Kleines Beispiel zu Java 8 Streams & Lambda: Top-K Wörter aus Text ermitteln

Thomas Darimont · 30. März 2014

Hallo,

hier mal ein kleines Beispiel wie man mit dem Java 8 Streams API und Lambda aus einem beliebigen Text ein Wort-Histogramm erzeugen und damit die Top-K verwendeten Wörter finden kann.

Für unser Beispiel wollen wir die 10 meist verwendeten Wörter in einer FAQ zur Demo-Szene auflisten.

Java:

package de.tutorials.training;

import java.io.BufferedReader;
import java.io.InputStreamReader;
import java.io.Reader;
import java.net.URL;
import java.util.Comparator;
import java.util.Map;
import java.util.Scanner;
import java.util.TreeMap;
import java.util.function.Predicate;
import java.util.regex.Pattern;
import java.util.stream.Collectors;
import java.util.stream.Stream;

import static java.util.stream.Collectors.counting;
import static java.util.stream.Collectors.groupingBy;
import static java.util.stream.Collectors.reducing;

/**
 * @author Thomas Darimont
 */
public class PrintTopKWordsStreamsExample {

    public static void main(String[] args) throws Exception {

        Pattern wholeWordsOnlyPattern = Pattern.compile("\\b[^\\s_\\d.,;]+");
        Pattern noStopWordPatten = Pattern.compile("(?i)(a|able|about|across|after|all|almost|also|am|among|an|and|any|are|as|at|be|because|been|but|by|can|cannot|could|dear|did|do|does|either|else|ever|every|for|from|get|got|had|has|have|he|her|hers|him|his|how|however|i|if|in|into|is|it|its|just|least|let|like|likely|may|me|might|most|must|my|neither|no|nor|not|of|off|often|on|only|or|other|our|own|rather|said|say|says|she|should|since|so|some|than|that|the|their|them|then|there|these|they|this|tis|to|too|twas|us|wants|was|we|were|what|when|where|which|while|who|whom|why|will|with|would|yet|you|your)");

        Predicate<String> wholeWordsOnly = s -> wholeWordsOnlyPattern.matcher(s).matches();
        Predicate<String> noStopWord = s -> !noStopWordPatten.matcher(s).matches();

        String url = "http://tomaes.32x.de/text/pcdemoscene_faq.txt";
        try (BufferedReader reader = new BufferedReader(new InputStreamReader(new URL(url).openStream()))) {
            reader //
                    .lines() // produces a stream that injects all lines from the reader
                    .flatMap(s -> Stream.of(s.split(" "))) // tokenizes a line into a stream of words
                    .filter(wholeWordsOnly) // remove non whole words from the stream
                    .filter(noStopWord) // remove common english stop words from the stream
                    .collect(groupingBy(s -> s, counting())) // group all words by themselves and count their occurrence -> this produces a Map<String,Long> with word as key and their occurrence count as long.
                    .entrySet().stream() // produces a new stream from the map entries
                    .sorted((a, b) -> -Long.compare(a.getValue(), b.getValue())) // sort the map entries by their occurrence count, highest occurrence first
                    .limit(10) // select the top k-words from the histogram
                    .forEach(System.out::println); //print each histogram entry
            ; //
        }
    }
}

Ausgabe:

Code:

demo=40
demoscene=37
demos=32
scene=29
make=18
people=18
more=18
music=17
graphics=17
sceners=15

Gruß Tom

Kleines Beispiel zu Java 8 Streams & Lambda: Top-K Wörter aus Text ermitteln

Thomas Darimont

Erfahrenes Mitglied

Neue Beiträge