Search code examples
javalambdacollectionshashmapjava-stream

How to count words in Map via Stream


I'm working with List<String> -- it contais a big text. Text looks like:

List<String> lines = Arrays.asList("The first line", "The second line", "Some words can repeat", "The first the second"); //etc

I need to calculate words in it with output:

first - 2
line - 2
second - 2
can - 1
repeat - 1
some - 1
words - 1

Words shorter than 4 symbols should be skipped, that's why "the" and "can" are not in the output. Here I wrote the example, but originally if the word is rare and entry < 20, i should skip this word. Then sort the map by Key in alphabetical order. Using only streams, without "if", "while" and "for" constructions.

What I have implemented:

Map<String, Integer> wordCount = Stream.of(list)
                .flatMap(Collection::stream)
                .flatMap(str -> Arrays.stream(str.split("\\p{Punct}| |[0-9]|…|«|»|“|„")))
                .filter(str -> (str.length() >= 4))
                .collect(Collectors.toMap(
                        i -> i.toLowerCase(),
                        i -> 1,
                        (a, b) -> java.lang.Integer.sum(a, b))
                );

wordCount contains Map with words and its entries. But how can I skip rare words? Should I create new stream? If yes, how can I get the value of Map? I tried this, but it's not correct:

 String result = Stream.of(wordCount)
         .filter(i -> (Map.Entry::getValue > 10));

My calculations shoud return a String:

"word" - number of entries

Thank you!


Solution

  • Given the stream that already done:

    List<String> lines = Arrays.asList(
            "For the rabbit, it was a bad day.",
            "An Antillean rabbit is very abundant.",
            "She put the rabbit back in the cage and closed the door securely, then ran away.",
            "The rabbit tired of her inquisition and hopped away a few steps.",
            "The Dean took the rabbit and went out of the house and away."
    );
    
    Map<String, Integer> wordCounts = Stream.of(lines)
            .flatMap(Collection::stream)
            .flatMap(str -> Arrays.stream(str.split("\\p{Punct}| |[0-9]|…|«|»|“|„")))
            .filter(str -> (str.length() >= 4))
            .collect(Collectors.toMap(
                    String::toLowerCase,
                    i -> 1,
                    Integer::sum)
            );
    
    System.out.println("Original:" + wordCounts);
    

    Original output:

    Original:{dean=1, took=1, door=1, very=1, went=1, away=3, antillean=1, abundant=1, tired=1, back=1, then=1, house=1, steps=1, hopped=1, inquisition=1, cage=1, securely=1, rabbit=5, closed=1}
    

    You can do:

    String results = wordCounts.entrySet()
            .stream()
            .filter(wordToCount -> wordToCount.getValue() > 2) // 2 is rare
            .sorted(Map.Entry.comparingByKey()).map(wordCount -> wordCount.getKey() + " - " + wordCount.getValue())
                .collect(Collectors.joining(", "));
    
    System.out.println(results);
    

    Filtered output:

    away - 3, rabbit - 5