Search code examples
javaword-count

Count recurrent words in two files


I have a code, which can count word occurences in a file. I would like to use this with 2 files and display recurrent(which both files contains) words in a separated table. What is your idea, how is it possible to use it with 2 files?

    while ((inputLine = bufferedReader.readLine()) != null) {
        String[] words = inputLine.split("[ \n\t\r.,;:!?(){}]");

        for (int counter = 0; counter < words.length; counter++) {
            String key = words[counter].toLowerCase();
            if (key.length() > 0) {
                if (crunchifyMap.get(key) == null) {
                    crunchifyMap.put(key, 1);
                } else {
                    int value = crunchifyMap.get(key).intValue();
                    value++;
                    crunchifyMap.put(key, value);
                }
            }
        }
    }
    Set<Map.Entry<String, Integer>> entrySet = crunchifyMap.entrySet();
    System.out.println("Words" + "\t\t" + "# of Occurances");
    for (Map.Entry<String, Integer> entry : entrySet) {
        System.out.println(entry.getKey() + "\t\t" + entry.getValue());
    }

Solution

  • You should probably use the following (very coarse) algorithm:

    1. Read the first file and store all words in a Set words;
    2. Read the second file and store all words in a Set words2;
    3. Compute the intersecting set by retaining all words in words that are also contained in words2: words.retainAll(words2)
    4. words contains your final list.

    Note that you can reuse the file-reading algorithm if you put it into a method like

    public Set<String> readWords(Reader reader) {
        ....
    }
    

    Count frequency of occurence

    If you also want to know the frequency of occurence, you should read each file into a Map<String, Integer> which maps each word to its frequency of occurence within that file.

    The new Map.merge(...) function (since Java 8) simplifies counting:

    Map<String, Integer> freq = new HashMap<>();
    for(String word : words) {
        // insert 1 or increment already mapped value
        freq.merge(word, 1, Integer::sum);
    }
    

    Then apply the following, slightly modified algorithm:

    1. Read the first file and store all words in a Map wordsFreq1;
    2. Read the second file and store all words in a Map wordsFreq2;
    3. Extract the words from the first map: Set<String> words = wordsFreq1.keySet()
    4. Compute the intersection by retaining all words from the second map: words.retainAll(wordsFreq2.keySet())
    5. Now words contains all the words in common, and wordsFreq1 and wordsFreq2 the frequencies of all words of both files.

    With these three data structures, you can easily get all information you want. Example:

        Map<String, Integer> wordsFreq1 = ... // read from file
        Map<String, Integer> wordsFreq2 = ... // read from file
    
        Set<String> commonWords = new HashSet<>(wordsFreq1.keySet());
        commonWords.retainAll(wordsFreq2.keySet());
    
        // Map that contains the summarized frequencies of all words
        Map<String, Integer> allWordsTotalFreq = new HashMap<>(wordsFreq1);
        wordsFreq2.forEach((word, freq) -> allWordsTotalFreq.merge(word, freq, Integer::sum));
    
        // Map that contains the summarized frequencies of words in common
        Map<String, Integer> commonWordsTotalFreq = new HashMap<>(allWordsTotalFreq);
        commonWordsTotalFreq.keySet().retainAll(commonWords);
    
        // List of common words sorted by frequency:
        List<String> list = new ArrayList<>(commonWords);
        Collections.sort(list, Comparator.comparingInt(commonWordsTotalFreq::get).reversed());