Search code examples
javaalgorithmarraylistcollections

How can I find the most frequent word in a huge amount of words (eg. 900000)


I am facing a task which is generating 900000 random words and then print out the most frequent one. So here is my algorithm:

1. move all number into a collection rather than printhing out them
2. for (900000...){move the frequency of Collection[i] to another collection B}
** 90W*90W is too much for a computer(lack of efficiency)
3. find the biggest number in that collection and the index.
4. then B[index] is output.

But the thing is that my computer cannot handle the second step. So I searched on this website and find some answer about find the frequency of word in a bunch of words and I viewed the answer code, but I haven't find a way to apply them into huge amount of words.

Now I show my code here:

/** Funny Words Generator
  * Tony
  */

import java.util.*;

public class WordsGenerator {

  //data field (can be accessed in whole class):
  private static int xC; // define a xCurrent so we can access it all over the class
  private static int n;
  private static String[] consonants = {"b","c","d","f","g","h","j","k","l","m","n","p","r","s","t","v","w","x","z"};
  private static String[] vowels = {"a", "e", "i", "o", "u"};
  private static String funnyWords = "";



  public static void main(String[] args) {

    Scanner sc = new Scanner(System.in);
    int times = 900000; // words number
    xC = sc.nextInt(); // seeds (only input)

    /* Funny word list */
    ArrayList<String> wordsList = new ArrayList<String>();
    ArrayList<Integer> frequencies = new ArrayList<Integer>();
    int maxFreq;
    for (int i = 0; i < times; i++) {
      n = 6; // each words are 6 characters long
      funnyWords = ""; // reset the funnyWords each new time
      for (int d = 0; d < n; d ++) {

        int letterNum = randomGenerator(); /* random generator will generate numbers based on current x */
        int letterIndex = 0; /* letterNum % 19 or % 5 based on condition */

        if ((d + 1) % 2 == 0) {
          letterIndex = letterNum % 5;
          funnyWords += vowels[letterIndex];
        }

        else if ((d + 1) % 2 != 0) {
          letterIndex = letterNum % 19;
          funnyWords += consonants[letterIndex];
        }
      }
      wordsList.add(funnyWords);
    }


    /* put all frequencies of each words into an array called frequencies */
    for (int i = 0; i < 900000; i++) {
      frequencies.add(Collections.frequency(wordsList, wordsList.get(i)));
    }



    maxFreq = Collections.max(frequencies);
    int index = frequencies.indexOf(maxFreq); // get the index of the most frequent word
    System.out.print(wordsList.get(index));


    sc.close();
  }

  /** randomGenerator
    * param: N(generate times), seeds
    * return: update the xC and return it */
  private static int randomGenerator() {
    int a = 445;
    int c = 700001;
    int m = 2097152;
    xC = (a * xC + c) % m; // update
    return xC; // return
  }

}

So I have realized that maybe there is a way skip the second step somehow. Anyone can give me a hint? Just a hint not code so I can try it myself will be great! Thx!

Modified: I see lots of your answer code contains "words.stream()", I googled it and I couldn't find it. Could you guys please tell me where I can find this kind of knowledge? this stream method is in which class? Thank you!


Solution

  • HashMap is one of the fastest data structures, just loop through each words, use it as key to the HashMap, inside the loop, make the counter the value of the hashMap.

    HashMap<string, Integer> hashMapVariable = new HashMap<>();
    ...
    //inside the loop of words
    if (hashMapVariable.containsKey(word){
       hashMapVariable.put(key, hashMapVariable.get(key) + 1);
    } else {
       hashMapVariable.put(word, 1);
    }
    ...
    

    For each key(word) just increment the value as associated with the key. although you have to check if the key exits ( in java its hashMapVariable.containsKey("key") ). if its exits then just increment else add it to the HashMap. by doing this you are not restoring the whole data you are only making every key just one and the number of times it occurs as value to the key.

    At the end of the loop the most frequent word will have the highest counter/value.