Search code examples
data-structuresstreambig-ofilestream

What is the best data structure to store words found in a document and a counter with their occurences?


Let's say I have a corpus of documents which I want to read one by one and store them in a data structure. The structure will probably be a list of something. That something class will define a single document. Inside that class I'll have to use a data structure to store the contents from each document, what that should be? Also, if I want to count occurrences of words and retrieve the most frequent words in each document, will I have to use a data structure that will allow me to do this in time < O(n) that would take to examine all the contents sequentially?


Solution

  • Use an associative array, also called map or dictionary since different programming languages use different terms for the same data structure.

    Every entry key would be a word and the counter would be the value of the entry. For example

    {
      'on' -> 15,
      'and' -> 43,
      'I' -> 157,
      'confluence' -> 1,
      'dear' -> 2
    }