Search code examples
pythonpython-3.xnlpcorpusword-frequency

Extracting Word Frequency List from a Large Corpus


I have a large English corpus named SubIMDB and I want to make a list of all the words with their frequency. Meaning that how much they have appeared in the whole corpus. This frequency list should have some characteristics:

  1. The words like boy and boys or other grammatical features such as get and getting, the same word or lemma and if there are 3 boy and 2 boys it should list them as Boy 5. However, not for the cases like Go and Went which have irregular forms(or foot and feet)
  2. I want to use this frequency list as a kind of dictionary so whenever I see a word in another part of the program I want to check its frequency in this list. So, better if it is searchable without looking up all the of it.

My questions are:

  1. For the first problem, what should I do? Lemmatize? or Stemming? or how can I get that?
  2. For second, what kind of variable type I should set it to? like dictionary or lists or what?
  3. Is is the best to save it in csv?
  4. Is there any prepared toolkit for python doing this all?

Thank you so much.


Solution

  • As pointed above, question(s) is a opinion based and vague, but here's some directions:

    1. Both will work for your case. Stemming usually is simpler and faster. I suggest starting with nltk's PorterStemmer. If you need sophisticated lemmatization, take a look at spaCy, IMO that's industry standard.
    2. You need dictionary, which gives you amortized O(1) lookup once you have your stem/lemma. Also counter may become useful.
    3. Depends on your usecase. CSV is more "portable", pickle may be easier to use.
    4. There's a lot of "building blocks" in nltk and spaCy, building your pipeline/models is up to you