I have a large English corpus named SubIMDB and I want to make a list of all the words with their frequency. Meaning that how much they have appeared in the whole corpus. This frequency list should have some characteristics:
My questions are:
Thank you so much.
As pointed above, question(s) is a opinion based and vague, but here's some directions:
PorterStemmer
. If you need sophisticated lemmatization, take a look at spaCy
, IMO that's industry standard.counter
may become useful.pickle
may be easier to use.