Search code examples
pythonnltkn-gramcollocation

Bi-grams in python with lots of txt files


I have a corpus which includes 70,429 files(296.5 mb). I try to find bi-grams by using whole corpus. I have written the following code;

allFiles = ""
for dirName in os.listdir(rootDirectory):
     for subDir in os.listdir(dirName):
         for fileN in os.listdir(subDir):
             FText = codecs.open(fileN, encoding="'iso8859-9'")
             PText = FText.read()
             allFiles += PText
tokens = allFiles.split()
finder = BigramCollocationFinder.from_words(tokens, window_size = 3)
finder.apply_freq_filter(2)
bigram_measures = nltk.collocations.BigramAssocMeasures()
for k,v in finder.ngram_fd.most_common(100):
    print(k,v)

There is a root directory and the root directory includes subdirectories and each subdirectory includes numerous files. What I have done is;

I read all of the files one by and add the context into to the string called allFiles. Eventually, I split the string into tokens and call the relevant bi-gram functions. The problem is;

I ran the program for a day and couldn't get any results. Is there a more efficient way to find bigrams within a corpus which includes lots of files?

Any advice and suggestions will be greatly appreciated. Thanks in advance.


Solution

  • By trying to read a huge corpus into memory at once, you're blowing out your memory, forcing a lot of swap use, and slowing everything down.

    The NLTK provides various "corpus readers" that can return your words one by one, so that the complete corpus is never stored in memory at the same time. This might work if I understand your corpus layout right:

    from nltk.corpus.reader import PlaintextCorpusReader
    reader = PlaintextCorpusReader(rootDirectory, "*/*/*", encoding="iso8859-9")
    finder = BigramCollocationFinder.from_words(reader.words(), window_size = 3)
    finder.apply_freq_filter(2) # Continue processing as before
    ...
    

    Addendum: Your approach has a bug: You're taking trigrams that span from the end of one document to the beginning of the next... that's nonsense you want to get rid of. I recommend the following variant, which collects trigrams from each document separately.

    document_streams = (reader.words(fname) for fname in reader.fileids())
    BigramCollocationFinder.default_ws = 3
    finder = BigramCollocationFinder.from_documents(document_streams)