Search code examples
pythonpandasnltkterm-document-matrix

efficient Term Document Matrix with NLTK


I am trying to create a term document matrix with NLTK and pandas. I wrote the following function:

def fnDTM_Corpus(xCorpus):
    import pandas as pd
    '''to create a Term Document Matrix from a NLTK Corpus'''
    fd_list = []
    for x in range(0, len(xCorpus.fileids())):
        fd_list.append(nltk.FreqDist(xCorpus.words(xCorpus.fileids()[x])))
    DTM = pd.DataFrame(fd_list, index = xCorpus.fileids())
    DTM.fillna(0,inplace = True)
    return DTM.T

to run it

import nltk
from nltk.corpus import PlaintextCorpusReader
corpus_root = 'C:/Data/'

newcorpus = PlaintextCorpusReader(corpus_root, '.*')

x = fnDTM_Corpus(newcorpus)

It works well for few small files in the corpus but gives me a MemoryError when I try to run it with a corpus of 4,000 files (of about 2 kb each).

Am I missing something?

I am using a 32 bit python. (am on windows 7, 64-bit OS, Core Quad CPU, 8 GB RAM). Do I really need to use 64 bit for corpus of this size ?


Solution

  • Thanks to Radim and Larsmans. My objective was to have a DTM like the one you get in R tm. I decided to use scikit-learn and partly inspired by this blog entry. This the code I came up with.

    I post it here in the hope that someone else will find it useful.

    import pandas as pd
    from sklearn.feature_extraction.text import CountVectorizer 
    
    def fn_tdm_df(docs, xColNames = None, **kwargs):
        ''' create a term document matrix as pandas DataFrame
        with **kwargs you can pass arguments of CountVectorizer
        if xColNames is given the dataframe gets columns Names'''
    
        #initialize the  vectorizer
        vectorizer = CountVectorizer(**kwargs)
        x1 = vectorizer.fit_transform(docs)
        #create dataFrame
        df = pd.DataFrame(x1.toarray().transpose(), index = vectorizer.get_feature_names())
        if xColNames is not None:
            df.columns = xColNames
    
        return df
    

    to use it on a list of text in a directory

    DIR = 'C:/Data/'
    
    def fn_CorpusFromDIR(xDIR):
        ''' functions to create corpus from a Directories
        Input: Directory
        Output: A dictionary with 
                 Names of files ['ColNames']
                 the text in corpus ['docs']'''
        import os
        Res = dict(docs = [open(os.path.join(xDIR,f)).read() for f in os.listdir(xDIR)],
                   ColNames = map(lambda x: 'P_' + x[0:6], os.listdir(xDIR)))
        return Res
    

    to create the dataframe

    d1 = fn_tdm_df(docs = fn_CorpusFromDIR(DIR)['docs'],
              xColNames = fn_CorpusFromDIR(DIR)['ColNames'], 
              stop_words=None, charset_error = 'replace')