Search code examples
python-2.7ipythonnltkcorpuslemmatization

Lemmatization makes corpus huge


using ipython 2.7 and a corpus with non-Ascii chars.

The cleansing process seems to be fine, but once I use either Wordnet or Porter to lemmatize the corpus, the size of the file increases exponentially. Please see code below

 from nltk.corpus import stopwords

 tokenized_docs_no_stopwords = []
 for doc in tokenized_docs_no_punctuation:
         new_term_vector = []
         for word in doc:
         if not word in stopwords.words('english'):
         new_term_vector.append(word)
tokenized_docs_no_stopwords.append(new_term_vector)

and the routine

 from nltk.stem.porter import PorterStemmer

 from nltk.stem.wordnet import WordNetLemmatizer

 porter = PorterStemmer()

 wordnet = WordNetLemmatizer()

  preprocessed_docs = []
 for doc in tokenized_docs_no_stopwords:
       final_doc = []
       for word in doc:
       final_doc.append(porter.stem(word))
       #final_doc.append(snowball.stem(word))
       #final_doc.append(wordnet.lemmatize(word))
   preprocessed_docs.append(final_doc)

Seems to make the corpus 10 times bigger. Is the objective of removing stops words and lemmaising not supposed to reduce the corpus size?

I have tried adjusting the indentation, but I have a feeling there might be a more efficient loop than the 'append' routine, but I am more concerned about the exponential memory increase.

i am working off the example here

http://stanford.edu/~rjweiss/public_html/IRiSS2013/text2 Any help or direction would be most appreciated


Solution

  • OK- the indentation of the code was critical, but I eliminated the messing append loops and used Lamba instead:

    filtered_words = stopwords.words('english')
     tokenized_docs_no_stopwords = []
    
    tokenized_docs_no_stopwords = filter(lambda x: x not in filtered_words,       
    tokenized_docs_no_irishstopwords)