python-2.7 ipython nltk corpus lemmatization

Lemmatization makes corpus huge

using ipython 2.7 and a corpus with non-Ascii chars.

The cleansing process seems to be fine, but once I use either Wordnet or Porter to lemmatize the corpus, the size of the file increases exponentially. Please see code below

 from nltk.corpus import stopwords

 tokenized_docs_no_stopwords = []
 for doc in tokenized_docs_no_punctuation:
         new_term_vector = []
         for word in doc:
         if not word in stopwords.words('english'):
         new_term_vector.append(word)
tokenized_docs_no_stopwords.append(new_term_vector)

and the routine

 from nltk.stem.porter import PorterStemmer

 from nltk.stem.wordnet import WordNetLemmatizer

 porter = PorterStemmer()

 wordnet = WordNetLemmatizer()

  preprocessed_docs = []
 for doc in tokenized_docs_no_stopwords:
       final_doc = []
       for word in doc:
       final_doc.append(porter.stem(word))
       #final_doc.append(snowball.stem(word))
       #final_doc.append(wordnet.lemmatize(word))
   preprocessed_docs.append(final_doc)

Seems to make the corpus 10 times bigger. Is the objective of removing stops words and lemmaising not supposed to reduce the corpus size?

I have tried adjusting the indentation, but I have a feeling there might be a more efficient loop than the 'append' routine, but I am more concerned about the exponential memory increase.

i am working off the example here

http://stanford.edu/~rjweiss/public_html/IRiSS2013/text2 Any help or direction would be most appreciated

Solution

OK- the indentation of the code was critical, but I eliminated the messing append loops and used Lamba instead:

filtered_words = stopwords.words('english')
 tokenized_docs_no_stopwords = []

tokenized_docs_no_stopwords = filter(lambda x: x not in filtered_words,       
tokenized_docs_no_irishstopwords)