using ipython 2.7 and a corpus with non-Ascii chars.
The cleansing process seems to be fine, but once I use either Wordnet or Porter to lemmatize the corpus, the size of the file increases exponentially. Please see code below
from nltk.corpus import stopwords
tokenized_docs_no_stopwords = []
for doc in tokenized_docs_no_punctuation:
new_term_vector = []
for word in doc:
if not word in stopwords.words('english'):
new_term_vector.append(word)
tokenized_docs_no_stopwords.append(new_term_vector)
and the routine
from nltk.stem.porter import PorterStemmer
from nltk.stem.wordnet import WordNetLemmatizer
porter = PorterStemmer()
wordnet = WordNetLemmatizer()
preprocessed_docs = []
for doc in tokenized_docs_no_stopwords:
final_doc = []
for word in doc:
final_doc.append(porter.stem(word))
#final_doc.append(snowball.stem(word))
#final_doc.append(wordnet.lemmatize(word))
preprocessed_docs.append(final_doc)
Seems to make the corpus 10 times bigger. Is the objective of removing stops words and lemmaising not supposed to reduce the corpus size?
I have tried adjusting the indentation, but I have a feeling there might be a more efficient loop than the 'append' routine, but I am more concerned about the exponential memory increase.
i am working off the example here
http://stanford.edu/~rjweiss/public_html/IRiSS2013/text2 Any help or direction would be most appreciated
OK- the indentation of the code was critical, but I eliminated the messing append loops and used Lamba instead:
filtered_words = stopwords.words('english')
tokenized_docs_no_stopwords = []
tokenized_docs_no_stopwords = filter(lambda x: x not in filtered_words,
tokenized_docs_no_irishstopwords)