I recently updated a conda environment from python=3.4 to python 3.6. The environment is made for a project using gensim which worked perfectly on 3.4. After this update, using the library generates multiple errors such as:
TypeError: object of type 'itertools.chain' has no len()
AssertionError: decomposition not initialized yet
Do you guys know why this happens while gensim explicitly says Python 3.5 and 3.6 are supported?
The used code:
# Create Texts
texts = src.data.raw.extract_clean_merge_titles_abstracts(papers)
texts = src.data.raw.tokenize_stream(texts)
print("Size of corpus: ", len(texts)) # ERROR 1 HERE
# Create Dictionary
dictionary = gensim.corpora.dictionary.Dictionary(texts, prune_at=None)
dictionary.filter_extremes(no_below=3 ,no_above=0.1, keep_n=None)
# Create corpus
corpus = [dictionary.doc2bow(text) for text in texts]
#gensim.corpora.MmCorpus.serialize(config.paths.PATH_DATA_GENSIM_TEMP_CORPUS, corpus)
corpus_index = gensim.similarities.docsim.Similarity(config.paths.PATH_DATA_GENSIM_TEMP_CORPUS_INDEX, corpus, num_features=len(dictionary))
# tf-idf
tfidf = gensim.models.TfidfModel(corpus)
corpus_tfidf = tfidf[corpus] #gensim.corpora.MmCorpus.serialize(config.paths.PATH_DATA_GENSIM_TEMP_CORPUS_TFIDF, corpus_tfidf)
corpus_tfidf_index = gensim.similarities.docsim.Similarity(config.paths.PATH_DATA_GENSIM_TEMP_CORPUS_TFIDF_INDEX, corpus_tfidf, num_features=len(dictionary))
# lsa
lsa_num_topics = 100
lsa = gensim.models.LsiModel(corpus_tfidf, id2word=dictionary, num_topics=lsa_num_topics)
corpus_lsa = lsa[corpus_tfidf] # ERROR 2 HERE
#gensim.corpora.MmCorpus.serialize(config.paths.PATH_DATA_GENSIM_TEMP_CORPUS_LSA, corpus_lsa)
corpus_lsa_index = gensim.similarities.docsim.Similarity(config.paths.PATH_DATA_GENSIM_TEMP_CORPUS_LSA_INDEX, corpus_lsa, num_features=lsa_num_topics)
Here is the list of the packages installed:
gensim 2.2.0 np113py36_0
matplotlib 2.0.2 np113py36_0
nltk 3.2.4 py36_0
numpy 1.13.1 py36_0
python 3.6.1 2
scikit-learn 0.18.2 np113py36_0
scipy 0.19.1 np113py36_0
smart_open 1.5.3 py36_0
My bad, it came from the Phraser:
def tokenize_stream(stream, max_num_words = 3):
tokens_stream = [gensim.utils.simple_preprocess(t, min_len=2, max_len=50) for t in stream]
for i,tokens in enumerate(tokens_stream):
tokens_stream[i] = [j for j in tokens if j not in stop_words]
phrases = gensim.models.phrases.Phrases.load(config.paths.PATH_DATA_GENSIM_PHRASES)
grams = gensim.models.phrases.Phraser(phrases)
tokens_stream = list(grams[tokens_stream]) ## HERE LIST IS IMPORTANT
return tokens_stream
For some reason, with python 3.4, not using "list(grams[...])" did work in my code, and returns an itertool.chain instance which leads to an empty corpus with python 3.6.