Search code examples
pythonnormalizationgensimcorpusterm-document-matrix

Normalizing bag of words data in Gensim


I am using gensim to create a bag of words model and I want to perform normalization. I found the documentation (https://radimrehurek.com/gensim/models/normmodel.html), but I am confused as to how to implement that given the code I have. Conversations is a list of tokenized documents, so essentially a list of lists when each element is a document.

id2word = corpora.Dictionary(conversations)
id2word.filter_extremes(keep_n=5000, keep_tokens=None) 
corpus = [id2word.doc2bow(text) for text in conversations]
norm_corpus = NormModel(corpus)

Corpus is a sparse matrix, I believe. For each document, it has the non-zero frequency terms and the corresponding counts: [[(0, 2), (1, 5), (2, 4)...(92, 2), (93, 3)],...].

The last line with norm_corpus does not work when I try to input it into the following: models.LsiModel(norm_corpus, id2word=id2word, num_topics=12). I get the type error message, 'int' object is not iterable. However, the documentation says to pass in a corpus so I'm confused. I would appreciate any help -- thanks!


Solution

  • I don't have a way to check at the moment but try this:

    norm_corpus = NormModel()
    norm_corpus.normalize(text)
    

    or

    norm_corpus.normalize(id2word.doc2bow(text)

    In your original code you have

    `NormModel(iterable)`
    

    but the documentation says you need to pass:

    NormModel(iterable of iterable(int,number))

    If this makes sense.