Search code examples
pythonmachine-learningnlpgensim

TypeError during extracting bigrams with Gensim(Python)


I want to extract and print bigrams using Gensim. For this purpose I used that code in GoogleColab:

import gensim.downloader as api
from gensim.models import Word2Vec
from gensim.corpora import WikiCorpus, Dictionary
from gensim.models import Phrases
from gensim.models.phrases import Phraser
from collections import Counter

data = api.load("text8") # wikipedia corpus
bigram = Phrases(data, min_count=3, threshold=10)


cntr = Counter()
for key in bigram.vocab.keys():
  if len(key.split('_')) > 1:
    cntr[key] += bigram.vocab[key]

for key, counts in cntr.most_common(50):
  print(key, " - ", counts)

But there's an error:

TypeError

Then I tried this:

cntr = Counter()
for key in bigram.vocab.keys():
  if len(key.split(b'_')) > 1:
    cntr[key] += bigram.vocab[key]

for key, counts in cntr.most_common(50):
  print(key, " - ", counts)

And then:

again

What is wrong?


Solution

  •  bigram_token  = list(bigram.vocab.keys())
     type(bigram_token[0])
    
     #op
     bytes
    

    convert this into string and it will solve problem, in your code just while splitting do

    cntr = Counter()
    for key in bigram.vocab.keys():
        if len(key.decode('utf-8').split(b'_')) > 1: # here added .decode('utf-8')
           cntr[key] += bigram.vocab[key]