Search code examples
pythontokenizegensim

Tokenizing a Gensim dataset


Im trying to tokenize a gensim dataset, which I've never worked with before and Im not sure if its a small bug or im not doing it properly.

I loaded the dataset using

model = api.load('word2vec-google-news-300')

and from my understanding, to tokenize using nltk all I need to do it call

tokens = word_tokenize(model)

However, the error im getting is "TypeError: expected string or bytes-like object". What am I doing wrong?


Solution

  • word2vec-google-news-300 isn't a dataset that's appropriate to 'tokenize'; it's the pretrained GoogleNews word2vec model released by Google circa 2013 with 3 million word-vectors. It's got lots of word-tokens, each with a 300-dimensional vector, but no multiword texts needing tokenization.

    You can run type(model) on the object that api.load() returns to see its Python type, which will offer more clues as to what's appropriate to do with it.

    Also, something like nltk's word_tokenize() appears to take a single string; you'd typically not pass it any full large dataset, in one call, in any case. (You'd be more likely to iterate over many individual texts as strings, tokenizing each in turn.)

    Rewind a bit & think more about what kind of dataset you're looking for.

    Try to get it in a simple format you can inspect it yourself, as files, before doing extra steps. (Gensim's api.load() is really bad/underdocumented for that, returning who-knows-what depending on what you've requested.)

    Try building on well-explained examples that already work, making minimal individual changes that you understand individually, checking continued proper operation after each step.

    (Also, for future SO questions that may be any more complicated than this: it's usually best to include the full error message you've received, including all lines of 'traceback' context showing involved files and lines-of-code, in order to better point at relevant lines-of-code in your code, or the libraries you're using, that are most-directly involved.)