Search code examples
pythongensim

tokenizing the data properly in gensim


I am a bit confused as how to tokenize the data correctly in gensim. I have a text file myfile.txt that contains the following text

""" 
this is a very long string with a title


and some white space. Multiple sentences, too. This is nuts!
Yay! :):):) 
"""

I load this file in gensim using LineReader('myfile.txt') to train a word2vec model (of course my data is much bigger than the example above)

But is this text tokenized propertly? I am asking this because LineReader seems to be very specific :

The format of files (either text, or compressed text files) in the path is one sentence = one line, with words already preprocessed and separated by whitespace. see https://radimrehurek.com/gensim/models/word2vec.html#gensim.models.word2vec.LineSentence

I am confused. Am I doing things right? How should I tokenize my text for LineReader?

Thanks!


Solution

  • That will work, but because Gensim's LineSentence class (what I assume you mean) breaks tokens on whitespace, your line...

    and some white space. Multiple sentences, too. This is nuts!
    

    ...will become the list of word-tokens:

    ['and', 'some', 'white', 'space.', 'Multiple', 
    'sentences,', 'too.', 'This', 'is', 'nuts!']
    

    That means tokens like 'space.', 'sentences,', & 'nuts!' will be treated as words – potentially even receiving trained word-vectors, too (if they appear at least min_count times).

    That's probably not what you want – but also not necessarily a big problem. In a sufficiently-large corpus, all the words you care about will appear so many times without this connected-punctuation issue, you'll probably still get good vectors for them.

    But more typically, you'd preprocess your text to either strip that punctuation, or split it off from words with extra space delimiter characters. (When you do that, the punctuation marks themselves become 'words' of a sort.)