I am a bit confused as how to tokenize the data correctly in gensim
.
I have a text file myfile.txt
that contains the following text
"""
this is a very long string with a title
and some white space. Multiple sentences, too. This is nuts!
Yay! :):):)
"""
I load this file in gensim
using LineReader('myfile.txt')
to train a word2vec
model (of course my data is much bigger than the example above)
But is this text tokenized propertly? I am asking this because LineReader
seems to be very specific :
The format of files (either text, or compressed text files) in the path is one sentence = one line, with words already preprocessed and separated by whitespace. see https://radimrehurek.com/gensim/models/word2vec.html#gensim.models.word2vec.LineSentence
I am confused. Am I doing things right? How should I tokenize my text for LineReader
?
Thanks!
That will work, but because Gensim's LineSentence
class (what I assume you mean) breaks tokens on whitespace, your line...
and some white space. Multiple sentences, too. This is nuts!
...will become the list of word-tokens:
['and', 'some', 'white', 'space.', 'Multiple',
'sentences,', 'too.', 'This', 'is', 'nuts!']
That means tokens like 'space.'
, 'sentences,'
, & 'nuts!'
will be treated as words – potentially even receiving trained word-vectors, too (if they appear at least min_count
times).
That's probably not what you want – but also not necessarily a big problem. In a sufficiently-large corpus, all the words you care about will appear so many times without this connected-punctuation issue, you'll probably still get good vectors for them.
But more typically, you'd preprocess your text to either strip that punctuation, or split it off from words with extra space delimiter characters. (When you do that, the punctuation marks themselves become 'words' of a sort.)