Preparing large txt file for gensim FastText unsupervised model

When I attempt to run FastText using gensim in Python, the best I can get is a result that gives me the most similar but each result is a single character. (I'm on a windows machine, which I've heard affects the result.)

I have all of my data stored in either a csv file in which I've already tokenized each sentence or in the original txt file I started with. When I try to use the csv file, I end up with the single character result.

Here's the code I'm using to process my csv file (I'm looking at analyzing how sports articles discuss white vs. nonwhite NFL quarterbacks differently, this is the code for my NonWhite results csv file):

from gensim.models import FastText
from gensim.test.utils import get_tmpfile, datapath
import os

embedding_size = 200
window_size = 10
min_word = 5
down_sampling = 1e-2

if os.path.isfile(modelpath):
    model1 = FastText.load(modelpath)
else:
    class NWIter():
        def __iter__(self):
            path = datapath(csvpath)
            with utils.open(path, 'r') as fin:
                for line in fin:
                    yield line

    model1 = FastText(vector_size=embedding_size, window=window_size, min_count=min_word,sample=down_sampling,workers=4)
    model1.build_vocab(corpus_iterable=NWIter())
    exs1=model1.corpus_count
    model1.train(corpus_iterable=NWIter(), total_examples=exs1, epochs=50)  
    model1.save(modelpath)

The cleaned CSV data looked like this, with each row representing a sentence that had been cleaned (stopwords removed, tokenized, and lemmatized).

When that didn't work, I attempted to bring in the raw text but got lots of UTF-8 encoding errors with unrecognizable characters. I attempted to work around this issue, finally getting to a point where it tried to read in the raw text file - only for the single character returns to come back.

So it seems the issue persists regardless of if I use my csv file or if I use the txt file. So I'd prefer to stick with the csv as I've already processed the information; how can I bring that data in without Python (or gensim) seeing the individual characters as the unit of analysis?

Edit: Here are the results I get when I run:

print('NonWhite: ',model1.wv.most_similar('smart', topn=10))

NonWhite: [('d', 0.36853086948394775), ('q', 0.326141357421875), ('s', 0.3181183338165283), ('M', 0.27458563446998596), ('g', 0.2703150510787964), ('o', 0.215525820851326), ('x', 0.2153075635433197), ('j', 0.21472081542015076), ('f', 0.20139966905117035), ('a', 0.18369245529174805)]

Solution

The Gensim FastText model (like its other models in the Word2Vec family) needs each individual text as a list-of-string-tokens, not a plain string.

If you pass texts as plain strings, they appear to be lists-of-single-characters – because of the way Python treats strings. Hence, the only 'words' the model sees are single-characters – including the individual spaces.

If the format of your file is such that each line is already a space-delimited text, you could simply change your yield line to:

yield line.split()

If instead it's truly a CSV, and your desired training texts are in only one column of the CSV, you should pick out that field and properly break it into a list-of-string-tokens.