python tensorflow gensim word-embedding fasttext

Training fasttext word embedding on your own corpus

I want to train fasttext on my own corpus. However, I have a small question before continuing. Do I need each sentences as a different item in corpus or can I have many sentences as one item?

For example, I have this DataFrame:

 text                                               |     summary
 ------------------------------------------------------------------
 this is sentence one this is sentence two continue | one two other
 other similar sentences some other                 | word word sent

Basically, the column text is an article so it has many sentences. Because of the preprocessing, I no longer have full stop .. So the question is can I do something like this directly or do I need to split each sentences.

docs = df['text']
vectorizer = TfidfVectorizer()
vectorizer.fit_transform(docs)

From the tutorials I read, I need list of words for each sentences but what if I have list of words from an article? What are the differences? Is this the right way of training fasttext in your own corpus?

Thank you!

Solution

FastText requires text as its training data - not anything that's pre-vectorized, as if by TfidfVectorizer. (If that's part of your FastText process, it's misplaced.)

The Gensim FastText support requires the training corpus as a Python iterable, where each item is a list of string word-tokens.

Each list-of-tokens is typically some cohesive text, where the neighboring words have the relationship of usage together in usual natural-language. It might be a sentence, a paragraph, a post, an article/chapter, or whatever. Gensim's only limitation is that each text shouldn't be more than 10,000 tokens long. (If your texts are longer than that, they should be fragmented into separate 10,000-or-fewer parts. But don't worry too much about the loss of association around the split points - in training sets sufficiently large for an algorithm like FastText, any such loss-of-contexts is negligible.)