machine-learning nlp pipeline word2vec fasttext

How to access to FastText classifier pipeline?

As we know Facebook's FastText is a great open-source, free, lightweight library which can be used for text classification. But here a problem is the pipeline seem to be end-to end black-box. Yes, we can change the hyper-parameters from these options for setting training configuration. But I couldn't manage to find a way to access to the vector embedding it generates internally.

Actually I want to do some manipulation on the vector embedding - like introducing tf-idf weighting apart from these word2vec representations and another thing I want to to is oversampling using SMOTE which requires numerical representation. For these reasons I need to introduce my custom code in between the overall pipeline which seems to be inaccessible for me. How introduce custom steps in this pipeline?

Solution

The full source code is available:

https://github.com/facebookresearch/fastText

So, you can make any changes or extensions you can imagine - if you're comfortable reading & modifying its C++ source code. Nothing is hidden or inaccessible.

Note that both FastText, and its supervised classification mode, are chiefly conventions for training a shallow neural-network. It may not be helpful to think of it as a "pipeline" like in the architecture of other classifier libraries - as none of the internal interfaces use that sort of language or modular layout.

Specifically, if you get the gist of word2vec training, FastText classifier mode really just replaces attempted-predictions of neighboring (in-context-window) vocabulary words, with attempted-predictions of known labels instead.

For the sake of understanding FastText's relationship to other techniques, and potential aspects for further extension, I think it's useful to also review:

this skeptical blog post comparing FastText to the much-earlier 'vowpal wabbit' tool: "Fast & easy baseline text categorization with vw"
Facebook's far-less discussed extension of such vector-training for more generic categorical or numerical tasks, "StarSpace"