As we know Facebook
's FastText is a great open-source, free, lightweight library which can be used for text classification. But here a problem is the pipeline seem to be end-to end black-box. Yes, we can change the hyper-parameters from these options for setting training configuration. But I couldn't manage to find a way to access to the vector embedding it generates internally.
Actually I want to do some manipulation on the vector embedding - like introducing tf-idf
weighting apart from these word2vec
representations and another thing I want to to is oversampling using SMOTE
which requires numerical representation. For these reasons I need to introduce my custom code in between the overall pipeline which seems to be inaccessible for me. How introduce custom steps in this pipeline?
The full source code is available:
https://github.com/facebookresearch/fastText
So, you can make any changes or extensions you can imagine - if you're comfortable reading & modifying its C++ source code. Nothing is hidden or inaccessible.
Note that both FastText, and its supervised
classification mode, are chiefly conventions for training a shallow neural-network. It may not be helpful to think of it as a "pipeline" like in the architecture of other classifier libraries - as none of the internal interfaces use that sort of language or modular layout.
Specifically, if you get the gist of word2vec training, FastText classifier mode really just replaces attempted-predictions of neighboring (in-context-window) vocabulary words, with attempted-predictions of known labels instead.
For the sake of understanding FastText's relationship to other techniques, and potential aspects for further extension, I think it's useful to also review: