Search code examples
machine-learningnlppipelineword2vecfasttext

How to access to FastText classifier pipeline?


As we know Facebook's FastText is a great open-source, free, lightweight library which can be used for text classification. But here a problem is the pipeline seem to be end-to end black-box. Yes, we can change the hyper-parameters from these options for setting training configuration. But I couldn't manage to find a way to access to the vector embedding it generates internally.

Actually I want to do some manipulation on the vector embedding - like introducing tf-idf weighting apart from these word2vec representations and another thing I want to to is oversampling using SMOTE which requires numerical representation. For these reasons I need to introduce my custom code in between the overall pipeline which seems to be inaccessible for me. How introduce custom steps in this pipeline?


Solution

  • The full source code is available:

    https://github.com/facebookresearch/fastText

    So, you can make any changes or extensions you can imagine - if you're comfortable reading & modifying its C++ source code. Nothing is hidden or inaccessible.

    Note that both FastText, and its supervised classification mode, are chiefly conventions for training a shallow neural-network. It may not be helpful to think of it as a "pipeline" like in the architecture of other classifier libraries - as none of the internal interfaces use that sort of language or modular layout.

    Specifically, if you get the gist of word2vec training, FastText classifier mode really just replaces attempted-predictions of neighboring (in-context-window) vocabulary words, with attempted-predictions of known labels instead.

    For the sake of understanding FastText's relationship to other techniques, and potential aspects for further extension, I think it's useful to also review: