Search code examples
fasttextdimensionality-reduction

Reducing size of Facebook's fastText


I am building a machine learning model which will process documents and extract some key information from it. For this, I need to use word embedding for OCRed output. I have several different options for the embedding (Google's word2vec, Stanford's, Facebook's fastText) but my main concern is OOV words, as the OCR output will have a lot of misspelled words. For example, I want the embeddings such that the output for Embedding and Embdding (e missed by the OCR) should have a certain level of similarity. I don't care much about the associated contextual information.

I chose Facebook's fastText as it gives embeddings for OOV words as well. My only concern is the size of the embeddings. The vector size of fastText's model is 300. Is there a way to reduce the size of the returned word vector? I am thinking of using PCA or any other dimensionality reduction technique, but given the size of word vectors, it can be a time-consuming task.


Solution

  • import fasttext
    import fasttext.util
    
    ft = fasttext.load_model('cc.en.300.bin')
    print(ft.get_dimension())
    
    fasttext.util.reduce_model(ft, 100)
    print(ft.get_dimension())
    

    This code should reduce your 300 vector embedding lenght to 100.

    Link to official documentation: https://fasttext.cc/docs/en/crawl-vectors.html