Search code examples
facebooknlpvectorizationfasttext

How do Facebook's fasttext library handle numerical data in input for word vectorization?


I am using Facebook's Fasttext for performing text classification. I wanted to know how fasttext library handle the numbers in a text string provided as input for word vectorization.

  1. Do fasttext typecast each number as a string before creating word vectors?

    For e.g. 1124 to " 1124 "

  2. Or any other transformation/preprocessing is performed in the background before training?

    For e.g. 1124 to " one one two four "

What should be the most optimal approach to handle numerical data if my input text to fasttext contains numbers?


Solution

  • Fasttext doesn't do any preprocessing of numeric tokens. They are treated like other whitespace-separated "words".

    Unless you already have a specific problem with fasttext and numbers in your input, I wouldn't worry about what fasttext does with the numbers. Just use it as normal.

    If you have a lot of numbers and they're causing problems - this is possible since fasttext likely doesn't have any useful vectors for most specific numbers - you can pre-process your input to replace them with <NUMBER> or another dummy token. That way these sentences will be the same to fasttext:

    1. I ate 1023 oranges.
    2. I ate 1024 oranges.

    Whether you want to treat those as the same or not depends on your application.