Search code examples
nlpword2vecfasttext

How do I limit word length in FastText?


I am using FastText to compute skipgrams on a corpus containing a long sequence of characters with no spaces. After an hour or so, FastText produces a model containing vectors (of length 100) corresponding to "words" of length 50 characters from the corpus.

I tried setting -minn and -maxn parameters, but that does not help (I kind of knew it won't, but tried anyway), and -wordNgrams parameter only applies if there are spaces, I guess (?!). This is just a long stream of characters representing state, without spaces.

The documentation doesn't seem to have any information on this (or perhaps I'm missing something?)


Solution

  • The tool just takes whatever space-delimited tokens you feed it.

    If you want to truncate, or discard, tokens that are longer than 50 characters (or any other threshold), you'd need to preprocess the data yourself.

    (If your question is actually something else, add more details to the question showing example lines from your corpus, how you're invoking fasttext on it, how you are reviewing unsatisfactory results, and how you would expect satisfactory results to look instead.