I'm currently working on fastText unsuperived learning. I wanted clarify something of context window present in fastText documentation.
In the description of the fasttext library for python https://github.com/facebookresearch/fastText/tree/master/python for training a fastText model there are different arguments, one of the arguments is,
My input file contains lines with 2 - 3 tokens.
Eg.,
The default window size 5. Here, in the above example, I have lines with token count less than the window size. What will happen if the window size is bigger than the document length?
FastText (& related algorithms like word2vec) will simply use as much of the context window as is possible.
For example, assume a window-size of 5 and the input tokens:
['Senior', 'Database', 'Administrator']
When training with the 'center' word 'Senior'
, the algorithm would be ready to consult up-to-5 words in either direction.
But, there are 0 words preceding 'Senior'
, and only 2 words succeeding 'Senior'
, so only those 2 following words will be considered as neighbors.
(No 'plug values' will be used as if they were blank-neighbors, nor will any 'bleed-through' to beighboring texts occur.)
Two other related notes to keep in mind: