Fasttext inconsistent on one label model classification

I'm using official FastText python library (v0.9.2) for intents classification.

import fasttext

model = fasttext.train_supervised(input='./test.txt',
  loss='softmax',
  dim=200,
  bucket=2000000,
  epoch=25,
  lr=1.0)

Where test.txt contains just one sample file like:

__label__greetings hi

and predict two utterances the results are:

print(model.words)
print('hi', model.predict('hi'))
print('bye', model.predict('bye'))

app_1  | ['hi']
app_1  | hi (('__label__greetings',), array([1.00001001]))
app_1  | bye ((), array([], dtype=float64))

This is my expected output, meanwhile if a set two samples for the same label:

__label__greetings hi
__label__greetings hello

The result for OOV is not correct.

app_1  | ['hi', '</s>', 'hello']
app_1  | hi (('__label__greetings',), array([1.00001001]))
app_1  | bye (('__label__greetings',), array([1.00001001]))

I understand that the problem is with </s> token, maybe \n in text file?, and when there isn't any word on vocabulary the text is replaced by </s>. There are any train option or way to skip this behavior?

Thanks!

Solution

FastText is a big, data-hungry algorithm that starts with random-initialization. You shouldn't expect results to be sensible or indeed match any set of expectations on toy-sized datasets - where (for example) 100%-minus-epsilon of your n-gram buckets won't have received any training.

I also wouldn't expect supervised mode to ever reliably predict no labels on realistic data-sets – it expects all of its training data to have labels, and I've not seen mention of its use to predict an implied 'ghost' category of "not in training data" versus a single known label (as in 'one-class classification').

(Speculatively, I think you might have to feed FastText supervised mode explicitly __label__not-greetings labeled contrast data – perhaps just synthesized random strings if you've got nothing else – in order for it to have any hope of meaningfully predicting "not-greetings".)

Given that, I'd not consider your first result for the input bye correct, nor the second result not correct. Both are just noise results from an undertrained model being asked to make a kind of distinction it's not known for being able to make.