The starspace documentation is unclear on the parameter 'fileFormat' which takes the value 'labelDoc' or 'fastText'. I would like to understand intuitively what material difference setting this paramter would have.
Currently, my best guess is that if you set fileFormat to 'fastText' then all tokens in the training file that do not have the prefix '__label__' will be broken down into character-level n-grams as in fastText. Alternatively, if you set fileFormat to 'labelDoc' then starspace will assume that all tokens are actually labels, and you do not need to prepend '__label__' to the tokens, because they will be recognized as labels anyway.
Is my thinking correct?
The way StarSpace uses the labels highly depends on the trainMode you are using. The labelDoc format is useful when you go for a trainMode that just relies on labels (trainMode 1 through 4) where it may be the same thing to use a fastText format specifying the __label__
prefix but some trainModes benefit from labelDoc format (i.e. trainMode 1 or 3) to use a whole sentence as a label element for that trainMode.
So to clarify that, if you are performing a text classification task(as explained in this example labelDoc wouldn't have any input recognized but on the other hand, as you stated, using fastText format will breakdown all non-labeled text as input and learn to predict the __label__
tags.
And an example for labelDoc format would be developing a content based recommender system (as explained in this example) every tab separated sentence is used at LHS or RHS during training time. But if you go on a collaborative approach (the content of the articles or wherever you sentences come from is not taken in account) it can be trained either with fastText (specifying the __label__
prefix) or labelDoc file format as labels are picked randomly during training time for LHS or RHS. (This second example is explained here).