How do I build a H2O word2vec training_frame that distinguishes between different document/sentences etc.?
As far as I can read from the very limited documentation I have found, you simply supply one long list of words? Such as
'This' 'is' 'the' 'first' 'This' 'is' 'number' 'two'
However it would make sense to be able to distinguish – ideally something like this:
Name | ID
This | 1
is | 1
the | 1
first | 1
This | 2
is | 2
number | 2
two | 2
Is that possible?
word2vec is a type of unsupervised learning: it turns string data into numbers. So to do a classification you need to do a two-step process:
The documentation contains links to a categorization example in each of R and Python. This tutorial shows the same process on a different data set (and there should be a H2O World 2017 video that goes with that).
By the way, in your original example, you don't just supply the words; the sentences are separated by NA. If you give h2o.tokenize() a vector of sentences, it will make this format for you. So your example would actually be:
'This' 'is' 'the' 'first' NA 'This' 'is' 'number' 'two'