I am trying to understand the output of word2vec algorithm of Spark.
I have a data frame of text column which I tokenize so now I have each text as a list of words in a column.
| tokenised_text|
|[if, you, hate, d...|
|[rampant, teen, s...|
|[the, united, sta...|
|[reuters, health,...|
|[brussels, (reute...|
Now on this I run a word2vec algorithm as below:
word2Vec = Word2Vec(vectorSize=100, seed=42, inputCol="tokenised_text", outputCol="model")
w2vmodel = word2Vec.fit(tokensDf)
Now when I had transformed this original data frame with the word2vec model object, I get another column added to the data frame which has 100 dimension vector. First row of the data frame is below.
[Row(tokenised_text=[u'if', u'you', u'hate', u'dealing', u'with', u'bank', u'tellers', u'or', u'customer', u'service', u'representatives,', u'then', u'the', u'royal', u'bank', u'of', u'scotland', u'might', u'have', u'a', u'solution', u'for', u'you.if', u'this', u'program', u'is', u'successful,', u'it', u'could', u'be', u'a', u'big', u'step', u'forward', u'on', u'the', u'road', u'to', u'automated', u'customer', u'service', u'through', u'the', u'use', u'of', u'ai,', u'notes', u'laurie', u'beaver,', u'research', u'associate', u'for', u'bi', u'intelligence,', u'business', u"insider's", u'premium', u'research', u"service.it's", u'noteworthy', u'that', u'luvo', u'does', u'not', u'operate', u'via', u'a', u'third-party', u'app', u'such', u'as', u'facebook', u'messenger,', u'wechat,', u'or', u'kik,', u'all', u'of', u'which', u'are', u'currently', u'trying', u'to', u'create', u'bots', u'that', u'would', u'assist', u'in', u'customer', u'service', u'within', u'their', u'respective', u'platforms.luvo', u'would', u'be', u'available', u'through', u'the', u'web', u'and', u'through', u'smartphones.', u'it', u'would', u'also', u'use', u'machine', u'learning', u'to', u'learn', u'from', u'its', u'mistakes,', u'which', u'should', u'ultimately', u'help', u'with', u'its', u'response', u'accuracy.down', u'the', u'road,', u'luvo', u'would', u'become', u'a', u'supplement', u'to', u'the', u'human', u'staff.', u'it', u'can', u'currently', u'answer', u'20', u'set', u'questions', u'but', u'as', u'that', u'number', u'grows,', u'it', u'would', u'allow', u'the', u'human', u'employees', u'to', u'more', u'complicated', u'issues.', u'if', u'a', u'problem', u'is', u'beyond', u"luvo's", u'comprehension,', u'then', u'it', u'would', u'refer', u'the', u'customer', u'to', u'a', u'bank', u'employee;', u'however,\xa0a', u'user', u'could', u'choose', u'to', u'speak', u'with', u'a', u'human', u'instead', u'of', u'luvo', u'anyway.ai', u'such', u'as', u'luvo,', u'if', u'successful,', u'could', u'help', u'businesses', u'become', u'more', u'efficient', u'and', u'increase', u'their', u'productivity,', u'while', u'simultaneously', u'improving', u'customer', u'service', u'capacity,', u'which', u'would', u'consequently\xa0save', u'money', u'that', u'would', u'otherwise', u'go', u'toward', u'manpower.and', u'this', u'trend', u'is', u'already', u'starting.', u'google,', u'microsoft,', u'and', u'ibm', u'are', u'investing', u'significantly', u'into', u'ai', u'research.', u'furthermore,', u'the', u'global', u'ai', u'market', u'is', u'estimated', u'to', u'grow', u'from', u'approximately', u'$420', u'million', u'in', u'2014', u'to', u'$5.05', u'billion', u'in', u'2020,', u'according', u'to', u'a', u'forecast', u'by', u'research', u'and', u'markets.\xa0the', u'move', u'toward', u'ai', u'would', u'be', u'just', u'one', u'more', u'way', u'in', u'which', u'the', u'digital', u'age', u'is', u'disrupting', u'retail', u'banking.', u'customers,', u'particularly', u'millennials,', u'are', u'increasingly', u'moving', u'toward', u'digital', u'banking,', u'and', u'as', u'a', u'result,', u"they're", u'walking', u'into', u'their', u"banks'", u'traditional', u'brick-and-mortar', u'branches', u'less', u'often', u'than', u'ever', u'before.'], model=DenseVector([-0.036, 0.0759, 0.196, 0.0379, 0.0331, 0.069, -0.1531, 0.0588, -0.1662, -0.0624, -0.0924, -0.0304, 0.0155, -0.0245, -0.0507, 0.0809, 0.0199, -0.0364, 0.0703, 0.0469, 0.0768, -0.0214, 0.0404, 0.0522, -0.0506, 0.0095, 0.1129, 0.0515, -0.0867, 0.0224, -0.0499, 0.0848, 0.1583, -0.0882, -0.0262, -0.0083, -0.0019, -0.0172, 0.0554, 0.0478, -0.0328, 0.1219, 0.0153, -0.1409, -0.0262, 0.0829, -0.1318, -0.0952, -0.1854, 0.0837, 0.0084, -0.0004, 0.0172, 0.0073, 0.1217, 0.0137, -0.0735, -0.0481, -0.0223, -0.0708, -0.0617, -0.0049, -0.0069, -0.0211, 0.0615, -0.0919, 0.0509, 0.0871, -0.0278, -0.0295, -0.2326, -0.0931, -0.1146, 0.0371, -0.0024, 0.0294, -0.0177, 0.0384, 0.019, 0.0767, -0.0922, -0.0418, 0.0005, 0.0221, -0.0624, 0.0149, -0.0496, -0.0434, 0.1202, -0.0305, 0.1478, -0.0385, -0.0342, 0.0798, 0.0302, -0.013, 0.0923, -0.0287, -0.0976, -0.0634]))]
However I am not able to understand this output. My token column had multiple word token. Each word is represented as a vector of 100 dimension in word2vec. So ideally I would have multiple such 100 dim vectors corresponding to each word of the token list. But we only get one 100 dim vector. That would correspond to one such word. What about other words in the token list of each row of data frame?
I am sure I am missing something here but Spark documentation is very badly written that none of the methods docs is very helpful to understand.
From the docstring:
Transform a sentence column to a vector column to represent the whole sentence. The transform is performed by averaging all word vectors it contains.
Regarding this statement:
So ideally I would have multiple such 100 dim vectors
You have to remember that this transfroms
a document so the output should be a single vector. There are much more sophisticated techniques of combing word embeddings and if you want to use these you can easily extract mappings from mllib
model (using Word2VecModel.transform() does not work in map function).