I have a dataframe called article
+--------------------+
| processed_title|
+--------------------+
|[new, relictual, ...|
|[once, upon,a,time..|
+--------------------+
I want to flatten it to get it as bag of words. How could I achieve this using the current situation. I have tried the code below which seems to give me a Type mismatch issue.
val bow_corpus = article.select("processed_title").rdd.flatMap(y => y)
I eventually want to use this bow_corpus to train a word2vec model.
Thanks
Assuming that processed_title
is represented in SQL as array<string>
:
article.select("processed_title").rdd.flatMap(_.getSeq[String](0))
There is also Word2Vec
transformer which can be trained directly on a DataFrame
:
import org.apache.spark.ml.feature.Word2Vec
val word2Vec = new Word2Vec()
.setInputCol("processed_title")
.setOutputCol("vectors")
.setMinCount(0)
.fit(article)
word2Vec.findSynonyms("foo", 1)
See also Spark extracting values from a Row