I'm looking for the most straightforward and idiomatic way to convert a data-frame column into a RDD. Say the columns views
contains floats. The following is not what I am looking for
views = df_filtered.select("views").rdd
for I end up with a RDD[Row]
instead of a RDD[Float]
and I thus can't feed it to any stat methods from mllib.stat (if I properly understand what's going on):
corr = Statistics.corr(views, likes, method="pearson")
TypeError: float() argument must be a string or a number
In pandas, I would go for .values()
to convert this pandas Series into the array of its values but RDD .values()
method does not seem to work this way. I finally came to the following solution
views = df_filtered.select("views").rdd.map(lambda r: r["views"])
but I wonderer whether there are more direct solutions
you need to use flatMap for this.
>>> newdf=df.select("emp_salary")
>>> newdf.show();
+----------+
|emp_salary|
+----------+
| 50000|
| 10000|
| 810000|
| 5500|
| 5500|
+----------+
>>> rdd=newdf.rdd.flatMap(lambda x:x)
>>> rdd.take(10);
[50000, 10000, 810000, 5500, 5500]
were your looking something like this?
yes than convert your statement as:
views = df_filtered.select("views").rdd.flatMap(lambda x:x)