Search code examples
dataframeapache-sparkpysparkrdd

PySpark column to RDD of its values


I'm looking for the most straightforward and idiomatic way to convert a data-frame column into a RDD. Say the columns views contains floats. The following is not what I am looking for

views = df_filtered.select("views").rdd

for I end up with a RDD[Row] instead of a RDD[Float] and I thus can't feed it to any stat methods from mllib.stat (if I properly understand what's going on):

corr = Statistics.corr(views, likes, method="pearson")
TypeError: float() argument must be a string or a number

In pandas, I would go for .values() to convert this pandas Series into the array of its values but RDD .values() method does not seem to work this way. I finally came to the following solution

views = df_filtered.select("views").rdd.map(lambda r: r["views"])

but I wonderer whether there are more direct solutions


Solution

  • you need to use flatMap for this.

    >>> newdf=df.select("emp_salary")
    >>> newdf.show();
    +----------+
    |emp_salary|
    +----------+
    |     50000|
    |     10000|
    |    810000|
    |      5500|
    |      5500|
    +----------+
    
    >>> rdd=newdf.rdd.flatMap(lambda x:x)
    >>> rdd.take(10);
    [50000, 10000, 810000, 5500, 5500]
    

    were your looking something like this?

    yes than convert your statement as:

    views = df_filtered.select("views").rdd.flatMap(lambda x:x)