Search code examples
pythonapache-sparkpysparkapache-spark-sqlapache-spark-mllib

PySpark computing correlation


I want to use pyspark.mllib.stat.Statistics.corr function to compute correlation between two columns of pyspark.sql.dataframe.DataFrame object. corr function expects to take an rdd of Vectors objects. How do I translate a column of df['some_name'] to rdd of Vectors.dense object?


Solution

  • There should be no need for that. For numerical you can compute correlation directly using DataFrameStatFunctions.corr:

    df1 = sc.parallelize([(0.0, 1.0), (1.0, 0.0)]).toDF(["x", "y"])
    df1.stat.corr("x", "y")
    # -1.0
    

    otherwise you can use VectorAssembler:

    from pyspark.ml.feature import VectorAssembler
    
    assembler = VectorAssembler(inputCols=df.columns, outputCol="features")
    assembler.transform(df).select("features").flatMap(lambda x: x)