Search code examples
scalaapache-sparkapache-spark-sqlapache-spark-mllib

Statistics.corr gives following error in IntelliJ IDEA: Cannot resolve overloaded method 'corr'


I am trying to follow this project https://github.com/caroljmcdonald/spark-stock-sql/blob/master/src/main/scala/example/Stock.scala

And in my IDE it gives me the error: Cannot resolve overloaded method 'corr' In the part where the code computes correlation between 2 columns read from parquet file

val df = sqlContext.read.parquet("joinstock.parquet")

df.show
df.printSchema

df.explain()

// COMMAND ----------

//var agg_df = df.groupBy("location").agg(min("id"), count("id"), avg("date_diff"))
df.select(year($"dt").alias("yr"), month($"dt").alias("mo"), $"apcclose", $"xomclose", $"spyclose").groupBy("yr", "mo").agg(avg("apcclose"), avg("xomclose"), avg("spyclose")).orderBy(desc("yr"), desc("mo")).show

// COMMAND ----------

df.select(year($"dt").alias("yr"), month($"dt").alias("mo"), $"apcclose", $"xomclose", $"spyclose").groupBy("yr", "mo").agg(avg("apcclose"), avg("xomclose"), avg("spyclose")).orderBy(desc("yr"), desc("mo")).explain

These lines give me the error in my IntelliJ IDE Cannot resolve overloaded method 'corr'

    // COMMAND ----------
    var seriesX = df.select($"xomclose").map { row: Row => row.getAs[Double]("xomclose") } //.rdd
    var seriesY = df.select($"spyclose").map { row: Row => row.getAs[Double]("spyclose") } //.rdd
    var correlation = Statistics.corr(seriesX, seriesY, "pearson")

    // COMMAND ----------

    seriesX = df.select($"apcclose").map { row: Row => row.getAs[Double]("apcclose") } //.rdd
    seriesY = df.select($"xomclose").map { row: Row => row.getAs[Double]("xomclose") } //.rdd
    correlation = Statistics.corr(seriesX, seriesY, "pearson")

  }
}

Solution

  • You can try the correlation method of a dataframe:

    var correlation = df.stat.corr("xomclose", "spyclose", "pearson")