I am trying to follow this project https://github.com/caroljmcdonald/spark-stock-sql/blob/master/src/main/scala/example/Stock.scala
And in my IDE it gives me the error: Cannot resolve overloaded method 'corr' In the part where the code computes correlation between 2 columns read from parquet file
val df = sqlContext.read.parquet("joinstock.parquet")
df.show
df.printSchema
df.explain()
// COMMAND ----------
//var agg_df = df.groupBy("location").agg(min("id"), count("id"), avg("date_diff"))
df.select(year($"dt").alias("yr"), month($"dt").alias("mo"), $"apcclose", $"xomclose", $"spyclose").groupBy("yr", "mo").agg(avg("apcclose"), avg("xomclose"), avg("spyclose")).orderBy(desc("yr"), desc("mo")).show
// COMMAND ----------
df.select(year($"dt").alias("yr"), month($"dt").alias("mo"), $"apcclose", $"xomclose", $"spyclose").groupBy("yr", "mo").agg(avg("apcclose"), avg("xomclose"), avg("spyclose")).orderBy(desc("yr"), desc("mo")).explain
These lines give me the error in my IntelliJ IDE Cannot resolve overloaded method 'corr'
// COMMAND ----------
var seriesX = df.select($"xomclose").map { row: Row => row.getAs[Double]("xomclose") } //.rdd
var seriesY = df.select($"spyclose").map { row: Row => row.getAs[Double]("spyclose") } //.rdd
var correlation = Statistics.corr(seriesX, seriesY, "pearson")
// COMMAND ----------
seriesX = df.select($"apcclose").map { row: Row => row.getAs[Double]("apcclose") } //.rdd
seriesY = df.select($"xomclose").map { row: Row => row.getAs[Double]("xomclose") } //.rdd
correlation = Statistics.corr(seriesX, seriesY, "pearson")
}
}
You can try the correlation method of a dataframe:
var correlation = df.stat.corr("xomclose", "spyclose", "pearson")