To train myself with Spark
and classical statistical analysis, I'm trying to execute some samples given into books (neutral statistics books : not dedicated to computing or Spark).
The sample in the book offers to calculate the Spearman correlation coefficient of two judges giving a note to ten sportmen :
| Judge 1 | 8.3 | 7.6 | 9.1 | 9.5 | 8.4 | 6.9 | 9.2 | 7.8 | 8.6 | 8.2
| Judge 2 | 7.9 | 7.4 | 9.1 | 9.3 | 8.4 | 7.5 | 9.0 | 7.2 | 8.2 | 8.1
Creating the intermediate matrix of ranks,
| Judge 1 | 5 | 2 | 8 | 10 | 6 | 1 | 9 | 3 | 7 | 4
| Judge 2 | 4 | 2 | 9 | 10 | 7 | 3 | 8 | 1 | 6 | 5
the sample in the book eventually ends to a result of :
r = 0.915
I tried to implement it with Spark
that way, according to the API documentation of Correlation :
List<Row> data = Arrays.asList(
RowFactory.create(Vectors.dense(8.3, 7.6, 9.1, 9.5, 8.4, 6.9, 9.2, 7.8, 8.6, 8.2)),
RowFactory.create(Vectors.dense(7.9, 7.4, 9.1, 9.3, 8.4, 7.5, 9.0, 7.2, 8.2, 8.1))
);
StructType schema = new StructType(new StructField[]{
new StructField("features", new VectorUDT(), false, Metadata.empty()),
});
Dataset<Row> df = this.session.createDataFrame(data, schema);
Row r2 = Correlation.corr(df, "features", "spearman").head();
System.out.println("Spearman correlation matrix:\n" + r2.get(0).toString());
But it doesn't return me a coefficient. Instead, another matrix that seems odd to me :
Spearman correlation matrix:
1.0 0.9999999999999998 NaN ... (10 total)
0.9999999999999998 1.0 NaN ...
NaN NaN 1.0 ...
0.9999999999999998 0.9999999999999998 NaN ...
NaN NaN NaN ...
-0.9999999999999998 -0.9999999999999998 NaN ...
0.9999999999999998 0.9999999999999998 NaN ...
0.9999999999999998 0.9999999999999998 NaN ...
0.9999999999999998 0.9999999999999998 NaN ...
0.9999999999999998 0.9999999999999998 NaN ...
I am new to MLib
and not so strong in statistics. It's clear that I'm doing things wrongly.
What do I see here, instead of what I've expected,
and how shall I achieve my wished result ?
A part of the solution of the problem is ashaming...
I'd just put the Vectors the wrong side. And this, correct that :
List<Row> data = Arrays.asList(
RowFactory.create(Vectors.dense(8.3, 7.9)),
RowFactory.create(Vectors.dense(7.6, 7.4)),
RowFactory.create(Vectors.dense(9.1, 9.1)),
RowFactory.create(Vectors.dense(9.5, 9.3)),
RowFactory.create(Vectors.dense(8.4, 8.4)),
RowFactory.create(Vectors.dense(6.9, 7.5)),
RowFactory.create(Vectors.dense(9.2, 9.0)),
RowFactory.create(Vectors.dense(7.8, 7.2)),
RowFactory.create(Vectors.dense(8.6, 8.2)),
RowFactory.create(Vectors.dense(8.2, 8.1))
);
Correlation entre les notes des deux juges pour les sportifs :
1.0 0.9151515151515153
0.9151515151515153 1.0