Search code examples
javaapache-sparkstatisticsapache-spark-mllib

How can I calculate a Spearman coefficient of correlation with Spark ? I am unable to reproduce a sample from a statistic book


To train myself with Spark and classical statistical analysis, I'm trying to execute some samples given into books (neutral statistics books : not dedicated to computing or Spark).

The sample in the book offers to calculate the Spearman correlation coefficient of two judges giving a note to ten sportmen :

| Judge 1 | 8.3 | 7.6 | 9.1 | 9.5 | 8.4 | 6.9 | 9.2 | 7.8 | 8.6 | 8.2
| Judge 2 | 7.9 | 7.4 | 9.1 | 9.3 | 8.4 | 7.5 | 9.0 | 7.2 | 8.2 | 8.1

Creating the intermediate matrix of ranks,
    | Judge 1 | 5 | 2 | 8 | 10 | 6 | 1 | 9 | 3 | 7 | 4
    | Judge 2 | 4 | 2 | 9 | 10 | 7 | 3 | 8 | 1 | 6 | 5

the sample in the book eventually ends to a result of :

r = 0.915

I tried to implement it with Spark that way, according to the API documentation of Correlation :

List<Row> data = Arrays.asList(
   RowFactory.create(Vectors.dense(8.3, 7.6, 9.1, 9.5, 8.4, 6.9, 9.2, 7.8, 8.6, 8.2)),
   RowFactory.create(Vectors.dense(7.9, 7.4, 9.1, 9.3, 8.4, 7.5, 9.0, 7.2, 8.2, 8.1))
);

StructType schema = new StructType(new StructField[]{
   new StructField("features", new VectorUDT(), false, Metadata.empty()),
});

Dataset<Row> df = this.session.createDataFrame(data, schema);

Row r2 = Correlation.corr(df, "features", "spearman").head();
System.out.println("Spearman correlation matrix:\n" + r2.get(0).toString());

But it doesn't return me a coefficient. Instead, another matrix that seems odd to me :

Spearman correlation matrix:
1.0                  0.9999999999999998   NaN  ... (10 total)
0.9999999999999998   1.0                  NaN  ...
NaN                  NaN                  1.0  ...
0.9999999999999998   0.9999999999999998   NaN  ...
NaN                  NaN                  NaN  ...
-0.9999999999999998  -0.9999999999999998  NaN  ...
0.9999999999999998   0.9999999999999998   NaN  ...
0.9999999999999998   0.9999999999999998   NaN  ...
0.9999999999999998   0.9999999999999998   NaN  ...
0.9999999999999998   0.9999999999999998   NaN  ...

I am new to MLib and not so strong in statistics. It's clear that I'm doing things wrongly.

What do I see here, instead of what I've expected,
and how shall I achieve my wished result ?


Solution

  • A part of the solution of the problem is ashaming...
    I'd just put the Vectors the wrong side. And this, correct that :

    List<Row> data = Arrays.asList(
       RowFactory.create(Vectors.dense(8.3, 7.9)),
       RowFactory.create(Vectors.dense(7.6, 7.4)),
       RowFactory.create(Vectors.dense(9.1, 9.1)),
       RowFactory.create(Vectors.dense(9.5, 9.3)),
       RowFactory.create(Vectors.dense(8.4, 8.4)),
       RowFactory.create(Vectors.dense(6.9, 7.5)),
       RowFactory.create(Vectors.dense(9.2, 9.0)),
       RowFactory.create(Vectors.dense(7.8, 7.2)),
       RowFactory.create(Vectors.dense(8.6, 8.2)),
       RowFactory.create(Vectors.dense(8.2, 8.1))
    );
    

    Correlation entre les notes des deux juges pour les sportifs :
    1.0                                 0.9151515151515153
    0.9151515151515153   1.0