Metric to see how close two Rankings are

I am trying to confirm a survey's benchmark on causal discovery methods, and I am running the same methods on the same datasets, evaluated on the same metrics.

To compare them, I'd like to use a metric that takes as input both rankings (order of methods for a given score, like True Positive Rate or Structural Hamming Distance) and outputs a number that quantifies how close they are.

An example of two tables would be something like this:

Table from the paper:

|        | shd   | tpr  | fdr  |
|--------|-------|------|------|
| LiNGAM | 35.00 | 0.37 | 0.32 |
| GES    | 44.00 | 0.70 | 0.55 |
| PC     | 64.00 | 0.80 | 0.63 |

My own table:

|        | shd   | tpr  | fdr  |
|--------|-------|------|------|
| LiNGAM | 28.00 | 0.00 | 1.00 |
| GES    | 13.00 | 0.65 | 0.42 |
| PC     | 16.00 | 0.65 | 0.56 |

I tried to see what's out there to compare rankings but I didn't find anything of substance that existed in Python

Solution

Evaluate repeatedly (and use rank correlation if you compare a large number of algorithms)

Spearman's rank correlation coefficient would be a common thing to do when comparing rankings, but from the looks of the table that might not actually be the main issue here. I'm guessing you're using synthetic data, so I would strongly recommend simulating multiple times and repeating the experiments so you can take a mean and standard deviations of the results. I'm not sure what paper the results you're referring to are from, but it's common practice to average over multiple simulations, so there is a good chance your results might be closer when doing it this way (right now they seem very far apart). Once you have the average result over a number of repetitions you could use rank correlation, although if you're comparing only three algorithms I don't think there is a need to do this.