I would like to determine many correlations (millions) between pairs of columns, so I am worried about computing time.
I suspect that Pearson correlations (based on values) are faster to calculate in R than Spearman correlations (based on ranks). Is that correct?
How can I find out, please? Thank you.
You can use the rbenchmark
package for this.
library(rbenchmark)
1.000 rows, 100 repetitions
x1 <- rnorm(1000)
y1 <- rnorm(1000)
benchmark(spearman = {
cor(x1, y1, method = "spearman")
},
pearson = {
cor(x1, y1, method = "pearson")
},
replications = 100)
#> test replications elapsed relative user.self sys.self user.child
#> 2 pearson 100 0.002 1 0.002 0 0
#> 1 spearman 100 0.014 7 0.013 0 0
#> sys.child
#> 2 0
#> 1 0
1.000.000 rows, 100 repititions
x2 <- rnorm(1000000)
y2 <- rnorm(1000000)
benchmark(spearman = {
cor(x2, y2, method = "spearman")
},
pearson = {
cor(x2, y2, method = "pearson")
},
replications = 100)
#> test replications elapsed relative user.self sys.self user.child
#> 2 pearson 100 0.717 1.000 0.717 0.001 0
#> 1 spearman 100 37.336 52.073 36.797 0.537 0
#> sys.child
#> 2 0
#> 1 0
This confirms you assumption: Pearson is significantly faster than Spearman. Especially when the rows/cases are increased, Spearman becomes slow.