Say that I have a correlation coefficient, C, which is user-defined, so that I have created the function C(x,y) to compute this correlation between two variables. This function gives three outputs (it's a vector) A, B, and C, where C is the correlation. Say that I have a dataset with many variables (x,y,z,w, etc.) and that I want to get the correlation matrix with all the pairwise C coefficients. How can implement that in R? I would like the output to be like the one given by the cor() function already available in R. I have tried a lot of things (loops, combn, apply, but nothing seems to work fine).
Suppose you have a user-defined function that calculates the correlation between two vectors as a single scalar, perhaps something like this:
C <- function(x, y) {
n <- length(x)
(n * sum(x * y) - sum(x) * sum(y)) /
sqrt(abs(n * sum(x^2) - sum(x)^2) * abs(n * sum(y^2) - sum(y)^2))
}
In fact, if we test this on two random vectors we will see we get the same output as we do from cor
:
set.seed(1)
x <- 1:5/10 + rnorm(5)
y <- 2:6/10 + rnorm(5)
C(x, y)
#> [1] 0.410903
cor(x, y)
#> [1] 0.410903
However, an important difference is that we can give cor
a data frame and it will calculate all pairwise correlations between the columns, outputting a matrix, as we can see if we pass it the first 4 columns of the built-in iris
data set:
cor(iris[1:4])
#> Sepal.Length Sepal.Width Petal.Length Petal.Width
#> Sepal.Length 1.0000000 -0.1175698 0.8717538 0.8179411
#> Sepal.Width -0.1175698 1.0000000 -0.4284401 -0.3661259
#> Petal.Length 0.8717538 -0.4284401 1.0000000 0.9628654
#> Petal.Width 0.8179411 -0.3661259 0.9628654 1.0000000
But clearly this won't work with our hand-rolled function C
:
C(iris[1:4])
#> Error in C(iris[1:4]): argument "y" is missing, with no default
The most straightforward way to achieve the same kind of output with our own hand-rolled function is to build a matrix, and iterate over both dimensions of it, populating each entry [i, j] with the correlation of the ith and jth column of the data frame using a double loop:
C_dataframe <- function(x) {
m <- matrix(0, ncol = ncol(x), nrow = ncol(x),
dimnames = list(names(x), names(x)))
for(i in seq(nrow(m))) {
for(j in seq(ncol(m))) {
m[i, j] <- C(x[[i]], x[[j]])
}
}
m
}
Now testing this, using only our own user-defined function, we get exactly the same output from C_dataframe
as we would get from cor
:
C_dataframe(iris[1:4])
#> Sepal.Length Sepal.Width Petal.Length Petal.Width
#> Sepal.Length 1.0000000 -0.1175698 0.8717538 0.8179411
#> Sepal.Width -0.1175698 1.0000000 -0.4284401 -0.3661259
#> Petal.Length 0.8717538 -0.4284401 1.0000000 0.9628654
#> Petal.Width 0.8179411 -0.3661259 0.9628654 1.0000000