Search code examples
rfunctionmatrixcorrelation

Create user-defined correlation matrix in R


Say that I have a correlation coefficient, C, which is user-defined, so that I have created the function C(x,y) to compute this correlation between two variables. This function gives three outputs (it's a vector) A, B, and C, where C is the correlation. Say that I have a dataset with many variables (x,y,z,w, etc.) and that I want to get the correlation matrix with all the pairwise C coefficients. How can implement that in R? I would like the output to be like the one given by the cor() function already available in R. I have tried a lot of things (loops, combn, apply, but nothing seems to work fine).


Solution

  • Suppose you have a user-defined function that calculates the correlation between two vectors as a single scalar, perhaps something like this:

    C <- function(x, y) {
      n <- length(x)
      (n * sum(x * y) - sum(x) * sum(y)) /
        sqrt(abs(n * sum(x^2) - sum(x)^2) * abs(n * sum(y^2) - sum(y)^2))
    }
    

    In fact, if we test this on two random vectors we will see we get the same output as we do from cor:

    set.seed(1)
    x <- 1:5/10 + rnorm(5)
    y <- 2:6/10 + rnorm(5)
    
    C(x, y)
    #> [1] 0.410903
    cor(x, y)
    #> [1] 0.410903
    

    However, an important difference is that we can give cor a data frame and it will calculate all pairwise correlations between the columns, outputting a matrix, as we can see if we pass it the first 4 columns of the built-in iris data set:

    cor(iris[1:4])
    #>              Sepal.Length Sepal.Width Petal.Length Petal.Width
    #> Sepal.Length    1.0000000  -0.1175698    0.8717538   0.8179411
    #> Sepal.Width    -0.1175698   1.0000000   -0.4284401  -0.3661259
    #> Petal.Length    0.8717538  -0.4284401    1.0000000   0.9628654
    #> Petal.Width     0.8179411  -0.3661259    0.9628654   1.0000000
    

    But clearly this won't work with our hand-rolled function C:

    C(iris[1:4])
    #> Error in C(iris[1:4]): argument "y" is missing, with no default
    

    The most straightforward way to achieve the same kind of output with our own hand-rolled function is to build a matrix, and iterate over both dimensions of it, populating each entry [i, j] with the correlation of the ith and jth column of the data frame using a double loop:

    C_dataframe <- function(x) {
      m <- matrix(0, ncol = ncol(x), nrow = ncol(x), 
                  dimnames = list(names(x), names(x)))
      for(i in seq(nrow(m))) {
        for(j in seq(ncol(m))) {
          m[i, j] <- C(x[[i]], x[[j]])
        }
      }
      m
    }
    

    Now testing this, using only our own user-defined function, we get exactly the same output from C_dataframe as we would get from cor:

    C_dataframe(iris[1:4])
    #>              Sepal.Length Sepal.Width Petal.Length Petal.Width
    #> Sepal.Length    1.0000000  -0.1175698    0.8717538   0.8179411
    #> Sepal.Width    -0.1175698   1.0000000   -0.4284401  -0.3661259
    #> Petal.Length    0.8717538  -0.4284401    1.0000000   0.9628654
    #> Petal.Width     0.8179411  -0.3661259    0.9628654   1.0000000