Search code examples
rmatrixregressioncorrelation

R - How do I run a regression based on a correlation matrix rather than raw data?


I would like to run a regression based on a correlation matrix rather than raw data. I have looked at this post, but can't make sense of it. How do I do this in R?

Here is some code:

#Correlation matrix.
MyMatrix <- matrix(
            c(1.0, 0.1, 0.5, 0.4,
              0.1, 1.0, 0.9, 0.3,
              0.5, 0.9, 1.0, 0.3,
              0.4, 0.3, 0.3, 1.0), 
            nrow=4, 
            ncol=4)

df <- as.data.frame(MyMatrix)

colnames(df)[colnames(df)=="V1"] <- "a"
colnames(df)[colnames(df)=="V2"] <- "b"
colnames(df)[colnames(df)=="V3"] <- "c"
colnames(df)[colnames(df)=="V4"] <- "d"

#Assume means and standard deviations as follows:
MEAN.a <- 4.00
MEAN.b <- 3.90
MEAN.c <- 4.10
MEAN.d <- 5.00
SD.a <- 1.01
SD.b <- 0.95
SD.c <- 0.99
SD.d <- 2.20

#Run model [UNSURE ABOUT THIS PART]
library(lavaan)
m1 <- 'd ~ a + b + c'
fit <- sem(m1, ????)
summary(fit, standardize=TRUE)

Solution

  • This should do it. First you can convert your correlation matrix to a covariance matrix:

    MyMatrix <- matrix(
      c(1.0, 0.1, 0.5, 0.4,
        0.1, 1.0, 0.9, 0.3,
        0.5, 0.9, 1.0, 0.3,
        0.4, 0.3, 0.3, 1.0), 
      nrow=4, 
      ncol=4)
    rownames(MyMatrix) <- colnames(MyMatrix) <- c("a", "b","c","d")
    
    #Assume the following means and standard deviations:
    MEAN.a <- 4.00
    MEAN.b <- 3.90
    MEAN.c <- 4.10
    MEAN.d <- 5.00
    SD.a <- 1.01
    SD.b <- 0.95
    SD.c <- 0.99
    SD.d <- 2.20
    s <- c(SD.a, SD.b, SD.c, SD.d)
    m <- c(MEAN.a, MEAN.b, MEAN.c, MEAN.d)
    cov.mat <- diag(s) %*% MyMatrix %*% diag(s)
    rownames(cov.mat) <- colnames(cov.mat) <- rownames(MyMatrix)
    names(m) <- rownames(MyMatrix)
    

    Then, you can use lavaan to estimate the model along the lines of the post you mentioned in your question. Note, you need to supply a number of observations to get the sample estimate. I used 100 for the example, but you may want to change it if that doesn't make sense.

    library(lavaan)
    m1 <- 'd ~ a + b + c'
    fit <- sem(m1, 
               sample.cov = cov.mat, 
               sample.nobs=100, 
               sample.mean=m,
               meanstructure=TRUE)
    summary(fit, standardize=TRUE)
    # lavaan 0.6-6 ended normally after 44 iterations
    # 
    # Estimator                                         ML
    # Optimization method                           NLMINB
    # Number of free parameters                          5
    # 
    # Number of observations                           100
    # 
    # Model Test User Model:
    #   
    # Test statistic                                 0.000
    # Degrees of freedom                                 0
    # 
    # Parameter Estimates:
    #   
    # Standard errors                             Standard
    # Information                                 Expected
    # Information saturated (h1) model          Structured
    # 
    # Regressions:
    #                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
    # d ~                                                                   
    #   a                 6.317    0.095   66.531    0.000    6.317    2.900
    #   b                12.737    0.201   63.509    0.000   12.737    5.500
    #   c               -13.556    0.221  -61.307    0.000  -13.556   -6.100
    # 
    # Intercepts:
    #                 Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
    # .d               -14.363    0.282  -50.850    0.000  -14.363   -6.562
    # 
    # Variances:
    #                 Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
    # .d                 0.096    0.014    7.071    0.000    0.096    0.020
    # 
    #