Search code examples
rmachine-learningnlppca

How to run PCA on existing correlation matrix, then run regression?


I currently have calculated pairwise correlation between survey respondents, and stored it in a dataframe. It looks like this:

          person_1 person_2 person_3
 person_1.  0        1.5     1.8
 person_2.  1.5       0      2.2
 person_3.  1.8      2.2.      0

Now I'd like to run PCA analysis to find loadings for each response. I have 2 questions:

  1. Which function should I use to calculate PC using the correlation matrix directly?
  2. On a related note. I'd like to then regress each respondent's loading on the person's survey rating score in the original dataframe. Is there a way for me to merge the "score" column back into the function to run regression? Or is there another way to do the regression/prediction?

The original dataframe is a text dataframe and looks like this. I then run word mover distance between sentences to derive the correlation matrix.

          text.                      score
person_1. I like working at Apple       2
person_2  the culture is great          -2
person_3. pandemic hits                 5

Thanks!


Solution

  • As you have a matrix, sometimes most of known algorithms for PCA in R use to have issues with tolerance so they return error. I would suggest next approach using eigen() function which replicates the essence of PCA. Next the code:

    #Data
    #Matrix
    mm <- structure(c(0, 1.5, 1.8, 1.5, 0, 2.2, 1.8, 2.2, 0), .Dim = c(3L, 
    3L), .Dimnames = list(c("person_1", "person_2", "person_3"), 
        c("person_1", "person_2", "person_3")))
    #Scores
    df1 <- structure(list(text. = c("I like working at Apple", "the culture is great", 
    "pandemic hits"), score = c(2L, -2L, 5L)), row.names = c(NA, 
    -3L), class = "data.frame")
    

    The code for PCA would be next:

    #PCA
    myPCA <- eigen(mm)
    #Squares of sd computed by princomp
    myPCA$values
    

    Output:

    [1]  3.681925 -1.437762 -2.244163
    

    In order to get loadings, we use this:

    #Loadings
    myPCA$vectors
    

    Output:

              [,1]       [,2]       [,3]
    [1,] -0.5360029  0.8195308 -0.2026578
    [2,] -0.5831254 -0.5329938 -0.6130925
    [3,] -0.6104635 -0.2104444  0.7635754
    

    With previous outputs we create a dataframe for regression:

    #Format loadings 
    Vectors <- data.frame(myPCA$vectors)
    names(Vectors) <- colnames(mm)
    #Prepare to regression
    #Create data
    mydf <- cbind(df1[,c('score'),drop=F],Vectors)
    

    Output:

      score   person_1   person_2   person_3
    1     2 -0.5360029  0.8195308 -0.2026578
    2    -2 -0.5831254 -0.5329938 -0.6130925
    3     5 -0.6104635 -0.2104444  0.7635754
    

    Finally the code for regressions would be this:

    #Build models
    lm(score~person_1,data=mydf)
    lm(score~person_2,data=mydf)
    lm(score~person_3,data=mydf)
    

    Last models can be saved in new objects if you want. An example would be:

    m1 <- lm(score~person_1,data=mydf)
    summary(m1)
    

    Output:

    Call:
    lm(formula = score ~ person_1, data = mydf)
    
    Residuals:
         1      2      3 
     1.411 -3.842  2.431 
    
    Coefficients:
                Estimate Std. Error t value Pr(>|t|)
    (Intercept)   -13.66      51.60  -0.265    0.835
    person_1      -26.58      89.37  -0.297    0.816
    
    Residual standard error: 4.76 on 1 degrees of freedom
    Multiple R-squared:  0.08127,   Adjusted R-squared:  -0.8375 
    F-statistic: 0.08846 on 1 and 1 DF,  p-value: 0.816