Search code examples
rcorrelationdummy-variable

How to run a correlation test using dummy variables


I am quite new to using r and am struggling to find a few to actually find a pearson correlation coeffcient from a set data. I am attempting to analyze whether there is a correlation between scores received for an assignment and the topic area chosen (Algebra, Calculus, Geometry, etc.) This is my dataframe

sc.ar <- structure(list(area = structure(c(1L, 5L, 5L, 2L, 4L, 4L, 1L, 
6L, 1L, 2L, 1L, 3L, 3L, 5L, 2L, 2L, 2L, 3L, 4L, 4L, 5L, 1L, 2L, 
3L, 4L, 5L, 5L, 2L, 5L, 5L, 5L, 1L, 2L, 2L, 3L, 4L, 4L, 2L, 3L, 
4L, 4L, 5L, 5L, 2L, 3L, 4L, 4L, 4L, 5L), levels = c("Algebra", 
"Calculus", "Geometry", "Modelling", "Probability", "Other"), class = "factor"), 
    score = c(10, 10, 10, 11, 11, 11, 12, 12, 13, 13, 14, 14, 
    14, 14, 15, 15, 15, 15, 15, 15, 15, 16, 16, 16, 16, 16, 16, 
    17, 17, 17, 17, 17, 18, 18, 18, 18, 18, 19, 19, 19, 19, 19, 
    19, 20, 20, 20, 7, 9, 9)), class = "data.frame", row.names = c(NA, 
-49L))

Sorry if this isn't enough information, it's my first time on here as well.

I am able to get results from summary(lm(formula = score ~ area, data = sc.ar)) but I honestly do not know what to do with them. My goal is to find a way to >cor by inputing the dummy variables manually


Solution

  • Maybe you want split by area,

    > (df_s <- split(df$score, df$area))
    $Algebra
    [1] 10 12 13 14 16 17
    
    $Calculus
     [1] 11 13 15 15 15 16 17 18 18 19 20
    
    $Geometry
    [1] 14 14 15 16 18 19 20
    
    $Modelling
     [1] 11 11 15 15 16 18 18 19 19 20  7  9
    
    $Probability
     [1] 10 10 14 15 16 16 17 17 17 19 19  9
    
    $Other
    [1] 12
    

    but the areas seem to be of different length. Maybe that's just because of your toy data which you could complete with maximum lengths.

    > (m <- sapply(df_s, `length<-`, max(lengths(df_s))))
          Algebra Calculus Geometry Modelling Probability Other
     [1,]      10       11       14        11          10    12
     [2,]      12       13       14        11          10    NA
     [3,]      13       15       15        15          14    NA
     [4,]      14       15       16        15          15    NA
     [5,]      16       15       18        16          16    NA
     [6,]      17       16       19        18          16    NA
     [7,]      NA       17       20        18          17    NA
     [8,]      NA       18       NA        19          17    NA
     [9,]      NA       18       NA        19          17    NA
    [10,]      NA       19       NA        20          19    NA
    [11,]      NA       20       NA         7          19    NA
    [12,]      NA       NA       NA         9           9    NA
    

    Anyways, finally just apply cor on the resulting matrix.

    > cor(m, use="pairwise.complete.obs")
                  Algebra  Calculus  Geometry Modelling Probability Other
    Algebra     1.0000000 0.9006049 0.9601136 0.9297804   0.9094441    NA
    Calculus    0.9006049 1.0000000 0.8492236 0.2967773   0.9461672    NA
    Geometry    0.9601136 0.8492236 1.0000000 0.9285061   0.8992441    NA
    Modelling   0.9297804 0.2967773 0.9285061 1.0000000   0.5407100    NA
    Probability 0.9094441 0.9461672 0.8992441 0.5407100   1.0000000    NA
    Other              NA        NA        NA        NA          NA    NA
    

    If you need statistics, you could use Hmisc::rcorr.

    > Hmisc::rcorr(m)
                Algebra Calculus Geometry Modelling Probability Other
    Algebra        1.00     0.90     0.96      0.93        0.91    NA
    Calculus       0.90     1.00     0.85      0.30        0.95    NA
    Geometry       0.96     0.85     1.00      0.93        0.90    NA
    Modelling      0.93     0.30     0.93      1.00        0.54    NA
    Probability    0.91     0.95     0.90      0.54        1.00    NA
    Other            NA       NA       NA        NA          NA     1
    
    n
                Algebra Calculus Geometry Modelling Probability Other
    Algebra           6        6        6         6           6     1
    Calculus          6       11        7        11          11     1
    Geometry          6        7        7         7           7     1
    Modelling         6       11        7        12          12     1
    Probability       6       11        7        12          12     1
    Other             1        1        1         1           1     1
    
    P
                Algebra Calculus Geometry Modelling Probability Other
    Algebra             0.0143   0.0024   0.0072    0.0119           
    Calculus    0.0143           0.0156   0.3755    0.0000           
    Geometry    0.0024  0.0156            0.0025    0.0059           
    Modelling   0.0072  0.3755   0.0025             0.0695           
    Probability 0.0119  0.0000   0.0059   0.0695                     
    Other                                                            
    Warning message:
    In sqrt(npair - 2) : NaNs produced
    

    Pearson is default in both.