Search code examples
rcorrelationpearson-correlation

Partial correlation values are larger than normal correlation in R


I am working on a large dataset (7 million rows) trying to understand the correlations between individual independent variables with dependent variables. When I run pcor(dataset) this results in higher correlations if compared when running cor(dataset).

My dataset has 6 dependent variables and 84 independent variables. I am finding the partial correlation for each dependent variable along with the 84 independent variables individually.

My independent variables are word counts for text type(75 categories), and some other social variables (all numerical) etc. gender.

My question is: I am not sure why I am getting high correlations when using pcor() in R and very weak correlation using cor(). Is this normal behavior for partial correlation?


Solution

  • If you're wondering whether a partial correlation coefficient can be larger than a "full" correlation coefficient, consider the following example.

    Let's take a look at the sample data from the ppcor reference manual

    df <- data.frame(
        hl = c(7,15,19,15,21,22,57,15,20,18),
        disp = c(0.000,0.964,0.000,0.000,0.921,0.000,0.000,1.006,0.000,1.011),
        deg = c(9,2,3,4,1,3,1,3,6,1),
        BC = c(1.78e-02,1.05e-06,1.37e-05,7.18e-03,0.00e+00,0.00e+00,0.00e+00 ,4.48e-03,2.10e-06,0.00e+00))
    

    According to the original paper, the data cover the relationship between sequence and functional evolution in yeast proteins, and is available from [Drummond et al., Molecular Biology and Evolution 23, 327–337 (2006)].

    We are interested in exploring the correlation between hl and disp.

    Linear relationship between hl and disp

    Let's start by plotting hl as a function of disp

    library(ggplot2)
    ggplot(df, aes(hl, disp)) +
        geom_point()
    

    enter image description here

    The standard ("full") Pearson's product moment correlation coefficient is given by

    with(df, cor(hl, disp))
    #[1] -0.2378724
    

    As is obvious from the plot and cor results, without controlling for any other variable, the linear relationship between hl on disp is not very strong.

    Partial correlation

    To recap the definition: Partial correlation between X and Y given confounding variables Z is defined as the correlation of the residuals resulting from a linear regression of X on Z an Y on Z.

    Let's visualise the partial correlation by plotting the residuals of the two corresponding linear models hl ~ deg + BC and disp ~ deg + BC.

    ggplot(data.frame(
        res.x = lm(hl ~ deg + BC, df)$residuals, 
        res.y = lm(disp ~ deg + BC, df)$residuals)) +
        geom_point(aes(res.x, res.y))
    

    enter image description here

    The linear dependence of both residuals is very obvious, suggesting a significant partial correlation between hl and disp. Let's confirm by calculating the partial correlation between hl and disp whilst controlling for confounding effects from deg and BC

    pcor.test(df$hl, df$disp, df[, c("deg","BC")])
    #    estimate    p.value statistic  n gp  Method
    #1 -0.6720863 0.06789202 -2.223267 10  2 pearson
    

    Conclusion

    Pearson's product moment correlation coefficient between hl and disp is larger when we control for confounding variables, than the correlation coefficient when we do not control for confounders.