Partial correlation values are larger than normal correlation in R

I am working on a large dataset (7 million rows) trying to understand the correlations between individual independent variables with dependent variables. When I run pcor(dataset) this results in higher correlations if compared when running cor(dataset).

My dataset has 6 dependent variables and 84 independent variables. I am finding the partial correlation for each dependent variable along with the 84 independent variables individually.

My independent variables are word counts for text type(75 categories), and some other social variables (all numerical) etc. gender.

My question is: I am not sure why I am getting high correlations when using pcor() in R and very weak correlation using cor(). Is this normal behavior for partial correlation?

Solution

If you're wondering whether a partial correlation coefficient can be larger than a "full" correlation coefficient, consider the following example.

Let's take a look at the sample data from the ppcor reference manual

df <- data.frame(
    hl = c(7,15,19,15,21,22,57,15,20,18),
    disp = c(0.000,0.964,0.000,0.000,0.921,0.000,0.000,1.006,0.000,1.011),
    deg = c(9,2,3,4,1,3,1,3,6,1),
    BC = c(1.78e-02,1.05e-06,1.37e-05,7.18e-03,0.00e+00,0.00e+00,0.00e+00 ,4.48e-03,2.10e-06,0.00e+00))

According to the original paper, the data cover the relationship between sequence and functional evolution in yeast proteins, and is available from [Drummond et al., Molecular Biology and Evolution 23, 327–337 (2006)].

We are interested in exploring the correlation between hl and disp.

Linear relationship between `hl` and `disp`

Let's start by plotting hl as a function of disp

library(ggplot2)
ggplot(df, aes(hl, disp)) +
    geom_point()

The standard ("full") Pearson's product moment correlation coefficient is given by

with(df, cor(hl, disp))
#[1] -0.2378724

As is obvious from the plot and cor results, without controlling for any other variable, the linear relationship between hl on disp is not very strong.

Partial correlation

To recap the definition: Partial correlation between X and Y given confounding variables Z is defined as the correlation of the residuals resulting from a linear regression of X on Z an Y on Z.

Let's visualise the partial correlation by plotting the residuals of the two corresponding linear models hl ~ deg + BC and disp ~ deg + BC.

ggplot(data.frame(
    res.x = lm(hl ~ deg + BC, df)$residuals, 
    res.y = lm(disp ~ deg + BC, df)$residuals)) +
    geom_point(aes(res.x, res.y))

The linear dependence of both residuals is very obvious, suggesting a significant partial correlation between hl and disp. Let's confirm by calculating the partial correlation between hl and disp whilst controlling for confounding effects from deg and BC

pcor.test(df$hl, df$disp, df[, c("deg","BC")])
#    estimate    p.value statistic  n gp  Method
#1 -0.6720863 0.06789202 -2.223267 10  2 pearson

Conclusion

Pearson's product moment correlation coefficient between hl and disp is larger when we control for confounding variables, than the correlation coefficient when we do not control for confounders.

Partial correlation values are larger than normal correlation in R

Linear relationship between hl and disp

Partial correlation

Conclusion

Linear relationship between `hl` and `disp`