I am working on a large dataset (7 million rows) trying to understand the correlations between individual independent variables with dependent variables. When I run pcor(dataset) this results in higher correlations if compared when running cor(dataset).
My dataset has 6 dependent variables and 84 independent variables. I am finding the partial correlation for each dependent variable along with the 84 independent variables individually.
My independent variables are word counts for text type(75 categories), and some other social variables (all numerical) etc. gender.
My question is: I am not sure why I am getting high correlations when using pcor() in R and very weak correlation using cor(). Is this normal behavior for partial correlation?
If you're wondering whether a partial correlation coefficient can be larger than a "full" correlation coefficient, consider the following example.
Let's take a look at the sample data from the ppcor
reference manual
df <- data.frame(
hl = c(7,15,19,15,21,22,57,15,20,18),
disp = c(0.000,0.964,0.000,0.000,0.921,0.000,0.000,1.006,0.000,1.011),
deg = c(9,2,3,4,1,3,1,3,6,1),
BC = c(1.78e-02,1.05e-06,1.37e-05,7.18e-03,0.00e+00,0.00e+00,0.00e+00 ,4.48e-03,2.10e-06,0.00e+00))
According to the original paper, the data cover the relationship between sequence and functional evolution in yeast proteins, and is available from [Drummond et al., Molecular Biology and Evolution 23, 327–337 (2006)].
We are interested in exploring the correlation between hl
and disp
.
hl
and disp
Let's start by plotting hl
as a function of disp
library(ggplot2)
ggplot(df, aes(hl, disp)) +
geom_point()
The standard ("full") Pearson's product moment correlation coefficient is given by
with(df, cor(hl, disp))
#[1] -0.2378724
As is obvious from the plot and cor
results, without controlling for any other variable, the linear relationship between hl
on disp
is not very strong.
To recap the definition: Partial correlation between X and Y given confounding variables Z is defined as the correlation of the residuals resulting from a linear regression of X on Z an Y on Z.
Let's visualise the partial correlation by plotting the residuals of the two corresponding linear models hl ~ deg + BC
and disp ~ deg + BC
.
ggplot(data.frame(
res.x = lm(hl ~ deg + BC, df)$residuals,
res.y = lm(disp ~ deg + BC, df)$residuals)) +
geom_point(aes(res.x, res.y))
The linear dependence of both residuals is very obvious, suggesting a significant partial correlation between hl
and disp
. Let's confirm by calculating the partial correlation between hl
and disp
whilst controlling for confounding effects from deg
and BC
pcor.test(df$hl, df$disp, df[, c("deg","BC")])
# estimate p.value statistic n gp Method
#1 -0.6720863 0.06789202 -2.223267 10 2 pearson
Pearson's product moment correlation coefficient between hl
and disp
is larger when we control for confounding variables, than the correlation coefficient when we do not control for confounders.