How to calculate partial correlations when the data frame contains missing values

I want to calculate partial correlations between sets of two variables while controlling for all the other variables in a data frame.

To do this, I used the pcor(c("variable1", "variable2", "control1", "control2", etc.), var(dataFrame)) from the ggm package. However, it didn't work, meaning I got NA for the partial correlation coefficient.

My data frame has scores of personality test results assessing the participants for neuroticism, extraversion, openness to experience, agreeableness, and conscientiousness:

studentLecturerPersonality <- read.delim("http://www.discoveringstatistics.com/docs/Chamorro-Premuzic.dat", header = TRUE)

names(studentLecturerPersonality) <- c("age", "gender", "studentNeuroticism", "studentExtraversion", "studentOpenness", "studentAgreeableness", "studentConscientiousness","lecturerNeuroticism", "lecturerExtraversion", "lecturerOpenness", "lecturerAgreeableness", "lecturerConscientiousness") 

studentLecturerPersonalityOnlyTraits <- subset(studentLecturerPersonality, select = c("studentNeuroticism", "studentExtraversion", "studentOpenness", "studentAgreeableness", "studentConscientiousness"))

I calculated the correlation between the variables using both cor(dataFrame, use = "pairwise.complete.obs", method = "pearson") and cor(variable1, variable2, use = "pairwise.complete.obs", method = "pearson"), in which I know how to deal with missing values (NAs).

I wanted to calculate partial correlation coefficients between the variables extraversion and neuroticism while controlling for openness to experience, agreeableness, and conscientiousnes:

studentLecturerPersonalityOnlyTraitsMatrix <- as.matrix(studentLecturerPersonalityOnlyTraits)

pcExtraversionNeuroticism <- pcor(c("studentExtraversion", "studentNeuroticism",
                                    "studentOpenness", 
                                    "studentAgreeableness", 
                                    "studentConscientiousness"), var(studentLecturerPersonalityOnlyTraitsMatrix))

pcExtraversionNeuroticism

which returns [1] NA.

I don't know if it's because the data frame contains missing values (NAs), which I didn't (or couldn't) specify how R should deal with (like in cor()).

Can anyone suggest how I can make the pcor() work or an alternative method?

I really appreciate any help you can provide.

Solution

First, use complete.cases() to subset the matrix to just the rows which do not contain NA:

complete_matrix  <- studentLecturerPersonalityOnlyTraitsMatrix[
    complete.cases(studentLecturerPersonalityOnlyTraitsMatrix),
]

Then use this matrix before to take the partial correlation:

pcExtraversionNeuroticism <- pcor(
    c(
        "studentExtraversion",
        "studentNeuroticism",
        "studentOpenness",
        "studentAgreeableness",
        "studentConscientiousness"
    ), var(complete_matrix)
)

pcExtraversionNeuroticism
# [1] -0.2971974

It is worth noting that this will drop any rows which contain NA, rather than just rows of the columns you are using. In this case you are using all the columns so that isn't a problem. However, in the event you were only using, for example, the first two columns, you might wish to do:

cols_to_use  <- c("studentExtraversion", "studentNeuroticism")
complete_matrix <- studentLecturerPersonalityOnlyTraitsMatrix[
    complete.cases(studentLecturerPersonalityOnlyTraitsMatrix[, cols_to_use]),
]

As an aside, your variable names are very long. The Style Guide in Advanced R by Hadley Wickham says:

Generally, variable names should be nouns and function names should be verbs. Strive for names that are concise and meaningful (this is not easy!).

You have certainly got meaningful names. This is a matter of taste, but I wonder if they could be a little more concise!