Search code examples
rvectorcorrelationr-colnamesname-matching

Correlate vectors whose colnames match the values of two variables in each row in r dataframe


I have this dataframe in r (link) (Example rows and columns below)

FocalID     Mother   Adelaide Asimov Austen Brazzaville Lusaka Kinshasa
Adelaide    HalfEar    0      0.0380 0.0417      0.0366 0.0278   0.0385
Asimov      Lusaka     0.0380 0      0.0845      0.0357 0.169    0.0641
Austen      Kinshasa   0.0417 0.0845 0           0.0526 0.0952   0.0411
Brazzaville NA         0.0366 0.0357 0.0526      0      0.0395   0.0488

I would like to add a new variable, df$cor, in which the value in each row is the result of a correlation. The correlation should be between two vectors: (1) the column whose colname corresponds to the value of the variable df$FocalID in that row, and (2) the column whose colname corresponds to the value of the variable df$Mother in that row.

If the vector correspondent to the column that matches the mother's name is absent (either because the mother is not known (NA in df$Mother) or absent from colnames), the correlation should produce an NA.

I have tried the following code:

df$cor <- cor(df[, colnames(df) %in% df$FocalID], df[, colnames(df) %in% df$Mother])

However, the result doesn't seem right. Any idea?


Solution

  • If we need to do this for each pairwise column, we check whether the 'FocalID', 'Mother' columns are non-NA with complete.cases. Then, loop over the columns specifying subsetting only the non-NA columns, with apply and MARGIN = 1, do a check for whether those elements are %in% the column names of the dataset, select the data, apply cor and create the new column Cor

    i1 <-  complete.cases(df[1:2])
    df$Cor <- NA_real_
    df$Cor[i1] <-  apply(df[i1, 1:2], 1, function(x) 
       if(all(x %in% names(df))) cor(df[, x[1]], df[, x[2]]) else NA)
    

    -output

    df$Cor
    #[1]          NA  0.09769710  0.26956397          NA  0.04820137 -0.07776837          NA  0.19553956 -0.09596063          NA  0.04806345
    #[12]  0.66489746          NA          NA          NA -0.04254666 -0.05975570  0.47359966  0.09745244          NA          NA  0.24750130
    #[23]          NA          NA          NA          NA          NA          NA          NA          NA  0.10822526          NA  0.07093166
    #[34]          NA          NA -0.18088278 -0.17548394  0.11585058  0.07278060  0.36327624  0.10178624          NA          NA          NA
    #[45]  0.20491334          NA
    

    Or using rowwise from dplyr

    library(dplyr)
    df <- df %>%
            rowwise %>% 
            mutate(Cor = if(!is.na(FocalID) & !is.na(Mother) & 
               all(c(FocalID, Mother) %in% names(.)))
              cor(df[[FocalID]], df[[Mother]]) else NA_real_)
    

    data

    library(readxl)
    df <- read_excel(file.choose(), na = "NA")