Search code examples
rdataframetibbleusage-statisticspearson-correlation

How would I automate computing correlations within a tibble for various countries and store effectively?


Somewhat of a beginner in R and I am working on a relatively large dataset (for me at least) of around 500,000 rows.

I am trying to find the correlation between variables for various countries (measuring the effects of bullying specifically) for the PISA dataset (education based survey).

I am able to compute the correlation matrix for countries on a case by case basis.

I wanted to record the correlation between two variables (so not the entire matrix necessarily) across all these countries - automating this and storing the results all in a tibble so that I don’t need to spend time doing this manually.

correl_countries = tibble()

for (each in list_countries){
  countries_bullying %>% #tibble subset of the original data 
    filter(CNTRYID == each)%>%
    select(reading_score, bullied_index)%>%
    correl = cor(use = "pairwise.complete.obs") #something to store the correlation values
    correl_countries %>% add_row(x = each, y = correl) #wanted to add these results to a tibble
}

Currently nothing seems to happen and I receive this error.

Error in is.data.frame(x) : argument "x" is missing, with no default

It may have something to do with the fact that "pairwise.complete.obs" generates a correlation matrix and not a single vector.

Grateful for your recommendations!


Solution

  • You don't really need the loop here, the tidyverse has got you covered... The following returns a tibble with 2 columns: CNTRYID and correl:

    library(tidyverse)
    
    # get only the correlations
    countries_bullying %>%
      group_by(CNTRYID) %>%
      summarise(correl = cor(reading_score, bullied_index, use = "pairwise.complete.obs"))