Search code examples
rcorrelationeconomics

Running a correlation for 300 users and subsetting based on significant correlations


Apologies in advance for this question; I only have a vague understanding of what I'm trying to do so searching for help has not produced very useful info.

Essentially my question is this. I have a data frame that looks like this, with 12 rows for each of the 300 hh_ids, one for each month:

 hh_ids      date    income consumption alcohol cleaning_materials  clothing
1  KELDK01 2012-11-1  62.70588    40.52941       0           0.000000  0.000000
2  KELDK01 2012-12-1  17.64706    42.43530       0           1.058824  7.058824
3  KELDK01 2013-01-1  91.76471    48.23529       0           0.000000  0.000000
4  KELDK01 2013-02-1  91.76470   107.52940       0           0.000000  0.000000
5  KELDK01 2013-03-1 116.47060   114.47060       0           0.000000  0.000000
6  KELDK01 2013-04-1 124.41180   118.29410       0           2.705882 17.647060
7  KELDK01 2013-05-1 137.23530   105.00000       0           1.411765  1.882353
8  KELDK01 2013-06-1 131.52940   109.54120       0           4.352942  2.941176
9  KELDK01 2013-07-1 121.52940   113.47060       0           2.352941 25.882350
10 KELDK01 2013-08-1 123.32940    86.50588       0           2.588235  2.941176

I want to see if there is any correlation between expenditure category "clothing" with each other expenditure category (approx. 10) for each household over the course of the year. I want to then create a new data frame with only the households that have a significant correlation between "clothing" and another expenditure category.

Any thoughts on how I'd tackle this problem?

(p.s. I'm trying to investigate if this is any cross-product substitution between "clothing" and other expenditure categories, and to isolate the HH's that do show that behavior. If i'm being an idiot and there's a better way to do it, I'd be happy to hear your thoughts!)

EDIT: In response to the requests to see work thus far: Its rather embarassing but I've been doing in manually- figured that I'd spend about equal time figuring out how to do it properly.

I subsetted df in df_cloth (for households that have expenditure in cloth >0 over the course of the year), which is 140 HH.

I then did:

df_cloth_cor<-select(df_cloth,income,consumption,alcohol,cleaning_material, clothing)
cor(df_cloth_cor)

I then recorded the correlation coefficients in excel by household, with a column for each variable cloth is correlated with.


Solution

  • I've changed your example a bit to include 2 different ids. Also, I'm not sure what you mean by "significant correlation". Large value, or statistically significant? I've included both cases here.

    1. Correlation value and p value

    library(dplyr)
    
    # example dataset
    dt = read.table(text="hh_ids      date    income consumption alcohol cleaning_materials  clothing
                    KELDK01 2012-11-1  62.70588    40.52941       0           0.000000  0.000000
                    KELDK01 2012-12-1  17.64706    42.43530       0           1.058824  7.058824
                    KELDK01 2013-01-1  91.76471    48.23529       0           0.000000  0.000000
                    KELDK01 2013-02-1  91.76470   107.52940       0           0.000000  0.000000
                    KELDK01 2013-03-1 116.47060   114.47060       0           0.000000  0.000000
                    KELDK01 2013-04-1 124.41180   118.29410       0           2.705882 17.647060
                    KELDK02 2013-05-1 137.23530   105.00000       0           1.411765  1.882353
                    KELDK02 2013-06-1 131.52940   109.54120       0           4.352942  2.941176
                    KELDK02 2013-07-1 121.52940   113.47060       0           2.352941 25.882350
                    KELDK02 2013-08-1 123.32940    86.50588       0           2.588235  2.941176", 
                    sep="", header=T, stringsAsFactors = F)
    
    dt
    
    #     hh_ids      date    income consumption alcohol cleaning_materials  clothing
    # 1  KELDK01 2012-11-1  62.70588    40.52941       0           0.000000  0.000000
    # 2  KELDK01 2012-12-1  17.64706    42.43530       0           1.058824  7.058824
    # 3  KELDK01 2013-01-1  91.76471    48.23529       0           0.000000  0.000000
    # 4  KELDK01 2013-02-1  91.76470   107.52940       0           0.000000  0.000000
    # 5  KELDK01 2013-03-1 116.47060   114.47060       0           0.000000  0.000000
    # 6  KELDK01 2013-04-1 124.41180   118.29410       0           2.705882 17.647060
    # 7  KELDK02 2013-05-1 137.23530   105.00000       0           1.411765  1.882353
    # 8  KELDK02 2013-06-1 131.52940   109.54120       0           4.352942  2.941176
    # 9  KELDK02 2013-07-1 121.52940   113.47060       0           2.352941 25.882350
    # 10 KELDK02 2013-08-1 123.32940    86.50588       0           2.588235  2.941176
    
    
    # create a function that calculates correlation and p value given 2 vectors
    Get_cor_and_pval = function(d,n1,n2,id){
    
      # create 2 vectors based on names of variables and the id
      x = d[,n1][dt$hh_ids==id]
      y = d[,n2][dt$hh_ids==id]
    
      # calculate correlation and p value
      test = cor.test(x,y)
      c = test$estimate   # keep correlation value
      p = test$p.value    # keep p value
    
      return(data.frame(c = c, p = p, row.names = NULL))
    }
    
    
    # specify combinations of variables to calculate correlation
    names1 = "clothing"
    names2 = c("income","consumption","alcohol","cleaning_materials")
    
    dt_combs = expand.grid(names1=names1, names2=names2, stringsAsFactors = F)
    
    dt_combs
    
    #     names1             names2
    # 1 clothing             income
    # 2 clothing        consumption
    # 3 clothing           alcohol
    # 4 clothing cleaning_materials
    
    
    # process to get correlations and p values for each variable combination and each id 
    dt %>%
      select(hh_ids) %>% distinct() %>%                                       # select unique ids
      group_by(hh_ids) %>%                                                    # for each id
      do(data.frame(.,dt_combs)) %>%                                          # get all combinations of interest
      rowwise() %>%                                                           # for each id and combination
      do(data.frame(.,                                                        # keep id and combination
                    Get_cor_and_pval(dt,.$names1,.$names2,.$hh_ids),          # get correlation and p value
                    stringsAsFactors=F)) %>%                                  # factor variables as character
      ungroup()                                                               # forget groupings
    
    # # A tibble: 8 x 5
    #    hh_ids   names1             names2          c            p
    # *   <chr>   <fctr>              <chr>      <dbl>        <dbl>
    # 1 KELDK01 clothing             income  0.1713298 7.455198e-01
    # 2 KELDK01 clothing        consumption  0.3220463 5.336309e-01
    # 3 KELDK01 clothing            alcohol         NA           NA
    # 4 KELDK01 clothing cleaning_materials  0.9999636 1.989337e-09
    # 5 KELDK02 clothing             income -0.6526867 3.473133e-01
    # 6 KELDK02 clothing        consumption  0.5376850 4.623150e-01
    # 7 KELDK02 clothing            alcohol         NA           NA
    # 8 KELDK02 clothing cleaning_materials -0.1416633 8.583367e-01
    

    The last data frame shows you what is the correlation between all pairs of interest, for each id. Alcohol variable is always 0 and creates this NA values. You can use your own filters to keep the rows you like.

    Note that for 300 ids and 6 variables it will work well. For a much lager number of ids (millions) and for many variables it might become slower and there could be a more efficient way to do that.

    2. Correlation value

    In case you're interested just in the correlation values and not the p values, then the code is much shorter:

    library(dplyr)
    
    # example dataset
    dt = read.table(text="hh_ids      date    income consumption alcohol cleaning_materials  clothing
                    KELDK01 2012-11-1  62.70588    40.52941       0           0.000000  0.000000
                    KELDK01 2012-12-1  17.64706    42.43530       0           1.058824  7.058824
                    KELDK01 2013-01-1  91.76471    48.23529       0           0.000000  0.000000
                    KELDK01 2013-02-1  91.76470   107.52940       0           0.000000  0.000000
                    KELDK01 2013-03-1 116.47060   114.47060       0           0.000000  0.000000
                    KELDK01 2013-04-1 124.41180   118.29410       0           2.705882 17.647060
                    KELDK02 2013-05-1 137.23530   105.00000       0           1.411765  1.882353
                    KELDK02 2013-06-1 131.52940   109.54120       0           4.352942  2.941176
                    KELDK02 2013-07-1 121.52940   113.47060       0           2.352941 25.882350
                    KELDK02 2013-08-1 123.32940    86.50588       0           2.588235  2.941176", 
                    sep="", header=T, stringsAsFactors = F)
    
    
    dt %>% 
      group_by(hh_ids) %>%                  # for each id
      do(data.frame(cor(.[,3:7]))[5,]) %>%  # keep columns 3 to 7 (numeric columns), get the correlation matrix and keep row 5 (row for income and all other)
      ungroup()
    
    # # A tibble: 2 x 6
    #    hh_ids     income consumption alcohol cleaning_materials clothing
    #     <chr>      <dbl>       <dbl>   <dbl>              <dbl>    <dbl>
    # 1 KELDK01  0.1713298   0.3220463      NA          0.9999636        1
    # 2 KELDK02 -0.6526867   0.5376850      NA         -0.1416633        1
    

    And an alternative using the corrr package as well

    library(dplyr)
    library(corrr)
    
    # example dataset
    dt = read.table(text="hh_ids      date    income consumption alcohol cleaning_materials  clothing
                    KELDK01 2012-11-1  62.70588    40.52941       0           0.000000  0.000000
                    KELDK01 2012-12-1  17.64706    42.43530       0           1.058824  7.058824
                    KELDK01 2013-01-1  91.76471    48.23529       0           0.000000  0.000000
                    KELDK01 2013-02-1  91.76470   107.52940       0           0.000000  0.000000
                    KELDK01 2013-03-1 116.47060   114.47060       0           0.000000  0.000000
                    KELDK01 2013-04-1 124.41180   118.29410       0           2.705882 17.647060
                    KELDK02 2013-05-1 137.23530   105.00000       0           1.411765  1.882353
                    KELDK02 2013-06-1 131.52940   109.54120       0           4.352942  2.941176
                    KELDK02 2013-07-1 121.52940   113.47060       0           2.352941 25.882350
                    KELDK02 2013-08-1 123.32940    86.50588       0           2.588235  2.941176", 
                    sep="", header=T, stringsAsFactors = F)
    
    
    dt %>% 
      group_by(hh_ids) %>%                               # for each id
      do( correlate(.[,3:7]) %>% focus(clothing) ) %>%   # keep columns 3 to 7, get correlations but return ones that have to do with variable "clothing"
      ungroup()
    
    # # A tibble: 8 x 3
    #    hh_ids            rowname   clothing
    #     <chr>              <chr>      <dbl>
    # 1 KELDK01             income  0.1713298
    # 2 KELDK01        consumption  0.3220463
    # 3 KELDK01            alcohol         NA
    # 4 KELDK01 cleaning_materials  0.9999636
    # 5 KELDK02             income -0.6526867
    # 6 KELDK02        consumption  0.5376850
    # 7 KELDK02            alcohol         NA
    # 8 KELDK02 cleaning_materials -0.1416633