Search code examples
rfor-loopdplyrstatisticscorrelation

Convert dplyr Chain to "for loop"


I am trying to calculate the correlation between price & cut in the diamonds data set, grouped by color.

I have constructed a pipeline of dplyr commands that returns what I want:

library(dplyr)
library(ggplot2)
data(diamonds)

df <- data.frame(group = diamonds$color, a = diamonds$price , b = diamonds$depth )

df %>% group_by(group) %>% summarize(Corr = cor(a,b)) %>% as.data.frame()

This outputs:(what I want)


  group         Corr
1     D  0.013415309
2     E  0.017037228
3     F  0.079294072
4     G -0.032000363
5     H  0.051953865
6     I  0.009288322
7     J  0.086863041

But I would like to create a for loop that serves the exact same purpose.

I understand the basics of a for loop, but to me it seems difficult to figure out the logic around such a task. Dplyr makes a lot of sense as it is.

Any ideas? Thank you (and happy holidays!)


Solution

  • I think the first thing to say is there's no particular reason to do this in a for loop - often avoiding loops is preferable in R, and here the dplyr syntax is designed for this kind of operation.

    However if you want to use a loop to get more familiar with them, I would be inclined to split() the data frame into a list of data frames, each of which contains one color.

    # Produces a list of length 7
    diamonds_color_list  <- split(diamonds, ~color)
    
    diamonds_color_list[[1]]
    # # A tibble: 6,775 x 10
    #    carat cut       color clarity depth table price     x     y     z
    #    <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
    #  1  0.23 Very Good D     VS2      60.5    61   357  3.96  3.97  2.4 
    #  2  0.23 Very Good D     VS1      61.9    58   402  3.92  3.96  2.44
    #  3  0.26 Very Good D     VS2      60.8    59   403  4.13  4.16  2.52
    #  4  0.26 Good      D     VS2      65.2    56   403  3.99  4.02  2.61
    #  5  0.26 Good      D     VS1      58.4    63   403  4.19  4.24  2.46
    #  6  0.22 Premium   D     VS2      59.3    62   404  3.91  3.88  2.31
    #  7  0.3  Premium   D     SI1      62.6    59   552  4.23  4.27  2.66
    #  8  0.3  Ideal     D     SI1      62.5    57   552  4.29  4.32  2.69
    #  9  0.3  Ideal     D     SI1      62.1    56   552  4.3   4.33  2.68
    # 10  0.24 Very Good D     VVS1     61.5    60   553  3.97  4     2.45
    # # ... with 6,765 more rows
    

    To calculate the correlation of price and depth for the first color you could do, cor(diamonds_color_list[[1]]$price, diamonds_color_list[[1]]$depth).

    It might seem tempting iterate through this list by index, starting with for(i in 1:7) cor(diamonds_color_list[[i]]$price, diamonds_color_list[[i]]$depth). This is OK - though for(i in seq_along(diamonds_color_list)) would be better.

    However, I think code is more readable if you iterate over the color names, so I would do something along the lines of:

    colors  <- names(diamonds_color_list) # D" "E" "F" "G" "H" "I" "J"
    
    # Create empty list of correct length so you don't
    # commit the sin of growing an object in a loop
    results_list  <- vector(mode="list", length = length(diamonds_color_list))  |>
        setNames(colors)
    
    # Loop through each color in the data frame 
    # calculate the cor and add to the list
    for (color in colors) {
        result <- list(
            color = color,
            cor = cor(
                diamonds_color_list[[color]]$price,
                diamonds_color_list[[color]]$depth
            )
        )
        results_list[[color]]  <- result
    }
    
    # Bind them together into a data frame
    do.call(rbind, results_list)
    #   color cor
    # D "D"   -0.01352522
    # E "E"   -0.005518622
    # F "F"   0.006164055
    # G "G"   -0.007661284
    # H "H"   -0.02033827
    # I "I"   -0.08299649
    # J "J"   -0.04973188
    

    All these functions are base R and do not require dplyr.