Search code examples
rfor-looplag

For loop using names of a dataframe in R


I am working with COVID-19 data from my country by regions (3) in a dataframe. I want to use those columns of positive cases to generate other columns in which I want to calculate the growth in between rows. The dataframe:

> df
  Lima Arequipa Huánuco
1    1       NA      NA
2    6       NA      NA
3    6        1      NA
4    8        2       5
5    9        3       7
6    11       4       8

I want to use a for loop to calculate in a new column named as each df's column adding to its name "_dif" in which I have the row 1 - lag (row 1) for each column. So I used this code:

for(col in names(df)) {
  df[paste0(col, "_dif")] = df[col] - lag(df[col])
}

The output I want is the next one:

  Lima Arequipa Huánuco Lima_dif Arequipa_dif Huánuco_dif
1    1       NA      NA       NA           NA          NA
2    6       NA      NA       5            NA          NA
3    6        1      NA       0            NA          NA
4    8        2       5       2            1           NA
5    9        3       7       1            1           2
6    11       4       8       2            1           1

But when I see the df after the for loop I got this (only NA in the new columns):

  Lima Arequipa Huánuco Lima_dif Arequipa_dif Huánuco_dif
1    1       NA      NA       NA           NA          NA
2    6       NA      NA       NA           NA          NA
3    6        1      NA       NA           NA          NA
4    8        2       5       NA           NA          NA
5    9        3       7       NA           NA          NA
6    11       4       8       NA           NA          NA

Thanks in advance.


Solution

  • We can just use mutate with across from dplyr as the _all/_at suffixes are getting deprecated and in the newer version, across is more genneralized

    library(dplyr)
    df %>%
       mutate(across(everything(), ~ . - lag(.), names = "{col}_dif"))
    #   Lima Arequipa Huánuco Lima_dif Arequipa_dif Huánuco_dif
    #1    1       NA      NA       NA           NA          NA
    #2    6       NA      NA        5           NA          NA
    #3    6        1      NA        0           NA          NA
    #4    8        2       5        2            1          NA
    #5    9        3       7        1            1           2
    #6   11        4       8        2            1           1
    

    Or in base R

    df[paste0(names(df), "_dif")] <- lapply(df, function(x) c(NA, diff(x)))
    

    Or another option is

    df[paste0(names(df), "_dif")] <- rbind(NA, diff(as.matrix(df)))
    

    The issue in the OP's for loop is that df[col] is a still a data.frame with a single column, we need df[[col]] to extract as vector because lag needs a vector. According to ?lag

    x - Vector of values

    lag(df[1])
    #  Lima
    #1   NA
    

    returns NA and it gets recycled

    while,

    lag(df[[1]])
    #[1] NA  1  6  6  8  9
    

    therefore, if we change the code to

    for(col in names(df)) {
      df[paste0(col, "_dif")] = df[[col]] - lag(df[[col]])
     }
    
    
    df
    #  Lima Arequipa Huánuco Lima_dif Arequipa_dif Huánuco_dif
    #1    1       NA      NA       NA           NA          NA
    #2    6       NA      NA        5           NA          NA
    #3    6        1      NA        0           NA          NA
    #4    8        2       5        2            1          NA
    #5    9        3       7        1            1           2
    #6   11        4       8        2            1           1
    

    data

    df <- structure(list(Lima = c(1L, 6L, 6L, 8L, 9L, 11L), Arequipa = c(NA, 
    NA, 1L, 2L, 3L, 4L), Huánuco = c(NA, NA, NA, 5L, 7L, 8L)), 
      class = "data.frame", row.names = c("1", 
    "2", "3", "4", "5", "6"))