I am working with COVID-19 data from my country by regions (3) in a dataframe. I want to use those columns of positive cases to generate other columns in which I want to calculate the growth in between rows. The dataframe:
> df
Lima Arequipa Huánuco
1 1 NA NA
2 6 NA NA
3 6 1 NA
4 8 2 5
5 9 3 7
6 11 4 8
I want to use a for loop to calculate in a new column named as each df's column adding to its name "_dif" in which I have the row 1 - lag (row 1)
for each column. So I used this code:
for(col in names(df)) {
df[paste0(col, "_dif")] = df[col] - lag(df[col])
}
The output I want is the next one:
Lima Arequipa Huánuco Lima_dif Arequipa_dif Huánuco_dif
1 1 NA NA NA NA NA
2 6 NA NA 5 NA NA
3 6 1 NA 0 NA NA
4 8 2 5 2 1 NA
5 9 3 7 1 1 2
6 11 4 8 2 1 1
But when I see the df after the for loop I got this (only NA in the new columns):
Lima Arequipa Huánuco Lima_dif Arequipa_dif Huánuco_dif
1 1 NA NA NA NA NA
2 6 NA NA NA NA NA
3 6 1 NA NA NA NA
4 8 2 5 NA NA NA
5 9 3 7 NA NA NA
6 11 4 8 NA NA NA
Thanks in advance.
We can just use mutate
with across
from dplyr
as the _all/_at
suffixes are getting deprecated and in the newer version, across
is more genneralized
library(dplyr)
df %>%
mutate(across(everything(), ~ . - lag(.), names = "{col}_dif"))
# Lima Arequipa Huánuco Lima_dif Arequipa_dif Huánuco_dif
#1 1 NA NA NA NA NA
#2 6 NA NA 5 NA NA
#3 6 1 NA 0 NA NA
#4 8 2 5 2 1 NA
#5 9 3 7 1 1 2
#6 11 4 8 2 1 1
Or in base R
df[paste0(names(df), "_dif")] <- lapply(df, function(x) c(NA, diff(x)))
Or another option is
df[paste0(names(df), "_dif")] <- rbind(NA, diff(as.matrix(df)))
The issue in the OP's for
loop is that df[col]
is a still a data.frame
with a single column, we need df[[col]]
to extract as vector
because lag
needs a vector
. According to ?lag
x - Vector of values
lag(df[1])
# Lima
#1 NA
returns NA
and it gets recycled
while,
lag(df[[1]])
#[1] NA 1 6 6 8 9
therefore, if we change the code to
for(col in names(df)) {
df[paste0(col, "_dif")] = df[[col]] - lag(df[[col]])
}
df
# Lima Arequipa Huánuco Lima_dif Arequipa_dif Huánuco_dif
#1 1 NA NA NA NA NA
#2 6 NA NA 5 NA NA
#3 6 1 NA 0 NA NA
#4 8 2 5 2 1 NA
#5 9 3 7 1 1 2
#6 11 4 8 2 1 1
df <- structure(list(Lima = c(1L, 6L, 6L, 8L, 9L, 11L), Arequipa = c(NA,
NA, 1L, 2L, 3L, 4L), Huánuco = c(NA, NA, NA, 5L, 7L, 8L)),
class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6"))