I have a dataframe with more than 100 000 rows and 30 000 unique ids.
My aim is to fill all the NAs among the different columns if there is a value from the previous month and the same id. However, most of the times the previous recorded value is from more than a month ago. Those NAs I would like to leave untouched.
The id column and the date column do not have NAs. Here is an example of the data I have:
df3
id oxygen gluco dias bp date
1 0,25897842 0,20201604 0,17955655 0,14100962 31.7.2019
2 NA NA 0,38582622 0,12918231 31.12.2014
2 0,35817147 0,32943499 NA 0,43667462 30.11.2018
2 0,68557053 0,42898807 0,93897514 NA 31.10.2018
2 NA NA 0,99899076 0,44168223 31.7.2018
2 0,43848054 0,38604586 NA NA 30.4.2013
2 0,15823254 0,06216771 0,07829624 0,69755251 31.1.2016
2 NA NA 0,61645303 NA 29.2.2016
2 0,94671363 0,50682091 0,96770222 0,97403356 31.5.2018
3 NA 0,77352235 0,660479 0,11554399 30.4.2019
3 0,15567703 NA 0,4553325 NA 31.3.2017
3 NA NA 0,22181609 0,08527658 30.9.2017
3 0,93660763 NA NA NA 31.3.2018
3 0,73416759 NA NA 0,78501791 30.11.2018
3 NA NA NA NA 28.2.2019
3 0,84525106 0,54360374 NA 0,40595426 31.8.2014
3 0,76221263 0,62983336 0,84592719 0,10640734 31.8.2013
4 NA 0,29108942 0,3863479 NA 31.1.2018
4 0,74075742 NA 0,38117415 0,58849266 30.11.2018
4 0,09400641 0,68860814 NA 0,88895224 31.8.2014
4 0,72202944 0,49901387 0,19967415 NA 31.8.2018
4 0,98205262 0,85213969 0,34450998 0,98962306 30.11.2013
This is the last code implementation that I have tried:
´´´
df3 %>%
group_by(id) %>%
mutate_all(funs(na.locf(., na.rm = FALSE, maxgap = 30)))
´´´
But apparently "mutate_all()
ignored the following grouping variables:
Column id
"
You can use the tidyverse
for that. Here's an approach:
date
column to class Date, then order by date
Ym
mo
flag
the rows which have max one month differencecumsum
the inverse logic in flag
fill
the rows from the same groupslibrary(dplyr)
library(tidyr)
library(lubridate)
df$date <- as.Date(df$date, format="%d.%m.%Y")
df %>%
arrange(date) %>%
mutate(
Ym = ym(strftime(date, "%Y-%m")),
mo = interval(Ym, lag(Ym, default=as.Date("1970-01-01"))) / months(1),
flag = cumsum(!(mo > -2 & mo < 1))) %>%
group_by(id, flag) %>%
fill(names(.), .direction="down") %>%
ungroup() %>%
select(-c("Ym","mo","flag")) %>%
print(n=nrow(.))
Output
# A tibble: 22 × 6
id oxygen gluco dias bp date
<int> <chr> <chr> <chr> <chr> <date>
1 2 0,43848054 0,38604586 NA NA 2013-04-30
2 3 0,76221263 0,62983336 0,84592719 0,10640734 2013-08-31
3 4 0,98205262 0,85213969 0,34450998 0,98962306 2013-11-30
4 3 0,84525106 0,54360374 NA 0,40595426 2014-08-31
5 4 0,09400641 0,68860814 NA 0,88895224 2014-08-31
6 2 NA NA 0,38582622 0,12918231 2014-12-31
7 2 0,15823254 0,06216771 0,07829624 0,69755251 2016-01-31
8 2 0,15823254 0,06216771 0,61645303 0,69755251 2016-02-29
9 3 0,15567703 NA 0,4553325 NA 2017-03-31
10 3 NA NA 0,22181609 0,08527658 2017-09-30
11 4 NA 0,29108942 0,3863479 NA 2018-01-31
12 3 0,93660763 NA NA NA 2018-03-31
13 2 0,94671363 0,50682091 0,96770222 0,97403356 2018-05-31
14 2 NA NA 0,99899076 0,44168223 2018-07-31
15 4 0,72202944 0,49901387 0,19967415 NA 2018-08-31
16 2 0,68557053 0,42898807 0,93897514 NA 2018-10-31
17 2 0,35817147 0,32943499 0,93897514 0,43667462 2018-11-30
18 3 0,73416759 NA NA 0,78501791 2018-11-30
19 4 0,74075742 NA 0,38117415 0,58849266 2018-11-30
20 3 NA NA NA NA 2019-02-28
21 3 NA 0,77352235 0,660479 0,11554399 2019-04-30
22 1 0,25897842 0,20201604 0,17955655 0,14100962 2019-07-31
df <- structure(list(id = c(1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L,
3L, 3L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L), oxygen = c("0,25897842",
NA, "0,35817147", "0,68557053", NA, "0,43848054", "0,15823254",
NA, "0,94671363", NA, "0,15567703", NA, "0,93660763", "0,73416759",
NA, "0,84525106", "0,76221263", NA, "0,74075742", "0,09400641",
"0,72202944", "0,98205262"), gluco = c("0,20201604", NA, "0,32943499",
"0,42898807", NA, "0,38604586", "0,06216771", NA, "0,50682091",
"0,77352235", NA, NA, NA, NA, NA, "0,54360374", "0,62983336",
"0,29108942", NA, "0,68860814", "0,49901387", "0,85213969"),
dias = c("0,17955655", "0,38582622", NA, "0,93897514", "0,99899076",
NA, "0,07829624", "0,61645303", "0,96770222", "0,660479",
"0,4553325", "0,22181609", NA, NA, NA, NA, "0,84592719",
"0,3863479", "0,38117415", NA, "0,19967415", "0,34450998"
), bp = c("0,14100962", "0,12918231", "0,43667462", NA, "0,44168223",
NA, "0,69755251", NA, "0,97403356", "0,11554399", NA, "0,08527658",
NA, "0,78501791", NA, "0,40595426", "0,10640734", NA, "0,58849266",
"0,88895224", NA, "0,98962306"), date = structure(c(18108,
16435, 17865, 17835, 17743, 15825, 16831, 16860, 17682, 18016,
17256, 17439, 17621, 17865, 17955, 16313, 15948, 17562, 17865,
16313, 17774, 16039), class = "Date")), row.names = c(NA,
-22L), class = "data.frame")