I have a dataset that looks like this in R:
name = c("john", "john", "john", "alex", "alex", "peter", "peter", "peter", "peter")
year = c(2010, 2011, 2015, 2014, 2016, 2010, 2011, 2012, 2013)
age = c(21, 21, 21, 55, 55, 61, 61, 61, 61)
problem_data = data.frame(name, year, age)
name year age
1 john 2010 21
2 john 2011 21
3 john 2015 21
4 alex 2014 55
5 alex 2016 55
6 peter 2010 61
7 peter 2011 61
8 peter 2012 61
9 peter 2013 61
In this dataset, the age of each person at the last recorded year has been erroneously inserted at each row. For example - in reality:
Sometimes years are missing - as a result:
I am trying to research a way in R that can handle such a task. I have been trying to combine "cumulative group differences" and "max row conditions" - but I am not sure how these concepts can be combined together to achieve this:
# https://stackoverflow.com/questions/39237345/subtract-value-in-previous-row-for-each-section-of-a-data-frame-that-corresponds
library(dplyr)
new_data = problem_data %>%
group_by(name) %>%
mutate(real_age= age - lag(age, default = age[1]))
Bur this approach has made everyone's age as 0!
# A tibble: 9 x 4
# Groups: name [3]
name year age real_age
<chr> <dbl> <dbl> <dbl>
1 john 2010 21 0
2 john 2011 21 0
3 john 2015 21 0
4 alex 2014 55 0
5 alex 2016 55 0
6 peter 2010 61 0
7 peter 2011 61 0
8 peter 2012 61 0
9 peter 2013 61 0
Can someone please show me how to fix this problem?
Thank you!
Group by 'name', use complete
to get the missing 'year', fill
the NA
elements in 'year', 'age' with previous non-NA values, subtract the 'age' from the sequence of group index
library(dplyr)
library(tidyr)
problem_data %>%
group_by(name) %>%
complete(year = full_seq(year, period = 1)) %>%
fill(year, age, .direction = "downup") %>%
mutate(real_age= age - (row_number() - 1)) %>%
ungroup