I am trying to clean data_set by replacing some values for some variables with the new values which is included in the Changes_file after matching the ID and the variable name.
The ID is unique in the data_set but it could be duplicated in the changes_file
Example:
> data_set = data.frame(ID = c('20','22','25','60','40','44'), Gender = c('Female','Female','Female','Male','Male','Male'), Age = c('41','45','50','52','36','61'))
> data_set
ID Gender Age
1 20 Female 41
2 22 Female 45
3 25 Female 50
4 60 Male 52
5 40 Male 36
6 44 Male 61
> Changes_file = data.frame(ID = c('22','22','60','40'), Variables = c('Gender','Age','Age','Gender'), New_Information = c('Male','50','17','Female'))
> Changes_file
ID Variables New_Information
1 22 Gender Male
2 22 Age 50
3 60 Age 17
4 40 Gender Female
The result is expected to be:
> data_set_new
ID Gender Age
1 20 Female 41
2 22 Male 50
3 25 Female 50
4 60 Male 17
5 40 Female 36
6 44 Male 61
The data I am working on has more than 10,000 rows and more than 300 variables, and I just mentioned a simple example above.
I appreciate any advice or suggestions for this process and especially for the 'matching' part.
Thank you,
We can do this by pivoting your changes data so it's the same structure as the original data and then using a couple rows_*
operations:
library(tidyr)
library(dplyr)
changes_wide = Changes_file |>
pivot_wider(names_from = Variables, values_from = New_Information)
common_cols = intersect(names(changes_wide), names(data_set))
changes_wide |>
## "patch" the changes file first so the NAs don't overwrite
## non-missing values
rows_patch(y = data_set[common_cols], by = "ID", unmatched = "ignore") |>
## then update the original with the patched changes
rows_update(x = data_set, by = "ID")
# ID Gender Age
# 1 20 Female 41
# 2 22 Male 50
# 3 25 Female 50
# 4 60 Male 17
# 5 40 Female 36
# 6 44 Male 61