Search code examples
rdata-cleaning

Cleaning dataset based on a reference file


I am trying to clean data_set by replacing some values for some variables with the new values which is included in the Changes_file after matching the ID and the variable name.

The ID is unique in the data_set but it could be duplicated in the changes_file

Example:

> data_set = data.frame(ID = c('20','22','25','60','40','44'), Gender = c('Female','Female','Female','Male','Male','Male'), Age = c('41','45','50','52','36','61'))

> data_set
  ID Gender Age
1 20 Female  41
2 22 Female  45
3 25 Female  50
4 60   Male  52
5 40   Male  36
6 44   Male  61

> Changes_file = data.frame(ID = c('22','22','60','40'), Variables = c('Gender','Age','Age','Gender'), New_Information  = c('Male','50','17','Female'))
> Changes_file
  ID Variables New_Information
1 22    Gender            Male
2 22       Age              50
3 60       Age              17
4 40    Gender          Female

The result is expected to be:

> data_set_new
  ID Gender Age
1 20 Female  41
2 22   Male  50
3 25 Female  50
4 60   Male  17
5 40 Female  36
6 44   Male  61

The data I am working on has more than 10,000 rows and more than 300 variables, and I just mentioned a simple example above.

I appreciate any advice or suggestions for this process and especially for the 'matching' part.

Thank you,


Solution

  • We can do this by pivoting your changes data so it's the same structure as the original data and then using a couple rows_* operations:

    library(tidyr)
    library(dplyr)
    
    changes_wide = Changes_file |>
      pivot_wider(names_from = Variables, values_from = New_Information)
    
    common_cols = intersect(names(changes_wide), names(data_set))
    
    changes_wide |>
      ## "patch" the changes file first so the NAs don't overwrite 
      ## non-missing values
      rows_patch(y = data_set[common_cols], by = "ID", unmatched = "ignore") |>
      ## then update the original with the patched changes
      rows_update(x = data_set, by = "ID")
    #   ID Gender Age
    # 1 20 Female  41
    # 2 22   Male  50
    # 3 25 Female  50
    # 4 60   Male  17
    # 5 40 Female  36
    # 6 44   Male  61