Search code examples
rstringdata-cleaningrecodefuzzy-comparison

R - Fuzzy find and recode


I am cleaning demographic data that has been submitted by 10+ school districts and the submissions are not standardized/uniform. I would like to find patterns and recode them so that the data is clean and simple.

Let's say I have a variable called Race, and one of the categories is Native Hawaiian - Pacific Islander.

School A submits this category as Native Hawaiian or Other Pacific Islander. School B submits this category as Native Hawaiian/Pacific Islander. School C submits this category as Native Hawaiian or Pacific Islander.

How could I recode this such that if R sees the word Pacific anywhere in the variable, it will recode to Native Hawaiian - Pacific Islander?

Here is the original data:

df_original <- data.frame(Race=c("Native Hawaiian or Other Pacific Islander",
                                 "Native Hawaiian/Pacific Islander", "Native Hawaiian or Pacific Islander",
                                 "Black or African American", "Black", "Black/African American"))

Here is the ideal cleaned data:

df_desired <- data.frame(Race=c("Native Hawaiian - Pacific Islander","Native Hawaiian - Pacific Islander",
                                "Native Hawaiian - Pacific Islander","Black - African American",
                                "Black - African American","Black - African American"))

Solution

  • grepl() will return TRUE for strings that contain "Pacific" and False otherwise. Use that to subset your vector and replace with the string you want:

    df_original$Race[grepl("Pacific", df_original$Race)] <- "Native Hawaiian - Pacific Islander"