Search code examples
rdataframedplyrnastring-matching

Replacing NAs in a dataframe based on a partial string match (in another dataframe) in R


Goal: To change a column of NAs in one dataframe based on a "key" in another dataframe (something like a VLookUp, except only in R)

Given df1 here (For Simplicity's sake, I just have 6 rows. The key I have is 50 rows for 50 states):

Index State_Name Abbreviation
1 California CA
2 Maryland MD
3 New York NY
4 Texas TX
5 Virginia VA
6 Washington WA

And given df2 here (This is just an example. The real dataframe I'm working with has a lot more rows) :

Index State Article
1 NA Texas governor, Abbott, signs new abortion bill
2 NA Effort to recall California governor Newsome loses steam
3 NA New York governor, Cuomo, accused of manipulating Covid-19 nursing home data
4 NA Hogan (Maryland, R) announces plans to lift statewide Covid restrictions
5 NA DC statehood unlikely as Manchin opposes
6 NA Amazon HQ2 causing housing prices to soar in northern Virginia

Task: To create an R function that loops and reads the state in each df2$Article row; then cross-reference it with df1$State_Name to replace the NAs in df2$State with the respective df1$Abbreviation key based on the state in df2$Article. I know it's quite a mouthful. I'm stuck with how to start, and finish this puzzle. Hard-coding is not an option as the real datasheet I have have thousands of rows like this, and will update as we add more articles to text-scrape.

The output should look like:

Index State Article
1 TX Texas governor, Abbott, signs new abortion bill
2 CA Effort to recall California governor Newsome loses steam
3 NY New York governor, Cuomo, accused of manipulating Covid-19 nursing home data
4 MD Hogan (Maryland, R) announces plans to lift statewide Covid restrictions
5 NA DC statehood unlikely as Manchin opposes
6 VA Amazon HQ2 causing housing prices to soar in northern Virginia

Note: The fifth entry with DC is intended to be NA.

Any links to guides, and/or any advice on how to code this is most appreciated. Thank you!


Solution

  • You can create create a regex pattern from the State_Name and use str_extract to extract it from Article. Use match to get the corresponding Abbreviation name from df1.

    library(stringr)
    
    df2$State <- df1$Abbreviation[match(str_extract(df2$Article, 
                   str_c(df1$State_Name, collapse = '|')), df1$State_Name)]
    df2$State
    #[1] "TX" "CA" "NY" "MD" NA   "VA"
    

    You can also use inbuilt state.name and state.abb instead of df1 to get state name and abbreviations.


    Here's a way to do this in for loop -

    for(i in seq(nrow(df1))) {
      inds <- grep(df1$State_Name[i], df2$Article)
      if(length(inds)) df2$State[inds] <- df1$Abbreviation[i]
    }
    df2
    
    #  Index State                                                                      Article
    #1     1    TX                              Texas governor, Abbott, signs new abortion bill
    #2     2    CA                     Effort to recall California governor Newsome loses steam
    #3     3    NY New York governor, Cuomo, accused of manipulating Covid-19 nursing home data
    #4     4    MD     Hogan (Maryland, R) announces plans to lift statewide Covid restrictions
    #5     5  <NA>                                     DC statehood unlikely as Manchin opposes
    #6     6    VA               Amazon HQ2 causing housing prices to soar in northern Virginia