Goal: To change a column of NAs in one dataframe based on a "key" in another dataframe (something like a VLookUp, except only in R)
Given df1 here (For Simplicity's sake, I just have 6 rows. The key I have is 50 rows for 50 states):
Index | State_Name | Abbreviation |
---|---|---|
1 | California | CA |
2 | Maryland | MD |
3 | New York | NY |
4 | Texas | TX |
5 | Virginia | VA |
6 | Washington | WA |
And given df2 here (This is just an example. The real dataframe I'm working with has a lot more rows) :
Index | State | Article |
---|---|---|
1 | NA | Texas governor, Abbott, signs new abortion bill |
2 | NA | Effort to recall California governor Newsome loses steam |
3 | NA | New York governor, Cuomo, accused of manipulating Covid-19 nursing home data |
4 | NA | Hogan (Maryland, R) announces plans to lift statewide Covid restrictions |
5 | NA | DC statehood unlikely as Manchin opposes |
6 | NA | Amazon HQ2 causing housing prices to soar in northern Virginia |
Task: To create an R function that loops and reads the state in each df2$Article row; then cross-reference it with df1$State_Name to replace the NAs in df2$State with the respective df1$Abbreviation key based on the state in df2$Article. I know it's quite a mouthful. I'm stuck with how to start, and finish this puzzle. Hard-coding is not an option as the real datasheet I have have thousands of rows like this, and will update as we add more articles to text-scrape.
The output should look like:
Index | State | Article |
---|---|---|
1 | TX | Texas governor, Abbott, signs new abortion bill |
2 | CA | Effort to recall California governor Newsome loses steam |
3 | NY | New York governor, Cuomo, accused of manipulating Covid-19 nursing home data |
4 | MD | Hogan (Maryland, R) announces plans to lift statewide Covid restrictions |
5 | NA | DC statehood unlikely as Manchin opposes |
6 | VA | Amazon HQ2 causing housing prices to soar in northern Virginia |
Note: The fifth entry with DC is intended to be NA.
Any links to guides, and/or any advice on how to code this is most appreciated. Thank you!
You can create create a regex pattern from the State_Name
and use str_extract
to extract it from Article
. Use match
to get the corresponding Abbreviation
name from df1
.
library(stringr)
df2$State <- df1$Abbreviation[match(str_extract(df2$Article,
str_c(df1$State_Name, collapse = '|')), df1$State_Name)]
df2$State
#[1] "TX" "CA" "NY" "MD" NA "VA"
You can also use inbuilt state.name
and state.abb
instead of df1
to get state name and abbreviations.
Here's a way to do this in for
loop -
for(i in seq(nrow(df1))) {
inds <- grep(df1$State_Name[i], df2$Article)
if(length(inds)) df2$State[inds] <- df1$Abbreviation[i]
}
df2
# Index State Article
#1 1 TX Texas governor, Abbott, signs new abortion bill
#2 2 CA Effort to recall California governor Newsome loses steam
#3 3 NY New York governor, Cuomo, accused of manipulating Covid-19 nursing home data
#4 4 MD Hogan (Maryland, R) announces plans to lift statewide Covid restrictions
#5 5 <NA> DC statehood unlikely as Manchin opposes
#6 6 VA Amazon HQ2 causing housing prices to soar in northern Virginia