Looking for some assistance on searching for multiple specific words from one dataframe, within another dataframe's column (body of text) and subsequently pulling those values out into a new column.
To explain further:
Here's some sample data that you can use to recreate my problem, with two dataframes: (1) test_admin
for the list of admin levels that I want to search for, and (2) test_dataset$Summary
which is the column I want to run the search on. (You can ignore values of Other_Variables, those are populated with a whole lot of values in the real dataset)
test_admin <- data.frame(adm1_name = c("Sindh"),
adm2_name = c("Central Karachi", "Dadu", "East Karachi", "Ghotki", "Sujawal", "Sukkur"))
test_dataset <- data.frame(Summary = c("In Cox's Bazar, this and that happened.",
"In Yangon, something else happened",
"In Central Karachi, this happened",
"In Sindh, this happened",
"In Dadu AND East Karachi, this happened"),
Other_Variable_1 = 1:5,
Other_Variable_2 = 1:5)
To further complicate things, I'd be interested in also being able to search for values from two columns on the test_admin
dataframe. For example, if you the value "Sindh" comes up from the adm1_level column, it would be super cool to return all results under adm2_level too.
But if you can solve it at the more basic level (just searching one column), I would also be very satisfied with that.
The output I'd be looking for would be something like the dataframe below, which would have also returned multiple rows for where multiple values turned up.
Summary Other_Variable_1 Other_Variable_2 Locations
1 In Cox's Bazar, this and that happened. 1 1 <NA>
2 In Yangon, something else happened 2 2 <NA>
3 In Central Karachi, this happened 3 3 Central Karachi
4 In Sindh, this happened 4 4 Central Karachi
5 In Sindh, this happened 4 4 Dadu
6 In Sindh, this happened 4 4 East Karachi
7 In Sindh, this happened 4 4 Ghotki
8 In Sindh, this happened 4 4 Sujawal
9 In Sindh, this happened 4 4 Sukkur
10 In Dadu AND East Karachi, this happened 5 5 Dadu
11 In Dadu AND East Karachi, this happened 5 5 East Karachi
I tried with some mutate and grepl functions, but came up short. The other examples I found seemed to only have this work for exact values or single searches. Thanks for the help!
#tidyverse solutions preferred
Here is one way to do this :
library(tidyverse)
map_df(seq(nrow(test_dataset)), function(i) {
inds <- str_detect(test_dataset$Summary[i], test_admin$adm1_name) |
str_detect(test_dataset$Summary[i], test_admin$adm2_name)
if(any(inds)) tibble(test_dataset[i, ], Locations = test_admin$adm2_name[inds])
else tibble(test_dataset[i, ], Locations = NA)
})
# Summary Other_Variable_1 Other_Variable_2 Locations
# <chr> <int> <int> <chr>
# 1 In Cox's Bazar, this and that happened. 1 1 NA
# 2 In Yangon, something else happened 2 2 NA
# 3 In Central Karachi, this happened 3 3 Central Karachi
# 4 In Sindh, this happened 4 4 Central Karachi
# 5 In Sindh, this happened 4 4 Dadu
# 6 In Sindh, this happened 4 4 East Karachi
# 7 In Sindh, this happened 4 4 Ghotki
# 8 In Sindh, this happened 4 4 Sujawal
# 9 In Sindh, this happened 4 4 Sukkur
#10 In Dadu AND East Karachi, this happened 5 5 Dadu
#11 In Dadu AND East Karachi, this happened 5 5 East Karachi
For each value in Summary
we check if it matches either adm1_name
or adm2_name
. If any of the row matches we include corresponding Location
value in the output or else return NA
.