I feel like there should be an easy way to do this but I've hit a dead end.
I have a large text dataset, and I want to know which countries are mentioned in each document. Sometimes it will say "afghanistan", sometimes "afghan", but since those are referring to the same country I want to only str_extract the first mention of either of those words. I have a pattern vector that therefore looks like this:
pattern <- c("afghanistan|afghan", "algeria|algerian", "albania|albanian", "angola|angolan", "argentina|argentine")
text <- c("the first stop on the trip is afghanistan, where he will meet the afghan president", "then he will leave afghanistan and head to argentina", "meetings with the afghan president in afghanistan should last 1 hour, and meetings with the argentine president in argentina should last 2 hours")
The goal is a series of vectors/df column that looks like the following:
c("afghanistan")
c("afghanistan", "argentina")
c("afghan", "argentine")
I originally made a long match pattern for all of the countries and nationalities all together and used str_extract_all() + unique() - this worked perfectly except when a text used both "afghanistan" and "afghan", in which case that country would be double counted.
I've tried various versions of map(), mapply(), etc and it usually results a list filled with character(0).
The closest I've gotten is a for loop:
country <- as.character(1:length(pattern)) #placeholder vector
for(i in 1:length(pattern)){
country[i] = str_extract(text, pattern[i])
}
This gives a vector of the correct length, but filled with NAs.
Any ideas on how to iterate a str_extract() call like this would be appreciated!
You can just remove the NA values. FOr example using map
library(purrr)
library(stringr)
text |>
map(function(t) map_chr(pattern, ~str_extract(t, .))) |>
map(~.x[!is.na(.x)])
# [[1]]
# [1] "afghanistan"
#
# [[2]]
# [1] "afghanistan" "argentina"
#
# [[3]]
# [1] "afghan" "argentine"