Search code examples
rsearchmatchtidyversegrepl

Search for the inclusion of specific text across multiple dataframes, and return those values in a new column (with multiple occurrences)


Looking for some assistance on searching for multiple specific words from one dataframe, within another dataframe's column (body of text) and subsequently pulling those values out into a new column.

To explain further:

  • First, I have a dataframe including a huge list of text summaries across fourteen countries.
  • Secondly, I have a second dataframe with all administrative level (lvl_2) names, like provinces, villages, etc.
  • I want to basically extract any mentions of these specific adm2 provinces/village names from the large summaries, and create a new column with each of those words, pivoted longer.

Here's some sample data that you can use to recreate my problem, with two dataframes: (1) test_admin for the list of admin levels that I want to search for, and (2) test_dataset$Summary which is the column I want to run the search on. (You can ignore values of Other_Variables, those are populated with a whole lot of values in the real dataset)

test_admin <- data.frame(adm1_name = c("Sindh"),
                   adm2_name = c("Central Karachi", "Dadu", "East Karachi", "Ghotki", "Sujawal", "Sukkur"))
                   
test_dataset <- data.frame(Summary = c("In Cox's Bazar, this and that happened.",
                                       "In Yangon, something else happened",
                                       "In Central Karachi, this happened",
                                       "In Sindh, this happened",
                                       "In Dadu AND East Karachi, this happened"),
                           Other_Variable_1 = 1:5,
                           Other_Variable_2 = 1:5)

To further complicate things, I'd be interested in also being able to search for values from two columns on the test_admin dataframe. For example, if you the value "Sindh" comes up from the adm1_level column, it would be super cool to return all results under adm2_level too.

But if you can solve it at the more basic level (just searching one column), I would also be very satisfied with that.

The output I'd be looking for would be something like the dataframe below, which would have also returned multiple rows for where multiple values turned up.

                                   Summary Other_Variable_1 Other_Variable_2       Locations
1  In Cox's Bazar, this and that happened.                1                1            <NA>
2       In Yangon, something else happened                2                2            <NA>
3        In Central Karachi, this happened                3                3 Central Karachi
4                  In Sindh, this happened                4                4 Central Karachi
5                  In Sindh, this happened                4                4            Dadu
6                  In Sindh, this happened                4                4    East Karachi
7                  In Sindh, this happened                4                4          Ghotki
8                  In Sindh, this happened                4                4         Sujawal
9                  In Sindh, this happened                4                4          Sukkur
10 In Dadu AND East Karachi, this happened                5                5            Dadu
11 In Dadu AND East Karachi, this happened                5                5    East Karachi

I tried with some mutate and grepl functions, but came up short. The other examples I found seemed to only have this work for exact values or single searches. Thanks for the help!

#tidyverse solutions preferred


Solution

  • Here is one way to do this :

    library(tidyverse)
    
    map_df(seq(nrow(test_dataset)), function(i) {
      inds <- str_detect(test_dataset$Summary[i], test_admin$adm1_name) | 
                 str_detect(test_dataset$Summary[i], test_admin$adm2_name)
      if(any(inds)) tibble(test_dataset[i, ], Locations = test_admin$adm2_name[inds])
        else tibble(test_dataset[i, ], Locations = NA)
    })
    
    #  Summary                                 Other_Variable_1 Other_Variable_2 Locations      
    #   <chr>                                              <int>            <int> <chr>          
    # 1 In Cox's Bazar, this and that happened.                1                1 NA             
    # 2 In Yangon, something else happened                     2                2 NA             
    # 3 In Central Karachi, this happened                      3                3 Central Karachi
    # 4 In Sindh, this happened                                4                4 Central Karachi
    # 5 In Sindh, this happened                                4                4 Dadu           
    # 6 In Sindh, this happened                                4                4 East Karachi   
    # 7 In Sindh, this happened                                4                4 Ghotki         
    # 8 In Sindh, this happened                                4                4 Sujawal        
    # 9 In Sindh, this happened                                4                4 Sukkur         
    #10 In Dadu AND East Karachi, this happened                5                5 Dadu           
    #11 In Dadu AND East Karachi, this happened                5                5 East Karachi   
    

    For each value in Summary we check if it matches either adm1_name or adm2_name. If any of the row matches we include corresponding Location value in the output or else return NA.