Asymmetric partial matching of text strings between two dataframes

I have two dataframes:

df1 is based on survey responses and includes a non-restricted field for users to add their location in the UK (or refuse to do so) formatted as so (not real data):

Name	Location
Alice	London/Liverpool
Bob	Manchester
Jim	No
David	I am based in Cardiff

I have another dataframe (df2) that is a single column list of UK cities and towns, like so:

city
Manchester
Cardiff
London
Brighton
Nottingham

What I want to do is write code that will:

Loop through df1
For each entry in the location column, scan df2 and if ANY of the cities in df2 are exactly replicated in df1, add a new column(s) to df 1 pulling out the first match like so:

Name	Location	Location_precise
Alice	London, or Brighton	London
Bob	Manchester	Manchester
Jim	No	-
David	I am based in Cardiff	Cardiff

I can then use the location_precise column for further analysis with standardised names. What's tripping me up with regular left_join/lapply functions is that I want to allow partial matching in one direction (if a city in df2 is partially contained in df1) but NOT in the other direction (if an answer in df1$location is part of a city in df2, e.g. 'no' and 'nottingham' in the above example.

I don't know how to adapt regular dplyr functions to this because there is no common identifier in the two datasets - and therefore every join attempt ends in various errors. I am otherwise completely at my wits' end; I have tried using grep:

df1$location_precise <- lapply(df1$Location, grep, df2$city,value=T)

But this doesn't pick up partial matches without whitespace, like 'London/Glasgow'

I've also tried agrep, but this falsely identifies partial matches, like 'No' = 'Nottingham', and still doesn't pick up many other partial matches

Solution

Always remember to place a reproducible example.

I have taken your tables (df1 and df2) and added an extra case to both:

df1 <- data.frame(Name = LETTERS[1:6], 
                  Location = c("London/Liverpool", 
                               "London/Glasgow",
                               "I born in Manchester",
                               "I am from Chester",
                               "No",
                               "I am based in Cardiff"))

df2 <- data.frame(city = c("Manchester",
                           "Cardiff",
                           "London",
                           "Chester",
                           "Glasgow",
                           "Brighton",
                           "Nottingham"))


df1$Location_precise <- strsplit(x = df1$Location, split = "[[:punct:]]|[[:space:]]") |> 
  
  lapply(FUN = \(locations, patterns){
    
    sapply(X = locations, FUN = \(loc){
      patterns[sapply(X = sprintf("^%s$", patterns), 
                      FUN = grepl, x = loc, ignore.case = TRUE)]  
    })
  }, patterns = df2$city) |> 
  
  lapply(FUN = unlist) |> 
  
  sapply(FUN = \(x) if(length(x) == 0) "-" else if(length(x) == 1) x else if(length(x) > 1) paste(x, collapse = ", "))

The output:

> df1
  Name              Location Location_precise
1    A      London/Liverpool           London
2    B        London/Glasgow  London, Glasgow
3    C  I born in Manchester       Manchester
4    D     I am from Chester          Chester
5    E                    No                -
6    F I am based in Cardiff          Cardiff

It may not be the most elegant example, but you could try it and we'll refine it. For cases where there is more than one match, I have it show all matches and separate them with commas.

UPDATE: I have extended the steps and this time it is now able to identify complete words. However, when I talk about identifying complete words, I mean that the script now cuts and searches in each word of each element of df1$Location. For this you must keep in mind that 1) a (complete) word is defined as a set of characters contained either between two blanks or between two punctuation marks (check what characters are included in [:space:] and [:punct:] in this link), 2) because of the above, if there is a Man-chester, Man,chester, Man/chester, etc., the script will assign Chester to it and 3) this is computationally somewhat more expensive (because the search is done word by word), so if you plan to apply it over millions of values it may be slow.