Search code examples
rdataframetextmatchingstringr

Compare 2 Text column of 2 Data frame by consecutive letter in R


I've the following 2 dataset: The original dataset and the matching dataset. The matching dataset is used to match against the original dataset.

Original Dataset
MelbourneAir
MelbourneCityAir
AirMelbourne
LondonAir
londonterminal
Airportlondon
NewyorkAirport
Airnewyork
newyorkterminal
airnorway
terminalnorway
swedenair
airsweden
swedenbus
shopbanana
bananashop
bananaashop
appleshop
shopapple
Original_df = structure(list(df.Original.Data = structure(c(11L, 12L, 1L, 9L, 
10L, 4L, 13L, 2L, 14L, 15L, 3L, 20L, 18L, 5L, 19L, 17L, 8L, 7L, 
6L, 16L), .Label = c("AirMelbourne", "airnewyork", "airnorway", 
"AirportLondon", "airsweden", "appleshop", "bananaashop", "bananashop", 
"LondonAir", "LondonTerminal", "MelbourneAir", "MelbourneCityAir", 
"newyorkairport", "newyorkterminal", "norwayterminal", "shopapple", 
"shopbanana", "swedenair", "swedenbus", "terminalnorway"), class = "factor")), class = "data.frame", row.names = c(NA, 
-20L))
Matching Dataset
MelbourneAirport
LondonTerminal
NewYorkAirport
NorwayTerminal
SwedenAirport
BananaShop
AppleShop
Matching_df = structure(list(df.Matching.Data = structure(c(5L, 4L, 6L, 7L, 
8L, 3L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L
), .Label = c("", "AppleShop", "BananaShop", "LondonTerminal", 
"MelbourneAirport", "NewYorkAirport", "NorwayTerminal", "SwedenAirport"
), class = "factor")), class = "data.frame", row.names = c(NA, 
-20L))

head(matching_df,7)

Does anyone know how I can use R and it's packages (stringdist, stringr, stringi, tidyr, dplyr or base R) to do the matching based on consecutive letters (i.e. as long as the text in original dataset matches the text in the matching dataset by 5(or 6) consecutive letters, R will match it accordingly to the following table? I've been using Excel so far to do the matching but given the limit to how much data excel can handle, using R would be more efficient.

Original Dataset Matched Dataset
MelbourneAir MelbourneAirport
MelbourneCityAir MelbourneAirport
AirMelbourne MelbourneAirport
LondonAir LondonTerminal
londonterminal LondonTerminal
Airportlondon LondonTerminal
NewyorkAirport NewYorkAirport
Airnewyork NewYorkAirport
newyorkterminal NewYorkAirport
airnorway NorwayTerminal
terminalnorway NorwayTerminal
swedenair SwedenAirport
airsweden SwedenAirport
swedenbus SwedenAirport
shopbanana BananaShop
bananashop BananaShop
bananaashop BananaShop
appleshop AppleShop
shopapple AppleShop

Thanks in advance!


Solution

  • stringdist indeed is a way to calculate the distance between two strings.

    #load stringdist package
    library(stringdist)
    #calculate "Optimal string alignment" distance between two strings
    sapply(1:nrow(original_df), function(x) stringdist(original_df$df.Original.Data[x], matching_df$df.Matching.Data[x], method = "osa"))
    
    distlist <- list()
    for(i in 1:nrow(original_df)) {
      distlist[[i]] <- stringdist(tolower(original_df$df.Original.Data[i]), tolower(matching_df$df.Matching.Data), method = "osa")
    }
    minindexes <- sapply(distlist, which.min)
    original_df$result <- Matching_df$df.Matching.Data[minindexes]
    
    
    > original_df
       df.Original.Data           result
    1      MelbourneAir MelbourneAirport
    2  MelbourneCityAir MelbourneAirport
    3      AirMelbourne        AppleShop
    4         LondonAir   LondonTerminal
    5    LondonTerminal   LondonTerminal
    6     AirportLondon        AppleShop
    7    newyorkairport   NewYorkAirport
    8        airnewyork        AppleShop
    9   newyorkterminal   LondonTerminal
    10   norwayterminal   NorwayTerminal
    11        airnorway       BananaShop
    12   terminalnorway    SwedenAirport
    13        swedenair    SwedenAirport
    14        airsweden        AppleShop
    15        swedenbus    SwedenAirport
    16       shopbanana       BananaShop
    17       bananashop       BananaShop
    18      bananaashop       BananaShop
    19        appleshop        AppleShop
    20        shopapple        AppleShop
    

    Several methods are available, see stringdist-metrics

    enter image description here