Search code examples
rrecord-linkagejaro-winkler

How to group similar strings together in a database in R


I have a tibble of just 1 column called 'title'.

> dat
# A tibble: 13 x 1
   title                                          
   <chr>                                          
 1 lymphoedema clinic                             
 2 zostavax shingles vaccine                      
 3 xray operator                                  
 4 workplace mental health wellbeing workshop     
 5 zostavax recall toolkit                        
 6 xray meetint                                   
 7 workplace mental health and wellbeing          
 8 lymphoedema early intervenstion                
 9 lymphoedema expo                               
10 lymphoedema for breast care nurses             
11 xray meeting and case studies                  
12 xray online examination                        
13 xray operator in service paediatric extremities

I wish to find similar records and group them together as such (all the while keeping their indices):

> dat
# A tibble: 13 x 1
   title                                          
   <chr>                                          
 1 lymphoedema clinic   
 8 lymphoedema early intervenstion                
 9 lymphoedema expo                               
10 lymphoedema for breast care nurses                             
 2 zostavax shingles vaccine 
 5 zostavax recall toolkit                                
 3 xray operator                                  
 6 xray meetint     
11 xray meeting and case studies                  
12 xray online examination                        
13 xray operator in service paediatric extremities
 4 workplace mental health wellbeing workshop                                  
 7 workplace mental health and wellbeing          

I'm using the below function to find strings that are close enough to each other (cutoff = 0.75)

compareJW <- function(string1, string2, cutoff)
{
  require(RecordLinkage)
  jarowinkler(string1, string2) > cutoff
}

I've implemented the loop below to 'send' similar records together in a new dataframe but it's not working properly, I've tried a few variations but nothing is working yet.

# create new database
newDB <- data.frame(matrix(ncol = ncol(dat), nrow = 0))
colnames(newDB) <- names(dat)
newDB <- as_tibble(newDB)

for(i in 1:nrow(dat))
{
  # print(dat$title[i])

  for(j in 1:nrow(dat))
  {
    print(dat$title[i])
    print(dat$title[j])
    # score <- jarowinkler(dat$title[i], dat$title[j])

    if(dat$title[i] != dat$title[j]
       &&
       compareJW(dat$title[i], dat$title[j], 0.75))
    {
      print("if")

      # newDB <- rbind(newDB, 
      #                dat$title[i],
      #                dat$title[j])
    }
    else
    {
      print("else")
      # newDB <- rbind(newDB, dat$title[i])
    }
  }
}

(I've inserted prints in the loop 'to see what's happening')

REPRODUCIBLE DAT:

dat <- 
structure(list(title = c("lymphoedema clinic", "zostavax shingles vaccine", 
                         "xray operator", "workplace mental health wellbeing workshop", 
                         "zostavax recall toolkit", "xray meetint", "workplace mental health and wellbeing", 
                         "lymphoedema early intervenstion", "lymphoedema expo", "lymphoedema for breast care nurses", 
                         "xray meeting and case studies", "xray online examination", "xray operator in service paediatric extremities"
)), row.names = c(NA, -13L), class = c("tbl_df", "tbl", "data.frame"
))

Any suggestions please? EDIT: I'd also like a new index column called 'group' as below:

> dat
# A tibble: 13 x 1
index   group    title                                          
                 <chr>                                          
 1       1   lymphoedema clinic   
 8       1   lymphoedema early intervenstion                
 9       1   lymphoedema expo                               
10       1   lymphoedema for breast care nurses                             
 2       2   zostavax shingles vaccine 
 5       2   zostavax recall toolkit                                
 3       3   xray operator                                  
 6       3   xray meetint     
11       3   xray meeting and case studies                  
12       3   xray online examination                        
13       3   xray operator in service paediatric extremities
 4       4   workplace mental health wellbeing workshop                                  
 7       4   workplace mental health and wellbeing          

Solution

  • I'm afraid I've never tried RecordLinkage, but if you're just using the Jaro-Winkler distance it should also be fairly easy to cluster similar strings with the stringdist package. Using your dput above:

    library(tidyverse)
    library(stringdist)
    
    map_dfr(dat$title, ~ {
        i <- which(stringdist(., dat$title, "jw") < 0.40)
        tibble(index = i, title = dat$title[i])
    }, .id = "group") %>%
        distinct(index, .keep_all = T) %>% 
        mutate(group = as.integer(group))
    

    Explanation: map_dfr iterates over each string in dat$title, extracts the indices of the closest matches computed by stringdist (constrained by 0.40, i.e. your "threshold"), creates a tibble with the indices and matches, then stacks these tibbles with a group variable corresponding to the integer position (and row number) of the original string. distinct then drops any cluster duplicates based on repeats of index.

    Output:

    # A tibble: 13 x 3
       group index title                                          
       <int> <int> <chr>                                          
     1     1     1 lymphoedema clinic                             
     2     1     8 lymphoedema early intervenstion                
     3     1     9 lymphoedema expo                               
     4     1    10 lymphoedema for breast care nurses             
     5     2     2 zostavax shingles vaccine                      
     6     2     5 zostavax recall toolkit                        
     7     2    11 xray meeting and case studies                  
     8     3     3 xray operator                                  
     9     3     6 xray meetint                                   
    10     3    12 xray online examination                        
    11     3    13 xray operator in service paediatric extremities
    12     4     4 workplace mental health wellbeing workshop     
    13     4     7 workplace mental health and wellbeing          
    

    An interesting alternative would be to use tidytext with widyr to tokenize by word and compute the cosine similarity of the titles based on similar words, rather than characters as above.