Search code examples
rcomparisonfuzzy-searchstringdistjaro-winkler

JaroWinkler Method --> Identifying Character/Numeric spots in a string


I am working on a problem to identify if a specified string has the correct format. I am attempting to use a fuzzy matching technique, JaroWinkler, to find the similarity score between a reference string and the strings of interest.

The correct format for the string follows this order (N=number, C=character): NNNCCCCCC

I found a similar problem on another StackOverflow question and edited the code a little here:

library(RecordLinkage)
library(dplyr)
library(stringdist)

ref <-c('123ABCDEF')
words <-c("456GHIJKL","123ABCDEF","78D78DAA2","660ABCDEF")

wordlist <- expand.grid(words = words, ref = ref, stringsAsFactors = FALSE)

df <- wordlist %>% 
        group_by(words) %>% 
        mutate(match_score = jarowinkler(words, ref))

df <- as.data.frame(df)
df

I know the JaroWinkler method is used for identifying common characters and considering string distance, but I'm not sure if this is the best method. Ideally, I'd like for the first and last elements in the words vector to be classified as correct and receive scores of 1 since they have the NNNCCCCCC format.

However, when I run this code, I get the following:

      words       ref match_score
1 456GHIJKL 123ABCDEF   0.0000000
2 123ABCDEF 123ABCDEF   1.0000000
3 78D78DAA2 123ABCDEF   0.3148148
4 660ABCDEF 123ABCDEF   0.7777778

Is there a better method for this type of matching exercise? Any help would be appreciated! Thank you!


Solution

  • As suggested in the comment above, I would do an exact string matching. Only uncertainty for now is what do you mean with "characters"? Only letters from A-Z or als e.g. punctuations? If only letters, see the code below.

    library(tidyverse)
    
    words <-c("456GHIJKL","123ABCDEF","78D78DAA2","660ABCDEF")
    
    str_detect(words, "[[:digit:]]{3}(?=[[:alpha:]]{6})")
    

    which gives:

    [1]  TRUE  TRUE FALSE  TRUE
    

    Updating the answer to reflect the TOs changed pattern

    words <-c("456GHIJKL","123ABCDEF","78D78DAA2","660ABCDEF", "660A7CDEF")
    
    str_detect(words, "[[:digit:]]{3}(?=[[:alpha:]]{1})(?=[[:digit:]]{1}|[[:alpha:]]{1})(?=[[:alpha:]]{5})")
    

    gives:

    [1]  TRUE  TRUE FALSE  TRUE  TRUE