Search code examples
javascriptpythonfull-text-searchstring-matchingfuzzy-search

How do I approximate search multiple terms in a string in JS?


I have a string S and a list of strings allItems, allItems contains strings that may have common "sub-words" but one element is never an extension of another:

//good, they both contain Fuzzy but Fuzzy isn't in allItems
const allItems = "FuzzyBunny", "FuzzyBear"

//not good BCS allItems[i] = whatever1+allItems[j]+whatever2
const allItems = "Fuzzy", "FuzzyBear", "FuzzyBunny" , "FuzzyBearBunny"

My goal is to find every match or approximate match of a string in allItems in S alongside their index(can be start or end, or ideally both). I've been searching for some algorithms to do this, similar to the aho-corasick algorithm, but that doesn't do approximate string matching.

Example:

S = "I love FuzzyBears and FuzzyDucks"
allItems = ["FuzzyBear", "FuzzyDuck"]

->

[
 { match: "FuzzyBear", matchIndex: 0, startIndex: 7, endIndex: 16 },
 { match: "FuzzyDuck", matchIndex: 1, startIndex: 22, endIndex: 31 }
]

I'm still pretty new to pattern matching so I'd appreciate some resources on how to code any of the recommended algorithms.

UPDATE: I've found a Fuzzified Aho-Corasick automata as described here, but I have very little idea on how to implement this in JS.I also have no issues with the code being slow as I use this for a one time run, and rarely need to do this often.


Solution

  • Python version (fuzzy-matching) [splits string by spaces]:

    import difflib
    
    s = "I love FuzzyBears and FuzzyDucks"
    allitems = ["FuzzyBear", "FuzzyDuck"]
    
    lst = []
    for x in allitems:
      d = {}
      match = difflib.get_close_matches(x, s.split())
      if match:
        d["match"] = x
        d["matchIndex"] = allitems.index(x)
        d["startIndex"] = s.find(x)
        d["endIndex"] = d["startIndex"] + len(x)
      lst.append(d)
    print(lst)
    
    >>>
    [{'match': 'FuzzyBear', 'matchIndex': 0, 'startIndex': 7, 'endIndex': 16}, {'match': 'FuzzyDuck', 'matchIndex': 1, 'startIndex': 22, 'endIndex': 31}]
    

    Python version (fuzzy-matching) [split by all possible substrings]:

    import difflib
    from itertools import combinations
    
    s = "I love FuzzyBears and FuzzyDucks"
    allitems = ["FuzzyBear", "FuzzyDuck"]
    
    lst = []
    for x in allitems:
      d = {}
      match = difflib.get_close_matches(x, [x[i:j] for i, j in combinations(range(len(x) + 1), r=2)])
      if match:
        d["match"] = x
        d["matchIndex"] = allitems.index(x)
        d["startIndex"] = s.find(x)
        d["endIndex"] = d["startIndex"] + len(x)
      lst.append(d)
    print(lst)
    
    >>>
    [{'match': 'FuzzyBear', 'matchIndex': 0, 'startIndex': 7, 'endIndex': 16}, {'match': 'FuzzyDuck', 'matchIndex': 1, 'startIndex': 22, 'endIndex': 31}]
    

    After a bunch of online searching:

    JS version (exact-matching):

    const s = "I love FuzzyBears and FuzzyDucks";
    const allitems = ["FuzzyBear", "FuzzyDuck"];
    const lst = [];
    for (const i of allitems) {
      const x = {};
      if (s.includes(i)) {
        x["match"] = i;
        x["matchIndex"] = allitems.indexOf(i);
        x["startIndex"] = s.indexOf(i);
        x["endIndex"] =  s.indexOf(i) + i.length;
       }
      lst.push(x);
    }
    console.log(lst);
    
    >>>
    [ { match: 'FuzzyBear', matchIndex: 0, startIndex: 7, endIndex: 16 },
      { match: 'FuzzyDuck',
        matchIndex: 1,
        startIndex: 22,
        endIndex: 31 } ]