Search code examples
javascriptstringsubstringindexof

Find position of substring that almost matches


Let's say I have the following text:

const input = `Hello world!

This is a random text.
I don't know...
nd ths lines got som spelling mitsakes`

And I want to find the position of I don't know.... In that case, I could just just do this:

const position = input.indexOf(`I don't know...`)

What I'm having issues with is finding a substring that doesn't perfectly match. This for example wouldn't yield a match:

const position = input.indexOf(`And this line's got some spelling mistakes.`)

Now what I'm looking for is a way to find a given substring and allow it to differ by a given percentage.

For example:

// Find position that matches by at least 50%:
const position = indexOfSimilar(input, `And this line's got some spelling mistakes.`, 0.5)

// Find position that matches by at least 99%:
const position = indexOfSimilar(input, `And this line's got some spelling mistakes.`, 0.99)

Any ideas how this could be implemented?


Solution

  • As some commenters mentioned about the Levenshtein distance, it could probably be a starting point, by implementing an algorithm based on that. A simpler solution would be to use a sliding window and find the matching substring and its index by the weight of matching pattern in the input string, for example:

    const input = `Hello world!
    
    This is a random text.
    I don't know...
    nd ths lines got som spelling mitsakes`
    
    const match = (input, pattern) => {
      const ids = new Map;
      for (let i = 0; i < input.length; i++) {
        const s = input.slice(i, i + pattern.length);
        const n = [...s].reduce((a, c, i) => (a += (c === pattern[i] ? c.charCodeAt(0) * (i + 1) : 0)), 0);
        ids.set(i, n);
      }
      const matches = [...ids].sort(([, n1], [, n2]) => n2 - n1);
      const [id, n] = matches[0];
      const max = [...pattern].reduce((a, c, i) => a += c.charCodeAt(0) * (i + 1), 0);
      const weight = Math.floor(n / max * 100);
      return {weight, id, slice: input.slice(id, id + pattern.length)};
    };
    
    console.log(match(input, `qwerty uiop`));
    console.log(match(input, `This is a random text.`));
    console.log(match(input, `And this line's got some spelling mistakes.`));

    Sure, the results of this simple solution heavily depends on the length of pattern and matched substring. The Levenshtein distance solution returns more accurate results:

    const input = `Hello world!
    
    This is a random text.
    I don't know...
    nd ths lines got som spelling mitsakes`
    
    const levenshtein = (a, b) => {
       const la = a.length, lb = b.length,
       aa = [...a, 0], ab = [...b, 0],
       map = ab.map(() => [...aa]);
       aa.forEach((v, i) => map[0][i] = i);
       ab.forEach((v, i) => map[i][0] = i);
       for (let j = 1; j <= lb; j++) {
          for (let i = 1; i <= la; i++) {
             const n = a[i - 1] === b[j - 1] ? 0 : 1;
             map[j][i] = Math.min(
                map[j][i - 1] + 1, map[j - 1][i] + 1, map[j - 1][i - 1] + n
             );
          }
       }
       return map[lb][la];
    };
    
    const match = (input, pattern) => {
      const ids = new Map, lp = pattern.length;
      for (let i = 0; i < input.length; i++) {
        const s = input.slice(i, i + lp);
        const n = levenshtein(s, pattern);
        ids.set(i, n);
      }
      const matches = [...ids].sort(([, n1], [, n2]) => n1 - n2);
      const [id, n] = matches[0];
      const weight = Math.floor((lp - n) / lp * 100);
      return {weight, id, slice: input.slice(id, id + lp)};
    };
    
    console.log(match(input, `qwerty uiop`));
    console.log(match(input, `This is a random text.`));
    console.log(match(input, `And this line's got some spelling mistakes.`));