Search code examples
arraysnode.jsstringcomparison

Group similar strings from an array in Node.js


Let's say I have a collection of different URLs in an array:

var source = ['www.xyz.com/Product/1', 'www.xyz.com/Product/3', 'www.xyz.com/Category/1', 'somestring']

What would be a good way to iterate over the array and group similar strings into a separate array? The desired output from the example above would be:

var output = [
    ['www.xyz.com/Product/1', 'www.xyz.com/Product/3'],
    ['www.xyz.com/Category/1'],
    ['somestring']
];

Conditions

  • All items within source can be random strings
  • The logic must be able to compare and group around 100'000 items in a meaningful time

I found the string-similarity library which gives the possibility to compare one string against a collection of strings. One way would be to iterate over the source, compare each item against the source collection and apply a rule to group items with a similar score. However I guess this would be terrible inefficient.

Can someone suggest me an efficient way to accomplish what I need?


Solution

  • The best solution I can think of is to compare the strings with each other and test how different they are. There is an algorithm that does that, which is the Levenshtein distance algorithm:

    The Levenshtein distance is a string metric for measuring the difference between two sequences. Informally, the Levenshtein distance between two words is the minimum number of single-character edits (i.e. insertions, deletions or substitutions) required to change one word into the other.


    We can easily create a Levenshtein filter on top of fast-levenshtein module:

    const levenshtein = require('fast-levenshtein'); 
    
    const levenshteinFilter = (source, maximum = 5) => {
      let _source, matches, x, y;
      _source = source.slice();
      matches = [];
      for (x = _source.length - 1; x >= 0; x--) {
        let output = _source.splice(x, 1);
        for (y = _source.length - 1; y >= 0; y--) {
          if (levenshtein.get(output[0], _source[y]) <= maximum) {
            output.push(_source[y]);
            _source.splice(y, 1);
            x--;
          }
        }
        matches.push(output);
      }
      return matches;
    }
    
    let source = ['www.xyz.com/Product/1', 'www.xyz.com/Product/3', 'www.xyz.com/Category/1', 'somestring'];
    let output = levenshteinFilter(source);
    // [ [ 'www.xyz.com/Product/1', 'www.xyz.com/Product/3' ],
    //   [ 'www.xyz.com/Category/1' ],
    //   [ 'somestring' ] ]
    

    You can define the maximum acceptable distance in the 2 argument of the function (defaults to 5).