Let's say I have a collection of different URLs in an array:
var source = ['www.xyz.com/Product/1', 'www.xyz.com/Product/3', 'www.xyz.com/Category/1', 'somestring']
What would be a good way to iterate over the array and group similar strings into a separate array? The desired output from the example above would be:
var output = [
['www.xyz.com/Product/1', 'www.xyz.com/Product/3'],
['www.xyz.com/Category/1'],
['somestring']
];
Conditions
source
can be random stringsI found the string-similarity library which gives the possibility to compare one string against a collection of strings. One way would be to iterate over the source, compare each item against the source collection and apply a rule to group items with a similar score. However I guess this would be terrible inefficient.
Can someone suggest me an efficient way to accomplish what I need?
The best solution I can think of is to compare the strings with each other and test how different they are. There is an algorithm that does that, which is the Levenshtein distance algorithm:
The Levenshtein distance is a string metric for measuring the difference between two sequences. Informally, the Levenshtein distance between two words is the minimum number of single-character edits (i.e. insertions, deletions or substitutions) required to change one word into the other.
We can easily create a Levenshtein filter on top of fast-levenshtein module:
const levenshtein = require('fast-levenshtein');
const levenshteinFilter = (source, maximum = 5) => {
let _source, matches, x, y;
_source = source.slice();
matches = [];
for (x = _source.length - 1; x >= 0; x--) {
let output = _source.splice(x, 1);
for (y = _source.length - 1; y >= 0; y--) {
if (levenshtein.get(output[0], _source[y]) <= maximum) {
output.push(_source[y]);
_source.splice(y, 1);
x--;
}
}
matches.push(output);
}
return matches;
}
let source = ['www.xyz.com/Product/1', 'www.xyz.com/Product/3', 'www.xyz.com/Category/1', 'somestring'];
let output = levenshteinFilter(source);
// [ [ 'www.xyz.com/Product/1', 'www.xyz.com/Product/3' ],
// [ 'www.xyz.com/Category/1' ],
// [ 'somestring' ] ]
You can define the maximum acceptable distance in the 2 argument of the function (defaults to 5).