The requirements are:
<u>
and </u>
hereFor the sake of simplicity answers can concentrate on case-insensitive search through a list including only ASCII characters and assume that the term delimiter is a plain space - ie. a query entered as "Foo bar baz" means the search terms are ['foo', 'bar', 'baz']
.
To clarify:
The final application is (perhaps not surprisingly) an autocomplete of sorts.
TL;DR
Most recent fiddle comparing the proposed algorithms side by side:
https://jsfiddle.net/Mikk3lRo/ndeuqn02/7/
(feel free to update this link if you add a new algorithm)jsPerf comparing algorithms in a somewhat more realistic / representative way - a few strings are basically "entered" one character at a time on each rep:
https://jsperf.com/comparison-of-algorithms-to-search-and-highlightAt this point it is clear (thanks to trincot's excellent base-comparison) that the majority of time used by the original implementations was spent on DOM-output. Its significance has been minimized as much as possible in the fiddle.
There is still a clear difference in performance between the algorithms in each search, but not one of them is consistently fastest on every keystroke. After revisiting and cleaning up my own "Divide and Conquer" it does outperform the others consistently in any realistic scenario I try though.
Tigregalis introduced the idea of a pre-run optimization, which seems reasonable given that options are unlikely to change between keystrokes. I have added (a function for) this to all methods here. The only algorithm where I saw an obvious benefit from it was in Skirtle's Permutations, but I'll encourage each answerer to consider if it might be useful for their own algorithms.
Some algorithms will be much easier to adapt than others. It is still my opinion that this will be more important than the minor performance differences in a real implementation.
Note that the current version of Tigregalis' Shrinking Set has a bug - I've excluded it from fiddle and jsperf until that is fixed.
In theory this can be solved by "manually" constructing a RegExp that contains every possible permutation of the search terms separated by a capturing group to catch anything between terms - a search for "foo bar" results in (foo)(.*?)(bar)|(bar)(.*?)(foo)
.
The highlighting is then done in one pass with string.replace()
. If there is any change in the string we have a match.
Demo:
var options = ['United States', 'United Kingdom', 'Afghanistan', 'Aland Islands', 'Albania', 'Algeria', 'American Samoa', 'Andorra', 'Angola', 'Anguilla', 'Antarctica', 'Antigua and Barbuda', 'Argentina', 'Armenia', 'Aruba', 'Australia', 'Austria', 'Azerbaijan', 'Bahamas', 'Bahrain', 'Bangladesh', 'Barbados', 'Belarus', 'Belgium', 'Belize', 'Benin', 'Bermuda', 'Bhutan', 'Bolivia, Plurinational State of', 'Bonaire, Sint Eustatius and Saba', 'Bosnia and Herzegovina', 'Botswana', 'Bouvet Island', 'Brazil', 'British Indian Ocean Territory', 'Brunei Darussalam', 'Bulgaria', 'Burkina Faso', 'Burundi', 'Cambodia', 'Cameroon', 'Canada', 'Cape Verde', 'Cayman Islands', 'Central African Republic', 'Chad', 'Chile', 'China', 'Christmas Island', 'Cocos (Keeling) Islands', 'Colombia', 'Comoros', 'Congo', 'Congo, The Democratic Republic of The', 'Cook Islands', 'Costa Rica', 'Cote D\'ivoire', 'Croatia', 'Cuba', 'Curacao', 'Cyprus', 'Czech Republic', 'Denmark', 'Djibouti', 'Dominica', 'Dominican Republic', 'Ecuador', 'Egypt', 'El Salvador', 'Equatorial Guinea', 'Eritrea', 'Estonia', 'Ethiopia', 'Falkland Islands (Malvinas)', 'Faroe Islands', 'Fiji', 'Finland', 'France', 'French Guiana', 'French Polynesia', 'French Southern Territories', 'Gabon', 'Gambia', 'Georgia', 'Germany', 'Ghana', 'Gibraltar', 'Greece', 'Greenland', 'Grenada', 'Guadeloupe', 'Guam', 'Guatemala', 'Guernsey', 'Guinea', 'Guinea-bissau', 'Guyana', 'Haiti', 'Heard Island and Mcdonald Islands', 'Holy See (Vatican City State)', 'Honduras', 'Hong Kong', 'Hungary', 'Iceland', 'India', 'Indonesia', 'Iran, Islamic Republic of', 'Iraq', 'Ireland', 'Isle of Man', 'Israel', 'Italy', 'Jamaica', 'Japan', 'Jersey', 'Jordan', 'Kazakhstan', 'Kenya', 'Kiribati', 'Korea, Democratic People\'s Republic of', 'Korea, Republic of', 'Kuwait', 'Kyrgyzstan', 'Lao People\'s Democratic Republic', 'Latvia', 'Lebanon', 'Lesotho', 'Liberia', 'Libya', 'Liechtenstein', 'Lithuania', 'Luxembourg', 'Macao', 'Macedonia, The Former Yugoslav Republic of', 'Madagascar', 'Malawi', 'Malaysia', 'Maldives', 'Mali', 'Malta', 'Marshall Islands', 'Martinique', 'Mauritania', 'Mauritius', 'Mayotte', 'Mexico', 'Micronesia, Federated States of', 'Moldova, Republic of', 'Monaco', 'Mongolia', 'Montenegro', 'Montserrat', 'Morocco', 'Mozambique', 'Myanmar', 'Namibia', 'Nauru', 'Nepal', 'Netherlands', 'New Caledonia', 'New Zealand', 'Nicaragua', 'Niger', 'Nigeria', 'Niue', 'Norfolk Island', 'Northern Mariana Islands', 'Norway', 'Oman', 'Pakistan', 'Palau', 'Palestinian Territory, Occupied', 'Panama', 'Papua New Guinea', 'Paraguay', 'Peru', 'Philippines', 'Pitcairn', 'Poland', 'Portugal', 'Puerto Rico', 'Qatar', 'Reunion', 'Romania', 'Russian Federation', 'Rwanda', 'Saint Barthelemy', 'Saint Helena, Ascension and Tristan da Cunha', 'Saint Kitts and Nevis', 'Saint Lucia', 'Saint Martin (French part)', 'Saint Pierre and Miquelon', 'Saint Vincent and The Grenadines', 'Samoa', 'San Marino', 'Sao Tome and Principe', 'Saudi Arabia', 'Senegal', 'Serbia', 'Seychelles', 'Sierra Leone', 'Singapore', 'Sint Maarten (Dutch part)', 'Slovakia', 'Slovenia', 'Solomon Islands', 'Somalia', 'South Africa', 'South Georgia and The South Sandwich Islands', 'South Sudan', 'Spain', 'Sri Lanka', 'Sudan', 'Suriname', 'Svalbard and Jan Mayen', 'Swaziland', 'Sweden', 'Switzerland', 'Syrian Arab Republic', 'Taiwan, Province of China', 'Tajikistan', 'Tanzania, United Republic of', 'Thailand', 'Timor-leste', 'Togo', 'Tokelau', 'Tonga', 'Trinidad and Tobago', 'Tunisia', 'Turkey', 'Turkmenistan', 'Turks and Caicos Islands', 'Tuvalu', 'Uganda', 'Ukraine', 'United Arab Emirates', 'United Kingdom', 'United States', 'United States Minor Outlying Islands', 'Uruguay', 'Uzbekistan', 'Vanuatu', 'Venezuela, Bolivarian Republic of', 'Viet Nam', 'Virgin Islands, British', 'Virgin Islands, U.S.', 'Wallis and Futuna', 'Western Sahara', 'Yemen', 'Zambia', 'Zimbabwe'];
var terms, terms_esc;
function viral_permutations() {
var t0, t1, i, permuted, res_elm, meta_elm, regex_string, regex, li, option, match_groups, highlighted;
meta_elm = document.getElementById('viral_permutations_meta');
res_elm = document.getElementById('viral_permutations_result');
res_elm.innerHTML = meta_elm.innerHTML = '';
t0 = performance.now();
//Split query in terms at delimiter and lowercase them
terms = document.getElementById('viral_permutations').value.split(/\s/).filter(function(n) {
return n;
}).map(function(term){
return term.toLowerCase();
});
meta_elm.innerHTML += 'Terms: ' + JSON.stringify(terms) + '<br>';
//Escape terms
terms_esc = terms.map(function(term) {
return term.replace(/[-[\]{}()*+?.,\\^$|#\s]/g, "\\$&");
});
//Wrap terms in in individual capturing groups
terms_esc = terms.map(function(term) {
return '(' + term + ')';
});
//Find all permutations
permuted = permutate_array(terms_esc);
//Construct a group for each permutation
match_groups = [];
for (var i in permuted) {
match_groups.push(permuted[i].join('(.*?)'));
}
try {
//Construct the final regex
regex_string = match_groups.join('|');
//Display it
document.getElementById('viral_permutations_regex').innerHTML = regex_string;
meta_elm.innerHTML += 'RegExp length: ' + regex_string.length + '<br>';
regex = new RegExp(regex_string, 'i');
//Loop through each option
for (i = 0; i < options.length; i++) {
option = options[i];
//Replace the terms with highlighted terms
highlighted = option.replace(regex, viral_permutations_replace);
//If anything was changed (or the query is empty) we have a match
if (highlighted !== option || terms.length === 0) {
//Append it to the result list
li = document.createElement('li');
li.innerHTML = highlighted;
res_elm.appendChild(li);
}
}
//Print some meta
t1 = performance.now();
meta_elm.innerHTML += 'Time: ' + (Math.round((t1 - t0) * 100) / 100) + 'ms';
} catch(e) {
meta_elm.innerHTML += '<span style="color:red">' + e.message + '</span>';
}
}
//The replacement function
function viral_permutations_replace() {
var i, m, j, retval, m_cased, unmatched;
retval = '';
//Make a clone of the terms array (that we can modify without destroying the original)
unmatched = terms.slice(0);
//Loop arguments between the first (which is the full match) and
//the last 2 (which are the offset and the full option)
for (i = 1; i < arguments.length - 1; i++) {
m = arguments[i];
//Check that we have a string - most of the arguments will be undefined
if (typeof m !== 'string') continue;
//Lowercase the match
m_cased = m.toLowerCase();
//Append it to the return value - highlighted if it is among our terms
j = unmatched.indexOf(m_cased);
if (j >= 0) {
//Remove it from the unmatched terms array
unmatched.splice(j, 1);
retval += '<u>' + m + '</u>';
} else {
retval += m;
}
}
return retval;
}
//Helper function to return all possible permutations of an array
function permutate_array(arr) {
var perm, perm_intern;
perm_intern = function(perm, pre, post, n) {
var elem, i, j, ref, rest;
if (n > 0) {
for (i = j = 0, ref = post.length; 0 <= ref ? j < ref : j > ref; i = 0 <= ref ? ++j : --j) {
rest = post.slice(0);
elem = rest.splice(i, 1);
perm_intern(perm, pre.concat(elem), rest, n - 1);
}
} else {
perm.push(pre);
}
};
perm = [];
perm_intern(perm, [], arr, arr.length);
return perm;
}
viral_permutations();
<input type="text" id="viral_permutations" onkeyup="viral_permutations()">
<p id="viral_permutations_meta"></p>
<pre style="width:100%;overflow:auto;background:#eee" id="viral_permutations_regex"></pre>
<ul style="height:300px;overflow:auto" id="viral_permutations_result"></ul>
Thanks to trincot for pointing out that my original version would occasionally highlight a recurring term twice - which is fixed in this snippet.
Fails because:
Error: regexp too big
...(foo|bar)(.*)(foo|bar)
Fails because:
The food in the food court
would match though it obviously shouldn't.The food in the food bar
would find foo
twice and never get to bar
.(foo|bar|baz)(.*?)((?!\1)(?:foo|bar|baz))(.*?)((?!\1|\3)(?:foo|bar|baz))
Fails because:
foo
, bar
or bar
which is not a foo
nor a bar
".(?=.*foo)(?=.*bar)(?=.*baz)
Fails because:
Somewhat more complicated than the single-regex Viral Permutations strategy - this recursive algorithm searches for each term one after the other, starting with the longest term.
Each time a match is found it divides that "bite" into three (unless at the beginning or end), marking the matched "bite" as consumed, and attempts to match the next-longest term in any of the unconsumed "bites".
When it is unable to find the longest unmatched term, it will backtrack and attempt to match the previous term at a different position (even in a different "bite").
If it comes back to the longest term and cannot find another position to match it at, it will return false.
This means that in most cases it can return non-matches pretty quickly, simply because they don't even contain the longest term.
Of course if it runs out of terms - ie. successfully matches the shortest - it will return the highlighted match by joining all the "bites" back together.
Updated for improved performance: The base algorithm is exactly the same, but there were some pretty expensive calls to arr.slice()
that could be avoided completely.
function divide_and_conquer_replace(query, options, separator) {
var terms, terms_esc;
//The inner replacement function
function divide_and_conquer_inner(bites, depth) {
var this_term, i, bite, match, new_bites, found_all_others;
depth = depth ? depth : 1;
//Get the longest remaining term
this_term = terms_esc[terms_esc.length - depth];
//Loop all the bites
for (i = 0; i < bites.length; i++) {
bite = bites[i];
//Reset the lastIndex since we're reusing the RegExp objects
this_term.lastIndex = 0;
//Check that we have a string (ie. do not attempt to match bites
//that are already consumed)
if (typeof bite === 'string') {
//Find the next matching position (if any)
while (match = this_term.exec(bite)) {
new_bites = (i > 0) ? bites.slice(0, i) : [];
if (match.index > 0) {
new_bites.push(bite.slice(0, match.index));
}
new_bites.push(['<u>' + match[0] + '</u>']);
if (this_term.lastIndex < bite.length) {
new_bites.push(bite.slice(this_term.lastIndex));
}
if (i < bites.length - 1) {
new_bites = new_bites.concat(bites.slice(i + 1));
}
if (terms_esc.length > depth) {
//Attempt to find all other terms
found_all_others = divide_and_conquer_inner(new_bites, depth + 1);
//If we found all terms we'll pass the modified string all the
//way up to the original callee
if (found_all_others) {
return found_all_others;
}
//Otherwise try to match current term somewhere else
this_term.lastIndex = match.index + 1;
} else {
//If no terms remain we have a match
return new_bites.join('');
}
}
}
}
//If we reach this point at least one term was not found
return null;
};
// Split query in terms at delimiter
terms = query.split(separator).filter(Boolean);
if (!terms.length) return options;
//Sort terms according to length - longest term last
terms.sort(function(a, b) {
return a.length - b.length;
});
//Escape terms
//And store RegExp's instead of strings
terms_esc = terms.map(function (term) {
return term.replace(/[-[\]{}()*+?.,\\^$|#\s]/g, "\\$&");
}).map(function (term) {
return new RegExp(term, 'gi');
});
//Loop through each option
return options.map(function(option){
return divide_and_conquer_inner([option]);
}).filter(Boolean);
}
var options = ['United States', 'United Kingdom', 'Afghanistan', 'Aland Islands', 'Albania', 'Algeria', 'American Samoa', 'Andorra', 'Angola', 'Anguilla', 'Antarctica', 'Antigua and Barbuda', 'Argentina', 'Armenia', 'Aruba', 'Australia', 'Austria', 'Azerbaijan', 'Bahamas', 'Bahrain', 'Bangladesh', 'Barbados', 'Belarus', 'Belgium', 'Belize', 'Benin', 'Bermuda', 'Bhutan', 'Bolivia, Plurinational State of', 'Bonaire, Sint Eustatius and Saba', 'Bosnia and Herzegovina', 'Botswana', 'Bouvet Island', 'Brazil', 'British Indian Ocean Territory', 'Brunei Darussalam', 'Bulgaria', 'Burkina Faso', 'Burundi', 'Cambodia', 'Cameroon', 'Canada', 'Cape Verde', 'Cayman Islands', 'Central African Republic', 'Chad', 'Chile', 'China', 'Christmas Island', 'Cocos (Keeling) Islands', 'Colombia', 'Comoros', 'Congo', 'Congo, The Democratic Republic of The', 'Cook Islands', 'Costa Rica', 'Cote D\'ivoire', 'Croatia', 'Cuba', 'Curacao', 'Cyprus', 'Czech Republic', 'Denmark', 'Djibouti', 'Dominica', 'Dominican Republic', 'Ecuador', 'Egypt', 'El Salvador', 'Equatorial Guinea', 'Eritrea', 'Estonia', 'Ethiopia', 'Falkland Islands (Malvinas)', 'Faroe Islands', 'Fiji', 'Finland', 'France', 'French Guiana', 'French Polynesia', 'French Southern Territories', 'Gabon', 'Gambia', 'Georgia', 'Germany', 'Ghana', 'Gibraltar', 'Greece', 'Greenland', 'Grenada', 'Guadeloupe', 'Guam', 'Guatemala', 'Guernsey', 'Guinea', 'Guinea-bissau', 'Guyana', 'Haiti', 'Heard Island and Mcdonald Islands', 'Holy See (Vatican City State)', 'Honduras', 'Hong Kong', 'Hungary', 'Iceland', 'India', 'Indonesia', 'Iran, Islamic Republic of', 'Iraq', 'Ireland', 'Isle of Man', 'Israel', 'Italy', 'Jamaica', 'Japan', 'Jersey', 'Jordan', 'Kazakhstan', 'Kenya', 'Kiribati', 'Korea, Democratic People\'s Republic of', 'Korea, Republic of', 'Kuwait', 'Kyrgyzstan', 'Lao People\'s Democratic Republic', 'Latvia', 'Lebanon', 'Lesotho', 'Liberia', 'Libya', 'Liechtenstein', 'Lithuania', 'Luxembourg', 'Macao', 'Macedonia, The Former Yugoslav Republic of', 'Madagascar', 'Malawi', 'Malaysia', 'Maldives', 'Mali', 'Malta', 'Marshall Islands', 'Martinique', 'Mauritania', 'Mauritius', 'Mayotte', 'Mexico', 'Micronesia, Federated States of', 'Moldova, Republic of', 'Monaco', 'Mongolia', 'Montenegro', 'Montserrat', 'Morocco', 'Mozambique', 'Myanmar', 'Namibia', 'Nauru', 'Nepal', 'Netherlands', 'New Caledonia', 'New Zealand', 'Nicaragua', 'Niger', 'Nigeria', 'Niue', 'Norfolk Island', 'Northern Mariana Islands', 'Norway', 'Oman', 'Pakistan', 'Palau', 'Palestinian Territory, Occupied', 'Panama', 'Papua New Guinea', 'Paraguay', 'Peru', 'Philippines', 'Pitcairn', 'Poland', 'Portugal', 'Puerto Rico', 'Qatar', 'Reunion', 'Romania', 'Russian Federation', 'Rwanda', 'Saint Barthelemy', 'Saint Helena, Ascension and Tristan da Cunha', 'Saint Kitts and Nevis', 'Saint Lucia', 'Saint Martin (French part)', 'Saint Pierre and Miquelon', 'Saint Vincent and The Grenadines', 'Samoa', 'San Marino', 'Sao Tome and Principe', 'Saudi Arabia', 'Senegal', 'Serbia', 'Seychelles', 'Sierra Leone', 'Singapore', 'Sint Maarten (Dutch part)', 'Slovakia', 'Slovenia', 'Solomon Islands', 'Somalia', 'South Africa', 'South Georgia and The South Sandwich Islands', 'South Sudan', 'Spain', 'Sri Lanka', 'Sudan', 'Suriname', 'Svalbard and Jan Mayen', 'Swaziland', 'Sweden', 'Switzerland', 'Syrian Arab Republic', 'Taiwan, Province of China', 'Tajikistan', 'Tanzania, United Republic of', 'Thailand', 'Timor-leste', 'Togo', 'Tokelau', 'Tonga', 'Trinidad and Tobago', 'Tunisia', 'Turkey', 'Turkmenistan', 'Turks and Caicos Islands', 'Tuvalu', 'Uganda', 'Ukraine', 'United Arab Emirates', 'United Kingdom', 'United States', 'United States Minor Outlying Islands', 'Uruguay', 'Uzbekistan', 'Vanuatu', 'Venezuela, Bolivarian Republic of', 'Viet Nam', 'Virgin Islands, British', 'Virgin Islands, U.S.', 'Wallis and Futuna', 'Western Sahara', 'Yemen', 'Zambia', 'Zimbabwe'];
var separator = ' ';
function divide_and_conquer(){
var query = document.getElementById('divide_and_conquer').value;
var res_elm = document.getElementById('divide_and_conquer_result');
var t0 = performance.now();
var results = divide_and_conquer_replace(query, options, separator);
var t1 = performance.now();
document.getElementById('divide_and_conquer_meta').innerHTML = 'Time: ' + (t1 - t0).toFixed(2) + 'ms';
res_elm.innerHTML = '';
for (var result of results) {
res_elm.innerHTML += '<li>' + result + '</li>';
}
};
divide_and_conquer();
<input type="text" id="divide_and_conquer" onkeyup="divide_and_conquer()">
<p id="divide_and_conquer_meta"></p>
<ul style="height:300px;overflow:auto" id="divide_and_conquer_result"></ul>
This strategy has performance issues when the query consists exclusively of (usually very short) strings that are all / most present in many of the options - such as a a a a a a a a
...
In realistic scenarios it is currently outperforming the other proposed algorithms - see the link to jsperf added to the question.