We have a large database of partial urls (strings) such as:
"example1.com"
"example2.com/test.js"
"/foo.js"
Our software listens for HTTP requests and tries to find one of our database's partial urls in the HTTP request's full url.
So we are getting full urls (i.e.: http://www.example.com/blah.js?foo=bar") and trying to match one of our database's partial patterns on it.
Which would be the best data structure to store our partial urls database on if all we care about is search speed?
Right now, this is what we do:
UPDATE:
This software is an extension for Firefox written in Javascript on Firefox's Addon SDK.
Assuming that your partial strings are only domain names and/or page names you could try to generate all possible combinations from the URL starting from the end:
http://www.example.com/blah.js?foo=bar
blaj.js
example.com/blah.js
www.example.com/blah.js
Then hash all the combinations, store them in an array and try to find any of them in another array that contains the hashes of all your partial strings from the database.
NOTE:
In case you want to match ANY string in the url, like ample
in example.com
it becomes little complicated in terms of storage, because all random combinations of strings in an url are
where n
is the length of the url and k
is the length of the string to find. According to this SO question the maximum reasonable length of a url is 2000 characters. And assuming that you want to match random string you'd have k
vary from 1 to 2000 which would result in a large amount of hashes generated from the url - Sum of n over k
for each k
from 1 to 2000.
Or more precisely - 2000! / (k!*(2000-k)!) different hashes