I have a set of documents, each annotated with a set of tags, which may contain spaces. The user supplies a set of possibly misspelled tags and I wants to find the documents with the highest number of matching tags (optionally weighted).
There are several thousand documents and tags but at most 100 tags per document.
I am looking on a lightweight and performant solution where the search should be fully on the client side using JavaScript but some preprocessing of the index with node.js is possible.
My idea is to create an inverse index of tags to documents using a multiset, and a fuzzy index that that finds the correct spelling of a misspelled tag, which are created in a preprocessing step in node.js and serialized as JSON files. In the search step, I want to consult for each item of the query set first the fuzzy index to get the most likely correct tag, and, if one exists to consult the inverse index and add the result set to a bag (numbered set). After doing this for all input tags, the contents of the bag, sorted in descending order, should provide the best matching documents.
My Questions
You should be able to achieve what you want using Lunr, here is a simplified example (and a jsfiddle):
var documents = [{
id: 1, tags: ["foo", "bar"],
},{
id: 2, tags: ["hurp", "durp"]
}]
var idx = lunr(function (builder) {
builder.ref('id')
builder.field('tags')
documents.forEach(function (doc) {
builder.add(doc)
})
})
console.log(idx.search("fob~1"))
console.log(idx.search("hurd~2"))
This takes advantage of a couple of features in Lunr:
The results will be sorted by which document best matches the query, in simple terms, documents that contain more matching tags will rank higher.
Is it better to keep the fuzzy step separate from the inverted index or is there a way to combine them?
As ever, it depends. Lunr maintains two data structures, an inverted index and a graph. The graph is used for doing the wildcard and fuzzy matching. It keeps separate data structures to facilitate storing extra information about a term in the inverted index that is unrelated to matching.
Depending on your use case, it would be possible to combine the two, an interesting approach would be a finite state transducers, so long as the data you want to store is simple, e.g. an integer (think document id). There is an excellent article talking about this data structure which is similar to what is used in Lunr - http://blog.burntsushi.net/transducers/