Search code examples
javascriptdatabaseinformation-retrievalquerying

Will using a database improve performance when sorting entries by string similarity?


I have a 11 MB JSON file that looks like this:

[{
  "name": "Guayabal de Síquima",
  "country": "Colombia",
  "population": 1051,
  "timezone": "America/Bogota"
}, {
  "name": "Maracaibo",
  "country": "Venezuela",
  "population": 19637,
  "timezone": "America/Caracas"
}]

I do a query to sort the cities by name like this:

cityList.sort((city1,city2) => 
  (stringSimilarity(city1.name, query) - population) -
  (stringSimilarity(city2.name, query) - population))

I'm also creating an array with only countries and cities per timezone, so that I can do the same thing, but with countries. It's a pretty big list with 137,530 cities. Is there any advantage to using a database for this? I already get pretty satisfactory results, but I don't know if I can speed things up a tiny bit to get more overhead for new features.

The use is: users will enter a city, and they'll get a piece of information that they need about it. Since so many cities share the same name, I order them by string similarity, and then by population, which will probably get the most relevant result, but I return 5 anyway to be sure.

It's already pretty fast. I don't know how fast but it's less than a second, including startup to load the document to memory and parse it. Once it's started it's very fast as well. It's just not instantaneous. I use https://www.npmjs.com/package/string-similarity


Solution

  • I do not know what your stringSimilarity does, however, even it does something like BM25 which usually DBs do, using DB will be slower. If it is something else, you have to customize a DB (or search engine) if allowed. It will be slower. If not allowed, you do what you do, calculating similarities and sorting, on DB entries. It will be even slower.

    DBs use binary files and they heavily depend on cache (e.g., B-tree indexing) because file IO is much slower than memory. Use DB when you cannot put everything on memory or when you do not want waste your memory.

    Additionally, compareFunction for your sorting does not need to be subtracted by population because it is for relative comparison.

    cityList.sort((city1,city2) => 
      stringSimilarity(city1.name, query) -
      stringSimilarity(city2.name, query)
    )