Search code examples
jsondatabasecouchdbapache-drill

Accessing objects from json files on disk


I have ~500 json files on my disk that represents hotels all over the world, each around 30 mbs, all objects have the same structure.

At certain points in my spring server I require to get the information of a single hotel, let's say via code (which is inside the json object).

The data is read only, but I might get updates from the hotels providers at certain times, like extra json files or delta changes.

Now I don't want to migrate my json files to a relational database that's for sure, so I've been investigating in the best solution to achieve what I want.

I tried Apache Drill because querying straight from json files made me think less headaches of dealing with the data, I did a directory query using Drill, something like:

SELECT * FROM dfs.'C:\hotels\' WHERE code='1b3474';

but this obviously does not seem to be the most efficient way for me as it takes around 10 seconds to fetch a single hotel.

At the moment I'm trying out Couch DB, but I'm still learning it. Should I migrate all the hotels to a single document (makes a bit of sense to me)? Or should I consider each hotel a document?

I'm just looking for pointers on what is a good solution to achieve what I want, so here to take your opinion.


Solution

  • The main issue here is that json files do not have indexes associated with them, and Drill does not create indexes for them. So whenever you do a query like SELECT * FROM dfs.'C:\hotels\' WHERE code='1b3474'; Drill has no choice but to read each json file and parse and process all the data in each file. The more files and data you have, the longer this query will take. If you need to do lookups like this often, I would suggest not using Drill for this use case. Some alternatives are:

    • A relational database where you have an index built for the code column.
    • A key value store where code is the key.