Search code examples
amazon-s3apache-drill

How to keep s3 data in memory with apache drill?


Querying JSON data stored on aws s3 with apache drill works awesome, but drill fetches the data fresh from s3 for every query.

How to tell drill to keep the data in memory for the next query?


Solution

  • The best solution i have found is to use http://tachyon-project.org/ it uses a Ramdrive to store the data and so data from s3 is only fetched once, later apache drill reads the data directly from tachyon. Setting up tachyon first seems complex, but in the end you need only to change like 6 lines in the config and copy on tachyon jar into drill.

    UPDATE 2016-07-22
    After some testing i find tachyon overly complex. Now I use the sync tool from s3 sdk and a ramdrive on linux & osx to keep the data fast accessible, works very good so far.

    UPDATE 2018-02-09
    At the end we settled with a linux RAM drive, works very good