Search code examples
jakarta-eefilesystemshbasehdfsparquet

Using HDFS to store files of different sizes


I have a rather theoretical question.

My team is developing and supporting a medium-sized java application (400k lines currently) which deals a lot with binary files. Currently we store all our data on a FS storage. We developed a small "framework" which will allow us to scale the file storages in the future, however, I have a strong suspicion that storing our data on a Windows/Linux filesystem would still remain a bottleneck (needless to say that reinventing a wheel in distributed data processing and then rely on it does not seem like a really good solution :)).

The size a data we deal with ranges from 1-2mb per file to hundreds of mb (rarely gigabytes) and it is frequently accessed. But I'd like to emphasize that the files are mostly small. Also considering our long term plans to move towards big data and ML analysis, I am investigating the possibility to integrate the Hadoop ecosystem to our application.

The question I have currently is if HDFS and probably HBase would play well in our environment? As I know HDFS was design to store really large binary data, but maybe using HBase and some configurational tuning it is possible to make this thing work smaller data? I also have to mention that performance does matter for both reading and writing files.

I would love to hear your experience with the tech I mentioned and maybe anyone can recommend any alternative solutions for the problem (Apache Parquet?).

Also our team does not have experience in distributed big data solution like the ones that Hadoop provides, so if you think that these frameworks may work for our case, maybe you can give your feedback on their integration or any tips on where to start my investigation. Thank you for your attention. :)

P.S. Besides FS we also use S3 to archive old data and store large (> 1gb) binary files, so introducing a single storage system would be cool from that point of view as well.


Solution

  • After a small investigation, I learned that the distributed file storages such as HDFS and noSQL storages are not quite suitable for applications that target low latency.

    These systems were designed to operate in Big Data world, where high overall throughput is more valuable than latency and the size of the binaries is huge.

    For most cloud-based applications that interact with real users or provide services for such applications the most suitable data storages are object storages such as Amazon S3. They provide convenient API, reasonable latency, high availablity and virtually unlimited. And most importantly they are usually managed by 3rd parties which eliminates a lot of work and concerns on the developers' side.