Search code examples
filehadoopformatavro

Best practice for file format storage (Hadoop)


I would like to get an advice about data format and particularly what is the best solution to store my data in HDFS.

I am receiving a lot of messages in JSON and XML format. For efficient processing I need to convert these files in a better format for Hadoop and store them in HDFS. The schema of these files does not change over time and these files can be large or small (<64Mb). I will need to compress these files. Then, I will do processing via Spark on the data to determine if there are errors and then generate a report.

So, after some researches, I think that the best format for my use case is Avro (even if I don't need to do schema evolutions) because it gives compression and splittability. But, I'm not sure about this solution.

Thanks for your help :)


Solution

  • Depends of your needs :

    • Avro is a good file format to store file because it has a good compression and Avro is pluggable with pig, hive, spark ... In addition with schemaregistry of confluent I/O you can manage with evolution of your schemas.

    • Parquet has a good compression ratio top, but it's a columnar format. It is too pluggable with pig, hive, spark but Parquet is more efficient for filters queries.

    In my opinionn if you just want to store and do full scan on data i go on avro, but if you want to query the data with impala or hive to do Business Intelligence you'll get better result with Parquet.

    My 2 cents