Hadoop File Formats

I need to consider how to write my data to Hadoop.

I'm using Spark, I got a message from Kafka topic, each message in JSON record.

I have around 200B records per day.

The data fields may be change (not alot but may be change in the future),

I need fast write and fast read, low size in disk.

What should I choose? Avro or Parquet?

I also read the following https://community.hitachivantara.com/community/products-and-solutions/pentaho/blog/2017/11/07/hadoop-file-formats-its-not-just-csv-anymore And Avro v/s Parquet

But still no idea what to choose,

Any suggestions?

Solution

If you care about storage and queries, optimal storage types in order are

ORC
Parquet
Avro
JSON
CSV/TSV (plain structured text)
unstructed text

If you are limited in disk space and would like to sacrifice retrieval, Snappy or Bzip2 would be best, with Bzip2 being more compressed.

Typically, I see people write JSON data directly to Hadoop, then batch a job to convert it daily, for example, into a more optional format (e.g. Hadoop prefers very large files rather than lots of tiny ones)

If you care about retrieval speed, use HBase or some other database (Hive is not a database), but at the very least, you will need to compact streaming data into larger time chunks according to your business needs.

Avro natively supports schema evolution, and if you are able to install the Confluent Schema Registry along side your existing Kafka Cluster, then you can just use Kafka HDFS Connect to write Parquet immediately from Avro (or JSON, I think, assuming you have a schema field in the message) into HDFS along with a Hive table.

Other options include Apache Nifi or Streamsets. In other words, don't reinvent the wheel writing Spark code to pull Kafka to HDFS