Search code examples
sqlapache-sparkdatabricksazure-databricks

How to store and aggregate data in ~300m JSON objects efficiently


I have an app where I receive 300m JSON text files (10m daily, retention = 30 days) from a Kafka topic.

The data it contains needs to be aggregated every day based on different properties.

We would like to build it with Apache Spark, using Azure Databricks, because the size of the data will gro, we cannot vertically scale this process anymore (currently runs in 1 Postgres server) and we also need something that is cost-effective.

Having this job in Apache Spark is straightforward in theory, but I haven't found any practical advice on how to process JSON objects efficiently.

These are the options as I see:

  1. Store the data in Postgres and ingest it with the Spark job (SQL) - may be slow to transfer the data

  2. Store the data in Azure Blob Storage in JSON format - We may run out of the number of files that can be stored, also this seems inefficient to read so many files

  3. Store the JSON data in big chunks, eg. 100.000 JSON in one file - it could be slow to delete/reinsert when the data changes

  4. Convert the data to CSV or some binary format with a fixed structure and store it in blob format in big chunks - Changing the format will be a challenge but it would rarely happen in the future, also CSV/binary is quicker to parse

Any practical advice would be really appreciated. Thanks in advance.


Solution

  • There are multiple factors to be consider :

    1. If you are trying to read the data on daily manner then strongly suggested to do store the data in Parquet format and store in databricks. If not accessing daily then store in Azure buckets itself (computation cost will be minimised)
    2. If JSON data to be flattened then you need to do all the data manipulations and write into delta tables with OPTIMISE conditions.
    3. If really retention 30 mandatory then be cautious with file formats bcz data will grow exponentially on daily basis. Other wise Alter table properties with retention period to 7 days or 15 days.