Search code examples
amazon-athenanetcdfamazon-quicksight

Analyze binary NetCDF files with AWS Quicksight / Athena


I have a task to analyze weather forecast data in Quicksight. The forecast data is held in NetCDF binary files in a public S3 bucket. The question is: how do you expose the contents of these binary files to Quicksight or even Athena?

There are python libraries that will decode the data from the binary files, such as Iris. They are used like this:

import iris
filename = iris.sample_data_path('forecast_20200304.nc')
cubes = iris.load(filename)
print(cubes)

So what would be the AWS workflow and services necessary to create a data ingestion pipeline that would:

  1. Respond to an SQS message that a new binary file is available
  2. Access the new binary file and decode it to access the forecast data
  3. Add the decoded data to the set of already decoded data from previous SQS notifications
  4. Make all the decoded data available in Athena / Quicksight

Tricky one, this...


Solution

  • What I would do is probably something like this:

    Write a Lambda function in Python that is triggered when new files appear in the S3 bucket – either by S3 notifications (if you control the bucket), by SNS, SQS, or by schedule in EventBridge. The function uses the code snipplet included in your question to transform each new file and upload the transformed data to another S3 bucket.

    I don't know the size of these files and how often they are published, so whether to convert to CSV, JSON, or Parquet is something you have to decide – if the data is small CSV will probably be easiest and will be good enough.

    With the converted data in a new S3 bucket all you need to do is create an Athena table for the data set and start using QuickSight.

    If you end up with a lot of small files you might want to implement a second step where you once per day combine the converted files into bigger files, and possibly Parquet, but don't do anything like that unless you have to.


    An alternative way would be to use Athena Federated Query: by implementing Lambda function(s) that respond to specific calls from Athena you can make Athena read any data source that you want. It's currently in preview, and as far as I know all the example code is written in Java – but theoretically it would be possible to write the Lambda functions in Python.

    I'm not sure whether it would be less work than implementing an ETL workflow like the one you suggest, but yours is one of the use cases for which Athena Federated Query was designed for and it might be worth looking into. If NetCDF files are common and a data source for such files would be useful for other people I'm sure the Athena team would love to talk to you and help you out.