Search code examples
apache-sparkhdf5

How do I write HDF5 files from Apache Spark?


I have found tools for reading HDF5 files from Spark, but not for writing them. Is it possible?

We have a dataset that is 10-40TB in size. We're currently writing it out as roughly 20,000 Python pickle files. That's not very portable. Also, HDF5 offers compression.

We can write parquet files, and one approach is to write out parquet and then convert them to HDF5. However, this approach is not desired because none of the conversion tools are multi-threaded.

We want to use HDF5 because it has broad acceptance within the scientific community. Its support in programs like Matlab and Stata appears significantly better than parquet.


Solution

  • After consultation with the HDFGroup, we have determined that there is currently no way to write HDF5 files directly from Spark. They can be written from Dask using numpy and pandas, but not from Spark.