Search code examples
postgresqlpandashdf5large-data

Better way than HDF5 -> Pandas -> PostgreSQL


I have 51 massive HDF5 tables, each with enough (well behaved) data that I cannot load even one of them completely into memory. To make life easier for the rest of my team I need to transfer this data into a PostgreSQL database (and delete the HDF5 tables). However, this is easier said than done, mainly because of these hurdles:

  1. pandas.read_hdf() still has a wonky chunksize kwag: SO Question; Open github issue
  2. pandas.DataFrame.to_sql() is monumentally slow and inefficient: Open github issue (see my post at the bottom of the issue page)
  3. PostgreSQL does not have a native or third party data wrapper to deal with HDF5: PostgreSQL wiki article
  4. HDF5 ODBC driver is still nascent: HDF5 ODBC blog

Basically to go from HDF5 -> Pandas -> PostgreSQL, will require surmounting hurdles 1 and 2 by extensive monkey patching. And there seems to be no direct way to go from HDF5 -> PostgreSQL directly. Unless I am missing something.

Perhaps one of you fine users can hint at something I am missing, some patchwork you created to surmount a similar issue that would help my cause, or any suggestions or advice...


Solution

  • You could convert to CSV with something like the following:

    import csv
    import h5py
    with h5py.File('input.hdf5') as hdf5file:
        with open('output.csv', 'w') as csvfile:
            writer = csv.writer(csvfile)
            for row in hdf5file['__data__']['table']:
                writer.writerow(row)
    

    And then then import into postgres with psql:

    create table mytable (col1 bigint, col2 float, col3 float);
    \copy mytable from 'output.csv' CSV
    

    Depending on the complexity of your data, you could probably do something clever to get the schema out of the hdf5 file and use that to make the CREATE TABLE statement.

    Alternatively you could try writing your own INSERT statements in your Python script, this will probably be slower than using COPY but could be a simpler solution:

    import psycopg2
    from itertools import islice
    
    with h5py.File('input.hdf5') as hdf5file:
        with psycopg2.connect("dbname=mydb user=postgres") as conn
            cur = conn.cursor()
            chunksize=50
            t = iter(hdf5file['__data__']['table'])
            rows = islice(t, chunksize)
            while rows != []:
                statement = "INSERT INTO mytable VALUES {}".format(','.join(rows))
                cur.execute(row)
                rows = islice(t, chunksize)
            conn.commit()