I have 51 massive HDF5 tables, each with enough (well behaved) data that I cannot load even one of them completely into memory. To make life easier for the rest of my team I need to transfer this data into a PostgreSQL database (and delete the HDF5 tables). However, this is easier said than done, mainly because of these hurdles:
pandas.read_hdf()
still has a wonky chunksize
kwag: SO Question; Open github issuepandas.DataFrame.to_sql()
is monumentally slow and inefficient: Open github issue (see my post at the bottom of the issue page)Basically to go from HDF5 -> Pandas -> PostgreSQL, will require surmounting hurdles 1 and 2 by extensive monkey patching. And there seems to be no direct way to go from HDF5 -> PostgreSQL directly. Unless I am missing something.
Perhaps one of you fine users can hint at something I am missing, some patchwork you created to surmount a similar issue that would help my cause, or any suggestions or advice...
You could convert to CSV with something like the following:
import csv
import h5py
with h5py.File('input.hdf5') as hdf5file:
with open('output.csv', 'w') as csvfile:
writer = csv.writer(csvfile)
for row in hdf5file['__data__']['table']:
writer.writerow(row)
And then then import into postgres with psql
:
create table mytable (col1 bigint, col2 float, col3 float);
\copy mytable from 'output.csv' CSV
Depending on the complexity of your data, you could probably do something clever to get the schema out of the hdf5 file and use that to make the CREATE TABLE
statement.
Alternatively you could try writing your own INSERT
statements in your Python script, this will probably be slower than using COPY
but could be a simpler solution:
import psycopg2
from itertools import islice
with h5py.File('input.hdf5') as hdf5file:
with psycopg2.connect("dbname=mydb user=postgres") as conn
cur = conn.cursor()
chunksize=50
t = iter(hdf5file['__data__']['table'])
rows = islice(t, chunksize)
while rows != []:
statement = "INSERT INTO mytable VALUES {}".format(','.join(rows))
cur.execute(row)
rows = islice(t, chunksize)
conn.commit()