I am working with an enormous dataset (hundreds of GBs) that has ~40 million identifiers stored as 32-character strings, and for each identifier hundreds or thousands of rows of numeric data.
Both to save space and to make it more efficient to read the data from disk, it seems preferable to not repeat the identifiers over and over again in the dataset. For example, a data table that looks like
verylongstringidentifier1, 1.2
verylongstringidentifier1, 2.3
verylongstringidentifier1, 3.4
.
.
verylongstringidentifier2, 2.1
verylongstringidentifier2, 1.0
.
.
could be stored more efficiently if the string identifiers weren't repeated. One option is to save separate files for each identifier, and I may go this route, but it's somewhat annoying to have millions of separate small files, and possibly inefficient from a disk I/O standpoint.
I am totally new to hdf5, but what I've read about it suggests it should work well for this situation, since the datasets can be stored with the identifiers as the keys. However, when I save to an hdf5 file the resulting file is roughly 40 times larger than what I would get if I simply wrote to a flat csv file. Am I missing something about how hdf5 files are stored, or am I just doing something wrong? The test code below is what I used to verify (and try to diagnose) the problem.
# trying to figure out why hdf5 file sizes are so huge
import time
import string
import random
import numpy as np
import pandas as pd
from pandas import HDFStore
# generate 1000 random 32-character strings
strings = [''.join(random.choices(string.ascii_lowercase, k=32)) for _ in range(1000)]
# for each of these random strings, create 200 rows of three random floats
# concatenate into one big dataframe
df = pd.DataFrame()
for s in strings:
vars = np.random.rand(200,3)
ss = np.full((200,1),s)
s_data = np.concatenate((ss, vars), axis=1)
df = pd.concat([df, pd.DataFrame(s_data)], axis=0)
df.columns = ['string', 'v1', 'v2', 'v3']
# write to one big csv file
df.to_csv('/tmp/test.csv', index=False)
# write to compressed bzip2 file
df.to_csv('/tmp/test.csv.bz2', index=False, compression='bz2')
# write to separate csv files for each string
unique_strings = df.string.unique()
for s in unique_strings:
s_chunk = df[df.string == s]
fname = '/tmp/test_' + s + '.csv.bz2'
# don't need to store the string, since it can be retrieved as the filename
s_chunk[['v1', 'v2', 'v3']].to_csv(fname, index=False, compression='bz2')
# write to hdf5 file with strings as keys
# what I'm trying to do here is *not* save the strings in the datasets, but instead
# use the strings as the names (keys) for the datasets
# My understanding is this would enable me to retrieve the data for a given string
# with pd.read_hdf(h5data, key=<string for which I want data>)
h5data = HDFStore('/tmp/test.h5')
for s in unique_strings:
s_chunk = df[df.string == s]
# don't need to store the string, because we'll use it as the key
s_chunk[['v1', 'v2', 'v3']].to_hdf(h5data, key=s, format='table', complib='bzip2')
h5data.close()
The resulting file sizes:
18M /tmp/test.csv
4.7M /tmp/test.csv.bz2
80M /tmp/test.h5
This may be happening due to Pandas dumping lots of extraneous information for each group/dataset into the HDF5 file. When I ran your code and examined the file using HDFView, this was readily apparent.
I prefer to use the h5py library for creating and managing HDF5 files, as it allows for greater simplicity and control.
I tried using h5py to structure the file where each Group was named as a unique string, and within each Group was a Dataset for each of the columns of the DataFrame. I used the following in your script to write to HDF5:
with h5py.File("/tmp/test.h5", "w") as h5data:
for s in unique_strings:
s_chunk = df[df.string == s]
# create group with name = string
g = h5data.create_group(s)
# create datasets within group for each data column
dset_v1 = g.create_dataset("v1", data=s_chunk["v1"].values.astype(np.float32), compression="gzip")
dset_v2 = g.create_dataset("v2", data=s_chunk["v2"].values.astype(np.float32), compression="gzip")
dset_v3 = g.create_dataset("v3", data=s_chunk["v3"].values.astype(np.float32), compression="gzip")
The results (note that I used gzip
instead of bz2
):
18M /tmp/test.csv
5.2M /tmp/test.csv.bz2
11M /tmp/test.h5
A further optimization would be to only have one dataset within each group, where that dataset is a 2D array. In that case, the three create_dataset
calls would be replaced by one:
dset = g.create_dataset("data", data=s_chunk[["v1", "v2", "v3"]].values.astype(np.float32), compression="gzip")
The results of that:
18M /tmp/test.csv
5.0M /tmp/test.csv.bz2
6.0M /tmp/test.h5
Using bz2
as the compression would shrink this even further.