I am experimenting with different pandas-friendly storage schemes for tick data. The fastest (in terms of reading and writing) so far has been using an HDFStore with blosc compression and the "fixed" format.
store = pd.HDFStore(path, complevel=9, complib='blosc')
store.put(symbol, df)
store.close()
I'm indexing by ticker symbol since that is my common access pattern. However, this scheme adds about 1 MB of space per symbol. That is, if the data frame for an microcap stock contains just a thousand ticks for that day, the file will increase by a megabyte in size. So for a large universe of small stocks, the .h5
file quickly becomes unwieldy.
Is there a way to keep the performance benefits of blosc/fixed format but get the size down? I have tried the "table" format, which requires about 285 KB per symbol.
store.append(symbol, df, data_columns=True)
However, this format is dramatically slower to read and write.
In case it helps, here is what my data frame looks like:
exchtime datetime64[ns]
localtime datetime64[ns]
symbol object
country int64
exch object
currency int64
indicator int64
bid float64
bidsize int64
bidexch object
ask float64
asksize int64
askexch object
The blosc compression itself works pretty well since the resulting .h5
file requires only 30--35 bytes per row. So right now my main concern is decreasing the size penalty per node in HDFStore.
AFAIK there is a certain minimum for a block size in PyTables.
Here are some suggestions:
You can ptrepack
the file, using the option chunkshape='auto'
. This will pack it using a chunkshape that is computed from looking at all the data and can repack the data in a more efficient blocksize resulting in smaller file sizes. The reason is that PyTables needs to be informed about the expected number of rows of the final array/table size.
You can achieve an optimal chunksize in a Table
format by passing expectedrows=
(and only doing a single append). However, ptrepacking
will STILL have a benefit here.
You can also try writing in the Table format, instead of setting all data_columns=True
, just pass format='table'
; it will write the table format (but you won't be able to query except by index); but it stores as a single block and so should be almost as fast as fixed (but somewhat more space efficient)
In PyTables 3.1 (just released), there is a new blosc
filter. Which might reduce file sizes.
See here