Search code examples
hdf5pytables

Removing a table does not free disk space in pytables


I have a table in pytables created as follows:

import tables as tb
import random
import time
h5f = tb.open_file('enum.h5', 'w')
class BallExt(tb.IsDescription): 
    ballTime = tb.Time32Col() 
    ballColor = tb.Int64Col()
tbl = h5f.create_table('/', 'df', BallExt)
now = time.time()
row = tbl.row
for i in range(10000): 
    row['ballTime'] = now + i 
    row['ballColor'] = int(random.choice([1,2,3,4,5]))  # take note of this 
    row.append()
tbl.flush()
h5f.close()

The file size of this database in disk is shown as 133KB.

Now when I try to delete the table, everything works as expected (and the final file size is around 1KB).

h5f = tb.open_file('enum.h5', 'a')
tbl = h5f.root.df
tbl.remove()
h5f.flush()
h5f.close()

However, if I copy part of this table to a new table and delete the original table, the file size seems to increase (to 263KB). It looks like the only some reference is deleted and data is still present in disk.

h5f = tb.open_file('enum.h5', 'a')
tbl = h5f.root.df
new_tbl = h5f.create_table('/', 'df2', BallExt)
tbl.append_where(new_tbl, '(ballColor >= 3)')
tbl.remove()
h5f.flush()
h5f.close()

Is this expected? If so, is there a way to delete tbl as well as free the disk space occupied by the table? (I am using pytables==3.6.1)


Solution

  • Yes, that behavior is expected. Take a look at this answer to see more detailed example of the same behavior: How does HDF handle the space freed by deleted datasets without repacking. Note that the space will be reclaimed/reused if you add new datasets.

    To reclaim the unused space in the file, you have to use a command line utility. There are 2 choices: ptrepack and h5repack: Both are used for a number of external file operations. To reduce file size after object deletion, create a new file from the old one as shown below:

    • ptrepack utility delivered with PyTables.
      • Reference here: PyTables ptrepack doc
      • Example: ptrepack file1.h5 file2.h5 (creates file2.h5 from file1.h5)
    • h5repack utility from The HDF Group.
      • Reference here: HDF5 h5repack doc
      • Example: h5repack [OPTIONS] file1.h5 file2.h5 (creates file2.h5 from file1.h5)

    Both have options to use a different compression method when creating the new file, so are also handy if you want to convert from compressed to uncompressed (or vice versa)