Search code examples
pythonpandashdf5pytableshdfstore

HDF5 min_itemsize error: ValueError: Trying to store a string with len [##] in [y] column but this column has a limit of [##]!


I am getting the following error after using pandas.HDFStore().append()

ValueError: Trying to store a string with len [150] in [values_block_0] column but  this column has a limit of [127]!

Consider using min_itemsize to preset the sizes on these columns

I am creating a pandas DataFrame and appending it to the HDF5 file as follows:

import pandas as pd

store = pd.HDFStore("test1.h5", mode='w')

hdf_key = "one_key"

columns = ["col1", "col2", ... ]

df = pd.Dataframe(...)
df.col1 = df.col1.astype(str)
df.col2 = df.col2astype(int)
df.col3 = df.col3astype(str)
.... 
store.append(hdf_key, df, data_column=columns, index=False)

I get the error above: "ValueError: Trying to store a string with len [150] in [values_block_0] column but this column has a limit of [127]!"

Afterwards, I execute the code:

store.get_storer(hdf_key).table.description

which outputs

{
  "index": Int64Col(shape=(), dflt=0, pos=0),
  "values_block_0": StringCol(itemsize=127, shape=(5,), dflt=b'', pos=1),
  "values_block_1": Int64Col(shape=(5,), dflt=0, pos=2),
  "col1": StringCol(itemsize=20, shape=(), dflt=b'', pos=3),
  "col2": StringCol(itemsize=39, shape=(), dflt=b'', pos=4)}

What are values_block_0 and values_block_1?

So, following this StackOverflow Pandas pytable: how to specify min_itemsize of the elements of a MultiIndex , I tried

store.append(hdf_key, df, data_column=columns, index=False,  min_itemsize={"values_block_0":250})

This doesn't work though---now I get this error:

ValueError: Trying to store a string with len [250] in [values_block_0] column but  this column has a limit of [127]!

Consider using min_itemsize to preset the sizes on these columns

What am I doing wrong?

EDIT: This code produces the error ValueError: min_itemsize has the key [values_block_0] which is not an axis or data_column from filename.py

import pandas as pd
store = pd.HDFStore("test1.h5", mode='w')
hdf_key = "one_key"

my_columns = ["col1", "col2", ... ]

df = pd.Dataframe(...)
df.col1 = df.col1.astype(str)
df.col2 = df.col2astype(int)
df.col3 = df.col3astype(str)
.... 
store.append(hdf_key, df, data_column=my_columns, index=False, min_itemsize={"values_block_0":350})

Here is the full error:

(python-3) -bash:1008 $ python filename.py
Traceback (most recent call last):
  File "filename.py", line 50, in <module>
    store.append(hdf_key, dicts_into_df,  data_column=my_columns, index=False, min_itemsize={'values_block_0':350})
  File "/path/lib/python-3/lib/python3.5/site-packages/pandas/io/pytables.py", line 970, in append
    **kwargs)
  File "/path/lib/python-3/lib/python3.5/site-packages/pandas/io/pytables.py", line 1315, in _write_to_group
    s.write(obj=value, append=append, complib=complib, **kwargs)
  File "/path/lib/python-3/lib/python3.5/site-packages/pandas/io/pytables.py", line 4263, in write
    obj=obj, data_columns=data_columns, **kwargs)
  File "/path/lib/python-3/lib/python3.5/site-packages/pandas/io/pytables.py", line 3853, in write
    **kwargs)
  File "/path/lib/python-3/lib/python3.5/site-packages/pandas/io/pytables.py", line 3535, in create_axes
    self.validate_min_itemsize(min_itemsize)
  File "/path/lib/python-3/lib/python3.5/site-packages/pandas/io/pytables.py", line 3174, in validate_min_itemsize
    "data_column" % k)
ValueError: min_itemsize has the key [values_block_0] which is not an axis or data_column

Solution

  • UPDATE:

    you have misspelled data_columns parameter: data_column - it should be data_columns. As a result you didn't have any indexed columns in your HDF Store and HDF store added values_block_X:

    In [70]: store = pd.HDFStore(r'D:\temp\.data\my_test.h5')
    

    misspelled parameters will be ignored:

    In [71]: store.append('no_idx_wrong_dc', df, data_column=df.columns, index=False)
    
    In [72]: store.get_storer('no_idx_wrong_dc').table
    Out[72]:
    /no_idx_wrong_dc/table (Table(10,)) ''
      description := {
      "index": Int64Col(shape=(), dflt=0, pos=0),
      "values_block_0": Float64Col(shape=(1,), dflt=0.0, pos=1),
      "values_block_1": Int64Col(shape=(1,), dflt=0, pos=2),
      "values_block_2": StringCol(itemsize=30, shape=(1,), dflt=b'', pos=3)}
      byteorder := 'little'
      chunkshape := (1213,)
    

    is the same as the following:

    In [73]: store.append('no_idx_no_dc', df, index=False)
    
    In [74]: store.get_storer('no_idx_no_dc').table
    Out[74]:
    /no_idx_no_dc/table (Table(10,)) ''
      description := {
      "index": Int64Col(shape=(), dflt=0, pos=0),
      "values_block_0": Float64Col(shape=(1,), dflt=0.0, pos=1),
      "values_block_1": Int64Col(shape=(1,), dflt=0, pos=2),
      "values_block_2": StringCol(itemsize=30, shape=(1,), dflt=b'', pos=3)}
      byteorder := 'little'
      chunkshape := (1213,)
    

    let's spell it correctly:

    In [75]: store.append('no_idx_dc', df, data_columns=df.columns, index=False)
    
    In [76]: store.get_storer('no_idx_dc').table
    Out[76]:
    /no_idx_dc/table (Table(10,)) ''
      description := {
      "index": Int64Col(shape=(), dflt=0, pos=0),
      "value": Float64Col(shape=(), dflt=0.0, pos=1),
      "count": Int64Col(shape=(), dflt=0, pos=2),
      "s": StringCol(itemsize=30, shape=(), dflt=b'', pos=3)}
      byteorder := 'little'
      chunkshape := (1213,)
    

    OLD Answer:

    AFAIK you can effectively set the min_itemsize parameter on the first append only.

    Demo:

    In [33]: df
    Out[33]:
       num                 s
    0   11  aaaaaaaaaaaaaaaa
    1   12    bbbbbbbbbbbbbb
    2   13     ccccccccccccc
    3   14       ddddddddddd
    
    In [34]: store = pd.HDFStore(r'D:\temp\.data\my_test.h5')
    
    In [35]: store.append('test_1', df, data_columns=True)
    
    In [36]: store.get_storer('test_1').table.description
    Out[36]:
    {
      "index": Int64Col(shape=(), dflt=0, pos=0),
      "num": Int64Col(shape=(), dflt=0, pos=1),
      "s": StringCol(itemsize=16, shape=(), dflt=b'', pos=2)}
    
    In [37]: df.loc[4] = [15, 'X'*200]
    
    In [38]: df
    Out[38]:
       num                                                  s
    0   11                                   aaaaaaaaaaaaaaaa
    1   12                                     bbbbbbbbbbbbbb
    2   13                                      ccccccccccccc
    3   14                                        ddddddddddd
    4   15  XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX...
    
    In [39]: store.append('test_1', df, data_columns=True)
    ...
    skipped
    ...
    ValueError: Trying to store a string with len [200] in [s] column but
    this column has a limit of [16]!
    Consider using min_itemsize to preset the sizes on these columns    
    

    now using min_itemsize, but still appending to the existing store object:

    In [40]: store.append('test_1', df, data_columns=True, min_itemsize={'s':250})
    ...
    skipped
    ...
    ValueError: Trying to store a string with len [250] in [s] column but
    this column has a limit of [16]!
    Consider using min_itemsize to preset the sizes on these columns
    

    The following works if we will create a new object in our store:

    In [41]: store.append('test_2', df, data_columns=True, min_itemsize={'s':250})
    

    Check column sizes:

    In [42]: store.get_storer('test_2').table.description
    Out[42]:
    {
      "index": Int64Col(shape=(), dflt=0, pos=0),
      "num": Int64Col(shape=(), dflt=0, pos=1),
      "s": StringCol(itemsize=250, shape=(), dflt=b'', pos=2)}