I am using python 2.7 with pandas and HDFStore
I try to process a big data set, which fits into the disk but not to the memory.
I store a large size data set in a .h5 file, the size of the data in each column is not constant, for instance, one column may have a string of 5 chars in one row and a string of 20 chars in another.
So i had issues writing the data to file in iterations, when the first iteration contained a small size of data and the following batches contained larger size of data.
I found that the issue was that the min_size was not used properly and the data did not fit into the columns, I used the following code to cache the database into the h5 without error
colsLen = {}
for col in dbCols:
curs.execute('SELECT MAX(CHAR_LENGTH(%s)) FROM table' % col)
for a in curs:
colsLen.update({col: a[0]})
# get the first row to create the hdfstore
rx = dbConAndQuery.dbTableToDf(con, table, limit=1, offset=0) #this is my utility that is querying the db
hdf.put("table", table, format="table", data_columns=True, min_itemsize=colsLen)
for i in range(rxRowCount / batchSize + 1):
rx = dbConAndQuery.dbTableToDf(con, table, limit=batchSize, offset=i * batchSize + 1)
hdf.append("table", table, format="table", data_columns=True, min_itemsize=colsLen)
hdf.close()
The issue is: how can i use HDFStore in cases where I can't query for the maximum size of each column's data in advance? e.g getting or creating data in iterations due to memory constrains.
I found that I can process the data using dask with in disk dataframes, but it lacks some functionality that I need in pandas, so the main idea is to process the data in batches appending it to the existing HDFStore file.
Thanks!
I found that the issue was hdf optimizing the data storage and counting on the size of the largest value of each column,
I found two ways to solve this: 1.Pre query the database to get the maximum data char length for each column 2.inserting each batch to a new key in the file then it works, each batch will be inserted to the hdf file using it's biggest value as biggest value in the column