Search code examples
pythonpandasdataframebigdatahdfstore

python pandas HDFStore append not contant size data


I am using python 2.7 with pandas and HDFStore

I try to process a big data set, which fits into the disk but not to the memory.

I store a large size data set in a .h5 file, the size of the data in each column is not constant, for instance, one column may have a string of 5 chars in one row and a string of 20 chars in another.

So i had issues writing the data to file in iterations, when the first iteration contained a small size of data and the following batches contained larger size of data.

I found that the issue was that the min_size was not used properly and the data did not fit into the columns, I used the following code to cache the database into the h5 without error

colsLen = {}
        for col in dbCols:
            curs.execute('SELECT MAX(CHAR_LENGTH(%s)) FROM table' % col)
            for a in curs:
                colsLen.update({col: a[0]})

        # get the first row to create the hdfstore
        rx = dbConAndQuery.dbTableToDf(con, table, limit=1, offset=0) #this is my utility that is querying the db 
        hdf.put("table", table, format="table", data_columns=True, min_itemsize=colsLen)

        for i in range(rxRowCount / batchSize + 1):
            rx = dbConAndQuery.dbTableToDf(con, table, limit=batchSize, offset=i * batchSize + 1)
            hdf.append("table", table, format="table", data_columns=True, min_itemsize=colsLen)

        hdf.close() 

The issue is: how can i use HDFStore in cases where I can't query for the maximum size of each column's data in advance? e.g getting or creating data in iterations due to memory constrains.

I found that I can process the data using dask with in disk dataframes, but it lacks some functionality that I need in pandas, so the main idea is to process the data in batches appending it to the existing HDFStore file.

Thanks!


Solution

  • I found that the issue was hdf optimizing the data storage and counting on the size of the largest value of each column,

    I found two ways to solve this: 1.Pre query the database to get the maximum data char length for each column 2.inserting each batch to a new key in the file then it works, each batch will be inserted to the hdf file using it's biggest value as biggest value in the column