Iḿ using python the anaconda distribution with pyarrow installed. At first, I had a dataset of 166 columns, during my first iteration over the data I had to decompose a lot into dummy variables so it went up to 915 columns, and in the refinement phase, I had to bin some data, therefore, growing to 1880 columns.
Since 915 I was unable to save the file as HDF, so I move into parquet, then during the last phase parquet is failing on me with the legend ArrowNotImplementedError: Fields with more than one child are not supported.
Fortunately, I was able to write it as CSV, but this is taking almost 3gb of space in my drive I would like to know the meaning of this error. columns are so simple either they are categories or binary (numeric) that is all. I have some missing values but iḿ using XGBOOST to train so there is no problem there.
Does anyone know why suddenly by just increasing the number of columns parquet fails to save my file? I have done describe(), info(), and many other operations with no problem, I have even train the xgboost model without saving the data but it takes to long to aggregate all those columns.
data.to_parquet("../data/5_all_data.parquet") => didn't not worked
ArrowNotImplementedError: Fields with more than one child are not supported.
data.to_hdf("../data/5_all_data.h5", key="data") => didn't not worked
NotImplementedError: Cannot store a category dtype in a HDF5 dataset that uses format="fixed". Use format="table".
data.to_csv("../data/5_all_data.csv") => did worked
data.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 605847 entries, 630776 to 1049122
Data columns (total 1880 columns):
dtypes: category(118), float64(88), int64(38), uint8(1636)
memory usage: 1.6 GB
any help please
The problem is the error msg is not helpful here. The real problem IN MY CASE is that there are 2 columns with the exact same name. After changing versions up and down and changing columns types and a bunch of other stuff, all I had to do was rename the columns and I can save into parquet in any version of the package.