Search code examples
pythonrpandaslistfeather

Error when trying to write DataFrame to feather. Does feather support list columns?


I'm working with both R and Python and I want to write one of my pandas DataFrames as a feather so I can work with it more easily in R. However, when I try to write it as a feather, I get the following error:

ArrowInvalid: trying to convert NumPy type float64 but got float32

I doubled checked my column types and they are already float 64:

In[1]
df.dtypes

Out[1]
id         Object
cluster    int64
vector_x   float64
vector_y   float64

I get the same error regardless of using feather.write_dataframe(df, "path/df.feather") or df.to_feather("path/df.feather").

I saw this on GitHub but didn't understand if it was related or not: https://issues.apache.org/jira/browse/ARROW-1345 and https://github.com/apache/arrow/issues/1430

In the end, I can just save it as a csv and change the columns in R (or just do the whole analysis in Python), but I was hoping to use this.

Edit 1:

Still having the same issue despite the great advice below so updating what I've tried.

df[['vector_x', 'vector_y', 'cluster']] = df[['vector_x', 'vector_y', 'cluster']].astype(float)

df[['doc_id', 'text']] = df[['doc_id', 'text']].astype(str)

df[['doc_vector', 'doc_vectors_2d']] = df[['doc_vector', 'doc_vectors_2d']].astype(list)

df.dtypes

Out[1]:
doc_id           object
text             object
doc_vector       object
cluster          float64
doc_vectors_2d   object
vector_x         float64
vector_y         float64
dtype: object

Edit 2:

After much searching, it appears that the issue is that my cluster column is a list type made up of int64 integers. So I guess the real quest is, does feather format support lists?

Edit 3:

Just to tie this in a bow, feather does not support nested data types like lists, at least not yet.


Solution

    • Luckly, I got the reason of my feather IO error here.
    • And I also got the solution for this problem, pandas.to_feather and read_feather are both based on pyarrow, and a column that contains lists as values is already support by pyarrow from 2019.

    Solution:

    pip install pyarrow==latest # my version is 1.0.0 and it work
    

    Then, still use pd.to_feather("Filename") and read_feather.