I have a huge multiindex dataframe in long format. There's only one "value" column. Some entries in "value" are of type pd.Timedelta
.
I got an error when trying to save that dataframe as parquet
file using pd.to_parquet
:
pyarrow.lib.ArrowInvalid: ("Could not convert Timedelta('2903 days 00:00:00') with type Timedelta: tried to convert to double", 'Conversion failed for column value with type object')
If I convert the particular value from the error message to numpy
, I get the following:
array([Timedelta('2903 days 00:00:00')], dtype=object)
I set up a toy example to check if it is possible at all to convert pd.Timedelta
to parquet
. The following code works just fine:
import pandas as pd
idx = pd.MultiIndex.from_tuples(
[("A", "x"), ("A", "y"), ("B", "x"), ("B", "y")],
names=["idx1", "idx2"])
data = {"value": [
pd.Timedelta(days=10),
2.5,
pd.Timedelta(days=20),
5
]}
df = pd.DataFrame(data=data, index=idx)
df.to_parquet("test.parquet")
x = pd.read_parquet("test.parquet")
A simple df.iloc[0, :].to_numpy()
delivers the exact same type as in my real dataframe: array([Timedelta('10 days 00:00:00')], dtype=object)
.
I am wondering what might be the difference of my original dataframe compared to my toy example?
This part of your error 'Conversion failed for column value with type object'
, indicates that your column is a mixture between floats and time delta, as you expected.
The problem comes from the fact that pyarrow when you have a big dataframe infers, the types from the first rows of the pandas dataframe. Most probably in your case they are all floats. This code (the same example you have with a small modification) will reproduce your error:
import pandas as pd
float_vals = [2.5 for i in range(10000)]
float_index_val = [("float", i) for i in range(10000)]
float_vals.extend([("A", "x"), ("A", "y"), ("B", "x"), ("B", "y")])
float_vals.extend([
pd.Timedelta(days=2903),
2.5,
pd.Timedelta(days=20),
5
])
idx = pd.MultiIndex.from_tuples(
float_vals,
names=["idx1", "idx2"])
data = {"value": float_vals}
df = pd.DataFrame(data=data, index=idx)
df.to_parquet("test.parquet")
x = pd.read_parquet("test.parquet")
The problem here is that a column in parquet cannot have multiple types. In your example if you load the saved parquet you will see that everything has been converted to timedelta. For example x = pd.read_parquet("test.parquet")
follow byx.iloc[1, :].to_numpy()
delivers this array([2], dtype='timedelta64[us]')
If the float values do not represent a delta time you should do as the comment suggest and unpivot the data. If the float values represent a deltatime, you can transform them before saving, or pass a schema when saving.
df.value = pd.to_timedelta(df.value)
df.to_parquet("test.parquet")
or
df.to_parquet("test.parquet", schema=pa.schema([("value", pa.duration("s"))]))