Search code examples
pythonpandasparquetpyarrow

pandas.to_parquet pyarrow.lib.ArrowInvalid: Could not convert Timedelta


I have a huge multiindex dataframe in long format. There's only one "value" column. Some entries in "value" are of type pd.Timedelta.

I got an error when trying to save that dataframe as parquet file using pd.to_parquet:

pyarrow.lib.ArrowInvalid: ("Could not convert Timedelta('2903 days 00:00:00') with type Timedelta: tried to convert to double", 'Conversion failed for column value with type object')

If I convert the particular value from the error message to numpy, I get the following:

array([Timedelta('2903 days 00:00:00')], dtype=object)

I set up a toy example to check if it is possible at all to convert pd.Timedelta to parquet. The following code works just fine:

import pandas as pd

idx = pd.MultiIndex.from_tuples(
    [("A", "x"), ("A", "y"), ("B", "x"), ("B", "y")],
    names=["idx1", "idx2"])
data = {"value": [
    pd.Timedelta(days=10),
    2.5,
    pd.Timedelta(days=20),
    5
]}

df = pd.DataFrame(data=data, index=idx)
df.to_parquet("test.parquet")

x = pd.read_parquet("test.parquet")

A simple df.iloc[0, :].to_numpy() delivers the exact same type as in my real dataframe: array([Timedelta('10 days 00:00:00')], dtype=object).

I am wondering what might be the difference of my original dataframe compared to my toy example?


Solution

  • This part of your error 'Conversion failed for column value with type object', indicates that your column is a mixture between floats and time delta, as you expected.

    The problem comes from the fact that pyarrow when you have a big dataframe infers, the types from the first rows of the pandas dataframe. Most probably in your case they are all floats. This code (the same example you have with a small modification) will reproduce your error:

    import pandas as pd
    
    float_vals = [2.5 for i in range(10000)]
    float_index_val = [("float", i) for i in range(10000)]
    float_vals.extend([("A", "x"), ("A", "y"), ("B", "x"), ("B", "y")])
    float_vals.extend([
        pd.Timedelta(days=2903),
        2.5,
        pd.Timedelta(days=20),
        5
    ])
    
    idx = pd.MultiIndex.from_tuples(
        float_vals,
        names=["idx1", "idx2"])
    data = {"value": float_vals}
    
    df = pd.DataFrame(data=data, index=idx)
    df.to_parquet("test.parquet")
    
    x = pd.read_parquet("test.parquet")
    

    The problem here is that a column in parquet cannot have multiple types. In your example if you load the saved parquet you will see that everything has been converted to timedelta. For example x = pd.read_parquet("test.parquet") follow byx.iloc[1, :].to_numpy() delivers this array([2], dtype='timedelta64[us]')

    If the float values do not represent a delta time you should do as the comment suggest and unpivot the data. If the float values represent a deltatime, you can transform them before saving, or pass a schema when saving.

    df.value = pd.to_timedelta(df.value)
    df.to_parquet("test.parquet")
    

    or

    df.to_parquet("test.parquet", schema=pa.schema([("value", pa.duration("s"))]))