Search code examples
pandasparquetpyarrowhuggingfacehuggingface-datasets

Arrow related error when pushing dataset to Hugging-face hub


i have quite a problem with my dataset:

The (future) dataset is a pandas dataframe that i loaded from a pickle file, the pandas dataset behaves correctly. My code is:

dataset.from_pandas(df)
dataset.push_to_hub("username/my_dataset", private=True)

because I thought it was pandas fault I also tried:

dataset = Dataset.from_dict(df_sentences.to_dict(orient='list'))
dataset.push_to_hub("username/my_dataset", private=True)

and to load it from file.

The error I get is:

ArrowNotImplementedError: Unhandled type for Arrow to Parquet schema conversion: string

My dataset is composed by 4 columns of type string and one of ints, around 3600 rows


Solution

  • Without having a reproducible sample, it is hard to test it, but one option is to convert data to string[pyarrow] dtype:

    dtypes = {
    'column_a': 'string[pyarrow]',
    'col_b': 'string[pyarrow]',
    ...
    }
    
    df_converted = df.astype(dtypes)
    # proceed with the push
    

    If possible, I would also upgrade to the latest versions, esp. for pyarrow and pandas.