Search code examples
pythonpython-2.7apachepandasparquet

Pyarrow keeps converting string to binary using Pandas


I am trying to convert a csv file to parquet using pandas and pyarrow in python2.7.

I am having an issue with converting string to string from the pa.Table.from_pandas(df) conversion. It keeps converting the data type to 'binary' and this makes AWS Glue very unhappy.

I have attempted a customized schema it will not work.

fields = []
for name, type in dtypes.items():
        fields.append(pa.field(name, type))
my_schema = pa.schema(fields)
df = pd.read_csv(StringIO(file), delimiter="\t")
table = pa.Table.from_pandas(df)

Previously was specifying the datatype when reading in the csv, that did not work either. Also tried replace_schema_metadata() but that doesn't do much as it isn't the actual schema.


Solution

  • Python 2's str type is actually the same content as Parquet's definition of BINARY, thus all columns with str objects will be saved as binary. In Python 3, they should then be correctly loaded as bytes. To store columns as string / UTF-8 in Parquet, you need to convert the column to unicode objects.