Search code examples
pythonpandasparquetpyarrow

how to change pyarrow table column precision for multi level index/column DataFrames


I have a pyarrow.Table that's created from a pandasDataFrame

    df = pd.DataFrame({"col1": [1.0, 2.0],  "col2": [2.3, 2.4]})
    df.columns = pd.MultiIndex.from_tuples([('a',100),('b',200)], names=('name', 'number'))
    df.index = pd.MultiIndex.from_tuples([('a',100),('b',200)], names=('name', 'number'))

    table = pa.Table.from_pandas(df)

The original df has thousands of columns and rows, and the values are all float64, and therefore become double when I convert to pyarrow Table

How can I change them all to float32?

I tried the following:

    schema = pa.schema([pa.field("('a',100)", pa.float32()),pa.field("('b',200)", pa.float32()),])
    table = pa.Table.from_pandas(df, schema=schema)

but that complains about the schema and the dataframe not matching: KeyError: "name '('a',100)' present in the specified schema is not found in the columns or index"


Solution

  • You can cast the table to the types you need

    table = pa.Table.from_pandas(df)
    table = table.cast(pa.schema([("('a', '100')", pa.float32()), 
                                  ("('b', '200')", pa.float32()), 
                                  ("name", pa.string()), 
                                  ("number", pa.string())]))
    

    I doubt you will find a way to provide a working schema to Table.from_pandas when using a Pandas multikey index. The name of the column in that case is a tuple (('a', 100)) but for Arrow schema column names can only be strings. So you will never be able to create a schema that points to the same column names that the dataframe has.

    That's why casting afterward works, because after you made an Arrow table (and thus all column names became strings) you can finally provide the string equal to the column name to the cast function.