Search code examples
pythonpandasparquetfloating-accuracyprecision-recall

pandas mess up multi level index parquet float accuracy


I have some data that has multi level index & column. My goal is to store the data as float32 instead of double/float64 into parquet file, to save disk space. I don't care about the accuracy of the cell values, but I do want the index & column themselves (here the number level) to remain as float64 and stay accurate.

However, in the following example, it messed up 7080.39 to become 7080.39013671875. I understand how it probably happened, but can't figure out a way to fix it.

import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq

df = pd.DataFrame({"col1": [1.0, 2.0, 3.0], "col2": [2.3, 2.4, 2.5], "col3": [3.1, 3.2, 3.3]})
df.index = pd.MultiIndex.from_tuples([('a', '2021-01-01', 100), ('a', '2021-01-01', 200), ('a', '2021-01-01', 7080.39)], names=('name', 'date', 'number'))
df.columns = pd.MultiIndex.from_tuples([('a', '2021-01-01', 100), ('a', '2021-01-01', 200), ('a', '2021-01-01', 7080.39)], names=('name', 'date', 'number'))

# write to parquet as float64/double
table = pa.Table.from_pandas(df)
pq.write_table(table, 'double.parquet')

# cast to float32 and write to parquet
schema = pa.schema([pa.field(field.name, pa.float32() if field.type == pa.float64() else field.type) for field in table.schema], metadata=table.schema.metadata)
table = table.cast(schema)
pq.write_table(table, 'float.parquet')

# now I read the data in, but as you can see, the "number" level in df2 index & column is messed up. I didn't change the type of it, but 7080.39 becomes 7080.39013671875, which is the float32 representation of 7080.39
df1 = pd.read_parquet('double.parquet')
df2 = pd.read_parquet('float.parquet')

df1.index
#MultiIndex([('a', '2021-01-01',   100.0),
#            ('a', '2021-01-01',   200.0),
#            ('a', '2021-01-01', 7080.39)],
#           names=['name', 'date', 'number'])

df2.index
#MultiIndex([('a', '2021-01-01',            100.0),
#            ('a', '2021-01-01',            200.0),
#            ('a', '2021-01-01', 7080.39013671875)],
#       names=['name', 'date', 'number'])

This is really annoying because I need to join this DataFrame with many other complex things, and this 7080.39013671875 couldn't find a match.

Any suggestions on how to fix it? (not how to convert 7080.39013671875 back to 7080.39 but how to prevent this from happening in the first place, because I don't want my index type to change)


Solution

  • If you do print(table.schema) you can see that it goes from:

    ('a', '2021-01-01', '100.0'): double
    ('a', '2021-01-01', '200.0'): double
    ('a', '2021-01-01', '7080.39'): double
    name: string
    date: string
    number: double
    

    to (after the cast):

    ('a', '2021-01-01', '100.0'): float
    ('a', '2021-01-01', '200.0'): float
    ('a', '2021-01-01', '7080.39'): float
    name: string
    date: string
    number: float
    

    You just need to make sure that the index columns are not casted from double/float64 to float/float32. Here's a hacky way to do that:

    schema = pa.schema([pa.field(field.name, pa.float32() if field.type == pa.float64() and '(' not in field.name else field.type) for field in table.schema], metadata=table.schema.metadata)
    table = table.cast(schema)
    

    Which gives you:

    ('a', '2021-01-01', '100.0'): double
    ('a', '2021-01-01', '200.0'): double
    ('a', '2021-01-01', '7080.39'): double
    name: string
    date: string
    number: float