Search code examples
pythonpandasdatetimeparquet

Parquet File datetime value mismatch


While reading a parquet file using pandas the values in the datetime field gets changing. For example the output of the field while reading a parquet file is 2021-02-07 10:43:20.067 but the actual value should be there is 2021-02-07 6:43:20. Same way for few records the datetime columns are seeing +4 and +5 hrs. All the date min and sec is same but only hours field is getting changed.

The below code I am using

df=pd.read_parquet('filename.parquet')

The data type for all the fields in parquet are datetime64[ns].


Solution

  • From your description I suspect that it may have to do with timezones that have daylight saving. if you think this is the reason, then I would do: ​

    ​pd.to_datetime(df['datetime'])\
        ​.dt.tz_localize('UTC')\
        .dt.tz_convert('Europe/Berlin')
    

    The first line makes sure the column is in date-time format so the subsequent lines will work.

    The second line defines the timezone as used in producing the parquet file.

    The third line converts the timezone to the one that you are interested in.