Search code examples
pythondaskparquetapache-drillpydrill

Storing with Dask date/timestamp columns in Parquet


I have a Dask data frame that has two columns, a date and a value.

I store it like so:

ddf.to_parquet('/some/folder', engine='pyarrow', overwrite=True)

I'm expecting Dask to store the date column as date in Parquet, but when I query it with Apache Drill I get 16 digit numbers (I would say timestamps) instead of dates. For example I get:

1546300800000000 instead of 2019-01-01

1548979200000000 instead of 2019-02-01

Is there a way to tell Dask to store columns as dates? How can I run a select with Apache Drill and get the dates? I tried using SELECT CAST in Drill but it doesn't convert the numbers to dates.


Solution

  • Not sure if is relevant for you, but it seems that you are interested only in the date value (ignoring hours, minutes, etc.). If so, you can explicitly convert timestamp information into date string using .dt.date.

    import pandas as pd
    import dask.dataframe as dd
    
    sample_dates = [
        '2019-01-01 00:01:00',
        '2019-01-02 05:04:02',
        '2019-01-02 15:04:02'
    ]
    
    df = pd.DataFrame(zip(sample_dates, range(len(sample_dates))), columns=['datestring', 'value'])
    
    ddf = dd.from_pandas(df, npartitions=2)
    
    # convert to timestamp and calculate as unix time (relative to 1970)
    ddf['unix_timestamp_seconds'] = (ddf['datestring'].astype('M8[s]') - pd.to_datetime('1970-01-01')).dt.total_seconds()
    
    # convert to timestamp format and extract dates
    ddf['datestring'] = ddf['datestring'].astype('M8[s]').dt.date
    
    ddf.to_parquet('test.parquet', engine='pyarrow', write_index=False, coerce_timestamps='ms')
    

    For time conversion, you can use .astype or dd.to_datetime, see answers to this question. There is also a very similar question and answer, which suggests that ensuring that the timestamp is downcasted to ms resolves the issue.

    So playing around with the values you provided it's possible to see that the core problem is a mismatch in the scaling of the variable:

    # both yield: Timestamp('2019-01-01 00:00:00')
    
    pd.to_datetime(1546300800000000*1000, unit='ns')
    pd.to_datetime(1546300800000000/1000000, unit='s')