Search code examples
python-polarsrust-polars

Polars: Date formatting with 'custom' nanosecond precision


I am new to polars, and i was trying to convert a flat file datetime type column to a dataframe column with 1 nanosecond precision. For example:

1974-08-08 16:28:51 => 1974-08-08 16:28:51.0
1997-12-12 17:56:19 => 1997-12-12 17:56:19.0

Flat File:

┌─────────────────────┐
│ DATETIME_COL        │
╞═════════════════════╡
│ 1974-08-08 16:28:51 │
│ 1997-12-12 17:56:19 │
│ 1986-05-24 09:04:12 │
│ 2008-01-13 05:31:57 │
│ …                   │
│ 2001-03-24 22:00:13 │
│ 2015-06-16 05:11:43 │
│ 2019-01-15 17:30:22 │
│ 2003-10-25 00:16:40 │
└─────────────────────┘

What I tried?

df = pl.read_csv(...)

df = df.with_columns(
    pl.coalesce(
        pl.col("DATETIME_COL")
          .str.strptime(pl.Datetime, format="%Y-%m-%d %H:%M:%S")
          .dt.strftime("%y-%m-%d %H:%M:%S.%3f"),
        pl.lit("2000-11-11 00:00:00.0"),
    )
)

Output:

┌───────────────────────┐
│ DATETIME_COL          │
│ ---                   │
│ str                   │
╞═══════════════════════╡
│ 74-08-08 16:28:51.000 │
│ 01-03-24 22:00:13.000 │
│ 19-01-15 17:30:22.000 │
│ 81-12-15 17:43:59.000 │
│ …                     │
│ 03-10-25 00:16:40.000 │
│ 71-04-23 13:05:49.000 │
│ 86-05-24 09:04:12.000 │
│ 08-01-13 05:31:57.000 │
└───────────────────────┘

Any pointers how to get output with 1 nanosecond precision will be really helpful. Thanks a lot in advance for your time.


Solution

  • Although it doesn't show up on any row of the output, you can use dt.cast_time_unit to make the column a data type of datetime[ns] (default is us), so the nanosecond precision is there on future calculations. For example:

    df = df.with_columns(pl.col('DATETIME_COL').dt.cast_time_unit('ns'))
    df.with_columns(add=pl.col('DATETIME_COL') + pl.duration(nanoseconds = pl.col('DATETIME_COL').dt.second()))
    
    shape: (8, 2)
    ┌─────────────────────┬───────────────────────────────┐
    │ DATETIME_COL        ┆ add                           │
    │ ---                 ┆ ---                           │
    │ datetime[ns]        ┆ datetime[ns]                  │
    ╞═════════════════════╪═══════════════════════════════╡
    │ 1974-08-08 16:28:51 ┆ 1974-08-08 16:28:51.000000051 │
    │ 1997-12-12 17:56:19 ┆ 1997-12-12 17:56:19.000000019 │
    │ 1986-05-24 09:04:12 ┆ 1986-05-24 09:04:12.000000012 │
    │ 2008-01-13 05:31:57 ┆ 2008-01-13 05:31:57.000000057 │
    │ 2001-03-24 22:00:13 ┆ 2001-03-24 22:00:13.000000013 │
    │ 2015-06-16 05:11:43 ┆ 2015-06-16 05:11:43.000000043 │
    │ 2019-01-15 17:30:22 ┆ 2019-01-15 17:30:22.000000022 │
    │ 2003-10-25 00:16:40 ┆ 2003-10-25 00:16:40.000000040 │
    └─────────────────────┴───────────────────────────────┘
    

    Another option is directly in the read_csv call, use the dtypes keyword argument to specify an exact data type, and pl.Datetime supports setting the time unit right there and then:

    dtypes = {'DATETIME_COL' : pl.Datetime(time_unit = 'ns')}