Search code examples
pythondataframepython-polarspython-holidays

Efficiently Marking Holidays in a Data Column


I'm trying to add a column that indicates whether a date is a holiday or not. I found some code online, but I believe there's a more efficient way to do it, possibly using a polar method instead of map elements and lambda.

Example code:

import polars as pl
import holidays

# Initialize the holidays for Chile
cl_holidays = holidays.CL()

# Sample data
data = {
    "Date": ["2024-06-20 00:00:00", "2024-06-21 00:00:00", "2024-06-22 00:00:00", "2024-06-23 00:00:00", "2024-06-24 00:00:00"],
    "Amount": [100, 200, 300, 400, 500],
    "User_Count" : [1, 2, 3, 4, 5]
}

# Create DataFrame
df = pl.DataFrame(data)

# Add a new column 'Is_Holiday' based on the Date column
df = df.with_columns(
    (pl.col("Date").map_elements(lambda x: x.split(" ")[0] in cl_holidays, return_dtype=pl.Boolean)).alias("Is_Holiday")
).with_columns(pl.col("Date").str.strptime(pl.Datetime))

df

Expected output:

shape: (5, 4)
┌─────────────────────┬────────┬────────────┬────────────┐
│ Date                ┆ Amount ┆ User_Count ┆ Is_Holiday │
│ ---                 ┆ ---    ┆ ---        ┆ ---        │
│ datetime[μs]        ┆ i64    ┆ i64        ┆ bool       │
╞═════════════════════╪════════╪════════════╪════════════╡
│ 2024-06-20 00:00:00 ┆ 100    ┆ 1          ┆ true       │
│ 2024-06-21 00:00:00 ┆ 200    ┆ 2          ┆ false      │
│ 2024-06-22 00:00:00 ┆ 300    ┆ 3          ┆ false      │
│ 2024-06-23 00:00:00 ┆ 400    ┆ 4          ┆ false      │
│ 2024-06-24 00:00:00 ┆ 500    ┆ 5          ┆ false      │
└─────────────────────┴────────┴────────────┴────────────┘

UPDATE: i tried with @ignoring_gravity aproach, and also tried changing the date format but i keep getting false instead of true

UPDATE2: If i try @Hericks aproach i keep getting false. (I'm using polars 0.20.31 )

import polars as pl
import holidays

# Initialize the holidays for Chile
cl_holidays = holidays.CL()

# Sample data
data = {
    "Date": ["2024-06-20 00:00:00", "2024-06-21 00:00:00", "2024-06-22 00:00:00", "2024-06-23 00:00:00", "2024-06-24 00:00:00"],
    "Amount": [100, 200, 300, 400, 500],
    "User_Count" : [1, 2, 3, 4, 5]
}

# Create DataFrame
df = pl.DataFrame(data)

# Add a new column 'Is_Holiday' based on the Date column
df.with_columns(
    Is_Holiday=pl.col('Date').str.to_datetime().dt.date().is_in(cl_holidays.keys())
)

Output:

shape: (5, 4)
┌─────────────────────┬────────┬────────────┬────────────┐
│ Date                ┆ Amount ┆ User_Count ┆ Is_Holiday │
│ ---                 ┆ ---    ┆ ---        ┆ ---        │
│ str                 ┆ i64    ┆ i64        ┆ bool       │
╞═════════════════════╪════════╪════════════╪════════════╡
│ 2024-06-20 00:00:00 ┆ 100    ┆ 1          ┆ false      │
│ 2024-06-21 00:00:00 ┆ 200    ┆ 2          ┆ false      │
│ 2024-06-22 00:00:00 ┆ 300    ┆ 3          ┆ false      │
│ 2024-06-23 00:00:00 ┆ 400    ┆ 4          ┆ false      │
│ 2024-06-24 00:00:00 ┆ 500    ┆ 5          ┆ false      │
└─────────────────────┴────────┴────────────┴────────────┘

Solution

  • After doing some research, I found that I need to pass the years into my 'holidays.CL' constructor

    import polars as pl
    import holidays
    
    # Initialize the holidays for Chile
    cl_holidays = holidays.CL()
    
    # Sample data
    data = {
        "Date": ["2024-06-20 00:00:00", "2024-06-21 00:00:00", "2024-06-22 00:00:00", "2024-06-23 00:00:00", "2024-06-24 00:00:00"],
        "Amount": [100, 200, 300, 400, 500],
        "User_Count" : [1, 2, 3, 4, 5]
    }
    
    # Create DataFrame
    df = pl.DataFrame(data)
    
    # Add a new column 'Is_Holiday' based on the Date column
    df.with_columns(
        pl.col("Date").str.to_date(format="%Y-%m-%d %H:%M:%S")
    ).with_columns(
        pl.col("Date").is_in(list(holidays.CL(years=[2024]).keys())).alias("holiday")
    )
    

    Output :

    shape: (5, 4)
    ┌────────────┬────────┬────────────┬─────────┐
    │ Date       ┆ Amount ┆ User_Count ┆ holiday │
    │ ---        ┆ ---    ┆ ---        ┆ ---     │
    │ date       ┆ i64    ┆ i64        ┆ bool    │
    ╞════════════╪════════╪════════════╪═════════╡
    │ 2024-06-20 ┆ 100    ┆ 1          ┆ true    │
    │ 2024-06-21 ┆ 200    ┆ 2          ┆ false   │
    │ 2024-06-22 ┆ 300    ┆ 3          ┆ false   │
    │ 2024-06-23 ┆ 400    ┆ 4          ┆ false   │
    │ 2024-06-24 ┆ 500    ┆ 5          ┆ false   │
    └────────────┴────────┴────────────┴─────────┘