Search code examples
python-polars

String manipulation of cell contents in polars


In polars, I am trying to perform selection of rows and create a new content based on string manipulation. However, the python string manipulation commands below don't work. I've seen polars uses regular expressions, but am unsure how to use this to create a number from the option_type column using everything before the '_'.

import polars as pl

columns = ['2022-03-01_bid', '2022-03-01_ask', 'option_type']

data = [
    [100.0, 110.0, '100_P'],
    [100.0, 110.0, '100_C'],
    [100.0, 110.0, '200_P'],
    [100.0, 110.0, '200_C'],
    [100.0, 110.0, '300_P'],
    [100.0, 110.0, '300_C'],
    [100.0, 110.0, '400_P'],
    [100.0, 110.0, '400_C'],
    [100.0, 110.0, '500_P'],
    [100.0, 110.0, '500_C'],
]

df = pl.DataFrame(data, orient="row", schema=columns)

# Filter rows where option_type ends with "P"
df_filtered = df.filter(pl.col("option_type").str.ends_with("_P"))

# Create a new column "strike"
df_filtered = df_filtered.with_columns(
    pl.col("option_type").str.split("_").str[0].astype(int)
)

Solution

  • Is this what you are looking for?

    df = pl.DataFrame({
        '2022-03-01_bid': [100.0, 100.0,100.0,100.0,100.0], 
        '2022-03-01_ask': [110.0, 110.0,110.0,110.0,110.0],
        'option_type' : ['100_P', '100_C', '200_P', '200_C', '300_P']
    })
    
    df_filtered = df.filter(pl.col('option_type').str.ends_with('_P'))
    
    df_filtered = df_filtered.with_columns(
        strike = pl.col('option_type').str.split("_").list.get(0).cast(pl.UInt32),
    )
    

    Outputs:

    ┌────────────────┬────────────────┬─────────────┬────────┐
    │ 2022-03-01_bid ┆ 2022-03-01_ask ┆ option_type ┆ strike │
    │ ---            ┆ ---            ┆ ---         ┆ ---    │
    │ f64            ┆ f64            ┆ str         ┆ u32    │
    ╞════════════════╪════════════════╪═════════════╪════════╡
    │ 100.0          ┆ 110.0          ┆ 100_P       ┆ 100    │
    │ 100.0          ┆ 110.0          ┆ 200_P       ┆ 200    │
    │ 100.0          ┆ 110.0          ┆ 300_P       ┆ 300    │
    └────────────────┴────────────────┴─────────────┴────────┘
    

    Also the square bracket masking syntax from pandas isn't really the way of polars, see this article on coming from pandas

    EDIT

    If you wanted to use a regular expression (perhaps a slightly cleaner solution) you could do something like this

    df_filtered = df_filtered.with_columns(
        strike = pl.col('option_type').str.extract(r"(\d+)").cast(pl.UInt32)
    )