In polars, I am trying to perform selection of rows and create a new content based on string manipulation. However, the python string manipulation commands below don't work. I've seen polars uses regular expressions, but am unsure how to use this to create a number from the option_type column using everything before the '_'.
import polars as pl
columns = ['2022-03-01_bid', '2022-03-01_ask', 'option_type']
data = [
[100.0, 110.0, '100_P'],
[100.0, 110.0, '100_C'],
[100.0, 110.0, '200_P'],
[100.0, 110.0, '200_C'],
[100.0, 110.0, '300_P'],
[100.0, 110.0, '300_C'],
[100.0, 110.0, '400_P'],
[100.0, 110.0, '400_C'],
[100.0, 110.0, '500_P'],
[100.0, 110.0, '500_C'],
]
df = pl.DataFrame(data, orient="row", schema=columns)
# Filter rows where option_type ends with "P"
df_filtered = df.filter(pl.col("option_type").str.ends_with("_P"))
# Create a new column "strike"
df_filtered = df_filtered.with_columns(
pl.col("option_type").str.split("_").str[0].astype(int)
)
Is this what you are looking for?
df = pl.DataFrame({
'2022-03-01_bid': [100.0, 100.0,100.0,100.0,100.0],
'2022-03-01_ask': [110.0, 110.0,110.0,110.0,110.0],
'option_type' : ['100_P', '100_C', '200_P', '200_C', '300_P']
})
df_filtered = df.filter(pl.col('option_type').str.ends_with('_P'))
df_filtered = df_filtered.with_columns(
strike = pl.col('option_type').str.split("_").list.get(0).cast(pl.UInt32),
)
Outputs:
┌────────────────┬────────────────┬─────────────┬────────┐
│ 2022-03-01_bid ┆ 2022-03-01_ask ┆ option_type ┆ strike │
│ --- ┆ --- ┆ --- ┆ --- │
│ f64 ┆ f64 ┆ str ┆ u32 │
╞════════════════╪════════════════╪═════════════╪════════╡
│ 100.0 ┆ 110.0 ┆ 100_P ┆ 100 │
│ 100.0 ┆ 110.0 ┆ 200_P ┆ 200 │
│ 100.0 ┆ 110.0 ┆ 300_P ┆ 300 │
└────────────────┴────────────────┴─────────────┴────────┘
Also the square bracket masking syntax from pandas isn't really the way of polars, see this article on coming from pandas
EDIT
If you wanted to use a regular expression (perhaps a slightly cleaner solution) you could do something like this
df_filtered = df_filtered.with_columns(
strike = pl.col('option_type').str.extract(r"(\d+)").cast(pl.UInt32)
)