I'm working with a Polars DataFrame and trying to clean up a column by applying multiple string operations. The first operation I need to do is a str.replace()
to fix some inconsistencies in the string, and then I want to extract several values into new columns.
My current approach:
df = pl.DataFrame(
{
"engine": ["172.0HP 1.6L 4 Cylinder Engine Gasoline Fuel",
"3.0 Liter Twin Turbo",
"429.0HP 5.0L 8 Cylinder Engine Gasoline Fuel"],
}
)
(
df
.with_columns(
pl.col("engine").str.replace(r'\sLiter', "L")
)
.with_columns(
pl.col("engine").str.extract(r'(\d+)\.\d+HP',1).alias("HP"),
pl.col("engine").str.extract(r'(\d+\.\d+)L',1).alias("Displacement"),
pl.col("engine").str.extract(r'(\d+)\sCylinder',1).alias("Cylinder"),
)
)
Since I'm going to apply multiple operations over the main dataframe, I want to create a function to make this code more reusable and cleaner. This is the function-based approach I've come up with:
Approach with function:
def get_engine() -> pl.Expr:
return (
pl.col("engine").str.extract(r'(\d+)\.\d+HP',1).alias("HP"),
pl.col("engine").str.extract(r'(\d+\.\d+)L',1).alias("Displacement"),
pl.col("engine").str.extract(r'(\d+)\sCylinder',1).alias("Cylinder"),
pl.col("engine").str.contains("Electric").alias("Electric")
)
(
df
.with_columns(
pl.col("engine").str.replace(r'\sLiter', "L")
)
.with_columns(
get_engine()
)
)
Is there a better or more efficient way to combine these operations while keeping the code clean?
EDIT1: Just edited the approach with fuction as @DeanMacGregor suggeested
It might be more performant to use extract_groups
rather than repeated extract
s. It assumes that the specifications are always in the same order. It doesn't require that every row have every spec just that they're never in a different order. Using this approach you can also forego the str.replace
.
Here's a crack at that
def get_engine():
extract = (
pl.col('engine')
.str.extract_groups(
r"(?:(\d+(\.\d+)?)\s*HP\s*)?.*?(\d+(\.\d+)?)\s*L.+?(?:\b(\d)\s*Cylinder)?"
)
)
return (
extract.struct.field('1').alias('HP').cast(pl.Float64),
extract.struct.field('3').alias('Displacement').cast(pl.Float64),
extract.struct.field("6").alias('Cyliner').cast(pl.Int8),
pl.col('engine').str.contains("Electric").alias("Electric")
)
I'm not very good at regex so that is with help from ChatGPT. In that extraction we're extracting groups we don't need and then disregarding them. It ought to be possible to refine the regex so it's strictly only getting the groups you care about. I'm not sure what the performance implications are with the excess groups but, at least in theory, it's more optimization potential.
With the above you can just do
df.with_columns(get_engine())
shape: (3, 5)
┌──────────────────────────────────────────────┬───────┬──────────────┬─────────┬──────────┐
│ engine ┆ HP ┆ Displacement ┆ Cyliner ┆ Electric │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ f64 ┆ f64 ┆ i8 ┆ bool │
╞══════════════════════════════════════════════╪═══════╪══════════════╪═════════╪══════════╡
│ 172.0HP 1.6L 4 Cylinder Engine Gasoline Fuel ┆ 172.0 ┆ 1.6 ┆ 4 ┆ false │
│ 3.0 Liter Twin Turbo ┆ null ┆ 3.0 ┆ null ┆ false │
│ 429HP 5.0L 8 Cylinder Engine Gasoline Fuel ┆ 429.0 ┆ 5.0 ┆ 8 ┆ false │
└──────────────────────────────────────────────┴───────┴──────────────┴─────────┴──────────┘