I have a dataframe with fields age
, diagnosis
, etc that I'm reading from a csv file. I need to add a column code
(a disease code) that depends on diagnosis
and/or age
. I am currently using the map_elements
method:
disease_codes = {
"malaria": {"under_5": "a", "over_5": "b"},
"PUD": "c",
"Asthma": "d",
}
# The callable to return the appropriate code
def return_code(row):
diagnosis = row["diagnosis"]
age = row["age"]
dx_return = disease_codes.get(diagnosis, "Undefined")
if type(dx_return) == dict:
if age >= 5:
return dx_return.get("over_5")
return dx_return.get("under_5")
return dx_return
df.with_columns(pl.struct(["diagnosis", "age"]).map_elements(return_code).alias("code"))
This is working as expected. But because this method uses a for
loop, there is a warning in the docs:
This method is much slower than native expressions API. Only use it if you cannot implement your logic otherwise.
I have been trying to find a way to achieve it with the native API and I thought map_batches
came close:
def return_code(diagnosis, age):
dx_return = disease_codes.get(diagnosis, "Undefined")
if type(dx_return) == dict:
if age >= 5:
return dx_return.get("over_5")
return dx_return.get("under_5")
return dx_return
df.with_columns(
(pl.struct(["diagnosis", "age"]).map_batches(
lambda x: return_code(x.struct.field("diagnosis"), x.struct.field("age"))
)).alias("code")
)
Apparently, I can't do that as I get an error:
TypeError: cannot use `__getitem__` on Series of dtype Struct([Field('diagnosis', Utf8), Field('age', Int64)]) with argument 'diagnosis' of type 'str'
Looking at the documentation, this is by far the only method I thought would do the trick. Am I stuck with the slower approach? Is there something I am not doing right? What would you suggest?
The expected output, which I am getting with the working approach, should be:
Method 1. You could use a when-then-otherwise expression as follows:
import polars as pl
df = pl.DataFrame({
"age": [2, 37, 53, 1, 52],
"diagnosis": ["Malaria", "PUD", "Malaria", "PUD", "Other"],
})
(
df
.with_columns(
pl.when(pl.col("diagnosis") == "Malaria").then(
pl.when(pl.col("age") < 5).then(pl.lit("a")).otherwise(pl.lit("b"))
).otherwise(
pl.when(pl.col("diagnosis") == "PUD").then(pl.lit("c")).otherwise(pl.lit("d"))
).alias("disease_code"),
)
)
If there are many conditions depending only on a single column (e.g. diagnosis
) and few depending on another column (e.g. age
), it might be reasonable to first start with pl.map_dict
. Subsequently, you could handle the age
-dependant special cases with the when-then-otherwise expressions as outlined above.
Method 2. Alternatively, you could create (using any method) a table of disease codes, add the necessary columns to df
, and merge the tables.
This could look look as follows:
disease_codes = pl.DataFrame({
"diagnosis": ["Malaria"] * 2 + ["PUD"] * 2 + ["Other"] * 2,
"le_5": [True, False] * 3,
"code": ["a", "b", "c", "c", "d", "d"]
})
(
df
.with_columns(le_5=pl.col("age") < 5)
.join(disease_codes, on=["diagnosis", "le_5"], how="left")
.drop("le_5")
)
The output will be identical the one shown above.