I have a data frame as below:
df = pl.DataFrame({'last_name':[None,'mallesh','bhavik'],
'first_name':['a','b','c'],
'middle_name_or_initial':['aa','bb','cc']})
shape: (3, 3)
┌───────────┬────────────┬────────────────────────┐
│ last_name ┆ first_name ┆ middle_name_or_initial │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str │
╞═══════════╪════════════╪════════════════════════╡
│ null ┆ a ┆ aa │
│ mallesh ┆ b ┆ bb │
│ bhavik ┆ c ┆ cc │
└───────────┴────────────┴────────────────────────┘
Here I would like to find an observation which has First and Middle Name not NULL and Last Name is Null.
In this case last_name should take the value from first_name, first_name should take the value from middle_name, and middle_name should be NULL.
expected output will be:
shape: (3, 3)
┌───────────┬────────────┬────────────────────────┐
│ last_name ┆ first_name ┆ middle_name_or_initial │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str │
╞═══════════╪════════════╪════════════════════════╡
│ a ┆ aa ┆ null │
│ mallesh ┆ b ┆ bb │
│ bhavik ┆ c ┆ cc │
└───────────┴────────────┴────────────────────────┘
I'm trying with this command:
df.with_columns(
pl.when((pl.col('first_name').is_not_null()) & (pl.col('middle_name_or_initial').is_not_null()) & (pl.col('last_name').is_null()))
.then(pl.col('first_name').alias('last_name')).otherwise(pl.col('last_name').alias('first_name')),
pl.when((pl.col('first_name').is_not_null()) & (pl.col('middle_name_or_initial').is_not_null()) & (pl.col('last_name').is_null()))
.then(pl.col('middle_name_or_initial').alias('first_name')).alias('middle_name_or_initial')
)
But it is not producing the expected output.
You can actually change the values of multiple columns within a single when/then/otherwise
statement.
name_cols = ["last_name", "first_name", "middle_name_or_initial"]
(
df.with_columns(
pl.when(
pl.col("first_name").is_not_null(),
pl.col("middle_name_or_initial").is_not_null(),
pl.col("last_name").is_null()
)
.then(pl.struct(
last_name = "first_name",
first_name = "middle_name_or_initial",
middle_name_or_initial = None
))
.otherwise(pl.struct(name_cols))
.struct.field(name_cols)
)
)
shape: (3, 3)
┌───────────┬────────────┬────────────────────────┐
│ last_name ┆ first_name ┆ middle_name_or_initial │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str │
╞═══════════╪════════════╪════════════════════════╡
│ a ┆ aa ┆ null │
│ mallesh ┆ b ┆ bb │
│ bhavik ┆ c ┆ cc │
└───────────┴────────────┴────────────────────────┘
To change the values of multiple columns within a single when/then/otherwise statement, we can use Structs.
We use pl.struct()
to create a struct for each condition.
df.select(
then = pl.struct(
last_name = "first_name",
first_name = "middle_name_or_initial",
middle_name_or_initial = None
),
otherwise = pl.struct(name_cols)
)
shape: (3, 2)
┌─────────────────┬──────────────────────┐
│ then ┆ otherwise │
│ --- ┆ --- │
│ struct[3] ┆ struct[3] │
╞═════════════════╪══════════════════════╡
│ {"a","aa",null} ┆ {null,"a","aa"} │
│ {"b","bb",null} ┆ {"mallesh","b","bb"} │
│ {"c","cc",null} ┆ {"bhavik","c","cc"} │
└─────────────────┴──────────────────────┘
.struct.field()
is used to "unnest" the result and replace the original values.