I have a dataframe
pl.DataFrame({'last_name':['Unknown','Mallesham',None,'Bhavik','Unknown'],
'first_name_or_initial':['U',None,'TRUE','yamulla',None],
'number':['003123490','012457847','100030303','','0023004648'],
'date_of_birth':[None,'12/09/1900','12/09/1900','12/09/1900',None]})
Here I would like to add a new column which contains the field names that do hold on any information except NULL/EMPTY/NAN.
For example:
first row: it has last,first and number field information, and dob is NULL, hence a new column conso_field is filled in with these field names such as last_name,first_name_or_initial and number. like wise I need to get this done for all the rows.
Here is an expected output:
First, let's expand the example to show a row with all null/empty fields (to show how the algorithm handles this case).
import polars as pl
import numpy as np
df = pl.DataFrame(
{
"last_name": ["Unknown", "Mallesham", None, "Bhavik", "Unknown", None],
"first_name_or_initial": ["U", None, "TRUE", "yamulla", None, None],
"number": ["003123490", "012457847", "100030303", "", "0023004648", None],
"date_of_birth": [None, "12/09/1900", "12/09/1900", "12/09/1900", None, None],
}
)
df
shape: (6, 4)
┌───────────┬───────────────────────┬────────────┬───────────────┐
│ last_name ┆ first_name_or_initial ┆ number ┆ date_of_birth │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str ┆ str │
╞═══════════╪═══════════════════════╪════════════╪═══════════════╡
│ Unknown ┆ U ┆ 003123490 ┆ null │
│ Mallesham ┆ null ┆ 012457847 ┆ 12/09/1900 │
│ null ┆ TRUE ┆ 100030303 ┆ 12/09/1900 │
│ Bhavik ┆ yamulla ┆ ┆ 12/09/1900 │
│ Unknown ┆ null ┆ 0023004648 ┆ null │
│ null ┆ null ┆ null ┆ null │
└───────────┴───────────────────────┴────────────┴───────────────┘
df = df.with_row_index()
(
df
.join(
df
.unpivot(index="index")
.filter(pl.col("value").is_not_null() & (pl.col("value") != ""))
.group_by("index")
.agg(pl.col("variable").alias("conso_field"))
.with_columns(pl.col("conso_field").list.join(","))
,
on="index",
how="left"
)
)
shape: (6, 6)
┌───────┬───────────┬───────────────────────┬────────────┬───────────────┬─────────────────────────────────┐
│ index ┆ last_name ┆ first_name_or_initial ┆ number ┆ date_of_birth ┆ conso_field │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ u32 ┆ str ┆ str ┆ str ┆ str ┆ str │
╞═══════╪═══════════╪═══════════════════════╪════════════╪═══════════════╪═════════════════════════════════╡
│ 0 ┆ Unknown ┆ U ┆ 003123490 ┆ null ┆ last_name,first_name_or_initia… │
│ 1 ┆ Mallesham ┆ null ┆ 012457847 ┆ 12/09/1900 ┆ last_name,number,date_of_birth │
│ 2 ┆ null ┆ TRUE ┆ 100030303 ┆ 12/09/1900 ┆ first_name_or_initial,number,d… │
│ 3 ┆ Bhavik ┆ yamulla ┆ ┆ 12/09/1900 ┆ last_name,first_name_or_initia… │
│ 4 ┆ Unknown ┆ null ┆ 0023004648 ┆ null ┆ last_name,number │
│ 5 ┆ null ┆ null ┆ null ┆ null ┆ null │
└───────┴───────────┴───────────────────────┴────────────┴───────────────┴─────────────────────────────────┘
Note that the algorithm keeps the last row with all null/empty values.
To see how it works, let's take it in steps. First, we'll need to attach a row number to each row. (This is needed in case any row has all null/empty values.)
Then we'll use unpivot
to place each value in each column on a separate row, next to it's column name.
df = df.with_row_index()
(
df
.unpivot(index="index")
)
shape: (24, 3)
┌───────┬───────────────┬────────────┐
│ index ┆ variable ┆ value │
│ --- ┆ --- ┆ --- │
│ u32 ┆ str ┆ str │
╞═══════╪═══════════════╪════════════╡
│ 0 ┆ last_name ┆ Unknown │
│ 1 ┆ last_name ┆ Mallesham │
│ 2 ┆ last_name ┆ null │
│ 3 ┆ last_name ┆ Bhavik │
│ 4 ┆ last_name ┆ Unknown │
│ … ┆ … ┆ … │
│ 1 ┆ date_of_birth ┆ 12/09/1900 │
│ 2 ┆ date_of_birth ┆ 12/09/1900 │
│ 3 ┆ date_of_birth ┆ 12/09/1900 │
│ 4 ┆ date_of_birth ┆ null │
│ 5 ┆ date_of_birth ┆ null │
└───────┴───────────────┴────────────┘
Note that columns values will be converted to string values in this step.
Next, we'll filter out any rows with null
or "" values.
df = df.with_row_index()
(
df
.unpivot(index="index")
.filter(pl.col("value").is_not_null() & (pl.col("value") != ""))
)
shape: (14, 3)
┌───────┬───────────────────────┬────────────┐
│ index ┆ variable ┆ value │
│ --- ┆ --- ┆ --- │
│ u32 ┆ str ┆ str │
╞═══════╪═══════════════════════╪════════════╡
│ 0 ┆ last_name ┆ Unknown │
│ 1 ┆ last_name ┆ Mallesham │
│ 3 ┆ last_name ┆ Bhavik │
│ 4 ┆ last_name ┆ Unknown │
│ 0 ┆ first_name_or_initial ┆ U │
│ … ┆ … ┆ … │
│ 2 ┆ number ┆ 100030303 │
│ 4 ┆ number ┆ 0023004648 │
│ 1 ┆ date_of_birth ┆ 12/09/1900 │
│ 2 ┆ date_of_birth ┆ 12/09/1900 │
│ 3 ┆ date_of_birth ┆ 12/09/1900 │
└───────┴───────────────────────┴────────────┘
In the next step, we'll aggregate up all the remaining rows by row number, keeping only the column names. These represent columns with non-null, non-empty values.
df = df.with_row_index()
(
df
.unpivot(index="index")
.filter(pl.col("value").is_not_null() & (pl.col("value") != ""))
.group_by("index")
.agg(pl.col("variable").alias("conso_field"))
)
shape: (5, 2)
┌───────┬─────────────────────────────────┐
│ index ┆ conso_field │
│ --- ┆ --- │
│ u32 ┆ list[str] │
╞═══════╪═════════════════════════════════╡
│ 3 ┆ ["last_name", "first_name_or_i… │
│ 0 ┆ ["last_name", "first_name_or_i… │
│ 1 ┆ ["last_name", "number", "date_… │
│ 2 ┆ ["first_name_or_initial", "num… │
│ 4 ┆ ["last_name", "number"] │
└───────┴─────────────────────────────────┘
Note that we get a list of column names for each row. (Note: we don't need to worry about the order of the rows at this point. We'll use the row number and a left-join in the last step recombine the values to the original DataFrame.)
Then, it's simply a matter of joining the columns names into one string:
df = df.with_row_index()
(
df
.unpivot(index="index")
.filter(pl.col("value").is_not_null() & (pl.col("value") != ""))
.group_by("index")
.agg(pl.col("variable").alias("conso_field"))
.with_columns(pl.col("conso_field").list.join(","))
)
shape: (5, 2)
┌───────┬─────────────────────────────────┐
│ index ┆ conso_field │
│ --- ┆ --- │
│ u32 ┆ str │
╞═══════╪═════════════════════════════════╡
│ 1 ┆ last_name,number,date_of_birth │
│ 3 ┆ last_name,first_name_or_initia… │
│ 0 ┆ last_name,first_name_or_initia… │
│ 2 ┆ first_name_or_initial,number,d… │
│ 4 ┆ last_name,number │
└───────┴─────────────────────────────────┘
From here, we simply use a "left join" to merge the data back to the original dataset (as shown at the beginning.)