How to check if dataframe columns contains any information except NULL/EMPTY and show them in a new column in python polars?

I have a dataframe

pl.DataFrame({'last_name':['Unknown','Mallesham',None,'Bhavik','Unknown'],
              'first_name_or_initial':['U',None,'TRUE','yamulla',None],
              'number':['003123490','012457847','100030303','','0023004648'],
              'date_of_birth':[None,'12/09/1900','12/09/1900','12/09/1900',None]})

Here I would like to add a new column which contains the field names that do hold on any information except NULL/EMPTY/NAN.

For example:

first row: it has last,first and number field information, and dob is NULL, hence a new column conso_field is filled in with these field names such as last_name,first_name_or_initial and number. like wise I need to get this done for all the rows.

Here is an expected output:

Solution

First, let's expand the example to show a row with all null/empty fields (to show how the algorithm handles this case).

import polars as pl
import numpy as np

df = pl.DataFrame(
    {
        "last_name": ["Unknown", "Mallesham", None, "Bhavik", "Unknown", None],
        "first_name_or_initial": ["U", None, "TRUE", "yamulla", None, None],
        "number": ["003123490", "012457847", "100030303", "", "0023004648", None],
        "date_of_birth": [None, "12/09/1900", "12/09/1900", "12/09/1900", None, None],
    }
)
df

shape: (6, 4)
┌───────────┬───────────────────────┬────────────┬───────────────┐
│ last_name ┆ first_name_or_initial ┆ number     ┆ date_of_birth │
│ ---       ┆ ---                   ┆ ---        ┆ ---           │
│ str       ┆ str                   ┆ str        ┆ str           │
╞═══════════╪═══════════════════════╪════════════╪═══════════════╡
│ Unknown   ┆ U                     ┆ 003123490  ┆ null          │
│ Mallesham ┆ null                  ┆ 012457847  ┆ 12/09/1900    │
│ null      ┆ TRUE                  ┆ 100030303  ┆ 12/09/1900    │
│ Bhavik    ┆ yamulla               ┆            ┆ 12/09/1900    │
│ Unknown   ┆ null                  ┆ 0023004648 ┆ null          │
│ null      ┆ null                  ┆ null       ┆ null          │
└───────────┴───────────────────────┴────────────┴───────────────┘

The Algorithm

df = df.with_row_index()

(
    df
    .join(
        df
        .unpivot(index="index")
        .filter(pl.col("value").is_not_null() & (pl.col("value") != ""))
        .group_by("index")
        .agg(pl.col("variable").alias("conso_field"))
        .with_columns(pl.col("conso_field").list.join(","))
        ,
        on="index",
        how="left"
    )
)

shape: (6, 6)
┌───────┬───────────┬───────────────────────┬────────────┬───────────────┬─────────────────────────────────┐
│ index ┆ last_name ┆ first_name_or_initial ┆ number     ┆ date_of_birth ┆ conso_field                     │
│ ---   ┆ ---       ┆ ---                   ┆ ---        ┆ ---           ┆ ---                             │
│ u32   ┆ str       ┆ str                   ┆ str        ┆ str           ┆ str                             │
╞═══════╪═══════════╪═══════════════════════╪════════════╪═══════════════╪═════════════════════════════════╡
│ 0     ┆ Unknown   ┆ U                     ┆ 003123490  ┆ null          ┆ last_name,first_name_or_initia… │
│ 1     ┆ Mallesham ┆ null                  ┆ 012457847  ┆ 12/09/1900    ┆ last_name,number,date_of_birth  │
│ 2     ┆ null      ┆ TRUE                  ┆ 100030303  ┆ 12/09/1900    ┆ first_name_or_initial,number,d… │
│ 3     ┆ Bhavik    ┆ yamulla               ┆            ┆ 12/09/1900    ┆ last_name,first_name_or_initia… │
│ 4     ┆ Unknown   ┆ null                  ┆ 0023004648 ┆ null          ┆ last_name,number                │
│ 5     ┆ null      ┆ null                  ┆ null       ┆ null          ┆ null                            │
└───────┴───────────┴───────────────────────┴────────────┴───────────────┴─────────────────────────────────┘

Note that the algorithm keeps the last row with all null/empty values.

How it works

To see how it works, let's take it in steps. First, we'll need to attach a row number to each row. (This is needed in case any row has all null/empty values.)

Then we'll use unpivot to place each value in each column on a separate row, next to it's column name.

df = df.with_row_index()
(
    df
    .unpivot(index="index")
)

shape: (24, 3)
┌───────┬───────────────┬────────────┐
│ index ┆ variable      ┆ value      │
│ ---   ┆ ---           ┆ ---        │
│ u32   ┆ str           ┆ str        │
╞═══════╪═══════════════╪════════════╡
│ 0     ┆ last_name     ┆ Unknown    │
│ 1     ┆ last_name     ┆ Mallesham  │
│ 2     ┆ last_name     ┆ null       │
│ 3     ┆ last_name     ┆ Bhavik     │
│ 4     ┆ last_name     ┆ Unknown    │
│ …     ┆ …             ┆ …          │
│ 1     ┆ date_of_birth ┆ 12/09/1900 │
│ 2     ┆ date_of_birth ┆ 12/09/1900 │
│ 3     ┆ date_of_birth ┆ 12/09/1900 │
│ 4     ┆ date_of_birth ┆ null       │
│ 5     ┆ date_of_birth ┆ null       │
└───────┴───────────────┴────────────┘

Note that columns values will be converted to string values in this step.

Next, we'll filter out any rows with null or "" values.

df = df.with_row_index()
(
    df
    .unpivot(index="index")
    .filter(pl.col("value").is_not_null() & (pl.col("value") != ""))
)

shape: (14, 3)
┌───────┬───────────────────────┬────────────┐
│ index ┆ variable              ┆ value      │
│ ---   ┆ ---                   ┆ ---        │
│ u32   ┆ str                   ┆ str        │
╞═══════╪═══════════════════════╪════════════╡
│ 0     ┆ last_name             ┆ Unknown    │
│ 1     ┆ last_name             ┆ Mallesham  │
│ 3     ┆ last_name             ┆ Bhavik     │
│ 4     ┆ last_name             ┆ Unknown    │
│ 0     ┆ first_name_or_initial ┆ U          │
│ …     ┆ …                     ┆ …          │
│ 2     ┆ number                ┆ 100030303  │
│ 4     ┆ number                ┆ 0023004648 │
│ 1     ┆ date_of_birth         ┆ 12/09/1900 │
│ 2     ┆ date_of_birth         ┆ 12/09/1900 │
│ 3     ┆ date_of_birth         ┆ 12/09/1900 │
└───────┴───────────────────────┴────────────┘

In the next step, we'll aggregate up all the remaining rows by row number, keeping only the column names. These represent columns with non-null, non-empty values.

df = df.with_row_index()
(
    df
    .unpivot(index="index")
    .filter(pl.col("value").is_not_null() & (pl.col("value") != ""))
    .group_by("index")
    .agg(pl.col("variable").alias("conso_field"))
)

shape: (5, 2)
┌───────┬─────────────────────────────────┐
│ index ┆ conso_field                     │
│ ---   ┆ ---                             │
│ u32   ┆ list[str]                       │
╞═══════╪═════════════════════════════════╡
│ 3     ┆ ["last_name", "first_name_or_i… │
│ 0     ┆ ["last_name", "first_name_or_i… │
│ 1     ┆ ["last_name", "number", "date_… │
│ 2     ┆ ["first_name_or_initial", "num… │
│ 4     ┆ ["last_name", "number"]         │
└───────┴─────────────────────────────────┘

Note that we get a list of column names for each row. (Note: we don't need to worry about the order of the rows at this point. We'll use the row number and a left-join in the last step recombine the values to the original DataFrame.)

Then, it's simply a matter of joining the columns names into one string:

df = df.with_row_index()
(
    df
    .unpivot(index="index")
    .filter(pl.col("value").is_not_null() & (pl.col("value") != ""))
    .group_by("index")
    .agg(pl.col("variable").alias("conso_field"))
    .with_columns(pl.col("conso_field").list.join(","))
)

shape: (5, 2)
┌───────┬─────────────────────────────────┐
│ index ┆ conso_field                     │
│ ---   ┆ ---                             │
│ u32   ┆ str                             │
╞═══════╪═════════════════════════════════╡
│ 1     ┆ last_name,number,date_of_birth  │
│ 3     ┆ last_name,first_name_or_initia… │
│ 0     ┆ last_name,first_name_or_initia… │
│ 2     ┆ first_name_or_initial,number,d… │
│ 4     ┆ last_name,number                │
└───────┴─────────────────────────────────┘

From here, we simply use a "left join" to merge the data back to the original dataset (as shown at the beginning.)