python pandas dataframe multiple-columns

Check multiple columns for datetime, null or a string value, if exists skip and go to other column otherwise pick the value and put in new column

I have 3 columns Col1, Col2 and Col3 in pandas dataframe 'df'. I want to make a new column, Col4 out of these. The steps are as follows:

Check Col1, Col2 and Col3, if value is null and not 'aaa' or has a datetime value, skip and go to the other column, if value is not null and not datettime and not 'aaa' pick value and put in the new column i.e. Col4.

Col1, Col2 and Col3 are object datatypes.

For example:

col1                    col2                    col3   col4
aaa                     2011-02-01 10:04:03     abc     abc
2011-02-01 10:04:03     xyz                     abc     xyz
                        ijk                     xyz     ijk

Currently i have been using the code:

df["col4"] = np.where(
    ((df["col1"].notnull()) & (df["col1"] != "aaa")),
    df["col1"],
    (
        ((df["col2"].notnull()) & (df["col2"] != "aaa")),
        df["col2"],
        (((df["col3"].notnull()) & (df["col3"] != "aaa")), df["col3"], "No Value"),
    ),
)

This incorporates the part where we skip the null values and 'aaa'. I want a code so that i am also able to skip the datetime values here.

Can you please help?

Solution

With the dataframe you provided:

import datetime
import pandas as pd

df = pd.DataFrame(
    {
        "col1": [
            "aaa",
            datetime.datetime.strptime("2011-02-01 10:04:03", "%Y-%m-%d %H:%M:%S"),
            "",
        ],
        "col2": [
            datetime.datetime.strptime("2011-02-01 10:04:03", "%Y-%m-%d %H:%M:%S"),
            "xyz",
            "ijk",
        ],
        "col3": ["abc", "abc", "xyz"],
    }
)

Here is another way to do it by first defining a helper function which uses Python built-in function isinstance to check datetime values:

def check(val):
    return (
        val
        if (val and val != "aaa" and not isinstance(val, datetime.datetime))
        else None
    )

And then, use Pandas apply to check each column inside a generator (here, Python next built-in function returns the first non-false element, as suggested in this clever post):

df["col4"] = df.apply(
    lambda x: next(
        (y for y in [check(x["col1"]), check(x["col2"]), check(x["col3"])] if y), pd.NA
    ),
    axis=1,
)

print(df)
# Output
                  col1                 col2 col3 col4
0                  aaa  2011-02-01 10:04:03  abc  abc
1  2011-02-01 10:04:03                  xyz  abc  xyz
2                                       ijk  xyz  ijk