Search code examples
pythonpandasdata-cleaning

Creating another column in pandas df based on partially empty columns


I want to create a third column in my pandas dataframe that is based on cols 1 and 2. They are always matching, but I want to make it so that the third column takes whichever value is available. If I just go off of id1, sometimes it is blank, so the third col will end up being blank as well. I want it so that it will take whichever one isn't blank to create the college name.

Original:

    id1     id2            
0   ID01   ID01             
1          ID03            
2   ID07                   
3   ID08   ID08            

Desired:

    id1     id2            college_name
0   ID01   ID01             College1
1          ID03             College3
2   ID07                    College7
3   ID08   ID08             College8

Also, one thing about this data frame is that I am pretty sure the first 2 columns either are an exact match or one of them is empty. I would like to double-check if there is an instance where id1 and id2 are completely different numbers in the same row. How should I do that?


Solution

  • Backfill values from id2 to id1. Extract the numbers. Convert to int then str.

    Given:

        id1   id2
    0  ID01  ID01
    1   NaN  ID03
    2  ID07   NaN
    3  ID08  ID08
    

    Doing:

    df['college_name'] = 'College' + (df.bfill(axis=1)['id1']
                                        .str.extract('(\d+)')
                                        .astype(int)
                                        .astype(str))
    

    Output:

        id1   id2 college_name
    0  ID01  ID01     College1
    1   NaN  ID03     College3
    2  ID07   NaN     College7
    3  ID08  ID08     College8
    

    To check for rows where the ids are different:

    Given:

        id1   id2
    0  ID01  ID01
    1   NaN  ID03
    2  ID07   NaN
    3  ID08  ID98
    

    Doing:

    print(df[df.id1.ne(df.id2) & df.id1.notna() & df.id2.notna()])
    

    Output:

        id1   id2
    3  ID08  ID98