Search code examples
pythonpandasdataframedata-cleaning

Creating another column in pandas based on a pre-existing column


I have a third column in my data frame where I want to be able to create a fourth column that looks almost the same, except it has no double quotes and there is a 'user/' prefix before each ID in the list. Also, sometimes it is just a single ID vs. list of IDs (as shown in example DF).

original

col1   col2     col3 
01      01     "ID278, ID289"

02      02     "ID275"

desired

col1   col2     col3                col4
01      01     "ID278, ID289"     user/ID278, user/ID289

02      02     "ID275"            user/ID275


Solution

  • Given:

       col1  col2            col3
    0   1.0   1.0  "ID278, ID289"
    1   2.0   2.0         "ID275"
    2   2.0   1.0             NaN
    

    Doing:

    df['col4'] = (df.col3.str.strip('"')  # Remove " from both ends.
                         .str.split(', ') # Split into lists on ', '.
                         .apply(lambda x: ['user/' + i for i in x if i] # Apply this list comprehension,
                                           if isinstance(x, list)  # If it's a list.
                                           else x)
                         .str.join(', ')) # Join them back together.
    print(df)
    

    Output:

       col1  col2            col3                    col4
    0   1.0   1.0  "ID278, ID289"  user/ID278, user/ID289
    1   2.0   2.0         "ID275"              user/ID275
    2   2.0   1.0             NaN                     NaN