Search code examples
pandascontains

create new pandas column is other column contains a string


I have:

df = pd.DataFrame({'col1': {0: 'success', 1: 'failed', 2: 'variable x 10', 3: 'variable xr 134', 4: 'error', 5: 'not found'}})

I want to create 3 columns depending on if a substring is present in the main column.

  • if the column contains the word variable, then I want to split the string and take the 2nd element
  • if the column contains the word variable, then I want to split the string and take the 3rd element. this value can also be a string
  • if the column contains one of these key words, success, fail, error, or not found, then I want to place a key word of Success, Fail, or Error.
  • if none of the keywords are present, then I would like a None or '' to appear

I tried:

df['var']=np.where(df['col1'].str.contains('variable'), df['col1'].str.split(' '),'')
df['val']=np.where(df['col1'].str.contains('variable'), df['col1'].str.split(' '),'')
df['other']=np.where(df['col1'].str.contains('not found'|'error'),'Error',
                     np.where(df['col1'].str.contains('success'),'Success',
                              np.where(df['col1'].str.contains('fail'),'Fail','')))

but I get the error unsupported operand type(s) for |: 'str' and 'str' with the str.contains and cannot select specific portions of the split string, as I receive the error operands could not be broadcast together with shapes (6,) (3,) () when trying df['val']=np.where(df['col1'].str.contains('variable'), df['col1'].str.split(' ')[2],'')

Pictorially, this is what I would like to end up with enter image description here

Any suggestions?


Solution

  • I would extract each of the three portions (if any), then concat the whole thing :

    others = {
        "success": "Success", "fail": "Fail",
        "error": "Error", "not found": "Error"
    }
    
    pat_vax = r"variable\s+(?P<var>\S+)\s+(?P<val>\S+)"
    pat_oth = r"(%s)" % "|".join(others)
    
    out = (
        pd.concat([
            df, df["col1"].str.extract(pat_vax),
            df["col1"].str.extract(pat_oth, expand=False)
                .map(others).rename("other")], axis=1
        )
    )
    

    Output :

    print(out)
    
                   col1  var   val    other
    0           success  NaN   NaN  Success
    1            failed  NaN   NaN     Fail
    2     variable x 10    x    10      NaN
    3  variable xr 134r   xr  134r      NaN
    4             error  NaN   NaN    Error
    5         not found  NaN   NaN    Error
    
    [6 rows x 4 columns]