Search code examples
pythonpandasvalueerrorfindall

How can I add a delimiter to my "findall" result when only one match is found for a given cell?


I'm trying to extract substrings containing equipment names from the cells in a dataframe. Because of the way the data was created, these substrings can be in any cell. I created this program which uses "findall" and some regex to create a list of all the equipment found in the cells in a given row.

The problem is, the output isn't exactly as I need it. For example, if "findall" matches only one substring in the cell, my script does not add a delimiter afterwards. When the program continues to the next column, it joins the first column match with the second column matches, without a delimiter between the results. And I need the delimiter so I can explode the list later on.

Here is the code:

import pandas as pd

# IMPORT FILE AND CREATE DATAFRAME
d = {'Cause':['Consider checking XXX-1000 for deficiencies prior to train switch', 'XXX-2000 AND PPP-2200 to be taken out of service', 'Need to check XXX-3000 and potentially XXX-1000 for degradation'], 'Mitigation':['ZZZ-9999 is dependent on ZZZ-8000', 'These equipment will be out of service in 2025, not applicable', 'No further comments']}

df = pd.DataFrame(data=d)

# Trying the findall technique
df['new_eq'] = ""
for column in df.columns:
    df['equipment'] =  df['equipment'] + df[column].str.findall(r'\s*(\w{3,}-\d{4}\D*?) ').str.join('|')
    if df['equipment'].str.contains('|') == False:
         df['equipment'] += '|'

My output looks like this:

0   XXX-1000ZZZ-9999|ZZZ-8000
1   XXX-2000|PPP-2200
2   XXX-3000|XXX-1000

But I want it to look like this:

0   XXX-1000|ZZZ-9999|ZZZ-8000
1   XXX-2000|PPP-2200
2   XXX-3000|XXX-1000

So I added the last two lines of above to try to add the pipe character. It doesn't work and is giving me the following error: ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

I know this is because the program expects a boolean value but I can't figure out how to fix it.


Solution

  • I suggest this solution:

    import pandas as pd
    
    # IMPORT FILE AND CREATE DATAFRAME
    d = {'Cause':['Consider checking XXX-1000 for deficiencies prior to train switch', 'XXX-2000 AND PPP-2200 to be taken out of service', 'Need to check XXX-3000 and potentially XXX-1000 for degradation'], 'Mitigation':['ZZZ-9999 is dependent on ZZZ-8000', 'These equipment will be out of service in 2025, not applicable', 'No further comments']}
    
    df = pd.DataFrame(data=d)
    
    df['equipment'] = (df['Cause'] + ' ' + df['Mitigation']).str.findall(r'(\w{3,}-\d{4})').apply(lambda x: '|'.join(x))
    df['equipment'] = df['equipment'].apply(lambda x: x.rstrip('|') if x.endswith('|') else x)
    
    for i in df['equipment']:
        print(i)
    

    which returns:

    XXX-1000|ZZZ-9999|ZZZ-8000
    XXX-2000|PPP-2200
    XXX-3000|XXX-1000
    

    or simply

    df['equipment]
    

    giving

    0    XXX-1000|ZZZ-9999|ZZZ-8000
    1             XXX-2000|PPP-2200
    2             XXX-3000|XXX-1000
    Name: equipment, dtype: object