Search code examples
pythonpandasreplacepython-re

If value is in this list and somewhere in a column's value can I replace that column value with list value? Pandas DataFrame


There is a LOT of documentation of this yet I can not figure this out.

Here is a list I need to check if one of these values in my column values. if so, replace entire cell with list value.

active_crews = ["CREW #101", "CREW #102", "CREW #203", "CREW #301", "CREW #404", "CREW #501", "CREW #406", "CREW #304", "CREW #701", "CREW #702", "CREW #703", "CREW #704", "CREW #705", "CREW #706",
                "CREW #707" "CREW #708", "CREW #801", "CREW #802", "CREW #803", "CREW #805"]

Example of the data i want to replace. and yes formatting has slight differences as well:

Debris Crew WO# 
REFER TO IAP 12/16 TO 12/19 CREW #405
REFER TO IAP 06/02 TO 06/05 CREW #406
REFER TO IAP 03/24TO 03/27 CREW # 803

Expected out put

Debris Crew WO#
CREW #405
CREW #406
CREW #803

My issue is that I do not know how to tell python to search a column value with a the list to look for a match. and if that list value is in that column value. replace the current column value with the list value

Codes I have tried:

1)

df.loc[df['Debris Crew WO#'] == active_crews, 'Debris Crew WO#']
# doesn't work. This was done before research lol I get the following error, which makes sense
# ValueError: ('Lengths must match to compare', (2216,), (19,))
df.loc[:, ['Place Holder']] = df.loc[:, 'Debris Crew WO#'].str[28:]
# this code "works" but due to different formatting i get data back like this:
8   REW #406
9   CREW #406
# not very effective and can not be relied on. I hate hard coding anything.
df.loc[:, ['Place Holder']] = df.loc[:, 'Debris Crew WO#'].str[26:]
df.loc[:, ['Place Holder']] = df[['Place Holder']].str.split().join(" ")
# tried this due to I have this filter for specials characters with a for loop in a different code and yet I get this error and I have no clue why. Works on my other codes with no problems

#AttributeError: 'DataFrame' object has no attribute 'str'

# even if I use .loc I get the same error:
df.loc[:, ['Place Holder']] = df.loc[:, 'Debris Crew WO#'].str[26:]
df.loc[:, ['Place Holder']] = df.loc[:, ['Place Holder']].str.split().join(" ")
#plus its still hard coding (gross)

Next I'm going to work with RE. I have been told it is great for a "CTRL find" style like type of filtering and is a key tool in data science. So going down that rabbit hole for the next week+ starting with the RE Documentation and practice it on this problem. Will edit with updates as I progress

That said. I have been learning python for almost a full two month now. Please forgive any "noob" styles/coding just trying and experimenting so I can make my life, and others around me a whole lot better. Any help would be apricated. Thanks in advance


Solution

  • Method #1 referring to a list:

    You can use str.extract() with the capture group being a joined list with join('|'). The | symbol is for OR and allows you to search multiple values simultaneously for each row. Capture groups require parentheses around them which is why I add parentheses as strings before and after.

    active_crews = ["CREW #101", "CREW #102", "CREW #203", "CREW #301", "CREW #404", "CREW #501", 
                    "CREW #406", "CREW #304", "CREW #701", "CREW #702", "CREW #703", "CREW #704", 
                    "CREW #705", "CREW #706", "CREW #707" "CREW #708", "CREW #801", "CREW #802", 
                    "CREW #803", "CREW #805"]
    
    df['Debris Crew WO#'] = df['Debris Crew WO#'].str.extract('(' + '|'.join(active_crews) + ')')
    df
    
    #You  can also use a formatted string like this:
    df['Debris Crew WO#'] = df['Debris Crew WO#'].str.extract(f'({"|".join(active_crews)})')
    
    Out[1]: 
      Debris Crew WO#
    0             NaN
    1       CREW #406
    2             NaN
    

    Method #2 Extracting based off a regex pattern and ignoring the list. A ? after a space means that the space is optional. Instead of a space you can also do \s or \s+ for multiple spaces. \d+ means consecutive numbers. If there are commas in the numbers, the regex is slightly different:

    df['Debris Crew WO#'] = df['Debris Crew WO#'].str.extract('(CREW ?# ?\d+)')
    
    Out[2]: 
      Debris Crew WO#
    0            #405
    1            #406
    2           # 803