Search code examples
pythonregexpandasregexp-replace

Use regex pattern to replace numbers followed by a substring or numbers followed by a space and then substring


For a column in a pandas dataframe, I want to remove any number either immediately followed by "gb" or "mb" or with a space in-between, in it's entirety. I.e. remove strings such as "500 gb" and "500mb".

    Column_To_Fix
0   coolblue 100gb
1   connector 500 mb for thing
2   5gb for user
3   load 800 mb
4   1000 add-on
5   20 gb 

The function below only works for row 0 and row 2, not sure how to add in the space requirement for the pattern:

pat = '(^|\s)\d+(gb|mb)($|\s)'
df['Column_To_Fix'].str.lower().replace(pat, ' ', regex=True)

Desired Output:

    Column_To_Fix
0   coolblue
1   connector for thing
2   for user
3   load
4   1000 add-on
5   

Solution

  • Try this pattern

    pat = '\d+ *(gb|mb)'
    
    df['Column_To_Fix'].str.lower().str.replace(pat, ' ')
    
    Out[462]:
    0               coolblue
    1    connector   for thing
    2                 for user
    3                   load
    4              1000 add-on
    5
    Name: Column_To_Fix, dtype: object
    

    If you prefer series.replace

    df['Column_To_Fix'].str.lower().replace(pat, ' ', regex=True)