Search code examples
pythonregexstringpandasspecial-characters

How to filter pd.Dataframe based on strings and special characters?


Here is what I have:

import re
import pandas as pd
d = {'ID': [1, 2, 3, 4, 5], 'Desc': ['0*1***HHCM', 'HC:83*20', 'HC:5*2CASL', 'DM*72\nCAS*', 'HC:564*CAS*5']}
df = pd.DataFrame(data=d)
df

Output:

   ID          Desc
0   1    0*1***HHCM
1   2      HC:83*20
2   3    HC:5*2CASL
3   4   DM*72\nCAS*
4   5  HC:564*CAS*5

I need to filter the dataframe by column "Desc", if it contains "CAS" or "HC" that are not surrounded by letters or digits.

Here is what I tried:

new_df = df[df['Desc'].str.match(r'[^A-Za-z0-9]CAS[^A-Za-z0-9]|[^A-Za-z0-9]HC[^A-Za-z0-9]') == True]

It returns an empty dataframe.

I want it to return the following:

   ID          Desc
1   2      HC:83*20
2   3    HC:5*2CASL
3   4   DM*72\nCAS*
4   5  HC:564*CAS*5

Another thing: since 3rd row has "\nCas", where "\n" is a line separator, will it treat it as a letter before "CAS"?

Please help.


Solution

  • Try this:

    df.loc[df['Desc'].str.contains(r'(\W|^)(HC|CAS)(\W|$)', flags=re.M)]
    
    # If you don't want to import re you can just use flags=8:
    
    df.loc[df['Desc'].str.contains(r'(\W|^)(HC|CAS)(\W|$)', flags=8)]
    

    Result:

       ID          Desc
    1   2      HC:83*20
    2   3    HC:5*2CASL
    3   4   DM*72\nCAS*
    4   5  HC:564*CAS*5
    

    To answer your other question, as long as \n is passed correctly it will be parsed as a newline character instead of an alphanumeric character n. i.e.:

    r'\n' -> `\\n` (backslash character + n character)
    '\n' -> '\n'   (newline character)
    

    For further explanation on the regex, please see Regex101 demo: https://regex101.com/r/FNBgPV/2