Search code examples
pythonpython-3.xregextext-processing

Correct re.compile to eliminate periods but add space for comma


I have a few lines of python code that go through a list and removes punctuation from each row. Here the code runs

import pandas as pd
import re
data = [['M.B.B.S'], ['M.B.B.S,B.S'],['ACN-P, D.N.P'],['ACNP-BC, DNP']] 
df = pd.DataFrame(data, columns = ['ID']) 
p = re.compile(r'[^\w\s\d]+')
df['ID'] = [p.sub('',x) for x in df['ID'].tolist()]
df

The problem I am facing is that I need the periods, and dashes (".", "-") to be substituted for no space as they do above, yet the commas (",") to be substituted for spaces. I can't get the correct expression syntax. For example line 2 gives the result "MBBSBS" when I need it to read "MBBS BS"


Solution

  • Just do the alternate replacement before the regex:

    df['ID'] = [p.sub('',x.replace(',',' ')) for x in df['ID'].tolist()]
    

    Or, just use the Python string method .translate and skip the regex entirely:

    import pandas as pd
    import string
    
    repl={ord(k):'' for k in string.punctuation}
    repl[ord(',')]=' '
    data = [['M.B.B.S'], ['M.B.B.S,B.S'],['ACN-P, D.N.P'],['ACNP-BC, DNP']] 
    df = pd.DataFrame(data, columns = ['ID']) 
    
    df['ID'] = [x.translate(repl) for x in df['ID'].tolist()]
    
    >>> df
                ID
    0         MBBS
    1      MBBS BS
    2    ACNP  DNP
    3  ACNPBC  DNP
    

    And if you don't want ', ' becoming two spaces, just replace those prior to other replacements:

    df['ID'] = [x.replace(', ',' ').translate(repl) for x in df['ID'].tolist()]
    

    You get the idea...