Search code examples
pythonregexpandastext

Removing repeated commas from Pandas Dataframe Column in other words I just need the text from the column with a comma separating them


I have this dataframe with the Text column

Text Cleaned Col
, , , Apples , , , Hard Work , , Apples, Hard Work
, , , , , , , , Apples , , , , , Apples
Apples , , Watermelon , , , , , , Apples, Watermelon
, , , , , , , , , , , , , , , , ,

I would like to create a column such as Cleaned Col essentially using regex.

I looked at different patterns such as this r'\s*,*([^(a-zA-Z)]*)' but I am not getting the right outcome.


Solution

  • Since your fields are comma-delimited you can use

    # If the fields CANNOT contain whitespace:
    df['Cleaned Col'] = df['Text'].str.findall(r'[^\s,]+').str.join(', ')
    
    # If the fields can contain whitespace:
    df['Cleaned Col'] = df['Text'].str.findall(r'[^\s,](?:[^,]*[^\s,])?').str.join(', ')
    

    The regex extracts all found matches and .str.join(', ') joins the resulting list items into a single string. The regex (see its demo) means:

    • [^\s,]+ - one or more chars other than whitespace and comma
    • [^\s,] - a single char other than whitespace and comma
    • (?:[^,]*[^\s,])? - an optional occurrence of any zero or more chars other than a comma and then a char other than whitespace and comma.

    If you have your commas padded with spaces and you really want to use Series.str.replace, you could use

    df['Cleaned Col'] = df['Text'].str.replace(r'^[\s,]+|[\s,]+$|(\s)*(,)[\s,]*', r'\2\1', regex=True)
    

    See this regex demo.

    Details:

    • ^[\s,]+ - one or more whitespaces or commas at the start of string
    • [\s,]+$ - one or more whitespaces or commas at the end of string
    • (\s)*(,)[\s,]* - zero or more whitespaces (the last one matched is kept in Group 1, \1), then a comma (captured into Group 2, \2) and then zero or more whitespace or comma chars.

    The replacement is Group 2 + Group 1 values.