I have this dataframe with the Text
column
Text | Cleaned Col |
---|---|
, , , Apples , , , Hard Work , , | Apples, Hard Work |
, , , , , , , , Apples , , , , , | Apples |
Apples , , Watermelon , , , , , , | Apples, Watermelon |
, , , , , , , , , , , , , , , , , |
I would like to create a column such as Cleaned Col
essentially using regex.
I looked at different patterns such as this r'\s*,*([^(a-zA-Z)]*)'
but I am not getting the right outcome.
Since your fields are comma-delimited you can use
# If the fields CANNOT contain whitespace:
df['Cleaned Col'] = df['Text'].str.findall(r'[^\s,]+').str.join(', ')
# If the fields can contain whitespace:
df['Cleaned Col'] = df['Text'].str.findall(r'[^\s,](?:[^,]*[^\s,])?').str.join(', ')
The regex extracts all found matches and .str.join(', ')
joins the resulting list items into a single string. The regex (see its demo) means:
[^\s,]+
- one or more chars other than whitespace and comma[^\s,]
- a single char other than whitespace and comma(?:[^,]*[^\s,])?
- an optional occurrence of any zero or more chars other than a comma and then a char other than whitespace and comma.If you have your commas padded with spaces and you really want to use Series.str.replace
, you could use
df['Cleaned Col'] = df['Text'].str.replace(r'^[\s,]+|[\s,]+$|(\s)*(,)[\s,]*', r'\2\1', regex=True)
See this regex demo.
Details:
^[\s,]+
- one or more whitespaces or commas at the start of string[\s,]+$
- one or more whitespaces or commas at the end of string(\s)*(,)[\s,]*
- zero or more whitespaces (the last one matched is kept in Group 1, \1
), then a comma (captured into Group 2, \2
) and then zero or more whitespace or comma chars.The replacement is Group 2 + Group 1 values.