I'm cleaning some data and wondering how to remove trailing phrases. I don't want to get rid of all numbers as some flavors have numbers. The first table is the pre-cleaned data, the second table is what I want.
Flavor |
---|
Orange 5 ml |
Cherry |
Strawberry 5 mg/ml |
#1 flavor |
Passion fruit 1. |
Cherry Blossom |
Flavor |
---|
Orange |
Cherry |
Strawberry |
#1 flavor |
Passion fruit |
Cherry Blossom |
Like all data cleansing, this requires knowledge of the entire dataset, so the help you can get is minimal. However, I've cooked up a regular expression that you can use to remove numbers, whitespace, units (ml
, mg
), slashes (/
) and periods (.
) from the end of the strings:
\s*\b[/mgl\d\s.]+$
You can use it like this:
df['Flavor'] = df['Flavor'].str.replace(r'\s*\b[/mgl\d\s.]+$', '', regex=True)