Search code examples
pythonpandasdata-analysisdata-cleaning

Data Cleaning - removing trailing phrases


I'm cleaning some data and wondering how to remove trailing phrases. I don't want to get rid of all numbers as some flavors have numbers. The first table is the pre-cleaned data, the second table is what I want.

Flavor
Orange 5 ml
Cherry
Strawberry 5 mg/ml
#1 flavor
Passion fruit 1.
Cherry Blossom
Flavor
Orange
Cherry
Strawberry
#1 flavor
Passion fruit
Cherry Blossom

Solution

  • Like all data cleansing, this requires knowledge of the entire dataset, so the help you can get is minimal. However, I've cooked up a regular expression that you can use to remove numbers, whitespace, units (ml, mg), slashes (/) and periods (.) from the end of the strings:

    \s*\b[/mgl\d\s.]+$
    

    You can use it like this:

    df['Flavor'] = df['Flavor'].str.replace(r'\s*\b[/mgl\d\s.]+$', '', regex=True)