Search code examples
pythonstringpython-re

Remove unwanted characters from set of strings in python


I am trying to clean a set of strings to remove unwanted characters.

Input

Lethal Lunch t5+ 0 0 D 10 t5+ Michael Bell . Alex Jary7 .
Muscika 1 v5+ W5+ 0 0 D 5 v5+ W5+ D O'Meara . Cam Hardie . C5
Typhoon Ten 1 0 0 D 13 R Hannon . Luke Catton7 .
Wentworth Falls 1 cp5+ 0 0 C D 45 cp5+ G Harker . Connor Beasley .
One Night Stand 0 0 D 34 W Jarvis . Silvestre De Sousa . 30 C1 C5
Dancinginthewoods 1 0 0 D 24 D Ivory . 14 Jamie Spencer . 30
Case Key 1 v3 0 0 D 13 v3 M Appleby . Andrew Mullen . 14

Wanted Output

Lethal Lunch
Muscika
Typhoon Ten
Wentworth Falls
One Night Stand
Dancinginthewoods 
Case Key

I have tried this

re.findall('([a-zA-Z ]*)\d*.*',final_df.loc[index, 'Horse'])

This removes everything after a number but it leaves the t on the first entry. I was wondering if there is a better way?


Solution

  • I'd use re.split instead:

    for d in data.splitlines():
        print(re.split(r'\s+t?[0-9]\+?', d)[0])
    
    Result
    Lethal Lunch 
    Muscika 
    Typhoon Ten 
    Wentworth Falls 
    One Night Stand 
    Dancinginthewoods 
    Case Key 
    

    Explanation: It splits the string at places where the specified pattern matches, then takes the first part. You probably want to tweak it so that other patterns also match.

    In Pandas

    I just noticed you seem to be using Pandas – assuming your df looks like this:

                                                   Horse
    0  Lethal Lunch t5+ 0 0 D 10 t5+ Michael Bell . A...
    1  Muscika 1 v5+ W5+ 0 0 D 5 v5+ W5+ D O'Meara . ...
    2  Typhoon Ten 1 0 0 D 13 R Hannon . Luke Catton7 .
    3  Wentworth Falls 1 cp5+ 0 0 C D 45 cp5+ G Harke...
    4  One Night Stand 0 0 D 34 W Jarvis . Silvestre ...
    5  Dancinginthewoods 1 0 0 D 24 D Ivory . 14 Jami...
    6  Case Key 1 v3 0 0 D 13 v3 M Appleby . Andrew M...
    

    You can do

    from operator import itemgetter
    
    df["name"] = df.Horse.str.split('\s+t?[0-9]\+?').map(itemgetter(0))
    

    to get this:

                                                   Horse               name
    0  Lethal Lunch t5+ 0 0 D 10 t5+ Michael Bell . A...       Lethal Lunch
    1  Muscika 1 v5+ W5+ 0 0 D 5 v5+ W5+ D O'Meara . ...            Muscika
    2  Typhoon Ten 1 0 0 D 13 R Hannon . Luke Catton7 .        Typhoon Ten
    3  Wentworth Falls 1 cp5+ 0 0 C D 45 cp5+ G Harke...    Wentworth Falls
    4  One Night Stand 0 0 D 34 W Jarvis . Silvestre ...    One Night Stand
    5  Dancinginthewoods 1 0 0 D 24 D Ivory . 14 Jami...  Dancinginthewoods
    6  Case Key 1 v3 0 0 D 13 v3 M Appleby . Andrew M...           Case Key