Search code examples
pandasregexdataframedata-cleaningpython-re

How to stop regex matchingbefore a special character


I'm trying to work around regex using python and I'm cleaning a dataset. Below is the sample.

Player
DG Bradman (AUS)
HC Brook (ENG)

I am trying to use regex to split the player name and the country. I am aware of the fact that we can use str.split but i would like to see if there is a possibility of using regex in achieving it.

Country=Player_column.str.extract(r"(\B\(.+)")
Player=Player_column.str.extract(r"([^a-z]\$(.)")
df['Country'] = Country
df['Player'] = Player
df

So I was able to figure out to extract the part within the brackets (Country name) but I'm not able to understand how to extract the player information alone. Could someone help me with this pls?


Solution

  • If all of the lines match that format, you can extract the 3 data points with a small regex: [^ )(]+

    That will return each sequence of characters that doesn't contain a space or parenthesis, so in this example you'd get ['DG', 'Bradman', 'AUS'] back

    import re
    
    inputstring = "DG Bradman (AUS)"
    
    print(re.findall("[^ )(]+", inputstring))