Search code examples
pythonpandassplit

Split column by last delimiter AND uppercase values - python


I'm trying to split a column by the last ' - ' AND being followed by all uppercase strings letters.

It may not necessarily be the last delimiter in isolation. But it will be the last before all uppercase strings.

I can find separate questions that separate based on first/last delimiter. But not with a combination.

Below, I have a df with Value containing various combinations. I want to split the col into two individuals columns, whereby, everything before the last ' - ' and uppercase letters.

I've got Last column correct but not First column.

df = pd.DataFrame({
   'Value': ['Juan-Diva - HOLLS', 'Carlos - George - ESTE BAN - BOM', 'Javier Plain - Hotham Ham - ALPINE', 'Yul - KONJ KOL MON'],
   })

option 1)

df[['First', 'l']] = df['Value'].str.split(' - ', n=1, expand=True)

df['Last'] = df['Value'].str.split('- ').str[-1]

option 2)

# Regular expression pattern
pattern = r'^(.*) - ([A-Z\s]+)$'

# Extract groups into two new columns
df[['First', 'Last']] = df['Value'].str.extract(pattern)

option 3)

df[["First", "Last"]] = df["Value"].str.rsplit(" - ", n=1, expand=True)

None of these options return the intended output.

intended output:

                       First            Last
0                  Juan-Diva           HOLLS
1            Carlos - George  ESTE BAN - BOM
2  Javier Plain - Hotham Ham          ALPINE
3                        Yul    KONJ KOL MON

regex:

df[["First", "Last"]] = df["Value"].str.extract(r'(.*?)\s*-\s*([A-Z]+(?:\s*-?\s*[A-Z]+)*)')
                                Value         First          Last
0                   Juan-Diva - HOLLS          Juan             D
1    Carlos - George - ESTE BAN - BOM        Carlos             G
2  Javier Plain - Hotham Ham - ALPINE  Javier Plain             H
3                  Yul - KONJ KOL MON           Yul  KONJ KOL MON

Solution

  • I would use str.extract here:

    df[["First", "Last"]] = df["Value"].str.extract(r'(.*?)\s*-\s*([A-Z]+(?:\s*-?\s*[A-Z]+)*)$')
    

    Demo

    The above uses the 2/multicolumn version of str.extract. It captures the first name in the first capture group. It matches until hitting the nearest hyphen which is followed by all caps. Then it matches the last name in the second capture group.