I'm trying to split a column by the last ' - '
AND being followed by all uppercase strings letters.
It may not necessarily be the last delimiter in isolation. But it will be the last before all uppercase strings.
I can find separate questions that separate based on first/last delimiter. But not with a combination.
Below, I have a df with Value
containing various combinations. I want to split the col into two individuals columns, whereby, everything before the last ' - '
and uppercase letters.
I've got Last
column correct but not First
column.
df = pd.DataFrame({
'Value': ['Juan-Diva - HOLLS', 'Carlos - George - ESTE BAN - BOM', 'Javier Plain - Hotham Ham - ALPINE', 'Yul - KONJ KOL MON'],
})
option 1)
df[['First', 'l']] = df['Value'].str.split(' - ', n=1, expand=True)
df['Last'] = df['Value'].str.split('- ').str[-1]
option 2)
# Regular expression pattern
pattern = r'^(.*) - ([A-Z\s]+)$'
# Extract groups into two new columns
df[['First', 'Last']] = df['Value'].str.extract(pattern)
option 3)
df[["First", "Last"]] = df["Value"].str.rsplit(" - ", n=1, expand=True)
None of these options return the intended output.
intended output:
First Last
0 Juan-Diva HOLLS
1 Carlos - George ESTE BAN - BOM
2 Javier Plain - Hotham Ham ALPINE
3 Yul KONJ KOL MON
regex:
df[["First", "Last"]] = df["Value"].str.extract(r'(.*?)\s*-\s*([A-Z]+(?:\s*-?\s*[A-Z]+)*)')
Value First Last
0 Juan-Diva - HOLLS Juan D
1 Carlos - George - ESTE BAN - BOM Carlos G
2 Javier Plain - Hotham Ham - ALPINE Javier Plain H
3 Yul - KONJ KOL MON Yul KONJ KOL MON
I would use str.extract
here:
df[["First", "Last"]] = df["Value"].str.extract(r'(.*?)\s*-\s*([A-Z]+(?:\s*-?\s*[A-Z]+)*)$')
The above uses the 2/multicolumn version of str.extract
. It captures the first name in the first capture group. It matches until hitting the nearest hyphen which is followed by all caps. Then it matches the last name in the second capture group.