I'm trying to extract strings using different regex rules conditional on some column value in Python. For example, given the following dataframe:
col1 col2
1 John Smith First
2 Jane Smith First
3 Pritchard James Doe Second
4 Helen Joanne Doe Second
5 Walker Jean Last
6 Hall Jensen Last
I want to obtain:
col1 col2 col3
1 John Smith First John
2 Jane Smith First Jane
3 Pritchard James Doe Second James
4 Helen Joanne Doe Second Joanne
5 Walker Jean Last Jean
6 Hall Jensen Last Jensen
In R I would do something like:
library(tidyverse)
df <- data.frame(col1 = c("John Smith", "Jane Smith", "Pritchard James Doe", "Helen Joanne Doe", "Walker Jean", "Hall Jensen"),
col2 = c("First", "First", "Second", "Second", "Last", "Last"))
df %>%
mutate(col3 = case_when(col2 == "First" ~ str_extract(col1, "^[A-Za-z]+"),
col2 == "Second" ~ str_extract(col1, "(?<=\\s+)[A-Za-z]+"),
col2 == "Last" ~ str_extract(col1, "[A-Za-z]+$")))
However, I'm unsure how to achieve this in Python. I've tried to use case_when in pandas but been unsuccessful in getting it to work, attempting to use a lambda function or pd.Series.str.extract
import pandas as pd
import re
d = {'col1': ['John Smith', 'Jane Smith', 'Pritchard James Doe', 'Helen Joanne Doe', 'Walker Jean', 'Hall Jensen'],
'col2': ['First', 'First', 'Second', 'Second', 'Last', 'Last']}
df = pd.DataFrame(d)
cl = [(df['col2'] == 'First', lambda x: re.search('(^[A-Za-z]+)', x).group()),
(df['col2'] == 'Second', lambda x: re.search(r'(?<=\s)([A-Za-z]+)', x).group()),
(df['col2'] == 'Last', lambda x: re.search('([A-Za-z]+$)', x).group())]
df.assign(col3 = df['col1'].case_when(cl))
Use groupby
with a dictionary of the patterns to extract
:
d = {'First': r'(^[A-Za-z]+)',
'Second': r'(?<=\s)([A-Za-z]+)',
'Last': r'([A-Za-z]+$)',
}
default_regex = r'()'
df['col3'] = (df.groupby('col2', group_keys=False)['col1']
.apply(lambda x: x.str.extract(d.get(x.name, default_regex)))
)
Less efficient alternative with numpy.select
:
lst = [lambda x: re.search(r'(^[A-Za-z]+)', x).group(),
lambda x: re.search(r'(?<=\s)([A-Za-z]+)', x).group(),
lambda x: re.search(r'([A-Za-z]+$)', x).group(),
]
default_regex = lambda x: re.search('', x).group()
df['col3'] = np.select(
[df['col2'] == 'First', df['col2'] == 'Second', df['col2'] == 'Last'],
[df['col1'].apply(f) for f in lst],
df['col1'].apply(default_regex)
)
Output:
col1 col2 col3
0 John Smith First John
1 Jane Smith First Jane
2 Pritchard James Doe Second James
3 Helen Joanne Doe Second Joanne
4 Walker Jean Last Jean
5 Hall Jensen Other