Suppose that I have a dataset with three categorical columns: df.type1 df.type2 df.type3 and I want to create a new column [df.new] that it takes:
df.new = df.type1 if df.type1 is true and the remaining are false
df.new = df.type2 if df.type2 is true and the remaining are false
df.new = df.type3 if df.type3 is true and the remaining are false
What is the best approach? I am quite confused by np.where() - too long and too script intense
Example:
City dt.t1 dt.t2 dt.t3
NY US Non EU Non Asia
Rome Non US EU Non Asia
SF US Non EU Non Asia
HK Non US Non EU Asia
my final result would be:
City dt.new
NY US
Rome EU
SF US
HK Asia
Use:
df = df.set_index('City')
df['dt.new'] = df.mask(df.apply(lambda x: x.str.contains('Non\s+'))).ffill(axis=1).iloc[:, -1]
Alternative solution with select columns for check values:
cols = df.filter(regex='^dt\.').columns
#or use list of columns names
#cols = ['dt.t1','dt.t2','dt.t3']
df['dt.new'] = df[cols].mask(df[cols].apply(lambda x: x.str.contains('Non\s+'))).ffill(axis=1).iloc[:, -1]
print (df)
dt.t1 dt.t2 dt.t3 dt.new
City
NY US Non EU Non Asia US
Rome Non US EU Non Asia EU
SF US Non EU Non Asia US
HK Non US Non EU Asia Asia
Detail:
First set_index
by City
column and then check for Non
string with one or more whitespaces:
df = df.set_index('City')
print (df.apply(lambda x: x.str.contains('Non\s+')))
dt.t1 dt.t2 dt.t3
City
NY False True True
Rome True False True
SF False True True
HK True True False
Then replace matching values to NaN
s by mask
:
print (df.mask(df.apply(lambda x: x.str.contains('Non\s+'))))
dt.t1 dt.t2 dt.t3
City
NY US NaN NaN
Rome NaN EU NaN
SF US NaN NaN
HK NaN NaN Asia
Forward fill non missing values per rows:
print (df.mask(df.apply(lambda x: x.str.contains('Non\s+'))).ffill(axis=1))
dt.t1 dt.t2 dt.t3
City
NY US US US
Rome NaN EU EU
SF US US US
HK NaN NaN Asia
And last select last column:
print (df.mask(df.apply(lambda x: x.str.contains('Non\s+'))).ffill(axis=1).iloc[:, -1])
City
NY US
Rome EU
SF US
HK Asia
Name: dt.t3, dtype: object
EDIT:
m1 = df['dt.t1'] == 'US'
m2 = df['dt.t2'] == 'EU'
m3 = df['dt.t3'] == 'Asia'
df['dt.new'] = np.select([m1, m2, m3], ['US','EU','Asia'], default=None)
Or:
df['dt.new'] = np.where(m1, 'US',
np.where(m2, 'EU',
np.where(m3, 'Asia', None)))
print (df)
City dt.t1 dt.t2 dt.t3 dt.new
0 NY US Non EU Non Asia US
1 Rome Non US EU Non Asia EU
2 SF US Non EU Non Asia US
3 HK Non US Non EU Asia Asia