A df has column ABC_81, ABC_y, ABC_aa, ZXC_aa, ZXC_5, ZXC_hi
:
data = {
'ABC_81': [1, 2, 3],
'ABC_y': [4, 5, 6],
'ABC_aa': [7, 8, 9],
'ZXC_aa': [10, 11, 12],
'ZXC_5': [13, 14, 15],
'ZXC_hi': [16, 17, 18],
'PPP_tr' : [88,55,99],
'QWE_gf' : [78,98,14]
}
Going from left to right in the df, I want to keep first instance of column name starting with ABC and ZXC, drop other columns starting with ABC and ZXC.
Keep ABC_81,
Drop ABC_y, ABC_aa
Similarly
Keep ZXC_aa,
Drop ZXC_5, ZXC_hi
and so on
Result should be:
data = {
'ABC_81': [1, 2, 3],
'ZXC_aa': [10, 11, 12],
'PPP_tr' : [88,55,99],
'QWE_gf' : [78,98,14]
}
Additionally, can this be done without knowing the column names?
Assuming the fixed part come before the underscore, you can use a regex to extract
the prefix, then identify the first value with duplicated
and finally use boolean indexing to select the first column for each prefix:
m = ~df.columns.str.extract(r'^([^_]+)', expand=False).duplicated()
out = df.loc[:, m]
As a one-liner:
out = df.loc[:, ~df.columns.str.extract(r'^([^_]+)', expand=False).duplicated()]
Output:
ABC_81 ZXC_aa PPP_tr QWE_gf
0 1 10 88 78
1 2 11 55 98
2 3 12 99 14
Intermediates:
df.columns.str.extract(r'^([^_]+)', expand=False)
# Index(['ABC', 'ABC', 'ABC', 'ZXC', 'ZXC', 'ZXC'], dtype='object')
df.columns.str.extract(r'^([^_]+)', expand=False).duplicated()
# array([False, True, True, False, True, True, False, False])
~df.columns.str.extract(r'^([^_]+)', expand=False).duplicated()
# array([ True, False, False, True, False, False, True, True])
You can add more conditions if needed:
# extract prefix
prefix = df.columns.str.extract(r'^([^_]+)', expand=False)
# identify duplicated columns that match ABC/ZXC
m = prefix.isin(['ABC', 'ZXC']) & prefix.duplicated()
# select all but duplicated ABC/ZXC
out = df.loc[:, ~m]