Keep first instance of column and drop others, using partial text in column name

A df has column ABC_81, ABC_y, ABC_aa, ZXC_aa, ZXC_5, ZXC_hi:

data = {
    'ABC_81': [1, 2, 3],
    'ABC_y': [4, 5, 6],
    'ABC_aa': [7, 8, 9],
    'ZXC_aa': [10, 11, 12],
    'ZXC_5': [13, 14, 15],
    'ZXC_hi': [16, 17, 18],
    'PPP_tr' : [88,55,99],
    'QWE_gf' : [78,98,14]
    }

Going from left to right in the df, I want to keep first instance of column name starting with ABC and ZXC, drop other columns starting with ABC and ZXC.

Keep ABC_81,
Drop ABC_y, ABC_aa

Similarly

Keep ZXC_aa,
Drop ZXC_5, ZXC_hi

and so on

Result should be:

data = {
    'ABC_81': [1, 2, 3],
    'ZXC_aa': [10, 11, 12],
    'PPP_tr' : [88,55,99],
    'QWE_gf' : [78,98,14]
    }

Additionally, can this be done without knowing the column names?

Solution

Assuming the fixed part come before the underscore, you can use a regex to extract the prefix, then identify the first value with duplicated and finally use boolean indexing to select the first column for each prefix:

m = ~df.columns.str.extract(r'^([^_]+)', expand=False).duplicated()

out = df.loc[:, m]

As a one-liner:

out = df.loc[:, ~df.columns.str.extract(r'^([^_]+)', expand=False).duplicated()]

Output:

   ABC_81  ZXC_aa  PPP_tr  QWE_gf
0       1      10      88      78
1       2      11      55      98
2       3      12      99      14

Intermediates:

df.columns.str.extract(r'^([^_]+)', expand=False)
# Index(['ABC', 'ABC', 'ABC', 'ZXC', 'ZXC', 'ZXC'], dtype='object')

df.columns.str.extract(r'^([^_]+)', expand=False).duplicated()
# array([False,  True,  True, False,  True,  True, False, False])

~df.columns.str.extract(r'^([^_]+)', expand=False).duplicated()
# array([ True, False, False,  True, False, False,  True,  True])

regex demo

more complex filter

You can add more conditions if needed:

# extract prefix
prefix = df.columns.str.extract(r'^([^_]+)', expand=False)
# identify duplicated columns that match ABC/ZXC
m = prefix.isin(['ABC', 'ZXC']) & prefix.duplicated()

# select all but duplicated ABC/ZXC
out = df.loc[:, ~m]