Search code examples
pythonpandasdataframedata-cleaning

Python Data Cleaning with Pandas


My dataset looks like this

Name    Subset            Value
A       67-A-5678          14
A       58-ABC-87555       187
A       45-ASH-87954       5465
S       34-A-8785          454
S       58-ASO-98978       54
S       23-ASH-87895       784
X       98-X-87876         455
X       87-ABC-54578       4545
X       56-ASH-89667       854
Y       09-D-98644         45
Y       87-ABC-78834       98
Y       87-ASH-87455A      4566
L       67-A-87545         78
L       89-GHS-08753       12 
L       78-PHU-09876       655 

I want to keep only those groups of rows whose "subset" columns are of pattern; *, *ABC, *ASH (Note: * is any alphabet or digit).

For example, output should look like

Name    Subset             Value
A       67-A-5678          14
A       58-ABC-87555       187
A       45-ASH-87954       5465
X       98-X-87876         455
X       87-ABC-54578       4545
X       56-ASH-89667       854
Y       09-D-98644         45
Y       87-ABC-78834       98
Y       87-ASH-87455A      4566

P.S. Actual dataset can be of many columns/rows.


Solution

  • Try this:

    filtered = df[df.groupby('Name')['Subset'].transform(lambda x: len(x) >= 3 and'-ABC-' in x.iloc[1] and '-ASH-' in x.iloc[2])]
    

    Output:

    >>> filtered
       Name         Subset  Value
    0     A      67-A-5678     14
    1     A   58-ABC-87555    187
    2     A   45-ASH-87954   5465
    6     X     98-X-87876    455
    7     X   87-ABC-54578   4545
    8     X   56-ASH-89667    854
    9     Y     09-D-98644     45
    10    Y   87-ABC-78834     98
    11    Y  87-ASH-87455A   4566