Search code examples
pythonpandasgroup-byconditional-statements

Pandas Groupby Show Non-Matching Values


I have the following dataframe:

data = [['123456ABCD234567', 'A'], ['8502', 'A'], 
    ['74523654894WRZI3', 'B'], ['85CGNK6987541236', 'B'], 
    ['WF85Z4HJ95R4CF2V', 'C'], ['VB52FG85RT74DF96', 'C'],
    ['WERTZ852146', 'D'], ['APUNGF', 'D']
   ]
df = pd.DataFrame(data, columns=['CODE', 'STOCK'])
df

    CODE                STOCK
0   123456ABCD234567    A
1   8502                A
2   74523654894WRZI3    B
3   85CGNK6987541236    B
4   WF85Z4HJ95R4CF2V    C
5   VB52FG85RT74DF96    C
6   WERTZ852146         D
7   APUNGF              D

Each stock is part of various codes. The code should have a length of 16 characters. My objective is to filter out any stocks that have no codes attached which are not composed of 16 characters. In this example, stock A has at least one code with a length of 16 characters, so it should be kept. However, stock D has no codes with a length of 16 characters.

I believe this can be accomplished using the groupby-function in Pandas.

Ultimately, I aim at obtaining below output:

    CODE          STOCK
6   WERTZ852146   D
7   APUNGF        D

Many thanks for any suggestions in advance!


Solution

  • You could create a boolean column for values not matching the condition and groupby.transform with all to identify the STOCK with all non matching rows:

    out = df[df['CODE'].str.len().ne(16).groupby(df['STOCK']).transform('all')]
    

    Output:

              CODE STOCK
    6  WERTZ852146     D
    7       APUNGF     D
    

    Intermediates:

                   CODE STOCK  str.len  ne(16)  transform('all')
    0  123456ABCD234567     A       16   False             False
    1              8502     A        4    True             False
    2  74523654894WRZI3     B       16   False             False
    3  85CGNK6987541236     B       16   False             False
    4  WF85Z4HJ95R4CF2V     C       16   False             False
    5  VB52FG85RT74DF96     C       16   False             False
    6       WERTZ852146     D       11    True              True
    7            APUNGF     D        6    True              True
    

    Using DeMorgan's law you could also run:

    out = df[~df['CODE'].str.len().eq(16).groupby(df['STOCK']).transform('any')]
    

    Intermediates:

                   CODE STOCK  str.len  eq(16)  transform('any')      ~
    0  123456ABCD234567     A       16    True              True  False
    1              8502     A        4   False              True  False
    2  74523654894WRZI3     B       16    True              True  False
    3  85CGNK6987541236     B       16    True              True  False
    4  WF85Z4HJ95R4CF2V     C       16    True              True  False
    5  VB52FG85RT74DF96     C       16    True              True  False
    6       WERTZ852146     D       11   False             False   True
    7            APUNGF     D        6   False             False   True
    

    And without groupby you could identify all the STOCK that have at least one match, and reverse select the others with isin:

    out = df[~df['STOCK'].isin(df.loc[df['CODE'].str.len().eq(16), 'STOCK'].unique())]
    

    Intermediates:

    df.loc[df['CODE'].str.len().eq(16), 'STOCK'].unique()
    # array(['A', 'B', 'C'], dtype=object)
    
    ~df['STOCK'].isin(df.loc[df['CODE'].str.len().eq(16), 'STOCK'].unique())
    # 0    False
    # 1    False
    # 2    False
    # 3    False
    # 4    False
    # 5    False
    # 6     True
    # 7     True
    # Name: STOCK, dtype: bool