I have the following dataframe:
data = [['123456ABCD234567', 'A'], ['8502', 'A'],
['74523654894WRZI3', 'B'], ['85CGNK6987541236', 'B'],
['WF85Z4HJ95R4CF2V', 'C'], ['VB52FG85RT74DF96', 'C'],
['WERTZ852146', 'D'], ['APUNGF', 'D']
]
df = pd.DataFrame(data, columns=['CODE', 'STOCK'])
df
CODE STOCK
0 123456ABCD234567 A
1 8502 A
2 74523654894WRZI3 B
3 85CGNK6987541236 B
4 WF85Z4HJ95R4CF2V C
5 VB52FG85RT74DF96 C
6 WERTZ852146 D
7 APUNGF D
Each stock is part of various codes. The code should have a length of 16 characters. My objective is to filter out any stocks that have no codes attached which are not composed of 16 characters. In this example, stock A has at least one code with a length of 16 characters, so it should be kept. However, stock D has no codes with a length of 16 characters.
I believe this can be accomplished using the groupby-function in Pandas.
Ultimately, I aim at obtaining below output:
CODE STOCK
6 WERTZ852146 D
7 APUNGF D
Many thanks for any suggestions in advance!
You could create a boolean column for values not matching the condition and groupby.transform
with all
to identify the STOCK with all non matching rows:
out = df[df['CODE'].str.len().ne(16).groupby(df['STOCK']).transform('all')]
Output:
CODE STOCK
6 WERTZ852146 D
7 APUNGF D
Intermediates:
CODE STOCK str.len ne(16) transform('all')
0 123456ABCD234567 A 16 False False
1 8502 A 4 True False
2 74523654894WRZI3 B 16 False False
3 85CGNK6987541236 B 16 False False
4 WF85Z4HJ95R4CF2V C 16 False False
5 VB52FG85RT74DF96 C 16 False False
6 WERTZ852146 D 11 True True
7 APUNGF D 6 True True
Using DeMorgan's law you could also run:
out = df[~df['CODE'].str.len().eq(16).groupby(df['STOCK']).transform('any')]
Intermediates:
CODE STOCK str.len eq(16) transform('any') ~
0 123456ABCD234567 A 16 True True False
1 8502 A 4 False True False
2 74523654894WRZI3 B 16 True True False
3 85CGNK6987541236 B 16 True True False
4 WF85Z4HJ95R4CF2V C 16 True True False
5 VB52FG85RT74DF96 C 16 True True False
6 WERTZ852146 D 11 False False True
7 APUNGF D 6 False False True
And without groupby
you could identify all the STOCK that have at least one match, and reverse select the others with isin
:
out = df[~df['STOCK'].isin(df.loc[df['CODE'].str.len().eq(16), 'STOCK'].unique())]
Intermediates:
df.loc[df['CODE'].str.len().eq(16), 'STOCK'].unique()
# array(['A', 'B', 'C'], dtype=object)
~df['STOCK'].isin(df.loc[df['CODE'].str.len().eq(16), 'STOCK'].unique())
# 0 False
# 1 False
# 2 False
# 3 False
# 4 False
# 5 False
# 6 True
# 7 True
# Name: STOCK, dtype: bool