Search code examples
pythonpandaspercentagespark-notebook

python Count of ID's for percentage levels


I have the table like below which is stored in a DataFrame. I want to add the tolerance_level = [0, 1, 5, 10, 20, 30, 50, 100, 200, 300, 500, 700] and get the count of ID's for columns A, B, C that fall under the tolerance_level

For example: in the below table count of ID's for column A having tolerance_level of 10% = 2 count of ID's for column A having tolerance_level of 0% = 1 count of ID's for columns A having tolerance_level of 100% = 1 and 700% =1

And count of ID's for column B having tolerance_level of 30% = 2 and so on..... And if a columns has % which is not defined in the tolerance_level then the ID falls under the nearest tolerance_level. For example, if a columns has 900% then it will be in >700% tolerance_level

ID A B C
1 0% 1% 5%
3 10% 30% 50%
6 100% 300% 500%
7 700% 900% 50%
10 10% 30% 50%

So, the result would be something like

tolerance_level A B C
0% 17 100 50
1% 10 50 70
5% 60 80 40

Solution

  • Use merge_asof for add new column filled by nearest values from helper DataFrame and count output by crosstab:

    df = pd.DataFrame({'ID': [1, 3, 6, 7, 10], 
                       'A': ['0%', '17%', '108%', '700%', '10%'], 
                       'B': ['1%', '30%', '299%', '900%', '30%'], 
                       'C': ['5%', '50%', '500%', '51%', pd.Timestamp('12-10-2000')]})
    print (df)
       ID     A     B                    C
    0   1    0%    1%                   5%
    1   3   17%   30%                  50%
    2   6  108%  299%                 500%
    3   7  700%  900%                  51%
    4  10   10%   30%  2000-12-10 00:00:00
    

    tolerance_level = [0, 1, 5, 10, 20, 30, 50, 100, 200, 300, 500, 700]
    df1 = pd.DataFrame({'tolerance_level':tolerance_level})
    df1['tolerance_level'] = df1['tolerance_level'].astype('float')
    
    df2 = df.melt('ID', value_name='tol')
    df2['tol'] = pd.to_numeric(df2['tol'].str.rstrip('%'), errors='coerce')
    
    df3 = pd.merge_asof(df2.dropna(subset=['tol']).sort_values('tol'), 
                        df1, 
                        left_on='tol',
                        right_on='tolerance_level', 
                        direction='nearest')
    
    
    out = (pd.crosstab(df3['tolerance_level'], df3['variable'])
             .rename_axis(columns=None)
             .rename(lambda x: f"{x}%")
             .reset_index())
    print (out)
       tolerance_level  A  B  C
    0               0%  1  0  0
    1               1%  0  1  0
    2               5%  0  0  1
    3              10%  1  0  0
    4              20%  1  0  0
    5              30%  0  2  0
    6              50%  0  0  3
    7             100%  1  0  0
    8             300%  0  1  0
    9             500%  0  0  1
    10            700%  1  1  0