Search code examples
pythonpandaspandas-groupbycrosstab

Pandas Check for key values in rows prior to running pd.crosstab and add known key values if absent


Afternoon All,

I run a script to create a crosstab style report from which calculations are added to the output, this assumes all fields used in the calculations are present. This assumption is incorrect, especially early in the month where a number of Results are not yet present for each Client. I would like to add some validation prior to the run. There are no issues if and only if the following values are ALL present for field Result for each Client

Covered
Customer Reject
Customer Timeout
Dealer Reject
Dealer Timeout
Done
No RFQ
Tied Covered
Tied Done
Tied Traded Away
Traded Away

The case where all values are present and the desired output is displayed:

df = {'Client': ['A', 'A', 'A',
                 'B', 'B', 'B', 'B','B','B','B','B',
                 'C',
                 'D','D','D','D','D','D','D','D','D','D','D'],
'Result': ['Covered', 'Customer Reject', 'Customer Timeout',
           'Dealer Reject','Dealer Timeout','Done','Tied Covered','Tied Done','Tied Traded Away','Traded Away','No RFQ',
           'No RFQ',
           'Covered','Customer Reject','Customer Timeout','Dealer Reject','Dealer Timeout','Done','Tied Covered','Tied Done','Tied Traded Away','Traded Away','No RFQ']}
df = pd.DataFrame.from_dict(df)
print(df)

   Client            Result
0       A           Covered
1       A   Customer Reject
2       A  Customer Timeout
3       B     Dealer Reject
4       B    Dealer Timeout
5       B              Done
6       B      Tied Covered
7       B         Tied Done
8       B  Tied Traded Away
9       B       Traded Away
10      B            No RFQ
11      C            No RFQ
12      D           Covered
13      D   Customer Reject
14      D  Customer Timeout
15      D     Dealer Reject
16      D    Dealer Timeout
17      D              Done
18      D      Tied Covered
19      D         Tied Done
20      D  Tied Traded Away
21      D       Traded Away
22      D            No RFQ

df = pd.crosstab(df.Client,df.Result,margins=True, margins_name='Total_Result_Per_Client').drop('Total_Result_Per_Client')
# Percentage_Priced_Back_Count = (Total - Dealer Reject - Dealer_Timeout) / (Total - Customer Reject)
df['Percentage_Priced_Back_Count'] = ( df['Total_Result_Per_Client'] - df['Dealer Reject'] - df['Dealer Timeout'] ) / ( df['Total_Result_Per_Client'] - df['Customer Reject'] )
# Hit_Rate_Count = (Total – (Covered + Tied Covered) – Customer Reject – Dealer Reject – Dealer Timeout – (Traded Away + Tied Traded Away)) / (Total – Dealer Reject – Dealer Timeout)
df['Hit_Rate_Count'] = ( df['Total_Result_Per_Client'] - df['Covered'] - df['Tied Covered'] - df['Customer Reject'] - df['Dealer Reject'] - df['Dealer Timeout'] - df['Traded Away'] - df['Tied Traded Away']) / \
                       ( df['Total_Result_Per_Client'] - df['Dealer Reject'] - df['Dealer Timeout'] )
df = df.stack().reset_index(name='value')


df2 = pd.DataFrame([[''] * len(df.columns), df.columns], columns=df.columns)
df1= (df.groupby('Client', group_keys=False)
        .apply(lambda d: d.append(df2))
        .iloc[:-2]
        .reset_index(drop=True))
print (df1)

Sample output:

15  Client                        Result     value
16       B                       Covered         0
17       B               Customer Reject         0
18       B              Customer Timeout         0
19       B                 Dealer Reject         1
20       B                Dealer Timeout         1
21       B                          Done         1
22       B                        No RFQ         1
23       B                  Tied Covered         1
24       B                     Tied Done         1
25       B              Tied Traded Away         1
26       B                   Traded Away         1
27       B       Total_Result_Per_Client         8
28       B  Percentage_Priced_Back_Count      0.75
29       B                Hit_Rate_Count       0.5

The case where a value in Result (used in the calculation) for a Client is not present, an error is raised. In the example below Traded Away is removed for Client B and D. The error is raised in the pd.crosstab call:

df = {'Client': ['A', 'A', 'A',
                 'B', 'B', 'B', 'B','B','B','B',
                 'C',
                 'D','D','D','D','D','D','D','D','D','D'],
'Result': ['Covered', 'Customer Reject', 'Customer Timeout',
           'Dealer Reject','Dealer Timeout','Done','Tied Covered','Tied Done','Tied Traded Away','No RFQ',
           'No RFQ',
           'Covered','Customer Reject','Customer Timeout','Dealer Reject','Dealer Timeout','Done','Tied Covered','Tied Done','Tied Traded Away','No RFQ']}
df = pd.DataFrame.from_dict(df)
print(df)

df = pd.crosstab(df.Client,df.Result,margins=True, margins_name='Total_Result_Per_Client').drop('Total_Result_Per_Client')

KeyError: 'Traded Away'

i.e. if the known Result values are not present for any Client add it (with a zero values) before the function call so all calculations can be completed.


Solution

  • I suggest use DataFrame.reindex for add all missing values from list and fill them by 0:

    vals = ['Covered',
    'Customer Reject',
    'Customer Timeout',
    'Dealer Reject',
    'Dealer Timeout',
    'Done',
    'No RFQ',
    'Tied Covered',
    'Tied Done',
    'Tied Traded Away',
    'Traded Away']
    
    df = (pd.crosstab(df.Client,
                     df.Result,
                     margins=True, 
                     margins_name='Total_Result_Per_Client')
            .drop('Total_Result_Per_Client')
            .reindex(vals + ['Total_Result_Per_Client'], axis=1, fill_value=0))
    

    print(df)
    
    Result  Covered  Customer Reject  Customer Timeout  Dealer Reject  \
    Client                                                              
    A             1                1                 1              0   
    B             0                0                 0              1   
    C             0                0                 0              0   
    D             1                1                 1              1   
    
    Result  Dealer Timeout  Done  No RFQ  Tied Covered  Tied Done  \
    Client                                                          
    A                    0     0       0             0          0   
    B                    1     1       1             1          1   
    C                    0     0       1             0          0   
    D                    1     1       1             1          1   
    
    Result  Tied Traded Away  Traded Away  Total_Result_Per_Client  
    Client                                                          
    A                      0            0                        3  
    B                      1            0                        7  
    C                      0            0                        1  
    D                      1            0                       10