Search code examples
pythonpython-3.xpandascrosstabreindex

Pandas crosstab - How to print rows/columns for values that don't exist in the data sets?


I am a beginner with pandas at best and I couldn't find a solution to this problem anywhere.

Let's say I have two variables: variable1, variable2.

They can have the following predefined values:

variable1 = ['1', '4', '9', '15', '20']
variable2 = ['2', '5', '6']

However, the current data set only has some of those values:

df = pd.DataFrame({variable1 : ['1', '9', '20'],
                  variable2 : ['2', '2', '6']})

When crossing the variables:

pd.crosstab(df.variable1, df.variable2)

I get:

variable2  2  6
variable1      
1          1  0
20         0  1
9          1  0

Is there a way to put all the possible categorical values in both the columns and the rows even if the current data set does not have all of them? The goal is to have a table of the same size when running the script with an updated data set which may have the values that were not present in the previous data set.


Solution

  • Use DataFrame.reindex:

    variable1 = ['1', '4', '9', '15', '20']
    variable2 = ['2', '5', '6']
    
    
    df = pd.DataFrame({'variable1' : ['1', '9', '20'],
                      'variable2' : ['2', '2', '6']})
    
    print (df)                  
      variable1 variable2
    0         1         2
    1         9         2
    2        20         6
    
    df = pd.crosstab(df.variable1, df.variable2)
    df = df.reindex(index=variable1, columns=variable2, fill_value=0)
    print (df)
    variable2  2  5  6
    variable1         
    1          1  0  0
    4          0  0  0
    9          1  0  0
    15         0  0  0
    20         0  0  1
    

    from collections import OrderedDict
    
    
    valuelabels = OrderedDict([('S8', [['1', 'Medical oncology'], 
                                       ['2', 'Hematology'], 
                                       ['3', 'Hematology/Oncology'], 
                                       ['4', 'Other']]), 
                               ('S9', [['1', 'Academic / Teaching Hospital'], 
                                       ['2', 'Community-Based Solo Private Practice'], 
                                       ['3', 'Community-Based Group Private Practice (record practice size )'], ['4', 'Community Non-Teaching Hospital'], 
                                       ['5', 'Comprehensive Cancer Center'], 
                                       ['6', 'Other (specify)']])])
    #print (valuelabels)
    
    
    df = pd.DataFrame({'variable1' : ['1', '2', '4'],
                      'variable2' : ['2', '3', '1']})
    
    table = pd.crosstab(df.variable1, df.variable2)      
    print (table)
    variable2  1  2  3
    variable1         
    1          0  1  0
    2          0  0  1
    4          1  0  0
    
    d1 = dict(list(zip([a[0] for a in valuelabels['S8']], [a[1] for a in valuelabels['S8']])))
    print (d1)
    {'4': 'Other', '1': 'Medical oncology', '2': 'Hematology', '3': 'Hematology/Oncology'}
    
    d2 = dict(list(zip([a[0] for a in valuelabels['S9']], [a[1] for a in valuelabels['S9']])))
    print (d2)
    {'1': 'Academic / Teaching Hospital', 
    '3': 'Community-Based Group Private Practice (record practice size )', 
    '4': 'Community Non-Teaching Hospital', 
    '6': 'Other (specify)', 
    '2': 'Community-Based Solo Private Practice', 
    '5': 'Comprehensive Cancer Center'}
    
    table = table.reindex(index=[a[0] for a in valuelabels['S8']], 
                          columns=[a[0] for a in valuelabels['S9'], fill_value=0)
    print (table)
    variable2  1  2  3  4  5  6
    variable1                  
    1          0  1  0  0  0  0
    2          0  0  1  0  0  0
    3          0  0  0  0  0  0
    4          1  0  0  0  0  0
    
    table.index = table.index.to_series().map(d1).values
    table.columns = table.columns.to_series().map(d2).values
    
    print (table)
                         Academic / Teaching Hospital  \
    Medical oncology                                0   
    Hematology                                      0   
    Hematology/Oncology                             0   
    Other                                           1   
    
                         Community-Based Solo Private Practice  \
    Medical oncology                                         1   
    Hematology                                               0   
    Hematology/Oncology                                      0   
    Other                                                    0   
    
                         Community-Based Group Private Practice (record practice size )  \
    Medical oncology                                                     0                
    Hematology                                                           1                
    Hematology/Oncology                                                  0                
    Other                                                                0                
    
                         Community Non-Teaching Hospital  \
    Medical oncology                                   0   
    Hematology                                         0   
    Hematology/Oncology                                0   
    Other                                              0   
    
                         Comprehensive Cancer Center  Other (specify)  
    Medical oncology                               0                0  
    Hematology                                     0                0  
    Hematology/Oncology                            0                0  
    Other                                          0                0