Search code examples
pythonpandasdataframepivot-tablecrosstab

Cross tabulation between two sets of columns in a DataFrame


I have tried to get crosstab of data specified by slice.

But something wrong in syntax.

data.csv like the following

ia,ib,ic,id,ie,if,ig  
a,0,0,0,e,0,g         
0,b,0,0,e,f,0         
0,0,c,d,0,f,g         

And then do python3 test.py like the following

import pandas as pd
import enum

df = pd.read_csv('data.csv')

class Slices(enum.Enum):
    first = slice(0, 2)
    second = slice(4, 6)

def getCrosstab(*args):
    cols1 = []
    cols1.append(df.iloc[:, args[0].value])
    cols2 = []
    cols2.append(df.iloc[:, args[1].value])
    print( pd.crosstab(cols1, cols2) )

if __name__ == '__main__':
    getCrosstab(Slices.first, Slices.second)

Expected result:

col2 ie if ig
col1
ia   1  0  1
ib   1  1  0
ic   0  1  1

But I had an error:
ValueError: Shape of passed values is (2, 2), indices imply (2, 3)

I can not fully understand the meaning of this error.

Please give me your guidance.


Solution

  • melt twice, once for each set of columns, and then call crosstab:

    u = (df.melt(['ia', 'ib', 'ic'], var_name='C', value_name='D')
           .melt(['C', 'D'], var_name='A', value_name='B')
           .query("B != '0' and D != '0'"))
    
    pd.crosstab(u.A, u.C)
    
    C   id  ie  if  ig
    A                 
    ia   0   1   0   1
    ib   0   1   1   0
    ic   1   0   1   1
    

    def crosstab_for(df, sliceA, sliceB):
        u = (df.reindex(df.columns[sliceA] | df.columns[sliceB], axis=1)
               .melt(df.columns[sliceA], var_name='C', value_name='D')
               .melt(['C', 'D'], var_name='A', value_name='B')
               .query("B != '0' and D != '0'"))
    
        return pd.crosstab(u.A, u.C)
    
    crosstab_for(df, slice(0, 3), slice(4, 7))
    
    C   ie  if  ig
    A             
    ia   1   0   1
    ib   1   1   0
    ic   0   1   1