Search code examples
pythonpandasnumpydataframecrosstab

Python: Add a column to a crosstab based on value and condition


I have a Pandas crosstab created from NumPy arrays of zeros and ones. After creating crosstab by:ct = pd.crosstab(index=np.array(labels), columns=np.array(cols) I have a crosstab that looks like:

col_0     0     1 
row_0
---------------------
0        618    53   
1        53     79

I wanted to add a column in here for error percentages on false positives and false negatives. I added a new column to the crosstab using ct[2] = ct.apply(lambda row: row[0] + row[1], axis = 1) which added a column of sum of the first two columns. I would like to add a fourth column that would calculate ct[0][0] divided by the sum for the first row, and ct[1][1] divided by the sum for the second row, and add that column to ct[3]. I tried:

for i,y in ct.iterrows():
   ct[3] = ct.apply(lambda y: y[i]/y[2] , axis = 1)

but since it executes twice, it will overwrite the column with the 2nd result. I tried using ct[3][i] to no avail. Using lambda y: y[1] if y==0 else y[0] doesn't work. What is the method for adding a new column to the dataframe / crosstab, with a condition-based calculation on existing values?


Solution

  • Try this using numpy and pandas:

    ct[3] = (ct * np.eye(2)).sum() / ct.sum(1)
    

    or

    ct[3] = np.diag(ct) / ct.sum(1)  
    

    Output:

    col_0    0   1         3
    row_0                   
    0      618  53  0.921013
    1       53  79  0.598485