Search code examples
pandasmulti-index

Calculating multiindex series / dataframes with conditions


I have a multiindex Sries that I would like to make calculations on it. I'm not familiar with multiindex dataframes, some advice will be much appreciated.

d = {'c1':['a', 'a', 'b', 'b', 'c', 'c', 'c', 'd'], 'c2':['False', 'True', 'True', 'True', 'True', 'True', 'False', 'False']}
dd = pd.DataFrame(data=d)
dd.groupby('c1')['c2'].value_counts(normalize=True)
c1  c2   
a   False    0.500000
    True     0.500000
b   True     1.000000
c   True     0.666667
    False    0.333333
d   False    1.000000

I'd like to take the max number with True in the c2 column divides by other numbers with True. For the above example, it will be 1 / 0.500000 and 1 / 0.666667

Expected output

enter image description here


Solution

  • Basically, all you have to do is use the .xs method on your value counts.

    Take:

    series = dd.groupby('c1')['c2'].value_counts(normalize=True)
    

    To get the number of Trues by c1 you can do

    true_counts = series.xs('True', level=1)
    

    Which returns something like this:

    c1
    a    0.500000
    b    1.000000
    c    0.666667
    Name: c2, dtype: float64
    

    Now you can do all your operations on a single index Series. Start by taking the max_index and max_value, which, in this case (because it's normalized), could be achieved by true_counts.loc[true_counts == 1] and then retrieving the values and index, or:

    max_index = true_counts.index[true_counts.argmax()]
    max_val = true_counts[max_index]
    

    And then, doing the calculation over the remaining rows:

    true_counts_dropped = true_counts.drop(max_index)
    ans = max_val / true_counts_dropped
    

    To make the index look like b/a, b/c, you can do:

    ans.index = [max_index + '/' + idx for idx in ans.index]
    

    And the final result:

    >>> ans
    b/a    2.0
    b/c    1.5
    Name: c2, dtype: float64