Search code examples
pythonpandaspandas-groupbyseries

Remove certain rows in multiindex series with a condition


i want to remove all the rows that equal to 0, so only rows that equal to 4, 1, 2 and 3 remain from the series below

my_series = df.groupby(by=['Continent','bins']).size()

my_series output:

Continent      bins            
Asia           (2.212, 15.753]     4
               (15.753, 29.227]    1
               (29.227, 42.701]    0
               (42.701, 56.174]    0
               (56.174, 69.648]    0
Australia      (2.212, 15.753]     1
               (15.753, 29.227]    0
               (29.227, 42.701]    0
               (42.701, 56.174]    0
               (56.174, 69.648]    0
Europe         (2.212, 15.753]     1
               (15.753, 29.227]    3
               (29.227, 42.701]    2
               (42.701, 56.174]    0
               (56.174, 69.648]    0
North America  (2.212, 15.753]     1
               (15.753, 29.227]    0
               (29.227, 42.701]    0
               (42.701, 56.174]    0
               (56.174, 69.648]    1
South America  (2.212, 15.753]     0
               (15.753, 29.227]    0
               (29.227, 42.701]    0
               (42.701, 56.174]    0
               (56.174, 69.648]    1
dtype: int64

the result type must be series, not dataframe.

i tried "dataframe-like" approach like

df = df[df['bins']!=0] 

but surely this didn't work out.

and same failure here(apparently it's not applyable to a multiindex series)

my_series.difference(list)

Solution

  • Because bins has the category dtype when you use groupby, Pandas generates all combinations of bins, that's why you have 0 in your result.

    From documentation

    observed bool, default False

    This only applies if any of the groupers are Categoricals. If True: only show observed values for categorical groupers. If False: show all values for categorical groupers.

    First solution based on your code, use observed=True:

    my_series = df.groupby(['Continent','bins'], observed=True).size()
    

    Second solution, more straightforward, use value_counts:

    my_series = df.value_counts(['Continent', 'bins'])