Search code examples
pythonpandasdataframedummy-variable

Frequency of a small subset of values in a Large Pandas Dataframe


This question provided a example of how a frequency count for a given row can be obtained from pandas dataframe using pd.get_dummies + aggregation. However this doesn't scale if you want only a small subset of terms from a very large dataframe.

For example consider the same example:

import pandas as pd

df = pd.DataFrame({'ID': ['xyz_1', 'xyz_2', 'xyz_3', 'xyz_4', 'xyz_400'],
                   'class1': ['yes_1', 'no_2', pd.NA, 'no_3', 'no_7'],
                   'class2': ['no_8', 'yes_15', 'yes_16', 'no_18', 'no_21'],
                   'class3': [pd.NA, 'no_51', 'yes_1', 'no_3', 'no_4'],
                   'class100': ['yes_3', 'no_5', pd.NA, 'yes_6', 'no_7']})

        ID class1  class2 class3 class100
0    xyz_1  yes_1    no_8   <NA>    yes_3
1    xyz_2   no_2  yes_15  no_51     no_5
2    xyz_3   <NA>  yes_16  yes_1     <NA>
3    xyz_4   no_3   no_18   no_3    yes_6
4  xyz_400   no_7   no_21   no_4     no_7

Instead of the values being in the set of yes and no they can be one of many different categorical variables. If you just wanted frequency terms for yes_1, no_51 this is a ton of extra computation.

The best solution I've found so far is pre-processing the other values into NANs

set = ['yes_1', 'no_51']
df[~df.isin(set)] = pd.NA

     ID class1 class2 class3 class100
0  <NA>  yes_1   <NA>   <NA>     <NA>
1  <NA>   <NA>   <NA>  no_51     <NA>
2  <NA>   <NA>   <NA>  yes_1     <NA>
3  <NA>   <NA>   <NA>   <NA>     <NA>
4  <NA>   <NA>   <NA>   <NA>     <NA>

This is still glacially slow for large dataframes on the order of 1 million entries. Is there a way to scale this better.


Solution

  • In the linked question, a lot more performant solution would be:

    df.apply(lambda row: row.value_counts(dropna=False), axis=1).fillna(0)
    

    This might be already sufficient for your purposes; however, in case you're after only a couple values, it might be faster to just:

    counts = pd.Series({(df == key).values.sum() for key in ['yes_1', 'no_51']})