Search code examples
pythonpandascategoriesbucketlevels

Pythonic Way of Reducing Factor Levels in Large Dataframe


I am attempting to reduce the number of factor levels within a column in a pandas dataframe such that the total instances of any factor as a proportion of all column rows lower than a defined threshold (default set to 1%), will be bucketed into a new factor labeled 'Other'. Below is the function I am using to accomplish this task:

def condenseMe(df, column_name, threshold = 0.01, newLabel = "Other"):

    valDict = dict(df[column_name].value_counts() / len(df[column_name]))
    toCondense = [v for v in valDict.keys() if valDict[v] < threshold]
    if 'Missing' in toCondense:
        toCondense.remove('Missing')
    df[column_name] = df[column_name].apply(lambda x: newLabel if x in toCondense else x)

The issue I am running into is I am working with a large dataset (~18 million rows) and am attempting to use this function on a column with more than 10,000 levels. Because of this, executing this function on this column is taking a very long time to complete. Is there a more pythonic way to reduce the number of factor levels that will execute faster? Any help would be much appreciated!


Solution

  • You can do this with a combination of groupby, tranform, and count:

    def condenseMe(df, col, threshold = 0.01, newLabel="Other"):
        # Create a new Series with the normalized value counts
        counts = df[[col]].groupby(col)[col].transform('count') / len(df)
        # Create a 1D mask based on threshold (ignoring "Missing")
        mask = (counts < threshold) & (df[col] != 'Missing')
    
        # Assign these masked values a new label
        df[col][mask] = newLabel