I am attempting to reduce the number of factor levels within a column in a pandas dataframe such that the total instances of any factor as a proportion of all column rows lower than a defined threshold (default set to 1%), will be bucketed into a new factor labeled 'Other'. Below is the function I am using to accomplish this task:
def condenseMe(df, column_name, threshold = 0.01, newLabel = "Other"):
valDict = dict(df[column_name].value_counts() / len(df[column_name]))
toCondense = [v for v in valDict.keys() if valDict[v] < threshold]
if 'Missing' in toCondense:
toCondense.remove('Missing')
df[column_name] = df[column_name].apply(lambda x: newLabel if x in toCondense else x)
The issue I am running into is I am working with a large dataset (~18 million rows) and am attempting to use this function on a column with more than 10,000 levels. Because of this, executing this function on this column is taking a very long time to complete. Is there a more pythonic way to reduce the number of factor levels that will execute faster? Any help would be much appreciated!
You can do this with a combination of groupby
, tranform
, and count
:
def condenseMe(df, col, threshold = 0.01, newLabel="Other"):
# Create a new Series with the normalized value counts
counts = df[[col]].groupby(col)[col].transform('count') / len(df)
# Create a 1D mask based on threshold (ignoring "Missing")
mask = (counts < threshold) & (df[col] != 'Missing')
# Assign these masked values a new label
df[col][mask] = newLabel