Search code examples
pythonpandas

How can I make cuts into a numerical column based on a categorical column


I have a very large dataset (about 10^7 rows and 1000 columns) and need to make cuts into one of the columns, for trining/validation separation, with the bins changing based on another column. I am pretty new to python and am using this function:

SEGMENT is either A, B or C, and DATE is what I am cutting (yes, it is a numerical column, I know it looks terrible but it was not my choice), with different bins for different values of SEGMENT.

cuts = {
    "A": {"cut":[0,20240101,20240801,20241201,20250000], "class":["out", "training", "validation", "out"]},
    "B": {"cut":[0,20230701,20240701,20241201,20250000], "class":["out", "training", "validation", "out"]},
    "C": {"cut":[0,20230701,20240701,20250101,20250201], "class":["out", "training", "validation", "out"]}
}
def divisions(row):
    rules = cuts[row["SEGMENT"]]
    return pd.cut([row["DATE"]], bins=rules["cut"], labels=rules["class"], right=False, ordered=False)[0]

df["CLASS"] = df.apply(divisions, axis=1)

This seems to work but has been insanely slow even on samples of less than 0.1% of the actual dataset.

How can I improve this?

All I need is this CLASS column, to check if the training and validation datasets show similar behaviors. I am not yet doing the actual modeling.


Solution

  • You could use a groupby.apply to handle all rows of a SEGMENT simultaneously:

    df['CLASS'] = (df.groupby('SEGMENT', group_keys=False)['DATE']
                     .apply(lambda x: pd.cut(x, bins=cuts[x.name]['cut'],
                                             labels=cuts[x.name]['class'],
                                             ordered=False))
                  )
    

    Example output:

      SEGMENT      DATE       CLASS
    0       A  20240102    training
    1       A  20241215         out
    2       A  20231201         out
    3       B  20240102    training
    4       B  20241215         out
    5       C  20231201    training
    6       C  20240102    training
    7       C  20241215  validation
    

    Testing on 10K rows, this is ~1000x faster that the original approach:

    # groupby
    3.68 ms ± 76.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
    
    # original
    3.34 s ± 10.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    

    Testing on 1M rows, this takes about half a second:

    529 ms ± 277 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)