How can I make cuts into a numerical column based on a categorical column

I have a very large dataset (about 10^7 rows and 1000 columns) and need to make cuts into one of the columns, for trining/validation separation, with the bins changing based on another column. I am pretty new to python and am using this function:

SEGMENT is either A, B or C, and DATE is what I am cutting (yes, it is a numerical column, I know it looks terrible but it was not my choice), with different bins for different values of SEGMENT.

cuts = {
    "A": {"cut":[0,20240101,20240801,20241201,20250000], "class":["out", "training", "validation", "out"]},
    "B": {"cut":[0,20230701,20240701,20241201,20250000], "class":["out", "training", "validation", "out"]},
    "C": {"cut":[0,20230701,20240701,20250101,20250201], "class":["out", "training", "validation", "out"]}
}
def divisions(row):
    rules = cuts[row["SEGMENT"]]
    return pd.cut([row["DATE"]], bins=rules["cut"], labels=rules["class"], right=False, ordered=False)[0]

df["CLASS"] = df.apply(divisions, axis=1)

This seems to work but has been insanely slow even on samples of less than 0.1% of the actual dataset.

How can I improve this?

All I need is this CLASS column, to check if the training and validation datasets show similar behaviors. I am not yet doing the actual modeling.

Solution

You could use a groupby.apply to handle all rows of a SEGMENT simultaneously:

df['CLASS'] = (df.groupby('SEGMENT', group_keys=False)['DATE']
                 .apply(lambda x: pd.cut(x, bins=cuts[x.name]['cut'],
                                         labels=cuts[x.name]['class'],
                                         ordered=False))
              )

Example output:

  SEGMENT      DATE       CLASS
0       A  20240102    training
1       A  20241215         out
2       A  20231201         out
3       B  20240102    training
4       B  20241215         out
5       C  20231201    training
6       C  20240102    training
7       C  20241215  validation

Testing on 10K rows, this is ~1000x faster that the original approach:

# groupby
3.68 ms ± 76.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

# original
3.34 s ± 10.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Testing on 1M rows, this takes about half a second:

529 ms ± 277 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)