I have a very large dataset (about 10^7 rows and 1000 columns) and need to make cuts into one of the columns, for trining/validation separation, with the bins changing based on another column. I am pretty new to python and am using this function:
SEGMENT
is either A, B or C, and DATE
is what I am cutting (yes, it is a numerical column, I know it looks terrible but it was not my choice), with different bins for different values of SEGMENT
.
cuts = {
"A": {"cut":[0,20240101,20240801,20241201,20250000], "class":["out", "training", "validation", "out"]},
"B": {"cut":[0,20230701,20240701,20241201,20250000], "class":["out", "training", "validation", "out"]},
"C": {"cut":[0,20230701,20240701,20250101,20250201], "class":["out", "training", "validation", "out"]}
}
def divisions(row):
rules = cuts[row["SEGMENT"]]
return pd.cut([row["DATE"]], bins=rules["cut"], labels=rules["class"], right=False, ordered=False)[0]
df["CLASS"] = df.apply(divisions, axis=1)
This seems to work but has been insanely slow even on samples of less than 0.1% of the actual dataset.
How can I improve this?
All I need is this CLASS
column, to check if the training and validation datasets show similar behaviors. I am not yet doing the actual modeling.
You could use a groupby.apply
to handle all rows of a SEGMENT simultaneously:
df['CLASS'] = (df.groupby('SEGMENT', group_keys=False)['DATE']
.apply(lambda x: pd.cut(x, bins=cuts[x.name]['cut'],
labels=cuts[x.name]['class'],
ordered=False))
)
Example output:
SEGMENT DATE CLASS
0 A 20240102 training
1 A 20241215 out
2 A 20231201 out
3 B 20240102 training
4 B 20241215 out
5 C 20231201 training
6 C 20240102 training
7 C 20241215 validation
Testing on 10K rows, this is ~1000x faster that the original approach:
# groupby
3.68 ms ± 76.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# original
3.34 s ± 10.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Testing on 1M rows, this takes about half a second:
529 ms ± 277 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)