I am working on a classification problem whose evaluation metric in ROC AUC. So far I have tried using xgb with different parameters. Here is the function which I used to sample the data. And you can find the relevant notebook here (google colab)
def get_data(x_train, y_train, shuffle=False):
if shuffle:
total_train = pd.concat([x_train, y_train], axis=1)
# generate n random number in range(0, len(data))
n = np.random.randint(0, len(total_train), size=len(total_train))
x_train = total_train.iloc[n]
y_train = total_train.iloc[n]['is_pass']
x_train.drop('is_pass', axis=1, inplace=True)
# keep the first 1000 rows as test data
x_test = x_train.iloc[:1000]
# keep the 1000 to 10000 rows as validation data
x_valid = x_train.iloc[1000:10000]
x_train = x_train.iloc[10000:]
y_test = y_train[:1000]
y_valid = y_train[1000:10000]
y_train = y_train.iloc[10000:]
return x_train, x_valid, x_test, y_train, y_valid, y_test
else:
# keep the first 1000 rows as test data
x_test = x_train.iloc[:1000]
# keep the 1000 to 10000 rows as validation data
x_valid = x_train.iloc[1000:10000]
x_train = x_train.iloc[10000:]
y_test = y_train[:1000]
y_valid = y_train[1000:10000]
y_train = y_train.iloc[10000:]
return x_train, x_valid, x_test, y_train, y_valid, y_test
Here are the two outputs that I get after running on shuffled and non shuffled data
AUC with shuffling: 0.9021756235738453
AUC without shuffling: 0.8025162142685565
Can you find out what's the issue here ?
The problem is that in your implementation of shuffling- np.random.randint
generates random numbers, but they can be repeated, thus you have the same events appearing in your train and test+valid sets. You should use np.random.permutation
instead (and consider to use np.random.seed
to ensure reproducibility of the outcome).
Another note- you have very large difference in performance between training and validation/testing sets (the training shows almost perfect ROC AUC). I guess, this is due to too high max depth of the tree (14) that you allow for the size of the dataset (~60K) that you have in hand
P.S. Thanks for sharing collaboratory link- I was not aware of it, but it is very useful.