I am currently following the slides from the following link. I am on slide 121/128 and I would like to know how to replicate the AUC. The author did not explain how to do so (the same on slide 124 also). Secondly on slide 125 the following code is produced;
bestRound = which.max(as.matrix(cv.res)[,3]-as.matrix(cv.res)[,4])
bestRound
I receive the following error;
Error in as.matrix(cv.res)[, 2] : subscript out of bounds
The data for the following code can be downloaded from here and I have produced the code below for your reference.
Question: How can I produce the AUC as the author and why is the subscript out of bounds?
----- Code ------
# Kaggle Winning Solutions
train <- read.csv('train.csv', header = TRUE)
test <- read.csv('test.csv', header = TRUE)
y <- train[, 1]
train <- as.matrix(train[, -1])
test <- as.matrix(test)
train[1, ]
#We want to determin who is more influencial than the other
new.train <- cbind(train[, 12:22], train[, 1:11])
train = rbind(train, new.train)
y <- c(y, 1 - y)
x <- rbind(train, test)
(dat[,i]+lambda)/(dat[,j]+lambda)
A.follow.ratio = calcRatio(x,1,2)
A.mention.ratio = calcRatio(x,4,6)
A.retweet.ratio = calcRatio(x,5,7)
A.follow.post = calcRatio(x,1,8)
A.mention.post = calcRatio(x,4,8)
A.retweet.post = calcRatio(x,5,8)
B.follow.ratio = calcRatio(x,12,13)
B.mention.ratio = calcRatio(x,15,17)
B.retweet.ratio = calcRatio(x,16,18)
B.follow.post = calcRatio(x,12,19)
B.mention.post = calcRatio(x,15,19)
B.retweet.post = calcRatio(x,16,19)
x = cbind(x[,1:11],
A.follow.ratio,A.mention.ratio,A.retweet.ratio,
A.follow.post,A.mention.post,A.retweet.post,
x[,12:22],
B.follow.ratio,B.mention.ratio,B.retweet.ratio,
B.follow.post,B.mention.post,B.retweet.post)
AB.diff = x[,1:17]-x[,18:34]
x = cbind(x,AB.diff)
train = x[1:nrow(train),]
test = x[-(1:nrow(train)),]
set.seed(1024)
cv.res <- xgb.cv(data = train, nfold = 3, label = y, nrounds = 100, verbose = FALSE,
objective = 'binary:logistic', eval_metric = 'auc')
set.seed(1024)
cv.res = xgb.cv(data = train, nfold = 3, label = y, nrounds = 3000,
objective='binary:logistic', eval_metric = 'auc',
eta = 0.005, gamma = 1,lambda = 3, nthread = 8,
max_depth = 4, min_child_weight = 1, verbose = F,
subsample = 0.8,colsample_bytree = 0.8)
#bestRound: - subscript out of bounds
bestRound <- which.max(as.matrix(cv.res)[,3]-as.matrix(cv.res)[,4])
bestRound
cv.res
cv.res[bestRound,]
set.seed(1024) bst <- xgboost(data = train, label = y, nrounds = 3000,
objective='binary:logistic', eval_metric = 'auc',
eta = 0.005, gamma = 1,lambda = 3, nthread = 8,
max_depth = 4, min_child_weight = 1,
subsample = 0.8,colsample_bytree = 0.8)
preds <- predict(bst,test,ntreelimit = bestRound)
result <- data.frame(Id = 1:nrow(test), Choice = preds)
write.csv(result,'submission.csv',quote=FALSE,row.names=FALSE)
Many parts of the code have little sense to me but here is a minimal example of building a model with the provided data:
Data:
train <- read.csv('train.csv', header = TRUE)
y <- train[, 1]
train <- as.matrix(train[, -1])
Model:
library(xgboost)
cv.res <- xgb.cv(data = train, nfold = 3, label = y, nrounds = 100, verbose = FALSE,
objective = 'binary:logistic', eval_metric = 'auc', prediction = T)
To obtain cross validation predictions one must specify prediction = T
when calling xgb.cv
.
To obtain best iteration:
it = which.max(cv.res$evaluation_log$test_auc_mean)
best.iter = cv.res$evaluation_log$iter[it]
to plot ROC curve on the cross validation results:
library(pROC)
plot(pROC::roc(response = y,
predictor = cv.res$pred,
levels=c(0, 1)),
lwd=1.5)
To get a confusion matrix (assuming 0.5 prob is the threshold):
library(caret)
confusionMatrix(ifelse(cv.res$pred <= 0.5, 0, 1), y)
#output
Reference
Prediction 0 1
0 2020 638
1 678 2164
Accuracy : 0.7607
95% CI : (0.7492, 0.772)
No Information Rate : 0.5095
P-Value [Acc > NIR] : <2e-16
Kappa : 0.5212
Mcnemar's Test P-Value : 0.2823
Sensitivity : 0.7487
Specificity : 0.7723
Pos Pred Value : 0.7600
Neg Pred Value : 0.7614
Prevalence : 0.4905
Detection Rate : 0.3673
Detection Prevalence : 0.4833
Balanced Accuracy : 0.7605
'Positive' Class : 0
That being said one should aim to tune the hyper-parameters with cross validation such as eta, gamma, lambda, subsample, colsample_bytree, colsample_bylevel etc.
The easiest way is to construct a grid search where you use expand.grid
on all combinations of hyper-parameters and use lapply on the grid with xgb.cv
as a part of the custom function). If you need more detail please comment.