I have a dataset like so:
> head(training_data)
year month channelGrouping visitStartTime visitNumber timeSinceLastVisit browser
1 2016 October Social 1477775021 1 0 Chrome
2 2016 September Social 1473037945 1 0 Safari
3 2017 July Organic Search 1500305542 1 0 Chrome
4 2017 July Organic Search 1500322111 2 16569 Chrome
5 2016 August Social 1471890172 1 0 Safari
6 2017 May Direct 1495146428 1 0 Chrome
operatingSystem isMobile continent subContinent country source medium
1 Windows 0 Americas South America Brazil referral
2 Macintosh 0 Americas Northern America United States referral
3 Windows 0 Americas Northern America Canada google organic
4 Windows 0 Americas Northern America Canada google organic
5 Macintosh 0 Africa Eastern Africa Zambia referral
6 Android 1 Americas Northern America United States (direct)
isTrueDirect hits pageviews positiveTransaction
1 0 1 1 No
2 0 1 1 No
3 0 5 5 No
4 1 3 3 No
5 0 1 1 No
6 1 6 6 No
> str(training_data)
'data.frame': 1000 obs. of 18 variables:
$ year : int 2016 2016 2017 2017 2016 2017 2016 2017 2017 2016 ...
$ month : Factor w/ 12 levels "January","February",..: 10 9 7 7 8 5 10 3 3 12 ...
$ channelGrouping : chr "Social" "Social" "Organic Search" "Organic Search" ...
$ visitStartTime : int 1477775021 1473037945 1500305542 1500322111 1471890172 1495146428 1476003570 1488556031 1490323225 1480696262 ...
$ visitNumber : int 1 1 1 2 1 1 1 1 1 1 ...
$ timeSinceLastVisit : int 0 0 0 16569 0 0 0 0 0 0 ...
$ browser : chr "Chrome" "Safari" "Chrome" "Chrome" ...
$ operatingSystem : chr "Windows" "Macintosh" "Windows" "Windows" ...
$ isMobile : int 0 0 0 0 0 1 0 1 0 0 ...
$ continent : Factor w/ 5 levels "Africa","Americas",..: 2 2 2 2 1 2 3 3 2 4 ...
$ subContinent : chr "South America" "Northern America" "Northern America" "Northern America" ...
$ country : chr "Brazil" "United States" "Canada" "Canada" ...
$ source : chr "" "" "google" "google" ...
$ medium : chr "referral" "referral" "organic" "organic" ...
$ isTrueDirect : int 0 0 0 1 0 1 0 0 0 0 ...
$ hits : int 1 1 5 3 1 6 1 1 2 1 ...
$ pageviews : int 1 1 5 3 1 6 1 1 2 1 ...
$ positiveTransaction: Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 …
I then define my custom RMSLE function using Metrics
rmsleMetric <- function(data, lev = NULL, model = NULL){
out <- Metrics::rmsle(data$obs, data$pred)
names(out) <- c("rmsle")
return (out)
Then, I define the trainControl
tc <- trainControl(method = "repeatedcv",
number = 5,
repeats = 5,
summaryFunction = rmsleMetric,
classProbs = TRUE)
My grid search:
tg <- expand.grid(alpha = 0, lambda = seq(0, 1, by = 0.1))
Finally, my model:
penalizedLogit_ridge <- train(positiveTransaction ~ .,
data = training_data,
method = "glmnet",
family = "binomial",
trControl = tc,
tuneGrid = tg
When I try to run the command above, I get an error:
Something is wrong; all the rmsle metric values are missing:
Min. : NA
1st Qu.: NA
Median : NA
Mean :NaN
3rd Qu.: NA
Max. : NA
NA's :11
Error: Stopping
In addition: There were 50 or more warnings (use warnings() to see the first 50)
Looking at warnings, I find:
1: In Ops.factor(1, actual) : ‘+’ not meaningful for factors
2: In Ops.factor(1, predicted) : ‘+’ not meaningful for factors
repeated 25 times
Since the same thing works fine if I change the metric to AUC
using prSummary
as my summary function, I don't believe that there are any issues with my data.
So, I believe that my function is wrong but I don't know how to figure out why it is wrong.
Any help is highly appreciated.
Your custom metric is not defined properly. If you use classProbs = TRUE
and savePredictions = "final"
with trainControl
you will realize that there are two columns named according to your target classes which hold the predicted probabilities while the data$pred
column holds the predicted class which can not be used to calculate the desired metric.
A proper way to define the function would be to get the possible levels and use them to extract the probabilities for one of the classes:
rmsleMetric <- function(data, lev = NULL, model = NULL){
lvls <- levels(data$obs)
out <- Metrics::rmsle(ifelse(data$obs == lev[2], 0, 1),
data[, lvls[1]])
names(out) <- c("rmsle")
return (out)
does it work:
tc <- trainControl(method = "repeatedcv",
number = 2,
repeats = 2,
summaryFunction = rmsleMetric,
classProbs = TRUE,
savePredictions = "final")
tg <- expand.grid(alpha = 0, lambda = seq(0, 1, by = 0.1))
penalizedLogit_ridge <- train(Class ~ .,
data = Sonar,
method = "glmnet",
family = "binomial",
trControl = tc,
tuneGrid = tg)
208 samples
60 predictor
2 classes: 'M', 'R'
No pre-processing
Resampling: Cross-Validated (2 fold, repeated 2 times)
Summary of sample sizes: 105, 103, 104, 104
Resampling results across tuning parameters:
lambda rmsle
0.0 0.2835407
0.1 0.2753197
0.2 0.2768288
0.3 0.2797847
0.4 0.2827953
0.5 0.2856088
0.6 0.2881894
0.7 0.2905501
0.8 0.2927171
0.9 0.2947169
1.0 0.2965505
Tuning parameter 'alpha' was held constant at a value of 0
rmsle was used to select the optimal model using the largest value.
The final values used for the model were alpha = 0 and lambda = 1.
You can inspect caret::twoClassSummary
- it is defined quite similarly.