I'm trying to do a churn analysis with R and SQL Server 2016.
I have uploaded my dataset on my database in a local SQL Server and I did all the preliminary work on this dataset.
Well, now I have this function trainModel()
which I would use to estimate my random model forest:
trainModel = function(sqlSettings, trainTable) {
sqlConnString = sqlSettings$connString
trainDataSQL <- RxSqlServerData(connectionString = sqlConnString,
table = trainTable,
colInfo = cdrColInfo)
## Create training formula
labelVar = "churn"
trainVars <- rxGetVarNames(trainDataSQL)
trainVars <- trainVars[!trainVars %in% c(labelVar)]
temp <- paste(c(labelVar, paste(trainVars, collapse = "+")), collapse = "~")
formula <- as.formula(temp)
## Train gradient tree boosting with mxFastTree on SQL data source
library(RevoScaleR)
rx_forest_model <- rxDForest(formula = formula,
data = trainDataSQL,
nTree = 8,
maxDepth = 16,
mTry = 2,
minBucket = 1,
replace = TRUE,
importance = TRUE,
seed = 8,
parms = list(loss = c(0, 4, 1, 0)))
return(rx_forest_model)
}
But when I run the function I get this wrong output:
> system.time({
+ trainModel(sqlSettings, trainTable)
+ })
user system elapsed
0.29 0.07 58.18
Warning message:
In tempGetNumObs(numObs) :
Number of observations not available for this data source. 'numObs' set to 1e6.
And for this warning message, the function trainModel()
does not create the object rx_forest_model
Does anyone have any suggestions on how to solve this problem?
After several attempts, I found the reason why the function trainModel()
did not function properly.
Is not a connection string problem and is not even a data source type issue.
The problem is in the syntax of function trainModel()
.
It is enough to eliminate from the body of the function the statement:
return(rx_forest_model)
In this way, the function returns the same warning message, but creates the object rx_forest_model
in the correct way.
So, the correct function is:
trainModel = function(sqlSettings, trainTable) {
sqlConnString = sqlSettings$connString
trainDataSQL <- RxSqlServerData(connectionString = sqlConnString,
table = trainTable,
colInfo = cdrColInfo)
## Create training formula
labelVar = "churn"
trainVars <- rxGetVarNames(trainDataSQL)
trainVars <- trainVars[!trainVars %in% c(labelVar)]
temp <- paste(c(labelVar, paste(trainVars, collapse = "+")), collapse = "~")
formula <- as.formula(temp)
## Train gradient tree boosting with mxFastTree on SQL data source
library(RevoScaleR)
rx_forest_model <- rxDForest(formula = formula,
data = trainDataSQL,
nTree = 8,
maxDepth = 16,
mTry = 2,
minBucket = 1,
replace = TRUE,
importance = TRUE,
seed = 8,
parms = list(loss = c(0, 4, 1, 0)))
}