Search code examples
sql-serverrmicrosoft-r

R: how to properly create rx_forest_model object?


I'm trying to do a churn analysis with R and SQL Server 2016. I have uploaded my dataset on my database in a local SQL Server and I did all the preliminary work on this dataset. Well, now I have this function trainModel() which I would use to estimate my random model forest:

trainModel = function(sqlSettings, trainTable) {
sqlConnString = sqlSettings$connString

trainDataSQL <- RxSqlServerData(connectionString = sqlConnString,
                                table = trainTable,
                                colInfo = cdrColInfo)

## Create training formula
labelVar = "churn"
trainVars <- rxGetVarNames(trainDataSQL)
trainVars <- trainVars[!trainVars %in% c(labelVar)]
temp <- paste(c(labelVar, paste(trainVars, collapse = "+")), collapse = "~")
formula <- as.formula(temp)

## Train gradient tree boosting with mxFastTree on SQL data source
library(RevoScaleR)
rx_forest_model <- rxDForest(formula = formula,
                             data = trainDataSQL,
                             nTree = 8,
                             maxDepth = 16,
                             mTry = 2,
                             minBucket = 1,
                             replace = TRUE,
                             importance = TRUE,
                             seed = 8,
                             parms = list(loss = c(0, 4, 1, 0)))

return(rx_forest_model)
}

But when I run the function I get this wrong output:

> system.time({
+   trainModel(sqlSettings, trainTable)
+ })
   user  system elapsed 
   0.29    0.07   58.18 
Warning message:
In tempGetNumObs(numObs) :
  Number of observations not available for this data source. 'numObs' set to 1e6.

And for this warning message, the function trainModel() does not create the object rx_forest_model

Does anyone have any suggestions on how to solve this problem?


Solution

  • After several attempts, I found the reason why the function trainModel() did not function properly. Is not a connection string problem and is not even a data source type issue. The problem is in the syntax of function trainModel().

    It is enough to eliminate from the body of the function the statement:

    return(rx_forest_model)
    

    In this way, the function returns the same warning message, but creates the object rx_forest_model in the correct way.

    So, the correct function is:

    trainModel = function(sqlSettings, trainTable) {
    sqlConnString = sqlSettings$connString
    
    trainDataSQL <- RxSqlServerData(connectionString = sqlConnString,
                                table = trainTable,
                                colInfo = cdrColInfo)
    
    ## Create training formula
    labelVar = "churn"
    trainVars <- rxGetVarNames(trainDataSQL)
    trainVars <- trainVars[!trainVars %in% c(labelVar)]
    temp <- paste(c(labelVar, paste(trainVars, collapse = "+")), collapse = "~")
    formula <- as.formula(temp)
    
    ## Train gradient tree boosting with mxFastTree on SQL data source
    library(RevoScaleR)
    rx_forest_model <- rxDForest(formula = formula,
                                 data = trainDataSQL,
                                 nTree = 8,
                                 maxDepth = 16,
                                 mTry = 2,
                                 minBucket = 1,
                                 replace = TRUE,
                                 importance = TRUE,
                                 seed = 8,
                                 parms = list(loss = c(0, 4, 1, 0)))
    
    }