I have a question about the summary() method in SparkR by using the Random Forest Regression. The model building process works fine, but I'm interesting in the featureImportance of one of the result of the algorithm. I want to store the featureImportance variables into a SparkDataFrame to visualize them, but I do not have any idea how to transfer/extract it.
model <- spark.randomForest(x1, x2 , x3, type = "regression", maxDepth = 30, maxBins = 50, numTrees=50, impurity="variance", featureSubsetStrategy="all")
summaryRF <- summary(model)
summaryRF$feature:
1. 'x1'
2. 'x2'
3. 'x3'
summaryRF$featureImportances:
'(3,[0,1,2],[0.01324152135,0.0545454422,0.0322122334])'
Is there any solution to get the featureImportance values out of the list object and store it in a SparkDataFrame?
Using the collect() method gives the following error code:
Error in (function (classes, fdef, mtable) : unable to find an inherited method for function ‘collect’ for signature ‘"character"’
summaryRF
is not a SparkDataFrame
anymore, that's why collect
doesn't work :)
summaryRF$featureImportances
is a character string
(on the Spark
side it a SparseVector
which can't currently (v. 2.1.0) be serialised to and from R
, which i guess is why it gets coerced into a string
).
so as far as i can tell, you have to extract the relevant bits by manipulating the string directly:
# extract the feature indexes and feature importances strings:
fimpList <- strsplit(gsub("\\(.*?\\[","",summaryRF$featureImportances),"\\],\\[")
# split the index and feature importances strings into vectors (and remove "])" from the last record):
fimp <- lapply(fimpList, function(x) strsplit(gsub("\\]\\)","",x),","))
# it's now a list of lists, but you can make this into a dataframe if you like:
fimpDF <- as.data.frame(do.call(cbind,(fimp[[1]])))
eta: by the way, indexes in Spark
start at 0, so if you want to merge on the feature index in summaryRF$featureImportances
when joining the feature names in summaryRf$features
you have to take that into account:
featureNameAndIndex <- data.frame(featureName = unlist(summaryRf$features),
featureIndex = c(0:(length(summaryRf$features)-1))),
stringsAsFactors = FALSE)