SparkR summary() extracting

I have a question about the summary() method in SparkR by using the Random Forest Regression. The model building process works fine, but I'm interesting in the featureImportance of one of the result of the algorithm. I want to store the featureImportance variables into a SparkDataFrame to visualize them, but I do not have any idea how to transfer/extract it.

model <- spark.randomForest(x1, x2 , x3, type = "regression", maxDepth = 30, maxBins = 50, numTrees=50, impurity="variance", featureSubsetStrategy="all")

summaryRF <- summary(model)

summaryRF$feature:
1. 'x1'
2. 'x2'
3. 'x3'

summaryRF$featureImportances: 
'(3,[0,1,2],[0.01324152135,0.0545454422,0.0322122334])'

Is there any solution to get the featureImportance values out of the list object and store it in a SparkDataFrame?

Using the collect() method gives the following error code:

Error in (function (classes, fdef, mtable) : unable to find an inherited method for function ‘collect’ for signature ‘"character"’

Solution

summaryRF is not a SparkDataFrame anymore, that's why collect doesn't work :)

summaryRF$featureImportances is a character string (on the Spark side it a SparseVector which can't currently (v. 2.1.0) be serialised to and from R, which i guess is why it gets coerced into a string).

so as far as i can tell, you have to extract the relevant bits by manipulating the string directly:

# extract the feature indexes and feature importances strings:
fimpList <- strsplit(gsub("\\(.*?\\[","",summaryRF$featureImportances),"\\],\\[")

# split the index and feature importances strings into vectors (and remove "])" from the last record):
fimp <- lapply(fimpList, function(x) strsplit(gsub("\\]\\)","",x),","))

# it's now a list of lists, but you can make this into a dataframe if you like:
fimpDF <- as.data.frame(do.call(cbind,(fimp[[1]])))

eta: by the way, indexes in Spark start at 0, so if you want to merge on the feature index in summaryRF$featureImportances when joining the feature names in summaryRf$features you have to take that into account:

featureNameAndIndex <- data.frame(featureName = unlist(summaryRf$features),
                                  featureIndex = c(0:(length(summaryRf$features)-1))),
                                  stringsAsFactors = FALSE)