Search code examples
rlapplyrpart

Data manipulation makes lapply not work


EDIT: Ok, it has something to do with the data.all.filtered datatype.

The filtered datatype gets created from data.all.raw which works fine with any lapply below. The weird thing is that I can't find out how do the two differ...

data.selectedFeatures <- sapply(data.train.raw, FUN = sf.getGoodFeaturesVector, treshold = 5)

data.train.filtered <- lapply(seq(1, 8), FUN = function(i) sf.filterFeatures(data.train.raw[[i]], data.selectedFeatures[[i]]))

st.testFeature <- function(featureVector, treshold) {
  if(!is.numeric(featureVector)) {return(T)}

  numberOfNonZero <- sum(featureVector > 0)
  numberOfZero <- length(featureVector) - numberOfNonZero

  return(min(numberOfNonZero, numberOfZero) >= treshold)
}

sf.getGoodFeaturesVector <- function(data, treshold) {

  selectedFeatures <- sapply(data, FUN = st.testFeature, treshold <- treshold)
  whitelistedFeatures <- names(data) %in% c("id", "tp")

  return(selectedFeatures | whitelistedFeatures)

}

sf.filterFeatures <- function(data, selectedFeatures) {
  return(data[, selectedFeatures])
}

Any idea what am I doing wrong when manipulating the data that causes subsequent lapply to not to work?

Original post:

I have a list of datasets called data.train.filtered and want to get a list of models (for predicting a feature called tp) trained by rplot on them. The easiest solution I could think of was using lapply but it doesn't work for some reason.

lapply(data.train.filtered, function(dta) rpart(tp ~ ., data = dta))

Error in terms.formula(formula, data = data) : 
  '.' in formula and no 'data' argument 

The problem is probably not in the data as using it just for one (any) dataset works fine:

rpart(tp ~ ., data = data.train.filtered[[1]])

Even though accessing just one dataset via index works fine (as shown above) using lapply trough indexes fails just the same way the first example did.

lapply(1:8, function(i) rpart(tp ~ ., data = data.train.filtered[[i]])) 

Error in terms.formula(formula, data = data) : 
  '.' in formula and no 'data' argument 

The traceback for the index version is following:

10 terms.formula(formula, data = data) 
9 terms(formula, data = data) 
8 model.frame.default(formula = tp ~ ., data = data.train.filtered[[i]], 
    na.action = function (x) 
    {
        Terms <- attr(x, "terms") ... 
7 stats::model.frame(formula = tp ~ ., data = data.train.filtered[[i]], 
    na.action = function (x) 
    {
        Terms <- attr(x, "terms") ... 
6 eval(expr, envir, enclos) 
5 eval(expr, p) 
4 eval.parent(temp) 
3 rpart(tp ~ ., data = data.train.filtered[[i]]) 
2 FUN(X[[i]], ...) 
1 lapply(1:8, function(i) rpart(tp ~ ., data = data.train.filtered[[i]])) 

I'm quite sure I'm missing something extremely trivial here but being quite new to R I just can't find the problem.

PS: I know that I could iterate trough all the datasets via for loop but that feels really dirty and I'd prefer an R idiomatic solution.


Solution

  • Ok, I finally managed to find the answer. The problem was that data.train.all was actually not what I thought it was. I had an error in the filtering process which corrupted (silently, thanks R) everything.

    The fix was to use:

    data.selectedFeatures <- lapply(data.train.raw, FUN = sf.getGoodFeaturesVector, treshold = 5)
    

    instead of

    data.selectedFeatures <- sapply(data.train.raw, FUN = sf.getGoodFeaturesVector, treshold = 5)
    

    Thanks for all the other answers, though.