Search code examples
rplotr-caretrpartcart-analysis

R ctree strange error


I have some strange problem in for loops with ctree data. If I write this code in a loop then R freezes.

data = read.csv("train.csv") #data description https://www.kaggle.com/c/titanic-gettingStarted/data

treet = ctree(Survived ~ ., data = data)
print(plot(treet))

Sometimes I get an error: "More than 52 levels in a predicting factor, truncated for printout" and my tree is showing in very weird way. Sometimes works just fine. Really, really strange!

My Loop code:

functionPlot <- function(traine, i) {
  print(i) # print only once, then RStudio freezes
  tempd <- ctree(Survived ~ ., data = traine)
  print(plot(tempd))
}

for(i in 1:2) {
  smp_size <- floor(0.70 * nrow(data))
  train_ind <- sample(seq_len(nrow(data)), size = smp_size)
  set.seed(100 + i)
  train <- data[train_ind, ]
  test <- data[-train_ind, ]
#
  functionPlot(train,i)
}

Solution

  • The ctree() function expects that (a) appropriate classes (numeric, factor, etc.) are used for each variable, and that (b) only useful predictors are employed in the model formula.

    As for (b) you have supplied variables that are really just characters (like the Name) and not factors. This would either need to be pre-processed appropriately or omitted from the analysis.

    Even if you do not, you will not get the best results because some variables (like Survived and Pclass) are coded numerically but are really categorical variables that should be factors. If you look at the scripts from https://www.kaggle.com/c/titanic/forums/t/13390/introducing-kaggle-scripts then you will also see how the data preparation can be carried out. Here, I use

    titanic <- read.csv("train.csv")
    titanic$Survived <- factor(titanic$Survived,
      levels = 0:1, labels = c("no", "yes"))
    titanic$Pclass <- factor(titanic$Pclass)
    titanic$Name <- as.character(titanic$Name)
    

    As for (b), I then go on to call ctree() with only the variables which have been sufficiently pre-processed for meaningful analysis. (And I use the newer recommended implementation from package partykit.)

    library("partykit")
    ct <- ctree(Survived ~ Pclass + Sex + Age + SibSp + Parch + Fare + Embarked,
      data = titanic)
    plot(ct)
    print(ct)
    

    This yields the following graphical output:

    ctree for titanic data

    And the following print output:

    Model formula:
    Survived ~ Pclass + Sex + Age + SibSp + Parch + Fare + Embarked
    
    Fitted party:
    [1] root
    |   [2] Sex in female
    |   |   [3] Pclass in 1, 2: yes (n = 170, err = 5.3%)
    |   |   [4] Pclass in 3
    |   |   |   [5] Fare <= 23.25: yes (n = 117, err = 41.0%)
    |   |   |   [6] Fare > 23.25: no (n = 27, err = 11.1%)
    |   [7] Sex in male
    |   |   [8] Pclass in 1
    |   |   |   [9] Age <= 52: no (n = 88, err = 43.2%)
    |   |   |   [10] Age > 52: no (n = 34, err = 20.6%)
    |   |   [11] Pclass in 2, 3
    |   |   |   [12] Age <= 9
    |   |   |   |   [13] Pclass in 3: no (n = 71, err = 18.3%)
    |   |   |   |   [14] Pclass in 2: yes (n = 13, err = 30.8%)
    |   |   |   [15] Age > 9: no (n = 371, err = 11.3%)
    
    Number of inner nodes:    7
    Number of terminal nodes: 8