Search code examples
rregressionrpart

Null Result in R data mining in Decision Tree


I have this code

#Import data
tugas=read.csv("D:/FlightDelays.csv")
dipakai=c(1,2,4,8,10,13) 
l=dim(tugas)[1] 
tugas<-tugas[1:l,dipakai] 

## Pembagian Data TRaining dan testtin
n <- round(nrow(tugas)*0.70);n
set.seed(123)
samp=sample(1:nrow(tugas),n)
data.train = tugas[samp,]
data.test = tugas[-samp,]
dim(data.train)
dim(data.test)

fit <- rpart(delay~., data = data.train, method = 'class')
summary(fit)
fit$variable.importance

but with fit$variable.importance, i cannot run that as the result is null. how can i fix this?


Solution

  • It doesn't work because all your predictions are the majority class:

    fl = https://raw.githubusercontent.com/niharikabalachandra/Logistic-Regression/master/FlightDelays.csv
    
    tugas=read.csv(fl)
    dipakai=c(1,2,4,8,10,13) 
    l=dim(tugas)[1] 
    tugas<-tugas[1:l,dipakai] 
    
    n <- round(nrow(tugas)*0.70)
    set.seed(123)
    samp=sample(1:nrow(tugas),n)
    data.train = tugas[samp,]
    data.test = tugas[-samp,]
    
    fit <- rpart(delay~., data = data.train, method = 'class')
    table(predict(fit,type="class"))
    
    delayed  ontime 
          0    1541 
    

    You need to solve this issue of imbalanced learning.. Below I just adjust weights to get predictions that are not all majority class, it does not however improve the precision of the model:

    wt = ifelse(data.train$delay == "delayed",1.5,1)
    fit <- rpart(delay~., data = data.train, method = 'class',weights =wt)
     table(predict(fit,type="class"))
    
    delayed  ontime 
         97    1444
    
    table(predict(fit,data.train,type="class"),data.train$delay)
             
              delayed ontime
      delayed      53     44
      ontime      235   1209
    

    You can get the importance now:

    fit$variable.importance
      carrier      dest schedtime   dayweek    origin 
    40.275159 23.709600 19.088864 16.221204  9.527087