Search code examples
rglmdata-partitioning

Incorrect splitting of data using sample.split in R and issue with logistic regression


I have 2 issues.

  1. When I try to split my data into test and train sets, using sample.split as below, the sampling is done rather unclearly. What I mean is that the data d, has a length of 392 and so, 4:1 division should show 0.8*392= 313.6 i.e. 313 or 314 rows in test set, but the shown length is 304. Is there something that I might be missing?

    require(caTools)
    set.seed(101)
    samplev = sample.split(d[,], SplitRatio= 0.80)
    train = subset(d, samplev == TRUE)
    test = subset(d, samplev == FALSE)
    
  2. I'm trying to use the split data as follows for a logistic regression task in R, as follows-

    #Training
    m <- glm(mpg01~ . -name, data= train, family = binomial(link = 'logit'))
    out2 <- predict.glm(m, test, type = "response")
    class2 <- vector()
    for (i in 1:length(out2))
    {
      if(out2[i] >= 0.5)
      {
        class2[i] <- 1
      }
      else
      {
        class2[i] <- 0
      }
    }
    r2 <- table(class2, test$mpg01)  #confusion Matrix
    

The idea is to not use 'name' column in the data for the training. When I try to run the built model on test data, it shows the following-

out2 <- predict.glm(m, test, type = "response")

Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels) :

factor name has new levels amc ambassador sst, amc concord dl 6, amc pacer, amc pacer d/l, amc rebel sst, audi 100 ls, audi 5000, buick century 350, buick century limited, cadillac seville, capri ii, chevrolet bel air, chevrolet cavalier, chevrolet cavalier wagon, chevrolet monte carlo, chevrolet vega 2300, chrysler lebaron town @ country (sw), chrysler new yorker brougham, datsun 510 hatchback, datsun b210 gx, datsun f-10 hatchback, dodge aries wagon (sw), dodge aspen 6, dodge colt hardtop, dodge colt m/m, dodge dart custom, dodge magnum xe, dodge rampage, fiat 124 tc, ford mustang, ford mustang ii, ford ranger, honda civic 1500 gl, maxda rx3, mazda 626, mazda glc 4, mazda glc custom, mercedes-benz 240d, mercedes-benz 280s, mercury capri 2000, mercury marquis, oldsmobile cutlass ciera (diesel), peugeot 505s turbo diesel, plymouth 'cuda 340, plymouth fury gran sedan, plymouth grand fury, plymouth horizon, plymouth horizon miser, plymouth horizon tc3, plymouth satellite, plymo

From my understanding, shouldn't this error not show up since we are not using the 'names' attribute? Or if we are somehow using it when it isn't intended to, what is it that I'm doing wrong?


Solution

  • Issue 1 Answer

    The sample.split function is expecting a vector for the first argument, it looks like you are either passing a data.frame or a matrix. Here is a simple example to show the different behavior.

    # Mock up some data
    library(caTools)
    df0 <- data.frame(
         y = as.factor(rbinom(392, 1, 0.75)),
         x1 = rnorm(392)
    )
    
    # sample.split with a data.frame as the first argument does not split 80/20 as expected
    set.seed(101)
    samplev = sample.split(df0, SplitRatio= 0.80)
    train = subset(df0, samplev == TRUE)
    test = subset(df0, samplev == FALSE)
    nrow(train)
    [1] 196
    nrow(test)
    [1] 196
    
    # feed in your response variable as a vector to get the expected split
    set.seed(101)
    samplev = sample.split(df0$y, SplitRatio= 0.80)
    train = subset(df0, samplev == TRUE)
    test = subset(df0, samplev == FALSE)
    nrow(train)
    [1] 314
    nrow(test)
    [1] 78
    

    Issue 2 Answer

    While what you are doing seems reasonable and seems like it should work the way you expect, it does not appear to be how the glm and ultimately model.frame functions handles formulas under the hood.

    First off, here is come code that will reproduce what you are doing and the error you are seeing.

    set.seed(123)
    df <- data.frame(
        y = as.factor(rbinom(100, 1, 0.5)),
        x1 = rnorm(100),
        x2 = rnorm(100),
        name = c(rep('a',40), rep('b',30), rep('c', 30))
    )
    train <- df[1:70,]
    test <- df[71:100,]
    m <- glm(y~ . -name, data= train, family = binomial(link = 'logit'))
    out2 <- predict.glm(m, test, type = "response")
    

    Now notice that when I call model.frame directly with your formula it is still including the name column.

    head(model.frame(y~ . -name, data = train), 1)
      y        x1        x2 name
    1 0 0.2533185 0.7877388    a
    

    Whereas a formula that does not include the . columns symbol will not include that extra column.

    head(model.frame(y~ x1 + x2, data = train), 1)
      y        x1        x2
    1 0 0.2533185 0.7877388
    

    At the end of the day, it appears you'll need to workaround this, either by specifying columns directly in the formula or if you use to continue to use the . columns symbol, then by dropping the columns you wish to exclude.

    More specifically, with my simple example, workaround 1 would look like.

    m <- glm(y~ x1 + x2, data= train, family = binomial(link = 'logit'))
    out2 <- predict.glm(m, test, type = "response")
    

    And workaround option 2 would look like.

    m <- glm(y~ ., data= train[,names(train) != 'name'], family = binomial(link = 'logit'))
    out2 <- predict.glm(m, test, type = "response")