Search code examples
rdatasetlogistic-regression

Variable lengths differ error message in R


I am running a logistic regression on the spam dataset from https://hastie.su.domains/ElemStatLearn/. The dependent variable is in the last column, which is given as V58 after I import the data in R. But I get the error message

variable lengths differ (found in 'V1')

I've checked for NAs using na.omit, I tried removing the V1 column just to see if that fixes the issue, but I get the same message. I tried using the old cv.glm function instead, but that does not work. I also tried referring to the dependent variable as spam.data$V58. I am stuck. What am I missing? Below is the code I have so far to just get the model.

spam.data <- data_frame(read.table(datapathname))

dim(spam.data)
str(spam.data)
summary(spam.data)

set.seed(2718)
row.number = sample(1:nrow(spam.data), 0.7*nrow(spam.data))
train = spam.data[row.number,]
test = spam.data[-row.number,]
dim(train)
dim(test)

model.logistic = glm(as.factor(spam.data[58])~., data=train, family=binomial) #The error gets thrown here.

summary(model.logistic)

I should also say that the way I created the data set was to copy and paste the data from the website to a text file and read it into R from that file. The info on the site says we should get 4601 rows and 58 columns, and indeed we get this size dataset.


Solution

  • There are two mistakes when you fit the logistic model:

    • The data is train, but you take the outcome from spam.data. These two data frames have different numbers of observations. Instead, use train[58] as the outcome.
    • When you change the outcome to a factor with as.factor, a single level NA is returned, because the as.factor does not work on the dbl format of the outcome. Therefore, first unlist the outcome with unlist(train[58]).

    Using these two changes the model worked for me:

    model.logistic = glm(as.factor(unlist(train[58]))~., data=train, family=binomial)