I am running a logistic regression on the spam dataset from https://hastie.su.domains/ElemStatLearn/
. The dependent variable is in the last column, which is given as V58 after I import the data in R. But I get the error message
variable lengths differ (found in 'V1')
I've checked for NA
s using na.omit
, I tried removing the V1
column just to see if that fixes the issue, but I get the same message. I tried using the old cv.glm
function instead, but that does not work. I also tried referring to the dependent variable as spam.data$V58
. I am stuck. What am I missing? Below is the code I have so far to just get the model.
spam.data <- data_frame(read.table(datapathname))
dim(spam.data)
str(spam.data)
summary(spam.data)
set.seed(2718)
row.number = sample(1:nrow(spam.data), 0.7*nrow(spam.data))
train = spam.data[row.number,]
test = spam.data[-row.number,]
dim(train)
dim(test)
model.logistic = glm(as.factor(spam.data[58])~., data=train, family=binomial) #The error gets thrown here.
summary(model.logistic)
I should also say that the way I created the data set was to copy and paste the data from the website to a text file and read it into R from that file. The info on the site says we should get 4601 rows and 58 columns, and indeed we get this size dataset.
There are two mistakes when you fit the logistic model:
train[58]
as the outcome.unlist(train[58])
.Using these two changes the model worked for me:
model.logistic = glm(as.factor(unlist(train[58]))~., data=train, family=binomial)