Search code examples
rvectorlinear-regressionlogistic-regressiondummy-variable

Adding a vector of dummy variables in logistic regression


I am currently trying to conduct logistic regression where one of the variables is a vector of 32 dummy variables. Each dummy represents a type of crime. For example:

narcotics <- ifelse(train$PRIMARY.DESCRIPTION == "NARCOTICS", 1,0)

Then the vector is created:

crime.type <- c(narcotics, theft, other.offense, burglary, motor.vehicle.theft, battery, robbery, assault, criminal.damage, deceptive.practice, kidnapping, etc.)

The logistic model is as follows:

logit.mod.train <- lm(street1 ~ BEAT+WARD+X.COORDINATE+Y.COORDINATE+LATITUDE+LONGITUDE+crime.type, data = train, family = "binomial")

It's important to note that street1 is actually a dummy variable for the location of the crime being on the street. So the column is LOCATION.DESCRIPTION and the element is street.

street1 <- ifelse(train$LOCATION.DESCRIPTION == "STREET", 1,0). 

It yields this error:

Error in model.frame.default(formula = street1 ~ BEAT + WARD + X.COORDINATE +  : 
variable lengths differ (found for 'crime.type')

I thought this would work because they are derived from the same data set and the dummies represent each unique element of one of the columns. When I input each dummy variable separately it's successful but I want to condense the regression and make it more efficient.

Thank you in advance


Solution

  • If you intend for each type of crime to be its own predictor, you'll need to bind them to train, and then specify the variables in your lm formula. (Actually for logit it should be glm().)

    For a more compact formula, subset train in the data= argument of glm() to include only your response variable and your intended design matrix. Then use street1 ~ . as your formula.

    train <- cbind(train, narcotics, theft)I
    model.vars <- c("narcotics", "theft", "street1")
    logit.mod.train <- glm(street1 ~ ., data = train[,model.vars], family = "binomial")
    

    More explanation:

    Using ifelse as you've done produces a 1 or 0 for every element in train.
    When you define crime.type as narcotics (which has the length of train) plus any additional elements, crime.type is longer than the number of rows in train.
    Then you're asking lm() to process a lopsided design matrix, where one predictor (crime.type) has more elements in it than the other predictors. That's why you're getting the error.

    Here's a replication of the issue:

    N <- 100
    train <- data.frame(PRIMARY.DESCRIPTION=sample(c("A","B"), replace = T, size = N),
                        response = rbinom(n=N, prob=0.7, size=1))
    dim(train) # 100  2
    
    narcotics <- ifelse(train$PRIMARY.DESCRIPTION == "A", 1, 0) 
    length(narcotics) # 100
    
    theft <-  ifelse(train$PRIMARY.DESCRIPTION == "B", 1, 0)
    length(theft) # 100
    
    crime.type <- c(desc.A, desc.B)
    length(crime.type) # 200
    
    logit.mod.train <- glm(response ~ PRIMARY.DESCRIPTION+crime.type, data = train, family = "binomial")
    

    Error in model.frame.default(formula = response ~ PRIMARY.DESCRIPTION + : variable lengths differ (found for 'crime.type')