Search code examples
rlinear-regression

R: add summation to regression equation


I am trying to write this formula in R where i = each value of the category (category can be 1 2 3 or 4) enter image description here

This is my code attempt but R prints this error message:
Error in lm(category ~ (year * state * district) + year + state + district + : formal argument "data" matched by multiple actual arguments

I am trying to create a summation so I had to add multiple arguments after the data, is there another way to write the summation to avoid the error message? I checked online but could not find anything similar, I am guessing it is rare to add a summation to a regression. Thank you in advance for any help

ID <-  c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,
         17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,
         33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48)
year <- c(1980,1980,1980,1980,1980,1980,1980,1980,1980,1980,1980,1980,1980,1980,1980,1980,
      1981,1981,1981,1981,1981,1981,1981,1981,1981,1981,1981,1981,1981,1981,1981,1981,
      1982,1982,1982,1982,1982,1982,1982,1982,1982,1982,1982,1982,1982,1982,1982,1982)
 state <-     c("NY","NY","NY","NY","NY","NY","NY","NY","CA","CA","CA","CA","CA","CA","CA","CA",
      "NY","NY","NY","NY","NY","NY","NY","NY","CA","CA","CA","CA","CA","CA","CA","CA",
      "NY","NY","NY","NY","NY","NY","NY","NY","CA","CA","CA","CA","CA","CA","CA","CA")
district <- c(1,1,1,1,2,2,2,2,1,1,1,1,2,2,2,2,
          1,1,1,1,2,2,2,2,1,1,1,1,2,2,2,2,
          1,1,1,1,2,2,2,2,1,1,1,1,2,2,2,2)
quantity <- c(100,200,45,87,65,32,94,52,67,72,14,53,28,94,12,41,
          10,20,45,87,65,32,8,52,67,1,14,53,28,94,12,41,
          1000,2000,45,87,9,32,94,5,6,7,1,5,2,9,1,4)
category <- c(1,2,3,4,1,2,3,4,1,2,3,4,1,2,3,4,
   1,2,3,4,1,2,3,4,1,2,3,4,1,2,3,4,
   1,2,3,4,1,2,3,4,1,2,3,4,1,2,3,4)

df <- data.frame(ID,year,state,district,quantity,category)

df$year <- as.factor(df$year)
df$state <- as.factor(df$state)
df$district <- as.factor(df$district)
df$category <- as.factor(df$category)

print(df)

# force regression baseline values 
relevel(df$year, ref = '1981')
relevel(df$district, ref = '2')

# r1 is when y = 1
r1 <- lm( category ~ (year*state*district) +  
           quantity + district + state + year,
           data = subset(df, year == 1980)
      +
        (year*state*district) +  
         quantity + district + state + year,
         data = subset(df, year == 1981)
      +
        (year*state*district) +  
         quantity + district + state + year,
         data = subset(df, year == 1980)
     )
summary(r1)

# r2 is when y = 2
r2 <- lm( category ~ (year*state*district) +  
        year + state + district + quantity,
      data = subset(df, year == 1980)
      +
        (year*state*district) +  
        year + state + district + quantity,
      data = subset(df, year == 1981)
      +
        (year*state*district) +  
        year + state + district + quantity,
      data = subset(df, year == 1980)
)
summary(r2)

then r3 and r4

Solution

  • There are several problems here:

    • the line with the summation near the beginning needs to have coefficients multiplying each term

    • what does year[i, y] mean? It is not defined.

    • linear regression is not appropriate for a categorical response. Assuming that the categories are unordered we can use multinomial logistic regression.

    • interactions normally require that all lower order interactions be included as well.

    Perhaps you want this:

    library(nnet)
    
    fm <- multinom(category ~ year/(district * state) + district + state + quantity, df)
    summary(fm)
    

    fm is of class "multinom" with these methods:

    methods(class = "multinom")
    ## [1] add1        anova       coef        confint     drop1       extractAIC 
    ## [7] logLik      model.frame predict     print       summary     vcov       
    ## see '?methods' for accessing help and source code
    

    For interpretation see https://stats.oarc.ucla.edu/r/dae/multinomial-logistic-regression/