Search code examples
rregressionlmpoly

lm() breaks when using poly() with predictors set up as factors


I'm trying to model the relationship between a categorical predictor variable and a continuous outcome variable. I use lm() to this end. Since it's a categorical variable, the proper practice is to convert it to a factor variable type. However, when using poly() for the predictor's regression term and when setting up the predictor variable as a factor it causes lm() to break. On the other hand, if I run lm() without using poly() (but do keep the predictor as factor) or keep poly() but not convert the predictor to a factor (let it be numeric) -- then lm() doesn't break. I don't understand why it breaks and I don't understand if I can trust the results when it doesn't break.

Data

Data about 50 basketball players. One column (PosCode) is about player's position in the game, and the other (Height) is player's height.

data <-
structure(list(Player = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 
12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 
28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 
44, 45, 46, 47, 48, 49, 50), PosCode = c(3, 3, 4, 1, 4, 1, 3, 
1, 2, 2, 4, 1, 5, 5, 2, 1, 2, 5, 4, 4, 5, 4, 4, 4, 2, 3, 2, 3, 
1, 1, 2, 4, 1, 2, 3, 1, 5, 4, 3, 4, 4, 1, 1, 4, 5, 1, 1, 1, 5, 
2), Height = c(176.1, 179.1, 183.1, 169.7, 177.3, 179, 176.4, 
174.9, 180.2, 176.5, 178.6, 167.9, 183.4, 166.2, 189.5, 171.9, 
188.5, 172.6, 167.7, 172.6, 186.9, 163.8, 179.3, 165.4, 182.2, 
166.1, 176.8, 171.9, 173.8, 163, 172.5, 184.9, 170.4, 170.6, 
166.8, 172.6, 184.3, 163.3, 182.4, 165.8, 173.4, 182.1, 172.9, 
184.9, 173.2, 185.8, 161.4, 186, 178.4, 170.7)), row.names = c(NA, 
-50L), class = c("tbl_df", "tbl", "data.frame"))


> data
## # A tibble: 50 x 3
##    Player PosCode Height
##    <dbl>   <dbl>  <dbl>
##  1      1       3   176.
##  2      2       3   179.
##  3      3       4   183.
##  4      4       1   170.
##  5      5       4   177.
##  6      6       1   179 
##  7      7       3   176.
##  8      8       1   175.
##  9      9       2   180.
## 10     10       2   176.
## # ... with 40 more rows

Modeling the data

I want to know whether I can predict players height from their position in the game. Since position is categorical (there are 5 possible positions), this variable should be of a factor type, with 5 levels.

library(tidyverse)
library(magrittr) 

data %<>% mutate_at(vars(PosCode), ~ as.factor(.)) ## convert PosCode from dbl to fct

Modeling by using lm() without poly()

lm(Height ~ PosCode, data = data)

## Call:
## lm(formula = Height ~ PosCode, data = data)
## 
## Coefficients:
## (Intercept)     PosCode2     PosCode3     PosCode4     PosCode5  
##    173.6714       4.9397       0.4429       0.1824       4.1857  

Modeling by using lm() with poly()

lm(Height ~ poly(PosCode ,1), data = data)

## Error in qr.default(X) : NA/NaN/Inf in foreign function call (arg 1)
## In addition: Warning messages:
## 1: In mean.default(x) : argument is not numeric or logical: returning NA
## 2: In Ops.factor(x, xbar) : ‘-’ not meaningful for factors

If the predictor isn't a factor, there's no problem regardless of poly()

## convert PosCode from fct back to dbl
data %<>% mutate_at(vars(PosCode), ~ as.double(.)) 

## lm() without poly()
lm(Height ~ PosCode, data = data)

Call:
lm(formula = Height ~ PosCode, data = data)

## Coefficients:
## (Intercept)      PosCode  
##   174.3848       0.3112 


## lm() with poly() 
lm(Height ~ poly(PosCode ,1), data = data)

## Call:
## lm(formula = Height ~ poly(PosCode, 1), data = data)

## Coefficients:
##      (Intercept)  poly(PosCode, 1)  
##          175.256             3.173 

But clearly, treating PosCode as dbl rather than fct changes the model in a way that is wrong.

Bottom line

I don't understand why including poly() in lm() breaks it when the predictor is set up as a factor variable.


Solution

  • From help("poly"):

    x a numeric vector at which to evaluate the polynomial.

    Thus, you cannot use factors inside poly(). This is expected, because categorical variables (i.e., factors) have to be recoded, for example, into dummy variables. And it does neither make sense to have, say, a quadratic effect for the categorical variable as a whole nor for the coded (dummy) variables. (It does not make sense from a substantive perspective, but squaring a dummy variable that has only 0s and 1s does also not make much sense from a perspective blind to statistics.)

    You can see that lm() recodes your factor because you get four coefficents (one less than the number of categories) for the variable PosCode in your first model.

    In the end, poly() is not of much use unless you set its argument degree to a value > 1