Search code examples
rdummy-variable

How to create dummy variable with range


I am trying to create price range and do a lm model against price range dummy variable. So I did:

> #price range 
> airbnblisting$PriceRange[price <= 500] <- 0 
> airbnblisting$PriceRange[price > 500 & price <= 1000] <- 1
> airbnblisting$PriceRange[price > 1000] <- 2

Then run:

> r1 <- lm(review_scores_rating ~ PriceRange, data=airbnblisting,)
> summary(r1)

But the result shows as NA for priceRange. Any idea I can get the priceRange working properly?

    Min      1Q  Median      3Q     Max 
-4.7619 -0.0319  0.1281  0.2381  0.2381 

Coefficients: (1 not defined because of singularities)
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 4.761914   0.003115    1529   <2e-16 ***
PriceRange        NA         NA      NA       NA    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1   

price example:

$102.00 
$179.00 
$1140.00 
$104.00 
$539.00 
$1090.00 
$149.00 
$44.00 
$1500.00 
$200.00 
$153.00 
$58.00 
$350.00 

Solution

  • The dollar $ indicates you have character strings not numbers. You need to clean your data first.

    Currently you're doing

    dat$PriceRange[dat$price <= 500] <- 0 
    dat$PriceRange[dat$price > 500 & dat$price <= 1000] <- 1
    dat$PriceRange[dat$price > 1000] <- 2
    

    which yields all zero

    dat$PriceRange
    # [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0
    

    therefore:

    lm(review ~ PriceRange, dat)$coe
    # (Intercept)  PriceRange 
    #   2.538462          NA 
    

    Now, we clean price with gsub, removing $ (needs to be escaped) | (or) , for 1000 separators.

    dat <- transform(dat, price=as.numeric(gsub('\\$|,', '', price)))
    

    Now, price will be correctly identified as number

    dat$PriceRange[dat$price <= 500] <- 0 
    dat$PriceRange[dat$price > 500 & dat$price <= 1000] <- 1
    dat$PriceRange[dat$price > 1000] <- 2
    
    dat$PriceRange
    # [1] 0 0 2 0 1 2 0 0 2 0 0 0 2 0
    

    And lm should work.

    lm(review ~ PriceRange, dat)$coe
    # (Intercept)  PriceRange 
    #   2.5350318  -0.1656051 
    

    More easily you could use cut to create the dummy variable (assuming data is already clean).

    dat <- transform(dat,
                     PriceRange=as.numeric(cut(price, c(0, 500, 1000, Inf), 
                                               labels=0:2)))
    lm(review ~ PriceRange, dat)$coe
    # (Intercept)  PriceRange 
    #   2.7006369  -0.1656051 
    

    Note, that you attempt to code a categorical variable as continuous, which might statistically be problematic!


    Data:

    dat <- structure(list(review = c(4L, 4L, 1L, 3L, 2L, 2L, 3L, 0L, 2L, 
    3L, 2L, 3L, 4L, 1L), price = c("$102.00", "$179.00", "$1140.00", 
    "$104.00", "$539.00", "$1090.00", "$149.00", "$44.10", "$1500.00", 
    "$200.00", "$153.00", "$58.00", "$1,258.00", "$350.00")), class = "data.frame", row.names = c(NA, 
    -14L))