Search code examples
rdummy-variablecontinuous

creating dummy quantile variable from continuous variable


Here is the data I am working with:

x <- getURL("https://raw.githubusercontent.com/dothemathonthatone/maps/master/testmain.csv")
    data <- read.csv(text = x)

I want to make a dummy variable for the top, middle, and lower third of the values in year_hh_inc. Every value in my id column reg_schl potentially has more than one value for year_hh_inc, so the dummy variable needs to group on reg_schl. I want to be able to differentiate the values in year_hh_inc within each unique reg_schl.

so far I have the following which is posted below as solution from Sotos:

data %>% 
 group_by(reg_schl) %>%
 mutate(category = cut(year_hh_inc, breaks = (quantile(year_hh_inc, c(0, 1 / 3, 2 / 3, 1), na.rm = TRUE)), labels = c("low", "middle", "high"), include.lowest = TRUE), vals = 1) %>% 
 pivot_wider(names_from = category, values_from = vals, values_fill = list(vals = 0))

This is working well.

I have also used this solution provided by Allan:

cut_by_id <- function(x)
{
  x$category <- cut(x$year_hh_inc, quantile(x$year_hh_inc, c(0,1/3,2/3,1), na.rm = TRUE), 
                    labels = c("low","middle","high"), include.lowest = TRUE)
  return(x)
}

data <- do.call(rbind, lapply(split(data, data$id), cut_by_id))

Solution

  • We can create your factor variable based on quantiles and spread those values, i.e.

    library(dplyr)
    library(tidyr)
    
    data %>% 
     group_by(id) %>%
     mutate(category = cut(inc, breaks = (quantile(inc, c(0, 1 / 3, 2 / 3, 1), na.rm = TRUE)), labels = c("low", "middle", "high"), include.lowest = TRUE), vals = 1) %>% 
     pivot_wider(names_from = category, values_from = vals, values_fill = list(vals = 0))
    

    which gives,

    # A tibble: 15 x 8
    # Groups:   id [3]
          id   inc   fee  fert fee_per_inc   low middle  high
       <dbl> <dbl> <dbl> <dbl>       <dbl> <dbl>  <dbl> <dbl>
     1     1 11000   125  0.15     0.0114      1      0     0
     2     1 15000   150  0.12     0.01        1      0     0
     3     1 17000   175  0.22     0.0103      0      1     0
     4     1 19000   200  0.13     0.0105      0      0     1
     5     1 21000   225  0.12     0.0107      0      0     1
     6     2 13000    55  0.11     0.00423     1      0     0
     7     2 16000    75  0.09     0.00469     1      0     0
     8     2 19000    85  0.23     0.00447     0      1     0
     9     2 21000    95  0.05     0.00452     0      0     1
    10     2 25000   105  0.01     0.0042      0      0     1
    11     3 18000    75  0.25     0.00417     1      0     0
    12     3 21000    85  0.03     0.00405     1      0     0
    13     3 23000    95  0.05     0.00413     0      1     0
    14     3 27000   105  0.15     0.00389     0      0     1
    15     3 30000   115  0.25     0.00383     0      0     1
    

    NOTE I added the argument include.lowest = TRUE in cut in order to capture the lowest value in the first label (low)