Here is the data I am working with:
x <- getURL("https://raw.githubusercontent.com/dothemathonthatone/maps/master/testmain.csv")
data <- read.csv(text = x)
I want to make a dummy variable for the top, middle, and lower third of the values in year_hh_inc
. Every value in my id column reg_schl
potentially has more than one value for year_hh_inc
, so the dummy variable needs to group on reg_schl
.
I want to be able to differentiate the values in year_hh_inc
within each unique reg_schl
.
so far I have the following which is posted below as solution from Sotos:
data %>%
group_by(reg_schl) %>%
mutate(category = cut(year_hh_inc, breaks = (quantile(year_hh_inc, c(0, 1 / 3, 2 / 3, 1), na.rm = TRUE)), labels = c("low", "middle", "high"), include.lowest = TRUE), vals = 1) %>%
pivot_wider(names_from = category, values_from = vals, values_fill = list(vals = 0))
This is working well.
I have also used this solution provided by Allan:
cut_by_id <- function(x)
{
x$category <- cut(x$year_hh_inc, quantile(x$year_hh_inc, c(0,1/3,2/3,1), na.rm = TRUE),
labels = c("low","middle","high"), include.lowest = TRUE)
return(x)
}
data <- do.call(rbind, lapply(split(data, data$id), cut_by_id))
We can create your factor variable based on quantiles and spread those values, i.e.
library(dplyr)
library(tidyr)
data %>%
group_by(id) %>%
mutate(category = cut(inc, breaks = (quantile(inc, c(0, 1 / 3, 2 / 3, 1), na.rm = TRUE)), labels = c("low", "middle", "high"), include.lowest = TRUE), vals = 1) %>%
pivot_wider(names_from = category, values_from = vals, values_fill = list(vals = 0))
which gives,
# A tibble: 15 x 8 # Groups: id [3] id inc fee fert fee_per_inc low middle high <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 1 1 11000 125 0.15 0.0114 1 0 0 2 1 15000 150 0.12 0.01 1 0 0 3 1 17000 175 0.22 0.0103 0 1 0 4 1 19000 200 0.13 0.0105 0 0 1 5 1 21000 225 0.12 0.0107 0 0 1 6 2 13000 55 0.11 0.00423 1 0 0 7 2 16000 75 0.09 0.00469 1 0 0 8 2 19000 85 0.23 0.00447 0 1 0 9 2 21000 95 0.05 0.00452 0 0 1 10 2 25000 105 0.01 0.0042 0 0 1 11 3 18000 75 0.25 0.00417 1 0 0 12 3 21000 85 0.03 0.00405 1 0 0 13 3 23000 95 0.05 0.00413 0 1 0 14 3 27000 105 0.15 0.00389 0 0 1 15 3 30000 115 0.25 0.00383 0 0 1
NOTE I added the argument include.lowest = TRUE
in cut
in order to capture the lowest value in the first label (low
)