Search code examples
rcategorical-datadummy-variableindicator

R: Create indicator variables from data and extract category


I have a data frame df that stores the average height in cm of several thousand plants in different years:

Name    Year    Height
Plant1  2010    440
Plant2  2011    60
Plant1  2011    1980
Plant3  2013    650
Plant4  2016    210

I want to do the following:

a) Create a variable for every height interval of 50 cm between 400 cm and 2000 cm (inclusive) with two variables <400 and >2000. df should look like this:

Name    Year    Height h_0_400 h_400 h_450 h_500 h_550 etc.
Plant1  2010    440    
Plant2  2011    60
Plant1  2011    1980
Plant3  2013    640
Plant4  2016    210

b) Assign the variables 0 or 1 depending on the actual height:

Name    Year    Height h_0_400 h_400 h_450 h_500 h_550 etc.
Plant1  2010    440    0       1     0     0     0
Plant2  2011    60     1       0     0     0     0
Plant1  2011    1980   0       0     0     0     0
Plant3  2013    640    0       0     0     0     0
Plant4  2016    210    1       0     0     0     0

c) Add a variable that indicates which category of heights the entry belongs to

Name    Year    Height h_0_400 h_400 h_450 h_500 h_550 etc. height_index
Plant1  2010    440    0       1     0     0     0          h_400
Plant2  2011    60     1       0     0     0     0          h_0_400
Plant1  2011    1980   0       0     0     0     0          h_1950
Plant3  2013    640    0       0     0     0     0          h_600
Plant4  2016    210    1       0     0     0     0          h_0_400

I am not sure how to approach the problem and would appreciate any insight. So far, I've tried to work with seq(400,2000,by=1) and drop the values that are not needed, but that seems very inefficient. I am happy to work with any package. Thanks a lot!


Solution

  • An option is to use cut (or using findInterval) to create a variable group and then reshape to wide format

    library(dplyr)
    library(tidyr)
    library(stringr)
    out <- df %>%
       # // create grouping variable with cut based on the Height
       mutate(ind = cut(Height, breaks = c(-Inf, c(0, seq(400, 2000,
              by = 50 ))), labels = c('h_0_400', 
      str_c('h_', seq(400, 2000, by = 50)))), height_index = ind, n = 1)  %>%
       # // reshape to wide format
       pivot_wider(names_from = ind, values_from = n, values_fill= list(n = 0))
    
    # // missing columns are created with setdiff and assigned to 0
    out[setdiff(levels(out$height_index), out$height_index)] <- 0
    

    data

    df <- structure(list(Name = c("Plant1", "Plant2", "Plant1", "Plant3", 
    "Plant4"), Year = c(2010L, 2011L, 2011L, 2013L, 2016L), Height = c(340L, 
    60L, 1980L, 650L, 210L)), class = "data.frame", row.names = c(NA, 
    -5L))