I have a data frame df
that stores the average height in cm of several thousand plants in different years:
Name Year Height
Plant1 2010 440
Plant2 2011 60
Plant1 2011 1980
Plant3 2013 650
Plant4 2016 210
I want to do the following:
a) Create a variable for every height interval of 50 cm between 400 cm and 2000 cm (inclusive) with two variables <400 and >2000. df
should look like this:
Name Year Height h_0_400 h_400 h_450 h_500 h_550 etc.
Plant1 2010 440
Plant2 2011 60
Plant1 2011 1980
Plant3 2013 640
Plant4 2016 210
b) Assign the variables 0 or 1 depending on the actual height
:
Name Year Height h_0_400 h_400 h_450 h_500 h_550 etc.
Plant1 2010 440 0 1 0 0 0
Plant2 2011 60 1 0 0 0 0
Plant1 2011 1980 0 0 0 0 0
Plant3 2013 640 0 0 0 0 0
Plant4 2016 210 1 0 0 0 0
c) Add a variable that indicates which category of heights
the entry belongs to
Name Year Height h_0_400 h_400 h_450 h_500 h_550 etc. height_index
Plant1 2010 440 0 1 0 0 0 h_400
Plant2 2011 60 1 0 0 0 0 h_0_400
Plant1 2011 1980 0 0 0 0 0 h_1950
Plant3 2013 640 0 0 0 0 0 h_600
Plant4 2016 210 1 0 0 0 0 h_0_400
I am not sure how to approach the problem and would appreciate any insight. So far, I've tried to work with seq(400,2000,by=1)
and drop the values that are not needed, but that seems very inefficient.
I am happy to work with any package. Thanks a lot!
An option is to use cut
(or using findInterval
) to create a variable group and then reshape to wide format
library(dplyr)
library(tidyr)
library(stringr)
out <- df %>%
# // create grouping variable with cut based on the Height
mutate(ind = cut(Height, breaks = c(-Inf, c(0, seq(400, 2000,
by = 50 ))), labels = c('h_0_400',
str_c('h_', seq(400, 2000, by = 50)))), height_index = ind, n = 1) %>%
# // reshape to wide format
pivot_wider(names_from = ind, values_from = n, values_fill= list(n = 0))
# // missing columns are created with setdiff and assigned to 0
out[setdiff(levels(out$height_index), out$height_index)] <- 0
df <- structure(list(Name = c("Plant1", "Plant2", "Plant1", "Plant3",
"Plant4"), Year = c(2010L, 2011L, 2011L, 2013L, 2016L), Height = c(340L,
60L, 1980L, 650L, 210L)), class = "data.frame", row.names = c(NA,
-5L))