Search code examples
rlookupsapply

Create new variable using a lookup table


I want to create a new variable using a lookup table. The dataframe looks like this:

  id    sex     age length
   1    Female  1   45
   2    Female  2   54
   3    Female  3   56
   4    Female  4   60
   5    Female  5   60
   6    Female  6   61
   7    Female  7   63
   8    Male    1   55
   9    Male    2   54
   10   Male    3   58
   11   Male    4   61
   12   Male    5   65
   13   Male    6   63
   14   Male    7   65
   15   Male    8   67
   16   Male    9   68
   17   Male    10  69

and the lookup table looks like this

sex    age  length
Female  1   50
Female  2   53
Female  3   56
Female  4   58
Female  5   60
Female  6   61
Female  7   63
Male    1   50
Male    2   54
Male    3   57
Male    4   60
Male    5   62
Male    6   63
Male    7   65
Male    8   66
Male    9   67
Male    10  69

I want to create a new variable growth.rate with two levels: "Normal" and "Low", so the final data frame looks like this ,

id   sex   age  length  growth.rate
1   Female  1   45  Low
2   Female  2   54  Normal
3   Female  3   56  Low
4   Female  4   60  Normal
5   Female  5   60  Low
6   Female  6   61  Low
7   Female  7   63  Low
8   Male    1   55  Normal
9   Male    2   54  Low
10  Male    3   58  Normal
11  Male    4   61  Normal
12  Male    5   65  Normal
13  Male    6   63  Low
14  Male    7   65  Low
15  Male    8   67  Normal
16  Male    9   68  Normal
17  Male    10  69  Low

In this example, the growth.rate for id 1 is "Low" because her length is lower than the value in the lookup table for females age 1.

Conversely, the growth.rate for id 2 is "Normal" because her length is higher than the value in the lookup table for females age 2.

I tried to adapt this solution without success Getting contextstack overflow error - too many nested ifelse statements within for loop?

any help is much appreciated


Solution

  • If we do a left_join betweeen the first and lookup dataset based on 'sex', 'age, we get two 'length' column, do the comparison between those columns and create a new column with ifelse or case_when

    library(dplyr)
    left_join(df1, lookup, by = c('sex', 'age')) %>%
        transmute(id, sex, age, 
          growth.rate = case_when(length.x <= length.y ~ "Low", 
            TRUE ~ "Normal"), length = length.x)
    #   id    sex age growth.rate length
    #1   1 Female   1         Low     45
    #2   2 Female   2      Normal     54
    #3   3 Female   3         Low     56
    #4   4 Female   4      Normal     60
    #5   5 Female   5         Low     60
    #6   6 Female   6         Low     61
    #7   7 Female   7         Low     63
    #8   8   Male   1      Normal     55
    #9   9   Male   2         Low     54
    #10 10   Male   3      Normal     58
    #11 11   Male   4      Normal     61
    #12 12   Male   5      Normal     65
    #13 13   Male   6         Low     63
    #14 14   Male   7         Low     65
    #15 15   Male   8      Normal     67
    #16 16   Male   9      Normal     68
    #17 17   Male  10         Low     69
    

    In data.table, this can be made more compact

    library(data.table)
    setDT(df1)[lookup, growth.rate := fcase(length <= i.length, "Low", 
               "Normal"), on = .(sex, age)]
    

    Or with an index

    setDT(df1)[lookup, growth.rate := 
           c("Normal", "Low")[1 + (length <= i.length)], on = .(sex, age)]
    

    data

    df1 <- structure(list(id = 1:17, sex = c("Female", "Female", "Female", 
    "Female", "Female", "Female", "Female", "Male", "Male", "Male", 
    "Male", "Male", "Male", "Male", "Male", "Male", "Male"), age = c(1L, 
    2L, 3L, 4L, 5L, 6L, 7L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L
    ), length = c(45L, 54L, 56L, 60L, 60L, 61L, 63L, 55L, 54L, 58L, 
    61L, 65L, 63L, 65L, 67L, 68L, 69L)), class = "data.frame", row.names = c(NA, 
    -17L))
    
    lookup <- structure(list(sex = c("Female", "Female", "Female", "Female", 
    "Female", "Female", "Female", "Male", "Male", "Male", "Male", 
    "Male", "Male", "Male", "Male", "Male", "Male"), age = c(1L, 
    2L, 3L, 4L, 5L, 6L, 7L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L
    ), length = c(50L, 53L, 56L, 58L, 60L, 61L, 63L, 50L, 54L, 57L, 
    60L, 62L, 63L, 65L, 66L, 67L, 69L)), class = "data.frame", row.names = c(NA, 
    -17L))