Search code examples
rdataframejoindplyroverlap

If two columns in separate dataframes match then pull numbers from another column r


I have two dataframes were I want to match age and height to the percentile they fall within (according to WHO guidelines). So if the ages in df_per and df_height match, find the percentile column in df_per that the height falls in from df_height. The percentiles (columns P3, P15, P50, P85, P97) contain heights (cm). I think I may need to create a min and max column for each percentile so there is a set range for the heights to fall between. So if the age is 0 days and the height 49 cm, the percentile would be P15 as it is > 47.217 and <49.148.

Age <- c(0,1,2,3,4,5,6)
P3 <- c(45.644,45.808,45.971,46.134,46.297,46.461,46.624)
P15 <- c(47.217,47.383,47.549,47.714,47.88,48.046,48.212)
P50 <- c(49.148,49.317,49.485,49.654,49.823,49.992,50.161)
P85 <- c(51.078,51.25,51.422,51.594,51.766,51.938,52.11)
P97 <- c(52.651,52.825,53,53.175,53.349,53.524,53.698)
    
df_per <- data.frame(Age, P3, P15, P50, P85, P97)
df_per
  Age     P3    P15    P50    P85    P97
1   0 45.644 47.217 49.148 51.078 52.651
2   1 45.808 47.383 49.317 51.250 52.825
3   2 45.971 47.549 49.485 51.422 53.000
4   3 46.134 47.714 49.654 51.594 53.175
5   4 46.297 47.880 49.823 51.766 53.349
6   5 46.461 48.046 49.992 51.938 53.524
7   6 46.624 48.212 50.161 52.110 53.698
Age <- c(0,1,2,2,4,6,6)
Height <- c(49,50.4,48.8,51.5,52.0,46.8,49)

df_height <- data.frame(Age,Height)
df_height
  Age Height
1   0   49.0
2   1   50.4
3   2   48.8
4   2   51.5
5   4   52.0
6   6   46.8
7   6   49.0

Output that I'd like to get

  Age Height Percentile
1   0   49.0        P15
2   1   50.4        P50
3   2   48.8        P15
4   2   51.5        P85
5   4   52.0        P85
6   6   46.8         P3
7   6   49.0        P15

Thank you in advance for any help!


Solution

  • After pivoting to long, you can use fuzzyjoin to do an overlap join:

    # First, some data manipulation (pivot df_per to long)
    library(dplyr)
    df_per <- df_per %>%
      pivot_longer(-Age, names_to = "percentile", values_to = "start") %>%
      group_by(Age) %>% 
      mutate(end = lead(start, default = Inf))
    
    # # A tibble: 35 × 4
    # # Groups:   Age [7]
    #     Age percentile start   end
    #   <dbl> <chr>      <dbl> <dbl>
    # 1     0 P3          45.6  47.2
    # 2     0 P15         47.2  49.1
    # 3     0 P50         49.1  51.1
    # 4     0 P85         51.1  52.7
    # 5     0 P97         52.7 Inf  
    # 6     1 P3          45.8  47.4
    # 7     1 P15         47.4  49.3
    # 8     1 P50         49.3  51.2
    # 9     1 P85         51.2  52.8
    # # … with 26 more rows
    
    #Then, the actual join
    library(fuzzyjoin)
    df_per %>% 
      fuzzy_inner_join(df_height, .,
                    by = c('Age', 'Height' = 'start', 'Height' = 'end'),
                    match_fun = list(`==`, `>=`, `<=`)) %>% 
      select(Age = Age.x, Height, percentile)
    
    #   Age Height percentile
    # 1   0   49.0        P15
    # 2   1   50.4        P50
    # 3   2   48.8        P15
    # 4   2   51.5        P85
    # 5   4   52.0        P85
    # 6   6   46.8         P3
    # 7   6   49.0        P15
    

    Note that overlap join are already supported in dplyr's development version (as of 2022-09-09) and soon be available on CRAN. You could do:

    #devtools::install_github("tidyverse/dplyr")
    library(dplyr)
    
    inner_join(df_height, df_per, by = join_by(Age, between(Height, start, end)))
    
    #   Age Height percentile
    # 1   0   49.0        P15
    # 2   1   50.4        P50
    # 3   2   48.8        P15
    # 4   2   51.5        P85
    # 5   4   52.0        P85
    # 6   6   46.8         P3
    # 7   6   49.0        P15