Search code examples
rdplyrtidyversedata-wrangling

Extract the first N observations of groups according to another dataframe


I would like to extract the first observations of different groups from a dataframe. The number of observations comes from another dataframe.

Here is an example: In dataframe "x" there are different groups of animals with their names.

x <- data.frame(animals = c("Mice", "Mice", "Cats", "Cats", "Dogs", "Dogs", "Birds", "Birds"),
       names = c("Jerry", "Speedy", "Tom", "Garfield", "Snoopy", "Pluto", "Tweety", "Donald"))

In dataframe "y", on the other hand, the group of animals is also shown, plus the number of observations I would like to have for the respective group.

y <- data.frame(animals = c("Mice", "Cats", "Dogs", "Birds"),
       firstObs = c(1, 2, 1, 2))

For the mice I would like to have the first observation, for the cats the first two (here all), for the dogs again only the first observation and for the birds finally again the first two. This result should be stored in data frame "z".

z <- data.frame(animals = c("Mice", "Cats", "Cats", "Dogs", "Birds", "Birds"),
       names = c("Jerry", "Tom", "Garfield", "Snoopy", "Tweety", "Donald"))

I had thought of using dplyr, grouping by "animals" and then using the head() function. But unfortunately I don't know how to pass different values for head(n=) according to the groups....

I had then thought of duplicating the rows in "y" and applying a left-join.

y.2 <- data.frame(animals = c("Mice", "Cats", "Cats", "Dogs", "Birds", "Birds"))

y.2 %>% left_join(x, by = "animals")

However, then not the first or first two observations are joined but y.2 is extended.

I would therefore like to ask you if you have a solution to my problem. I would really appreciate it and thank you very much in advance!


Solution

  • Create logical condition with row_number() on the 'first_Obs' after joining the two datasets

    library(dplyr)
    left_join(x, y) %>% 
       group_by(animals) %>% 
       filter(row_number() <= first(firstObs)) %>% 
       ungroup %>%
       select(-firstObs)
    

    -output

    # A tibble: 6 × 2
      animals names   
      <chr>   <chr>   
    1 Mice    Jerry   
    2 Cats    Tom     
    3 Cats    Garfield
    4 Dogs    Snoopy  
    5 Birds   Tweety  
    6 Birds   Donald  
    

    Or use match with slice

    x %>% 
      group_by(animals) %>% 
      slice(seq_len(y$firstObs[match(cur_group()$animals, y$animals)])) %>% 
      ungroup
    # A tibble: 6 × 2
      animals names   
      <chr>   <chr>   
    1 Birds   Tweety  
    2 Birds   Donald  
    3 Cats    Tom     
    4 Cats    Garfield
    5 Dogs    Snoopy  
    6 Mice    Jerry