Search code examples
rdataframesplitsample

Select random sample by ID`s


I have a dataframe with 811777 rows and 133 different worker IDs. My dataframe looks like this:

  PERS_ID           NEU_DATUM                             
 1      22 2022-03-01 00:00:00 
 2      22 2022-03-01 00:00:00 
 3      22 2022-03-01 00:00:00 
 4      22 2022-03-01 00:00:00 
 5      22 2022-03-01 00:00:00 
 6      22 2022-03-01 00:00:00 
 7      22 2022-03-01 00:00:00 
 8      22 2022-03-01 00:00:00 
 9      22 2022-03-01 00:00:00 
10      22 2022-03-01 00:00:00 

In the first 10 rows u can only see one worker with the ID "22", but like I said above my df has 133 different worker IDs. I want to take 50 random worker IDs and create a new df. But I don´t want one row for one ID. Instead I want every row that has that worker ID. So basically my new df should consist of 50 random worker IDs and I want every row of these workers. I already tried with the sample code but I failed :(. Thanks in advance!


Solution

  • If your data are df, you can do the following:

    df[df$PERS_ID %in% sample(unique(df$PERS_ID), 50),]
    

    or with data.table

    library(data.table)
    setDT(df)[PERS_ID %in% sample(unique(PERS_ID),50)]
    

    or with dplyr:

    library(dplyr)
    df %>% filter(PERS_ID %in% sample(unique(PERS_ID),50))
    

    You can also do this using a join approach; one such approach using dplyr is shown below:

    inner_join(
      df, 
      df %>% distinct(PERS_ID) %>% slice_sample(n=50)
    )