Search code examples
rloopsrandommatchmapply

Matching and sampling large dataframe (loop?)


Hi I'm trying to match two dataframes, I have a large dataframe with a million observations and other dataframe with an ID variable and the size of how long must the random sample be.

Name <- c("Jon", "Bill", "Maria", "Ben", "Tina", "Jack", "Laura")
Gender <- c("male", "male", "female", "male", "female", "male", "female")

bigdf <- data.frame(Name, Gender)

ID <- c("male", "female")
samplesize <- c(1,2)
sampledf <- data.frame(ID, samplesize)

So, what I want is match both dataframes and get the following outcome (for example)

Name Gender
Ben male
Laura female
Maria female

I tried to create a function like

j <- function(x,y){
output<- filter(bigdf, Gender==x) %>% sample_n(y)
}
mapply(j, sampledf$Gender, sampledf$samplesize)

But the only thing I get is a long waiting time and a lot of empty columns. So it's obvious that I'm doing something wrong.

Any suggestion?

Thanks!


Solution

  • dplyr

    library(dplyr)
    left_join(bigdf, sampledf, by = c(Gender = "ID")) %>%
      group_by(Gender) %>%
      filter(row_number() %in% sample(first(samplesize))) %>%
      ungroup() %>%
      select(-samplesize)
    # # A tibble: 3 × 2
    #   Name  Gender
    #   <chr> <chr> 
    # 1 Jon   male  
    # 2 Maria female
    # 3 Tina  female
    

    base R

    merge(bigdf, sampledf, by.x = "Gender", by.y = "ID") |>
      subset(ave(samplesize, Gender,
                 FUN = function(z) seq_along(z) %in% sample(z[1])) > 0,
             select = -samplesize)
    #   Gender  Name
    # 1 female Maria
    # 2 female  Tina
    # 4   male   Jon