Search code examples
rdata.tablerowsample

How to take a random subsample from rows of a data.table per factor?


I'm trying to create a subsample of a big dataset with various n's for each factor. I want to be able to do this very fast, because I'm doing it hundreds of thousands of times. Could you help me optimise this process using data.tables?

What I do now is add a "rownumber" rn to the data.table, sample that using the various n's, and then do the subset based on that. I think there must be a smarter way of doing this, but I cannot seem to figure it out.

# generate data.table
DT <- data.table(rn = 1:100, factor = letters[1:3],
  value = rnorm(100, c(1, 5, 10)))

# subset based on "row number" with various numbers per category
subsetrn <- DT[, .(rn = sample(rn,
  if (factor == "a") 12
  else if (factor == "b") 20
  else if (factor == "c") 5
  else NULL, replace = TRUE)),
  by = factor]

# subset
ss <- DT[rn %in% subsetrn[, rn]]

EDIT: I've seen this way of quickly sampling from a data.table, but it doesn't do it by factor: https://stackoverflow.com/a/33201094/5252333 I've also seen tricks on how to do it for each factor in equal amounts, but not for each factor in different amounts.

EDIT2: I've been playing around with the solution by @akron but am still having trouble:

If I have one of the factors at 0, something is wrong:

# generate data.table
DT <- data.table(factor = letters[1:4],
  value = rnorm(300, c(1, 5, 10)))

# subset based on "row number" with various numbers per category subsetrn

# index data table with numbers
id <- data.table(factor = letters[1:4], val = c(12, 20, 5, 0))
# map our DT onto it, then subsample by the new val
ssid <- DT[id, on = .(factor)][, sample(.I, val[1], replace = TRUE), factor]
# subset
ss <- DT[ssid[, V1]]
count(ss[, factor])
##   x freq
## 1 a   10
## 2 b   12
## 3 c    5
## 4 d   10

# this is wrong! It only works if I do it like this
ss <- DT[id, on = .(factor)][ssid[, V1]]

I would like to be able to get ssid such that I can just DT[ssid[, V1]] (or DT[ssid]), so that I can do everything by reference instead of making a local copy of the DT. In my application, this is all wrapped in a function that currently makes a copy of a small part of DT 50k times, taking ~25 minutes. The function performs some calculations with the subset and then returs the output. This is slow, and I'd like to figure out if it's possible to do it by reference somehow.

This question might become a bit too specific now ;-).


Solution

  • We could join with a key/value dataset and use .I to sample

    DT[DT[data.table(factor = letters[1:3], val = c(12, 20, 5)), 
          on = .(factor)][, sample(.I, val[1], replace = TRUE), factor]$V1]
    

    If we split this into parts-

    data.table(factor = letters[1:3], val = c(12, 20, 5))
    

    is a key/value data.table to get the 'val' as a column on the original dataset by joining on the 'factor`.

    In the second step, we do the joining

    DT[data.table(factor = letters[1:3], val = c(12, 20, 5)), 
          on = .(factor)]
    

    Now, we sample the row index, grouped by 'factor', specifying the size as the first element of 'val', extract the rowindex column $V1 and use this to subset the original dataset. i.e.

    DT[....$V1]