Search code examples
rdata.tablesample

Question: Randomly pick rows based on the number of other elements with data.table?


I have a dataset as following in data.table.

Would like to compare the value and select the rows based on the number of least elements.

Please find the toy sample for a better picture.

set.seed(100)
DT <- data.table(CATA = sample(LETTERS[1:4], 4489, replace = T),
                 ITEM = sample(LETTERS[24:26], 4489, prob = c(0.4, 0.3, 0.3),replace = T),
                 VAL  = sample(100:999, 4489, replace = T)
      )

# 4489 is a random-picked number

If counting the items based on the CATA

DT[,.N, by = .(CATA, ITEM)][order(CATA)]
#     CATA ITEM   N
#  1:    A    X 433
#  2:    A    Y 323
#  3:    A    Z 342
#  4:    B    X 452
#  5:    B    Y 333
#  6:    B    Z 358
#  7:    C    X 461
#  8:    C    Y 302
#  9:    C    Z 359
# 10:    D    X 461
# 11:    D    Y 344
# 12:    D    Z 321

I can find the min of each category.

DTmin <- DT[,.N, by = .(CATA, ITEM)][order(CATA)][,.(MIN = min(N)), by = CATA]

>DTmin
#    CATA MIN
# 1:    A 323
# 2:    B 333
# 3:    C 302
# 4:    D 321

what I need is to get pass the DTmin value to get all items has same number under each category, like,

DT[,.N, by = .(CATA, ITEM)][order(CATA)]
#     CATA ITEM   N
#  1:    A    X 323 # was 433
#  2:    A    Y 323
#  3:    A    Z 323 # was 342
#  4:    B    X 333 # was 452
#  5:    B    Y 333
#  6:    B    Z 333 # was 358
#  7:    C    X 302 # was 461
#  8:    C    Y 302
#  9:    C    Z 302 # was 359
# 10:    D    X 321 # was 461
# 11:    D    Y 321 # was 344
# 12:    D    Z 321

In the end, DT shall has row number of 3837 ( sum(DTmin$MIN)*3 )

Here is my way, pls advise more smooth way without breaking the chain or introducing new column.

# create a new index column, and remove it.
DTmin[DT, on = .(CATA)][,tmpV:= sample(.N), by = .(CATA, ITEM)][tmpV<MIN][,tmpV:=NULL]

#check
DTmin[DT, on = .(CATA)][,tmpV:= sample(.N), by = .(CATA, ITEM)][tmpV<MIN][,tmpV:=NULL][,.N, by = .(CATA, ITEM)][order(CATA, ITEM)]

#     CATA ITEM   N
#  1:    A    X 322
#  2:    A    Y 322
#  3:    A    Z 322
#  4:    B    X 332
#  5:    B    Y 332
#  6:    B    Z 332
#  7:    C    X 301
#  8:    C    Y 301
#  9:    C    Z 301
# 10:    D    X 320
# 11:    D    Y 320
# 12:    D    Z 320

Thanks to @ronak-shan

DT[DTmin, on = 'CATA'][, .SD[sample(.N, first(MIN))], .(CATA, ITEM)]

My intepretation is, DT[DTmin, on = 'CATA'], as a DT, was sent to next block, select MIN rows among .N, limited in the box defined by CATA & ITEM.


Solution

  • We can join DT and DTmin so that we get the MIN value in the dataframe and for each CATA and ITEM we can select MIN rows.

    library(data.table)
    DT[DTmin, on = 'CATA'][, .SD[sample(.N, first(MIN))], .(CATA, ITEM)]
    

    Similarly, with dplyr :

    library(dplyr)
    DT %>%
      left_join(DTmin, by = 'CATA') %>%
      group_by(CATA, ITEM) %>%
      sample_n(first(MIN))
    

    All the MIN value are the same throughout the group we can use any of it, I use the first one.