I have a dataset as following in data.table.
Would like to compare the value and select the rows based on the number of least elements.
Please find the toy sample for a better picture.
set.seed(100)
DT <- data.table(CATA = sample(LETTERS[1:4], 4489, replace = T),
ITEM = sample(LETTERS[24:26], 4489, prob = c(0.4, 0.3, 0.3),replace = T),
VAL = sample(100:999, 4489, replace = T)
)
# 4489 is a random-picked number
If counting the items based on the CATA
DT[,.N, by = .(CATA, ITEM)][order(CATA)]
# CATA ITEM N
# 1: A X 433
# 2: A Y 323
# 3: A Z 342
# 4: B X 452
# 5: B Y 333
# 6: B Z 358
# 7: C X 461
# 8: C Y 302
# 9: C Z 359
# 10: D X 461
# 11: D Y 344
# 12: D Z 321
I can find the min of each category.
DTmin <- DT[,.N, by = .(CATA, ITEM)][order(CATA)][,.(MIN = min(N)), by = CATA]
>DTmin
# CATA MIN
# 1: A 323
# 2: B 333
# 3: C 302
# 4: D 321
what I need is to get pass the DTmin value to get all items has same number under each category, like,
DT[,.N, by = .(CATA, ITEM)][order(CATA)]
# CATA ITEM N
# 1: A X 323 # was 433
# 2: A Y 323
# 3: A Z 323 # was 342
# 4: B X 333 # was 452
# 5: B Y 333
# 6: B Z 333 # was 358
# 7: C X 302 # was 461
# 8: C Y 302
# 9: C Z 302 # was 359
# 10: D X 321 # was 461
# 11: D Y 321 # was 344
# 12: D Z 321
In the end, DT shall has row number of 3837
( sum(DTmin$MIN)*3
)
Here is my way, pls advise more smooth way without breaking the chain or introducing new column.
# create a new index column, and remove it.
DTmin[DT, on = .(CATA)][,tmpV:= sample(.N), by = .(CATA, ITEM)][tmpV<MIN][,tmpV:=NULL]
#check
DTmin[DT, on = .(CATA)][,tmpV:= sample(.N), by = .(CATA, ITEM)][tmpV<MIN][,tmpV:=NULL][,.N, by = .(CATA, ITEM)][order(CATA, ITEM)]
# CATA ITEM N
# 1: A X 322
# 2: A Y 322
# 3: A Z 322
# 4: B X 332
# 5: B Y 332
# 6: B Z 332
# 7: C X 301
# 8: C Y 301
# 9: C Z 301
# 10: D X 320
# 11: D Y 320
# 12: D Z 320
Thanks to @ronak-shan
DT[DTmin, on = 'CATA'][, .SD[sample(.N, first(MIN))], .(CATA, ITEM)]
My intepretation is, DT[DTmin, on = 'CATA']
, as a DT, was sent to next block, select MIN
rows among .N
, limited in the box defined by CATA
& ITEM
.
We can join DT
and DTmin
so that we get the MIN
value in the dataframe and for each CATA
and ITEM
we can select MIN
rows.
library(data.table)
DT[DTmin, on = 'CATA'][, .SD[sample(.N, first(MIN))], .(CATA, ITEM)]
Similarly, with dplyr
:
library(dplyr)
DT %>%
left_join(DTmin, by = 'CATA') %>%
group_by(CATA, ITEM) %>%
sample_n(first(MIN))
All the MIN
value are the same throughout the group we can use any of it, I use the first
one.