I have a large dataset with 200.000+ rows. I want to group it by two columns, filter based on another column, while keeping all the other columns. Sample looks like this:
hv001 hv002 hv101 hv104 hv105
1 1 1 1 1 35
2 1 1 2 2 22
3 1 1 3 2 2
4 1 1 3 2 0
5 1 4 1 1 35
6 1 4 2 2 32
7 1 4 3 2 4
8 1 4 3 2 2
9 1 4 3 1 1
10 1 7 1 1 51
structure(list(hv001 = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1), hv002 = c(1,
1, 1, 1, 4, 4, 4, 4, 4, 7), hv101 = c(1, 2, 3, 3, 1, 2, 3, 3,
3, 1), hv104 = c(1, 2, 2, 2, 1, 2, 2, 2, 1, 1), hv105 = c(35,
22, 2, 0, 35, 32, 4, 2, 1, 51)), row.names = c(NA, -10L), class = "data.frame")
I can do this with dplyr, however it is too slow. Code looks like this:
df |>
group_by(hv001, hv002) |>
slice_min(hv101)
Now with data.table it is possible to do this much faster, however it does not keep all the columns. How do keep all columns or how do I slice the columns when doing a group by? Code looks like this:
df_dt <- as.data.table(df)
df_dt[, .(min_hv101 = min(hv101)), by = c("hv001", "hv002")]
You could add .SD
to subset your data table like this:
library(data.table)
df_dt <- as.data.table(df)
df_dt[, .SD[(min_hv101 = min(hv101))], by = c("hv001", "hv002")]
#> hv001 hv002 hv101 hv104 hv105
#> 1: 1 1 1 1 35
#> 2: 1 4 1 1 35
#> 3: 1 7 1 1 51
Created on 2023-01-08 with reprex v2.0.2
This answer explains the meaning of .SD
very well: What does .SD stand for in data.table in R