R (data.table): computing mean after join in most efficient way

I have two massive datasets. I created a simple example (below) to replicate the structure of my data: DT_ade has variables a, d, e, while DT_abce has variables a, b, c, e. I would like to compute the average of d by a-b-c. In the code below, I merged the two DT based on values of a-e. Because I am not interested in variable e, I just selected the remaining variables after the join (a, b, c, d). I then computed the average of d by a-b-c.

library(data.table)


DT_ade = data.table(a = c("a1", "a1", "a2", "a3", "a3"),
                    d = c(10, 20, 30, 40, 80) ,
                    e = c("e1", "e2", "e2", "e2", "e3"))

DT_abce = data.table(a = c("a1", "a1", "a1", "a1", "a2", "a3", "a3"),
                     b = c("b1", "b1", "b2", "b2", "b2", "b2", "b2"),
                     c = c("c1", "c1", "c2", "c2", "c2", "c2", "c2"),
                     e = c("e1", "e2", "e1", "e2", "e2", "e2", "e3"))
  

DT_ade[
  DT_abce, on=.(a, e), .(a, b, c, d)
  ][, .(mean_d = mean(d, na.rm=TRUE)),
    .(a, b, c)]

The code works in this simple example: it gives the following output:

    a  b  c mean_d
1: a1 b1 c1     15
2: a1 b2 c2     15
3: a2 b2 c2     30
4: a3 b2 c2     60

However, when applying this code to my real data, I get the following error message due to the size of my data:

Error in gforce(thisEnv, jsub, o__, f__, len__, irows) : 
  Internal error: Failed to allocate counts or TMP when assigning g in gforce

I am wondering if there was any less demanding way to do this. For instance, could I compute the mean d by a-b-c while not generating the full list a-b-c-d of as a previous step?

Solution

I'll expand on Waldi's recommendation for arrow with an example.

Up front: the benefits of using arrow are when dealing with large amounts of data and its lazy retrieval/processing of data. While I use arrow::arrow_table here, this is because it's convenient for this example; in your case, they should be parquet files on the filesystem (preferably local).

That is, ideally you would create these objects as:

ds_ade <- arrow::open_dataset("path/to/ade.parquet")
ds_abce <- arrow::open_dataset("path/to/abcd.parquet")

But for now, I'll use:

ds_ade <- arrow::arrow_table(DT_ade)
ds_abce <- arrow::arrow_table(DT_abce)

Each of these represents the arrow objects, as in: no data pulled, just what it looks like:

ds_ade
# Table
# 5 rows x 3 columns
# $a <string>
# $d <double>
# $e <string>

From here, a relatively simple dplyr-pipe can be used that doesn't pull data or do the calculations yet:

library(dplyr)
left_join(ds_ade, ds_abce, on = c("a", "e")) %>%
  group_by(a, b, c) %>%
  summarize(mean_d = mean(d, na.rm=TRUE)) %>%
  ungroup()
# Table (query)
# a: string
# b: string
# c: string
# mean_d: double
# See $.data for the source Arrow object

We just add %>% collect() to that, and the data materializes (i.e., is calculated and then pulled into R):

left_join(ds_ade, ds_abce, on = c("a", "e")) %>%
  group_by(a, b, c) %>%
  summarize(mean_d = mean(d, na.rm=TRUE)) %>%
  ungroup() %>%
  collect()
# # A tibble: 4 × 4
#   a     b     c     mean_d
#   <chr> <chr> <chr>  <dbl>
# 1 a1    b2    c2        15
# 2 a1    b1    c1        15
# 3 a2    b2    c2        30
# 4 a3    b2    c2        60

Don't confuse "speed of data.table-vs-dplyr" with what is going on here: you're losing no speed, since the calculations are being done in arrow, not in data.table or dplyr. It is not until the collect() that the calculations are actually done (in arrow) and then R gets to first see the data. If you then choose to as.data.table this data (for the rest of your processing), that's fine.

The "pain" of this approach will be to put your data into parquet files in the first place. Since your data is fairly large, it seems likely that when it is loaded into R, you are in a fragile situation: if your windows computer decides it must update now and reboot, or if you lose power (and UPS/batteries fail), or something else happens, you're sunk, and need to create or load all of the data again.

In my experience, depending on your real data, you might want to load your CSV into R first anyway, do some simple processing (i.e., convert to POSIXt, set factors and levels, etc, but not reshaping/aggregation), and then save to parquet. When saved, the underlying data class and attributes are preserved. In fact, since you are saving a data.table, you can very quickly (inexpensively) run ds_ade %>% head() %>% collect() and see that (1) it is nearly instantaneous, and (2) it is a data.table. If you have other classed objects (POSIXt, factor), then they will retain those classes. filtering is efficient, especially when the variables are numeric. (But I'm not an arrow-maintainer/designer/engineer, so it's always possible I have misunderstand and therefore misrepresented some portions of the laziness here.)