I have a disk frame with these columns
key_a
key_b
key_c
value
Say the disk frame is 200M rows and I'd like to group it by key_b. Additionally, I want to keep the underlying disk frame in tact and unchanged so I could later on join it to something else on key_c or aggregate it on key_a. I'm concerned that srckeep affects the underlying disk frame.
Will either of these work? If so, can I expect one to be faster than the other?
df %>%
srckeep("value", "key_b") %>%
group_by(key_b) %>%
summarize(avg = mean(value)) %>%
collect
df[
keep = c("value", "key_b"
.(avg = mean(value)),
.(key_b)
]
How will either of these aggregations affect the underlying disk frame? I had an experience earlier where I assigned an aggregation to a variable, and then ran delete(aggregation
, but it deleted the entire disk frame.
When you apply an operation, it doesn't change the underly disk.frame at all!
srckeep
only affects what gets used! It loads only those columns in srckeep
in memory when doing the processing. Again, it doesn't affect the underlying data at all.
Unless you do write_disk.frame(some_other_diskf, "to/location_of_disk.frame.df", overwrite=TRUE)
which will overwrite the old disk.frame.
The disk.frame is always on disk. You can see where it is with attr(diskf, "path")