Search code examples
rdisk.frame

How does srckeep affect the underlying disk frame?


I have a disk frame with these columns

key_a
key_b
key_c
value

Say the disk frame is 200M rows and I'd like to group it by key_b. Additionally, I want to keep the underlying disk frame in tact and unchanged so I could later on join it to something else on key_c or aggregate it on key_a. I'm concerned that srckeep affects the underlying disk frame.

Will either of these work? If so, can I expect one to be faster than the other?

  df %>% 
  srckeep("value", "key_b") %>%
  group_by(key_b) %>% 
  summarize(avg = mean(value)) %>% 
  collect
  df[
    keep = c("value", "key_b" 
    .(avg = mean(value)),
    .(key_b)
    ]

How will either of these aggregations affect the underlying disk frame? I had an experience earlier where I assigned an aggregation to a variable, and then ran delete(aggregation, but it deleted the entire disk frame.


Solution

  • When you apply an operation, it doesn't change the underly disk.frame at all!

    srckeep only affects what gets used! It loads only those columns in srckeep in memory when doing the processing. Again, it doesn't affect the underlying data at all.

    Unless you do write_disk.frame(some_other_diskf, "to/location_of_disk.frame.df", overwrite=TRUE) which will overwrite the old disk.frame.

    The disk.frame is always on disk. You can see where it is with attr(diskf, "path")