Search code examples
rserializationlightgbm

how to "serialize" a non-R object together with an R object


Some objects in R are actually pointers to lower-level (not sure if that's the right term for it) constructs that require specialized functions to save to disk. For example, saveRDS is not sufficient to preserve a lightgbm boosted tree:

## Create a lightgbm booster
library(lightgbm)
data(agaricus.train, package = "lightgbm")
train = agaricus.train
bst = lightgbm(data = train$data,label = train$label,
               nrounds = 1, objective = "binary")

## but suppose bst is only one part of a bigger analysis
results = list(bst = bst, metadata = 'other stuff')

## then it would be nice if this IO cycle worked, but the last line crashes R
# saveRDS(results, file = 'so_post_temp')
# rm(results)
# rm(bst)
# lgb.unloader(wipe = TRUE)
# results = readRDS('so_post_temp')
# predict(results$bst, train$data)

The standard solution is not terrible, but enough to annoy me. It requires using a separate lightgbm-specific saver and creating a separate 'companion' file to any analysis that I want to save:

results = list(lgbpath = 'bst.lightgbm', metadata = 'other stuff')
saveRDS(results, file = 'so_post_temp')
lgb.save(bst, file = 'bst.lightgbm')
# destruct:
rm(results)
rm(bst)
lgb.unloader(wipe = TRUE)
# reconstruct:
results = readRDS('so_post_temp')
bst = lgb.load(results$lgbpath)
predict(bst, train$data)

Is there any way to clean this up to somehow bind R objects and other objects into a single file? Something like

fake_pointer_to_disk = [points to some kind of R object instead]
fake_file_object = lgb.save(bst, file = fake_pointer_to_disk)
results = list(bst = fake_file_object, metadata = 'other stuff')
# later loaded as
bst = lgb.load(results$bst)

Solution

  • I think readBin should suffice:

    tf <- tempfile()
    
    lgb.save(bst, file=tf)
    # since I don't have lightgbm loaded, this is my fake model/save
    bst <- 100:150 # my fake data
    writeBin(bst, file = tf) # poor man's lgb.save :-)
    

    Now read it in as a blob:

    rawbst <- readBin(tf, raw(), n=file.size(tf))
    file.remove(tf)
    

    and save it the way you wanted to:

    saveRDS(list(bst = rawbst, metadata = 'other stuff'), file = 'so_post_temp')
    

    When you're ready to re-hydrate your results and model:

    tf2 <- tempfile()
    results <- readRDS('so_post_temp')
    writeBin(results$bst, tf2)
    bst <- lgb.load(tf2)
    file.remove(tf2)
    

    (Caveat: under-tested: it worked with fake data, I have not tried with a bst-like object.)