Saving and Loading lightgbm Dataset

I am trying to save and load lightgbm datasets using the save_binary command.

The following seems to work for the saving part:

import numpy as 
import lightgbm as lgb

data = lgb.Dataset(np.array([[1,2],[12,2]]))
data.save_binary('test.bin')

But so far, I have not been able to load the dataset back. Does anyone have an idea how I should proceed here?

Many thanks!

Solution

Short Answer

You can create a new Dataset from a file created with .save_binary() by passing a path to that file to the data argument of lgb.Dataset().

Try this example with Python 3.7, numpy==1.21.0, scikit-learn==0.24.1, and lightgbm==3.2.1.

import lightgbm as lgb
from numpy.testing import assert_equal
from sklearn.datasets import load_breast_cancer

X, y = load_breast_cancer(return_X_y=True)

# construct a Dataset from arrays in memory
dataset_in_mem = lgb.Dataset(
    data=X,
    label=y
)
dataset_in_mem.construct()

# save that dataset to a file
dataset_in_mem.save_binary('test.bin')

# create a new Dataset from that file
dataset_from_file = lgb.Dataset(data="test.bin")
dataset_from_file.construct()

# confirm that the Datasets are the same
print("--- X ---")
print(f"num rows: {X.shape[0]}")
print(f"num features: {X.shape[1]}")

print("--- in-memory dataset ---")
print(f"num rows: {dataset_in_mem.num_data()}")
print(f"num features: {dataset_in_mem.num_feature()}")

print("--- dataset from file ---")
print(f"num rows: {dataset_from_file.num_data()}")
print(f"num features: {dataset_from_file.num_feature()}")

# check that labels are the same
assert_equal(dataset_in_mem.label, y)
assert_equal(dataset_from_file.label, y)

--- X ---
num rows: 569
num features: 30
--- in-memory dataset ---
num rows: 569
num features: 30
--- dataset from file ---
num rows: 569
num features: 30

Description

LightGBM training requires some pre-processing of raw data, such as binning continuous features into histograms and dropping features that are unsplittable. This pre-processing is done one time, in the "construction" of a LightGBM Dataset object.

In the Python package (lightgbm), it's common to create a Dataset from arrays in memory. If you want to then re-use that Dataset many times (for example, to perform hyperparameter tuning) without needing to repeat that construction work, you can do it one time and then save the Dataset to a file with .save_binary().

When you want to create a new Dataset object in memory, you can pass a filepath to the data argument in lgb.Dataset(), as shown in the sample code above.

NOTE: The Dataset object stored to disk will not include your raw data. So, in the sample code above, dataset_from_file.data is None. This is done for efficiency...once LightGBM has created its own "constructed" representation of the training data, it no longer needs the raw data.