I am trying to save and load lightgbm datasets using the save_binary command.
The following seems to work for the saving part:
import numpy as
import lightgbm as lgb
data = lgb.Dataset(np.array([[1,2],[12,2]]))
data.save_binary('test.bin')
But so far, I have not been able to load the dataset back. Does anyone have an idea how I should proceed here?
Many thanks!
Short Answer
You can create a new Dataset from a file created with .save_binary()
by passing a path to that file to the data
argument of lgb.Dataset()
.
Try this example with Python 3.7, numpy==1.21.0
, scikit-learn==0.24.1
, and lightgbm==3.2.1
.
import lightgbm as lgb
from numpy.testing import assert_equal
from sklearn.datasets import load_breast_cancer
X, y = load_breast_cancer(return_X_y=True)
# construct a Dataset from arrays in memory
dataset_in_mem = lgb.Dataset(
data=X,
label=y
)
dataset_in_mem.construct()
# save that dataset to a file
dataset_in_mem.save_binary('test.bin')
# create a new Dataset from that file
dataset_from_file = lgb.Dataset(data="test.bin")
dataset_from_file.construct()
# confirm that the Datasets are the same
print("--- X ---")
print(f"num rows: {X.shape[0]}")
print(f"num features: {X.shape[1]}")
print("--- in-memory dataset ---")
print(f"num rows: {dataset_in_mem.num_data()}")
print(f"num features: {dataset_in_mem.num_feature()}")
print("--- dataset from file ---")
print(f"num rows: {dataset_from_file.num_data()}")
print(f"num features: {dataset_from_file.num_feature()}")
# check that labels are the same
assert_equal(dataset_in_mem.label, y)
assert_equal(dataset_from_file.label, y)
--- X ---
num rows: 569
num features: 30
--- in-memory dataset ---
num rows: 569
num features: 30
--- dataset from file ---
num rows: 569
num features: 30
Description
LightGBM training requires some pre-processing of raw data, such as binning continuous features into histograms and dropping features that are unsplittable. This pre-processing is done one time, in the "construction" of a LightGBM Dataset
object.
In the Python package (lightgbm
), it's common to create a Dataset
from arrays in memory. If you want to then re-use that Dataset
many times (for example, to perform hyperparameter tuning) without needing to repeat that construction work, you can do it one time and then save the Dataset
to a file with .save_binary()
.
When you want to create a new Dataset
object in memory, you can pass a filepath to the data
argument in lgb.Dataset()
, as shown in the sample code above.
NOTE: The Dataset
object stored to disk will not include your raw data. So, in the sample code above, dataset_from_file.data
is None
. This is done for efficiency...once LightGBM has created its own "constructed" representation of the training data, it no longer needs the raw data.