Search code examples

Saving and Loading lightgbm Dataset

I am trying to save and load lightgbm datasets using the save_binary command.

The following seems to work for the saving part:

import numpy as 
import lightgbm as lgb

data = lgb.Dataset(np.array([[1,2],[12,2]]))

But so far, I have not been able to load the dataset back. Does anyone have an idea how I should proceed here?

Many thanks!


  • Short Answer

    You can create a new Dataset from a file created with .save_binary() by passing a path to that file to the data argument of lgb.Dataset().

    Try this example with Python 3.7, numpy==1.21.0, scikit-learn==0.24.1, and lightgbm==3.2.1.

    import lightgbm as lgb
    from numpy.testing import assert_equal
    from sklearn.datasets import load_breast_cancer
    X, y = load_breast_cancer(return_X_y=True)
    # construct a Dataset from arrays in memory
    dataset_in_mem = lgb.Dataset(
    # save that dataset to a file
    # create a new Dataset from that file
    dataset_from_file = lgb.Dataset(data="test.bin")
    # confirm that the Datasets are the same
    print("--- X ---")
    print(f"num rows: {X.shape[0]}")
    print(f"num features: {X.shape[1]}")
    print("--- in-memory dataset ---")
    print(f"num rows: {dataset_in_mem.num_data()}")
    print(f"num features: {dataset_in_mem.num_feature()}")
    print("--- dataset from file ---")
    print(f"num rows: {dataset_from_file.num_data()}")
    print(f"num features: {dataset_from_file.num_feature()}")
    # check that labels are the same
    assert_equal(dataset_in_mem.label, y)
    assert_equal(dataset_from_file.label, y)
    --- X ---
    num rows: 569
    num features: 30
    --- in-memory dataset ---
    num rows: 569
    num features: 30
    --- dataset from file ---
    num rows: 569
    num features: 30


    LightGBM training requires some pre-processing of raw data, such as binning continuous features into histograms and dropping features that are unsplittable. This pre-processing is done one time, in the "construction" of a LightGBM Dataset object.

    In the Python package (lightgbm), it's common to create a Dataset from arrays in memory. If you want to then re-use that Dataset many times (for example, to perform hyperparameter tuning) without needing to repeat that construction work, you can do it one time and then save the Dataset to a file with .save_binary().

    When you want to create a new Dataset object in memory, you can pass a filepath to the data argument in lgb.Dataset(), as shown in the sample code above.

    NOTE: The Dataset object stored to disk will not include your raw data. So, in the sample code above, is None. This is done for efficiency...once LightGBM has created its own "constructed" representation of the training data, it no longer needs the raw data.