Search code examples
pythonloadlightgbm

Saving and Loading lightgbm Dataset


I am trying to save and load lightgbm datasets using the save_binary command.

The following seems to work for the saving part:

import numpy as 
import lightgbm as lgb

data = lgb.Dataset(np.array([[1,2],[12,2]]))
data.save_binary('test.bin')

But so far, I have not been able to load the dataset back. Does anyone have an idea how I should proceed here?

Many thanks!


Solution

  • Short Answer

    You can create a new Dataset from a file created with .save_binary() by passing a path to that file to the data argument of lgb.Dataset().

    Try this example with Python 3.7, numpy==1.21.0, scikit-learn==0.24.1, and lightgbm==3.2.1.

    import lightgbm as lgb
    from numpy.testing import assert_equal
    from sklearn.datasets import load_breast_cancer
    
    X, y = load_breast_cancer(return_X_y=True)
    
    # construct a Dataset from arrays in memory
    dataset_in_mem = lgb.Dataset(
        data=X,
        label=y
    )
    dataset_in_mem.construct()
    
    # save that dataset to a file
    dataset_in_mem.save_binary('test.bin')
    
    # create a new Dataset from that file
    dataset_from_file = lgb.Dataset(data="test.bin")
    dataset_from_file.construct()
    
    # confirm that the Datasets are the same
    print("--- X ---")
    print(f"num rows: {X.shape[0]}")
    print(f"num features: {X.shape[1]}")
    
    print("--- in-memory dataset ---")
    print(f"num rows: {dataset_in_mem.num_data()}")
    print(f"num features: {dataset_in_mem.num_feature()}")
    
    print("--- dataset from file ---")
    print(f"num rows: {dataset_from_file.num_data()}")
    print(f"num features: {dataset_from_file.num_feature()}")
    
    # check that labels are the same
    assert_equal(dataset_in_mem.label, y)
    assert_equal(dataset_from_file.label, y)
    
    --- X ---
    num rows: 569
    num features: 30
    --- in-memory dataset ---
    num rows: 569
    num features: 30
    --- dataset from file ---
    num rows: 569
    num features: 30
    

    Description

    LightGBM training requires some pre-processing of raw data, such as binning continuous features into histograms and dropping features that are unsplittable. This pre-processing is done one time, in the "construction" of a LightGBM Dataset object.

    In the Python package (lightgbm), it's common to create a Dataset from arrays in memory. If you want to then re-use that Dataset many times (for example, to perform hyperparameter tuning) without needing to repeat that construction work, you can do it one time and then save the Dataset to a file with .save_binary().

    When you want to create a new Dataset object in memory, you can pass a filepath to the data argument in lgb.Dataset(), as shown in the sample code above.

    NOTE: The Dataset object stored to disk will not include your raw data. So, in the sample code above, dataset_from_file.data is None. This is done for efficiency...once LightGBM has created its own "constructed" representation of the training data, it no longer needs the raw data.