I have trained a model using lightgbm.sklearn.LGBMClassifier
from the lightgbm
package. I can find out the number of columns and column names of the training data from the model but I have not found a way to find the row number of the training data. Is it possible to do so? The best solution would be to obtain the training data from the model but I have not come across anything like that.
# This gives the number of the columns the model is trained with
lgbm_model.n_features_
# Any way to find out the row number of the training data as well?
lgbm_model.n_instances_ # does not exist!
The tree structure of a LightGBM model includes information about how many records from the training data would fall into each node in the tree if that node were a leaf node. In LightGBM's code, this value is called internal_count
.
Since all data matches the root node of each tree, in most situations you can use that information to figure out, given a LightGBM model, how many instances were in the training data.
Consider the following example, using lightgbm==3.3.2
and Python 3.8.8.
import lightgbm as lgb
from sklearn.datasets import make_blobs
X, y = make_blobs(n_samples=1234, centers=[[-4, -4], [-4, 4]])
clf = lgb.LGBMClassifier(num_iterations=10, subsample=0.5)
clf.fit(X, y)
num_data = clf.booster_.dump_model()["tree_info"][0]["tree_structure"]["internal_count"]
print(num_data)
# 1234
This will work in most cases. There are two special circumstances where this number could be misleading as an answer to the question "how much data was used to train this model":
bagging_fraction<1.0
, then at each iteration LightGBM will only use a fraction of the training data to evaluate splits (see the LightGBM docs for details on bagging_fraction
)