Is it possible to get the number of rows of the training set from a LGBMClassifier?

I have trained a model using lightgbm.sklearn.LGBMClassifier from the lightgbmpackage. I can find out the number of columns and column names of the training data from the model but I have not found a way to find the row number of the training data. Is it possible to do so? The best solution would be to obtain the training data from the model but I have not come across anything like that.

# This gives the number of the columns the model is trained with
lgbm_model.n_features_
# Any way to find out the row number of the training data as well?
lgbm_model.n_instances_ # does not exist!

Solution

The tree structure of a LightGBM model includes information about how many records from the training data would fall into each node in the tree if that node were a leaf node. In LightGBM's code, this value is called internal_count.

Since all data matches the root node of each tree, in most situations you can use that information to figure out, given a LightGBM model, how many instances were in the training data.

Consider the following example, using lightgbm==3.3.2 and Python 3.8.8.

import lightgbm as lgb
from sklearn.datasets import make_blobs

X, y = make_blobs(n_samples=1234, centers=[[-4, -4], [-4, 4]])
clf = lgb.LGBMClassifier(num_iterations=10, subsample=0.5)
clf.fit(X, y)
num_data = clf.booster_.dump_model()["tree_info"][0]["tree_structure"]["internal_count"]

print(num_data)
# 1234

This will work in most cases. There are two special circumstances where this number could be misleading as an answer to the question "how much data was used to train this model":

if you set bagging_fraction<1.0, then at each iteration LightGBM will only use a fraction of the training data to evaluate splits (see the LightGBM docs for details on bagging_fraction)
if you use "training continuation", where you take an existing model and perform additional boosting rounds, and you use a different training set for those additional boosting rounds, then "how much data was used to train this model" will have a complicated answer that depends on which range of boosting rounds you're referring to by "this model"