Search code examples
machine-learningscikit-learncross-validationrankinglightgbm

How to perform cross-validation with LightGBM.LGBMRanker, while keeping groups together?


I'm on a search problem, I have a dataset of queries and urls. Each couple (query, url) has a relevance (the target), a float which should preserve the order of the urls, for a given query. I would like to perform cross validation for my lightgbm.LGBMRanker model, with the objective as ndcg.

I went through the documentation and saw that it is important to keep the instances in the same group, because an instance is actually a query with all its associated urls. I however have an issue regarding this, as I get the following error :

ValueError: Computing NDCG is only meaningful when there is more than 1 document. Got 1 instead.

I used the debugger, and while I do not have any group which size is inferior to 2 in my dataset, I have groups which are smaller in the _feval function, meaning the cv() fucntion did not actually keep the groups together.

In the lightgbm.cv I see no sign of the group argument which is used in the LGBMRanker. But I can see that the function lightbm.cv precises that Values passed through params take precedence over those supplied via arguments. My understanding was that this value is passed to the underlying model of the cv function.

Here is the code that I have so far :

def eval_model(
    self,
    model: lightgbm.LGBMRanker,
    k_fold: int = 3,
    seed: int = 42,
):
    """Evaluates with NDCG"""

    def _feval(y_pred: np.ndarray, lgb_dataset: lightgbm.basic.Dataset):
        y_true = lgb_dataset.get_label()
        serp_sizes = lgb_dataset.get_group()

        ndcg_values = []
        start = 0
        for size in serp_sizes:
            end = start + size
            y_true_serp, y_pred_serp = y_true[start:end], y_pred[start:end]
            ndcg_serp = sklearn.metrics.ndcg_score(
                [y_true_serp], [y_pred_serp], k=10
            )
            ndcg_values.append(ndcg_serp)
            start = end

        eval_name = "my-ndcg"
        eval_result = np.mean(ndcg_values)
        greater_is_better = True
        return eval_name, eval_result, greater_is_better

    lgb_dataset = lightgbm.Dataset(data=self.X, label=self.y, group=self.serp_sizes)
    cv_results = lightgbm.cv(
        params={**model.get_params(), "group": self.serp_sizes},
        train_set=lgb_dataset,
        num_boost_round=1_000,
        nfold=k_fold,
        stratified=False,
        seed=seed,
        feval=_feval,
    )
    ndcg = np.mean(cv_results["my-ndcg"])

    return ndcg

Where is my mistake/misunderstanding ? is there a simple workaround to perform cross-validation using a lightgbm.LGBMRanker, and keeping the groups together ?


Solution

  • I would like to perform cross validation for my lightgbm.LGBMRanker model, with the objective as ndcg.

    As of lightgbm==4.1.0 (the latest version as of this writing), lightgbm.sklearn.LGBMRanker cannot be used with scikit-learn's cross-validation APIs.

    It also cannot be passed to lightgbm.cv().

    In the lightgbm.cv I see no sign of the group argument which is used in the LGBMRanker

    As described in LightGBM's documentation (link), lightgbm.cv() expects to be passed a lightgbm.Dataset object.

    group is an attribute of the Dataset object.

    To perform cross-validation of a LightGBM learning-to-rank model, use lightgbm.cv() instead of lightgbm.sklearn.LGBMRanker().

    Here's a minimal, reproducible example using 3.11.7 and lightgbm==4.1.0.

    import lightgbm as lgb
    import numpy as np
    import requests
    from sklearn.datasets import load_svmlight_file
    from tempfile import NamedTemporaryFile
    
    # get training data from LightGBM examples
    data_url = "https://raw.githubusercontent.com/microsoft/LightGBM/master/examples/lambdarank"
    with NamedTemporaryFile(mode="w") as f:
        train_data_raw = requests.get(f"{data_url}/rank.train").text
        f.write(train_data_raw)
        X, y = load_svmlight_file(f.name)
    
    group = np.loadtxt(f"{data_url}/rank.train.query")
    
    # create a LightGBM Dataset
    dtrain = lgb.Dataset(
        data=X,
        label=y,
        group=group
    )
    
    # perform LambdaRank 3-fold cross-validation with 1 set of hyperparameters
    cv_results = lgb.cv(
        train_set=dtrain,
        params={
            "objective": "lambdarank",
            "eval_at": 2,
            "num_iterations": 10
        },
        nfold=3,
        return_cvbooster=True
    )
    
    # check metrics
    np.round(cv_results["valid ndcg@2-mean"], 3)
    # array([0.593, 0.597, 0.64 , 0.632, 0.64 , 0.636, 0.655, 0.655, 0.653, 0.669])
    

    lightgbm.cv() will correctly preserve query groups when creating cross-validation folds.

    Values passed through params take precedence over those supplied via arguments

    In LightGBM's documentation, "param" refers specifically to the configuration described at https://lightgbm.readthedocs.io/en/v4.1.0/Parameters.html.

    The statement you've quoted does not apply to data like group, init_score, and label, and those things should not be passed through the params keyword argument in any of LightGBM's interfaces.