Search code examples
cross-validationlenskit

Cross-validation in Lenskit


I'm trying to understand how exactly is performed cross-validation in lenskit. In the documentation, it says that by default the data are partitioned by user. Does that mean that, in each fold, none of the users in the test set has been used for training? Is this achieved through the "holdout" option? If so, does this option break the user-based partioning and yields folds in which each user shows up in both the training and test sets?

Right now, my evaluation code looks something like this:

dataset crossfold("data") {
    source csvfile(sourceFile) {
        delimiter "\t"
        domain {
            minimum 0.0
            maximum 10.0
            precision 0.1
        }
    }
//        order RandomOrder
    holdoutFraction 0.1
}

I commented out the "order" option because, when using it, lenskit eval throws an error.

Cheers!!!


Solution

  • Each user appears in both the training and the test sets, no matter the holdout, holdoutFraction, or retain options.

    However, for each test user (when using 5 partitions, 20% of the users), part of their ratings (the test ratings) are held out and placed in the test set. The remainder of their ratings are placed in the training set, along with all ratings from other users.

    This simulates the common case of a recommender system: you have users, for whom some of their history is already known and can be used in model training, and you're trying to recommend or predict their future behavior.

    The holdout, holdoutFraction, and retain options are different ways of deciding how many ratings are put in the test set. If you say holdout 5, then 5 ratings from each test user are put in the test set, and the rest are used for training. If you say holdoutFraction 0.2, then 20% are used for testing and 80% for training. If you say retain 5, then 5 are used for training and the rest are used for testing.