Search code examples
pythonnumpyrecommendation-engine

Optimization of K-fold cross validation for implicit recommendation systems


I have been trying to test my recommendation system using k-fold cross validation. My recommendation system is based on implicit feedback.Since, I am trying to implement k-fold cross validation on my user-item matrix, I can't use scikit-learn's native k-fold methods(can I?). I am having trouble implementing my k-fold cross validation technique. I seem to be using a lot of for loops, and it's becoming very slow. I have gone through these links : Optimize this function with numpy (or other vectorization methods) and Speed up for loop with numpy but I can't seem to apply it to my code. Can somebody help me out?

My code:

 def TrainRepeat2(counts,FinalArr,k=3):


  """
  parameters:
  -------------------------------------------

  counts : user-item matrix
  k  :  No of folds
  FinalArr : shuffled indices

  Example:
  if k = 3
  FinalArr will be a list containing 3 lists with randomly shuffled indices
"""


   # No of factors
   num_factors= [10,20]
   PartitionList = range(k)


   # Iterating over the number of factors
   for i in range(len(num_factors)):


       # iterating over the folds
       for partition in PartitionList:

          # Keep one fold for testing
          validation = counts[FinalArr[partition],:]

         # CKeep the rest for training
          validation_list = [x for x in PartitionList if x != partition]

           # Train over the rest
           for t in validation_list:

             train = counts[FinalArr[t],:]
             train = sparse.csr_matrix(train)
             print "THe evaluation is being done for factor no %d" %(num_factors[i])
             reg_param = 5

             MF_als = ImplicitMF(train,validation,num_factors = num_factors[i],num_iterations=80,reg_param = reg_param,num_threads=14)
             user_vectors,item_vectors=  MF_als.train_model(flag,leaveone=False)

Specifically, the algorithm is o(N^3). I want to somehow remove the for loops and optimize the code.

Any help would be appreciated

Thanks!


Solution

  • Edited per comment

    At the end of the day if you want to run cross validation n number of times, you are going to have to loop n times. Whether that loop is hidden to you (and hopefully written very efficiently, either in cython or something similar) or visible in your code, that will happen.

    I think at a high level what you want is here:

    http://scikit-learn.org/stable/modules/cross_validation.html

    Things you need to do: write a classifier object, that takes in train_data,train_class,test_data returns a list of predictions for test_data. This is your "recommender" class, and is in function to any of the sklearn classifiers.

    Write a scoring object. Per your comment below, this should take in two arrays of the same length, the prediction and the correct classification, and calculate the error. Then you can use those two objects directly in the sample sklearn code below.

    Assuming:

    your full dataset is in df

    your "target" (however defined) is in targets

    clf is your classifier (or recommender in this case)

    scorer is how you calculate error

    n_samples = len(df)
    cv = cross_validation.ShuffleSplit(n_samples, n_iter=3, test_size=0.3, random_state=0)
    
    cross_validation.cross_val_score(clf, df, targets, scoring = scorer,cv=cv)
    
    array([ 0.97...,  0.97...,  1.        ])