I have been trying to test my recommendation system using k-fold cross validation. My recommendation system is based on implicit feedback.Since, I am trying to implement k-fold cross validation on my user-item matrix, I can't use scikit-learn's native k-fold methods(can I?). I am having trouble implementing my k-fold cross validation technique. I seem to be using a lot of for loops, and it's becoming very slow. I have gone through these links : Optimize this function with numpy (or other vectorization methods) and Speed up for loop with numpy but I can't seem to apply it to my code. Can somebody help me out?
My code:
def TrainRepeat2(counts,FinalArr,k=3):
"""
parameters:
-------------------------------------------
counts : user-item matrix
k : No of folds
FinalArr : shuffled indices
Example:
if k = 3
FinalArr will be a list containing 3 lists with randomly shuffled indices
"""
# No of factors
num_factors= [10,20]
PartitionList = range(k)
# Iterating over the number of factors
for i in range(len(num_factors)):
# iterating over the folds
for partition in PartitionList:
# Keep one fold for testing
validation = counts[FinalArr[partition],:]
# CKeep the rest for training
validation_list = [x for x in PartitionList if x != partition]
# Train over the rest
for t in validation_list:
train = counts[FinalArr[t],:]
train = sparse.csr_matrix(train)
print "THe evaluation is being done for factor no %d" %(num_factors[i])
reg_param = 5
MF_als = ImplicitMF(train,validation,num_factors = num_factors[i],num_iterations=80,reg_param = reg_param,num_threads=14)
user_vectors,item_vectors= MF_als.train_model(flag,leaveone=False)
Specifically, the algorithm is o(N^3). I want to somehow remove the for loops and optimize the code.
Any help would be appreciated
Thanks!
Edited per comment
At the end of the day if you want to run cross validation n number of times, you are going to have to loop n times. Whether that loop is hidden to you (and hopefully written very efficiently, either in cython or something similar) or visible in your code, that will happen.
I think at a high level what you want is here:
http://scikit-learn.org/stable/modules/cross_validation.html
Things you need to do: write a classifier object, that takes in train_data,train_class,test_data returns a list of predictions for test_data. This is your "recommender" class, and is in function to any of the sklearn classifiers.
Write a scoring object. Per your comment below, this should take in two arrays of the same length, the prediction and the correct classification, and calculate the error. Then you can use those two objects directly in the sample sklearn code below.
Assuming:
your full dataset is in df
your "target" (however defined) is in targets
clf is your classifier (or recommender in this case)
scorer is how you calculate error
n_samples = len(df)
cv = cross_validation.ShuffleSplit(n_samples, n_iter=3, test_size=0.3, random_state=0)
cross_validation.cross_val_score(clf, df, targets, scoring = scorer,cv=cv)
array([ 0.97..., 0.97..., 1. ])