Search code examples
jsonapache-sparkpysparkapache-spark-sqlcollaborative-filtering

Collaborative filtering spark python


I'm trying to save only 10 rows of dataframe to json. But instead of 10 rows he saves everything.

userRecs = model.recommendForAllUsers(10)

  • this show 10 and then I save :
userRecs.coalesce(1).write.mode('overwrite').json("gs://imdbcc1/ML/userrecs")

but it gives me 200 000 records. I only want to save 10

(training, test) = ratings.randomSplit([0.8, 0.2])
als = ALS(maxIter=10, regParam=1, userCol="user_id", itemCol="tconst", ratingCol="rating", coldStartStrategy="drop")
model = als.fit(training)
#Evaluate the model by computing the RMSE on the test data
predictions = model.transform(test)
evaluator = RegressionEvaluator(metricName="rmse", labelCol="rating", predictionCol="prediction")
rmse = evaluator.evaluate(predictions)
print("Root-mean-square error = " + str(rmse))
#Generate top 10 movie recommendations for each user
userRecs = model.recommendForAllUsers(10)
userRecs.coalesce(1).write.mode('overwrite').json("gs://imdbcc1/ML/userrecs")

Solution

  • #Generate top 10 movie recommendations for each user
    userRecs = model.recommendForAllUsers(10)
    

    means for all users you are taking top 10 movie recommendations. but all records will come with top 10 movie recommendations.

    you have to use limit(10) for 10 users (on the data top 10 movie recommendations) before coalese

    like this

    userRecs.limit(10).coalesce(1).write.mode('overwrite').json("gs://imdbcc1/ML/userrecs")