Think about a platform where an user choose what factors he give more importance. For example 5 factors of criteria A, B, C, D, E
Then each product review has a weighing for A1, B1, C1, D1, E1
. So, if he gave more importance to A, then the weighing will take that in consideration. The result is that each review can have an different overall for each user.
My problem is about the algorithm for that. Currently the processing is slow.
For each category summary, I need to iterate over all companies of that category, and all reviews for each company.
#1 step
find companies of category X with more than 1 review published
companies_X = [1, 2, 3, 5, n]
#2 step
iterate all companies, and all reviews of these companies
for company in companies:
for review in company:
#calculate the weighing of the review for the current user criteria
#give more importance to recent reviews
#3 step
avg of all reviews for each company data
#4 step
make the avg of all companies of this category to create a final score for the category x
This works, but I can't have a page that takes 30 seconds to load.
I am thinking about cache this page, but in that case i need to process this page for all users in background. Not a good solution, definitely.
Any ideas about improvements? Any insight will be welcome.
First option: using numpy and pandas could improve your speed, if leveraged in a smart way, so by avoiding loops whenever it is possible. This can be made by using the apply method, working on both numpy and pandas, along with some condition or lambda function.
for company in companies:
for review in company:
can be replaced by review_data["note"] = note_formula(review_data["number_reviews"])
Edit: here note_formula
is a function returning the weighting of the review, as indicated in the comments of the question:
# calculate the weighing of the review for the current user criteria
# give more importance to recent reviews
Your step 4 can be performed by using groupby method from pandas along with a calculation of average.
Second option: where are your data stored? If they are in a data base, a good rule to boost performance is: move the data as little as possible, so perform the request directly in the data base, I think all your operations can be written in SQL, and then redirect only the result to the python script. If your data are stored in an other way, consider using a data base engine, SQLite for instance at the beginning if you don't aim at scaling fast.