I have a table in postgresql containing some cars +1000000 records:
+----+--------+------+---------+-----------+-------------+------------+------------+
| id | price | year | mileage | fuel_type | body_type | brand | model |
+----+--------+------+---------+-----------+-------------+------------+------------+
| 1 | 4894 | 2011 | 121842 | "Benzin" | "Sedan" | "Toyota" | "Yaris" |
| 2 | 4989 | 2012 | 33901 | "Benzin" | "Hatchback" | "Renault" | "Twingo" |
| 3 | 4990 | 2013 | 55105 | "Benzin" | "Hatchback" | "Renault" | "Twingo" |
| 3 | 5290 | 2013 | 20967 | "Benzin" | "Hatchback" | "Renault" | "Twingo" |
| 5 | 5594 | 2008 | 121281 | "Benzin" | "Hatchback" | "Mercedes" | "A170" |
| 6 | 4690 | 2012 | 71303 | "Benzin" | "Hatchback" | "Renault" | "Twingo" |
| 7 | 5290 | 2013 | 58300 | "Benzin" | "Hatchback" | "Renault" | "Twingo" |
| 8 | 5890 | 2013 | 35732 | "Benzin" | "Hatchback" | "Renault" | "Twingo" |
| 9 | 5990 | 2013 | 38777 | "Benzin" | "Hatchback" | "Renault" | "Twingo" |
| 10 | 6180 | 2013 | 69491 | "Benzin" | "Hatchback" | "VW" | "up!" |
| 11 | 6490 | 2012 | 72900 | "Benzin" | "Sedan" | "Renault" | "Clio III" |
| 12 | 6790 | 2012 | 49541 | "Benzin" | "Hatchback" | "Renault" | "Clio III" |
| 13 | 6790 | 2012 | 46377 | "Benzin" | "Hatchback" | "Renault" | "Clio III" |
| 14 | 6790 | 2012 | 45200 | "Benzin" | "Hatchback" | "Renault" | "Clio III" |
| 15 | 6894 | 2007 | 108840 | "Benzin" | "Sedan" | "VW" | "Golf V" |
| 16 | 6990 | 2009 | 54200 | "Benzin" | "Sedan" | "Renault" | "Mégane" |
| 17 | 6990 | 2012 | 40652 | "Benzin" | "Hatchback" | "Renault" | "Clio III" |
| 18 | 6990 | 2012 | 38080 | "Benzin" | "Sedan" | "Renault" | "Clio III" |
| 19 | 7290 | 2012 | 28600 | "Benzin" | "Hatchback" | "Renault" | "Clio III" |
| 20 | 7290 | 2013 | 52800 | "Benzin" | "Hatchback" | "Renault" | "Twingo" |
+----+--------+------+---------+-----------+-------------+------------+------------+
I would like to create a recommendation engine, that can return the 20 most "similar" matches based on some varying criteria e.g. When a user does a search for: brand = 'Renault' AND price < 60000 AND year > 2010
, I want to present, outside the search result some other, more loose results with other cars, that is similar, but doesn't necessarily match all the search criteria exact.
I have tried making some rule based code in ruby, that does something like:
Based on this code, I generate an SQL query with where and order by clauses.
The problem however is, that things get's huge, as I have like 20 different columns I would like to optionally take into consideration, base on the initial criteria. Also I want the recommendation to be backward compatible in the sense that I don't want to just do a simple filtering (WHERE
) query, that in cases might end up returning zero matches. Instead I want it do something similar to when you use text similarity algorithms, where you can compare one phrase to all and get a comparison score for all of them which you can then sort by.
I'm super puzzled about how I could implement this, in an approach this is NOT defining 1000 rules and if/then statements to generate an SQL query. Is there some other technique I could use, or maybe another technology than postgresql?
Apply machine learning techniques.
MADlib http://madlib.incubator.apache.org/ is an extension to Postgres that gives you an ability to work with various machine learning algorithms right inside the database. It is worth to learn and try.
Start with linear regression for your vectors and then also try random forests and other algorithms and compare what works better in your case (the trick to evaluate the algorithm's quality is simple: you take all your the data you have, use 70-80% of it to train, and then use the remainder to get estimations from the trained engine -- and then use some function to calculate the deviation error, usually people use mean square error approach).
Also, I can recommend an awesome Stanford online course, with online book and lectures published on Youtube (all free!): http://mmds.org/. Various modern approaches to building recommendation engines are described very well in it.