Search code examples
solrlucenemachine-learningmahout

Automatic product classification and query weighting


I'm facing ranking problems using solr and I'm stucked.

Given a e-commerce site, for the query "ipad" i obtain:

  1. ipad case for ipad 2
  2. ipad case
  3. ipad connection kit
  4. ipad 32gb wifi

This is a problem, since we want to rank first the main products (or products by itself) and tf/idf ranks first the accessories due to descriptions like "ipad case compatible with ipad, ipad2, ipad3, ipad retina, ipad mini, etc".

Furthermore, using the categories we have no way of determining whether is an accessory or a product.

I wonder if using automatic classification would help. Another solution that improves this ranking (like Named Entity Recognition) would be appreciated.


Solution

  • Could you provide tagged data?

    If you have >50k items a Naive Bayes with a bigram language model trained on the product name will almost catch all accessories with 99% accuracy. I guess you can train such a naive bayes with Mahout, however product names have a pretty limited bigram amount so this can be trained even on a smartphone easily and fast nowadays.

    This is a typical mechanical turk task, shouldn't be that expensive to tag a few items. However if you insist on some semi-supervised algorithm, I found Iterative similarity aggregation pretty useful.

    The main idea is that you give a few tokens like "case"/"power adapter" and it iteratively finds new tokens that are indicators of spam because they appear in the same context.

    Here is the paper, but I have written a blogpost about this as well which sums up the intention in plain language. This paper also mentions the same "let the user find the right item" paradigm that Sean has proposed, so both can be used in conjunction.

    Oh and if you need some advice of machine learning with Lucene&SOLR I can recommend you the talk of my friend Tommaso Teofili at ApacheCon Europe this year. You can find the slides on slideshare. There is also a youtube video of the talk out there, just search for it ;)