Search code examples
javamysqllucenetagginghibernate-search

Suggesting tags in a Java / MySQL / Hibernate Search / Lucene environment


I am working on a web-based application that allows our users to post typical blog / microblog / forum type posts and the one problem that we have had is that our users are not tagging their content very often. Since tags are very important in our app for several reasons, we want to drive our users towards the behaviour of tagging.

We implemented hash tagging and this seemed to have some effect and we also intend on implementing some form of gamification to encourage this.

In addition to the above, we want to implement tag suggestions (basically what StackOverflow has). We would want to suggest tags based on existing tags in our database and for when there are no matching tags then we would also like to suggest tags "out of the blue" maybe using some kind of tf-idf library or something. My question is two-fold:

  1. Is it feasible from a performance perspective to do this kind of tag suggesting as the user types (i.e. on keystroke)? I think that this is the way that StackOverflow does it when you are posting a question and we are looking for something very similar to this. Or would we have to do some post-processing instead (i.e. after the user has already added the content then we suggest tags to him).

  2. Are there any tools / libraries that we could use that would give us these suggestions that also give us stemming, etc. Even perhaps synonym matching. Our data is currently stored in MySQL and we also use Hibernate Search so it is also stored in Lucene indexes (although we currently do not interact with these directly, only through Hibernate Search). We are open to storing this data in a different type of data source if that will help the situation though.


Solution

    1. Performing a search on each key-stroke is indeed feasible, our application does this currently with a install base of a couple million clients (though they're not all searching at once). I would probably suggest introducing a small delay (a couple of seconds or so) before trying to find tags both to reduce the load on your server as well as prevent the tag list from updating too frequently.
    2. Hibernate Search (via Lucene) should be able to give you the functionality you require. The key for your searches would be to set the proper analyzer for your fields in order to properly handle synonyms and stemming (for example, Lucene's EnglishAnalyzer which provides removal of stop words such as "the" and "and" and uses a Porter stemmer to provide stemming functionality, perhaps coupled with a SynonymFilter initialized with your synonyms).