Search code examples
elasticsearchstormcrawler

Stormcrawler / Elasticsearch and keeping track of inbound links to a page


When we are searching the results of the Stormcrawler crawl in the Elasticsearch index, people are inevitably comparing the results to Google, and the searched results are comparing unfavorably to the google search of the same topic. One of the ways Google helps to determine the rank of various pages is to track the in-bound links to any given page.

In thinking about the search results on our page, and looking through the status index, I came across the field url.path. url.path appears to contain the entire path that led to the current page.

Would it be possible to create a multivalue field in the index that gets populated with just the last url from whatever bolt/function generates the url.path. That way the field would end up being an array of all pages that are directly linking to the current document.

With that info, you could potentially count the values and get an idea of the relative popularity of the current doc by all of the pages linking to it.

Is something like that possible with Stormcrawler?


Solution

  • This would be possible with some modifications of the code. By default, we keep the info about a discovered URL, including the path that led to it, only for the first instance of that URL being discovered. There could be various ways of implementing this, for instance with a custom bolt accumulating the inlinks into Redis or a Graph DB.

    Your underlying question is about relevance tuning with Elasticsearch. This depends of course on what fields are sent by the crawler but not only. I know of some StormCrawler users who used it with ES as a replacement for Google Search Appliance with great success. Info about inlinks could help, but you should be able to get decent results without it.