Search code examples
algorithmsortingsearchsearch-engine

Efficient way to merge-sort distributed index postings for pagination


This is distributed search engine system processing 50 ~ 100 requests at 1 seconds.

Every search request need to merge sort searched documents among several servers.

There are 1 master server and N slave servers.

Each slave server has document id list with score data (Entry[]).

Entry is like below.

| docuemnt id (int) | score data (byte[]) |

Master server has to merge-sort each slave servers entry list and make page data (ex: 5000~5100)

Limitation1) Do not use huge memory. Cannot do merge-sort whole documents in memory at a time. Limitation2) Do not use temp file. Temp file might make search system slow.

If user want N documents start from R ( means rank R ) then ..


Solution-1) 1. Make top R + N sorted list at each slave server and sent it to master server. (use heap structure) 2. Master server merge-sort and make Top R + N 3. Return N documents start from R

Problem-1) 1. Each slave server have to maintain Top R + N documents in memory 2. If master get each slave servers R + N documents in memory, Outofmemory error might be occur.


Solution-2) 1. Each slave server send unsorted document list with score data to master server.(by chunk of data for memory limitation) 2. Master server do n-way merge sort

Problem-2) 1. User need only N document, but slave servers must send whole list of documents.(network transport data getting too large)


Any other good solution for this case?

I think making sorted N document from R need to calculate with every docuemnt list resides among slave servers, is it right?

Which is good? "local sort and remote merge" or "remote whole list" ?

I'm getting stuck last 1 week.

Thanks in advance.


Solution

  • You should think this problem as a different point of view. Show later result page is high cost job, and it's is rarely used. Search engine must provides user-want-documents at top ranking. User must be able to find a meaningful document in less than 10 pages. For example, amazon provides only top 20 pages for total search result and 400 pages for cateogry specified search result. That means nobody need thousand of result pages, if search engine provides good quality result. Check out this document: http://www.slideshare.net/songaal/ss-30031348