Search code examples
lucene

About Lucene index postings list, why are all deltas between 0 and 255 during FOR encoding?


From this blog, it says that postings lists are split into blocks of 256 docs and then each block is compressed separately. But what if a term's postings list is [72, 373]? Is there anything that Lucene does to avoid a deltas greater than 255, like altering doc sequence so the docs have appropriate doc ids?


Solution

  • The article doesn't say there is a limit of 256 for delta, but the example in the article does.

    Lucene computes the maximum number of bits required to store deltas in a block, adds this information to the block header, and then encodes all deltas of the block using this number of bits.

    For example, If a posting list contains doc ids 1 to 256 like [1,2,3,.....256], the delta encoded block would be [1,1,1,1,....1] which means the block needs only 1 bit per doc id to store.

    Taking the example in your question [72,373..], the delta encoded block would be [72, 301,...] which means the block will need 9 bits per doc id (assuming that 301 is the largest delta in the block).

    Hope it clears.