Search code examples
elasticsearch

What is _recovery_source field in ElasticSearch?


I created two indexes in ElasticSearch with the exact same mappings except just one parameter where one mapping had the dense_vector excluded from the _source and the other did not:

"mappings": {
        "_source": {"excludes": ["title_vector"]},
        "properties": {
        ...}

then I indexed the same 1_000 documents into both indexes:

vector_in_source       1000            0     21.5mb         21.5mb
no_vector_in_source    1000            0     21.2mb         21.2mb

When I ran

curl --location --request POST 'http://127.0.0.1:9200/index_name/_disk_usage?run_expensive_tasks=true

on both indexes I found out that:

  1. Index with vectors in source is storing dense_vector as plain floats in the source as I expected
  2. Index with no vectors in source does not store dense vectors BUT it creates a new field called _recovery_source with the size equal to what 1000 1024-dim vectors stored as plain floats would occupy.

So even though I explicitly excluded dense vectors from being stored in Elastic they are still stored just in a new field!

So I was wondering what this field is, can I disable it ot at least exclude dense_vectors from being stored in this field?


Solution

  • So I was wondering what this field is, can I disable it or at least exclude dense_vectors from being stored in this field?

    This field is automatically created when _source is modified from its original state (by excluding a field from a source or switching to synthetic source). It is needed for per-operation replication. The field has a limited time span - 12 hours by default. If segment is merged after 12h this field is removed from the document, otherwise it remains there.

    If your index has a lifespan significantly longer than 12 hours and is actively updated, you don't have to worry about it, segments will be created and merged as needed, therefore this field will be eventually removed from old records and will only exist in young segments and new records.

    If you create an index and it remains idle, there is a workaround. you can reduce index.soft_deletes.retention_lease.period and after indexing is over forcemerge the index.

    I recently wrote a blog post about that issue recently that provides more details.