I created two indexes in ElasticSearch with the exact same mappings except just one parameter where one mapping had the dense_vector
excluded from the _source
and the other did not:
"mappings": {
"_source": {"excludes": ["title_vector"]},
"properties": {
...}
then I indexed the same 1_000 documents into both indexes:
vector_in_source 1000 0 21.5mb 21.5mb
no_vector_in_source 1000 0 21.2mb 21.2mb
When I ran
curl --location --request POST 'http://127.0.0.1:9200/index_name/_disk_usage?run_expensive_tasks=true
on both indexes I found out that:
dense_vector
as plain floats in the source as I expected_recovery_source
with the size equal to what 1000 1024-dim vectors stored as plain floats would occupy.So even though I explicitly excluded dense vectors from being stored in Elastic they are still stored just in a new field!
So I was wondering what this field is, can I disable it ot at least exclude dense_vectors from being stored in this field?
So I was wondering what this field is, can I disable it or at least exclude dense_vectors from being stored in this field?
This field is automatically created when _source
is modified from its original state (by excluding a field from a source or switching to synthetic source). It is needed for per-operation replication. The field has a limited time span - 12 hours by default. If segment is merged after 12h this field is removed from the document, otherwise it remains there.
If your index has a lifespan significantly longer than 12 hours and is actively updated, you don't have to worry about it, segments will be created and merged as needed, therefore this field will be eventually removed from old records and will only exist in young segments and new records.
If you create an index and it remains idle, there is a workaround. you can reduce index.soft_deletes.retention_lease.period
and after indexing is over forcemerge the index.
I recently wrote a blog post about that issue recently that provides more details.