Search code examples
yqlvespa

Receiving responses of different formats for the same query in vespa


I have such schema

schema embeddings {
  document embeddings {
    field id type int {}
    field text_embedding type tensor<double>(d0[960]) {
      indexing: attribute | index
      attribute {
        distance-metric: euclidean
      }
    }
  }

  rank-profile closeness {
    num-threads-per-search:1
    inputs {
      query(query_embedding) tensor<double>(d0[960])
    }
    first-phase {
      expression: closeness(field, text_embedding)
    }
  }

Such services:

...
    <container id="query" version="1.0">
        <search/>
        <nodes>
            <node hostalias="query" />
        </nodes>
    </container>

    <content id='mind' version='1.0'>
        <redundancy>1</redundancy>
        <documents>
            <document type='embeddings' mode="index"/>
        </documents>
        <nodes>
            <node hostalias="content1" distribution-key="0"/>
        </nodes>
    </content>
...

Then I have the number of queries of the same format:

{
    'yql': 'select * from embeddings where ({approximate:false, targetHits:100} nearestNeighbor(text_embedding, query_embedding));',
    'timeout': 5,
    "hits":100,
    'input': {
        'query(query_embedding)': [...],
    },
    'ranking': {
        'profile': 'closeness',
    },
}

which are then run via app.query_batch(test_queries)

The problem is some responses look like this (and contain id field as integers, just like I inserted):

{'id': 'id:embeddings:embeddings::786559', 'relevance': 0.5703559830732123, 'source': 'mind', 'fields': {'sddocname': 'embeddings', 'documentid': 'id:embeddings:embeddings::786559'}}

and others look like this (neither containing int id as I inserted, nor keeping the format of the previous example):

{'id': 'index:mind/0/b0dde169c545ce11e8fd1a17', 'relevance': 0.49024561522459087, 'source': 'mind'}

How can I make all responses look like the first one? Why are they different at all?


Solution

  • Some of them are filled with content and some are not, I suppose because it timed out. Check the coverage info, and run with traceLevel=3 to see more details.

    Some more background info on what's going on:

    Searches are executed in two phases: First, minimal information on each hits hit is returned from each content node up to the issuing container. These partial lists are then merged to produce the final hits length list of matches. For those we execute phase two, which is to fill the content of the final hits. This involves doing another request to each of the content nodes to get the relevant content.

    If there's little time left, or lots of data, or expensive summary features to compute, or a slow disk subsystem or network, or a node in some kind of trouble, this may time out leaving only some hits filled so that you'll see this.

    Why are the id's not the true document id in these cases? The text string id is stored in the disk document blob but not in memory as an attribute, so it needs to be fetched in the fill phase too. If it is not, an internally generated unique id is used instead.