Search code examples
vespa

Queries are very slow in local vespa


I am having difficulty executing correctly a vespa query. i want to query 2 different index fields with or between them, i want to to the equivalent of elastic match query.

i got a lot of soft timeouts so i increased timeout to get the true result and check how much time it took.

this is the query i sent:

{
    "name": "some name",
    "timeout": 10,
    "traceLevel": 4,
    "ranking": {
        "profile": "addition_score"
    },
    "hits": 2,
    "address__street_address": "some street address",
    "yql": "select * from sources * where   (([{\"grammar\":\"any\",\"defaultIndex\":\"address__street_address\"}] userInput(@address__street_address))  or ([{\"grammar\":\"any\",\"defaultIndex\":\"name\"}] userInput(@name))) and address__address_region contains \"us-ca\"   ;"}
}

and the trace data i added the second time:

{
                                                "timestamp_ms": 4978.7203,
                                                "tag": "match_threads",
                                                "threads": [
                                                    {
                                                        "traces": [
                                                            {
                                                                "timestamp_ms": 12.1874,
                                                                "event": "Start MatchThread::run"
                                                            },
                                                            {
                                                                "timestamp_ms": 12.2481,
                                                                "event": "Start match and first phase rank"
                                                            },
                                                            {
                                                                "timestamp_ms": 4976.272,
                                                                "event": "Start second phase rerank"
                                                            },
                                                            {
                                                                "timestamp_ms": 4977.3873,
                                                                "event": "Create result set"
                                                            },
                                                            {
                                                                "timestamp_ms": 4978.6731,
                                                                "event": "Start thread merge"
                                                            },
                                                            {
                                                                "timestamp_ms": 4978.6816,
                                                                "event": "MatchThread::run Done"
                                                            }
                                                        ]
                                                    }
                                                ]
                                            }
                                        ],
                                        "distribution-key": 0,
                                        "duration_ms": 4978.8584
                                    }
                                ]
                            }

which from my understanding means the fetching and first phase took 5 seconds which for me seems strangely large.

i run the vespa docker on my local machine with 8 gb of ram and have about 40 million documents. the schema is the following:

    document organization {
        field name type string {
            indexing: index | summary
            weight : 100
        }
        field url type string {
            indexing: index
            weight : 10
        }
        field naics type string {
            indexing: attribute
            weight : 10
        }
        field number_of_employees type int {
            indexing: attribute
            weight : 1
        }
        field is_hq type bool {
            indexing: attribute
            weight : 1
        }
        field address__address_country type string {
            indexing: attribute | summary
            weight : 10
        }
        field address__address_region type string {
            indexing: attribute | summary
            weight : 20
        }
        field address__address_locality type string {
            indexing: index | summary
            weight : 50
        }
        field address__postal_code type string {
            indexing: attribute
            weight : 70
        }
        field address__street_address type string {
            indexing: index | summary
            weight : 100
        }
        field duns type string {
            indexing: attribute | summary
            weight : 1000
        }
        field density type int{
            indexing: attribute
            weight : 1
        }
    }

    fieldset default {
        fields: name
    }
    rank-profile native {
        first-phase {
            expression: nativeRank(name,address__street_address)
        }
        second-phase{
           expression: fieldMatch(name) * fieldMatch(address__street_address)
        }
        summary-features {
             nativeRank(name)
             nativeRank(address__street_address)
             fieldMatch(name)
             fieldMatch(address__street_address)
        }
    }
     rank-profile addition_score {
        first-phase {
           expression: nativeRank(address__street_address,name)*(attribute(density)/11)
        }
        second-phase {
            expression: (fieldMatch(name)*100 + fieldMatch(address__street_address)*100+attributeMatch(address__address_country)*10+fieldMatch(address__address_locality)*20+attributeMatch(address__postal_code)*50 +attribute(density))/(290)
        }
        summary-features {
             attributeMatch(duns)
             fieldMatch(name)
             fieldMatch(address__address_locality)
             fieldMatch(address__street_address)
             attributeMatch(address__address_country)
             attributeMatch(address__postal_code)
        }
    }
}

what did i do wrong?


Solution

  • Background Vespa has the index and the attribute concept for indexing exemplified in the schema below

    schema doc {
      document doc {
       field license type string {
          indexing: summary | attribute
       }
       field title type string {
          indexing: summary | index
       }
       }
       rank-profile my-ranking {
         first-phase { expression: nativeRank(title) }
       }
    } 
    

    The title field will be tokenized and stemmed because of the default match setting, the type of processing you expect for text matching. By default index fields are matched using text match (which is per unit/token/atom). Index fields can be searched in potentially sub-linear time due to inverted index structures.

    The license field is defined without index, but with attribute. An attribute field will be in memory at all times. Default matching mode is different than for index and is geared toward exact matching, no tokenization, and no stemming. Also by default, there are no inverted index-like structures for attribute fields. One can add it by adding attribute:fast-search, at the cost of higher memory usage. This and slower overall indexing is the primary reason why attribute fields in Vespa per default do not use fast-search.

    Now, on what this means for search performance. Let us assume we have 10M documents stored in a Vespa cluster using the simplified schema:

    /search/?query=license:unk&yql=select id,license from doc where userQuery();&ranking=my-ranking
    

    The above request searches the license attribute field, but since it has no inverted-like index structures, the search is linear with 10M documents. In other words, the search process will traverse all 10M documents, read the value of the license field from memory and compare it with the input query term, if the term matches the value, it is exposed to the ranking profile for ranking. This is not particularly efficient. If we change the attribute definition to also include fast-search, then Vespa will build inverted-index like data structures:

     field license type string {
          indexing: summary | attribute
          attribute:fast-search
     }
    

    If we deploy this and follow the changes which require restart, our query will use the inverted-index structure. The search process then changes from just traversing linearly to looking up the term in the dictionary and traversing the posting list. If the term ‘unk’ occurs in less than 10M documents, the search becomes sub-linear and fewer than 10M documents are exposed to ranking.

    The above was a simplified example, what if we have a more complex query searching multiple fields?

    /search/?query=license:unk title:foo&yql=select id,license from doc where userQuery();&ranking=my-ranking
    

    In the above example, we search using AND between license:unk and title need to contain the term foo (tokenized text match). In this case, (and others), the search process builds a query execution plan to how to efficiently match the query against the indexes. In the case where the license field does not have fast-search the query plan will assume that the term occurs in all documents (worst case). However, since we include a term that has an index defined, it can know the upper bound of the number of hits, and the overall search becomes sub-linear. However, have we used OR to combine license:unk with title:foo, the search would become linear as we are asking for either to have license:unk OR the token foo occurs in the title (logical disjunction).

    How to debug matching and ranking performance of single queries?

    Run a representative query with the intended ranking profile and look at the totalCount versus the number of documents searched. This information is provided in the result template (See coverage for number of documents search). The overall cost of a query is highly dependent on totalCount, higher totalCount means higher number of documents exposed to first phase ranking. If a query is slow but retrieves relatively few documents, it’s a clear indication that matching has touched a linear scan path. However, ranking complicates this, as the complexity of the first phase ranking profile also impacts performance. Vespa has a built-in ‘unranked’ ranking profile, which can be used to quantify matching versus ranking performance. If using unranked rank-profile one can find the lower bound of search performance, without any first phase ranking, meaning we can debug the matching performance alone.

    Query tracing:

    The content nodes query plan of a query can be inspected by adding &tracelevel=6&trace.timestamps=1 to the search request. The query blueprint from every content node involved in the query is then included with the traced response. In the blueprint query tree trace there will be a docid _limit which is the number of documents indexed (counting from 1), so for our case with 10M documents, it would be 10M +1. If the estHits on the top root of the query tree is equal to the docid_limit then the overall complexity is linear.

    Query blueprint example query tracing example linear

    In this example, the top root of the query tree estimates that the total number of hits will be equal to the docid_limit (which is the number of documents indexed). This indicates linear matching.

    query tracing example sub-linear

    In this example, the top root of the query tree estimates a much lower number of hits (due to the presence of a must-have term in the query tree which has either index or attribute:fast-search. This then restricts the query to match fewer documents than indexed and matching and ranking becomes sub-linear.