Search code examples
ibm-watsonretrieve-and-rank

Why does Retrieve and Rank ignore my indexes when querying a collection?


We have a Solr collection in Retrieve and Rank which contains, among others, a field called document_sub_type. This field is indexed in the Solr schema, but does not have a field type value (I understand that fields intended to be used by the ranker must have a field type value of "Watson_text_en"; this field does not). We want to filter results on this document_sub_type metadata field.

If I send the query power systems client reference AND (document_sub_type:"Client Reference*" OR document_sub_type:"Case Study*") to the /select endpoint of R&R, I get back only documents with a document_sub_type value of "Client Reference Book" or "Client Reference Brief", just as expected. However, if I send the same query to the /fcselect endpoint, the returned documents have a document_sub_type value that could apparently contain any value.

I will admit that our ranker is not fully trained, but this occurs even if we omit the ranker from the query.

Why does /fcselect ignore the metadata part of the query?

Here are the full response bodies from the two queries:

From /select:

{
  "responseHeader": {
    "status": 0,
    "QTime": 2,
    "params": {
      "q": "power systems client reference AND (document_sub_type:\"Client Reference*\" OR document_sub_type:\"Case Study*\")",
      "fl": "document_sub_type",
      "wt": "json"
    }
  },
  "response": {
    "numFound": 89,
    "start": 0,
    "docs": [
      {
        "document_sub_type": "Client Reference Book"
      },
      {
        "document_sub_type": "Client Reference Brief"
      },
      {
        "document_sub_type": "Client Reference Brief"
      },
      {
        "document_sub_type": "Client Reference Brief"
      },
      {
        "document_sub_type": "Client Reference Book"
      },
      {
        "document_sub_type": "Client Reference Brief"
      },
      {
        "document_sub_type": "Client Reference Brief"
      },
      {
        "document_sub_type": "Client Reference Brief"
      },
      {
        "document_sub_type": "Client Reference Brief"
      },
      {
        "document_sub_type": "Client Reference Brief"
      }
    ]
  }
}

From /fcselect:

{
  "responseHeader": {
    "status": 0,
    "QTime": 65,
    "params": {
      "q": "power systems client reference AND (document_sub_type:\"Client Reference*\" OR document_sub_type:\"Case Study*\")",
      "ranker_id": "c852c8x19-rank-422",
      "fl": "document_sub_type",
      "wt": "json"
    }
  },
  "response": {
    "numFound": 39428,
    "start": 0,
    "maxScore": 10,
    "docs": [
      {
        "document_sub_type": "Sales guidance"
      },
      {
        "document_sub_type": "Other sales tool or Utility"
      },
      {
        "document_sub_type": "Client Reference Book"
      },
      {
        "document_sub_type": "Client Reference Brief"
      },
      {
        "document_sub_type": "Client Reference Book"
      },
      {
        "document_sub_type": "At a Glance"
      },
      {
        "document_sub_type": "Brief or Template for Marketing"
      },
      {
        "document_sub_type": "text/plain"
      },
      {
        "document_sub_type": "Brief or Template for Marketing"
      },
      {
        "document_sub_type": "QRG"
      }
    ]
  }
}

Solution

  • The /fcselect endpoint does not support combining terms with boolean operators in the query parameter itself. For this type of operation you should be able to use filter queries to get the expected results. See the documentation here for details: https://www.ibm.com/watson/developercloud/doc/retrieve-rank/plugin_query_syntax.shtml#top