elasticsearch elasticsearch-5 elasticsearch-dsl

Elasticsearch exact search with fuzzy search

I have an index that contains company name, an abbreviation for the company, and a description of what the company does (the index schema is below). An example of an element in this document is:

{
  "abbreviation": "APPL",
  "name": "Apple",
  "description": "Computer software and hardware"
}

Typically users will type in the abbreviation when searching for a document. Sometimes they may incorrectly type this in and elasticsearch works great in this case. However, most of the time users will type the abbreviation in exactly and while they will get the best matches at the top of the response, some junk with low scores (greater than 0) will come back. I have tried fiddling with min_score in queries but it's difficult to choose this parameter because scores fluctuate a lot.

Is there a way to get rid of documents that are not an exact match for the abbreviation field but still have fuzzy match as a backup in case an exact match or the user searches other fields (e.g. name and description) is not found?

Here are a couple of examples:

Querying for just AAPL yields 3 results, the two are exact matches for the query so have a fairly high score but ADP is still somewhat similar but clearly isn't what the user has searched for.

{
  "abbreviation": "APPL",
  "name": "Apple, Inc.",
  "description": "Computer software and hardware"
},
{
  "abbreviation": "APPL",
  "name": "Apple, Inc.",
  "description": "Computer software and hardware"
},
{
  "abbreviation": "ADP",
  "name": "Automatic Data Processing, Inc",
  "description": "Computer software and hardware"
}

Querying for Apple, we again get the top few entries being super relevant but then some other company names showing up.

{
  "abbreviation": "APPL",
  "name": "Apple, Inc.",
  "description": "Computer software and hardware"
},
{
  "abbreviation": "APPL",
  "name": "Apple, Inc.",
  "description": "Computer software and hardware"
},
{
  "abbreviation": "CSCO",
  "name": "AppDynamics (Cisco subsidiary)",
  "description": "Computer software"
}

The document's schema:

{
  "settings": {
    "index": {
      "requests.cache.enable": true
    }
  },
  "mappings": {
    "properties": {
      "abbreviation_and_name": {
        "type": "text",
        "boost": 2
      },
      "abbreviation": { "type": "text", "copy_to": "abbreviation_and_name", "boost": 20 },
      "name": { "type": "text", "copy_to": "abbreviation_and_name" },
      "description": { "type": "text" }
    }
  }
}

Solution

First, I'd probably question why the following document should be brought back when searching for AAPL:

{
  "abbreviation": "ADP",
  "name": "Automatic Data Processing, Inc",
  "description": "Computer software and hardware"
}

Second, I'd recommend removing boosting criteria from index mappings, it's recommended to boost at a query level.

But overall, I believe you might simply want an OR query:

{
  "query": {
    "bool": {
      "should": [
        {
          "match": {
            "abbreviation": {
              "query": "AAPL",
              "boost": 2
            }
          }
        },
        {
          "multi_match": {
            "query": "AAPL",
            "fields": ["name", "description"],
            "fuzziness": "AUTO"
          }
        }
      ]
    }
  }
}

This might not yield exact results as you described, but I believe this should work just fine for your use case.