Search code examples
ruby-on-railsrubyelasticsearchtire

elasticsearch exact match with dashes


We have an index of domain names in elasticsearch (we are using the tire gem with ruby to connect and maintain this) however we are having trouble with exact searches.

If I search for the term google.com in domains, it brings back google.com however it also brings back any domain with a dash (-) in such as in-google.com, research leads me to believe that - is a wildcard in ES and all I need to do is put not_analyzed but that doesn't work.

    :domain       => { :type => 'string' , :analyzer => 'whitespace'                          },
    :domain_2     => { :type => 'string' , :analyzer => 'pattern'                          },
    :domain_3     => { :type => 'string', :index => 'not_analyzed'                           },
    :domain_4     => { :type => 'string', :analyzer => 'snowball'                            }

I've tried different analysers as you can see above, but they all have the same issue when searched using the 'head' plugin'.

https://gist.github.com/anonymous/8080839 is the code I'm using to generate the dataset to test with, what I'm looking for is the ability to search for JUST google and if I want *google I can implement my own wildcard?

I'm resigned to the fact that I'm going to have to delete and regenerate my index but no matter what analyser I choose or type, I still cannot get an exact match


Solution

  • You're not showing the sample queries you are using. Are you sure your queries and indexing uses the same text processing?

    Also, you may want to check out the multi_field-approach to analyzing things multiple ways.

    I've made a runnable example with a bunch of different queries that illustrate this. Note that the domain has been indexed in two ways, and note which field the queries are hitting: https://www.found.no/play/gist/ecc52fad687e83ddcf73

    #!/bin/bash
    
    export ELASTICSEARCH_ENDPOINT="http://localhost:9200"
    
    # Create indexes
    
    curl -XPUT "$ELASTICSEARCH_ENDPOINT/play" -d '{
        "mappings": {
            "type": {
                "properties": {
                    "domain": {
                        "type": "multi_field",
                        "fields": {
                            "domain": {
                                "type": "string",
                                "analyzer": "standard"
                            },
                            "whitespace": {
                                "type": "string",
                                "analyzer": "whitespace"
                            }
                        }
                    }
                }
            }
        }
    }'
    
    
    # Index documents
    curl -XPOST "$ELASTICSEARCH_ENDPOINT/_bulk?refresh=true" -d '
    {"index":{"_index":"play","_type":"type"}}
    {"domain":"google.com"}
    {"index":{"_index":"play","_type":"type"}}
    {"domain":"in-google.com"}
    '
    
    # Do searches
    
    # Matches both
    curl -XPOST "$ELASTICSEARCH_ENDPOINT/_search?pretty" -d '
    {
        "query": {
            "match": {
                "_all": "google.com"
            }
        }
    }
    '
    
    # Also matches "google.com". in-google.com gets tokenized to ["in", "google.com"]
    # and the default match operator is `or`.
    curl -XPOST "$ELASTICSEARCH_ENDPOINT/_search?pretty" -d '
    {
        "query": {
            "match": {
                "domain": {
                    "query": "in-google.com"
                }
            }
        }
    }
    '
    
    # What terms are generated? (Answer: `google.com` and `in`)
    curl -XPOST "$ELASTICSEARCH_ENDPOINT/_search?pretty" -d '
    {
        "size": 0,
        "facets": {
            "domain": {
                "terms": {
                    "field": "domain"
                }
            }
        }
    }
    '
    
    # This should just match the second document.
    curl -XPOST "$ELASTICSEARCH_ENDPOINT/_search?pretty" -d '
    {
        "query": {
            "match": {
                "domain.whitespace": {
                    "query": "in-google.com"
                }
            }
        }
    }
    '