Search code examples
pythonelasticsearchduplicateselasticsearch-pyminhash

Why does my query using a MinHash analyzer fail to retrieve duplicates?


I am trying to query an Elasticsearch index for near-duplicates using its MinHash implementation. I use the Python client running in containers to index and perform the search.

My corpus is a JSONL file a bit like this:

{"id":1, "text":"I'd just like to interject for a moment"}
{"id":2, "text":"I come up here for perception and clarity"}
...

I create an Elasticsearch index successfully, trying to use custom settings and analyzer, taking inspiration from the official examples and MinHash docs:

def create_index(client):
    client.indices.create(
        index="documents",
        body={
            "settings": {
                "analysis": {
                    "filter": {
                        "my_shingle_filter": {      
                        "type": "shingle",
                        "min_shingle_size": 5,
                        "max_shingle_size": 5,
                        "output_unigrams": False
                        },
                        "my_minhash_filter": {
                        "type": "min_hash",
                        "hash_count": 10,          
                        "bucket_count": 512,      
                        "hash_set_size": 1,       
                        "with_rotation": True     
                        }
                    },
                    "analyzer": {
                        "my_analyzer": {
                        "tokenizer": "standard",
                        "filter": [
                            "my_shingle_filter",
                            "my_minhash_filter"
                        ]
                        }
                    }
                }
            },
            "mappings": {
                "properties": {
                    "name": {"type": "text", "analyzer": "my_analyzer"}
                }
            },
        },
        ignore=400,
    )

I verify that index creation hasn't big problems via Kibana and also by visiting http://localhost:9200/documents/_settings I get something that seems in order:

enter image description here

However, querying the index with:

def get_duplicate_documents(body, K, es):
    doc = {
        '_source': ['_id', 'body'],
        'size': K,
        'query': {
            "match": {
                "body": {
                    "query": body,
                    "analyzer" : "my_analyzer"
                }
            }
        }
    }

    res = es.search(index='documents', body=doc)
    top_matches = [hit['_source']['_id'] for hit in res['hits']['hits']]

my res['hits'] is consistently empty even if I set my body to match exactly the text of one of the entries in my corpus. In other words I don't get any results if I try as values for body e.g.

"I come up here for perception and clarity"

or substrings like

"I come up here for perception"

while ideally, I'd like the procedure to return near-duplicates, with a score being an approximation of the Jaccard similarity of the query and the near-duplicates, obtained via MinHash.

Is there something wrong in my query and/or way I index Elasticsearch? Am I missing something else entirely?

P.S.: You can have a look at https://github.com/davidefiocco/dockerized-elasticsearch-duplicate-finder/tree/ea0974363b945bf5f85d52a781463fba76f4f987 for a non-functional, but hopefully reproducible example (I will also update the repo as I find a solution!)


Solution

  • Here are some things that you should double-check as they are likely culprits:

    • when you create your mapping you should change from "name" to "text" in your client.indices.create method inside body param, because your json document has a field called text:

        "mappings": {
            "properties": {
                "text": {"type": "text", "analyzer": "my_analyzer"}
            }
      
    • in indexing phase you could also rework your generate_actions() method following the documentation with something like:

      for elem in corpus:
        yield {
            "_op_type": "index"
            "_index": "documents",
            "_id": elem["id"],
            "_source": elem["text"]
        }
      

      Incidentally, if you are indexing pandas dataframes, you may want to check the experimental official library eland.

    • Also, according to your mapping, you are using a minhash token filter, so Lucene will transform your text inside text field in hash. So you can query against this field with an hash and not with a string as you have done in your example "I come up here for perception and clarity". So the best way to use it is to retrieve the content of the field text and then query in Elasticsearch for the same value retrieved. Then the _id metafield is not inside _source metafield, so you should change your get_duplicate_documents() method in:

      def get_duplicate_documents(body, K, es):
        doc = {
            '_source': ['text'],
            'size': K,
            'query': {
                "match": { 
                    "text": { # I changed this line!
                        "query": body
                    }
                }
            }
        }
      
        res = es.search(index='documents', body=doc)
        # also changed the list comprehension!
        top_matches = [(hit['_id'], hit['_source']) for hit in res['hits']['hits']]