Search code examples
elasticsearchsequencefuzzy-search

Elasticsearch compare long sequence strings with fuzzy query


I have two long String sequences that are similar:

C50FD711C2C43287351892A4D82F44B055F048C46D2C54197AC1D1E921F11E6699C4057C4B93907518E6DCA51A672D3D3E419160DAE276CB7716D11B94D8C3BB2E4A591329B7AF973D17A7F9336342FFAAFD4D

and

C50FD711C2C43287351892A4D820B5EAC5F048C1E67CAC197AC1D1E921F11C3623C1DCD6493907518E6DCA18CD71016E7FD1160DAE276CB7716D11B94A6B762E4A591329B7AF973D17A7F9336342FFAAFD4D

Its distance is 41. I would like to find those strings that are similar to eachother. I started a query like this:

GET my_index/_type/_search
{
"query": {
        "fuzzy" : {
            "sequence.keyword": {
                "value": "C50FD711C2C43287351892A4D820B5EAC5F048C1E67CAC197AC1D1E921F11C3623C1DCD6493907518E6DCA18CD71016E7FD1160DAE276CB7716D11B94A6B762E4A591329B7AF973D17A7F9336342FFAAFD4D",
                "boost": 1.0,
                "fuzziness": 50,
                "prefix_length": 10,
                "max_expansions": 200
            }
        }
    }
}

I tried with sequence.keyword and sequence, the field is of type text and type keyword. However, it did not find the other similar sequence string in my index. Why?


Solution

  • The answer is pretty simple. The maximum edit distance that is allowed is 2 (as can be seen in the source code for the Fuzziness class

    You can try with a simpler value, if you index AAAAAA and try to search for AAABBB with fuzziness: 3, you'll get nothing.