Search code examples
javahibernatehibernate-search

hibernate search not returning the closest words of group of words


I have hibernate search endpoint where I needed to return the closest match in a group of words. when I try to make a search the closest words is not found in the first 10 results, below is the snippet of the hibernate search

FullTextEntityManager fullTextEntityManager = Search.getFullTextEntityManager(entityManager);
        QueryBuilder qb = fullTextEntityManager.getSearchFactory().buildQueryBuilder().forEntity(Test.class).get();
        org.apache.lucene.search.Query luceneQuery = qb.keyword().onFields("arg")
                .matching(searchTerm).createQuery();
        javax.persistence.Query jpaQuery = fullTextEntityManager.createFullTextQuery(luceneQuery, Test.class);

Please how can I return the closest match of group of words


Solution

  • While full-text search can return "close matches" (i.e. to account for typos, etc.), you still need to opt in.

    For approximate matches, you have two solutions:

    1. Use "fuzzy" queries: this solution is limited and not very configurable, but simple to set up.
    2. Configure an analyzer. More configurable, but requires a little bit more knowledge.

    If you go with solution #2, I suggest you have a look at these resources to familiarize yourself with full-text search:

    (This is the documentation of Hibernate Search 6, but the concepts are the same as in Hibernate Search 5)

    Then have a look at how to configure an analyzer in Hibernate Search 5.

    Now you should have a better idea of what analyzers are: the transform the text, both when indexing and querying, into tokens that will be matched exactly. The approximate matches are achieved by an approximate transformation: if analysis transforms "Résumé" into "resume", then the query "resume" will match a document containing "Résumé".

    For example:

    Document: "Quick Brown Fox" => "quick", "brown", "fox"
    Queried: "Qick borwn fox" => "qick", "borwn", "fox"
    Matching: "fox"
    

    There's a typo in the query. The document should be high in the search hits, but it won't be because only one term matches, "fox".

    To get even more approximate matches, one strategy is to break down words into what is called "ngrams". To that end, use NGramFilterFactory, like here for example.

    If we set up analysis to break down words into 3-grams, we will get this:

    Document: "quick brown fox" => "qui", "uic", "ick", "bro", "row", "own", "fox"
    Queried: "qick borwn fox" => "qic", "ick", "bor", "orw", "rwn", "fox"
    Matching: "ick", "fox"
    

    Now it's a little better: two terms will match, "ick" and "fox". The document will be higher up in the result list.

    Of course, it's not perfect either:

    1. You will now get matches with possibly unrelated documents, such as one containing "fickle" (=> "fic", "ick", "kle"). This should be counterbalanced by a sort by relevance, bringing the best matches near the top of the result list: if the user finds what he wants near the top, they won't mind that other results are irrelevant.
    2. The word "borwn" still wasn't detected as a match. You could add 2-grams on top of the 3-grams, so that "wn" matches, but be careful: you will get even more irrelevant matches.

    As you can see, getting a full-text search that behaves just the way you want requires some work and configuration; there's no "one-size-fits-all" solution. You just need to try different configurations and see what suits you best.