Search code examples
pythonelasticsearchartificial-intelligence

more accurate queries with elastic search


I'm creating an elastic search based application where I essentially have a document in which I retrieve and split into chunks, then I have a query:

query = "what is your pricing?"

and then I feed the text, and the query into my function:

def construct_enhanced_query(user_query):
    keywords = extract_keywords(user_query)
    should_clauses = [{"match": {"content": {"query": keyword, "boost": 2}}} for keyword in keywords]

    query = {
        "query": {
            "bool": {
                "should": should_clauses,
                "minimum_should_match": 1
            }
        }
    }
    return query

, extract_keywords(), looks like so:

def extract_keywords(text):
    doc = nlp(text)
    keywords = set([chunk.text for chunk in doc.noun_chunks] + [ent.text for ent in doc.ents])
    return keywords

Now, the thing is, in the chunks, I have things like this:

ABOUT PRICING CONTACT Learn more and also PRICING PLANS Private Lessons $150 PER STUDENT 30 minute lessons 1 instructor and 1 student (1:1) Book Now Semi-Private $130 /PER STUDENT 30 minute lessons 1 instructor and 2 students (1:2). As you could imagine, I am looking to retrieve the $150 PER STUDENT part and its relevant information, and it should work because it's one chunk, which is good. the problem is, its only returning this:

Document ID: nrIzDI4BhRAP3y-2FwQt Content: PRICING
Document ID: orIzDI4BhRAP3y-2FwQ0 Content: PRICING
Document ID: FLIwDI4BhRAP3y-2UwI2 Content: PRICING
Document ID: GLIwDI4BhRAP3y-2UwJD Content: PRICING

which as you could imagine is due to the query and keyword search plus the boosting. But I am trying to make it more dynamic, and be able to actually retrieve the relevant information, because the query wont always be "what is your pricing", it could be anything, so I can't hardcode it. Any advice?


Solution

  • It seems you are trying to build a semantic search engine from scratch. Elastic has features to make it easy to import LLMs from external sources (like HuggingFace) and use those insights into your search.

    If you still want to build it bit by bit you can use a keyword extraction LLM and add that to your query; or you could use an embedding mode from the start so the queries search by meaning rather than word matching (which I would recommend in your example).

    Take a look at the starting guide from the docs: https://github.com/elastic/elasticsearch-labs/blob/main/notebooks/search/00-quick-start.ipynb

    Or if you want to read more about semantic / lexical search there are some good articles on searhlabs: https://www.elastic.co/search-labs/blog/articles/lexical-and-semantic-search-with-elasticsearch

    The parts you would need for your use case:

    1. Importing a model through docker and setting up your index and mappings (including a dense vector field where the embeddings will go)

    2. Running a pipeline or bulk process to generate embeddings for all the documents in your index.

    operations = []
    for book in books:
        operations.append({"index": {"_index": "book_index"}})
        # Transforming the title into an embedding using the model
        book["title_vector"] = model.encode(book["title"]).tolist()
        operations.append(book)
    client.bulk(index="book_index", operations=operations, refresh=True)
    
    1. When you search with a natural language query, like query = "what is your pricing?" you want to embed this query with the same model and send that as part of your knn search:
    response = client.search(
        index="book_index",
        knn={
            "field": "title_vector",
            "query_vector": model.encode(query),
            "k": 10,
            "num_candidates": 100,
        },
    )
    

    Hope this helps!