Search code examples
elasticsearchelasticsearch-percolate

Complex queries with elatic search percolation


I have a PostgreSQL database containing news articles parsed from the web. The parser runs every hour and collects new news items and stores them in DB. The users of the app are able to add certain keywords to their profile so that if a new news item is found containing the keyword then they will be notified. Currently I am using SQL query for this, so whenever I get a new news article I try to match it against all keywords added by users and then send out notififcation, but this takes a lot of time. So I am thinking of integrating Elasticsearch. I have come across the percolation query , but I can't find a good documentation around it, so not sure if I will be able to create complex queries with it. Search needs to take into account the following:

  1. Users can add keywords with AND, OR, NOT and we need to "search for all","any one", "not containing" respectively. For example user can give keywords like "Bitcoin" AND "Cryptocurrency" NOT "Mining", then in that case this search query should only match news articles containing words "Bitcoin" and "Cryptocurrency" and should not contain word "Mining". The keywords can be anywhere within article title or article body
  2. Stemming . So if user keywords contain "raining" and the article contains word "rain", then percolation search should still return the id for that keyword
  3. User can also provide author as keyword and in that case we need to return articles which were authored by that author

Solution

  • Thanks for the precision.

    To use percolate query in your case, you would have to :

    1. Create an index defining the mapping of your article, the information about your users and a percolator query corresponding to the user preferences.
    PUT /percolated_queries_index
    {
        "mappings": {
            "properties": {
                "article": {
                    // Mapping for your article
                },
                "query": {
                    "type": "percolator"
                },
                "user": {
                    // Mapping for the information related to the user
                }
            }
        }
    }
    

    The article field is required because the article documents that you will percolate will use this mapping. This should probably be the same mapping as the one you use in the article index. As mentionned in the documentation, you should see this mapping as the pretreatment on the document you will match. For example, you will have to specify a stemming analyzer here.

    1. index for each user the search query corresponding to its user preference in a percolator field.
    PUT /percolated_queries_index/_doc
    {
        "query" : {
            // The elasticsearch query corresponding to the user preferences
        },
        "user": {
            // Information for the user, e.g., id, email
        }
    }
    

    The query corresponds to the user preferences rewritten as an elasticsearch query. For example, a match query for the author of the article and boolean queries for the AND, OR, NOT keywords. This will be probably the difficult part because you will have to write something that transforms the user query into an elasticsearch query. If you can use the query string syntax, it should be much easier.

    You should not set an article field here.

    1. When a new article is indexed, run a percolate search query using this article in the document firled parameter. If the article is already indexed, you can also use directly its id (the syntax is given in the document).
    GET /percolated_queries_index/_search
    {
        "query" : {
            "percolate" : {
                "field" : "query",
                "document" : {
                    // The content of the article
                }
            }
        }
        "_source": "user"
    }
    

    The response to this query will return the documents containing a search query matching the article document including the user information corresponding to the article. Since you are usually not interested in the search query itself, you can filter to get only the user field.

    The response to this query will give you all the users to which the new article should be sent to.