Search code examples
elasticsearchindexinglucenelatexmathml

What is the best way to index documents which contain mathematical expression in elastic search?


The problem here I am trying to solve is I have a bunch of documents which context mathematical expressions/formulas. I want to search the documents by the formula or expression.

So far based on my research I'm considering to convert the mathematical expression to latex format and store as a string in the database (elastic search).

With this approach will be I able to search for documents with the latex string?

Example latex conversion of a2 + b2 = c2 is a^{2} + b^{2} = c^{2} . Can this string be searchable in elastic search ?


Solution

  • I agree with user @Lue E with some more modifications and tried with a simple keyword approach but gave me some issues, hence I modified my approach to using the keyword tokenizer in my own custom analyzer which should solve most of your use-cases.

    Index def with a custom analyzer

    {
        "settings": {
            "analysis": {
                "analyzer": {
                    "my_custom_analyzer": {
                        "type": "custom",
                        "tokenizer": "keyword", --> to make it searchable
                        "filter": [
                            "lowercase", --> case insensitive search
                            "trim" --> remove extra spaces
                        ]
                    }
                }
            }
        },
        "mappings": {
            "properties": {
                "mathformula": {
                    "type": "text",
                    "analyzer": "my_custom_analyzer"
                }
            }
        }
    }
    

    Index sample docs

     {
            "mathformula" : "(a+b)^2 = a^2 + b^2 + 2ab"
        }
    
    {
        "mathformula" : "a2+b2 = c2"
    }
    

    Search query(match query, uses the same analyzer of the index time)

    {
        "query": {
            "match" : {
                "mathformula" : {
                    "query" : "a2+b2 = c2"
                }
            }
        }
    }
    

    The search result contains only first indexed doc

     "hits": [
                {
                    "_index": "so_math",
                    "_type": "_doc",
                    "_id": "1",
                    "_score": 0.6931471,
                    "_source": {
                        "mathformula": "a2+b2 = c2"
                    }
                }
            ]