Search code examples
solrfull-text-searchmarkdownapache-tikafull-text-indexing

Indexing markdown documents for full text search in Apache SOLR




I am using Apache SOLR to index markdown documents.
As you know, Markdown is basically plaintext with special tags for formatting like bold and italic. The problem is: if the markdown has bold or italic formatting, the fulltext search is not working. However, if the markdown document has no formatting elements (bold, italic or heading, links etc) - the full text search works. To summarize it works when markdown document is the same as the plain text(i.e no word has any markdown formatting).

I have concluded that I need to convert markdown to plaintext before indexing the documents. Only then the full text search will work as expected in all the cases.

I did some searching and reading on different online forums. I think I need to implement a custom analyzer. The custom analyzer needs to convert the markdown to plaintext first, and then index it. I think this situation is similar to what Apache Tika does for microsoft documents. It parses ms office documents and extracts the plain text. I think I need to similar thing.
I think for markdown documents too - I need to parse and convert to plain text.
I have already found a way to convert markdown to plaintext.

However, I am not sure if I really need to create a custom analyzer. I read some code for custom analyzers - but all of them use tokenFilters. From my understanding, tokenFilters operate on the stream on a token by token basis. In my case, the entire markdown corpus has to be converted to plain text. So, please suggest an approach for this.

Another approach I have thought about this is to first convert the markdown to plaintext and then save the plaintext along with the markdown to the disk. But, I want to avoid this and handle this in SOLR. I expect that SOLR convert it to plain text and then index it.

  1. Should I be creating a custom analyzer for saving the markdown documents to plain text? Or is a custom query parser required?
  2. Can someone give a code example for the same (pseudocode is also fine).

Please help.


Solution

  • Use a StandardTokenizer - it'll split on most non-numeric characters, which should be suitable for getting Markdown indexed as single terms, instead of with the Markdown syntax kept intact.

    This tokenizer splits the text field into tokens, treating whitespace and punctuation as delimiters. Delimiter characters are discarded, with the following exceptions:

    Periods (dots) that are not followed by whitespace are kept as part of the token, including Internet domain names.

    The "@" character is among the set of token-splitting punctuation, so email addresses are not preserved as single tokens.

    If you want to split on periods between words as well, you can use a PatternReplaceCharFilterFactory to insert a space after words separated by a dot without whitespace.