Search code examples
solrsolr5solr6

Index only plain text from HTML in solr


I need to index only plain text from HTML and reject all other HTML tags.

For Example: I have html like

<html xmlns="http://www.w3.org/1999/xhtml">
<head>
    <title>
       title
    </title>
    <link href="./test.html" rel="StyleSheet" type="text/css" />
    </head>
    <body>
      <h1 style="height: 22px">
       header
      </h1>
    </body>
</html>

I want to index only 'header' text under the body tag and reject all other HTML tags in _text_ field of solr.

I tried <charFilter class="solr.HTMLStripCharFilterFactory"/> like below:

<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100" multiValued="true">
  <analyzer type="index">
    <charFilter class="solr.HTMLStripCharFilterFactory"/>
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
  <analyzer type="query">
    <charFilter class="solr.HTMLStripCharFilterFactory"/>
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
    <filter class="solr.SynonymGraphFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
</fieldType>

But it still indexing the HTML tags attributes

According to solr documentation it should not index the HTML tags solr.HTMLStripCharFilterFactory

When i search solr/testcore/select?q=_text_:height&wt=json it giving me a record which should not be.

I tried in both solr-5.3.1 and solr-6.6.0.

I stuck with this, please help me out.


Solution

  • Since you're posting the HTML raw to Solr, it's being handled by the extracting request handler ("Solr Cell") - which uses Apache Tika to extract content from the HTML file.

    That means that the _text_ field never sees the HTML tags at all, since the content has already been extracted by Apache Tika and the HTML tags have disappeared - so there's nothing to remove.

    If you use a Solr client in a programming language of choice and submit the HTML as a field value directly, the the HTML stripping will take place as you expect (since the tags are then actually part of the content submitted to the field types internally in Solr).

    I tried finding some way of configuring the HTML Parser in the bundled Tika version - it uses the Tagsoup library to do parsing, but I can't see any exposed configuration that would change what you're experiencing.