Search code examples
indexingsolrjsolr6

How to remove Scripts and Styles in content of SOLR Indexes[content field], while indexed through URL?


Whenever Solr is indexed to collection ( with configSet sample_techproducts_configs) and using URL, via following command:

bin/post -p 8983 -c collection https://www.mywebsite.com -recursive 3 

The indexes created do have a field content copied to text field. This field do have value of the content of web page parsed using embedded tika parse.

But, when those webpage contains any <script> or <style> tag the <body> is removed but the script or styles inside those respective tags remains as the content of the webpages, and shown in response to Solr Queries.

How To remove these unwanted content ?


Solution

  • Do read the inputstream of DATA_MODE_WEB in SimplePostTool (only for whom the content type is "text/html" and remove all <script> and <style> tags with its content and again convert that content_String to stream using stringToStream(String) in readPageFromUrl(URL u) function.