Whenever Solr is indexed to collection ( with configSet sample_techproducts_configs
) and using URL, via following command:
bin/post -p 8983 -c collection https://www.mywebsite.com -recursive 3
The indexes created do have a field content
copied to text
field.
This field do have value of the content of web page parsed using embedded tika parse.
But, when those webpage contains any <script>
or <style>
tag the <body>
is removed but the script or styles inside those respective tags remains as the content of the webpages, and shown in response to Solr Queries.
How To remove these unwanted content ?
Do read the inputstream
of DATA_MODE_WEB
in SimplePostTool
(only for whom the content type is "text/html" and remove all <script>
and <style>
tags with its content and again convert that content_String to stream using stringToStream(String)
in readPageFromUrl(URL u)
function.