Search code examples
web-crawlerstormcrawler

How to exclude script and style tags from text extracted by StormCrawler?


I am working with storm crawler 1.10 and Elastic Search 6.3.x. I added the http.content.limit=-1 in config. The Crawler is working good and when I check the results functions and css data is displaying in the index. Is there any possibility to apply in xpath filter (e.g: <script> and <style>) in parserfilter.json or any other suggestions to restrict the crawler to avoid these. I am sharing some some sample data tht showing in records.

 document.getElementById('cloak6258804dfa0d517eaedf4b69a99ed997').innerHTML = '';
                var prefix = '&#109;a' + 'i&#108;' + '&#116;o';
                var path = 'hr' + 'ef' + '=';
                var addy6258804dfa0d517eaedf4b69a99ed997 = '&#97;dm&#105;ss&#105;&#111;ns' + '&#64;';
                addy6258804dfa0d517eaedf4b69a99ed997 = addy6258804dfa0d517eaedf4b69a99ed997 + '&#97;&#117;k' + '&#46;' + '&#111;rg';
                var addy_text6258804dfa0d517eaedf4b69a99ed997 = '&#97;dm&#105;ss&#105;&#111;ns' + '&#64;' + '&#97;&#117;k' + '&#46;' + '&#111;rg';document.getElementById('cloak6258804dfa0d517eaedf4b69a99ed997').innerHTML += '<a ' + path + '\'' + prefix + ':' + addy6258804dfa0d517eaedf4b69a99ed997 + '\'>'+addy_text6258804dfa0d517eaedf4b69a99ed997+'<\/a>'

Solution

  • The XPathFilter serves a different purpose which is to extract metadata from Xpath expressions. There is also the ContentFilter which is closer to what you need as it allows you to restrict the scope of the extracted text to a set of xpaths, however it does not give you a way of filtering out specific tags and keep everything else.

    Your best option at this stage is probably to use the ParserBolt based on Tika: it can be configured with a mapper implementation which by default is set to identityMapper but could use any other implementation provided by Tika or yourself, see Tika documentation on HTML mapper.

    Feel free to open an issue on GH to request a new type of parseFilter to exclude some HTML elements, as this could be useful to have. We have a related issue for googleon / googleoff tags and that could be a way of implementing it.

    EDIT: we have since released the TextExtractor, see StormCrawler 1.13 release announcement