Search code examples
web-crawlerstormcrawler

stormCrawler not crawling only main content of page


By default, Crawler crawls whole page including Header & Footer which is common across all pages. Our requirement is Crawler should only crawl main content of page(which is under div#body-wrapper)

We achieved the same using parsefilters.json.

{
      "class": "com.digitalpebble.stormcrawler.parse.filter.ContentFilter",
      "name": "ContentFilter",
      "params": {
        "pattern": "//DIV[@id=\"body-wrapper\"]",
        "pattern2": "//DIV[@itemprop=\"articleBody\"]",
        "pattern3": "//ARTICLE"
       }
    }

After updating parsefilters.json, it's only crawling that div, but it's including all whitespaces, newlines, JS, CSS code etc as given below.

"content" : "\n\t\t\t\n\n\t\t\t\t\n\t\t\t\t\t Growing Your Business ............. \n\n\n\n\n\n\t\n\t\t\n\t\t\t\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\t\n\n\t\n\t\t\n\n\n\n\n\t\n\n\t\n\t\t\n\t\t\t\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\t\n\n\t\t\n\n\t\t\n\n\n\n\t\n\t\t\n\t\t\n\n\n\t\t\t\n\t\t\t\t\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\t\n\t\t\n\n\t\t\n\n\t\t\n\n\t\t\n\t\n\t\t\t\t\n\t\t\t\n\t\t \n\t\t\n\n\t\t\n\t\t\t\n\t\t\t\t\n\n\n\t\n\n\n\n\t\n\t\t\n\t\t\t\n\t\t\t\t\n\t\t\t\t\t\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\t\n\n\t\t\t\t\n\n\t\t\t\t\n\t\t\t\t\t\n\t\t\t\t\t\t\n\n\n\t\n\n\t\t\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\t\n\t\t\n\t\t\t\n\t\t\t\t\n\t\t\t\t\t\n\t\t\t\t\t\t\n\t\t\t\t\t\t\t\n\t\t\t\t\t\t\n\n\t\t\t\t\t\t\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\t\n\n\t\n\t\t\n\t\n.landing-page-indicators { \n\ttop:inherit !important;\n}\n\n\t.slide-share .slide-share-indicators li {\n\t width: 10px;\n\t height: 10px;\n\t border-radius: 10px;\n\t border: none;\n\t margin: 0px 0 0 14px;\n}\n.slide-share .cta-btn-inline { \n margin-left:0px;\n}\n .slide-share .slide-share-indicators .active {\n\t background-color: #f33;\n}\n .slide-share .slide-share-item-img {\n\t width: 100%;\n\t height: 360px;\n\t max-height: 370px;\n\t background-size: cover;\n\t background-position: center;\n}\n .slide-share .carousel-indicators {\n\t margin-bottom: 0px;\n\t bottom: 24px;\n}\n .slide-share .slide-share-item-caption {\n\t width: 100%;\n\t -webkit-transition: height 0.4s ease;\n\t transition: height 0.4s ease;\n\t padding: 24px 16px;\n\t padding-bottom:0px;\n\t position: absolute;\n\t bottom: 5%;\n\t display: block;\n\t color: black;\n}\n .slide-share .slide-share-item-caption:hover {\n\t text-decoration: none;\n}\n .slide-share .slide-share-item-desc {\n\t max-width: 992px;\n\t width: 100%;\n\t position: relative;\n\t margin: 0 auto;\n}\n .slide-share .slide-share-item-desc h2 {\n\t margin-bottom: 8px;\n\t font-size: 36px;\n\t font-weight: 700;\n}\n .slide-share .slide-share-item-desc p {\n\t line-height: 1.5;\n\t margin-bottom: 24px;\n\t font-size:24px;\n\t font-weight: 400;\n\t width:60%;\n}\n .slide-share .slide-share-arrows {\n\t top: 50px;\n\t margin: 30px;\n\t width: 0;\n\t align-items: initial;\n}\n .slide-share .slide-share-arrow-icon {\n\t color: #fff;\n\t font-size: 25px;\n\t margin-top: 75px;\n}\n.slide-share .slide-share-item-desc {\n background-color: transparent;\n}\n .slide-share .slide-share-arrow-icon:hover {\n\t color: #ee1818;\n\t font-size: 25px;\n}\n\n.slide-share .carousel-item .shade { \n width: 60%;\n height: 100%;\n position: absolute;\n background-image: linear-gradient(to right, #2e2e2e, transparent);\n opacity: .6;\n \n}\n\n @media (max-width: 991px) and (min-width: 768px) {\n\t .slide-share .slide-share-item-desc h2 {\n\t\t width: 100%;\n\t}\n\t .slide-share .slide-share-item-desc p {\n\t\t width: 100%;\n\t}\n}\n @media (max-width: 768px) {\n\t .slide-share .slide-share-item-desc h2 {\n\t\t width: 100%;\n\t\t font-size: 24px;\n\t\t margin-bottom: 16px;\n\n\t}\n\t .slide-share .slide-share-item-desc p {\n\t\t font-size: 16px;\n\t\t display: none;\n\t}\n\t.slide-share-item-img.left-center {\n\tbackground-position: left center;\n\t} \n\n\t.slide-share-item-img.right-center {\n\tbackground-position: right center;\n\t} \n\t.slide-share-item-img.center-center {\n\tbackground-position: centercenter;\n\t}\n}\n \n\n\n\n\n \n\t\n\t\t\n\t\t\t\t\n\t\t\t\t\t\n\t\t\t\t\n\t\t\t\t\t\n\t\t\t\t\n\t\t\t\t\t\n\t\t\t\t\n\t\t\t\t\t\n\t\t\t\t\n\t\t\t\t\t\n\t\t\t\t\n\t\t\t\t\t\n\t\t

But when Crawler was crawling full page(default configuration), it wasn't adding whitespaces, newlines, JS, CSS code etc.

How do we crawl some part of page but without whitespaces, newlines, JS, CSS etc.

Please kinldy advice.

Thank you.


Solution

  • The ContentFilter is deprecated since StormCrawler 1.13 and replaced with the TextExtractor.

    From the release notes,

    [...] the main new feature is the addition of the TextExtractor (#678) for the JsoupParserBolt. Unlike the ContentParseFilter, which it replaces, it is configured from the main configuration and is not a ParseFilter as it operates directly on the objects generated by Jsoup. The TextExtractor allows restricting the text to specific elements to avoid boilerplate code and navigation elements but provides a far cleaner text content compared to the ContentParseFilter which merges some tokens. The TextExtractor can also be used to define exclusion zones which will be applied either to the restricted zones or the whole document if no such zone were defined or found. This is useful for instance to remove SCRIPT or STYLE elements.

    The configuration generated by the archetypes use the TextExtractor with a similar configuration to what the ContentFilter used to do.