Search code examples
web-crawlerapache-stormdata-extractionstormcrawler

How to crawl specific data from a website using stormcrawler


I am crawling news websites using stormcrawler(v 1.16) and storing data on Elasticsearch (v 7.5.0). My crawler-conf file is as stormcrawler files.I am using kibana for visualization.My issues are

  • While crawling news website I want only urls of article content but i am also getting urls of ads,other tabs on website.What and where i have to make changes Kibana link
  • if i have to get only specific things from a URL(like only title or only content) how can we do that.

EDIT: I was thinking to add a field in content index. So i made changes in src/main/resources/parsefilter.json ,ES_IndecInit.sh,and Crawler-conf.yaml. XPATH which i have added is correct . I have added as

"parse.pubDate":"//META[@itemprop=\"datePublished\"]/@content"

in parsefilter.

parse.pubDate =PublishDate

in crawler-conf and added

PublishDate": { "type": "text", "index": false, "store": true}

in properties of ES_IndexInit.sh . But still I am not getting any field named PublishDate in kibana or elasticsearch. ES_IndexInit.sh mapping is as folows:

{
  "mapping": {
    "_source": {
      "enabled": false
    },
    "properties": {
      "PublishDate": {
        "type": "text",
        "index": false,
        "store": true
      },
      "content": {
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 256
          }
        }
      },
      "description": {
        "type": "text",
        "store": true
      },
      "domain": {
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 256
          }
        }
      },
      "host": {
        "type": "keyword",
        "store": true
      },
      "keywords": {
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 256
          }
        }
      },
      "title": {
        "type": "text",
        "store": true
      },
      "url": {
        "type": "keyword",
        "store": true
      }
    }
  }
}


Solution

  • One approach to indexing only news pages from a site is to rely on sitemaps, but not all sites will provide these.

    Alternatively, you'd need a mechanism as part of the parsing, maybe in a ParseFilter, to determine that a page is a news item and filter based on the presence of a key / value in the metadata during the indexing.

    The way it is done in the news crawl dataset from CommonCrawl is that the seed URLs are sitemaps or RSS feeds.

    To not index the content, simply comment out

      indexer.text.fieldname: "content"
    

    in the configuration.