I am crawling news websites using stormcrawler(v 1.16) and storing data on Elasticsearch (v 7.5.0). My crawler-conf file is as stormcrawler files.I am using kibana for visualization.My issues are
EDIT: I was thinking to add a field in content index. So i made changes in src/main/resources/parsefilter.json ,ES_IndecInit.sh,and Crawler-conf.yaml. XPATH which i have added is correct . I have added as
"parse.pubDate":"//META[@itemprop=\"datePublished\"]/@content"
in parsefilter.
parse.pubDate =PublishDate
in crawler-conf and added
PublishDate": {
"type": "text",
"index": false,
"store": true}
in properties of ES_IndexInit.sh . But still I am not getting any field named PublishDate in kibana or elasticsearch. ES_IndexInit.sh mapping is as folows:
{
"mapping": {
"_source": {
"enabled": false
},
"properties": {
"PublishDate": {
"type": "text",
"index": false,
"store": true
},
"content": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"description": {
"type": "text",
"store": true
},
"domain": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"host": {
"type": "keyword",
"store": true
},
"keywords": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"title": {
"type": "text",
"store": true
},
"url": {
"type": "keyword",
"store": true
}
}
}
}
One approach to indexing only news pages from a site is to rely on sitemaps, but not all sites will provide these.
Alternatively, you'd need a mechanism as part of the parsing, maybe in a ParseFilter, to determine that a page is a news item and filter based on the presence of a key / value in the metadata during the indexing.
The way it is done in the news crawl dataset from CommonCrawl is that the seed URLs are sitemaps or RSS feeds.
To not index the content, simply comment out
indexer.text.fieldname: "content"
in the configuration.