We are trying to store the content of the webpage in the Status Index, along with url, status and metadata information.
We tried to edit the ES_IndexInit.sh and add the next property in the Status' mapping section:
"content": {
"type": "text",
"index": "true",
"store": true
}
but we can't see anything in Kibana after crawling process.
Our guess is that we would have to alter the Java source code in the storm crawler project but don't know how to proceed with that.
Any insight would be very helpful. Thank you in advance.
The content is usually stored in a separate index, the status one being used essentially for scheduling URLs and keeping their metadata. It would also probably have an impact on performance.
If that's the way you want to proceed though, you could write a custom ParseFilter to store the text content in the metadata. As usual, you'd need to add the key used to store the text to the config entry listing the metadata to persist in the status index (metadata.persist)