Search code examples
stormcrawler

StormCrawler - Metadata fields not being persisted


I have a topology with a spout that emits a tuple to the status stream and is picked up by the StatusUpdaterBolt, which in turn write data to an elasticsearch index.

The spout emits a tuple with a Metadata object that contains certain metadata (eg: crawler).

This is not being written to the status index.

The config looks something like this:


bolts:
  - id: "myspout"
    className: com.mycompany.MySpout
    parallelism: 8
  - id: "status"
    className: com.digitalpebble.stormcrawler.elasticsearch.persistence.StatusUpdaterBolt
    parallelism: 4

streams:
  - from: "myspout"
    to: "status"
    grouping:
      type: FIELDS
      args: ["url"]
      streamId: "status"

The Metadata object is built like this:

Metadata metadata = new Metadata();
...
metadata.setValue("crawler", "mycrawl");

and then is emitted:

collector.emit(new Values(url, metadata));

Why would the custom properties not get written to the status index?

Versions:

storm: 2.4.0 stormcrawler: 2.8


Solution

  • As per the documentation here: https://github.com/DigitalPebble/storm-crawler/wiki/MetadataTransfer

    It's important to specify what fields you want transferred/persisted into the status index. If you don't, it won't get persisted.

    In your example:

    metadata.persist:
      - crawler
    

    Note: If you were using parsefilters to extract Outlinks, you'd also need to include:

    metadata.transfer:
      - crawler
    

    if you wanted it on new documents generated by outlink identification.