Does JSoupParserBolt has an inbuilt implementation to utilise parsefilters.json file and the classes associated with it?

I was looking into specific XPath data extraction using filter classes provided within storm crawler. I was wondering whether JSoupParserBolt utilises the classes associated around filter classes and files or we do have to override filter classes to extract the required data.

Also I was trying to understand how to use indexer.md.filter and indexer.md.mapping entries from crawler_conf.yaml, but due to limited documentation, the use is not clear to me.

Can anyone help me out?

Solution

JSoupParserBolt calls the ParseFilters defined in parsefilters.json. The one generated by the archetype gives a good example of what you can do with them. If you need to do some simple XPath extraction, you should be able to do that through configuration of the com.digitalpebble.stormcrawler.parse.filter.XPathFilter. For instance,

"parse.title": [
        "//TITLE",
        "//META[@name=\"title\"]/@content"
     ]

will try to match two Xpath expressions and store whichever value if found in the metadata under the key parse.title.

You can of course implement custom ParseFilters, this package contains various implementations that you can use as a source of inspiration.

As for the indexer.md configs, see wiki. Basically, the mapping allows you to rename the metadata keys

  indexer.md.mapping:
  - parse.title=title
  - parse.keywords=keywords
  - parse.description=description
  - domain=domain

in the example above the key 'parse.title' will be indexed as a field named 'title'. Only the metadata listed in mapping will be used for the indexing.

indexer.md.filter serves a different purpose. As explained in the Javadoc, it is used to filter out (i.e skip indexing) a document which has the key+value in its metadata.