I want to know if there's a quick way to unit test LinkParseFilter configurations.
For example, if I have a parsefilter file with a LinkParseFilter specified like so:
...
{
"class": "com.digitalpebble.stormcrawler.parse.filter.LinkParseFilter",
"name": "MyGalleryParseFilter",
"params": {
"thumbnails": "substring-before(substring-after(//a[@class='thumbnail']/span/@style, 'background-image: url('), ')')",
"gallery": "//div[@class='browse']//a/@href",
"interesting": "//ul[@class='also-interesting']//a/@href",
"original": "//div[@id='original-image-frame']//a/img/@src"
}
},
...
What's the quickest way to unit test this with some sample page content to check that it's extracting what I want?
One option would be to write a unit test like the ones in the core module, you'd need to save a copy of the page in src/test/resources/. However, this assumes that the FetcherBolt returns the same content as the copy of the page you'd stored, which is not necessarily the case.
You could also modify your topology or write a custom one to use the same configuration with the MemorySpout. The topology from the archetype is a good starting point as the StdOutStatusUpdater will print out all the URLs found. Running it in debug mode with Eclipse (or the editor of your choice) can also help.
Could it be that there is a URL filter removing the outlinks you just created?