Search code examples
web-crawlerapache-stormstormcrawler

Custom parsefilter.json file not found when running StormCrawler from Eclipse


I wanted to report that, I have been investigating StormCrawler SDK for extracting HTML response. I know that JSoupParserBolt uses parsefilter.json file to extract the response according to a specific need. I also know that there is a default file for the same purpose. In my case, I am using Eclipse to execute the pom.xml file to generate .jar file for the crawler designed. Then I am running the CrawlTopology class containing the main function and a run function consisting all the required spout and bolt references from the SDK, forming a Topology(I used maven archtype to download the example crawler).

The problem is that the CrawlTopology class is not calling the modified parsefilter.json file to refer to the required information, instead its always using the default parsefilter.json file all the time. I am not able to figure out what is causing this kind a problem. Whether its a maven dependency issue or its an issue with the default project.

Can anyone help me out?


Solution

  • If your code was generated from the archetype, then the parsefilter.json should be in the right place i.e. src/main/resources/.

    When using Eclipse, make sure you import the project as a Maven project. This will add src/main/resources/ to the classpath. Eclipse will get the dependencies and manage the classes etc... I routinely run topologies in Eclipse without any problems.

    This is fine for testing and debugging but the best approach is to run the code outside Eclipse as indicated in the README. Another option, if you haven't installed Storm is to use

     mvn clean compile exec:java -Dexec.mainClass=insert.package.CrawlTopology -Dexec.args="-conf crawler-conf.yaml -local"
    

    to run it in local mode outside of Eclipse.