Search code examples
web-crawlerstormcrawler

Stormcrawl with SQL external module gets ParseFilters exception at crawl sage


I use Stromcrawler with SQL external module. I have updated my pop.xml with:

<dependency>
        <groupId>com.digitalpebble.stormcrawler</groupId>
        <artifactId>storm-crawler-sql</artifactId>
        <version>1.8</version>
</dependency>

I use similar injector/crawl procedure as in the case with ES setup:

storm jar target/stromcrawler-1.0-SNAPSHOT.jar  org.apache.storm.flux.Flux --local sql-injector.flux --sleep 864000

I have created mysql database crawl, table urls and successfully injected my urls in it. For example, If I do select * from crawl.urls limit 5;, I can see urls, status, and other fields. From this, I conclude that at this stage, the crawler connects to the database.

Sql-injector looks like this:

name: "injector"

includes:
- resource: true
  file: "/crawler-default.yaml"
  override: false

- resource: false
  file: "crawler-conf.yaml"
  override: true

- resource: false
  file: "sql-conf.yaml"
  override: true

- resource: false
  file: "my-config.yaml"
  override: true

components:
 - id: "scheme"
className: "com.digitalpebble.stormcrawler.util.StringTabScheme"
constructorArgs:
  - DISCOVERED

spouts:
 - id: "spout"
  className: "com.digitalpebble.stormcrawler.spout.FileSpout"
parallelism: 1
constructorArgs:
  - "seeds.txt"
  - ref: "scheme"

bolts:
- id: "status"
className: "com.digitalpebble.stormcrawler.sql.StatusUpdaterBolt"
parallelism: 1

streams:
 - from: "spout"
to: "status"
grouping:
  type: CUSTOM
  customClass:
    className: "com.digitalpebble.stormcrawler.util.URLStreamGrouping"
    constructorArgs:
      - "byHost"

When I run:

storm jar target/stromcrawler-1.0-SNAPSHOT.jar  org.apache.storm.flux.Flux --remote sql-crawler.flux

I got the following exception at the Parse bolt:

java.lang.RuntimeException: Exception caught while loading the ParseFilters from parsefilters.json at com.digitalpebble.stormcrawler.parse.ParseFilters.fromConf(ParseFilters.java:67) at com.digitalpebble.stormcrawler.bolt.JSoupParserBolt.prepare(JSoupParserBolt.java:116) at org.apache.storm.daemon.executor$fn__5043$fn__5056.invoke(executor.clj:803) at org.apache.storm.util$async_loop$fn__557.invoke(util.clj:482) at clojure.lang.AFn.run(AFn.java:22) at java.lang.Thread.run(Thread.java:745) Caused by: java.io.IOException: Unable to build JSON object from file at com.digitalpebble.stormcrawler.parse.ParseFilters.(ParseFilters.java:92) at com.digitalpebble.stormcrawler.parse.ParseFilters.fromConf(ParseFilters.java:62) ... 5 more Caused by: com.fasterxml.jackson.core.JsonParseException: Unexpected character ('}' (code 125)): was expecting double-quote to start field name...

Screenshot of StormUI

sql-crawler.flux:

name: "crawler"

includes:
- resource: true
  file: "/crawler-default.yaml"
  override: false

- resource: false
  file: "crawler-conf.yaml"
  override: true

- resource: false
  file: "sql-conf.yaml"
  override: true

- resource: false
  file: "my-config.yaml"
  override: true

spouts:
- id: "spout"
className: "com.digitalpebble.stormcrawler.sql.SQLSpout"
parallelism: 100

bolts:
- id: "partitioner"
className: "com.digitalpebble.stormcrawler.bolt.URLPartitionerBolt"
parallelism: 1
- id: "fetcher"
className: "com.digitalpebble.stormcrawler.bolt.FetcherBolt"
parallelism: 1
- id: "sitemap"
className: "com.digitalpebble.stormcrawler.bolt.SiteMapParserBolt"
parallelism: 1
- id: "parse"
className: "com.digitalpebble.stormcrawler.bolt.JSoupParserBolt"
parallelism: 1
- id: "status"
className: "com.digitalpebble.stormcrawler.sql.StatusUpdaterBolt"
parallelism: 1


streams:
- from: "spout"
to: "partitioner"
grouping:
  type: SHUFFLE

- from: "partitioner"
to: "fetcher"
grouping:
  type: FIELDS
  args: ["key"]

- from: "fetcher"
to: "sitemap"
grouping:
  type: LOCAL_OR_SHUFFLE

- from: "sitemap"
to: "parse"
grouping:
  type: LOCAL_OR_SHUFFLE

- from: "fetcher"
to: "status"
grouping:
  type: FIELDS
  args: ["url"]
  streamId: "status"

- from: "sitemap"
to: "status"
grouping:
  type: FIELDS
  args: ["url"]
  streamId: "status"

- from: "parse"
to: "status"
grouping:
  type: FIELDS
  args: ["url"]
  streamId: "status"

It looks like object StringUtils at ParseFilters.java:60 is blank.


Solution

  • Check the content of src/main/resources.parsefilters.json (or whichever value you might have set for parsefilters.config.file), judging by the error message, the JSON it contains is not valid. Don't forget to rebuild the uber jar with mvn clean package