I use Stromcrawler with SQL external module. I have updated my pop.xml with:
<dependency>
<groupId>com.digitalpebble.stormcrawler</groupId>
<artifactId>storm-crawler-sql</artifactId>
<version>1.8</version>
</dependency>
I use similar injector/crawl procedure as in the case with ES setup:
storm jar target/stromcrawler-1.0-SNAPSHOT.jar org.apache.storm.flux.Flux --local sql-injector.flux --sleep 864000
I have created mysql database crawl
, table urls
and successfully injected my urls in it. For example, If I do select * from crawl.urls limit 5;
, I can see urls, status, and other fields. From this, I conclude that at this stage, the crawler connects to the database.
Sql-injector looks like this:
name: "injector"
includes:
- resource: true
file: "/crawler-default.yaml"
override: false
- resource: false
file: "crawler-conf.yaml"
override: true
- resource: false
file: "sql-conf.yaml"
override: true
- resource: false
file: "my-config.yaml"
override: true
components:
- id: "scheme"
className: "com.digitalpebble.stormcrawler.util.StringTabScheme"
constructorArgs:
- DISCOVERED
spouts:
- id: "spout"
className: "com.digitalpebble.stormcrawler.spout.FileSpout"
parallelism: 1
constructorArgs:
- "seeds.txt"
- ref: "scheme"
bolts:
- id: "status"
className: "com.digitalpebble.stormcrawler.sql.StatusUpdaterBolt"
parallelism: 1
streams:
- from: "spout"
to: "status"
grouping:
type: CUSTOM
customClass:
className: "com.digitalpebble.stormcrawler.util.URLStreamGrouping"
constructorArgs:
- "byHost"
When I run:
storm jar target/stromcrawler-1.0-SNAPSHOT.jar org.apache.storm.flux.Flux --remote sql-crawler.flux
I got the following exception at the Parse bolt:
java.lang.RuntimeException: Exception caught while loading the ParseFilters from parsefilters.json at com.digitalpebble.stormcrawler.parse.ParseFilters.fromConf(ParseFilters.java:67) at com.digitalpebble.stormcrawler.bolt.JSoupParserBolt.prepare(JSoupParserBolt.java:116) at org.apache.storm.daemon.executor$fn__5043$fn__5056.invoke(executor.clj:803) at org.apache.storm.util$async_loop$fn__557.invoke(util.clj:482) at clojure.lang.AFn.run(AFn.java:22) at java.lang.Thread.run(Thread.java:745) Caused by: java.io.IOException: Unable to build JSON object from file at com.digitalpebble.stormcrawler.parse.ParseFilters.(ParseFilters.java:92) at com.digitalpebble.stormcrawler.parse.ParseFilters.fromConf(ParseFilters.java:62) ... 5 more Caused by: com.fasterxml.jackson.core.JsonParseException: Unexpected character ('}' (code 125)): was expecting double-quote to start field name...
sql-crawler.flux:
name: "crawler"
includes:
- resource: true
file: "/crawler-default.yaml"
override: false
- resource: false
file: "crawler-conf.yaml"
override: true
- resource: false
file: "sql-conf.yaml"
override: true
- resource: false
file: "my-config.yaml"
override: true
spouts:
- id: "spout"
className: "com.digitalpebble.stormcrawler.sql.SQLSpout"
parallelism: 100
bolts:
- id: "partitioner"
className: "com.digitalpebble.stormcrawler.bolt.URLPartitionerBolt"
parallelism: 1
- id: "fetcher"
className: "com.digitalpebble.stormcrawler.bolt.FetcherBolt"
parallelism: 1
- id: "sitemap"
className: "com.digitalpebble.stormcrawler.bolt.SiteMapParserBolt"
parallelism: 1
- id: "parse"
className: "com.digitalpebble.stormcrawler.bolt.JSoupParserBolt"
parallelism: 1
- id: "status"
className: "com.digitalpebble.stormcrawler.sql.StatusUpdaterBolt"
parallelism: 1
streams:
- from: "spout"
to: "partitioner"
grouping:
type: SHUFFLE
- from: "partitioner"
to: "fetcher"
grouping:
type: FIELDS
args: ["key"]
- from: "fetcher"
to: "sitemap"
grouping:
type: LOCAL_OR_SHUFFLE
- from: "sitemap"
to: "parse"
grouping:
type: LOCAL_OR_SHUFFLE
- from: "fetcher"
to: "status"
grouping:
type: FIELDS
args: ["url"]
streamId: "status"
- from: "sitemap"
to: "status"
grouping:
type: FIELDS
args: ["url"]
streamId: "status"
- from: "parse"
to: "status"
grouping:
type: FIELDS
args: ["url"]
streamId: "status"
It looks like object StringUtils
at ParseFilters.java:60 is blank.
Check the content of src/main/resources.parsefilters.json (or whichever value you might have set for parsefilters.config.file), judging by the error message, the JSON it contains is not valid. Don't forget to rebuild the uber jar with mvn clean package