elasticsearch web-crawler nutch stormcrawler

Storm-crawler crawl and indexing

I've worked with Nutch 1x for crawling websites and using Elasticsearch to index the data. I've come across Storm-crawler recently and like it, especially the streaming nature of it.

Do I have to init and create the mappings for my ES server that Storm-crawler is sending the data to?

With Nutch, as long as I had the ES index up and running, the mapping took care of itself... except for some fine tuning. Is it the same for Stormcrawler? Or do I have to init the index and mapping before?

Solution

Great to hear you like StormCrawler.

As explained in README and the video tutorial based on ES2.x, you should use the ES_IndexInit script to set the mapping explicitly. It probably works without it but it would not be optimal.

Constrain a Specman list so it doesn't have identical values in consecutive elements
How to type cast a list of uint to a list of vr_ahb_data in Specman?
Static fields/methods in e
What is the difference between deep_copy and gen keeping in Specman?
Specman UVM: What is the difference between write_reg { .field == 2;}; and write_reg_fields?
Does Specman support optional parameters to a method?
Specman e: When colon equal sign ":=" should be used?
Specman e: Is there a way to know how many values there is in an enumerated type?
Specman e error: No match for file when using "for each line in file"
How to run e file one by one? Not in parallel test
Specman/e constraint (for each in) iteration
Specman e: How driver's items queue can be locked from a sequence?
Specman e: How to print variable's address?
Specman e: "keep type .. is a" fails to refine the type of a field
Specman soft select on variable, decimal vs. hexadecimal values
Specman e: How to print a pointer to a struct?
Specman e: Is there a way to print unit instance name?
Specman e: simultaneous events error
Specman e coverage: ignored values appear in the coverage statistics
Specman e: How a sequence should be started when gen_and_start_main constrained to FALSE?
Specman/e list of lists (multidimensional array)
Does Specman e have struct constructor?
Specman e: How the predefined sequence.item should be used?
Specman e: define-as-computed macro error
Specman e UVM: Why to inherit from uvm_* units?
Specman e: A sequence drives its BFM also its MAIN was not defined in a test
e HVL (IEEE 1647): expect expression fails unexpectedly
Specman e subtyping: How to refer to FALSE value of conditional field in when/extend subtyping?
e HVL (IEEE 1647): How to set 'X' value?
Difference between declaring an event that is sensitive to a simple_port value and event_port