Ok, I asked this already, but I guess I didn't ask it to the way stackoverflow expects. Hopefully I will get more luck this time and an answer.
I am trying to run nutch to crawl this site: http://www.tigerdirect.com/
I want it to crawl that site and all sublinks.
The problem is its not working. In my reg-ex file I tried a couple of things, but none of them worked:
+^http://([a-z0-9]*\.)*tigerdirect.com/
+^http://tigerdirect.com/([a-z0-9]*\.)*
my urls.txt is:
http://tigerdirect.com
Basically what I am trying to accomplish is to crawl all the product pages on their website so I can create a search engine (I am using solr) of electronic products. Eventually I want to crawl bestbuy.com, newegg.com and other sites as well.
BTW, I followed the tutorial from here: http://wiki.apache.org/nutch/NutchTutorial and I am using the script mentioned in session 3.3 (after fixing a bug it had).
I have a background in java and android and bash so this is a little new to me. I used to do regex in perl 5 years ago, but that is all forgotten.
Thanks!
According to your comments I see that you have crawled something before and this is why your Nutch starts to crawl Wikipedia.
When you crawl something with Nutch it records some metada at a table (if you use Hbase it is a table named webpage) When you finish a crawling and start a new one that table is scanned and if there is a record that has a metada says "this record can be fetched again because next fetch time is passed" Nutch starts to fetch that urls and also your new urls.
So if you want to have just http://www.tigerdirect.com/ crawled at your system you have to clean up that table first. If you use Hbase start shell:
./bin/hbase shell
and disable table:
disable 'webpage'
and finally drop it:
drop 'webpage'
I could truncate that table but removed it.
Next thing is putting that into your seed.txt:
http://www.tigerdirect.com/
open regex-urlfilter.txt that is located at:
nutch/runtime/local/conf
write that line into it:
+^http://([a-z0-9]*\.)*www.tigerdirect.com/([a-z0-9]*\.)*
you will put that line instead of +.
I have indicated to crawl subdomains of tigerdirect, it is up to you.
After that you can send it into solr to index and make a search on it. I have tried it and works however you may have some errors at Nutch side but it is another topic to talk about.