I have configured Nutch 2.3.1 with Hadoop/Hbase ecosystem. I have not changed gora.buffer.read.limit
and gora.buffer.read.limit
i.e., using their default values that is 10000 in both cases. At generate phase, I set topN to 100,000. During generate job I get following information
org.apache.gora.mapreduce.GoraRecordWriter: Flushing the datastore after 60000 records
After job completion, I found that 100,000 urls are marked for fetched that I want to be. But I am confused what does above warning shows ? What is impact of gora.buffer.read.limit on my crawling ? Can someone guide ?
That log is written here. By default, the buffer is flushed after writing 10000 records, so you must have somewhere configured gora.buffer.write.limit
to 60000
(at core-site.xml
or mapred-site.xml
or code?).
It is not important, since it is at INFO level. It only notifies that the write buffer is going to be written into the storage.
The writing process happens each time you call store.flush()
, or in gora.buffer.write.limit
size batches.