Search code examples
hadoophbasenutchgoranutch2

Apache Nutch flushes gora record after limit


I have configured Nutch 2.3.1 with Hadoop/Hbase ecosystem. I have not changed gora.buffer.read.limit and gora.buffer.read.limit i.e., using their default values that is 10000 in both cases. At generate phase, I set topN to 100,000. During generate job I get following information

org.apache.gora.mapreduce.GoraRecordWriter: Flushing the datastore after 60000 records

After job completion, I found that 100,000 urls are marked for fetched that I want to be. But I am confused what does above warning shows ? What is impact of gora.buffer.read.limit on my crawling ? Can someone guide ?


Solution

  • That log is written here. By default, the buffer is flushed after writing 10000 records, so you must have somewhere configured gora.buffer.write.limit to 60000 (at core-site.xml or mapred-site.xml or code?).

    It is not important, since it is at INFO level. It only notifies that the write buffer is going to be written into the storage. The writing process happens each time you call store.flush(), or in gora.buffer.write.limit size batches.