Questions
When reading from hbase into spark, the regions seem to dictate the spark-partition and thus the 2G limit. Hence problems with caching Does this imply that region size needs to be small ?
The TableSnapshotInputFormat which bypasses the region severs and reads directly from the snapshots, also creates it splits by Region so would still fall into the region size problem above. It is possible to read key-values from hfiles directly in which case the split size is determined by the hdfs block size. Is there an implementation of a scanner or other util which can read a row directly from a hfile (to be specific from a snapshot referenced hfile) ?
Are there any other pointers to say configurations that may help to boost performance ? for instance the hdfs block size etc ? The main use case is a full table scan for the most part.
As it turns out this was actually pretty fast. Performance analysis showed that the problem lay in one of the object representations for an ip address, namely InetAddress took a significant amount to resolve an ip address. We resolved to using the raw bytes to extract whatever we needed. This itself made the job finish in about 2.5 hours. A modelling of the problem as a Map Reduce problem and a run on MR2 with the same above change showed that it could finish in about 1 hr 20 minutes. The iterative nature and smaller memory footprint helped the MR2 acheive more parallelism and hence was way faster.