hbase skip region server to read rows directly from hfile

Am attempting to dump over 10 billion records into hbase which will grow on average at 10 million per day and then attempt a full table scan over the records. I understand that a full scan over hdfs will be faster than hbase.
Hbase is being used to order the disparate data on hdfs. The application is being built using spark.
The data is bulk-loaded onto hbase. Because of the various 2G limits, region size was reduced to 1.2G from an initial test of 3G (Still requires a bit more detail investigation).
scan cache is 1000 and cache blocks is off
Total hbase size is in the 6TB range, yielding several thousand regions across 5 region servers (nodes). (recommendation is low hundreds).
The spark job essentially runs across each row and then computes something based on columns within a range.
Using spark-on-hbase which internally uses the TableInputFormat the job ran in about 7.5 hrs.
In order to bypass the region servers, created a snapshot and used the TableSnapshotInputFormat instead. The job completed in abt 5.5 hrs.

Questions

When reading from hbase into spark, the regions seem to dictate the spark-partition and thus the 2G limit. Hence problems with caching Does this imply that region size needs to be small ?
The TableSnapshotInputFormat which bypasses the region severs and reads directly from the snapshots, also creates it splits by Region so would still fall into the region size problem above. It is possible to read key-values from hfiles directly in which case the split size is determined by the hdfs block size. Is there an implementation of a scanner or other util which can read a row directly from a hfile (to be specific from a snapshot referenced hfile) ?
Are there any other pointers to say configurations that may help to boost performance ? for instance the hdfs block size etc ? The main use case is a full table scan for the most part.

Solution

As it turns out this was actually pretty fast. Performance analysis showed that the problem lay in one of the object representations for an ip address, namely InetAddress took a significant amount to resolve an ip address. We resolved to using the raw bytes to extract whatever we needed. This itself made the job finish in about 2.5 hours. A modelling of the problem as a Map Reduce problem and a run on MR2 with the same above change showed that it could finish in about 1 hr 20 minutes. The iterative nature and smaller memory footprint helped the MR2 acheive more parallelism and hence was way faster.