I have 100 million records to be inserted to a HBase table (PHOENIX) as a result of a Spark Job. I would like to know if i convert it to a Dataframe and save it, will it do Bulk load (or) it is not the efficient way to write data to a Phoenix HBase table
From: Josh Mahonin
Date: Wed, May 18, 2016 at 10:29 PM
Subject: Re: PHOENIX SPARK - DataFrame for BulkLoad
Hi,
The Spark integration uses the Phoenix MapReduce framework, which under the hood translates those to UPSERTs spread across a number of workers.
You should try out both methods and see which works best for your use case. For what it's worth, we routinely do load / save operations using the Spark integration on those data sizes.