PHOENIX SPARK - DataFrame for BulkLoad

I have 100 million records to be inserted to a HBase table (PHOENIX) as a result of a Spark Job. I would like to know if i convert it to a Dataframe and save it, will it do Bulk load (or) it is not the efficient way to write data to a Phoenix HBase table

Solution

From: Josh Mahonin

Date: Wed, May 18, 2016 at 10:29 PM

Subject: Re: PHOENIX SPARK - DataFrame for BulkLoad

To: [email protected]

Hi,

The Spark integration uses the Phoenix MapReduce framework, which under the hood translates those to UPSERTs spread across a number of workers.

You should try out both methods and see which works best for your use case. For what it's worth, we routinely do load / save operations using the Spark integration on those data sizes.

Spark SELECT Query Ignores Partition Filters in java spark App but Works in Zeppelin
Get timestamp in PySpark from GregorianCalendar
Syntax error at or near CLUSTER when altering clustered column in delta table
Apache Spark: "with as" vs "cache"
Syntax error at or near ':'(line 1, pos 2) - PARSE_SYNTAX_ERROR - == SQL ==
DateTime datatype in BigQuery
Replace dataframe with its alias in select in pyspark
Get difference between two version of delta lake table
How to write a condition based on multiple values for a DataFrame in Spark
Convert RDD of LabeledPoint to DataFrame toDF() Error
Rowencoder.apply and rowencoder.encoderfor methods in spark catalyst package
How to deal with executor memory and driver memory in Spark?
No FileSystem for scheme: s3 with pyspark
How to throw Exception in Databricks?
Spark - SELECT WHERE or filtering?
Are Spark checkpoints invalidated when source data is changed?
How to read the json file in spark using scala?
Unable to start thrift-server due to class org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter is not a javax.servlet.Filter
filter only not empty arrays dataframe spark
Escape a single quote in plain Databricks SQL
How can I get all names of the arrays on Dataframe
create a Spark DataFrame from a nested array of struct element?
Executing multiple SQL queries on Spark - Table or view not found
spark UI - Understand metrics memory used
Spark: Trying to run spark-shell, but get 'cmd' is not recognized as an internal or
Remove list elements in a dataframe in scala
Not able to Explode and select in the same expression in spark scala
Fetching data from REST API to Spark Dataframe using Pyspark
Count entries for all possible categories
Create column using Spark pandas_udf, with dynamic number of input columns