Looking to find if anybody got through a way to perform upserts (append / update / partial inserts/update) on Phoenix using Apache Spark. I could see as per Phoenix documentation save SaveMode.Overwrite
is only supported - which is overwrite with full load. I tried changing the mode it throws error.
Currently, we have *.hql jobs running to perform this operation, now we want to rewrite them in Spark Scala. Thanks for sharing your valuable inputs.
While Phoenix connector indeed supports only SaveMode.Overwrite
, the implementation doesn't conform to the Spark standard, which states that:
Overwrite mode means that when saving a DataFrame to a data source, if data/table already exists, existing data is expected to be overwritten by the contents of the DataFrame
If you check the source, you'll see that saveToPhoenix
just calls saveAsNewAPIHadoopFile
with PhoenixOutputFormat
, which
internally builds the UPSERT query for you
In other words SaveMode.Overwrite
with Phoenix Connector is in fact UPSERT
.