Search code examples
apache-sparkpysparkdelta-lake

Insert or Update a delta table from a dataframe in Pyspark


I have a pyspark dataframe currently from which I initially created a delta table using below code -

df.write.format("delta").saveAsTable("events")

Now, since the above dataframe populates the data on daily basis in my requirement, hence for appending new records into delta table, I used below syntax -

df.write.format("delta").mode("append").saveAsTable("events")

Now this whole thing I did in databricks and in my cluster. I want to know how can I write generic pyspark code in python that will create delta table if it does not exists and append records if delta table exists.This thing I want to do because if I give my python package to someone, they will not have the same delta table in their environment so it should get created dynamically from code.


Solution

  • If you don't have Delta table yet, then it will be created when you're using the append mode. So you don't need to write any special code to handle the case when table doesn't exist yet, and when it exits.

    P.S. You'll need to have such code only in case if you're performing merge into the table, not append. In this case the code will looks like this:

    if table_exists:
      do_merge
    else:
      df.write....
    

    P.S. here is a generic implementation of that pattern