Search code examples
apache-sparkpysparkazure-synapsedelta-lake

Change Column name in table and delta files?


I have a folder delta_table with delta files and I have created a Table called test_delta_table based on the delta files. How can I change the name of a column in the underlying delta files and in the table itself?

I'm getting in a spark error if I want to read from the same location and overwrite into the same location. Also it kind of curupts the files in the location. So I'm suspecting it's not possible to overwrite the same location from the read?


Solution

  • Yes, you cannot write in the same location you are reading from, but there are a couple of ways to get around that.

    Option 1: checkpoint the DataFrame

    The first think you can do is to checkpoint the DataFrame so the lineage gets broken, allowing you to write to the read location.

    >>> # It's compulsory to set the checkpoint dir beforehand
    >>> spark.setCheckpointDir("/hdfs/checkpoint/path").
    >>> df = df.checkpoint()
    >>> # Now you can write, saveAsTable or whatever you want on to the reading location
    >>> df.write.parquet("/read/path")
    

    You can also use localCheckpoint, which saves the DataFrame on executor's cache memmory, instead of in disk, like checkpoint

    Option 2: write to a temporal location

    The second approach is doing a checkpoint manually. Firstly, we write de DataFrame to a temporal location, then read from temporal location to create a new DataFrame, and finally, write back to read path.

    >>> df.write.parquet("/tmp/path")
    >>> tmp_df = spark.read.parquet("/tmp/path")
    >>> tmp_df.write.parquet("/read/path")