I have a delta table dbfs:/mnt/some_table
in pyspark, which as you know is a folder with a series of .parquet files. I want to get the last modified time of that table without having to query data in the table.
If I do dbutils.fs.ls(path)
for the table I get a modificationTime which seems to just be now() every time I query it. This makes me believe modificationTime doesn't work accurately on folders in pyspark.
I could just get the modificationTime of every .parquet file in the folder and use the greatest number, but I'd wonder if there is another already built in way, or more performant way to get the last modification time of a delta table.
Simplest way to do that is to query history of the table - it's relatively lightweight operation that will read that data from the transaction log:
from delta.tables import *
deltaTable = DeltaTable.forPath(spark, pathToTable)
lastOperationTimestamp = deltaTable.history(1).select("timestamp").collect()[0][0]