Search code examples
apache-sparkpysparkapache-spark-sqldatabricksazure-databricks

Add the creation date of a parquet file into a DataFrame


Currently I load multiple parquet file with this code :

df = spark.read.parquet("/mnt/dev/bronze/Voucher/*/*")

(Into the Voucher folder, there is one folder by date, and one parquet file inside it)

How can I add the creation date of each parquet file into my DataFrame ?

Thanks

EDIT 1:

Thanks rainingdistros, I wrote this:

import os
from datetime import datetime, timedelta 

Path = "/dbfs/mnt/dev/bronze/Voucher/2022-09-23/"
fileFull = Path +'/'+'XXXXXX.parquet'
statinfo = os.stat(fileFull)
create_date = datetime.fromtimestamp(statinfo.st_ctime)
display(create_date)

Now I must find a way to loop through all the files and add a column in the DataFrame.


Solution

    • The information returned from os.stat might not be accurate unless the file is first operation on these files is your requirement (i.e., adding the additional column with creation time).

    • Each time the file is modified, both st_mtime and st_ctime will be updated to this modification time. The following are the images indicating the same:

    enter image description here

    • When I modify this file, the changes can be observed in the information returned by os.stat.

    enter image description here

    • So, if adding this column is the first operation that is going to be performed on these files, then you can use the following code to add this date as column to your files.
    from pyspark.sql.functions import lit
    import pandas as pd
    path = "/dbfs/mnt/repro/2022-12-01"
    fileinfo = os.listdir(path)
    for file in fileinfo:
        pdf = pd.read_csv(f"{path}/{file}")
        pdf.display()
        statinfo = os.stat("/dbfs/mnt/repro/2022-12-01/sample1.csv")
        create_date = datetime.fromtimestamp(statinfo.st_ctime)
        pdf['creation_date'] = [create_date.date()] * len(pdf)
        pdf.to_csv(f"{path}/{file}", index=False)
    

    enter image description here

    • These files would have this new column as shown below after running the code:

    enter image description here

    • It might be better to take the value directly from folder in this case as the information is already available and all that needs to be done is to extract and add column to files in a similar manner as in the above code.