Search code examples
apache-sparkhdfsparquet

in spark how to get parquet file created timestamp as column


In spark, while reading files from hdfs , for each record I want to add a column to df with the file created timestamp of the file from which the record is read.

for example hdfs has below structure

/data/module/
|----------- file1.parquet
|----------- file2.parquet
|----------- file3.parquet
|----------- file4.parquet

when this directory is read in spark I want to add a column for each record that should have the file created timestamp of the file from which the record is read.

I tried using df.withColumn("records_inserted_time", current_timestmap())

but this give the required result.


Solution

  • Based on information you provided looks like you want to add a column to each and every record of the DF that should have the timestamp of the file the particular record is from.

    for this you can utilize FileSystem class from spark._jvm and get the filename along with file creation time.

    from py4j.java_gateway import java_import
    # Import Hadoop's FileStatus and FileSystem classes
    java_import(spark._jvm, 'org.apache.hadoop.fs.FileSystem')
    java_import(spark._jvm, 'org.apache.hadoop.fs.Path')
    
    fs = spark._jvm.FileSystem.get(spark._jsc.hadoopConfiguration())
    file_statuses = fs.listStatus(spark._jvm.Path(hdfs_path))
    creation_times = [(status.getPath().toString(), status.getModificationTime()) for status in file_statuses]
    

    once u have files name and created time you can simply have a look function to add the new column.

    I have created a post with explanation and sample code for this problem. https://medium.com/@azam.khan681542/apache-spark-get-source-files-created-timestamp-as-a-column-in-dataframe-4fb1baca82bd