Search code examples
pythonazure-storageazure-databricks

File metadata such as time in Azure Storage from Databricks


I m trying to get creationfile metadata.

File is in: Azure Storage
Accesing data throw: Databricks

right now I m using:

   file_path = my_storage_path
   dbutils.fs.ls(file_path)

but it returns

[FileInfo(path='path_myFile.csv', name='fileName.csv', size=437940)]

I do not have any information about creation time, there is a way to get that information ?

other solutions in Stackoverflow are refering to files that are already in databricks Does databricks dbfs support file metadata such as file/folder create date or modified date in my case we access to the data from Databricks but the data are in Azure Storage.


Solution

  • It really depends on the version of Databricks Runtime (DBR) that you're using. For example, modification timestamp is available if you use DBR 10.2 (didn't test with 10.0/10.1, but definitely not available on 9.1):

    enter image description here

    If you need to get that information you can use Hadoop FileSystem API via Py4j gateway, like this:

    URI           = sc._gateway.jvm.java.net.URI
    Path          = sc._gateway.jvm.org.apache.hadoop.fs.Path
    FileSystem    = sc._gateway.jvm.org.apache.hadoop.fs.FileSystem
    Configuration = sc._gateway.jvm.org.apache.hadoop.conf.Configuration
    
    fs = FileSystem.get(URI("/tmp"), Configuration())
    
    status = fs.listStatus(Path('/tmp/'))
    for fileStatus in status:
        print(f"path={fileStatus.getPath()}, size={fileStatus.getLen()}, mod_time={fileStatus.getModificationTime()}")