Search code examples
azureapache-sparkpysparkdatabricksazure-data-lake-gen2

How to efficiently read the data lake files' metadata


I want to read the last modified datetime of the files in data lake in a databricks script. If I could read it efficiently as a column when reading data from data lake, it would be perfect.
Thank you:)

enter image description here

UPDATE: If you're working in Databricks, since Databricks runtime 10.4 released on Mar 18, 2022, dbutils.fs.ls() command returns “modificationTime” of the folders and files as well: enter image description here


Solution

  • Regarding the issue, please refer to the following code

    URI = sc._gateway.jvm.java.net.URI
    Path = sc._gateway.jvm.org.apache.hadoop.fs.Path
    FileSystem = sc._gateway.jvm.org.apache.hadoop.fs.FileSystem
    conf = sc._jsc.hadoopConfiguration()
    
    conf.set(
      "fs.azure.account.key.<account-name>.dfs.core.windows.net",
      "<account-access-key>")
    
    fs = Path('abfss://<container-name>@<account-name>.dfs.core.windows.net/<file-path>/').getFileSystem(sc._jsc.hadoopConfiguration())
    
    status=fs.listStatus(Path('abfss://<container-name>@<account-name>.dfs.core.windows.net/<file-path>/'))
    
    for i in status:
      print(i)
      print(i.getModificationTime())
    
    

    enter image description here