azure apache-spark pyspark databricks azure-data-lake-gen2

How to efficiently read the data lake files' metadata

I want to read the last modified datetime of the files in data lake in a databricks script. If I could read it efficiently as a column when reading data from data lake, it would be perfect.
Thank you:)

UPDATE: If you're working in Databricks, since Databricks runtime 10.4 released on Mar 18, 2022, dbutils.fs.ls() command returns “modificationTime” of the folders and files as well:

Solution

Regarding the issue, please refer to the following code

URI = sc._gateway.jvm.java.net.URI
Path = sc._gateway.jvm.org.apache.hadoop.fs.Path
FileSystem = sc._gateway.jvm.org.apache.hadoop.fs.FileSystem
conf = sc._jsc.hadoopConfiguration()

conf.set(
  "fs.azure.account.key.<account-name>.dfs.core.windows.net",
  "<account-access-key>")

fs = Path('abfss://<container-name>@<account-name>.dfs.core.windows.net/<file-path>/').getFileSystem(sc._jsc.hadoopConfiguration())

status=fs.listStatus(Path('abfss://<container-name>@<account-name>.dfs.core.windows.net/<file-path>/'))

for i in status:
  print(i)
  print(i.getModificationTime())