Search code examples
pythonscaladatabricksazure-data-lakeazure-databricks

List All Files in a Folder Sitting in a Data Lake


I'm trying to get an inventory of all files in a folder, which has a few sub-folders, all of which sit in a data lake. Here is the code that I'm testing.

import sys, os
import pandas as pd

mylist = []
root = "/mnt/rawdata/parent/"
path = os.path.join(root, "targetdirectory") 

for path, subdirs, files in os.walk(path):
    for name in files:
        mylist.append(os.path.join(path, name))


df = pd.DataFrame(mylist)
print(df)

I also tried the sample code from this link:

Python list directory, subdirectory, and files

I'm working in Azure Databricks. I'm open to using Scala to do the job. So far, nothing has worked for me. Each time, I keep getting an empty dataframe. I believe this is pretty close, but I must be missing something small. Thoughts?


Solution

  • I got this to work.

    from azure.storage.blob import BlockBlobService 
    
    blob_service = BlockBlobService(account_name='your_account_name', account_key='your_account_key')
    
    blobs = []
    marker = None
    while True:
        batch = blob_service.list_blobs('rawdata', marker=marker)
        blobs.extend(batch)
        if not batch.next_marker:
            break
        marker = batch.next_marker
    for blob in blobs:
        print(blob.name)
    

    The only prerequisite is that you need to import azure.storage. So, in the Clusters window, click 'Install-New' -> PyPI > package = 'azure.storage'. Finally, click 'Install'.