python scala databricks azure-data-lake azure-databricks

List All Files in a Folder Sitting in a Data Lake

I'm trying to get an inventory of all files in a folder, which has a few sub-folders, all of which sit in a data lake. Here is the code that I'm testing.

import sys, os
import pandas as pd

mylist = []
root = "/mnt/rawdata/parent/"
path = os.path.join(root, "targetdirectory") 

for path, subdirs, files in os.walk(path):
    for name in files:
        mylist.append(os.path.join(path, name))


df = pd.DataFrame(mylist)
print(df)

I also tried the sample code from this link:

Python list directory, subdirectory, and files

I'm working in Azure Databricks. I'm open to using Scala to do the job. So far, nothing has worked for me. Each time, I keep getting an empty dataframe. I believe this is pretty close, but I must be missing something small. Thoughts?

Solution

I got this to work.

from azure.storage.blob import BlockBlobService 

blob_service = BlockBlobService(account_name='your_account_name', account_key='your_account_key')

blobs = []
marker = None
while True:
    batch = blob_service.list_blobs('rawdata', marker=marker)
    blobs.extend(batch)
    if not batch.next_marker:
        break
    marker = batch.next_marker
for blob in blobs:
    print(blob.name)

The only prerequisite is that you need to import azure.storage. So, in the Clusters window, click 'Install-New' -> PyPI > package = 'azure.storage'. Finally, click 'Install'.