I'm trying to get an inventory of all files in a folder, which has a few sub-folders, all of which sit in a data lake. Here is the code that I'm testing.
import sys, os
import pandas as pd
mylist = []
root = "/mnt/rawdata/parent/"
path = os.path.join(root, "targetdirectory")
for path, subdirs, files in os.walk(path):
for name in files:
mylist.append(os.path.join(path, name))
df = pd.DataFrame(mylist)
print(df)
I also tried the sample code from this link:
Python list directory, subdirectory, and files
I'm working in Azure Databricks. I'm open to using Scala to do the job. So far, nothing has worked for me. Each time, I keep getting an empty dataframe. I believe this is pretty close, but I must be missing something small. Thoughts?
I got this to work.
from azure.storage.blob import BlockBlobService
blob_service = BlockBlobService(account_name='your_account_name', account_key='your_account_key')
blobs = []
marker = None
while True:
batch = blob_service.list_blobs('rawdata', marker=marker)
blobs.extend(batch)
if not batch.next_marker:
break
marker = batch.next_marker
for blob in blobs:
print(blob.name)
The only prerequisite is that you need to import azure.storage
. So, in the Clusters window, click 'Install-New' -> PyPI > package = 'azure.storage'. Finally, click 'Install'.