Search code examples
pythonazureazure-blob-storageazure-databricks

How to read a shapefile from Azure Blob Storage into an Azure Databricks Notebook?


As the title suggests, I have a shapefile (.shp) in my Azure Blob Storage container, and I'm trying to read it into my Azure Databricks notebook directly without downloading it into my local drive.

I'm able to read csv files from the Blob Storage, but am running into problems with the shapefile. I haven't been able to find a solution in past stackoverflow questions.

Here's the code I'm using:

from io import BytesIO
from azure.storage.blob import BlobClient

blob_data = BlobClient(
    account_url=ACCOUNT_URL,
    container_name=CONTAINER_NAME,
    blob_name=BLOB_NAME, 
    credential=BLOB_STORAGE_CREDENTIAL,
)
blob_data = blob_data.download_blob().readall()
shapefile = BytesIO(blob_data)

shapefile

This returns <_io.BytesIO at 0x7f1ad4bbbcc0>.

Subsequently I've tried reading the shapefile with geopandas and Fiona:

# open with gpd
gdf = gpd.read_file(shapefile)

# open with fiona
with fiona.open(shapefile) as shp:
    first_feature = next(iter(shp))
    print(first_feature)

This gives the error DriverError: '/vsimem/31debcdbc2b0480b9f0567aea3a687d7' not recognized as a supported file format.

Fiona gives a similar error: DriverError: '/vsimem/04e527ecf5324605bdcf3643ea3b4bd2/04e527ecf5324605bdcf3643ea3b4bd2' not recognized as a supported file format.

There doesn't appear to be issues with the file. I've uploaded the shapefile into my Azure Workspace and it read fine from there, but because this file is meant to be used for a workflow on the cloud I can't use this approach.


Solution

  • You can mount your storage account to databricks and read the shapfile(.shp). Below is the shapefile i am having.

    enter image description here

    Code to mount.

    dbutils.fs.mount(
    source = "wasbs://<container-name>@<storage-account-name>.blob.core.windows.net",
    mount_point = "/mnt/blob/",
    extra_configs = {"fs.azure.account.key.<storage-account-name>.blob.core.windows.net":"<Account_key>"})
    

    enter image description here

    Using below code you can read it.

    gdf = geopandas.read_file("/dbfs/mnt/blob/spatial/samp.shp")
    gdf
    

    enter image description here

    Here, you can see i prefixed dbfs to path, geopandas.read_file checks path from root and it is not in spark context.