As the title suggests, I have a shapefile (.shp) in my Azure Blob Storage container, and I'm trying to read it into my Azure Databricks notebook directly without downloading it into my local drive.
I'm able to read csv files from the Blob Storage, but am running into problems with the shapefile. I haven't been able to find a solution in past stackoverflow questions.
Here's the code I'm using:
from io import BytesIO
from azure.storage.blob import BlobClient
blob_data = BlobClient(
account_url=ACCOUNT_URL,
container_name=CONTAINER_NAME,
blob_name=BLOB_NAME,
credential=BLOB_STORAGE_CREDENTIAL,
)
blob_data = blob_data.download_blob().readall()
shapefile = BytesIO(blob_data)
shapefile
This returns <_io.BytesIO at 0x7f1ad4bbbcc0>
.
Subsequently I've tried reading the shapefile with geopandas and Fiona:
# open with gpd
gdf = gpd.read_file(shapefile)
# open with fiona
with fiona.open(shapefile) as shp:
first_feature = next(iter(shp))
print(first_feature)
This gives the error DriverError: '/vsimem/31debcdbc2b0480b9f0567aea3a687d7' not recognized as a supported file format.
Fiona gives a similar error: DriverError: '/vsimem/04e527ecf5324605bdcf3643ea3b4bd2/04e527ecf5324605bdcf3643ea3b4bd2' not recognized as a supported file format.
There doesn't appear to be issues with the file. I've uploaded the shapefile into my Azure Workspace and it read fine from there, but because this file is meant to be used for a workflow on the cloud I can't use this approach.
You can mount your storage account to databricks and read the shapfile(.shp). Below is the shapefile i am having.
Code to mount.
dbutils.fs.mount(
source = "wasbs://<container-name>@<storage-account-name>.blob.core.windows.net",
mount_point = "/mnt/blob/",
extra_configs = {"fs.azure.account.key.<storage-account-name>.blob.core.windows.net":"<Account_key>"})
Using below code you can read it.
gdf = geopandas.read_file("/dbfs/mnt/blob/spatial/samp.shp")
gdf
Here, you can see i prefixed dbfs
to path, geopandas.read_file
checks path from root and it is not in spark context.