Search code examples
pythonazurepysparkazure-synapsepydicom

How to retrieve .dcm image files from the ADLS gen2 using Azure Synapse and pySpark notebook?


I want to access the files of type .dcm (dicom) stored in a container on ADLS gen2 in a pyspark notebook on azure synapse analytics. I'm using pydicom to access the files but getting and error that file does not exists. Please have a look at the code below,

To create the filepath I'm using path library:

Path(path_to_dicoms_dir).joinpath('stage_2_train_images/%s.dcm' % pid)

where pid is the id of the dcm image.

To fetch the dcm image I'm using the following way.

d = pydicom.read_file(data['dicom']) 
OR
d = pydicom.dcmread(data['dicom'])  

where data['dicom'] is the path.

I've checked the path there is no issue with the it, the file exists and all the access rights are there as I'm accessing other files in the directory just above the directory in which these dcm files are there. But the other files are csv and not dcm

Error:

FileNotFoundError: [Errno 2] No such file or directory: 'abfss:/@.dfs.core.windows.net//stage_2_train_images/stage_2_train_images/003d8fa0-6bf1-40ed-b54c-ac657f8495c5.dcm'

Questions that I have in my mind:

  • Is this the right storage solution for such image data, if not shall I use blog storage then?
  • Is it some issue with pydicom library and I'm missing some settings to tell the pydicom that this is a ADLS link.
  • Or should I entirely change the approach and use databricks instead to run my notebooks?
  • Or is someone can help me with issue?

Solution

  • Is this the right storage solution for such image data, if not shall I use blog storage then?

    The ADLS Gen2 storage account works perfectly fine with Synapse, so there is no need to use blob storage.

    It seems like the pydicom not taking the path correctly.

    You need to mount the ADLS Gen2 account in synapse so that pydicom will treat the path as an attached hard drive instead if taking URL.

    Follow this tutorial given my Microsoft to How to mount Gen2/blob Storage to do the same.

    You need to first create a Linked Service in Synapse which will store your ADLS Gen2 account connection details. Later use below code in notebook to mount the storage account:

    mssparkutils.fs.mount( 
        "abfss://mycontainer@<accountname>.dfs.core.windows.net", 
        "/test", 
        {"linkedService":"mygen2account"} 
    )