Search code examples
pythonazure-data-lakeazure-synapse-analytics

How can I read pdf or pptx or docx files in python from ADLS gen2 using Synapse?


I am looking to read in files of different formats with python in a Synapse notebook. These include .pdf, .pptx, .docx, .msg, and .eml. I would like to be able to read in the files then parse and manipulate them with python. I was able to do this in data bricks using different python libraries.

This is how I had accomplished this in Data Bricks:

from pptx import Presentation
prs = Presentation(file_name)

# for pdf
from pypdf import PdfReader
reader = PdfReader(open(filename, 'rb'))

# word docs
import docx
doc = docx.Document(file_name)

# .eml files
import email
msg = email.message_from_file(open(file_name))type here

# .msg files
import extract_msg
msg = extract_msg.Message(file_name)

In Synapse I have been getting an error: FileNotFoundError: [Errno 2] No such file or directory.

These file paths work to read in csv, excel or txt data using spark or pandas so I don't think there is a authorization or connectivity issue. The format is: abfs[s]://file_system_name@account_name.dfs.core.windows.net/file_path

I also tried mounting the storage location. This did help to read in text files but not for the other formats. Mounting Storage locations in Synapse


Solution

  • Mounting was the right approach as this answer explains. I was using Synapse studio . The key was to use the file format obtained from the path command of the mounted storage. Otherwise I could basically use what I used previously as mentioned in my question. Only pdf I had to change from using the pypdf library to pypdf2.

    the format that worked was:

    path = mssparkutils.fs.getMountPath("/mounted_name") 
    # this gave me this format '/synfs/{jobId}/mounted_path/{filename}'
    

    What did not work was the format obtained from mssparkutils fs

    mssparkutils.fs.ls("synfs:/{jobId}/mounted_path/") 
    # this gave a different format which did not work   'synfs:/{jobId}/mounted_path/{filename}'
    

    Here is the whole process:

    First install the library you will need. Mounting the storage is described here. Then read the file using the PyPDF2 library.

    !pip install PyPDF2  
        
        
    # Then mount the storage location 
        
    from notebookutils import mssparkutils
    mssparkutils.fs.mount( "abfss://mycontainer@<accountname>.dfs.core.windows.net", "/test", {"LinkedService":"mygen2account"} )
        
    # get mounted path
    path = mssparkutils.fs.getMountPath("/test")
    file_name  = path + '/filename'
        
    # now read the file 
    from PyPDF2 import PdfReader
        
    reader = PdfReader(open(file_name, 'rb'))