Search code examples
pythonazure-synapseazure-synapse-analytics

Reading PDF file with Azure Synapse Notebooks


It's my first post, asking for a help, before I usually used examples from Stack overflow, but can't find and answer. I am sorry, if the formatting of my post is not great, will try to improve it for the future.

I am struggling with reading PDF files from Azure Date Lake Gen 2 with Azure Synapse Notebooks.

Updated information: I am using Azure Synapse within the virtual network with private endpoint access to the storage.

Reading CSV file is not problem, I can access CSV with command:

%%pyspark
df = spark.read.load('abfss://**accountname**.dfs.core.windows.net/**file.csv**'
## If header exists uncomment line below
##, header=True
)
display(df.limit(10))

But when I tried to read PDF, it's always failing. I used libraries like pypdf2 and camelot.

pdf_file = "abfss://**accountname**.dfs.core.windows.net/**file.pdf"
# Open the PDF using PyPDF2
pdf_reader = PyPDF2.PdfReader(pdf_file)

I receive an error:

FileNotFoundError: [Errno 2] No such file or directory

I tried to mount storage location as mentioned in this post - How can I read pdf or pptx or docx files in python from ADLS gen2 using Synapse?

Still can read CSV file from that mounted storage, but not PDF.

mssparkutils.fs.mount( 
    "abfss://container@accountname.dfs.core.windows.net/", 
    "/TR", 
    {"LinkedService":"linkedservice"} 
)

# can get a path, this command is working:
path = mssparkutils.fs.getMountPath("TR")
print(path)

import PyPDF2
with open("/synfs/mount#/TR/file.pdf") as f:
    pdf_reader = PyPDF2.PdfReader(f)

Gives an error:

OSError: [Errno 5] Input/output error:

I tried to read using path, still not working.

file_name = path + "/file.pdf"
print(file_name)
reader = PyPDF2.PdfReader(open(file_name, 'rb'))

gives an error: OSError: [Errno 5] Input/output error

Tried to use PyPDF2:

pdf_reader = PyPDF2.PdfReader(file_name)

Gives an error:

logger_warning( 310 "PdfReader stream/file object is not in binary mode. " 311 "It may not be read correctly.", 312 name,'

Please advice, if you know how to solve it. I am using Azure Synapse Studio, not SDK.


Solution

  • This error occurs when storage accounts are accessed via Private Endpoints on a Virtual Network.

    • If the above-created Linked Service to Azure Data Lake Storage Gen2 uses a managed private endpoint (with a dfs URI), then we need to create another secondary managed private endpoint using the Azure Blob Storage option (with a blob URI) to ensure that the internal fsspec/adlfs code can connect using the BlobServiceClient interface.

    Refer to this documentation for more information.

    So, create a Blob Storage private endpoint and try the code below.

    import PyPDF2
    
    jobId = mssparkutils.env.getJobId()
    
    path=f"/synfs/{jobId}/TR/pdf/bob.pdf"
    
    print(path)
    
    pdf_reader = PyPDF2.PdfReader(path)
    number_of_pages = len(pdf_reader.pages)
    page = pdf_reader.pages[0]
    text = page.extract_text()
    print(text)
    

    You can also refer to this answer.

    If it is urgent, for now, you can copy the file and read.

    import PyPDF2
    
    jobId = mssparkutils.env.getJobId()
    
    path=f"synfs:/{jobId}/TR/pdf/bob.pdf"
    
    mssparkutils.fs.cp(path,"file:/tmp/t_pdf/bob.pdf")
    pdf_reader = PyPDF2.PdfReader("/tmp/t_pdf/bob.pdf")
    number_of_pages = len(pdf_reader.pages)
    page = pdf_reader.pages[0]
    text = page.extract_text()
    print(text)
    

    Here, I am copying the file to the tmp folder and reading from there.