I am reading a pdf file using Form recognizer. Storing it in a "result" variable/object. As per the syntax given for the Azure Databricks/pyspark in the documentation for the Formrecognizer my output is coming out like below. Instead I need to put the output into a dataframe. Each table into a separate dataframe. Please suggest on the syntax. Thanks in advance.
with open(formUrl, "rb") as f:
poller = document_analysis_client.begin_analyze_document("prebuilt-layout", document =f)
result = poller.result()
for table_idx, table in enumerate(result.tables):
print(
"Table # {} has {} rows and {} columns".format(
table_idx, table.row_count, table.column_count
)
)
for cell in table.cells:
print(
"...Cell[{}][{}] has content '{}'".format(
cell.row_index,
cell.column_index,
cell.content.encode("utf-8"),
)
)
Output
Table # 0 has 3 rows and 7 columns
...Cell[0][0] has content 'b'BIOMARKER''
...Cell[0][1] has content 'b'METHOD|''
...Cell[0][2] has content 'b'ANALYTE''
...Cell[0][3] has content 'b'RESULT''
...Cell[0][4] has content 'b'THERAPY ASSOCIATION''
...Cell[0][6] has content 'b'BIOMARKER LEVELE''
...Cell[1][0] has content 'b'''
...Cell[1][1] has content 'b'IHC''
...Cell[1][2] has content 'b'Protein''
...Cell[1][3] has content 'b'Negative | 0''
...Cell[1][4] has content 'b'LACK OF BENEFIT''
...Cell[1][5] has content 'b'alectinib, brigatinib''
...Cell[1][6] has content 'b'Level 1''
...Cell[2][0] has content 'b'ALK''
...Cell[2][1] has content 'b'Seq''
...Cell[2][2] has content 'b'RNA-Tumor''
...Cell[2][3] has content 'b'Fusion Not Detected''
...Cell[2][5] has content 'b'ceritinib''
...Cell[2][6] has content 'b'Level 1''
...Cell[3][1] has content 'b'''
...Cell[3][2] has content 'b'''
...Cell[3][3] has content 'b'''
...Cell[3][5] has content 'b'crizotinib''
...Cell[3][6] has content 'b'Level 1''
Table # 1 has 3 rows and 4 columns
...Cell[0][0] has content 'b'''
...Cell[0][1] has content 'b'''
...Cell[0][2] has content 'b'''
...Cell[0][3] has content 'b'''
...Cell[1][0] has content 'b'NTRK1/2/3''
...Cell[1][1] has content 'b'Seq''
...Cell[1][2] has content 'b'RNA-Tumor''
...Cell[1][3] has content 'b'Fusion Not Detected''
...Cell[2][0] has content 'b'Tumor Mutational Burden''
...Cell[2][1] has content 'b'Seq''
...Cell[2][2] has content 'b'DNA-Tumor''
...Cell[2][3] has content 'b'High | 19 Mutations/ Mb''
I tried to read PDF doc using azure form recognizer and used azure databricks for converting it to dataframe following are the detailed steps
->login to the subscribed Azur account in Form Recognizer Studio - Microsoft Azure and select layout from document analysis
->Browse required Invoice pdf file and click analyze after analysis the recognizer
->I created a storage account and two containers input (storing input invoice), freg (storing output csv)
->Create Azure databricks notebook
1. Package Installation
%pip install azure.storage.blob
%pip install azure.ai.formrecognizer
2. Connect to Azure Storage Container
from azure.storage.blob import ContainerClient
container_url = "https://formrecognizerdemo070621.blob.core.windows.net/pdf-raw"
container = ContainerClient.from_container_url(container_url)
3. Enable Congitive Services
In place of cognitiveServicesEndpoint ,cognitiveServicesKey in code we provide the key value and endpoint of the form recognizer created
import requests
from azure.ai.formrecognizer import FormRecognizerClient
from azure.core.credentials import AzureKeyCredential
endpoint = dbutils.secrets.get(scope="formrec",key="cognitiveServicesEndpoint")
key = dbutils.secrets.get(scope="formrec",key="cognitiveServicesKey")
form_recognizer_client = FormRecognizerClient(endpoint=endpoint, credential=AzureKeyCredential(key))
4. Send files to Cognitive Services & converting to Dataframe
import pandas as pd
field_list = ["InvoiceId", "VendorName", "VendorAddress", "CustomerName", "CustomerAddress", "CustomerAddressRecipient", "InvoiceDate", "InvoiceTotal", "DueDate"]
df = pd.DataFrame(columns=field_list)
for blob in container.list_blobs():
blob_url = container_url + "/" + blob.name
poller = form_recognizer_client.begin_recognize_invoices_from_url(invoice_url=blob_url)
invoices = poller.result()
print("Scanning " + blob.name + "...")
for idx, invoice in enumerate(invoices):
single_df = pd.DataFrame(columns=field_list)
for field in field_list:
entry = invoice.fields.get(field)
if entry:
single_df[field] = [entry.value]
single_df['FileName'] = blob.name
df = df.append(single_df)
df = df.reset_index(drop=True)
df
5. Upload dataframe of results to Azure
Due to security issues I have used relevant replacement names in place of actual names StorageAccountName = Initially created stoirage account namet, OutputContainer = contrainer created for storing output file, formreckey =key of form recognizer
account_name = "StorageAccountName"
account_key = "fs.azure.account.key." + account_name + ".blob.core.windows.net"
dbutils.fs.mount(
source = "wasbs://OutputContainer@StorageAccount.blob.core.windows.net",
mount_point = "/mnt/OutputContainer",
extra_configs = {account_key: dbutils.secrets.get(scope = "formrec", key = "formreckey")} )
df.to_csv(r"/dbfs/mnt/OutputContainer/output.csv", index=False)