python python-3.x pyspark azure-databricks azure-form-recognizer

Load Form recognizer data into a dataframe

I am reading a pdf file using Form recognizer. Storing it in a "result" variable/object. As per the syntax given for the Azure Databricks/pyspark in the documentation for the Formrecognizer my output is coming out like below. Instead I need to put the output into a dataframe. Each table into a separate dataframe. Please suggest on the syntax. Thanks in advance.

with open(formUrl, "rb") as f:
  poller = document_analysis_client.begin_analyze_document("prebuilt-layout", document =f)
  result = poller.result()

     for table_idx, table in enumerate(result.tables):
         print(
             "Table # {} has {} rows and {} columns".format(
             table_idx, table.row_count, table.column_count
             )
         )
                
         for cell in table.cells:
             print(
                 "...Cell[{}][{}] has content '{}'".format(
                 cell.row_index,
                 cell.column_index,
                 cell.content.encode("utf-8"),
                 )
             )

Output

 Table # 0 has 3 rows and 7 columns
     ...Cell[0][0] has content 'b'BIOMARKER''
     ...Cell[0][1] has content 'b'METHOD|''
     ...Cell[0][2] has content 'b'ANALYTE''
     ...Cell[0][3] has content 'b'RESULT''
     ...Cell[0][4] has content 'b'THERAPY ASSOCIATION''
     ...Cell[0][6] has content 'b'BIOMARKER LEVELE''
     ...Cell[1][0] has content 'b'''
     ...Cell[1][1] has content 'b'IHC''
     ...Cell[1][2] has content 'b'Protein''
     ...Cell[1][3] has content 'b'Negative | 0''
     ...Cell[1][4] has content 'b'LACK OF BENEFIT''
     ...Cell[1][5] has content 'b'alectinib, brigatinib''
     ...Cell[1][6] has content 'b'Level 1''
     ...Cell[2][0] has content 'b'ALK''
     ...Cell[2][1] has content 'b'Seq''
     ...Cell[2][2] has content 'b'RNA-Tumor''
     ...Cell[2][3] has content 'b'Fusion Not Detected''
     ...Cell[2][5] has content 'b'ceritinib''
     ...Cell[2][6] has content 'b'Level 1''
     ...Cell[3][1] has content 'b'''
     ...Cell[3][2] has content 'b'''
     ...Cell[3][3] has content 'b'''
     ...Cell[3][5] has content 'b'crizotinib''
     ...Cell[3][6] has content 'b'Level 1''
 Table # 1 has 3 rows and 4 columns
 ...Cell[0][0] has content 'b'''
 ...Cell[0][1] has content 'b'''
 ...Cell[0][2] has content 'b'''
 ...Cell[0][3] has content 'b'''
 ...Cell[1][0] has content 'b'NTRK1/2/3''
 ...Cell[1][1] has content 'b'Seq''
 ...Cell[1][2] has content 'b'RNA-Tumor''
 ...Cell[1][3] has content 'b'Fusion Not Detected''
 ...Cell[2][0] has content 'b'Tumor Mutational Burden''
 ...Cell[2][1] has content 'b'Seq''
 ...Cell[2][2] has content 'b'DNA-Tumor''
 ...Cell[2][3] has content 'b'High | 19 Mutations/ Mb''

Solution

I tried to read PDF doc using azure form recognizer and used azure databricks for converting it to dataframe following are the detailed steps

->login to the subscribed Azur account in Form Recognizer Studio - Microsoft Azure and select layout from document analysis

->Browse required Invoice pdf file and click analyze after analysis the recognizer

->I created a storage account and two containers input (storing input invoice), freg (storing output csv)

->Create Azure databricks notebook

1. Package Installation

%pip install azure.storage.blob
%pip install azure.ai.formrecognizer

2. Connect to Azure Storage Container

from azure.storage.blob import ContainerClient

container_url = "https://formrecognizerdemo070621.blob.core.windows.net/pdf-raw"
container = ContainerClient.from_container_url(container_url)

3. Enable Congitive Services

In place of cognitiveServicesEndpoint ,cognitiveServicesKey in code we provide the key value and endpoint of the form recognizer created

import requests
from azure.ai.formrecognizer import FormRecognizerClient
from azure.core.credentials import AzureKeyCredential

endpoint = dbutils.secrets.get(scope="formrec",key="cognitiveServicesEndpoint")
key = dbutils.secrets.get(scope="formrec",key="cognitiveServicesKey")

form_recognizer_client = FormRecognizerClient(endpoint=endpoint, credential=AzureKeyCredential(key))

4. Send files to Cognitive Services & converting to Dataframe

import pandas as pd

field_list = ["InvoiceId", "VendorName", "VendorAddress", "CustomerName", "CustomerAddress", "CustomerAddressRecipient", "InvoiceDate", "InvoiceTotal", "DueDate"]
df = pd.DataFrame(columns=field_list)

for blob in container.list_blobs():
  blob_url = container_url + "/" + blob.name
  poller = form_recognizer_client.begin_recognize_invoices_from_url(invoice_url=blob_url)
  invoices = poller.result()
  print("Scanning " + blob.name + "...")
  
  for idx, invoice in enumerate(invoices):
      single_df = pd.DataFrame(columns=field_list)

      for field in field_list:
        entry = invoice.fields.get(field)
        
        if entry:
          single_df[field] = [entry.value]
          
      single_df['FileName'] = blob.name
      df = df.append(single_df)

df = df.reset_index(drop=True)
df

5. Upload dataframe of results to Azure

Due to security issues I have used relevant replacement names in place of actual names StorageAccountName = Initially created stoirage account namet, OutputContainer = contrainer created for storing output file, formreckey =key of form recognizer

account_name = "StorageAccountName"
account_key = "fs.azure.account.key." + account_name + ".blob.core.windows.net"


 dbutils.fs.mount(
    source = "wasbs://OutputContainer@StorageAccount.blob.core.windows.net",
    mount_point = "/mnt/OutputContainer",
    extra_configs = {account_key: dbutils.secrets.get(scope = "formrec", key = "formreckey")} )


df.to_csv(r"/dbfs/mnt/OutputContainer/output.csv", index=False)