Search code examples
pythonpython-3.xpysparkazure-databricksazure-form-recognizer

Load Form recognizer data into a dataframe


I am reading a pdf file using Form recognizer. Storing it in a "result" variable/object. As per the syntax given for the Azure Databricks/pyspark in the documentation for the Formrecognizer my output is coming out like below. Instead I need to put the output into a dataframe. Each table into a separate dataframe. Please suggest on the syntax. Thanks in advance.

with open(formUrl, "rb") as f:
  poller = document_analysis_client.begin_analyze_document("prebuilt-layout", document =f)
  result = poller.result()

     for table_idx, table in enumerate(result.tables):
         print(
             "Table # {} has {} rows and {} columns".format(
             table_idx, table.row_count, table.column_count
             )
         )
                
         for cell in table.cells:
             print(
                 "...Cell[{}][{}] has content '{}'".format(
                 cell.row_index,
                 cell.column_index,
                 cell.content.encode("utf-8"),
                 )
             )

Output

 Table # 0 has 3 rows and 7 columns
     ...Cell[0][0] has content 'b'BIOMARKER''
     ...Cell[0][1] has content 'b'METHOD|''
     ...Cell[0][2] has content 'b'ANALYTE''
     ...Cell[0][3] has content 'b'RESULT''
     ...Cell[0][4] has content 'b'THERAPY ASSOCIATION''
     ...Cell[0][6] has content 'b'BIOMARKER LEVELE''
     ...Cell[1][0] has content 'b'''
     ...Cell[1][1] has content 'b'IHC''
     ...Cell[1][2] has content 'b'Protein''
     ...Cell[1][3] has content 'b'Negative | 0''
     ...Cell[1][4] has content 'b'LACK OF BENEFIT''
     ...Cell[1][5] has content 'b'alectinib, brigatinib''
     ...Cell[1][6] has content 'b'Level 1''
     ...Cell[2][0] has content 'b'ALK''
     ...Cell[2][1] has content 'b'Seq''
     ...Cell[2][2] has content 'b'RNA-Tumor''
     ...Cell[2][3] has content 'b'Fusion Not Detected''
     ...Cell[2][5] has content 'b'ceritinib''
     ...Cell[2][6] has content 'b'Level 1''
     ...Cell[3][1] has content 'b'''
     ...Cell[3][2] has content 'b'''
     ...Cell[3][3] has content 'b'''
     ...Cell[3][5] has content 'b'crizotinib''
     ...Cell[3][6] has content 'b'Level 1''
 Table # 1 has 3 rows and 4 columns
 ...Cell[0][0] has content 'b'''
 ...Cell[0][1] has content 'b'''
 ...Cell[0][2] has content 'b'''
 ...Cell[0][3] has content 'b'''
 ...Cell[1][0] has content 'b'NTRK1/2/3''
 ...Cell[1][1] has content 'b'Seq''
 ...Cell[1][2] has content 'b'RNA-Tumor''
 ...Cell[1][3] has content 'b'Fusion Not Detected''
 ...Cell[2][0] has content 'b'Tumor Mutational Burden''
 ...Cell[2][1] has content 'b'Seq''
 ...Cell[2][2] has content 'b'DNA-Tumor''
 ...Cell[2][3] has content 'b'High | 19 Mutations/ Mb''

Solution

  • I tried to read PDF doc using azure form recognizer and used azure databricks for converting it to dataframe following are the detailed steps

    ->login to the subscribed Azur account in Form Recognizer Studio - Microsoft Azure and select layout from document analysis

    enter image description here

    ->Browse required Invoice pdf file and click analyze after analysis the recognizer

    enter image description here

    ->I created a storage account and two containers input (storing input invoice), freg (storing output csv)

    enter image description here

    ->Create Azure databricks notebook

    1. Package Installation

    %pip install azure.storage.blob
    %pip install azure.ai.formrecognizer
    

    2. Connect to Azure Storage Container

    enter image description here

    from azure.storage.blob import ContainerClient
    
    container_url = "https://formrecognizerdemo070621.blob.core.windows.net/pdf-raw"
    container = ContainerClient.from_container_url(container_url)
    

    3. Enable Congitive Services

    In place of cognitiveServicesEndpoint ,cognitiveServicesKey in code we provide the key value and endpoint of the form recognizer created

    enter image description here

    import requests
    from azure.ai.formrecognizer import FormRecognizerClient
    from azure.core.credentials import AzureKeyCredential
    
    endpoint = dbutils.secrets.get(scope="formrec",key="cognitiveServicesEndpoint")
    key = dbutils.secrets.get(scope="formrec",key="cognitiveServicesKey")
    
    form_recognizer_client = FormRecognizerClient(endpoint=endpoint, credential=AzureKeyCredential(key))
    

    4. Send files to Cognitive Services & converting to Dataframe

    import pandas as pd
    
    field_list = ["InvoiceId", "VendorName", "VendorAddress", "CustomerName", "CustomerAddress", "CustomerAddressRecipient", "InvoiceDate", "InvoiceTotal", "DueDate"]
    df = pd.DataFrame(columns=field_list)
    
    for blob in container.list_blobs():
      blob_url = container_url + "/" + blob.name
      poller = form_recognizer_client.begin_recognize_invoices_from_url(invoice_url=blob_url)
      invoices = poller.result()
      print("Scanning " + blob.name + "...")
      
      for idx, invoice in enumerate(invoices):
          single_df = pd.DataFrame(columns=field_list)
    
          for field in field_list:
            entry = invoice.fields.get(field)
            
            if entry:
              single_df[field] = [entry.value]
              
          single_df['FileName'] = blob.name
          df = df.append(single_df)
    
    df = df.reset_index(drop=True)
    df
    

    enter image description here

    5. Upload dataframe of results to Azure

    Due to security issues I have used relevant replacement names in place of actual names StorageAccountName = Initially created stoirage account namet, OutputContainer = contrainer created for storing output file, formreckey =key of form recognizer

    account_name = "StorageAccountName"
    account_key = "fs.azure.account.key." + account_name + ".blob.core.windows.net"
    
    
     dbutils.fs.mount(
        source = "wasbs://OutputContainer@StorageAccount.blob.core.windows.net",
        mount_point = "/mnt/OutputContainer",
        extra_configs = {account_key: dbutils.secrets.get(scope = "formrec", key = "formreckey")} )
    
    
    df.to_csv(r"/dbfs/mnt/OutputContainer/output.csv", index=False)
    

    enter image description here