Search code examples
azure-databricksazure-cognitive-services

Databricks Azure Form Recognizer library for Python failing: Content type could not be auto-detected. Please pass the content_type keyword argument


I using the guide from https://learn.microsoft.com/en-us/python/api/overview/azure/ai-formrecognizer-readme?view=azure-python to recognize content with Databricks.

The code that I'm using is

from azure.ai.formrecognizer import FormRecognizerClient
from azure.core.credentials import AzureKeyCredential

endpoint = "https://<region>.api.cognitive.microsoft.com/"
credential = AzureKeyCredential("<api_key>")

form_recognizer_client = FormRecognizerClient(endpoint, credential)

with open("/dbfs/mnt/lake/RAW/export/sentimenttest.txt", "rb") as fd:
    form = fd.read()

poller = form_recognizer_client.begin_recognize_content(form)
form_pages = poller.result()

for content in form_pages:
    for table in content.tables:
        print("Table found on page {}:".format(table.page_number))
        print("Table location {}:".format(table.bounding_box))
        for cell in table.cells:
            print("Cell text: {}".format(cell.text))
            print("Location: {}".format(cell.bounding_box))
            print("Confidence score: {}\n".format(cell.confidence))

    if content.selection_marks:
        print("Selection marks found on page {}:".format(content.page_number))
        for selection_mark in content.selection_marks:
            print("Selection mark is '{}' within bounding box '{}' and has a confidence of {}".format(
                selection_mark.state,
                selection_mark.bounding_box,
                selection_mark.confidence
            ))

You will notice the path that I'm using is

/dbfs/mnt/lake/RAW/export/sentimenttest.txt

When I execute the code I get the error:

ValueError: Content type could not be auto-detected. Please pass the content_type keyword argument.

Can someone let me know what I need to do to fix this


Solution

  • Prerequisites

    • Python 2.7, or 3.5 or later is required to use this package.

    • You must have an Azure subscription and a Cognitive Services or Form Recognizer resource to use this package.

    Extract text and content/layout information from a given document. The input document must be of one of the supported content types - 'application/pdf', 'image/jpeg', 'image/png', 'image/tiff' or 'image/bmp'.

    New in version v2.1: The pages, language and reading order keyword arguments and support for image/bmp content

    Refer this link for more information