Search code examples
azure-cognitive-services

Microsoft Computer Vision OCR Read API charged as S3 transaction instead of S2


I am using Microsoft Computer Vision API for OCR processing and I noticed that they are getting charged as S3 transactions instead of S2 in my bill. enter image description here

I'm using the .NET SDK and the API I am using is this one. https://learn.microsoft.com/en-us/dotnet/api/microsoft.azure.cognitiveservices.vision.computervision.computervisionclientextensions.readasync?view=azure-dotnet

I have also confirmed that the actual REST API the SDK calls is the following POST /vision/v3.2/read/analyze https://centraluseuap.dev.cognitive.microsoft.com/docs/services/computer-vision-v3-2/operations/5d986960601faab4bf452005

According to documentation, that should be the OCR Read API, am I correct? https://learn.microsoft.com/en-us/azure/cognitive-services/computer-vision/vision-api-how-to-topics/call-read-api

I am puzzled as to why my calls are getting charged as S3 instead of S2. This is important for me because S3 is 50% more expensive than S2. Using the Pricing Calculator, 1000 S2 transactions is $1, whereas 1000 S3 transactions is $1.5. https://azure.microsoft.com/en-us/pricing/calculator/?service=cognitive-services

What's the difference between OCR and "Describe and Recognize Text" anyways? OCR (Optical Character Recognition) by definition must recognize text. I am calling the Read API without any of the optional parameters so I did not ask for "Describe" hence the call should be S2 feature rather than S3 feature I think. enter image description here

I already posted this question at Microsoft Q&A but I thought SO might get more traffic hence help me get an answer faster. https://learn.microsoft.com/en-us/answers/questions/689767/computer-vision-api-charged-as-s3-transaction-inst.html


Solution

  • To help you understand, you need a bit of history of those services. Computer Vision API (and all "calling" SDKs, whether C#/.Net, Java, Python etc using these APIs) have moved frequently and it is sometimes hard to understand which SDK calls which version of the APIs.

    API operations history

    Regarding optical character reading operations, there have been several operations:

    Computer Vision 1.0

    See definition here was containing:

    • OCR operation, a synchronous operation to recognize printed text
    • Recognize Handwritten Text operation, an asynchronous operation for handwritten text (with "Get Handwritten Text Operation Result" operation to collect the result once completed)

    Computer Vision 2.0

    See definition here. OCR was still there, but "Recognize Handwritten Text" was changed. So there were:

    • OCR operation, a synchronous operation to recognize printed text
    • Recognize Text operation, asynchronous (+ Get Recognize Text Operation Result to collect the result), accepting both printed or handwritten text (see mode input parameter)
    • Batch Read File operation, asynchronous (+ "Get Read Operation Result" to collect the result), which was also processing PDF files whereas the other one were only accepting images. It was intended "for text-heavy documents"

    Computer Vision 2.1 was similar in terms of operations.

    Computer Vision 3.0

    See definition here. Main changes: Recognize Text and Batch Read File were "unified" into a Read operation, with models improvements. No more need to specify handwritten / printed for example (see link).

    Upgrade from 2.0 to 3.0

    The Read API is optimized for text-heavy images and multi-page, mixed language, and mixed type (print – seven languages and handwritten – English only) documents
    

    So there were:

    • OCR operation, a synchronous operation to recognize printed text
    • Read operation, asynchronous (+ Get Read Result to collect the result), accepting both printed or handwritten text, images and PDF inputs.

    Same for Computer Vision v3.1-preview.1, v3.1-preview.2, v3.1, v3.2-preview.1, v3.2-preview.2, v3.2-preview.3

    SDKs

    All recent versions of the SDKs implementing a Read method are calling this 3.x. operation. See the changelog for example for .Net SDK here:

    v7.0.x of the SDK "supports v3.2 Cognitive Services Computer Vision API endpoints." Changelog of SDK

    Conclusion

    It is normal that you are billed S3 for Read. But the calculator is misleading as the "Recognize Text" term should be changed for "Read".

    If you really want to use OCR operation, use RecognizePrintedTextAsync method of the SDK which is the one using it.

    OCR is an old model, used only for printed text. Read operation is the latest model. I can also confirm (based on a few tests that I made) that the performance is lower than Read operation. If you want to quickly test, you can use your key on this website: it is an open-source portal created by another Microsoft MVP, where I also contributed. You will be able to see both results of OCR and Read operations. It is currently using 6.0.0 SDK version of Computer Vision (see source).

    Sample: OCR result: OCR output Read result: Read output