Search code examples
azureocrazure-form-recognizer

Azure Form Recognizer mainline support for Office documents


I have been using the 2022/06/30-preview version of the API to OCR-ize docx and powerpoint documents. Now that the API has been stabilized and has moved to 2022-08-31, I have updated my code to use this stable version (juste a version update of the sdk client), but the same documents are now rejected, with an error InvalidContent, "The file is corrupted or format is unsupported. Refer to documentation for the list of supported formats.".

Has support for Office documents been dropped or is there some settings to add ? From the changelog I don't seem to see any mention that support has been dropped between the last preview version and the stable one.

I'm using the node.js SDK. I have checked that the same docx document, using the same exact code, is accepted using the @azure/ai-form-recognizer@4.0.0-beta.5 SDK client, but not the latest and stable @azure/ai-form-recognizer@4.0.0 version. The code I'm using is almost exactly the example code in the quickstart, only the urls change.


Solution

    • Well according to this MSDOC they have dropped support for Microsoft office files for all SDK.

    • So, you have two options the form recognizer does provide support but for Microsoft office files through RestAPi. So, you can either make http calls or you can convert the files to pdf and then use conventional SDK for further processing.

    • The conversion is done using docx-pdf npm package. Here I have a hjh.docx which I am converting to pdfuploader.pdf and then processing it.

    const  fs = require("fs");
    const { AzureKeyCredential, DocumentAnalysisClient } = require("@azure/ai-form-recognizer");
    
    const key= "";
    const endpoint = "";
    
    async  function  main() {
        //convertion logic 
        var  docxConverter = require('docx-pdf');
        docxConverter('./hjh.docx','./pdfuploader.pdf',function(err,result){
            if(err){
                console.log(err);
            }
            console.log('result'+result);
        });
    
        // form recognizer logic
    
        const  client = new  DocumentAnalysisClient(endpoint, new  AzureKeyCredential(key));
        const  readStream = fs.createReadStream("<Path>");
        const  poller = await  client.beginAnalyzeDocument("prebuilt-document", readStream,{
            onProgress: ({ status }) => {
                console.log(`status: ${status}`);
            },
        });
        const  e = await  poller.pollUntilDone();
        console.log(e);
        
    }
    main().catch((error) => {
        console.error("An error occurred:", error);
        process.exit(1);
    });
    

    @azure/ai-form-recognizer output: enter image description here

    @azure/ai-form-recognizer@4.0.0-beta.5 output:

    enter image description here