Search code examples
c#artificial-intelligencecloud-document-ai

Google Document AI c# mime Unsupported input file format


I am trying to upload a pdf for processing to google's Document AI service. Using google's using Google.Cloud.DocumentAI.V1 for "C#". Looked at the github and docs, not much info. PDF is on the local drive. I converted the pdf to a byte array then converted that to a Bystring. Then set the request mime to "application/pdf" but it return was an error of:

Status(StatusCode="InvalidArgument", Detail="Unsupported input file format.", DebugException="Grpc.Core.Internal.CoreErrorDetailException: {"created":"@1627582435.256000000","description":"Error received from peer ipv4:142.250.72.170:443","file":"......\src\core\lib\surface\call.cc","file_line":1067,"grpc_message":"Unsupported input file format.","grpc_status":3}")

Code:

try
{
    //Generate a document
    string pdfFilePath = "C:\\Users\\maponte\\Documents\\Projects\\SettonProjects\\OCRSTUFF\\DOC071621-0016.pdf";
    var bytes = Encoding.UTF8.GetBytes(pdfFilePath);


    ByteString content = ByteString.CopyFrom(bytes);

    // Create client
    DocumentProcessorServiceClient documentProcessorServiceClient = await DocumentProcessorServiceClient.CreateAsync();
    // Initialize request argument(s)
    ProcessRequest request = new ProcessRequest
    {
        ProcessorName = ProcessorName.FromProjectLocationProcessor("*****", "mycountry", "***"),
        SkipHumanReview = false,
        InlineDocument = new Document(),
        RawDocument = new RawDocument(),
    };
    
    request.RawDocument.MimeType = "application/pdf";
    request.RawDocument.Content = content;

    // Make the request
    ProcessResponse response = await documentProcessorServiceClient.ProcessDocumentAsync(request);

    Document docResponse = response.Document;

    Console.WriteLine(docResponse.Text);
   
}
catch(Exception ex)
{
    Console.WriteLine(ex.Message);
}

Solution

  • This is the problem (or at least one problem) - you aren't actually loading the file:

    string pdfFilePath = "C:\\Users\\maponte\\Documents\\Projects\\SettonProjects\\OCRSTUFF\\DOC071621-0016.pdf";
    var bytes = Encoding.UTF8.GetBytes(pdfFilePath);
    
    ByteString content = ByteString.CopyFrom(bytes);
    

    You instead want:

    string pdfFilePath = "path-as-before";
    var bytes = File.ReadAllBytes(pdfFilePath);
    ByteString content = ByteString.CopyFrom(bytes);
    

    I'd also note, however, that InlineDocument and RawDocument are alternatives to each other - specifying either of them removes the other. Your request creation would be better written as:

    ProcessRequest request = new ProcessRequest
    {
        ProcessorName = ProcessorName.FromProjectLocationProcessor("*****", "mycountry", "***"),
        SkipHumanReview = false,
        RawDocument = new RawDocument
        {
            MimeType = "application/pdf",
            Content = content
        }
    };