Search code examples
node.jspdfstreamibm-cloudibm-watson

What causes an "Error in the web application" error when using Watson's document conversion service?


I am trying to convert some PDF files into answer units with Watson's Document Conversion service. These files are all zipped up into one big .zip file, which is uploaded to my Bluemix server running a Node.js application. The application unzips the files in memory and tries to send each one in turn to the conversion service:

var document_conversion = watson.document_conversion(dcCredentials);

function createCollection(res, solrClient, docs)
   {
   for (var doc in docs) //docs is an array of objects describing the pdf files
      {
      console.log("Converting: %s", docs[doc].filename);

      //make a stream of this pdf file
      var rs = new Readable;    //create the stream
      rs.push(docs[doc].data);  //add pdf file (string object) to stream
      rs.push(null);        //end of stream marker

      document_conversion.convert(
         {
         file: rs,
         conversion_target: "ANSWER_UNITS"
         }, 
         function (err, response) 
            {
            if (err) 
               {
               console.log("Error converting doc: ", err);
        .
        .
        .
        etc...

Every time, the conversion service returns error 400 with the description "Error in the web application".

After scratching my head for two days trying to figure out the cause of this rather unhelpful error message, I have pretty much decided that the problem must be that the conversion service can't figure out what type of file is being sent, since there's no filename associated with it. This of course is just a guess on my part, but I can't test this theory because I don't know how to provide that information to the service without actually writing each file to disk and reading it back.

Can anyone help?


Solution

  • Updated: The problem is in how the underlying form-data library handles Streams: It doesn't calculate the length of Streams (with the exception of file and request steams, which it has extra logic to handle).

    getLengthSync() method DOESN'T calculate length for streams, use knownLength options as workaround.

    I found two ways around this. Calculate the length yourself and pass it as an option:

    document_conversion.convert({
      file: { value: rs, options: { knownLength: 12345 } }
      ...
    

    Or use a Buffer:

    document_conversion.convert({
      file: { value: myBuffer, options: {} }
      ...
    

    The reason you were getting a 400 response was because the Content-Length header of your request was incorrectly calculated: The length was too small for the request, causing the MIME part of the request to be truncated (and not closed).

    I suspect this is due to the Readable stream not providing a length or size for your content when the request library calculates the size of the entity.

    Also, apologies for the useless error message. We'll make that better.