Search code examples
ibm-cloudibm-watsondocument-conversion

Getting a strange error from Watson's Document Conversion service


I am trying to convert some documents into answer units with Watson's Document Conversion service, using the watson-developer-cloud Javascript library in Node.js. Certain ones (an example is at IBM internal link and is a .DOCX file) return this error:

Error: code:400 error: The supplied data appears to be in the Office 2007+ XML. You are calling the part of POI that deals with OLE2 Office Documents. You need to call a different part of POI to process this data (eg XSSF instead of HSSF)

If I try to convert it via the document conversion demo site, it seems to convert without error. My program downloads the file from the source, writes it to disk, and then uploads it to the Document Conversion service via the above mentioned library.

Is there any way around this error? Consider that this conversion is part of a massive automated conversion of thousands of documents, so manual handling for these outliers is out of the question.


Solution

  • The service attempts to autodetect the media type of the uploaded file using the first few bytes of the file, and the file name.

    If the file name is unavailable (i.e., not passed in by your user), you could provide the media type of the file you are uploading in the file portion of the convert call:

    file: {
        value: fs.createReadStream('filename'),
        options: {
          contentType: 'application/vnd.openxmlformats officedocument.wordprocessingml.document'
        }
    }