Search code examples
marklogicmarklogic-9

XDMP-DOCUTF8SEQ when trying to store binary content in MarkLogic


My case is slightly different than the one mentioned in an earlier post, I am pushing content that I actually want to just store as binary in MarkLogic. I have code later on in a trigger that will process the content of the file. The content in question is uploaded with a URI that ends in .txt.

Using the Java API, I have:

    BinaryDocumentManager docManager = binaryClient.newBinaryDocumentManager();
    BinaryWriteHandle handle = new BytesHandle(content).withFormat(Format.BINARY);

I hoped that would bypass the UTF-8 requirement. Is my assumption correct?

 Server Message: XDMP-DOCUTF8SEQ: Invalid UTF-8 escape sequence at  line 1 -- document is not UTF-8 encoded

Solution

  • The Java API goes through the REST API, and there is some auto-magic processing that occurs when you invoke /v1/documents PUT to insert a document.

    If the URI has a known file extension, then the MIME type mappings to determine the format. When you use a URI with a .txt file extension, then it assumes you are loading a text document.

    If you were to use a URI that does not end with .txt file extension, for instance .txt.bin, then it should insert as a binary() node.

    If you want to insert the document with a .txt file extension as a binary() node, then you will likely need to insert it differently.

    General Content Type Guidelines

    The following guidelines apply to specifying input and output content type for most requests:

    • Document content: Rely on the MarkLogic Server MIME type mapping defined for the URI extension.
    • Non-document data: Set the request Content-type and/or Accept headers. In most cases, this means setting the header(s) to application/xml or application/json.

    The installation-wide MarkLogic Server MIME type mappings define associations between MIME type, URI extensions, and document format. For example, the default mappings associate the MIME type application/pdf, the 'pdf' URI extension, and the binary document format. You can view, change, and extend the mappings in the 'Mimetypes' section of the Admin Interface or using the XQuery functions admin:mimetypes-get and admin:mimetypes-add.