Search code examples
xquerymarklogicmarklogic-8marklogic-corb

Marklogic CoRB tool is not saving XML files in UTF-8 format


If we try to save an XML from Marklogic with the help of xdmp:save function, it saves the file in the UTF-8 format.

Now, if we try to save the same file with the help of the Marklogic CoRB tool, it saves that file into ANSI format instead of UTF-8.

Why?

Below XQuery code saving the XML file in UTF-8 format XML via Marklogic Qconsole.

xquery version "1.0-ml";

let $data := fn:collection('00-2346447146')/metadata

return xdmp:save("E:\ML_CoRB_Tool\DD-7759627900-test\Report\00-2346447146-1.xml", $data)

While below marklogic CoRB Tool PROCESS-MODULE xquery code saving the same XML file in ANSI format XML:

xquery version "1.0-ml";
declare variable $URI external;
declare variable $SCR-database-name    := 'SCR'
let $scr-data:= xdmp:eval('xquery version "1.0-ml";
                           declare variable $URI external; 
                           let $UPI := fn:replace($URI, ".xml", "") 
                            let $scr-metadata := cts:search(collection("scr-asset"), cts:element-range-query(xs:QName("SAPID"), "=", xs:int($UPI)))
                            let $assetID := $scr-metadata/metadata/assetIdentifiers/assetIdentifier/AssetID
                            return
                              try
                              {
                              if ($scr-metadata)
                              then $scr-metadata
                              else <doc-not-found>{fn:concat("DOC-NOT-PRESENT for UPI: ", $UPI)}</doc-not-found>
                              }
                               catch($x)
                               {
                                  (
                                  xdmp:log("============Transform error ============="),
                                  xdmp:log($x),
                                  <error>{fn:concat("ERROR in UPI:", $UPI," Assetid: ",$assetID)}</error>
                                  )
                               }'
                  , (xs:QName("URI"), $URI),
                <options xmlns="xdmp:eval">
                <database>{xdmp:database($SCR-database-name)}</database>
                </options>
                )
return
if ($scr-data/metadata) then $scr-data else ()  

We are using below ML CoRB tool Properties:

THREAD-COUNT=8
MODULE-ROOT=/
MODULES-DATABASE=.\\test\\XQuery\\PROD-Metadata
URIS-FILE=.\\test\\Input\\assets_for_extraction_from_scr_20220121.csv
PROCESS-MODULE=.\\test\\XQuery\\new-query.xqy|ADHOC
EXPORT-FILE-DIR=.\\test\\Report
URIS_BATCH_REF='URIS_BATCH_REF'
LOADER-SET-URIS-BATCH-REF=true
EXPORT-FILE-URI-TO-PATH=false
PRE-BATCH-TASK=com.marklogic.developer.corb.PreBatchUpdateFileTask
PROCESS-TASK=com.marklogic.developer.corb.ExportToFileTask
POST-BATCH-TASK=com.marklogic.developer.corb.PostBatchUpdateFileTask
DECRYPTER=com.marklogic.developer.corb.JasyptDecrypter

Solution

  • The CoRB tasks use the method method getValueAsBytes() invokes:

    item.asString().getBytes();
    

    The Java String getBytes() method:

    Encodes this String into a sequence of bytes using the platform's default charset, storing the result into a new byte array.

    So, it looks like it should instead explicitly ask for UTF_8 encoded bytes to be written, rather than rely on the platform charset:

    item.asString().getBytes(StandardCharsets.UTF_8);
    

    I have filed an issue and get that adjusted.

    In the meantime, as @David Ennis has suggested, you can set the default file encoding to UTF-8 by setting the system property -Dfile.encoding=UTF-8.