marklogic marklogic-9 mlcp marklogic-dhf

Extract element from root node and populate with each document

MarkLogic version : 9.0-6.2

Here is a sample xml file I am ingesting into staging db using mlcp. My requirement is to split the xml into separate documents for each Policy, but while creating uri_id as /policy/PolNum/TransactionRequestDt.xml, after reformatting TransactionRequestDt to YYYYMMDDHHMMSS. Example uri is /policy/P123/201610171533390000000.xml

<?xml version="1.0" encoding="UTF-8"?>
<PolicyInfo>
    <TransactionRequestDt>2016-10-17T15:33:39.770<TransactionRequestDt>
    <Policy>
        <PolNum>P123</PolNum>
        ....
        ....
    </Policy>
   <Policy>
        <PolNum>P456</PolNum>
        ....
        ....
    </Policy>
</PolicyInfo>

I have mlcp code looking like below

mlcp.sh import -ssl \
-host localhost \
-port 8010 \
-username nnnn \
-password ffff \
-input_file_path /f1/f2 \
-input_file_type aggregates \
-aggregate_record_element Policy \
-output_collections policy \
-output_uri_prefix /policy/ \
-uri_id PolNum \
-transform_module /ext/ingesttransform.sjs \
-output_uri_suffix ".xml"

My thought is to use the transform function to reformat the TransactionRequestDt but realized that the element TransactionRequestDt was not available to the transform (as it was outside of 'Policy' aggregate).

What is the best way to access TransactionRequestDt and use it in uri? I tried

-transaction_param TransactionRequestDt

but looks like the parameter value is being passed as 'TransactionRequestDt' (string) instead of the actual date value of TransactionRequestDt.

Solution

I'd consider not using the -aggregate_record_element param, so you get access to the full document inside the transform (which will consequently be invoked once for the entire file). Inside you read and normalize that date, get the Policy children (using something like content.xpath('/PolicyInfo/Policy')), iterate over those, and build up a sequence of { uri: ..., value: ... } objects to return as result of the transform. MLCP will detect you are returning multiple results, and write them all.

Here a similar SO answer with sample code. Mind though that it speaks of splitting JSON, rather than XML. Don't do the toObject(), but use xpath() instead, and no need to xdmp.toJSON() either:

https://stackoverflow.com/a/36506478/918496

HTH!