Search code examples
amazon-s3marklogicmlcp

MarkLogic - S3 Import


Can we import data from Amazon S3 into MarkLogic using

  1. JavaScript/xQuery API
  2. MarkLogic Content Pump
  3. Any other way?

Please share the reference, if available.


Solution

  • I'm not an AWS expert by any stretch, but I if you know the locations of data on S3, you can use xdmp:document-get(), with an http:// prefix in the $location, to retrieve documents. You can also use xdmp:http-get(), perhaps to query for the locations of your documents. Once that command has returned, you can use the usual xdmp:document-insert.

    That approach should be fine for a small number of documents. If you have a large set you want to import, you'll have to factor in the possibility of the transaction timing out.

    For a larger data set, you might want to manage the process externally. Here are a couple options:

    • export data from S3 onto your local filesystem, then use MLCP to send it to MarkLogic
    • insert a document that has a list of resources at S3 that you want to import; spawn tasks that will each take a group of those resources and import them using xdmp:document-get()
    • use Java code to pull a document (or batch of documents) from S3, then use the Java Client API to insert that data into MarkLogic
    • once MarkLogic 9 comes out, use the Data Movement SDK, which is intended to make projects like this easier (as of this writing, the DMSDK is still in development)