Search code examples
marklogicmlcpmarklogic-dhf

Could MLCP Content Transformation and Triggers be used together during document ingestion?


As I understand, both the MLCP Transformation and Trigger can be used to modify ingested documents. The difference is that content transformation operates on the in-memory document object during the ingestion, whereas Trigger can be fired after a document is created.

So it seems to me there is no reason why I cannot use both of them together. My use cases is that I need to update some nodes of the documents after they are ingested to the database. The reason I use trigger is because running the same logic in MLCP transformation using the in-mem-update module always caused ingestion failure, presumably due to the large file size and the large number of nodes I attempted to update.

2018-08-22 23:02:24 ERROR TransformWriter:546 - Exception:Error parsing HTTP headers: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond

So far, I have not been able to combine Content Transformations and Triggers. When I enabled transformation during MLCP ingestion, the trigger was not fired. When I disabled the transformation, the trigger worked without problem.

Is there any intrinsic reason why I cannot use both of them together? Or is it an issue related to my configuration? Thanks!

Edit:

I would like to provide some context for clarification and report results based on suggestions from @ElijahBernstein-Cooper, @MadsHansen and @grtjn (thanks!). I am using the MarkLogic Data Hub Framework to ingest PDF files (some are quite large) as binaries and extract the text as XML. I essentially followed this example, except that I am using xdmp:pdf-convert instead of xdmp:document-filter: https://github.com/marklogic/marklogic-data-hub/blob/master/examples/load-binaries/plugins/entities/Guides/input/LoadAsXml/content/content.xqy

While xdmp:pdf-convert seems to preserve the PDF structure better than the xdmp:document-filter, it also includes some styling nodes (<link> and <style>) and attributes (class and style) which I do not need. In attempting to remove them I explored two different approaches:

  1. The first approach is to use the in-mem-update module to remove the unwanted nodes from the in-memory document representation within the above content.xqy script, as part of the content transformation flow. The problem is that the process can be quite slow, and as @grtjn pointed out I have to limit parallelization to avoid timeout.
  2. The second approach is to use a post-commit trigger function to modify the documents using xdmp:node-delete after they have been ingested into the database. However, the trigger won't fire when the triggering condition is set to be document-content("create"). It does trigger if I change the condition to document-content("modify"), but for some reason I cannot access the document using fn:document($trgr:uri) similar to this SO question (MarkLogic 9 sjs trigger not able to acces post-commit() document data).

Solution

  • MLCP Transforms and Triggers operate independently. There is nothing in those Transforms that should stop Triggers from working per se.

    Triggers are triggers by events. I typically use both a create and a modify trigger to cover the cases where I import the same files a second time (for testing purposes for instance).

    Triggers also have a scope. They are configured to look for either a directory or a collection. Make sure your MLCP configuration matches the Trigger scope, and that your Transform does not influence the URI in such a way that it no longer matches directory scope if that is used.

    Looking more closely to the error message however, I'd say that is caused by a timeout. Timeouts can occur both server-side (10 min by default), and client-side (might depend on client-side settings, but could be much smaller). The message basically says that the server took too long to respond, so I'd say you are facing a client-side timeout.

    Timeouts can be caused by too small time-limits. You could try to increase timeout settings both server-side (xdmp:set-request-time-limit()), and client-side (not sure how to do that in Java).

    It is more common though, that you are simply trying to do too much at the same time. MLCP opens transactions, and tries to execute a number of batches within that transaction, aka the transaction_size. Each batch contains a number of documents to the size of batch_size. By default MLCP tries to process 10 x 100 = 1000 documents per transaction.

    It also runs with 10 threads by default, so it typically opens 10 transactions at the same time, and tries to run 10 threads to process a 1000 docs each in parallel. With simple inserts this is just fine. With more heavy processing in transforms or pre-commit triggers, this can become a bottle-neck, particularly when the threads start to compete for server resources like memory and cpu.

    Functions like xdmp:pdf-convert can often be fairly slow. It depends on an external plugin for starters, but also imagine it has to process a 200 page PDF. Binaries can be large. You'll want to pace down to process them. If using -transaction_size 1 -batch_size 1 -thread_count 1 makes your transforms work, you really were facing timeouts, and may have been flooding your server. From there you can look at increasing some numbers, but binary sizes can be unpredictable, so be conservative.

    It might also be worth looking at doing heavy processing asynchronously, for instance using CPF, the Content Processing Framework. It is a very robust implementation for processing content, and is designed to survive server restarts.

    HTH!