Search requests timing out with concurrent transactions in MarkLogic

Apologies here in advance for this non-simplified use case.

During one of my data load processes, concurrent request transactions are used to fill MarkLogic.

Each concurrent thread does the following operations at a high level:

1) Create a transaction via the transactions API

2) Search if a certain document exists passing in transaction ID from step 1. (could possibly search for a document being updated and locked in another concurrent transaction with another transaction ID. It is with this query that is timing out)

3) If document does not exist, create new document with transaction
ID from step 1, if it does exist, upsert new document with transaction ID from step 1.

4) Commit transaction from step 1.

Deadlocks are occurring here with this process, which we see in the error logs, which we are fine with, if any requests get deadlocked, we push them to the bottom of the queue to retry. An issue we are running into is we have cts:search() queries that timeout, causing our concurrent transaction process to lengthen exponentially, rather than throwing a deadlock retryable error right away, which is what we expect.

Below is a sample request that is timing out to the server:

This is a POST to the /v1/search endpoint, which just does a basic unfiltered path-range-query that we have an index for, as well as a json-property-value query.

cts:search(fn:collection(), cts:and-query((cts:collection-query(\"my_collection\"), cts:path-range-query(\"/structure/id\", \"=\", \"28820425\", (\"collation=http://marklogic.com/collation/\"), 1), cts:json-property-value-query(\"@class\", \"myClass\", (\"lang=en\"), 1)), ()), (\"unfiltered\", cts:score-order(\"descending\")), xs:double(\"0\"), ())

And we get the follow timeout error:

{"errorResponse":{"statusCode":500, "status":"Internal Server Error", "messageCode":"INTERNAL ERROR", "message":"SVC-EXTIME:

When debugging, we do see concurrently when this transaction is being run, documents returned in the cts:search are locked for updates in other transactions.

Why is our cts:search returning a timeout error here, is the request hanging because certain documents are locked? When explicitly locking documents in our testing and then searching for them, all the documents still get returned with a valid search response without error or a timeout.

Our dataset is not large (3k documents) and our documents are even smaller (10-15 JSON fields), so performance can't be the issue here.

Is there any options we can set here to help on the appserver itself? Or structure our query differently? Any explanation would help greatly here. Sorry again for not being able to provide a testable case, but just curious if anyone has run into something similar.

Solution

When debugging, we do see concurrently when this transaction is being run, documents returned in the cts:search are locked for updates in other transactions. We are well aware of this possibility and are okay with it.

You may think that you are okay with it, but you are running into performance issues that are likely due to it, and are looking to avoid timeouts - so you probably aren't okay with it.

When you perform a search in an update transaction, all of the fragments will get a read-lock. You can have multiple transactions all obtain read locks on the same URI without a problem. However, if one of those transactions then decides it wants to update one of those documents, it needs to promote it's shared read-lock to an exclusive write-lock. When that happens, all of those other transactions that had a read-lock on that URI will get restarted. If they need access to that URI that has an exclusive write-lock then they will have to wait until the transaction that has the write-lock completes and lets go.

So, if you have a lot of competing transactions all performing searches with the same criteria and trying to snag the first item (or first set of items) from the search results, they can cause each other to keep restarting and/or waiting, which takes time. Adding more threads in an attempt to do more makes it even worse.

There are several strategies that you can use to avoid this lock contention.

Instead of cts:search() to search and retrieve the documents, you could use cts:uris(), and then before reading the doc with fn:doc() (which would first obtain a read-lock) before attempting to UPSERT (which would promote the read-lock to a write-lock), you could use xdmp:lock-for-update() on the URI to obtain an exclusive write-lock and then read the doc with fn:doc().

If you are trying to perform some sort of batch processing, using a tool such as CoRB to first query for the set of URIs to process (lock-free) in a read-only transaction, and then fire off lots of worker transactions to process each URI separately where it reads/locks the doc without any contention.

You could also separate the search and update work, using xdmp:invoke-function() or xdmp:spawn-function() so that the search is executed lock-free and the update work is isolated.

Some resources that describe locks and performance issues caused by lock contention: