Search code examples
azureazure-cosmosdbcomposite-index

Composite index for Cosmos db not improving RU usage


Azure notified me that a query I had was not performant because of an ORDER BY clause and it would greatly benefit from adding a composite index. I've done a bunch of reading and tried repeatedly to add a composite index for this query but unfortunately I've either had no improvement or a slight increase in RUs for my query with my attempts.

I have the following piece to my query:

WHERE
     p.BusinessId = 'id' AND
     p.PurchaseDate >= '2018-07-04T16:26:42.0882176-06:00' AND p.PurchaseDate < '2023-07-10T15:26:42.0882176-06:00' AND
     NOT Contains(LOWER(p.TransactionIdentifier), 'history')
ORDER BY p.PurchaseDate DESC

Based on all the reading I've done I should have two composite indexes, a (businessId, purchaseDate descending) and a (businessId, transactionIdentifier). Because I can only have one range or function per composite index and I have two so I would need two indexes. That does not work, however, because it's complaining about not having a composite index that matches. The only configuration that it doesn't complain about is (businessId, purchaseDate descending, transactionIdentifer) which actually increases the RUs required to run the query.

Documentation also states that I should include these filters in my ORDER BY for improved performance. I've tried adding those as well. Nothing seems to give me any reduction in RUs.

I've been trying to follow along with the following resources and I'm just not making progress:

I'm stuck. Can anyone help me figure out the proper composite index configuration for this query?


Solution

  • The cause

    I'm pretty sure this condition cannot utilize the index: NOT Contains(LOWER(p.TransactionIdentifier), 'history')

    From LOWER function documentation:

    • This function doesn't use the index.
    • If you plan to do frequent case insensitive comparisons, this function may consume a significant number of RUs. Consider normalizing the casing of strings when ingesting your data. Then a query like SELECT * FROM c WHERE LOWER(c.name) = 'USERNAME' is simplified to SELECT * FROM c WHERE c.name = 'USERNAME'.

    A sidenote, that you don't have to use TOLOWER explicitly, see CONTAINS function:

    CONTAINS(<string_expr_1>, <string_expr_2> [, <bool_expr>])

    bool_expr (Optional) - Optional boolean value for ignoring case. When set to true, CONTAINS performs a case-insensitive search. When unspecified, this value defaults to false.

    The bad news is again in Remarks:

    This function performs a full scan.

    While I've seen hints of these functions being able to utilize index (see here), the docs currently say otherwise. Either way, you also have NOT which likely breaks the deal anyway.

    Hence - no index use for this.

    What can you do

    Rethink your data model

    i.e. precalculate a boolean IsTransactionHistory or similar to your model and efficient indexing options light up. This is the preferred way to get good performance with no trailing baggage.

    Exclude TransactionIdentifier from that index.

    Because as it cannot be used and just makes the index bulkier and match less = cost more.

    Your best bet is most likely simpler index on (p.BusinessId, p.PurchaseDate).

    Depending on data cardinality and parameter patterns, I would also try simpler index on just PurchaseDate. It may work better if you query really narrow date range and there are very few wrong BusinessId-s in the slice.

    Ignore it.

    Sometimes it is economically viable to just ignore the inefficiency and buy those few extra RU's. Rarely, but do the math.