Search code examples
postgresqlkotlinjooqmultiset

jOOQ multiset performance


Have been fiddling with jOOQ and its MULTISET feature. But was a bit dissappointed by the performance of the queries. I assume I'm doing something suboptimal, or perhaps the JSON de-serializing is more expensive than expected.

Given the popular Sakila DB and these two queries:

val result: Result<Record3<CustomerId, Int, Int>> = dslContext
.select(
    CUSTOMER.CUSTOMER_ID,
    PAYMENT.PAYMENT_ID,
    RENTAL.RENTAL_ID,
)
.from(CUSTOMER)
.join(PAYMENT).on(PAYMENT.CUSTOMER_ID.eq(CUSTOMER.CUSTOMER_ID))
.join(RENTAL).on(RENTAL.CUSTOMER_ID.eq(CUSTOMER.CUSTOMER_ID))
.orderBy(CUSTOMER.EMAIL)
.fetch()

and

val result: Result<Record3<CustomerId, Result<Record1<Int>>, Result<Record1<Int>>>> = dslContext
.select(
    CUSTOMER.CUSTOMER_ID,
    multiset(
        DSL.select(PAYMENT.PAYMENT_ID)
           .from(PAYMENT)
           .where(PAYMENT.CUSTOMER_ID.eq(CUSTOMER.CUSTOMER_ID))
    ),
    multiset(
        DSL.select(RENTAL.RENTAL_ID)
           .from(RENTAL)
           .where(RENTAL.CUSTOMER_ID.eq(CUSTOMER.CUSTOMER_ID))
    ),
)
.from(CUSTOMER)
.fetch()

Measuring the number of row and time it takes leads to the following:

445483 Records via JOIN.    in 466.389250ms
599    Records via MULTISET in 747.627541ms

Which is rather disappointing... even if I select more data from CUSTOMER table resulting in more duplicate data transfer in the JOIN case, it doesn't change much in MULTISETs favor.

Am I missing something?


Solution

  • As illustrated in this blog post, MULTISET and the underlying JSON aggregation techniques are well suited for small nested collections, not large ones, in case of which there are better optimised approaches, including:

    • Joins (which denormalise data, but the overhead of re-normalising/grouping is still often smaller than the penalty of bad execution plans)
    • Multiple queries (not N+1, but perhaps 1 query per table or per relationship)

    Your example query doesn't really represent a real world use-case. You're just fetching all the data of the parent table, as well as all the data of the child tables, which is hardly ever done this way, and if it is, then you probably don't have this nesting use-case, but just flat export your data into a CSV, or whatever. In real-world use-cases, MULTISET can often outperform equivalent JOIN tables, but you'll have to consider execution plans (as always with SQL, not just with MULTISET!). Why not measure a real-world use-case instead? E.g. 5 rows in the parent table, and the top 10 rows per child.

    If you had done any further measurements, profiling, etc., you would have seen that the overhead of JSON serialisation / deserialisation, while present, no doubt, is still marginal compared to the effect of the inferior execution plan. Most RDBMS still produce nested loops of sorts when collecting data from subqueries into JSON documents, whereas joins can profit from hash joins or merge joins, which have better algorithmic complexity. Just like nested loop joins are inferior to hash joins for large data sets, nesting collections can be inferior as well.

    As always with SQL, don't try to establish set-in-stone rules about which approach is better than the other, but try to understand why things are the way they are, and frequently verify your assumptions, as optimisers tend to get better, and there's no reason for RDBMS not to eventually find ways to transform nested collection loops like these into some sort of hashed collection. jOOQ recommends techniques like these benchmarking approaches for measurements of various equivalent alternative queries.