What's a good pattern to implement running queries on individual entities from a obtained result set in Datanucleus/JPA

I am basically obtaining a decently sized result set (a few thousand) through datanucleus by running a JPQL query. On each of these, I also want to find the number of references from another table. The data is in a MySQL db.

For example:

List<Instrument> instruments = em().createQuery("SELECT i FROM Instrument AS i").getResultList();
for(Instrument i : instruments)
{
    Query q = em().createQuery("SELECT COUNT(c) FROM Component AS c WHERE c.instrument.id = :id")
    q.setParameter("id", i.getId());
    long count = (Long) q.getSingleResult();
}

So, basically I want the list of instruments and also the list of components attached to the instrument as per the above example.

I've used similar code at a bunch of places and it performs pretty poorly. I understand that for 2000 instruments, I'll fire 2000 additional queries to count components and that will slow things down. I'm sure there's a better pattern to obtain the same result that I want. How can I get things to speed up?

Solution

That's right, this is not an optimal solution. But the good news is that all of this can be done with one or at most two queries. For instance you don't have to execute the counting query once for each instrument. You can use grouping and get all counts with one query:

List<Instrument> instruments = em().createQuery("SELECT i FROM Instrument AS i").getResultList();

Query q = em().createQuery("SELECT c.instrument.id, COUNT(c) FROM Component AS c GROUP BY c.instrument.id")
List<Object[]> counts = q.getResultList();

for (Object[] elem : counts) {
    // do something 
    // elem[0] is instrument ID
    // elem[1] is count
}

I haven't check that but you can probably also do everything with one query by putting the second query as a subquery in the first one:

SELECT i 
(SELECT COUNT(c) FROM Component AS c WHERE c.instrument.id = i.id)
FROM Instrument AS i

Similar to the first example, result list elem[0] would be an Instrument and elem[1] the count. It can be less efficient because the DB will have to execute the subquery for each instrument anyway, but it will be still quicker than your code, because it happens fully on DB side (no round-trips to DB for each counting query).