Search code examples
swiftobjective-ccore-datanspredicate

How to use AND with ANY in NSPredicates while staying efficient


Imagine I have a collection of books in Core Data. Each book can have multiple authors. Here's what this imaginary collection of books could look like (it only has one book, just to simplify things):

[
    Book(authors: [
        Author(first: "John", last: "Monarch"),
        Author(first: "Sarah", last: "Monarch")
    ])
]

I want to filter my collection of books down to only those that have an author whose name is "Sarah Monarch".

From what I've seen so far, if I wanted to write an NSPredicate to filter my collection and return a filtered collection that only contains this book, I could do it by using a SUBQUERY:

NSPredicate(format: "SUBQUERY(authors, $author, $author.#first == 'Sarah' && $author.#last == 'Monarch').@count > 0")

My understanding is that this operation is essentially the same as:

books.filter {
    book in
    
    let matchingAuthors = book.authors.filter {
        author in
        
        author.first == "John" && author.last == "Monarch"
    }
    
    return matchingAuthors.count > 0
}

My problem is that it seems there's some inefficiency here — SUBQUERY (and the example code above) will look at all authors, when we can stop after finding just one that matches. My intuition would lead me to try a predicate like:

ANY (authors.#first == "Sarah" && authors.#last == "Monarch")

Which, as code, could be:

books.filter {
    book in
    
    return book.authors.contains {
        author in
        
        author.first == "John" && author.last == "Monarch"
    }
}

But this predicate's syntax isn't valid.

If I'm right, and the SUBQUERY-based approach is less efficient (because it looks at all elements in a collection, rather than just stopping at the first match) is there a more correct and efficient way to do this?


Solution

  • The following is an alternative approach, which does not use SUBQUERY, but should ultimately have the same results. I've no idea whether this would in practice be more or less efficient than using SUBQUERY.

    Your specific concern is the inefficiency of counting all the matching Authors, rather than just stopping when the first matching Author is found. Bear in mind that there is a lot going on behind the scenes whenever a fetch request is processed. First, CoreData has to parse your predicate and turn it into an equivalent SQLite query, which will look something like this:

    SELECT 0, t0.Z_PK, t0.Z_OPT, t0.ZNAME, ... FROM ZBOOK t0 WHERE (SELECT COUNT(t1.Z_PK) FROM ZAUTHOR t1 WHERE (t0.Z_PK = t1.ZBOOK AND ( t1.ZFIRST == 'Sarah' AND  t1.ZLAST == 'Monarch')) ) > 0
    

    (The precise query will depend on whether the inverse relationship is defined and if so whether it is one-many or many-many; the above is based on the relationship having a to-one inverse, book).

    When SQLite is handed that query to execute, it will check what indexes it has available and then invoke the query planner to determine how best to process it. The bit of interest is the subselect to get the count:

    SELECT COUNT(t1.Z_PK) FROM ZAUTHOR t1 WHERE (t0.Z_PK = t1.ZBOOK AND ( t1.ZFIRST == 'Sarah' AND  t1.ZLAST == 'Monarch'))
    

    This is the bit of the query which corresponds to your first code snippet. Note that it is in SQLite terms a correlated subquery: it includes a parameter (t0.Z_PK - this is essentially the relevant Book) from the outer SELECT statement. SQLite will search the entire table of Authors, looking first to see whether they are related to that Book and then to see whether the author first and last names match. Your proposition is that this is inefficient; the nested select can stop as soon as any matching Author is found. In SQLite terms, that would correspond to a query like this:

    SELECT 0, t0.Z_PK, t0.Z_OPT, t0.ZNAME, ... FROM ZBOOK t0 WHERE EXISTS(SELECT 1 FROM ZAUTHOR t1 WHERE (t0.Z_PK = t1.ZBOOK AND ( t1.ZFIRST == 'Sarah' AND  t1.ZLAST == 'Monarch')) )
    

    (It's unclear from the SQLite docs whether the EXISTS operator actually shortcuts the underlying subselect, but this answer elsewhere on SO suggests it does. If not it might be necessary to add "LIMIT 1" to the subselect to get it to stop after one row is returned). The problem is, I don't know of any way to craft a CoreData predicate which would be converted into a SQLite query using the EXISTS operator. Certainly it's not listed as a function in the NSExpression documentation, nor is it mentioned (or listed as a reserved word) in the predicate format documentation. Likewise, I don't know of any way to add "LIMIT 1" to the subquery (though it's relatively straightforward to add to the main fetch request using fetchLimit.

    So not much scope for addressing the issue you identify. However, there might be other inefficiencies. Scanning the Author table for each Book in turn (the correlated subquery) might be one. Might it be more efficient to scan the Author table once, identify those that meet the relevant criteria (first = "Sarah" and last = "Monarch"), then use that (presumably much shorter) list to search for the books? As I said at the start, that's an open question: I have no idea whether it is or isn't more efficient.

    To pass results of one fetch request to another, use NSFetchRequestExpression. It's a bit arcane but hopefully the following code is clear enough:

        let authorFetch = Author.fetchRequest()
        authorFetch.predicate = NSPredicate(format: "#first == 'Sarah' AND #last == 'Monarch'")
        authorFetch.resultType = .managedObjectIDResultType
        let contextExp = NSExpression(forConstantValue: self.managedObjectContext)
        let fetchExp = NSExpression(forConstantValue: authorFetch)
        let fre = NSFetchRequestExpression.expression(forFetch: fetchExp, context: contextExp, countOnly: false)
        let bookFetch = Book.fetchRequest()
        bookFetch.predicate = NSPredicate(format: "ANY authors IN %@", fre)
        let results = try! self.managedObjectContext!.fetch(bookFetch)
    

    The result of using that fetch is a SQL query like this:

    SELECT DISTINCT 0, t0.Z_PK, t0.Z_OPT, t0.ZNAME, ... FROM ZBOOK t0 JOIN ZAUTHOR t1 ON t0.Z_PK = t1.ZBOOK WHERE  t1.Z_PK IN (SELECT n1_t0.Z_PK FROM ZAUHTOR n1_t0 WHERE ( n1_t0.ZFIST == 'Sarah' AND  n1_t0.ZLAST == 'Monarch')
    

    This has its own complexities (a DISTINCT, a JOIN, and a subselect) but importantly the subselect is no longer correlated: it is independent of the outer SELECT so can be evaluated once rather than being re-evaluated for each row of the outer SELECT.