Search code examples
xmlxpathxquerysaxonxquery-3.0

Efficient way of finding elements that exists in one document but not the other in XQuery


I have the following data:

<Subjects>
    <Subject>
        <Id>1</Id>
        <Name>Maths</Name>
    </Subject>
    <Subject>
        <Id>2</Id>
        <Name>Science</Name>
    </Subject>
    <Subject>
        <Id>2</Id>
        <Name>Advanced Science</Name>
    </Subject>
    <Subject>
        <Id>500</Id>
        <Name>XYZ</Name>
    </Subject>
    <Subject>
        <Id>1000</Id>
        <Name>ABC</Name>
    </Subject>
</Subjects>

and:

<Courses>
    <Course>
        <SubjectId>1</SubjectId>
        <Name>Algebra I</Name>
    </Course>
    <Course>
        <SubjectId>1</SubjectId>
        <Name>Algebra II</Name>
    </Course>
    <Course>
        <SubjectId>1</SubjectId>
        <Name>Percentages</Name>
    </Course>
    <Course>
        <SubjectId>2</SubjectId>
        <Name>Physics</Name>
    </Course>
    <Course>
        <SubjectId>2</SubjectId>
        <Name>Biology</Name>
    </Course>
</Courses>

and I want to be able to get the subject elements with 500 and 1000, because they don't appear in the 2nd XML document.

How do I do this in the most efficient way possible (bare in mind I have about 750 subjects, each with 120 courses)?


Solution

  • Efficiency depends on your optimizer, but since you mention Saxon in your tags, I guess that's what we could target on. The simplest query, assuming you have bound the variables $subjects and $courses to the <Subjects> and <Courses> elements respectively, is probably

    $subjects/Subject[not(Id = $courses/Course/SubjectId)]
    

    and as a first step I would try running that and seeing if it produces the right results in an acceptable time; from then on it's performance tuning work. For the performance tuning, make sure you have source documents of different sizes so that you can measure how the performance scales with document size.

    Normally for join queries Saxon-EE will do a much better job of optimizing than Saxon-HE, but I doubt that it will have much success with this one as the predicate is expressed as a negation. So this will probably have quadratic performance.

    To hand-optimize this, I would build an index. In XSLT that can be done using xsl:key, in XQuery 3.1 it can be done using maps. Define a map containing all the SubjectIds that appear in $courses:

    let $courseSubjects := map:merge($courses/Course/SubjectId ! map{.: true()})
    

    and then use this to select:

    return $subjects/Subject[not(map:contains($courseSubjects, Id))]
    

    POSTSCRIPT

    I under-estimated the Saxon-EE optimizer. It does in fact generate an index to support evaluation of this join. So creating your own map is probably quite unnecessary. But I haven't done any measurements.