Search code examples
xmlxsdmarklogicxml-validationschematron

How can I validate XML files against multiple schema definitions in MarkLogic?


I am working in a MarkLogic database that has approximately 130,000 XML documents. These documents are written using the MODS schema, with an additional local schema used in the MODS extension element. What I want to do is validate these documents against both the official MODS 3.7 xsd and a locally written schematron.sch file.

How can I validate all elements in the MODS namespace against both the mods-3.7.xsd and schematron.sch? Elements in our local namespace would also need to be validated against the schematron.sch.

What do I need to do within MarkLogic to properly set-up validation in this way?

I've tried moving mods-3.7.xsd and schematron.sch into the MarkLogic Schemas database and then updating the xsi:schemaLocation in the XML documents to xsi:schemaLocation="http://www.loc.gov/mods/v3 http://www.loc.gov/standards/mods/v3/mods-3-7.xsd http://www.loc.gov/mods/v3 /Schemas/schematron.sch" then testing validation in MarkLogic Query Console using xdmp:document-insert($new-uri, validate strict { $doc } ). This just returns the error: [1.0-ml] XDMP-VALIDATENODECL: (err:XQDY0084) validate strict { $doc } -- Missing element declaration: Expected declaration for node fn:doc("/Apps/theocom-maggie/scripts/MODS-conversion/ia-to-mods.xsl")/mods:mods in non-lax mode using schema "".

Help!


Solution

  • Keep in mind that the uri of the schema in the schemaLocation is resolved in the Schemas database, not on the web.

    To be honest, I think it is easiest in MarkLogic to not use the xsi:schemaLocation attribute at all, and rather import the schema explicitly in xqy (using the import schema statement), to make sure it finds it correctly.

    Joshua is right about Schematron by the way. The validate statement does not do Schematron validation. MarkLogic does provide schematron support however, which you could apply manually instead:

    https://docs.marklogic.com/schematron

    The pattern would roughly be as follows. You start with uploading schematron and schema into your schemas database. You then need to compile your schematron file using something like:

    xquery version "1.0-ml";
    
    import module namespace schematron = "http://marklogic.com/xdmp/schematron" 
          at "/MarkLogic/schematron/schematron.xqy";
    
    schematron:put("/schematron.sch")
    

    After that you use import schema and validate to do both schema and schematron validation. Something like:

    import schema namespace mods = "http://www.loc.gov/mods/v3" at "/mods-3-6.xsd";
    
    import module namespace schematron = "http://marklogic.com/xdmp/schematron" 
          at "/MarkLogic/schematron/schematron.xqy";
    
    let $xml := <mods version="3.3" xmlns="http://www.loc.gov/mods/v3">
    
    <titleInfo>
    <title>FranUlmer.com -- Home Page</title>
    </titleInfo>
    <titleInfo type="alternative">
    <title>Fran Ulmer, Democratic candidate for Governor, Alaska, 2002</title>
    </titleInfo>
    <name type="personal">
    <namePart>Ulmer, Fran</namePart>
    </name>
    <genre>Website</genre>
    <originInfo>
    <dateCaptured point="start" encoding="iso8601">20020702 </dateCaptured>
    <dateCaptured point="end" encoding="iso8601"> 20021203</dateCaptured>
    </originInfo>
    <language>
    <languageTerm authority="iso639-2b">eng</languageTerm>
    </language>
    <physicalDescription>
    <internetMediaType>text/html</internetMediaType>
    <internetMediaType>image/jpg</internetMediaType>
    </physicalDescription>
    <abstract>Website promoting the candidacy of Fran Ulmer, Democratic candidate for Governor, Alaska, 2002. Includes candidate biography, issue position statements, campaign contact information, privacy policy and campaign news press releases. Site features enable visitors to sign up for campaign email list, volunteer, make campaign contributions and follow links to other internet locations. </abstract>
    <subject>
    <topic>Elections</topic>
    <geographic>Alaska</geographic>
    </subject>
    <subject>
    <topic>Governors</topic>
    <geographic>Alaska</geographic>
    <topic>Election</topic>
    </subject>
    <subject>
    <topic>Democratic Party (AK)</topic>
    </subject>
    <relatedItem type="host">
    <titleInfo>
    <title>Election 2002 Web Archive</title>
    </titleInfo>
    <location>
    <url>http://www.loc.gov/minerva/collect/elec2002/</url>
    </location>
    </relatedItem>
    <location>
    <url displayLabel="Active site (if available)">http://www.franulmer.com/</url>
    </location>
    <location>
    <url displayLabel="Archived site">http://wayback-cgi1.alexa.com/e2002/*/http://www.franulmer.com/</url>
    </location>
    </mods>
    return
      schematron:validate(
        validate strict { $xml},
        schematron:get("/schematron.sch")
      )
    

    HTH!