Search code examples

Unable to validate XML with schema but works by reading the written file from it

I am currently using lxml and want to validate a XML content.

I wrote it completely in Python from tei = etree.Element("TEI", nsmap={None: ''} with many subelements.

At a moment, I want to check if the structure is ok using a specific .xsd file using the following code:

xmlschema_doc = etree.parse(xsd_file_path)
xmlschema = etree.XMLSchema(xmlschema_doc)
# run check
status = xmlschema.validate(xml_tree)

It returns False with error Element 'TEI': No matching global declaration available for the validation root.

I observe a very weird thing that if I write the xml using

ET = etree.ElementTree(xmlData)
ET.write('test.xml', pretty_print=True, xml_declaration=True, encoding='utf-8')

and if I reopen it with b= etree.parse('test.xml') I finally had no error and the xml structure is valid as a result of xmlschema.validate(b)

Any idea about what I need to add in my xml structure?

EDIT: First items in the not valid XML linesXML

First items in the valid XML file linefile


<?xml version='1.0' encoding='UTF-8'?>
<TEI xmlns="">
            <title xml:lang="en">article</title>
            <title xml:lang="fr">article</title>
            <title type="sub" xml:lang="en">A subtitle</title>
            <author role="aut">
                <forename type="first">John</forename>
              <idno type="">orcid</idno>
              <affiliation ref="#localStruct-affiliation"/>
              <affiliation ref="#struct-affiliation"/>
            <author role="aut">
                <forename type="first">Jane</forename>
                <forename type="middle">Middle</forename>
              <idno type="">orcid</idno>
              <affiliation ref="#localStruct-affiliationA"/>
              <affiliation ref="#localStruct-affiliationB"/>
              <ref type="file" subtype="author" n="1" target="upload.pdf"/>
              <licence target=""/>
            <note type="audience" n="2"/>
            <note type="invited" n="1"/>
            <note type="popular" n="0"/>
            <note type="peer" n="1"/>
            <note type="proceedings" n="0"/>
            <note type="commentary">small comment</note>
            <note type="description">small description</note>
                <title xml:lang="en">article</title>
                <title xml:lang="fr">article</title>
                <title type="sub" xml:lang="en">A subtitle</title>
                <author role="aut">
                    <forename type="first">John</forename>
                  <idno type="">orcid</idno>
                  <affiliation ref="#localStruct-affiliation"/>
                  <affiliation ref="#struct-affiliation"/>
                <author role="aut">
                    <forename type="first">Jane</forename>
                    <forename type="middle">Middle</forename>
                  <idno type="">orcid</idno>
                  <affiliation ref="#localStruct-affiliationA"/>
                  <affiliation ref="#localStruct-affiliationB"/>
                <idno type="isbn">978-1725183483</idno>
                <idno type="halJournalId">117751</idno>
                <idno type="issn">xxx</idno>
                  <biblScope unit="serie">a special collection</biblScope>
                  <biblScope unit="volume">20</biblScope>
                  <biblScope unit="issue">1</biblScope>
                  <biblScope unit="pp">10-25</biblScope>
                  <date type="datePub">2024-01-01</date>
              <idno type="doi">reg</idno>
              <idno type="arxiv">ger</idno>
              <idno type="bibcode">erg</idno>
              <idno type="ird">greger</idno>
              <idno type="pubmed">greger</idno>
              <idno type="ads">gaergezg</idno>
              <idno type="pubmedcentral">gegzefdv</idno>
              <idno type="irstea">vvxc</idno>
              <idno type="sciencespo">gderg</idno>
              <idno type="oatao">gev</idno>
              <idno type="ensam">xcvcxv</idno>
              <idno type="prodinra">vxcv</idno>
              <ref type="publisher"></ref>
              <ref type="seeAlso"></ref>
              <ref type="seeAlso"></ref>
              <ref type="seeAlso"></ref>
              <keywords scheme="author">
                <term xml:lang="en">keyword1</term>
                <term xml:lang="en">keyword2</term>
                <term xml:lang="fr">mot-clé1</term>
                <term xml:lang="fr">mot-clé2</term>
              <classCode scheme="halDomain" n="physics"/>
              <classCode scheme="halDomain" n="halDomain2"/>
              <classCode scheme="halTypology" n="ART"/>
      <listOrg type="structures">
        <org type="institution" xml:id="localStruct-affiliation">
          <orgName>laboratory for MC, university of Yeah</orgName>
          <orgName type="acronym">LMC</orgName>
              <addrLine>Blue street 155, 552501 Olso, Norway</addrLine>
              <country key="LS">Lesotho</country>
            <ref type="url" target=""/>
        <org type="institution" xml:id="localStruct-affiliationB">
          <orgName>laboratory for MCL, university of Yeah</orgName>
          <orgName type="acronym">LMCL</orgName>
              <addrLine>Blue street 155, 552501 Olso, Norway</addrLine>
              <country key="NO">Norway</country>
            <ref type="url" target=""/>


  • Have a look at, you should basically use

    TEI = "{%s}" % TEI_NAMESPACE
    NSMAP = {None : TEI_NAMESPACE} # the default namespace (no prefix)
    root = etree.Element(TEI + "TEI", nsmap=NSMAP) # lxml only!
    text = etree.SubElement(root, TEI + "text")

    and so on for all elements, to ensure they are created in the TEI namespace.

    A minimal in memory created ElementTree to be valid against the schema (after I downloaded it together with the imported W3C xml.xsd) is e.g.

    from lxml import etree
    TEI = "{%s}" % TEI_NAMESPACE
    NSMAP = {None : TEI_NAMESPACE} # the default namespace (no prefix)
    root = etree.Element(TEI + "TEI", nsmap=NSMAP) # lxml only!
    text = etree.SubElement(root, TEI + "text")
    body = etree.SubElement(text, TEI + "body")
    listBibl = etree.SubElement(body, TEI + "listBibl")
    biblFull = etree.SubElement(listBibl, TEI + "biblFull")
    sourceDesc = etree.SubElement(biblFull, TEI + "sourceDesc")
    profileDesc = etree.SubElement(biblFull, TEI + "profileDesc")
    xmlschema_doc = etree.parse("aofr.xsd")
    xmlschema = etree.XMLSchema(xmlschema_doc)
    # run check
    status = xmlschema.validate(root)