Search code examples
xmldtdsgml

Are DTD external unparsed entities and notations used in the wild?


DTDs provide a mechanism for referencing external entities of arbitrary formats, thus allowing SGML and XML files to link to any file with a URI without creating a custom mechanism for that. So, for example, one could specify in a DTD:

<!ELEMENT img EMPTY>
<!ATTLIST img src ENTITY #REQUIRED>
<!NOTATION gif PUBLIC "-//CompuServe//NOTATION Graphics Interchange Format 89a//EN" "image/gif">
<!ENTITY myimg1 SYSTEM "img1.gif" NDATA gif>
<!ENTITY myimg2 SYSTEM "img2.gif" NDATA gif>
<!ENTITY myimg3 SYSTEM "img3.gif" NDATA gif>

When creating an img element, one could then use a value like myimg1 and the application working with the document should be informed that file img1.gif is referenced, with a specific format.

The way I understand it, there are three advantages to this:

  • Standardization. Regardless of any actual schema in use, an application could be made to find out everything the document links to, even though it may not understand it. This might be useful for security, searching, filtering etc.
  • Avoiding repetition. The entity URI is defined only once, but it can be referred to many times.
  • Specifying the format (notation) alongside the entity. In case the system doesn't provide or know the format, or there are multiple formats or displaying methods to choose from (show or download for example), there is no need to clutter the document with this information.

Yet, so far I wasn't able to find any dataset or application which would predominantly use this mechanism. In practice, all these points are defeated:

  • The vast majority of resources are still linked to in a schema-specific way, like in XHTML. XLink is used for standardized linking to resources in the XML way. XML Schema defines anyURI so links can be still automatically found (there is a difference between embedding and linking to a resource though).
  • Internal parsed entities can already provide a way to reuse a URI in any place in the document. Compression further reduces the need to care about larger documents in datasets.
  • The most widely used HTTP provides means for specifying or negotiating the format of the target file. This has an advantage that the server is not locked to storing the file only in the specific format; it could for example upgrade to a better format for images (i.e. PNG over GIF) without the need to modify any document that refers to it.

All tutorials about this mechanism I've found simply state what this can be used for (mostly copying paragraphs from other documents) with examples of custom DTDs like the one above. Additionally, since an entity like this can only be included in an attribute, it can never actually be considered a part of the content of any element and its processing is always dependent on the application.

Is there a system using or relying on external entities and notations? Are there applications that recognize entities used this way and are able to understand notations? What kind of public IDs for notations can I use reasonably, and what are some real-world examples of system IDs? And are there common public IDs for entities or notations?


Solution

  • Notations and unparsed entities are notably used by DocBook and TEI.

    They are also used for a general templating/parametric macro expansion mechanism in my SGML software (http://sgmljs.net), much in the spirit of adding features to SGML without new syntax. Specifically, in SGML (but not XML), entity declarations can have data attributes, as in

    <!ENTITY e SYSTEM "..." NDATA sgml [ x=1 y=2 ]>
    

    Support for XLink/XInclude generally is just as spotty or arguably even more than entity/notation declarations given the latter are core SGML/XML constructs (see eg Trying to use XInclude with Java and resolving the fragment with xml:id). The more grave concern with XInclude is that it interacts with schema validation in unintended ways (XInclude Schema/Namespace Validation?) due to it being layered as an XML application/vocabulary rather than a core feature.

    XLink might be nice on paper (I don't think it's even that given that it blindly brings over HyTime concepts without context eg. with extremely vague specification of link roles other than plain HTML-like links). But the reality is that the most common document format out there by far (ie. HTML) makes use of URLs which XML can't reasonably deal with at all given it allows and frequently contains & ampersand characters which XML always wants to interpret as the start of entity references. The WebSGML revision of SGML (created by the authors of the original XML spec along with introducing XML as a standalone subset of SGML to align these two specs) has introduced data specification attributes (explained in http://sgmljs.net/docs/parsing-html-tutorial/parsing-html-tutorial.html) do deal with this problem specifically.

    Update: regarding commonly used public identifiers for notations to use in SGML and XML, there's

    • the historic, withdrawn ISO/IEC 9070 spec and the identifiers it defines (see http://xml.coverpages.org/wg4-n1990.html)

    • the older ISO HTML 4 spec (ISO/IEC 15445) assigning alternate public identifiers for (ISO) HTML as opposed to the well-known ones for W3C HTML 4 (see http://www.cs.tcd.ie/misc/15445/15445.dtd)

    • the storage notation identifiers of ISO/IEC 10744 (HyTime 2nd ed), though these really are only for use in formal system identifiers (see eg http://sgmljs.net/docs/sgmlrefman.html#identifiers for an explanation), among them a convention for defining a notation for an external program to be used as viewer app via MIME/IANA media type associations

    • a convention to establish new identifiers in ISO/IEC 8879:1986 Technical Corrigendum 2 (aka WebSGML aka Annex K) delegating formation of unique identifiers to domain name resolution; for example +//IDN www.someisp.net/users/mtb refers to the notation whose spec document lives at the canonical location http://www.someisp.net/users/mtb

    There are also the well-known entity sets for special characters in SGML, HTML, and XML (and specifically MathML).