Search code examples
xmlxsdschemaxsd-validationxml-validation

Default XML schema / XSD when none is specified?


I am capturing lessons learned about data I/O from a data analyst's perspective, without the benefit of data engineering expertise (and being quite explicit about that shortcoming). In order to give context to the various alternatives, taking into consideration the constraints within my shop, I've experimented briefly with XML import/export, and done online reading about schemas. One thing I noticed about an open source utility for a 4th generation language environment is that seems to use a default (I haven't specified one):

<?xml version="1.0" encoding="utf-8"?>
<y>
   <DataFrame1>
      <DataFrame1_Field1>[75;75;75;75;75;75;75;75;75;...;75;75]</DataFrame1_Field1>
      <DataFrame1_Field2>[2014;2014;2015;2015;2016;2016;...;2083;2084;2084;2085;2085;2086;2086]</DataFrame1_Field2>
      <DataFrame1_Field3>
         <item>ABC</item>
         <item>DEF</item>
      <...snip...>
         <item>00-00</item>
         <item>00-00</item>
         <item>00-00</item>
      </DataFrameP_FieldM>
      <DataFrameP_FieldN>[2;2;4;2;5;3;5;3;3;1;5;5;...;4;5;3;3;2;4;2;1;2;4]</DataFrameP_FieldN>
   </DataFrameQ>
   <DataFrameR>
      <DataFrameR_Field1>[75;75;75;75;75;75;...;75;75;75;75;75]</DataFrameR_Field1>
      <DataFrameR_Field2>[1;2;3;4;5;6;7;...;1638;1639;1640;1641;1642]</DataFrameR_Field2>
      <DataFrameR_Field3>[0;0;0;0;0;0.014925;0.223881;0.014925;...;0;0.059701;0;0;0;0;0;0;0.626866]</DataFrameR_Field3>
   </DataFrameR>
   <DataFrameS>
      <DataFrameS_Field1>[75;75;75;75;75;75;...;75;75;75;75;75;75;75]</DataFrameS_Field1>
      <DataFrameS_Field2>[1;1;1;1;1;1;1;...;1642;1642;1642;1642;1642]</DataFrameS_Field2>
      <DataFrameS_Field3>[0;0;0;0;0;0;0;0;...;7;0.7;0.7;0.8;0.8;0.8;0.9;0.9;1]</DataFrameS_Field3>
      <DataFrameS_Field4>[0;0.1;0.2;...;0;0.1;0.2;0;0.1;0]</DataFrameS_Field4>
      <DataFrameS_Field5>[1;0.9;0.8;...;0.3;0.2;0.1;0;0.2;0.1;0;0.1;0;0]</DataFrameS_Field5>
      <DataFrameS_Field6>[0;0;0;0;0;0;...1;1;1;1;1;1;1;1;1;1]</DataFrameS_Field6>
   </DataFrameS>
</y>

Interpreting the labels: All labels starting with the string "DataFrame..." are anonymizations I made in the code. Before anonymization, DataFrameX (where X is any alphanumeric character) was the name of a data frame objects in my 4GL environment [1]. All labels containing the string "DataFrame" and "Field" are also anonymizations. Before anonymization, they were the names of fields within data frames. The label <y> is just the object name of the collection of data frames in the 4GL environment.

The arrangement of the data all makes sense to me, knowing what I do about the data frames from which the data come. All the taggings makes sense. I assumed that they come from a generic default schema. However, my web searching has not revealed any indication that such a default schema exists, much less has been agreed/standardized upon. Is there such a generic default, or is these tags the result of the export utility's author?

[1] The 4GL environment is Matlab, but my question is about XML practices & conventions rather than Matlab.


Solution

  • There is no default XML schema for an arbitrary XML file. There are rules of well-formedness given by the W3C XML Recommendation, but those define XML itself rather than the vocabulary and grammar of any given XML schema.

    Identifying an XSD when none is specified

    1. When schemaLocation is specified in the XML, see the XSD specified there. For more on schemaLocation, see How to link XML to XSD using schemaLocation or noNamespaceSchemaLocation?
    2. When only a namespace is used, see How to locate an XML Schema (XSD) by namespace?
    3. When the provider of the XML is available, ask or inspect the source/documentation.
    4. When relatively unique/informative element names are used, or if you know the sector/industry google element names or sector/industry and "xml schema".

    If none of the above work, go schema-less, or write your own to fit the data.


    More on XML design

    In the comments, @user2153235 asks:

    Is there a prevailing practice (or even a universal, minimal "base" scheme that is defaulted to in the absence of an explicit schema) wherein the atomic element is "item", and any other tag represents an element that is either a string or a structure composed of subordinate elements?

    Yes, there is a prevailing practice.

    Answer to the question: No, there is no universal, minimal "base" schema – just the rules of well-formedness for XML itself.

    The XML in your post is poorly designed:

    • Naming is terrible:
      • The root element is named y, yet the content is clearly not a simple y-coordinate or anything else that could be reasonably be described as y.
      • DataFrame-based names have C character suffixes followed by _FieldN numeric suffixes. Unless the C character is meaningful in some domain, the abbreviation ought to be expanded. Hard-wired numerical suffixes on list members are better left implied by position so that the name can lexically signal type without having to decompose.
    • Substructure is left unmarked up: Generally, structure shouldn't be buried in micro-formats within strings; mark-up should be imposed so that the XML parser can be leveraged rather than having to implement micro-parsers within an application.