I am capturing lessons learned about data I/O from a data analyst's perspective, without the benefit of data engineering expertise (and being quite explicit about that shortcoming). In order to give context to the various alternatives, taking into consideration the constraints within my shop, I've experimented briefly with XML import/export, and done online reading about schemas. One thing I noticed about an open source utility for a 4th generation language environment is that seems to use a default (I haven't specified one):
<?xml version="1.0" encoding="utf-8"?>
<y>
<DataFrame1>
<DataFrame1_Field1>[75;75;75;75;75;75;75;75;75;...;75;75]</DataFrame1_Field1>
<DataFrame1_Field2>[2014;2014;2015;2015;2016;2016;...;2083;2084;2084;2085;2085;2086;2086]</DataFrame1_Field2>
<DataFrame1_Field3>
<item>ABC</item>
<item>DEF</item>
<...snip...>
<item>00-00</item>
<item>00-00</item>
<item>00-00</item>
</DataFrameP_FieldM>
<DataFrameP_FieldN>[2;2;4;2;5;3;5;3;3;1;5;5;...;4;5;3;3;2;4;2;1;2;4]</DataFrameP_FieldN>
</DataFrameQ>
<DataFrameR>
<DataFrameR_Field1>[75;75;75;75;75;75;...;75;75;75;75;75]</DataFrameR_Field1>
<DataFrameR_Field2>[1;2;3;4;5;6;7;...;1638;1639;1640;1641;1642]</DataFrameR_Field2>
<DataFrameR_Field3>[0;0;0;0;0;0.014925;0.223881;0.014925;...;0;0.059701;0;0;0;0;0;0;0.626866]</DataFrameR_Field3>
</DataFrameR>
<DataFrameS>
<DataFrameS_Field1>[75;75;75;75;75;75;...;75;75;75;75;75;75;75]</DataFrameS_Field1>
<DataFrameS_Field2>[1;1;1;1;1;1;1;...;1642;1642;1642;1642;1642]</DataFrameS_Field2>
<DataFrameS_Field3>[0;0;0;0;0;0;0;0;...;7;0.7;0.7;0.8;0.8;0.8;0.9;0.9;1]</DataFrameS_Field3>
<DataFrameS_Field4>[0;0.1;0.2;...;0;0.1;0.2;0;0.1;0]</DataFrameS_Field4>
<DataFrameS_Field5>[1;0.9;0.8;...;0.3;0.2;0.1;0;0.2;0.1;0;0.1;0;0]</DataFrameS_Field5>
<DataFrameS_Field6>[0;0;0;0;0;0;...1;1;1;1;1;1;1;1;1;1]</DataFrameS_Field6>
</DataFrameS>
</y>
Interpreting the labels: All labels starting with the string "DataFrame..." are anonymizations I made in the code. Before anonymization, DataFrameX (where X is any alphanumeric character) was the name of a data frame objects in my 4GL environment [1]. All labels containing the string "DataFrame" and "Field" are also anonymizations. Before anonymization, they were the names of fields within data frames. The label <y>
is just the object name of the collection of data frames in the 4GL environment.
The arrangement of the data all makes sense to me, knowing what I do about the data frames from which the data come. All the taggings makes sense. I assumed that they come from a generic default schema. However, my web searching has not revealed any indication that such a default schema exists, much less has been agreed/standardized upon. Is there such a generic default, or is these tags the result of the export utility's author?
[1] The 4GL environment is Matlab, but my question is about XML practices & conventions rather than Matlab.
There is no default XML schema for an arbitrary XML file. There are rules of well-formedness given by the W3C XML Recommendation, but those define XML itself rather than the vocabulary and grammar of any given XML schema.
schemaLocation
is specified in the XML, see the XSD specified there. For
more on schemaLocation
, see
How to link XML to XSD using schemaLocation or noNamespaceSchemaLocation?If none of the above work, go schema-less, or write your own to fit the data.
In the comments, @user2153235 asks:
Is there a prevailing practice (or even a universal, minimal "base" scheme that is defaulted to in the absence of an explicit schema) wherein the atomic element is "item", and any other tag represents an element that is either a string or a structure composed of subordinate elements?
Yes, there is a prevailing practice.
Answer to the question: No, there is no universal, minimal "base" schema – just the rules of well-formedness for XML itself.
The XML in your post is poorly designed:
y
, yet the content is clearly not a simple y-coordinate or anything else that could be reasonably be described as y
.DataFrame
-based names have C
character suffixes followed by _FieldN
numeric suffixes. Unless the C
character is meaningful in some domain, the abbreviation ought to be expanded. Hard-wired numerical suffixes on list members are better left implied by position so that the name can lexically signal type without having to decompose.