Search code examples
xmlwell-formed

What's the difference between "not well-formed" XML and "invalid" XML?


I made a claim that an unescaped ampersand in some XML source was "invalid XML". LarsH then mentioned:

technically, the XML is "not well-formed". "Invalid" would mean that it fails to conform to a specific schema.

I tried to find official definitions of "invalid" XML and "not well-formed" XML to confirm LarsH's claim, but I wasn't able to find any definitions in an official specification to compare.

How does "invalid" XML differ from "not well-formed" XML?


Solution

  • I think the general difference is clear, and Nathan's and Shawn's answers are accurate. The unclear corner case that raised the question is this:

    • If a document is not well-formed, can it be valid? Can it be invalid?

    I've gotten the impression from a long time of working with XML that the question of validity is undefined, for a non-well-formed XML fragment. But I couldn't prove that from the XML spec.

    In theory

    The official definition of "valid" in the XML spec is:

    Definition: An XML document is valid if it has an associated document type declaration and if the document complies with the constraints expressed in it.

    Note that this definition begins with "an XML document". An XML document is defined as:

    Definition: A data object is an XML document if it is well-formed, as defined in this specification. In addition, the XML document is valid if it meets certain further constraints.

    This means that the above definition of "valid" is only applicable to XML documents, that is, to well-formed data objects. About data objects that are not (well-formed) XML documents, the definition of "valid" doesn't say anything.

    Various hermeneutic questions remain...

    • Do we take the above definition of valid as exhaustive... that is, do we assume that nothing else written about validity is definitive? (If so, we ignore in/validity based on XML Schema or RelaxNG, etc.)

    • Do we take the "if" as "only if"? E.g. can a well-formed XML document with no DTD be considered valid too? Can a not-well-formed XML document be considered valid if it conforms to its associated DTD? (Bob DuCharmes seems to say that this definition means "only if": "The XML spec explicitly says that valid documents must be well-formed [emphasis mine].")

    • Can we assume that every XML document that is not "valid" is "invalid"? I think so. But what about every data object? E.g. is there such thing as undefined validity status? The XML spec never defines the term "invalid", leaving some leeway for interpretation. It's clear that if X is invalid, it's not valid. But the converse: if X is not valid, does that mean it must be invalid?

    Taking a look at respected experts on XML, outside of the spec... Bob DuCharme writes that an XML "document that isn't valid ... may still be well-formed...", implying that an XML document that isn't valid might not be well-formed. But again, is "not valid" the same as "invalid"? And furthermore, according to the spec, an XML document is well-formed by definition. So technically, an XML document that isn't valid must still be well-formed. I believe DuCharme is using terms somewhat loosely here.

    We also need to keep in mind the broader context of SGML-descended languages, including HTML. This web page gives examples of XHTML pages that it says are valid according to the W3C validator service, but are not well-formed. However when I run them by the validator service, it doesn't say they're valid.

    In practice

    In practice, it's difficult for any validation engine to work with anything that's not well-formed XML. It would have to first "correct" the input data, guessing what the intended, correct XML structure should be, and there is no official specification for that process. So the results could differ significantly between implementations. Validation would then be implementation-dependent.

    Conclusion

    For that reason, I would say that for all practical purposes, it's misleading to claim that a data object is invalid XML, if it's not a (well-formed) XML document. If you mean to communicate (as in the case of the unescaped ampersand) that the data is not well-formed XML, then the term "invalid" is communicating the wrong thing, even if could arguably be considered to be true. It's a bit like saying a spider is not a fruit fly because it has more than six legs, when you mean that a spider is not an insect because it has more than six legs. It's true that a spider is not a fruit fly, but the intended meaning wasn't communicated.

    Nevertheless I don't see an ironclad argument from the XML spec that says whether a data object that is not well-formed XML can be (or must be) invalid. If we follow Bob DuCharme, which we probably should, we can safely conclude that a data object that is not well-formed XML cannot be valid.

    Certainly if we say, referring to a data object that is not well-formed XML, that it is invalid, we raise confusion, and we may easily be understood to be claiming something other than that it's not well-formed.

    I would expect that further reading of experts on XML could give us a better idea of consensus about this question, even if the answer isn't as official as the XML spec.