regex xsd regular-language formal-languages

Is there an algorithm for determining if the set of all valid XML instances in respect with a specific XSD schema is a regular language or not?

Essentially I want to know if a specific XSD schema can be replaced by a regular expression or not. I know that XML Schema language can produce XSDs whose set of valid XML instances can be of any type of language (even context-sensitive). I want to identify those schemas that are "regex-equivalent". I came up with this question after tackling the following problem:

I needed to parse a specific text format and I first tried regular expressions and I saw that regexp is sufficient to parse it. I then wanted to make an XML representation for the messages that I received in this format so I mapped regex groups with XML elements. I then created manually an XSD schema based on the structure of the regex. In the end , I had a schema that could replace my regex, in the sense that the original regex was possible to be constructed from the schema. I also managed to do the opposite: Create the schema automatically from the regex. So I could transform the message into XML and validate it on the same time. My questions are:

Can every regex be represented by an XSD schema? (I mean, given a regex to be able to produce an XSD schema)
Given an arbitrary XSD schema is there a way to determine if there is a regex whose representation is the given schema?

~~EDIT: Probably the answer to 1st question is yes since i did it with my regex in a way that did not depend on the specific regex~~ (This isn't a proof for every regex).

Solution

XML Schema language is a super-set of regular languages, but only within the domain of XML documents, obviously.

For #1: with the added condition that the regex matches a well-formed XML document and nothing else, yes.

For #2: yes, it's a matter of checking for any features of XSD which are allowed in a regular language. Finding the regular expression would be much more work.

A regular language has a fairly simple definition, informally:

The empty set/string
Literals (a "singleton language"), e.g., "x"
For a regular language A, A* is also a regular language
For regular languages A and B, A|B (union) and AB (concatenated) are regular.

Basically, all concatenations and alternations are fine but recursion is impossible and there's no back-references or "memory". No element type can contain choice/all/element elements referencing itself or parent types, and you can't make use of any information you found earlier in the parsing process.

The restriction on recursion extends to the any element, which would be forbidden. By definition it accepts any element, including elements with sub-elements. Since you don't know the nesting depth of this unknown element you need a recursive pattern to match it, and you can't do that in a regular language.

The restriction on back-references means you can't do things like "some number of 'A' followed by the same number of 'B'" (A{n}B{n}). I don't think this is even possible in XSD, however, at least I can't think how you would do it.

Restricting numeric values (e.g., minInclusive) would not be possible in a regex.

The all element would be problematic in that it would have to accept all the possible orderings of child elements, which would make the regex expand exponentially (binomial coefficient, (n/k)^k <= n!/k!(n-k)! <= (ne/k)^k) with the number of child elements, and matching the regex is super-linear on that length. Recognizing attributes suffers from the same issue, since the ordering of attributes within an element is not constrained by the schema. Of course, if you only care about whether a regex exists and not about finding it, then it doesn't matter.