I have a data model representing a network graph. So I got Host entities (with their address, and many other attributes/elements) and I need to model in some way the Link entity (representing a network link between a source node and a destination node, with latency and throughput attributes).
The question is, I can't imagine a proper way to design the network using XML Schema. How should I design it in a proper way? (After XML design, I will use this schema with a Java application).
I suppose I should create a Network element as root element of the schema, but how can I manage Links between Hosts? Don't know if I have to put Link element inside the root element Network, so beside Host element, or if I have to put Link element inside Host element.
Here is a guide example
<xsd:element name="network" type="NetworkType"/>
<xsd:complexType name="NetworkType">
<xsd:sequence>
<xsd:element name="host" type="HostType"/>
<!-- don't know if put Link element here or inside HostType-->
</xsd:sequence>
</complexType>
Please ignore the schema's declaration missing and so on, I need just a modelling advice and, if you can, an example of how to use "host" or the host's attribute "hostName" (not showed in the above example) as key and how to make "link"'s elements/attributes sourceHost and destHost to refer to the previous one.
EDIT: I will just tell you more about the modeling problem, I noticed my question is not very accurate. Since I'm modeling a network infrastructure, I don't even care about a Vertex (Host) not "connected" to other vertices (hosts). Said that, I thought about modeling the graph by means of Links only, and since I don't care about source vertex and destination vertex (for my use case) I could model it inserting only a Link for each couple of connected vertices. But the fact is, I have to model a XML application (and a XML schema) starting from a Java generic interface and representing all the information referring to it. Let us assume the interface is
public interface NetworkReader {
public Set<Host> getHosts();
public Host getHost(String hostName);
public Connection getConnectionPerformance(Host h1, Host h2);
}
Given an interface like this, I choosed to also include Host elements in my root element Network (it can make easier host's accesses required by first and second method of the interface), that's why the above consideration about a links-only network element fails (in my opinion).
As you can notice, the third method requires the information about the link state given two hosts, that's why I need also the Link element in my XSD.
Since you say it's the modeling problem you're interested in, not the XSD details, let's consider some alternatives.
Abstractly, a graph is a pair (V, E), where V is an arbitrary set and E is a relation on V, i.e. a set of pairs (v1, v2) where (a) v1 and v2 are both in V and (b) (v2, v1) is in E if and only if (v2, v1) is in E. The members of V are the vertices of the graph and the members of E are the edges. Some definitions of graph make E a bag of edges, not a set, so that two vertices can be linked by zero or more arcs; some definitions allow and others forbid edges with v1 = v2.
In XML, there are three fairly obvious ways to represent a graph:
An element for each vertex and an element for each edge giving the pair of endpoints in either order, neither enclosed within the other. A graph of three nodes a, b, c, with edges from b to itself and from a to be might be:
<graph>
<vertex id="a"/>
<vertex id="b"/>
<vertex id="c"/>
<edge endpoints="a b"/>
<edge endpoints="b b"/>
</graph>
Some users (and probably some tool chains) will prefer that the endpoints of edges be given by children, not by an attribute; it's your schema, you decide on the basis of your own knowledge, skill, and taste.
An element for each node, and subordinate elements indicating what other elements it's adjacent to. If we allow the edge to be recorded at either end and not necessarily both, we might have for the graph described above
<graph>
<vertex id="a">
<adjacent vertex="b"/>
</vertex>
<vertex id="b">
<adjacent vertex="b"/>
</vertex>
<vertex id="c"/>
</graph>
Depending on the relative frequency of operations like update and search, we might prefer to require that every edge is recorded at both ends, so every vertex has as children a complete list of all adjacent nodes (at the cost of more complicate validation of the XML); then we might need:
<graph>
<vertex id="a">
<adjacent vertex="b"/>
</vertex>
<vertex id="b">
<adjacent vertex="b"/>
<adjacent vertex="a"/>
</vertex>
<vertex id="c"/>
</graph>
Note that in this representation, the set of edges is represented indirectly by the adjacency relation. For some purposes, that's a good idea; for others, it will probably be a bad idea. Your choice.
Just as it's possible to subordinate edges to vertices in the XML, it's possible to subordinate vertices to edges. As the graph is not necessarily connected, we will also need some other way to signal vertices which are not incident to any edge. Our example graph might be:
<graph>
<edge endpoints="a b"/>
<edge endpoints="b b"/>
<isolated vertices="c"/>
</graph>
Here, it is the set of vertices which is implicit (it is distinct-values(for $e in $graph/edge return tokenize(@endpoints,' '), tokenize($graph/isolated/@vertices,' '))
).
Any of these are simple to define in XSD; making the XSD enforce the necessary referential integrity constraints will likely be easier in some representations than in others. (In particular, in the second variant requiring that each edge be represented at both ends will be difficult in XSD 1.0.)
Note that in each case there is some tradeoff between directness of expression and redundancy. In method 1, we have a simple XML representation of the set of vertices and also of the set of edges. But the separation of the two means that in order to check that the edges are represented correctly we must examine each endpoint value in the edges to make sure it names a known vertex. Also, if we are only ever interested in vertices connected to other vertices -- if, that is, an isolated vertex is an encoding error -- then in method 1 we also need to check each vertex to make sure it's named as an endpoint by at least one edge. In method 2, one endpoint for each edge is guaranteed correct, because edges only appear as children of vertices; we do however have to check each identifier for the other vertex on each edge. And method 2 requires either redundant information about each link under each endpoint or else it requires a search of all nodes in the network in order to locate all edges connected to a given edge.
If you have non-trivial information to store about each node and each link, then method 1 will be least redundant.