Search code examples
dotnetrdf

Why is my DotNetRDF Graph failing due to XML Namespace specifications


I have an RDF file that is pulling in data from a file using graph.LoadFromFile(), it has successfully been parsed for years in another language, but using dotnetrdf in C# it's throwing the error "The value '4259-306N4220DP6' for rdf:ID is not valid, RDF IDs can only be valid NCNames as defined by the W3C XML Namespaces specification." Is there a way to bypass this specific rdf id and log it, or a namespace I can manually include to allow it, or pretty much any other workaround?

I've removed the RDF:id in question and it continued on but removing it while in production is not an option. I've added an underscore to the front and it continued processing.


Solution

  • The message from dotNetRDF is correct, the exhibited rdf:ID is syntactically invalid. One solution would be to rewrite attributes of the form rdf:ID="x" to rdf:about="#x" since the latter does not have the same restrictions as the former (but see my Closing Remarks, below).

    Unfortunately, at time of writing there is no public API within dotNetRDF 3.1 that will allow us to correct or discard erroneous elements on the fly. The parser throws an exception that terminates processing at the point of the first error. That leaves us with no choice but to correct the XML prior to feeding it to dotNetRDF.

    Ideally, the upstream program would be changed. But since that has been ruled out in this case, we will have to take matters into our own hands.

    The code that follows is a bare-bones C# scripting example that shows a way to perform the rewrites using DOM manipulation. A streaming solution might be preferred, but the code is long enough as it is :)

    using System.IO;
    using System.Xml;
    using VDS.RDF;
    using VDS.RDF.Parsing;
    using VDS.RDF.Writing;
    
    // sample RDF/XML with an invalid rdf:ID
    var rdfXml = """
        <rdf:RDF xmlns:rdf='http://www.w3.org/1999/02/22-rdf-syntax-ns#'
            xmlns:ex='http://example.org/'
            xml:base='http://example.org/'>
        <rdf:Description rdf:ID='4259-306N4220DP6'>
            <ex:name>example</ex:name>
        </rdf:Description>
        </rdf:RDF>
        """;
    
    // load the XML as a DOM
    var doc = new XmlDocument();
    var ns = new XmlNamespaceManager(doc.NameTable);
    ns.AddNamespace("rdf", "http://www.w3.org/1999/02/22-rdf-syntax-ns#");
    doc.LoadXml(rdfXml);
    
    // replace rdf:ID="..." with rdf:about="#..." everywhere
    // if this code is removed then parsing wil fail
    foreach (XmlAttribute attr in doc.SelectNodes("//@rdf:ID", ns))
    {
        attr.OwnerElement.SetAttribute("rdf:about", $"#{attr.Value}");
        attr.OwnerElement.RemoveAttributeNode(attr);
    }
    
    // load a graph from the corrected DOM
    var graph = new Graph();
    new RdfXmlParser().Load(graph, doc);
    
    // display the result as Turtle
    var sw = new System.IO.StringWriter();
    new CompressingTurtleWriter().Save(graph, sw);
    Console.WriteLine(sw.ToString());
    

    Result:

    @base <http://example.org/>.
    
    @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>.
    @prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>.
    @prefix xsd: <http://www.w3.org/2001/XMLSchema#>.
    @prefix xml: <http://www.w3.org/XML/1998/namespace>.
    
    <http://example.org/#4259-306N4220DP6> <http://example.org/name> "example"^^<http://www.w3.org/2001/XMLSchema#string>.
    

    Closing Remarks

    There is a reason why rdf:ID has this restriction. The assumption is that the ID is a fragment identifier to an element within an XML document, with similar restrictions. Some other content types also have those restrictions so best-practice advice is to conform, at least in hash-style vocabularies. Slash-style vocabularies do not have the same issue (but have other implications).

    Of course, if the IRI is never derefenced in this way then none of this matters.