Search code examples
c#dotnetrdf

VDS.RDF.GraphHandler for skipping Triples


I want to parse only some data out of a ~100 MB rdf cell line ontology. So far I am interested in 169.796 out of 1.387.097 tripples (1.217.301 tripples being skipped).

I need ~24 seconds using the handler below to create the graph. This is only some seconds less then parsing the ontology in total.

Is there something I could improve in skipping the tuples I am not interested in?

Thanks!

private class MyHandler : VDS.RDF.GraphHandler
    {
        public MyHandler(IGraph g)
            : base(g)
        {
        }

        protected override bool HandleTripleInternal(Triple t)
        {
            if (t.Predicate is UriNode uri 
                && uri.Uri.AbsoluteUri != "http://www.w3.org/2000/01/rdf-schema#subClassOf"
                && uri.Uri.AbsoluteUri != "http://www.w3.org/2000/01/rdf-schema#label")
            {
                return true;
            }
            else if (t.Object is LiteralNode l && l.Language == "zh")
            {
                return true;
            }
            return base.HandleTripleInternal(t);
        }
    }

Solution

  • To make the comparison of nodes a bit faster you could try comparing directly with a UriNode created from the graph instead of comparing URI strings. If you use the IGraph.CreateUriNode() method in your filter constructor to create nodes for rdfs:subClassOf and rdfs:label and then use IUriNode.Equals() as your comparator then you should find that the node comparison can use a faster object reference equality rather than a string comparison.

    private class MyHandler : GraphHandler
    {
        private readonly IUriNode _rdfsSubClassOf;
        private readonly IUriNode _rdfsLabel;
    
        public MyHandler(IGraph g)
            : base(g)
        {
            _rdfsSubClassOf = g.CreateUriNode(UriFactory.Create("http://www.w3.org/2000/01/rdf-schema#subClassOf"));
            _rdfsLabel = g.CreateUriNode(UriFactory.Create("http://www.w3.org/2000/01/rdf-schema#label"));
        }
    
        protected override bool HandleTripleInternal(Triple t)
        {
            if (t.Predicate is UriNode uri
                && !uri.Equals(_rdfsSubClassOf)
                && !uri.Equals(_rdfsLabel))
            {
                return true;
            }
            else if (t.Object is LiteralNode l && l.Language == "zh")
            {
                return true;
            }
            return base.HandleTripleInternal(t);
        }
    }
    

    However that is only going to speed up the filter and I suspect that if you profile the parse of the file you will find that the majority of the time is spent in parsing the syntax to create the triple that is passed to your filter. There isn't really a way to get around this issue in the dotNetRDF handler architecture.