Search code examples
rdfjenadbpediafusekidotnetrdf

dotNetRdf issue with Unicode escape sequences / Jena Fuseki inability to load apostrophe in URI


I am developing an web application and I need to support storing RDF data onto my Jena Fuseki server from several data sources (DB dumps / URIs). I have encountered a problem with dotNetRdf. I am using the newest version (2.2.0) downloaded as a NuGet package. I think the problem could be caused by some unfortunate dealing with unicode escape sequences when parsing.

At first, I was trying to make work an example from the dotNetRdf's documentation (section: Reading RDF data, link is below) when I was getting an error with parsing. The failing code is following:

IGraph g = new Graph();
g.LoadFromUri(new Uri("http://dbpedia.org/resource/Barack_Obama"));

This should be functionally equivalent to the code sample in the documentation (https://github.com/dotnetrdf/dotnetrdf/wiki/UserGuide-Reading-RDF#reading-rdf-from-uris), I am just using an extension method.

I am getting a VDS.RDF.Parsing.RdfParseException with message:

[Line 2233 Column 42 to Line 2233 Column 83] 
Unexpected Token <b>'Integer'</b> encountered, expected a Property Value
describing one of the properties of an Object Node

The 2233rd Line from given DBpedia resource should be following:

"Barack Hussein Obama II (US /b\u0259\u02C8r\u0251\u02D0k hu\u02D0\u02C8se\u026An o\u028A\u02C8b\u0251\u02D0m\u0259/; born August 4, 1961) is an American politician who is the 44th and current President of the United States. He is the first African American to hold the office and the first president born outside the continental United States. Born in Honolulu, Hawaii, Obama is a graduate of Columbia University and Harvard Law School, where he was president of the Harvard Law Review. He was a community organizer in Chicago before earning his law degree. He worked as a civil rights attorney and taught constitutional law at the University of Chicago Law School between 1992 and 2004. While serving three terms representing the 13th District in the Illinois Senate from 1997 to 2004, he ran unsuccessfully in the Democratic primary for the United States Hou"@en ,

Between columns 42 and 84 there are a few unicode escape sequences and so I suppose dotNetRdf is not parsing them correctly?! (As there is a note about unexpected integer.)

I have seen a few StackOverflow questions discussing DBpedia's inability to provide correct data but those questions seem to be somewhat outdated, it is already 2019. So I think DBpedia is not the problem. I have only very little experience working with RDF data but everything seems okay to me here.


Secondly, I tried to download content via .NET's HttpClient with specifying some Accept-Headers (in my case text/turtle) and then tried to load data into IGraph instance by calling IGraph.LoadFromString(...) method. Didn't help. Same problem but different exception.

Thirdly - I have finally found the workaround! I've loaded content into string variable (as it was said - via HttpClient) and then I have used VDS.RDF.Parsing.Notation3Parser class. This has worked, but... another problem has occurred - when I was trying to Save Graph into my Jena Fuseki Triplestore, I got a RdfStorageException with inner exception (WebException: remote server returned 400 Bad Request).

Exception message:

A HTTP error (HTTP 400 Parse error: [line: 10, col: 50] 
The declaration for the entity "ns5" must end with '>'.) 
occurred while saving a Graph to the Store.
Empty response body, see aformentioned status line or the inner exception for further details

So probably the data were not even parsed correctly? Would it even be possible?

Here is the simplified workaround code:

string content = /* get content via HttpClient */;

IGraph g = new Graph();
IRdfReader reader = new Notation3Parser();
reader.Load(g, new StringReader(content));

string connectionStr = "...";
var store = new PersistentTripleStore(new FusekiConnector(connectionStr));
...
store.UnderlyingStore.SaveGraph(g); // this call causes the mentioned RdfStorageException

I used the extension method to save IGraph into file to see what's in the IGraph (the file content is available right here: https://pastebin.com/nULJtjXu) and again - when I looked up the 10th line, which is causing the problem, there is a unicode escape sequence:

@prefix ns5:    <http://dbpedia.org/resource/Buyer\u0027s_Remorse:> .

(Note: \u0027 is a apostrophe ('))

It is weird because in the HTTP Response returned by DBpedia, there are many unicode escape sequences and parsing doesn't fail on the first occurrence.

So it is maybe more likely that my Jena Fuseki has problem with loading data with an apostrophe in the URI?

Any help with my problem would be much appreciated


Solution

  • The Fuseki error is likely caused by a bug in the RDF/XML writer of dotNetRDF.

    When you wrote your IGraph to a file, it looks like you used the Turtle or Notation3 writer. But when dotNetRDF talks to Fuseki, it uses the RDF/XML writer. So the contents of your pastebin are not what is being sent to Fuseki.

    I get the same kind of error from Fuseki when sending an RDF/XML file like this:

    <!DOCTYPE RDF [
      <!ENTITY ns5 'http://dbpedia.org/resource/Buyer's_Remorse:' >
    ]>
    <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"/>
    

    This file contains no data, it just sets up an XML entity as is common in RDF/XML. The file is invalid because the apostrophe in the middle of the entity declaration is not escaped. (This is XML, so it needs to be escaped as &apos;.)

    You could verify the problem by writing the IGraph to a file with the RDF/XML writer.

    I have filed a bug report for dotNetRDF about this.