Search code examples
rdfjenaapache-jenatdb

Sanitize YAGO files before loading into apache-jena TDB triplestore


I want to use the YAGO 3 rdf triples (yago3_entire_ttl.7z from http://www.mpi-inf.mpg.de/departments/databases-and-information-systems/research/yago-naga/yago/downloads/ ) into the apache-jena triplestore (3.1.0) using tdbloader.

The riot tool provided by apache-jena to validate the input gives 2 types of errors (multiple occurences) :

  1. Illegal unicode escape sequence value: \\ (0x5C)
  2. Illegal character in IRI (codepoint 0x7C, '|')

My obvious thought is to replace '\\' and '|' with accepted character sequences that pass the riot validation, but I wanted to know whether there are other solution?


Solution

  • Found a solution here:

    Now the .ttl files needs to get some kind of preprocessed, where non-unicode characters are replaced in order for Jena to accept the data. On Linux run sed -i 's/|/-/g' ./* && sed -i 's/\\/-/g' ./* && sed -i 's/–/-/g' ./* from within the directory where your .ttl files are. On Windows, start the Ubuntu Bash, navigate to the respective directory (e.g. /mnt/c/Users/Ferdinand/yago) and do the same command. It will take several minutes. I mean, really several...

    https://ferdinand-muetsch.de/how-to-load-yago-into-apache-jena-fuseki.html