Search code examples
replacenotepad++triplestoreturtle-rdf

How to replace underscores in a .ttl file only for objects


I have a file containing RDF triples (subject-predicate-object) in turtle syntax (.ttl file) in which I should replace every _ with a space, but only for triple objects (subjects and predicates must remain the same). An example is the following (in my case each object is between double quotes "):

<http://dbpedia.org/resource/Animalia_(book)> <http://dbpedia.org/property/author> "Graeme_Base" .
<http://dbpedia.org/resource/Animalia_(book)> <http://dbpedia.org/property/illustrator> "Graeme_Base" .

I would like to get:

<http://dbpedia.org/resource/Animalia_(book)> <http://dbpedia.org/property/author> "Graeme Base" .
<http://dbpedia.org/resource/Animalia_(book)> <http://dbpedia.org/property/illustrator> "Graeme Base" .

What is the easiest and fastest way to achieve this? The files are very large, so I can't replace underscores one at a time. I've tried using regular expressions in Notepad ++ but I don't understand how to exclude subject and predicate.

thanks a lot for the help


Solution

  • You might use:

    (?:^<[^\n<>]+>\h+<[^<>\n]+>\h+"|\G(?!^))[^_\n]+\K_(?=[^"\n]*")
    

    Explanation

    • (?: Non capturing group
      • ^ Assert start of the string
      • <[^\n<>]+>\h+<[^<>\n]+>\h+" Match 2 times an opening-closing angle bracket followed by 1+ horizontal whitespace chars and then match "
      • | Or
      • \G(?!^) Assert position at the end of previous match, not at the start
    • ) Close non capturing group
    • [^_\n]+\K_ Match 1+ times not an underscore or newline using a negated character class and forget what was matched using \K. Then match the underscore.
    • (?=[^"\n]*") Positive lookahead to assert what is on the right is a closing "

    Regex demo

    In the replacement use a space.