Search code examples
pythonrdflibshacl

rdfLib turning a single backslash into multiple backslashes


I'm using rdfLib to serialize my triples in trig format (turtle based), but for some reason, the backslashes () in my sh:pattern statements (regex format) are doubled. I tried inputting the value for sh:pattern as raw string, and escaping the backslashes, but it still creates too many backslashes in the resulting trig file.

Example:

"shpattern": r"^\s|\d{VALUE}\D"

once serialized becomes:

sh:pattern "^\\s|\\d{4}\\D"

There's a few transformation steps between the input, and the eventual serialization, but none of these transformations touch the backslashes in the original input, so that makes me blame the serialization.

Does anybody know why rdfLib may do this, and if there is a way to toggle it off? I can imagine rdfLib may see something being string, and decide that if there are any backslashes, I probably want to escape them (which usually I would want), but since the output is actually input for shacl shapes where the backslash has a regular expression function I don't want to escape them!

Thanks for any hints!


Solution

  • Trying to parse this RDF using RDFlib breaks:

    ttl = """
        PREFIX sh: <http://www.w3.org/ns/shacl#>
    
        <a:> sh:pattern "^\s|\d{VALUE}\D" .
        """
    

    But this works:

    g = Graph()
    g.add((
        URIRef("a:"),
        URIRef("http://www.w3.org/ns/shacl#pattern"),
        Literal("^\s|\d{VALUE}\D")
    ))
    g2 = Graph().parse(data=g.serialize())
    print(g2.serialize())
    

    So the answer is that the Turtle parser really can't handle unescaped backslashes in RDF literals and RDFlib will encode those but it won't re-encode them a second time. So you will just have to unencode regex like that when taking it out of RDF into a tool.

    I'm sure pySHACL works fine with all forms of input.