Search code examples
phpregexrdftriples

RegEx in PHP to extract components of nquad


I'm looking around for a RegEx that can help me parse an nquad file. An nquad file is a straight text file where each line represents a quad (s, p, o, c):

<http://mysubject> <http://mypredicate> <http://myobject> <http://mycontext> .
<http://mysubject> <http://mypredicate2> <http://myobject2> <http://mycontext> .
<http://mysubject> <http://mypredicate2> <http://myobject2> <http://mycontext> .

The objects can also be literals (instead of uris), in which case they are enclosed with double quotes:

<http://mysubject> <http://mypredicate> "My object" <http://mycontext> .

I'm looking for a regex that given one line of this file, which will give me back a php array in the following format:

[0] => "http://mysubject"
[1] => "http://mypredicate"
[2] => "http://myobject"
[3] => "http://mycontext"

...or in the case where the double quotes are used for the object:

[0] => "http://mysubject"
[1] => "http://mypredicate"
[2] => "My Object"
[3] => "http://mycontext"

One final thing - in an ideal world, the regex will cater for the scenario there may be 1 or more spaces between the various components, e.g.

<http://mysubject>     <http://mypredicate>  "My object"       <http://mycontext> .

Solution

  • I'm going to add another answer as an additional solution using only a regex and explode:

    $line = "<http://mysubject> <http://mypredicate> <http://myobject> <http://mycontext>";
    $line2 = '<http://mysubject> <http://mypredicate> "My object" <http://mycontext>';
    
    $delimeter = '---'; // Can't use space
    $result = preg_replace('/<([^>]*)>\s+<([^>]*)>\s+(?:["<]){1}([^">]*)(?:[">]){1}\s+<([^>]*)>/i', '$1' . $delimeter . '$2' . $delimeter . '$3' . $delimeter . '$4', $line);
    $array = explode( $delimeter, $result);