Search code examples
phpregexfdf

PHP regex code to extract FDF data


I am trying to parse a FDF file using PHP, and regex. But I just cant get my head around regex. I am stuck parsing the file to generate a array.

%FDF-1.2
%âãÏÓ
1 0 obj 
<<
/FDF 
<<
/Fields [
<<
/V ([email protected])
/T (field_email)
>> 
<<
/V (John)
/T (field_name)
>> 
<<
/V ()
/T (field_reference)
>>]
>>
>>
endobj 
trailer

<<
/Root 1 0 R
>>
%%EOF

Current function (source:http://php.net/manual/en/ref.fdf.php)

function parse2($file) {
 if (!preg_match_all("/<<\s*\/V([^>]*)>>/x", $file,$out,PREG_SET_ORDER))
         return;
 for ($i=0;$i<count($out);$i++) {
         $pattern = "<<.*/V\s*(.*)\s*/T\s*(.*)\s*>>";
         $thing = $out[$i][1];
         if (eregi($pattern,$out[$i][0],$regs)) {
                 $key = $regs[2];
                 $val = $regs[1];
                 $key = preg_replace("/^\s*\(/","",$key);
                 $key = preg_replace("/\)$/","",$key);
                 $key = preg_replace("/\\\/","",$key);
                 $val = preg_replace("/^\s*\(/","",$val);
                 $val = preg_replace("/\)$/","",$val);
                 $matches[$key] = $val;
         }
 }
 return $matches;
}

Result:

Array
(
    [field_email)
    ] => [email protected])

    [field_name)
    ] => John)

    [field_reference)
    ] => )

)

Why does it conclude the ) and new line? I know this problem is trivial for someone that understands regex expressions. So help would be appreciated.


Solution

  • Description

    Your initial expression simply finds the entire block of text which represents each key and value set. Then in your clean up section, you're looking for a close paran which is followed immediately by a end of string \)$ but I'm sure there are additional characters between the close paran and the end of the string.

    Instead I'd handle all this in one operation. This expression will:

    • find the field value
      • trim the surrounding parens off
      • and place into capture group 1
    • find the name of the value and place into capture group 2
      • trim the field_ substring off
      • trim the surrounding parens off
      • and place into capture group 2
    • requires the options: case insensitive, and multi-line

    ^\/V\s\(([^)]*)\)[\r\n]*^\/T\s\(field_([^)]*)\)

    enter image description here

    Example

    Live Demo

    Sample Text

    %FDF-1.2
    %âãÏÓ
    1 0 obj 
    <<
    /FDF 
    <<
    /Fields [
    <<
    /V ([email protected])
    /T (field_email)
    >> 
    <<
    /V (John)
    /T (field_name)
    >> 
    <<
    /V ()
    /T (field_reference)
    >>]
    >>
    >>
    endobj 
    trailer
    
    <<
    /Root 1 0 R
    >>
    %%EOF
    

    Matches

    [0][0] = /V ([email protected])
    /T (field_email)
    [0][1] = [email protected]
    [0][2] = email
    
    [1][0] = /V (John)
    /T (field_name)
    [1][1] = John
    [1][2] = name
    
    [2][0] = /V ()
    /T (field_reference)
    [2][1] = 
    [2][2] = reference
    



    Or

    If you wanted retain the field_ substring, then you can simply remove that from the expression like so:

    ^\/V\s\(([^)]*)\)[\r\n]*^\/T\s\(([^)]*)\)

    enter image description here