Search code examples

PHP regex code to extract FDF data

I am trying to parse a FDF file using PHP, and regex. But I just cant get my head around regex. I am stuck parsing the file to generate a array.

1 0 obj 
/Fields [
/V ([email protected])
/T (field_email)
/V (John)
/T (field_name)
/V ()
/T (field_reference)

/Root 1 0 R

Current function (source:

function parse2($file) {
 if (!preg_match_all("/<<\s*\/V([^>]*)>>/x", $file,$out,PREG_SET_ORDER))
 for ($i=0;$i<count($out);$i++) {
         $pattern = "<<.*/V\s*(.*)\s*/T\s*(.*)\s*>>";
         $thing = $out[$i][1];
         if (eregi($pattern,$out[$i][0],$regs)) {
                 $key = $regs[2];
                 $val = $regs[1];
                 $key = preg_replace("/^\s*\(/","",$key);
                 $key = preg_replace("/\)$/","",$key);
                 $key = preg_replace("/\\\/","",$key);
                 $val = preg_replace("/^\s*\(/","",$val);
                 $val = preg_replace("/\)$/","",$val);
                 $matches[$key] = $val;
 return $matches;


    ] => [email protected])

    ] => John)

    ] => )


Why does it conclude the ) and new line? I know this problem is trivial for someone that understands regex expressions. So help would be appreciated.


  • Description

    Your initial expression simply finds the entire block of text which represents each key and value set. Then in your clean up section, you're looking for a close paran which is followed immediately by a end of string \)$ but I'm sure there are additional characters between the close paran and the end of the string.

    Instead I'd handle all this in one operation. This expression will:

    • find the field value
      • trim the surrounding parens off
      • and place into capture group 1
    • find the name of the value and place into capture group 2
      • trim the field_ substring off
      • trim the surrounding parens off
      • and place into capture group 2
    • requires the options: case insensitive, and multi-line


    enter image description here


    Live Demo

    Sample Text

    1 0 obj 
    /Fields [
    /V ([email protected])
    /T (field_email)
    /V (John)
    /T (field_name)
    /V ()
    /T (field_reference)
    /Root 1 0 R


    [0][0] = /V ([email protected])
    /T (field_email)
    [0][1] = [email protected]
    [0][2] = email
    [1][0] = /V (John)
    /T (field_name)
    [1][1] = John
    [1][2] = name
    [2][0] = /V ()
    /T (field_reference)
    [2][1] = 
    [2][2] = reference


    If you wanted retain the field_ substring, then you can simply remove that from the expression like so:


    enter image description here