PHP - How to extract blocks from a text file by reading it line-by-line

I have an input text file like the following:

BEGIN
#1 
#2 
#3 
#4 
#5 
#6 
1       2015-05-31  2001-11-24  'Name Surname'      ID_1        0 
2       2011-04-01  ?           ?                   ID_2        1 
2       2013-02-24  ?           ?                   ID_3        1 
2       2014-02-28  ?           'Name Surname'      ID_4        2 
END
#7      'value 1'
#8      'value 2'
#9      'value 3'
#10     'value 4'
END

When in the text file there is a BEGIN, from there starts a loop where each line that begins with # is a key, while the relative values are the columns of each following rows, until the END, generating arrays like the following:

Array ( [#1] => Array ( [0] => 1 [1] => 2 [2] => 2 [3] => 2 ) [#2] => Array ( [0] => 2015-05-31 [1] => 2011-04-01 [2] => 2013-02-24 [3] => 2014-02-28 ) [#3] => Array ( [0] => 2001-11-24 [1] => ? [2] => ? [3] => ? ) [#4] => Array ( [0] => 'Name Surname' [1] => ? [2] => ? [3] => 'Name Surname' ) [#5] => Array ( [0] => ID_1 [1] => ID_2 [2] => ID_3 [3] => ID_4 ) [#6] => Array ( [0] => 0 [1] => 1 [2] => 1 [3] => 2 ) )

Otherwise, if in the text file there isn't a BEGIN, but you find a line that starts with #, its relative value is the one between the single quotes, generating an array like the following:

Array ( [#7] => 'value 1' [#8] => 'value 2' [#9] => 'value 3' [#10] => 'value 4' )

This is what I would to obtain, and my current code is the following:

<?php
    $time = microtime();
    $time = explode(' ', $time);
    $time = $time[1] + $time[0];
    $start = $time;

    ini_set("max_execution_time", 300); // 300 seconds = 5 minutes
    ini_set("pcre.backtrack_limit", "100000000"); // default 100k = "100000"
    ini_set("memory_limit", "1024M");

    $txt_path = "./test_2.txt";
    $txt_data = @file_get_contents($txt_path) or die("Could not access file: $txt_path");
    //echo $txt_data;

    /* BEGIN ARRAY FOR LOOP ENTRIES */

    $loop_pattern = "/BEGIN(.*?)END/s";
    preg_match_all($loop_pattern, $txt_data, $matches);
    $loops = $matches[0];
    $loops_count = count($loops);
    //echo("<br><br>".$loops_count."<br><br>");

    foreach ($loops as $key => $value) {
        $value = trim($value);
        $pattern = array("/BEGIN(.*?)/", "/END(.*?)/", "/[[:blank:]]+/");
        $replacement = array("", "", " ");
        $value = preg_replace($pattern, $replacement, $value);
        //echo $value."<br><br>";

        preg_match_all( '/^#\d+/m', $value, $matches );
        $keys = $matches[0];
        //print_r($keys);
        //echo "<br><br>";

        $value = preg_replace( '/^#\d+\s*/m', '', $value );

        $value = str_replace( "\n", " ", $value );

        $pattern = '/'.str_repeat( "('[^']+'|\S+)\s+", count( $keys ) ).'/';

        preg_match_all( $pattern, $value, $matches );
        //print_r($matches);
        //echo "<br><br>";

        $loop_dic = array_combine( $keys, array_slice( $matches, 1 ) );

        print_r( $loop_dic );
        echo("<br><br>");
    }

    /* END ARRAY FOR LOOP ENTRIES */

    /* BEGIN ARRAY FOR NO LOOP ENTRIES */

    $txt_data_without_loops = preg_replace( "/BEGIN(.*?)END/s", "", $txt_data );
    //echo $txt_data_without_loops;

    $pattern = array("/END(.*?)/", "/[[:blank:]]+/");
    $replacement = array("", " ");
    $txt_data_without_loops_clean = preg_replace($pattern, $replacement, $txt_data_without_loops);
    //echo $txt_data_without_loops_clean;
    preg_match_all( '/^#(.*?)\S+/m', $txt_data_without_loops_clean, $matches );
    $keys = $matches[0];
    //print_r($keys);
    $txt_data_without_loops_clean = preg_replace( '/^#(.*?)\S+\s*/m', '', $txt_data_without_loops_clean );
    //print_r($txt_data_without_loops_clean);

    $txt_data_without_loops_clean_no_newline = str_replace( "\n", " ", $txt_data_without_loops_clean );
    //print_r($txt_data_without_loops_clean_no_newline);
    $pattern = '/'.str_repeat( "('[^']+'|\S+)\s+", 1 ).'/';
    preg_match_all( $pattern, $txt_data_without_loops_clean_no_newline, $matches );
    //print_r( $matches[0] );

    $no_loop_dic = array_combine( $keys, $matches[0] );
    print_r( $no_loop_dic );
    echo("<br><br>");

    /* END ARRAY FOR NO LOOP ENTRIES */

    $time = microtime();
    $time = explode(' ', $time);
    $time = $time[1] + $time[0];
    $finish = $time;
    $total_time = round(($finish - $start), 4);
    echo '<br><br><b>Page generated in '.$total_time.' seconds.</b><br><br>';
?>

As first approach, to obtain the BEGIN-END loops and relative arrays, I read the input file with:

$txt_path = "./input.txt";
$txt_data = @file_get_contents($txt_path) or die("<b>Could not access file: $txt_path</b><br><br>");

that works fine for small files but, with big input files, it generates not-responding times in the browser (I'm testing on Firefox), maybe for a saturation of the RAM to parse the whole big file (my laptop has 3GB of RAM).

I tried the following setting in the php file:

ini_set("max_execution_time", 300); // 300 seconds = 5 minutes
ini_set("pcre.backtrack_limit", "100000000"); // default 100k = "100000"
ini_set("memory_limit", "1024M");

that seems to solve the problem with some file not so big in size, while, with big files, the process has been finished without errors only without many resources in use in the same moment... So, that isn't the best solution.

Searching on the web, I found this page where I read:

If you're reading files, read them line-by-line instead of reading in the complete file into memory. Look at fgets and SplFileObject::fgets.

So I decided to use fgets to read and parse the whole input file. After generating an array for all the lines, I need to extract from it each loop, adding it to a loops_array, while I would add the other no_loop key-value couples to another array.

My idea, that seems to be fast, is to find the index of each BEGIN, in this way:

$txt_path = "./test.txt";
$txt_data = @fopen($txt_path, "rb") or die("<b>Could not access file: $txt_path</b><br/><br/>");

$lines = array();
while ( !feof($txt_data) ) {
    $line = fgets($txt_data, 1024);
    //echo($line."<br/><br/>");
    array_push($lines, trim($line));
}

$lines = array_filter($lines);
//print_r($lines);
//echo("<br/><br/>");

$begins = array_keys($lines, "BEGIN");
//echo("<b>Begins:</b><br/><br/>");
//print_r($begins);
//echo("<br/><br/>");

but now I need to find the index of the first END after each element in the $begins array... If I do:

$ends = array_keys($lines, "END");
//echo("<b>Ends:</b><br/><br/>");
//print_r($ends);
//echo("<br/><br/>");

it considers also the END string in the no_loop zones of the input file, while I should find the index of the first match for the END string, after each BEGIN, combining then them with:

$begins_ends = array_combine($begins, $ends);

and extract all the loops with array_slice, adding finally each $loop to a new array, $loops, in a way like this one:

$i = 0;
$loops = array();
foreach ($begins_ends as $key => $value) {
    $begin = trim($key);
    $end = trim($value);
    $loop = array_slice( $lines, $begin, ($end - $begin), false );
    $this_loop = array();
    for ($el=$begin; $el < $end+1; $el++) {
        array_push($this_loop, $lines[$el]);
        unset($lines[$el]);
    }
    array_push($loops, $this_loop);
    $loop = array_values($lines);
    //echo("<b>Loops Dictionary $i:</b><br/><br/>");
    //print_r($loop);
    //echo("<br/><br/>");
    $i++;
}
//print_r($loops);
//echo("<br/><br/>");

The problem is to obtain the correct $ends array, without considering the END string of the no_loop zones in the input file, obtaining the previous output:

Array ( [#1] => Array ( [0] => 1 [1] => 2 [2] => 2 [3] => 2 ) [#2] => Array ( [0] => 2015-05-31 [1] => 2011-04-01 [2] => 2013-02-24 [3] => 2014-02-28 ) [#3] => Array ( [0] => 2001-11-24 [1] => ? [2] => ? [3] => ? ) [#4] => Array ( [0] => 'Name Surname' [1] => ? [2] => ? [3] => 'Name Surname' ) [#5] => Array ( [0] => ID_1 [1] => ID_2 [2] => ID_3 [3] => ID_4 ) [#6] => Array ( [0] => 0 [1] => 1 [2] => 1 [3] => 2 ) )

Array ( [#7] => 'value 1' [#8] => 'value 2' [#9] => 'value 3' [#10] => 'value 4' )

with the fastest approach and the lowest memory usage, to solve the not-responding times in the browser with files big in size.

Thank you

Solution

It was simply useful to say that it was not necessary to use fgets(), but fread(); the source of the information is here!

As you can read there, file() is very similar to the previously used file_get_contents(), so it should not make a difference.

The previous working code should be adapted in a so simple manner:

test_2.txt file content:

BEGIN
#1 
#2 
#3 
#4 
#5 
#6 
1       2015-05-31  2001-11-24  'Name Surname'      ID_1        0 
2       2011-04-01  ?           ?                   ID_2        1 
2       2013-02-24  ?           ?                   ID_3        1 
2       2014-02-28  ?           'Name Surname'      ID_4        2 
END
#7      'value 1'
#8      'value 2'
#9      'value 3'
#10     'value 4'
END
BEGIN
#11 
#12 
#13 
#14 
#15 
#16 
1       2015-05-31  2001-11-24  'Name Surname'      ID_5        0 
2       2011-04-01  ?           ?                   ID_6        1 
2       2013-02-24  ?           ?                   ID_7        1 
2       2014-02-28  ?           'Name Surname'      ID_8        2 
END
BEGIN
#17 
#18 
#19 
#20 
#21 
#22 
1       2015-05-31  2001-11-24  'Name Surname'      ID_9        0 
2       2011-04-01  ?           ?                   ID_10        1 
2       2013-02-24  ?           ?                   ID_11        1 
2       2014-02-28  ?           'Name Surname'      ID_12        2 
END

PHP code:

<?php
$time = microtime();
$time = explode(" ", $time);
$time = $time[1] + $time[0];
$start = $time;

$filename = "./test_2.txt";
$handle = fopen($filename, "rb") or die("<b>Could not access file: $filename</b><br/><br/>");
$contents = fread($handle, filesize($filename));
fclose($handle);

//echo($contents."<br><br>");

$loop_pattern = "/BEGIN(.*?)END/s";
preg_match_all($loop_pattern, $contents, $matches);
$loops = $matches[0];
//print_r($loops);
//echo("<br><br>");
$loops_count = count($loops);
//print_r($loops_count);
//echo "<br><br>";

foreach ($loops as $key => $value) {
    $value = trim($value);
    //echo($value."<br><br>");
    $pattern = array("/[[:blank:]]+/", "/BEGIN(.*)/", "/END(.*)/");
    $replacement = array(" ", "", "");
    $value = preg_replace($pattern, $replacement, $value);
    //echo($value."<br><br>");

    preg_match_all( '/^#\d+/m', $value, $matches );
    $keys = $matches[0];
    //print_r($keys);
    //echo "<br><br>";

    $value = preg_replace( '/^#\d+\s*/m', '', $value );

    $value = str_replace( "\n", " ", $value );

    $pattern = '/'.str_repeat( "('[^']+'|\S+)\s+", count( $keys ) ).'/';
    preg_match_all( $pattern, $value, $matches );
    //print_r($matches);
    //echo "<br><br>";

    $values = array_combine( $keys, array_slice( $matches, 1, count( $keys ), false ) );
    print_r( $values );
    echo "<br><br>";
}

$time = microtime();
$time = explode(" ", $time);
$time = $time[1] + $time[0];
$finish = $time;
$total_time = round(($finish - $start), 4);
echo("<br/><br/><b>Page generated in ".$total_time." seconds.</b><br/><br/>");
?>

I also removed @, writing:

fopen($filename, "rb") or die("<b>Could not access file: $filename</b><br/><br/>");

instead of the previous:

@fopen($txt_path, "rb") or die("<b>Could not access file: $txt_path</b><br/><br/>");

as suggested here.

EDIT 1

Another approach is the following:

$txt_path = "./test_2.txt";
$handle = new SplFileObject($txt_path);

// Loop until we reach the end of the file.
$lines_array = array();
while ( !$handle->eof() ) {
    $line = $handle->fgets();
    //echo($line."<br/><br/>"); // Echo one line from the file.
    array_push($lines_array, trim($line));
}

// Unset the file to call __destruct(), closing the file handle.
$handle = null;

$lines_array = array_filter($lines_array);
//print_r($lines_array);
//echo("<br/><br/>");

$lines_joined = implode("\n", $lines_array);
//echo($lines_joined."<br/><br/>");