Search code examples
phphtmlhtml-tabletext-filesfopen

Extract information from text file using PHP


Problem:

Extracting information from text file using PHP based on a structure that is as following:

  • Date (in the format YYYY-MM-DD)
  • Title
  • Text: value
  • Text: value
  • Text: value

Input:

2015-03-18
 Store A
Text 1: 5,00 USD
Text 2: 2015-03-18
Text 3: 2015-03-12
 Store B
Text 1: 10,00 USD
Text 2: 2015-03-18
Text 3: 2015-03-12
 Store C
Text 1: 15,00 USD
Text 2: 2015-03-18
Text 3: 2015-03-12
2015-03-19
 Store D
Text 1: 20,00 USD
Text 2: 2015-03-18
Text 3: 2015-03-12

PHP Code (so far):

<?php
    // Creates array to store data from textfile
    $data       = array();

    // Opens text file
    $text_file  = fopen('data.txt', 'r');

    // Loops through each line
    while ($line = fgets($text_file))
    {
        // Checks whether line is a date
        if (preg_match("/^[0-9]{4}-(0[1-9]|1[0-2])-(0[1-9]|[1-2][0-9]|3[0-1])$/", trim($line)))
        {
            $data[$line] = array();
        }
        else
        {
            $data[] = trim($line);
        }
    }

    // Removes first array key
    $data = array_slice($data, 1);

    // Prints out full array
    echo "<xmp>" . print_r($data, true) . "</xmp>";
 ?>

HTML Code:

<table border="1">
  <tr>
    <th>Date</th>
    <th>Store</th>
    <th>Text 1</th>
    <th>Text 2</th>
    <th>Text 3</th>
  </tr>
  <tr>
    <td>2015-03-18</td>
    <td>Store A</td>
    <td>5,00 USD</td>
    <td>2015-03-18</td>
    <td>2015-03-12</td>
  </tr>
  <tr>
    <td></td>
    <td>Store B</td>
    <td>10,00 USD</td>
    <td>2015-03-18</td>
    <td>2015-03-12</td>
  </tr>
  <tr>
    <td></td>
    <td>Store C</td>
    <td>15,00 USD</td>
    <td>2015-03-18</td>
    <td>2015-03-12</td>
  </tr>
  <tr>
    <td>2015-03-19</td>
    <td>Store D</td>
    <td>20,00 USD</td>
    <td>2015-03-18</td>
    <td>2015-03-12</td>
  </tr>
</table>

Desired output:

enter image description here

Questions:

  1. What is the appropriate way to extract and store the different values?
  2. What is the appropriate way to print out the information as the output example?

Solution

  • I am interested in the 'groups' of records in the source file.

    Date group - indicated by the a line with just a date on it

    • Store Group - consists of..
    • store name
    • price
    • a group of dates

    Added Requirement: print out only store groups that is current date and forward? I will call this the 'cutoff_date' in the code.

    I use a 'read-ahead' technique so there is always a record to process

    I supply functions to help 'identify things'. They are used so it is easier to see the controlling' logic.

    The code:

    <?php // https://stackoverflow.com/questions/29121286/extract-information-from-text-file-using-php
    
    /**
     * We need to only show store entries on or after a certain date
     * i call this the 'cutoff_date'.
     *
     * It will default to todays date
     */
    $now = new DateTime();
    $CUTOFF_DATE = $now->format('Y-m-d');
    
    // output stored in here
    $outHtml = '<table border="1">
      <tr>
        <th>Date</th>
        <th>Store</th>
        <th>Text 1</th>
        <th>Text 2</th>
        <th>Text 3</th>
      </tr>';
    
    
    // source - we use 'read-ahead' as it makes life easier
    $sourceFile = fopen(__DIR__ . '/Q29121286.txt', 'rb');
    
    $currentLine = readNextLine($sourceFile); // read-ahead
    
    while (!empty($currentLine)) { // process until eof...
    
        // start of a date group...
        $currentGroupDate = $currentLine; // ignore this group if less than CUTOFF_DATE
        $currentLine = readNextLine($sourceFile); // read ahead
    
        while (!empty($currentGroupDate) && $currentGroupDate < $CUTOFF_DATE) { // find next date_group record
            while (!empty($currentLine) && datePosition($currentLine) !== 0) { // read to end of current group
                $currentLine = readNextLine($sourceFile);
            }
            $currentGroupDate = $currentLine;
            $currentLine = readNextLine($sourceFile); // read ahead
       }
    
        $htmlCurrentDate = $currentGroupDate; // only print the date once
    
        $html = '';
        // display all the rows for this 'date group' -- look for next 'date'
        while (!empty($currentLine) && datePosition($currentLine) !== 0) {
    
            $html = '<tr>';
    
            $html .= '<td>'. $htmlCurrentDate .'</td>';
            $htmlCurrentDate = ''; // only display the date once
    
            $html .= '<td>'. $currentLine .'</td>'; // store
            $currentLine = readNextLine($sourceFile);
    
            // process the price
             $lineParts = explode(':', $currentLine); // need the price...
             $html .= '<td>'. $lineParts[1] .'</td>';
             $currentLine = readNextLine($sourceFile);
    
            // now process the group of dates - look for a line
            // that starts with 'text' and must contain a date
            while (   !empty($currentLine)
                    && isTextLine($currentLine)
                    && datePosition($currentLine) >= 1) {
    
                $lineParts = explode(':', $currentLine); // need the date...
                $html .= '<td>'. $lineParts[1] .'</td>';
                $currentLine = readNextLine($sourceFile); // read next
            }
    
            // end of this group...
            $html .= '</tr>';
    
            $outHtml .= $html;
    
        } // end of 'dateGroup'
    } // end of data file...
    
    $outHtml .= '</table>';
    fclose($sourceFile);
    
    
    // display output
    echo $outHtml;
    exit;
    
    /**
     * These routines hide the low-level processing;
     */
    
    /**
     * Return position of date string - will be -1 if not found
     * @param type $line
     * @return integer
     */
    function datePosition($line)
    {
        $result = preg_match("/\d{4}-\d{2}-\d{2}/", $line, $matches, PREG_OFFSET_CAPTURE);
        $pos = -1;
        if (!empty($matches)) {
            $match = current($matches);
            $pos = $match[1];
        }
        return $pos;
    }
    
    /**
     * return whether line is a text line
     *
     * @param type $text
     * @return type
     */
    function isTextLine($text)
    {
        return strpos(strtolower($text), 'text') === 0;
    }
    
    /**
     * return trimmed string or an empty string at eof
     * Added 'fudge' to not read passed the eof - ;-/
     * @param type $handle
     * @return string
     */
    function readNextLine($handle)
    {
        static $isEOF = false;
    
        if ($isEOF) {
            return '';
        }
    
        $line = fgets($handle);
        if ($line !== false) {
            $line = trim($line);
        }
        else {
            $isEOF = true;
            $line = '';
        }
        return $line;
    }
    

    Original output from the supplied file:

    | Date       | Store   | Text 1    | Text 2     | Text 3     |
    |------------|---------|-----------|------------|------------|
    | 2015-03-18 | Store A | 5,00 USD  | 2015-03-18 | 2015-03-12 |
    |            | Store B | 10,00 USD | 2015-03-18 | 2015-03-12 |
    |            | Store C | 15,00 USD | 2015-03-18 | 2015-03-12 |
    | 2015-03-19 | Store D | 20,00 USD | 2015-03-18 | 2015-03-12 |