Search code examples
phpregexstringpreg-matchfile-get-contents

Regex pattern works on string but not on loaded file content


I want to extract words between ";" and ":" from an XML file, for example the word " Index" here

bla bla bla ; Index : bla bla

the file is loaded by its URL using file_get_contents

$output = file_get_contents("https://fr.wikipedia.org/wiki/Sp%C3%A9cial:Exporter/Base_de_donn%C3%A9es");
       
 preg_match_all('/\;.[a-zA-Z]+.\:/', $output, $matches, PREG_SET_ORDER, 0);
 var_dump($matches);

The regex pattern works fine on the same file content using regex101 and also when I copy the text in a string variable. But the code above does Not work, it returns only the last match.

What am I doing wrong ?

PS : I also tried loading the XML file using DOMDocument.. same result.


Solution

  • A way to do it with a low memory footprint, several considerations:

    • the file is big (not enormous but big).
    • the fact that your are dealing with an xml file isn't very important for this case since the text you are looking for follows it's own line based format (XWiki format for standard definitions) that is independent of the xml format. However, if you absolutely want to use an XML parser here to extract the text tag content, I suggest to use XMLReader in place of DOMDocument.
    • the lines you are looking for are always single lines, start with ; (without indentation) and are always immediately followed by : on the next line.

    Once you see that (right click, source code), you can choose to read the file by line (instead of loading the whole file with file_get_contents) and to use a generator function to select interesting lines:

    $url = 'https://fr.wikipedia.org/wiki/Sp%C3%A9cial:Exporter/Base_de_donn%C3%A9es';
    
    $handle = fopen($url, 'rb');
    
    function filterLines($handle) {
        while (feof($handle) !== true) {
            $line = fgets($handle);
            if ( $line[0] == ';' ) {
                $temp = $line;
                continue;
            } 
            if ( $line[0] == ':' && $temp )
                yield $temp;            
    
            $temp = false;
        }
    }
    
    foreach (filterLines($handle) as $line) {
        if ( preg_match_all('~\b\p{Latin}+(?: \p{Latin}+)*\b~u', $line, $matches) )
            echo implode(', ', $matches[0]), PHP_EOL;
    }
    
    fclose($handle);