I want to extract words between ";" and ":" from an XML file, for example the word " Index" here
bla bla bla ; Index : bla bla
the file is loaded by its URL using file_get_contents
$output = file_get_contents("https://fr.wikipedia.org/wiki/Sp%C3%A9cial:Exporter/Base_de_donn%C3%A9es");
preg_match_all('/\;.[a-zA-Z]+.\:/', $output, $matches, PREG_SET_ORDER, 0);
var_dump($matches);
The regex pattern works fine on the same file content using regex101 and also when I copy the text in a string variable. But the code above does Not work, it returns only the last match.
What am I doing wrong ?
PS : I also tried loading the XML file using DOMDocument.. same result.
A way to do it with a low memory footprint, several considerations:
text
tag content, I suggest to use XMLReader in place of DOMDocument.;
(without indentation) and are always immediately followed by :
on the next line.Once you see that (right click, source code), you can choose to read the file by line (instead of loading the whole file with file_get_contents
) and to use a generator function to select interesting lines:
$url = 'https://fr.wikipedia.org/wiki/Sp%C3%A9cial:Exporter/Base_de_donn%C3%A9es';
$handle = fopen($url, 'rb');
function filterLines($handle) {
while (feof($handle) !== true) {
$line = fgets($handle);
if ( $line[0] == ';' ) {
$temp = $line;
continue;
}
if ( $line[0] == ':' && $temp )
yield $temp;
$temp = false;
}
}
foreach (filterLines($handle) as $line) {
if ( preg_match_all('~\b\p{Latin}+(?: \p{Latin}+)*\b~u', $line, $matches) )
echo implode(', ', $matches[0]), PHP_EOL;
}
fclose($handle);