Search code examples
phpregexhyperlinkwhile-looppreg-match

Using regex function in a while loop


I have a function that gets a specific link from a specific website, and it works, but the problem starts when I try to use this function in a while loop. When I tried that, the links length starts to stack up for some reason.

function getLinks($link) {

$link1 = $link;
$content = file_get_contents($link1);

$content = str_replace("<", "", $content);
$content = str_replace(">", "", $content);

preg_match("~previous page.+?next page~i", $content, $match);
preg_match("~\"(/.+?)\"~i", $match[0], $match);
$link2 = "https://en.wiktionary.org".$match[1];

echo $link1."<br>";
echo $link2."<br>";

return $link2;

}


$firstLink = getLinks("https://en.wiktionary.org/w/index.php?title=Category:English_verbs&pagefrom=AUTOPILOT%0Aautopilot#mw-pages");

Result firstLink = getLinks():

https://en.wiktionary.org/w/index.php?title=Category:English_verbs&pagefrom=AUTOPILOT%0Aautopilot#mw-pages
https://en.wiktionary.org/w/index.php?title=Category:English_verbs&pagefrom=BAGSIE%0Abagsie#mw-pages

^--- See how it works fine when it's like this? Then when I put it in a while loop:

$count = 0; 
while ($count < 5) {

$count++;
$firstLink = getLinks($firstLink);

}

The results comes up totally messed up, and the links started to stack up upon each other, like so:

https://en.wiktionary.org/w/index.php?title=Category:English_verbs&pagefrom=AUTOPILOT%0Aautopilot#mw-pages
https://en.wiktionary.org/w/index.php?title=Category:English_verbs&pagefrom=BAGSIE%0Abagsie#mw-pages
https://en.wiktionary.org/w/index.php?title=Category:English_verbs&pagefrom=BAGSIE%0Abagsie#mw-pages
https://en.wiktionary.org/w/index.php?title=Category:English_verbs&amp%3Bpagefrom=BAGSIE%0Abagsie&pagefrom=ACETIFY%0Aacetify#mw-pages
https://en.wiktionary.org/w/index.php?title=Category:English_verbs&amp%3Bpagefrom=BAGSIE%0Abagsie&pagefrom=ACETIFY%0Aacetify#mw-pages
https://en.wiktionary.org/w/index.php?title=Category:English_verbs&amp%3Bamp%3Bpagefrom=BAGSIE%0Abagsie&amp%3Bpagefrom=ACETIFY%0Aacetify&pagefrom=ACETIFY%0Aacetify#mw-pages
https://en.wiktionary.org/w/index.php?title=Category:English_verbs&amp%3Bamp%3Bpagefrom=BAGSIE%0Abagsie&amp%3Bpagefrom=ACETIFY%0Aacetify&pagefrom=ACETIFY%0Aacetify#mw-pages
https://en.wiktionary.org/w/index.php?title=Category:English_verbs&amp%3Bamp%3Bamp%3Bpagefrom=BAGSIE%0Abagsie&amp%3Bamp%3Bpagefrom=ACETIFY%0Aacetify&amp%3Bpagefrom=ACETIFY%0Aacetify&pagefrom=ACETIFY%0Aacetify#mw-pages
https://en.wiktionary.org/w/index.php?title=Category:English_verbs&amp%3Bamp%3Bamp%3Bpagefrom=BAGSIE%0Abagsie&amp%3Bamp%3Bpagefrom=ACETIFY%0Aacetify&amp%3Bpagefrom=ACETIFY%0Aacetify&pagefrom=ACETIFY%0Aacetify#mw-pages
https://en.wiktionary.org/w/index.php?title=Category:English_verbs&amp%3Bamp%3Bamp%3Bamp%3Bpagefrom=BAGSIE%0Abagsie&amp%3Bamp%3Bamp%3Bpagefrom=ACETIFY%0Aacetify&amp%3Bamp%3Bpagefrom=ACETIFY%0Aacetify&amp%3Bpagefrom=ACETIFY%0Aacetify&pagefrom=ACETIFY%0Aacetify#mw-pages
https://en.wiktionary.org/w/index.php?title=Category:English_verbs&amp%3Bamp%3Bamp%3Bamp%3Bpagefrom=BAGSIE%0Abagsie&amp%3Bamp%3Bamp%3Bpagefrom=ACETIFY%0Aacetify&amp%3Bamp%3Bpagefrom=ACETIFY%0Aacetify&amp%3Bpagefrom=ACETIFY%0Aacetify&pagefrom=ACETIFY%0Aacetify#mw-pages
https://en.wiktionary.org/w/index.php?title=Category:English_verbs&amp%3Bamp%3Bamp%3Bamp%3Bamp%3Bpagefrom=BAGSIE%0Abagsie&amp%3Bamp%3Bamp%3Bamp%3Bpagefrom=ACETIFY%0Aacetify&amp%3Bamp%3Bamp%3Bpagefrom=ACETIFY%0Aacetify&amp%3Bamp%3Bpagefrom=ACETIFY%0Aacetify&amp%3Bpagefrom=ACETIFY%0Aacetify&pagefrom=ACETIFY%0Aacetify#mw-pages

This is driving me insane, so if anyone know what I did wrong, please, please tell me. Thank you.

Regular function in while loop:

function addOne($num) {

echo $num."<br>";   
$num++;
return $num;    

}

$num = 0;
$count = 0;
while ($count < 5) {

$count++;
$num = addOne($num);    

}

^---Works just fine


Solution

  • Your problem is with HTML entities. I've re-wrote the function to address that issue, repeated URLs and to make it more efficient. You call it with a depth parameter, which would in your case be your while's max.

    function getLinks($linkd, $depth, $checked=array()) {
    
    if(!is_array($linkd)) $linkd=array($linkd);
        foreach($linkd as $link)
        {
            if(isset($checked[$link])) continue;
            $link1 = $link;
            $content = file_get_contents($link1);
    
            $content = str_replace("<", "", $content);
            $content = str_replace(">", "", $content);
    
            preg_match("~previous page.+?next page~i", $content, $match);
            preg_match("~\"(/.+?)\"~i", $match[0], $match);
            $link2 = "https://en.wiktionary.org".$match[1];
    
            echo $link1."<br>";
            echo $link2."<br>";
    
            $checked[$link] = true;
    
            if($depth>0)
            {
                $depth--;
                return getLinks(html_entity_decode($link2), $depth, $checked);
            }
            else
            {
                return $link2;
            }
    
        }
    }
    
    
    $firstLink = "https://en.wiktionary.org/w/index.php?title=Category:English_verbs&pagefrom=AUTOPILOT%0Aautopilot#mw-pages";
    
    $firstLink = getLinks($firstLink, 5);