Search code examples
phpsymfonyguzzlegouttedomcrawler

Join URLs in symfony/goutte


I have a Goutte/Client (goutte uses symfony for the requests) and I would like to join paths and get a final URL:

$client = new Goutte\Client();
$crawler = $client->request('GET', 'http://DOMAIN/some/path/')
// $crawler is instance of Symfony\Component\DomCrawler\Crawler

$new_path = '../new_page';
$final path = $crawler->someMagicFunction($new_path);
// final path == http://DOMAIN/some/new_page

What I'm looking for is an easy way join the $new_path variable with he current page from the request and get the new URL.

Note that $new_page can be any of:

new_page    ==> http://DOMAIN/some/path/new_page
../new_page ==> http://DOMAIN/some/new_page
/new_page   ==> http://DOMAIN/new_page

Does symfony/goutte/guzzle gives any easy way to do so?

I found the getUriForPath from Symfony\Component\HttpFoundation\Request, but I don't see any easy way to convert the Symfony\Component\BrowserKit\Request to the HttpFoundation\Request


Solution

  • You can use parse_url to get the url's path:

    $components = parse_url('http://DOMAIN/some/path/');
    $path = $components['path'];
    

    then you need a way to canonize it. This answer can help you:

    function normalizePath($path, $separator = '\\/')
    {
        // Remove any kind of funky unicode whitespace
        $normalized = preg_replace('#\p{C}+|^\./#u', '', $path);
    
        // Path remove self referring paths ("/./").
        $normalized = preg_replace('#/\.(?=/)|^\./|\./$#', '', $normalized);
    
        // Regex for resolving relative paths
        $regex = '#\/*[^/\.]+/\.\.#Uu';
    
        while (preg_match($regex, $normalized)) {
            $normalized = preg_replace($regex, '', $normalized);
        }
    
        if (preg_match('#/\.{2}|\.{2}/#', $normalized)) {
            throw new LogicException('Path is outside of the defined root, path: [' . $path . '], resolved: [' . $normalized . ']');
        }
    
        return trim($normalized, $separator);
    }
    

    Everything that's left to do is rebuilding the url, you can see this comment:

    function unparse_url($parsed_url) { 
        $scheme   = isset($parsed_url['scheme']) ? $parsed_url['scheme'] . '://' : ''; 
        $host     = isset($parsed_url['host']) ? $parsed_url['host'] : ''; 
        $port     = isset($parsed_url['port']) ? ':' . $parsed_url['port'] : ''; 
        $user     = isset($parsed_url['user']) ? $parsed_url['user'] : ''; 
        $pass     = isset($parsed_url['pass']) ? ':' . $parsed_url['pass']  : ''; 
        $pass     = ($user || $pass) ? "$pass@" : ''; 
        $path     = isset($parsed_url['path']) ? $parsed_url['path'] : ''; 
        $query    = isset($parsed_url['query']) ? '?' . $parsed_url['query'] : ''; 
        $fragment = isset($parsed_url['fragment']) ? '#' . $parsed_url['fragment'] : ''; 
        return "$scheme$user$pass$host$port/$path$query$fragment"; 
    }
    

    Final path:

    $new_path = '../new_page';
    
    if (strpos($new_path, '/') === 0) { // absolute path, replace it entirely
        $path = $new_path;
    } else { // relative path, append it
        $path = $path . $new_path;
    }
    

    Put it all together:

    // http://DOMAIN/some/new_page
    echo unparse_url(array_replace($components, array('path' => normalizePath($path))));