Search code examples
phppathrelative-pathstring-parsing

untangling directory separator madness using string manipulation?


I'm working on converting a website. It involved standardizing the directory structure of images and media files. I'm parsing path information from various tags, standardizing them, checking to see if the media exists in the new standardized location, and putting it there if it doesn't. I'm using string manipulation to do so.

This is a little open-ended, but is there a class, tool, or concept out there I can use to save myself some headaches? For instance, I'm running into problems where, say, a page in a sudirectory (website.com/subdir/dir/page.php) has relative image paths (../images/image.png), or other kinds of things like this. It's not like there's one overarching problem, but just a lot of little things that add up.

When I think I've got my script covering most cases, then I get errors like Could not find file at export/standardized_folder/proper_image_folderimage.png where it should be export/standardized_folder/proper_image_folder/image.png. It's kind of driving me mad, doing string parsing and checks to make sure that directory separators are in the proper places.

I feel like I'm putting too much work into making a one-off import script very robust. Perhaps someone's already untangled this mess in a re-useable way, one which I can take advantage of?

Post Script: So here's a more in-depth scoop. I write my script that parses one "type" of page and pulls content from the same of its kind. Then I turn my script to parse another type of page, get all knids of errors, and learn that all my assumptions about how paths are referenced must be thrown out the window. Wash, rinse, repeat.

So I'm looking at doing some major re-factoring of my script, throwing out all assumptions, and checking, re-checking, and double-checking path information. Since I'm really trying to build a robust path building script, hopefully I can avoid re-inventing the wheel. Is there a wheel out there?


Solution

  • If your problems have their root in resolving the relative links from a document and resolve to an absolute one (which should be half the job to map the linked images paths onto the file-system), I normally use Net_URL2 from pear. It's a simple class that just does the job.

    To install, as root just call

    # pear install channel://pear.php.net/Net_URL2-0.3.1
    

    Even if it's a beta package, it's really stable.

    A little example, let's say there is an array with all the images srcs in question and there is a base-URL for the document:

    require_once('Net/URL2.php');
    
    $baseUrl = 'http://www.example.com/test/images.html';
    
    $docSrcs = array(...);
    
    $baseUrl = new Net_URL2($baseUrl);
    
    foreach($docSrcs as $href)
    {
        $url = $baseUrl->resolve($href);
        echo ' * ', $href, ' -> ', $url->getURL(), "\n";
        // or
        echo " $href -> $url\n"; # Net_URL2 supports string context
    }
    

    This will convert any relative links into absolute ones based on your base URL. The base URL is first of all the documents address. The document can override it by specifying another one with the base elementDocs. So you could look that up with the HTML parser you're already using (as well as the src and href values).

    Net_URL2 reflects the current RFC 3986 to do the URL resolving.

    Another thing that might be handy for your URL handling is the getNormalizedURL function. It does remove some potential error-cases like needless dot segments etc. which is useful if you need to compare one URL with another one and naturally for mapping the URL to a path then:

    foreach($docSrcs as $href)
    {
        $url = $baseUrl->resolve($href);
        $url = $url->getNormalizedURL();
        echo " $href -> $url\n";
    }
    

    So as you can resolve all URLs to absolute ones and you get them normalized, you can decide whether or not they are in question for your site, as long as the url is still a Net_URL2 instance, you can use one of the many functions to do that:

    $host = strtolower($url->getHost());
    if (in_array($host, array('example.com', 'www.example.com'))
    {
        # URL is on my server, process it further
    }
    

    Left is the concrete path to the file in the URL:

    $path = $url->getPath();
    

    That path, considering you're comparing against a UNIX file-system, should be easy to prefix with a concrete base directory:

    $filesystemImagePath = '/var/www/site-new/images';
    $newPath = $filesystemImagePath . $path;
    if (is_file($newPath))
    {
        # new image already exists.
    }
    

    If you've got problems to combine the base path with the image path, the image path will always have a slash at the beginning.

    Hope this helps.