Search code examples
phpregexurlpreg-matchwhitespace

How to filter URLs that contain white space with preg match?


I parse through a text that contains several links. Some of them contain white spaces but have a file ending. My current pattern is:

preg_match_all('#\bhttps?://[^\s()<>]+(?:\([\w\d]+\)|([^[:punct:]\s]|/))#', $links, $match);

This works the same way:

preg_match_all('/(http|https|ftp|ftps)\:\/\/[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(\/\S*)?/', $links, $match);

I don't know much about the patterns and didn't find a good tutorial that explains the meaning of all possible patterns and shows examples.

How could I filter an URL like this: http://my-url.com/my doc.doc or even http://my-url.com/my doc with more white spaces.doc

The \s in that preg_match_all functions stands for a white space. But how could I check if there is a file ending behind one or some white spaces?

Is it possible?


Solution

  • Alright after doing this really helpful tutorial I finally know how the regex syntax works. After finishing it I experimented a bit on this site

    It was pretty easy after figuring out that all hyperlinks in my parsed document were in between quotation marks so I just had to change the regex to:

    preg_match_all('#\bhttps?://[^()<>"]+#', $links, $match);
    

    so that after the " it is looking for the next match that begins with http.

    But that's not the full solution yet. The user Class was right - without rawurlencode the filenames it won't work.

    So the next step was this:

    function endsWith($haystack, $needle)
    {
        return $needle === "" || substr($haystack, -strlen($needle)) === $needle;
    }
    
    if(endsWith($textlink, ".doc") || endsWith($textlink, ".docx") || endsWith($textlink, ".pdf") || endsWith($textlink, ".jpg") || endsWith($textlink, ".jpeg") || endsWith($textlink, ".png")){
            $file = substr( $textlink, strrpos( $textlink, '/' )+1 );
            $rest_url=substr($textlink, 0, strrpos($textlink, '/' )+1 );
            $textlink=$rest_url.rawurlencode($file);            
        }
    

    That filters the filenames from the URLs and rawurlencodes them so that the the output links are correct.