I parse through a text that contains several links. Some of them contain white spaces but have a file ending. My current pattern is:
preg_match_all('#\bhttps?://[^\s()<>]+(?:\([\w\d]+\)|([^[:punct:]\s]|/))#', $links, $match);
This works the same way:
preg_match_all('/(http|https|ftp|ftps)\:\/\/[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(\/\S*)?/', $links, $match);
I don't know much about the patterns and didn't find a good tutorial that explains the meaning of all possible patterns and shows examples.
How could I filter an URL like this: http://my-url.com/my doc.doc or even http://my-url.com/my doc with more white spaces.doc
The \s
in that preg_match_all functions stands for a white space. But how could I check if there is a file ending behind one or some white spaces?
Is it possible?
Alright after doing this really helpful tutorial I finally know how the regex syntax works. After finishing it I experimented a bit on this site
It was pretty easy after figuring out that all hyperlinks in my parsed document were in between quotation marks so I just had to change the regex to:
preg_match_all('#\bhttps?://[^()<>"]+#', $links, $match);
so that after the "
it is looking for the next match that begins with http
.
But that's not the full solution yet. The user Class was right - without rawurlencode the filenames it won't work.
So the next step was this:
function endsWith($haystack, $needle)
{
return $needle === "" || substr($haystack, -strlen($needle)) === $needle;
}
if(endsWith($textlink, ".doc") || endsWith($textlink, ".docx") || endsWith($textlink, ".pdf") || endsWith($textlink, ".jpg") || endsWith($textlink, ".jpeg") || endsWith($textlink, ".png")){
$file = substr( $textlink, strrpos( $textlink, '/' )+1 );
$rest_url=substr($textlink, 0, strrpos($textlink, '/' )+1 );
$textlink=$rest_url.rawurlencode($file);
}
That filters the filenames from the URLs and rawurlencodes them so that the the output links are correct.