Search code examples
phpregexpreg-matchstrpos

Using preg_match to discover and validate types of links embedded in html


I have implemented a function to validate .edu domains. This is how I am doing it:

if( preg_match('/edu/', $matches[0])==FALSE )
    return FALSE;
return TRUE;

Now I want to skip those urls as well that point to some documents such as .pdf and .doc.

For this, the following code should have worked but is not:

if( preg_match('/edu/', $matches[0])==FALSE || preg_match('/pdf/i', $matches[0])!=FALSE || preg_match('/doc/i', $matches[0]!=FALSE))
        return FALSE;
return TRUE;

Where am I wrong in this regard? Moreover, how will I implement preg_match in such a way that it has a list of document types to check in a url string. If a certain type of document is found, it should return false. In other words, I want to provide a list (an array maybe) of various document types as $pattern to find in a url.

Note: matches[0] contains the whole url string. eg: http://www.nust.edu.pk/Documents/pdf/NNBS_Form.pdf

The code for the function:

public function validateEduDomain($url) {
    // get host name from URL
    preg_match('@^(?:http://)?([^/]+)@i', $url, $matches);
    $host = $matches[1];

    // get last two segments of host name
    preg_match('/[^.]+\.[^.]+$/', $host, $matches);

    if( preg_match('/edu/', $matches[0])!=FALSE && (preg_match('/pdf/i', $matches[0])==FALSE || preg_match('/doc/i', $matches[0]==FALSE)))      
        return TRUE;
    return FALSE;
}

Solution

  • I wonder why are you making everything so complicated, and also noticed you have $$matches[0] instead of $matches[0]. The regexes you want is:

    if( preg_match('/^https?:\/\/[A-Za-z]+[A-Za-z0-9\.-]+\.edu/i', $matches[0]) && !preg_match('/\.(pdf)|(doc)$/i', $matches[0]) ) {
        // do something here...
    }