Search code examples
phpregexhtml-content-extraction

Extract specific part of URL from string


I need to extract only parts of a URL with PHP but I am struggling to the set point where the extraction should stop. I used a regex to extract the entire URL from a longer string like this:

$regex = '/\b(https?|ftp|file):\/\/[-A-Z0-9+&@#\/%?=~_|$!:,.;]*[A-Z0-9+&@#\/%=~_|$]/i';
preg_match_all($regex, $href, $matches);

The result is the following string:

http://www.cambridgeenglish.org/test-your-english/&sa=U&ei=a4rbU8agB-zY0QWS_IGYDw&ved=0CFEQFjAL&usg=AFQjCNGU4FMUPB2ZuVM45OoqQ39rJbfveg

Now I want to extract only this bit http://www.cambridgeenglish.org/test-your-english/. I basically need to get rid off everything starting at &amp onwards.

Anyone an idea how to achieve this? Do I need to run another regex or can I add it to the initial one?


Solution

  • The below regex would get ridoff everything after the string &amp. Your php code would be,

    <?php
    echo preg_replace('~&amp.*$~', '', 'http://www.cambridgeenglish.org/test-your-english/&amp;sa=U&amp;ei=a4rbU8agB-zY0QWS_IGYDw&amp;ved=0CFEQFjAL&amp;usg=AFQjCNGU4FMUPB2ZuVM45OoqQ39rJbfveg');
    ?> //=> http://www.cambridgeenglish.org/test-your-english/
    

    Explanation:

    • &amp Matches the string &amp.
    • .* Matches any character zero or more times.
    • $ End of the line.