Search code examples
phpregexidn

Splitting up an IDN URL in PHP


I'm trying to take an IDN URL along the lines of http://exämple.se/path or https://äxämple.se/anotherpath?foo=bar&baf=bas so that I get the components of it like so:

[0] http(s)://
[1] äxämple.se
[2] /anotherpath?foo=bar&baf=bas

My first thought was "I'll just use parse_url!". Well, except it doesn't do IDN domains so no luck.

Next I tried a bunch of my own regex tricks but somehow failed to get any useful output (some of them working to a degree but still painfully lacking.

Finally I tried various other peoples' regex patterns but none of them seemed to work right for me (work right = captured anything useful, one captured the whole url as its "protocol" part, most others I ran across captured nothing or were clearly functionally identical to ones I'd tried).

And of course, why am I doing this? I want to run idn_to_ascii on the domain name before piecing the URL back together and storing it in a db.

So, what am I doing wrong here? Is my approach completely wrong or is there some magic invocation of preg_match which will fix my problem?

Edit: Preferably I'd like a solution which doesn't involve downloading a blob of code someone else wrote (like say, a custom class named something like ParseIDNUrl weighing in at 100kB)


Solution

  • parse_url should work fine. Using PHP 5.3.4 I've been able to extract just the domain part:

    print parse_url('http://äxämple.se/foobar', PHP_URL_HOST);
    

    Maybe you'll need to tweak encodings:

    print utf8_decode(parse_url('http://äxämple.se/foobar', PHP_URL_HOST));
    

    Output I've got is:

    äxämple.se
    

    Hope that helps!