I'm trying to take an IDN URL along the lines of http://exämple.se/path
or https://äxämple.se/anotherpath?foo=bar&baf=bas
so that I get the components of it like so:
[0] http(s)://
[1] äxämple.se
[2] /anotherpath?foo=bar&baf=bas
My first thought was "I'll just use parse_url
!". Well, except it doesn't do IDN domains so no luck.
Next I tried a bunch of my own regex tricks but somehow failed to get any useful output (some of them working to a degree but still painfully lacking.
Finally I tried various other peoples' regex patterns but none of them seemed to work right for me (work right = captured anything useful, one captured the whole url as its "protocol" part, most others I ran across captured nothing or were clearly functionally identical to ones I'd tried).
And of course, why am I doing this? I want to run idn_to_ascii
on the domain name before piecing the URL back together and storing it in a db.
So, what am I doing wrong here? Is my approach completely wrong or is there some magic invocation of preg_match
which will fix my problem?
Edit: Preferably I'd like a solution which doesn't involve downloading a blob of code someone else wrote (like say, a custom class named something like ParseIDNUrl
weighing in at 100kB)
parse_url
should work fine. Using PHP 5.3.4 I've been able to extract just the domain part:
print parse_url('http://äxämple.se/foobar', PHP_URL_HOST);
Maybe you'll need to tweak encodings:
print utf8_decode(parse_url('http://äxämple.se/foobar', PHP_URL_HOST));
Output I've got is:
äxämple.se
Hope that helps!