In Python, there is a standard library module urllib.parse
that deals with parsing URLs:
>>> import urllib.parse
>>> urllib.parse.urlparse("https://127.0.0.1:6443")
ParseResult(scheme='https', netloc='127.0.0.1:6443', path='', params='', query='', fragment='')
There are also properties on urllib.parse.ParseResult
that return the hostname and the port:
>>> p.hostname
'127.0.0.1'
>>> p.port
6443
And, by virtue of ParseResult being a namedtuple, it has a _replace()
method that returns a new ParseResult with the given field(s) replaced:
>>> p._replace(netloc="foobar.tld")
ParseResult(scheme='https', netloc='foobar.tld', path='', params='', query='', fragment='')
However, it cannot replace hostname
or port
because they are dynamic properties rather than fields of the tuple:
>>> p._replace(hostname="foobar.tld")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python3.11/collections/__init__.py", line 455, in _replace
raise ValueError(f'Got unexpected field names: {list(kwds)!r}')
ValueError: Got unexpected field names: ['hostname']
It might be tempting to simply concatenate the new hostname with the existing port and pass it as the new netloc:
>>> p._replace(netloc='{}:{}'.format("foobar.tld", p.port))
ParseResult(scheme='https', netloc='foobar.tld:6443', path='', params='', query='', fragment='')
However this quickly turns into a mess if we consider
https://user:pass@hostname.tld
);https://::1
isn't valid but https://[::1]
is);What is the cleanest, correct way to replace the hostname in a URL in Python?
The solution must handle IPv6 (both as a part of the original URL and as the replacement value), URLs containing username/password, and in general all well-formed URLs.
(There is a wide assortment of existing posts that try to ask the same question, but none of them ask for (or provide) a solution that fits all of the criteria above.)
Nice nerd snipe. Quite difficult to get right.
import urllib.parse
import socket
def is_ipv6(s):
try:
socket.inet_pton(socket.AF_INET6, s)
except Exception:
return False
else:
return True
def host_replace(url, new_host):
parsed = urllib.parse.urlparse(url)
_, _, host = parsed.netloc.rpartition("@")
_, sep, bracketed = host.partition("[")
if sep:
host, _, _ = bracketed.partition("]")
ipv6 = True
else:
# ipv4 - might have port suffix
host, _, _ = host.partition(':')
ipv6 = False
new_ipv6 = is_ipv6(new_host)
if ipv6 and not new_ipv6:
host = f"[{host}]"
elif not ipv6 and new_ipv6:
new_host = f"[{new_host}]"
port = parsed.port
netloc = parsed.netloc
if port is not None:
netloc = netloc.removesuffix(f":{port}")
left, sep, right = netloc.rpartition(host)
new_netloc = left + new_host + right
if port is not None:
new_netloc += f":{port}"
new_url = parsed._replace(netloc=new_netloc).geturl()
return new_url
I also include my test-cases:
tests = [
("https://x.com", "example.org", "https://example.org"),
("https://X.com", "example.org", "https://example.org"),
("https://x.com/", "example.org", "https://example.org/"),
("https://x.com/i.html", "example.org", "https://example.org/i.html"),
("https://x.com:8888", "example.org", "https://example.org:8888"),
("https://u@x.com:8888", "example.org", "https://u@example.org:8888"),
("https://u:p@x.com:8888", "example.org", "https://u:p@example.org:8888"),
("https://[::1]:1234", "example.org", "https://example.org:1234"),
("https://[::1]:1234", "::2", "https://[::2]:1234"),
("https://x.com", "::2", "https://[::2]"),
("http://u:p@80:80", "foo", "http://u:p@foo:80"),
]
for url, new_host, expect in tests:
actual = host_replace(url, new_host)
assert actual == expect, f"\n{actual=}\n{expect=}"