I am working on a web crawler these days. In that project when my crawler gathers the links in the site some are URLs are like ; about.html
, /pages
, #form-login
, javascript:validate();
, ../help
, ../../
, ./
I have tried urllib's urlparse , urljoin and os module's join functions. However given below is the part of the code of my project which is related to the question.
from urllib.parse import urlparse, urljoin
base_url = input('Enter base url : ')
def make_links(link):
u = urlparse(link)
if link[:3] == 'www':
link = u['scheme'] + link
elif link[:1] == '/':
link = base_url + link
elif link[:3] == '../':
link = urljoin(base_url, link)
elif link[:2] == './':
link = urljoin(base_url, link)
link = base_url + '/' + link
while True:
i = input("Enter your url : ")
if i == 'exit':
I except the output of the relative URL inputted by the user to be relative to the base URL inputted by the user. When the user inputs a absolute URL as the base_url
and then when the user enters the relative URL the output should be the absolute URL path where the user can access the web page through a browser. This program also should support any type of relative URL. If you want to know the ways of relative URLs represented, refer this,
It should not execute javascript when the program comes across URLs like
😜 😜 😜
does most of the heavy lifting for you. Hence, something as simple as this would do the trick:
def make_links(link):
url = urljoin(base_url, link)
parsed = urlparse(url)
if not parsed.scheme or not parsed.scheme.startswith('http'):
# invalid, e.g. javascript, etc.
return None
return url
Enter base url : http://example.com/dir1/file.php
Enter your url : ../dir2
Enter your url : #hello
Enter your url : javascript: return false
Enter your url : /world
Enter your url : www.test.com
Enter your url : http://www.test.com
As you can see, the only downside is the necessity to start urls with http
. And this actually makes sense, as there are no strict rules: a website could use www as a subresource...