Search code examples
web-scrapingurlsimilaritysentence-similarity

How to check the similarity score between two web urls?


I'm working on a project that frequently needs to check the similarity score between two web url, initially i did this by scraping all the text from the web page and then calculated the document similarity. However this is really time consuming, instead what i'm looking for is a way which can detect the similarity between urls by just using the contents of the url instead of going through all the text.

eg:
url1:  https://en.wikipedia.org/wiki/Tic-tac-toe
url2:  https://en.wikipedia.org/wiki/Chess
a rough similarity estimate : 67% (since both are from wiki and both are related to games)

Solution

  • You are probably better off comparing individual pieces of URL as foo.com/a/b/c and boo.com/a/b/c would have similar sequence score but would probably have very different contents.

    For this you can use:

    • Python's urllib.parse.urlparse() to separate urls into different parts like netloc (domain), path and parameters
    • Python's difflib.SequenceMatcher which can tell how similar two strings are.
    • w3lib.url.canonicalize_url to normalize your urls as different order of parameters etc result in the same content though look very different. See w3lib docs for more.
    from difflib import SequenceMatcher
    from w3lib.url import canonicalize_url
    from urllib.parse import urlparse
    
    
    def compare_urls(url1, url2):
        url1 = canonicalize_url(url1)
        url2 = canonicalize_url(url2)
        url1_parsed = urlparse(url1)
        url2_parsed = urlparse(url2)
        domain = SequenceMatcher(None, url1_parsed.netloc, url2_parsed.netloc).ratio()
        path = SequenceMatcher(None, url1_parsed.path, url2_parsed.path).ratio()
        query = SequenceMatcher(None, url1_parsed.query, url2_parsed.query).ratio()
        return {
            "domain": domain,
            "path": path,
            "query": query,
        }
    
    if __name__ == "__main__":
        print(compare_urls(
            "https://en.wikipedia.org/wiki/Tic-tac-toe",
            "https://en.wikipedia.org/wiki/Chess"
        ))
    # prints: {'domain': 1.0, 'path': 0.5, 'query': 1.0}
    

    By separating sequence comparison to netloc (domain), path and parameters you can assign scores weights to each one of them to design a more successful comparison algorithm.