Search code examples
algorithmhashsystem

Designing a URL Shortening service like TinyURL


I'm reading an online document that explains how to design a url shortening service. The website is https://www.educative.io/courses/grokking-the-system-design-interview .

In the section, Encoding actual URL, they said -> "We can compute a unique hash (e.g., MD5 or SHA256, etc.) of the given URL. The hash can then be encoded for displaying. This encoding could be base36 ([a-z ,0-9]) or base62 ([A-Z, a-z, 0-9]) and if we add ‘+’ and ‘/’ we can use Base64 encoding. A reasonable question would be, what should be the length of the short key? 6, 8, or 10 characters."

"If we use the MD5 algorithm as our hash function, it’ll produce a 128-bit hash value. After base64 encoding, we’ll get a string having more than 21 characters (since each base64 character encodes 6 bits of the hash value).Since we only have space for 8 characters per short key, how will we choose our key then? We can take the first 6 (or 8) letters for the key. This could result in key duplication, to resolve that, we can choose some other characters out of the encoding string or swap some characters."

I used online MD5 hash generator (http://onlinemd5.com/) and Base64 encoder (https://www.base64encode.org/) to verify the above. I used "www.yahoo.com" as the input string for MD5 hash and output is 1B03577ED104F16AADC00A639D33CB44 . Then I Base64 encoded it and got MUIwMzU3N0VEMTA0RjE2QUFEQzAwQTYzOUQzM0NCNDQ= with UTF-8 destination charset and Unix newline seperator.

Can anyone explain if I'm doing it correctly? I see the number of characters are way more than 21.


Solution

  • The problem is that you are using the output of MD5 as a string of hexadecimal digits, and then base64 encoding that string. There's no reason to base64 encode that string - base64 encoding is intended for binary data. What you probably wanted to do is base64 the actual 128-bit binary value of the MD5 hash. Here is some Python code that does what I think you are trying to do:

    import hashlib, base64
    
    text = "www.yahoo.com"
    text_utf8 = text.encode('utf8')
    md5 = hashlib.md5(text_utf8).digest()
    b64 = base64.b64encode(md5)
    print(b64)
    

    This gets the result GwNXftEE8WqtwApjnTPLRA which has the length you were expecting.