Search code examples
pythonpython-3.xencryptionencodingbase64

How to decode text with special symbols using base64 in python3?


I am trying to decode some list of texts using base64 module. Though I'm able to decode some, but probably the ones which have special symbols included in it I am unable to decode that.

import base64

# List of string which we are trying to decode
encoded_text_list = ['MTA0MDI0','MTA0MDYw','MTA0MDgz','MTA0MzI%3D']
    
# Iterating and decoding string using base64    
for k in encoded_text_list:
    print(k, base64.b64decode(k).decode())

Output:

MTA0MDI0 104024
MTA0MDYw 104060
MTA0MDgz 104083

---------------------------------------------------------------------------
Error                                     Traceback (most recent call last)
<ipython-input-60-d1ba00f4e54a> in <module>
      2 for k in member_url_list:
      3     print(k)
----> 4     print(base64.b64decode(k).decode())
      5     # break

/usr/lib/python3.6/base64.py in b64decode(s, altchars, validate)
     85     if validate and not re.match(b'^[A-Za-z0-9+/]*={0,2}$', s):
     86         raise binascii.Error('Non-base64 digit found')
---> 87     return binascii.a2b_base64(s)
     88 
     89 

Error: Incorrect padding

The script works well but as it reaches to decode string 'MTA0MzI%3D' it gives the above error.

As above text list is based on url, so also tried with parse method of urllib.

from urllib.parse import unquote
b64_string = 'MTA0MzI%3D'
b64_string = unquote(b64_string) # 'MTA0MzI=' 
b64_string += "=" * ((4 - len(b64_string) % 4) % 4)
print(base64.b64decode(b64_string).decode())

Output:

10432

Expected Output:

104327

Now the output may seems to be correct, but it isn't as it converts the input text from 'MTA0MzI%3D' to 'MTA0MzI=' and so does it's output from '104327' to '10432'. Thing is the above text with symbol works perfectly on this base64 site.

I have tried in different versions on python i.e python 2, 3.6, 3.8, etc., I have also tried codecs module & explored some base64 functions, but got no positive response. Can someone please help me to make it working or suggest any other way to get it done.


Solution

  • These are url-quoted strings, so url-unquoting is the correct procedure. The first step is unquote them with urllib.parse.unquote. Only after that should you attempt base64-decoding and there's no need to manually mess around with the base64 padding character =.

    The website you reference ignores invalid base64 characters and also infers the padding from the length of the base64-encoded data. So you give the website MTA0MzI%3D and it throws away the % because it's not valid base64 char, then processes MTA0MzI3D and returns 104327. Base64 padding is redundant and I'm not sure why some base64 encoding standards specify to have it in there but many do.

    Example:

    import base64
    import urllib.parse
    
    # List of string which we are trying to decode
    encoded_text_list = ['MTA0MDI0', 'MTA0MDYw', 'MTA0MDgz', 'MTA0MzI%3D']
    
    # Iterating and decoding string using base64
    for k in encoded_text_list:
        url_unquoted = urllib.parse.unquote(k)
        print(k, base64.b64decode(url_unquoted).decode('utf-8'))
    

    Output

    MTA0MDI0 104024
    MTA0MDYw 104060
    MTA0MDgz 104083
    MTA0MzI%3D 10432
    

    and 10432 is the correct output, not 104327.