Search code examples
pythonunicodesendgridpunycode

Python Convert punycode back to unicode


I'm trying to add contacts to Sendgrid from a db which occasionally is storing the user email in punycode [email protected] which translates to example-email@yahóo.com in Unicode.

Anyway if I try and add the ascii version there's an error because sendgrid doesn't accept it - however it does accept the Unicode version.

So is there a way to convert them in python.

So I think long story short is there a way to decode punycode to Unicode?

Edit

As suggested in comments i tried 'example-email@yahóo.com'.encode('punycode').decode() which returns [email protected] so this is incorrect outside of python so is not a valid solution.

Thanks in advance.


Solution

  • There is the xn-- ACE prefix in your encoded e-mail address:

    The ACE prefix for IDNA is "xn--" or any capitalization thereof.

    So apply the idna encoding (see Python Specific Encodings):

    codec idna Implement RFC 3490, see also encodings.idna. Only errors='strict' is supported.

    Result:

    'yahóo.com'.encode('idna').decode()
    # 'xn--yaho-sqa.com'
    

    and vice versa:

    'xn--yaho-sqa.com'.encode().decode('idna')
    # 'yahóo.com'
    

    You could use the idna library instead:

    Support for the Internationalised Domain Names in Applications (IDNA) protocol as specified in RFC 5891. This is the latest version of the protocol and is sometimes referred to as “IDNA 2008”.

    This library also provides support for Unicode Technical Standard 46, Unicode IDNA Compatibility Processing.

    This acts as a suitable replacement for the “encodings.idna” module that comes with the Python standard library, but which only supports the older superseded IDNA specification (RFC 3490).