Search code examples
pythonunicodeencodingemojisurrogate-pairs

Python: Find equivalent surrogate pair from non-BMP unicode char


The answer presented here: How to work with surrogate pairs in Python? tells you how to convert a surrogate pair, such as '\ud83d\ude4f' into a single non-BMP unicode character (the answer being "\ud83d\ude4f".encode('utf-16', 'surrogatepass').decode('utf-16')). I would like to know how to do this in reverse. How can I, using Python, find the equivalent surrogate pair from a non-BMP character, converting '\U0001f64f' (🙏) back to '\ud83d\ude4f'. I couldn't find a clear answer to that.


Solution

  • You'll have to manually replace each non-BMP point with the surrogate pair. You could do this with a regular expression:

    import re
    
    _nonbmp = re.compile(r'[\U00010000-\U0010FFFF]')
    
    def _surrogatepair(match):
        char = match.group()
        assert ord(char) > 0xffff
        encoded = char.encode('utf-16-le')
        return (
            chr(int.from_bytes(encoded[:2], 'little')) + 
            chr(int.from_bytes(encoded[2:], 'little')))
    
    def with_surrogates(text):
        return _nonbmp.sub(_surrogatepair, text)
    

    Demo:

    >>> with_surrogates('\U0001f64f')
    '\ud83d\ude4f'