Search code examples
pythonalgorithmutf-8substringrepr

Is there a easy way to have a substring of a utf8 encode string, the substring's repr's length less than N in python


for example i have a string, I hope find a easy way to get a substring, which encode in utf-8, and the length of the repr of the substring is <= N, of course i can try N/3 substring and increase N/3+1, N/3+2,...,but if there is a easy way?

word = u"this is a ship, and some other words".encode("utf-8")
#some way got a substring
substring = func(word, N)
#assert len(repr(substring)) <= N

Thanks!


Solution

  • A possible approach:

    1. Take first N-1 bytes of the repr of the whole string.
    2. Examine last 3 bytes to see if you broke an escape sequence and cut of bytes if necessary
    3. Append a quote, keeping in mind that it may be ' or ".
    4. Eval the repr back to utf-8.
    5. Examine the last few bytes to see if you broke the string in the middle of a Unicode code point and cut out bytes if necessary. You can tell apart leading bytes and continuation bytes by examining the bit pattern.