I have a game telegram bot which uses first name - last name pairs to spell out a top chart of users in a chat by their score. Screenshot example below:
So, every user has a link to them. The actual code to generate a link:
from html import escape as html_escape
EscapeType = typing.Literal['html']
def escape_string(s: str, escape: EscapeType | None = None) -> str:
if escape == 'html':
s = html_escape(s)
elif escape is None:
pass
else:
raise NotImplementedError(escape)
return s
def getter(d):
if isinstance(d, User):
return lambda attr: getattr(d, attr, None)
elif hasattr(d, '__getitem__') and hasattr(d, 'get'):
return lambda attr: d.get(attr, None)
else:
return lambda attr: getattr(d, attr, None)
def personal_appeal(user: User | dict, escape: EscapeType | None = 'html') -> str:
get = getter(user)
if full_name := get("full_name"):
appeal = full_name
elif name := get("name"):
appeal = name
elif first_name := get("first_name"):
if last_name := get("last_name"):
appeal = f"{first_name} {last_name}"
else:
appeal = first_name
elif username := get('username'):
appeal = username
else:
raise ValueError(user)
return escape_string(appeal, escape)
def user_mention(id: int | User, name: str | None = None, escape: EscapeType | None = 'html') -> str:
if isinstance(id, User):
user = id
id = user.id
name = personal_appeal(user)
name = escape_string(name, escape=escape)
if name is None:
name = "N/A"
if id is not None:
return f'<a href="tg://user?id={id}">{name}</a>'
else:
return name
Basically, this code generates a link from a user name - user ID pair. As you can see, the name is HTML escaped by default.
There is, however, one user, which breaks this code somehow, by their unusual first name, and here is the actual sequence of characters they use:
'$̴̢̛̙͈͚̎̓͆͑.̸̱̖͑͒ ̧̡͉̺̬͎̯.̸̧̢̠̺̮̬͙͛̓̀̐́.̵̦͑̉͌͌̎͘ ̞ ̷̡͈̤̓̀͋͗͊̈́̑̽͝'
Screenshot of the result of the same code run against this first name:
As you can see, telegram seems to be lost in the markup. The link escapes onto other unrelated characters, and the <b>
tag is broken, too.
This is the actual string which is being sent to the telegram servers (except for the ids, those I redacted out):
🔝🏆 <u>Рейтинг игроков чата</u>:
🥇 1. <a href="tg://user?id=1">andy alexanderson</a> (<b>40</b>)
🥈 2. <a href="tg://user?id=2">$̴̢̛̙͈͚̎̓͆͑.̸̱̖͑͒ ̧̡͉̺̬͎̯.̸̧̢̠̺̮̬͙͛̓̀̐́.̵̦͑̉͌͌̎͘ ̞ ̷̡͈̤̓̀͋͗͊̈́̑̽͝</a> (<b>40</b>)
🤡 3. <a href="tg://user?id=3">: )</a> (<b>0</b>)
⏱️ <i>Рейтинг составлен 1 минуту назад</i>.
⏭️ <i>Следующее обновление через 28 минут</i>.
Seems like the only odd thing in this markup is the nickname, though.
Is this a Telegram bug?
Can something be done to mitigate this, so that my users wouldn't be able to escape the HTML markup? I am willing to sacrifice the correctness of their name representation (due to the fact that such users willingly obfuscate their names), but I need to somehow be able to tell apart something which would break the markup.
Or maybe there is some UTF-16 <-> UTF-8 encoding stuff going on that I'm missing out on?
Framework used: python-telegram-bot
.
Python version: 3.10.12
.
As @roganjosh pointed out, this turns out to be a so-called "zalgo" sequence of characters. To remove the zalgo characters, I first found this decode function from an old JS library called lunicode.js. I found it by reversing this zalgo-text encoder-decoder website.
It turned out to be a very simple function, so here it is written in python:
def remove_zalgo(txt: str) -> str:
return ''.join([
char
for char in txt
if ord(char) < 768 or ord(char) > 865
])
Now my markup doesn't break, and there are no zalgo characters in names of my users. I think, it's a win :)