How to use TextBuffer.register_serialize_format of PyGTK?

I'm using serialize and deserialize right now, and when decoding the serialized textbuffer with utf-8 I get this:

GTKTEXTBUFFERCONTENTS-0001 <text_view_markup>
 <tags>
  <tag name="bold" priority="1">
   <attr name="weight" type="gint" value="700" />
  </tag>
  <tag name="#efef29292929" priority="2">
   <attr name="foreground-gdk" type="GdkColor" value="efef:2929:2929" />
  </tag>
  <tag name="underline" priority="0">
   <attr name="underline" type="PangoUnderline" value="PANGO_UNDERLINE_SINGLE" />
  </tag>
 </tags>
<text><apply_tag name="underline">At</apply_tag> the first <apply_tag name="bold">comes</apply_tag> rock!  <apply_tag name="underline">Rock</apply_tag>, <apply_tag name="bold">paper,</apply_tag> <apply_tag name="#efef29292929">scissors!</apply_tag></text>
</text_view_markup>

I'm trying to apply the tags using some html tags like <u></u><b></b>, as I asked before and that was closed as a duplicate I'll be asking differently. So, how can I tell where these tags are ending if all they ends with </apply_tag>, instead of something like </apply_tag name="nameoftag"> I tried this before:

def correctTags(text):
    tags = []
    newstring = ''
    for i in range(len(text)):
        if string[i] == '<' and i+18 <= len(text):
            if text[i+17] == '#':
                tags.append('</font color>')
            elif text[i+17] == 'b':
                tags.append('</b>')
            elif text[i+17] == 'u':
                tags.append('</u>')
    
    newstring = string.replace('<apply_tag name="#', '<font color="#').replace('<apply_tag name="bold">', '<b>').replace('<apply_tag name="underline">', '<u>')

    for j in tags:
        newstring = newstring.replace('</apply_tag>', j, 1)    

    return '<text>' + newstring + '</text>'

But there is a problem with inner tags, they will be closed where it shouldn't be. I think maybe the answer is gtk.TextBuffer.register_serialize_format as I think this should serialize using the mime that I pass to it, like html, and then I should know where the tags are ending. But I didn't found any example extensive friendly usage of it.

Solution

I found the solution to get tags correctly out of serialized textbuffer at Serialising Gtk TextBuffers to HTML, it isn't register_serialize_format, but as was said at the site it's possible to write a serializer but the documentation is sparse (and for that I think is using register_serialize_format). Either way, the solution uses htlm.parser and xml.etree.ElementTree, but it's possible to use BeautifulSoup.

Basically, this script will handle the serialized textbuffer content using html paser, the hard work starts at the feed, that receive byte content (the serialized textbuffer content) and returns a string (the formated text with the html tags), first it'll find the index of <text_view_markup> dropping out the reader GTKTEXTBUFFERCONTENTS-0001 (this is what couldn't be decoded using decode('utf-8')) as it will result in "UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb4 in position : invalid start byte", you can use decode('utf-8', erros='ignore') or erros='replace' for that, but as the feed method will drop this part the content is decoded with simple .decode().

Then tags and text will be handled separetly, first the tags will be handled and here I used xml.etree.ElementTree, but it's possible use beautifulsoup as the original script, after the tags are handled feed is called and the text is passed, this feed is the method of HTMLParser.

Also for the tags it's possible handle more than italis, bold, and color, you just need to update the tag2html dictionary.

Besides of not using beautifulsoup I made some other changes, as for the tag name, all the tags has names and so they are not using id, my color tag also already has hex values so I didn't need use the pango_to_html_hex method. And here is how it looks right now:

from html.parser            import HTMLParser
from typing                 import Dict, List, Optional, Tuple
from xml.etree.ElementTree  import fromstring

from gi import require_version
require_version('Pango', '1.0')
from gi.repository import Pango

class PangoToHtml(HTMLParser):
    """Decode a subset of Pango markup and serialize it as HTML.

    Only the Pango markup used within Gourmet is handled, although expanding it
    is not difficult.

    Due to the way that Pango attributes work, the HTML is not necessarily the
    simplest. For example italic tags may be closed early and reopened if other
    attributes, eg. bold, are inserted mid-way:

        <i> italic text </i><i><u>and underlined</u></i>

    This means that the HTML resulting from the conversion by this object may
    differ from the original that was fed to the caller.
    """
    def __init__(self):
        super().__init__()
        self.markup_text:           str  = ""  # the resulting content
        self.current_opening_tags:  str  = ""  # used during parsing
        self.current_closing_tags:  List = []  # used during parsing

        # The key is the Pango id of a tag, and the value is a tuple of opening
        # and closing html tags for this id.
        self.tags: Dict[str: Tuple[str, str]] = {}

    tag2html: Dict[str, Tuple[str, str]] = {
                                            Pango.Style.ITALIC.value_name:      ("<i>", "</i>"),  # Pango doesn't do <em>
                                            str(Pango.Weight.BOLD.real):        ("<b>", "</b>"),
                                            Pango.Underline.SINGLE.value_name:  ("<u>", "</u>"),
                                            "foreground-gdk":                   (r'<span foreground="{}">', "</span>"),
                                            "background-gdk":                   (r'<span background="{}">', "</span>")
                                            }

    def feed(self, data: bytes) -> str:
        """Convert a buffer (text and and the buffer's iterators to html string.

        Unlike an HTMLParser, the whole string must be passed at once, chunks
        are not supported.
        """
        # Remove the Pango header: it contains a length mark, which we don't
        # care about, but which does not necessarily decodes as valid char.
        header_end  = data.find(b"<text_view_markup>")
        data        = data[header_end:].decode()

        # Get the tags
        tags_begin  = data.index("<tags>")
        tags_end    = data.index("</tags>") + len("</tags>")
        tags        = data[tags_begin:tags_end]
        data        = data[tags_end:]

        # Get the textual content
        text_begin  = data.index("<text>")
        text_end    = data.index("</text>") + len("</text>")
        text        = data[text_begin:text_end]

        # Convert the tags to html.
        # We know that only a subset of HTML is handled in Gourmet:
        # italics, bold, underlined and normal

        root            = fromstring(tags)
        tags_name       = list(root.iter('tag'))
        tags_attributes = list(root.iter('attr'))
        tags            = [ [tag_name, tag_attribute] for tag_name, tag_attribute in zip(tags_name, tags_attributes)]

        tags_list = {}
        for tag in tags:
            opening_tags = ""
            closing_tags = ""

            tag_name    = tag[0].attrib['name']
            vtype       = tag[1].attrib['type']
            value       = tag[1].attrib['value'] 
            name        = tag[1].attrib['name']

            if vtype == "GdkColor":  # Convert colours to html
                if name in ['foreground-gdk', 'background-gdk']:
                    opening, closing = self.tag2html[name]
                    hex_color = f'{value.replace(":","")}' #hex color already handled by gtk.gdk.color.to_string() method
                    opening = opening.format(hex_color)
                else:
                    continue  # no idea!
            else:
                opening, closing = self.tag2html[value]

            opening_tags += opening
            closing_tags = closing + closing_tags   # closing tags are FILO

            tags_list[tag_name] = opening_tags, closing_tags

            if opening_tags:
                tags_list[tag_name] = opening_tags, closing_tags

        self.tags = tags_list

        # Create a single output string that will be sequentially appended to
        # during feeding of text. It can then be returned once we've parse all
        self.markup_text                = ""
        self.current_opening_tags       = ""
        self.current_closing_tags       = []  # Closing tags are FILO

        super().feed(text)

        return self.markup_text

    def handle_starttag(self, tag: str, attrs: List[Tuple[str, str]]) -> None:
        # The pango tags are either "apply_tag", or "text". We only really care
        # about the "apply_tag". There could be an assert, but we let the
        # parser quietly handle nonsense.
        if tag == "apply_tag":
            attrs       = dict(attrs)
            tag_name    = attrs.get('name')
            tags        = self.tags.get(tag_name)

            if tags is not None:
                (self.current_opening_tags, closing_tag) = tags
                self.current_closing_tags.append(closing_tag)

    def handle_data(self, data: str) -> None:
        data = self.current_opening_tags + data
        self.markup_text += data

    def handle_endtag(self, tag: str) -> None:
        if self.current_closing_tags:  # Can be empty due to closing "text" tag
            self.markup_text += self.current_closing_tags.pop()
        self.current_opening_tags = ""

Also a big thanks to Cyril Danilevski who wrote this, all credits to him. And as he explained, "There is also , that mark the beginning and end of a TextBuffer's content." so if you follow allong the example from the site, at the handle_endtag it has self.markup_text += self.current_closing_tags.pop() and that will try to pop a empty list, so I recommend anyone who wants to handle tags also see pango_html.py which handle this by checking if the list is not empty (it's also on the code on this answer at the handle_endtag), there's also a test file test_pango_html.py.

Exemple of usage

import PangoToHtml

start_iter  = text_buffer.get_start_iter()
end_iter    = text_buffer.get_end_iter()
format      = text_buffer.register_serialize_tagset()
exported    = text_buffer.serialize( text_buffer,
                                     format,
                                     start_iter,
                                     end_iter )

p = PangoToHtml()
p.feed(exported)