Search code examples
pythonregexalgorithmreplacetext-formatting

Algorithm for multiple string replacement by index


I have some problems thinking up a good algorithm to replace some entities in a text. Here are the details: I have a text that I need to format to html, information about the formatting is in a python list containing dictionaries of entities. Let's say for example that the original text was like this(please, pay attention to the formatting):

Lorem Ipsum is simply dummy text of the printing and typesetting industry.

The text I will get will be this (without formatting):

Lorem Ipsum is simply dummy text of the printing and typesetting industry.

and a list of entities like this:

entities = [{"entity_text":"Lorem Ipsum", "type": "bold", "offset": 0, "length":"11"}, {"entity_text":"dummy", "type": "italic", "offset": 22, "length":"5"},{"entity_text":"printing", "type": "text_link", "offset": 41, "length":"8", "url": "google.com"}]

My algorithm should translate the given unformatted text and entities into this html:

<b>Lorem Ipsum</b> is simply <i>dummy</i> text of the <a href="google.com">printing</a> and typesetting industry

So that it can be compiled into the original message. I have tried string replacement but it messes up the offset(position of the entities from the start of the text). And remember that there could be many of those words with formatting in the text, that are not formatted, so I have to find exactly the ones that should be formatted. Any help from anyone? I'm writing the code in python but you can specify the algorithm in any language

EDIT sorry I forgot to post the code that I have tried. This is it:

format_html(text, entities):
    for entity in entities:
        try:
            entity_text = entity['entity_text']
            position = text.find(entity_text, entity['offset'])
            if position == entity['offset']:
                before = text[:position]
                after = text[min(position+entity['length'], len(text)-1):]
                if entity['type'] == 'text_link':
                    text_link = '<a href="{}">{}</a>'.format(entity['url'], entity_text)
                    text = before + text_link + after
                elif entity['type'] == 'code':
                    code = '<code>{}</code>'.format(entity_text)
                    text = before + code + after
                elif entity['type'] == 'bold':
                    bold_text = '<b>{}</b>'.format(entity_text)
                    text = before + bold_text + after
                elif entity['type'] == 'italic':
                    italic_text = '<i>{}</i>'.format(entity_text)
                    text = before + italic_text + after
                elif entity['type'] == 'pre':
                    pre_code = '<pre>{}</pre>'.format(entity_text)
                    text = before + pre_code + after
        except:
            pass

Solution

  • Well, this was how I solved it. I adjusted the offsets with the lenght of extra strings added to the text (because of the tags) each time I modified the text. This is costly in terms of computational time, but that is the only option I've seen yet

    def format_html(text, entities):
        for entity in entities:
            try:
                modified = None
                entity_text = entity['entity_text']
                position = text.find(entity_text, entity['offset'])
                if position == entity['offset']:
                    before = text[:position]
                    after = text[min(position+entity['length'], len(text)-1):]
                    if entity['type'] == 'text_link':
                        text_link = '<a href="{}">{}</a>'.format(entity['url'], entity_text)
                        text = before + text_link + after
                        modified = 15 + len(entity['url'])
                    elif entity['type'] == 'code':
                        code = '<code>{}</code>'.format(entity_text)
                        text = before + code + after
                        modified = 13
                    elif entity['type'] == 'bold':
                        bold_text = '<b>{}</b>'.format(entity_text)
                        text = before + bold_text + after
                        modified = 7
                    elif entity['type'] == 'italic':
                        italic_text = '<i>{}</i>'.format(entity_text)
                        text = before + italic_text + after
                        modified = 7
                    elif entity['type'] == 'pre':
                        pre_code = '<pre>{}</pre>'.format(entity_text)
                        text = before + pre_code + after
                        modified = 11
                   if modified:
                       for other in entites:
                           if other['offset'] > entity.offset:   
                               other.offset += modified
            except:
                pass