I have some problems thinking up a good algorithm to replace some entities in a text. Here are the details: I have a text that I need to format to html, information about the formatting is in a python list containing dictionaries of entities. Let's say for example that the original text was like this(please, pay attention to the formatting):
Lorem Ipsum is simply dummy text of the printing and typesetting industry.
The text I will get will be this (without formatting):
Lorem Ipsum is simply dummy text of the printing and typesetting industry.
and a list of entities like this:
entities = [{"entity_text":"Lorem Ipsum", "type": "bold", "offset": 0, "length":"11"}, {"entity_text":"dummy", "type": "italic", "offset": 22, "length":"5"},{"entity_text":"printing", "type": "text_link", "offset": 41, "length":"8", "url": "google.com"}]
My algorithm should translate the given unformatted text and entities into this html:
<b>Lorem Ipsum</b> is simply <i>dummy</i> text of the <a href="google.com">printing</a> and typesetting industry
So that it can be compiled into the original message. I have tried string replacement but it messes up the offset(position of the entities from the start of the text). And remember that there could be many of those words with formatting in the text, that are not formatted, so I have to find exactly the ones that should be formatted. Any help from anyone? I'm writing the code in python but you can specify the algorithm in any language
EDIT sorry I forgot to post the code that I have tried. This is it:
format_html(text, entities):
for entity in entities:
try:
entity_text = entity['entity_text']
position = text.find(entity_text, entity['offset'])
if position == entity['offset']:
before = text[:position]
after = text[min(position+entity['length'], len(text)-1):]
if entity['type'] == 'text_link':
text_link = '<a href="{}">{}</a>'.format(entity['url'], entity_text)
text = before + text_link + after
elif entity['type'] == 'code':
code = '<code>{}</code>'.format(entity_text)
text = before + code + after
elif entity['type'] == 'bold':
bold_text = '<b>{}</b>'.format(entity_text)
text = before + bold_text + after
elif entity['type'] == 'italic':
italic_text = '<i>{}</i>'.format(entity_text)
text = before + italic_text + after
elif entity['type'] == 'pre':
pre_code = '<pre>{}</pre>'.format(entity_text)
text = before + pre_code + after
except:
pass
Well, this was how I solved it. I adjusted the offsets with the lenght of extra strings added to the text (because of the tags) each time I modified the text. This is costly in terms of computational time, but that is the only option I've seen yet
def format_html(text, entities):
for entity in entities:
try:
modified = None
entity_text = entity['entity_text']
position = text.find(entity_text, entity['offset'])
if position == entity['offset']:
before = text[:position]
after = text[min(position+entity['length'], len(text)-1):]
if entity['type'] == 'text_link':
text_link = '<a href="{}">{}</a>'.format(entity['url'], entity_text)
text = before + text_link + after
modified = 15 + len(entity['url'])
elif entity['type'] == 'code':
code = '<code>{}</code>'.format(entity_text)
text = before + code + after
modified = 13
elif entity['type'] == 'bold':
bold_text = '<b>{}</b>'.format(entity_text)
text = before + bold_text + after
modified = 7
elif entity['type'] == 'italic':
italic_text = '<i>{}</i>'.format(entity_text)
text = before + italic_text + after
modified = 7
elif entity['type'] == 'pre':
pre_code = '<pre>{}</pre>'.format(entity_text)
text = before + pre_code + after
modified = 11
if modified:
for other in entites:
if other['offset'] > entity.offset:
other.offset += modified
except:
pass