Highlight or bold strings in a text file using python-docx?

I have a list of 'short strings', such as:

['MKWVTFISLLLLFSSAYSRGV', 'SSAYSRGVFRRDTHKSEIAH', 'KPKATEEQLKTVMENFVAFVDKCCA']

That I need to match to a 'long string' contained in a word file (BSA.docx) or .txt file (does not matter) such as:

sp|P02769|ALBU_BOVIN Albumin OS=Bos taurus OX=9913 GN=ALB PE=1 SV=4 MKWVTFISLLLLFSSAYSRGVFRRDTHKSEIAHRFKDLGEEHFKGLVLIAFSQYLQQCPFDEHVKLVNELTEFAKTCVADESHAGCEKSLHTLFGDELCKVASLRETYGDMADCCEKQEPERNECFLSHKDDSPDLPKLKPDPNTLCDEFKADEKKFWGKYLYEIARRHPYFYAPELLYYANKYNGVFQECCQAEDKGACLLPKIETMREKVLASSARQRLRCASIQKFGERALKAWSVARLSQKFPKAEFVEVTKLVTDLTKVHKECCHGDLLECADDRADLAKYICDNQDTISSKLKECCDKPLLEKSHCIAEVEKDAIPENLPPLTADFAEDKDVCKNYQEAKDAFLGSFLYEYSRRHPEYAVSVLLRLAKEYEATLEECCAKDDPHACYSTVFDKLKHLVDEPQNLIKQNCDQFEKLGEYGFQNALIVRYTRKVPQVSTPTLVEVSRSLGKVGTRCCTKPESERMPCTEDYLSLILNRLCVLHEKTPVSEKVTKCCTESLVNRRPCFSALTPDETYVPKAFDEKLFTFHADICTLPDTEKQIKKQTALVELLKHKPKATEEQLKTVMENFVAFVDKCCAADDKEACFAVEGPKLVVSTQTALA

What I would like to obtain is the following using python (in a terminal or in a jupyter notebook):

Highlight shorter strings matches in the long string. The highlight style is not important, it can be highlighted with a yellow marker or bolded, or underline, anything that jump to the eyes to see if there were matches or not.
Find the coverage of the long string as ((number of highlighted characters)/(total length of the long string))*100. Note the first line starting with ">>" of the long string is just an identifier and needs to be disregarded.

Here is my current code for the first task:

from docx import Document

doc = Document('BSA.docx')

peptide_list = ['MKWVTFISLLLLFSSAYSRGV', 'SSAYSRGVFRRDTHKSEIAH', 'KPKATEEQLKTVMENFVAFVDKCCA']

def highlight_peptides(text, keywords):
    text = text.paragraphs[1].text
    replacement = "\033[91m" + "\\1" + "\033[39m"
    enter code here`text = re.sub("(" + "|".join(map(re.escape, keywords)) + ")", replacement, text, flags=re.I)
    

highlight_peptides(doc, peptide_list)

The problem is that the first two short strings in the list are overlapping and in the results only the first one is highlighted in red in the sequence.

See the first link below, that contains the output result I am obtaining.

current result

See this second link to visualize my 'ideal' result.

ideal result

In the ideal I also included the second task of finding the sequence coverage. I am not sure how to count the colored or highlighted characters.

Solution

You can use the third-party regex module to do an overlapping keyword search. Then, it is perhaps easiest to go through the matches in 2 passes: (1) storing the start and end positions of each highlighted segment and combining any that overlap:

import regex as re # important - not using the usual re module

def find_keywords(keywords, text):
    """ Return a list of positions where keywords start or end within the text. 
    Where keywords overlap, combine them. """
    pattern = "(" + "|".join(re.escape(word) for word in keywords) + ")"
    r = []
    for match in re.finditer(pattern, text, flags=re.I, overlapped=True):
        start, end = match.span()
        if not r or start > r[-1]: 
            r += [start, end]  # add new segment
        elif end > r[-1]:
            r[-1] = end        # combine with previous segment
    return r

positions = find_keywords(keywords, text)

Your 'keyword coverage' (percent highlighted) can be calculated as:

coverage = sum(positions[1::2]) - sum(positions[::2]) # sum of end positions - sum of start positions
percent_coverage = coverage * 100 / len(text)

Then (2) to add the formatting to the text, using the run properties in docx:

import docx

def highlight_sections_docx(positions, text):
    """ Add characters to a text to highlight the segments indicated by
     a list of alternating start and end positions """
    document = docx.Document()
    p = document.add_paragraph()
    for i, (start, end) in enumerate(zip([None] + positions, positions + [None])):
        run = p.add_run(text[start:end])
        if i % 2:  # odd segments are highlighted
            run.bold = True   # or add other formatting - see https://python-docx.readthedocs.io/en/latest/api/text.html#run-objects
    return document

doc = highlight_sections_docx(positions, text)
doc.save("my_word_doc.docx")

Alternatively, you could highlight the text in html, and then save this to a Word document using the htmldocx package:

def highlight_sections(positions, text, start_highlight="<mark>", end_highlight="</mark>"):
    """ Add characters to a text to highlight the segments indicated by
     a list of alternating start and end positions """
    r = ""
    for i, (start, end) in enumerate(zip([None] + positions, positions + [None])):
        if i % 2:  # odd segments are highlighted
            r += start_highlight + text[start:end] + end_highlight
        else:      # even segments are not
            r += text[start:end]
    return r

from htmldocx import HtmlToDocx
s = highlight_sections(positions, text, start_highlight="<strong>", end_highlight="</strong>")
html = f"""<html><head></head><body><span style="width:100%; word-wrap:break-word; display:inline-block;">{s}</span></body></html>"""
HtmlToDocx().parse_html_string(html).save("my_word_doc.docx")

(<mark> would be a more appropriate html tag to use than <strong>, but unfortunately HtmlToDocx does not preserve any formatting of <mark>, and ignores CSS styles).

highlight_sections can also be used to output to the console:

print(highlight_sections(positions, text, start_highlight="\033[91m", end_highlight="\033[39m"))

... or to a Jupyter / IPython notebook:

from IPython.core.display import HTML
s = highlight_sections(positions, text)
display(HTML(f"""<span style="width:100%; word-wrap:break-word; display:inline-block;">{s}</span>""")