Search code examples
pythonstringdata-science

Combining strings which have been altered


I have the following three strings:

"A randomized, prospective study of [intervention]endometrial resection[intervention] to prevent recurrent endometrial polyps in women with breast cancer receiving tamoxifen. To assess the role of endometrial resection in preventing recurrence of tamoxifen-associated endometrial polyps in women with breast cancer.
"A randomized, prospective study of endometrial resection to prevent [condition]recurrent endometrial polyps[condition] in women with breast cancer receiving tamoxifen. To assess the role of endometrial resection in preventing recurrence of tamoxifen-associated endometrial polyps in women with breast cancer.
"A randomized, prospective study of endometrial resection to prevent recurrent endometrial polyps in [eligibility]women with breast cancer receiving tamoxifen[eligibility]. To assess the role of endometrial resection in preventing recurrence of tamoxifen-associated endometrial polyps in women with breast cancer.

Is there a way to efficiently combine the three strings into one, where you can see all the annotations (between brackets) that I have made? I cannot come up with anything efficient by myself. The result should look like:

"A randomized, prospective study of [intervention]endometrial resection[intervention] to prevent [condition]recurrent endometrial polyps[condition] in [eligibility]women with breast cancer receiving tamoxifen[eligibility]. To assess the role of endometrial resection in preventing recurrence of tamoxifen-associated endometrial polyps in women with breast cancer.

Thanks in advance!


Solution

  • Assuming you are only adding those words+brackets immediately next to the existing words (i.e. splitting the string on space won't change the alignment, which is the case in the example). A simple solution might be to zip the split strings and keep the longest variant using max, then join back into a single string:

    strings = ["A randomized, prospective study of [intervention]endometrial resection[intervention] to prevent recurrent endometrial polyps in women with breast cancer receiving tamoxifen. To assess the role of endometrial resection in preventing recurrence of tamoxifen-associated endometrial polyps in women with breast cancer.",
               "A randomized, prospective study of endometrial resection to prevent [condition]recurrent endometrial polyps[condition] in women with breast cancer receiving tamoxifen. To assess the role of endometrial resection in preventing recurrence of tamoxifen-associated endometrial polyps in women with breast cancer.",
               "A randomized, prospective study of endometrial resection to prevent recurrent endometrial polyps in [eligibility]women with breast cancer receiving tamoxifen[eligibility]. To assess the role of endometrial resection in preventing recurrence of tamoxifen-associated endometrial polyps in women with breast cancer.",
              ]
    
    out = ' '.join([max(x, key=len) for x in zip(*map(lambda s: s.split(), strings))])
    

    Output:

    'A randomized, prospective study of [intervention]endometrial resection[intervention] to prevent [condition]recurrent endometrial polyps[condition] in [eligibility]women with breast cancer receiving tamoxifen[eligibility]. To assess the role of endometrial resection in preventing recurrence of tamoxifen-associated endometrial polyps in women with breast cancer.'
    

    If you need something more robust, a good starting point might be to use the difflib module to compute the successive differences, keeping the longest variant in each comparison.