I have the following three strings:
"A randomized, prospective study of [intervention]endometrial resection[intervention] to prevent recurrent endometrial polyps in women with breast cancer receiving tamoxifen. To assess the role of endometrial resection in preventing recurrence of tamoxifen-associated endometrial polyps in women with breast cancer.
"A randomized, prospective study of endometrial resection to prevent [condition]recurrent endometrial polyps[condition] in women with breast cancer receiving tamoxifen. To assess the role of endometrial resection in preventing recurrence of tamoxifen-associated endometrial polyps in women with breast cancer.
"A randomized, prospective study of endometrial resection to prevent recurrent endometrial polyps in [eligibility]women with breast cancer receiving tamoxifen[eligibility]. To assess the role of endometrial resection in preventing recurrence of tamoxifen-associated endometrial polyps in women with breast cancer.
Is there a way to efficiently combine the three strings into one, where you can see all the annotations (between brackets) that I have made? I cannot come up with anything efficient by myself. The result should look like:
"A randomized, prospective study of [intervention]endometrial resection[intervention] to prevent [condition]recurrent endometrial polyps[condition] in [eligibility]women with breast cancer receiving tamoxifen[eligibility]. To assess the role of endometrial resection in preventing recurrence of tamoxifen-associated endometrial polyps in women with breast cancer.
Thanks in advance!
Assuming you are only adding those words+brackets immediately next to the existing words (i.e. splitting the string on space won't change the alignment, which is the case in the example). A simple solution might be to zip
the split strings and keep the longest variant using max
, then join
back into a single string:
strings = ["A randomized, prospective study of [intervention]endometrial resection[intervention] to prevent recurrent endometrial polyps in women with breast cancer receiving tamoxifen. To assess the role of endometrial resection in preventing recurrence of tamoxifen-associated endometrial polyps in women with breast cancer.",
"A randomized, prospective study of endometrial resection to prevent [condition]recurrent endometrial polyps[condition] in women with breast cancer receiving tamoxifen. To assess the role of endometrial resection in preventing recurrence of tamoxifen-associated endometrial polyps in women with breast cancer.",
"A randomized, prospective study of endometrial resection to prevent recurrent endometrial polyps in [eligibility]women with breast cancer receiving tamoxifen[eligibility]. To assess the role of endometrial resection in preventing recurrence of tamoxifen-associated endometrial polyps in women with breast cancer.",
]
out = ' '.join([max(x, key=len) for x in zip(*map(lambda s: s.split(), strings))])
Output:
'A randomized, prospective study of [intervention]endometrial resection[intervention] to prevent [condition]recurrent endometrial polyps[condition] in [eligibility]women with breast cancer receiving tamoxifen[eligibility]. To assess the role of endometrial resection in preventing recurrence of tamoxifen-associated endometrial polyps in women with breast cancer.'
If you need something more robust, a good starting point might be to use the difflib
module to compute the successive differences, keeping the longest variant in each comparison.