Search code examples
pythonhtmlbeautifulsoupabbr

How to convert html `abbr` tag text to a text in parentheses in Python?


I need to convert hundreds of html sentences generated by an outside source to readable text, and I have a question about conversion of abbr tag. Below is an example:

from bs4 import BeautifulSoup
text = "<abbr title=\"World Health Organization\" style=\"color:blue\">WHO</abbr> is a specialized agency of the <abbr title=\"United Nations\" style=\"color:#CCCC00\">UN</abbr>."
print (BeautifulSoup(text).get_text())

This code returns "WHO is a specialized agency of the UN.". However, what I want is "WHO (World Health Organization) is a specialized agency of the UN (United Nations)." Is there a way to do this? Maybe another module rather than BeautifulSoup?


Solution

  • You can iterate over the elements in soup.contents:

    from bs4 import BeautifulSoup as soup
    text = "<abbr title=\"World Health Organization\" style=\"color:blue\">WHO</abbr> is a specialized agency of the <abbr title=\"United Nations\" style=\"color:#CCCC00\">UN</abbr>."
    d = ''.join(str(i) if i.name is None else f'{i.text} ({i["title"]})' for i in soup(text, 'html.parser').contents)
    

    Output:

    'WHO (World Health Organization) is a specialized agency of the UN (United Nations).'