Search code examples
pythontextbeautifulsoupnlp

How to extract text within flagged tags?


I have the following document and I would like to extract all categories flags.

Input: Should be a variable has unstructured text named doc.

doc = "Like APC , <category="Modifier">APC2</category> regulates the formation of active betacatenin-Tcf 
       complexes , as demonstrated using transient transcriptional activation assays in APC - / - 
       <category="Modifier">colon carcinoma</category> cells. Human APC2 maps to chromosome 19p13 . 3 . 
       APC and APC2 may therefore have comparable functions in development 
       and <category="SpecificDisease">cancer</category>"

Output: Should be as follows:

{
'Modifier': ['APC2', 'colon carcinoma'],
'SpecificDisease': ['cancer']
}

This should be automated to be able to extract all category tags in a corpus.


I tried the following code:

soup = BeautifulSoup(doc)
contents = soup.find_all('category')

But didn't know how to extract each flag.


Solution

  • BeautifulSoup cannot parse this type of document. But as a "workaround", you can use re module, for example:

    import re
    
    doc = """Like APC , <category="Modifier">APC2</category> regulates the formation of active betacatenin-Tcf 
           complexes , as demonstrated using transient transcriptional activation assays in APC - / - 
           <category="Modifier">colon carcinoma</category> cells. Human APC2 maps to chromosome 19p13 . 3 . 
           APC and APC2 may therefore have comparable functions in development 
           and <category="SpecificDisease">cancer</category>"""
    
    out = {}
    for c, t in re.findall(r'<category="(.*?)">(.*?)</category>', doc):
        out.setdefault(c, []).append(t)
    
    print(out)
    

    Prints:

    {'Modifier': ['APC2', 'colon carcinoma'], 'SpecificDisease': ['cancer']}