How to extract text within flagged tags?

I have the following document and I would like to extract all categories flags.

Input: Should be a variable has unstructured text named doc.

doc = "Like APC , <category="Modifier">APC2</category> regulates the formation of active betacatenin-Tcf 
       complexes , as demonstrated using transient transcriptional activation assays in APC - / - 
       <category="Modifier">colon carcinoma</category> cells. Human APC2 maps to chromosome 19p13 . 3 . 
       APC and APC2 may therefore have comparable functions in development 
       and <category="SpecificDisease">cancer</category>"

Output: Should be as follows:

{
'Modifier': ['APC2', 'colon carcinoma'],
'SpecificDisease': ['cancer']
}

This should be automated to be able to extract all category tags in a corpus.

I tried the following code:

soup = BeautifulSoup(doc)
contents = soup.find_all('category')

But didn't know how to extract each flag.

Solution

BeautifulSoup cannot parse this type of document. But as a "workaround", you can use re module, for example:

import re

doc = """Like APC , <category="Modifier">APC2</category> regulates the formation of active betacatenin-Tcf 
       complexes , as demonstrated using transient transcriptional activation assays in APC - / - 
       <category="Modifier">colon carcinoma</category> cells. Human APC2 maps to chromosome 19p13 . 3 . 
       APC and APC2 may therefore have comparable functions in development 
       and <category="SpecificDisease">cancer</category>"""

out = {}
for c, t in re.findall(r'<category="(.*?)">(.*?)</category>', doc):
    out.setdefault(c, []).append(t)

print(out)

Prints:

{'Modifier': ['APC2', 'colon carcinoma'], 'SpecificDisease': ['cancer']}