Search code examples
pythonxmlparsingbeautifulsoupchildren

How to parse information from children tag using beautifulsoup from xml file?


Here is the layout of the XML file that I am parsing. Whenever an instance occurs when there is a tag like driverslicense with multiple values I am trying to parse them to get the name and text. i.e. {number: 99999999, state: CA}

 """ >  <subjects>

        <subject id="B6">

            <name type="primary">

                <first>Frank </first>

                <middle></middle>

                <last>Darko</last>

            </name>

            <birthdate>10/26/2001</birthdate>

            <age>17</age>

            <ssn>12345679</ssn>

            <description>

                <sex>Male</sex>

            </description>

            <address type="residence" ref="A1"/>

            <driverslicense state="CA" number="99999999"/>

        </subject>

    </subjects>"""

My code is as follows:

dl = bs_data.find("driverslicense")

Output:

<driverslicense number="T64430698" state="VA"/>

I tried do a for loop but then no value is returned as well as .text but this also returns none.

for i in bs_data.find('driverslicense'):
print(i)
------------------
DriverLicense = bs_data.find("driverslicense")
print(DriverLicense.text)

I prefer to get this in dictionary form but if I get this as independent variables like state = CA and number = 99999999 that would work as well.


Solution

  • Just in addition if you like to get a dict with the attributes and values of a tag you could simply call .attrs.

    soup.select_one('driverslicense').attrs
    

    Note: In this case it works like charm, in others where you have to pick only specific attributes, the approache from @platipus_on_fire would be ideal or you might have to ignore or drop additional ones

    Example

    from bs4 import BeautifulSoup
    html = '''
    <subjects>
        <subject id="B6">
            <name type="primary">
                <first>Frank </first>
                <middle></middle>
                <last>Darko</last>
            </name>
            <birthdate>10/26/2001</birthdate>
            <age>17</age>
            <ssn>12345679</ssn>
            <description>
                <sex>Male</sex>
            </description>
            <address type="residence" ref="A1"/>
            <driverslicense state="CA" number="99999999"/>
        </subject>
    </subjects>
    '''
    soup.select_one('driverslicense').attrs
    

    Output

    {'state': 'CA', 'number': '99999999'}