Search code examples
python-3.xxmlbeautifulsoupxml-parsingnamespaces

How to parse XML namespaces in Python 3 and Beautiful Soup 4?


I am trying to parse XML with BS4 in Python 3.

For some reason, I am not able to parse namespaces. I tried to look for answers in this question, but it doesn't work for me and I don't get any error message either.

Why does the first part work, but the second does not?

import requests
from bs4 import BeautifulSoup

input = """
<?xml version="1.0" encoding="utf-8"?>
<wb:countries page="1" pages="6" per_page="50" total="299" xmlns:wb="http://www.worldbank.org">
  <wb:country id="ABW">
    <wb:iso2Code>AW</wb:iso2Code>
    <wb:name>Aruba</wb:name>
    <wb:region id="LCN" iso2code="ZJ">Latin America &amp; Caribbean </wb:region>
    <wb:adminregion id="" iso2code="" />
    <wb:incomeLevel id="HIC" iso2code="XD">High income</wb:incomeLevel>
    <wb:lendingType id="LNX" iso2code="XX">Not classified</wb:lendingType>
    <wb:capitalCity>Oranjestad</wb:capitalCity>
    <wb:longitude>-70.0167</wb:longitude>
    <wb:latitude>12.5167</wb:latitude>
  </wb:country>
  <wb:country id="AFE">
    <wb:iso2Code>ZH</wb:iso2Code>
    <wb:name>Africa Eastern and Southern</wb:name>
    <wb:region id="NA" iso2code="NA">Aggregates</wb:region>
    <wb:adminregion id="" iso2code="" />
    <wb:incomeLevel id="NA" iso2code="NA">Aggregates</wb:incomeLevel>
    <wb:lendingType id="" iso2code="">Aggregates</wb:lendingType>
    <wb:capitalCity />
    <wb:longitude />
    <wb:latitude />
  </wb:country>
</wb:countries>

<item>
  <title>Some string</title>
  <pubDate>Wed, 01 Sep 2022 12:45:00 +0000</pubDate>
  <guid isPermaLink="false">4574785</guid>
  <link>https://somesite.com</link>
  <itunes:subtitle>A subtitle</itunes:subtitle>
  <enclosure length="0" type="audio/mpeg" url="https://assets.somesite.com/123.mp3"/>
  <itunes:image href="https://somesite.com/img.jpg"/>
  <itunes:duration>7845</itunes:duration>
  <itunes:explicit>no</itunes:explicit>
  <itunes:episodeType>Full</itunes:episodeType>
</item>
"""

soup = BeautifulSoup(input, 'xml')

# Working
for x in soup.find_all('wb:country'):
    print(x.find('wb:name').text)

# Not working
for x in soup.find_all('item'):
    print(x.find('itunes:subtitle').text)

Solution

  • It looks like non conform XML were you have two documents mixed togehter - A namespace is expected in strict mode of XML parser if it is defined - Use lxml instead to get your expected result in this wild mix:

    soup = BeautifulSoup(xml_string, 'lxml')
    
    # Working
    for x in soup.find_all('wb:country'):
        print(x.find('wb:name').text)
    
    # also working
    for x in soup.find_all('item'):
        print(x.find('itunes:subtitle').text)
    

    Note: Avoid using python reserved terms (keywords), this could have unwanted effects on the results of your code.


    If you have second document separat use:

    for x in soup.find_all('item'):
        print(x.find('subtitle').text)
    

    Example

    from bs4 import BeautifulSoup
    
    xml_string = """
    <?xml version="1.0" encoding="utf-8"?>
    <item>
      <title>Some string</title>
      <pubDate>Wed, 01 Sep 2022 12:45:00 +0000</pubDate>
      <guid isPermaLink="false">4574785</guid>
      <link>https://somesite.com</link>
      <itunes:subtitle>A subtitle</itunes:subtitle>
      <enclosure length="0" type="audio/mpeg" url="https://assets.somesite.com/123.mp3"/>
      <itunes:image href="https://somesite.com/img.jpg"/>
      <itunes:duration>7845</itunes:duration>
      <itunes:explicit>no</itunes:explicit>
      <itunes:episodeType>Full</itunes:episodeType>
    </item>
    """
    
    soup = BeautifulSoup(input, 'xml')
    
    # working
    for x in soup.find_all('item'):
        print(x.find('subtitle').text)
    

    Else you have to define a namespace for your item and can still use XML parser:

    <?xml version="1.0" encoding="utf-8"?>
    <item xmlns:itunes="http://www.w3.org/TR/html4/">
      <title>Some string</title>
      <pubDate>Wed, 01 Sep 2022 12:45:00 +0000</pubDate>
      <guid isPermaLink="false">4574785</guid>
      <link>https://somesite.com</link>
      <itunes:subtitle>A subtitle</itunes:subtitle>
      ...
    

    When a namespace is defined for an element, all child elements with the same prefix are associated with the same namespace.