I would like to retrieve data given in a SDMX file (like https://www.bundesbank.de/cae/servlet/StatisticDownload?tsId=BBK01.ST0304&its_fileFormat=sdmx&mode=its). I tried to use BeautifulSoup, but it seems, it does not see the tags. In the following the code
import urllib2
from bs4 import BeautifulSoup
url = "https://www.bundesbank.de/cae/servlet/StatisticDownload?tsId=BBK01.ST0304&its_fileFormat=sdmx"
html_source = urllib2.urlopen(url).read()
soup = BeautifulSoup(html_source, 'lxml')
ts_series = soup.findAll("bbk:Series")
which gives me an empty object.
Is BS4 the wrong tool, or (more likely) what am I doing wrong? Thanks in advance
soup.findAll("bbk:series")
would return the result.
In fact, in this case, even you use lxml
as the parser, BeautifulSoup still parse it as html, since html tags are case insensetive, BeautifulSoup downcases all the tags, thus soup.findAll("bbk:series")
works. See Other parser problems from the official doc.
If you want to parse it as xml
, use soup = BeautifulSoup(html_source, 'xml')
instead. It also uses lxml
since lxml
is the only xml
parser BeautifulSoup has. Now you can use ts_series = soup.findAll("Series")
to get the result as beautifulSoup will strip the namespace part bbk
.