python python-3.x web-scraping beautifulsoup urllib

Finding Audio and Text between two <td> tags Python BeautifulSoup

I am working with this website http://www.nemoapps.com/phrasebooks/hebrew.

And for every td element, for example, I would like to get the first mp3 audio file /audio/mp3/HEBFND1_1395.mp3 and then get the hebrew text שרה שרה שיר שמח, שיר שמח שרה שרה and the pronunciation Sara shara shir sameaĥ, shir sameaĥ shara sara

The following code kind of accomplishes what I am trying to get but not quite there.

<source src="/audio/ogg/HEBFND1_1395.ogg" type="audio/ogg">
</source></source>
[]
[<div class="target1" lang="he"><strong>\u05e9\u05e8\u05d4 \u05e9\u05e8\u05d4 \u05e9\u05d9\u05e8 \u05e9\u05de\u05d7, \u05e9\u05d9\u05e8 \u05e9\u05de\u05d7 \u05e9\u05e8\u05d4 \u05e9\u05e8\u05d4</strong></div>, <div class="target2" lang="he-Latn"><strong>Sara shara shir samea\u0125, shir samea\u0125 shara sara</strong></div>, <div class="translation">Tongue Twister: Sara sings a happy song, a happy song Sara is singing</div>]

This is a sample output I get. But I need to get into source and <div> to retrieve the information I want.

The following is the code I used.

from bs4 import BeautifulSoup
import re
import urllib

_url = "http://www.nemoapps.com/phrasebooks/hebrew"
soup = BeautifulSoup(urllib.urlopen(_url), features="lxml")
_trs = soup.find_all('tr')
for tr in _trs:
    cells = tr.find_all('td', recursive=False)
    for cell in cells:
        audio = cell.find_all('audio')
        div = cell.find_all('div')
        for a in audio:
            source = a.find_all('source', type='audio/mpeg')
            for s in source:
                print(s)
        print(div)
    print("++++++++")

Please let me know if there is any other efficient way to accomplish this. Thanks,

Solution

Personally I find BeautifulSoup hard to use and byzantine, especially when it comes to finding elements by anything but the simplest rules.

The lxml module functions well with HTML sources and offers XPath support, which means finding elements is much easier and much more flexible.

I also prefer using the requests module over the very bare-bones urllib.

import requests
from lxml import etree

resp = requests.get("http://www.nemoapps.com/phrasebooks/hebrew")

htmlparser = etree.HTMLParser()
tree = etree.fromstring(resp.text, htmlparser)

for tr in tree.xpath('.//table[@class="nemocards"]/tr'):
    data = {
        'source': tr.xpath('string(.//source[@type="audio/mpeg"]/@src)'),
        'hebrew': tr.xpath('normalize-space(.//div[@lang="he"])'),
        'latin': tr.xpath('normalize-space(.//div[@lang="he-Latn"])'),
        'translation': tr.xpath('normalize-space(.//div[@class="translation"])'),
    }
    print(data)

Notes:

string() is an XPath function that gets a node's text content (whether that's an attribute node or an element node makes no difference).
normalize-space() does the same, but additionally it trims excess whitespace.
When called without such a string conversion (i.e. like tr.xpath('.//div[@lang="he"]'), you'd get a (possibly empty) list of matching elements. Since your goal is to extract the element's text content, and this task is harder to do in Python code, using the XPath string functions right away makes your life much easier - they will simply return a (possibly empty) string.
.//table[@class="nemocards"] will only match when the table's class attribute is precisely "nemocards". For partial matches, something like .//table[contains(@class, "nemocards")] could be used.

Output is like this:

{'source': '/audio/mp3/HEBFND1_0001.mp3', 'hebrew': 'שלום', 'latin': 'Shalom', 'translation': 'Hello'}