Search code examples
beautifulsoupscreen-scraping

Python BeautifulSoup Not Recursive Text


I have a span element with the code like below, how could I extract the text only exist outside the anchor(a) tag:

# print soup.prettify()
<span class="1">
    text_wanted         
    <a data-toggle="notify" href="https://www.abc.com/1" class="class1"><span>text1</span></a>
    <a data-toggle="notify" href="https://www.abc.com/2" class="class2"><span>text2</span></a>
</span>

I am thinking about the solution below:

text_all = soup.text.encode('utf-8')
text_strip_list = [a.text.encode('utf-8').strip() for a in soup.find_all('a')]
for text_strip in text_strip_list:
    text_all = text_all.replace(text_strip, '').strip()

I am wondering is there an easy way to get the text wanted instead of diving into the anchor tag..

Thanks in advance...


Solution

  • Assuming html is the BeautifulSoup object with the parsed HTML,

    from BeautifulSoup import NavigableString
    
    print [node for node in html.find('span').contents if type(node) is NavigableString]
    

    will yield the text nodes inside the outermost span.