Search code examples
pythonhtmlparsinghtml-parsingbeautifulsoup

BS4: Getting text in tag


I'm using beautiful soup. There is a tag like this:

<li><a href="example"> s.r.o., <small>small</small></a></li>

I want to get the text within the anchor <a> tag only, without any from the <small> tag in the output; i.e. " s.r.o., "

I tried find('li').text[0] but it does not work.

Is there a command in BS4 which can do that?


Solution

  • One option would be to get the first element from the contents of the a element:

    >>> from bs4 import BeautifulSoup
    >>> data = '<li><a href="example"> s.r.o., <small>small</small></a></li>'
    >>> soup = BeautifulSoup(data)
    >>> print soup.find('a').contents[0]
     s.r.o., 
    

    Another one would be to find the small tag and get the previous sibling:

    >>> print soup.find('small').previous_sibling
     s.r.o., 
    

    Well, there are all sorts of alternative/crazy options also:

    >>> print next(soup.find('a').descendants)
     s.r.o., 
    >>> print next(iter(soup.find('a')))
     s.r.o.,