Search code examples
htmlbeautifulsouppython-re

Retrieve all names from html tags using BeautifulSoup


I managed to setup by Beautiful Soup and find the tags that I needed. How do I extract all the names in the tags?

tags = soup.find_all("a")
print(tags)

After running the above code, I got the following output

[<a href="/wiki/Alfred_the_Great" title="Alfred the Great">Alfred the Great</a>, <a class="mw-redirect" href="/wiki/Elizabeth_I_of_England" title="Elizabeth I of England">Queen Elizabeth I</a>, <a href="/wiki/Family_tree_of_Scottish_monarchs" title="Family tree of Scottish monarchs">Family tree of Scottish monarchs</a>, <a href="/wiki/Kenneth_MacAlpin" title="Kenneth MacAlpin">Kenneth MacAlpin</a>]

How do I retrieve the names, Alfred the Great,Queen Elizabeth I, Kenneth MacAlpin, etc? Do i need to use regular expression? Using .string gave me an error


Solution

  • No need to apply re. You can easily grab all the names by iterating all a tags then call title attribute or get_text() or .find(text=True)

    html='''
    <html>
     <body>
      <a href="/wiki/Alfred_the_Great" title="Alfred the Great">
       Alfred the Great
      </a>
      ,
      <a class="mw-redirect" href="/wiki/Elizabeth_I_of_England" title="Elizabeth I of England">
       Queen Elizabeth I
      </a>
      ,
      <a href="/wiki/Family_tree_of_Scottish_monarchs" title="Family tree of Scottish monarchs">
       Family tree of Scottish monarchs
      </a>
      ,
      <a href="/wiki/Kenneth_MacAlpin" title="Kenneth MacAlpin">
       Kenneth MacAlpin
      </a>
     </body>
    </html>
    
    '''
    
    from bs4 import BeautifulSoup
    
    soup = BeautifulSoup(html,'lxml')
    
    #print(soup.prettify())
    
    for name in soup.find_all('a'):
        txt = name.get('title')
        #OR
        #txt = name.get_text(strip=True)
        print(txt)
    

    Output:

    Alfred the Great
    Queen Elizabeth I
    Family tree of Scottish monarchs
    Kenneth MacAlpin