Search code examples
pythonhtmlbeautifulsouptagsextract

Extracting text inside tags from html document


I have an html document like this: https://dropmefiles.com/wezmb So I need to extract text inside tags <span id="1" and </span , but I don't know how. I'm trying and write this code:

from bs4 import BeautifulSoup

with open("10_01.htm") as fp:
    soup = BeautifulSoup(fp,features="html.parser")
    for a in soup.find_all('span'):
      print (a.string)

But it extract all information from all 'span' tags. So, how can i extract text inside tags <span id="1" and </span in Python?


Solution

  • What you need is the .contents function. documentation

    Find the span <span id = "1"> ... </span> using

    for x in soup.find(id = 1).contents:
        print(x)
    

    OR

    x = soup.find(id = 1).contents[0] # since there will only be one element with the id 1.
    print(x)
    

    This will give you :

    
    10
    
    

    that is, an empty line followed by 10 followed by another empty line. This is because the string in the HTML is actually like that and prints 10 in a new line, as you can also see in the HTML that 10 has its separate line.
    The string will correctly be '\n10\n'.

    If you want just x = '10' from x = '\n10\n' you can do : x = x[1:-1] since '\n' is a single character. Hope this helped.