Search code examples
pythonhtmlurlbeautifulsouptags

how to get text between two SETS of tags in python


I am trying to get text between tag and also text between sets of tags, I have tried but I haven't got what I want. Can anyone help? I really appreciate it.

text = '''
 <b>Doc Type: </b>AABB
<br />
<b>Doc No: </b>BBBBF
<br />
<b>System No: </b>aaa bbb
<br />
<b>VCode: </b>040000033
<br />
<b>G Code: </b>000045
<br />  
'''

the expected output:

Doc Type: AABB
Doc No:   BBBBF
System No: aaa bbb
VCode: 040000033
G Code: 000045

the code I have tried, this only gave me the text between tags, but not text outside tags:

soup = BeautifulSoup(html, "html.parser")
print(soup.find_all('b'))

I also tried following, but it gave me all text on the page, I only want tags and text outside of the tags, :

soup = BeautifulSoup(html, "html.parser")
lines = ''.join(soup.text)
print(lines)

the current output is:

Doc Type: 
Doc No:   
System No: 
VCode: 
G Code: 

Solution

  • YOu could use the .next_sibling from each of those elements.

    Code:

    html = '''
     <b>Doc Type: </b>AABB
    <br />
    <b>Doc No: </b>BBBBF
    <br />
    <b>System No: </b>aaa bbb
    <br />
    <b>VCode: </b>040000033
    <br />
    <b>G Code: </b>000045
    <br />'''
    
    from bs4 import BeautifulSoup
    
    soup = BeautifulSoup(html, "html.parser")
    bs = soup.find_all('b')
    
    
    for each in bs:
        eachFollowingText = each.next_sibling.strip()
        print(f'{each.text} {eachFollowingText}')
    

    Output:

    Doc Type:  AABB
    Doc No:  BBBBF
    System No:  aaa bbb
    VCode:  040000033
    G Code:  000045