Search code examples
pythonpython-3.xbeautifulsoupurllib3

Extract text only except the content of script tag from html with BeautifulSoup


I have html like this

<span class="age">
    Ages 15
    <span class="loc" id="loc_loads1">
     </span>
     <script>
        getCurrentLocationVal("loc_loads1",29.45218856,59.38139268,1);
     </script>
</span>

I am trying to extract Age 15 using BeautifulSoup

So i written python code as follows

code:

from bs4 import BeautifulSoup as bs
import urllib3

URL = 'html file'

http = urllib3.PoolManager()

page = http.request('GET', URL)

soup = bs(page.data, 'html.parser')
age = soup.find("span", {"class": "age"})

print(age.text)

output:

Age 15 getCurrentLocationVal("loc_loads1",29.45218856,59.38139268,1);

I want only Age 15 not the function inside script tag. Is there any way to get only text: Age 15? or any way to exclude the content of script tag?

PS: there are too many script tags and different URLS. I don't prefer replace text from the output.


Solution

  • Use .find(text=True)

    EX:

    from bs4 import BeautifulSoup
    
    html = """<span class="age">
        Ages 15
        <span class="loc" id="loc_loads1">
         </span>
         <script>
            getCurrentLocationVal("loc_loads1",29.45218856,59.38139268,1);
         </script>
    </span>"""
    
    soup = BeautifulSoup(html, "html.parser")
    print(soup.find("span", {"class": "age"}).find(text=True).strip())
    

    Output:

    Ages 15