Search code examples
pythonweb-scrapingbeautifulsoupdata-extraction

get the first line of text inside a tag using webscraping


I need to get the first line of text inside a tag using python code for web scraping.

expexted output : 22 September 1995

The code html goes like this

<div class="txt-block">
<h4 class="inline">Release Date:</h4> 22 September 1995 (USA)
<span class="see-more inline">
<a href="releaseinfo?ref_=tt_dt_dt">See more</a>&nbsp;»
</span></div>

my code to get the data is

soup.find('div', {"class": "txt-block"}).text

output is: Release Date: 22 September 1995 (USA) See more


Solution

  • I would do this way

    text = soup.find('h4').next_sibling
    text.replace('(USA)','')
    

    or

    text = soup.find('h4',{'class','inline'}).next_sibling
    text.replace('(USA)','')
    

    Than you can use regex to exclude parenthesis (USA) like word from text.

    using regex to remove a specific word from a string

    text = soup.find('h4',{'class','inline'}).next_sibling
    import re
    text = re.sub(r'\s\(.+\)','',text)
    

    That will remove any other parenthesis included word from that string.