Search code examples
pythonbeautifulsouphtml-parsing

Is there a way to find class name and take the whole text of parent tag?


I have a lot of html files and I have to take the full header of files. Tags of headers located differently: class="c6", class="c7"

I have tried BeautifulSoup

for head_c6 in soup.find_all('span', attrs={'class': 'c6'}):
        print(head_c6.get_text())
for head_c7 in soup.find_all('span', attrs={'class': 'c7'}):
        print(head_c7.get_text())

but the result:

Q3 2017 American Express Co Earnings Call - Final LENGTH:

Q2 2016 Akamai Technologies Inc Call - Final Earnings

Here how different files look like:

File 1

<div class="c4">
<p class="c5">
<span class="c6">
      Q3 2017 American Express Co Earnings Call - Final
     </span>
</p>
</div>
<div class="c4">
<p class="c5">
<span class="c7">
      LENGTH:
     </span>
<span class="c2">
      11051 words
     </span>
</p>
</div>

File 2

<div class="c4">
<p class="c5">
<span class="c6">
      Q2 2018 Akamai Technologies Inc
     </span>
<span class="c7">
      Earnings
     </span>
<span class="c6">
      Call - Final
     </span>
</p>
</div>

File 3

<div class="c4">
    <p class="c5">
     <span class="c6">
      Q4 2018
     </span>
     <span class="c7">
      Facebook
     </span>
     <span class="c6">
      Inc
     </span>
     <span class="c7">
      Earnings
     </span>
     <span class="c6">
      Call - Final
     </span>
    </p>

What I want is get full text of header:

Q3 2017 American Express Co Earnings Call - Final

Q2 2018 Akamai Technologies Inc Earnings Call - Final

Q4 2018 Facebook Inc Earnings Call - Final


Solution

  • Use Regular expression re I have updated the last file html.You can do it same with remaining files

    from bs4 import BeautifulSoup
    import re
    data='''<div class="c4">
        <p class="c5">
         <span class="c6">
          Q4 2018
         </span>
         <span class="c7">
          Facebook
         </span>
         <span class="c6">
          Inc
         </span>
         <span class="c7">
          Earnings
         </span>
         <span class="c6">
          Call - Final
         </span>
        </p>'''
    
    soup=BeautifulSoup(data,'html.parser')
    
    items=[item.text.strip() for item in soup.find_all('span', class_=re.compile("c"))]
    stritem=' '.join(items)
    print(stritem.replace('\n',''))
    

    Output:

     Q4 2018 Facebook Inc Earnings Call - Final
    

    You can also use following way.

    items=[item.text.strip() for item in soup.find_all('span', class_=re.compile("c6|c7"))]
    stritem=' '.join(items)
    print(stritem.replace('\n',''))
    

    or to get the parent tag text try that.

    from bs4 import BeautifulSoup
    import re
    data='''<div class="c4">
        <p class="c5">
         <span class="c6">
          Q4 2018
         </span>
         <span class="c7">
          Facebook
         </span>
         <span class="c6">
          Inc
         </span>
         <span class="c7">
          Earnings
         </span>
         <span class="c6">
          Call - Final
         </span>
        </p>'''
    
    soup=BeautifulSoup(data,'html.parser')
    childtag=soup.find('span', class_=re.compile("c6|c7"))
    parenttag=childtag.parent
    print(parenttag.text.replace('\n',''))