I have a lot of html files and I have to take the full header of files. Tags of headers located differently: class="c6", class="c7"
I have tried BeautifulSoup
for head_c6 in soup.find_all('span', attrs={'class': 'c6'}):
print(head_c6.get_text())
for head_c7 in soup.find_all('span', attrs={'class': 'c7'}):
print(head_c7.get_text())
but the result:
Q3 2017 American Express Co Earnings Call - Final LENGTH:
Q2 2016 Akamai Technologies Inc Call - Final Earnings
Here how different files look like:
File 1
<div class="c4">
<p class="c5">
<span class="c6">
Q3 2017 American Express Co Earnings Call - Final
</span>
</p>
</div>
<div class="c4">
<p class="c5">
<span class="c7">
LENGTH:
</span>
<span class="c2">
11051 words
</span>
</p>
</div>
File 2
<div class="c4">
<p class="c5">
<span class="c6">
Q2 2018 Akamai Technologies Inc
</span>
<span class="c7">
Earnings
</span>
<span class="c6">
Call - Final
</span>
</p>
</div>
File 3
<div class="c4">
<p class="c5">
<span class="c6">
Q4 2018
</span>
<span class="c7">
Facebook
</span>
<span class="c6">
Inc
</span>
<span class="c7">
Earnings
</span>
<span class="c6">
Call - Final
</span>
</p>
What I want is get full text of header:
Q3 2017 American Express Co Earnings Call - Final
Q2 2018 Akamai Technologies Inc Earnings Call - Final
Q4 2018 Facebook Inc Earnings Call - Final
Use Regular expression re
I have updated the last file html.You can do it same with remaining files
from bs4 import BeautifulSoup
import re
data='''<div class="c4">
<p class="c5">
<span class="c6">
Q4 2018
</span>
<span class="c7">
Facebook
</span>
<span class="c6">
Inc
</span>
<span class="c7">
Earnings
</span>
<span class="c6">
Call - Final
</span>
</p>'''
soup=BeautifulSoup(data,'html.parser')
items=[item.text.strip() for item in soup.find_all('span', class_=re.compile("c"))]
stritem=' '.join(items)
print(stritem.replace('\n',''))
Output:
Q4 2018 Facebook Inc Earnings Call - Final
You can also use following way.
items=[item.text.strip() for item in soup.find_all('span', class_=re.compile("c6|c7"))]
stritem=' '.join(items)
print(stritem.replace('\n',''))
or to get the parent tag text try that.
from bs4 import BeautifulSoup
import re
data='''<div class="c4">
<p class="c5">
<span class="c6">
Q4 2018
</span>
<span class="c7">
Facebook
</span>
<span class="c6">
Inc
</span>
<span class="c7">
Earnings
</span>
<span class="c6">
Call - Final
</span>
</p>'''
soup=BeautifulSoup(data,'html.parser')
childtag=soup.find('span', class_=re.compile("c6|c7"))
parenttag=childtag.parent
print(parenttag.text.replace('\n',''))