I am trying to reconstruct posts on the web and I have success in most sites using the code below to scrape the texts.
parent = driver.find_element(By.XPATH, "//*") # main post element
child = parent.find_elements(By.XPATH, ".//*") # child containing texts/emojis/links/images
for i in child:
if i.text is not None:
post.append(i.get_attribute('textContent'))
if i.get_attribute('alt') is not None:
post.append(i.get_attribute('alt'))
text = ''.join(post)
print(text)
I am basically reconstructing the post by obtaining the textContent
and alt
as they are found and listed by the find_elements
. If the child contains a text
, text will be obtained, if not, alt
for emoji/images will be obtained.
All the contents of the post can be acquired using the get_attribute
since each of the texts and emojis are inside a child element. However, I encountered a structure where there is only a single child(<div>
) containing all the text
and alt
. Example below:
<span class="post">
<div class="paragraph1">
<div>
<span class="html-span link">
<a>#link1 </a>
</span>
"A "
<span class="html-span emoji">
<img alt=":)">
</span>
"B "
</div>
</div>
<div class="paragraph2">
<div>
"C "
<span class="html-span link">
<a>#link2</a>
</span>
"D "
<span class="html-span emoji">
<img alt=":(">
</span>
"E"
</div>
</div>
</span>
Is it possible to put the idea //text() or //*[@img='alt']
into action? It seems using find_elements
with By.XPATH, //text()
is not allowed.
Expected output:
#link1 A :) B C #link2 D :( E
Tried:
parent = driver.find_element(By.XPATH, "//*") # main post element
child = parent.find_elements(By.XPATH, ".//*[contains(@class, 'html-span')]//*") # child containing texts/emojis/links/images
for i in child:
print(i.get_attribute('textContent'))
if i.text is not None:
post.append(i.get_attribute('textContent'))
if i.get_attribute('alt') is not None:
post.append(i.get_attribute('alt'))
print(post)
text = ''.join(post)
print(text)
Result:
[#link, '', :), '', '', #link2, '', :(, ''] # missing text
of the child
parent = driver.find_element(By.XPATH, //span[@class='post']")
print(parent.get_attribute('textContent'))
Result:
#link1 A B C #link2 D E
# missing alt
attribute
I am trying to group each paragraph class
and construct their sentences. But being able to obtain both attribute and text of the entire post in order is good enough.
Any idea on how to proceed is much appreciated.
This script should work:
script = '''
function getText(node){
let text = [];
for (let i = 0; i < node.childNodes.length; i++){
const child = node.childNodes[i];
if (child.alt){
text.push(child.alt);
} else if (child.children){
text.push(getText(child));
} else {
const content = child.textContent.trim();
content && text.push(content);
}
}
return text.join(' ');
}
return getText(arguments[0])
'''
span = driver.find_element(By.CSS_SELECTOR, 'span.post')
text = driver.execute_script(script, span)
print(text)
Edit:
If you want a native python solution, you can use BeautifulSoup like so:
from bs4 import BeautifulSoup
def getText(soup):
text = []
for child in soup.children:
if child.string:
text.append(child.get_text(strip=True))
elif alt := child.get('alt'):
text.append(alt)
elif child.contents:
text.append(getText(child))
return ' '.join(text).strip()
span = driver.find_element(By.CSS_SELECTOR, 'span.post')
html = span.get_attribute('innerHTML')
text = getText(BeautifulSoup(html, 'html.parser'))
print(text)