Search code examples
pythonselenium-webdriverweb-scrapingxpath

Is there a way to get the values of multiple text and attribute of a single element in order?


I am trying to reconstruct posts on the web and I have success in most sites using the code below to scrape the texts.

parent = driver.find_element(By.XPATH, "//*") # main post element
child = parent.find_elements(By.XPATH, ".//*") # child containing texts/emojis/links/images
        for i in child:
            if i.text is not None:
                post.append(i.get_attribute('textContent'))
            if i.get_attribute('alt') is not None:
                post.append(i.get_attribute('alt'))
text = ''.join(post)
print(text)

I am basically reconstructing the post by obtaining the textContent and alt as they are found and listed by the find_elements. If the child contains a text, text will be obtained, if not, alt for emoji/images will be obtained.

All the contents of the post can be acquired using the get_attribute since each of the texts and emojis are inside a child element. However, I encountered a structure where there is only a single child(<div>) containing all the text and alt. Example below:

<span class="post">
    <div class="paragraph1">
        <div>
            <span class="html-span link">
                <a>#link1 </a>
            </span>
            "A "
            <span class="html-span emoji">
                <img alt=":)">
            </span>
            "B "
        </div>
    </div>
    <div class="paragraph2">
        <div>
            "C "
            <span class="html-span link">
                <a>#link2</a>
            </span>
            "D "
            <span class="html-span emoji">
                <img alt=":(">
            </span>
            "E"
        </div>
    </div>
</span>

Is it possible to put the idea //text() or //*[@img='alt'] into action? It seems using find_elements with By.XPATH, //text() is not allowed.

Expected output:

#link1 A :) B C #link2 D :( E

Tried:

    parent = driver.find_element(By.XPATH, "//*") # main post element
    child = parent.find_elements(By.XPATH, ".//*[contains(@class, 'html-span')]//*") # child containing texts/emojis/links/images
    for i in child:
        print(i.get_attribute('textContent'))
        if i.text is not None:
            post.append(i.get_attribute('textContent'))
        if i.get_attribute('alt') is not None:
            post.append(i.get_attribute('alt'))
    print(post)
    text = ''.join(post)
    print(text)

Result: [#link, '', :), '', '', #link2, '', :(, ''] # missing text of the child

    parent = driver.find_element(By.XPATH, //span[@class='post']")
    print(parent.get_attribute('textContent'))

Result: #link1 A B C #link2 D E # missing alt attribute

I am trying to group each paragraph class and construct their sentences. But being able to obtain both attribute and text of the entire post in order is good enough.

Any idea on how to proceed is much appreciated.


Solution

  • This script should work:

    script = '''
    function getText(node){
      let text = [];
      for (let i = 0; i < node.childNodes.length; i++){
        const child = node.childNodes[i];
        if (child.alt){
          text.push(child.alt);
        } else if (child.children){
          text.push(getText(child));
        } else {
          const content = child.textContent.trim();
          content && text.push(content);
        }
      }
      return text.join(' ');
    }
    
    return getText(arguments[0])
    '''
    span = driver.find_element(By.CSS_SELECTOR, 'span.post')
    text = driver.execute_script(script, span)
    print(text)
    

    Edit:

    If you want a native python solution, you can use BeautifulSoup like so:

    from bs4 import BeautifulSoup
    
    def getText(soup):
        text = []
        for child in soup.children:
            if child.string:
                text.append(child.get_text(strip=True))
            elif alt := child.get('alt'):
                text.append(alt)
            elif child.contents:
                text.append(getText(child))
        
        return ' '.join(text).strip()
    
    
    span = driver.find_element(By.CSS_SELECTOR, 'span.post')
    html = span.get_attribute('innerHTML')
    
    text = getText(BeautifulSoup(html, 'html.parser'))
    print(text)