Search code examples
pythonseleniumweb-scrapingbeautifulsoupsubstring

How to extract only specific elements, combine "find_all" and "find_elements_by_xpath"?


I want to extract all of 'data-test-id'='^fdbk-item-.*$' in <span> from link.

Futhermore, within that contain whichever capital or lower case mirror|tray|ceramic. source

Using find_all(), retrieving 'data-test-id'='^fdbk-item-. *$' was successfully.

from selenium import webdriver
from selenium.webdriver.common.by import By
from bs4 import BeautifulSoup
import time
import re

options = webdriver.ChromeOptions()
options.add_argument('--headless')
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')
driver = webdriver.Chrome(options=options, executable_path='/Users/user/Desktop/pytest/chromedriver')
driver.implicitly_wait(10)
url="https://www.ebay.com/fdbk/feedback_profile/blueberbestmall?filter=feedback_page%3ARECEIVED_AS_SELLER%2Cperiod%3ATWELVE_MONTHS%2Coverall_rating%3ANEGATIVE&commentType=NEGATIVE"
driver.get(url)
time.sleep(3)
html = driver.page_source.encode('utf-8')
soup = BeautifulSoup(html, 'html.parser')
s=soup.find_all('span', attrs={'data-test-id': re.compile('^fdbk-item-.*$')})
time.sleep(3)
for i in s:
    print(i)

output

But I'm struggling to get find_all and find_elements_by_xpath to work together.

html = driver.page_source.encode('utf-8')
soup = BeautifulSoup(html, 'html.parser')
time.sleep(3)
item_lst=[]
s=soup.find_all('span', attrs={'data-test-id': re.compile('^fdbk-item-.*$')})
try:
    for t in s:
        p=item_lst.append(t)
        item_fil=[]
        i=p.find_elements_by_xpath("//div[contains(text(),'mirror|tray|ceramic',flags=re.IGNORECASE)]")
        for j in i:
            k=item_fil.append(j)
            print(k)
except:
    pass

I have no idea how to modify:

find_elements_by_xpath("//div[contains(text(),'mirror|tray|ceramic',flags=re.IGNORECASE)]")

Is it possible to further refine the extracted elements?


Solution

  • Just in case two alternatives to get your goal:

    • Select your elements with css selector and check with :-soup-contains-own() (yes, this is not case insensitive):

      soup.select('div:has(>span:-soup-contains-own("Mirror","Tray","Ceramic"))')
      
    • Select your elements more specific and use a list comprehension to check against matches, so regex is not needed:

      matches = ['mirror','tray','ceramic']
      [e for e in soup.select('[data-feedback-id] .card__item') if any([x in e.text.lower() for x in matches])]
      

    Example

    from bs4 import BeautifulSoup
    
    html = '''
    <tr data-feedback-id="1638468213026"><td><div class="card__feedback-container"><div class="card__rating"><svg class="imagePosition icon icon--feedback-negative" data-test-id="fdbk-rating-13" data-test-type="negative" viewBox="0 0 22 22" height="24" width="24" aria-label="Negative feedback rating" role="img"><path fill="#E0103A" d="M10.969 0C4.911 0 0 4.911 0 10.969s4.911 10.969 10.969 10.969 10.969-4.91 10.969-10.969C21.938 4.912 17.027 0 10.969 0zm4.113 12.339H6.854a1.37 1.37 0 1 1 0-2.74h8.227a1.37 1.37 0 1 1 0 2.74h.001z"></path></svg></div><div class="card__feedback"><div class="card__comment"><span data-test-id="fdbk-comment-13" aria-label="Such a shame, the pack contained 6 swallows, but 5 were broken.">Such a shame, the pack contained 6 swallows, but 5 were broken.</span></div><div class="card__item"><span data-test-id="fdbk-item-13">1PC Swallow Mirror Non-toxic Peel and Stick Creative Wall Stickers for Bedroom (#384870311555)</span></div></div></div></td><td><div class="card__from"><span data-test-id="fdbk-context-13" aria-label="Feedback left by buyer.">2***w</span><span class="no-wrap">&nbsp;(<span data-test-id="fdbk-rating-score-13">340</span><span class="gspr icst3 starIcon" data-test-id="fdbk-rating-icon-13"></span>)</span></div><div class="card__price"><span data-test-id="fdbk-price-13">GBP 5.62</span></div></td><td><div><span data-test-id="fdbk-time-13" aria-label="Past year">Past year</span></div><div class="card__links"></div></td></tr>
    '''
    soup = BeautifulSoup(html)
    
    print(soup.select('tr:has(span:-soup-contains("Mirror","Tray","Ceramic"))'))
    
    matches = ['mirror','tray','ceramic']
    
    [e for e in soup.select('[data-feedback-id] .card__item') if any([x in e.text.lower() for x in matches])]
    

    Output

    [<div class="card__item"><span data-test-id="fdbk-item-13">1PC Swallow Mirror Non-toxic Peel and Stick Creative Wall Stickers for Bedroom (#384870311555)</span></div>]