Search code examples
pythonweb-scrapingbeautifulsoupfindall

How To Use FindAll While Web Scraping


I want to scrape https://www.ebay.co.uk/sch/i.html?_from=R40&_sacat=0&_nkw=xbox&_pgn=2&_skc=50&rt=nc and get the tiles (Microsoft Xbox 360 E 250 GB Black Console, Microsoft Xbox One S 1TB Console White with 2 Wireless Controllers etc). In due course I want to feed the Python script different eBay URLS but for the sake of this question, I just want to focus on one specific eBay URL.

I then want to add them titles to a data frame which I would write to Excel. I think I can do this part myself.

Did not work -

for post in soup.findAll('a',id='ListViewInner'):
    print (post.get('href'))

Did not work -

for post in soup.findAll('a',id='body'):
      print (post.get('href'))

Did not work -

for post in soup.findAll('a',id='body'):
   print (post.get('href'))

h1 = soup.find("a",{"class":"lvtitle"})
print(h1)

Did not work -

for post in soup.findAll('a',attrs={"class":"left-center"}):
    print (post.get('href'))

Did not work -

for post in soup.findAll('a',{'id':'ListViewInner'}):
    print (post.get('href'))

This gave me links for the wrong parts of the web page, I know href is hyperlinks and not titles but I figured if the below code had worked, I could amend it for titles -

for post in soup.findAll('a'):
    print (post.get('href'))

Here is all my code -

import pandas as pd
from pandas import ExcelWriter
from pandas import ExcelFile
import urllib.request
from bs4 import BeautifulSoup

#BaseURL, Syntax1 and Syntax2 should be standard across all
#Ebay URLs, whereas Request and PageNumber can change 

BaseURL = "https://www.ebay.co.uk/sch/i.html?_from=R40&_sacat=0&_nkw="

Syntax1 = "&_skc=50&rt=nc"

Request = "xbox"

Syntax2  = "&_pgn="

PageNumber ="2"

URL = BaseURL + Request + Syntax2 + PageNumber + Syntax1


print (URL)
HTML = urllib.request.urlopen(URL).read()

#print(HTML)

soup=b(HTML,"html.parser")

#print (soup)

for post in soup.findAll('a'):
    print (post.get('href'))

Solution

  • Use css selector which is much faster.

    import requests
    from bs4 import  BeautifulSoup
    
    url = 'https://www.ebay.co.uk/sch/i.html?_from=R40&_sacat=0&_nkw=xbox&_pgn=2&_skc=50&rt=nc'
    Res = requests.get(url)
    soup = BeautifulSoup(Res.text,'html.parser')
    for post in soup.select("#ListViewInner a"):
        print(post.get('href'))
    

    Use format() function instead of concatenation string.

    import pandas as pd
    from pandas import ExcelWriter
    from pandas import ExcelFile
    import urllib.request
    from bs4 import BeautifulSoup
    
    BaseURL = "https://www.ebay.co.uk/sch/i.html?_from=R40&_sacat=0&_nkw={}&_pgn={}&_skc={}&rt={}"
    
    skc = "50"
    rt = "nc"
    Request = "xbox"
    PageNumber = "2"
    
    URL = BaseURL.format(Request,PageNumber,skc,rt)
    print(URL)
    HTML = urllib.request.urlopen(URL).read()
    soup = BeautifulSoup(HTML,"html.parser")
    for post in soup.select('#ListViewInner a'):
        print(post.get('href'))