I'm having an issue where the code I'm scraping is only printing out the first entry of each page. What I require is that all the data from all three pages of the website get scraped and added to the list 'infoList'.
What I assume is the problem the declaration 'CAR_INFO = 0' but I'm not sure how to fix it. Any tips or push in the right direction would be greatly appreciated.
my code:
import time
from selenium import webdriver
from bs4 import BeautifulSoup
import re
DRIVER_PATH = r"C:\Users\salmou\Downloads\chromedriver_win32\chromedriver.exe"
URL = "https://vancouver.craigslist.org/"
browser = webdriver.Chrome(DRIVER_PATH)
browser.get(URL)
time.sleep(4)
SEARCH_TERM = "Honda"
search = browser.find_element_by_css_selector("#query")
search.send_keys(SEARCH_TERM)
search.send_keys(u'\ue007')
class ScrapedData:
carInfo = ""
def __init__(self, carInfo):
self.carInfo = carInfo
def scrapedCarInfo(self):
print(SEARCH_TERM + " information: " + self.carInfo)
print("****")
infoList = []
for i in range(0,3):
content = browser.find_elements_by_css_selector(".hdrlnk")
for e in content:
start = e.get_attribute("innerHTML")
soup= BeautifulSoup(start, features=("lxml"))
rawString = soup.get_text().strip()
# print(soup.get_text())
# print("*****************************************************")
button = browser.find_element_by_css_selector(".next")
button.click()
time.sleep(3)
rawString = re.sub(r"[\n\t]*", "", rawString)
# Replace two or more consecutive empty spaces with '*'
rawString = re.sub('[ ]{2,}', '*', rawString)
infoArray = rawString.split('*')
CAR_INFO = 0
carInfo = infoArray[CAR_INFO]
objInfo = ScrapedData(carInfo)
infoList.append(objInfo)
for info in infoList:
info.scrapedCarInfo()
I see you have 2 loops: external with i
and inner with e
but I can't see any reference to the current i
value in the loop. So it looks like you are performing the same action 3 times.
Also the rawString
defined and evaluated in the internal loop is treated in the outer loop only. So only the latest value rawString
received in the inner loop is treated in the outer loop. This may cause your problem.