Search code examples
pythonseleniumscreen-scraping

Data getting scraped and printed is only the first entry of each page but I need all the data


I'm having an issue where the code I'm scraping is only printing out the first entry of each page. What I require is that all the data from all three pages of the website get scraped and added to the list 'infoList'.

What I assume is the problem the declaration 'CAR_INFO = 0' but I'm not sure how to fix it. Any tips or push in the right direction would be greatly appreciated.

my code:

import time
from selenium import webdriver 
from bs4 import BeautifulSoup
import re


DRIVER_PATH = r"C:\Users\salmou\Downloads\chromedriver_win32\chromedriver.exe"
URL = "https://vancouver.craigslist.org/"

browser = webdriver.Chrome(DRIVER_PATH)
browser.get(URL)

time.sleep(4)

SEARCH_TERM = "Honda"
search = browser.find_element_by_css_selector("#query")
search.send_keys(SEARCH_TERM)
search.send_keys(u'\ue007')


class ScrapedData:
    carInfo = ""
    
    def __init__(self, carInfo):
        self.carInfo = carInfo

    def scrapedCarInfo(self):
        print(SEARCH_TERM + " information: " + self.carInfo)
        print("****")
        

infoList = []


for i in range(0,3):
    content = browser.find_elements_by_css_selector(".hdrlnk")
    for e in content:
        start = e.get_attribute("innerHTML")
        soup= BeautifulSoup(start, features=("lxml"))
        rawString = soup.get_text().strip()
        # print(soup.get_text())
        # print("*****************************************************")
    button = browser.find_element_by_css_selector(".next")
    button.click()
    time.sleep(3)
    
    rawString = re.sub(r"[\n\t]*", "", rawString)

    # Replace two or more consecutive empty spaces with '*'

    rawString = re.sub('[ ]{2,}', '*', rawString)
    
    infoArray = rawString.split('*')
        
    CAR_INFO = 0
    
    carInfo = infoArray[CAR_INFO]
    
    
    objInfo = ScrapedData(carInfo)
    infoList.append(objInfo)


for info in infoList:
    info.scrapedCarInfo()

Solution

  • I see you have 2 loops: external with i and inner with e but I can't see any reference to the current i value in the loop. So it looks like you are performing the same action 3 times.
    Also the rawString defined and evaluated in the internal loop is treated in the outer loop only. So only the latest value rawString received in the inner loop is treated in the outer loop. This may cause your problem.