Search code examples
pythonseleniumfacebookweb-scraping

How to fix an error in web scraping Facebook Marketplace with infinite scrolling using Selenium


My code is scraping housing data from Facebook Marketplace, but I'm facing an issue. Initially, it can only read 24 listings when the page is opened. However, when I try to load more listings by scrolling down the page, my code starts reading all listings from the beginning instead of the 25th listing. How can I resolve this issue?

open = driver.find_elements(By.XPATH, '//div[@ class="x3ct3a4"]')
#open it's a list of all clickable housing listings when I open the page

while True:
    for o in open:
        sleep(random.randint(1, 2))

        #Here I read the data that I need 

        close_button = driver.find_element(By.XPATH, close_xpath)
        close_button.click()
        sleep(random.randint(1, 2))
        #Here I close the listing and go to the next one
        
    #When I read all 24 listings that were in the 'open' list, I then scroll the page down and try to get new listings and then read them

    driver.execute_script('window.scrollTo(0, document.body.scrollHeight);')
    sleep(random.randint(2, 4)
    open = driver.find_elements(By.XPATH, open_xpath)

    #But after the scroll, my code starts reading the same listings that it already read.

Here is my output:

1
['', '2 Beds 1 Bath Apartment']
['$1,600 / Month']
2
['', '1 Bed 1 Bath Apartment']
['$1,500 / Month']

.
.
.
24
['', '2 Beds 2 Baths Apartment']
['$1,350 / Month']
25
['', '2 Beds 1 Bath Apartment']
['$1,600 / Month']
26
['', '1 Bed 1 Bath Apartment']
['$1,500 / Month']

So, after the 24th opened link, code start reading all listings again.


Solution

  • You can try a couple of things.

    Remove elements from the html

    At the end of the for o in open: loop, delete the current element o from the html using javascript.

    for o in open:
        ...
        driver.execute_script('var element = arguments[0]; element.remove();', o)
    

    However, this method may not work: sometimes when you scroll down to load new elements, the page reloads all the previous elements which are then re-added to the html. If this is the case try the next method.

    Add a counter and loop over the new elements

    Define a counter and loops over the elements with index bigger than the counter (i.e. loops over only the new elements). The first time the while loop is executed, counter is 0 so the for loops over all the elements contained in open (for o in open[0:]:). At the end of the for, counter will be equal to 24, so at the second execution of the while we will have for o in open[24:]: which means that the first 24 elements are now excluded.

    open = driver.find_elements(By.XPATH, '//div[@ class="x3ct3a4"]')
    counter = 0
    while True:
        for o in open[counter:]:
            ...
            counter += 1
               
        ...