My code is scraping housing data from Facebook Marketplace, but I'm facing an issue. Initially, it can only read 24 listings when the page is opened. However, when I try to load more listings by scrolling down the page, my code starts reading all listings from the beginning instead of the 25th listing. How can I resolve this issue?
open = driver.find_elements(By.XPATH, '//div[@ class="x3ct3a4"]')
#open it's a list of all clickable housing listings when I open the page
while True:
for o in open:
sleep(random.randint(1, 2))
#Here I read the data that I need
close_button = driver.find_element(By.XPATH, close_xpath)
close_button.click()
sleep(random.randint(1, 2))
#Here I close the listing and go to the next one
#When I read all 24 listings that were in the 'open' list, I then scroll the page down and try to get new listings and then read them
driver.execute_script('window.scrollTo(0, document.body.scrollHeight);')
sleep(random.randint(2, 4)
open = driver.find_elements(By.XPATH, open_xpath)
#But after the scroll, my code starts reading the same listings that it already read.
Here is my output:
1
['', '2 Beds 1 Bath Apartment']
['$1,600 / Month']
2
['', '1 Bed 1 Bath Apartment']
['$1,500 / Month']
.
.
.
24
['', '2 Beds 2 Baths Apartment']
['$1,350 / Month']
25
['', '2 Beds 1 Bath Apartment']
['$1,600 / Month']
26
['', '1 Bed 1 Bath Apartment']
['$1,500 / Month']
So, after the 24th opened link, code start reading all listings again.
You can try a couple of things.
At the end of the for o in open:
loop, delete the current element o
from the html using javascript.
for o in open:
...
driver.execute_script('var element = arguments[0]; element.remove();', o)
However, this method may not work: sometimes when you scroll down to load new elements, the page reloads all the previous elements which are then re-added to the html. If this is the case try the next method.
Define a counter and loops over the elements with index bigger than the counter (i.e. loops over only the new elements). The first time the while
loop is executed, counter is 0 so the for
loops over all the elements contained in open
(for o in open[0:]:
). At the end of the for
, counter
will be equal to 24, so at the second execution of the while
we will have for o in open[24:]:
which means that the first 24 elements are now excluded.
open = driver.find_elements(By.XPATH, '//div[@ class="x3ct3a4"]')
counter = 0
while True:
for o in open[counter:]:
...
counter += 1
...