Search code examples
pythonselenium-webdriverweb-scrapingbots

Using Python Selenium to export Facebook posts - can't separate by post


Due to a research, I am trying to scrape posts from a Facebook group. I am trying to get name, date and post content.

So, I am first trying the following code, but it looks like that it is not capturing data post by post, it returns all the names together, and I cannot break down by post.

On the following code, my intent was to find all the posts with the

posts = browser.find_elements(By.XPATH, "//div[contains(@class,'x1yztbdb x1n2onr6 xh8yej3 x1ja2u2z')]")

then iterate over each post element with:

for index, post in enumerate(posts):

to finally collect all the a classes from that iterated post with:

name_spans = post.find_elements(By.XPATH,"//a[contains(@class,'x1i10hfl xjbqb8w x6umtig x1b1mbwd xaqea5y xav7gou x9f619 x1ypdohk xt0psk2 xe8uvvx xdj266r x11i5rnm xat24cr x1mh8g0r xexx8yu x4uap5 x18d9i69 xkhd6sd x16tdsg8 x1hl2dhg xggy1nq x1a2a7pz xt0b8zv xzsf02u x1s688f')]")

but when I run the following code, it gets me all the A classes, all together, so I cannot know each one belong to what post.

Any work-around?

# Loading Cookies
browser = load_cookies.adding_cookies_browser(cookies)

# Get page
print("getting to fb")
browser.get("https://www.facebook.com/groups/1494117617557521?sorting_setting=CHRONOLOGICAL")
sleep(random.randint(4,6))

#Scroll until this member's post (so it only gets new posts)
last_member = ["member name here", "link here"]
new_members = [ ]

#scroll down until finding the last contact scrapped

while True:
    #scroll down
    browser.execute_script("window.scrollBy(0, document.body.scrollHeight);")
    print("scrolling down")
    try:
        #looking for the last name that it has scrapped on the previous time. If finds it, stop scrolling down
                    
        text = WebDriverWait(browser, random.randint(3,5)).until(EC.presence_of_element_located((By.XPATH, f"//*[text()='{last_member[0]}']")))
        print(f"found {last_member[0]}")
        break
    except:
        pass

group_posts = []
posts = browser.find_elements(By.XPATH, "//div[contains(@class,'x1yztbdb x1n2onr6 xh8yej3 x1ja2u2z')]")
print(posts)
print(len(posts))

# Getting each post info:
for index, post in enumerate(posts):
    print(f"post{index}")
    print(post)

    #find people's name by finding the A's element id
    name_spans = post.find_elements(By.XPATH,"//a[contains(@class,'x1i10hfl xjbqb8w x6umtig x1b1mbwd xaqea5y xav7gou x9f619 x1ypdohk xt0psk2 xe8uvvx xdj266r x11i5rnm xat24cr x1mh8g0r xexx8yu x4uap5 x18d9i69 xkhd6sd x16tdsg8 x1hl2dhg xggy1nq x1a2a7pz xt0b8zv xzsf02u x1s688f')]")

    for i, name in enumerate(name_spans):
        print(f"post n{i}")
        print(name.text)

Solution

  • In the definition of name_spans you are using post.find_element since you want to restrict the search inside post. But this is not enough, you also have to add a dot . in front of the xpath:

    .//a[contains(@class,'x1i10hfl xjbqb8w x6umtig x1b1mbwd xaqea5y xav7gou x9f619 x1ypdohk xt0psk2 xe8uvvx xdj266r x11i5rnm xat24cr x1mh8g0r xexx8yu x4uap5 x18d9i69 xkhd6sd x16tdsg8 x1hl2dhg xggy1nq x1a2a7pz xt0b8zv xzsf02u x1s688f')]
    

    Remember

    • //div finds divs in all the html
    • .//div finds divs which are descendants of the current node

    Moreover

    • .// finds the descendants of the current node
    • ./ finds the direct children of the current node