Search code examples
python-3.xweb-scrapingbeautifulsouphtml-parsing

Scrape image's metadata from Facebook public posts


This is a follow-up question in my quest to get some data from Facebook public posts. I'm trying to collect images metadata this time (image's url). Link posts work fine but some posts return empty data. I used the same approach suggested in answers to my previous question but it doesn't work for the example below. Will appreciate suggestions!

link = "https://www.facebook.com/228735667216/posts/10151653129902217"
res = requests.get(link,headers={'User-Agent':'Mozilla/5.0'})
comment = res.text.replace("-->", "").replace("<!--", "")
soup = BeautifulSoup(comment, "lxml")
image = soup.find("div", class_="uiScaledImageContainer _517g")
img = image.find("img", class_="scaledImageFitWidth img")
href = img["src"]
print(href)

Solution

  • To log in using requests is not that easy so I intentionally skipped that library. You can try using only selenium or selenium in combination with BeautifulSoup to do the doing.

    from selenium import webdriver
    from bs4 import BeautifulSoup
    from selenium.webdriver.common.keys import Keys
    
    url = "https://www.facebook.com/228735667216/posts/10156284868312217"
    
    chrome_options = webdriver.ChromeOptions()
    
    #This is how you can make the browser headless
    chrome_options.add_argument("--headless")
    #The following line controls the notification popping up right after login
    prefs = {"profile.default_content_setting_values.notifications" : 2}
    chrome_options.add_experimental_option("prefs",prefs)
    driver = webdriver.Chrome(chrome_options=chrome_options)
    
    driver.get(url)
    driver.find_element_by_id("email").send_keys("your_username")
    driver.find_element_by_id("pass").send_keys("your_password",Keys.RETURN)
    driver.get(url)
    soup = BeautifulSoup(driver.page_source, "lxml")
    for img in soup.find_all(class_="scaledImageFitWidth"):
        print(img.get("src"))
    driver.quit()
    

    Output are like (partial):

    https://external.fdac17-1.fna.fbcdn.net/safe_image.php?d=AQBjBuP0TBYabtnO&w=540&h=282&url=https%3A%2F%2Fs3.amazonaws.com%2Fprod-cust-photo-posts-jfaikqealaka%2F3065-6e4c325b07b921fdefed4dd727881f8d.jpg&cfs=1&upscale=1&fallback=news_d_placeholder_publisher&_nc_hash=AQCVKXMSqvNiHZik
    https://external.fdac17-1.fna.fbcdn.net/safe_image.php?d=AQCJ6RFOF4dY2xTn&w=100&h=100&url=https%3A%2F%2Fcdn.images.express.co.uk%2Fimg%2Fdynamic%2F106%2F750x445%2F1046936.jpg&cfs=1&upscale=1&fallback=news_d_placeholder_publisher_square&_nc_hash=AQAyFxRaZTGV47Se