Search code examples
python-3.xweb-scrapingbeautifulsouphtml-parsing

Scrape original links and headlines from Facebook posts


I need to gather some information which is not provided by Facebook Analytics. For example, the original url and headline of an article promoted on Facebook as a link post. This info is buried in the html code of a Facebook post but I struggle to dig it out. Will appreciate your help.

Let's take this example: https://www.facebook.com/bbcnews/posts/10156428513547217

I identified classes for a link (bbc.in...): "_6ks" and headline: 'mbs _6m6 _2cnj _5s6c'

The code below doesn't return anything:

from bs4 import BeautifulSoup
import requests
link = 'https://www.facebook.com/bbcnews/posts/10156428513547217'
r = requests.get(link)
soup = BeautifulSoup(r.content, "lxml")
for paragraph in soup.find_all("div", class_="_6ks"):
    for a in paragraph("a"):
       print(a.get('href'))
for paragraph in soup.find_all("div", class_='mbs _6m6 _2cnj _5s6c'):
    for a in paragraph("a"):
       print(a.get('hover'))

Solution

  • Another way to achieve the same would be something like below:

    from bs4 import BeautifulSoup
    import requests
    
    link = 'https://www.facebook.com/bbcnews/posts/10156428513547217'
    
    res = requests.get(link,headers={'User-Agent':'Mozilla/5.0'})
    comment = res.text.replace("-->", "").replace("<!--", "")
    soup = BeautifulSoup(comment, "lxml")
    items = soup.select_one('.mbs a')
    print(items.get("href")+"\n",items.text)