Search code examples
pythonauthenticationweb-scrapingoauth-2.0python-requests

Logging into the Wall Street Journal via Python/Requests


I work on an academic project that requires the scraping of various news articles across the internet using a Python script powered by the Requests and BeautifulSoup libraries. Recently, I've been tasked with scraping articles from the Wall Street Journal and have been given a subscription login to use. However, during past assignments I've never needed to log into a website before scraping the actual article. I've followed the basic logic of logging in via requests and POST, but it doesn't look like WSJ's login follows that protocol - I'm still receiving the "unsubscribed" article page. Additionally, a previous question asked here seems to indicate it's using OAuth 2.0 (something else I don't have experience in), but seemed to provide a solution via shell script - I'm looking for a Python/Requests solution if possible. What I'm currently working with:

user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36'

# login
login_url = "https://accounts.wsj.com/login?target="
login_data = {'username':'*******','password':'*******'}
s = requests.Session()
s.post(login_url,data=login_data,headers={'User-Agent':user_agent})

# get article
url = "https://www.wsj.com/articles/rep-elijah-cummings-has-died-11571304573?mod=hp_lead_pos8"
r = s.get(url,headers={'User-Agent':user_agent})
page = r.content
soup = BeautifulSoup(page,'html.parser')

Any help is appreciated.


Solution

  • First of all, I would verify that the WSJ "Terms of Service" allow you to scrape the website, even for academic research. Many websites will actual prohibit you from doing this.

    Second, the "requests" library is really useful for APIs (i.e. websites for computers, not websites for humans). Request has an 'auth' parameter for APIs that use basic authentication (i.e. ID/PW, not cookie tracking).

    In your particular circumstance, you may need to use Selenium. There's a good beginner book on Python and Selenium online.

    https://automatetheboringstuff.com/

    In my experience I have used selenium to create a browser and login with ID/PW, and then automate the clicking of links and downloading of data.

    It requires lots of experimentation (and frustration), so there's not really an easy answer that I know.