python web-scraping python-requests pdfkit splinter

Save Credentials in a Session

I am trying to use pdfkit to make a visual backup of our company wiki. I am running into trouble since the website requires the user to be logged in to use. I developed a script using splinter that logs into the company wiki, but when pdfkit executes, it returns the log in page. PDFkit must open a different session in that case. How would I be able to find out when credentials (cookies) are needed to access the pages on my site, and save them as a variable so I can grab these screen shots?

I am using python 2.7.8 splinter, requests and pdfkit

from splinter import Browser
browser = Browser()
browser.visit('https://companywiki.com')
browser.find_by_id('login-link').click()
browser.fill('os_username', 'username')
browser.fill('os_password', 'password')
browser.find_by_name('login').click()
import pdfkit
pdfkit.from_url("https://pagefromcompanywiki.com", "c:/out.pdf")

I also found the following script which will log me in and save credentials, but I'm not sure how to tie it in to what I am trying to do.

import requests
import sys
EMAIL = ''
PASSWORD = ''
URL = 'https://company.wiki.com'
def main():
    session = requests.session(config={'verbose': sys.stderr})
    login_data = {
        'loginemail': EMAIL,
        'loginpswd': PASSWORD,
        'submit': 'login',
    }
    r = session.post(URL, data=login_data)
    r = session.get('https://pageoncompanywiki.com').

if __name__ == '__main__':
    main()

Any ideas on how to accomplish this task are appreciated

Solution

When you log in with your Splinter browser, the site sends you HTTP cookies that identify your authorized session, and browser remembers them for further requests.

But PDFKit knows nothing about your browser. It just passes the URL you gave it down to the underlying wkhtmltopdf tool, which then fetches the page with its own default settings.

What you need to do is transfer cookies from browser to wkhtmltopdf. Thankfully, it’s easy to connect Splinter and PDFKit in this way:

options = {"cookie": browser.cookies.all().items()}
pdfkit.from_url("https://pagefromcompanywiki.com", "c:/out.pdf", options=options)