Search code examples
cookiesoauthwgetgoogle-sites-2016

How can I download a non-public Google Site?


I would like to download the HTML of all of the pages of a Google Site that can only be accessed by logging into Google. Google does not provide an API for the new Google Sites (source). To complicate matters, my Google login mandates 2SV.

I tried authenticating in Firefox, saving my cookies via the Firefox extension cookies.txt, and then using wget:

wget \
    --load-cookies=cookies.txt \
    --no-host-directories \
    --no-directories \
    --recursive \
    --accept '*.html' \
    https://sites.google.com/a/example.com/the-website-i-need/

The result was just a Google login page.

I also tried from within Firefox via the cliget plugin, which can generate a wget command equivalent to what Firefox does for downloads. My idea was to add the recursive options to the generated command. However, the plugin just reported "No downloads for this session", even after saving the root page of the Google Site as an .html file. I then initiated downloading a PDF file from the Google Site, which did trigger the cliget plugin. However, the resulting wget command resulted in 302 Moved Temporarily, which wget faithfullly followed, but this processes repeated until, finally, wget gave up with 20 redirections exceeded.

Can this be done with OAuth or some other method of authentication?

Related: Accessing a non-Public Google Sites page using curl + Bearer Token


Solution

  • I finally found a way to do this. Google Takeout allows you (in theory) to download all of your Google data, including the Google Sites.

    There are some limitations:.

    • For unknown reasons, it does not work for the classic Google Sites. The data is simply not in the download that Google provides, even though Google says it is supported. This may be a bug. It does work well on the new Google Sites.
    • There is, as far as I know, no automated way to do this. You'll have to walk through the Google Takeout steps. However, for a one-time export, this shouldn't be a problem.
    • If you're using Google G Suite, your administrators may have disabled Google Takeout. Try it, but if says, "You have no services enabled for which data can be exported" you'll need to work with your G Suite administrators.

    The short version:

    • in Google Drive, move your Google Site to a top-level folder
    • go to https://takeout.google.com/
    • under Google Drive, select the folder used above
    • export

    The detailed version: