Search code examples
htmlcopyweb-crawler

WebCopy does not fully download pasword protected website with a form login. Downloads some pages but the rest returns a 403 Forbidden error


I do not have much experience using Cyotek WebCopy however I did manage to get it up and running and did a partial download of my site.

I need to download the entire contents (html, js, css, assets) of an internal website which is password protected by a form login. The site needs to retain it's functionality, links should be clickable, assets can be downloaded.

I do have the credentials and permission from the site owner to do so.

WebCopy deals with a password protected site in two different ways:

  • Run a scan, detects a form login where you can set the credentials.
  • Open the website in a browser within WebCopy and enter the credentials yourself.

Every time I try to download the website, I manage to get all the assets from the login page, and the homepage right after the login. Every other link that branches from the homepage returns a 403 Forbidden error.

enter image description here

What I have tried:

  • Use WebCopy form login detection and save the credentials
  • USe WebCopy login from browser, same result
  • Remove "Use header checking" option as stated here
  • Try with "follow internal redirects" and with "follow all redirects"
  • Have tried using HTTrack with similar results

If anyone has an idea of what I can do to get this running, it would be much appreciated. I am sure there is something not correctly set up with the crawler but after searching for a solution I couldn't find any more info.


Solution

  • First link to be checked after login was logout link, needed to set up a rule in the creawler that would exclude checking it.

    Ok so I will leave this up just in case anyone else might find themselves in the same situation.

    The deafult project configuration for WebCopy (after setting the login credentials), works fine.

    My issue was that the first link that was checked after successfull login was the "logout" link -_-

    This caused the crawler to lose the authentification and everything subsequently returned 403.