Search code examples
directoryjsouphttp-status-code-403virtual-directory

JSoup error 403 when trying to read the contents of a directory on my website


Exception in thread "main" org.jsoup.HttpStatusException: HTTP error fetching URL. Status=403, URL=(site) at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:449) at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:465) at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:424) at org.jsoup.helper.HttpConnection.execute(HttpConnection.java:178) at org.jsoup.helper.HttpConnection.get(HttpConnection.java:167) at plan.URLReader.main(URLReader.java:21)

Hello all!

I have been looking up a way to read a directory on a website of mine for an application I'm developing.

I can read the files themselves and work with them if I hardcode it, but if I try to grab the list of files from the directory I get this error.

I've tried a few ways, but this is the code I am currently working with.

String url = ""//(removed site for privacy); print("Fetching %s...", url);

    Document doc = Jsoup.connect(url).userAgent("Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.153 Safari/537.36").get();
    Elements links = doc.select("a[href]");
    Elements media = doc.select("[src]");
    Elements imports = doc.select("link[href]");

... ... ...

Now if I use the main site as in www.google.com/ it reads the links. The problem is I want a directory as in www.google.com/something/something/...

when I try that for my site I am getting this error.

Any idea why I can access my main site, but not directories within it?

I also notice that '/' is needed at the end.

Just curious if am I missing something, or need to do something another way?

Thank you for your time.


Solution

  • This is likely a problem with (or deliberate attempt to block access using) the server's configuration, not your application. From the tag wiki excerpt for the http-status-code-403 tag:

    The 403 or "Forbidden" error message is a HTTP standard response code indicating that the request was legal and understood but the server refuses to respond to the request.

    From the tag wiki itself:

    A 403 Forbidden may be returned by a web server due to an authorization issue or other constraint related to the request. File permissions, lack of encryption, and maximum number of users reached (among others) can all be the cause of a 403 response.

    If the target site is attempting to block screen-scraping, another possibility is an unrecognized user-agent string, but you're setting the user-agent string to one (I presume) you've obtained from an actual browser, so that shouldn't be the cause.

    It's not clear from your question if you expect to fetch a regular (HTML) web page, or a special "directory listing" page generated by the server when an index.html is not present in a directory. If it's the latter, note that many servers have these listings disabled to avoid leaking the names of files in the directory that aren't linked to from the web site itself. Again, this is a server configuration issue, not something your application can work around.