Search code examples
web-crawlerjsoup

Jsoup downloading error. Says must be logged in, but there's no login


Specs: My company's server runs Jsoup to download pdfs based on links I provide it

I sometimes run into this problem where a website has a document (pdf or otherwise) which I can download normally from my browser, but through my scraping software it returns an error such as this

Something went wrong. Oh no! Something is not right! Try to log in again. If you continue to see this error, please contact us at [email protected] Error description: MessageInvalid URI: The Authority/Host could not be parsed. TargetSiteVoid CreateThis(System.String, Boolean, System.UriKind) StackTrace at System.Uri.CreateThis(String uri, Boolean dontEscape, UriKind uriKind) at SWPalInc.WebHost.Controllers.DController.F(String u, String n) at lambda_method(Closure , ControllerBase , Object[] ) at System.Web.Mvc.ReflectedActionDescriptor.Execute(ControllerContext controllerContext, IDictionary2 parameters) at System.Web.Mvc.ControllerActionInvoker.InvokeActionMethod(ControllerContext controllerContext, ActionDescriptor actionDescriptor, IDictionary2 parameters) at System.Web.Mvc.ControllerActionInvoker.<>c__DisplayClass15.b__12() at System.Web.Mvc.ControllerActionInvoker.InvokeActionMethodFilter(IActionFilter filter, ActionExecutingContext preContext, Func1 continuation) at System.Web.Mvc.ControllerActionInvoker.InvokeActionMethodWithFilters(ControllerContext controllerContext, IList1 filters, ActionDescriptor actionDescriptor, IDictionary`2 parameters) at System.Web.Mvc.ControllerActionInvoker.InvokeAction(ControllerContext controllerContext, String actionName) at System.Web.Mvc.Controller.ExecuteCore() at System.Web.Mvc.ControllerBase.Execute(RequestContext requestContext) at System.Web.Mvc.MvcHandler.<>c__DisplayClass6.<>c__DisplayClassb.b__5() at System.Web.Mvc.Async.AsyncResultWrapper.<>c__DisplayClass1.b__0() at System.Web.HttpApplication.CallHandlerExecutionStep.System.Web.HttpApplication.IExecutionStep.Execute() at System.Web.HttpApplication.ExecuteStep(IExecutionStep step, Boolean& completedSynchronously) DataSystem.Collections.ListDictionaryInternal InnerException SourceSystem Click here and try to login again

I received that error when I try to extract the pdf from a link such as this using my company server https://meetings.municode.com/d/f?u=https://agendapalncus.blob.core.windows.net/paonia-pubu/MEET-Agenda-e11f135d48564ad983c6c46949e34894.pdf&n=Agenda-Regular%20Town%20Board%20Meeting-February%2026,%202019%206.30%20PM.pdf

I've tried using a proxy server but I get the same issue when I crawl it. Anyone know a solution to this issue or seen this before?


Solution

  • When I try to parse this URL with Jsoup it throws

    Exception in thread "main" org.jsoup.UnsupportedMimeTypeException: Unhandled content type.
    Must be text/*, application/xml, or application/xhtml+xml.
    

    so it seems like it's throwing proper, explicit exception. Try catching and handling this exception. That's how I would do it in Java:

        try {
            doc = Jsoup.connect(url).get();
            (...)
        } catch (UnsupportedMimeTypeException ex) {
            // handle exception here
        }