Search code examples
postnutch

Nutch and Http POST authentication?


I'm stuck at the point where I need to crawl websites that have a form post. Nutch does not support this. How do I get around this so I can crawl these websites using Nutch? Is there a better solution?


Solution

    1. Make a file with data: regex for URLs requiring auth / URL to submit form / form data
    2. Make own http protocol plugin modifying standard protocol-httpclient plugin. If URL to make http request is requiring auth and no auth made yet, so go to form and send it.

    Here's the simplest solution. The problem is, there is no one simple solution for big amount of websites. There are problems with cookie expiring / using of Javascript during login / etc. Search through Nutch's JIRA, there were many discussions about that.