Search code examples
javaeclipsenutch

Integrating Nutch 1.17 with Eclipse (Ubuntu 18.04)


I don't know if the guide is possibly outdated, or I'm doing something wrong. I just started using nutch, and I've integrated it with solr and crawled/indexed through some websites via terminal. Now I'm trying to use them in a java application, so I've been following the tutorial here: https://cwiki.apache.org/confluence/display/NUTCH/RunNutchInEclipse#RunNutchInEclipse-RunningNutchinEclipse

I downloaded Subclipse, IvyDE and m2e through Eclipse, and I downloaded ant, so I should have all the prerequisites. The m2e link through the tutorial is broken, so I found it somewhere else. It also turns out that eclipse already had it upon installation.

I get a huge list of error messages when I run 'ant eclipse' in terminal. Due to word count, put a link to a pastebin with the entire error message here

I'm really not sure what I'm doing wrong. The directions aren't that complicated, so I really don't know where I'm messing up.

Just in case it's necessary, here is the nutch-site.xml that we needed to modify.

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>

<property>
   <name>plugin.folders</name>
   <value>/home/user/trunk/build/plugins</value>
</property>

<!-- HTTP properties -->

<property>
  <name>http.agent.name</name>
  <value>MarketDataCrawler</value>
  <description>HTTP 'User-Agent' request header. MUST NOT be empty - 
  please set this to a single word uniquely related to your organization.

  NOTE: You should also check other related properties:

    http.robots.agents
    http.agent.description
    http.agent.url
    http.agent.email
    http.agent.version

  and set their values appropriately.

  </description>
</property>

<property>
  <name>http.robots.agents</name>
  <value></value>
  <description>Any other agents, apart from 'http.agent.name', that the robots
  parser would look for in robots.txt. Multiple agents can be provided using 
  comma as a delimiter. eg. mybot,foo-spider,bar-crawler
  
  The ordering of agents does NOT matter and the robots parser would make 
  decision based on the agent which matches first to the robots rules.  
  Also, there is NO need to add a wildcard (ie. "*") to this string as the 
  robots parser would smartly take care of a no-match situation. 
    
  If no value is specified, by default HTTP agent (ie. 'http.agent.name') 
  would be used for user agent matching by the robots parser. 
  </description>
</property>

</configuration>

A ton of the errors have to do with Ivy, so I don't know if the versions of Ivy between Nutch and the plugins installed in eclipse are compatible.


Solution

  • As guided in the LOG file

    [ivy:resolve]   SERVER ERROR: HTTPS Required url=http://repo1.maven.org/maven2/org/slf4j/slf4j-api/1.6.1/slf4j-api-1.6.1.pom
    [ivy:resolve]   SERVER ERROR: HTTPS Required url=http://repo1.maven.org/maven2/org/slf4j/slf4j-api/1.6.1/slf4j-api-1.6.1.jar
    [ivy:resolve]   SERVER ERROR: HTTPS Required url=http://repo1.maven.org/maven2/org/slf4j/slf4j-log4j12/1.6.1/slf4j-log4j12-1.6.1.pom
    

    You should use updated repositories URL in ivy/ivy.xml. One option is to change each URL from http to https in ivy.xml.

    I think, you are using some old version otherwise this issue should be fixed already.