Search code examples
web-crawlerhdfsnutchapache-stormflume

crawl data from website into hdfs


I want to crawl data from website so i am using API from openweather.org. The agent that i have configured to stream in data is as follows

weather.channels= memory-channel
weather.channels.memory-channel.capacity=10000
weather.channels.memory-channel.type = memory
weather.sinks = hdfs-write
weather.sinks.hdfs-write.channel=memory-channel
weather.sinks.hdfs-write.type = logger
weather.sinks.hdfs-write.hdfs.path = hdfs://localhost:8020/user/hadoop/flume/
weather.sinks.hdfs-write.rollInterval = 1200
weather.sinks.hdfs-write.hdfs.writeFormat=Text
weather.sinks.hdfs-write.hdfs.fileType=DataStream
weather.sources= Weather
weather.sources.Weather.bind =     api.openweathermap.org/data/2.5/forecast/city?id=285787&APPID=8ce9bbbe446da25b19242763bdddb90a
weather.sources.Weather.username= abc
weather.sources.Weather.password= ********
weather.sources.Weather.channels=memory-channel
weather.sources.Weather.type = http
weather.sources.Weather.port = 11111

While i am running the flume agent with following command flume-ng agent -f weather.conf -n weather

I am getting following error

15/03/23 05:17:34 INFO node.PollingPropertiesFileConfigurationProvider: Reloading configuration file:weather.conf
15/03/23 05:17:34 INFO conf.FlumeConfiguration: Processing:hdfs-write
15/03/23 05:17:34 INFO conf.FlumeConfiguration: Processing:hdfs-write
15/03/23 05:17:34 INFO conf.FlumeConfiguration: Processing:hdfs-write
15/03/23 05:17:34 INFO conf.FlumeConfiguration: Processing:hdfs-write
15/03/23 05:17:34 INFO conf.FlumeConfiguration: Added sinks: hdfs-write Agent: weather
15/03/23 05:17:34 INFO conf.FlumeConfiguration: Processing:hdfs-write
15/03/23 05:17:34 INFO conf.FlumeConfiguration: Processing:hdfs-write
15/03/23 05:17:34 INFO conf.FlumeConfiguration: Post-validation flume configuration contains configuration for agents: [weather]
15/03/23 05:17:34 INFO node.AbstractConfigurationProvider: Creating channels
15/03/23 05:17:34 INFO channel.DefaultChannelFactory: Creating instance of channel memory-channel type memory
15/03/23 05:17:34 INFO node.AbstractConfigurationProvider: Created channel memory-channel
15/03/23 05:17:34 INFO source.DefaultSourceFactory: Creating instance of sourceWeather, type http
15/03/23 05:17:35 INFO sink.DefaultSinkFactory: Creating instance of sink: hdfs-write, type: logger
15/03/23 05:17:35 INFO node.AbstractConfigurationProvider: Channel memory-channel connected to [Weather, hdfs-write]
15/03/23 05:17:35 INFO node.Application: Starting new configuration:{     
sourceRunners:{Weather=EventDrivenSourceRunner: {    
source:org.apache.flume.source.http.HTTP
Source{name:Weather,state:IDLE} }} sinkRunners:{hdfs-write=SinkRunner: {   
policy:org.apache.flume.sink.DefaultSinkProcessor@529d1dd7 counterGroup:{    
name:null counters:{} } }} channels:{memory-   
channel=org.apache.flume.channel.MemoryChannel{name: memory-channel}} }
15/03/23 05:17:35 INFO node.Application: Starting Channel memory-channel
15/03/23 05:17:35 INFO instrumentation.MonitoredCounterGroup: Monitored  
countergroup for type: CHANNEL, name: memory-channel: Successfully  
registered new MBean.
15/03/23 05:17:35 INFO instrumentation.MonitoredCounterGroup: Component   
type: CHANNEL, name: memory-channel started
15/03/23 05:17:35 INFO node.Application: Starting Sink hdfs-write
15/03/23 05:17:35 INFO node.Application: Starting Source Weather
15/03/23 05:17:35 INFO mortbay.log: Logging to 
org.slf4j.impl.Log4jLoggerAdapter(org.mortbay.log) via   
org.mortbay.log.Slf4jLog
15/3/23 05:17:35 INFO mortbay.log: jetty-6.1.26
15/03/23 05:17:36 WARN mortbay.log: failed 
SelectChannelConnector@api.openweathermap.org/data/2.5/forecast/city?
id=285787&APPID=8ce9bbbe446da25b19242763bdddb90a:11111:   
java.net.SocketException: Unresolved address
15/03/23 05:17:36 WARN mortbay.log: failed Server@642c189d: 
java.net.SocketException: Unresolved address
15/03/23 05:17:36 ERROR http.HTTPSource: Error while starting HTTPSource.    
  Exception follows.java.net.SocketException: Unresolved address
    at sun.nio.ch.Net.translateToSocketException(Net.java:157)
    at sun.nio.ch.Net.translateException(Net.java:183)
    at sun.nio.ch.Net.translateException(Net.java:189)
    at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:76)
    at org.mortbay.jetty.nio.SelectChannelConnector.open
    (SelectChannelConnector.java:216)
    at org.mortbay.jetty.nio.SelectChannelConnector.doStart(SelectChannelCon
    nector.java:315)
    at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java_
    at org.mortbay.jetty.Server.doStart(Server.java:235)
    at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java)
    at org.apache.flume.source.http.HTTPSource.start(HTTPSource.java:220)
    at org.apache.flume.source.EventDrivenSourceRunner.start(EventDrivenSour
    ceRunner.java:44)
    at org.apache.flume.lifecycle.LifecycleSupervisor$MonitorRunnable.run
    (LifecycleSupervisor.java:251)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java)
    at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.
    access$301(ScheduledThreadPoolExecutor.java:178)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.
    run(ScheduledThreadPoolExecutor.java:293)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.
    java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor
    .java:615)
    at java.lang.Thread.run(Thread.java:745)
    Caused by: java.nio.channels.UnresolvedAddressException
    at sun.nio.ch.Net.checkAddress(Net.java:127)
    at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java)
    at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74)
    ... 15 more
   15/03/23 05:17:36 ERROR lifecycle.LifecycleSupervisor: Unable to start 
   EventDrivenSourceRunner: {   
   source:org.apache.flume.source.http.HTTPSource{name:Weather,state:IDLE} } 
   - Exception follows.
   java.lang.RuntimeException: java.net.SocketException: Unresolved address
    at com.google.common.base.Throwables.propagate(Throwables.java:156)
    at org.apache.flume.source.http.HTTPSource.start(HTTPSource.java:224)
    at org.apache.flume.source.EventDrivenSourceRunner.start
    (EventDrivenSourceRunner.java:44)
    at org.apache.flume.lifecycle.LifecycleSupervisor$MonitorRunnable.run(Li
    fecycleSupervisor.java:251)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java)
    at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.
    access$301(ScheduledThreadPoolExecutor.java:178)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.
    run(ScheduledThreadPoolExecutor.java:293)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.
    java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor
    .java:615)
    at java.lang.Thread.run(Thread.java:745)
    Caused by: java.net.SocketException: Unresolved address
    at sun.nio.ch.Net.translateToSocketException(Net.java:157)
    at sun.nio.ch.Net.translateException(Net.java:183)
    at sun.nio.ch.Net.translateException(Net.java:189)
    at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:76)
    at org.mortbay.jetty.nio.SelectChannelConnector.open(SelectChannelConnec
    tor.java:216)
    at org.mortbay.jetty.nio.SelectChannelConnector.doStart(SelectChannelCon
    nector.java:315)
    at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:
    at org.mortbay.jetty.Server.doStart(Server.java:235)
    at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:
    at org.apache.flume.source.http.HTTPSource.start(HTTPSource.java:220)
    ... 9 more
    Caused by: java.nio.channels.UnresolvedAddressException
    at sun.nio.ch.Net.checkAddress(Net.java:127)
    at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java
    at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74)
    ... 15 more
    15/03/23 05:17:39 ERROR lifecycle.LifecycleSupervisor: Unable to start 
    EventDrivenSourceRunner: {   
    source:org.apache.flume.source.http.HTTPSource{name:Weather,state:IDLE} 
    } - Exception follows.
    java.lang.IllegalStateException: Running HTTP Server found in source:  
    Weather before I started one.Will not attempt to start.
    at com.google.common.base.Preconditions.checkState(Preconditions.java:14
    at org.apache.flume.source.http.HTTPSource.start(HTTPSource.java:189)
    at org.apache.flume.source.EventDrivenSourceRunner.start(EventDrivenSour
    ceRunner.java:44)
    at org.apache.flume.lifecycle.LifecycleSupervisor$MonitorRunnable.run(Li
    fecycleSupervisor.java:251)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java)
    at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.
    access$301(ScheduledThreadPoolExecutor.java:178)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.
    run(ScheduledThreadPoolExecutor.java:293)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.
    java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor
    .java:615)
    at java.lang.Thread.run(Thread.java:745)
    ^C15/03/23 05:17:41 INFO lifecycle.LifecycleSupervisor: Stopping  
    lifecycle supervisor 10
    15/03/23 05:17:41 INFO node.PollingPropertiesFileConfigurationProvider:  
    Configuration provider stopping

Please help me on this issue?

Or do i have to do something else before configuring flume agent.

or should i use nutch to crawl the data in, or should i use storm.

Please help me what is the best alternative to do this

Thank you in advance


Solution

  • the bind parameter of HTTPSource specifies the IP address or hostname your agent is going to be listening for data. It is not the crawling endpoint, but the endpoint (together with the port) where the crawler must send the data.

    Being said that, I would suggest using the Exec source in order to execute a script that crawls openweather.org and produce data at the output; that output is then used as input data for the agent.