I want to crawl data from website so i am using API from openweather.org. The agent that i have configured to stream in data is as follows
weather.channels= memory-channel
weather.channels.memory-channel.capacity=10000
weather.channels.memory-channel.type = memory
weather.sinks = hdfs-write
weather.sinks.hdfs-write.channel=memory-channel
weather.sinks.hdfs-write.type = logger
weather.sinks.hdfs-write.hdfs.path = hdfs://localhost:8020/user/hadoop/flume/
weather.sinks.hdfs-write.rollInterval = 1200
weather.sinks.hdfs-write.hdfs.writeFormat=Text
weather.sinks.hdfs-write.hdfs.fileType=DataStream
weather.sources= Weather
weather.sources.Weather.bind = api.openweathermap.org/data/2.5/forecast/city?id=285787&APPID=8ce9bbbe446da25b19242763bdddb90a
weather.sources.Weather.username= abc
weather.sources.Weather.password= ********
weather.sources.Weather.channels=memory-channel
weather.sources.Weather.type = http
weather.sources.Weather.port = 11111
While i am running the flume agent with following command flume-ng agent -f weather.conf -n weather
I am getting following error
15/03/23 05:17:34 INFO node.PollingPropertiesFileConfigurationProvider: Reloading configuration file:weather.conf
15/03/23 05:17:34 INFO conf.FlumeConfiguration: Processing:hdfs-write
15/03/23 05:17:34 INFO conf.FlumeConfiguration: Processing:hdfs-write
15/03/23 05:17:34 INFO conf.FlumeConfiguration: Processing:hdfs-write
15/03/23 05:17:34 INFO conf.FlumeConfiguration: Processing:hdfs-write
15/03/23 05:17:34 INFO conf.FlumeConfiguration: Added sinks: hdfs-write Agent: weather
15/03/23 05:17:34 INFO conf.FlumeConfiguration: Processing:hdfs-write
15/03/23 05:17:34 INFO conf.FlumeConfiguration: Processing:hdfs-write
15/03/23 05:17:34 INFO conf.FlumeConfiguration: Post-validation flume configuration contains configuration for agents: [weather]
15/03/23 05:17:34 INFO node.AbstractConfigurationProvider: Creating channels
15/03/23 05:17:34 INFO channel.DefaultChannelFactory: Creating instance of channel memory-channel type memory
15/03/23 05:17:34 INFO node.AbstractConfigurationProvider: Created channel memory-channel
15/03/23 05:17:34 INFO source.DefaultSourceFactory: Creating instance of sourceWeather, type http
15/03/23 05:17:35 INFO sink.DefaultSinkFactory: Creating instance of sink: hdfs-write, type: logger
15/03/23 05:17:35 INFO node.AbstractConfigurationProvider: Channel memory-channel connected to [Weather, hdfs-write]
15/03/23 05:17:35 INFO node.Application: Starting new configuration:{
sourceRunners:{Weather=EventDrivenSourceRunner: {
source:org.apache.flume.source.http.HTTP
Source{name:Weather,state:IDLE} }} sinkRunners:{hdfs-write=SinkRunner: {
policy:org.apache.flume.sink.DefaultSinkProcessor@529d1dd7 counterGroup:{
name:null counters:{} } }} channels:{memory-
channel=org.apache.flume.channel.MemoryChannel{name: memory-channel}} }
15/03/23 05:17:35 INFO node.Application: Starting Channel memory-channel
15/03/23 05:17:35 INFO instrumentation.MonitoredCounterGroup: Monitored
countergroup for type: CHANNEL, name: memory-channel: Successfully
registered new MBean.
15/03/23 05:17:35 INFO instrumentation.MonitoredCounterGroup: Component
type: CHANNEL, name: memory-channel started
15/03/23 05:17:35 INFO node.Application: Starting Sink hdfs-write
15/03/23 05:17:35 INFO node.Application: Starting Source Weather
15/03/23 05:17:35 INFO mortbay.log: Logging to
org.slf4j.impl.Log4jLoggerAdapter(org.mortbay.log) via
org.mortbay.log.Slf4jLog
15/3/23 05:17:35 INFO mortbay.log: jetty-6.1.26
15/03/23 05:17:36 WARN mortbay.log: failed
SelectChannelConnector@api.openweathermap.org/data/2.5/forecast/city?
id=285787&APPID=8ce9bbbe446da25b19242763bdddb90a:11111:
java.net.SocketException: Unresolved address
15/03/23 05:17:36 WARN mortbay.log: failed Server@642c189d:
java.net.SocketException: Unresolved address
15/03/23 05:17:36 ERROR http.HTTPSource: Error while starting HTTPSource.
Exception follows.java.net.SocketException: Unresolved address
at sun.nio.ch.Net.translateToSocketException(Net.java:157)
at sun.nio.ch.Net.translateException(Net.java:183)
at sun.nio.ch.Net.translateException(Net.java:189)
at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:76)
at org.mortbay.jetty.nio.SelectChannelConnector.open
(SelectChannelConnector.java:216)
at org.mortbay.jetty.nio.SelectChannelConnector.doStart(SelectChannelCon
nector.java:315)
at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java_
at org.mortbay.jetty.Server.doStart(Server.java:235)
at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java)
at org.apache.flume.source.http.HTTPSource.start(HTTPSource.java:220)
at org.apache.flume.source.EventDrivenSourceRunner.start(EventDrivenSour
ceRunner.java:44)
at org.apache.flume.lifecycle.LifecycleSupervisor$MonitorRunnable.run
(LifecycleSupervisor.java:251)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.
access$301(ScheduledThreadPoolExecutor.java:178)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.
run(ScheduledThreadPoolExecutor.java:293)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.
java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor
.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.nio.channels.UnresolvedAddressException
at sun.nio.ch.Net.checkAddress(Net.java:127)
at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java)
at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74)
... 15 more
15/03/23 05:17:36 ERROR lifecycle.LifecycleSupervisor: Unable to start
EventDrivenSourceRunner: {
source:org.apache.flume.source.http.HTTPSource{name:Weather,state:IDLE} }
- Exception follows.
java.lang.RuntimeException: java.net.SocketException: Unresolved address
at com.google.common.base.Throwables.propagate(Throwables.java:156)
at org.apache.flume.source.http.HTTPSource.start(HTTPSource.java:224)
at org.apache.flume.source.EventDrivenSourceRunner.start
(EventDrivenSourceRunner.java:44)
at org.apache.flume.lifecycle.LifecycleSupervisor$MonitorRunnable.run(Li
fecycleSupervisor.java:251)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.
access$301(ScheduledThreadPoolExecutor.java:178)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.
run(ScheduledThreadPoolExecutor.java:293)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.
java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor
.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.net.SocketException: Unresolved address
at sun.nio.ch.Net.translateToSocketException(Net.java:157)
at sun.nio.ch.Net.translateException(Net.java:183)
at sun.nio.ch.Net.translateException(Net.java:189)
at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:76)
at org.mortbay.jetty.nio.SelectChannelConnector.open(SelectChannelConnec
tor.java:216)
at org.mortbay.jetty.nio.SelectChannelConnector.doStart(SelectChannelCon
nector.java:315)
at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:
at org.mortbay.jetty.Server.doStart(Server.java:235)
at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:
at org.apache.flume.source.http.HTTPSource.start(HTTPSource.java:220)
... 9 more
Caused by: java.nio.channels.UnresolvedAddressException
at sun.nio.ch.Net.checkAddress(Net.java:127)
at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java
at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74)
... 15 more
15/03/23 05:17:39 ERROR lifecycle.LifecycleSupervisor: Unable to start
EventDrivenSourceRunner: {
source:org.apache.flume.source.http.HTTPSource{name:Weather,state:IDLE}
} - Exception follows.
java.lang.IllegalStateException: Running HTTP Server found in source:
Weather before I started one.Will not attempt to start.
at com.google.common.base.Preconditions.checkState(Preconditions.java:14
at org.apache.flume.source.http.HTTPSource.start(HTTPSource.java:189)
at org.apache.flume.source.EventDrivenSourceRunner.start(EventDrivenSour
ceRunner.java:44)
at org.apache.flume.lifecycle.LifecycleSupervisor$MonitorRunnable.run(Li
fecycleSupervisor.java:251)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.
access$301(ScheduledThreadPoolExecutor.java:178)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.
run(ScheduledThreadPoolExecutor.java:293)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.
java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor
.java:615)
at java.lang.Thread.run(Thread.java:745)
^C15/03/23 05:17:41 INFO lifecycle.LifecycleSupervisor: Stopping
lifecycle supervisor 10
15/03/23 05:17:41 INFO node.PollingPropertiesFileConfigurationProvider:
Configuration provider stopping
Please help me on this issue?
Or do i have to do something else before configuring flume agent.
or should i use nutch to crawl the data in, or should i use storm.
Please help me what is the best alternative to do this
Thank you in advance
the bind
parameter of HTTPSource
specifies the IP address or hostname your agent is going to be listening for data. It is not the crawling endpoint, but the endpoint (together with the port) where the crawler must send the data.
Being said that, I would suggest using the Exec
source in order to execute a script that crawls openweather.org and produce data at the output; that output is then used as input data for the agent.