Search code examples
hadoop-streamingflume-ng

Flume to stream in weather data


I am new to flume. but i want to stream in weather data form any website to my hdfs location. so i have created the sink, source and channel...as below

weather.channels= memory-channel
weather.channels.memory-channel.capacity=10000
weather.channels.memory-channel.type = memory
weather.sinks = hdfs-write
weather.sinks.hdfs-write.channel=memory-channel
 weather.sinks.hdfs-write.type = logger
 weather.sinks.hdfs-write.hdfs.path = hdfs://localhost:8020/user/hadoop/flume
weather.sinks.hdfs-write.rollInterval = 1200
weather.sinks.hdfs-write.hdfs.writeFormat=Text
weather.sinks.hdfs-write.hdfs.fileType=DataStream
weather.sources= Weather
weather.sources.Weather.bind =  api.openweathermap.org/data/2.5/forecast/city?id=524901&APPID=********************************
weather.sources.Weather.channels=memory-channel
weather.sources.Weather.type = netcat
weather.sources.Weather.port = 80

so i am using here API to work with this. What else i can use to stream in weather data, what online website can i use, or which API i should use to configure the source? While executing the flume-ng command to start the agent i am getting following

15/03/18 11:13:28 ERROR lifecycle.LifecycleSupervisor: Unable to start EventDrivenSourceRunner:{
 source:org.apache.flume.source.http.HTTPSource{name:Weather,state:IDLE} } - Exception follows.
java.lang.IllegalStateException: Running HTTP Server found in 
source:Weather before I started one.Will not attempt to start.
at com.google.common.base.Preconditions.checkState(Preconditions.java:145)at org.apache.flume.source.http.HTTPSource.start(HTTPSource.java:189)
at org.apache.flume.source.EventDrivenSourceRunner.start(EventDrivenSourceRunner.java:44)
at org.apache.flume.lifecycle.LifecycleSupervisor$MonitorRunnable.run(LifecycleSupervisor.java:251)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745) 
C15/03/18 11:13:31 INFO lifecycle.LifecycleSupervisor: Stopping lifecycle supervisor 10
15/03/18 11:13:31 INFO node.PollingPropertiesFileConfigurationProvider: Configuration provider stopping
15/03/18 11:13:31 INFO instrumentation.MonitoredCounterGroup: Component type: CHANNEL, name: memory-channel stopped

Solution

  • The "lyfecycle" error you see is the cause of a previous error trying to start the http server.

    The original error is likely due to trying to bind to the priviledged 80 port with non root user. Change the port to >1024, e.g. 8080

    However, it won't work as you are trying to use. A http or netcat source listens to calls, doesn't go an fetch the url you are setting in bind.

    I see two options:

    1. Create a linux daemon to go a wget or curl to that url at regular intervals, save the result to a file and then configure flume with the spool source.
    2. Create your own Flume source that pools that url at regular intervals