Users, greetings !
I have installed a flume on my cloudera 4.6, and I am trying to get tweets from twitter.
So I created a HDFS sink and a HBase sink, and they are gathering tweets... But data in HBase is not well structured.
As the data is not structured, I can't make queries on it with impala.
I created a table tweets {NAME => 'tweet'}, {NAME => 'retweet'}, {NAME => 'entities'}, {NAME => 'user'}
and my flume configuration is : http://pastebin.com/4b5d3R8Q
I am following this tutorial, but I don't know what to do with his serializer.
https://github.com/AronMacDonald/Twitter_Hbase_Impala I have to make it into a jar ?
I have currently this in Hbase: http://pastebin.com/aNGBsvB7 Everything is in the column tweets...
I recompiled and used the flume-sources-1.0-SNAPSHOT.jar from the git:https://github.com/cloudera/cdh-twitter-example and so there were no promblem when using 'TwitterAgent.sources.Twitter.type = com.cloudera.flume.source.TwitterSource'
Install Maven, then download the repository of cdh-twitter-example.
Unzip, then execute inside (as mentionned) :
$ cd flume-sources
$ mvn package
$ cd ..
This problem happened when the twitter4j version updated from 2.2.6 to 3.X, they removed the method setIncludeEntities, and the JAR is not up to date.
PS: Do not download the prebuilt version, it is still the old.