Search code examples

Data in Hbase are not structured as it should be - Twitter Flume

Users, greetings !

I have installed a flume on my cloudera 4.6, and I am trying to get tweets from twitter.

So I created a HDFS sink and a HBase sink, and they are gathering tweets... But data in HBase is not well structured.

As the data is not structured, I can't make queries on it with impala.

I created a table tweets {NAME => 'tweet'}, {NAME => 'retweet'}, {NAME => 'entities'}, {NAME => 'user'}

and my flume configuration is :

I am following this tutorial, but I don't know what to do with his serializer. I have to make it into a jar ?

I have currently this in Hbase: Everything is in the column tweets...


  • I recompiled and used the flume-sources-1.0-SNAPSHOT.jar from the git: and so there were no promblem when using 'TwitterAgent.sources.Twitter.type = com.cloudera.flume.source.TwitterSource'

    Install Maven, then download the repository of cdh-twitter-example.

    Unzip, then execute inside (as mentionned) :

    $ cd flume-sources

    $ mvn package

    $ cd ..

    This problem happened when the twitter4j version updated from 2.2.6 to 3.X, they removed the method setIncludeEntities, and the JAR is not up to date.

    PS: Do not download the prebuilt version, it is still the old.