Search code examples
hadooptwitterflume

How to read data files generated by flume from twitter


I have generated few twitter data log files using flume on HDFS , what is the actual format of the log file ? I was expecting data in json format. But it looks like this. Could someone help me on how to read this data ? or what is wrong with the way I have done this


Solution

  • DOWNLOAD THE FILE (hive-serdes-1.0-SNAPSHOT.jar) from this link
    http://files.cloudera.com/samples/hive-serdes-1.0-SNAPSHOT.jar

    Then put this file in your $HIVE_HOME/lib
    Add the jar into hive shell

    hive> ADD JAR file:///home/hadoop/work/hive-0.10.0/lib/hive-serdes-1.0-SNAPSHOT.jar
    

    Create a table in hive

    hive> CREATE TABLE tweets (
    id BIGINT,
    created_at STRING,
    source STRING,
    favorited BOOLEAN,
    retweeted_status STRUCT<
    text:STRING,
    user:STRUCT<screen_name:STRING,name:STRING>,
    retweet_count:INT>,
    entities STRUCT<
    urls:ARRAY<STRUCT<expanded_url:STRING>>,
    user_mentions:ARRAY<STRUCT<screen_name:STRING,name:STRING>>,
    hashtags:ARRAY<STRUCT<text:STRING>>>,
    text STRING,
    user STRUCT<
    screen_name:STRING,
    name:STRING,
    friends_count:INT,
    followers_count:INT,
    statuses_count:INT,
    verified:BOOLEAN,
    utc_offset:INT,
    time_zone:STRING>,
    in_reply_to_screen_name STRING
    ) 
    ROW FORMAT SERDE 'com.cloudera.hive.serde.JSONSerDe';
    

    load data into table from hdfs

    hive> load data inpath '/home/hadoop/work/flumedata' into table tweets;
    

    Now analyze you twitter data from this table

    hive> select id,text,user from tweets;
    

    you done, but it is deserialized data, now serialize from hive table..