Search code examples
apihadooptwitterterminalflume

Flume won't load Twitter data to HDFS


I am trying to load Twitter data into Hadoop. It says that it has processed nearly 25000 files, but when I check Hadoop I always find the folder empty. This is the command I am using

flume-ng agent -n TwitterAgent -f flume.conf

Here is a small caption

21/07/18 19:40:03 INFO twitter.TwitterSource: Processed 25,000 docs 21/07/18 19:40:03 INFO twitter.TwitterSource: Total docs indexed: 25,000, total skipped docs: 0 21/07/18 19:40:03 INFO twitter.TwitterSource: 45 docs/second 21/07/18 19:40:03 INFO twitter.TwitterSource: Run took 545 seconds and processed: 21/07/18 19:40:03 INFO twitter.TwitterSource: 0.012 MB/sec sent to index 21/07/18 19:40:03 INFO twitter.TwitterSource: 6.708 MB text sent to index 21/07/18 19:40:03 INFO twitter.TwitterSource: There were 0 exceptions ignored: 21/07/18 19:40:05 INFO twitter.TwitterSource: Processed 25,100 docs 21/07/18 19:40:06 INFO hdfs.BucketWriter: Creating /home/hadoopusr/flumetweets/FlumeData.1626629459197.tmp 21/07/18 19:40:06 WARN hdfs.HDFSEventSink: HDFS IO error org.apache.hadoop.fs.ParentNotDirectoryException: /home (is not a directory) at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkIsDirectory(FSPermissionChecker.java:538) at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkTraverse(FSPermissionChecker.java:278) at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:206) at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:189) at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkTraverse(FSPermissionChecker.java:507) at org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkTraverse(FSDirectory.java:1612) at org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkTraverse(FSDirectory.java:1630) at org.apache.hadoop.hdfs.server.namenode.FSDirectory.resolvePath(FSDirectory.java:551) at org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.resolvePathForStartFile(FSDirWriteFileOp.java:291) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInt(FSNamesystem.java:2282) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:2225) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.create(NameNodeRpcServer.java:728)

This is my Flume.config file

#Naming the components on the current agent.

TwitterAgent.sources = Twitter

TwitterAgent.channels = MemChannel

TwitterAgent.sinks = HDFS

#Describing/Configuring the source

TwitterAgent.sources.Twitter.type = org.apache.flume.source.twitter.TwitterSource

TwitterAgent.sources.Twitter.channels=MemChannel

TwitterAgent.sources.Twitter.consumerKey = ************

TwitterAgent.sources.Twitter.consumerSecret =************

TwitterAgent.sources.Twitter.accessToken = ************

TwitterAgent.sources.Twitter.accessTokenSecret = ************

TwitterAgent.sources.Twitter.keywords =covid,covid-19,coronavirus

#Describing/Configuring the sink TwitterAgent.sinks.HDFS.type = hdfs

TwitterAgent.sinks.HDFS.hdfs.path = /home/hadoopusr/flumetweets

TwitterAgent.sinks.HDFS.hdfs.fileType = DataStream

TwitterAgent.sinks.HDFS.hdfs.writeFormat = Text

TwitterAgent.sinks.HDFS.hdfs.batchSize = 10

TwitterAgent.sinks.HDFS.hdfs.rollSize = 0

TwitterAgent.sinks.HDFS.hdfs.rollInterval = 600

TwitterAgent.sinks.HDFS.hdfs.rollCount = 100

#Describing/Configuring the channel

TwitterAgent.channels.MemChannel.type = memory

TwitterAgent.channels.MemChannel.capacity = 1000

TwitterAgent.channels.MemChannel.transactionCapacity = 1000

#Binding the source and sink to the channel

TwitterAgent.sources.Twitter.channels = MemChannel

TwitterAgent.sinks.HDFS.channel = MemChannel


Solution

  • As commented, you fixed your first error, now you get a permission error upon writing to the HDFS root path as the user=amel

    In your config you have

    TwitterAgent.sinks.HDFS.hdfs.path = /home/hadoopusr/flumetweets
    

    But, I'm guessing either /home or /home/hadoopusr does not exist, so that directory is trying to get created.

    However, your user is not hadoopusr (your HDFS superuser), so there is not permissions to do so

    Your options therefore are either

    1. run flume-ng agent as the hadoopusr (sudo su hadoopusr -c flume-ng agent ...)
    2. Change the HDFS path in the config to use /home/amel (after you create that path and give yourself permissions on it) sudo su hadoopusr; hadoop fs -mkdir /home/amel; hadoop fs chown -R amel /home/amel; hadoop fs -chmod -R 760 /home/amel