Given a huge dataset of tweets i need to:
So, the first thing that came to my mind is doing something like that:
val tweets = sparkContext.textFile(DATASET).cache
val hashtags = tweets
.map(extractHashTags)
.map((_, 1))
.reduceByKey(_ + _)
val emoticonsEmojis = tweets
.map(extractEmoticonsEmojis)
.map((_, 1))
.reduceByKey(_ + _)
val lemmas = tweets
.map(extractLemmas)
.map((_, 1))
.reduceByKey(_ + _)
But in this way each tweet is processed 3 times, is it right? If so, is there an efficient way to count all these elements separately by processing each tweet only once?
I was thinking something like that:
sparkContext.textFile(DATASET)
.map(extractor) // RDD[(List[String], List[String], List[String])]
But in this way it becomes a nightmare. Also because once i count the words (I refer to the third point of the requests), I would need to make a join with another RDD and this, in the first version, is very simple while in the second version is not.
Using Dataset API:
val tweets = sparkContext.textFile(DATASET)
val tokens = tweets.flatMap(extractor) //return RDD[(String, String)]
.toDF("type", "token") //type is one of ("hashtag", "emoticon", "lemma")
.groupBy("type", "token")
.count() //Dataset[Row] which has columns ("type", "token", "count")
val lemmas = tokens
.where($"type" === lit("lemma"))
.select("token", "count")
.as[(String, Long)]
.rdd //should be the same type as your original 'lemmas', for future join