Search code examples
scalaapache-sparktwitter

Using a Scala filter inside a Spark map operation


I have a small dataset of tweets and I wanted to remove user names from tweets. I should remove all words starting with an @, but in the last map() operation of the following code I get a java.lang.StringIndexOutOfBoundsException: String index out of range: 0. As inside that map operation I'm splitting a sentence into words and then use a filter operation from the collections instead of the Spark, I'm wondering is the problem is related to that. I've tried to comment .filter(_(0) != '@') and everything works fine

val logFile = "tweets10.csv"
val config = new SparkConf().setMaster("local").setAppName("Spark App")
val sc = new SparkContext(config)

val logData = sc.textFile(logFile, 2).cache()


val tweets = logData.mapPartitionsWithIndex((index, line) => if (index == 0) line.drop(1) else line)
                              .map(_.split(",")(1).replace("\"", ""))
                              .map(line => line.split(" ")
                                  .filter(_(0) != '@')
                                  .reduce((x,y) => x + " " + y))

Dataset:

"","text","favorited","favoriteCount","replyToSN","created","truncated","replyToSID","id","replyToUID","statusSource","screenName","retweetCount","isRetweet","retweeted","longitude","latitude"
"1","RT @WDD: Check today how you can join World Diabetes Day: htts/EIQ1Za0R0t. Eyes on #diabetes htts/rN3VJYC7T0",FALSE,0,NA,2016-09-07 20:12:03,FALSE,NA,"773614831018643457",NA,"<a href=""htt://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>","un_ncd",27,TRUE,FALSE,NA,NA
"2","RT @JDRFUK: With his #Rio2016 medal in hand Team GB gymnast @louissmith1989 puts type 1 #diabetes in the picture! htts:/OKkPtQLuvi",FALSE,0,NA,2016-09-07 20:10:44,FALSE,NA,"773614501853880320",NA,"<a href=""htt://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>","sg0809",2,TRUE,FALSE,NA,NA
"3","RT @CleanairCA: Speaking of the things in the air you breath...
    #asthma #diabetes #copd #lungcancer #smog #losangeles #HeartDisease htts:/",FALSE,0,NA,2016-09-07 20:09:03,FALSE,NA,"773614075284746240",NA,"<a href=""htt://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>","tt85207533",9,TRUE,FALSE,NA,NA
"4","So - tonight's #tweetchat is about FOOD - ""#Diabetes and Diets"" (aka - stuff we eat)  #gbdoc",FALSE,1,NA,2016-09-07 20:08:28,FALSE,NA,"773613929515941888",NA,"<a href=""htt://www.tchat.io"" rel=""nofollow"">tchat.io</a>","theGBDOC",0,FALSE,FALSE,NA,NA
"5","Learn the most important things you can do to prevent #diabetes here: htts:/eHu5pesgKw.",FALSE,0,NA,2016-09-07 20:07:00,FALSE,NA,"773613560320495617",NA,"<a href=""htt://sproutsocial.com"" rel=""nofollow"">Sprout Social</a>","MountainPointMC",0,FALSE,FALSE,NA,NA
"6","Cancer risk #NaturalCures #AlternativeMedicine #Cures #Healing #HerbalRemedies #Diabetes htts:/Ul0vwRpqbw htts:/YU77iuudeR",FALSE,0,NA,2016-09-07 20:06:09,FALSE,NA,"773613345480007680",NA,"<a href=""htt://www.socialcloudsuite.com"" rel=""nofollow"">SocialCloudSuite</a>","CureExchange",0,FALSE,FALSE,NA,NA
"7","Cancer risk #NaturalCures #AlternativeMedicine #Cures #Healing #HerbalRemedies #Diabetes htts:/wEjrW9f9b1 htts:/iHlSpbwzZl",FALSE,0,NA,2016-09-07 20:06:08,FALSE,NA,"773613341826805760",NA,"<a href=""htt://www.socialcloudsuite.com"" rel=""nofollow"">SocialCloudSuite</a>","GuineaHenWeed",0,FALSE,FALSE,NA,NA
"8","Linda Yip hopes to find better ways to diagnose, treat &amp; prevent #diabetes: htts:/tmjgnEFUkZ  #WIMmonth htts:/xL25me7ckK",FALSE,0,NA,2016-09-07 20:05:14,FALSE,NA,"773613114533171200",NA,"<a href=""htts://about.twitter.com/products/tweetdeck"" rel=""nofollow"">TweetDeck</a>","StanfordDeptMed",0,FALSE,FALSE,NA,NA
"9","A Farm Stand In South Dallas Is Fighting #Diabetes With Common Sense And Vegetables htts:/l9pWvnAA5W",FALSE,0,NA,2016-09-07 20:05:08,FALSE,NA,"773613090378166273",NA,"<a href=""htt://www.hootsuite.com"" rel=""nofollow"">Hootsuite</a>","DiabetesDallas",0,FALSE,FALSE,NA,NA
"10","Hi #gbdoc Paul here, #t1d #teampump and #cgm - 4.5 years with #diabetes now!",FALSE,0,NA,2016-09-07 20:04:25,FALSE,NA,"773612908693614592",NA,"<a href=""htt://itunes.apple.com/us/app/twitter/id409789998?mt=12"" rel=""nofollow"">Twitter for Mac</a>","t1hba1c",0,FALSE,FALSE,NA,NA

Solution

  • Without knowing what the dataset actually contains, I'll go on a hunch here and say that after the split your dataset contains empty strings. Add an additional check for emptiness:

    _.split(" ")
     .filter(word => word != "" && word(0) != '@')
     .reduce((x,y) => x + " " + y)