I have a textFile data that looks like
28.225.37.170 - - [14/May/2019:00:00:05 +0000] "GET xxxxxx "-" "yyyyyy"
80.156.48.65 - - [14/May/2019:00:00:10 +0000] "GET xxxxxxx "-" "yyyyyy"
....
I want to get a RDD looking like this
(28.225.37.170 , 14/May/2019:00:00:05 +0000 , xxxxxx , yyyyyy )
(80.156.48.65 , 14/May/2019:00:00:10 +0000 , xxxxxx , yyyyyy )
I want to know what's the regex that I can use to do the splitting of my data
val reg: scala.util.matching.Regex = """?????""".r // ????? any suggestions ?
rdd.map( lines => lines.split(reg) )
Why not pattern match on a regex? If your lines present a define number of elements to extract and different separators, I could be better:
val l1 = """28.225.37.170 - - [14/May/2019:00:00:05 +0000] "GET xxxxxx "-" "yyyyyy""""
val l2 = """80.156.48.65 - - [14/May/2019:00:00:10 +0000] "GET xxxxxxx "-" "yyyyyy""""
val reg = """(.*) - - \[(.*)\] "GET (.*) "-" "(.*)"""".r
def splitMyLine(line:String) = line match { case reg(a,b,c,d) => "line: \n" +Seq(a,b,c,d).map(s => s" __ data: $s").mkString("\n") }
Seq(l1, l2).foreach(l => println(splitMyLine(l)))
gives:
line:
__ data: 28.225.37.170
__ data: 14/May/2019:00:00:05 +0000
__ data: xxxxxx
__ data: yyyyyy
line:
__ data: 80.156.48.65
__ data: 14/May/2019:00:00:10 +0000
__ data: xxxxxxx
__ data: yyyyyy
You can simply define your splitting function like this then:
def splitMyLine(line:String): Seq[String] = line match {
case reg(a,b,c,d) => Seq(a,b,c,d)
}
Hope it helps.