I have a log like :
[Pipeline] timestamps
[Pipeline] {
[Pipeline] echo
20:33:05 0
[Pipeline] echo
I am trying to only extract the time
information here (20:33:05).
I have tried to do the following:
val lines = sc.textFile("/logs/log7.txt")
val individualLines=lines.flatMap(_.split("\n")) //Splitting file contentinto individual lines
val dates=individualLines.filter(value=>value.startsWith("[0-9]"))
I am getting the output as
MapPartitionsRDD[3] at filter at DateExtract.scala:30
How should the regex be defined here?
Any help would be much appreciated.
If you have a log files with the data in new line you do not have to split
it, you can simply read each line is a String
data
Then check if it starts with digit by Character.isDigit
this function as below
val lines = sc.textFile("/logs/log7.txt")
val dates=lines.filter(value=>Character.isDigit(value.charAt(0)))
.map(_.split(" ")(0))
dates.foreach(println)
If you want to strictly match the timestamp with regex and filter unmatched then you can use
val dates=lines.filter(value=>Character.isDigit(value.charAt(0)))
.map(_.split(" ")(0))
.filter(_.matches("""\d{2}:\d{2}:\d{2}"""))
Output:
20:33:05
Hope this helps!