Search code examples
regexhadoopapache-sparkrdd

Extracting timestamp from string with regex in Spark RDD


I have a log like :

[Pipeline] timestamps
[Pipeline] {
[Pipeline] echo
20:33:05 0
[Pipeline] echo

I am trying to only extract the time information here (20:33:05).

I have tried to do the following:

val lines = sc.textFile("/logs/log7.txt")  
val individualLines=lines.flatMap(_.split("\n")) //Splitting file contentinto individual lines
val dates=individualLines.filter(value=>value.startsWith("[0-9]"))

I am getting the output as

MapPartitionsRDD[3] at filter at DateExtract.scala:30

How should the regex be defined here?

Any help would be much appreciated.


Solution

  • If you have a log files with the data in new line you do not have to split it, you can simply read each line is a String data

    Then check if it starts with digit by Character.isDigit this function as below

      val lines = sc.textFile("/logs/log7.txt")
      val dates=lines.filter(value=>Character.isDigit(value.charAt(0)))
                .map(_.split(" ")(0))
      dates.foreach(println)
    

    If you want to strictly match the timestamp with regex and filter unmatched then you can use

    val dates=lines.filter(value=>Character.isDigit(value.charAt(0)))
        .map(_.split(" ")(0))
        .filter(_.matches("""\d{2}:\d{2}:\d{2}"""))
    

    Output:

    20:33:05
    

    Hope this helps!