Search code examples
scalaparsingstring-parsingidiomspurely-functional

Splitting complex String Pattern (without regex) in a functional approach


I am trying to split a string without regex in a more idiomatic functional approach.

case class Parsed(blocks: Vector[String], block: String, depth: Int)

def processChar(parsed: Parsed, c: Char): Parsed = {
  import parsed._
  c match {

    case '|'  if depth == 0
                =>  parsed.copy(block = "", blocks = blocks :+ block ,
                                  depth = depth)                          
    case '['  => parsed.copy(block = block + c,
                                  depth = depth + 1)
    case ']'  if depth == 1
                => parsed.copy( block = "", blocks = blocks :+ (block + c),
                                depth = depth - 1)
    case ']'  => parsed.copy(block = block + c,
                                  depth = depth - 1)
    case _    => parsed.copy(block = block + c)
  }
}

val s = "Str|[ts1:tssub2|ts1:tssub2]|BLANK|[INT1|X.X.X.X|INT2|BLANK |BLANK | |X.X.X.X|[INT3|s1]]|[INT3|INT4|INT5|INT6|INT7|INT8|INT9|INT10|INT11|INT12|INT13|INT14|INT15]|BLANK |BLANK |[s2|s3|s4|INT16|INT17];[s5|s6|s7|INT18|INT19]|[[s8|s9|s10|INT20|INT21]|ts3:tssub3| | ];[[s11|s12|s13|INT21|INT22]|INT23:INT24|BLANK |BLANK ]|BLANK |BLANK |[s14|s15]"
val parsed = s.foldLeft(Parsed(Vector(), "", 0))(processChar)
parsed.blocks.size //20 
parsed.blocks foreach println

I would expect to get the following result (parsed.blocks.size should be 12).

Str
[ts1:tssub2|ts1:tssub2]
BLANK|
[INT1|X.X.X.X|INT2|BLANK |BLANK | |X.X.X.X|[INT3|s1]]
[INT3|INT4|INT5|INT6|INT7|INT8|INT9|INT10|INT11|INT12|INT13|INT14|INT15]
BLANK 
BLANK 
[s2|s3|s4|INT16|INT17];[s5|s6|s7|INT18|INT19]
[[s8|s9|s10|INT20|INT21]|ts3:tssub3| | ];[[s11|s12|s13|INT21|INT22]|INT23:INT24|BLANK |BLANK ]
BLANK 
BLANK 
[s14|s15]

However result I am getting is (parsed.blocks.size is 20)

Str
[ts1:tssub2|ts1:tssub2]

BLANK
[INT1|X.X.X.X|INT2|BLANK|BLANK||X.X.X.X|[INT3|s1]]

[INT3|INT4|INT5|INT6|INT7|INT8|INT9|INT10|INT11|INT12|INT13|INT14|INT15]

BLANK
BLANK
[s2|s3|s4|INT16|INT17]
;[s5|s6|s7|INT18|INT19]

[[s8|s9|s10|INT20|INT21]|ts1:tssub2||]
;[[s11|s12|s13|INT21|INT22]|INT23:INT24|BLANK|BLANK]

BLANK
BLANK
[s14|s15]

To my understanding this is slight variation of parenthesis balancing problem. However in this case ; would mean kind of continuation.

I have two questions in this case

1) How the extra entry /space after [ts1:tssub2|ts1:tssub2] came, also after

[INT1|X.X.X.X|INT2|BLANK|BLANK||X.X.X.X|[INT3|s1]] , [INT3|INT4|INT5|INT6|INT7|INT8|INT9|INT10|INT11|INT12|INT13|INT14|INT15] and ;[[s11|s12|s13|INT21|INT22]|INT23:INT24|BLANK|BLANK]

in my result as well ?

2) Here at the moment [s2|s3|s4|INT16|INT17] and ;[s5|s6|s7|INT18|INT19]

go in as two different entries. However this should be merged as [s2|s3|s4|INT16|INT17];[s5|s6|s7|INT18|INT19]

a single entry[So does

[[s8|s9|s10|INT20|INT21]|ts1:tssub2||]

and

;[[s11|s12|s13|INT21|INT22]|INT23:INT24|BLANK|BLANK])

as well]. Any clues to how to do so ?


Solution

  • Question 1

    The extra empty string block appears because the very previous case each time is

    case ']'  if depth == 1
    

    It adds an empty block and decreases the depth. Then we have

    case '|'  if depth == 0
    

    which also adds another empty block, pushing the previous empty one into the resulting Vector.


    Before answering the second question, I would like to suggest another approach to the implementation of this parser, which is slightly more idiomatic. My major criticism about the current one is the usage of an intermediate object (Parsed) to wrap state and copying it in each and every case. Indeed, we do not need it: more frequent approach is to use a recursive function, especially when depth is involved.

    So, without modifying significantly the processing of your cases, it can be represented as follows:

    def parse(blocks: Seq[String],
              currentBlock: String,
              remaining: String,
              depth: Int): Seq[String] =
      if (remaining.isEmpty) {
        blocks
      } else {
        val curChar = remaining.head
        curChar match {
          case '|' if depth == 0 =>
            parse(blocks :+ currentBlock, "", remaining.tail, depth)
          case '[' =>
            parse(blocks, currentBlock + curChar, remaining.tail, depth + 1)
          case ']' =>
            if (depth == 1)
              parse(blocks :+ (currentBlock + curChar), "", remaining.tail, depth - 1)
            else
              parse(blocks, currentBlock + curChar, remaining.tail, depth - 1)
          case _ =>
            parse(blocks, currentBlock + curChar, remaining.tail, depth)
        }
      }
    

    It produces exactly the same output as the original solution.

    To fix the issue with empty blocks, we need to change case '|':

    case '|' if depth == 0 =>
      val updatedBlocks = if (currentBlock.isEmpty) blocks
                          else blocks :+ currentBlock
      parse(updatedBlocks, "", remaining.tail, depth)
    

    We just skip the current block if it contains an empty string.


    Question 2

    To merge the two blocks between ; char, we need to bring back one parsed block and return it into the currentBlock reference. This represents an additional case:

    case ';' =>
      parse(blocks.init, blocks.last + curChar, remaining.tail, depth)
    

    Now, after

    val result = parse(Seq(), "", s, 0)
    result.foreach(println)
    

    The output is

    Str
    [ts1:tssub2|ts1:tssub2]
    BLANK
    [INT1|X.X.X.X|INT2|BLANK |BLANK | |X.X.X.X|[INT3|s1]]
    [INT3|INT4|INT5|INT6|INT7|INT8|INT9|INT10|INT11|INT12|INT13|INT14|INT15]
    BLANK 
    BLANK 
    [s2|s3|s4|INT16|INT17];[s5|s6|s7|INT18|INT19]
    [[s8|s9|s10|INT20|INT21]|ts3:tssub3| | ];[[s11|s12|s13|INT21|INT22]|INT23:INT24|BLANK |BLANK ]
    BLANK 
    BLANK 
    [s14|s15]
    

    And it looks very similar to what you were looking for.