Search code examples
xmlscalatraversal

Scala: Traversing an XML tree using DFS produces unexpected results


I am traversing an XML tree visiting each node using DFS. The output that I get is not what I expected.

object Main extends App {

  lazy val testXml =
    <vehicles>
      <vehicle>
        gg
      </vehicle>
      <variable>
      </variable>
    </vehicles>

  traverse.dfs(testXml.head)
}

object traverse {
  def dfs(node: Node): Unit = {
    println("==============")
    println(node.label + ">>>" + node.child + ">>>" + node.child.size)
    node.child.map(child => {
      dfs(child)
    })
  }
}

Output:

==============
vehicles>>>ArrayBuffer(
      , <vehicle>
        gg
      </vehicle>, 
      , <variable>
      </variable>, 
    )>>>5
==============
#PCDATA>>>List()>>>0
==============
vehicle>>>ArrayBuffer(
        gg
      )>>>1
==============
#PCDATA>>>List()>>>0
==============
#PCDATA>>>List()>>>0
==============
variable>>>ArrayBuffer(
      )>>>1
==============
#PCDATA>>>List()>>>0
==============
#PCDATA>>>List()>>>0

Process finished with exit code 0

If you take a look at the output, for the first element (vehicles) it says it has 5 children. If you print the children, two children (the first and the last) are empty.
I want the traversal to visit vehicles then vehicle then gg and finally variable.

Any advice with this is appreciated. Thanks.


Solution

  • Those 2 children are not empty. They are text nodes containing line breaks and spaces between other elements.

    If you define the XML as <vehicles><vehicle>gg</vehicle><variable></variable></vehicles> without line breaks and spaces your traversal will give the desired result.

    But if you want the traversal to work on your original XML, you may filter the children to contain only the text nodes with actual content:

    import scala.xml._
    
    def filterEmptyNodes(nodes: Seq[Node]): Seq[Node] =
      nodes.collect(Function.unlift {
        case Text(text) =>
          if (text.trim.isEmpty) None
          else Some(Text(text.trim))
        case node => Some(node)
      })
    

    And have the traversal function use this function:

    object traverse {
      def dfs(node: Node): Unit = {
        val nonEmptyChildren = filterEmptyNodes(node.child)
        println("==============")
        println(node.label + ">>>" + nonEmptyChildren + ">>>" + nonEmptyChildren.size)
        nonEmptyChildren.foreach(dfs)
      }
    }
    

    On a side note, you may also use node \ "_" to get all child elements, but it won't contain text nodes.

    Or you may use node.descendant or node.descendant_or_self to have a List of all the descendants in DFS order without writing the traversal yourself. You have to filter out the "empty" nodes from the descendants as well: filterEmptyNodes(node.descendant) or filterEmptyNodes(node.descendant_or_self)