Search code examples
xmlscalascala-xml

Scala how to retrieve xml tag with optional attribute


I am trying to get scala xml node tag with attribute. I would like to get just the tag name with attribute and not the child elements.

I have this input:

<substance-classes>
    <nucleic-acid-sequence display-name="Nucleic Acid Sequence">
        <nucleic-acid-base>
            <base-symbol>a</base-symbol>
            <count>295</count>
        </nucleic-acid-base>
        <nucleic-acid-base>
            <base-symbol>c</base-symbol>
            <count>329</count>
        </nucleic-acid-base>
        <nucleic-acid-base>
            <base-symbol>g</base-symbol>
            <count>334</count>
        </nucleic-acid-base>
        <nucleic-acid-base>
            <base-symbol>t</base-symbol>
            <count>268</count>
        </nucleic-acid-base>
    </nucleic-acid-sequence>
    <genbank-information>
        <genbank-accession-number>EU186063</genbank-accession-number>
    </genbank-information>
</substance-classes>

I am trying to replace the contents of <nucleic-acid-sequence> by doing this

val newNucleicAcidSequenceNode = <nucleic-acid-sequence>{ myfunction 
} </nucleic-acid-sequence>

But some <nucleic-acid-sequence> has attributes like <nucleic-acid- sequence display-name="Nucleic Acid Sequence">. Since my newNucleicAcidSequenceNode is a hardcoded tag I am losing the attibutes.

How do I retain the optional attributes and still pass { myfunction } to <nucleic-acid-sequence> tag?


Solution

  • So, if I have understood you well:

    • you want to replace just a part of your xml
    • this part are the children of any nucleic-acid-sequence under substance-classes
    • you don't want to lose any attributes of any foresaid nucleic-acid-sequence
    • changing these foresaid children is done by a function ( myFunction)

    So my answer would be in that case:

    import scala.xml.{Node, Elem}
    
    val myXml: Elem =
          <substance-classes>
            <nucleic-acid-sequence display-name="Nucleic Acid Sequence">
              <nucleic-acid-base>
                <base-symbol>a</base-symbol>
                <count>295</count>
              </nucleic-acid-base>
              <nucleic-acid-base>
                <base-symbol>c</base-symbol>
                <count>329</count>
              </nucleic-acid-base>
              <nucleic-acid-base>
                <base-symbol>g</base-symbol>
                <count>334</count>
              </nucleic-acid-base>
              <nucleic-acid-base>
                <base-symbol>t</base-symbol>
                <count>268</count>
              </nucleic-acid-base>
            </nucleic-acid-sequence>
            <genbank-information>
              <genbank-accession-number>EU186063</genbank-accession-number>
            </genbank-information>
          </substance-classes>
    
    def myFunction(children: Seq[Node]) : Seq[Node] = ??? // whatever you want it to be
    
    // Here's the replacement:
    
    myXml.copy(child = myXml.child.map {
      case e@Elem(_, "nucleic-acid-sequence", _, _, children@_*) =>
        e.asInstanceOf[Elem].copy(child = myFunction(children))
      case other => other
    })
    

    For instance, myFunction could keep only children which have a count above 300 and could be something like:

    import scala.util.{ Try, Success }
    def myFunction(children: Seq[Node]): Seq[Node] = children.collect {
      case e: Node if Try((e \ "count").text.toInt > 300) == Success(true) =>
      e
    }
    

    In that case, if you replace the unimplemented myFunction in the first snippet by this, the replacement would give:

      <substance-classes>
        <nucleic-acid-sequence display-name="Nucleic Acid Sequence"><nucleic-acid-base>
            <base-symbol>c</base-symbol>
            <count>329</count>
          </nucleic-acid-base><nucleic-acid-base>
            <base-symbol>g</base-symbol>
            <count>334</count>
          </nucleic-acid-base></nucleic-acid-sequence>
        <genbank-information>
          <genbank-accession-number>EU186063</genbank-accession-number>
        </genbank-information>
      </substance-classes>
    

    As you can see no attributes of nucleic-acid-sequence is lost and your function has kept two nodes over four for a defined condition.

    Hope it helps.