Search code examples
xmlhaskellhxt

How to merge sibling elements with same attributes?


Consider XML input like this:

<root>
  <sub>
      <p att1=0 att2=1><i>foo</i></p>
      <p att1=1 att2=1><i>bar</i></p>
      <p att1=0 att2=0><i>baz</i></p>
      <p att1=0 att2=1><i>bazz</i></p>
  </sub>
</root>

Which should be transformed to:

<root>
  <sub>
      <p att1=0 att2=1><i>foo</i><i>bazz</i></p>
      <p att1=1 att2=1><i>bar</i></p>
      <p att1=0 att2=0><i>baz</i></p>
  </sub>
</root>

(Because both p parent elemets of <i>foo</i> and <i>bazz</i> are siblings and have the same attributes.)

How to do such a transformation with HXT arrows?


Solution

  • Ok, there is my try: The code first collects all attribute lists at the parent of the siblings and then does a merge for all attribute lists that are different:

    {-# LANGUAGE Arrows #-}
    
    module Main
    where
    
    import Data.List
    import Text.XML.HXT.Core
    
    example="\
    \<root>\
    \  <sub>\
    \      <p att1=\"0\" att2=\"1\"><i>foo</i></p>\
    \      <p att1=\"1\" att2=\"1\"><i>bar</i></p>\
    \      <p att1=\"0\" att2=\"0\"><i>baz</i></p>\
    \      <p att1=\"0\" att2=\"1\"><i>bazz</i></p>\
    \  </sub>\
    \</root>"
    
    get_attrs name = getChildren >>> hasName name >>> proc x -> do
       a <- listA (((
              getAttrName
              &&& (getChildren >>> getText))  ) <<< getAttrl ) -< x
       returnA -< a
    
    
    has_attrs atts = proc x -> do
       a <- listA (((
               getAttrName
               &&& (getChildren >>> getText))  ) <<< getAttrl ) -< x
       if (a == atts)
       then returnA -< x
       else none -<< ()
    
    mk_attrs atts = map f atts
      where
        f (n, v) = sqattr n v
    
    mergeSiblings_one inp name att = catA (map constA inp)
        >>> mkelem name
                   (mk_attrs att)
                   [getChildren
                    >>> hasName name  >>> has_attrs att >>> getChildren ]
    
    mergeSiblings_core name = proc x -> do
        a <- listA (get_attrs name >>. (sort.nub) ) -< x
        b <- listA this -< x
        c <- listA (getChildren >>> neg (hasName name)) -< x
        catA ((map (mergeSiblings_one b name) a) ++ (map constA c) ) -<< ()
    
    
    is_parent_of name = getChildren >>> hasName name
    
    mergeSiblings name = processTopDownUntil (
            is_parent_of name `guards` mergeSiblings_core name
        )
    
    stuff = mergeSiblings "p"
    
    
    main :: IO ()
    main
        = do
          x <- runX ( 
                 configSysVars  [withTrace 1]
                 >>> readString [withValidate no
                               ,withPreserveComment yes
                               ,withRemoveWS yes
                            ] example
                 >>> setTraceLevel 4
                 >>> stuff >>> traceTree >>> traceSource
               )
          return ()
    

    Output for the example

    <root>
      <p att1="0" att2="0">
        <i>baz</i>
      </p>
      <p att1="0" att2="1">
        <i>foo</i>
        <i>bazz</i>
      </p>
      <p att1="1" att2="1">
        <i>bar</i>
      </p>
    </root>
    

    Nice to have

    The above version puts the merged children in front and the none-matching ones in the new children list of the parent node: A nice variation would be: insert each merged child at the old position of the first old sibling node and don't change the order of non-merged nodes. For example that

    <other>1</other><p><a/></p><other>2</other><p><b/></p>
    

    is transformed to

    <other>1</other><p><a/><b/></p><other>2</other>
    

    and not to:

    <p><a/><b/></p><other>1</other><other>2</other>
    

    Disclaimer

    Since I am new to HXT and arrows - I wouldn't be surprised if there are more concise/HXT0idiomatic/elegant answers.