Search code examples
xmlxqueryxquery-3.1

Wrap plain text chunks in P while skipping chunks that are already wrapped in P


I need to wrap all plain text chunks with paragraphs, but there could a nested paragraph which should be skipped. How would I tackle this?

I'm having difficult time understanding how to wrap some plain-text into one paragraph while skipping existing paragraphs.

Given XML:

<section xmlns="http://www.w3.org/1999/xhtml">    
    <div>
        test test 
        <p>test</p>
        <ins>INS</ins>
        text
    </div> 
</section>

Expected Result:

<section xmlns="http://www.w3.org/1999/xhtml">    
    <div>
        <p>test test</p>
        <p>test</p>
        <p>
            <ins>INS</ins>
            text
        </p>
    </div> 
</section>

Solution

  • Here is an approach using a simple recursive algorithm to effectively partition the div content by p nodes

    declare default element namespace "http://www.w3.org/1999/xhtml";
    
    declare function local:collect($sequence as node()*) as node()* {
      (: index of last p in this candidate subsequence :)
      let $nextP := max((0, 
        (for $i in (1 to count($sequence)) 
         where $sequence[$i][self::p] 
         return $i)))
      return
        (: if sequence is empty then return empty sequence :)
        if(count($sequence) = 0) then ()
        (: if no p in this candidate subsequence, then wrap it in a p :)
        else if($nextP = 0) then <p>{$sequence}</p>
        (: otherwise evaluate subsequence before the last p, the p, 
           and the subsequence after the last p 
         :)
        else (
          local:collect(subsequence($sequence,1,$nextP - 1)),
          $sequence[$nextP],
          local:collect(subsequence($sequence,$nextP + 1))
        )
    };
    
    let $input :=
        <section>    
           <div>
                test test 
                <p>test</p>
                <ins>INS</ins>
                text
           </div> 
        </section>
    return
      <section>    
      {
        for $div in $input/div
        return <div>{local:collect($div/(*|text()))}</div>
      }
      </section>
    

    yields the following:

    <section xmlns="http://www.w3.org/1999/xhtml">
       <div>
          <p>
             test test 
             </p>
          <p>test</p>
          <p><ins>INS</ins>
             text
             </p>
       </div>
    </section>
    

    Your expected result is not consistent with regard to leading/trailing whitespace in the text nodes. It is not clear if you really expect to achieve the exact result presented where whitespace is normalized for some text and not for other text. Probably not.

    To normalize whitespace in all text nodes replace this:

    <p>{$sequence}</p>
    

    with:

    <p>{for $x in $sequence return if($x[self::text()]) then normalize-space($x) else ($x)}</p>
    

    which yields:

    <section xmlns="http://www.w3.org/1999/xhtml">
      <div>
        <p>test test</p>
        <p>test</p>
        <p><ins>INS</ins>text</p>
      </div>
    </section>
    

    The algorithm here works when there is no p or multiple p, but I did not test every scenario.

    In XQuery 3, this can be simplified with tumbling windows, for example:

    (: Return true if the passed nodes exist and both p or neither are p.
     :)
    declare function local:same($compare1 as node()?, $compare2 as node()?) as xs:boolean {
      if(not($compare1) or not($compare2)) then false()
      else if(($compare1[self::p] and $compare2[self::p]) 
        or (not($compare1[self::p]) and not($compare2[self::p])))
      then true()
      else false()
    };
    
    let $input :=
        <section>    
            <div>
                test test 
                <p>test</p>
                <ins>INS</ins>
                text
            </div> 
        </section>
    
    return
      <section>    
      {
        for $div in $input/div
        return
          <div>
          {
            for tumbling window $partition in $div/(*|text())
            start $s previous $s-prev when not(local:same($s, $s-prev))
            end   $e next $e-next     when not(local:same($e, $e-next))
            return 
              if($partition[1][self::p]) 
              then $partition 
              else <p>{$partition}</p>
          }
          </div>
      }
      </section>
    

    Similarly to normalize space, replace:

    <p>{$partition}</p>
    

    with something like

    <p>{for $x in $partition return if($x[self::text()]) then normalize-space($x) else ($x)}</p>