Search code examples
xmlxquerymarklogicxpath-2.0altova

XML to CSV conversion using XQuery


I have an XML file and I need to convert it into XQuery. consider a simple set of XML:

books[book]
book[@isbn, title, descrption]

eg:

<books>
    <book isbn="1590593049">
        <title>Extending Flash MX 2004</title>
        <description>
        Using javascript alongwith actionscript 3.0 and mxml.</description>
    </book>
    <book isbn="0132149184">
        <title>Java Software Solutions</title>
        <description>
            Complete book full of case studies on business solutions and design concepts while building mission critical
            business applications.
        </description>
    </book>

How to convert it to CSV format using XQuery? The CSV is used by Microsoft excel,

so it would be delimited by comma (,) character and special characters should be escaped.


Solution

  • A pure XPath 2.0 expression:

    for $b in /*/book
        return
          concat(escape-html-uri(string-join(($b/@isbn,
                                              $b/title,
                                              $b/description
                                              )
                                               /normalize-space(),
                                            ",")
                                 ),
                 codepoints-to-string(10))
    

    XSLT 2 - based verification:

    <xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
     <xsl:output omit-xml-declaration="yes" indent="yes"/>
    
     <xsl:template match="/">
      <xsl:sequence select=
       "for $b in /*/book
           return
             concat(escape-html-uri(string-join(($b/@isbn,
                                                 $b/title,
                                                 $b/description
                                                 )
                                                  /normalize-space(),
                                               ',')
                                    ),
                    codepoints-to-string(10))"/>
     </xsl:template>
    </xsl:stylesheet>
    

    When this transformation is applied on the provided XML document (corrected from its malformedness):

    <books>
        <book isbn="1590593049">
            <title>Extending Flash MX 2004</title>
            <description>
            Using javascript alongwith actionscript 3.0 and mxml.</description>
        </book>
        <book isbn="0132149184">
            <title>Java Software Solutions</title>
            <description>
                Complete book full of case studies on business solutions and design concepts while building mission critical
                business applications.
            </description>
        </book>
    </books>
    

    the wanted, correct result is produced:

    1590593049,Extending Flash MX 2004,Using javascript alongwith actionscript 3.0 and mxml.
     0132149184,Java Software Solutions,Complete book full of case studies on business solutions and design concepts while building mission critical business applications.
    

    Update:

    In a comment the OP has requested that any in-text comma be surrounded by a quote and that (after that) any quote be replaced by two quotes, and, finally, if the wholw result contains a quote, it must be surrounded by (single) quotes.

    Here is a pure XPath 2.0 expression that produces this:

    for $b in /*/book,
        $q in codepoints-to-string(34),
        $NL in codepoints-to-string(10),
        $isbn in normalize-space(replace($b/@isbn, ',', concat($q,',',$q))),
        $t in normalize-space(replace($b/title, ',', concat($q,',',$q))),
        $d in normalize-space(replace($b/description, ',', concat($q,',',$q))),
        $res in
         escape-html-uri(string-join(($isbn,$t,$d), ',')),
        $res2 in replace($res, $q, concat($q,$q))
       return
        if(contains($res2, $q))
           then concat($q, $res2, $q, $NL)
           else concat($res2, $NL)
    

    When this XPath expression is evaluated against this (extended with a new test-case) XML document:

    <books>
        <book isbn="1590593049">
            <title>Extending Flash MX 2004</title>
            <description>
            Using javascript alongwith actionscript 3.0 and mxml.</description>
        </book>
        <book isbn="0132149184">
            <title>Java Software Solutions</title>
            <description>
                Complete book full of case studies on business solutions and design concepts while building mission critical
                business applications.
            </description>
        </book>
        <book isbn="XX1234567">
            <title>Quotes and comma</title>
            <description>
                Hello, World from "Ms-Excel"
            </description>
        </book>
    </books>
    

    the wanted, correct result is produced:

    1590593049,Extending Flash MX 2004,Using javascript alongwith actionscript 3.0 and mxml.
    0132149184,Java Software Solutions,Complete book full of case studies on business solutions and design concepts while building mission critical business applications.
    "XX1234567,Quotes and comma,Hello"","" World from ""Ms-Excel"""