Search code examples
xmlhaskellhxt

Using HXT with an XML document including a specification header


I'd like to iron out a bug the the rdf4h library that I currently maintain. It supports parsing XML/RDF documents in to RDF graphs in the XmlParser module, but does not successfully parse XML/RDF documents that include an XML specification header, e.g.

<?xml version="1.0" encoding="ISO-8859-1"?>

The parser uses HXT arrow interface, namely the Text.XML.HXT.Core module. I have boiled the problem down to two parsing attempts made in the functions testSuccess and testFailure. Both use runSLA. The author of hxt tells me that the problem lies in the use of xread , and that I should first of all be extracting the XML document from the string before xread. (Unfortunately, he hasn't responded on the GitHub issue I raised about this).

Below, there are two strings, both containing the same XML document. The xmlDoc1 string includes a specification header, which trips up the xread arrow in testFailure.

module HXTProblem where

import Text.XML.HXT.Core

data GParseState = GParseState { stateGenId :: Int } deriving(Show)

-- this document has an XML specification included
xmlDoc1 :: String
xmlDoc1 = "<?xml version=\"1.0\" encoding=\"ISO-8859-1\"?>" ++
          "<shiporder orderid=\"889923\" " ++
          "xmlns:xsi=\"http://www.w3.org/2001/XMLSchema-instance\" " ++
          "xsi:noNamespaceSchemaLocation=\"shiporder.xsd\">" ++
          "<orderperson>John Smith</orderperson>" ++
             "<shipto>" ++
               "<name>Ola Nordmann</name>" ++
             "</shipto>" ++
          "</shiporder>"

-- this document does not include the XML specification
xmlDoc2 :: String
xmlDoc2 = "<shiporder orderid=\"889923\" " ++
          "xmlns:xsi=\"http://www.w3.org/2001/XMLSchema-instance\" " ++
          "xsi:noNamespaceSchemaLocation=\"shiporder.xsd\">" ++
          "<orderperson>John Smith</orderperson>" ++
             "<shipto>" ++
               "<name>Ola Nordmann</name>" ++
             "</shipto>" ++
          "</shiporder>"

initState :: GParseState
initState = GParseState { stateGenId = 0 }

-- | Works
testSuccess :: (GParseState,[XmlTree])
testSuccess = runSLA xread initState xmlDoc2

{- output of runnnig testSuccess
(GParseState {stateGenId = 0},[NTree (XTag "shiporder" [NTree (XAttr "orderid") [NTree (XText "889923") []],NTree (XAttr "xmlns:xsi") [NTree (XText "http://www.w3.org/2001/XMLSchema-instance") []],NTree (XAttr "xsi:noNamespaceSchemaLocation") [NTree (XText "shiporder.xsd") []]]) [NTree (XTag "orderperson" []) [NTree (XText "John Smith") []],NTree (XTag "shipto" []) [NTree (XTag "name" []) [NTree (XText "Ola Nordmann") []]]]]
-}

-- | Does not work
testFailure :: (GParseState,[XmlTree])
testFailure = runSLA xread initState xmlDoc1

{- ERROR running testFailure
(GParseState {stateGenId = 0},[NTree (XError 2 "\"string: \"<?xml version=\\\"1.0\\\" encoding=\\\"ISO-8859-1...\"\" (line 1, column 6):\nunexpected xml\nexpecting legal XML name character\n") []])
-}

I should add that I am looking for a solution using runSLA that will generate the same XMLTree when parsing either xmlDoc1 or xmlDoc2.


Solution

  • Hurray, this is been solved. The author of the HXT library has addressed the GitHub issue added a new parser xreadDoc in this commit. I've fixed the rdf4h library version 1.2.2 and up, using this new parser in this commit, so XML/RDF documents (with spec and encoding headings) can now be parsed with the XmlParser.

    Note the new arrow composition in testFailure, as (xreadDoc >>> isElem).

    module HXTProblem where
    
    import Text.XML.HXT.Core
    
    data GParseState = GParseState { stateGenId :: Int } deriving(Show)
    
    -- this document has an XML specification included
    xmlDoc1 :: String
    xmlDoc1 = "<?xml version=\"1.0\" encoding=\"ISO-8859-1\"?>" ++
              "<shiporder orderid=\"889923\" " ++
              "xmlns:xsi=\"http://www.w3.org/2001/XMLSchema-instance\" " ++
              "xsi:noNamespaceSchemaLocation=\"shiporder.xsd\">" ++
              "<orderperson>John Smith</orderperson>" ++
                 "<shipto>" ++
                   "<name>Ola Nordmann</name>" ++
                 "</shipto>" ++
              "</shiporder>"
    
    -- this document does not include the XML specification
    xmlDoc2 :: String
    xmlDoc2 = "<shiporder orderid=\"889923\" " ++
              "xmlns:xsi=\"http://www.w3.org/2001/XMLSchema-instance\" " ++
              "xsi:noNamespaceSchemaLocation=\"shiporder.xsd\">" ++
              "<orderperson>John Smith</orderperson>" ++
                 "<shipto>" ++
                   "<name>Ola Nordmann</name>" ++
                 "</shipto>" ++
              "</shiporder>"
    
    initState :: GParseState
    initState = GParseState { stateGenId = 0 }
    
    -- | Works
    testSuccess :: (GParseState,[XmlTree])
    testSuccess = runSLA xread initState xmlDoc2
    
    -- | Does also now work!
    testFailure :: (GParseState,[XmlTree])
    testFailure = runSLA (xreadDoc >>> isElem) initState xmlDoc1
    
    testEquality :: Bool
    testEquality =
        let (_,x) = testSuccess
            (_,y) = testFailure
        in x == y