I'd like to iron out a bug the the rdf4h library that I currently maintain. It supports parsing XML/RDF documents in to RDF graphs in the XmlParser module, but does not successfully parse XML/RDF documents that include an XML specification header, e.g.
<?xml version="1.0" encoding="ISO-8859-1"?>
The parser uses HXT arrow interface, namely the Text.XML.HXT.Core
module. I have boiled the problem down to two parsing attempts made in the functions testSuccess
and testFailure
. Both use runSLA. The author of hxt tells me that the problem lies in the use of xread
, and that I should first of all be extracting the XML document from the string before xread
. (Unfortunately, he hasn't responded on the GitHub issue I raised about this).
Below, there are two strings, both containing the same XML document. The xmlDoc1
string includes a specification header, which trips up the xread
arrow in testFailure
.
module HXTProblem where
import Text.XML.HXT.Core
data GParseState = GParseState { stateGenId :: Int } deriving(Show)
-- this document has an XML specification included
xmlDoc1 :: String
xmlDoc1 = "<?xml version=\"1.0\" encoding=\"ISO-8859-1\"?>" ++
"<shiporder orderid=\"889923\" " ++
"xmlns:xsi=\"http://www.w3.org/2001/XMLSchema-instance\" " ++
"xsi:noNamespaceSchemaLocation=\"shiporder.xsd\">" ++
"<orderperson>John Smith</orderperson>" ++
"<shipto>" ++
"<name>Ola Nordmann</name>" ++
"</shipto>" ++
"</shiporder>"
-- this document does not include the XML specification
xmlDoc2 :: String
xmlDoc2 = "<shiporder orderid=\"889923\" " ++
"xmlns:xsi=\"http://www.w3.org/2001/XMLSchema-instance\" " ++
"xsi:noNamespaceSchemaLocation=\"shiporder.xsd\">" ++
"<orderperson>John Smith</orderperson>" ++
"<shipto>" ++
"<name>Ola Nordmann</name>" ++
"</shipto>" ++
"</shiporder>"
initState :: GParseState
initState = GParseState { stateGenId = 0 }
-- | Works
testSuccess :: (GParseState,[XmlTree])
testSuccess = runSLA xread initState xmlDoc2
{- output of runnnig testSuccess
(GParseState {stateGenId = 0},[NTree (XTag "shiporder" [NTree (XAttr "orderid") [NTree (XText "889923") []],NTree (XAttr "xmlns:xsi") [NTree (XText "http://www.w3.org/2001/XMLSchema-instance") []],NTree (XAttr "xsi:noNamespaceSchemaLocation") [NTree (XText "shiporder.xsd") []]]) [NTree (XTag "orderperson" []) [NTree (XText "John Smith") []],NTree (XTag "shipto" []) [NTree (XTag "name" []) [NTree (XText "Ola Nordmann") []]]]]
-}
-- | Does not work
testFailure :: (GParseState,[XmlTree])
testFailure = runSLA xread initState xmlDoc1
{- ERROR running testFailure
(GParseState {stateGenId = 0},[NTree (XError 2 "\"string: \"<?xml version=\\\"1.0\\\" encoding=\\\"ISO-8859-1...\"\" (line 1, column 6):\nunexpected xml\nexpecting legal XML name character\n") []])
-}
I should add that I am looking for a solution using runSLA
that will generate the same XMLTree
when parsing either xmlDoc1
or xmlDoc2
.
Hurray, this is been solved. The author of the HXT library has addressed the GitHub issue added a new parser xreadDoc
in this commit. I've fixed the rdf4h library version 1.2.2 and up, using this new parser in this commit, so XML/RDF documents (with spec and encoding headings) can now be parsed with the XmlParser
.
Note the new arrow composition in testFailure
, as (xreadDoc >>> isElem)
.
module HXTProblem where
import Text.XML.HXT.Core
data GParseState = GParseState { stateGenId :: Int } deriving(Show)
-- this document has an XML specification included
xmlDoc1 :: String
xmlDoc1 = "<?xml version=\"1.0\" encoding=\"ISO-8859-1\"?>" ++
"<shiporder orderid=\"889923\" " ++
"xmlns:xsi=\"http://www.w3.org/2001/XMLSchema-instance\" " ++
"xsi:noNamespaceSchemaLocation=\"shiporder.xsd\">" ++
"<orderperson>John Smith</orderperson>" ++
"<shipto>" ++
"<name>Ola Nordmann</name>" ++
"</shipto>" ++
"</shiporder>"
-- this document does not include the XML specification
xmlDoc2 :: String
xmlDoc2 = "<shiporder orderid=\"889923\" " ++
"xmlns:xsi=\"http://www.w3.org/2001/XMLSchema-instance\" " ++
"xsi:noNamespaceSchemaLocation=\"shiporder.xsd\">" ++
"<orderperson>John Smith</orderperson>" ++
"<shipto>" ++
"<name>Ola Nordmann</name>" ++
"</shipto>" ++
"</shiporder>"
initState :: GParseState
initState = GParseState { stateGenId = 0 }
-- | Works
testSuccess :: (GParseState,[XmlTree])
testSuccess = runSLA xread initState xmlDoc2
-- | Does also now work!
testFailure :: (GParseState,[XmlTree])
testFailure = runSLA (xreadDoc >>> isElem) initState xmlDoc1
testEquality :: Bool
testEquality =
let (_,x) = testSuccess
(_,y) = testFailure
in x == y