Search code examples
xmlhaskellhxt

Ignoring XML attributes with HXT unpickler


I'm writing a small application that aims to scrape XML from multiple sites and then process the data in a way I want it to. I made such an application before but in different languages and I am writing this one for Haskell practice.

Anyway, to the point. After looking around the Web at million and one different XML parsers, I decided to go with HXT because who doesn't love arrows. Following the page http://www.haskell.org/haskellwiki/HXT/Conversion_of_Haskell_data_from/to_XML I have arrived at something that seems to read my XML file and put it into Haskell data types I defined. I'm using instances of XmlPickler to read the data from the file to achieve this. I arrived at something that would work, except for this error:

fatal error: document unpickling failed
xpCheckEmptyAttributes: unprocessed XML attribute(s) detected

I'm aware that I didn't process all attributes. I don't want all the attributes. Is there a way to ignore these? I imagine that I could process all the attributes, put them in a new data type and then extract attributes from that to get the data that I actually want. I'd like to avoid this little hack though and hence I'm here, asking for The Proper Way™.

Am I using the wrong tool for the job? Is unpickling 3rd party data unsafe (like it is in Python)?

I looked around the Web for a solution but Text.XML.HXT.Arrow.XmlState.SystemConfig doesn't seem to have what I need to disable this behaviour.


Solution

  • I came across this exact problem the other day, and came to the following conclusion:

    Am I using the wrong tool for the job?

    Yes. HXT's pickle functionality is designed for serializing and unserializing data easily, but without much flexibility. From the linked page:

    They are intended to read machine generated XML, ideally generated by the same pickler.

    As for:

    Is unpickling 3rd party data unsafe (like it is in Python)?

    Not with HXT, no. Pickling in python is unsafe because it (loosely) equates to calling eval() on arbitrary content. HXT is just an XML parser, there's no calling of arbitrary code going on.

    Personally, I've moved on to manually processing XML using the xml package (Text.XML.Light) instead of trying to get HXT's picklers to do what I want. It's not as concise, but it lets me ignore data that I don't care about. You could presumably use the non-pickle parts of HXT just as well though, if you like arrows (I'm still wrapping my head around them ;)).