Search code examples
pythonxmlsax

Python SAX Parser program calculating wrong results


The following xml-file (lieferungen.xml) contains several inconsistencies. Several of the items have more than one id (e.g. item "apfel" has 3 different IDs):

<?xml version="1.0" encoding="UTF-8"?>
<lieferungen xmlns="urn:myspace:lieferungen" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="urn:myspace:lieferungen C:\xml\lieferungen.xsd">
    <artikel id="3526">
        <name>apfel</name>
        <preis stueckpreis="true">8.97</preis>
        <lieferant>Fa. Krause</lieferant>
    </artikel>
    <artikel id="7866">
        <name>Kirschen</name>
        <preis stueckpreis="false">10.45</preis>
        <lieferant>Fa. Helbig</lieferant>
    </artikel>
    <artikel id="4444"> <!--DIFFERENT ID FOR apfel!! -->    
        <name>apfel</name>
        <preis stueckpreis="true">12.67</preis>
        <lieferant>Fa. Liebig</lieferant>
    </artikel>
    <artikel id="7866">
        <name>Kirschen</name>
        <preis stueckpreis="false">17.67</preis>
        <lieferant>Fa. Krause</lieferant>
    </artikel>
    <artikel id="2345"> <!--DIFFERENT ID FOR apfel!! -->
        <name>apfel</name>
        <preis stueckpreis="true">9.54</preis>
        <lieferant>Fa. Mertes</lieferant>
    </artikel>
    <artikel id="7116"> <!--DIFFERENT ID FOR Kirschen!! -->
        <name>Kirschen</name>
        <preis stueckpreis="false">16.45</preis>
        <lieferant>Fa. Hoeller</lieferant>
    </artikel>
    <artikel id="7868">
        <name>Kohl</name>
        <preis stueckpreis="false">3.20</preis>
        <lieferant>Fa. Hoeller</lieferant>
    </artikel>
    <artikel id="7866">
        <name>Kirschen</name>
        <preis stueckpreis="false">12.45</preis>
        <lieferant>Fa. Richard</lieferant>
    </artikel>
    <artikel id="3245">
        <name>Bananen</name>
        <preis stueckpreis="false">15.67</preis>
        <lieferant>Fa. Hoeller</lieferant>
    </artikel>
    <artikel id="6745"> <!--DIFFERENT ID FOR Kohl!! -->     
        <name>Kohl</name>
        <preis stueckpreis="false">3.10</preis>
        <lieferant>Fa. Reinhardt</lieferant>
    </artikel>
    <artikel id="7789">
        <name>Ananas</name>
        <preis stueckpreis="true">8.60</preis>
        <lieferant>Fa. Richard</lieferant>
    </artikel>
</lieferungen>

In order to find all inconsistencies in the file, I wrote the following sax-parser in python:

import xml.sax
import sys


class C_Handler(xml.sax.ContentHandler):

    def __init__(self):
        self.items = {}
        self.items2 = {}
        self.read = 0
        self.id = 0

    def startDocument(self):
        print("Inconsistencies:\n")

    def startElement(self, tag, attributes): 
        if tag=="name":
            self.read = 1
        if tag=="artikel":
            self.id = attributes["id"]            

    def endElement(self, tag):
        if tag=="name":
            self.read = 0

    def characters(self, content):
        if self.read == 1:
            item = content
            #check whether the item is not yet part of the dictionaries
            if item not in self.items:
                #add item (e.g. "apfel") to both dictionary "items" and  
                #dictionary "items2". The value for the item is the id in the 
                #case of dictionary "items" and "0" in the case of dictionary 
                #"items2". The second dictionary contains the number of 
                #inconsistencies for each product. At the beginning, the 
                #number of inconsistencies for the product is zero.  
                self.items[item] = self.id
                self.items2[item] = 0
            else:
                if self.items[item] == self.id:
                    #increase number of inconsistencies by 1:
                    self.items2[item] = self.items2[item] + 1

    def endDocument(self):
        for prod in self.items2:
            if self.items2[prod]>0:
                print("There are {} different IDs for item \"
                {}\".".format(self.items2[prod] + 1, prod))


if ( __name__ == "__main__"):

    c = C_Handler()
    xml.sax.parse("lieferungen.xml", c)

The output of this program is as follows:

Inconsistencies:

There are 3 different IDs for item "Kirschen".

As you can see, from the file (note the comments marking occurrences of more than one ID), this output is wrong in two ways:

  1. There are only two different IDs for item "Kirschen", not three.
  2. Several items with more than one ID are not mentioned at all (e.g. item "Kohl" has two different IDs)

However, I do not understand, what's going wrong in my code.


Solution

  • Unless I've misunderstood, the error is that this line

                    if self.items[item] == self.id:
    

    should be

                    if self.items[item] != self.id:
    

    As it stands, your program appears to be counting consistencies rather than inconsistencies: Kirschen uses the ID 7866 three times and nothing else uses the same ID more than once, hence your output.

    With the above change made, I get the following output:

    Inconsistencies:
    
    There are 3 different IDs for item "apfel".
    There are 2 different IDs for item "Kirschen".
    There are 2 different IDs for item "Kohl".
    

    Having said this, I'm not sure your code would necessarily do what you want all of the time. Try moving the <artikel> element with ID 7116 above all of the other <artikel> elements and then running your code. Your code will then tell you that there are four different IDs for Kirschen, when arguably there are only two.

    The reason for this is that the number of IDs your program outputs for an item is one for the first ID found for that item and one for each further <artikel> element with the that name but whose ID differs from the first.

    If you really want to count the number of IDs used per product, a better way would be to use sets to store the IDs used for each product as you go through, and then print the lengths of any sets that contain more than one element. Here's what your characters method could look like after making this change - I'll leave it up to you to make the necessary modifications to your endDocument method:

       def characters(self, content):
            if self.read == 1:
                item = content
                #check whether the item is not yet part of the dictionary
                if item not in self.items:
                    self.items[item] = set([self.id])
                else:
                    self.items[item].add(self.id)
    

    Note that in the last line I don't need to check whether the set in self.items[item] already contains self.id. The nice thing about a set is that if you add an ID that's already in the set, nothing happens. The set doesn't end up with duplicate IDs. Note also that I'm no longer using self.items2, as self.items has all the information I need.

    You could even go one step further than this. We have to check whether item is in self.items and create a set for that item if it isn't. If we use a defaultdict, then that will take care of creating the set for us if it doesn't already exist. Add the line from collections import defaultdict above your C_Handler class and replace the line self.items = {} with self.items = defaultdict(set). After doing this, your characters method just needs to be the following:

        def characters(self, content):
            if self.read == 1:
                item = content
                self.items[item].add(self.id)