Search code examples
c++xml-parsingxerces-c

C++ Arabica (over Xerces-c) getNodeValue() method does not return the actual value


I am using Arabica wrapping over Xerces-c to parse XML. The sample code below returns correct names when using .getNodeName() method, but not the correct value when using .getNodeValue() method:

bool readXML(bfs::path xmlfullfile) 
{
  // first check to see if the file exists
  if (!bfs::is_regular_file(xmlfullfile)) return false;

  Arabica::SAX2DOM::Parser<std::string> domParser;
  Arabica::SAX::CatchErrorHandler<std::string> eh;
  Arabica::DOM::Document<std::string> xmlDoc; 
  Arabica::SAX::InputSource<std::string> is;

  domParser.setErrorHandler(eh);
  is.setSystemId(xmlfullfile.string());
  domParser.parse(is);

  if(!eh.errorsReported()) 
  {
    xmlDoc = domParser.getDocument();
    xmlDoc.normalize();

    Arabica::DOM::NodeList<string_type> objects = xmlDoc.getElementsByTagName("object");
    for (size_t t = 0; t < objects.getLength(); t++) 
    {
      Arabica::DOM::Node<std::string> object = objects.item(t);
      Arabica::DOM::NodeList<std::string> values = object.getChildNodes(); 
      for (size_t u = 0; u < values.getLength(); u++) 
      {
        values.item(u).normalize(); 
        string name = values.item(u).getNodeName(); 
        string val = values.item(u).getNodeValue(); 
        cout << "Node streaming = \"" << values.item(u) << "\", meaning that name = \"" << name << "\" and value = \"" << val << "\"" << endl; 
      }
    }
    return true;
  } else {
    std::cerr << eh.errors() << std::endl;
    eh.reset();
    return false;
  }
}

The sample XML I'm trying to parse is:

<annotation>
    <filename>1a.jpg</filename>
    <folder>Sample</folder>
    <source>
        <database>Some database</database>
        <annotation>Annotator</annotation>
        <image>Some source</image>
    </source>
    <size>
        <width>3264</width>
        <height>1840</height>
        <depth>0</depth>
    </size>
    <segmented>0</segmented>
    <object>
        <name>somename</name>
        <pose>Unspecified</pose>
        <truncated>0</truncated>
        <difficult>0</difficult>
        <occluded>0</occluded>
        <bndbox>
            <xmin>48</xmin>
            <ymin>671</ymin>
            <xmax>3213</xmax>
            <ymax>1616</ymax>
        </bndbox>
    </object>
</annotation>

The output looks similar to this:

Node streaming = "
                ", meaning that name = "#text" and value = "
                "
Node streaming = "<name>somename</name>", meaning that name = "name" and value = ""
Node streaming = "
                ", meaning that name = "#text" and value = "
                "
Node streaming = "<pose>Unspecified</pose>", meaning that name = "pose" and valu
e = ""
Node streaming = "
                ", meaning that name = "#text" and value = "
                "
Node streaming = "<truncated>0</truncated>", meaning that name = "truncated" and
 value = ""
Node streaming = "
                ", meaning that name = "#text" and value = "
                "
Node streaming = "<difficult>0</difficult>", meaning that name = "difficult" and
 value = ""
Node streaming = "
                ", meaning that name = "#text" and value = "
                "
Node streaming = "<occluded>0</occluded>", meaning that name = "occluded" and va
lue = ""
Node streaming = "
                ", meaning that name = "#text" and value = "
                "
Node streaming = "<bndbox>
                        <xmin>48</xmin>
                        <ymin>671</ymin>
                        <xmax>3213</xmax>
                        <ymax>1616</ymax>
                </bndbox>", meaning that name = "bndbox" and value = ""
Node streaming = "
        ", meaning that name = "#text" and value = "
        "

Not quite sure what I'm doing wrong. Since getNodeName() returns the correct name (when it's not #text of course), the fact that getNodeValue() doesn't return anything makes me wonder.


Solution

  • I found a solution after comparing my code with some other XML libraries. Apparently the value of a node is not a simple text field, and one has to get the first child of that simple leaf node to be able to access the text value. Not sure if the way I'm doing it is the best way, but here is the code in case someone else has the same problem:

    for (size_t u = 0; u < values.getLength(); u++) 
    {
      string name = values.item(u).getNodeName();
      if (name == "#text") continue;
      string val = values.item(u).getFirstChild().getNodeValue(); 
      cout << "Node streaming = \"" << values.item(u) << "\", meaning that name = \"" << name << "\" and value = \"" << val << "\"" << endl; 
    }
    

    Note: The production code should take into account the fact that not all nodes are simple leaf nodes. So my code is only half of the solution.