Search code examples
javaepub

How to extract data from <dc> tag in java?


I am currently trying to extract the tag element < dc:title > from an epub in Java. However, i tried using

doc.getDocumentElement().getElementsByTagName("dc:title")); 

and it only showed 2nd element :com.sun.org.apache.xerces.internal.dom.DeepNodeListImpl. I would like to know how can I extract < dc:tittle > ?

Here is my code:

File fXmlFile = new File("file directory");
DocumentBuilderFactory dbFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder dBuilder = dbFactory.newDocumentBuilder();
Document doc = dBuilder.parse(fXmlFile);
doc.getDocumentElement().normalize();

System.out.println("1st element :" +  doc.getElementsByTagName("dc");
System.out.println("2nd element :" + doc.getDocumentElement().getElementsByTagName("dc:title"));

System output:

1st element : com.sun.org.apache.xerces.internal.dom.DeepNodeListImpl@4f53e9be
2nd element :com.sun.org.apache.xerces.internal.dom.DeepNodeListImpl@e16e1a2

Added Sample Data

<dc:title>
  <![CDATA[someData]]>
</dc:title>
<dc:creator>
  <![CDATA[someData]>
</dc:creator>
<dc:language>someData</dc:language>

Solution

  • The method getElementsByTagName(String) is return a List of matching elements (note plural 's'). You then need to specify which element (such as by using .item(index) to access a Node instance) you want to use. Therewith, you can using getNodeValue() on that Node object.

    EDITED: because of the CDATA element, rather use Node.getTextContent():

    NodeList elems = doc.getElementsByTagName("dc:title");
    Node item = elems.item(0);
    System.out.println(item.getTextContent());