I'm working with the New York Times Corpus for a project and right now I am having trouble working with the xml files to retrieve the text content for the bulk of my project.
Each year in the corpus contains xml files of hundreds of megabytes which has an xml file for each article in that year.
I want to retrieve the text from the body.content tag.
The general format of the xml file for a specific year is something like:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE nitf SYSTEM "http://www.nitf.org/IPTC/NITF/3.3/specification/dtd/nitf-3-3.dtd">
<nitf change.date="June 10, 2005" change.time="19:30" version="-//IPTC//DTD NITF 3.3//EN">
<head>
<title> Article1 </title>
</head>
<body>
<body.content>
</body.content>
</body>
...
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE nitf SYSTEM "http://www.nitf.org/IPTC/NITF/3.3/specification/dtd/nitf-3-3.dtd">
<nitf change.date="June 10, 2005" change.time="19:30" version="-//IPTC//DTD NITF 3.3//EN">
<head>
<title> Article2 </title>
</head>
<body>
<body.content>
</body.content>
</body>
...
This is the class and method I used in my attempt to parse the XML file:
import org.w3c.dom.Document;
import org.w3c.dom.NodeList;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.xpath.XPath;
import javax.xml.xpath.XPathConstants;
import javax.xml.xpath.XPathExpression;
import javax.xml.xpath.XPathFactory;
import java.io.FileWriter;
import java.util.ArrayList;
public class XMLParser {
public static XMLParser parser = new XMLParser();
public static final String TEXT_LOCATION = "/txts/";
private XMLParser(){
}
public static XMLParser getParser(){
return parser;
}
public void XMLtoText(String xmlLocation, int year) throws Exception{
ArrayList<String> text = new ArrayList<String>();
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = factory.newDocumentBuilder();
Document doc = builder.parse(xmlLocation);
XPathFactory xFactory = XPathFactory.newInstance();
XPath xpath = xFactory.newXPath();
XPathExpression expr = xpath.compile("//body.content/text()");
Object result = expr.evaluate(doc, XPathConstants.NODESET);
NodeList nodes = (NodeList) result;
for (int i=0; i<nodes.getLength();i++){
text.add(nodes.item(i).getNodeValue().toString());
}
try {
FileWriter writer = new FileWriter(TEXT_LOCATION + year + ".txt");
for(String str : text){
writer.write(str);
}
writer.close();
} catch(Exception e){
}
}
}
This is the error I get when trying to parse.
[Fatal Error] nitf-3-3.dtd:1:3: The markup declarations contained or pointed to by the document type declaration must be well-formed.
org.xml.sax.SAXParseException; systemId: http://www.nitf.org/IPTC/NITF/3.3/specification/dtd/nitf-3-3.dtd; lineNumber: 1; columnNumber: 3; The markup declarations contained or pointed to by the document type declaration must be well-formed.
at com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(DOMParser.java:257)
at com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(DocumentBuilderImpl.java:339)
at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:177)
at ____.XMLParser.XMLtoText(XMLParser.java:45)
at ____.Main.main(Main.java:23)
I am wondering if there is a way to split this huge XML file into multiple XML files for each article. This would make it easier to parse the text out of each article without the problem of it being an invalid xml file. I tried to remove the xml declarations and DOCTYPE nitf from each element besides the top element but that did not seem to resolve the issue. Removing DOCTYPE - the second line - from the top level seems to work to parse up to the second XML declaration where the invalid XML format stops the parser from continuing.
PROBLEM: Your files simply aren't "well formed XML".
They seem to be a BUNCH of different XML stanzas, all glommed together in a single file.
So yes, you MUST "split this huge XML file into multiple XML files".
SUGGESTIONS:
The "delimiter" that tells you where one XML stanza ends and the next one begins seems to be <?xml version="1.0" encoding="UTF-8"?>
. Use it!
Write a script that parses the "big file", copying each line until it hits the <?xml>
header. It closes the current "small file", opens the next one, and continues copying, a stanza at a time.
Instead of copying files, you can do the same thing by copying the stanzas into a Java string, stanza by stanza.