Search code examples
javaxmljaxbtext-filesstax

Test file to XML file (Problem with the structure)


I want to convert a text file to XML file with a specific structure. I want to separate the text into paragraphs and these paragraphs will get into a chapter. For example, every chapter should have 3 paragraphs. The root element of XML is called "Book".

To give you one more example, I have this text file:

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.

Velit scelerisque in dictum non consectetur a erat. Sit amet justo donec enim diam vulputate. Id aliquet lectus proin nibh nisl condimentum id venenatis a.

Eget gravida cum sociis natoque penatibus et magnis dis. Habitant morbi tristique senectus et netus et. Interdum consectetur libero id faucibus nisl tincidunt eget nullam.

I want an XML which includes a chapter with these 3 paragraphs.

Here is my code:

Chapter class:

@Data
@AllArgsConstructor
@NoArgsConstructor
public class Chapter {

    private String paragraph;
    private List<String> sentence;
    private List<String> words;

My main code:

public static void main(String[] args) {
    String textInputFile = "xml_files/sample.txt";
    String xmlFileOutput = "xml_files/sample.xml";

    try (FileOutputStream outXML = new FileOutputStream(xmlFileOutput))  {
        Scanner inputfile = new Scanner(new File(textInputFile));
        convertToXml(inputfile, outXML);
    }
    catch(Exception e){
    }
}

private static void  convertToXml(Scanner inputfile, FileOutputStream outXML) throws XMLStreamException {
    XMLOutputFactory output = XMLOutputFactory.newInstance();
    XMLStreamWriter writer = output.createXMLStreamWriter(outXML);
    writer.writeStartDocument("utf-8", "1.0");
    writer.writeCharacters("\n");
    // <books>
    writer.writeStartElement("book");
    // <book>
    while (inputfile.hasNext()){
        String line = inputfile.nextLine();
        Chapter chapter = getChapter(line);
        writer.writeCharacters("\n\t");
        writer.writeStartElement("Chapter");
        writer.writeCharacters("\n\t\t");
        writer.writeStartElement("Paragraph");
        writer.writeCharacters(chapter.getParagraph()+"");
        writer.writeEndElement();
        writer.writeCharacters("\n\t\t");
        writer.writeStartElement("Sentence");
        writer.writeCharacters(chapter.getSentence()+"");
        writer.writeEndElement();
        writer.writeCharacters("\n\t");
        writer.writeEndElement();
    }
    writer.writeCharacters("\n");
    writer.writeEndElement();
    writer.writeEndDocument();
}

private static Chapter getChapter(String line){
    String[] paragraphs = line.split("\\r?\\n");
    String[] sentences = line.split("(?<=(?<![A-Z])\\.)");
    Chapter chapter = new Chapter();
    chapter.setParagraph(List.of(paragraphs));
    chapter.setSentence(List.of(sentences));
    return chapter;
}

I'm counting the sentences of each paragraph in the above code, but I don't have any problem there.

My output:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>

<book>

<Chapter Paragraph="Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.">
<Paragraph> Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.</Paragraph>
<Sentence>[Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.]
</Chapter>

   <Chapter Paragraph="" Sentences="[]">
        <Paragraph/>
        <Sentences>[]</Sentences>
    </Chapter>

<Chapter Paragraph="Velit scelerisque in dictum non consectetur a erat. Sit amet justo donec enim diam vulputate. Id aliquet lectus proin nibh nisl condimentum id venenatis a.">
<Paragraph> Velit scelerisque in dictum non consectetur a erat. Sit amet justo donec enim diam vulputate. Id aliquet lectus proin nibh nisl condimentum id venenatis a.</Paragraph>
<Sentence>[Velit scelerisque in dictum non consectetur a erat , Sit amet justo donec enim diam vulputate, Id aliquet lectus proin nibh nisl condimentum id venenatis a.]
</Chapter>

  (...)        

</book>

In the second chapter you can see I have null values inside paragraph and sentence. How can I prevent to print these nulls (I have a chapter with values and the next chapter is always null)? My second question is how can I have many paragraphs in one chapter? For example, I want every chapter to includes 3 paragraphs. Imagine that I have a text file with 10000 lines and I want to structure it into an XML.


Solution

  • First question: please notice that in your input, you have "empty lines"/linebreaks in your Lorem Ipsum. Scanner.nextLine() reports/provides these lines too. In order to avoid adding Chapters for these which then result in an empty <Sentences/> in the output, what about adding

    if (line.isEmpty() == true) {
        continue;
    }
    

    to your loop after the inputfile.nextLine()?

    Second question: what about something like

    private static void convertToXml(Scanner inputfile, FileOutputStream outXML) throws XMLStreamException {
        List<Chapter> chapters = new ArrayList<Chapter>();
    
        {
            Chapter chapter = null;
    
            while (inputfile.hasNext()) {
                String line = inputfile.nextLine();
    
                if (line.isEmpty() == true) {
                    continue;
                }
    
                String[] sentences = line.split("(?<=(?<![A-Z])\\.)");
    
                if (chapter == null) {
                    chapter = new Chapter();
                }
    
                chapter.getParagraph().add(line);
                chapter.getSentence().addAll(List.of(sentences));
    
                if (chapter.getParagraph().size() >= 3) {
                    chapters.add(chapter);
                    chapter = null;
                }
            }
    
            if (chapter != null) {
                chapters.add(chapter);
            }
        }
    
        XMLOutputFactory output = XMLOutputFactory.newInstance();
        XMLStreamWriter writer = output.createXMLStreamWriter(outXML);
        writer.writeStartDocument("utf-8", "1.0");
        writer.writeCharacters("\n");
        writer.writeStartElement("book");
        writer.writeCharacters("\n");
    
        for (Chapter chapter : chapters) {
            writer.writeCharacters("\t");
            writer.writeStartElement("Chapter");
            writer.writeCharacters("\n");
    
            for (String paragraph : chapter.getParagraph()) {
                writer.writeCharacters("\t\t");
                writer.writeStartElement("Paragraph");
                writer.writeCharacters(paragraph);
                writer.writeEndElement();
                writer.writeCharacters("\n");
            }
    
            writer.writeCharacters("\t\t");
            writer.writeStartElement("Sentence");
            writer.writeCharacters(chapter.getSentence()+"");
            writer.writeEndElement();
            writer.writeCharacters("\n\t");
            writer.writeEndElement();
            writer.writeCharacters("\n");
        }
    
        writer.writeCharacters("\n");
        writer.writeEndElement();
        writer.writeEndDocument();
    }
    

    with a Chapter.java like

    public class Chapter {
    
        private List<String> paragraph = new ArrayList<String>();
        private List<String> sentence = new ArrayList<String>();
    
        public List<String> getParagraph() {
            return paragraph;
        }
    
        public List<String> getSentence() {
            return sentence;
        }
    }
    

    and the getChapter() not needed (or you may put the plaintext file reading and XML output generation into separate methods, etc.)?

    Please be aware, with my proposal, you keep all the Chapter objects and paragraph strings in memory. If you want to avoid this, you can mingle input file processing and output generation back together. I just separated the two for better illustration of how to arrange the collection of paragraphs. You could easily write out a Chapter once it has collected 3 paragraphs + at the end of the loop (in case there's a remaining Chapter object not written out yet), and not grow a List<Chapter>.