Search code examples
javaapacheapache-poimathml

How to add multiple equations inline with text in Apache POI Word?


I am converting text with latex style equation into MS word document using Apache POI. with some help, I was able to implement it successfully but if the line has more than one equation then it produces an incorrect result.

following is my code:

import java.io.*;
import org.apache.poi.xwpf.usermodel.*;

import org.openxmlformats.schemas.wordprocessingml.x2006.main.CTP;
import org.openxmlformats.schemas.officeDocument.x2006.math.CTOMath;
import org.openxmlformats.schemas.officeDocument.x2006.math.CTOMathPara;

import javax.xml.transform.Transformer;
import javax.xml.transform.TransformerFactory;
import javax.xml.transform.stream.StreamSource;
import javax.xml.transform.stream.StreamResult;

import uk.ac.ed.ph.snuggletex.SnuggleInput;
import uk.ac.ed.ph.snuggletex.SnuggleEngine;
import uk.ac.ed.ph.snuggletex.SnuggleSession;

import java.io.IOException;

public class CreateWordFormulaFromMathML {

 static File stylesheet = new File("MML2OMML.XSL");
 static TransformerFactory tFactory = TransformerFactory.newInstance();
 static StreamSource stylesource = new StreamSource(stylesheet); 

 static CTOMath getOMML(String mathML) throws Exception {
  Transformer transformer = tFactory.newTransformer(stylesource);

  StringReader stringreader = new StringReader(mathML);
  StreamSource source = new StreamSource(stringreader);

  StringWriter stringwriter = new StringWriter();
  StreamResult result = new StreamResult(stringwriter);
  transformer.transform(source, result);

  String ooML = stringwriter.toString();
  stringwriter.close();

  CTOMath ctOMath = CTOMath.Factory.parse(ooML);
  return ctOMath.getOMathArray(0);
 }

 public static void main(String[] args) throws Exception {

  XWPFDocument document = new XWPFDocument();

  String mstr = "The expression is as: $ax^2 + bx = c$ is easier to understand than $$ax^2 + \\frac{\\sin^{-1}\\theta}{\\cot{-1}} \\times y_1$$ or anything in \\[ ay^2 + b_2 \\theta\\]";

  XWPFParagraph paragraph = document.createParagraph();
  XWPFRun run = paragraph.createRun();
 // run.setText("");

  SnuggleEngine engine = new SnuggleEngine();
  SnuggleSession session = engine.createSession();

  SnuggleInput input = new SnuggleInput(mstr);
  session.parseInput(input);

  String mathML = session.buildXMLString();
  System.out.println("Input " + input.getString() + " was converted to:\n" + mathML + "\n\n");


for(String s : mathML.split("\\s+(?=<math)|(?<=</math>)\\s+")){

if (s.startsWith("<math"))
{
    CTOMath ctOMath = getOMML(s);
    System.out.println(s);

    CTP ctp = paragraph.getCTP();
    ctp.setOMathArray(new CTOMath[]{ctOMath});        
}
else
{
    run.setText(s + " ");
    System.out.println(s);
}
}

  document.write(new FileOutputStream("CreateWordFormulaFromMathML.docx"));
  document.close();

 }
}

This producing a document with

The expression is as: is easier to understand than or anything in ay^2+b_2 \theta

Note : (ay^2+b_2 \theta) is correctly in word equation format.

What I need is inline text with multipal equations in the middle.


Solution

  • How to approach solving tasks for creating Office OpenXML files such as *.docx?

    Office OpenXML files such as *.docx are siply ZIP archives. We can unzip them and have a look into the internals. In *.docx we find /word/document.xml and there we find XML which describes the document structure. For paragraphs having formula inline we find something like:

    <w:p>
     <w:r>
      <w:t>text</w:t>
     </w:r>
     <m:oMath>... </m:oMath>
     <w:r>
      <w:t>text</w:t>
     </w:r>
     <m:oMath>... </m:oMath>
     ...
    </w:p>
    

    So we need multiple runs holding the text and between them multiple <m:oMath>... </m:oMath>.

    Thats why the paragraph has a OMathArray CTOMath[]. And your code does overwriting this array with a new array having one CTOMath each time a additional CTOMath was found. Instead an additional CTOMath needs to be added to the array, each time an additional CTOMath was found.

    To know what we can do with org.openxmlformats.schemas.wordprocessingml.x2006.main.CTP paragraphs, we need a documentation for this. Best I have found is grepcode.com. There we find CTP.addNewOMath() and CTP.setOMathArray(int, CTOMath).

    So changing your code like:

      for(String s : mathML.split("\\s+(?=<math)|(?<=</math>)\\s+")){
    
       if (s.startsWith("<math")) {
        CTOMath ctOMath = getOMML(s);
        System.out.println(s);
    
        CTP ctp = paragraph.getCTP();
        ctp.addNewOMath();
        ctp.setOMathArray(ctp.sizeOfOMathArray()-1, ctOMath);        
       }
       else {
        run = paragraph.createRun();
        run.setText(s + " ");
        System.out.println(s);
       }
      }
    

    should work.