Search code examples
apache-poidocbolditalichwpf

How do I read word document with bold and italic formatting by using POI


I am using Apache POI.

I am able to read text from a doc file by using "org.apache.poi.hwpf.extractor.WordExtractor"

Even fetched the tables by using "org.apache.poi.hwpf.usermodel.Table"

But please suggest me, how can I fetch bold/italic formatting of the text.

Thanks in advance.


Solution

  • WordExtractor returns only the text, nothing else.

    The simplest way for you to get the text+formatting of a word document is to switch to using Apache Tika. Apache Tika builds on top of Apache POI (amongst others), and offers both plain text extraction and rich extraction (XHTML with formatting).

    Alternately, if you want to write the code yourself, I'd suggest you review the code in Tika's WordExtractor, which demonstrates how to use Apache POI to get the formatting information of runs of text out.