Search code examples
pdfpdfboxapache-tika

How to get style information of elements in PDF using Apache Tika?


I am playing around with Apache Tika to extract text from PDF files. I would like to know how to get style information like font size, text color, whether specific piece of text (few words) are in Italics, Bold, etc. using Apache Tika?

Is it even possible to get this type of information?

Also I would like to if it is possible to get table information using Apache Tika? Information like start of table, start of first row, first cell, etc.


Solution

  • It is probably more convenient to use another api like PDFTextStream. Tika extracts raw textual information from a pdf, while PDFTextStream gives you structured text with correlated info such as character encoding, height, region of the text etc.