Search code examples
pdffontsextractfont-size

Extract text from PDF in respect to formatting (font size, type etc)


Is possible to extract text from a PDF file concerning specific font/font size/font colour etc.? I prefer Perl, python or *nix command-line utilities. My goal is to extract all headlines from PDF file so I will have a nice index of articles contained in a single PDF.


Solution

  • Text and /font/font size/position (no color, as I checked) you can get from Ghostscript's txtwrite device (try -dTextFormat=0 | 1 options), as well as from mudraw's (MuPDF) with -tt option. Then parse XML-like output with e.g. Perl.