Search code examples
apachepdfunicodeapache-fop

Apache FOP, Unicode. Text is not searchable


I've got a problem with some legacy Java code that renders PDF files.

We're using Apache FOP:

Implementation-Title: Fop
Implementation-Version: 0.20.5 
Implementation-Vendor: Apache Software Foundation (http://xml.apache.org/fop/)

With options set to:

<configuration>
  <fonts>
   <font metrics-file="arialuni.xml" 
        embed-file="ARIALUNI.TTF" kerning="yes">            
    <font-triplet name="arialuni" style="normal" weight="normal"/>
  </font>
 </fonts>
</configuration>

The .pdf is rendered correctly, there is one big problem though: I'm not able to search text in such file and if I'll try to copy-paste this text - I'll get a lot of symbols-boxes.(□)

As I've understood - the arialuni.ttf (unicode version of arial, i suppose) causes this troubles. Is there some known solutions? Is it possible to fix that with font configuration?

Thanks in advance.

PS: I'm not allowed to switch to any other pdf-rendering library, or upgrade an existing one.

Edit #1

Thank you all for your answers. We'll probably refuse from Unicode support for now and will upgrade to the 1.0 version later.


Solution

  • The best solution is to slap your boss and get him to approve some time for an upgrade to Apache FOP 1.0 or later. Seriously.

    The only alternative is to try "-enc ansi" as parameter to the "TTFReader" application when you generate the XML font metrics file. That will cause FOP 0.20.5 to use WinAnsi encoding instead of CID encoding. The downside: you'll be restricted to the WinAnsi 8bit encoding. You don't get the whole Unicode set.