Search code examples
unicodecharacter-encodingxsl-foapache-fop

Interpretation of Greek characters by FOP


Can you please help me interpret the Greek Characters with HTML display as HTML= & #8062; and Hex value 01F7E

Details of these characters can be found on the below URL

http://www.isthisthingon.org/unicode/index.php?page=01&subpage=F&hilite=01F7E

When I run this character in Apache FOP, they give me an ArrayIndexOut of Bounds Exception

Caused by: java.lang.ArrayIndexOutOfBoundsException: -1 at org.apache.fop.text.linebreak.LineBreakUtils.getLineBreakPairProperty(LineBreakUtils.java:668) at org.apache.fop.text.linebreak.LineBreakStatus.nextChar(LineBreakStatus.java:117)

When I looked into the FOP Code, I was unable to understand the need for lineBreakProperties[][] Array in LineBreakUtils.java.

I also noticed that FOP fails for all the Greek characters mentioned on the above page which are non-displayable with the similar error.

What are these special characters ?
Why is their no display for these characters are these Line Breaks or TAB’s ?
Has anyone solved a similar issue with FOP ?


Solution

  • Answer from Apache

    At first glance, this seems like a minor oversight in the implementation of Unicode linebreaking in FOP. This does not take into account the possibility that a given codepoint is not assigned a 'class' in linebreaking context. (= U+1F7E does not appear in the file http://www.unicode.org/Public/UNIDATA/LineBreak.txt, which is used as a basis to generate those arrays in LineBreakUtils.java)

    On the other hand, one could obviously raise the question why you so desperately need to have an unassigned codepoint in your output. Are you absolutely sure you need this? If yes, then can you elaborate on the exact reason? (i.e. What exactly is this unassigned codepoint used for?)

    The most straightforward 'fix' seems to be roughly as follows:

    Index: src/java/org/apache/fop/text/linebreak/LineBreakStatus.java

    --- src/java/org/apache/fop/text/linebreak/LineBreakStatus.java (revision 1054383) +++ src/java/org/apache/fop/text/linebreak/LineBreakStatus.java (working copy) @@ -87,6 +87,7 @@

         /* Initial conversions */
         switch (currentClass) {
    

    + case 0: // Unassigned codepoint: consider as AL? case LineBreakUtils.LINE_BREAK_PROPERTY_AI: case LineBreakUtils.LINE_BREAK_PROPERTY_SG: case LineBreakUtils.LINE_BREAK_PROPERTY_XX:

    What this does, is assign the class 'AL' or 'Alphabetic' to any codepoint that has not been assigned a class by Unicode. This means it will be treated as a regular letter. Now, the reason why I am asking the question whether you are sure you know what you're doing, is that this may turn out to be undesirable. Perhaps the character in question needs to be treated as a space rather than a letter. Unicode does not define U+1F7E other than as a 'reserved' character, so it makes sense that Unicode cannot say what should happen with this character in the context of linebreaking...

    That said, it is also wrong of FOP to crash in this case, so the bug is definitely genuine.