Search code examples
javaunicodetextclojureligature

Detecting Unicode text ligatures in Clojure/Java


Ligatures are the Unicode characters which are represented by more than one code points. For example, in Devanagari त्र is a ligature which consists of code points त + ् + र.

When seen in simple text file editors like Notepad, त्र is shown as त् + र and is stored as three Unicode characters. However when the same file is opened in Firefox, it is shown as a proper ligature.

So my question is, how to detect such ligatures programmatically while reading the file from my code. Since Firefox does it, there must exist a way to do it programmatically. Are there any Unicode properties which contain this information or do I need to have a map to all such ligatures?

SVG CSS property text-rendering when set to optimizeLegibility does the same thing (combine code points into proper ligature).

PS: I am using Java.

EDIT

The purpose of my code is to count the characters in the Unicode text assuming a ligature to be a single character. So I need a way to collapse multiple code points into a single ligature.


Solution

  • While Aaron's answer is not exactly correct, it pushed me in the right direction. After reading through the Java API docs of java.awt.font.GlyphVector and playing a lot on the Clojure REPL, I was able to write a function which does what I want.

    The idea is to find the width of glyphs in the glyphVector and combine the glyphs with zero width with the last found non-zero width glyph. The solution is in Clojure but it should be translatable to Java if required.

    (ns net.abhinavsarkar.unicode
      (:import [java.awt.font TextAttribute GlyphVector]
               [java.awt Font]
               [javax.swing JTextArea]))
    
    (let [^java.util.Map text-attrs {
            TextAttribute/FAMILY "Arial Unicode MS"
            TextAttribute/SIZE 25
            TextAttribute/LIGATURES TextAttribute/LIGATURES_ON}
          font (Font/getFont text-attrs)
          ta (doto (JTextArea.) (.setFont font))
          frc (.getFontRenderContext (.getFontMetrics ta font))]
      (defn unicode-partition
        "takes an unicode string and returns a vector of strings by partitioning
        the input string in such a way that multiple code points of a single
        ligature are in same partition in the output vector"
        [^String text]
        (let [glyph-vector 
                (.layoutGlyphVector
                  font, frc, (.toCharArray text),
                  0, (.length text), Font/LAYOUT_LEFT_TO_RIGHT)
              glyph-num (.getNumGlyphs glyph-vector)
              glyph-positions
                (map first (partition 2
                              (.getGlyphPositions glyph-vector 0 glyph-num nil)))
              glyph-widths
                (map -
                  (concat (next glyph-positions)
                          [(.. glyph-vector getLogicalBounds width)])
                  glyph-positions)
              glyph-indices 
                (seq (.getGlyphCharIndices glyph-vector 0 glyph-num nil))
              glyph-index-width-map (zipmap glyph-indices glyph-widths)
              corrected-glyph-widths
                (vec (reduce
                        (fn [acc [k v]] (do (aset acc k v) acc))
                        (make-array Float (count glyph-index-width-map))
                        glyph-index-width-map))]
          (loop [idx 0 pidx 0 char-seq text acc []]
            (if (nil? char-seq)
              acc
              (if-not (zero? (nth corrected-glyph-widths idx))
                (recur (inc idx) (inc pidx) (next char-seq)
                  (conj acc (str (first char-seq))))
                (recur (inc idx) pidx (next char-seq)
                  (assoc acc (dec pidx)
                    (str (nth acc (dec pidx)) (first char-seq))))))))))
    

    Also posted on Gist.