Ligatures are the Unicode characters which are represented by more than one code points. For example, in Devanagari त्र
is a ligature which consists of code points त + ् + र
.
When seen in simple text file editors like Notepad, त्र
is shown as त् + र
and is stored as three Unicode characters. However when the same file is opened in Firefox, it is shown as a proper ligature.
So my question is, how to detect such ligatures programmatically while reading the file from my code. Since Firefox does it, there must exist a way to do it programmatically. Are there any Unicode properties which contain this information or do I need to have a map to all such ligatures?
SVG CSS property text-rendering
when set to optimizeLegibility
does the same thing (combine code points into proper ligature).
PS: I am using Java.
EDIT
The purpose of my code is to count the characters in the Unicode text assuming a ligature to be a single character. So I need a way to collapse multiple code points into a single ligature.
While Aaron's answer is not exactly correct, it pushed me in the right direction. After reading through the Java API docs of java.awt.font.GlyphVector
and playing a lot on the Clojure REPL, I was able to write a function which does what I want.
The idea is to find the width of glyphs in the glyphVector
and combine the glyphs with zero width with the last found non-zero width glyph. The solution is in Clojure but it should be translatable to Java if required.
(ns net.abhinavsarkar.unicode
(:import [java.awt.font TextAttribute GlyphVector]
[java.awt Font]
[javax.swing JTextArea]))
(let [^java.util.Map text-attrs {
TextAttribute/FAMILY "Arial Unicode MS"
TextAttribute/SIZE 25
TextAttribute/LIGATURES TextAttribute/LIGATURES_ON}
font (Font/getFont text-attrs)
ta (doto (JTextArea.) (.setFont font))
frc (.getFontRenderContext (.getFontMetrics ta font))]
(defn unicode-partition
"takes an unicode string and returns a vector of strings by partitioning
the input string in such a way that multiple code points of a single
ligature are in same partition in the output vector"
[^String text]
(let [glyph-vector
(.layoutGlyphVector
font, frc, (.toCharArray text),
0, (.length text), Font/LAYOUT_LEFT_TO_RIGHT)
glyph-num (.getNumGlyphs glyph-vector)
glyph-positions
(map first (partition 2
(.getGlyphPositions glyph-vector 0 glyph-num nil)))
glyph-widths
(map -
(concat (next glyph-positions)
[(.. glyph-vector getLogicalBounds width)])
glyph-positions)
glyph-indices
(seq (.getGlyphCharIndices glyph-vector 0 glyph-num nil))
glyph-index-width-map (zipmap glyph-indices glyph-widths)
corrected-glyph-widths
(vec (reduce
(fn [acc [k v]] (do (aset acc k v) acc))
(make-array Float (count glyph-index-width-map))
glyph-index-width-map))]
(loop [idx 0 pidx 0 char-seq text acc []]
(if (nil? char-seq)
acc
(if-not (zero? (nth corrected-glyph-widths idx))
(recur (inc idx) (inc pidx) (next char-seq)
(conj acc (str (first char-seq))))
(recur (inc idx) pidx (next char-seq)
(assoc acc (dec pidx)
(str (nth acc (dec pidx)) (first char-seq))))))))))
Also posted on Gist.