I'm working on text analysis in Python, I'm looking at a range of Irish language texts dating from the 6th century to the 14th, which means I have a whole range of orthographic variations to account for when sorting a word list.
I want to sort a list which takes into account different grammatical forms of characters (e.g. fada, séimhiú, and úru) from different periods by their core words, so my custom alphabet will look like this:
"a, á, b, ḃ, bh, mb, c, ċ, ch, gc, d, ḋ, dh, nd, e, é, f, ḟ, fh, bhf, g, ġ, gh, ng, h, i, í, l, m, ṁ, mh, n, o, ó, p, ṗ, ph, bp, r, rh, s, ṡ, sh, t, ṫ, th, ts, dt, u, ú, j, k, q, v, w, x, y, z"
I can probably handle the fada (accented letters) with Unicode encoding, e.g u'á', but I'm struggling to find a way to work with the old style úru (diacritic dot)?
Does anyone have experience with this sort of mix of characters? Is there a common way that people have developed to work with these characters?
Currently when ever I try to use a diacritic dot charcter with u'ḃ' I get the following error:
Traceback (most recent call last):
File "csv_generator.py", line 44, in <module>
print u'ß©â'
File "C:\Users\Charlie\Anaconda2\lib\encodings\cp850.py", line 12, in encode
return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode character u'\u1e03' in
position 0: character maps to <undefined>
The problem as shown is printing a character that isn't supported by your code page (cp850). You can manipulate Unicode strings just fine...it's just a problem of display. Python 3.6+ solves this issue by bypassing code pages and printing using Windows Unicode APIs:
Python 3.6.3 (v3.6.3:2c5fed8, Oct 3 2017, 18:11:49) [MSC v.1900 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> print('\u1e03')
ḃ
If you can't switch to a later version of Python, can you switch to an IDE that supports UTF-8? Example using PythonWin from the pywin32 module (I have Python 2.7 installed).
PythonWin 2.7.13 (v2.7.13:a06454b1afa1, Dec 17 2016, 20:53:40) [MSC v.1500 64 bit (AMD64)] on win32.
Portions Copyright 1994-2008 Mark Hammond - see 'Help/About PythonWin' for further copyright information.
>>> print(u'\u1e03')
ḃ