Search code examples
pythoniosiso-639-2ietf-bcp-47

How to convert IETF BCP 47 language identifier to ISO-639-2?


I am writing a server API for an iOS application. As a part of the initialization process, the app should send the phone interface language to server via an API call.

The problem is that Apple uses something called IETF BCP 47 language identifier in its NSLocale preferredLanguages function.

The returned values have different lengths (e.g. [aa, ab, ace, ach, ada, ady, ae, af, afa, afh, agq, ...], and I found very few parsers that can convert this code to a proper language identifier.

I would like to use the more common ISO-639-2 three-letters language identifier, which is ubiquitous, has many parsers in many languages, and has a standard, 3-letter representation of languages.

How can I convert a IETF BCP 47 language identifier to ISO-639-2 three-letters language identifier, preferably in Python?


Solution

  • BCP 47 identifiers start with a 2 letter ISO 639-1 or 3 letter 639-2, 639-3 or 639-5 language code; see the RFC 5646 Syntax section:

    Language-Tag  = langtag             ; normal language tags
                  / privateuse          ; private use tag
                  / grandfathered       ; grandfathered tags
    
    langtag       = language
                    ["-" script]
                    ["-" region]
                    *("-" variant)
                    *("-" extension)
                    ["-" privateuse]
    
    language      = 2*3ALPHA            ; shortest ISO 639 code
                    ["-" extlang]       ; sometimes followed by
                                        ; extended language subtags
                  / 4ALPHA              ; or reserved for future use
                  / 5*8ALPHA            ; or registered language subtag
    

    I don't expect Apple to use the privateuse or grandfathered forms, so you can assume that you are looking at ISO 639-1, ISO 639-2, ISO 639-3 or ISO 639-5 language codes here. Simply map the 2-letter ISO-639-1 codes to 3-letter ISO 639-* codes.

    You can use the pycountry package for this:

    import pycountry
    
    lang = pycountry.languages.get(alpha2=two_letter_code)
    three_letter_code = lang.terminology
    

    Demo:

    >>> import pycountry
    >>> lang = pycountry.languages.get(alpha2='aa')
    >>> lang.terminology
    u'aar'
    

    where the terminology form is the preferred 3-letter code; there is also a bibliography form which differs only for 22 entries. See ISO 639-2 B and T codes. The package doesn't include entries from ISO 639-5 however; that list overlaps and conflicts with 639-2 in places and I don't think Apple uses such codes at all.