python machine-learning automation nlp artificial-intelligence

Automatically Add Diacritic/Accent Marks to a Non-English Document

In my spare time, I am transcribing a very old, rare book written in Romanian (in fact, it is the only remaining copy, to my knowledge). It was written over a hundred years ago, well before any computers existed. As such, no digital copies exist, and I am manually transcribing and digitizing it.

The book is thousands of pages long, and it is surprisingly time consuming (for me, at least) to add diacritic and accented marks (ă/â/î/ş/ţ) to every single word as I type. If I omit the marks and just type the bare letters (i.e a instead of ă/â), I am able to type more than twice as fast, which is a huge benefit. Currently I am typing everything directly into a .tex file to apply special formatting for the pages and illustrations.

However, I know that eventually I will have to add all these marks back into the text, and it seems tedious/unecessary to do all that manually, since I already have all the letters. I'm looking for some way to automatically/semi-automatically ADD diacritic/accent marks to a large body of text (not remove - I see plenty of questions asking how to remove the marks on SO).

I tried searching for large corpora of Romanian words (this and this were the most promising two), but everything I found fell short, missing at least a few words on any random sample of text I fed it (I used a short python script). It doesn't help that the book uses many archaic/uncommon words or uncommon spellings of words.

Does anyone have any ideas on how I might go about this? There are no dumb ideas here - any document format, machine learning technique, coding language, professional tool, etc that you can think of that might help is appreciated.

I should also note that I have substantial coding experience, and would not consider it a waste of time to build something myself. Tbh, I think it might be beneficial to the community, since I could not find such a tool in any western language (french, czech, serbian, etc). Just need some guidance on how to get started.

Solution

Bob's answer is a static approach which will work depending on how good the word-list is. So if a word is missing from this list it will never handled.

Moreover, as in many other languages, there are cases where two (or more) words exists with the same characters but different diacritics. For Romanian I found the following example: peste = over vs. pesţe = fish. These cases cannot be handled in a straightforward way either. This is especially an issue, if the text you're converted contains words which aren't used anymore in today's language, especially diacritised ones.

In this answer I will present an alternative using machine learning. The only caveat to this is that I couldn't find a publicly available trained model doing diacritic restoration for Romanian. You may find some luck in contacting the authors of the papers I will mention here to see if they'd be willing to send their trained models for you to use. Otherwise, you'll have to train yourself, which I'll give some pointers on. I will try to give a comprehensive overview to get you started, but further reading is encouraged.

Although this process may be laborious, it can give you 99% accuracy with the right tools.

Language Model

The language model is a model which can be thought of as having a high-level "understanding" of the language. It's typically pre-trained on raw text corpora. Although you can train your own, be wary that these models are quite expensive to pre-train.

Whilst multilingual models can be used, language-specific models typically fare better if trained with enough data. Luckily, there are publicly language models available for Romanian, such as RoBERT. This language model is based on BERT, an architecture used extensively in Natural Language Processing & is more or less the standard in the field due as it attained state-of-the-art results in English & other languages.

In fact there are three variants: base, large, & small. The larger the model, the better the results, due to the larger representation power. But larger models will also have a higher footprint in terms of memory.

Loading these models is very easy with the transformers library. For instance, the base model:

from transformers import AutoModel, AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("readerbench/RoBERT-base")
model = AutoModel.from_pretrained("readerbench/RoBERT-base")
inputs = tokenizer("exemplu de propoziție", return_tensors="pt")
outputs = model(**inputs)

The outputs above will contain vector representations of the inputted texts, more commonly know as "word embeddings". Language models are then fine-tuned to a downstream task — in your case, diacritic restoration — and would take these embeddings as input.

Fine-tuning

I couldn't find any publicly available fine-tuned models, so you'll have to fine-tune your own unless you find a model yourself.

To fine-tune a language model, we need to build a task-specific architecture which will be trained on some dataset. The dataset is used to tell the model how the input is & how we'd like the output to be.

Dataset

From Diacritics Restoration using BERT with Analysis on Czech language, there's a publicly available dataset for a number of languages including Romanian. The dataset annotations will also depend on which fine-tuning architecture you use (more on that below).

In general, you'd choose a dataset which you trust has a high-quality of of diacritics. From this text you can then build annotations automatically by producing the undiacritised variants of the words as well as the corresponding labels.

Keep in mind that this or any other dataset you'll use will contain biases especially in terms of the domain the annotated texts originate from. Depending on how much data you have already transcribed, you may also want to build a dataset using your texts.

Architecture

The architecture you choose will have a bearing on the downstream performance you use & the amount of custom code you'll have to do.

Word-level

The aforementioned work, Diacritics Restoration using BERT with Analysis on Czech language, use a token-level classification mechanism where each word is is labelled with a set of instructions of the type of diacritic marks to insert at which character index. For example, the undiacritised word "dite" with instruction set 1:CARON;3:ACUTE indicates adding the appropriate diacritic marks at index 1 and index 3 to result in "dítě".

Since this is a token-level classification task, there's not much custom code you have to do, as you can directly use a BertForTokenClassification. Refer to the authors' code for a more complete example.

One sidenote is that the authors use a multililingual language model. This can be easily replaced with another language model such as RoBERT mentioned above.

Character-level

Alternatively, the RoBERT paper use a character-level model. From the paper, each character is annotated as one of the following:

make no modification to the current character (e.g., a → a), add circumflex mark (e.g., a → â and i → î), add breve mark (e.g., a → ̆ă), and two more classes for adding comma below (e.g., s → ş and t → ţ)

Here you will have to build your own custom model (instead of the BertForTokenClassification above). But, the rest of the training code will largely be the same. Here's a template for the model class which can be used using the transformers library:

from transformers import BertModel, BertPreTrainedModel

class BertForDiacriticRestoration(BertPreTrainedModel):

    def __init__(self, config):
        super().__init__(config)
        self.bert = BertModel(config)
        ...

    def forward(
        self,
        input_ids=None,
        attention_mask=None,
        token_type_ids=None
    ):
        ...

Evaluation

In each section there's plethora of options for you to choose from. A bit of pragmatic advice I'll offer is to start simple & complicate things if you want to improve things further. Keep a testing set to measure if the changes you're making result in improvements or degradation over your previous setup.

Crucially, I'd suggest that at least a small part of your testing set is texts coming from the texts you have transcribed yourself, the more you use the better. Primarily, this is data you annotated yourself, so you are more sure of the quality of this data then any other publicly available source. Secondly, when you are testing on data coming from the target domain, you stand a better chance of more accurately evaluating your systems more accurately to your target task, due to certain biases which might be present from other domains.