Search code examples
nlpspacybert-language-model

Using Arabert model with SpaCy


SpaCy doesn't support the Arabic language, but Can I use SpaCy with the pretrained Arabert model?

Is it possible to modify this code so it can accept bert-large-arabertv02 instead of en_core_web_lg?

!python -m spacy download en_core_web_lg
import spacy
nlp = spacy.load("en_core_web_lg")

Here How we can call AraBertV.02

from arabert.preprocess import ArabertPreprocessor
from transformers import AutoTokenizer, AutoModelForMaskedLM

model_name="aubmindlab/bert-large-arabertv02"
arabert_prep = ArabertPreprocessor(model_name=model_name)  
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForMaskedLM.from_pretrained(model_name)

Solution

  • spaCy actually does support Arabic, though only at an alpha level, which basically just means tokenization support (see here). That's enough for loading external models or training your own, though, so in this case you should be able to load this like any HuggingFace model - see this FAQ.

    In this case this would look like:

    import spacy
    nlp = spacy.blank("ar") # empty English pipeline
    # create the config with the name of your model
    # values omitted will get default values
    config = {
        "model": {
            "@architectures": "spacy-transformers.TransformerModel.v3",
            "name": "aubmindlab/bert-large-arabertv02"
        }
    }
    nlp.add_pipe("transformer", config=config)
    nlp.initialize() # XXX don't forget this step!
    doc = nlp("فريك الذرة لذيذة")
    print(doc._.trf_data) # all the Transformer output is stored here
    

    I don't speak Arabic, so I can't check the output thoroughly, but that code ran and produced an embedding for me.