Search code examples
pythonnlpfilteringspacyvocabulary

Apply python package (spaCy) word list only covering the specific language vocabulary


I need to filter out non-core German words from a text using spaCy. However, I couldn't find a suitable approach or word list that covers only the essential vocabulary of the German language.

I have tried different approaches using the spacy tools nlp(word).has_vector and nlp(word).vector_norm == 0 and using a list of words like list(nlp.vocab.strings) from 'de_core_news_sm' or 'de_core_news_lg', but they either recognize irrelevant words as part of the German language or fail to recognize basic German words. I'm looking for recommendations on how to obtain or create a word list that accurately covers only the core vocabulary of the German language, and can be used with (preferably) spaCy or other NLP packages. I would prefer using a universal, not german specific, language package, so that I can extend to other languages as easily.


Solution

  • You can use a frequency-based approach, maybe. For this, you should use a frequency list that ranks words by their frequency of use in written or spoken German. Here is an example repo. Alternatively, you can create it on your own using a large corpus.

    I can show a very basic version using spaCy:

    • Define a function to filter out non-core German words. The function should check if a token is in the frequency list.
    • Process your text and apply the function to each token in the processed text.
    import spacy
    import pandas as pd
    import nltk
    
    nlp = spacy.load("de_core_news_sm")
    stemmer = nltk.stem.Cistem()
    
    # Load a frequency list of German words
    df = pd.read_csv('~/decow_wordfreq_cistem.csv', index_col=['word'])
    
    # Define a function to filter out non-core German words
    def is_core_german_word(token):
        return df.at[stemmer.stem(token.text.lower()), 'freq'] > 0
    
    # Process your text
    text = "Lass uns ein bisschen Spaß haben!"
    doc = nlp(text)
    
    # Filter out non-core German words
    core_german_words = [token.text for token in doc if is_core_german_word(token)]
    
    print(core_german_words)
    

    Note that the quality of the results will depend on the quality and coverage of the frequency list you use. You may need to combine multiple approaches, such as using the CEFR levels or word embeddings, to obtain a word list that accurately covers only the core vocabulary of the German language.

    I am aware that this is very language specific. But I thought it might be helpful if no other answer came up.