Search code examples
pythonpandasdictionarynlptuples

Remove word not in dictionary


I have a data table containing tuples of words from an online review. It contains too many typos so I'm trying to erase words that do not belong to the dictionary. The dictionary I'm trying to use is KBBI (Indonesian Dictionary) https://pypi.org/project/kbbi/, imported from...

pip install kbbi
from kbbi import KBBI

I have trouble matching my data with the dictionary as I am not familiar with its data type. The function I found from the original resource shows it allows us to search a word at it will return the definition. I will only need to search within the dictionary (or maybe other way is to extract all text inside the dictionary in txt file). Here's an example of input...

# trying to look for "anjing" in the dictionary. Anjing is Indonesian for dog.    
anjing = KBBI('anjing')
print (anjing)

And its output

an.jing
1. (n)  mamalia yang biasa dipelihara untuk menjaga rumah, berburu, dan sebagainya 〔Canis familiaris〕
2. (n)  anjing yang biasa dipelihara untuk menjaga rumah, berburu, dan sebagainya 〔Canis familiaris〕

This is how I expect my result would look like (notice the word in bold is removed because it is not in the dictionary) ...

before after
[masih, blom, cair, jugagmn, in] [masih, cair]
[alhmdllh, sangat, membantu, meski, bunga, cukup, besar] [alhmdllh, sangat, membantu, meski, bunga, cukup, besar]

Here is what I've tried so far...

def remove_typo(text):
    text = [word for word in text if word in KBBI]
    return text

df['after'] = df['before'].apply(lambda x: remove_typo(x))

I got an error saying "argument of type 'type' is not iterable" on 2nd line.


Solution

  • I check docs for kbbi and solution is changed with try-except:

    from kbbi import KBBI, TidakDitemukann 
    
    L = [['masih', 'blom', 'cair', 'jugagmn', 'in'], 
         ['alhmdllh', 'sangat', 'membantu', 'meski', 'bunga', 'cukup', 'besar']]
    
    df = pd.DataFrame({'before':L})
    
    def remove_typo(text):
        out = []
        for word in text:
            try:
                if KBBI (word):
                    out.append(word)
            except TidakDitemukan:
                    pass
        return out
    
    df['after'] = df['before'].apply(remove_typo)
    
    print (df)
                                                  before  \
    0                   [masih, blom, cair, jugagmn, in]   
    1  [alhmdllh, sangat, membantu, meski, bunga, cuk...   
    
                                                after  
    0                                   [masih, cair]  
    1  [sangat, membantu, meski, bunga, cukup, besar]