I'm working on a code that detects the langue of a tweet and applies the lexical matching that language. The code USED to work just fine, it did its job. Then it threw a KeyError: 'en'
even though 'en' exists in the dictionary. I've looked at multiple questions that already have answers and nothing in them seems to work. I'll provide the part of the code that deals with German only (so not including the other langues). The code is written to where if the language detected isn't in the dictionary, it'll automatically be classified as English.
from langdetect import detect
import glob
import re
rsc_lg = {
"de" : {"pos" : "ressources/positive_words_de.txt",
"neg" : "ressources/negative_words_de.txt"},
"en" : {"pos" : "ressources/positive_words_en.txt",
"neg" : "ressources/negative_words_en.txt"}
}
dic = {}
liste_resultats = []
for path in glob.glob("corpus/*/*/*"):
f = open(path, errors="ignore")
read = f.read().lower()
lang = detect(read)
if lang not in dic:
dic[lang] = {}
if lang not in rsc_lg :
lang = "en"
###german###
f_de_pos = open(rsc_lg[lang]["pos"])
f_de_neg = open(rsc_lg[lang]["neg"])
de_pos = f_de_pos.read().lower().split()
de_neg = f_de_neg.read().lower().split()
f_de_pos.close()
f_de_neg.close()
words = read.split()
pos_words_de = set(words) & set(de_pos)
neg_words_de = set(words) & set(de_neg)
if len(pos_words_de) > len(neg_words_de):
diagnostic = "positive"
if len(pos_words_de) == len(neg_words_de):
diagnostic = "mixed"
if len(pos_words_de) < len(neg_words_de):
diagnostic = "negative"
# print("this german tweet is ", diagnostic)
dic[lang][path] = diagnostic
corpus, lang, classe, nom = re.split("\\\\", path)
liste_resultats.append([nom, lang, classe, diagnostic])
import json
w = open("resultats_langdetect_german.json", "w")
w.write(json.dumps(liste_resultats, indent= 2))
w.close()
f.close()
print("done")
The error comes up with the line dic[lang][path] = diagnostic
just after the classification of tweets as positive, mixed or negative.
Like I said, this worked just fine before and suddenly stopped working despite me making absolutely no changes whatsoever to the code.
The problem is if you encounter an unknown language then you perform dic[lang] = {}
and immediately after that you do lang = "en"
. Now if lang
was for example "es"
you end up with dic == {"es": {}}
and lang == "en"
. Later in the code you do dic[lang][path] = diagnostic
but at this point "en" not in dic
since it still used the unknown language code ("es"
). You probably want to switch the order of the two statements, i.e. first set lang = "en"
and then do dic[lang] = {}
.