I want to remove only the sentences with greek capital letters, here are some examples
input1 = 'Καλημέρα κόσμε'
output = 'Καλημέρα κόσμε'
input2 = 'ΚΑΝΕΙ ΠΟΛΥ ΖΕΣΤΗ. Καθε ΣΚ.'
Output2 = 'Καθε ΣΚ.'
input3 = 'Ο ΑΛΕΞΑΝΔΡος σπουδαζει στατιστικη'
Output3 = 'Ο ΑΛΕΞΑΝΔΡος σπουδαζει στατιστικη'
I checked this previous question,https://stackoverflow.com/questions/60738190/regular-expression-to-find-a-series-of-uppercase-words-in-a-string, and I created this function, but isn't working. I would be grateful if you could help me.
def remove_sent_capital(input):
greek_capital_chars = set(chr(cp) for cp in range(0x0370, 0x1FFF) if "GREEK CAPITAL" in unicodedata.name(chr(cp), ""))
chars_class = re.escape("".join(greek_capital_chars.union(string.ascii_uppercase)))
input = re.sub('\b[{chars_class}\s]+(?:\s+[{chars_class}\s]+)*\b', '', input)
return input
From my understanding, you want to remove the sentences where ALL letters are Greek capital. Defining a sentence as a sequence of letters finishing with '.' you can do the following:
def remove_sent_capital(x):
greek_capital_chars = set(chr(cp) for cp in range(0x0370, 0x1FFF) if "GREEK CAPITAL" in unicodedata.name(chr(cp), ""))
s = x.split('.')
s = [i for i in s if not all([k in greek_capital_chars for k in i if k!=' '])]
return '.'.join(s)
Examples:
>>> remove_sent_capital('Καλημέρα κόσμε')
#'Καλημέρα κόσμε'
>>>remove_sent_capital('ΚΑΝΕΙ ΠΟΛΥ ΖΕΣΤΗ. Καθε ΣΚ.')
#' Καθε ΣΚ'
>>> remove_sent_capital('Ο ΑΛΕΞΑΝΔΡος σπουδαζει στατιστικη')
#'Ο ΑΛΕΞΑΝΔΡος σπουδαζει στατιστικη'
Ελπίζω να βοήθησα :)