Search code examples
pythonregexnlp

REGEX: Remove sentences with all greek capital letters


I want to remove only the sentences with greek capital letters, here are some examples

input1 = 'Καλημέρα κόσμε' 
output = 'Καλημέρα κόσμε'
input2 = 'ΚΑΝΕΙ ΠΟΛΥ ΖΕΣΤΗ. Καθε ΣΚ.' 
Output2 = 'Καθε ΣΚ.'
input3 = 'Ο ΑΛΕΞΑΝΔΡος σπουδαζει στατιστικη' 
Output3 = 'Ο ΑΛΕΞΑΝΔΡος σπουδαζει στατιστικη' 

I checked this previous question,https://stackoverflow.com/questions/60738190/regular-expression-to-find-a-series-of-uppercase-words-in-a-string, and I created this function, but isn't working. I would be grateful if you could help me.

def remove_sent_capital(input):

  greek_capital_chars = set(chr(cp) for cp in range(0x0370, 0x1FFF) if "GREEK CAPITAL" in unicodedata.name(chr(cp), "")) 
  chars_class = re.escape("".join(greek_capital_chars.union(string.ascii_uppercase)))
  input = re.sub('\b[{chars_class}\s]+(?:\s+[{chars_class}\s]+)*\b', '', input)
  
  return input

EDIT: Maybe this is helpful enter image description here


Solution

  • From my understanding, you want to remove the sentences where ALL letters are Greek capital. Defining a sentence as a sequence of letters finishing with '.' you can do the following:

    def remove_sent_capital(x):
        greek_capital_chars = set(chr(cp) for cp in range(0x0370, 0x1FFF) if "GREEK CAPITAL" in unicodedata.name(chr(cp), "")) 
        s = x.split('.')
        s = [i for i in s if not all([k in greek_capital_chars for k in i if k!=' '])]
        return '.'.join(s)
    

    Examples:

    >>> remove_sent_capital('Καλημέρα κόσμε')
    #'Καλημέρα κόσμε'
    >>>remove_sent_capital('ΚΑΝΕΙ ΠΟΛΥ ΖΕΣΤΗ. Καθε ΣΚ.')
    #' Καθε ΣΚ'
    >>> remove_sent_capital('Ο ΑΛΕΞΑΝΔΡος σπουδαζει στατιστικη')
    #'Ο ΑΛΕΞΑΝΔΡος σπουδαζει στατιστικη'
    

    Ελπίζω να βοήθησα :)