python regex nltk text-extraction stringtokenizer

Regex , Find the sentence, all of which are capital letters

I need your help.

Currently I'm using this code section for my work;

    altbaslik = []
    for line in sentenceIndex:
        finded = re.match(r"\w*[A-Z]\w*[A-Z]\w*|[Ö|Ç|Ş|Ü|Ğ|İ]", line)
        if finded != None:
          finded2 = finded.group()
          altbaslik.append(finded2)


    print(altbaslik)

sentenceIndex = this is a list. It contains tokenized sentences from a paragraph. For example:

Sample Paragraph:

VODOFONE ARENA ŞANSI Ama asıl önemli olan nokta Murat Çetinkaya, Cumhurbaşkanı Erdoğan ve Başbakan Davutoğlu’nun ittifakıyla seçildi. O süreci ayrıntılı olarak aktaracağım. Hatta Cumhurbaşkanı ve Başbakan’ı aynı isim üzerinde ittifak etmeye götüren kriterlere de değineceğim. Ama bir şey var ki aktarmasam olmaz. Merkez Bankası Başkanı’nın kaderi Dolmabahçe ile Vodafone Arena arasındaki yolculukta belirleniyor.

sentenceIndex:

['VODOFONE ARENA ŞANSI Ama asıl önemli olan nokta Murat Çetinkaya, Cumhurbaşkanı Erdoğan ve Başbakan Davutoğlu’nun ittifakıyla seçildi.','...................','.................']

I need a regex, which finds all the capital letter words in the sentences.

"VODOFONE ARENA ŞANSI" ı need to find and extract this section. current regex that I am using is not working. I need help on this regex thing.

NOTE: [Ö|Ç|Ş|Ü|Ğ|İ] I am working on turkish text. Thats why I need to pay attention this letters too.

Thanks for the people who will spare their time and helped me on this issue :)

Solution

You may use re.findall with

r'\b[A-ZÖÇŞÜĞİ]+(?:\W+[A-ZÖÇŞÜĞİ]+)*\b'

With Python regex library that you may install using pip install regex:

r'\b\p{Lu}+(?:\W+\p{Lu}+)*\b'

See the regex demo.

Details

\b - a word boundary
[A-ZÖÇŞÜĞİ]+ - 1+ uppercase letters (Base Latin and Turkish) (\p{Lu} matches any Unicode uppercase letters)
(?:\W+[A-ZÖÇŞÜĞİ]+)* - 0 or more repetitions of
- \W+ - any 1+ non-word chars
- [A-ZÖÇŞÜĞİ]+ - 1+ uppercase letters (Base Latin and Turkish) (\p{Lu} matches any Unicode uppercase letters)
\b - a word boundary

See the Python demo:

import re

altbaslik=[]
sentenceIndex = ['VODOFONE ARENA ŞANSI Ama asıl önemli olan nokta Murat Çetinkaya, Cumhurbaşkanı Erdoğan ve Başbakan Davutoğlu’nun ittifakıyla seçildi.','...................','.................']
for line in sentenceIndex:
    found = re.findall(r"\b[A-ZÖÇŞÜĞİ]+(?:\W+[A-ZÖÇŞÜĞİ]+)*\b", line)
    if len(found):
        altbaslik.extend(found)

print(altbaslik) # => ['VODOFONE ARENA ŞANSI']

Or, with PyPi regex:

import regex

altbaslik=[]
sentenceIndex = ['VODOFONE ARENA ŞANSI Ama asıl önemli olan nokta Murat Çetinkaya, Cumhurbaşkanı Erdoğan ve Başbakan Davutoğlu’nun ittifakıyla seçildi.','...................','.................']
for line in sentenceIndex:
    found = regex.findall(r'\b\p{Lu}+(?:\W+\p{Lu}+)*\b', line)
    if len(found):
        altbaslik.extend(found)

print(altbaslik) # => ['VODOFONE ARENA ŞANSI']