Printing of abbreviations and hyphenated words

I need to identify all abbreviations and hyphenated words in my sentences to start. They need to be printed as they get identified. My code does not seem to be functioning well for this identification.

import re

sentence_stream2=df1['Open End Text']
for sent in sentence_stream2:
    abbs_ = re.findall(r'(?:[A-Z]\.)+', sent) #abbreviations
    hypns_= re.findall(r'\w+(?:-\w+)*', sent) #hyphenated words

    print("new sentence:")
    print(sent)
    print(abbs_)
    print(hypns_)

One of the sentences in my corpus is: DevOps with APIs & event-driven architecture using cloud Data Analytics environment Self-service BI

The output for this is:

new sentence:
DevOps with APIs & event-driven architecture using cloud Data Analytics environment Self-service BI
[]
['DevOps', 'with', 'APIs', 'event-driven', 'architecture', 'using', 'cloud', 'Data', 'Analytics', 'environment', 'Self-service', 'BI']

expected output is:

new sentence:
DevOps with APIs & event-driven architecture using cloud Data Analytics environment Self-service BI
['APIs','BI']
['event-driven','Self-service']

Solution

Your rule for abbreviations does not match. You want to find any words with more then 1 consecutive capital letter, a rule you could use would be:

abbs_ = re.findall(r'(?:[A-Z]{2,}s?\.?)', sent) #abbreviations

This would match APIs and BI.

t = "DevOps with APIs & event-driven architecture using cloud Data Analytics environment Self-service BI"

import re

abbs_ = re.findall(r'(?:[A-Z]\.)+', t) #abbreviations
cap_ = re.findall(r'(?:[A-Z]{2,}s?\.?)', t) #abbreviations
hypns_= re.findall(r'\w+-\w+', t) #hyphenated words fixed

print("new sentence:")
print(t)
print(abbs_)
print(cap_)
print(hypns_)

Output:

DevOps with APIs & event-driven architecture using cloud Data Analytics environment Self-service BI
[]  # your abbreviation rule - does not find any capital letter followed by .
['APIs', 'BI'] # cap_ rule
['event-driven', 'Self-service']  # fixed hyphen rule

This will most probably not find all abbreviations like

t = "Prof. Dr. S. Quakernack"

so you might need to tweak it using some more data and f.e. http://www.regex101.com