I need to identify all abbreviations and hyphenated words in my sentences to start. They need to be printed as they get identified. My code does not seem to be functioning well for this identification.
import re
sentence_stream2=df1['Open End Text']
for sent in sentence_stream2:
abbs_ = re.findall(r'(?:[A-Z]\.)+', sent) #abbreviations
hypns_= re.findall(r'\w+(?:-\w+)*', sent) #hyphenated words
print("new sentence:")
print(sent)
print(abbs_)
print(hypns_)
One of the sentences in my corpus is: DevOps with APIs & event-driven architecture using cloud Data Analytics environment Self-service BI
The output for this is:
new sentence:
DevOps with APIs & event-driven architecture using cloud Data Analytics environment Self-service BI
[]
['DevOps', 'with', 'APIs', 'event-driven', 'architecture', 'using', 'cloud', 'Data', 'Analytics', 'environment', 'Self-service', 'BI']
expected output is:
new sentence:
DevOps with APIs & event-driven architecture using cloud Data Analytics environment Self-service BI
['APIs','BI']
['event-driven','Self-service']
Your rule for abbreviations does not match. You want to find any words with more then 1 consecutive capital letter, a rule you could use would be:
abbs_ = re.findall(r'(?:[A-Z]{2,}s?\.?)', sent) #abbreviations
This would match APIs and BI.
t = "DevOps with APIs & event-driven architecture using cloud Data Analytics environment Self-service BI"
import re
abbs_ = re.findall(r'(?:[A-Z]\.)+', t) #abbreviations
cap_ = re.findall(r'(?:[A-Z]{2,}s?\.?)', t) #abbreviations
hypns_= re.findall(r'\w+-\w+', t) #hyphenated words fixed
print("new sentence:")
print(t)
print(abbs_)
print(cap_)
print(hypns_)
Output:
DevOps with APIs & event-driven architecture using cloud Data Analytics environment Self-service BI
[] # your abbreviation rule - does not find any capital letter followed by .
['APIs', 'BI'] # cap_ rule
['event-driven', 'Self-service'] # fixed hyphen rule
This will most probably not find all abbreviations like
t = "Prof. Dr. S. Quakernack"
so you might need to tweak it using some more data and f.e. http://www.regex101.com