Search code examples
pythontextabbreviationhyphenation

Printing of abbreviations and hyphenated words


I need to identify all abbreviations and hyphenated words in my sentences to start. They need to be printed as they get identified. My code does not seem to be functioning well for this identification.

import re

sentence_stream2=df1['Open End Text']
for sent in sentence_stream2:
    abbs_ = re.findall(r'(?:[A-Z]\.)+', sent) #abbreviations
    hypns_= re.findall(r'\w+(?:-\w+)*', sent) #hyphenated words

    print("new sentence:")
    print(sent)
    print(abbs_)
    print(hypns_)

One of the sentences in my corpus is: DevOps with APIs & event-driven architecture using cloud Data Analytics environment Self-service BI

The output for this is:

new sentence:
DevOps with APIs & event-driven architecture using cloud Data Analytics environment Self-service BI
[]
['DevOps', 'with', 'APIs', 'event-driven', 'architecture', 'using', 'cloud', 'Data', 'Analytics', 'environment', 'Self-service', 'BI']

expected output is:

new sentence:
DevOps with APIs & event-driven architecture using cloud Data Analytics environment Self-service BI
['APIs','BI']
['event-driven','Self-service']

Solution

  • Your rule for abbreviations does not match. You want to find any words with more then 1 consecutive capital letter, a rule you could use would be:

    abbs_ = re.findall(r'(?:[A-Z]{2,}s?\.?)', sent) #abbreviations
    

    This would match APIs and BI.

    t = "DevOps with APIs & event-driven architecture using cloud Data Analytics environment Self-service BI"
    
    import re
    
    abbs_ = re.findall(r'(?:[A-Z]\.)+', t) #abbreviations
    cap_ = re.findall(r'(?:[A-Z]{2,}s?\.?)', t) #abbreviations
    hypns_= re.findall(r'\w+-\w+', t) #hyphenated words fixed
    
    print("new sentence:")
    print(t)
    print(abbs_)
    print(cap_)
    print(hypns_)
    

    Output:

    DevOps with APIs & event-driven architecture using cloud Data Analytics environment Self-service BI
    []  # your abbreviation rule - does not find any capital letter followed by .
    ['APIs', 'BI'] # cap_ rule
    ['event-driven', 'Self-service']  # fixed hyphen rule
    

    This will most probably not find all abbreviations like

    t = "Prof. Dr. S. Quakernack"
    

    so you might need to tweak it using some more data and f.e. http://www.regex101.com