Search code examples
pythoncpu-wordcapitalize

Find capitalized words in a text


How to specify words that start with a capital letter and the number of that word in a text? If no word with this attribute is found in the text, print it in the None output. The words at the beginning of the sentence should not be considered. Numbers should not be considered and if the semicolon is at the end of the word, that semicolon should be omitted.

Like the following example:

Input:

The University of Edinburgh is a public research university in Edinburgh, Scotland. The University of Texas was included in the Association of American Universities in 1929.

Output:

2:University
4:Edinburgh
11:Edinburgh
12:Scotland
14:University
16:Texas
21:Association
23:American
24:Universities

Solution

  • The words at the beginning of the sentence should not be considered

    This makes the process harder because you should at first determine how the sentence is separated. a sentence can be ended with punctuation marks like . or ! or ?. But you did not close the last sentence in your example with a full stop. your corpus must be first preprocessed for this aim!


    Putting this issue aside, suppose this scenario:

    import re
    
    inp = "The University of Edinburgh is a public research university in Edinburgh, Scotland. The University of Texas was included in the Association of American Universities in 1929! The last Sentence."
    
    sentences = re.findall(r"[\w\s,]*[\.\!\?]",inp)
    counter = 0
    for sentence in sentences:
        sentence = re.sub(r"\W", " ",sentence)
        sentence = re.sub(r"\s+", " ", sentence)
        words = re.split(r"\s", sentence)
        words = [w for w in words if w!=""]
        for i, word in enumerate(words):
            if word != "" and i != 0:
                if re.search(r"[A-Z]+", word):
                    print("%d:%s" % (counter+i+1, word))
        counter += len(words)
    

    This code is exactly what you want. It is not the best practice but it is a tight and simple code. Note that you need to specify the punctuations at the end of each sentence for the input sentence at first!!!


    The output:

    2:University                                                                                                                          
    4:Edinburgh                                                                                                                           
    11:Edinburgh                                                                                                                          
    12:Scotland                                                                                                                           
    14:University                                                                                                                         
    16:Texas                                                                                                                              
    21:Association                                                                                                                        
    23:American                                                                                                                           
    24:Universities                                                                                                                       
    29:Sentence