Search code examples
pythonfiledictionarycounthistogram

Search of sentences ending with specific marks and frequency histogram


I tried to make a frequency histogram of sentences which end with exclamation marks, question marks, as well as sentences ending with a dot in the text (I just counted the number of these characters in the text). The text is read from the file. The code I've done looks like this:

import matplotlib.pyplot as plt
 
text_file = 'text.txt'
 
marks = '?!.'
lcount = dict([(l, 0) for l in marks])
 
 
for l in open(text_file, encoding='utf8').read():
    try:
        lcount[l.upper()] += 1
    except KeyError:
        pass
norm = sum(lcount.values())
 
fig = plt.figure()
ax = fig.add_subplot(111)
x = range(3)
ax.bar(x, [lcount[l]/norm * 100 for l in marks], width=0.8,
       color='g', alpha=0.5, align='center')
ax.set_xticks(x)
ax.set_xticklabels(marks)
ax.tick_params(axis='x', direction='out')
ax.set_xlim(-0.5, 2.5)
ax.yaxis.grid(True)
ax.set_ylabel('Sentences with different ending frequency, %')
plt.show()

But I can’t count others sentence that end with an ellipsis(it means ...), my code counts as three characters, so three sentences. Moreover, this is count of symbols, not actually sentences. How can I improve with counting of sentences, not marks and counting of sentences ending with ellipsis? Example of file: Wanna play? Let's go! It will be definitely good! My friend also think so. However, I don't know... I don't like this.


Solution

  • You could try splitting the sentences using a regex. The re.split() function works fine here: Sample code:

    import re
    string = "Wanna play? Let's go! It will be definitely good! My friend also think so. However, I don't know... I don't like this."
    print(re.split('\.+\s*|!+\s*|\?+\s*', string))
    
    

    Output:

    ['Wanna play', "Let's go", 'It will be definitely good', 'My friend also think so', "However, I don't know", "I don't like this", '']
    

    Edited answer: re.findall('([^.?!]+\.+|[^.?!]+!+|[^.?!]+\?+)\s*', string)

    Explainantion:

    1. () are used to capture and group the results of the sequence inside those. Eg.:
    import re
    
    a = 'Hello World'
    print(re.findall('l+o', a))   # Match = llo, Output = llo
    print(re.findall('(l+)o', a)) # Match = llo, output = ll
    

    Output:

    ['llo']
    ['ll']  # With parenthesis, only the part inside them is returned
    
    1. [^.?!]+ Refers to a set of 1 or more characters except ., ?, and !. This matches all the words and as soon as it encounters a punctuation out of the three, the criteria fails and the sraerch breaks.Eg.
    import re
    
    a = 'Hello World! My name is Anshumaan. What is Your name?'
    print(re.findall('[^.?!]+', a))
    print(re.findall('([^.?!]+)\!+\s*', a))
    print(re.findall('([^.?!]+\!+)\s*', a))
    

    Output:

    ['Hello World', ' My name is Anshumaan', ' What is Your name']
    ['Hello World']
    ['Hello World!']
    

    It starts from the left, all characters until ! match it and hence it returns them. Then it starts from the space since it also matches the criteria and goes until the ..
    In the next case, ! is also matched, but since only the word matching part is in the parentheses, ! is not returned(\s* matches 0 or more whitespace). In the 3rd case, since \! is also in the parentheses, ! is also returned.

    1. Finally, It is the or block. Since we have 3 punctuations there are 3 criteria, a word/phrase with., a word/phrase with ! and a word/phrase with ?. They are all joined using an or character(|) and then in order to filter out whitespaces, the \s character is placed out of the parentheses.

    So,

    re.findall('([^.?!]+\.+|[^.?!]+!+|[^.?!]+\?+)\s*', string)
    

    Can be interpreted as:

    find: ('(<character except [.!?] once or more>and<! once or more>)' or '(<character except [.!?] once or more>and<. once or more>)' or '(<character except [.!?] once or more>and<? once or more>)')<also look for whitespace but don't return them since they are not in the parentheses>