Search of sentences ending with specific marks and frequency histogram

I tried to make a frequency histogram of sentences which end with exclamation marks, question marks, as well as sentences ending with a dot in the text (I just counted the number of these characters in the text). The text is read from the file. The code I've done looks like this:

import matplotlib.pyplot as plt
 
text_file = 'text.txt'
 
marks = '?!.'
lcount = dict([(l, 0) for l in marks])
 
 
for l in open(text_file, encoding='utf8').read():
    try:
        lcount[l.upper()] += 1
    except KeyError:
        pass
norm = sum(lcount.values())
 
fig = plt.figure()
ax = fig.add_subplot(111)
x = range(3)
ax.bar(x, [lcount[l]/norm * 100 for l in marks], width=0.8,
       color='g', alpha=0.5, align='center')
ax.set_xticks(x)
ax.set_xticklabels(marks)
ax.tick_params(axis='x', direction='out')
ax.set_xlim(-0.5, 2.5)
ax.yaxis.grid(True)
ax.set_ylabel('Sentences with different ending frequency, %')
plt.show()

But I can’t count others sentence that end with an ellipsis(it means ...), my code counts as three characters, so three sentences. Moreover, this is count of symbols, not actually sentences. How can I improve with counting of sentences, not marks and counting of sentences ending with ellipsis? Example of file: Wanna play? Let's go! It will be definitely good! My friend also think so. However, I don't know... I don't like this.

Solution

You could try splitting the sentences using a regex. The re.split() function works fine here: Sample code:

import re
string = "Wanna play? Let's go! It will be definitely good! My friend also think so. However, I don't know... I don't like this."
print(re.split('\.+\s*|!+\s*|\?+\s*', string))

Output:

['Wanna play', "Let's go", 'It will be definitely good', 'My friend also think so', "However, I don't know", "I don't like this", '']

Edited answer: re.findall('([^.?!]+\.+|[^.?!]+!+|[^.?!]+\?+)\s*', string)

Explainantion:

() are used to capture and group the results of the sequence inside those. Eg.:

import re

a = 'Hello World'
print(re.findall('l+o', a))   # Match = llo, Output = llo
print(re.findall('(l+)o', a)) # Match = llo, output = ll

Output:

['llo']
['ll']  # With parenthesis, only the part inside them is returned

[^.?!]+ Refers to a set of 1 or more characters except ., ?, and !. This matches all the words and as soon as it encounters a punctuation out of the three, the criteria fails and the sraerch breaks.Eg.

import re

a = 'Hello World! My name is Anshumaan. What is Your name?'
print(re.findall('[^.?!]+', a))
print(re.findall('([^.?!]+)\!+\s*', a))
print(re.findall('([^.?!]+\!+)\s*', a))

Output:

['Hello World', ' My name is Anshumaan', ' What is Your name']
['Hello World']
['Hello World!']

It starts from the left, all characters until ! match it and hence it returns them. Then it starts from the space since it also matches the criteria and goes until the ..
In the next case, ! is also matched, but since only the word matching part is in the parentheses, ! is not returned(\s* matches 0 or more whitespace). In the 3rd case, since \! is also in the parentheses, ! is also returned.

Finally, It is the or block. Since we have 3 punctuations there are 3 criteria, a word/phrase with., a word/phrase with ! and a word/phrase with ?. They are all joined using an or character(|) and then in order to filter out whitespaces, the \s character is placed out of the parentheses.

So,

re.findall('([^.?!]+\.+|[^.?!]+!+|[^.?!]+\?+)\s*', string)

Can be interpreted as:

find: ('(<character except [.!?] once or more>and<! once or more>)' or '(<character except [.!?] once or more>and<. once or more>)' or '(<character except [.!?] once or more>and<? once or more>)')<also look for whitespace but don't return them since they are not in the parentheses>