I tried to make a frequency histogram of sentences which end with exclamation marks, question marks, as well as sentences ending with a dot in the text (I just counted the number of these characters in the text). The text is read from the file. The code I've done looks like this:
import matplotlib.pyplot as plt
text_file = 'text.txt'
marks = '?!.'
lcount = dict([(l, 0) for l in marks])
for l in open(text_file, encoding='utf8').read():
try:
lcount[l.upper()] += 1
except KeyError:
pass
norm = sum(lcount.values())
fig = plt.figure()
ax = fig.add_subplot(111)
x = range(3)
ax.bar(x, [lcount[l]/norm * 100 for l in marks], width=0.8,
color='g', alpha=0.5, align='center')
ax.set_xticks(x)
ax.set_xticklabels(marks)
ax.tick_params(axis='x', direction='out')
ax.set_xlim(-0.5, 2.5)
ax.yaxis.grid(True)
ax.set_ylabel('Sentences with different ending frequency, %')
plt.show()
But I can’t count others sentence that end with an ellipsis(it means ...), my code counts as three characters, so three sentences. Moreover, this is count of symbols, not actually sentences. How can I improve with counting of sentences, not marks and counting of sentences ending with ellipsis? Example of file: Wanna play? Let's go! It will be definitely good! My friend also think so. However, I don't know... I don't like this.
You could try splitting the sentences using a regex
. The re.split()
function works fine here:
Sample code:
import re
string = "Wanna play? Let's go! It will be definitely good! My friend also think so. However, I don't know... I don't like this."
print(re.split('\.+\s*|!+\s*|\?+\s*', string))
Output:
['Wanna play', "Let's go", 'It will be definitely good', 'My friend also think so', "However, I don't know", "I don't like this", '']
Edited answer: re.findall('([^.?!]+\.+|[^.?!]+!+|[^.?!]+\?+)\s*', string)
()
are used to capture and group the results of the sequence inside those. Eg.:import re
a = 'Hello World'
print(re.findall('l+o', a)) # Match = llo, Output = llo
print(re.findall('(l+)o', a)) # Match = llo, output = ll
Output:
['llo']
['ll'] # With parenthesis, only the part inside them is returned
[^.?!]+
Refers to a set of 1 or more characters except .
, ?
, and !
. This matches all the words and as soon as it encounters a punctuation out of the three, the criteria fails and the sraerch breaks.Eg.import re
a = 'Hello World! My name is Anshumaan. What is Your name?'
print(re.findall('[^.?!]+', a))
print(re.findall('([^.?!]+)\!+\s*', a))
print(re.findall('([^.?!]+\!+)\s*', a))
Output:
['Hello World', ' My name is Anshumaan', ' What is Your name']
['Hello World']
['Hello World!']
It starts from the left, all characters until !
match it and hence it returns them. Then it starts from the space since it also matches the criteria and goes until the .
.
In the next case, !
is also matched, but since only the word matching part is in the parentheses, !
is not returned(\s* matches 0 or more whitespace).
In the 3rd case, since \!
is also in the parentheses, !
is also returned.
or
block. Since we have 3 punctuations there are 3 criteria, a word/phrase with.
, a word/phrase with !
and a word/phrase with ?
. They are all joined using an or
character(|
) and then in order to filter out whitespaces, the \s
character is placed out of the parentheses.So,
re.findall('([^.?!]+\.+|[^.?!]+!+|[^.?!]+\?+)\s*', string)
Can be interpreted as:
find: ('(<character except [.!?] once or more>and<! once or more>)' or '(<character except [.!?] once or more>and<. once or more>)' or '(<character except [.!?] once or more>and<? once or more>)')<also look for whitespace but don't return them since they are not in the parentheses>