I am trying to split a sentence correctly bases on normal grammatical rules in python.
The sentence I want to split is
s = """Mr. Smith bought cheapsite.com for 1.5 million dollars,
i.e. he paid a lot for it. Did he mind? Adam Jones Jr. thinks he didn't. In any case, this isn't true... Well, with a
probability of .9 it isn't."""
The expected output is
Mr. Smith bought cheapsite.com for 1.5 million dollars, i.e. he paid a lot for it.
Did he mind?
Adam Jones Jr. thinks he didn't.
In any case, this isn't true...
Well, with a probability of .9 it isn't.
To achieve this I am using regular , after a lot of searching I came upon the following regex which does the trick.The new_str was jut to remove some \n from 's'
m = re.split(r'(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)\s',new_str)
for i in m:
print (i)
Mr. Smith bought cheapsite.com for 1.5 million dollars,i.e. he paid a lot for it.
Did he mind?
Adam Jones Jr. thinks he didn't.
In any case, this isn't true...
Well, with aprobability of .9 it isn't.
So the way I understand the reg ex is that we are first selecting
1) All the characters like i.e
2) From the filtered spaces from the first selection ,we select those characters which dont have words like Mr. Mrs. etc
3) From the filtered 2nd step we select only those subjects where we have either dot or question and are preceded by a space.
So I tried to change the order as below
1) Filter out all the titles first.
2) From the filtered step select those that are preceded by space
3) remove all phrases like i.e
but when I do that the blank after is also split
m = re.split(r'(?<![A-Z][a-z]\.)(?<=\.|\?)\s(?<!\w\.\w.)',new_str)
for i in m:
print (i)
Mr. Smith bought cheapsite.com for 1.5 million dollars,i.e.
he paid a lot for it.
Did he mind?
Adam Jones Jr. thinks he didn't.
In any case, this isn't true...
Well, with aprobability of .9 it isn't.
Shouldn't the last step in the modified procedure be capable in identifying phrases like i.e ,why is it failing to detect it ?
First, the last .
in (?<!\w\.\w.)
looks suspicious, if you need to match a literal dot with it, escape it ((?<!\w\.\w\.)
).
Coming back to the question, when you use r'(?<![A-Z][a-z]\.)(?<=\.|\?)\s(?<!\w\.\w.)'
regex, the last negative lookbehind checks if the position after a whitespace is not preceded with a word char, dot, word char, any char (since the .
is unescaped). This condition is true, because there are a dot, e
, another .
and a space before that position.
To make the lookbehind work that same way as when it was before \s
, put the \s
into the lookbehind pattern, too:
(?<![A-Z][a-z]\.)(?<=\.|\?)\s(?<!\w\.\w.\s)
See the regex demo
Another enhancement can be using a character class in the second lookbehind: (?<=\.|\?)
-> (?<=[.?])
.