My AutoIt script parses text by sentences. Because they most likely end in a period, question mark or exclamation point, I used this to split text by sentence:
$LineArray = StringSplit($displayed_file, "!?.", 2)
The problem; it deletes delimiters (periods, question marks, and exclamation points at the end of sentences). For example, the string One. Two. Three.
is split into One
, Two
, and Three
.
How can I split into sentences while retaining the periods, question marks, and exclamation points that end these sentences?
Using StringSplit()
the delimiters are consumed in the process (and so are lost for the result). Using StringRegExp()
:
#include <array.au3>
$string="This is a text. It has several sentences. Really? Of Course!"
$a = stringregexp($string,"(?U)(.*[.?!])",3)
_ArrayDisplay($a)
To remove leading space(s), change the pattern to "(?U)[ ]*?(.*[.?!])"
. Or to "(?U) *?(.*[.?!] )"
to split at [.!?]
plus <space>
(adding a space to the last sentence):
#include <array.au3>
$string = "Do you know Pi? Yes! What's it? It's 3.14159! That's correct."
$a = StringRegExp($string & " ", "(?U)[ ]*?(.*[.?!] )", 3)
_ArrayDisplay($a)
To preserve @CRLF
(\r\n
) inside sentences:
#include <array.au3>
$string = "Do you " & @CRLF & "know Pi? Yes! What's it? It's" & @CRLF & "3.14159! That's correct."
$a = StringRegExp($string & " ", "(?s)(?U)[ ]*?(.*[.?!][ \R] )", 3)
_ArrayDisplay($a,"Sentences") ;_ArrayDisplay doesn't show @CRLF
For $i In $a
;MsgBox(0,"",$i)
ConsoleWrite(StringStripWS($i, 3) & @CRLF & "---------" & @CRLF)
Next
This does not keep @CRLF
when end of line is same as end of sentence: ...line end!" & @CRLF & "Next line...
.