I am looking to breakdown a paragraph into sentences and then into 'exploded' strings but need to keep the punctuations as elements of the array.
Example text:
$meta = 'I am looking to break this paragraph into chunks.
I have researched, tried and tested various combinations; however, I cannot
seem to make it work. Would anyone help me figure this out?
I thank you in advance...'
Array ( [0] =>
Array ( [0] => I [1] => am [2] => looking [3] => to [4] => break [5] => [6] => this [7] => paragraph [8] => into [9] => chunks [10] => . )
[1] =>
Array ( [0] => I [2] => have [3] => researched [4] => , [5] => tried [......
......] [5] => figure [6] => this [7] => out [8] => ? )
[3] =>
Array ( [0] => I [1] => thank [2] => you [3] => in [4] => advance [5] => ... )
)
$s = preg_split('/\s*[!?.]\s*/u', $meta, -1, PREG_SPLIT_NO_EMPTY);
to separate out the sentences but whilst this works, the punctuation disappears.
I would really appreciate help with building this two level array with the punctuation
You could do what you want using preg_match:
$meta = 'I am looking to break this paragraph into chunks.
I have researched, tried and tested various combinations; however, I cannot
seem to make it work. Would anyone help me figure this out?
I thank you in advance...';
preg_match_all('/(\w+|[.;?,]+)/', $meta, $m);
print_r($m);
Explanation:
/ : regex delimiter
( : begin group 1
\w+ : 1 or more aphanumeric character <=> [a-zA-Z0-9_]
| : OR
[.;?,]+ : 1 or more punctuation
) : end of group 1
/ : regex delimiter
This will match and store in group 1 evry word an every group of punctuiation character.
If you want to be unicode compatible, you could use \p{L}
for any letter and \p{P}
for punctuation:
/(\p{L}+|\p{P}+)/
Output:
Array
(
[0] => Array
(
[0] => I
[1] => am
[2] => looking
[3] => to
[4] => break
[5] => this
[6] => paragraph
[7] => into
[8] => chunks
[9] => .
[10] => I
[11] => have
[12] => researched
[13] => ,
[14] => tried
[15] => and
[16] => tested
[17] => various
[18] => combinations
[19] => ;
[20] => however
[21] => ,
[22] => I
[23] => cannot
[24] => seem
[25] => to
[26] => make
[27] => it
[28] => work
[29] => .
[30] => Would
[31] => anyone
[32] => help
[33] => me
[34] => figure
[35] => this
[36] => out
[37] => ?
[38] => I
[39] => thank
[40] => you
[41] => in
[42] => advance
[43] => ...
)
[1] => Array
(
[0] => I
[1] => am
[2] => looking
[3] => to
[4] => break
[5] => this
[6] => paragraph
[7] => into
[8] => chunks
[9] => .
[10] => I
[11] => have
[12] => researched
[13] => ,
[14] => tried
[15] => and
[16] => tested
[17] => various
[18] => combinations
[19] => ;
[20] => however
[21] => ,
[22] => I
[23] => cannot
[24] => seem
[25] => to
[26] => make
[27] => it
[28] => work
[29] => .
[30] => Would
[31] => anyone
[32] => help
[33] => me
[34] => figure
[35] => this
[36] => out
[37] => ?
[38] => I
[39] => thank
[40] => you
[41] => in
[42] => advance
[43] => ...
)
)