Search code examples
phpregexsplit

Split text containing sentences and whitelisted phrases into separate lines and prepend each line with a counter


I have the follwoing script to split up sentences. There are a few phrases that I would like to treat as the end of a sentence in addition to punctuation. This works fine if it is a single character, but not when it there is a space.

This is the code I have that works:

$re = '/# Split sentences on whitespace between them.
(?<=                # Begin positive lookbehind.
  [.!?:\#*]             # Either an end of sentence punct,
| [.!?:][\'"]
| [\r\t\n]              # or end of sentence punct and quote.
| HYPERLINK
| .org
| .gov
| .aspx
| .com
| Date
| Dear  
)                   # End positive lookbehind.
(?<!                # Begin negative lookbehind.
  Mr\.              # Skip either "Mr."
| Mrs\.             # or "Mrs.",    
| Ms\.              # or "Ms.",
| Jr\.              # or "Jr.",
| Dr\.              # or "Dr.",
| Prof\.            # or "Prof.",
| U\.S\.A\.
| U\.S\.
| Sr\.              # or "Sr.",
| T\.V\.A\.         # or "T.V.A.",
| a\.m\.            # or "a.m.",
| p\.m\.            # or "p.m.",
| a€¢\.
| :\.

                    # or... (you get the idea).
)                   # End negative lookbehind.
\s+                 # Split on whitespace between sentences.

/ix';

This is an example phrase I have tried to add: "Total Gross Income"

I have tried formating it in these ways, but none of them work:

$re = '/# Split sentences on whitespace between them.
(?<=                # Begin positive lookbehind.
  [.!?:\#*]             # Either an end of sentence punct,
| [.!?:][\'"]
| [\r\t\n]              # or end of sentence punct and quote.
| HYPERLINK
| .org
| .gov
| .aspx
| .com
| Date
| Dear  
| "Total Gross Income"
| Total[ X]Gross[ X]Income
| Total" "Gross" "Income
)  

This for example if I have the following code:

$block_o_text = "You could receive the wrong amount. If you receive more benefits than you    should, you must pay them back. When will we review your case? An eligibility review form will be sent before your benefits stop. Total Gross Income Total ResourcesMedical ProgramsHousehold.";

$sentences = preg_split($re, $block_o_text, -1, PREG_SPLIT_NO_EMPTY);

for ($i = 0; $i < count($sentences); ++$i) {
    echo $i . " - " . $sentance . "<BR>";
}

The results I get are:

77 - You could receive the wrong amount.
78 - If you receive more benefits than you should, you must pay them back.
79 - When will we review your case?
80 - An eligibility review form will be sent before your benefits stop.
81 - 01/201502/2015
82 - Total Gross Income Total ResourcesMedical ProgramsHousehold 

What I want to get is :

77 - You could receive the wrong amount.
78 - If you receive more benefits than you should, you must pay them back.
79 - When will we review your case?
80 - An eligibility review form will be sent before your benefits stop.
81 - 01/201502/2015
82 - Total Gross Income
83 - Total ResourcesMedical ProgramsHousehold 

What am I doing wrong?


Solution

  • Your problem is with the white space declaration that follows your lookbehind - it requires at least one white space in order to split, but if you remove it, then you end up capturing the preceeding letter and breaking the whole thing.

    Thus As far as I can tell, you can't do this entirely with lookarounds. You'll still need to have some of the expression work with lookarounds (space preceded by punctuation, etc.), but for specific phrases, you can't.

    You can also use the PREG_SPLIT_DELIM_CAPTURE flag to capture out what you're splitting. Something like this should get you started:

    $re = '/((?<=[\.\?\!])\s+|Total\sGross\sIncome)/ix';
    
    $block_o_text = "You could receive the wrong amount. If you receive more benefits than you    should, you must pay them back. When will we review your case? An eligibility review form will be sent before your benefits stop. Total Gross IncomeTotal ResourcesMedical ProgramsHousehold.";
    
    $sentences = preg_split($re, $block_o_text, -1, PREG_SPLIT_DELIM_CAPTURE | PREG_SPLIT_NO_EMPTY);
    
    for ($i = 0; $i < count($sentences); ++$i) {
        if (!ctype_space($sentences[$i])) {
            echo $i . " - " . $sentences[$i] . "<br>";
        }
    }
    

    Output:

    0 - You could receive the wrong amount.
    2 - If you receive more benefits than you should, you must pay them back.
    4 - When will we review your case?
    6 - An eligibility review form will be sent before your benefits stop.
    8 - Total Gross Income
    9 - Total ResourcesMedical ProgramsHousehold.