Search code examples
phphtmlregexstringsplit

Split text into array elements based on opening and closing HTML tags


I have a string such as the following:

Are you looking for a quality real estate company? 

<s>Josh's real estate firm specializes in helping people find homes from          
[city][State].</s>

<s>Josh's real estate company is a boutique real estate firm serving clients 
locally.</s> 

In [city][state] I am sure you know how difficult it is
to find a great home, but we work closely with you to give you exactly 
what you need

I would like to have this paragraph split into an array based on the <s> </s> tags, so I have the following array as the result:

[0] Are you looking for a quality real estate company?
[1] Josh's real estate firm 
    specializes in helping people find homes from [city][State].
[2] Josh's real estate company is a boutique real estate firm serving clients 
    locally.
[3] In [city][state] I am sure you know how difficult it is
    to find a great home, but we work closely with you to give you exactly 
    what you need

This is a regex I'm currently using:

$matches = array();
preg_match_all(":<s>(.*?)</s>:is", $string, $matches);
$result = $matches[1];
print_r($result);

But this one only returns an array containing the text found between <s> </s> tags, it ignores the text found before and after these tags. (In the example above it would only return the array elements 1 and 2.


Solution

  • The closest I could get was using preg_split() instead:

    $string = <<< STR
    Are you looking for a quality real estate company? <s>Josh's real estate firm 
    specializes in helping people find homes from [city][State].</s>
    <s>Josh's real estate company is a boutique real estate firm serving clients 
    locally.</s> In [city][state] I am sure you know how difficult it is
    to find a great home, but we work closely with you to give you exactly 
    what you need
    STR;
    
    print_r(preg_split(':</?s>:is', $string));
    

    And got this output:

    Array
    (
        [0] => Are you looking for a quality real estate company? 
        [1] => Josh's real estate firm 
    specializes in helping people find homes from [city][State].
        [2] => 
    
        [3] => Josh's real estate company is a boutique real estate firm serving clients 
    locally.
        [4] =>  In [city][state] I am sure you know how difficult it is
    to find a great home, but we work closely with you to give you exactly 
    what you need
    )
    

    Except that produces an extra array element (index 2) where there's a newline between the fragments [city][State].</s> and <s>Josh's real estate company.

    It'd be trivial to add some code to remove the whitespace matches though, but I'm not sure if you desire that.