Search code examples
phpsubstr

Using PHP, how to search a longer string for shorter string that begins with and ends with something specific?


I'm working on a PHP ticket system where I pipe emails, grab their HTML and insert into database.

I've added this line to my outgoing emails:

## If you reply, text above this line is added to the request ##

Saw this type of thing in an Upwork email and it was easy enough grab only the email/html BEFORE that unique string, using:

//now, get only the stuff before our "dividing" line starts
$html = strstr($html, '## If', true) ?: $html;

Anyway, I've noticed Gmail adds the following automatically to all email replies:

On Fri, Jun 7, 2019 at 2:40 PM Carson Wentz<carson.wentz@gmail.com> wrote:

So after I do step one to only keep things before "## If you reply...," I now would like to search the remaining text/html to see if it has a string starting with "On" and ending with "wrote:". And if so, only grab the stuff before that (similar to step 1).

I'm having trouble finding anything clearly explaining how to search a longer string for a shorter string that BEGINS WITH something AND ENDS WITH something specific, regardless of what's in the middle. I imagine it would have to use REGEX?

However, as I write this, I just realized that it's pretty likely that at some point someone might start their reply with "On" in which case EVERYTHING would be removed. Ugh.

If anyone has any ideas if this can be handled, please let me know. More I think about it, I might just have to have that Gmail-included line appear in all replies within the ticket system since I don't think there's an absolute way I can get that exact string, since it includes date/time and Name info that obviously is always different.

Thanks for your time.


Solution

  • You can use preg_replace and the following pattern:

    /^(?:On .+?> wrote:)?((\R|.)+?)## If you reply, text above this line is added to the request ##/
    

    This optionally matches a literal On, then any characters up to > wrote:\n from the start of the body string, then captures everything until the termination message including newlines with \R.

    Of course, you can go further to make the header pattern more strict, but it seems pretty unlikely that someone will write On [any characters...]> wrote:\n on exactly the first line, which is a false positive and would cause information to be lost. Going the strict route might wind up with edge cases where an unusual email address causes a false negative and is incorrectly considered part of the body.

    The below example shows that even if this header appears anywhere after the first line, it'll be considered as part of the body.

    Use ^\s*On if there might be spaces before the On... begins.

    <?php
    
    $withGmailHeader = "On Fri, Jun 7, 2019 at 2:40 PM Carson Wentz<carson.wentz@gmail.com> wrote:
    
    Here's the text content of the email. We'd like to extract it.
    
    On Fri, Jun 6, 2019 at 2:53 AM Bob Smith<bob@gmail.com> wrote:
    'hello'
    
    ## If you reply, text above this line is added to the request ##";
    $withoutGmailHeader = "On Fri, Jun 7, 2019 at 2:40 PM Carson Wentz<carson.wentz@gmail.com>  wrote:
    
    Here's the text content of the email. We'd like to extract it.
    
    On Fri, Jun 6, 2019 at 2:53 AM Bob Smith<bob@gmail.com> wrote:
    'hello'
    
    ## If you reply, text above this line is added to the request ##";
    
    $pattern = "/^(?:On .+?> wrote:)?((\R|.)+?)## If you reply, text above this line is added to the request ##/";
    
    preg_match($pattern, $withGmailHeader, $match);
    echo "\n=> With Gmail header:\n";
    var_export($match[1]);
    echo "\n\n=> Without Gmail header: (note the extra space after >)\n";
    preg_match($pattern, $withoutGmailHeader, $match);
    var_export($match[1]);
    

    Output:

    => With Gmail header:
    '
    
    Here\'s the text content of the email. We\'d like to extract it.
    
    On Fri, Jun 6, 2019 at 2:53 AM Bob Smith<bob@gmail.com> wrote:
    \'hello\'
    
    '
    
    => Without Gmail header (note the extra space after >):
    'On Fri, Jun 7, 2019 at 2:40 PM Carson Wentz<carson.wentz@gmail.com>  wrote:
    
    Here\'s the text content of the email. We\'d like to extract it.
    
    On Fri, Jun 6, 2019 at 2:53 AM Bob Smith<bob@gmail.com> wrote:
    \'hello\'
    
    '