Search code examples
regexperlrt

What is this regex substitution "$content =~ s/\n-- \n.*?$//s" actually doing?


I am working through some Perl code in Request Tracker 4.0 and have encountered an error where ticket requestor's message is cut off. I am new to Perl, I have done some work with regular expressions, but I'm having some trouble with this one even after reading quite a bit.

I have narrowed my problem down to this line of code:

$content =~ s/\n-- \n.*?$//s

I don't fully understand what it is doing and would like a better explanation.

I understand that s/ / is matching the pattern \n-- \n.*?$ and replacing it with nothing.

I don't understand what .*?$ does. Here is my basic understanding:

  • . is any character except \n
  • * is 0 or more times of the preceding character
  • ? is 0 or 1 times of the preceding character
  • $ is the end of the string

Then, from what I understand, the final s makes the . match new lines

So, roughly, we're replacing any text beginning with \n-- \n - this line of code is causing some questionable behavior that I'd love to get sorted out if someone can explain what's going on here.

Can someone explain what this line is doing? Is it just removing all text after the first \n-- \n or is there more to it?

Long winded part / real-life issue (you don't need to read this to answer the question)

My exact problem is that it is cutting the quoted content at the signature.

So if email A from a customer says:

What is going on with order ABCD?
-- Some Customer

The staff reply says (note the loss of the customer's signature)

It is shipping today

What is going on with order ABCD?

The customer replies

I did not get it, it did not ship!!!
-- Some Customer

It is shipping today

What is going on with order ABCD?

When we reply, their message will cut at the -- which kills all the context.

It shipped today, tracking number 12345

I did not get it, it did not ship!!!

And leads to more work explaining what order it is, etc.


Solution

  • You're almost correct: it removes everything from the last occurrence of "\n-- \n" to the end. That this doesn't remove everything from the first occurrence is due to the non-greedyness operator ? -- it tells the regex engine to match the shortest postsible form of the preceding pattern (.*).

    What this does: In email communication the signature is usually separated from the message body by exactly this pattern: a line consisting of exactly two dashes and a single trailing space. Therefore what the regex does is remove everything beginning with the signature separator to the end.

    Now what your customer does (either manually or his email client) is add the quoted reply of the email after the signature separator. This is highly unusual: the quoted reply must be located before the signature modifier. I don't know of a single email client that does this on purpose, but alas there are tons of programs out there that simply get email from (from charset issues over quoting to SMTP non-conformance you can make an incredible number of mistakes), so I wouldn't be surprised to learn that there are indeed such clients.

    Another possibility is that this is an affectation of the client -- like signing his own name after --. However, I suspect this is not done manually as humans seldom insert a trailing space after two dashes followed by a line break.