I am using the following regex to extract the filename from an rfc822 multipart email.
private static Pattern filenamePattern = Pattern.compile("(?<=filename=\").*?(?=\")");
This is able to extract filenames that have a space, as in:
Content-Type : application/pdf; name="Key.Enrollment_Final.pdf"
but cannot extract filenames that are not quoted, like:
Content-Type : application/octet-stream; name=.config
I cannot quite figure out how to get both. For the first quote, I think I can check for (?<=filename=\"?), but how should I check for a space or an end of line or a quote?
I have only seen filename
attribute being specified in Content-Disposition
header, but not Content-Type
header.
Either way, this is a regex that correctly matches filename
attribute, according to RFC 1806 (which references RFC 1521 and RFC 822.
"filename=(?:([\\x21-\\x7E&&[^\\Q()<>[]@,;:\\\"/?=\\E]]++)|\"((?:(?:(?:\r\n)?[\t ])+|[^\r\"\\\\]|\\\\[\\x00-\\x7f])*)\")"
Well, matching is one thing, but you still have to process the file name in the second case, at least to unquote special characters. (You still need to collapse linear-white-space: (?:(?:\r\n)?[\t ])+
, as defined in RFC 822, to a single space, and replace non-printable characters).