I have a huge mbox file, with maybe 500 emails in it.
It looks like the following:
From x@blah.com Fri Aug 12 09:34:09 2005
Message-ID: <42FBEE81.9090701@blah.com>
Date: Fri, 12 Aug 2005 09:34:09 +0900
From: me <x@blah.com>
User-Agent: Mozilla Thunderbird 1.0.6 (Windows/20050716)
X-Accept-Language: en-us, en
MIME-Version: 1.0
To: someone <someone@hotmail.com>
Subject: Re: (no subject)
References: <BAY101-F9353854000A4758A7E2CCA9BD0@phx.gbl>
In-Reply-To: <BAY101-F9353854000A4758A7E2CCA9BD0@phx.gbl>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 8bit
Status: RO
X-Status:
X-Keywords:
X-UID: 371
X-Evolution-Source: imap://x+blah.com@blah.com/
X-Evolution: 00000002-0010
Hey
the actual content of the email
someone wrote:
> lines of quotedtext
I would like to know how I can remove all of the quoted text, strip most of the headers except the To, From and Date lines, and still have it somewhat continuous.
My goal is to be able to print these emails as a book sort of format, and at the moment every program wants to print one email per page, or all of the headers and quoted text. Any suggestions for where to start on whipping up a small program using shell tools?
Mail::Box::Mbox will let you easily parse the file into separate messages. Mark Overmeer's slides from YAPC::Europe 2002 go into quite a bit of detail as to why parsing is much more difficult than it seems. Using this library will also deal with mh, IMAP and many other formats than just mbox.
#!/usr/bin/perl
use warnings;
use strict;
use Mail::Box::Manager;
my $file = shift || $ENV{MAIL};
my $mgr = Mail::Box::Manager->new(
access => 'r',
);
my $folder = $mgr->open( folder => $file )
or die "$file: Unable to open: $!\n";
for my $msg ($folder->messages)
{
my $to = join( ', ', map { $_->format } $msg->to );
my $from = join( ', ', map { $_->format } $msg->from );
my $date = localtime( $msg->timestamp );
my $subject = $msg->subject;
my $body = $msg->body;
# Strip all quoted text
$body =~ s/^>.*$//msg;
print <<"";
From: $from
To: $to
Date: $date
$body
}
You may want to reconsider your request to strip the quoted text -- what if you email that is formatted with interleaved replies? Stripping the quoted text would make this sort of email very hard to understand:
Foo wrote: > I like bar. Bar? Who likes bar? > It is better than baz. Everyone knows that. -- Quux
Additionally, what do you plan to do with attachments, non-text/plain MIME types, encoded text entities and other oddities?