Search code examples
haskellparsec

Parsing text email with parsec


I just started learning Parsec and ... this is a bit brain bending. I have a text email. I need to extract the From: header and the body text. Now, I started searching for tutorials and examples from which to learn. I found three, all dealing with parsing CSV files as if there is nothing else in the world to parse.

In theory, it seems very simple: skip lines till you hit a line starting with "From:" and get text between "From: " and new line. In practise, I've been fighting with this for a couple of days.

Return-Path: <[email protected]>
X-Original-To: [email protected]
Delivered-To: [email protected]
blah ... blah ...
Subject: Test subject
From: John Doe <[email protected]>
To: [email protected]
Content-Type: multipart/alternative; boundary=047d7b2e4e3cdc627304eb094bfe

--047d7b2e4e3cdc627304eb094bfe
Content-Type: text/plain; charset=UTF-8

Email body

--047d7b2e4e3cdc627304eb094bfe

I can define a line like

let line = do{many1 (noneOf "\n"); many1 newline}

I don't understand how to cycle through lines till I hit a line with a certain string at the beginning?

p = do
  manyTill line (string "From:")
  string "From: "
  b <- many anyChar
  newline
  many line
  eof
  return b

This does not work. Can someone show me how to do it or point to a simple tutorial (not CSV parsing tutorial).

How do I extract the body, which is the text between boundary tokens and starts after the first empty line? I suppose extracting the body is even more complex so any help is appreciated.

Thanks


Solution

  • Parsec doesn't by default backtrack so many anyChar will just slurp the rest of your text. Instead consider something like

    manyTill line $ try (string "From: ")
    b <- manyTill anyChar newline
    many line
    eof
    return b
    

    Note that since we want to backtrack if our end parser fails with manyTill, it's important to make sure that it backtracks properly, thus we use try.

    Now this still fails because your email doesn't end in a newline, so line starts to succeed, then fails, causing the whole parser to fail rather than backtracking. If you can't change this than change it to

    many (try line)
    

    To clarify, parsec considers a parser to have failed if it fails without consuming any input by default. If it consumes even one character and then fails, your whole parser dies. If you want backtracking behaviour so this doesn't happen, use try.

    For extracting the body,

    getBody = do
      manyTill anyChar (try $ string "boundary=")
      boundary <- manyTill anyChar newline
      manyTill anyChar (try $ string boundary) -- Get to the boundary
      manyTill anyChar (try $ string boundary) -- Read the body