I am learning how to use the Haskell lexical analyzer tool called Alex1.
I am trying to implement a lexical analyzer for this string (an email "From:" header):
From: "John Doe" <john@doe.org>
I want to break it up into this list of tokens:
[
From,
DisplayName "John Doe",
Email,
LocalName "john",
Domain "doe.org"
]
Below is my implementation. It works fine if the string doesn't contain a display name. That is, this works fine:
let s = "From: <john@doe.org>"
alexScanTokens s
However, when I include a display name, I get this error message:
[From*** Exception: lexical error
That is, this results in an error:
let s = "From: \"John Doe\" <john@doe.org>"
alexScanTokens s
I am guessing that this part of my Alex
program is causing the error:
\"[a-zA-Z ]+\" { \s -> DisplayName (init (tail s)) }
In Alex
the left side is a regular expression:
\"[a-zA-Z ]+\"
and the right side is the action to be taken when a string is found that matches the regular expression:
{ \s -> DisplayName (init (tail s)) }
Any ideas on what the problem might be?
{
module Main (main) where
}
%wrapper "basic"
$digit = 0-9 -- digits
$alpha = [a-zA-Z] -- alphabetic characters
tokens :-
$white+ ;
From: { \s -> From }
\"[a-zA-Z ]+\" { \s -> DisplayName (init (tail s)) }
\< { \s -> Email }
[$alpha]+@ { \s -> LocalPart (init s) }
[$alpha\.]+> { \s -> Domain (init s) }
{
-- Each action has type :: String -> Token
-- The token type:
data Token =
From |
DisplayName String |
Email |
LocalPart String |
Domain String
deriving (Eq,Show)
main = do
s <- getContents
print (alexScanTokens s)
}
1 The "Alex" lexical analyzer tool may be found at this URL: http://www.haskell.org/alex/doc/html/introduction.html
It's the space in "John Doe"
that's causing trouble.
Whitespace is ignored in character sets like [a-zA-Z ]
. To include the space, you need to escape it with a backslash, e.g. [a-zA-Z\ ]
.
Also, I can't help but note that a lexer might be the wrong tool for this job. Consider writing a proper parser using e.g. Parsec.