Search code examples
parsinghaskellparsec

haskell parser with spell-checking


in order to learn a bit more of Haskell (especially Monads) I'm trying to build a spell-checker. My goal is to be able to go through a LaTeX document and do something on words that aren't in a dictionary list.

I've already written the parser (string to AST), which code i paste below. It basically returns the LaTeX source chunked in relevant pieces (text, formulas, commands, etc.). I would like to know how to build a program such that on every word not found in a list, we ask the user what word to substitute.

(What we really care about LaTeX is that we have some part of the source that are text and have to be spell-checked, and other parts that are formulas and not plain english)


Let me explain it more clearly with some examples of desired behaviour (for simplicity formulas are between $ HERE IS THE FORMULA $)

Source:

This is my frst file and here
we have a formula: $\forall x \quad x$

Desired behaviour:

In file 'first.tex' at line 1: 'frst' unknown
1 This is my **frst** file and here
2 we have a formula: $\forall x \quad x$
Action [Add word to dictionary / Change word]? 

The main problem is, after I've parsed the file, I'm left with an AST and have no more references to lines, so i could not display them like the above example.


Code for the parser:

import System.Environment
import Text.Parsec (ParseError)
import Text.Parsec.String (Parser, parseFromFile)
import Text.Parsec.String.Parsec (try)
import Text.Parsec.String.Char (oneOf, char, digit, string, letter, satisfy, noneOf, anyChar)
import Text.Parsec.String.Combinator (many1, choice, chainl1, between, count, option, optionMaybe, optional, manyTill, eof, lookAhead)
import Control.Applicative ((<$>), (<*>), (<*), (*>), (<|>), many, (<$))
import Control.Monad (void, ap, mzero)
import Data.Char (isLetter, isDigit)
import FunctionsAndTypesForParsing

data TexFile = Items [TexTerm]
              deriving (Eq, Show)

data TexTerm = Comment String
             | Formula String
             | Command String [TexFile]
             | Text String
             | Block TexFile
               deriving (Eq, Show)

-- We get the AST as output                                                                                                                                           
texFile :: Parser TexFile
texFile = Items <$> (many texTerm) <* (optional (try $ eof))

texTerm :: Parser TexTerm
texTerm = lexeme $ (try comment <|> text <|> formula <|> command <|> block)

whitespace :: Parser ()
whitespace = void $ try $ oneOf " \n\t"

lexeme :: Parser a -> Parser a
lexeme p = p <* (many $ whitespace)

comment :: Parser TexTerm
comment = Comment <$> between (string "%") (string "\n") (many $ noneOf "\n")

formula :: Parser TexTerm
formula = Formula <$> (try singledollar <|> doubledollar <|> equation <|> align)
  where
    singledollar = between (string "$") (string "$") (many1 $ noneOf "$")
    doubledollar = between (string "$$") (string "$$") (many1 $ noneOf "$$")
    equation = try $ between (try $ string "\\begin{equation}") (string "\\end{equation}") (manyTill anyChar (lookAhead $ try $ string "\\end{equation}"))
    align = try $ between (try $ string "\\begin{align*}") (string "\\end{align*}") (manyTill anyChar (lookAhead $ try $ string "\\end{align*}"))

command :: Parser TexTerm
command = Command <$> com <*> (many arg)
  where
    com = char '\\' *> (manyTill (try letter <|> oneOf "*") (lookAhead $ try $ oneOf "[{ \\\n\t"))
    arg = (try (between (string "{") (string "}") texFile)
           <|> (between (string "[") (string "]") texFile)
          )

text :: Parser TexTerm
text = Text <$> many1 textualchars
  where
    textualchars = try letter <|> digit <|> oneOf " \n\t\r,.*:;-<>#@()`_!'?"

block :: Parser TexTerm
block = Block <$> between (string "{") (string "}") texFile

Solution

  • You can use Parsec's getPosition action to get the current position in the input stream. You can then store it in your AST type (i.e. changing it to something like

    data TexFile = Items [(SourcePos, TexTerm)]
    

    )