Search code examples
haskellparsec

How to ignore arbitrary tokens using parsec?


I wanted to replace sed and awk with Parsec. For example, extract number from strings like unknown structure but containing the number 42 and maybe some other stuff.

I run into "unexpected end of input". I'm looking for equivalent of non-greedy .*([0-9]+).*.

module Main where

import Text.Parsec

parser :: Parsec String () Int
parser = do
    _ <- many anyToken
    x <- read <$> many1 digit
    _ <- many anyToken
    return x

main :: IO ()
main = interact (show . parse parser "STDIN")

Solution

  • This can be easily done with my library regex-applicative. It gives you both the combinator interface and the features of regular expressions that you seem to want.

    Here's a working version that's closest to your example:

    {-# LANGUAGE ApplicativeDo #-}
    import Text.Regex.Applicative
    import Text.Regex.Applicative.Common (decimal)
    
    parser :: RE Char Int
    parser = do
        _ <- few anySym
        x <- decimal
        _ <- many anySym
        return x
    
    main :: IO ()
    main = interact (show . match parser)
    

    Here's an even shorter version, using findFirstInfix:

    import Text.Regex.Applicative
    import Text.Regex.Applicative.Common (decimal)
    
    main :: IO ()
    main = interact (snd3 . findFirstInfix decimal)
      where snd3 (_, r, _) = r
    

    If you want to perform actual tokenization (e.g. skip 93 in foo93bar), then take a look at lexer-applicative, a tokenizer based on regex-applicative.