Search code examples
haskellregex-lookaroundsattoparsec

Parse identifiers that don't end with certain characters in attoparsec


I am stuck writing an attoparsec parser to parse what the Uniform Code for Units of Measure calls a <ATOM-SYMBOL>. It's defined to be the longest sequence of characters in a certain class (that class includes all the digits 0-9) which doesn't end with a digit.

So given the input foo27 I want to consume and return foo, for 237bar26 I want to consume and return 237bar, for 19 I want to fail without consuming anything.

I can't figure out how to build this out of takeWhile1 or takeTill or scan but I am probably missing something obvious.

Update: My best attempt so far was that I managed to exclude sequences that are entirely digits

atomSymbol :: Parser Text
atomSymbol = do
               r <- core
               if (P.all (inClass "0-9") . T.unpack $ r)
                 then fail "Expected an atom symbol but all characters were digits."
                 else return r
  where
    core = A.takeWhile1 $ inClass "!#-'*,0-<>-Z\\^-z|~"

I tried changing that to test if the last character was a digit instead of if they all were, but it doesn't seem to backtrack one character at a time.

Update 2:

The whole file is at https://github.com/dmcclean/dimensional-attoparsec/blob/master/src/Numeric/Units/Dimensional/Parsing/Attoparsec.hs. This only builds against the prefixes branch from https://github.com/dmcclean/dimensional.


Solution

  • You should reformulate the problem and treat spans of digits (0-9) and spans of non-digit characters (!#-'*,:-<>-Z\\^-z|~) separately. The syntactic element of interest can then be described as

    • an optional digit span, followed by
    • a non-digit span, followed by
    • zero or more {digit span followed by a non-digit span}.
    {-# LANGUAGE OverloadedStrings #-}
    
    module Main where
    
    import Control.Applicative ((<|>), many)
    import Data.Char (isDigit)
    
    import Data.Attoparsec.Combinator (option)
    import Data.Attoparsec.Text (Parser)
    import qualified Data.Attoparsec.Text as A
    import Data.Text (Text)
    import qualified Data.Text as T
    
    atomSymbol :: Parser Text
    atomSymbol = f <$> (option "" digitSpan)
                   <*> (nonDigitSpan <|> fail errorMsg)
                   <*> many (g <$> digitSpan <*> nonDigitSpan)
      where
        nonDigitSpan = A.takeWhile1 $ A.inClass "!#-'*,:-<>-Z\\^-z|~"
        digitSpan    = A.takeWhile1 isDigit
        f x y xss    = T.concat $ x : y : concat xss
        g x y        = [x,y]
        errorMsg     = "Expected an atom symbol but all characters (if any) were digits."
    

    Tests

    [...] given the input foo27 I want to consume and return foo, for 237bar26 I want to consume and return 237bar, for 19 I want to fail without consuming anything.

    λ> A.parseOnly atomSymbol "foo26"
    Right "foo"
    
    λ> A.parseOnly atomSymbol "237bar26"
    Right "237bar"
    
    λ> A.parseOnly atomSymbol "19"
    Left "Failed reading: Expected an atom symbol but all characters (if any) were digits."