Search code examples
haskellparser-combinatorsattoparsec

How can I write a more general (but efficient) version of attoparsec's takeWhile1?


Data.Attoparsec.Text exports takeWhile and takeWhile1:

takeWhile :: (Char -> Bool) -> Parser Text

Consume input as long as the predicate returns True, and return the consumed input.

This parser does not fail. It will return an empty string if the predicate returns False on the first character of input.

[...]

takeWhile1 :: (Char -> Bool) -> Parser Text

Consume input as long as the predicate returns True, and return the consumed input.

This parser requires the predicate to succeed on at least one character of input: it will fail if the predicate never returns True or if there is no input left.

attoparsec's documentation encourages the user to

Use the Text-oriented parsers whenever possible, e.g. takeWhile1 instead of many1 anyChar. There is about a factor of 100 difference in performance between the two kinds of parser.

Those two parsers are very useful, but I keep feeling the need for a more general version of takeWhile1, more specifically, some hypothetical parser

takeWhileLo :: (Char -> Bool) -> Int -> Parser Text
takeWhileLo f lo = undefined

that would parse at least lo characters satisfying predicate f, where lo is an arbitrary nonnegative integer.

I had a look at takeWhile1's implementation, but it uses a bunch of functions private to Data.Attoparsec.Text.Internal and doesn't seem easily generalizable.

I came up with the following applicative implementation:

{-# LANGUAGE OverloadedStrings #-}

import           Prelude                  hiding ( takeWhile )

import           Control.Applicative             ( (<*>) )
import           Data.Text                       ( Text )
import qualified Data.Text           as T

import           Data.Attoparsec.Text

takeWhileLo :: (Char -> Bool) -> Int -> Parser Text
takeWhileLo f lo =
  T.append . T.pack <$> count lo (satisfy f) <*> takeWhile f

It works as advertised,

λ> parseOnly (takeWhileLo (== 'a') 4) "aaa"
Left "not enough input"
λ> parseOnly (takeWhileLo (== 'a') 4) "aaaa"
Right "aaaa"
λ> parseOnly (takeWhileLo (== 'a') 4) "aaaaaaaaaaaaa"
Right "aaaaaaaaaaaaa"

but the need for packing the intermediate list of results returned by count worries me, especially for cases where lo is large... It seems to go against the recommendation to

use the Text-oriented parsers whenever possible [...]

Am I missing something? Is there a more efficient/idiomatic way of implementing such a takeWhileLo combinator?


Solution

  • Parser is a monad, so you can just inspect the return value and fail if the length's not right:

    takeWhileLo :: (Char -> Bool) -> Int -> Parser Text
    takeWhileLo f lo = do
      text <- takeWhile f
      case T.compareLength text lo of
        LT -> empty
        _  -> return text
    

    compareLength is from the text package. It's more efficient than comparing text's length, because compareLength may short-circuit.