I'm trying to make a parser to scan arrays of numbers separated by empty lines in a text file.
1 235 623 684
2 871 699 557
3 918 686 49
4 53 564 906
1 154
2 321
3 519
1 235 623 684
2 871 699 557
3 918 686 49
Here is the full text file
I wrote the following parser with parsec
:
import Text.ParserCombinators.Parsec
emptyLine = do
spaces
newline
emptyLines = many1 emptyLine
data1 = do
dat <- many1 digit
return (dat)
datan = do
many1 (oneOf " \t")
dat <- many1 digit
return (dat)
dataline = do
dat1 <- data1
dat2 <- many datan
many (oneOf " \t")
newline
return (dat1:dat2)
parseSeries = do
dat <- many1 dataline
return dat
parseParag = try parseSeries
parseListing = do
--cont <- parseSeries `sepBy` emptyLines
cont <- between emptyLines emptyLines parseSeries
eof
return cont
main = do
fichier <- readFile ("test_listtst.txt")
case parse parseListing "(test)" fichier of
Left error -> do putStrLn "!!! Error !!!"
print error
Right serie -> do
mapM_ print serie
but it fails with the following error :
!!! Error !!!
"(test)" (line 6, column 1):
unexpected "1"
expecting space or new-line
and I don't understand why.
Do you have any idea of what's wrong with my parser ?
Do you have an example on how to parse a structured bunch of data separated by empty lines ?
Do you have any idea of what's wrong with my parser ?
A few things:
As other answerers have already pointed out, the spaces
parser is designed to consume a sequence of characters that satisfy Data.Char.isSpace
; the newline ('\n'
) is such a character. Therefore, your emptyLine
parser always fails, because newline
expects a newline character that has already been consumed.
You probably shouldn't use the newline
parser in your "line" parsers anyway, because those parsers will fail on the last line of the file if the latter doesn't end with a newline.
Why not use parsec 3 (Text.Parsec.*
) rather than parsec 2 (Text.ParserCombinators.*
)?
Why not parse the numbers as Integer
s or Int
s as you go, rather than keep them as String
s?
Personal preference, but you rely too much on the do
notation for my taste, to the detriment of readability. For instance,
data1 = do
dat <- many1 digit
return (dat)
can be simplified to
data1 = many1 digit
You would do well to add a type signature to all your top-level bindings.
Be consistent in how you name your parsers: why "parseListing" instead of simply "listing"?
Have you considered using a different type of input stream (e.g. Text
) for better performance?
Do you have an example on how to parse a structured bunch of data separated by empty lines ?
Below is a much simplified version of the kind of parser you want. Note that the input is not supposed to begin with (but may end with) empty lines, and "data lines" are not supposed to contain leading spaces, but may contain trailing spaces (in the sense of the spaces
parser).
module Main where
import Data.Char ( isSpace )
import Text.Parsec
import Text.Parsec.String ( Parser )
eolChar :: Char
eolChar = '\n'
eol :: Parser Char
eol = char eolChar
whitespace :: Parser String
whitespace = many $ satisfy $ \c -> isSpace c && c /= eolChar
emptyLine :: Parser String
emptyLine = whitespace
emptyLines :: Parser [String]
emptyLines = sepEndBy1 emptyLine eol
cell :: Parser Integer
cell = read <$> many1 digit
dataLine :: Parser [Integer]
dataLine = sepEndBy1 cell whitespace
-- ^
-- replace by endBy1 if no trailing whitespace is allowed in a "data line"
dataLines :: Parser [[Integer]]
dataLines = sepEndBy1 dataLine eol
listing :: Parser [[[Integer]]]
listing = sepEndBy dataLines emptyLines
main :: IO ()
main = do
fichier <- readFile ("test_listtst.txt")
case parse listing "(test)" fichier of
Left error -> putStrLn "!!! Error !!!"
Right serie -> mapM_ print serie
Test:
λ> main
[[1,235,623,684],[2,871,699,557],[3,918,686,49],[4,53,564,906]]
[[1,154],[2,321],[3,519]]
[[1,235,623,684],[2,871,699,557],[3,918,686,49]]