Search code examples
parsinghaskellparsec

Parse array of numbers between emptylines


I'm trying to make a parser to scan arrays of numbers separated by empty lines in a text file.

1   235 623 684
2   871 699 557
3   918 686 49
4   53  564 906


1   154
2   321
3   519

1   235 623 684
2   871 699 557
3   918 686 49

Here is the full text file

I wrote the following parser with parsec :

import Text.ParserCombinators.Parsec

emptyLine = do
  spaces
  newline

emptyLines = many1 emptyLine

data1 = do
  dat <- many1 digit
  return (dat)

datan = do
  many1 (oneOf " \t")
  dat <- many1 digit
  return (dat)

dataline  = do
  dat1 <- data1
  dat2 <- many datan
  many (oneOf " \t")
  newline
  return (dat1:dat2)

parseSeries = do 
    dat <- many1 dataline  
    return dat

parseParag =  try parseSeries

parseListing = do 
    --cont <- parseSeries `sepBy` emptyLines
    cont <- between emptyLines emptyLines parseSeries
    eof
    return cont

main = do
    fichier <- readFile ("test_listtst.txt")
    case parse parseListing "(test)" fichier of
            Left error -> do putStrLn "!!! Error !!!"
                             print error
            Right serie -> do  
                                mapM_ print serie

but it fails with the following error :

!!! Error !!!
"(test)" (line 6, column 1):
unexpected "1"
expecting space or new-line

and I don't understand why.

Do you have any idea of what's wrong with my parser ?

Do you have an example on how to parse a structured bunch of data separated by empty lines ?


Solution

  • Do you have any idea of what's wrong with my parser ?

    A few things:

    1. As other answerers have already pointed out, the spaces parser is designed to consume a sequence of characters that satisfy Data.Char.isSpace; the newline ('\n') is such a character. Therefore, your emptyLine parser always fails, because newline expects a newline character that has already been consumed.

    2. You probably shouldn't use the newline parser in your "line" parsers anyway, because those parsers will fail on the last line of the file if the latter doesn't end with a newline.

    3. Why not use parsec 3 (Text.Parsec.*) rather than parsec 2 (Text.ParserCombinators.*)?

    4. Why not parse the numbers as Integers or Ints as you go, rather than keep them as Strings?

    5. Personal preference, but you rely too much on the do notation for my taste, to the detriment of readability. For instance,

      data1 = do
        dat <- many1 digit
        return (dat)
      

      can be simplified to

      data1 = many1 digit
      
    6. You would do well to add a type signature to all your top-level bindings.

    7. Be consistent in how you name your parsers: why "parseListing" instead of simply "listing"?

    8. Have you considered using a different type of input stream (e.g. Text) for better performance?

    Do you have an example on how to parse a structured bunch of data separated by empty lines ?

    Below is a much simplified version of the kind of parser you want. Note that the input is not supposed to begin with (but may end with) empty lines, and "data lines" are not supposed to contain leading spaces, but may contain trailing spaces (in the sense of the spaces parser).

    module Main where
    
    import Data.Char ( isSpace )
    import Text.Parsec
    import Text.Parsec.String ( Parser )
    
    eolChar :: Char
    eolChar = '\n'
    
    eol :: Parser Char
    eol = char eolChar
    
    whitespace :: Parser String
    whitespace = many $ satisfy $ \c -> isSpace c && c /= eolChar
    
    emptyLine :: Parser String
    emptyLine = whitespace
    
    emptyLines :: Parser [String]
    emptyLines = sepEndBy1 emptyLine eol
    
    cell :: Parser Integer
    cell = read <$> many1 digit
    
    dataLine :: Parser [Integer]
    dataLine = sepEndBy1 cell whitespace
    --             ^
    -- replace by endBy1 if no trailing whitespace is allowed in a "data line"
    
    dataLines :: Parser [[Integer]]
    dataLines = sepEndBy1 dataLine eol
    
    listing :: Parser [[[Integer]]]
    listing = sepEndBy dataLines emptyLines
    
    main :: IO ()
    main = do
        fichier <- readFile ("test_listtst.txt")
        case parse listing "(test)" fichier of
            Left error  -> putStrLn "!!! Error !!!"
            Right serie -> mapM_ print serie
    

    Test:

    λ> main
    [[1,235,623,684],[2,871,699,557],[3,918,686,49],[4,53,564,906]]
    [[1,154],[2,321],[3,519]]
    [[1,235,623,684],[2,871,699,557],[3,918,686,49]]