Search code examples
haskelllazy-evaluationmmap

Haskell Read Last Line with a Lazy mmap


I want to read the last line of my file and make sure it has the same number of fields as my first---I don't care about anything in the middle. I'm using mmap because it's fast for random access on large files, but am encountering problems not understanding Haskell or laziness.

λ> import qualified Data.ByteString.Lazy.Char8 as LB
λ> import System.IO.MMap
λ> outh <- mmapFileByteStringLazy fname Nothing 
λ> LB.length outh
87094896
λ> LB.takeWhile (`notElem` "\n") outh
"\"Field1\",\"Field2\",

Great.

From here, I know that

takeWhileR p xs is equivalent to reverse (takeWhileL p (reverse xs)).

So let's make this. That is, let's get the last line by reversing my lazy bytestring, taking while not "\n" just as before, then unreversing it. Laziness makes me think the compiler will let me do this easily.

So trying this out:

LB.reverse (LB.takeWhile (`notElem` "\n") (LB.reverse outh))

What I expect to see is:

"\"val1\",\"val2\",

Instead, this crashes my session.

Segmentation fault (core dumped)

Questions:

  1. What am I doing wrong with laziness, or bytestrings, or the mmap library, or Haskell?
  2. How can I get this line correctly and with memory efficiency? (The answer possibly using foreign pointers instead of lazy bytestrings?)

For other readers, if you're looking to get the last line, you may find a very fast and suitable method described in the answer here: hSeek and SeekFromEnd in Haskell

In this thread, I'm looking specifically for a solution using mmap.


Solution

  • I would prefer the use of bytestring-mmap made by the same author as bytestring. In either case, all you need is

    import System.IO.Posix.MMap (unsafeMMapFile)
    import qualified Data.ByteString.Char8 as BS
    
    main = do
       -- can be swapped out for `mmapFileByteString` from `mmap`
      bs <- unsafeMMapFile "file.txt"
    
      let (firstLine, _) = BS.break (== '\n') bs
          (_, lastLine) = BS.breakEnd (== '\n') bs
    
      putStrLn $ "First line: " ++ BS.unpack firstLine
      putStrLn $ "Last line: " ++ BS.unpack lastLine
    

    This runs instantly too, with no extra allocations. As before, there is the caveat that many files end in newlines, so one may want to have BS.breakEnd (== '\n') (init bs) to ignore the last \n character.

    Also, I would not recommend reversing the bytestring - that will require at least some allocations, which are in this case completely avoidable. Even if you use a lazy bytestring, you still pay the cost of going through all the chunks of the bytestring (which hopefully shouldn't even have been constructed at this point). That said, your reversing code should work. I reckon something is off with mmap (probably the package as the doing the same thing with a strict bytestring works just fine).

    Previous answer, from before OP's edit

    I'm not sure what the problem is with the functions in System.IO. The following runs instantly on my laptop, file file.txt being almost 4GB. It isn't elegant, but it is certainly efficient.

    import System.IO
    
    hGetLastLine :: Handle -> IO String
    hGetLastLine hdl = go "" (negate 1)
      where
      go s i = do
        hSeek hdl SeekFromEnd i
        c <- hGetChar hdl
        if c == '\n'
          then pure s
          else go (c:s) (i-1)
    
    
    main = do
      handle <- openFile "file.txt" ReadMode
    
      firstLine <- hGetLine handle
      putStrLn $ "First line: " ++ firstLine
    
      lastLine <- hGetLastLine handle
      putStrLn $ "Last line: " ++ lastLine