I want to read the last line of my file and make sure it has the same number of fields as my first---I don't care about anything in the middle. I'm using mmap because it's fast for random access on large files, but am encountering problems not understanding Haskell or laziness.
λ> import qualified Data.ByteString.Lazy.Char8 as LB
λ> import System.IO.MMap
λ> outh <- mmapFileByteStringLazy fname Nothing
λ> LB.length outh
87094896
λ> LB.takeWhile (`notElem` "\n") outh
"\"Field1\",\"Field2\",
Great.
From here, I know that
takeWhileR p xs is equivalent to reverse (takeWhileL p (reverse xs)).
So let's make this. That is, let's get the last line by reversing my lazy bytestring, taking while not "\n" just as before, then unreversing it. Laziness makes me think the compiler will let me do this easily.
So trying this out:
LB.reverse (LB.takeWhile (`notElem` "\n") (LB.reverse outh))
What I expect to see is:
"\"val1\",\"val2\",
Instead, this crashes my session.
Segmentation fault (core dumped)
Questions:
For other readers, if you're looking to get the last line, you may find a very fast and suitable method described in the answer here: hSeek and SeekFromEnd in Haskell
In this thread, I'm looking specifically for a solution using mmap.
I would prefer the use of bytestring-mmap
made by the same author as bytestring
. In either case, all you need is
import System.IO.Posix.MMap (unsafeMMapFile)
import qualified Data.ByteString.Char8 as BS
main = do
-- can be swapped out for `mmapFileByteString` from `mmap`
bs <- unsafeMMapFile "file.txt"
let (firstLine, _) = BS.break (== '\n') bs
(_, lastLine) = BS.breakEnd (== '\n') bs
putStrLn $ "First line: " ++ BS.unpack firstLine
putStrLn $ "Last line: " ++ BS.unpack lastLine
This runs instantly too, with no extra allocations. As before, there is the caveat that many files end in newlines, so one may want to have BS.breakEnd (== '\n') (init bs)
to ignore the last \n
character.
Also, I would not recommend reversing the bytestring - that will require at least some allocations, which are in this case completely avoidable. Even if you use a lazy bytestring, you still pay the cost of going through all the chunks of the bytestring (which hopefully shouldn't even have been constructed at this point). That said, your reversing code should work. I reckon something is off with mmap
(probably the package as the doing the same thing with a strict bytestring works just fine).
I'm not sure what the problem is with the functions in System.IO
. The following runs instantly on my laptop, file file.txt
being almost 4GB. It isn't elegant, but it is certainly efficient.
import System.IO
hGetLastLine :: Handle -> IO String
hGetLastLine hdl = go "" (negate 1)
where
go s i = do
hSeek hdl SeekFromEnd i
c <- hGetChar hdl
if c == '\n'
then pure s
else go (c:s) (i-1)
main = do
handle <- openFile "file.txt" ReadMode
firstLine <- hGetLine handle
putStrLn $ "First line: " ++ firstLine
lastLine <- hGetLastLine handle
putStrLn $ "Last line: " ++ lastLine