Search code examples
lazy-evaluationhaskelllazy-io

How does hGetContents achieve memory efficiency?


I want to add Haskell to my toolbox so I'm working my way through Real World Haskell.

In the chapter in Input and Output, in the section on hGetContents, I came across this example:

import System.IO
import Data.Char(toUpper)

main :: IO ()
main = do 
    inh <- openFile "input.txt" ReadMode
    outh <- openFile "output.txt" WriteMode
    inpStr <- hGetContents inh
    let result = processData inpStr
    hPutStr outh result
    hClose inh
    hClose outh

processData :: String -> String
processData = map toUpper

Following this code sample, the authors go on to say:

Notice that hGetContents handled all of the reading for us. Also, take a look at processData. It's a pure function since it has no side effects and always returns the same result each time it is called. It has no need to know—and no way to tell—that its input is being read lazily from a file in this case. It can work perfectly well with a 20-character literal or a 500GB data dump on disk. (N.B. Emphasis is mine)

My question is: how does hGetContents or its resultant values achieve this memory efficiency without – in this example – processData "being able to tell", and still maintain all benefits that accrue to pure code (i.e. processData), specifically memoization?

<- hGetContents inh returns a string so inpStr is bound to a value of type String, which is exactly the type that processData accepts. But if I understand the authors of Real World Haskell correctly, then this string isn't quite like other strings, in that it's not fully loaded into memory (or fully evaluated, if such a things as not-fully-evaluated strings exists...) by the time of the call to processData.

Therefore, another way to ask my question is: if inpStr is not fully evaluated or loaded into memory at the time of the call to processData, then how can it be used to lookup if a memoized call to processData exists, without first fully evaluating inpStr?

Are there instances of type String that each behave differently but cannot be told apart at this level of abstraction?


Solution

  • The String returned by hGetContents is no different from any other Haskell string. In general, Haskell data is not fully evaluated unless the programmer has taken extra steps to ensure that it is (e.g. seq, deepseq, bang patterns).

    Strings are defined as (essentially)

    data List a = Nil | Cons a (List a) -- Nil === [], Cons === :
    type String = List Char
    

    This means that a string is either empty, or a single character (the head) and another string (the tail). Due to laziness, the tail may not exist in memory, and may even be infinite. Upon processing a String, a Haskell program will typically check if it's Nil or Cons, then branch and proceed as necessary. If the function doesn't need to evaluate the tail, it won't. This function, for example, only needs to check the initial constructor:

    safeHead :: String -> Maybe Char
    safeHead [] = Nothing
    safeHead (x:_) = Just x
    

    This is a perfectly legitimate string

    allA's = repeat 'a' :: String
    

    that is infinite. You can manipulate this string sensibly, however if you try to print all of it, or calculate the length, or any sort of unbounded traversal your program won't terminate. But you can use functions like safeHead without any problem whatsoever, and even consume some finite initial substring.

    Your intuition that something strange is happening is correct, however. hGetContents is implemented using the special function unsafeInterleaveIO, which is essentially a compiler hook into IO behavior. The less said about this, the better.

    You're correct that it would be difficult to check if a memoized call to a function exists without having the argument fully evaluated. However, most compilers don't perform this optimization. The problem is that it's very difficult for a compiler to determine when it's worthwhile to memoize calls, and very easy to consume all of your memory by doing so. Fortunately there are several memoizing libraries you can use to add memoization when appropriate.