I have a csv file with the following structure :
I have written a little script to go through each line of the file and return a sequence of tuples containing the column header and the length of the largest string of data in that column :
let getColumnInfo (fileName:string) =
let delimiter = ','
let readLinesIntoColumns (sr:StreamReader) = seq {
while not sr.EndOfStream do
yield sr.ReadLine().Split(delimiter) |> Seq.map (fun c -> c.Length )
}
use sr = new StreamReader(fileName)
let headers = sr.ReadLine().Split(delimiter)
let columnSizes =
let initial = Seq.map ( fun h -> 0 ) headers
let toMaxColLengths (accumulator:seq<int>) (line:seq<int>) =
let chooseBigger a b = if a > b then a else b
Seq.map2 chooseBigger accumulator line
readLinesIntoColumns sr |> Seq.fold toMaxColLengths initial
Seq.zip headers columnSizes;
This works fine on a small file. However when it trys to process a large file (> 75 Mb) it blows fsi with a StackOverflow exception. If I remove the line
Seq.map2 chooseBigger accumulator line
the program completes.
Now, my question is this : why is F# using up the stack? My understanding of sequences in F# is that the entire sequence is not held in memory, only the elements that are being processed. Therefore I expected that the lines that had already been processed would not remain on the stack. Where is my misunderstanding?
I think this is a good question. Here's a simpler repro:
let test n =
[for i in 1 .. n -> Seq.empty]
|> List.fold (Seq.map2 max) Seq.empty
|> Seq.iter ignore
test
creats a sequence of empty sequences, calculates the max by rows, and then iterates over the resulting (empty) sequence. You'll find that with a high value of n
this will cause a stack overflow, even though there aren't any values to iterate over at all!
It's a bit tricky to explain why, but here's a stab at it. The problem is that as you fold over the sequences, Seq.map2
is returning a new sequence which defers its work until it's enumerated. Thus, when you try to iterate through the resulting sequence, you end up calling back into a chain of computations n
layers deep.
As Daniel explains, you can escape this by evaluating the resulting sequence eagerly (e.g. by converting it to a list).
EDIT
Here's an attempt to further explain what's going wrong. When you call Seq.map2 max s1 s2
, neither s1
nor s2
is actually enumerated; you get a new sequence which, when enumerated, will enumerate both of them and compare the yielded values. Thus, if we do something like the following:
let s0 = Seq.empty
let s1 = Seq.map2 max Seq.emtpy s0
let s2 = Seq.map2 max Seq.emtpy s1
let s3 = Seq.map2 max Seq.emtpy s2
let s4 = Seq.map2 max Seq.emtpy s3
let s5 = Seq.map2 max Seq.emtpy s4
...
Then the call to Seq.map2
always returns immediately and uses constant stack space. However, enumerating s5 requires enumerating s4, which requires enumerating s3, etc. This means that enumerating s99999 will build up a huge call stack that looks sort of like:
...
(s99996's enumerator).MoveNext()
(s99997's enumerator).MoveNext()
(s99998's enumerator).MoveNext()
(s99999's enumerator).MoveNext()
and we'll get a stack overflow.