F# Deedle's csv file load time

I have been using the CSV provider to load files of about 300k to 1M rows (50~120megs). It works very well and is very fast. It can load most files in under a second. Here is the output from 64-bit FSI on windows loading a file of about 400k rows and 25 fields.

#time
let Csv2 = CsvFile.Parse(testfile)
let parsedRows = Csv2.Rows |> Seq.toArray
#time

--> Timing now on

Real: 00:00:00.056, CPU: 00:00:00.093, GC gen0: 0, gen1: 0, gen2: 0

But when I load the same file into Deedle

#time
let dCsv = Frame.ReadCsv(testfile)
#time;;

--> Timing now on

Real: 00:01:39.197, CPU: 00:01:41.119, GC gen0: 6324, gen1: 417, gen2: 13

It takes over 1m 40s. I know some extra time is necessary as Deedle is doing much more than the static csv parser above, but over 1m 40s secs seems high. Can I somehow shorten it?

Solution

By default, the Frame.ReadCsv function attempts to infer the type of the columns by looking at the contents. I think this might be adding most of the overhead here. You can try specifying inferTypes=false to disable this completely (then it'll load the data as strings) or you can use inferRows=10 to infer the types from the first few rows. This should work well enough and be faster:

let df = Frame.ReadCsv(testfile, inferRows=10)

Maybe we should make something this the default option. If this does not fix the problem, please submit a GitHub issue and we'll look into that!