I have been using the CSV provider to load files of about 300k to 1M rows (50~120megs). It works very well and is very fast. It can load most files in under a second. Here is the output from 64-bit FSI on windows loading a file of about 400k rows and 25 fields.
#time
let Csv2 = CsvFile.Parse(testfile)
let parsedRows = Csv2.Rows |> Seq.toArray
#time
--> Timing now on
Real: 00:00:00.056, CPU: 00:00:00.093, GC gen0: 0, gen1: 0, gen2: 0
But when I load the same file into Deedle
#time
let dCsv = Frame.ReadCsv(testfile)
#time;;
--> Timing now on
Real: 00:01:39.197, CPU: 00:01:41.119, GC gen0: 6324, gen1: 417, gen2: 13
It takes over 1m 40s. I know some extra time is necessary as Deedle is doing much more than the static csv parser above, but over 1m 40s secs seems high. Can I somehow shorten it?
By default, the Frame.ReadCsv
function attempts to infer the type of the columns by looking at the contents. I think this might be adding most of the overhead here. You can try specifying inferTypes=false
to disable this completely (then it'll load the data as strings) or you can use inferRows=10
to infer the types from the first few rows. This should work well enough and be faster:
let df = Frame.ReadCsv(testfile, inferRows=10)
Maybe we should make something this the default option. If this does not fix the problem, please submit a GitHub issue and we'll look into that!