Wondering if the below makes sense. The Data.csv file is about 8 GB. My laptop has 64GB of RAM with 12 threads. Is this the kind of improvement I should see from multi-threading? Or is there something else I should do here?
@time CSV.read(raw"Data.csv", DataFrame, ntasks=1); # one thread
139.160430 seconds
@time CSV.read(raw"Data.csv", DataFrame, ntasks=8); # 8 threads
113.964781 seconds
@time CSV.read(raw"Data.csv", DataFrame, ntasks=12); # 12 threads
112.279668 seconds
As indicated in the above, I tried these different ntasks= options to select different thread counts, but I am new to multi-threading, so trying to get a sense of the level of improvement I should expect.
When reading data from disk, the bottleneck will typically be the disk, so adding threads will not improve that. The speedup you see might be a little parsing improvement.