Search code examples
multithreadingjulia

Multi-threading improvement when reading in CSV using Julia


Wondering if the below makes sense. The Data.csv file is about 8 GB. My laptop has 64GB of RAM with 12 threads. Is this the kind of improvement I should see from multi-threading? Or is there something else I should do here?

@time CSV.read(raw"Data.csv", DataFrame, ntasks=1); # one thread

139.160430 seconds

@time CSV.read(raw"Data.csv", DataFrame, ntasks=8); # 8 threads

113.964781 seconds

@time CSV.read(raw"Data.csv", DataFrame, ntasks=12); # 12 threads

112.279668 seconds

As indicated in the above, I tried these different ntasks= options to select different thread counts, but I am new to multi-threading, so trying to get a sense of the level of improvement I should expect.


Solution

  • When reading data from disk, the bottleneck will typically be the disk, so adding threads will not improve that. The speedup you see might be a little parsing improvement.