Search code examples
multithreadingrhardwaremulticore

Can parallel operations speed the availability of a file from a hard disk in R?


I have a huge datafile (~4GB) that I am passing through R (to do some string clean up) on its way into an MySQL database. Each row/line is independent from the other. Is there any speed advantage to be had by using parallel operations to finish this process? That is, could one thread start with by skipping no lines and scan every second line and another start with a skip of 1 line and read every second line? If so, would it actually speed up the process or would the two threads fighting for the 10K Western Digital hard drive (not SSD) negate any possible advantages?


Solution

  • The answer is maybe. At some point, disk access will become limiting. Whether this happens with 2 cores running or 8 depends on the characteristics of your hardware setup. It'd be pretty easy to just try it out, while watching your system with top. If your %wa is consitently above zero, it means that the CPUs are waiting for the disk to catch up and you're likely slowing the whole process down.