Parallel disk I/O: Is it faster?

Being the developer of GNU Parallel I often meet the question: Is it faster to do disk I/O in parallel or sequentially?

The answer is a very clear and resounding: It depends.

In the simplest case where you have a single spinning disk (“one spindle”) it will normally be faster do sequential I/O. But see http://unix.stackexchange.com/questions/124527/speed-up-copying-1000000-small-files for an example that contradicts this.

If you instead have a RAID6 over 40 spindles, things are different: On such a system I got a 6x speedup by running 10 jobs in parallel. Fewer jobs or more jobs gave less speed up.

Network file systems are stored on physical servers, so the above applies to network file systems, too. On top of that the network can introduce a latency, so that it is slow to open a file, but when the file is open, you get the data fast. In this case you will often see a speed up, too. Distributed network file systems that are spread over several nodes will in most cases work faster if you can keep all nodes busy.

In the general case: If there is long latency (which is not caused by the parallelization), then parallelizing may be a good idea, because while one job is waiting due to the latency a a different job can be receiving its data.

With SSD, RAID, different network file systems, and caching there is really only one safe answer: Test with different parallelization and measure.

This entry was posted in Uncategorized. Bookmark the permalink.

Leave a comment