Motivation

I work on machine learning projects that involve serializing and deserializing large amounts of data. The IO can often be the chief bottleneck in the programs that I write. Naturally, I wanted to determine the fastest methods for sequentially reading, writing, and copying files on OS X and Linux. This series of benchmarks was written towards this end.

This page briefly describes the methodology, and my observations based on the results from the two machines on which I ran the benchmarks. I also quickly threw up a Google form so people can run the benchmarks on different systems and share their results. The form submissions are publicly viewable.

Before proceeding, I would like to warn you that I am by no means a systems expert. I have tried to be cautious when writing the benchmarks, but it is very possible that I got a few details wrong, or missed some others entirely. Feel free to communicate to me any comments or suggestions that you have about the project.

Overview

All of the benchmarks rely on the test files generated by tools/make_data.rb. The implementations of the benchmarks can be found in the src directory. Some common IO wrapper functions, and a copy of the CCBase headers, are found in the include directory.

Each benchmark tests a series of IO methods that perform the same task. Each IO method is evaluated num_trials times for each file size in the range defined in tools/test_read.sh or tools/test_write.sh, where num_trials is defined in include/configuration.hpp. For each IO method, the test function defined in include/test.hpp prints out the following information:

The file size, in megabytes.
The name of the IO method used.
The mean completion time, in milliseconds.
The standard deviation, in milliseconds.

You can access the raw results for the two machines on which I ran the benchmarks here.

Each method used in the read benchmark counts the occurrences of needle (set to 0xFF in include/configuration.hpp) as a check for correctness. To mitigate the effects of caching, the test function also purges the page cache after each method evaluation (see purge_cache in include/io_common.hpp for how this is done). The write benchmark writes a specified number of randomly-generated bytes to a destination file. The last benchmark, which tests copying, does not need much explanation.

Methods

The IO methods can be segregated into two broad categories. Each IO method either uses a buffer to read or write one chunk at a time within a loop, or uses a special function to accomplish the same task. Methods in the first class fall into one of three categories: synchronous IO, POSIX AIO, or lock-free asynchronous IO using C++ atomics. Examples of IO methods in the second class are mmap, splice, and sendfile, with the latter two being Linux-specific. I did not use Linux AIO, because I found earlier that it scales very poorly. To my knowledge, Linux AIO only works reliably for raw block devices.

At first glance, it may not make sense why one would benefit from using asynchronous IO for these benchmarks. But all three benchmarks perform work that can be pipelined using a double-buffering scheme. The read benchmark counts the occurrences of needle, the write benchmark generates random bytes, and the copy benchmark must read a chunk of data before it can write it. Using asynchronous double-buffering, we can produce frame n + 1 while consuming frame n. The results show that, even for the simple tasks performed by the benchmarks, this double-buffering scheme often results in a speedup over synchronous IO.

Each IO method also uses zero or more of the following optimizations (which, in many cases, turn out to be pessimizations): direct IO, read ahead advice to the OS (for reading and copying), and preallocation (for writing and copying). Sometimes, there is more than one way to perform a particular optimization on the target platform. In such cases, all possible combinations of the available choices were tried. I omitted reporting the combinations that generally yielded slower execution times across the entire range of file sizes.

Read Optimizations

On OS X, one can disable caching using fcntl with the F_NOCACHE flag. On Linux, this is done by opening the file using the O_DIRECT flag. Further information I found online suggested that the IO buffer address and length should both be multiples of the page size, and the request length should be a multiple of the file system's block size. I used posix_memalign to accommodate for these constraints.

OS X offers two different kinds of read ahead optimizations via fcntl: F_RDAHEAD and F_RDADVISE. Based on available documentation, I gathered that the F_RDADVISE and F_RDAHEAD are analogous to the FADV_WILLNEED and FADV_SEQUENTIAL flags on Linux, respectively. The FADV_WILLNEED flag initiates a non-blocking read of the specified region into the page cache. Experiment has shown that F_RDADVISE has a similar effect on OS X. On Linux, the FADV_SEQUENTIAL flag doubles the size of the read ahead buffer. Presumably, the F_RDAHEAD flag does something similar on OS X, but the manual was sparse on details.

Write Optimizations

On both OS X and Linux, one can preallocate a file before writing to it, in order to accelerate IO. As with the other optimizations, OS X exposes this interface via fcntl and the F_PREALLOCATE flag. Linux has the fallocate and ftruncate functions, which do slightly different things. Using the posix_fallocate function on Linux is inadvisable, because glibc emulates the behavior (very inefficiently!) even if the underlying file system does not support the operation. The fallocate function throws EOPNOTSUPP in this case, so the programmer has the option of falling back to other approaches. Finally, this Mozilla blog post warns that on OS X, F_PREALLOCATE need to be followed by truncate in order to force the data to be written to the file. I implemented both approaches in the benchmarks.

Copy Optimizations

OS X and Linux both allow you to copy files using a simple read-and-write loop or mmap. Linux also has the splice and sendfile functions, which avoid copying data from kernel buffers to userspace buffers.

Results

Before I discuss the results for the two systems on which I ran the benchmarks, here is the exhaustive list of read methods that I tried. All buffer-based IO methods were evaluated using each of the following buffer sizes: 4 KB, 8 KB, 12 KB, 16 KB, 24 KB, 32 KB, 40 KB, 48 KB, 56 KB, 64 KB, 256 KB, 1024 KB, 4096 KB, 16384 KB, 65536 KB, and 262144 KB.

// OS X:
read_plain
read_nocache
read_rdahead
read_rdadvise
read_async_nocache
read_async_rdahead
read_async_rdadvise
read_mmap_plain
read_mmap_nocache
read_mmap_rdahead
read_mmap_rdadvise

// Linux:
read_plain
read_direct
read_fadvise
read_async_plain
read_async_direct
read_async_fadvise
read_mmap_plain
read_mmap_direct
read_mmap_fadvise

Here is the list of write methods:

// OS X:
write_plain
write_nocache
write_preallocate
write_preallocate_truncate
write_preallocate_truncate_nocache
async_write_preallocate_truncate_nocache
write_mmap

// Linux:
write_plain
write_direct
write_preallocate
write_truncate
write_direct_preallocate
write_direct_truncate
write_async_plain
write_async_direct
write_async_preallocate
write_async_truncate
write_async_direct_preallocate
write_async_direct_truncate
mmap_preallocate
mmap_preallocate_direct
mmap_truncate_direct

Finally, here is a list of the copy methods. Note that I avoided testing many combinations of optimizations that I expected to perform poorly based on the read and write results.

// OS X:
copy_plain
copy_nocache
copy_rdahead_preallocate
copy_rdadvise_preallocate
copy_mmap_nocache_plain
copy_mmap_nocache_nocache

// Linux:
copy_plain
copy_direct
copy_preallocate
copy_mmap_plain
copy_mmap_nocache
copy_mmap_fadvise
copy_splice
copy_splice_preallocate
copy_splice_preallocate_fadvise
copy_splice_fadvise
copy_sendfile
copy_sendfile_preallocate
copy_sendfile_preallocate_fadvise
copy_sendfile_fadvise

The benchmarks were run on two systems: a mid-2012 Macbook Pro with an SSD, running OS X 10.9, and a Linux server with a PCIe SSD, running Arch Linux with kernel version 3.14.4. The Macbook Pro was formatted to HFS+, while the Linux server was using ext4 for the partition on which the benchmark was run.

Results: Macbook Pro

Below, I have tabulated what I found to be the best schemes for IO on the Macbook Pro. The data from which these conclusions were drawn can be found here.

Task	IO Type	File Size	Method	Buffer Size
Reading	Synchronous	< 256 MB	`read_rdadvise`	4 KB — 1024 KB
Reading	Synchronous	≥ 256 MB	`read_rdahead`	≥ 4096 KB
Reading	Asynchronous	< 128 MB	`read_async_rdadvise`	Base on workload.
Reading	Asynchronous	≥ 128 MB	`read_async_rdahead`	Base on workload.
Writing	Any	Any	`write_preallocate_truncate_nocache`	≥ 1024 KB
Copying	Any	Any	`copy_mmap`	N/A

For reading, there is a point after which it makes sense to stop using F_RDADVISE and instead use F_RDAHEAD. It would be nice to find out where this point occurs without actually running the benchmark on the machine. For writing, the methods tabulated above performed the fastest for all file sizes tested. For any given file size, the best asynchronous IO method was generally faster than the best synchronous IO method. This shows that despite the modest synchronization overhead, it is still worth it to exploit parallelism where possible.

For reading and writing, the choice of buffer size does make a modest difference in performance. It may be worth the time to try out a range of buffer sizes if you know in advantage that your program will only run on a particular platform. The difference is much more dramatic when F_NOCACHE is used for reading, but this approach did not yield good results anyway. For copying, the fastest method in all cases was a simple one that used mmap twice and std::copy to transfer the data.

Results: Linux Server

Below, I have tabulated what I found to be the best schemes for IO on the Linux server. The data from which these conclusions were drawn can be found here.

Task	IO Type	File Size	Method	Buffer Size
Reading	Any	Any	`read_fadvise`	40 KB — 256 KB
Writing	Any	Any	`write_preallocate`	≥ 4096 KB
Copying	Any	Any	`copy_sendfile`	N/A

The fastest methods for reading and writing on Linux used the same kinds of optimizations that were used by the fastest methods on OS X. The methods tabulated above were also the fastest across all file sizes. This is encouraging, since it suggests that providing read advice and preallocating space before writes are effective strategies on both kernels. One issue is that the fallocate function is not supported on all filesystems (but it is supported by XFS, ext4, Btrfs, and tmpfs). One possible fallback is to use ftruncate in the event that fallocate is unsupported.

Interestingly, the implementation of mmap on Linux is much more competitive than the one on OS X. Although mmap was never the fastest IO method, it was always slightly slower. Finally, splice and sendfile were the clear winners for copying files. I preferred sendfile to splice, since the usage is much simpler, and the programmer does not have to choose a buffer size.

Conclusion

Giving the kernel IO advice and preallocating space before writing were effective strategies on both platforms. On each platform, there was also a clear choice for the best method to use for copying files. One would expect the same methods to work well across platforms with different file systems, for the same reasons that these methods were effective on the platforms on which I ran the benchmarks. Benchmark results on other systems to back up this speculation would be useful.

One problem is that the best buffer size to use for each IO task will likely vary based on the type of storage device, file system, and so on. If more people run the benchmark on different systems and publish their results using the survey, then perhaps more definitive recommendations can be made in this regard. Until this happens, my recommendation is to iterate across a range of buffer sizes in order to determine which ones work best.

IO Benchmark

Comparison of IO methods for OS X and Linux