I work on machine learning projects that involve serializing and deserializing large amounts of data. The IO can often be the chief bottleneck in the programs that I write. Naturally, I wanted to determine the fastest methods for sequentially reading, writing, and copying files on OS X and Linux. This series of benchmarks was written towards this end.
This page briefly describes the methodology, and my observations based on the results from the two machines on which I ran the benchmarks. I also quickly threw up a Google form so people can run the benchmarks on different systems and share their results. The form submissions are publicly viewable.
Before proceeding, I would like to warn you that I am by no means a systems expert. I have tried to be cautious when writing the benchmarks, but it is very possible that I got a few details wrong, or missed some others entirely. Feel free to communicate to me any comments or suggestions that you have about the project.
All of the benchmarks rely on the test files generated by
tools/make_data.rb. The implementations of the benchmarks can be
found in the
src directory. Some common IO wrapper functions, and a
copy of the CCBase headers,
are found in the
Each benchmark tests a series of IO methods that perform the same task. Each
IO method is evaluated
num_trials times for each file size in the
range defined in
num_trials is defined in
include/configuration.hpp. For each IO method, the test function
include/test.hpp prints out the following
- The file size, in megabytes.
- The name of the IO method used.
- The mean completion time, in milliseconds.
- The standard deviation, in milliseconds.
You can access the raw results for the two machines on which I ran the benchmarks here.
Each method used in the read benchmark counts the occurrences of
needle (set to
include/configuration.hpp) as a check for correctness. To mitigate
the effects of caching, the test function also purges the page cache after each
method evaluation (see
include/io_common.hpp for how this is done). The write benchmark
writes a specified number of randomly-generated bytes to a destination file. The
last benchmark, which tests copying, does not need much explanation.
The IO methods can be segregated into two broad categories. Each IO method
either uses a buffer to read or write one chunk at a time within a loop, or uses
a special function to accomplish the same task. Methods in the first class fall
into one of three categories: synchronous IO, POSIX AIO, or lock-free
asynchronous IO using C++ atomics. Examples of IO methods in the second class
sendfile, with the
latter two being Linux-specific. I did not use Linux AIO, because I found
earlier that it scales
very poorly. To my knowledge, Linux AIO only works reliably for raw block
At first glance, it may not make sense why one would benefit from using
asynchronous IO for these benchmarks. But all three benchmarks perform work that
can be pipelined using a double-buffering scheme. The read benchmark counts the
needle, the write benchmark generates random bytes,
and the copy benchmark must read a chunk of data before it can write it.
Using asynchronous double-buffering, we can produce frame
n + 1
while consuming frame
n. The results show that, even for the simple
tasks performed by the benchmarks, this double-buffering scheme often results in
a speedup over synchronous IO.
Each IO method also uses zero or more of the following optimizations (which, in many cases, turn out to be pessimizations): direct IO, read ahead advice to the OS (for reading and copying), and preallocation (for writing and copying). Sometimes, there is more than one way to perform a particular optimization on the target platform. In such cases, all possible combinations of the available choices were tried. I omitted reporting the combinations that generally yielded slower execution times across the entire range of file sizes.
On OS X, one can disable caching using
fcntl with the
F_NOCACHE flag. On Linux, this is done by opening the file using
O_DIRECT flag. Further information I found online suggested
that the IO buffer address and length should both be multiples of the page size,
and the request length should be a multiple of the file system's block size. I
posix_memalign to accommodate for these constraints.
OS X offers two different kinds of read ahead optimizations via
F_RDADVISE. Based on
available documentation, I gathered that the
F_RDAHEAD are analogous to the
FADV_SEQUENTIAL flags on Linux, respectively. The
FADV_WILLNEED flag initiates a non-blocking read of the specified
region into the page cache. Experiment has shown that
has a similar effect on OS X. On Linux, the
doubles the size of the read ahead buffer. Presumably, the
F_RDAHEAD flag does something similar on OS X, but the manual was
sparse on details.
On both OS X and Linux, one can preallocate a file before writing to it, in
order to accelerate IO. As with the other optimizations, OS X exposes this
fcntl and the
F_PREALLOCATE flag. Linux
ftruncate functions, which do
slightly different things. Using the
posix_fallocate function on
Linux is inadvisable, because glibc emulates the behavior (very inefficiently!)
even if the underlying file system does not support the operation. The
fallocate function throws
EOPNOTSUPP in this case, so
the programmer has the option of falling back to other approaches. Finally, this
Mozilla blog post warns that on OS X,
F_PREALLOCATE need to be
truncate in order to force the data to be written to
the file. I implemented both approaches in the benchmarks.
OS X and Linux both allow you to copy files using a simple read-and-write
mmap. Linux also has the
sendfile functions, which avoid copying data from kernel buffers to
Before I discuss the results for the two systems on which I ran the benchmarks, here is the exhaustive list of read methods that I tried. All buffer-based IO methods were evaluated using each of the following buffer sizes: 4 KB, 8 KB, 12 KB, 16 KB, 24 KB, 32 KB, 40 KB, 48 KB, 56 KB, 64 KB, 256 KB, 1024 KB, 4096 KB, 16384 KB, 65536 KB, and 262144 KB.
// OS X: read_plain read_nocache read_rdahead read_rdadvise read_async_nocache read_async_rdahead read_async_rdadvise read_mmap_plain read_mmap_nocache read_mmap_rdahead read_mmap_rdadvise // Linux: read_plain read_direct read_fadvise read_async_plain read_async_direct read_async_fadvise read_mmap_plain read_mmap_direct read_mmap_fadvise
Here is the list of write methods:
// OS X: write_plain write_nocache write_preallocate write_preallocate_truncate write_preallocate_truncate_nocache async_write_preallocate_truncate_nocache write_mmap // Linux: write_plain write_direct write_preallocate write_truncate write_direct_preallocate write_direct_truncate write_async_plain write_async_direct write_async_preallocate write_async_truncate write_async_direct_preallocate write_async_direct_truncate mmap_preallocate mmap_preallocate_direct mmap_truncate_direct
Finally, here is a list of the copy methods. Note that I avoided testing many combinations of optimizations that I expected to perform poorly based on the read and write results.
// OS X: copy_plain copy_nocache copy_rdahead_preallocate copy_rdadvise_preallocate copy_mmap_nocache_plain copy_mmap_nocache_nocache // Linux: copy_plain copy_direct copy_preallocate copy_mmap_plain copy_mmap_nocache copy_mmap_fadvise copy_splice copy_splice_preallocate copy_splice_preallocate_fadvise copy_splice_fadvise copy_sendfile copy_sendfile_preallocate copy_sendfile_preallocate_fadvise copy_sendfile_fadvise
The benchmarks were run on two systems: a mid-2012 Macbook Pro with an SSD, running OS X 10.9, and a Linux server with a PCIe SSD, running Arch Linux with kernel version 3.14.4. The Macbook Pro was formatted to HFS+, while the Linux server was using ext4 for the partition on which the benchmark was run.
Below, I have tabulated what I found to be the best schemes for IO on the Macbook Pro. The data from which these conclusions were drawn can be found here.
|Task||IO Type||File Size||Method||Buffer Size|
|Reading||Synchronous||< 256 MB||
||4 KB — 1024 KB|
|Reading||Synchronous||≥ 256 MB||
||≥ 4096 KB|
|Reading||Asynchronous||< 128 MB||
||Base on workload.|
|Reading||Asynchronous||≥ 128 MB||
||Base on workload.|
||≥ 1024 KB|
For reading, there is a point after which it makes sense to stop using
F_RDADVISE and instead use
F_RDAHEAD. It would be nice
to find out where this point occurs without actually running the benchmark on
the machine. For writing, the methods tabulated above performed the fastest for all
file sizes tested. For any given file size, the best asynchronous IO method was
generally faster than the best synchronous IO method. This shows that despite
the modest synchronization overhead, it is still worth it to exploit parallelism
For reading and writing, the choice of buffer size does make a modest
difference in performance. It may be worth the time to try out a range of buffer
sizes if you know in advantage that your program will only run on a particular
platform. The difference is much more dramatic when
used for reading, but this approach did not yield good results anyway. For
copying, the fastest method in all cases was a simple one that used
mmap twice and
std::copy to transfer the data.
Below, I have tabulated what I found to be the best schemes for IO on the Linux server. The data from which these conclusions were drawn can be found here.
|Task||IO Type||File Size||Method||Buffer Size|
||40 KB — 256 KB|
||≥ 4096 KB|
The fastest methods for reading and writing on Linux used the same kinds of
optimizations that were used by the fastest methods on OS X. The methods
tabulated above were also the fastest across all file sizes. This is
encouraging, since it suggests that providing read advice and preallocating
space before writes are effective strategies on both kernels. One issue is that
fallocate function is not supported on all filesystems (but it
is supported by XFS, ext4, Btrfs, and tmpfs). One possible fallback is to use
ftruncate in the event that
Interestingly, the implementation of
mmap on Linux is much more
competitive than the one on OS X. Although
mmap was never the
fastest IO method, it was always slightly slower. Finally,
sendfile were the clear winners for copying files. I preferred
splice, since the usage is much simpler,
and the programmer does not have to choose a buffer size.
Giving the kernel IO advice and preallocating space before writing were effective strategies on both platforms. On each platform, there was also a clear choice for the best method to use for copying files. One would expect the same methods to work well across platforms with different file systems, for the same reasons that these methods were effective on the platforms on which I ran the benchmarks. Benchmark results on other systems to back up this speculation would be useful.
One problem is that the best buffer size to use for each IO task will likely vary based on the type of storage device, file system, and so on. If more people run the benchmark on different systems and publish their results using the survey, then perhaps more definitive recommendations can be made in this regard. Until this happens, my recommendation is to iterate across a range of buffer sizes in order to determine which ones work best.