Motivation
I work on machine learning projects that involve serializing and deserializing large amounts of data. The IO can often be the chief bottleneck in the programs that I write. Naturally, I wanted to determine the fastest methods for sequentially reading, writing, and copying files on OS X and Linux. This series of benchmarks was written towards this end.
This page briefly describes the methodology, and my observations based on the results from the two machines on which I ran the benchmarks. I also quickly threw up a Google form so people can run the benchmarks on different systems and share their results. The form submissions are publicly viewable.
Before proceeding, I would like to warn you that I am by no means a systems expert. I have tried to be cautious when writing the benchmarks, but it is very possible that I got a few details wrong, or missed some others entirely. Feel free to communicate to me any comments or suggestions that you have about the project.
Overview
All of the benchmarks rely on the test files generated by
tools/make_data.rb
. The implementations of the benchmarks can be
found in the src
directory. Some common IO wrapper functions, and a
copy of the CCBase headers,
are found in the include
directory.
Each benchmark tests a series of IO methods that perform the same task. Each
IO method is evaluated num_trials
times for each file size in the
range defined in tools/test_read.sh
or
tools/test_write.sh
, where num_trials
is defined in
include/configuration.hpp
. For each IO method, the test function
defined in include/test.hpp
prints out the following
information:
- The file size, in megabytes.
- The name of the IO method used.
- The mean completion time, in milliseconds.
- The standard deviation, in milliseconds.
You can access the raw results for the two machines on which I ran the benchmarks here.
Each method used in the read benchmark counts the occurrences of
needle
(set to 0xFF
in
include/configuration.hpp
) as a check for correctness. To mitigate
the effects of caching, the test function also purges the page cache after each
method evaluation (see purge_cache
in
include/io_common.hpp
for how this is done). The write benchmark
writes a specified number of randomly-generated bytes to a destination file. The
last benchmark, which tests copying, does not need much explanation.
Methods
The IO methods can be segregated into two broad categories. Each IO method
either uses a buffer to read or write one chunk at a time within a loop, or uses
a special function to accomplish the same task. Methods in the first class fall
into one of three categories: synchronous IO, POSIX AIO, or lock-free
asynchronous IO using C++ atomics. Examples of IO methods in the second class
are mmap
, splice
, and sendfile
, with the
latter two being Linux-specific. I did not use Linux AIO, because I found
earlier that it scales
very poorly. To my knowledge, Linux AIO only works reliably for raw block
devices.
At first glance, it may not make sense why one would benefit from using
asynchronous IO for these benchmarks. But all three benchmarks perform work that
can be pipelined using a double-buffering scheme. The read benchmark counts the
occurrences of needle
, the write benchmark generates random bytes,
and the copy benchmark must read a chunk of data before it can write it.
Using asynchronous double-buffering, we can produce frame n + 1
while consuming frame n
. The results show that, even for the simple
tasks performed by the benchmarks, this double-buffering scheme often results in
a speedup over synchronous IO.
Each IO method also uses zero or more of the following optimizations (which, in many cases, turn out to be pessimizations): direct IO, read ahead advice to the OS (for reading and copying), and preallocation (for writing and copying). Sometimes, there is more than one way to perform a particular optimization on the target platform. In such cases, all possible combinations of the available choices were tried. I omitted reporting the combinations that generally yielded slower execution times across the entire range of file sizes.
Read Optimizations
On OS X, one can disable caching using fcntl
with the
F_NOCACHE
flag. On Linux, this is done by opening the file using
the O_DIRECT
flag. Further information I found online suggested
that the IO buffer address and length should both be multiples of the page size,
and the request length should be a multiple of the file system's block size. I
used posix_memalign
to accommodate for these constraints.
OS X offers two different kinds of read ahead optimizations via
fcntl
: F_RDAHEAD
and F_RDADVISE
. Based on
available documentation, I gathered that the F_RDADVISE
and
F_RDAHEAD
are analogous to the FADV_WILLNEED
and
FADV_SEQUENTIAL
flags on Linux, respectively. The
FADV_WILLNEED
flag initiates a non-blocking read of the specified
region into the page cache. Experiment has shown that F_RDADVISE
has a similar effect on OS X. On Linux, the FADV_SEQUENTIAL
flag
doubles the size of the read ahead buffer. Presumably, the
F_RDAHEAD
flag does something similar on OS X, but the manual was
sparse on details.
Write Optimizations
On both OS X and Linux, one can preallocate a file before writing to it, in
order to accelerate IO. As with the other optimizations, OS X exposes this
interface via fcntl
and the F_PREALLOCATE
flag. Linux
has the fallocate
and ftruncate
functions, which do
slightly different things. Using the posix_fallocate
function on
Linux is inadvisable, because glibc emulates the behavior (very inefficiently!)
even if the underlying file system does not support the operation. The
fallocate
function throws EOPNOTSUPP
in this case, so
the programmer has the option of falling back to other approaches. Finally, this
Mozilla blog post warns that on OS X, F_PREALLOCATE
need to be
followed by truncate
in order to force the data to be written to
the file. I implemented both approaches in the benchmarks.
Copy Optimizations
OS X and Linux both allow you to copy files using a simple read-and-write
loop or mmap
. Linux also has the splice
and
sendfile
functions, which avoid copying data from kernel buffers to
userspace buffers.
Results
Before I discuss the results for the two systems on which I ran the benchmarks, here is the exhaustive list of read methods that I tried. All buffer-based IO methods were evaluated using each of the following buffer sizes: 4 KB, 8 KB, 12 KB, 16 KB, 24 KB, 32 KB, 40 KB, 48 KB, 56 KB, 64 KB, 256 KB, 1024 KB, 4096 KB, 16384 KB, 65536 KB, and 262144 KB.
// OS X:
read_plain
read_nocache
read_rdahead
read_rdadvise
read_async_nocache
read_async_rdahead
read_async_rdadvise
read_mmap_plain
read_mmap_nocache
read_mmap_rdahead
read_mmap_rdadvise
// Linux:
read_plain
read_direct
read_fadvise
read_async_plain
read_async_direct
read_async_fadvise
read_mmap_plain
read_mmap_direct
read_mmap_fadvise
Here is the list of write methods:
// OS X:
write_plain
write_nocache
write_preallocate
write_preallocate_truncate
write_preallocate_truncate_nocache
async_write_preallocate_truncate_nocache
write_mmap
// Linux:
write_plain
write_direct
write_preallocate
write_truncate
write_direct_preallocate
write_direct_truncate
write_async_plain
write_async_direct
write_async_preallocate
write_async_truncate
write_async_direct_preallocate
write_async_direct_truncate
mmap_preallocate
mmap_preallocate_direct
mmap_truncate_direct
Finally, here is a list of the copy methods. Note that I avoided testing many combinations of optimizations that I expected to perform poorly based on the read and write results.
// OS X:
copy_plain
copy_nocache
copy_rdahead_preallocate
copy_rdadvise_preallocate
copy_mmap_nocache_plain
copy_mmap_nocache_nocache
// Linux:
copy_plain
copy_direct
copy_preallocate
copy_mmap_plain
copy_mmap_nocache
copy_mmap_fadvise
copy_splice
copy_splice_preallocate
copy_splice_preallocate_fadvise
copy_splice_fadvise
copy_sendfile
copy_sendfile_preallocate
copy_sendfile_preallocate_fadvise
copy_sendfile_fadvise
The benchmarks were run on two systems: a mid-2012 Macbook Pro with an SSD, running OS X 10.9, and a Linux server with a PCIe SSD, running Arch Linux with kernel version 3.14.4. The Macbook Pro was formatted to HFS+, while the Linux server was using ext4 for the partition on which the benchmark was run.
Results: Macbook Pro
Below, I have tabulated what I found to be the best schemes for IO on the Macbook Pro. The data from which these conclusions were drawn can be found here.
Task | IO Type | File Size | Method | Buffer Size |
---|---|---|---|---|
Reading | Synchronous | < 256 MB | read_rdadvise |
4 KB — 1024 KB |
Reading | Synchronous | ≥ 256 MB | read_rdahead |
≥ 4096 KB |
Reading | Asynchronous | < 128 MB | read_async_rdadvise |
Base on workload. |
Reading | Asynchronous | ≥ 128 MB | read_async_rdahead |
Base on workload. |
Writing | Any | Any | write_preallocate_truncate_nocache |
≥ 1024 KB |
Copying | Any | Any | copy_mmap |
N/A |
For reading, there is a point after which it makes sense to stop using
F_RDADVISE
and instead use F_RDAHEAD
. It would be nice
to find out where this point occurs without actually running the benchmark on
the machine. For writing, the methods tabulated above performed the fastest for all
file sizes tested. For any given file size, the best asynchronous IO method was
generally faster than the best synchronous IO method. This shows that despite
the modest synchronization overhead, it is still worth it to exploit parallelism
where possible.
For reading and writing, the choice of buffer size does make a modest
difference in performance. It may be worth the time to try out a range of buffer
sizes if you know in advantage that your program will only run on a particular
platform. The difference is much more dramatic when F_NOCACHE
is
used for reading, but this approach did not yield good results anyway. For
copying, the fastest method in all cases was a simple one that used
mmap
twice and std::copy
to transfer the data.
Results: Linux Server
Below, I have tabulated what I found to be the best schemes for IO on the Linux server. The data from which these conclusions were drawn can be found here.
Task | IO Type | File Size | Method | Buffer Size |
---|---|---|---|---|
Reading | Any | Any | read_fadvise |
40 KB — 256 KB |
Writing | Any | Any | write_preallocate |
≥ 4096 KB |
Copying | Any | Any | copy_sendfile |
N/A |
The fastest methods for reading and writing on Linux used the same kinds of
optimizations that were used by the fastest methods on OS X. The methods
tabulated above were also the fastest across all file sizes. This is
encouraging, since it suggests that providing read advice and preallocating
space before writes are effective strategies on both kernels. One issue is that
the fallocate
function is not supported on all filesystems (but it
is supported by XFS, ext4, Btrfs, and tmpfs). One possible fallback is to use
ftruncate
in the event that fallocate
is
unsupported.
Interestingly, the implementation of mmap
on Linux is much more
competitive than the one on OS X. Although mmap
was never the
fastest IO method, it was always slightly slower. Finally, splice
and sendfile
were the clear winners for copying files. I preferred
sendfile
to splice
, since the usage is much simpler,
and the programmer does not have to choose a buffer size.
Conclusion
Giving the kernel IO advice and preallocating space before writing were effective strategies on both platforms. On each platform, there was also a clear choice for the best method to use for copying files. One would expect the same methods to work well across platforms with different file systems, for the same reasons that these methods were effective on the platforms on which I ran the benchmarks. Benchmark results on other systems to back up this speculation would be useful.
One problem is that the best buffer size to use for each IO task will likely vary based on the type of storage device, file system, and so on. If more people run the benchmark on different systems and publish their results using the survey, then perhaps more definitive recommendations can be made in this regard. Until this happens, my recommendation is to iterate across a range of buffer sizes in order to determine which ones work best.