288 lines
13 KiB
ReStructuredText
288 lines
13 KiB
ReStructuredText
.. _first-steps:
|
|
|
|
First Steps with IOR
|
|
====================
|
|
|
|
This is a short tutorial for the basic usage of IOR and some tips on how to use
|
|
IOR to handle caching effects as these are very likely to affect your
|
|
measurements.
|
|
|
|
Running IOR
|
|
-----------
|
|
There are two ways of running IOR:
|
|
|
|
1) Command line with arguments -- executable followed by command line options.
|
|
|
|
.. code-block:: shell
|
|
|
|
$ ./IOR -w -r -o filename
|
|
|
|
This performs a write and a read to the file 'filename'.
|
|
|
|
2) Command line with scripts -- any arguments on the command line will
|
|
establish the default for the test run, but a script may be used in
|
|
conjunction with this for varying specific tests during an execution of
|
|
the code. Only arguments before the script will be used!
|
|
|
|
.. code-block:: shell
|
|
|
|
$ ./IOR -W -f script
|
|
|
|
This defaults all tests in 'script' to use write data checking.
|
|
|
|
|
|
In this tutorial the first one is used as it is much easier to toy around with
|
|
an get to know IOR. The second option thought is much more useful to safe
|
|
benchmark setups to rerun later or to test many different cases.
|
|
|
|
|
|
Getting Started with IOR
|
|
------------------------
|
|
|
|
IOR writes data sequentially with the following parameters:
|
|
|
|
* ``blockSize`` (``-b``)
|
|
* ``transferSize`` (``-t``)
|
|
* ``segmentCount`` (``-s``)
|
|
* ``numTasks`` (``-n``)
|
|
|
|
which are best illustrated with a diagram:
|
|
|
|
.. image:: tutorial-ior-io-pattern.png
|
|
|
|
|
|
These four parameters are all you need to get started with IOR. However,
|
|
naively running IOR usually gives disappointing results. For example, if we run
|
|
a four-node IOR test that writes a total of 16 GiB:
|
|
|
|
.. code-block:: shell
|
|
|
|
$ mpirun -n 64 ./ior -t 1m -b 16m -s 16
|
|
...
|
|
access bw(MiB/s) block(KiB) xfer(KiB) open(s) wr/rd(s) close(s) total(s) iter
|
|
------ --------- ---------- --------- -------- -------- -------- -------- ----
|
|
write 427.36 16384 1024.00 0.107961 38.34 32.48 38.34 2
|
|
read 239.08 16384 1024.00 0.005789 68.53 65.53 68.53 2
|
|
remove - - - - - - 0.534400 2
|
|
|
|
|
|
we can only get a couple hundred megabytes per second out of a Lustre file
|
|
system that should be capable of a lot more.
|
|
|
|
Switching from writing to a single-shared file to one file per process using the
|
|
``-F`` (``filePerProcess=1``) option changes the performance dramatically:
|
|
|
|
.. code-block:: shell
|
|
|
|
$ mpirun -n 64 ./ior -t 1m -b 16m -s 16 -F
|
|
...
|
|
access bw(MiB/s) block(KiB) xfer(KiB) open(s) wr/rd(s) close(s) total(s) iter
|
|
------ --------- ---------- --------- -------- -------- -------- -------- ----
|
|
write 33645 16384 1024.00 0.007693 0.486249 0.195494 0.486972 1
|
|
read 149473 16384 1024.00 0.004936 0.108627 0.016479 0.109612 1
|
|
remove - - - - - - 6.08 1
|
|
|
|
|
|
This is in large part because letting each MPI process work on its own file cuts
|
|
out any contention that would arise because of file locking.
|
|
|
|
However, the performance difference between our naive test and the
|
|
file-per-process test is a bit extreme. In fact, the only way that 146 GB/sec
|
|
read rate could be achievable on Lustre is if each of the four compute nodes had
|
|
over 45 GB/sec of network bandwidth to Lustre--that is, a 400 Gbit link on every
|
|
compute and storage node.
|
|
|
|
|
|
Effect of Page Cache on Benchmarking
|
|
------------------------------------
|
|
What's really happening is that the data being read by IOR isn't actually coming
|
|
from Lustre; rather, files' contents are already cached, and IOR is able to
|
|
read them directly out of each compute node's DRAM. The data wound up getting
|
|
cached during the write phase of IOR as a result of Linux (and Lustre) using a
|
|
write-back cache to buffer I/O, so that instead of IOR writing and reading data
|
|
directly to Lustre, it's actually mostly talking to the memory on each compute
|
|
node.
|
|
|
|
To be more specific, although each IOR process thinks it is writing to a file on
|
|
Lustre and then reading back the contents of that file from Lustre, it is
|
|
actually
|
|
|
|
1) writing data to a copy of the file that is cached in memory. If there
|
|
is no copy of the file cached in memory before this write, the parts
|
|
being modified are loaded into memory first.
|
|
2) those parts of the file in memory (called "pages") that are now
|
|
different from what's on Lustre are marked as being "dirty"
|
|
3) the write() call completes and IOR continues on, even though the written
|
|
data still hasn't been committed to Lustre
|
|
4) independent of IOR, the OS kernel continually scans the file cache for
|
|
files who have been updated in memory but not on Lustre ("dirt pages"),
|
|
and then commits the cached modifications to Lustre
|
|
5) dirty pages are declared non-dirty since they are now in sync with
|
|
what's on disk, but they remain in memory
|
|
|
|
Then when the read phase of IOR follows the write phase, IOR is able to just
|
|
retrieve the file's contents from memory instead of having to communicate with
|
|
Lustre over the network.
|
|
|
|
There are a couple of ways to measure the read performance of the underlying
|
|
Lustre file system. The most crude way is to simply write more data than will
|
|
fit into the total page cache so that by the time the write phase has completed,
|
|
the beginning of the file has already been evicted from cache. For example,
|
|
increasing the number of segments (``-s``) to write more data reveals the point at
|
|
which the nodes' page cache on my test system runs over very clearly:
|
|
|
|
.. image:: tutorial-ior-overflowing-cache.png
|
|
|
|
|
|
However, this can make running IOR on systems with a lot of on-node memory take
|
|
forever.
|
|
|
|
A better option would be to get the MPI processes on each node to only read data
|
|
that they didn't write. For example, on a four-process-per-node test, shifting
|
|
the mapping of MPI processes to blocks by four makes each node N read the data
|
|
written by node N-1.
|
|
|
|
.. image:: tutorial-ior-reorderTasks.png
|
|
|
|
Since page cache is not shared between compute nodes, shifting tasks this way
|
|
ensures that each MPI process is reading data it did not write.
|
|
|
|
IOR provides the ``-C`` option (``reorderTasks``) to do this, and it forces each MPI
|
|
process to read the data written by its neighboring node. Running IOR with
|
|
this option gives much more credible read performance:
|
|
|
|
.. code-block:: shell
|
|
|
|
$ mpirun -n 64 ./ior -t 1m -b 16m -s 16 -F -C
|
|
...
|
|
access bw(MiB/s) block(KiB) xfer(KiB) open(s) wr/rd(s) close(s) total(s) iter
|
|
------ --------- ---------- --------- -------- -------- -------- -------- ----
|
|
write 41326 16384 1024.00 0.005756 0.395859 0.095360 0.396453 0
|
|
read 3310.00 16384 1024.00 0.011786 4.95 4.20 4.95 1
|
|
remove - - - - - - 0.237291 1
|
|
|
|
|
|
But now it should seem obvious that the write performance is also ridiculously
|
|
high. And again, this is due to the page cache, which signals to IOR that writes
|
|
are complete when they have been committed to memory rather than the underlying
|
|
Lustre file system.
|
|
|
|
To work around the effects of the page cache on write performance, we can issue
|
|
an fsync() call immediately after all of the write()s return to force the dirty
|
|
pages we just wrote to flush out to Lustre. Including the time it takes for
|
|
fsync() to finish gives us a measure of how long it takes for our data to write
|
|
to the page cache and for the page cache to write back to Lustre.
|
|
|
|
IOR provides another convenient option, ``-e`` (fsync), to do just this. And, once
|
|
again, using this option changes our performance measurement quite a bit:
|
|
|
|
.. code-block:: shell
|
|
|
|
$ mpirun -n 64 ./ior -t 1m -b 16m -s 16 -F -C -e
|
|
...
|
|
access bw(MiB/s) block(KiB) xfer(KiB) open(s) wr/rd(s) close(s) total(s) iter
|
|
------ --------- ---------- --------- -------- -------- -------- -------- ----
|
|
write 2937.89 16384 1024.00 0.011841 5.56 4.93 5.58 0
|
|
read 2712.55 16384 1024.00 0.005214 6.04 5.08 6.04 3
|
|
remove - - - - - - 0.037706 0
|
|
|
|
|
|
and we finally have a believable bandwidth measurement for our file system.
|
|
|
|
Defeating Page Cache
|
|
--------------------
|
|
Since IOR is specifically designed to benchmark I/O, it provides these options
|
|
that make it as easy as possible to ensure that you are actually measuring the
|
|
performance of your file system and not your compute nodes' memory. That being
|
|
said, the I/O patterns it generates are designed to demonstrate peak performance,
|
|
not reflect what a real application might be trying to do, and as a result,
|
|
there are plenty of cases where measuring I/O performance with IOR is not always
|
|
the best choice. There are several ways in which we can get clever and defeat
|
|
page cache in a more general sense to get meaningful performance numbers.
|
|
|
|
When measuring write performance, bypassing page cache is actually quite simple;
|
|
opening a file with the ``O_DIRECT`` flag going directly to disk. In addition,
|
|
the ``fsync()`` call can be inserted into applications, as is done with IOR's ``-e``
|
|
option.
|
|
|
|
Measuring read performance is a lot trickier. If you are fortunate enough to
|
|
have root access on a test system, you can force the Linux kernel to empty out
|
|
its page cache by doing
|
|
|
|
.. code-block:: shell
|
|
|
|
# echo 1 > /proc/sys/vm/drop_caches
|
|
|
|
and in fact, this is often good practice before running any benchmark
|
|
(e.g., Linpack) because it ensures that you aren't losing performance to the
|
|
kernel trying to evict pages as your benchmark application starts allocating
|
|
memory for its own use.
|
|
|
|
Unfortunately, many of us do not have root on our systems, so we have to get
|
|
even more clever. As it turns out, there is a way to pass a hint to the kernel
|
|
that a file is no longer needed in page cache:
|
|
|
|
.. code-block:: c
|
|
|
|
#define _XOPEN_SOURCE 600
|
|
#include <unistd.h>
|
|
#include <fcntl.h>
|
|
int main(int argc, char *argv[]) {
|
|
int fd;
|
|
fd = open(argv[1], O_RDONLY);
|
|
fdatasync(fd);
|
|
posix_fadvise(fd, 0,0,POSIX_FADV_DONTNEED);
|
|
close(fd);
|
|
return 0;
|
|
}
|
|
|
|
The effect of passing POSIX_FADV_DONTNEED using ``posix_fadvise()`` is usually that
|
|
all pages belonging to that file are evicted from page cache in Linux. However,
|
|
this is just a hint --not a guarantee-- and the kernel evicts these pages
|
|
asynchronously, so it may take a second or two for pages to actually leave page
|
|
cache. Fortunately, Linux also provides a way to probe pages in a file to see
|
|
if they are resident in memory.
|
|
|
|
Finally, it's often easiest to just limit the amount of memory available for
|
|
page cache. Because application memory always takes precedence over cache
|
|
memory, simply allocating most of the memory on a node will force most of the
|
|
cached pages to be evicted. Newer versions of IOR provide the memoryPerNode
|
|
option that do just that, and the effects are what one would expect:
|
|
|
|
.. image:: tutorial-ior-memPerNode-test.png
|
|
|
|
The above diagram shows the measured bandwidth from a single node with 128 GiB
|
|
of total DRAM. The first percent on each x-label is the amount of this 128 GiB
|
|
that was reserved by the benchmark as application memory, and the second percent
|
|
is the total write volume. For example, the "50%/150%" data points correspond
|
|
to 50% of the node memory (64 GiB) being allocated for the application, and a
|
|
total of 192 GiB of data being read.
|
|
|
|
This benchmark was run on a single spinning disk which is not capable of more
|
|
than 130 MB/sec, so the conditions that showed performance higher than this were
|
|
benefiting from some pages being served from cache. And this makes perfect
|
|
sense given that the anomalously high performance measurements were obtained
|
|
when there was plenty of memory to cache relative to the amount of data being
|
|
read.
|
|
|
|
Corollary
|
|
---------
|
|
Measuring I/O performance is a bit trickier than CPU performance in large part
|
|
due to the effects of page caching. That being said, page cache exists for a
|
|
reason, and there are many cases where an application's I/O performance really
|
|
is best represented by a benchmark that heavily utilizes cache.
|
|
|
|
For example, the BLAST bioinformatics application re-reads all of its input data
|
|
twice; the first time initializes data structures, and the second time fills
|
|
them up. Because the first read caches each page and allows the second read to
|
|
come out of cache rather than the file system, running this I/O pattern with
|
|
page cache disabled causes it to be about 2x slower:
|
|
|
|
.. image:: tutorial-cache-vs-nocache.png
|
|
|
|
|
|
Thus, letting the page cache do its thing is often the most realistic way to
|
|
benchmark with realistic application I/O patterns. Once you know how page cache
|
|
might be affecting your measurements, you stand a good chance of being able to
|
|
reason about what the most meaningful performance metrics are.
|