690 lines
31 KiB
Plaintext
Executable File
690 lines
31 KiB
Plaintext
Executable File
Copyright (c) 2003, The Regents of the University of California
|
|
See the file COPYRIGHT for a complete copyright notice and license.
|
|
|
|
IOR USER GUIDE
|
|
|
|
Index:
|
|
* Basics
|
|
1. Description
|
|
2. Running IOR
|
|
3. Options
|
|
|
|
* More Information
|
|
4. Option details
|
|
5. Verbosity levels
|
|
6. Using Scripts
|
|
|
|
* Troubleshooting
|
|
7. Compatibility with older versions
|
|
|
|
* Frequently Asked Questions
|
|
8. How do I . . . ?
|
|
|
|
*******************
|
|
* 1. DESCRIPTION *
|
|
*******************
|
|
IOR can be used for testing performance of parallel file systems using various
|
|
interfaces and access patterns. IOR uses MPI for process synchronization.
|
|
IOR version 2 is a complete rewrite of the original IOR (Interleaved-Or-Random)
|
|
version 1 code.
|
|
|
|
|
|
******************
|
|
* 2. RUNNING IOR *
|
|
******************
|
|
Two ways to run IOR:
|
|
|
|
* Command line with arguments -- executable followed by command line options.
|
|
|
|
E.g., to execute: IOR -w -r -o filename
|
|
This performs a write and a read to the file 'filename'.
|
|
|
|
* Command line with scripts -- any arguments on the command line will
|
|
establish the default for the test run, but a script may be used in
|
|
conjunction with this for varying specific tests during an execution of the
|
|
code.
|
|
|
|
E.g., to execute: IOR -W -f script
|
|
This defaults all tests in 'script' to use write data checking.
|
|
|
|
* The Command line supports to specify additional parameters for the chosen API.
|
|
For example, username and password for the storage.
|
|
Available options are listed in the help text after selecting the API when running with -h.
|
|
For example, 'IOR -a DUMMY -h' shows the supported options for the DUMMY backend.
|
|
The options for the backend must be specified at last and are separated with
|
|
two dashes '--'. Example: 'IOR -a DUMMY -- -c 1000' defines a delay for the
|
|
file creation for the plugin. Currently, it is not possible to set these
|
|
backend options using a command line script (-f option).
|
|
|
|
|
|
|
|
**************
|
|
* 3. OPTIONS *
|
|
**************
|
|
These options are to be used on the command line. E.g., 'IOR -a POSIX -b 4K'.
|
|
-a S api -- API for I/O, e.g., POSIX
|
|
-A N refNum -- user reference number to include in long summary
|
|
-b N blockSize -- contiguous bytes to write per task (e.g.: 8, 4k, 2m, 1g)
|
|
-c collective -- collective I/O
|
|
-C reorderTasksConstant -- changes task ordering to n+1 ordering for readback
|
|
-d N interTestDelay -- delay between reps in seconds
|
|
-D N deadlineForStonewalling -- seconds before stopping write or read phase
|
|
-e fsync -- perform fsync upon POSIX write close
|
|
-E useExistingTestFile -- do not remove test file before write access
|
|
-f S scriptFile -- test script name
|
|
-F filePerProc -- file-per-process
|
|
-g intraTestBarriers -- use barriers between open, write/read, and close
|
|
-G N setTimeStampSignature -- set value for time stamp signature
|
|
-h showHelp -- displays options and help
|
|
-H showHints -- show hints
|
|
-i N repetitions -- number of repetitions of test
|
|
-I individualDataSets -- datasets not shared by all procs [not working]
|
|
-j N outlierThreshold -- warn on outlier N seconds from mean
|
|
-J N setAlignment -- HDF5 alignment in bytes (e.g.: 8, 4k, 2m, 1g)
|
|
-k keepFile -- don't remove the test file(s) on program exit
|
|
-K keepFileWithError -- keep error-filled file(s) after data-checking
|
|
-l data packet type-- type of packet that will be created [offset|incompressible|timestamp|o|i|t]
|
|
-m multiFile -- use number of reps (-i) for multiple file count
|
|
-M N memoryPerNode -- hog memory on the node (e.g.: 2g, 75%)
|
|
-n noFill -- no fill in HDF5 file creation
|
|
-N N numTasks -- number of tasks that should participate in the test
|
|
-o S testFile -- full name for test
|
|
-O S string of IOR directives (e.g. -O checkRead=1,lustreStripeCount=32)
|
|
-p preallocate -- preallocate file size
|
|
-P useSharedFilePointer -- use shared file pointer [not working]
|
|
-q quitOnError -- during file error-checking, abort on error
|
|
-Q N taskPerNodeOffset for read tests use with -C & -Z options (-C constant N, -Z at least N) [!HDF5]
|
|
-r readFile -- read existing file
|
|
-R checkRead -- check read after read
|
|
-s N segmentCount -- number of segments
|
|
-S useStridedDatatype -- put strided access into datatype [not working]
|
|
-t N transferSize -- size of transfer in bytes (e.g.: 8, 4k, 2m, 1g)
|
|
-T N maxTimeDuration -- max time in minutes to run tests
|
|
-u uniqueDir -- use unique directory name for each file-per-process
|
|
-U S hintsFileName -- full name for hints file
|
|
-v verbose -- output information (repeating flag increases level)
|
|
-V useFileView -- use MPI_File_set_view
|
|
-w writeFile -- write file
|
|
-W checkWrite -- check read after write
|
|
-x singleXferAttempt -- do not retry transfer if incomplete
|
|
-X N reorderTasksRandomSeed -- random seed for -Z option
|
|
-Y fsyncPerWrite -- perform fsync after each POSIX write
|
|
-z randomOffset -- access is to random, not sequential, offsets within a file
|
|
-Z reorderTasksRandom -- changes task ordering to random ordering for readback
|
|
|
|
|
|
NOTES: * S is a string, N is an integer number.
|
|
* For transfer and block sizes, the case-insensitive K, M, and G
|
|
suffices are recognized. I.e., '4k' or '4K' is accepted as 4096.
|
|
|
|
Various options are only valid for specific modules, you can see details when running $ ./ior -h
|
|
These options are typically prefixed with the module name, an example is: --posix.odirect
|
|
|
|
*********************
|
|
* 4. OPTION DETAILS *
|
|
*********************
|
|
For each of the general settings, note the default is shown in brackets.
|
|
IMPORTANT NOTE: For all true/false options below [1]=true, [0]=false
|
|
IMPORTANT NOTE: Contrary to appearance, the script options below are NOT case sensitive
|
|
|
|
GENERAL:
|
|
========
|
|
* refNum - user supplied reference number, included in
|
|
long summary [0]
|
|
|
|
* api - must be set to one of POSIX, MPIIO, HDF5, HDFS, IME,
|
|
S3, S3_EMC, or NCMPI, depending on test [POSIX]
|
|
|
|
* testFile - name of the output file [testFile]
|
|
NOTE: with filePerProc set, the tasks can round
|
|
robin across multiple file names '-o S@S@S'
|
|
|
|
* hintsFileName - name of the hints file []
|
|
|
|
* repetitions - number of times to run each test [1]
|
|
|
|
* multiFile - creates multiple files for single-shared-file or
|
|
file-per-process modes; i.e. each iteration creates
|
|
a new file [0=FALSE]
|
|
|
|
* reorderTasksConstant - reorders tasks by a constant node offset for writing/reading neighbor's
|
|
data from different nodes [0=FALSE]
|
|
|
|
* taskPerNodeOffset - for read tests. Use with -C & -Z options. [1]
|
|
With reorderTasks, constant N. With reordertasksrandom, >= N
|
|
|
|
* reorderTasksRandom - reorders tasks to random ordering for readback [0=FALSE]
|
|
|
|
* reorderTasksRandomSeed - random seed for reordertasksrandom option. [0]
|
|
>0, same seed for all iterations. <0, different seed for each iteration
|
|
|
|
* quitOnError - upon error encountered on checkWrite or checkRead,
|
|
display current error and then stop execution;
|
|
if not set, count errors and continue [0=FALSE]
|
|
|
|
* numTasks - number of tasks that should participate in the test
|
|
[0]
|
|
NOTE: 0 denotes all tasks
|
|
|
|
* interTestDelay - this is the time in seconds to delay before
|
|
beginning a write or read in a series of tests [0]
|
|
NOTE: it does not delay before a check write or
|
|
check read
|
|
|
|
* interIODelay - this time in us (microseconds) after each I/O simulates computing time.
|
|
|
|
* outlierThreshold - gives warning if any task is more than this number
|
|
of seconds from the mean of all participating tasks.
|
|
If so, the task is identified, its time (start,
|
|
elapsed create, elapsed transfer, elapsed close, or
|
|
end) is reported, as is the mean and standard
|
|
deviation for all tasks. The default for this is 0,
|
|
which turns it off. If set to a positive value, for
|
|
example 3, any task not within 3 seconds of the mean
|
|
displays its times. [0]
|
|
|
|
* intraTestBarriers - use barrier between open, write/read, and close [0=FALSE]
|
|
|
|
* uniqueDir - create and use unique directory for each
|
|
file-per-process [0=FALSE]
|
|
|
|
* writeFile - writes file(s), first deleting any existing file [1=TRUE]
|
|
NOTE: the defaults for writeFile and readFile are
|
|
set such that if there is not at least one of
|
|
the following -w, -r, -W, or -R, it is assumed
|
|
that -w and -r are expected and are
|
|
consequently used -- this is only true with
|
|
the command line, and may be overridden in
|
|
a script
|
|
|
|
* readFile - reads existing file(s) (from current or previous
|
|
run) [1=TRUE]
|
|
NOTE: see writeFile notes
|
|
|
|
* filePerProc - accesses a single file for each processor; default
|
|
is a single file accessed by all processors [0=FALSE]
|
|
|
|
* checkWrite - read data back and check for errors against known
|
|
pattern; can be used independently of writeFile [0=FALSE]
|
|
NOTES: * data checking is not timed and does not
|
|
affect other performance timings
|
|
* all errors tallied and returned as program
|
|
exit code, unless quitOnError set
|
|
|
|
* checkRead - reread data and check for errors between reads; can
|
|
be used independently of readFile [0=FALSE]
|
|
NOTE: see checkWrite notes
|
|
|
|
* keepFile - stops removal of test file(s) on program exit [0=FALSE]
|
|
|
|
* keepFileWithError - ensures that with any error found in data-checking,
|
|
the error-filled file(s) will not be deleted [0=FALSE]
|
|
|
|
* useExistingTestFile - do not remove test file before write access [0=FALSE]
|
|
|
|
* segmentCount - number of segments in file [1]
|
|
NOTES: * a segment is a contiguous chunk of data
|
|
accessed by multiple clients each writing/
|
|
reading their own contiguous data;
|
|
comprised of blocks accessed by multiple
|
|
clients
|
|
* with HDF5 this repeats the pattern of an
|
|
entire shared dataset
|
|
|
|
* blockSize - size (in bytes) of a contiguous chunk of data
|
|
accessed by a single client; it is comprised of one
|
|
or more transfers [1048576]
|
|
|
|
* transferSize - size (in bytes) of a single data buffer to be
|
|
transferred in a single I/O call [262144]
|
|
|
|
* verbose - output information [0]
|
|
NOTE: this can be set to levels 0-5 on the command
|
|
line; repeating the -v flag will increase
|
|
verbosity level
|
|
|
|
* setTimeStampSignature - set value for time stamp signature [0]
|
|
NOTE: used to rerun tests with the exact data
|
|
pattern by setting data signature to contain
|
|
positive integer value as timestamp to be
|
|
written in data file; if set to 0, is
|
|
disabled
|
|
|
|
* showHelp - display options and help [0=FALSE]
|
|
|
|
* storeFileOffset - use file offset as stored signature when writing
|
|
file [0=FALSE]
|
|
NOTE: this will affect performance measurements
|
|
|
|
* memoryPerNode - Allocate memory on each node to simulate real
|
|
application memory usage. Accepts a percentage of
|
|
node memory (e.g. "50%") on machines that support
|
|
sysconf(_SC_PHYS_PAGES) or a size. Allocation will
|
|
be split between tasks that share the node.
|
|
|
|
* memoryPerTask - Allocate secified amount of memory per task to
|
|
simulate real application memory usage.
|
|
|
|
* maxTimeDuration - max time in minutes to run tests [0]
|
|
NOTES: * setting this to zero (0) unsets this option
|
|
* this option allows the current read/write
|
|
to complete without interruption
|
|
|
|
* deadlineForStonewalling - seconds before stopping write or read phase [0]
|
|
NOTES: * used for measuring the amount of data moved
|
|
in a fixed time. After the barrier, each
|
|
task starts its own timer, begins moving
|
|
data, and the stops moving data at a pre-
|
|
arranged time. Instead of measuring the
|
|
amount of time to move a fixed amount of
|
|
data, this option measures the amount of
|
|
data moved in a fixed amount of time. The
|
|
objective is to prevent tasks slow to
|
|
complete from skewing the performance.
|
|
* setting this to zero (0) unsets this option
|
|
* this option is incompatible w/data checking
|
|
|
|
* randomOffset - access is to random, not sequential, offsets within a file [0=FALSE]
|
|
NOTES: * this option is currently incompatible with:
|
|
-checkRead
|
|
-storeFileOffset
|
|
-MPIIO collective or useFileView
|
|
-HDF5 or NCMPI
|
|
* summaryAlways - Always print the long summary for each test.
|
|
Useful for long runs that may be interrupted, preventing
|
|
the final long summary for ALL tests to be printed.
|
|
* summaryFile=File - Output the summary to the file instead on stdout/stderr.
|
|
* summaryFormat=FMT - Choose the output format -- default, JSON, CSV
|
|
|
|
POSIX-ONLY:
|
|
===========
|
|
* useO_DIRECT - use O_DIRECT for POSIX, bypassing I/O buffers [0]
|
|
|
|
* singleXferAttempt - will not continue to retry transfer entire buffer
|
|
until it is transferred [0=FALSE]
|
|
NOTE: when performing a write() or read() in POSIX,
|
|
there is no guarantee that the entire
|
|
requested size of the buffer will be
|
|
transferred; this flag keeps the retrying a
|
|
single transfer until it completes or returns
|
|
an error
|
|
|
|
* fsyncPerWrite - perform fsync after each POSIX write [0=FALSE]
|
|
* fsync - perform fsync after POSIX write close [0=FALSE]
|
|
|
|
MPIIO-ONLY:
|
|
===========
|
|
* preallocate - preallocate the entire file before writing [0=FALSE]
|
|
|
|
* useFileView - use an MPI datatype for setting the file view option
|
|
to use individual file pointer [0=FALSE]
|
|
NOTE: default IOR uses explicit file pointers
|
|
|
|
* useSharedFilePointer - use a shared file pointer [0=FALSE] (not working)
|
|
NOTE: default IOR uses explicit file pointers
|
|
|
|
* useStridedDatatype - create a datatype (max=2GB) for strided access; akin
|
|
to MULTIBLOCK_REGION_SIZE [0] (not working)
|
|
|
|
HDF5-ONLY:
|
|
==========
|
|
* individualDataSets - within a single file each task will access its own
|
|
dataset [0=FALSE] (not working)
|
|
NOTE: default IOR creates a dataset the size of
|
|
numTasks * blockSize to be accessed by all
|
|
tasks
|
|
|
|
* noFill - no pre-filling of data in HDF5 file creation [0=FALSE]
|
|
|
|
* setAlignment - HDF5 alignment in bytes (e.g.: 8, 4k, 2m, 1g) [1]
|
|
|
|
MPIIO-, HDF5-, AND NCMPI-ONLY:
|
|
==============================
|
|
* collective - uses collective operations for access [0=FALSE]
|
|
|
|
* showHints - show hint/value pairs attached to open file [0=FALSE]
|
|
NOTE: available in NCMPI only if PnetCDF is 1.2.0 or later
|
|
|
|
LUSTRE-SPECIFIC:
|
|
================
|
|
* lustreStripeCount - set the lustre stripe count for the test file(s) [0]
|
|
|
|
* lustreStripeSize - set the lustre stripe size for the test file(s) [0]
|
|
|
|
* lustreStartOST - set the starting OST for the test file(s) [-1]
|
|
|
|
* lustreIgnoreLocks - disable lustre range locking [0]
|
|
|
|
GPFS-SPECIFIC:
|
|
================
|
|
* gpfsHintAccess - use gpfs_fcntl hints to pre-declare accesses
|
|
|
|
* gpfsReleaseToken - immediately after opening or creating file, release
|
|
all locks. Might help mitigate lock-revocation
|
|
traffic when many processes write/read to same file.
|
|
|
|
BeeGFS-SPECIFIC (POSIX only):
|
|
================
|
|
* beegfsNumTargets - set the number of storage targets to use
|
|
|
|
* beegfsChunkSize - set the striping chunk size. Must be a power of two,
|
|
and greater than 64kiB, (e.g.: 256k, 1M, ...)
|
|
|
|
|
|
***********************
|
|
* 5. VERBOSITY LEVELS *
|
|
***********************
|
|
The verbosity of output for IOR can be set with -v. Increasing the number of
|
|
-v instances on a command line sets the verbosity higher.
|
|
|
|
Here is an overview of the information shown for different verbosity levels:
|
|
0 - default; only bare essentials shown
|
|
1 - max clock deviation, participating tasks, free space, access pattern,
|
|
commence/verify access notification w/time
|
|
2 - rank/hostname, machine name, timer used, individual repetition
|
|
performance results, timestamp used for data signature
|
|
3 - full test details, transfer block/offset compared, individual data
|
|
checking errors, environment variables, task writing/reading file name,
|
|
all test operation times
|
|
4 - task id and offset for each transfer
|
|
5 - each 8-byte data signature comparison (WARNING: more data to STDOUT
|
|
than stored in file, use carefully)
|
|
|
|
|
|
********************
|
|
* 6. USING SCRIPTS *
|
|
********************
|
|
|
|
IOR can use a script with the command line. Any options on the command line
|
|
will be considered the default settings for running the script. (I.e.,
|
|
'IOR -W -f script' will have all tests in the script run with the -W option as
|
|
default.) The script itself can override these settings and may be set to run
|
|
run many different tests of IOR under a single execution. The command line is:
|
|
|
|
IOR/bin/IOR -f script
|
|
|
|
In IOR/scripts, there are scripts of testcases for simulating I/O behavior of
|
|
various application codes. Details are included in each script as necessary.
|
|
|
|
An example of a script:
|
|
===============> start script <===============
|
|
IOR START
|
|
api=[POSIX|MPIIO|HDF5|HDFS|IME|S3|S3_EMC|NCMPI]
|
|
testFile=testFile
|
|
hintsFileName=hintsFile
|
|
repetitions=8
|
|
multiFile=0
|
|
interTestDelay=5
|
|
readFile=1
|
|
writeFile=1
|
|
filePerProc=0
|
|
checkWrite=0
|
|
checkRead=0
|
|
keepFile=1
|
|
quitOnError=0
|
|
segmentCount=1
|
|
blockSize=32k
|
|
outlierThreshold=0
|
|
setAlignment=1
|
|
transferSize=32
|
|
singleXferAttempt=0
|
|
individualDataSets=0
|
|
verbose=0
|
|
numTasks=32
|
|
collective=1
|
|
preallocate=0
|
|
useFileView=0
|
|
keepFileWithError=0
|
|
setTimeStampSignature=0
|
|
useSharedFilePointer=0
|
|
useStridedDatatype=0
|
|
uniqueDir=0
|
|
fsync=0
|
|
storeFileOffset=0
|
|
maxTimeDuration=60
|
|
deadlineForStonewalling=0
|
|
useExistingTestFile=0
|
|
useO_DIRECT=0
|
|
showHints=0
|
|
showHelp=0
|
|
RUN
|
|
# additional tests are optional
|
|
<snip>
|
|
RUN
|
|
<snip>
|
|
RUN
|
|
IOR STOP
|
|
===============> stop script <===============
|
|
|
|
NOTES: * Not all test parameters need be set.
|
|
* White space is ignored in script, as are comments starting with '#'.
|
|
|
|
|
|
****************************************
|
|
* 7. COMPATIBILITY WITH OLDER VERSIONS *
|
|
****************************************
|
|
1) IOR version 1 (c. 1996-2002) and IOR version 2 (c. 2003-present) are
|
|
incompatible. Input decks from one will not work on the other. As version
|
|
1 is not included in this release, this shouldn't be case for concern. All
|
|
subsequent compatibility issues are for IOR version 2.
|
|
|
|
2) IOR versions prior to release 2.8 provided data size and rates in powers
|
|
of two. E.g., 1 MB/sec referred to 1,048,576 bytes per second. With the
|
|
IOR release 2.8 and later versions, MB is now defined as 1,000,000 bytes
|
|
and MiB is 1,048,576 bytes.
|
|
|
|
3) In IOR versions 2.5.3 to 2.8.7, IOR could be run without any command line
|
|
options. This assumed that if both write and read options (-w -r) were
|
|
omitted, the run with them both set as default. Later, it became clear
|
|
that in certain cases (data checking, e.g.) this caused difficulties. In
|
|
IOR versions 2.8.8 and later, if not one of the -w -r -W or -R options is
|
|
set, then -w and -r are set implicitly.
|
|
|
|
4) IOR version 3 (Jan 2012-present) has changed the output of IOR somewhat,
|
|
and the "testNum" option was renamed "refNum".
|
|
|
|
***************************
|
|
* 8. INCOMPRESSIBLE NOTES *
|
|
***************************
|
|
|
|
Please note that incompressibility is a factor of how large a block compression
|
|
algorithm uses. The incompressible buffer is filled only once before write times,
|
|
so if the compression algorithm takes in blocks larger than the transfer size,
|
|
there will be compression. Below are some baselines that I established for
|
|
zip, gzip, and bzip.
|
|
|
|
1) zip: For zipped files, a transfer size of 1k is sufficient.
|
|
|
|
2) gzip: For gzipped files, a transfer size of 1k is sufficient.
|
|
|
|
3) bzip2: For bziped files a transfer size of 1k is insufficient (~50% compressed).
|
|
To avoid compression a transfer size of greater than the bzip block size is required
|
|
(default = 900KB). I suggest a transfer size of greater than 1MB to avoid bzip2 compression.
|
|
|
|
Be aware of the block size your compression algorithm will look at, and adjust the transfer size
|
|
accordingly.
|
|
|
|
*********************************
|
|
* 9. FREQUENTLY ASKED QUESTIONS *
|
|
*********************************
|
|
HOW DO I PERFORM MULTIPLE DATA CHECKS ON AN EXISTING FILE?
|
|
|
|
Use this command line: IOR -k -E -W -i 5 -o file
|
|
|
|
-k keeps the file after the access rather than deleting it
|
|
-E uses the existing file rather than truncating it first
|
|
-W performs the writecheck
|
|
-i number of iterations of checking
|
|
-o filename
|
|
|
|
On versions of IOR prior to 2.8.8, you need the -r flag also, otherwise
|
|
you'll first overwrite the existing file. (In earlier versions, omitting -w
|
|
and -r implied using both. This semantic has been subsequently altered to be
|
|
omitting -w, -r, -W, and -R implied using both -w and -r.)
|
|
|
|
If you're running new tests to create a file and want repeat data checking on
|
|
this file multiple times, there is an undocumented option for this. It's -O
|
|
multiReRead=1, and you'd need to have an IOR version compiled with the
|
|
USE_UNDOC_OPT=1 (in iordef.h). The command line would look like this:
|
|
|
|
IOR -k -E -w -W -i 5 -o file -O multiReRead=1
|
|
|
|
For the first iteration, the file would be written (w/o data checking). Then
|
|
for any additional iterations (four, in this example) the file would be
|
|
reread for whatever data checking option is used.
|
|
|
|
|
|
HOW DOES IOR CALCULATE PERFORMANCE?
|
|
|
|
IOR performs get a time stamp START, then has all participating tasks open a
|
|
shared or independent file, transfer data, close the file(s), and then get a
|
|
STOP time. A stat() or MPI_File_get_size() is performed on the file(s) and
|
|
compared against the aggregate amount of data transferred. If this value
|
|
does not match, a warning is issued and the amount of data transferred as
|
|
calculated from write(), e.g., return codes is used. The calculated
|
|
bandwidth is the amount of data transferred divided by the elapsed
|
|
STOP-minus-START time.
|
|
|
|
IOR also gets time stamps to report the open, transfer, and close times.
|
|
Each of these times is based on the earliest start time for any task and the
|
|
latest stop time for any task. Without using barriers between these
|
|
operations (-g), the sum of the open, transfer, and close times may not equal
|
|
the elapsed time from the first open to the last close.
|
|
|
|
After each iteration (-i) IOR reports performance for that iteration, and
|
|
those numbers include:
|
|
|
|
- Bandwidth (described above)
|
|
|
|
- IOPS: I/O rate (operations per second) achieved by all tasks given the total
|
|
time spent in reading and writing the data.
|
|
|
|
- Latency: computed by taking the average latency of all I/O operations from a
|
|
single task. If ior is run with multiple tasks, then the latency reported is
|
|
the minimum that was computed between all tasks.
|
|
|
|
HOW DO I ACCESS MULTIPLE FILE SYSTEMS IN IOR?
|
|
|
|
It is possible when using the filePerProc option to have tasks round-robin
|
|
across multiple file names. Rather than use a single file name '-o file',
|
|
additional names '-o file1@file2@file3' may be used. In this case, a file
|
|
per process would have three different file names (which may be full path
|
|
names) to access. The '@' delimiter is arbitrary, and may be set in the
|
|
FILENAME_DELIMITER definition in iordef.h.
|
|
|
|
Note that this option of multiple filenames only works with the filePerProc
|
|
-F option. This will not work for shared files.
|
|
|
|
|
|
HOW DO I BALANCE LOAD ACROSS MULTIPLE FILE SYSTEMS?
|
|
|
|
As for the balancing of files per file system where different file systems
|
|
offer different performance, additional instances of the same destination
|
|
path can generally achieve good balance.
|
|
|
|
For example, with FS1 getting 50% better performance than FS2, set the '-o'
|
|
flag such that there are additional instances of the FS1 directory. In this
|
|
case, '-o FS1/file@FS1/file@FS1/file@FS2/file@FS2/file' should adjust for
|
|
the performance difference and balance accordingly.
|
|
|
|
|
|
HOW DO I USE STONEWALLING?
|
|
|
|
To use stonewalling (-D), it's generally best to separate write testing from
|
|
read testing. Start with writing a file with '-D 0' (stonewalling disabled)
|
|
to determine how long the file takes to be written. If it takes 10 seconds
|
|
for the data transfer, run again with a shorter duration, '-D 7' e.g., to
|
|
stop before the file would be completed without stonewalling. For reading,
|
|
it's best to create a full file (not an incompletely written file from a
|
|
stonewalling run) and then run with stonewalling set on this preexisting
|
|
file. If a write and read test are performed in the same run with
|
|
stonewalling, it's likely that the read will encounter an error upon hitting
|
|
the EOF. Separating the runs can correct for this. E.g.,
|
|
|
|
IOR -w -k -o file -D 10 # write and keep file, stonewall after 10 seconds
|
|
IOR -r -E -o file -D 7 # read existing file, stonewall after 7 seconds
|
|
|
|
Also, when running multiple iterations of a read-only stonewall test, it may
|
|
be necessary to set the -D value high enough so that each iteration is not
|
|
reading from cache. Otherwise, in some cases, the first iteration may show
|
|
100 MB/s, the next 200 MB/s, the third 300 MB/s. Each of these tests is
|
|
actually reading the same amount from disk in the allotted time, but they
|
|
are also reading the cached data from the previous test each time to get the
|
|
increased performance. Setting -D high enough so that the cache is
|
|
overfilled will prevent this.
|
|
|
|
|
|
HOW DO I BYPASS CACHING WHEN READING BACK A FILE I'VE JUST WRITTEN?
|
|
|
|
One issue with testing file systems is handling cached data. When a file is
|
|
written, that data may be stored locally on the node writing the file. When
|
|
the same node attempts to read the data back from the file system either for
|
|
performance or data integrity checking, it may be reading from its own cache
|
|
rather from the file system.
|
|
|
|
The reorderTasksConstant '-C' option attempts to address this by having a
|
|
different node read back data than wrote it. For example, node N writes the
|
|
data to file, node N+1 reads back the data for read performance, node N+2
|
|
reads back the data for write data checking, and node N+3 reads the data for
|
|
read data checking, comparing this with the reread data from node N+4. The
|
|
objective is to make sure on file access that the data is not being read from
|
|
cached data.
|
|
|
|
Node 0: writes data
|
|
Node 1: reads data
|
|
Node 2: reads written data for write checking
|
|
Node 3: reads written data for read checking
|
|
Node 4: reads written data for read checking, comparing with Node 3
|
|
|
|
The algorithm for skipping from N to N+1, e.g., expects consecutive task
|
|
numbers on nodes (block assignment), not those assigned round robin (cyclic
|
|
assignment). For example, a test running 6 tasks on 3 nodes would expect
|
|
tasks 0,1 on node 0; tasks 2,3 on node 1; and tasks 4,5 on node 2. Were the
|
|
assignment for tasks-to-node in round robin fashion, there would be tasks 0,3
|
|
on node 0; tasks 1,4 on node 1; and tasks 2,5 on node 2. In this case, there
|
|
would be no expectation that a task would not be reading from data cached on
|
|
a node.
|
|
|
|
|
|
HOW DO I USE HINTS?
|
|
|
|
It is possible to pass hints to the I/O library or file system layers
|
|
following this form:
|
|
'setenv IOR_HINT__<layer>__<hint> <value>'
|
|
For example:
|
|
'setenv IOR_HINT__MPI__IBM_largeblock_io true'
|
|
'setenv IOR_HINT__GPFS__important_hint true'
|
|
or, in a file in the form:
|
|
'IOR_HINT__<layer>__<hint>=<value>'
|
|
Note that hints to MPI from the HDF5 or NCMPI layers are of the form:
|
|
'setenv IOR_HINT__MPI__<hint> <value>'
|
|
|
|
|
|
HOW DO I EXPLICITLY SET THE FILE DATA SIGNATURE?
|
|
|
|
The data signature for a transfer contains the MPI task number, transfer-
|
|
buffer offset, and also timestamp for the start of iteration. As IOR works
|
|
with 8-byte long long ints, the even-numbered long longs written contain a
|
|
32-bit MPI task number and a 32-bit timestamp. The odd-numbered long longs
|
|
contain a 64-bit transferbuffer offset (or file offset if the '-l'
|
|
storeFileOffset option is used). To set the timestamp value, use '-G' or
|
|
setTimeStampSignature.
|
|
|
|
|
|
HOW DO I EASILY CHECK OR CHANGE A BYTE IN AN OUTPUT DATA FILE?
|
|
|
|
There is a simple utility IOR/src/C/cbif/cbif.c that may be built. This is a
|
|
stand-alone, serial application called cbif (Change Byte In File). The
|
|
utility allows a file offset to be checked, returning the data at that
|
|
location in IOR's data check format. It also allows a byte at that location
|
|
to be changed.
|
|
|
|
|
|
HOW DO I CORRECT FOR CLOCK SKEW BETWEEN NODES IN A CLUSTER?
|
|
|
|
To correct for clock skew between nodes, IOR compares times between nodes,
|
|
then broadcasts the root node's timestamp so all nodes can adjust by the
|
|
difference. To see an egregious outlier, use the '-j' option. Be sure
|
|
to set this value high enough to only show a node outside a certain time
|
|
from the mean.
|