Fix the configure check if --with-lustre is specified, but
the linux/lustre/lustre_user.h header is not present. Only
one of the headers needs to be included if both are found.
In some cases, FASYNC is not defined, but forms part of the
O_LOV_DELAY_CREATE value, add a #define in that case.
Fixes#115.
Fixes: cb88c4c19a831d94b864c49a162e2635730540e5
Signed-off-by: Andreas Dilger <adilger@dilger.ca>
Changed the parser to fix#98. Does _not_ yet have the HDF5 backend args updated. They were reverted to allow this to make the 3.2 release and should be re-applied as a separate commit.
Currently, calculation for barriers case looks a bit
suprising, for example, Min stats is from min of
Max value of number of mpi proc, this makes Min
and Max very similar in our testing thus we get
a small Std value too.
I am not a MPI expert, but question is why we
don't use more nature calcuation which calculate
all results from different iterations and MPI proc?
Signed-off-by: Wang Shilong <wshilong@ddn.com>
IOR uses the GetTimeStamp() wrapper to allow gettimeofday() to be used for timings instead of MPI_Wtime() to calculate throughput. This commits adds the same logic to mdtest timings.
The parser now supports concurrent parsing of all plugin options.
Moved HDF5 collective_md option into the backend as an example.
Example: ./src/ior -a dummy --dummy.delay-xfer=50000
This patch enables correct compilation on both MacOS and Linux using only
POSIX.1-2008-compliant C with XSI extensions.
Specifically, POSIX.1-2008 is the minimum version because we use strdup(3);
explicit XSI is required to expose putenv from glibc.
Leading whitespace is stripped from each line of the ior input file. This
allows indented comments to be treated as comments. However it does NOT allow
one to specify excessive whitespace inside of the `ior start` and `ior stop`
magic phrases.
Also added a test to catch regressions in this functionality.
Context: Some backends may have used different names in
the past (like IME backend use to be IM). Legacy scripts
may break.
This patch adds a legacy name option in the aiori structure.
Both name and legacy name work to select the interface.
But the following warning is printed if the legacy name is used:
ior WARNING: [legacy name] backend is deprecated use [name] instead.
- travis now tests the packaged source to detect missing source/headers
- basic tests are less sensitive to the directory from where they are run
- fixed some missing files from the `make dist` manifest
- updated the format of NEWS to work with `make dist`
Context: write and read results from the same iteration
use the same length value in Bytes. When stonewalling is
used the size variates depending on the performance of
the access. This leads to wrong max bandwidths reported
for writes as shown in the following example:
write 10052 ...
read 9910 ...
write 10022 ...
read 9880 ...
write 10052 ...
read 9894 ...
Max Write: 9371.43 MiB/sec (9826.66 MB/sec)
Max Read: 9910.48 MiB/sec (10391.89 MB/sec)
This patch makes IOR separate variables used for read
and write tests.
Context: IOR outputs errors when the '-w -W' flags are used
without '-r'. Write a file using with check write option
should be possible even without setting read.
This patch fixes a condition which was introduced for
HDFS to remove RDWR flag in some particular cases.
Write check was set with the write only flag but it
requieres the read flag.
Context: IOR initializes all available backends. If one
backend fails to initialize IOR cannot be used.
This patch makes IOR initialize only the backends
which will be used. The initialization is done after
that the parameters are checked so that the help message
can still be dispayed is something goes wrong.
Context: IOR gets a segfault if an unsupported API string
is provided.
This patch checks that the I/O backend is supported, otherwise
IOR stops with an error.
- update cmd line options to add DAOS Pool and Container uuid and SVCL
- Add init/finalize backend functions.
Signed-off-by: Mohamad Chaarawi <mohamad.chaarawi@intel.com>
This patch adds initialize, finalize and stat calls to
the IME backend. statfs, mkdir and rmdir are currently
not supported. This patch also fixes the IME_GetVersion
call.
This patchs makes a few adjustements in the automake file
in order to properly add include and library directories.
Those are required to be able to compile some aiori backends.
Therefore, it uses the -I argument (number of files per directory) and divides it by -n (items).
This will be a number of sub directories that are created on the top-level.
inform aiori interface about RADOS backend
stubbed out aiori backend for rados
additions to get RADOS backend compiling/linking
first cut at rados create/open patha
make sure to return RADOS oid on open/create
implement rados xfer path for WRITE
refactor + implement getfilesize and close
remember to use read_op interface for stat
implement RADOS delete function
don't error in RADOS_Delete for now
implement RADOS set_version
handle open/create flags appropriately
cleanup RADOS error handling
implement read/readcheck/writecheck for RADOS
rados doesn't support directio
implement unsupported aiori ops for RADOS
implement RADOS access call
define rados types if no rados support
It shares the create/open/delete/set_version/get_file_size
functions with POSIX backend.
The mmap backend also supports fsync and fsyncPerWrite options,
and it will use msync() instead and fsync().
Signed-off-by: Li Dongyang <dongyangli@ddn.com>
Context: Some file systems require a prefix in the path.
The POSIX 'access' call fails and consequently files are
never deleted.
This patch implements an access function in the MPIIO
backend using MPI_File_open. Prefixes can now be parsed
by ROMIO.
The O_DIRECT option was not working as set_o_direct_flag() were moved to
utilities.c but there the #define _GNU_SOURCE where missing. This lead
to not the Waring "cannot use O_DIRECT".
From Pnetcdf 1.7, the MPI datatype corresponding to NC_BYTE is change from MPI_BYTE to MPI_SIGNED_CHAR. If not change, running IOR with NCMPI will cause a fatal error as below:
ERROR in aiori-NCMPI.c (line 287): cannot write to data set.
ERROR: NetCDF: Not a valid data type or _FillValue type mismatch.
The file "src/aiori-NCMPI.c" uses numTasksWorld as the process which is declared in "src/ior.h". In "src/mdtest.c", NCMPI backend will be called if IOR configured with ncmpi enabled but numTasksWorld was not defined in "src/mdtest.c". So it will cause a compiler error like below:
mdtest-aiori-NCMPI.o: In function `NCMPI_Xfer':
/home/parallels/Documents/ior/src/aiori-NCMPI.c:272: undefined reference to `numTasksWorld'
Once a process hits the stonewall (timelimit), they all figure out the maximum pair read/written.
Each proces continues to read/write until the maximum number of pairs is reached, this simulates the wear out.
This commit makes changes to the AIORI backends to add support for
abstacting statfs, mkdir, rmdir, stat, and access. These new
abstractions are used by a modified mdtest. Some changes:
- Require C99. Its 2017 and most compilers now support C11. The
benefits of using C99 include subobject naming (for aiori backend
structs), and fixed size integers (uint64_t). There is no reason to
use the non-standard long long type.
- Moved some of the aiori code into aiori.c so it can be used by both
mdtest and ior.
- Code cleanup of mdtest. This is mostly due to the usage of the IOR
backends rather than a mess of #if code.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
Previously -R did run another read phase and compared if the output of both reads is identical.
-R now checks if the data matches the expected signature (as set using -G <NUMBER>), so reads data once and then
directly compares the read data with the expected buffer.
This allows to first run IOR with a write only phase, then later with a read phase that checks if the data still is correct.
Since the read can be repeated multiple times, there is no need for the old -R semantics.
If a hintfile contains e.g. cb_buffer_size = 1234, IOR will try to set
the hint "cb_buffer_size " (note trailing space), a hint that no MPI
implementation actually supports.
IOR was leaking a hint structure in MPI-IO case, two groups in
common code, and when we get the info member from the file we were
losing our reference to the hints we just passed in.
This makes it so that the buffers are only allocated once per test instead
of once per transfer. This also removes initial buffer set-up from the
timing window.
Added a new struct into ior.h IOR_io_buffers for the buffer, checkbuffer, and readcheckbuffer
so only one pointer needed to be passed to XferBuffersSetup(), XferBuffersFree(),
and WriteOrRead().
Changed the logic in XferBuffersSetup() and XferBuffersFree() to not be transfer
dependent. If a test includes a write check or read check the checkBuffer
and readcheckBuffer will be created once per test in TestIoSys(). The
argument now taken by both function has changed from the access type to
a pointer to IOR_param_t.
Changed WriteOrRead to take as an additional parameter
the IOR_io_buffers struct, since it was no longer creating those
buffers.
Changed how the -l option works. Now you choose the type of datapacket
-l i incompressible data packets
-l incompressible incompressible data packets
-l timestamp timestamped data packets
-l t timestamped data packets
-l offset offset data packets
-l o offset data packets
-G option now is either the seed for the incompressible random packets
or the timestamp, depending on the input to the -l option.
-G will no long timestamp packets on its own without the additon of -l timestamp or -l t
I kept shorter versions of the options for the sake of typing sanity.
All ranks locally capture and accumulate Etags for the parts they are
writing. In the N:1 cases, these are ethen collected by rank 0, via
MPI_Gather. This is effectively an organization matching the "segmented"
layout. If data was written segmented, then rank0 assigns part-numbers to
with appropriate offsets to correspond to what would've been used by each
rank when writing a given etag. If data was written strided, then etags
must also be accessed in strided order, to build the XML that will be sent.
TBD: Once the total volume of etag data exceeds the size of memory at rank
0, we'll need to impose a more-sophisticated technique. One idea is to
thread the MPI comms differently from the libcurl comms, so that multiple
gathers can be staged incrementally, while sending a single stream of XML
data tot he servers. For example, the libcurl write-function could
interact with the MPI prog to allow the appearance of a single stream of
data.
These are variants on S3. S3 uses the "pure" S3 interface, e.g. using
Multi-Part-Upload. The "plus" variant enables EMC-extensions in the aws4c
library. This allows the N:N case to use "append", in the case where
"transfer_size" != "block_size" for IOR. In pure S3, the N:N case will
fail, because the EMC-extensions won't be enabled, and appending (which
attempts to use the EMC byte-range tricks to do this) will throw an error.
In the S3_EMC alg, N:1 uses EMCs other byte-range tricks to write different
parts of an N:1 file, and also uses append to write the parts of an N:N
file. Preliminary tests show these EMC extensions look to improve BW by
~20%.
I put all three algs in aiori-S3.c, because it seemed some code was getting
reused. Not sure if that's still going to make sense after the TBD, below.
TBD: Recently realized that the "pure' S3 shouldn't be trying to use
appends for anything. In the N:N case, it should just use MPU, within each
file. Then, there's no need for S3_plus. We just have S3, which does MPU
for all writes where transfer_size != block_size, and uses (standard)
byte-range reads for reading. Then S3_EMC uses "appends for N:N writes,
and byte-range writes for N:1 writes. This separates the code for the two
algs a little more, but we might still want them in the same file.
Testing on our EMC ViPR installation. Therefore, we also have available
some EMC extensions. For example, EMC supports a special "byte-range"
header-option ("Range: bytes=-1-") which allows appending to an object.
This is not needed for N:1 (where every write creates an independent part),
but is vital for N:N (where every write is considered an append, unless
"transfer-size" is the same as "block-size").
We also use a LANL-extended implementation of aws4c 0.5, which provides
some special features, and allows greater efficiency. That is included in
this commit as a tarball. Untar it somewhere else and build it, to produce
a library, which is linked with IOR. (configure with --with-S3).
TBD: EMC also supports a simpler alternative to Multi-Part Upload, which
appears to have several advantages. We'll add that in next, but wanted to
capture this as is, before I break it.
Along the way, added a bunch of diagnostic output in the HDFS calls, which
only shows up at verbosity >= 4. I'll probably remove this stuff before
merging with master. Also, there's an #ifdef'ed-out sleep() in place,
which I used to attach gdb to a running MPI task. I'll get rid of that
later, too.
Also, added another hdfs-related parameter to the IOR_param_t structure;
hdfs_user_name gets the value of the USER environment-variable as the
default HDFS user for connections. Does this cause portability problems?
I saw run in which I caught an MPI task hanging in ctime() here. Swiching
to ctime_r() fixes that. This function is only called form rank==0, but it
hangs anyway.
This is not a problem for most backends, but HDFS doesn't support opening
RDWR. If you use only write-oriented or read-oriented flags on the
command-line, CheckRunSettings() will undo the default IOR_RDWR flag and
install the appropriate IOR_WRONLY or IO_RDONLY open-flags, respectively.
This provides an HDFS back-end, allowing IOR to exercise a Hadoop
Distributed File-System, plus corresponding changes throughout, to
integrate the new module into the build. The commit compiles at LANL, but
hasn't been run yet. We're currently waiting for some configuration on
machines that will eventually provide HDFS. By default, configure ignores
the HDFS module. You have to explicitly add --with-hdfs.
GPFS supports a "gpfs_fcntl" method for hinting various things,
including "i'm about to write this block of data". Let's see if, for
the cost of a few system calls, we can wrangle the GPFS locking system
into allowing concurrent access with less overhead. (new IOR parameter
gpfsHintAccess)
Also, drop all locks on a file immediately after open/creation in the
shared file case, since we know all processes will touch unique regions
of the file. It may or may not be a good idea to release all file locks
after opening. Processes will then have to re-acquire locks already
held. (new IOR parameter gpfsReleaseToken)
Improve the scalabilit of CountTasksPerNode() by using
a Broadcast and AllReduce, rather than flooding task zero
with MPI_Send() messages.
Also change the hostname lookup function from MPI_Get_processor_name
to gethostname(), which should work on most systems that I know of,
including BlueGene/Q.
Allows every task to allocate a specified amount of memory as
a rough simulation of a real application's memory usage.
Every page of the allocated memory is touch to defeat lazy
memory allocation.
Original patch by Michael Kluge <michael.kluge@tu-dresden.de>
Only print total summary after all tests run.
Put calculated results from each iteration of a test in a separate
IOR_results_t structure. Clean up the allocation and freeing code
for these caluclated bits, which allowing us to hang onto the results
until the end of all tests. That in turn allows us to perform one
big summary at the end of all of the tests.
Clean up the header files to only contain those things that
need to be shared between .c files.
Functions that are not shared are now declared static to
make their file scope explicit. Functions that ARE shared
are declared in appropriate headers.
I am not going to claim that I caugh everything, but at
least it is a good start.
It was a nice idea to use multi-line error messages to make them
really obvious, but unfortunately it is terrible in practice. Often
errors will occur on multiple nodes simultaneously and the error
messages will be interleaved and virtually unreadable.
I also prefixed each message with "ior", to make it clear that
it is ior that produced the message, and not the scheduler, MPI, or
something else.
Error out immediately if a lustre option was specified,
but no lustre support was compiled in.
Set a flag when any lustre string options are set, to
make the code cleaner.