New features:
- Data and metadata checksums!
- Metadata checksums are always used with new disk format
- Data checksums can be turned on with --data_csum_type crc32c for new OSDs
- Checksum block size can be configured
- inmemory_metadata now also affects keeping checksums in memory
- Linux page cache I/O caching support which can be enabled separately for
data, metadata (including checksums) and journal (O_SYNC instead of O_DIRECT)
- Details [here](https://git.yourcmc.ru/vitalif/vitastor/src/branch/master/docs/config/layout-osd.en.md#data_csum_type)
- Backwards compatibility is preserved, you can use new OSDs with old disks
Release also includes bug fixes from [0.9.6](https://git.yourcmc.ru/vitalif/vitastor/releases/tag/v0.9.6).
0.9.6 is moved to "-oldstable" repositories and will be available for some additional time.
Instead of it, just do not verify checksums of currently mutated objects.
When clean data modification during flush runs in parallel to a read request,
that request may read a mix of old and new data. It may even read a mix of
multiple flushed versions if it lasts too long... And attempts to verify it
using temporary copies of metadata make the algorithm too complex and creepy.
- Fix vitastor-disk partition zeroing (sometimes it was writing garbage instead of zeroes)
- Fix incorrect EC space statistics in `vitastor-cli status`
- Several bug fixes for NFS:
- Add . and .. in NFS directory listings
- Return FILE_SYNC from NFS writes if immediate_commit is enabled
- Return the same "verifier" in NFS COMMIT as in NFS WRITE
- Make parallel NFS extending writes work correctly, without conflicts
- Handle parallel NFS extending writes without imposing extra load on etcd
- Support UTF-8 in vitastor-cli table output
- Also allow "0" and "no" as false for inmemory_metadata and inmemory_journal
- Use HDD defaults for HDD-only in automatic `vitastor-disk prepare` mode
This fixes buffered (not O_DIRECT) NFS writes in Linux - previously they were
hanging in an infinite loop because COMMIT didn't return the same verifier as
previous WRITEs, and NFS kernel client was infinitely retrying the same writes.
Also this probably allows for correct NFS failover, at least for the same
buffered writes, because NFS clients repeat all write requests until a COMMIT
confirms them.
Previously, multiple parallel writes extending file size through NFS were
racing with each other and triggering deletions of part of the written data
I.e. if you mounted vitastor-nfs and just copied a file into it in MC then
you could end up with only a part of the file actually written
- Improve QEMU driver performance by integrating io_uring in it (up to 1.5x total iops improvement)
- Fix QEMU driver deadlocks which started to reproduce in qemu-img after iothread fixes
- Fix `vitastor-cli status` reporting more etcds than actually exists (fix etcd address duplication in config on reload)
- Fix `vitastor-cli ls` crashing on inodes in non-existing pools
- Delete old garbage /pool/stats/ keys for non-existing (deleted) pools
- Reduce memory usage of etcds initialized by make-etcd script
- Fix OSDs almost always crashing on etcd restart due to "revisions were compacted" (support reloading state from etcd)
- Fix a crash and a stall possible mostly in HDD setups with small journal and big (512k, 900k) random writes
- Add notes about HDDs to documentation. You are officially allowed to use HDD-only Vitastor with HGST/Toshiba/EXOS :)
Deadlock was caused by switching QEMU coroutines directly inside
vitastor_co_read_bitmap_cb() callback. The correct way is to schedule a BH
/BH is a QEMU term for setImmediate() :)/, same as in read and write callbacks.
Fixes two bugs found during HDD testing :-)
1) OSD crashed with "BUG: Attempt to overwrite used offset of the journal" during
`fio -bs=900k -iodepth=128` test with 16 MB journal
2) OSD stalled during `fio -bs=512k -iodepth=128` test with 64 MB journal
Before this change, OSDs almost always died when one of the etcds was restarted,
even though the rest of them was still in quorum and the lease was still active
- Add patch for libvirt 9.0
- Add support for Proxmox VE 8.0
- Fix compatibility of the QEMU driver with iothread (QEMU rebuilds are coming)
- Fix vitastor-cli rm-data/rm/merge hanging when some OSDs are down.
Allow deletions in unclean cluster at the cost of some data possibly
"reappearing" when those OSDs start back. In that case you can just repeat
the deletion request using rm-data.
- A bunch of bug fixes for snapshots:
- Fix snapshot reads often not working at all with snapshot chain size > 2
- Fix optimized snapshot data merge (children to parent)
- Fix updating of image name index key during optimized merge
- Fix auto-selection preventing the use of optimized merge with only 1 snapshot
- Fix incorrect CAS retries during snapshot merge
- Fix snapshot merge progress reporting
- Fix primary_read bitmap buffers use-after-free which could lead to
incorrect allocation map reads
- Remove /usr/local/bin path from make-etcd
- Some documentation fixes
- Measure and report scrub I/O statistics in vitastor-cli status
- Make aggregated statistics in vitastor-cli status much smoother
(first derive, then sum instead of first summing and then deriving)
- Fix an old rare bug leading to journal corruption
(try to use scrub if you think you're affected...)
- Do not start EC PGs without at least <data chunks> OSDs in each old set
(prevents spurious read errors with EC during reconnections/restarts)
- Fix failed assert(!scrub_list_op) on OSD restart with pending scrubs
- Fix future planned scrubs not starting because of incorrect time comparison
- Build packages for Debian 12 (Bookworm)
The consequence of this issue was that in some very rare cases (only reproduced
under load in CI when running 4+ tests in parallel) small write data written to
journal could overwrite journal entries.
Also add an assert-type safety check to be able to catch this issue in the
future again in case of a regression.
- Fix "Client XX command out of sync" messages sometimes happening on OSD reconnections
- Fix a bug where EC reads parallel with writes to the same object failed with -ERANGE error
- Slightly reduce the amount of metadata writes during journal flushing
- Correctly unmap NBD volumes when Proxmox forces map_volume use (with SWTPM and maybe some other cases)