- The tests are now stable and run in a CI system based on Gitea CI
- The release includes final bug fixes for EC:
- Implement missing EC recovery of allocation bitmap when built with ISA-L
- Fix broken snapshot export with EC (allocation bitmap reads were giving incorrect results previously)
- Also fixed bugs manifesting under heavy load:
- Fix monitor possibly applying incorrect PG history on retries
- Fix monitor incorrectly changing PG count when last_clean_pgs contains less PGs than the new number
- Allow writes to wait for free space again, but now correctly (previously dropped in 0.8.2)
- Fix a rare segfault in client (handle client stop during incoming stream handling in 1 more place)
- Make monitor correctly handle etcd connection errors - it could die instead of connecting to another etcd
- Fix OSD rarely being unable to report PG states after a PG was taken over by another OSD
- Fixed return code for incomplete EC objects (now EIO) and made cluster client retry this error
- Made other small changes for tests: timeouts, nice/ionice for etcd, waiting conditions, NBD device checks and so on
Monitor could deceive itself by immediately saving PG configuration changes
which weren't applied to etcd yet in memory, and apply incorrect PG history
changes next time if the first update fails.
This usually only happened under heavy load and was caught in CI. :-)
- Fix vitastor-cli rm/rm-data broken in 0.8.6 (missing messenger initialization)
- Prepare OSD read handler for upcoming version with scrub - allow "secondary reads" to return errors
- Fix OSDs re-peering PGs infinitely with a big number of PGs (reproduced in test_add_osd)
- Fix another variant of flusher sync-waiting stall (reproduced in test_write)
- Fix other tests in tests/ (will add them to Gitea CI soon)
- Add patches for QEMU 6.2-8.0
- Fix QEMU driver compatibility with QEMU 8.0
- Build packages for RHEL 9 clones (based on AlmaLinux 9)
This release includes a bunch of important bugfixes for erasure-coded setups
with disabled immediate_commit. After these fixes, "test_heal" OSD killing test
now passes fine with EC:
- Fix cluster write stalls with "Error while doing flush on OSD xx: -16 (Device or resource busy)"
in OSD logs possible in EC setups with disabled immediate_commit by selectively
syncing nonsynced objects on STABILIZE/ROLLBACK (https://github.com/vitalif/vitastor/issues/51)
- Fix other EC + disabled immediate_commit problems:
- Fix "opcode=5 retval=-2" errors happening on SYNC retries
- Fix non-working "pagination" during PG dirty object flushing
- Fix write operations not continued correctly after dirty object flushing
- Fix incorrect parity read-modify-write calculation when writing into a lost chunk
- Fix OSDs losing left_on_dead PG state of non-clean PGs and thus not removing junk data in the cluster
- Fix a small memory leak caused by bad indexing of EC recovery matrices
- Fix a rare use-after-free in cluster_client caused by a reenterability issue
- Fix vitastor-cli create command syntax in the CSI driver
- Allow to start OSDs without local store for tests
- Fix memory allocation error in disk_tool_meta for non-standard metadata block sizes
- Fix delete operations received before loading pool metadata crashing OSDs with "null pointer exception"
- Improve "theoretical performance" Russian documentation
New features:
- Implement online configuration update for some parameters. Documentation is coming soon :)