Test / test_write_xor (push) Successful in 1m19sDetails
Test / test_heal_pg_size_2 (push) Successful in 4m30sDetails
Test / test_heal_ec (push) Successful in 4m32sDetails
- The tests are now stable and run in a CI system based on Gitea CI
- The release includes final bug fixes for EC:
- Implement missing EC recovery of allocation bitmap when built with ISA-L
- Fix broken snapshot export with EC (allocation bitmap reads were giving incorrect results previously)
- Also fixed bugs manifesting under heavy load:
- Fix monitor possibly applying incorrect PG history on retries
- Fix monitor incorrectly changing PG count when last_clean_pgs contains less PGs than the new number
- Allow writes to wait for free space again, but now correctly (previously dropped in 0.8.2)
- Fix a rare segfault in client (handle client stop during incoming stream handling in 1 more place)
- Make monitor correctly handle etcd connection errors - it could die instead of connecting to another etcd
- Fix OSD rarely being unable to report PG states after a PG was taken over by another OSD
- Fixed return code for incomplete EC objects (now EIO) and made cluster client retry this error
- Made other small changes for tests: timeouts, nice/ionice for etcd, waiting conditions, NBD device checks and so on
- Fix vitastor-cli rm/rm-data broken in 0.8.6 (missing messenger initialization)
- Prepare OSD read handler for upcoming version with scrub - allow "secondary reads" to return errors
- Fix OSDs re-peering PGs infinitely with a big number of PGs (reproduced in test_add_osd)
- Fix another variant of flusher sync-waiting stall (reproduced in test_write)
- Fix other tests in tests/ (will add them to Gitea CI soon)
- Add patches for QEMU 6.2-8.0
- Fix QEMU driver compatibility with QEMU 8.0
- Build packages for RHEL 9 clones (based on AlmaLinux 9)
This release includes a bunch of important bugfixes for erasure-coded setups
with disabled immediate_commit. After these fixes, "test_heal" OSD killing test
now passes fine with EC:
- Fix cluster write stalls with "Error while doing flush on OSD xx: -16 (Device or resource busy)"
in OSD logs possible in EC setups with disabled immediate_commit by selectively
syncing nonsynced objects on STABILIZE/ROLLBACK (https://github.com/vitalif/vitastor/issues/51)
- Fix other EC + disabled immediate_commit problems:
- Fix "opcode=5 retval=-2" errors happening on SYNC retries
- Fix non-working "pagination" during PG dirty object flushing
- Fix write operations not continued correctly after dirty object flushing
- Fix incorrect parity read-modify-write calculation when writing into a lost chunk
- Fix OSDs losing left_on_dead PG state of non-clean PGs and thus not removing junk data in the cluster
- Fix a small memory leak caused by bad indexing of EC recovery matrices
- Fix a rare use-after-free in cluster_client caused by a reenterability issue
- Fix vitastor-cli create command syntax in the CSI driver
- Allow to start OSDs without local store for tests
- Fix memory allocation error in disk_tool_meta for non-standard metadata block sizes
- Fix delete operations received before loading pool metadata crashing OSDs with "null pointer exception"
- Improve "theoretical performance" Russian documentation
New features:
- Implement online configuration update for some parameters. Documentation is coming soon :)