This has 2 effects:
1) OSD sets aren't added into PG history until actual write attempts anymore
which removes unneeded extra osd_sets in PG history
2) New OSD sets are reported synchronously and can't be lost on PG restarts
happening at the same time with reconfiguration
- Implement a new "vitastor-disk purge" command to remove OSDs with safety checks
- Implement a new "vitastor-cli rm-osd" command to only remove OSD metadata from etcd
- Fix a bug where the monitor could ignore OSD removal and other /osd/stats key changes
- Fix a bug where garbage could be returned when reading objects being written at the same time
- Fix a rare write stall where journal space could be not reclaimed where there
were no new operations in the flush queue
- Fix a rare peering stall caused by a previous long listing operations queues limiting attempt
- Fix total object count statistic in OSD on object creation
- Add missing offset&len into vitastor-disk dump-journal for big_writes, fix JSON format
- Make vitastor-cli print help on missing command
- Make vitastor-cli translate all '-' to '_' in CLI options
"In-flight" versions are added into dirty_db when writes are enqueued. And they
weren't ignored by subsequent reads even though they didn't have data location yet.
This bug was leading to test_heal.sh not passing sometimes with replicated setups.
- Fix QEMU driver compatibility with QEMU 7.0 and < 2.9
- Add patches for pve-qemu-kvm 7.1 (PVE 7.3) and pve-qemu-kvm 6.2 (PVE 7.2)
- Fix Proxmox driver location in the pve-storage-vitastor package
- Disable HDD autodetection in non-hybrid mode
- Explicitly warn about a buggy kernels on -EAGAIN in io_uring
- Final fix for the lack of zeroing out of old metadata entries
(do not crash with "big_write journal_entry was allocated over another object"
in some cases after an unclean OSD shutdown)
- Wait for data writes before fsyncing data if data fsync is enabled
- Never try to wait for free space inside blockstore thus stalling OSDs
- Fix a rare crash in osd_peering due to callback ordering
- Fix a rare duplication of ping & op message IDs
- Fix a rare use-after-free during pings
- Add --force to vitastor-disk read-sb
- Make vitastor-disk dump metadata object IDs in hex, add forgotten commas
- Fix vitastor-disk SCSI disk cache check
If a crash occurs during flushing a redirect-write it may happen so that
the disk contains both old and new metadata entries. This is OK, but prior
to 0.8.0 after this situation OSDs started without problem, but then they
crashed after some more overwrites with a "tried to overwrite non-zero
metadata entry" error. 0.8.0 introduced a change that was intended to fix
this situation, but rather than fixing it it prevented OSDs from starting,
now because of a "big_write journal_entry was allocated over another object"
error... :-)
This change finally fixes the original issue.
Followup to 54ef2c389f
- Remove an additional data copy operation when flushing journal (should
slightly increase write performance)
- Fix a bug where new writes in the inmemory_journal=false mode could overwrite
the data currently read by a parallel read operation
- Fix degraded parity writes for EC N+K when K>1 where the bug could also lead
to an "assertion failed" error
- Fix missing journal space check for "big" writes which could lead to
"prefill_single_journal_entry(): assertion failed..." error in OSD
- Fix possible "assertion failed: next->prev_wait >= 0" in client in rare cases
- Fix missing "len" field in vitastor-disk write-journal big_writes
- Fix possible crash of a full OSD (ENOSPC)
- Fix CSI build scripts to include newest packages every time
- Fix CSI endpoint in the liveness probe manifest
- Implement automatic OSD activation via udev and simple on-disk superblock storage
- Add a new `vitastor-disk` tool and merge all disk-related functionality there.
Now it can prepare new OSD disks, upgrade plain old systemd units to the new scheme,
resize OSD data area, manage OSD services by disk paths, manage superblocks,
automatically check and disable disk cache, dump and write back journal and metadata.
- Add a documentation section about `vitastor-disk` (read it if you want details!)
- Install systemd services during package installation instead of the older method
of manually creating them via separate shell scripts
- Add a new `make-etcd` script that reuses /etc/vitastor/vitastor.conf to configure etcd
- Allow to configure block_size, bitmap_granularity and immediate_commit per-pool
- Fix "fatal error: tried to overwrite non-zero metadata entry" which was possible
in some cases after unclean OSD shutdown (caused by old metadata entries not being zeroed)
- Add ISA-L erasure code implementation, now used automatically instead of jerasure when available
- Fix listings sending too many parallel requests to OSDs
- Fix rm-data crashing with --wait-list
- Remove empty inodes from statistics and `ls` output, after <inode_vanish_time> seconds after deletion
- Make monitor delete pool statistics when the pool is deleted and thus remove them from `df` output
- Log multiple etcd addresses in OSD logs correctly
- Fix true/false parsing in json configs like no_recovery/no_rebalance
- Show no_recovery, no_rebalance, readonly flags in status
- Add documentation! :-) in Russian and English
- Implement an NFS proxy for file-based access emulation to Vitastor
images for non-QEMU based hypervisors like VMWare, as a better way
than iSCSI
- Implement "primary affinity tags"
- Add a patch for libvirt 6.0
- Fix free_down_raw in cli status
- Fix a rare bug where OSDs could drop unrelated connections on errors
Return results and errors in a variable instead of just printing them,
separate vitastor-cli main() from cli_tool_t, move positional argument
parsing to CLI main from command implementations.
- Fix incorrect reading of extra metadata block leading to extra unknown objects in stats
- Fix CSI driver volumeMode: Block support
- Add block PVC and pod examples
- Fix build under 32 bit architectures
- Fix slow connection ramp-up caused by up_wait_retry_interval
- Implement `vitastor-cli status` (print cluster status) command
- Add a new `make-osd-hybrid.js` script to quickly prepare a lot of hybrid (HDD+SSD) OSDs
- Implement snapshot deletion for Cinder driver (only works in a healthy cluster)
- Fix a huge :) bug causing reads to return all zeroes during rebalance. Add a test to prevent it in the future
- Disconnect NBD proxy correctly without leaving a zombie [vitastor-nbd] process in D state
- Fix a rare write hang appearing with small write throttling enabled
- Fix IPv6 address parsing
- Fix "cannot read bytes of undefined" in the monitor on a fresh DB
- Fix possible hangs of read requests on OSD restarts without immediate_commit=all mode
- Fix OSDs skipping misplaced recovery in some cases
- Fix OSDs possibly dying with "map::at" errors when other OSDs are stopped
- Fix division by zero in ls if all pool OSDs are down
- Fix client hangs possible on OSD restarts (bug affected versions from 0.5.11)
- Fix "Assertion `sqe != NULL' failed" io_uring-related crashes possible
on some kernels (0.6.11 increased probability of this bug)
- Fix timeout=0 in NBD proxy
- Fix build under centos 7
Problem is that in recent kernels io_uring may return completions BEFORE
clearing the submission queue. I.e. for example its capacity is 512, there
were 512 requests, one of them completed, so when the request completion is
processed the queue "should have" 1 free slot. But sometimes it doesn't because
io_uring doesn't always clear the submission queue before sending CQE :-/
Fixes client hangs possible after stopping & restarting an osd.
Hangs happened when a connection was closed in the middle of reading a READ
operation reply from the network. In this case the operation being read was
in read_op and the client didn't free it when closing the connection.
Test case for msgr_read.cpp:
- Partially read reply for a READ operation
- stop_client()
- Check that the READ operation returns EPIPE
The bug was actually introduced in 0.5.11.
etcd connection stability, clang & elbrus support
- Fix build under CLang and Elbrus LCC compilers, making Vitastor compatible
with Elbrus CPUs :)
- Completely fix the bug where OSDs didn't connect to peers and incorrectly marked
PGs as incomplete
- Limit I/O depth for deletes the same way as for small writes. Makes OSD crashes
with "Assertion failed: sqe != NULL" during image deletion go away
- Fix a very old, but rare, journaling bug (credits to https://github.com/mirrorll)
- Fix flushing of unclean journaled objects leading to OSDs sometimes hanging
after failover in EC setups (bug was introduced in 0.6.7)
- Fix several problems that could prevent smooth operation of a Vitastor cluster
under the condition of partial etcd failure:
- OSDs could randomly fail due to too strict error handling
- New clients and OSDs could be unable to start because of the lack of retries
- CLI could fail some commands because of the lack of retries
- Monitor could stop receiving state updates because of the lack of websocket pings
- Fix monitor being unable to rebalance PGs after a downscale of pool pg_size (3->2)
- Exit with failure when trying to nbd map or benchmark a non-existing image
- Use HTTP keep-alive for etcd connections
- Allow to configure etcd request timeouts and retries
- Allow to configure NBD timeout, max devices and partitions, and set default to
up to 64 devices with up to 3 partitions each
Build problems fixed:
- void* pointer arithmetic which is a GNU extension (works as byte*)
- "variable size object may not be initialized" which is OK under GCC
- nullptr_t related error in json11 (it lacks 'operator <' in clang)
Warnings fixed:
- empty nested struct initializer { 0 } replaced by {}
- removed several unused lambda captures
Prelimilary results:
- CPU usage drops significantly. For example, in T1Q8 128K write test against
stub_uring_osd with 10G network and Athlon X4 860k CPU it drops from 100% to 30%
- Latency becomes slightly worse. In T1Q1 4K write test in the same environment
latency increases from 56 to 63 us.
- Small write throughput also becomes slightly worse. In T1Q128 4K write test
against stub iops decreases from 138k to ~110k (unstable, fluctuates 100k..120k).
Note that this is without io_uring, of course.
- Slightly reduce journaling write amplification (requires no_same_sector_overwrites=false)
- Fix listen_backlog (it was 0) because it could more than halve OSD socket send speed
- Support IPv6 OSD addresses
- Do not try to initialize client in simple-offsets
- Fix OSDs sometimes marking PGs incomplete instead of trying to connect with peers
- Allow to configure OSD placement in node_placement
- Allow to run with 4k sector size block devices. Natural, but it was forbidden
Slightly reduces WA. For example, in 4K T1Q128 replicated randwrite tests
WA is reduced from ~3.6 to ~3.1, in T1Q64 from ~3.8 to ~3.4.
Only effective without no_same_sector_overwrites.
- Implement a storage plugin for Proxmox. Now you can use Vitastor with Proxmox!
- Implement `vitastor-cli df` (pool space usage statistics) command
- Add glob pattern support for `vitastor-cli ls`
- Fix several bugs in other CLI commands (resize, create --parent, modify --readonly)
- Use 512 byte logical block size in QEMU driver by default (and thus don't require to set it in QEMU options)
New features:
- Build Vitastor driver as part of QEMU
- Implement renaming images in CLI (vitastor-cli modify --rename)
- Add vitastor-cli alloc-osd and simple-offsets commands and use them in make-osd,
thus removing the dependency on etcdctl
- Make monitor remove stale deleted inode statistics from etcd automatically
- Implement OSD address selection from a subnet, thus removing the need to specify
OSD addresses in startup scripts explicitly
Bug fixes:
- Fix client failover in case of etcd shutdown or crash (make client survive etcd failures)
- Stick to the last live etcd in OSD and mon to prevent random failures when one of etcds is down
- Fix incorrect copying of data from journal to the data device which could lead to data corruption
- Prefer local etcd IPs in OSD
- Remove the total PG count restriction in optimize_change which was sometimes leading
to inability to redistribute PGs over OSDs
- Fix error response parsing on a failed pg state report
- Fix slow linear writes with RDMA by changing default buffer settings
- Fix possible 'TypeError' in openstack nova when using Vitastor cinder driver
- Fix bugs in vitastor-cli create, ls, rm, modify commands
Patch changes:
- Add a patch for libvirt 7.6
- Add patches for QEMU 6.0 and 6.1
- Fix config file path XML location parsing in libvirt patches
- Replace _ with - in QEMU options
- Fix possible 'TypeError' in openstack nova when using Vitastor cinder driver
- Fix possible crashes of QEMU block driver in case of incorrect options
129K to leave extra space for the header
The problem with 8x 1M buffers is that the following happens with,
for example, 2 OSDs and 4M T1Q1 write:
- Server posts 8 receives
- Client posts 8 sends
- WRs are processed by the RDMA stack, but the OSD doesn't have the time
to handle them and doesn't refill buffers
- Client posts 1 more send
- RNR retransmission happens and performance drops to zero
Overall it seems that RDMA support should be reworked to use real 'RDMA'
operations i.e. operations writing into remote memory. This has an
additional advantage of avoiding a copy at the receive side of the OSD.
- Implement CLI commands for listing, viewing I/O statistics, creating,
snapshotting, cloning, resizing and modifying images. All these operations
are covered by 3 commands: ls, create, modify
- Implement an important fix to prior OSD set tracking for PGs. The previous
version had an issue which could lead to data loss due to an OSD with older
copy of the data thinking it has the newest copy
- Fix I/O statistics aggregation in the monitor
- Several minor fixes for Cinder driver
- Fix QEMU driver to be compatible with QEMU 2.x > 2.0
- Fix stalls sometimes possible in configurations without immediate_commit due
to insufficient amount of automatic internal fsync operations
- Add `vita` alias for `vitastor-cli`
Required to prevent data loss due to activation of an OSD with older data
when PG OSD set change doesn't occur. I.e. fixes the simplest case:
- Run 2 OSDs with 1 PG
- Start writing into the PG
- Stop OSD 2
- Stop OSD 1
- Start OSD 2
After this change the PG will refuse to start after the last step.
- New command-line tool: vitastor-cli
- Implement layer (snapshot/clone) merge and delete
- Remove 'bool' from the C header
- Fix a very rare flusher stall
- More diagnostics now printed for slow ops in the log
- Basic support for OpenStack: Cinder driver, patches for Nova and libvirt
- Add missing "image" and "config_path" QEMU options
- Calculate aggregate per-pool statistics in monitor
- Implement writes with Check-And-Set semantics
- Add a C wrapper library with public header