New features:
- Add a new kernel device mounting method: [ublk](https://vitastor.io/docs/usage/ublk.html).
It's the fastest method for random IOPS, and it's on par with VDUSE for linear MB/s.
- Disable io_uring waits being reported as iowait on kernels which support it (6.15+)
- Allow to enforce permissions at the VitastorFS NFS server side
- Add qemu_file_mirror_path option to the config to allow to trick Veeam and make it work
- Speed up CRC32C calculation in OSD by fixing a bug and enabling AVX512 version
- Support QEMU 10
- Support Debian 13 Trixie and Proxmox 9.0
- Remove the dependency on system liburing in package builds (build it statically)
Bug fixes:
- Fix checksums NEVER BEING ENABLED in vitastor-disk prepare, even when explicitly requested :-D
- Use default uid and gid from NFS AUTH_SYS when creating files
- Fix object bitmaps supposedly & possibly being corrupted in some rare cases with EC N+2+
- Avoid multiple inflight overwrites to meta blocks - fixes possible data corruption with
one specific SSD model: Memblaze PBlaze5 910 (github #79)
- Fix a bug in antietcd which was leading to leases sometimes not expiring correctly
- Fix a bug in NFS where ".." entry had its `cookie` equal to 0 instead of 1 (github #78)
- Fix snapshots not being deleted during VM deletion in Proxmox plugin (github #85)
- Fix monitor not filtering OSDs by block size correctly
When qemu_file_mirror_path is set to "/dir/" in /etc/vitastor/vitastor.conf, QEMU driver
returns "/dir/image_name" as the filename in qemu-img info and in QAPI to trick software
like Veeam which expects file-based access. After that, vitastor-nfs can be mounted to that
directory to make backups work :-)
- Support Ubuntu 24.04 Noble
- Fix clients hanging on online PG count change with in-flight operations
- Fix volume_size_info in PVE VitastorPlugin
- Implement bdrv_refresh_filename for the QEMU driver
- Fix RDMA-CM broken in 2.2.0
- Fix possible "invalid %N$ use detected" OSD crashes
- Fix vitastor-nfs build without RDMA
- Fix docker build by adding the forgotten apt/preferences
- Fix libfio_blockstore.so
- Fix "trigger event loop automatically" API version check in the QEMU driver
- Add Content-Type header for prometheus metrics
- Add a patch for libvirt 11.5
- Fix a bug introduced in 2.2.0 - pg_locks weren't disabled for pools without local_reads
correctly which could lead to inactive pools during various operations
- Fix an old bug where OSDs could send sub-operations to incorrect peer OSDs when their
connections were stopped and reestablished quickly, in 2.2.0 it was usually leading
to "sequencing broken" messages in OSD logs
- Fix debug use_sync_send_recv mode
This is required to prevent disconnected peers from sometimes receiving messages
suited for other peers - stop_client was freeing the operations even though they
were still references in the io_uring requests in progress. This was leading to
OSDs sometimes receiving garbage and "broken sequencing" errors in logs as the
memory was usually already reallocated for other operations
- Fix vitastor-disk purge broken after adding the "OSD is still running" check
- Fix iothreads hanging after adding zero-copy send support
- Fix enabling localized reads online (without restarting OSDs) in the default PG lock mode
New features:
- [Localized read support](https://vitastor.io/docs/config/pool.html#local_reads) for multi-datacenter setups.
- io_uring-based [zero-copy send support](https://vitastor.io/docs/config/network.html#min_zerocopy_send_size) -
read the instruction carefully for optimal performance!
- Improve and speedup data distribution, especially in cases of very large hosts (100 OSD+).
Previously, PG optimization speed depended on the number of OSDs, now it only depends on
the number of failure domains. Distribution over specific OSDs is now also more even and
becomes strictly more even when you increase the number of PGs.
- Add a [very interesting instruction](https://vitastor.io/en/docs/usage/nfs.html#linux-nfs-write-size) to change NFS_MAX_FILE_IO_SIZE
- Check operation sequencing and stop connections when it breaks - should help catch some
very rare RDMA packet loss problems.
- `vitastor-cli rm-osd` now refuses to remove OSDs which are still up and suggests to use `vitastor-disk purge`.
- Allow removal of direntries referring non-existent in VitastorFS.
- Change default vitastor-etcd data dir to /var/lib/etcd/vitastor.
Bug fixes:
- Fix compatibility with ISA-L 2.31+. ⚠️Very important: please upgrade Vitastor before upgrading ISA-L to 2.31+.
- Fix in-memory state cleanup for incomplete PGs.
- Fix monitor crash with non-existent node_placement nodes.
- Slightly speedup `vitastor-kv dump` command by adding output buffering.
- Fix theoretically possible slowdowns in OSD sub-operation failure handling code.
- Fix very rare stack overflows in vitastor-kv.
- Fix a possible crash in VitastorFS during handling of file creation race condition.
- Fix modify-pool -s PG_SIZE which didn't work without --pg_minsize.
- Fix marking peer OSDs as alive on receiving data from them via RDMA - in theory,
the bug could result in instability with RDMA under high load with slow disks.
- Fix a rare OSD crash due to double handle_primary_subop() call.
- Fix latency aggregation in global stats (/vitastor/stats in etcd) - do not sum it.
- Hide "Ran out of journal space" log messages by default.
- Wait for RDMA-CM EVENT_ESTABLISHED after rdma_accept(), handle rdma_accept() before acking the event.
- Fix VitastorFS total & free numbers multiplied by extra 2.
- Fix systemd unit name in make-etcd.
- Do not allow reweight > 1 in vitastor-cli modify-osd.
- Fix docker build.
This greatly speeds up PG placement and makes it more uniform both because the LP task
becomes simpler and because the distribution of individual OSDs is optimised manually
New features:
- Support separate OSD cluster network - [osd_cluster_network](https://vitastor.io/docs/config/network.html#osd_cluster_network)
and, in general, multiple OSD networks, including RDMA
- Add an alternative RDMA implementation via RDMA-CM - [use_rdmacm](https://vitastor.io/docs/config/network.html#use_rdmacm),
required for iWARP and, maybe, for some IB setups (but not for RoCE)
- Change default PG behaviour to wait for all "up" OSDs to be connected before starting it.
The old behaviour may be returned by enabling a new [allow_net_split](https://vitastor.io/docs/config/osd.html#allow_net_split)
option.
- Add a patch for QEMU 9.2
Bug fixes:
- Fix incorrect "has_xxx" PG state names in ls-pgs
- Fix possible QEMU crashes after detaching of Vitastor disks (and update all QEMU builds in Vitastor repos)
- Fix clients sometimes spamming OSDs with infinite reconnections when some PGs are offline
- Fall back to TCP on RDMA connection failures
- Add missing logging of RDMA ibv_modify_qp() errors
- Add a minimum interval for etcd_state_client to reload state
No breaking features, it's 2.0.0 just because it includes S3 and because
there are already too many 1.x releases :).
New features:
- S3 is finally available: https://vitastor.io/docs/installation/s3.html
- node.js addon is now packaged as a Debian package
- Support listing PGs by OSDs in `vitastor-cli ls-pgs`
- Implement offline TRIM support: [vitastor-disk trim](https://vitastor.io/docs/usage/disk.html#trim),
[discard_on_start](https://vitastor.io/docs/config/osd.html#discard_on_start)
- Change used_for_fs pool option to used_for_app
Bug fixes:
- Fix several bugs in the node.js addon (a memory leak, an incorrectly triggered event loop)
- Fix a client crash (vitastor-cli rm) during deletion when writeback is enabled
- Fix PG object count statistics on deletion of non-existing objects
- Fix vitastor-nbd crash when mapping by ID instead of inode name
- Fix a client memory leak with enabled immediate_commit and write-back cache
- Add seccomp=unconfined for vitastor docker OSDs to not break io_uring
- Add udev and systemd to vitastor docker image
- Fix upgrading from pre-0.7.1 (very old) systemd units O_o
- Fix total object count calculation in rm_data
New features:
- Support containerized Vitastor installations: http://vitastor.io/docs/installation/docker.html
- Add new functions to the node.js binding: delete(), get_immediate_commit(), on_ready(),
get_min_io_size(), get_max_atomic_write_size()
- S3 (Zenko Cloudserver with Vitastor support) is coming shortly and will be released separately
Bug fixes:
- Use IP-derived etcd node names in make-etcd
- Set short name of the OSD process to display in `top`
- Fix snap-create without pool_id failing when there are multiple pools
- Several bugs are fixed in the write-back cache, it should now be stable:
- Fix incorrect snapshot reads from dirty write-back cache
- Do not try to repeat pending writebacks on OSD reconnections
- Fix client hangs with multiple SYNCs in the writeback queue
- Fix client hangs do to incorrect calculation of the writeback queue size
- Several improvements for NBD mapping/unmapping:
- Add a workaround for race condition in the Linux kernel NBD driver leading
to vitastor-nbd sometimes breaking a previously mapped device instead of
setting up a new one
- Check if the device is actually mapped in vitastor-nbd unmap
- Fix device name/number validation in vitastor-nbd
- Fix OSD crashes after starting with corrupted metadata - from now it will skip
corrupted metadata entries and heal itself
- Fix scrubbing of misplaced objects and object state recalculation after
vitastor-cli fix - previously, an OSD restart could be required to fix object states
- Make primary OSD distribution more stable by using murmur3 hash instead of the old pseudo-rng
- Fix monitor sometimes racing with itself - do not touch /pool/stats from stats
aggregation if PG recheck is active
- Sort vitastor-cli ls output by name by default
- Update antietcd to 1.1.2
Should probably have been an obvious side effect :-)
Child process gets open file descriptors to parent's epoll/timerfd,
and it's totally OK to just close() all of them, but it's absolutely NOT
OK to run destructors - they modify the kernel state of epoll/timerfd
before destroying. So, basically, when we destroy the client in the child
process, we break it in the parent too. This also means that cluster_client_t
doesn't support fork(). :-)
Do all NBD configuration in the child process, after the last fork.
Why? It's needed because there is a race condition in the Linux kernel nbd driver
in nbd_add_socket() - it saves `current` task pointer as `nbd->task_setup` and
then rechecks if the new `current` is the same. Problem is that if that process
is already dead, `current` may be freed and then replaced by another process
with the same pointer value. So the check passes and NBD allows a different process
to set up a device which is already set up. Proper fix would have to be done in the
kernel code, but the workaround is obviously to perform NBD setup from the process
which will then actually call NBD_DO_IT. That process stays alive during the whole
time of NBD device execution and the (nbd->task_setup != current) check always
works correctly, and we don't accidentally break previous NBD devices while setting
up a new device. Forking to check every device is of course rather slow, so we also
do an additional check by calling list_mapped() before searching for a free NBD device.
New features:
- Add "deleted" image flag which is set when vitastor-cli rm starts to delete an image,
but can't delete it fully due to inactive PGs or stopped OSDs
- Support JSON output in vitastor-disk prepare and purge
- Show backfillfull pools in vitastor-cli status
- Make object listings consistent (used in vitastor-cli rm/rm-data/merge/etc).
This means that there is now a guarantee that if a data block is present when you invoke rm,
rm will attempt to delete it, even if rm is invoked when the PG switches state. Previously in
such cases rm could skip and leave some objects behind as garbage, and merge probably could
incorrectly move data between snapshots.
- Make deletions (rm/rm-data) consistent. This means that rm/rm-data will either complete
successfully and delete all requested image data or complete with an error if some objects
could not be deleted or if there is a possibility that some data is left on stopped OSDs.
Previously, when some PGs or OSDs were inactive at the moment of deletion, rm-data was
behaving incorrectly: it wasn't retrying deletions failed due to dropped OSD connections,
it could hang waiting for PGs to activate, and it could return with a successful error
code while some garbage was still possibly left on some OSDs. Deletions are not fully atomic
cluster-wide yet, which means that you still have to repeat the deletion request after you
return stopped OSDs back, but now you always know for sure if you have to repeat it.
Bug fixes:
- Fix vitastor-cli rm --exact / --matching command not working
- Finally fix "Unexpected status" in the Proxmox plugin
- Fix vitastor-cli create-snap incorrectly linking multiple snapshots in a different pool
- Fix incomplete image parent_id loop check in OSD
- Fix reads from snapshots in a different pool not working if there are more than 2 snapshots
- Fix append of VITASTOR_CONF to cmdline in the opennebula prebackup script
- Fix OSDs crashing again when the cluster is full with EC (was meant to work since 1.6.0 but didn't)
- Improve logging of subop failures
This still doesn't make listings 100% consistent yet; for 100% consistent
listings we have to receive listings only from the primary OSD, not from all
peer OSDs, but this issue will be fixed separately.
New features:
- Implement basic VitastorFS support in [CSI](https://vitastor.io/docs/installation/kubernetes.html)
- Implement [NFS RDMA](https://vitastor.io/docs/usage/nfs.html#rdma) support
- Pause pool rebalance when monitor detects that it can lead to any OSD becoming full ([osd_backfillfull_ratio](https://vitastor.io/docs/config/monitor.html#osd_backfillfull_ratio))
- Auto-select correct [RDMA device and GID](https://vitastor.io/docs/config/network.html#rdma_device) based on osd_network and RoCEv2 priority
- Report slow ops in OSD stats in etcd and show them in vitastor-cli status
Bug fixes:
- Fix possibly incorrect linked list deserialization in NFS
- Fix possible crash in vitastor-nfs --block READDIR operation
- Map netlink after forking to show correct PID in vitastor-nbd ls
- Simplify and fix create-pool OSD count checks for the case of hosts split into sub-nodes
- Make monitor print "Waiting to become master" just once, not every 5s
- Take out_size from oimg if not specified in vitastor-cli dd
- Do not report OSDs with empty statistics as "full" in status
- Trigger double autosync when switching PG state to prevent leaving garbage in non-immediate_commit clusters
- Fix a lack of connection timeout for etcd websockets in OSD leading to slower etcd failover (~70s instead of ~10s)
- Fix a rare OSD crash during client disconnect
- Fix PGs sometimes sticking until OSD restart in the "has_unclean" state with EC pools
- Fix metadata partition zeroing in vitastor-disk prepare
- Add patches for qemu 9.1 and pve-qemu 9.0 and 9.1
- Fix libvirt 8 patch
Linux NFS RDMA transport has a stupid bug - when the reply doesn't contain
post_op_attr, the data gets offsetted by 84 bytes (size of attributes) and
first 84 bytes are filled with probably random data.
- Support custom hybrid OSD creation (`vitastor-disk prepare --hybrid --fast-devices /dev/xxx,/dev/yyy`)
- Auto-change partition paths to /dev/disk/by-partuuid/ in `vitastor-disk prepare`
- Allow to select cached I/O in vitastor-disk commands
- Fix multiple bugs in vitastor-disk resize & add tests for them
- Fix vitastor-disk write-meta/write-journal in superblock-based mode writing it to an incorrect device
- Fix vitastor-disk prepare sometimes again not seeing new partitions
- Cleanup PG history and stats of deleted pools
- Fix "is already mounted" checks in CSI
New features:
- Support resizing normal vitastor-disk partitions and moving journal/metadata: [vitastor-disk resize](https://vitastor.io/docs/usage/disk.html#resize)
- Support simple forms of vitastor-disk {dump,write}-{meta,journal} for OSD partitions
Bug fixes:
- Fix block RWX volumes broken after introducing stage/unstage support
- Do not allow to create non-block RWX volumes in CSI
- Fix vitastor-disk prepare not seeing the newly created partition in rare cases
- Fix non-array tags not showing up in ls-osd/osd-tree
- Make OpenNebula oned.conf patching during installation smarter
- Fix iseek option in vitastor-cli dd not working
- Validate conv=, iflag=, oflag= options in vitastor-cli dd
- Fix vitastor-disk write-meta not writing header checksum to the disk
- Fix JSON format in vitastor-disk dump-meta
- Fix read_chain_bitmap not working for snapshot in another pool
- Fix a possible OSD crash during parallel read & write to an image with snapshots
- Several followups to the READ_CHAIN_BITMAP fix: avoid data reads, fix possible overflow in is_zero(), fix bitmap size
OSDs could crash with the following "assertion failed" message (crash didn't affect data
and was caused by OSD thinking upper blocks are full while they weren't). Reproduction
without introducing artificial delays is hard because you have to force OSD to read an
object with enqueued but not handled write which fills previously non-full bitmap. O_o.
```
vitastor-osd: ./src/osd/osd_primary_chain.cpp:613: void osd_t::send_chained_read_results(pg_t&, osd_op_t*): Assertion `stripes[role].read_buf' failed.
```
Hotfixes for OpenNebula and upgrade hotfix for 1.7
- Fix deploy.vitastor, save.vitastor, restore.vitastor scripts not working for nodes other than master oned
- Fix deploy.vitastor not working for VMs without Vitastor disks
- Disable clearing old PG configuration when upgrading from 1.7 or older versions (it was breaking old clients)
Bugfix release, would be 1.7.2, but etcd layout changes mandate it to be 1.8.0. :-)
- Change etcd layout: /config/pgs is now /pg/config, /pg/stats/* is now /pgstats/*.
This is required to fix a rare PG history tracking issue caused by non-atomic
delivery of etcd events sometimes resulting in `incomplete` objects in EC pools
after mass OSD restarts. Upgrading can be performed freely, downgrade requires
additional action: [1.8.0 to 1.7.1](https://git.yourcmc.ru/vitalif/vitastor/src/branch/master/docs/usage/admin.en.md#1-8-0-to-1-7-1)
- Fix a rare client hang on PG primary OSD switch
- Fix vitastor-nfs started using mount command sometimes not stopping automatically after unmount
- Fix vitastor-nfs mounts started using mount command sometimes hanging after daemonizing
- Fix merge/flatten into a pool with different object size (image migration between pools case)
- Do not print extra "PG disappeared after reload" verbose log messages for non-existing PGs
- Fix clustered Antietcd support and persistence filter
- Do not try to purge the same OSD multiple times if its multiple devices are passed to purge
- Various node.js binding fixes
Required mainly for clients, allows to scale parallel client I/O with TCP
from 100-150k iops to ~400k iops and from 2-3 GB/s to at least 7-8 GB/s
with 4 I/O threads, at the same time increasing Q=1 latency by 2x thread
switching delay, which is ~10 us when CPU powersaving is disabled and may
be as high as 200 us when it's enabled.
SYNC operation fsyncs only completed operations, so treating writes as "eligible
for fsync" before actually completing them is incorrect
It affected SCHEME=ec test_heal.sh (with immediate_commit=none) test - it was
flapping with lost writes - some non-fsynced writes were legitimately lost by
the OSD, but weren't repeated by the client
A bunch of monitor fixes
- Add noout flag for OSDs (/vitastor/config/osd/xx)
- Fix "effective" size of degraded PGs (and thus "used space") calculation in monitor
- Fix monitor not clearing PGs of deleted pools
- Fix incorrect PG generation with hosts with 0 OSDs
- Fix monitor crashing during primary OSD recheck when pool has no PGs
- Fix monitor crashing when node_placement included non-existing OSDs
- Fix possible data movement after removing OSDs reweighted to 0
- Remove extra empty keys from pool configurations created by vitastor-cli create-pool
- Fix 32-bit build
New features:
- Implement "hierarchical failure domains" and other complex distribution rules, for example
EC 4+2 over 3 DC, with 2 chunks per each DC ([documentation](docs/config/pool.en.md#level_placement))
- Make OSDs handle ENOSPC - now cluster stays online even if some OSDs fill up
to 100 %, only writes requiring free space hang
- Implement Stage/Unstage & volume locking for CSI to prevent parallel mounting
and/or modifications of the same volume
- Warn about full and almost full OSDs in vitastor-cli status
- Add an experimental NBD netlink map mode as an option ([documentation](docs/usage/nbd.en.md))
- Add --pg parameter to vitastor-cli describe, print objects with 0x in human-readable format too
- Add [administration docs](docs/usage/admin.en.md)
Bug fixes:
- Fix client operation retry timeout - previously the timeout wasn't applied and writes were
retries almost instantly
- Fix monitors crashing on invalid pool configurations
- Fix journaling - make each journal write wait for all previous journal writes
- Fix monitor thinking that OSD weight is 0 after deleting /osd/config/ key online
- Fix a write stall caused by flusher possibly not trimming journal on rollback
- Set 32k csum_block_size for HDD by default in vitastor-disk
After half a year of hard work, VitastorFS is finally here ! :-)
New features:
- VitastorFS, a full-featured clustered (read-write-many) file system.
Documentation: [VitastorFS](docs/usage/nfs.en.md)
- Embedded key-value database implementation based on Parallel Optimistic B-Tree
algorithm and used for the metadata of VitastorFS
- Pool management commands in vitastor-cli (create-pool, list-pools, rm-pool, modify-pool).
Thanks MIND Software (https://mindsw.io) for their contribution!
[Documentation](docs/usage/cli.en.md#create-pool)
Bug fixes:
- Fix a very rare "infinite loop" in the client library
- Fix a rare OSD hang on during start when zeroing out bad metadata entries left from the previous run
- Do not try to allocate more DB blocks in an inode block until it's "confirmed" and "locked" by the first write
- Do not recheck for new zero DB blocks on first write into an inode block - a CAS failure means someone else is already writing into it
- Throw new allocation blocks away regardless of whether the known_version is 0 on a CAS failure
- do not overwrite a block with older version if known version is newer
(read may start before update and end after update)
- invalidated block versions can't be remembered and trusted
- right boundary for split blocks is right_half when diving down, not key_lt
- restart update also when block is "invalidated", not just on version mismatch
- copy callback in listings to avoid closure destruction bugs too
- Prevent _split types of new blocks
- Stop updating new blocks only after the whole update, otherwise pointers
may become invalid
- Use recheck_none for updates initially
- Use UINT64_MAX as initial block version when postponing ops, otherwise the
check fails when the block is initially empty. This for example leads to
writing both leaf items & block pointers (which is incorrect) into the root
block when starting stress-test with --parallelism 32
- Fix -EINTR comparison
- track block versions correctly - per inode block (128kb) instead of tree block (4kb)
- prevent multiple parallel CAS writes of the same inode block
- add logging for EILSEQ which means invalid data in the tree
- fix get_block updated flag which was true for blocks already in cache and was leading to infinite loops on "unrelated block" errors
- apply changes to blocks in cache only after successful writes (using "virtual changes")
- do not replace cached block with an older version from disk
- recheck "unrelated blocks" (read/update collisions) until data stops changing
- track tree path correctly - do not treat split block as parent of its right half
- correctly move blocks when finding new empty place on disk
- restart updates from the beginning when one of blocks is changed by a parallel update
- fix delete using SET opcode and setting key to the empty value instead
- prevent changing the same key more than 1 time in parallel
- fix listing verification
- resume continue_updates in update_find (required because it uses continue_update itself)
- add allow_old_cached parameter to get()
- Do not use \r if output is not a terminal (should fix unexpected job output in proxmox)
- Fix rm/rm-data error return code, add --down-ok option to bypass the error
- Add EIO retry timeout and allow to disable these retries, rename up_wait_retry_interval to client_retry_interval
- Add ubuntu jammy build
- Wait for blockstore initialisation before starting OSD (prevent timeouts when init takes time)
- Fix a rare use-after-free in automatic sync after delete in blockstore
ASan report: [0] READ of size 16 at operator() /root/vitastor/src/blockstore_write.cpp:100
...[5] blockstore_impl_t::ack_sync(blockstore_op_t*) /root/vitastor/src/blockstore_sync.cpp:232
- Fix another old "BUG: Attempt to overwrite used offset" in a very simple
case: bs=4k rw=write iodepth=16 from OSD start; add this case to tests
- Fix a rare crash with "unexpected state during flush: 0x51" possible with
EC since 1.4.2 during rebalance and OSD outages
- Fix a rare write stall with EC & immediate_commit=none caused by sync
operations reserving unneeded space in the journal
- Fix 32-bit build warnings, most in printf/scanf format strings
Unwavering stabilization of 1.4.x, continued :-)
- Include the accidentally lost part of 1.4.5 journal trimming fix
- Fix a possible OSD crash with "BUG: Attempt to overwrite used offset"
which was probably present for long time, but became apparent after
fixing flapping tests in CI
- Fix remaining flapping tests in CI. It was the first time when tests
actually passed without retries :-)
- Fix a write stall caused by incorrect journal trimming introduced in 1.4.4 :)
- Fix PGs sometimes hanging in "starting" state on mass OSD restarts
- Fix a rare crash with "map::at" during OSD pings
- Use new defaults for non-capacitor (desktop) SSDs - improves T1Q256 random write from ~6k iops to ~45k iops
- Make journal_trim_interval configurable
A couple of fixes for EC pools
- Fix a segfault possible on partial EC overwrite in 1234 -> 5030 rebalance scenario
- Fix two problems leading to EC pools stalling on rebalance & parallel sudden stops
of OSDs, for example during a sudden poweroff of a host:
- Recovery auto-tuning (1.4.0 feature) could apply too large delays and stall
the EC journal - fixed by limiting delays with a new recovery_tune_sleep_cutoff_us
parameter (10 seconds by default) and applying recovery pauses before write
operations, not after them, to not occupy space in the journal for long time
- Dynamic journal space reservation (1.3.0 feature) wasn't accounting new writes
when checking the limit so OSDs could still fill the journal fully and stall -
fixed by including new writes into the limit
- Print etcd dbSize instead of dbSizeInUse in status
Hotfix for hotfix O:-)
- "Write stall fix" was incomplete and EC write stalls could
continue even on 1.4.2. Now they're finally fixed O:-)
- Make monitor ignore statistics of stopped OSDs. Previously if you stopped all
OSDs the last total I/O numbers would remain the same indefinitely
- Log to systemd by default
- Fix excessive autosyncs after every operation with disabled immediate_commit (introduced in 1.1.0)
- Fix a possible write stall with EC due to the lack of OSD wakeup after stabilizing previous writes
- Change sync operation semantics as a final fix to possible write stalls with EC and disabled immediate_commit
- Sync after deleting data in CLI rm / rm-data if immediate_commit is disabled
- Fix OSDs ignoring syncs & autosyncs for delete operations
- Fix OSD space reporting sometimes adding garbage zeros for deleted inodes (causing extra pool/stats etcd keys for deleted pools)
- Speed up monitor failover - change default etcd_mon_ttl from 30 to 5 seconds
- Speed up operation retries - change default up_wait_retry_interval to 50 ms
- Add patch for libvirt 9.10
Should be a final remaining fix to EC + non-capacitor (non-immediate-commit) write hangs :).
First it was breaking non-EC ("instantly stable") writes because they sometimes
complete out of order which was leading to the following error:
terminate called after throwing an instance of 'std::runtime_error'
what(): BUG: Unexpected dirty_entry 1000000000001:29480000 v65540 unstable state during flush: 0x151
But it is easily fixed by scanning previous and next dirty_entries in mark_stable.
- Fix a monitor crash on primary OSD switching introduced in 1.4.0
- Fix "partly outside array bounds" warnings for GCC 12 in cpp-btree
- Fix a realloc memory leak in theory possible with too large listings (OSD_OP_LIST)
New features:
- Intelligent recovery/rebalance speed auto-tuning to reduce its impact on clients (see README -> Features)
- Auto-restoration of dead VDUSE daemons in CSI plugin
- Add vitastor-disk update-sb command
- Update QEMU for Debian Bookworm to 8.1 and use it for CSI plugin
Bug fixes:
- Fix pools SOMETIMES staying inactive after stopping a node due to OSDs not reacting
to PG state changes caused by incorrect full reload of state from etcd on reconnection
- Make monitors retry pool configuration changes quickier which fixes them being unable
to apply changes when an ongoing rebalance is quickly making a lot of PGs clean
- Fix CSI plugin not accepting array of strings as etcd address in /etc/vitastor/vitastor.conf
- Allow multiple interfaces with the same IP address, for "simple routed" full mesh network
- Do not ignore loopback addresses for OSD network (to make ECMP setups with frr possible)
- Fix a rare client crash during OSD reconnections
- Only treat data partitions as existing OSDs in vitastor-disk prepare
- Remove etcd parameter from default command examples
- Fix reported free space sometimes changing non-immediately after deletion of data from OSDs
- Fix a possible OSD crash on print_slow when bs_op is NULL
- Use the same etcd_ws_keepalive_interval in mon as in OSD
- Fix mon not using values from config when /config/global is not present
- Remove pve-storage-portal-dns-list format for vitastor_etcd_address
- Parse log_level in cluster_client
- Fix vitastor-nbd image existence check not working because of non-zeroed inode_watch fields
- Do not warn on EPIPE in client unless log_level is raised explicitly
- Fix incorrect error in CSI when searching for the device in /sys
- Remove 2 last prints to stdout in etcd_state_client
- Fix a possible OSD crash when checking corrupted journal entries
Fixes a possible use-after-free in case of continue_ops() calling try_send(),
then connect_peer() -> set_timer() -> trigger_nearest() -> handle_op_part() -> continue_ops() again
OSDs often change their /pg/history keys during rebalance, so monitor receives additional
transaction failures from etcd if it re-runs lpsolve which sometimes may even lead to monitor
being unable to apply PG changes at all until rebalance completes
Also add protection from etcd watcher messages being split into multiple websocket
messages - I'm not sure if etcd actually does that, but it's better to have extra
protection anyway.
Also check that all etcd watchers are started in the keepalive routine, otherwise
it sometimes tries to revive etcd watchers starting with revision=1 which obviously
always fails because this revision is nearly always compacted.
All these changes should fix an old rarely reproduced bug where SOMETIMES OSDs
didn't react to PG config changes which was leading to offline pools on node reboot.
It happened on the full reload of state from etcd.
New features:
- RDMA without ODP - much faster and all cards are now supported, not just Mellanox
- VDUSE in CSI - faster, more stable and can even recover after CSI pod restart!
- Reserve journal space for stabilize requests dynamically to prevent stalls under load with EC
- Raise default NBD timeout from 30 to 300 seconds and allow to take it from /etc/vitastor/vitastor.conf
- Remove explicit etcdUrl/etcdPrefix K8S storage class parameter support to prevent
etcd migration issues for volumes created with these parameters
- Support QEMU 8.1 and pve-qemu 8.1
Bug fixes:
- Fix RDMA connection (and thus memory) leak
- Fix rare crashes under load due to incorrect io_uring queue size tracking
- Fix monitor statistics aggregation in case of empty /osd/stats keys
- Fix crash on unknown long argument to vitastor-disk
- Allow trailing comma in JSONs again
- Fix crash on attempts to dump a long listing of objects "to stabilize" or "to rollback" in a slow op
VDUSE has multiple advantages:
- Better performance
- Lack of timeout problems
- And even the ability to recover after restart of the vitastor-csi pod!
ODP is slower than regular RDMA even with memory copy overhead
Example numbers:
- 3950000 random read iops without ODP vs 240000 iops with ODP
- 1447000 random write iops without ODP vs 101000 iops with ODP
Reference: https://tkygtr6.github.io/pub/ISPASS21_slides.pdf
New features:
- Implement CSI volume expansion
- Implement CSI volume snapshots
- CSI driver now requires Kubernetes >= 1.20
Bug fixes:
- Important bug fix for EC: fix EC n+k, k>=2 read recovery in ISA-L version returning
incorrect data when reading at least the second chunk out of multiple missing chunks
without reading the first one. All users of EC n+k, k>=2 should upgrade as soon as
possible, and upgrade should be conducted with downtime: first stop all clients
(VMs/containers), then all OSDs, then upgrade and restart everything.
- Fix unstable statistics aggregation in monitor (affecting vitastor-cli status and df)
- Make udev not wait for OSDs to start during boot
- Do not report negative numbers of offline PGs in vitastor-cli status when changing PG count
- Report both old and new PG counts in vitastor-cli df when changing it
- Fix OSDs sometimes not starting with "The code only supports journal versions 1 and 2,
but it is 2 on disk" error after upgrading from pre-1.0 versions and letting OSDs run
for some time
- Fix monitors sometimes returning old PG count back after OSD configuration changes
- Make monitor PG changes more stable and timeout errors less probable
New features:
- Implement [client writeback cache](docs/config/client.en.md#client_enable_writeback)
- Add the third I/O mode: [O_DIRECT|O_SYNC](docs/config/osd.en.md#data_io) (good for Optane)
- Reduce load on etcd by splitting OSD lease and statistics reporting intervals:
[etcd_stats_interval](docs/config/osd.en.md#etcd_stats_interval) (default 30 sec)
- Make MON automatically filter OSDs by layout (block_size/immediate_commit/bitmap_granularity)
to prevent "refusing to start PGs of this pool" errors on misconfiguration
- Support running fio benchmarks on systems without io_uring
- Make QEMU driver compatible with QEMU 8.1
- Document usage of [vhost-user-blk](docs/usage/qemu.en.md#vhost-user-blk)
Bug fixes:
- Fix resizing disks in QEMU driver (for example, in Proxmox)
- Fix "unexpected result" in Proxmox driver by making CLI flush output on exit
- Remove unneeded block_size mismatch warnings on pools without matching PGs
- Fix possible segfault in vitastor-cli ls -l (usually with deleted pools)
- Fix QEMU driver compatibility with systems without io_uring
- Fix monitor eating 100% CPU when etcd is down (caused by infinite retries)
- Fix potential incorrect write processing with snapshots (not caught in tests
but could probably lead to client hangs)
- Fix buffer insertion in cluster_client (not caught in tests but could
probably lead to incorrect writes in rare cases)
- Fix rare OSD crash during sync operation processing
- Fix a reenterability issue in cluster_client not reproducible in QEMU/fio,
but reproducible with the currently developed K/V database implementation
- Fix deletion of the first modified object - OSDs could crash if you modified
the same object a lot of times, then deleted it, and then modified it again
- Fix the fio_sec_osd test tool
- Disabled by default, enable with client_enable_writeback=true
- Even then only enabled in FIO when -direct is disabled and in QEMU when
block device cache is enabled in settings
- Can also be enabled in other clients like vitastor-cli using parameter
client_writeback_allowed=true, but not recommended
New features:
- Data and metadata checksums!
- Metadata checksums are always used with new disk format
- Data checksums can be turned on with --data_csum_type crc32c for new OSDs
- Checksum block size can be configured
- inmemory_metadata now also affects keeping checksums in memory
- Linux page cache I/O caching support which can be enabled separately for
data, metadata (including checksums) and journal (O_SYNC instead of O_DIRECT)
- Details [here](https://git.yourcmc.ru/vitalif/vitastor/src/branch/master/docs/config/layout-osd.en.md#data_csum_type)
- Backwards compatibility is preserved, you can use new OSDs with old disks
Release also includes bug fixes from [0.9.6](https://git.yourcmc.ru/vitalif/vitastor/releases/tag/v0.9.6).
0.9.6 is moved to "-oldstable" repositories and will be available for some additional time.
Instead of it, just do not verify checksums of currently mutated objects.
When clean data modification during flush runs in parallel to a read request,
that request may read a mix of old and new data. It may even read a mix of
multiple flushed versions if it lasts too long... And attempts to verify it
using temporary copies of metadata make the algorithm too complex and creepy.
- Fix vitastor-disk partition zeroing (sometimes it was writing garbage instead of zeroes)
- Fix incorrect EC space statistics in `vitastor-cli status`
- Several bug fixes for NFS:
- Add . and .. in NFS directory listings
- Return FILE_SYNC from NFS writes if immediate_commit is enabled
- Return the same "verifier" in NFS COMMIT as in NFS WRITE
- Make parallel NFS extending writes work correctly, without conflicts
- Handle parallel NFS extending writes without imposing extra load on etcd
- Support UTF-8 in vitastor-cli table output
- Also allow "0" and "no" as false for inmemory_metadata and inmemory_journal
- Use HDD defaults for HDD-only in automatic `vitastor-disk prepare` mode
This fixes buffered (not O_DIRECT) NFS writes in Linux - previously they were
hanging in an infinite loop because COMMIT didn't return the same verifier as
previous WRITEs, and NFS kernel client was infinitely retrying the same writes.
Also this probably allows for correct NFS failover, at least for the same
buffered writes, because NFS clients repeat all write requests until a COMMIT
confirms them.
Previously, multiple parallel writes extending file size through NFS were
racing with each other and triggering deletions of part of the written data
I.e. if you mounted vitastor-nfs and just copied a file into it in MC then
you could end up with only a part of the file actually written
- Improve QEMU driver performance by integrating io_uring in it (up to 1.5x total iops improvement)
- Fix QEMU driver deadlocks which started to reproduce in qemu-img after iothread fixes
- Fix `vitastor-cli status` reporting more etcds than actually exists (fix etcd address duplication in config on reload)
- Fix `vitastor-cli ls` crashing on inodes in non-existing pools
- Delete old garbage /pool/stats/ keys for non-existing (deleted) pools
- Reduce memory usage of etcds initialized by make-etcd script
- Fix OSDs almost always crashing on etcd restart due to "revisions were compacted" (support reloading state from etcd)
- Fix a crash and a stall possible mostly in HDD setups with small journal and big (512k, 900k) random writes
- Add notes about HDDs to documentation. You are officially allowed to use HDD-only Vitastor with HGST/Toshiba/EXOS :)
Deadlock was caused by switching QEMU coroutines directly inside
vitastor_co_read_bitmap_cb() callback. The correct way is to schedule a BH
/BH is a QEMU term for setImmediate() :)/, same as in read and write callbacks.
Fixes two bugs found during HDD testing :-)
1) OSD crashed with "BUG: Attempt to overwrite used offset of the journal" during
`fio -bs=900k -iodepth=128` test with 16 MB journal
2) OSD stalled during `fio -bs=512k -iodepth=128` test with 64 MB journal
Before this change, OSDs almost always died when one of the etcds was restarted,
even though the rest of them was still in quorum and the lease was still active
- Add patch for libvirt 9.0
- Add support for Proxmox VE 8.0
- Fix compatibility of the QEMU driver with iothread (QEMU rebuilds are coming)
- Fix vitastor-cli rm-data/rm/merge hanging when some OSDs are down.
Allow deletions in unclean cluster at the cost of some data possibly
"reappearing" when those OSDs start back. In that case you can just repeat
the deletion request using rm-data.
- A bunch of bug fixes for snapshots:
- Fix snapshot reads often not working at all with snapshot chain size > 2
- Fix optimized snapshot data merge (children to parent)
- Fix updating of image name index key during optimized merge
- Fix auto-selection preventing the use of optimized merge with only 1 snapshot
- Fix incorrect CAS retries during snapshot merge
- Fix snapshot merge progress reporting
- Fix primary_read bitmap buffers use-after-free which could lead to
incorrect allocation map reads
- Remove /usr/local/bin path from make-etcd
- Some documentation fixes
- Measure and report scrub I/O statistics in vitastor-cli status
- Make aggregated statistics in vitastor-cli status much smoother
(first derive, then sum instead of first summing and then deriving)
- Fix an old rare bug leading to journal corruption
(try to use scrub if you think you're affected...)
- Do not start EC PGs without at least <data chunks> OSDs in each old set
(prevents spurious read errors with EC during reconnections/restarts)
- Fix failed assert(!scrub_list_op) on OSD restart with pending scrubs
- Fix future planned scrubs not starting because of incorrect time comparison
- Build packages for Debian 12 (Bookworm)
The consequence of this issue was that in some very rare cases (only reproduced
under load in CI when running 4+ tests in parallel) small write data written to
journal could overwrite journal entries.
Also add an assert-type safety check to be able to catch this issue in the
future again in case of a regression.
- Fix "Client XX command out of sync" messages sometimes happening on OSD reconnections
- Fix a bug where EC reads parallel with writes to the same object failed with -ERANGE error
- Slightly reduce the amount of metadata writes during journal flushing
- Correctly unmap NBD volumes when Proxmox forces map_volume use (with SWTPM and maybe some other cases)
New features:
- Scrubbing! Check documentation: [auto_scrub](src/branch/master/docs/config/osd.en.md#auto_scrub)
- Document online-updatable configuration parameters
Bug fixes:
- Fix NaN during PG optimisation if there are nonexisting OSDs in node_placement
- Fix monitor crash on pool deletion
- Clear journal_device and meta_device before initialising the next OSD in automatic mode
- Sync unsynced deletes before overwriting them with a lower version
(reproducted mostly/only after scrubbing)
- The tests are now stable and run in a CI system based on Gitea CI
- The release includes final bug fixes for EC:
- Implement missing EC recovery of allocation bitmap when built with ISA-L
- Fix broken snapshot export with EC (allocation bitmap reads were giving incorrect results previously)
- Also fixed bugs manifesting under heavy load:
- Fix monitor possibly applying incorrect PG history on retries
- Fix monitor incorrectly changing PG count when last_clean_pgs contains less PGs than the new number
- Allow writes to wait for free space again, but now correctly (previously dropped in 0.8.2)
- Fix a rare segfault in client (handle client stop during incoming stream handling in 1 more place)
- Make monitor correctly handle etcd connection errors - it could die instead of connecting to another etcd
- Fix OSD rarely being unable to report PG states after a PG was taken over by another OSD
- Fixed return code for incomplete EC objects (now EIO) and made cluster client retry this error
- Made other small changes for tests: timeouts, nice/ionice for etcd, waiting conditions, NBD device checks and so on
Monitor could deceive itself by immediately saving PG configuration changes
which weren't applied to etcd yet in memory, and apply incorrect PG history
changes next time if the first update fails.
This usually only happened under heavy load and was caught in CI. :-)
- Fix vitastor-cli rm/rm-data broken in 0.8.6 (missing messenger initialization)
- Prepare OSD read handler for upcoming version with scrub - allow "secondary reads" to return errors
- Fix OSDs re-peering PGs infinitely with a big number of PGs (reproduced in test_add_osd)
- Fix another variant of flusher sync-waiting stall (reproduced in test_write)
- Fix other tests in tests/ (will add them to Gitea CI soon)
- Add patches for QEMU 6.2-8.0
- Fix QEMU driver compatibility with QEMU 8.0
- Build packages for RHEL 9 clones (based on AlmaLinux 9)
This release includes a bunch of important bugfixes for erasure-coded setups
with disabled immediate_commit. After these fixes, "test_heal" OSD killing test
now passes fine with EC:
- Fix cluster write stalls with "Error while doing flush on OSD xx: -16 (Device or resource busy)"
in OSD logs possible in EC setups with disabled immediate_commit by selectively
syncing nonsynced objects on STABILIZE/ROLLBACK (https://github.com/vitalif/vitastor/issues/51)
- Fix other EC + disabled immediate_commit problems:
- Fix "opcode=5 retval=-2" errors happening on SYNC retries
- Fix non-working "pagination" during PG dirty object flushing
- Fix write operations not continued correctly after dirty object flushing
- Fix incorrect parity read-modify-write calculation when writing into a lost chunk
- Fix OSDs losing left_on_dead PG state of non-clean PGs and thus not removing junk data in the cluster
- Fix a small memory leak caused by bad indexing of EC recovery matrices
- Fix a rare use-after-free in cluster_client caused by a reenterability issue
- Fix vitastor-cli create command syntax in the CSI driver
- Allow to start OSDs without local store for tests
- Fix memory allocation error in disk_tool_meta for non-standard metadata block sizes
- Fix delete operations received before loading pool metadata crashing OSDs with "null pointer exception"
- Improve "theoretical performance" Russian documentation
New features:
- Implement online configuration update for some parameters. Documentation is coming soon :)
Important fixes:
- Fix possibly incorrect EC parity chunk updates with EC n+k, k > 1 and when
the first parity chunk is missing
Minor fixes and improvements:
- Fix incorrect EC free space statistics in vitastor-cli df output
- Speedup vitastor-cli startup in clusters with RDMA
- Remove unused PG "peered" state (previously used to update PG epoch)
- Use sfdisk with just --json in vitastor-disk (--dump --json isn't needed)
- Allow trailing comma in sfdisk output (fixes sfdisk 2.36 compatibility)
- Slightly improve RDMA send/receive code
- Reduce RDMA memory consumption by default (rdma_max_recv/send = 16/8)
- Use vitastor-cli instead of direct etcd interaction in the CSI driver
- Fix a possible "double free" bug in the client library happening on OSD restart
- Fix a possible write hang on PG history update when only epoch is changed
- Fix incorrect systemd target "local.target" in mon/make-etcd
- Allow "content" option in PVE storage plugin to allow to enable containers
- Build client library without tcmalloc which fixes "attempt to free invalid pointer"
errors when, for example, trying to run QEMU with both Vitastor and Ceph RBD disks
Fixes "[src/tcmalloc.cc:332] Attempt to free invalid pointer ..." when trying
to run QEMU with both Vitastor and Ceph RBD disks and other possible allocator
collisions.
New features:
- Implement QCOW2 image/snapshot export via qemu-img (bdrv_co_block_status in the driver)
- Remove OSDs from PG history during `vitastor-cli rm-osd` to prevent `left_on_dead` PG states after deletion
- Add a new recovery_pg_switch setting to mix all PGs during recovery, to almost
fully reduce the probability of ENOSPC during rebalance
- Introduce partial ENOSPC ("OSD is full") handling - now ENOSPC doesn't turn
into cascades of crashes
- Add migration support to Proxmox VE Vitastor driver
- Track last_clean_pgs on a per-pool basis thus reducing data movement in a cluster
with pools remaining unclean/degraded for a long time
Bug fixes:
- Fix a bug where monitor could generate degraded PGs if one of the hosts had no OSDs
- Fix a bug where monitor could skip PG redistribution with a lot of OSDs in cluster
- Report PG history synchronously on the first write, which improves PG consistency
and availability at the same time, because history now gets reported correctly
and doesn't get reported without the need for it
- Fix possible write and recovery stalls which could happen in a cluster with both EC and replicated pools
- Make OSD and monitors sanitize & deduplicate PG history items in etcd
- Fix non-working OSD peer config safety check
- Fix a rare journal flush stall where flushing wasn't activated with full journal, but with empty flush queue
- Fix builds without ISA-L (jerasure-only) crashing with EC N+K, K>=2 due to the lack of 16-byte buffer alignment
- Fix a possible crash for EC N+K, K>=2 when calculating a parity chunk with previous parity chunk missing
- Fix a bug where vitastor-disk purge with suppressed warnings didn't work
Sync before listing was added to wait for all PG writes possibly left in queue
from the previous master to finish before listing it
But in fact it may block the cluster when EC is used and some unstable writes
are left in the queue - they block journal flushing, rollback/stabilize is
required to unblock them, but rollback/stabilize may only happen after PG is
peered. But peering needs listings, listings are requested only after sync, and
sync itself waits for currently blocked writes waiting in the queue
This has 2 effects:
1) OSD sets aren't added into PG history until actual write attempts anymore
which removes unneeded extra osd_sets in PG history
2) New OSD sets are reported synchronously and can't be lost on PG restarts
happening at the same time with reconfiguration
- Implement a new "vitastor-disk purge" command to remove OSDs with safety checks
- Implement a new "vitastor-cli rm-osd" command to only remove OSD metadata from etcd
- Fix a bug where the monitor could ignore OSD removal and other /osd/stats key changes
- Fix a bug where garbage could be returned when reading objects being written at the same time
- Fix a rare write stall where journal space could be not reclaimed where there
were no new operations in the flush queue
- Fix a rare peering stall caused by a previous long listing operations queues limiting attempt
- Fix total object count statistic in OSD on object creation
- Add missing offset&len into vitastor-disk dump-journal for big_writes, fix JSON format
- Make vitastor-cli print help on missing command
- Make vitastor-cli translate all '-' to '_' in CLI options
"In-flight" versions are added into dirty_db when writes are enqueued. And they
weren't ignored by subsequent reads even though they didn't have data location yet.
This bug was leading to test_heal.sh not passing sometimes with replicated setups.
- Fix QEMU driver compatibility with QEMU 7.0 and < 2.9
- Add patches for pve-qemu-kvm 7.1 (PVE 7.3) and pve-qemu-kvm 6.2 (PVE 7.2)
- Fix Proxmox driver location in the pve-storage-vitastor package
- Disable HDD autodetection in non-hybrid mode
- Explicitly warn about a buggy kernels on -EAGAIN in io_uring
- Final fix for the lack of zeroing out of old metadata entries
(do not crash with "big_write journal_entry was allocated over another object"
in some cases after an unclean OSD shutdown)
- Wait for data writes before fsyncing data if data fsync is enabled
- Never try to wait for free space inside blockstore thus stalling OSDs
- Fix a rare crash in osd_peering due to callback ordering
- Fix a rare duplication of ping & op message IDs
- Fix a rare use-after-free during pings
- Add --force to vitastor-disk read-sb
- Make vitastor-disk dump metadata object IDs in hex, add forgotten commas
- Fix vitastor-disk SCSI disk cache check
If a crash occurs during flushing a redirect-write it may happen so that
the disk contains both old and new metadata entries. This is OK, but prior
to 0.8.0 after this situation OSDs started without problem, but then they
crashed after some more overwrites with a "tried to overwrite non-zero
metadata entry" error. 0.8.0 introduced a change that was intended to fix
this situation, but rather than fixing it it prevented OSDs from starting,
now because of a "big_write journal_entry was allocated over another object"
error... :-)
This change finally fixes the original issue.
Followup to 54ef2c389f
- Remove an additional data copy operation when flushing journal (should
slightly increase write performance)
- Fix a bug where new writes in the inmemory_journal=false mode could overwrite
the data currently read by a parallel read operation
- Fix degraded parity writes for EC N+K when K>1 where the bug could also lead
to an "assertion failed" error
- Fix missing journal space check for "big" writes which could lead to
"prefill_single_journal_entry(): assertion failed..." error in OSD
- Fix possible "assertion failed: next->prev_wait >= 0" in client in rare cases
- Fix missing "len" field in vitastor-disk write-journal big_writes
- Fix possible crash of a full OSD (ENOSPC)
- Fix CSI build scripts to include newest packages every time
- Fix CSI endpoint in the liveness probe manifest
- Implement automatic OSD activation via udev and simple on-disk superblock storage
- Add a new `vitastor-disk` tool and merge all disk-related functionality there.
Now it can prepare new OSD disks, upgrade plain old systemd units to the new scheme,
resize OSD data area, manage OSD services by disk paths, manage superblocks,
automatically check and disable disk cache, dump and write back journal and metadata.
- Add a documentation section about `vitastor-disk` (read it if you want details!)
- Install systemd services during package installation instead of the older method
of manually creating them via separate shell scripts
- Add a new `make-etcd` script that reuses /etc/vitastor/vitastor.conf to configure etcd
- Allow to configure block_size, bitmap_granularity and immediate_commit per-pool
- Fix "fatal error: tried to overwrite non-zero metadata entry" which was possible
in some cases after unclean OSD shutdown (caused by old metadata entries not being zeroed)
- Add ISA-L erasure code implementation, now used automatically instead of jerasure when available
- Fix listings sending too many parallel requests to OSDs
- Fix rm-data crashing with --wait-list
- Remove empty inodes from statistics and `ls` output, after <inode_vanish_time> seconds after deletion
- Make monitor delete pool statistics when the pool is deleted and thus remove them from `df` output
- Log multiple etcd addresses in OSD logs correctly
- Fix true/false parsing in json configs like no_recovery/no_rebalance
- Show no_recovery, no_rebalance, readonly flags in status
- Add documentation! :-) in Russian and English
- Implement an NFS proxy for file-based access emulation to Vitastor
images for non-QEMU based hypervisors like VMWare, as a better way
than iSCSI
- Implement "primary affinity tags"
- Add a patch for libvirt 6.0
- Fix free_down_raw in cli status
- Fix a rare bug where OSDs could drop unrelated connections on errors
Return results and errors in a variable instead of just printing them,
separate vitastor-cli main() from cli_tool_t, move positional argument
parsing to CLI main from command implementations.
- Fix incorrect reading of extra metadata block leading to extra unknown objects in stats
- Fix CSI driver volumeMode: Block support
- Add block PVC and pod examples
- Fix build under 32 bit architectures
- Fix slow connection ramp-up caused by up_wait_retry_interval
- Implement `vitastor-cli status` (print cluster status) command
- Add a new `make-osd-hybrid.js` script to quickly prepare a lot of hybrid (HDD+SSD) OSDs
- Implement snapshot deletion for Cinder driver (only works in a healthy cluster)
- Fix a huge :) bug causing reads to return all zeroes during rebalance. Add a test to prevent it in the future
- Disconnect NBD proxy correctly without leaving a zombie [vitastor-nbd] process in D state
- Fix a rare write hang appearing with small write throttling enabled
- Fix IPv6 address parsing
- Fix "cannot read bytes of undefined" in the monitor on a fresh DB
- Fix possible hangs of read requests on OSD restarts without immediate_commit=all mode
- Fix OSDs skipping misplaced recovery in some cases
- Fix OSDs possibly dying with "map::at" errors when other OSDs are stopped
- Fix division by zero in ls if all pool OSDs are down
- Fix client hangs possible on OSD restarts (bug affected versions from 0.5.11)
- Fix "Assertion `sqe != NULL' failed" io_uring-related crashes possible
on some kernels (0.6.11 increased probability of this bug)
- Fix timeout=0 in NBD proxy
- Fix build under centos 7
Problem is that in recent kernels io_uring may return completions BEFORE
clearing the submission queue. I.e. for example its capacity is 512, there
were 512 requests, one of them completed, so when the request completion is
processed the queue "should have" 1 free slot. But sometimes it doesn't because
io_uring doesn't always clear the submission queue before sending CQE :-/
Fixes client hangs possible after stopping & restarting an osd.
Hangs happened when a connection was closed in the middle of reading a READ
operation reply from the network. In this case the operation being read was
in read_op and the client didn't free it when closing the connection.
Test case for msgr_read.cpp:
- Partially read reply for a READ operation
- stop_client()
- Check that the READ operation returns EPIPE
The bug was actually introduced in 0.5.11.
etcd connection stability, clang & elbrus support
- Fix build under CLang and Elbrus LCC compilers, making Vitastor compatible
with Elbrus CPUs :)
- Completely fix the bug where OSDs didn't connect to peers and incorrectly marked
PGs as incomplete
- Limit I/O depth for deletes the same way as for small writes. Makes OSD crashes
with "Assertion failed: sqe != NULL" during image deletion go away
- Fix a very old, but rare, journaling bug (credits to https://github.com/mirrorll)
- Fix flushing of unclean journaled objects leading to OSDs sometimes hanging
after failover in EC setups (bug was introduced in 0.6.7)
- Fix several problems that could prevent smooth operation of a Vitastor cluster
under the condition of partial etcd failure:
- OSDs could randomly fail due to too strict error handling
- New clients and OSDs could be unable to start because of the lack of retries
- CLI could fail some commands because of the lack of retries
- Monitor could stop receiving state updates because of the lack of websocket pings
- Fix monitor being unable to rebalance PGs after a downscale of pool pg_size (3->2)
- Exit with failure when trying to nbd map or benchmark a non-existing image
- Use HTTP keep-alive for etcd connections
- Allow to configure etcd request timeouts and retries
- Allow to configure NBD timeout, max devices and partitions, and set default to
up to 64 devices with up to 3 partitions each
Build problems fixed:
- void* pointer arithmetic which is a GNU extension (works as byte*)
- "variable size object may not be initialized" which is OK under GCC
- nullptr_t related error in json11 (it lacks 'operator <' in clang)
Warnings fixed:
- empty nested struct initializer { 0 } replaced by {}
- removed several unused lambda captures
Prelimilary results:
- CPU usage drops significantly. For example, in T1Q8 128K write test against
stub_uring_osd with 10G network and Athlon X4 860k CPU it drops from 100% to 30%
- Latency becomes slightly worse. In T1Q1 4K write test in the same environment
latency increases from 56 to 63 us.
- Small write throughput also becomes slightly worse. In T1Q128 4K write test
against stub iops decreases from 138k to ~110k (unstable, fluctuates 100k..120k).
Note that this is without io_uring, of course.
- Slightly reduce journaling write amplification (requires no_same_sector_overwrites=false)
- Fix listen_backlog (it was 0) because it could more than halve OSD socket send speed
- Support IPv6 OSD addresses
- Do not try to initialize client in simple-offsets
- Fix OSDs sometimes marking PGs incomplete instead of trying to connect with peers
- Allow to configure OSD placement in node_placement
- Allow to run with 4k sector size block devices. Natural, but it was forbidden
Slightly reduces WA. For example, in 4K T1Q128 replicated randwrite tests
WA is reduced from ~3.6 to ~3.1, in T1Q64 from ~3.8 to ~3.4.
Only effective without no_same_sector_overwrites.
- Implement a storage plugin for Proxmox. Now you can use Vitastor with Proxmox!
- Implement `vitastor-cli df` (pool space usage statistics) command
- Add glob pattern support for `vitastor-cli ls`
- Fix several bugs in other CLI commands (resize, create --parent, modify --readonly)
- Use 512 byte logical block size in QEMU driver by default (and thus don't require to set it in QEMU options)
New features:
- Build Vitastor driver as part of QEMU
- Implement renaming images in CLI (vitastor-cli modify --rename)
- Add vitastor-cli alloc-osd and simple-offsets commands and use them in make-osd,
thus removing the dependency on etcdctl
- Make monitor remove stale deleted inode statistics from etcd automatically
- Implement OSD address selection from a subnet, thus removing the need to specify
OSD addresses in startup scripts explicitly
Bug fixes:
- Fix client failover in case of etcd shutdown or crash (make client survive etcd failures)
- Stick to the last live etcd in OSD and mon to prevent random failures when one of etcds is down
- Fix incorrect copying of data from journal to the data device which could lead to data corruption
- Prefer local etcd IPs in OSD
- Remove the total PG count restriction in optimize_change which was sometimes leading
to inability to redistribute PGs over OSDs
- Fix error response parsing on a failed pg state report
- Fix slow linear writes with RDMA by changing default buffer settings
- Fix possible 'TypeError' in openstack nova when using Vitastor cinder driver
- Fix bugs in vitastor-cli create, ls, rm, modify commands
Patch changes:
- Add a patch for libvirt 7.6
- Add patches for QEMU 6.0 and 6.1
- Fix config file path XML location parsing in libvirt patches
- Replace _ with - in QEMU options
- Fix possible 'TypeError' in openstack nova when using Vitastor cinder driver
- Fix possible crashes of QEMU block driver in case of incorrect options
129K to leave extra space for the header
The problem with 8x 1M buffers is that the following happens with,
for example, 2 OSDs and 4M T1Q1 write:
- Server posts 8 receives
- Client posts 8 sends
- WRs are processed by the RDMA stack, but the OSD doesn't have the time
to handle them and doesn't refill buffers
- Client posts 1 more send
- RNR retransmission happens and performance drops to zero
Overall it seems that RDMA support should be reworked to use real 'RDMA'
operations i.e. operations writing into remote memory. This has an
additional advantage of avoiding a copy at the receive side of the OSD.
- Implement CLI commands for listing, viewing I/O statistics, creating,
snapshotting, cloning, resizing and modifying images. All these operations
are covered by 3 commands: ls, create, modify
- Implement an important fix to prior OSD set tracking for PGs. The previous
version had an issue which could lead to data loss due to an OSD with older
copy of the data thinking it has the newest copy
- Fix I/O statistics aggregation in the monitor
- Several minor fixes for Cinder driver
- Fix QEMU driver to be compatible with QEMU 2.x > 2.0
- Fix stalls sometimes possible in configurations without immediate_commit due
to insufficient amount of automatic internal fsync operations
- Add `vita` alias for `vitastor-cli`
Required to prevent data loss due to activation of an OSD with older data
when PG OSD set change doesn't occur. I.e. fixes the simplest case:
- Run 2 OSDs with 1 PG
- Start writing into the PG
- Stop OSD 2
- Stop OSD 1
- Start OSD 2
After this change the PG will refuse to start after the last step.
- New command-line tool: vitastor-cli
- Implement layer (snapshot/clone) merge and delete
- Remove 'bool' from the C header
- Fix a very rare flusher stall
- More diagnostics now printed for slow ops in the log
- Basic support for OpenStack: Cinder driver, patches for Nova and libvirt
- Add missing "image" and "config_path" QEMU options
- Calculate aggregate per-pool statistics in monitor
- Implement writes with Check-And-Set semantics
- Add a C wrapper library with public header
It can't delete snapshots yet because Vitastor layer merge isn't
implemented yet. You can only delete volumes with all snapshots.
This will be fixed in the near future.
From now on, reads will return the server-side object version numbers
and writes and deletes will have an additional "version" parameter
which, if set to a non-zero value, will be atomically compared with
the current version of the object plus 1 and the modification will
fail if it doesn't match.
This feature opens the road to correct online flattening of snapshot
layers and other interesting things.
2021-06-15 00:12:35 +03:00
657 changed files with 121264 additions and 17633 deletions
* Для QEMU 3.0+: `<qemu>/qapi`→`<vitastor>/qemu/b/qemu/qapi`
* Для QEMU 2.0+: `<qemu>/qapi-types.h`→`<vitastor>/qemu/b/qemu/qapi-types.h`
-`config-host.h` и `qapi` нужны, т.к. в них содержатся автогенерируемые заголовки
- Установите fio 3.7 или новее, возьмите исходники пакета и сделайте на них симлинк с`<vitastor>/fio`.
- Соберите и установите Vitastor командой `mkdir build && cd build && cmake .. && make -j8 && make install`.
Обратите внимание на переменную cmake `QEMU_PLUGINDIR` - под RHEL её нужно установить равной `qemu-kvm`.
## Запуск
Внимание: процедура пока что достаточно нетривиальная, задавать конфигурацию и смещения
на диске нужно почти вручную. Это будет исправлено в ближайшем будущем.
- Желательны SATA SSD или NVMe диски с конденсаторами (серверные SSD). Можно использовать и
десктопные SSD, включив режим отложенного fsync, но производительность однопоточной записи
в этом случае пострадает.
- Быстрая сеть, минимум 10 гбит/с
- Для наилучшей производительности нужно отключить энергосбережение CPU: `cpupower idle-set -D 0 && cpupower frequency-set -g performance`.
- Пропишите нужные вам значения вверху файлов `/usr/lib/vitastor/mon/make-units.sh` и `/usr/lib/vitastor/mon/make-osd.sh`.
- Создайте юниты systemd для etcd и мониторов: `/usr/lib/vitastor/mon/make-units.sh`
- Создайте юниты для OSD: `/usr/lib/vitastor/mon/make-osd.sh /dev/disk/by-partuuid/XXX [/dev/disk/by-partuuid/YYY ...]`
- Вы можете поменять параметры OSD в юнитах systemd. Смысл некоторых параметров:
-`disable_data_fsync 1` - отключает fsync, используется с SSD с конденсаторами.
-`immediate_commit all` - используется с SSD с конденсаторами.
-`disable_device_lock 1` - отключает блокировку файла устройства, нужно, только если вы запускаете
несколько OSD на одном блочном устройстве.
-`flusher_count 256` - "flusher" - микропоток, удаляющий старые данные из журнала.
Не волнуйтесь об этой настройке, 256 теперь достаточно практически всегда.
-`disk_alignment`, `journal_block_size`, `meta_block_size` следует установить равными размеру
внутреннего блока SSD. Это почти всегда 4096.
-`journal_no_same_sector_overwrites true` запрещает перезапись одного и того же сектора журнала подряд
много раз в процессе записи. Большинство (99%) SSD не нуждаются в данной опции. Однако выяснилось, что
диски, используемые на одном из тестовых стендов - Intel D3-S4510 - очень сильно не любят такую
перезапись, и для них была добавлена эта опция. Когда данный режим включён, также нужно поднимать
значение `journal_sector_buffer_count`, так как иначе Vitastor не хватит буферов для записи в журнал.
- Запустите все etcd: `systemctl start etcd`
- Создайте глобальную конфигурацию в etcd: `etcdctl --endpoints=... put /vitastor/config/global '{"immediate_commit":"all"}'`
(если все ваши диски - серверные с конденсаторами).
- Создайте пулы: `etcdctl --endpoints=... put /vitastor/config/pools '{"1":{"name":"testpool","scheme":"replicated","pg_size":2,"pg_minsize":1,"pg_count":256,"failure_domain":"host"}}'`.
Для jerasure EC-пулов конфигурация должна выглядеть так: `2:{"name":"ecpool","scheme":"jerasure","pg_size":4,"parity_chunks":2,"pg_minsize":2,"pg_count":256,"failure_domain":"host"}`.
- Запустите все OSD: `systemctl start vitastor.target`
- Ваш кластер должен быть готов - один из мониторов должен уже сконфигурировать PG, а OSD должны запустить их.
- Вы можете проверить состояние PG прямо в etcd: `etcdctl --endpoints=... get --prefix /vitastor/pg/state`. Все PG должны быть 'active'.
### Задать имя образу
```
etcdctl --endpoints=<etcd> put /vitastor/config/inode/<pool>/<inode> '{"name":"<name>","size":<size>[,"parent_id":<parent_inode_number>][,"readonly":true]}'
```
Например:
```
etcdctl --endpoints=http://10.115.0.10:2379/v3 put /vitastor/config/inode/1/1 '{"name":"testimg","size":2147483648}'
```
Если вы зададите parent_id, то образ станет CoW-клоном, т.е. все новые запросы записи пойдут в новый инод, а запросы
чтения будут проверять сначала его, а потом родительские слои по цепочке вверх. Чтобы случайно не перезаписать данные
в родительском слое, вы можете переключить его в режим "только чтение", добавив флаг `"readonly":true` в его запись
метаданных. В таком случае родительский образ становится просто снапшотом.
Таким образом, для создания снапшота вам нужно просто переименовать предыдущий inode (например, из testimg в testimg@0),
сделать его readonly и создать новый слой с исходным именем образа (testimg), ссылающийся на только что переименованный
- Saturated parallel write iops: min(network bandwidth, sum(disk write iops * number of data drives / (number of data + parity drives) / write amplification)).
In fact, you should put disk write iops under the condition of ~10% reads / ~90% writes in this formula.
Write amplification for 4 KB blocks is usually 3-5 in Vitastor:
1. Journal block write
2. Journal data write
3. Metadata block write
4. Another journal block write for EC/XOR setups
5. Data block write
If you manage to get an SSD which handles 512 byte blocks well (Optane?) you may
lower 1, 3 and 4 to 512 bytes (1/8 of data size) and get WA as low as 2.375.
Lazy fsync also reduces WA for parallel workloads because journal blocks are only
written when they fill up or fsync is requested.
## Example Comparison with Ceph
Hardware configuration: 4 nodes, each with:
- 6x SATA SSD Intel D3-4510 3.84 TB
- 2x Xeon Gold 6242 (16 cores @ 2.8 GHz)
- 384 GB RAM
- 1x 25 GbE network interface (Mellanox ConnectX-4 LX), connected to a Juniper QFX5200 switch
CPU powersaving was disabled. Both Vitastor and Ceph were configured with 2 OSDs per 1 SSD.
All of the results below apply to 4 KB blocks and random access (unless indicated otherwise).
Raw drive performance:
- T1Q1 write ~27000 iops (~0.037ms latency)
- T1Q1 read ~9800 iops (~0.101ms latency)
- T1Q32 write ~60000 iops
- T1Q32 read ~81700 iops
Ceph 15.2.4 (Bluestore):
- T1Q1 write ~1000 iops (~1ms latency)
- T1Q1 read ~1750 iops (~0.57ms latency)
- T8Q64 write ~100000 iops, total CPU usage by OSDs about 40 virtual cores on each node
- T8Q64 read ~480000 iops, total CPU usage by OSDs about 40 virtual cores on each node
T8Q64 tests were conducted over 8 400GB RBD images from all hosts (every host was running 2 instances of fio).
This is because Ceph has performance penalties related to running multiple clients over a single RBD image.
cephx_sign_messages was set to false during tests, RocksDB and Bluestore settings were left at defaults.
In fact, not that bad for Ceph. These servers are an example of well-balanced Ceph nodes.
However, CPU usage and I/O latency were through the roof, as usual.
Vitastor:
- T1Q1 write: 7087 iops (0.14ms latency)
- T1Q1 read: 6838 iops (0.145ms latency)
- T2Q64 write: 162000 iops, total CPU usage by OSDs about 3 virtual cores on each node
- T8Q64 read: 895000 iops, total CPU usage by OSDs about 4 virtual cores on each node
- Linear write (4M T1Q32): 2800 MB/s
- Linear read (4M T1Q32): 1500 MB/s
T8Q64 read test was conducted over 1 larger inode (3.2T) from all hosts (every host was running 2 instances of fio).
Vitastor has no performance penalties related to running multiple clients over a single inode.
If conducted from one node with all primary OSDs moved to other nodes the result was slightly lower (689000 iops),
this is because all operations resulted in network roundtrips between the client and the primary OSD.
When fio was colocated with OSDs (like in Ceph benchmarks above), 1/4 of the read workload actually
used the loopback network.
Vitastor was configured with: `--disable_data_fsync true --immediate_commit all --flusher_count 8
- Check `/usr/lib/vitastor/mon/make-units.sh` and `/usr/lib/vitastor/mon/make-osd.sh` and
put desired values into the variables at the top of these files.
- Create systemd units for the monitor and etcd: `/usr/lib/vitastor/mon/make-units.sh`
- Create systemd units for your OSDs: `/usr/lib/vitastor/mon/make-osd.sh /dev/disk/by-partuuid/XXX [/dev/disk/by-partuuid/YYY ...]`
- You can edit the units and change OSD configuration. Notable configuration variables:
-`disable_data_fsync 1` - only safe with server-grade drives with capacitors.
-`immediate_commit all` - use this if all your drives are server-grade.
-`disable_device_lock 1` - only required if you run multiple OSDs on one block device.
-`flusher_count 256` - flusher is a micro-thread that removes old data from the journal.
You don't have to worry about this parameter anymore, 256 is enough.
-`disk_alignment`, `journal_block_size`, `meta_block_size` should be set to the internal
block size of your SSDs which is 4096 on most drives.
-`journal_no_same_sector_overwrites true` prevents multiple overwrites of the same journal sector.
Most (99%) SSDs don't need this option. But Intel D3-4510 does because it doesn't like when you
overwrite the same sector twice in a short period of time. The setting forces Vitastor to never
overwrite the same journal sector twice in a row which makes D3-4510 almost happy. Not totally
happy, because overwrites of the same block can still happen in the metadata area... When this
setting is set, it is also required to raise `journal_sector_buffer_count` setting, which is the
number of dirty journal sectors that may be written to at the same time.
-`systemctl start vitastor.target` everywhere.
- Create global configuration in etcd: `etcdctl --endpoints=... put /vitastor/config/global '{"immediate_commit":"all"}'`
(if all your drives have capacitors).
- Create pool configuration in etcd: `etcdctl --endpoints=... put /vitastor/config/pools '{"1":{"name":"testpool","scheme":"replicated","pg_size":2,"pg_minsize":1,"pg_count":256,"failure_domain":"host"}}'`.
For jerasure pools the configuration should look like the following: `2:{"name":"ecpool","scheme":"jerasure","pg_size":4,"parity_chunks":2,"pg_minsize":2,"pg_count":256,"failure_domain":"host"}`.
- At this point, one of the monitors will configure PGs and OSDs will start them.
- You can check PG states with `etcdctl --endpoints=... get --prefix /vitastor/pg/state`. All PGs should become 'active'.
### Name an image
```
etcdctl --endpoints=<etcd> put /vitastor/config/inode/<pool>/<inode> '{"name":"<name>","size":<size>[,"parent_id":<parent_inode_number>][,"readonly":true]}'
```
For example:
```
etcdctl --endpoints=http://10.115.0.10:2379/v3 put /vitastor/config/inode/1/1 '{"name":"testimg","size":2147483648}'
```
If you specify parent_id the image becomes a CoW clone. I.e. all writes go to the new inode and reads first check it
and then upper layers. You can then make parent readonly by updating its entry with `"readonly":true` for safety and
basically treat it as a snapshot.
So to create a snapshot you basically rename the previous upper layer (for example from testimg to testimg@0), make it readonly
and create a new top layer with the original name (testimg) and the previous one as a parent.
[Documentation](../../README.md#documentation) → [Configuration](../config.en.md) → Image metadata in etcd
-----
[Читать на русском](inode.ru.md)
# Image metadata in etcd
Image list is stored in etcd in `/vitastor/config/inode/<pool>/<inode>` keys.
You can even create images manually:
```
etcdctl --endpoints=<etcd> put /vitastor/config/inode/<pool>/<inode> '{"name":"<name>","size":<size>[,"parent_id":<parent_inode_number>][,"readonly":true]}'
```
For example:
```
etcdctl --endpoints=http://10.115.0.10:2379/v3 put /vitastor/config/inode/1/1 '{"name":"testimg","size":2147483648}'
```
If you specify parent_id the image becomes a CoW clone. I.e. all writes go to the new inode and reads first check it
and then upper layers. You can then make parent readonly by updating its entry with `"readonly":true` for safety and
basically treat it as a snapshot.
So to create a snapshot you basically rename the previous upper layer (for example from testimg to testimg@0), make it readonly
and create a new top layer with the original name (testimg) and the previous one as a parent.
vitastor-cli, K8s, OpenStack and other drivers also store the reverse mapping in `/vitastor/index/image/<name>` keys
in JSON format: `{"id":<inode>,"pool_id":<pool>}` and ID counters in `/vitastor/index/maxid/<pool>` as numbers
[Документация](../../README-ru.md#документация) → [Конфигурация](../config.ru.md) → Метаданные образов в etcd
-----
[Read in English](inode.en.md)
# Метаданные образов в etcd
Список образов хранится в etcd в ключах `/vitastor/config/inode/<pool>/<inode>`.
Вы можете даже создавать образы вручную:
```
etcdctl --endpoints=<etcd> put /vitastor/config/inode/<pool>/<inode> '{"name":"<name>","size":<size>[,"parent_id":<parent_inode_number>][,"readonly":true]}'
```
Например:
```
etcdctl --endpoints=http://10.115.0.10:2379/v3 put /vitastor/config/inode/1/1 '{"name":"testimg","size":2147483648}'
```
Если вы зададите parent_id, то образ станет CoW-клоном, т.е. все новые запросы записи пойдут в новый инод, а запросы
чтения будут проверять сначала его, а потом родительские слои по цепочке вверх. Чтобы случайно не перезаписать данные
в родительском слое, вы можете переключить его в режим "только чтение", добавив флаг `"readonly":true` в его запись
метаданных. В таком случае родительский образ становится просто снапшотом.
Таким образом, для создания снапшота вам нужно просто переименовать предыдущий inode (например, из testimg в testimg@0),
сделать его readonly и создать новый слой с исходным именем образа (testimg), ссылающийся на только что переименованный
в качестве родительского.
vitastor-cli и драйвера K8s, OpenStack и т.п. также хранят обратный маппинг в ключах `/vitastor/index/image/<name>`
в JSON-формате: `{"id":<inode>,"pool_id":<pool>}` и счётчики ID `/vitastor/index/maxid/<pool>` в виде просто чисел
[Documentation](../../README.md#documentation) → [Configuration](../config.en.md) → Cluster-Wide Disk Layout Parameters
-----
[Читать на русском](layout-cluster.ru.md)
# Cluster-Wide Disk Layout Parameters
These parameters apply to clients and OSDs, are fixed at the moment of OSD drive
initialization and can't be changed after it without losing data.
OSDs with different values of these parameters (for example, SSD and SSD+HDD
OSDs) can coexist in one Vitastor cluster within different pools. Each pool can
only include OSDs with identical settings of these parameters.
These parameters, when set to a non-default value, must also be specified in
etcd for clients to be aware of their values, either in /vitastor/config/global
or in pool configuration. Pool configuration overrides the global setting.
If the value for a pool in etcd doesn't match on-disk OSD configuration, the
OSD will refuse to start PGs of that pool.
- [block_size](#block_size)
- [bitmap_granularity](#bitmap_granularity)
- [immediate_commit](#immediate_commit)
## block_size
- Type: integer
- Default: 131072
Size of objects (data blocks) into which all physical and virtual drives
(within a pool) are subdivided in Vitastor. One of current main settings
in Vitastor, affects memory usage, write amplification and I/O load
distribution effectiveness.
Recommended default block size is 128 KB for SSD and 1 MB for HDD. In fact,
it's possible to use 1 MB for SSD too - it will lower memory usage, but
may increase average WA and reduce linear performance.
OSD memory usage is roughly (SIZE / BLOCK * 68 bytes) which is roughly
544 MB per 1 TB of used disk space with the default 128 KB block size.
With 1 MB it's 8 times lower.
## bitmap_granularity
- Type: integer
- Default: 4096
Required virtual disk write alignment ("sector size"). Must be a multiple
of disk_alignment. It's called bitmap granularity because Vitastor tracks
an allocation bitmap for each object containing 2 bits per each
(bitmap_granularity) bytes.
Can't be smaller than the OSD data device sector.
## immediate_commit
- Type: string
- Default: all
One of "none", "all" or "small". Global value, may be overriden [at pool level](pool.en.md#immediate_commit).
This parameter is also really important for performance.
TLDR: default "all" is optimal for server-grade SSDs with supercapacitor-based
power loss protection (nonvolatile write-through cache) and also for most HDDs.
"none" or "small" should be only selected if you use desktop SSDs without
capacitors or drives with slow write-back cache that can't be disabled. Check
immediate_commit of your OSDs in [ls-osd](../usage/cli.en.md#ls-osd).
Detailed explanation:
Desktop SSDs are very fast (100000+ iops) for simple random writes
without cache flush. However, they are really slow (only around 1000 iops)
if you try to fsync() each write, that is, if you want to guarantee that
each change gets actually persisted to the physical media.
Server-grade SSDs with "Advanced/Enhanced Power Loss Protection" or with
"Supercapacitor-based Power Loss Protection", on the other hand, are equally
fast with and without fsync because their cache is protected from sudden
power loss by a built-in supercapacitor-based "UPS".
Some software-defined storage systems always fsync each write and thus are
really slow when used with desktop SSDs. Vitastor, however, can also
efficiently utilize desktop SSDs by postponing fsync until the client calls
it explicitly.
This is what this parameter regulates. When it's set to "all" Vitastor
cluster commits each change to disks immediately and clients just
ignore fsyncs because they know for sure that they're unneeded. This reduces
the amount of network roundtrips performed by clients and improves
performance. So it's always better to use server grade SSDs with
supercapacitors even with Vitastor, especially given that they cost only
a bit more than desktop models.
There is also a common SATA SSD (and HDD too!) firmware bug (or feature)
that makes server SSDs which have supercapacitors slow with fsync. To check
if your SSDs are affected, compare benchmark results from `fio -name=test
-ioengine=libaio -direct=1 -bs=4k -rw=randwrite -iodepth=1` with and without
`-fsync=1`. Results should be the same. If fsync=1 result is worse you can
try to work around this bug by "disabling" drive write-back cache by running
`hdparm -W 0 /dev/sdXX` or `echo write through > /sys/block/sdXX/device/scsi_disk/*/cache_type`
(IMPORTANT: don't mistake it with `/sys/block/sdXX/queue/write_cache` - it's
unsafe to change by hand). The same may apply to newer HDDs with internal
SSD cache or "media-cache" - for example, a lot of Seagate EXOS drives have
it (they have internal SSD cache even though it's not stated in datasheets).
Setting this parameter to "all" or "small" in OSD parameters requires enabling
[disable_journal_fsync](layout-osd.en.md#disable_journal_fsync) and
[disable_meta_fsync](layout-osd.en.md#disable_meta_fsync), setting it to
"all" also requires enabling [disable_data_fsync](layout-osd.en.md#disable_data_fsync).
vitastor-disk tried to do that by default, first checking/disabling drive cache.
If it can't disable drive cache, OSD get initialized with "none".
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.