Compare commits

..

114 Commits
kv ... v1.4.8

Author SHA1 Message Date
38b8963330 Release 1.4.8
- Do not use \r if output is not a terminal (should fix unexpected job output in proxmox)
- Fix rm/rm-data error return code, add --down-ok option to bypass the error
- Add EIO retry timeout and allow to disable these retries, rename up_wait_retry_interval to client_retry_interval
- Add ubuntu jammy build
- Wait for blockstore initialisation before starting OSD (prevent timeouts when init takes time)
- Fix a rare use-after-free in automatic sync after delete in blockstore
2024-02-29 09:58:34 +03:00
77167e2920 Do not use \r if output is not a terminal 2024-02-29 00:21:17 +03:00
5af23672d0 Fix rm/rm-data error return code, add --down-ok option to bypass the error 2024-02-29 00:20:10 +03:00
6bf1f539a6 Add EIO retry timeout and allow to disable these retries, rename up_wait_retry_interval to client_retry_interval 2024-02-28 13:10:02 +03:00
02d1f16bbd Add ubuntu jammy build
PR #62 vitalif/vitastor#62

I accept Vitastor CLA agreement: https://git.yourcmc.ru/vitalif/vitastor/src/branch/master/CLA-en.md
2024-02-28 11:43:54 +03:00
fc413038d1 Wait for blockstore initialisation before starting OSD 2024-02-27 02:20:04 +03:00
1bc0b5aab3 Fix a rare use-after-free in automatic sync after delete in blockstore
ASan report: [0] READ of size 16 at operator() /root/vitastor/src/blockstore_write.cpp:100
...[5] blockstore_impl_t::ack_sync(blockstore_op_t*) /root/vitastor/src/blockstore_sync.cpp:232
2024-02-24 00:06:36 +03:00
5e934264cf Release 1.4.7
- Fix another old "BUG: Attempt to overwrite used offset" in a very simple
  case: bs=4k rw=write iodepth=16 from OSD start; add this case to tests
- Fix a rare crash with "unexpected state during flush: 0x51" possible with
  EC since 1.4.2 during rebalance and OSD outages
- Fix a rare write stall with EC & immediate_commit=none caused by sync
  operations reserving unneeded space in the journal
- Fix 32-bit build warnings, most in printf/scanf format strings
2024-02-22 12:45:52 +03:00
f20564b44b Fix 32-bit build warnings (99.9% in printf) 2024-02-22 12:22:16 +03:00
b3c15db331 32M journal by default in simple-offsets 2024-02-21 15:25:02 +03:00
685bcd6ef9 Do not reserve extra space for big_writes during sync - sync itself is needed to commit and clear them 2024-02-21 13:00:14 +03:00
3eb389b321 Supposed fix for "unexpected state during flush: 0x51" with EC 2024-02-21 01:32:06 +03:00
3d16cde23c Fix assertions, add small sequential write test 2024-02-20 19:41:48 +03:00
c6406d67fc Fix journal space_check incorrectly checking for space at the beginning 2024-02-20 19:40:56 +03:00
f87964861d Release 1.4.6
Unwavering stabilization of 1.4.x, continued :-)

- Include the accidentally lost part of 1.4.5 journal trimming fix
- Fix a possible OSD crash with "BUG: Attempt to overwrite used offset"
  which was probably present for long time, but became apparent after
  fixing flapping tests in CI
- Fix remaining flapping tests in CI. It was the first time when tests
  actually passed without retries :-)
2024-02-20 17:01:26 +03:00
62a4f45160 Raise test_scrub waiting timeout 2024-02-20 16:26:09 +03:00
7048228678 Supposed fix for "BUG: Attempt to overwrite used offset" 2024-02-20 15:56:48 +03:00
ea73857450 Add asserts to catch "BUG: Attempt to overwrite used offset" 2024-02-20 15:56:48 +03:00
6cfe38ec04 Followup to empty cur.oid as stop condition for forced trim fix 2024-02-20 15:56:38 +03:00
7ae5766fdb Wait to clear has_degraded in test_heal - should fix flaps of test_heal_* in CI 2024-02-20 15:56:27 +03:00
f882c7dd87 Release 1.4.5
- Fix a write stall caused by incorrect journal trimming introduced in 1.4.4 :)
- Fix PGs sometimes hanging in "starting" state on mass OSD restarts
- Fix a rare crash with "map::at" during OSD pings
- Use new defaults for non-capacitor (desktop) SSDs - improves T1Q256 random write from ~6k iops to ~45k iops
- Make journal_trim_interval configurable
2024-02-16 10:13:33 +03:00
26dd863c8d Fix sometimes possible crash on clients.at() during pings 2024-02-16 10:13:33 +03:00
2ae859fbc6 Use min/max_flusher_count=32/256, 128M journal and autosync_writes=512 for non-capacitor SSDs by default 2024-02-16 10:13:33 +03:00
f6cd9f9153 Add a note about pg_minsize 2024-02-15 23:38:52 +03:00
8389c0f33b Fix PGs sometimes hanging in "starting" state on mass OSD restarts 2024-02-15 23:38:52 +03:00
9db2196aef Make journal_trim_interval configurable 2024-02-15 23:38:51 +03:00
8d6ae662fe Use empty cur.oid as stop condition for forced trim, not journal_trim_counter 2024-02-15 23:27:17 +03:00
c777a0041a Release 1.4.4
A couple of fixes for EC pools

- Fix a segfault possible on partial EC overwrite in 1234 -> 5030 rebalance scenario
- Fix two problems leading to EC pools stalling on rebalance & parallel sudden stops
  of OSDs, for example during a sudden poweroff of a host:
  - Recovery auto-tuning (1.4.0 feature) could apply too large delays and stall
    the EC journal - fixed by limiting delays with a new recovery_tune_sleep_cutoff_us
    parameter (10 seconds by default) and applying recovery pauses before write
    operations, not after them, to not occupy space in the journal for long time
  - Dynamic journal space reservation (1.3.0 feature) wasn't accounting new writes
    when checking the limit so OSDs could still fill the journal fully and stall -
    fixed by including new writes into the limit
- Print etcd dbSize instead of dbSizeInUse in status
2024-02-11 16:23:08 +03:00
2947ea93e8 Raise test_snapshot_chain_ec timeout to 6 minutes 2024-02-11 16:13:52 +03:00
978bdc128a Apply recovery pause before writes, after commits, and do not apply it to syncs to not block EC pools from functioning 2024-02-11 16:13:52 +03:00
bb2f395f1e Add cutoff threshold for recovery auto-tuning 2024-02-11 16:13:52 +03:00
b127da40f7 Add a FIXME about incomplete PGs 2024-02-11 13:42:51 +03:00
ca34a6047a Fix dynamic journal space reservation: include the new write itself, too 2024-02-11 13:42:51 +03:00
38ba76e893 Fix flusher sometimes being unable to trim journal when the flush queue is empty 2024-02-11 13:42:51 +03:00
1e3c4edea0 Print etcd dbSize instead of dbSizeInUse in status 2024-02-11 13:42:51 +03:00
e7ac855b07 Fix that EC segfault (1234 -> 5030 partial overwrite) 2024-02-11 13:42:51 +03:00
c53357ac45 Add a test for EC segfault with partial overwrite in 1234 -> 5030 rebalance scenario 2024-02-11 13:42:51 +03:00
27e9f244ec Release 1.4.3
Hotfix for hotfix O:-)

- "Write stall fix" was incomplete and EC write stalls could
  continue even on 1.4.2. Now they're finally fixed O:-)
- Make monitor ignore statistics of stopped OSDs. Previously if you stopped all
  OSDs the last total I/O numbers would remain the same indefinitely
2024-02-09 00:29:31 +03:00
8e25a28a08 Ignore down OSDs in monitor statistics aggregation 2024-02-09 00:22:36 +03:00
5d3317e4f2 Followup to 1.4.2 write stall fix - sadly, the previous version was not working correctly :) 2024-02-08 19:34:29 +03:00
016115c0d4 Release 1.4.2
- Log to systemd by default
- Fix excessive autosyncs after every operation with disabled immediate_commit (introduced in 1.1.0)
- Fix a possible write stall with EC due to the lack of OSD wakeup after stabilizing previous writes
- Change sync operation semantics as a final fix to possible write stalls with EC and disabled immediate_commit
- Sync after deleting data in CLI rm / rm-data if immediate_commit is disabled
- Fix OSDs ignoring syncs & autosyncs for delete operations
- Fix OSD space reporting sometimes adding garbage zeros for deleted inodes (causing extra pool/stats etcd keys for deleted pools)
- Speed up monitor failover - change default etcd_mon_ttl from 30 to 5 seconds
- Speed up operation retries - change default up_wait_retry_interval to 50 ms
- Add patch for libvirt 9.10
2024-02-04 02:23:49 +03:00
e026de95d5 Log to systemd by default 2024-02-04 01:21:31 +03:00
77c10fd1f8 In fact, do not autosync blockstore when autosync_writes=0 2024-02-03 20:37:36 +03:00
581d02e581 Mark secondary OSDs with deletions as dirty to not forget to sync & autosync them 2024-02-03 20:31:08 +03:00
f03a9db4d9 Fix OSD space reporting sometimes adding garbage zeros for deleted inodes (causing extra pool/stats etcd keys for deleted pools) 2024-02-03 20:31:08 +03:00
cb9c30bc31 Sync after sending all deletes to each PG in cli rm-data 2024-02-03 20:31:08 +03:00
a86a380d20 Fix invalid parsing of autosync_writes in blockstore leading to autosyncs after every operation with disabled immediate_commit :D 2024-02-03 20:31:08 +03:00
d2b43cb118 Change default etcd_mon_ttl 2024-01-29 23:45:19 +03:00
cc76e6876b Fix flapping "scrub" test 2024-01-28 14:59:33 +03:00
1cec62d25d Sync only completed writes
Should be a final remaining fix to EC + non-capacitor (non-immediate-commit) write hangs :).

First it was breaking non-EC ("instantly stable") writes because they sometimes
complete out of order which was leading to the following error:

terminate called after throwing an instance of 'std::runtime_error'
  what():  BUG: Unexpected dirty_entry 1000000000001:29480000 v65540 unstable state during flush: 0x151

But it is easily fixed by scanning previous and next dirty_entries in mark_stable.
2024-01-27 15:17:22 +03:00
1c322b33ed Change default up_wait_retry_interval to 50 ms 2024-01-26 01:51:08 +03:00
d27524f441 Add patch for libvirt 9.10 2024-01-25 01:09:12 +03:00
ba55f91409 Release 1.4.1
- Fix a monitor crash on primary OSD switching introduced in 1.4.0
- Fix "partly outside array bounds" warnings for GCC 12 in cpp-btree
- Fix a realloc memory leak in theory possible with too large listings (OSD_OP_LIST)
2024-01-18 02:31:42 +03:00
80aac39513 Add detailed formula for theoretical EC N+K random write performance 2024-01-18 00:36:32 +03:00
2aa5aa7ab6 Add a test for simple master switching without PG reconfiguration
Also use osd_out_time:1 only in select tests and restart mon in tests only on connection errors
2024-01-17 00:19:01 +03:00
3ca3b8a8d8 Fix recheck_pgs bug introduced in 1.4.0 2024-01-16 23:49:21 +03:00
2cf649eba6 Fix "partly outside array bounds" warnings for GCC 12 in cpp-btree 2024-01-15 03:04:33 +03:00
5935640a4a Add CLA PR form 2024-01-14 16:48:24 +03:00
d00d4dbac0 Initialize mod_revision field in etcd_state_client 2024-01-13 01:30:28 +03:00
5d9d6f32a0 Fix common realloc memory leak mistakes found by cppcheck 2024-01-13 01:30:28 +03:00
5280d1d561 Release 1.4.0
New features:
- Intelligent recovery/rebalance speed auto-tuning to reduce its impact on clients (see README -> Features)
- Auto-restoration of dead VDUSE daemons in CSI plugin
- Add vitastor-disk update-sb command
- Update QEMU for Debian Bookworm to 8.1 and use it for CSI plugin

Bug fixes:
- Fix pools SOMETIMES staying inactive after stopping a node due to OSDs not reacting
  to PG state changes caused by incorrect full reload of state from etcd on reconnection
- Make monitors retry pool configuration changes quickier which fixes them being unable
  to apply changes when an ongoing rebalance is quickly making a lot of PGs clean
- Fix CSI plugin not accepting array of strings as etcd address in /etc/vitastor/vitastor.conf
- Allow multiple interfaces with the same IP address, for "simple routed" full mesh network
- Do not ignore loopback addresses for OSD network (to make ECMP setups with frr possible)
- Fix a rare client crash during OSD reconnections
- Only treat data partitions as existing OSDs in vitastor-disk prepare
- Remove etcd parameter from default command examples
- Fix reported free space sometimes changing non-immediately after deletion of data from OSDs
- Fix a possible OSD crash on print_slow when bs_op is NULL
- Use the same etcd_ws_keepalive_interval in mon as in OSD
- Fix mon not using values from config when /config/global is not present
- Remove pve-storage-portal-dns-list format for vitastor_etcd_address
- Parse log_level in cluster_client
- Fix vitastor-nbd image existence check not working because of non-zeroed inode_watch fields
- Do not warn on EPIPE in client unless log_level is raised explicitly
- Fix incorrect error in CSI when searching for the device in /sys
- Remove 2 last prints to stdout in etcd_state_client
- Fix a possible OSD crash when checking corrupted journal entries
2024-01-12 01:28:33 +03:00
317b0feb0a Add a note about VDUSE daemon auto-restart 2024-01-12 01:27:36 +03:00
247f0552db Fix debug log "killing..." in CSI 2024-01-10 01:19:34 +03:00
2f228fa96a Only treat data partitions as existing OSDs in vitastor-disk prepare 2023-12-31 11:46:47 +03:00
2f6b9c0306 Remove etcd parameter from default command examples 2023-12-31 02:50:41 +03:00
48b5f871e0 Add Contributor License Aggrement in Russian and English 2023-12-31 01:23:52 +03:00
c17f76a3e4 Add documentation for recovery auto-tuning 2023-12-31 01:23:17 +03:00
a6ab54b1ba Do not allow negative util_low/high 2023-12-31 01:23:17 +03:00
99ee8596ea Rename min/max_util to util_low/high 2023-12-31 01:23:17 +03:00
c4928e6ecd Protect from try_send completing the operation immediately
Fixes a possible use-after-free in case of continue_ops() calling try_send(),
then connect_peer() -> set_timer() -> trigger_nearest() -> handle_op_part() -> continue_ops() again
2023-12-31 01:23:17 +03:00
ec7dcd1be5 Do not apply very large recovery pauses during tests 2023-12-31 01:23:17 +03:00
e600bbc151 Fix flapping move_reappear test by adding an fsync before stopping PG 2023-12-31 01:23:17 +03:00
8b8c1179a7 Use a separate used_blocks counter for free space stats to hide possibly delayed on-flush deallocation 2023-12-31 01:23:17 +03:00
d5a6fa6dd7 Fix possible crash on print_slow when bs_op is NULL 2023-12-31 01:23:17 +03:00
f757a35a8d Retry PG changes without re-running lpsolve when pool configuration and OSD tree don't change
OSDs often change their /pg/history keys during rebalance, so monitor receives additional
transaction failures from etcd if it re-runs lpsolve which sometimes may even lead to monitor
being unable to apply PG changes at all until rebalance completes
2023-12-31 01:23:17 +03:00
1edf86ed26 Aggregate recovery delay using simple mean over last 10 observations (EWMA is shit) 2023-12-31 01:23:17 +03:00
5ca7cde612 Experiment/WIP: Try to track "secondary" recovery ops separately 2023-12-31 01:23:17 +03:00
751935ddd8 WIP Auto-tune recovery speed 2023-12-31 01:23:17 +03:00
d84dee7098 Track recovery op latencies + refactor into a structure 2023-12-31 01:23:17 +03:00
dcc76eee15 Add a parity chunk count change test script 2023-12-26 23:48:41 +03:00
2f38adeb3d Restart dead VDUSE daemons at regular intervals 2023-12-24 12:58:50 +03:00
f72f14e6a7 Clear old PG states, history, and OSD states on etcd state reload
Also add protection from etcd watcher messages being split into multiple websocket
messages - I'm not sure if etcd actually does that, but it's better to have extra
protection anyway.

Also check that all etcd watchers are started in the keepalive routine, otherwise
it sometimes tries to revive etcd watchers starting with revision=1 which obviously
always fails because this revision is nearly always compacted.

All these changes should fix an old rarely reproduced bug where SOMETIMES OSDs
didn't react to PG config changes which was leading to offline pools on node reboot.
It happened on the full reload of state from etcd.
2023-12-24 02:02:13 +03:00
1299373988 Use the same etcd_ws_keepalive_interval in OSD and mon 2023-12-23 20:07:29 +03:00
178bb0e701 Prevent re-entry into timerfd set_nearest 2023-12-22 02:32:40 +03:00
4ece4dfdd0 Fix mon not using values from config when /config/global is not present 2023-12-22 02:25:09 +03:00
95631773b6 Remove pve-storage-portal-dns-list format for vitastor_etcd_address 2023-12-20 02:22:06 +03:00
7239cfb91a Parse log_level in cluster_client 2023-12-20 02:21:23 +03:00
7cea642f4a Fix vitastor-nbd image existence check not working because of non-zeroed inode_watch fields 2023-12-19 01:11:37 +03:00
dc615403d9 Do not warn on EPIPE in client unless log_level is raised explicitly 2023-12-17 13:42:26 +03:00
1a704e06ab Allow multiple interfaces with the same IP address, for "simple routed" full mesh network 2023-12-17 13:25:56 +03:00
575475de71 Do not ignore loopback addresses for OSD network (to make ECMP setups with frr possible) 2023-12-17 11:55:13 +03:00
aca2bef15f Add vitastor-disk update-sb command 2023-12-14 01:11:42 +03:00
4dd6e89263 Change qemu to qemu-system-x86 in docs 2023-12-14 01:01:00 +03:00
9bac99ffb6 Fix incorrect error in CSI when searching for the device in /sys 2023-12-14 01:00:32 +03:00
62ed130960 Support building qemu 8.1 from bookworm-backports 2023-12-10 00:34:13 +03:00
9c7755b6e8 Use qemu-storage-daemon from QEMU 8.1.2 for CSI 2023-12-08 00:10:12 +03:00
691ebd991a Move 2 last log printfs to stderr from stdout in etcd_state_client 2023-12-08 00:01:52 +03:00
6d5df908a3 Fix possible out of bounds when checking invalid journal entries 2023-12-08 00:01:07 +03:00
fa87769ed8 Correct config options in vduse docs 2023-12-06 02:09:04 +03:00
2ce8292803 Also log when killing process 2023-12-06 01:06:53 +03:00
7f8f7ded52 Check for empty output of vitastor-nbd map (just in case) 2023-12-06 01:01:14 +03:00
68553eabbb Log executed CLI commands 2023-12-06 00:48:12 +03:00
3147c5c8d5 Remove internal error wrapping 2023-12-06 00:39:42 +03:00
576e2ae608 Fix etcd_address check in CSI 2023-12-06 00:28:21 +03:00
a1c7cc3d8d Release 1.3.1
Hotfix to 1.3.0 - new "journal space reservation" had a bug which
caused OSDs to crash with EC and without immediate_commit.
2023-12-04 18:35:09 +03:00
a5e3dfbc5a Oops, 1.3.0 needs a hotfix 2023-12-04 13:45:54 +03:00
7972502eaf Release 1.3.0
New features:
- RDMA without ODP - much faster and all cards are now supported, not just Mellanox
- VDUSE in CSI - faster, more stable and can even recover after CSI pod restart!
- Reserve journal space for stabilize requests dynamically to prevent stalls under load with EC
- Raise default NBD timeout from 30 to 300 seconds and allow to take it from /etc/vitastor/vitastor.conf
- Remove explicit etcdUrl/etcdPrefix K8S storage class parameter support to prevent
  etcd migration issues for volumes created with these parameters
- Support QEMU 8.1 and pve-qemu 8.1

Bug fixes:
- Fix RDMA connection (and thus memory) leak
- Fix rare crashes under load due to incorrect io_uring queue size tracking
- Fix monitor statistics aggregation in case of empty /osd/stats keys
- Fix crash on unknown long argument to vitastor-disk
- Allow trailing comma in JSONs again
- Fix crash on attempts to dump a long listing of objects "to stabilize" or "to rollback" in a slow op
2023-12-04 02:36:43 +03:00
e57b7203b8 Use cmake3 on RHEL 7 2023-12-04 02:36:29 +03:00
c8a179dcda Note that Proxmox 8.1 is supported 2023-12-04 02:20:33 +03:00
845454742d Fix warning with QEMU 8.1 2023-12-04 01:59:07 +03:00
d65512bd80 Add patches for QEMU 8.1 2023-12-04 01:56:17 +03:00
53de2bbd0f Support VDUSE in CSI
VDUSE has multiple advantages:
- Better performance
- Lack of timeout problems
- And even the ability to recover after restart of the vitastor-csi pod!
2023-12-04 00:41:24 +03:00
628aa59574 Raise default NBD timeout from 30 to 300 seconds and allow to take it from /etc/vitastor/vitastor.conf 2023-12-02 14:11:14 +03:00
037cf64a47 Remove explicit etcdUrl/etcdPrefix from volume parameters 2023-12-02 13:26:00 +03:00
182 changed files with 5123 additions and 4344 deletions

View File

@@ -395,7 +395,7 @@ jobs:
steps:
- name: Run test
id: test
timeout-minutes: 3
timeout-minutes: 6
run: SCHEME=ec /root/vitastor/tests/test_snapshot_chain.sh
- name: Print logs
if: always() && steps.test.outcome == 'failure'
@@ -532,6 +532,24 @@ jobs:
echo ""
done
test_switch_primary:
runs-on: ubuntu-latest
needs: build
container: ${{env.TEST_IMAGE}}:${{github.sha}}
steps:
- name: Run test
id: test
timeout-minutes: 3
run: /root/vitastor/tests/test_switch_primary.sh
- name: Print logs
if: always() && steps.test.outcome == 'failure'
run: |
for i in /root/vitastor/testdata/*.log /root/vitastor/testdata/*.txt; do
echo "-------- $i --------"
cat $i
echo ""
done
test_write:
runs-on: ubuntu-latest
needs: build

View File

@@ -39,6 +39,10 @@ for my $line (<>)
$test_name .= '_'.lc($1).'_'.$2;
}
}
if ($test_name eq 'test_snapshot_chain_ec')
{
$timeout = 6;
}
$line =~ s!\./test_!/root/vitastor/tests/test_!;
# Gitea CI doesn't support artifacts yet, lol
#- name: Upload results

115
CLA-en.md Normal file
View File

@@ -0,0 +1,115 @@
## Contributor License Agreement
> This Agreement is made in the Russian and English languages. **The English
text of Agreement is for informational purposes only** and is not binding
for the Parties.
>
> In the event of a conflict between the provisions of the Russian and
English versions of this Agreement, the **Russian version shall prevail**.
>
> Russian version is published at https://git.yourcmc.ru/vitalif/vitastor/src/branch/master/CLA-ru.md
This document represents the offer of Filippov Vitaliy Vladimirovich
("Author"), author and copyright holder of Vitastor software ("Program"),
acknowledged by a certificate of Federal Service for Intellectual
Property of Russian Federation (Rospatent) # 2021617829 dated 20 May 2021,
to "Contributors" to conclude this license agreement as follows
("Agreement" or "Offer").
In accordance with Art. 435, Art. 438 of the Civil Code of the Russian
Federation, this Agreement is an offer and in case of acceptance of the
offer, an agreement is considered concluded on the conditions specified
in the offer.
1. Applicable Terms. \
1.1. "Official Repository" shall mean the computer storage, operated by
the Author, containing all prior and future versions of the Source
Code of the Program, at Internet addresses https://git.yourcmc.ru/vitalif/vitastor/
or https://github.com/vitalif/vitastor/. \
1.2. "Contributions" shall mean results of intellectual activity
(including, but not limited to, source code, libraries, components,
texts, documentation) which can be software or elements of the software
and which are provided by Contributors to the Author for inclusion
in the Program. \
1.3. "Contributor" shall mean a person who provides Contributions to
the Author and agrees with all provisions of this Agreement.
A Сontributor can be: 1) an individual; or 2) a legal entity or an
individual entrepreneur in case when an individual provides Contributions
on behalf of third parties, including on behalf of his employer.
2. Subject of the Agreement. \
2.1. Subject of the Agreement shall be the Contributions sent to the Author by Contributors. \
2.2. The Contributor grants to the Author the right to use Contributions at his own
discretion and without any necessity to get a prior approval from Contributor or
any other third party in any way, under a simple (non-exclusive), royalty-free,
irrevocable license throughout the world by all means not contrary to law, in whole
or as a part of the Program, or other open-source or closed-source computer programs,
products or services (hereinafter -- the "License"), including, but not limited to: \
2.2.1. to execute Contributions and use them for any tasks; \
2.2.2. to publish and distribute Contributions in modified or unmodified form and/or to rent them; \
2.2.3. to modify Contributions, add comments, illustrations or any explanations to Contributions while using them; \
2.2.4. to create other results of intellectual activity based on Contributions, including derivative works and composite works; \
2.2.5. to translate Contributions into other languages, including other programming languages; \
2.2.6. to carry out rental and public display of Contributions; \
2.2.7. to use Contributions under the trade name and/or any trademark or any other label, or without it, as the Author thinks fit; \
2.3. The Contributor grants to the Author the right to sublicense any of the aforementioned
rights to third parties on any terms at the Author's discretion. \
2.4. The License is provided for the entire duration of Contributor's
exclusive intellectual property rights to the Contributions. \
2.5. The Contributor grants to the Author the right to decide how and where to mention,
or to not mention at all, the fact of his authorship, name, nickname and/or company
details when including Contributions into the Program or in any other computer
programs, products or services.
3. Acceptance of the Offer \
3.1. The Contributor may provide Contributions to the Author in the form of
a "Pull Request" in an Official Repository of the Program or by any
other electronic means of communication, including, but not limited to,
E-mail or messenger applications. \
3.2. The acceptance of the Offer shall be the fact of provision of Contributions
to the Author by the Contributor by any means with the following remark:
“I accept Vitastor CLA agreement: https://git.yourcmc.ru/vitalif/vitastor/src/branch/master/CLA-en.md”
or “Я принимаю соглашение Vitastor CLA: https://git.yourcmc.ru/vitalif/vitastor/src/branch/master/CLA-ru.md”. \
3.3. Date of acceptance of the Offer shall be the date of such provision.
4. Rights and obligations of the parties. \
4.1. The Contributor reserves the right to use Contributions by any lawful means
not contrary to this Agreement. \
4.2. The Author has the right to refuse to include Contributions into the Program
at any moment with no explanation to the Contributor.
5. Representations and Warranties. \
5.1. The person providing Contributions for the purpose of their inclusion
in the Program represents and warrants that he is the Contributor
or legally acts on the Contributor's behalf. Name or company details
of the Contributor shall be provided with the Contribution at the moment
of their provision to the Author. \
5.2. The Contributor represents and warrants that he legally owns exclusive
intellectual property rights to the Contributions. \
5.3. The Contributor represents and warrants that any further use of
Contributions by the Author as provided by Contributor under the terms
of the Agreement does not infringe on intellectual and other rights and
legitimate interests of third parties. \
5.4. The Contributor represents and warrants that he has all rights and legal
capacity needed to accept this Offer; \
5.5. The Contributor represents and warrants that Contributions don't
contain malware or any information considered illegal under the law
of Russian Federation.
6. Termination of the Agreement \
6.1. The Agreement may be terminated at will of both Author and Contributor,
formalised in the written form or if the Agreement is terminated on
reasons prescribed by the law of Russian Federation.
7. Final Clauses \
7.1. The Contributor may optionally sign the Agreement in the written form. \
7.2. The Agreement is deemed to become effective from the Date of signing of
the Agreement and until the expiration of Contributor's exclusive
intellectual property rights to the Contributions. \
7.3. The Author may unilaterally alter the Agreement without informing Contributors.
The new version of the document shall come into effect 3 (three) days after
being published in the Official Repository of the Program at Internet address
[https://git.yourcmc.ru/vitalif/vitastor/src/branch/master/CLA-en.md](https://git.yourcmc.ru/vitalif/vitastor/src/branch/master/CLA-en.md).
Contributors should keep informed about the actual version of the Agreement themselves. \
7.4. If the Author and the Contributor fail to agree on disputable issues,
disputes shall be referred to the Moscow Arbitration court.

108
CLA-ru.md Normal file
View File

@@ -0,0 +1,108 @@
## Лицензионное соглашение с участником
> Данная Оферта написана в Русской и Английской версиях. **Версия на английском
языке предоставляется в информационных целях** и не связывает стороны договора.
>
> В случае несоответствий между положениями Русской и Английской версий Договора,
**Русская версия имеет приоритет**.
>
> Английская версия опубликована по адресу https://git.yourcmc.ru/vitalif/vitastor/src/branch/master/CLA-en.md
Настоящий договор-оферта (далее по тексту Оферта, Договор) адресована физическим
и юридическим лицам (далее Участникам) и является официальным публичным предложением
Филиппова Виталия Владимировича (далее Автора) программного обеспечения Vitastor,
свидетельство Федеральной службы по интеллектуальной собственности (Роспатент) № 2021617829
от 20 мая 2021 г. (далее Программа) о нижеследующем:
1. Термины и определения \
1.1. Репозиторий электронное хранилище, содержащее исходный код Программы. \
1.2. Доработка результат интеллектуальной деятельности Участника, включающий
в себя изменения или дополнения к исходному коду Программы, которые Участник
желает включить в состав Программы для дальнейшего использования и распространения
Автором и для этого направляет их Автору. \
1.3. Участник физическое или юридическое лицо, вносящее Доработки в код Программы. \
1.4. ГК РФ Гражданский кодекс Российской Федерации.
2. Предмет оферты \
2.1. Предметом настоящей оферты являются Доработки, отправляемые Участником Автору. \
2.2. Участник предоставляет Автору право использовать Доработки по собственному усмотрению
и без необходимости предварительного согласования с Участником или иным третьим лицом
на условиях простой (неисключительной) безвозмездной безотзывной лицензии, полностью
или фрагментарно, в составе Программы или других программ, продуктов или сервисов
как с открытым, так и с закрытым исходным кодом, любыми способами, не противоречащими
закону, включая, но не ограничиваясь следующими: \
2.2.1. Запускать и использовать Доработки для выполнения любых задач; \
2.2.2. Распространять, импортировать и доводить Доработки до всеобщего сведения; \
2.2.3. Вносить в Доработки изменения, сокращения и дополнения, снабжать Доработки
при их использовании комментариями, иллюстрациями или пояснениями; \
2.2.4. Создавать на основе Доработок иные результаты интеллектуальной деятельности,
в том числе производные и составные произведения; \
2.2.5. Переводить Доработки на другие языки, в том числе на другие языки программирования; \
2.2.6. Осуществлять прокат и публичный показ Доработок; \
2.2.7. Использовать Доработки под любым фирменным наименованием, товарным знаком
(знаком обслуживания) или иным обозначением, или без такового. \
2.3. Участник предоставляет Автору право сублицензировать полученные права на Доработки
третьим лицам на любых условиях на усмотрение Автора. \
2.4. Участник предоставляет Автору права на Доработки на территории всего мира. \
2.5. Участник предоставляет Автору права на весь срок действия исключительного права
Участника на Доработки. \
2.6. Участник предоставляет Автору права на Доработки на безвозмездной основе. \
2.7. Участник разрешает Автору самостоятельно определять порядок, способ и
место указания его имени, реквизитов и/или псевдонима при включении
Доработок в состав Программы или других программ, продуктов или сервисов.
3. Акцепт Оферты \
3.1. Участник может передавать Доработки в адрес Автора через зеркала официального
Репозитория Программы по адресам https://git.yourcmc.ru/vitalif/vitastor/ или
https://github.com/vitalif/vitastor/ в виде “запроса на слияние” (pull request),
либо в письменном виде или с помощью любых других электронных средств коммуникации,
например, электронной почты или мессенджеров. \
3.2. Факт передачи Участником Доработок в адрес Автора любым способом с одной из пометок
“I accept Vitastor CLA agreement: https://git.yourcmc.ru/vitalif/vitastor/src/branch/master/CLA-en.md”
или “Я принимаю соглашение Vitastor CLA: https://git.yourcmc.ru/vitalif/vitastor/src/branch/master/CLA-ru.md”
является полным и безоговорочным акцептом (принятием) Участником условий настоящей
Оферты, т.е. Участник считается ознакомившимся с настоящим публичным договором и
в соответствии с ГК РФ признается лицом, вступившим с Автором в договорные отношения
на основании настоящей Оферты. \
3.3. Датой акцептирования настоящей Оферты считается дата такой передачи.
4. Права и обязанности Сторон \
4.1. Участник сохраняет за собой право использовать Доработки любым законным
способом, не противоречащим настоящему Договору. \
4.2. Автор вправе отказать Участнику во включении Доработок в состав
Программы без объяснения причин в любой момент по своему усмотрению.
5. Гарантии и заверения \
5.1. Лицо, направляющее Доработки для целей их включения в состав Программы,
гарантирует, что является Участником или представителем Участника. Имя или реквизиты
Участника должны быть указаны при их передаче в адрес Автора Программы. \
5.2. Участник гарантирует, что является законным обладателем исключительных прав
на Доработки. \
5.3. Участник гарантирует, что на момент акцептирования настоящей Оферты ему
ничего не известно (и не могло быть известно) о правах третьих лиц на
передаваемые Автору Доработки или их часть, которые могут быть нарушены
в связи с передачей Доработок по настоящему Договору. \
5.4. Участник гарантирует, что является дееспособным лицом и обладает всеми
необходимыми правами для заключения Договора. \
5.5. Участник гарантирует, что Доработки не содержат вредоносного ПО, а также
любой другой информации, запрещённой к распространению по законам Российской
Федерации.
6. Прекращение действия оферты \
6.1. Действие настоящего договора может быть прекращено по соглашению сторон,
оформленному в письменном виде, а также вследствие его расторжения по основаниям,
предусмотренным законом.
7. Заключительные положения \
7.1. Участник вправе по желанию подписать настоящий Договор в письменном виде. \
7.2. Настоящий договор действует с момента его заключения и до истечения срока
действия исключительных прав Участника на Доработки. \
7.3. Автор имеет право в одностороннем порядке вносить изменения и дополнения в договор
без специального уведомления об этом Участников. Новая редакция документа вступает
в силу через 3 (Три) календарных дня со дня опубликования в официальном Репозитории
Программы по адресу в сети Интернет
[https://git.yourcmc.ru/vitalif/vitastor/src/branch/master/CLA-ru.md](https://git.yourcmc.ru/vitalif/vitastor/src/branch/master/CLA-ru.md).
Участники самостоятельно отслеживают действующие условия Оферты. \
7.4. Все споры, возникающие между сторонами в процессе их взаимодействия по настоящему
договору, решаются путём переговоров. В случае невозможности урегулирования споров
переговорным порядком стороны разрешают их в Арбитражном суде г.Москвы.

View File

@@ -2,6 +2,6 @@ cmake_minimum_required(VERSION 2.8.12)
project(vitastor)
set(VERSION "1.2.0")
set(VERSION "1.4.8")
add_subdirectory(src)

View File

@@ -1,14 +1,15 @@
# Compile stage
FROM golang:buster AS build
FROM golang:bookworm AS build
ADD go.sum go.mod /app/
RUN cd /app; CGO_ENABLED=1 GOOS=linux GOARCH=amd64 go mod download -x
ADD . /app
RUN perl -i -e '$/ = undef; while(<>) { s/\n\s*(\{\s*\n)/$1\n/g; s/\}(\s*\n\s*)else\b/$1} else/g; print; }' `find /app -name '*.go'`
RUN cd /app; CGO_ENABLED=1 GOOS=linux GOARCH=amd64 go build -o vitastor-csi
RUN perl -i -e '$/ = undef; while(<>) { s/\n\s*(\{\s*\n)/$1\n/g; s/\}(\s*\n\s*)else\b/$1} else/g; print; }' `find /app -name '*.go'` && \
cd /app && \
CGO_ENABLED=1 GOOS=linux GOARCH=amd64 go build -o vitastor-csi
# Final stage
FROM debian:buster
FROM debian:bookworm
LABEL maintainers="Vitaliy Filippov <vitalif@yourcmc.ru>"
LABEL description="Vitastor CSI Driver"
@@ -18,19 +19,30 @@ ENV CSI_ENDPOINT=""
RUN apt-get update && \
apt-get install -y wget && \
(echo deb http://deb.debian.org/debian buster-backports main > /etc/apt/sources.list.d/backports.list) && \
(echo "APT::Install-Recommends false;" > /etc/apt/apt.conf) && \
apt-get update && \
apt-get install -y e2fsprogs xfsprogs kmod && \
apt-get install -y e2fsprogs xfsprogs kmod iproute2 \
# dependencies of qemu-storage-daemon
libnuma1 liburing2 libglib2.0-0 libfuse3-3 libaio1 libzstd1 libnettle8 \
libgmp10 libhogweed6 libp11-kit0 libidn2-0 libunistring2 libtasn1-6 libpcre2-8-0 libffi8 && \
apt-get clean && \
(echo options nbd nbds_max=128 > /etc/modprobe.d/nbd.conf)
COPY --from=build /app/vitastor-csi /bin/
RUN (echo deb http://vitastor.io/debian buster main > /etc/apt/sources.list.d/vitastor.list) && \
RUN (echo deb http://vitastor.io/debian bookworm main > /etc/apt/sources.list.d/vitastor.list) && \
((echo 'Package: *'; echo 'Pin: origin "vitastor.io"'; echo 'Pin-Priority: 1000') > /etc/apt/preferences.d/vitastor.pref) && \
wget -q -O /etc/apt/trusted.gpg.d/vitastor.gpg https://vitastor.io/debian/pubkey.gpg && \
apt-get update && \
apt-get install -y vitastor-client && \
wget https://vitastor.io/archive/qemu/qemu-bookworm-8.1.2%2Bds-1%2Bvitastor1/qemu-utils_8.1.2%2Bds-1%2Bvitastor1_amd64.deb && \
wget https://vitastor.io/archive/qemu/qemu-bookworm-8.1.2%2Bds-1%2Bvitastor1/qemu-block-extra_8.1.2%2Bds-1%2Bvitastor1_amd64.deb && \
dpkg -x qemu-utils*.deb tmp1 && \
dpkg -x qemu-block-extra*.deb tmp1 && \
cp -a tmp1/usr/bin/qemu-storage-daemon /usr/bin/ && \
mkdir -p /usr/lib/x86_64-linux-gnu/qemu && \
cp -a tmp1/usr/lib/x86_64-linux-gnu/qemu/block-vitastor.so /usr/lib/x86_64-linux-gnu/qemu/ && \
rm -rf tmp1 *.deb && \
apt-get clean
ENTRYPOINT ["/bin/vitastor-csi"]

View File

@@ -1,4 +1,4 @@
VERSION ?= v1.2.0
VERSION ?= v1.4.8
all: build push

View File

@@ -2,6 +2,7 @@
apiVersion: v1
kind: ConfigMap
data:
# You can add multiple configuration files here to use a multi-cluster setup
vitastor.conf: |-
{"etcd_address":"http://192.168.7.2:2379","etcd_prefix":"/vitastor"}
metadata:

View File

@@ -49,7 +49,7 @@ spec:
capabilities:
add: ["SYS_ADMIN"]
allowPrivilegeEscalation: true
image: vitalif/vitastor-csi:v1.2.0
image: vitalif/vitastor-csi:v1.4.8
args:
- "--node=$(NODE_ID)"
- "--endpoint=$(CSI_ENDPOINT)"
@@ -82,6 +82,8 @@ spec:
name: host-sys
- mountPath: /run/mount
name: host-mount
- mountPath: /run/vitastor-csi
name: run-vitastor-csi
- mountPath: /lib/modules
name: lib-modules
readOnly: true
@@ -132,6 +134,9 @@ spec:
- name: host-mount
hostPath:
path: /run/mount
- name: run-vitastor-csi
hostPath:
path: /run/vitastor-csi
- name: lib-modules
hostPath:
path: /lib/modules

View File

@@ -121,7 +121,7 @@ spec:
privileged: true
capabilities:
add: ["SYS_ADMIN"]
image: vitalif/vitastor-csi:v1.2.0
image: vitalif/vitastor-csi:v1.4.8
args:
- "--node=$(NODE_ID)"
- "--endpoint=$(CSI_ENDPOINT)"

View File

@@ -12,9 +12,6 @@ parameters:
etcdVolumePrefix: ""
poolId: "1"
# you can choose other configuration file if you have it in the config map
# different etcd URLs and prefixes should also be put in the config
#configPath: "/etc/vitastor/vitastor.conf"
# you can also specify etcdUrl here, maybe to connect to another Vitastor cluster
# multiple etcdUrls may be specified, delimited by comma
#etcdUrl: "http://192.168.7.2:2379"
#etcdPrefix: "/vitastor"
allowVolumeExpansion: true

View File

@@ -5,7 +5,7 @@ package vitastor
const (
vitastorCSIDriverName = "csi.vitastor.io"
vitastorCSIDriverVersion = "1.2.0"
vitastorCSIDriverVersion = "1.4.8"
)
// Config struct fills the parameters of request or user input

View File

@@ -62,7 +62,7 @@ func NewControllerServer(driver *Driver) *ControllerServer
}
}
func GetConnectionParams(params map[string]string) (map[string]string, []string, string)
func GetConnectionParams(params map[string]string) (map[string]string, error)
{
ctxVars := make(map[string]string)
configPath := params["configPath"]
@@ -75,71 +75,69 @@ func GetConnectionParams(params map[string]string) (map[string]string, []string,
ctxVars["configPath"] = configPath
}
config := make(map[string]interface{})
if configFD, err := os.Open(configPath); err == nil
configFD, err := os.Open(configPath)
if (err != nil)
{
defer configFD.Close()
data, _ := ioutil.ReadAll(configFD)
json.Unmarshal(data, &config)
return nil, err
}
// Try to load prefix & etcd URL from the config
defer configFD.Close()
data, _ := ioutil.ReadAll(configFD)
json.Unmarshal(data, &config)
// Check etcd URL in the config, but do not use the explicit etcdUrl
// parameter for CLI calls, otherwise users won't be able to later
// change them - storage class parameters are saved in volume IDs
var etcdUrl []string
if (params["etcdUrl"] != "")
switch config["etcd_address"].(type)
{
ctxVars["etcdUrl"] = params["etcdUrl"]
etcdUrl = strings.Split(params["etcdUrl"], ",")
case string:
url := strings.TrimSpace(config["etcd_address"].(string))
if (url != "")
{
etcdUrl = strings.Split(url, ",")
}
case []string:
etcdUrl = config["etcd_address"].([]string)
case []interface{}:
for _, url := range config["etcd_address"].([]interface{})
{
s, ok := url.(string)
if (ok)
{
etcdUrl = append(etcdUrl, s)
}
}
}
if (len(etcdUrl) == 0)
{
switch config["etcd_address"].(type)
{
case string:
etcdUrl = strings.Split(config["etcd_address"].(string), ",")
case []string:
etcdUrl = config["etcd_address"].([]string)
}
return nil, status.Error(codes.InvalidArgument, "etcd_address is missing in "+configPath)
}
etcdPrefix := params["etcdPrefix"]
if (etcdPrefix == "")
return ctxVars, nil
}
func system(program string, args ...string) ([]byte, []byte, error)
{
klog.Infof("Running "+program+" "+strings.Join(args, " "))
c := exec.Command(program, args...)
var stdout, stderr bytes.Buffer
c.Stdout, c.Stderr = &stdout, &stderr
err := c.Run()
if (err != nil)
{
etcdPrefix, _ = config["etcd_prefix"].(string)
if (etcdPrefix == "")
{
etcdPrefix = "/vitastor"
}
stdoutStr, stderrStr := string(stdout.Bytes()), string(stderr.Bytes())
klog.Errorf(program+" "+strings.Join(args, " ")+" failed: %s, status %s\n", stdoutStr+stderrStr, err)
return nil, nil, status.Error(codes.Internal, stdoutStr+stderrStr+" (status "+err.Error()+")")
}
else
{
ctxVars["etcdPrefix"] = etcdPrefix
}
return ctxVars, etcdUrl, etcdPrefix
return stdout.Bytes(), stderr.Bytes(), nil
}
func invokeCLI(ctxVars map[string]string, args []string) ([]byte, error)
{
if (ctxVars["etcdUrl"] != "")
{
args = append(args, "--etcd_address", ctxVars["etcdUrl"])
}
if (ctxVars["etcdPrefix"] != "")
{
args = append(args, "--etcd_prefix", ctxVars["etcdPrefix"])
}
if (ctxVars["configPath"] != "")
{
args = append(args, "--config_path", ctxVars["configPath"])
}
c := exec.Command("/usr/bin/vitastor-cli", args...)
var stdout, stderr bytes.Buffer
c.Stdout = &stdout
c.Stderr = &stderr
err := c.Run()
stderrStr := string(stderr.Bytes())
if (err != nil)
{
klog.Errorf("vitastor-cli %s failed: %s, status %s\n", strings.Join(args, " "), stderrStr, err)
return nil, status.Error(codes.Internal, stderrStr+" (status "+err.Error()+")")
}
return stdout.Bytes(), nil
stdout, _, err := system("/usr/bin/vitastor-cli", args...)
return stdout, err
}
// Create the volume
@@ -174,10 +172,10 @@ func (cs *ControllerServer) CreateVolume(ctx context.Context, req *csi.CreateVol
volSize = ((capRange.GetRequiredBytes() + MB - 1) / MB) * MB
}
ctxVars, etcdUrl, _ := GetConnectionParams(req.Parameters)
if (len(etcdUrl) == 0)
ctxVars, err := GetConnectionParams(req.Parameters)
if (err != nil)
{
return nil, status.Error(codes.InvalidArgument, "no etcdUrl in storage class configuration and no etcd_address in vitastor.conf")
return nil, err
}
args := []string{ "create", volName, "-s", fmt.Sprintf("%v", volSize), "--pool", fmt.Sprintf("%v", poolId) }
@@ -207,7 +205,7 @@ func (cs *ControllerServer) CreateVolume(ctx context.Context, req *csi.CreateVol
}
// Create image using vitastor-cli
_, err := invokeCLI(ctxVars, args)
_, err = invokeCLI(ctxVars, args)
if (err != nil)
{
if (strings.Index(err.Error(), "already exists") > 0)
@@ -257,7 +255,11 @@ func (cs *ControllerServer) DeleteVolume(ctx context.Context, req *csi.DeleteVol
}
volName := volVars["name"]
ctxVars, _, _ := GetConnectionParams(volVars)
ctxVars, err := GetConnectionParams(volVars)
if (err != nil)
{
return nil, err
}
_, err = invokeCLI(ctxVars, []string{ "rm", volName })
if (err != nil)
@@ -469,7 +471,11 @@ func (cs *ControllerServer) DeleteSnapshot(ctx context.Context, req *csi.DeleteS
volName := volVars["name"]
snapName := volVars["snapshot"]
ctxVars, _, _ := GetConnectionParams(volVars)
ctxVars, err := GetConnectionParams(volVars)
if (err != nil)
{
return nil, err
}
_, err = invokeCLI(ctxVars, []string{ "rm", volName+"@"+snapName })
if (err != nil)
@@ -496,7 +502,11 @@ func (cs *ControllerServer) ListSnapshots(ctx context.Context, req *csi.ListSnap
return nil, status.Error(codes.Internal, "volume ID not in JSON format")
}
volName := volVars["name"]
ctxVars, _, _ := GetConnectionParams(volVars)
ctxVars, err := GetConnectionParams(volVars)
if (err != nil)
{
return nil, err
}
inodeCfg, err := invokeList(ctxVars, volName+"@*", false)
if (err != nil)
@@ -555,7 +565,11 @@ func (cs *ControllerServer) ControllerExpandVolume(ctx context.Context, req *csi
return nil, status.Error(codes.Internal, "volume ID not in JSON format")
}
volName := volVars["name"]
ctxVars, _, _ := GetConnectionParams(volVars)
ctxVars, err := GetConnectionParams(volVars)
if (err != nil)
{
return nil, err
}
inodeCfg, err := invokeList(ctxVars, volName, true)
if (err != nil)

View File

@@ -5,11 +5,16 @@ package vitastor
import (
"context"
"errors"
"encoding/json"
"fmt"
"os"
"os/exec"
"encoding/json"
"path/filepath"
"strconv"
"strings"
"bytes"
"syscall"
"time"
"google.golang.org/grpc/codes"
"google.golang.org/grpc/status"
@@ -25,16 +30,102 @@ import (
type NodeServer struct
{
*Driver
useVduse bool
stateDir string
mounter mount.Interface
restartInterval time.Duration
}
type DeviceState struct
{
ConfigPath string `json:"configPath"`
VdpaId string `json:"vdpaId"`
Image string `json:"image"`
Blockdev string `json:"blockdev"`
Readonly bool `json:"readonly"`
PidFile string `json:"pidFile"`
}
// NewNodeServer create new instance node
func NewNodeServer(driver *Driver) *NodeServer
{
return &NodeServer{
stateDir := os.Getenv("STATE_DIR")
if (stateDir == "")
{
stateDir = "/run/vitastor-csi"
}
if (stateDir[len(stateDir)-1] != '/')
{
stateDir += "/"
}
ns := &NodeServer{
Driver: driver,
useVduse: checkVduseSupport(),
stateDir: stateDir,
mounter: mount.New(""),
}
if (ns.useVduse)
{
ns.restoreVduseDaemons()
dur, err := time.ParseDuration(os.Getenv("RESTART_INTERVAL"))
if (err != nil)
{
dur = 10 * time.Second
}
ns.restartInterval = dur
if (ns.restartInterval != time.Duration(0))
{
go ns.restarter()
}
}
return ns
}
func checkVduseSupport() bool
{
// Check VDUSE support (vdpa, vduse, virtio-vdpa kernel modules)
vduse := true
for _, mod := range []string{"vdpa", "vduse", "virtio-vdpa"}
{
_, err := os.Stat("/sys/module/"+mod)
if (err != nil)
{
if (!errors.Is(err, os.ErrNotExist))
{
klog.Errorf("failed to check /sys/module/%s: %v", mod, err)
}
c := exec.Command("/sbin/modprobe", mod)
c.Stdout = os.Stderr
c.Stderr = os.Stderr
err := c.Run()
if (err != nil)
{
klog.Errorf("/sbin/modprobe %s failed: %v", mod, err)
vduse = false
break
}
}
}
// Check that vdpa tool functions
if (vduse)
{
c := exec.Command("/sbin/vdpa", "-j", "dev")
c.Stderr = os.Stderr
err := c.Run()
if (err != nil)
{
klog.Errorf("/sbin/vdpa -j dev failed: %v", err)
vduse = false
}
}
if (!vduse)
{
klog.Errorf(
"Your host apparently has no VDUSE support. VDUSE support disabled, NBD will be used to map devices."+
" For VDUSE you need at least Linux 5.15 and the following kernel modules: vdpa, virtio-vdpa, vduse.",
)
}
return vduse
}
// NodeStageVolume mounts the volume to a staging path on the node.
@@ -61,6 +152,318 @@ func Contains(list []string, s string) bool
return false
}
func (ns *NodeServer) mapNbd(volName string, ctxVars map[string]string, readonly bool) (string, error)
{
// Map NBD device
// FIXME: Check if already mapped
args := []string{
"map", "--image", volName,
}
if (ctxVars["configPath"] != "")
{
args = append(args, "--config_path", ctxVars["configPath"])
}
if (readonly)
{
args = append(args, "--readonly", "1")
}
stdout, stderr, err := system("/usr/bin/vitastor-nbd", args...)
dev := strings.TrimSpace(string(stdout))
if (dev == "")
{
return "", fmt.Errorf("vitastor-nbd did not return the name of NBD device. output: %s", stderr)
}
return dev, err
}
func (ns *NodeServer) unmapNbd(devicePath string)
{
// unmap NBD device
unmapOut, unmapErr := exec.Command("/usr/bin/vitastor-nbd", "unmap", devicePath).CombinedOutput()
if (unmapErr != nil)
{
klog.Errorf("failed to unmap NBD device %s: %s, error: %v", devicePath, unmapOut, unmapErr)
}
}
func findByPidFile(pidFile string) (*os.Process, error)
{
pidBuf, err := os.ReadFile(pidFile)
if (err != nil)
{
return nil, err
}
pid, err := strconv.ParseInt(strings.TrimSpace(string(pidBuf)), 0, 64)
if (err != nil)
{
return nil, err
}
proc, err := os.FindProcess(int(pid))
if (err != nil)
{
return nil, err
}
return proc, nil
}
func killByPidFile(pidFile string) error
{
klog.Infof("killing process with PID from file %s", pidFile)
proc, err := findByPidFile(pidFile)
if (err != nil)
{
return err
}
return proc.Signal(syscall.SIGTERM)
}
func startStorageDaemon(vdpaId, volName, pidFile, configPath string, readonly bool) error
{
// Start qemu-storage-daemon
blockSpec := map[string]interface{}{
"node-name": "disk1",
"driver": "vitastor",
"image": volName,
"cache": map[string]bool{
"direct": true,
"no-flush": false,
},
"discard": "unmap",
}
if (configPath != "")
{
blockSpec["config-path"] = configPath
}
blockSpecJson, _ := json.Marshal(blockSpec)
writable := "true"
if (readonly)
{
writable = "false"
}
_, _, err := system(
"/usr/bin/qemu-storage-daemon", "--daemonize", "--pidfile", pidFile, "--blockdev", string(blockSpecJson),
"--export", "vduse-blk,id="+vdpaId+",node-name=disk1,name="+vdpaId+",num-queues=16,queue-size=128,writable="+writable,
)
return err
}
func (ns *NodeServer) mapVduse(volName string, ctxVars map[string]string, readonly bool) (string, string, error)
{
// Generate state file
stateFd, err := os.CreateTemp(ns.stateDir, "vitastor-vduse-*.json")
if (err != nil)
{
return "", "", err
}
stateFile := stateFd.Name()
stateFd.Close()
vdpaId := filepath.Base(stateFile)
vdpaId = vdpaId[0:len(vdpaId)-5] // remove ".json"
pidFile := ns.stateDir + vdpaId + ".pid"
// Map VDUSE device via qemu-storage-daemon
err = startStorageDaemon(vdpaId, volName, pidFile, ctxVars["configPath"], readonly)
if (err == nil)
{
// Add device to VDPA bus
_, _, err = system("/sbin/vdpa", "-j", "dev", "add", "name", vdpaId, "mgmtdev", "vduse")
if (err == nil)
{
// Find block device name
var matches []string
matches, err = filepath.Glob("/sys/bus/vdpa/devices/"+vdpaId+"/virtio*/block/*")
if (err == nil && len(matches) == 0)
{
err = errors.New("/sys/bus/vdpa/devices/"+vdpaId+"/virtio*/block/* is not found")
}
if (err == nil)
{
blockdev := "/dev/"+filepath.Base(matches[0])
_, err = os.Stat(blockdev)
if (err == nil)
{
// Generate state file
stateJSON, _ := json.Marshal(&DeviceState{
ConfigPath: ctxVars["configPath"],
VdpaId: vdpaId,
Image: volName,
Blockdev: blockdev,
Readonly: readonly,
PidFile: pidFile,
})
err = os.WriteFile(stateFile, stateJSON, 0600)
if (err == nil)
{
return blockdev, vdpaId, nil
}
}
}
}
killErr := killByPidFile(pidFile)
if (killErr != nil)
{
klog.Errorf("Failed to kill started qemu-storage-daemon: %v", killErr)
}
os.Remove(stateFile)
os.Remove(pidFile)
}
return "", "", err
}
func (ns *NodeServer) unmapVduse(devicePath string)
{
if (len(devicePath) < 6 || devicePath[0:6] != "/dev/v")
{
klog.Errorf("%s does not start with /dev/v", devicePath)
return
}
vduseDev, err := os.Readlink("/sys/block/"+devicePath[5:])
if (err != nil)
{
klog.Errorf("%s is not a symbolic link to VDUSE device (../devices/virtual/vduse/xxx): %v", devicePath, err)
return
}
vdpaId := ""
p := strings.Index(vduseDev, "/vduse/")
if (p >= 0)
{
vduseDev = vduseDev[p+7:]
p = strings.Index(vduseDev, "/")
if (p >= 0)
{
vdpaId = vduseDev[0:p]
}
}
if (vdpaId == "")
{
klog.Errorf("%s is not a symbolic link to VDUSE device (../devices/virtual/vduse/xxx), but is %v", devicePath, vduseDev)
return
}
ns.unmapVduseById(vdpaId)
}
func (ns *NodeServer) unmapVduseById(vdpaId string)
{
_, err := os.Stat("/sys/bus/vdpa/devices/"+vdpaId)
if (err != nil)
{
klog.Errorf("failed to stat /sys/bus/vdpa/devices/"+vdpaId+": %v", err)
}
else
{
_, _, _ = system("/sbin/vdpa", "-j", "dev", "del", vdpaId)
}
stateFile := ns.stateDir + vdpaId + ".json"
os.Remove(stateFile)
pidFile := ns.stateDir + vdpaId + ".pid"
_, err = os.Stat(pidFile)
if (os.IsNotExist(err))
{
// ok, already killed
}
else if (err != nil)
{
klog.Errorf("Failed to stat %v: %v", pidFile, err)
return
}
else
{
err = killByPidFile(pidFile)
if (err != nil)
{
klog.Errorf("Failed to kill started qemu-storage-daemon: %v", err)
}
os.Remove(pidFile)
}
}
func (ns *NodeServer) restarter()
{
// Restart dead VDUSE daemons at regular intervals
// Otherwise volume I/O may hang in case of a qemu-storage-daemon crash
// Moreover, it may lead to a kernel panic of the kernel is configured to
// panic on hung tasks
ticker := time.NewTicker(ns.restartInterval)
defer ticker.Stop()
for
{
<-ticker.C
ns.restoreVduseDaemons()
}
}
func (ns *NodeServer) restoreVduseDaemons()
{
pattern := ns.stateDir+"vitastor-vduse-*.json"
matches, err := filepath.Glob(pattern)
if (err != nil)
{
klog.Errorf("failed to list %s: %v", pattern, err)
}
if (len(matches) == 0)
{
return
}
devList := make(map[string]interface{})
// example output: {"dev":{"test1":{"type":"block","mgmtdev":"vduse","vendor_id":0,"max_vqs":16,"max_vq_size":128}}}
devListJSON, _, err := system("/sbin/vdpa", "-j", "dev", "list")
if (err != nil)
{
return
}
err = json.Unmarshal(devListJSON, &devList)
devs, ok := devList["dev"].(map[string]interface{})
if (err != nil || !ok)
{
klog.Errorf("/sbin/vdpa -j dev list returned bad JSON (error %v): %v", err, string(devListJSON))
return
}
for _, stateFile := range matches
{
vdpaId := filepath.Base(stateFile)
vdpaId = vdpaId[0:len(vdpaId)-5]
// Check if VDPA device is still added to the bus
if (devs[vdpaId] != nil)
{
// Check if the storage daemon is still active
pidFile := ns.stateDir + vdpaId + ".pid"
exists := false
proc, err := findByPidFile(pidFile)
if (err == nil)
{
exists = proc.Signal(syscall.Signal(0)) == nil
}
if (!exists)
{
// Restart daemon
stateJSON, err := os.ReadFile(stateFile)
if (err != nil)
{
klog.Warningf("error reading state file %v: %v", stateFile, err)
}
else
{
var state DeviceState
err := json.Unmarshal(stateJSON, &state)
if (err != nil)
{
klog.Warningf("state file %v contains invalid JSON (error %v): %v", stateFile, err, string(stateJSON))
}
else
{
klog.Warningf("restarting storage daemon for volume %v (VDPA ID %v)", state.Image, vdpaId)
_ = startStorageDaemon(vdpaId, state.Image, pidFile, state.ConfigPath, state.Readonly)
}
}
}
}
else
{
// Unused, clean it up
ns.unmapVduseById(vdpaId)
}
}
}
// NodePublishVolume mounts the volume mounted to the staging path to the target path
func (ns *NodeServer) NodePublishVolume(ctx context.Context, req *csi.NodePublishVolumeRequest) (*csi.NodePublishVolumeResponse, error)
{
@@ -81,13 +484,13 @@ func (ns *NodeServer) NodePublishVolume(ctx context.Context, req *csi.NodePublis
if (err != nil)
{
klog.Errorf("failed to create block device mount target %s with error: %v", targetPath, err)
return nil, status.Error(codes.Internal, err.Error())
return nil, err
}
err = pathFile.Close()
if (err != nil)
{
klog.Errorf("failed to close %s with error: %v", targetPath, err)
return nil, status.Error(codes.Internal, err.Error())
return nil, err
}
}
else
@@ -96,13 +499,13 @@ func (ns *NodeServer) NodePublishVolume(ctx context.Context, req *csi.NodePublis
if (err != nil)
{
klog.Errorf("failed to create fs mount target %s with error: %v", targetPath, err)
return nil, status.Error(codes.Internal, err.Error())
return nil, err
}
}
}
else
{
return nil, status.Error(codes.Internal, err.Error())
return nil, err
}
}
@@ -114,38 +517,25 @@ func (ns *NodeServer) NodePublishVolume(ctx context.Context, req *csi.NodePublis
}
volName := ctxVars["name"]
_, etcdUrl, etcdPrefix := GetConnectionParams(ctxVars)
if (len(etcdUrl) == 0)
{
return nil, status.Error(codes.InvalidArgument, "no etcdUrl in storage class configuration and no etcd_address in vitastor.conf")
}
// Map NBD device
// FIXME: Check if already mapped
args := []string{
"map", "--etcd_address", strings.Join(etcdUrl, ","),
"--etcd_prefix", etcdPrefix,
"--image", volName,
};
if (ctxVars["configPath"] != "")
{
args = append(args, "--config_path", ctxVars["configPath"])
}
if (req.GetReadonly())
{
args = append(args, "--readonly", "1")
}
c := exec.Command("/usr/bin/vitastor-nbd", args...)
var stdout, stderr bytes.Buffer
c.Stdout, c.Stderr = &stdout, &stderr
err = c.Run()
stdoutStr, stderrStr := string(stdout.Bytes()), string(stderr.Bytes())
_, err = GetConnectionParams(ctxVars)
if (err != nil)
{
klog.Errorf("vitastor-nbd map failed: %s, status %s\n", stdoutStr+stderrStr, err)
return nil, status.Error(codes.Internal, stdoutStr+stderrStr+" (status "+err.Error()+")")
return nil, err
}
var devicePath, vdpaId string
if (!ns.useVduse)
{
devicePath, err = ns.mapNbd(volName, ctxVars, req.GetReadonly())
}
else
{
devicePath, vdpaId, err = ns.mapVduse(volName, ctxVars, req.GetReadonly())
}
if (err != nil)
{
return nil, err
}
devicePath := strings.TrimSpace(stdoutStr)
diskMounter := &mount.SafeFormatAndMount{Interface: ns.mounter, Exec: utilexec.New()}
if (isBlock)
@@ -227,13 +617,15 @@ func (ns *NodeServer) NodePublishVolume(ctx context.Context, req *csi.NodePublis
return &csi.NodePublishVolumeResponse{}, nil
unmap:
// unmap NBD device
unmapOut, unmapErr := exec.Command("/usr/bin/vitastor-nbd", "unmap", devicePath).CombinedOutput()
if (unmapErr != nil)
if (!ns.useVduse || len(devicePath) >= 8 && devicePath[0:8] == "/dev/nbd")
{
klog.Errorf("failed to unmap NBD device %s: %s, error: %v", devicePath, unmapOut, unmapErr)
ns.unmapNbd(devicePath)
}
return nil, status.Error(codes.Internal, err.Error())
else
{
ns.unmapVduseById(vdpaId)
}
return nil, err
}
// NodeUnpublishVolume unmounts the volume from the target path
@@ -248,25 +640,31 @@ func (ns *NodeServer) NodeUnpublishVolume(ctx context.Context, req *csi.NodeUnpu
{
return nil, status.Error(codes.NotFound, "Target path not found")
}
return nil, status.Error(codes.Internal, err.Error())
return nil, err
}
if (devicePath == "")
{
return nil, status.Error(codes.NotFound, "Volume not mounted")
// volume not mounted
klog.Warningf("%s is not a mountpoint, deleting", targetPath)
os.Remove(targetPath)
return &csi.NodeUnpublishVolumeResponse{}, nil
}
// unmount
err = mount.CleanupMountPoint(targetPath, ns.mounter, false)
if (err != nil)
{
return nil, status.Error(codes.Internal, err.Error())
return nil, err
}
// unmap NBD device
if (refCount == 1)
{
unmapOut, unmapErr := exec.Command("/usr/bin/vitastor-nbd", "unmap", devicePath).CombinedOutput()
if (unmapErr != nil)
if (!ns.useVduse)
{
klog.Errorf("failed to unmap NBD device %s: %s, error: %v", devicePath, unmapOut, unmapErr)
ns.unmapNbd(devicePath)
}
else
{
ns.unmapVduse(devicePath)
}
}
return &csi.NodeUnpublishVolumeResponse{}, nil

View File

@@ -3,5 +3,5 @@
cat < vitastor.Dockerfile > ../Dockerfile
cd ..
mkdir -p packages
sudo podman build --build-arg REL=bookworm -v `pwd`/packages:/root/packages -f Dockerfile .
sudo podman build --build-arg DISTRO=debian --build-arg REL=bookworm -v `pwd`/packages:/root/packages -f Dockerfile .
rm Dockerfile

View File

@@ -3,5 +3,5 @@
cat < vitastor.Dockerfile > ../Dockerfile
cd ..
mkdir -p packages
sudo podman build --build-arg REL=bullseye -v `pwd`/packages:/root/packages -f Dockerfile .
sudo podman build --build-arg DISTRO=debian --build-arg REL=bullseye -v `pwd`/packages:/root/packages -f Dockerfile .
rm Dockerfile

View File

@@ -3,5 +3,5 @@
cat < vitastor.Dockerfile > ../Dockerfile
cd ..
mkdir -p packages
sudo podman build --build-arg REL=buster -v `pwd`/packages:/root/packages -f Dockerfile .
sudo podman build --build-arg DISTRO=debian --build-arg REL=buster -v `pwd`/packages:/root/packages -f Dockerfile .
rm Dockerfile

7
debian/build-vitastor-ubuntu-jammy.sh vendored Executable file
View File

@@ -0,0 +1,7 @@
#!/bin/bash
cat < vitastor.Dockerfile > ../Dockerfile
cd ..
mkdir -p packages
sudo podman build --build-arg DISTRO=ubuntu --build-arg REL=jammy -v `pwd`/packages:/root/packages -f Dockerfile .
rm Dockerfile

4
debian/changelog vendored
View File

@@ -1,10 +1,10 @@
vitastor (1.2.0-1) unstable; urgency=medium
vitastor (1.4.8-1) unstable; urgency=medium
* Bugfixes
-- Vitaliy Filippov <vitalif@yourcmc.ru> Fri, 03 Jun 2022 02:09:44 +0300
vitastor (1.2.0-1) unstable; urgency=medium
vitastor (0.7.0-1) unstable; urgency=medium
* Implement NFS proxy
* Add documentation

View File

@@ -1,13 +1,14 @@
# Build patched libvirt for Debian Buster or Bullseye/Sid inside a container
# cd ..; podman build --build-arg REL=bullseye -v `pwd`/packages:/root/packages -f debian/libvirt.Dockerfile .
# cd ..; podman build --build-arg DISTRO=debian --build-arg REL=bullseye -v `pwd`/packages:/root/packages -f debian/libvirt.Dockerfile .
ARG DISTRO=
ARG REL=
FROM debian:$REL
FROM $DISTRO:$REL
ARG REL=
WORKDIR /root
RUN if [ "$REL" = "buster" -o "$REL" = "bullseye" ]; then \
RUN if ([ "${DISTRO}" = "debian" ]) && ( [ "${REL}" = "buster" -o "${REL}" = "bullseye" ] ); then \
echo "deb http://deb.debian.org/debian $REL-backports main" >> /etc/apt/sources.list; \
echo >> /etc/apt/preferences; \
echo 'Package: *' >> /etc/apt/preferences; \
@@ -23,7 +24,7 @@ RUN apt-get -y build-dep libvirt0
RUN apt-get -y install libglusterfs-dev
RUN apt-get --download-only source libvirt
ADD patches/libvirt-5.0-vitastor.diff patches/libvirt-7.0-vitastor.diff patches/libvirt-7.5-vitastor.diff patches/libvirt-7.6-vitastor.diff /root
ADD patches/libvirt-5.0-vitastor.diff patches/libvirt-7.0-vitastor.diff patches/libvirt-7.5-vitastor.diff patches/libvirt-7.6-vitastor.diff patches/libvirt-8.0-vitastor.diff /root
RUN set -e; \
mkdir -p /root/packages/libvirt-$REL; \
rm -rf /root/packages/libvirt-$REL/*; \

View File

@@ -7,7 +7,7 @@ ARG REL=
WORKDIR /root
RUN if [ "$REL" = "buster" -o "$REL" = "bullseye" ]; then \
RUN if [ "$REL" = "buster" -o "$REL" = "bullseye" -o "$REL" = "bookworm" ]; then \
echo "deb http://deb.debian.org/debian $REL-backports main" >> /etc/apt/sources.list; \
echo >> /etc/apt/preferences; \
echo 'Package: *' >> /etc/apt/preferences; \
@@ -45,7 +45,7 @@ RUN set -e; \
rm -rf /root/packages/qemu-$REL/*; \
cd /root/packages/qemu-$REL; \
dpkg-source -x /root/qemu*.dsc; \
QEMU_VER=$(ls -d qemu*/ | perl -pe 's!^.*(\d+\.\d+).*!$1!'); \
QEMU_VER=$(ls -d qemu*/ | perl -pe 's!^.*?(\d+\.\d+).*!$1!'); \
D=$(ls -d qemu*/); \
cp /root/vitastor/patches/qemu-$QEMU_VER-vitastor.patch ./qemu-*/debian/patches; \
echo qemu-$QEMU_VER-vitastor.patch >> $D/debian/patches/series; \

View File

@@ -1,8 +1,10 @@
# Build Vitastor packages for Debian inside a container
# cd ..; podman build --build-arg REL=bullseye -v `pwd`/packages:/root/packages -f debian/vitastor.Dockerfile .
# cd ..; podman build --build-arg DISTRO=debian --build-arg REL=bullseye -v `pwd`/packages:/root/packages -f debian/vitastor.Dockerfile .
ARG DISTRO=debian
ARG REL=
FROM debian:$REL
FROM $DISTRO:$REL
ARG DISTRO=debian
ARG REL=
WORKDIR /root
@@ -35,8 +37,8 @@ RUN set -e -x; \
mkdir -p /root/packages/vitastor-$REL; \
rm -rf /root/packages/vitastor-$REL/*; \
cd /root/packages/vitastor-$REL; \
cp -r /root/vitastor vitastor-1.2.0; \
cd vitastor-1.2.0; \
cp -r /root/vitastor vitastor-1.4.8; \
cd vitastor-1.4.8; \
ln -s /root/fio-build/fio-*/ ./fio; \
FIO=$(head -n1 fio/debian/changelog | perl -pe 's/^.*\((.*?)\).*$/$1/'); \
ls /usr/include/linux/raw.h || cp ./debian/raw.h /usr/include/linux/raw.h; \
@@ -49,8 +51,8 @@ RUN set -e -x; \
rm -rf a b; \
echo "dep:fio=$FIO" > debian/fio_version; \
cd /root/packages/vitastor-$REL; \
tar --sort=name --mtime='2020-01-01' --owner=0 --group=0 --exclude=debian -cJf vitastor_1.2.0.orig.tar.xz vitastor-1.2.0; \
cd vitastor-1.2.0; \
tar --sort=name --mtime='2020-01-01' --owner=0 --group=0 --exclude=debian -cJf vitastor_1.4.8.orig.tar.xz vitastor-1.4.8; \
cd vitastor-1.4.8; \
V=$(head -n1 debian/changelog | perl -pe 's/^.*\((.*?)\).*$/$1/'); \
DEBFULLNAME="Vitaliy Filippov <vitalif@yourcmc.ru>" dch -D $REL -v "$V""$REL" "Rebuild for $REL"; \
DEB_BUILD_OPTIONS=nocheck dpkg-buildpackage --jobs=auto -sa; \

View File

@@ -6,15 +6,40 @@
# Client Parameters
These parameters apply only to clients and affect their interaction with
the cluster.
These parameters apply only to Vitastor clients (QEMU, fio, NBD and so on) and
affect their interaction with the cluster.
- [client_retry_interval](#client_retry_interval)
- [client_eio_retry_interval](#client_eio_retry_interval)
- [client_max_dirty_bytes](#client_max_dirty_bytes)
- [client_max_dirty_ops](#client_max_dirty_ops)
- [client_enable_writeback](#client_enable_writeback)
- [client_max_buffered_bytes](#client_max_buffered_bytes)
- [client_max_buffered_ops](#client_max_buffered_ops)
- [client_max_writeback_iodepth](#client_max_writeback_iodepth)
- [nbd_timeout](#nbd_timeout)
- [nbd_max_devices](#nbd_max_devices)
- [nbd_max_part](#nbd_max_part)
## client_retry_interval
- Type: milliseconds
- Default: 50
- Minimum: 10
- Can be changed online: yes
Retry time for I/O requests failed due to inactive PGs or network
connectivity errors.
## client_eio_retry_interval
- Type: milliseconds
- Default: 1000
- Can be changed online: yes
Retry time for I/O requests failed due to data corruption or unfinished
EC object deletions (has_incomplete PG state). 0 disables such retries
and clients are not blocked and just get EIO error code instead.
## client_max_dirty_bytes
@@ -101,3 +126,34 @@ Multiple consecutive modified data regions are counted as 1 write here.
- Can be changed online: yes
Maximum number of parallel writes when flushing buffered data to the server.
## nbd_timeout
- Type: seconds
- Default: 300
Timeout for I/O operations for [NBD](../usage/nbd.en.md). If an operation
executes for longer than this timeout, including when your cluster is just
temporarily down for more than timeout, the NBD device will detach by itself
(and possibly break the mounted file system).
You can set timeout to 0 to never detach, but in that case you won't be
able to remove the kernel device at all if the NBD process dies - you'll have
to reboot the host.
## nbd_max_devices
- Type: integer
- Default: 64
Maximum number of NBD devices in the system. This value is passed as
`nbds_max` parameter for the nbd kernel module when vitastor-nbd autoloads it.
## nbd_max_part
- Type: integer
- Default: 3
Maximum number of partitions per NBD device. This value is passed as
`max_part` parameter for the nbd kernel module when vitastor-nbd autoloads it.
Note that (nbds_max)*(1+max_part) usually can't exceed 256.

View File

@@ -6,15 +6,41 @@
# Параметры клиентского кода
Данные параметры применяются только к клиентам Vitastor (QEMU, fio, NBD) и
Данные параметры применяются только к клиентам Vitastor (QEMU, fio, NBD и т.п.) и
затрагивают логику их работы с кластером.
- [client_retry_interval](#client_retry_interval)
- [client_eio_retry_interval](#client_eio_retry_interval)
- [client_max_dirty_bytes](#client_max_dirty_bytes)
- [client_max_dirty_ops](#client_max_dirty_ops)
- [client_enable_writeback](#client_enable_writeback)
- [client_max_buffered_bytes](#client_max_buffered_bytes)
- [client_max_buffered_ops](#client_max_buffered_ops)
- [client_max_writeback_iodepth](#client_max_writeback_iodepth)
- [nbd_timeout](#nbd_timeout)
- [nbd_max_devices](#nbd_max_devices)
- [nbd_max_part](#nbd_max_part)
## client_retry_interval
- Тип: миллисекунды
- Значение по умолчанию: 50
- Минимальное значение: 10
- Можно менять на лету: да
Время повтора запросов ввода-вывода, неудачных из-за неактивных PG или
ошибок сети.
## client_eio_retry_interval
- Тип: миллисекунды
- Значение по умолчанию: 1000
- Можно менять на лету: да
Время повтора запросов ввода-вывода, неудачных из-за повреждения данных
или незавершённых удалений EC-объектов (состояния PG has_incomplete).
0 отключает повторы таких запросов и клиенты не блокируются, а вместо
этого просто получают код ошибки EIO.
## client_max_dirty_bytes
@@ -101,3 +127,34 @@
- Можно менять на лету: да
Максимальное число параллельных операций записи при сбросе буферов на сервер.
## nbd_timeout
- Тип: секунды
- Значение по умолчанию: 300
Таймаут для операций чтения/записи через [NBD](../usage/nbd.ru.md). Если
операция выполняется дольше таймаута, включая временную недоступность
кластера на время, большее таймаута, NBD-устройство отключится само собой
(и, возможно, сломает примонтированную ФС).
Вы можете установить таймаут в 0, чтобы никогда не отключать устройство по
таймауту, но в этом случае вы вообще не сможете удалить устройство, если
процесс NBD умрёт - вам придётся перезагружать сервер.
## nbd_max_devices
- Тип: целое число
- Значение по умолчанию: 64
Максимальное число NBD-устройств в системе. Данное значение передаётся
модулю ядра nbd как параметр `nbds_max`, когда его загружает vitastor-nbd.
## nbd_max_part
- Тип: целое число
- Значение по умолчанию: 3
Максимальное число разделов на одном NBD-устройстве. Данное значение передаётся
модулю ядра nbd как параметр `max_part`, когда его загружает vitastor-nbd.
Имейте в виду, что (nbds_max)*(1+max_part) обычно не может превышать 256.

View File

@@ -19,8 +19,8 @@ These parameters only apply to Monitors.
## etcd_mon_ttl
- Type: seconds
- Default: 30
- Minimum: 10
- Default: 1
- Minimum: 5
Monitor etcd lease refresh interval in seconds

View File

@@ -19,8 +19,8 @@
## etcd_mon_ttl
- Тип: секунды
- Значение по умолчанию: 30
- Минимальное значение: 10
- Значение по умолчанию: 1
- Минимальное значение: 5
Интервал обновления etcd резервации (lease) монитором

View File

@@ -25,7 +25,6 @@ between clients, OSDs and etcd.
- [peer_connect_timeout](#peer_connect_timeout)
- [osd_idle_timeout](#osd_idle_timeout)
- [osd_ping_timeout](#osd_ping_timeout)
- [up_wait_retry_interval](#up_wait_retry_interval)
- [max_etcd_attempts](#max_etcd_attempts)
- [etcd_quick_timeout](#etcd_quick_timeout)
- [etcd_slow_timeout](#etcd_slow_timeout)
@@ -212,17 +211,6 @@ Maximum time to wait for OSD keepalive responses. If an OSD doesn't respond
within this time, the connection to it is dropped and a reconnection attempt
is scheduled.
## up_wait_retry_interval
- Type: milliseconds
- Default: 500
- Minimum: 50
- Can be changed online: yes
OSDs respond to clients with a special error code when they receive I/O
requests for a PG that's not synchronized and started. This parameter sets
the time for the clients to wait before re-attempting such I/O requests.
## max_etcd_attempts
- Type: integer

View File

@@ -25,7 +25,6 @@
- [peer_connect_timeout](#peer_connect_timeout)
- [osd_idle_timeout](#osd_idle_timeout)
- [osd_ping_timeout](#osd_ping_timeout)
- [up_wait_retry_interval](#up_wait_retry_interval)
- [max_etcd_attempts](#max_etcd_attempts)
- [etcd_quick_timeout](#etcd_quick_timeout)
- [etcd_slow_timeout](#etcd_slow_timeout)
@@ -221,19 +220,6 @@ OSD в любом случае согласовывают реальное зн
Если OSD не отвечает за это время, соединение отключается и производится
повторная попытка соединения.
## up_wait_retry_interval
- Тип: миллисекунды
- Значение по умолчанию: 500
- Минимальное значение: 50
- Можно менять на лету: да
Когда OSD получают от клиентов запросы ввода-вывода, относящиеся к не
поднятым на данный момент на них PG, либо к PG в процессе синхронизации,
они отвечают клиентам специальным кодом ошибки, означающим, что клиент
должен некоторое время подождать перед повторением запроса. Именно это время
ожидания задаёт данный параметр.
## max_etcd_attempts
- Тип: целое число

View File

@@ -19,6 +19,7 @@ them, even without restarting by updating configuration in etcd.
- [autosync_interval](#autosync_interval)
- [autosync_writes](#autosync_writes)
- [recovery_queue_depth](#recovery_queue_depth)
- [recovery_sleep_us](#recovery_sleep_us)
- [recovery_pg_switch](#recovery_pg_switch)
- [recovery_sync_batch](#recovery_sync_batch)
- [readonly](#readonly)
@@ -51,6 +52,14 @@ them, even without restarting by updating configuration in etcd.
- [scrub_list_limit](#scrub_list_limit)
- [scrub_find_best](#scrub_find_best)
- [scrub_ec_max_bruteforce](#scrub_ec_max_bruteforce)
- [recovery_tune_interval](#recovery_tune_interval)
- [recovery_tune_util_low](#recovery_tune_util_low)
- [recovery_tune_util_high](#recovery_tune_util_high)
- [recovery_tune_client_util_low](#recovery_tune_client_util_low)
- [recovery_tune_client_util_high](#recovery_tune_client_util_high)
- [recovery_tune_agg_interval](#recovery_tune_agg_interval)
- [recovery_tune_sleep_min_us](#recovery_tune_sleep_min_us)
- [recovery_tune_sleep_cutoff_us](#recovery_tune_sleep_cutoff_us)
## etcd_report_interval
@@ -135,12 +144,24 @@ operations before issuing an fsync operation internally.
## recovery_queue_depth
- Type: integer
- Default: 4
- Default: 1
- Can be changed online: yes
Maximum recovery operations per one primary OSD at any given moment of time.
Currently it's the only parameter available to tune the speed or recovery
and rebalancing, but it's planned to implement more.
Maximum recovery and rebalance operations initiated by each OSD in parallel.
Note that each OSD talks to a lot of other OSDs so actual number of parallel
recovery operations per each OSD is greater than just recovery_queue_depth.
Increasing this parameter can speedup recovery if [auto-tuning](#recovery_tune_interval)
allows it or if it is disabled.
## recovery_sleep_us
- Type: microseconds
- Default: 0
- Can be changed online: yes
Delay for all recovery- and rebalance- related operations. If non-zero,
such operations are artificially slowed down to reduce the impact on
client I/O.
## recovery_pg_switch
@@ -508,3 +529,90 @@ the variant with most available equal copies is correct. For example, if
you have 3 replicas and 1 of them differs, this one is considered to be
corrupted. But if there is no "best" version with more copies than all
others have then the object is also marked as inconsistent.
## recovery_tune_interval
- Type: seconds
- Default: 1
- Can be changed online: yes
Interval at which OSD re-considers client and recovery load and automatically
adjusts [recovery_sleep_us](#recovery_sleep_us). Recovery auto-tuning is
disabled if recovery_tune_interval is set to 0.
Auto-tuning targets utilization. Utilization is a measure of load and is
equal to the product of iops and average latency (so it may be greater
than 1). You set "low" and "high" client utilization thresholds and two
corresponding target recovery utilization levels. OSD calculates desired
recovery utilization from client utilization using linear interpolation
and auto-tunes recovery operation delay to make actual recovery utilization
match desired.
This allows to reduce recovery/rebalance impact on client operations. It is
of course impossible to remove it completely, but it should become adequate.
In some tests rebalance could earlier drop client write speed from 1.5 GB/s
to 50-100 MB/s, with default auto-tuning settings it now only reduces
to ~1 GB/s.
## recovery_tune_util_low
- Type: number
- Default: 0.1
- Can be changed online: yes
Desired recovery/rebalance utilization when client load is high, i.e. when
it is at or above recovery_tune_client_util_high.
## recovery_tune_util_high
- Type: number
- Default: 1
- Can be changed online: yes
Desired recovery/rebalance utilization when client load is low, i.e. when
it is at or below recovery_tune_client_util_low.
## recovery_tune_client_util_low
- Type: number
- Default: 0
- Can be changed online: yes
Client utilization considered "low".
## recovery_tune_client_util_high
- Type: number
- Default: 0.5
- Can be changed online: yes
Client utilization considered "high".
## recovery_tune_agg_interval
- Type: integer
- Default: 10
- Can be changed online: yes
The number of last auto-tuning iterations to use for calculating the
delay as average. Lower values result in quicker response to client
load change, higher values result in more stable delay. Default value of 10
is usually fine.
## recovery_tune_sleep_min_us
- Type: microseconds
- Default: 10
- Can be changed online: yes
Minimum possible value for auto-tuned recovery_sleep_us. Lower values
are changed to 0.
## recovery_tune_sleep_cutoff_us
- Type: microseconds
- Default: 10000000
- Can be changed online: yes
Maximum possible value for auto-tuned recovery_sleep_us. Higher values
are treated as outliers and ignored in aggregation.

View File

@@ -20,6 +20,7 @@
- [autosync_interval](#autosync_interval)
- [autosync_writes](#autosync_writes)
- [recovery_queue_depth](#recovery_queue_depth)
- [recovery_sleep_us](#recovery_sleep_us)
- [recovery_pg_switch](#recovery_pg_switch)
- [recovery_sync_batch](#recovery_sync_batch)
- [readonly](#readonly)
@@ -52,6 +53,14 @@
- [scrub_list_limit](#scrub_list_limit)
- [scrub_find_best](#scrub_find_best)
- [scrub_ec_max_bruteforce](#scrub_ec_max_bruteforce)
- [recovery_tune_interval](#recovery_tune_interval)
- [recovery_tune_util_low](#recovery_tune_util_low)
- [recovery_tune_util_high](#recovery_tune_util_high)
- [recovery_tune_client_util_low](#recovery_tune_client_util_low)
- [recovery_tune_client_util_high](#recovery_tune_client_util_high)
- [recovery_tune_agg_interval](#recovery_tune_agg_interval)
- [recovery_tune_sleep_min_us](#recovery_tune_sleep_min_us)
- [recovery_tune_sleep_cutoff_us](#recovery_tune_sleep_cutoff_us)
## etcd_report_interval
@@ -138,13 +147,25 @@ OSD, чтобы успевать очищать журнал - без них OSD
## recovery_queue_depth
- Тип: целое число
- Значение по умолчанию: 4
- Значение по умолчанию: 1
- Можно менять на лету: да
Максимальное число операций восстановления на одном первичном OSD в любой
момент времени. На данный момент единственный параметр, который можно менять
для ускорения или замедления восстановления и перебалансировки данных, но
в планах реализация других параметров.
Максимальное число параллельных операций восстановления, инициируемых одним
OSD в любой момент времени. Имейте в виду, что каждый OSD обычно работает с
многими другими OSD, так что на практике параллелизм восстановления больше,
чем просто recovery_queue_depth. Увеличение значения этого параметра может
ускорить восстановление если [автотюнинг скорости](#recovery_tune_interval)
разрешает это или если он отключён.
## recovery_sleep_us
- Тип: микросекунды
- Значение по умолчанию: 0
- Можно менять на лету: да
Delay for all recovery- and rebalance- related operations. If non-zero,
such operations are artificially slowed down to reduce the impact on
client I/O.
## recovery_pg_switch
@@ -535,3 +556,93 @@ EC (кодов коррекции ошибок) с более, чем 1 диск
считается некорректной. Однако, если "лучшую" версию с числом доступных
копий большим, чем у всех других версий, найти невозможно, то объект тоже
маркируется неконсистентным.
## recovery_tune_interval
- Тип: секунды
- Значение по умолчанию: 1
- Можно менять на лету: да
Интервал, с которым OSD пересматривает клиентскую нагрузку и нагрузку
восстановления и автоматически подстраивает [recovery_sleep_us](#recovery_sleep_us).
Автотюнинг (автоподстройка) отключается, если recovery_tune_interval
устанавливается в значение 0.
Автотюнинг регулирует утилизацию. Утилизация является мерой нагрузки
и равна произведению числа операций в секунду и средней задержки
(то есть, она может быть выше 1). Вы задаёте два уровня клиентской
утилизации - "низкий" и "высокий" (low и high) и два соответствующих
целевых уровня утилизации операциями восстановления. OSD рассчитывает
желаемый уровень утилизации восстановления линейной интерполяцией от
клиентской утилизации и подстраивает задержку операций восстановления
так, чтобы фактическая утилизация восстановления совпадала с желаемой.
Это позволяет снизить влияние восстановления и ребаланса на клиентские
операции. Конечно, невозможно исключить такое влияние полностью, но оно
должно становиться адекватнее. В некоторых тестах перебалансировка могла
снижать клиентскую скорость записи с 1.5 ГБ/с до 50-100 МБ/с, а теперь, с
настройками автотюнинга по умолчанию, она снижается только до ~1 ГБ/с.
## recovery_tune_util_low
- Тип: число
- Значение по умолчанию: 0.1
- Можно менять на лету: да
Желаемая утилизация восстановления в моменты, когда клиентская нагрузка
высокая, то есть, находится на уровне или выше recovery_tune_client_util_high.
## recovery_tune_util_high
- Тип: число
- Значение по умолчанию: 1
- Можно менять на лету: да
Желаемая утилизация восстановления в моменты, когда клиентская нагрузка
низкая, то есть, находится на уровне или ниже recovery_tune_client_util_low.
## recovery_tune_client_util_low
- Тип: число
- Значение по умолчанию: 0
- Можно менять на лету: да
Клиентская утилизация, которая считается "низкой".
## recovery_tune_client_util_high
- Тип: число
- Значение по умолчанию: 0.5
- Можно менять на лету: да
Клиентская утилизация, которая считается "высокой".
## recovery_tune_agg_interval
- Тип: целое число
- Значение по умолчанию: 10
- Можно менять на лету: да
Число последних итераций автоподстройки для расчёта задержки как среднего
значения. Меньшие значения параметра ускоряют отклик на изменение нагрузки,
большие значения делают задержку стабильнее. Значение по умолчанию 10
обычно нормальное и не требует изменений.
## recovery_tune_sleep_min_us
- Тип: микросекунды
- Значение по умолчанию: 10
- Можно менять на лету: да
Минимальное возможное значение авто-подстроенного recovery_sleep_us.
Меньшие значения заменяются на 0.
## recovery_tune_sleep_cutoff_us
- Тип: микросекунды
- Значение по умолчанию: 10000000
- Можно менять на лету: да
Максимальное возможное значение авто-подстроенного recovery_sleep_us.
Большие значения считаются случайными выбросами и игнорируются в
усреднении.

View File

@@ -154,6 +154,9 @@ That is, if it becomes impossible to place PG data on at least (pg_minsize)
OSDs, PG is deactivated for both read and write. So you know that a fresh
write always goes to at least (pg_minsize) OSDs (disks).
That is, pg_size minus pg_minsize sets the number of disk failures to tolerate
without temporary downtime (for [osd_out_time](monitor.en.md#osd_out_time)).
FIXME: pg_minsize behaviour may be changed in the future to only make PGs
read-only instead of deactivating them.

View File

@@ -157,6 +157,10 @@
OSD, PG деактивируется на чтение и запись. Иными словами, всегда известно,
что новые блоки данных всегда записываются как минимум на pg_minsize дисков.
По сути, разница pg_size и pg_minsize задаёт число отказов дисков, которые пул
может пережить без временной (на [osd_out_time](monitor.ru.md#osd_out_time))
остановки обслуживания.
FIXME: Поведение pg_minsize может быть изменено в будущем с полной деактивации
PG на перевод их в режим только для чтения.

View File

@@ -1,4 +1,4 @@
# Client Parameters
These parameters apply only to clients and affect their interaction with
the cluster.
These parameters apply only to Vitastor clients (QEMU, fio, NBD and so on) and
affect their interaction with the cluster.

View File

@@ -1,4 +1,4 @@
# Параметры клиентского кода
Данные параметры применяются только к клиентам Vitastor (QEMU, fio, NBD) и
Данные параметры применяются только к клиентам Vitastor (QEMU, fio, NBD и т.п.) и
затрагивают логику их работы с кластером.

View File

@@ -1,3 +1,27 @@
- name: client_retry_interval
type: ms
min: 10
default: 50
online: true
info: |
Retry time for I/O requests failed due to inactive PGs or network
connectivity errors.
info_ru: |
Время повтора запросов ввода-вывода, неудачных из-за неактивных PG или
ошибок сети.
- name: client_eio_retry_interval
type: ms
default: 1000
online: true
info: |
Retry time for I/O requests failed due to data corruption or unfinished
EC object deletions (has_incomplete PG state). 0 disables such retries
and clients are not blocked and just get EIO error code instead.
info_ru: |
Время повтора запросов ввода-вывода, неудачных из-за повреждения данных
или незавершённых удалений EC-объектов (состояния PG has_incomplete).
0 отключает повторы таких запросов и клиенты не блокируются, а вместо
этого просто получают код ошибки EIO.
- name: client_max_dirty_bytes
type: int
default: 33554432
@@ -122,3 +146,47 @@
Maximum number of parallel writes when flushing buffered data to the server.
info_ru: |
Максимальное число параллельных операций записи при сбросе буферов на сервер.
- name: nbd_timeout
type: sec
default: 300
online: false
info: |
Timeout for I/O operations for [NBD](../usage/nbd.en.md). If an operation
executes for longer than this timeout, including when your cluster is just
temporarily down for more than timeout, the NBD device will detach by itself
(and possibly break the mounted file system).
You can set timeout to 0 to never detach, but in that case you won't be
able to remove the kernel device at all if the NBD process dies - you'll have
to reboot the host.
info_ru: |
Таймаут для операций чтения/записи через [NBD](../usage/nbd.ru.md). Если
операция выполняется дольше таймаута, включая временную недоступность
кластера на время, большее таймаута, NBD-устройство отключится само собой
(и, возможно, сломает примонтированную ФС).
Вы можете установить таймаут в 0, чтобы никогда не отключать устройство по
таймауту, но в этом случае вы вообще не сможете удалить устройство, если
процесс NBD умрёт - вам придётся перезагружать сервер.
- name: nbd_max_devices
type: int
default: 64
online: false
info: |
Maximum number of NBD devices in the system. This value is passed as
`nbds_max` parameter for the nbd kernel module when vitastor-nbd autoloads it.
info_ru: |
Максимальное число NBD-устройств в системе. Данное значение передаётся
модулю ядра nbd как параметр `nbds_max`, когда его загружает vitastor-nbd.
- name: nbd_max_part
type: int
default: 3
online: false
info: |
Maximum number of partitions per NBD device. This value is passed as
`max_part` parameter for the nbd kernel module when vitastor-nbd autoloads it.
Note that (nbds_max)*(1+max_part) usually can't exceed 256.
info_ru: |
Максимальное число разделов на одном NBD-устройстве. Данное значение передаётся
модулю ядра nbd как параметр `max_part`, когда его загружает vitastor-nbd.
Имейте в виду, что (nbds_max)*(1+max_part) обычно не может превышать 256.

View File

@@ -38,6 +38,7 @@ const types = {
bool: 'boolean',
int: 'integer',
sec: 'seconds',
float: 'number',
ms: 'milliseconds',
us: 'microseconds',
},
@@ -46,6 +47,7 @@ const types = {
bool: 'булево (да/нет)',
int: 'целое число',
sec: 'секунды',
float: 'число',
ms: 'миллисекунды',
us: 'микросекунды',
},

View File

@@ -1,7 +1,7 @@
- name: etcd_mon_ttl
type: sec
min: 10
default: 30
min: 5
default: 1
info: Monitor etcd lease refresh interval in seconds
info_ru: Интервал обновления etcd резервации (lease) монитором
- name: etcd_mon_timeout

View File

@@ -243,21 +243,6 @@
Максимальное время ожидания ответа на запрос проверки состояния соединения.
Если OSD не отвечает за это время, соединение отключается и производится
повторная попытка соединения.
- name: up_wait_retry_interval
type: ms
min: 50
default: 500
online: true
info: |
OSDs respond to clients with a special error code when they receive I/O
requests for a PG that's not synchronized and started. This parameter sets
the time for the clients to wait before re-attempting such I/O requests.
info_ru: |
Когда OSD получают от клиентов запросы ввода-вывода, относящиеся к не
поднятым на данный момент на них PG, либо к PG в процессе синхронизации,
они отвечают клиентам специальным кодом ошибки, означающим, что клиент
должен некоторое время подождать перед повторением запроса. Именно это время
ожидания задаёт данный параметр.
- name: max_etcd_attempts
type: int
default: 5

View File

@@ -107,17 +107,29 @@
принудительной отправкой fsync-а.
- name: recovery_queue_depth
type: int
default: 4
default: 1
online: true
info: |
Maximum recovery operations per one primary OSD at any given moment of time.
Currently it's the only parameter available to tune the speed or recovery
and rebalancing, but it's planned to implement more.
Maximum recovery and rebalance operations initiated by each OSD in parallel.
Note that each OSD talks to a lot of other OSDs so actual number of parallel
recovery operations per each OSD is greater than just recovery_queue_depth.
Increasing this parameter can speedup recovery if [auto-tuning](#recovery_tune_interval)
allows it or if it is disabled.
info_ru: |
Максимальное число операций восстановления на одном первичном OSD в любой
момент времени. На данный момент единственный параметр, который можно менять
для ускорения или замедления восстановления и перебалансировки данных, но
в планах реализация других параметров.
Максимальное число параллельных операций восстановления, инициируемых одним
OSD в любой момент времени. Имейте в виду, что каждый OSD обычно работает с
многими другими OSD, так что на практике параллелизм восстановления больше,
чем просто recovery_queue_depth. Увеличение значения этого параметра может
ускорить восстановление если [автотюнинг скорости](#recovery_tune_interval)
разрешает это или если он отключён.
- name: recovery_sleep_us
type: us
default: 0
online: true
info: |
Delay for all recovery- and rebalance- related operations. If non-zero,
such operations are artificially slowed down to reduce the impact on
client I/O.
- name: recovery_pg_switch
type: int
default: 128
@@ -626,3 +638,112 @@
считается некорректной. Однако, если "лучшую" версию с числом доступных
копий большим, чем у всех других версий, найти невозможно, то объект тоже
маркируется неконсистентным.
- name: recovery_tune_interval
type: sec
default: 1
online: true
info: |
Interval at which OSD re-considers client and recovery load and automatically
adjusts [recovery_sleep_us](#recovery_sleep_us). Recovery auto-tuning is
disabled if recovery_tune_interval is set to 0.
Auto-tuning targets utilization. Utilization is a measure of load and is
equal to the product of iops and average latency (so it may be greater
than 1). You set "low" and "high" client utilization thresholds and two
corresponding target recovery utilization levels. OSD calculates desired
recovery utilization from client utilization using linear interpolation
and auto-tunes recovery operation delay to make actual recovery utilization
match desired.
This allows to reduce recovery/rebalance impact on client operations. It is
of course impossible to remove it completely, but it should become adequate.
In some tests rebalance could earlier drop client write speed from 1.5 GB/s
to 50-100 MB/s, with default auto-tuning settings it now only reduces
to ~1 GB/s.
info_ru: |
Интервал, с которым OSD пересматривает клиентскую нагрузку и нагрузку
восстановления и автоматически подстраивает [recovery_sleep_us](#recovery_sleep_us).
Автотюнинг (автоподстройка) отключается, если recovery_tune_interval
устанавливается в значение 0.
Автотюнинг регулирует утилизацию. Утилизация является мерой нагрузки
и равна произведению числа операций в секунду и средней задержки
(то есть, она может быть выше 1). Вы задаёте два уровня клиентской
утилизации - "низкий" и "высокий" (low и high) и два соответствующих
целевых уровня утилизации операциями восстановления. OSD рассчитывает
желаемый уровень утилизации восстановления линейной интерполяцией от
клиентской утилизации и подстраивает задержку операций восстановления
так, чтобы фактическая утилизация восстановления совпадала с желаемой.
Это позволяет снизить влияние восстановления и ребаланса на клиентские
операции. Конечно, невозможно исключить такое влияние полностью, но оно
должно становиться адекватнее. В некоторых тестах перебалансировка могла
снижать клиентскую скорость записи с 1.5 ГБ/с до 50-100 МБ/с, а теперь, с
настройками автотюнинга по умолчанию, она снижается только до ~1 ГБ/с.
- name: recovery_tune_util_low
type: float
default: 0.1
online: true
info: |
Desired recovery/rebalance utilization when client load is high, i.e. when
it is at or above recovery_tune_client_util_high.
info_ru: |
Желаемая утилизация восстановления в моменты, когда клиентская нагрузка
высокая, то есть, находится на уровне или выше recovery_tune_client_util_high.
- name: recovery_tune_util_high
type: float
default: 1
online: true
info: |
Desired recovery/rebalance utilization when client load is low, i.e. when
it is at or below recovery_tune_client_util_low.
info_ru: |
Желаемая утилизация восстановления в моменты, когда клиентская нагрузка
низкая, то есть, находится на уровне или ниже recovery_tune_client_util_low.
- name: recovery_tune_client_util_low
type: float
default: 0
online: true
info: Client utilization considered "low".
info_ru: Клиентская утилизация, которая считается "низкой".
- name: recovery_tune_client_util_high
type: float
default: 0.5
online: true
info: Client utilization considered "high".
info_ru: Клиентская утилизация, которая считается "высокой".
- name: recovery_tune_agg_interval
type: int
default: 10
online: true
info: |
The number of last auto-tuning iterations to use for calculating the
delay as average. Lower values result in quicker response to client
load change, higher values result in more stable delay. Default value of 10
is usually fine.
info_ru: |
Число последних итераций автоподстройки для расчёта задержки как среднего
значения. Меньшие значения параметра ускоряют отклик на изменение нагрузки,
большие значения делают задержку стабильнее. Значение по умолчанию 10
обычно нормальное и не требует изменений.
- name: recovery_tune_sleep_min_us
type: us
default: 10
online: true
info: |
Minimum possible value for auto-tuned recovery_sleep_us. Lower values
are changed to 0.
info_ru: |
Минимальное возможное значение авто-подстроенного recovery_sleep_us.
Меньшие значения заменяются на 0.
- name: recovery_tune_sleep_cutoff_us
type: us
default: 10000000
online: true
info: |
Maximum possible value for auto-tuned recovery_sleep_us. Higher values
are treated as outliers and ignored in aggregation.
info_ru: |
Максимальное возможное значение авто-подстроенного recovery_sleep_us.
Большие значения считаются случайными выбросами и игнорируются в
усреднении.

View File

@@ -19,6 +19,14 @@ for i in ./???-*.yaml; do kubectl apply -f $i; done
After that you'll be able to create PersistentVolumes.
**Important:** For best experience, use Linux kernel at least 5.15 with [VDUSE](../usage/qemu.en.md#vduse)
kernel modules enabled (vdpa, vduse, virtio-vdpa). If your distribution doesn't
have them pre-built - build them yourself ([instructions](../usage/qemu.en.md#vduse)),
I promise it's worth it :-). When VDUSE is unavailable, CSI driver uses [NBD](../usage/nbd.en.md)
to map Vitastor devices. NBD is slower and prone to timeout issues: if Vitastor
cluster becomes unresponsible for more than [nbd_timeout](../config/client.en.md#nbd_timeout),
the NBD device detaches and breaks pods using it.
## Features
Vitastor CSI supports:
@@ -27,5 +35,9 @@ Vitastor CSI supports:
- Raw block RWX (ReadWriteMany) volumes. Example: [PVC](../../csi/deploy/example-pvc-block.yaml), [pod](../../csi/deploy/example-test-pod-block.yaml)
- Volume expansion
- Volume snapshots. Example: [snapshot class](../../csi/deploy/example-snapshot-class.yaml), [snapshot](../../csi/deploy/example-snapshot.yaml), [clone](../../csi/deploy/example-snapshot-clone.yaml)
- [VDUSE](../usage/qemu.en.md#vduse) (preferred) and [NBD](../usage/nbd.en.md) device mapping methods
- Upgrades with VDUSE - new handler processes are restarted when CSI pods are restarted themselves
- VDUSE daemon auto-restart - handler processes are automatically restarted if they crash due to a bug in Vitastor client code
- Multiple clusters by using multiple configuration files in ConfigMap.
Remember that to use snapshots with CSI you also have to install [Snapshot Controller and CRDs](https://kubernetes-csi.github.io/docs/snapshot-controller.html#deployment).

View File

@@ -19,6 +19,14 @@ for i in ./???-*.yaml; do kubectl apply -f $i; done
После этого вы сможете создавать PersistentVolume.
**Важно:** Лучше всего использовать ядро Linux версии не менее 5.15 с включёнными модулями
[VDUSE](../usage/qemu.ru.md#vduse) (vdpa, vduse, virtio-vdpa). Если в вашем дистрибутиве
они не собраны из коробки - соберите их сами, обещаю, что это стоит того ([инструкция](../usage/qemu.ru.md#vduse)) :-).
Когда VDUSE недоступно, CSI-плагин использует [NBD](../usage/nbd.ru.md) для подключения
дисков, а NBD медленнее и имеет проблему таймаута - если кластер остаётся недоступным
дольше, чем [nbd_timeout](../config/client.ru.md#nbd_timeout), NBD-устройство отключается
и ломает поды, использующие его.
## Возможности
CSI-плагин Vitastor поддерживает:
@@ -27,5 +35,9 @@ CSI-плагин Vitastor поддерживает:
- Сырые блочные RWX (ReadWriteMany) тома. Пример: [PVC](../../csi/deploy/example-pvc-block.yaml), [под](../../csi/deploy/example-test-pod-block.yaml)
- Расширение размера томов
- Снимки томов. Пример: [класс снимков](../../csi/deploy/example-snapshot-class.yaml), [снимок](../../csi/deploy/example-snapshot.yaml), [клон снимка](../../csi/deploy/example-snapshot-clone.yaml)
- Способы подключения устройств [VDUSE](../usage/qemu.ru.md#vduse) (предпочитаемый) и [NBD](../usage/nbd.ru.md)
- Обновление при использовании VDUSE - новые процессы-обработчики устройств успешно перезапускаются вместе с самими подами CSI
- Автоперезауск демонов VDUSE - процесс-обработчик автоматически перезапустится, если он внезапно упадёт из-за бага в коде клиента Vitastor
- Несколько кластеров через задание нескольких файлов конфигурации в ConfigMap.
Не забывайте, что для использования снимков нужно сначала установить [контроллер снимков и CRD](https://kubernetes-csi.github.io/docs/snapshot-controller.html#deployment).

View File

@@ -18,7 +18,7 @@
stable version from 0.9.x branch instead of 1.x
- For Debian 10 (Buster) also enable backports repository:
`deb http://deb.debian.org/debian buster-backports main`
- Install packages: `apt update; apt install vitastor lp-solve etcd linux-image-amd64 qemu`
- Install packages: `apt update; apt install vitastor lp-solve etcd linux-image-amd64 qemu-system-x86`
## CentOS

View File

@@ -18,7 +18,7 @@
установить последнюю стабильную версию из ветки 0.9.x вместо 1.x
- Для Debian 10 (Buster) также включите репозиторий backports:
`deb http://deb.debian.org/debian buster-backports main`
- Установите пакеты: `apt update; apt install vitastor lp-solve etcd linux-image-amd64 qemu`
- Установите пакеты: `apt update; apt install vitastor lp-solve etcd linux-image-amd64 qemu-system-x86`
## CentOS

View File

@@ -6,10 +6,10 @@
# Proxmox VE
To enable Vitastor support in Proxmox Virtual Environment (6.4-8.0 are supported):
To enable Vitastor support in Proxmox Virtual Environment (6.4-8.1 are supported):
- Add the corresponding Vitastor Debian repository into sources.list on Proxmox hosts:
bookworm for 8.0, bullseye for 7.4, pve7.3 for 7.3, pve7.2 for 7.2, pve7.1 for 7.1, buster for 6.4
bookworm for 8.1, pve8.0 for 8.0, bullseye for 7.4, pve7.3 for 7.3, pve7.2 for 7.2, pve7.1 for 7.1, buster for 6.4
- Install vitastor-client, pve-qemu-kvm, pve-storage-vitastor (* or see note) packages from Vitastor repository
- Define storage in `/etc/pve/storage.cfg` (see below)
- Block network access from VMs to Vitastor network (to OSDs and etcd),
@@ -25,7 +25,7 @@ vitastor: vitastor
vitastor_pool testpool
# path to the configuration file
vitastor_config_path /etc/vitastor/vitastor.conf
# etcd address(es), required only if missing in the configuration file
# etcd address(es), OPTIONAL, required only if missing in the configuration file
vitastor_etcd_address 192.168.7.2:2379/v3
# prefix for keys in etcd
vitastor_etcd_prefix /vitastor

View File

@@ -6,10 +6,10 @@
# Proxmox VE
Чтобы подключить Vitastor к Proxmox Virtual Environment (поддерживаются версии 6.4-8.0):
Чтобы подключить Vitastor к Proxmox Virtual Environment (поддерживаются версии 6.4-8.1):
- Добавьте соответствующий Debian-репозиторий Vitastor в sources.list на хостах Proxmox:
bookworm для 8.0, bullseye для 7.4, pve7.3 для 7.3, pve7.2 для 7.2, pve7.1 для 7.1, buster для 6.4
bookworm для 8.1, pve8.0 для 8.0, bullseye для 7.4, pve7.3 для 7.3, pve7.2 для 7.2, pve7.1 для 7.1, buster для 6.4
- Установите пакеты vitastor-client, pve-qemu-kvm, pve-storage-vitastor (* или см. сноску) из репозитория Vitastor
- Определите тип хранилища в `/etc/pve/storage.cfg` (см. ниже)
- Обязательно заблокируйте доступ от виртуальных машин к сети Vitastor (OSD и etcd), т.к. Vitastor (пока) не поддерживает аутентификацию
@@ -24,7 +24,7 @@ vitastor: vitastor
vitastor_pool testpool
# Путь к файлу конфигурации
vitastor_config_path /etc/vitastor/vitastor.conf
# Адрес(а) etcd, нужны, только если не указаны в vitastor.conf
# Адрес(а) etcd, ОПЦИОНАЛЬНЫ, нужны, только если не указаны в vitastor.conf
vitastor_etcd_address 192.168.7.2:2379/v3
# Префикс ключей метаданных в etcd
vitastor_etcd_prefix /vitastor

View File

@@ -54,7 +54,8 @@
виртуальные диски, их снимки и клоны.
- **Драйвер QEMU** — подключаемый модуль QEMU, позволяющий QEMU/KVM виртуальным машинам работать
с виртуальными дисками Vitastor напрямую из пространства пользователя с помощью клиентской
библиотеки, без необходимости отображения дисков в виде блочных устройств.
библиотеки, без необходимости отображения дисков в виде блочных устройств. Тот же драйвер
позволяет подключать диски в систему через [VDUSE](../usage/qemu.ru.md#vduse).
- **vitastor-nbd** — утилита, позволяющая монтировать образы Vitastor в виде блочных устройств
с помощью NBD (Network Block Device), на самом деле скорее работающего как "BUSE"
(Block Device In Userspace). Модуля ядра Linux для выполнения той же задачи в Vitastor нет

View File

@@ -32,6 +32,7 @@
- [Scrubbing](../config/osd.en.md#auto_scrub) (verification of copies)
- [Checksums](../config/layout-osd.en.md#data_csum_type)
- [Client write-back cache](../config/client.en.md#client_enable_writeback)
- [Intelligent recovery auto-tuning](../config/osd.en.md#recovery_tune_interval)
## Plugins and tools

View File

@@ -34,6 +34,7 @@
- [Фоновая проверка целостности](../config/osd.ru.md#auto_scrub) (сверка копий)
- [Контрольные суммы](../config/layout-osd.ru.md#data_csum_type)
- [Буферизация записи на стороне клиента](../config/client.ru.md#client_enable_writeback)
- [Интеллектуальная автоподстройка скорости восстановления](../config/osd.ru.md#recovery_tune_interval)
## Драйверы и инструменты

View File

@@ -11,19 +11,26 @@ Replicated setups:
- Single-threaded write+fsync latency:
- With immediate commit: 2 network roundtrips + 1 disk write.
- With lazy commit: 4 network roundtrips + 1 disk write + 1 disk flush.
- Saturated parallel read iops: min(network bandwidth, sum(disk read iops)).
- Saturated parallel write iops: min(network bandwidth, sum(disk write iops / number of replicas / write amplification)).
- Linear read: `min(total network bandwidth, sum(disk read MB/s))`.
- Linear write: `min(total network bandwidth, sum(disk write MB/s / number of replicas))`.
- Saturated parallel read iops: `min(total network bandwidth, sum(disk read iops))`.
- Saturated parallel write iops: `min(total network bandwidth / number of replicas, sum(disk write iops / number of replicas / (write amplification = 4)))`.
EC/XOR setups:
EC/XOR setups (EC N+K):
- Single-threaded (T1Q1) read latency: 1.5 network roundtrips + 1 disk read.
- Single-threaded write+fsync latency:
- With immediate commit: 3.5 network roundtrips + 1 disk read + 2 disk writes.
- With lazy commit: 5.5 network roundtrips + 1 disk read + 2 disk writes + 2 disk fsyncs.
- 0.5 in actually (k-1)/k which means that an additional roundtrip doesn't happen when
- 0.5 in actually `(N-1)/N` which means that an additional roundtrip doesn't happen when
the read sub-operation can be served locally.
- Saturated parallel read iops: min(network bandwidth, sum(disk read iops)).
- Saturated parallel write iops: min(network bandwidth, sum(disk write iops * number of data drives / (number of data + parity drives) / write amplification)).
In fact, you should put disk write iops under the condition of ~10% reads / ~90% writes in this formula.
- Linear read: `min(total network bandwidth, sum(disk read MB/s))`.
- Linear write: `min(total network bandwidth, sum(disk write MB/s * N/(N+K)))`.
- Saturated parallel read iops: `min(total network bandwidth, sum(disk read iops))`.
- Saturated parallel write iops: roughly `total iops / (N+K) / WA`. More exactly,
`min(total network bandwidth * N/(N+K), sum(disk randrw iops / (N*4 + K*5 + 1)))` with
random read/write mix corresponding to `(N-1)/(N*4 + K*5 + 1)*100 % reads`.
- For example, with EC 2+1 it is: `(7% randrw iops) / 14`.
- With EC 6+3 it is: `(12.5% randrw iops) / 40`.
Write amplification for 4 KB blocks is usually 3-5 in Vitastor:
1. Journal block write

View File

@@ -11,20 +11,27 @@
- Запись+fsync в 1 поток:
- С мгновенным сбросом: 2 RTT + 1 запись.
- С отложенным ("ленивым") сбросом: 4 RTT + 1 запись + 1 fsync.
- Параллельное чтение: сумма IOPS всех дисков либо производительность сети, если в сеть упрётся раньше.
- Параллельная запись: сумма IOPS всех дисков / число реплик / WA либо производительность сети, если в сеть упрётся раньше.
- Линейное чтение: сумма МБ/с чтения всех дисков, либо общая производительность сети (сумма пропускной способности сети всех нод), если в сеть упрётся раньше.
- Линейная запись: сумма МБ/с записи всех дисков / число реплик, либо производительность сети / число реплик, если в сеть упрётся раньше.
- Параллельное случайное мелкое чтение: сумма IOPS чтения всех дисков, либо производительность сети, если в сеть упрётся раньше.
- Параллельная случайная мелкая запись: сумма IOPS записи всех дисков / число реплик / WA, либо производительность сети / число реплик, если в сеть упрётся раньше.
При использовании кодов коррекции ошибок (EC):
При использовании кодов коррекции ошибок (EC N+K):
- Задержка чтения в 1 поток (T1Q1): 1.5 RTT + 1 чтение.
- Запись+fsync в 1 поток:
- С мгновенным сбросом: 3.5 RTT + 1 чтение + 2 записи.
- С отложенным ("ленивым") сбросом: 5.5 RTT + 1 чтение + 2 записи + 2 fsync.
- Под 0.5 на самом деле подразумевается (k-1)/k, где k - число дисков данных,
- Под 0.5 на самом деле подразумевается (N-1)/N, где N - число дисков данных,
что означает, что дополнительное обращение по сети не нужно, когда операция
чтения обслуживается локально.
- Параллельное чтение: сумма IOPS всех дисков либо производительность сети, если в сеть упрётся раньше.
- Параллельная запись: сумма IOPS всех дисков / общее число дисков данных и чётности / WA либо производительность сети, если в сеть упрётся раньше.
Примечание: IOPS дисков в данном случае надо брать в смешанном режиме чтения/записи в пропорции, аналогичной формулам выше.
- Линейное чтение: сумма МБ/с чтения всех дисков, либо общая производительность сети, если в сеть упрётся раньше.
- Линейная запись: сумма МБ/с записи всех дисков * N/(N+K), либо производительность сети * N / (N+K), если в сеть упрётся раньше.
- Параллельное случайное мелкое чтение: сумма IOPS чтения всех дисков либо производительность сети, если в сеть упрётся раньше.
- Параллельная случайная мелкая запись: грубо `(сумма IOPS / (N+K) / WA)`. Если точнее, то:
сумма смешанного IOPS всех дисков при `(N-1)/(N*4 + K*5 + 1)*100 %` чтения, делённая на `(N*4 + K*5 + 1)`.
Либо, производительность сети * N/(N+K), если в сеть упрётся раньше.
- Например, при EC 2+1 это: `(сумма IOPS при 7% чтения) / 14`.
- При EC 6+3 это: `(сумма IOPS при 12.5% чтения) / 40`.
WA (мультипликатор записи) для 4 КБ блоков в Vitastor обычно составляет 3-5:
1. Запись метаданных в журнал

View File

@@ -28,7 +28,8 @@ It supports the following commands:
Global options:
```
--etcd_address ADDR Etcd connection address
--config_file FILE Path to Vitastor configuration file
--etcd_address URL Etcd connection address
--iodepth N Send N operations in parallel to each OSD when possible (default 32)
--parallel_osds M Work with M osds in parallel when possible (default 4)
--progress 1|0 Report progress (default 1)
@@ -130,19 +131,18 @@ See also about [how to export snapshots](qemu.en.md#exporting-snapshots).
## modify
`vitastor-cli modify <name> [--rename <new-name>] [--resize <size>] [--readonly | --readwrite] [-f|--force]`
`vitastor-cli modify <name> [--rename <new-name>] [--resize <size>] [--readonly | --readwrite] [-f|--force] [--down-ok]`
Rename, resize image or change its readonly status. Images with children can't be made read-write.
If the new size is smaller than the old size, extra data will be purged.
You should resize file system in the image, if present, before shrinking it.
```
-f|--force Proceed with shrinking or setting readwrite flag even if the image has children.
```
| `-f|--force` | Proceed with shrinking or setting readwrite flag even if the image has children. |
| `--down-ok` | Proceed with shrinking even if some data will be left on unavailable OSDs. |
## rm
`vitastor-cli rm <from> [<to>] [--writers-stopped]`
`vitastor-cli rm <from> [<to>] [--writers-stopped] [--down-ok]`
Remove `<from>` or all layers between `<from>` and `<to>` (`<to>` must be a child of `<from>`),
rebasing all their children accordingly. --writers-stopped allows merging to be a bit
@@ -150,6 +150,10 @@ more effective in case of a single 'slim' read-write child and 'fat' removed par
the child is merged into parent and parent is renamed to child in that case.
In other cases parent layers are always merged into children.
Other options:
| `--down-ok` | Continue deletion/merging even if some data will be left on unavailable OSDs. |
## flatten
`vitastor-cli flatten <layer>`

View File

@@ -27,7 +27,8 @@ vitastor-cli - интерфейс командной строки для адм
Глобальные опции:
```
--etcd_address ADDR Адрес соединения с etcd
--config_file FILE Путь к файлу конфигурации Vitastor
--etcd_address URL Адрес соединения с etcd
--iodepth N Отправлять параллельно N операций на каждый OSD (по умолчанию 32)
--parallel_osds M Работать параллельно с M OSD (по умолчанию 4)
--progress 1|0 Печатать прогресс выполнения (по умолчанию 1)
@@ -131,7 +132,7 @@ vitastor-cli snap-create [-p|--pool <id|name>] <image>@<snapshot>
## modify
`vitastor-cli modify <name> [--rename <new-name>] [--resize <size>] [--readonly | --readwrite] [-f|--force]`
`vitastor-cli modify <name> [--rename <new-name>] [--resize <size>] [--readonly | --readwrite] [-f|--force] [--down-ok]`
Изменить размер, имя образа или флаг "только для чтения". Снимать флаг "только для чтения"
и уменьшать размер образов, у которых есть дочерние клоны, без `--force` нельзя.
@@ -139,13 +140,12 @@ vitastor-cli snap-create [-p|--pool <id|name>] <image>@<snapshot>
Если новый размер меньше старого, "лишние" данные будут удалены, поэтому перед уменьшением
образа сначала уменьшите файловую систему в нём.
```
-f|--force Разрешить уменьшение или перевод в чтение-запись образа, у которого есть клоны.
```
| -f|--force | Разрешить уменьшение или перевод в чтение-запись образа, у которого есть клоны. |
| --down-ok | Разрешить уменьшение, даже если часть данных останется неудалённой на недоступных OSD. |
## rm
`vitastor-cli rm <from> [<to>] [--writers-stopped]`
`vitastor-cli rm <from> [<to>] [--writers-stopped] [--down-ok]`
Удалить образ `<from>` или все слои от `<from>` до `<to>` (`<to>` должен быть дочерним
образом `<from>`), одновременно меняя родительские образы их клонов (если таковые есть).
@@ -157,6 +157,10 @@ vitastor-cli snap-create [-p|--pool <id|name>] <image>@<snapshot>
В других случаях родительские слои вливаются в дочерние.
Другие опции:
| `--down-ok` | Продолжать удаление/слияние, даже если часть данных останется неудалённой на недоступных OSD. |
## flatten
`vitastor-cli flatten <layer>`

View File

@@ -17,6 +17,7 @@ It supports the following commands:
- [purge](#purge)
- [read-sb](#read-sb)
- [write-sb](#write-sb)
- [update-sb](#update-sb)
- [udev](#udev)
- [exec-osd](#exec-osd)
- [pre-exec](#pre-exec)
@@ -182,6 +183,14 @@ Try to read Vitastor OSD superblock from `<device>` and print it in JSON format.
Read JSON from STDIN and write it into Vitastor OSD superblock on `<device>`.
## update-sb
`vitastor-disk update-sb <device> [--force] [--<parameter> <value>] [...]`
Read Vitastor OSD superblock from <device>, update parameters in it and write it back.
`--force` allows to ignore validation errors.
## udev
`vitastor-disk udev <device>`
@@ -252,7 +261,7 @@ Options (see also [Cluster-Wide Disk Layout Parameters](../config/layout-cluster
```
--object_size 128k Set blockstore block size
--bitmap_granularity 4k Set bitmap granularity
--journal_size 16M Set journal size
--journal_size 32M Set journal size
--data_csum_type none Set data checksum type (crc32c or none)
--csum_block_size 4k Set data checksum block size
--device_block_size 4k Set device block size

View File

@@ -17,6 +17,7 @@ vitastor-disk - инструмент командной строки для уп
- [purge](#purge)
- [read-sb](#read-sb)
- [write-sb](#write-sb)
- [update-sb](#update-sb)
- [udev](#udev)
- [exec-osd](#exec-osd)
- [pre-exec](#pre-exec)
@@ -187,6 +188,15 @@ throttle_target_mbs, throttle_target_parallelism, throttle_threshold_us.
Прочитать JSON со стандартного ввода и записать его в суперблок OSD на диск `<device>`.
## update-sb
`vitastor-disk update-sb <device> [--force] [--<параметр> <значение>] [...]`
Прочитать суперблок OSD с диска `<device>`, изменить в нём заданные параметры и записать обратно.
Опция `--force` позволяет читать суперблок, даже если он считается некорректным
из-за ошибок валидации.
## udev
`vitastor-disk udev <device>`
@@ -257,7 +267,7 @@ OSD отключены fsync-и.
```
--object_size 128k Размер блока хранилища
--bitmap_granularity 4k Гранулярность битовых карт
--journal_size 16M Размер журнала
--journal_size 32M Размер журнала
--data_csum_type none Задать тип контрольных сумм (crc32c или none)
--csum_block_size 4k Задать размер блока расчёта контрольных сумм
--device_block_size 4k Размер блока устройства

View File

@@ -14,10 +14,13 @@ Vitastor has a fio driver which can be installed from the package vitastor-fio.
Use the following command as an example to run tests with fio against a Vitastor cluster:
```
fio -thread -ioengine=libfio_vitastor.so -name=test -bs=4M -direct=1 -iodepth=16 -rw=write -etcd=10.115.0.10:2379/v3 -image=testimg
fio -thread -ioengine=libfio_vitastor.so -name=test -bs=4M -direct=1 -iodepth=16 -rw=write -image=testimg
```
If you don't want to access your image by name, you can specify pool number, inode number and size
(`-pool=1 -inode=1 -size=400G`) instead of the image name (`-image=testimg`).
See exact fio commands to use for benchmarking [here](../performance/understanding.en.md#команды-fio).
You can also specify etcd address(es) explicitly by adding `-etcd=10.115.0.10:2379/v3`, or you
can override configuration file path by adding `-conf=/etc/vitastor/vitastor.conf`.
See exact fio commands to use for benchmarking [here](../performance/understanding.en.md#fio-commands).

View File

@@ -14,10 +14,13 @@
Используйте следующую команду как пример для запуска тестов кластера Vitastor через fio:
```
fio -thread -ioengine=libfio_vitastor.so -name=test -bs=4M -direct=1 -iodepth=16 -rw=write -etcd=10.115.0.10:2379/v3 -image=testimg
fio -thread -ioengine=libfio_vitastor.so -name=test -bs=4M -direct=1 -iodepth=16 -rw=write -image=testimg
```
Вместо обращения к образу по имени (`-image=testimg`) можно указать номер пула, номер инода и размер:
`-pool=1 -inode=1 -size=400G`.
Вы также можете задать адрес(а) подключения к etcd явно, добавив `-etcd=10.115.0.10:2379/v3`,
или переопределить путь к файлу конфигурации, добавив `-conf=/etc/vitastor/vitastor.conf`.
Конкретные команды fio для тестирования производительности можно посмотреть [здесь](../performance/understanding.ru.md#команды-fio).

View File

@@ -11,25 +11,25 @@ NBD stands for "Network Block Device", but in fact it also functions as "BUSE"
NBD slighly lowers the performance due to additional overhead, but performance still
remains decent (see an example [here](../performance/comparison1.en.md#vitastor-0-4-0-nbd)).
Vitastor Kubernetes CSI driver is based on NBD.
See also [VDUSE](qemu.en.md#vduse) as a better alternative to NBD.
See also [VDUSE](qemu.en.md#vduse).
Vitastor Kubernetes CSI driver uses NBD when VDUSE is unavailable.
## Map image
To create a local block device for a Vitastor image run:
```
vitastor-nbd map --etcd_address 10.115.0.10:2379/v3 --image testimg
vitastor-nbd map --image testimg
```
It will output a block device name like /dev/nbd0 which you can then use as a normal disk.
You can also use `--pool <POOL> --inode <INODE> --size <SIZE>` instead of `--image <IMAGE>` if you want.
Additional options for map command:
vitastor-nbd supports all usual Vitastor configuration options like `--config_file <path_to_config>` plus NBD-specific:
* `--nbd_timeout 30` \
* `--nbd_timeout 300` \
Timeout for I/O operations in seconds after exceeding which the kernel stops
the device. You can set it to 0 to disable the timeout, but beware that you
won't be able to stop the device at all if vitastor-nbd process dies.
@@ -44,6 +44,9 @@ Additional options for map command:
* `--foreground 1` \
Stay in foreground, do not daemonize.
Note that `nbd_timeout`, `nbd_max_devices` and `nbd_max_part` options may also be specified
in `/etc/vitastor/vitastor.conf` or in other configuration file specified with `--config_file`.
## Unmap image
To unmap the device run:

View File

@@ -14,16 +14,16 @@ NBD на данный момент необходимо, чтобы монтир
NBD немного снижает производительность из-за дополнительных копирований памяти,
но она всё равно остаётся на неплохом уровне (см. для примера [тест](../performance/comparison1.ru.md#vitastor-0-4-0-nbd)).
CSI-драйвер Kubernetes Vitastor основан на NBD.
Смотрите также [VDUSE](qemu.ru.md#vduse), как лучшую альтернативу NBD.
Смотрите также [VDUSE](qemu.ru.md#vduse).
CSI-драйвер Kubernetes Vitastor использует NBD, когда VDUSE недоступен.
## Подключить устройство
Чтобы создать локальное блочное устройство для образа, выполните команду:
```
vitastor-nbd map --etcd_address 10.115.0.10:2379/v3 --image testimg
vitastor-nbd map --image testimg
```
Команда напечатает название блочного устройства вида /dev/nbd0, которое потом можно
@@ -32,7 +32,8 @@ vitastor-nbd map --etcd_address 10.115.0.10:2379/v3 --image testimg
Для обращения по номеру инода, аналогично другим командам, можно использовать опции
`--pool <POOL> --inode <INODE> --size <SIZE>` вместо `--image testimg`.
Дополнительные опции для команды подключения NBD-устройства:
vitastor-nbd поддерживает все обычные опции Vitastor, например, `--config_file <path_to_config>`,
плюс специфичные для NBD:
* `--nbd_timeout 30` \
Максимальное время выполнения любой операции чтения/записи в секундах, при
@@ -53,6 +54,10 @@ vitastor-nbd map --etcd_address 10.115.0.10:2379/v3 --image testimg
* `--foreground 1` \
Не уводить процесс в фоновый режим.
Обратите внимание, что опции `nbd_timeout`, `nbd_max_devices` и `nbd_max_part` можно
также задавать в `/etc/vitastor/vitastor.conf` или в другом файле конфигурации,
заданном опцией `--config_file`.
## Отключить устройство
Для отключения устройства выполните:

View File

@@ -23,7 +23,7 @@ balancer or any failover method you want to in that case.
vitastor-nfs usage:
```
vitastor-nfs [--etcd_address ADDR] [OTHER OPTIONS]
vitastor-nfs [STANDARD OPTIONS] [OTHER OPTIONS]
--subdir <DIR> export images prefixed <DIR>/ (default empty - export all images)
--portmap 0 do not listen on port 111 (portmap/rpcbind, requires root)
@@ -34,7 +34,7 @@ vitastor-nfs [--etcd_address ADDR] [OTHER OPTIONS]
--foreground 1 stay in foreground, do not daemonize
```
Example start and mount commands:
Example start and mount commands (etcd_address is optional):
```
vitastor-nfs --etcd_address 192.168.5.10:2379 --portmap 0 --port 2050 --pool testpool

View File

@@ -22,7 +22,7 @@
Использование vitastor-nfs:
```
vitastor-nfs [--etcd_address ADDR] [ДРУГИЕ ОПЦИИ]
vitastor-nfs [СТАНДАРТНЫЕ ОПЦИИ] [ДРУГИЕ ОПЦИИ]
--subdir <DIR> экспортировать "поддиректорию" - образы с префиксом имени <DIR>/ (по умолчанию пусто - экспортировать все образы)
--portmap 0 отключить сервис portmap/rpcbind на порту 111 (по умолчанию включён и требует root привилегий)
@@ -33,7 +33,7 @@ vitastor-nfs [--etcd_address ADDR] [ДРУГИЕ ОПЦИИ]
--foreground 1 не уходить в фон после запуска
```
Пример монтирования Vitastor через NFS:
Пример монтирования Vitastor через NFS (etcd_address необязателен):
```
vitastor-nfs --etcd_address 192.168.5.10:2379 --portmap 0 --port 2050 --pool testpool

View File

@@ -16,13 +16,16 @@ Old syntax (-drive):
```
qemu-system-x86_64 -enable-kvm -m 1024 \
-drive 'file=vitastor:etcd_host=192.168.7.2\:2379/v3:image=debian9',
-drive 'file=vitastor:image=debian9',
format=raw,if=none,id=drive-virtio-disk0,cache=none \
-device 'virtio-blk-pci,scsi=off,bus=pci.0,addr=0x5,drive=drive-virtio-disk0,
id=virtio-disk0,bootindex=1,write-cache=off' \
-vnc 0.0.0.0:0
```
Etcd address may be specified explicitly by adding `:etcd_host=192.168.7.2\:2379/v3` to `file=`.
Configuration file path may be overriden by adding `:config_path=/etc/vitastor/vitastor.conf`.
New syntax (-blockdev):
```
@@ -50,12 +53,12 @@ You can also specify inode ID, pool and size manually instead of `:image=<IMAGE>
## qemu-img
For qemu-img, you should use `vitastor:etcd_host=<HOST>:image=<IMAGE>` as filename.
For qemu-img, you should use `vitastor:image=<IMAGE>[:etcd_host=<HOST>]` as filename.
For example, to upload a VM image into Vitastor, run:
```
qemu-img convert -f qcow2 debian10.qcow2 -p -O raw 'vitastor:etcd_host=192.168.7.2\:2379/v3:image=debian10'
qemu-img convert -f qcow2 debian10.qcow2 -p -O raw 'vitastor:image=debian10'
```
You can also specify `:pool=<POOL>:inode=<INODE>:size=<SIZE>` instead of `:image=<IMAGE>`
@@ -72,10 +75,10 @@ the snapshot separately using the following commands (key points are using `skip
`-B backing_file` option):
```
qemu-img convert -f raw 'vitastor:etcd_host=192.168.7.2\:2379/v3:image=testimg@0' \
qemu-img convert -f raw 'vitastor:image=testimg@0' \
-O qcow2 testimg_0.qcow2
qemu-img convert -f raw 'vitastor:etcd_host=192.168.7.2\:2379/v3:image=testimg:skip-parents=1' \
qemu-img convert -f raw 'vitastor:image=testimg:skip-parents=1' \
-O qcow2 -o 'cluster_size=4k' -B testimg_0.qcow2 testimg.qcow2
```
@@ -146,7 +149,7 @@ Example performance comparison:
| 4k random read Q1 | 9600 iops | 7640 iops | 7780 iops |
To try VDUSE you need at least Linux 5.15, built with VDUSE support
(CONFIG_VIRTIO_VDPA=m, CONFIG_VDPA_USER=m, CONFIG_VIRTIO_VDPA=m).
(CONFIG_VDPA=m, CONFIG_VDPA_USER=m, CONFIG_VIRTIO_VDPA=m).
Debian Linux kernels have these options disabled by now, so if you want to try it on Debian,
use a kernel from Ubuntu [kernel-ppa/mainline](https://kernel.ubuntu.com/~kernel-ppa/mainline/), Proxmox,

View File

@@ -18,13 +18,16 @@
```
qemu-system-x86_64 -enable-kvm -m 1024 \
-drive 'file=vitastor:etcd_host=192.168.7.2\:2379/v3:image=debian9',
-drive 'file=vitastor:image=debian9',
format=raw,if=none,id=drive-virtio-disk0,cache=none \
-device 'virtio-blk-pci,scsi=off,bus=pci.0,addr=0x5,drive=drive-virtio-disk0,
id=virtio-disk0,bootindex=1,write-cache=off' \
-vnc 0.0.0.0:0
```
Адрес подключения etcd можно задать явно, если добавить `:etcd_host=192.168.7.2\:2379/v3` к `file=`.
Путь к файлу конфигурации можно переопределить, добавив `:config_path=/etc/vitastor/vitastor.conf`.
Новый синтаксис (-blockdev):
```
@@ -52,12 +55,12 @@ qemu-system-x86_64 -enable-kvm -m 1024 \
## qemu-img
Для qemu-img используйте строку `vitastor:etcd_host=<HOST>:image=<IMAGE>` в качестве имени файла диска.
Для qemu-img используйте строку `vitastor:image=<IMAGE>[:etcd_host=<HOST>]` в качестве имени файла диска.
Например, чтобы загрузить образ диска в Vitastor:
```
qemu-img convert -f qcow2 debian10.qcow2 -p -O raw 'vitastor:etcd_host=10.115.0.10\:2379/v3:image=testimg'
qemu-img convert -f qcow2 debian10.qcow2 -p -O raw 'vitastor:image=testimg'
```
Если вы не хотите обращаться к образу по имени, вместо `:image=<IMAGE>` можно указать номер пула, номер инода и размер:
@@ -73,10 +76,10 @@ qemu-img convert -f qcow2 debian10.qcow2 -p -O raw 'vitastor:etcd_host=10.115.0.
с помощью следующих команд (ключевые моменты - использование `skip-parents=1` и опции `-B backing_file.qcow2`):
```
qemu-img convert -f raw 'vitastor:etcd_host=192.168.7.2\:2379/v3:image=testimg@0' \
qemu-img convert -f raw 'vitastor:image=testimg@0' \
-O qcow2 testimg_0.qcow2
qemu-img convert -f raw 'vitastor:etcd_host=192.168.7.2\:2379/v3:image=testimg:skip-parents=1' \
qemu-img convert -f raw 'vitastor:image=testimg:skip-parents=1' \
-O qcow2 -o 'cluster_size=4k' -B testimg_0.qcow2 testimg.qcow2
```
@@ -149,7 +152,7 @@ VDUSE - на данный момент лучший интерфейс для п
| 4k случайное чтение Q1 | 9600 iops | 7640 iops | 7780 iops |
Чтобы попробовать VDUSE, вам нужно ядро Linux как минимум версии 5.15, собранное с поддержкой
VDUSE (CONFIG_VIRTIO_VDPA=m, CONFIG_VDPA_USER=m, CONFIG_VIRTIO_VDPA=m).
VDUSE (CONFIG_VDPA=m, CONFIG_VDPA_USER=m, CONFIG_VIRTIO_VDPA=m).
В ядрах в Debian Linux поддержка пока отключена по умолчанию, так что чтобы попробовать VDUSE
на Debian, поставьте ядро из Ubuntu [kernel-ppa/mainline](https://kernel.ubuntu.com/~kernel-ppa/mainline/),

View File

@@ -3,6 +3,7 @@
module.exports = {
scale_pg_count,
scale_pg_history,
};
function add_pg_history(new_pg_history, new_pg, prev_pgs, prev_pg_history, old_pg)
@@ -43,16 +44,18 @@ function finish_pg_history(merged_history)
merged_history.all_peers = Object.values(merged_history.all_peers);
}
function scale_pg_count(prev_pgs, real_prev_pgs, prev_pg_history, new_pg_history, new_pg_count)
function scale_pg_history(prev_pg_history, prev_pgs, new_pgs)
{
const old_pg_count = real_prev_pgs.length;
const new_pg_history = [];
const old_pg_count = prev_pgs.length;
const new_pg_count = new_pgs.length;
// Add all possibly intersecting PGs to the history of new PGs
if (!(new_pg_count % old_pg_count))
{
// New PG count is a multiple of old PG count
for (let i = 0; i < new_pg_count; i++)
{
add_pg_history(new_pg_history, i, real_prev_pgs, prev_pg_history, i % old_pg_count);
add_pg_history(new_pg_history, i, prev_pgs, prev_pg_history, i % old_pg_count);
finish_pg_history(new_pg_history[i]);
}
}
@@ -64,7 +67,7 @@ function scale_pg_count(prev_pgs, real_prev_pgs, prev_pg_history, new_pg_history
{
for (let j = 0; j < mul; j++)
{
add_pg_history(new_pg_history, i, real_prev_pgs, prev_pg_history, i+j*new_pg_count);
add_pg_history(new_pg_history, i, prev_pgs, prev_pg_history, i+j*new_pg_count);
}
finish_pg_history(new_pg_history[i]);
}
@@ -76,7 +79,7 @@ function scale_pg_count(prev_pgs, real_prev_pgs, prev_pg_history, new_pg_history
let merged_history = {};
for (let i = 0; i < old_pg_count; i++)
{
add_pg_history(merged_history, 1, real_prev_pgs, prev_pg_history, i);
add_pg_history(merged_history, 1, prev_pgs, prev_pg_history, i);
}
finish_pg_history(merged_history[1]);
for (let i = 0; i < new_pg_count; i++)
@@ -89,6 +92,12 @@ function scale_pg_count(prev_pgs, real_prev_pgs, prev_pg_history, new_pg_history
{
new_pg_history[i] = null;
}
return new_pg_history;
}
function scale_pg_count(prev_pgs, new_pg_count)
{
const old_pg_count = prev_pgs.length;
// Just for the lp_solve optimizer - pick a "previous" PG for each "new" one
if (prev_pgs.length < new_pg_count)
{

View File

@@ -55,10 +55,11 @@ const etcd_tree = {
// etcd connection - configurable online
etcd_address: "10.0.115.10:2379/v3",
// mon
etcd_mon_ttl: 30, // min: 10
etcd_mon_ttl: 5, // min: 1
etcd_mon_timeout: 1000, // ms. min: 0
etcd_mon_retries: 5, // min: 0
mon_change_timeout: 1000, // ms. min: 100
mon_retry_change_timeout: 50, // ms. min: 10
mon_stats_timeout: 1000, // ms. min: 100
osd_out_time: 600, // seconds. min: 0
placement_levels: { datacenter: 1, rack: 2, host: 3, osd: 4, ... },
@@ -85,13 +86,14 @@ const etcd_tree = {
client_max_buffered_bytes: 33554432,
client_max_buffered_ops: 1024,
client_max_writeback_iodepth: 256,
client_retry_interval: 50, // ms. min: 10
client_eio_retry_interval: 1000, // ms
// client and osd - configurable online
log_level: 0,
peer_connect_interval: 5, // seconds. min: 1
peer_connect_timeout: 5, // seconds. min: 1
osd_idle_timeout: 5, // seconds. min: 1
osd_ping_timeout: 5, // seconds. min: 1
up_wait_retry_interval: 500, // ms. min: 50
max_etcd_attempts: 5,
etcd_quick_timeout: 1000, // ms
etcd_slow_timeout: 5000, // ms
@@ -110,7 +112,15 @@ const etcd_tree = {
autosync_interval: 5,
autosync_writes: 128,
client_queue_depth: 128, // unused
recovery_queue_depth: 4,
recovery_queue_depth: 1,
recovery_sleep_us: 0,
recovery_tune_util_low: 0.1,
recovery_tune_client_util_low: 0,
recovery_tune_util_high: 1.0,
recovery_tune_client_util_high: 0.5,
recovery_tune_interval: 1,
recovery_tune_agg_interval: 10, // 10 times recovery_tune_interval
recovery_tune_sleep_min_us: 10, // 10 microseconds
recovery_pg_switch: 128,
recovery_sync_batch: 16,
no_recovery: false,
@@ -381,7 +391,8 @@ class Mon
{
constructor(config)
{
this.die = (e) => this._die(e);
this.failconnect = (e) => this._die(e, 2);
this.die = (e) => this._die(e, 1);
if (fs.existsSync(config.config_path||'/etc/vitastor/vitastor.conf'))
{
config = {
@@ -392,7 +403,7 @@ class Mon
this.parse_etcd_addresses(config.etcd_address||config.etcd_url);
this.verbose = config.verbose || 0;
this.initConfig = config;
this.config = {};
this.config = { ...config };
this.etcd_prefix = config.etcd_prefix || '/vitastor';
this.etcd_prefix = this.etcd_prefix.replace(/\/\/+/g, '/').replace(/^\/?(.*[^\/])\/?$/, '/$1');
this.etcd_start_timeout = (config.etcd_start_timeout || 5) * 1000;
@@ -470,10 +481,10 @@ class Mon
check_config()
{
this.config.etcd_mon_ttl = Number(this.config.etcd_mon_ttl) || 30;
if (this.config.etcd_mon_ttl < 10)
this.config.etcd_mon_ttl = Number(this.config.etcd_mon_ttl) || 5;
if (this.config.etcd_mon_ttl < 1)
{
this.config.etcd_mon_ttl = 10;
this.config.etcd_mon_ttl = 1;
}
this.config.etcd_mon_timeout = Number(this.config.etcd_mon_timeout) || 0;
if (this.config.etcd_mon_timeout <= 0)
@@ -490,6 +501,11 @@ class Mon
{
this.config.mon_change_timeout = 100;
}
this.config.mon_retry_change_timeout = Number(this.config.mon_retry_change_timeout) || 50;
if (this.config.mon_retry_change_timeout < 50)
{
this.config.mon_retry_change_timeout = 50;
}
this.config.mon_stats_timeout = Number(this.config.mon_stats_timeout) || 1000;
if (this.config.mon_stats_timeout < 100)
{
@@ -590,7 +606,7 @@ class Mon
}
if (!this.ws)
{
this.die('Failed to open etcd watch websocket');
this.failconnect('Failed to open etcd watch websocket');
}
const cur_addr = this.selected_etcd_url;
this.ws_alive = true;
@@ -606,7 +622,7 @@ class Mon
console.log('etcd websocket timed out, restarting it');
this.restart_watcher(cur_addr);
}
}, (Number(this.config.etcd_keepalive_interval) || 30)*1000);
}, (Number(this.config.etcd_ws_keepalive_interval) || 30)*1000);
this.ws.on('error', () => this.restart_watcher(cur_addr));
this.ws.send(JSON.stringify({
create_request: {
@@ -660,7 +676,12 @@ class Mon
{
this.parse_kv(e.kv);
const key = e.kv.key.substr(this.etcd_prefix.length);
if (key.substr(0, 11) == '/osd/stats/' || key.substr(0, 10) == '/pg/stats/' || key.substr(0, 16) == '/osd/inodestats/')
if (key.substr(0, 11) == '/osd/state/')
{
stats_changed = true;
changed = true;
}
else if (key.substr(0, 11) == '/osd/stats/' || key.substr(0, 10) == '/pg/stats/' || key.substr(0, 16) == '/osd/inodestats/')
{
stats_changed = true;
}
@@ -777,9 +798,9 @@ class Mon
const res = await this.etcd_call('/lease/keepalive', { ID: this.etcd_lease_id }, this.config.etcd_mon_timeout, this.config.etcd_mon_retries);
if (!res.result.TTL)
{
this.die('Lease expired');
this.failconnect('Lease expired');
}
}, this.config.etcd_mon_timeout);
}, this.config.etcd_mon_ttl*1000);
if (!this.signals_set)
{
process.on('SIGINT', this.on_stop_cb);
@@ -1222,6 +1243,89 @@ class Mon
return aff_osds;
}
async generate_pool_pgs(pool_id, osd_tree, levels)
{
const pool_cfg = this.state.config.pools[pool_id];
if (!this.validate_pool_cfg(pool_id, pool_cfg, false))
{
return null;
}
let pool_tree = osd_tree[pool_cfg.root_node || ''];
pool_tree = pool_tree ? pool_tree.children : [];
pool_tree = LPOptimizer.flatten_tree(pool_tree, levels, pool_cfg.failure_domain, 'osd');
this.filter_osds_by_tags(osd_tree, pool_tree, pool_cfg.osd_tags);
this.filter_osds_by_block_layout(
pool_tree,
pool_cfg.block_size || this.config.block_size || 131072,
pool_cfg.bitmap_granularity || this.config.bitmap_granularity || 4096,
pool_cfg.immediate_commit || this.config.immediate_commit || 'none'
);
// First try last_clean_pgs to minimize data movement
let prev_pgs = [];
for (const pg in ((this.state.history.last_clean_pgs.items||{})[pool_id]||{}))
{
prev_pgs[pg-1] = [ ...this.state.history.last_clean_pgs.items[pool_id][pg].osd_set ];
}
if (!prev_pgs.length)
{
// Fall back to config/pgs if it's empty
for (const pg in ((this.state.config.pgs.items||{})[pool_id]||{}))
{
prev_pgs[pg-1] = [ ...this.state.config.pgs.items[pool_id][pg].osd_set ];
}
}
const old_pg_count = prev_pgs.length;
const optimize_cfg = {
osd_tree: pool_tree,
pg_count: pool_cfg.pg_count,
pg_size: pool_cfg.pg_size,
pg_minsize: pool_cfg.pg_minsize,
max_combinations: pool_cfg.max_osd_combinations,
ordered: pool_cfg.scheme != 'replicated',
};
let optimize_result;
// Re-shuffle PGs if config/pgs.hash is empty
if (old_pg_count > 0 && this.state.config.pgs.hash)
{
if (prev_pgs.length != pool_cfg.pg_count)
{
// Scale PG count
// Do it even if old_pg_count is already equal to pool_cfg.pg_count,
// because last_clean_pgs may still contain the old number of PGs
PGUtil.scale_pg_count(prev_pgs, pool_cfg.pg_count);
}
for (const pg of prev_pgs)
{
while (pg.length < pool_cfg.pg_size)
{
pg.push(0);
}
}
optimize_result = await LPOptimizer.optimize_change({
prev_pgs,
...optimize_cfg,
});
}
else
{
optimize_result = await LPOptimizer.optimize_initial(optimize_cfg);
}
console.log(`Pool ${pool_id} (${pool_cfg.name || 'unnamed'}):`);
LPOptimizer.print_change_stats(optimize_result);
const pg_effsize = Math.min(pool_cfg.pg_size, Object.keys(pool_tree).length);
return {
pool_id,
pgs: optimize_result.int_pgs,
stats: {
total_raw_tb: optimize_result.space,
pg_real_size: pg_effsize || pool_cfg.pg_size,
raw_to_usable: (pg_effsize || pool_cfg.pg_size) / (pool_cfg.scheme === 'replicated'
? 1 : (pool_cfg.pg_size - (pool_cfg.parity_chunks||0))),
space_efficiency: optimize_result.space/(optimize_result.total_space||1),
},
};
}
async recheck_pgs()
{
if (this.recheck_pgs_active)
@@ -1236,158 +1340,47 @@ class Mon
const { up_osds, levels, osd_tree } = this.get_osd_tree();
const tree_cfg = {
osd_tree,
levels,
pools: this.state.config.pools,
};
const tree_hash = sha1hex(stableStringify(tree_cfg));
if (this.state.config.pgs.hash != tree_hash)
{
// Something has changed
const new_config_pgs = JSON.parse(JSON.stringify(this.state.config.pgs));
const etcd_request = { compare: [], success: [] };
for (const pool_id in (this.state.config.pgs||{}).items||{})
console.log('Pool configuration or OSD tree changed, re-optimizing');
// First re-optimize PGs, but don't look at history yet
const optimize_results = await Promise.all(Object.keys(this.state.config.pools)
.map(pool_id => this.generate_pool_pgs(pool_id, osd_tree, levels)));
// Then apply the modification in the form of an optimistic transaction,
// each time considering new pg/history modifications (OSDs modify it during rebalance)
while (!await this.apply_pool_pgs(optimize_results, up_osds, osd_tree, tree_hash))
{
if (!this.state.config.pools[pool_id])
{
// Pool deleted. Delete all PGs, but first stop them.
if (!await this.stop_all_pgs(pool_id))
{
this.recheck_pgs_active = false;
this.schedule_recheck();
return;
}
const prev_pgs = [];
for (const pg in this.state.config.pgs.items[pool_id]||{})
{
prev_pgs[pg-1] = this.state.config.pgs.items[pool_id][pg].osd_set;
}
// Also delete pool statistics
etcd_request.success.push({ requestDeleteRange: {
key: b64(this.etcd_prefix+'/pool/stats/'+pool_id),
} });
this.save_new_pgs_txn(new_config_pgs, etcd_request, pool_id, up_osds, osd_tree, prev_pgs, [], []);
}
}
for (const pool_id in this.state.config.pools)
{
const pool_cfg = this.state.config.pools[pool_id];
if (!this.validate_pool_cfg(pool_id, pool_cfg, false))
{
continue;
}
let pool_tree = osd_tree[pool_cfg.root_node || ''];
pool_tree = pool_tree ? pool_tree.children : [];
pool_tree = LPOptimizer.flatten_tree(pool_tree, levels, pool_cfg.failure_domain, 'osd');
this.filter_osds_by_tags(osd_tree, pool_tree, pool_cfg.osd_tags);
this.filter_osds_by_block_layout(
pool_tree,
pool_cfg.block_size || this.config.block_size || 131072,
pool_cfg.bitmap_granularity || this.config.bitmap_granularity || 4096,
pool_cfg.immediate_commit || this.config.immediate_commit || 'none'
console.log(
'Someone changed PG configuration while we also tried to change it.'+
' Retrying in '+this.config.mon_retry_change_timeout+' ms'
);
// These are for the purpose of building history.osd_sets
const real_prev_pgs = [];
let pg_history = [];
for (const pg in ((this.state.config.pgs.items||{})[pool_id]||{}))
// Failed to apply - parallel change detected. Wait a bit and retry
const old_rev = this.etcd_watch_revision;
while (this.etcd_watch_revision === old_rev)
{
real_prev_pgs[pg-1] = this.state.config.pgs.items[pool_id][pg].osd_set;
if (this.state.pg.history[pool_id] &&
this.state.pg.history[pool_id][pg])
{
pg_history[pg-1] = this.state.pg.history[pool_id][pg];
}
await new Promise(ok => setTimeout(ok, this.config.mon_retry_change_timeout));
}
// And these are for the purpose of minimizing data movement
let prev_pgs = [];
for (const pg in ((this.state.history.last_clean_pgs.items||{})[pool_id]||{}))
{
prev_pgs[pg-1] = this.state.history.last_clean_pgs.items[pool_id][pg].osd_set;
}
prev_pgs = JSON.parse(JSON.stringify(prev_pgs.length ? prev_pgs : real_prev_pgs));
const old_pg_count = real_prev_pgs.length;
const optimize_cfg = {
osd_tree: pool_tree,
pg_count: pool_cfg.pg_count,
pg_size: pool_cfg.pg_size,
pg_minsize: pool_cfg.pg_minsize,
max_combinations: pool_cfg.max_osd_combinations,
ordered: pool_cfg.scheme != 'replicated',
const new_ot = this.get_osd_tree();
const new_tcfg = {
osd_tree: new_ot.osd_tree,
levels: new_ot.levels,
pools: this.state.config.pools,
};
let optimize_result;
if (old_pg_count > 0)
if (sha1hex(stableStringify(new_tcfg)) !== tree_hash)
{
if (old_pg_count != pool_cfg.pg_count)
{
// PG count changed. Need to bring all PGs down.
if (!await this.stop_all_pgs(pool_id))
{
this.recheck_pgs_active = false;
this.schedule_recheck();
return;
}
}
if (prev_pgs.length != pool_cfg.pg_count)
{
// Scale PG count
// Do it even if old_pg_count is already equal to pool_cfg.pg_count,
// because last_clean_pgs may still contain the old number of PGs
const new_pg_history = [];
PGUtil.scale_pg_count(prev_pgs, real_prev_pgs, pg_history, new_pg_history, pool_cfg.pg_count);
pg_history = new_pg_history;
}
for (const pg of prev_pgs)
{
while (pg.length < pool_cfg.pg_size)
{
pg.push(0);
}
}
if (!this.state.config.pgs.hash)
{
// Re-shuffle PGs
optimize_result = await LPOptimizer.optimize_initial(optimize_cfg);
}
else
{
optimize_result = await LPOptimizer.optimize_change({
prev_pgs,
...optimize_cfg,
});
}
// Configuration actually changed, restart from the beginning
this.recheck_pgs_active = false;
setImmediate(() => this.recheck_pgs().catch(this.die));
return;
}
else
{
optimize_result = await LPOptimizer.optimize_initial(optimize_cfg);
}
if (old_pg_count != optimize_result.int_pgs.length)
{
console.log(
`PG count for pool ${pool_id} (${pool_cfg.name || 'unnamed'})`+
` changed from: ${old_pg_count} to ${optimize_result.int_pgs.length}`
);
// Drop stats
etcd_request.success.push({ requestDeleteRange: {
key: b64(this.etcd_prefix+'/pg/stats/'+pool_id+'/'),
range_end: b64(this.etcd_prefix+'/pg/stats/'+pool_id+'0'),
} });
}
LPOptimizer.print_change_stats(optimize_result);
const pg_effsize = Math.min(pool_cfg.pg_size, Object.keys(pool_tree).length);
this.state.pool.stats[pool_id] = {
used_raw_tb: (this.state.pool.stats[pool_id]||{}).used_raw_tb || 0,
total_raw_tb: optimize_result.space,
pg_real_size: pg_effsize || pool_cfg.pg_size,
raw_to_usable: (pg_effsize || pool_cfg.pg_size) / (pool_cfg.scheme === 'replicated'
? 1 : (pool_cfg.pg_size - (pool_cfg.parity_chunks||0))),
space_efficiency: optimize_result.space/(optimize_result.total_space||1),
};
etcd_request.success.push({ requestPut: {
key: b64(this.etcd_prefix+'/pool/stats/'+pool_id),
value: b64(JSON.stringify(this.state.pool.stats[pool_id])),
} });
this.save_new_pgs_txn(new_config_pgs, etcd_request, pool_id, up_osds, osd_tree, real_prev_pgs, optimize_result.int_pgs, pg_history);
// Configuration didn't change, PG history probably changed, so just retry
}
new_config_pgs.hash = tree_hash;
await this.save_pg_config(new_config_pgs, etcd_request);
console.log('PG configuration successfully changed');
}
else
{
@@ -1428,12 +1421,97 @@ class Mon
}
if (changed)
{
await this.save_pg_config(new_config_pgs);
const ok = await this.save_pg_config(new_config_pgs);
if (ok)
console.log('PG configuration successfully changed');
else
{
console.log('Someone changed PG configuration while we also tried to change it. Retrying in '+this.config.mon_change_timeout+' ms');
this.schedule_recheck();
}
}
}
this.recheck_pgs_active = false;
}
async apply_pool_pgs(results, up_osds, osd_tree, tree_hash)
{
for (const pool_id in (this.state.config.pgs||{}).items||{})
{
// We should stop all PGs when deleting a pool or changing its PG count
if (!this.state.config.pools[pool_id] ||
this.state.config.pgs.items[pool_id] && this.state.config.pools[pool_id].pg_count !=
Object.keys(this.state.config.pgs.items[pool_id]).reduce((a, c) => (a < (0|c) ? (0|c) : a), 0))
{
if (!await this.stop_all_pgs(pool_id))
{
return false;
}
}
}
const new_config_pgs = JSON.parse(JSON.stringify(this.state.config.pgs));
const etcd_request = { compare: [], success: [] };
for (const pool_id in (new_config_pgs||{}).items||{})
{
if (!this.state.config.pools[pool_id])
{
const prev_pgs = [];
for (const pg in new_config_pgs.items[pool_id]||{})
{
prev_pgs[pg-1] = new_config_pgs.items[pool_id][pg].osd_set;
}
// Also delete pool statistics
etcd_request.success.push({ requestDeleteRange: {
key: b64(this.etcd_prefix+'/pool/stats/'+pool_id),
} });
this.save_new_pgs_txn(new_config_pgs, etcd_request, pool_id, up_osds, osd_tree, prev_pgs, [], []);
}
}
for (const pool_res of results)
{
const pool_id = pool_res.pool_id;
const pool_cfg = this.state.config.pools[pool_id];
let pg_history = [];
for (const pg in ((this.state.config.pgs.items||{})[pool_id]||{}))
{
if (this.state.pg.history[pool_id] &&
this.state.pg.history[pool_id][pg])
{
pg_history[pg-1] = this.state.pg.history[pool_id][pg];
}
}
const real_prev_pgs = [];
for (const pg in ((this.state.config.pgs.items||{})[pool_id]||{}))
{
real_prev_pgs[pg-1] = [ ...this.state.config.pgs.items[pool_id][pg].osd_set ];
}
if (real_prev_pgs.length > 0 && real_prev_pgs.length != pool_res.pgs.length)
{
console.log(
`Changing PG count for pool ${pool_id} (${pool_cfg.name || 'unnamed'})`+
` from: ${real_prev_pgs.length} to ${pool_res.pgs.length}`
);
pg_history = PGUtil.scale_pg_history(pg_history, real_prev_pgs, pool_res.pgs);
// Drop stats
etcd_request.success.push({ requestDeleteRange: {
key: b64(this.etcd_prefix+'/pg/stats/'+pool_id+'/'),
range_end: b64(this.etcd_prefix+'/pg/stats/'+pool_id+'0'),
} });
}
const stats = {
used_raw_tb: (this.state.pool.stats[pool_id]||{}).used_raw_tb || 0,
...pool_res.stats,
};
etcd_request.success.push({ requestPut: {
key: b64(this.etcd_prefix+'/pool/stats/'+pool_id),
value: b64(JSON.stringify(stats)),
} });
this.save_new_pgs_txn(new_config_pgs, etcd_request, pool_id, up_osds, osd_tree, real_prev_pgs, pool_res.pgs, pg_history);
}
new_config_pgs.hash = tree_hash;
return await this.save_pg_config(new_config_pgs, etcd_request);
}
async save_pg_config(new_config_pgs, etcd_request = { compare: [], success: [] })
{
etcd_request.compare.push(
@@ -1443,14 +1521,8 @@ class Mon
etcd_request.success.push(
{ requestPut: { key: b64(this.etcd_prefix+'/config/pgs'), value: b64(JSON.stringify(new_config_pgs)) } },
);
const res = await this.etcd_call('/kv/txn', etcd_request, this.config.etcd_mon_timeout, 0);
if (!res.succeeded)
{
console.log('Someone changed PG configuration while we also tried to change it. Retrying in '+this.config.mon_change_timeout+' ms');
this.schedule_recheck();
return;
}
console.log('PG configuration successfully changed');
const txn_res = await this.etcd_call('/kv/txn', etcd_request, this.config.etcd_mon_timeout, 0);
return txn_res.succeeded;
}
// Schedule next recheck at least at <unixtime>
@@ -1569,9 +1641,13 @@ class Mon
}
const sum_diff = { op_stats: {}, subop_stats: {}, recovery_stats: {} };
// Sum derived values instead of deriving summed
for (const osd in this.state.osd.stats)
for (const osd in this.state.osd.state)
{
const derived = this.prev_stats.osd_diff[osd];
if (!this.state.osd.state[osd] || !derived)
{
continue;
}
for (const type in sum_diff)
{
for (const op in derived[type]||{})
@@ -1672,9 +1748,13 @@ class Mon
const used = this.state.pool.stats[pool_id].used_raw_tb;
this.state.pool.stats[pool_id].used_raw_tb = Number(used)/1024/1024/1024/1024;
}
for (const osd_num in this.state.osd.inodestats)
for (const osd_num in this.state.osd.state)
{
const ist = this.state.osd.inodestats[osd_num];
if (!ist || !this.state.osd.state[osd_num])
{
continue;
}
for (const pool_id in ist)
{
inode_stats[pool_id] = inode_stats[pool_id] || {};
@@ -1690,9 +1770,14 @@ class Mon
}
}
}
for (const osd in this.prev_stats.osd_diff)
for (const osd in this.state.osd.state)
{
for (const pool_id in this.prev_stats.osd_diff[osd].inode_stats)
const osd_diff = this.prev_stats.osd_diff[osd];
if (!osd_diff || !this.state.osd.state[osd])
{
continue;
}
for (const pool_id in osd_diff.inode_stats)
{
for (const inode_num in this.prev_stats.osd_diff[osd].inode_stats[pool_id])
{
@@ -1932,14 +2017,14 @@ class Mon
return res.json;
}
}
this.die();
this.failconnect();
}
_die(err)
_die(err, code)
{
// In fact we can just try to rejoin
console.error(new Error(err || 'Cluster connection failed'));
process.exit(1);
process.exit(code || 2);
}
local_ips(all)

View File

@@ -1,6 +1,6 @@
{
"name": "vitastor-mon",
"version": "1.2.0",
"version": "1.4.8",
"description": "Vitastor SDS monitor service",
"main": "mon-main.js",
"scripts": {

View File

@@ -8,7 +8,9 @@ PartOf=vitastor.target
LimitNOFILE=1048576
LimitNPROC=1048576
LimitMEMLOCK=infinity
ExecStart=bash -c 'exec vitastor-disk exec-osd /dev/vitastor/osd%i-data >>/var/log/vitastor/osd%i.log 2>&1'
# Use the following for direct logs to files
#ExecStart=bash -c 'exec vitastor-disk exec-osd /dev/vitastor/osd%i-data >>/var/log/vitastor/osd%i.log 2>&1'
ExecStart=vitastor-disk exec-osd /dev/vitastor/osd%i-data
ExecStartPre=+vitastor-disk pre-exec /dev/vitastor/osd%i-data
WorkingDirectory=/
User=vitastor

View File

@@ -110,7 +110,6 @@ sub properties
vitastor_etcd_address => {
description => 'IP address(es) of etcd.',
type => 'string',
format => 'pve-storage-portal-dns-list',
},
vitastor_etcd_prefix => {
description => 'Prefix for Vitastor etcd metadata',

View File

@@ -50,7 +50,7 @@ from cinder.volume import configuration
from cinder.volume import driver
from cinder.volume import volume_utils
VERSION = '1.2.0'
VERSION = '1.4.8'
LOG = logging.getLogger(__name__)

View File

@@ -0,0 +1,692 @@
commit d85024bd803b3b91f15578ed22de4ce31856626f
Author: Vitaliy Filippov <vitalif@yourcmc.ru>
Date: Wed Jan 24 18:07:43 2024 +0300
Add Vitastor support
diff --git a/docs/schemas/domaincommon.rng b/docs/schemas/domaincommon.rng
index 7fa5c2b8b5..2d77f391e7 100644
--- a/docs/schemas/domaincommon.rng
+++ b/docs/schemas/domaincommon.rng
@@ -1898,6 +1898,35 @@
</element>
</define>
+ <define name="diskSourceNetworkProtocolVitastor">
+ <element name="source">
+ <interleave>
+ <attribute name="protocol">
+ <value>vitastor</value>
+ </attribute>
+ <ref name="diskSourceCommon"/>
+ <optional>
+ <attribute name="name"/>
+ </optional>
+ <optional>
+ <attribute name="query"/>
+ </optional>
+ <zeroOrMore>
+ <ref name="diskSourceNetworkHost"/>
+ </zeroOrMore>
+ <optional>
+ <element name="config">
+ <attribute name="file">
+ <ref name="absFilePath"/>
+ </attribute>
+ <empty/>
+ </element>
+ </optional>
+ <empty/>
+ </interleave>
+ </element>
+ </define>
+
<define name="diskSourceNetworkProtocolISCSI">
<element name="source">
<attribute name="protocol">
@@ -2154,6 +2183,7 @@
<ref name="diskSourceNetworkProtocolSimple"/>
<ref name="diskSourceNetworkProtocolVxHS"/>
<ref name="diskSourceNetworkProtocolNFS"/>
+ <ref name="diskSourceNetworkProtocolVitastor"/>
</choice>
</define>
diff --git a/include/libvirt/libvirt-storage.h b/include/libvirt/libvirt-storage.h
index f89856b93e..a8cb9387e2 100644
--- a/include/libvirt/libvirt-storage.h
+++ b/include/libvirt/libvirt-storage.h
@@ -246,6 +246,7 @@ typedef enum {
VIR_CONNECT_LIST_STORAGE_POOLS_ZFS = 1 << 17,
VIR_CONNECT_LIST_STORAGE_POOLS_VSTORAGE = 1 << 18,
VIR_CONNECT_LIST_STORAGE_POOLS_ISCSI_DIRECT = 1 << 19,
+ VIR_CONNECT_LIST_STORAGE_POOLS_VITASTOR = 1 << 20,
} virConnectListAllStoragePoolsFlags;
int virConnectListAllStoragePools(virConnectPtr conn,
diff --git a/src/conf/domain_conf.c b/src/conf/domain_conf.c
index 5691b8d2d5..6669e8451d 100644
--- a/src/conf/domain_conf.c
+++ b/src/conf/domain_conf.c
@@ -8293,7 +8293,8 @@ virDomainDiskSourceNetworkParse(xmlNodePtr node,
src->configFile = virXPathString("string(./config/@file)", ctxt);
if (src->protocol == VIR_STORAGE_NET_PROTOCOL_HTTP ||
- src->protocol == VIR_STORAGE_NET_PROTOCOL_HTTPS)
+ src->protocol == VIR_STORAGE_NET_PROTOCOL_HTTPS ||
+ src->protocol == VIR_STORAGE_NET_PROTOCOL_VITASTOR)
src->query = virXMLPropString(node, "query");
if (virDomainStorageNetworkParseHosts(node, ctxt, &src->hosts, &src->nhosts) < 0)
@@ -31267,6 +31268,7 @@ virDomainStorageSourceTranslateSourcePool(virStorageSource *src,
case VIR_STORAGE_POOL_MPATH:
case VIR_STORAGE_POOL_RBD:
+ case VIR_STORAGE_POOL_VITASTOR:
case VIR_STORAGE_POOL_SHEEPDOG:
case VIR_STORAGE_POOL_GLUSTER:
case VIR_STORAGE_POOL_LAST:
diff --git a/src/conf/domain_validate.c b/src/conf/domain_validate.c
index a4271f1247..621c1b7b31 100644
--- a/src/conf/domain_validate.c
+++ b/src/conf/domain_validate.c
@@ -508,7 +508,7 @@ virDomainDiskDefValidateSourceChainOne(const virStorageSource *src)
}
}
- /* internal snapshots and config files are currently supported only with rbd: */
+ /* internal snapshots are currently supported only with rbd: */
if (virStorageSourceGetActualType(src) != VIR_STORAGE_TYPE_NETWORK &&
src->protocol != VIR_STORAGE_NET_PROTOCOL_RBD) {
if (src->snapshot) {
@@ -517,11 +517,15 @@ virDomainDiskDefValidateSourceChainOne(const virStorageSource *src)
"only with 'rbd' disks"));
return -1;
}
-
+ }
+ /* config files are currently supported only with rbd and vitastor: */
+ if (virStorageSourceGetActualType(src) != VIR_STORAGE_TYPE_NETWORK &&
+ src->protocol != VIR_STORAGE_NET_PROTOCOL_RBD &&
+ src->protocol != VIR_STORAGE_NET_PROTOCOL_VITASTOR) {
if (src->configFile) {
virReportError(VIR_ERR_XML_ERROR, "%s",
_("<config> element is currently supported "
- "only with 'rbd' disks"));
+ "only with 'rbd' and 'vitastor' disks"));
return -1;
}
}
diff --git a/src/conf/storage_conf.c b/src/conf/storage_conf.c
index 6690d26ffd..2255df9d28 100644
--- a/src/conf/storage_conf.c
+++ b/src/conf/storage_conf.c
@@ -60,7 +60,7 @@ VIR_ENUM_IMPL(virStoragePool,
"logical", "disk", "iscsi",
"iscsi-direct", "scsi", "mpath",
"rbd", "sheepdog", "gluster",
- "zfs", "vstorage",
+ "zfs", "vstorage", "vitastor",
);
VIR_ENUM_IMPL(virStoragePoolFormatFileSystem,
@@ -246,6 +246,18 @@ static virStoragePoolTypeInfo poolTypeInfo[] = {
.formatToString = virStorageFileFormatTypeToString,
}
},
+ {.poolType = VIR_STORAGE_POOL_VITASTOR,
+ .poolOptions = {
+ .flags = (VIR_STORAGE_POOL_SOURCE_HOST |
+ VIR_STORAGE_POOL_SOURCE_NETWORK |
+ VIR_STORAGE_POOL_SOURCE_NAME),
+ },
+ .volOptions = {
+ .defaultFormat = VIR_STORAGE_FILE_RAW,
+ .formatFromString = virStorageVolumeFormatFromString,
+ .formatToString = virStorageFileFormatTypeToString,
+ }
+ },
{.poolType = VIR_STORAGE_POOL_SHEEPDOG,
.poolOptions = {
.flags = (VIR_STORAGE_POOL_SOURCE_HOST |
@@ -546,6 +558,11 @@ virStoragePoolDefParseSource(xmlXPathContextPtr ctxt,
_("element 'name' is mandatory for RBD pool"));
return -1;
}
+ if (pool_type == VIR_STORAGE_POOL_VITASTOR && source->name == NULL) {
+ virReportError(VIR_ERR_XML_ERROR, "%s",
+ _("element 'name' is mandatory for Vitastor pool"));
+ return -1;
+ }
if (options->formatFromString) {
g_autofree char *format = NULL;
@@ -1176,6 +1193,7 @@ virStoragePoolDefFormatBuf(virBuffer *buf,
/* RBD, Sheepdog, Gluster and Iscsi-direct devices are not local block devs nor
* files, so they don't have a target */
if (def->type != VIR_STORAGE_POOL_RBD &&
+ def->type != VIR_STORAGE_POOL_VITASTOR &&
def->type != VIR_STORAGE_POOL_SHEEPDOG &&
def->type != VIR_STORAGE_POOL_GLUSTER &&
def->type != VIR_STORAGE_POOL_ISCSI_DIRECT) {
diff --git a/src/conf/storage_conf.h b/src/conf/storage_conf.h
index aaecf138d6..97172db38b 100644
--- a/src/conf/storage_conf.h
+++ b/src/conf/storage_conf.h
@@ -106,6 +106,7 @@ typedef enum {
VIR_STORAGE_POOL_GLUSTER, /* Gluster device */
VIR_STORAGE_POOL_ZFS, /* ZFS */
VIR_STORAGE_POOL_VSTORAGE, /* Virtuozzo Storage */
+ VIR_STORAGE_POOL_VITASTOR, /* Vitastor */
VIR_STORAGE_POOL_LAST,
} virStoragePoolType;
@@ -466,6 +467,7 @@ VIR_ENUM_DECL(virStoragePartedFs);
VIR_CONNECT_LIST_STORAGE_POOLS_SCSI | \
VIR_CONNECT_LIST_STORAGE_POOLS_MPATH | \
VIR_CONNECT_LIST_STORAGE_POOLS_RBD | \
+ VIR_CONNECT_LIST_STORAGE_POOLS_VITASTOR | \
VIR_CONNECT_LIST_STORAGE_POOLS_SHEEPDOG | \
VIR_CONNECT_LIST_STORAGE_POOLS_GLUSTER | \
VIR_CONNECT_LIST_STORAGE_POOLS_ZFS | \
diff --git a/src/conf/storage_source_conf.c b/src/conf/storage_source_conf.c
index d42f715f26..29d8da3d10 100644
--- a/src/conf/storage_source_conf.c
+++ b/src/conf/storage_source_conf.c
@@ -86,6 +86,7 @@ VIR_ENUM_IMPL(virStorageNetProtocol,
"ssh",
"vxhs",
"nfs",
+ "vitastor",
);
@@ -1265,6 +1266,7 @@ virStorageSourceNetworkDefaultPort(virStorageNetProtocol protocol)
case VIR_STORAGE_NET_PROTOCOL_GLUSTER:
return 24007;
+ case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
case VIR_STORAGE_NET_PROTOCOL_RBD:
/* we don't provide a default for RBD */
return 0;
diff --git a/src/conf/storage_source_conf.h b/src/conf/storage_source_conf.h
index c4a026881c..67568e9181 100644
--- a/src/conf/storage_source_conf.h
+++ b/src/conf/storage_source_conf.h
@@ -128,6 +128,7 @@ typedef enum {
VIR_STORAGE_NET_PROTOCOL_SSH,
VIR_STORAGE_NET_PROTOCOL_VXHS,
VIR_STORAGE_NET_PROTOCOL_NFS,
+ VIR_STORAGE_NET_PROTOCOL_VITASTOR,
VIR_STORAGE_NET_PROTOCOL_LAST
} virStorageNetProtocol;
diff --git a/src/conf/virstorageobj.c b/src/conf/virstorageobj.c
index 02903ac487..504df599fb 100644
--- a/src/conf/virstorageobj.c
+++ b/src/conf/virstorageobj.c
@@ -1481,6 +1481,7 @@ virStoragePoolObjSourceFindDuplicateCb(const void *payload,
return 1;
break;
+ case VIR_STORAGE_POOL_VITASTOR:
case VIR_STORAGE_POOL_RBD:
case VIR_STORAGE_POOL_LAST:
break;
@@ -1978,6 +1979,8 @@ virStoragePoolObjMatch(virStoragePoolObj *obj,
(obj->def->type == VIR_STORAGE_POOL_MPATH)) ||
(MATCH(VIR_CONNECT_LIST_STORAGE_POOLS_RBD) &&
(obj->def->type == VIR_STORAGE_POOL_RBD)) ||
+ (MATCH(VIR_CONNECT_LIST_STORAGE_POOLS_VITASTOR) &&
+ (obj->def->type == VIR_STORAGE_POOL_VITASTOR)) ||
(MATCH(VIR_CONNECT_LIST_STORAGE_POOLS_SHEEPDOG) &&
(obj->def->type == VIR_STORAGE_POOL_SHEEPDOG)) ||
(MATCH(VIR_CONNECT_LIST_STORAGE_POOLS_GLUSTER) &&
diff --git a/src/libvirt-storage.c b/src/libvirt-storage.c
index cbc522b300..b4760fa58d 100644
--- a/src/libvirt-storage.c
+++ b/src/libvirt-storage.c
@@ -92,6 +92,7 @@ virStoragePoolGetConnect(virStoragePoolPtr pool)
* VIR_CONNECT_LIST_STORAGE_POOLS_SCSI
* VIR_CONNECT_LIST_STORAGE_POOLS_MPATH
* VIR_CONNECT_LIST_STORAGE_POOLS_RBD
+ * VIR_CONNECT_LIST_STORAGE_POOLS_VITASTOR
* VIR_CONNECT_LIST_STORAGE_POOLS_SHEEPDOG
* VIR_CONNECT_LIST_STORAGE_POOLS_GLUSTER
* VIR_CONNECT_LIST_STORAGE_POOLS_ZFS
diff --git a/src/libxl/libxl_conf.c b/src/libxl/libxl_conf.c
index 1ac6253ad7..abe4587f94 100644
--- a/src/libxl/libxl_conf.c
+++ b/src/libxl/libxl_conf.c
@@ -962,6 +962,7 @@ libxlMakeNetworkDiskSrcStr(virStorageSource *src,
case VIR_STORAGE_NET_PROTOCOL_SSH:
case VIR_STORAGE_NET_PROTOCOL_VXHS:
case VIR_STORAGE_NET_PROTOCOL_NFS:
+ case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
case VIR_STORAGE_NET_PROTOCOL_LAST:
case VIR_STORAGE_NET_PROTOCOL_NONE:
virReportError(VIR_ERR_NO_SUPPORT,
diff --git a/src/libxl/xen_xl.c b/src/libxl/xen_xl.c
index 7604e3d534..6453bb9776 100644
--- a/src/libxl/xen_xl.c
+++ b/src/libxl/xen_xl.c
@@ -1506,6 +1506,7 @@ xenFormatXLDiskSrcNet(virStorageSource *src)
case VIR_STORAGE_NET_PROTOCOL_SSH:
case VIR_STORAGE_NET_PROTOCOL_VXHS:
case VIR_STORAGE_NET_PROTOCOL_NFS:
+ case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
case VIR_STORAGE_NET_PROTOCOL_LAST:
case VIR_STORAGE_NET_PROTOCOL_NONE:
virReportError(VIR_ERR_NO_SUPPORT,
diff --git a/src/qemu/qemu_block.c b/src/qemu/qemu_block.c
index e5ff653a60..884ecc79ea 100644
--- a/src/qemu/qemu_block.c
+++ b/src/qemu/qemu_block.c
@@ -943,6 +943,38 @@ qemuBlockStorageSourceGetRBDProps(virStorageSource *src,
}
+static virJSONValue *
+qemuBlockStorageSourceGetVitastorProps(virStorageSource *src)
+{
+ virJSONValue *ret = NULL;
+ virStorageNetHostDef *host;
+ size_t i;
+ g_auto(virBuffer) buf = VIR_BUFFER_INITIALIZER;
+ g_autofree char *etcd = NULL;
+
+ for (i = 0; i < src->nhosts; i++) {
+ host = src->hosts + i;
+ if ((virStorageNetHostTransport)host->transport != VIR_STORAGE_NET_HOST_TRANS_TCP) {
+ return NULL;
+ }
+ virBufferAsprintf(&buf, i > 0 ? ",%s:%u" : "%s:%u", host->name, host->port);
+ }
+ if (src->nhosts > 0) {
+ etcd = virBufferContentAndReset(&buf);
+ }
+
+ if (virJSONValueObjectCreate(&ret,
+ "S:etcd-host", etcd,
+ "S:etcd-prefix", src->query,
+ "S:config-path", src->configFile,
+ "s:image", src->path,
+ NULL) < 0)
+ return NULL;
+
+ return ret;
+}
+
+
static virJSONValue *
qemuBlockStorageSourceGetSheepdogProps(virStorageSource *src)
{
@@ -1233,6 +1265,12 @@ qemuBlockStorageSourceGetBackendProps(virStorageSource *src,
return NULL;
break;
+ case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
+ driver = "vitastor";
+ if (!(fileprops = qemuBlockStorageSourceGetVitastorProps(src)))
+ return NULL;
+ break;
+
case VIR_STORAGE_NET_PROTOCOL_SHEEPDOG:
driver = "sheepdog";
if (!(fileprops = qemuBlockStorageSourceGetSheepdogProps(src)))
@@ -2244,6 +2282,7 @@ qemuBlockGetBackingStoreString(virStorageSource *src,
case VIR_STORAGE_NET_PROTOCOL_SHEEPDOG:
case VIR_STORAGE_NET_PROTOCOL_RBD:
+ case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
case VIR_STORAGE_NET_PROTOCOL_VXHS:
case VIR_STORAGE_NET_PROTOCOL_NFS:
case VIR_STORAGE_NET_PROTOCOL_SSH:
@@ -2626,6 +2665,12 @@ qemuBlockStorageSourceCreateGetStorageProps(virStorageSource *src,
return -1;
break;
+ case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
+ driver = "vitastor";
+ if (!(location = qemuBlockStorageSourceGetVitastorProps(src)))
+ return -1;
+ break;
+
case VIR_STORAGE_NET_PROTOCOL_SHEEPDOG:
driver = "sheepdog";
if (!(location = qemuBlockStorageSourceGetSheepdogProps(src)))
diff --git a/src/qemu/qemu_command.c b/src/qemu/qemu_command.c
index d822533ccb..afe2087303 100644
--- a/src/qemu/qemu_command.c
+++ b/src/qemu/qemu_command.c
@@ -1723,6 +1723,43 @@ qemuBuildNetworkDriveStr(virStorageSource *src,
ret = virBufferContentAndReset(&buf);
break;
+ case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
+ if (strchr(src->path, ':')) {
+ virReportError(VIR_ERR_CONFIG_UNSUPPORTED,
+ _("':' not allowed in Vitastor source volume name '%s'"),
+ src->path);
+ return NULL;
+ }
+
+ virBufferStrcat(&buf, "vitastor:image=", src->path, NULL);
+
+ if (src->nhosts > 0) {
+ virBufferAddLit(&buf, ":etcd-host=");
+ for (i = 0; i < src->nhosts; i++) {
+ if (i)
+ virBufferAddLit(&buf, ",");
+
+ /* assume host containing : is ipv6 */
+ if (strchr(src->hosts[i].name, ':'))
+ virBufferEscape(&buf, '\\', ":", "[%s]",
+ src->hosts[i].name);
+ else
+ virBufferAsprintf(&buf, "%s", src->hosts[i].name);
+
+ if (src->hosts[i].port)
+ virBufferAsprintf(&buf, "\\:%u", src->hosts[i].port);
+ }
+ }
+
+ if (src->configFile)
+ virBufferEscape(&buf, '\\', ":", ":config-path=%s", src->configFile);
+
+ if (src->query)
+ virBufferEscape(&buf, '\\', ":", ":etcd-prefix=%s", src->query);
+
+ ret = virBufferContentAndReset(&buf);
+ break;
+
case VIR_STORAGE_NET_PROTOCOL_VXHS:
virReportError(VIR_ERR_INTERNAL_ERROR, "%s",
_("VxHS protocol does not support URI syntax"));
diff --git a/src/qemu/qemu_domain.c b/src/qemu/qemu_domain.c
index a8401bac30..3dc1fe6db0 100644
--- a/src/qemu/qemu_domain.c
+++ b/src/qemu/qemu_domain.c
@@ -4731,7 +4731,8 @@ qemuDomainValidateStorageSource(virStorageSource *src,
if (src->query &&
(actualType != VIR_STORAGE_TYPE_NETWORK ||
(src->protocol != VIR_STORAGE_NET_PROTOCOL_HTTPS &&
- src->protocol != VIR_STORAGE_NET_PROTOCOL_HTTP))) {
+ src->protocol != VIR_STORAGE_NET_PROTOCOL_HTTP &&
+ src->protocol != VIR_STORAGE_NET_PROTOCOL_VITASTOR))) {
virReportError(VIR_ERR_CONFIG_UNSUPPORTED, "%s",
_("query is supported only with HTTP(S) protocols"));
return -1;
@@ -9919,6 +9920,7 @@ qemuDomainPrepareStorageSourceTLS(virStorageSource *src,
break;
case VIR_STORAGE_NET_PROTOCOL_RBD:
+ case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
case VIR_STORAGE_NET_PROTOCOL_SHEEPDOG:
case VIR_STORAGE_NET_PROTOCOL_GLUSTER:
case VIR_STORAGE_NET_PROTOCOL_ISCSI:
diff --git a/src/qemu/qemu_snapshot.c b/src/qemu/qemu_snapshot.c
index f92e00f9c0..854a3fbc90 100644
--- a/src/qemu/qemu_snapshot.c
+++ b/src/qemu/qemu_snapshot.c
@@ -393,6 +393,7 @@ qemuSnapshotPrepareDiskExternalInactive(virDomainSnapshotDiskDef *snapdisk,
case VIR_STORAGE_NET_PROTOCOL_NONE:
case VIR_STORAGE_NET_PROTOCOL_NBD:
case VIR_STORAGE_NET_PROTOCOL_RBD:
+ case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
case VIR_STORAGE_NET_PROTOCOL_SHEEPDOG:
case VIR_STORAGE_NET_PROTOCOL_GLUSTER:
case VIR_STORAGE_NET_PROTOCOL_ISCSI:
@@ -485,6 +486,7 @@ qemuSnapshotPrepareDiskExternalActive(virDomainObj *vm,
case VIR_STORAGE_NET_PROTOCOL_NONE:
case VIR_STORAGE_NET_PROTOCOL_NBD:
case VIR_STORAGE_NET_PROTOCOL_RBD:
+ case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
case VIR_STORAGE_NET_PROTOCOL_SHEEPDOG:
case VIR_STORAGE_NET_PROTOCOL_ISCSI:
case VIR_STORAGE_NET_PROTOCOL_HTTP:
@@ -638,6 +640,7 @@ qemuSnapshotPrepareDiskInternal(virDomainDiskDef *disk,
case VIR_STORAGE_NET_PROTOCOL_NONE:
case VIR_STORAGE_NET_PROTOCOL_NBD:
case VIR_STORAGE_NET_PROTOCOL_RBD:
+ case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
case VIR_STORAGE_NET_PROTOCOL_SHEEPDOG:
case VIR_STORAGE_NET_PROTOCOL_GLUSTER:
case VIR_STORAGE_NET_PROTOCOL_ISCSI:
diff --git a/src/storage/storage_driver.c b/src/storage/storage_driver.c
index 4df2c75a2b..5a5e48ef71 100644
--- a/src/storage/storage_driver.c
+++ b/src/storage/storage_driver.c
@@ -1643,6 +1643,7 @@ storageVolLookupByPathCallback(virStoragePoolObj *obj,
case VIR_STORAGE_POOL_GLUSTER:
case VIR_STORAGE_POOL_RBD:
+ case VIR_STORAGE_POOL_VITASTOR:
case VIR_STORAGE_POOL_SHEEPDOG:
case VIR_STORAGE_POOL_ZFS:
case VIR_STORAGE_POOL_LAST:
diff --git a/src/storage_file/storage_source_backingstore.c b/src/storage_file/storage_source_backingstore.c
index e48ae725ab..2017ccc88c 100644
--- a/src/storage_file/storage_source_backingstore.c
+++ b/src/storage_file/storage_source_backingstore.c
@@ -284,6 +284,75 @@ virStorageSourceParseRBDColonString(const char *rbdstr,
}
+static int
+virStorageSourceParseVitastorColonString(const char *colonstr,
+ virStorageSource *src)
+{
+ char *p, *e, *next;
+ g_autofree char *options = NULL;
+
+ /* optionally skip the "vitastor:" prefix if provided */
+ if (STRPREFIX(colonstr, "vitastor:"))
+ colonstr += strlen("vitastor:");
+
+ options = g_strdup(colonstr);
+
+ p = options;
+ while (*p) {
+ /* find : delimiter or end of string */
+ for (e = p; *e && *e != ':'; ++e) {
+ if (*e == '\\') {
+ e++;
+ if (*e == '\0')
+ break;
+ }
+ }
+ if (*e == '\0') {
+ next = e; /* last kv pair */
+ } else {
+ next = e + 1;
+ *e = '\0';
+ }
+
+ if (STRPREFIX(p, "image=")) {
+ src->path = g_strdup(p + strlen("image="));
+ } else if (STRPREFIX(p, "etcd-prefix=")) {
+ src->query = g_strdup(p + strlen("etcd-prefix="));
+ } else if (STRPREFIX(p, "config-path=")) {
+ src->configFile = g_strdup(p + strlen("config-path="));
+ } else if (STRPREFIX(p, "etcd-host=")) {
+ char *h, *sep;
+
+ h = p + strlen("etcd-host=");
+ while (h < e) {
+ for (sep = h; sep < e; ++sep) {
+ if (*sep == '\\' && (sep[1] == ',' ||
+ sep[1] == ';' ||
+ sep[1] == ' ')) {
+ *sep = '\0';
+ sep += 2;
+ break;
+ }
+ }
+
+ if (virStorageSourceRBDAddHost(src, h) < 0)
+ return -1;
+
+ h = sep;
+ }
+ }
+
+ p = next;
+ }
+
+ if (!src->path) {
+ return -1;
+ }
+
+ return 0;
+}
+
+
static int
virStorageSourceParseNBDColonString(const char *nbdstr,
virStorageSource *src)
@@ -396,6 +465,11 @@ virStorageSourceParseBackingColon(virStorageSource *src,
return -1;
break;
+ case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
+ if (virStorageSourceParseVitastorColonString(path, src) < 0)
+ return -1;
+ break;
+
case VIR_STORAGE_NET_PROTOCOL_SHEEPDOG:
case VIR_STORAGE_NET_PROTOCOL_LAST:
case VIR_STORAGE_NET_PROTOCOL_NONE:
@@ -984,6 +1058,54 @@ virStorageSourceParseBackingJSONRBD(virStorageSource *src,
return 0;
}
+static int
+virStorageSourceParseBackingJSONVitastor(virStorageSource *src,
+ virJSONValue *json,
+ const char *jsonstr G_GNUC_UNUSED,
+ int opaque G_GNUC_UNUSED)
+{
+ const char *filename;
+ const char *image = virJSONValueObjectGetString(json, "image");
+ const char *conf = virJSONValueObjectGetString(json, "config-path");
+ const char *etcd_prefix = virJSONValueObjectGetString(json, "etcd-prefix");
+ virJSONValue *servers = virJSONValueObjectGetArray(json, "server");
+ size_t nservers;
+ size_t i;
+
+ src->type = VIR_STORAGE_TYPE_NETWORK;
+ src->protocol = VIR_STORAGE_NET_PROTOCOL_VITASTOR;
+
+ /* legacy syntax passed via 'filename' option */
+ if ((filename = virJSONValueObjectGetString(json, "filename")))
+ return virStorageSourceParseVitastorColonString(filename, src);
+
+ if (!image) {
+ virReportError(VIR_ERR_INVALID_ARG, "%s",
+ _("missing image name in Vitastor backing volume "
+ "JSON specification"));
+ return -1;
+ }
+
+ src->path = g_strdup(image);
+ src->configFile = g_strdup(conf);
+ src->query = g_strdup(etcd_prefix);
+
+ if (servers) {
+ nservers = virJSONValueArraySize(servers);
+
+ src->hosts = g_new0(virStorageNetHostDef, nservers);
+ src->nhosts = nservers;
+
+ for (i = 0; i < nservers; i++) {
+ if (virStorageSourceParseBackingJSONInetSocketAddress(src->hosts + i,
+ virJSONValueArrayGet(servers, i)) < 0)
+ return -1;
+ }
+ }
+
+ return 0;
+}
+
static int
virStorageSourceParseBackingJSONRaw(virStorageSource *src,
virJSONValue *json,
@@ -1162,6 +1284,7 @@ static const struct virStorageSourceJSONDriverParser jsonParsers[] = {
{"sheepdog", false, virStorageSourceParseBackingJSONSheepdog, 0},
{"ssh", false, virStorageSourceParseBackingJSONSSH, 0},
{"rbd", false, virStorageSourceParseBackingJSONRBD, 0},
+ {"vitastor", false, virStorageSourceParseBackingJSONVitastor, 0},
{"raw", true, virStorageSourceParseBackingJSONRaw, 0},
{"nfs", false, virStorageSourceParseBackingJSONNFS, 0},
{"vxhs", false, virStorageSourceParseBackingJSONVxHS, 0},
diff --git a/src/test/test_driver.c b/src/test/test_driver.c
index 0e93b79922..b4d33f5f56 100644
--- a/src/test/test_driver.c
+++ b/src/test/test_driver.c
@@ -7367,6 +7367,7 @@ testStorageVolumeTypeForPool(int pooltype)
case VIR_STORAGE_POOL_ISCSI_DIRECT:
case VIR_STORAGE_POOL_GLUSTER:
case VIR_STORAGE_POOL_RBD:
+ case VIR_STORAGE_POOL_VITASTOR:
return VIR_STORAGE_VOL_NETWORK;
case VIR_STORAGE_POOL_LOGICAL:
case VIR_STORAGE_POOL_DISK:
diff --git a/tests/storagepoolcapsschemadata/poolcaps-fs.xml b/tests/storagepoolcapsschemadata/poolcaps-fs.xml
index eee75af746..8bd0a57bdd 100644
--- a/tests/storagepoolcapsschemadata/poolcaps-fs.xml
+++ b/tests/storagepoolcapsschemadata/poolcaps-fs.xml
@@ -204,4 +204,11 @@
</enum>
</volOptions>
</pool>
+ <pool type='vitastor' supported='no'>
+ <volOptions>
+ <defaultFormat type='raw'/>
+ <enum name='targetFormatType'>
+ </enum>
+ </volOptions>
+ </pool>
</storagepoolCapabilities>
diff --git a/tests/storagepoolcapsschemadata/poolcaps-full.xml b/tests/storagepoolcapsschemadata/poolcaps-full.xml
index 805950a937..852df0de16 100644
--- a/tests/storagepoolcapsschemadata/poolcaps-full.xml
+++ b/tests/storagepoolcapsschemadata/poolcaps-full.xml
@@ -204,4 +204,11 @@
</enum>
</volOptions>
</pool>
+ <pool type='vitastor' supported='yes'>
+ <volOptions>
+ <defaultFormat type='raw'/>
+ <enum name='targetFormatType'>
+ </enum>
+ </volOptions>
+ </pool>
</storagepoolCapabilities>
diff --git a/tests/storagepoolxml2argvtest.c b/tests/storagepoolxml2argvtest.c
index 449b745519..7f95cc8e08 100644
--- a/tests/storagepoolxml2argvtest.c
+++ b/tests/storagepoolxml2argvtest.c
@@ -68,6 +68,7 @@ testCompareXMLToArgvFiles(bool shouldFail,
case VIR_STORAGE_POOL_GLUSTER:
case VIR_STORAGE_POOL_ZFS:
case VIR_STORAGE_POOL_VSTORAGE:
+ case VIR_STORAGE_POOL_VITASTOR:
case VIR_STORAGE_POOL_LAST:
default:
VIR_TEST_DEBUG("pool type '%s' has no xml2argv test", defTypeStr);
diff --git a/tools/virsh-pool.c b/tools/virsh-pool.c
index d391257f6e..46799c4a90 100644
--- a/tools/virsh-pool.c
+++ b/tools/virsh-pool.c
@@ -1213,6 +1213,9 @@ cmdPoolList(vshControl *ctl, const vshCmd *cmd G_GNUC_UNUSED)
case VIR_STORAGE_POOL_VSTORAGE:
flags |= VIR_CONNECT_LIST_STORAGE_POOLS_VSTORAGE;
break;
+ case VIR_STORAGE_POOL_VITASTOR:
+ flags |= VIR_CONNECT_LIST_STORAGE_POOLS_VITASTOR;
+ break;
case VIR_STORAGE_POOL_LAST:
break;
}

View File

@@ -0,0 +1,643 @@
commit c1cd026e211e94b120028e7c98a6e4ce5afe9846
Author: Vitaliy Filippov <vitalif@yourcmc.ru>
Date: Wed Jan 24 22:04:50 2024 +0300
Add Vitastor support
diff --git a/include/libvirt/libvirt-storage.h b/include/libvirt/libvirt-storage.h
index aaad4a3da1..5f5daa8341 100644
--- a/include/libvirt/libvirt-storage.h
+++ b/include/libvirt/libvirt-storage.h
@@ -326,6 +326,7 @@ typedef enum {
VIR_CONNECT_LIST_STORAGE_POOLS_ZFS = 1 << 17, /* (Since: 1.2.8) */
VIR_CONNECT_LIST_STORAGE_POOLS_VSTORAGE = 1 << 18, /* (Since: 3.1.0) */
VIR_CONNECT_LIST_STORAGE_POOLS_ISCSI_DIRECT = 1 << 19, /* (Since: 5.6.0) */
+ VIR_CONNECT_LIST_STORAGE_POOLS_VITASTOR = 1 << 20, /* (Since: 5.0.0) */
} virConnectListAllStoragePoolsFlags;
int virConnectListAllStoragePools(virConnectPtr conn,
diff --git a/src/conf/domain_conf.c b/src/conf/domain_conf.c
index 22ad43e1d7..56c81d6852 100644
--- a/src/conf/domain_conf.c
+++ b/src/conf/domain_conf.c
@@ -7185,7 +7185,8 @@ virDomainDiskSourceNetworkParse(xmlNodePtr node,
src->configFile = virXPathString("string(./config/@file)", ctxt);
if (src->protocol == VIR_STORAGE_NET_PROTOCOL_HTTP ||
- src->protocol == VIR_STORAGE_NET_PROTOCOL_HTTPS)
+ src->protocol == VIR_STORAGE_NET_PROTOCOL_HTTPS ||
+ src->protocol == VIR_STORAGE_NET_PROTOCOL_VITASTOR)
src->query = virXMLPropString(node, "query");
if (virDomainStorageNetworkParseHosts(node, ctxt, &src->hosts, &src->nhosts) < 0)
@@ -30618,6 +30619,7 @@ virDomainStorageSourceTranslateSourcePool(virStorageSource *src,
case VIR_STORAGE_POOL_MPATH:
case VIR_STORAGE_POOL_RBD:
+ case VIR_STORAGE_POOL_VITASTOR:
case VIR_STORAGE_POOL_SHEEPDOG:
case VIR_STORAGE_POOL_GLUSTER:
case VIR_STORAGE_POOL_LAST:
diff --git a/src/conf/domain_validate.c b/src/conf/domain_validate.c
index c72108886e..c739ed6c43 100644
--- a/src/conf/domain_validate.c
+++ b/src/conf/domain_validate.c
@@ -495,6 +495,7 @@ virDomainDiskDefValidateSourceChainOne(const virStorageSource *src)
case VIR_STORAGE_NET_PROTOCOL_RBD:
break;
+ case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
case VIR_STORAGE_NET_PROTOCOL_NBD:
case VIR_STORAGE_NET_PROTOCOL_SHEEPDOG:
case VIR_STORAGE_NET_PROTOCOL_GLUSTER:
@@ -541,7 +542,7 @@ virDomainDiskDefValidateSourceChainOne(const virStorageSource *src)
}
}
- /* internal snapshots and config files are currently supported only with rbd: */
+ /* internal snapshots are currently supported only with rbd: */
if (virStorageSourceGetActualType(src) != VIR_STORAGE_TYPE_NETWORK &&
src->protocol != VIR_STORAGE_NET_PROTOCOL_RBD) {
if (src->snapshot) {
@@ -549,10 +550,15 @@ virDomainDiskDefValidateSourceChainOne(const virStorageSource *src)
_("<snapshot> element is currently supported only with 'rbd' disks"));
return -1;
}
+ }
+ /* config files are currently supported only with rbd and vitastor: */
+ if (virStorageSourceGetActualType(src) != VIR_STORAGE_TYPE_NETWORK &&
+ src->protocol != VIR_STORAGE_NET_PROTOCOL_RBD &&
+ src->protocol != VIR_STORAGE_NET_PROTOCOL_VITASTOR) {
if (src->configFile) {
virReportError(VIR_ERR_XML_ERROR, "%s",
- _("<config> element is currently supported only with 'rbd' disks"));
+ _("<config> element is currently supported only with 'rbd' and 'vitastor' disks"));
return -1;
}
}
diff --git a/src/conf/schemas/domaincommon.rng b/src/conf/schemas/domaincommon.rng
index b98a2ae602..7d7a872e01 100644
--- a/src/conf/schemas/domaincommon.rng
+++ b/src/conf/schemas/domaincommon.rng
@@ -1997,6 +1997,35 @@
</element>
</define>
+ <define name="diskSourceNetworkProtocolVitastor">
+ <element name="source">
+ <interleave>
+ <attribute name="protocol">
+ <value>vitastor</value>
+ </attribute>
+ <ref name="diskSourceCommon"/>
+ <optional>
+ <attribute name="name"/>
+ </optional>
+ <optional>
+ <attribute name="query"/>
+ </optional>
+ <zeroOrMore>
+ <ref name="diskSourceNetworkHost"/>
+ </zeroOrMore>
+ <optional>
+ <element name="config">
+ <attribute name="file">
+ <ref name="absFilePath"/>
+ </attribute>
+ <empty/>
+ </element>
+ </optional>
+ <empty/>
+ </interleave>
+ </element>
+ </define>
+
<define name="diskSourceNetworkProtocolISCSI">
<element name="source">
<attribute name="protocol">
@@ -2347,6 +2376,7 @@
<ref name="diskSourceNetworkProtocolSimple"/>
<ref name="diskSourceNetworkProtocolVxHS"/>
<ref name="diskSourceNetworkProtocolNFS"/>
+ <ref name="diskSourceNetworkProtocolVitastor"/>
</choice>
</define>
diff --git a/src/conf/storage_conf.c b/src/conf/storage_conf.c
index 68842004b7..1d69a788b6 100644
--- a/src/conf/storage_conf.c
+++ b/src/conf/storage_conf.c
@@ -56,7 +56,7 @@ VIR_ENUM_IMPL(virStoragePool,
"logical", "disk", "iscsi",
"iscsi-direct", "scsi", "mpath",
"rbd", "sheepdog", "gluster",
- "zfs", "vstorage",
+ "zfs", "vstorage", "vitastor",
);
VIR_ENUM_IMPL(virStoragePoolFormatFileSystem,
@@ -242,6 +242,18 @@ static virStoragePoolTypeInfo poolTypeInfo[] = {
.formatToString = virStorageFileFormatTypeToString,
}
},
+ {.poolType = VIR_STORAGE_POOL_VITASTOR,
+ .poolOptions = {
+ .flags = (VIR_STORAGE_POOL_SOURCE_HOST |
+ VIR_STORAGE_POOL_SOURCE_NETWORK |
+ VIR_STORAGE_POOL_SOURCE_NAME),
+ },
+ .volOptions = {
+ .defaultFormat = VIR_STORAGE_FILE_RAW,
+ .formatFromString = virStorageVolumeFormatFromString,
+ .formatToString = virStorageFileFormatTypeToString,
+ }
+ },
{.poolType = VIR_STORAGE_POOL_SHEEPDOG,
.poolOptions = {
.flags = (VIR_STORAGE_POOL_SOURCE_HOST |
@@ -538,6 +550,11 @@ virStoragePoolDefParseSource(xmlXPathContextPtr ctxt,
_("element 'name' is mandatory for RBD pool"));
return -1;
}
+ if (pool_type == VIR_STORAGE_POOL_VITASTOR && source->name == NULL) {
+ virReportError(VIR_ERR_XML_ERROR, "%s",
+ _("element 'name' is mandatory for Vitastor pool"));
+ return -1;
+ }
if (options->formatFromString) {
g_autofree char *format = NULL;
@@ -1127,6 +1144,7 @@ virStoragePoolDefFormatBuf(virBuffer *buf,
/* RBD, Sheepdog, Gluster and Iscsi-direct devices are not local block devs nor
* files, so they don't have a target */
if (def->type != VIR_STORAGE_POOL_RBD &&
+ def->type != VIR_STORAGE_POOL_VITASTOR &&
def->type != VIR_STORAGE_POOL_SHEEPDOG &&
def->type != VIR_STORAGE_POOL_GLUSTER &&
def->type != VIR_STORAGE_POOL_ISCSI_DIRECT) {
diff --git a/src/conf/storage_conf.h b/src/conf/storage_conf.h
index fc67957cfe..720c07ef74 100644
--- a/src/conf/storage_conf.h
+++ b/src/conf/storage_conf.h
@@ -103,6 +103,7 @@ typedef enum {
VIR_STORAGE_POOL_GLUSTER, /* Gluster device */
VIR_STORAGE_POOL_ZFS, /* ZFS */
VIR_STORAGE_POOL_VSTORAGE, /* Virtuozzo Storage */
+ VIR_STORAGE_POOL_VITASTOR, /* Vitastor */
VIR_STORAGE_POOL_LAST,
} virStoragePoolType;
@@ -454,6 +455,7 @@ VIR_ENUM_DECL(virStoragePartedFs);
VIR_CONNECT_LIST_STORAGE_POOLS_SCSI | \
VIR_CONNECT_LIST_STORAGE_POOLS_MPATH | \
VIR_CONNECT_LIST_STORAGE_POOLS_RBD | \
+ VIR_CONNECT_LIST_STORAGE_POOLS_VITASTOR | \
VIR_CONNECT_LIST_STORAGE_POOLS_SHEEPDOG | \
VIR_CONNECT_LIST_STORAGE_POOLS_GLUSTER | \
VIR_CONNECT_LIST_STORAGE_POOLS_ZFS | \
diff --git a/src/conf/storage_source_conf.c b/src/conf/storage_source_conf.c
index f974a521b1..cd394d0a9f 100644
--- a/src/conf/storage_source_conf.c
+++ b/src/conf/storage_source_conf.c
@@ -88,6 +88,7 @@ VIR_ENUM_IMPL(virStorageNetProtocol,
"ssh",
"vxhs",
"nfs",
+ "vitastor",
);
@@ -1301,6 +1302,7 @@ virStorageSourceNetworkDefaultPort(virStorageNetProtocol protocol)
case VIR_STORAGE_NET_PROTOCOL_GLUSTER:
return 24007;
+ case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
case VIR_STORAGE_NET_PROTOCOL_RBD:
/* we don't provide a default for RBD */
return 0;
diff --git a/src/conf/storage_source_conf.h b/src/conf/storage_source_conf.h
index 5e7d127453..283709eeb3 100644
--- a/src/conf/storage_source_conf.h
+++ b/src/conf/storage_source_conf.h
@@ -129,6 +129,7 @@ typedef enum {
VIR_STORAGE_NET_PROTOCOL_SSH,
VIR_STORAGE_NET_PROTOCOL_VXHS,
VIR_STORAGE_NET_PROTOCOL_NFS,
+ VIR_STORAGE_NET_PROTOCOL_VITASTOR,
VIR_STORAGE_NET_PROTOCOL_LAST
} virStorageNetProtocol;
diff --git a/src/conf/virstorageobj.c b/src/conf/virstorageobj.c
index 59fa5da372..4739167f5f 100644
--- a/src/conf/virstorageobj.c
+++ b/src/conf/virstorageobj.c
@@ -1438,6 +1438,7 @@ virStoragePoolObjSourceFindDuplicateCb(const void *payload,
return 1;
break;
+ case VIR_STORAGE_POOL_VITASTOR:
case VIR_STORAGE_POOL_ISCSI_DIRECT:
case VIR_STORAGE_POOL_RBD:
case VIR_STORAGE_POOL_LAST:
@@ -1921,6 +1922,8 @@ virStoragePoolObjMatch(virStoragePoolObj *obj,
(obj->def->type == VIR_STORAGE_POOL_MPATH)) ||
(MATCH(VIR_CONNECT_LIST_STORAGE_POOLS_RBD) &&
(obj->def->type == VIR_STORAGE_POOL_RBD)) ||
+ (MATCH(VIR_CONNECT_LIST_STORAGE_POOLS_VITASTOR) &&
+ (obj->def->type == VIR_STORAGE_POOL_VITASTOR)) ||
(MATCH(VIR_CONNECT_LIST_STORAGE_POOLS_SHEEPDOG) &&
(obj->def->type == VIR_STORAGE_POOL_SHEEPDOG)) ||
(MATCH(VIR_CONNECT_LIST_STORAGE_POOLS_GLUSTER) &&
diff --git a/src/libvirt-storage.c b/src/libvirt-storage.c
index db7660aac4..561df34709 100644
--- a/src/libvirt-storage.c
+++ b/src/libvirt-storage.c
@@ -94,6 +94,7 @@ virStoragePoolGetConnect(virStoragePoolPtr pool)
* VIR_CONNECT_LIST_STORAGE_POOLS_SCSI
* VIR_CONNECT_LIST_STORAGE_POOLS_MPATH
* VIR_CONNECT_LIST_STORAGE_POOLS_RBD
+ * VIR_CONNECT_LIST_STORAGE_POOLS_VITASTOR
* VIR_CONNECT_LIST_STORAGE_POOLS_SHEEPDOG
* VIR_CONNECT_LIST_STORAGE_POOLS_GLUSTER
* VIR_CONNECT_LIST_STORAGE_POOLS_ZFS
diff --git a/src/libxl/libxl_conf.c b/src/libxl/libxl_conf.c
index 62e1be6672..71a1d42896 100644
--- a/src/libxl/libxl_conf.c
+++ b/src/libxl/libxl_conf.c
@@ -979,6 +979,7 @@ libxlMakeNetworkDiskSrcStr(virStorageSource *src,
case VIR_STORAGE_NET_PROTOCOL_SSH:
case VIR_STORAGE_NET_PROTOCOL_VXHS:
case VIR_STORAGE_NET_PROTOCOL_NFS:
+ case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
case VIR_STORAGE_NET_PROTOCOL_LAST:
case VIR_STORAGE_NET_PROTOCOL_NONE:
virReportError(VIR_ERR_NO_SUPPORT,
diff --git a/src/libxl/xen_xl.c b/src/libxl/xen_xl.c
index f175359307..8efcf4c329 100644
--- a/src/libxl/xen_xl.c
+++ b/src/libxl/xen_xl.c
@@ -1456,6 +1456,7 @@ xenFormatXLDiskSrcNet(virStorageSource *src)
case VIR_STORAGE_NET_PROTOCOL_SSH:
case VIR_STORAGE_NET_PROTOCOL_VXHS:
case VIR_STORAGE_NET_PROTOCOL_NFS:
+ case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
case VIR_STORAGE_NET_PROTOCOL_LAST:
case VIR_STORAGE_NET_PROTOCOL_NONE:
virReportError(VIR_ERR_NO_SUPPORT,
diff --git a/src/qemu/qemu_block.c b/src/qemu/qemu_block.c
index 7e9daf0bdc..825b4a3006 100644
--- a/src/qemu/qemu_block.c
+++ b/src/qemu/qemu_block.c
@@ -758,6 +758,38 @@ qemuBlockStorageSourceGetRBDProps(virStorageSource *src,
}
+static virJSONValue *
+qemuBlockStorageSourceGetVitastorProps(virStorageSource *src)
+{
+ virJSONValue *ret = NULL;
+ virStorageNetHostDef *host;
+ size_t i;
+ g_auto(virBuffer) buf = VIR_BUFFER_INITIALIZER;
+ g_autofree char *etcd = NULL;
+
+ for (i = 0; i < src->nhosts; i++) {
+ host = src->hosts + i;
+ if ((virStorageNetHostTransport)host->transport != VIR_STORAGE_NET_HOST_TRANS_TCP) {
+ return NULL;
+ }
+ virBufferAsprintf(&buf, i > 0 ? ",%s:%u" : "%s:%u", host->name, host->port);
+ }
+ if (src->nhosts > 0) {
+ etcd = virBufferContentAndReset(&buf);
+ }
+
+ if (virJSONValueObjectAdd(&ret,
+ "S:etcd-host", etcd,
+ "S:etcd-prefix", src->query,
+ "S:config-path", src->configFile,
+ "s:image", src->path,
+ NULL) < 0)
+ return NULL;
+
+ return ret;
+}
+
+
static virJSONValue *
qemuBlockStorageSourceGetSheepdogProps(virStorageSource *src)
{
@@ -1140,6 +1172,12 @@ qemuBlockStorageSourceGetBackendProps(virStorageSource *src,
return NULL;
break;
+ case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
+ driver = "vitastor";
+ if (!(fileprops = qemuBlockStorageSourceGetVitastorProps(src)))
+ return NULL;
+ break;
+
case VIR_STORAGE_NET_PROTOCOL_SHEEPDOG:
driver = "sheepdog";
if (!(fileprops = qemuBlockStorageSourceGetSheepdogProps(src)))
@@ -2032,6 +2070,7 @@ qemuBlockGetBackingStoreString(virStorageSource *src,
case VIR_STORAGE_NET_PROTOCOL_SHEEPDOG:
case VIR_STORAGE_NET_PROTOCOL_RBD:
+ case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
case VIR_STORAGE_NET_PROTOCOL_VXHS:
case VIR_STORAGE_NET_PROTOCOL_NFS:
case VIR_STORAGE_NET_PROTOCOL_SSH:
@@ -2415,6 +2454,12 @@ qemuBlockStorageSourceCreateGetStorageProps(virStorageSource *src,
return -1;
break;
+ case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
+ driver = "vitastor";
+ if (!(location = qemuBlockStorageSourceGetVitastorProps(src)))
+ return -1;
+ break;
+
case VIR_STORAGE_NET_PROTOCOL_SHEEPDOG:
driver = "sheepdog";
if (!(location = qemuBlockStorageSourceGetSheepdogProps(src)))
diff --git a/src/qemu/qemu_domain.c b/src/qemu/qemu_domain.c
index 953808fcfe..62860283d8 100644
--- a/src/qemu/qemu_domain.c
+++ b/src/qemu/qemu_domain.c
@@ -5215,7 +5215,8 @@ qemuDomainValidateStorageSource(virStorageSource *src,
if (src->query &&
(actualType != VIR_STORAGE_TYPE_NETWORK ||
(src->protocol != VIR_STORAGE_NET_PROTOCOL_HTTPS &&
- src->protocol != VIR_STORAGE_NET_PROTOCOL_HTTP))) {
+ src->protocol != VIR_STORAGE_NET_PROTOCOL_HTTP &&
+ src->protocol != VIR_STORAGE_NET_PROTOCOL_VITASTOR))) {
virReportError(VIR_ERR_CONFIG_UNSUPPORTED, "%s",
_("query is supported only with HTTP(S) protocols"));
return -1;
@@ -10340,6 +10341,7 @@ qemuDomainPrepareStorageSourceTLS(virStorageSource *src,
break;
case VIR_STORAGE_NET_PROTOCOL_RBD:
+ case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
case VIR_STORAGE_NET_PROTOCOL_SHEEPDOG:
case VIR_STORAGE_NET_PROTOCOL_GLUSTER:
case VIR_STORAGE_NET_PROTOCOL_ISCSI:
diff --git a/src/qemu/qemu_snapshot.c b/src/qemu/qemu_snapshot.c
index 73ff533827..e9c799ca8f 100644
--- a/src/qemu/qemu_snapshot.c
+++ b/src/qemu/qemu_snapshot.c
@@ -423,6 +423,7 @@ qemuSnapshotPrepareDiskExternalInactive(virDomainSnapshotDiskDef *snapdisk,
case VIR_STORAGE_NET_PROTOCOL_NONE:
case VIR_STORAGE_NET_PROTOCOL_NBD:
case VIR_STORAGE_NET_PROTOCOL_RBD:
+ case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
case VIR_STORAGE_NET_PROTOCOL_SHEEPDOG:
case VIR_STORAGE_NET_PROTOCOL_GLUSTER:
case VIR_STORAGE_NET_PROTOCOL_ISCSI:
@@ -648,6 +649,7 @@ qemuSnapshotPrepareDiskInternal(virDomainDiskDef *disk,
case VIR_STORAGE_NET_PROTOCOL_NONE:
case VIR_STORAGE_NET_PROTOCOL_NBD:
case VIR_STORAGE_NET_PROTOCOL_RBD:
+ case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
case VIR_STORAGE_NET_PROTOCOL_SHEEPDOG:
case VIR_STORAGE_NET_PROTOCOL_GLUSTER:
case VIR_STORAGE_NET_PROTOCOL_ISCSI:
diff --git a/src/storage/storage_driver.c b/src/storage/storage_driver.c
index 314fe930e0..fb615a8b4e 100644
--- a/src/storage/storage_driver.c
+++ b/src/storage/storage_driver.c
@@ -1626,6 +1626,7 @@ storageVolLookupByPathCallback(virStoragePoolObj *obj,
case VIR_STORAGE_POOL_GLUSTER:
case VIR_STORAGE_POOL_RBD:
+ case VIR_STORAGE_POOL_VITASTOR:
case VIR_STORAGE_POOL_SHEEPDOG:
case VIR_STORAGE_POOL_ZFS:
case VIR_STORAGE_POOL_LAST:
diff --git a/src/storage_file/storage_source_backingstore.c b/src/storage_file/storage_source_backingstore.c
index 80681924ea..8a3ade9ec0 100644
--- a/src/storage_file/storage_source_backingstore.c
+++ b/src/storage_file/storage_source_backingstore.c
@@ -287,6 +287,75 @@ virStorageSourceParseRBDColonString(const char *rbdstr,
}
+static int
+virStorageSourceParseVitastorColonString(const char *colonstr,
+ virStorageSource *src)
+{
+ char *p, *e, *next;
+ g_autofree char *options = NULL;
+
+ /* optionally skip the "vitastor:" prefix if provided */
+ if (STRPREFIX(colonstr, "vitastor:"))
+ colonstr += strlen("vitastor:");
+
+ options = g_strdup(colonstr);
+
+ p = options;
+ while (*p) {
+ /* find : delimiter or end of string */
+ for (e = p; *e && *e != ':'; ++e) {
+ if (*e == '\\') {
+ e++;
+ if (*e == '\0')
+ break;
+ }
+ }
+ if (*e == '\0') {
+ next = e; /* last kv pair */
+ } else {
+ next = e + 1;
+ *e = '\0';
+ }
+
+ if (STRPREFIX(p, "image=")) {
+ src->path = g_strdup(p + strlen("image="));
+ } else if (STRPREFIX(p, "etcd-prefix=")) {
+ src->query = g_strdup(p + strlen("etcd-prefix="));
+ } else if (STRPREFIX(p, "config-path=")) {
+ src->configFile = g_strdup(p + strlen("config-path="));
+ } else if (STRPREFIX(p, "etcd-host=")) {
+ char *h, *sep;
+
+ h = p + strlen("etcd-host=");
+ while (h < e) {
+ for (sep = h; sep < e; ++sep) {
+ if (*sep == '\\' && (sep[1] == ',' ||
+ sep[1] == ';' ||
+ sep[1] == ' ')) {
+ *sep = '\0';
+ sep += 2;
+ break;
+ }
+ }
+
+ if (virStorageSourceRBDAddHost(src, h) < 0)
+ return -1;
+
+ h = sep;
+ }
+ }
+
+ p = next;
+ }
+
+ if (!src->path) {
+ return -1;
+ }
+
+ return 0;
+}
+
+
static int
virStorageSourceParseNBDColonString(const char *nbdstr,
virStorageSource *src)
@@ -399,6 +468,11 @@ virStorageSourceParseBackingColon(virStorageSource *src,
return -1;
break;
+ case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
+ if (virStorageSourceParseVitastorColonString(path, src) < 0)
+ return -1;
+ break;
+
case VIR_STORAGE_NET_PROTOCOL_SHEEPDOG:
case VIR_STORAGE_NET_PROTOCOL_LAST:
case VIR_STORAGE_NET_PROTOCOL_NONE:
@@ -975,6 +1049,54 @@ virStorageSourceParseBackingJSONRBD(virStorageSource *src,
return 0;
}
+static int
+virStorageSourceParseBackingJSONVitastor(virStorageSource *src,
+ virJSONValue *json,
+ const char *jsonstr G_GNUC_UNUSED,
+ int opaque G_GNUC_UNUSED)
+{
+ const char *filename;
+ const char *image = virJSONValueObjectGetString(json, "image");
+ const char *conf = virJSONValueObjectGetString(json, "config-path");
+ const char *etcd_prefix = virJSONValueObjectGetString(json, "etcd-prefix");
+ virJSONValue *servers = virJSONValueObjectGetArray(json, "server");
+ size_t nservers;
+ size_t i;
+
+ src->type = VIR_STORAGE_TYPE_NETWORK;
+ src->protocol = VIR_STORAGE_NET_PROTOCOL_VITASTOR;
+
+ /* legacy syntax passed via 'filename' option */
+ if ((filename = virJSONValueObjectGetString(json, "filename")))
+ return virStorageSourceParseVitastorColonString(filename, src);
+
+ if (!image) {
+ virReportError(VIR_ERR_INVALID_ARG, "%s",
+ _("missing image name in Vitastor backing volume "
+ "JSON specification"));
+ return -1;
+ }
+
+ src->path = g_strdup(image);
+ src->configFile = g_strdup(conf);
+ src->query = g_strdup(etcd_prefix);
+
+ if (servers) {
+ nservers = virJSONValueArraySize(servers);
+
+ src->hosts = g_new0(virStorageNetHostDef, nservers);
+ src->nhosts = nservers;
+
+ for (i = 0; i < nservers; i++) {
+ if (virStorageSourceParseBackingJSONInetSocketAddress(src->hosts + i,
+ virJSONValueArrayGet(servers, i)) < 0)
+ return -1;
+ }
+ }
+
+ return 0;
+}
+
static int
virStorageSourceParseBackingJSONRaw(virStorageSource *src,
virJSONValue *json,
@@ -1152,6 +1274,7 @@ static const struct virStorageSourceJSONDriverParser jsonParsers[] = {
{"sheepdog", false, virStorageSourceParseBackingJSONSheepdog, 0},
{"ssh", false, virStorageSourceParseBackingJSONSSH, 0},
{"rbd", false, virStorageSourceParseBackingJSONRBD, 0},
+ {"vitastor", false, virStorageSourceParseBackingJSONVitastor, 0},
{"raw", true, virStorageSourceParseBackingJSONRaw, 0},
{"nfs", false, virStorageSourceParseBackingJSONNFS, 0},
{"vxhs", false, virStorageSourceParseBackingJSONVxHS, 0},
diff --git a/src/test/test_driver.c b/src/test/test_driver.c
index e87d7cfd44..ccc05d7aae 100644
--- a/src/test/test_driver.c
+++ b/src/test/test_driver.c
@@ -7335,6 +7335,7 @@ testStorageVolumeTypeForPool(int pooltype)
case VIR_STORAGE_POOL_ISCSI_DIRECT:
case VIR_STORAGE_POOL_GLUSTER:
case VIR_STORAGE_POOL_RBD:
+ case VIR_STORAGE_POOL_VITASTOR:
return VIR_STORAGE_VOL_NETWORK;
case VIR_STORAGE_POOL_LOGICAL:
case VIR_STORAGE_POOL_DISK:
diff --git a/tests/storagepoolcapsschemadata/poolcaps-fs.xml b/tests/storagepoolcapsschemadata/poolcaps-fs.xml
index eee75af746..8bd0a57bdd 100644
--- a/tests/storagepoolcapsschemadata/poolcaps-fs.xml
+++ b/tests/storagepoolcapsschemadata/poolcaps-fs.xml
@@ -204,4 +204,11 @@
</enum>
</volOptions>
</pool>
+ <pool type='vitastor' supported='no'>
+ <volOptions>
+ <defaultFormat type='raw'/>
+ <enum name='targetFormatType'>
+ </enum>
+ </volOptions>
+ </pool>
</storagepoolCapabilities>
diff --git a/tests/storagepoolcapsschemadata/poolcaps-full.xml b/tests/storagepoolcapsschemadata/poolcaps-full.xml
index 805950a937..852df0de16 100644
--- a/tests/storagepoolcapsschemadata/poolcaps-full.xml
+++ b/tests/storagepoolcapsschemadata/poolcaps-full.xml
@@ -204,4 +204,11 @@
</enum>
</volOptions>
</pool>
+ <pool type='vitastor' supported='yes'>
+ <volOptions>
+ <defaultFormat type='raw'/>
+ <enum name='targetFormatType'>
+ </enum>
+ </volOptions>
+ </pool>
</storagepoolCapabilities>
diff --git a/tests/storagepoolxml2argvtest.c b/tests/storagepoolxml2argvtest.c
index e8e40d695e..db55fe5f3a 100644
--- a/tests/storagepoolxml2argvtest.c
+++ b/tests/storagepoolxml2argvtest.c
@@ -65,6 +65,7 @@ testCompareXMLToArgvFiles(bool shouldFail,
case VIR_STORAGE_POOL_GLUSTER:
case VIR_STORAGE_POOL_ZFS:
case VIR_STORAGE_POOL_VSTORAGE:
+ case VIR_STORAGE_POOL_VITASTOR:
case VIR_STORAGE_POOL_LAST:
default:
VIR_TEST_DEBUG("pool type '%s' has no xml2argv test", defTypeStr);
diff --git a/tools/virsh-pool.c b/tools/virsh-pool.c
index 36f00cf643..5f5bd3464e 100644
--- a/tools/virsh-pool.c
+++ b/tools/virsh-pool.c
@@ -1223,6 +1223,9 @@ cmdPoolList(vshControl *ctl, const vshCmd *cmd G_GNUC_UNUSED)
case VIR_STORAGE_POOL_VSTORAGE:
flags |= VIR_CONNECT_LIST_STORAGE_POOLS_VSTORAGE;
break;
+ case VIR_STORAGE_POOL_VITASTOR:
+ flags |= VIR_CONNECT_LIST_STORAGE_POOLS_VITASTOR;
+ break;
case VIR_STORAGE_POOL_LAST:
break;
}

View File

@@ -0,0 +1,190 @@
Index: pve-qemu-kvm-8.1.2/block/meson.build
===================================================================
--- pve-qemu-kvm-8.1.2.orig/block/meson.build
+++ pve-qemu-kvm-8.1.2/block/meson.build
@@ -123,6 +123,7 @@ foreach m : [
[libnfs, 'nfs', files('nfs.c')],
[libssh, 'ssh', files('ssh.c')],
[rbd, 'rbd', files('rbd.c')],
+ [vitastor, 'vitastor', files('vitastor.c')],
]
if m[0].found()
module_ss = ss.source_set()
Index: pve-qemu-kvm-8.1.2/meson.build
===================================================================
--- pve-qemu-kvm-8.1.2.orig/meson.build
+++ pve-qemu-kvm-8.1.2/meson.build
@@ -1303,6 +1303,26 @@ if not get_option('rbd').auto() or have_
endif
endif
+vitastor = not_found
+if not get_option('vitastor').auto() or have_block
+ libvitastor_client = cc.find_library('vitastor_client', has_headers: ['vitastor_c.h'],
+ required: get_option('vitastor'))
+ if libvitastor_client.found()
+ if cc.links('''
+ #include <vitastor_c.h>
+ int main(void) {
+ vitastor_c_create_qemu(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0);
+ return 0;
+ }''', dependencies: libvitastor_client)
+ vitastor = declare_dependency(dependencies: libvitastor_client)
+ elif get_option('vitastor').enabled()
+ error('could not link libvitastor_client')
+ else
+ warning('could not link libvitastor_client, disabling')
+ endif
+ endif
+endif
+
glusterfs = not_found
glusterfs_ftruncate_has_stat = false
glusterfs_iocb_has_stat = false
@@ -2123,6 +2143,7 @@ if numa.found()
endif
config_host_data.set('CONFIG_OPENGL', opengl.found())
config_host_data.set('CONFIG_RBD', rbd.found())
+config_host_data.set('CONFIG_VITASTOR', vitastor.found())
config_host_data.set('CONFIG_RDMA', rdma.found())
config_host_data.set('CONFIG_SAFESTACK', get_option('safe_stack'))
config_host_data.set('CONFIG_SDL', sdl.found())
@@ -4298,6 +4319,7 @@ summary_info += {'fdt support': fd
summary_info += {'libcap-ng support': libcap_ng}
summary_info += {'bpf support': libbpf}
summary_info += {'rbd support': rbd}
+summary_info += {'vitastor support': vitastor}
summary_info += {'smartcard support': cacard}
summary_info += {'U2F support': u2f}
summary_info += {'libusb': libusb}
Index: pve-qemu-kvm-8.1.2/meson_options.txt
===================================================================
--- pve-qemu-kvm-8.1.2.orig/meson_options.txt
+++ pve-qemu-kvm-8.1.2/meson_options.txt
@@ -186,6 +186,8 @@ option('lzo', type : 'feature', value :
description: 'lzo compression support')
option('rbd', type : 'feature', value : 'auto',
description: 'Ceph block device driver')
+option('vitastor', type : 'feature', value : 'auto',
+ description: 'Vitastor block device driver')
option('opengl', type : 'feature', value : 'auto',
description: 'OpenGL support')
option('rdma', type : 'feature', value : 'auto',
Index: pve-qemu-kvm-8.1.2/qapi/block-core.json
===================================================================
--- pve-qemu-kvm-8.1.2.orig/qapi/block-core.json
+++ pve-qemu-kvm-8.1.2/qapi/block-core.json
@@ -3403,7 +3403,7 @@
'raw', 'rbd',
{ 'name': 'replication', 'if': 'CONFIG_REPLICATION' },
'pbs',
- 'ssh', 'throttle', 'vdi', 'vhdx',
+ 'ssh', 'throttle', 'vdi', 'vhdx', 'vitastor',
{ 'name': 'virtio-blk-vfio-pci', 'if': 'CONFIG_BLKIO' },
{ 'name': 'virtio-blk-vhost-user', 'if': 'CONFIG_BLKIO' },
{ 'name': 'virtio-blk-vhost-vdpa', 'if': 'CONFIG_BLKIO' },
@@ -4465,6 +4465,28 @@
'*server': ['InetSocketAddressBase'] } }
##
+# @BlockdevOptionsVitastor:
+#
+# Driver specific block device options for vitastor
+#
+# @image: Image name
+# @inode: Inode number
+# @pool: Pool ID
+# @size: Desired image size in bytes
+# @config-path: Path to Vitastor configuration
+# @etcd-host: etcd connection address(es)
+# @etcd-prefix: etcd key/value prefix
+##
+{ 'struct': 'BlockdevOptionsVitastor',
+ 'data': { '*inode': 'uint64',
+ '*pool': 'uint64',
+ '*size': 'uint64',
+ '*image': 'str',
+ '*config-path': 'str',
+ '*etcd-host': 'str',
+ '*etcd-prefix': 'str' } }
+
+##
# @ReplicationMode:
#
# An enumeration of replication modes.
@@ -4923,6 +4945,7 @@
'throttle': 'BlockdevOptionsThrottle',
'vdi': 'BlockdevOptionsGenericFormat',
'vhdx': 'BlockdevOptionsGenericFormat',
+ 'vitastor': 'BlockdevOptionsVitastor',
'virtio-blk-vfio-pci':
{ 'type': 'BlockdevOptionsVirtioBlkVfioPci',
'if': 'CONFIG_BLKIO' },
@@ -5360,6 +5383,17 @@
'*encrypt' : 'RbdEncryptionCreateOptions' } }
##
+# @BlockdevCreateOptionsVitastor:
+#
+# Driver specific image creation options for Vitastor.
+#
+# @size: Size of the virtual disk in bytes
+##
+{ 'struct': 'BlockdevCreateOptionsVitastor',
+ 'data': { 'location': 'BlockdevOptionsVitastor',
+ 'size': 'size' } }
+
+##
# @BlockdevVmdkSubformat:
#
# Subformat options for VMDK images
@@ -5581,6 +5615,7 @@
'ssh': 'BlockdevCreateOptionsSsh',
'vdi': 'BlockdevCreateOptionsVdi',
'vhdx': 'BlockdevCreateOptionsVhdx',
+ 'vitastor': 'BlockdevCreateOptionsVitastor',
'vmdk': 'BlockdevCreateOptionsVmdk',
'vpc': 'BlockdevCreateOptionsVpc'
} }
Index: pve-qemu-kvm-8.1.2/scripts/ci/org.centos/stream/8/x86_64/configure
===================================================================
--- pve-qemu-kvm-8.1.2.orig/scripts/ci/org.centos/stream/8/x86_64/configure
+++ pve-qemu-kvm-8.1.2/scripts/ci/org.centos/stream/8/x86_64/configure
@@ -30,7 +30,7 @@
--with-suffix="qemu-kvm" \
--firmwarepath=/usr/share/qemu-firmware \
--target-list="x86_64-softmmu" \
---block-drv-rw-whitelist="qcow2,raw,file,host_device,nbd,iscsi,rbd,blkdebug,luks,null-co,nvme,copy-on-read,throttle,gluster" \
+--block-drv-rw-whitelist="qcow2,raw,file,host_device,nbd,iscsi,rbd,vitastor,blkdebug,luks,null-co,nvme,copy-on-read,throttle,gluster" \
--audio-drv-list="" \
--block-drv-ro-whitelist="vmdk,vhdx,vpc,https,ssh" \
--with-coroutine=ucontext \
@@ -176,6 +176,7 @@
--enable-opengl \
--enable-pie \
--enable-rbd \
+--enable-vitastor \
--enable-rdma \
--enable-seccomp \
--enable-snappy \
Index: pve-qemu-kvm-8.1.2/scripts/meson-buildoptions.sh
===================================================================
--- pve-qemu-kvm-8.1.2.orig/scripts/meson-buildoptions.sh
+++ pve-qemu-kvm-8.1.2/scripts/meson-buildoptions.sh
@@ -153,6 +153,7 @@ meson_options_help() {
printf "%s\n" ' qed qed image format support'
printf "%s\n" ' qga-vss build QGA VSS support (broken with MinGW)'
printf "%s\n" ' rbd Ceph block device driver'
+ printf "%s\n" ' vitastor Vitastor block device driver'
printf "%s\n" ' rdma Enable RDMA-based migration'
printf "%s\n" ' replication replication support'
printf "%s\n" ' sdl SDL user interface'
@@ -416,6 +417,8 @@ _meson_option_parse() {
--disable-qom-cast-debug) printf "%s" -Dqom_cast_debug=false ;;
--enable-rbd) printf "%s" -Drbd=enabled ;;
--disable-rbd) printf "%s" -Drbd=disabled ;;
+ --enable-vitastor) printf "%s" -Dvitastor=enabled ;;
+ --disable-vitastor) printf "%s" -Dvitastor=disabled ;;
--enable-rdma) printf "%s" -Drdma=enabled ;;
--disable-rdma) printf "%s" -Drdma=disabled ;;
--enable-replication) printf "%s" -Dreplication=enabled ;;

View File

@@ -0,0 +1,190 @@
diff --git a/block/meson.build b/block/meson.build
index 529fc172c6..d542dc0609 100644
--- a/block/meson.build
+++ b/block/meson.build
@@ -110,6 +110,7 @@ foreach m : [
[libnfs, 'nfs', files('nfs.c')],
[libssh, 'ssh', files('ssh.c')],
[rbd, 'rbd', files('rbd.c')],
+ [vitastor, 'vitastor', files('vitastor.c')],
]
if m[0].found()
module_ss = ss.source_set()
diff --git a/meson.build b/meson.build
index a9c4f28247..8496cf13f1 100644
--- a/meson.build
+++ b/meson.build
@@ -1303,6 +1303,26 @@ if not get_option('rbd').auto() or have_block
endif
endif
+vitastor = not_found
+if not get_option('vitastor').auto() or have_block
+ libvitastor_client = cc.find_library('vitastor_client', has_headers: ['vitastor_c.h'],
+ required: get_option('vitastor'))
+ if libvitastor_client.found()
+ if cc.links('''
+ #include <vitastor_c.h>
+ int main(void) {
+ vitastor_c_create_qemu(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0);
+ return 0;
+ }''', dependencies: libvitastor_client)
+ vitastor = declare_dependency(dependencies: libvitastor_client)
+ elif get_option('vitastor').enabled()
+ error('could not link libvitastor_client')
+ else
+ warning('could not link libvitastor_client, disabling')
+ endif
+ endif
+endif
+
glusterfs = not_found
glusterfs_ftruncate_has_stat = false
glusterfs_iocb_has_stat = false
@@ -2119,6 +2139,7 @@ if numa.found()
endif
config_host_data.set('CONFIG_OPENGL', opengl.found())
config_host_data.set('CONFIG_RBD', rbd.found())
+config_host_data.set('CONFIG_VITASTOR', vitastor.found())
config_host_data.set('CONFIG_RDMA', rdma.found())
config_host_data.set('CONFIG_SAFESTACK', get_option('safe_stack'))
config_host_data.set('CONFIG_SDL', sdl.found())
@@ -4286,6 +4307,7 @@ summary_info += {'fdt support': fdt_opt == 'disabled' ? false : fdt_opt}
summary_info += {'libcap-ng support': libcap_ng}
summary_info += {'bpf support': libbpf}
summary_info += {'rbd support': rbd}
+summary_info += {'vitastor support': vitastor}
summary_info += {'smartcard support': cacard}
summary_info += {'U2F support': u2f}
summary_info += {'libusb': libusb}
diff --git a/meson_options.txt b/meson_options.txt
index ae6d8f469d..e3d9f8404d 100644
--- a/meson_options.txt
+++ b/meson_options.txt
@@ -186,6 +186,8 @@ option('lzo', type : 'feature', value : 'auto',
description: 'lzo compression support')
option('rbd', type : 'feature', value : 'auto',
description: 'Ceph block device driver')
+option('vitastor', type : 'feature', value : 'auto',
+ description: 'Vitastor block device driver')
option('opengl', type : 'feature', value : 'auto',
description: 'OpenGL support')
option('rdma', type : 'feature', value : 'auto',
diff --git a/qapi/block-core.json b/qapi/block-core.json
index 2b1d493d6e..90673fdbdc 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -3146,7 +3146,7 @@
'parallels', 'preallocate', 'qcow', 'qcow2', 'qed', 'quorum',
'raw', 'rbd',
{ 'name': 'replication', 'if': 'CONFIG_REPLICATION' },
- 'ssh', 'throttle', 'vdi', 'vhdx',
+ 'ssh', 'throttle', 'vdi', 'vhdx', 'vitastor',
{ 'name': 'virtio-blk-vfio-pci', 'if': 'CONFIG_BLKIO' },
{ 'name': 'virtio-blk-vhost-user', 'if': 'CONFIG_BLKIO' },
{ 'name': 'virtio-blk-vhost-vdpa', 'if': 'CONFIG_BLKIO' },
@@ -4196,6 +4196,28 @@
'*key-secret': 'str',
'*server': ['InetSocketAddressBase'] } }
+##
+# @BlockdevOptionsVitastor:
+#
+# Driver specific block device options for vitastor
+#
+# @image: Image name
+# @inode: Inode number
+# @pool: Pool ID
+# @size: Desired image size in bytes
+# @config-path: Path to Vitastor configuration
+# @etcd-host: etcd connection address(es)
+# @etcd-prefix: etcd key/value prefix
+##
+{ 'struct': 'BlockdevOptionsVitastor',
+ 'data': { '*inode': 'uint64',
+ '*pool': 'uint64',
+ '*size': 'uint64',
+ '*image': 'str',
+ '*config-path': 'str',
+ '*etcd-host': 'str',
+ '*etcd-prefix': 'str' } }
+
##
# @ReplicationMode:
#
@@ -4654,6 +4676,7 @@
'throttle': 'BlockdevOptionsThrottle',
'vdi': 'BlockdevOptionsGenericFormat',
'vhdx': 'BlockdevOptionsGenericFormat',
+ 'vitastor': 'BlockdevOptionsVitastor',
'virtio-blk-vfio-pci':
{ 'type': 'BlockdevOptionsVirtioBlkVfioPci',
'if': 'CONFIG_BLKIO' },
@@ -5089,6 +5112,17 @@
'*cluster-size' : 'size',
'*encrypt' : 'RbdEncryptionCreateOptions' } }
+##
+# @BlockdevCreateOptionsVitastor:
+#
+# Driver specific image creation options for Vitastor.
+#
+# @size: Size of the virtual disk in bytes
+##
+{ 'struct': 'BlockdevCreateOptionsVitastor',
+ 'data': { 'location': 'BlockdevOptionsVitastor',
+ 'size': 'size' } }
+
##
# @BlockdevVmdkSubformat:
#
@@ -5311,6 +5345,7 @@
'ssh': 'BlockdevCreateOptionsSsh',
'vdi': 'BlockdevCreateOptionsVdi',
'vhdx': 'BlockdevCreateOptionsVhdx',
+ 'vitastor': 'BlockdevCreateOptionsVitastor',
'vmdk': 'BlockdevCreateOptionsVmdk',
'vpc': 'BlockdevCreateOptionsVpc'
} }
diff --git a/scripts/ci/org.centos/stream/8/x86_64/configure b/scripts/ci/org.centos/stream/8/x86_64/configure
index d02b09a4b9..f0b5fbfef3 100755
--- a/scripts/ci/org.centos/stream/8/x86_64/configure
+++ b/scripts/ci/org.centos/stream/8/x86_64/configure
@@ -30,7 +30,7 @@
--with-suffix="qemu-kvm" \
--firmwarepath=/usr/share/qemu-firmware \
--target-list="x86_64-softmmu" \
---block-drv-rw-whitelist="qcow2,raw,file,host_device,nbd,iscsi,rbd,blkdebug,luks,null-co,nvme,copy-on-read,throttle,gluster" \
+--block-drv-rw-whitelist="qcow2,raw,file,host_device,nbd,iscsi,rbd,vitastor,blkdebug,luks,null-co,nvme,copy-on-read,throttle,gluster" \
--audio-drv-list="" \
--block-drv-ro-whitelist="vmdk,vhdx,vpc,https,ssh" \
--with-coroutine=ucontext \
@@ -176,6 +176,7 @@
--enable-opengl \
--enable-pie \
--enable-rbd \
+--enable-vitastor \
--enable-rdma \
--enable-seccomp \
--enable-snappy \
diff --git a/scripts/meson-buildoptions.sh b/scripts/meson-buildoptions.sh
index d7020af175..94958eb6fa 100644
--- a/scripts/meson-buildoptions.sh
+++ b/scripts/meson-buildoptions.sh
@@ -153,6 +153,7 @@ meson_options_help() {
printf "%s\n" ' qed qed image format support'
printf "%s\n" ' qga-vss build QGA VSS support (broken with MinGW)'
printf "%s\n" ' rbd Ceph block device driver'
+ printf "%s\n" ' vitastor Vitastor block device driver'
printf "%s\n" ' rdma Enable RDMA-based migration'
printf "%s\n" ' replication replication support'
printf "%s\n" ' sdl SDL user interface'
@@ -416,6 +417,8 @@ _meson_option_parse() {
--disable-qom-cast-debug) printf "%s" -Dqom_cast_debug=false ;;
--enable-rbd) printf "%s" -Drbd=enabled ;;
--disable-rbd) printf "%s" -Drbd=disabled ;;
+ --enable-vitastor) printf "%s" -Dvitastor=enabled ;;
+ --disable-vitastor) printf "%s" -Dvitastor=disabled ;;
--enable-rdma) printf "%s" -Drdma=enabled ;;
--disable-rdma) printf "%s" -Drdma=disabled ;;
--enable-replication) printf "%s" -Dreplication=enabled ;;

28
pull_request_template.yml Normal file
View File

@@ -0,0 +1,28 @@
name: Pull Request
about: Submit a pull request
body:
- type: textarea
id: description
attributes:
label: Description
description: Describe your pull request
placeholder: ""
value: ""
validations:
required: true
- type: input
id: author
attributes:
label: Contributor Name
description: Contributor Name or Company Details if the Contributor is a company
placeholder: ""
validations:
required: false
- type: checkboxes
id: terms
attributes:
label: CLA
description: By submitting this pull request, I accept [Vitastor CLA](https://git.yourcmc.ru/vitalif/vitastor/src/branch/master/CLA-en.md)
options:
- label: "I accept Vitastor CLA agreement: https://git.yourcmc.ru/vitalif/vitastor/src/branch/master/CLA-en.md"
required: true

View File

@@ -24,4 +24,4 @@ rm fio
mv fio-copy fio
FIO=`rpm -qi fio | perl -e 'while(<>) { /^Epoch[\s:]+(\S+)/ && print "$1:"; /^Version[\s:]+(\S+)/ && print $1; /^Release[\s:]+(\S+)/ && print "-$1"; }'`
perl -i -pe 's/(Requires:\s*fio)([^\n]+)?/$1 = '$FIO'/' $VITASTOR/rpm/vitastor-el$EL.spec
tar --transform 's#^#vitastor-1.2.0/#' --exclude 'rpm/*.rpm' -czf $VITASTOR/../vitastor-1.2.0$(rpm --eval '%dist').tar.gz *
tar --transform 's#^#vitastor-1.4.8/#' --exclude 'rpm/*.rpm' -czf $VITASTOR/../vitastor-1.4.8$(rpm --eval '%dist').tar.gz *

View File

@@ -15,6 +15,7 @@ RUN yumdownloader --disablerepo=centos-sclo-rh --source fio
RUN rpm --nomd5 -i fio*.src.rpm
RUN rm -f /etc/yum.repos.d/CentOS-Media.repo
RUN cd ~/rpmbuild/SPECS && yum-builddep -y fio.spec
RUN yum -y install cmake3
ADD https://vitastor.io/rpms/liburing-el7/liburing-0.7-2.el7.src.rpm /root
@@ -35,7 +36,7 @@ ADD . /root/vitastor
RUN set -e; \
cd /root/vitastor/rpm; \
sh build-tarball.sh; \
cp /root/vitastor-1.2.0.el7.tar.gz ~/rpmbuild/SOURCES; \
cp /root/vitastor-1.4.8.el7.tar.gz ~/rpmbuild/SOURCES; \
cp vitastor-el7.spec ~/rpmbuild/SPECS/vitastor.spec; \
cd ~/rpmbuild/SPECS/; \
rpmbuild -ba vitastor.spec; \

View File

@@ -1,11 +1,11 @@
Name: vitastor
Version: 1.2.0
Version: 1.4.8
Release: 1%{?dist}
Summary: Vitastor, a fast software-defined clustered block storage
License: Vitastor Network Public License 1.1
URL: https://vitastor.io/
Source0: vitastor-1.2.0.el7.tar.gz
Source0: vitastor-1.4.8.el7.tar.gz
BuildRequires: liburing-devel >= 0.6
BuildRequires: gperftools-devel
@@ -16,7 +16,7 @@ BuildRequires: jerasure-devel
BuildRequires: libisa-l-devel
BuildRequires: gf-complete-devel
BuildRequires: libibverbs-devel
BuildRequires: cmake
BuildRequires: cmake3
Requires: vitastor-osd = %{version}-%{release}
Requires: vitastor-mon = %{version}-%{release}
Requires: vitastor-client = %{version}-%{release}
@@ -94,7 +94,7 @@ Vitastor fio drivers for benchmarking.
%build
. /opt/rh/devtoolset-9/enable
%cmake .
%cmake3 .
%make_build

View File

@@ -35,7 +35,7 @@ ADD . /root/vitastor
RUN set -e; \
cd /root/vitastor/rpm; \
sh build-tarball.sh; \
cp /root/vitastor-1.2.0.el8.tar.gz ~/rpmbuild/SOURCES; \
cp /root/vitastor-1.4.8.el8.tar.gz ~/rpmbuild/SOURCES; \
cp vitastor-el8.spec ~/rpmbuild/SPECS/vitastor.spec; \
cd ~/rpmbuild/SPECS/; \
rpmbuild -ba vitastor.spec; \

View File

@@ -1,11 +1,11 @@
Name: vitastor
Version: 1.2.0
Version: 1.4.8
Release: 1%{?dist}
Summary: Vitastor, a fast software-defined clustered block storage
License: Vitastor Network Public License 1.1
URL: https://vitastor.io/
Source0: vitastor-1.2.0.el8.tar.gz
Source0: vitastor-1.4.8.el8.tar.gz
BuildRequires: liburing-devel >= 0.6
BuildRequires: gperftools-devel

View File

@@ -18,7 +18,7 @@ ADD . /root/vitastor
RUN set -e; \
cd /root/vitastor/rpm; \
sh build-tarball.sh; \
cp /root/vitastor-1.2.0.el9.tar.gz ~/rpmbuild/SOURCES; \
cp /root/vitastor-1.4.8.el9.tar.gz ~/rpmbuild/SOURCES; \
cp vitastor-el9.spec ~/rpmbuild/SPECS/vitastor.spec; \
cd ~/rpmbuild/SPECS/; \
rpmbuild -ba vitastor.spec; \

View File

@@ -1,11 +1,11 @@
Name: vitastor
Version: 1.2.0
Version: 1.4.8
Release: 1%{?dist}
Summary: Vitastor, a fast software-defined clustered block storage
License: Vitastor Network Public License 1.1
URL: https://vitastor.io/
Source0: vitastor-1.2.0.el9.tar.gz
Source0: vitastor-1.4.8.el9.tar.gz
BuildRequires: liburing-devel >= 0.6
BuildRequires: gperftools-devel

View File

@@ -16,8 +16,8 @@ if("${CMAKE_INSTALL_PREFIX}" MATCHES "^/usr/local/?$")
set(CMAKE_INSTALL_RPATH "${CMAKE_INSTALL_PREFIX}/${CMAKE_INSTALL_LIBDIR}")
endif()
add_definitions(-DVERSION="1.2.0")
add_definitions(-Wall -Wno-sign-compare -Wno-comment -Wno-parentheses -Wno-pointer-arith -fdiagnostics-color=always -fno-omit-frame-pointer -I ${CMAKE_SOURCE_DIR}/src)
add_definitions(-DVERSION="1.4.8")
add_definitions(-D_LARGEFILE64_SOURCE -D_FILE_OFFSET_BITS=64 -Wall -Wno-sign-compare -Wno-comment -Wno-parentheses -Wno-pointer-arith -fdiagnostics-color=always -fno-omit-frame-pointer -I ${CMAKE_SOURCE_DIR}/src)
add_link_options(-fno-omit-frame-pointer)
if (${WITH_ASAN})
add_definitions(-fsanitize=address)
@@ -181,25 +181,6 @@ target_link_libraries(vitastor-nbd
vitastor_client
)
# vitastor-kv
add_executable(vitastor-kv
kv_cli.cpp
kv_db.cpp
kv_db.h
)
target_link_libraries(vitastor-kv
vitastor_client
)
add_executable(vitastor-kv-stress
kv_stress.cpp
kv_db.cpp
kv_db.h
)
target_link_libraries(vitastor-kv-stress
vitastor_client
)
# vitastor-nfs
add_executable(vitastor-nfs
nfs_proxy.cpp

View File

@@ -8,6 +8,7 @@
#include <stdio.h>
#include <stdexcept>
#include <set>
#include "addr_util.h"
@@ -135,7 +136,7 @@ std::vector<std::string> getifaddr_list(std::vector<std::string> mask_cfg, bool
throw std::runtime_error((include_v6 ? "Invalid IPv4 address mask: " : "Invalid IP address mask: ") + mask);
}
}
std::vector<std::string> addresses;
std::set<std::string> addresses;
ifaddrs *list, *ifa;
if (getifaddrs(&list) == -1)
{
@@ -149,7 +150,8 @@ std::vector<std::string> getifaddr_list(std::vector<std::string> mask_cfg, bool
}
int family = ifa->ifa_addr->sa_family;
if ((family == AF_INET || family == AF_INET6 && include_v6) &&
(ifa->ifa_flags & (IFF_UP | IFF_RUNNING | IFF_LOOPBACK)) == (IFF_UP | IFF_RUNNING))
// Do not skip loopback addresses if the address filter is specified
(ifa->ifa_flags & (IFF_UP | IFF_RUNNING | (masks.size() ? 0 : IFF_LOOPBACK))) == (IFF_UP | IFF_RUNNING))
{
void *addr_ptr;
if (family == AF_INET)
@@ -182,11 +184,11 @@ std::vector<std::string> getifaddr_list(std::vector<std::string> mask_cfg, bool
{
throw std::runtime_error(std::string("inet_ntop: ") + strerror(errno));
}
addresses.push_back(std::string(addr));
addresses.insert(std::string(addr));
}
}
freeifaddrs(list);
return addresses;
return std::vector<std::string>(addresses.begin(), addresses.end());
}
int create_and_bind_socket(std::string bind_address, int bind_port, int listen_backlog, int *listening_port)

View File

@@ -108,6 +108,10 @@ void blockstore_disk_t::parse_config(std::map<std::string, std::string> & config
{
throw std::runtime_error("journal_block_size must be a multiple of "+std::to_string(DIRECT_IO_ALIGNMENT));
}
else if (journal_block_size > MAX_DATA_BLOCK_SIZE)
{
throw std::runtime_error("journal_block_size must not exceed "+std::to_string(MAX_DATA_BLOCK_SIZE));
}
if (!meta_block_size)
{
meta_block_size = 4096;
@@ -116,6 +120,10 @@ void blockstore_disk_t::parse_config(std::map<std::string, std::string> & config
{
throw std::runtime_error("meta_block_size must be a multiple of "+std::to_string(DIRECT_IO_ALIGNMENT));
}
else if (meta_block_size > MAX_DATA_BLOCK_SIZE)
{
throw std::runtime_error("meta_block_size must not exceed "+std::to_string(MAX_DATA_BLOCK_SIZE));
}
if (data_offset % disk_alignment)
{
throw std::runtime_error("data_offset must be a multiple of disk_alignment = "+std::to_string(disk_alignment));

View File

@@ -19,7 +19,6 @@ journal_flusher_t::journal_flusher_t(blockstore_impl_t *bs)
syncing_flushers = 0;
// FIXME: allow to configure flusher_start_threshold and journal_trim_interval
flusher_start_threshold = bs->dsk.journal_block_size / sizeof(journal_entry_stable);
journal_trim_interval = 512;
journal_trim_counter = bs->journal.flush_journal ? 1 : 0;
trim_wanted = bs->journal.flush_journal ? 1 : 0;
journal_superblock = bs->journal.inmemory ? bs->journal.buffer : memalign_or_die(MEM_ALIGNMENT, bs->dsk.journal_block_size);
@@ -94,7 +93,7 @@ void journal_flusher_t::loop()
void journal_flusher_t::enqueue_flush(obj_ver_id ov)
{
#ifdef BLOCKSTORE_DEBUG
printf("enqueue_flush %lx:%lx v%lu\n", ov.oid.inode, ov.oid.stripe, ov.version);
printf("enqueue_flush %jx:%jx v%ju\n", ov.oid.inode, ov.oid.stripe, ov.version);
#endif
auto it = flush_versions.find(ov.oid);
if (it != flush_versions.end())
@@ -117,7 +116,7 @@ void journal_flusher_t::enqueue_flush(obj_ver_id ov)
void journal_flusher_t::unshift_flush(obj_ver_id ov, bool force)
{
#ifdef BLOCKSTORE_DEBUG
printf("unshift_flush %lx:%lx v%lu\n", ov.oid.inode, ov.oid.stripe, ov.version);
printf("unshift_flush %jx:%jx v%ju\n", ov.oid.inode, ov.oid.stripe, ov.version);
#endif
auto it = flush_versions.find(ov.oid);
if (it != flush_versions.end())
@@ -143,7 +142,7 @@ void journal_flusher_t::unshift_flush(obj_ver_id ov, bool force)
void journal_flusher_t::remove_flush(object_id oid)
{
#ifdef BLOCKSTORE_DEBUG
printf("undo_flush %lx:%lx\n", oid.inode, oid.stripe);
printf("undo_flush %jx:%jx\n", oid.inode, oid.stripe);
#endif
auto v_it = flush_versions.find(oid);
if (v_it != flush_versions.end())
@@ -184,8 +183,7 @@ void journal_flusher_t::mark_trim_possible()
if (trim_wanted > 0)
{
dequeuing = true;
if (!journal_trim_counter)
journal_trim_counter = journal_trim_interval;
journal_trim_counter = 0;
bs->ringloop->wakeup();
}
}
@@ -235,7 +233,7 @@ void journal_flusher_t::dump_diagnostics()
break;
}
printf(
"Flusher: queued=%ld first=%s%lx:%lx trim_wanted=%d dequeuing=%d trimming=%d cur=%d target=%d active=%d syncing=%d\n",
"Flusher: queued=%zd first=%s%jx:%jx trim_wanted=%d dequeuing=%d trimming=%d cur=%d target=%d active=%d syncing=%d\n",
flush_queue.size(), unflushable_type, unflushable.oid.inode, unflushable.oid.stripe,
trim_wanted, dequeuing, trimming, cur_flusher_count, target_flusher_count,
active_flushers, syncing_flushers
@@ -268,7 +266,7 @@ bool journal_flusher_t::try_find_other(std::map<obj_ver_id, dirty_entry>::iterat
{
int search_left = flush_queue.size() - 1;
#ifdef BLOCKSTORE_DEBUG
printf("Flusher overran writers (%lx:%lx v%lu, dirty_start=%08lx) - searching for older flushes (%d left)\n",
printf("Flusher overran writers (%jx:%jx v%ju, dirty_start=%08jx) - searching for older flushes (%d left)\n",
cur.oid.inode, cur.oid.stripe, cur.version, bs->journal.dirty_start, search_left);
#endif
while (search_left > 0)
@@ -285,7 +283,7 @@ bool journal_flusher_t::try_find_other(std::map<obj_ver_id, dirty_entry>::iterat
dirty_end->second.journal_sector < bs->journal.used_start))
{
#ifdef BLOCKSTORE_DEBUG
printf("Write %lx:%lx v%lu is too new: offset=%08lx\n", cur.oid.inode, cur.oid.stripe, cur.version, dirty_end->second.journal_sector);
printf("Write %jx:%jx v%ju is too new: offset=%08jx\n", cur.oid.inode, cur.oid.stripe, cur.version, dirty_end->second.journal_sector);
#endif
enqueue_flush(cur);
}
@@ -366,9 +364,10 @@ resume_0:
!flusher->flush_queue.size() || !flusher->dequeuing)
{
stop_flusher:
if (flusher->trim_wanted > 0 && flusher->journal_trim_counter > 0)
if (flusher->trim_wanted > 0 && cur.oid.inode != 0)
{
// Attempt forced trim
cur.oid = {};
flusher->active_flushers++;
goto trim_journal;
}
@@ -387,7 +386,7 @@ stop_flusher:
if (repeat_it != flusher->sync_to_repeat.end())
{
#ifdef BLOCKSTORE_DEBUG
printf("Postpone %lx:%lx v%lu\n", cur.oid.inode, cur.oid.stripe, cur.version);
printf("Postpone %jx:%jx v%ju\n", cur.oid.inode, cur.oid.stripe, cur.version);
#endif
// We don't flush different parts of history of the same object in parallel
// So we check if someone is already flushing this object
@@ -416,12 +415,13 @@ stop_flusher:
flusher->sync_to_repeat.erase(cur.oid);
if (!flusher->try_find_other(dirty_end, cur))
{
cur.oid = {};
goto stop_flusher;
}
}
}
#ifdef BLOCKSTORE_DEBUG
printf("Flushing %lx:%lx v%lu\n", cur.oid.inode, cur.oid.stripe, cur.version);
printf("Flushing %jx:%jx v%ju\n", cur.oid.inode, cur.oid.stripe, cur.version);
#endif
flusher->active_flushers++;
// Find it in clean_db
@@ -448,7 +448,7 @@ stop_flusher:
// Object not allocated. This is a bug.
char err[1024];
snprintf(
err, 1024, "BUG: Object %lx:%lx v%lu that we are trying to flush is not allocated on the data device",
err, 1024, "BUG: Object %jx:%jx v%ju that we are trying to flush is not allocated on the data device",
cur.oid.inode, cur.oid.stripe, cur.version
);
throw std::runtime_error(err);
@@ -538,7 +538,7 @@ resume_2:
clean_disk_entry *old_entry = (clean_disk_entry*)((uint8_t*)meta_old.buf + meta_old.pos*bs->dsk.clean_entry_size);
if (old_entry->oid.inode != 0 && old_entry->oid != cur.oid)
{
printf("Fatal error (metadata corruption or bug): tried to wipe metadata entry %lu (%lx:%lx v%lu) as old location of %lx:%lx\n",
printf("Fatal error (metadata corruption or bug): tried to wipe metadata entry %ju (%jx:%jx v%ju) as old location of %jx:%jx\n",
old_clean_loc >> bs->dsk.block_order, old_entry->oid.inode, old_entry->oid.stripe,
old_entry->version, cur.oid.inode, cur.oid.stripe);
exit(1);
@@ -571,7 +571,7 @@ resume_2:
// Erase dirty_db entries
bs->erase_dirty(dirty_start, std::next(dirty_end), clean_loc);
#ifdef BLOCKSTORE_DEBUG
printf("Flushed %lx:%lx v%lu (%d copies, wr:%d, del:%d), %ld left\n", cur.oid.inode, cur.oid.stripe, cur.version,
printf("Flushed %jx:%jx v%ju (%d copies, wr:%d, del:%d), %jd left\n", cur.oid.inode, cur.oid.stripe, cur.version,
copy_count, has_writes, has_delete, flusher->flush_queue.size());
#endif
release_oid:
@@ -584,7 +584,8 @@ resume_2:
flusher->sync_to_repeat.erase(repeat_it);
trim_journal:
// Clear unused part of the journal every <journal_trim_interval> flushes
if (!((++flusher->journal_trim_counter) % flusher->journal_trim_interval) || flusher->trim_wanted > 0)
if (bs->journal_trim_interval && !((++flusher->journal_trim_counter) % bs->journal_trim_interval) ||
flusher->trim_wanted > 0)
{
resume_26:
resume_27:
@@ -609,8 +610,8 @@ void journal_flusher_co::update_metadata_entry()
{
printf(
has_delete
? "Fatal error (metadata corruption or bug): tried to delete metadata entry %lu (%lx:%lx v%lu) while deleting %lx:%lx v%lu\n"
: "Fatal error (metadata corruption or bug): tried to overwrite non-zero metadata entry %lu (%lx:%lx v%lu) with %lx:%lx v%lu\n",
? "Fatal error (metadata corruption or bug): tried to delete metadata entry %ju (%jx:%jx v%ju) while deleting %jx:%jx v%ju\n"
: "Fatal error (metadata corruption or bug): tried to overwrite non-zero metadata entry %ju (%jx:%jx v%ju) with %jx:%jx v%ju\n",
clean_loc >> bs->dsk.block_order, new_entry->oid.inode, new_entry->oid.stripe,
new_entry->version, cur.oid.inode, cur.oid.stripe, cur.version
);
@@ -710,7 +711,7 @@ bool journal_flusher_co::write_meta_block(flusher_meta_write_t & meta_block, int
if (wait_state == wait_base)
goto resume_0;
await_sqe(0);
data->iov = (struct iovec){ meta_block.buf, bs->dsk.meta_block_size };
data->iov = (struct iovec){ meta_block.buf, (size_t)bs->dsk.meta_block_size };
data->callback = simple_callback_w;
my_uring_prep_writev(
sqe, bs->dsk.meta_fd, &data->iov, 1, bs->dsk.meta_offset + bs->dsk.meta_block_size + meta_block.sector
@@ -760,7 +761,7 @@ bool journal_flusher_co::clear_incomplete_csum_block_bits(int wait_base)
{
// If we encounter bad checksums during flush, we still update the bad block,
// but intentionally mangle checksums to avoid hiding the corruption.
iovec iov = { .iov_base = v[i].buf, .iov_len = v[i].len };
iovec iov = { .iov_base = v[i].buf, .iov_len = (size_t)v[i].len };
if (!(v[i].copy_flags & COPY_BUF_JOURNAL))
{
assert(!(v[i].offset % bs->dsk.csum_block_size));
@@ -768,7 +769,7 @@ bool journal_flusher_co::clear_incomplete_csum_block_bits(int wait_base)
bs->verify_padded_checksums(new_clean_bitmap, new_clean_bitmap + 2*bs->dsk.clean_entry_bitmap_size,
v[i].offset, &iov, 1, [&](uint32_t bad_block, uint32_t calc_csum, uint32_t stored_csum)
{
printf("Checksum mismatch in object %lx:%lx v%lu in data area at offset 0x%lx+0x%x: got %08x, expected %08x\n",
printf("Checksum mismatch in object %jx:%jx v%ju in data area at offset 0x%jx+0x%x: got %08x, expected %08x\n",
cur.oid.inode, cur.oid.stripe, old_clean_ver, old_clean_loc, bad_block, calc_csum, stored_csum);
for (uint32_t j = 0; j < bs->dsk.csum_block_size; j += bs->dsk.bitmap_granularity)
{
@@ -781,7 +782,7 @@ bool journal_flusher_co::clear_incomplete_csum_block_bits(int wait_base)
{
bs->verify_journal_checksums(v[i].csum_buf, v[i].offset, &iov, 1, [&](uint32_t bad_block, uint32_t calc_csum, uint32_t stored_csum)
{
printf("Checksum mismatch in object %lx:%lx v%lu in journal at offset 0x%lx+0x%x (block offset 0x%lx): got %08x, expected %08x\n",
printf("Checksum mismatch in object %jx:%jx v%ju in journal at offset 0x%jx+0x%x (block offset 0x%jx): got %08x, expected %08x\n",
cur.oid.inode, cur.oid.stripe, old_clean_ver,
v[i].disk_offset, bad_block, v[i].offset, calc_csum, stored_csum);
bad_block += (v[i].offset/bs->dsk.csum_block_size) * bs->dsk.csum_block_size;
@@ -805,7 +806,7 @@ bool journal_flusher_co::clear_incomplete_csum_block_bits(int wait_base)
if (new_entry->oid != cur.oid)
{
printf(
"Fatal error (metadata corruption or bug): tried to make holes in %lu (%lx:%lx v%lu) with %lx:%lx v%lu\n",
"Fatal error (metadata corruption or bug): tried to make holes in %ju (%jx:%jx v%ju) with %jx:%jx v%ju\n",
clean_loc >> bs->dsk.block_order, new_entry->oid.inode, new_entry->oid.stripe,
new_entry->version, cur.oid.inode, cur.oid.stripe, cur.version
);
@@ -925,7 +926,7 @@ void journal_flusher_co::scan_dirty()
{
char err[1024];
snprintf(
err, 1024, "BUG: Unexpected dirty_entry %lx:%lx v%lu unstable state during flush: 0x%x",
err, 1024, "BUG: Unexpected dirty_entry %jx:%jx v%ju unstable state during flush: 0x%x",
dirty_it->first.oid.inode, dirty_it->first.oid.stripe, dirty_it->first.version, dirty_it->second.state
);
throw std::runtime_error(err);
@@ -1021,7 +1022,7 @@ void journal_flusher_co::scan_dirty()
// May happen if the metadata entry is corrupt, but journal isn't
// FIXME: Report corrupted object to the upper layer (OSD)
printf(
"Warning: object %lx:%lx has overwrites, but doesn't have a clean version."
"Warning: object %jx:%jx has overwrites, but doesn't have a clean version."
" Metadata is likely corrupted. Dropping object from the DB.\n",
cur.oid.inode, cur.oid.stripe
);
@@ -1056,7 +1057,7 @@ void journal_flusher_co::scan_dirty()
flusher->enqueue_flush(cur);
cur.version = dirty_end->first.version;
#ifdef BLOCKSTORE_DEBUG
printf("Partial checksum block overwrites found - rewinding flush back to %lx:%lx v%lu\n", cur.oid.inode, cur.oid.stripe, cur.version);
printf("Partial checksum block overwrites found - rewinding flush back to %jx:%jx v%ju\n", cur.oid.inode, cur.oid.stripe, cur.version);
#endif
v.clear();
copy_count = 0;
@@ -1084,7 +1085,7 @@ bool journal_flusher_co::read_dirty(int wait_base)
auto & vi = v[v.size()-i];
assert(vi.len != 0);
vi.buf = memalign_or_die(MEM_ALIGNMENT, vi.len);
data->iov = (struct iovec){ vi.buf, vi.len };
data->iov = (struct iovec){ vi.buf, (size_t)vi.len };
data->callback = simple_callback_r;
my_uring_prep_readv(
sqe, bs->dsk.data_fd, &data->iov, 1, bs->dsk.data_offset + old_clean_loc + vi.offset
@@ -1208,7 +1209,7 @@ bool journal_flusher_co::modify_meta_read(uint64_t meta_loc, flusher_meta_write_
.usage_count = 1,
}).first;
await_sqe(0);
data->iov = (struct iovec){ wr.it->second.buf, bs->dsk.meta_block_size };
data->iov = (struct iovec){ wr.it->second.buf, (size_t)bs->dsk.meta_block_size };
data->callback = simple_callback_r;
wr.submitted = true;
my_uring_prep_readv(
@@ -1247,7 +1248,7 @@ void journal_flusher_co::free_data_blocks()
auto uo_it = bs->used_clean_objects.find(old_clean_loc);
bool used = uo_it != bs->used_clean_objects.end();
#ifdef BLOCKSTORE_DEBUG
printf("%s block %lu from %lx:%lx v%lu (new location is %lu)\n",
printf("%s block %ju from %jx:%jx v%ju (new location is %ju)\n",
used ? "Postpone free" : "Free",
old_clean_loc >> bs->dsk.block_order,
cur.oid.inode, cur.oid.stripe, cur.version,
@@ -1264,7 +1265,7 @@ void journal_flusher_co::free_data_blocks()
auto uo_it = bs->used_clean_objects.find(old_clean_loc);
bool used = uo_it != bs->used_clean_objects.end();
#ifdef BLOCKSTORE_DEBUG
printf("%s block %lu from %lx:%lx v%lu (delete)\n",
printf("%s block %ju from %jx:%jx v%ju (delete)\n",
used ? "Postpone free" : "Free",
old_clean_loc >> bs->dsk.block_order,
cur.oid.inode, cur.oid.stripe, cur.version);
@@ -1346,7 +1347,6 @@ bool journal_flusher_co::trim_journal(int wait_base)
else if (wait_state == wait_base+2) goto resume_2;
else if (wait_state == wait_base+3) goto resume_3;
else if (wait_state == wait_base+4) goto resume_4;
flusher->journal_trim_counter = 0;
new_trim_pos = bs->journal.get_trim_pos();
if (new_trim_pos != bs->journal.used_start)
{
@@ -1378,7 +1378,7 @@ bool journal_flusher_co::trim_journal(int wait_base)
.csum_block_size = bs->dsk.csum_block_size,
};
((journal_entry_start*)flusher->journal_superblock)->crc32 = je_crc32((journal_entry*)flusher->journal_superblock);
data->iov = (struct iovec){ flusher->journal_superblock, bs->dsk.journal_block_size };
data->iov = (struct iovec){ flusher->journal_superblock, (size_t)bs->dsk.journal_block_size };
data->callback = simple_callback_w;
my_uring_prep_writev(sqe, bs->dsk.journal_fd, &data->iov, 1, bs->journal.offset);
wait_count++;
@@ -1410,7 +1410,7 @@ bool journal_flusher_co::trim_journal(int wait_base)
}
bs->journal.used_start = new_trim_pos;
#ifdef BLOCKSTORE_DEBUG
printf("Journal trimmed to %08lx (next_free=%08lx dirty_start=%08lx)\n", bs->journal.used_start, bs->journal.next_free, bs->journal.dirty_start);
printf("Journal trimmed to %08jx (next_free=%08jx dirty_start=%08jx)\n", bs->journal.used_start, bs->journal.next_free, bs->journal.dirty_start);
#endif
if (bs->journal.flush_journal && !flusher->flush_queue.size())
{
@@ -1419,6 +1419,7 @@ bool journal_flusher_co::trim_journal(int wait_base)
exit(0);
}
}
flusher->journal_trim_counter = 0;
flusher->trimming = false;
}
return true;

View File

@@ -107,7 +107,7 @@ class journal_flusher_t
blockstore_impl_t *bs;
friend class journal_flusher_co;
int journal_trim_counter, journal_trim_interval;
int journal_trim_counter;
bool trimming;
void* journal_superblock;

View File

@@ -163,20 +163,10 @@ void blockstore_impl_t::loop()
}
else if (op->opcode == BS_OP_SYNC)
{
// wait for all small writes to be submitted
// wait for all big writes to complete, submit data device fsync
// sync only completed writes?
// wait for the data device fsync to complete, then submit journal writes for big writes
// then submit an fsync operation
if (has_writes)
{
// Can't submit SYNC before previous writes
continue;
}
wr_st = continue_sync(op);
if (wr_st != 2)
{
has_writes = wr_st > 0 ? 1 : 2;
}
}
else if (op->opcode == BS_OP_STABLE)
{
@@ -205,6 +195,10 @@ void blockstore_impl_t::loop()
// ring is full, stop submission
break;
}
else if (PRIV(op)->wait_for == WAIT_JOURNAL)
{
PRIV(op)->wait_detail2 = (unstable_writes.size()+unstable_unsynced);
}
}
}
if (op_idx != new_idx)
@@ -275,7 +269,7 @@ void blockstore_impl_t::check_wait(blockstore_op_t *op)
{
// stop submission if there's still no free space
#ifdef BLOCKSTORE_DEBUG
printf("Still waiting for %lu SQE(s)\n", PRIV(op)->wait_detail);
printf("Still waiting for %ju SQE(s)\n", PRIV(op)->wait_detail);
#endif
return;
}
@@ -283,11 +277,12 @@ void blockstore_impl_t::check_wait(blockstore_op_t *op)
}
else if (PRIV(op)->wait_for == WAIT_JOURNAL)
{
if (journal.used_start == PRIV(op)->wait_detail)
if (journal.used_start == PRIV(op)->wait_detail &&
(unstable_writes.size()+unstable_unsynced) == PRIV(op)->wait_detail2)
{
// do not submit
#ifdef BLOCKSTORE_DEBUG
printf("Still waiting to flush journal offset %08lx\n", PRIV(op)->wait_detail);
printf("Still waiting to flush journal offset %08jx\n", PRIV(op)->wait_detail);
#endif
return;
}
@@ -558,13 +553,14 @@ void blockstore_impl_t::process_list(blockstore_op_t *op)
if (stable_count >= stable_alloc)
{
stable_alloc *= 2;
stable = (obj_ver_id*)realloc(stable, sizeof(obj_ver_id) * stable_alloc);
if (!stable)
obj_ver_id* nst = (obj_ver_id*)realloc(stable, sizeof(obj_ver_id) * stable_alloc);
if (!nst)
{
op->retval = -ENOMEM;
FINISH_OP(op);
return;
}
stable = nst;
}
stable[stable_count++] = {
.oid = clean_it->first,
@@ -642,8 +638,8 @@ void blockstore_impl_t::process_list(blockstore_op_t *op)
if (stable_count >= stable_alloc)
{
stable_alloc += 32768;
stable = (obj_ver_id*)realloc(stable, sizeof(obj_ver_id) * stable_alloc);
if (!stable)
obj_ver_id *nst = (obj_ver_id*)realloc(stable, sizeof(obj_ver_id) * stable_alloc);
if (!nst)
{
if (unstable)
free(unstable);
@@ -651,6 +647,7 @@ void blockstore_impl_t::process_list(blockstore_op_t *op)
FINISH_OP(op);
return;
}
stable = nst;
}
stable[stable_count++] = dirty_it->first;
}
@@ -666,8 +663,8 @@ void blockstore_impl_t::process_list(blockstore_op_t *op)
if (unstable_count >= unstable_alloc)
{
unstable_alloc += 32768;
unstable = (obj_ver_id*)realloc(unstable, sizeof(obj_ver_id) * unstable_alloc);
if (!unstable)
obj_ver_id *nst = (obj_ver_id*)realloc(unstable, sizeof(obj_ver_id) * unstable_alloc);
if (!nst)
{
if (stable)
free(stable);
@@ -675,6 +672,7 @@ void blockstore_impl_t::process_list(blockstore_op_t *op)
FINISH_OP(op);
return;
}
unstable = nst;
}
unstable[unstable_count++] = dirty_it->first;
}
@@ -694,8 +692,8 @@ void blockstore_impl_t::process_list(blockstore_op_t *op)
if (stable_count+unstable_count > stable_alloc)
{
stable_alloc = stable_count+unstable_count;
stable = (obj_ver_id*)realloc(stable, sizeof(obj_ver_id) * stable_alloc);
if (!stable)
obj_ver_id *nst = (obj_ver_id*)realloc(stable, sizeof(obj_ver_id) * stable_alloc);
if (!nst)
{
if (unstable)
free(unstable);
@@ -703,6 +701,7 @@ void blockstore_impl_t::process_list(blockstore_op_t *op)
FINISH_OP(op);
return;
}
stable = nst;
}
// Copy unstable entries
for (int i = 0; i < unstable_count; i++)

View File

@@ -55,6 +55,7 @@
#define IS_JOURNAL(st) (((st) & 0x0F) == BS_ST_SMALL_WRITE)
#define IS_BIG_WRITE(st) (((st) & 0x0F) == BS_ST_BIG_WRITE)
#define IS_DELETE(st) (((st) & 0x0F) == BS_ST_DELETE)
#define IS_INSTANT(st) (((st) & BS_ST_TYPE_MASK) == BS_ST_DELETE || ((st) & BS_ST_INSTANT))
#define BS_SUBMIT_CHECK_SQES(n) \
if (ringloop->sqes_left() < (n))\
@@ -201,7 +202,7 @@ struct blockstore_op_private_t
{
// Wait status
int wait_for;
uint64_t wait_detail;
uint64_t wait_detail, wait_detail2;
int pending_ops;
int op_state;
@@ -252,6 +253,7 @@ class blockstore_impl_t
bool inmemory_meta = false;
// Maximum and minimum flusher count
unsigned max_flusher_count, min_flusher_count;
unsigned journal_trim_interval;
// Maximum queue depth
unsigned max_write_iodepth = 128;
// Enable small (journaled) write throttling, useful for the SSD+HDD case
@@ -277,6 +279,7 @@ class blockstore_impl_t
int unsynced_big_write_count = 0, unstable_unsynced = 0;
int unsynced_queued_ops = 0;
allocator *data_alloc = NULL;
uint64_t used_blocks = 0;
uint8_t *zero_object;
void *metadata_buffer = NULL;
@@ -376,7 +379,7 @@ class blockstore_impl_t
// Stabilize
int dequeue_stable(blockstore_op_t *op);
int continue_stable(blockstore_op_t *op);
void mark_stable(const obj_ver_id & ov, bool forget_dirty = false);
void mark_stable(obj_ver_id ov, bool forget_dirty = false);
void stabilize_object(object_id oid, uint64_t max_ver);
blockstore_op_t* selective_sync(blockstore_op_t *op);
int split_stab_op(blockstore_op_t *op, std::function<int(obj_ver_id v)> decider);
@@ -430,7 +433,7 @@ public:
inline uint32_t get_block_size() { return dsk.data_block_size; }
inline uint64_t get_block_count() { return dsk.block_count; }
inline uint64_t get_free_block_count() { return data_alloc->get_free_count(); }
inline uint64_t get_free_block_count() { return dsk.block_count - used_blocks; }
inline uint32_t get_bitmap_granularity() { return dsk.disk_alignment; }
inline uint64_t get_journal_size() { return dsk.journal_len; }
};

View File

@@ -63,7 +63,7 @@ int blockstore_init_meta::loop()
throw std::runtime_error("Failed to allocate metadata read buffer");
// Read superblock
GET_SQE();
data->iov = { metadata_buffer, bs->dsk.meta_block_size };
data->iov = { metadata_buffer, (size_t)bs->dsk.meta_block_size };
data->callback = [this](ring_data_t *data) { handle_event(data, -1); };
my_uring_prep_readv(sqe, bs->dsk.meta_fd, &data->iov, 1, bs->dsk.meta_offset);
bs->ringloop->submit();
@@ -100,7 +100,7 @@ resume_1:
{
printf("Initializing metadata area\n");
GET_SQE();
data->iov = (struct iovec){ metadata_buffer, bs->dsk.meta_block_size };
data->iov = (struct iovec){ metadata_buffer, (size_t)bs->dsk.meta_block_size };
data->callback = [this](ring_data_t *data) { handle_event(data, -1); };
my_uring_prep_writev(sqe, bs->dsk.meta_fd, &data->iov, 1, bs->dsk.meta_offset);
bs->ringloop->submit();
@@ -153,7 +153,7 @@ resume_1:
else if (hdr->version > BLOCKSTORE_META_FORMAT_V2)
{
printf(
"Metadata format is too new for me (stored version is %lu, max supported %u).\n",
"Metadata format is too new for me (stored version is %ju, max supported %u).\n",
hdr->version, BLOCKSTORE_META_FORMAT_V2
);
exit(1);
@@ -167,7 +167,7 @@ resume_1:
printf(
"Configuration stored in metadata superblock"
" (meta_block_size=%u, data_block_size=%u, bitmap_granularity=%u, data_csum_type=%u, csum_block_size=%u)"
" differs from OSD configuration (%lu/%u/%lu, %u/%u).\n",
" differs from OSD configuration (%ju/%u/%ju, %u/%u).\n",
hdr->meta_block_size, hdr->data_block_size, hdr->bitmap_granularity,
hdr->data_csum_type, hdr->csum_block_size,
bs->dsk.meta_block_size, bs->dsk.data_block_size, bs->dsk.bitmap_granularity,
@@ -199,7 +199,8 @@ resume_2:
submitted++;
next_offset += bufs[i].size;
GET_SQE();
data->iov = { bufs[i].buf, bufs[i].size };
assert(bufs[i].size <= 0x7fffffff);
data->iov = { bufs[i].buf, (size_t)bufs[i].size };
data->callback = [this, i](ring_data_t *data) { handle_event(data, i); };
if (!zero_on_init)
my_uring_prep_readv(sqe, bs->dsk.meta_fd, &data->iov, 1, bs->dsk.meta_offset + bufs[i].offset);
@@ -231,7 +232,8 @@ resume_2:
{
// write the modified buffer back
GET_SQE();
data->iov = { bufs[i].buf, bufs[i].size };
assert(bufs[i].size <= 0x7fffffff);
data->iov = { bufs[i].buf, (size_t)bufs[i].size };
data->callback = [this, i](ring_data_t *data) { handle_event(data, i); };
my_uring_prep_writev(sqe, bs->dsk.meta_fd, &data->iov, 1, bs->dsk.meta_offset + bufs[i].offset);
bufs[i].state = INIT_META_WRITING;
@@ -257,7 +259,7 @@ resume_2:
next_offset = entries_to_zero[i]/entries_per_block;
for (j = i; j < entries_to_zero.size() && entries_to_zero[j]/entries_per_block == next_offset; j++) {}
GET_SQE();
data->iov = { metadata_buffer, bs->dsk.meta_block_size };
data->iov = { metadata_buffer, (size_t)bs->dsk.meta_block_size };
data->callback = [this](ring_data_t *data) { handle_event(data, -1); };
my_uring_prep_readv(sqe, bs->dsk.meta_fd, &data->iov, 1, bs->dsk.meta_offset + (1+next_offset)*bs->dsk.meta_block_size);
submitted++;
@@ -273,7 +275,7 @@ resume_5:
memset((uint8_t*)metadata_buffer + pos*bs->dsk.clean_entry_size, 0, bs->dsk.clean_entry_size);
}
GET_SQE();
data->iov = { metadata_buffer, bs->dsk.meta_block_size };
data->iov = { metadata_buffer, (size_t)bs->dsk.meta_block_size };
data->callback = [this](ring_data_t *data) { handle_event(data, -1); };
my_uring_prep_writev(sqe, bs->dsk.meta_fd, &data->iov, 1, bs->dsk.meta_offset + (1+next_offset)*bs->dsk.meta_block_size);
submitted++;
@@ -287,7 +289,7 @@ resume_6:
entries_to_zero.clear();
}
// metadata read finished
printf("Metadata entries loaded: %lu, free blocks: %lu / %lu\n", entries_loaded, bs->data_alloc->get_free_count(), bs->dsk.block_count);
printf("Metadata entries loaded: %ju, free blocks: %ju / %ju\n", entries_loaded, bs->data_alloc->get_free_count(), bs->dsk.block_count);
if (!bs->inmemory_meta)
{
free(metadata_buffer);
@@ -328,7 +330,7 @@ bool blockstore_init_meta::handle_meta_block(uint8_t *buf, uint64_t entries_per_
uint32_t *entry_csum = (uint32_t*)((uint8_t*)entry + bs->dsk.clean_entry_size - 4);
if (*entry_csum != crc32c(0, entry, bs->dsk.clean_entry_size - 4))
{
printf("Metadata entry %lu is corrupt (checksum mismatch), skipping\n", done_cnt+i);
printf("Metadata entry %ju is corrupt (checksum mismatch), skipping\n", done_cnt+i);
continue;
}
}
@@ -366,7 +368,7 @@ bool blockstore_init_meta::handle_meta_block(uint8_t *buf, uint64_t entries_per_
entries_to_zero.push_back(clean_it->second.location >> bs->dsk.block_order);
}
#ifdef BLOCKSTORE_DEBUG
printf("Free block %lu from %lx:%lx v%lu (new location is %lu)\n",
printf("Free block %ju from %jx:%jx v%ju (new location is %ju)\n",
old_clean_loc,
clean_it->first.inode, clean_it->first.stripe, clean_it->second.version,
done_cnt+i);
@@ -376,10 +378,11 @@ bool blockstore_init_meta::handle_meta_block(uint8_t *buf, uint64_t entries_per_
else
{
bs->inode_space_stats[entry->oid.inode] += bs->dsk.data_block_size;
bs->used_blocks++;
}
entries_loaded++;
#ifdef BLOCKSTORE_DEBUG
printf("Allocate block (clean entry) %lu: %lx:%lx v%lu\n", done_cnt+i, entry->oid.inode, entry->oid.stripe, entry->version);
printf("Allocate block (clean entry) %ju: %jx:%jx v%ju\n", done_cnt+i, entry->oid.inode, entry->oid.stripe, entry->version);
#endif
bs->data_alloc->set(done_cnt+i, true);
clean_db[entry->oid] = (struct clean_entry){
@@ -393,7 +396,7 @@ bool blockstore_init_meta::handle_meta_block(uint8_t *buf, uint64_t entries_per_
updated = true;
memset(entry, 0, bs->dsk.clean_entry_size);
#ifdef BLOCKSTORE_DEBUG
printf("Old clean entry %lu: %lx:%lx v%lu\n", done_cnt+i, entry->oid.inode, entry->oid.stripe, entry->version);
printf("Old clean entry %ju: %jx:%jx v%ju\n", done_cnt+i, entry->oid.inode, entry->oid.stripe, entry->version);
#endif
}
}
@@ -465,7 +468,7 @@ int blockstore_init_journal::loop()
if (!sqe)
throw std::runtime_error("io_uring is full while trying to read journal");
data = ((ring_data_t*)sqe->user_data);
data->iov = { submitted_buf, bs->journal.block_size };
data->iov = { submitted_buf, (size_t)bs->journal.block_size };
data->callback = simple_callback;
my_uring_prep_readv(sqe, bs->dsk.journal_fd, &data->iov, 1, bs->journal.offset);
bs->ringloop->submit();
@@ -506,7 +509,7 @@ resume_1:
// FIXME: Randomize initial crc32. Track crc32 when trimming.
printf("Resetting journal\n");
GET_SQE();
data->iov = (struct iovec){ submitted_buf, 2*bs->journal.block_size };
data->iov = (struct iovec){ submitted_buf, (size_t)(2*bs->journal.block_size) };
data->callback = simple_callback;
my_uring_prep_writev(sqe, bs->dsk.journal_fd, &data->iov, 1, bs->journal.offset);
wait_count++;
@@ -556,7 +559,7 @@ resume_1:
(je_start->version != JOURNAL_VERSION_V2 || je_start->size != JE_START_V2_SIZE && je_start->size != JE_START_V1_SIZE))
{
fprintf(
stderr, "The code only supports journal versions 2 and 1, but it is %lu on disk."
stderr, "The code only supports journal versions 2 and 1, but it is %ju on disk."
" Please use vitastor-disk to rewrite the journal\n",
je_start->size == JE_START_V0_SIZE ? 0 : je_start->version
);
@@ -605,7 +608,7 @@ resume_1:
submitted_buf = (uint8_t*)bs->journal.buffer + journal_pos;
data->iov = {
submitted_buf,
end - journal_pos < JOURNAL_BUFFER_SIZE ? end - journal_pos : JOURNAL_BUFFER_SIZE,
(size_t)(end - journal_pos < JOURNAL_BUFFER_SIZE ? end - journal_pos : JOURNAL_BUFFER_SIZE),
};
data->callback = [this](ring_data_t *data1) { handle_event(data1); };
my_uring_prep_readv(sqe, bs->dsk.journal_fd, &data->iov, 1, bs->journal.offset + journal_pos);
@@ -621,7 +624,7 @@ resume_1:
if (init_write_buf && !bs->readonly)
{
GET_SQE();
data->iov = { init_write_buf, bs->journal.block_size };
data->iov = { init_write_buf, (size_t)bs->journal.block_size };
data->callback = simple_callback;
my_uring_prep_writev(sqe, bs->dsk.journal_fd, &data->iov, 1, bs->journal.offset + init_write_sector);
wait_count++;
@@ -690,7 +693,7 @@ resume_1:
IS_BIG_WRITE(dirty_it->second.state) &&
dirty_it->second.location == UINT64_MAX)
{
printf("Fatal error (bug): %lx:%lx v%lu big_write journal_entry was allocated over another object\n",
printf("Fatal error (bug): %jx:%jx v%ju big_write journal_entry was allocated over another object\n",
dirty_it->first.oid.inode, dirty_it->first.oid.stripe, dirty_it->first.version);
exit(1);
}
@@ -698,7 +701,7 @@ resume_1:
bs->flusher->mark_trim_possible();
bs->journal.dirty_start = bs->journal.next_free;
printf(
"Journal entries loaded: %lu, free journal space: %lu bytes (%08lx..%08lx is used), free blocks: %lu / %lu\n",
"Journal entries loaded: %ju, free journal space: %ju bytes (%08jx..%08jx is used), free blocks: %ju / %ju\n",
entries_loaded,
(bs->journal.next_free >= bs->journal.used_start
? bs->journal.len-bs->journal.block_size - (bs->journal.next_free-bs->journal.used_start)
@@ -732,8 +735,9 @@ int blockstore_init_journal::handle_journal_part(void *buf, uint64_t done_pos, u
resume:
while (pos < bs->journal.block_size)
{
journal_entry *je = (journal_entry*)((uint8_t*)buf + proc_pos - done_pos + pos);
if (je->magic != JOURNAL_MAGIC || je_crc32(je) != je->crc32 ||
auto buf_pos = proc_pos - done_pos + pos;
journal_entry *je = (journal_entry*)((uint8_t*)buf + buf_pos);
if (je->magic != JOURNAL_MAGIC || buf_pos+je->size > len || je_crc32(je) != je->crc32 ||
je->type < JE_MIN || je->type > JE_MAX || started && je->crc32_prev != crc32_last)
{
if (pos == 0)
@@ -752,7 +756,7 @@ int blockstore_init_journal::handle_journal_part(void *buf, uint64_t done_pos, u
{
#ifdef BLOCKSTORE_DEBUG
printf(
"je_small_write%s oid=%lx:%lx ver=%lu offset=%u len=%u\n",
"je_small_write%s oid=%jx:%jx ver=%ju offset=%u len=%u\n",
je->type == JE_SMALL_WRITE_INSTANT ? "_instant" : "",
je->small_write.oid.inode, je->small_write.oid.stripe, je->small_write.version,
je->small_write.offset, je->small_write.len
@@ -774,7 +778,7 @@ int blockstore_init_journal::handle_journal_part(void *buf, uint64_t done_pos, u
if (location != je->small_write.data_offset)
{
char err[1024];
snprintf(err, 1024, "BUG: calculated journal data offset (%08lx) != stored journal data offset (%08lx)", location, je->small_write.data_offset);
snprintf(err, 1024, "BUG: calculated journal data offset (%08jx) != stored journal data offset (%08jx)", location, je->small_write.data_offset);
throw std::runtime_error(err);
}
small_write_data.clear();
@@ -801,7 +805,7 @@ int blockstore_init_journal::handle_journal_part(void *buf, uint64_t done_pos, u
covered += part_end - part_begin;
small_write_data.push_back((iovec){
.iov_base = (uint8_t*)done[i].buf + part_begin - done[i].pos,
.iov_len = part_end - part_begin,
.iov_len = (size_t)(part_end - part_begin),
});
}
}
@@ -824,7 +828,7 @@ int blockstore_init_journal::handle_journal_part(void *buf, uint64_t done_pos, u
if (!data_csum_valid)
{
printf(
"Journal entry data is corrupt for small_write%s oid=%lx:%lx ver=%lu offset=%u len=%u - data crc32 %x != %x\n",
"Journal entry data is corrupt for small_write%s oid=%jx:%jx ver=%ju offset=%u len=%u - data crc32 %x != %x\n",
je->type == JE_SMALL_WRITE_INSTANT ? "_instant" : "",
je->small_write.oid.inode, je->small_write.oid.stripe, je->small_write.version,
je->small_write.offset, je->small_write.len,
@@ -843,7 +847,7 @@ int blockstore_init_journal::handle_journal_part(void *buf, uint64_t done_pos, u
if (je->size != required_size)
{
printf(
"Journal entry data has invalid size for small_write%s oid=%lx:%lx ver=%lu offset=%u len=%u - should be %u bytes but is %u bytes\n",
"Journal entry data has invalid size for small_write%s oid=%jx:%jx ver=%ju offset=%u len=%u - should be %u bytes but is %u bytes\n",
je->type == JE_SMALL_WRITE_INSTANT ? "_instant" : "",
je->small_write.oid.inode, je->small_write.oid.stripe, je->small_write.version,
je->small_write.offset, je->small_write.len,
@@ -891,7 +895,7 @@ int blockstore_init_journal::handle_journal_part(void *buf, uint64_t done_pos, u
if (block_crc32 != *block_csums)
{
printf(
"Journal entry data is corrupt for small_write%s oid=%lx:%lx ver=%lu offset=%u len=%u - block %u crc32 %x != %x\n",
"Journal entry data is corrupt for small_write%s oid=%jx:%jx ver=%ju offset=%u len=%u - block %u crc32 %x != %x\n",
je->type == JE_SMALL_WRITE_INSTANT ? "_instant" : "",
je->small_write.oid.inode, je->small_write.oid.stripe, je->small_write.version,
je->small_write.offset, je->small_write.len,
@@ -954,7 +958,7 @@ int blockstore_init_journal::handle_journal_part(void *buf, uint64_t done_pos, u
bs->journal.used_sectors[proc_pos]++;
#ifdef BLOCKSTORE_DEBUG
printf(
"journal offset %08lx is used by %lx:%lx v%lu (%lu refs)\n",
"journal offset %08jx is used by %jx:%jx v%ju (%ju refs)\n",
proc_pos, ov.oid.inode, ov.oid.stripe, ov.version, bs->journal.used_sectors[proc_pos]
);
#endif
@@ -970,7 +974,7 @@ int blockstore_init_journal::handle_journal_part(void *buf, uint64_t done_pos, u
{
#ifdef BLOCKSTORE_DEBUG
printf(
"je_big_write%s oid=%lx:%lx ver=%lu loc=%lu\n",
"je_big_write%s oid=%jx:%jx ver=%ju loc=%ju\n",
je->type == JE_BIG_WRITE_INSTANT ? "_instant" : "",
je->big_write.oid.inode, je->big_write.oid.stripe, je->big_write.version, je->big_write.location >> bs->dsk.block_order
);
@@ -1047,7 +1051,7 @@ int blockstore_init_journal::handle_journal_part(void *buf, uint64_t done_pos, u
{
#ifdef BLOCKSTORE_DEBUG
printf(
"Allocate block (journal) %lu: %lx:%lx v%lu\n",
"Allocate block (journal) %ju: %jx:%jx v%ju\n",
je->big_write.location >> bs->dsk.block_order,
ov.oid.inode, ov.oid.stripe, ov.version
);
@@ -1057,7 +1061,7 @@ int blockstore_init_journal::handle_journal_part(void *buf, uint64_t done_pos, u
bs->journal.used_sectors[proc_pos]++;
#ifdef BLOCKSTORE_DEBUG
printf(
"journal offset %08lx is used by %lx:%lx v%lu (%lu refs)\n",
"journal offset %08jx is used by %jx:%jx v%ju (%ju refs)\n",
proc_pos, ov.oid.inode, ov.oid.stripe, ov.version, bs->journal.used_sectors[proc_pos]
);
#endif
@@ -1072,7 +1076,7 @@ int blockstore_init_journal::handle_journal_part(void *buf, uint64_t done_pos, u
else if (je->type == JE_STABLE)
{
#ifdef BLOCKSTORE_DEBUG
printf("je_stable oid=%lx:%lx ver=%lu\n", je->stable.oid.inode, je->stable.oid.stripe, je->stable.version);
printf("je_stable oid=%jx:%jx ver=%ju\n", je->stable.oid.inode, je->stable.oid.stripe, je->stable.version);
#endif
// oid, version
obj_ver_id ov = {
@@ -1084,7 +1088,7 @@ int blockstore_init_journal::handle_journal_part(void *buf, uint64_t done_pos, u
else if (je->type == JE_ROLLBACK)
{
#ifdef BLOCKSTORE_DEBUG
printf("je_rollback oid=%lx:%lx ver=%lu\n", je->rollback.oid.inode, je->rollback.oid.stripe, je->rollback.version);
printf("je_rollback oid=%jx:%jx ver=%ju\n", je->rollback.oid.inode, je->rollback.oid.stripe, je->rollback.version);
#endif
// rollback dirty writes of <oid> up to <version>
obj_ver_id ov = {
@@ -1096,7 +1100,7 @@ int blockstore_init_journal::handle_journal_part(void *buf, uint64_t done_pos, u
else if (je->type == JE_DELETE)
{
#ifdef BLOCKSTORE_DEBUG
printf("je_delete oid=%lx:%lx ver=%lu\n", je->del.oid.inode, je->del.oid.stripe, je->del.version);
printf("je_delete oid=%jx:%jx ver=%ju\n", je->del.oid.inode, je->del.oid.stripe, je->del.version);
#endif
bool dirty_exists = false;
auto dirty_it = bs->dirty_db.upper_bound((obj_ver_id){
@@ -1180,6 +1184,7 @@ void blockstore_init_journal::erase_dirty_object(blockstore_dirty_db_t::iterator
sp -= bs->dsk.data_block_size;
else
bs->inode_space_stats.erase(oid.inode);
bs->used_blocks--;
}
bs->erase_dirty(dirty_it, dirty_end, clean_loc);
// Remove it from the flusher's queue, too

View File

@@ -90,8 +90,8 @@ int blockstore_journal_check_t::check_available(blockstore_op_t *op, int entries
}
// In fact, it's even more rare than "ran out of journal space", so print a warning
printf(
"Ran out of journal sector buffers: %d/%lu buffers used (%d dirty), next buffer (%ld)"
" is %s and flushed %lu times. Consider increasing \'journal_sector_buffer_count\'\n",
"Ran out of journal sector buffers: %d/%ju buffers used (%d dirty), next buffer (%jd)"
" is %s and flushed %ju times. Consider increasing \'journal_sector_buffer_count\'\n",
used, bs->journal.sector_count, dirty, next_sector,
bs->journal.sector_info[next_sector].dirty ? "dirty" : "not dirty",
bs->journal.sector_info[next_sector].flush_count
@@ -103,7 +103,7 @@ int blockstore_journal_check_t::check_available(blockstore_op_t *op, int entries
if (data_after > 0)
{
next_pos = next_pos + data_after;
if (next_pos > bs->journal.len)
if (next_pos >= bs->journal.len)
{
if (right_dir)
next_pos = bs->journal.block_size + data_after;
@@ -114,7 +114,7 @@ int blockstore_journal_check_t::check_available(blockstore_op_t *op, int entries
{
// No space in the journal. Wait until used_start changes.
printf(
"Ran out of journal space (used_start=%08lx, next_free=%08lx, dirty_start=%08lx)\n",
"Ran out of journal space (used_start=%08jx, next_free=%08jx, dirty_start=%08jx)\n",
bs->journal.used_start, bs->journal.next_free, bs->journal.dirty_start
);
PRIV(op)->wait_for = WAIT_JOURNAL;
@@ -144,8 +144,10 @@ journal_entry* prefill_single_journal_entry(journal_t & journal, uint16_t type,
journal.sector_info[journal.cur_sector].written = false;
journal.sector_info[journal.cur_sector].offset = journal.next_free;
journal.in_sector_pos = 0;
journal.next_free = (journal.next_free+journal.block_size) < journal.len ? journal.next_free + journal.block_size : journal.block_size;
assert(journal.next_free != journal.used_start);
auto next_next_free = (journal.next_free+journal.block_size) < journal.len ? journal.next_free + journal.block_size : journal.block_size;
// double check that next_free doesn't cross used_start from the left
assert(journal.next_free >= journal.used_start && next_next_free >= journal.next_free || next_next_free < journal.used_start);
journal.next_free = next_next_free;
memset(journal.inmemory
? (uint8_t*)journal.buffer + journal.sector_info[journal.cur_sector].offset
: (uint8_t*)journal.sector_buf + journal.block_size*journal.cur_sector, 0, journal.block_size);
@@ -181,7 +183,7 @@ void blockstore_impl_t::prepare_journal_sector_write(int cur_sector, blockstore_
(journal.inmemory
? (uint8_t*)journal.buffer + journal.sector_info[cur_sector].offset
: (uint8_t*)journal.sector_buf + journal.block_size*cur_sector),
journal.block_size
(size_t)journal.block_size
};
data->callback = [this, flush_id = journal.submit_id](ring_data_t *data) { handle_journal_write(data, flush_id); };
my_uring_prep_writev(
@@ -261,7 +263,7 @@ uint64_t journal_t::get_trim_pos()
// next_free does not need updating during trim
#ifdef BLOCKSTORE_DEBUG
printf(
"Trimming journal (used_start=%08lx, next_free=%08lx, dirty_start=%08lx, new_start=%08lx, new_refcount=%ld)\n",
"Trimming journal (used_start=%08jx, next_free=%08jx, dirty_start=%08jx, new_start=%08jx, new_refcount=%jd)\n",
used_start, next_free, dirty_start,
journal_used_it->first, journal_used_it->second
);
@@ -274,7 +276,7 @@ uint64_t journal_t::get_trim_pos()
// Journal is cleared up to <journal_used_it>
#ifdef BLOCKSTORE_DEBUG
printf(
"Trimming journal (used_start=%08lx, next_free=%08lx, dirty_start=%08lx, new_start=%08lx, new_refcount=%ld)\n",
"Trimming journal (used_start=%08jx, next_free=%08jx, dirty_start=%08jx, new_start=%08jx, new_refcount=%jd)\n",
used_start, next_free, dirty_start,
journal_used_it->first, journal_used_it->second
);
@@ -294,7 +296,7 @@ void journal_t::dump_diagnostics()
journal_used_it = used_sectors.begin();
}
printf(
"Journal: used_start=%08lx next_free=%08lx dirty_start=%08lx trim_to=%08lx trim_to_refs=%ld\n",
"Journal: used_start=%08jx next_free=%08jx dirty_start=%08jx trim_to=%08jx trim_to_refs=%jd\n",
used_start, next_free, dirty_start,
journal_used_it == used_sectors.end() ? 0 : journal_used_it->first,
journal_used_it == used_sectors.end() ? 0 : journal_used_it->second

View File

@@ -13,13 +13,14 @@ void blockstore_impl_t::parse_config(blockstore_config_t & config, bool init)
max_flusher_count = strtoull(config["flusher_count"].c_str(), NULL, 10);
}
min_flusher_count = strtoull(config["min_flusher_count"].c_str(), NULL, 10);
journal_trim_interval = strtoull(config["journal_trim_interval"].c_str(), NULL, 10);
max_write_iodepth = strtoull(config["max_write_iodepth"].c_str(), NULL, 10);
throttle_small_writes = config["throttle_small_writes"] == "true" || config["throttle_small_writes"] == "1" || config["throttle_small_writes"] == "yes";
throttle_target_iops = strtoull(config["throttle_target_iops"].c_str(), NULL, 10);
throttle_target_mbs = strtoull(config["throttle_target_mbs"].c_str(), NULL, 10);
throttle_target_parallelism = strtoull(config["throttle_target_parallelism"].c_str(), NULL, 10);
throttle_threshold_us = strtoull(config["throttle_threshold_us"].c_str(), NULL, 10);
if (config.find("autosync_writes") != config.end())
if (config["autosync_writes"] != "")
{
autosync_writes = strtoull(config["autosync_writes"].c_str(), NULL, 10);
}
@@ -31,6 +32,10 @@ void blockstore_impl_t::parse_config(blockstore_config_t & config, bool init)
{
min_flusher_count = 1;
}
if (!journal_trim_interval)
{
journal_trim_interval = 512;
}
if (!max_write_iodepth)
{
max_write_iodepth = 128;

View File

@@ -25,7 +25,7 @@ int blockstore_impl_t::fulfill_read_push(blockstore_op_t *op, void *buf, uint64_
return 1;
}
BS_SUBMIT_GET_SQE(sqe, data);
data->iov = (struct iovec){ buf, len };
data->iov = (struct iovec){ buf, (size_t)len };
PRIV(op)->pending_ops++;
my_uring_prep_readv(
sqe,
@@ -505,7 +505,7 @@ int blockstore_impl_t::dequeue_read(blockstore_op_t *read_op)
for (auto & rv: PRIV(read_op)->read_vec)
{
if (rv.journal_sector)
journal.used_sectors[rv.journal_sector-1]++;
journal.used_sectors.at(rv.journal_sector-1)++;
}
}
read_op->retval = 0;
@@ -700,7 +700,7 @@ uint8_t* blockstore_impl_t::read_clean_meta_block(blockstore_op_t *op, uint64_t
.buf = buf,
});
BS_SUBMIT_GET_SQE(sqe, data);
data->iov = (struct iovec){ buf, dsk.meta_block_size };
data->iov = (struct iovec){ buf, (size_t)dsk.meta_block_size };
PRIV(op)->pending_ops++;
my_uring_prep_readv(sqe, dsk.meta_fd, &data->iov, 1, dsk.meta_offset + dsk.meta_block_size + sector);
data->callback = [this, op](ring_data_t *data) { handle_read_event(data, op); };
@@ -855,7 +855,7 @@ void blockstore_impl_t::handle_read_event(ring_data_t *data, blockstore_op_t *op
{
ok = false;
printf(
"Checksum mismatch in object %lx:%lx v%lu in journal at 0x%lx, checksum block #%u: got %08x, expected %08x\n",
"Checksum mismatch in object %jx:%jx v%ju in journal at 0x%jx, checksum block #%u: got %08x, expected %08x\n",
op->oid.inode, op->oid.stripe, op->version,
rv[i].disk_offset, bad_block / dsk.csum_block_size, calc_csum, stored_csum
);
@@ -875,7 +875,7 @@ void blockstore_impl_t::handle_read_event(ring_data_t *data, blockstore_op_t *op
{
ok = false;
printf(
"Checksum mismatch in object %lx:%lx v%lu in %s data at 0x%lx, checksum block #%u: got %08x, expected %08x\n",
"Checksum mismatch in object %jx:%jx v%ju in %s data at 0x%jx, checksum block #%u: got %08x, expected %08x\n",
op->oid.inode, op->oid.stripe, op->version,
(rv[i].copy_flags & COPY_BUF_JOURNALED_BIG ? "redirect-write" : "clean"),
rv[i].disk_offset, bad_block / dsk.csum_block_size, calc_csum, stored_csum
@@ -918,7 +918,7 @@ void blockstore_impl_t::handle_read_event(ring_data_t *data, blockstore_op_t *op
{
// checksum error
printf(
"Checksum mismatch in object %lx:%lx v%lu in %s area at offset 0x%lx+0x%lx: %08x vs %08x\n",
"Checksum mismatch in object %jx:%jx v%ju in %s area at offset 0x%jx+0x%zx: %08x vs %08x\n",
op->oid.inode, op->oid.stripe, op->version,
(vec.copy_flags & COPY_BUF_JOURNAL) ? "journal" : "data", vec.disk_offset, p,
crc32c(0, (uint8_t*)op->buf + vec.offset - op->offset + p, dsk.csum_block_size), *csum
@@ -966,7 +966,7 @@ void blockstore_impl_t::handle_read_event(ring_data_t *data, blockstore_op_t *op
{
if (rv.journal_sector)
{
auto used = --journal.used_sectors[rv.journal_sector-1];
auto used = --journal.used_sectors.at(rv.journal_sector-1);
if (used == 0)
{
journal.used_sectors.erase(rv.journal_sector-1);

View File

@@ -179,7 +179,7 @@ void blockstore_impl_t::erase_dirty(blockstore_dirty_db_t::iterator dirty_start,
{
object_id oid = dirty_it->first.oid;
#ifdef BLOCKSTORE_DEBUG
printf("Unblock writes-after-delete %lx:%lx v%lu\n", oid.inode, oid.stripe, dirty_it->first.version);
printf("Unblock writes-after-delete %jx:%jx v%ju\n", oid.inode, oid.stripe, dirty_it->first.version);
#endif
dirty_it = dirty_end;
// Unblock operations blocked by delete flushing
@@ -210,21 +210,26 @@ void blockstore_impl_t::erase_dirty(blockstore_dirty_db_t::iterator dirty_start,
dirty_it->second.location != UINT64_MAX)
{
#ifdef BLOCKSTORE_DEBUG
printf("Free block %lu from %lx:%lx v%lu\n", dirty_it->second.location >> dsk.block_order,
printf("Free block %ju from %jx:%jx v%ju\n", dirty_it->second.location >> dsk.block_order,
dirty_it->first.oid.inode, dirty_it->first.oid.stripe, dirty_it->first.version);
#endif
data_alloc->set(dirty_it->second.location >> dsk.block_order, false);
}
auto used = --journal.used_sectors[dirty_it->second.journal_sector];
auto used = --journal.used_sectors.at(dirty_it->second.journal_sector);
#ifdef BLOCKSTORE_DEBUG
printf(
"remove usage of journal offset %08lx by %lx:%lx v%lu (%lu refs)\n", dirty_it->second.journal_sector,
"remove usage of journal offset %08jx by %jx:%jx v%ju (%ju refs)\n", dirty_it->second.journal_sector,
dirty_it->first.oid.inode, dirty_it->first.oid.stripe, dirty_it->first.version, used
);
#endif
if (used == 0)
{
journal.used_sectors.erase(dirty_it->second.journal_sector);
if (dirty_it->second.journal_sector == journal.sector_info[journal.cur_sector].offset)
{
// Mark current sector as "full" to select the new one
journal.in_sector_pos = dsk.journal_block_size;
}
flusher->mark_trim_possible();
}
free_dirty_dyn_data(dirty_it->second);

View File

@@ -298,7 +298,7 @@ int blockstore_impl_t::dequeue_stable(blockstore_op_t *op)
if (clean_it == clean_db.end() || clean_it->second.version < ov.version)
{
// No such object version
printf("Error: %lx:%lx v%lu not found while stabilizing\n", ov.oid.inode, ov.oid.stripe, ov.version);
printf("Error: %jx:%jx v%ju not found while stabilizing\n", ov.oid.inode, ov.oid.stripe, ov.version);
return -ENOENT;
}
else
@@ -307,35 +307,49 @@ int blockstore_impl_t::dequeue_stable(blockstore_op_t *op)
return STAB_SPLIT_DONE;
}
}
else if (IS_IN_FLIGHT(dirty_it->second.state))
{
// Object write is still in progress. Wait until the write request completes
return STAB_SPLIT_WAIT;
}
else if (!IS_SYNCED(dirty_it->second.state))
{
// Object not synced yet - sync it
// In previous versions we returned EBUSY here and required
// the caller (OSD) to issue a global sync first. But a global sync
// waits for all writes in the queue including inflight writes. And
// inflight writes may themselves be blocked by unstable writes being
// still present in the journal and not flushed away from it.
// So we must sync specific objects here.
//
// Even more, we have to process "stabilize" request in parts. That is,
// we must stabilize all objects which are already synced. Otherwise
// they may block objects which are NOT synced yet.
return STAB_SPLIT_SYNC;
}
else if (IS_STABLE(dirty_it->second.state))
{
// Already stable
return STAB_SPLIT_DONE;
}
else
while (true)
{
return STAB_SPLIT_TODO;
if (IS_IN_FLIGHT(dirty_it->second.state))
{
// Object write is still in progress. Wait until the write request completes
return STAB_SPLIT_WAIT;
}
else if (!IS_SYNCED(dirty_it->second.state))
{
// Object not synced yet - sync it
// In previous versions we returned EBUSY here and required
// the caller (OSD) to issue a global sync first. But a global sync
// waits for all writes in the queue including inflight writes. And
// inflight writes may themselves be blocked by unstable writes being
// still present in the journal and not flushed away from it.
// So we must sync specific objects here.
//
// Even more, we have to process "stabilize" request in parts. That is,
// we must stabilize all objects which are already synced. Otherwise
// they may block objects which are NOT synced yet.
return STAB_SPLIT_SYNC;
}
else if (IS_STABLE(dirty_it->second.state))
{
break;
}
// Check previous versions too
if (dirty_it == dirty_db.begin())
{
break;
}
dirty_it--;
if (dirty_it->first.oid != ov.oid)
{
break;
}
}
return STAB_SPLIT_TODO;
});
if (r != 1)
{
@@ -402,7 +416,7 @@ resume_4:
{
// Mark all dirty_db entries up to op->version as stable
#ifdef BLOCKSTORE_DEBUG
printf("Stabilize %lx:%lx v%lu\n", v->oid.inode, v->oid.stripe, v->version);
printf("Stabilize %jx:%jx v%ju\n", v->oid.inode, v->oid.stripe, v->version);
#endif
mark_stable(*v);
}
@@ -412,11 +426,40 @@ resume_4:
return 2;
}
void blockstore_impl_t::mark_stable(const obj_ver_id & v, bool forget_dirty)
void blockstore_impl_t::mark_stable(obj_ver_id v, bool forget_dirty)
{
auto dirty_it = dirty_db.find(v);
if (dirty_it != dirty_db.end())
{
if (IS_INSTANT(dirty_it->second.state))
{
// 'Instant' (non-EC) operations may complete and try to become stable out of order. Prevent it.
auto back_it = dirty_it;
while (back_it != dirty_db.begin())
{
back_it--;
if (back_it->first.oid != v.oid)
{
break;
}
if (!IS_STABLE(back_it->second.state))
{
// There are preceding unstable versions, can't flush <v>
return;
}
}
while (true)
{
dirty_it++;
if (dirty_it == dirty_db.end() || dirty_it->first.oid != v.oid ||
!IS_SYNCED(dirty_it->second.state))
{
dirty_it--;
break;
}
v.version = dirty_it->first.version;
}
}
while (1)
{
bool was_stable = IS_STABLE(dirty_it->second.state);
@@ -445,6 +488,7 @@ void blockstore_impl_t::mark_stable(const obj_ver_id & v, bool forget_dirty)
if (!exists)
{
inode_space_stats[dirty_it->first.oid.inode] += dsk.data_block_size;
used_blocks++;
}
big_to_flush++;
}
@@ -455,6 +499,7 @@ void blockstore_impl_t::mark_stable(const obj_ver_id & v, bool forget_dirty)
sp -= dsk.data_block_size;
else
inode_space_stats.erase(dirty_it->first.oid.inode);
used_blocks--;
big_to_flush++;
}
}
@@ -462,7 +507,7 @@ void blockstore_impl_t::mark_stable(const obj_ver_id & v, bool forget_dirty)
{
// mark_stable should never be called for in-flight or submitted writes
printf(
"BUG: Attempt to mark_stable object %lx:%lx v%lu state of which is %x\n",
"BUG: Attempt to mark_stable object %jx:%jx v%ju state of which is %x\n",
dirty_it->first.oid.inode, dirty_it->first.oid.stripe, dirty_it->first.version,
dirty_it->second.state
);

View File

@@ -85,16 +85,14 @@ int blockstore_impl_t::continue_sync(blockstore_op_t *op)
left--;
auto & dirty_entry = dirty_db.at(sbw);
uint64_t dyn_size = dsk.dirty_dyn_size(dirty_entry.offset, dirty_entry.len);
if (!space_check.check_available(op, 1, sizeof(journal_entry_big_write) + dyn_size,
(unstable_writes.size()+unstable_unsynced)*journal.block_size))
if (!space_check.check_available(op, 1, sizeof(journal_entry_big_write) + dyn_size, 0))
{
return 0;
}
}
}
else if (!space_check.check_available(op, PRIV(op)->sync_big_writes.size(),
sizeof(journal_entry_big_write) + dsk.clean_entry_bitmap_size,
(unstable_writes.size()+unstable_unsynced)*journal.block_size))
sizeof(journal_entry_big_write) + dsk.clean_entry_bitmap_size, 0))
{
return 0;
}
@@ -117,11 +115,14 @@ int blockstore_impl_t::continue_sync(blockstore_op_t *op)
journal, (dirty_entry.state & BS_ST_INSTANT) ? JE_BIG_WRITE_INSTANT : JE_BIG_WRITE,
sizeof(journal_entry_big_write) + dyn_size
);
dirty_entry.journal_sector = journal.sector_info[journal.cur_sector].offset;
auto jsec = dirty_entry.journal_sector = journal.sector_info[journal.cur_sector].offset;
assert(journal.next_free >= journal.used_start
? (jsec >= journal.used_start && jsec < journal.next_free)
: (jsec >= journal.used_start || jsec < journal.next_free));
journal.used_sectors[journal.sector_info[journal.cur_sector].offset]++;
#ifdef BLOCKSTORE_DEBUG
printf(
"journal offset %08lx is used by %lx:%lx v%lu (%lu refs)\n",
"journal offset %08jx is used by %jx:%jx v%ju (%ju refs)\n",
dirty_entry.journal_sector, it->oid.inode, it->oid.stripe, it->version,
journal.used_sectors[journal.sector_info[journal.cur_sector].offset]
);
@@ -175,7 +176,7 @@ void blockstore_impl_t::ack_sync(blockstore_op_t *op)
for (auto it = PRIV(op)->sync_big_writes.begin(); it != PRIV(op)->sync_big_writes.end(); it++)
{
#ifdef BLOCKSTORE_DEBUG
printf("Ack sync big %lx:%lx v%lu\n", it->oid.inode, it->oid.stripe, it->version);
printf("Ack sync big %jx:%jx v%ju\n", it->oid.inode, it->oid.stripe, it->version);
#endif
auto & unstab = unstable_writes[it->oid];
unstab = unstab < it->version ? it->version : unstab;
@@ -203,7 +204,7 @@ void blockstore_impl_t::ack_sync(blockstore_op_t *op)
for (auto it = PRIV(op)->sync_small_writes.begin(); it != PRIV(op)->sync_small_writes.end(); it++)
{
#ifdef BLOCKSTORE_DEBUG
printf("Ack sync small %lx:%lx v%lu\n", it->oid.inode, it->oid.stripe, it->version);
printf("Ack sync small %jx:%jx v%ju\n", it->oid.inode, it->oid.stripe, it->version);
#endif
auto & unstab = unstable_writes[it->oid];
unstab = unstab < it->version ? it->version : unstab;

View File

@@ -85,7 +85,7 @@ bool blockstore_impl_t::enqueue_write(blockstore_op_t *op)
// It's allowed to write versions with low numbers over deletes
// However, we have to flush those deletes first as we use version number for ordering
#ifdef BLOCKSTORE_DEBUG
printf("Write %lx:%lx v%lu over delete (real v%lu) offset=%u len=%u\n", op->oid.inode, op->oid.stripe, version, op->version, op->offset, op->len);
printf("Write %jx:%jx v%ju over delete (real v%ju) offset=%u len=%u\n", op->oid.inode, op->oid.stripe, version, op->version, op->offset, op->len);
#endif
wait_del = true;
PRIV(op)->real_version = op->version;
@@ -95,11 +95,13 @@ bool blockstore_impl_t::enqueue_write(blockstore_op_t *op)
// Issue an additional sync so the delete reaches the journal
blockstore_op_t *sync_op = new blockstore_op_t;
sync_op->opcode = BS_OP_SYNC;
sync_op->callback = [this, op](blockstore_op_t *sync_op)
sync_op->oid = op->oid;
sync_op->version = op->version;
sync_op->callback = [this](blockstore_op_t *sync_op)
{
flusher->unshift_flush((obj_ver_id){
.oid = op->oid,
.version = op->version-1,
.oid = sync_op->oid,
.version = sync_op->version-1,
}, true);
delete sync_op;
};
@@ -117,7 +119,7 @@ bool blockstore_impl_t::enqueue_write(blockstore_op_t *op)
{
// Invalid version requested
#ifdef BLOCKSTORE_DEBUG
printf("Write %lx:%lx v%lu requested, but we already have v%lu\n", op->oid.inode, op->oid.stripe, op->version, version);
printf("Write %jx:%jx v%ju requested, but we already have v%ju\n", op->oid.inode, op->oid.stripe, op->version, version);
#endif
op->retval = -EEXIST;
if (!is_del && alloc_dyn_data)
@@ -129,7 +131,7 @@ bool blockstore_impl_t::enqueue_write(blockstore_op_t *op)
}
bool imm = (op->len < dsk.data_block_size ? (immediate_commit != IMMEDIATE_NONE) : (immediate_commit == IMMEDIATE_ALL));
if (wait_big && !is_del && !deleted && op->len < dsk.data_block_size && !imm ||
!imm && unsynced_queued_ops >= autosync_writes)
!imm && autosync_writes && unsynced_queued_ops >= autosync_writes)
{
// Issue an additional sync so that the previous big write can reach the journal
blockstore_op_t *sync_op = new blockstore_op_t;
@@ -144,9 +146,9 @@ bool blockstore_impl_t::enqueue_write(blockstore_op_t *op)
unsynced_queued_ops++;
#ifdef BLOCKSTORE_DEBUG
if (is_del)
printf("Delete %lx:%lx v%lu\n", op->oid.inode, op->oid.stripe, op->version);
printf("Delete %jx:%jx v%ju\n", op->oid.inode, op->oid.stripe, op->version);
else if (!wait_del)
printf("Write %lx:%lx v%lu offset=%u len=%u\n", op->oid.inode, op->oid.stripe, op->version, op->offset, op->len);
printf("Write %jx:%jx v%ju offset=%u len=%u\n", op->oid.inode, op->oid.stripe, op->version, op->offset, op->len);
#endif
// No strict need to add it into dirty_db here except maybe for listings to return
// correct data when there are inflight operations in the queue
@@ -286,7 +288,7 @@ int blockstore_impl_t::dequeue_write(blockstore_op_t *op)
}
// Restore original low version number for unblocked operations
#ifdef BLOCKSTORE_DEBUG
printf("Restoring %lx:%lx version: v%lu -> v%lu\n", op->oid.inode, op->oid.stripe, op->version, PRIV(op)->real_version);
printf("Restoring %jx:%jx version: v%ju -> v%ju\n", op->oid.inode, op->oid.stripe, op->version, PRIV(op)->real_version);
#endif
auto prev_it = dirty_it;
if (prev_it != dirty_db.begin())
@@ -296,7 +298,7 @@ int blockstore_impl_t::dequeue_write(blockstore_op_t *op)
{
// Original version is still invalid
// All subsequent writes to the same object must be canceled too
printf("Tried to write %lx:%lx v%lu after delete (old version v%lu), but already have v%lu\n",
printf("Tried to write %jx:%jx v%ju after delete (old version v%ju), but already have v%ju\n",
op->oid.inode, op->oid.stripe, PRIV(op)->real_version, op->version, prev_it->first.version);
cancel_all_writes(op, dirty_it, -EEXIST);
return 2;
@@ -320,7 +322,7 @@ int blockstore_impl_t::dequeue_write(blockstore_op_t *op)
blockstore_journal_check_t space_check(this);
if (!space_check.check_available(op, unsynced_big_write_count + 1,
sizeof(journal_entry_big_write) + dsk.clean_dyn_size,
(unstable_writes.size()+unstable_unsynced)*journal.block_size))
(unstable_writes.size()+unstable_unsynced+((dirty_it->second.state & BS_ST_INSTANT) ? 0 : 1))*journal.block_size))
{
return 0;
}
@@ -348,8 +350,8 @@ int blockstore_impl_t::dequeue_write(blockstore_op_t *op)
if (entry->oid.inode || entry->oid.stripe || entry->version)
{
printf(
"Fatal error (metadata corruption or bug): tried to write object %lx:%lx v%lu"
" over a non-zero metadata entry %lu with %lx:%lx v%lu\n", op->oid.inode,
"Fatal error (metadata corruption or bug): tried to write object %jx:%jx v%ju"
" over a non-zero metadata entry %ju with %jx:%jx v%ju\n", op->oid.inode,
op->oid.stripe, op->version, loc, entry->oid.inode, entry->oid.stripe, entry->version
);
exit(1);
@@ -361,7 +363,7 @@ int blockstore_impl_t::dequeue_write(blockstore_op_t *op)
dirty_it->second.state = (dirty_it->second.state & ~BS_ST_WORKFLOW_MASK) | BS_ST_SUBMITTED;
#ifdef BLOCKSTORE_DEBUG
printf(
"Allocate block %lu for %lx:%lx v%lu\n",
"Allocate block %ju for %jx:%jx v%ju\n",
loc, op->oid.inode, op->oid.stripe, op->version
);
#endif
@@ -372,13 +374,13 @@ int blockstore_impl_t::dequeue_write(blockstore_op_t *op)
int vcnt = 0;
if (stripe_offset)
{
PRIV(op)->iov_zerofill[vcnt++] = (struct iovec){ zero_object, stripe_offset };
PRIV(op)->iov_zerofill[vcnt++] = (struct iovec){ zero_object, (size_t)stripe_offset };
}
PRIV(op)->iov_zerofill[vcnt++] = (struct iovec){ op->buf, op->len };
if (stripe_end)
{
stripe_end = dsk.bitmap_granularity - stripe_end;
PRIV(op)->iov_zerofill[vcnt++] = (struct iovec){ zero_object, stripe_end };
PRIV(op)->iov_zerofill[vcnt++] = (struct iovec){ zero_object, (size_t)stripe_end };
}
data->iov.iov_len = op->len + stripe_offset + stripe_end; // to check it in the callback
data->callback = [this, op](ring_data_t *data) { handle_write_event(data, op); };
@@ -386,7 +388,7 @@ int blockstore_impl_t::dequeue_write(blockstore_op_t *op)
sqe, dsk.data_fd, PRIV(op)->iov_zerofill, vcnt, dsk.data_offset + (loc << dsk.block_order) + op->offset - stripe_offset
);
PRIV(op)->pending_ops = 1;
if (immediate_commit != IMMEDIATE_ALL && !(dirty_it->second.state & BS_ST_INSTANT))
if (!(dirty_it->second.state & BS_ST_INSTANT))
{
unstable_unsynced++;
}
@@ -412,7 +414,7 @@ int blockstore_impl_t::dequeue_write(blockstore_op_t *op)
sizeof(journal_entry_big_write) + dsk.clean_dyn_size, 0)
|| !space_check.check_available(op, 1,
sizeof(journal_entry_small_write) + dyn_size,
(unstable_writes.size()+unstable_unsynced)*journal.block_size))
op->len + (unstable_writes.size()+unstable_unsynced+((dirty_it->second.state & BS_ST_INSTANT) ? 0 : 1))*journal.block_size))
{
return 0;
}
@@ -436,11 +438,23 @@ int blockstore_impl_t::dequeue_write(blockstore_op_t *op)
journal, op->opcode == BS_OP_WRITE_STABLE ? JE_SMALL_WRITE_INSTANT : JE_SMALL_WRITE,
sizeof(journal_entry_small_write) + dyn_size
);
dirty_it->second.journal_sector = journal.sector_info[journal.cur_sector].offset;
auto jsec = dirty_it->second.journal_sector = journal.sector_info[journal.cur_sector].offset;
if (!(journal.next_free >= journal.used_start
? (jsec >= journal.used_start && jsec < journal.next_free)
: (jsec >= journal.used_start || jsec < journal.next_free)))
{
printf(
"BUG: journal offset %08jx is used by %jx:%jx v%ju (%ju refs) BUT used_start=%jx next_free=%jx\n",
dirty_it->second.journal_sector, dirty_it->first.oid.inode, dirty_it->first.oid.stripe, dirty_it->first.version,
journal.used_sectors[journal.sector_info[journal.cur_sector].offset],
journal.used_start, journal.next_free
);
abort();
}
journal.used_sectors[journal.sector_info[journal.cur_sector].offset]++;
#ifdef BLOCKSTORE_DEBUG
printf(
"journal offset %08lx is used by %lx:%lx v%lu (%lu refs)\n",
"journal offset %08jx is used by %jx:%jx v%ju (%ju refs)\n",
dirty_it->second.journal_sector, dirty_it->first.oid.inode, dirty_it->first.oid.stripe, dirty_it->first.version,
journal.used_sectors[journal.sector_info[journal.cur_sector].offset]
);
@@ -454,14 +468,16 @@ int blockstore_impl_t::dequeue_write(blockstore_op_t *op)
journal_used_it->first < next_next_free + op->len)
{
printf(
"BUG: Attempt to overwrite used offset (%lx, %lu refs) of the journal with the object %lx:%lx v%lu: data at %lx, len %x!"
" Journal used_start=%08lx (%lu refs), next_free=%08lx, dirty_start=%08lx\n",
"BUG: Attempt to overwrite used offset (%jx, %ju refs) of the journal with the object %jx:%jx v%ju: data at %jx, len %x!"
" Journal used_start=%08jx (%ju refs), next_free=%08jx, dirty_start=%08jx\n",
journal_used_it->first, journal_used_it->second, op->oid.inode, op->oid.stripe, op->version, next_next_free, op->len,
journal.used_start, journal.used_sectors[journal.used_start], journal.next_free, journal.dirty_start
);
exit(1);
}
}
// double check that next_free doesn't cross used_start from the left
assert(journal.next_free >= journal.used_start && next_next_free >= journal.next_free || next_next_free < journal.used_start);
journal.next_free = next_next_free;
je->oid = op->oid;
je->version = op->version;
@@ -499,13 +515,13 @@ int blockstore_impl_t::dequeue_write(blockstore_op_t *op)
}
dirty_it->second.location = journal.next_free;
dirty_it->second.state = (dirty_it->second.state & ~BS_ST_WORKFLOW_MASK) | BS_ST_SUBMITTED;
journal.next_free += op->len;
if (journal.next_free >= journal.len)
{
journal.next_free = dsk.journal_block_size;
assert(journal.next_free != journal.used_start);
}
if (immediate_commit == IMMEDIATE_NONE && !(dirty_it->second.state & BS_ST_INSTANT))
next_next_free = journal.next_free + op->len;
if (next_next_free >= journal.len)
next_next_free = dsk.journal_block_size;
// double check that next_free doesn't cross used_start from the left
assert(journal.next_free >= journal.used_start && next_next_free >= journal.next_free || next_next_free < journal.used_start);
journal.next_free = next_next_free;
if (!(dirty_it->second.state & BS_ST_INSTANT))
{
unstable_unsynced++;
}
@@ -547,7 +563,7 @@ resume_2:
uint64_t dyn_size = dsk.dirty_dyn_size(op->offset, op->len);
blockstore_journal_check_t space_check(this);
if (!space_check.check_available(op, 1, sizeof(journal_entry_big_write) + dyn_size,
(unstable_writes.size()+unstable_unsynced)*journal.block_size))
(unstable_writes.size()+unstable_unsynced+((dirty_it->second.state & BS_ST_INSTANT) ? 0 : 1))*journal.block_size))
{
return 0;
}
@@ -556,11 +572,23 @@ resume_2:
journal, op->opcode == BS_OP_WRITE_STABLE ? JE_BIG_WRITE_INSTANT : JE_BIG_WRITE,
sizeof(journal_entry_big_write) + dyn_size
);
dirty_it->second.journal_sector = journal.sector_info[journal.cur_sector].offset;
auto jsec = dirty_it->second.journal_sector = journal.sector_info[journal.cur_sector].offset;
if (!(journal.next_free >= journal.used_start
? (jsec >= journal.used_start && jsec < journal.next_free)
: (jsec >= journal.used_start || jsec < journal.next_free)))
{
printf(
"BUG: journal offset %08jx is used by %jx:%jx v%ju (%ju refs) BUT used_start=%jx next_free=%jx\n",
dirty_it->second.journal_sector, dirty_it->first.oid.inode, dirty_it->first.oid.stripe, dirty_it->first.version,
journal.used_sectors[journal.sector_info[journal.cur_sector].offset],
journal.used_start, journal.next_free
);
abort();
}
journal.used_sectors[journal.sector_info[journal.cur_sector].offset]++;
#ifdef BLOCKSTORE_DEBUG
printf(
"journal offset %08lx is used by %lx:%lx v%lu (%lu refs)\n",
"journal offset %08jx is used by %jx:%jx v%ju (%ju refs)\n",
journal.sector_info[journal.cur_sector].offset, op->oid.inode, op->oid.stripe, op->version,
journal.used_sectors[journal.sector_info[journal.cur_sector].offset]
);
@@ -587,20 +615,20 @@ resume_4:
});
assert(dirty_it != dirty_db.end());
#ifdef BLOCKSTORE_DEBUG
printf("Ack write %lx:%lx v%lu = state 0x%x\n", op->oid.inode, op->oid.stripe, op->version, dirty_it->second.state);
printf("Ack write %jx:%jx v%ju = state 0x%x\n", op->oid.inode, op->oid.stripe, op->version, dirty_it->second.state);
#endif
bool is_big = (dirty_it->second.state & BS_ST_TYPE_MASK) == BS_ST_BIG_WRITE;
bool imm = is_big ? (immediate_commit == IMMEDIATE_ALL) : (immediate_commit != IMMEDIATE_NONE);
bool is_instant = ((dirty_it->second.state & BS_ST_TYPE_MASK) == BS_ST_DELETE || (dirty_it->second.state & BS_ST_INSTANT));
bool is_instant = IS_INSTANT(dirty_it->second.state);
if (imm)
{
auto & unstab = unstable_writes[op->oid];
unstab = unstab < op->version ? op->version : unstab;
}
else if (!is_instant)
{
unstable_unsynced--;
assert(unstable_unsynced >= 0);
if (!is_instant)
{
unstable_unsynced--;
assert(unstable_unsynced >= 0);
}
}
dirty_it->second.state = (dirty_it->second.state & ~BS_ST_WORKFLOW_MASK)
| (imm ? BS_ST_SYNCED : BS_ST_WRITTEN);
@@ -780,7 +808,7 @@ int blockstore_impl_t::dequeue_del(blockstore_op_t *op)
journal.used_sectors[journal.sector_info[journal.cur_sector].offset]++;
#ifdef BLOCKSTORE_DEBUG
printf(
"journal offset %08lx is used by %lx:%lx v%lu (%lu refs)\n",
"journal offset %08jx is used by %jx:%jx v%ju (%ju refs)\n",
dirty_it->second.journal_sector, dirty_it->first.oid.inode, dirty_it->first.oid.stripe, dirty_it->first.version,
journal.used_sectors[journal.sector_info[journal.cur_sector].offset]
);

View File

@@ -46,18 +46,21 @@ static const char* help_text =
"vitastor-cli snap-create [-p|--pool <id|name>] <image>@<snapshot>\n"
" Create a snapshot of image <name>. May be used live if only a single writer is active.\n"
"\n"
"vitastor-cli modify <name> [--rename <new-name>] [--resize <size>] [--readonly | --readwrite] [-f|--force]\n"
"vitastor-cli modify <name> [--rename <new-name>] [--resize <size>] [--readonly | --readwrite] [-f|--force] [--down-ok]\n"
" Rename, resize image or change its readonly status. Images with children can't be made read-write.\n"
" If the new size is smaller than the old size, extra data will be purged.\n"
" You should resize file system in the image, if present, before shrinking it.\n"
" -f|--force Proceed with shrinking or setting readwrite flag even if the image has children.\n"
" --down-ok Proceed with shrinking even if some data will be left on unavailable OSDs.\n"
"\n"
"vitastor-cli rm <from> [<to>] [--writers-stopped]\n"
"vitastor-cli rm <from> [<to>] [--writers-stopped] [--down-ok]\n"
" Remove <from> or all layers between <from> and <to> (<to> must be a child of <from>),\n"
" rebasing all their children accordingly. --writers-stopped allows merging to be a bit\n"
" more effective in case of a single 'slim' read-write child and 'fat' removed parent:\n"
" the child is merged into parent and parent is renamed to child in that case.\n"
" In other cases parent layers are always merged into children.\n"
" Other options:\n"
" --down-ok Continue deletion/merging even if some data will be left on unavailable OSDs.\n"
"\n"
"vitastor-cli flatten <layer>\n"
" Flatten a layer, i.e. merge data and detach it from parents.\n"
@@ -116,12 +119,13 @@ static const char* help_text =
"Use vitastor-cli --help <command> for command details or vitastor-cli --help --all for all details.\n"
"\n"
"GLOBAL OPTIONS:\n"
" --etcd_address <etcd_address>\n"
" --config_file FILE Path to Vitastor configuration file\n"
" --etcd_address URL Etcd connection address\n"
" --iodepth N Send N operations in parallel to each OSD when possible (default 32)\n"
" --parallel_osds M Work with M osds in parallel when possible (default 4)\n"
" --progress 1|0 Report progress (default 1)\n"
" --cas 1|0 Use CAS writes for flatten, merge, rm (default is decide automatically)\n"
" --no-color Disable colored output\n"
" --color 1|0 Enable/disable colored output and CR symbols (default 1 if stdout is a terminal)\n"
" --json JSON output\n"
;
@@ -170,6 +174,7 @@ static json11::Json::object parse_args(int narg, const char *args[])
!strcmp(opt, "readonly") || !strcmp(opt, "readwrite") ||
!strcmp(opt, "force") || !strcmp(opt, "reverse") ||
!strcmp(opt, "allow-data-loss") || !strcmp(opt, "allow_data_loss") ||
!strcmp(opt, "down-ok") || !strcmp(opt, "down_ok") ||
!strcmp(opt, "dry-run") || !strcmp(opt, "dry_run") ||
!strcmp(opt, "help") || !strcmp(opt, "all") ||
(!strcmp(opt, "writers-stopped") || !strcmp(opt, "writers_stopped")) && strcmp("1", args[i+1]) != 0

View File

@@ -77,7 +77,7 @@ struct alloc_osd_t
std::string key = base64_decode(kv["key"].string_value());
osd_num_t cur_osd;
char null_byte = 0;
int scanned = sscanf(key.c_str() + parent->cli->st_cli.etcd_prefix.length(), "/osd/stats/%lu%c", &cur_osd, &null_byte);
int scanned = sscanf(key.c_str() + parent->cli->st_cli.etcd_prefix.length(), "/osd/stats/%ju%c", &cur_osd, &null_byte);
if (scanned != 1 || !cur_osd)
{
fprintf(stderr, "Invalid key in etcd: %s\n", key.c_str());

View File

@@ -1,6 +1,7 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.1 (see README.md for details)
#include <unistd.h>
#include "str_util.h"
#include "cluster_client.h"
#include "cli.h"
@@ -11,7 +12,7 @@ void cli_tool_t::change_parent(inode_t cur, inode_t new_parent, cli_result_t *re
if (cur_cfg_it == cli->st_cli.inode_config.end())
{
char buf[128];
snprintf(buf, 128, "Inode 0x%lx disappeared", cur);
snprintf(buf, 128, "Inode 0x%jx disappeared", cur);
*result = (cli_result_t){ .err = EIO, .text = buf };
return;
}
@@ -113,7 +114,12 @@ void cli_tool_t::parse_config(json11::Json::object & cfg)
else
kv_it++;
}
color = !cfg["no_color"].bool_value();
if (cfg.find("no_color") != cfg.end())
color = !cfg["no_color"].bool_value();
else if (cfg.find("color") != cfg.end())
color = cfg["color"].bool_value();
else
color = isatty(1);
json_output = cfg["json"].bool_value();
iodepth = cfg["iodepth"].uint64_value();
if (!iodepth)

View File

@@ -160,14 +160,14 @@ struct cli_describe_t
if (op->reply.hdr.retval < 0)
{
fprintf(
stderr, "Failed to describe objects on OSD %lu (retval=%ld)\n",
stderr, "Failed to describe objects on OSD %ju (retval=%jd)\n",
osd_num, op->reply.hdr.retval
);
}
else if (op->reply.describe.result_bytes != op->reply.hdr.retval * sizeof(osd_reply_describe_item_t))
{
fprintf(
stderr, "Invalid response size from OSD %lu (expected %lu bytes, got %lu bytes)\n",
stderr, "Invalid response size from OSD %ju (expected %ju bytes, got %ju bytes)\n",
osd_num, op->reply.hdr.retval * sizeof(osd_reply_describe_item_t), op->reply.describe.result_bytes
);
}
@@ -178,11 +178,11 @@ struct cli_describe_t
{
if (!parent->json_output || parent->is_command_line)
{
#define FMT "{\"inode\":\"0x%lx\",\"stripe\":\"0x%lx\",\"part\":%u,\"osd_num\":%lu%s%s%s}"
#define FMT "{\"inode\":\"0x%jx\",\"stripe\":\"0x%jx\",\"part\":%u,\"osd_num\":%ju%s%s%s}"
printf(
(parent->json_output
? (count > 0 ? ",\n " FMT : " " FMT)
: "%lx:%lx part %u on OSD %lu%s%s%s\n"),
: "%jx:%jx part %u on OSD %ju%s%s%s\n"),
#undef FMT
items[i].inode, items[i].stripe,
items[i].role, items[i].osd_num,

Some files were not shown because too many files have changed in this diff Show More