Compare commits

...

84 Commits

Author SHA1 Message Date
Vitaliy Filippov 25832cb7e4 Fix eviction when random_pos selects the end
Test / test_scrub (push) Blocked by required conditions Details
Test / test_scrub_zero_osd_2 (push) Blocked by required conditions Details
Test / test_scrub_xor (push) Blocked by required conditions Details
Test / test_scrub_pg_size_3 (push) Blocked by required conditions Details
Test / test_scrub_pg_size_6_pg_minsize_4_osd_count_6_ec (push) Blocked by required conditions Details
Test / test_scrub_ec (push) Blocked by required conditions Details
Test / buildenv (push) Has been cancelled Details
Test / build (push) Has been cancelled Details
Test / make_test (push) Has been cancelled Details
Test / test_add_osd (push) Has been cancelled Details
Test / test_cas (push) Has been cancelled Details
Test / test_change_pg_count (push) Has been cancelled Details
Test / test_change_pg_count_ec (push) Has been cancelled Details
Test / test_change_pg_size (push) Has been cancelled Details
Test / test_create_nomaxid (push) Has been cancelled Details
Test / test_etcd_fail (push) Has been cancelled Details
Test / test_interrupted_rebalance (push) Has been cancelled Details
Test / test_interrupted_rebalance_imm (push) Has been cancelled Details
Test / test_interrupted_rebalance_ec (push) Has been cancelled Details
Test / test_interrupted_rebalance_ec_imm (push) Has been cancelled Details
Test / test_failure_domain (push) Has been cancelled Details
Test / test_snapshot (push) Has been cancelled Details
Test / test_snapshot_ec (push) Has been cancelled Details
Test / test_minsize_1 (push) Has been cancelled Details
Test / test_move_reappear (push) Has been cancelled Details
Test / test_rm (push) Has been cancelled Details
Test / test_snapshot_chain (push) Has been cancelled Details
Test / test_snapshot_chain_ec (push) Has been cancelled Details
Test / test_snapshot_down (push) Has been cancelled Details
Test / test_snapshot_down_ec (push) Has been cancelled Details
2024-01-02 13:24:15 +03:00
Vitaliy Filippov e6326c6539 Implement min/max list_count to make listings during performance test reasonable 2024-01-02 13:24:15 +03:00
Vitaliy Filippov e32f382815 Fix and improve parallel allocation
- Do not try to allocate more DB blocks in an inode block until it's "confirmed" and "locked" by the first write
- Do not recheck for new zero DB blocks on first write into an inode block - a CAS failure means someone else is already writing into it
- Throw new allocation blocks away regardless of whether the known_version is 0 on a CAS failure
2024-01-02 13:24:15 +03:00
Vitaliy Filippov fb23d94000 Implement key_prefix for K/V stress test 2024-01-02 13:24:15 +03:00
Vitaliy Filippov ee462c2dad More fixes
- do not overwrite a block with older version if known version is newer
  (read may start before update and end after update)
- invalidated block versions can't be remembered and trusted
- right boundary for split blocks is right_half when diving down, not key_lt
- restart update also when block is "invalidated", not just on version mismatch
- copy callback in listings to avoid closure destruction bugs too
2024-01-02 13:24:15 +03:00
Vitaliy Filippov 16e4c767f1 Add logging and one more assert 2024-01-02 13:24:15 +03:00
Vitaliy Filippov 9e2b677499 Make get_block() wait for updating when unrelated block is found along the path 2024-01-02 13:24:15 +03:00
Vitaliy Filippov fd57096d2d Fix a race condition where changed blocks were parsed over existing cached blocks and getting a mix of data 2024-01-02 13:24:15 +03:00
Vitaliy Filippov e5ae907256 Simplify code by removing an unneeded "optimisation" 2024-01-02 13:24:15 +03:00
Vitaliy Filippov 64fd6f1c56 Add kv_log_level, print warnings on level 1, trace ops on level 10 2024-01-02 13:24:15 +03:00
Vitaliy Filippov 6e76e09d16 Fix duplicate keys in listings on parallel updates -- do not rewind key "iterator position" 2024-01-02 13:24:15 +03:00
Vitaliy Filippov 0964aeebd2 Implement key suffix to avoid collisions of multiple test workers 2024-01-02 13:24:15 +03:00
Vitaliy Filippov facff20ca1 Do not complain on empty first block 2024-01-02 13:24:15 +03:00
Vitaliy Filippov 16e09745f0 Add JSON output for stress-tester 2024-01-02 13:24:15 +03:00
Vitaliy Filippov 442f44a64f Print total stats 2024-01-02 13:24:15 +03:00
Vitaliy Filippov c67e3d56cb Do not send more than op_count operations (fix segfault on finish) 2024-01-02 13:24:15 +03:00
Vitaliy Filippov de41e46335 Add some more resiliency to serialize() 2024-01-02 13:24:15 +03:00
Vitaliy Filippov bf9a279ff9 Invalidate blocks being updated too 2024-01-02 13:24:15 +03:00
Vitaliy Filippov b7a41e6394 Change new block allocation method: make each writer choose multiple empty PG blocks and place blocks in them 2024-01-02 13:24:15 +03:00
Vitaliy Filippov 4175cb3720 Remove blocks from cache on unsuccessful updates 2024-01-02 13:24:15 +03:00
Vitaliy Filippov 3a4b71b0cd Allow to track multiple updates per block (it should never happen though) 2024-01-02 13:24:14 +03:00
Vitaliy Filippov 34969c5919 Do not call stop_updating after failed write_new_block and after clear_block (both delete the item) 2024-01-02 13:24:14 +03:00
Vitaliy Filippov 02a8df6586 Track versions of parent blocks and recheck if changed during update 2024-01-02 13:24:14 +03:00
Vitaliy Filippov 86c6482cf3 Fix resume_split condition (key_lt can also be "") 2024-01-02 13:24:14 +03:00
Vitaliy Filippov 4dd68c543c Experiment: transform offsets for better sharding 2024-01-02 13:24:14 +03:00
Vitaliy Filippov 10ad96c56c More post-stress-test fixes
- Prevent _split types of new blocks
- Stop updating new blocks only after the whole update, otherwise pointers
  may become invalid
- Use recheck_none for updates initially
- Use UINT64_MAX as initial block version when postponing ops, otherwise the
  check fails when the block is initially empty. This for example leads to
  writing both leaf items & block pointers (which is incorrect) into the root
  block when starting stress-test with --parallelism 32
- Fix -EINTR comparison
2024-01-02 13:24:14 +03:00
Vitaliy Filippov 6e451117ce Print operation statistics 2024-01-02 13:24:14 +03:00
Vitaliy Filippov 85f35bdf30 K/V fixes after stress-test :-)
- track block versions correctly - per inode block (128kb) instead of tree block (4kb)
- prevent multiple parallel CAS writes of the same inode block
- add logging for EILSEQ which means invalid data in the tree
- fix get_block updated flag which was true for blocks already in cache and was leading to infinite loops on "unrelated block" errors
- apply changes to blocks in cache only after successful writes (using "virtual changes")
- do not replace cached block with an older version from disk
- recheck "unrelated blocks" (read/update collisions) until data stops changing
- track tree path correctly - do not treat split block as parent of its right half
- correctly move blocks when finding new empty place on disk
- restart updates from the beginning when one of blocks is changed by a parallel update
- fix delete using SET opcode and setting key to the empty value instead
- prevent changing the same key more than 1 time in parallel
- fix listing verification
- resume continue_updates in update_find (required because it uses continue_update itself)
- add allow_old_cached parameter to get()
2024-01-02 13:24:14 +03:00
Vitaliy Filippov 09adaf62fd Implement K/V DB stress tester 2024-01-02 13:24:14 +03:00
Vitaliy Filippov af93f8323c Evict blocks based on memory limit & block usage 2024-01-02 13:24:14 +03:00
Vitaliy Filippov 3d29c76ff4 Track blocks per level 2024-01-02 13:24:14 +03:00
Vitaliy Filippov 19275379c1 Track block level 2024-01-02 13:24:14 +03:00
Vitaliy Filippov 96ad3c7c50 Experimental B-Tree Vitastor embedded K/V database implementation! 2024-01-02 13:24:14 +03:00
Vitaliy Filippov 2f228fa96a Only treat data partitions as existing OSDs in vitastor-disk prepare
Test / test_interrupted_rebalance_ec (push) Successful in 2m40s Details
Test / test_rm (push) Successful in 31s Details
Test / test_move_reappear (push) Successful in 39s Details
Test / test_snapshot_down (push) Successful in 26s Details
Test / test_interrupted_rebalance (push) Successful in 4m42s Details
Test / test_snapshot_down_ec (push) Successful in 26s Details
Test / test_splitbrain (push) Successful in 22s Details
Test / test_snapshot_chain (push) Failing after 3m17s Details
Test / test_snapshot_chain_ec (push) Failing after 3m13s Details
Test / test_rebalance_verify_imm (push) Successful in 2m51s Details
Test / test_write (push) Successful in 37s Details
Test / test_rebalance_verify_ec_imm (push) Successful in 2m37s Details
Test / test_write_no_same (push) Successful in 18s Details
Test / test_rebalance_verify_ec (push) Successful in 3m20s Details
Test / test_write_xor (push) Failing after 3m8s Details
Test / test_rebalance_verify (push) Successful in 8m20s Details
Test / test_heal_pg_size_2 (push) Successful in 4m17s Details
Test / test_heal_ec (push) Successful in 4m59s Details
Test / test_heal_csum_32k_dmj (push) Successful in 4m15s Details
Test / test_heal_csum_32k_dj (push) Successful in 5m35s Details
Test / test_heal_csum_32k (push) Successful in 6m47s Details
Test / test_heal_csum_4k_dmj (push) Successful in 6m49s Details
Test / test_scrub (push) Successful in 1m2s Details
Test / test_scrub_zero_osd_2 (push) Successful in 45s Details
Test / test_scrub_xor (push) Successful in 40s Details
Test / test_heal_csum_4k_dj (push) Successful in 7m16s Details
Test / test_scrub_pg_size_6_pg_minsize_4_osd_count_6_ec (push) Successful in 1m9s Details
Test / test_scrub_ec (push) Successful in 45s Details
Test / test_heal_csum_4k (push) Successful in 5m26s Details
Test / test_scrub_pg_size_3 (push) Successful in 1m38s Details
2023-12-31 11:46:47 +03:00
Vitaliy Filippov 2f6b9c0306 Remove etcd parameter from default command examples 2023-12-31 02:50:41 +03:00
Vitaliy Filippov 48b5f871e0 Add Contributor License Aggrement in Russian and English 2023-12-31 01:23:52 +03:00
Vitaliy Filippov c17f76a3e4 Add documentation for recovery auto-tuning
Test / test_snapshot_ec (push) Successful in 26s Details
Test / test_move_reappear (push) Successful in 19s Details
Test / test_rm (push) Successful in 15s Details
Test / test_snapshot_down (push) Successful in 24s Details
Test / test_snapshot_down_ec (push) Successful in 26s Details
Test / test_snapshot_chain (push) Successful in 1m50s Details
Test / test_splitbrain (push) Successful in 52s Details
Test / test_snapshot_chain_ec (push) Successful in 2m31s Details
Test / test_rebalance_verify_imm (push) Successful in 2m28s Details
Test / test_rebalance_verify (push) Successful in 3m25s Details
Test / test_rebalance_verify_ec (push) Successful in 3m31s Details
Test / test_write (push) Successful in 1m17s Details
Test / test_write_no_same (push) Successful in 17s Details
Test / test_rebalance_verify_ec_imm (push) Successful in 3m36s Details
Test / test_heal_pg_size_2 (push) Successful in 4m12s Details
Test / test_heal_ec (push) Successful in 5m20s Details
Test / test_heal_csum_32k_dmj (push) Successful in 4m36s Details
Test / test_heal_csum_32k_dj (push) Successful in 6m11s Details
Test / test_heal_csum_32k (push) Successful in 6m13s Details
Test / test_scrub (push) Successful in 56s Details
Test / test_scrub_zero_osd_2 (push) Successful in 1m6s Details
Test / test_heal_csum_4k_dj (push) Successful in 6m31s Details
Test / test_heal_csum_4k_dmj (push) Successful in 6m58s Details
Test / test_scrub_xor (push) Successful in 43s Details
Test / test_scrub_pg_size_6_pg_minsize_4_osd_count_6_ec (push) Successful in 1m10s Details
Test / test_scrub_ec (push) Successful in 49s Details
Test / test_scrub_pg_size_3 (push) Successful in 1m40s Details
Test / test_heal_csum_4k (push) Successful in 5m59s Details
Test / test_write_xor (push) Successful in 34s Details
Test / test_interrupted_rebalance (push) Successful in 1m19s Details
2023-12-31 01:23:17 +03:00
Vitaliy Filippov a6ab54b1ba Do not allow negative util_low/high 2023-12-31 01:23:17 +03:00
Vitaliy Filippov 99ee8596ea Rename min/max_util to util_low/high 2023-12-31 01:23:17 +03:00
Vitaliy Filippov c4928e6ecd Protect from try_send completing the operation immediately
Fixes a possible use-after-free in case of continue_ops() calling try_send(),
then connect_peer() -> set_timer() -> trigger_nearest() -> handle_op_part() -> continue_ops() again
2023-12-31 01:23:17 +03:00
Vitaliy Filippov ec7dcd1be5 Do not apply very large recovery pauses during tests 2023-12-31 01:23:17 +03:00
Vitaliy Filippov e600bbc151 Fix flapping move_reappear test by adding an fsync before stopping PG 2023-12-31 01:23:17 +03:00
Vitaliy Filippov 8b8c1179a7 Use a separate used_blocks counter for free space stats to hide possibly delayed on-flush deallocation 2023-12-31 01:23:17 +03:00
Vitaliy Filippov d5a6fa6dd7 Fix possible crash on print_slow when bs_op is NULL 2023-12-31 01:23:17 +03:00
Vitaliy Filippov f757a35a8d Retry PG changes without re-running lpsolve when pool configuration and OSD tree don't change
OSDs often change their /pg/history keys during rebalance, so monitor receives additional
transaction failures from etcd if it re-runs lpsolve which sometimes may even lead to monitor
being unable to apply PG changes at all until rebalance completes
2023-12-31 01:23:17 +03:00
Vitaliy Filippov 1edf86ed26 Aggregate recovery delay using simple mean over last 10 observations (EWMA is shit) 2023-12-31 01:23:17 +03:00
Vitaliy Filippov 5ca7cde612 Experiment/WIP: Try to track "secondary" recovery ops separately 2023-12-31 01:23:17 +03:00
Vitaliy Filippov 751935ddd8 WIP Auto-tune recovery speed 2023-12-31 01:23:17 +03:00
Vitaliy Filippov d84dee7098 Track recovery op latencies + refactor into a structure 2023-12-31 01:23:17 +03:00
Vitaliy Filippov dcc76eee15 Add a parity chunk count change test script 2023-12-26 23:48:41 +03:00
Vitaliy Filippov 2f38adeb3d Restart dead VDUSE daemons at regular intervals 2023-12-24 12:58:50 +03:00
Vitaliy Filippov f72f14e6a7 Clear old PG states, history, and OSD states on etcd state reload
Test / test_snapshot_ec (push) Successful in 30s Details
Test / test_interrupted_rebalance_ec_imm (push) Successful in 1m24s Details
Test / test_rm (push) Successful in 16s Details
Test / test_snapshot_down (push) Successful in 23s Details
Test / test_snapshot_down_ec (push) Successful in 25s Details
Test / test_splitbrain (push) Successful in 21s Details
Test / test_snapshot_chain (push) Successful in 2m24s Details
Test / test_snapshot_chain_ec (push) Successful in 3m5s Details
Test / test_rebalance_verify_imm (push) Successful in 3m21s Details
Test / test_write (push) Successful in 36s Details
Test / test_rebalance_verify (push) Successful in 4m12s Details
Test / test_write_no_same (push) Successful in 15s Details
Test / test_write_xor (push) Successful in 52s Details
Test / test_rebalance_verify_ec_imm (push) Successful in 4m29s Details
Test / test_rebalance_verify_ec (push) Successful in 5m25s Details
Test / test_heal_pg_size_2 (push) Successful in 4m10s Details
Test / test_heal_ec (push) Successful in 4m46s Details
Test / test_heal_csum_32k_dmj (push) Successful in 5m31s Details
Test / test_heal_csum_32k_dj (push) Successful in 5m41s Details
Test / test_heal_csum_32k (push) Successful in 6m41s Details
Test / test_scrub (push) Successful in 1m13s Details
Test / test_heal_csum_4k_dmj (push) Successful in 6m53s Details
Test / test_scrub_xor (push) Successful in 54s Details
Test / test_scrub_zero_osd_2 (push) Successful in 58s Details
Test / test_heal_csum_4k_dj (push) Successful in 6m27s Details
Test / test_scrub_pg_size_6_pg_minsize_4_osd_count_6_ec (push) Successful in 1m15s Details
Test / test_scrub_pg_size_3 (push) Successful in 1m27s Details
Test / test_heal_csum_4k (push) Successful in 6m20s Details
Test / test_scrub_ec (push) Successful in 29s Details
Test / test_move_reappear (push) Successful in 17s Details
Also add protection from etcd watcher messages being split into multiple websocket
messages - I'm not sure if etcd actually does that, but it's better to have extra
protection anyway.

Also check that all etcd watchers are started in the keepalive routine, otherwise
it sometimes tries to revive etcd watchers starting with revision=1 which obviously
always fails because this revision is nearly always compacted.

All these changes should fix an old rarely reproduced bug where SOMETIMES OSDs
didn't react to PG config changes which was leading to offline pools on node reboot.
It happened on the full reload of state from etcd.
2023-12-24 02:02:13 +03:00
Vitaliy Filippov 1299373988 Use the same etcd_ws_keepalive_interval in OSD and mon
Test / test_snapshot_ec (push) Successful in 33s Details
Test / test_interrupted_rebalance_ec (push) Successful in 1m58s Details
Test / test_move_reappear (push) Successful in 22s Details
Test / test_rm (push) Successful in 16s Details
Test / test_snapshot_down (push) Successful in 32s Details
Test / test_snapshot_down_ec (push) Successful in 32s Details
Test / test_splitbrain (push) Successful in 25s Details
Test / test_snapshot_chain (push) Successful in 2m36s Details
Test / test_snapshot_chain_ec (push) Failing after 3m8s Details
Test / test_rebalance_verify_imm (push) Successful in 2m58s Details
Test / test_rebalance_verify (push) Successful in 3m55s Details
Test / test_write (push) Successful in 39s Details
Test / test_write_no_same (push) Successful in 15s Details
Test / test_rebalance_verify_ec_imm (push) Successful in 3m18s Details
Test / test_rebalance_verify_ec (push) Successful in 4m8s Details
Test / test_write_xor (push) Failing after 3m11s Details
Test / test_heal_pg_size_2 (push) Successful in 3m47s Details
Test / test_heal_csum_32k_dmj (push) Successful in 4m58s Details
Test / test_heal_ec (push) Successful in 6m21s Details
Test / test_heal_csum_32k_dj (push) Successful in 6m11s Details
Test / test_heal_csum_32k (push) Successful in 6m22s Details
Test / test_scrub (push) Successful in 1m17s Details
Test / test_scrub_zero_osd_2 (push) Successful in 1m17s Details
Test / test_heal_csum_4k_dmj (push) Successful in 6m35s Details
Test / test_scrub_xor (push) Successful in 57s Details
Test / test_heal_csum_4k_dj (push) Successful in 6m27s Details
Test / test_scrub_pg_size_6_pg_minsize_4_osd_count_6_ec (push) Successful in 1m3s Details
Test / test_scrub_pg_size_3 (push) Successful in 1m33s Details
Test / test_scrub_ec (push) Successful in 44s Details
Test / test_heal_csum_4k (push) Successful in 6m9s Details
2023-12-23 20:07:29 +03:00
Vitaliy Filippov 178bb0e701 Prevent re-entry into timerfd set_nearest
Test / test_interrupted_rebalance_ec (push) Successful in 2m0s Details
Test / test_rm (push) Successful in 19s Details
Test / test_move_reappear (push) Successful in 23s Details
Test / test_snapshot_ec (push) Successful in 40s Details
Test / test_snapshot_down (push) Successful in 31s Details
Test / test_snapshot_down_ec (push) Successful in 32s Details
Test / test_splitbrain (push) Successful in 26s Details
Test / test_snapshot_chain (push) Successful in 2m32s Details
Test / test_rebalance_verify_imm (push) Successful in 3m10s Details
Test / test_rebalance_verify (push) Successful in 4m2s Details
Test / test_write (push) Successful in 39s Details
Test / test_write_no_same (push) Successful in 13s Details
Test / test_rebalance_verify_ec_imm (push) Successful in 3m25s Details
Test / test_rebalance_verify_ec (push) Successful in 4m19s Details
Test / test_heal_pg_size_2 (push) Successful in 3m43s Details
Test / test_heal_csum_32k_dmj (push) Successful in 5m8s Details
Test / test_heal_csum_32k_dj (push) Successful in 6m26s Details
Test / test_heal_csum_32k (push) Successful in 6m12s Details
Test / test_heal_csum_4k_dmj (push) Successful in 5m41s Details
Test / test_scrub (push) Successful in 1m17s Details
Test / test_scrub_zero_osd_2 (push) Successful in 57s Details
Test / test_scrub_xor (push) Successful in 53s Details
Test / test_heal_csum_4k_dj (push) Successful in 5m36s Details
Test / test_scrub_pg_size_6_pg_minsize_4_osd_count_6_ec (push) Successful in 59s Details
Test / test_scrub_pg_size_3 (push) Successful in 1m36s Details
Test / test_scrub_ec (push) Successful in 31s Details
Test / test_heal_csum_4k (push) Successful in 5m23s Details
Test / test_write_xor (push) Successful in 1m4s Details
Test / test_heal_ec (push) Successful in 3m29s Details
Test / test_snapshot_chain_ec (push) Successful in 1m20s Details
2023-12-22 02:32:40 +03:00
Vitaliy Filippov 4ece4dfdd0 Fix mon not using values from config when /config/global is not present
Test / test_snapshot_ec (push) Successful in 27s Details
Test / test_rm (push) Successful in 17s Details
Test / test_interrupted_rebalance_ec_imm (push) Successful in 1m24s Details
Test / test_move_reappear (push) Successful in 22s Details
Test / test_snapshot_down (push) Successful in 32s Details
Test / test_snapshot_down_ec (push) Successful in 33s Details
Test / test_splitbrain (push) Successful in 26s Details
Test / test_snapshot_chain (push) Successful in 2m13s Details
Test / test_snapshot_chain_ec (push) Successful in 3m0s Details
Test / test_rebalance_verify_imm (push) Successful in 2m57s Details
Test / test_rebalance_verify (push) Successful in 3m47s Details
Test / test_write (push) Successful in 44s Details
Test / test_write_no_same (push) Successful in 14s Details
Test / test_write_xor (push) Successful in 56s Details
Test / test_rebalance_verify_ec (push) Successful in 5m4s Details
Test / test_rebalance_verify_ec_imm (push) Successful in 4m29s Details
Test / test_heal_pg_size_2 (push) Successful in 3m42s Details
Test / test_heal_ec (push) Successful in 5m1s Details
Test / test_heal_csum_32k_dj (push) Successful in 6m10s Details
Test / test_heal_csum_32k_dmj (push) Successful in 6m57s Details
Test / test_heal_csum_32k (push) Successful in 6m21s Details
Test / test_scrub (push) Successful in 1m16s Details
Test / test_scrub_zero_osd_2 (push) Successful in 1m8s Details
Test / test_scrub_xor (push) Successful in 1m13s Details
Test / test_heal_csum_4k_dmj (push) Successful in 8m15s Details
Test / test_heal_csum_4k_dj (push) Successful in 6m1s Details
Test / test_heal_csum_4k (push) Successful in 5m47s Details
Test / test_scrub_pg_size_6_pg_minsize_4_osd_count_6_ec (push) Successful in 1m14s Details
Test / test_scrub_pg_size_3 (push) Successful in 1m24s Details
Test / test_scrub_ec (push) Successful in 27s Details
2023-12-22 02:25:09 +03:00
Vitaliy Filippov 95631773b6 Remove pve-storage-portal-dns-list format for vitastor_etcd_address 2023-12-20 02:22:06 +03:00
Vitaliy Filippov 7239cfb91a Parse log_level in cluster_client
Test / test_snapshot_ec (push) Successful in 32s Details
Test / test_interrupted_rebalance_ec (push) Successful in 1m52s Details
Test / test_move_reappear (push) Successful in 21s Details
Test / test_rm (push) Successful in 20s Details
Test / test_snapshot_down (push) Successful in 27s Details
Test / test_snapshot_down_ec (push) Successful in 30s Details
Test / test_splitbrain (push) Successful in 28s Details
Test / test_snapshot_chain (push) Successful in 2m18s Details
Test / test_snapshot_chain_ec (push) Failing after 3m6s Details
Test / test_rebalance_verify_imm (push) Successful in 3m0s Details
Test / test_rebalance_verify (push) Successful in 3m43s Details
Test / test_write (push) Successful in 40s Details
Test / test_write_no_same (push) Successful in 13s Details
Test / test_rebalance_verify_ec_imm (push) Successful in 3m17s Details
Test / test_rebalance_verify_ec (push) Successful in 4m9s Details
Test / test_write_xor (push) Failing after 3m16s Details
Test / test_heal_pg_size_2 (push) Successful in 3m49s Details
Test / test_heal_csum_32k_dmj (push) Successful in 5m6s Details
Test / test_heal_ec (push) Successful in 6m46s Details
Test / test_heal_csum_32k_dj (push) Successful in 6m25s Details
Test / test_heal_csum_32k (push) Successful in 6m41s Details
Test / test_scrub (push) Successful in 1m15s Details
Test / test_scrub_zero_osd_2 (push) Successful in 1m25s Details
Test / test_heal_csum_4k_dmj (push) Successful in 6m33s Details
Test / test_scrub_xor (push) Successful in 1m7s Details
Test / test_heal_csum_4k_dj (push) Successful in 6m16s Details
Test / test_scrub_pg_size_3 (push) Successful in 1m56s Details
Test / test_scrub_ec (push) Successful in 52s Details
Test / test_scrub_pg_size_6_pg_minsize_4_osd_count_6_ec (push) Successful in 1m7s Details
Test / test_heal_csum_4k (push) Successful in 6m14s Details
2023-12-20 02:21:23 +03:00
Vitaliy Filippov 7cea642f4a Fix vitastor-nbd image existence check not working because of non-zeroed inode_watch fields
Test / test_interrupted_rebalance_ec (push) Successful in 1m55s Details
Test / test_snapshot_ec (push) Successful in 38s Details
Test / test_rm (push) Successful in 16s Details
Test / test_snapshot_down (push) Successful in 25s Details
Test / test_move_reappear (push) Failing after 50s Details
Test / test_splitbrain (push) Successful in 22s Details
Test / test_snapshot_down_ec (push) Successful in 24s Details
Test / test_snapshot_chain (push) Successful in 2m14s Details
Test / test_snapshot_chain_ec (push) Successful in 2m53s Details
Test / test_rebalance_verify_imm (push) Successful in 2m49s Details
Test / test_write (push) Successful in 35s Details
Test / test_rebalance_verify (push) Successful in 3m34s Details
Test / test_write_no_same (push) Successful in 14s Details
Test / test_write_xor (push) Successful in 53s Details
Test / test_rebalance_verify_ec (push) Successful in 4m48s Details
Test / test_rebalance_verify_ec_imm (push) Successful in 4m16s Details
Test / test_heal_pg_size_2 (push) Successful in 4m3s Details
Test / test_heal_ec (push) Successful in 4m37s Details
Test / test_heal_csum_32k_dmj (push) Successful in 5m49s Details
Test / test_heal_csum_32k_dj (push) Successful in 6m0s Details
Test / test_heal_csum_32k (push) Successful in 6m59s Details
Test / test_heal_csum_4k_dmj (push) Successful in 7m6s Details
Test / test_scrub (push) Successful in 1m13s Details
Test / test_scrub_xor (push) Successful in 51s Details
Test / test_scrub_zero_osd_2 (push) Successful in 1m2s Details
Test / test_scrub_pg_size_6_pg_minsize_4_osd_count_6_ec (push) Successful in 1m11s Details
Test / test_heal_csum_4k_dj (push) Successful in 6m44s Details
Test / test_scrub_pg_size_3 (push) Successful in 1m44s Details
Test / test_scrub_ec (push) Successful in 45s Details
Test / test_heal_csum_4k (push) Successful in 6m56s Details
2023-12-19 01:11:37 +03:00
Vitaliy Filippov dc615403d9 Do not warn on EPIPE in client unless log_level is raised explicitly
Test / test_snapshot_ec (push) Successful in 36s Details
Test / test_interrupted_rebalance_ec (push) Successful in 1m58s Details
Test / test_rm (push) Successful in 16s Details
Test / test_snapshot_down (push) Successful in 24s Details
Test / test_move_reappear (push) Failing after 49s Details
Test / test_snapshot_down_ec (push) Successful in 24s Details
Test / test_splitbrain (push) Successful in 22s Details
Test / test_snapshot_chain (push) Successful in 2m26s Details
Test / test_snapshot_chain_ec (push) Failing after 3m6s Details
Test / test_rebalance_verify_imm (push) Successful in 3m11s Details
Test / test_write (push) Successful in 35s Details
Test / test_rebalance_verify (push) Successful in 4m6s Details
Test / test_write_no_same (push) Successful in 15s Details
Test / test_write_xor (push) Successful in 41s Details
Test / test_rebalance_verify_ec_imm (push) Successful in 4m10s Details
Test / test_rebalance_verify_ec (push) Successful in 5m0s Details
Test / test_heal_ec (push) Successful in 4m32s Details
Test / test_heal_pg_size_2 (push) Successful in 4m50s Details
Test / test_heal_csum_32k_dmj (push) Successful in 5m44s Details
Test / test_heal_csum_32k_dj (push) Successful in 5m50s Details
Test / test_heal_csum_32k (push) Successful in 7m6s Details
Test / test_heal_csum_4k_dmj (push) Successful in 7m1s Details
Test / test_scrub_zero_osd_2 (push) Successful in 1m2s Details
Test / test_scrub (push) Successful in 1m6s Details
Test / test_scrub_xor (push) Successful in 58s Details
Test / test_heal_csum_4k_dj (push) Successful in 6m57s Details
Test / test_heal_csum_4k (push) Successful in 6m42s Details
Test / test_scrub_pg_size_3 (push) Successful in 1m42s Details
Test / test_scrub_pg_size_6_pg_minsize_4_osd_count_6_ec (push) Successful in 53s Details
Test / test_scrub_ec (push) Successful in 22s Details
2023-12-17 13:42:26 +03:00
Vitaliy Filippov 1a704e06ab Allow multiple interfaces with the same IP address, for "simple routed" full mesh network
Test / test_snapshot_ec (push) Successful in 32s Details
Test / test_interrupted_rebalance_ec (push) Successful in 1m59s Details
Test / test_rm (push) Successful in 20s Details
Test / test_snapshot_down (push) Successful in 24s Details
Test / test_move_reappear (push) Failing after 51s Details
Test / test_snapshot_down_ec (push) Successful in 24s Details
Test / test_splitbrain (push) Successful in 21s Details
Test / test_snapshot_chain (push) Successful in 2m29s Details
Test / test_snapshot_chain_ec (push) Successful in 3m5s Details
Test / test_rebalance_verify_imm (push) Successful in 3m7s Details
Test / test_write (push) Successful in 34s Details
Test / test_rebalance_verify (push) Successful in 3m59s Details
Test / test_write_no_same (push) Successful in 15s Details
Test / test_write_xor (push) Successful in 41s Details
Test / test_rebalance_verify_ec_imm (push) Successful in 4m20s Details
Test / test_rebalance_verify_ec (push) Successful in 5m17s Details
Test / test_heal_pg_size_2 (push) Successful in 3m52s Details
Test / test_heal_ec (push) Successful in 5m7s Details
Test / test_heal_csum_32k_dmj (push) Successful in 5m38s Details
Test / test_heal_csum_32k_dj (push) Successful in 6m7s Details
Test / test_heal_csum_32k (push) Successful in 6m56s Details
Test / test_scrub (push) Successful in 1m10s Details
Test / test_heal_csum_4k_dmj (push) Successful in 7m11s Details
Test / test_scrub_zero_osd_2 (push) Successful in 1m8s Details
Test / test_scrub_xor (push) Successful in 49s Details
Test / test_heal_csum_4k_dj (push) Successful in 6m32s Details
Test / test_scrub_pg_size_6_pg_minsize_4_osd_count_6_ec (push) Successful in 1m19s Details
Test / test_heal_csum_4k (push) Successful in 6m3s Details
Test / test_scrub_pg_size_3 (push) Successful in 1m45s Details
Test / test_scrub_ec (push) Successful in 28s Details
2023-12-17 13:25:56 +03:00
Vitaliy Filippov 575475de71 Do not ignore loopback addresses for OSD network (to make ECMP setups with frr possible)
Test / test_interrupted_rebalance_ec (push) Successful in 1m55s Details
Test / test_snapshot_ec (push) Successful in 32s Details
Test / test_rm (push) Successful in 14s Details
Test / test_snapshot_down (push) Successful in 24s Details
Test / test_move_reappear (push) Failing after 50s Details
Test / test_snapshot_down_ec (push) Successful in 25s Details
Test / test_splitbrain (push) Successful in 21s Details
Test / test_snapshot_chain (push) Successful in 2m21s Details
Test / test_snapshot_chain_ec (push) Successful in 2m55s Details
Test / test_rebalance_verify_imm (push) Successful in 2m42s Details
Test / test_write (push) Successful in 43s Details
Test / test_rebalance_verify (push) Successful in 3m42s Details
Test / test_write_no_same (push) Successful in 14s Details
Test / test_write_xor (push) Successful in 39s Details
Test / test_rebalance_verify_ec_imm (push) Successful in 4m9s Details
Test / test_rebalance_verify_ec (push) Successful in 4m54s Details
Test / test_heal_pg_size_2 (push) Successful in 4m20s Details
Test / test_heal_ec (push) Successful in 4m53s Details
Test / test_heal_csum_32k_dmj (push) Successful in 6m14s Details
Test / test_heal_csum_32k_dj (push) Successful in 6m9s Details
Test / test_heal_csum_32k (push) Successful in 6m49s Details
Test / test_heal_csum_4k_dmj (push) Successful in 6m56s Details
Test / test_scrub (push) Successful in 1m11s Details
Test / test_scrub_zero_osd_2 (push) Successful in 1m10s Details
Test / test_scrub_xor (push) Successful in 55s Details
Test / test_scrub_pg_size_6_pg_minsize_4_osd_count_6_ec (push) Successful in 1m31s Details
Test / test_heal_csum_4k_dj (push) Successful in 6m19s Details
Test / test_heal_csum_4k (push) Successful in 6m16s Details
Test / test_scrub_pg_size_3 (push) Successful in 1m44s Details
Test / test_scrub_ec (push) Successful in 21s Details
2023-12-17 11:55:13 +03:00
Vitaliy Filippov aca2bef15f Add vitastor-disk update-sb command
Test / test_snapshot_ec (push) Successful in 31s Details
Test / test_interrupted_rebalance_ec (push) Successful in 1m55s Details
Test / test_rm (push) Successful in 16s Details
Test / test_snapshot_down (push) Successful in 23s Details
Test / test_snapshot_down_ec (push) Successful in 22s Details
Test / test_splitbrain (push) Successful in 20s Details
Test / test_snapshot_chain (push) Successful in 2m8s Details
Test / test_snapshot_chain_ec (push) Successful in 2m52s Details
Test / test_rebalance_verify_imm (push) Successful in 2m56s Details
Test / test_write (push) Successful in 36s Details
Test / test_rebalance_verify (push) Successful in 3m38s Details
Test / test_write_no_same (push) Successful in 13s Details
Test / test_rebalance_verify_ec (push) Successful in 4m0s Details
Test / test_rebalance_verify_ec_imm (push) Successful in 3m14s Details
Test / test_heal_pg_size_2 (push) Successful in 3m22s Details
Test / test_heal_csum_32k_dmj (push) Successful in 5m9s Details
Test / test_heal_ec (push) Successful in 6m49s Details
Test / test_heal_csum_32k_dj (push) Successful in 6m14s Details
Test / test_heal_csum_32k (push) Successful in 6m12s Details
Test / test_scrub (push) Successful in 1m21s Details
Test / test_scrub_zero_osd_2 (push) Successful in 1m11s Details
Test / test_heal_csum_4k_dmj (push) Successful in 6m21s Details
Test / test_scrub_xor (push) Successful in 1m13s Details
Test / test_heal_csum_4k_dj (push) Successful in 6m9s Details
Test / test_scrub_pg_size_6_pg_minsize_4_osd_count_6_ec (push) Successful in 1m3s Details
Test / test_scrub_pg_size_3 (push) Successful in 1m48s Details
Test / test_scrub_ec (push) Successful in 52s Details
Test / test_heal_csum_4k (push) Successful in 6m2s Details
Test / test_move_reappear (push) Successful in 18s Details
Test / test_write_xor (push) Failing after 3m5s Details
2023-12-14 01:11:42 +03:00
Vitaliy Filippov 4dd6e89263 Change qemu to qemu-system-x86 in docs 2023-12-14 01:01:00 +03:00
Vitaliy Filippov 9bac99ffb6 Fix incorrect error in CSI when searching for the device in /sys 2023-12-14 01:00:32 +03:00
Vitaliy Filippov 62ed130960 Support building qemu 8.1 from bookworm-backports 2023-12-10 00:34:13 +03:00
Vitaliy Filippov 9c7755b6e8 Use qemu-storage-daemon from QEMU 8.1.2 for CSI 2023-12-08 00:10:12 +03:00
Vitaliy Filippov 691ebd991a Move 2 last log printfs to stderr from stdout in etcd_state_client
Test / test_snapshot_ec (push) Successful in 29s Details
Test / test_interrupted_rebalance_ec (push) Successful in 1m46s Details
Test / test_move_reappear (push) Successful in 20s Details
Test / test_rm (push) Successful in 16s Details
Test / test_snapshot_down (push) Successful in 31s Details
Test / test_snapshot_down_ec (push) Successful in 33s Details
Test / test_splitbrain (push) Successful in 25s Details
Test / test_snapshot_chain (push) Successful in 2m12s Details
Test / test_snapshot_chain_ec (push) Successful in 2m57s Details
Test / test_rebalance_verify_ec_imm (push) Failing after 22s Details
Test / test_rebalance_verify_imm (push) Successful in 2m45s Details
Test / test_write (push) Successful in 31s Details
Test / test_write_no_same (push) Successful in 15s Details
Test / test_rebalance_verify (push) Successful in 3m32s Details
Test / test_write_xor (push) Successful in 1m15s Details
Test / test_heal_pg_size_2 (push) Successful in 4m3s Details
Test / test_rebalance_verify_ec (push) Successful in 6m34s Details
Test / test_heal_csum_32k_dmj (push) Successful in 4m43s Details
Test / test_heal_ec (push) Successful in 5m33s Details
Test / test_heal_csum_32k_dj (push) Successful in 5m45s Details
Test / test_heal_csum_32k (push) Successful in 6m37s Details
Test / test_scrub (push) Successful in 1m3s Details
Test / test_heal_csum_4k_dmj (push) Successful in 6m39s Details
Test / test_heal_csum_4k_dj (push) Successful in 6m37s Details
Test / test_scrub_zero_osd_2 (push) Successful in 54s Details
Test / test_scrub_xor (push) Successful in 53s Details
Test / test_scrub_pg_size_3 (push) Successful in 1m29s Details
Test / test_scrub_pg_size_6_pg_minsize_4_osd_count_6_ec (push) Successful in 48s Details
Test / test_scrub_ec (push) Successful in 46s Details
Test / test_heal_csum_4k (push) Successful in 5m31s Details
2023-12-08 00:01:52 +03:00
Vitaliy Filippov 6d5df908a3 Fix possible out of bounds when checking invalid journal entries 2023-12-08 00:01:07 +03:00
Vitaliy Filippov fa87769ed8 Correct config options in vduse docs 2023-12-06 02:09:04 +03:00
Vitaliy Filippov 2ce8292803 Also log when killing process 2023-12-06 01:06:53 +03:00
Vitaliy Filippov 7f8f7ded52 Check for empty output of vitastor-nbd map (just in case) 2023-12-06 01:01:14 +03:00
Vitaliy Filippov 68553eabbb Log executed CLI commands 2023-12-06 00:48:12 +03:00
Vitaliy Filippov 3147c5c8d5 Remove internal error wrapping 2023-12-06 00:39:42 +03:00
Vitaliy Filippov 576e2ae608 Fix etcd_address check in CSI 2023-12-06 00:28:21 +03:00
Vitaliy Filippov a1c7cc3d8d Release 1.3.1
Test / test_interrupted_rebalance_ec (push) Successful in 1m46s Details
Test / test_move_reappear (push) Successful in 21s Details
Test / test_rm (push) Successful in 15s Details
Test / test_snapshot_ec (push) Successful in 35s Details
Test / test_snapshot_down (push) Successful in 30s Details
Test / test_snapshot_down_ec (push) Successful in 31s Details
Test / test_splitbrain (push) Successful in 23s Details
Test / test_snapshot_chain (push) Successful in 2m22s Details
Test / test_snapshot_chain_ec (push) Successful in 2m59s Details
Test / test_rebalance_verify_imm (push) Successful in 3m3s Details
Test / test_rebalance_verify (push) Successful in 3m47s Details
Test / test_write (push) Successful in 44s Details
Test / test_write_no_same (push) Successful in 13s Details
Test / test_rebalance_verify_ec_imm (push) Successful in 3m36s Details
Test / test_rebalance_verify_ec (push) Successful in 4m20s Details
Test / test_heal_pg_size_2 (push) Successful in 3m43s Details
Test / test_heal_csum_32k_dmj (push) Successful in 4m45s Details
Test / test_heal_ec (push) Successful in 6m22s Details
Test / test_heal_csum_32k_dj (push) Successful in 5m51s Details
Test / test_heal_csum_32k (push) Successful in 6m2s Details
Test / test_scrub (push) Successful in 1m14s Details
Test / test_scrub_zero_osd_2 (push) Successful in 1m19s Details
Test / test_heal_csum_4k_dmj (push) Successful in 5m54s Details
Test / test_scrub_xor (push) Successful in 1m1s Details
Test / test_heal_csum_4k_dj (push) Successful in 5m59s Details
Test / test_scrub_pg_size_3 (push) Successful in 1m54s Details
Test / test_scrub_pg_size_6_pg_minsize_4_osd_count_6_ec (push) Successful in 1m2s Details
Test / test_scrub_ec (push) Successful in 34s Details
Test / test_heal_csum_4k (push) Successful in 6m0s Details
Test / test_write_xor (push) Successful in 32s Details
Hotfix to 1.3.0 - new "journal space reservation" had a bug which
caused OSDs to crash with EC and without immediate_commit.
2023-12-04 18:35:09 +03:00
Vitaliy Filippov a5e3dfbc5a Oops, 1.3.0 needs a hotfix
Test / test_snapshot_ec (push) Successful in 30s Details
Test / test_move_reappear (push) Successful in 19s Details
Test / test_interrupted_rebalance_ec (push) Successful in 1m53s Details
Test / test_rm (push) Successful in 18s Details
Test / test_snapshot_down (push) Successful in 30s Details
Test / test_snapshot_down_ec (push) Successful in 31s Details
Test / test_splitbrain (push) Successful in 27s Details
Test / test_snapshot_chain (push) Successful in 2m13s Details
Test / test_snapshot_chain_ec (push) Successful in 2m56s Details
Test / test_rebalance_verify_imm (push) Successful in 2m51s Details
Test / test_rebalance_verify (push) Successful in 3m38s Details
Test / test_write (push) Successful in 45s Details
Test / test_write_no_same (push) Successful in 13s Details
Test / test_rebalance_verify_ec (push) Successful in 4m5s Details
Test / test_rebalance_verify_ec_imm (push) Successful in 3m22s Details
Test / test_heal_pg_size_2 (push) Successful in 3m51s Details
Test / test_heal_csum_32k_dmj (push) Successful in 4m39s Details
Test / test_heal_ec (push) Successful in 6m39s Details
Test / test_heal_csum_32k_dj (push) Successful in 5m55s Details
Test / test_heal_csum_32k (push) Successful in 6m5s Details
Test / test_scrub (push) Successful in 1m18s Details
Test / test_scrub_zero_osd_2 (push) Successful in 1m19s Details
Test / test_heal_csum_4k_dmj (push) Successful in 6m25s Details
Test / test_scrub_xor (push) Successful in 50s Details
Test / test_scrub_pg_size_3 (push) Successful in 1m46s Details
Test / test_heal_csum_4k_dj (push) Successful in 6m17s Details
Test / test_heal_csum_4k (push) Successful in 5m51s Details
Test / test_scrub_pg_size_6_pg_minsize_4_osd_count_6_ec (push) Successful in 1m0s Details
Test / test_scrub_ec (push) Successful in 21s Details
Test / test_write_xor (push) Successful in 33s Details
2023-12-04 13:45:54 +03:00
Vitaliy Filippov 7972502eaf Release 1.3.0
Test / test_rm (push) Successful in 12s Details
Test / test_snapshot_chain (push) Successful in 1m1s Details
Test / test_snapshot_down (push) Successful in 19s Details
Test / test_splitbrain (push) Successful in 12s Details
Test / test_snapshot_down_ec (push) Failing after 3m10s Details
Test / test_rebalance_verify (push) Successful in 2m45s Details
Test / test_rebalance_verify_imm (push) Successful in 2m17s Details
Test / test_write (push) Successful in 1m11s Details
Test / test_rebalance_verify_ec_imm (push) Successful in 2m41s Details
Test / test_write_no_same (push) Successful in 12s Details
Test / test_write_xor (push) Failing after 3m6s Details
Test / test_rebalance_verify_ec (push) Failing after 5m27s Details
Test / test_heal_pg_size_2 (push) Failing after 3m7s Details
Test / test_heal_csum_32k_dmj (push) Successful in 4m36s Details
Test / test_heal_csum_32k_dj (push) Failing after 4m53s Details
Test / test_heal_csum_32k (push) Failing after 5m27s Details
Test / test_heal_ec (push) Failing after 10m15s Details
Test / test_heal_csum_4k_dmj (push) Successful in 5m14s Details
Test / test_scrub (push) Successful in 1m11s Details
Test / test_heal_csum_4k_dj (push) Successful in 5m15s Details
Test / test_scrub_zero_osd_2 (push) Successful in 56s Details
Test / test_scrub_pg_size_3 (push) Successful in 1m4s Details
Test / test_heal_csum_4k (push) Failing after 5m31s Details
Test / test_scrub_xor (push) Failing after 3m17s Details
Test / test_scrub_pg_size_6_pg_minsize_4_osd_count_6_ec (push) Failing after 3m6s Details
Test / test_change_pg_count_ec (push) Failing after 3m5s Details
Test / test_snapshot_ec (push) Failing after 3m5s Details
Test / test_scrub_ec (push) Failing after 3m5s Details
Test / test_snapshot_chain_ec (push) Failing after 3m5s Details
Test / test_interrupted_rebalance_ec (push) Failing after 10m5s Details
New features:
- RDMA without ODP - much faster and all cards are now supported, not just Mellanox
- VDUSE in CSI - faster, more stable and can even recover after CSI pod restart!
- Reserve journal space for stabilize requests dynamically to prevent stalls under load with EC
- Raise default NBD timeout from 30 to 300 seconds and allow to take it from /etc/vitastor/vitastor.conf
- Remove explicit etcdUrl/etcdPrefix K8S storage class parameter support to prevent
  etcd migration issues for volumes created with these parameters
- Support QEMU 8.1 and pve-qemu 8.1

Bug fixes:
- Fix RDMA connection (and thus memory) leak
- Fix rare crashes under load due to incorrect io_uring queue size tracking
- Fix monitor statistics aggregation in case of empty /osd/stats keys
- Fix crash on unknown long argument to vitastor-disk
- Allow trailing comma in JSONs again
- Fix crash on attempts to dump a long listing of objects "to stabilize" or "to rollback" in a slow op
2023-12-04 02:36:43 +03:00
Vitaliy Filippov e57b7203b8 Use cmake3 on RHEL 7 2023-12-04 02:36:29 +03:00
Vitaliy Filippov c8a179dcda Note that Proxmox 8.1 is supported 2023-12-04 02:20:33 +03:00
Vitaliy Filippov 845454742d Fix warning with QEMU 8.1 2023-12-04 01:59:07 +03:00
Vitaliy Filippov d65512bd80 Add patches for QEMU 8.1 2023-12-04 01:56:17 +03:00
Vitaliy Filippov 53de2bbd0f Support VDUSE in CSI
VDUSE has multiple advantages:
- Better performance
- Lack of timeout problems
- And even the ability to recover after restart of the vitastor-csi pod!
2023-12-04 00:41:24 +03:00
Vitaliy Filippov 628aa59574 Raise default NBD timeout from 30 to 300 seconds and allow to take it from /etc/vitastor/vitastor.conf
Test / test_move_reappear (push) Successful in 17s Details
Test / test_rm (push) Successful in 12s Details
Test / test_snapshot_chain (push) Successful in 1m0s Details
Test / test_snapshot_down (push) Successful in 20s Details
Test / test_snapshot_ec (push) Failing after 3m6s Details
Test / test_splitbrain (push) Successful in 13s Details
Test / test_snapshot_chain_ec (push) Failing after 3m7s Details
Test / test_snapshot_down_ec (push) Failing after 3m7s Details
Test / test_rebalance_verify (push) Successful in 2m42s Details
Test / test_rebalance_verify_imm (push) Successful in 2m14s Details
Test / test_write (push) Successful in 45s Details
Test / test_rebalance_verify_ec_imm (push) Successful in 2m19s Details
Test / test_write_no_same (push) Successful in 12s Details
Test / test_interrupted_rebalance_ec (push) Failing after 10m40s Details
Test / test_write_xor (push) Failing after 3m5s Details
Test / test_rebalance_verify_ec (push) Failing after 5m22s Details
Test / test_heal_pg_size_2 (push) Failing after 3m48s Details
Test / test_heal_csum_32k_dj (push) Successful in 4m27s Details
Test / test_heal_ec (push) Failing after 10m6s Details
Test / test_heal_csum_32k_dmj (push) Failing after 10m14s Details
Test / test_heal_csum_32k (push) Failing after 10m14s Details
Test / test_scrub (push) Successful in 22s Details
Test / test_scrub_zero_osd_2 (push) Successful in 19s Details
Test / test_heal_csum_4k_dmj (push) Failing after 10m10s Details
Test / test_scrub_pg_size_3 (push) Successful in 30s Details
Test / test_scrub_xor (push) Failing after 3m6s Details
Test / test_scrub_pg_size_6_pg_minsize_4_osd_count_6_ec (push) Failing after 3m5s Details
Test / test_heal_csum_4k_dj (push) Failing after 10m13s Details
Test / test_scrub_ec (push) Failing after 3m5s Details
Test / test_heal_csum_4k (push) Failing after 10m8s Details
2023-12-02 14:11:14 +03:00
Vitaliy Filippov 037cf64a47 Remove explicit etcdUrl/etcdPrefix from volume parameters 2023-12-02 13:26:00 +03:00
103 changed files with 5770 additions and 531 deletions

115
CLA-en.md Normal file
View File

@ -0,0 +1,115 @@
## Contributor License Agreement
> This Agreement is made in the Russian and English languages. **The English
text of Agreement is for informational purposes only** and is not binding
for the Parties.
>
> In the event of a conflict between the provisions of the Russian and
English versions of this Agreement, the **Russian version shall prevail**.
>
> Russian version is published at https://git.yourcmc.ru/vitalif/vitastor/src/branch/master/CLA-ru.md
This document represents the offer of Filippov Vitaliy Vladimirovich
("Author"), author and copyright holder of Vitastor software ("Program"),
acknowledged by a certificate of Federal Service for Intellectual
Property of Russian Federation (Rospatent) # 2021617829 dated 20 May 2021,
to "Contributors" to conclude this license agreement as follows
("Agreement" or "Offer").
In accordance with Art. 435, Art. 438 of the Civil Code of the Russian
Federation, this Agreement is an offer and in case of acceptance of the
offer, an agreement is considered concluded on the conditions specified
in the offer.
1. Applicable Terms. \
1.1. "Official Repository" shall mean the computer storage, operated by
the Author, containing all prior and future versions of the Source
Code of the Program, at Internet addresses https://git.yourcmc.ru/vitalif/vitastor/
or https://github.com/vitalif/vitastor/. \
1.2. "Contributions" shall mean results of intellectual activity
(including, but not limited to, source code, libraries, components,
texts, documentation) which can be software or elements of the software
and which are provided by Contributors to the Author for inclusion
in the Program. \
1.3. "Contributor" shall mean a person who provides Contributions to
the Author and agrees with all provisions of this Agreement.
A Сontributor can be: 1) an individual; or 2) a legal entity or an
individual entrepreneur in case when an individual provides Contributions
on behalf of third parties, including on behalf of his employer.
2. Subject of the Agreement. \
2.1. Subject of the Agreement shall be the Contributions sent to the Author by Contributors.
2.2. The Contributor grants to the Author the right to use Contributions at his own
discretion and without any necessity to get a prior approval from Contributor or
any other third party in any way, under a simple (non-exclusive), royalty-free,
irrevocable license throughout the world by all means not contrary to law, in whole
or as a part of the Program, or other open-source or closed-source computer programs,
products or services (hereinafter -- the "License"), including, but not limited to: \
2.2.1. to execute Contributions and use them for any tasks; \
2.2.2. to publish and distribute Contributions in modified or unmodified form and/or to rent them; \
2.2.3. to modify Contributions, add comments, illustrations or any explanations to Contributions while using them; \
2.2.4. to create other results of intellectual activity based on Contributions, including derivative works and composite works; \
2.2.5. to translate Contributions into other languages, including other programming languages; \
2.2.6. to carry out rental and public display of Contributions; \
2.2.7. to use Contributions under the trade name and/or any trademark or any other label, or without it, as the Author thinks fit; \
2.3. The Contributor grants to the Author the right to sublicense any of the aforementioned
rights to third parties on any terms at the Author's discretion. \
2.4. The License is provided for the entire duration of Contributor's
exclusive intellectual property rights to the Contributions. \
2.5. The Contributor grants to the Author the right to decide how and where to mention,
or to not mention at all, the fact of his authorship, name, nickname and/or company
details when including Contributions into the Program or in any other computer
programs, products or services.
3. Acceptance of the Offer \
3.1. The Contributor may provide Contributions to the Author in the form of
a "Pull Request" in an Official Repository of the Program or by any
other electronic means of communication, including, but not limited to,
E-mail or messenger applications. \
3.2. The acceptance of the Offer shall be the fact of provision of Contributions
to the Author by the Contributor by any means with the following remark:
“I accept Vitastor CLA agreement: https://git.yourcmc.ru/vitalif/vitastor/src/branch/master/CLA-en.md”
or “Я принимаю соглашение Vitastor CLA: https://git.yourcmc.ru/vitalif/vitastor/src/branch/master/CLA-ru.md”. \
3.3. Date of acceptance of the Offer shall be the date of such provision.
4. Rights and obligations of the parties. \
4.1. The Contributor reserves the right to use Contributions by any lawful means
not contrary to this Agreement. \
4.2. The Author has the right to refuse to include Contributions into the Program
at any moment with no explanation to the Contributor.
5. Representations and Warranties. \
5.1. The person providing Contributions for the purpose of their inclusion
in the Program represents and warrants that he is the Contributor
or legally acts on the Contributor's behalf. Name or company details
of the Contributor shall be provided with the Contribution at the moment
of their provision to the Author. \
5.2. The Contributor represents and warrants that he legally owns exclusive
intellectual property rights to the Contributions. \
5.3. The Contributor represents and warrants that any further use of \
Contributions by the Author as provided by Contributor under the terms
of the Agreement does not infringe on intellectual and other rights and
legitimate interests of third parties. \
5.4. The Contributor represents and warrants that he has all rights and legal
capacity needed to accept this Offer; \
5.5. The Contributor represents and warrants that Contributions don't
contain malware or any information considered illegal under the law
of Russian Federation.
6. Termination of the Agreement \
6.1. The Agreement may be terminated at will of both Author and Contributor,
formalised in the written form or if the Agreement is terminated on
reasons prescribed by the law of Russian Federation.
7. Final Clauses \
7.1. The Contributor may optionally sign the Agreement in the written form. \
7.2. The Agreement is deemed to become effective from the Date of signing of
the Agreement and until the expiration of Contributor's exclusive
intellectual property rights to the Contributions. \
7.3. The Author may unilaterally alter the Agreement without informing Contributors.
The new version of the document shall come into effect 3 (three) days after
being published in the Official Repository of the Program at Internet address
[https://git.yourcmc.ru/vitalif/vitastor/src/branch/master/CLA-en.md](https://git.yourcmc.ru/vitalif/vitastor/src/branch/master/CLA-en.md).
Contributors should keep informed about the actual version of the Agreement themselves. \
7.4. If the Author and the Contributor fail to agree on disputable issues,
disputes shall be referred to the Moscow Arbitration court.

108
CLA-ru.md Normal file
View File

@ -0,0 +1,108 @@
## Лицензионное соглашение с участником
> Данная Оферта написана в Русской и Английской версиях. **Версия на английском
языке предоставляется в информационных целях** и не связывает стороны договора.
>
> В случае несоответствий между положениями Русской и Английской версий Договора,
**Русская версия имеет приоритет**.
>
> Английская версия опубликована по адресу https://git.yourcmc.ru/vitalif/vitastor/src/branch/master/CLA-en.md
Настоящий договор-оферта (далее по тексту Оферта, Договор) адресована физическим
и юридическим лицам (далее Участникам) и является официальным публичным предложением
Филиппова Виталия Владимировича (далее Автора) программного обеспечения Vitastor,
свидетельство Федеральной службы по интеллектуальной собственности (Роспатент) № 2021617829
от 20 мая 2021 г. (далее Программа) о нижеследующем:
1. Термины и определения \
1.1. Репозиторий электронное хранилище, содержащее исходный код Программы. \
1.2. Доработка результат интеллектуальной деятельности Участника, включающий
в себя изменения или дополнения к исходному коду Программы, которые Участник
желает включить в состав Программы для дальнейшего использования и распространения
Автором и для этого направляет их Автору. \
1.3. Участник физическое или юридическое лицо, вносящее Доработки в код Программы. \
1.4. ГК РФ Гражданский кодекс Российской Федерации.
2. Предмет оферты \
2.1. Предметом настоящей оферты являются Доработки, отправляемые Участником Автору. \
2.2. Участник предоставляет Автору право использовать Доработки по собственному усмотрению
и без необходимости предварительного согласования с Участником или иным третьим лицом
на условиях простой (неисключительной) безвозмездной безотзывной лицензии, полностью
или фрагментарно, в составе Программы или других программ, продуктов или сервисов
как с открытым, так и с закрытым исходным кодом, любыми способами, не противоречащими
закону, включая, но не ограничиваясь следующими: \
2.2.1. Запускать и использовать Доработки для выполнения любых задач; \
2.2.2. Распространять, импортировать и доводить Доработки до всеобщего сведения; \
2.2.3. Вносить в Доработки изменения, сокращения и дополнения, снабжать Доработки
при их использовании комментариями, иллюстрациями или пояснениями; \
2.2.4. Создавать на основе Доработок иные результаты интеллектуальной деятельности,
в том числе производные и составные произведения; \
2.2.5. Переводить Доработки на другие языки, в том числе на другие языки программирования; \
2.2.6. Осуществлять прокат и публичный показ Доработок; \
2.2.7. Использовать Доработки под любым фирменным наименованием, товарным знаком
(знаком обслуживания) или иным обозначением, или без такового. \
2.3. Участник предоставляет Автору право сублицензировать полученные права на Доработки
третьим лицам на любых условиях на усмотрение Автора. \
2.4. Участник предоставляет Автору права на Доработки на территории всего мира. \
2.5. Участник предоставляет Автору права на весь срок действия исключительного права
Участника на Доработки. \
2.6. Участник предоставляет Автору права на Доработки на безвозмездной основе. \
2.7. Участник разрешает Автору самостоятельно определять порядок, способ и
место указания его имени, реквизитов и/или псевдонима при включении
Доработок в состав Программы или других программ, продуктов или сервисов.
3. Акцепт Оферты \
3.1. Участник может передавать Доработки в адрес Автора через зеркала официального
Репозитория Программы по адресам https://git.yourcmc.ru/vitalif/vitastor/ или
https://github.com/vitalif/vitastor/ в виде “запроса на слияние” (pull request),
либо в письменном виде или с помощью любых других электронных средств коммуникации,
например, электронной почты или мессенджеров. \
3.2. Факт передачи Участником Доработок в адрес Автора любым способом с одной из пометок
“I accept Vitastor CLA agreement: https://git.yourcmc.ru/vitalif/vitastor/src/branch/master/CLA-en.md”
или “Я принимаю соглашение Vitastor CLA: https://git.yourcmc.ru/vitalif/vitastor/src/branch/master/CLA-ru.md”
является полным и безоговорочным акцептом (принятием) Участником условий настоящей
Оферты, т.е. Участник считается ознакомившимся с настоящим публичным договором и
в соответствии с ГК РФ признается лицом, вступившим с Автором в договорные отношения
на основании настоящей Оферты. \
3.3. Датой акцептирования настоящей Оферты считается дата такой передачи.
4. Права и обязанности Сторон \
4.1. Участник сохраняет за собой право использовать Доработки любым законным
способом, не противоречащим настоящему Договору. \
4.2. Автор вправе отказать Участнику во включении Доработок в состав
Программы без объяснения причин в любой момент по своему усмотрению.
5. Гарантии и заверения \
5.1. Лицо, направляющее Доработки для целей их включения в состав Программы,
гарантирует, что является Участником или представителем Участника. Имя или реквизиты
Участника должны быть указаны при их передаче в адрес Автора Программы. \
5.2. Участник гарантирует, что является законным обладателем исключительных прав
на Доработки. \
5.3. Участник гарантирует, что на момент акцептирования настоящей Оферты ему
ничего не известно (и не могло быть известно) о правах третьих лиц на
передаваемые Автору Доработки или их часть, которые могут быть нарушены
в связи с передачей Доработок по настоящему Договору. \
5.4. Участник гарантирует, что является дееспособным лицом и обладает всеми
необходимыми правами для заключения Договора. \
5.5. Участник гарантирует, что Доработки не содержат вредоносного ПО, а также
любой другой информации, запрещённой к распространению по законам Российской
Федерации.
6. Прекращение действия оферты \
6.1. Действие настоящего договора может быть прекращено по соглашению сторон,
оформленному в письменном виде, а также вследствие его расторжения по основаниям,
предусмотренным законом.
7. Заключительные положения \
7.1. Участник вправе по желанию подписать настоящий Договор в письменном виде. \
7.2. Настоящий договор действует с момента его заключения и до истечения срока
действия исключительных прав Участника на Доработки. \
7.3. Автор имеет право в одностороннем порядке вносить изменения и дополнения в договор
без специального уведомления об этом Участников. Новая редакция документа вступает
в силу через 3 (Три) календарных дня со дня опубликования в официальном Репозитории
Программы по адресу в сети Интернет
[https://git.yourcmc.ru/vitalif/vitastor/src/branch/master/CLA-ru.md](https://git.yourcmc.ru/vitalif/vitastor/src/branch/master/CLA-ru.md).
Участники самостоятельно отслеживают действующие условия Оферты. \
7.4. Все споры, возникающие между сторонами в процессе их взаимодействия по настоящему
договору, решаются путём переговоров. В случае невозможности урегулирования споров
переговорным порядком стороны разрешают их в Арбитражном суде г.Москвы.

View File

@ -2,6 +2,6 @@ cmake_minimum_required(VERSION 2.8.12)
project(vitastor) project(vitastor)
set(VERSION "1.2.0") set(VERSION "1.3.1")
add_subdirectory(src) add_subdirectory(src)

View File

@ -1,14 +1,15 @@
# Compile stage # Compile stage
FROM golang:buster AS build FROM golang:bookworm AS build
ADD go.sum go.mod /app/ ADD go.sum go.mod /app/
RUN cd /app; CGO_ENABLED=1 GOOS=linux GOARCH=amd64 go mod download -x RUN cd /app; CGO_ENABLED=1 GOOS=linux GOARCH=amd64 go mod download -x
ADD . /app ADD . /app
RUN perl -i -e '$/ = undef; while(<>) { s/\n\s*(\{\s*\n)/$1\n/g; s/\}(\s*\n\s*)else\b/$1} else/g; print; }' `find /app -name '*.go'` RUN perl -i -e '$/ = undef; while(<>) { s/\n\s*(\{\s*\n)/$1\n/g; s/\}(\s*\n\s*)else\b/$1} else/g; print; }' `find /app -name '*.go'` && \
RUN cd /app; CGO_ENABLED=1 GOOS=linux GOARCH=amd64 go build -o vitastor-csi cd /app && \
CGO_ENABLED=1 GOOS=linux GOARCH=amd64 go build -o vitastor-csi
# Final stage # Final stage
FROM debian:buster FROM debian:bookworm
LABEL maintainers="Vitaliy Filippov <vitalif@yourcmc.ru>" LABEL maintainers="Vitaliy Filippov <vitalif@yourcmc.ru>"
LABEL description="Vitastor CSI Driver" LABEL description="Vitastor CSI Driver"
@ -18,19 +19,30 @@ ENV CSI_ENDPOINT=""
RUN apt-get update && \ RUN apt-get update && \
apt-get install -y wget && \ apt-get install -y wget && \
(echo deb http://deb.debian.org/debian buster-backports main > /etc/apt/sources.list.d/backports.list) && \
(echo "APT::Install-Recommends false;" > /etc/apt/apt.conf) && \ (echo "APT::Install-Recommends false;" > /etc/apt/apt.conf) && \
apt-get update && \ apt-get update && \
apt-get install -y e2fsprogs xfsprogs kmod && \ apt-get install -y e2fsprogs xfsprogs kmod iproute2 \
# dependencies of qemu-storage-daemon
libnuma1 liburing2 libglib2.0-0 libfuse3-3 libaio1 libzstd1 libnettle8 \
libgmp10 libhogweed6 libp11-kit0 libidn2-0 libunistring2 libtasn1-6 libpcre2-8-0 libffi8 && \
apt-get clean && \ apt-get clean && \
(echo options nbd nbds_max=128 > /etc/modprobe.d/nbd.conf) (echo options nbd nbds_max=128 > /etc/modprobe.d/nbd.conf)
COPY --from=build /app/vitastor-csi /bin/ COPY --from=build /app/vitastor-csi /bin/
RUN (echo deb http://vitastor.io/debian buster main > /etc/apt/sources.list.d/vitastor.list) && \ RUN (echo deb http://vitastor.io/debian bookworm main > /etc/apt/sources.list.d/vitastor.list) && \
((echo 'Package: *'; echo 'Pin: origin "vitastor.io"'; echo 'Pin-Priority: 1000') > /etc/apt/preferences.d/vitastor.pref) && \
wget -q -O /etc/apt/trusted.gpg.d/vitastor.gpg https://vitastor.io/debian/pubkey.gpg && \ wget -q -O /etc/apt/trusted.gpg.d/vitastor.gpg https://vitastor.io/debian/pubkey.gpg && \
apt-get update && \ apt-get update && \
apt-get install -y vitastor-client && \ apt-get install -y vitastor-client && \
wget https://vitastor.io/archive/qemu/qemu-bookworm-8.1.2%2Bds-1%2Bvitastor1/qemu-utils_8.1.2%2Bds-1%2Bvitastor1_amd64.deb && \
wget https://vitastor.io/archive/qemu/qemu-bookworm-8.1.2%2Bds-1%2Bvitastor1/qemu-block-extra_8.1.2%2Bds-1%2Bvitastor1_amd64.deb && \
dpkg -x qemu-utils*.deb tmp1 && \
dpkg -x qemu-block-extra*.deb tmp1 && \
cp -a tmp1/usr/bin/qemu-storage-daemon /usr/bin/ && \
mkdir -p /usr/lib/x86_64-linux-gnu/qemu && \
cp -a tmp1/usr/lib/x86_64-linux-gnu/qemu/block-vitastor.so /usr/lib/x86_64-linux-gnu/qemu/ && \
rm -rf tmp1 *.deb && \
apt-get clean apt-get clean
ENTRYPOINT ["/bin/vitastor-csi"] ENTRYPOINT ["/bin/vitastor-csi"]

View File

@ -1,4 +1,4 @@
VERSION ?= v1.2.0 VERSION ?= v1.3.1
all: build push all: build push

View File

@ -2,6 +2,7 @@
apiVersion: v1 apiVersion: v1
kind: ConfigMap kind: ConfigMap
data: data:
# You can add multiple configuration files here to use a multi-cluster setup
vitastor.conf: |- vitastor.conf: |-
{"etcd_address":"http://192.168.7.2:2379","etcd_prefix":"/vitastor"} {"etcd_address":"http://192.168.7.2:2379","etcd_prefix":"/vitastor"}
metadata: metadata:

View File

@ -49,7 +49,7 @@ spec:
capabilities: capabilities:
add: ["SYS_ADMIN"] add: ["SYS_ADMIN"]
allowPrivilegeEscalation: true allowPrivilegeEscalation: true
image: vitalif/vitastor-csi:v1.2.0 image: vitalif/vitastor-csi:v1.3.1
args: args:
- "--node=$(NODE_ID)" - "--node=$(NODE_ID)"
- "--endpoint=$(CSI_ENDPOINT)" - "--endpoint=$(CSI_ENDPOINT)"
@ -82,6 +82,8 @@ spec:
name: host-sys name: host-sys
- mountPath: /run/mount - mountPath: /run/mount
name: host-mount name: host-mount
- mountPath: /run/vitastor-csi
name: run-vitastor-csi
- mountPath: /lib/modules - mountPath: /lib/modules
name: lib-modules name: lib-modules
readOnly: true readOnly: true
@ -132,6 +134,9 @@ spec:
- name: host-mount - name: host-mount
hostPath: hostPath:
path: /run/mount path: /run/mount
- name: run-vitastor-csi
hostPath:
path: /run/vitastor-csi
- name: lib-modules - name: lib-modules
hostPath: hostPath:
path: /lib/modules path: /lib/modules

View File

@ -121,7 +121,7 @@ spec:
privileged: true privileged: true
capabilities: capabilities:
add: ["SYS_ADMIN"] add: ["SYS_ADMIN"]
image: vitalif/vitastor-csi:v1.2.0 image: vitalif/vitastor-csi:v1.3.1
args: args:
- "--node=$(NODE_ID)" - "--node=$(NODE_ID)"
- "--endpoint=$(CSI_ENDPOINT)" - "--endpoint=$(CSI_ENDPOINT)"

View File

@ -12,9 +12,6 @@ parameters:
etcdVolumePrefix: "" etcdVolumePrefix: ""
poolId: "1" poolId: "1"
# you can choose other configuration file if you have it in the config map # you can choose other configuration file if you have it in the config map
# different etcd URLs and prefixes should also be put in the config
#configPath: "/etc/vitastor/vitastor.conf" #configPath: "/etc/vitastor/vitastor.conf"
# you can also specify etcdUrl here, maybe to connect to another Vitastor cluster
# multiple etcdUrls may be specified, delimited by comma
#etcdUrl: "http://192.168.7.2:2379"
#etcdPrefix: "/vitastor"
allowVolumeExpansion: true allowVolumeExpansion: true

View File

@ -5,7 +5,7 @@ package vitastor
const ( const (
vitastorCSIDriverName = "csi.vitastor.io" vitastorCSIDriverName = "csi.vitastor.io"
vitastorCSIDriverVersion = "1.2.0" vitastorCSIDriverVersion = "1.3.1"
) )
// Config struct fills the parameters of request or user input // Config struct fills the parameters of request or user input

View File

@ -62,7 +62,7 @@ func NewControllerServer(driver *Driver) *ControllerServer
} }
} }
func GetConnectionParams(params map[string]string) (map[string]string, []string, string) func GetConnectionParams(params map[string]string) (map[string]string, error)
{ {
ctxVars := make(map[string]string) ctxVars := make(map[string]string)
configPath := params["configPath"] configPath := params["configPath"]
@ -75,71 +75,69 @@ func GetConnectionParams(params map[string]string) (map[string]string, []string,
ctxVars["configPath"] = configPath ctxVars["configPath"] = configPath
} }
config := make(map[string]interface{}) config := make(map[string]interface{})
if configFD, err := os.Open(configPath); err == nil configFD, err := os.Open(configPath)
if (err != nil)
{ {
defer configFD.Close() return nil, err
data, _ := ioutil.ReadAll(configFD)
json.Unmarshal(data, &config)
} }
// Try to load prefix & etcd URL from the config defer configFD.Close()
data, _ := ioutil.ReadAll(configFD)
json.Unmarshal(data, &config)
// Check etcd URL in the config, but do not use the explicit etcdUrl
// parameter for CLI calls, otherwise users won't be able to later
// change them - storage class parameters are saved in volume IDs
var etcdUrl []string var etcdUrl []string
if (params["etcdUrl"] != "") switch config["etcd_address"].(type)
{ {
ctxVars["etcdUrl"] = params["etcdUrl"] case string:
etcdUrl = strings.Split(params["etcdUrl"], ",") url := strings.TrimSpace(config["etcd_address"].(string))
if (url != "")
{
etcdUrl = strings.Split(url, ",")
}
case []string:
etcdUrl = config["etcd_address"].([]string)
case []interface{}:
for _, url := range config["etcd_address"].([]interface{})
{
s, ok := url.(string)
if (ok)
{
etcdUrl = append(etcdUrl, s)
}
}
} }
if (len(etcdUrl) == 0) if (len(etcdUrl) == 0)
{ {
switch config["etcd_address"].(type) return nil, status.Error(codes.InvalidArgument, "etcd_address is missing in "+configPath)
{
case string:
etcdUrl = strings.Split(config["etcd_address"].(string), ",")
case []string:
etcdUrl = config["etcd_address"].([]string)
}
} }
etcdPrefix := params["etcdPrefix"] return ctxVars, nil
if (etcdPrefix == "") }
func system(program string, args ...string) ([]byte, []byte, error)
{
klog.Infof("Running "+program+" "+strings.Join(args, " "))
c := exec.Command(program, args...)
var stdout, stderr bytes.Buffer
c.Stdout, c.Stderr = &stdout, &stderr
err := c.Run()
if (err != nil)
{ {
etcdPrefix, _ = config["etcd_prefix"].(string) stdoutStr, stderrStr := string(stdout.Bytes()), string(stderr.Bytes())
if (etcdPrefix == "") klog.Errorf(program+" "+strings.Join(args, " ")+" failed: %s, status %s\n", stdoutStr+stderrStr, err)
{ return nil, nil, status.Error(codes.Internal, stdoutStr+stderrStr+" (status "+err.Error()+")")
etcdPrefix = "/vitastor"
}
} }
else return stdout.Bytes(), stderr.Bytes(), nil
{
ctxVars["etcdPrefix"] = etcdPrefix
}
return ctxVars, etcdUrl, etcdPrefix
} }
func invokeCLI(ctxVars map[string]string, args []string) ([]byte, error) func invokeCLI(ctxVars map[string]string, args []string) ([]byte, error)
{ {
if (ctxVars["etcdUrl"] != "")
{
args = append(args, "--etcd_address", ctxVars["etcdUrl"])
}
if (ctxVars["etcdPrefix"] != "")
{
args = append(args, "--etcd_prefix", ctxVars["etcdPrefix"])
}
if (ctxVars["configPath"] != "") if (ctxVars["configPath"] != "")
{ {
args = append(args, "--config_path", ctxVars["configPath"]) args = append(args, "--config_path", ctxVars["configPath"])
} }
c := exec.Command("/usr/bin/vitastor-cli", args...) stdout, _, err := system("/usr/bin/vitastor-cli", args...)
var stdout, stderr bytes.Buffer return stdout, err
c.Stdout = &stdout
c.Stderr = &stderr
err := c.Run()
stderrStr := string(stderr.Bytes())
if (err != nil)
{
klog.Errorf("vitastor-cli %s failed: %s, status %s\n", strings.Join(args, " "), stderrStr, err)
return nil, status.Error(codes.Internal, stderrStr+" (status "+err.Error()+")")
}
return stdout.Bytes(), nil
} }
// Create the volume // Create the volume
@ -174,10 +172,10 @@ func (cs *ControllerServer) CreateVolume(ctx context.Context, req *csi.CreateVol
volSize = ((capRange.GetRequiredBytes() + MB - 1) / MB) * MB volSize = ((capRange.GetRequiredBytes() + MB - 1) / MB) * MB
} }
ctxVars, etcdUrl, _ := GetConnectionParams(req.Parameters) ctxVars, err := GetConnectionParams(req.Parameters)
if (len(etcdUrl) == 0) if (err != nil)
{ {
return nil, status.Error(codes.InvalidArgument, "no etcdUrl in storage class configuration and no etcd_address in vitastor.conf") return nil, err
} }
args := []string{ "create", volName, "-s", fmt.Sprintf("%v", volSize), "--pool", fmt.Sprintf("%v", poolId) } args := []string{ "create", volName, "-s", fmt.Sprintf("%v", volSize), "--pool", fmt.Sprintf("%v", poolId) }
@ -207,7 +205,7 @@ func (cs *ControllerServer) CreateVolume(ctx context.Context, req *csi.CreateVol
} }
// Create image using vitastor-cli // Create image using vitastor-cli
_, err := invokeCLI(ctxVars, args) _, err = invokeCLI(ctxVars, args)
if (err != nil) if (err != nil)
{ {
if (strings.Index(err.Error(), "already exists") > 0) if (strings.Index(err.Error(), "already exists") > 0)
@ -257,7 +255,11 @@ func (cs *ControllerServer) DeleteVolume(ctx context.Context, req *csi.DeleteVol
} }
volName := volVars["name"] volName := volVars["name"]
ctxVars, _, _ := GetConnectionParams(volVars) ctxVars, err := GetConnectionParams(volVars)
if (err != nil)
{
return nil, err
}
_, err = invokeCLI(ctxVars, []string{ "rm", volName }) _, err = invokeCLI(ctxVars, []string{ "rm", volName })
if (err != nil) if (err != nil)
@ -469,7 +471,11 @@ func (cs *ControllerServer) DeleteSnapshot(ctx context.Context, req *csi.DeleteS
volName := volVars["name"] volName := volVars["name"]
snapName := volVars["snapshot"] snapName := volVars["snapshot"]
ctxVars, _, _ := GetConnectionParams(volVars) ctxVars, err := GetConnectionParams(volVars)
if (err != nil)
{
return nil, err
}
_, err = invokeCLI(ctxVars, []string{ "rm", volName+"@"+snapName }) _, err = invokeCLI(ctxVars, []string{ "rm", volName+"@"+snapName })
if (err != nil) if (err != nil)
@ -496,7 +502,11 @@ func (cs *ControllerServer) ListSnapshots(ctx context.Context, req *csi.ListSnap
return nil, status.Error(codes.Internal, "volume ID not in JSON format") return nil, status.Error(codes.Internal, "volume ID not in JSON format")
} }
volName := volVars["name"] volName := volVars["name"]
ctxVars, _, _ := GetConnectionParams(volVars) ctxVars, err := GetConnectionParams(volVars)
if (err != nil)
{
return nil, err
}
inodeCfg, err := invokeList(ctxVars, volName+"@*", false) inodeCfg, err := invokeList(ctxVars, volName+"@*", false)
if (err != nil) if (err != nil)
@ -555,7 +565,11 @@ func (cs *ControllerServer) ControllerExpandVolume(ctx context.Context, req *csi
return nil, status.Error(codes.Internal, "volume ID not in JSON format") return nil, status.Error(codes.Internal, "volume ID not in JSON format")
} }
volName := volVars["name"] volName := volVars["name"]
ctxVars, _, _ := GetConnectionParams(volVars) ctxVars, err := GetConnectionParams(volVars)
if (err != nil)
{
return nil, err
}
inodeCfg, err := invokeList(ctxVars, volName, true) inodeCfg, err := invokeList(ctxVars, volName, true)
if (err != nil) if (err != nil)

View File

@ -5,11 +5,16 @@ package vitastor
import ( import (
"context" "context"
"errors"
"encoding/json"
"fmt"
"os" "os"
"os/exec" "os/exec"
"encoding/json" "path/filepath"
"strconv"
"strings" "strings"
"bytes" "syscall"
"time"
"google.golang.org/grpc/codes" "google.golang.org/grpc/codes"
"google.golang.org/grpc/status" "google.golang.org/grpc/status"
@ -25,16 +30,102 @@ import (
type NodeServer struct type NodeServer struct
{ {
*Driver *Driver
useVduse bool
stateDir string
mounter mount.Interface mounter mount.Interface
restartInterval time.Duration
}
type DeviceState struct
{
ConfigPath string `json:"configPath"`
VdpaId string `json:"vdpaId"`
Image string `json:"image"`
Blockdev string `json:"blockdev"`
Readonly bool `json:"readonly"`
PidFile string `json:"pidFile"`
} }
// NewNodeServer create new instance node // NewNodeServer create new instance node
func NewNodeServer(driver *Driver) *NodeServer func NewNodeServer(driver *Driver) *NodeServer
{ {
return &NodeServer{ stateDir := os.Getenv("STATE_DIR")
if (stateDir == "")
{
stateDir = "/run/vitastor-csi"
}
if (stateDir[len(stateDir)-1] != '/')
{
stateDir += "/"
}
ns := &NodeServer{
Driver: driver, Driver: driver,
useVduse: checkVduseSupport(),
stateDir: stateDir,
mounter: mount.New(""), mounter: mount.New(""),
} }
if (ns.useVduse)
{
ns.restoreVduseDaemons()
dur, err := time.ParseDuration(os.Getenv("RESTART_INTERVAL"))
if (err != nil)
{
dur = 10 * time.Second
}
ns.restartInterval = dur
if (ns.restartInterval != time.Duration(0))
{
go ns.restarter()
}
}
return ns
}
func checkVduseSupport() bool
{
// Check VDUSE support (vdpa, vduse, virtio-vdpa kernel modules)
vduse := true
for _, mod := range []string{"vdpa", "vduse", "virtio-vdpa"}
{
_, err := os.Stat("/sys/module/"+mod)
if (err != nil)
{
if (!errors.Is(err, os.ErrNotExist))
{
klog.Errorf("failed to check /sys/module/%s: %v", mod, err)
}
c := exec.Command("/sbin/modprobe", mod)
c.Stdout = os.Stderr
c.Stderr = os.Stderr
err := c.Run()
if (err != nil)
{
klog.Errorf("/sbin/modprobe %s failed: %v", mod, err)
vduse = false
break
}
}
}
// Check that vdpa tool functions
if (vduse)
{
c := exec.Command("/sbin/vdpa", "-j", "dev")
c.Stderr = os.Stderr
err := c.Run()
if (err != nil)
{
klog.Errorf("/sbin/vdpa -j dev failed: %v", err)
vduse = false
}
}
if (!vduse)
{
klog.Errorf(
"Your host apparently has no VDUSE support. VDUSE support disabled, NBD will be used to map devices."+
" For VDUSE you need at least Linux 5.15 and the following kernel modules: vdpa, virtio-vdpa, vduse.",
)
}
return vduse
} }
// NodeStageVolume mounts the volume to a staging path on the node. // NodeStageVolume mounts the volume to a staging path on the node.
@ -61,6 +152,318 @@ func Contains(list []string, s string) bool
return false return false
} }
func (ns *NodeServer) mapNbd(volName string, ctxVars map[string]string, readonly bool) (string, error)
{
// Map NBD device
// FIXME: Check if already mapped
args := []string{
"map", "--image", volName,
}
if (ctxVars["configPath"] != "")
{
args = append(args, "--config_path", ctxVars["configPath"])
}
if (readonly)
{
args = append(args, "--readonly", "1")
}
stdout, stderr, err := system("/usr/bin/vitastor-nbd", args...)
dev := strings.TrimSpace(string(stdout))
if (dev == "")
{
return "", fmt.Errorf("vitastor-nbd did not return the name of NBD device. output: %s", stderr)
}
return dev, err
}
func (ns *NodeServer) unmapNbd(devicePath string)
{
// unmap NBD device
unmapOut, unmapErr := exec.Command("/usr/bin/vitastor-nbd", "unmap", devicePath).CombinedOutput()
if (unmapErr != nil)
{
klog.Errorf("failed to unmap NBD device %s: %s, error: %v", devicePath, unmapOut, unmapErr)
}
}
func findByPidFile(pidFile string) (*os.Process, error)
{
klog.Infof("killing process with PID from file %s", pidFile)
pidBuf, err := os.ReadFile(pidFile)
if (err != nil)
{
return nil, err
}
pid, err := strconv.ParseInt(strings.TrimSpace(string(pidBuf)), 0, 64)
if (err != nil)
{
return nil, err
}
proc, err := os.FindProcess(int(pid))
if (err != nil)
{
return nil, err
}
return proc, nil
}
func killByPidFile(pidFile string) error
{
proc, err := findByPidFile(pidFile)
if (err != nil)
{
return err
}
return proc.Signal(syscall.SIGTERM)
}
func startStorageDaemon(vdpaId, volName, pidFile, configPath string, readonly bool) error
{
// Start qemu-storage-daemon
blockSpec := map[string]interface{}{
"node-name": "disk1",
"driver": "vitastor",
"image": volName,
"cache": map[string]bool{
"direct": true,
"no-flush": false,
},
"discard": "unmap",
}
if (configPath != "")
{
blockSpec["config-path"] = configPath
}
blockSpecJson, _ := json.Marshal(blockSpec)
writable := "true"
if (readonly)
{
writable = "false"
}
_, _, err := system(
"/usr/bin/qemu-storage-daemon", "--daemonize", "--pidfile", pidFile, "--blockdev", string(blockSpecJson),
"--export", "vduse-blk,id="+vdpaId+",node-name=disk1,name="+vdpaId+",num-queues=16,queue-size=128,writable="+writable,
)
return err
}
func (ns *NodeServer) mapVduse(volName string, ctxVars map[string]string, readonly bool) (string, string, error)
{
// Generate state file
stateFd, err := os.CreateTemp(ns.stateDir, "vitastor-vduse-*.json")
if (err != nil)
{
return "", "", err
}
stateFile := stateFd.Name()
stateFd.Close()
vdpaId := filepath.Base(stateFile)
vdpaId = vdpaId[0:len(vdpaId)-5] // remove ".json"
pidFile := ns.stateDir + vdpaId + ".pid"
// Map VDUSE device via qemu-storage-daemon
err = startStorageDaemon(vdpaId, volName, pidFile, ctxVars["configPath"], readonly)
if (err == nil)
{
// Add device to VDPA bus
_, _, err = system("/sbin/vdpa", "-j", "dev", "add", "name", vdpaId, "mgmtdev", "vduse")
if (err == nil)
{
// Find block device name
var matches []string
matches, err = filepath.Glob("/sys/bus/vdpa/devices/"+vdpaId+"/virtio*/block/*")
if (err == nil && len(matches) == 0)
{
err = errors.New("/sys/bus/vdpa/devices/"+vdpaId+"/virtio*/block/* is not found")
}
if (err == nil)
{
blockdev := "/dev/"+filepath.Base(matches[0])
_, err = os.Stat(blockdev)
if (err == nil)
{
// Generate state file
stateJSON, _ := json.Marshal(&DeviceState{
ConfigPath: ctxVars["configPath"],
VdpaId: vdpaId,
Image: volName,
Blockdev: blockdev,
Readonly: readonly,
PidFile: pidFile,
})
err = os.WriteFile(stateFile, stateJSON, 0600)
if (err == nil)
{
return blockdev, vdpaId, nil
}
}
}
}
killErr := killByPidFile(pidFile)
if (killErr != nil)
{
klog.Errorf("Failed to kill started qemu-storage-daemon: %v", killErr)
}
os.Remove(stateFile)
os.Remove(pidFile)
}
return "", "", err
}
func (ns *NodeServer) unmapVduse(devicePath string)
{
if (len(devicePath) < 6 || devicePath[0:6] != "/dev/v")
{
klog.Errorf("%s does not start with /dev/v", devicePath)
return
}
vduseDev, err := os.Readlink("/sys/block/"+devicePath[5:])
if (err != nil)
{
klog.Errorf("%s is not a symbolic link to VDUSE device (../devices/virtual/vduse/xxx): %v", devicePath, err)
return
}
vdpaId := ""
p := strings.Index(vduseDev, "/vduse/")
if (p >= 0)
{
vduseDev = vduseDev[p+7:]
p = strings.Index(vduseDev, "/")
if (p >= 0)
{
vdpaId = vduseDev[0:p]
}
}
if (vdpaId == "")
{
klog.Errorf("%s is not a symbolic link to VDUSE device (../devices/virtual/vduse/xxx), but is %v", devicePath, vduseDev)
return
}
ns.unmapVduseById(vdpaId)
}
func (ns *NodeServer) unmapVduseById(vdpaId string)
{
_, err := os.Stat("/sys/bus/vdpa/devices/"+vdpaId)
if (err != nil)
{
klog.Errorf("failed to stat /sys/bus/vdpa/devices/"+vdpaId+": %v", err)
}
else
{
_, _, _ = system("/sbin/vdpa", "-j", "dev", "del", vdpaId)
}
stateFile := ns.stateDir + vdpaId + ".json"
os.Remove(stateFile)
pidFile := ns.stateDir + vdpaId + ".pid"
_, err = os.Stat(pidFile)
if (os.IsNotExist(err))
{
// ok, already killed
}
else if (err != nil)
{
klog.Errorf("Failed to stat %v: %v", pidFile, err)
return
}
else
{
err = killByPidFile(pidFile)
if (err != nil)
{
klog.Errorf("Failed to kill started qemu-storage-daemon: %v", err)
}
os.Remove(pidFile)
}
}
func (ns *NodeServer) restarter()
{
// Restart dead VDUSE daemons at regular intervals
// Otherwise volume I/O may hang in case of a qemu-storage-daemon crash
// Moreover, it may lead to a kernel panic of the kernel is configured to
// panic on hung tasks
ticker := time.NewTicker(ns.restartInterval)
defer ticker.Stop()
for
{
<-ticker.C
ns.restoreVduseDaemons()
}
}
func (ns *NodeServer) restoreVduseDaemons()
{
pattern := ns.stateDir+"vitastor-vduse-*.json"
matches, err := filepath.Glob(pattern)
if (err != nil)
{
klog.Errorf("failed to list %s: %v", pattern, err)
}
if (len(matches) == 0)
{
return
}
devList := make(map[string]interface{})
// example output: {"dev":{"test1":{"type":"block","mgmtdev":"vduse","vendor_id":0,"max_vqs":16,"max_vq_size":128}}}
devListJSON, _, err := system("/sbin/vdpa", "-j", "dev", "list")
if (err != nil)
{
return
}
err = json.Unmarshal(devListJSON, &devList)
devs, ok := devList["dev"].(map[string]interface{})
if (err != nil || !ok)
{
klog.Errorf("/sbin/vdpa -j dev list returned bad JSON (error %v): %v", err, string(devListJSON))
return
}
for _, stateFile := range matches
{
vdpaId := filepath.Base(stateFile)
vdpaId = vdpaId[0:len(vdpaId)-5]
// Check if VDPA device is still added to the bus
if (devs[vdpaId] != nil)
{
// Check if the storage daemon is still active
pidFile := ns.stateDir + vdpaId + ".pid"
exists := false
proc, err := findByPidFile(pidFile)
if (err == nil)
{
exists = proc.Signal(syscall.Signal(0)) == nil
}
if (!exists)
{
// Restart daemon
stateJSON, err := os.ReadFile(stateFile)
if (err != nil)
{
klog.Warningf("error reading state file %v: %v", stateFile, err)
}
else
{
var state DeviceState
err := json.Unmarshal(stateJSON, &state)
if (err != nil)
{
klog.Warningf("state file %v contains invalid JSON (error %v): %v", stateFile, err, string(stateJSON))
}
else
{
klog.Warningf("restarting storage daemon for volume %v (VDPA ID %v)", state.Image, vdpaId)
_ = startStorageDaemon(vdpaId, state.Image, pidFile, state.ConfigPath, state.Readonly)
}
}
}
}
else
{
// Unused, clean it up
ns.unmapVduseById(vdpaId)
}
}
}
// NodePublishVolume mounts the volume mounted to the staging path to the target path // NodePublishVolume mounts the volume mounted to the staging path to the target path
func (ns *NodeServer) NodePublishVolume(ctx context.Context, req *csi.NodePublishVolumeRequest) (*csi.NodePublishVolumeResponse, error) func (ns *NodeServer) NodePublishVolume(ctx context.Context, req *csi.NodePublishVolumeRequest) (*csi.NodePublishVolumeResponse, error)
{ {
@ -81,13 +484,13 @@ func (ns *NodeServer) NodePublishVolume(ctx context.Context, req *csi.NodePublis
if (err != nil) if (err != nil)
{ {
klog.Errorf("failed to create block device mount target %s with error: %v", targetPath, err) klog.Errorf("failed to create block device mount target %s with error: %v", targetPath, err)
return nil, status.Error(codes.Internal, err.Error()) return nil, err
} }
err = pathFile.Close() err = pathFile.Close()
if (err != nil) if (err != nil)
{ {
klog.Errorf("failed to close %s with error: %v", targetPath, err) klog.Errorf("failed to close %s with error: %v", targetPath, err)
return nil, status.Error(codes.Internal, err.Error()) return nil, err
} }
} }
else else
@ -96,13 +499,13 @@ func (ns *NodeServer) NodePublishVolume(ctx context.Context, req *csi.NodePublis
if (err != nil) if (err != nil)
{ {
klog.Errorf("failed to create fs mount target %s with error: %v", targetPath, err) klog.Errorf("failed to create fs mount target %s with error: %v", targetPath, err)
return nil, status.Error(codes.Internal, err.Error()) return nil, err
} }
} }
} }
else else
{ {
return nil, status.Error(codes.Internal, err.Error()) return nil, err
} }
} }
@ -114,38 +517,25 @@ func (ns *NodeServer) NodePublishVolume(ctx context.Context, req *csi.NodePublis
} }
volName := ctxVars["name"] volName := ctxVars["name"]
_, etcdUrl, etcdPrefix := GetConnectionParams(ctxVars) _, err = GetConnectionParams(ctxVars)
if (len(etcdUrl) == 0)
{
return nil, status.Error(codes.InvalidArgument, "no etcdUrl in storage class configuration and no etcd_address in vitastor.conf")
}
// Map NBD device
// FIXME: Check if already mapped
args := []string{
"map", "--etcd_address", strings.Join(etcdUrl, ","),
"--etcd_prefix", etcdPrefix,
"--image", volName,
};
if (ctxVars["configPath"] != "")
{
args = append(args, "--config_path", ctxVars["configPath"])
}
if (req.GetReadonly())
{
args = append(args, "--readonly", "1")
}
c := exec.Command("/usr/bin/vitastor-nbd", args...)
var stdout, stderr bytes.Buffer
c.Stdout, c.Stderr = &stdout, &stderr
err = c.Run()
stdoutStr, stderrStr := string(stdout.Bytes()), string(stderr.Bytes())
if (err != nil) if (err != nil)
{ {
klog.Errorf("vitastor-nbd map failed: %s, status %s\n", stdoutStr+stderrStr, err) return nil, err
return nil, status.Error(codes.Internal, stdoutStr+stderrStr+" (status "+err.Error()+")") }
var devicePath, vdpaId string
if (!ns.useVduse)
{
devicePath, err = ns.mapNbd(volName, ctxVars, req.GetReadonly())
}
else
{
devicePath, vdpaId, err = ns.mapVduse(volName, ctxVars, req.GetReadonly())
}
if (err != nil)
{
return nil, err
} }
devicePath := strings.TrimSpace(stdoutStr)
diskMounter := &mount.SafeFormatAndMount{Interface: ns.mounter, Exec: utilexec.New()} diskMounter := &mount.SafeFormatAndMount{Interface: ns.mounter, Exec: utilexec.New()}
if (isBlock) if (isBlock)
@ -227,13 +617,15 @@ func (ns *NodeServer) NodePublishVolume(ctx context.Context, req *csi.NodePublis
return &csi.NodePublishVolumeResponse{}, nil return &csi.NodePublishVolumeResponse{}, nil
unmap: unmap:
// unmap NBD device if (!ns.useVduse || len(devicePath) >= 8 && devicePath[0:8] == "/dev/nbd")
unmapOut, unmapErr := exec.Command("/usr/bin/vitastor-nbd", "unmap", devicePath).CombinedOutput()
if (unmapErr != nil)
{ {
klog.Errorf("failed to unmap NBD device %s: %s, error: %v", devicePath, unmapOut, unmapErr) ns.unmapNbd(devicePath)
} }
return nil, status.Error(codes.Internal, err.Error()) else
{
ns.unmapVduseById(vdpaId)
}
return nil, err
} }
// NodeUnpublishVolume unmounts the volume from the target path // NodeUnpublishVolume unmounts the volume from the target path
@ -248,25 +640,31 @@ func (ns *NodeServer) NodeUnpublishVolume(ctx context.Context, req *csi.NodeUnpu
{ {
return nil, status.Error(codes.NotFound, "Target path not found") return nil, status.Error(codes.NotFound, "Target path not found")
} }
return nil, status.Error(codes.Internal, err.Error()) return nil, err
} }
if (devicePath == "") if (devicePath == "")
{ {
return nil, status.Error(codes.NotFound, "Volume not mounted") // volume not mounted
klog.Warningf("%s is not a mountpoint, deleting", targetPath)
os.Remove(targetPath)
return &csi.NodeUnpublishVolumeResponse{}, nil
} }
// unmount // unmount
err = mount.CleanupMountPoint(targetPath, ns.mounter, false) err = mount.CleanupMountPoint(targetPath, ns.mounter, false)
if (err != nil) if (err != nil)
{ {
return nil, status.Error(codes.Internal, err.Error()) return nil, err
} }
// unmap NBD device // unmap NBD device
if (refCount == 1) if (refCount == 1)
{ {
unmapOut, unmapErr := exec.Command("/usr/bin/vitastor-nbd", "unmap", devicePath).CombinedOutput() if (!ns.useVduse)
if (unmapErr != nil)
{ {
klog.Errorf("failed to unmap NBD device %s: %s, error: %v", devicePath, unmapOut, unmapErr) ns.unmapNbd(devicePath)
}
else
{
ns.unmapVduse(devicePath)
} }
} }
return &csi.NodeUnpublishVolumeResponse{}, nil return &csi.NodeUnpublishVolumeResponse{}, nil

4
debian/changelog vendored
View File

@ -1,10 +1,10 @@
vitastor (1.2.0-1) unstable; urgency=medium vitastor (1.3.1-1) unstable; urgency=medium
* Bugfixes * Bugfixes
-- Vitaliy Filippov <vitalif@yourcmc.ru> Fri, 03 Jun 2022 02:09:44 +0300 -- Vitaliy Filippov <vitalif@yourcmc.ru> Fri, 03 Jun 2022 02:09:44 +0300
vitastor (1.2.0-1) unstable; urgency=medium vitastor (0.7.0-1) unstable; urgency=medium
* Implement NFS proxy * Implement NFS proxy
* Add documentation * Add documentation

View File

@ -7,7 +7,7 @@ ARG REL=
WORKDIR /root WORKDIR /root
RUN if [ "$REL" = "buster" -o "$REL" = "bullseye" ]; then \ RUN if [ "$REL" = "buster" -o "$REL" = "bullseye" -o "$REL" = "bookworm" ]; then \
echo "deb http://deb.debian.org/debian $REL-backports main" >> /etc/apt/sources.list; \ echo "deb http://deb.debian.org/debian $REL-backports main" >> /etc/apt/sources.list; \
echo >> /etc/apt/preferences; \ echo >> /etc/apt/preferences; \
echo 'Package: *' >> /etc/apt/preferences; \ echo 'Package: *' >> /etc/apt/preferences; \
@ -45,7 +45,7 @@ RUN set -e; \
rm -rf /root/packages/qemu-$REL/*; \ rm -rf /root/packages/qemu-$REL/*; \
cd /root/packages/qemu-$REL; \ cd /root/packages/qemu-$REL; \
dpkg-source -x /root/qemu*.dsc; \ dpkg-source -x /root/qemu*.dsc; \
QEMU_VER=$(ls -d qemu*/ | perl -pe 's!^.*(\d+\.\d+).*!$1!'); \ QEMU_VER=$(ls -d qemu*/ | perl -pe 's!^.*?(\d+\.\d+).*!$1!'); \
D=$(ls -d qemu*/); \ D=$(ls -d qemu*/); \
cp /root/vitastor/patches/qemu-$QEMU_VER-vitastor.patch ./qemu-*/debian/patches; \ cp /root/vitastor/patches/qemu-$QEMU_VER-vitastor.patch ./qemu-*/debian/patches; \
echo qemu-$QEMU_VER-vitastor.patch >> $D/debian/patches/series; \ echo qemu-$QEMU_VER-vitastor.patch >> $D/debian/patches/series; \

View File

@ -35,8 +35,8 @@ RUN set -e -x; \
mkdir -p /root/packages/vitastor-$REL; \ mkdir -p /root/packages/vitastor-$REL; \
rm -rf /root/packages/vitastor-$REL/*; \ rm -rf /root/packages/vitastor-$REL/*; \
cd /root/packages/vitastor-$REL; \ cd /root/packages/vitastor-$REL; \
cp -r /root/vitastor vitastor-1.2.0; \ cp -r /root/vitastor vitastor-1.3.1; \
cd vitastor-1.2.0; \ cd vitastor-1.3.1; \
ln -s /root/fio-build/fio-*/ ./fio; \ ln -s /root/fio-build/fio-*/ ./fio; \
FIO=$(head -n1 fio/debian/changelog | perl -pe 's/^.*\((.*?)\).*$/$1/'); \ FIO=$(head -n1 fio/debian/changelog | perl -pe 's/^.*\((.*?)\).*$/$1/'); \
ls /usr/include/linux/raw.h || cp ./debian/raw.h /usr/include/linux/raw.h; \ ls /usr/include/linux/raw.h || cp ./debian/raw.h /usr/include/linux/raw.h; \
@ -49,8 +49,8 @@ RUN set -e -x; \
rm -rf a b; \ rm -rf a b; \
echo "dep:fio=$FIO" > debian/fio_version; \ echo "dep:fio=$FIO" > debian/fio_version; \
cd /root/packages/vitastor-$REL; \ cd /root/packages/vitastor-$REL; \
tar --sort=name --mtime='2020-01-01' --owner=0 --group=0 --exclude=debian -cJf vitastor_1.2.0.orig.tar.xz vitastor-1.2.0; \ tar --sort=name --mtime='2020-01-01' --owner=0 --group=0 --exclude=debian -cJf vitastor_1.3.1.orig.tar.xz vitastor-1.3.1; \
cd vitastor-1.2.0; \ cd vitastor-1.3.1; \
V=$(head -n1 debian/changelog | perl -pe 's/^.*\((.*?)\).*$/$1/'); \ V=$(head -n1 debian/changelog | perl -pe 's/^.*\((.*?)\).*$/$1/'); \
DEBFULLNAME="Vitaliy Filippov <vitalif@yourcmc.ru>" dch -D $REL -v "$V""$REL" "Rebuild for $REL"; \ DEBFULLNAME="Vitaliy Filippov <vitalif@yourcmc.ru>" dch -D $REL -v "$V""$REL" "Rebuild for $REL"; \
DEB_BUILD_OPTIONS=nocheck dpkg-buildpackage --jobs=auto -sa; \ DEB_BUILD_OPTIONS=nocheck dpkg-buildpackage --jobs=auto -sa; \

View File

@ -6,8 +6,8 @@
# Client Parameters # Client Parameters
These parameters apply only to clients and affect their interaction with These parameters apply only to Vitastor clients (QEMU, fio, NBD and so on) and
the cluster. affect their interaction with the cluster.
- [client_max_dirty_bytes](#client_max_dirty_bytes) - [client_max_dirty_bytes](#client_max_dirty_bytes)
- [client_max_dirty_ops](#client_max_dirty_ops) - [client_max_dirty_ops](#client_max_dirty_ops)
@ -15,6 +15,9 @@ the cluster.
- [client_max_buffered_bytes](#client_max_buffered_bytes) - [client_max_buffered_bytes](#client_max_buffered_bytes)
- [client_max_buffered_ops](#client_max_buffered_ops) - [client_max_buffered_ops](#client_max_buffered_ops)
- [client_max_writeback_iodepth](#client_max_writeback_iodepth) - [client_max_writeback_iodepth](#client_max_writeback_iodepth)
- [nbd_timeout](#nbd_timeout)
- [nbd_max_devices](#nbd_max_devices)
- [nbd_max_part](#nbd_max_part)
## client_max_dirty_bytes ## client_max_dirty_bytes
@ -101,3 +104,34 @@ Multiple consecutive modified data regions are counted as 1 write here.
- Can be changed online: yes - Can be changed online: yes
Maximum number of parallel writes when flushing buffered data to the server. Maximum number of parallel writes when flushing buffered data to the server.
## nbd_timeout
- Type: seconds
- Default: 300
Timeout for I/O operations for [NBD](../usage/nbd.en.md). If an operation
executes for longer than this timeout, including when your cluster is just
temporarily down for more than timeout, the NBD device will detach by itself
(and possibly break the mounted file system).
You can set timeout to 0 to never detach, but in that case you won't be
able to remove the kernel device at all if the NBD process dies - you'll have
to reboot the host.
## nbd_max_devices
- Type: integer
- Default: 64
Maximum number of NBD devices in the system. This value is passed as
`nbds_max` parameter for the nbd kernel module when vitastor-nbd autoloads it.
## nbd_max_part
- Type: integer
- Default: 3
Maximum number of partitions per NBD device. This value is passed as
`max_part` parameter for the nbd kernel module when vitastor-nbd autoloads it.
Note that (nbds_max)*(1+max_part) usually can't exceed 256.

View File

@ -6,7 +6,7 @@
# Параметры клиентского кода # Параметры клиентского кода
Данные параметры применяются только к клиентам Vitastor (QEMU, fio, NBD) и Данные параметры применяются только к клиентам Vitastor (QEMU, fio, NBD и т.п.) и
затрагивают логику их работы с кластером. затрагивают логику их работы с кластером.
- [client_max_dirty_bytes](#client_max_dirty_bytes) - [client_max_dirty_bytes](#client_max_dirty_bytes)
@ -15,6 +15,9 @@
- [client_max_buffered_bytes](#client_max_buffered_bytes) - [client_max_buffered_bytes](#client_max_buffered_bytes)
- [client_max_buffered_ops](#client_max_buffered_ops) - [client_max_buffered_ops](#client_max_buffered_ops)
- [client_max_writeback_iodepth](#client_max_writeback_iodepth) - [client_max_writeback_iodepth](#client_max_writeback_iodepth)
- [nbd_timeout](#nbd_timeout)
- [nbd_max_devices](#nbd_max_devices)
- [nbd_max_part](#nbd_max_part)
## client_max_dirty_bytes ## client_max_dirty_bytes
@ -101,3 +104,34 @@
- Можно менять на лету: да - Можно менять на лету: да
Максимальное число параллельных операций записи при сбросе буферов на сервер. Максимальное число параллельных операций записи при сбросе буферов на сервер.
## nbd_timeout
- Тип: секунды
- Значение по умолчанию: 300
Таймаут для операций чтения/записи через [NBD](../usage/nbd.ru.md). Если
операция выполняется дольше таймаута, включая временную недоступность
кластера на время, большее таймаута, NBD-устройство отключится само собой
(и, возможно, сломает примонтированную ФС).
Вы можете установить таймаут в 0, чтобы никогда не отключать устройство по
таймауту, но в этом случае вы вообще не сможете удалить устройство, если
процесс NBD умрёт - вам придётся перезагружать сервер.
## nbd_max_devices
- Тип: целое число
- Значение по умолчанию: 64
Максимальное число NBD-устройств в системе. Данное значение передаётся
модулю ядра nbd как параметр `nbds_max`, когда его загружает vitastor-nbd.
## nbd_max_part
- Тип: целое число
- Значение по умолчанию: 3
Максимальное число разделов на одном NBD-устройстве. Данное значение передаётся
модулю ядра nbd как параметр `max_part`, когда его загружает vitastor-nbd.
Имейте в виду, что (nbds_max)*(1+max_part) обычно не может превышать 256.

View File

@ -19,6 +19,7 @@ them, even without restarting by updating configuration in etcd.
- [autosync_interval](#autosync_interval) - [autosync_interval](#autosync_interval)
- [autosync_writes](#autosync_writes) - [autosync_writes](#autosync_writes)
- [recovery_queue_depth](#recovery_queue_depth) - [recovery_queue_depth](#recovery_queue_depth)
- [recovery_sleep_us](#recovery_sleep_us)
- [recovery_pg_switch](#recovery_pg_switch) - [recovery_pg_switch](#recovery_pg_switch)
- [recovery_sync_batch](#recovery_sync_batch) - [recovery_sync_batch](#recovery_sync_batch)
- [readonly](#readonly) - [readonly](#readonly)
@ -51,6 +52,13 @@ them, even without restarting by updating configuration in etcd.
- [scrub_list_limit](#scrub_list_limit) - [scrub_list_limit](#scrub_list_limit)
- [scrub_find_best](#scrub_find_best) - [scrub_find_best](#scrub_find_best)
- [scrub_ec_max_bruteforce](#scrub_ec_max_bruteforce) - [scrub_ec_max_bruteforce](#scrub_ec_max_bruteforce)
- [recovery_tune_interval](#recovery_tune_interval)
- [recovery_tune_util_low](#recovery_tune_util_low)
- [recovery_tune_util_high](#recovery_tune_util_high)
- [recovery_tune_client_util_low](#recovery_tune_client_util_low)
- [recovery_tune_client_util_high](#recovery_tune_client_util_high)
- [recovery_tune_agg_interval](#recovery_tune_agg_interval)
- [recovery_tune_sleep_min_us](#recovery_tune_sleep_min_us)
## etcd_report_interval ## etcd_report_interval
@ -135,12 +143,24 @@ operations before issuing an fsync operation internally.
## recovery_queue_depth ## recovery_queue_depth
- Type: integer - Type: integer
- Default: 4 - Default: 1
- Can be changed online: yes - Can be changed online: yes
Maximum recovery operations per one primary OSD at any given moment of time. Maximum recovery and rebalance operations initiated by each OSD in parallel.
Currently it's the only parameter available to tune the speed or recovery Note that each OSD talks to a lot of other OSDs so actual number of parallel
and rebalancing, but it's planned to implement more. recovery operations per each OSD is greater than just recovery_queue_depth.
Increasing this parameter can speedup recovery if [auto-tuning](#recovery_tune_interval)
allows it or if it is disabled.
## recovery_sleep_us
- Type: microseconds
- Default: 0
- Can be changed online: yes
Delay for all recovery- and rebalance- related operations. If non-zero,
such operations are artificially slowed down to reduce the impact on
client I/O.
## recovery_pg_switch ## recovery_pg_switch
@ -508,3 +528,81 @@ the variant with most available equal copies is correct. For example, if
you have 3 replicas and 1 of them differs, this one is considered to be you have 3 replicas and 1 of them differs, this one is considered to be
corrupted. But if there is no "best" version with more copies than all corrupted. But if there is no "best" version with more copies than all
others have then the object is also marked as inconsistent. others have then the object is also marked as inconsistent.
## recovery_tune_interval
- Type: seconds
- Default: 1
- Can be changed online: yes
Interval at which OSD re-considers client and recovery load and automatically
adjusts [recovery_sleep_us](#recovery_sleep_us). Recovery auto-tuning is
disabled if recovery_tune_interval is set to 0.
Auto-tuning targets utilization. Utilization is a measure of load and is
equal to the product of iops and average latency (so it may be greater
than 1). You set "low" and "high" client utilization thresholds and two
corresponding target recovery utilization levels. OSD calculates desired
recovery utilization from client utilization using linear interpolation
and auto-tunes recovery operation delay to make actual recovery utilization
match desired.
This allows to reduce recovery/rebalance impact on client operations. It is
of course impossible to remove it completely, but it should become adequate.
In some tests rebalance could earlier drop client write speed from 1.5 GB/s
to 50-100 MB/s, with default auto-tuning settings it now only reduces
to ~1 GB/s.
## recovery_tune_util_low
- Type: number
- Default: 0.1
- Can be changed online: yes
Desired recovery/rebalance utilization when client load is high, i.e. when
it is at or above recovery_tune_client_util_high.
## recovery_tune_util_high
- Type: number
- Default: 1
- Can be changed online: yes
Desired recovery/rebalance utilization when client load is low, i.e. when
it is at or below recovery_tune_client_util_low.
## recovery_tune_client_util_low
- Type: number
- Default: 0
- Can be changed online: yes
Client utilization considered "low".
## recovery_tune_client_util_high
- Type: number
- Default: 0.5
- Can be changed online: yes
Client utilization considered "high".
## recovery_tune_agg_interval
- Type: integer
- Default: 10
- Can be changed online: yes
The number of last auto-tuning iterations to use for calculating the
delay as average. Lower values result in quicker response to client
load change, higher values result in more stable delay. Default value of 10
is usually fine.
## recovery_tune_sleep_min_us
- Type: microseconds
- Default: 10
- Can be changed online: yes
Minimum possible value for auto-tuned recovery_sleep_us. Values lower
than this value are changed to 0.

View File

@ -20,6 +20,7 @@
- [autosync_interval](#autosync_interval) - [autosync_interval](#autosync_interval)
- [autosync_writes](#autosync_writes) - [autosync_writes](#autosync_writes)
- [recovery_queue_depth](#recovery_queue_depth) - [recovery_queue_depth](#recovery_queue_depth)
- [recovery_sleep_us](#recovery_sleep_us)
- [recovery_pg_switch](#recovery_pg_switch) - [recovery_pg_switch](#recovery_pg_switch)
- [recovery_sync_batch](#recovery_sync_batch) - [recovery_sync_batch](#recovery_sync_batch)
- [readonly](#readonly) - [readonly](#readonly)
@ -52,6 +53,13 @@
- [scrub_list_limit](#scrub_list_limit) - [scrub_list_limit](#scrub_list_limit)
- [scrub_find_best](#scrub_find_best) - [scrub_find_best](#scrub_find_best)
- [scrub_ec_max_bruteforce](#scrub_ec_max_bruteforce) - [scrub_ec_max_bruteforce](#scrub_ec_max_bruteforce)
- [recovery_tune_interval](#recovery_tune_interval)
- [recovery_tune_util_low](#recovery_tune_util_low)
- [recovery_tune_util_high](#recovery_tune_util_high)
- [recovery_tune_client_util_low](#recovery_tune_client_util_low)
- [recovery_tune_client_util_high](#recovery_tune_client_util_high)
- [recovery_tune_agg_interval](#recovery_tune_agg_interval)
- [recovery_tune_sleep_min_us](#recovery_tune_sleep_min_us)
## etcd_report_interval ## etcd_report_interval
@ -138,13 +146,25 @@ OSD, чтобы успевать очищать журнал - без них OSD
## recovery_queue_depth ## recovery_queue_depth
- Тип: целое число - Тип: целое число
- Значение по умолчанию: 4 - Значение по умолчанию: 1
- Можно менять на лету: да - Можно менять на лету: да
Максимальное число операций восстановления на одном первичном OSD в любой Максимальное число параллельных операций восстановления, инициируемых одним
момент времени. На данный момент единственный параметр, который можно менять OSD в любой момент времени. Имейте в виду, что каждый OSD обычно работает с
для ускорения или замедления восстановления и перебалансировки данных, но многими другими OSD, так что на практике параллелизм восстановления больше,
в планах реализация других параметров. чем просто recovery_queue_depth. Увеличение значения этого параметра может
ускорить восстановление если [автотюнинг скорости](#recovery_tune_interval)
разрешает это или если он отключён.
## recovery_sleep_us
- Тип: микросекунды
- Значение по умолчанию: 0
- Можно менять на лету: да
Delay for all recovery- and rebalance- related operations. If non-zero,
such operations are artificially slowed down to reduce the impact on
client I/O.
## recovery_pg_switch ## recovery_pg_switch
@ -535,3 +555,83 @@ EC (кодов коррекции ошибок) с более, чем 1 диск
считается некорректной. Однако, если "лучшую" версию с числом доступных считается некорректной. Однако, если "лучшую" версию с числом доступных
копий большим, чем у всех других версий, найти невозможно, то объект тоже копий большим, чем у всех других версий, найти невозможно, то объект тоже
маркируется неконсистентным. маркируется неконсистентным.
## recovery_tune_interval
- Тип: секунды
- Значение по умолчанию: 1
- Можно менять на лету: да
Интервал, с которым OSD пересматривает клиентскую нагрузку и нагрузку
восстановления и автоматически подстраивает [recovery_sleep_us](#recovery_sleep_us).
Автотюнинг (автоподстройка) отключается, если recovery_tune_interval
устанавливается в значение 0.
Автотюнинг регулирует утилизацию. Утилизация является мерой нагрузки
и равна произведению числа операций в секунду и средней задержки
(то есть, она может быть выше 1). Вы задаёте два уровня клиентской
утилизации - "низкий" и "высокий" (low и high) и два соответствующих
целевых уровня утилизации операциями восстановления. OSD рассчитывает
желаемый уровень утилизации восстановления линейной интерполяцией от
клиентской утилизации и подстраивает задержку операций восстановления
так, чтобы фактическая утилизация восстановления совпадала с желаемой.
Это позволяет снизить влияние восстановления и ребаланса на клиентские
операции. Конечно, невозможно исключить такое влияние полностью, но оно
должно становиться адекватнее. В некоторых тестах перебалансировка могла
снижать клиентскую скорость записи с 1.5 ГБ/с до 50-100 МБ/с, а теперь, с
настройками автотюнинга по умолчанию, она снижается только до ~1 ГБ/с.
## recovery_tune_util_low
- Тип: число
- Значение по умолчанию: 0.1
- Можно менять на лету: да
Желаемая утилизация восстановления в моменты, когда клиентская нагрузка
высокая, то есть, находится на уровне или выше recovery_tune_client_util_high.
## recovery_tune_util_high
- Тип: число
- Значение по умолчанию: 1
- Можно менять на лету: да
Желаемая утилизация восстановления в моменты, когда клиентская нагрузка
низкая, то есть, находится на уровне или ниже recovery_tune_client_util_low.
## recovery_tune_client_util_low
- Тип: число
- Значение по умолчанию: 0
- Можно менять на лету: да
Клиентская утилизация, которая считается "низкой".
## recovery_tune_client_util_high
- Тип: число
- Значение по умолчанию: 0.5
- Можно менять на лету: да
Клиентская утилизация, которая считается "высокой".
## recovery_tune_agg_interval
- Тип: целое число
- Значение по умолчанию: 10
- Можно менять на лету: да
Число последних итераций автоподстройки для расчёта задержки как среднего
значения. Меньшие значения параметра ускоряют отклик на изменение нагрузки,
большие значения делают задержку стабильнее. Значение по умолчанию 10
обычно нормальное и не требует изменений.
## recovery_tune_sleep_min_us
- Тип: микросекунды
- Значение по умолчанию: 10
- Можно менять на лету: да
Минимальное возможное значение авто-подстроенного recovery_sleep_us.
Значения ниже данного заменяются на 0.

View File

@ -1,4 +1,4 @@
# Client Parameters # Client Parameters
These parameters apply only to clients and affect their interaction with These parameters apply only to Vitastor clients (QEMU, fio, NBD and so on) and
the cluster. affect their interaction with the cluster.

View File

@ -1,4 +1,4 @@
# Параметры клиентского кода # Параметры клиентского кода
Данные параметры применяются только к клиентам Vitastor (QEMU, fio, NBD) и Данные параметры применяются только к клиентам Vitastor (QEMU, fio, NBD и т.п.) и
затрагивают логику их работы с кластером. затрагивают логику их работы с кластером.

View File

@ -122,3 +122,47 @@
Maximum number of parallel writes when flushing buffered data to the server. Maximum number of parallel writes when flushing buffered data to the server.
info_ru: | info_ru: |
Максимальное число параллельных операций записи при сбросе буферов на сервер. Максимальное число параллельных операций записи при сбросе буферов на сервер.
- name: nbd_timeout
type: sec
default: 300
online: false
info: |
Timeout for I/O operations for [NBD](../usage/nbd.en.md). If an operation
executes for longer than this timeout, including when your cluster is just
temporarily down for more than timeout, the NBD device will detach by itself
(and possibly break the mounted file system).
You can set timeout to 0 to never detach, but in that case you won't be
able to remove the kernel device at all if the NBD process dies - you'll have
to reboot the host.
info_ru: |
Таймаут для операций чтения/записи через [NBD](../usage/nbd.ru.md). Если
операция выполняется дольше таймаута, включая временную недоступность
кластера на время, большее таймаута, NBD-устройство отключится само собой
(и, возможно, сломает примонтированную ФС).
Вы можете установить таймаут в 0, чтобы никогда не отключать устройство по
таймауту, но в этом случае вы вообще не сможете удалить устройство, если
процесс NBD умрёт - вам придётся перезагружать сервер.
- name: nbd_max_devices
type: int
default: 64
online: false
info: |
Maximum number of NBD devices in the system. This value is passed as
`nbds_max` parameter for the nbd kernel module when vitastor-nbd autoloads it.
info_ru: |
Максимальное число NBD-устройств в системе. Данное значение передаётся
модулю ядра nbd как параметр `nbds_max`, когда его загружает vitastor-nbd.
- name: nbd_max_part
type: int
default: 3
online: false
info: |
Maximum number of partitions per NBD device. This value is passed as
`max_part` parameter for the nbd kernel module when vitastor-nbd autoloads it.
Note that (nbds_max)*(1+max_part) usually can't exceed 256.
info_ru: |
Максимальное число разделов на одном NBD-устройстве. Данное значение передаётся
модулю ядра nbd как параметр `max_part`, когда его загружает vitastor-nbd.
Имейте в виду, что (nbds_max)*(1+max_part) обычно не может превышать 256.

View File

@ -38,6 +38,7 @@ const types = {
bool: 'boolean', bool: 'boolean',
int: 'integer', int: 'integer',
sec: 'seconds', sec: 'seconds',
float: 'number',
ms: 'milliseconds', ms: 'milliseconds',
us: 'microseconds', us: 'microseconds',
}, },
@ -46,6 +47,7 @@ const types = {
bool: 'булево (да/нет)', bool: 'булево (да/нет)',
int: 'целое число', int: 'целое число',
sec: 'секунды', sec: 'секунды',
float: 'число',
ms: 'миллисекунды', ms: 'миллисекунды',
us: 'микросекунды', us: 'микросекунды',
}, },

View File

@ -107,17 +107,29 @@
принудительной отправкой fsync-а. принудительной отправкой fsync-а.
- name: recovery_queue_depth - name: recovery_queue_depth
type: int type: int
default: 4 default: 1
online: true online: true
info: | info: |
Maximum recovery operations per one primary OSD at any given moment of time. Maximum recovery and rebalance operations initiated by each OSD in parallel.
Currently it's the only parameter available to tune the speed or recovery Note that each OSD talks to a lot of other OSDs so actual number of parallel
and rebalancing, but it's planned to implement more. recovery operations per each OSD is greater than just recovery_queue_depth.
Increasing this parameter can speedup recovery if [auto-tuning](#recovery_tune_interval)
allows it or if it is disabled.
info_ru: | info_ru: |
Максимальное число операций восстановления на одном первичном OSD в любой Максимальное число параллельных операций восстановления, инициируемых одним
момент времени. На данный момент единственный параметр, который можно менять OSD в любой момент времени. Имейте в виду, что каждый OSD обычно работает с
для ускорения или замедления восстановления и перебалансировки данных, но многими другими OSD, так что на практике параллелизм восстановления больше,
в планах реализация других параметров. чем просто recovery_queue_depth. Увеличение значения этого параметра может
ускорить восстановление если [автотюнинг скорости](#recovery_tune_interval)
разрешает это или если он отключён.
- name: recovery_sleep_us
type: us
default: 0
online: true
info: |
Delay for all recovery- and rebalance- related operations. If non-zero,
such operations are artificially slowed down to reduce the impact on
client I/O.
- name: recovery_pg_switch - name: recovery_pg_switch
type: int type: int
default: 128 default: 128
@ -626,3 +638,101 @@
считается некорректной. Однако, если "лучшую" версию с числом доступных считается некорректной. Однако, если "лучшую" версию с числом доступных
копий большим, чем у всех других версий, найти невозможно, то объект тоже копий большим, чем у всех других версий, найти невозможно, то объект тоже
маркируется неконсистентным. маркируется неконсистентным.
- name: recovery_tune_interval
type: sec
default: 1
online: true
info: |
Interval at which OSD re-considers client and recovery load and automatically
adjusts [recovery_sleep_us](#recovery_sleep_us). Recovery auto-tuning is
disabled if recovery_tune_interval is set to 0.
Auto-tuning targets utilization. Utilization is a measure of load and is
equal to the product of iops and average latency (so it may be greater
than 1). You set "low" and "high" client utilization thresholds and two
corresponding target recovery utilization levels. OSD calculates desired
recovery utilization from client utilization using linear interpolation
and auto-tunes recovery operation delay to make actual recovery utilization
match desired.
This allows to reduce recovery/rebalance impact on client operations. It is
of course impossible to remove it completely, but it should become adequate.
In some tests rebalance could earlier drop client write speed from 1.5 GB/s
to 50-100 MB/s, with default auto-tuning settings it now only reduces
to ~1 GB/s.
info_ru: |
Интервал, с которым OSD пересматривает клиентскую нагрузку и нагрузку
восстановления и автоматически подстраивает [recovery_sleep_us](#recovery_sleep_us).
Автотюнинг (автоподстройка) отключается, если recovery_tune_interval
устанавливается в значение 0.
Автотюнинг регулирует утилизацию. Утилизация является мерой нагрузки
и равна произведению числа операций в секунду и средней задержки
(то есть, она может быть выше 1). Вы задаёте два уровня клиентской
утилизации - "низкий" и "высокий" (low и high) и два соответствующих
целевых уровня утилизации операциями восстановления. OSD рассчитывает
желаемый уровень утилизации восстановления линейной интерполяцией от
клиентской утилизации и подстраивает задержку операций восстановления
так, чтобы фактическая утилизация восстановления совпадала с желаемой.
Это позволяет снизить влияние восстановления и ребаланса на клиентские
операции. Конечно, невозможно исключить такое влияние полностью, но оно
должно становиться адекватнее. В некоторых тестах перебалансировка могла
снижать клиентскую скорость записи с 1.5 ГБ/с до 50-100 МБ/с, а теперь, с
настройками автотюнинга по умолчанию, она снижается только до ~1 ГБ/с.
- name: recovery_tune_util_low
type: float
default: 0.1
online: true
info: |
Desired recovery/rebalance utilization when client load is high, i.e. when
it is at or above recovery_tune_client_util_high.
info_ru: |
Желаемая утилизация восстановления в моменты, когда клиентская нагрузка
высокая, то есть, находится на уровне или выше recovery_tune_client_util_high.
- name: recovery_tune_util_high
type: float
default: 1
online: true
info: |
Desired recovery/rebalance utilization when client load is low, i.e. when
it is at or below recovery_tune_client_util_low.
info_ru: |
Желаемая утилизация восстановления в моменты, когда клиентская нагрузка
низкая, то есть, находится на уровне или ниже recovery_tune_client_util_low.
- name: recovery_tune_client_util_low
type: float
default: 0
online: true
info: Client utilization considered "low".
info_ru: Клиентская утилизация, которая считается "низкой".
- name: recovery_tune_client_util_high
type: float
default: 0.5
online: true
info: Client utilization considered "high".
info_ru: Клиентская утилизация, которая считается "высокой".
- name: recovery_tune_agg_interval
type: int
default: 10
online: true
info: |
The number of last auto-tuning iterations to use for calculating the
delay as average. Lower values result in quicker response to client
load change, higher values result in more stable delay. Default value of 10
is usually fine.
info_ru: |
Число последних итераций автоподстройки для расчёта задержки как среднего
значения. Меньшие значения параметра ускоряют отклик на изменение нагрузки,
большие значения делают задержку стабильнее. Значение по умолчанию 10
обычно нормальное и не требует изменений.
- name: recovery_tune_sleep_min_us
type: us
default: 10
online: true
info: |
Minimum possible value for auto-tuned recovery_sleep_us. Values lower
than this value are changed to 0.
info_ru: |
Минимальное возможное значение авто-подстроенного recovery_sleep_us.
Значения ниже данного заменяются на 0.

View File

@ -19,6 +19,14 @@ for i in ./???-*.yaml; do kubectl apply -f $i; done
After that you'll be able to create PersistentVolumes. After that you'll be able to create PersistentVolumes.
**Important:** For best experience, use Linux kernel at least 5.15 with [VDUSE](../usage/qemu.en.md#vduse)
kernel modules enabled (vdpa, vduse, virtio-vdpa). If your distribution doesn't
have them pre-built - build them yourself ([instructions](../usage/qemu.en.md#vduse)),
I promise it's worth it :-). When VDUSE is unavailable, CSI driver uses [NBD](../usage/nbd.en.md)
to map Vitastor devices. NBD is slower and prone to timeout issues: if Vitastor
cluster becomes unresponsible for more than [nbd_timeout](../config/client.en.md#nbd_timeout),
the NBD device detaches and breaks pods using it.
## Features ## Features
Vitastor CSI supports: Vitastor CSI supports:
@ -27,5 +35,8 @@ Vitastor CSI supports:
- Raw block RWX (ReadWriteMany) volumes. Example: [PVC](../../csi/deploy/example-pvc-block.yaml), [pod](../../csi/deploy/example-test-pod-block.yaml) - Raw block RWX (ReadWriteMany) volumes. Example: [PVC](../../csi/deploy/example-pvc-block.yaml), [pod](../../csi/deploy/example-test-pod-block.yaml)
- Volume expansion - Volume expansion
- Volume snapshots. Example: [snapshot class](../../csi/deploy/example-snapshot-class.yaml), [snapshot](../../csi/deploy/example-snapshot.yaml), [clone](../../csi/deploy/example-snapshot-clone.yaml) - Volume snapshots. Example: [snapshot class](../../csi/deploy/example-snapshot-class.yaml), [snapshot](../../csi/deploy/example-snapshot.yaml), [clone](../../csi/deploy/example-snapshot-clone.yaml)
- [VDUSE](../usage/qemu.en.md#vduse) (preferred) and [NBD](../usage/nbd.en.md) device mapping methods
- Upgrades with VDUSE - new handler processes are restarted when CSI pods are restarted themselves
- Multiple clusters by using multiple configuration files in ConfigMap.
Remember that to use snapshots with CSI you also have to install [Snapshot Controller and CRDs](https://kubernetes-csi.github.io/docs/snapshot-controller.html#deployment). Remember that to use snapshots with CSI you also have to install [Snapshot Controller and CRDs](https://kubernetes-csi.github.io/docs/snapshot-controller.html#deployment).

View File

@ -19,6 +19,14 @@ for i in ./???-*.yaml; do kubectl apply -f $i; done
После этого вы сможете создавать PersistentVolume. После этого вы сможете создавать PersistentVolume.
**Важно:** Лучше всего использовать ядро Linux версии не менее 5.15 с включёнными модулями
[VDUSE](../usage/qemu.ru.md#vduse) (vdpa, vduse, virtio-vdpa). Если в вашем дистрибутиве
они не собраны из коробки - соберите их сами, обещаю, что это стоит того ([инструкция](../usage/qemu.ru.md#vduse)) :-).
Когда VDUSE недоступно, CSI-плагин использует [NBD](../usage/nbd.ru.md) для подключения
дисков, а NBD медленнее и имеет проблему таймаута - если кластер остаётся недоступным
дольше, чем [nbd_timeout](../config/client.ru.md#nbd_timeout), NBD-устройство отключается
и ломает поды, использующие его.
## Возможности ## Возможности
CSI-плагин Vitastor поддерживает: CSI-плагин Vitastor поддерживает:
@ -27,5 +35,8 @@ CSI-плагин Vitastor поддерживает:
- Сырые блочные RWX (ReadWriteMany) тома. Пример: [PVC](../../csi/deploy/example-pvc-block.yaml), [под](../../csi/deploy/example-test-pod-block.yaml) - Сырые блочные RWX (ReadWriteMany) тома. Пример: [PVC](../../csi/deploy/example-pvc-block.yaml), [под](../../csi/deploy/example-test-pod-block.yaml)
- Расширение размера томов - Расширение размера томов
- Снимки томов. Пример: [класс снимков](../../csi/deploy/example-snapshot-class.yaml), [снимок](../../csi/deploy/example-snapshot.yaml), [клон снимка](../../csi/deploy/example-snapshot-clone.yaml) - Снимки томов. Пример: [класс снимков](../../csi/deploy/example-snapshot-class.yaml), [снимок](../../csi/deploy/example-snapshot.yaml), [клон снимка](../../csi/deploy/example-snapshot-clone.yaml)
- Способы подключения устройств [VDUSE](../usage/qemu.ru.md#vduse) (предпочитаемый) и [NBD](../usage/nbd.ru.md)
- Обновление при использовании VDUSE - новые процессы-обработчики устройств успешно перезапускаются вместе с самими подами CSI
- Несколько кластеров через задание нескольких файлов конфигурации в ConfigMap.
Не забывайте, что для использования снимков нужно сначала установить [контроллер снимков и CRD](https://kubernetes-csi.github.io/docs/snapshot-controller.html#deployment). Не забывайте, что для использования снимков нужно сначала установить [контроллер снимков и CRD](https://kubernetes-csi.github.io/docs/snapshot-controller.html#deployment).

View File

@ -18,7 +18,7 @@
stable version from 0.9.x branch instead of 1.x stable version from 0.9.x branch instead of 1.x
- For Debian 10 (Buster) also enable backports repository: - For Debian 10 (Buster) also enable backports repository:
`deb http://deb.debian.org/debian buster-backports main` `deb http://deb.debian.org/debian buster-backports main`
- Install packages: `apt update; apt install vitastor lp-solve etcd linux-image-amd64 qemu` - Install packages: `apt update; apt install vitastor lp-solve etcd linux-image-amd64 qemu-system-x86`
## CentOS ## CentOS

View File

@ -18,7 +18,7 @@
установить последнюю стабильную версию из ветки 0.9.x вместо 1.x установить последнюю стабильную версию из ветки 0.9.x вместо 1.x
- Для Debian 10 (Buster) также включите репозиторий backports: - Для Debian 10 (Buster) также включите репозиторий backports:
`deb http://deb.debian.org/debian buster-backports main` `deb http://deb.debian.org/debian buster-backports main`
- Установите пакеты: `apt update; apt install vitastor lp-solve etcd linux-image-amd64 qemu` - Установите пакеты: `apt update; apt install vitastor lp-solve etcd linux-image-amd64 qemu-system-x86`
## CentOS ## CentOS

View File

@ -6,10 +6,10 @@
# Proxmox VE # Proxmox VE
To enable Vitastor support in Proxmox Virtual Environment (6.4-8.0 are supported): To enable Vitastor support in Proxmox Virtual Environment (6.4-8.1 are supported):
- Add the corresponding Vitastor Debian repository into sources.list on Proxmox hosts: - Add the corresponding Vitastor Debian repository into sources.list on Proxmox hosts:
bookworm for 8.0, bullseye for 7.4, pve7.3 for 7.3, pve7.2 for 7.2, pve7.1 for 7.1, buster for 6.4 bookworm for 8.1, pve8.0 for 8.0, bullseye for 7.4, pve7.3 for 7.3, pve7.2 for 7.2, pve7.1 for 7.1, buster for 6.4
- Install vitastor-client, pve-qemu-kvm, pve-storage-vitastor (* or see note) packages from Vitastor repository - Install vitastor-client, pve-qemu-kvm, pve-storage-vitastor (* or see note) packages from Vitastor repository
- Define storage in `/etc/pve/storage.cfg` (see below) - Define storage in `/etc/pve/storage.cfg` (see below)
- Block network access from VMs to Vitastor network (to OSDs and etcd), - Block network access from VMs to Vitastor network (to OSDs and etcd),
@ -25,7 +25,7 @@ vitastor: vitastor
vitastor_pool testpool vitastor_pool testpool
# path to the configuration file # path to the configuration file
vitastor_config_path /etc/vitastor/vitastor.conf vitastor_config_path /etc/vitastor/vitastor.conf
# etcd address(es), required only if missing in the configuration file # etcd address(es), OPTIONAL, required only if missing in the configuration file
vitastor_etcd_address 192.168.7.2:2379/v3 vitastor_etcd_address 192.168.7.2:2379/v3
# prefix for keys in etcd # prefix for keys in etcd
vitastor_etcd_prefix /vitastor vitastor_etcd_prefix /vitastor

View File

@ -6,10 +6,10 @@
# Proxmox VE # Proxmox VE
Чтобы подключить Vitastor к Proxmox Virtual Environment (поддерживаются версии 6.4-8.0): Чтобы подключить Vitastor к Proxmox Virtual Environment (поддерживаются версии 6.4-8.1):
- Добавьте соответствующий Debian-репозиторий Vitastor в sources.list на хостах Proxmox: - Добавьте соответствующий Debian-репозиторий Vitastor в sources.list на хостах Proxmox:
bookworm для 8.0, bullseye для 7.4, pve7.3 для 7.3, pve7.2 для 7.2, pve7.1 для 7.1, buster для 6.4 bookworm для 8.1, pve8.0 для 8.0, bullseye для 7.4, pve7.3 для 7.3, pve7.2 для 7.2, pve7.1 для 7.1, buster для 6.4
- Установите пакеты vitastor-client, pve-qemu-kvm, pve-storage-vitastor (* или см. сноску) из репозитория Vitastor - Установите пакеты vitastor-client, pve-qemu-kvm, pve-storage-vitastor (* или см. сноску) из репозитория Vitastor
- Определите тип хранилища в `/etc/pve/storage.cfg` (см. ниже) - Определите тип хранилища в `/etc/pve/storage.cfg` (см. ниже)
- Обязательно заблокируйте доступ от виртуальных машин к сети Vitastor (OSD и etcd), т.к. Vitastor (пока) не поддерживает аутентификацию - Обязательно заблокируйте доступ от виртуальных машин к сети Vitastor (OSD и etcd), т.к. Vitastor (пока) не поддерживает аутентификацию
@ -24,7 +24,7 @@ vitastor: vitastor
vitastor_pool testpool vitastor_pool testpool
# Путь к файлу конфигурации # Путь к файлу конфигурации
vitastor_config_path /etc/vitastor/vitastor.conf vitastor_config_path /etc/vitastor/vitastor.conf
# Адрес(а) etcd, нужны, только если не указаны в vitastor.conf # Адрес(а) etcd, ОПЦИОНАЛЬНЫ, нужны, только если не указаны в vitastor.conf
vitastor_etcd_address 192.168.7.2:2379/v3 vitastor_etcd_address 192.168.7.2:2379/v3
# Префикс ключей метаданных в etcd # Префикс ключей метаданных в etcd
vitastor_etcd_prefix /vitastor vitastor_etcd_prefix /vitastor

View File

@ -54,7 +54,8 @@
виртуальные диски, их снимки и клоны. виртуальные диски, их снимки и клоны.
- **Драйвер QEMU** — подключаемый модуль QEMU, позволяющий QEMU/KVM виртуальным машинам работать - **Драйвер QEMU** — подключаемый модуль QEMU, позволяющий QEMU/KVM виртуальным машинам работать
с виртуальными дисками Vitastor напрямую из пространства пользователя с помощью клиентской с виртуальными дисками Vitastor напрямую из пространства пользователя с помощью клиентской
библиотеки, без необходимости отображения дисков в виде блочных устройств. библиотеки, без необходимости отображения дисков в виде блочных устройств. Тот же драйвер
позволяет подключать диски в систему через [VDUSE](../usage/qemu.ru.md#vduse).
- **vitastor-nbd** — утилита, позволяющая монтировать образы Vitastor в виде блочных устройств - **vitastor-nbd** — утилита, позволяющая монтировать образы Vitastor в виде блочных устройств
с помощью NBD (Network Block Device), на самом деле скорее работающего как "BUSE" с помощью NBD (Network Block Device), на самом деле скорее работающего как "BUSE"
(Block Device In Userspace). Модуля ядра Linux для выполнения той же задачи в Vitastor нет (Block Device In Userspace). Модуля ядра Linux для выполнения той же задачи в Vitastor нет

View File

@ -32,6 +32,7 @@
- [Scrubbing](../config/osd.en.md#auto_scrub) (verification of copies) - [Scrubbing](../config/osd.en.md#auto_scrub) (verification of copies)
- [Checksums](../config/layout-osd.en.md#data_csum_type) - [Checksums](../config/layout-osd.en.md#data_csum_type)
- [Client write-back cache](../config/client.en.md#client_enable_writeback) - [Client write-back cache](../config/client.en.md#client_enable_writeback)
- [Intelligent recovery auto-tuning](../config/osd.en.md#recovery_tune_interval)
## Plugins and tools ## Plugins and tools

View File

@ -34,6 +34,7 @@
- [Фоновая проверка целостности](../config/osd.ru.md#auto_scrub) (сверка копий) - [Фоновая проверка целостности](../config/osd.ru.md#auto_scrub) (сверка копий)
- [Контрольные суммы](../config/layout-osd.ru.md#data_csum_type) - [Контрольные суммы](../config/layout-osd.ru.md#data_csum_type)
- [Буферизация записи на стороне клиента](../config/client.ru.md#client_enable_writeback) - [Буферизация записи на стороне клиента](../config/client.ru.md#client_enable_writeback)
- [Интеллектуальная автоподстройка скорости восстановления](../config/osd.ru.md#recovery_tune_interval)
## Драйверы и инструменты ## Драйверы и инструменты

View File

@ -28,7 +28,8 @@ It supports the following commands:
Global options: Global options:
``` ```
--etcd_address ADDR Etcd connection address --config_file FILE Path to Vitastor configuration file
--etcd_address URL Etcd connection address
--iodepth N Send N operations in parallel to each OSD when possible (default 32) --iodepth N Send N operations in parallel to each OSD when possible (default 32)
--parallel_osds M Work with M osds in parallel when possible (default 4) --parallel_osds M Work with M osds in parallel when possible (default 4)
--progress 1|0 Report progress (default 1) --progress 1|0 Report progress (default 1)

View File

@ -27,7 +27,8 @@ vitastor-cli - интерфейс командной строки для адм
Глобальные опции: Глобальные опции:
``` ```
--etcd_address ADDR Адрес соединения с etcd --config_file FILE Путь к файлу конфигурации Vitastor
--etcd_address URL Адрес соединения с etcd
--iodepth N Отправлять параллельно N операций на каждый OSD (по умолчанию 32) --iodepth N Отправлять параллельно N операций на каждый OSD (по умолчанию 32)
--parallel_osds M Работать параллельно с M OSD (по умолчанию 4) --parallel_osds M Работать параллельно с M OSD (по умолчанию 4)
--progress 1|0 Печатать прогресс выполнения (по умолчанию 1) --progress 1|0 Печатать прогресс выполнения (по умолчанию 1)

View File

@ -17,6 +17,7 @@ It supports the following commands:
- [purge](#purge) - [purge](#purge)
- [read-sb](#read-sb) - [read-sb](#read-sb)
- [write-sb](#write-sb) - [write-sb](#write-sb)
- [update-sb](#update-sb)
- [udev](#udev) - [udev](#udev)
- [exec-osd](#exec-osd) - [exec-osd](#exec-osd)
- [pre-exec](#pre-exec) - [pre-exec](#pre-exec)
@ -182,6 +183,14 @@ Try to read Vitastor OSD superblock from `<device>` and print it in JSON format.
Read JSON from STDIN and write it into Vitastor OSD superblock on `<device>`. Read JSON from STDIN and write it into Vitastor OSD superblock on `<device>`.
## update-sb
`vitastor-disk update-sb <device> [--force] [--<parameter> <value>] [...]`
Read Vitastor OSD superblock from <device>, update parameters in it and write it back.
`--force` allows to ignore validation errors.
## udev ## udev
`vitastor-disk udev <device>` `vitastor-disk udev <device>`

View File

@ -17,6 +17,7 @@ vitastor-disk - инструмент командной строки для уп
- [purge](#purge) - [purge](#purge)
- [read-sb](#read-sb) - [read-sb](#read-sb)
- [write-sb](#write-sb) - [write-sb](#write-sb)
- [update-sb](#update-sb)
- [udev](#udev) - [udev](#udev)
- [exec-osd](#exec-osd) - [exec-osd](#exec-osd)
- [pre-exec](#pre-exec) - [pre-exec](#pre-exec)
@ -187,6 +188,15 @@ throttle_target_mbs, throttle_target_parallelism, throttle_threshold_us.
Прочитать JSON со стандартного ввода и записать его в суперблок OSD на диск `<device>`. Прочитать JSON со стандартного ввода и записать его в суперблок OSD на диск `<device>`.
## update-sb
`vitastor-disk update-sb <device> [--force] [--<параметр> <значение>] [...]`
Прочитать суперблок OSD с диска `<device>`, изменить в нём заданные параметры и записать обратно.
Опция `--force` позволяет читать суперблок, даже если он считается некорректным
из-за ошибок валидации.
## udev ## udev
`vitastor-disk udev <device>` `vitastor-disk udev <device>`

View File

@ -14,10 +14,13 @@ Vitastor has a fio driver which can be installed from the package vitastor-fio.
Use the following command as an example to run tests with fio against a Vitastor cluster: Use the following command as an example to run tests with fio against a Vitastor cluster:
``` ```
fio -thread -ioengine=libfio_vitastor.so -name=test -bs=4M -direct=1 -iodepth=16 -rw=write -etcd=10.115.0.10:2379/v3 -image=testimg fio -thread -ioengine=libfio_vitastor.so -name=test -bs=4M -direct=1 -iodepth=16 -rw=write -image=testimg
``` ```
If you don't want to access your image by name, you can specify pool number, inode number and size If you don't want to access your image by name, you can specify pool number, inode number and size
(`-pool=1 -inode=1 -size=400G`) instead of the image name (`-image=testimg`). (`-pool=1 -inode=1 -size=400G`) instead of the image name (`-image=testimg`).
See exact fio commands to use for benchmarking [here](../performance/understanding.en.md#команды-fio). You can also specify etcd address(es) explicitly by adding `-etcd=10.115.0.10:2379/v3`, or you
can override configuration file path by adding `-conf=/etc/vitastor/vitastor.conf`.
See exact fio commands to use for benchmarking [here](../performance/understanding.en.md#fio-commands).

View File

@ -14,10 +14,13 @@
Используйте следующую команду как пример для запуска тестов кластера Vitastor через fio: Используйте следующую команду как пример для запуска тестов кластера Vitastor через fio:
``` ```
fio -thread -ioengine=libfio_vitastor.so -name=test -bs=4M -direct=1 -iodepth=16 -rw=write -etcd=10.115.0.10:2379/v3 -image=testimg fio -thread -ioengine=libfio_vitastor.so -name=test -bs=4M -direct=1 -iodepth=16 -rw=write -image=testimg
``` ```
Вместо обращения к образу по имени (`-image=testimg`) можно указать номер пула, номер инода и размер: Вместо обращения к образу по имени (`-image=testimg`) можно указать номер пула, номер инода и размер:
`-pool=1 -inode=1 -size=400G`. `-pool=1 -inode=1 -size=400G`.
Вы также можете задать адрес(а) подключения к etcd явно, добавив `-etcd=10.115.0.10:2379/v3`,
или переопределить путь к файлу конфигурации, добавив `-conf=/etc/vitastor/vitastor.conf`.
Конкретные команды fio для тестирования производительности можно посмотреть [здесь](../performance/understanding.ru.md#команды-fio). Конкретные команды fio для тестирования производительности можно посмотреть [здесь](../performance/understanding.ru.md#команды-fio).

View File

@ -11,25 +11,25 @@ NBD stands for "Network Block Device", but in fact it also functions as "BUSE"
NBD slighly lowers the performance due to additional overhead, but performance still NBD slighly lowers the performance due to additional overhead, but performance still
remains decent (see an example [here](../performance/comparison1.en.md#vitastor-0-4-0-nbd)). remains decent (see an example [here](../performance/comparison1.en.md#vitastor-0-4-0-nbd)).
Vitastor Kubernetes CSI driver is based on NBD. See also [VDUSE](qemu.en.md#vduse) as a better alternative to NBD.
See also [VDUSE](qemu.en.md#vduse). Vitastor Kubernetes CSI driver uses NBD when VDUSE is unavailable.
## Map image ## Map image
To create a local block device for a Vitastor image run: To create a local block device for a Vitastor image run:
``` ```
vitastor-nbd map --etcd_address 10.115.0.10:2379/v3 --image testimg vitastor-nbd map --image testimg
``` ```
It will output a block device name like /dev/nbd0 which you can then use as a normal disk. It will output a block device name like /dev/nbd0 which you can then use as a normal disk.
You can also use `--pool <POOL> --inode <INODE> --size <SIZE>` instead of `--image <IMAGE>` if you want. You can also use `--pool <POOL> --inode <INODE> --size <SIZE>` instead of `--image <IMAGE>` if you want.
Additional options for map command: vitastor-nbd supports all usual Vitastor configuration options like `--config_file <path_to_config>` plus NBD-specific:
* `--nbd_timeout 30` \ * `--nbd_timeout 300` \
Timeout for I/O operations in seconds after exceeding which the kernel stops Timeout for I/O operations in seconds after exceeding which the kernel stops
the device. You can set it to 0 to disable the timeout, but beware that you the device. You can set it to 0 to disable the timeout, but beware that you
won't be able to stop the device at all if vitastor-nbd process dies. won't be able to stop the device at all if vitastor-nbd process dies.
@ -44,6 +44,9 @@ Additional options for map command:
* `--foreground 1` \ * `--foreground 1` \
Stay in foreground, do not daemonize. Stay in foreground, do not daemonize.
Note that `nbd_timeout`, `nbd_max_devices` and `nbd_max_part` options may also be specified
in `/etc/vitastor/vitastor.conf` or in other configuration file specified with `--config_file`.
## Unmap image ## Unmap image
To unmap the device run: To unmap the device run:

View File

@ -14,16 +14,16 @@ NBD на данный момент необходимо, чтобы монтир
NBD немного снижает производительность из-за дополнительных копирований памяти, NBD немного снижает производительность из-за дополнительных копирований памяти,
но она всё равно остаётся на неплохом уровне (см. для примера [тест](../performance/comparison1.ru.md#vitastor-0-4-0-nbd)). но она всё равно остаётся на неплохом уровне (см. для примера [тест](../performance/comparison1.ru.md#vitastor-0-4-0-nbd)).
CSI-драйвер Kubernetes Vitastor основан на NBD. Смотрите также [VDUSE](qemu.ru.md#vduse), как лучшую альтернативу NBD.
Смотрите также [VDUSE](qemu.ru.md#vduse). CSI-драйвер Kubernetes Vitastor использует NBD, когда VDUSE недоступен.
## Подключить устройство ## Подключить устройство
Чтобы создать локальное блочное устройство для образа, выполните команду: Чтобы создать локальное блочное устройство для образа, выполните команду:
``` ```
vitastor-nbd map --etcd_address 10.115.0.10:2379/v3 --image testimg vitastor-nbd map --image testimg
``` ```
Команда напечатает название блочного устройства вида /dev/nbd0, которое потом можно Команда напечатает название блочного устройства вида /dev/nbd0, которое потом можно
@ -32,7 +32,8 @@ vitastor-nbd map --etcd_address 10.115.0.10:2379/v3 --image testimg
Для обращения по номеру инода, аналогично другим командам, можно использовать опции Для обращения по номеру инода, аналогично другим командам, можно использовать опции
`--pool <POOL> --inode <INODE> --size <SIZE>` вместо `--image testimg`. `--pool <POOL> --inode <INODE> --size <SIZE>` вместо `--image testimg`.
Дополнительные опции для команды подключения NBD-устройства: vitastor-nbd поддерживает все обычные опции Vitastor, например, `--config_file <path_to_config>`,
плюс специфичные для NBD:
* `--nbd_timeout 30` \ * `--nbd_timeout 30` \
Максимальное время выполнения любой операции чтения/записи в секундах, при Максимальное время выполнения любой операции чтения/записи в секундах, при
@ -53,6 +54,10 @@ vitastor-nbd map --etcd_address 10.115.0.10:2379/v3 --image testimg
* `--foreground 1` \ * `--foreground 1` \
Не уводить процесс в фоновый режим. Не уводить процесс в фоновый режим.
Обратите внимание, что опции `nbd_timeout`, `nbd_max_devices` и `nbd_max_part` можно
также задавать в `/etc/vitastor/vitastor.conf` или в другом файле конфигурации,
заданном опцией `--config_file`.
## Отключить устройство ## Отключить устройство
Для отключения устройства выполните: Для отключения устройства выполните:

View File

@ -23,7 +23,7 @@ balancer or any failover method you want to in that case.
vitastor-nfs usage: vitastor-nfs usage:
``` ```
vitastor-nfs [--etcd_address ADDR] [OTHER OPTIONS] vitastor-nfs [STANDARD OPTIONS] [OTHER OPTIONS]
--subdir <DIR> export images prefixed <DIR>/ (default empty - export all images) --subdir <DIR> export images prefixed <DIR>/ (default empty - export all images)
--portmap 0 do not listen on port 111 (portmap/rpcbind, requires root) --portmap 0 do not listen on port 111 (portmap/rpcbind, requires root)
@ -34,7 +34,7 @@ vitastor-nfs [--etcd_address ADDR] [OTHER OPTIONS]
--foreground 1 stay in foreground, do not daemonize --foreground 1 stay in foreground, do not daemonize
``` ```
Example start and mount commands: Example start and mount commands (etcd_address is optional):
``` ```
vitastor-nfs --etcd_address 192.168.5.10:2379 --portmap 0 --port 2050 --pool testpool vitastor-nfs --etcd_address 192.168.5.10:2379 --portmap 0 --port 2050 --pool testpool

View File

@ -22,7 +22,7 @@
Использование vitastor-nfs: Использование vitastor-nfs:
``` ```
vitastor-nfs [--etcd_address ADDR] [ДРУГИЕ ОПЦИИ] vitastor-nfs [СТАНДАРТНЫЕ ОПЦИИ] [ДРУГИЕ ОПЦИИ]
--subdir <DIR> экспортировать "поддиректорию" - образы с префиксом имени <DIR>/ (по умолчанию пусто - экспортировать все образы) --subdir <DIR> экспортировать "поддиректорию" - образы с префиксом имени <DIR>/ (по умолчанию пусто - экспортировать все образы)
--portmap 0 отключить сервис portmap/rpcbind на порту 111 (по умолчанию включён и требует root привилегий) --portmap 0 отключить сервис portmap/rpcbind на порту 111 (по умолчанию включён и требует root привилегий)
@ -33,7 +33,7 @@ vitastor-nfs [--etcd_address ADDR] [ДРУГИЕ ОПЦИИ]
--foreground 1 не уходить в фон после запуска --foreground 1 не уходить в фон после запуска
``` ```
Пример монтирования Vitastor через NFS: Пример монтирования Vitastor через NFS (etcd_address необязателен):
``` ```
vitastor-nfs --etcd_address 192.168.5.10:2379 --portmap 0 --port 2050 --pool testpool vitastor-nfs --etcd_address 192.168.5.10:2379 --portmap 0 --port 2050 --pool testpool

View File

@ -16,13 +16,16 @@ Old syntax (-drive):
``` ```
qemu-system-x86_64 -enable-kvm -m 1024 \ qemu-system-x86_64 -enable-kvm -m 1024 \
-drive 'file=vitastor:etcd_host=192.168.7.2\:2379/v3:image=debian9', -drive 'file=vitastor:image=debian9',
format=raw,if=none,id=drive-virtio-disk0,cache=none \ format=raw,if=none,id=drive-virtio-disk0,cache=none \
-device 'virtio-blk-pci,scsi=off,bus=pci.0,addr=0x5,drive=drive-virtio-disk0, -device 'virtio-blk-pci,scsi=off,bus=pci.0,addr=0x5,drive=drive-virtio-disk0,
id=virtio-disk0,bootindex=1,write-cache=off' \ id=virtio-disk0,bootindex=1,write-cache=off' \
-vnc 0.0.0.0:0 -vnc 0.0.0.0:0
``` ```
Etcd address may be specified explicitly by adding `:etcd_host=192.168.7.2\:2379/v3` to `file=`.
Configuration file path may be overriden by adding `:config_path=/etc/vitastor/vitastor.conf`.
New syntax (-blockdev): New syntax (-blockdev):
``` ```
@ -50,12 +53,12 @@ You can also specify inode ID, pool and size manually instead of `:image=<IMAGE>
## qemu-img ## qemu-img
For qemu-img, you should use `vitastor:etcd_host=<HOST>:image=<IMAGE>` as filename. For qemu-img, you should use `vitastor:image=<IMAGE>[:etcd_host=<HOST>]` as filename.
For example, to upload a VM image into Vitastor, run: For example, to upload a VM image into Vitastor, run:
``` ```
qemu-img convert -f qcow2 debian10.qcow2 -p -O raw 'vitastor:etcd_host=192.168.7.2\:2379/v3:image=debian10' qemu-img convert -f qcow2 debian10.qcow2 -p -O raw 'vitastor:image=debian10'
``` ```
You can also specify `:pool=<POOL>:inode=<INODE>:size=<SIZE>` instead of `:image=<IMAGE>` You can also specify `:pool=<POOL>:inode=<INODE>:size=<SIZE>` instead of `:image=<IMAGE>`
@ -72,10 +75,10 @@ the snapshot separately using the following commands (key points are using `skip
`-B backing_file` option): `-B backing_file` option):
``` ```
qemu-img convert -f raw 'vitastor:etcd_host=192.168.7.2\:2379/v3:image=testimg@0' \ qemu-img convert -f raw 'vitastor:image=testimg@0' \
-O qcow2 testimg_0.qcow2 -O qcow2 testimg_0.qcow2
qemu-img convert -f raw 'vitastor:etcd_host=192.168.7.2\:2379/v3:image=testimg:skip-parents=1' \ qemu-img convert -f raw 'vitastor:image=testimg:skip-parents=1' \
-O qcow2 -o 'cluster_size=4k' -B testimg_0.qcow2 testimg.qcow2 -O qcow2 -o 'cluster_size=4k' -B testimg_0.qcow2 testimg.qcow2
``` ```
@ -146,7 +149,7 @@ Example performance comparison:
| 4k random read Q1 | 9600 iops | 7640 iops | 7780 iops | | 4k random read Q1 | 9600 iops | 7640 iops | 7780 iops |
To try VDUSE you need at least Linux 5.15, built with VDUSE support To try VDUSE you need at least Linux 5.15, built with VDUSE support
(CONFIG_VIRTIO_VDPA=m, CONFIG_VDPA_USER=m, CONFIG_VIRTIO_VDPA=m). (CONFIG_VDPA=m, CONFIG_VDPA_USER=m, CONFIG_VIRTIO_VDPA=m).
Debian Linux kernels have these options disabled by now, so if you want to try it on Debian, Debian Linux kernels have these options disabled by now, so if you want to try it on Debian,
use a kernel from Ubuntu [kernel-ppa/mainline](https://kernel.ubuntu.com/~kernel-ppa/mainline/), Proxmox, use a kernel from Ubuntu [kernel-ppa/mainline](https://kernel.ubuntu.com/~kernel-ppa/mainline/), Proxmox,

View File

@ -18,13 +18,16 @@
``` ```
qemu-system-x86_64 -enable-kvm -m 1024 \ qemu-system-x86_64 -enable-kvm -m 1024 \
-drive 'file=vitastor:etcd_host=192.168.7.2\:2379/v3:image=debian9', -drive 'file=vitastor:image=debian9',
format=raw,if=none,id=drive-virtio-disk0,cache=none \ format=raw,if=none,id=drive-virtio-disk0,cache=none \
-device 'virtio-blk-pci,scsi=off,bus=pci.0,addr=0x5,drive=drive-virtio-disk0, -device 'virtio-blk-pci,scsi=off,bus=pci.0,addr=0x5,drive=drive-virtio-disk0,
id=virtio-disk0,bootindex=1,write-cache=off' \ id=virtio-disk0,bootindex=1,write-cache=off' \
-vnc 0.0.0.0:0 -vnc 0.0.0.0:0
``` ```
Адрес подключения etcd можно задать явно, если добавить `:etcd_host=192.168.7.2\:2379/v3` к `file=`.
Путь к файлу конфигурации можно переопределить, добавив `:config_path=/etc/vitastor/vitastor.conf`.
Новый синтаксис (-blockdev): Новый синтаксис (-blockdev):
``` ```
@ -52,12 +55,12 @@ qemu-system-x86_64 -enable-kvm -m 1024 \
## qemu-img ## qemu-img
Для qemu-img используйте строку `vitastor:etcd_host=<HOST>:image=<IMAGE>` в качестве имени файла диска. Для qemu-img используйте строку `vitastor:image=<IMAGE>[:etcd_host=<HOST>]` в качестве имени файла диска.
Например, чтобы загрузить образ диска в Vitastor: Например, чтобы загрузить образ диска в Vitastor:
``` ```
qemu-img convert -f qcow2 debian10.qcow2 -p -O raw 'vitastor:etcd_host=10.115.0.10\:2379/v3:image=testimg' qemu-img convert -f qcow2 debian10.qcow2 -p -O raw 'vitastor:image=testimg'
``` ```
Если вы не хотите обращаться к образу по имени, вместо `:image=<IMAGE>` можно указать номер пула, номер инода и размер: Если вы не хотите обращаться к образу по имени, вместо `:image=<IMAGE>` можно указать номер пула, номер инода и размер:
@ -73,10 +76,10 @@ qemu-img convert -f qcow2 debian10.qcow2 -p -O raw 'vitastor:etcd_host=10.115.0.
с помощью следующих команд (ключевые моменты - использование `skip-parents=1` и опции `-B backing_file.qcow2`): с помощью следующих команд (ключевые моменты - использование `skip-parents=1` и опции `-B backing_file.qcow2`):
``` ```
qemu-img convert -f raw 'vitastor:etcd_host=192.168.7.2\:2379/v3:image=testimg@0' \ qemu-img convert -f raw 'vitastor:image=testimg@0' \
-O qcow2 testimg_0.qcow2 -O qcow2 testimg_0.qcow2
qemu-img convert -f raw 'vitastor:etcd_host=192.168.7.2\:2379/v3:image=testimg:skip-parents=1' \ qemu-img convert -f raw 'vitastor:image=testimg:skip-parents=1' \
-O qcow2 -o 'cluster_size=4k' -B testimg_0.qcow2 testimg.qcow2 -O qcow2 -o 'cluster_size=4k' -B testimg_0.qcow2 testimg.qcow2
``` ```
@ -149,7 +152,7 @@ VDUSE - на данный момент лучший интерфейс для п
| 4k случайное чтение Q1 | 9600 iops | 7640 iops | 7780 iops | | 4k случайное чтение Q1 | 9600 iops | 7640 iops | 7780 iops |
Чтобы попробовать VDUSE, вам нужно ядро Linux как минимум версии 5.15, собранное с поддержкой Чтобы попробовать VDUSE, вам нужно ядро Linux как минимум версии 5.15, собранное с поддержкой
VDUSE (CONFIG_VIRTIO_VDPA=m, CONFIG_VDPA_USER=m, CONFIG_VIRTIO_VDPA=m). VDUSE (CONFIG_VDPA=m, CONFIG_VDPA_USER=m, CONFIG_VIRTIO_VDPA=m).
В ядрах в Debian Linux поддержка пока отключена по умолчанию, так что чтобы попробовать VDUSE В ядрах в Debian Linux поддержка пока отключена по умолчанию, так что чтобы попробовать VDUSE
на Debian, поставьте ядро из Ubuntu [kernel-ppa/mainline](https://kernel.ubuntu.com/~kernel-ppa/mainline/), на Debian, поставьте ядро из Ubuntu [kernel-ppa/mainline](https://kernel.ubuntu.com/~kernel-ppa/mainline/),

View File

@ -3,6 +3,7 @@
module.exports = { module.exports = {
scale_pg_count, scale_pg_count,
scale_pg_history,
}; };
function add_pg_history(new_pg_history, new_pg, prev_pgs, prev_pg_history, old_pg) function add_pg_history(new_pg_history, new_pg, prev_pgs, prev_pg_history, old_pg)
@ -43,16 +44,18 @@ function finish_pg_history(merged_history)
merged_history.all_peers = Object.values(merged_history.all_peers); merged_history.all_peers = Object.values(merged_history.all_peers);
} }
function scale_pg_count(prev_pgs, real_prev_pgs, prev_pg_history, new_pg_history, new_pg_count) function scale_pg_history(prev_pg_history, prev_pgs, new_pgs)
{ {
const old_pg_count = real_prev_pgs.length; const new_pg_history = [];
const old_pg_count = prev_pgs.length;
const new_pg_count = new_pgs.length;
// Add all possibly intersecting PGs to the history of new PGs // Add all possibly intersecting PGs to the history of new PGs
if (!(new_pg_count % old_pg_count)) if (!(new_pg_count % old_pg_count))
{ {
// New PG count is a multiple of old PG count // New PG count is a multiple of old PG count
for (let i = 0; i < new_pg_count; i++) for (let i = 0; i < new_pg_count; i++)
{ {
add_pg_history(new_pg_history, i, real_prev_pgs, prev_pg_history, i % old_pg_count); add_pg_history(new_pg_history, i, prev_pgs, prev_pg_history, i % old_pg_count);
finish_pg_history(new_pg_history[i]); finish_pg_history(new_pg_history[i]);
} }
} }
@ -64,7 +67,7 @@ function scale_pg_count(prev_pgs, real_prev_pgs, prev_pg_history, new_pg_history
{ {
for (let j = 0; j < mul; j++) for (let j = 0; j < mul; j++)
{ {
add_pg_history(new_pg_history, i, real_prev_pgs, prev_pg_history, i+j*new_pg_count); add_pg_history(new_pg_history, i, prev_pgs, prev_pg_history, i+j*new_pg_count);
} }
finish_pg_history(new_pg_history[i]); finish_pg_history(new_pg_history[i]);
} }
@ -76,7 +79,7 @@ function scale_pg_count(prev_pgs, real_prev_pgs, prev_pg_history, new_pg_history
let merged_history = {}; let merged_history = {};
for (let i = 0; i < old_pg_count; i++) for (let i = 0; i < old_pg_count; i++)
{ {
add_pg_history(merged_history, 1, real_prev_pgs, prev_pg_history, i); add_pg_history(merged_history, 1, prev_pgs, prev_pg_history, i);
} }
finish_pg_history(merged_history[1]); finish_pg_history(merged_history[1]);
for (let i = 0; i < new_pg_count; i++) for (let i = 0; i < new_pg_count; i++)
@ -89,6 +92,12 @@ function scale_pg_count(prev_pgs, real_prev_pgs, prev_pg_history, new_pg_history
{ {
new_pg_history[i] = null; new_pg_history[i] = null;
} }
return new_pg_history;
}
function scale_pg_count(prev_pgs, new_pg_count)
{
const old_pg_count = prev_pgs.length;
// Just for the lp_solve optimizer - pick a "previous" PG for each "new" one // Just for the lp_solve optimizer - pick a "previous" PG for each "new" one
if (prev_pgs.length < new_pg_count) if (prev_pgs.length < new_pg_count)
{ {

View File

@ -59,6 +59,7 @@ const etcd_tree = {
etcd_mon_timeout: 1000, // ms. min: 0 etcd_mon_timeout: 1000, // ms. min: 0
etcd_mon_retries: 5, // min: 0 etcd_mon_retries: 5, // min: 0
mon_change_timeout: 1000, // ms. min: 100 mon_change_timeout: 1000, // ms. min: 100
mon_retry_change_timeout: 50, // ms. min: 10
mon_stats_timeout: 1000, // ms. min: 100 mon_stats_timeout: 1000, // ms. min: 100
osd_out_time: 600, // seconds. min: 0 osd_out_time: 600, // seconds. min: 0
placement_levels: { datacenter: 1, rack: 2, host: 3, osd: 4, ... }, placement_levels: { datacenter: 1, rack: 2, host: 3, osd: 4, ... },
@ -110,7 +111,15 @@ const etcd_tree = {
autosync_interval: 5, autosync_interval: 5,
autosync_writes: 128, autosync_writes: 128,
client_queue_depth: 128, // unused client_queue_depth: 128, // unused
recovery_queue_depth: 4, recovery_queue_depth: 1,
recovery_sleep_us: 0,
recovery_tune_util_low: 0.1,
recovery_tune_client_util_low: 0,
recovery_tune_util_high: 1.0,
recovery_tune_client_util_high: 0.5,
recovery_tune_interval: 1,
recovery_tune_agg_interval: 10, // 10 times recovery_tune_interval
recovery_tune_sleep_min_us: 10, // 10 microseconds
recovery_pg_switch: 128, recovery_pg_switch: 128,
recovery_sync_batch: 16, recovery_sync_batch: 16,
no_recovery: false, no_recovery: false,
@ -392,7 +401,7 @@ class Mon
this.parse_etcd_addresses(config.etcd_address||config.etcd_url); this.parse_etcd_addresses(config.etcd_address||config.etcd_url);
this.verbose = config.verbose || 0; this.verbose = config.verbose || 0;
this.initConfig = config; this.initConfig = config;
this.config = {}; this.config = { ...config };
this.etcd_prefix = config.etcd_prefix || '/vitastor'; this.etcd_prefix = config.etcd_prefix || '/vitastor';
this.etcd_prefix = this.etcd_prefix.replace(/\/\/+/g, '/').replace(/^\/?(.*[^\/])\/?$/, '/$1'); this.etcd_prefix = this.etcd_prefix.replace(/\/\/+/g, '/').replace(/^\/?(.*[^\/])\/?$/, '/$1');
this.etcd_start_timeout = (config.etcd_start_timeout || 5) * 1000; this.etcd_start_timeout = (config.etcd_start_timeout || 5) * 1000;
@ -490,6 +499,11 @@ class Mon
{ {
this.config.mon_change_timeout = 100; this.config.mon_change_timeout = 100;
} }
this.config.mon_retry_change_timeout = Number(this.config.mon_retry_change_timeout) || 50;
if (this.config.mon_retry_change_timeout < 50)
{
this.config.mon_retry_change_timeout = 50;
}
this.config.mon_stats_timeout = Number(this.config.mon_stats_timeout) || 1000; this.config.mon_stats_timeout = Number(this.config.mon_stats_timeout) || 1000;
if (this.config.mon_stats_timeout < 100) if (this.config.mon_stats_timeout < 100)
{ {
@ -606,7 +620,7 @@ class Mon
console.log('etcd websocket timed out, restarting it'); console.log('etcd websocket timed out, restarting it');
this.restart_watcher(cur_addr); this.restart_watcher(cur_addr);
} }
}, (Number(this.config.etcd_keepalive_interval) || 30)*1000); }, (Number(this.config.etcd_ws_keepalive_interval) || 30)*1000);
this.ws.on('error', () => this.restart_watcher(cur_addr)); this.ws.on('error', () => this.restart_watcher(cur_addr));
this.ws.send(JSON.stringify({ this.ws.send(JSON.stringify({
create_request: { create_request: {
@ -1222,6 +1236,89 @@ class Mon
return aff_osds; return aff_osds;
} }
async generate_pool_pgs(pool_id, osd_tree, levels)
{
const pool_cfg = this.state.config.pools[pool_id];
if (!this.validate_pool_cfg(pool_id, pool_cfg, false))
{
return null;
}
let pool_tree = osd_tree[pool_cfg.root_node || ''];
pool_tree = pool_tree ? pool_tree.children : [];
pool_tree = LPOptimizer.flatten_tree(pool_tree, levels, pool_cfg.failure_domain, 'osd');
this.filter_osds_by_tags(osd_tree, pool_tree, pool_cfg.osd_tags);
this.filter_osds_by_block_layout(
pool_tree,
pool_cfg.block_size || this.config.block_size || 131072,
pool_cfg.bitmap_granularity || this.config.bitmap_granularity || 4096,
pool_cfg.immediate_commit || this.config.immediate_commit || 'none'
);
// First try last_clean_pgs to minimize data movement
let prev_pgs = [];
for (const pg in ((this.state.history.last_clean_pgs.items||{})[pool_id]||{}))
{
prev_pgs[pg-1] = [ ...this.state.history.last_clean_pgs.items[pool_id][pg].osd_set ];
}
if (!prev_pgs.length)
{
// Fall back to config/pgs if it's empty
for (const pg in ((this.state.config.pgs.items||{})[pool_id]||{}))
{
prev_pgs[pg-1] = [ ...this.state.config.pgs.items[pool_id][pg].osd_set ];
}
}
const old_pg_count = prev_pgs.length;
const optimize_cfg = {
osd_tree: pool_tree,
pg_count: pool_cfg.pg_count,
pg_size: pool_cfg.pg_size,
pg_minsize: pool_cfg.pg_minsize,
max_combinations: pool_cfg.max_osd_combinations,
ordered: pool_cfg.scheme != 'replicated',
};
let optimize_result;
// Re-shuffle PGs if config/pgs.hash is empty
if (old_pg_count > 0 && this.state.config.pgs.hash)
{
if (prev_pgs.length != pool_cfg.pg_count)
{
// Scale PG count
// Do it even if old_pg_count is already equal to pool_cfg.pg_count,
// because last_clean_pgs may still contain the old number of PGs
PGUtil.scale_pg_count(prev_pgs, pool_cfg.pg_count);
}
for (const pg of prev_pgs)
{
while (pg.length < pool_cfg.pg_size)
{
pg.push(0);
}
}
optimize_result = await LPOptimizer.optimize_change({
prev_pgs,
...optimize_cfg,
});
}
else
{
optimize_result = await LPOptimizer.optimize_initial(optimize_cfg);
}
console.log(`Pool ${pool_id} (${pool_cfg.name || 'unnamed'}):`);
LPOptimizer.print_change_stats(optimize_result);
const pg_effsize = Math.min(pool_cfg.pg_size, Object.keys(pool_tree).length);
return {
pool_id,
pgs: optimize_result.int_pgs,
stats: {
total_raw_tb: optimize_result.space,
pg_real_size: pg_effsize || pool_cfg.pg_size,
raw_to_usable: (pg_effsize || pool_cfg.pg_size) / (pool_cfg.scheme === 'replicated'
? 1 : (pool_cfg.pg_size - (pool_cfg.parity_chunks||0))),
space_efficiency: optimize_result.space/(optimize_result.total_space||1),
},
};
}
async recheck_pgs() async recheck_pgs()
{ {
if (this.recheck_pgs_active) if (this.recheck_pgs_active)
@ -1236,158 +1333,47 @@ class Mon
const { up_osds, levels, osd_tree } = this.get_osd_tree(); const { up_osds, levels, osd_tree } = this.get_osd_tree();
const tree_cfg = { const tree_cfg = {
osd_tree, osd_tree,
levels,
pools: this.state.config.pools, pools: this.state.config.pools,
}; };
const tree_hash = sha1hex(stableStringify(tree_cfg)); const tree_hash = sha1hex(stableStringify(tree_cfg));
if (this.state.config.pgs.hash != tree_hash) if (this.state.config.pgs.hash != tree_hash)
{ {
// Something has changed // Something has changed
const new_config_pgs = JSON.parse(JSON.stringify(this.state.config.pgs)); console.log('Pool configuration or OSD tree changed, re-optimizing');
const etcd_request = { compare: [], success: [] }; // First re-optimize PGs, but don't look at history yet
for (const pool_id in (this.state.config.pgs||{}).items||{}) const optimize_results = await Promise.all(Object.keys(this.state.config.pools)
.map(pool_id => this.generate_pool_pgs(pool_id, osd_tree, levels)));
// Then apply the modification in the form of an optimistic transaction,
// each time considering new pg/history modifications (OSDs modify it during rebalance)
while (!await this.apply_pool_pgs(optimize_results, up_osds, osd_tree, tree_hash))
{ {
if (!this.state.config.pools[pool_id]) console.log(
{ 'Someone changed PG configuration while we also tried to change it.'+
// Pool deleted. Delete all PGs, but first stop them. ' Retrying in '+this.config.mon_retry_change_timeout+' ms'
if (!await this.stop_all_pgs(pool_id))
{
this.recheck_pgs_active = false;
this.schedule_recheck();
return;
}
const prev_pgs = [];
for (const pg in this.state.config.pgs.items[pool_id]||{})
{
prev_pgs[pg-1] = this.state.config.pgs.items[pool_id][pg].osd_set;
}
// Also delete pool statistics
etcd_request.success.push({ requestDeleteRange: {
key: b64(this.etcd_prefix+'/pool/stats/'+pool_id),
} });
this.save_new_pgs_txn(new_config_pgs, etcd_request, pool_id, up_osds, osd_tree, prev_pgs, [], []);
}
}
for (const pool_id in this.state.config.pools)
{
const pool_cfg = this.state.config.pools[pool_id];
if (!this.validate_pool_cfg(pool_id, pool_cfg, false))
{
continue;
}
let pool_tree = osd_tree[pool_cfg.root_node || ''];
pool_tree = pool_tree ? pool_tree.children : [];
pool_tree = LPOptimizer.flatten_tree(pool_tree, levels, pool_cfg.failure_domain, 'osd');
this.filter_osds_by_tags(osd_tree, pool_tree, pool_cfg.osd_tags);
this.filter_osds_by_block_layout(
pool_tree,
pool_cfg.block_size || this.config.block_size || 131072,
pool_cfg.bitmap_granularity || this.config.bitmap_granularity || 4096,
pool_cfg.immediate_commit || this.config.immediate_commit || 'none'
); );
// These are for the purpose of building history.osd_sets // Failed to apply - parallel change detected. Wait a bit and retry
const real_prev_pgs = []; const old_rev = this.etcd_watch_revision;
let pg_history = []; while (this.etcd_watch_revision === old_rev)
for (const pg in ((this.state.config.pgs.items||{})[pool_id]||{}))
{ {
real_prev_pgs[pg-1] = this.state.config.pgs.items[pool_id][pg].osd_set; await new Promise(ok => setTimeout(ok, this.config.mon_retry_change_timeout));
if (this.state.pg.history[pool_id] &&
this.state.pg.history[pool_id][pg])
{
pg_history[pg-1] = this.state.pg.history[pool_id][pg];
}
} }
// And these are for the purpose of minimizing data movement const new_ot = this.get_osd_tree();
let prev_pgs = []; const new_tcfg = {
for (const pg in ((this.state.history.last_clean_pgs.items||{})[pool_id]||{})) osd_tree: new_ot.osd_tree,
{ levels: new_ot.levels,
prev_pgs[pg-1] = this.state.history.last_clean_pgs.items[pool_id][pg].osd_set; pools: this.state.config.pools,
}
prev_pgs = JSON.parse(JSON.stringify(prev_pgs.length ? prev_pgs : real_prev_pgs));
const old_pg_count = real_prev_pgs.length;
const optimize_cfg = {
osd_tree: pool_tree,
pg_count: pool_cfg.pg_count,
pg_size: pool_cfg.pg_size,
pg_minsize: pool_cfg.pg_minsize,
max_combinations: pool_cfg.max_osd_combinations,
ordered: pool_cfg.scheme != 'replicated',
}; };
let optimize_result; if (sha1hex(stableStringify(new_tcfg)) !== tree_hash)
if (old_pg_count > 0)
{ {
if (old_pg_count != pool_cfg.pg_count) // Configuration actually changed, restart from the beginning
{ this.recheck_pgs_active = false;
// PG count changed. Need to bring all PGs down. setImmediate(() => this.recheck_pgs().catch(this.die));
if (!await this.stop_all_pgs(pool_id)) return;
{
this.recheck_pgs_active = false;
this.schedule_recheck();
return;
}
}
if (prev_pgs.length != pool_cfg.pg_count)
{
// Scale PG count
// Do it even if old_pg_count is already equal to pool_cfg.pg_count,
// because last_clean_pgs may still contain the old number of PGs
const new_pg_history = [];
PGUtil.scale_pg_count(prev_pgs, real_prev_pgs, pg_history, new_pg_history, pool_cfg.pg_count);
pg_history = new_pg_history;
}
for (const pg of prev_pgs)
{
while (pg.length < pool_cfg.pg_size)
{
pg.push(0);
}
}
if (!this.state.config.pgs.hash)
{
// Re-shuffle PGs
optimize_result = await LPOptimizer.optimize_initial(optimize_cfg);
}
else
{
optimize_result = await LPOptimizer.optimize_change({
prev_pgs,
...optimize_cfg,
});
}
} }
else // Configuration didn't change, PG history probably changed, so just retry
{
optimize_result = await LPOptimizer.optimize_initial(optimize_cfg);
}
if (old_pg_count != optimize_result.int_pgs.length)
{
console.log(
`PG count for pool ${pool_id} (${pool_cfg.name || 'unnamed'})`+
` changed from: ${old_pg_count} to ${optimize_result.int_pgs.length}`
);
// Drop stats
etcd_request.success.push({ requestDeleteRange: {
key: b64(this.etcd_prefix+'/pg/stats/'+pool_id+'/'),
range_end: b64(this.etcd_prefix+'/pg/stats/'+pool_id+'0'),
} });
}
LPOptimizer.print_change_stats(optimize_result);
const pg_effsize = Math.min(pool_cfg.pg_size, Object.keys(pool_tree).length);
this.state.pool.stats[pool_id] = {
used_raw_tb: (this.state.pool.stats[pool_id]||{}).used_raw_tb || 0,
total_raw_tb: optimize_result.space,
pg_real_size: pg_effsize || pool_cfg.pg_size,
raw_to_usable: (pg_effsize || pool_cfg.pg_size) / (pool_cfg.scheme === 'replicated'
? 1 : (pool_cfg.pg_size - (pool_cfg.parity_chunks||0))),
space_efficiency: optimize_result.space/(optimize_result.total_space||1),
};
etcd_request.success.push({ requestPut: {
key: b64(this.etcd_prefix+'/pool/stats/'+pool_id),
value: b64(JSON.stringify(this.state.pool.stats[pool_id])),
} });
this.save_new_pgs_txn(new_config_pgs, etcd_request, pool_id, up_osds, osd_tree, real_prev_pgs, optimize_result.int_pgs, pg_history);
} }
new_config_pgs.hash = tree_hash; console.log('PG configuration successfully changed');
await this.save_pg_config(new_config_pgs, etcd_request);
} }
else else
{ {
@ -1434,8 +1420,81 @@ class Mon
this.recheck_pgs_active = false; this.recheck_pgs_active = false;
} }
async save_pg_config(new_config_pgs, etcd_request = { compare: [], success: [] }) async apply_pool_pgs(results, up_osds, osd_tree, tree_hash)
{ {
for (const pool_id in (this.state.config.pgs||{}).items||{})
{
// We should stop all PGs when deleting a pool or changing its PG count
if (!this.state.config.pools[pool_id] ||
this.state.config.pgs.items[pool_id] && this.state.config.pools[pool_id].pg_count !=
Object.keys(this.state.config.pgs.items[pool_id]).reduce((a, c) => (a < (0|c) ? (0|c) : a), 0))
{
if (!await this.stop_all_pgs(pool_id))
{
return false;
}
}
}
const new_config_pgs = JSON.parse(JSON.stringify(this.state.config.pgs));
const etcd_request = { compare: [], success: [] };
for (const pool_id in (new_config_pgs||{}).items||{})
{
if (!this.state.config.pools[pool_id])
{
const prev_pgs = [];
for (const pg in new_config_pgs.items[pool_id]||{})
{
prev_pgs[pg-1] = new_config_pgs.items[pool_id][pg].osd_set;
}
// Also delete pool statistics
etcd_request.success.push({ requestDeleteRange: {
key: b64(this.etcd_prefix+'/pool/stats/'+pool_id),
} });
this.save_new_pgs_txn(new_config_pgs, etcd_request, pool_id, up_osds, osd_tree, prev_pgs, [], []);
}
}
for (const pool_res of results)
{
const pool_id = pool_res.pool_id;
const pool_cfg = this.state.config.pools[pool_id];
let pg_history = [];
for (const pg in ((this.state.config.pgs.items||{})[pool_id]||{}))
{
if (this.state.pg.history[pool_id] &&
this.state.pg.history[pool_id][pg])
{
pg_history[pg-1] = this.state.pg.history[pool_id][pg];
}
}
const real_prev_pgs = [];
for (const pg in ((this.state.config.pgs.items||{})[pool_id]||{}))
{
real_prev_pgs[pg-1] = [ ...this.state.config.pgs.items[pool_id][pg].osd_set ];
}
if (real_prev_pgs.length > 0 && real_prev_pgs.length != pool_res.pgs.length)
{
console.log(
`Changing PG count for pool ${pool_id} (${pool_cfg.name || 'unnamed'})`+
` from: ${real_prev_pgs.length} to ${pool_res.pgs.length}`
);
pg_history = PGUtil.scale_pg_history(pg_history, real_prev_pgs, pool_res.pgs);
// Drop stats
etcd_request.success.push({ requestDeleteRange: {
key: b64(this.etcd_prefix+'/pg/stats/'+pool_id+'/'),
range_end: b64(this.etcd_prefix+'/pg/stats/'+pool_id+'0'),
} });
}
const stats = {
used_raw_tb: (this.state.pool.stats[pool_id]||{}).used_raw_tb || 0,
...pool_res.stats,
};
etcd_request.success.push({ requestPut: {
key: b64(this.etcd_prefix+'/pool/stats/'+pool_id),
value: b64(JSON.stringify(stats)),
} });
this.save_new_pgs_txn(new_config_pgs, etcd_request, pool_id, up_osds, osd_tree, real_prev_pgs, pool_res.pgs, pg_history);
}
new_config_pgs.hash = tree_hash;
etcd_request.compare.push( etcd_request.compare.push(
{ key: b64(this.etcd_prefix+'/mon/master'), target: 'LEASE', lease: ''+this.etcd_lease_id }, { key: b64(this.etcd_prefix+'/mon/master'), target: 'LEASE', lease: ''+this.etcd_lease_id },
{ key: b64(this.etcd_prefix+'/config/pgs'), target: 'MOD', mod_revision: ''+this.etcd_watch_revision, result: 'LESS' }, { key: b64(this.etcd_prefix+'/config/pgs'), target: 'MOD', mod_revision: ''+this.etcd_watch_revision, result: 'LESS' },
@ -1443,14 +1502,8 @@ class Mon
etcd_request.success.push( etcd_request.success.push(
{ requestPut: { key: b64(this.etcd_prefix+'/config/pgs'), value: b64(JSON.stringify(new_config_pgs)) } }, { requestPut: { key: b64(this.etcd_prefix+'/config/pgs'), value: b64(JSON.stringify(new_config_pgs)) } },
); );
const res = await this.etcd_call('/kv/txn', etcd_request, this.config.etcd_mon_timeout, 0); const txn_res = await this.etcd_call('/kv/txn', etcd_request, this.config.etcd_mon_timeout, 0);
if (!res.succeeded) return txn_res.succeeded;
{
console.log('Someone changed PG configuration while we also tried to change it. Retrying in '+this.config.mon_change_timeout+' ms');
this.schedule_recheck();
return;
}
console.log('PG configuration successfully changed');
} }
// Schedule next recheck at least at <unixtime> // Schedule next recheck at least at <unixtime>

View File

@ -1,6 +1,6 @@
{ {
"name": "vitastor-mon", "name": "vitastor-mon",
"version": "1.2.0", "version": "1.3.1",
"description": "Vitastor SDS monitor service", "description": "Vitastor SDS monitor service",
"main": "mon-main.js", "main": "mon-main.js",
"scripts": { "scripts": {

View File

@ -110,7 +110,6 @@ sub properties
vitastor_etcd_address => { vitastor_etcd_address => {
description => 'IP address(es) of etcd.', description => 'IP address(es) of etcd.',
type => 'string', type => 'string',
format => 'pve-storage-portal-dns-list',
}, },
vitastor_etcd_prefix => { vitastor_etcd_prefix => {
description => 'Prefix for Vitastor etcd metadata', description => 'Prefix for Vitastor etcd metadata',

View File

@ -50,7 +50,7 @@ from cinder.volume import configuration
from cinder.volume import driver from cinder.volume import driver
from cinder.volume import volume_utils from cinder.volume import volume_utils
VERSION = '1.2.0' VERSION = '1.3.1'
LOG = logging.getLogger(__name__) LOG = logging.getLogger(__name__)

View File

@ -0,0 +1,190 @@
Index: pve-qemu-kvm-8.1.2/block/meson.build
===================================================================
--- pve-qemu-kvm-8.1.2.orig/block/meson.build
+++ pve-qemu-kvm-8.1.2/block/meson.build
@@ -123,6 +123,7 @@ foreach m : [
[libnfs, 'nfs', files('nfs.c')],
[libssh, 'ssh', files('ssh.c')],
[rbd, 'rbd', files('rbd.c')],
+ [vitastor, 'vitastor', files('vitastor.c')],
]
if m[0].found()
module_ss = ss.source_set()
Index: pve-qemu-kvm-8.1.2/meson.build
===================================================================
--- pve-qemu-kvm-8.1.2.orig/meson.build
+++ pve-qemu-kvm-8.1.2/meson.build
@@ -1303,6 +1303,26 @@ if not get_option('rbd').auto() or have_
endif
endif
+vitastor = not_found
+if not get_option('vitastor').auto() or have_block
+ libvitastor_client = cc.find_library('vitastor_client', has_headers: ['vitastor_c.h'],
+ required: get_option('vitastor'))
+ if libvitastor_client.found()
+ if cc.links('''
+ #include <vitastor_c.h>
+ int main(void) {
+ vitastor_c_create_qemu(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0);
+ return 0;
+ }''', dependencies: libvitastor_client)
+ vitastor = declare_dependency(dependencies: libvitastor_client)
+ elif get_option('vitastor').enabled()
+ error('could not link libvitastor_client')
+ else
+ warning('could not link libvitastor_client, disabling')
+ endif
+ endif
+endif
+
glusterfs = not_found
glusterfs_ftruncate_has_stat = false
glusterfs_iocb_has_stat = false
@@ -2123,6 +2143,7 @@ if numa.found()
endif
config_host_data.set('CONFIG_OPENGL', opengl.found())
config_host_data.set('CONFIG_RBD', rbd.found())
+config_host_data.set('CONFIG_VITASTOR', vitastor.found())
config_host_data.set('CONFIG_RDMA', rdma.found())
config_host_data.set('CONFIG_SAFESTACK', get_option('safe_stack'))
config_host_data.set('CONFIG_SDL', sdl.found())
@@ -4298,6 +4319,7 @@ summary_info += {'fdt support': fd
summary_info += {'libcap-ng support': libcap_ng}
summary_info += {'bpf support': libbpf}
summary_info += {'rbd support': rbd}
+summary_info += {'vitastor support': vitastor}
summary_info += {'smartcard support': cacard}
summary_info += {'U2F support': u2f}
summary_info += {'libusb': libusb}
Index: pve-qemu-kvm-8.1.2/meson_options.txt
===================================================================
--- pve-qemu-kvm-8.1.2.orig/meson_options.txt
+++ pve-qemu-kvm-8.1.2/meson_options.txt
@@ -186,6 +186,8 @@ option('lzo', type : 'feature', value :
description: 'lzo compression support')
option('rbd', type : 'feature', value : 'auto',
description: 'Ceph block device driver')
+option('vitastor', type : 'feature', value : 'auto',
+ description: 'Vitastor block device driver')
option('opengl', type : 'feature', value : 'auto',
description: 'OpenGL support')
option('rdma', type : 'feature', value : 'auto',
Index: pve-qemu-kvm-8.1.2/qapi/block-core.json
===================================================================
--- pve-qemu-kvm-8.1.2.orig/qapi/block-core.json
+++ pve-qemu-kvm-8.1.2/qapi/block-core.json
@@ -3403,7 +3403,7 @@
'raw', 'rbd',
{ 'name': 'replication', 'if': 'CONFIG_REPLICATION' },
'pbs',
- 'ssh', 'throttle', 'vdi', 'vhdx',
+ 'ssh', 'throttle', 'vdi', 'vhdx', 'vitastor',
{ 'name': 'virtio-blk-vfio-pci', 'if': 'CONFIG_BLKIO' },
{ 'name': 'virtio-blk-vhost-user', 'if': 'CONFIG_BLKIO' },
{ 'name': 'virtio-blk-vhost-vdpa', 'if': 'CONFIG_BLKIO' },
@@ -4465,6 +4465,28 @@
'*server': ['InetSocketAddressBase'] } }
##
+# @BlockdevOptionsVitastor:
+#
+# Driver specific block device options for vitastor
+#
+# @image: Image name
+# @inode: Inode number
+# @pool: Pool ID
+# @size: Desired image size in bytes
+# @config-path: Path to Vitastor configuration
+# @etcd-host: etcd connection address(es)
+# @etcd-prefix: etcd key/value prefix
+##
+{ 'struct': 'BlockdevOptionsVitastor',
+ 'data': { '*inode': 'uint64',
+ '*pool': 'uint64',
+ '*size': 'uint64',
+ '*image': 'str',
+ '*config-path': 'str',
+ '*etcd-host': 'str',
+ '*etcd-prefix': 'str' } }
+
+##
# @ReplicationMode:
#
# An enumeration of replication modes.
@@ -4923,6 +4945,7 @@
'throttle': 'BlockdevOptionsThrottle',
'vdi': 'BlockdevOptionsGenericFormat',
'vhdx': 'BlockdevOptionsGenericFormat',
+ 'vitastor': 'BlockdevOptionsVitastor',
'virtio-blk-vfio-pci':
{ 'type': 'BlockdevOptionsVirtioBlkVfioPci',
'if': 'CONFIG_BLKIO' },
@@ -5360,6 +5383,17 @@
'*encrypt' : 'RbdEncryptionCreateOptions' } }
##
+# @BlockdevCreateOptionsVitastor:
+#
+# Driver specific image creation options for Vitastor.
+#
+# @size: Size of the virtual disk in bytes
+##
+{ 'struct': 'BlockdevCreateOptionsVitastor',
+ 'data': { 'location': 'BlockdevOptionsVitastor',
+ 'size': 'size' } }
+
+##
# @BlockdevVmdkSubformat:
#
# Subformat options for VMDK images
@@ -5581,6 +5615,7 @@
'ssh': 'BlockdevCreateOptionsSsh',
'vdi': 'BlockdevCreateOptionsVdi',
'vhdx': 'BlockdevCreateOptionsVhdx',
+ 'vitastor': 'BlockdevCreateOptionsVitastor',
'vmdk': 'BlockdevCreateOptionsVmdk',
'vpc': 'BlockdevCreateOptionsVpc'
} }
Index: pve-qemu-kvm-8.1.2/scripts/ci/org.centos/stream/8/x86_64/configure
===================================================================
--- pve-qemu-kvm-8.1.2.orig/scripts/ci/org.centos/stream/8/x86_64/configure
+++ pve-qemu-kvm-8.1.2/scripts/ci/org.centos/stream/8/x86_64/configure
@@ -30,7 +30,7 @@
--with-suffix="qemu-kvm" \
--firmwarepath=/usr/share/qemu-firmware \
--target-list="x86_64-softmmu" \
---block-drv-rw-whitelist="qcow2,raw,file,host_device,nbd,iscsi,rbd,blkdebug,luks,null-co,nvme,copy-on-read,throttle,gluster" \
+--block-drv-rw-whitelist="qcow2,raw,file,host_device,nbd,iscsi,rbd,vitastor,blkdebug,luks,null-co,nvme,copy-on-read,throttle,gluster" \
--audio-drv-list="" \
--block-drv-ro-whitelist="vmdk,vhdx,vpc,https,ssh" \
--with-coroutine=ucontext \
@@ -176,6 +176,7 @@
--enable-opengl \
--enable-pie \
--enable-rbd \
+--enable-vitastor \
--enable-rdma \
--enable-seccomp \
--enable-snappy \
Index: pve-qemu-kvm-8.1.2/scripts/meson-buildoptions.sh
===================================================================
--- pve-qemu-kvm-8.1.2.orig/scripts/meson-buildoptions.sh
+++ pve-qemu-kvm-8.1.2/scripts/meson-buildoptions.sh
@@ -153,6 +153,7 @@ meson_options_help() {
printf "%s\n" ' qed qed image format support'
printf "%s\n" ' qga-vss build QGA VSS support (broken with MinGW)'
printf "%s\n" ' rbd Ceph block device driver'
+ printf "%s\n" ' vitastor Vitastor block device driver'
printf "%s\n" ' rdma Enable RDMA-based migration'
printf "%s\n" ' replication replication support'
printf "%s\n" ' sdl SDL user interface'
@@ -416,6 +417,8 @@ _meson_option_parse() {
--disable-qom-cast-debug) printf "%s" -Dqom_cast_debug=false ;;
--enable-rbd) printf "%s" -Drbd=enabled ;;
--disable-rbd) printf "%s" -Drbd=disabled ;;
+ --enable-vitastor) printf "%s" -Dvitastor=enabled ;;
+ --disable-vitastor) printf "%s" -Dvitastor=disabled ;;
--enable-rdma) printf "%s" -Drdma=enabled ;;
--disable-rdma) printf "%s" -Drdma=disabled ;;
--enable-replication) printf "%s" -Dreplication=enabled ;;

View File

@ -0,0 +1,190 @@
diff --git a/block/meson.build b/block/meson.build
index 529fc172c6..d542dc0609 100644
--- a/block/meson.build
+++ b/block/meson.build
@@ -110,6 +110,7 @@ foreach m : [
[libnfs, 'nfs', files('nfs.c')],
[libssh, 'ssh', files('ssh.c')],
[rbd, 'rbd', files('rbd.c')],
+ [vitastor, 'vitastor', files('vitastor.c')],
]
if m[0].found()
module_ss = ss.source_set()
diff --git a/meson.build b/meson.build
index a9c4f28247..8496cf13f1 100644
--- a/meson.build
+++ b/meson.build
@@ -1303,6 +1303,26 @@ if not get_option('rbd').auto() or have_block
endif
endif
+vitastor = not_found
+if not get_option('vitastor').auto() or have_block
+ libvitastor_client = cc.find_library('vitastor_client', has_headers: ['vitastor_c.h'],
+ required: get_option('vitastor'))
+ if libvitastor_client.found()
+ if cc.links('''
+ #include <vitastor_c.h>
+ int main(void) {
+ vitastor_c_create_qemu(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0);
+ return 0;
+ }''', dependencies: libvitastor_client)
+ vitastor = declare_dependency(dependencies: libvitastor_client)
+ elif get_option('vitastor').enabled()
+ error('could not link libvitastor_client')
+ else
+ warning('could not link libvitastor_client, disabling')
+ endif
+ endif
+endif
+
glusterfs = not_found
glusterfs_ftruncate_has_stat = false
glusterfs_iocb_has_stat = false
@@ -2119,6 +2139,7 @@ if numa.found()
endif
config_host_data.set('CONFIG_OPENGL', opengl.found())
config_host_data.set('CONFIG_RBD', rbd.found())
+config_host_data.set('CONFIG_VITASTOR', vitastor.found())
config_host_data.set('CONFIG_RDMA', rdma.found())
config_host_data.set('CONFIG_SAFESTACK', get_option('safe_stack'))
config_host_data.set('CONFIG_SDL', sdl.found())
@@ -4286,6 +4307,7 @@ summary_info += {'fdt support': fdt_opt == 'disabled' ? false : fdt_opt}
summary_info += {'libcap-ng support': libcap_ng}
summary_info += {'bpf support': libbpf}
summary_info += {'rbd support': rbd}
+summary_info += {'vitastor support': vitastor}
summary_info += {'smartcard support': cacard}
summary_info += {'U2F support': u2f}
summary_info += {'libusb': libusb}
diff --git a/meson_options.txt b/meson_options.txt
index ae6d8f469d..e3d9f8404d 100644
--- a/meson_options.txt
+++ b/meson_options.txt
@@ -186,6 +186,8 @@ option('lzo', type : 'feature', value : 'auto',
description: 'lzo compression support')
option('rbd', type : 'feature', value : 'auto',
description: 'Ceph block device driver')
+option('vitastor', type : 'feature', value : 'auto',
+ description: 'Vitastor block device driver')
option('opengl', type : 'feature', value : 'auto',
description: 'OpenGL support')
option('rdma', type : 'feature', value : 'auto',
diff --git a/qapi/block-core.json b/qapi/block-core.json
index 2b1d493d6e..90673fdbdc 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -3146,7 +3146,7 @@
'parallels', 'preallocate', 'qcow', 'qcow2', 'qed', 'quorum',
'raw', 'rbd',
{ 'name': 'replication', 'if': 'CONFIG_REPLICATION' },
- 'ssh', 'throttle', 'vdi', 'vhdx',
+ 'ssh', 'throttle', 'vdi', 'vhdx', 'vitastor',
{ 'name': 'virtio-blk-vfio-pci', 'if': 'CONFIG_BLKIO' },
{ 'name': 'virtio-blk-vhost-user', 'if': 'CONFIG_BLKIO' },
{ 'name': 'virtio-blk-vhost-vdpa', 'if': 'CONFIG_BLKIO' },
@@ -4196,6 +4196,28 @@
'*key-secret': 'str',
'*server': ['InetSocketAddressBase'] } }
+##
+# @BlockdevOptionsVitastor:
+#
+# Driver specific block device options for vitastor
+#
+# @image: Image name
+# @inode: Inode number
+# @pool: Pool ID
+# @size: Desired image size in bytes
+# @config-path: Path to Vitastor configuration
+# @etcd-host: etcd connection address(es)
+# @etcd-prefix: etcd key/value prefix
+##
+{ 'struct': 'BlockdevOptionsVitastor',
+ 'data': { '*inode': 'uint64',
+ '*pool': 'uint64',
+ '*size': 'uint64',
+ '*image': 'str',
+ '*config-path': 'str',
+ '*etcd-host': 'str',
+ '*etcd-prefix': 'str' } }
+
##
# @ReplicationMode:
#
@@ -4654,6 +4676,7 @@
'throttle': 'BlockdevOptionsThrottle',
'vdi': 'BlockdevOptionsGenericFormat',
'vhdx': 'BlockdevOptionsGenericFormat',
+ 'vitastor': 'BlockdevOptionsVitastor',
'virtio-blk-vfio-pci':
{ 'type': 'BlockdevOptionsVirtioBlkVfioPci',
'if': 'CONFIG_BLKIO' },
@@ -5089,6 +5112,17 @@
'*cluster-size' : 'size',
'*encrypt' : 'RbdEncryptionCreateOptions' } }
+##
+# @BlockdevCreateOptionsVitastor:
+#
+# Driver specific image creation options for Vitastor.
+#
+# @size: Size of the virtual disk in bytes
+##
+{ 'struct': 'BlockdevCreateOptionsVitastor',
+ 'data': { 'location': 'BlockdevOptionsVitastor',
+ 'size': 'size' } }
+
##
# @BlockdevVmdkSubformat:
#
@@ -5311,6 +5345,7 @@
'ssh': 'BlockdevCreateOptionsSsh',
'vdi': 'BlockdevCreateOptionsVdi',
'vhdx': 'BlockdevCreateOptionsVhdx',
+ 'vitastor': 'BlockdevCreateOptionsVitastor',
'vmdk': 'BlockdevCreateOptionsVmdk',
'vpc': 'BlockdevCreateOptionsVpc'
} }
diff --git a/scripts/ci/org.centos/stream/8/x86_64/configure b/scripts/ci/org.centos/stream/8/x86_64/configure
index d02b09a4b9..f0b5fbfef3 100755
--- a/scripts/ci/org.centos/stream/8/x86_64/configure
+++ b/scripts/ci/org.centos/stream/8/x86_64/configure
@@ -30,7 +30,7 @@
--with-suffix="qemu-kvm" \
--firmwarepath=/usr/share/qemu-firmware \
--target-list="x86_64-softmmu" \
---block-drv-rw-whitelist="qcow2,raw,file,host_device,nbd,iscsi,rbd,blkdebug,luks,null-co,nvme,copy-on-read,throttle,gluster" \
+--block-drv-rw-whitelist="qcow2,raw,file,host_device,nbd,iscsi,rbd,vitastor,blkdebug,luks,null-co,nvme,copy-on-read,throttle,gluster" \
--audio-drv-list="" \
--block-drv-ro-whitelist="vmdk,vhdx,vpc,https,ssh" \
--with-coroutine=ucontext \
@@ -176,6 +176,7 @@
--enable-opengl \
--enable-pie \
--enable-rbd \
+--enable-vitastor \
--enable-rdma \
--enable-seccomp \
--enable-snappy \
diff --git a/scripts/meson-buildoptions.sh b/scripts/meson-buildoptions.sh
index d7020af175..94958eb6fa 100644
--- a/scripts/meson-buildoptions.sh
+++ b/scripts/meson-buildoptions.sh
@@ -153,6 +153,7 @@ meson_options_help() {
printf "%s\n" ' qed qed image format support'
printf "%s\n" ' qga-vss build QGA VSS support (broken with MinGW)'
printf "%s\n" ' rbd Ceph block device driver'
+ printf "%s\n" ' vitastor Vitastor block device driver'
printf "%s\n" ' rdma Enable RDMA-based migration'
printf "%s\n" ' replication replication support'
printf "%s\n" ' sdl SDL user interface'
@@ -416,6 +417,8 @@ _meson_option_parse() {
--disable-qom-cast-debug) printf "%s" -Dqom_cast_debug=false ;;
--enable-rbd) printf "%s" -Drbd=enabled ;;
--disable-rbd) printf "%s" -Drbd=disabled ;;
+ --enable-vitastor) printf "%s" -Dvitastor=enabled ;;
+ --disable-vitastor) printf "%s" -Dvitastor=disabled ;;
--enable-rdma) printf "%s" -Drdma=enabled ;;
--disable-rdma) printf "%s" -Drdma=disabled ;;
--enable-replication) printf "%s" -Dreplication=enabled ;;

View File

@ -24,4 +24,4 @@ rm fio
mv fio-copy fio mv fio-copy fio
FIO=`rpm -qi fio | perl -e 'while(<>) { /^Epoch[\s:]+(\S+)/ && print "$1:"; /^Version[\s:]+(\S+)/ && print $1; /^Release[\s:]+(\S+)/ && print "-$1"; }'` FIO=`rpm -qi fio | perl -e 'while(<>) { /^Epoch[\s:]+(\S+)/ && print "$1:"; /^Version[\s:]+(\S+)/ && print $1; /^Release[\s:]+(\S+)/ && print "-$1"; }'`
perl -i -pe 's/(Requires:\s*fio)([^\n]+)?/$1 = '$FIO'/' $VITASTOR/rpm/vitastor-el$EL.spec perl -i -pe 's/(Requires:\s*fio)([^\n]+)?/$1 = '$FIO'/' $VITASTOR/rpm/vitastor-el$EL.spec
tar --transform 's#^#vitastor-1.2.0/#' --exclude 'rpm/*.rpm' -czf $VITASTOR/../vitastor-1.2.0$(rpm --eval '%dist').tar.gz * tar --transform 's#^#vitastor-1.3.1/#' --exclude 'rpm/*.rpm' -czf $VITASTOR/../vitastor-1.3.1$(rpm --eval '%dist').tar.gz *

View File

@ -15,6 +15,7 @@ RUN yumdownloader --disablerepo=centos-sclo-rh --source fio
RUN rpm --nomd5 -i fio*.src.rpm RUN rpm --nomd5 -i fio*.src.rpm
RUN rm -f /etc/yum.repos.d/CentOS-Media.repo RUN rm -f /etc/yum.repos.d/CentOS-Media.repo
RUN cd ~/rpmbuild/SPECS && yum-builddep -y fio.spec RUN cd ~/rpmbuild/SPECS && yum-builddep -y fio.spec
RUN yum -y install cmake3
ADD https://vitastor.io/rpms/liburing-el7/liburing-0.7-2.el7.src.rpm /root ADD https://vitastor.io/rpms/liburing-el7/liburing-0.7-2.el7.src.rpm /root
@ -35,7 +36,7 @@ ADD . /root/vitastor
RUN set -e; \ RUN set -e; \
cd /root/vitastor/rpm; \ cd /root/vitastor/rpm; \
sh build-tarball.sh; \ sh build-tarball.sh; \
cp /root/vitastor-1.2.0.el7.tar.gz ~/rpmbuild/SOURCES; \ cp /root/vitastor-1.3.1.el7.tar.gz ~/rpmbuild/SOURCES; \
cp vitastor-el7.spec ~/rpmbuild/SPECS/vitastor.spec; \ cp vitastor-el7.spec ~/rpmbuild/SPECS/vitastor.spec; \
cd ~/rpmbuild/SPECS/; \ cd ~/rpmbuild/SPECS/; \
rpmbuild -ba vitastor.spec; \ rpmbuild -ba vitastor.spec; \

View File

@ -1,11 +1,11 @@
Name: vitastor Name: vitastor
Version: 1.2.0 Version: 1.3.1
Release: 1%{?dist} Release: 1%{?dist}
Summary: Vitastor, a fast software-defined clustered block storage Summary: Vitastor, a fast software-defined clustered block storage
License: Vitastor Network Public License 1.1 License: Vitastor Network Public License 1.1
URL: https://vitastor.io/ URL: https://vitastor.io/
Source0: vitastor-1.2.0.el7.tar.gz Source0: vitastor-1.3.1.el7.tar.gz
BuildRequires: liburing-devel >= 0.6 BuildRequires: liburing-devel >= 0.6
BuildRequires: gperftools-devel BuildRequires: gperftools-devel
@ -16,7 +16,7 @@ BuildRequires: jerasure-devel
BuildRequires: libisa-l-devel BuildRequires: libisa-l-devel
BuildRequires: gf-complete-devel BuildRequires: gf-complete-devel
BuildRequires: libibverbs-devel BuildRequires: libibverbs-devel
BuildRequires: cmake BuildRequires: cmake3
Requires: vitastor-osd = %{version}-%{release} Requires: vitastor-osd = %{version}-%{release}
Requires: vitastor-mon = %{version}-%{release} Requires: vitastor-mon = %{version}-%{release}
Requires: vitastor-client = %{version}-%{release} Requires: vitastor-client = %{version}-%{release}
@ -94,7 +94,7 @@ Vitastor fio drivers for benchmarking.
%build %build
. /opt/rh/devtoolset-9/enable . /opt/rh/devtoolset-9/enable
%cmake . %cmake3 .
%make_build %make_build

View File

@ -35,7 +35,7 @@ ADD . /root/vitastor
RUN set -e; \ RUN set -e; \
cd /root/vitastor/rpm; \ cd /root/vitastor/rpm; \
sh build-tarball.sh; \ sh build-tarball.sh; \
cp /root/vitastor-1.2.0.el8.tar.gz ~/rpmbuild/SOURCES; \ cp /root/vitastor-1.3.1.el8.tar.gz ~/rpmbuild/SOURCES; \
cp vitastor-el8.spec ~/rpmbuild/SPECS/vitastor.spec; \ cp vitastor-el8.spec ~/rpmbuild/SPECS/vitastor.spec; \
cd ~/rpmbuild/SPECS/; \ cd ~/rpmbuild/SPECS/; \
rpmbuild -ba vitastor.spec; \ rpmbuild -ba vitastor.spec; \

View File

@ -1,11 +1,11 @@
Name: vitastor Name: vitastor
Version: 1.2.0 Version: 1.3.1
Release: 1%{?dist} Release: 1%{?dist}
Summary: Vitastor, a fast software-defined clustered block storage Summary: Vitastor, a fast software-defined clustered block storage
License: Vitastor Network Public License 1.1 License: Vitastor Network Public License 1.1
URL: https://vitastor.io/ URL: https://vitastor.io/
Source0: vitastor-1.2.0.el8.tar.gz Source0: vitastor-1.3.1.el8.tar.gz
BuildRequires: liburing-devel >= 0.6 BuildRequires: liburing-devel >= 0.6
BuildRequires: gperftools-devel BuildRequires: gperftools-devel

View File

@ -18,7 +18,7 @@ ADD . /root/vitastor
RUN set -e; \ RUN set -e; \
cd /root/vitastor/rpm; \ cd /root/vitastor/rpm; \
sh build-tarball.sh; \ sh build-tarball.sh; \
cp /root/vitastor-1.2.0.el9.tar.gz ~/rpmbuild/SOURCES; \ cp /root/vitastor-1.3.1.el9.tar.gz ~/rpmbuild/SOURCES; \
cp vitastor-el9.spec ~/rpmbuild/SPECS/vitastor.spec; \ cp vitastor-el9.spec ~/rpmbuild/SPECS/vitastor.spec; \
cd ~/rpmbuild/SPECS/; \ cd ~/rpmbuild/SPECS/; \
rpmbuild -ba vitastor.spec; \ rpmbuild -ba vitastor.spec; \

View File

@ -1,11 +1,11 @@
Name: vitastor Name: vitastor
Version: 1.2.0 Version: 1.3.1
Release: 1%{?dist} Release: 1%{?dist}
Summary: Vitastor, a fast software-defined clustered block storage Summary: Vitastor, a fast software-defined clustered block storage
License: Vitastor Network Public License 1.1 License: Vitastor Network Public License 1.1
URL: https://vitastor.io/ URL: https://vitastor.io/
Source0: vitastor-1.2.0.el9.tar.gz Source0: vitastor-1.3.1.el9.tar.gz
BuildRequires: liburing-devel >= 0.6 BuildRequires: liburing-devel >= 0.6
BuildRequires: gperftools-devel BuildRequires: gperftools-devel

View File

@ -16,7 +16,7 @@ if("${CMAKE_INSTALL_PREFIX}" MATCHES "^/usr/local/?$")
set(CMAKE_INSTALL_RPATH "${CMAKE_INSTALL_PREFIX}/${CMAKE_INSTALL_LIBDIR}") set(CMAKE_INSTALL_RPATH "${CMAKE_INSTALL_PREFIX}/${CMAKE_INSTALL_LIBDIR}")
endif() endif()
add_definitions(-DVERSION="1.2.0") add_definitions(-DVERSION="1.3.1")
add_definitions(-Wall -Wno-sign-compare -Wno-comment -Wno-parentheses -Wno-pointer-arith -fdiagnostics-color=always -fno-omit-frame-pointer -I ${CMAKE_SOURCE_DIR}/src) add_definitions(-Wall -Wno-sign-compare -Wno-comment -Wno-parentheses -Wno-pointer-arith -fdiagnostics-color=always -fno-omit-frame-pointer -I ${CMAKE_SOURCE_DIR}/src)
add_link_options(-fno-omit-frame-pointer) add_link_options(-fno-omit-frame-pointer)
if (${WITH_ASAN}) if (${WITH_ASAN})
@ -181,6 +181,25 @@ target_link_libraries(vitastor-nbd
vitastor_client vitastor_client
) )
# vitastor-kv
add_executable(vitastor-kv
kv_cli.cpp
kv_db.cpp
kv_db.h
)
target_link_libraries(vitastor-kv
vitastor_client
)
add_executable(vitastor-kv-stress
kv_stress.cpp
kv_db.cpp
kv_db.h
)
target_link_libraries(vitastor-kv-stress
vitastor_client
)
# vitastor-nfs # vitastor-nfs
add_executable(vitastor-nfs add_executable(vitastor-nfs
nfs_proxy.cpp nfs_proxy.cpp

View File

@ -8,6 +8,7 @@
#include <stdio.h> #include <stdio.h>
#include <stdexcept> #include <stdexcept>
#include <set>
#include "addr_util.h" #include "addr_util.h"
@ -135,7 +136,7 @@ std::vector<std::string> getifaddr_list(std::vector<std::string> mask_cfg, bool
throw std::runtime_error((include_v6 ? "Invalid IPv4 address mask: " : "Invalid IP address mask: ") + mask); throw std::runtime_error((include_v6 ? "Invalid IPv4 address mask: " : "Invalid IP address mask: ") + mask);
} }
} }
std::vector<std::string> addresses; std::set<std::string> addresses;
ifaddrs *list, *ifa; ifaddrs *list, *ifa;
if (getifaddrs(&list) == -1) if (getifaddrs(&list) == -1)
{ {
@ -149,7 +150,8 @@ std::vector<std::string> getifaddr_list(std::vector<std::string> mask_cfg, bool
} }
int family = ifa->ifa_addr->sa_family; int family = ifa->ifa_addr->sa_family;
if ((family == AF_INET || family == AF_INET6 && include_v6) && if ((family == AF_INET || family == AF_INET6 && include_v6) &&
(ifa->ifa_flags & (IFF_UP | IFF_RUNNING | IFF_LOOPBACK)) == (IFF_UP | IFF_RUNNING)) // Do not skip loopback addresses if the address filter is specified
(ifa->ifa_flags & (IFF_UP | IFF_RUNNING | (masks.size() ? 0 : IFF_LOOPBACK))) == (IFF_UP | IFF_RUNNING))
{ {
void *addr_ptr; void *addr_ptr;
if (family == AF_INET) if (family == AF_INET)
@ -182,11 +184,11 @@ std::vector<std::string> getifaddr_list(std::vector<std::string> mask_cfg, bool
{ {
throw std::runtime_error(std::string("inet_ntop: ") + strerror(errno)); throw std::runtime_error(std::string("inet_ntop: ") + strerror(errno));
} }
addresses.push_back(std::string(addr)); addresses.insert(std::string(addr));
} }
} }
freeifaddrs(list); freeifaddrs(list);
return addresses; return std::vector<std::string>(addresses.begin(), addresses.end());
} }
int create_and_bind_socket(std::string bind_address, int bind_port, int listen_backlog, int *listening_port) int create_and_bind_socket(std::string bind_address, int bind_port, int listen_backlog, int *listening_port)

View File

@ -277,6 +277,7 @@ class blockstore_impl_t
int unsynced_big_write_count = 0, unstable_unsynced = 0; int unsynced_big_write_count = 0, unstable_unsynced = 0;
int unsynced_queued_ops = 0; int unsynced_queued_ops = 0;
allocator *data_alloc = NULL; allocator *data_alloc = NULL;
uint64_t used_blocks = 0;
uint8_t *zero_object; uint8_t *zero_object;
void *metadata_buffer = NULL; void *metadata_buffer = NULL;
@ -430,7 +431,7 @@ public:
inline uint32_t get_block_size() { return dsk.data_block_size; } inline uint32_t get_block_size() { return dsk.data_block_size; }
inline uint64_t get_block_count() { return dsk.block_count; } inline uint64_t get_block_count() { return dsk.block_count; }
inline uint64_t get_free_block_count() { return data_alloc->get_free_count(); } inline uint64_t get_free_block_count() { return dsk.block_count - used_blocks; }
inline uint32_t get_bitmap_granularity() { return dsk.disk_alignment; } inline uint32_t get_bitmap_granularity() { return dsk.disk_alignment; }
inline uint64_t get_journal_size() { return dsk.journal_len; } inline uint64_t get_journal_size() { return dsk.journal_len; }
}; };

View File

@ -376,6 +376,7 @@ bool blockstore_init_meta::handle_meta_block(uint8_t *buf, uint64_t entries_per_
else else
{ {
bs->inode_space_stats[entry->oid.inode] += bs->dsk.data_block_size; bs->inode_space_stats[entry->oid.inode] += bs->dsk.data_block_size;
bs->used_blocks++;
} }
entries_loaded++; entries_loaded++;
#ifdef BLOCKSTORE_DEBUG #ifdef BLOCKSTORE_DEBUG
@ -732,8 +733,9 @@ int blockstore_init_journal::handle_journal_part(void *buf, uint64_t done_pos, u
resume: resume:
while (pos < bs->journal.block_size) while (pos < bs->journal.block_size)
{ {
journal_entry *je = (journal_entry*)((uint8_t*)buf + proc_pos - done_pos + pos); auto buf_pos = proc_pos - done_pos + pos;
if (je->magic != JOURNAL_MAGIC || je_crc32(je) != je->crc32 || journal_entry *je = (journal_entry*)((uint8_t*)buf + buf_pos);
if (je->magic != JOURNAL_MAGIC || buf_pos+je->size > len || je_crc32(je) != je->crc32 ||
je->type < JE_MIN || je->type > JE_MAX || started && je->crc32_prev != crc32_last) je->type < JE_MIN || je->type > JE_MAX || started && je->crc32_prev != crc32_last)
{ {
if (pos == 0) if (pos == 0)
@ -1180,6 +1182,7 @@ void blockstore_init_journal::erase_dirty_object(blockstore_dirty_db_t::iterator
sp -= bs->dsk.data_block_size; sp -= bs->dsk.data_block_size;
else else
bs->inode_space_stats.erase(oid.inode); bs->inode_space_stats.erase(oid.inode);
bs->used_blocks--;
} }
bs->erase_dirty(dirty_it, dirty_end, clean_loc); bs->erase_dirty(dirty_it, dirty_end, clean_loc);
// Remove it from the flusher's queue, too // Remove it from the flusher's queue, too

View File

@ -144,8 +144,10 @@ journal_entry* prefill_single_journal_entry(journal_t & journal, uint16_t type,
journal.sector_info[journal.cur_sector].written = false; journal.sector_info[journal.cur_sector].written = false;
journal.sector_info[journal.cur_sector].offset = journal.next_free; journal.sector_info[journal.cur_sector].offset = journal.next_free;
journal.in_sector_pos = 0; journal.in_sector_pos = 0;
journal.next_free = (journal.next_free+journal.block_size) < journal.len ? journal.next_free + journal.block_size : journal.block_size; auto next_next_free = (journal.next_free+journal.block_size) < journal.len ? journal.next_free + journal.block_size : journal.block_size;
assert(journal.next_free != journal.used_start); // double check that next_free doesn't cross used_start from the left
assert(journal.next_free >= journal.used_start || next_next_free < journal.used_start);
journal.next_free = next_next_free;
memset(journal.inmemory memset(journal.inmemory
? (uint8_t*)journal.buffer + journal.sector_info[journal.cur_sector].offset ? (uint8_t*)journal.buffer + journal.sector_info[journal.cur_sector].offset
: (uint8_t*)journal.sector_buf + journal.block_size*journal.cur_sector, 0, journal.block_size); : (uint8_t*)journal.sector_buf + journal.block_size*journal.cur_sector, 0, journal.block_size);

View File

@ -445,6 +445,7 @@ void blockstore_impl_t::mark_stable(const obj_ver_id & v, bool forget_dirty)
if (!exists) if (!exists)
{ {
inode_space_stats[dirty_it->first.oid.inode] += dsk.data_block_size; inode_space_stats[dirty_it->first.oid.inode] += dsk.data_block_size;
used_blocks++;
} }
big_to_flush++; big_to_flush++;
} }
@ -455,6 +456,7 @@ void blockstore_impl_t::mark_stable(const obj_ver_id & v, bool forget_dirty)
sp -= dsk.data_block_size; sp -= dsk.data_block_size;
else else
inode_space_stats.erase(dirty_it->first.oid.inode); inode_space_stats.erase(dirty_it->first.oid.inode);
used_blocks--;
big_to_flush++; big_to_flush++;
} }
} }

View File

@ -386,7 +386,7 @@ int blockstore_impl_t::dequeue_write(blockstore_op_t *op)
sqe, dsk.data_fd, PRIV(op)->iov_zerofill, vcnt, dsk.data_offset + (loc << dsk.block_order) + op->offset - stripe_offset sqe, dsk.data_fd, PRIV(op)->iov_zerofill, vcnt, dsk.data_offset + (loc << dsk.block_order) + op->offset - stripe_offset
); );
PRIV(op)->pending_ops = 1; PRIV(op)->pending_ops = 1;
if (immediate_commit != IMMEDIATE_ALL && !(dirty_it->second.state & BS_ST_INSTANT)) if (!(dirty_it->second.state & BS_ST_INSTANT))
{ {
unstable_unsynced++; unstable_unsynced++;
} }
@ -412,7 +412,7 @@ int blockstore_impl_t::dequeue_write(blockstore_op_t *op)
sizeof(journal_entry_big_write) + dsk.clean_dyn_size, 0) sizeof(journal_entry_big_write) + dsk.clean_dyn_size, 0)
|| !space_check.check_available(op, 1, || !space_check.check_available(op, 1,
sizeof(journal_entry_small_write) + dyn_size, sizeof(journal_entry_small_write) + dyn_size,
(unstable_writes.size()+unstable_unsynced)*journal.block_size)) op->len + (unstable_writes.size()+unstable_unsynced)*journal.block_size))
{ {
return 0; return 0;
} }
@ -462,6 +462,8 @@ int blockstore_impl_t::dequeue_write(blockstore_op_t *op)
exit(1); exit(1);
} }
} }
// double check that next_free doesn't cross used_start from the left
assert(journal.next_free >= journal.used_start || next_next_free < journal.used_start);
journal.next_free = next_next_free; journal.next_free = next_next_free;
je->oid = op->oid; je->oid = op->oid;
je->version = op->version; je->version = op->version;
@ -499,13 +501,13 @@ int blockstore_impl_t::dequeue_write(blockstore_op_t *op)
} }
dirty_it->second.location = journal.next_free; dirty_it->second.location = journal.next_free;
dirty_it->second.state = (dirty_it->second.state & ~BS_ST_WORKFLOW_MASK) | BS_ST_SUBMITTED; dirty_it->second.state = (dirty_it->second.state & ~BS_ST_WORKFLOW_MASK) | BS_ST_SUBMITTED;
journal.next_free += op->len; next_next_free = journal.next_free + op->len;
if (journal.next_free >= journal.len) if (next_next_free >= journal.len)
{ next_next_free = dsk.journal_block_size;
journal.next_free = dsk.journal_block_size; // double check that next_free doesn't cross used_start from the left
assert(journal.next_free != journal.used_start); assert(journal.next_free >= journal.used_start || next_next_free < journal.used_start);
} journal.next_free = next_next_free;
if (immediate_commit == IMMEDIATE_NONE && !(dirty_it->second.state & BS_ST_INSTANT)) if (!(dirty_it->second.state & BS_ST_INSTANT))
{ {
unstable_unsynced++; unstable_unsynced++;
} }
@ -596,11 +598,11 @@ resume_4:
{ {
auto & unstab = unstable_writes[op->oid]; auto & unstab = unstable_writes[op->oid];
unstab = unstab < op->version ? op->version : unstab; unstab = unstab < op->version ? op->version : unstab;
} if (!is_instant)
else if (!is_instant) {
{ unstable_unsynced--;
unstable_unsynced--; assert(unstable_unsynced >= 0);
assert(unstable_unsynced >= 0); }
} }
dirty_it->second.state = (dirty_it->second.state & ~BS_ST_WORKFLOW_MASK) dirty_it->second.state = (dirty_it->second.state & ~BS_ST_WORKFLOW_MASK)
| (imm ? BS_ST_SYNCED : BS_ST_WRITTEN); | (imm ? BS_ST_SYNCED : BS_ST_WRITTEN);

View File

@ -116,7 +116,8 @@ static const char* help_text =
"Use vitastor-cli --help <command> for command details or vitastor-cli --help --all for all details.\n" "Use vitastor-cli --help <command> for command details or vitastor-cli --help --all for all details.\n"
"\n" "\n"
"GLOBAL OPTIONS:\n" "GLOBAL OPTIONS:\n"
" --etcd_address <etcd_address>\n" " --config_file FILE Path to Vitastor configuration file\n"
" --etcd_address URL Etcd connection address\n"
" --iodepth N Send N operations in parallel to each OSD when possible (default 32)\n" " --iodepth N Send N operations in parallel to each OSD when possible (default 32)\n"
" --parallel_osds M Work with M osds in parallel when possible (default 4)\n" " --parallel_osds M Work with M osds in parallel when possible (default 4)\n"
" --progress 1|0 Report progress (default 1)\n" " --progress 1|0 Report progress (default 1)\n"

View File

@ -6,7 +6,7 @@
#include "cluster_client_impl.h" #include "cluster_client_impl.h"
#include "http_client.h" // json_is_true #include "http_client.h" // json_is_true
cluster_client_t::cluster_client_t(ring_loop_t *ringloop, timerfd_manager_t *tfd, json11::Json & config) cluster_client_t::cluster_client_t(ring_loop_t *ringloop, timerfd_manager_t *tfd, json11::Json config)
{ {
wb = new writeback_cache_t(); wb = new writeback_cache_t();
@ -359,6 +359,8 @@ void cluster_client_t::on_load_config_hook(json11::Json::object & etcd_global_co
{ {
up_wait_retry_interval = 50; up_wait_retry_interval = 50;
} }
// log_level
log_level = config["log_level"].uint64_value();
msgr.parse_config(config); msgr.parse_config(config);
st_cli.parse_config(config); st_cli.parse_config(config);
st_cli.load_pgs(); st_cli.load_pgs();
@ -532,7 +534,7 @@ void cluster_client_t::execute_internal(cluster_op_t *op)
return; return;
} }
if (op->opcode == OSD_OP_WRITE && enable_writeback && !(op->flags & OP_FLUSH_BUFFER) && if (op->opcode == OSD_OP_WRITE && enable_writeback && !(op->flags & OP_FLUSH_BUFFER) &&
!op->version /* FIXME no CAS writeback */) !op->version /* no CAS writeback */)
{ {
if (wb->writebacks_active >= client_max_writeback_iodepth) if (wb->writebacks_active >= client_max_writeback_iodepth)
{ {
@ -553,7 +555,7 @@ void cluster_client_t::execute_internal(cluster_op_t *op)
} }
if (op->opcode == OSD_OP_WRITE && !(op->flags & OP_IMMEDIATE_COMMIT)) if (op->opcode == OSD_OP_WRITE && !(op->flags & OP_IMMEDIATE_COMMIT))
{ {
if (!(op->flags & OP_FLUSH_BUFFER)) if (!(op->flags & OP_FLUSH_BUFFER) && !op->version /* no CAS write-repeat */)
{ {
wb->copy_write(op, CACHE_WRITTEN); wb->copy_write(op, CACHE_WRITTEN);
} }
@ -703,6 +705,8 @@ resume_1:
} }
goto resume_2; goto resume_2;
} }
// Protect from try_send completing the operation immediately
op->inflight_count++;
for (int i = 0; i < op->parts.size(); i++) for (int i = 0; i < op->parts.size(); i++)
{ {
if (!(op->parts[i].flags & PART_SENT)) if (!(op->parts[i].flags & PART_SENT))
@ -726,8 +730,10 @@ resume_1:
} }
} }
} }
op->inflight_count--;
if (op->state == 1) if (op->state == 1)
{ {
// Some suboperations have to be resent
return 0; return 0;
} }
resume_2: resume_2:
@ -1147,12 +1153,15 @@ void cluster_client_t::handle_op_part(cluster_op_part_t *part)
if (op->retval != -EINTR && op->retval != -EIO && op->retval != -ENOSPC) if (op->retval != -EINTR && op->retval != -EIO && op->retval != -ENOSPC)
{ {
stop_fd = part->op.peer_fd; stop_fd = part->op.peer_fd;
fprintf( if (op->retval != -EPIPE || log_level > 0)
stderr, "%s operation failed on OSD %lu: retval=%ld (expected %d), dropping connection\n", {
osd_op_names[part->op.req.hdr.opcode], part->osd_num, part->op.reply.hdr.retval, expected fprintf(
); stderr, "%s operation failed on OSD %lu: retval=%ld (expected %d), dropping connection\n",
osd_op_names[part->op.req.hdr.opcode], part->osd_num, part->op.reply.hdr.retval, expected
);
}
} }
else else if (log_level > 0)
{ {
fprintf( fprintf(
stderr, "%s operation failed on OSD %lu: retval=%ld (expected %d)\n", stderr, "%s operation failed on OSD %lu: retval=%ld (expected %d)\n",

View File

@ -91,7 +91,7 @@ class cluster_client_t
uint64_t client_max_buffered_ops = 0; uint64_t client_max_buffered_ops = 0;
uint64_t client_max_writeback_iodepth = 0; uint64_t client_max_writeback_iodepth = 0;
int log_level; int log_level = 0;
int up_wait_retry_interval = 500; // ms int up_wait_retry_interval = 500; // ms
int retry_timeout_id = 0; int retry_timeout_id = 0;
@ -121,7 +121,7 @@ public:
json11::Json::object cli_config, file_config, etcd_global_config; json11::Json::object cli_config, file_config, etcd_global_config;
json11::Json::object config; json11::Json::object config;
cluster_client_t(ring_loop_t *ringloop, timerfd_manager_t *tfd, json11::Json & config); cluster_client_t(ring_loop_t *ringloop, timerfd_manager_t *tfd, json11::Json config);
~cluster_client_t(); ~cluster_client_t();
void execute(cluster_op_t *op); void execute(cluster_op_t *op);
void execute_raw(osd_num_t osd_num, osd_op_t *op); void execute_raw(osd_num_t osd_num, osd_op_t *op);

View File

@ -127,6 +127,10 @@ static const char *help_text =
"vitastor-disk write-sb <device>\n" "vitastor-disk write-sb <device>\n"
" Read JSON from STDIN and write it into Vitastor OSD superblock on <device>.\n" " Read JSON from STDIN and write it into Vitastor OSD superblock on <device>.\n"
"\n" "\n"
"vitastor-disk update-sb <device> [--force] [--<parameter> <value>] [...]\n"
" Read Vitastor OSD superblock from <device>, update parameters in it and write it back.\n"
" --force allows to ignore validation errors.\n"
"\n"
"vitastor-disk udev <device>\n" "vitastor-disk udev <device>\n"
" Try to read Vitastor OSD superblock from <device> and print variables for udev.\n" " Try to read Vitastor OSD superblock from <device> and print variables for udev.\n"
"\n" "\n"
@ -363,6 +367,15 @@ int main(int argc, char *argv[])
} }
return self.write_sb(cmd[1]); return self.write_sb(cmd[1]);
} }
else if (!strcmp(cmd[0], "update-sb"))
{
if (cmd.size() != 2)
{
fprintf(stderr, "Exactly 1 device path argument is required\n");
return 1;
}
return self.update_sb(cmd[1]);
}
else if (!strcmp(cmd[0], "start") || !strcmp(cmd[0], "stop") || else if (!strcmp(cmd[0], "start") || !strcmp(cmd[0], "stop") ||
!strcmp(cmd[0], "restart") || !strcmp(cmd[0], "enable") || !strcmp(cmd[0], "disable")) !strcmp(cmd[0], "restart") || !strcmp(cmd[0], "enable") || !strcmp(cmd[0], "disable"))
{ {

View File

@ -109,6 +109,7 @@ struct disk_tool_t
int udev_import(std::string device); int udev_import(std::string device);
int read_sb(std::string device); int read_sb(std::string device);
int write_sb(std::string device); int write_sb(std::string device);
int update_sb(std::string device);
int exec_osd(std::string device); int exec_osd(std::string device);
int systemd_start_stop_osds(const std::vector<std::string> & cmd, const std::vector<std::string> & devices); int systemd_start_stop_osds(const std::vector<std::string> & cmd, const std::vector<std::string> & devices);
int pre_exec_osd(std::string device); int pre_exec_osd(std::string device);

View File

@ -440,16 +440,25 @@ std::vector<std::string> disk_tool_t::get_new_data_parts(vitastor_dev_info_t & d
{ {
// Use this partition // Use this partition
use_parts.push_back(part["uuid"].string_value()); use_parts.push_back(part["uuid"].string_value());
osds_exist++;
} }
else else
{ {
std::string part_path = "/dev/disk/by-partuuid/"+strtolower(part["uuid"].string_value());
bool is_meta = sb["params"]["meta_device"].string_value() == part_path;
bool is_journal = sb["params"]["journal_device"].string_value() == part_path;
bool is_data = sb["params"]["data_device"].string_value() == part_path;
fprintf( fprintf(
stderr, "%s is already initialized for OSD %lu, skipping\n", stderr, "%s is already initialized for OSD %lu%s, skipping\n",
part["node"].string_value().c_str(), sb["params"]["osd_num"].uint64_value() part["node"].string_value().c_str(), sb["params"]["osd_num"].uint64_value(),
(is_data ? " data" : (is_meta ? " meta" : (is_journal ? " journal" : "")))
); );
osds_size += part["size"].uint64_value()*dev.pt["sectorsize"].uint64_value(); if (is_data || sb["params"]["data_device"].string_value().substr(0, 22) != "/dev/disk/by-partuuid/")
{
osds_size += part["size"].uint64_value()*dev.pt["sectorsize"].uint64_value();
osds_exist++;
}
} }
osds_exist++;
} }
} }
// Still create OSD(s) if a disk has no more than (max_other_percent) other data // Still create OSD(s) if a disk has no more than (max_other_percent) other data

View File

@ -86,6 +86,24 @@ int disk_tool_t::write_sb(std::string device)
return !write_osd_superblock(device, params); return !write_osd_superblock(device, params);
} }
int disk_tool_t::update_sb(std::string device)
{
json11::Json sb = read_osd_superblock(device, true, options.find("force") != options.end());
if (sb.is_null())
{
return 1;
}
auto sb_obj = sb["params"].object_items();
for (auto & kv: options)
{
if (kv.first != "force")
{
sb_obj[kv.first] = kv.second;
}
}
return !write_osd_superblock(device, sb_obj);
}
uint32_t disk_tool_t::write_osd_superblock(std::string device, json11::Json params) uint32_t disk_tool_t::write_osd_superblock(std::string device, json11::Json params)
{ {
std::string json_data = params.dump(); std::string json_data = params.dump();

View File

@ -135,8 +135,8 @@ void etcd_state_client_t::etcd_call(std::string api, json11::Json payload, int t
{ {
if (this->log_level > 0) if (this->log_level > 0)
{ {
printf( fprintf(
"Warning: etcd request failed: %s, retrying %d more times\n", stderr, "Warning: etcd request failed: %s, retrying %d more times\n",
err.c_str(), retries err.c_str(), retries
); );
} }
@ -333,7 +333,7 @@ void etcd_state_client_t::start_etcd_watcher()
etcd_watch_ws = NULL; etcd_watch_ws = NULL;
} }
if (this->log_level > 1) if (this->log_level > 1)
printf("Trying to connect to etcd websocket at %s\n", etcd_address.c_str()); fprintf(stderr, "Trying to connect to etcd websocket at %s, watch from revision %lu\n", etcd_address.c_str(), etcd_watch_revision);
etcd_watch_ws = open_websocket(tfd, etcd_address, etcd_api_path+"/watch", etcd_slow_timeout, etcd_watch_ws = open_websocket(tfd, etcd_address, etcd_api_path+"/watch", etcd_slow_timeout,
[this, cur_addr = selected_etcd_address](const http_response_t *msg) [this, cur_addr = selected_etcd_address](const http_response_t *msg)
{ {
@ -356,8 +356,8 @@ void etcd_state_client_t::start_etcd_watcher()
watch_id == ETCD_PG_HISTORY_WATCH_ID || watch_id == ETCD_PG_HISTORY_WATCH_ID ||
watch_id == ETCD_OSD_STATE_WATCH_ID) watch_id == ETCD_OSD_STATE_WATCH_ID)
etcd_watches_initialised++; etcd_watches_initialised++;
if (etcd_watches_initialised == 4 && this->log_level > 0) if (etcd_watches_initialised == ETCD_TOTAL_WATCHES && this->log_level > 0)
fprintf(stderr, "Successfully subscribed to etcd at %s\n", cur_addr.c_str()); fprintf(stderr, "Successfully subscribed to etcd at %s, revision %lu\n", cur_addr.c_str(), etcd_watch_revision);
} }
if (data["result"]["canceled"].bool_value()) if (data["result"]["canceled"].bool_value())
{ {
@ -393,9 +393,13 @@ void etcd_state_client_t::start_etcd_watcher()
exit(1); exit(1);
} }
} }
if (etcd_watches_initialised == 4) if (etcd_watches_initialised == ETCD_TOTAL_WATCHES && !data["result"]["header"]["revision"].is_null())
{ {
etcd_watch_revision = data["result"]["header"]["revision"].uint64_value()+1; // Protect against a revision beign split into multiple messages and some
// of them being lost. Even though I'm not sure if etcd actually splits them
// Also sometimes etcd sends something without a header, like:
// {"error": {"grpc_code": 14, "http_code": 503, "http_status": "Service Unavailable", "message": "error reading from server: EOF"}}
etcd_watch_revision = data["result"]["header"]["revision"].uint64_value();
addresses_to_try.clear(); addresses_to_try.clear();
} }
// First gather all changes into a hash to remove multiple overwrites // First gather all changes into a hash to remove multiple overwrites
@ -507,7 +511,7 @@ void etcd_state_client_t::start_ws_keepalive()
{ {
ws_keepalive_timer = tfd->set_timer(etcd_ws_keepalive_interval*1000, true, [this](int) ws_keepalive_timer = tfd->set_timer(etcd_ws_keepalive_interval*1000, true, [this](int)
{ {
if (!etcd_watch_ws) if (!etcd_watch_ws || etcd_watches_initialised < ETCD_TOTAL_WATCHES)
{ {
// Do nothing // Do nothing
} }
@ -636,18 +640,28 @@ void etcd_state_client_t::load_pgs()
on_load_pgs_hook(false); on_load_pgs_hook(false);
return; return;
} }
reset_pg_exists();
if (!etcd_watch_revision) if (!etcd_watch_revision)
{ {
etcd_watch_revision = data["header"]["revision"].uint64_value()+1; etcd_watch_revision = data["header"]["revision"].uint64_value()+1;
if (this->log_level > 3)
{
fprintf(stderr, "Loaded revision %lu of PG configuration\n", etcd_watch_revision-1);
}
} }
for (auto & res: data["responses"].array_items()) for (auto & res: data["responses"].array_items())
{ {
for (auto & kv_json: res["response_range"]["kvs"].array_items()) for (auto & kv_json: res["response_range"]["kvs"].array_items())
{ {
auto kv = parse_etcd_kv(kv_json); auto kv = parse_etcd_kv(kv_json);
if (this->log_level > 3)
{
fprintf(stderr, "Loaded key: %s -> %s\n", kv.key.c_str(), kv.value.dump().c_str());
}
parse_state(kv); parse_state(kv);
} }
} }
clean_nonexistent_pgs();
on_load_pgs_hook(true); on_load_pgs_hook(true);
start_etcd_watcher(); start_etcd_watcher();
}); });
@ -668,6 +682,73 @@ void etcd_state_client_t::load_pgs()
} }
#endif #endif
void etcd_state_client_t::reset_pg_exists()
{
for (auto & pool_item: pool_config)
{
for (auto & pg_item: pool_item.second.pg_config)
{
pg_item.second.state_exists = false;
pg_item.second.history_exists = false;
}
}
seen_peers.clear();
}
void etcd_state_client_t::clean_nonexistent_pgs()
{
for (auto & pool_item: pool_config)
{
for (auto pg_it = pool_item.second.pg_config.begin(); pg_it != pool_item.second.pg_config.end(); )
{
auto & pg_cfg = pg_it->second;
if (!pg_cfg.config_exists && !pg_cfg.state_exists && !pg_cfg.history_exists)
{
if (this->log_level > 3)
{
fprintf(stderr, "PG %u/%u disappeared after reload, forgetting it\n", pool_item.first, pg_it->first);
}
pool_item.second.pg_config.erase(pg_it++);
}
else
{
if (!pg_cfg.state_exists)
{
if (this->log_level > 3)
{
fprintf(stderr, "PG %u/%u primary OSD disappeared after reload, forgetting it\n", pool_item.first, pg_it->first);
}
parse_state((etcd_kv_t){
.key = etcd_prefix+"/pg/state/"+std::to_string(pool_item.first)+"/"+std::to_string(pg_it->first),
});
}
if (!pg_cfg.history_exists)
{
if (this->log_level > 3)
{
fprintf(stderr, "PG %u/%u history disappeared after reload, forgetting it\n", pool_item.first, pg_it->first);
}
parse_state((etcd_kv_t){
.key = etcd_prefix+"/pg/history/"+std::to_string(pool_item.first)+"/"+std::to_string(pg_it->first),
});
}
pg_it++;
}
}
}
for (auto & peer_item: peer_states)
{
if (seen_peers.find(peer_item.first) == seen_peers.end())
{
fprintf(stderr, "OSD %lu state disappeared after reload, forgetting it\n", peer_item.first);
parse_state((etcd_kv_t){
.key = etcd_prefix+"/osd/state/"+std::to_string(peer_item.first),
});
}
}
seen_peers.clear();
}
void etcd_state_client_t::parse_state(const etcd_kv_t & kv) void etcd_state_client_t::parse_state(const etcd_kv_t & kv)
{ {
const std::string & key = kv.key; const std::string & key = kv.key;
@ -822,7 +903,7 @@ void etcd_state_client_t::parse_state(const etcd_kv_t & kv)
{ {
for (auto & pg_item: pool_item.second.pg_config) for (auto & pg_item: pool_item.second.pg_config)
{ {
pg_item.second.exists = false; pg_item.second.config_exists = false;
} }
} }
for (auto & pool_item: value["items"].object_items()) for (auto & pool_item: value["items"].object_items())
@ -845,7 +926,7 @@ void etcd_state_client_t::parse_state(const etcd_kv_t & kv)
continue; continue;
} }
auto & parsed_cfg = this->pool_config[pool_id].pg_config[pg_num]; auto & parsed_cfg = this->pool_config[pool_id].pg_config[pg_num];
parsed_cfg.exists = true; parsed_cfg.config_exists = true;
parsed_cfg.pause = pg_item.second["pause"].bool_value(); parsed_cfg.pause = pg_item.second["pause"].bool_value();
parsed_cfg.primary = pg_item.second["primary"].uint64_value(); parsed_cfg.primary = pg_item.second["primary"].uint64_value();
parsed_cfg.target_set.clear(); parsed_cfg.target_set.clear();
@ -866,7 +947,7 @@ void etcd_state_client_t::parse_state(const etcd_kv_t & kv)
int n = 0; int n = 0;
for (auto pg_it = pool_item.second.pg_config.begin(); pg_it != pool_item.second.pg_config.end(); pg_it++) for (auto pg_it = pool_item.second.pg_config.begin(); pg_it != pool_item.second.pg_config.end(); pg_it++)
{ {
if (pg_it->second.exists && pg_it->first != ++n) if (pg_it->second.config_exists && pg_it->first != ++n)
{ {
fprintf( fprintf(
stderr, "Invalid pool %u PG configuration: PG numbers don't cover whole 1..%lu range\n", stderr, "Invalid pool %u PG configuration: PG numbers don't cover whole 1..%lu range\n",
@ -874,7 +955,7 @@ void etcd_state_client_t::parse_state(const etcd_kv_t & kv)
); );
for (pg_it = pool_item.second.pg_config.begin(); pg_it != pool_item.second.pg_config.end(); pg_it++) for (pg_it = pool_item.second.pg_config.begin(); pg_it != pool_item.second.pg_config.end(); pg_it++)
{ {
pg_it->second.exists = false; pg_it->second.config_exists = false;
} }
n = 0; n = 0;
break; break;
@ -899,6 +980,7 @@ void etcd_state_client_t::parse_state(const etcd_kv_t & kv)
auto & pg_cfg = this->pool_config[pool_id].pg_config[pg_num]; auto & pg_cfg = this->pool_config[pool_id].pg_config[pg_num];
pg_cfg.target_history.clear(); pg_cfg.target_history.clear();
pg_cfg.all_peers.clear(); pg_cfg.all_peers.clear();
pg_cfg.history_exists = !value.is_null();
// Refuse to start PG if any set of the <osd_sets> has no live OSDs // Refuse to start PG if any set of the <osd_sets> has no live OSDs
for (auto & hist_item: value["osd_sets"].array_items()) for (auto & hist_item: value["osd_sets"].array_items())
{ {
@ -951,11 +1033,15 @@ void etcd_state_client_t::parse_state(const etcd_kv_t & kv)
} }
else if (value.is_null()) else if (value.is_null())
{ {
this->pool_config[pool_id].pg_config[pg_num].cur_primary = 0; auto & pg_cfg = this->pool_config[pool_id].pg_config[pg_num];
this->pool_config[pool_id].pg_config[pg_num].cur_state = 0; pg_cfg.state_exists = false;
pg_cfg.cur_primary = 0;
pg_cfg.cur_state = 0;
} }
else else
{ {
auto & pg_cfg = this->pool_config[pool_id].pg_config[pg_num];
pg_cfg.state_exists = true;
osd_num_t cur_primary = value["primary"].uint64_value(); osd_num_t cur_primary = value["primary"].uint64_value();
int state = 0; int state = 0;
for (auto & e: value["state"].array_items()) for (auto & e: value["state"].array_items())
@ -983,8 +1069,8 @@ void etcd_state_client_t::parse_state(const etcd_kv_t & kv)
fprintf(stderr, "Unexpected pool %u PG %u state in etcd: primary=%lu, state=%s\n", pool_id, pg_num, cur_primary, value["state"].dump().c_str()); fprintf(stderr, "Unexpected pool %u PG %u state in etcd: primary=%lu, state=%s\n", pool_id, pg_num, cur_primary, value["state"].dump().c_str());
return; return;
} }
this->pool_config[pool_id].pg_config[pg_num].cur_primary = cur_primary; pg_cfg.cur_primary = cur_primary;
this->pool_config[pool_id].pg_config[pg_num].cur_state = state; pg_cfg.cur_state = state;
} }
} }
else if (key.substr(0, etcd_prefix.length()+11) == etcd_prefix+"/osd/state/") else if (key.substr(0, etcd_prefix.length()+11) == etcd_prefix+"/osd/state/")
@ -998,6 +1084,7 @@ void etcd_state_client_t::parse_state(const etcd_kv_t & kv)
value["port"].int64_value() > 0 && value["port"].int64_value() < 65536) value["port"].int64_value() > 0 && value["port"].int64_value() < 65536)
{ {
this->peer_states[peer_osd] = value; this->peer_states[peer_osd] = value;
this->seen_peers.insert(peer_osd);
} }
else else
{ {

View File

@ -3,6 +3,8 @@
#pragma once #pragma once
#include <set>
#include "json11/json11.hpp" #include "json11/json11.hpp"
#include "osd_id.h" #include "osd_id.h"
#include "timerfd_manager.h" #include "timerfd_manager.h"
@ -11,6 +13,7 @@
#define ETCD_PG_STATE_WATCH_ID 2 #define ETCD_PG_STATE_WATCH_ID 2
#define ETCD_PG_HISTORY_WATCH_ID 3 #define ETCD_PG_HISTORY_WATCH_ID 3
#define ETCD_OSD_STATE_WATCH_ID 4 #define ETCD_OSD_STATE_WATCH_ID 4
#define ETCD_TOTAL_WATCHES 4
#define DEFAULT_BLOCK_SIZE 128*1024 #define DEFAULT_BLOCK_SIZE 128*1024
#define MIN_DATA_BLOCK_SIZE 4*1024 #define MIN_DATA_BLOCK_SIZE 4*1024
@ -30,7 +33,7 @@ struct etcd_kv_t
struct pg_config_t struct pg_config_t
{ {
bool exists; bool config_exists, history_exists, state_exists;
osd_num_t primary; osd_num_t primary;
std::vector<osd_num_t> target_set; std::vector<osd_num_t> target_set;
std::vector<std::vector<osd_num_t>> target_history; std::vector<std::vector<osd_num_t>> target_history;
@ -61,21 +64,21 @@ struct pool_config_t
struct inode_config_t struct inode_config_t
{ {
uint64_t num; uint64_t num = 0;
std::string name; std::string name;
uint64_t size; uint64_t size = 0;
inode_t parent_id; inode_t parent_id = 0;
bool readonly; bool readonly = false;
// Arbitrary metadata // Arbitrary metadata
json11::Json meta; json11::Json meta;
// Change revision of the metadata in etcd // Change revision of the metadata in etcd
uint64_t mod_revision; uint64_t mod_revision = 0;
}; };
struct inode_watch_t struct inode_watch_t
{ {
std::string name; std::string name;
inode_config_t cfg; inode_config_t cfg = {};
}; };
struct http_co_t; struct http_co_t;
@ -113,6 +116,7 @@ public:
uint64_t etcd_watch_revision = 0; uint64_t etcd_watch_revision = 0;
std::map<pool_id_t, pool_config_t> pool_config; std::map<pool_id_t, pool_config_t> pool_config;
std::map<osd_num_t, json11::Json> peer_states; std::map<osd_num_t, json11::Json> peer_states;
std::set<osd_num_t> seen_peers;
std::map<inode_t, inode_config_t> inode_config; std::map<inode_t, inode_config_t> inode_config;
std::map<std::string, inode_t> inode_by_name; std::map<std::string, inode_t> inode_by_name;
@ -138,6 +142,8 @@ public:
void start_ws_keepalive(); void start_ws_keepalive();
void load_global_config(); void load_global_config();
void load_pgs(); void load_pgs();
void reset_pg_exists();
void clean_nonexistent_pgs();
void parse_state(const etcd_kv_t & kv); void parse_state(const etcd_kv_t & kv);
void parse_config(const json11::Json & config); void parse_config(const json11::Json & config);
void insert_inode_config(const inode_config_t & cfg); void insert_inode_config(const inode_config_t & cfg);

401
src/kv_cli.cpp Normal file
View File

@ -0,0 +1,401 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.1 (see README.md for details)
//
// Vitastor shared key/value database test CLI
#define _XOPEN_SOURCE
#include <limits.h>
#include <netinet/tcp.h>
#include <sys/epoll.h>
#include <unistd.h>
#include <fcntl.h>
//#include <signal.h>
#include "epoll_manager.h"
#include "str_util.h"
#include "kv_db.h"
const char *exe_name = NULL;
class kv_cli_t
{
public:
kv_dbw_t *db = NULL;
ring_loop_t *ringloop = NULL;
epoll_manager_t *epmgr = NULL;
cluster_client_t *cli = NULL;
bool interactive = false;
int in_progress = 0;
char *cur_cmd = NULL;
int cur_cmd_size = 0, cur_cmd_alloc = 0;
bool finished = false, eof = false;
json11::Json::object cfg;
~kv_cli_t();
static json11::Json::object parse_args(int narg, const char *args[]);
void run(const json11::Json::object & cfg);
void read_cmd();
void next_cmd();
void handle_cmd(const std::string & cmd, std::function<void()> cb);
};
kv_cli_t::~kv_cli_t()
{
if (cur_cmd)
{
free(cur_cmd);
cur_cmd = NULL;
}
cur_cmd_alloc = 0;
if (db)
delete db;
if (cli)
{
cli->flush();
delete cli;
}
if (epmgr)
delete epmgr;
if (ringloop)
delete ringloop;
}
json11::Json::object kv_cli_t::parse_args(int narg, const char *args[])
{
json11::Json::object cfg;
for (int i = 1; i < narg; i++)
{
if (!strcmp(args[i], "-h") || !strcmp(args[i], "--help"))
{
printf(
"Vitastor Key/Value CLI\n"
"(c) Vitaliy Filippov, 2023+ (VNPL-1.1)\n"
"\n"
"USAGE: %s [--etcd_address ADDR] [OTHER OPTIONS]\n",
exe_name
);
exit(0);
}
else if (args[i][0] == '-' && args[i][1] == '-')
{
const char *opt = args[i]+2;
cfg[opt] = !strcmp(opt, "json") || i == narg-1 ? "1" : args[++i];
}
}
return cfg;
}
void kv_cli_t::run(const json11::Json::object & cfg)
{
// Create client
ringloop = new ring_loop_t(512);
epmgr = new epoll_manager_t(ringloop);
cli = new cluster_client_t(ringloop, epmgr->tfd, cfg);
db = new kv_dbw_t(cli);
// Load image metadata
while (!cli->is_ready())
{
ringloop->loop();
if (cli->is_ready())
break;
ringloop->wait();
}
// Run
fcntl(0, F_SETFL, fcntl(0, F_GETFL, 0) | O_NONBLOCK);
try
{
epmgr->tfd->set_fd_handler(0, false, [this](int fd, int events)
{
if (events & EPOLLIN)
{
read_cmd();
}
if (events & EPOLLRDHUP)
{
epmgr->tfd->set_fd_handler(0, false, NULL);
finished = true;
}
});
interactive = true;
printf("> ");
}
catch (std::exception & e)
{
// Can't add to epoll, STDIN is probably a file
read_cmd();
}
while (!finished)
{
ringloop->loop();
if (!finished)
ringloop->wait();
}
// Destroy the client
delete db;
db = NULL;
cli->flush();
delete cli;
delete epmgr;
delete ringloop;
cli = NULL;
epmgr = NULL;
ringloop = NULL;
}
void kv_cli_t::read_cmd()
{
if (!cur_cmd_alloc)
{
cur_cmd_alloc = 65536;
cur_cmd = (char*)malloc_or_die(cur_cmd_alloc);
}
while (cur_cmd_size < cur_cmd_alloc)
{
int r = read(0, cur_cmd+cur_cmd_size, cur_cmd_alloc-cur_cmd_size);
if (r < 0 && errno != EAGAIN)
fprintf(stderr, "Error reading from stdin: %s\n", strerror(errno));
if (r > 0)
cur_cmd_size += r;
if (r == 0)
eof = true;
if (r <= 0)
break;
}
next_cmd();
}
void kv_cli_t::next_cmd()
{
if (in_progress > 0)
{
return;
}
int pos = 0;
for (; pos < cur_cmd_size; pos++)
{
if (cur_cmd[pos] == '\n' || cur_cmd[pos] == '\r')
{
auto cmd = trim(std::string(cur_cmd, pos));
pos++;
memmove(cur_cmd, cur_cmd+pos, cur_cmd_size-pos);
cur_cmd_size -= pos;
in_progress++;
handle_cmd(cmd, [this]()
{
in_progress--;
if (interactive)
printf("> ");
next_cmd();
if (!in_progress)
read_cmd();
});
break;
}
}
if (eof && !in_progress)
{
finished = true;
}
}
void kv_cli_t::handle_cmd(const std::string & cmd, std::function<void()> cb)
{
if (cmd == "")
{
cb();
return;
}
auto pos = cmd.find_first_of(" \t");
if (pos != std::string::npos)
{
while (pos < cmd.size()-1 && (cmd[pos+1] == ' ' || cmd[pos+1] == '\t'))
pos++;
}
auto opname = strtolower(pos == std::string::npos ? cmd : cmd.substr(0, pos));
if (opname == "open")
{
uint64_t pool_id = 0;
inode_t inode_id = 0;
uint32_t kv_block_size = 0;
int scanned = sscanf(cmd.c_str() + pos+1, "%lu %lu %u", &pool_id, &inode_id, &kv_block_size);
if (scanned == 2)
{
kv_block_size = 4096;
}
if (scanned < 2 || !pool_id || !inode_id || !kv_block_size || (kv_block_size & (kv_block_size-1)) != 0)
{
fprintf(stderr, "Usage: open <pool_id> <inode_id> [block_size]. Block size must be a power of 2. Default is 4096.\n");
cb();
return;
}
cfg["kv_block_size"] = (uint64_t)kv_block_size;
db->open(INODE_WITH_POOL(pool_id, inode_id), cfg, [=](int res)
{
if (res < 0)
fprintf(stderr, "Error opening index: %s (code %d)\n", strerror(-res), res);
else
printf("Index opened. Current size: %lu bytes\n", db->get_size());
cb();
});
}
else if (opname == "config")
{
auto pos2 = cmd.find_first_of(" \t", pos+1);
if (pos2 == std::string::npos)
{
fprintf(stderr, "Usage: config <property> <value>\n");
cb();
return;
}
auto key = trim(cmd.substr(pos+1, pos2-pos-1));
auto value = parse_size(trim(cmd.substr(pos2+1)));
if (key != "kv_memory_limit" &&
key != "kv_allocate_blocks" &&
key != "kv_evict_max_misses" &&
key != "kv_evict_attempts_per_level" &&
key != "kv_evict_unused_age" &&
key != "kv_log_level")
{
fprintf(
stderr, "Allowed properties: kv_memory_limit, kv_allocate_blocks,"
" kv_evict_max_misses, kv_evict_attempts_per_level, kv_evict_unused_age, kv_log_level\n"
);
}
else
{
cfg[key] = value;
db->set_config(cfg);
}
cb();
}
else if (opname == "get" || opname == "set" || opname == "del")
{
if (opname == "get" || opname == "del")
{
if (pos == std::string::npos)
{
fprintf(stderr, "Usage: %s <key>\n", opname.c_str());
cb();
return;
}
auto key = trim(cmd.substr(pos+1));
if (opname == "get")
{
db->get(key, [this, cb](int res, const std::string & value)
{
if (res < 0)
fprintf(stderr, "Error: %s (code %d)\n", strerror(-res), res);
else
{
write(1, value.c_str(), value.size());
write(1, "\n", 1);
}
cb();
});
}
else
{
db->del(key, [this, cb](int res)
{
if (res < 0)
fprintf(stderr, "Error: %s (code %d)\n", strerror(-res), res);
else
printf("OK\n");
cb();
});
}
}
else
{
auto pos2 = cmd.find_first_of(" \t", pos+1);
if (pos2 == std::string::npos)
{
fprintf(stderr, "Usage: set <key> <value>\n");
cb();
return;
}
auto key = trim(cmd.substr(pos+1, pos2-pos-1));
auto value = trim(cmd.substr(pos2+1));
db->set(key, value, [this, cb](int res)
{
if (res < 0)
fprintf(stderr, "Error: %s (code %d)\n", strerror(-res), res);
else
printf("OK\n");
cb();
});
}
}
else if (opname == "list")
{
std::string start, end;
if (pos != std::string::npos)
{
auto pos2 = cmd.find_first_of(" \t", pos+1);
if (pos2 != std::string::npos)
{
start = trim(cmd.substr(pos+1, pos2-pos-1));
end = trim(cmd.substr(pos2+1));
}
else
{
start = trim(cmd.substr(pos+1));
}
}
void *handle = db->list_start(start);
db->list_next(handle, [=](int res, const std::string & key, const std::string & value)
{
if (res < 0)
{
if (res != -ENOENT)
{
fprintf(stderr, "Error: %s (code %d)\n", strerror(-res), res);
}
db->list_close(handle);
cb();
}
else
{
printf("%s = %s\n", key.c_str(), value.c_str());
db->list_next(handle, NULL);
}
});
}
else if (opname == "close")
{
db->close([=]()
{
printf("Index closed\n");
cb();
});
}
else if (opname == "quit" || opname == "q")
{
::close(0);
finished = true;
}
else
{
fprintf(
stderr, "Unknown operation: %s. Supported operations:\n"
"open <pool_id> <inode_id> [block_size]\n"
"config <property> <value>\n"
"get <key>\nset <key> <value>\ndel <key>\nlist [<start> [end]]\n"
"close\nquit\n", opname.c_str()
);
cb();
}
}
int main(int narg, const char *args[])
{
setvbuf(stdout, NULL, _IONBF, 0);
setvbuf(stderr, NULL, _IONBF, 0);
exe_name = args[0];
kv_cli_t *p = new kv_cli_t();
p->run(kv_cli_t::parse_args(narg, args));
delete p;
return 0;
}

2037
src/kv_db.cpp Normal file

File diff suppressed because it is too large Load Diff

36
src/kv_db.h Normal file
View File

@ -0,0 +1,36 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.1 (see README.md for details)
//
// Vitastor shared key/value database
// Parallel optimistic B-Tree O:-)
#pragma once
#include "cluster_client.h"
struct kv_db_t;
struct kv_dbw_t
{
kv_dbw_t(cluster_client_t *cli);
~kv_dbw_t();
void open(inode_t inode_id, json11::Json cfg, std::function<void(int)> cb);
void set_config(json11::Json cfg);
void close(std::function<void()> cb);
uint64_t get_size();
void get(const std::string & key, std::function<void(int res, const std::string & value)> cb,
bool allow_old_cached = false);
void set(const std::string & key, const std::string & value, std::function<void(int res)> cb,
std::function<bool(int res, const std::string & value)> cas_compare = NULL);
void del(const std::string & key, std::function<void(int res)> cb,
std::function<bool(int res, const std::string & value)> cas_compare = NULL);
void* list_start(const std::string & start);
void list_next(void *handle, std::function<void(int res, const std::string & key, const std::string & value)> cb);
void list_close(void *handle);
kv_db_t *db;
};

697
src/kv_stress.cpp Normal file
View File

@ -0,0 +1,697 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.1 (see README.md for details)
//
// Vitastor shared key/value database stress tester / benchmark
#define _XOPEN_SOURCE
#include <limits.h>
#include <netinet/tcp.h>
#include <sys/epoll.h>
#include <unistd.h>
#include <fcntl.h>
//#include <signal.h>
#include "epoll_manager.h"
#include "str_util.h"
#include "kv_db.h"
const char *exe_name = NULL;
struct kv_test_listing_t
{
uint64_t count = 0, done = 0;
void *handle = NULL;
std::string next_after;
std::set<std::string> inflights;
timespec tv_begin;
bool error = false;
};
struct kv_test_lat_t
{
const char *name = NULL;
uint64_t usec = 0, count = 0;
};
struct kv_test_stat_t
{
kv_test_lat_t get, add, update, del, list;
uint64_t list_keys = 0;
};
class kv_test_t
{
public:
// Config
json11::Json::object kv_cfg;
std::string key_prefix, key_suffix;
uint64_t inode_id = 0;
uint64_t op_count = 1000000;
uint64_t runtime_sec = 0;
uint64_t parallelism = 4;
uint64_t reopen_prob = 1;
uint64_t get_prob = 30000;
uint64_t add_prob = 20000;
uint64_t update_prob = 20000;
uint64_t del_prob = 5000;
uint64_t list_prob = 300;
uint64_t min_key_len = 10;
uint64_t max_key_len = 70;
uint64_t min_value_len = 50;
uint64_t max_value_len = 300;
uint64_t min_list_count = 10;
uint64_t max_list_count = 1000;
uint64_t print_stats_interval = 1;
bool json_output = false;
uint64_t log_level = 1;
bool trace = false;
bool stop_on_error = false;
// FIXME: Multiple clients
kv_test_stat_t stat, prev_stat;
timespec prev_stat_time, start_stat_time;
// State
kv_dbw_t *db = NULL;
ring_loop_t *ringloop = NULL;
epoll_manager_t *epmgr = NULL;
cluster_client_t *cli = NULL;
ring_consumer_t consumer;
bool finished = false;
uint64_t total_prob = 0;
uint64_t ops_sent = 0, ops_done = 0;
int stat_timer_id = -1;
int in_progress = 0;
bool reopening = false;
std::set<kv_test_listing_t*> listings;
std::set<std::string> changing_keys;
std::map<std::string, std::string> values;
~kv_test_t();
static json11::Json::object parse_args(int narg, const char *args[]);
void parse_config(json11::Json cfg);
void run(json11::Json cfg);
void loop();
void print_stats(kv_test_stat_t & prev_stat, timespec & prev_stat_time);
void print_total_stats();
void start_change(const std::string & key);
void stop_change(const std::string & key);
void add_stat(kv_test_lat_t & stat, timespec tv_begin);
};
kv_test_t::~kv_test_t()
{
if (db)
delete db;
if (cli)
{
cli->flush();
delete cli;
}
if (epmgr)
delete epmgr;
if (ringloop)
delete ringloop;
}
json11::Json::object kv_test_t::parse_args(int narg, const char *args[])
{
json11::Json::object cfg;
for (int i = 1; i < narg; i++)
{
if (!strcmp(args[i], "-h") || !strcmp(args[i], "--help"))
{
printf(
"Vitastor Key/Value DB stress tester / benchmark\n"
"(c) Vitaliy Filippov, 2023+ (VNPL-1.1)\n"
"\n"
"USAGE: %s --pool_id POOL_ID --inode_id INODE_ID [OPTIONS]\n"
" --op_count 1000000\n"
" Total operations to run during test. 0 means unlimited\n"
" --key_prefix \"\"\n"
" Prefix for all keys read or written (to avoid collisions)\n"
" --key_suffix \"\"\n"
" Suffix for all keys read or written (to avoid collisions, but scan all DB)\n"
" --runtime 0\n"
" Run for this number of seconds. 0 means unlimited\n"
" --parallelism 4\n"
" Run this number of operations in parallel\n"
" --get_prob 30000\n"
" Fraction of key retrieve operations\n"
" --add_prob 20000\n"
" Fraction of key addition operations\n"
" --update_prob 20000\n"
" Fraction of key update operations\n"
" --del_prob 30000\n"
" Fraction of key delete operations\n"
" --list_prob 300\n"
" Fraction of listing operations\n"
" --min_key_len 10\n"
" Minimum key size in bytes\n"
" --max_key_len 70\n"
" Maximum key size in bytes\n"
" --min_value_len 50\n"
" Minimum value size in bytes\n"
" --max_value_len 300\n"
" Maximum value size in bytes\n"
" --min_list_count 10\n"
" Minimum number of keys read in listing (0 = all keys)\n"
" --max_list_count 1000\n"
" Maximum number of keys read in listing\n"
" --print_stats 1\n"
" Print operation statistics every this number of seconds\n"
" --json\n"
" JSON output\n"
" --stop_on_error 0\n"
" Stop on first execution error, mismatch, lost key or extra key during listing\n"
" --kv_memory_limit 128M\n"
" Maximum memory to use for vitastor-kv index cache\n"
" --kv_allocate_blocks 4\n"
" Number of PG blocks used for new tree block allocation in parallel\n"
" --kv_evict_max_misses 10\n"
" Eviction algorithm parameter: retry eviction from another random spot\n"
" if this number of keys is used currently or was used recently\n"
" --kv_evict_attempts_per_level 3\n"
" Retry eviction at most this number of times per tree level, starting\n"
" with bottom-most levels\n"
" --kv_evict_unused_age 1000\n"
" Evict only keys unused during this number of last operations\n"
" --kv_log_level 1\n"
" Log level. 0 = errors, 1 = warnings, 10 = trace operations\n",
exe_name
);
exit(0);
}
else if (args[i][0] == '-' && args[i][1] == '-')
{
const char *opt = args[i]+2;
cfg[opt] = !strcmp(opt, "json") || i == narg-1 ? "1" : args[++i];
}
}
return cfg;
}
void kv_test_t::parse_config(json11::Json cfg)
{
inode_id = INODE_WITH_POOL(cfg["pool_id"].uint64_value(), cfg["inode_id"].uint64_value());
if (cfg["op_count"].uint64_value() > 0)
op_count = cfg["op_count"].uint64_value();
key_prefix = cfg["key_prefix"].string_value();
key_suffix = cfg["key_suffix"].string_value();
if (cfg["runtime"].uint64_value() > 0)
runtime_sec = cfg["runtime"].uint64_value();
if (cfg["parallelism"].uint64_value() > 0)
parallelism = cfg["parallelism"].uint64_value();
if (!cfg["reopen_prob"].is_null())
reopen_prob = cfg["reopen_prob"].uint64_value();
if (!cfg["get_prob"].is_null())
get_prob = cfg["get_prob"].uint64_value();
if (!cfg["add_prob"].is_null())
add_prob = cfg["add_prob"].uint64_value();
if (!cfg["update_prob"].is_null())
update_prob = cfg["update_prob"].uint64_value();
if (!cfg["del_prob"].is_null())
del_prob = cfg["del_prob"].uint64_value();
if (!cfg["list_prob"].is_null())
list_prob = cfg["list_prob"].uint64_value();
if (!cfg["min_key_len"].is_null())
min_key_len = cfg["min_key_len"].uint64_value();
if (cfg["max_key_len"].uint64_value() > 0)
max_key_len = cfg["max_key_len"].uint64_value();
if (!cfg["min_value_len"].is_null())
min_value_len = cfg["min_value_len"].uint64_value();
if (cfg["max_value_len"].uint64_value() > 0)
max_value_len = cfg["max_value_len"].uint64_value();
if (!cfg["min_list_count"].is_null())
min_list_count = cfg["min_list_count"].uint64_value();
if (!cfg["max_list_count"].is_null())
max_list_count = cfg["max_list_count"].uint64_value();
if (!cfg["print_stats"].is_null())
print_stats_interval = cfg["print_stats"].uint64_value();
if (!cfg["json"].is_null())
json_output = true;
if (!cfg["stop_on_error"].is_null())
stop_on_error = cfg["stop_on_error"].bool_value();
if (!cfg["kv_memory_limit"].is_null())
kv_cfg["kv_memory_limit"] = cfg["kv_memory_limit"];
if (!cfg["kv_allocate_blocks"].is_null())
kv_cfg["kv_allocate_blocks"] = cfg["kv_allocate_blocks"];
if (!cfg["kv_evict_max_misses"].is_null())
kv_cfg["kv_evict_max_misses"] = cfg["kv_evict_max_misses"];
if (!cfg["kv_evict_attempts_per_level"].is_null())
kv_cfg["kv_evict_attempts_per_level"] = cfg["kv_evict_attempts_per_level"];
if (!cfg["kv_evict_unused_age"].is_null())
kv_cfg["kv_evict_unused_age"] = cfg["kv_evict_unused_age"];
if (!cfg["kv_log_level"].is_null())
{
log_level = cfg["kv_log_level"].uint64_value();
trace = log_level >= 10;
kv_cfg["kv_log_level"] = cfg["kv_log_level"];
}
total_prob = reopen_prob+get_prob+add_prob+update_prob+del_prob+list_prob;
stat.get.name = "get";
stat.add.name = "add";
stat.update.name = "update";
stat.del.name = "del";
stat.list.name = "list";
}
void kv_test_t::run(json11::Json cfg)
{
srand48(time(NULL));
parse_config(cfg);
// Create client
ringloop = new ring_loop_t(512);
epmgr = new epoll_manager_t(ringloop);
cli = new cluster_client_t(ringloop, epmgr->tfd, cfg);
db = new kv_dbw_t(cli);
// Load image metadata
while (!cli->is_ready())
{
ringloop->loop();
if (cli->is_ready())
break;
ringloop->wait();
}
// Run
reopening = true;
db->open(inode_id, kv_cfg, [this](int res)
{
reopening = false;
if (res < 0)
{
fprintf(stderr, "ERROR: Open index: %d (%s)\n", res, strerror(-res));
exit(1);
}
if (trace)
printf("Index opened\n");
ringloop->wakeup();
});
consumer.loop = [this]() { loop(); };
ringloop->register_consumer(&consumer);
if (print_stats_interval)
stat_timer_id = epmgr->tfd->set_timer(print_stats_interval*1000, true, [this](int) { print_stats(prev_stat, prev_stat_time); });
clock_gettime(CLOCK_REALTIME, &start_stat_time);
prev_stat_time = start_stat_time;
while (!finished)
{
ringloop->loop();
if (!finished)
ringloop->wait();
}
if (stat_timer_id >= 0)
epmgr->tfd->clear_timer(stat_timer_id);
ringloop->unregister_consumer(&consumer);
// Print total stats
print_total_stats();
// Destroy the client
delete db;
db = NULL;
cli->flush();
delete cli;
delete epmgr;
delete ringloop;
cli = NULL;
epmgr = NULL;
ringloop = NULL;
}
static const char *base64_chars = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789@+/";
std::string random_str(int len)
{
std::string str;
str.resize(len);
for (int i = 0; i < len; i++)
{
str[i] = base64_chars[lrand48() % 64];
}
return str;
}
void kv_test_t::loop()
{
if (reopening)
{
return;
}
if (ops_done >= op_count)
{
finished = true;
}
while (!finished && ops_sent < op_count && in_progress < parallelism)
{
uint64_t dice = (lrand48() % total_prob);
if (dice < reopen_prob)
{
reopening = true;
db->close([this]()
{
if (trace)
printf("Index closed\n");
db->open(inode_id, kv_cfg, [this](int res)
{
reopening = false;
if (res < 0)
{
fprintf(stderr, "ERROR: Reopen index: %d (%s)\n", res, strerror(-res));
finished = true;
return;
}
if (trace)
printf("Index reopened\n");
ringloop->wakeup();
});
});
return;
}
else if (dice < reopen_prob+get_prob)
{
// get existing
auto key = random_str(max_key_len);
auto k_it = values.lower_bound(key);
if (k_it == values.end())
continue;
key = k_it->first;
if (changing_keys.find(key) != changing_keys.end())
continue;
in_progress++;
ops_sent++;
if (trace)
printf("get %s\n", key.c_str());
timespec tv_begin;
clock_gettime(CLOCK_REALTIME, &tv_begin);
db->get(key, [this, key, tv_begin](int res, const std::string & value)
{
add_stat(stat.get, tv_begin);
ops_done++;
in_progress--;
auto it = values.find(key);
if (res != (it == values.end() ? -ENOENT : 0))
{
fprintf(stderr, "ERROR: get %s: %d (%s)\n", key.c_str(), res, strerror(-res));
if (stop_on_error)
exit(1);
}
else if (it != values.end() && value != it->second)
{
fprintf(stderr, "ERROR: get %s: mismatch: %s vs %s\n", key.c_str(), value.c_str(), it->second.c_str());
if (stop_on_error)
exit(1);
}
ringloop->wakeup();
});
}
else if (dice < reopen_prob+get_prob+add_prob+update_prob)
{
bool is_add = false;
std::string key;
if (dice < reopen_prob+get_prob+add_prob)
{
// add
is_add = true;
uint64_t key_len = min_key_len + (max_key_len > min_key_len ? lrand48() % (max_key_len-min_key_len) : 0);
key = key_prefix + random_str(key_len) + key_suffix;
}
else
{
// update
key = random_str(max_key_len);
auto k_it = values.lower_bound(key);
if (k_it == values.end())
continue;
key = k_it->first;
}
if (changing_keys.find(key) != changing_keys.end())
continue;
uint64_t value_len = min_value_len + (max_value_len > min_value_len ? lrand48() % (max_value_len-min_value_len) : 0);
auto value = random_str(value_len);
start_change(key);
ops_sent++;
in_progress++;
if (trace)
printf("set %s = %s\n", key.c_str(), value.c_str());
timespec tv_begin;
clock_gettime(CLOCK_REALTIME, &tv_begin);
db->set(key, value, [this, key, value, tv_begin, is_add](int res)
{
add_stat(is_add ? stat.add : stat.update, tv_begin);
stop_change(key);
ops_done++;
in_progress--;
if (res != 0)
{
fprintf(stderr, "ERROR: set %s = %s: %d (%s)\n", key.c_str(), value.c_str(), res, strerror(-res));
if (stop_on_error)
exit(1);
}
else
{
values[key] = value;
}
ringloop->wakeup();
}, NULL);
}
else if (dice < reopen_prob+get_prob+add_prob+update_prob+del_prob)
{
// delete
auto key = random_str(max_key_len);
auto k_it = values.lower_bound(key);
if (k_it == values.end())
continue;
key = k_it->first;
if (changing_keys.find(key) != changing_keys.end())
continue;
start_change(key);
ops_sent++;
in_progress++;
if (trace)
printf("del %s\n", key.c_str());
timespec tv_begin;
clock_gettime(CLOCK_REALTIME, &tv_begin);
db->del(key, [this, key, tv_begin](int res)
{
add_stat(stat.del, tv_begin);
stop_change(key);
ops_done++;
in_progress--;
if (res != 0)
{
fprintf(stderr, "ERROR: del %s: %d (%s)\n", key.c_str(), res, strerror(-res));
if (stop_on_error)
exit(1);
}
else
{
values.erase(key);
}
ringloop->wakeup();
}, NULL);
}
else if (dice < reopen_prob+get_prob+add_prob+update_prob+del_prob+list_prob)
{
// list
ops_sent++;
in_progress++;
auto key = random_str(max_key_len);
auto lst = new kv_test_listing_t;
auto k_it = values.lower_bound(key);
lst->count = min_list_count + (max_list_count > min_list_count ? lrand48() % (max_list_count-min_list_count) : 0);
lst->handle = db->list_start(k_it == values.begin() ? key_prefix : key);
lst->next_after = k_it == values.begin() ? key_prefix : key;
lst->inflights = changing_keys;
listings.insert(lst);
if (trace)
printf("list from %s\n", key.c_str());
clock_gettime(CLOCK_REALTIME, &lst->tv_begin);
db->list_next(lst->handle, [this, lst](int res, const std::string & key, const std::string & value)
{
if (log_level >= 11)
printf("list: %s = %s\n", key.c_str(), value.c_str());
if (res >= 0 && key_prefix.size() && (key.size() < key_prefix.size() ||
key.substr(0, key_prefix.size()) != key_prefix))
{
// stop at this key
res = -ENOENT;
}
if (res < 0 || (lst->count > 0 && lst->done >= lst->count))
{
add_stat(stat.list, lst->tv_begin);
if (res == 0)
{
// ok (done >= count)
}
else if (res != -ENOENT)
{
fprintf(stderr, "ERROR: list: %d (%s)\n", res, strerror(-res));
lst->error = true;
}
else
{
auto k_it = lst->next_after == "" ? values.begin() : values.upper_bound(lst->next_after);
while (k_it != values.end())
{
while (k_it != values.end() && lst->inflights.find(k_it->first) != lst->inflights.end())
k_it++;
if (k_it != values.end())
{
fprintf(stderr, "ERROR: list: missing key %s\n", (k_it++)->first.c_str());
lst->error = true;
}
}
}
if (lst->error && stop_on_error)
exit(1);
ops_done++;
in_progress--;
db->list_close(lst->handle);
delete lst;
listings.erase(lst);
ringloop->wakeup();
}
else
{
stat.list_keys++;
// Do not check modified keys in listing
// Listing may return their old or new state
if ((!key_suffix.size() || key.size() >= key_suffix.size() &&
key.substr(key.size()-key_suffix.size()) == key_suffix) &&
lst->inflights.find(key) == lst->inflights.end())
{
lst->done++;
auto k_it = lst->next_after == "" ? values.begin() : values.upper_bound(lst->next_after);
while (true)
{
while (k_it != values.end() && lst->inflights.find(k_it->first) != lst->inflights.end())
{
k_it++;
}
if (k_it == values.end() || k_it->first > key)
{
fprintf(stderr, "ERROR: list: extra key %s\n", key.c_str());
lst->error = true;
break;
}
else if (k_it->first < key)
{
fprintf(stderr, "ERROR: list: missing key %s\n", k_it->first.c_str());
lst->error = true;
lst->next_after = k_it->first;
k_it++;
}
else
{
if (k_it->second != value)
{
fprintf(stderr, "ERROR: list: mismatch: %s = %s but should be %s\n",
key.c_str(), value.c_str(), k_it->second.c_str());
lst->error = true;
}
lst->next_after = k_it->first;
break;
}
}
}
db->list_next(lst->handle, NULL);
}
});
}
}
}
void kv_test_t::add_stat(kv_test_lat_t & stat, timespec tv_begin)
{
timespec tv_end;
clock_gettime(CLOCK_REALTIME, &tv_end);
int64_t usec = (tv_end.tv_sec - tv_begin.tv_sec)*1000000 +
(tv_end.tv_nsec - tv_begin.tv_nsec)/1000;
if (usec > 0)
{
stat.usec += usec;
stat.count++;
}
}
void kv_test_t::print_stats(kv_test_stat_t & prev_stat, timespec & prev_stat_time)
{
timespec cur_stat_time;
clock_gettime(CLOCK_REALTIME, &cur_stat_time);
int64_t usec = (cur_stat_time.tv_sec - prev_stat_time.tv_sec)*1000000 +
(cur_stat_time.tv_nsec - prev_stat_time.tv_nsec)/1000;
if (usec > 0)
{
kv_test_lat_t *lats[] = { &stat.get, &stat.add, &stat.update, &stat.del, &stat.list };
kv_test_lat_t *prev[] = { &prev_stat.get, &prev_stat.add, &prev_stat.update, &prev_stat.del, &prev_stat.list };
if (!json_output)
{
char buf[128] = { 0 };
for (int i = 0; i < sizeof(lats)/sizeof(lats[0]); i++)
{
snprintf(buf, sizeof(buf)-1, "%.1f %s/s (%lu us)", (lats[i]->count-prev[i]->count)*1000000.0/usec,
lats[i]->name, (lats[i]->usec-prev[i]->usec)/(lats[i]->count-prev[i]->count > 0 ? lats[i]->count-prev[i]->count : 1));
int k;
for (k = strlen(buf); k < strlen(lats[i]->name)+21; k++)
buf[k] = ' ';
buf[k] = 0;
printf("%s", buf);
}
printf("\n");
}
else
{
int64_t runtime = (cur_stat_time.tv_sec - start_stat_time.tv_sec)*1000000 +
(cur_stat_time.tv_nsec - start_stat_time.tv_nsec)/1000;
printf("{\"runtime\":%.1f", (double)runtime/1000000.0);
for (int i = 0; i < sizeof(lats)/sizeof(lats[0]); i++)
{
if (lats[i]->count > prev[i]->count)
{
printf(
",\"%s\":{\"avg\":{\"iops\":%.1f,\"usec\":%lu},\"total\":{\"count\":%lu,\"usec\":%lu}}",
lats[i]->name, (lats[i]->count-prev[i]->count)*1000000.0/usec,
(lats[i]->usec-prev[i]->usec)/(lats[i]->count-prev[i]->count),
lats[i]->count, lats[i]->usec
);
}
}
printf("}\n");
}
}
prev_stat = stat;
prev_stat_time = cur_stat_time;
}
void kv_test_t::print_total_stats()
{
if (!json_output)
printf("Total:\n");
kv_test_stat_t start_stats;
timespec start_stat_time = this->start_stat_time;
print_stats(start_stats, start_stat_time);
}
void kv_test_t::start_change(const std::string & key)
{
changing_keys.insert(key);
for (auto lst: listings)
{
lst->inflights.insert(key);
}
}
void kv_test_t::stop_change(const std::string & key)
{
changing_keys.erase(key);
}
int main(int narg, const char *args[])
{
setvbuf(stdout, NULL, _IONBF, 0);
setvbuf(stderr, NULL, _IONBF, 0);
exe_name = args[0];
kv_test_t *p = new kv_test_t();
p->run(kv_test_t::parse_args(narg, args));
delete p;
return 0;
}

View File

@ -149,7 +149,7 @@ public:
std::map<osd_num_t, osd_wanted_peer_t> wanted_peers; std::map<osd_num_t, osd_wanted_peer_t> wanted_peers;
std::map<uint64_t, int> osd_peer_fds; std::map<uint64_t, int> osd_peer_fds;
// op statistics // op statistics
osd_op_stats_t stats; osd_op_stats_t stats, recovery_stats;
void init(); void init();
void parse_config(const json11::Json & config); void parse_config(const json11::Json & config);
@ -175,6 +175,7 @@ public:
bool connect_rdma(int peer_fd, std::string rdma_address, uint64_t client_max_msg); bool connect_rdma(int peer_fd, std::string rdma_address, uint64_t client_max_msg);
#endif #endif
void inc_op_stats(osd_op_stats_t & stats, uint64_t opcode, timespec & tv_begin, timespec & tv_end, uint64_t len);
void measure_exec(osd_op_t *cur_op); void measure_exec(osd_op_t *cur_op);
protected: protected:

View File

@ -24,3 +24,17 @@ osd_op_t::~osd_op_t()
free(buf); free(buf);
} }
} }
bool osd_op_t::is_recovery_related()
{
return (req.hdr.opcode == OSD_OP_SEC_READ ||
req.hdr.opcode == OSD_OP_SEC_WRITE ||
req.hdr.opcode == OSD_OP_SEC_WRITE_STABLE) &&
(req.sec_rw.flags & OSD_OP_RECOVERY_RELATED) ||
req.hdr.opcode == OSD_OP_SEC_DELETE &&
(req.sec_del.flags & OSD_OP_RECOVERY_RELATED) ||
req.hdr.opcode == OSD_OP_SEC_STABILIZE &&
(req.sec_stab.flags & OSD_OP_RECOVERY_RELATED) ||
req.hdr.opcode == OSD_OP_SEC_SYNC &&
(req.sec_sync.flags & OSD_OP_RECOVERY_RELATED);
}

View File

@ -173,4 +173,6 @@ struct osd_op_t
osd_op_buf_list_t iov; osd_op_buf_list_t iov;
~osd_op_t(); ~osd_op_t();
bool is_recovery_related();
}; };

View File

@ -131,6 +131,23 @@ void osd_messenger_t::outbox_push(osd_op_t *cur_op)
} }
} }
void osd_messenger_t::inc_op_stats(osd_op_stats_t & stats, uint64_t opcode, timespec & tv_begin, timespec & tv_end, uint64_t len)
{
uint64_t usecs = (
(tv_end.tv_sec - tv_begin.tv_sec)*1000000 +
(tv_end.tv_nsec - tv_begin.tv_nsec)/1000
);
stats.op_stat_count[opcode]++;
if (!stats.op_stat_count[opcode])
{
stats.op_stat_count[opcode] = 1;
stats.op_stat_sum[opcode] = 0;
stats.op_stat_bytes[opcode] = 0;
}
stats.op_stat_sum[opcode] += usecs;
stats.op_stat_bytes[opcode] += len;
}
void osd_messenger_t::measure_exec(osd_op_t *cur_op) void osd_messenger_t::measure_exec(osd_op_t *cur_op)
{ {
// Measure execution latency // Measure execution latency
@ -142,29 +159,24 @@ void osd_messenger_t::measure_exec(osd_op_t *cur_op)
{ {
clock_gettime(CLOCK_REALTIME, &cur_op->tv_end); clock_gettime(CLOCK_REALTIME, &cur_op->tv_end);
} }
stats.op_stat_count[cur_op->req.hdr.opcode]++; uint64_t len = 0;
if (!stats.op_stat_count[cur_op->req.hdr.opcode])
{
stats.op_stat_count[cur_op->req.hdr.opcode]++;
stats.op_stat_sum[cur_op->req.hdr.opcode] = 0;
stats.op_stat_bytes[cur_op->req.hdr.opcode] = 0;
}
stats.op_stat_sum[cur_op->req.hdr.opcode] += (
(cur_op->tv_end.tv_sec - cur_op->tv_begin.tv_sec)*1000000 +
(cur_op->tv_end.tv_nsec - cur_op->tv_begin.tv_nsec)/1000
);
if (cur_op->req.hdr.opcode == OSD_OP_READ || if (cur_op->req.hdr.opcode == OSD_OP_READ ||
cur_op->req.hdr.opcode == OSD_OP_WRITE || cur_op->req.hdr.opcode == OSD_OP_WRITE ||
cur_op->req.hdr.opcode == OSD_OP_SCRUB) cur_op->req.hdr.opcode == OSD_OP_SCRUB)
{ {
// req.rw.len is internally set to the full object size for scrubs // req.rw.len is internally set to the full object size for scrubs
stats.op_stat_bytes[cur_op->req.hdr.opcode] += cur_op->req.rw.len; len = cur_op->req.rw.len;
} }
else if (cur_op->req.hdr.opcode == OSD_OP_SEC_READ || else if (cur_op->req.hdr.opcode == OSD_OP_SEC_READ ||
cur_op->req.hdr.opcode == OSD_OP_SEC_WRITE || cur_op->req.hdr.opcode == OSD_OP_SEC_WRITE ||
cur_op->req.hdr.opcode == OSD_OP_SEC_WRITE_STABLE) cur_op->req.hdr.opcode == OSD_OP_SEC_WRITE_STABLE)
{ {
stats.op_stat_bytes[cur_op->req.hdr.opcode] += cur_op->req.sec_rw.len; len = cur_op->req.sec_rw.len;
}
inc_op_stats(stats, cur_op->req.hdr.opcode, cur_op->tv_begin, cur_op->tv_end, len);
if (cur_op->is_recovery_related())
{
inc_op_stats(recovery_stats, cur_op->req.hdr.opcode, cur_op->tv_begin, cur_op->tv_end, len);
} }
} }

View File

@ -30,7 +30,7 @@ protected:
std::string image_name; std::string image_name;
uint64_t inode = 0; uint64_t inode = 0;
uint64_t device_size = 0; uint64_t device_size = 0;
int nbd_timeout = 30; int nbd_timeout = 300;
int nbd_max_devices = 64; int nbd_max_devices = 64;
int nbd_max_part = 3; int nbd_max_part = 3;
inode_watch_t *watch = NULL; inode_watch_t *watch = NULL;
@ -135,14 +135,16 @@ public:
" %s unmap /dev/nbd0\n" " %s unmap /dev/nbd0\n"
" %s ls [--json]\n" " %s ls [--json]\n"
"OPTIONS:\n" "OPTIONS:\n"
" All usual Vitastor config options like --etcd_address <etcd_address> plus NBD-specific:\n" " All usual Vitastor config options like --config_file <path_to_config> plus NBD-specific:\n"
" --nbd_timeout 30\n" " --nbd_timeout 300\n"
" Timeout for I/O operations in seconds after exceeding which the kernel stops\n" " Timeout for I/O operations in seconds after exceeding which the kernel stops\n"
" the device. You can set it to 0 to disable the timeout, but beware that you\n" " the device. You can set it to 0 to disable the timeout, but beware that you\n"
" won't be able to stop the device at all if vitastor-nbd process dies.\n" " won't be able to stop the device at all if vitastor-nbd process dies.\n"
" --nbd_max_devices 64 --nbd_max_part 3\n" " --nbd_max_devices 64 --nbd_max_part 3\n"
" Options for the \"nbd\" kernel module when modprobing it (nbds_max and max_part).\n" " Options for the \"nbd\" kernel module when modprobing it (nbds_max and max_part).\n"
" note that maximum allowed (nbds_max)*(1+max_part) is 256.\n" " note that maximum allowed (nbds_max)*(1+max_part) is 256.\n"
" Note that nbd_timeout, nbd_max_devices and nbd_max_part options may also be specified\n"
" in /etc/vitastor/vitastor.conf or in other configuration file specified with --config_file.\n"
" --logfile /path/to/log/file.txt\n" " --logfile /path/to/log/file.txt\n"
" Wite log messages to the specified file instead of dropping them (in background mode)\n" " Wite log messages to the specified file instead of dropping them (in background mode)\n"
" or printing them to the standard output (in foreground mode).\n" " or printing them to the standard output (in foreground mode).\n"
@ -204,17 +206,18 @@ public:
exit(1); exit(1);
} }
} }
if (cfg["nbd_max_devices"].is_number() || cfg["nbd_max_devices"].is_string()) auto file_config = osd_messenger_t::read_config(cfg);
if (file_config["nbd_max_devices"].is_number() || file_config["nbd_max_devices"].is_string())
{ {
nbd_max_devices = cfg["nbd_max_devices"].uint64_value(); nbd_max_devices = file_config["nbd_max_devices"].uint64_value();
} }
if (cfg["nbd_max_part"].is_number() || cfg["nbd_max_part"].is_string()) if (file_config["nbd_max_part"].is_number() || file_config["nbd_max_part"].is_string())
{ {
nbd_max_part = cfg["nbd_max_part"].uint64_value(); nbd_max_part = file_config["nbd_max_part"].uint64_value();
} }
if (cfg["nbd_timeout"].is_number() || cfg["nbd_timeout"].is_string()) if (file_config["nbd_timeout"].is_number() || file_config["nbd_timeout"].is_string())
{ {
nbd_timeout = cfg["nbd_timeout"].uint64_value(); nbd_timeout = file_config["nbd_timeout"].uint64_value();
} }
if (cfg["client_writeback_allowed"].is_null()) if (cfg["client_writeback_allowed"].is_null())
{ {
@ -272,7 +275,7 @@ public:
int i = 0; int i = 0;
while (true) while (true)
{ {
int r = run_nbd(sockfd, i, device_size, NBD_FLAG_SEND_FLUSH, 30, bg); int r = run_nbd(sockfd, i, device_size, NBD_FLAG_SEND_FLUSH, nbd_timeout, bg);
if (r == 0) if (r == 0)
{ {
printf("/dev/nbd%d\n", i); printf("/dev/nbd%d\n", i);

View File

@ -56,7 +56,7 @@ json11::Json::object nfs_proxy_t::parse_args(int narg, const char *args[])
"(c) Vitaliy Filippov, 2021-2022 (VNPL-1.1)\n" "(c) Vitaliy Filippov, 2021-2022 (VNPL-1.1)\n"
"\n" "\n"
"USAGE:\n" "USAGE:\n"
" %s [--etcd_address ADDR] [OTHER OPTIONS]\n" " %s [STANDARD OPTIONS] [OTHER OPTIONS]\n"
" --subdir <DIR> export images prefixed <DIR>/ (default empty - export all images)\n" " --subdir <DIR> export images prefixed <DIR>/ (default empty - export all images)\n"
" --portmap 0 do not listen on port 111 (portmap/rpcbind, requires root)\n" " --portmap 0 do not listen on port 111 (portmap/rpcbind, requires root)\n"
" --bind <IP> bind service to <IP> address (default 0.0.0.0)\n" " --bind <IP> bind service to <IP> address (default 0.0.0.0)\n"

View File

@ -68,14 +68,21 @@ osd_t::osd_t(const json11::Json & config, ring_loop_t *ringloop)
} }
} }
print_stats_timer_id = this->tfd->set_timer(print_stats_interval*1000, true, [this](int timer_id) if (print_stats_timer_id == -1)
{ {
print_stats(); print_stats_timer_id = this->tfd->set_timer(print_stats_interval*1000, true, [this](int timer_id)
}); {
slow_log_timer_id = this->tfd->set_timer(slow_log_interval*1000, true, [this](int timer_id) print_stats();
});
}
if (slow_log_timer_id == -1)
{ {
print_slow(); slow_log_timer_id = this->tfd->set_timer(slow_log_interval*1000, true, [this](int timer_id)
}); {
print_slow();
});
}
apply_recovery_tune_interval();
msgr.tfd = this->tfd; msgr.tfd = this->tfd;
msgr.ringloop = this->ringloop; msgr.ringloop = this->ringloop;
@ -97,6 +104,11 @@ osd_t::~osd_t()
tfd->clear_timer(slow_log_timer_id); tfd->clear_timer(slow_log_timer_id);
slow_log_timer_id = -1; slow_log_timer_id = -1;
} }
if (rtune_timer_id >= 0)
{
tfd->clear_timer(rtune_timer_id);
rtune_timer_id = -1;
}
if (print_stats_timer_id >= 0) if (print_stats_timer_id >= 0)
{ {
tfd->clear_timer(print_stats_timer_id); tfd->clear_timer(print_stats_timer_id);
@ -196,6 +208,30 @@ void osd_t::parse_config(bool init)
recovery_queue_depth = config["recovery_queue_depth"].uint64_value(); recovery_queue_depth = config["recovery_queue_depth"].uint64_value();
if (recovery_queue_depth < 1 || recovery_queue_depth > MAX_RECOVERY_QUEUE) if (recovery_queue_depth < 1 || recovery_queue_depth > MAX_RECOVERY_QUEUE)
recovery_queue_depth = DEFAULT_RECOVERY_QUEUE; recovery_queue_depth = DEFAULT_RECOVERY_QUEUE;
recovery_sleep_us = config["recovery_sleep_us"].uint64_value();
recovery_tune_util_low = config["recovery_tune_util_low"].is_null()
? 0.1 : config["recovery_tune_util_low"].number_value();
if (recovery_tune_util_low < 0.01)
recovery_tune_util_low = 0.01;
recovery_tune_util_high = config["recovery_tune_util_high"].is_null()
? 1.0 : config["recovery_tune_util_high"].number_value();
if (recovery_tune_util_high < 0.01)
recovery_tune_util_high = 0.01;
recovery_tune_client_util_low = config["recovery_tune_client_util_low"].is_null()
? 0 : config["recovery_tune_client_util_low"].number_value();
if (recovery_tune_client_util_low < 0.01)
recovery_tune_client_util_low = 0.01;
recovery_tune_client_util_high = config["recovery_tune_client_util_high"].is_null()
? 0.5 : config["recovery_tune_client_util_high"].number_value();
if (recovery_tune_client_util_high < 0.01)
recovery_tune_client_util_high = 0.01;
auto old_recovery_tune_interval = recovery_tune_interval;
recovery_tune_interval = config["recovery_tune_interval"].is_null()
? 1 : config["recovery_tune_interval"].uint64_value();
recovery_tune_agg_interval = config["recovery_tune_agg_interval"].is_null()
? 10 : config["recovery_tune_agg_interval"].uint64_value();
recovery_tune_sleep_min_us = config["recovery_tune_sleep_min_us"].is_null()
? 10 : config["recovery_tune_sleep_min_us"].uint64_value();
recovery_pg_switch = config["recovery_pg_switch"].uint64_value(); recovery_pg_switch = config["recovery_pg_switch"].uint64_value();
if (recovery_pg_switch < 1) if (recovery_pg_switch < 1)
recovery_pg_switch = DEFAULT_RECOVERY_PG_SWITCH; recovery_pg_switch = DEFAULT_RECOVERY_PG_SWITCH;
@ -274,6 +310,10 @@ void osd_t::parse_config(bool init)
print_slow(); print_slow();
}); });
} }
if (old_recovery_tune_interval != recovery_tune_interval)
{
apply_recovery_tune_interval();
}
} }
void osd_t::bind_socket() void osd_t::bind_socket()
@ -421,14 +461,6 @@ void osd_t::exec_op(osd_op_t *cur_op)
} }
} }
void osd_t::reset_stats()
{
msgr.stats = {};
prev_stats = {};
memset(recovery_stat_count, 0, sizeof(recovery_stat_count));
memset(recovery_stat_bytes, 0, sizeof(recovery_stat_bytes));
}
void osd_t::print_stats() void osd_t::print_stats()
{ {
for (int i = OSD_OP_MIN; i <= OSD_OP_MAX; i++) for (int i = OSD_OP_MIN; i <= OSD_OP_MAX; i++)
@ -466,19 +498,20 @@ void osd_t::print_stats()
} }
for (int i = 0; i < 2; i++) for (int i = 0; i < 2; i++)
{ {
if (recovery_stat_count[0][i] != recovery_stat_count[1][i]) if (recovery_stat[i].count > recovery_print_prev[i].count)
{ {
uint64_t bw = (recovery_stat_bytes[0][i] - recovery_stat_bytes[1][i]) / print_stats_interval; uint64_t bw = (recovery_stat[i].bytes - recovery_print_prev[i].bytes) / print_stats_interval;
printf( printf(
"[OSD %lu] %s recovery: %.1f op/s, B/W: %.2f %s\n", osd_num, recovery_stat_names[i], "[OSD %lu] %s recovery: %.1f op/s, B/W: %.2f %s, avg latency %ld us, delay %ld us\n", osd_num, recovery_stat_names[i],
(recovery_stat_count[0][i] - recovery_stat_count[1][i]) * 1.0 / print_stats_interval, (recovery_stat[i].count - recovery_print_prev[i].count) * 1.0 / print_stats_interval,
(bw > 1024*1024*1024 ? bw/1024.0/1024/1024 : (bw > 1024*1024 ? bw/1024.0/1024 : bw/1024.0)), (bw > 1024*1024*1024 ? bw/1024.0/1024/1024 : (bw > 1024*1024 ? bw/1024.0/1024 : bw/1024.0)),
(bw > 1024*1024*1024 ? "GB/s" : (bw > 1024*1024 ? "MB/s" : "KB/s")) (bw > 1024*1024*1024 ? "GB/s" : (bw > 1024*1024 ? "MB/s" : "KB/s")),
(recovery_stat[i].usec - recovery_print_prev[i].usec) / (recovery_stat[i].count - recovery_print_prev[i].count),
recovery_target_sleep_us
); );
recovery_stat_count[1][i] = recovery_stat_count[0][i];
recovery_stat_bytes[1][i] = recovery_stat_bytes[0][i];
} }
} }
memcpy(recovery_print_prev, recovery_stat, sizeof(recovery_stat));
if (corrupted_objects > 0) if (corrupted_objects > 0)
{ {
printf("[OSD %lu] %lu object(s) corrupted\n", osd_num, corrupted_objects); printf("[OSD %lu] %lu object(s) corrupted\n", osd_num, corrupted_objects);
@ -572,8 +605,8 @@ void osd_t::print_slow()
op->req.hdr.opcode == OSD_OP_SEC_STABILIZE || op->req.hdr.opcode == OSD_OP_SEC_ROLLBACK || op->req.hdr.opcode == OSD_OP_SEC_STABILIZE || op->req.hdr.opcode == OSD_OP_SEC_ROLLBACK ||
op->req.hdr.opcode == OSD_OP_SEC_READ_BMP) op->req.hdr.opcode == OSD_OP_SEC_READ_BMP)
{ {
bufprintf(" state=%d", PRIV(op->bs_op)->op_state); bufprintf(" state=%d", op->bs_op ? PRIV(op->bs_op)->op_state : -1);
int wait_for = PRIV(op->bs_op)->wait_for; int wait_for = op->bs_op ? PRIV(op->bs_op)->wait_for : 0;
if (wait_for) if (wait_for)
{ {
bufprintf(" wait=%d (detail=%lu)", wait_for, PRIV(op->bs_op)->wait_detail); bufprintf(" wait=%d (detail=%lu)", wait_for, PRIV(op->bs_op)->wait_detail);

View File

@ -34,7 +34,7 @@
#define DEFAULT_AUTOSYNC_INTERVAL 5 #define DEFAULT_AUTOSYNC_INTERVAL 5
#define DEFAULT_AUTOSYNC_WRITES 128 #define DEFAULT_AUTOSYNC_WRITES 128
#define MAX_RECOVERY_QUEUE 2048 #define MAX_RECOVERY_QUEUE 2048
#define DEFAULT_RECOVERY_QUEUE 4 #define DEFAULT_RECOVERY_QUEUE 1
#define DEFAULT_RECOVERY_PG_SWITCH 128 #define DEFAULT_RECOVERY_PG_SWITCH 128
#define DEFAULT_RECOVERY_BATCH 16 #define DEFAULT_RECOVERY_BATCH 16
@ -87,6 +87,11 @@ struct osd_chain_read_t
struct osd_rmw_stripe_t; struct osd_rmw_stripe_t;
struct recovery_stat_t
{
uint64_t count, usec, bytes;
};
class osd_t class osd_t
{ {
// config // config
@ -111,7 +116,15 @@ class osd_t
int immediate_commit = IMMEDIATE_NONE; int immediate_commit = IMMEDIATE_NONE;
int autosync_interval = DEFAULT_AUTOSYNC_INTERVAL; // "emergency" sync every 5 seconds int autosync_interval = DEFAULT_AUTOSYNC_INTERVAL; // "emergency" sync every 5 seconds
int autosync_writes = DEFAULT_AUTOSYNC_WRITES; int autosync_writes = DEFAULT_AUTOSYNC_WRITES;
int recovery_queue_depth = DEFAULT_RECOVERY_QUEUE; uint64_t recovery_queue_depth = 1;
uint64_t recovery_sleep_us = 0;
double recovery_tune_util_low = 0.1;
double recovery_tune_client_util_low = 0;
double recovery_tune_util_high = 1.0;
double recovery_tune_client_util_high = 0.5;
int recovery_tune_interval = 1;
int recovery_tune_agg_interval = 10;
int recovery_tune_sleep_min_us = 10;
int recovery_pg_switch = DEFAULT_RECOVERY_PG_SWITCH; int recovery_pg_switch = DEFAULT_RECOVERY_PG_SWITCH;
int recovery_sync_batch = DEFAULT_RECOVERY_BATCH; int recovery_sync_batch = DEFAULT_RECOVERY_BATCH;
int inode_vanish_time = 60; int inode_vanish_time = 60;
@ -189,8 +202,18 @@ class osd_t
std::map<uint64_t, inode_stats_t> inode_stats; std::map<uint64_t, inode_stats_t> inode_stats;
std::map<uint64_t, timespec> vanishing_inodes; std::map<uint64_t, timespec> vanishing_inodes;
const char* recovery_stat_names[2] = { "degraded", "misplaced" }; const char* recovery_stat_names[2] = { "degraded", "misplaced" };
uint64_t recovery_stat_count[2][2] = {}; recovery_stat_t recovery_stat[2];
uint64_t recovery_stat_bytes[2][2] = {}; recovery_stat_t recovery_print_prev[2];
// recovery auto-tuning
int rtune_timer_id = -1;
uint64_t rtune_avg_lat = 0;
double rtune_client_util = 0, rtune_target_util = 1;
osd_op_stats_t rtune_prev_stats, rtune_prev_recovery_stats;
std::vector<uint64_t> recovery_target_sleep_items;
uint64_t recovery_target_sleep_us = 0;
uint64_t recovery_target_sleep_total = 0;
int recovery_target_sleep_cur = 0, recovery_target_sleep_count = 0;
// cluster connection // cluster connection
void parse_config(bool init); void parse_config(bool init);
@ -208,8 +231,9 @@ class osd_t
void create_osd_state(); void create_osd_state();
void renew_lease(bool reload); void renew_lease(bool reload);
void print_stats(); void print_stats();
void tune_recovery();
void apply_recovery_tune_interval();
void print_slow(); void print_slow();
void reset_stats();
json11::Json get_statistics(); json11::Json get_statistics();
void report_statistics(); void report_statistics();
void report_pg_state(pg_t & pg); void report_pg_state(pg_t & pg);
@ -238,6 +262,7 @@ class osd_t
bool submit_flush_op(pool_id_t pool_id, pg_num_t pg_num, pg_flush_batch_t *fb, bool rollback, osd_num_t peer_osd, int count, obj_ver_id *data); bool submit_flush_op(pool_id_t pool_id, pg_num_t pg_num, pg_flush_batch_t *fb, bool rollback, osd_num_t peer_osd, int count, obj_ver_id *data);
bool pick_next_recovery(osd_recovery_op_t &op); bool pick_next_recovery(osd_recovery_op_t &op);
void submit_recovery_op(osd_recovery_op_t *op); void submit_recovery_op(osd_recovery_op_t *op);
void finish_recovery_op(osd_recovery_op_t *op);
bool continue_recovery(); bool continue_recovery();
pg_osd_set_state_t* change_osd_set(pg_osd_set_state_t *st, pg_t *pg); pg_osd_set_state_t* change_osd_set(pg_osd_set_state_t *st, pg_t *pg);
@ -279,7 +304,7 @@ class osd_t
bool remember_unstable_write(osd_op_t *cur_op, pg_t & pg, pg_osd_set_t & loc_set, int base_state); bool remember_unstable_write(osd_op_t *cur_op, pg_t & pg, pg_osd_set_t & loc_set, int base_state);
void handle_primary_subop(osd_op_t *subop, osd_op_t *cur_op); void handle_primary_subop(osd_op_t *subop, osd_op_t *cur_op);
void handle_primary_bs_subop(osd_op_t *subop); void handle_primary_bs_subop(osd_op_t *subop);
void add_bs_subop_stats(osd_op_t *subop); void add_bs_subop_stats(osd_op_t *subop, bool recovery_related = false);
void pg_cancel_write_queue(pg_t & pg, osd_op_t *first_op, object_id oid, int retval); void pg_cancel_write_queue(pg_t & pg, osd_op_t *first_op, object_id oid, int retval);
void submit_primary_subops(int submit_type, uint64_t op_version, const uint64_t* osd_set, osd_op_t *cur_op); void submit_primary_subops(int submit_type, uint64_t op_version, const uint64_t* osd_set, osd_op_t *cur_op);

View File

@ -213,12 +213,14 @@ json11::Json osd_t::get_statistics()
st["subop_stats"] = subop_stats; st["subop_stats"] = subop_stats;
st["recovery_stats"] = json11::Json::object { st["recovery_stats"] = json11::Json::object {
{ recovery_stat_names[0], json11::Json::object { { recovery_stat_names[0], json11::Json::object {
{ "count", recovery_stat_count[0][0] }, { "count", recovery_stat[0].count },
{ "bytes", recovery_stat_bytes[0][0] }, { "bytes", recovery_stat[0].bytes },
{ "usec", recovery_stat[0].usec },
} }, } },
{ recovery_stat_names[1], json11::Json::object { { recovery_stat_names[1], json11::Json::object {
{ "count", recovery_stat_count[0][1] }, { "count", recovery_stat[1].count },
{ "bytes", recovery_stat_bytes[0][1] }, { "bytes", recovery_stat[1].bytes },
{ "usec", recovery_stat[1].usec },
} }, } },
}; };
return st; return st;
@ -649,7 +651,7 @@ void osd_t::apply_pg_config()
{ {
pg_num_t pg_num = kv.first; pg_num_t pg_num = kv.first;
auto & pg_cfg = kv.second; auto & pg_cfg = kv.second;
bool take = pg_cfg.exists && pg_cfg.primary == this->osd_num && bool take = pg_cfg.config_exists && pg_cfg.primary == this->osd_num &&
!pg_cfg.pause && (!pg_cfg.cur_primary || pg_cfg.cur_primary == this->osd_num); !pg_cfg.pause && (!pg_cfg.cur_primary || pg_cfg.cur_primary == this->osd_num);
auto pg_it = this->pgs.find({ .pool_id = pool_id, .pg_num = pg_num }); auto pg_it = this->pgs.find({ .pool_id = pool_id, .pg_num = pg_num });
bool currently_taken = pg_it != this->pgs.end() && pg_it->second.state != PG_OFFLINE; bool currently_taken = pg_it != this->pgs.end() && pg_it->second.state != PG_OFFLINE;

View File

@ -325,26 +325,129 @@ void osd_t::submit_recovery_op(osd_recovery_op_t *op)
{ {
printf("Recovery operation done for %lx:%lx\n", op->oid.inode, op->oid.stripe); printf("Recovery operation done for %lx:%lx\n", op->oid.inode, op->oid.stripe);
} }
// CAREFUL! op = &recovery_ops[op->oid]. Don't access op->* after recovery_ops.erase() finish_recovery_op(op);
op->osd_op = NULL;
recovery_ops.erase(op->oid);
delete osd_op;
if (immediate_commit != IMMEDIATE_ALL)
{
recovery_done++;
if (recovery_done >= recovery_sync_batch)
{
// Force sync every <recovery_sync_batch> operations
// This is required not to pile up an excessive amount of delete operations
autosync();
recovery_done = 0;
}
}
continue_recovery();
}; };
exec_op(op->osd_op); exec_op(op->osd_op);
} }
void osd_t::apply_recovery_tune_interval()
{
if (rtune_timer_id >= 0)
{
tfd->clear_timer(rtune_timer_id);
rtune_timer_id = -1;
}
if (recovery_tune_interval != 0)
{
rtune_timer_id = this->tfd->set_timer(recovery_tune_interval*1000, true, [this](int timer_id)
{
tune_recovery();
});
}
else
{
recovery_target_sleep_us = recovery_sleep_us;
}
}
void osd_t::finish_recovery_op(osd_recovery_op_t *op)
{
// CAREFUL! op = &recovery_ops[op->oid]. Don't access op->* after recovery_ops.erase()
delete op->osd_op;
op->osd_op = NULL;
recovery_ops.erase(op->oid);
if (immediate_commit != IMMEDIATE_ALL)
{
recovery_done++;
if (recovery_done >= recovery_sync_batch)
{
// Force sync every <recovery_sync_batch> operations
// This is required not to pile up an excessive amount of delete operations
autosync();
recovery_done = 0;
}
}
continue_recovery();
}
void osd_t::tune_recovery()
{
static int accounted_ops[] = {
OSD_OP_SEC_READ, OSD_OP_SEC_WRITE, OSD_OP_SEC_WRITE_STABLE,
OSD_OP_SEC_STABILIZE, OSD_OP_SEC_SYNC, OSD_OP_SEC_DELETE
};
uint64_t total_client_usec = 0, total_recovery_usec = 0, recovery_count = 0;
for (int i = 0; i < sizeof(accounted_ops)/sizeof(accounted_ops[0]); i++)
{
total_client_usec += (msgr.stats.op_stat_sum[accounted_ops[i]]
- rtune_prev_stats.op_stat_sum[accounted_ops[i]]);
total_recovery_usec += (msgr.recovery_stats.op_stat_sum[accounted_ops[i]]
- rtune_prev_recovery_stats.op_stat_sum[accounted_ops[i]]);
recovery_count += (msgr.recovery_stats.op_stat_count[accounted_ops[i]]
- rtune_prev_recovery_stats.op_stat_count[accounted_ops[i]]);
rtune_prev_stats.op_stat_sum[accounted_ops[i]] = msgr.stats.op_stat_sum[accounted_ops[i]];
rtune_prev_recovery_stats.op_stat_sum[accounted_ops[i]] = msgr.recovery_stats.op_stat_sum[accounted_ops[i]];
rtune_prev_recovery_stats.op_stat_count[accounted_ops[i]] = msgr.recovery_stats.op_stat_count[accounted_ops[i]];
}
total_client_usec -= total_recovery_usec;
if (recovery_count == 0)
{
return;
}
// example:
// total 3 GB/s
// recovery queue 1
// 120 OSDs
// EC 5+3
// 128kb block_size => 640kb object
// 3000*1024/640/120 = 40 MB/s per OSD = 64 recovered objects per OSD
// = 64*8*2 subops = 1024 recovery subop iops
// 8 recovery subop queue
// => subop avg latency = 0.0078125 sec
// utilisation = 8
// target util 1
// intuitively target latency should be 8x of real
// target_lat = rtune_avg_lat * utilisation / target_util
// = rtune_avg_lat * rtune_avg_lat * rtune_avg_iops / target_util
// = 0.0625
// recovery utilisation will be 1
rtune_client_util = total_client_usec/1000000.0/recovery_tune_interval;
rtune_target_util = (rtune_client_util < recovery_tune_client_util_low
? recovery_tune_util_high
: recovery_tune_util_low + (rtune_client_util >= recovery_tune_client_util_high
? 0 : (recovery_tune_util_high-recovery_tune_util_low)*
(recovery_tune_client_util_high-rtune_client_util)/(recovery_tune_client_util_high-recovery_tune_client_util_low)
)
);
rtune_avg_lat = total_recovery_usec/recovery_count;
uint64_t target_lat = rtune_avg_lat * rtune_avg_lat/1000000.0 * recovery_count/recovery_tune_interval / rtune_target_util;
auto sleep_us = target_lat > rtune_avg_lat+recovery_tune_sleep_min_us ? target_lat-rtune_avg_lat : 0;
if (recovery_target_sleep_items.size() != recovery_tune_agg_interval)
{
recovery_target_sleep_items.resize(recovery_tune_agg_interval);
for (int i = 0; i < recovery_tune_agg_interval; i++)
recovery_target_sleep_items[i] = 0;
recovery_target_sleep_total = 0;
recovery_target_sleep_cur = 0;
recovery_target_sleep_count = 0;
}
recovery_target_sleep_total -= recovery_target_sleep_items[recovery_target_sleep_cur];
recovery_target_sleep_items[recovery_target_sleep_cur] = sleep_us;
recovery_target_sleep_cur = (recovery_target_sleep_cur+1) % recovery_tune_agg_interval;
recovery_target_sleep_total += sleep_us;
if (recovery_target_sleep_count < recovery_tune_agg_interval)
recovery_target_sleep_count++;
recovery_target_sleep_us = recovery_target_sleep_total / recovery_target_sleep_count;
if (log_level > 4)
{
printf(
"[OSD %lu] auto-tune: client util: %.2f, recovery util: %.2f, lat: %lu us -> target util %.2f, delay %lu us\n",
osd_num, rtune_client_util, total_recovery_usec/1000000.0/recovery_tune_interval,
rtune_avg_lat, rtune_target_util, recovery_target_sleep_us
);
}
}
// Just trigger write requests for degraded objects. They'll be recovered during writing // Just trigger write requests for degraded objects. They'll be recovered during writing
bool osd_t::continue_recovery() bool osd_t::continue_recovery()
{ {

View File

@ -34,6 +34,7 @@
#define OSD_OP_MAX 18 #define OSD_OP_MAX 18
#define OSD_RW_MAX 64*1024*1024 #define OSD_RW_MAX 64*1024*1024
#define OSD_PROTOCOL_VERSION 1 #define OSD_PROTOCOL_VERSION 1
#define OSD_OP_RECOVERY_RELATED (uint32_t)1
// Memory alignment for direct I/O (usually 512 bytes) // Memory alignment for direct I/O (usually 512 bytes)
#ifndef DIRECT_IO_ALIGNMENT #ifndef DIRECT_IO_ALIGNMENT
@ -88,7 +89,8 @@ struct __attribute__((__packed__)) osd_op_sec_rw_t
uint32_t len; uint32_t len;
// bitmap/attribute length - bitmap comes after header, but before data // bitmap/attribute length - bitmap comes after header, but before data
uint32_t attr_len; uint32_t attr_len;
uint32_t pad0; // the only possible flag is OSD_OP_RECOVERY_RELATED
uint32_t flags;
}; };
struct __attribute__((__packed__)) osd_reply_sec_rw_t struct __attribute__((__packed__)) osd_reply_sec_rw_t
@ -109,6 +111,9 @@ struct __attribute__((__packed__)) osd_op_sec_del_t
object_id oid; object_id oid;
// delete version (automatic or specific) // delete version (automatic or specific)
uint64_t version; uint64_t version;
// the only possible flag is OSD_OP_RECOVERY_RELATED
uint32_t flags;
uint32_t pad0;
}; };
struct __attribute__((__packed__)) osd_reply_sec_del_t struct __attribute__((__packed__)) osd_reply_sec_del_t
@ -121,6 +126,9 @@ struct __attribute__((__packed__)) osd_reply_sec_del_t
struct __attribute__((__packed__)) osd_op_sec_sync_t struct __attribute__((__packed__)) osd_op_sec_sync_t
{ {
osd_op_header_t header; osd_op_header_t header;
// the only possible flag is OSD_OP_RECOVERY_RELATED
uint32_t flags;
uint32_t pad0;
}; };
struct __attribute__((__packed__)) osd_reply_sec_sync_t struct __attribute__((__packed__)) osd_reply_sec_sync_t
@ -134,6 +142,9 @@ struct __attribute__((__packed__)) osd_op_sec_stab_t
osd_op_header_t header; osd_op_header_t header;
// obj_ver_id array length in bytes // obj_ver_id array length in bytes
uint64_t len; uint64_t len;
// the only possible flag is OSD_OP_RECOVERY_RELATED
uint32_t flags;
uint32_t pad0;
}; };
typedef osd_op_sec_stab_t osd_op_sec_rollback_t; typedef osd_op_sec_stab_t osd_op_sec_rollback_t;

View File

@ -3,13 +3,15 @@
#include "osd_primary.h" #include "osd_primary.h"
#define SELF_FD -1
void osd_t::autosync() void osd_t::autosync()
{ {
if (immediate_commit != IMMEDIATE_ALL && !autosync_op) if (immediate_commit != IMMEDIATE_ALL && !autosync_op)
{ {
autosync_op = new osd_op_t(); autosync_op = new osd_op_t();
autosync_op->op_type = OSD_OP_IN; autosync_op->op_type = OSD_OP_IN;
autosync_op->peer_fd = -1; autosync_op->peer_fd = SELF_FD;
autosync_op->req = (osd_any_op_t){ autosync_op->req = (osd_any_op_t){
.sync = { .sync = {
.header = { .header = {
@ -85,9 +87,13 @@ void osd_t::finish_op(osd_op_t *cur_op, int retval)
cur_op->reply.hdr.id = cur_op->req.hdr.id; cur_op->reply.hdr.id = cur_op->req.hdr.id;
cur_op->reply.hdr.opcode = cur_op->req.hdr.opcode; cur_op->reply.hdr.opcode = cur_op->req.hdr.opcode;
cur_op->reply.hdr.retval = retval; cur_op->reply.hdr.retval = retval;
if (cur_op->peer_fd == -1) if (cur_op->peer_fd == SELF_FD)
{ {
msgr.measure_exec(cur_op); // Do not include internal primary writes (recovery/rebalance) into client op statistics
if (cur_op->req.hdr.opcode != OSD_OP_WRITE)
{
msgr.measure_exec(cur_op);
}
// Copy lambda to be unaffected by `delete op` // Copy lambda to be unaffected by `delete op`
std::function<void(osd_op_t*)>(cur_op->callback)(cur_op); std::function<void(osd_op_t*)>(cur_op->callback)(cur_op);
} }
@ -215,6 +221,7 @@ int osd_t::submit_primary_subop_batch(int submit_type, inode_t inode, uint64_t o
.offset = wr ? si->write_start : si->read_start, .offset = wr ? si->write_start : si->read_start,
.len = subop_len, .len = subop_len,
.attr_len = wr ? clean_entry_bitmap_size : 0, .attr_len = wr ? clean_entry_bitmap_size : 0,
.flags = cur_op->peer_fd == SELF_FD && cur_op->req.hdr.opcode != OSD_OP_SCRUB ? OSD_OP_RECOVERY_RELATED : 0,
}; };
#ifdef OSD_DEBUG #ifdef OSD_DEBUG
printf( printf(
@ -294,7 +301,8 @@ void osd_t::handle_primary_bs_subop(osd_op_t *subop)
" retval = "+std::to_string(bs_op->retval)+")" " retval = "+std::to_string(bs_op->retval)+")"
); );
} }
add_bs_subop_stats(subop); bool recovery_related = cur_op->peer_fd == SELF_FD && cur_op->req.hdr.opcode != OSD_OP_SCRUB;
add_bs_subop_stats(subop, recovery_related);
subop->req.hdr.opcode = bs_op_to_osd_op[bs_op->opcode]; subop->req.hdr.opcode = bs_op_to_osd_op[bs_op->opcode];
subop->reply.hdr.retval = bs_op->retval; subop->reply.hdr.retval = bs_op->retval;
if (bs_op->opcode == BS_OP_READ || bs_op->opcode == BS_OP_WRITE || bs_op->opcode == BS_OP_WRITE_STABLE) if (bs_op->opcode == BS_OP_READ || bs_op->opcode == BS_OP_WRITE || bs_op->opcode == BS_OP_WRITE_STABLE)
@ -306,30 +314,33 @@ void osd_t::handle_primary_bs_subop(osd_op_t *subop)
} }
delete bs_op; delete bs_op;
subop->bs_op = NULL; subop->bs_op = NULL;
subop->peer_fd = -1; subop->peer_fd = SELF_FD;
handle_primary_subop(subop, cur_op); if (recovery_related && recovery_target_sleep_us)
{
tfd->set_timer_us(recovery_target_sleep_us, false, [=](int timer_id)
{
handle_primary_subop(subop, cur_op);
});
}
else
{
handle_primary_subop(subop, cur_op);
}
} }
void osd_t::add_bs_subop_stats(osd_op_t *subop) void osd_t::add_bs_subop_stats(osd_op_t *subop, bool recovery_related)
{ {
// Include local blockstore ops in statistics // Include local blockstore ops in statistics
uint64_t opcode = bs_op_to_osd_op[subop->bs_op->opcode]; uint64_t opcode = bs_op_to_osd_op[subop->bs_op->opcode];
timespec tv_end; timespec tv_end;
clock_gettime(CLOCK_REALTIME, &tv_end); clock_gettime(CLOCK_REALTIME, &tv_end);
msgr.stats.op_stat_count[opcode]++; uint64_t len = (opcode == OSD_OP_SEC_READ || opcode == OSD_OP_SEC_WRITE)
if (!msgr.stats.op_stat_count[opcode]) ? subop->bs_op->len : 0;
msgr.inc_op_stats(msgr.stats, opcode, subop->tv_begin, tv_end, len);
if (recovery_related)
{ {
msgr.stats.op_stat_count[opcode] = 1; // It is OSD_OP_RECOVERY_RELATED
msgr.stats.op_stat_sum[opcode] = 0; msgr.inc_op_stats(msgr.recovery_stats, opcode, subop->tv_begin, tv_end, len);
msgr.stats.op_stat_bytes[opcode] = 0;
}
msgr.stats.op_stat_sum[opcode] += (
(tv_end.tv_sec - subop->tv_begin.tv_sec)*1000000 +
(tv_end.tv_nsec - subop->tv_begin.tv_nsec)/1000
);
if (opcode == OSD_OP_SEC_READ || opcode == OSD_OP_SEC_WRITE)
{
msgr.stats.op_stat_bytes[opcode] += subop->bs_op->len;
} }
} }
@ -552,6 +563,7 @@ void osd_t::submit_primary_del_batch(osd_op_t *cur_op, obj_ver_osd_t *chunks_to_
}, },
.oid = chunk.oid, .oid = chunk.oid,
.version = chunk.version, .version = chunk.version,
.flags = cur_op->peer_fd == SELF_FD && cur_op->req.hdr.opcode != OSD_OP_SCRUB ? OSD_OP_RECOVERY_RELATED : 0,
} }; } };
subops[i].callback = [cur_op, this](osd_op_t *subop) subops[i].callback = [cur_op, this](osd_op_t *subop)
{ {
@ -609,6 +621,7 @@ int osd_t::submit_primary_sync_subops(osd_op_t *cur_op)
.id = msgr.next_subop_id++, .id = msgr.next_subop_id++,
.opcode = OSD_OP_SEC_SYNC, .opcode = OSD_OP_SEC_SYNC,
}, },
.flags = cur_op->peer_fd == SELF_FD && cur_op->req.hdr.opcode != OSD_OP_SCRUB ? OSD_OP_RECOVERY_RELATED : 0,
} }; } };
subops[i].callback = [cur_op, this](osd_op_t *subop) subops[i].callback = [cur_op, this](osd_op_t *subop)
{ {
@ -668,6 +681,7 @@ void osd_t::submit_primary_stab_subops(osd_op_t *cur_op)
.opcode = OSD_OP_SEC_STABILIZE, .opcode = OSD_OP_SEC_STABILIZE,
}, },
.len = (uint64_t)(stab_osd.len * sizeof(obj_ver_id)), .len = (uint64_t)(stab_osd.len * sizeof(obj_ver_id)),
.flags = cur_op->peer_fd == SELF_FD && cur_op->req.hdr.opcode != OSD_OP_SCRUB ? OSD_OP_RECOVERY_RELATED : 0,
} }; } };
subops[i].iov.push_back(op_data->unstable_writes + stab_osd.start, stab_osd.len * sizeof(obj_ver_id)); subops[i].iov.push_back(op_data->unstable_writes + stab_osd.start, stab_osd.len * sizeof(obj_ver_id));
subops[i].callback = [cur_op, this](osd_op_t *subop) subops[i].callback = [cur_op, this](osd_op_t *subop)

View File

@ -292,16 +292,26 @@ resume_7:
{ {
{ {
int recovery_type = op_data->object_state->state & (OBJ_DEGRADED|OBJ_INCOMPLETE) ? 0 : 1; int recovery_type = op_data->object_state->state & (OBJ_DEGRADED|OBJ_INCOMPLETE) ? 0 : 1;
recovery_stat_count[0][recovery_type]++; recovery_stat[recovery_type].count++;
if (!recovery_stat_count[0][recovery_type]) if (!recovery_stat[recovery_type].count) // wrapped
{ {
recovery_stat_count[0][recovery_type]++; memset(&recovery_print_prev[recovery_type], 0, sizeof(recovery_print_prev[recovery_type]));
recovery_stat_bytes[0][recovery_type] = 0; memset(&recovery_stat[recovery_type], 0, sizeof(recovery_stat[recovery_type]));
recovery_stat[recovery_type].count++;
} }
for (int role = 0; role < (op_data->scheme == POOL_SCHEME_REPLICATED ? 1 : pg.pg_size); role++) for (int role = 0; role < (op_data->scheme == POOL_SCHEME_REPLICATED ? 1 : pg.pg_size); role++)
{ {
recovery_stat_bytes[0][recovery_type] += op_data->stripes[role].write_end - op_data->stripes[role].write_start; recovery_stat[recovery_type].bytes += op_data->stripes[role].write_end - op_data->stripes[role].write_start;
} }
if (!cur_op->tv_end.tv_sec)
{
clock_gettime(CLOCK_REALTIME, &cur_op->tv_end);
}
uint64_t usec = (
(cur_op->tv_end.tv_sec - cur_op->tv_begin.tv_sec)*1000000 +
(cur_op->tv_end.tv_nsec - cur_op->tv_begin.tv_nsec)/1000
);
recovery_stat[recovery_type].usec += usec;
} }
// Any kind of a non-clean object can have extra chunks, because we don't record objects // Any kind of a non-clean object can have extra chunks, because we don't record objects
// as degraded & misplaced or incomplete & misplaced at the same time. So try to remove extra chunks // as degraded & misplaced or incomplete & misplaced at the same time. So try to remove extra chunks

View File

@ -42,7 +42,21 @@ void osd_t::secondary_op_callback(osd_op_t *op)
int retval = op->bs_op->retval; int retval = op->bs_op->retval;
delete op->bs_op; delete op->bs_op;
op->bs_op = NULL; op->bs_op = NULL;
finish_op(op, retval); if (op->is_recovery_related() && recovery_target_sleep_us)
{
if (!op->tv_end.tv_sec)
{
clock_gettime(CLOCK_REALTIME, &op->tv_end);
}
tfd->set_timer_us(recovery_target_sleep_us, false, [this, op, retval](int timer_id)
{
finish_op(op, retval);
});
}
else
{
finish_op(op, retval);
}
} }
void osd_t::exec_secondary(osd_op_t *cur_op) void osd_t::exec_secondary(osd_op_t *cur_op)

View File

@ -196,10 +196,11 @@ static void vitastor_parse_filename(const char *filename, QDict *options, Error
!strcmp(name, "rdma-gid-index") || !strcmp(name, "rdma-gid-index") ||
!strcmp(name, "rdma-mtu")) !strcmp(name, "rdma-mtu"))
{ {
unsigned long long num_val;
#if QEMU_VERSION_MAJOR < 8 || QEMU_VERSION_MAJOR == 8 && QEMU_VERSION_MINOR < 1 #if QEMU_VERSION_MAJOR < 8 || QEMU_VERSION_MAJOR == 8 && QEMU_VERSION_MINOR < 1
unsigned long long num_val;
if (parse_uint_full(value, &num_val, 0)) if (parse_uint_full(value, &num_val, 0))
#else #else
uint64_t num_val;
if (parse_uint_full(value, 0, &num_val)) if (parse_uint_full(value, 0, &num_val))
#endif #endif
{ {

View File

@ -90,6 +90,12 @@ void timerfd_manager_t::clear_timer(int timer_id)
void timerfd_manager_t::set_nearest() void timerfd_manager_t::set_nearest()
{ {
if (onstack > 0)
{
// Prevent re-entry
return;
}
onstack++;
again: again:
if (!timers.size()) if (!timers.size())
{ {
@ -139,6 +145,7 @@ again:
} }
wait_state = wait_state | 1; wait_state = wait_state | 1;
} }
onstack--;
} }
void timerfd_manager_t::handle_readable() void timerfd_manager_t::handle_readable()

View File

@ -22,6 +22,7 @@ class timerfd_manager_t
int timerfd; int timerfd;
int nearest = -1; int nearest = -1;
int id = 1; int id = 1;
int onstack = 0;
std::vector<timerfd_timer_t> timers; std::vector<timerfd_timer_t> timers;
void inc_timer(timerfd_timer_t & t); void inc_timer(timerfd_timer_t & t);

View File

@ -6,7 +6,7 @@ includedir=${prefix}/@CMAKE_INSTALL_INCLUDEDIR@
Name: Vitastor Name: Vitastor
Description: Vitastor client library Description: Vitastor client library
Version: 1.2.0 Version: 1.3.1
Libs: -L${libdir} -lvitastor_client Libs: -L${libdir} -lvitastor_client
Cflags: -I${includedir} Cflags: -I${includedir}

View File

@ -19,10 +19,10 @@ fi
if [ "$IMMEDIATE_COMMIT" != "" ]; then if [ "$IMMEDIATE_COMMIT" != "" ]; then
NO_SAME="--journal_no_same_sector_overwrites true --journal_sector_buffer_count 1024 --disable_data_fsync 1 --immediate_commit all --log_level 10 --etcd_stats_interval 5" NO_SAME="--journal_no_same_sector_overwrites true --journal_sector_buffer_count 1024 --disable_data_fsync 1 --immediate_commit all --log_level 10 --etcd_stats_interval 5"
$ETCDCTL put /vitastor/config/global '{"recovery_queue_depth":1,"osd_out_time":1,"immediate_commit":"all","client_enable_writeback":true}' $ETCDCTL put /vitastor/config/global '{"recovery_queue_depth":1,"recovery_tune_util_low":1,"osd_out_time":1,"immediate_commit":"all","client_enable_writeback":true}'
else else
NO_SAME="--journal_sector_buffer_count 1024 --log_level 10 --etcd_stats_interval 5" NO_SAME="--journal_sector_buffer_count 1024 --log_level 10 --etcd_stats_interval 5"
$ETCDCTL put /vitastor/config/global '{"recovery_queue_depth":1,"osd_out_time":1,"client_enable_writeback":true}' $ETCDCTL put /vitastor/config/global '{"recovery_queue_depth":1,"recovery_tune_util_low":1,"osd_out_time":1,"client_enable_writeback":true}'
fi fi
start_osd_on() start_osd_on()
@ -53,7 +53,7 @@ for i in $(seq 1 $OSD_COUNT); do
start_osd $i start_osd $i
done done
(while true; do node mon/mon-main.js --etcd_url $ETCD_URL --etcd_prefix "/vitastor" --verbose 1 || true; done) &>./testdata/mon.log & (while true; do node mon/mon-main.js --etcd_address $ETCD_URL --etcd_prefix "/vitastor" --verbose 1 || true; done) >>./testdata/mon.log 2>&1 &
MON_PID=$! MON_PID=$!
if [ "$SCHEME" = "ec" ]; then if [ "$SCHEME" = "ec" ]; then

View File

@ -18,6 +18,7 @@ try_change()
for i in {1..6}; do for i in {1..6}; do
echo --- Change PG count to $n --- >>testdata/osd$i.log echo --- Change PG count to $n --- >>testdata/osd$i.log
done done
echo --- Change PG count to $n --- >>testdata/mon.log
$ETCDCTL put /vitastor/config/pools '{"1":{'$POOLCFG',"pg_size":'$PG_SIZE',"pg_minsize":'$PG_MINSIZE',"pg_count":'$n'}}' $ETCDCTL put /vitastor/config/pools '{"1":{'$POOLCFG',"pg_size":'$PG_SIZE',"pg_minsize":'$PG_MINSIZE',"pg_count":'$n'}}'

View File

@ -15,7 +15,7 @@ $ETCDCTL put /vitastor/osd/stats/7 '{"host":"host4","size":1073741824,"time":"'$
$ETCDCTL put /vitastor/osd/stats/8 '{"host":"host4","size":1073741824,"time":"'$TIME'"}' $ETCDCTL put /vitastor/osd/stats/8 '{"host":"host4","size":1073741824,"time":"'$TIME'"}'
$ETCDCTL put /vitastor/config/pools '{"1":{"name":"testpool","scheme":"replicated","pg_size":2,"pg_minsize":1,"pg_count":4,"failure_domain":"rack"}}' $ETCDCTL put /vitastor/config/pools '{"1":{"name":"testpool","scheme":"replicated","pg_size":2,"pg_minsize":1,"pg_count":4,"failure_domain":"rack"}}'
node mon/mon-main.js --etcd_url $ETCD_URL --etcd_prefix "/vitastor" &>./testdata/mon.log & node mon/mon-main.js --etcd_address $ETCD_URL --etcd_prefix "/vitastor" >>./testdata/mon.log 2>&1 &
MON_PID=$! MON_PID=$!
sleep 2 sleep 2

Some files were not shown because too many files have changed in this diff Show More