Compare commits

..

147 Commits

Author SHA1 Message Date
9ad6822353 Release 1.5.0
All checks were successful
Test / test_rm (push) Successful in 14s
Test / test_interrupted_rebalance_ec_imm (push) Successful in 1m36s
Test / test_snapshot_down (push) Successful in 31s
Test / test_snapshot_down_ec (push) Successful in 30s
Test / test_splitbrain (push) Successful in 24s
Test / test_snapshot_chain (push) Successful in 2m20s
Test / test_snapshot_chain_ec (push) Successful in 3m5s
Test / test_rebalance_verify_imm (push) Successful in 5m11s
Test / test_rebalance_verify (push) Successful in 5m55s
Test / test_switch_primary (push) Successful in 33s
Test / test_rebalance_verify_ec_imm (push) Successful in 4m26s
Test / test_write (push) Successful in 54s
Test / test_write_xor (push) Successful in 57s
Test / test_write_no_same (push) Successful in 19s
Test / test_rebalance_verify_ec (push) Successful in 7m21s
Test / test_heal_pg_size_2 (push) Successful in 4m36s
Test / test_heal_csum_32k_dmj (push) Successful in 4m33s
Test / test_heal_ec (push) Successful in 6m15s
Test / test_heal_csum_32k_dj (push) Successful in 6m31s
Test / test_heal_csum_32k (push) Successful in 6m29s
Test / test_heal_csum_4k_dmj (push) Successful in 6m15s
Test / test_scrub_zero_osd_2 (push) Successful in 1m16s
Test / test_scrub (push) Successful in 1m18s
Test / test_scrub_xor (push) Successful in 1m13s
Test / test_heal_csum_4k_dj (push) Successful in 7m10s
Test / test_scrub_ec (push) Successful in 56s
Test / test_scrub_pg_size_6_pg_minsize_4_osd_count_6_ec (push) Successful in 59s
Test / test_heal_csum_4k (push) Successful in 6m2s
Test / test_scrub_pg_size_3 (push) Successful in 2m11s
Test / test_nfs (push) Successful in 11s
After half a year of hard work, VitastorFS is finally here ! :-)

New features:
- VitastorFS, a full-featured clustered (read-write-many) file system.
  Documentation: [VitastorFS](docs/usage/nfs.en.md)
- Embedded key-value database implementation based on Parallel Optimistic B-Tree
  algorithm and used for the metadata of VitastorFS
- Pool management commands in vitastor-cli (create-pool, list-pools, rm-pool, modify-pool).
  Thanks MIND Software (https://mindsw.io) for their contribution!
  [Documentation](docs/usage/cli.en.md#create-pool)

Bug fixes:
- Fix a very rare "infinite loop" in the client library
- Fix a rare OSD hang on during start when zeroing out bad metadata entries left from the previous run
2024-03-16 15:35:10 +03:00
2043b4e374 Fix build errors for gcc 8 2024-03-16 15:35:10 +03:00
de840e6fe3 Reduce kv-cli loadjson load parallelism to 16 2024-03-16 15:35:10 +03:00
b5e04bf809 Fix build warning 2024-03-16 15:35:10 +03:00
8807a1623b Fix markdown tables 2024-03-16 15:35:10 +03:00
f12855c31b Add vitastor-kv to packages 2024-03-16 15:35:10 +03:00
e75dcc9a71 Add documentation for VitastorFS 2024-03-16 15:16:43 +03:00
88516ab4bd Remove extra log 2024-03-16 13:24:36 +03:00
6221126b4f Allow to print simple-offsets just given the device size 2024-03-16 13:24:36 +03:00
6783d4a13c Implement fool protection for FS pools 2024-03-16 13:24:36 +03:00
dcbe1afac3 Store pool ID in inode metadata 2024-03-16 13:24:36 +03:00
0bde28c24a Make nfs_do_rmw a library function 2024-03-16 13:24:36 +03:00
bb8ca6184e Support setattr guard 2024-03-16 13:24:36 +03:00
87310ef7bb Support ctime 2024-03-16 13:24:36 +03:00
4f4b2dab80 Log NFS liveness checks 2024-03-16 13:24:36 +03:00
f70da82317 Add loadjson command to vitastor-kv 2024-03-16 13:24:36 +03:00
e42148f347 Allow to specify KV commands on command line 2024-03-16 13:24:36 +03:00
c289584469 Add JSON dump format 2024-03-16 13:24:36 +03:00
018e89f867 Erase verf key left from creation from ientries on every modification 2024-03-16 13:24:36 +03:00
603dc68f11 Implement async mtime change 2024-03-16 13:24:36 +03:00
7b12342933 Allow to specify additional NFS mount options 2024-03-16 13:24:36 +03:00
44bf0f16ee Fix malloc/free in nfs_kv_read/write 2024-03-16 13:24:36 +03:00
8840c84572 Fix "bad key in etcd" in mon for FS pools 2024-03-16 13:24:36 +03:00
5b747c12ec Check if already mounted before mounting 2024-03-16 13:24:36 +03:00
05f5f46162 Fix zero used space, update mtime when moving/changing inode 2024-03-16 13:24:36 +03:00
b5604191c8 Ignore ECANCELED in nfs-proxy (happens in io_uring on fork) 2024-03-16 13:24:36 +03:00
e871de27de Support unaligned shared_offsets, align shared file data instead of header 2024-03-16 13:24:36 +03:00
f600ce98e2 Implement auto-unmount local NFS server mode for vitastor-nfs 2024-03-16 13:24:36 +03:00
57605a5c13 Return error on failed shrink 2024-03-16 13:24:36 +03:00
29bd4561bb Implement rename over an existing file/directory 2024-03-16 13:24:36 +03:00
7142460ec8 Support --logfile in nfs-proxy 2024-03-16 13:24:36 +03:00
d03f19ebe5 Fix shared file overlap, add FIXMEs 2024-03-16 13:24:36 +03:00
88f9d18be3 Create inode, then direntry, not direntry, then inode; retry ID collisions 2024-03-16 13:24:36 +03:00
6213fbd8c6 Fix NFS shared/aligned write FIXMEs 2024-03-16 13:24:36 +03:00
3aee37eadd Allow to disable per-inode stats for VitastorFS pools 2024-03-16 13:24:36 +03:00
ecfc753e93 Add basic NFS tests, fix bugs 2024-03-16 13:24:36 +03:00
a574f9ad71 Return block NFS implementation back as an option too 2024-03-16 13:24:36 +03:00
7c235c9103 Move KV FS header into a separate file 2024-03-16 13:24:36 +03:00
e5bb986164 Implement packing small files into shared inodes 2024-03-16 13:24:36 +03:00
181795d748 Split new NFS proxy implementation into multiple files 2024-03-16 13:24:36 +03:00
8cdc38805b WIP VitastorFS with metadata storage in VitastorKV 2024-03-16 13:24:36 +03:00
0cd455d17f First just recheck version without actually re-reading block in vitastor-kv 2024-03-16 13:24:36 +03:00
32ba653ba6 Fix vitastor-kv hang on reopen & unfinished closed listing 2024-03-16 13:24:36 +03:00
231d4b15fc Add loadable dump format to vitastor-kv (dump) 2024-03-16 13:24:36 +03:00
9dc4d5fd7b Fix freeing r/w buffers on errors in kv_db 2024-03-16 13:24:36 +03:00
e58538fa47 Fix eviction when random_pos selects the end 2024-03-16 13:24:36 +03:00
11ac9e7024 Implement min/max list_count to make listings during performance test reasonable 2024-03-16 13:24:36 +03:00
511bc3df1c Fix and improve parallel allocation
- Do not try to allocate more DB blocks in an inode block until it's "confirmed" and "locked" by the first write
- Do not recheck for new zero DB blocks on first write into an inode block - a CAS failure means someone else is already writing into it
- Throw new allocation blocks away regardless of whether the known_version is 0 on a CAS failure
2024-03-16 13:24:36 +03:00
a64f0d1f73 Implement key_prefix for K/V stress test 2024-03-16 13:24:36 +03:00
ec5f7c6b87 More fixes
- do not overwrite a block with older version if known version is newer
  (read may start before update and end after update)
- invalidated block versions can't be remembered and trusted
- right boundary for split blocks is right_half when diving down, not key_lt
- restart update also when block is "invalidated", not just on version mismatch
- copy callback in listings to avoid closure destruction bugs too
2024-03-16 13:24:36 +03:00
3ebed9a749 Add logging and one more assert 2024-03-16 13:24:36 +03:00
eab67a6e8f Make get_block() wait for updating when unrelated block is found along the path 2024-03-16 13:24:36 +03:00
20993d9b7a Fix a race condition where changed blocks were parsed over existing cached blocks and getting a mix of data 2024-03-16 13:24:36 +03:00
5cf9b343c0 Simplify code by removing an unneeded "optimisation" 2024-03-16 13:24:36 +03:00
79ae0aadcd Add kv_log_level, print warnings on level 1, trace ops on level 10 2024-03-16 13:24:36 +03:00
605afc3583 Fix duplicate keys in listings on parallel updates -- do not rewind key "iterator position" 2024-03-16 13:24:36 +03:00
c0681d8242 Implement key suffix to avoid collisions of multiple test workers 2024-03-16 13:24:36 +03:00
763e77b4f4 Do not complain on empty first block 2024-03-16 13:24:36 +03:00
19426aa4c5 Add JSON output for stress-tester 2024-03-16 13:24:36 +03:00
08f586bcec Print total stats 2024-03-16 13:24:36 +03:00
f1cd87473a Do not send more than op_count operations (fix segfault on finish) 2024-03-16 13:24:36 +03:00
1bd8d2da56 Add some more resiliency to serialize() 2024-03-16 13:24:36 +03:00
a7396d2baf Invalidate blocks being updated too 2024-03-16 13:24:36 +03:00
e98a38810d Change new block allocation method: make each writer choose multiple empty PG blocks and place blocks in them 2024-03-16 13:24:36 +03:00
28c4324c36 Remove blocks from cache on unsuccessful updates 2024-03-16 13:24:36 +03:00
31ec3fa8f5 Allow to track multiple updates per block (it should never happen though) 2024-03-16 13:24:36 +03:00
e4fa26f60a Do not call stop_updating after failed write_new_block and after clear_block (both delete the item) 2024-03-16 13:24:36 +03:00
59ae27f9e5 Track versions of parent blocks and recheck if changed during update 2024-03-16 13:24:36 +03:00
2c6a301d9b Fix resume_split condition (key_lt can also be "") 2024-03-16 13:24:36 +03:00
01558349f8 Experiment: transform offsets for better sharding 2024-03-16 13:24:36 +03:00
36f4717d0d More post-stress-test fixes
- Prevent _split types of new blocks
- Stop updating new blocks only after the whole update, otherwise pointers
  may become invalid
- Use recheck_none for updates initially
- Use UINT64_MAX as initial block version when postponing ops, otherwise the
  check fails when the block is initially empty. This for example leads to
  writing both leaf items & block pointers (which is incorrect) into the root
  block when starting stress-test with --parallelism 32
- Fix -EINTR comparison
2024-03-16 13:24:36 +03:00
babaf2a0ce Print operation statistics 2024-03-16 13:24:36 +03:00
5773f1a375 K/V fixes after stress-test :-)
- track block versions correctly - per inode block (128kb) instead of tree block (4kb)
- prevent multiple parallel CAS writes of the same inode block
- add logging for EILSEQ which means invalid data in the tree
- fix get_block updated flag which was true for blocks already in cache and was leading to infinite loops on "unrelated block" errors
- apply changes to blocks in cache only after successful writes (using "virtual changes")
- do not replace cached block with an older version from disk
- recheck "unrelated blocks" (read/update collisions) until data stops changing
- track tree path correctly - do not treat split block as parent of its right half
- correctly move blocks when finding new empty place on disk
- restart updates from the beginning when one of blocks is changed by a parallel update
- fix delete using SET opcode and setting key to the empty value instead
- prevent changing the same key more than 1 time in parallel
- fix listing verification
- resume continue_updates in update_find (required because it uses continue_update itself)
- add allow_old_cached parameter to get()
2024-03-16 13:24:36 +03:00
57222a9f79 Implement K/V DB stress tester 2024-03-16 13:24:36 +03:00
61ef000c6e Evict blocks based on memory limit & block usage 2024-03-16 13:24:36 +03:00
7d5e1cc393 Track blocks per level 2024-03-16 13:24:36 +03:00
5e7f27a02d Track block level 2024-03-16 13:24:36 +03:00
fd1d8a8520 Experimental B-Tree Vitastor embedded K/V database implementation! 2024-03-16 13:24:36 +03:00
c364e14c40 Stop then retry, not retry then stop 2024-03-16 13:24:36 +03:00
3ebbfa0428 Fix another rare OSD hang on zeroing out entries on start 2024-03-16 13:24:36 +03:00
aa79d1db1c Fix incorrect "changing scheme" message in modify-pool
All checks were successful
Test / test_rm (push) Successful in 14s
Test / test_interrupted_rebalance_ec_imm (push) Successful in 1m32s
Test / test_move_reappear (push) Successful in 20s
Test / test_snapshot_down (push) Successful in 29s
Test / test_snapshot_down_ec (push) Successful in 29s
Test / test_splitbrain (push) Successful in 28s
Test / test_snapshot_chain (push) Successful in 2m5s
Test / test_snapshot_chain_ec (push) Successful in 3m3s
Test / test_rebalance_verify_imm (push) Successful in 4m0s
Test / test_rebalance_verify (push) Successful in 4m40s
Test / test_switch_primary (push) Successful in 38s
Test / test_write (push) Successful in 41s
Test / test_write_no_same (push) Successful in 17s
Test / test_write_xor (push) Successful in 1m2s
Test / test_rebalance_verify_ec (push) Successful in 5m34s
Test / test_rebalance_verify_ec_imm (push) Successful in 5m34s
Test / test_heal_pg_size_2 (push) Successful in 3m22s
Test / test_heal_ec (push) Successful in 4m58s
Test / test_heal_csum_32k_dmj (push) Successful in 5m37s
Test / test_heal_csum_32k_dj (push) Successful in 6m21s
Test / test_heal_csum_32k (push) Successful in 7m1s
Test / test_scrub (push) Successful in 1m37s
Test / test_heal_csum_4k_dmj (push) Successful in 6m59s
Test / test_scrub_zero_osd_2 (push) Successful in 1m26s
Test / test_scrub_xor (push) Successful in 1m3s
Test / test_heal_csum_4k_dj (push) Successful in 7m20s
Test / test_scrub_pg_size_6_pg_minsize_4_osd_count_6_ec (push) Successful in 1m7s
Test / test_scrub_ec (push) Successful in 36s
Test / test_scrub_pg_size_3 (push) Successful in 1m37s
Test / test_heal_csum_4k (push) Successful in 6m23s
2024-03-06 00:41:35 +03:00
a1fecb7eff Move callback away when calling it in cluster_client 2024-03-06 00:41:35 +03:00
ff74b19423 Fix rare OSD hang on zeroing out bad entries on start 2024-03-06 00:41:35 +03:00
4cf6dceed7 Merge branch 'rel-1.4'
Some checks reported warnings
Test / test_minsize_1 (push) Has been cancelled
Test / test_move_reappear (push) Has been cancelled
Test / test_rm (push) Has been cancelled
Test / test_snapshot_chain (push) Has been cancelled
Test / test_snapshot_chain_ec (push) Has been cancelled
Test / test_snapshot_down (push) Has been cancelled
Test / test_snapshot_down_ec (push) Has been cancelled
Test / test_splitbrain (push) Has been cancelled
Test / test_rebalance_verify (push) Has been cancelled
Test / test_rebalance_verify_imm (push) Has been cancelled
Test / test_rebalance_verify_ec (push) Has been cancelled
Test / test_rebalance_verify_ec_imm (push) Has been cancelled
Test / test_switch_primary (push) Has been cancelled
Test / test_write (push) Has been cancelled
Test / test_write_xor (push) Has been cancelled
Test / test_write_no_same (push) Has been cancelled
Test / test_heal_pg_size_2 (push) Has been cancelled
Test / test_heal_ec (push) Has been cancelled
Test / test_heal_csum_32k_dmj (push) Has been cancelled
Test / test_heal_csum_32k_dj (push) Has been cancelled
Test / test_heal_csum_32k (push) Has been cancelled
Test / test_heal_csum_4k_dmj (push) Has been cancelled
Test / test_heal_csum_4k_dj (push) Has been cancelled
Test / test_heal_csum_4k (push) Has been cancelled
Test / test_scrub (push) Has been cancelled
Test / test_scrub_zero_osd_2 (push) Has been cancelled
Test / test_scrub_xor (push) Has been cancelled
Test / test_scrub_pg_size_3 (push) Has been cancelled
Test / test_scrub_pg_size_6_pg_minsize_4_osd_count_6_ec (push) Has been cancelled
Test / test_scrub_ec (push) Has been cancelled
2024-02-29 09:59:01 +03:00
38b8963330 Release 1.4.8
All checks were successful
Test / test_rm (push) Successful in 19s
Test / test_move_reappear (push) Successful in 26s
Test / test_interrupted_rebalance_ec_imm (push) Successful in 1m40s
Test / test_snapshot_down (push) Successful in 31s
Test / test_snapshot_down_ec (push) Successful in 34s
Test / test_splitbrain (push) Successful in 27s
Test / test_snapshot_chain (push) Successful in 2m18s
Test / test_snapshot_chain_ec (push) Successful in 2m59s
Test / test_rebalance_verify_imm (push) Successful in 5m32s
Test / test_rebalance_verify (push) Successful in 6m11s
Test / test_switch_primary (push) Successful in 41s
Test / test_write (push) Successful in 45s
Test / test_write_no_same (push) Successful in 23s
Test / test_rebalance_verify_ec_imm (push) Successful in 5m2s
Test / test_write_xor (push) Successful in 55s
Test / test_rebalance_verify_ec (push) Successful in 6m22s
Test / test_heal_pg_size_2 (push) Successful in 5m41s
Test / test_heal_csum_32k_dmj (push) Successful in 5m59s
Test / test_heal_csum_32k_dj (push) Successful in 7m19s
Test / test_heal_csum_32k (push) Successful in 7m17s
Test / test_heal_csum_4k_dmj (push) Successful in 7m14s
Test / test_scrub (push) Successful in 1m12s
Test / test_heal_ec (push) Successful in 9m2s
Test / test_scrub_xor (push) Successful in 56s
Test / test_scrub_zero_osd_2 (push) Successful in 1m8s
Test / test_scrub_pg_size_6_pg_minsize_4_osd_count_6_ec (push) Successful in 2m1s
Test / test_heal_csum_4k_dj (push) Successful in 4m45s
Test / test_scrub_pg_size_3 (push) Successful in 2m31s
Test / test_heal_csum_4k (push) Successful in 4m54s
Test / test_scrub_ec (push) Successful in 46s
- Do not use \r if output is not a terminal (should fix unexpected job output in proxmox)
- Fix rm/rm-data error return code, add --down-ok option to bypass the error
- Add EIO retry timeout and allow to disable these retries, rename up_wait_retry_interval to client_retry_interval
- Add ubuntu jammy build
- Wait for blockstore initialisation before starting OSD (prevent timeouts when init takes time)
- Fix a rare use-after-free in automatic sync after delete in blockstore
2024-02-29 09:58:34 +03:00
77167e2920 Do not use \r if output is not a terminal 2024-02-29 00:21:17 +03:00
5af23672d0 Fix rm/rm-data error return code, add --down-ok option to bypass the error 2024-02-29 00:20:10 +03:00
6bf1f539a6 Add EIO retry timeout and allow to disable these retries, rename up_wait_retry_interval to client_retry_interval 2024-02-28 13:10:02 +03:00
4eab26f968 Add documentation and a very basic test for pool management commands
All checks were successful
Test / test_snapshot_ec (push) Successful in 31s
Test / test_rm (push) Successful in 17s
Test / test_move_reappear (push) Successful in 24s
Test / test_snapshot_down (push) Successful in 27s
Test / test_snapshot_down_ec (push) Successful in 33s
Test / test_splitbrain (push) Successful in 20s
Test / test_snapshot_chain (push) Successful in 2m15s
Test / test_snapshot_chain_ec (push) Successful in 2m58s
Test / test_rebalance_verify_imm (push) Successful in 5m3s
Test / test_rebalance_verify (push) Successful in 5m36s
Test / test_switch_primary (push) Successful in 37s
Test / test_rebalance_verify_ec_imm (push) Successful in 4m3s
Test / test_write_no_same (push) Successful in 21s
Test / test_write (push) Successful in 58s
Test / test_write_xor (push) Successful in 1m31s
Test / test_rebalance_verify_ec (push) Successful in 6m20s
Test / test_heal_pg_size_2 (push) Successful in 4m7s
Test / test_heal_ec (push) Successful in 4m33s
Test / test_heal_csum_32k_dmj (push) Successful in 5m53s
Test / test_heal_csum_32k_dj (push) Successful in 6m17s
Test / test_heal_csum_32k (push) Successful in 7m23s
Test / test_heal_csum_4k_dmj (push) Successful in 6m56s
Test / test_scrub_zero_osd_2 (push) Successful in 1m26s
Test / test_scrub (push) Successful in 1m29s
Test / test_heal_csum_4k_dj (push) Successful in 7m1s
Test / test_scrub_xor (push) Successful in 1m1s
Test / test_heal_csum_4k (push) Successful in 6m34s
Test / test_scrub_pg_size_6_pg_minsize_4_osd_count_6_ec (push) Successful in 32s
Test / test_scrub_pg_size_3 (push) Successful in 1m19s
Test / test_scrub_ec (push) Successful in 24s
2024-02-28 13:08:04 +03:00
86243b7101 Rework & fix pool-create / pool-modify / pool-ls 2024-02-28 13:08:04 +03:00
idelson
dc92851322 vitastor-cli: add commands to control pools: pool-create, pool-ls, pool-modify, pool-rm
PR #59 - https://github.com/vitalif/vitastor/pull/58/commits

By MIND Software LLC

By submitting this pull request, I accept Vitastor CLA
2024-02-28 13:08:04 +03:00
02d1f16bbd Add ubuntu jammy build
PR #62 #62

I accept Vitastor CLA agreement: https://git.yourcmc.ru/vitalif/vitastor/src/branch/master/CLA-en.md
2024-02-28 11:43:54 +03:00
fc413038d1 Wait for blockstore initialisation before starting OSD
Some checks reported warnings
Test / test_cas (push) Has been cancelled
Test / test_change_pg_count (push) Has been cancelled
Test / test_change_pg_count_ec (push) Has been cancelled
Test / test_change_pg_size (push) Has been cancelled
Test / test_create_nomaxid (push) Has been cancelled
Test / test_etcd_fail (push) Has been cancelled
Test / test_interrupted_rebalance (push) Has been cancelled
Test / test_interrupted_rebalance_imm (push) Has been cancelled
Test / test_interrupted_rebalance_ec (push) Has been cancelled
Test / test_interrupted_rebalance_ec_imm (push) Has been cancelled
Test / test_failure_domain (push) Has been cancelled
Test / test_snapshot (push) Has been cancelled
Test / test_snapshot_ec (push) Has been cancelled
Test / test_minsize_1 (push) Has been cancelled
Test / test_move_reappear (push) Has been cancelled
Test / test_rm (push) Has been cancelled
Test / test_snapshot_chain (push) Has been cancelled
Test / test_snapshot_chain_ec (push) Has been cancelled
Test / test_snapshot_down (push) Has been cancelled
Test / test_snapshot_down_ec (push) Has been cancelled
Test / test_splitbrain (push) Has been cancelled
Test / test_rebalance_verify (push) Has been cancelled
Test / test_rebalance_verify_imm (push) Has been cancelled
Test / test_rebalance_verify_ec (push) Has been cancelled
Test / test_rebalance_verify_ec_imm (push) Has been cancelled
Test / test_switch_primary (push) Has been cancelled
Test / test_write (push) Has been cancelled
Test / test_write_xor (push) Has been cancelled
Test / test_write_no_same (push) Has been cancelled
Test / test_heal_pg_size_2 (push) Has been cancelled
2024-02-27 02:20:04 +03:00
1bc0b5aab3 Fix a rare use-after-free in automatic sync after delete in blockstore
All checks were successful
Test / test_interrupted_rebalance_ec (push) Successful in 2m49s
Test / test_rm (push) Successful in 14s
Test / test_move_reappear (push) Successful in 21s
Test / test_snapshot_down (push) Successful in 31s
Test / test_snapshot_down_ec (push) Successful in 30s
Test / test_splitbrain (push) Successful in 23s
Test / test_snapshot_chain (push) Successful in 2m29s
Test / test_snapshot_chain_ec (push) Successful in 2m48s
Test / test_rebalance_verify_imm (push) Successful in 4m9s
Test / test_rebalance_verify (push) Successful in 4m42s
Test / test_switch_primary (push) Successful in 41s
Test / test_write (push) Successful in 43s
Test / test_write_no_same (push) Successful in 21s
Test / test_rebalance_verify_ec_imm (push) Successful in 3m37s
Test / test_write_xor (push) Successful in 1m11s
Test / test_rebalance_verify_ec (push) Successful in 7m14s
Test / test_heal_pg_size_2 (push) Successful in 4m3s
Test / test_heal_ec (push) Successful in 4m18s
Test / test_heal_csum_32k_dmj (push) Successful in 5m5s
Test / test_heal_csum_32k_dj (push) Successful in 6m52s
Test / test_heal_csum_32k (push) Successful in 6m23s
Test / test_heal_csum_4k_dmj (push) Successful in 6m23s
Test / test_scrub (push) Successful in 1m30s
Test / test_scrub_zero_osd_2 (push) Successful in 1m18s
Test / test_heal_csum_4k_dj (push) Successful in 7m9s
Test / test_scrub_xor (push) Successful in 57s
Test / test_scrub_pg_size_6_pg_minsize_4_osd_count_6_ec (push) Successful in 1m5s
Test / test_scrub_ec (push) Successful in 1m6s
Test / test_scrub_pg_size_3 (push) Successful in 2m3s
Test / test_heal_csum_4k (push) Successful in 4m54s
ASan report: [0] READ of size 16 at operator() /root/vitastor/src/blockstore_write.cpp:100
...[5] blockstore_impl_t::ack_sync(blockstore_op_t*) /root/vitastor/src/blockstore_sync.cpp:232
2024-02-24 00:06:36 +03:00
5e934264cf Release 1.4.7
- Fix another old "BUG: Attempt to overwrite used offset" in a very simple
  case: bs=4k rw=write iodepth=16 from OSD start; add this case to tests
- Fix a rare crash with "unexpected state during flush: 0x51" possible with
  EC since 1.4.2 during rebalance and OSD outages
- Fix a rare write stall with EC & immediate_commit=none caused by sync
  operations reserving unneeded space in the journal
- Fix 32-bit build warnings, most in printf/scanf format strings
2024-02-22 12:45:52 +03:00
f20564b44b Fix 32-bit build warnings (99.9% in printf) 2024-02-22 12:22:16 +03:00
b3c15db331 32M journal by default in simple-offsets
All checks were successful
Test / test_snapshot_ec (push) Successful in 30s
Test / test_rm (push) Successful in 18s
Test / test_move_reappear (push) Successful in 24s
Test / test_snapshot_down (push) Successful in 26s
Test / test_snapshot_down_ec (push) Successful in 30s
Test / test_splitbrain (push) Successful in 23s
Test / test_snapshot_chain (push) Successful in 2m17s
Test / test_snapshot_chain_ec (push) Successful in 2m55s
Test / test_rebalance_verify_imm (push) Successful in 2m46s
Test / test_rebalance_verify (push) Successful in 3m9s
Test / test_switch_primary (push) Successful in 39s
Test / test_write (push) Successful in 43s
Test / test_write_no_same (push) Successful in 19s
Test / test_write_xor (push) Successful in 55s
Test / test_rebalance_verify_ec (push) Successful in 3m35s
Test / test_rebalance_verify_ec_imm (push) Successful in 3m37s
Test / test_heal_pg_size_2 (push) Successful in 3m36s
Test / test_heal_ec (push) Successful in 5m47s
Test / test_heal_csum_32k_dmj (push) Successful in 5m21s
Test / test_heal_csum_32k_dj (push) Successful in 6m16s
Test / test_heal_csum_32k (push) Successful in 6m45s
Test / test_scrub (push) Successful in 1m56s
Test / test_heal_csum_4k_dj (push) Successful in 6m39s
Test / test_heal_csum_4k_dmj (push) Successful in 6m42s
Test / test_scrub_zero_osd_2 (push) Successful in 1m16s
Test / test_scrub_xor (push) Successful in 47s
Test / test_scrub_pg_size_3 (push) Successful in 1m26s
Test / test_heal_csum_4k (push) Successful in 6m32s
Test / test_scrub_pg_size_6_pg_minsize_4_osd_count_6_ec (push) Successful in 48s
Test / test_scrub_ec (push) Successful in 49s
2024-02-21 15:25:02 +03:00
685bcd6ef9 Do not reserve extra space for big_writes during sync - sync itself is needed to commit and clear them 2024-02-21 13:00:14 +03:00
3eb389b321 Supposed fix for "unexpected state during flush: 0x51" with EC
Some checks failed
Test / test_move_reappear (push) Successful in 22s
Test / test_interrupted_rebalance_ec_imm (push) Successful in 1m32s
Test / test_rm (push) Successful in 16s
Test / test_snapshot_down (push) Successful in 31s
Test / test_snapshot_down_ec (push) Successful in 32s
Test / test_splitbrain (push) Successful in 25s
Test / test_snapshot_chain (push) Successful in 2m4s
Test / test_snapshot_chain_ec (push) Successful in 2m51s
Test / test_rebalance_verify_imm (push) Successful in 2m47s
Test / test_rebalance_verify (push) Successful in 3m30s
Test / test_switch_primary (push) Successful in 38s
Test / test_write (push) Successful in 51s
Test / test_write_no_same (push) Successful in 16s
Test / test_write_xor (push) Successful in 52s
Test / test_rebalance_verify_ec (push) Successful in 3m32s
Test / test_rebalance_verify_ec_imm (push) Successful in 3m7s
Test / test_scrub_zero_osd_2 (push) Successful in 59s
Test / test_scrub (push) Successful in 1m2s
Test / test_scrub_xor (push) Successful in 36s
Test / test_scrub_ec (push) Successful in 38s
Test / test_scrub_pg_size_6_pg_minsize_4_osd_count_6_ec (push) Successful in 40s
Test / test_scrub_pg_size_3 (push) Successful in 49s
Test / test_heal_csum_32k_dmj (push) Successful in 5m12s
Test / test_heal_csum_32k_dj (push) Successful in 5m8s
Test / test_heal_csum_32k (push) Successful in 4m55s
Test / test_heal_ec (push) Failing after 10m14s
Test / test_heal_csum_4k_dmj (push) Successful in 4m59s
Test / test_heal_csum_4k_dj (push) Successful in 5m5s
Test / test_heal_pg_size_2 (push) Successful in 3m54s
Test / test_heal_csum_4k (push) Successful in 3m49s
2024-02-21 01:32:06 +03:00
3d16cde23c Fix assertions, add small sequential write test
Some checks failed
Test / test_snapshot_down_ec (push) Successful in 32s
Test / test_splitbrain (push) Successful in 22s
Test / test_snapshot_chain (push) Successful in 2m8s
Test / test_snapshot_chain_ec (push) Successful in 2m48s
Test / test_rebalance_verify_imm (push) Successful in 2m57s
Test / test_rebalance_verify (push) Successful in 3m29s
Test / test_switch_primary (push) Successful in 36s
Test / test_write (push) Successful in 54s
Test / test_write_xor (push) Successful in 51s
Test / test_write_no_same (push) Successful in 16s
Test / test_rebalance_verify_ec (push) Successful in 3m40s
Test / test_rebalance_verify_ec_imm (push) Successful in 4m20s
Test / test_scrub (push) Successful in 1m1s
Test / test_scrub_zero_osd_2 (push) Successful in 46s
Test / test_scrub_xor (push) Successful in 41s
Test / test_scrub_pg_size_6_pg_minsize_4_osd_count_6_ec (push) Successful in 1m0s
Test / test_scrub_ec (push) Successful in 58s
Test / test_scrub_pg_size_3 (push) Successful in 1m45s
Test / test_heal_pg_size_2 (push) Failing after 4m52s
Test / test_heal_csum_32k_dmj (push) Successful in 5m36s
Test / test_heal_csum_32k_dj (push) Successful in 5m33s
Test / test_interrupted_rebalance_imm (push) Successful in 1m35s
Test / test_interrupted_rebalance (push) Successful in 2m28s
Test / test_interrupted_rebalance_ec (push) Successful in 2m30s
Test / test_interrupted_rebalance_ec_imm (push) Successful in 2m41s
Test / test_heal_ec (push) Failing after 10m20s
Test / test_heal_csum_4k_dmj (push) Successful in 4m21s
Test / test_heal_csum_32k (push) Successful in 5m15s
Test / test_heal_csum_4k_dj (push) Successful in 5m48s
Test / test_heal_csum_4k (push) Successful in 5m32s
2024-02-20 19:41:48 +03:00
c6406d67fc Fix journal space_check incorrectly checking for space at the beginning 2024-02-20 19:40:56 +03:00
f87964861d Release 1.4.6
All checks were successful
Test / test_snapshot_ec (push) Successful in 29s
Test / test_rm (push) Successful in 18s
Test / test_move_reappear (push) Successful in 26s
Test / test_snapshot_down (push) Successful in 28s
Test / test_snapshot_down_ec (push) Successful in 32s
Test / test_splitbrain (push) Successful in 23s
Test / test_snapshot_chain (push) Successful in 2m3s
Test / test_snapshot_chain_ec (push) Successful in 2m46s
Test / test_rebalance_verify_imm (push) Successful in 3m1s
Test / test_rebalance_verify (push) Successful in 3m30s
Test / test_switch_primary (push) Successful in 38s
Test / test_write (push) Successful in 32s
Test / test_write_no_same (push) Successful in 17s
Test / test_write_xor (push) Successful in 38s
Test / test_rebalance_verify_ec (push) Successful in 4m38s
Test / test_rebalance_verify_ec_imm (push) Successful in 3m57s
Test / test_heal_csum_32k_dj (push) Successful in 5m14s
Test / test_heal_csum_32k_dmj (push) Successful in 5m21s
Test / test_heal_csum_32k (push) Successful in 5m45s
Test / test_heal_csum_4k_dmj (push) Successful in 5m27s
Test / test_scrub (push) Successful in 1m30s
Test / test_heal_csum_4k_dj (push) Successful in 5m26s
Test / test_scrub_zero_osd_2 (push) Successful in 38s
Test / test_scrub_xor (push) Successful in 40s
Test / test_scrub_pg_size_6_pg_minsize_4_osd_count_6_ec (push) Successful in 1m8s
Test / test_scrub_ec (push) Successful in 1m5s
Test / test_scrub_pg_size_3 (push) Successful in 1m49s
Test / test_heal_csum_4k (push) Successful in 5m41s
Test / test_heal_ec (push) Successful in 4m11s
Test / test_heal_pg_size_2 (push) Successful in 4m22s
Unwavering stabilization of 1.4.x, continued :-)

- Include the accidentally lost part of 1.4.5 journal trimming fix
- Fix a possible OSD crash with "BUG: Attempt to overwrite used offset"
  which was probably present for long time, but became apparent after
  fixing flapping tests in CI
- Fix remaining flapping tests in CI. It was the first time when tests
  actually passed without retries :-)
2024-02-20 17:01:26 +03:00
62a4f45160 Raise test_scrub waiting timeout
Some checks failed
Test / test_snapshot_ec (push) Successful in 27s
Test / test_rm (push) Successful in 19s
Test / test_move_reappear (push) Successful in 25s
Test / test_snapshot_down (push) Successful in 28s
Test / test_snapshot_down_ec (push) Successful in 33s
Test / test_splitbrain (push) Successful in 28s
Test / test_snapshot_chain (push) Successful in 2m17s
Test / test_snapshot_chain_ec (push) Successful in 3m5s
Test / test_rebalance_verify_imm (push) Successful in 3m0s
Test / test_rebalance_verify (push) Successful in 3m43s
Test / test_switch_primary (push) Successful in 40s
Test / test_write (push) Successful in 41s
Test / test_write_no_same (push) Successful in 14s
Test / test_write_xor (push) Successful in 42s
Test / test_rebalance_verify_ec (push) Successful in 3m55s
Test / test_rebalance_verify_ec_imm (push) Successful in 3m6s
Test / test_heal_pg_size_2 (push) Successful in 4m51s
Test / test_heal_csum_32k_dj (push) Successful in 5m47s
Test / test_heal_csum_32k_dmj (push) Successful in 5m50s
Test / test_heal_csum_32k (push) Successful in 5m42s
Test / test_heal_ec (push) Failing after 10m30s
Test / test_heal_csum_4k_dmj (push) Successful in 5m22s
Test / test_scrub (push) Successful in 1m21s
Test / test_scrub_xor (push) Successful in 46s
Test / test_scrub_zero_osd_2 (push) Successful in 53s
Test / test_scrub_pg_size_6_pg_minsize_4_osd_count_6_ec (push) Successful in 1m21s
Test / test_scrub_pg_size_3 (push) Successful in 1m56s
Test / test_scrub_ec (push) Successful in 55s
Test / test_heal_csum_4k (push) Successful in 4m28s
Test / test_heal_csum_4k_dj (push) Failing after 10m15s
2024-02-20 16:26:09 +03:00
7048228678 Supposed fix for "BUG: Attempt to overwrite used offset" 2024-02-20 15:56:48 +03:00
ea73857450 Add asserts to catch "BUG: Attempt to overwrite used offset" 2024-02-20 15:56:48 +03:00
6cfe38ec04 Followup to empty cur.oid as stop condition for forced trim fix 2024-02-20 15:56:38 +03:00
7ae5766fdb Wait to clear has_degraded in test_heal - should fix flaps of test_heal_* in CI 2024-02-20 15:56:27 +03:00
f882c7dd87 Release 1.4.5
All checks were successful
Test / test_interrupted_rebalance_ec_imm (push) Successful in 1m23s
Test / test_rm (push) Successful in 15s
Test / test_move_reappear (push) Successful in 21s
Test / test_snapshot_down (push) Successful in 26s
Test / test_snapshot_down_ec (push) Successful in 30s
Test / test_splitbrain (push) Successful in 29s
Test / test_snapshot_chain (push) Successful in 2m17s
Test / test_snapshot_chain_ec (push) Successful in 3m14s
Test / test_rebalance_verify_imm (push) Successful in 3m24s
Test / test_rebalance_verify (push) Successful in 3m59s
Test / test_switch_primary (push) Successful in 35s
Test / test_write_xor (push) Successful in 32s
Test / test_write_no_same (push) Successful in 13s
Test / test_rebalance_verify_ec (push) Successful in 3m46s
Test / test_rebalance_verify_ec_imm (push) Successful in 3m13s
Test / test_heal_pg_size_2 (push) Successful in 3m52s
Test / test_heal_ec (push) Successful in 5m25s
Test / test_heal_csum_32k_dj (push) Successful in 4m24s
Test / test_heal_csum_4k_dmj (push) Successful in 4m23s
Test / test_heal_csum_4k_dj (push) Successful in 4m17s
Test / test_scrub (push) Successful in 38s
Test / test_scrub_zero_osd_2 (push) Successful in 29s
Test / test_scrub_xor (push) Successful in 30s
Test / test_scrub_pg_size_6_pg_minsize_4_osd_count_6_ec (push) Successful in 43s
Test / test_scrub_ec (push) Successful in 32s
Test / test_scrub_pg_size_3 (push) Successful in 1m46s
Test / test_heal_csum_4k (push) Successful in 4m4s
Test / test_write (push) Successful in 1m38s
Test / test_heal_csum_32k_dmj (push) Successful in 4m5s
Test / test_heal_csum_32k (push) Successful in 4m15s
- Fix a write stall caused by incorrect journal trimming introduced in 1.4.4 :)
- Fix PGs sometimes hanging in "starting" state on mass OSD restarts
- Fix a rare crash with "map::at" during OSD pings
- Use new defaults for non-capacitor (desktop) SSDs - improves T1Q256 random write from ~6k iops to ~45k iops
- Make journal_trim_interval configurable
2024-02-16 10:13:33 +03:00
26dd863c8d Fix sometimes possible crash on clients.at() during pings 2024-02-16 10:13:33 +03:00
2ae859fbc6 Use min/max_flusher_count=32/256, 128M journal and autosync_writes=512 for non-capacitor SSDs by default 2024-02-16 10:13:33 +03:00
f6cd9f9153 Add a note about pg_minsize 2024-02-15 23:38:52 +03:00
8389c0f33b Fix PGs sometimes hanging in "starting" state on mass OSD restarts 2024-02-15 23:38:52 +03:00
9db2196aef Make journal_trim_interval configurable 2024-02-15 23:38:51 +03:00
8d6ae662fe Use empty cur.oid as stop condition for forced trim, not journal_trim_counter 2024-02-15 23:27:17 +03:00
c777a0041a Release 1.4.4
All checks were successful
Test / test_interrupted_rebalance_ec_imm (push) Successful in 1m23s
Test / test_move_reappear (push) Successful in 21s
Test / test_rm (push) Successful in 16s
Test / test_snapshot_down (push) Successful in 30s
Test / test_snapshot_down_ec (push) Successful in 30s
Test / test_splitbrain (push) Successful in 25s
Test / test_snapshot_chain (push) Successful in 2m18s
Test / test_snapshot_chain_ec (push) Successful in 3m13s
Test / test_rebalance_verify_imm (push) Successful in 3m8s
Test / test_rebalance_verify (push) Successful in 3m41s
Test / test_switch_primary (push) Successful in 36s
Test / test_write (push) Successful in 40s
Test / test_write_no_same (push) Successful in 16s
Test / test_write_xor (push) Successful in 39s
Test / test_rebalance_verify_ec (push) Successful in 4m56s
Test / test_rebalance_verify_ec_imm (push) Successful in 4m21s
Test / test_heal_pg_size_2 (push) Successful in 4m15s
Test / test_heal_ec (push) Successful in 5m1s
Test / test_heal_csum_32k_dj (push) Successful in 5m32s
Test / test_heal_csum_32k (push) Successful in 5m38s
Test / test_heal_csum_4k_dmj (push) Successful in 5m43s
Test / test_scrub (push) Successful in 1m31s
Test / test_scrub_zero_osd_2 (push) Successful in 1m17s
Test / test_heal_csum_4k_dj (push) Successful in 5m57s
Test / test_scrub_xor (push) Successful in 30s
Test / test_scrub_pg_size_3 (push) Successful in 1m7s
Test / test_scrub_pg_size_6_pg_minsize_4_osd_count_6_ec (push) Successful in 41s
Test / test_scrub_ec (push) Successful in 24s
Test / test_heal_csum_32k_dmj (push) Successful in 3m56s
Test / test_heal_csum_4k (push) Successful in 3m16s
A couple of fixes for EC pools

- Fix a segfault possible on partial EC overwrite in 1234 -> 5030 rebalance scenario
- Fix two problems leading to EC pools stalling on rebalance & parallel sudden stops
  of OSDs, for example during a sudden poweroff of a host:
  - Recovery auto-tuning (1.4.0 feature) could apply too large delays and stall
    the EC journal - fixed by limiting delays with a new recovery_tune_sleep_cutoff_us
    parameter (10 seconds by default) and applying recovery pauses before write
    operations, not after them, to not occupy space in the journal for long time
  - Dynamic journal space reservation (1.3.0 feature) wasn't accounting new writes
    when checking the limit so OSDs could still fill the journal fully and stall -
    fixed by including new writes into the limit
- Print etcd dbSize instead of dbSizeInUse in status
2024-02-11 16:23:08 +03:00
2947ea93e8 Raise test_snapshot_chain_ec timeout to 6 minutes 2024-02-11 16:13:52 +03:00
978bdc128a Apply recovery pause before writes, after commits, and do not apply it to syncs to not block EC pools from functioning 2024-02-11 16:13:52 +03:00
bb2f395f1e Add cutoff threshold for recovery auto-tuning 2024-02-11 16:13:52 +03:00
b127da40f7 Add a FIXME about incomplete PGs 2024-02-11 13:42:51 +03:00
ca34a6047a Fix dynamic journal space reservation: include the new write itself, too 2024-02-11 13:42:51 +03:00
38ba76e893 Fix flusher sometimes being unable to trim journal when the flush queue is empty 2024-02-11 13:42:51 +03:00
1e3c4edea0 Print etcd dbSize instead of dbSizeInUse in status 2024-02-11 13:42:51 +03:00
e7ac855b07 Fix that EC segfault (1234 -> 5030 partial overwrite) 2024-02-11 13:42:51 +03:00
c53357ac45 Add a test for EC segfault with partial overwrite in 1234 -> 5030 rebalance scenario 2024-02-11 13:42:51 +03:00
27e9f244ec Release 1.4.3
All checks were successful
Test / test_move_reappear (push) Successful in 22s
Test / test_rm (push) Successful in 15s
Test / test_snapshot_down (push) Successful in 36s
Test / test_snapshot_down_ec (push) Successful in 30s
Test / test_interrupted_rebalance (push) Successful in 5m3s
Test / test_splitbrain (push) Successful in 20s
Test / test_snapshot_chain (push) Successful in 3m1s
Test / test_snapshot_chain_ec (push) Successful in 3m13s
Test / test_rebalance_verify_imm (push) Successful in 3m0s
Test / test_rebalance_verify (push) Successful in 3m29s
Test / test_switch_primary (push) Successful in 37s
Test / test_write (push) Successful in 44s
Test / test_write_xor (push) Successful in 39s
Test / test_write_no_same (push) Successful in 16s
Test / test_rebalance_verify_ec_imm (push) Successful in 4m13s
Test / test_rebalance_verify_ec (push) Successful in 5m31s
Test / test_heal_ec (push) Successful in 4m54s
Test / test_heal_csum_32k_dj (push) Successful in 5m25s
Test / test_heal_csum_32k (push) Successful in 6m8s
Test / test_heal_csum_4k_dmj (push) Successful in 6m17s
Test / test_scrub (push) Successful in 1m8s
Test / test_scrub_zero_osd_2 (push) Successful in 55s
Test / test_scrub_xor (push) Successful in 45s
Test / test_heal_csum_4k_dj (push) Successful in 6m22s
Test / test_scrub_pg_size_6_pg_minsize_4_osd_count_6_ec (push) Successful in 1m11s
Test / test_scrub_ec (push) Successful in 46s
Test / test_scrub_pg_size_3 (push) Successful in 1m39s
Test / test_heal_csum_4k (push) Successful in 6m8s
Test / test_heal_csum_32k_dmj (push) Successful in 4m15s
Test / test_heal_pg_size_2 (push) Successful in 4m41s
Hotfix for hotfix O:-)

- "Write stall fix" was incomplete and EC write stalls could
  continue even on 1.4.2. Now they're finally fixed O:-)
- Make monitor ignore statistics of stopped OSDs. Previously if you stopped all
  OSDs the last total I/O numbers would remain the same indefinitely
2024-02-09 00:29:31 +03:00
8e25a28a08 Ignore down OSDs in monitor statistics aggregation
Some checks failed
Test / test_move_reappear (push) Successful in 20s
Test / test_interrupted_rebalance_ec (push) Successful in 2m50s
Test / test_snapshot_down (push) Successful in 22s
Test / test_snapshot_down_ec (push) Successful in 25s
Test / test_splitbrain (push) Successful in 18s
Test / test_snapshot_chain (push) Successful in 2m10s
Test / test_snapshot_chain_ec (push) Failing after 3m8s
Test / test_rebalance_verify (push) Successful in 3m6s
Test / test_interrupted_rebalance (push) Failing after 10m52s
Test / test_rebalance_verify_imm (push) Successful in 5m28s
Test / test_switch_primary (push) Successful in 37s
Test / test_write (push) Successful in 42s
Test / test_write_xor (push) Successful in 38s
Test / test_write_no_same (push) Successful in 16s
Test / test_rebalance_verify_ec (push) Successful in 6m7s
Test / test_rebalance_verify_ec_imm (push) Successful in 6m3s
Test / test_heal_pg_size_2 (push) Successful in 4m12s
Test / test_heal_ec (push) Successful in 5m20s
Test / test_heal_csum_32k_dmj (push) Successful in 4m53s
Test / test_heal_csum_32k_dj (push) Successful in 5m23s
Test / test_heal_csum_32k (push) Successful in 5m59s
Test / test_scrub_zero_osd_2 (push) Has been cancelled
Test / test_scrub_xor (push) Has been cancelled
Test / test_heal_csum_4k_dmj (push) Has been cancelled
Test / test_scrub_pg_size_3 (push) Has been cancelled
Test / test_scrub_pg_size_6_pg_minsize_4_osd_count_6_ec (push) Has been cancelled
Test / test_scrub_ec (push) Has been cancelled
Test / test_scrub (push) Has been cancelled
Test / test_heal_csum_4k_dj (push) Has been cancelled
Test / test_heal_csum_4k (push) Has been cancelled
2024-02-09 00:22:36 +03:00
5d3317e4f2 Followup to 1.4.2 write stall fix - sadly, the previous version was not working correctly :)
All checks were successful
Test / test_move_reappear (push) Successful in 19s
Test / test_snapshot_chain (push) Successful in 1m21s
Test / test_snapshot_down (push) Successful in 23s
Test / test_snapshot_chain_ec (push) Successful in 1m50s
Test / test_snapshot_down_ec (push) Successful in 22s
Test / test_splitbrain (push) Successful in 16s
Test / test_etcd_fail (push) Successful in 6m42s
Test / test_rebalance_verify_imm (push) Successful in 2m19s
Test / test_rebalance_verify (push) Successful in 4m7s
Test / test_switch_primary (push) Successful in 36s
Test / test_write (push) Successful in 35s
Test / test_rebalance_verify_ec (push) Successful in 4m6s
Test / test_write_no_same (push) Successful in 22s
Test / test_write_xor (push) Successful in 1m34s
Test / test_rebalance_verify_ec_imm (push) Successful in 6m7s
Test / test_heal_csum_32k_dmj (push) Successful in 4m7s
Test / test_heal_csum_32k_dj (push) Successful in 4m59s
Test / test_heal_csum_32k (push) Successful in 5m4s
Test / test_heal_csum_4k_dmj (push) Successful in 5m59s
Test / test_scrub (push) Successful in 1m9s
Test / test_scrub_zero_osd_2 (push) Successful in 37s
Test / test_scrub_xor (push) Successful in 52s
Test / test_scrub_pg_size_6_pg_minsize_4_osd_count_6_ec (push) Successful in 1m5s
Test / test_heal_csum_4k_dj (push) Successful in 5m12s
Test / test_heal_csum_4k (push) Successful in 5m1s
Test / test_scrub_pg_size_3 (push) Successful in 1m48s
Test / test_scrub_ec (push) Successful in 19s
Test / test_interrupted_rebalance (push) Successful in 1m38s
Test / test_heal_pg_size_2 (push) Successful in 3m20s
Test / test_heal_ec (push) Successful in 3m3s
2024-02-08 19:34:29 +03:00
016115c0d4 Release 1.4.2
All checks were successful
Test / test_rm (push) Successful in 16s
Test / test_move_reappear (push) Successful in 20s
Test / test_snapshot_down (push) Successful in 25s
Test / test_snapshot_down_ec (push) Successful in 39s
Test / test_interrupted_rebalance (push) Successful in 4m52s
Test / test_splitbrain (push) Successful in 20s
Test / test_snapshot_chain (push) Successful in 3m11s
Test / test_rebalance_verify_imm (push) Successful in 3m16s
Test / test_rebalance_verify (push) Successful in 3m45s
Test / test_switch_primary (push) Successful in 36s
Test / test_write (push) Successful in 40s
Test / test_write_xor (push) Successful in 40s
Test / test_write_no_same (push) Successful in 17s
Test / test_rebalance_verify_ec_imm (push) Successful in 3m8s
Test / test_rebalance_verify_ec (push) Successful in 5m57s
Test / test_heal_pg_size_2 (push) Successful in 4m22s
Test / test_heal_csum_32k_dmj (push) Successful in 4m20s
Test / test_heal_ec (push) Successful in 5m54s
Test / test_heal_csum_32k_dj (push) Successful in 5m24s
Test / test_heal_csum_32k (push) Successful in 6m3s
Test / test_heal_csum_4k_dmj (push) Successful in 5m54s
Test / test_scrub_zero_osd_2 (push) Successful in 53s
Test / test_scrub (push) Successful in 55s
Test / test_heal_csum_4k_dj (push) Successful in 6m14s
Test / test_scrub_xor (push) Successful in 1m1s
Test / test_scrub_pg_size_3 (push) Successful in 1m50s
Test / test_scrub_pg_size_6_pg_minsize_4_osd_count_6_ec (push) Successful in 57s
Test / test_scrub_ec (push) Successful in 52s
Test / test_heal_csum_4k (push) Successful in 5m47s
Test / test_snapshot_chain_ec (push) Successful in 1m24s
- Log to systemd by default
- Fix excessive autosyncs after every operation with disabled immediate_commit (introduced in 1.1.0)
- Fix a possible write stall with EC due to the lack of OSD wakeup after stabilizing previous writes
- Change sync operation semantics as a final fix to possible write stalls with EC and disabled immediate_commit
- Sync after deleting data in CLI rm / rm-data if immediate_commit is disabled
- Fix OSDs ignoring syncs & autosyncs for delete operations
- Fix OSD space reporting sometimes adding garbage zeros for deleted inodes (causing extra pool/stats etcd keys for deleted pools)
- Speed up monitor failover - change default etcd_mon_ttl from 30 to 5 seconds
- Speed up operation retries - change default up_wait_retry_interval to 50 ms
- Add patch for libvirt 9.10
2024-02-04 02:23:49 +03:00
e026de95d5 Log to systemd by default
Some checks failed
Test / test_move_reappear (push) Successful in 20s
Test / test_etcd_fail (push) Successful in 5m19s
Test / test_snapshot_chain (push) Successful in 1m26s
Test / test_snapshot_down (push) Successful in 26s
Test / test_snapshot_down_ec (push) Successful in 28s
Test / test_splitbrain (push) Successful in 19s
Test / test_snapshot_chain_ec (push) Failing after 3m8s
Test / test_interrupted_rebalance (push) Successful in 7m44s
Test / test_rebalance_verify_imm (push) Successful in 3m11s
Test / test_switch_primary (push) Successful in 34s
Test / test_write (push) Successful in 34s
Test / test_rebalance_verify_ec_imm (push) Successful in 2m41s
Test / test_rebalance_verify_ec (push) Successful in 3m18s
Test / test_write_no_same (push) Successful in 22s
Test / test_write_xor (push) Successful in 1m41s
Test / test_heal_pg_size_2 (push) Failing after 3m54s
Test / test_rebalance_verify (push) Successful in 9m38s
Test / test_heal_csum_32k_dmj (push) Successful in 4m4s
Test / test_heal_csum_32k_dj (push) Successful in 4m23s
Test / test_heal_csum_32k (push) Successful in 5m24s
Test / test_heal_ec (push) Failing after 10m18s
Test / test_heal_csum_4k_dmj (push) Successful in 5m31s
Test / test_scrub (push) Successful in 1m18s
Test / test_scrub_zero_osd_2 (push) Successful in 1m0s
Test / test_scrub_xor (push) Successful in 51s
Test / test_heal_csum_4k_dj (push) Successful in 5m10s
Test / test_scrub_pg_size_6_pg_minsize_4_osd_count_6_ec (push) Successful in 48s
Test / test_scrub_pg_size_3 (push) Successful in 1m59s
Test / test_scrub_ec (push) Successful in 48s
Test / test_heal_csum_4k (push) Successful in 4m39s
2024-02-04 01:21:31 +03:00
77c10fd1f8 In fact, do not autosync blockstore when autosync_writes=0
All checks were successful
Test / test_move_reappear (push) Successful in 19s
Test / test_rm (push) Successful in 14s
Test / test_snapshot_down (push) Successful in 24s
Test / test_snapshot_down_ec (push) Successful in 25s
Test / test_splitbrain (push) Successful in 17s
Test / test_snapshot_chain (push) Successful in 1m57s
Test / test_snapshot_chain_ec (push) Successful in 2m41s
Test / test_rebalance_verify (push) Successful in 3m5s
Test / test_rebalance_verify_imm (push) Successful in 2m26s
Test / test_switch_primary (push) Successful in 45s
Test / test_write (push) Successful in 33s
Test / test_write_xor (push) Successful in 33s
Test / test_rebalance_verify_ec (push) Successful in 3m42s
Test / test_write_no_same (push) Successful in 14s
Test / test_rebalance_verify_ec_imm (push) Successful in 4m57s
Test / test_heal_csum_32k_dmj (push) Successful in 4m24s
Test / test_heal_csum_32k_dj (push) Successful in 4m29s
Test / test_heal_csum_4k_dmj (push) Successful in 5m10s
Test / test_heal_csum_32k (push) Successful in 5m13s
Test / test_scrub (push) Successful in 1m5s
Test / test_scrub_zero_osd_2 (push) Successful in 1m1s
Test / test_scrub_xor (push) Successful in 1m2s
Test / test_heal_csum_4k_dj (push) Successful in 5m2s
Test / test_scrub_pg_size_6_pg_minsize_4_osd_count_6_ec (push) Successful in 57s
Test / test_scrub_ec (push) Successful in 50s
Test / test_scrub_pg_size_3 (push) Successful in 2m1s
Test / test_heal_csum_4k (push) Successful in 4m40s
Test / test_interrupted_rebalance (push) Successful in 1m38s
Test / test_heal_pg_size_2 (push) Successful in 4m2s
Test / test_heal_ec (push) Successful in 5m17s
2024-02-03 20:37:36 +03:00
581d02e581 Mark secondary OSDs with deletions as dirty to not forget to sync & autosync them
Some checks reported warnings
Test / test_change_pg_count (push) Has been cancelled
Test / test_rm (push) Has been cancelled
Test / test_snapshot_chain (push) Has been cancelled
Test / test_snapshot_chain_ec (push) Has been cancelled
Test / test_snapshot_down (push) Has been cancelled
Test / test_snapshot_down_ec (push) Has been cancelled
Test / test_splitbrain (push) Has been cancelled
Test / test_rebalance_verify (push) Has been cancelled
Test / test_rebalance_verify_imm (push) Has been cancelled
Test / test_rebalance_verify_ec (push) Has been cancelled
Test / test_rebalance_verify_ec_imm (push) Has been cancelled
Test / test_switch_primary (push) Has been cancelled
Test / test_write (push) Has been cancelled
Test / test_write_xor (push) Has been cancelled
Test / test_write_no_same (push) Has been cancelled
Test / test_heal_pg_size_2 (push) Has been cancelled
Test / test_heal_ec (push) Has been cancelled
Test / test_heal_csum_32k_dmj (push) Has been cancelled
Test / test_cas (push) Has been cancelled
Test / test_heal_csum_32k_dj (push) Has been cancelled
Test / test_heal_csum_32k (push) Has been cancelled
Test / test_heal_csum_4k_dmj (push) Has been cancelled
Test / test_heal_csum_4k_dj (push) Has been cancelled
Test / test_heal_csum_4k (push) Has been cancelled
Test / test_scrub (push) Has been cancelled
Test / test_scrub_zero_osd_2 (push) Has been cancelled
Test / test_scrub_xor (push) Has been cancelled
Test / test_scrub_pg_size_3 (push) Has been cancelled
Test / test_scrub_pg_size_6_pg_minsize_4_osd_count_6_ec (push) Has been cancelled
Test / test_scrub_ec (push) Has been cancelled
2024-02-03 20:31:08 +03:00
f03a9db4d9 Fix OSD space reporting sometimes adding garbage zeros for deleted inodes (causing extra pool/stats etcd keys for deleted pools) 2024-02-03 20:31:08 +03:00
cb9c30bc31 Sync after sending all deletes to each PG in cli rm-data 2024-02-03 20:31:08 +03:00
a86a380d20 Fix invalid parsing of autosync_writes in blockstore leading to autosyncs after every operation with disabled immediate_commit :D 2024-02-03 20:31:08 +03:00
d2b43cb118 Change default etcd_mon_ttl
Some checks failed
Test / test_move_reappear (push) Successful in 35s
Test / test_interrupted_rebalance_ec (push) Successful in 3m29s
Test / test_interrupted_rebalance (push) Successful in 4m47s
Test / test_snapshot_down (push) Successful in 29s
Test / test_snapshot_down_ec (push) Successful in 25s
Test / test_splitbrain (push) Successful in 24s
Test / test_snapshot_chain (push) Successful in 2m46s
Test / test_snapshot_chain_ec (push) Failing after 3m10s
Test / test_rebalance_verify_imm (push) Successful in 4m24s
Test / test_rebalance_verify (push) Successful in 4m54s
Test / test_switch_primary (push) Successful in 35s
Test / test_rebalance_verify_ec_imm (push) Successful in 3m38s
Test / test_write (push) Successful in 46s
Test / test_write_xor (push) Successful in 49s
Test / test_write_no_same (push) Successful in 18s
Test / test_rebalance_verify_ec (push) Successful in 7m14s
Test / test_heal_pg_size_2 (push) Successful in 4m10s
Test / test_heal_csum_32k_dmj (push) Successful in 4m10s
Test / test_heal_csum_32k_dj (push) Successful in 4m52s
Test / test_heal_csum_32k (push) Successful in 5m20s
Test / test_heal_csum_4k_dmj (push) Successful in 5m8s
Test / test_heal_ec (push) Failing after 10m21s
Test / test_scrub (push) Successful in 1m2s
Test / test_scrub_xor (push) Successful in 54s
Test / test_scrub_zero_osd_2 (push) Successful in 1m4s
Test / test_heal_csum_4k_dj (push) Successful in 4m48s
Test / test_scrub_pg_size_6_pg_minsize_4_osd_count_6_ec (push) Successful in 1m26s
Test / test_scrub_ec (push) Successful in 50s
Test / test_scrub_pg_size_3 (push) Failing after 2m5s
Test / test_heal_csum_4k (push) Successful in 4m33s
2024-01-29 23:45:19 +03:00
cc76e6876b Fix flapping "scrub" test
All checks were successful
Test / test_rm (push) Successful in 16s
Test / test_interrupted_rebalance_ec (push) Successful in 2m53s
Test / test_snapshot_down (push) Successful in 29s
Test / test_snapshot_down_ec (push) Successful in 38s
Test / test_splitbrain (push) Successful in 25s
Test / test_interrupted_rebalance (push) Successful in 5m46s
Test / test_snapshot_chain (push) Successful in 2m59s
Test / test_rebalance_verify_imm (push) Successful in 2m36s
Test / test_rebalance_verify (push) Successful in 3m22s
Test / test_switch_primary (push) Successful in 36s
Test / test_write (push) Successful in 32s
Test / test_write_no_same (push) Successful in 17s
Test / test_rebalance_verify_ec_imm (push) Successful in 2m46s
Test / test_write_xor (push) Successful in 40s
Test / test_rebalance_verify_ec (push) Successful in 4m46s
Test / test_heal_pg_size_2 (push) Successful in 4m23s
Test / test_heal_csum_32k_dmj (push) Successful in 4m40s
Test / test_heal_csum_32k_dj (push) Successful in 6m45s
Test / test_scrub (push) Successful in 1m1s
Test / test_scrub_zero_osd_2 (push) Successful in 43s
Test / test_scrub_xor (push) Successful in 35s
Test / test_heal_csum_4k (push) Successful in 4m14s
Test / test_scrub_pg_size_3 (push) Successful in 1m19s
Test / test_scrub_pg_size_6_pg_minsize_4_osd_count_6_ec (push) Successful in 35s
Test / test_scrub_ec (push) Successful in 21s
Test / test_heal_csum_32k (push) Successful in 4m48s
Test / test_heal_csum_4k_dmj (push) Successful in 4m48s
Test / test_heal_csum_4k_dj (push) Successful in 4m27s
Test / test_snapshot_chain_ec (push) Successful in 2m29s
Test / test_heal_ec (push) Successful in 3m7s
2024-01-28 14:59:33 +03:00
1cec62d25d Sync only completed writes
Some checks failed
Test / test_move_reappear (push) Successful in 21s
Test / test_rm (push) Successful in 16s
Test / test_snapshot_down (push) Successful in 25s
Test / test_snapshot_down_ec (push) Successful in 35s
Test / test_splitbrain (push) Successful in 24s
Test / test_interrupted_rebalance (push) Successful in 5m14s
Test / test_snapshot_chain (push) Successful in 2m50s
Test / test_rebalance_verify_imm (push) Successful in 2m47s
Test / test_rebalance_verify (push) Successful in 3m42s
Test / test_switch_primary (push) Successful in 33s
Test / test_write (push) Successful in 42s
Test / test_write_xor (push) Successful in 44s
Test / test_rebalance_verify_ec_imm (push) Successful in 2m52s
Test / test_write_no_same (push) Successful in 15s
Test / test_rebalance_verify_ec (push) Successful in 4m19s
Test / test_heal_ec (push) Successful in 6m20s
Test / test_heal_csum_32k (push) Successful in 3m29s
Test / test_scrub (push) Successful in 1m24s
Test / test_scrub_zero_osd_2 (push) Successful in 1m11s
Test / test_heal_csum_4k_dmj (push) Successful in 4m23s
Test / test_scrub_xor (push) Successful in 1m9s
Test / test_heal_csum_4k_dj (push) Successful in 5m29s
Test / test_heal_csum_4k (push) Successful in 5m36s
Test / test_scrub_pg_size_3 (push) Successful in 1m53s
Test / test_scrub_ec (push) Successful in 29s
Test / test_heal_pg_size_2 (push) Successful in 3m9s
Test / test_heal_csum_32k_dmj (push) Successful in 4m13s
Test / test_heal_csum_32k_dj (push) Successful in 4m17s
Test / test_snapshot_chain_ec (push) Successful in 1m25s
Test / test_scrub_pg_size_6_pg_minsize_4_osd_count_6_ec (push) Failing after 24s
Should be a final remaining fix to EC + non-capacitor (non-immediate-commit) write hangs :).

First it was breaking non-EC ("instantly stable") writes because they sometimes
complete out of order which was leading to the following error:

terminate called after throwing an instance of 'std::runtime_error'
  what():  BUG: Unexpected dirty_entry 1000000000001:29480000 v65540 unstable state during flush: 0x151

But it is easily fixed by scanning previous and next dirty_entries in mark_stable.
2024-01-27 15:17:22 +03:00
1c322b33ed Change default up_wait_retry_interval to 50 ms
Some checks failed
Test / test_rm (push) Successful in 14s
Test / test_interrupted_rebalance_ec (push) Successful in 3m59s
Test / test_snapshot_chain (push) Successful in 1m34s
Test / test_snapshot_down (push) Successful in 25s
Test / test_snapshot_down_ec (push) Successful in 29s
Test / test_splitbrain (push) Successful in 19s
Test / test_snapshot_chain_ec (push) Successful in 2m35s
Test / test_interrupted_rebalance (push) Successful in 8m15s
Test / test_rebalance_verify_imm (push) Successful in 3m54s
Test / test_switch_primary (push) Successful in 36s
Test / test_write (push) Successful in 35s
Test / test_rebalance_verify_ec (push) Successful in 4m48s
Test / test_rebalance_verify_ec_imm (push) Successful in 2m51s
Test / test_write_no_same (push) Successful in 14s
Test / test_write_xor (push) Failing after 3m9s
Test / test_heal_pg_size_2 (push) Successful in 3m55s
Test / test_heal_ec (push) Successful in 3m50s
Test / test_rebalance_verify (push) Failing after 9m30s
Test / test_heal_csum_32k_dmj (push) Failing after 5m40s
Test / test_heal_csum_32k_dj (push) Successful in 6m12s
Test / test_heal_csum_32k (push) Successful in 6m25s
Test / test_heal_csum_4k_dmj (push) Successful in 6m56s
Test / test_scrub (push) Successful in 1m4s
Test / test_scrub_zero_osd_2 (push) Successful in 55s
Test / test_scrub_xor (push) Successful in 56s
Test / test_scrub_pg_size_6_pg_minsize_4_osd_count_6_ec (push) Successful in 1m19s
Test / test_scrub_pg_size_3 (push) Failing after 2m14s
Test / test_heal_csum_4k_dj (push) Successful in 5m53s
Test / test_scrub_ec (push) Successful in 1m1s
Test / test_heal_csum_4k (push) Successful in 5m17s
2024-01-26 01:51:08 +03:00
d27524f441 Add patch for libvirt 9.10 2024-01-25 01:09:12 +03:00
ba55f91409 Release 1.4.1
Some checks failed
Test / test_move_reappear (push) Successful in 22s
Test / test_snapshot_chain (push) Successful in 1m27s
Test / test_interrupted_rebalance_ec (push) Successful in 4m41s
Test / test_snapshot_down (push) Successful in 25s
Test / test_snapshot_chain_ec (push) Successful in 2m0s
Test / test_splitbrain (push) Successful in 18s
Test / test_snapshot_down_ec (push) Successful in 25s
Test / test_rebalance_verify_ec (push) Failing after 2m21s
Test / test_rebalance_verify_imm (push) Successful in 2m30s
Test / test_switch_primary (push) Successful in 39s
Test / test_write (push) Successful in 35s
Test / test_interrupted_rebalance (push) Failing after 10m8s
Test / test_write_xor (push) Successful in 36s
Test / test_write_no_same (push) Successful in 17s
Test / test_rebalance_verify_ec_imm (push) Successful in 4m4s
Test / test_heal_pg_size_2 (push) Successful in 3m55s
Test / test_rebalance_verify (push) Successful in 8m31s
Test / test_heal_ec (push) Successful in 5m9s
Test / test_heal_csum_32k_dmj (push) Successful in 4m27s
Test / test_heal_csum_32k (push) Successful in 5m42s
Test / test_heal_csum_32k_dj (push) Successful in 6m1s
Test / test_scrub (push) Successful in 59s
Test / test_scrub_zero_osd_2 (push) Successful in 38s
Test / test_heal_csum_4k_dmj (push) Successful in 7m5s
Test / test_scrub_xor (push) Successful in 58s
Test / test_heal_csum_4k_dj (push) Successful in 6m25s
Test / test_scrub_ec (push) Failing after 42s
Test / test_scrub_pg_size_6_pg_minsize_4_osd_count_6_ec (push) Successful in 1m32s
Test / test_scrub_pg_size_3 (push) Successful in 1m38s
Test / test_heal_csum_4k (push) Successful in 5m38s
- Fix a monitor crash on primary OSD switching introduced in 1.4.0
- Fix "partly outside array bounds" warnings for GCC 12 in cpp-btree
- Fix a realloc memory leak in theory possible with too large listings (OSD_OP_LIST)
2024-01-18 02:31:42 +03:00
80aac39513 Add detailed formula for theoretical EC N+K random write performance 2024-01-18 00:36:32 +03:00
2aa5aa7ab6 Add a test for simple master switching without PG reconfiguration
All checks were successful
Test / test_move_reappear (push) Successful in 20s
Test / test_snapshot_chain (push) Successful in 1m27s
Test / test_snapshot_down (push) Successful in 23s
Test / test_snapshot_chain_ec (push) Successful in 1m56s
Test / test_snapshot_down_ec (push) Successful in 23s
Test / test_splitbrain (push) Successful in 17s
Test / test_interrupted_rebalance_ec (push) Successful in 6m40s
Test / test_interrupted_rebalance (push) Successful in 8m12s
Test / test_rebalance_verify_imm (push) Successful in 3m12s
Test / test_switch_primary (push) Successful in 34s
Test / test_write (push) Successful in 46s
Test / test_rebalance_verify_ec (push) Successful in 3m18s
Test / test_rebalance_verify_ec_imm (push) Successful in 2m42s
Test / test_write_no_same (push) Successful in 15s
Test / test_rebalance_verify (push) Successful in 6m36s
Test / test_heal_ec (push) Successful in 5m2s
Test / test_heal_csum_32k_dmj (push) Successful in 4m33s
Test / test_heal_csum_32k_dj (push) Successful in 5m58s
Test / test_heal_csum_32k (push) Successful in 6m6s
Test / test_scrub (push) Successful in 47s
Test / test_heal_csum_4k_dmj (push) Successful in 6m17s
Test / test_scrub_zero_osd_2 (push) Successful in 43s
Test / test_scrub_xor (push) Successful in 47s
Test / test_heal_csum_4k_dj (push) Successful in 6m44s
Test / test_scrub_ec (push) Successful in 41s
Test / test_scrub_pg_size_6_pg_minsize_4_osd_count_6_ec (push) Successful in 1m18s
Test / test_scrub_pg_size_3 (push) Successful in 2m11s
Test / test_heal_csum_4k (push) Successful in 6m12s
Test / test_heal_pg_size_2 (push) Successful in 3m16s
Test / test_write_xor (push) Successful in 34s
Also use osd_out_time:1 only in select tests and restart mon in tests only on connection errors
2024-01-17 00:19:01 +03:00
3ca3b8a8d8 Fix recheck_pgs bug introduced in 1.4.0
Some checks failed
Test / test_rm (push) Successful in 14s
Test / test_interrupted_rebalance_ec (push) Successful in 3m27s
Test / test_snapshot_chain (push) Successful in 1m24s
Test / test_snapshot_down (push) Successful in 25s
Test / test_snapshot_chain_ec (push) Successful in 1m54s
Test / test_snapshot_down_ec (push) Successful in 20s
Test / test_splitbrain (push) Successful in 15s
Test / test_rebalance_verify_imm (push) Successful in 2m42s
Test / test_etcd_fail (push) Failing after 10m8s
Test / test_interrupted_rebalance (push) Failing after 10m9s
Test / test_write (push) Successful in 1m22s
Test / test_rebalance_verify_ec (push) Failing after 1m51s
Test / test_write_no_same (push) Successful in 16s
Test / test_rebalance_verify_ec_imm (push) Successful in 3m27s
Test / test_write_xor (push) Failing after 3m13s
Test / test_heal_pg_size_2 (push) Successful in 3m22s
Test / test_rebalance_verify (push) Failing after 10m9s
Test / test_heal_ec (push) Successful in 4m41s
Test / test_heal_csum_32k_dmj (push) Successful in 4m42s
Test / test_heal_csum_32k_dj (push) Successful in 4m58s
Test / test_heal_csum_32k (push) Successful in 6m34s
Test / test_scrub (push) Successful in 54s
Test / test_heal_csum_4k_dmj (push) Successful in 6m56s
Test / test_scrub_zero_osd_2 (push) Successful in 49s
Test / test_heal_csum_4k_dj (push) Successful in 6m1s
Test / test_scrub_ec (push) Has been cancelled
Test / test_heal_csum_4k (push) Has been cancelled
Test / test_scrub_pg_size_6_pg_minsize_4_osd_count_6_ec (push) Has been cancelled
Test / test_scrub_xor (push) Has been cancelled
Test / test_scrub_pg_size_3 (push) Has been cancelled
2024-01-16 23:49:21 +03:00
2cf649eba6 Fix "partly outside array bounds" warnings for GCC 12 in cpp-btree
Some checks failed
Test / test_move_reappear (push) Successful in 19s
Test / test_rm (push) Successful in 13s
Test / test_snapshot_down (push) Successful in 25s
Test / test_snapshot_down_ec (push) Successful in 25s
Test / test_splitbrain (push) Successful in 19s
Test / test_snapshot_chain (push) Successful in 2m13s
Test / test_interrupted_rebalance (push) Successful in 7m36s
Test / test_rebalance_verify (push) Successful in 3m35s
Test / test_write (push) Successful in 1m0s
Test / test_rebalance_verify_imm (push) Successful in 4m4s
Test / test_rebalance_verify_ec (push) Successful in 3m13s
Test / test_rebalance_verify_ec_imm (push) Successful in 2m35s
Test / test_write_no_same (push) Successful in 14s
Test / test_heal_pg_size_2 (push) Successful in 4m32s
Test / test_heal_csum_32k_dmj (push) Successful in 4m29s
Test / test_heal_ec (push) Successful in 5m47s
Test / test_heal_csum_32k_dj (push) Successful in 5m47s
Test / test_heal_csum_4k_dmj (push) Successful in 6m4s
Test / test_heal_csum_32k (push) Successful in 6m19s
Test / test_scrub (push) Successful in 56s
Test / test_scrub_zero_osd_2 (push) Successful in 43s
Test / test_heal_csum_4k_dj (push) Successful in 6m14s
Test / test_scrub_xor (push) Successful in 53s
Test / test_scrub_pg_size_6_pg_minsize_4_osd_count_6_ec (push) Successful in 57s
Test / test_scrub_ec (push) Successful in 47s
Test / test_heal_csum_4k (push) Successful in 5m56s
Test / test_minsize_1 (push) Successful in 14s
Test / test_scrub_pg_size_3 (push) Successful in 46s
Test / test_snapshot_chain_ec (push) Successful in 1m40s
Test / test_write_xor (push) Failing after 3m6s
2024-01-15 03:04:33 +03:00
5935640a4a Add CLA PR form 2024-01-14 16:48:24 +03:00
d00d4dbac0 Initialize mod_revision field in etcd_state_client
Some checks failed
Test / test_interrupted_rebalance_ec (push) Successful in 2m28s
Test / test_rm (push) Successful in 17s
Test / test_move_reappear (push) Successful in 29s
Test / test_snapshot_down (push) Successful in 26s
Test / test_snapshot_down_ec (push) Successful in 26s
Test / test_splitbrain (push) Successful in 16s
Test / test_snapshot_chain (push) Successful in 2m0s
Test / test_rebalance_verify_imm (push) Successful in 2m28s
Test / test_rebalance_verify (push) Successful in 3m0s
Test / test_rebalance_verify_ec (push) Successful in 3m14s
Test / test_write_no_same (push) Successful in 13s
Test / test_rebalance_verify_ec_imm (push) Successful in 3m7s
Test / test_heal_pg_size_2 (push) Successful in 3m33s
Test / test_heal_ec (push) Successful in 4m40s
Test / test_heal_csum_32k_dj (push) Successful in 5m40s
Test / test_heal_csum_32k (push) Successful in 6m8s
Test / test_scrub (push) Successful in 1m4s
Test / test_scrub_zero_osd_2 (push) Successful in 47s
Test / test_heal_csum_4k_dmj (push) Successful in 6m33s
Test / test_heal_csum_4k_dj (push) Successful in 6m28s
Test / test_scrub_xor (push) Successful in 44s
Test / test_scrub_pg_size_6_pg_minsize_4_osd_count_6_ec (push) Successful in 1m2s
Test / test_scrub_ec (push) Successful in 42s
Test / test_scrub_pg_size_3 (push) Successful in 1m38s
Test / test_heal_csum_4k (push) Successful in 5m56s
Test / test_interrupted_rebalance (push) Successful in 1m53s
Test / test_snapshot_chain_ec (push) Failing after 3m17s
Test / test_write (push) Failing after 3m15s
Test / test_heal_csum_32k_dmj (push) Successful in 4m6s
Test / test_write_xor (push) Failing after 3m11s
2024-01-13 01:30:28 +03:00
5d9d6f32a0 Fix common realloc memory leak mistakes found by cppcheck 2024-01-13 01:30:28 +03:00
181 changed files with 14065 additions and 1572 deletions

View File

@@ -22,7 +22,7 @@ RUN apt-get update
RUN apt-get -y install etcd qemu-system-x86 qemu-block-extra qemu-utils fio libasan5 \
liburing1 liburing-dev libgoogle-perftools-dev devscripts libjerasure-dev cmake libibverbs-dev libisal-dev
RUN apt-get -y build-dep fio qemu=`dpkg -s qemu-system-x86|grep ^Version:|awk '{print $2}'`
RUN apt-get -y install jq lp-solve sudo
RUN apt-get -y install jq lp-solve sudo nfs-common
RUN apt-get --download-only source fio qemu=`dpkg -s qemu-system-x86|grep ^Version:|awk '{print $2}'`
RUN set -ex; \

View File

@@ -395,7 +395,7 @@ jobs:
steps:
- name: Run test
id: test
timeout-minutes: 3
timeout-minutes: 6
run: SCHEME=ec /root/vitastor/tests/test_snapshot_chain.sh
- name: Print logs
if: always() && steps.test.outcome == 'failure'
@@ -532,6 +532,24 @@ jobs:
echo ""
done
test_switch_primary:
runs-on: ubuntu-latest
needs: build
container: ${{env.TEST_IMAGE}}:${{github.sha}}
steps:
- name: Run test
id: test
timeout-minutes: 3
run: /root/vitastor/tests/test_switch_primary.sh
- name: Print logs
if: always() && steps.test.outcome == 'failure'
run: |
for i in /root/vitastor/testdata/*.log /root/vitastor/testdata/*.txt; do
echo "-------- $i --------"
cat $i
echo ""
done
test_write:
runs-on: ubuntu-latest
needs: build
@@ -838,3 +856,21 @@ jobs:
echo ""
done
test_nfs:
runs-on: ubuntu-latest
needs: build
container: ${{env.TEST_IMAGE}}:${{github.sha}}
steps:
- name: Run test
id: test
timeout-minutes: 3
run: /root/vitastor/tests/test_nfs.sh
- name: Print logs
if: always() && steps.test.outcome == 'failure'
run: |
for i in /root/vitastor/testdata/*.log /root/vitastor/testdata/*.txt; do
echo "-------- $i --------"
cat $i
echo ""
done

View File

@@ -39,6 +39,10 @@ for my $line (<>)
$test_name .= '_'.lc($1).'_'.$2;
}
}
if ($test_name eq 'test_snapshot_chain_ec')
{
$timeout = 6;
}
$line =~ s!\./test_!/root/vitastor/tests/test_!;
# Gitea CI doesn't support artifacts yet, lol
#- name: Upload results

View File

@@ -38,7 +38,7 @@ in the offer.
on behalf of third parties, including on behalf of his employer.
2. Subject of the Agreement. \
2.1. Subject of the Agreement shall be the Contributions sent to the Author by Contributors.
2.1. Subject of the Agreement shall be the Contributions sent to the Author by Contributors. \
2.2. The Contributor grants to the Author the right to use Contributions at his own
discretion and without any necessity to get a prior approval from Contributor or
any other third party in any way, under a simple (non-exclusive), royalty-free,
@@ -86,7 +86,7 @@ in the offer.
of their provision to the Author. \
5.2. The Contributor represents and warrants that he legally owns exclusive
intellectual property rights to the Contributions. \
5.3. The Contributor represents and warrants that any further use of \
5.3. The Contributor represents and warrants that any further use of
Contributions by the Author as provided by Contributor under the terms
of the Agreement does not infringe on intellectual and other rights and
legitimate interests of third parties. \

View File

@@ -2,6 +2,6 @@ cmake_minimum_required(VERSION 2.8.12)
project(vitastor)
set(VERSION "1.4.0")
set(VERSION "1.5.0")
add_subdirectory(src)

View File

@@ -6,8 +6,8 @@
Вернём былую скорость кластерному блочному хранилищу!
Vitastor - распределённая блочная SDS (программная СХД), прямой аналог Ceph RBD и
внутренних СХД популярных облачных провайдеров. Однако, в отличие от них, Vitastor
Vitastor - распределённая блочная и файловая SDS (программная СХД), прямой аналог Ceph RBD и CephFS,
а также внутренних СХД популярных облачных провайдеров. Однако, в отличие от них, Vitastor
быстрый и при этом простой. Только пока маленький :-).
Vitastor архитектурно похож на Ceph, что означает атомарность и строгую консистентность,
@@ -63,7 +63,7 @@ Vitastor поддерживает QEMU-драйвер, протоколы NBD и
- [fio](docs/usage/fio.ru.md) для тестов производительности
- [NBD](docs/usage/nbd.ru.md) для монтирования ядром
- [QEMU и qemu-img](docs/usage/qemu.ru.md)
- [NFS](docs/usage/nfs.ru.md)-прокси для VMWare и подобных
- [NFS](docs/usage/nfs.ru.md) кластерная файловая система и псевдо-ФС прокси
- Производительность
- [Понимание сути производительности](docs/performance/understanding.ru.md)
- [Теоретический максимум](docs/performance/theoretical.ru.md)

View File

@@ -6,9 +6,9 @@
Make Clustered Block Storage Fast Again.
Vitastor is a distributed block SDS, direct replacement of Ceph RBD and internal SDS's
of public clouds. However, in contrast to them, Vitastor is fast and simple at the same time.
The only thing is it's slightly young :-).
Vitastor is a distributed block and file SDS, direct replacement of Ceph RBD and CephFS,
and also internal SDS's of public clouds. However, in contrast to them, Vitastor is fast
and simple at the same time. The only thing is it's slightly young :-).
Vitastor is architecturally similar to Ceph which means strong consistency,
primary-replication, symmetric clustering and automatic data distribution over any
@@ -63,7 +63,7 @@ Read more details below in the documentation.
- [fio](docs/usage/fio.en.md) for benchmarks
- [NBD](docs/usage/nbd.en.md) for kernel mounts
- [QEMU and qemu-img](docs/usage/qemu.en.md)
- [NFS](docs/usage/nfs.en.md) emulator for VMWare and similar
- [NFS](docs/usage/nfs.en.md) clustered file system and pseudo-FS proxy
- Performance
- [Understanding storage performance](docs/performance/understanding.en.md)
- [Theoretical performance](docs/performance/theoretical.en.md)

View File

@@ -1,4 +1,4 @@
VERSION ?= v1.4.0
VERSION ?= v1.5.0
all: build push

View File

@@ -49,7 +49,7 @@ spec:
capabilities:
add: ["SYS_ADMIN"]
allowPrivilegeEscalation: true
image: vitalif/vitastor-csi:v1.4.0
image: vitalif/vitastor-csi:v1.5.0
args:
- "--node=$(NODE_ID)"
- "--endpoint=$(CSI_ENDPOINT)"

View File

@@ -121,7 +121,7 @@ spec:
privileged: true
capabilities:
add: ["SYS_ADMIN"]
image: vitalif/vitastor-csi:v1.4.0
image: vitalif/vitastor-csi:v1.5.0
args:
- "--node=$(NODE_ID)"
- "--endpoint=$(CSI_ENDPOINT)"

View File

@@ -5,7 +5,7 @@ package vitastor
const (
vitastorCSIDriverName = "csi.vitastor.io"
vitastorCSIDriverVersion = "1.4.0"
vitastorCSIDriverVersion = "1.5.0"
)
// Config struct fills the parameters of request or user input

View File

@@ -3,5 +3,5 @@
cat < vitastor.Dockerfile > ../Dockerfile
cd ..
mkdir -p packages
sudo podman build --build-arg REL=bookworm -v `pwd`/packages:/root/packages -f Dockerfile .
sudo podman build --build-arg DISTRO=debian --build-arg REL=bookworm -v `pwd`/packages:/root/packages -f Dockerfile .
rm Dockerfile

View File

@@ -3,5 +3,5 @@
cat < vitastor.Dockerfile > ../Dockerfile
cd ..
mkdir -p packages
sudo podman build --build-arg REL=bullseye -v `pwd`/packages:/root/packages -f Dockerfile .
sudo podman build --build-arg DISTRO=debian --build-arg REL=bullseye -v `pwd`/packages:/root/packages -f Dockerfile .
rm Dockerfile

View File

@@ -3,5 +3,5 @@
cat < vitastor.Dockerfile > ../Dockerfile
cd ..
mkdir -p packages
sudo podman build --build-arg REL=buster -v `pwd`/packages:/root/packages -f Dockerfile .
sudo podman build --build-arg DISTRO=debian --build-arg REL=buster -v `pwd`/packages:/root/packages -f Dockerfile .
rm Dockerfile

7
debian/build-vitastor-ubuntu-jammy.sh vendored Executable file
View File

@@ -0,0 +1,7 @@
#!/bin/bash
cat < vitastor.Dockerfile > ../Dockerfile
cd ..
mkdir -p packages
sudo podman build --build-arg DISTRO=ubuntu --build-arg REL=jammy -v `pwd`/packages:/root/packages -f Dockerfile .
rm Dockerfile

2
debian/changelog vendored
View File

@@ -1,4 +1,4 @@
vitastor (1.4.0-1) unstable; urgency=medium
vitastor (1.5.0-1) unstable; urgency=medium
* Bugfixes

View File

@@ -1,13 +1,14 @@
# Build patched libvirt for Debian Buster or Bullseye/Sid inside a container
# cd ..; podman build --build-arg REL=bullseye -v `pwd`/packages:/root/packages -f debian/libvirt.Dockerfile .
# cd ..; podman build --build-arg DISTRO=debian --build-arg REL=bullseye -v `pwd`/packages:/root/packages -f debian/libvirt.Dockerfile .
ARG DISTRO=
ARG REL=
FROM debian:$REL
FROM $DISTRO:$REL
ARG REL=
WORKDIR /root
RUN if [ "$REL" = "buster" -o "$REL" = "bullseye" ]; then \
RUN if ([ "${DISTRO}" = "debian" ]) && ( [ "${REL}" = "buster" -o "${REL}" = "bullseye" ] ); then \
echo "deb http://deb.debian.org/debian $REL-backports main" >> /etc/apt/sources.list; \
echo >> /etc/apt/preferences; \
echo 'Package: *' >> /etc/apt/preferences; \
@@ -23,7 +24,7 @@ RUN apt-get -y build-dep libvirt0
RUN apt-get -y install libglusterfs-dev
RUN apt-get --download-only source libvirt
ADD patches/libvirt-5.0-vitastor.diff patches/libvirt-7.0-vitastor.diff patches/libvirt-7.5-vitastor.diff patches/libvirt-7.6-vitastor.diff /root
ADD patches/libvirt-5.0-vitastor.diff patches/libvirt-7.0-vitastor.diff patches/libvirt-7.5-vitastor.diff patches/libvirt-7.6-vitastor.diff patches/libvirt-8.0-vitastor.diff /root
RUN set -e; \
mkdir -p /root/packages/libvirt-$REL; \
rm -rf /root/packages/libvirt-$REL/*; \

View File

@@ -3,4 +3,6 @@ usr/bin/vitastor-cli
usr/bin/vitastor-rm
usr/bin/vitastor-nbd
usr/bin/vitastor-nfs
usr/bin/vitastor-kv
usr/bin/vitastor-kv-stress
usr/lib/*/libvitastor*.so*

View File

@@ -1,8 +1,10 @@
# Build Vitastor packages for Debian inside a container
# cd ..; podman build --build-arg REL=bullseye -v `pwd`/packages:/root/packages -f debian/vitastor.Dockerfile .
# cd ..; podman build --build-arg DISTRO=debian --build-arg REL=bullseye -v `pwd`/packages:/root/packages -f debian/vitastor.Dockerfile .
ARG DISTRO=debian
ARG REL=
FROM debian:$REL
FROM $DISTRO:$REL
ARG DISTRO=debian
ARG REL=
WORKDIR /root
@@ -35,8 +37,8 @@ RUN set -e -x; \
mkdir -p /root/packages/vitastor-$REL; \
rm -rf /root/packages/vitastor-$REL/*; \
cd /root/packages/vitastor-$REL; \
cp -r /root/vitastor vitastor-1.4.0; \
cd vitastor-1.4.0; \
cp -r /root/vitastor vitastor-1.5.0; \
cd vitastor-1.5.0; \
ln -s /root/fio-build/fio-*/ ./fio; \
FIO=$(head -n1 fio/debian/changelog | perl -pe 's/^.*\((.*?)\).*$/$1/'); \
ls /usr/include/linux/raw.h || cp ./debian/raw.h /usr/include/linux/raw.h; \
@@ -49,8 +51,8 @@ RUN set -e -x; \
rm -rf a b; \
echo "dep:fio=$FIO" > debian/fio_version; \
cd /root/packages/vitastor-$REL; \
tar --sort=name --mtime='2020-01-01' --owner=0 --group=0 --exclude=debian -cJf vitastor_1.4.0.orig.tar.xz vitastor-1.4.0; \
cd vitastor-1.4.0; \
tar --sort=name --mtime='2020-01-01' --owner=0 --group=0 --exclude=debian -cJf vitastor_1.5.0.orig.tar.xz vitastor-1.5.0; \
cd vitastor-1.5.0; \
V=$(head -n1 debian/changelog | perl -pe 's/^.*\((.*?)\).*$/$1/'); \
DEBFULLNAME="Vitaliy Filippov <vitalif@yourcmc.ru>" dch -D $REL -v "$V""$REL" "Rebuild for $REL"; \
DEB_BUILD_OPTIONS=nocheck dpkg-buildpackage --jobs=auto -sa; \

View File

@@ -9,6 +9,8 @@
These parameters apply only to Vitastor clients (QEMU, fio, NBD and so on) and
affect their interaction with the cluster.
- [client_retry_interval](#client_retry_interval)
- [client_eio_retry_interval](#client_eio_retry_interval)
- [client_max_dirty_bytes](#client_max_dirty_bytes)
- [client_max_dirty_ops](#client_max_dirty_ops)
- [client_enable_writeback](#client_enable_writeback)
@@ -19,6 +21,26 @@ affect their interaction with the cluster.
- [nbd_max_devices](#nbd_max_devices)
- [nbd_max_part](#nbd_max_part)
## client_retry_interval
- Type: milliseconds
- Default: 50
- Minimum: 10
- Can be changed online: yes
Retry time for I/O requests failed due to inactive PGs or network
connectivity errors.
## client_eio_retry_interval
- Type: milliseconds
- Default: 1000
- Can be changed online: yes
Retry time for I/O requests failed due to data corruption or unfinished
EC object deletions (has_incomplete PG state). 0 disables such retries
and clients are not blocked and just get EIO error code instead.
## client_max_dirty_bytes
- Type: integer

View File

@@ -9,6 +9,8 @@
Данные параметры применяются только к клиентам Vitastor (QEMU, fio, NBD и т.п.) и
затрагивают логику их работы с кластером.
- [client_retry_interval](#client_retry_interval)
- [client_eio_retry_interval](#client_eio_retry_interval)
- [client_max_dirty_bytes](#client_max_dirty_bytes)
- [client_max_dirty_ops](#client_max_dirty_ops)
- [client_enable_writeback](#client_enable_writeback)
@@ -19,6 +21,27 @@
- [nbd_max_devices](#nbd_max_devices)
- [nbd_max_part](#nbd_max_part)
## client_retry_interval
- Тип: миллисекунды
- Значение по умолчанию: 50
- Минимальное значение: 10
- Можно менять на лету: да
Время повтора запросов ввода-вывода, неудачных из-за неактивных PG или
ошибок сети.
## client_eio_retry_interval
- Тип: миллисекунды
- Значение по умолчанию: 1000
- Можно менять на лету: да
Время повтора запросов ввода-вывода, неудачных из-за повреждения данных
или незавершённых удалений EC-объектов (состояния PG has_incomplete).
0 отключает повторы таких запросов и клиенты не блокируются, а вместо
этого просто получают код ошибки EIO.
## client_max_dirty_bytes
- Тип: целое число

View File

@@ -19,8 +19,8 @@ These parameters only apply to Monitors.
## etcd_mon_ttl
- Type: seconds
- Default: 30
- Minimum: 10
- Default: 1
- Minimum: 5
Monitor etcd lease refresh interval in seconds

View File

@@ -19,8 +19,8 @@
## etcd_mon_ttl
- Тип: секунды
- Значение по умолчанию: 30
- Минимальное значение: 10
- Значение по умолчанию: 1
- Минимальное значение: 5
Интервал обновления etcd резервации (lease) монитором

View File

@@ -25,7 +25,6 @@ between clients, OSDs and etcd.
- [peer_connect_timeout](#peer_connect_timeout)
- [osd_idle_timeout](#osd_idle_timeout)
- [osd_ping_timeout](#osd_ping_timeout)
- [up_wait_retry_interval](#up_wait_retry_interval)
- [max_etcd_attempts](#max_etcd_attempts)
- [etcd_quick_timeout](#etcd_quick_timeout)
- [etcd_slow_timeout](#etcd_slow_timeout)
@@ -212,17 +211,6 @@ Maximum time to wait for OSD keepalive responses. If an OSD doesn't respond
within this time, the connection to it is dropped and a reconnection attempt
is scheduled.
## up_wait_retry_interval
- Type: milliseconds
- Default: 500
- Minimum: 50
- Can be changed online: yes
OSDs respond to clients with a special error code when they receive I/O
requests for a PG that's not synchronized and started. This parameter sets
the time for the clients to wait before re-attempting such I/O requests.
## max_etcd_attempts
- Type: integer

View File

@@ -25,7 +25,6 @@
- [peer_connect_timeout](#peer_connect_timeout)
- [osd_idle_timeout](#osd_idle_timeout)
- [osd_ping_timeout](#osd_ping_timeout)
- [up_wait_retry_interval](#up_wait_retry_interval)
- [max_etcd_attempts](#max_etcd_attempts)
- [etcd_quick_timeout](#etcd_quick_timeout)
- [etcd_slow_timeout](#etcd_slow_timeout)
@@ -221,19 +220,6 @@ OSD в любом случае согласовывают реальное зн
Если OSD не отвечает за это время, соединение отключается и производится
повторная попытка соединения.
## up_wait_retry_interval
- Тип: миллисекунды
- Значение по умолчанию: 500
- Минимальное значение: 50
- Можно менять на лету: да
Когда OSD получают от клиентов запросы ввода-вывода, относящиеся к не
поднятым на данный момент на них PG, либо к PG в процессе синхронизации,
они отвечают клиентам специальным кодом ошибки, означающим, что клиент
должен некоторое время подождать перед повторением запроса. Именно это время
ожидания задаёт данный параметр.
## max_etcd_attempts
- Тип: целое число

View File

@@ -59,6 +59,7 @@ them, even without restarting by updating configuration in etcd.
- [recovery_tune_client_util_high](#recovery_tune_client_util_high)
- [recovery_tune_agg_interval](#recovery_tune_agg_interval)
- [recovery_tune_sleep_min_us](#recovery_tune_sleep_min_us)
- [recovery_tune_sleep_cutoff_us](#recovery_tune_sleep_cutoff_us)
## etcd_report_interval
@@ -604,5 +605,14 @@ is usually fine.
- Default: 10
- Can be changed online: yes
Minimum possible value for auto-tuned recovery_sleep_us. Values lower
than this value are changed to 0.
Minimum possible value for auto-tuned recovery_sleep_us. Lower values
are changed to 0.
## recovery_tune_sleep_cutoff_us
- Type: microseconds
- Default: 10000000
- Can be changed online: yes
Maximum possible value for auto-tuned recovery_sleep_us. Higher values
are treated as outliers and ignored in aggregation.

View File

@@ -60,6 +60,7 @@
- [recovery_tune_client_util_high](#recovery_tune_client_util_high)
- [recovery_tune_agg_interval](#recovery_tune_agg_interval)
- [recovery_tune_sleep_min_us](#recovery_tune_sleep_min_us)
- [recovery_tune_sleep_cutoff_us](#recovery_tune_sleep_cutoff_us)
## etcd_report_interval
@@ -634,4 +635,14 @@ EC (кодов коррекции ошибок) с более, чем 1 диск
- Можно менять на лету: да
Минимальное возможное значение авто-подстроенного recovery_sleep_us.
Значения ниже данного заменяются на 0.
Меньшие значения заменяются на 0.
## recovery_tune_sleep_cutoff_us
- Тип: микросекунды
- Значение по умолчанию: 10000000
- Можно менять на лету: да
Максимальное возможное значение авто-подстроенного recovery_sleep_us.
Большие значения считаются случайными выбросами и игнорируются в
усреднении.

View File

@@ -41,6 +41,7 @@ Parameters:
- [osd_tags](#osd_tags)
- [primary_affinity_tags](#primary_affinity_tags)
- [scrub_interval](#scrub_interval)
- [used_for_fs](#used_for_fs)
Examples:
@@ -154,6 +155,26 @@ That is, if it becomes impossible to place PG data on at least (pg_minsize)
OSDs, PG is deactivated for both read and write. So you know that a fresh
write always goes to at least (pg_minsize) OSDs (disks).
For example, the difference between pg_minsize 2 and 1 in a 3-way replicated
pool (pg_size=3) is:
- If 2 hosts go down with pg_minsize=2, the pool becomes inactive and remains
inactive for [osd_out_time](monitor.en.md#osd_out_time) (10 minutes). After
this timeout, the monitor selects replacement hosts/OSDs and the pool comes
up and starts to heal. Therefore, if you don't have replacement OSDs, i.e.
if you only have 3 hosts with OSDs and 2 of them are down, the pool remains
inactive until you add or return at least 1 host (or change failure_domain
to "osd").
- If 2 hosts go down with pg_minsize=1, the pool only experiences a short
I/O pause until the monitor notices that OSDs are down (5-10 seconds with
the default [etcd_report_interval](osd.en.md#etcd_report_interval)). After
this pause, I/O resumes, but new data is temporarily written in only 1 copy.
Then, after osd_out_time, the monitor also selects replacement OSDs and the
pool starts to heal.
So, pg_minsize regulates the number of failures that a pool can tolerate
without temporary downtime for [osd_out_time](monitor.en.md#osd_out_time),
but at a cost of slightly reduced storage reliability.
FIXME: pg_minsize behaviour may be changed in the future to only make PGs
read-only instead of deactivating them.
@@ -165,8 +186,8 @@ read-only instead of deactivating them.
Number of PGs for this pool. The value should be big enough for the monitor /
LP solver to be able to optimize data placement.
"Enough" is usually around 64-128 PGs per OSD, i.e. you set pg_count for pool
to (total OSD count * 100 / pg_size). You can round it to the closest power of 2,
"Enough" is usually around 10-100 PGs per OSD, i.e. you set pg_count for pool
to (total OSD count * 10 / pg_size). You can round it to the closest power of 2,
because it makes it easier to reduce or increase PG count later by dividing or
multiplying it by 2.
@@ -279,6 +300,25 @@ of the OSDs containing a data chunk for a PG.
Automatic scrubbing interval for this pool. Overrides
[global scrub_interval setting](osd.en.md#scrub_interval).
## used_for_fs
- Type: string
If non-empty, the pool is marked as used for VitastorFS with metadata stored
in block image (regular Vitastor volume) named as the value of this pool parameter.
When a pool is marked as used for VitastorFS, regular block volume creation in it
is disabled (vitastor-cli refuses to create images without --force) to protect
the user from block volume and FS file ID collisions and data loss.
[vitastor-nfs](../usage/nfs.ru.md), in its turn, refuses to use pools not marked
for the corresponding FS when starting. This also implies that you can use one
pool only for one VitastorFS.
The second thing that is disabled for VitastorFS pools is reporting per-inode space
usage statistics in etcd because a FS pool may store a very large number of files
and statistics for them all would take a lot of space in etcd.
# Examples
## Replicated pool

View File

@@ -40,6 +40,7 @@
- [osd_tags](#osd_tags)
- [primary_affinity_tags](#primary_affinity_tags)
- [scrub_interval](#scrub_interval)
- [used_for_fs](#used_for_fs)
Примеры:
@@ -157,6 +158,26 @@
OSD, PG деактивируется на чтение и запись. Иными словами, всегда известно,
что новые блоки данных всегда записываются как минимум на pg_minsize дисков.
Для примера, разница между pg_minsize 2 и 1 в реплицированном пуле с 3 копиями
данных (pg_size=3), проявляется следующим образом:
- Если 2 сервера отключаются при pg_minsize=2, пул становится неактивным и
остаётся неактивным в течение [osd_out_time](monitor.en.md#osd_out_time)
(10 минут), после чего монитор назначает другие OSD/серверы на замену, пул
поднимается и начинает восстанавливать недостающие копии данных. Соответственно,
если OSD на замену нет - то есть, если у вас всего 3 сервера с OSD и 2 из них
недоступны - пул так и остаётся недоступным до тех пор, пока вы не вернёте
или не добавите хотя бы 1 сервер (или не переключите failure_domain на "osd").
- Если 2 сервера отключаются при pg_minsize=1, ввод-вывод лишь приостанавливается
на короткое время, до тех пор, пока монитор не поймёт, что OSD отключены
(что занимает 5-10 секунд при стандартном [etcd_report_interval](osd.en.md#etcd_report_interval)).
После этого ввод-вывод восстанавливается, но новые данные временно пишутся
всего в 1 копии. Когда же проходит osd_out_time, монитор точно так же назначает
другие OSD на замену выбывшим и пул начинает восстанавливать копии данных.
То есть, pg_minsize регулирует число отказов, которые пул может пережить без
временной остановки обслуживания на [osd_out_time](monitor.ru.md#osd_out_time),
но ценой немного пониженных гарантий надёжности.
FIXME: Поведение pg_minsize может быть изменено в будущем с полной деактивации
PG на перевод их в режим только для чтения.
@@ -168,8 +189,8 @@ PG на перевод их в режим только для чтения.
Число PG для данного пула. Число должно быть достаточно большим, чтобы монитор
мог равномерно распределить по ним данные.
Обычно это означает примерно 64-128 PG на 1 OSD, т.е. pg_count можно устанавливать
равным (общему числу OSD * 100 / pg_size). Значение можно округлить до ближайшей
Обычно это означает примерно 10-100 PG на 1 OSD, т.е. pg_count можно устанавливать
равным (общему числу OSD * 10 / pg_size). Значение можно округлить до ближайшей
степени 2, чтобы потом было легче уменьшать или увеличивать число PG, умножая
или деля его на 2.
@@ -286,6 +307,27 @@ OSD с "all".
Интервал скраба, то есть, автоматической фоновой проверки данных для данного пула.
Переопределяет [глобальную настройку scrub_interval](osd.ru.md#scrub_interval).
## used_for_fs
- Type: string
Если непусто, пул помечается как используемый для файловой системы VitastorFS с
метаданными, хранимыми в блочном образе Vitastor с именем, равным значению
этого параметра.
Когда пул помечается как используемый для VitastorFS, создание обычных блочных
образов в нём отключается (vitastor-cli отказывается создавать образы без --force),
чтобы защитить пользователя от коллизий ID файлов и блочных образов и, таким
образом, от потери данных.
[vitastor-nfs](../usage/nfs.ru.md), в свою очередь, при запуске отказывается
использовать для ФС пулы, не выделенные для неё. Это также означает, что один
пул может использоваться только для одной VitastorFS.
Также для ФС-пулов отключается передача статистики в etcd по отдельным инодам,
так как ФС-пул может содержать очень много файлов и статистика по ним всем
заняла бы очень много места в etcd.
# Примеры
## Реплицированный пул

View File

@@ -1,3 +1,27 @@
- name: client_retry_interval
type: ms
min: 10
default: 50
online: true
info: |
Retry time for I/O requests failed due to inactive PGs or network
connectivity errors.
info_ru: |
Время повтора запросов ввода-вывода, неудачных из-за неактивных PG или
ошибок сети.
- name: client_eio_retry_interval
type: ms
default: 1000
online: true
info: |
Retry time for I/O requests failed due to data corruption or unfinished
EC object deletions (has_incomplete PG state). 0 disables such retries
and clients are not blocked and just get EIO error code instead.
info_ru: |
Время повтора запросов ввода-вывода, неудачных из-за повреждения данных
или незавершённых удалений EC-объектов (состояния PG has_incomplete).
0 отключает повторы таких запросов и клиенты не блокируются, а вместо
этого просто получают код ошибки EIO.
- name: client_max_dirty_bytes
type: int
default: 33554432

View File

@@ -1,7 +1,7 @@
- name: etcd_mon_ttl
type: sec
min: 10
default: 30
min: 5
default: 1
info: Monitor etcd lease refresh interval in seconds
info_ru: Интервал обновления etcd резервации (lease) монитором
- name: etcd_mon_timeout

View File

@@ -243,21 +243,6 @@
Максимальное время ожидания ответа на запрос проверки состояния соединения.
Если OSD не отвечает за это время, соединение отключается и производится
повторная попытка соединения.
- name: up_wait_retry_interval
type: ms
min: 50
default: 500
online: true
info: |
OSDs respond to clients with a special error code when they receive I/O
requests for a PG that's not synchronized and started. This parameter sets
the time for the clients to wait before re-attempting such I/O requests.
info_ru: |
Когда OSD получают от клиентов запросы ввода-вывода, относящиеся к не
поднятым на данный момент на них PG, либо к PG в процессе синхронизации,
они отвечают клиентам специальным кодом ошибки, означающим, что клиент
должен некоторое время подождать перед повторением запроса. Именно это время
ожидания задаёт данный параметр.
- name: max_etcd_attempts
type: int
default: 5

View File

@@ -731,8 +731,19 @@
default: 10
online: true
info: |
Minimum possible value for auto-tuned recovery_sleep_us. Values lower
than this value are changed to 0.
Minimum possible value for auto-tuned recovery_sleep_us. Lower values
are changed to 0.
info_ru: |
Минимальное возможное значение авто-подстроенного recovery_sleep_us.
Значения ниже данного заменяются на 0.
Меньшие значения заменяются на 0.
- name: recovery_tune_sleep_cutoff_us
type: us
default: 10000000
online: true
info: |
Maximum possible value for auto-tuned recovery_sleep_us. Higher values
are treated as outliers and ignored in aggregation.
info_ru: |
Максимальное возможное значение авто-подстроенного recovery_sleep_us.
Большие значения считаются случайными выбросами и игнорируются в
усреднении.

View File

@@ -33,6 +33,7 @@
- [Checksums](../config/layout-osd.en.md#data_csum_type)
- [Client write-back cache](../config/client.en.md#client_enable_writeback)
- [Intelligent recovery auto-tuning](../config/osd.en.md#recovery_tune_interval)
- [Clustered file system](../usage/nfs.en.md#vitastorfs)
## Plugins and tools
@@ -46,13 +47,12 @@
- [CSI plugin for Kubernetes](../installation/kubernetes.en.md)
- [OpenStack support: Cinder driver, Nova and libvirt patches](../installation/openstack.en.md)
- [Proxmox storage plugin and packages](../installation/proxmox.en.md)
- [Simplified NFS proxy for file-based image access emulation (suitable for VMWare)](../usage/nfs.en.md)
- [Simplified NFS proxy for file-based image access emulation (suitable for VMWare)](../usage/nfs.en.md#pseudo-fs)
## Roadmap
The following features are planned for the future:
- File system
- Control plane optimisation
- Other administrative tools
- Web GUI

View File

@@ -35,6 +35,7 @@
- [Контрольные суммы](../config/layout-osd.ru.md#data_csum_type)
- [Буферизация записи на стороне клиента](../config/client.ru.md#client_enable_writeback)
- [Интеллектуальная автоподстройка скорости восстановления](../config/osd.ru.md#recovery_tune_interval)
- [Кластерная файловая система](../usage/nfs.ru.md#vitastorfs)
## Драйверы и инструменты
@@ -48,11 +49,10 @@
- [CSI-плагин для Kubernetes](../installation/kubernetes.ru.md)
- [Базовая поддержка OpenStack: драйвер Cinder, патчи для Nova и libvirt](../installation/openstack.ru.md)
- [Плагин для Proxmox](../installation/proxmox.ru.md)
- [Упрощённая NFS-прокси для эмуляции файлового доступа к образам (подходит для VMWare)](../usage/nfs.ru.md)
- [Упрощённая NFS-прокси для эмуляции файлового доступа к образам (подходит для VMWare)](../usage/nfs.ru.md#псевдо-фс)
## Планы развития
- Файловая система
- Оптимизация слоя управления
- Другие инструменты администрирования
- Web-интерфейс

View File

@@ -14,6 +14,7 @@
- [Check cluster status](#check-cluster-status)
- [Create an image](#create-an-image)
- [Install plugins](#install-plugins)
- [Create VitastorFS](#create-vitastorfs)
## Preparation
@@ -75,18 +76,16 @@ On the monitor hosts:
## Create a pool
Create pool configuration in etcd:
Create a pool using vitastor-cli:
```
etcdctl --endpoints=... put /vitastor/config/pools '{"1":{"name":"testpool",
"scheme":"replicated","pg_size":2,"pg_minsize":1,"pg_count":256,"failure_domain":"host"}}'
vitastor-cli create-pool testpool --pg_size 2 --pg_count 256
```
For EC pools the configuration should look like the following:
```
etcdctl --endpoints=... put /vitastor/config/pools '{"2":{"name":"ecpool",
"scheme":"ec","pg_size":4,"parity_chunks":2,"pg_minsize":2,"pg_count":256,"failure_domain":"host"}}'
vitastor-cli create-pool testpool --ec 2+2 --pg_count 256
```
After you do this, one of the monitors will configure PGs and OSDs will start them.
@@ -116,3 +115,9 @@ After that, you can [run benchmarks](../usage/fio.en.md) or [start QEMU manually
- [Proxmox](../installation/proxmox.en.md)
- [OpenStack](../installation/openstack.en.md)
- [Kubernetes CSI](../installation/kubernetes.en.md)
## Create VitastorFS
If you want to use clustered file system in addition to VM or container images:
- [Follow the instructions here](../usage/nfs.en.md#vitastorfs)

View File

@@ -14,6 +14,7 @@
- [Проверьте состояние кластера](#проверьте-состояние-кластера)
- [Создайте образ](#создайте-образ)
- [Установите плагины](#установите-плагины)
- [Создайте VitastorFS](#создайте-vitastorfs)
## Подготовка
@@ -77,18 +78,16 @@
## Создайте пул
Создайте конфигурацию пула с помощью etcdctl:
Создайте пул с помощью vitastor-cli:
```
etcdctl --endpoints=... put /vitastor/config/pools '{"1":{"name":"testpool",
"scheme":"replicated","pg_size":2,"pg_minsize":1,"pg_count":256,"failure_domain":"host"}}'
vitastor-cli create-pool testpool --pg_size 2 --pg_count 256
```
Для пулов с кодами коррекции ошибок конфигурация должна выглядеть примерно так:
```
etcdctl --endpoints=... put /vitastor/config/pools '{"2":{"name":"ecpool",
"scheme":"ec","pg_size":4,"parity_chunks":2,"pg_minsize":2,"pg_count":256,"failure_domain":"host"}}'
vitastor-cli create-pool testpool --ec 2+2 --pg_count 256
```
После этого один из мониторов должен сконфигурировать PG, а OSD должны запустить их.
@@ -118,3 +117,10 @@ vitastor-cli create -s 10G testimg
- [Proxmox](../installation/proxmox.ru.md)
- [OpenStack](../installation/openstack.ru.md)
- [Kubernetes CSI](../installation/kubernetes.ru.md)
## Создайте VitastorFS
Если вы хотите использовать не только блочные образы виртуальных машин или контейнеров,
а также кластерную файловую систему, то:
- [Следуйте инструкциям](../usage/nfs.en.md#vitastorfs)

View File

@@ -11,19 +11,26 @@ Replicated setups:
- Single-threaded write+fsync latency:
- With immediate commit: 2 network roundtrips + 1 disk write.
- With lazy commit: 4 network roundtrips + 1 disk write + 1 disk flush.
- Saturated parallel read iops: min(network bandwidth, sum(disk read iops)).
- Saturated parallel write iops: min(network bandwidth, sum(disk write iops / number of replicas / write amplification)).
- Linear read: `min(total network bandwidth, sum(disk read MB/s))`.
- Linear write: `min(total network bandwidth, sum(disk write MB/s / number of replicas))`.
- Saturated parallel read iops: `min(total network bandwidth, sum(disk read iops))`.
- Saturated parallel write iops: `min(total network bandwidth / number of replicas, sum(disk write iops / number of replicas / (write amplification = 4)))`.
EC/XOR setups:
EC/XOR setups (EC N+K):
- Single-threaded (T1Q1) read latency: 1.5 network roundtrips + 1 disk read.
- Single-threaded write+fsync latency:
- With immediate commit: 3.5 network roundtrips + 1 disk read + 2 disk writes.
- With lazy commit: 5.5 network roundtrips + 1 disk read + 2 disk writes + 2 disk fsyncs.
- 0.5 in actually (k-1)/k which means that an additional roundtrip doesn't happen when
- 0.5 in actually `(N-1)/N` which means that an additional roundtrip doesn't happen when
the read sub-operation can be served locally.
- Saturated parallel read iops: min(network bandwidth, sum(disk read iops)).
- Saturated parallel write iops: min(network bandwidth, sum(disk write iops * number of data drives / (number of data + parity drives) / write amplification)).
In fact, you should put disk write iops under the condition of ~10% reads / ~90% writes in this formula.
- Linear read: `min(total network bandwidth, sum(disk read MB/s))`.
- Linear write: `min(total network bandwidth, sum(disk write MB/s * N/(N+K)))`.
- Saturated parallel read iops: `min(total network bandwidth, sum(disk read iops))`.
- Saturated parallel write iops: roughly `total iops / (N+K) / WA`. More exactly,
`min(total network bandwidth * N/(N+K), sum(disk randrw iops / (N*4 + K*5 + 1)))` with
random read/write mix corresponding to `(N-1)/(N*4 + K*5 + 1)*100 % reads`.
- For example, with EC 2+1 it is: `(7% randrw iops) / 14`.
- With EC 6+3 it is: `(12.5% randrw iops) / 40`.
Write amplification for 4 KB blocks is usually 3-5 in Vitastor:
1. Journal block write

View File

@@ -11,20 +11,27 @@
- Запись+fsync в 1 поток:
- С мгновенным сбросом: 2 RTT + 1 запись.
- С отложенным ("ленивым") сбросом: 4 RTT + 1 запись + 1 fsync.
- Параллельное чтение: сумма IOPS всех дисков либо производительность сети, если в сеть упрётся раньше.
- Параллельная запись: сумма IOPS всех дисков / число реплик / WA либо производительность сети, если в сеть упрётся раньше.
- Линейное чтение: сумма МБ/с чтения всех дисков, либо общая производительность сети (сумма пропускной способности сети всех нод), если в сеть упрётся раньше.
- Линейная запись: сумма МБ/с записи всех дисков / число реплик, либо производительность сети / число реплик, если в сеть упрётся раньше.
- Параллельное случайное мелкое чтение: сумма IOPS чтения всех дисков, либо производительность сети, если в сеть упрётся раньше.
- Параллельная случайная мелкая запись: сумма IOPS записи всех дисков / число реплик / WA, либо производительность сети / число реплик, если в сеть упрётся раньше.
При использовании кодов коррекции ошибок (EC):
При использовании кодов коррекции ошибок (EC N+K):
- Задержка чтения в 1 поток (T1Q1): 1.5 RTT + 1 чтение.
- Запись+fsync в 1 поток:
- С мгновенным сбросом: 3.5 RTT + 1 чтение + 2 записи.
- С отложенным ("ленивым") сбросом: 5.5 RTT + 1 чтение + 2 записи + 2 fsync.
- Под 0.5 на самом деле подразумевается (k-1)/k, где k - число дисков данных,
- Под 0.5 на самом деле подразумевается (N-1)/N, где N - число дисков данных,
что означает, что дополнительное обращение по сети не нужно, когда операция
чтения обслуживается локально.
- Параллельное чтение: сумма IOPS всех дисков либо производительность сети, если в сеть упрётся раньше.
- Параллельная запись: сумма IOPS всех дисков / общее число дисков данных и чётности / WA либо производительность сети, если в сеть упрётся раньше.
Примечание: IOPS дисков в данном случае надо брать в смешанном режиме чтения/записи в пропорции, аналогичной формулам выше.
- Линейное чтение: сумма МБ/с чтения всех дисков, либо общая производительность сети, если в сеть упрётся раньше.
- Линейная запись: сумма МБ/с записи всех дисков * N/(N+K), либо производительность сети * N / (N+K), если в сеть упрётся раньше.
- Параллельное случайное мелкое чтение: сумма IOPS чтения всех дисков либо производительность сети, если в сеть упрётся раньше.
- Параллельная случайная мелкая запись: грубо `(сумма IOPS / (N+K) / WA)`. Если точнее, то:
сумма смешанного IOPS всех дисков при `(N-1)/(N*4 + K*5 + 1)*100 %` чтения, делённая на `(N*4 + K*5 + 1)`.
Либо, производительность сети * N/(N+K), если в сеть упрётся раньше.
- Например, при EC 2+1 это: `(сумма IOPS при 7% чтения) / 14`.
- При EC 6+3 это: `(сумма IOPS при 12.5% чтения) / 40`.
WA (мультипликатор записи) для 4 КБ блоков в Vitastor обычно составляет 3-5:
1. Запись метаданных в журнал

View File

@@ -24,6 +24,10 @@ It supports the following commands:
- [fix](#fix)
- [alloc-osd](#alloc-osd)
- [rm-osd](#rm-osd)
- [create-pool](#create-pool)
- [modify-pool](#modify-pool)
- [ls-pools](#ls-pools)
- [rm-pool](#rm-pool)
Global options:
@@ -131,19 +135,18 @@ See also about [how to export snapshots](qemu.en.md#exporting-snapshots).
## modify
`vitastor-cli modify <name> [--rename <new-name>] [--resize <size>] [--readonly | --readwrite] [-f|--force]`
`vitastor-cli modify <name> [--rename <new-name>] [--resize <size>] [--readonly | --readwrite] [-f|--force] [--down-ok]`
Rename, resize image or change its readonly status. Images with children can't be made read-write.
If the new size is smaller than the old size, extra data will be purged.
You should resize file system in the image, if present, before shrinking it.
```
-f|--force Proceed with shrinking or setting readwrite flag even if the image has children.
```
* `-f|--force` - Proceed with shrinking or setting readwrite flag even if the image has children.
* `--down-ok` - Proceed with shrinking even if some data will be left on unavailable OSDs.
## rm
`vitastor-cli rm <from> [<to>] [--writers-stopped]`
`vitastor-cli rm <from> [<to>] [--writers-stopped] [--down-ok]`
Remove `<from>` or all layers between `<from>` and `<to>` (`<to>` must be a child of `<from>`),
rebasing all their children accordingly. --writers-stopped allows merging to be a bit
@@ -151,6 +154,10 @@ more effective in case of a single 'slim' read-write child and 'fat' removed par
the child is merged into parent and parent is renamed to child in that case.
In other cases parent layers are always merged into children.
Other options:
* `--down-ok` - Continue deletion/merging even if some data will be left on unavailable OSDs.
## flatten
`vitastor-cli flatten <layer>`
@@ -238,3 +245,91 @@ Refuses to remove OSDs with data without `--force` and `--allow-data-loss`.
With `--dry-run` only checks if deletion is possible without data loss and
redundancy degradation.
## create-pool
`vitastor-cli create-pool|pool-create <name> (-s <pg_size>|--ec <N>+<K>) -n <pg_count> [OPTIONS]`
Create a pool. Required parameters:
| <!-- --> | <!-- --> |
|--------------------------|---------------------------------------------------------------------------------------|
| `-s R` or `--pg_size R` | Number of replicas for replicated pools |
| `--ec N+K` | Number of data (N) and parity (K) chunks for erasure-coded pools |
| `-n N` or `--pg_count N` | PG count for the new pool (start with 10*<OSD count>/pg_size rounded to a power of 2) |
Optional parameters:
| <!-- --> | <!-- --> |
|--------------------------------|----------------------------------------------------------------------------|
| `--pg_minsize <number>` | R or N+K minus number of failures to tolerate without downtime ([details](../config/pool.en.md#pg_minsize)) |
| `--failure_domain host` | Failure domain: host, osd or a level from placement_levels. Default: host |
| `--root_node <node>` | Put pool only on child OSDs of this placement tree node |
| `--osd_tags <tag>[,<tag>]...` | Put pool only on OSDs tagged with all specified tags |
| `--block_size 128k` | Put pool only on OSDs with this data block size |
| `--bitmap_granularity 4k` | Put pool only on OSDs with this logical sector size |
| `--immediate_commit none` | Put pool only on OSDs with this or larger immediate_commit (none < small < all) |
| `--primary_affinity_tags tags` | Prefer to put primary copies on OSDs with all specified tags |
| `--scrub_interval <time>` | Enable regular scrubbing for this pool. Format: number + unit s/m/h/d/M/y |
| `--used_for_fs <name>` | Mark pool as used for VitastorFS with metadata in image <name> |
| `--pg_stripe_size <number>` | Increase object grouping stripe |
| `--max_osd_combinations 10000` | Maximum number of random combinations for LP solver input |
| `--wait` | Wait for the new pool to come online |
| `-f` or `--force` | Do not check that cluster has enough OSDs to create the pool |
See also [Pool configuration](../config/pool.en.md) for detailed parameter descriptions.
Examples:
`vitastor-cli create-pool test_x4 -s 4 -n 32`
`vitastor-cli create-pool test_ec42 --ec 4+2 -n 32`
## modify-pool
`vitastor-cli modify-pool|pool-modify <id|name> [--name <new_name>] [PARAMETERS...]`
Modify an existing pool. Modifiable parameters:
```
[-s|--pg_size <number>] [--pg_minsize <number>] [-n|--pg_count <count>]
[--failure_domain <level>] [--root_node <node>] [--osd_tags <tags>] [--no_inode_stats 0|1]
[--max_osd_combinations <number>] [--primary_affinity_tags <tags>] [--scrub_interval <time>]
```
Non-modifiable parameters (changing them WILL lead to data loss):
```
[--block_size <size>] [--bitmap_granularity <size>]
[--immediate_commit <all|small|none>] [--pg_stripe_size <size>]
```
These, however, can still be modified with -f|--force.
See [create-pool](#create-pool) for parameter descriptions.
Examples:
`vitastor-cli modify-pool pool_A --name pool_B`
`vitastor-cli modify-pool 2 --pg_size 4 -n 128`
## rm-pool
`vitastor-cli rm-pool|pool-rm [--force] <id|name>`
Remove a pool. Refuses to remove pools with images without `--force`.
## ls-pools
`vitastor-cli ls-pools|pool-ls|ls-pool|pools [-l] [--detail] [--sort FIELD] [-r] [-n N] [--stats] [<glob> ...]`
List pools (only matching <glob> patterns if passed).
| <!-- --> | <!-- --> |
|----------------------|-------------------------------------------------------|
| `-l` or `--long` | Also report I/O statistics |
| `--detail` | Use list format (not table), show all details |
| `--sort FIELD` | Sort by specified field (see fields in --json output) |
| `-r` or `--reverse` | Sort in descending order |
| `-n` or `--count N` | Only list first N items |

View File

@@ -23,6 +23,10 @@ vitastor-cli - интерфейс командной строки для адм
- [merge-data](#merge-data)
- [alloc-osd](#alloc-osd)
- [rm-osd](#rm-osd)
- [create-pool](#create-pool)
- [modify-pool](#modify-pool)
- [ls-pools](#ls-pools)
- [rm-pool](#rm-pool)
Глобальные опции:
@@ -85,8 +89,8 @@ kaveri 2/1 32 0 B 10 G 0 B 100% 0%
`vitastor-cli ls [-l] [-p POOL] [--sort FIELD] [-r] [-n N] [<glob> ...]`
Показать список образов, если переданы шаблоны `<glob>`, то только с именами,
соответствующими этим шаблонам (стандартные ФС-шаблоны с * и ?).
Показать список образов, если передан(ы) шаблон(ы) `<glob>`, то только с именами,
соответствующими одному из шаблонов (стандартные ФС-шаблоны с * и ?).
Опции:
@@ -132,7 +136,7 @@ vitastor-cli snap-create [-p|--pool <id|name>] <image>@<snapshot>
## modify
`vitastor-cli modify <name> [--rename <new-name>] [--resize <size>] [--readonly | --readwrite] [-f|--force]`
`vitastor-cli modify <name> [--rename <new-name>] [--resize <size>] [--readonly | --readwrite] [-f|--force] [--down-ok]`
Изменить размер, имя образа или флаг "только для чтения". Снимать флаг "только для чтения"
и уменьшать размер образов, у которых есть дочерние клоны, без `--force` нельзя.
@@ -140,13 +144,12 @@ vitastor-cli snap-create [-p|--pool <id|name>] <image>@<snapshot>
Если новый размер меньше старого, "лишние" данные будут удалены, поэтому перед уменьшением
образа сначала уменьшите файловую систему в нём.
```
-f|--force Разрешить уменьшение или перевод в чтение-запись образа, у которого есть клоны.
```
* `-f|--force` - Разрешить уменьшение или перевод в чтение-запись образа, у которого есть клоны.
* `--down-ok` - Разрешить уменьшение, даже если часть данных останется неудалённой на недоступных OSD.
## rm
`vitastor-cli rm <from> [<to>] [--writers-stopped]`
`vitastor-cli rm <from> [<to>] [--writers-stopped] [--down-ok]`
Удалить образ `<from>` или все слои от `<from>` до `<to>` (`<to>` должен быть дочерним
образом `<from>`), одновременно меняя родительские образы их клонов (если таковые есть).
@@ -158,6 +161,10 @@ vitastor-cli snap-create [-p|--pool <id|name>] <image>@<snapshot>
В других случаях родительские слои вливаются в дочерние.
Другие опции:
* `--down-ok` - Продолжать удаление/слияние, даже если часть данных останется неудалённой на недоступных OSD.
## flatten
`vitastor-cli flatten <layer>`
@@ -255,3 +262,91 @@ vitastor-cli snap-create [-p|--pool <id|name>] <image>@<snapshot>
С опцией `--dry-run` только проверяет, возможно ли удаление без потери данных и деградации
избыточности.
## create-pool
`vitastor-cli create-pool|pool-create <name> (-s <pg_size>|--ec <N>+<K>) -n <pg_count> [OPTIONS]`
Создать пул. Обязательные параметры:
| <!-- --> | <!-- --> |
|---------------------------|---------------------------------------------------------------------------------------------|
| `-s R` или `--pg_size R` | Число копий данных для реплицированных пулов |
| `--ec N+K` | Число частей данных (N) и чётности (K) для пулов с кодами коррекции ошибок |
| `-n N` или `--pg_count N` | Число PG для нового пула (начните с 10*<число OSD>/pg_size, округлённого до степени двойки) |
Необязательные параметры:
| <!-- --> | <!-- --> |
|--------------------------------|----------------------------------------------------------------------------|
| `--pg_minsize <number>` | (R или N+K) минус число разрешённых отказов без остановки пула ([подробнее](../config/pool.ru.md#pg_minsize)) |
| `--failure_domain host` | Домен отказа: host, osd или другой из placement_levels. По умолчанию: host |
| `--root_node <node>` | Использовать для пула только дочерние OSD этого узла дерева размещения |
| `--osd_tags <tag>[,<tag>]...` | ...только OSD со всеми заданными тегами |
| `--block_size 128k` | ...только OSD с данным размером блока |
| `--bitmap_granularity 4k` | ...только OSD с данным размером логического сектора |
| `--immediate_commit none` | ...только OSD с этим или большим immediate_commit (none < small < all) |
| `--primary_affinity_tags tags` | Предпочитать OSD со всеми данными тегами для роли первичных |
| `--scrub_interval <time>` | Включить скрабы с заданным интервалом времени (число + единица s/m/h/d/M/y) |
| `--pg_stripe_size <number>` | Увеличить блок группировки объектов по PG |
| `--max_osd_combinations 10000` | Максимальное число случайных комбинаций OSD для ЛП-солвера |
| `--wait` | Подождать, пока новый пул будет активирован |
| `-f` или `--force` | Не проверять, что в кластере достаточно доменов отказа для создания пула |
Подробно о параметрах см. [Конфигурация пулов](../config/pool.ru.md).
Примеры:
`vitastor-cli create-pool test_x4 -s 4 -n 32`
`vitastor-cli create-pool test_ec42 --ec 4+2 -n 32`
## modify-pool
`vitastor-cli modify-pool|pool-modify <id|name> [--name <new_name>] [PARAMETERS...]`
Изменить настройки существующего пула. Изменяемые параметры:
```
[-s|--pg_size <number>] [--pg_minsize <number>] [-n|--pg_count <count>]
[--failure_domain <level>] [--root_node <node>] [--osd_tags <tags>]
[--max_osd_combinations <number>] [--primary_affinity_tags <tags>] [--scrub_interval <time>]
```
Неизменяемые параметры (их изменение ПРИВЕДЁТ к потере данных):
```
[--block_size <size>] [--bitmap_granularity <size>]
[--immediate_commit <all|small|none>] [--pg_stripe_size <size>]
```
Эти параметры можно изменить, только если явно передать опцию -f или --force.
Описания параметров смотрите в [create-pool](#create-pool).
Примеры:
`vitastor-cli modify-pool pool_A --name pool_B`
`vitastor-cli modify-pool 2 --pg_size 4 -n 128`
## rm-pool
`vitastor-cli rm-pool|pool-rm [--force] <id|name>`
Удалить пул. Отказывается удалять пул, в котором ещё есть образы, без `--force`.
## ls-pools
`vitastor-cli ls-pools|pool-ls|ls-pool|pools [-l] [--detail] [--sort FIELD] [-r] [-n N] [--stats] [<glob> ...]`
Показать список пулов. Если передан(ы) шаблон(ы) `<glob>`, то только с именами,
соответствующими одному из шаблонов (стандартные ФС-шаблоны с * и ?).
| <!-- --> | <!-- --> |
|-----------------------|------------------------------------------------------------|
| `-l` или `--long` | Вывести также статистику ввода-вывода |
| `--detail` | Максимально подробный вывод в виде списка (а не таблицы) |
| `--sort FIELD` | Сортировать по заданному полю (поля см. в выводе с --json) |
| `-r` или `--reverse` | Сортировать в обратном порядке |
| `-n` или `--count N` | Выводить только первые N записей |

View File

@@ -261,7 +261,7 @@ Options (see also [Cluster-Wide Disk Layout Parameters](../config/layout-cluster
```
--object_size 128k Set blockstore block size
--bitmap_granularity 4k Set bitmap granularity
--journal_size 16M Set journal size
--journal_size 32M Set journal size
--data_csum_type none Set data checksum type (crc32c or none)
--csum_block_size 4k Set data checksum block size
--device_block_size 4k Set device block size

View File

@@ -267,7 +267,7 @@ OSD отключены fsync-и.
```
--object_size 128k Размер блока хранилища
--bitmap_granularity 4k Гранулярность битовых карт
--journal_size 16M Размер журнала
--journal_size 32M Размер журнала
--data_csum_type none Задать тип контрольных сумм (crc32c или none)
--csum_block_size 4k Задать размер блока расчёта контрольных сумм
--device_block_size 4k Размер блока устройства

View File

@@ -4,42 +4,150 @@
[Читать на русском](nfs.ru.md)
# NFS
# VitastorFS and pseudo-FS
Vitastor has a simplified NFS 3.0 proxy for file-based image access emulation. It's not
suitable as a full-featured file system, at least because all file/image metadata is stored
in etcd and kept in memory all the time - thus you can't put a lot of files in it.
Vitastor has two file system implementations. Both can be used via `vitastor-nfs`.
However, NFS proxy is totally fine as a method to provide VM image access and allows to
plug Vitastor into, for example, VMWare. It's important to note that for VMWare it's a much
better access method than iSCSI, because with iSCSI we'd have to put all VM images into one
Vitastor image exported as a LUN to VMWare and formatted with VMFS. VMWare doesn't use VMFS
over NFS.
Commands:
- [mount](#mount)
- [start](#start)
NFS proxy is stateless if you use immediate_commit=all mode (for SSD with capacitors or
HDDs with disabled cache), so you can run multiple NFS proxies and use a network load
balancer or any failover method you want to in that case.
## Pseudo-FS
vitastor-nfs usage:
Simplified pseudo-FS proxy is used for file-based image access emulation. It's not
suitable as a full-featured file system: it lacks a lot of FS features, it stores
all file/image metadata in memory and in etcd. So it's fine for hundreds or thousands
of large files/images, but not for millions.
Pseudo-FS proxy is intended for environments where other block volume access methods
can't be used or impose additional restrictions - for example, VMWare. NFS is better
for VMWare than, for example, iSCSI, because with iSCSI, VMWare puts all VM images
into one large shared block image in its own VMFS file system, and with NFS, VMWare
doesn't use VMFS and puts each VM disk in a regular file which is equal to one
Vitastor block image, just as originally intended.
To use Vitastor pseudo-FS locally, run `vitastor-nfs mount --block /mnt/vita`.
Also you can start the network server:
```
vitastor-nfs [STANDARD OPTIONS] [OTHER OPTIONS]
--subdir <DIR> export images prefixed <DIR>/ (default empty - export all images)
--portmap 0 do not listen on port 111 (portmap/rpcbind, requires root)
--bind <IP> bind service to <IP> address (default 0.0.0.0)
--nfspath <PATH> set NFS export path to <PATH> (default is /)
--port <PORT> use port <PORT> for NFS services (default is 2049)
--pool <POOL> use <POOL> as default pool for new files (images)
--foreground 1 stay in foreground, do not daemonize
vitastor-nfs start --block --etcd_address 192.168.5.10:2379 --portmap 0 --port 2050 --pool testpool
```
Example start and mount commands (etcd_address is optional):
To mount the FS exported by this server, run:
```
vitastor-nfs --etcd_address 192.168.5.10:2379 --portmap 0 --port 2050 --pool testpool
mount server:/ /mnt/ -o port=2050,mountport=2050,nfsvers=3,soft,nolock,tcp
```
```
mount localhost:/ /mnt/ -o port=2050,mountport=2050,nfsvers=3,soft,nolock,tcp
```
## VitastorFS
VitastorFS is a full-featured clustered (Read-Write-Many) file system. It supports most POSIX
features like hierarchical organization, symbolic links, hard links, quick renames and so on.
VitastorFS metadata is stored in a Parallel Optimistic B-Tree key-value database,
implemented over a regular Vitastor block volume. Directory entries and inodes
are stored in a simple human-readable JSON format in the B-Tree. `vitastor-kv` tool
can be used to inspect the database.
To use VitastorFS:
1. Create a pool or choose an existing empty pool for FS data
2. Create an image for FS metadata, preferably in a faster (SSD or replica-HDD) pool,
but you can create it in the data pool too if you want (image size doesn't matter):
`vitastor-cli create -s 10G -p fastpool testfs`
3. Mark data pool as an FS pool: `vitastor-cli modify-pool --used-for-fs testfs data-pool`
4. Either mount the FS: `vitastor-nfs mount --fs testfs --pool data-pool /mnt/vita`
5. Or start the NFS server: `vitastor-nfs start --fs testfs --pool data-pool`
### Supported POSIX features
- Read-after-write semantics (read returns new data immediately after write)
- Linear and random read and write
- Writing outside current file size
- Hierarchical structure, immediate rename of files and directories
- File size change support (truncate)
- Permissions (chmod/chown)
- Flushing data to stable storage (if required) (fsync)
- Symbolic links
- Hard links
- Special files (devices, sockets, named pipes)
- File modification and attribute change time tracking (mtime and ctime)
- Modification time (mtime) and last access time (atime) change support (utimes)
- Correct handling of directory listing during file creation/deletion
### Limitations
POSIX features currently not implemented in VitastorFS:
- File locking is not supported
- Actually used space is not counted, so `du` always reports apparent file sizes
instead of actually allocated space
- Access times (`atime`) are not tracked (like `-o noatime`)
- Modification time (`mtime`) is updated lazily every second (like `-o lazytime`)
Other notable missing features which should be addressed in the future:
- Defragmentation of "shared" inodes. Files smaller than pool object size (block_size
multiplied by data part count if pool is EC) are internally stored in large block
volumes sequentially, one after another, and leave garbage after deleting or resizing.
Defragmentator will be implemented to collect this garbage.
- Inode ID reuse. Currently inode IDs always grow, the limit is 2^48 inodes, so
in theory you may hit it if you create and delete a very large number of files
- Compaction of the key-value B-Tree. Current implementation never merges or deletes
B-Tree blocks, so B-Tree may become bloated over time. Currently you can
use `vitastor-kv dumpjson` & `loadjson` commands to recreate the index in such
situations.
- Filesystem check tool. VitastorFS doesn't have journal because it would impose a
severe performance hit, optimistic CAS-based transactions are used instead of it.
So, again, in theory an abnormal shutdown of the FS server may leave some garbage
in the DB. The FS is implemented is such way that this garbage doesn't affect its
function, but having a tool to clean it up still seems a right thing to do.
## Horizontal scaling
Linux NFS 3.0 client doesn't support built-in scaling or failover, i.e. you can't
specify multiple server addresses when mounting the FS.
However, you can use any regular TCP load balancing over multiple NFS servers.
It's absolutely safe with `immediate_commit=all` and `client_enable_writeback=false`
settings, because Vitastor NFS proxy doesn't keep uncommitted data in memory
with these settings. But it may even work without `immediate_commit=all` because
the Linux NFS client repeats all uncommitted writes if it loses the connection.
## Commands
### mount
`vitastor-nfs (--fs <NAME> | --block) [-o <OPT>] mount <MOUNTPOINT>`
Start local filesystem server and mount file system to <MOUNTPOINT>.
Use regular `umount <MOUNTPOINT>` to unmount the FS.
The server will be automatically stopped when the FS is unmounted.
- `-o|--options <OPT>` - Pass additional NFS mount options (ex.: -o async).
### start
`vitastor-nfs (--fs <NAME> | --block) start`
Start network NFS server. Options:
| <!-- --> | <!-- --> |
|-----------------|------------------------------------------------------------|
| `--bind <IP>` | bind service to \<IP> address (default 0.0.0.0) |
| `--port <PORT>` | use port \<PORT> for NFS services (default is 2049) |
| `--portmap 0` | do not listen on port 111 (portmap/rpcbind, requires root) |
## Common options
| <!-- --> | <!-- --> |
|--------------------|----------------------------------------------------------|
| `--fs <NAME>` | use VitastorFS with metadata in image \<NAME> |
| `--block` | use pseudo-FS presenting images as files |
| `--pool <POOL>` | use \<POOL> as default pool for new files |
| `--subdir <DIR>` | export \<DIR> instead of root directory (pseudo-FS only) |
| `--nfspath <PATH>` | set NFS export path to \<PATH> (default is /) |
| `--pidfile <FILE>` | write process ID to the specified file |
| `--logfile <FILE>` | log to the specified file |
| `--foreground 1` | stay in foreground, do not daemonize |

View File

@@ -4,41 +4,156 @@
[Read in English](nfs.en.md)
# NFS
# VitastorFS и псевдо-ФС
В Vitastor реализована упрощённая NFS 3.0 прокси для эмуляции файлового доступа к образам.
Это не полноценная файловая система, т.к. метаданные всех файлов (образов) сохраняются
в etcd и всё время хранятся в оперативной памяти - то есть, положить туда много файлов
не получится.
В Vitastor есть две реализации файловой системы. Обе используются через `vitastor-nfs`.
Однако в качестве способа доступа к образам виртуальных машин NFS прокси прекрасно подходит
и позволяет подключить Vitastor, например, к VMWare.
Команды:
- [mount](#mount)
- [start](#start)
При этом, если вы используете режим immediate_commit=all (для SSD с конденсаторами или HDD
с отключённым кэшем), то NFS-сервер не имеет состояния и вы можете свободно поднять
его в нескольких экземплярах и использовать поверх них сетевой балансировщик нагрузки или
схему с отказоустойчивостью.
## Псевдо-ФС
Использование vitastor-nfs:
Упрощённая реализация псевдо-ФС используется для эмуляции файлового доступа к блочным
образам Vitastor. Это не полноценная файловая система - в ней отсутствуют многие функции
POSIX ФС, а метаданные всех файлов (образов) сохраняются в etcd и всё время хранятся в
оперативной памяти - то есть, псевдо-ФС подходит для сотен или тысяч файлов, но не миллионов.
Псевдо-ФС предназначена для доступа к образам виртуальных машин в средах, где другие
способы невозможны или неудобны - например, в VMWare. Для VMWare это лучшая опция, чем
iSCSI, так как при использовании iSCSI VMWare размещает все виртуальные машины в одном
большом блочном образе внутри собственной ФС VMFS, а с NFS VMFS не используется и каждый
диск ВМ представляется в виде одного файла, то есть, соответствует одному блочному образу
Vitastor, как это и задумано изначально.
Чтобы подключить псевдо-ФС Vitastor, выполните команду `vitastor-nfs mount --block /mnt/vita`.
Либо же запустите сетевой вариант сервера:
```
vitastor-nfs [СТАНДАРТНЫЕ ОПЦИИ] [ДРУГИЕ ОПЦИИ]
--subdir <DIR> экспортировать "поддиректорию" - образы с префиксом имени <DIR>/ (по умолчанию пусто - экспортировать все образы)
--portmap 0 отключить сервис portmap/rpcbind на порту 111 (по умолчанию включён и требует root привилегий)
--bind <IP> принимать соединения по адресу <IP> (по умолчанию 0.0.0.0 - на всех)
--nfspath <PATH> установить путь NFS-экспорта в <PATH> (по умолчанию /)
--port <PORT> использовать порт <PORT> для NFS-сервисов (по умолчанию 2049)
--pool <POOL> использовать пул <POOL> для новых образов (обязательно, если пул в кластере не один)
--foreground 1 не уходить в фон после запуска
vitastor-nfs start --block --etcd_address 192.168.5.10:2379 --portmap 0 --port 2050 --pool testpool
```
Пример монтирования Vitastor через NFS (etcd_address необязателен):
Примонтировать ФС, запущенную с такими опциями, можно следующей командой:
```
vitastor-nfs --etcd_address 192.168.5.10:2379 --portmap 0 --port 2050 --pool testpool
mount server:/ /mnt/ -o port=2050,mountport=2050,nfsvers=3,soft,nolock,tcp
```
```
mount localhost:/ /mnt/ -o port=2050,mountport=2050,nfsvers=3,soft,nolock,tcp
```
## VitastorFS
VitastorFS - полноценная кластерная (Read-Write-Many) файловая система. Она поддерживает
большую часть функций POSIX - иерархическую организацию, символические ссылки, жёсткие
ссылки, быстрые переименования и так далее.
Метаданные VitastorFS хранятся в собственной реализации БД формата ключ-значения,
основанной на Параллельном Оптимистичном Б-дереве поверх обычного блочного образа Vitastor.
И записи каталогов, и иноды, как обычно в Vitastor, хранятся в простом человекочитаемом
JSON-формате :-). Для инспекции содержимого БД можно использовать инструмент `vitastor-kv`.
Чтобы использовать VitastorFS:
1. Создайте пул для данных ФС или выберите существующий пустой пул
2. Создайте блочный образ для метаданных ФС, желательно, в более быстром пуле (на SSD
или по крайней мере на HDD, но без EC), но можно и в том же пуле, что данные
(размер образа значения не имеет):
`vitastor-cli create -s 10G -p fastpool testfs`
3. Пометьте пул данных как ФС-пул: `vitastor-cli modify-pool --used-for-fs testfs data-pool`
4. Либо примонтируйте ФС: `vitastor-nfs mount --fs testfs --pool data-pool /mnt/vita`
5. Либо запустите сетевой NFS-сервер: `vitastor-nfs start --fs testfs --pool data-pool`
### Поддерживаемые функции POSIX
- Чтение актуальной версии данных сразу после записи
- Последовательное и произвольное чтение и запись
- Запись за пределами текущего размера файла
- Иерархическая организация, мгновенное переименование файлов и каталогов
- Изменение размера файла (truncate)
- Права на файлы (chmod/chown)
- Фиксация данных на диски (когда необходимо) (fsync)
- Символические ссылки
- Жёсткие ссылки
- Специальные файлы (устройства, сокеты, каналы)
- Отслеживание времён модификации (mtime), изменения атрибутов (ctime)
- Ручное изменение времён модификации (mtime), последнего доступа (atime)
- Корректная обработка изменений списка файлов во время листинга
### Ограничения
Отсутствующие на данный момент в VitastorFS функции POSIX:
- Блокировки файлов не поддерживаются
- Фактически занятое файлами место не подсчитывается и не возвращается вызовами
stat(2), так что `du` всегда показывает сумму размеров файлов, а не фактически занятое место
- Времена доступа (`atime`) не отслеживаются (как будто ФС смонтирована с `-o noatime`)
- Времена модификации (`mtime`) отслеживаются асинхронно (как будто ФС смонтирована с `-o lazytime`)
Другие недостающие функции, которые нужно добавить в будущем:
- Дефрагментация "общих инодов". На уровне реализации ФС файлы, меньшие, чем размер
объекта пула (block_size умножить на число частей данных, если пул EC),
упаковываются друг за другом в большие "общие" иноды/тома. Если такие файлы удалять
или увеличивать, они перемещаются и оставляют за собой "мусор", вот тут-то и нужен
дефрагментатор.
- Переиспользование номеров инодов. В текущей реализации номера инодов всё время
увеличиваются, так что в теории вы можете упереться в лимит, если насоздаёте
и наудаляете больше, чем 2^48 файлов.
- Очистка места в Б-дереве метаданных. Текущая реализация никогда не сливает и не
удаляет блоки Б-дерева, так что в теории дерево может разростись и стать неоптимальным.
Если вы столкнётесь с такой ситуацией сейчас, вы можете решить её с помощью
команд `vitastor-kv dumpjson` и `loadjson` (т.е. пересоздав и загрузив обратно все метаданные ФС).
- Инструмент проверки метаданных файловой системы. У VitastorFS нет журнала, так как
журнал бы сильно замедлил реализацию, вместо него используются оптимистичные
транзакции на основе CAS (сравнить-и-записать), и теоретически при нештатном
завершении сервера ФС в БД также могут оставаться неконсистентные "мусорные"
записи. ФС устроена так, что на работу они не влияют, но для порядка и их стоит
уметь подчищать.
## Горизонтальное масштабирование
Клиент Linux NFS 3.0 не поддерживает встроенное масштабирование или отказоустойчивость.
То есть, вы не можете задать несколько адресов серверов при монтировании ФС.
Однако вы можете использовать любые стандартные сетевые балансировщики нагрузки
или схемы с отказоустойчивостью. Это точно безопасно при настройках `immediate_commit=all` и
`client_enable_writeback=false`, так как с ними NFS-сервер Vitastor вообще не хранит
в памяти ещё не зафиксированные на дисках данные; и вполне вероятно безопасно
даже без `immediate_commit=all`, потому что NFS-клиент ядра Linux повторяет все
незафиксированные запросы при потере соединения.
## Команды
### mount
`vitastor-nfs (--fs <NAME> | --block) mount [-o <OPT>] <MOUNTPOINT>`
Запустить локальный сервер и примонтировать ФС в директорию <MOUNTPOINT>.
Чтобы отмонтировать ФС, используйте обычную команду `umount <MOUNTPOINT>`.
Сервер автоматически останавливается при отмонтировании ФС.
- `-o|--options <OPT>` - Передать дополнительные опции монтирования NFS (пример: -o async).
### start
`vitastor-nfs (--fs <NAME> | --block) start`
Запустить сетевой NFS-сервер. Опции:
| <!-- --> | <!-- --> |
|-----------------|-----------------------------------------------------------------------|
| `--bind <IP>` | принимать соединения по адресу \<IP> (по умолчанию 0.0.0.0 - на всех) |
| `--port <PORT>` | использовать порт \<PORT> для NFS-сервисов (по умолчанию 2049) |
| `--portmap 0` | отключить сервис portmap/rpcbind на порту 111 (по умолчанию включён и требует root привилегий) |
## Общие опции
| <!-- --> | <!-- --> |
|--------------------|---------------------------------------------------------|
| `--fs <NAME>` | использовать VitastorFS с метаданными в образе \<NAME> |
| `--block` | использовать псевдо-ФС для доступа к блочным образам |
| `--pool <POOL>` | использовать пул \<POOL> для новых файлов (обязательно, если пул в кластере не один) |
| `--subdir <DIR>` | экспортировать подкаталог \<DIR>, а не корень (только для псевдо-ФС) |
| `--nfspath <PATH>` | установить путь NFS-экспорта в \<PATH> (по умолчанию /) |
| `--pidfile <FILE>` | записать ID процесса в заданный файл |
| `--logfile <FILE>` | записывать логи в заданный файл |
| `--foreground 1` | не уходить в фон после запуска |

View File

@@ -37,7 +37,7 @@ const etcd_allow = new RegExp('^'+[
'pg/history/[1-9]\\d*/[1-9]\\d*',
'pool/stats/[1-9]\\d*',
'history/last_clean_pgs',
'inode/stats/[1-9]\\d*/[1-9]\\d*',
'inode/stats/[1-9]\\d*/\\d+',
'pool/stats/[1-9]\\d*',
'stats',
'index/image/.*',
@@ -55,7 +55,7 @@ const etcd_tree = {
// etcd connection - configurable online
etcd_address: "10.0.115.10:2379/v3",
// mon
etcd_mon_ttl: 30, // min: 10
etcd_mon_ttl: 5, // min: 1
etcd_mon_timeout: 1000, // ms. min: 0
etcd_mon_retries: 5, // min: 0
mon_change_timeout: 1000, // ms. min: 100
@@ -86,13 +86,14 @@ const etcd_tree = {
client_max_buffered_bytes: 33554432,
client_max_buffered_ops: 1024,
client_max_writeback_iodepth: 256,
client_retry_interval: 50, // ms. min: 10
client_eio_retry_interval: 1000, // ms
// client and osd - configurable online
log_level: 0,
peer_connect_interval: 5, // seconds. min: 1
peer_connect_timeout: 5, // seconds. min: 1
osd_idle_timeout: 5, // seconds. min: 1
osd_ping_timeout: 5, // seconds. min: 1
up_wait_retry_interval: 500, // ms. min: 50
max_etcd_attempts: 5,
etcd_quick_timeout: 1000, // ms
etcd_slow_timeout: 5000, // ms
@@ -390,7 +391,8 @@ class Mon
{
constructor(config)
{
this.die = (e) => this._die(e);
this.failconnect = (e) => this._die(e, 2);
this.die = (e) => this._die(e, 1);
if (fs.existsSync(config.config_path||'/etc/vitastor/vitastor.conf'))
{
config = {
@@ -479,10 +481,10 @@ class Mon
check_config()
{
this.config.etcd_mon_ttl = Number(this.config.etcd_mon_ttl) || 30;
if (this.config.etcd_mon_ttl < 10)
this.config.etcd_mon_ttl = Number(this.config.etcd_mon_ttl) || 5;
if (this.config.etcd_mon_ttl < 1)
{
this.config.etcd_mon_ttl = 10;
this.config.etcd_mon_ttl = 1;
}
this.config.etcd_mon_timeout = Number(this.config.etcd_mon_timeout) || 0;
if (this.config.etcd_mon_timeout <= 0)
@@ -604,7 +606,7 @@ class Mon
}
if (!this.ws)
{
this.die('Failed to open etcd watch websocket');
this.failconnect('Failed to open etcd watch websocket');
}
const cur_addr = this.selected_etcd_url;
this.ws_alive = true;
@@ -674,7 +676,12 @@ class Mon
{
this.parse_kv(e.kv);
const key = e.kv.key.substr(this.etcd_prefix.length);
if (key.substr(0, 11) == '/osd/stats/' || key.substr(0, 10) == '/pg/stats/' || key.substr(0, 16) == '/osd/inodestats/')
if (key.substr(0, 11) == '/osd/state/')
{
stats_changed = true;
changed = true;
}
else if (key.substr(0, 11) == '/osd/stats/' || key.substr(0, 10) == '/pg/stats/' || key.substr(0, 16) == '/osd/inodestats/')
{
stats_changed = true;
}
@@ -791,9 +798,9 @@ class Mon
const res = await this.etcd_call('/lease/keepalive', { ID: this.etcd_lease_id }, this.config.etcd_mon_timeout, this.config.etcd_mon_retries);
if (!res.result.TTL)
{
this.die('Lease expired');
this.failconnect('Lease expired');
}
}, this.config.etcd_mon_timeout);
}, this.config.etcd_mon_ttl*1000);
if (!this.signals_set)
{
process.on('SIGINT', this.on_stop_cb);
@@ -1414,7 +1421,14 @@ class Mon
}
if (changed)
{
await this.save_pg_config(new_config_pgs);
const ok = await this.save_pg_config(new_config_pgs);
if (ok)
console.log('PG configuration successfully changed');
else
{
console.log('Someone changed PG configuration while we also tried to change it. Retrying in '+this.config.mon_change_timeout+' ms');
this.schedule_recheck();
}
}
}
this.recheck_pgs_active = false;
@@ -1495,6 +1509,11 @@ class Mon
this.save_new_pgs_txn(new_config_pgs, etcd_request, pool_id, up_osds, osd_tree, real_prev_pgs, pool_res.pgs, pg_history);
}
new_config_pgs.hash = tree_hash;
return await this.save_pg_config(new_config_pgs, etcd_request);
}
async save_pg_config(new_config_pgs, etcd_request = { compare: [], success: [] })
{
etcd_request.compare.push(
{ key: b64(this.etcd_prefix+'/mon/master'), target: 'LEASE', lease: ''+this.etcd_lease_id },
{ key: b64(this.etcd_prefix+'/config/pgs'), target: 'MOD', mod_revision: ''+this.etcd_watch_revision, result: 'LESS' },
@@ -1622,9 +1641,13 @@ class Mon
}
const sum_diff = { op_stats: {}, subop_stats: {}, recovery_stats: {} };
// Sum derived values instead of deriving summed
for (const osd in this.state.osd.stats)
for (const osd in this.state.osd.state)
{
const derived = this.prev_stats.osd_diff[osd];
if (!this.state.osd.state[osd] || !derived)
{
continue;
}
for (const type in sum_diff)
{
for (const op in derived[type]||{})
@@ -1714,8 +1737,11 @@ class Mon
for (const inode_num in this.state.osd.space[osd_num][pool_id])
{
const u = BigInt(this.state.osd.space[osd_num][pool_id][inode_num]||0);
inode_stats[pool_id][inode_num] = inode_stats[pool_id][inode_num] || inode_stub();
inode_stats[pool_id][inode_num].raw_used += u;
if (inode_num)
{
inode_stats[pool_id][inode_num] = inode_stats[pool_id][inode_num] || inode_stub();
inode_stats[pool_id][inode_num].raw_used += u;
}
this.state.pool.stats[pool_id].used_raw_tb += u;
}
}
@@ -1725,9 +1751,13 @@ class Mon
const used = this.state.pool.stats[pool_id].used_raw_tb;
this.state.pool.stats[pool_id].used_raw_tb = Number(used)/1024/1024/1024/1024;
}
for (const osd_num in this.state.osd.inodestats)
for (const osd_num in this.state.osd.state)
{
const ist = this.state.osd.inodestats[osd_num];
if (!ist || !this.state.osd.state[osd_num])
{
continue;
}
for (const pool_id in ist)
{
inode_stats[pool_id] = inode_stats[pool_id] || {};
@@ -1743,9 +1773,14 @@ class Mon
}
}
}
for (const osd in this.prev_stats.osd_diff)
for (const osd in this.state.osd.state)
{
for (const pool_id in this.prev_stats.osd_diff[osd].inode_stats)
const osd_diff = this.prev_stats.osd_diff[osd];
if (!osd_diff || !this.state.osd.state[osd])
{
continue;
}
for (const pool_id in osd_diff.inode_stats)
{
for (const inode_num in this.prev_stats.osd_diff[osd].inode_stats[pool_id])
{
@@ -1985,14 +2020,14 @@ class Mon
return res.json;
}
}
this.die();
this.failconnect();
}
_die(err)
_die(err, code)
{
// In fact we can just try to rejoin
console.error(new Error(err || 'Cluster connection failed'));
process.exit(1);
process.exit(code || 2);
}
local_ips(all)

View File

@@ -1,6 +1,6 @@
{
"name": "vitastor-mon",
"version": "1.4.0",
"version": "1.5.0",
"description": "Vitastor SDS monitor service",
"main": "mon-main.js",
"scripts": {

View File

@@ -8,7 +8,9 @@ PartOf=vitastor.target
LimitNOFILE=1048576
LimitNPROC=1048576
LimitMEMLOCK=infinity
ExecStart=bash -c 'exec vitastor-disk exec-osd /dev/vitastor/osd%i-data >>/var/log/vitastor/osd%i.log 2>&1'
# Use the following for direct logs to files
#ExecStart=bash -c 'exec vitastor-disk exec-osd /dev/vitastor/osd%i-data >>/var/log/vitastor/osd%i.log 2>&1'
ExecStart=vitastor-disk exec-osd /dev/vitastor/osd%i-data
ExecStartPre=+vitastor-disk pre-exec /dev/vitastor/osd%i-data
WorkingDirectory=/
User=vitastor

View File

@@ -50,7 +50,7 @@ from cinder.volume import configuration
from cinder.volume import driver
from cinder.volume import volume_utils
VERSION = '1.4.0'
VERSION = '1.5.0'
LOG = logging.getLogger(__name__)

View File

@@ -0,0 +1,692 @@
commit d85024bd803b3b91f15578ed22de4ce31856626f
Author: Vitaliy Filippov <vitalif@yourcmc.ru>
Date: Wed Jan 24 18:07:43 2024 +0300
Add Vitastor support
diff --git a/docs/schemas/domaincommon.rng b/docs/schemas/domaincommon.rng
index 7fa5c2b8b5..2d77f391e7 100644
--- a/docs/schemas/domaincommon.rng
+++ b/docs/schemas/domaincommon.rng
@@ -1898,6 +1898,35 @@
</element>
</define>
+ <define name="diskSourceNetworkProtocolVitastor">
+ <element name="source">
+ <interleave>
+ <attribute name="protocol">
+ <value>vitastor</value>
+ </attribute>
+ <ref name="diskSourceCommon"/>
+ <optional>
+ <attribute name="name"/>
+ </optional>
+ <optional>
+ <attribute name="query"/>
+ </optional>
+ <zeroOrMore>
+ <ref name="diskSourceNetworkHost"/>
+ </zeroOrMore>
+ <optional>
+ <element name="config">
+ <attribute name="file">
+ <ref name="absFilePath"/>
+ </attribute>
+ <empty/>
+ </element>
+ </optional>
+ <empty/>
+ </interleave>
+ </element>
+ </define>
+
<define name="diskSourceNetworkProtocolISCSI">
<element name="source">
<attribute name="protocol">
@@ -2154,6 +2183,7 @@
<ref name="diskSourceNetworkProtocolSimple"/>
<ref name="diskSourceNetworkProtocolVxHS"/>
<ref name="diskSourceNetworkProtocolNFS"/>
+ <ref name="diskSourceNetworkProtocolVitastor"/>
</choice>
</define>
diff --git a/include/libvirt/libvirt-storage.h b/include/libvirt/libvirt-storage.h
index f89856b93e..a8cb9387e2 100644
--- a/include/libvirt/libvirt-storage.h
+++ b/include/libvirt/libvirt-storage.h
@@ -246,6 +246,7 @@ typedef enum {
VIR_CONNECT_LIST_STORAGE_POOLS_ZFS = 1 << 17,
VIR_CONNECT_LIST_STORAGE_POOLS_VSTORAGE = 1 << 18,
VIR_CONNECT_LIST_STORAGE_POOLS_ISCSI_DIRECT = 1 << 19,
+ VIR_CONNECT_LIST_STORAGE_POOLS_VITASTOR = 1 << 20,
} virConnectListAllStoragePoolsFlags;
int virConnectListAllStoragePools(virConnectPtr conn,
diff --git a/src/conf/domain_conf.c b/src/conf/domain_conf.c
index 5691b8d2d5..6669e8451d 100644
--- a/src/conf/domain_conf.c
+++ b/src/conf/domain_conf.c
@@ -8293,7 +8293,8 @@ virDomainDiskSourceNetworkParse(xmlNodePtr node,
src->configFile = virXPathString("string(./config/@file)", ctxt);
if (src->protocol == VIR_STORAGE_NET_PROTOCOL_HTTP ||
- src->protocol == VIR_STORAGE_NET_PROTOCOL_HTTPS)
+ src->protocol == VIR_STORAGE_NET_PROTOCOL_HTTPS ||
+ src->protocol == VIR_STORAGE_NET_PROTOCOL_VITASTOR)
src->query = virXMLPropString(node, "query");
if (virDomainStorageNetworkParseHosts(node, ctxt, &src->hosts, &src->nhosts) < 0)
@@ -31267,6 +31268,7 @@ virDomainStorageSourceTranslateSourcePool(virStorageSource *src,
case VIR_STORAGE_POOL_MPATH:
case VIR_STORAGE_POOL_RBD:
+ case VIR_STORAGE_POOL_VITASTOR:
case VIR_STORAGE_POOL_SHEEPDOG:
case VIR_STORAGE_POOL_GLUSTER:
case VIR_STORAGE_POOL_LAST:
diff --git a/src/conf/domain_validate.c b/src/conf/domain_validate.c
index a4271f1247..621c1b7b31 100644
--- a/src/conf/domain_validate.c
+++ b/src/conf/domain_validate.c
@@ -508,7 +508,7 @@ virDomainDiskDefValidateSourceChainOne(const virStorageSource *src)
}
}
- /* internal snapshots and config files are currently supported only with rbd: */
+ /* internal snapshots are currently supported only with rbd: */
if (virStorageSourceGetActualType(src) != VIR_STORAGE_TYPE_NETWORK &&
src->protocol != VIR_STORAGE_NET_PROTOCOL_RBD) {
if (src->snapshot) {
@@ -517,11 +517,15 @@ virDomainDiskDefValidateSourceChainOne(const virStorageSource *src)
"only with 'rbd' disks"));
return -1;
}
-
+ }
+ /* config files are currently supported only with rbd and vitastor: */
+ if (virStorageSourceGetActualType(src) != VIR_STORAGE_TYPE_NETWORK &&
+ src->protocol != VIR_STORAGE_NET_PROTOCOL_RBD &&
+ src->protocol != VIR_STORAGE_NET_PROTOCOL_VITASTOR) {
if (src->configFile) {
virReportError(VIR_ERR_XML_ERROR, "%s",
_("<config> element is currently supported "
- "only with 'rbd' disks"));
+ "only with 'rbd' and 'vitastor' disks"));
return -1;
}
}
diff --git a/src/conf/storage_conf.c b/src/conf/storage_conf.c
index 6690d26ffd..2255df9d28 100644
--- a/src/conf/storage_conf.c
+++ b/src/conf/storage_conf.c
@@ -60,7 +60,7 @@ VIR_ENUM_IMPL(virStoragePool,
"logical", "disk", "iscsi",
"iscsi-direct", "scsi", "mpath",
"rbd", "sheepdog", "gluster",
- "zfs", "vstorage",
+ "zfs", "vstorage", "vitastor",
);
VIR_ENUM_IMPL(virStoragePoolFormatFileSystem,
@@ -246,6 +246,18 @@ static virStoragePoolTypeInfo poolTypeInfo[] = {
.formatToString = virStorageFileFormatTypeToString,
}
},
+ {.poolType = VIR_STORAGE_POOL_VITASTOR,
+ .poolOptions = {
+ .flags = (VIR_STORAGE_POOL_SOURCE_HOST |
+ VIR_STORAGE_POOL_SOURCE_NETWORK |
+ VIR_STORAGE_POOL_SOURCE_NAME),
+ },
+ .volOptions = {
+ .defaultFormat = VIR_STORAGE_FILE_RAW,
+ .formatFromString = virStorageVolumeFormatFromString,
+ .formatToString = virStorageFileFormatTypeToString,
+ }
+ },
{.poolType = VIR_STORAGE_POOL_SHEEPDOG,
.poolOptions = {
.flags = (VIR_STORAGE_POOL_SOURCE_HOST |
@@ -546,6 +558,11 @@ virStoragePoolDefParseSource(xmlXPathContextPtr ctxt,
_("element 'name' is mandatory for RBD pool"));
return -1;
}
+ if (pool_type == VIR_STORAGE_POOL_VITASTOR && source->name == NULL) {
+ virReportError(VIR_ERR_XML_ERROR, "%s",
+ _("element 'name' is mandatory for Vitastor pool"));
+ return -1;
+ }
if (options->formatFromString) {
g_autofree char *format = NULL;
@@ -1176,6 +1193,7 @@ virStoragePoolDefFormatBuf(virBuffer *buf,
/* RBD, Sheepdog, Gluster and Iscsi-direct devices are not local block devs nor
* files, so they don't have a target */
if (def->type != VIR_STORAGE_POOL_RBD &&
+ def->type != VIR_STORAGE_POOL_VITASTOR &&
def->type != VIR_STORAGE_POOL_SHEEPDOG &&
def->type != VIR_STORAGE_POOL_GLUSTER &&
def->type != VIR_STORAGE_POOL_ISCSI_DIRECT) {
diff --git a/src/conf/storage_conf.h b/src/conf/storage_conf.h
index aaecf138d6..97172db38b 100644
--- a/src/conf/storage_conf.h
+++ b/src/conf/storage_conf.h
@@ -106,6 +106,7 @@ typedef enum {
VIR_STORAGE_POOL_GLUSTER, /* Gluster device */
VIR_STORAGE_POOL_ZFS, /* ZFS */
VIR_STORAGE_POOL_VSTORAGE, /* Virtuozzo Storage */
+ VIR_STORAGE_POOL_VITASTOR, /* Vitastor */
VIR_STORAGE_POOL_LAST,
} virStoragePoolType;
@@ -466,6 +467,7 @@ VIR_ENUM_DECL(virStoragePartedFs);
VIR_CONNECT_LIST_STORAGE_POOLS_SCSI | \
VIR_CONNECT_LIST_STORAGE_POOLS_MPATH | \
VIR_CONNECT_LIST_STORAGE_POOLS_RBD | \
+ VIR_CONNECT_LIST_STORAGE_POOLS_VITASTOR | \
VIR_CONNECT_LIST_STORAGE_POOLS_SHEEPDOG | \
VIR_CONNECT_LIST_STORAGE_POOLS_GLUSTER | \
VIR_CONNECT_LIST_STORAGE_POOLS_ZFS | \
diff --git a/src/conf/storage_source_conf.c b/src/conf/storage_source_conf.c
index d42f715f26..29d8da3d10 100644
--- a/src/conf/storage_source_conf.c
+++ b/src/conf/storage_source_conf.c
@@ -86,6 +86,7 @@ VIR_ENUM_IMPL(virStorageNetProtocol,
"ssh",
"vxhs",
"nfs",
+ "vitastor",
);
@@ -1265,6 +1266,7 @@ virStorageSourceNetworkDefaultPort(virStorageNetProtocol protocol)
case VIR_STORAGE_NET_PROTOCOL_GLUSTER:
return 24007;
+ case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
case VIR_STORAGE_NET_PROTOCOL_RBD:
/* we don't provide a default for RBD */
return 0;
diff --git a/src/conf/storage_source_conf.h b/src/conf/storage_source_conf.h
index c4a026881c..67568e9181 100644
--- a/src/conf/storage_source_conf.h
+++ b/src/conf/storage_source_conf.h
@@ -128,6 +128,7 @@ typedef enum {
VIR_STORAGE_NET_PROTOCOL_SSH,
VIR_STORAGE_NET_PROTOCOL_VXHS,
VIR_STORAGE_NET_PROTOCOL_NFS,
+ VIR_STORAGE_NET_PROTOCOL_VITASTOR,
VIR_STORAGE_NET_PROTOCOL_LAST
} virStorageNetProtocol;
diff --git a/src/conf/virstorageobj.c b/src/conf/virstorageobj.c
index 02903ac487..504df599fb 100644
--- a/src/conf/virstorageobj.c
+++ b/src/conf/virstorageobj.c
@@ -1481,6 +1481,7 @@ virStoragePoolObjSourceFindDuplicateCb(const void *payload,
return 1;
break;
+ case VIR_STORAGE_POOL_VITASTOR:
case VIR_STORAGE_POOL_RBD:
case VIR_STORAGE_POOL_LAST:
break;
@@ -1978,6 +1979,8 @@ virStoragePoolObjMatch(virStoragePoolObj *obj,
(obj->def->type == VIR_STORAGE_POOL_MPATH)) ||
(MATCH(VIR_CONNECT_LIST_STORAGE_POOLS_RBD) &&
(obj->def->type == VIR_STORAGE_POOL_RBD)) ||
+ (MATCH(VIR_CONNECT_LIST_STORAGE_POOLS_VITASTOR) &&
+ (obj->def->type == VIR_STORAGE_POOL_VITASTOR)) ||
(MATCH(VIR_CONNECT_LIST_STORAGE_POOLS_SHEEPDOG) &&
(obj->def->type == VIR_STORAGE_POOL_SHEEPDOG)) ||
(MATCH(VIR_CONNECT_LIST_STORAGE_POOLS_GLUSTER) &&
diff --git a/src/libvirt-storage.c b/src/libvirt-storage.c
index cbc522b300..b4760fa58d 100644
--- a/src/libvirt-storage.c
+++ b/src/libvirt-storage.c
@@ -92,6 +92,7 @@ virStoragePoolGetConnect(virStoragePoolPtr pool)
* VIR_CONNECT_LIST_STORAGE_POOLS_SCSI
* VIR_CONNECT_LIST_STORAGE_POOLS_MPATH
* VIR_CONNECT_LIST_STORAGE_POOLS_RBD
+ * VIR_CONNECT_LIST_STORAGE_POOLS_VITASTOR
* VIR_CONNECT_LIST_STORAGE_POOLS_SHEEPDOG
* VIR_CONNECT_LIST_STORAGE_POOLS_GLUSTER
* VIR_CONNECT_LIST_STORAGE_POOLS_ZFS
diff --git a/src/libxl/libxl_conf.c b/src/libxl/libxl_conf.c
index 1ac6253ad7..abe4587f94 100644
--- a/src/libxl/libxl_conf.c
+++ b/src/libxl/libxl_conf.c
@@ -962,6 +962,7 @@ libxlMakeNetworkDiskSrcStr(virStorageSource *src,
case VIR_STORAGE_NET_PROTOCOL_SSH:
case VIR_STORAGE_NET_PROTOCOL_VXHS:
case VIR_STORAGE_NET_PROTOCOL_NFS:
+ case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
case VIR_STORAGE_NET_PROTOCOL_LAST:
case VIR_STORAGE_NET_PROTOCOL_NONE:
virReportError(VIR_ERR_NO_SUPPORT,
diff --git a/src/libxl/xen_xl.c b/src/libxl/xen_xl.c
index 7604e3d534..6453bb9776 100644
--- a/src/libxl/xen_xl.c
+++ b/src/libxl/xen_xl.c
@@ -1506,6 +1506,7 @@ xenFormatXLDiskSrcNet(virStorageSource *src)
case VIR_STORAGE_NET_PROTOCOL_SSH:
case VIR_STORAGE_NET_PROTOCOL_VXHS:
case VIR_STORAGE_NET_PROTOCOL_NFS:
+ case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
case VIR_STORAGE_NET_PROTOCOL_LAST:
case VIR_STORAGE_NET_PROTOCOL_NONE:
virReportError(VIR_ERR_NO_SUPPORT,
diff --git a/src/qemu/qemu_block.c b/src/qemu/qemu_block.c
index e5ff653a60..884ecc79ea 100644
--- a/src/qemu/qemu_block.c
+++ b/src/qemu/qemu_block.c
@@ -943,6 +943,38 @@ qemuBlockStorageSourceGetRBDProps(virStorageSource *src,
}
+static virJSONValue *
+qemuBlockStorageSourceGetVitastorProps(virStorageSource *src)
+{
+ virJSONValue *ret = NULL;
+ virStorageNetHostDef *host;
+ size_t i;
+ g_auto(virBuffer) buf = VIR_BUFFER_INITIALIZER;
+ g_autofree char *etcd = NULL;
+
+ for (i = 0; i < src->nhosts; i++) {
+ host = src->hosts + i;
+ if ((virStorageNetHostTransport)host->transport != VIR_STORAGE_NET_HOST_TRANS_TCP) {
+ return NULL;
+ }
+ virBufferAsprintf(&buf, i > 0 ? ",%s:%u" : "%s:%u", host->name, host->port);
+ }
+ if (src->nhosts > 0) {
+ etcd = virBufferContentAndReset(&buf);
+ }
+
+ if (virJSONValueObjectCreate(&ret,
+ "S:etcd-host", etcd,
+ "S:etcd-prefix", src->query,
+ "S:config-path", src->configFile,
+ "s:image", src->path,
+ NULL) < 0)
+ return NULL;
+
+ return ret;
+}
+
+
static virJSONValue *
qemuBlockStorageSourceGetSheepdogProps(virStorageSource *src)
{
@@ -1233,6 +1265,12 @@ qemuBlockStorageSourceGetBackendProps(virStorageSource *src,
return NULL;
break;
+ case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
+ driver = "vitastor";
+ if (!(fileprops = qemuBlockStorageSourceGetVitastorProps(src)))
+ return NULL;
+ break;
+
case VIR_STORAGE_NET_PROTOCOL_SHEEPDOG:
driver = "sheepdog";
if (!(fileprops = qemuBlockStorageSourceGetSheepdogProps(src)))
@@ -2244,6 +2282,7 @@ qemuBlockGetBackingStoreString(virStorageSource *src,
case VIR_STORAGE_NET_PROTOCOL_SHEEPDOG:
case VIR_STORAGE_NET_PROTOCOL_RBD:
+ case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
case VIR_STORAGE_NET_PROTOCOL_VXHS:
case VIR_STORAGE_NET_PROTOCOL_NFS:
case VIR_STORAGE_NET_PROTOCOL_SSH:
@@ -2626,6 +2665,12 @@ qemuBlockStorageSourceCreateGetStorageProps(virStorageSource *src,
return -1;
break;
+ case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
+ driver = "vitastor";
+ if (!(location = qemuBlockStorageSourceGetVitastorProps(src)))
+ return -1;
+ break;
+
case VIR_STORAGE_NET_PROTOCOL_SHEEPDOG:
driver = "sheepdog";
if (!(location = qemuBlockStorageSourceGetSheepdogProps(src)))
diff --git a/src/qemu/qemu_command.c b/src/qemu/qemu_command.c
index d822533ccb..afe2087303 100644
--- a/src/qemu/qemu_command.c
+++ b/src/qemu/qemu_command.c
@@ -1723,6 +1723,43 @@ qemuBuildNetworkDriveStr(virStorageSource *src,
ret = virBufferContentAndReset(&buf);
break;
+ case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
+ if (strchr(src->path, ':')) {
+ virReportError(VIR_ERR_CONFIG_UNSUPPORTED,
+ _("':' not allowed in Vitastor source volume name '%s'"),
+ src->path);
+ return NULL;
+ }
+
+ virBufferStrcat(&buf, "vitastor:image=", src->path, NULL);
+
+ if (src->nhosts > 0) {
+ virBufferAddLit(&buf, ":etcd-host=");
+ for (i = 0; i < src->nhosts; i++) {
+ if (i)
+ virBufferAddLit(&buf, ",");
+
+ /* assume host containing : is ipv6 */
+ if (strchr(src->hosts[i].name, ':'))
+ virBufferEscape(&buf, '\\', ":", "[%s]",
+ src->hosts[i].name);
+ else
+ virBufferAsprintf(&buf, "%s", src->hosts[i].name);
+
+ if (src->hosts[i].port)
+ virBufferAsprintf(&buf, "\\:%u", src->hosts[i].port);
+ }
+ }
+
+ if (src->configFile)
+ virBufferEscape(&buf, '\\', ":", ":config-path=%s", src->configFile);
+
+ if (src->query)
+ virBufferEscape(&buf, '\\', ":", ":etcd-prefix=%s", src->query);
+
+ ret = virBufferContentAndReset(&buf);
+ break;
+
case VIR_STORAGE_NET_PROTOCOL_VXHS:
virReportError(VIR_ERR_INTERNAL_ERROR, "%s",
_("VxHS protocol does not support URI syntax"));
diff --git a/src/qemu/qemu_domain.c b/src/qemu/qemu_domain.c
index a8401bac30..3dc1fe6db0 100644
--- a/src/qemu/qemu_domain.c
+++ b/src/qemu/qemu_domain.c
@@ -4731,7 +4731,8 @@ qemuDomainValidateStorageSource(virStorageSource *src,
if (src->query &&
(actualType != VIR_STORAGE_TYPE_NETWORK ||
(src->protocol != VIR_STORAGE_NET_PROTOCOL_HTTPS &&
- src->protocol != VIR_STORAGE_NET_PROTOCOL_HTTP))) {
+ src->protocol != VIR_STORAGE_NET_PROTOCOL_HTTP &&
+ src->protocol != VIR_STORAGE_NET_PROTOCOL_VITASTOR))) {
virReportError(VIR_ERR_CONFIG_UNSUPPORTED, "%s",
_("query is supported only with HTTP(S) protocols"));
return -1;
@@ -9919,6 +9920,7 @@ qemuDomainPrepareStorageSourceTLS(virStorageSource *src,
break;
case VIR_STORAGE_NET_PROTOCOL_RBD:
+ case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
case VIR_STORAGE_NET_PROTOCOL_SHEEPDOG:
case VIR_STORAGE_NET_PROTOCOL_GLUSTER:
case VIR_STORAGE_NET_PROTOCOL_ISCSI:
diff --git a/src/qemu/qemu_snapshot.c b/src/qemu/qemu_snapshot.c
index f92e00f9c0..854a3fbc90 100644
--- a/src/qemu/qemu_snapshot.c
+++ b/src/qemu/qemu_snapshot.c
@@ -393,6 +393,7 @@ qemuSnapshotPrepareDiskExternalInactive(virDomainSnapshotDiskDef *snapdisk,
case VIR_STORAGE_NET_PROTOCOL_NONE:
case VIR_STORAGE_NET_PROTOCOL_NBD:
case VIR_STORAGE_NET_PROTOCOL_RBD:
+ case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
case VIR_STORAGE_NET_PROTOCOL_SHEEPDOG:
case VIR_STORAGE_NET_PROTOCOL_GLUSTER:
case VIR_STORAGE_NET_PROTOCOL_ISCSI:
@@ -485,6 +486,7 @@ qemuSnapshotPrepareDiskExternalActive(virDomainObj *vm,
case VIR_STORAGE_NET_PROTOCOL_NONE:
case VIR_STORAGE_NET_PROTOCOL_NBD:
case VIR_STORAGE_NET_PROTOCOL_RBD:
+ case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
case VIR_STORAGE_NET_PROTOCOL_SHEEPDOG:
case VIR_STORAGE_NET_PROTOCOL_ISCSI:
case VIR_STORAGE_NET_PROTOCOL_HTTP:
@@ -638,6 +640,7 @@ qemuSnapshotPrepareDiskInternal(virDomainDiskDef *disk,
case VIR_STORAGE_NET_PROTOCOL_NONE:
case VIR_STORAGE_NET_PROTOCOL_NBD:
case VIR_STORAGE_NET_PROTOCOL_RBD:
+ case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
case VIR_STORAGE_NET_PROTOCOL_SHEEPDOG:
case VIR_STORAGE_NET_PROTOCOL_GLUSTER:
case VIR_STORAGE_NET_PROTOCOL_ISCSI:
diff --git a/src/storage/storage_driver.c b/src/storage/storage_driver.c
index 4df2c75a2b..5a5e48ef71 100644
--- a/src/storage/storage_driver.c
+++ b/src/storage/storage_driver.c
@@ -1643,6 +1643,7 @@ storageVolLookupByPathCallback(virStoragePoolObj *obj,
case VIR_STORAGE_POOL_GLUSTER:
case VIR_STORAGE_POOL_RBD:
+ case VIR_STORAGE_POOL_VITASTOR:
case VIR_STORAGE_POOL_SHEEPDOG:
case VIR_STORAGE_POOL_ZFS:
case VIR_STORAGE_POOL_LAST:
diff --git a/src/storage_file/storage_source_backingstore.c b/src/storage_file/storage_source_backingstore.c
index e48ae725ab..2017ccc88c 100644
--- a/src/storage_file/storage_source_backingstore.c
+++ b/src/storage_file/storage_source_backingstore.c
@@ -284,6 +284,75 @@ virStorageSourceParseRBDColonString(const char *rbdstr,
}
+static int
+virStorageSourceParseVitastorColonString(const char *colonstr,
+ virStorageSource *src)
+{
+ char *p, *e, *next;
+ g_autofree char *options = NULL;
+
+ /* optionally skip the "vitastor:" prefix if provided */
+ if (STRPREFIX(colonstr, "vitastor:"))
+ colonstr += strlen("vitastor:");
+
+ options = g_strdup(colonstr);
+
+ p = options;
+ while (*p) {
+ /* find : delimiter or end of string */
+ for (e = p; *e && *e != ':'; ++e) {
+ if (*e == '\\') {
+ e++;
+ if (*e == '\0')
+ break;
+ }
+ }
+ if (*e == '\0') {
+ next = e; /* last kv pair */
+ } else {
+ next = e + 1;
+ *e = '\0';
+ }
+
+ if (STRPREFIX(p, "image=")) {
+ src->path = g_strdup(p + strlen("image="));
+ } else if (STRPREFIX(p, "etcd-prefix=")) {
+ src->query = g_strdup(p + strlen("etcd-prefix="));
+ } else if (STRPREFIX(p, "config-path=")) {
+ src->configFile = g_strdup(p + strlen("config-path="));
+ } else if (STRPREFIX(p, "etcd-host=")) {
+ char *h, *sep;
+
+ h = p + strlen("etcd-host=");
+ while (h < e) {
+ for (sep = h; sep < e; ++sep) {
+ if (*sep == '\\' && (sep[1] == ',' ||
+ sep[1] == ';' ||
+ sep[1] == ' ')) {
+ *sep = '\0';
+ sep += 2;
+ break;
+ }
+ }
+
+ if (virStorageSourceRBDAddHost(src, h) < 0)
+ return -1;
+
+ h = sep;
+ }
+ }
+
+ p = next;
+ }
+
+ if (!src->path) {
+ return -1;
+ }
+
+ return 0;
+}
+
+
static int
virStorageSourceParseNBDColonString(const char *nbdstr,
virStorageSource *src)
@@ -396,6 +465,11 @@ virStorageSourceParseBackingColon(virStorageSource *src,
return -1;
break;
+ case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
+ if (virStorageSourceParseVitastorColonString(path, src) < 0)
+ return -1;
+ break;
+
case VIR_STORAGE_NET_PROTOCOL_SHEEPDOG:
case VIR_STORAGE_NET_PROTOCOL_LAST:
case VIR_STORAGE_NET_PROTOCOL_NONE:
@@ -984,6 +1058,54 @@ virStorageSourceParseBackingJSONRBD(virStorageSource *src,
return 0;
}
+static int
+virStorageSourceParseBackingJSONVitastor(virStorageSource *src,
+ virJSONValue *json,
+ const char *jsonstr G_GNUC_UNUSED,
+ int opaque G_GNUC_UNUSED)
+{
+ const char *filename;
+ const char *image = virJSONValueObjectGetString(json, "image");
+ const char *conf = virJSONValueObjectGetString(json, "config-path");
+ const char *etcd_prefix = virJSONValueObjectGetString(json, "etcd-prefix");
+ virJSONValue *servers = virJSONValueObjectGetArray(json, "server");
+ size_t nservers;
+ size_t i;
+
+ src->type = VIR_STORAGE_TYPE_NETWORK;
+ src->protocol = VIR_STORAGE_NET_PROTOCOL_VITASTOR;
+
+ /* legacy syntax passed via 'filename' option */
+ if ((filename = virJSONValueObjectGetString(json, "filename")))
+ return virStorageSourceParseVitastorColonString(filename, src);
+
+ if (!image) {
+ virReportError(VIR_ERR_INVALID_ARG, "%s",
+ _("missing image name in Vitastor backing volume "
+ "JSON specification"));
+ return -1;
+ }
+
+ src->path = g_strdup(image);
+ src->configFile = g_strdup(conf);
+ src->query = g_strdup(etcd_prefix);
+
+ if (servers) {
+ nservers = virJSONValueArraySize(servers);
+
+ src->hosts = g_new0(virStorageNetHostDef, nservers);
+ src->nhosts = nservers;
+
+ for (i = 0; i < nservers; i++) {
+ if (virStorageSourceParseBackingJSONInetSocketAddress(src->hosts + i,
+ virJSONValueArrayGet(servers, i)) < 0)
+ return -1;
+ }
+ }
+
+ return 0;
+}
+
static int
virStorageSourceParseBackingJSONRaw(virStorageSource *src,
virJSONValue *json,
@@ -1162,6 +1284,7 @@ static const struct virStorageSourceJSONDriverParser jsonParsers[] = {
{"sheepdog", false, virStorageSourceParseBackingJSONSheepdog, 0},
{"ssh", false, virStorageSourceParseBackingJSONSSH, 0},
{"rbd", false, virStorageSourceParseBackingJSONRBD, 0},
+ {"vitastor", false, virStorageSourceParseBackingJSONVitastor, 0},
{"raw", true, virStorageSourceParseBackingJSONRaw, 0},
{"nfs", false, virStorageSourceParseBackingJSONNFS, 0},
{"vxhs", false, virStorageSourceParseBackingJSONVxHS, 0},
diff --git a/src/test/test_driver.c b/src/test/test_driver.c
index 0e93b79922..b4d33f5f56 100644
--- a/src/test/test_driver.c
+++ b/src/test/test_driver.c
@@ -7367,6 +7367,7 @@ testStorageVolumeTypeForPool(int pooltype)
case VIR_STORAGE_POOL_ISCSI_DIRECT:
case VIR_STORAGE_POOL_GLUSTER:
case VIR_STORAGE_POOL_RBD:
+ case VIR_STORAGE_POOL_VITASTOR:
return VIR_STORAGE_VOL_NETWORK;
case VIR_STORAGE_POOL_LOGICAL:
case VIR_STORAGE_POOL_DISK:
diff --git a/tests/storagepoolcapsschemadata/poolcaps-fs.xml b/tests/storagepoolcapsschemadata/poolcaps-fs.xml
index eee75af746..8bd0a57bdd 100644
--- a/tests/storagepoolcapsschemadata/poolcaps-fs.xml
+++ b/tests/storagepoolcapsschemadata/poolcaps-fs.xml
@@ -204,4 +204,11 @@
</enum>
</volOptions>
</pool>
+ <pool type='vitastor' supported='no'>
+ <volOptions>
+ <defaultFormat type='raw'/>
+ <enum name='targetFormatType'>
+ </enum>
+ </volOptions>
+ </pool>
</storagepoolCapabilities>
diff --git a/tests/storagepoolcapsschemadata/poolcaps-full.xml b/tests/storagepoolcapsschemadata/poolcaps-full.xml
index 805950a937..852df0de16 100644
--- a/tests/storagepoolcapsschemadata/poolcaps-full.xml
+++ b/tests/storagepoolcapsschemadata/poolcaps-full.xml
@@ -204,4 +204,11 @@
</enum>
</volOptions>
</pool>
+ <pool type='vitastor' supported='yes'>
+ <volOptions>
+ <defaultFormat type='raw'/>
+ <enum name='targetFormatType'>
+ </enum>
+ </volOptions>
+ </pool>
</storagepoolCapabilities>
diff --git a/tests/storagepoolxml2argvtest.c b/tests/storagepoolxml2argvtest.c
index 449b745519..7f95cc8e08 100644
--- a/tests/storagepoolxml2argvtest.c
+++ b/tests/storagepoolxml2argvtest.c
@@ -68,6 +68,7 @@ testCompareXMLToArgvFiles(bool shouldFail,
case VIR_STORAGE_POOL_GLUSTER:
case VIR_STORAGE_POOL_ZFS:
case VIR_STORAGE_POOL_VSTORAGE:
+ case VIR_STORAGE_POOL_VITASTOR:
case VIR_STORAGE_POOL_LAST:
default:
VIR_TEST_DEBUG("pool type '%s' has no xml2argv test", defTypeStr);
diff --git a/tools/virsh-pool.c b/tools/virsh-pool.c
index d391257f6e..46799c4a90 100644
--- a/tools/virsh-pool.c
+++ b/tools/virsh-pool.c
@@ -1213,6 +1213,9 @@ cmdPoolList(vshControl *ctl, const vshCmd *cmd G_GNUC_UNUSED)
case VIR_STORAGE_POOL_VSTORAGE:
flags |= VIR_CONNECT_LIST_STORAGE_POOLS_VSTORAGE;
break;
+ case VIR_STORAGE_POOL_VITASTOR:
+ flags |= VIR_CONNECT_LIST_STORAGE_POOLS_VITASTOR;
+ break;
case VIR_STORAGE_POOL_LAST:
break;
}

View File

@@ -0,0 +1,643 @@
commit c1cd026e211e94b120028e7c98a6e4ce5afe9846
Author: Vitaliy Filippov <vitalif@yourcmc.ru>
Date: Wed Jan 24 22:04:50 2024 +0300
Add Vitastor support
diff --git a/include/libvirt/libvirt-storage.h b/include/libvirt/libvirt-storage.h
index aaad4a3da1..5f5daa8341 100644
--- a/include/libvirt/libvirt-storage.h
+++ b/include/libvirt/libvirt-storage.h
@@ -326,6 +326,7 @@ typedef enum {
VIR_CONNECT_LIST_STORAGE_POOLS_ZFS = 1 << 17, /* (Since: 1.2.8) */
VIR_CONNECT_LIST_STORAGE_POOLS_VSTORAGE = 1 << 18, /* (Since: 3.1.0) */
VIR_CONNECT_LIST_STORAGE_POOLS_ISCSI_DIRECT = 1 << 19, /* (Since: 5.6.0) */
+ VIR_CONNECT_LIST_STORAGE_POOLS_VITASTOR = 1 << 20, /* (Since: 5.0.0) */
} virConnectListAllStoragePoolsFlags;
int virConnectListAllStoragePools(virConnectPtr conn,
diff --git a/src/conf/domain_conf.c b/src/conf/domain_conf.c
index 22ad43e1d7..56c81d6852 100644
--- a/src/conf/domain_conf.c
+++ b/src/conf/domain_conf.c
@@ -7185,7 +7185,8 @@ virDomainDiskSourceNetworkParse(xmlNodePtr node,
src->configFile = virXPathString("string(./config/@file)", ctxt);
if (src->protocol == VIR_STORAGE_NET_PROTOCOL_HTTP ||
- src->protocol == VIR_STORAGE_NET_PROTOCOL_HTTPS)
+ src->protocol == VIR_STORAGE_NET_PROTOCOL_HTTPS ||
+ src->protocol == VIR_STORAGE_NET_PROTOCOL_VITASTOR)
src->query = virXMLPropString(node, "query");
if (virDomainStorageNetworkParseHosts(node, ctxt, &src->hosts, &src->nhosts) < 0)
@@ -30618,6 +30619,7 @@ virDomainStorageSourceTranslateSourcePool(virStorageSource *src,
case VIR_STORAGE_POOL_MPATH:
case VIR_STORAGE_POOL_RBD:
+ case VIR_STORAGE_POOL_VITASTOR:
case VIR_STORAGE_POOL_SHEEPDOG:
case VIR_STORAGE_POOL_GLUSTER:
case VIR_STORAGE_POOL_LAST:
diff --git a/src/conf/domain_validate.c b/src/conf/domain_validate.c
index c72108886e..c739ed6c43 100644
--- a/src/conf/domain_validate.c
+++ b/src/conf/domain_validate.c
@@ -495,6 +495,7 @@ virDomainDiskDefValidateSourceChainOne(const virStorageSource *src)
case VIR_STORAGE_NET_PROTOCOL_RBD:
break;
+ case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
case VIR_STORAGE_NET_PROTOCOL_NBD:
case VIR_STORAGE_NET_PROTOCOL_SHEEPDOG:
case VIR_STORAGE_NET_PROTOCOL_GLUSTER:
@@ -541,7 +542,7 @@ virDomainDiskDefValidateSourceChainOne(const virStorageSource *src)
}
}
- /* internal snapshots and config files are currently supported only with rbd: */
+ /* internal snapshots are currently supported only with rbd: */
if (virStorageSourceGetActualType(src) != VIR_STORAGE_TYPE_NETWORK &&
src->protocol != VIR_STORAGE_NET_PROTOCOL_RBD) {
if (src->snapshot) {
@@ -549,10 +550,15 @@ virDomainDiskDefValidateSourceChainOne(const virStorageSource *src)
_("<snapshot> element is currently supported only with 'rbd' disks"));
return -1;
}
+ }
+ /* config files are currently supported only with rbd and vitastor: */
+ if (virStorageSourceGetActualType(src) != VIR_STORAGE_TYPE_NETWORK &&
+ src->protocol != VIR_STORAGE_NET_PROTOCOL_RBD &&
+ src->protocol != VIR_STORAGE_NET_PROTOCOL_VITASTOR) {
if (src->configFile) {
virReportError(VIR_ERR_XML_ERROR, "%s",
- _("<config> element is currently supported only with 'rbd' disks"));
+ _("<config> element is currently supported only with 'rbd' and 'vitastor' disks"));
return -1;
}
}
diff --git a/src/conf/schemas/domaincommon.rng b/src/conf/schemas/domaincommon.rng
index b98a2ae602..7d7a872e01 100644
--- a/src/conf/schemas/domaincommon.rng
+++ b/src/conf/schemas/domaincommon.rng
@@ -1997,6 +1997,35 @@
</element>
</define>
+ <define name="diskSourceNetworkProtocolVitastor">
+ <element name="source">
+ <interleave>
+ <attribute name="protocol">
+ <value>vitastor</value>
+ </attribute>
+ <ref name="diskSourceCommon"/>
+ <optional>
+ <attribute name="name"/>
+ </optional>
+ <optional>
+ <attribute name="query"/>
+ </optional>
+ <zeroOrMore>
+ <ref name="diskSourceNetworkHost"/>
+ </zeroOrMore>
+ <optional>
+ <element name="config">
+ <attribute name="file">
+ <ref name="absFilePath"/>
+ </attribute>
+ <empty/>
+ </element>
+ </optional>
+ <empty/>
+ </interleave>
+ </element>
+ </define>
+
<define name="diskSourceNetworkProtocolISCSI">
<element name="source">
<attribute name="protocol">
@@ -2347,6 +2376,7 @@
<ref name="diskSourceNetworkProtocolSimple"/>
<ref name="diskSourceNetworkProtocolVxHS"/>
<ref name="diskSourceNetworkProtocolNFS"/>
+ <ref name="diskSourceNetworkProtocolVitastor"/>
</choice>
</define>
diff --git a/src/conf/storage_conf.c b/src/conf/storage_conf.c
index 68842004b7..1d69a788b6 100644
--- a/src/conf/storage_conf.c
+++ b/src/conf/storage_conf.c
@@ -56,7 +56,7 @@ VIR_ENUM_IMPL(virStoragePool,
"logical", "disk", "iscsi",
"iscsi-direct", "scsi", "mpath",
"rbd", "sheepdog", "gluster",
- "zfs", "vstorage",
+ "zfs", "vstorage", "vitastor",
);
VIR_ENUM_IMPL(virStoragePoolFormatFileSystem,
@@ -242,6 +242,18 @@ static virStoragePoolTypeInfo poolTypeInfo[] = {
.formatToString = virStorageFileFormatTypeToString,
}
},
+ {.poolType = VIR_STORAGE_POOL_VITASTOR,
+ .poolOptions = {
+ .flags = (VIR_STORAGE_POOL_SOURCE_HOST |
+ VIR_STORAGE_POOL_SOURCE_NETWORK |
+ VIR_STORAGE_POOL_SOURCE_NAME),
+ },
+ .volOptions = {
+ .defaultFormat = VIR_STORAGE_FILE_RAW,
+ .formatFromString = virStorageVolumeFormatFromString,
+ .formatToString = virStorageFileFormatTypeToString,
+ }
+ },
{.poolType = VIR_STORAGE_POOL_SHEEPDOG,
.poolOptions = {
.flags = (VIR_STORAGE_POOL_SOURCE_HOST |
@@ -538,6 +550,11 @@ virStoragePoolDefParseSource(xmlXPathContextPtr ctxt,
_("element 'name' is mandatory for RBD pool"));
return -1;
}
+ if (pool_type == VIR_STORAGE_POOL_VITASTOR && source->name == NULL) {
+ virReportError(VIR_ERR_XML_ERROR, "%s",
+ _("element 'name' is mandatory for Vitastor pool"));
+ return -1;
+ }
if (options->formatFromString) {
g_autofree char *format = NULL;
@@ -1127,6 +1144,7 @@ virStoragePoolDefFormatBuf(virBuffer *buf,
/* RBD, Sheepdog, Gluster and Iscsi-direct devices are not local block devs nor
* files, so they don't have a target */
if (def->type != VIR_STORAGE_POOL_RBD &&
+ def->type != VIR_STORAGE_POOL_VITASTOR &&
def->type != VIR_STORAGE_POOL_SHEEPDOG &&
def->type != VIR_STORAGE_POOL_GLUSTER &&
def->type != VIR_STORAGE_POOL_ISCSI_DIRECT) {
diff --git a/src/conf/storage_conf.h b/src/conf/storage_conf.h
index fc67957cfe..720c07ef74 100644
--- a/src/conf/storage_conf.h
+++ b/src/conf/storage_conf.h
@@ -103,6 +103,7 @@ typedef enum {
VIR_STORAGE_POOL_GLUSTER, /* Gluster device */
VIR_STORAGE_POOL_ZFS, /* ZFS */
VIR_STORAGE_POOL_VSTORAGE, /* Virtuozzo Storage */
+ VIR_STORAGE_POOL_VITASTOR, /* Vitastor */
VIR_STORAGE_POOL_LAST,
} virStoragePoolType;
@@ -454,6 +455,7 @@ VIR_ENUM_DECL(virStoragePartedFs);
VIR_CONNECT_LIST_STORAGE_POOLS_SCSI | \
VIR_CONNECT_LIST_STORAGE_POOLS_MPATH | \
VIR_CONNECT_LIST_STORAGE_POOLS_RBD | \
+ VIR_CONNECT_LIST_STORAGE_POOLS_VITASTOR | \
VIR_CONNECT_LIST_STORAGE_POOLS_SHEEPDOG | \
VIR_CONNECT_LIST_STORAGE_POOLS_GLUSTER | \
VIR_CONNECT_LIST_STORAGE_POOLS_ZFS | \
diff --git a/src/conf/storage_source_conf.c b/src/conf/storage_source_conf.c
index f974a521b1..cd394d0a9f 100644
--- a/src/conf/storage_source_conf.c
+++ b/src/conf/storage_source_conf.c
@@ -88,6 +88,7 @@ VIR_ENUM_IMPL(virStorageNetProtocol,
"ssh",
"vxhs",
"nfs",
+ "vitastor",
);
@@ -1301,6 +1302,7 @@ virStorageSourceNetworkDefaultPort(virStorageNetProtocol protocol)
case VIR_STORAGE_NET_PROTOCOL_GLUSTER:
return 24007;
+ case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
case VIR_STORAGE_NET_PROTOCOL_RBD:
/* we don't provide a default for RBD */
return 0;
diff --git a/src/conf/storage_source_conf.h b/src/conf/storage_source_conf.h
index 5e7d127453..283709eeb3 100644
--- a/src/conf/storage_source_conf.h
+++ b/src/conf/storage_source_conf.h
@@ -129,6 +129,7 @@ typedef enum {
VIR_STORAGE_NET_PROTOCOL_SSH,
VIR_STORAGE_NET_PROTOCOL_VXHS,
VIR_STORAGE_NET_PROTOCOL_NFS,
+ VIR_STORAGE_NET_PROTOCOL_VITASTOR,
VIR_STORAGE_NET_PROTOCOL_LAST
} virStorageNetProtocol;
diff --git a/src/conf/virstorageobj.c b/src/conf/virstorageobj.c
index 59fa5da372..4739167f5f 100644
--- a/src/conf/virstorageobj.c
+++ b/src/conf/virstorageobj.c
@@ -1438,6 +1438,7 @@ virStoragePoolObjSourceFindDuplicateCb(const void *payload,
return 1;
break;
+ case VIR_STORAGE_POOL_VITASTOR:
case VIR_STORAGE_POOL_ISCSI_DIRECT:
case VIR_STORAGE_POOL_RBD:
case VIR_STORAGE_POOL_LAST:
@@ -1921,6 +1922,8 @@ virStoragePoolObjMatch(virStoragePoolObj *obj,
(obj->def->type == VIR_STORAGE_POOL_MPATH)) ||
(MATCH(VIR_CONNECT_LIST_STORAGE_POOLS_RBD) &&
(obj->def->type == VIR_STORAGE_POOL_RBD)) ||
+ (MATCH(VIR_CONNECT_LIST_STORAGE_POOLS_VITASTOR) &&
+ (obj->def->type == VIR_STORAGE_POOL_VITASTOR)) ||
(MATCH(VIR_CONNECT_LIST_STORAGE_POOLS_SHEEPDOG) &&
(obj->def->type == VIR_STORAGE_POOL_SHEEPDOG)) ||
(MATCH(VIR_CONNECT_LIST_STORAGE_POOLS_GLUSTER) &&
diff --git a/src/libvirt-storage.c b/src/libvirt-storage.c
index db7660aac4..561df34709 100644
--- a/src/libvirt-storage.c
+++ b/src/libvirt-storage.c
@@ -94,6 +94,7 @@ virStoragePoolGetConnect(virStoragePoolPtr pool)
* VIR_CONNECT_LIST_STORAGE_POOLS_SCSI
* VIR_CONNECT_LIST_STORAGE_POOLS_MPATH
* VIR_CONNECT_LIST_STORAGE_POOLS_RBD
+ * VIR_CONNECT_LIST_STORAGE_POOLS_VITASTOR
* VIR_CONNECT_LIST_STORAGE_POOLS_SHEEPDOG
* VIR_CONNECT_LIST_STORAGE_POOLS_GLUSTER
* VIR_CONNECT_LIST_STORAGE_POOLS_ZFS
diff --git a/src/libxl/libxl_conf.c b/src/libxl/libxl_conf.c
index 62e1be6672..71a1d42896 100644
--- a/src/libxl/libxl_conf.c
+++ b/src/libxl/libxl_conf.c
@@ -979,6 +979,7 @@ libxlMakeNetworkDiskSrcStr(virStorageSource *src,
case VIR_STORAGE_NET_PROTOCOL_SSH:
case VIR_STORAGE_NET_PROTOCOL_VXHS:
case VIR_STORAGE_NET_PROTOCOL_NFS:
+ case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
case VIR_STORAGE_NET_PROTOCOL_LAST:
case VIR_STORAGE_NET_PROTOCOL_NONE:
virReportError(VIR_ERR_NO_SUPPORT,
diff --git a/src/libxl/xen_xl.c b/src/libxl/xen_xl.c
index f175359307..8efcf4c329 100644
--- a/src/libxl/xen_xl.c
+++ b/src/libxl/xen_xl.c
@@ -1456,6 +1456,7 @@ xenFormatXLDiskSrcNet(virStorageSource *src)
case VIR_STORAGE_NET_PROTOCOL_SSH:
case VIR_STORAGE_NET_PROTOCOL_VXHS:
case VIR_STORAGE_NET_PROTOCOL_NFS:
+ case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
case VIR_STORAGE_NET_PROTOCOL_LAST:
case VIR_STORAGE_NET_PROTOCOL_NONE:
virReportError(VIR_ERR_NO_SUPPORT,
diff --git a/src/qemu/qemu_block.c b/src/qemu/qemu_block.c
index 7e9daf0bdc..825b4a3006 100644
--- a/src/qemu/qemu_block.c
+++ b/src/qemu/qemu_block.c
@@ -758,6 +758,38 @@ qemuBlockStorageSourceGetRBDProps(virStorageSource *src,
}
+static virJSONValue *
+qemuBlockStorageSourceGetVitastorProps(virStorageSource *src)
+{
+ virJSONValue *ret = NULL;
+ virStorageNetHostDef *host;
+ size_t i;
+ g_auto(virBuffer) buf = VIR_BUFFER_INITIALIZER;
+ g_autofree char *etcd = NULL;
+
+ for (i = 0; i < src->nhosts; i++) {
+ host = src->hosts + i;
+ if ((virStorageNetHostTransport)host->transport != VIR_STORAGE_NET_HOST_TRANS_TCP) {
+ return NULL;
+ }
+ virBufferAsprintf(&buf, i > 0 ? ",%s:%u" : "%s:%u", host->name, host->port);
+ }
+ if (src->nhosts > 0) {
+ etcd = virBufferContentAndReset(&buf);
+ }
+
+ if (virJSONValueObjectAdd(&ret,
+ "S:etcd-host", etcd,
+ "S:etcd-prefix", src->query,
+ "S:config-path", src->configFile,
+ "s:image", src->path,
+ NULL) < 0)
+ return NULL;
+
+ return ret;
+}
+
+
static virJSONValue *
qemuBlockStorageSourceGetSheepdogProps(virStorageSource *src)
{
@@ -1140,6 +1172,12 @@ qemuBlockStorageSourceGetBackendProps(virStorageSource *src,
return NULL;
break;
+ case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
+ driver = "vitastor";
+ if (!(fileprops = qemuBlockStorageSourceGetVitastorProps(src)))
+ return NULL;
+ break;
+
case VIR_STORAGE_NET_PROTOCOL_SHEEPDOG:
driver = "sheepdog";
if (!(fileprops = qemuBlockStorageSourceGetSheepdogProps(src)))
@@ -2032,6 +2070,7 @@ qemuBlockGetBackingStoreString(virStorageSource *src,
case VIR_STORAGE_NET_PROTOCOL_SHEEPDOG:
case VIR_STORAGE_NET_PROTOCOL_RBD:
+ case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
case VIR_STORAGE_NET_PROTOCOL_VXHS:
case VIR_STORAGE_NET_PROTOCOL_NFS:
case VIR_STORAGE_NET_PROTOCOL_SSH:
@@ -2415,6 +2454,12 @@ qemuBlockStorageSourceCreateGetStorageProps(virStorageSource *src,
return -1;
break;
+ case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
+ driver = "vitastor";
+ if (!(location = qemuBlockStorageSourceGetVitastorProps(src)))
+ return -1;
+ break;
+
case VIR_STORAGE_NET_PROTOCOL_SHEEPDOG:
driver = "sheepdog";
if (!(location = qemuBlockStorageSourceGetSheepdogProps(src)))
diff --git a/src/qemu/qemu_domain.c b/src/qemu/qemu_domain.c
index 953808fcfe..62860283d8 100644
--- a/src/qemu/qemu_domain.c
+++ b/src/qemu/qemu_domain.c
@@ -5215,7 +5215,8 @@ qemuDomainValidateStorageSource(virStorageSource *src,
if (src->query &&
(actualType != VIR_STORAGE_TYPE_NETWORK ||
(src->protocol != VIR_STORAGE_NET_PROTOCOL_HTTPS &&
- src->protocol != VIR_STORAGE_NET_PROTOCOL_HTTP))) {
+ src->protocol != VIR_STORAGE_NET_PROTOCOL_HTTP &&
+ src->protocol != VIR_STORAGE_NET_PROTOCOL_VITASTOR))) {
virReportError(VIR_ERR_CONFIG_UNSUPPORTED, "%s",
_("query is supported only with HTTP(S) protocols"));
return -1;
@@ -10340,6 +10341,7 @@ qemuDomainPrepareStorageSourceTLS(virStorageSource *src,
break;
case VIR_STORAGE_NET_PROTOCOL_RBD:
+ case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
case VIR_STORAGE_NET_PROTOCOL_SHEEPDOG:
case VIR_STORAGE_NET_PROTOCOL_GLUSTER:
case VIR_STORAGE_NET_PROTOCOL_ISCSI:
diff --git a/src/qemu/qemu_snapshot.c b/src/qemu/qemu_snapshot.c
index 73ff533827..e9c799ca8f 100644
--- a/src/qemu/qemu_snapshot.c
+++ b/src/qemu/qemu_snapshot.c
@@ -423,6 +423,7 @@ qemuSnapshotPrepareDiskExternalInactive(virDomainSnapshotDiskDef *snapdisk,
case VIR_STORAGE_NET_PROTOCOL_NONE:
case VIR_STORAGE_NET_PROTOCOL_NBD:
case VIR_STORAGE_NET_PROTOCOL_RBD:
+ case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
case VIR_STORAGE_NET_PROTOCOL_SHEEPDOG:
case VIR_STORAGE_NET_PROTOCOL_GLUSTER:
case VIR_STORAGE_NET_PROTOCOL_ISCSI:
@@ -648,6 +649,7 @@ qemuSnapshotPrepareDiskInternal(virDomainDiskDef *disk,
case VIR_STORAGE_NET_PROTOCOL_NONE:
case VIR_STORAGE_NET_PROTOCOL_NBD:
case VIR_STORAGE_NET_PROTOCOL_RBD:
+ case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
case VIR_STORAGE_NET_PROTOCOL_SHEEPDOG:
case VIR_STORAGE_NET_PROTOCOL_GLUSTER:
case VIR_STORAGE_NET_PROTOCOL_ISCSI:
diff --git a/src/storage/storage_driver.c b/src/storage/storage_driver.c
index 314fe930e0..fb615a8b4e 100644
--- a/src/storage/storage_driver.c
+++ b/src/storage/storage_driver.c
@@ -1626,6 +1626,7 @@ storageVolLookupByPathCallback(virStoragePoolObj *obj,
case VIR_STORAGE_POOL_GLUSTER:
case VIR_STORAGE_POOL_RBD:
+ case VIR_STORAGE_POOL_VITASTOR:
case VIR_STORAGE_POOL_SHEEPDOG:
case VIR_STORAGE_POOL_ZFS:
case VIR_STORAGE_POOL_LAST:
diff --git a/src/storage_file/storage_source_backingstore.c b/src/storage_file/storage_source_backingstore.c
index 80681924ea..8a3ade9ec0 100644
--- a/src/storage_file/storage_source_backingstore.c
+++ b/src/storage_file/storage_source_backingstore.c
@@ -287,6 +287,75 @@ virStorageSourceParseRBDColonString(const char *rbdstr,
}
+static int
+virStorageSourceParseVitastorColonString(const char *colonstr,
+ virStorageSource *src)
+{
+ char *p, *e, *next;
+ g_autofree char *options = NULL;
+
+ /* optionally skip the "vitastor:" prefix if provided */
+ if (STRPREFIX(colonstr, "vitastor:"))
+ colonstr += strlen("vitastor:");
+
+ options = g_strdup(colonstr);
+
+ p = options;
+ while (*p) {
+ /* find : delimiter or end of string */
+ for (e = p; *e && *e != ':'; ++e) {
+ if (*e == '\\') {
+ e++;
+ if (*e == '\0')
+ break;
+ }
+ }
+ if (*e == '\0') {
+ next = e; /* last kv pair */
+ } else {
+ next = e + 1;
+ *e = '\0';
+ }
+
+ if (STRPREFIX(p, "image=")) {
+ src->path = g_strdup(p + strlen("image="));
+ } else if (STRPREFIX(p, "etcd-prefix=")) {
+ src->query = g_strdup(p + strlen("etcd-prefix="));
+ } else if (STRPREFIX(p, "config-path=")) {
+ src->configFile = g_strdup(p + strlen("config-path="));
+ } else if (STRPREFIX(p, "etcd-host=")) {
+ char *h, *sep;
+
+ h = p + strlen("etcd-host=");
+ while (h < e) {
+ for (sep = h; sep < e; ++sep) {
+ if (*sep == '\\' && (sep[1] == ',' ||
+ sep[1] == ';' ||
+ sep[1] == ' ')) {
+ *sep = '\0';
+ sep += 2;
+ break;
+ }
+ }
+
+ if (virStorageSourceRBDAddHost(src, h) < 0)
+ return -1;
+
+ h = sep;
+ }
+ }
+
+ p = next;
+ }
+
+ if (!src->path) {
+ return -1;
+ }
+
+ return 0;
+}
+
+
static int
virStorageSourceParseNBDColonString(const char *nbdstr,
virStorageSource *src)
@@ -399,6 +468,11 @@ virStorageSourceParseBackingColon(virStorageSource *src,
return -1;
break;
+ case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
+ if (virStorageSourceParseVitastorColonString(path, src) < 0)
+ return -1;
+ break;
+
case VIR_STORAGE_NET_PROTOCOL_SHEEPDOG:
case VIR_STORAGE_NET_PROTOCOL_LAST:
case VIR_STORAGE_NET_PROTOCOL_NONE:
@@ -975,6 +1049,54 @@ virStorageSourceParseBackingJSONRBD(virStorageSource *src,
return 0;
}
+static int
+virStorageSourceParseBackingJSONVitastor(virStorageSource *src,
+ virJSONValue *json,
+ const char *jsonstr G_GNUC_UNUSED,
+ int opaque G_GNUC_UNUSED)
+{
+ const char *filename;
+ const char *image = virJSONValueObjectGetString(json, "image");
+ const char *conf = virJSONValueObjectGetString(json, "config-path");
+ const char *etcd_prefix = virJSONValueObjectGetString(json, "etcd-prefix");
+ virJSONValue *servers = virJSONValueObjectGetArray(json, "server");
+ size_t nservers;
+ size_t i;
+
+ src->type = VIR_STORAGE_TYPE_NETWORK;
+ src->protocol = VIR_STORAGE_NET_PROTOCOL_VITASTOR;
+
+ /* legacy syntax passed via 'filename' option */
+ if ((filename = virJSONValueObjectGetString(json, "filename")))
+ return virStorageSourceParseVitastorColonString(filename, src);
+
+ if (!image) {
+ virReportError(VIR_ERR_INVALID_ARG, "%s",
+ _("missing image name in Vitastor backing volume "
+ "JSON specification"));
+ return -1;
+ }
+
+ src->path = g_strdup(image);
+ src->configFile = g_strdup(conf);
+ src->query = g_strdup(etcd_prefix);
+
+ if (servers) {
+ nservers = virJSONValueArraySize(servers);
+
+ src->hosts = g_new0(virStorageNetHostDef, nservers);
+ src->nhosts = nservers;
+
+ for (i = 0; i < nservers; i++) {
+ if (virStorageSourceParseBackingJSONInetSocketAddress(src->hosts + i,
+ virJSONValueArrayGet(servers, i)) < 0)
+ return -1;
+ }
+ }
+
+ return 0;
+}
+
static int
virStorageSourceParseBackingJSONRaw(virStorageSource *src,
virJSONValue *json,
@@ -1152,6 +1274,7 @@ static const struct virStorageSourceJSONDriverParser jsonParsers[] = {
{"sheepdog", false, virStorageSourceParseBackingJSONSheepdog, 0},
{"ssh", false, virStorageSourceParseBackingJSONSSH, 0},
{"rbd", false, virStorageSourceParseBackingJSONRBD, 0},
+ {"vitastor", false, virStorageSourceParseBackingJSONVitastor, 0},
{"raw", true, virStorageSourceParseBackingJSONRaw, 0},
{"nfs", false, virStorageSourceParseBackingJSONNFS, 0},
{"vxhs", false, virStorageSourceParseBackingJSONVxHS, 0},
diff --git a/src/test/test_driver.c b/src/test/test_driver.c
index e87d7cfd44..ccc05d7aae 100644
--- a/src/test/test_driver.c
+++ b/src/test/test_driver.c
@@ -7335,6 +7335,7 @@ testStorageVolumeTypeForPool(int pooltype)
case VIR_STORAGE_POOL_ISCSI_DIRECT:
case VIR_STORAGE_POOL_GLUSTER:
case VIR_STORAGE_POOL_RBD:
+ case VIR_STORAGE_POOL_VITASTOR:
return VIR_STORAGE_VOL_NETWORK;
case VIR_STORAGE_POOL_LOGICAL:
case VIR_STORAGE_POOL_DISK:
diff --git a/tests/storagepoolcapsschemadata/poolcaps-fs.xml b/tests/storagepoolcapsschemadata/poolcaps-fs.xml
index eee75af746..8bd0a57bdd 100644
--- a/tests/storagepoolcapsschemadata/poolcaps-fs.xml
+++ b/tests/storagepoolcapsschemadata/poolcaps-fs.xml
@@ -204,4 +204,11 @@
</enum>
</volOptions>
</pool>
+ <pool type='vitastor' supported='no'>
+ <volOptions>
+ <defaultFormat type='raw'/>
+ <enum name='targetFormatType'>
+ </enum>
+ </volOptions>
+ </pool>
</storagepoolCapabilities>
diff --git a/tests/storagepoolcapsschemadata/poolcaps-full.xml b/tests/storagepoolcapsschemadata/poolcaps-full.xml
index 805950a937..852df0de16 100644
--- a/tests/storagepoolcapsschemadata/poolcaps-full.xml
+++ b/tests/storagepoolcapsschemadata/poolcaps-full.xml
@@ -204,4 +204,11 @@
</enum>
</volOptions>
</pool>
+ <pool type='vitastor' supported='yes'>
+ <volOptions>
+ <defaultFormat type='raw'/>
+ <enum name='targetFormatType'>
+ </enum>
+ </volOptions>
+ </pool>
</storagepoolCapabilities>
diff --git a/tests/storagepoolxml2argvtest.c b/tests/storagepoolxml2argvtest.c
index e8e40d695e..db55fe5f3a 100644
--- a/tests/storagepoolxml2argvtest.c
+++ b/tests/storagepoolxml2argvtest.c
@@ -65,6 +65,7 @@ testCompareXMLToArgvFiles(bool shouldFail,
case VIR_STORAGE_POOL_GLUSTER:
case VIR_STORAGE_POOL_ZFS:
case VIR_STORAGE_POOL_VSTORAGE:
+ case VIR_STORAGE_POOL_VITASTOR:
case VIR_STORAGE_POOL_LAST:
default:
VIR_TEST_DEBUG("pool type '%s' has no xml2argv test", defTypeStr);
diff --git a/tools/virsh-pool.c b/tools/virsh-pool.c
index 36f00cf643..5f5bd3464e 100644
--- a/tools/virsh-pool.c
+++ b/tools/virsh-pool.c
@@ -1223,6 +1223,9 @@ cmdPoolList(vshControl *ctl, const vshCmd *cmd G_GNUC_UNUSED)
case VIR_STORAGE_POOL_VSTORAGE:
flags |= VIR_CONNECT_LIST_STORAGE_POOLS_VSTORAGE;
break;
+ case VIR_STORAGE_POOL_VITASTOR:
+ flags |= VIR_CONNECT_LIST_STORAGE_POOLS_VITASTOR;
+ break;
case VIR_STORAGE_POOL_LAST:
break;
}

28
pull_request_template.yml Normal file
View File

@@ -0,0 +1,28 @@
name: Pull Request
about: Submit a pull request
body:
- type: textarea
id: description
attributes:
label: Description
description: Describe your pull request
placeholder: ""
value: ""
validations:
required: true
- type: input
id: author
attributes:
label: Contributor Name
description: Contributor Name or Company Details if the Contributor is a company
placeholder: ""
validations:
required: false
- type: checkboxes
id: terms
attributes:
label: CLA
description: By submitting this pull request, I accept [Vitastor CLA](https://git.yourcmc.ru/vitalif/vitastor/src/branch/master/CLA-en.md)
options:
- label: "I accept Vitastor CLA agreement: https://git.yourcmc.ru/vitalif/vitastor/src/branch/master/CLA-en.md"
required: true

View File

@@ -24,4 +24,4 @@ rm fio
mv fio-copy fio
FIO=`rpm -qi fio | perl -e 'while(<>) { /^Epoch[\s:]+(\S+)/ && print "$1:"; /^Version[\s:]+(\S+)/ && print $1; /^Release[\s:]+(\S+)/ && print "-$1"; }'`
perl -i -pe 's/(Requires:\s*fio)([^\n]+)?/$1 = '$FIO'/' $VITASTOR/rpm/vitastor-el$EL.spec
tar --transform 's#^#vitastor-1.4.0/#' --exclude 'rpm/*.rpm' -czf $VITASTOR/../vitastor-1.4.0$(rpm --eval '%dist').tar.gz *
tar --transform 's#^#vitastor-1.5.0/#' --exclude 'rpm/*.rpm' -czf $VITASTOR/../vitastor-1.5.0$(rpm --eval '%dist').tar.gz *

View File

@@ -36,7 +36,7 @@ ADD . /root/vitastor
RUN set -e; \
cd /root/vitastor/rpm; \
sh build-tarball.sh; \
cp /root/vitastor-1.4.0.el7.tar.gz ~/rpmbuild/SOURCES; \
cp /root/vitastor-1.5.0.el7.tar.gz ~/rpmbuild/SOURCES; \
cp vitastor-el7.spec ~/rpmbuild/SPECS/vitastor.spec; \
cd ~/rpmbuild/SPECS/; \
rpmbuild -ba vitastor.spec; \

View File

@@ -1,11 +1,11 @@
Name: vitastor
Version: 1.4.0
Version: 1.5.0
Release: 1%{?dist}
Summary: Vitastor, a fast software-defined clustered block storage
License: Vitastor Network Public License 1.1
URL: https://vitastor.io/
Source0: vitastor-1.4.0.el7.tar.gz
Source0: vitastor-1.5.0.el7.tar.gz
BuildRequires: liburing-devel >= 0.6
BuildRequires: gperftools-devel
@@ -149,9 +149,12 @@ mkdir -p /etc/vitastor
%_bindir/vitastor-nfs
%_bindir/vitastor-cli
%_bindir/vitastor-rm
%_bindir/vitastor-kv
%_bindir/vitastor-kv-stress
%_bindir/vita
%_libdir/libvitastor_blk.so*
%_libdir/libvitastor_client.so*
%_libdir/libvitastor_kv.so*
%files -n vitastor-client-devel

View File

@@ -35,7 +35,7 @@ ADD . /root/vitastor
RUN set -e; \
cd /root/vitastor/rpm; \
sh build-tarball.sh; \
cp /root/vitastor-1.4.0.el8.tar.gz ~/rpmbuild/SOURCES; \
cp /root/vitastor-1.5.0.el8.tar.gz ~/rpmbuild/SOURCES; \
cp vitastor-el8.spec ~/rpmbuild/SPECS/vitastor.spec; \
cd ~/rpmbuild/SPECS/; \
rpmbuild -ba vitastor.spec; \

View File

@@ -1,11 +1,11 @@
Name: vitastor
Version: 1.4.0
Version: 1.5.0
Release: 1%{?dist}
Summary: Vitastor, a fast software-defined clustered block storage
License: Vitastor Network Public License 1.1
URL: https://vitastor.io/
Source0: vitastor-1.4.0.el8.tar.gz
Source0: vitastor-1.5.0.el8.tar.gz
BuildRequires: liburing-devel >= 0.6
BuildRequires: gperftools-devel
@@ -146,9 +146,12 @@ mkdir -p /etc/vitastor
%_bindir/vitastor-nfs
%_bindir/vitastor-cli
%_bindir/vitastor-rm
%_bindir/vitastor-kv
%_bindir/vitastor-kv-stress
%_bindir/vita
%_libdir/libvitastor_blk.so*
%_libdir/libvitastor_client.so*
%_libdir/libvitastor_kv.so*
%files -n vitastor-client-devel

View File

@@ -18,7 +18,7 @@ ADD . /root/vitastor
RUN set -e; \
cd /root/vitastor/rpm; \
sh build-tarball.sh; \
cp /root/vitastor-1.4.0.el9.tar.gz ~/rpmbuild/SOURCES; \
cp /root/vitastor-1.5.0.el9.tar.gz ~/rpmbuild/SOURCES; \
cp vitastor-el9.spec ~/rpmbuild/SPECS/vitastor.spec; \
cd ~/rpmbuild/SPECS/; \
rpmbuild -ba vitastor.spec; \

View File

@@ -1,11 +1,11 @@
Name: vitastor
Version: 1.4.0
Version: 1.5.0
Release: 1%{?dist}
Summary: Vitastor, a fast software-defined clustered block storage
License: Vitastor Network Public License 1.1
URL: https://vitastor.io/
Source0: vitastor-1.4.0.el9.tar.gz
Source0: vitastor-1.5.0.el9.tar.gz
BuildRequires: liburing-devel >= 0.6
BuildRequires: gperftools-devel
@@ -139,9 +139,12 @@ mkdir -p /etc/vitastor
%_bindir/vitastor-nfs
%_bindir/vitastor-cli
%_bindir/vitastor-rm
%_bindir/vitastor-kv
%_bindir/vitastor-kv-stress
%_bindir/vita
%_libdir/libvitastor_blk.so*
%_libdir/libvitastor_client.so*
%_libdir/libvitastor_kv.so*
%files -n vitastor-client-devel

View File

@@ -16,8 +16,8 @@ if("${CMAKE_INSTALL_PREFIX}" MATCHES "^/usr/local/?$")
set(CMAKE_INSTALL_RPATH "${CMAKE_INSTALL_PREFIX}/${CMAKE_INSTALL_LIBDIR}")
endif()
add_definitions(-DVERSION="1.4.0")
add_definitions(-Wall -Wno-sign-compare -Wno-comment -Wno-parentheses -Wno-pointer-arith -fdiagnostics-color=always -fno-omit-frame-pointer -I ${CMAKE_SOURCE_DIR}/src)
add_definitions(-DVERSION="1.5.0")
add_definitions(-D_LARGEFILE64_SOURCE -D_FILE_OFFSET_BITS=64 -Wall -Wno-sign-compare -Wno-comment -Wno-parentheses -Wno-pointer-arith -fdiagnostics-color=always -fno-omit-frame-pointer -I ${CMAKE_SOURCE_DIR}/src)
add_link_options(-fno-omit-frame-pointer)
if (${WITH_ASAN})
add_definitions(-fsanitize=address)
@@ -145,7 +145,6 @@ add_library(vitastor_client SHARED
cli_status.cpp
cli_describe.cpp
cli_fix.cpp
cli_df.cpp
cli_ls.cpp
cli_create.cpp
cli_modify.cpp
@@ -154,6 +153,11 @@ add_library(vitastor_client SHARED
cli_rm_data.cpp
cli_rm.cpp
cli_rm_osd.cpp
cli_pool_cfg.cpp
cli_pool_create.cpp
cli_pool_ls.cpp
cli_pool_modify.cpp
cli_pool_rm.cpp
)
set_target_properties(vitastor_client PROPERTIES PUBLIC_HEADER "vitastor_c.h")
target_link_libraries(vitastor_client
@@ -181,10 +185,48 @@ target_link_libraries(vitastor-nbd
vitastor_client
)
# libvitastor_kv.so
add_library(vitastor_kv SHARED
kv_db.cpp
kv_db.h
)
target_link_libraries(vitastor_kv
vitastor_client
)
set_target_properties(vitastor_kv PROPERTIES VERSION ${VERSION} SOVERSION 0)
# vitastor-kv
add_executable(vitastor-kv
kv_cli.cpp
)
target_link_libraries(vitastor-kv
vitastor_kv
)
add_executable(vitastor-kv-stress
kv_stress.cpp
)
target_link_libraries(vitastor-kv-stress
vitastor_kv
)
# vitastor-nfs
add_executable(vitastor-nfs
nfs_proxy.cpp
nfs_conn.cpp
nfs_block.cpp
nfs_kv.cpp
nfs_kv_create.cpp
nfs_kv_getattr.cpp
nfs_kv_link.cpp
nfs_kv_lookup.cpp
nfs_kv_read.cpp
nfs_kv_readdir.cpp
nfs_kv_remove.cpp
nfs_kv_rename.cpp
nfs_kv_setattr.cpp
nfs_kv_write.cpp
nfs_fsstat.cpp
nfs_mount.cpp
nfs_portmap.cpp
sha256.c
nfs/xdr_impl.cpp
@@ -194,6 +236,7 @@ add_executable(vitastor-nfs
)
target_link_libraries(vitastor-nfs
vitastor_client
vitastor_kv
)
# vitastor-cli
@@ -318,12 +361,12 @@ add_test(NAME test_cluster_client COMMAND test_cluster_client)
### Install
install(TARGETS vitastor-osd vitastor-disk vitastor-nbd vitastor-nfs vitastor-cli RUNTIME DESTINATION ${CMAKE_INSTALL_BINDIR})
install(TARGETS vitastor-osd vitastor-disk vitastor-nbd vitastor-nfs vitastor-cli vitastor-kv vitastor-kv-stress RUNTIME DESTINATION ${CMAKE_INSTALL_BINDIR})
install_symlink(vitastor-disk ${CMAKE_INSTALL_PREFIX}/${CMAKE_INSTALL_BINDIR}/vitastor-dump-journal)
install_symlink(vitastor-cli ${CMAKE_INSTALL_PREFIX}/${CMAKE_INSTALL_BINDIR}/vitastor-rm)
install_symlink(vitastor-cli ${CMAKE_INSTALL_PREFIX}/${CMAKE_INSTALL_BINDIR}/vita)
install(
TARGETS vitastor_blk vitastor_client
TARGETS vitastor_blk vitastor_client vitastor_kv
LIBRARY DESTINATION ${CMAKE_INSTALL_LIBDIR}
PUBLIC_HEADER DESTINATION ${CMAKE_INSTALL_INCLUDEDIR}
)

View File

@@ -82,3 +82,8 @@ uint32_t blockstore_t::get_bitmap_granularity()
{
return impl->get_bitmap_granularity();
}
void blockstore_t::set_no_inode_stats(const std::vector<uint64_t> & pool_ids)
{
impl->set_no_inode_stats(pool_ids);
}

View File

@@ -216,6 +216,9 @@ public:
// Get per-inode space usage statistics
std::map<uint64_t, uint64_t> & get_inode_space_stats();
// Set per-pool no_inode_stats
void set_no_inode_stats(const std::vector<uint64_t> & pool_ids);
// Print diagnostics to stdout
void dump_diagnostics();

View File

@@ -108,6 +108,10 @@ void blockstore_disk_t::parse_config(std::map<std::string, std::string> & config
{
throw std::runtime_error("journal_block_size must be a multiple of "+std::to_string(DIRECT_IO_ALIGNMENT));
}
else if (journal_block_size > MAX_DATA_BLOCK_SIZE)
{
throw std::runtime_error("journal_block_size must not exceed "+std::to_string(MAX_DATA_BLOCK_SIZE));
}
if (!meta_block_size)
{
meta_block_size = 4096;
@@ -116,6 +120,10 @@ void blockstore_disk_t::parse_config(std::map<std::string, std::string> & config
{
throw std::runtime_error("meta_block_size must be a multiple of "+std::to_string(DIRECT_IO_ALIGNMENT));
}
else if (meta_block_size > MAX_DATA_BLOCK_SIZE)
{
throw std::runtime_error("meta_block_size must not exceed "+std::to_string(MAX_DATA_BLOCK_SIZE));
}
if (data_offset % disk_alignment)
{
throw std::runtime_error("data_offset must be a multiple of disk_alignment = "+std::to_string(disk_alignment));

View File

@@ -19,7 +19,6 @@ journal_flusher_t::journal_flusher_t(blockstore_impl_t *bs)
syncing_flushers = 0;
// FIXME: allow to configure flusher_start_threshold and journal_trim_interval
flusher_start_threshold = bs->dsk.journal_block_size / sizeof(journal_entry_stable);
journal_trim_interval = 512;
journal_trim_counter = bs->journal.flush_journal ? 1 : 0;
trim_wanted = bs->journal.flush_journal ? 1 : 0;
journal_superblock = bs->journal.inmemory ? bs->journal.buffer : memalign_or_die(MEM_ALIGNMENT, bs->dsk.journal_block_size);
@@ -94,7 +93,7 @@ void journal_flusher_t::loop()
void journal_flusher_t::enqueue_flush(obj_ver_id ov)
{
#ifdef BLOCKSTORE_DEBUG
printf("enqueue_flush %lx:%lx v%lu\n", ov.oid.inode, ov.oid.stripe, ov.version);
printf("enqueue_flush %jx:%jx v%ju\n", ov.oid.inode, ov.oid.stripe, ov.version);
#endif
auto it = flush_versions.find(ov.oid);
if (it != flush_versions.end())
@@ -117,7 +116,7 @@ void journal_flusher_t::enqueue_flush(obj_ver_id ov)
void journal_flusher_t::unshift_flush(obj_ver_id ov, bool force)
{
#ifdef BLOCKSTORE_DEBUG
printf("unshift_flush %lx:%lx v%lu\n", ov.oid.inode, ov.oid.stripe, ov.version);
printf("unshift_flush %jx:%jx v%ju\n", ov.oid.inode, ov.oid.stripe, ov.version);
#endif
auto it = flush_versions.find(ov.oid);
if (it != flush_versions.end())
@@ -143,7 +142,7 @@ void journal_flusher_t::unshift_flush(obj_ver_id ov, bool force)
void journal_flusher_t::remove_flush(object_id oid)
{
#ifdef BLOCKSTORE_DEBUG
printf("undo_flush %lx:%lx\n", oid.inode, oid.stripe);
printf("undo_flush %jx:%jx\n", oid.inode, oid.stripe);
#endif
auto v_it = flush_versions.find(oid);
if (v_it != flush_versions.end())
@@ -184,8 +183,7 @@ void journal_flusher_t::mark_trim_possible()
if (trim_wanted > 0)
{
dequeuing = true;
if (!journal_trim_counter)
journal_trim_counter = journal_trim_interval;
journal_trim_counter = 0;
bs->ringloop->wakeup();
}
}
@@ -235,7 +233,7 @@ void journal_flusher_t::dump_diagnostics()
break;
}
printf(
"Flusher: queued=%ld first=%s%lx:%lx trim_wanted=%d dequeuing=%d trimming=%d cur=%d target=%d active=%d syncing=%d\n",
"Flusher: queued=%zd first=%s%jx:%jx trim_wanted=%d dequeuing=%d trimming=%d cur=%d target=%d active=%d syncing=%d\n",
flush_queue.size(), unflushable_type, unflushable.oid.inode, unflushable.oid.stripe,
trim_wanted, dequeuing, trimming, cur_flusher_count, target_flusher_count,
active_flushers, syncing_flushers
@@ -268,7 +266,7 @@ bool journal_flusher_t::try_find_other(std::map<obj_ver_id, dirty_entry>::iterat
{
int search_left = flush_queue.size() - 1;
#ifdef BLOCKSTORE_DEBUG
printf("Flusher overran writers (%lx:%lx v%lu, dirty_start=%08lx) - searching for older flushes (%d left)\n",
printf("Flusher overran writers (%jx:%jx v%ju, dirty_start=%08jx) - searching for older flushes (%d left)\n",
cur.oid.inode, cur.oid.stripe, cur.version, bs->journal.dirty_start, search_left);
#endif
while (search_left > 0)
@@ -285,7 +283,7 @@ bool journal_flusher_t::try_find_other(std::map<obj_ver_id, dirty_entry>::iterat
dirty_end->second.journal_sector < bs->journal.used_start))
{
#ifdef BLOCKSTORE_DEBUG
printf("Write %lx:%lx v%lu is too new: offset=%08lx\n", cur.oid.inode, cur.oid.stripe, cur.version, dirty_end->second.journal_sector);
printf("Write %jx:%jx v%ju is too new: offset=%08jx\n", cur.oid.inode, cur.oid.stripe, cur.version, dirty_end->second.journal_sector);
#endif
enqueue_flush(cur);
}
@@ -366,9 +364,10 @@ resume_0:
!flusher->flush_queue.size() || !flusher->dequeuing)
{
stop_flusher:
if (flusher->trim_wanted > 0 && flusher->journal_trim_counter > 0)
if (flusher->trim_wanted > 0 && cur.oid.inode != 0)
{
// Attempt forced trim
cur.oid = {};
flusher->active_flushers++;
goto trim_journal;
}
@@ -387,7 +386,7 @@ stop_flusher:
if (repeat_it != flusher->sync_to_repeat.end())
{
#ifdef BLOCKSTORE_DEBUG
printf("Postpone %lx:%lx v%lu\n", cur.oid.inode, cur.oid.stripe, cur.version);
printf("Postpone %jx:%jx v%ju\n", cur.oid.inode, cur.oid.stripe, cur.version);
#endif
// We don't flush different parts of history of the same object in parallel
// So we check if someone is already flushing this object
@@ -416,12 +415,13 @@ stop_flusher:
flusher->sync_to_repeat.erase(cur.oid);
if (!flusher->try_find_other(dirty_end, cur))
{
cur.oid = {};
goto stop_flusher;
}
}
}
#ifdef BLOCKSTORE_DEBUG
printf("Flushing %lx:%lx v%lu\n", cur.oid.inode, cur.oid.stripe, cur.version);
printf("Flushing %jx:%jx v%ju\n", cur.oid.inode, cur.oid.stripe, cur.version);
#endif
flusher->active_flushers++;
// Find it in clean_db
@@ -448,7 +448,7 @@ stop_flusher:
// Object not allocated. This is a bug.
char err[1024];
snprintf(
err, 1024, "BUG: Object %lx:%lx v%lu that we are trying to flush is not allocated on the data device",
err, 1024, "BUG: Object %jx:%jx v%ju that we are trying to flush is not allocated on the data device",
cur.oid.inode, cur.oid.stripe, cur.version
);
throw std::runtime_error(err);
@@ -538,7 +538,7 @@ resume_2:
clean_disk_entry *old_entry = (clean_disk_entry*)((uint8_t*)meta_old.buf + meta_old.pos*bs->dsk.clean_entry_size);
if (old_entry->oid.inode != 0 && old_entry->oid != cur.oid)
{
printf("Fatal error (metadata corruption or bug): tried to wipe metadata entry %lu (%lx:%lx v%lu) as old location of %lx:%lx\n",
printf("Fatal error (metadata corruption or bug): tried to wipe metadata entry %ju (%jx:%jx v%ju) as old location of %jx:%jx\n",
old_clean_loc >> bs->dsk.block_order, old_entry->oid.inode, old_entry->oid.stripe,
old_entry->version, cur.oid.inode, cur.oid.stripe);
exit(1);
@@ -571,7 +571,7 @@ resume_2:
// Erase dirty_db entries
bs->erase_dirty(dirty_start, std::next(dirty_end), clean_loc);
#ifdef BLOCKSTORE_DEBUG
printf("Flushed %lx:%lx v%lu (%d copies, wr:%d, del:%d), %ld left\n", cur.oid.inode, cur.oid.stripe, cur.version,
printf("Flushed %jx:%jx v%ju (%d copies, wr:%d, del:%d), %jd left\n", cur.oid.inode, cur.oid.stripe, cur.version,
copy_count, has_writes, has_delete, flusher->flush_queue.size());
#endif
release_oid:
@@ -584,7 +584,8 @@ resume_2:
flusher->sync_to_repeat.erase(repeat_it);
trim_journal:
// Clear unused part of the journal every <journal_trim_interval> flushes
if (!((++flusher->journal_trim_counter) % flusher->journal_trim_interval) || flusher->trim_wanted > 0)
if (bs->journal_trim_interval && !((++flusher->journal_trim_counter) % bs->journal_trim_interval) ||
flusher->trim_wanted > 0)
{
resume_26:
resume_27:
@@ -609,8 +610,8 @@ void journal_flusher_co::update_metadata_entry()
{
printf(
has_delete
? "Fatal error (metadata corruption or bug): tried to delete metadata entry %lu (%lx:%lx v%lu) while deleting %lx:%lx v%lu\n"
: "Fatal error (metadata corruption or bug): tried to overwrite non-zero metadata entry %lu (%lx:%lx v%lu) with %lx:%lx v%lu\n",
? "Fatal error (metadata corruption or bug): tried to delete metadata entry %ju (%jx:%jx v%ju) while deleting %jx:%jx v%ju\n"
: "Fatal error (metadata corruption or bug): tried to overwrite non-zero metadata entry %ju (%jx:%jx v%ju) with %jx:%jx v%ju\n",
clean_loc >> bs->dsk.block_order, new_entry->oid.inode, new_entry->oid.stripe,
new_entry->version, cur.oid.inode, cur.oid.stripe, cur.version
);
@@ -710,7 +711,7 @@ bool journal_flusher_co::write_meta_block(flusher_meta_write_t & meta_block, int
if (wait_state == wait_base)
goto resume_0;
await_sqe(0);
data->iov = (struct iovec){ meta_block.buf, bs->dsk.meta_block_size };
data->iov = (struct iovec){ meta_block.buf, (size_t)bs->dsk.meta_block_size };
data->callback = simple_callback_w;
my_uring_prep_writev(
sqe, bs->dsk.meta_fd, &data->iov, 1, bs->dsk.meta_offset + bs->dsk.meta_block_size + meta_block.sector
@@ -760,7 +761,7 @@ bool journal_flusher_co::clear_incomplete_csum_block_bits(int wait_base)
{
// If we encounter bad checksums during flush, we still update the bad block,
// but intentionally mangle checksums to avoid hiding the corruption.
iovec iov = { .iov_base = v[i].buf, .iov_len = v[i].len };
iovec iov = { .iov_base = v[i].buf, .iov_len = (size_t)v[i].len };
if (!(v[i].copy_flags & COPY_BUF_JOURNAL))
{
assert(!(v[i].offset % bs->dsk.csum_block_size));
@@ -768,7 +769,7 @@ bool journal_flusher_co::clear_incomplete_csum_block_bits(int wait_base)
bs->verify_padded_checksums(new_clean_bitmap, new_clean_bitmap + 2*bs->dsk.clean_entry_bitmap_size,
v[i].offset, &iov, 1, [&](uint32_t bad_block, uint32_t calc_csum, uint32_t stored_csum)
{
printf("Checksum mismatch in object %lx:%lx v%lu in data area at offset 0x%lx+0x%x: got %08x, expected %08x\n",
printf("Checksum mismatch in object %jx:%jx v%ju in data area at offset 0x%jx+0x%x: got %08x, expected %08x\n",
cur.oid.inode, cur.oid.stripe, old_clean_ver, old_clean_loc, bad_block, calc_csum, stored_csum);
for (uint32_t j = 0; j < bs->dsk.csum_block_size; j += bs->dsk.bitmap_granularity)
{
@@ -781,7 +782,7 @@ bool journal_flusher_co::clear_incomplete_csum_block_bits(int wait_base)
{
bs->verify_journal_checksums(v[i].csum_buf, v[i].offset, &iov, 1, [&](uint32_t bad_block, uint32_t calc_csum, uint32_t stored_csum)
{
printf("Checksum mismatch in object %lx:%lx v%lu in journal at offset 0x%lx+0x%x (block offset 0x%lx): got %08x, expected %08x\n",
printf("Checksum mismatch in object %jx:%jx v%ju in journal at offset 0x%jx+0x%x (block offset 0x%jx): got %08x, expected %08x\n",
cur.oid.inode, cur.oid.stripe, old_clean_ver,
v[i].disk_offset, bad_block, v[i].offset, calc_csum, stored_csum);
bad_block += (v[i].offset/bs->dsk.csum_block_size) * bs->dsk.csum_block_size;
@@ -805,7 +806,7 @@ bool journal_flusher_co::clear_incomplete_csum_block_bits(int wait_base)
if (new_entry->oid != cur.oid)
{
printf(
"Fatal error (metadata corruption or bug): tried to make holes in %lu (%lx:%lx v%lu) with %lx:%lx v%lu\n",
"Fatal error (metadata corruption or bug): tried to make holes in %ju (%jx:%jx v%ju) with %jx:%jx v%ju\n",
clean_loc >> bs->dsk.block_order, new_entry->oid.inode, new_entry->oid.stripe,
new_entry->version, cur.oid.inode, cur.oid.stripe, cur.version
);
@@ -925,7 +926,7 @@ void journal_flusher_co::scan_dirty()
{
char err[1024];
snprintf(
err, 1024, "BUG: Unexpected dirty_entry %lx:%lx v%lu unstable state during flush: 0x%x",
err, 1024, "BUG: Unexpected dirty_entry %jx:%jx v%ju unstable state during flush: 0x%x",
dirty_it->first.oid.inode, dirty_it->first.oid.stripe, dirty_it->first.version, dirty_it->second.state
);
throw std::runtime_error(err);
@@ -1021,7 +1022,7 @@ void journal_flusher_co::scan_dirty()
// May happen if the metadata entry is corrupt, but journal isn't
// FIXME: Report corrupted object to the upper layer (OSD)
printf(
"Warning: object %lx:%lx has overwrites, but doesn't have a clean version."
"Warning: object %jx:%jx has overwrites, but doesn't have a clean version."
" Metadata is likely corrupted. Dropping object from the DB.\n",
cur.oid.inode, cur.oid.stripe
);
@@ -1056,7 +1057,7 @@ void journal_flusher_co::scan_dirty()
flusher->enqueue_flush(cur);
cur.version = dirty_end->first.version;
#ifdef BLOCKSTORE_DEBUG
printf("Partial checksum block overwrites found - rewinding flush back to %lx:%lx v%lu\n", cur.oid.inode, cur.oid.stripe, cur.version);
printf("Partial checksum block overwrites found - rewinding flush back to %jx:%jx v%ju\n", cur.oid.inode, cur.oid.stripe, cur.version);
#endif
v.clear();
copy_count = 0;
@@ -1084,7 +1085,7 @@ bool journal_flusher_co::read_dirty(int wait_base)
auto & vi = v[v.size()-i];
assert(vi.len != 0);
vi.buf = memalign_or_die(MEM_ALIGNMENT, vi.len);
data->iov = (struct iovec){ vi.buf, vi.len };
data->iov = (struct iovec){ vi.buf, (size_t)vi.len };
data->callback = simple_callback_r;
my_uring_prep_readv(
sqe, bs->dsk.data_fd, &data->iov, 1, bs->dsk.data_offset + old_clean_loc + vi.offset
@@ -1208,7 +1209,7 @@ bool journal_flusher_co::modify_meta_read(uint64_t meta_loc, flusher_meta_write_
.usage_count = 1,
}).first;
await_sqe(0);
data->iov = (struct iovec){ wr.it->second.buf, bs->dsk.meta_block_size };
data->iov = (struct iovec){ wr.it->second.buf, (size_t)bs->dsk.meta_block_size };
data->callback = simple_callback_r;
wr.submitted = true;
my_uring_prep_readv(
@@ -1247,7 +1248,7 @@ void journal_flusher_co::free_data_blocks()
auto uo_it = bs->used_clean_objects.find(old_clean_loc);
bool used = uo_it != bs->used_clean_objects.end();
#ifdef BLOCKSTORE_DEBUG
printf("%s block %lu from %lx:%lx v%lu (new location is %lu)\n",
printf("%s block %ju from %jx:%jx v%ju (new location is %ju)\n",
used ? "Postpone free" : "Free",
old_clean_loc >> bs->dsk.block_order,
cur.oid.inode, cur.oid.stripe, cur.version,
@@ -1264,7 +1265,7 @@ void journal_flusher_co::free_data_blocks()
auto uo_it = bs->used_clean_objects.find(old_clean_loc);
bool used = uo_it != bs->used_clean_objects.end();
#ifdef BLOCKSTORE_DEBUG
printf("%s block %lu from %lx:%lx v%lu (delete)\n",
printf("%s block %ju from %jx:%jx v%ju (delete)\n",
used ? "Postpone free" : "Free",
old_clean_loc >> bs->dsk.block_order,
cur.oid.inode, cur.oid.stripe, cur.version);
@@ -1346,7 +1347,6 @@ bool journal_flusher_co::trim_journal(int wait_base)
else if (wait_state == wait_base+2) goto resume_2;
else if (wait_state == wait_base+3) goto resume_3;
else if (wait_state == wait_base+4) goto resume_4;
flusher->journal_trim_counter = 0;
new_trim_pos = bs->journal.get_trim_pos();
if (new_trim_pos != bs->journal.used_start)
{
@@ -1378,7 +1378,7 @@ bool journal_flusher_co::trim_journal(int wait_base)
.csum_block_size = bs->dsk.csum_block_size,
};
((journal_entry_start*)flusher->journal_superblock)->crc32 = je_crc32((journal_entry*)flusher->journal_superblock);
data->iov = (struct iovec){ flusher->journal_superblock, bs->dsk.journal_block_size };
data->iov = (struct iovec){ flusher->journal_superblock, (size_t)bs->dsk.journal_block_size };
data->callback = simple_callback_w;
my_uring_prep_writev(sqe, bs->dsk.journal_fd, &data->iov, 1, bs->journal.offset);
wait_count++;
@@ -1410,7 +1410,7 @@ bool journal_flusher_co::trim_journal(int wait_base)
}
bs->journal.used_start = new_trim_pos;
#ifdef BLOCKSTORE_DEBUG
printf("Journal trimmed to %08lx (next_free=%08lx dirty_start=%08lx)\n", bs->journal.used_start, bs->journal.next_free, bs->journal.dirty_start);
printf("Journal trimmed to %08jx (next_free=%08jx dirty_start=%08jx)\n", bs->journal.used_start, bs->journal.next_free, bs->journal.dirty_start);
#endif
if (bs->journal.flush_journal && !flusher->flush_queue.size())
{
@@ -1419,6 +1419,7 @@ bool journal_flusher_co::trim_journal(int wait_base)
exit(0);
}
}
flusher->journal_trim_counter = 0;
flusher->trimming = false;
}
return true;

View File

@@ -107,7 +107,7 @@ class journal_flusher_t
blockstore_impl_t *bs;
friend class journal_flusher_co;
int journal_trim_counter, journal_trim_interval;
int journal_trim_counter;
bool trimming;
void* journal_superblock;

View File

@@ -163,20 +163,10 @@ void blockstore_impl_t::loop()
}
else if (op->opcode == BS_OP_SYNC)
{
// wait for all small writes to be submitted
// wait for all big writes to complete, submit data device fsync
// sync only completed writes?
// wait for the data device fsync to complete, then submit journal writes for big writes
// then submit an fsync operation
if (has_writes)
{
// Can't submit SYNC before previous writes
continue;
}
wr_st = continue_sync(op);
if (wr_st != 2)
{
has_writes = wr_st > 0 ? 1 : 2;
}
}
else if (op->opcode == BS_OP_STABLE)
{
@@ -205,6 +195,10 @@ void blockstore_impl_t::loop()
// ring is full, stop submission
break;
}
else if (PRIV(op)->wait_for == WAIT_JOURNAL)
{
PRIV(op)->wait_detail2 = (unstable_writes.size()+unstable_unsynced);
}
}
}
if (op_idx != new_idx)
@@ -275,7 +269,7 @@ void blockstore_impl_t::check_wait(blockstore_op_t *op)
{
// stop submission if there's still no free space
#ifdef BLOCKSTORE_DEBUG
printf("Still waiting for %lu SQE(s)\n", PRIV(op)->wait_detail);
printf("Still waiting for %ju SQE(s)\n", PRIV(op)->wait_detail);
#endif
return;
}
@@ -283,11 +277,12 @@ void blockstore_impl_t::check_wait(blockstore_op_t *op)
}
else if (PRIV(op)->wait_for == WAIT_JOURNAL)
{
if (journal.used_start == PRIV(op)->wait_detail)
if (journal.used_start == PRIV(op)->wait_detail &&
(unstable_writes.size()+unstable_unsynced) == PRIV(op)->wait_detail2)
{
// do not submit
#ifdef BLOCKSTORE_DEBUG
printf("Still waiting to flush journal offset %08lx\n", PRIV(op)->wait_detail);
printf("Still waiting to flush journal offset %08jx\n", PRIV(op)->wait_detail);
#endif
return;
}
@@ -558,13 +553,14 @@ void blockstore_impl_t::process_list(blockstore_op_t *op)
if (stable_count >= stable_alloc)
{
stable_alloc *= 2;
stable = (obj_ver_id*)realloc(stable, sizeof(obj_ver_id) * stable_alloc);
if (!stable)
obj_ver_id* nst = (obj_ver_id*)realloc(stable, sizeof(obj_ver_id) * stable_alloc);
if (!nst)
{
op->retval = -ENOMEM;
FINISH_OP(op);
return;
}
stable = nst;
}
stable[stable_count++] = {
.oid = clean_it->first,
@@ -642,8 +638,8 @@ void blockstore_impl_t::process_list(blockstore_op_t *op)
if (stable_count >= stable_alloc)
{
stable_alloc += 32768;
stable = (obj_ver_id*)realloc(stable, sizeof(obj_ver_id) * stable_alloc);
if (!stable)
obj_ver_id *nst = (obj_ver_id*)realloc(stable, sizeof(obj_ver_id) * stable_alloc);
if (!nst)
{
if (unstable)
free(unstable);
@@ -651,6 +647,7 @@ void blockstore_impl_t::process_list(blockstore_op_t *op)
FINISH_OP(op);
return;
}
stable = nst;
}
stable[stable_count++] = dirty_it->first;
}
@@ -666,8 +663,8 @@ void blockstore_impl_t::process_list(blockstore_op_t *op)
if (unstable_count >= unstable_alloc)
{
unstable_alloc += 32768;
unstable = (obj_ver_id*)realloc(unstable, sizeof(obj_ver_id) * unstable_alloc);
if (!unstable)
obj_ver_id *nst = (obj_ver_id*)realloc(unstable, sizeof(obj_ver_id) * unstable_alloc);
if (!nst)
{
if (stable)
free(stable);
@@ -675,6 +672,7 @@ void blockstore_impl_t::process_list(blockstore_op_t *op)
FINISH_OP(op);
return;
}
unstable = nst;
}
unstable[unstable_count++] = dirty_it->first;
}
@@ -694,8 +692,8 @@ void blockstore_impl_t::process_list(blockstore_op_t *op)
if (stable_count+unstable_count > stable_alloc)
{
stable_alloc = stable_count+unstable_count;
stable = (obj_ver_id*)realloc(stable, sizeof(obj_ver_id) * stable_alloc);
if (!stable)
obj_ver_id *nst = (obj_ver_id*)realloc(stable, sizeof(obj_ver_id) * stable_alloc);
if (!nst)
{
if (unstable)
free(unstable);
@@ -703,6 +701,7 @@ void blockstore_impl_t::process_list(blockstore_op_t *op)
FINISH_OP(op);
return;
}
stable = nst;
}
// Copy unstable entries
for (int i = 0; i < unstable_count; i++)
@@ -734,3 +733,86 @@ void blockstore_impl_t::disk_error_abort(const char *op, int retval, int expecte
fprintf(stderr, "Disk %s failed: result is %d, expected %d. Can't continue, sorry :-(\n", op, retval, expected);
exit(1);
}
void blockstore_impl_t::set_no_inode_stats(const std::vector<uint64_t> & pool_ids)
{
for (auto & np: no_inode_stats)
{
np.second = 2;
}
for (auto pool_id: pool_ids)
{
if (!no_inode_stats[pool_id])
recalc_inode_space_stats(pool_id, false);
no_inode_stats[pool_id] = 1;
}
for (auto np_it = no_inode_stats.begin(); np_it != no_inode_stats.end(); )
{
if (np_it->second == 2)
{
recalc_inode_space_stats(np_it->first, true);
no_inode_stats.erase(np_it++);
}
else
np_it++;
}
}
void blockstore_impl_t::recalc_inode_space_stats(uint64_t pool_id, bool per_inode)
{
auto sp_begin = inode_space_stats.lower_bound((pool_id << (64-POOL_ID_BITS)));
auto sp_end = inode_space_stats.lower_bound(((pool_id+1) << (64-POOL_ID_BITS)));
inode_space_stats.erase(sp_begin, sp_end);
auto sh_it = clean_db_shards.lower_bound((pool_id << (64-POOL_ID_BITS)));
while (sh_it != clean_db_shards.end() &&
(sh_it->first >> (64-POOL_ID_BITS)) == pool_id)
{
for (auto & pair: sh_it->second)
{
uint64_t space_id = per_inode ? pair.first.inode : (pool_id << (64-POOL_ID_BITS));
inode_space_stats[space_id] += dsk.data_block_size;
}
sh_it++;
}
object_id last_oid = {};
bool last_exists = false;
auto dirty_it = dirty_db.lower_bound((obj_ver_id){ .oid = { .inode = (pool_id << (64-POOL_ID_BITS)) } });
while (dirty_it != dirty_db.end() && (dirty_it->first.oid.inode >> (64-POOL_ID_BITS)) == pool_id)
{
if (IS_STABLE(dirty_it->second.state) && (IS_BIG_WRITE(dirty_it->second.state) || IS_DELETE(dirty_it->second.state)))
{
bool exists = false;
if (last_oid == dirty_it->first.oid)
{
exists = last_exists;
}
else
{
auto & clean_db = clean_db_shard(dirty_it->first.oid);
auto clean_it = clean_db.find(dirty_it->first.oid);
exists = clean_it != clean_db.end();
}
uint64_t space_id = per_inode ? dirty_it->first.oid.inode : (pool_id << (64-POOL_ID_BITS));
if (IS_BIG_WRITE(dirty_it->second.state))
{
if (!exists)
inode_space_stats[space_id] += dsk.data_block_size;
last_exists = true;
}
else
{
if (exists)
{
auto & sp = inode_space_stats[space_id];
if (sp > dsk.data_block_size)
sp -= dsk.data_block_size;
else
inode_space_stats.erase(space_id);
}
last_exists = false;
}
last_oid = dirty_it->first.oid;
}
dirty_it++;
}
}

View File

@@ -55,6 +55,7 @@
#define IS_JOURNAL(st) (((st) & 0x0F) == BS_ST_SMALL_WRITE)
#define IS_BIG_WRITE(st) (((st) & 0x0F) == BS_ST_BIG_WRITE)
#define IS_DELETE(st) (((st) & 0x0F) == BS_ST_DELETE)
#define IS_INSTANT(st) (((st) & BS_ST_TYPE_MASK) == BS_ST_DELETE || ((st) & BS_ST_INSTANT))
#define BS_SUBMIT_CHECK_SQES(n) \
if (ringloop->sqes_left() < (n))\
@@ -201,7 +202,7 @@ struct blockstore_op_private_t
{
// Wait status
int wait_for;
uint64_t wait_detail;
uint64_t wait_detail, wait_detail2;
int pending_ops;
int op_state;
@@ -252,6 +253,7 @@ class blockstore_impl_t
bool inmemory_meta = false;
// Maximum and minimum flusher count
unsigned max_flusher_count, min_flusher_count;
unsigned journal_trim_interval;
// Maximum queue depth
unsigned max_write_iodepth = 128;
// Enable small (journaled) write throttling, useful for the SSD+HDD case
@@ -270,6 +272,7 @@ class blockstore_impl_t
std::map<pool_id_t, pool_shard_settings_t> clean_db_settings;
std::map<pool_pg_id_t, blockstore_clean_db_t> clean_db_shards;
std::map<uint64_t, int> no_inode_stats;
uint8_t *clean_bitmaps = NULL;
blockstore_dirty_db_t dirty_db;
std::vector<blockstore_op_t*> submit_queue;
@@ -316,6 +319,7 @@ class blockstore_impl_t
blockstore_clean_db_t& clean_db_shard(object_id oid);
void reshard_clean_db(pool_id_t pool_id, uint32_t pg_count, uint32_t pg_stripe_size);
void recalc_inode_space_stats(uint64_t pool_id, bool per_inode);
// Journaling
void prepare_journal_sector_write(int sector, blockstore_op_t *op);
@@ -377,7 +381,7 @@ class blockstore_impl_t
// Stabilize
int dequeue_stable(blockstore_op_t *op);
int continue_stable(blockstore_op_t *op);
void mark_stable(const obj_ver_id & ov, bool forget_dirty = false);
void mark_stable(obj_ver_id ov, bool forget_dirty = false);
void stabilize_object(object_id oid, uint64_t max_ver);
blockstore_op_t* selective_sync(blockstore_op_t *op);
int split_stab_op(blockstore_op_t *op, std::function<int(obj_ver_id v)> decider);
@@ -426,6 +430,9 @@ public:
// Space usage statistics
std::map<uint64_t, uint64_t> inode_space_stats;
// Set per-pool no_inode_stats
void set_no_inode_stats(const std::vector<uint64_t> & pool_ids);
// Print diagnostics to stdout
void dump_diagnostics();

View File

@@ -32,7 +32,7 @@ void blockstore_init_meta::handle_event(ring_data_t *data, int buf_num)
if (data->res < 0)
{
throw std::runtime_error(
std::string("read metadata failed at offset ") + std::to_string(bufs[buf_num].offset) +
std::string("read metadata failed at offset ") + std::to_string(buf_num >= 0 ? bufs[buf_num].offset : last_read_offset) +
std::string(": ") + strerror(-data->res)
);
}
@@ -63,7 +63,8 @@ int blockstore_init_meta::loop()
throw std::runtime_error("Failed to allocate metadata read buffer");
// Read superblock
GET_SQE();
data->iov = { metadata_buffer, bs->dsk.meta_block_size };
last_read_offset = 0;
data->iov = { metadata_buffer, (size_t)bs->dsk.meta_block_size };
data->callback = [this](ring_data_t *data) { handle_event(data, -1); };
my_uring_prep_readv(sqe, bs->dsk.meta_fd, &data->iov, 1, bs->dsk.meta_offset);
bs->ringloop->submit();
@@ -100,7 +101,8 @@ resume_1:
{
printf("Initializing metadata area\n");
GET_SQE();
data->iov = (struct iovec){ metadata_buffer, bs->dsk.meta_block_size };
last_read_offset = 0;
data->iov = (struct iovec){ metadata_buffer, (size_t)bs->dsk.meta_block_size };
data->callback = [this](ring_data_t *data) { handle_event(data, -1); };
my_uring_prep_writev(sqe, bs->dsk.meta_fd, &data->iov, 1, bs->dsk.meta_offset);
bs->ringloop->submit();
@@ -153,7 +155,7 @@ resume_1:
else if (hdr->version > BLOCKSTORE_META_FORMAT_V2)
{
printf(
"Metadata format is too new for me (stored version is %lu, max supported %u).\n",
"Metadata format is too new for me (stored version is %ju, max supported %u).\n",
hdr->version, BLOCKSTORE_META_FORMAT_V2
);
exit(1);
@@ -167,7 +169,7 @@ resume_1:
printf(
"Configuration stored in metadata superblock"
" (meta_block_size=%u, data_block_size=%u, bitmap_granularity=%u, data_csum_type=%u, csum_block_size=%u)"
" differs from OSD configuration (%lu/%u/%lu, %u/%u).\n",
" differs from OSD configuration (%ju/%u/%ju, %u/%u).\n",
hdr->meta_block_size, hdr->data_block_size, hdr->bitmap_granularity,
hdr->data_csum_type, hdr->csum_block_size,
bs->dsk.meta_block_size, bs->dsk.data_block_size, bs->dsk.bitmap_granularity,
@@ -199,7 +201,8 @@ resume_2:
submitted++;
next_offset += bufs[i].size;
GET_SQE();
data->iov = { bufs[i].buf, bufs[i].size };
assert(bufs[i].size <= 0x7fffffff);
data->iov = { bufs[i].buf, (size_t)bufs[i].size };
data->callback = [this, i](ring_data_t *data) { handle_event(data, i); };
if (!zero_on_init)
my_uring_prep_readv(sqe, bs->dsk.meta_fd, &data->iov, 1, bs->dsk.meta_offset + bufs[i].offset);
@@ -231,9 +234,11 @@ resume_2:
{
// write the modified buffer back
GET_SQE();
data->iov = { bufs[i].buf, bufs[i].size };
assert(bufs[i].size <= 0x7fffffff);
data->iov = { bufs[i].buf, (size_t)bufs[i].size };
data->callback = [this, i](ring_data_t *data) { handle_event(data, i); };
my_uring_prep_writev(sqe, bs->dsk.meta_fd, &data->iov, 1, bs->dsk.meta_offset + bufs[i].offset);
bs->ringloop->submit();
bufs[i].state = INIT_META_WRITING;
submitted++;
}
@@ -257,9 +262,11 @@ resume_2:
next_offset = entries_to_zero[i]/entries_per_block;
for (j = i; j < entries_to_zero.size() && entries_to_zero[j]/entries_per_block == next_offset; j++) {}
GET_SQE();
data->iov = { metadata_buffer, bs->dsk.meta_block_size };
last_read_offset = (1+next_offset)*bs->dsk.meta_block_size;
data->iov = { metadata_buffer, (size_t)bs->dsk.meta_block_size };
data->callback = [this](ring_data_t *data) { handle_event(data, -1); };
my_uring_prep_readv(sqe, bs->dsk.meta_fd, &data->iov, 1, bs->dsk.meta_offset + (1+next_offset)*bs->dsk.meta_block_size);
bs->ringloop->submit();
submitted++;
resume_5:
if (submitted > 0)
@@ -273,9 +280,10 @@ resume_5:
memset((uint8_t*)metadata_buffer + pos*bs->dsk.clean_entry_size, 0, bs->dsk.clean_entry_size);
}
GET_SQE();
data->iov = { metadata_buffer, bs->dsk.meta_block_size };
data->iov = { metadata_buffer, (size_t)bs->dsk.meta_block_size };
data->callback = [this](ring_data_t *data) { handle_event(data, -1); };
my_uring_prep_writev(sqe, bs->dsk.meta_fd, &data->iov, 1, bs->dsk.meta_offset + (1+next_offset)*bs->dsk.meta_block_size);
bs->ringloop->submit();
submitted++;
resume_6:
if (submitted > 0)
@@ -287,7 +295,7 @@ resume_6:
entries_to_zero.clear();
}
// metadata read finished
printf("Metadata entries loaded: %lu, free blocks: %lu / %lu\n", entries_loaded, bs->data_alloc->get_free_count(), bs->dsk.block_count);
printf("Metadata entries loaded: %ju, free blocks: %ju / %ju\n", entries_loaded, bs->data_alloc->get_free_count(), bs->dsk.block_count);
if (!bs->inmemory_meta)
{
free(metadata_buffer);
@@ -297,6 +305,7 @@ resume_6:
{
GET_SQE();
my_uring_prep_fsync(sqe, bs->dsk.meta_fd, IORING_FSYNC_DATASYNC);
last_read_offset = 0;
data->iov = { 0 };
data->callback = [this](ring_data_t *data) { handle_event(data, -1); };
submitted++;
@@ -328,7 +337,7 @@ bool blockstore_init_meta::handle_meta_block(uint8_t *buf, uint64_t entries_per_
uint32_t *entry_csum = (uint32_t*)((uint8_t*)entry + bs->dsk.clean_entry_size - 4);
if (*entry_csum != crc32c(0, entry, bs->dsk.clean_entry_size - 4))
{
printf("Metadata entry %lu is corrupt (checksum mismatch), skipping\n", done_cnt+i);
printf("Metadata entry %ju is corrupt (checksum mismatch), skipping\n", done_cnt+i);
continue;
}
}
@@ -366,7 +375,7 @@ bool blockstore_init_meta::handle_meta_block(uint8_t *buf, uint64_t entries_per_
entries_to_zero.push_back(clean_it->second.location >> bs->dsk.block_order);
}
#ifdef BLOCKSTORE_DEBUG
printf("Free block %lu from %lx:%lx v%lu (new location is %lu)\n",
printf("Free block %ju from %jx:%jx v%ju (new location is %ju)\n",
old_clean_loc,
clean_it->first.inode, clean_it->first.stripe, clean_it->second.version,
done_cnt+i);
@@ -380,7 +389,7 @@ bool blockstore_init_meta::handle_meta_block(uint8_t *buf, uint64_t entries_per_
}
entries_loaded++;
#ifdef BLOCKSTORE_DEBUG
printf("Allocate block (clean entry) %lu: %lx:%lx v%lu\n", done_cnt+i, entry->oid.inode, entry->oid.stripe, entry->version);
printf("Allocate block (clean entry) %ju: %jx:%jx v%ju\n", done_cnt+i, entry->oid.inode, entry->oid.stripe, entry->version);
#endif
bs->data_alloc->set(done_cnt+i, true);
clean_db[entry->oid] = (struct clean_entry){
@@ -394,7 +403,7 @@ bool blockstore_init_meta::handle_meta_block(uint8_t *buf, uint64_t entries_per_
updated = true;
memset(entry, 0, bs->dsk.clean_entry_size);
#ifdef BLOCKSTORE_DEBUG
printf("Old clean entry %lu: %lx:%lx v%lu\n", done_cnt+i, entry->oid.inode, entry->oid.stripe, entry->version);
printf("Old clean entry %ju: %jx:%jx v%ju\n", done_cnt+i, entry->oid.inode, entry->oid.stripe, entry->version);
#endif
}
}
@@ -466,7 +475,7 @@ int blockstore_init_journal::loop()
if (!sqe)
throw std::runtime_error("io_uring is full while trying to read journal");
data = ((ring_data_t*)sqe->user_data);
data->iov = { submitted_buf, bs->journal.block_size };
data->iov = { submitted_buf, (size_t)bs->journal.block_size };
data->callback = simple_callback;
my_uring_prep_readv(sqe, bs->dsk.journal_fd, &data->iov, 1, bs->journal.offset);
bs->ringloop->submit();
@@ -507,7 +516,7 @@ resume_1:
// FIXME: Randomize initial crc32. Track crc32 when trimming.
printf("Resetting journal\n");
GET_SQE();
data->iov = (struct iovec){ submitted_buf, 2*bs->journal.block_size };
data->iov = (struct iovec){ submitted_buf, (size_t)(2*bs->journal.block_size) };
data->callback = simple_callback;
my_uring_prep_writev(sqe, bs->dsk.journal_fd, &data->iov, 1, bs->journal.offset);
wait_count++;
@@ -557,7 +566,7 @@ resume_1:
(je_start->version != JOURNAL_VERSION_V2 || je_start->size != JE_START_V2_SIZE && je_start->size != JE_START_V1_SIZE))
{
fprintf(
stderr, "The code only supports journal versions 2 and 1, but it is %lu on disk."
stderr, "The code only supports journal versions 2 and 1, but it is %ju on disk."
" Please use vitastor-disk to rewrite the journal\n",
je_start->size == JE_START_V0_SIZE ? 0 : je_start->version
);
@@ -606,7 +615,7 @@ resume_1:
submitted_buf = (uint8_t*)bs->journal.buffer + journal_pos;
data->iov = {
submitted_buf,
end - journal_pos < JOURNAL_BUFFER_SIZE ? end - journal_pos : JOURNAL_BUFFER_SIZE,
(size_t)(end - journal_pos < JOURNAL_BUFFER_SIZE ? end - journal_pos : JOURNAL_BUFFER_SIZE),
};
data->callback = [this](ring_data_t *data1) { handle_event(data1); };
my_uring_prep_readv(sqe, bs->dsk.journal_fd, &data->iov, 1, bs->journal.offset + journal_pos);
@@ -622,7 +631,7 @@ resume_1:
if (init_write_buf && !bs->readonly)
{
GET_SQE();
data->iov = { init_write_buf, bs->journal.block_size };
data->iov = { init_write_buf, (size_t)bs->journal.block_size };
data->callback = simple_callback;
my_uring_prep_writev(sqe, bs->dsk.journal_fd, &data->iov, 1, bs->journal.offset + init_write_sector);
wait_count++;
@@ -691,7 +700,7 @@ resume_1:
IS_BIG_WRITE(dirty_it->second.state) &&
dirty_it->second.location == UINT64_MAX)
{
printf("Fatal error (bug): %lx:%lx v%lu big_write journal_entry was allocated over another object\n",
printf("Fatal error (bug): %jx:%jx v%ju big_write journal_entry was allocated over another object\n",
dirty_it->first.oid.inode, dirty_it->first.oid.stripe, dirty_it->first.version);
exit(1);
}
@@ -699,7 +708,7 @@ resume_1:
bs->flusher->mark_trim_possible();
bs->journal.dirty_start = bs->journal.next_free;
printf(
"Journal entries loaded: %lu, free journal space: %lu bytes (%08lx..%08lx is used), free blocks: %lu / %lu\n",
"Journal entries loaded: %ju, free journal space: %ju bytes (%08jx..%08jx is used), free blocks: %ju / %ju\n",
entries_loaded,
(bs->journal.next_free >= bs->journal.used_start
? bs->journal.len-bs->journal.block_size - (bs->journal.next_free-bs->journal.used_start)
@@ -754,7 +763,7 @@ int blockstore_init_journal::handle_journal_part(void *buf, uint64_t done_pos, u
{
#ifdef BLOCKSTORE_DEBUG
printf(
"je_small_write%s oid=%lx:%lx ver=%lu offset=%u len=%u\n",
"je_small_write%s oid=%jx:%jx ver=%ju offset=%u len=%u\n",
je->type == JE_SMALL_WRITE_INSTANT ? "_instant" : "",
je->small_write.oid.inode, je->small_write.oid.stripe, je->small_write.version,
je->small_write.offset, je->small_write.len
@@ -776,7 +785,7 @@ int blockstore_init_journal::handle_journal_part(void *buf, uint64_t done_pos, u
if (location != je->small_write.data_offset)
{
char err[1024];
snprintf(err, 1024, "BUG: calculated journal data offset (%08lx) != stored journal data offset (%08lx)", location, je->small_write.data_offset);
snprintf(err, 1024, "BUG: calculated journal data offset (%08jx) != stored journal data offset (%08jx)", location, je->small_write.data_offset);
throw std::runtime_error(err);
}
small_write_data.clear();
@@ -803,7 +812,7 @@ int blockstore_init_journal::handle_journal_part(void *buf, uint64_t done_pos, u
covered += part_end - part_begin;
small_write_data.push_back((iovec){
.iov_base = (uint8_t*)done[i].buf + part_begin - done[i].pos,
.iov_len = part_end - part_begin,
.iov_len = (size_t)(part_end - part_begin),
});
}
}
@@ -826,7 +835,7 @@ int blockstore_init_journal::handle_journal_part(void *buf, uint64_t done_pos, u
if (!data_csum_valid)
{
printf(
"Journal entry data is corrupt for small_write%s oid=%lx:%lx ver=%lu offset=%u len=%u - data crc32 %x != %x\n",
"Journal entry data is corrupt for small_write%s oid=%jx:%jx ver=%ju offset=%u len=%u - data crc32 %x != %x\n",
je->type == JE_SMALL_WRITE_INSTANT ? "_instant" : "",
je->small_write.oid.inode, je->small_write.oid.stripe, je->small_write.version,
je->small_write.offset, je->small_write.len,
@@ -845,7 +854,7 @@ int blockstore_init_journal::handle_journal_part(void *buf, uint64_t done_pos, u
if (je->size != required_size)
{
printf(
"Journal entry data has invalid size for small_write%s oid=%lx:%lx ver=%lu offset=%u len=%u - should be %u bytes but is %u bytes\n",
"Journal entry data has invalid size for small_write%s oid=%jx:%jx ver=%ju offset=%u len=%u - should be %u bytes but is %u bytes\n",
je->type == JE_SMALL_WRITE_INSTANT ? "_instant" : "",
je->small_write.oid.inode, je->small_write.oid.stripe, je->small_write.version,
je->small_write.offset, je->small_write.len,
@@ -893,7 +902,7 @@ int blockstore_init_journal::handle_journal_part(void *buf, uint64_t done_pos, u
if (block_crc32 != *block_csums)
{
printf(
"Journal entry data is corrupt for small_write%s oid=%lx:%lx ver=%lu offset=%u len=%u - block %u crc32 %x != %x\n",
"Journal entry data is corrupt for small_write%s oid=%jx:%jx ver=%ju offset=%u len=%u - block %u crc32 %x != %x\n",
je->type == JE_SMALL_WRITE_INSTANT ? "_instant" : "",
je->small_write.oid.inode, je->small_write.oid.stripe, je->small_write.version,
je->small_write.offset, je->small_write.len,
@@ -956,7 +965,7 @@ int blockstore_init_journal::handle_journal_part(void *buf, uint64_t done_pos, u
bs->journal.used_sectors[proc_pos]++;
#ifdef BLOCKSTORE_DEBUG
printf(
"journal offset %08lx is used by %lx:%lx v%lu (%lu refs)\n",
"journal offset %08jx is used by %jx:%jx v%ju (%ju refs)\n",
proc_pos, ov.oid.inode, ov.oid.stripe, ov.version, bs->journal.used_sectors[proc_pos]
);
#endif
@@ -972,7 +981,7 @@ int blockstore_init_journal::handle_journal_part(void *buf, uint64_t done_pos, u
{
#ifdef BLOCKSTORE_DEBUG
printf(
"je_big_write%s oid=%lx:%lx ver=%lu loc=%lu\n",
"je_big_write%s oid=%jx:%jx ver=%ju loc=%ju\n",
je->type == JE_BIG_WRITE_INSTANT ? "_instant" : "",
je->big_write.oid.inode, je->big_write.oid.stripe, je->big_write.version, je->big_write.location >> bs->dsk.block_order
);
@@ -1049,7 +1058,7 @@ int blockstore_init_journal::handle_journal_part(void *buf, uint64_t done_pos, u
{
#ifdef BLOCKSTORE_DEBUG
printf(
"Allocate block (journal) %lu: %lx:%lx v%lu\n",
"Allocate block (journal) %ju: %jx:%jx v%ju\n",
je->big_write.location >> bs->dsk.block_order,
ov.oid.inode, ov.oid.stripe, ov.version
);
@@ -1059,7 +1068,7 @@ int blockstore_init_journal::handle_journal_part(void *buf, uint64_t done_pos, u
bs->journal.used_sectors[proc_pos]++;
#ifdef BLOCKSTORE_DEBUG
printf(
"journal offset %08lx is used by %lx:%lx v%lu (%lu refs)\n",
"journal offset %08jx is used by %jx:%jx v%ju (%ju refs)\n",
proc_pos, ov.oid.inode, ov.oid.stripe, ov.version, bs->journal.used_sectors[proc_pos]
);
#endif
@@ -1074,7 +1083,7 @@ int blockstore_init_journal::handle_journal_part(void *buf, uint64_t done_pos, u
else if (je->type == JE_STABLE)
{
#ifdef BLOCKSTORE_DEBUG
printf("je_stable oid=%lx:%lx ver=%lu\n", je->stable.oid.inode, je->stable.oid.stripe, je->stable.version);
printf("je_stable oid=%jx:%jx ver=%ju\n", je->stable.oid.inode, je->stable.oid.stripe, je->stable.version);
#endif
// oid, version
obj_ver_id ov = {
@@ -1086,7 +1095,7 @@ int blockstore_init_journal::handle_journal_part(void *buf, uint64_t done_pos, u
else if (je->type == JE_ROLLBACK)
{
#ifdef BLOCKSTORE_DEBUG
printf("je_rollback oid=%lx:%lx ver=%lu\n", je->rollback.oid.inode, je->rollback.oid.stripe, je->rollback.version);
printf("je_rollback oid=%jx:%jx ver=%ju\n", je->rollback.oid.inode, je->rollback.oid.stripe, je->rollback.version);
#endif
// rollback dirty writes of <oid> up to <version>
obj_ver_id ov = {
@@ -1098,7 +1107,7 @@ int blockstore_init_journal::handle_journal_part(void *buf, uint64_t done_pos, u
else if (je->type == JE_DELETE)
{
#ifdef BLOCKSTORE_DEBUG
printf("je_delete oid=%lx:%lx ver=%lu\n", je->del.oid.inode, je->del.oid.stripe, je->del.version);
printf("je_delete oid=%jx:%jx ver=%ju\n", je->del.oid.inode, je->del.oid.stripe, je->del.version);
#endif
bool dirty_exists = false;
auto dirty_it = bs->dirty_db.upper_bound((obj_ver_id){

View File

@@ -23,6 +23,7 @@ class blockstore_init_meta
struct ring_data_t *data;
uint64_t md_offset = 0;
uint64_t next_offset = 0;
uint64_t last_read_offset = 0;
uint64_t entries_loaded = 0;
unsigned entries_per_block = 0;
int i = 0, j = 0;

View File

@@ -90,8 +90,8 @@ int blockstore_journal_check_t::check_available(blockstore_op_t *op, int entries
}
// In fact, it's even more rare than "ran out of journal space", so print a warning
printf(
"Ran out of journal sector buffers: %d/%lu buffers used (%d dirty), next buffer (%ld)"
" is %s and flushed %lu times. Consider increasing \'journal_sector_buffer_count\'\n",
"Ran out of journal sector buffers: %d/%ju buffers used (%d dirty), next buffer (%jd)"
" is %s and flushed %ju times. Consider increasing \'journal_sector_buffer_count\'\n",
used, bs->journal.sector_count, dirty, next_sector,
bs->journal.sector_info[next_sector].dirty ? "dirty" : "not dirty",
bs->journal.sector_info[next_sector].flush_count
@@ -103,7 +103,7 @@ int blockstore_journal_check_t::check_available(blockstore_op_t *op, int entries
if (data_after > 0)
{
next_pos = next_pos + data_after;
if (next_pos > bs->journal.len)
if (next_pos >= bs->journal.len)
{
if (right_dir)
next_pos = bs->journal.block_size + data_after;
@@ -114,7 +114,7 @@ int blockstore_journal_check_t::check_available(blockstore_op_t *op, int entries
{
// No space in the journal. Wait until used_start changes.
printf(
"Ran out of journal space (used_start=%08lx, next_free=%08lx, dirty_start=%08lx)\n",
"Ran out of journal space (used_start=%08jx, next_free=%08jx, dirty_start=%08jx)\n",
bs->journal.used_start, bs->journal.next_free, bs->journal.dirty_start
);
PRIV(op)->wait_for = WAIT_JOURNAL;
@@ -146,7 +146,7 @@ journal_entry* prefill_single_journal_entry(journal_t & journal, uint16_t type,
journal.in_sector_pos = 0;
auto next_next_free = (journal.next_free+journal.block_size) < journal.len ? journal.next_free + journal.block_size : journal.block_size;
// double check that next_free doesn't cross used_start from the left
assert(journal.next_free >= journal.used_start || next_next_free < journal.used_start);
assert(journal.next_free >= journal.used_start && next_next_free >= journal.next_free || next_next_free < journal.used_start);
journal.next_free = next_next_free;
memset(journal.inmemory
? (uint8_t*)journal.buffer + journal.sector_info[journal.cur_sector].offset
@@ -183,7 +183,7 @@ void blockstore_impl_t::prepare_journal_sector_write(int cur_sector, blockstore_
(journal.inmemory
? (uint8_t*)journal.buffer + journal.sector_info[cur_sector].offset
: (uint8_t*)journal.sector_buf + journal.block_size*cur_sector),
journal.block_size
(size_t)journal.block_size
};
data->callback = [this, flush_id = journal.submit_id](ring_data_t *data) { handle_journal_write(data, flush_id); };
my_uring_prep_writev(
@@ -263,7 +263,7 @@ uint64_t journal_t::get_trim_pos()
// next_free does not need updating during trim
#ifdef BLOCKSTORE_DEBUG
printf(
"Trimming journal (used_start=%08lx, next_free=%08lx, dirty_start=%08lx, new_start=%08lx, new_refcount=%ld)\n",
"Trimming journal (used_start=%08jx, next_free=%08jx, dirty_start=%08jx, new_start=%08jx, new_refcount=%jd)\n",
used_start, next_free, dirty_start,
journal_used_it->first, journal_used_it->second
);
@@ -276,7 +276,7 @@ uint64_t journal_t::get_trim_pos()
// Journal is cleared up to <journal_used_it>
#ifdef BLOCKSTORE_DEBUG
printf(
"Trimming journal (used_start=%08lx, next_free=%08lx, dirty_start=%08lx, new_start=%08lx, new_refcount=%ld)\n",
"Trimming journal (used_start=%08jx, next_free=%08jx, dirty_start=%08jx, new_start=%08jx, new_refcount=%jd)\n",
used_start, next_free, dirty_start,
journal_used_it->first, journal_used_it->second
);
@@ -296,7 +296,7 @@ void journal_t::dump_diagnostics()
journal_used_it = used_sectors.begin();
}
printf(
"Journal: used_start=%08lx next_free=%08lx dirty_start=%08lx trim_to=%08lx trim_to_refs=%ld\n",
"Journal: used_start=%08jx next_free=%08jx dirty_start=%08jx trim_to=%08jx trim_to_refs=%jd\n",
used_start, next_free, dirty_start,
journal_used_it == used_sectors.end() ? 0 : journal_used_it->first,
journal_used_it == used_sectors.end() ? 0 : journal_used_it->second

View File

@@ -13,13 +13,14 @@ void blockstore_impl_t::parse_config(blockstore_config_t & config, bool init)
max_flusher_count = strtoull(config["flusher_count"].c_str(), NULL, 10);
}
min_flusher_count = strtoull(config["min_flusher_count"].c_str(), NULL, 10);
journal_trim_interval = strtoull(config["journal_trim_interval"].c_str(), NULL, 10);
max_write_iodepth = strtoull(config["max_write_iodepth"].c_str(), NULL, 10);
throttle_small_writes = config["throttle_small_writes"] == "true" || config["throttle_small_writes"] == "1" || config["throttle_small_writes"] == "yes";
throttle_target_iops = strtoull(config["throttle_target_iops"].c_str(), NULL, 10);
throttle_target_mbs = strtoull(config["throttle_target_mbs"].c_str(), NULL, 10);
throttle_target_parallelism = strtoull(config["throttle_target_parallelism"].c_str(), NULL, 10);
throttle_threshold_us = strtoull(config["throttle_threshold_us"].c_str(), NULL, 10);
if (config.find("autosync_writes") != config.end())
if (config["autosync_writes"] != "")
{
autosync_writes = strtoull(config["autosync_writes"].c_str(), NULL, 10);
}
@@ -31,6 +32,10 @@ void blockstore_impl_t::parse_config(blockstore_config_t & config, bool init)
{
min_flusher_count = 1;
}
if (!journal_trim_interval)
{
journal_trim_interval = 512;
}
if (!max_write_iodepth)
{
max_write_iodepth = 128;

View File

@@ -25,7 +25,7 @@ int blockstore_impl_t::fulfill_read_push(blockstore_op_t *op, void *buf, uint64_
return 1;
}
BS_SUBMIT_GET_SQE(sqe, data);
data->iov = (struct iovec){ buf, len };
data->iov = (struct iovec){ buf, (size_t)len };
PRIV(op)->pending_ops++;
my_uring_prep_readv(
sqe,
@@ -505,7 +505,7 @@ int blockstore_impl_t::dequeue_read(blockstore_op_t *read_op)
for (auto & rv: PRIV(read_op)->read_vec)
{
if (rv.journal_sector)
journal.used_sectors[rv.journal_sector-1]++;
journal.used_sectors.at(rv.journal_sector-1)++;
}
}
read_op->retval = 0;
@@ -700,7 +700,7 @@ uint8_t* blockstore_impl_t::read_clean_meta_block(blockstore_op_t *op, uint64_t
.buf = buf,
});
BS_SUBMIT_GET_SQE(sqe, data);
data->iov = (struct iovec){ buf, dsk.meta_block_size };
data->iov = (struct iovec){ buf, (size_t)dsk.meta_block_size };
PRIV(op)->pending_ops++;
my_uring_prep_readv(sqe, dsk.meta_fd, &data->iov, 1, dsk.meta_offset + dsk.meta_block_size + sector);
data->callback = [this, op](ring_data_t *data) { handle_read_event(data, op); };
@@ -855,7 +855,7 @@ void blockstore_impl_t::handle_read_event(ring_data_t *data, blockstore_op_t *op
{
ok = false;
printf(
"Checksum mismatch in object %lx:%lx v%lu in journal at 0x%lx, checksum block #%u: got %08x, expected %08x\n",
"Checksum mismatch in object %jx:%jx v%ju in journal at 0x%jx, checksum block #%u: got %08x, expected %08x\n",
op->oid.inode, op->oid.stripe, op->version,
rv[i].disk_offset, bad_block / dsk.csum_block_size, calc_csum, stored_csum
);
@@ -875,7 +875,7 @@ void blockstore_impl_t::handle_read_event(ring_data_t *data, blockstore_op_t *op
{
ok = false;
printf(
"Checksum mismatch in object %lx:%lx v%lu in %s data at 0x%lx, checksum block #%u: got %08x, expected %08x\n",
"Checksum mismatch in object %jx:%jx v%ju in %s data at 0x%jx, checksum block #%u: got %08x, expected %08x\n",
op->oid.inode, op->oid.stripe, op->version,
(rv[i].copy_flags & COPY_BUF_JOURNALED_BIG ? "redirect-write" : "clean"),
rv[i].disk_offset, bad_block / dsk.csum_block_size, calc_csum, stored_csum
@@ -918,7 +918,7 @@ void blockstore_impl_t::handle_read_event(ring_data_t *data, blockstore_op_t *op
{
// checksum error
printf(
"Checksum mismatch in object %lx:%lx v%lu in %s area at offset 0x%lx+0x%lx: %08x vs %08x\n",
"Checksum mismatch in object %jx:%jx v%ju in %s area at offset 0x%jx+0x%zx: %08x vs %08x\n",
op->oid.inode, op->oid.stripe, op->version,
(vec.copy_flags & COPY_BUF_JOURNAL) ? "journal" : "data", vec.disk_offset, p,
crc32c(0, (uint8_t*)op->buf + vec.offset - op->offset + p, dsk.csum_block_size), *csum
@@ -966,7 +966,7 @@ void blockstore_impl_t::handle_read_event(ring_data_t *data, blockstore_op_t *op
{
if (rv.journal_sector)
{
auto used = --journal.used_sectors[rv.journal_sector-1];
auto used = --journal.used_sectors.at(rv.journal_sector-1);
if (used == 0)
{
journal.used_sectors.erase(rv.journal_sector-1);

View File

@@ -179,7 +179,7 @@ void blockstore_impl_t::erase_dirty(blockstore_dirty_db_t::iterator dirty_start,
{
object_id oid = dirty_it->first.oid;
#ifdef BLOCKSTORE_DEBUG
printf("Unblock writes-after-delete %lx:%lx v%lu\n", oid.inode, oid.stripe, dirty_it->first.version);
printf("Unblock writes-after-delete %jx:%jx v%ju\n", oid.inode, oid.stripe, dirty_it->first.version);
#endif
dirty_it = dirty_end;
// Unblock operations blocked by delete flushing
@@ -210,21 +210,26 @@ void blockstore_impl_t::erase_dirty(blockstore_dirty_db_t::iterator dirty_start,
dirty_it->second.location != UINT64_MAX)
{
#ifdef BLOCKSTORE_DEBUG
printf("Free block %lu from %lx:%lx v%lu\n", dirty_it->second.location >> dsk.block_order,
printf("Free block %ju from %jx:%jx v%ju\n", dirty_it->second.location >> dsk.block_order,
dirty_it->first.oid.inode, dirty_it->first.oid.stripe, dirty_it->first.version);
#endif
data_alloc->set(dirty_it->second.location >> dsk.block_order, false);
}
auto used = --journal.used_sectors[dirty_it->second.journal_sector];
auto used = --journal.used_sectors.at(dirty_it->second.journal_sector);
#ifdef BLOCKSTORE_DEBUG
printf(
"remove usage of journal offset %08lx by %lx:%lx v%lu (%lu refs)\n", dirty_it->second.journal_sector,
"remove usage of journal offset %08jx by %jx:%jx v%ju (%ju refs)\n", dirty_it->second.journal_sector,
dirty_it->first.oid.inode, dirty_it->first.oid.stripe, dirty_it->first.version, used
);
#endif
if (used == 0)
{
journal.used_sectors.erase(dirty_it->second.journal_sector);
if (dirty_it->second.journal_sector == journal.sector_info[journal.cur_sector].offset)
{
// Mark current sector as "full" to select the new one
journal.in_sector_pos = dsk.journal_block_size;
}
flusher->mark_trim_possible();
}
free_dirty_dyn_data(dirty_it->second);

View File

@@ -298,7 +298,7 @@ int blockstore_impl_t::dequeue_stable(blockstore_op_t *op)
if (clean_it == clean_db.end() || clean_it->second.version < ov.version)
{
// No such object version
printf("Error: %lx:%lx v%lu not found while stabilizing\n", ov.oid.inode, ov.oid.stripe, ov.version);
printf("Error: %jx:%jx v%ju not found while stabilizing\n", ov.oid.inode, ov.oid.stripe, ov.version);
return -ENOENT;
}
else
@@ -307,35 +307,49 @@ int blockstore_impl_t::dequeue_stable(blockstore_op_t *op)
return STAB_SPLIT_DONE;
}
}
else if (IS_IN_FLIGHT(dirty_it->second.state))
{
// Object write is still in progress. Wait until the write request completes
return STAB_SPLIT_WAIT;
}
else if (!IS_SYNCED(dirty_it->second.state))
{
// Object not synced yet - sync it
// In previous versions we returned EBUSY here and required
// the caller (OSD) to issue a global sync first. But a global sync
// waits for all writes in the queue including inflight writes. And
// inflight writes may themselves be blocked by unstable writes being
// still present in the journal and not flushed away from it.
// So we must sync specific objects here.
//
// Even more, we have to process "stabilize" request in parts. That is,
// we must stabilize all objects which are already synced. Otherwise
// they may block objects which are NOT synced yet.
return STAB_SPLIT_SYNC;
}
else if (IS_STABLE(dirty_it->second.state))
{
// Already stable
return STAB_SPLIT_DONE;
}
else
while (true)
{
return STAB_SPLIT_TODO;
if (IS_IN_FLIGHT(dirty_it->second.state))
{
// Object write is still in progress. Wait until the write request completes
return STAB_SPLIT_WAIT;
}
else if (!IS_SYNCED(dirty_it->second.state))
{
// Object not synced yet - sync it
// In previous versions we returned EBUSY here and required
// the caller (OSD) to issue a global sync first. But a global sync
// waits for all writes in the queue including inflight writes. And
// inflight writes may themselves be blocked by unstable writes being
// still present in the journal and not flushed away from it.
// So we must sync specific objects here.
//
// Even more, we have to process "stabilize" request in parts. That is,
// we must stabilize all objects which are already synced. Otherwise
// they may block objects which are NOT synced yet.
return STAB_SPLIT_SYNC;
}
else if (IS_STABLE(dirty_it->second.state))
{
break;
}
// Check previous versions too
if (dirty_it == dirty_db.begin())
{
break;
}
dirty_it--;
if (dirty_it->first.oid != ov.oid)
{
break;
}
}
return STAB_SPLIT_TODO;
});
if (r != 1)
{
@@ -402,7 +416,7 @@ resume_4:
{
// Mark all dirty_db entries up to op->version as stable
#ifdef BLOCKSTORE_DEBUG
printf("Stabilize %lx:%lx v%lu\n", v->oid.inode, v->oid.stripe, v->version);
printf("Stabilize %jx:%jx v%ju\n", v->oid.inode, v->oid.stripe, v->version);
#endif
mark_stable(*v);
}
@@ -412,11 +426,40 @@ resume_4:
return 2;
}
void blockstore_impl_t::mark_stable(const obj_ver_id & v, bool forget_dirty)
void blockstore_impl_t::mark_stable(obj_ver_id v, bool forget_dirty)
{
auto dirty_it = dirty_db.find(v);
if (dirty_it != dirty_db.end())
{
if (IS_INSTANT(dirty_it->second.state))
{
// 'Instant' (non-EC) operations may complete and try to become stable out of order. Prevent it.
auto back_it = dirty_it;
while (back_it != dirty_db.begin())
{
back_it--;
if (back_it->first.oid != v.oid)
{
break;
}
if (!IS_STABLE(back_it->second.state))
{
// There are preceding unstable versions, can't flush <v>
return;
}
}
while (true)
{
dirty_it++;
if (dirty_it == dirty_db.end() || dirty_it->first.oid != v.oid ||
!IS_SYNCED(dirty_it->second.state))
{
dirty_it--;
break;
}
v.version = dirty_it->first.version;
}
}
while (1)
{
bool was_stable = IS_STABLE(dirty_it->second.state);
@@ -444,18 +487,24 @@ void blockstore_impl_t::mark_stable(const obj_ver_id & v, bool forget_dirty)
}
if (!exists)
{
inode_space_stats[dirty_it->first.oid.inode] += dsk.data_block_size;
uint64_t space_id = dirty_it->first.oid.inode;
if (no_inode_stats[dirty_it->first.oid.inode >> (64-POOL_ID_BITS)])
space_id = space_id & ~(((uint64_t)1 << (64-POOL_ID_BITS)) - 1);
inode_space_stats[space_id] += dsk.data_block_size;
used_blocks++;
}
big_to_flush++;
}
else if (IS_DELETE(dirty_it->second.state))
{
auto & sp = inode_space_stats[dirty_it->first.oid.inode];
uint64_t space_id = dirty_it->first.oid.inode;
if (no_inode_stats[dirty_it->first.oid.inode >> (64-POOL_ID_BITS)])
space_id = space_id & ~(((uint64_t)1 << (64-POOL_ID_BITS)) - 1);
auto & sp = inode_space_stats[space_id];
if (sp > dsk.data_block_size)
sp -= dsk.data_block_size;
else
inode_space_stats.erase(dirty_it->first.oid.inode);
inode_space_stats.erase(space_id);
used_blocks--;
big_to_flush++;
}
@@ -464,7 +513,7 @@ void blockstore_impl_t::mark_stable(const obj_ver_id & v, bool forget_dirty)
{
// mark_stable should never be called for in-flight or submitted writes
printf(
"BUG: Attempt to mark_stable object %lx:%lx v%lu state of which is %x\n",
"BUG: Attempt to mark_stable object %jx:%jx v%ju state of which is %x\n",
dirty_it->first.oid.inode, dirty_it->first.oid.stripe, dirty_it->first.version,
dirty_it->second.state
);

View File

@@ -85,16 +85,14 @@ int blockstore_impl_t::continue_sync(blockstore_op_t *op)
left--;
auto & dirty_entry = dirty_db.at(sbw);
uint64_t dyn_size = dsk.dirty_dyn_size(dirty_entry.offset, dirty_entry.len);
if (!space_check.check_available(op, 1, sizeof(journal_entry_big_write) + dyn_size,
(unstable_writes.size()+unstable_unsynced)*journal.block_size))
if (!space_check.check_available(op, 1, sizeof(journal_entry_big_write) + dyn_size, 0))
{
return 0;
}
}
}
else if (!space_check.check_available(op, PRIV(op)->sync_big_writes.size(),
sizeof(journal_entry_big_write) + dsk.clean_entry_bitmap_size,
(unstable_writes.size()+unstable_unsynced)*journal.block_size))
sizeof(journal_entry_big_write) + dsk.clean_entry_bitmap_size, 0))
{
return 0;
}
@@ -117,11 +115,14 @@ int blockstore_impl_t::continue_sync(blockstore_op_t *op)
journal, (dirty_entry.state & BS_ST_INSTANT) ? JE_BIG_WRITE_INSTANT : JE_BIG_WRITE,
sizeof(journal_entry_big_write) + dyn_size
);
dirty_entry.journal_sector = journal.sector_info[journal.cur_sector].offset;
auto jsec = dirty_entry.journal_sector = journal.sector_info[journal.cur_sector].offset;
assert(journal.next_free >= journal.used_start
? (jsec >= journal.used_start && jsec < journal.next_free)
: (jsec >= journal.used_start || jsec < journal.next_free));
journal.used_sectors[journal.sector_info[journal.cur_sector].offset]++;
#ifdef BLOCKSTORE_DEBUG
printf(
"journal offset %08lx is used by %lx:%lx v%lu (%lu refs)\n",
"journal offset %08jx is used by %jx:%jx v%ju (%ju refs)\n",
dirty_entry.journal_sector, it->oid.inode, it->oid.stripe, it->version,
journal.used_sectors[journal.sector_info[journal.cur_sector].offset]
);
@@ -175,7 +176,7 @@ void blockstore_impl_t::ack_sync(blockstore_op_t *op)
for (auto it = PRIV(op)->sync_big_writes.begin(); it != PRIV(op)->sync_big_writes.end(); it++)
{
#ifdef BLOCKSTORE_DEBUG
printf("Ack sync big %lx:%lx v%lu\n", it->oid.inode, it->oid.stripe, it->version);
printf("Ack sync big %jx:%jx v%ju\n", it->oid.inode, it->oid.stripe, it->version);
#endif
auto & unstab = unstable_writes[it->oid];
unstab = unstab < it->version ? it->version : unstab;
@@ -203,7 +204,7 @@ void blockstore_impl_t::ack_sync(blockstore_op_t *op)
for (auto it = PRIV(op)->sync_small_writes.begin(); it != PRIV(op)->sync_small_writes.end(); it++)
{
#ifdef BLOCKSTORE_DEBUG
printf("Ack sync small %lx:%lx v%lu\n", it->oid.inode, it->oid.stripe, it->version);
printf("Ack sync small %jx:%jx v%ju\n", it->oid.inode, it->oid.stripe, it->version);
#endif
auto & unstab = unstable_writes[it->oid];
unstab = unstab < it->version ? it->version : unstab;

View File

@@ -85,7 +85,7 @@ bool blockstore_impl_t::enqueue_write(blockstore_op_t *op)
// It's allowed to write versions with low numbers over deletes
// However, we have to flush those deletes first as we use version number for ordering
#ifdef BLOCKSTORE_DEBUG
printf("Write %lx:%lx v%lu over delete (real v%lu) offset=%u len=%u\n", op->oid.inode, op->oid.stripe, version, op->version, op->offset, op->len);
printf("Write %jx:%jx v%ju over delete (real v%ju) offset=%u len=%u\n", op->oid.inode, op->oid.stripe, version, op->version, op->offset, op->len);
#endif
wait_del = true;
PRIV(op)->real_version = op->version;
@@ -95,11 +95,13 @@ bool blockstore_impl_t::enqueue_write(blockstore_op_t *op)
// Issue an additional sync so the delete reaches the journal
blockstore_op_t *sync_op = new blockstore_op_t;
sync_op->opcode = BS_OP_SYNC;
sync_op->callback = [this, op](blockstore_op_t *sync_op)
sync_op->oid = op->oid;
sync_op->version = op->version;
sync_op->callback = [this](blockstore_op_t *sync_op)
{
flusher->unshift_flush((obj_ver_id){
.oid = op->oid,
.version = op->version-1,
.oid = sync_op->oid,
.version = sync_op->version-1,
}, true);
delete sync_op;
};
@@ -117,7 +119,7 @@ bool blockstore_impl_t::enqueue_write(blockstore_op_t *op)
{
// Invalid version requested
#ifdef BLOCKSTORE_DEBUG
printf("Write %lx:%lx v%lu requested, but we already have v%lu\n", op->oid.inode, op->oid.stripe, op->version, version);
printf("Write %jx:%jx v%ju requested, but we already have v%ju\n", op->oid.inode, op->oid.stripe, op->version, version);
#endif
op->retval = -EEXIST;
if (!is_del && alloc_dyn_data)
@@ -129,7 +131,7 @@ bool blockstore_impl_t::enqueue_write(blockstore_op_t *op)
}
bool imm = (op->len < dsk.data_block_size ? (immediate_commit != IMMEDIATE_NONE) : (immediate_commit == IMMEDIATE_ALL));
if (wait_big && !is_del && !deleted && op->len < dsk.data_block_size && !imm ||
!imm && unsynced_queued_ops >= autosync_writes)
!imm && autosync_writes && unsynced_queued_ops >= autosync_writes)
{
// Issue an additional sync so that the previous big write can reach the journal
blockstore_op_t *sync_op = new blockstore_op_t;
@@ -144,9 +146,9 @@ bool blockstore_impl_t::enqueue_write(blockstore_op_t *op)
unsynced_queued_ops++;
#ifdef BLOCKSTORE_DEBUG
if (is_del)
printf("Delete %lx:%lx v%lu\n", op->oid.inode, op->oid.stripe, op->version);
printf("Delete %jx:%jx v%ju\n", op->oid.inode, op->oid.stripe, op->version);
else if (!wait_del)
printf("Write %lx:%lx v%lu offset=%u len=%u\n", op->oid.inode, op->oid.stripe, op->version, op->offset, op->len);
printf("Write %jx:%jx v%ju offset=%u len=%u\n", op->oid.inode, op->oid.stripe, op->version, op->offset, op->len);
#endif
// No strict need to add it into dirty_db here except maybe for listings to return
// correct data when there are inflight operations in the queue
@@ -286,7 +288,7 @@ int blockstore_impl_t::dequeue_write(blockstore_op_t *op)
}
// Restore original low version number for unblocked operations
#ifdef BLOCKSTORE_DEBUG
printf("Restoring %lx:%lx version: v%lu -> v%lu\n", op->oid.inode, op->oid.stripe, op->version, PRIV(op)->real_version);
printf("Restoring %jx:%jx version: v%ju -> v%ju\n", op->oid.inode, op->oid.stripe, op->version, PRIV(op)->real_version);
#endif
auto prev_it = dirty_it;
if (prev_it != dirty_db.begin())
@@ -296,7 +298,7 @@ int blockstore_impl_t::dequeue_write(blockstore_op_t *op)
{
// Original version is still invalid
// All subsequent writes to the same object must be canceled too
printf("Tried to write %lx:%lx v%lu after delete (old version v%lu), but already have v%lu\n",
printf("Tried to write %jx:%jx v%ju after delete (old version v%ju), but already have v%ju\n",
op->oid.inode, op->oid.stripe, PRIV(op)->real_version, op->version, prev_it->first.version);
cancel_all_writes(op, dirty_it, -EEXIST);
return 2;
@@ -320,7 +322,7 @@ int blockstore_impl_t::dequeue_write(blockstore_op_t *op)
blockstore_journal_check_t space_check(this);
if (!space_check.check_available(op, unsynced_big_write_count + 1,
sizeof(journal_entry_big_write) + dsk.clean_dyn_size,
(unstable_writes.size()+unstable_unsynced)*journal.block_size))
(unstable_writes.size()+unstable_unsynced+((dirty_it->second.state & BS_ST_INSTANT) ? 0 : 1))*journal.block_size))
{
return 0;
}
@@ -348,8 +350,8 @@ int blockstore_impl_t::dequeue_write(blockstore_op_t *op)
if (entry->oid.inode || entry->oid.stripe || entry->version)
{
printf(
"Fatal error (metadata corruption or bug): tried to write object %lx:%lx v%lu"
" over a non-zero metadata entry %lu with %lx:%lx v%lu\n", op->oid.inode,
"Fatal error (metadata corruption or bug): tried to write object %jx:%jx v%ju"
" over a non-zero metadata entry %ju with %jx:%jx v%ju\n", op->oid.inode,
op->oid.stripe, op->version, loc, entry->oid.inode, entry->oid.stripe, entry->version
);
exit(1);
@@ -361,7 +363,7 @@ int blockstore_impl_t::dequeue_write(blockstore_op_t *op)
dirty_it->second.state = (dirty_it->second.state & ~BS_ST_WORKFLOW_MASK) | BS_ST_SUBMITTED;
#ifdef BLOCKSTORE_DEBUG
printf(
"Allocate block %lu for %lx:%lx v%lu\n",
"Allocate block %ju for %jx:%jx v%ju\n",
loc, op->oid.inode, op->oid.stripe, op->version
);
#endif
@@ -372,13 +374,13 @@ int blockstore_impl_t::dequeue_write(blockstore_op_t *op)
int vcnt = 0;
if (stripe_offset)
{
PRIV(op)->iov_zerofill[vcnt++] = (struct iovec){ zero_object, stripe_offset };
PRIV(op)->iov_zerofill[vcnt++] = (struct iovec){ zero_object, (size_t)stripe_offset };
}
PRIV(op)->iov_zerofill[vcnt++] = (struct iovec){ op->buf, op->len };
if (stripe_end)
{
stripe_end = dsk.bitmap_granularity - stripe_end;
PRIV(op)->iov_zerofill[vcnt++] = (struct iovec){ zero_object, stripe_end };
PRIV(op)->iov_zerofill[vcnt++] = (struct iovec){ zero_object, (size_t)stripe_end };
}
data->iov.iov_len = op->len + stripe_offset + stripe_end; // to check it in the callback
data->callback = [this, op](ring_data_t *data) { handle_write_event(data, op); };
@@ -412,7 +414,7 @@ int blockstore_impl_t::dequeue_write(blockstore_op_t *op)
sizeof(journal_entry_big_write) + dsk.clean_dyn_size, 0)
|| !space_check.check_available(op, 1,
sizeof(journal_entry_small_write) + dyn_size,
op->len + (unstable_writes.size()+unstable_unsynced)*journal.block_size))
op->len + (unstable_writes.size()+unstable_unsynced+((dirty_it->second.state & BS_ST_INSTANT) ? 0 : 1))*journal.block_size))
{
return 0;
}
@@ -436,11 +438,23 @@ int blockstore_impl_t::dequeue_write(blockstore_op_t *op)
journal, op->opcode == BS_OP_WRITE_STABLE ? JE_SMALL_WRITE_INSTANT : JE_SMALL_WRITE,
sizeof(journal_entry_small_write) + dyn_size
);
dirty_it->second.journal_sector = journal.sector_info[journal.cur_sector].offset;
auto jsec = dirty_it->second.journal_sector = journal.sector_info[journal.cur_sector].offset;
if (!(journal.next_free >= journal.used_start
? (jsec >= journal.used_start && jsec < journal.next_free)
: (jsec >= journal.used_start || jsec < journal.next_free)))
{
printf(
"BUG: journal offset %08jx is used by %jx:%jx v%ju (%ju refs) BUT used_start=%jx next_free=%jx\n",
dirty_it->second.journal_sector, dirty_it->first.oid.inode, dirty_it->first.oid.stripe, dirty_it->first.version,
journal.used_sectors[journal.sector_info[journal.cur_sector].offset],
journal.used_start, journal.next_free
);
abort();
}
journal.used_sectors[journal.sector_info[journal.cur_sector].offset]++;
#ifdef BLOCKSTORE_DEBUG
printf(
"journal offset %08lx is used by %lx:%lx v%lu (%lu refs)\n",
"journal offset %08jx is used by %jx:%jx v%ju (%ju refs)\n",
dirty_it->second.journal_sector, dirty_it->first.oid.inode, dirty_it->first.oid.stripe, dirty_it->first.version,
journal.used_sectors[journal.sector_info[journal.cur_sector].offset]
);
@@ -454,8 +468,8 @@ int blockstore_impl_t::dequeue_write(blockstore_op_t *op)
journal_used_it->first < next_next_free + op->len)
{
printf(
"BUG: Attempt to overwrite used offset (%lx, %lu refs) of the journal with the object %lx:%lx v%lu: data at %lx, len %x!"
" Journal used_start=%08lx (%lu refs), next_free=%08lx, dirty_start=%08lx\n",
"BUG: Attempt to overwrite used offset (%jx, %ju refs) of the journal with the object %jx:%jx v%ju: data at %jx, len %x!"
" Journal used_start=%08jx (%ju refs), next_free=%08jx, dirty_start=%08jx\n",
journal_used_it->first, journal_used_it->second, op->oid.inode, op->oid.stripe, op->version, next_next_free, op->len,
journal.used_start, journal.used_sectors[journal.used_start], journal.next_free, journal.dirty_start
);
@@ -463,7 +477,7 @@ int blockstore_impl_t::dequeue_write(blockstore_op_t *op)
}
}
// double check that next_free doesn't cross used_start from the left
assert(journal.next_free >= journal.used_start || next_next_free < journal.used_start);
assert(journal.next_free >= journal.used_start && next_next_free >= journal.next_free || next_next_free < journal.used_start);
journal.next_free = next_next_free;
je->oid = op->oid;
je->version = op->version;
@@ -505,7 +519,7 @@ int blockstore_impl_t::dequeue_write(blockstore_op_t *op)
if (next_next_free >= journal.len)
next_next_free = dsk.journal_block_size;
// double check that next_free doesn't cross used_start from the left
assert(journal.next_free >= journal.used_start || next_next_free < journal.used_start);
assert(journal.next_free >= journal.used_start && next_next_free >= journal.next_free || next_next_free < journal.used_start);
journal.next_free = next_next_free;
if (!(dirty_it->second.state & BS_ST_INSTANT))
{
@@ -549,7 +563,7 @@ resume_2:
uint64_t dyn_size = dsk.dirty_dyn_size(op->offset, op->len);
blockstore_journal_check_t space_check(this);
if (!space_check.check_available(op, 1, sizeof(journal_entry_big_write) + dyn_size,
(unstable_writes.size()+unstable_unsynced)*journal.block_size))
(unstable_writes.size()+unstable_unsynced+((dirty_it->second.state & BS_ST_INSTANT) ? 0 : 1))*journal.block_size))
{
return 0;
}
@@ -558,11 +572,23 @@ resume_2:
journal, op->opcode == BS_OP_WRITE_STABLE ? JE_BIG_WRITE_INSTANT : JE_BIG_WRITE,
sizeof(journal_entry_big_write) + dyn_size
);
dirty_it->second.journal_sector = journal.sector_info[journal.cur_sector].offset;
auto jsec = dirty_it->second.journal_sector = journal.sector_info[journal.cur_sector].offset;
if (!(journal.next_free >= journal.used_start
? (jsec >= journal.used_start && jsec < journal.next_free)
: (jsec >= journal.used_start || jsec < journal.next_free)))
{
printf(
"BUG: journal offset %08jx is used by %jx:%jx v%ju (%ju refs) BUT used_start=%jx next_free=%jx\n",
dirty_it->second.journal_sector, dirty_it->first.oid.inode, dirty_it->first.oid.stripe, dirty_it->first.version,
journal.used_sectors[journal.sector_info[journal.cur_sector].offset],
journal.used_start, journal.next_free
);
abort();
}
journal.used_sectors[journal.sector_info[journal.cur_sector].offset]++;
#ifdef BLOCKSTORE_DEBUG
printf(
"journal offset %08lx is used by %lx:%lx v%lu (%lu refs)\n",
"journal offset %08jx is used by %jx:%jx v%ju (%ju refs)\n",
journal.sector_info[journal.cur_sector].offset, op->oid.inode, op->oid.stripe, op->version,
journal.used_sectors[journal.sector_info[journal.cur_sector].offset]
);
@@ -589,11 +615,11 @@ resume_4:
});
assert(dirty_it != dirty_db.end());
#ifdef BLOCKSTORE_DEBUG
printf("Ack write %lx:%lx v%lu = state 0x%x\n", op->oid.inode, op->oid.stripe, op->version, dirty_it->second.state);
printf("Ack write %jx:%jx v%ju = state 0x%x\n", op->oid.inode, op->oid.stripe, op->version, dirty_it->second.state);
#endif
bool is_big = (dirty_it->second.state & BS_ST_TYPE_MASK) == BS_ST_BIG_WRITE;
bool imm = is_big ? (immediate_commit == IMMEDIATE_ALL) : (immediate_commit != IMMEDIATE_NONE);
bool is_instant = ((dirty_it->second.state & BS_ST_TYPE_MASK) == BS_ST_DELETE || (dirty_it->second.state & BS_ST_INSTANT));
bool is_instant = IS_INSTANT(dirty_it->second.state);
if (imm)
{
auto & unstab = unstable_writes[op->oid];
@@ -782,7 +808,7 @@ int blockstore_impl_t::dequeue_del(blockstore_op_t *op)
journal.used_sectors[journal.sector_info[journal.cur_sector].offset]++;
#ifdef BLOCKSTORE_DEBUG
printf(
"journal offset %08lx is used by %lx:%lx v%lu (%lu refs)\n",
"journal offset %08jx is used by %jx:%jx v%ju (%ju refs)\n",
dirty_it->second.journal_sector, dirty_it->first.oid.inode, dirty_it->first.oid.stripe, dirty_it->first.version,
journal.used_sectors[journal.sector_info[journal.cur_sector].offset]
);

View File

@@ -46,18 +46,21 @@ static const char* help_text =
"vitastor-cli snap-create [-p|--pool <id|name>] <image>@<snapshot>\n"
" Create a snapshot of image <name>. May be used live if only a single writer is active.\n"
"\n"
"vitastor-cli modify <name> [--rename <new-name>] [--resize <size>] [--readonly | --readwrite] [-f|--force]\n"
"vitastor-cli modify <name> [--rename <new-name>] [--resize <size>] [--readonly | --readwrite] [-f|--force] [--down-ok]\n"
" Rename, resize image or change its readonly status. Images with children can't be made read-write.\n"
" If the new size is smaller than the old size, extra data will be purged.\n"
" You should resize file system in the image, if present, before shrinking it.\n"
" -f|--force Proceed with shrinking or setting readwrite flag even if the image has children.\n"
" --down-ok Proceed with shrinking even if some data will be left on unavailable OSDs.\n"
"\n"
"vitastor-cli rm <from> [<to>] [--writers-stopped]\n"
"vitastor-cli rm <from> [<to>] [--writers-stopped] [--down-ok]\n"
" Remove <from> or all layers between <from> and <to> (<to> must be a child of <from>),\n"
" rebasing all their children accordingly. --writers-stopped allows merging to be a bit\n"
" more effective in case of a single 'slim' read-write child and 'fat' removed parent:\n"
" the child is merged into parent and parent is renamed to child in that case.\n"
" In other cases parent layers are always merged into children.\n"
" Other options:\n"
" --down-ok Continue deletion/merging even if some data will be left on unavailable OSDs.\n"
"\n"
"vitastor-cli flatten <layer>\n"
" Flatten a layer, i.e. merge data and detach it from parents.\n"
@@ -113,6 +116,55 @@ static const char* help_text =
" With --dry-run only checks if deletion is possible without data loss and\n"
" redundancy degradation.\n"
"\n"
"vitastor-cli create-pool|pool-create <name> (-s <pg_size>|--ec <N>+<K>) -n <pg_count> [OPTIONS]\n"
" Create a pool. Required parameters:\n"
" -s|--pg_size R Number of replicas for replicated pools\n"
" --ec N+K Number of data (N) and parity (K) chunks for erasure-coded pools\n"
" -n|--pg_count N PG count for the new pool (start with 10*<OSD count>/pg_size rounded to a power of 2)\n"
" Optional parameters:\n"
" --pg_minsize <number> R or N+K minus number of failures to tolerate without downtime\n"
" --failure_domain host Failure domain: host, osd or a level from placement_levels. Default: host\n"
" --root_node <node> Put pool only on child OSDs of this placement tree node\n"
" --osd_tags <tag>[,<tag>]... Put pool only on OSDs tagged with all specified tags\n"
" --block_size 128k Put pool only on OSDs with this data block size\n"
" --bitmap_granularity 4k Put pool only on OSDs with this logical sector size\n"
" --immediate_commit none Put pool only on OSDs with this or larger immediate_commit (none < small < all)\n"
" --primary_affinity_tags tags Prefer to put primary copies on OSDs with all specified tags\n"
" --scrub_interval <time> Enable regular scrubbing for this pool. Format: number + unit s/m/h/d/M/y\n"
" --used_for_fs <name> Mark pool as used for VitastorFS with metadata in image <name>\n"
" --pg_stripe_size <number> Increase object grouping stripe\n"
" --max_osd_combinations 10000 Maximum number of random combinations for LP solver input\n"
" --wait Wait for the new pool to come online\n"
" -f|--force Do not check that cluster has enough OSDs to create the pool\n"
" Examples:\n"
" vitastor-cli create-pool test_x4 -s 4 -n 32\n"
" vitastor-cli create-pool test_ec42 --ec 4+2 -n 32\n"
"\n"
"vitastor-cli modify-pool|pool-modify <id|name> [--name <new_name>] [PARAMETERS...]\n"
" Modify an existing pool. Modifiable parameters:\n"
" [-s|--pg_size <number>] [--pg_minsize <number>] [-n|--pg_count <count>]\n"
" [--failure_domain <level>] [--root_node <node>] [--osd_tags <tags>] [--used_for_fs <name>]\n"
" [--max_osd_combinations <number>] [--primary_affinity_tags <tags>] [--scrub_interval <time>]\n"
" Non-modifiable parameters (changing them WILL lead to data loss):\n"
" [--block_size <size>] [--bitmap_granularity <size>]\n"
" [--immediate_commit <all|small|none>] [--pg_stripe_size <size>]\n"
" These, however, can still be modified with -f|--force.\n"
" See create-pool for parameter descriptions.\n"
" Examples:\n"
" vitastor-cli modify-pool pool_A --name pool_B\n"
" vitastor-cli modify-pool 2 --pg_size 4 -n 128\n"
"\n"
"vitastor-cli rm-pool|pool-rm [--force] <id|name>\n"
" Remove a pool. Refuses to remove pools with images without --force.\n"
"\n"
"vitastor-cli ls-pools|pool-ls|ls-pool|pools [-l] [--detail] [--sort FIELD] [-r] [-n N] [--stats] [<glob> ...]\n"
" List pools (only matching <glob> patterns if passed).\n"
" -l|--long Also report I/O statistics\n"
" --detail Use list format (not table), show all details\n"
" --sort FIELD Sort by specified field (see fields in --json output)\n"
" -r|--reverse Sort in descending order\n"
" -n|--count N Only list first N items\n"
"\n"
"Use vitastor-cli --help <command> for command details or vitastor-cli --help --all for all details.\n"
"\n"
"GLOBAL OPTIONS:\n"
@@ -122,7 +174,7 @@ static const char* help_text =
" --parallel_osds M Work with M osds in parallel when possible (default 4)\n"
" --progress 1|0 Report progress (default 1)\n"
" --cas 1|0 Use CAS writes for flatten, merge, rm (default is decide automatically)\n"
" --no-color Disable colored output\n"
" --color 1|0 Enable/disable colored output and CR symbols (default 1 if stdout is a terminal)\n"
" --json JSON output\n"
;
@@ -133,6 +185,7 @@ static json11::Json::object parse_args(int narg, const char *args[])
cfg["progress"] = "1";
for (int i = 1; i < narg; i++)
{
bool argHasValue = (!(i == narg-1) && (args[i+1][0] != '-'));
if (args[i][0] == '-' && args[i][1] == 'h' && args[i][2] == 0)
{
cfg["help"] = "1";
@@ -143,15 +196,15 @@ static json11::Json::object parse_args(int narg, const char *args[])
}
else if (args[i][0] == '-' && args[i][1] == 'n' && args[i][2] == 0)
{
cfg["count"] = args[++i];
cfg["count"] = argHasValue ? args[++i] : "";
}
else if (args[i][0] == '-' && args[i][1] == 'p' && args[i][2] == 0)
{
cfg["pool"] = args[++i];
cfg["pool"] = argHasValue ? args[++i] : "";
}
else if (args[i][0] == '-' && args[i][1] == 's' && args[i][2] == 0)
{
cfg["size"] = args[++i];
cfg["size"] = argHasValue ? args[++i] : "";
}
else if (args[i][0] == '-' && args[i][1] == 'r' && args[i][2] == 0)
{
@@ -164,17 +217,24 @@ static json11::Json::object parse_args(int narg, const char *args[])
else if (args[i][0] == '-' && args[i][1] == '-')
{
const char *opt = args[i]+2;
cfg[opt] = i == narg-1 || !strcmp(opt, "json") ||
if (!strcmp(opt, "json") || !strcmp(opt, "wait") ||
!strcmp(opt, "wait-list") || !strcmp(opt, "wait_list") ||
!strcmp(opt, "long") || !strcmp(opt, "del") ||
!strcmp(opt, "long") || !strcmp(opt, "detail") || !strcmp(opt, "del") ||
!strcmp(opt, "no-color") || !strcmp(opt, "no_color") ||
!strcmp(opt, "readonly") || !strcmp(opt, "readwrite") ||
!strcmp(opt, "force") || !strcmp(opt, "reverse") ||
!strcmp(opt, "allow-data-loss") || !strcmp(opt, "allow_data_loss") ||
!strcmp(opt, "down-ok") || !strcmp(opt, "down_ok") ||
!strcmp(opt, "dry-run") || !strcmp(opt, "dry_run") ||
!strcmp(opt, "help") || !strcmp(opt, "all") ||
(!strcmp(opt, "writers-stopped") || !strcmp(opt, "writers_stopped")) && strcmp("1", args[i+1]) != 0
? "1" : args[++i];
!strcmp(opt, "writers-stopped") || !strcmp(opt, "writers_stopped"))
{
cfg[opt] = "1";
}
else
{
cfg[opt] = argHasValue ? args[++i] : "";
}
}
else
{
@@ -217,7 +277,7 @@ static int run(cli_tool_t *p, json11::Json::object cfg)
else if (cmd[0] == "df")
{
// Show pool space stats
action_cb = p->start_df(cfg);
action_cb = p->start_pool_ls(cfg);
}
else if (cmd[0] == "ls")
{
@@ -324,6 +384,44 @@ static int run(cli_tool_t *p, json11::Json::object cfg)
// Allocate a new OSD number
action_cb = p->start_alloc_osd(cfg);
}
else if (cmd[0] == "create-pool" || cmd[0] == "pool-create")
{
// Create a new pool
if (cmd.size() > 1 && cfg["name"].is_null())
{
cfg["name"] = cmd[1];
}
action_cb = p->start_pool_create(cfg);
}
else if (cmd[0] == "modify-pool" || cmd[0] == "pool-modify")
{
// Modify existing pool
if (cmd.size() > 1)
{
cfg["old_name"] = cmd[1];
}
action_cb = p->start_pool_modify(cfg);
}
else if (cmd[0] == "rm-pool" || cmd[0] == "pool-rm")
{
// Remove existing pool
if (cmd.size() > 1)
{
cfg["pool"] = cmd[1];
}
action_cb = p->start_pool_rm(cfg);
}
else if (cmd[0] == "ls-pool" || cmd[0] == "pool-ls" || cmd[0] == "ls-pools" || cmd[0] == "pools")
{
// Show pool list
cfg["show_recovery"] = 1;
if (cmd.size() > 1)
{
cmd.erase(cmd.begin(), cmd.begin()+1);
cfg["names"] = cmd;
}
action_cb = p->start_pool_ls(cfg);
}
else
{
result = { .err = EINVAL, .text = "unknown command: "+cmd[0].string_value() };

View File

@@ -46,6 +46,7 @@ public:
json11::Json etcd_result;
void parse_config(json11::Json::object & cfg);
json11::Json parse_tags(std::string tags);
void change_parent(inode_t cur, inode_t new_parent, cli_result_t *result);
inode_config_t* get_inode_cfg(const std::string & name);
@@ -58,7 +59,6 @@ public:
std::function<bool(cli_result_t &)> start_status(json11::Json);
std::function<bool(cli_result_t &)> start_describe(json11::Json);
std::function<bool(cli_result_t &)> start_fix(json11::Json);
std::function<bool(cli_result_t &)> start_df(json11::Json);
std::function<bool(cli_result_t &)> start_ls(json11::Json);
std::function<bool(cli_result_t &)> start_create(json11::Json);
std::function<bool(cli_result_t &)> start_modify(json11::Json);
@@ -68,6 +68,10 @@ public:
std::function<bool(cli_result_t &)> start_rm(json11::Json);
std::function<bool(cli_result_t &)> start_rm_osd(json11::Json cfg);
std::function<bool(cli_result_t &)> start_alloc_osd(json11::Json cfg);
std::function<bool(cli_result_t &)> start_pool_create(json11::Json);
std::function<bool(cli_result_t &)> start_pool_modify(json11::Json);
std::function<bool(cli_result_t &)> start_pool_rm(json11::Json);
std::function<bool(cli_result_t &)> start_pool_ls(json11::Json);
// Should be called like loop_and_wait(start_status(), <completion callback>)
void loop_and_wait(std::function<bool(cli_result_t &)> loop_cb, std::function<void(const cli_result_t &)> complete_cb);
@@ -77,8 +81,13 @@ public:
std::string print_table(json11::Json items, json11::Json header, bool use_esc);
size_t print_detail_title_len(json11::Json item, std::vector<std::pair<std::string, std::string>> names, size_t prev_len);
std::string print_detail(json11::Json item, std::vector<std::pair<std::string, std::string>> names, size_t title_len, bool use_esc);
std::string format_lat(uint64_t lat);
std::string format_q(double depth);
bool stupid_glob(const std::string str, const std::string glob);
std::string implode(const std::string & sep, json11::Json array);

View File

@@ -77,7 +77,7 @@ struct alloc_osd_t
std::string key = base64_decode(kv["key"].string_value());
osd_num_t cur_osd;
char null_byte = 0;
int scanned = sscanf(key.c_str() + parent->cli->st_cli.etcd_prefix.length(), "/osd/stats/%lu%c", &cur_osd, &null_byte);
int scanned = sscanf(key.c_str() + parent->cli->st_cli.etcd_prefix.length(), "/osd/stats/%ju%c", &cur_osd, &null_byte);
if (scanned != 1 || !cur_osd)
{
fprintf(stderr, "Invalid key in etcd: %s\n", key.c_str());

View File

@@ -1,6 +1,7 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.1 (see README.md for details)
#include <unistd.h>
#include "str_util.h"
#include "cluster_client.h"
#include "cli.h"
@@ -11,7 +12,7 @@ void cli_tool_t::change_parent(inode_t cur, inode_t new_parent, cli_result_t *re
if (cur_cfg_it == cli->st_cli.inode_config.end())
{
char buf[128];
snprintf(buf, 128, "Inode 0x%lx disappeared", cur);
snprintf(buf, 128, "Inode 0x%jx disappeared", cur);
*result = (cli_result_t){ .err = EIO, .text = buf };
return;
}
@@ -113,7 +114,12 @@ void cli_tool_t::parse_config(json11::Json::object & cfg)
else
kv_it++;
}
color = !cfg["no_color"].bool_value();
if (cfg.find("no_color") != cfg.end())
color = !cfg["no_color"].bool_value();
else if (cfg.find("color") != cfg.end())
color = cfg["color"].bool_value();
else
color = isatty(1);
json_output = cfg["json"].bool_value();
iodepth = cfg["iodepth"].uint64_value();
if (!iodepth)
@@ -147,6 +153,7 @@ void cli_tool_t::loop_and_wait(std::function<bool(cli_result_t &)> loop_cb, std:
ringloop->unregister_consumer(&looper->consumer);
looper->loop_cb = NULL;
looper->complete_cb(looper->result);
ringloop->submit();
delete looper;
return;
}

View File

@@ -27,6 +27,7 @@ struct image_creator_t
std::string image_name, new_snap, new_parent;
json11::Json new_meta;
uint64_t size;
bool force = false;
bool force_size = false;
pool_id_t old_pool_id = 0;
@@ -45,6 +46,7 @@ struct image_creator_t
void loop()
{
auto & pools = parent->cli->st_cli.pool_config;
if (state >= 1)
goto resume_1;
if (image_name == "")
@@ -62,7 +64,6 @@ struct image_creator_t
}
if (new_pool_id)
{
auto & pools = parent->cli->st_cli.pool_config;
if (pools.find(new_pool_id) == pools.end())
{
result = (cli_result_t){ .err = ENOENT, .text = "Pool "+std::to_string(new_pool_id)+" does not exist" };
@@ -72,7 +73,7 @@ struct image_creator_t
}
else if (new_pool_name != "")
{
for (auto & ic: parent->cli->st_cli.pool_config)
for (auto & ic: pools)
{
if (ic.second.name == new_pool_name)
{
@@ -87,10 +88,20 @@ struct image_creator_t
return;
}
}
else if (parent->cli->st_cli.pool_config.size() == 1)
else if (pools.size() == 1)
{
auto it = parent->cli->st_cli.pool_config.begin();
new_pool_id = it->first;
new_pool_id = pools.begin()->first;
}
if (new_pool_id && !pools.at(new_pool_id).used_for_fs.empty() && !force)
{
result = (cli_result_t){
.err = EINVAL,
.text = "Pool "+pools.at(new_pool_id).name+
" is used for VitastorFS "+pools.at(new_pool_id).used_for_fs+
". Use --force if you really know what you are doing",
};
state = 100;
return;
}
state = 1;
resume_1:
@@ -183,7 +194,16 @@ resume_3:
// Save into inode_config for library users to be able to take it from there immediately
new_cfg.mod_revision = parent->etcd_result["responses"][0]["response_put"]["header"]["revision"].uint64_value();
parent->cli->st_cli.insert_inode_config(new_cfg);
result = (cli_result_t){ .err = 0, .text = "Image "+image_name+" created" };
result = (cli_result_t){
.err = 0,
.text = "Image "+image_name+" created",
.data = json11::Json::object {
{ "name", image_name },
{ "pool", new_pool_name },
{ "parent", new_parent },
{ "size", size },
}
};
state = 100;
}
@@ -251,7 +271,16 @@ resume_4:
// Save into inode_config for library users to be able to take it from there immediately
new_cfg.mod_revision = parent->etcd_result["responses"][0]["response_put"]["header"]["revision"].uint64_value();
parent->cli->st_cli.insert_inode_config(new_cfg);
result = (cli_result_t){ .err = 0, .text = "Snapshot "+image_name+"@"+new_snap+" created" };
result = (cli_result_t){
.err = 0,
.text = "Snapshot "+image_name+"@"+new_snap+" created",
.data = json11::Json::object {
{ "name", image_name+"@"+new_snap },
{ "pool", (uint64_t)new_pool_id },
{ "parent", new_parent },
{ "size", size },
}
};
state = 100;
}
@@ -514,6 +543,7 @@ std::function<bool(cli_result_t &)> cli_tool_t::start_create(json11::Json cfg)
image_creator->image_name = cfg["image"].string_value();
image_creator->new_pool_id = cfg["pool"].uint64_value();
image_creator->new_pool_name = cfg["pool"].string_value();
image_creator->force = cfg["force"].bool_value();
image_creator->force_size = cfg["force_size"].bool_value();
if (cfg["image_meta"].is_object())
{

View File

@@ -160,14 +160,14 @@ struct cli_describe_t
if (op->reply.hdr.retval < 0)
{
fprintf(
stderr, "Failed to describe objects on OSD %lu (retval=%ld)\n",
stderr, "Failed to describe objects on OSD %ju (retval=%jd)\n",
osd_num, op->reply.hdr.retval
);
}
else if (op->reply.describe.result_bytes != op->reply.hdr.retval * sizeof(osd_reply_describe_item_t))
{
fprintf(
stderr, "Invalid response size from OSD %lu (expected %lu bytes, got %lu bytes)\n",
stderr, "Invalid response size from OSD %ju (expected %ju bytes, got %ju bytes)\n",
osd_num, op->reply.hdr.retval * sizeof(osd_reply_describe_item_t), op->reply.describe.result_bytes
);
}
@@ -178,11 +178,11 @@ struct cli_describe_t
{
if (!parent->json_output || parent->is_command_line)
{
#define FMT "{\"inode\":\"0x%lx\",\"stripe\":\"0x%lx\",\"part\":%u,\"osd_num\":%lu%s%s%s}"
#define FMT "{\"inode\":\"0x%jx\",\"stripe\":\"0x%jx\",\"part\":%u,\"osd_num\":%ju%s%s%s}"
printf(
(parent->json_output
? (count > 0 ? ",\n " FMT : " " FMT)
: "%lx:%lx part %u on OSD %lu%s%s%s\n"),
: "%jx:%jx part %u on OSD %ju%s%s%s\n"),
#undef FMT
items[i].inode, items[i].stripe,
items[i].role, items[i].osd_num,

View File

@@ -1,243 +0,0 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.1 (see README.md for details)
#include "cli.h"
#include "cluster_client.h"
#include "str_util.h"
// List pools with space statistics
struct pool_lister_t
{
cli_tool_t *parent;
int state = 0;
json11::Json space_info;
cli_result_t result;
std::map<pool_id_t, json11::Json::object> pool_stats;
bool is_done()
{
return state == 100;
}
void get_stats()
{
if (state == 1)
goto resume_1;
// Space statistics - pool/stats/<pool>
parent->etcd_txn(json11::Json::object {
{ "success", json11::Json::array {
json11::Json::object {
{ "request_range", json11::Json::object {
{ "key", base64_encode(
parent->cli->st_cli.etcd_prefix+"/pool/stats/"
) },
{ "range_end", base64_encode(
parent->cli->st_cli.etcd_prefix+"/pool/stats0"
) },
} },
},
json11::Json::object {
{ "request_range", json11::Json::object {
{ "key", base64_encode(
parent->cli->st_cli.etcd_prefix+"/osd/stats/"
) },
{ "range_end", base64_encode(
parent->cli->st_cli.etcd_prefix+"/osd/stats0"
) },
} },
},
} },
});
state = 1;
resume_1:
if (parent->waiting > 0)
return;
if (parent->etcd_err.err)
{
result = parent->etcd_err;
state = 100;
return;
}
space_info = parent->etcd_result;
std::map<pool_id_t, uint64_t> osd_free;
for (auto & kv_item: space_info["responses"][0]["response_range"]["kvs"].array_items())
{
auto kv = parent->cli->st_cli.parse_etcd_kv(kv_item);
// pool ID
pool_id_t pool_id;
char null_byte = 0;
int scanned = sscanf(kv.key.substr(parent->cli->st_cli.etcd_prefix.length()).c_str(), "/pool/stats/%u%c", &pool_id, &null_byte);
if (scanned != 1 || !pool_id || pool_id >= POOL_ID_MAX)
{
fprintf(stderr, "Invalid key in etcd: %s\n", kv.key.c_str());
continue;
}
// pool/stats/<N>
pool_stats[pool_id] = kv.value.object_items();
}
for (auto & kv_item: space_info["responses"][1]["response_range"]["kvs"].array_items())
{
auto kv = parent->cli->st_cli.parse_etcd_kv(kv_item);
// osd ID
osd_num_t osd_num;
char null_byte = 0;
int scanned = sscanf(kv.key.substr(parent->cli->st_cli.etcd_prefix.length()).c_str(), "/osd/stats/%lu%c", &osd_num, &null_byte);
if (scanned != 1 || !osd_num || osd_num >= POOL_ID_MAX)
{
fprintf(stderr, "Invalid key in etcd: %s\n", kv.key.c_str());
continue;
}
// osd/stats/<N>::free
osd_free[osd_num] = kv.value["free"].uint64_value();
}
// Calculate max_avail for each pool
for (auto & pp: parent->cli->st_cli.pool_config)
{
auto & pool_cfg = pp.second;
uint64_t pool_avail = UINT64_MAX;
std::map<osd_num_t, uint64_t> pg_per_osd;
for (auto & pgp: pool_cfg.pg_config)
{
for (auto pg_osd: pgp.second.target_set)
{
if (pg_osd != 0)
{
pg_per_osd[pg_osd]++;
}
}
}
for (auto pg_per_pair: pg_per_osd)
{
uint64_t pg_free = osd_free[pg_per_pair.first] * pool_cfg.real_pg_count / pg_per_pair.second;
if (pool_avail > pg_free)
{
pool_avail = pg_free;
}
}
if (pool_avail == UINT64_MAX)
{
pool_avail = 0;
}
if (pool_cfg.scheme != POOL_SCHEME_REPLICATED)
{
pool_avail *= (pool_cfg.pg_size - pool_cfg.parity_chunks);
}
pool_stats[pool_cfg.id] = json11::Json::object {
{ "id", (uint64_t)pool_cfg.id },
{ "name", pool_cfg.name },
{ "pg_count", pool_cfg.pg_count },
{ "real_pg_count", pool_cfg.real_pg_count },
{ "scheme", pool_cfg.scheme == POOL_SCHEME_REPLICATED ? "replicated" : "ec" },
{ "scheme_name", pool_cfg.scheme == POOL_SCHEME_REPLICATED
? std::to_string(pool_cfg.pg_size)+"/"+std::to_string(pool_cfg.pg_minsize)
: "EC "+std::to_string(pool_cfg.pg_size-pool_cfg.parity_chunks)+"+"+std::to_string(pool_cfg.parity_chunks) },
{ "used_raw", (uint64_t)(pool_stats[pool_cfg.id]["used_raw_tb"].number_value() * ((uint64_t)1<<40)) },
{ "total_raw", (uint64_t)(pool_stats[pool_cfg.id]["total_raw_tb"].number_value() * ((uint64_t)1<<40)) },
{ "max_available", pool_avail },
{ "raw_to_usable", pool_stats[pool_cfg.id]["raw_to_usable"].number_value() },
{ "space_efficiency", pool_stats[pool_cfg.id]["space_efficiency"].number_value() },
{ "pg_real_size", pool_stats[pool_cfg.id]["pg_real_size"].uint64_value() },
{ "failure_domain", pool_cfg.failure_domain },
};
}
}
json11::Json::array to_list()
{
json11::Json::array list;
for (auto & kv: pool_stats)
{
list.push_back(kv.second);
}
return list;
}
void loop()
{
get_stats();
if (parent->waiting > 0)
return;
if (state == 100)
return;
if (parent->json_output)
{
// JSON output
result.data = to_list();
state = 100;
return;
}
// Table output: name, scheme_name, pg_count, total, used, max_avail, used%, efficiency
json11::Json::array cols;
cols.push_back(json11::Json::object{
{ "key", "name" },
{ "title", "NAME" },
});
cols.push_back(json11::Json::object{
{ "key", "scheme_name" },
{ "title", "SCHEME" },
});
cols.push_back(json11::Json::object{
{ "key", "pg_count_fmt" },
{ "title", "PGS" },
});
cols.push_back(json11::Json::object{
{ "key", "total_fmt" },
{ "title", "TOTAL" },
});
cols.push_back(json11::Json::object{
{ "key", "used_fmt" },
{ "title", "USED" },
});
cols.push_back(json11::Json::object{
{ "key", "max_avail_fmt" },
{ "title", "AVAILABLE" },
});
cols.push_back(json11::Json::object{
{ "key", "used_pct" },
{ "title", "USED%" },
});
cols.push_back(json11::Json::object{
{ "key", "eff_fmt" },
{ "title", "EFFICIENCY" },
});
json11::Json::array list;
for (auto & kv: pool_stats)
{
double raw_to = kv.second["raw_to_usable"].number_value();
if (raw_to < 0.000001 && raw_to > -0.000001)
raw_to = 1;
kv.second["pg_count_fmt"] = kv.second["real_pg_count"] == kv.second["pg_count"]
? kv.second["real_pg_count"].as_string()
: kv.second["real_pg_count"].as_string()+"->"+kv.second["pg_count"].as_string();
kv.second["total_fmt"] = format_size(kv.second["total_raw"].uint64_value() / raw_to);
kv.second["used_fmt"] = format_size(kv.second["used_raw"].uint64_value() / raw_to);
kv.second["max_avail_fmt"] = format_size(kv.second["max_available"].uint64_value());
kv.second["used_pct"] = format_q(kv.second["total_raw"].uint64_value()
? (100 - 100*kv.second["max_available"].uint64_value() *
kv.second["raw_to_usable"].number_value() / kv.second["total_raw"].uint64_value())
: 100)+"%";
kv.second["eff_fmt"] = format_q(kv.second["space_efficiency"].number_value()*100)+"%";
}
result.data = to_list();
result.text = print_table(result.data, cols, parent->color);
state = 100;
}
};
std::function<bool(cli_result_t &)> cli_tool_t::start_df(json11::Json cfg)
{
auto lister = new pool_lister_t();
lister->parent = this;
return [lister](cli_result_t & result)
{
lister->loop();
if (lister->is_done())
{
result = lister->result;
delete lister;
return true;
}
return false;
};
}

View File

@@ -136,7 +136,7 @@ struct cli_fix_t
auto pool_cfg_it = parent->cli->st_cli.pool_config.find(INODE_POOL(obj.inode));
if (pool_cfg_it == parent->cli->st_cli.pool_config.end())
{
fprintf(stderr, "Object %lx:%lx is from unknown pool\n", obj.inode, obj.stripe);
fprintf(stderr, "Object %jx:%jx is from unknown pool\n", obj.inode, obj.stripe);
continue;
}
auto & pool_cfg = pool_cfg_it->second;
@@ -146,7 +146,7 @@ struct cli_fix_t
!pg_it->second.cur_primary || !(pg_it->second.cur_state & PG_ACTIVE))
{
fprintf(
stderr, "Object %lx:%lx is from PG %u/%u which is not currently active\n",
stderr, "Object %jx:%jx is from PG %u/%u which is not currently active\n",
obj.inode, obj.stripe, pool_cfg_it->first, pg_num
);
continue;
@@ -171,7 +171,7 @@ struct cli_fix_t
{
if (op->reply.hdr.retval < 0 || op->reply.describe.result_bytes != op->reply.hdr.retval * sizeof(osd_reply_describe_item_t))
{
fprintf(stderr, "Failed to describe objects on OSD %lu (retval=%ld)\n", primary_osd, op->reply.hdr.retval);
fprintf(stderr, "Failed to describe objects on OSD %ju (retval=%jd)\n", primary_osd, op->reply.hdr.retval);
parent->waiting--;
loop();
}
@@ -209,7 +209,7 @@ struct cli_fix_t
if (rm_op->reply.hdr.retval < 0)
{
fprintf(
stderr, "Failed to remove object %lx:%lx from OSD %lu (retval=%ld)\n",
stderr, "Failed to remove object %jx:%jx from OSD %ju (retval=%jd)\n",
rm_op->req.sec_del.oid.inode, rm_op->req.sec_del.oid.stripe,
rm_osd_num, rm_op->reply.hdr.retval
);
@@ -226,7 +226,7 @@ struct cli_fix_t
else
{
printf(
"Removed %lx:%lx (part %lu) from OSD %lu\n",
"Removed %jx:%jx (part %ju) from OSD %ju\n",
rm_op->req.sec_del.oid.inode, rm_op->req.sec_del.oid.stripe & ~STRIPE_MASK,
rm_op->req.sec_del.oid.stripe & STRIPE_MASK, rm_osd_num
);
@@ -254,7 +254,7 @@ struct cli_fix_t
if (scrub_op->reply.hdr.retval < 0 && scrub_op->reply.hdr.retval != -ENOENT)
{
fprintf(
stderr, "Failed to scrub %lx:%lx on OSD %lu (retval=%ld)\n",
stderr, "Failed to scrub %jx:%jx on OSD %ju (retval=%jd)\n",
obj.inode, obj.stripe, primary_osd, scrub_op->reply.hdr.retval
);
}

View File

@@ -150,7 +150,7 @@ resume_1:
inode_t only_inode_num;
char null_byte = 0;
int scanned = sscanf(kv.key.substr(parent->cli->st_cli.etcd_prefix.length()).c_str(),
"/inode/stats/%u/%lu%c", &pool_id, &only_inode_num, &null_byte);
"/inode/stats/%u/%ju%c", &pool_id, &only_inode_num, &null_byte);
if (scanned != 2 || !pool_id || pool_id >= POOL_ID_MAX || INODE_POOL(only_inode_num) != 0)
{
fprintf(stderr, "Invalid key in etcd: %s\n", kv.key.c_str());
@@ -456,7 +456,7 @@ std::string format_lat(uint64_t lat)
char buf[256];
int l = 0;
if (lat < 100)
l = snprintf(buf, sizeof(buf), "%lu us", lat);
l = snprintf(buf, sizeof(buf), "%ju us", lat);
else if (lat < 500000)
l = snprintf(buf, sizeof(buf), "%.2f ms", (double)lat/1000);
else

View File

@@ -202,7 +202,7 @@ struct snap_merger_t
if (parent->progress)
{
printf(
"Merging %ld layer(s) into target %s%s (inode %lu in pool %u)\n",
"Merging %zd layer(s) into target %s%s (inode %ju in pool %u)\n",
sources.size(), target_cfg->name.c_str(),
use_cas ? " online (with CAS)" : "", INODE_NO_POOL(target), INODE_POOL(target)
);
@@ -275,7 +275,9 @@ struct snap_merger_t
processed++;
if (parent->progress && !(processed % 128))
{
printf("\rFiltering target blocks: %lu/%lu", processed, to_process);
fprintf(stderr, parent->color
? "\rFiltering target blocks: %ju/%ju"
: "Filtering target blocks: %ju/%ju\n", processed, to_process);
}
}
if (in_flight > 0 || oit != merge_offsets.end())
@@ -285,7 +287,9 @@ struct snap_merger_t
}
if (parent->progress)
{
printf("\r%lu full blocks of target filtered out\n", to_process-merge_offsets.size());
fprintf(stderr, parent->color
? "\r%ju full blocks of target filtered out\n"
: "%ju full blocks of target filtered out\n", to_process-merge_offsets.size());
}
}
state = 3;
@@ -320,7 +324,9 @@ struct snap_merger_t
processed++;
if (parent->progress && !(processed % 128))
{
printf("\rOverwriting blocks: %lu/%lu", processed, to_process);
fprintf(stderr, parent->color
? "\rOverwriting blocks: %ju/%ju"
: "Overwriting blocks: %ju/%ju\n", processed, to_process);
}
}
if (in_flight == 0 && rwo_error.size())
@@ -339,10 +345,16 @@ struct snap_merger_t
}
if (parent->progress)
{
printf("\rOverwriting blocks: %lu/%lu\n", to_process, to_process);
fprintf(stderr, parent->color
? "\rOverwriting blocks: %ju/%ju\n"
: "Overwriting blocks: %ju/%ju\n", to_process, to_process);
}
// Done
result = (cli_result_t){ .text = "Done, layers from "+from_name+" to "+to_name+" merged into "+target_name };
result = (cli_result_t){ .text = "Done, layers from "+from_name+" to "+to_name+" merged into "+target_name, .data = json11::Json::object {
{ "from", from_name },
{ "to", to_name },
{ "into", target_name },
}};
state = 100;
resume_100:
return;
@@ -384,7 +396,7 @@ struct snap_merger_t
auto & name = parent->cli->st_cli.inode_config.at(src).name;
if (parent->progress)
{
printf("Got listing of layer %s (inode %lu in pool %u)\n", name.c_str(), INODE_NO_POOL(src), INODE_POOL(src));
printf("Got listing of layer %s (inode %ju in pool %u)\n", name.c_str(), INODE_NO_POOL(src), INODE_POOL(src));
}
if (delete_source)
{
@@ -416,7 +428,7 @@ struct snap_merger_t
{
if (op->retval < 0)
{
fprintf(stderr, "error reading target bitmap at offset %lx: %s\n", op->offset, strerror(-op->retval));
fprintf(stderr, "error reading target bitmap at offset %jx: %s\n", op->offset, strerror(-op->retval));
}
else
{
@@ -571,7 +583,7 @@ struct snap_merger_t
{
if (subop->retval != 0)
{
fprintf(stderr, "error deleting from layer 0x%lx at offset %lx: %s", subop->inode, subop->offset, strerror(-subop->retval));
fprintf(stderr, "error deleting from layer 0x%jx at offset %jx: %s", subop->inode, subop->offset, strerror(-subop->retval));
}
delete subop;
};
@@ -620,7 +632,7 @@ struct snap_merger_t
if (rwo->error_code)
{
char buf[1024];
snprintf(buf, 1024, "Error %s target at offset %lx: %s",
snprintf(buf, 1024, "Error %s target at offset %jx: %s",
rwo->error_read ? "reading" : "writing", rwo->error_offset, strerror(rwo->error_code));
rwo_error = std::string(buf);
}

View File

@@ -15,6 +15,7 @@ struct image_changer_t
uint64_t new_size = 0;
bool force_size = false, inc_size = false;
bool set_readonly = false, set_readwrite = false, force = false;
bool down_ok = false;
// interval between fsyncs
int fsync_interval = 128;
@@ -84,7 +85,10 @@ struct image_changer_t
(!new_size && !force_size || cfg.size == new_size || cfg.size >= new_size && inc_size) &&
(new_name == "" || new_name == image_name))
{
result = (cli_result_t){ .text = "No change" };
result = (cli_result_t){ .err = 0, .text = "No change", .data = json11::Json::object {
{ "error_code", 0 },
{ "error_text", "No change" },
}};
state = 100;
return;
}
@@ -105,6 +109,7 @@ struct image_changer_t
{ "pool", (uint64_t)INODE_POOL(inode_num) },
{ "fsync-interval", fsync_interval },
{ "min-offset", ((new_size+4095)/4096)*4096 },
{ "down-ok", down_ok },
});
resume_1:
while (!cb(result))
@@ -220,7 +225,16 @@ resume_2:
parent->cli->st_cli.inode_by_name.erase(image_name);
}
parent->cli->st_cli.insert_inode_config(cfg);
result = (cli_result_t){ .err = 0, .text = "Image "+image_name+" modified" };
result = (cli_result_t){
.err = 0,
.text = "Image "+image_name+" modified",
.data = json11::Json::object {
{ "name", image_name },
{ "inode", INODE_NO_POOL(inode_num) },
{ "pool", (uint64_t)INODE_POOL(inode_num) },
{ "size", new_size },
}
};
state = 100;
}
};
@@ -240,6 +254,7 @@ std::function<bool(cli_result_t &)> cli_tool_t::start_modify(json11::Json cfg)
changer->fsync_interval = cfg["fsync_interval"].uint64_value();
if (!changer->fsync_interval)
changer->fsync_interval = 128;
changer->down_ok = cfg["down_ok"].bool_value();
// FIXME Check that the image doesn't have children when shrinking
return [changer](cli_result_t & result)
{

270
src/cli_pool_cfg.cpp Normal file
View File

@@ -0,0 +1,270 @@
// Copyright (c) Vitaliy Filippov, 2024
// License: VNPL-1.1 (see README.md for details)
#include "cli_pool_cfg.h"
#include "etcd_state_client.h"
#include "str_util.h"
std::string validate_pool_config(json11::Json::object & new_cfg, json11::Json old_cfg,
uint64_t global_block_size, uint64_t global_bitmap_granularity, bool force)
{
// short option names
if (new_cfg.find("count") != new_cfg.end())
{
new_cfg["pg_count"] = new_cfg["count"];
new_cfg.erase("count");
}
if (new_cfg.find("size") != new_cfg.end())
{
new_cfg["pg_size"] = new_cfg["size"];
new_cfg.erase("size");
}
// --ec shortcut
if (new_cfg.find("ec") != new_cfg.end())
{
if (new_cfg.find("scheme") != new_cfg.end() ||
new_cfg.find("pg_size") != new_cfg.end() ||
new_cfg.find("parity_chunks") != new_cfg.end())
{
return "--ec can't be used with --pg_size, --parity_chunks or --scheme";
}
// pg_size = N+K
// parity_chunks = K
uint64_t data_chunks = 0, parity_chunks = 0;
char null_byte = 0;
int ret = sscanf(new_cfg["ec"].string_value().c_str(), "%ju+%ju%c", &data_chunks, &parity_chunks, &null_byte);
if (ret != 2 || !data_chunks || !parity_chunks)
{
return "--ec should be <N>+<K> format (<N>, <K> - numbers)";
}
new_cfg.erase("ec");
new_cfg["scheme"] = "ec";
new_cfg["pg_size"] = data_chunks+parity_chunks;
new_cfg["parity_chunks"] = parity_chunks;
}
if (old_cfg.is_null() && new_cfg["scheme"].string_value() == "")
{
// Default scheme
new_cfg["scheme"] = "replicated";
}
if (new_cfg.find("pg_minsize") == new_cfg.end() && (old_cfg.is_null() || new_cfg.find("pg_size") != new_cfg.end()))
{
// Default pg_minsize
if (new_cfg["scheme"] == "replicated")
{
// pg_minsize = (N+K > 2) ? 2 : 1
new_cfg["pg_minsize"] = new_cfg["pg_size"].uint64_value() > 2 ? 2 : 1;
}
else // ec or xor
{
// pg_minsize = (K > 1) ? N + 1 : N
new_cfg["pg_minsize"] = new_cfg["pg_size"].uint64_value() - new_cfg["parity_chunks"].uint64_value() +
(new_cfg["parity_chunks"].uint64_value() > 1 ? 1 : 0);
}
}
// Check integer values and unknown keys
for (auto kv_it = new_cfg.begin(); kv_it != new_cfg.end(); )
{
auto & key = kv_it->first;
auto & value = kv_it->second;
if (key == "pg_size" || key == "parity_chunks" || key == "pg_minsize" ||
key == "pg_count" || key == "max_osd_combinations" || key == "block_size" ||
key == "bitmap_granularity" || key == "pg_stripe_size")
{
if (value.is_number() && value.uint64_value() != value.number_value() ||
value.is_string() && !value.uint64_value() && value.string_value() != "0")
{
return key+" must be a non-negative integer";
}
value = value.uint64_value();
}
else if (key == "name" || key == "scheme" || key == "immediate_commit" ||
key == "failure_domain" || key == "root_node" || key == "scrub_interval" || key == "used_for_fs")
{
// OK
}
else if (key == "osd_tags" || key == "primary_affinity_tags")
{
if (value.is_string())
{
value = explode(",", value.string_value(), true);
}
}
else
{
// Unknown parameter
new_cfg.erase(kv_it++);
continue;
}
kv_it++;
}
// Merge with the old config
if (!old_cfg.is_null())
{
for (auto & kv: old_cfg.object_items())
{
if (new_cfg.find(kv.first) == new_cfg.end())
{
new_cfg[kv.first] = kv.second;
}
}
}
// Check after merging
if (new_cfg["scheme"] != "ec")
{
new_cfg.erase("parity_chunks");
}
if (new_cfg.find("used_for_fs") != new_cfg.end() && new_cfg["used_for_fs"].string_value() == "")
{
new_cfg.erase("used_for_fs");
}
// Prevent autovivification of object keys. Now we don't modify the config, we just check it
json11::Json cfg = new_cfg;
// Validate changes
if (!old_cfg.is_null() && !force)
{
if (old_cfg["scheme"] != cfg["scheme"])
{
return "Changing scheme for an existing pool will lead to data loss. Use --force to proceed";
}
if (etcd_state_client_t::parse_scheme(old_cfg["scheme"].string_value()) == POOL_SCHEME_EC)
{
uint64_t old_data_chunks = old_cfg["pg_size"].uint64_value() - old_cfg["parity_chunks"].uint64_value();
uint64_t new_data_chunks = cfg["pg_size"].uint64_value() - cfg["parity_chunks"].uint64_value();
if (old_data_chunks != new_data_chunks)
{
return "Changing EC data chunk count for an existing pool will lead to data loss. Use --force to proceed";
}
}
if (old_cfg["block_size"] != cfg["block_size"] ||
old_cfg["bitmap_granularity"] != cfg["bitmap_granularity"] ||
old_cfg["immediate_commit"] != cfg["immediate_commit"])
{
return "Changing block_size, bitmap_granularity or immediate_commit"
" for an existing pool will lead to incomplete PGs. Use --force to proceed";
}
if (old_cfg["pg_stripe_size"] != cfg["pg_stripe_size"])
{
return "Changing pg_stripe_size for an existing pool will lead to data loss. Use --force to proceed";
}
}
// Validate values
if (cfg["name"].string_value() == "")
{
return "Non-empty pool name is required";
}
// scheme
auto scheme = etcd_state_client_t::parse_scheme(cfg["scheme"].string_value());
if (!scheme)
{
return "Scheme must be one of \"replicated\", \"ec\" or \"xor\"";
}
// pg_size
auto pg_size = cfg["pg_size"].uint64_value();
if (!pg_size)
{
return "Non-zero PG size is required";
}
if (scheme != POOL_SCHEME_REPLICATED && pg_size < 3)
{
return "PG size can't be smaller than 3 for EC/XOR pools";
}
if (pg_size > 256)
{
return "PG size can't be greater than 256";
}
// parity_chunks
uint64_t parity_chunks = 1;
if (scheme == POOL_SCHEME_EC)
{
parity_chunks = cfg["parity_chunks"].uint64_value();
if (!parity_chunks)
{
return "Non-zero parity_chunks is required";
}
if (parity_chunks > pg_size-2)
{
return "parity_chunks can't be greater than "+std::to_string(pg_size-2)+" (PG size - 2)";
}
}
// pg_minsize
auto pg_minsize = cfg["pg_minsize"].uint64_value();
if (!pg_minsize)
{
return "Non-zero pg_minsize is required";
}
else if (pg_minsize > pg_size)
{
return "pg_minsize can't be greater than "+std::to_string(pg_size)+" (PG size)";
}
else if (scheme != POOL_SCHEME_REPLICATED && pg_minsize < pg_size-parity_chunks)
{
return "pg_minsize can't be smaller than "+std::to_string(pg_size-parity_chunks)+
" (pg_size - parity_chunks) for XOR/EC pool";
}
// pg_count
if (!cfg["pg_count"].uint64_value())
{
return "Non-zero pg_count is required";
}
// max_osd_combinations
if (!cfg["max_osd_combinations"].is_null() && cfg["max_osd_combinations"].uint64_value() < 100)
{
return "max_osd_combinations must be at least 100, but it is "+cfg["max_osd_combinations"].as_string();
}
// block_size
auto block_size = cfg["block_size"].uint64_value();
if (!cfg["block_size"].is_null() && ((block_size & (block_size-1)) ||
block_size < MIN_DATA_BLOCK_SIZE || block_size > MAX_DATA_BLOCK_SIZE))
{
return "block_size must be a power of two between "+std::to_string(MIN_DATA_BLOCK_SIZE)+
" and "+std::to_string(MAX_DATA_BLOCK_SIZE)+", but it is "+std::to_string(block_size);
}
block_size = (block_size ? block_size : global_block_size);
// bitmap_granularity
auto bitmap_granularity = cfg["bitmap_granularity"].uint64_value();
if (!cfg["bitmap_granularity"].is_null() && (!bitmap_granularity || (bitmap_granularity % 512)))
{
return "bitmap_granularity must be a multiple of 512, but it is "+std::to_string(bitmap_granularity);
}
bitmap_granularity = (bitmap_granularity ? bitmap_granularity : global_bitmap_granularity);
if (block_size % bitmap_granularity)
{
return "bitmap_granularity must divide data block size ("+std::to_string(block_size)+"), but it is "+std::to_string(bitmap_granularity);
}
// immediate_commit
if (!cfg["immediate_commit"].is_null() && !etcd_state_client_t::parse_immediate_commit(cfg["immediate_commit"].string_value()))
{
return "immediate_commit must be one of \"all\", \"small\", or \"none\", but it is "+cfg["immediate_commit"].as_string();
}
// scrub_interval
if (!cfg["scrub_interval"].is_null())
{
bool ok;
parse_time(cfg["scrub_interval"].string_value(), &ok);
if (!ok)
{
return "scrub_interval must be a time interval (number + unit s/m/h/d/M/y), but it is "+cfg["scrub_interval"].as_string();
}
}
return "";
}

10
src/cli_pool_cfg.h Normal file
View File

@@ -0,0 +1,10 @@
// Copyright (c) Vitaliy Filippov, 2024
// License: VNPL-1.1 (see README.md for details)
#pragma once
#include "json11/json11.hpp"
#include <stdint.h>
std::string validate_pool_config(json11::Json::object & new_cfg, json11::Json old_cfg,
uint64_t global_block_size, uint64_t global_bitmap_granularity, bool force);

622
src/cli_pool_create.cpp Normal file
View File

@@ -0,0 +1,622 @@
// Copyright (c) MIND Software LLC, 2023 (info@mindsw.io)
// I accept Vitastor CLA: see CLA-en.md for details
// Copyright (c) Vitaliy Filippov, 2024
// License: VNPL-1.1 (see README.md for details)
#include <ctype.h>
#include "cli.h"
#include "cli_pool_cfg.h"
#include "cluster_client.h"
#include "epoll_manager.h"
#include "pg_states.h"
#include "str_util.h"
struct pool_creator_t
{
cli_tool_t *parent;
json11::Json::object cfg;
bool force = false;
bool wait = false;
int state = 0;
cli_result_t result;
struct {
uint32_t retries = 5;
uint32_t interval = 0;
bool passed = false;
} create_check;
uint64_t new_id = 1;
uint64_t new_pools_mod_rev;
json11::Json state_node_tree;
json11::Json new_pools;
bool is_done() { return state == 100; }
void loop()
{
if (state == 1)
goto resume_1;
else if (state == 2)
goto resume_2;
else if (state == 3)
goto resume_3;
else if (state == 4)
goto resume_4;
else if (state == 5)
goto resume_5;
else if (state == 6)
goto resume_6;
else if (state == 7)
goto resume_7;
else if (state == 8)
goto resume_8;
// Validate pool parameters
result.text = validate_pool_config(cfg, json11::Json(), parent->cli->st_cli.global_block_size,
parent->cli->st_cli.global_bitmap_granularity, force);
if (result.text != "")
{
result.err = EINVAL;
state = 100;
return;
}
state = 1;
resume_1:
// If not forced, check that we have enough osds for pg_size
if (!force)
{
// Get node_placement configuration from etcd
parent->etcd_txn(json11::Json::object {
{ "success", json11::Json::array {
json11::Json::object {
{ "request_range", json11::Json::object {
{ "key", base64_encode(parent->cli->st_cli.etcd_prefix+"/config/node_placement") },
} }
},
} },
});
state = 2;
resume_2:
if (parent->waiting > 0)
return;
if (parent->etcd_err.err)
{
result = parent->etcd_err;
state = 100;
return;
}
// Get state_node_tree based on node_placement and osd peer states
{
auto kv = parent->cli->st_cli.parse_etcd_kv(parent->etcd_result["responses"][0]["response_range"]["kvs"][0]);
state_node_tree = get_state_node_tree(kv.value.object_items());
}
// Skip tag checks, if pool has none
if (cfg["osd_tags"].array_items().size())
{
// Get osd configs (for tags) of osds in state_node_tree
{
json11::Json::array osd_configs;
for (auto osd_num: state_node_tree["osds"].array_items())
{
osd_configs.push_back(json11::Json::object {
{ "request_range", json11::Json::object {
{ "key", base64_encode(parent->cli->st_cli.etcd_prefix+"/config/osd/"+osd_num.as_string()) },
} }
});
}
parent->etcd_txn(json11::Json::object { { "success", osd_configs, }, });
}
state = 3;
resume_3:
if (parent->waiting > 0)
return;
if (parent->etcd_err.err)
{
result = parent->etcd_err;
state = 100;
return;
}
// Filter out osds from state_node_tree based on pool/osd tags
{
std::vector<json11::Json> osd_configs;
for (auto & ocr: parent->etcd_result["responses"].array_items())
{
auto kv = parent->cli->st_cli.parse_etcd_kv(ocr["response_range"]["kvs"][0]);
osd_configs.push_back(kv.value);
}
state_node_tree = filter_state_node_tree_by_tags(state_node_tree, osd_configs);
}
}
// Get stats (for block_size, bitmap_granularity, ...) of osds in state_node_tree
{
json11::Json::array osd_stats;
for (auto osd_num: state_node_tree["osds"].array_items())
{
osd_stats.push_back(json11::Json::object {
{ "request_range", json11::Json::object {
{ "key", base64_encode(parent->cli->st_cli.etcd_prefix+"/osd/stats/"+osd_num.as_string()) },
} }
});
}
parent->etcd_txn(json11::Json::object { { "success", osd_stats, }, });
}
state = 4;
resume_4:
if (parent->waiting > 0)
return;
if (parent->etcd_err.err)
{
result = parent->etcd_err;
state = 100;
return;
}
// Filter osds from state_node_tree based on pool parameters and osd stats
{
std::vector<json11::Json> osd_stats;
for (auto & ocr: parent->etcd_result["responses"].array_items())
{
auto kv = parent->cli->st_cli.parse_etcd_kv(ocr["response_range"]["kvs"][0]);
osd_stats.push_back(kv.value);
}
state_node_tree = filter_state_node_tree_by_stats(state_node_tree, osd_stats);
}
// Check that pg_size <= max_pg_size
{
auto failure_domain = cfg["failure_domain"].string_value() == ""
? "host" : cfg["failure_domain"].string_value();
uint64_t max_pg_size = get_max_pg_size(state_node_tree["nodes"].object_items(),
failure_domain, cfg["root_node"].string_value());
if (cfg["pg_size"].uint64_value() > max_pg_size)
{
result = (cli_result_t){
.err = EINVAL,
.text =
"There are "+std::to_string(max_pg_size)+" \""+failure_domain+"\" failure domains with OSDs matching tags and"
" block_size/bitmap_granularity/immediate_commit parameters, but you want to create a"
" pool with "+cfg["pg_size"].as_string()+" OSDs from different failure domains in a PG."
" Change parameters or add --force if you want to create a degraded pool and add OSDs later."
};
state = 100;
return;
}
}
}
// Create pool
state = 5;
resume_5:
// Get pools from etcd
parent->etcd_txn(json11::Json::object {
{ "success", json11::Json::array {
json11::Json::object {
{ "request_range", json11::Json::object {
{ "key", base64_encode(parent->cli->st_cli.etcd_prefix+"/config/pools") },
} }
},
} },
});
state = 6;
resume_6:
if (parent->waiting > 0)
return;
if (parent->etcd_err.err)
{
result = parent->etcd_err;
state = 100;
return;
}
{
// Add new pool
auto kv = parent->cli->st_cli.parse_etcd_kv(parent->etcd_result["responses"][0]["response_range"]["kvs"][0]);
new_pools = create_pool(kv);
if (new_pools.is_string())
{
result = (cli_result_t){ .err = EEXIST, .text = new_pools.string_value() };
state = 100;
return;
}
new_pools_mod_rev = kv.mod_revision;
}
// Update pools in etcd
parent->etcd_txn(json11::Json::object {
{ "compare", json11::Json::array {
json11::Json::object {
{ "target", "MOD" },
{ "key", base64_encode(parent->cli->st_cli.etcd_prefix+"/config/pools") },
{ "result", "LESS" },
{ "mod_revision", new_pools_mod_rev+1 },
}
} },
{ "success", json11::Json::array {
json11::Json::object {
{ "request_put", json11::Json::object {
{ "key", base64_encode(parent->cli->st_cli.etcd_prefix+"/config/pools") },
{ "value", base64_encode(new_pools.dump()) },
} },
},
} },
});
state = 7;
resume_7:
if (parent->waiting > 0)
return;
if (parent->etcd_err.err)
{
result = parent->etcd_err;
state = 100;
return;
}
// Perform final create-check
create_check.interval = parent->cli->config["mon_change_timeout"].uint64_value();
if (!create_check.interval)
create_check.interval = 1000;
state = 8;
resume_8:
if (parent->waiting > 0)
return;
// Unless forced, check that pool was created and is active
if (!wait)
{
create_check.passed = true;
}
else if (create_check.retries)
{
create_check.retries--;
parent->waiting++;
parent->epmgr->tfd->set_timer(create_check.interval, false, [this](int timer_id)
{
if (parent->cli->st_cli.pool_config.find(new_id) != parent->cli->st_cli.pool_config.end())
{
auto & pool_cfg = parent->cli->st_cli.pool_config[new_id];
create_check.passed = pool_cfg.real_pg_count > 0;
for (auto pg_it = pool_cfg.pg_config.begin(); pg_it != pool_cfg.pg_config.end(); pg_it++)
{
if (!(pg_it->second.cur_state & PG_ACTIVE))
{
create_check.passed = false;
break;
}
}
if (create_check.passed)
create_check.retries = 0;
}
parent->waiting--;
parent->ringloop->wakeup();
});
return;
}
if (!create_check.passed)
{
result = (cli_result_t) {
.err = EAGAIN,
.text = "Pool "+cfg["name"].string_value()+" was created, but failed to become active."
" This may indicate that cluster state has changed while the pool was being created."
" Please check the current state and adjust the pool configuration if necessary.",
};
}
else
{
result = (cli_result_t){
.err = 0,
.text = "Pool "+cfg["name"].string_value()+" created",
.data = new_pools[std::to_string(new_id)],
};
}
state = 100;
}
// Returns a JSON object of form {"nodes": {...}, "osds": [...]} that
// contains: all nodes (osds, hosts, ...) based on node_placement config
// and current peer state, and a list of active peer osds.
json11::Json get_state_node_tree(json11::Json::object node_placement)
{
// Erase non-peer osd nodes from node_placement
for (auto np_it = node_placement.begin(); np_it != node_placement.end();)
{
// Numeric nodes are osds
osd_num_t osd_num = stoull_full(np_it->first);
// If node is osd and it is not in peer states, erase it
if (osd_num > 0 &&
parent->cli->st_cli.peer_states.find(osd_num) == parent->cli->st_cli.peer_states.end())
{
node_placement.erase(np_it++);
}
else
np_it++;
}
// List of peer osds
std::vector<std::string> peer_osds;
// Record peer osds and add missing osds/hosts to np
for (auto & ps: parent->cli->st_cli.peer_states)
{
std::string osd_num = std::to_string(ps.first);
// Record peer osd
peer_osds.push_back(osd_num);
// Add osd, if necessary
if (node_placement.find(osd_num) == node_placement.end())
{
std::string osd_host = ps.second["host"].as_string();
// Add host, if necessary
if (node_placement.find(osd_host) == node_placement.end())
{
node_placement[osd_host] = json11::Json::object {
{ "level", "host" }
};
}
node_placement[osd_num] = json11::Json::object {
{ "parent", osd_host }
};
}
}
return json11::Json::object { { "osds", peer_osds }, { "nodes", node_placement } };
}
// Returns new state_node_tree based on given state_node_tree with osds
// filtered out by tags in given osd_configs and current pool config.
// Requires: state_node_tree["osds"] must match osd_configs 1-1
json11::Json filter_state_node_tree_by_tags(const json11::Json & state_node_tree, std::vector<json11::Json> & osd_configs)
{
auto & osds = state_node_tree["osds"].array_items();
// Accepted state_node_tree nodes
auto accepted_nodes = state_node_tree["nodes"].object_items();
// List of accepted osds
std::vector<std::string> accepted_osds;
for (size_t i = 0; i < osd_configs.size(); i++)
{
auto & oc = osd_configs[i].object_items();
// Get osd number
auto osd_num = osds[i].as_string();
// We need tags in config to check against pool tags
if (oc.find("tags") == oc.end())
{
// Exclude osd from state_node_tree nodes
accepted_nodes.erase(osd_num);
continue;
}
else
{
// If all pool tags are in osd tags, accept osd
if (all_in_tags(osd_configs[i]["tags"], cfg["osd_tags"]))
{
accepted_osds.push_back(osd_num);
}
// Otherwise, exclude osd
else
{
// Exclude osd from state_node_tree nodes
accepted_nodes.erase(osd_num);
}
}
}
return json11::Json::object { { "osds", accepted_osds }, { "nodes", accepted_nodes } };
}
// Returns new state_node_tree based on given state_node_tree with osds
// filtered out by stats parameters (block_size, bitmap_granularity) in
// given osd_stats and current pool config.
// Requires: state_node_tree["osds"] must match osd_stats 1-1
json11::Json filter_state_node_tree_by_stats(const json11::Json & state_node_tree, std::vector<json11::Json> & osd_stats)
{
auto & osds = state_node_tree["osds"].array_items();
// Accepted state_node_tree nodes
auto accepted_nodes = state_node_tree["nodes"].object_items();
// List of accepted osds
std::vector<std::string> accepted_osds;
uint64_t p_block_size = cfg["block_size"].uint64_value()
? cfg["block_size"].uint64_value()
: parent->cli->st_cli.global_block_size;
uint64_t p_bitmap_granularity = cfg["bitmap_granularity"].uint64_value()
? cfg["bitmap_granularity"].uint64_value()
: parent->cli->st_cli.global_bitmap_granularity;
uint32_t p_immediate_commit = cfg["immediate_commit"].is_string()
? etcd_state_client_t::parse_immediate_commit(cfg["immediate_commit"].string_value())
: parent->cli->st_cli.global_immediate_commit;
for (size_t i = 0; i < osd_stats.size(); i++)
{
auto & os = osd_stats[i];
// Get osd number
auto osd_num = osds[i].as_string();
if (!os["data_block_size"].is_null() && os["data_block_size"] != p_block_size ||
!os["bitmap_granularity"].is_null() && os["bitmap_granularity"] != p_bitmap_granularity ||
!os["immediate_commit"].is_null() &&
etcd_state_client_t::parse_immediate_commit(os["immediate_commit"].string_value()) < p_immediate_commit)
{
accepted_nodes.erase(osd_num);
}
else
{
accepted_osds.push_back(osd_num);
}
}
return json11::Json::object { { "osds", accepted_osds }, { "nodes", accepted_nodes } };
}
// Returns maximum pg_size possible for given node_tree and failure_domain, starting at parent_node
uint64_t get_max_pg_size(json11::Json::object node_tree, const std::string & level, const std::string & parent_node)
{
uint64_t max_pg_sz = 0;
std::vector<std::string> nodes;
// Check if parent node is an osd (numeric)
if (parent_node != "" && stoull_full(parent_node))
{
// Add it to node list if osd is in node tree
if (node_tree.find(parent_node) != node_tree.end())
nodes.push_back(parent_node);
}
// If parent node given, ...
else if (parent_node != "")
{
// ... look for children nodes of this parent
for (auto & sn: node_tree)
{
auto & props = sn.second.object_items();
auto parent_prop = props.find("parent");
if (parent_prop != props.end() && (parent_prop->second.as_string() == parent_node))
{
nodes.push_back(sn.first);
// If we're not looking for all osds, we only need a single
// child osd node
if (level != "osd" && stoull_full(sn.first))
break;
}
}
}
// No parent node given, and we're not looking for all osds
else if (level != "osd")
{
// ... look for all level nodes
for (auto & sn: node_tree)
{
auto & props = sn.second.object_items();
auto level_prop = props.find("level");
if (level_prop != props.end() && (level_prop->second.as_string() == level))
{
nodes.push_back(sn.first);
}
}
}
// Otherwise, ...
else
{
// ... we're looking for osd nodes only
for (auto & sn: node_tree)
{
if (stoull_full(sn.first))
{
nodes.push_back(sn.first);
}
}
}
// Process gathered nodes
for (auto & node: nodes)
{
// Check for osd node, return constant max size
if (stoull_full(node))
{
max_pg_sz += 1;
}
// Otherwise, ...
else
{
// ... exclude parent node from tree, and ...
node_tree.erase(parent_node);
// ... descend onto the resulting tree
max_pg_sz += get_max_pg_size(node_tree, level, node);
}
}
return max_pg_sz;
}
json11::Json create_pool(const etcd_kv_t & kv)
{
for (auto & p: kv.value.object_items())
{
// ID
uint64_t pool_id = stoull_full(p.first);
new_id = std::max(pool_id+1, new_id);
// Name
if (p.second["name"].string_value() == cfg["name"].string_value())
{
return "Pool with name \""+cfg["name"].string_value()+"\" already exists (ID "+std::to_string(pool_id)+")";
}
}
auto res = kv.value.object_items();
res[std::to_string(new_id)] = cfg;
return res;
}
// Checks whether tags2 tags are all in tags1 tags
bool all_in_tags(json11::Json tags1, json11::Json tags2)
{
if (!tags2.is_array())
{
tags2 = json11::Json::array{ tags2.string_value() };
}
if (!tags1.is_array())
{
tags1 = json11::Json::array{ tags1.string_value() };
}
for (auto & tag2: tags2.array_items())
{
bool found = false;
for (auto & tag1: tags1.array_items())
{
if (tag1 == tag2)
{
found = true;
break;
}
}
if (!found)
{
return false;
}
}
return true;
}
};
std::function<bool(cli_result_t &)> cli_tool_t::start_pool_create(json11::Json cfg)
{
auto pool_creator = new pool_creator_t();
pool_creator->parent = this;
pool_creator->cfg = cfg.object_items();
pool_creator->force = cfg["force"].bool_value();
pool_creator->wait = cfg["wait"].bool_value();
return [pool_creator](cli_result_t & result)
{
pool_creator->loop();
if (pool_creator->is_done())
{
result = pool_creator->result;
delete pool_creator;
return true;
}
return false;
};
}

723
src/cli_pool_ls.cpp Normal file
View File

@@ -0,0 +1,723 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.1 (see README.md for details)
#include <algorithm>
#include "cli.h"
#include "cluster_client.h"
#include "str_util.h"
#include "pg_states.h"
// List pools with space statistics
// - df - minimal list with % used space
// - pool-ls - same but with PG state and recovery %
// - pool-ls -l - same but also include I/O statistics
// - pool-ls --detail - use list format, include PG states, I/O stats and all pool parameters
struct pool_lister_t
{
cli_tool_t *parent;
std::string sort_field;
std::set<std::string> only_names;
bool reverse = false;
int max_count = 0;
bool show_recovery = false;
bool show_stats = false;
bool detailed = false;
int state = 0;
cli_result_t result;
std::map<pool_id_t, json11::Json::object> pool_stats;
struct io_stats_t
{
uint64_t count = 0;
uint64_t read_iops = 0;
uint64_t read_bps = 0;
uint64_t read_lat = 0;
uint64_t write_iops = 0;
uint64_t write_bps = 0;
uint64_t write_lat = 0;
uint64_t delete_iops = 0;
uint64_t delete_bps = 0;
uint64_t delete_lat = 0;
};
struct object_counts_t
{
uint64_t object_count = 0;
uint64_t misplaced_count = 0;
uint64_t degraded_count = 0;
uint64_t incomplete_count = 0;
};
bool is_done()
{
return state == 100;
}
void get_pool_stats(int base_state)
{
if (state == base_state+1)
goto resume_1;
// Space statistics - pool/stats/<pool>
parent->etcd_txn(json11::Json::object {
{ "success", json11::Json::array {
json11::Json::object {
{ "request_range", json11::Json::object {
{ "key", base64_encode(
parent->cli->st_cli.etcd_prefix+"/pool/stats/"
) },
{ "range_end", base64_encode(
parent->cli->st_cli.etcd_prefix+"/pool/stats0"
) },
} },
},
json11::Json::object {
{ "request_range", json11::Json::object {
{ "key", base64_encode(
parent->cli->st_cli.etcd_prefix+"/osd/stats/"
) },
{ "range_end", base64_encode(
parent->cli->st_cli.etcd_prefix+"/osd/stats0"
) },
} },
},
json11::Json::object {
{ "request_range", json11::Json::object {
{ "key", base64_encode(
parent->cli->st_cli.etcd_prefix+"/config/pools"
) },
} },
},
} },
});
state = base_state+1;
resume_1:
if (parent->waiting > 0)
return;
if (parent->etcd_err.err)
{
result = parent->etcd_err;
state = 100;
return;
}
auto space_info = parent->etcd_result;
auto config_pools = space_info["responses"][2]["response_range"]["kvs"][0];
if (!config_pools.is_null())
{
config_pools = parent->cli->st_cli.parse_etcd_kv(config_pools).value;
}
for (auto & kv_item: space_info["responses"][0]["response_range"]["kvs"].array_items())
{
auto kv = parent->cli->st_cli.parse_etcd_kv(kv_item);
// pool ID
pool_id_t pool_id;
char null_byte = 0;
int scanned = sscanf(kv.key.substr(parent->cli->st_cli.etcd_prefix.length()).c_str(), "/pool/stats/%u%c", &pool_id, &null_byte);
if (scanned != 1 || !pool_id || pool_id >= POOL_ID_MAX)
{
fprintf(stderr, "Invalid key in etcd: %s\n", kv.key.c_str());
continue;
}
// pool/stats/<N>
pool_stats[pool_id] = kv.value.object_items();
}
std::map<pool_id_t, uint64_t> osd_free;
for (auto & kv_item: space_info["responses"][1]["response_range"]["kvs"].array_items())
{
auto kv = parent->cli->st_cli.parse_etcd_kv(kv_item);
// osd ID
osd_num_t osd_num;
char null_byte = 0;
int scanned = sscanf(kv.key.substr(parent->cli->st_cli.etcd_prefix.length()).c_str(), "/osd/stats/%ju%c", &osd_num, &null_byte);
if (scanned != 1 || !osd_num || osd_num >= POOL_ID_MAX)
{
fprintf(stderr, "Invalid key in etcd: %s\n", kv.key.c_str());
continue;
}
// osd/stats/<N>::free
osd_free[osd_num] = kv.value["free"].uint64_value();
}
// Calculate max_avail for each pool
for (auto & pp: parent->cli->st_cli.pool_config)
{
auto & pool_cfg = pp.second;
uint64_t pool_avail = UINT64_MAX;
std::map<osd_num_t, uint64_t> pg_per_osd;
bool active = pool_cfg.real_pg_count > 0;
uint64_t pg_states = 0;
for (auto & pgp: pool_cfg.pg_config)
{
if (!(pgp.second.cur_state & PG_ACTIVE))
{
active = false;
}
pg_states |= pgp.second.cur_state;
for (auto pg_osd: pgp.second.target_set)
{
if (pg_osd != 0)
{
pg_per_osd[pg_osd]++;
}
}
}
for (auto pg_per_pair: pg_per_osd)
{
uint64_t pg_free = osd_free[pg_per_pair.first] * pool_cfg.real_pg_count / pg_per_pair.second;
if (pool_avail > pg_free)
{
pool_avail = pg_free;
}
}
if (pool_avail == UINT64_MAX)
{
pool_avail = 0;
}
if (pool_cfg.scheme != POOL_SCHEME_REPLICATED)
{
pool_avail *= (pool_cfg.pg_size - pool_cfg.parity_chunks);
}
// incomplete > has_incomplete > degraded > has_degraded > has_misplaced
std::string status;
if (!active)
status = "inactive";
else if (pg_states & PG_INCOMPLETE)
status = "incomplete";
else if (pg_states & PG_HAS_INCOMPLETE)
status = "has_incomplete";
else if (pg_states & PG_DEGRADED)
status = "degraded";
else if (pg_states & PG_HAS_DEGRADED)
status = "has_degraded";
else if (pg_states & PG_HAS_MISPLACED)
status = "has_misplaced";
else
status = "active";
pool_stats[pool_cfg.id] = json11::Json::object {
{ "id", (uint64_t)pool_cfg.id },
{ "name", pool_cfg.name },
{ "status", status },
{ "pg_count", pool_cfg.pg_count },
{ "real_pg_count", pool_cfg.real_pg_count },
{ "scheme_name", pool_cfg.scheme == POOL_SCHEME_REPLICATED
? std::to_string(pool_cfg.pg_size)+"/"+std::to_string(pool_cfg.pg_minsize)
: "EC "+std::to_string(pool_cfg.pg_size-pool_cfg.parity_chunks)+"+"+std::to_string(pool_cfg.parity_chunks) },
{ "used_raw", (uint64_t)(pool_stats[pool_cfg.id]["used_raw_tb"].number_value() * ((uint64_t)1<<40)) },
{ "total_raw", (uint64_t)(pool_stats[pool_cfg.id]["total_raw_tb"].number_value() * ((uint64_t)1<<40)) },
{ "max_available", pool_avail },
{ "raw_to_usable", pool_stats[pool_cfg.id]["raw_to_usable"].number_value() },
{ "space_efficiency", pool_stats[pool_cfg.id]["space_efficiency"].number_value() },
{ "pg_real_size", pool_stats[pool_cfg.id]["pg_real_size"].uint64_value() },
{ "osd_count", pg_per_osd.size() },
};
}
// Include full pool config
for (auto & pp: config_pools.object_items())
{
if (!pp.second.is_object())
{
continue;
}
auto pool_id = stoull_full(pp.first);
auto & st = pool_stats[pool_id];
for (auto & kv: pp.second.object_items())
{
if (st.find(kv.first) == st.end())
st[kv.first] = kv.second;
}
}
}
void get_pg_stats(int base_state)
{
if (state == base_state+1)
goto resume_1;
// Space statistics - pool/stats/<pool>
parent->etcd_txn(json11::Json::object {
{ "success", json11::Json::array {
json11::Json::object {
{ "request_range", json11::Json::object {
{ "key", base64_encode(
parent->cli->st_cli.etcd_prefix+"/pg/stats/"
) },
{ "range_end", base64_encode(
parent->cli->st_cli.etcd_prefix+"/pg/stats0"
) },
} },
},
} },
});
state = base_state+1;
resume_1:
if (parent->waiting > 0)
return;
if (parent->etcd_err.err)
{
result = parent->etcd_err;
state = 100;
return;
}
auto pg_stats = parent->etcd_result["responses"][0]["response_range"]["kvs"];
// Calculate recovery percent
std::map<pool_id_t, object_counts_t> counts;
for (auto & kv_item: pg_stats.array_items())
{
auto kv = parent->cli->st_cli.parse_etcd_kv(kv_item);
// pool ID & pg number
pool_id_t pool_id;
pg_num_t pg_num = 0;
char null_byte = 0;
int scanned = sscanf(kv.key.substr(parent->cli->st_cli.etcd_prefix.length()).c_str(),
"/pg/stats/%u/%u%c", &pool_id, &pg_num, &null_byte);
if (scanned != 2 || !pool_id || pool_id >= POOL_ID_MAX)
{
fprintf(stderr, "Invalid key in etcd: %s\n", kv.key.c_str());
continue;
}
auto & cnt = counts[pool_id];
cnt.object_count += kv.value["object_count"].uint64_value();
cnt.misplaced_count += kv.value["misplaced_count"].uint64_value();
cnt.degraded_count += kv.value["degraded_count"].uint64_value();
cnt.incomplete_count += kv.value["incomplete_count"].uint64_value();
}
for (auto & pp: pool_stats)
{
auto & cnt = counts[pp.first];
auto & st = pp.second;
st["object_count"] = cnt.object_count;
st["misplaced_count"] = cnt.misplaced_count;
st["degraded_count"] = cnt.degraded_count;
st["incomplete_count"] = cnt.incomplete_count;
}
}
void get_inode_stats(int base_state)
{
if (state == base_state+1)
goto resume_1;
// Space statistics - pool/stats/<pool>
parent->etcd_txn(json11::Json::object {
{ "success", json11::Json::array {
json11::Json::object {
{ "request_range", json11::Json::object {
{ "key", base64_encode(
parent->cli->st_cli.etcd_prefix+"/inode/stats/"
) },
{ "range_end", base64_encode(
parent->cli->st_cli.etcd_prefix+"/inode/stats0"
) },
} },
},
} },
});
state = base_state+1;
resume_1:
if (parent->waiting > 0)
return;
if (parent->etcd_err.err)
{
result = parent->etcd_err;
state = 100;
return;
}
auto inode_stats = parent->etcd_result["responses"][0]["response_range"]["kvs"];
// Performance statistics
std::map<pool_id_t, io_stats_t> pool_io;
for (auto & kv_item: inode_stats.array_items())
{
auto kv = parent->cli->st_cli.parse_etcd_kv(kv_item);
// pool ID & inode number
pool_id_t pool_id;
inode_t only_inode_num;
char null_byte = 0;
int scanned = sscanf(kv.key.substr(parent->cli->st_cli.etcd_prefix.length()).c_str(),
"/inode/stats/%u/%ju%c", &pool_id, &only_inode_num, &null_byte);
if (scanned != 2 || !pool_id || pool_id >= POOL_ID_MAX || INODE_POOL(only_inode_num) != 0)
{
fprintf(stderr, "Invalid key in etcd: %s\n", kv.key.c_str());
continue;
}
auto & io = pool_io[pool_id];
io.read_iops += kv.value["read"]["iops"].uint64_value();
io.read_bps += kv.value["read"]["bps"].uint64_value();
io.read_lat += kv.value["read"]["lat"].uint64_value();
io.write_iops += kv.value["write"]["iops"].uint64_value();
io.write_bps += kv.value["write"]["bps"].uint64_value();
io.write_lat += kv.value["write"]["lat"].uint64_value();
io.delete_iops += kv.value["delete"]["iops"].uint64_value();
io.delete_bps += kv.value["delete"]["bps"].uint64_value();
io.delete_lat += kv.value["delete"]["lat"].uint64_value();
io.count++;
}
for (auto & pp: pool_stats)
{
auto & io = pool_io[pp.first];
if (io.count > 0)
{
io.read_lat /= io.count;
io.write_lat /= io.count;
io.delete_lat /= io.count;
}
auto & st = pp.second;
st["read_iops"] = io.read_iops;
st["read_bps"] = io.read_bps;
st["read_lat"] = io.read_lat;
st["write_iops"] = io.write_iops;
st["write_bps"] = io.write_bps;
st["write_lat"] = io.write_lat;
st["delete_iops"] = io.delete_iops;
st["delete_bps"] = io.delete_bps;
st["delete_lat"] = io.delete_lat;
}
}
json11::Json::array to_list()
{
json11::Json::array list;
for (auto & kv: pool_stats)
{
if (!only_names.size())
{
list.push_back(kv.second);
}
else
{
for (auto glob: only_names)
{
if (stupid_glob(kv.second["name"].string_value(), glob))
{
list.push_back(kv.second);
break;
}
}
}
}
if (sort_field == "name" || sort_field == "scheme" ||
sort_field == "scheme_name" || sort_field == "status")
{
std::sort(list.begin(), list.end(), [this](json11::Json a, json11::Json b)
{
auto av = a[sort_field].as_string();
auto bv = b[sort_field].as_string();
return reverse ? av > bv : av < bv;
});
}
else
{
std::sort(list.begin(), list.end(), [this](json11::Json a, json11::Json b)
{
auto av = a[sort_field].number_value();
auto bv = b[sort_field].number_value();
return reverse ? av > bv : av < bv;
});
}
if (max_count > 0 && list.size() > max_count)
{
list.resize(max_count);
}
return list;
}
void loop()
{
if (state == 1)
goto resume_1;
if (state == 2)
goto resume_2;
if (state == 3)
goto resume_3;
if (state == 100)
return;
show_stats = show_stats || detailed;
show_recovery = show_recovery || detailed;
resume_1:
get_pool_stats(0);
if (parent->waiting > 0)
return;
if (show_stats)
{
resume_2:
get_inode_stats(1);
if (parent->waiting > 0)
return;
}
if (show_recovery)
{
resume_3:
get_pg_stats(2);
if (parent->waiting > 0)
return;
}
if (parent->json_output)
{
// JSON output
result.data = to_list();
state = 100;
return;
}
json11::Json::array list;
for (auto & kv: pool_stats)
{
auto & st = kv.second;
double raw_to = st["raw_to_usable"].number_value();
if (raw_to < 0.000001 && raw_to > -0.000001)
raw_to = 1;
st["pg_count_fmt"] = st["real_pg_count"] == st["pg_count"]
? st["real_pg_count"].as_string()
: st["real_pg_count"].as_string()+"->"+st["pg_count"].as_string();
st["total_fmt"] = format_size(st["total_raw"].uint64_value() / raw_to);
st["used_fmt"] = format_size(st["used_raw"].uint64_value() / raw_to);
st["max_avail_fmt"] = format_size(st["max_available"].uint64_value());
st["used_pct"] = format_q(st["total_raw"].uint64_value()
? (100 - 100*st["max_available"].uint64_value() *
st["raw_to_usable"].number_value() / st["total_raw"].uint64_value())
: 100)+"%";
st["eff_fmt"] = format_q(st["space_efficiency"].number_value()*100)+"%";
if (show_stats)
{
st["read_bw"] = format_size(st["read_bps"].uint64_value())+"/s";
st["write_bw"] = format_size(st["write_bps"].uint64_value())+"/s";
st["delete_bw"] = format_size(st["delete_bps"].uint64_value())+"/s";
st["read_iops"] = format_q(st["read_iops"].number_value());
st["write_iops"] = format_q(st["write_iops"].number_value());
st["delete_iops"] = format_q(st["delete_iops"].number_value());
st["read_lat_f"] = format_lat(st["read_lat"].uint64_value());
st["write_lat_f"] = format_lat(st["write_lat"].uint64_value());
st["delete_lat_f"] = format_lat(st["delete_lat"].uint64_value());
}
if (show_recovery)
{
auto object_count = st["object_count"].uint64_value();
auto recovery_pct = 100.0 * (object_count - (st["misplaced_count"].uint64_value() +
st["degraded_count"].uint64_value() + st["incomplete_count"].uint64_value())) /
(object_count ? object_count : 1);
st["recovery_fmt"] = format_q(recovery_pct)+"%";
}
}
if (detailed)
{
for (auto & kv: pool_stats)
{
auto & st = kv.second;
auto total = st["object_count"].uint64_value();
auto obj_size = st["block_size"].uint64_value();
if (!obj_size)
obj_size = parent->cli->st_cli.global_block_size;
if (st["scheme"] == "ec")
obj_size *= st["pg_size"].uint64_value() - st["parity_chunks"].uint64_value();
else if (st["scheme"] == "xor")
obj_size *= st["pg_size"].uint64_value() - 1;
auto n = st["misplaced_count"].uint64_value();
if (n > 0)
st["misplaced_fmt"] = format_size(n * obj_size) + " / " + format_q(100.0 * n / total);
n = st["degraded_count"].uint64_value();
if (n > 0)
st["degraded_fmt"] = format_size(n * obj_size) + " / " + format_q(100.0 * n / total);
n = st["incomplete_count"].uint64_value();
if (n > 0)
st["incomplete_fmt"] = format_size(n * obj_size) + " / " + format_q(100.0 * n / total);
st["read_fmt"] = st["read_bw"].string_value()+", "+st["read_iops"].string_value()+" op/s, "+
st["read_lat_f"].string_value()+" lat";
st["write_fmt"] = st["write_bw"].string_value()+", "+st["write_iops"].string_value()+" op/s, "+
st["write_lat_f"].string_value()+" lat";
st["delete_fmt"] = st["delete_bw"].string_value()+", "+st["delete_iops"].string_value()+" op/s, "+
st["delete_lat_f"].string_value()+" lat";
if (st["scheme"] == "replicated")
st["scheme_name"] = "x"+st["pg_size"].as_string();
if (st["failure_domain"].string_value() == "")
st["failure_domain"] = "host";
st["osd_tags_fmt"] = implode(", ", st["osd_tags"]);
st["primary_affinity_tags_fmt"] = implode(", ", st["primary_affinity_tags"]);
if (st["block_size"].uint64_value())
st["block_size_fmt"] = format_size(st["block_size"].uint64_value());
if (st["bitmap_granularity"].uint64_value())
st["bitmap_granularity_fmt"] = format_size(st["bitmap_granularity"].uint64_value());
}
// All pool parameters are only displayed in the "detailed" mode
// because there's too many of them to show them in table
auto cols = std::vector<std::pair<std::string, std::string>>{
{ "name", "Name" },
{ "id", "ID" },
{ "scheme_name", "Scheme" },
{ "used_for_fs", "Used for VitastorFS" },
{ "status", "Status" },
{ "pg_count_fmt", "PGs" },
{ "pg_minsize", "PG minsize" },
{ "failure_domain", "Failure domain" },
{ "root_node", "Root node" },
{ "osd_tags_fmt", "OSD tags" },
{ "primary_affinity_tags_fmt", "Primary affinity" },
{ "block_size_fmt", "Block size" },
{ "bitmap_granularity_fmt", "Bitmap granularity" },
{ "immediate_commit", "Immediate commit" },
{ "scrub_interval", "Scrub interval" },
{ "inode_stats_fmt", "Per-inode stats" },
{ "pg_stripe_size", "PG stripe size" },
{ "max_osd_combinations", "Max OSD combinations" },
{ "total_fmt", "Total" },
{ "used_fmt", "Used" },
{ "max_avail_fmt", "Available" },
{ "used_pct", "Used%" },
{ "eff_fmt", "Efficiency" },
{ "osd_count", "OSD count" },
{ "misplaced_fmt", "Misplaced" },
{ "degraded_fmt", "Degraded" },
{ "incomplete_fmt", "Incomplete" },
{ "read_fmt", "Read" },
{ "write_fmt", "Write" },
{ "delete_fmt", "Delete" },
};
auto list = to_list();
size_t title_len = 0;
for (auto & item: list)
{
title_len = print_detail_title_len(item, cols, title_len);
}
for (auto & item: list)
{
if (result.text != "")
result.text += "\n";
result.text += print_detail(item, cols, title_len, parent->color);
}
state = 100;
return;
}
// Table output: name, scheme_name, pg_count, total, used, max_avail, used%, efficiency
json11::Json::array cols;
cols.push_back(json11::Json::object{
{ "key", "name" },
{ "title", "NAME" },
});
cols.push_back(json11::Json::object{
{ "key", "scheme_name" },
{ "title", "SCHEME" },
});
cols.push_back(json11::Json::object{
{ "key", "status" },
{ "title", "STATUS" },
});
cols.push_back(json11::Json::object{
{ "key", "pg_count_fmt" },
{ "title", "PGS" },
});
cols.push_back(json11::Json::object{
{ "key", "total_fmt" },
{ "title", "TOTAL" },
});
cols.push_back(json11::Json::object{
{ "key", "used_fmt" },
{ "title", "USED" },
});
cols.push_back(json11::Json::object{
{ "key", "max_avail_fmt" },
{ "title", "AVAILABLE" },
});
cols.push_back(json11::Json::object{
{ "key", "used_pct" },
{ "title", "USED%" },
});
cols.push_back(json11::Json::object{
{ "key", "eff_fmt" },
{ "title", "EFFICIENCY" },
});
if (show_recovery)
{
cols.push_back(json11::Json::object{ { "key", "recovery_fmt" }, { "title", "RECOVERY" } });
}
if (show_stats)
{
cols.push_back(json11::Json::object{ { "key", "read_bw" }, { "title", "READ" } });
cols.push_back(json11::Json::object{ { "key", "read_iops" }, { "title", "IOPS" } });
cols.push_back(json11::Json::object{ { "key", "read_lat_f" }, { "title", "LAT" } });
cols.push_back(json11::Json::object{ { "key", "write_bw" }, { "title", "WRITE" } });
cols.push_back(json11::Json::object{ { "key", "write_iops" }, { "title", "IOPS" } });
cols.push_back(json11::Json::object{ { "key", "write_lat_f" }, { "title", "LAT" } });
cols.push_back(json11::Json::object{ { "key", "delete_bw" }, { "title", "DELETE" } });
cols.push_back(json11::Json::object{ { "key", "delete_iops" }, { "title", "IOPS" } });
cols.push_back(json11::Json::object{ { "key", "delete_lat_f" }, { "title", "LAT" } });
}
result.data = to_list();
result.text = print_table(result.data, cols, parent->color);
state = 100;
}
};
size_t print_detail_title_len(json11::Json item, std::vector<std::pair<std::string, std::string>> names, size_t prev_len)
{
size_t title_len = prev_len;
for (auto & kv: names)
{
if (!item[kv.first].is_null() && (!item[kv.first].is_string() || item[kv.first].string_value() != ""))
{
size_t len = utf8_length(kv.second);
title_len = title_len < len ? len : title_len;
}
}
return title_len;
}
std::string print_detail(json11::Json item, std::vector<std::pair<std::string, std::string>> names, size_t title_len, bool use_esc)
{
std::string str;
for (auto & kv: names)
{
if (!item[kv.first].is_null() && (!item[kv.first].is_string() || item[kv.first].string_value() != ""))
{
str += kv.second;
str += ": ";
size_t len = utf8_length(kv.second);
for (int j = 0; j < title_len-len; j++)
str += ' ';
if (use_esc)
str += "\033[1m";
str += item[kv.first].as_string();
if (use_esc)
str += "\033[0m";
str += "\n";
}
}
return str;
}
std::function<bool(cli_result_t &)> cli_tool_t::start_pool_ls(json11::Json cfg)
{
auto lister = new pool_lister_t();
lister->parent = this;
lister->show_recovery = cfg["show_recovery"].bool_value();
lister->show_stats = cfg["long"].bool_value();
lister->detailed = cfg["detail"].bool_value();
lister->sort_field = cfg["sort"].string_value();
if ((lister->sort_field == "osd_tags") ||
(lister->sort_field == "primary_affinity_tags" ))
lister->sort_field = lister->sort_field + "_fmt";
lister->reverse = cfg["reverse"].bool_value();
lister->max_count = cfg["count"].uint64_value();
for (auto & item: cfg["names"].array_items())
{
lister->only_names.insert(item.string_value());
}
return [lister](cli_result_t & result)
{
lister->loop();
if (lister->is_done())
{
result = lister->result;
delete lister;
return true;
}
return false;
};
}
std::string implode(const std::string & sep, json11::Json array)
{
if (array.is_number() || array.is_bool() || array.is_string())
{
return array.as_string();
}
std::string res;
bool first = true;
for (auto & item: array.array_items())
{
res += (first ? item.as_string() : sep+item.as_string());
first = false;
}
return res;
}

203
src/cli_pool_modify.cpp Normal file
View File

@@ -0,0 +1,203 @@
// Copyright (c) MIND Software LLC, 2023 (info@mindsw.io)
// I accept Vitastor CLA: see CLA-en.md for details
// Copyright (c) Vitaliy Filippov, 2024
// License: VNPL-1.1 (see README.md for details)
#include <ctype.h>
#include "cli.h"
#include "cli_pool_cfg.h"
#include "cluster_client.h"
#include "str_util.h"
struct pool_changer_t
{
cli_tool_t *parent;
// Required parameters (id/name)
pool_id_t pool_id = 0;
std::string pool_name;
json11::Json::object cfg;
json11::Json::object new_cfg;
bool force = false;
json11::Json old_cfg;
int state = 0;
cli_result_t result;
// Updated pools
json11::Json new_pools;
// Expected pools mod revision
uint64_t pools_mod_rev;
bool is_done() { return state == 100; }
void loop()
{
if (state == 1)
goto resume_1;
else if (state == 2)
goto resume_2;
pool_id = stoull_full(cfg["old_name"].string_value());
if (!pool_id)
{
pool_name = cfg["old_name"].string_value();
if (pool_name == "")
{
result = (cli_result_t){ .err = ENOENT, .text = "Pool ID or name is required to modify it" };
state = 100;
return;
}
}
resume_0:
// Get pools from etcd
parent->etcd_txn(json11::Json::object {
{ "success", json11::Json::array {
json11::Json::object {
{ "request_range", json11::Json::object {
{ "key", base64_encode(parent->cli->st_cli.etcd_prefix+"/config/pools") },
} }
},
} },
});
state = 1;
resume_1:
if (parent->waiting > 0)
return;
if (parent->etcd_err.err)
{
result = parent->etcd_err;
state = 100;
return;
}
{
// Parse received pools from etcd
auto kv = parent->cli->st_cli.parse_etcd_kv(parent->etcd_result["responses"][0]["response_range"]["kvs"][0]);
// Get pool by name or ID
old_cfg = json11::Json();
if (pool_name != "")
{
for (auto & pce: kv.value.object_items())
{
if (pce.second["name"] == pool_name)
{
pool_id = stoull_full(pce.first);
old_cfg = pce.second;
break;
}
}
}
else
{
pool_name = std::to_string(pool_id);
old_cfg = kv.value[pool_name];
}
if (!old_cfg.is_object())
{
result = (cli_result_t){ .err = ENOENT, .text = "Pool "+pool_name+" does not exist" };
state = 100;
return;
}
// Update pool
new_cfg = cfg;
result.text = validate_pool_config(new_cfg, old_cfg, parent->cli->st_cli.global_block_size,
parent->cli->st_cli.global_bitmap_granularity, force);
if (result.text != "")
{
result.err = EINVAL;
state = 100;
return;
}
if (new_cfg.find("used_for_fs") != new_cfg.end() && !force)
{
// Check that pool doesn't have images
auto img_it = parent->cli->st_cli.inode_config.lower_bound(INODE_WITH_POOL(pool_id, 0));
if (img_it != parent->cli->st_cli.inode_config.end() && INODE_POOL(img_it->first) == pool_id &&
img_it->second.name == new_cfg["used_for_fs"].string_value())
{
// Only allow metadata image to exist in the FS pool
img_it++;
}
if (img_it != parent->cli->st_cli.inode_config.end() && INODE_POOL(img_it->first) == pool_id)
{
result = (cli_result_t){ .err = ENOENT, .text = "Pool "+pool_name+" has block images, delete them before using it for VitastorFS" };
state = 100;
return;
}
}
// Update pool
auto pls = kv.value.object_items();
pls[std::to_string(pool_id)] = new_cfg;
new_pools = pls;
// Expected pools mod revision
pools_mod_rev = kv.mod_revision;
}
// Update pools in etcd
parent->etcd_txn(json11::Json::object {
{ "compare", json11::Json::array {
json11::Json::object {
{ "target", "MOD" },
{ "key", base64_encode(parent->cli->st_cli.etcd_prefix+"/config/pools") },
{ "result", "LESS" },
{ "mod_revision", pools_mod_rev+1 },
}
} },
{ "success", json11::Json::array {
json11::Json::object {
{ "request_put", json11::Json::object {
{ "key", base64_encode(parent->cli->st_cli.etcd_prefix+"/config/pools") },
{ "value", base64_encode(new_pools.dump()) },
} },
},
} },
});
state = 2;
resume_2:
if (parent->waiting > 0)
return;
if (parent->etcd_err.err)
{
result = parent->etcd_err;
state = 100;
return;
}
if (!parent->etcd_result["succeeded"].bool_value())
{
// CAS failure - retry
fprintf(stderr, "Warning: pool configuration was modified in the meantime by someone else\n");
goto resume_0;
}
// Successfully updated pool
result = (cli_result_t){
.err = 0,
.text = "Pool "+pool_name+" updated",
.data = new_pools,
};
state = 100;
}
};
std::function<bool(cli_result_t &)> cli_tool_t::start_pool_modify(json11::Json cfg)
{
auto pool_changer = new pool_changer_t();
pool_changer->parent = this;
pool_changer->cfg = cfg.object_items();
pool_changer->force = cfg["force"].bool_value();
return [pool_changer](cli_result_t & result)
{
pool_changer->loop();
if (pool_changer->is_done())
{
result = pool_changer->result;
delete pool_changer;
return true;
}
return false;
};
}

226
src/cli_pool_rm.cpp Normal file
View File

@@ -0,0 +1,226 @@
// Copyright (c) MIND Software LLC, 2023 (info@mindsw.io)
// I accept Vitastor CLA: see CLA-en.md for details
// Copyright (c) Vitaliy Filippov, 2024
// License: VNPL-1.1 (see README.md for details)
#include <ctype.h>
#include "cli.h"
#include "cluster_client.h"
#include "str_util.h"
struct pool_remover_t
{
cli_tool_t *parent;
// Required parameters (id/name)
pool_id_t pool_id = 0;
std::string pool_name;
// Force removal
bool force;
int state = 0;
cli_result_t result;
// Is pool valid?
bool pool_valid = false;
// Updated pools
json11::Json new_pools;
// Expected pools mod revision
uint64_t pools_mod_rev;
bool is_done() { return state == 100; }
void loop()
{
if (state == 1)
goto resume_1;
else if (state == 2)
goto resume_2;
else if (state == 3)
goto resume_3;
// Pool name (or id) required
if (!pool_id && pool_name == "")
{
result = (cli_result_t){ .err = EINVAL, .text = "Pool name or id must be given" };
state = 100;
return;
}
// Validate pool name/id
// Get pool id by name (if name given)
if (pool_name != "")
{
for (auto & ic: parent->cli->st_cli.pool_config)
{
if (ic.second.name == pool_name)
{
pool_id = ic.first;
pool_valid = 1;
break;
}
}
}
// Otherwise, check if given pool id is valid
else
{
// Set pool name from id (for easier logging)
pool_name = "id " + std::to_string(pool_id);
// Look-up pool id in pool_config
if (parent->cli->st_cli.pool_config.find(pool_id) != parent->cli->st_cli.pool_config.end())
{
pool_valid = 1;
}
}
// Need a valid pool to proceed
if (!pool_valid)
{
result = (cli_result_t){ .err = ENOENT, .text = "Pool "+pool_name+" does not exist" };
state = 100;
return;
}
// Unless forced, check if pool has associated Images/Snapshots
if (!force)
{
std::string images;
for (auto & ic: parent->cli->st_cli.inode_config)
{
if (pool_id && INODE_POOL(ic.second.num) != pool_id)
{
continue;
}
images += ((images != "") ? ", " : "") + ic.second.name;
}
if (images != "")
{
result = (cli_result_t){
.err = ENOTEMPTY,
.text =
"Pool "+pool_name+" cannot be removed as it still has the following "
"images/snapshots associated with it: "+images
};
state = 100;
return;
}
}
// Proceed to deleting the pool
state = 1;
do
{
resume_1:
// Get pools from etcd
parent->etcd_txn(json11::Json::object {
{ "success", json11::Json::array {
json11::Json::object {
{ "request_range", json11::Json::object {
{ "key", base64_encode(parent->cli->st_cli.etcd_prefix+"/config/pools") },
} }
},
} },
});
state = 2;
resume_2:
if (parent->waiting > 0)
return;
if (parent->etcd_err.err)
{
result = parent->etcd_err;
state = 100;
return;
}
{
// Parse received pools from etcd
auto kv = parent->cli->st_cli.parse_etcd_kv(parent->etcd_result["responses"][0]["response_range"]["kvs"][0]);
// Remove pool
auto p = kv.value.object_items();
if (p.erase(std::to_string(pool_id)) != 1)
{
result = (cli_result_t){
.err = ENOENT,
.text = "Failed to erase pool "+pool_name+" from: "+kv.value.string_value()
};
state = 100;
return;
}
// Record updated pools
new_pools = p;
// Expected pools mod revision
pools_mod_rev = kv.mod_revision;
}
// Update pools in etcd
parent->etcd_txn(json11::Json::object {
{ "compare", json11::Json::array {
json11::Json::object {
{ "target", "MOD" },
{ "key", base64_encode(parent->cli->st_cli.etcd_prefix+"/config/pools") },
{ "result", "LESS" },
{ "mod_revision", pools_mod_rev+1 },
}
} },
{ "success", json11::Json::array {
json11::Json::object {
{ "request_put", json11::Json::object {
{ "key", base64_encode(parent->cli->st_cli.etcd_prefix+"/config/pools") },
{ "value", base64_encode(new_pools.dump()) },
} },
},
} },
});
state = 3;
resume_3:
if (parent->waiting > 0)
return;
if (parent->etcd_err.err)
{
result = parent->etcd_err;
state = 100;
return;
}
} while (!parent->etcd_result["succeeded"].bool_value());
// Successfully deleted pool
result = (cli_result_t){
.err = 0,
.text = "Pool "+pool_name+" deleted",
.data = new_pools
};
state = 100;
}
};
std::function<bool(cli_result_t &)> cli_tool_t::start_pool_rm(json11::Json cfg)
{
auto pool_remover = new pool_remover_t();
pool_remover->parent = this;
pool_remover->pool_id = cfg["pool"].uint64_value();
pool_remover->pool_name = pool_remover->pool_id ? "" : cfg["pool"].as_string();
pool_remover->force = !cfg["force"].is_null();
return [pool_remover](cli_result_t & result)
{
pool_remover->loop();
if (pool_remover->is_done())
{
result = pool_remover->result;
delete pool_remover;
return true;
}
return false;
};
}

View File

@@ -53,6 +53,8 @@ struct snap_remover_t
int use_cas = 1;
// interval between fsyncs
int fsync_interval = 128;
// ignore deletion errors
bool down_ok = false;
std::map<inode_t,int> sources;
std::map<inode_t,uint64_t> inode_used;
@@ -245,6 +247,7 @@ resume_8:
}
state = 100;
result = (cli_result_t){
.err = 0,
.text = "",
.data = my_result(result.data),
};
@@ -291,7 +294,7 @@ resume_100:
if (it == parent->cli->st_cli.inode_config.end())
{
char buf[1024];
snprintf(buf, 1024, "Parent inode of layer %s (id 0x%lx) not found", cur->name.c_str(), cur->parent_id);
snprintf(buf, 1024, "Parent inode of layer %s (id 0x%jx) not found", cur->name.c_str(), cur->parent_id);
state = 100;
return;
}
@@ -384,7 +387,7 @@ resume_100:
pool_id_t pool_id = 0;
inode_t inode = 0;
char null_byte = 0;
int scanned = sscanf(kv.key.c_str() + parent->cli->st_cli.etcd_prefix.length()+13, "%u/%lu%c", &pool_id, &inode, &null_byte);
int scanned = sscanf(kv.key.c_str() + parent->cli->st_cli.etcd_prefix.length()+13, "%u/%ju%c", &pool_id, &inode, &null_byte);
if (scanned != 2 || !inode)
{
result = (cli_result_t){ .err = EIO, .text = "Bad key returned from etcd: "+kv.key };
@@ -439,7 +442,7 @@ resume_100:
if (child_it == parent->cli->st_cli.inode_config.end())
{
char buf[1024];
snprintf(buf, 1024, "Inode 0x%lx disappeared", inverse_child);
snprintf(buf, 1024, "Inode 0x%jx disappeared", inverse_child);
result = (cli_result_t){ .err = EIO, .text = std::string(buf) };
state = 100;
return;
@@ -448,7 +451,7 @@ resume_100:
if (target_it == parent->cli->st_cli.inode_config.end())
{
char buf[1024];
snprintf(buf, 1024, "Inode 0x%lx disappeared", inverse_parent);
snprintf(buf, 1024, "Inode 0x%jx disappeared", inverse_parent);
result = (cli_result_t){ .err = EIO, .text = std::string(buf) };
state = 100;
return;
@@ -576,7 +579,7 @@ resume_100:
if (cur_cfg_it == parent->cli->st_cli.inode_config.end())
{
char buf[1024];
snprintf(buf, 1024, "Inode 0x%lx disappeared", cur);
snprintf(buf, 1024, "Inode 0x%jx disappeared", cur);
result = (cli_result_t){ .err = EIO, .text = std::string(buf) };
state = 100;
return;
@@ -640,7 +643,7 @@ resume_100:
if (child_it == parent->cli->st_cli.inode_config.end())
{
char buf[1024];
snprintf(buf, 1024, "Inode 0x%lx disappeared", child_inode);
snprintf(buf, 1024, "Inode 0x%jx disappeared", child_inode);
result = (cli_result_t){ .err = EIO, .text = std::string(buf) };
state = 100;
return;
@@ -649,7 +652,7 @@ resume_100:
if (target_it == parent->cli->st_cli.inode_config.end())
{
char buf[1024];
snprintf(buf, 1024, "Inode 0x%lx disappeared", target_inode);
snprintf(buf, 1024, "Inode 0x%jx disappeared", target_inode);
result = (cli_result_t){ .err = EIO, .text = std::string(buf) };
state = 100;
return;
@@ -670,7 +673,7 @@ resume_100:
if (source == parent->cli->st_cli.inode_config.end())
{
char buf[1024];
snprintf(buf, 1024, "Inode 0x%lx disappeared", inode);
snprintf(buf, 1024, "Inode 0x%jx disappeared", inode);
result = (cli_result_t){ .err = EIO, .text = std::string(buf) };
state = 100;
return;
@@ -679,6 +682,7 @@ resume_100:
{ "inode", inode },
{ "pool", (uint64_t)INODE_POOL(inode) },
{ "fsync-interval", fsync_interval },
{ "down-ok", down_ok },
});
}
};
@@ -690,6 +694,7 @@ std::function<bool(cli_result_t &)> cli_tool_t::start_rm(json11::Json cfg)
snap_remover->from_name = cfg["from"].string_value();
snap_remover->to_name = cfg["to"].string_value();
snap_remover->fsync_interval = cfg["fsync_interval"].uint64_value();
snap_remover->down_ok = cfg["down_ok"].bool_value();
if (!snap_remover->fsync_interval)
snap_remover->fsync_interval = 128;
if (!cfg["cas"].is_null())

View File

@@ -17,6 +17,7 @@ struct rm_pg_t
uint64_t obj_count = 0, obj_done = 0;
int state = 0;
int in_flight = 0;
bool synced = false;
};
struct rm_inode_t
@@ -24,6 +25,7 @@ struct rm_inode_t
uint64_t inode = 0;
pool_id_t pool_id = 0;
uint64_t min_offset = 0;
bool down_ok = false;
cli_tool_t *parent = NULL;
inode_list_t *lister = NULL;
@@ -48,6 +50,7 @@ struct rm_inode_t
.objects = objects,
.obj_count = objects.size(),
.obj_done = 0,
.synced = parent->cli->get_immediate_commit(inode),
});
if (min_offset == 0)
{
@@ -93,7 +96,7 @@ struct rm_inode_t
fprintf(stderr, "Some data may remain after delete on OSDs which are currently down: ");
for (int i = 0; i < inactive_osds.size(); i++)
{
fprintf(stderr, i > 0 ? ", %lu" : "%lu", inactive_osds[i]);
fprintf(stderr, i > 0 ? ", %ju" : "%ju", inactive_osds[i]);
}
fprintf(stderr, "\n");
}
@@ -136,7 +139,7 @@ struct rm_inode_t
cur_list->in_flight--;
if (op->reply.hdr.retval < 0)
{
fprintf(stderr, "Failed to remove object %lx:%lx from PG %u (OSD %lu) (retval=%ld)\n",
fprintf(stderr, "Failed to remove object %jx:%jx from PG %u (OSD %ju) (retval=%jd)\n",
op->req.rw.inode, op->req.rw.offset,
cur_list->pg_num, cur_list->rm_osd_num, op->reply.hdr.retval);
error_count++;
@@ -151,6 +154,37 @@ struct rm_inode_t
}
cur_list->obj_pos++;
}
if (cur_list->in_flight == 0 && cur_list->obj_pos == cur_list->objects.end() &&
!cur_list->synced)
{
osd_op_t *op = new osd_op_t();
op->op_type = OSD_OP_OUT;
op->peer_fd = parent->cli->msgr.osd_peer_fds.at(cur_list->rm_osd_num);
op->req = (osd_any_op_t){
.sync = {
.header = {
.magic = SECONDARY_OSD_OP_MAGIC,
.id = parent->cli->next_op_id(),
.opcode = OSD_OP_SYNC,
},
},
};
op->callback = [this, cur_list](osd_op_t *op)
{
cur_list->in_flight--;
cur_list->synced = true;
if (op->reply.hdr.retval < 0)
{
fprintf(stderr, "Failed to sync OSD %ju (retval=%jd)\n",
cur_list->rm_osd_num, op->reply.hdr.retval);
error_count++;
}
delete op;
continue_delete();
};
cur_list->in_flight++;
parent->cli->msgr.outbox_push(op);
}
}
void continue_delete()
@@ -161,7 +195,8 @@ struct rm_inode_t
}
for (int i = 0; i < lists.size(); i++)
{
if (!lists[i]->in_flight && lists[i]->obj_pos == lists[i]->objects.end())
if (!lists[i]->in_flight && lists[i]->obj_pos == lists[i]->objects.end() &&
lists[i]->synced)
{
delete lists[i];
lists.erase(lists.begin()+i, lists.begin()+i+1);
@@ -178,7 +213,9 @@ struct rm_inode_t
}
if (parent->progress && total_count > 0 && total_done*1000/total_count != total_prev_pct)
{
fprintf(stderr, "\rRemoved %lu/%lu objects, %lu more PGs to list...", total_done, total_count, pgs_to_list);
fprintf(stderr, parent->color
? "\rRemoved %ju/%ju objects, %ju more PGs to list..."
: "Removed %ju/%ju objects, %ju more PGs to list...\n", total_done, total_count, pgs_to_list);
total_prev_pct = total_done*1000/total_count;
}
if (lists_done && !lists.size())
@@ -187,17 +224,18 @@ struct rm_inode_t
{
fprintf(stderr, "\n");
}
if (parent->progress && (total_done < total_count || inactive_osds.size() > 0))
bool is_error = (total_done < total_count || inactive_osds.size() > 0 || error_count > 0);
if (parent->progress && is_error)
{
fprintf(
stderr, "Warning: Pool:%u,ID:%lu inode data may not have been fully removed.\n"
" Use `vitastor-cli rm-data --pool %u --inode %lu` if you encounter it in listings.\n",
stderr, "Warning: Pool:%u,ID:%ju inode data may not have been fully removed.\n"
"Use `vitastor-cli rm-data --pool %u --inode %ju` if you encounter it in listings.\n",
pool_id, INODE_NO_POOL(inode), pool_id, INODE_NO_POOL(inode)
);
}
result = (cli_result_t){
.err = error_count > 0 ? EIO : 0,
.text = error_count > 0 ? "Some blocks were not removed" : (
.err = is_error && !down_ok ? EIO : 0,
.text = is_error ? "Some blocks were not removed" : (
"Done, inode "+std::to_string(INODE_NO_POOL(inode))+" from pool "+
std::to_string(pool_id)+" removed"),
.data = json11::Json::object {
@@ -246,6 +284,7 @@ std::function<bool(cli_result_t &)> cli_tool_t::start_rm_data(json11::Json cfg)
{
remover->inode = (remover->inode & (((uint64_t)1 << (64-POOL_ID_BITS)) - 1)) | (((uint64_t)remover->pool_id) << (64-POOL_ID_BITS));
}
remover->down_ok = cfg["down_ok"].bool_value();
remover->pool_id = INODE_POOL(remover->inode);
remover->min_offset = cfg["min_offset"].uint64_value();
return [remover](cli_result_t & result)

View File

@@ -106,7 +106,7 @@ resume_2:
if (etcd_states[i]["error"].is_null())
{
etcd_alive++;
etcd_db_size = etcd_states[i]["dbSizeInUse"].uint64_value();
etcd_db_size = etcd_states[i]["dbSize"].uint64_value();
}
}
int mon_count = 0;
@@ -132,7 +132,7 @@ resume_2:
auto kv = parent->cli->st_cli.parse_etcd_kv(osd_stats[i]);
osd_num_t stat_osd_num = 0;
char null_byte = 0;
int scanned = sscanf(kv.key.c_str() + parent->cli->st_cli.etcd_prefix.size(), "/osd/stats/%lu%c", &stat_osd_num, &null_byte);
int scanned = sscanf(kv.key.c_str() + parent->cli->st_cli.etcd_prefix.size(), "/osd/stats/%ju%c", &stat_osd_num, &null_byte);
if (scanned != 1 || !stat_osd_num)
{
fprintf(stderr, "Invalid key in etcd: %s\n", kv.key.c_str());
@@ -283,7 +283,7 @@ resume_2:
}
printf(
" cluster:\n"
" etcd: %d / %ld up, %s database size\n"
" etcd: %d / %zd up, %s database size\n"
" mon: %d up%s\n"
" osd: %d / %d up\n"
" \n"

View File

@@ -6,7 +6,7 @@
#include "cluster_client_impl.h"
#include "http_client.h" // json_is_true
cluster_client_t::cluster_client_t(ring_loop_t *ringloop, timerfd_manager_t *tfd, json11::Json & config)
cluster_client_t::cluster_client_t(ring_loop_t *ringloop, timerfd_manager_t *tfd, json11::Json config)
{
wb = new writeback_cache_t();
@@ -238,7 +238,8 @@ void cluster_client_t::erase_op(cluster_op_t *op)
// which may continue following SYNCs, but these SYNCs
// should know about the changed buffer state
// This is ugly but this is the way we do it
std::function<void(cluster_op_t*)>(op->callback)(op);
auto cb = std::move(op->callback);
cb(op);
}
if (!(flags & OP_IMMEDIATE_COMMIT) || enable_writeback)
{
@@ -248,7 +249,8 @@ void cluster_client_t::erase_op(cluster_op_t *op)
{
// Call callback at the end to avoid inconsistencies in prev_wait
// if the callback adds more operations itself
std::function<void(cluster_op_t*)>(op->callback)(op);
auto cb = std::move(op->callback);
cb(op);
}
if (flags & OP_FLUSH_BUFFER)
{
@@ -265,7 +267,7 @@ void cluster_client_t::erase_op(cluster_op_t *op)
}
}
void cluster_client_t::continue_ops(bool up_retry)
void cluster_client_t::continue_ops(int time_passed)
{
if (!pgs_loaded)
{
@@ -277,22 +279,27 @@ void cluster_client_t::continue_ops(bool up_retry)
// Attempt to reenter the function
return;
}
int reset_duration = 0;
restart:
continuing_ops = 1;
for (auto op = op_queue_head; op; )
{
cluster_op_t *next_op = op->next;
if (!op->up_wait || up_retry)
if (op->retry_after && time_passed)
{
op->up_wait = false;
if (!op->prev_wait)
op->retry_after = op->retry_after > time_passed ? op->retry_after-time_passed : 0;
if (op->retry_after && (!reset_duration || op->retry_after < reset_duration))
{
if (op->opcode == OSD_OP_SYNC)
continue_sync(op);
else
continue_rw(op);
reset_duration = op->retry_after;
}
}
if (!op->retry_after && !op->prev_wait)
{
if (op->opcode == OSD_OP_SYNC)
continue_sync(op);
else
continue_rw(op);
}
op = next_op;
if (continuing_ops == 2)
{
@@ -300,6 +307,27 @@ restart:
}
}
continuing_ops = 0;
reset_retry_timer(reset_duration);
}
void cluster_client_t::reset_retry_timer(int new_duration)
{
if (retry_timeout_duration && retry_timeout_duration <= new_duration || !new_duration)
{
return;
}
if (retry_timeout_id)
{
tfd->clear_timer(retry_timeout_id);
}
retry_timeout_duration = new_duration;
retry_timeout_id = tfd->set_timer(retry_timeout_duration, false, [this](int)
{
int time_passed = retry_timeout_duration;
retry_timeout_id = 0;
retry_timeout_duration = 0;
continue_ops(time_passed);
});
}
void cluster_client_t::on_load_config_hook(json11::Json::object & etcd_global_config)
@@ -349,15 +377,25 @@ void cluster_client_t::on_load_config_hook(json11::Json::object & etcd_global_co
{
client_max_writeback_iodepth = DEFAULT_CLIENT_MAX_WRITEBACK_IODEPTH;
}
// up_wait_retry_interval
up_wait_retry_interval = config["up_wait_retry_interval"].uint64_value();
if (!up_wait_retry_interval)
// client_retry_interval
client_retry_interval = config["client_retry_interval"].uint64_value();
if (!client_retry_interval)
{
up_wait_retry_interval = 500;
client_retry_interval = 50;
}
else if (up_wait_retry_interval < 50)
else if (client_retry_interval < 10)
{
up_wait_retry_interval = 50;
client_retry_interval = 10;
}
// client_eio_retry_interval
client_eio_retry_interval = 1000;
if (!config["client_eio_retry_interval"].is_null())
{
client_eio_retry_interval = config["client_eio_retry_interval"].uint64_value();
if (client_eio_retry_interval && client_eio_retry_interval < 10)
{
client_eio_retry_interval = 10;
}
}
// log_level
log_level = config["log_level"].uint64_value();
@@ -512,7 +550,8 @@ void cluster_client_t::execute(cluster_op_t *op)
op->opcode != OSD_OP_READ_BITMAP && op->opcode != OSD_OP_READ_CHAIN_BITMAP && op->opcode != OSD_OP_WRITE)
{
op->retval = -EINVAL;
std::function<void(cluster_op_t*)>(op->callback)(op);
auto cb = std::move(op->callback);
cb(op);
return;
}
if (!pgs_loaded)
@@ -534,7 +573,7 @@ void cluster_client_t::execute_internal(cluster_op_t *op)
return;
}
if (op->opcode == OSD_OP_WRITE && enable_writeback && !(op->flags & OP_FLUSH_BUFFER) &&
!op->version /* FIXME no CAS writeback */)
!op->version /* no CAS writeback */)
{
if (wb->writebacks_active >= client_max_writeback_iodepth)
{
@@ -550,12 +589,13 @@ void cluster_client_t::execute_internal(cluster_op_t *op)
wb->start_writebacks(this, 1);
}
op->retval = op->len;
std::function<void(cluster_op_t*)>(op->callback)(op);
auto cb = std::move(op->callback);
cb(op);
return;
}
if (op->opcode == OSD_OP_WRITE && !(op->flags & OP_IMMEDIATE_COMMIT))
{
if (!(op->flags & OP_FLUSH_BUFFER))
if (!(op->flags & OP_FLUSH_BUFFER) && !op->version /* no CAS write-repeat */)
{
wb->copy_write(op, CACHE_WRITTEN);
}
@@ -619,7 +659,8 @@ bool cluster_client_t::check_rw(cluster_op_t *op)
if (!pool_id)
{
op->retval = -EINVAL;
std::function<void(cluster_op_t*)>(op->callback)(op);
auto cb = std::move(op->callback);
cb(op);
return false;
}
auto pool_it = st_cli.pool_config.find(pool_id);
@@ -627,15 +668,17 @@ bool cluster_client_t::check_rw(cluster_op_t *op)
{
// Pools are loaded, but this one is unknown
op->retval = -EINVAL;
std::function<void(cluster_op_t*)>(op->callback)(op);
auto cb = std::move(op->callback);
cb(op);
return false;
}
// Check alignment
if (!op->len && (op->opcode == OSD_OP_READ || op->opcode == OSD_OP_READ_BITMAP || op->opcode == OSD_OP_READ_CHAIN_BITMAP || op->opcode == OSD_OP_WRITE) ||
if (!op->len && (op->opcode == OSD_OP_READ_BITMAP || op->opcode == OSD_OP_READ_CHAIN_BITMAP || op->opcode == OSD_OP_WRITE) ||
op->offset % pool_it->second.bitmap_granularity || op->len % pool_it->second.bitmap_granularity)
{
op->retval = -EINVAL;
std::function<void(cluster_op_t*)>(op->callback)(op);
auto cb = std::move(op->callback);
cb(op);
return false;
}
if (pool_it->second.immediate_commit == IMMEDIATE_ALL)
@@ -648,7 +691,8 @@ bool cluster_client_t::check_rw(cluster_op_t *op)
if (ino_it != st_cli.inode_config.end() && ino_it->second.readonly)
{
op->retval = -EROFS;
std::function<void(cluster_op_t*)>(op->callback)(op);
auto cb = std::move(op->callback);
cb(op);
return false;
}
}
@@ -716,15 +760,8 @@ resume_1:
// We'll need to retry again
if (op->parts[i].flags & PART_RETRY)
{
op->up_wait = true;
if (!retry_timeout_id)
{
retry_timeout_id = tfd->set_timer(up_wait_retry_interval, false, [this](int)
{
retry_timeout_id = 0;
continue_ops(true);
});
}
op->retry_after = client_retry_interval;
reset_retry_timer(client_retry_interval);
}
op->state = 1;
}
@@ -780,10 +817,9 @@ resume_2:
return 1;
}
else if (op->retval != 0 && !(op->flags & OP_FLUSH_BUFFER) &&
op->retval != -EPIPE && op->retval != -EIO && op->retval != -ENOSPC)
op->retval != -EPIPE && (op->retval != -EIO || !client_eio_retry_interval) && op->retval != -ENOSPC)
{
// Fatal error (neither -EPIPE, -EIO nor -ENOSPC)
// FIXME: Add a parameter to allow to not wait for EIOs (incomplete or corrupted objects) to heal
erase_op(op);
return 1;
}
@@ -1138,7 +1174,6 @@ static inline void mem_or(void *res, const void *r2, unsigned int len)
void cluster_client_t::handle_op_part(cluster_op_part_t *part)
{
cluster_op_t *op = part->parent;
op->inflight_count--;
int expected = part->op.req.hdr.opcode == OSD_OP_SYNC ? 0 : part->op.req.rw.len;
if (part->op.reply.hdr.retval != expected)
{
@@ -1156,31 +1191,32 @@ void cluster_client_t::handle_op_part(cluster_op_part_t *part)
if (op->retval != -EPIPE || log_level > 0)
{
fprintf(
stderr, "%s operation failed on OSD %lu: retval=%ld (expected %d), dropping connection\n",
stderr, "%s operation failed on OSD %ju: retval=%jd (expected %d), dropping connection\n",
osd_op_names[part->op.req.hdr.opcode], part->osd_num, part->op.reply.hdr.retval, expected
);
}
}
else
else if (log_level > 0)
{
fprintf(
stderr, "%s operation failed on OSD %lu: retval=%ld (expected %d)\n",
stderr, "%s operation failed on OSD %ju: retval=%jd (expected %d)\n",
osd_op_names[part->op.req.hdr.opcode], part->osd_num, part->op.reply.hdr.retval, expected
);
}
// All next things like timer, continue_sync/rw and stop_client may affect the operation again
// So do all these things after modifying operation state, otherwise we may hit reenterability bugs
// FIXME postpone such things to set_immediate here to avoid bugs
// Mark op->up_wait = true to retry operation after a short pause (not immediately)
op->up_wait = true;
if (!retry_timeout_id)
// Set op->retry_after to retry operation after a short pause (not immediately)
if (!op->retry_after)
{
retry_timeout_id = tfd->set_timer(up_wait_retry_interval, false, [this](int)
{
retry_timeout_id = 0;
continue_ops(true);
});
op->retry_after = op->retval == -EIO ? client_eio_retry_interval : client_retry_interval;
}
reset_retry_timer(op->retry_after);
if (stop_fd >= 0)
{
msgr.stop_client(stop_fd);
}
op->inflight_count--;
if (op->inflight_count == 0)
{
if (op->opcode == OSD_OP_SYNC)
@@ -1188,14 +1224,11 @@ void cluster_client_t::handle_op_part(cluster_op_part_t *part)
else
continue_rw(op);
}
if (stop_fd >= 0)
{
msgr.stop_client(stop_fd);
}
}
else
{
// OK
op->inflight_count--;
if ((op->opcode == OSD_OP_WRITE || op->opcode == OSD_OP_DELETE) && !(op->flags & OP_IMMEDIATE_COMMIT))
dirty_osds.insert(part->osd_num);
part->flags |= PART_DONE;

View File

@@ -59,7 +59,7 @@ protected:
void *buf = NULL;
cluster_op_t *orig_op = NULL;
bool needs_reslice = false;
bool up_wait = false;
int retry_after = 0;
int inflight_count = 0, done_count = 0;
std::vector<cluster_op_part_t> parts;
void *part_bitmaps = NULL;
@@ -92,9 +92,11 @@ class cluster_client_t
uint64_t client_max_writeback_iodepth = 0;
int log_level = 0;
int up_wait_retry_interval = 500; // ms
int client_retry_interval = 50; // ms
int client_eio_retry_interval = 1000; // ms
int retry_timeout_id = 0;
int retry_timeout_duration = 0;
std::vector<cluster_op_t*> offline_ops;
cluster_op_t *op_queue_head = NULL, *op_queue_tail = NULL;
writeback_cache_t *wb = NULL;
@@ -121,7 +123,7 @@ public:
json11::Json::object cli_config, file_config, etcd_global_config;
json11::Json::object config;
cluster_client_t(ring_loop_t *ringloop, timerfd_manager_t *tfd, json11::Json & config);
cluster_client_t(ring_loop_t *ringloop, timerfd_manager_t *tfd, json11::Json config);
~cluster_client_t();
void execute(cluster_op_t *op);
void execute_raw(osd_num_t osd_num, osd_op_t *op);
@@ -131,7 +133,7 @@ public:
bool get_immediate_commit(uint64_t inode);
void continue_ops(bool up_retry = false);
void continue_ops(int time_passed = 0);
inode_list_t *list_inode_start(inode_t inode,
std::function<void(inode_list_t* lst, std::set<object_id>&& objects, pg_num_t pg_num, osd_num_t primary_osd, int status)> callback);
int list_pg_count(inode_list_t *lst);
@@ -152,6 +154,7 @@ protected:
int continue_rw(cluster_op_t *op);
bool check_rw(cluster_op_t *op);
void slice_rw(cluster_op_t *op);
void reset_retry_timer(int new_duration);
bool try_send(cluster_op_t *op, int i);
int continue_sync(cluster_op_t *op);
void send_sync(cluster_op_t *op, cluster_op_part_t *part);

View File

@@ -226,7 +226,7 @@ void cluster_client_t::send_list(inode_list_osd_t *cur_list)
{
if (op->reply.hdr.retval < 0)
{
fprintf(stderr, "Failed to get PG %u/%u object list from OSD %lu (retval=%ld), skipping\n",
fprintf(stderr, "Failed to get PG %u/%u object list from OSD %ju (retval=%jd), skipping\n",
cur_list->pg->lst->pool_id, cur_list->pg->pg_num, cur_list->osd_num, op->reply.hdr.retval);
}
else
@@ -236,7 +236,7 @@ void cluster_client_t::send_list(inode_list_osd_t *cur_list)
// Unstable objects, if present, mean that someone still writes into the inode. Warn the user about it.
cur_list->pg->has_unstable = true;
fprintf(
stderr, "[PG %u/%u] Inode still has %lu unstable object versions out of total %lu - is it still open?\n",
stderr, "[PG %u/%u] Inode still has %ju unstable object versions out of total %ju - is it still open?\n",
cur_list->pg->lst->pool_id, cur_list->pg->pg_num, op->reply.hdr.retval - op->reply.sec_list.stable_count,
op->reply.hdr.retval
);
@@ -244,7 +244,7 @@ void cluster_client_t::send_list(inode_list_osd_t *cur_list)
if (log_level > 0)
{
fprintf(
stderr, "[PG %u/%u] Got inode object list from OSD %lu: %ld object versions\n",
stderr, "[PG %u/%u] Got inode object list from OSD %ju: %jd object versions\n",
cur_list->pg->lst->pool_id, cur_list->pg->pg_num, cur_list->osd_num, op->reply.hdr.retval
);
}

Some files were not shown because too many files have changed in this diff Show More