Compare commits

..

11 Commits

Author SHA1 Message Date
Vitaliy Filippov 90a56a0519 WIP NFS RDMA support
Test / test_dd (push) Has been skipped Details
Test / test_root_node (push) Has been skipped Details
Test / test_switch_primary (push) Has been skipped Details
Test / test_write (push) Has been skipped Details
Test / test_write_xor (push) Has been skipped Details
Test / test_write_no_same (push) Has been skipped Details
Test / test_heal_pg_size_2 (push) Has been skipped Details
Test / test_heal_ec (push) Has been skipped Details
Test / test_heal_antietcd (push) Has been skipped Details
Test / test_heal_csum_32k_dmj (push) Has been skipped Details
Test / test_heal_csum_32k_dj (push) Has been skipped Details
Test / test_heal_csum_32k (push) Has been skipped Details
Test / test_heal_csum_4k_dmj (push) Has been skipped Details
Test / test_heal_csum_4k_dj (push) Has been skipped Details
Test / test_heal_csum_4k (push) Has been skipped Details
Test / test_resize (push) Has been skipped Details
Test / test_resize_auto (push) Has been skipped Details
Test / test_snapshot_pool2 (push) Has been skipped Details
Test / test_osd_tags (push) Has been skipped Details
Test / test_enospc (push) Has been skipped Details
Test / test_enospc_xor (push) Has been skipped Details
Test / test_enospc_imm (push) Has been skipped Details
Test / test_enospc_imm_xor (push) Has been skipped Details
Test / test_scrub (push) Has been skipped Details
Test / test_scrub_zero_osd_2 (push) Has been skipped Details
Test / test_scrub_xor (push) Has been skipped Details
Test / test_scrub_pg_size_3 (push) Has been skipped Details
Test / test_scrub_pg_size_6_pg_minsize_4_osd_count_6_ec (push) Has been skipped Details
Test / test_scrub_ec (push) Has been skipped Details
Test / test_nfs (push) Has been skipped Details
2024-11-23 13:57:05 +03:00
Vitaliy Filippov d84b84f58d Fix new backfillfull feature, add more logs
Test / test_rebalance_verify_ec_imm (push) Successful in 1m41s Details
Test / test_write_no_same (push) Successful in 9s Details
Test / test_write (push) Successful in 31s Details
Test / test_switch_primary (push) Successful in 34s Details
Test / test_write_xor (push) Successful in 35s Details
Test / test_heal_pg_size_2 (push) Successful in 2m18s Details
Test / test_heal_ec (push) Successful in 2m19s Details
Test / test_heal_antietcd (push) Successful in 2m17s Details
Test / test_heal_csum_32k_dmj (push) Successful in 2m31s Details
Test / test_heal_csum_32k_dj (push) Successful in 2m17s Details
Test / test_heal_csum_32k (push) Successful in 2m21s Details
Test / test_heal_csum_4k_dmj (push) Successful in 2m21s Details
Test / test_resize_auto (push) Successful in 9s Details
Test / test_resize (push) Successful in 14s Details
Test / test_heal_csum_4k_dj (push) Successful in 2m18s Details
Test / test_osd_tags (push) Successful in 10s Details
Test / test_snapshot_pool2 (push) Successful in 15s Details
Test / test_enospc (push) Successful in 11s Details
Test / test_enospc_imm (push) Successful in 9s Details
Test / test_enospc_xor (push) Successful in 15s Details
Test / test_enospc_imm_xor (push) Successful in 14s Details
Test / test_scrub (push) Successful in 14s Details
Test / test_scrub_zero_osd_2 (push) Successful in 15s Details
Test / test_scrub_xor (push) Successful in 14s Details
Test / test_scrub_pg_size_3 (push) Successful in 17s Details
Test / test_scrub_pg_size_6_pg_minsize_4_osd_count_6_ec (push) Successful in 15s Details
Test / test_scrub_ec (push) Successful in 16s Details
Test / test_nfs (push) Successful in 12s Details
Test / test_heal_csum_4k (push) Successful in 2m17s Details
Test / test_etcd_fail_antietcd (push) Successful in 42s Details
2024-11-23 01:08:13 +03:00
Vitaliy Filippov 8cfe705d7a Map netlink after forking to show correct PID in vitastor-nbd ls
Test / test_rebalance_verify_ec (push) Successful in 1m35s Details
Test / test_rebalance_verify_ec_imm (push) Successful in 1m36s Details
Test / test_write_no_same (push) Successful in 9s Details
Test / test_switch_primary (push) Successful in 33s Details
Test / test_write (push) Successful in 31s Details
Test / test_write_xor (push) Successful in 35s Details
Test / test_heal_pg_size_2 (push) Successful in 2m16s Details
Test / test_heal_ec (push) Successful in 2m18s Details
Test / test_heal_antietcd (push) Successful in 2m16s Details
Test / test_heal_csum_32k_dmj (push) Successful in 2m19s Details
Test / test_heal_csum_32k_dj (push) Successful in 2m25s Details
Test / test_heal_csum_32k (push) Successful in 2m20s Details
Test / test_heal_csum_4k_dmj (push) Successful in 2m19s Details
Test / test_heal_csum_4k_dj (push) Successful in 2m18s Details
Test / test_resize_auto (push) Successful in 8s Details
Test / test_resize (push) Successful in 14s Details
Test / test_osd_tags (push) Successful in 8s Details
Test / test_snapshot_pool2 (push) Successful in 15s Details
Test / test_enospc (push) Successful in 11s Details
Test / test_enospc_xor (push) Successful in 12s Details
Test / test_enospc_imm (push) Successful in 10s Details
Test / test_enospc_imm_xor (push) Successful in 14s Details
Test / test_scrub (push) Successful in 15s Details
Test / test_scrub_zero_osd_2 (push) Successful in 14s Details
Test / test_scrub_xor (push) Successful in 15s Details
Test / test_scrub_pg_size_3 (push) Successful in 16s Details
Test / test_scrub_pg_size_6_pg_minsize_4_osd_count_6_ec (push) Successful in 17s Details
Test / test_scrub_ec (push) Successful in 15s Details
Test / test_nfs (push) Successful in 11s Details
Test / test_heal_csum_4k (push) Successful in 2m19s Details
2024-11-23 00:46:44 +03:00
Vitaliy Filippov 66c9271cbd Radically simplify create-pool pg_size check
Test / test_rebalance_verify_ec (push) Successful in 1m38s Details
Test / test_rebalance_verify_ec_imm (push) Successful in 1m40s Details
Test / test_write_no_same (push) Successful in 10s Details
Test / test_write (push) Successful in 32s Details
Test / test_switch_primary (push) Successful in 34s Details
Test / test_write_xor (push) Successful in 35s Details
Test / test_heal_pg_size_2 (push) Successful in 2m17s Details
Test / test_heal_ec (push) Successful in 2m16s Details
Test / test_heal_antietcd (push) Successful in 2m18s Details
Test / test_heal_csum_32k_dmj (push) Successful in 2m19s Details
Test / test_heal_csum_32k_dj (push) Successful in 2m20s Details
Test / test_heal_csum_32k (push) Successful in 2m18s Details
Test / test_heal_csum_4k_dmj (push) Successful in 2m20s Details
Test / test_heal_csum_4k_dj (push) Successful in 2m20s Details
Test / test_resize_auto (push) Successful in 9s Details
Test / test_resize (push) Successful in 15s Details
Test / test_osd_tags (push) Successful in 10s Details
Test / test_snapshot_pool2 (push) Successful in 15s Details
Test / test_enospc (push) Successful in 11s Details
Test / test_enospc_xor (push) Successful in 13s Details
Test / test_enospc_imm (push) Successful in 12s Details
Test / test_enospc_imm_xor (push) Successful in 13s Details
Test / test_scrub (push) Successful in 16s Details
Test / test_scrub_zero_osd_2 (push) Successful in 14s Details
Test / test_scrub_xor (push) Successful in 15s Details
Test / test_scrub_pg_size_3 (push) Successful in 16s Details
Test / test_scrub_pg_size_6_pg_minsize_4_osd_count_6_ec (push) Successful in 15s Details
Test / test_scrub_ec (push) Successful in 16s Details
Test / test_nfs (push) Successful in 12s Details
Test / test_heal_csum_4k (push) Successful in 2m18s Details
2024-11-22 01:44:14 +03:00
Vitaliy Filippov 7b37ba921d Pause pool rebalance when monitor detects that it can lead to any OSD becoming full
Test / test_root_node (push) Successful in 10s Details
Test / test_rebalance_verify_ec_imm (push) Successful in 1m39s Details
Test / test_write_no_same (push) Successful in 9s Details
Test / test_switch_primary (push) Successful in 32s Details
Test / test_write (push) Successful in 32s Details
Test / test_write_xor (push) Successful in 37s Details
Test / test_heal_pg_size_2 (push) Successful in 2m17s Details
Test / test_heal_ec (push) Successful in 2m18s Details
Test / test_heal_antietcd (push) Successful in 2m17s Details
Test / test_heal_csum_32k_dmj (push) Successful in 2m20s Details
Test / test_heal_csum_32k_dj (push) Successful in 2m27s Details
Test / test_heal_csum_32k (push) Successful in 2m17s Details
Test / test_heal_csum_4k_dmj (push) Successful in 2m19s Details
Test / test_heal_csum_4k_dj (push) Successful in 2m19s Details
Test / test_resize_auto (push) Successful in 9s Details
Test / test_resize (push) Successful in 14s Details
Test / test_osd_tags (push) Successful in 8s Details
Test / test_snapshot_pool2 (push) Failing after 11s Details
Test / test_enospc (push) Successful in 11s Details
Test / test_enospc_imm (push) Successful in 11s Details
Test / test_enospc_xor (push) Successful in 13s Details
Test / test_enospc_imm_xor (push) Successful in 14s Details
Test / test_scrub (push) Successful in 14s Details
Test / test_scrub_zero_osd_2 (push) Successful in 15s Details
Test / test_scrub_xor (push) Successful in 15s Details
Test / test_scrub_pg_size_3 (push) Successful in 16s Details
Test / test_scrub_pg_size_6_pg_minsize_4_osd_count_6_ec (push) Successful in 16s Details
Test / test_scrub_ec (push) Successful in 14s Details
Test / test_nfs (push) Successful in 12s Details
Test / test_heal_csum_4k (push) Successful in 2m19s Details
2024-11-22 01:01:07 +03:00
Vitaliy Filippov 262c581400 Fix create-pool for the case of hosts split into sub-nodes
Test / test_rebalance_verify_ec (push) Successful in 1m38s Details
Test / test_rebalance_verify_ec_imm (push) Successful in 1m40s Details
Test / test_switch_primary (push) Successful in 23s Details
Test / test_write_no_same (push) Successful in 9s Details
Test / test_write (push) Successful in 31s Details
Test / test_write_xor (push) Successful in 35s Details
Test / test_heal_pg_size_2 (push) Successful in 2m18s Details
Test / test_heal_ec (push) Successful in 2m16s Details
Test / test_heal_antietcd (push) Successful in 2m17s Details
Test / test_heal_csum_32k_dmj (push) Successful in 2m19s Details
Test / test_heal_csum_32k_dj (push) Successful in 2m19s Details
Test / test_heal_csum_32k (push) Successful in 2m20s Details
Test / test_heal_csum_4k_dmj (push) Successful in 2m19s Details
Test / test_heal_csum_4k_dj (push) Successful in 2m17s Details
Test / test_resize_auto (push) Successful in 9s Details
Test / test_resize (push) Successful in 16s Details
Test / test_snapshot_pool2 (push) Failing after 11s Details
Test / test_osd_tags (push) Successful in 7s Details
Test / test_enospc (push) Successful in 12s Details
Test / test_enospc_imm (push) Successful in 11s Details
Test / test_enospc_xor (push) Successful in 14s Details
Test / test_enospc_imm_xor (push) Successful in 13s Details
Test / test_scrub (push) Successful in 13s Details
Test / test_scrub_zero_osd_2 (push) Successful in 13s Details
Test / test_scrub_xor (push) Successful in 14s Details
Test / test_scrub_pg_size_3 (push) Successful in 17s Details
Test / test_scrub_pg_size_6_pg_minsize_4_osd_count_6_ec (push) Successful in 17s Details
Test / test_scrub_ec (push) Successful in 14s Details
Test / test_nfs (push) Successful in 11s Details
Test / test_heal_csum_4k (push) Successful in 2m14s Details
2024-11-22 01:01:07 +03:00
Vitaliy Filippov ad3b6b7267 Add a note about GID and RDMA device auto-selection 2024-11-21 23:54:05 +03:00
Vitaliy Filippov 1f6a061283 Move ibv_query_gid under #ifdef to only build it with libibverbs 32+
Test / test_rebalance_verify_ec (push) Successful in 1m39s Details
Test / test_rebalance_verify_ec_imm (push) Successful in 1m39s Details
Test / test_write_no_same (push) Successful in 13s Details
Test / test_switch_primary (push) Successful in 33s Details
Test / test_write (push) Successful in 35s Details
Test / test_write_xor (push) Successful in 37s Details
Test / test_heal_pg_size_2 (push) Successful in 2m17s Details
Test / test_heal_ec (push) Successful in 2m19s Details
Test / test_heal_antietcd (push) Successful in 2m19s Details
Test / test_heal_csum_32k_dmj (push) Successful in 2m18s Details
Test / test_heal_csum_32k_dj (push) Successful in 2m21s Details
Test / test_heal_csum_32k (push) Successful in 2m21s Details
Test / test_heal_csum_4k_dmj (push) Successful in 2m18s Details
Test / test_heal_csum_4k_dj (push) Successful in 2m19s Details
Test / test_resize (push) Successful in 13s Details
Test / test_resize_auto (push) Successful in 9s Details
Test / test_osd_tags (push) Successful in 8s Details
Test / test_snapshot_pool2 (push) Successful in 14s Details
Test / test_enospc (push) Successful in 11s Details
Test / test_enospc_xor (push) Successful in 12s Details
Test / test_enospc_imm (push) Successful in 10s Details
Test / test_enospc_imm_xor (push) Successful in 14s Details
Test / test_scrub_zero_osd_2 (push) Successful in 12s Details
Test / test_scrub (push) Successful in 14s Details
Test / test_scrub_xor (push) Successful in 14s Details
Test / test_scrub_pg_size_3 (push) Successful in 15s Details
Test / test_scrub_pg_size_6_pg_minsize_4_osd_count_6_ec (push) Successful in 17s Details
Test / test_scrub_ec (push) Successful in 15s Details
Test / test_nfs (push) Successful in 13s Details
Test / test_heal_csum_4k (push) Successful in 2m26s Details
2024-11-21 23:47:57 +03:00
Vitaliy Filippov fc4d97da10 Print "Waiting to become master" just once
Test / test_rebalance_verify_ec (push) Successful in 1m38s Details
Test / test_rebalance_verify_ec_imm (push) Successful in 1m40s Details
Test / test_write_no_same (push) Successful in 8s Details
Test / test_write (push) Successful in 30s Details
Test / test_switch_primary (push) Successful in 33s Details
Test / test_write_xor (push) Successful in 36s Details
Test / test_heal_pg_size_2 (push) Successful in 2m15s Details
Test / test_heal_ec (push) Successful in 2m16s Details
Test / test_heal_antietcd (push) Successful in 2m17s Details
Test / test_heal_csum_32k_dmj (push) Successful in 2m19s Details
Test / test_heal_csum_32k_dj (push) Successful in 2m26s Details
Test / test_heal_csum_4k_dmj (push) Successful in 2m18s Details
Test / test_heal_csum_32k (push) Successful in 2m22s Details
Test / test_heal_csum_4k_dj (push) Successful in 2m20s Details
Test / test_resize_auto (push) Successful in 9s Details
Test / test_resize (push) Successful in 14s Details
Test / test_osd_tags (push) Successful in 9s Details
Test / test_enospc (push) Successful in 11s Details
Test / test_snapshot_pool2 (push) Successful in 16s Details
Test / test_enospc_xor (push) Successful in 12s Details
Test / test_enospc_imm (push) Successful in 12s Details
Test / test_enospc_imm_xor (push) Successful in 13s Details
Test / test_scrub (push) Successful in 15s Details
Test / test_scrub_zero_osd_2 (push) Successful in 14s Details
Test / test_scrub_xor (push) Successful in 15s Details
Test / test_scrub_pg_size_3 (push) Successful in 16s Details
Test / test_scrub_pg_size_6_pg_minsize_4_osd_count_6_ec (push) Successful in 15s Details
Test / test_scrub_ec (push) Successful in 15s Details
Test / test_nfs (push) Successful in 13s Details
Test / test_heal_csum_4k (push) Successful in 2m18s Details
2024-11-21 00:55:22 +03:00
Vitaliy Filippov c7a4ce7341 Take out_size from dd oimg if not specified
Test / test_rebalance_verify_ec (push) Successful in 1m41s Details
Test / test_rebalance_verify_ec_imm (push) Successful in 1m43s Details
Test / test_write_no_same (push) Successful in 9s Details
Test / test_write (push) Successful in 32s Details
Test / test_switch_primary (push) Successful in 34s Details
Test / test_write_xor (push) Successful in 36s Details
Test / test_heal_pg_size_2 (push) Successful in 2m15s Details
Test / test_heal_ec (push) Successful in 2m19s Details
Test / test_heal_antietcd (push) Successful in 2m17s Details
Test / test_heal_csum_32k_dmj (push) Successful in 2m18s Details
Test / test_heal_csum_32k_dj (push) Successful in 2m18s Details
Test / test_heal_csum_32k (push) Successful in 2m19s Details
Test / test_heal_csum_4k_dmj (push) Successful in 2m17s Details
Test / test_heal_csum_4k_dj (push) Successful in 2m20s Details
Test / test_resize_auto (push) Successful in 9s Details
Test / test_resize (push) Successful in 13s Details
Test / test_osd_tags (push) Successful in 8s Details
Test / test_snapshot_pool2 (push) Successful in 14s Details
Test / test_enospc (push) Successful in 12s Details
Test / test_enospc_xor (push) Successful in 12s Details
Test / test_enospc_imm (push) Successful in 11s Details
Test / test_enospc_imm_xor (push) Successful in 12s Details
Test / test_scrub (push) Successful in 14s Details
Test / test_scrub_zero_osd_2 (push) Successful in 13s Details
Test / test_scrub_xor (push) Successful in 14s Details
Test / test_scrub_pg_size_3 (push) Successful in 16s Details
Test / test_scrub_pg_size_6_pg_minsize_4_osd_count_6_ec (push) Successful in 17s Details
Test / test_scrub_ec (push) Successful in 14s Details
Test / test_nfs (push) Successful in 12s Details
Test / test_heal_csum_4k (push) Successful in 2m8s Details
2024-11-19 02:13:34 +03:00
Vitaliy Filippov ddea31d86d Auto-select first RDMA device only if RoCE is not found, add rocev2->rocev1->ib priority
Test / test_rebalance_verify_ec (push) Successful in 1m43s Details
Test / test_rebalance_verify_ec_imm (push) Successful in 1m43s Details
Test / test_write_no_same (push) Successful in 10s Details
Test / test_write (push) Successful in 29s Details
Test / test_switch_primary (push) Successful in 32s Details
Test / test_write_xor (push) Successful in 35s Details
Test / test_heal_pg_size_2 (push) Successful in 2m17s Details
Test / test_heal_ec (push) Successful in 2m15s Details
Test / test_heal_antietcd (push) Successful in 2m17s Details
Test / test_heal_csum_32k_dmj (push) Successful in 2m19s Details
Test / test_heal_csum_32k_dj (push) Successful in 2m19s Details
Test / test_heal_csum_32k (push) Successful in 2m17s Details
Test / test_heal_csum_4k_dmj (push) Successful in 2m19s Details
Test / test_heal_csum_4k_dj (push) Successful in 2m17s Details
Test / test_resize (push) Successful in 13s Details
Test / test_resize_auto (push) Successful in 9s Details
Test / test_osd_tags (push) Successful in 9s Details
Test / test_snapshot_pool2 (push) Successful in 16s Details
Test / test_enospc (push) Successful in 11s Details
Test / test_enospc_xor (push) Successful in 12s Details
Test / test_enospc_imm (push) Successful in 10s Details
Test / test_enospc_imm_xor (push) Successful in 13s Details
Test / test_scrub_zero_osd_2 (push) Successful in 14s Details
Test / test_scrub (push) Successful in 17s Details
Test / test_scrub_xor (push) Successful in 15s Details
Test / test_scrub_pg_size_3 (push) Successful in 17s Details
Test / test_scrub_pg_size_6_pg_minsize_4_osd_count_6_ec (push) Successful in 15s Details
Test / test_scrub_ec (push) Successful in 14s Details
Test / test_nfs (push) Successful in 12s Details
Test / test_heal_csum_4k (push) Successful in 2m19s Details
2024-11-19 01:54:00 +03:00
23 changed files with 423 additions and 227 deletions

View File

@ -288,6 +288,24 @@ jobs:
echo ""
done
test_create_halfhost:
runs-on: ubuntu-latest
needs: build
container: ${{env.TEST_IMAGE}}:${{github.sha}}
steps:
- name: Run test
id: test
timeout-minutes: 3
run: /root/vitastor/tests/test_create_halfhost.sh
- name: Print logs
if: always() && steps.test.outcome == 'failure'
run: |
for i in /root/vitastor/testdata/*.log /root/vitastor/testdata/*.txt; do
echo "-------- $i --------"
cat $i
echo ""
done
test_failure_domain:
runs-on: ubuntu-latest
needs: build

View File

@ -68,11 +68,17 @@ but they are not connected to the cluster.
- Type: string
RDMA device name to use for Vitastor OSD communications (for example,
"rocep5s0f0"). Now Vitastor supports all adapters, even ones without
ODP support, like Mellanox ConnectX-3 and non-Mellanox cards.
"rocep5s0f0"). If not specified, Vitastor will try to find an RoCE
device matching [osd_network](osd.en.md#osd_network), preferring RoCEv2,
or choose the first available RDMA device if no RoCE devices are
found or if `osd_network` is not specified. Auto-selection is also
unsupported with old libibverbs < v32, like in Debian 10 Buster or
CentOS 7.
Versions up to Vitastor 1.2.0 required ODP which is only present in
Mellanox ConnectX >= 4. See also [rdma_odp](#rdma_odp).
Vitastor supports all adapters, even ones without ODP support, like
Mellanox ConnectX-3 and non-Mellanox cards. Versions up to Vitastor
1.2.0 required ODP which is only present in Mellanox ConnectX >= 4.
See also [rdma_odp](#rdma_odp).
Run `ibv_devinfo -v` as root to list available RDMA devices and their
features.
@ -95,15 +101,17 @@ your device has.
## rdma_gid_index
- Type: integer
- Default: 0
Global address identifier index of the RDMA device to use. Different GID
indexes may correspond to different protocols like RoCEv1, RoCEv2 and iWARP.
Search for "GID" in `ibv_devinfo -v` output to determine which GID index
you need.
**IMPORTANT:** If you want to use RoCEv2 (as recommended) then the correct
rdma_gid_index is usually 1 (IPv6) or 3 (IPv4).
If not specified, Vitastor will try to auto-select a RoCEv2 IPv4 GID, then
RoCEv2 IPv6 GID, then RoCEv1 IPv4 GID, then RoCEv1 IPv6 GID, then IB GID.
GID auto-selection is unsupported with libibverbs < v32.
A correct rdma_gid_index for RoCEv2 is usually 1 (IPv6) or 3 (IPv4).
## rdma_mtu

View File

@ -71,12 +71,17 @@ RDMA может быть нужно только если у клиентов е
- Тип: строка
Название RDMA-устройства для связи с Vitastor OSD (например, "rocep5s0f0").
Сейчас Vitastor поддерживает все модели адаптеров, включая те, у которых
нет поддержки ODP, то есть вы можете использовать RDMA с ConnectX-3 и
картами производства не Mellanox.
Если не указано, Vitastor попробует найти RoCE-устройство, соответствующее
[osd_network](osd.en.md#osd_network), предпочитая RoCEv2, или выбрать первое
попавшееся RDMA-устройство, если RoCE-устройств нет или если сеть `osd_network`
не задана. Также автовыбор не поддерживается со старыми версиями библиотеки
libibverbs < v32, например в Debian 10 Buster или CentOS 7.
Версии Vitastor до 1.2.0 включительно требовали ODP, который есть только
на Mellanox ConnectX 4 и более новых. См. также [rdma_odp](#rdma_odp).
Vitastor поддерживает все модели адаптеров, включая те, у которых
нет поддержки ODP, то есть вы можете использовать RDMA с ConnectX-3 и
картами производства не Mellanox. Версии Vitastor до 1.2.0 включительно
требовали ODP, который есть только на Mellanox ConnectX 4 и более новых.
См. также [rdma_odp](#rdma_odp).
Запустите `ibv_devinfo -v` от имени суперпользователя, чтобы посмотреть
список доступных RDMA-устройств, их параметры и возможности.
@ -101,15 +106,18 @@ Control) и ECN (Explicit Congestion Notification).
## rdma_gid_index
- Тип: целое число
- Значение по умолчанию: 0
Номер глобального идентификатора адреса RDMA-устройства, который следует
использовать. Разным gid_index могут соответствовать разные протоколы связи:
RoCEv1, RoCEv2, iWARP. Чтобы понять, какой нужен вам - смотрите строчки со
словом "GID" в выводе команды `ibv_devinfo -v`.
**ВАЖНО:** Если вы хотите использовать RoCEv2 (как мы и рекомендуем), то
правильный rdma_gid_index, как правило, 1 (IPv6) или 3 (IPv4).
Если не указан, Vitastor попробует автоматически выбрать сначала GID,
соответствующий RoCEv2 IPv4, потом RoCEv2 IPv6, потом RoCEv1 IPv4, потом
RoCEv1 IPv6, потом IB. Авто-выбор GID не поддерживается со старыми версиями
libibverbs < v32.
Правильный rdma_gid_index для RoCEv2, как правило, 1 (IPv6) или 3 (IPv4).
## rdma_mtu

View File

@ -48,11 +48,17 @@
type: string
info: |
RDMA device name to use for Vitastor OSD communications (for example,
"rocep5s0f0"). Now Vitastor supports all adapters, even ones without
ODP support, like Mellanox ConnectX-3 and non-Mellanox cards.
"rocep5s0f0"). If not specified, Vitastor will try to find an RoCE
device matching [osd_network](osd.en.md#osd_network), preferring RoCEv2,
or choose the first available RDMA device if no RoCE devices are
found or if `osd_network` is not specified. Auto-selection is also
unsupported with old libibverbs < v32, like in Debian 10 Buster or
CentOS 7.
Versions up to Vitastor 1.2.0 required ODP which is only present in
Mellanox ConnectX >= 4. See also [rdma_odp](#rdma_odp).
Vitastor supports all adapters, even ones without ODP support, like
Mellanox ConnectX-3 and non-Mellanox cards. Versions up to Vitastor
1.2.0 required ODP which is only present in Mellanox ConnectX >= 4.
See also [rdma_odp](#rdma_odp).
Run `ibv_devinfo -v` as root to list available RDMA devices and their
features.
@ -64,12 +70,17 @@
PFC (Priority Flow Control) and ECN (Explicit Congestion Notification).
info_ru: |
Название RDMA-устройства для связи с Vitastor OSD (например, "rocep5s0f0").
Сейчас Vitastor поддерживает все модели адаптеров, включая те, у которых
нет поддержки ODP, то есть вы можете использовать RDMA с ConnectX-3 и
картами производства не Mellanox.
Если не указано, Vitastor попробует найти RoCE-устройство, соответствующее
[osd_network](osd.en.md#osd_network), предпочитая RoCEv2, или выбрать первое
попавшееся RDMA-устройство, если RoCE-устройств нет или если сеть `osd_network`
не задана. Также автовыбор не поддерживается со старыми версиями библиотеки
libibverbs < v32, например в Debian 10 Buster или CentOS 7.
Версии Vitastor до 1.2.0 включительно требовали ODP, который есть только
на Mellanox ConnectX 4 и более новых. См. также [rdma_odp](#rdma_odp).
Vitastor поддерживает все модели адаптеров, включая те, у которых
нет поддержки ODP, то есть вы можете использовать RDMA с ConnectX-3 и
картами производства не Mellanox. Версии Vitastor до 1.2.0 включительно
требовали ODP, который есть только на Mellanox ConnectX 4 и более новых.
См. также [rdma_odp](#rdma_odp).
Запустите `ibv_devinfo -v` от имени суперпользователя, чтобы посмотреть
список доступных RDMA-устройств, их параметры и возможности.
@ -94,23 +105,29 @@
`ibv_devinfo -v`.
- name: rdma_gid_index
type: int
default: 0
info: |
Global address identifier index of the RDMA device to use. Different GID
indexes may correspond to different protocols like RoCEv1, RoCEv2 and iWARP.
Search for "GID" in `ibv_devinfo -v` output to determine which GID index
you need.
**IMPORTANT:** If you want to use RoCEv2 (as recommended) then the correct
rdma_gid_index is usually 1 (IPv6) or 3 (IPv4).
If not specified, Vitastor will try to auto-select a RoCEv2 IPv4 GID, then
RoCEv2 IPv6 GID, then RoCEv1 IPv4 GID, then RoCEv1 IPv6 GID, then IB GID.
GID auto-selection is unsupported with libibverbs < v32.
A correct rdma_gid_index for RoCEv2 is usually 1 (IPv6) or 3 (IPv4).
info_ru: |
Номер глобального идентификатора адреса RDMA-устройства, который следует
использовать. Разным gid_index могут соответствовать разные протоколы связи:
RoCEv1, RoCEv2, iWARP. Чтобы понять, какой нужен вам - смотрите строчки со
словом "GID" в выводе команды `ibv_devinfo -v`.
**ВАЖНО:** Если вы хотите использовать RoCEv2 (как мы и рекомендуем), то
правильный rdma_gid_index, как правило, 1 (IPv6) или 3 (IPv4).
Если не указан, Vitastor попробует автоматически выбрать сначала GID,
соответствующий RoCEv2 IPv4, потом RoCEv2 IPv6, потом RoCEv1 IPv4, потом
RoCEv1 IPv6, потом IB. Авто-выбор GID не поддерживается со старыми версиями
libibverbs < v32.
Правильный rdma_gid_index для RoCEv2, как правило, 1 (IPv6) или 3 (IPv4).
- name: rdma_mtu
type: int
default: 4096

View File

@ -232,6 +232,7 @@ class EtcdAdapter
async become_master()
{
const state = { ...this.mon.get_mon_state(), id: ''+this.mon.etcd_lease_id };
console.log('Waiting to become master');
// eslint-disable-next-line no-constant-condition
while (1)
{
@ -243,7 +244,6 @@ class EtcdAdapter
{
break;
}
console.log('Waiting to become master');
await new Promise(ok => setTimeout(ok, this.mon.config.etcd_start_timeout));
}
console.log('Became master');

View File

@ -56,6 +56,7 @@ const etcd_tree = {
osd_out_time: 600, // seconds. min: 0
placement_levels: { datacenter: 1, rack: 2, host: 3, osd: 4, ... },
use_old_pg_combinator: false,
osd_backfillfull_ratio: 0.99,
// client and osd
tcp_header_buffer_size: 65536,
use_sync_send_recv: false,

View File

@ -74,6 +74,7 @@ class Mon
this.state = JSON.parse(JSON.stringify(etcd_tree));
this.prev_stats = { osd_stats: {}, osd_diff: {} };
this.recheck_pgs_active = false;
this.updating_total_stats = false;
this.watcher_active = false;
this.old_pg_config = false;
this.old_pg_stats_seen = false;
@ -658,7 +659,19 @@ class Mon
this.etcd_watch_revision, pool_id, up_osds, osd_tree, real_prev_pgs, pool_res.pgs, pg_history);
}
new_pg_config.hash = tree_hash;
return await this.save_pg_config(new_pg_config, etcd_request);
const { backfillfull_pools, backfillfull_osds } = sum_object_counts(
{ ...this.state, pg: { ...this.state.pg, config: new_pg_config } }, this.config
);
if (backfillfull_pools.join(',') != ((this.state.pg.config||{}).backfillfull_pools||[]).join(','))
{
this.log_backfillfull(backfillfull_osds, backfillfull_pools);
}
new_pg_config.backfillfull_pools = backfillfull_pools.length ? backfillfull_pools : undefined;
if (!await this.save_pg_config(new_pg_config, etcd_request))
{
return false;
}
return true;
}
async save_pg_config(new_pg_config, etcd_request = { compare: [], success: [] })
@ -730,7 +743,7 @@ class Mon
async update_total_stats()
{
const txn = [];
const { object_counts, object_bytes } = sum_object_counts(this.state, this.config);
const { object_counts, object_bytes, backfillfull_pools, backfillfull_osds } = sum_object_counts(this.state, this.config);
let stats = sum_op_stats(this.state.osd, this.prev_stats);
let { inode_stats, seen_pools } = sum_inode_stats(this.state, this.prev_stats);
stats.object_counts = object_counts;
@ -783,6 +796,27 @@ class Mon
{
await this.etcd.etcd_call('/kv/txn', { success: txn }, this.config.etcd_mon_timeout, 0);
}
if (!this.recheck_pgs_active &&
backfillfull_pools.join(',') != ((this.state.pg.config||{}).backfillfull_pools||[]).join(','))
{
this.log_backfillfull(backfillfull_osds, backfillfull_pools);
const new_pg_config = { ...this.state.pg.config, backfillfull_pools: backfillfull_pools.length ? backfillfull_pools : undefined };
await this.save_pg_config(new_pg_config);
}
}
log_backfillfull(osds, pools)
{
for (const osd in osds)
{
const bf = osds[osd];
console.log('OSD '+osd+' may fill up during rebalance: capacity '+(bf.cap/1024n/1024n)+
' MB, target user data '+(bf.clean/1024n/1024n)+' MB');
}
console.log(
(pools.length ? 'Pool(s) '+pools.join(', ') : 'No pools')+
' are backfillfull now, applying rebalance configuration'
);
}
schedule_update_stats()
@ -794,7 +828,21 @@ class Mon
this.stats_timer = setTimeout(() =>
{
this.stats_timer = null;
this.update_total_stats().catch(console.error);
if (this.updating_total_stats)
{
this.schedule_update_stats();
return;
}
this.updating_total_stats = true;
try
{
this.update_total_stats().catch(console.error);
}
catch (e)
{
console.error(e);
}
this.updating_total_stats = false;
}, this.config.mon_stats_timeout);
}

View File

@ -109,6 +109,8 @@ function sum_object_counts(state, global_config)
pgstats[pool_id] = { ...(state.pg.stats[pool_id] || {}), ...(pgstats[pool_id] || {}) };
}
}
const pool_per_osd = {};
const clean_per_osd = {};
for (const pool_id in pgstats)
{
let object_size = 0;
@ -143,10 +145,45 @@ function sum_object_counts(state, global_config)
object_bytes[k] += BigInt(st[k+'_count']) * object_size;
}
}
if (st.object_count)
{
for (const pg_osd of (((state.pg.config.items||{})[pool_id]||{})[pg_num]||{}).osd_set||[])
{
if (!(pg_osd in clean_per_osd))
{
clean_per_osd[pg_osd] = 0n;
}
clean_per_osd[pg_osd] += BigInt(st.object_count);
pool_per_osd[pg_osd] = pool_per_osd[pg_osd]||{};
pool_per_osd[pg_osd][pool_id] = true;
}
}
}
}
}
return { object_counts, object_bytes };
// If clean_per_osd[osd] is larger than osd capacity then it will fill up during rebalance
let backfillfull_pools = {};
let backfillfull_osds = {};
for (const osd in clean_per_osd)
{
const st = state.osd.stats[osd];
if (!st || !st.size || !st.data_block_size)
{
continue;
}
let cap = BigInt(st.size)/BigInt(st.data_block_size);
cap = cap * BigInt((global_config.osd_backfillfull_ratio||0.99)*1000000) / 1000000n;
if (cap < clean_per_osd[osd])
{
backfillfull_osds[osd] = { cap: BigInt(st.size), clean: clean_per_osd[osd]*BigInt(st.data_block_size) };
for (const pool_id in pool_per_osd[osd])
{
backfillfull_pools[pool_id] = true;
}
}
}
backfillfull_pools = Object.keys(backfillfull_pools).sort();
return { object_counts, object_bytes, backfillfull_pools, backfillfull_osds };
}
// sum_inode_stats(this.state, this.prev_stats)

View File

@ -785,7 +785,7 @@ void etcd_state_client_t::parse_state(const etcd_kv_t & kv)
}
for (auto & pool_item: value.object_items())
{
pool_config_t pc;
pool_config_t pc = {};
// ID
pool_id_t pool_id;
char null_byte = 0;
@ -931,12 +931,28 @@ void etcd_state_client_t::parse_state(const etcd_kv_t & kv)
// Ignore old key if the new one is present
return;
}
for (auto & pool_id_json: value["backfillfull_pools"].array_items())
{
auto pool_id = pool_id_json.uint64_value();
auto pool_it = this->pool_config.find(pool_id);
if (pool_it != this->pool_config.end())
{
pool_it->second.backfillfull |= 2;
}
}
for (auto & pool_item: this->pool_config)
{
for (auto & pg_item: pool_item.second.pg_config)
{
pg_item.second.config_exists = false;
}
// 3 = was 1 and became 1, 0 = was 0 and became 0
if (pool_item.second.backfillfull == 2 || pool_item.second.backfillfull == 1)
{
if (on_change_backfillfull_hook)
on_change_backfillfull_hook(pool_item.first);
}
pool_item.second.backfillfull = pool_item.second.backfillfull >> 1;
}
for (auto & pool_item: value["items"].object_items())
{

View File

@ -62,6 +62,7 @@ struct pool_config_t
std::map<pg_num_t, pg_config_t> pg_config;
uint64_t scrub_interval;
std::string used_for_fs;
int backfillfull;
};
struct inode_config_t
@ -131,6 +132,7 @@ public:
std::function<json11::Json()> load_pgs_checks_hook;
std::function<void(bool)> on_load_pgs_hook;
std::function<void()> on_change_pool_config_hook;
std::function<void(pool_id_t)> on_change_backfillfull_hook;
std::function<void(pool_id_t, pg_num_t, osd_num_t)> on_change_pg_state_hook;
std::function<void(pool_id_t, pg_num_t)> on_change_pg_history_hook;
std::function<void(osd_num_t)> on_change_osd_state_hook;

View File

@ -70,6 +70,7 @@ msgr_rdma_connection_t::~msgr_rdma_connection_t()
send_out_size = 0;
}
#ifdef IBV_ADVISE_MR_ADVICE_PREFETCH_NO_FAULT
static bool is_ipv4_gid(ibv_gid_entry *gidx)
{
return (((uint64_t*)gidx->gid.raw)[0] == 0 &&
@ -131,6 +132,7 @@ static matched_dev match_device(ibv_device **dev_list, addr_mask_t *networks, in
ibv_port_attr portinfo;
ibv_gid_entry best_gidx;
int res;
bool have_non_roce = false, have_roce = false;
for (int i = 0; dev_list[i]; ++i)
{
auto dev = dev_list[i];
@ -161,6 +163,11 @@ static matched_dev match_device(ibv_device **dev_list, addr_mask_t *networks, in
else
break;
}
if (gidx.gid_type != IBV_GID_TYPE_ROCE_V1 &&
gidx.gid_type != IBV_GID_TYPE_ROCE_V2)
have_non_roce = true;
else
have_roce = true;
if (match_gid(&gidx, networks, nnet))
{
// Prefer RoCEv2
@ -186,8 +193,13 @@ cleanup:
{
log_rdma_dev_port_gid(dev_list[best.dev], best.port, best.gid, best_gidx);
}
if (best.dev < 0 && have_non_roce && !have_roce)
{
best.dev = -2;
}
return best;
}
#endif
msgr_rdma_context_t *msgr_rdma_context_t::create(std::vector<std::string> osd_networks, const char *ib_devname, uint8_t ib_port, uint8_t gid_index, uint32_t mtu, bool odp, int log_level)
{
@ -229,6 +241,7 @@ msgr_rdma_context_t *msgr_rdma_context_t::create(std::vector<std::string> osd_ne
goto cleanup;
}
}
#ifdef IBV_ADVISE_MR_ADVICE_PREFETCH_NO_FAULT
else if (osd_networks.size())
{
std::vector<addr_mask_t> nets;
@ -237,11 +250,17 @@ msgr_rdma_context_t *msgr_rdma_context_t::create(std::vector<std::string> osd_ne
nets.push_back(cidr_parse(netstr));
}
auto best = match_device(dev_list, nets.data(), nets.size(), log_level);
if (best.dev < 0)
if (best.dev == -2)
{
best.dev = 0;
if (log_level > 0)
fprintf(stderr, "No RoCE devices found, using first available RDMA device %s\n", ibv_get_device_name(*dev_list));
}
else if (best.dev < 0)
{
if (log_level > 0)
fprintf(stderr, "RDMA device matching osd_network is not found, using first available device\n");
best.dev = 0;
fprintf(stderr, "RDMA device matching osd_network is not found, disabling RDMA\n");
goto cleanup;
}
else
{
@ -250,6 +269,7 @@ msgr_rdma_context_t *msgr_rdma_context_t::create(std::vector<std::string> osd_ne
}
ctx->dev = dev_list[best.dev];
}
#endif
else
{
ctx->dev = *dev_list;
@ -275,18 +295,22 @@ msgr_rdma_context_t *msgr_rdma_context_t::create(std::vector<std::string> osd_ne
goto cleanup;
}
#ifdef IBV_ADVISE_MR_ADVICE_PREFETCH_NO_FAULT
if (gid_index != -1)
#endif
{
ctx->gid_index = gid_index;
if (ibv_query_gid_ex(ctx->context, ib_port, gid_index, &ctx->my_gid, 0))
ctx->gid_index = gid_index < 0 ? 0 : gid_index;
if (ibv_query_gid(ctx->context, ib_port, gid_index, &ctx->my_gid))
{
fprintf(stderr, "Couldn't read RDMA device %s GID index %d\n", ibv_get_device_name(ctx->dev), gid_index);
goto cleanup;
}
}
#ifdef IBV_ADVISE_MR_ADVICE_PREFETCH_NO_FAULT
else
{
// Auto-guess GID
ibv_gid_entry best_gidx;
for (int k = 0; k < ctx->portinfo.gid_tbl_len; k++)
{
ibv_gid_entry gidx;
@ -301,21 +325,24 @@ msgr_rdma_context_t *msgr_rdma_context_t::create(std::vector<std::string> osd_ne
{
continue;
}
// Prefer IPv4 RoCEv2 GID by default
// Prefer IPv4 RoCEv2 -> IPv6 RoCEv2 -> IPv4 RoCEv1 -> IPv6 RoCEv1 -> IB
if (gid_index == -1 ||
gidx.gid_type == IBV_GID_TYPE_ROCE_V2 &&
(ctx->my_gid.gid_type != IBV_GID_TYPE_ROCE_V2 || is_ipv4_gid(&gidx)))
gidx.gid_type == IBV_GID_TYPE_ROCE_V2 && best_gidx.gid_type != IBV_GID_TYPE_ROCE_V2 ||
gidx.gid_type == IBV_GID_TYPE_ROCE_V1 && best_gidx.gid_type == IBV_GID_TYPE_IB ||
gidx.gid_type == best_gidx.gid_type && is_ipv4_gid(&gidx))
{
gid_index = k;
ctx->my_gid = gidx;
best_gidx = gidx;
}
}
ctx->gid_index = gid_index = (gid_index == -1 ? 0 : gid_index);
if (log_level > 0)
{
log_rdma_dev_port_gid(ctx->dev, ctx->ib_port, ctx->gid_index, ctx->my_gid);
log_rdma_dev_port_gid(ctx->dev, ctx->ib_port, ctx->gid_index, best_gidx);
}
ctx->my_gid = best_gidx.gid;
}
#endif
ctx->pd = ibv_alloc_pd(ctx->context);
if (!ctx->pd)
@ -431,7 +458,7 @@ msgr_rdma_connection_t *msgr_rdma_connection_t::create(msgr_rdma_context_t *ctx,
}
conn->addr.lid = ctx->my_lid;
conn->addr.gid = ctx->my_gid.gid;
conn->addr.gid = ctx->my_gid;
conn->addr.qpn = conn->qp->qp_num;
conn->addr.psn = lrand48() & 0xffffff;

View File

@ -31,7 +31,7 @@ struct msgr_rdma_context_t
uint8_t ib_port;
uint8_t gid_index;
uint16_t my_lid;
ibv_gid_entry my_gid;
ibv_gid my_gid;
uint32_t mtu;
int max_cqe = 0;
int used_max_cqe = 0;

View File

@ -64,7 +64,7 @@ static void netlink_sock_alloc(struct netlink_ctx *ctx)
if (nl_driver_id < 0)
{
nl_socket_free(sk);
fail("Couldn't resolve the nbd netlink family\n");
fail("Couldn't resolve the nbd netlink family: %s (code %d)\n", nl_geterror(nl_driver_id), nl_driver_id);
}
ctx->driver_id = nl_driver_id;
@ -555,7 +555,12 @@ help:
fcntl(sockfd[0], F_SETFL, fcntl(sockfd[0], F_GETFL, 0) | O_NONBLOCK);
nbd_fd = sockfd[0];
load_module();
bool bg = cfg["foreground"].is_null();
if (cfg["logfile"].string_value() != "")
{
logfile = cfg["logfile"].string_value();
}
if (netlink)
{
@ -588,6 +593,10 @@ help:
}
close(sockfd[1]);
printf("/dev/nbd%d\n", err);
if (bg)
{
daemonize_reopen_stdio();
}
#else
fprintf(stderr, "netlink support is disabled in this build\n");
exit(1);
@ -631,14 +640,10 @@ help:
}
}
}
}
if (cfg["logfile"].string_value() != "")
{
logfile = cfg["logfile"].string_value();
}
if (bg)
{
daemonize();
if (bg)
{
daemonize();
}
}
// Initialize read state
read_state = CL_READ_HDR;
@ -716,13 +721,17 @@ help:
}
}
void daemonize()
void daemonize_fork()
{
if (fork())
exit(0);
setsid();
if (fork())
exit(0);
}
void daemonize_reopen_stdio()
{
close(0);
close(1);
close(2);
@ -733,6 +742,12 @@ help:
fprintf(stderr, "Warning: Failed to chdir into /\n");
}
void daemonize()
{
daemonize_fork();
daemonize_reopen_stdio();
}
json11::Json::object list_mapped()
{
const char *self_filename = exe_name;
@ -783,8 +798,9 @@ help:
if (!strcmp(pid_filename, self_filename))
{
json11::Json::object cfg = nbd_proxy::parse_args(argv.size(), argv.data());
if (cfg["command"] == "map")
if (cfg["command"] == "map" || cfg["command"] == "netlink-map")
{
cfg["interface"] = (cfg["command"] == "netlink-map") ? "netlink" : "nbd";
cfg.erase("command");
cfg["pid"] = pid;
mapped["/dev/nbd"+std::to_string(dev_num)] = cfg;

View File

@ -261,6 +261,7 @@ struct dd_out_info_t
else
{
// ok
out_size = owatch->cfg.size;
return true;
}
// Wait for sub-command
@ -882,6 +883,8 @@ resume_2:
oinfo.end_fsync = oinfo.end_fsync && oinfo.out_seekable;
read_offset = 0;
read_end = iinfo.in_seekable ? iinfo.in_size-iseek : 0;
if (oinfo.out_size && (!read_end || read_end > oinfo.out_size-oseek))
read_end = oinfo.out_size-oseek;
if (bytelimit && (!read_end || read_end > bytelimit))
read_end = bytelimit;
clock_gettime(CLOCK_REALTIME, &tv_begin);

View File

@ -35,6 +35,7 @@ struct pool_creator_t
uint64_t new_pools_mod_rev;
json11::Json state_node_tree;
json11::Json new_pools;
std::map<osd_num_t, json11::Json> osd_stats;
bool is_done() { return state == 100; }
@ -46,8 +47,6 @@ struct pool_creator_t
goto resume_2;
else if (state == 3)
goto resume_3;
else if (state == 4)
goto resume_4;
else if (state == 5)
goto resume_5;
else if (state == 6)
@ -90,13 +89,19 @@ resume_1:
// If not forced, check that we have enough osds for pg_size
if (!force)
{
// Get node_placement configuration from etcd
// Get node_placement configuration from etcd and OSD stats
parent->etcd_txn(json11::Json::object {
{ "success", json11::Json::array {
json11::Json::object {
{ "request_range", json11::Json::object {
{ "key", base64_encode(parent->cli->st_cli.etcd_prefix+"/config/node_placement") },
} }
} },
},
json11::Json::object {
{ "request_range", json11::Json::object {
{ "key", base64_encode(parent->cli->st_cli.etcd_prefix+"/osd/stats/") },
{ "range_end", base64_encode(parent->cli->st_cli.etcd_prefix+"/osd/stats0") },
} },
},
} },
});
@ -112,10 +117,21 @@ resume_2:
return;
}
// Get state_node_tree based on node_placement and osd peer states
// Get state_node_tree based on node_placement and osd stats
{
auto kv = parent->cli->st_cli.parse_etcd_kv(parent->etcd_result["responses"][0]["response_range"]["kvs"][0]);
state_node_tree = get_state_node_tree(kv.value.object_items());
auto node_placement_kv = parent->cli->st_cli.parse_etcd_kv(parent->etcd_result["responses"][0]["response_range"]["kvs"][0]);
timespec tv_now;
clock_gettime(CLOCK_REALTIME, &tv_now);
uint64_t osd_out_time = parent->cli->config["osd_out_time"].uint64_value();
if (!osd_out_time)
osd_out_time = 600;
osd_stats.clear();
parent->iterate_kvs_1(parent->etcd_result["responses"][1]["response_range"]["kvs"], "/osd/stats/", [&](uint64_t cur_osd, json11::Json value)
{
if ((uint64_t)value["time"].number_value()+osd_out_time >= tv_now.tv_sec)
osd_stats[cur_osd] = value;
});
state_node_tree = get_state_node_tree(node_placement_kv.value.object_items(), osd_stats);
}
// Skip tag checks, if pool has none
@ -158,42 +174,18 @@ resume_3:
}
}
// Get stats (for block_size, bitmap_granularity, ...) of osds in state_node_tree
{
json11::Json::array osd_stats;
for (auto osd_num: state_node_tree["osds"].array_items())
{
osd_stats.push_back(json11::Json::object {
{ "request_range", json11::Json::object {
{ "key", base64_encode(parent->cli->st_cli.etcd_prefix+"/osd/stats/"+osd_num.as_string()) },
} }
});
}
parent->etcd_txn(json11::Json::object{ { "success", osd_stats } });
}
state = 4;
resume_4:
if (parent->waiting > 0)
return;
if (parent->etcd_err.err)
{
result = parent->etcd_err;
state = 100;
return;
}
// Filter osds from state_node_tree based on pool parameters and osd stats
{
std::vector<json11::Json> osd_stats;
for (auto & ocr: parent->etcd_result["responses"].array_items())
std::vector<json11::Json> filtered_osd_stats;
for (auto & osd_num: state_node_tree["osds"].array_items())
{
auto kv = parent->cli->st_cli.parse_etcd_kv(ocr["response_range"]["kvs"][0]);
osd_stats.push_back(kv.value);
auto st_it = osd_stats.find(osd_num.uint64_value());
if (st_it != osd_stats.end())
{
filtered_osd_stats.push_back(st_it->second);
}
}
guess_block_size(osd_stats);
guess_block_size(filtered_osd_stats);
state_node_tree = filter_state_node_tree_by_stats(state_node_tree, osd_stats);
}
@ -201,8 +193,7 @@ resume_4:
{
auto failure_domain = cfg["failure_domain"].string_value() == ""
? "host" : cfg["failure_domain"].string_value();
uint64_t max_pg_size = get_max_pg_size(state_node_tree["nodes"].object_items(),
failure_domain, cfg["root_node"].string_value());
uint64_t max_pg_size = get_max_pg_size(state_node_tree, failure_domain, cfg["root_node"].string_value());
if (cfg["pg_size"].uint64_value() > max_pg_size)
{
@ -358,56 +349,50 @@ resume_8:
// Returns a JSON object of form {"nodes": {...}, "osds": [...]} that
// contains: all nodes (osds, hosts, ...) based on node_placement config
// and current peer state, and a list of active peer osds.
json11::Json get_state_node_tree(json11::Json::object node_placement)
// and current osd stats.
json11::Json get_state_node_tree(json11::Json::object node_placement, std::map<osd_num_t, json11::Json> & osd_stats)
{
// Erase non-peer osd nodes from node_placement
// Erase non-existing osd nodes from node_placement
for (auto np_it = node_placement.begin(); np_it != node_placement.end();)
{
// Numeric nodes are osds
osd_num_t osd_num = stoull_full(np_it->first);
// If node is osd and it is not in peer states, erase it
if (osd_num > 0 &&
parent->cli->st_cli.peer_states.find(osd_num) == parent->cli->st_cli.peer_states.end())
{
// If node is osd and its stats do not exist, erase it
if (osd_num > 0 && osd_stats.find(osd_num) == osd_stats.end())
node_placement.erase(np_it++);
}
else
np_it++;
}
// List of peer osds
std::vector<std::string> peer_osds;
// List of osds
std::vector<std::string> existing_osds;
// Record peer osds and add missing osds/hosts to np
for (auto & ps: parent->cli->st_cli.peer_states)
// Record osds and add missing osds/hosts to np
for (auto & ps: osd_stats)
{
std::string osd_num = std::to_string(ps.first);
// Record peer osd
peer_osds.push_back(osd_num);
// Record osd
existing_osds.push_back(osd_num);
// Add osd, if necessary
if (node_placement.find(osd_num) == node_placement.end())
// Add host if necessary
std::string osd_host = ps.second["host"].as_string();
if (node_placement.find(osd_host) == node_placement.end())
{
std::string osd_host = ps.second["host"].as_string();
// Add host, if necessary
if (node_placement.find(osd_host) == node_placement.end())
{
node_placement[osd_host] = json11::Json::object {
{ "level", "host" }
};
}
node_placement[osd_num] = json11::Json::object {
{ "parent", osd_host }
node_placement[osd_host] = json11::Json::object {
{ "level", "host" }
};
}
// Add osd
node_placement[osd_num] = json11::Json::object {
{ "parent", node_placement[osd_num]["parent"].is_null() ? osd_host : node_placement[osd_num]["parent"] },
{ "level", "osd" },
};
}
return json11::Json::object { { "osds", peer_osds }, { "nodes", node_placement } };
return json11::Json::object { { "osds", existing_osds }, { "nodes", node_placement } };
}
// Returns new state_node_tree based on given state_node_tree with osds
@ -534,15 +519,13 @@ resume_8:
// filtered out by stats parameters (block_size, bitmap_granularity) in
// given osd_stats and current pool config.
// Requires: state_node_tree["osds"] must match osd_stats 1-1
json11::Json filter_state_node_tree_by_stats(const json11::Json & state_node_tree, std::vector<json11::Json> & osd_stats)
json11::Json filter_state_node_tree_by_stats(const json11::Json & state_node_tree, std::map<osd_num_t, json11::Json> & osd_stats)
{
auto & osds = state_node_tree["osds"].array_items();
// Accepted state_node_tree nodes
auto accepted_nodes = state_node_tree["nodes"].object_items();
// List of accepted osds
std::vector<std::string> accepted_osds;
json11::Json::array accepted_osds;
block_size = cfg["block_size"].uint64_value()
? cfg["block_size"].uint64_value()
@ -554,21 +537,25 @@ resume_8:
? etcd_state_client_t::parse_immediate_commit(cfg["immediate_commit"].string_value(), IMMEDIATE_ALL)
: parent->cli->st_cli.global_immediate_commit;
for (size_t i = 0; i < osd_stats.size(); i++)
for (auto osd_num_json: state_node_tree["osds"].array_items())
{
auto & os = osd_stats[i];
// Get osd number
auto osd_num = osds[i].as_string();
auto osd_num = osd_num_json.uint64_value();
auto os_it = osd_stats.find(osd_num);
if (os_it == osd_stats.end())
{
continue;
}
auto & os = os_it->second;
if (!os["data_block_size"].is_null() && os["data_block_size"] != block_size ||
!os["bitmap_granularity"].is_null() && os["bitmap_granularity"] != bitmap_granularity ||
!os["immediate_commit"].is_null() &&
etcd_state_client_t::parse_immediate_commit(os["immediate_commit"].string_value(), IMMEDIATE_NONE) < immediate_commit)
{
accepted_nodes.erase(osd_num);
accepted_nodes.erase(osd_num_json.as_string());
}
else
{
accepted_osds.push_back(osd_num);
accepted_osds.push_back(osd_num_json);
}
}
@ -576,87 +563,28 @@ resume_8:
}
// Returns maximum pg_size possible for given node_tree and failure_domain, starting at parent_node
uint64_t get_max_pg_size(json11::Json::object node_tree, const std::string & level, const std::string & parent_node)
uint64_t get_max_pg_size(json11::Json state_node_tree, const std::string & level, const std::string & root_node)
{
uint64_t max_pg_sz = 0;
std::vector<std::string> nodes;
// Check if parent node is an osd (numeric)
if (parent_node != "" && stoull_full(parent_node))
std::set<std::string> level_seen;
for (auto & osd: state_node_tree["osds"].array_items())
{
// Add it to node list if osd is in node tree
if (node_tree.find(parent_node) != node_tree.end())
nodes.push_back(parent_node);
}
// If parent node given, ...
else if (parent_node != "")
{
// ... look for children nodes of this parent
for (auto & sn: node_tree)
// find OSD parent at <level>, but stop at <root_node>
auto cur_id = osd.string_value();
auto cur = state_node_tree["nodes"][cur_id];
while (!cur.is_null())
{
auto & props = sn.second.object_items();
auto parent_prop = props.find("parent");
if (parent_prop != props.end() && (parent_prop->second.as_string() == parent_node))
if (cur["level"] == level)
{
nodes.push_back(sn.first);
// If we're not looking for all osds, we only need a single
// child osd node
if (level != "osd" && stoull_full(sn.first))
break;
level_seen.insert(cur_id);
break;
}
if (cur_id == root_node)
break;
cur_id = cur["parent"].string_value();
cur = state_node_tree["nodes"][cur_id];
}
}
// No parent node given, and we're not looking for all osds
else if (level != "osd")
{
// ... look for all level nodes
for (auto & sn: node_tree)
{
auto & props = sn.second.object_items();
auto level_prop = props.find("level");
if (level_prop != props.end() && (level_prop->second.as_string() == level))
{
nodes.push_back(sn.first);
}
}
}
// Otherwise, ...
else
{
// ... we're looking for osd nodes only
for (auto & sn: node_tree)
{
if (stoull_full(sn.first))
{
nodes.push_back(sn.first);
}
}
}
// Process gathered nodes
for (auto & node: nodes)
{
// Check for osd node, return constant max size
if (stoull_full(node))
{
max_pg_sz += 1;
}
// Otherwise, ...
else
{
// ... exclude parent node from tree, and ...
node_tree.erase(parent_node);
// ... descend onto the resulting tree
max_pg_sz += get_max_pg_size(node_tree, level, node);
}
}
return max_pg_sz;
return level_seen.size();
}
json11::Json create_pool(const etcd_kv_t & kv)

View File

@ -499,6 +499,19 @@ void nfs_proxy_t::check_default_pool()
}
}
void nfs_proxy_t::create_client()
{
auto cli = new nfs_client_t();
cli->parent = this->proxy;
if (kvfs)
nfs_kv_procs(cli);
else
nfs_block_procs(cli);
for (auto & fn: pmap.proc_table)
cli->proc_table.insert(fn);
rpc_clients.insert(cli);
}
void nfs_proxy_t::do_accept(int listen_fd)
{
struct sockaddr_storage addr;
@ -512,18 +525,8 @@ void nfs_proxy_t::do_accept(int listen_fd)
fcntl(nfs_fd, F_SETFL, fcntl(nfs_fd, F_GETFL, 0) | O_NONBLOCK);
int one = 1;
setsockopt(nfs_fd, SOL_TCP, TCP_NODELAY, &one, sizeof(one));
auto cli = new nfs_client_t();
if (kvfs)
nfs_kv_procs(cli);
else
nfs_block_procs(cli);
cli->parent = this;
auto cli = this->create_client();
cli->nfs_fd = nfs_fd;
for (auto & fn: pmap.proc_table)
{
cli->proc_table.insert(fn);
}
rpc_clients[nfs_fd] = cli;
epmgr->tfd->set_fd_handler(nfs_fd, true, [cli](int nfs_fd, int epoll_events)
{
// Handle incoming event
@ -781,7 +784,7 @@ void nfs_client_t::stop()
if (refs <= 0)
{
auto parent = this->parent;
parent->rpc_clients.erase(nfs_fd);
parent->rpc_clients.erase(this);
parent->active_connections--;
parent->epmgr->tfd->set_fd_handler(nfs_fd, true, NULL);
close(nfs_fd);
@ -1085,8 +1088,8 @@ void nfs_proxy_t::daemonize()
// Stop all clients because client I/O sometimes breaks during daemonize
// I.e. the new process stops receiving events on the old FD
// It doesn't happen if we call sleep(1) here, but we don't want to call sleep(1)...
for (auto & clp: rpc_clients)
clp.second->stop();
for (auto & cli: rpc_clients)
cli->stop();
if (fork())
exit(0);
setsid();

View File

@ -55,7 +55,7 @@ public:
vitastorkv_dbw_t *db = NULL;
kv_fs_state_t *kvfs = NULL;
block_fs_state_t *blockfs = NULL;
std::map<int, nfs_client_t*> rpc_clients;
std::set<nfs_client_t*> rpc_clients;
std::vector<XDR*> xdr_pool;
@ -72,6 +72,7 @@ public:
void watch_stats();
void parse_stats(etcd_kv_t & kv);
void check_default_pool();
nfs_client_t* create_client();
void do_accept(int listen_fd);
void daemonize();
void write_pid();
@ -101,6 +102,8 @@ struct rpc_free_buffer_t
unsigned size;
};
struct nfs_rdma_conn_t;
class nfs_client_t
{
public:
@ -108,6 +111,7 @@ public:
int refs = 0;
bool stopped = false;
std::set<rpc_service_proc_t> proc_table;
nfs_rdma_conn_t *rdma_conn = NULL;
// <TCP>
int nfs_fd;

View File

@ -45,6 +45,7 @@ struct nfs_rdma_context_t
int max_send = 8, max_recv = 8; --- FIXME max_send and max_recv should probably be equal
uint64_t max_send_size = 256*1024, max_recv_size = 256*1024;
nfs_proxy_t *proxy = NULL;
epoll_manager_t *epmgr = NULL;
int max_cqe = 0, used_max_cqe = 0;
@ -76,6 +77,7 @@ struct nfs_rdma_conn_t
nfs_rdma_context_t* nfs_proxy_t::create_rdma(const std::string & bind_address, int rdmacm_port)
{
nfs_rdma_context_t* self = new nfs_rdma_context_t;
self->proxy = this;
self->epmgr = epmgr;
self->bind_address = bind_address;
self->rdmacm_port = rdmacm_port;
@ -301,6 +303,8 @@ void nfs_rdma_context_t::rdmacm_accept(rdma_cm_event *ev)
conn->id = ctx->id;
rdma_connections[ctx->id] = conn;
rdma_connections_by_qp[conn->id->qp->qp_num];
auto cli = this->proxy->create_client();
cli->rdma_conn = conn;
}
}

View File

@ -226,6 +226,7 @@ class osd_t
void parse_config(bool init);
void init_cluster();
void on_change_osd_state_hook(osd_num_t peer_osd);
void on_change_backfillfull_hook(pool_id_t pool_id);
void on_change_pg_history_hook(pool_id_t pool_id, pg_num_t pg_num);
void on_change_etcd_state_hook(std::map<std::string, etcd_kv_t> & changes);
void on_load_config_hook(json11::Json::object & changes);

View File

@ -65,6 +65,7 @@ void osd_t::init_cluster()
st_cli.tfd = tfd;
st_cli.log_level = log_level;
st_cli.on_change_osd_state_hook = [this](osd_num_t peer_osd) { on_change_osd_state_hook(peer_osd); };
st_cli.on_change_backfillfull_hook = [this](pool_id_t pool_id) { on_change_backfillfull_hook(pool_id); };
st_cli.on_change_pg_history_hook = [this](pool_id_t pool_id, pg_num_t pg_num) { on_change_pg_history_hook(pool_id, pg_num); };
st_cli.on_change_hook = [this](std::map<std::string, etcd_kv_t> & changes) { on_change_etcd_state_hook(changes); };
st_cli.on_load_config_hook = [this](json11::Json::object & cfg) { on_load_config_hook(cfg); };
@ -414,6 +415,14 @@ void osd_t::on_change_osd_state_hook(osd_num_t peer_osd)
}
}
void osd_t::on_change_backfillfull_hook(pool_id_t pool_id)
{
if (!(peering_state & (OSD_RECOVERING | OSD_FLUSHING_PGS)))
{
peering_state = peering_state | OSD_RECOVERING;
}
}
void osd_t::on_change_etcd_state_hook(std::map<std::string, etcd_kv_t> & changes)
{
if (changes.find(st_cli.etcd_prefix+"/config/global") != changes.end())

View File

@ -252,10 +252,18 @@ bool osd_t::pick_next_recovery(osd_recovery_op_t &op)
auto mask = recovery_last_degraded ? (PG_ACTIVE | PG_HAS_DEGRADED) : (PG_ACTIVE | PG_DEGRADED | PG_HAS_MISPLACED);
auto check = recovery_last_degraded ? (PG_ACTIVE | PG_HAS_DEGRADED) : (PG_ACTIVE | PG_HAS_MISPLACED);
// Restart scanning from the same PG as the last time
restart:
for (auto pg_it = pgs.lower_bound(recovery_last_pg); pg_it != pgs.end(); pg_it++)
{
if ((pg_it->second.state & mask) == check)
{
auto pool_it = st_cli.pool_config.find(pg_it->first.pool_id);
if (pool_it != st_cli.pool_config.end() && pool_it->second.backfillfull)
{
// Skip the pool
recovery_last_pg.pool_id++;
goto restart;
}
auto & src = recovery_last_degraded ? pg_it->second.degraded_objects : pg_it->second.misplaced_objects;
assert(src.size() > 0);
// Restart scanning from the next object

View File

@ -19,9 +19,12 @@ ANTIETCD=1 ./test_etcd_fail.sh
./test_interrupted_rebalance.sh
IMMEDIATE_COMMIT=1 ./test_interrupted_rebalance.sh
SCHEME=ec ./test_interrupted_rebalance.sh
SCHEME=ec IMMEDIATE_COMMIT=1 ./test_interrupted_rebalance.sh
./test_create_halfhost.sh
./test_failure_domain.sh
./test_snapshot.sh

35
tests/test_create_halfhost.sh Executable file
View File

@ -0,0 +1,35 @@
#!/bin/bash -ex
. `dirname $0`/common.sh
node mon/mon-main.js $MON_PARAMS --etcd_address $ETCD_URL --etcd_prefix "/vitastor" >>./testdata/mon.log 2>&1 &
MON_PID=$!
wait_etcd
TIME=$(date '+%s')
$ETCDCTL put /vitastor/config/global '{"placement_levels":{"dc":10,"host":100,"half":105,"osd":110}}'
$ETCDCTL put /vitastor/config/node_placement '{
"h11":{"level":"half","parent":"host1"},
"h12":{"level":"half","parent":"host1"},
"h21":{"level":"half","parent":"host2"},
"h22":{"level":"half","parent":"host2"},
"h31":{"level":"half","parent":"host3"},
"h32":{"level":"half","parent":"host3"},
"1":{"parent":"h11"},
"2":{"parent":"h12"},
"3":{"parent":"h21"},
"4":{"parent":"h22"},
"5":{"parent":"h31"},
"6":{"parent":"h32"}
}'
$ETCDCTL put /vitastor/osd/stats/1 '{"host":"host1","size":1073741824,"time":"'$TIME'"}'
$ETCDCTL put /vitastor/osd/stats/2 '{"host":"host1","size":1073741824,"time":"'$TIME'"}'
$ETCDCTL put /vitastor/osd/stats/3 '{"host":"host2","size":1073741824,"time":"'$TIME'"}'
$ETCDCTL put /vitastor/osd/stats/4 '{"host":"host2","size":1073741824,"time":"'$TIME'"}'
$ETCDCTL put /vitastor/osd/stats/5 '{"host":"host3","size":1073741824,"time":"'$TIME'"}'
$ETCDCTL put /vitastor/osd/stats/6 '{"host":"host3","size":1073741824,"time":"'$TIME'"}'
build/src/cmd/vitastor-cli --etcd_address $ETCD_URL osd-tree
# check that it doesn't fail
build/src/cmd/vitastor-cli --etcd_address $ETCD_URL create-pool testpool --ec 2+1 -n 32
format_green OK