Compare commits

..

12 Commits

Author SHA1 Message Date
Vitaliy Filippov d84b84f58d Fix new backfillfull feature, add more logs
Test / test_rebalance_verify_ec_imm (push) Successful in 1m41s Details
Test / test_write_no_same (push) Successful in 9s Details
Test / test_write (push) Successful in 31s Details
Test / test_switch_primary (push) Successful in 34s Details
Test / test_write_xor (push) Successful in 35s Details
Test / test_heal_pg_size_2 (push) Successful in 2m18s Details
Test / test_heal_ec (push) Successful in 2m19s Details
Test / test_heal_antietcd (push) Successful in 2m17s Details
Test / test_heal_csum_32k_dmj (push) Successful in 2m31s Details
Test / test_heal_csum_32k_dj (push) Successful in 2m17s Details
Test / test_heal_csum_32k (push) Successful in 2m21s Details
Test / test_heal_csum_4k_dmj (push) Successful in 2m21s Details
Test / test_resize_auto (push) Successful in 9s Details
Test / test_resize (push) Successful in 14s Details
Test / test_heal_csum_4k_dj (push) Successful in 2m18s Details
Test / test_osd_tags (push) Successful in 10s Details
Test / test_snapshot_pool2 (push) Successful in 15s Details
Test / test_enospc (push) Successful in 11s Details
Test / test_enospc_imm (push) Successful in 9s Details
Test / test_enospc_xor (push) Successful in 15s Details
Test / test_enospc_imm_xor (push) Successful in 14s Details
Test / test_scrub (push) Successful in 14s Details
Test / test_scrub_zero_osd_2 (push) Successful in 15s Details
Test / test_scrub_xor (push) Successful in 14s Details
Test / test_scrub_pg_size_3 (push) Successful in 17s Details
Test / test_scrub_pg_size_6_pg_minsize_4_osd_count_6_ec (push) Successful in 15s Details
Test / test_scrub_ec (push) Successful in 16s Details
Test / test_nfs (push) Successful in 12s Details
Test / test_heal_csum_4k (push) Successful in 2m17s Details
Test / test_etcd_fail_antietcd (push) Successful in 42s Details
2024-11-23 01:08:13 +03:00
Vitaliy Filippov 8cfe705d7a Map netlink after forking to show correct PID in vitastor-nbd ls
Test / test_rebalance_verify_ec (push) Successful in 1m35s Details
Test / test_rebalance_verify_ec_imm (push) Successful in 1m36s Details
Test / test_write_no_same (push) Successful in 9s Details
Test / test_switch_primary (push) Successful in 33s Details
Test / test_write (push) Successful in 31s Details
Test / test_write_xor (push) Successful in 35s Details
Test / test_heal_pg_size_2 (push) Successful in 2m16s Details
Test / test_heal_ec (push) Successful in 2m18s Details
Test / test_heal_antietcd (push) Successful in 2m16s Details
Test / test_heal_csum_32k_dmj (push) Successful in 2m19s Details
Test / test_heal_csum_32k_dj (push) Successful in 2m25s Details
Test / test_heal_csum_32k (push) Successful in 2m20s Details
Test / test_heal_csum_4k_dmj (push) Successful in 2m19s Details
Test / test_heal_csum_4k_dj (push) Successful in 2m18s Details
Test / test_resize_auto (push) Successful in 8s Details
Test / test_resize (push) Successful in 14s Details
Test / test_osd_tags (push) Successful in 8s Details
Test / test_snapshot_pool2 (push) Successful in 15s Details
Test / test_enospc (push) Successful in 11s Details
Test / test_enospc_xor (push) Successful in 12s Details
Test / test_enospc_imm (push) Successful in 10s Details
Test / test_enospc_imm_xor (push) Successful in 14s Details
Test / test_scrub (push) Successful in 15s Details
Test / test_scrub_zero_osd_2 (push) Successful in 14s Details
Test / test_scrub_xor (push) Successful in 15s Details
Test / test_scrub_pg_size_3 (push) Successful in 16s Details
Test / test_scrub_pg_size_6_pg_minsize_4_osd_count_6_ec (push) Successful in 17s Details
Test / test_scrub_ec (push) Successful in 15s Details
Test / test_nfs (push) Successful in 11s Details
Test / test_heal_csum_4k (push) Successful in 2m19s Details
2024-11-23 00:46:44 +03:00
Vitaliy Filippov 66c9271cbd Radically simplify create-pool pg_size check
Test / test_rebalance_verify_ec (push) Successful in 1m38s Details
Test / test_rebalance_verify_ec_imm (push) Successful in 1m40s Details
Test / test_write_no_same (push) Successful in 10s Details
Test / test_write (push) Successful in 32s Details
Test / test_switch_primary (push) Successful in 34s Details
Test / test_write_xor (push) Successful in 35s Details
Test / test_heal_pg_size_2 (push) Successful in 2m17s Details
Test / test_heal_ec (push) Successful in 2m16s Details
Test / test_heal_antietcd (push) Successful in 2m18s Details
Test / test_heal_csum_32k_dmj (push) Successful in 2m19s Details
Test / test_heal_csum_32k_dj (push) Successful in 2m20s Details
Test / test_heal_csum_32k (push) Successful in 2m18s Details
Test / test_heal_csum_4k_dmj (push) Successful in 2m20s Details
Test / test_heal_csum_4k_dj (push) Successful in 2m20s Details
Test / test_resize_auto (push) Successful in 9s Details
Test / test_resize (push) Successful in 15s Details
Test / test_osd_tags (push) Successful in 10s Details
Test / test_snapshot_pool2 (push) Successful in 15s Details
Test / test_enospc (push) Successful in 11s Details
Test / test_enospc_xor (push) Successful in 13s Details
Test / test_enospc_imm (push) Successful in 12s Details
Test / test_enospc_imm_xor (push) Successful in 13s Details
Test / test_scrub (push) Successful in 16s Details
Test / test_scrub_zero_osd_2 (push) Successful in 14s Details
Test / test_scrub_xor (push) Successful in 15s Details
Test / test_scrub_pg_size_3 (push) Successful in 16s Details
Test / test_scrub_pg_size_6_pg_minsize_4_osd_count_6_ec (push) Successful in 15s Details
Test / test_scrub_ec (push) Successful in 16s Details
Test / test_nfs (push) Successful in 12s Details
Test / test_heal_csum_4k (push) Successful in 2m18s Details
2024-11-22 01:44:14 +03:00
Vitaliy Filippov 7b37ba921d Pause pool rebalance when monitor detects that it can lead to any OSD becoming full
Test / test_root_node (push) Successful in 10s Details
Test / test_rebalance_verify_ec_imm (push) Successful in 1m39s Details
Test / test_write_no_same (push) Successful in 9s Details
Test / test_switch_primary (push) Successful in 32s Details
Test / test_write (push) Successful in 32s Details
Test / test_write_xor (push) Successful in 37s Details
Test / test_heal_pg_size_2 (push) Successful in 2m17s Details
Test / test_heal_ec (push) Successful in 2m18s Details
Test / test_heal_antietcd (push) Successful in 2m17s Details
Test / test_heal_csum_32k_dmj (push) Successful in 2m20s Details
Test / test_heal_csum_32k_dj (push) Successful in 2m27s Details
Test / test_heal_csum_32k (push) Successful in 2m17s Details
Test / test_heal_csum_4k_dmj (push) Successful in 2m19s Details
Test / test_heal_csum_4k_dj (push) Successful in 2m19s Details
Test / test_resize_auto (push) Successful in 9s Details
Test / test_resize (push) Successful in 14s Details
Test / test_osd_tags (push) Successful in 8s Details
Test / test_snapshot_pool2 (push) Failing after 11s Details
Test / test_enospc (push) Successful in 11s Details
Test / test_enospc_imm (push) Successful in 11s Details
Test / test_enospc_xor (push) Successful in 13s Details
Test / test_enospc_imm_xor (push) Successful in 14s Details
Test / test_scrub (push) Successful in 14s Details
Test / test_scrub_zero_osd_2 (push) Successful in 15s Details
Test / test_scrub_xor (push) Successful in 15s Details
Test / test_scrub_pg_size_3 (push) Successful in 16s Details
Test / test_scrub_pg_size_6_pg_minsize_4_osd_count_6_ec (push) Successful in 16s Details
Test / test_scrub_ec (push) Successful in 14s Details
Test / test_nfs (push) Successful in 12s Details
Test / test_heal_csum_4k (push) Successful in 2m19s Details
2024-11-22 01:01:07 +03:00
Vitaliy Filippov 262c581400 Fix create-pool for the case of hosts split into sub-nodes
Test / test_rebalance_verify_ec (push) Successful in 1m38s Details
Test / test_rebalance_verify_ec_imm (push) Successful in 1m40s Details
Test / test_switch_primary (push) Successful in 23s Details
Test / test_write_no_same (push) Successful in 9s Details
Test / test_write (push) Successful in 31s Details
Test / test_write_xor (push) Successful in 35s Details
Test / test_heal_pg_size_2 (push) Successful in 2m18s Details
Test / test_heal_ec (push) Successful in 2m16s Details
Test / test_heal_antietcd (push) Successful in 2m17s Details
Test / test_heal_csum_32k_dmj (push) Successful in 2m19s Details
Test / test_heal_csum_32k_dj (push) Successful in 2m19s Details
Test / test_heal_csum_32k (push) Successful in 2m20s Details
Test / test_heal_csum_4k_dmj (push) Successful in 2m19s Details
Test / test_heal_csum_4k_dj (push) Successful in 2m17s Details
Test / test_resize_auto (push) Successful in 9s Details
Test / test_resize (push) Successful in 16s Details
Test / test_snapshot_pool2 (push) Failing after 11s Details
Test / test_osd_tags (push) Successful in 7s Details
Test / test_enospc (push) Successful in 12s Details
Test / test_enospc_imm (push) Successful in 11s Details
Test / test_enospc_xor (push) Successful in 14s Details
Test / test_enospc_imm_xor (push) Successful in 13s Details
Test / test_scrub (push) Successful in 13s Details
Test / test_scrub_zero_osd_2 (push) Successful in 13s Details
Test / test_scrub_xor (push) Successful in 14s Details
Test / test_scrub_pg_size_3 (push) Successful in 17s Details
Test / test_scrub_pg_size_6_pg_minsize_4_osd_count_6_ec (push) Successful in 17s Details
Test / test_scrub_ec (push) Successful in 14s Details
Test / test_nfs (push) Successful in 11s Details
Test / test_heal_csum_4k (push) Successful in 2m14s Details
2024-11-22 01:01:07 +03:00
Vitaliy Filippov ad3b6b7267 Add a note about GID and RDMA device auto-selection 2024-11-21 23:54:05 +03:00
Vitaliy Filippov 1f6a061283 Move ibv_query_gid under #ifdef to only build it with libibverbs 32+
Test / test_rebalance_verify_ec (push) Successful in 1m39s Details
Test / test_rebalance_verify_ec_imm (push) Successful in 1m39s Details
Test / test_write_no_same (push) Successful in 13s Details
Test / test_switch_primary (push) Successful in 33s Details
Test / test_write (push) Successful in 35s Details
Test / test_write_xor (push) Successful in 37s Details
Test / test_heal_pg_size_2 (push) Successful in 2m17s Details
Test / test_heal_ec (push) Successful in 2m19s Details
Test / test_heal_antietcd (push) Successful in 2m19s Details
Test / test_heal_csum_32k_dmj (push) Successful in 2m18s Details
Test / test_heal_csum_32k_dj (push) Successful in 2m21s Details
Test / test_heal_csum_32k (push) Successful in 2m21s Details
Test / test_heal_csum_4k_dmj (push) Successful in 2m18s Details
Test / test_heal_csum_4k_dj (push) Successful in 2m19s Details
Test / test_resize (push) Successful in 13s Details
Test / test_resize_auto (push) Successful in 9s Details
Test / test_osd_tags (push) Successful in 8s Details
Test / test_snapshot_pool2 (push) Successful in 14s Details
Test / test_enospc (push) Successful in 11s Details
Test / test_enospc_xor (push) Successful in 12s Details
Test / test_enospc_imm (push) Successful in 10s Details
Test / test_enospc_imm_xor (push) Successful in 14s Details
Test / test_scrub_zero_osd_2 (push) Successful in 12s Details
Test / test_scrub (push) Successful in 14s Details
Test / test_scrub_xor (push) Successful in 14s Details
Test / test_scrub_pg_size_3 (push) Successful in 15s Details
Test / test_scrub_pg_size_6_pg_minsize_4_osd_count_6_ec (push) Successful in 17s Details
Test / test_scrub_ec (push) Successful in 15s Details
Test / test_nfs (push) Successful in 13s Details
Test / test_heal_csum_4k (push) Successful in 2m26s Details
2024-11-21 23:47:57 +03:00
Vitaliy Filippov fc4d97da10 Print "Waiting to become master" just once
Test / test_rebalance_verify_ec (push) Successful in 1m38s Details
Test / test_rebalance_verify_ec_imm (push) Successful in 1m40s Details
Test / test_write_no_same (push) Successful in 8s Details
Test / test_write (push) Successful in 30s Details
Test / test_switch_primary (push) Successful in 33s Details
Test / test_write_xor (push) Successful in 36s Details
Test / test_heal_pg_size_2 (push) Successful in 2m15s Details
Test / test_heal_ec (push) Successful in 2m16s Details
Test / test_heal_antietcd (push) Successful in 2m17s Details
Test / test_heal_csum_32k_dmj (push) Successful in 2m19s Details
Test / test_heal_csum_32k_dj (push) Successful in 2m26s Details
Test / test_heal_csum_4k_dmj (push) Successful in 2m18s Details
Test / test_heal_csum_32k (push) Successful in 2m22s Details
Test / test_heal_csum_4k_dj (push) Successful in 2m20s Details
Test / test_resize_auto (push) Successful in 9s Details
Test / test_resize (push) Successful in 14s Details
Test / test_osd_tags (push) Successful in 9s Details
Test / test_enospc (push) Successful in 11s Details
Test / test_snapshot_pool2 (push) Successful in 16s Details
Test / test_enospc_xor (push) Successful in 12s Details
Test / test_enospc_imm (push) Successful in 12s Details
Test / test_enospc_imm_xor (push) Successful in 13s Details
Test / test_scrub (push) Successful in 15s Details
Test / test_scrub_zero_osd_2 (push) Successful in 14s Details
Test / test_scrub_xor (push) Successful in 15s Details
Test / test_scrub_pg_size_3 (push) Successful in 16s Details
Test / test_scrub_pg_size_6_pg_minsize_4_osd_count_6_ec (push) Successful in 15s Details
Test / test_scrub_ec (push) Successful in 15s Details
Test / test_nfs (push) Successful in 13s Details
Test / test_heal_csum_4k (push) Successful in 2m18s Details
2024-11-21 00:55:22 +03:00
Vitaliy Filippov c7a4ce7341 Take out_size from dd oimg if not specified
Test / test_rebalance_verify_ec (push) Successful in 1m41s Details
Test / test_rebalance_verify_ec_imm (push) Successful in 1m43s Details
Test / test_write_no_same (push) Successful in 9s Details
Test / test_write (push) Successful in 32s Details
Test / test_switch_primary (push) Successful in 34s Details
Test / test_write_xor (push) Successful in 36s Details
Test / test_heal_pg_size_2 (push) Successful in 2m15s Details
Test / test_heal_ec (push) Successful in 2m19s Details
Test / test_heal_antietcd (push) Successful in 2m17s Details
Test / test_heal_csum_32k_dmj (push) Successful in 2m18s Details
Test / test_heal_csum_32k_dj (push) Successful in 2m18s Details
Test / test_heal_csum_32k (push) Successful in 2m19s Details
Test / test_heal_csum_4k_dmj (push) Successful in 2m17s Details
Test / test_heal_csum_4k_dj (push) Successful in 2m20s Details
Test / test_resize_auto (push) Successful in 9s Details
Test / test_resize (push) Successful in 13s Details
Test / test_osd_tags (push) Successful in 8s Details
Test / test_snapshot_pool2 (push) Successful in 14s Details
Test / test_enospc (push) Successful in 12s Details
Test / test_enospc_xor (push) Successful in 12s Details
Test / test_enospc_imm (push) Successful in 11s Details
Test / test_enospc_imm_xor (push) Successful in 12s Details
Test / test_scrub (push) Successful in 14s Details
Test / test_scrub_zero_osd_2 (push) Successful in 13s Details
Test / test_scrub_xor (push) Successful in 14s Details
Test / test_scrub_pg_size_3 (push) Successful in 16s Details
Test / test_scrub_pg_size_6_pg_minsize_4_osd_count_6_ec (push) Successful in 17s Details
Test / test_scrub_ec (push) Successful in 14s Details
Test / test_nfs (push) Successful in 12s Details
Test / test_heal_csum_4k (push) Successful in 2m8s Details
2024-11-19 02:13:34 +03:00
Vitaliy Filippov ddea31d86d Auto-select first RDMA device only if RoCE is not found, add rocev2->rocev1->ib priority
Test / test_rebalance_verify_ec (push) Successful in 1m43s Details
Test / test_rebalance_verify_ec_imm (push) Successful in 1m43s Details
Test / test_write_no_same (push) Successful in 10s Details
Test / test_write (push) Successful in 29s Details
Test / test_switch_primary (push) Successful in 32s Details
Test / test_write_xor (push) Successful in 35s Details
Test / test_heal_pg_size_2 (push) Successful in 2m17s Details
Test / test_heal_ec (push) Successful in 2m15s Details
Test / test_heal_antietcd (push) Successful in 2m17s Details
Test / test_heal_csum_32k_dmj (push) Successful in 2m19s Details
Test / test_heal_csum_32k_dj (push) Successful in 2m19s Details
Test / test_heal_csum_32k (push) Successful in 2m17s Details
Test / test_heal_csum_4k_dmj (push) Successful in 2m19s Details
Test / test_heal_csum_4k_dj (push) Successful in 2m17s Details
Test / test_resize (push) Successful in 13s Details
Test / test_resize_auto (push) Successful in 9s Details
Test / test_osd_tags (push) Successful in 9s Details
Test / test_snapshot_pool2 (push) Successful in 16s Details
Test / test_enospc (push) Successful in 11s Details
Test / test_enospc_xor (push) Successful in 12s Details
Test / test_enospc_imm (push) Successful in 10s Details
Test / test_enospc_imm_xor (push) Successful in 13s Details
Test / test_scrub_zero_osd_2 (push) Successful in 14s Details
Test / test_scrub (push) Successful in 17s Details
Test / test_scrub_xor (push) Successful in 15s Details
Test / test_scrub_pg_size_3 (push) Successful in 17s Details
Test / test_scrub_pg_size_6_pg_minsize_4_osd_count_6_ec (push) Successful in 15s Details
Test / test_scrub_ec (push) Successful in 14s Details
Test / test_nfs (push) Successful in 12s Details
Test / test_heal_csum_4k (push) Successful in 2m19s Details
2024-11-19 01:54:00 +03:00
Vitaliy Filippov 156d005412 Add serialize_overlap to test_heal
Test / test_rebalance_verify_ec (push) Successful in 1m44s Details
Test / test_rebalance_verify_ec_imm (push) Successful in 1m44s Details
Test / test_write_no_same (push) Successful in 10s Details
Test / test_switch_primary (push) Successful in 33s Details
Test / test_write (push) Successful in 32s Details
Test / test_write_xor (push) Successful in 35s Details
Test / test_heal_pg_size_2 (push) Successful in 2m16s Details
Test / test_heal_ec (push) Successful in 2m18s Details
Test / test_heal_antietcd (push) Successful in 2m16s Details
Test / test_heal_csum_32k_dmj (push) Successful in 2m18s Details
Test / test_heal_csum_32k_dj (push) Successful in 2m18s Details
Test / test_heal_csum_32k (push) Successful in 2m18s Details
Test / test_heal_csum_4k_dmj (push) Successful in 2m17s Details
Test / test_heal_csum_4k_dj (push) Successful in 2m17s Details
Test / test_resize_auto (push) Successful in 8s Details
Test / test_resize (push) Successful in 13s Details
Test / test_osd_tags (push) Successful in 10s Details
Test / test_snapshot_pool2 (push) Successful in 16s Details
Test / test_enospc (push) Successful in 11s Details
Test / test_enospc_xor (push) Successful in 13s Details
Test / test_enospc_imm (push) Successful in 12s Details
Test / test_enospc_imm_xor (push) Successful in 13s Details
Test / test_scrub (push) Successful in 15s Details
Test / test_scrub_zero_osd_2 (push) Successful in 15s Details
Test / test_scrub_xor (push) Successful in 14s Details
Test / test_scrub_pg_size_6_pg_minsize_4_osd_count_6_ec (push) Successful in 15s Details
Test / test_scrub_pg_size_3 (push) Successful in 18s Details
Test / test_scrub_ec (push) Successful in 14s Details
Test / test_nfs (push) Successful in 14s Details
Test / test_heal_csum_4k (push) Successful in 2m22s Details
2024-11-17 01:26:35 +03:00
Vitaliy Filippov 7e076c7049 Do not report OSDs with empty statistics as full
Test / test_rebalance_verify_ec (push) Successful in 1m36s Details
Test / test_rebalance_verify_ec_imm (push) Successful in 1m38s Details
Test / test_write_no_same (push) Successful in 7s Details
Test / test_switch_primary (push) Successful in 32s Details
Test / test_write (push) Successful in 32s Details
Test / test_write_xor (push) Successful in 34s Details
Test / test_heal_pg_size_2 (push) Successful in 2m16s Details
Test / test_heal_ec (push) Successful in 2m16s Details
Test / test_heal_antietcd (push) Successful in 2m18s Details
Test / test_heal_csum_32k_dmj (push) Successful in 2m18s Details
Test / test_heal_csum_32k_dj (push) Successful in 2m22s Details
Test / test_heal_csum_32k (push) Successful in 2m18s Details
Test / test_heal_csum_4k_dmj (push) Successful in 2m15s Details
Test / test_heal_csum_4k_dj (push) Successful in 2m14s Details
Test / test_resize_auto (push) Successful in 8s Details
Test / test_resize (push) Successful in 14s Details
Test / test_osd_tags (push) Successful in 7s Details
Test / test_snapshot_pool2 (push) Successful in 16s Details
Test / test_enospc (push) Successful in 12s Details
Test / test_enospc_xor (push) Successful in 12s Details
Test / test_enospc_imm (push) Successful in 11s Details
Test / test_enospc_imm_xor (push) Successful in 13s Details
Test / test_scrub (push) Successful in 12s Details
Test / test_scrub_zero_osd_2 (push) Successful in 13s Details
Test / test_scrub_xor (push) Successful in 16s Details
Test / test_scrub_pg_size_3 (push) Successful in 15s Details
Test / test_scrub_pg_size_6_pg_minsize_4_osd_count_6_ec (push) Successful in 15s Details
Test / test_scrub_ec (push) Successful in 16s Details
Test / test_nfs (push) Successful in 12s Details
Test / test_heal_csum_4k (push) Successful in 2m19s Details
2024-11-16 23:36:16 +03:00
26 changed files with 413 additions and 408 deletions

View File

@ -288,6 +288,24 @@ jobs:
echo "" echo ""
done done
test_create_halfhost:
runs-on: ubuntu-latest
needs: build
container: ${{env.TEST_IMAGE}}:${{github.sha}}
steps:
- name: Run test
id: test
timeout-minutes: 3
run: /root/vitastor/tests/test_create_halfhost.sh
- name: Print logs
if: always() && steps.test.outcome == 'failure'
run: |
for i in /root/vitastor/testdata/*.log /root/vitastor/testdata/*.txt; do
echo "-------- $i --------"
cat $i
echo ""
done
test_failure_domain: test_failure_domain:
runs-on: ubuntu-latest runs-on: ubuntu-latest
needs: build needs: build

View File

@ -68,11 +68,17 @@ but they are not connected to the cluster.
- Type: string - Type: string
RDMA device name to use for Vitastor OSD communications (for example, RDMA device name to use for Vitastor OSD communications (for example,
"rocep5s0f0"). Now Vitastor supports all adapters, even ones without "rocep5s0f0"). If not specified, Vitastor will try to find an RoCE
ODP support, like Mellanox ConnectX-3 and non-Mellanox cards. device matching [osd_network](osd.en.md#osd_network), preferring RoCEv2,
or choose the first available RDMA device if no RoCE devices are
found or if `osd_network` is not specified. Auto-selection is also
unsupported with old libibverbs < v32, like in Debian 10 Buster or
CentOS 7.
Versions up to Vitastor 1.2.0 required ODP which is only present in Vitastor supports all adapters, even ones without ODP support, like
Mellanox ConnectX >= 4. See also [rdma_odp](#rdma_odp). Mellanox ConnectX-3 and non-Mellanox cards. Versions up to Vitastor
1.2.0 required ODP which is only present in Mellanox ConnectX >= 4.
See also [rdma_odp](#rdma_odp).
Run `ibv_devinfo -v` as root to list available RDMA devices and their Run `ibv_devinfo -v` as root to list available RDMA devices and their
features. features.
@ -95,15 +101,17 @@ your device has.
## rdma_gid_index ## rdma_gid_index
- Type: integer - Type: integer
- Default: 0
Global address identifier index of the RDMA device to use. Different GID Global address identifier index of the RDMA device to use. Different GID
indexes may correspond to different protocols like RoCEv1, RoCEv2 and iWARP. indexes may correspond to different protocols like RoCEv1, RoCEv2 and iWARP.
Search for "GID" in `ibv_devinfo -v` output to determine which GID index Search for "GID" in `ibv_devinfo -v` output to determine which GID index
you need. you need.
**IMPORTANT:** If you want to use RoCEv2 (as recommended) then the correct If not specified, Vitastor will try to auto-select a RoCEv2 IPv4 GID, then
rdma_gid_index is usually 1 (IPv6) or 3 (IPv4). RoCEv2 IPv6 GID, then RoCEv1 IPv4 GID, then RoCEv1 IPv6 GID, then IB GID.
GID auto-selection is unsupported with libibverbs < v32.
A correct rdma_gid_index for RoCEv2 is usually 1 (IPv6) or 3 (IPv4).
## rdma_mtu ## rdma_mtu

View File

@ -71,12 +71,17 @@ RDMA может быть нужно только если у клиентов е
- Тип: строка - Тип: строка
Название RDMA-устройства для связи с Vitastor OSD (например, "rocep5s0f0"). Название RDMA-устройства для связи с Vitastor OSD (например, "rocep5s0f0").
Сейчас Vitastor поддерживает все модели адаптеров, включая те, у которых Если не указано, Vitastor попробует найти RoCE-устройство, соответствующее
нет поддержки ODP, то есть вы можете использовать RDMA с ConnectX-3 и [osd_network](osd.en.md#osd_network), предпочитая RoCEv2, или выбрать первое
картами производства не Mellanox. попавшееся RDMA-устройство, если RoCE-устройств нет или если сеть `osd_network`
не задана. Также автовыбор не поддерживается со старыми версиями библиотеки
libibverbs < v32, например в Debian 10 Buster или CentOS 7.
Версии Vitastor до 1.2.0 включительно требовали ODP, который есть только Vitastor поддерживает все модели адаптеров, включая те, у которых
на Mellanox ConnectX 4 и более новых. См. также [rdma_odp](#rdma_odp). нет поддержки ODP, то есть вы можете использовать RDMA с ConnectX-3 и
картами производства не Mellanox. Версии Vitastor до 1.2.0 включительно
требовали ODP, который есть только на Mellanox ConnectX 4 и более новых.
См. также [rdma_odp](#rdma_odp).
Запустите `ibv_devinfo -v` от имени суперпользователя, чтобы посмотреть Запустите `ibv_devinfo -v` от имени суперпользователя, чтобы посмотреть
список доступных RDMA-устройств, их параметры и возможности. список доступных RDMA-устройств, их параметры и возможности.
@ -101,15 +106,18 @@ Control) и ECN (Explicit Congestion Notification).
## rdma_gid_index ## rdma_gid_index
- Тип: целое число - Тип: целое число
- Значение по умолчанию: 0
Номер глобального идентификатора адреса RDMA-устройства, который следует Номер глобального идентификатора адреса RDMA-устройства, который следует
использовать. Разным gid_index могут соответствовать разные протоколы связи: использовать. Разным gid_index могут соответствовать разные протоколы связи:
RoCEv1, RoCEv2, iWARP. Чтобы понять, какой нужен вам - смотрите строчки со RoCEv1, RoCEv2, iWARP. Чтобы понять, какой нужен вам - смотрите строчки со
словом "GID" в выводе команды `ibv_devinfo -v`. словом "GID" в выводе команды `ibv_devinfo -v`.
**ВАЖНО:** Если вы хотите использовать RoCEv2 (как мы и рекомендуем), то Если не указан, Vitastor попробует автоматически выбрать сначала GID,
правильный rdma_gid_index, как правило, 1 (IPv6) или 3 (IPv4). соответствующий RoCEv2 IPv4, потом RoCEv2 IPv6, потом RoCEv1 IPv4, потом
RoCEv1 IPv6, потом IB. Авто-выбор GID не поддерживается со старыми версиями
libibverbs < v32.
Правильный rdma_gid_index для RoCEv2, как правило, 1 (IPv6) или 3 (IPv4).
## rdma_mtu ## rdma_mtu

View File

@ -48,11 +48,17 @@
type: string type: string
info: | info: |
RDMA device name to use for Vitastor OSD communications (for example, RDMA device name to use for Vitastor OSD communications (for example,
"rocep5s0f0"). Now Vitastor supports all adapters, even ones without "rocep5s0f0"). If not specified, Vitastor will try to find an RoCE
ODP support, like Mellanox ConnectX-3 and non-Mellanox cards. device matching [osd_network](osd.en.md#osd_network), preferring RoCEv2,
or choose the first available RDMA device if no RoCE devices are
found or if `osd_network` is not specified. Auto-selection is also
unsupported with old libibverbs < v32, like in Debian 10 Buster or
CentOS 7.
Versions up to Vitastor 1.2.0 required ODP which is only present in Vitastor supports all adapters, even ones without ODP support, like
Mellanox ConnectX >= 4. See also [rdma_odp](#rdma_odp). Mellanox ConnectX-3 and non-Mellanox cards. Versions up to Vitastor
1.2.0 required ODP which is only present in Mellanox ConnectX >= 4.
See also [rdma_odp](#rdma_odp).
Run `ibv_devinfo -v` as root to list available RDMA devices and their Run `ibv_devinfo -v` as root to list available RDMA devices and their
features. features.
@ -64,12 +70,17 @@
PFC (Priority Flow Control) and ECN (Explicit Congestion Notification). PFC (Priority Flow Control) and ECN (Explicit Congestion Notification).
info_ru: | info_ru: |
Название RDMA-устройства для связи с Vitastor OSD (например, "rocep5s0f0"). Название RDMA-устройства для связи с Vitastor OSD (например, "rocep5s0f0").
Сейчас Vitastor поддерживает все модели адаптеров, включая те, у которых Если не указано, Vitastor попробует найти RoCE-устройство, соответствующее
нет поддержки ODP, то есть вы можете использовать RDMA с ConnectX-3 и [osd_network](osd.en.md#osd_network), предпочитая RoCEv2, или выбрать первое
картами производства не Mellanox. попавшееся RDMA-устройство, если RoCE-устройств нет или если сеть `osd_network`
не задана. Также автовыбор не поддерживается со старыми версиями библиотеки
libibverbs < v32, например в Debian 10 Buster или CentOS 7.
Версии Vitastor до 1.2.0 включительно требовали ODP, который есть только Vitastor поддерживает все модели адаптеров, включая те, у которых
на Mellanox ConnectX 4 и более новых. См. также [rdma_odp](#rdma_odp). нет поддержки ODP, то есть вы можете использовать RDMA с ConnectX-3 и
картами производства не Mellanox. Версии Vitastor до 1.2.0 включительно
требовали ODP, который есть только на Mellanox ConnectX 4 и более новых.
См. также [rdma_odp](#rdma_odp).
Запустите `ibv_devinfo -v` от имени суперпользователя, чтобы посмотреть Запустите `ibv_devinfo -v` от имени суперпользователя, чтобы посмотреть
список доступных RDMA-устройств, их параметры и возможности. список доступных RDMA-устройств, их параметры и возможности.
@ -94,23 +105,29 @@
`ibv_devinfo -v`. `ibv_devinfo -v`.
- name: rdma_gid_index - name: rdma_gid_index
type: int type: int
default: 0
info: | info: |
Global address identifier index of the RDMA device to use. Different GID Global address identifier index of the RDMA device to use. Different GID
indexes may correspond to different protocols like RoCEv1, RoCEv2 and iWARP. indexes may correspond to different protocols like RoCEv1, RoCEv2 and iWARP.
Search for "GID" in `ibv_devinfo -v` output to determine which GID index Search for "GID" in `ibv_devinfo -v` output to determine which GID index
you need. you need.
**IMPORTANT:** If you want to use RoCEv2 (as recommended) then the correct If not specified, Vitastor will try to auto-select a RoCEv2 IPv4 GID, then
rdma_gid_index is usually 1 (IPv6) or 3 (IPv4). RoCEv2 IPv6 GID, then RoCEv1 IPv4 GID, then RoCEv1 IPv6 GID, then IB GID.
GID auto-selection is unsupported with libibverbs < v32.
A correct rdma_gid_index for RoCEv2 is usually 1 (IPv6) or 3 (IPv4).
info_ru: | info_ru: |
Номер глобального идентификатора адреса RDMA-устройства, который следует Номер глобального идентификатора адреса RDMA-устройства, который следует
использовать. Разным gid_index могут соответствовать разные протоколы связи: использовать. Разным gid_index могут соответствовать разные протоколы связи:
RoCEv1, RoCEv2, iWARP. Чтобы понять, какой нужен вам - смотрите строчки со RoCEv1, RoCEv2, iWARP. Чтобы понять, какой нужен вам - смотрите строчки со
словом "GID" в выводе команды `ibv_devinfo -v`. словом "GID" в выводе команды `ibv_devinfo -v`.
**ВАЖНО:** Если вы хотите использовать RoCEv2 (как мы и рекомендуем), то Если не указан, Vitastor попробует автоматически выбрать сначала GID,
правильный rdma_gid_index, как правило, 1 (IPv6) или 3 (IPv4). соответствующий RoCEv2 IPv4, потом RoCEv2 IPv6, потом RoCEv1 IPv4, потом
RoCEv1 IPv6, потом IB. Авто-выбор GID не поддерживается со старыми версиями
libibverbs < v32.
Правильный rdma_gid_index для RoCEv2, как правило, 1 (IPv6) или 3 (IPv4).
- name: rdma_mtu - name: rdma_mtu
type: int type: int
default: 4096 default: 4096

View File

@ -232,6 +232,7 @@ class EtcdAdapter
async become_master() async become_master()
{ {
const state = { ...this.mon.get_mon_state(), id: ''+this.mon.etcd_lease_id }; const state = { ...this.mon.get_mon_state(), id: ''+this.mon.etcd_lease_id };
console.log('Waiting to become master');
// eslint-disable-next-line no-constant-condition // eslint-disable-next-line no-constant-condition
while (1) while (1)
{ {
@ -243,7 +244,6 @@ class EtcdAdapter
{ {
break; break;
} }
console.log('Waiting to become master');
await new Promise(ok => setTimeout(ok, this.mon.config.etcd_start_timeout)); await new Promise(ok => setTimeout(ok, this.mon.config.etcd_start_timeout));
} }
console.log('Became master'); console.log('Became master');

View File

@ -56,6 +56,7 @@ const etcd_tree = {
osd_out_time: 600, // seconds. min: 0 osd_out_time: 600, // seconds. min: 0
placement_levels: { datacenter: 1, rack: 2, host: 3, osd: 4, ... }, placement_levels: { datacenter: 1, rack: 2, host: 3, osd: 4, ... },
use_old_pg_combinator: false, use_old_pg_combinator: false,
osd_backfillfull_ratio: 0.99,
// client and osd // client and osd
tcp_header_buffer_size: 65536, tcp_header_buffer_size: 65536,
use_sync_send_recv: false, use_sync_send_recv: false,

View File

@ -74,6 +74,7 @@ class Mon
this.state = JSON.parse(JSON.stringify(etcd_tree)); this.state = JSON.parse(JSON.stringify(etcd_tree));
this.prev_stats = { osd_stats: {}, osd_diff: {} }; this.prev_stats = { osd_stats: {}, osd_diff: {} };
this.recheck_pgs_active = false; this.recheck_pgs_active = false;
this.updating_total_stats = false;
this.watcher_active = false; this.watcher_active = false;
this.old_pg_config = false; this.old_pg_config = false;
this.old_pg_stats_seen = false; this.old_pg_stats_seen = false;
@ -658,7 +659,19 @@ class Mon
this.etcd_watch_revision, pool_id, up_osds, osd_tree, real_prev_pgs, pool_res.pgs, pg_history); this.etcd_watch_revision, pool_id, up_osds, osd_tree, real_prev_pgs, pool_res.pgs, pg_history);
} }
new_pg_config.hash = tree_hash; new_pg_config.hash = tree_hash;
return await this.save_pg_config(new_pg_config, etcd_request); const { backfillfull_pools, backfillfull_osds } = sum_object_counts(
{ ...this.state, pg: { ...this.state.pg, config: new_pg_config } }, this.config
);
if (backfillfull_pools.join(',') != ((this.state.pg.config||{}).backfillfull_pools||[]).join(','))
{
this.log_backfillfull(backfillfull_osds, backfillfull_pools);
}
new_pg_config.backfillfull_pools = backfillfull_pools.length ? backfillfull_pools : undefined;
if (!await this.save_pg_config(new_pg_config, etcd_request))
{
return false;
}
return true;
} }
async save_pg_config(new_pg_config, etcd_request = { compare: [], success: [] }) async save_pg_config(new_pg_config, etcd_request = { compare: [], success: [] })
@ -730,7 +743,7 @@ class Mon
async update_total_stats() async update_total_stats()
{ {
const txn = []; const txn = [];
const { object_counts, object_bytes } = sum_object_counts(this.state, this.config); const { object_counts, object_bytes, backfillfull_pools, backfillfull_osds } = sum_object_counts(this.state, this.config);
let stats = sum_op_stats(this.state.osd, this.prev_stats); let stats = sum_op_stats(this.state.osd, this.prev_stats);
let { inode_stats, seen_pools } = sum_inode_stats(this.state, this.prev_stats); let { inode_stats, seen_pools } = sum_inode_stats(this.state, this.prev_stats);
stats.object_counts = object_counts; stats.object_counts = object_counts;
@ -783,6 +796,27 @@ class Mon
{ {
await this.etcd.etcd_call('/kv/txn', { success: txn }, this.config.etcd_mon_timeout, 0); await this.etcd.etcd_call('/kv/txn', { success: txn }, this.config.etcd_mon_timeout, 0);
} }
if (!this.recheck_pgs_active &&
backfillfull_pools.join(',') != ((this.state.pg.config||{}).backfillfull_pools||[]).join(','))
{
this.log_backfillfull(backfillfull_osds, backfillfull_pools);
const new_pg_config = { ...this.state.pg.config, backfillfull_pools: backfillfull_pools.length ? backfillfull_pools : undefined };
await this.save_pg_config(new_pg_config);
}
}
log_backfillfull(osds, pools)
{
for (const osd in osds)
{
const bf = osds[osd];
console.log('OSD '+osd+' may fill up during rebalance: capacity '+(bf.cap/1024n/1024n)+
' MB, target user data '+(bf.clean/1024n/1024n)+' MB');
}
console.log(
(pools.length ? 'Pool(s) '+pools.join(', ') : 'No pools')+
' are backfillfull now, applying rebalance configuration'
);
} }
schedule_update_stats() schedule_update_stats()
@ -794,7 +828,21 @@ class Mon
this.stats_timer = setTimeout(() => this.stats_timer = setTimeout(() =>
{ {
this.stats_timer = null; this.stats_timer = null;
this.update_total_stats().catch(console.error); if (this.updating_total_stats)
{
this.schedule_update_stats();
return;
}
this.updating_total_stats = true;
try
{
this.update_total_stats().catch(console.error);
}
catch (e)
{
console.error(e);
}
this.updating_total_stats = false;
}, this.config.mon_stats_timeout); }, this.config.mon_stats_timeout);
} }

View File

@ -109,6 +109,8 @@ function sum_object_counts(state, global_config)
pgstats[pool_id] = { ...(state.pg.stats[pool_id] || {}), ...(pgstats[pool_id] || {}) }; pgstats[pool_id] = { ...(state.pg.stats[pool_id] || {}), ...(pgstats[pool_id] || {}) };
} }
} }
const pool_per_osd = {};
const clean_per_osd = {};
for (const pool_id in pgstats) for (const pool_id in pgstats)
{ {
let object_size = 0; let object_size = 0;
@ -143,10 +145,45 @@ function sum_object_counts(state, global_config)
object_bytes[k] += BigInt(st[k+'_count']) * object_size; object_bytes[k] += BigInt(st[k+'_count']) * object_size;
} }
} }
if (st.object_count)
{
for (const pg_osd of (((state.pg.config.items||{})[pool_id]||{})[pg_num]||{}).osd_set||[])
{
if (!(pg_osd in clean_per_osd))
{
clean_per_osd[pg_osd] = 0n;
}
clean_per_osd[pg_osd] += BigInt(st.object_count);
pool_per_osd[pg_osd] = pool_per_osd[pg_osd]||{};
pool_per_osd[pg_osd][pool_id] = true;
}
}
} }
} }
} }
return { object_counts, object_bytes }; // If clean_per_osd[osd] is larger than osd capacity then it will fill up during rebalance
let backfillfull_pools = {};
let backfillfull_osds = {};
for (const osd in clean_per_osd)
{
const st = state.osd.stats[osd];
if (!st || !st.size || !st.data_block_size)
{
continue;
}
let cap = BigInt(st.size)/BigInt(st.data_block_size);
cap = cap * BigInt((global_config.osd_backfillfull_ratio||0.99)*1000000) / 1000000n;
if (cap < clean_per_osd[osd])
{
backfillfull_osds[osd] = { cap: BigInt(st.size), clean: clean_per_osd[osd]*BigInt(st.data_block_size) };
for (const pool_id in pool_per_osd[osd])
{
backfillfull_pools[pool_id] = true;
}
}
}
backfillfull_pools = Object.keys(backfillfull_pools).sort();
return { object_counts, object_bytes, backfillfull_pools, backfillfull_osds };
} }
// sum_inode_stats(this.state, this.prev_stats) // sum_inode_stats(this.state, this.prev_stats)

View File

@ -785,7 +785,7 @@ void etcd_state_client_t::parse_state(const etcd_kv_t & kv)
} }
for (auto & pool_item: value.object_items()) for (auto & pool_item: value.object_items())
{ {
pool_config_t pc; pool_config_t pc = {};
// ID // ID
pool_id_t pool_id; pool_id_t pool_id;
char null_byte = 0; char null_byte = 0;
@ -931,12 +931,28 @@ void etcd_state_client_t::parse_state(const etcd_kv_t & kv)
// Ignore old key if the new one is present // Ignore old key if the new one is present
return; return;
} }
for (auto & pool_id_json: value["backfillfull_pools"].array_items())
{
auto pool_id = pool_id_json.uint64_value();
auto pool_it = this->pool_config.find(pool_id);
if (pool_it != this->pool_config.end())
{
pool_it->second.backfillfull |= 2;
}
}
for (auto & pool_item: this->pool_config) for (auto & pool_item: this->pool_config)
{ {
for (auto & pg_item: pool_item.second.pg_config) for (auto & pg_item: pool_item.second.pg_config)
{ {
pg_item.second.config_exists = false; pg_item.second.config_exists = false;
} }
// 3 = was 1 and became 1, 0 = was 0 and became 0
if (pool_item.second.backfillfull == 2 || pool_item.second.backfillfull == 1)
{
if (on_change_backfillfull_hook)
on_change_backfillfull_hook(pool_item.first);
}
pool_item.second.backfillfull = pool_item.second.backfillfull >> 1;
} }
for (auto & pool_item: value["items"].object_items()) for (auto & pool_item: value["items"].object_items())
{ {

View File

@ -62,6 +62,7 @@ struct pool_config_t
std::map<pg_num_t, pg_config_t> pg_config; std::map<pg_num_t, pg_config_t> pg_config;
uint64_t scrub_interval; uint64_t scrub_interval;
std::string used_for_fs; std::string used_for_fs;
int backfillfull;
}; };
struct inode_config_t struct inode_config_t
@ -131,6 +132,7 @@ public:
std::function<json11::Json()> load_pgs_checks_hook; std::function<json11::Json()> load_pgs_checks_hook;
std::function<void(bool)> on_load_pgs_hook; std::function<void(bool)> on_load_pgs_hook;
std::function<void()> on_change_pool_config_hook; std::function<void()> on_change_pool_config_hook;
std::function<void(pool_id_t)> on_change_backfillfull_hook;
std::function<void(pool_id_t, pg_num_t, osd_num_t)> on_change_pg_state_hook; std::function<void(pool_id_t, pg_num_t, osd_num_t)> on_change_pg_state_hook;
std::function<void(pool_id_t, pg_num_t)> on_change_pg_history_hook; std::function<void(pool_id_t, pg_num_t)> on_change_pg_history_hook;
std::function<void(osd_num_t)> on_change_osd_state_hook; std::function<void(osd_num_t)> on_change_osd_state_hook;

View File

@ -70,6 +70,7 @@ msgr_rdma_connection_t::~msgr_rdma_connection_t()
send_out_size = 0; send_out_size = 0;
} }
#ifdef IBV_ADVISE_MR_ADVICE_PREFETCH_NO_FAULT
static bool is_ipv4_gid(ibv_gid_entry *gidx) static bool is_ipv4_gid(ibv_gid_entry *gidx)
{ {
return (((uint64_t*)gidx->gid.raw)[0] == 0 && return (((uint64_t*)gidx->gid.raw)[0] == 0 &&
@ -131,6 +132,7 @@ static matched_dev match_device(ibv_device **dev_list, addr_mask_t *networks, in
ibv_port_attr portinfo; ibv_port_attr portinfo;
ibv_gid_entry best_gidx; ibv_gid_entry best_gidx;
int res; int res;
bool have_non_roce = false, have_roce = false;
for (int i = 0; dev_list[i]; ++i) for (int i = 0; dev_list[i]; ++i)
{ {
auto dev = dev_list[i]; auto dev = dev_list[i];
@ -161,6 +163,11 @@ static matched_dev match_device(ibv_device **dev_list, addr_mask_t *networks, in
else else
break; break;
} }
if (gidx.gid_type != IBV_GID_TYPE_ROCE_V1 &&
gidx.gid_type != IBV_GID_TYPE_ROCE_V2)
have_non_roce = true;
else
have_roce = true;
if (match_gid(&gidx, networks, nnet)) if (match_gid(&gidx, networks, nnet))
{ {
// Prefer RoCEv2 // Prefer RoCEv2
@ -186,8 +193,13 @@ cleanup:
{ {
log_rdma_dev_port_gid(dev_list[best.dev], best.port, best.gid, best_gidx); log_rdma_dev_port_gid(dev_list[best.dev], best.port, best.gid, best_gidx);
} }
if (best.dev < 0 && have_non_roce && !have_roce)
{
best.dev = -2;
}
return best; return best;
} }
#endif
msgr_rdma_context_t *msgr_rdma_context_t::create(std::vector<std::string> osd_networks, const char *ib_devname, uint8_t ib_port, uint8_t gid_index, uint32_t mtu, bool odp, int log_level) msgr_rdma_context_t *msgr_rdma_context_t::create(std::vector<std::string> osd_networks, const char *ib_devname, uint8_t ib_port, uint8_t gid_index, uint32_t mtu, bool odp, int log_level)
{ {
@ -229,6 +241,7 @@ msgr_rdma_context_t *msgr_rdma_context_t::create(std::vector<std::string> osd_ne
goto cleanup; goto cleanup;
} }
} }
#ifdef IBV_ADVISE_MR_ADVICE_PREFETCH_NO_FAULT
else if (osd_networks.size()) else if (osd_networks.size())
{ {
std::vector<addr_mask_t> nets; std::vector<addr_mask_t> nets;
@ -237,11 +250,17 @@ msgr_rdma_context_t *msgr_rdma_context_t::create(std::vector<std::string> osd_ne
nets.push_back(cidr_parse(netstr)); nets.push_back(cidr_parse(netstr));
} }
auto best = match_device(dev_list, nets.data(), nets.size(), log_level); auto best = match_device(dev_list, nets.data(), nets.size(), log_level);
if (best.dev < 0) if (best.dev == -2)
{
best.dev = 0;
if (log_level > 0)
fprintf(stderr, "No RoCE devices found, using first available RDMA device %s\n", ibv_get_device_name(*dev_list));
}
else if (best.dev < 0)
{ {
if (log_level > 0) if (log_level > 0)
fprintf(stderr, "RDMA device matching osd_network is not found, using first available device\n"); fprintf(stderr, "RDMA device matching osd_network is not found, disabling RDMA\n");
best.dev = 0; goto cleanup;
} }
else else
{ {
@ -250,6 +269,7 @@ msgr_rdma_context_t *msgr_rdma_context_t::create(std::vector<std::string> osd_ne
} }
ctx->dev = dev_list[best.dev]; ctx->dev = dev_list[best.dev];
} }
#endif
else else
{ {
ctx->dev = *dev_list; ctx->dev = *dev_list;
@ -275,18 +295,22 @@ msgr_rdma_context_t *msgr_rdma_context_t::create(std::vector<std::string> osd_ne
goto cleanup; goto cleanup;
} }
#ifdef IBV_ADVISE_MR_ADVICE_PREFETCH_NO_FAULT
if (gid_index != -1) if (gid_index != -1)
#endif
{ {
ctx->gid_index = gid_index; ctx->gid_index = gid_index < 0 ? 0 : gid_index;
if (ibv_query_gid_ex(ctx->context, ib_port, gid_index, &ctx->my_gid, 0)) if (ibv_query_gid(ctx->context, ib_port, gid_index, &ctx->my_gid))
{ {
fprintf(stderr, "Couldn't read RDMA device %s GID index %d\n", ibv_get_device_name(ctx->dev), gid_index); fprintf(stderr, "Couldn't read RDMA device %s GID index %d\n", ibv_get_device_name(ctx->dev), gid_index);
goto cleanup; goto cleanup;
} }
} }
#ifdef IBV_ADVISE_MR_ADVICE_PREFETCH_NO_FAULT
else else
{ {
// Auto-guess GID // Auto-guess GID
ibv_gid_entry best_gidx;
for (int k = 0; k < ctx->portinfo.gid_tbl_len; k++) for (int k = 0; k < ctx->portinfo.gid_tbl_len; k++)
{ {
ibv_gid_entry gidx; ibv_gid_entry gidx;
@ -301,21 +325,24 @@ msgr_rdma_context_t *msgr_rdma_context_t::create(std::vector<std::string> osd_ne
{ {
continue; continue;
} }
// Prefer IPv4 RoCEv2 GID by default // Prefer IPv4 RoCEv2 -> IPv6 RoCEv2 -> IPv4 RoCEv1 -> IPv6 RoCEv1 -> IB
if (gid_index == -1 || if (gid_index == -1 ||
gidx.gid_type == IBV_GID_TYPE_ROCE_V2 && gidx.gid_type == IBV_GID_TYPE_ROCE_V2 && best_gidx.gid_type != IBV_GID_TYPE_ROCE_V2 ||
(ctx->my_gid.gid_type != IBV_GID_TYPE_ROCE_V2 || is_ipv4_gid(&gidx))) gidx.gid_type == IBV_GID_TYPE_ROCE_V1 && best_gidx.gid_type == IBV_GID_TYPE_IB ||
gidx.gid_type == best_gidx.gid_type && is_ipv4_gid(&gidx))
{ {
gid_index = k; gid_index = k;
ctx->my_gid = gidx; best_gidx = gidx;
} }
} }
ctx->gid_index = gid_index = (gid_index == -1 ? 0 : gid_index); ctx->gid_index = gid_index = (gid_index == -1 ? 0 : gid_index);
if (log_level > 0) if (log_level > 0)
{ {
log_rdma_dev_port_gid(ctx->dev, ctx->ib_port, ctx->gid_index, ctx->my_gid); log_rdma_dev_port_gid(ctx->dev, ctx->ib_port, ctx->gid_index, best_gidx);
} }
ctx->my_gid = best_gidx.gid;
} }
#endif
ctx->pd = ibv_alloc_pd(ctx->context); ctx->pd = ibv_alloc_pd(ctx->context);
if (!ctx->pd) if (!ctx->pd)
@ -431,7 +458,7 @@ msgr_rdma_connection_t *msgr_rdma_connection_t::create(msgr_rdma_context_t *ctx,
} }
conn->addr.lid = ctx->my_lid; conn->addr.lid = ctx->my_lid;
conn->addr.gid = ctx->my_gid.gid; conn->addr.gid = ctx->my_gid;
conn->addr.qpn = conn->qp->qp_num; conn->addr.qpn = conn->qp->qp_num;
conn->addr.psn = lrand48() & 0xffffff; conn->addr.psn = lrand48() & 0xffffff;

View File

@ -31,7 +31,7 @@ struct msgr_rdma_context_t
uint8_t ib_port; uint8_t ib_port;
uint8_t gid_index; uint8_t gid_index;
uint16_t my_lid; uint16_t my_lid;
ibv_gid_entry my_gid; ibv_gid my_gid;
uint32_t mtu; uint32_t mtu;
int max_cqe = 0; int max_cqe = 0;
int used_max_cqe = 0; int used_max_cqe = 0;

View File

@ -64,7 +64,7 @@ static void netlink_sock_alloc(struct netlink_ctx *ctx)
if (nl_driver_id < 0) if (nl_driver_id < 0)
{ {
nl_socket_free(sk); nl_socket_free(sk);
fail("Couldn't resolve the nbd netlink family\n"); fail("Couldn't resolve the nbd netlink family: %s (code %d)\n", nl_geterror(nl_driver_id), nl_driver_id);
} }
ctx->driver_id = nl_driver_id; ctx->driver_id = nl_driver_id;
@ -555,7 +555,12 @@ help:
fcntl(sockfd[0], F_SETFL, fcntl(sockfd[0], F_GETFL, 0) | O_NONBLOCK); fcntl(sockfd[0], F_SETFL, fcntl(sockfd[0], F_GETFL, 0) | O_NONBLOCK);
nbd_fd = sockfd[0]; nbd_fd = sockfd[0];
load_module(); load_module();
bool bg = cfg["foreground"].is_null(); bool bg = cfg["foreground"].is_null();
if (cfg["logfile"].string_value() != "")
{
logfile = cfg["logfile"].string_value();
}
if (netlink) if (netlink)
{ {
@ -588,6 +593,10 @@ help:
} }
close(sockfd[1]); close(sockfd[1]);
printf("/dev/nbd%d\n", err); printf("/dev/nbd%d\n", err);
if (bg)
{
daemonize_reopen_stdio();
}
#else #else
fprintf(stderr, "netlink support is disabled in this build\n"); fprintf(stderr, "netlink support is disabled in this build\n");
exit(1); exit(1);
@ -631,14 +640,10 @@ help:
} }
} }
} }
} if (bg)
if (cfg["logfile"].string_value() != "") {
{ daemonize();
logfile = cfg["logfile"].string_value(); }
}
if (bg)
{
daemonize();
} }
// Initialize read state // Initialize read state
read_state = CL_READ_HDR; read_state = CL_READ_HDR;
@ -716,13 +721,17 @@ help:
} }
} }
void daemonize() void daemonize_fork()
{ {
if (fork()) if (fork())
exit(0); exit(0);
setsid(); setsid();
if (fork()) if (fork())
exit(0); exit(0);
}
void daemonize_reopen_stdio()
{
close(0); close(0);
close(1); close(1);
close(2); close(2);
@ -733,6 +742,12 @@ help:
fprintf(stderr, "Warning: Failed to chdir into /\n"); fprintf(stderr, "Warning: Failed to chdir into /\n");
} }
void daemonize()
{
daemonize_fork();
daemonize_reopen_stdio();
}
json11::Json::object list_mapped() json11::Json::object list_mapped()
{ {
const char *self_filename = exe_name; const char *self_filename = exe_name;
@ -783,8 +798,9 @@ help:
if (!strcmp(pid_filename, self_filename)) if (!strcmp(pid_filename, self_filename))
{ {
json11::Json::object cfg = nbd_proxy::parse_args(argv.size(), argv.data()); json11::Json::object cfg = nbd_proxy::parse_args(argv.size(), argv.data());
if (cfg["command"] == "map") if (cfg["command"] == "map" || cfg["command"] == "netlink-map")
{ {
cfg["interface"] = (cfg["command"] == "netlink-map") ? "netlink" : "nbd";
cfg.erase("command"); cfg.erase("command");
cfg["pid"] = pid; cfg["pid"] = pid;
mapped["/dev/nbd"+std::to_string(dev_num)] = cfg; mapped["/dev/nbd"+std::to_string(dev_num)] = cfg;

View File

@ -261,6 +261,7 @@ struct dd_out_info_t
else else
{ {
// ok // ok
out_size = owatch->cfg.size;
return true; return true;
} }
// Wait for sub-command // Wait for sub-command
@ -882,6 +883,8 @@ resume_2:
oinfo.end_fsync = oinfo.end_fsync && oinfo.out_seekable; oinfo.end_fsync = oinfo.end_fsync && oinfo.out_seekable;
read_offset = 0; read_offset = 0;
read_end = iinfo.in_seekable ? iinfo.in_size-iseek : 0; read_end = iinfo.in_seekable ? iinfo.in_size-iseek : 0;
if (oinfo.out_size && (!read_end || read_end > oinfo.out_size-oseek))
read_end = oinfo.out_size-oseek;
if (bytelimit && (!read_end || read_end > bytelimit)) if (bytelimit && (!read_end || read_end > bytelimit))
read_end = bytelimit; read_end = bytelimit;
clock_gettime(CLOCK_REALTIME, &tv_begin); clock_gettime(CLOCK_REALTIME, &tv_begin);

View File

@ -35,6 +35,7 @@ struct pool_creator_t
uint64_t new_pools_mod_rev; uint64_t new_pools_mod_rev;
json11::Json state_node_tree; json11::Json state_node_tree;
json11::Json new_pools; json11::Json new_pools;
std::map<osd_num_t, json11::Json> osd_stats;
bool is_done() { return state == 100; } bool is_done() { return state == 100; }
@ -46,8 +47,6 @@ struct pool_creator_t
goto resume_2; goto resume_2;
else if (state == 3) else if (state == 3)
goto resume_3; goto resume_3;
else if (state == 4)
goto resume_4;
else if (state == 5) else if (state == 5)
goto resume_5; goto resume_5;
else if (state == 6) else if (state == 6)
@ -90,13 +89,19 @@ resume_1:
// If not forced, check that we have enough osds for pg_size // If not forced, check that we have enough osds for pg_size
if (!force) if (!force)
{ {
// Get node_placement configuration from etcd // Get node_placement configuration from etcd and OSD stats
parent->etcd_txn(json11::Json::object { parent->etcd_txn(json11::Json::object {
{ "success", json11::Json::array { { "success", json11::Json::array {
json11::Json::object { json11::Json::object {
{ "request_range", json11::Json::object { { "request_range", json11::Json::object {
{ "key", base64_encode(parent->cli->st_cli.etcd_prefix+"/config/node_placement") }, { "key", base64_encode(parent->cli->st_cli.etcd_prefix+"/config/node_placement") },
} } } },
},
json11::Json::object {
{ "request_range", json11::Json::object {
{ "key", base64_encode(parent->cli->st_cli.etcd_prefix+"/osd/stats/") },
{ "range_end", base64_encode(parent->cli->st_cli.etcd_prefix+"/osd/stats0") },
} },
}, },
} }, } },
}); });
@ -112,10 +117,21 @@ resume_2:
return; return;
} }
// Get state_node_tree based on node_placement and osd peer states // Get state_node_tree based on node_placement and osd stats
{ {
auto kv = parent->cli->st_cli.parse_etcd_kv(parent->etcd_result["responses"][0]["response_range"]["kvs"][0]); auto node_placement_kv = parent->cli->st_cli.parse_etcd_kv(parent->etcd_result["responses"][0]["response_range"]["kvs"][0]);
state_node_tree = get_state_node_tree(kv.value.object_items()); timespec tv_now;
clock_gettime(CLOCK_REALTIME, &tv_now);
uint64_t osd_out_time = parent->cli->config["osd_out_time"].uint64_value();
if (!osd_out_time)
osd_out_time = 600;
osd_stats.clear();
parent->iterate_kvs_1(parent->etcd_result["responses"][1]["response_range"]["kvs"], "/osd/stats/", [&](uint64_t cur_osd, json11::Json value)
{
if ((uint64_t)value["time"].number_value()+osd_out_time >= tv_now.tv_sec)
osd_stats[cur_osd] = value;
});
state_node_tree = get_state_node_tree(node_placement_kv.value.object_items(), osd_stats);
} }
// Skip tag checks, if pool has none // Skip tag checks, if pool has none
@ -158,42 +174,18 @@ resume_3:
} }
} }
// Get stats (for block_size, bitmap_granularity, ...) of osds in state_node_tree
{
json11::Json::array osd_stats;
for (auto osd_num: state_node_tree["osds"].array_items())
{
osd_stats.push_back(json11::Json::object {
{ "request_range", json11::Json::object {
{ "key", base64_encode(parent->cli->st_cli.etcd_prefix+"/osd/stats/"+osd_num.as_string()) },
} }
});
}
parent->etcd_txn(json11::Json::object{ { "success", osd_stats } });
}
state = 4;
resume_4:
if (parent->waiting > 0)
return;
if (parent->etcd_err.err)
{
result = parent->etcd_err;
state = 100;
return;
}
// Filter osds from state_node_tree based on pool parameters and osd stats // Filter osds from state_node_tree based on pool parameters and osd stats
{ {
std::vector<json11::Json> osd_stats; std::vector<json11::Json> filtered_osd_stats;
for (auto & ocr: parent->etcd_result["responses"].array_items()) for (auto & osd_num: state_node_tree["osds"].array_items())
{ {
auto kv = parent->cli->st_cli.parse_etcd_kv(ocr["response_range"]["kvs"][0]); auto st_it = osd_stats.find(osd_num.uint64_value());
osd_stats.push_back(kv.value); if (st_it != osd_stats.end())
{
filtered_osd_stats.push_back(st_it->second);
}
} }
guess_block_size(osd_stats); guess_block_size(filtered_osd_stats);
state_node_tree = filter_state_node_tree_by_stats(state_node_tree, osd_stats); state_node_tree = filter_state_node_tree_by_stats(state_node_tree, osd_stats);
} }
@ -201,8 +193,7 @@ resume_4:
{ {
auto failure_domain = cfg["failure_domain"].string_value() == "" auto failure_domain = cfg["failure_domain"].string_value() == ""
? "host" : cfg["failure_domain"].string_value(); ? "host" : cfg["failure_domain"].string_value();
uint64_t max_pg_size = get_max_pg_size(state_node_tree["nodes"].object_items(), uint64_t max_pg_size = get_max_pg_size(state_node_tree, failure_domain, cfg["root_node"].string_value());
failure_domain, cfg["root_node"].string_value());
if (cfg["pg_size"].uint64_value() > max_pg_size) if (cfg["pg_size"].uint64_value() > max_pg_size)
{ {
@ -358,56 +349,50 @@ resume_8:
// Returns a JSON object of form {"nodes": {...}, "osds": [...]} that // Returns a JSON object of form {"nodes": {...}, "osds": [...]} that
// contains: all nodes (osds, hosts, ...) based on node_placement config // contains: all nodes (osds, hosts, ...) based on node_placement config
// and current peer state, and a list of active peer osds. // and current osd stats.
json11::Json get_state_node_tree(json11::Json::object node_placement) json11::Json get_state_node_tree(json11::Json::object node_placement, std::map<osd_num_t, json11::Json> & osd_stats)
{ {
// Erase non-peer osd nodes from node_placement // Erase non-existing osd nodes from node_placement
for (auto np_it = node_placement.begin(); np_it != node_placement.end();) for (auto np_it = node_placement.begin(); np_it != node_placement.end();)
{ {
// Numeric nodes are osds // Numeric nodes are osds
osd_num_t osd_num = stoull_full(np_it->first); osd_num_t osd_num = stoull_full(np_it->first);
// If node is osd and it is not in peer states, erase it // If node is osd and its stats do not exist, erase it
if (osd_num > 0 && if (osd_num > 0 && osd_stats.find(osd_num) == osd_stats.end())
parent->cli->st_cli.peer_states.find(osd_num) == parent->cli->st_cli.peer_states.end())
{
node_placement.erase(np_it++); node_placement.erase(np_it++);
}
else else
np_it++; np_it++;
} }
// List of peer osds // List of osds
std::vector<std::string> peer_osds; std::vector<std::string> existing_osds;
// Record peer osds and add missing osds/hosts to np // Record osds and add missing osds/hosts to np
for (auto & ps: parent->cli->st_cli.peer_states) for (auto & ps: osd_stats)
{ {
std::string osd_num = std::to_string(ps.first); std::string osd_num = std::to_string(ps.first);
// Record peer osd // Record osd
peer_osds.push_back(osd_num); existing_osds.push_back(osd_num);
// Add osd, if necessary // Add host if necessary
if (node_placement.find(osd_num) == node_placement.end()) std::string osd_host = ps.second["host"].as_string();
if (node_placement.find(osd_host) == node_placement.end())
{ {
std::string osd_host = ps.second["host"].as_string(); node_placement[osd_host] = json11::Json::object {
{ "level", "host" }
// Add host, if necessary
if (node_placement.find(osd_host) == node_placement.end())
{
node_placement[osd_host] = json11::Json::object {
{ "level", "host" }
};
}
node_placement[osd_num] = json11::Json::object {
{ "parent", osd_host }
}; };
} }
// Add osd
node_placement[osd_num] = json11::Json::object {
{ "parent", node_placement[osd_num]["parent"].is_null() ? osd_host : node_placement[osd_num]["parent"] },
{ "level", "osd" },
};
} }
return json11::Json::object { { "osds", peer_osds }, { "nodes", node_placement } }; return json11::Json::object { { "osds", existing_osds }, { "nodes", node_placement } };
} }
// Returns new state_node_tree based on given state_node_tree with osds // Returns new state_node_tree based on given state_node_tree with osds
@ -534,15 +519,13 @@ resume_8:
// filtered out by stats parameters (block_size, bitmap_granularity) in // filtered out by stats parameters (block_size, bitmap_granularity) in
// given osd_stats and current pool config. // given osd_stats and current pool config.
// Requires: state_node_tree["osds"] must match osd_stats 1-1 // Requires: state_node_tree["osds"] must match osd_stats 1-1
json11::Json filter_state_node_tree_by_stats(const json11::Json & state_node_tree, std::vector<json11::Json> & osd_stats) json11::Json filter_state_node_tree_by_stats(const json11::Json & state_node_tree, std::map<osd_num_t, json11::Json> & osd_stats)
{ {
auto & osds = state_node_tree["osds"].array_items();
// Accepted state_node_tree nodes // Accepted state_node_tree nodes
auto accepted_nodes = state_node_tree["nodes"].object_items(); auto accepted_nodes = state_node_tree["nodes"].object_items();
// List of accepted osds // List of accepted osds
std::vector<std::string> accepted_osds; json11::Json::array accepted_osds;
block_size = cfg["block_size"].uint64_value() block_size = cfg["block_size"].uint64_value()
? cfg["block_size"].uint64_value() ? cfg["block_size"].uint64_value()
@ -554,21 +537,25 @@ resume_8:
? etcd_state_client_t::parse_immediate_commit(cfg["immediate_commit"].string_value(), IMMEDIATE_ALL) ? etcd_state_client_t::parse_immediate_commit(cfg["immediate_commit"].string_value(), IMMEDIATE_ALL)
: parent->cli->st_cli.global_immediate_commit; : parent->cli->st_cli.global_immediate_commit;
for (size_t i = 0; i < osd_stats.size(); i++) for (auto osd_num_json: state_node_tree["osds"].array_items())
{ {
auto & os = osd_stats[i]; auto osd_num = osd_num_json.uint64_value();
// Get osd number auto os_it = osd_stats.find(osd_num);
auto osd_num = osds[i].as_string(); if (os_it == osd_stats.end())
{
continue;
}
auto & os = os_it->second;
if (!os["data_block_size"].is_null() && os["data_block_size"] != block_size || if (!os["data_block_size"].is_null() && os["data_block_size"] != block_size ||
!os["bitmap_granularity"].is_null() && os["bitmap_granularity"] != bitmap_granularity || !os["bitmap_granularity"].is_null() && os["bitmap_granularity"] != bitmap_granularity ||
!os["immediate_commit"].is_null() && !os["immediate_commit"].is_null() &&
etcd_state_client_t::parse_immediate_commit(os["immediate_commit"].string_value(), IMMEDIATE_NONE) < immediate_commit) etcd_state_client_t::parse_immediate_commit(os["immediate_commit"].string_value(), IMMEDIATE_NONE) < immediate_commit)
{ {
accepted_nodes.erase(osd_num); accepted_nodes.erase(osd_num_json.as_string());
} }
else else
{ {
accepted_osds.push_back(osd_num); accepted_osds.push_back(osd_num_json);
} }
} }
@ -576,87 +563,28 @@ resume_8:
} }
// Returns maximum pg_size possible for given node_tree and failure_domain, starting at parent_node // Returns maximum pg_size possible for given node_tree and failure_domain, starting at parent_node
uint64_t get_max_pg_size(json11::Json::object node_tree, const std::string & level, const std::string & parent_node) uint64_t get_max_pg_size(json11::Json state_node_tree, const std::string & level, const std::string & root_node)
{ {
uint64_t max_pg_sz = 0; std::set<std::string> level_seen;
for (auto & osd: state_node_tree["osds"].array_items())
std::vector<std::string> nodes;
// Check if parent node is an osd (numeric)
if (parent_node != "" && stoull_full(parent_node))
{ {
// Add it to node list if osd is in node tree // find OSD parent at <level>, but stop at <root_node>
if (node_tree.find(parent_node) != node_tree.end()) auto cur_id = osd.string_value();
nodes.push_back(parent_node); auto cur = state_node_tree["nodes"][cur_id];
} while (!cur.is_null())
// If parent node given, ...
else if (parent_node != "")
{
// ... look for children nodes of this parent
for (auto & sn: node_tree)
{ {
auto & props = sn.second.object_items(); if (cur["level"] == level)
auto parent_prop = props.find("parent");
if (parent_prop != props.end() && (parent_prop->second.as_string() == parent_node))
{ {
nodes.push_back(sn.first); level_seen.insert(cur_id);
break;
// If we're not looking for all osds, we only need a single
// child osd node
if (level != "osd" && stoull_full(sn.first))
break;
} }
if (cur_id == root_node)
break;
cur_id = cur["parent"].string_value();
cur = state_node_tree["nodes"][cur_id];
} }
} }
// No parent node given, and we're not looking for all osds return level_seen.size();
else if (level != "osd")
{
// ... look for all level nodes
for (auto & sn: node_tree)
{
auto & props = sn.second.object_items();
auto level_prop = props.find("level");
if (level_prop != props.end() && (level_prop->second.as_string() == level))
{
nodes.push_back(sn.first);
}
}
}
// Otherwise, ...
else
{
// ... we're looking for osd nodes only
for (auto & sn: node_tree)
{
if (stoull_full(sn.first))
{
nodes.push_back(sn.first);
}
}
}
// Process gathered nodes
for (auto & node: nodes)
{
// Check for osd node, return constant max size
if (stoull_full(node))
{
max_pg_sz += 1;
}
// Otherwise, ...
else
{
// ... exclude parent node from tree, and ...
node_tree.erase(parent_node);
// ... descend onto the resulting tree
max_pg_sz += get_max_pg_size(node_tree, level, node);
}
}
return max_pg_sz;
} }
json11::Json create_pool(const etcd_kv_t & kv) json11::Json create_pool(const etcd_kv_t & kv)

View File

@ -142,13 +142,16 @@ resume_2:
auto osd_free = value["free"].uint64_value(); auto osd_free = value["free"].uint64_value();
total_raw += osd_size; total_raw += osd_size;
free_raw += osd_free; free_raw += osd_free;
if (!osd_free) if (osd_size)
{ {
osds_full++; if (!osd_free)
} {
else if (osd_free < (uint64_t)(osd_size*(1-osd_nearfull_ratio))) osds_full++;
{ }
osds_nearfull++; else if (osd_free < (uint64_t)(osd_size*(1-osd_nearfull_ratio)))
{
osds_nearfull++;
}
} }
auto peer_it = parent->cli->st_cli.peer_states.find(stat_osd_num); auto peer_it = parent->cli->st_cli.peer_states.find(stat_osd_num);
if (peer_it != parent->cli->st_cli.peer_states.end()) if (peer_it != parent->cli->st_cli.peer_states.end())

View File

@ -226,6 +226,7 @@ class osd_t
void parse_config(bool init); void parse_config(bool init);
void init_cluster(); void init_cluster();
void on_change_osd_state_hook(osd_num_t peer_osd); void on_change_osd_state_hook(osd_num_t peer_osd);
void on_change_backfillfull_hook(pool_id_t pool_id);
void on_change_pg_history_hook(pool_id_t pool_id, pg_num_t pg_num); void on_change_pg_history_hook(pool_id_t pool_id, pg_num_t pg_num);
void on_change_etcd_state_hook(std::map<std::string, etcd_kv_t> & changes); void on_change_etcd_state_hook(std::map<std::string, etcd_kv_t> & changes);
void on_load_config_hook(json11::Json::object & changes); void on_load_config_hook(json11::Json::object & changes);

View File

@ -65,6 +65,7 @@ void osd_t::init_cluster()
st_cli.tfd = tfd; st_cli.tfd = tfd;
st_cli.log_level = log_level; st_cli.log_level = log_level;
st_cli.on_change_osd_state_hook = [this](osd_num_t peer_osd) { on_change_osd_state_hook(peer_osd); }; st_cli.on_change_osd_state_hook = [this](osd_num_t peer_osd) { on_change_osd_state_hook(peer_osd); };
st_cli.on_change_backfillfull_hook = [this](pool_id_t pool_id) { on_change_backfillfull_hook(pool_id); };
st_cli.on_change_pg_history_hook = [this](pool_id_t pool_id, pg_num_t pg_num) { on_change_pg_history_hook(pool_id, pg_num); }; st_cli.on_change_pg_history_hook = [this](pool_id_t pool_id, pg_num_t pg_num) { on_change_pg_history_hook(pool_id, pg_num); };
st_cli.on_change_hook = [this](std::map<std::string, etcd_kv_t> & changes) { on_change_etcd_state_hook(changes); }; st_cli.on_change_hook = [this](std::map<std::string, etcd_kv_t> & changes) { on_change_etcd_state_hook(changes); };
st_cli.on_load_config_hook = [this](json11::Json::object & cfg) { on_load_config_hook(cfg); }; st_cli.on_load_config_hook = [this](json11::Json::object & cfg) { on_load_config_hook(cfg); };
@ -414,6 +415,14 @@ void osd_t::on_change_osd_state_hook(osd_num_t peer_osd)
} }
} }
void osd_t::on_change_backfillfull_hook(pool_id_t pool_id)
{
if (!(peering_state & (OSD_RECOVERING | OSD_FLUSHING_PGS)))
{
peering_state = peering_state | OSD_RECOVERING;
}
}
void osd_t::on_change_etcd_state_hook(std::map<std::string, etcd_kv_t> & changes) void osd_t::on_change_etcd_state_hook(std::map<std::string, etcd_kv_t> & changes)
{ {
if (changes.find(st_cli.etcd_prefix+"/config/global") != changes.end()) if (changes.find(st_cli.etcd_prefix+"/config/global") != changes.end())

View File

@ -252,10 +252,18 @@ bool osd_t::pick_next_recovery(osd_recovery_op_t &op)
auto mask = recovery_last_degraded ? (PG_ACTIVE | PG_HAS_DEGRADED) : (PG_ACTIVE | PG_DEGRADED | PG_HAS_MISPLACED); auto mask = recovery_last_degraded ? (PG_ACTIVE | PG_HAS_DEGRADED) : (PG_ACTIVE | PG_DEGRADED | PG_HAS_MISPLACED);
auto check = recovery_last_degraded ? (PG_ACTIVE | PG_HAS_DEGRADED) : (PG_ACTIVE | PG_HAS_MISPLACED); auto check = recovery_last_degraded ? (PG_ACTIVE | PG_HAS_DEGRADED) : (PG_ACTIVE | PG_HAS_MISPLACED);
// Restart scanning from the same PG as the last time // Restart scanning from the same PG as the last time
restart:
for (auto pg_it = pgs.lower_bound(recovery_last_pg); pg_it != pgs.end(); pg_it++) for (auto pg_it = pgs.lower_bound(recovery_last_pg); pg_it != pgs.end(); pg_it++)
{ {
if ((pg_it->second.state & mask) == check) if ((pg_it->second.state & mask) == check)
{ {
auto pool_it = st_cli.pool_config.find(pg_it->first.pool_id);
if (pool_it != st_cli.pool_config.end() && pool_it->second.backfillfull)
{
// Skip the pool
recovery_last_pg.pool_id++;
goto restart;
}
auto & src = recovery_last_degraded ? pg_it->second.degraded_objects : pg_it->second.misplaced_objects; auto & src = recovery_last_degraded ? pg_it->second.degraded_objects : pg_it->second.misplaced_objects;
assert(src.size() > 0); assert(src.size() > 0);
// Restart scanning from the next object // Restart scanning from the next object

View File

@ -12,11 +12,6 @@ target_link_libraries(stub_bench tcmalloc_minimal)
add_executable(osd_test osd_test.cpp ../util/rw_blocking.cpp ../util/addr_util.cpp) add_executable(osd_test osd_test.cpp ../util/rw_blocking.cpp ../util/addr_util.cpp)
target_link_libraries(osd_test tcmalloc_minimal) target_link_libraries(osd_test tcmalloc_minimal)
# bindiff
add_executable(bindiff
bindiff.c
)
# stub_uring_osd # stub_uring_osd
add_executable(stub_uring_osd add_executable(stub_uring_osd
stub_uring_osd.cpp stub_uring_osd.cpp

View File

@ -1,177 +0,0 @@
// Copyright (c) Vitaliy Filippov, 2004+
// License: VNPL-1.1 (see README.md for details)
#ifndef _LARGEFILE64_SOURCE
#define _LARGEFILE64_SOURCE
#endif
#include <string.h>
#include <sys/stat.h>
#include <errno.h>
#include <stdlib.h>
#include <stdio.h>
#include <stdint.h>
#include <unistd.h>
#include <fcntl.h>
#define BUFSIZE 0x100000
uint64_t filelength(int fd)
{
struct stat st;
if (fstat(fd, &st) < 0)
{
fprintf(stderr, "fstat failed: %s\n", strerror(errno));
return 0;
}
if (st.st_size < 0)
{
return 0;
}
return (uint64_t)st.st_size;
}
size_t read_blocking(int fd, void *read_buf, size_t remaining)
{
size_t done = 0;
while (done < remaining)
{
ssize_t r = read(fd, read_buf, remaining-done);
if (r <= 0)
{
if (!errno)
{
// EOF
return done;
}
else if (errno != EINTR && errno != EAGAIN && errno != EPIPE)
{
perror("read");
exit(1);
}
continue;
}
done += (size_t)r;
read_buf = (uint8_t*)read_buf + r;
}
return done;
}
size_t write_blocking(int fd, void *write_buf, size_t remaining)
{
size_t done = 0;
while (done < remaining)
{
ssize_t r = write(fd, write_buf, remaining-done);
if (r < 0)
{
if (errno != EINTR && errno != EAGAIN && errno != EPIPE)
{
perror("write");
exit(1);
}
continue;
}
done += (size_t)r;
write_buf = (uint8_t*)write_buf + r;
}
return done;
}
int main(int narg, char *args[])
{
int fd1 = -1, fd2 = -1;
uint8_t *buf1 = NULL, *buf2 = NULL;
uint64_t addr = 0, l1 = 0, l2 = 0, l = 0, diffl = 0;
size_t buf1_len = 0, buf2_len = 0, i = 0, j = 0, dl = 0;
int argoff = 0;
int nosource = 0;
fprintf(stderr, "VMX HexDiff v2.1\nLicense: GPLv3.0+, (c) 2005+, Vitaliy Filippov\n");
argoff = 1;
if (narg > argoff && strcmp(args[argoff], "-n") == 0)
{
nosource = 1;
argoff++;
}
if (narg < argoff+2)
{
fprintf(stderr, "USAGE: bindiff [-n] <file1> <file2>\n"
"This will create hex patch file1->file2 and write it to stdout.\n"
"[-n] = do not write file1 data in patch, only file2.\n");
return -1;
}
fd1 = open(args[argoff], O_RDONLY);
if (fd1 < 0)
{
fprintf(stderr, "Couldn't open %s: %s\n", args[argoff], strerror(errno));
return -1;
}
fd2 = open(args[argoff+1], O_RDONLY);
if (fd2 < 0)
{
fprintf(stderr, "Couldn't open %s: %s\n", args[argoff+1], strerror(errno));
close(fd1);
return -1;
}
l1 = filelength(fd1);
l2 = filelength(fd2);
if (l1 < l2)
l = l1;
else
l = l2;
addr = diffl = 0;
buf1 = malloc(BUFSIZE+1);
buf2 = malloc(BUFSIZE+1);
while ((buf1_len = read_blocking(fd1, buf1, BUFSIZE)) > 0 && (buf2_len = read_blocking(fd2, buf2, BUFSIZE)) > 0)
{
buf1[buf1_len] = buf2[buf2_len] = 0;
for (dl = 0, i = 0; i <= buf1_len && i <= buf2_len; i++, addr++)
{
if (buf1[i] != buf2[i])
{
dl++;
}
else if (dl)
{
printf("%08jX: ", addr-dl);
if (!nosource)
{
for (j = i-dl; j < i; j++)
printf("%02X", buf1[j]);
printf(" ");
}
for (j = i-dl; j < i; j++)
printf("%02X", buf2[j]);
printf("\n");
diffl += dl;
dl = 0;
}
}
addr--;
}
if (l1 < l2)
{
printf("%08zX: ", i);
while ((buf2_len = read_blocking(fd2, buf2, BUFSIZE)) > 0)
{
for (j = 0; j < buf2_len; j++, i++)
printf("%02X", buf2[j]);
}
printf("\n");
}
else if (l1 > l2)
{
printf("SIZE %08zX\n", l2);
}
if (diffl != 0 || l1 != l2)
{
fprintf(stderr, "Difference in %zu of %zu common bytes\n", diffl, l);
if (l1 != l2)
fprintf(stderr, "Length difference!\nFile \"%s\": %zu\nFile \"%s\": %zu\n", args [1], l1, args [2], l2);
}
else
{
fprintf(stderr, "Files are equal\n");
}
return 0;
}

View File

@ -10,7 +10,7 @@
#include "rw_blocking.h" #include "rw_blocking.h"
size_t read_blocking(int fd, void *read_buf, size_t remaining) int read_blocking(int fd, void *read_buf, size_t remaining)
{ {
size_t done = 0; size_t done = 0;
while (done < remaining) while (done < remaining)
@ -30,13 +30,13 @@ size_t read_blocking(int fd, void *read_buf, size_t remaining)
} }
continue; continue;
} }
done += (size_t)r; done += r;
read_buf = (uint8_t*)read_buf + r; read_buf = (uint8_t*)read_buf + r;
} }
return done; return done;
} }
size_t write_blocking(int fd, void *write_buf, size_t remaining) int write_blocking(int fd, void *write_buf, size_t remaining)
{ {
size_t done = 0; size_t done = 0;
while (done < remaining) while (done < remaining)
@ -51,7 +51,7 @@ size_t write_blocking(int fd, void *write_buf, size_t remaining)
} }
continue; continue;
} }
done += (size_t)r; done += r;
write_buf = (uint8_t*)write_buf + r; write_buf = (uint8_t*)write_buf + r;
} }
return done; return done;

View File

@ -6,8 +6,8 @@
#include <unistd.h> #include <unistd.h>
#include <sys/uio.h> #include <sys/uio.h>
size_t read_blocking(int fd, void *read_buf, size_t remaining); int read_blocking(int fd, void *read_buf, size_t remaining);
size_t write_blocking(int fd, void *write_buf, size_t remaining); int write_blocking(int fd, void *write_buf, size_t remaining);
int readv_blocking(int fd, iovec *iov, int iovcnt); int readv_blocking(int fd, iovec *iov, int iovcnt);
int writev_blocking(int fd, iovec *iov, int iovcnt); int writev_blocking(int fd, iovec *iov, int iovcnt);
int sendv_blocking(int fd, iovec *iov, int iovcnt, int flags); int sendv_blocking(int fd, iovec *iov, int iovcnt, int flags);

View File

@ -19,9 +19,12 @@ ANTIETCD=1 ./test_etcd_fail.sh
./test_interrupted_rebalance.sh ./test_interrupted_rebalance.sh
IMMEDIATE_COMMIT=1 ./test_interrupted_rebalance.sh IMMEDIATE_COMMIT=1 ./test_interrupted_rebalance.sh
SCHEME=ec ./test_interrupted_rebalance.sh SCHEME=ec ./test_interrupted_rebalance.sh
SCHEME=ec IMMEDIATE_COMMIT=1 ./test_interrupted_rebalance.sh SCHEME=ec IMMEDIATE_COMMIT=1 ./test_interrupted_rebalance.sh
./test_create_halfhost.sh
./test_failure_domain.sh ./test_failure_domain.sh
./test_snapshot.sh ./test_snapshot.sh

35
tests/test_create_halfhost.sh Executable file
View File

@ -0,0 +1,35 @@
#!/bin/bash -ex
. `dirname $0`/common.sh
node mon/mon-main.js $MON_PARAMS --etcd_address $ETCD_URL --etcd_prefix "/vitastor" >>./testdata/mon.log 2>&1 &
MON_PID=$!
wait_etcd
TIME=$(date '+%s')
$ETCDCTL put /vitastor/config/global '{"placement_levels":{"dc":10,"host":100,"half":105,"osd":110}}'
$ETCDCTL put /vitastor/config/node_placement '{
"h11":{"level":"half","parent":"host1"},
"h12":{"level":"half","parent":"host1"},
"h21":{"level":"half","parent":"host2"},
"h22":{"level":"half","parent":"host2"},
"h31":{"level":"half","parent":"host3"},
"h32":{"level":"half","parent":"host3"},
"1":{"parent":"h11"},
"2":{"parent":"h12"},
"3":{"parent":"h21"},
"4":{"parent":"h22"},
"5":{"parent":"h31"},
"6":{"parent":"h32"}
}'
$ETCDCTL put /vitastor/osd/stats/1 '{"host":"host1","size":1073741824,"time":"'$TIME'"}'
$ETCDCTL put /vitastor/osd/stats/2 '{"host":"host1","size":1073741824,"time":"'$TIME'"}'
$ETCDCTL put /vitastor/osd/stats/3 '{"host":"host2","size":1073741824,"time":"'$TIME'"}'
$ETCDCTL put /vitastor/osd/stats/4 '{"host":"host2","size":1073741824,"time":"'$TIME'"}'
$ETCDCTL put /vitastor/osd/stats/5 '{"host":"host3","size":1073741824,"time":"'$TIME'"}'
$ETCDCTL put /vitastor/osd/stats/6 '{"host":"host3","size":1073741824,"time":"'$TIME'"}'
build/src/cmd/vitastor-cli --etcd_address $ETCD_URL osd-tree
# check that it doesn't fail
build/src/cmd/vitastor-cli --etcd_address $ETCD_URL create-pool testpool --ec 2+1 -n 32
format_green OK

View File

@ -53,14 +53,13 @@ kill_osds &
LD_PRELOAD="build/src/client/libfio_vitastor.so" \ LD_PRELOAD="build/src/client/libfio_vitastor.so" \
fio -thread -name=test -ioengine=build/src/client/libfio_vitastor.so -bsrange=4k-128k -blockalign=4k -direct=1 -iodepth=32 -fsync=256 -rw=randrw \ fio -thread -name=test -ioengine=build/src/client/libfio_vitastor.so -bsrange=4k-128k -blockalign=4k -direct=1 -iodepth=32 -fsync=256 -rw=randrw \
-randrepeat=0 -refill_buffers=1 -mirror_file=./testdata/bin/mirror.bin -etcd=$ETCD_URL -image=testimg -loops=10 -runtime=120 -serialize_overlap=1 -randrepeat=0 -refill_buffers=1 -mirror_file=./testdata/bin/mirror.bin -etcd=$ETCD_URL -image=testimg -loops=10 -runtime=120
qemu-img convert -S 4096 -p \ qemu-img convert -S 4096 -p \
-f raw "vitastor:etcd_host=127.0.0.1\:$ETCD_PORT/v3:image=testimg" \ -f raw "vitastor:etcd_host=127.0.0.1\:$ETCD_PORT/v3:image=testimg" \
-O raw ./testdata/bin/read.bin -O raw ./testdata/bin/read.bin
if ! diff -q ./testdata/bin/read.bin ./testdata/bin/mirror.bin; then if ! diff -q ./testdata/bin/read.bin ./testdata/bin/mirror.bin; then
build/src/test/bindiff ./testdata/bin/read.bin ./testdata/bin/mirror.bin
format_error Data lost during self-heal format_error Data lost during self-heal
fi fi