Compare commits

..

1 Commits

Author SHA1 Message Date
Vitaliy Filippov a6bf6a2cf0 WIP NFS RDMA support
Test / test_dd (push) Has been skipped Details
Test / test_root_node (push) Has been skipped Details
Test / test_switch_primary (push) Has been skipped Details
Test / test_write (push) Has been skipped Details
Test / test_write_xor (push) Has been skipped Details
Test / test_write_no_same (push) Has been skipped Details
Test / test_heal_pg_size_2 (push) Has been skipped Details
Test / test_heal_ec (push) Has been skipped Details
Test / test_heal_antietcd (push) Has been skipped Details
Test / test_heal_csum_32k_dmj (push) Has been skipped Details
Test / test_heal_csum_32k_dj (push) Has been skipped Details
Test / test_heal_csum_32k (push) Has been skipped Details
Test / test_heal_csum_4k_dmj (push) Has been skipped Details
Test / test_heal_csum_4k_dj (push) Has been skipped Details
Test / test_heal_csum_4k (push) Has been skipped Details
Test / test_resize (push) Has been skipped Details
Test / test_resize_auto (push) Has been skipped Details
Test / test_snapshot_pool2 (push) Has been skipped Details
Test / test_osd_tags (push) Has been skipped Details
Test / test_enospc (push) Has been skipped Details
Test / test_enospc_xor (push) Has been skipped Details
Test / test_enospc_imm (push) Has been skipped Details
Test / test_enospc_imm_xor (push) Has been skipped Details
Test / test_scrub (push) Has been skipped Details
Test / test_scrub_zero_osd_2 (push) Has been skipped Details
Test / test_scrub_xor (push) Has been skipped Details
Test / test_scrub_pg_size_3 (push) Has been skipped Details
Test / test_scrub_pg_size_6_pg_minsize_4_osd_count_6_ec (push) Has been skipped Details
Test / test_scrub_ec (push) Has been skipped Details
Test / test_nfs (push) Has been skipped Details
2024-11-19 00:57:14 +03:00
23 changed files with 227 additions and 423 deletions

View File

@ -288,24 +288,6 @@ jobs:
echo "" echo ""
done done
test_create_halfhost:
runs-on: ubuntu-latest
needs: build
container: ${{env.TEST_IMAGE}}:${{github.sha}}
steps:
- name: Run test
id: test
timeout-minutes: 3
run: /root/vitastor/tests/test_create_halfhost.sh
- name: Print logs
if: always() && steps.test.outcome == 'failure'
run: |
for i in /root/vitastor/testdata/*.log /root/vitastor/testdata/*.txt; do
echo "-------- $i --------"
cat $i
echo ""
done
test_failure_domain: test_failure_domain:
runs-on: ubuntu-latest runs-on: ubuntu-latest
needs: build needs: build

View File

@ -68,17 +68,11 @@ but they are not connected to the cluster.
- Type: string - Type: string
RDMA device name to use for Vitastor OSD communications (for example, RDMA device name to use for Vitastor OSD communications (for example,
"rocep5s0f0"). If not specified, Vitastor will try to find an RoCE "rocep5s0f0"). Now Vitastor supports all adapters, even ones without
device matching [osd_network](osd.en.md#osd_network), preferring RoCEv2, ODP support, like Mellanox ConnectX-3 and non-Mellanox cards.
or choose the first available RDMA device if no RoCE devices are
found or if `osd_network` is not specified. Auto-selection is also
unsupported with old libibverbs < v32, like in Debian 10 Buster or
CentOS 7.
Vitastor supports all adapters, even ones without ODP support, like Versions up to Vitastor 1.2.0 required ODP which is only present in
Mellanox ConnectX-3 and non-Mellanox cards. Versions up to Vitastor Mellanox ConnectX >= 4. See also [rdma_odp](#rdma_odp).
1.2.0 required ODP which is only present in Mellanox ConnectX >= 4.
See also [rdma_odp](#rdma_odp).
Run `ibv_devinfo -v` as root to list available RDMA devices and their Run `ibv_devinfo -v` as root to list available RDMA devices and their
features. features.
@ -101,17 +95,15 @@ your device has.
## rdma_gid_index ## rdma_gid_index
- Type: integer - Type: integer
- Default: 0
Global address identifier index of the RDMA device to use. Different GID Global address identifier index of the RDMA device to use. Different GID
indexes may correspond to different protocols like RoCEv1, RoCEv2 and iWARP. indexes may correspond to different protocols like RoCEv1, RoCEv2 and iWARP.
Search for "GID" in `ibv_devinfo -v` output to determine which GID index Search for "GID" in `ibv_devinfo -v` output to determine which GID index
you need. you need.
If not specified, Vitastor will try to auto-select a RoCEv2 IPv4 GID, then **IMPORTANT:** If you want to use RoCEv2 (as recommended) then the correct
RoCEv2 IPv6 GID, then RoCEv1 IPv4 GID, then RoCEv1 IPv6 GID, then IB GID. rdma_gid_index is usually 1 (IPv6) or 3 (IPv4).
GID auto-selection is unsupported with libibverbs < v32.
A correct rdma_gid_index for RoCEv2 is usually 1 (IPv6) or 3 (IPv4).
## rdma_mtu ## rdma_mtu

View File

@ -71,17 +71,12 @@ RDMA может быть нужно только если у клиентов е
- Тип: строка - Тип: строка
Название RDMA-устройства для связи с Vitastor OSD (например, "rocep5s0f0"). Название RDMA-устройства для связи с Vitastor OSD (например, "rocep5s0f0").
Если не указано, Vitastor попробует найти RoCE-устройство, соответствующее Сейчас Vitastor поддерживает все модели адаптеров, включая те, у которых
[osd_network](osd.en.md#osd_network), предпочитая RoCEv2, или выбрать первое
попавшееся RDMA-устройство, если RoCE-устройств нет или если сеть `osd_network`
не задана. Также автовыбор не поддерживается со старыми версиями библиотеки
libibverbs < v32, например в Debian 10 Buster или CentOS 7.
Vitastor поддерживает все модели адаптеров, включая те, у которых
нет поддержки ODP, то есть вы можете использовать RDMA с ConnectX-3 и нет поддержки ODP, то есть вы можете использовать RDMA с ConnectX-3 и
картами производства не Mellanox. Версии Vitastor до 1.2.0 включительно картами производства не Mellanox.
требовали ODP, который есть только на Mellanox ConnectX 4 и более новых.
См. также [rdma_odp](#rdma_odp). Версии Vitastor до 1.2.0 включительно требовали ODP, который есть только
на Mellanox ConnectX 4 и более новых. См. также [rdma_odp](#rdma_odp).
Запустите `ibv_devinfo -v` от имени суперпользователя, чтобы посмотреть Запустите `ibv_devinfo -v` от имени суперпользователя, чтобы посмотреть
список доступных RDMA-устройств, их параметры и возможности. список доступных RDMA-устройств, их параметры и возможности.
@ -106,18 +101,15 @@ Control) и ECN (Explicit Congestion Notification).
## rdma_gid_index ## rdma_gid_index
- Тип: целое число - Тип: целое число
- Значение по умолчанию: 0
Номер глобального идентификатора адреса RDMA-устройства, который следует Номер глобального идентификатора адреса RDMA-устройства, который следует
использовать. Разным gid_index могут соответствовать разные протоколы связи: использовать. Разным gid_index могут соответствовать разные протоколы связи:
RoCEv1, RoCEv2, iWARP. Чтобы понять, какой нужен вам - смотрите строчки со RoCEv1, RoCEv2, iWARP. Чтобы понять, какой нужен вам - смотрите строчки со
словом "GID" в выводе команды `ibv_devinfo -v`. словом "GID" в выводе команды `ibv_devinfo -v`.
Если не указан, Vitastor попробует автоматически выбрать сначала GID, **ВАЖНО:** Если вы хотите использовать RoCEv2 (как мы и рекомендуем), то
соответствующий RoCEv2 IPv4, потом RoCEv2 IPv6, потом RoCEv1 IPv4, потом правильный rdma_gid_index, как правило, 1 (IPv6) или 3 (IPv4).
RoCEv1 IPv6, потом IB. Авто-выбор GID не поддерживается со старыми версиями
libibverbs < v32.
Правильный rdma_gid_index для RoCEv2, как правило, 1 (IPv6) или 3 (IPv4).
## rdma_mtu ## rdma_mtu

View File

@ -48,17 +48,11 @@
type: string type: string
info: | info: |
RDMA device name to use for Vitastor OSD communications (for example, RDMA device name to use for Vitastor OSD communications (for example,
"rocep5s0f0"). If not specified, Vitastor will try to find an RoCE "rocep5s0f0"). Now Vitastor supports all adapters, even ones without
device matching [osd_network](osd.en.md#osd_network), preferring RoCEv2, ODP support, like Mellanox ConnectX-3 and non-Mellanox cards.
or choose the first available RDMA device if no RoCE devices are
found or if `osd_network` is not specified. Auto-selection is also
unsupported with old libibverbs < v32, like in Debian 10 Buster or
CentOS 7.
Vitastor supports all adapters, even ones without ODP support, like Versions up to Vitastor 1.2.0 required ODP which is only present in
Mellanox ConnectX-3 and non-Mellanox cards. Versions up to Vitastor Mellanox ConnectX >= 4. See also [rdma_odp](#rdma_odp).
1.2.0 required ODP which is only present in Mellanox ConnectX >= 4.
See also [rdma_odp](#rdma_odp).
Run `ibv_devinfo -v` as root to list available RDMA devices and their Run `ibv_devinfo -v` as root to list available RDMA devices and their
features. features.
@ -70,17 +64,12 @@
PFC (Priority Flow Control) and ECN (Explicit Congestion Notification). PFC (Priority Flow Control) and ECN (Explicit Congestion Notification).
info_ru: | info_ru: |
Название RDMA-устройства для связи с Vitastor OSD (например, "rocep5s0f0"). Название RDMA-устройства для связи с Vitastor OSD (например, "rocep5s0f0").
Если не указано, Vitastor попробует найти RoCE-устройство, соответствующее Сейчас Vitastor поддерживает все модели адаптеров, включая те, у которых
[osd_network](osd.en.md#osd_network), предпочитая RoCEv2, или выбрать первое
попавшееся RDMA-устройство, если RoCE-устройств нет или если сеть `osd_network`
не задана. Также автовыбор не поддерживается со старыми версиями библиотеки
libibverbs < v32, например в Debian 10 Buster или CentOS 7.
Vitastor поддерживает все модели адаптеров, включая те, у которых
нет поддержки ODP, то есть вы можете использовать RDMA с ConnectX-3 и нет поддержки ODP, то есть вы можете использовать RDMA с ConnectX-3 и
картами производства не Mellanox. Версии Vitastor до 1.2.0 включительно картами производства не Mellanox.
требовали ODP, который есть только на Mellanox ConnectX 4 и более новых.
См. также [rdma_odp](#rdma_odp). Версии Vitastor до 1.2.0 включительно требовали ODP, который есть только
на Mellanox ConnectX 4 и более новых. См. также [rdma_odp](#rdma_odp).
Запустите `ibv_devinfo -v` от имени суперпользователя, чтобы посмотреть Запустите `ibv_devinfo -v` от имени суперпользователя, чтобы посмотреть
список доступных RDMA-устройств, их параметры и возможности. список доступных RDMA-устройств, их параметры и возможности.
@ -105,29 +94,23 @@
`ibv_devinfo -v`. `ibv_devinfo -v`.
- name: rdma_gid_index - name: rdma_gid_index
type: int type: int
default: 0
info: | info: |
Global address identifier index of the RDMA device to use. Different GID Global address identifier index of the RDMA device to use. Different GID
indexes may correspond to different protocols like RoCEv1, RoCEv2 and iWARP. indexes may correspond to different protocols like RoCEv1, RoCEv2 and iWARP.
Search for "GID" in `ibv_devinfo -v` output to determine which GID index Search for "GID" in `ibv_devinfo -v` output to determine which GID index
you need. you need.
If not specified, Vitastor will try to auto-select a RoCEv2 IPv4 GID, then **IMPORTANT:** If you want to use RoCEv2 (as recommended) then the correct
RoCEv2 IPv6 GID, then RoCEv1 IPv4 GID, then RoCEv1 IPv6 GID, then IB GID. rdma_gid_index is usually 1 (IPv6) or 3 (IPv4).
GID auto-selection is unsupported with libibverbs < v32.
A correct rdma_gid_index for RoCEv2 is usually 1 (IPv6) or 3 (IPv4).
info_ru: | info_ru: |
Номер глобального идентификатора адреса RDMA-устройства, который следует Номер глобального идентификатора адреса RDMA-устройства, который следует
использовать. Разным gid_index могут соответствовать разные протоколы связи: использовать. Разным gid_index могут соответствовать разные протоколы связи:
RoCEv1, RoCEv2, iWARP. Чтобы понять, какой нужен вам - смотрите строчки со RoCEv1, RoCEv2, iWARP. Чтобы понять, какой нужен вам - смотрите строчки со
словом "GID" в выводе команды `ibv_devinfo -v`. словом "GID" в выводе команды `ibv_devinfo -v`.
Если не указан, Vitastor попробует автоматически выбрать сначала GID, **ВАЖНО:** Если вы хотите использовать RoCEv2 (как мы и рекомендуем), то
соответствующий RoCEv2 IPv4, потом RoCEv2 IPv6, потом RoCEv1 IPv4, потом правильный rdma_gid_index, как правило, 1 (IPv6) или 3 (IPv4).
RoCEv1 IPv6, потом IB. Авто-выбор GID не поддерживается со старыми версиями
libibverbs < v32.
Правильный rdma_gid_index для RoCEv2, как правило, 1 (IPv6) или 3 (IPv4).
- name: rdma_mtu - name: rdma_mtu
type: int type: int
default: 4096 default: 4096

View File

@ -232,7 +232,6 @@ class EtcdAdapter
async become_master() async become_master()
{ {
const state = { ...this.mon.get_mon_state(), id: ''+this.mon.etcd_lease_id }; const state = { ...this.mon.get_mon_state(), id: ''+this.mon.etcd_lease_id };
console.log('Waiting to become master');
// eslint-disable-next-line no-constant-condition // eslint-disable-next-line no-constant-condition
while (1) while (1)
{ {
@ -244,6 +243,7 @@ class EtcdAdapter
{ {
break; break;
} }
console.log('Waiting to become master');
await new Promise(ok => setTimeout(ok, this.mon.config.etcd_start_timeout)); await new Promise(ok => setTimeout(ok, this.mon.config.etcd_start_timeout));
} }
console.log('Became master'); console.log('Became master');

View File

@ -56,7 +56,6 @@ const etcd_tree = {
osd_out_time: 600, // seconds. min: 0 osd_out_time: 600, // seconds. min: 0
placement_levels: { datacenter: 1, rack: 2, host: 3, osd: 4, ... }, placement_levels: { datacenter: 1, rack: 2, host: 3, osd: 4, ... },
use_old_pg_combinator: false, use_old_pg_combinator: false,
osd_backfillfull_ratio: 0.99,
// client and osd // client and osd
tcp_header_buffer_size: 65536, tcp_header_buffer_size: 65536,
use_sync_send_recv: false, use_sync_send_recv: false,

View File

@ -74,7 +74,6 @@ class Mon
this.state = JSON.parse(JSON.stringify(etcd_tree)); this.state = JSON.parse(JSON.stringify(etcd_tree));
this.prev_stats = { osd_stats: {}, osd_diff: {} }; this.prev_stats = { osd_stats: {}, osd_diff: {} };
this.recheck_pgs_active = false; this.recheck_pgs_active = false;
this.updating_total_stats = false;
this.watcher_active = false; this.watcher_active = false;
this.old_pg_config = false; this.old_pg_config = false;
this.old_pg_stats_seen = false; this.old_pg_stats_seen = false;
@ -659,19 +658,7 @@ class Mon
this.etcd_watch_revision, pool_id, up_osds, osd_tree, real_prev_pgs, pool_res.pgs, pg_history); this.etcd_watch_revision, pool_id, up_osds, osd_tree, real_prev_pgs, pool_res.pgs, pg_history);
} }
new_pg_config.hash = tree_hash; new_pg_config.hash = tree_hash;
const { backfillfull_pools, backfillfull_osds } = sum_object_counts( return await this.save_pg_config(new_pg_config, etcd_request);
{ ...this.state, pg: { ...this.state.pg, config: new_pg_config } }, this.config
);
if (backfillfull_pools.join(',') != ((this.state.pg.config||{}).backfillfull_pools||[]).join(','))
{
this.log_backfillfull(backfillfull_osds, backfillfull_pools);
}
new_pg_config.backfillfull_pools = backfillfull_pools.length ? backfillfull_pools : undefined;
if (!await this.save_pg_config(new_pg_config, etcd_request))
{
return false;
}
return true;
} }
async save_pg_config(new_pg_config, etcd_request = { compare: [], success: [] }) async save_pg_config(new_pg_config, etcd_request = { compare: [], success: [] })
@ -743,7 +730,7 @@ class Mon
async update_total_stats() async update_total_stats()
{ {
const txn = []; const txn = [];
const { object_counts, object_bytes, backfillfull_pools, backfillfull_osds } = sum_object_counts(this.state, this.config); const { object_counts, object_bytes } = sum_object_counts(this.state, this.config);
let stats = sum_op_stats(this.state.osd, this.prev_stats); let stats = sum_op_stats(this.state.osd, this.prev_stats);
let { inode_stats, seen_pools } = sum_inode_stats(this.state, this.prev_stats); let { inode_stats, seen_pools } = sum_inode_stats(this.state, this.prev_stats);
stats.object_counts = object_counts; stats.object_counts = object_counts;
@ -796,27 +783,6 @@ class Mon
{ {
await this.etcd.etcd_call('/kv/txn', { success: txn }, this.config.etcd_mon_timeout, 0); await this.etcd.etcd_call('/kv/txn', { success: txn }, this.config.etcd_mon_timeout, 0);
} }
if (!this.recheck_pgs_active &&
backfillfull_pools.join(',') != ((this.state.pg.config||{}).backfillfull_pools||[]).join(','))
{
this.log_backfillfull(backfillfull_osds, backfillfull_pools);
const new_pg_config = { ...this.state.pg.config, backfillfull_pools: backfillfull_pools.length ? backfillfull_pools : undefined };
await this.save_pg_config(new_pg_config);
}
}
log_backfillfull(osds, pools)
{
for (const osd in osds)
{
const bf = osds[osd];
console.log('OSD '+osd+' may fill up during rebalance: capacity '+(bf.cap/1024n/1024n)+
' MB, target user data '+(bf.clean/1024n/1024n)+' MB');
}
console.log(
(pools.length ? 'Pool(s) '+pools.join(', ') : 'No pools')+
' are backfillfull now, applying rebalance configuration'
);
} }
schedule_update_stats() schedule_update_stats()
@ -828,21 +794,7 @@ class Mon
this.stats_timer = setTimeout(() => this.stats_timer = setTimeout(() =>
{ {
this.stats_timer = null; this.stats_timer = null;
if (this.updating_total_stats)
{
this.schedule_update_stats();
return;
}
this.updating_total_stats = true;
try
{
this.update_total_stats().catch(console.error); this.update_total_stats().catch(console.error);
}
catch (e)
{
console.error(e);
}
this.updating_total_stats = false;
}, this.config.mon_stats_timeout); }, this.config.mon_stats_timeout);
} }

View File

@ -109,8 +109,6 @@ function sum_object_counts(state, global_config)
pgstats[pool_id] = { ...(state.pg.stats[pool_id] || {}), ...(pgstats[pool_id] || {}) }; pgstats[pool_id] = { ...(state.pg.stats[pool_id] || {}), ...(pgstats[pool_id] || {}) };
} }
} }
const pool_per_osd = {};
const clean_per_osd = {};
for (const pool_id in pgstats) for (const pool_id in pgstats)
{ {
let object_size = 0; let object_size = 0;
@ -145,45 +143,10 @@ function sum_object_counts(state, global_config)
object_bytes[k] += BigInt(st[k+'_count']) * object_size; object_bytes[k] += BigInt(st[k+'_count']) * object_size;
} }
} }
if (st.object_count)
{
for (const pg_osd of (((state.pg.config.items||{})[pool_id]||{})[pg_num]||{}).osd_set||[])
{
if (!(pg_osd in clean_per_osd))
{
clean_per_osd[pg_osd] = 0n;
}
clean_per_osd[pg_osd] += BigInt(st.object_count);
pool_per_osd[pg_osd] = pool_per_osd[pg_osd]||{};
pool_per_osd[pg_osd][pool_id] = true;
} }
} }
} }
} return { object_counts, object_bytes };
}
// If clean_per_osd[osd] is larger than osd capacity then it will fill up during rebalance
let backfillfull_pools = {};
let backfillfull_osds = {};
for (const osd in clean_per_osd)
{
const st = state.osd.stats[osd];
if (!st || !st.size || !st.data_block_size)
{
continue;
}
let cap = BigInt(st.size)/BigInt(st.data_block_size);
cap = cap * BigInt((global_config.osd_backfillfull_ratio||0.99)*1000000) / 1000000n;
if (cap < clean_per_osd[osd])
{
backfillfull_osds[osd] = { cap: BigInt(st.size), clean: clean_per_osd[osd]*BigInt(st.data_block_size) };
for (const pool_id in pool_per_osd[osd])
{
backfillfull_pools[pool_id] = true;
}
}
}
backfillfull_pools = Object.keys(backfillfull_pools).sort();
return { object_counts, object_bytes, backfillfull_pools, backfillfull_osds };
} }
// sum_inode_stats(this.state, this.prev_stats) // sum_inode_stats(this.state, this.prev_stats)

View File

@ -785,7 +785,7 @@ void etcd_state_client_t::parse_state(const etcd_kv_t & kv)
} }
for (auto & pool_item: value.object_items()) for (auto & pool_item: value.object_items())
{ {
pool_config_t pc = {}; pool_config_t pc;
// ID // ID
pool_id_t pool_id; pool_id_t pool_id;
char null_byte = 0; char null_byte = 0;
@ -931,28 +931,12 @@ void etcd_state_client_t::parse_state(const etcd_kv_t & kv)
// Ignore old key if the new one is present // Ignore old key if the new one is present
return; return;
} }
for (auto & pool_id_json: value["backfillfull_pools"].array_items())
{
auto pool_id = pool_id_json.uint64_value();
auto pool_it = this->pool_config.find(pool_id);
if (pool_it != this->pool_config.end())
{
pool_it->second.backfillfull |= 2;
}
}
for (auto & pool_item: this->pool_config) for (auto & pool_item: this->pool_config)
{ {
for (auto & pg_item: pool_item.second.pg_config) for (auto & pg_item: pool_item.second.pg_config)
{ {
pg_item.second.config_exists = false; pg_item.second.config_exists = false;
} }
// 3 = was 1 and became 1, 0 = was 0 and became 0
if (pool_item.second.backfillfull == 2 || pool_item.second.backfillfull == 1)
{
if (on_change_backfillfull_hook)
on_change_backfillfull_hook(pool_item.first);
}
pool_item.second.backfillfull = pool_item.second.backfillfull >> 1;
} }
for (auto & pool_item: value["items"].object_items()) for (auto & pool_item: value["items"].object_items())
{ {

View File

@ -62,7 +62,6 @@ struct pool_config_t
std::map<pg_num_t, pg_config_t> pg_config; std::map<pg_num_t, pg_config_t> pg_config;
uint64_t scrub_interval; uint64_t scrub_interval;
std::string used_for_fs; std::string used_for_fs;
int backfillfull;
}; };
struct inode_config_t struct inode_config_t
@ -132,7 +131,6 @@ public:
std::function<json11::Json()> load_pgs_checks_hook; std::function<json11::Json()> load_pgs_checks_hook;
std::function<void(bool)> on_load_pgs_hook; std::function<void(bool)> on_load_pgs_hook;
std::function<void()> on_change_pool_config_hook; std::function<void()> on_change_pool_config_hook;
std::function<void(pool_id_t)> on_change_backfillfull_hook;
std::function<void(pool_id_t, pg_num_t, osd_num_t)> on_change_pg_state_hook; std::function<void(pool_id_t, pg_num_t, osd_num_t)> on_change_pg_state_hook;
std::function<void(pool_id_t, pg_num_t)> on_change_pg_history_hook; std::function<void(pool_id_t, pg_num_t)> on_change_pg_history_hook;
std::function<void(osd_num_t)> on_change_osd_state_hook; std::function<void(osd_num_t)> on_change_osd_state_hook;

View File

@ -70,7 +70,6 @@ msgr_rdma_connection_t::~msgr_rdma_connection_t()
send_out_size = 0; send_out_size = 0;
} }
#ifdef IBV_ADVISE_MR_ADVICE_PREFETCH_NO_FAULT
static bool is_ipv4_gid(ibv_gid_entry *gidx) static bool is_ipv4_gid(ibv_gid_entry *gidx)
{ {
return (((uint64_t*)gidx->gid.raw)[0] == 0 && return (((uint64_t*)gidx->gid.raw)[0] == 0 &&
@ -132,7 +131,6 @@ static matched_dev match_device(ibv_device **dev_list, addr_mask_t *networks, in
ibv_port_attr portinfo; ibv_port_attr portinfo;
ibv_gid_entry best_gidx; ibv_gid_entry best_gidx;
int res; int res;
bool have_non_roce = false, have_roce = false;
for (int i = 0; dev_list[i]; ++i) for (int i = 0; dev_list[i]; ++i)
{ {
auto dev = dev_list[i]; auto dev = dev_list[i];
@ -163,11 +161,6 @@ static matched_dev match_device(ibv_device **dev_list, addr_mask_t *networks, in
else else
break; break;
} }
if (gidx.gid_type != IBV_GID_TYPE_ROCE_V1 &&
gidx.gid_type != IBV_GID_TYPE_ROCE_V2)
have_non_roce = true;
else
have_roce = true;
if (match_gid(&gidx, networks, nnet)) if (match_gid(&gidx, networks, nnet))
{ {
// Prefer RoCEv2 // Prefer RoCEv2
@ -193,13 +186,8 @@ cleanup:
{ {
log_rdma_dev_port_gid(dev_list[best.dev], best.port, best.gid, best_gidx); log_rdma_dev_port_gid(dev_list[best.dev], best.port, best.gid, best_gidx);
} }
if (best.dev < 0 && have_non_roce && !have_roce)
{
best.dev = -2;
}
return best; return best;
} }
#endif
msgr_rdma_context_t *msgr_rdma_context_t::create(std::vector<std::string> osd_networks, const char *ib_devname, uint8_t ib_port, uint8_t gid_index, uint32_t mtu, bool odp, int log_level) msgr_rdma_context_t *msgr_rdma_context_t::create(std::vector<std::string> osd_networks, const char *ib_devname, uint8_t ib_port, uint8_t gid_index, uint32_t mtu, bool odp, int log_level)
{ {
@ -241,7 +229,6 @@ msgr_rdma_context_t *msgr_rdma_context_t::create(std::vector<std::string> osd_ne
goto cleanup; goto cleanup;
} }
} }
#ifdef IBV_ADVISE_MR_ADVICE_PREFETCH_NO_FAULT
else if (osd_networks.size()) else if (osd_networks.size())
{ {
std::vector<addr_mask_t> nets; std::vector<addr_mask_t> nets;
@ -250,17 +237,11 @@ msgr_rdma_context_t *msgr_rdma_context_t::create(std::vector<std::string> osd_ne
nets.push_back(cidr_parse(netstr)); nets.push_back(cidr_parse(netstr));
} }
auto best = match_device(dev_list, nets.data(), nets.size(), log_level); auto best = match_device(dev_list, nets.data(), nets.size(), log_level);
if (best.dev == -2) if (best.dev < 0)
{ {
if (log_level > 0)
fprintf(stderr, "RDMA device matching osd_network is not found, using first available device\n");
best.dev = 0; best.dev = 0;
if (log_level > 0)
fprintf(stderr, "No RoCE devices found, using first available RDMA device %s\n", ibv_get_device_name(*dev_list));
}
else if (best.dev < 0)
{
if (log_level > 0)
fprintf(stderr, "RDMA device matching osd_network is not found, disabling RDMA\n");
goto cleanup;
} }
else else
{ {
@ -269,7 +250,6 @@ msgr_rdma_context_t *msgr_rdma_context_t::create(std::vector<std::string> osd_ne
} }
ctx->dev = dev_list[best.dev]; ctx->dev = dev_list[best.dev];
} }
#endif
else else
{ {
ctx->dev = *dev_list; ctx->dev = *dev_list;
@ -295,22 +275,18 @@ msgr_rdma_context_t *msgr_rdma_context_t::create(std::vector<std::string> osd_ne
goto cleanup; goto cleanup;
} }
#ifdef IBV_ADVISE_MR_ADVICE_PREFETCH_NO_FAULT
if (gid_index != -1) if (gid_index != -1)
#endif
{ {
ctx->gid_index = gid_index < 0 ? 0 : gid_index; ctx->gid_index = gid_index;
if (ibv_query_gid(ctx->context, ib_port, gid_index, &ctx->my_gid)) if (ibv_query_gid_ex(ctx->context, ib_port, gid_index, &ctx->my_gid, 0))
{ {
fprintf(stderr, "Couldn't read RDMA device %s GID index %d\n", ibv_get_device_name(ctx->dev), gid_index); fprintf(stderr, "Couldn't read RDMA device %s GID index %d\n", ibv_get_device_name(ctx->dev), gid_index);
goto cleanup; goto cleanup;
} }
} }
#ifdef IBV_ADVISE_MR_ADVICE_PREFETCH_NO_FAULT
else else
{ {
// Auto-guess GID // Auto-guess GID
ibv_gid_entry best_gidx;
for (int k = 0; k < ctx->portinfo.gid_tbl_len; k++) for (int k = 0; k < ctx->portinfo.gid_tbl_len; k++)
{ {
ibv_gid_entry gidx; ibv_gid_entry gidx;
@ -325,24 +301,21 @@ msgr_rdma_context_t *msgr_rdma_context_t::create(std::vector<std::string> osd_ne
{ {
continue; continue;
} }
// Prefer IPv4 RoCEv2 -> IPv6 RoCEv2 -> IPv4 RoCEv1 -> IPv6 RoCEv1 -> IB // Prefer IPv4 RoCEv2 GID by default
if (gid_index == -1 || if (gid_index == -1 ||
gidx.gid_type == IBV_GID_TYPE_ROCE_V2 && best_gidx.gid_type != IBV_GID_TYPE_ROCE_V2 || gidx.gid_type == IBV_GID_TYPE_ROCE_V2 &&
gidx.gid_type == IBV_GID_TYPE_ROCE_V1 && best_gidx.gid_type == IBV_GID_TYPE_IB || (ctx->my_gid.gid_type != IBV_GID_TYPE_ROCE_V2 || is_ipv4_gid(&gidx)))
gidx.gid_type == best_gidx.gid_type && is_ipv4_gid(&gidx))
{ {
gid_index = k; gid_index = k;
best_gidx = gidx; ctx->my_gid = gidx;
} }
} }
ctx->gid_index = gid_index = (gid_index == -1 ? 0 : gid_index); ctx->gid_index = gid_index = (gid_index == -1 ? 0 : gid_index);
if (log_level > 0) if (log_level > 0)
{ {
log_rdma_dev_port_gid(ctx->dev, ctx->ib_port, ctx->gid_index, best_gidx); log_rdma_dev_port_gid(ctx->dev, ctx->ib_port, ctx->gid_index, ctx->my_gid);
} }
ctx->my_gid = best_gidx.gid;
} }
#endif
ctx->pd = ibv_alloc_pd(ctx->context); ctx->pd = ibv_alloc_pd(ctx->context);
if (!ctx->pd) if (!ctx->pd)
@ -458,7 +431,7 @@ msgr_rdma_connection_t *msgr_rdma_connection_t::create(msgr_rdma_context_t *ctx,
} }
conn->addr.lid = ctx->my_lid; conn->addr.lid = ctx->my_lid;
conn->addr.gid = ctx->my_gid; conn->addr.gid = ctx->my_gid.gid;
conn->addr.qpn = conn->qp->qp_num; conn->addr.qpn = conn->qp->qp_num;
conn->addr.psn = lrand48() & 0xffffff; conn->addr.psn = lrand48() & 0xffffff;

View File

@ -31,7 +31,7 @@ struct msgr_rdma_context_t
uint8_t ib_port; uint8_t ib_port;
uint8_t gid_index; uint8_t gid_index;
uint16_t my_lid; uint16_t my_lid;
ibv_gid my_gid; ibv_gid_entry my_gid;
uint32_t mtu; uint32_t mtu;
int max_cqe = 0; int max_cqe = 0;
int used_max_cqe = 0; int used_max_cqe = 0;

View File

@ -64,7 +64,7 @@ static void netlink_sock_alloc(struct netlink_ctx *ctx)
if (nl_driver_id < 0) if (nl_driver_id < 0)
{ {
nl_socket_free(sk); nl_socket_free(sk);
fail("Couldn't resolve the nbd netlink family: %s (code %d)\n", nl_geterror(nl_driver_id), nl_driver_id); fail("Couldn't resolve the nbd netlink family\n");
} }
ctx->driver_id = nl_driver_id; ctx->driver_id = nl_driver_id;
@ -555,12 +555,7 @@ help:
fcntl(sockfd[0], F_SETFL, fcntl(sockfd[0], F_GETFL, 0) | O_NONBLOCK); fcntl(sockfd[0], F_SETFL, fcntl(sockfd[0], F_GETFL, 0) | O_NONBLOCK);
nbd_fd = sockfd[0]; nbd_fd = sockfd[0];
load_module(); load_module();
bool bg = cfg["foreground"].is_null(); bool bg = cfg["foreground"].is_null();
if (cfg["logfile"].string_value() != "")
{
logfile = cfg["logfile"].string_value();
}
if (netlink) if (netlink)
{ {
@ -593,10 +588,6 @@ help:
} }
close(sockfd[1]); close(sockfd[1]);
printf("/dev/nbd%d\n", err); printf("/dev/nbd%d\n", err);
if (bg)
{
daemonize_reopen_stdio();
}
#else #else
fprintf(stderr, "netlink support is disabled in this build\n"); fprintf(stderr, "netlink support is disabled in this build\n");
exit(1); exit(1);
@ -640,11 +631,15 @@ help:
} }
} }
} }
}
if (cfg["logfile"].string_value() != "")
{
logfile = cfg["logfile"].string_value();
}
if (bg) if (bg)
{ {
daemonize(); daemonize();
} }
}
// Initialize read state // Initialize read state
read_state = CL_READ_HDR; read_state = CL_READ_HDR;
recv_buf = malloc_or_die(receive_buffer_size); recv_buf = malloc_or_die(receive_buffer_size);
@ -721,17 +716,13 @@ help:
} }
} }
void daemonize_fork() void daemonize()
{ {
if (fork()) if (fork())
exit(0); exit(0);
setsid(); setsid();
if (fork()) if (fork())
exit(0); exit(0);
}
void daemonize_reopen_stdio()
{
close(0); close(0);
close(1); close(1);
close(2); close(2);
@ -742,12 +733,6 @@ help:
fprintf(stderr, "Warning: Failed to chdir into /\n"); fprintf(stderr, "Warning: Failed to chdir into /\n");
} }
void daemonize()
{
daemonize_fork();
daemonize_reopen_stdio();
}
json11::Json::object list_mapped() json11::Json::object list_mapped()
{ {
const char *self_filename = exe_name; const char *self_filename = exe_name;
@ -798,9 +783,8 @@ help:
if (!strcmp(pid_filename, self_filename)) if (!strcmp(pid_filename, self_filename))
{ {
json11::Json::object cfg = nbd_proxy::parse_args(argv.size(), argv.data()); json11::Json::object cfg = nbd_proxy::parse_args(argv.size(), argv.data());
if (cfg["command"] == "map" || cfg["command"] == "netlink-map") if (cfg["command"] == "map")
{ {
cfg["interface"] = (cfg["command"] == "netlink-map") ? "netlink" : "nbd";
cfg.erase("command"); cfg.erase("command");
cfg["pid"] = pid; cfg["pid"] = pid;
mapped["/dev/nbd"+std::to_string(dev_num)] = cfg; mapped["/dev/nbd"+std::to_string(dev_num)] = cfg;

View File

@ -261,7 +261,6 @@ struct dd_out_info_t
else else
{ {
// ok // ok
out_size = owatch->cfg.size;
return true; return true;
} }
// Wait for sub-command // Wait for sub-command
@ -883,8 +882,6 @@ resume_2:
oinfo.end_fsync = oinfo.end_fsync && oinfo.out_seekable; oinfo.end_fsync = oinfo.end_fsync && oinfo.out_seekable;
read_offset = 0; read_offset = 0;
read_end = iinfo.in_seekable ? iinfo.in_size-iseek : 0; read_end = iinfo.in_seekable ? iinfo.in_size-iseek : 0;
if (oinfo.out_size && (!read_end || read_end > oinfo.out_size-oseek))
read_end = oinfo.out_size-oseek;
if (bytelimit && (!read_end || read_end > bytelimit)) if (bytelimit && (!read_end || read_end > bytelimit))
read_end = bytelimit; read_end = bytelimit;
clock_gettime(CLOCK_REALTIME, &tv_begin); clock_gettime(CLOCK_REALTIME, &tv_begin);

View File

@ -35,7 +35,6 @@ struct pool_creator_t
uint64_t new_pools_mod_rev; uint64_t new_pools_mod_rev;
json11::Json state_node_tree; json11::Json state_node_tree;
json11::Json new_pools; json11::Json new_pools;
std::map<osd_num_t, json11::Json> osd_stats;
bool is_done() { return state == 100; } bool is_done() { return state == 100; }
@ -47,6 +46,8 @@ struct pool_creator_t
goto resume_2; goto resume_2;
else if (state == 3) else if (state == 3)
goto resume_3; goto resume_3;
else if (state == 4)
goto resume_4;
else if (state == 5) else if (state == 5)
goto resume_5; goto resume_5;
else if (state == 6) else if (state == 6)
@ -89,19 +90,13 @@ resume_1:
// If not forced, check that we have enough osds for pg_size // If not forced, check that we have enough osds for pg_size
if (!force) if (!force)
{ {
// Get node_placement configuration from etcd and OSD stats // Get node_placement configuration from etcd
parent->etcd_txn(json11::Json::object { parent->etcd_txn(json11::Json::object {
{ "success", json11::Json::array { { "success", json11::Json::array {
json11::Json::object { json11::Json::object {
{ "request_range", json11::Json::object { { "request_range", json11::Json::object {
{ "key", base64_encode(parent->cli->st_cli.etcd_prefix+"/config/node_placement") }, { "key", base64_encode(parent->cli->st_cli.etcd_prefix+"/config/node_placement") },
} }, } }
},
json11::Json::object {
{ "request_range", json11::Json::object {
{ "key", base64_encode(parent->cli->st_cli.etcd_prefix+"/osd/stats/") },
{ "range_end", base64_encode(parent->cli->st_cli.etcd_prefix+"/osd/stats0") },
} },
}, },
} }, } },
}); });
@ -117,21 +112,10 @@ resume_2:
return; return;
} }
// Get state_node_tree based on node_placement and osd stats // Get state_node_tree based on node_placement and osd peer states
{ {
auto node_placement_kv = parent->cli->st_cli.parse_etcd_kv(parent->etcd_result["responses"][0]["response_range"]["kvs"][0]); auto kv = parent->cli->st_cli.parse_etcd_kv(parent->etcd_result["responses"][0]["response_range"]["kvs"][0]);
timespec tv_now; state_node_tree = get_state_node_tree(kv.value.object_items());
clock_gettime(CLOCK_REALTIME, &tv_now);
uint64_t osd_out_time = parent->cli->config["osd_out_time"].uint64_value();
if (!osd_out_time)
osd_out_time = 600;
osd_stats.clear();
parent->iterate_kvs_1(parent->etcd_result["responses"][1]["response_range"]["kvs"], "/osd/stats/", [&](uint64_t cur_osd, json11::Json value)
{
if ((uint64_t)value["time"].number_value()+osd_out_time >= tv_now.tv_sec)
osd_stats[cur_osd] = value;
});
state_node_tree = get_state_node_tree(node_placement_kv.value.object_items(), osd_stats);
} }
// Skip tag checks, if pool has none // Skip tag checks, if pool has none
@ -174,18 +158,42 @@ resume_3:
} }
} }
// Get stats (for block_size, bitmap_granularity, ...) of osds in state_node_tree
{
json11::Json::array osd_stats;
for (auto osd_num: state_node_tree["osds"].array_items())
{
osd_stats.push_back(json11::Json::object {
{ "request_range", json11::Json::object {
{ "key", base64_encode(parent->cli->st_cli.etcd_prefix+"/osd/stats/"+osd_num.as_string()) },
} }
});
}
parent->etcd_txn(json11::Json::object{ { "success", osd_stats } });
}
state = 4;
resume_4:
if (parent->waiting > 0)
return;
if (parent->etcd_err.err)
{
result = parent->etcd_err;
state = 100;
return;
}
// Filter osds from state_node_tree based on pool parameters and osd stats // Filter osds from state_node_tree based on pool parameters and osd stats
{ {
std::vector<json11::Json> filtered_osd_stats; std::vector<json11::Json> osd_stats;
for (auto & osd_num: state_node_tree["osds"].array_items()) for (auto & ocr: parent->etcd_result["responses"].array_items())
{ {
auto st_it = osd_stats.find(osd_num.uint64_value()); auto kv = parent->cli->st_cli.parse_etcd_kv(ocr["response_range"]["kvs"][0]);
if (st_it != osd_stats.end()) osd_stats.push_back(kv.value);
{
filtered_osd_stats.push_back(st_it->second);
} }
} guess_block_size(osd_stats);
guess_block_size(filtered_osd_stats);
state_node_tree = filter_state_node_tree_by_stats(state_node_tree, osd_stats); state_node_tree = filter_state_node_tree_by_stats(state_node_tree, osd_stats);
} }
@ -193,7 +201,8 @@ resume_3:
{ {
auto failure_domain = cfg["failure_domain"].string_value() == "" auto failure_domain = cfg["failure_domain"].string_value() == ""
? "host" : cfg["failure_domain"].string_value(); ? "host" : cfg["failure_domain"].string_value();
uint64_t max_pg_size = get_max_pg_size(state_node_tree, failure_domain, cfg["root_node"].string_value()); uint64_t max_pg_size = get_max_pg_size(state_node_tree["nodes"].object_items(),
failure_domain, cfg["root_node"].string_value());
if (cfg["pg_size"].uint64_value() > max_pg_size) if (cfg["pg_size"].uint64_value() > max_pg_size)
{ {
@ -349,35 +358,42 @@ resume_8:
// Returns a JSON object of form {"nodes": {...}, "osds": [...]} that // Returns a JSON object of form {"nodes": {...}, "osds": [...]} that
// contains: all nodes (osds, hosts, ...) based on node_placement config // contains: all nodes (osds, hosts, ...) based on node_placement config
// and current osd stats. // and current peer state, and a list of active peer osds.
json11::Json get_state_node_tree(json11::Json::object node_placement, std::map<osd_num_t, json11::Json> & osd_stats) json11::Json get_state_node_tree(json11::Json::object node_placement)
{ {
// Erase non-existing osd nodes from node_placement // Erase non-peer osd nodes from node_placement
for (auto np_it = node_placement.begin(); np_it != node_placement.end();) for (auto np_it = node_placement.begin(); np_it != node_placement.end();)
{ {
// Numeric nodes are osds // Numeric nodes are osds
osd_num_t osd_num = stoull_full(np_it->first); osd_num_t osd_num = stoull_full(np_it->first);
// If node is osd and its stats do not exist, erase it // If node is osd and it is not in peer states, erase it
if (osd_num > 0 && osd_stats.find(osd_num) == osd_stats.end()) if (osd_num > 0 &&
parent->cli->st_cli.peer_states.find(osd_num) == parent->cli->st_cli.peer_states.end())
{
node_placement.erase(np_it++); node_placement.erase(np_it++);
}
else else
np_it++; np_it++;
} }
// List of osds // List of peer osds
std::vector<std::string> existing_osds; std::vector<std::string> peer_osds;
// Record osds and add missing osds/hosts to np // Record peer osds and add missing osds/hosts to np
for (auto & ps: osd_stats) for (auto & ps: parent->cli->st_cli.peer_states)
{ {
std::string osd_num = std::to_string(ps.first); std::string osd_num = std::to_string(ps.first);
// Record osd // Record peer osd
existing_osds.push_back(osd_num); peer_osds.push_back(osd_num);
// Add host if necessary // Add osd, if necessary
if (node_placement.find(osd_num) == node_placement.end())
{
std::string osd_host = ps.second["host"].as_string(); std::string osd_host = ps.second["host"].as_string();
// Add host, if necessary
if (node_placement.find(osd_host) == node_placement.end()) if (node_placement.find(osd_host) == node_placement.end())
{ {
node_placement[osd_host] = json11::Json::object { node_placement[osd_host] = json11::Json::object {
@ -385,14 +401,13 @@ resume_8:
}; };
} }
// Add osd
node_placement[osd_num] = json11::Json::object { node_placement[osd_num] = json11::Json::object {
{ "parent", node_placement[osd_num]["parent"].is_null() ? osd_host : node_placement[osd_num]["parent"] }, { "parent", osd_host }
{ "level", "osd" },
}; };
} }
}
return json11::Json::object { { "osds", existing_osds }, { "nodes", node_placement } }; return json11::Json::object { { "osds", peer_osds }, { "nodes", node_placement } };
} }
// Returns new state_node_tree based on given state_node_tree with osds // Returns new state_node_tree based on given state_node_tree with osds
@ -519,13 +534,15 @@ resume_8:
// filtered out by stats parameters (block_size, bitmap_granularity) in // filtered out by stats parameters (block_size, bitmap_granularity) in
// given osd_stats and current pool config. // given osd_stats and current pool config.
// Requires: state_node_tree["osds"] must match osd_stats 1-1 // Requires: state_node_tree["osds"] must match osd_stats 1-1
json11::Json filter_state_node_tree_by_stats(const json11::Json & state_node_tree, std::map<osd_num_t, json11::Json> & osd_stats) json11::Json filter_state_node_tree_by_stats(const json11::Json & state_node_tree, std::vector<json11::Json> & osd_stats)
{ {
auto & osds = state_node_tree["osds"].array_items();
// Accepted state_node_tree nodes // Accepted state_node_tree nodes
auto accepted_nodes = state_node_tree["nodes"].object_items(); auto accepted_nodes = state_node_tree["nodes"].object_items();
// List of accepted osds // List of accepted osds
json11::Json::array accepted_osds; std::vector<std::string> accepted_osds;
block_size = cfg["block_size"].uint64_value() block_size = cfg["block_size"].uint64_value()
? cfg["block_size"].uint64_value() ? cfg["block_size"].uint64_value()
@ -537,25 +554,21 @@ resume_8:
? etcd_state_client_t::parse_immediate_commit(cfg["immediate_commit"].string_value(), IMMEDIATE_ALL) ? etcd_state_client_t::parse_immediate_commit(cfg["immediate_commit"].string_value(), IMMEDIATE_ALL)
: parent->cli->st_cli.global_immediate_commit; : parent->cli->st_cli.global_immediate_commit;
for (auto osd_num_json: state_node_tree["osds"].array_items()) for (size_t i = 0; i < osd_stats.size(); i++)
{ {
auto osd_num = osd_num_json.uint64_value(); auto & os = osd_stats[i];
auto os_it = osd_stats.find(osd_num); // Get osd number
if (os_it == osd_stats.end()) auto osd_num = osds[i].as_string();
{
continue;
}
auto & os = os_it->second;
if (!os["data_block_size"].is_null() && os["data_block_size"] != block_size || if (!os["data_block_size"].is_null() && os["data_block_size"] != block_size ||
!os["bitmap_granularity"].is_null() && os["bitmap_granularity"] != bitmap_granularity || !os["bitmap_granularity"].is_null() && os["bitmap_granularity"] != bitmap_granularity ||
!os["immediate_commit"].is_null() && !os["immediate_commit"].is_null() &&
etcd_state_client_t::parse_immediate_commit(os["immediate_commit"].string_value(), IMMEDIATE_NONE) < immediate_commit) etcd_state_client_t::parse_immediate_commit(os["immediate_commit"].string_value(), IMMEDIATE_NONE) < immediate_commit)
{ {
accepted_nodes.erase(osd_num_json.as_string()); accepted_nodes.erase(osd_num);
} }
else else
{ {
accepted_osds.push_back(osd_num_json); accepted_osds.push_back(osd_num);
} }
} }
@ -563,28 +576,87 @@ resume_8:
} }
// Returns maximum pg_size possible for given node_tree and failure_domain, starting at parent_node // Returns maximum pg_size possible for given node_tree and failure_domain, starting at parent_node
uint64_t get_max_pg_size(json11::Json state_node_tree, const std::string & level, const std::string & root_node) uint64_t get_max_pg_size(json11::Json::object node_tree, const std::string & level, const std::string & parent_node)
{ {
std::set<std::string> level_seen; uint64_t max_pg_sz = 0;
for (auto & osd: state_node_tree["osds"].array_items())
std::vector<std::string> nodes;
// Check if parent node is an osd (numeric)
if (parent_node != "" && stoull_full(parent_node))
{ {
// find OSD parent at <level>, but stop at <root_node> // Add it to node list if osd is in node tree
auto cur_id = osd.string_value(); if (node_tree.find(parent_node) != node_tree.end())
auto cur = state_node_tree["nodes"][cur_id]; nodes.push_back(parent_node);
while (!cur.is_null()) }
// If parent node given, ...
else if (parent_node != "")
{ {
if (cur["level"] == level) // ... look for children nodes of this parent
for (auto & sn: node_tree)
{ {
level_seen.insert(cur_id); auto & props = sn.second.object_items();
auto parent_prop = props.find("parent");
if (parent_prop != props.end() && (parent_prop->second.as_string() == parent_node))
{
nodes.push_back(sn.first);
// If we're not looking for all osds, we only need a single
// child osd node
if (level != "osd" && stoull_full(sn.first))
break; break;
} }
if (cur_id == root_node)
break;
cur_id = cur["parent"].string_value();
cur = state_node_tree["nodes"][cur_id];
} }
} }
return level_seen.size(); // No parent node given, and we're not looking for all osds
else if (level != "osd")
{
// ... look for all level nodes
for (auto & sn: node_tree)
{
auto & props = sn.second.object_items();
auto level_prop = props.find("level");
if (level_prop != props.end() && (level_prop->second.as_string() == level))
{
nodes.push_back(sn.first);
}
}
}
// Otherwise, ...
else
{
// ... we're looking for osd nodes only
for (auto & sn: node_tree)
{
if (stoull_full(sn.first))
{
nodes.push_back(sn.first);
}
}
}
// Process gathered nodes
for (auto & node: nodes)
{
// Check for osd node, return constant max size
if (stoull_full(node))
{
max_pg_sz += 1;
}
// Otherwise, ...
else
{
// ... exclude parent node from tree, and ...
node_tree.erase(parent_node);
// ... descend onto the resulting tree
max_pg_sz += get_max_pg_size(node_tree, level, node);
}
}
return max_pg_sz;
} }
json11::Json create_pool(const etcd_kv_t & kv) json11::Json create_pool(const etcd_kv_t & kv)

View File

@ -499,19 +499,6 @@ void nfs_proxy_t::check_default_pool()
} }
} }
void nfs_proxy_t::create_client()
{
auto cli = new nfs_client_t();
cli->parent = this->proxy;
if (kvfs)
nfs_kv_procs(cli);
else
nfs_block_procs(cli);
for (auto & fn: pmap.proc_table)
cli->proc_table.insert(fn);
rpc_clients.insert(cli);
}
void nfs_proxy_t::do_accept(int listen_fd) void nfs_proxy_t::do_accept(int listen_fd)
{ {
struct sockaddr_storage addr; struct sockaddr_storage addr;
@ -525,8 +512,18 @@ void nfs_proxy_t::do_accept(int listen_fd)
fcntl(nfs_fd, F_SETFL, fcntl(nfs_fd, F_GETFL, 0) | O_NONBLOCK); fcntl(nfs_fd, F_SETFL, fcntl(nfs_fd, F_GETFL, 0) | O_NONBLOCK);
int one = 1; int one = 1;
setsockopt(nfs_fd, SOL_TCP, TCP_NODELAY, &one, sizeof(one)); setsockopt(nfs_fd, SOL_TCP, TCP_NODELAY, &one, sizeof(one));
auto cli = this->create_client(); auto cli = new nfs_client_t();
if (kvfs)
nfs_kv_procs(cli);
else
nfs_block_procs(cli);
cli->parent = this;
cli->nfs_fd = nfs_fd; cli->nfs_fd = nfs_fd;
for (auto & fn: pmap.proc_table)
{
cli->proc_table.insert(fn);
}
rpc_clients[nfs_fd] = cli;
epmgr->tfd->set_fd_handler(nfs_fd, true, [cli](int nfs_fd, int epoll_events) epmgr->tfd->set_fd_handler(nfs_fd, true, [cli](int nfs_fd, int epoll_events)
{ {
// Handle incoming event // Handle incoming event
@ -784,7 +781,7 @@ void nfs_client_t::stop()
if (refs <= 0) if (refs <= 0)
{ {
auto parent = this->parent; auto parent = this->parent;
parent->rpc_clients.erase(this); parent->rpc_clients.erase(nfs_fd);
parent->active_connections--; parent->active_connections--;
parent->epmgr->tfd->set_fd_handler(nfs_fd, true, NULL); parent->epmgr->tfd->set_fd_handler(nfs_fd, true, NULL);
close(nfs_fd); close(nfs_fd);
@ -1088,8 +1085,8 @@ void nfs_proxy_t::daemonize()
// Stop all clients because client I/O sometimes breaks during daemonize // Stop all clients because client I/O sometimes breaks during daemonize
// I.e. the new process stops receiving events on the old FD // I.e. the new process stops receiving events on the old FD
// It doesn't happen if we call sleep(1) here, but we don't want to call sleep(1)... // It doesn't happen if we call sleep(1) here, but we don't want to call sleep(1)...
for (auto & cli: rpc_clients) for (auto & clp: rpc_clients)
cli->stop(); clp.second->stop();
if (fork()) if (fork())
exit(0); exit(0);
setsid(); setsid();

View File

@ -55,7 +55,7 @@ public:
vitastorkv_dbw_t *db = NULL; vitastorkv_dbw_t *db = NULL;
kv_fs_state_t *kvfs = NULL; kv_fs_state_t *kvfs = NULL;
block_fs_state_t *blockfs = NULL; block_fs_state_t *blockfs = NULL;
std::set<nfs_client_t*> rpc_clients; std::map<int, nfs_client_t*> rpc_clients;
std::vector<XDR*> xdr_pool; std::vector<XDR*> xdr_pool;
@ -72,7 +72,6 @@ public:
void watch_stats(); void watch_stats();
void parse_stats(etcd_kv_t & kv); void parse_stats(etcd_kv_t & kv);
void check_default_pool(); void check_default_pool();
nfs_client_t* create_client();
void do_accept(int listen_fd); void do_accept(int listen_fd);
void daemonize(); void daemonize();
void write_pid(); void write_pid();
@ -102,8 +101,6 @@ struct rpc_free_buffer_t
unsigned size; unsigned size;
}; };
struct nfs_rdma_conn_t;
class nfs_client_t class nfs_client_t
{ {
public: public:
@ -111,7 +108,6 @@ public:
int refs = 0; int refs = 0;
bool stopped = false; bool stopped = false;
std::set<rpc_service_proc_t> proc_table; std::set<rpc_service_proc_t> proc_table;
nfs_rdma_conn_t *rdma_conn = NULL;
// <TCP> // <TCP>
int nfs_fd; int nfs_fd;

View File

@ -45,7 +45,6 @@ struct nfs_rdma_context_t
int max_send = 8, max_recv = 8; --- FIXME max_send and max_recv should probably be equal int max_send = 8, max_recv = 8; --- FIXME max_send and max_recv should probably be equal
uint64_t max_send_size = 256*1024, max_recv_size = 256*1024; uint64_t max_send_size = 256*1024, max_recv_size = 256*1024;
nfs_proxy_t *proxy = NULL;
epoll_manager_t *epmgr = NULL; epoll_manager_t *epmgr = NULL;
int max_cqe = 0, used_max_cqe = 0; int max_cqe = 0, used_max_cqe = 0;
@ -77,7 +76,6 @@ struct nfs_rdma_conn_t
nfs_rdma_context_t* nfs_proxy_t::create_rdma(const std::string & bind_address, int rdmacm_port) nfs_rdma_context_t* nfs_proxy_t::create_rdma(const std::string & bind_address, int rdmacm_port)
{ {
nfs_rdma_context_t* self = new nfs_rdma_context_t; nfs_rdma_context_t* self = new nfs_rdma_context_t;
self->proxy = this;
self->epmgr = epmgr; self->epmgr = epmgr;
self->bind_address = bind_address; self->bind_address = bind_address;
self->rdmacm_port = rdmacm_port; self->rdmacm_port = rdmacm_port;
@ -303,8 +301,6 @@ void nfs_rdma_context_t::rdmacm_accept(rdma_cm_event *ev)
conn->id = ctx->id; conn->id = ctx->id;
rdma_connections[ctx->id] = conn; rdma_connections[ctx->id] = conn;
rdma_connections_by_qp[conn->id->qp->qp_num]; rdma_connections_by_qp[conn->id->qp->qp_num];
auto cli = this->proxy->create_client();
cli->rdma_conn = conn;
} }
} }

View File

@ -226,7 +226,6 @@ class osd_t
void parse_config(bool init); void parse_config(bool init);
void init_cluster(); void init_cluster();
void on_change_osd_state_hook(osd_num_t peer_osd); void on_change_osd_state_hook(osd_num_t peer_osd);
void on_change_backfillfull_hook(pool_id_t pool_id);
void on_change_pg_history_hook(pool_id_t pool_id, pg_num_t pg_num); void on_change_pg_history_hook(pool_id_t pool_id, pg_num_t pg_num);
void on_change_etcd_state_hook(std::map<std::string, etcd_kv_t> & changes); void on_change_etcd_state_hook(std::map<std::string, etcd_kv_t> & changes);
void on_load_config_hook(json11::Json::object & changes); void on_load_config_hook(json11::Json::object & changes);

View File

@ -65,7 +65,6 @@ void osd_t::init_cluster()
st_cli.tfd = tfd; st_cli.tfd = tfd;
st_cli.log_level = log_level; st_cli.log_level = log_level;
st_cli.on_change_osd_state_hook = [this](osd_num_t peer_osd) { on_change_osd_state_hook(peer_osd); }; st_cli.on_change_osd_state_hook = [this](osd_num_t peer_osd) { on_change_osd_state_hook(peer_osd); };
st_cli.on_change_backfillfull_hook = [this](pool_id_t pool_id) { on_change_backfillfull_hook(pool_id); };
st_cli.on_change_pg_history_hook = [this](pool_id_t pool_id, pg_num_t pg_num) { on_change_pg_history_hook(pool_id, pg_num); }; st_cli.on_change_pg_history_hook = [this](pool_id_t pool_id, pg_num_t pg_num) { on_change_pg_history_hook(pool_id, pg_num); };
st_cli.on_change_hook = [this](std::map<std::string, etcd_kv_t> & changes) { on_change_etcd_state_hook(changes); }; st_cli.on_change_hook = [this](std::map<std::string, etcd_kv_t> & changes) { on_change_etcd_state_hook(changes); };
st_cli.on_load_config_hook = [this](json11::Json::object & cfg) { on_load_config_hook(cfg); }; st_cli.on_load_config_hook = [this](json11::Json::object & cfg) { on_load_config_hook(cfg); };
@ -415,14 +414,6 @@ void osd_t::on_change_osd_state_hook(osd_num_t peer_osd)
} }
} }
void osd_t::on_change_backfillfull_hook(pool_id_t pool_id)
{
if (!(peering_state & (OSD_RECOVERING | OSD_FLUSHING_PGS)))
{
peering_state = peering_state | OSD_RECOVERING;
}
}
void osd_t::on_change_etcd_state_hook(std::map<std::string, etcd_kv_t> & changes) void osd_t::on_change_etcd_state_hook(std::map<std::string, etcd_kv_t> & changes)
{ {
if (changes.find(st_cli.etcd_prefix+"/config/global") != changes.end()) if (changes.find(st_cli.etcd_prefix+"/config/global") != changes.end())

View File

@ -252,18 +252,10 @@ bool osd_t::pick_next_recovery(osd_recovery_op_t &op)
auto mask = recovery_last_degraded ? (PG_ACTIVE | PG_HAS_DEGRADED) : (PG_ACTIVE | PG_DEGRADED | PG_HAS_MISPLACED); auto mask = recovery_last_degraded ? (PG_ACTIVE | PG_HAS_DEGRADED) : (PG_ACTIVE | PG_DEGRADED | PG_HAS_MISPLACED);
auto check = recovery_last_degraded ? (PG_ACTIVE | PG_HAS_DEGRADED) : (PG_ACTIVE | PG_HAS_MISPLACED); auto check = recovery_last_degraded ? (PG_ACTIVE | PG_HAS_DEGRADED) : (PG_ACTIVE | PG_HAS_MISPLACED);
// Restart scanning from the same PG as the last time // Restart scanning from the same PG as the last time
restart:
for (auto pg_it = pgs.lower_bound(recovery_last_pg); pg_it != pgs.end(); pg_it++) for (auto pg_it = pgs.lower_bound(recovery_last_pg); pg_it != pgs.end(); pg_it++)
{ {
if ((pg_it->second.state & mask) == check) if ((pg_it->second.state & mask) == check)
{ {
auto pool_it = st_cli.pool_config.find(pg_it->first.pool_id);
if (pool_it != st_cli.pool_config.end() && pool_it->second.backfillfull)
{
// Skip the pool
recovery_last_pg.pool_id++;
goto restart;
}
auto & src = recovery_last_degraded ? pg_it->second.degraded_objects : pg_it->second.misplaced_objects; auto & src = recovery_last_degraded ? pg_it->second.degraded_objects : pg_it->second.misplaced_objects;
assert(src.size() > 0); assert(src.size() > 0);
// Restart scanning from the next object // Restart scanning from the next object

View File

@ -19,12 +19,9 @@ ANTIETCD=1 ./test_etcd_fail.sh
./test_interrupted_rebalance.sh ./test_interrupted_rebalance.sh
IMMEDIATE_COMMIT=1 ./test_interrupted_rebalance.sh IMMEDIATE_COMMIT=1 ./test_interrupted_rebalance.sh
SCHEME=ec ./test_interrupted_rebalance.sh SCHEME=ec ./test_interrupted_rebalance.sh
SCHEME=ec IMMEDIATE_COMMIT=1 ./test_interrupted_rebalance.sh SCHEME=ec IMMEDIATE_COMMIT=1 ./test_interrupted_rebalance.sh
./test_create_halfhost.sh
./test_failure_domain.sh ./test_failure_domain.sh
./test_snapshot.sh ./test_snapshot.sh

View File

@ -1,35 +0,0 @@
#!/bin/bash -ex
. `dirname $0`/common.sh
node mon/mon-main.js $MON_PARAMS --etcd_address $ETCD_URL --etcd_prefix "/vitastor" >>./testdata/mon.log 2>&1 &
MON_PID=$!
wait_etcd
TIME=$(date '+%s')
$ETCDCTL put /vitastor/config/global '{"placement_levels":{"dc":10,"host":100,"half":105,"osd":110}}'
$ETCDCTL put /vitastor/config/node_placement '{
"h11":{"level":"half","parent":"host1"},
"h12":{"level":"half","parent":"host1"},
"h21":{"level":"half","parent":"host2"},
"h22":{"level":"half","parent":"host2"},
"h31":{"level":"half","parent":"host3"},
"h32":{"level":"half","parent":"host3"},
"1":{"parent":"h11"},
"2":{"parent":"h12"},
"3":{"parent":"h21"},
"4":{"parent":"h22"},
"5":{"parent":"h31"},
"6":{"parent":"h32"}
}'
$ETCDCTL put /vitastor/osd/stats/1 '{"host":"host1","size":1073741824,"time":"'$TIME'"}'
$ETCDCTL put /vitastor/osd/stats/2 '{"host":"host1","size":1073741824,"time":"'$TIME'"}'
$ETCDCTL put /vitastor/osd/stats/3 '{"host":"host2","size":1073741824,"time":"'$TIME'"}'
$ETCDCTL put /vitastor/osd/stats/4 '{"host":"host2","size":1073741824,"time":"'$TIME'"}'
$ETCDCTL put /vitastor/osd/stats/5 '{"host":"host3","size":1073741824,"time":"'$TIME'"}'
$ETCDCTL put /vitastor/osd/stats/6 '{"host":"host3","size":1073741824,"time":"'$TIME'"}'
build/src/cmd/vitastor-cli --etcd_address $ETCD_URL osd-tree
# check that it doesn't fail
build/src/cmd/vitastor-cli --etcd_address $ETCD_URL create-pool testpool --ec 2+1 -n 32
format_green OK