Compare commits
26 Commits
Author | SHA1 | Date |
---|---|---|
Vitaliy Filippov | d84b84f58d | |
Vitaliy Filippov | 8cfe705d7a | |
Vitaliy Filippov | 66c9271cbd | |
Vitaliy Filippov | 7b37ba921d | |
Vitaliy Filippov | 262c581400 | |
Vitaliy Filippov | ad3b6b7267 | |
Vitaliy Filippov | 1f6a061283 | |
Vitaliy Filippov | fc4d97da10 | |
Vitaliy Filippov | c7a4ce7341 | |
Vitaliy Filippov | ddea31d86d | |
Vitaliy Filippov | 156d005412 | |
Vitaliy Filippov | 7e076c7049 | |
Vitaliy Filippov | 7de38250ad | |
Vitaliy Filippov | 9c59d30e83 | |
Vitaliy Filippov | 5db02cdf6e | |
Vitaliy Filippov | 8202ee9d74 | |
Vitaliy Filippov | 5864bd067c | |
Vitaliy Filippov | c312557ace | |
Vitaliy Filippov | 5ce20116d8 | |
Vitaliy Filippov | be66791e59 | |
Vitaliy Filippov | 141cec2383 | |
Vitaliy Filippov | 1ce4b1b417 | |
Vitaliy Filippov | ebf24bac9a | |
Vitaliy Filippov | edd9051f81 | |
Vitaliy Filippov | 662ca86dc0 | |
Vitaliy Filippov | a1ca573168 |
|
@ -288,6 +288,24 @@ jobs:
|
||||||
echo ""
|
echo ""
|
||||||
done
|
done
|
||||||
|
|
||||||
|
test_create_halfhost:
|
||||||
|
runs-on: ubuntu-latest
|
||||||
|
needs: build
|
||||||
|
container: ${{env.TEST_IMAGE}}:${{github.sha}}
|
||||||
|
steps:
|
||||||
|
- name: Run test
|
||||||
|
id: test
|
||||||
|
timeout-minutes: 3
|
||||||
|
run: /root/vitastor/tests/test_create_halfhost.sh
|
||||||
|
- name: Print logs
|
||||||
|
if: always() && steps.test.outcome == 'failure'
|
||||||
|
run: |
|
||||||
|
for i in /root/vitastor/testdata/*.log /root/vitastor/testdata/*.txt; do
|
||||||
|
echo "-------- $i --------"
|
||||||
|
cat $i
|
||||||
|
echo ""
|
||||||
|
done
|
||||||
|
|
||||||
test_failure_domain:
|
test_failure_domain:
|
||||||
runs-on: ubuntu-latest
|
runs-on: ubuntu-latest
|
||||||
needs: build
|
needs: build
|
||||||
|
|
|
@ -68,11 +68,17 @@ but they are not connected to the cluster.
|
||||||
- Type: string
|
- Type: string
|
||||||
|
|
||||||
RDMA device name to use for Vitastor OSD communications (for example,
|
RDMA device name to use for Vitastor OSD communications (for example,
|
||||||
"rocep5s0f0"). Now Vitastor supports all adapters, even ones without
|
"rocep5s0f0"). If not specified, Vitastor will try to find an RoCE
|
||||||
ODP support, like Mellanox ConnectX-3 and non-Mellanox cards.
|
device matching [osd_network](osd.en.md#osd_network), preferring RoCEv2,
|
||||||
|
or choose the first available RDMA device if no RoCE devices are
|
||||||
|
found or if `osd_network` is not specified. Auto-selection is also
|
||||||
|
unsupported with old libibverbs < v32, like in Debian 10 Buster or
|
||||||
|
CentOS 7.
|
||||||
|
|
||||||
Versions up to Vitastor 1.2.0 required ODP which is only present in
|
Vitastor supports all adapters, even ones without ODP support, like
|
||||||
Mellanox ConnectX >= 4. See also [rdma_odp](#rdma_odp).
|
Mellanox ConnectX-3 and non-Mellanox cards. Versions up to Vitastor
|
||||||
|
1.2.0 required ODP which is only present in Mellanox ConnectX >= 4.
|
||||||
|
See also [rdma_odp](#rdma_odp).
|
||||||
|
|
||||||
Run `ibv_devinfo -v` as root to list available RDMA devices and their
|
Run `ibv_devinfo -v` as root to list available RDMA devices and their
|
||||||
features.
|
features.
|
||||||
|
@ -95,15 +101,17 @@ your device has.
|
||||||
## rdma_gid_index
|
## rdma_gid_index
|
||||||
|
|
||||||
- Type: integer
|
- Type: integer
|
||||||
- Default: 0
|
|
||||||
|
|
||||||
Global address identifier index of the RDMA device to use. Different GID
|
Global address identifier index of the RDMA device to use. Different GID
|
||||||
indexes may correspond to different protocols like RoCEv1, RoCEv2 and iWARP.
|
indexes may correspond to different protocols like RoCEv1, RoCEv2 and iWARP.
|
||||||
Search for "GID" in `ibv_devinfo -v` output to determine which GID index
|
Search for "GID" in `ibv_devinfo -v` output to determine which GID index
|
||||||
you need.
|
you need.
|
||||||
|
|
||||||
**IMPORTANT:** If you want to use RoCEv2 (as recommended) then the correct
|
If not specified, Vitastor will try to auto-select a RoCEv2 IPv4 GID, then
|
||||||
rdma_gid_index is usually 1 (IPv6) or 3 (IPv4).
|
RoCEv2 IPv6 GID, then RoCEv1 IPv4 GID, then RoCEv1 IPv6 GID, then IB GID.
|
||||||
|
GID auto-selection is unsupported with libibverbs < v32.
|
||||||
|
|
||||||
|
A correct rdma_gid_index for RoCEv2 is usually 1 (IPv6) or 3 (IPv4).
|
||||||
|
|
||||||
## rdma_mtu
|
## rdma_mtu
|
||||||
|
|
||||||
|
|
|
@ -71,12 +71,17 @@ RDMA может быть нужно только если у клиентов е
|
||||||
- Тип: строка
|
- Тип: строка
|
||||||
|
|
||||||
Название RDMA-устройства для связи с Vitastor OSD (например, "rocep5s0f0").
|
Название RDMA-устройства для связи с Vitastor OSD (например, "rocep5s0f0").
|
||||||
Сейчас Vitastor поддерживает все модели адаптеров, включая те, у которых
|
Если не указано, Vitastor попробует найти RoCE-устройство, соответствующее
|
||||||
нет поддержки ODP, то есть вы можете использовать RDMA с ConnectX-3 и
|
[osd_network](osd.en.md#osd_network), предпочитая RoCEv2, или выбрать первое
|
||||||
картами производства не Mellanox.
|
попавшееся RDMA-устройство, если RoCE-устройств нет или если сеть `osd_network`
|
||||||
|
не задана. Также автовыбор не поддерживается со старыми версиями библиотеки
|
||||||
|
libibverbs < v32, например в Debian 10 Buster или CentOS 7.
|
||||||
|
|
||||||
Версии Vitastor до 1.2.0 включительно требовали ODP, который есть только
|
Vitastor поддерживает все модели адаптеров, включая те, у которых
|
||||||
на Mellanox ConnectX 4 и более новых. См. также [rdma_odp](#rdma_odp).
|
нет поддержки ODP, то есть вы можете использовать RDMA с ConnectX-3 и
|
||||||
|
картами производства не Mellanox. Версии Vitastor до 1.2.0 включительно
|
||||||
|
требовали ODP, который есть только на Mellanox ConnectX 4 и более новых.
|
||||||
|
См. также [rdma_odp](#rdma_odp).
|
||||||
|
|
||||||
Запустите `ibv_devinfo -v` от имени суперпользователя, чтобы посмотреть
|
Запустите `ibv_devinfo -v` от имени суперпользователя, чтобы посмотреть
|
||||||
список доступных RDMA-устройств, их параметры и возможности.
|
список доступных RDMA-устройств, их параметры и возможности.
|
||||||
|
@ -101,15 +106,18 @@ Control) и ECN (Explicit Congestion Notification).
|
||||||
## rdma_gid_index
|
## rdma_gid_index
|
||||||
|
|
||||||
- Тип: целое число
|
- Тип: целое число
|
||||||
- Значение по умолчанию: 0
|
|
||||||
|
|
||||||
Номер глобального идентификатора адреса RDMA-устройства, который следует
|
Номер глобального идентификатора адреса RDMA-устройства, который следует
|
||||||
использовать. Разным gid_index могут соответствовать разные протоколы связи:
|
использовать. Разным gid_index могут соответствовать разные протоколы связи:
|
||||||
RoCEv1, RoCEv2, iWARP. Чтобы понять, какой нужен вам - смотрите строчки со
|
RoCEv1, RoCEv2, iWARP. Чтобы понять, какой нужен вам - смотрите строчки со
|
||||||
словом "GID" в выводе команды `ibv_devinfo -v`.
|
словом "GID" в выводе команды `ibv_devinfo -v`.
|
||||||
|
|
||||||
**ВАЖНО:** Если вы хотите использовать RoCEv2 (как мы и рекомендуем), то
|
Если не указан, Vitastor попробует автоматически выбрать сначала GID,
|
||||||
правильный rdma_gid_index, как правило, 1 (IPv6) или 3 (IPv4).
|
соответствующий RoCEv2 IPv4, потом RoCEv2 IPv6, потом RoCEv1 IPv4, потом
|
||||||
|
RoCEv1 IPv6, потом IB. Авто-выбор GID не поддерживается со старыми версиями
|
||||||
|
libibverbs < v32.
|
||||||
|
|
||||||
|
Правильный rdma_gid_index для RoCEv2, как правило, 1 (IPv6) или 3 (IPv4).
|
||||||
|
|
||||||
## rdma_mtu
|
## rdma_mtu
|
||||||
|
|
||||||
|
|
|
@ -48,11 +48,17 @@
|
||||||
type: string
|
type: string
|
||||||
info: |
|
info: |
|
||||||
RDMA device name to use for Vitastor OSD communications (for example,
|
RDMA device name to use for Vitastor OSD communications (for example,
|
||||||
"rocep5s0f0"). Now Vitastor supports all adapters, even ones without
|
"rocep5s0f0"). If not specified, Vitastor will try to find an RoCE
|
||||||
ODP support, like Mellanox ConnectX-3 and non-Mellanox cards.
|
device matching [osd_network](osd.en.md#osd_network), preferring RoCEv2,
|
||||||
|
or choose the first available RDMA device if no RoCE devices are
|
||||||
|
found or if `osd_network` is not specified. Auto-selection is also
|
||||||
|
unsupported with old libibverbs < v32, like in Debian 10 Buster or
|
||||||
|
CentOS 7.
|
||||||
|
|
||||||
Versions up to Vitastor 1.2.0 required ODP which is only present in
|
Vitastor supports all adapters, even ones without ODP support, like
|
||||||
Mellanox ConnectX >= 4. See also [rdma_odp](#rdma_odp).
|
Mellanox ConnectX-3 and non-Mellanox cards. Versions up to Vitastor
|
||||||
|
1.2.0 required ODP which is only present in Mellanox ConnectX >= 4.
|
||||||
|
See also [rdma_odp](#rdma_odp).
|
||||||
|
|
||||||
Run `ibv_devinfo -v` as root to list available RDMA devices and their
|
Run `ibv_devinfo -v` as root to list available RDMA devices and their
|
||||||
features.
|
features.
|
||||||
|
@ -64,12 +70,17 @@
|
||||||
PFC (Priority Flow Control) and ECN (Explicit Congestion Notification).
|
PFC (Priority Flow Control) and ECN (Explicit Congestion Notification).
|
||||||
info_ru: |
|
info_ru: |
|
||||||
Название RDMA-устройства для связи с Vitastor OSD (например, "rocep5s0f0").
|
Название RDMA-устройства для связи с Vitastor OSD (например, "rocep5s0f0").
|
||||||
Сейчас Vitastor поддерживает все модели адаптеров, включая те, у которых
|
Если не указано, Vitastor попробует найти RoCE-устройство, соответствующее
|
||||||
нет поддержки ODP, то есть вы можете использовать RDMA с ConnectX-3 и
|
[osd_network](osd.en.md#osd_network), предпочитая RoCEv2, или выбрать первое
|
||||||
картами производства не Mellanox.
|
попавшееся RDMA-устройство, если RoCE-устройств нет или если сеть `osd_network`
|
||||||
|
не задана. Также автовыбор не поддерживается со старыми версиями библиотеки
|
||||||
|
libibverbs < v32, например в Debian 10 Buster или CentOS 7.
|
||||||
|
|
||||||
Версии Vitastor до 1.2.0 включительно требовали ODP, который есть только
|
Vitastor поддерживает все модели адаптеров, включая те, у которых
|
||||||
на Mellanox ConnectX 4 и более новых. См. также [rdma_odp](#rdma_odp).
|
нет поддержки ODP, то есть вы можете использовать RDMA с ConnectX-3 и
|
||||||
|
картами производства не Mellanox. Версии Vitastor до 1.2.0 включительно
|
||||||
|
требовали ODP, который есть только на Mellanox ConnectX 4 и более новых.
|
||||||
|
См. также [rdma_odp](#rdma_odp).
|
||||||
|
|
||||||
Запустите `ibv_devinfo -v` от имени суперпользователя, чтобы посмотреть
|
Запустите `ibv_devinfo -v` от имени суперпользователя, чтобы посмотреть
|
||||||
список доступных RDMA-устройств, их параметры и возможности.
|
список доступных RDMA-устройств, их параметры и возможности.
|
||||||
|
@ -94,23 +105,29 @@
|
||||||
`ibv_devinfo -v`.
|
`ibv_devinfo -v`.
|
||||||
- name: rdma_gid_index
|
- name: rdma_gid_index
|
||||||
type: int
|
type: int
|
||||||
default: 0
|
|
||||||
info: |
|
info: |
|
||||||
Global address identifier index of the RDMA device to use. Different GID
|
Global address identifier index of the RDMA device to use. Different GID
|
||||||
indexes may correspond to different protocols like RoCEv1, RoCEv2 and iWARP.
|
indexes may correspond to different protocols like RoCEv1, RoCEv2 and iWARP.
|
||||||
Search for "GID" in `ibv_devinfo -v` output to determine which GID index
|
Search for "GID" in `ibv_devinfo -v` output to determine which GID index
|
||||||
you need.
|
you need.
|
||||||
|
|
||||||
**IMPORTANT:** If you want to use RoCEv2 (as recommended) then the correct
|
If not specified, Vitastor will try to auto-select a RoCEv2 IPv4 GID, then
|
||||||
rdma_gid_index is usually 1 (IPv6) or 3 (IPv4).
|
RoCEv2 IPv6 GID, then RoCEv1 IPv4 GID, then RoCEv1 IPv6 GID, then IB GID.
|
||||||
|
GID auto-selection is unsupported with libibverbs < v32.
|
||||||
|
|
||||||
|
A correct rdma_gid_index for RoCEv2 is usually 1 (IPv6) or 3 (IPv4).
|
||||||
info_ru: |
|
info_ru: |
|
||||||
Номер глобального идентификатора адреса RDMA-устройства, который следует
|
Номер глобального идентификатора адреса RDMA-устройства, который следует
|
||||||
использовать. Разным gid_index могут соответствовать разные протоколы связи:
|
использовать. Разным gid_index могут соответствовать разные протоколы связи:
|
||||||
RoCEv1, RoCEv2, iWARP. Чтобы понять, какой нужен вам - смотрите строчки со
|
RoCEv1, RoCEv2, iWARP. Чтобы понять, какой нужен вам - смотрите строчки со
|
||||||
словом "GID" в выводе команды `ibv_devinfo -v`.
|
словом "GID" в выводе команды `ibv_devinfo -v`.
|
||||||
|
|
||||||
**ВАЖНО:** Если вы хотите использовать RoCEv2 (как мы и рекомендуем), то
|
Если не указан, Vitastor попробует автоматически выбрать сначала GID,
|
||||||
правильный rdma_gid_index, как правило, 1 (IPv6) или 3 (IPv4).
|
соответствующий RoCEv2 IPv4, потом RoCEv2 IPv6, потом RoCEv1 IPv4, потом
|
||||||
|
RoCEv1 IPv6, потом IB. Авто-выбор GID не поддерживается со старыми версиями
|
||||||
|
libibverbs < v32.
|
||||||
|
|
||||||
|
Правильный rdma_gid_index для RoCEv2, как правило, 1 (IPv6) или 3 (IPv4).
|
||||||
- name: rdma_mtu
|
- name: rdma_mtu
|
||||||
type: int
|
type: int
|
||||||
default: 4096
|
default: 4096
|
||||||
|
|
|
@ -6,7 +6,12 @@
|
||||||
|
|
||||||
# Architecture
|
# Architecture
|
||||||
|
|
||||||
|
- [Server-side components](#server-side-components)
|
||||||
- [Basic concepts](#basic-concepts)
|
- [Basic concepts](#basic-concepts)
|
||||||
|
- [Client-side components](#client-side-components)
|
||||||
|
- [Additional utilities](#additional-utilities)
|
||||||
|
- [Overall read/write process](#overall-read-write-process)
|
||||||
|
- [Nuances of request handling](#nuances-of-request-handling)
|
||||||
- [Similarities to Ceph](#similarities-to-ceph)
|
- [Similarities to Ceph](#similarities-to-ceph)
|
||||||
- [Differences from Ceph](#differences-from-ceph)
|
- [Differences from Ceph](#differences-from-ceph)
|
||||||
- [Implementation Principles](#implementation-principles)
|
- [Implementation Principles](#implementation-principles)
|
||||||
|
@ -39,10 +44,10 @@
|
||||||
|
|
||||||
## Client-side components
|
## Client-side components
|
||||||
|
|
||||||
- **Client library** incapsulates client I/O logic. Client library connects to etcd and to all OSDs,
|
- **Client library** encapsulates client I/O logic. Client library connects to etcd and to all OSDs,
|
||||||
receives cluster state from etcd, sends read and write requests directly to all OSDs. Due
|
receives cluster state from etcd, sends read and write requests directly to all OSDs. Due
|
||||||
to the symmetric distributed architecture, all data blocks (each 128 KB by default) are placed
|
to the symmetric distributed architecture, all data blocks (each 128 KB by default) are placed
|
||||||
to different OSDs, but clients always knows where each data block is stored and connects directly
|
to different OSDs, but clients always know where each data block is stored and connect directly
|
||||||
to the right OSD.
|
to the right OSD.
|
||||||
|
|
||||||
All other client-side components are based on the client library:
|
All other client-side components are based on the client library:
|
||||||
|
@ -77,8 +82,8 @@ All other client-side components are based on the client library:
|
||||||
|
|
||||||
## Additional utilities
|
## Additional utilities
|
||||||
|
|
||||||
- **vitastor-disk** — утилита для разметки дисков под Vitastor OSD. С её помощью можно
|
- **vitastor-disk** — a Vitastor OSD disk management tool. You can create, remove,
|
||||||
создавать, удалять, менять размеры или перемещать разделы OSD.
|
resize and move OSD partitions with it.
|
||||||
|
|
||||||
## Overall read/write process
|
## Overall read/write process
|
||||||
|
|
||||||
|
|
|
@ -171,7 +171,14 @@ to make them use the new version of the client library.
|
||||||
|
|
||||||
### 1.7.x to 1.8.0
|
### 1.7.x to 1.8.0
|
||||||
|
|
||||||
After upgrading version <= 1.7.x to version >= 1.8.0, BUT <= 1.9.0: restart all clients
|
It's recommended to upgrade from version <= 1.7.x to version >= 1.8.0 with full downtime,
|
||||||
|
i.e. you should first stop clients and then the cluster (OSDs and monitor), because 1.8.0
|
||||||
|
includes a fix for etcd event stream inconsistency which could lead to "incomplete" objects
|
||||||
|
appearing in EC pools, and in rare cases, probably, even to data corruption during mass OSD
|
||||||
|
restarts. It doesn't mean that you WILL hit this problem if you upgrade without full downtime,
|
||||||
|
but it's better to secure yourself against it.
|
||||||
|
|
||||||
|
Also, if you upgrade version from <= 1.7.x to version >= 1.8.0, BUT <= 1.9.0: restart all clients
|
||||||
(VMs and so on), otherwise they will hang when monitor clears old PG configuration key,
|
(VMs and so on), otherwise they will hang when monitor clears old PG configuration key,
|
||||||
which happens 24 hours after upgrade.
|
which happens 24 hours after upgrade.
|
||||||
|
|
||||||
|
|
|
@ -168,7 +168,14 @@ done
|
||||||
|
|
||||||
### 1.7.x -> 1.8.0
|
### 1.7.x -> 1.8.0
|
||||||
|
|
||||||
После обновления с версий <= 1.7.x до версий >= 1.8.0, НО <= 1.9.0: перезапустите всех
|
Обновляться с версий <= 1.7.x до версий >= 1.8.0 рекомендуется с полной остановкой
|
||||||
|
сначала клиентов, а затем кластера, так как в 1.8.0 исправлена проблема (неконсистентность
|
||||||
|
потоков событий от etcd), способная приводить к появлению incomplete объектов в EC-пулах
|
||||||
|
и, хоть и редко, но даже к повреждению данных при массовых перезапусках OSD. Если вы
|
||||||
|
обновляетесь без полной остановки - это не значит, что вы обязательно столкнётесь с этой
|
||||||
|
проблемой, но лучше подстраховаться.
|
||||||
|
|
||||||
|
Также, если вы обновляетесь с версии <= 1.7.x до версии >= 1.8.0, НО <= 1.9.0: перезапустите всех
|
||||||
клиентов (процессы виртуальных машин можно перезапустить путём миграции на другой сервер),
|
клиентов (процессы виртуальных машин можно перезапустить путём миграции на другой сервер),
|
||||||
иначе они зависнут, когда монитор удалит старый ключ конфигурации PG, что происходит через
|
иначе они зависнут, когда монитор удалит старый ключ конфигурации PG, что происходит через
|
||||||
24 часа после обновления.
|
24 часа после обновления.
|
||||||
|
|
|
@ -232,6 +232,7 @@ class EtcdAdapter
|
||||||
async become_master()
|
async become_master()
|
||||||
{
|
{
|
||||||
const state = { ...this.mon.get_mon_state(), id: ''+this.mon.etcd_lease_id };
|
const state = { ...this.mon.get_mon_state(), id: ''+this.mon.etcd_lease_id };
|
||||||
|
console.log('Waiting to become master');
|
||||||
// eslint-disable-next-line no-constant-condition
|
// eslint-disable-next-line no-constant-condition
|
||||||
while (1)
|
while (1)
|
||||||
{
|
{
|
||||||
|
@ -243,7 +244,6 @@ class EtcdAdapter
|
||||||
{
|
{
|
||||||
break;
|
break;
|
||||||
}
|
}
|
||||||
console.log('Waiting to become master');
|
|
||||||
await new Promise(ok => setTimeout(ok, this.mon.config.etcd_start_timeout));
|
await new Promise(ok => setTimeout(ok, this.mon.config.etcd_start_timeout));
|
||||||
}
|
}
|
||||||
console.log('Became master');
|
console.log('Became master');
|
||||||
|
|
|
@ -56,6 +56,7 @@ const etcd_tree = {
|
||||||
osd_out_time: 600, // seconds. min: 0
|
osd_out_time: 600, // seconds. min: 0
|
||||||
placement_levels: { datacenter: 1, rack: 2, host: 3, osd: 4, ... },
|
placement_levels: { datacenter: 1, rack: 2, host: 3, osd: 4, ... },
|
||||||
use_old_pg_combinator: false,
|
use_old_pg_combinator: false,
|
||||||
|
osd_backfillfull_ratio: 0.99,
|
||||||
// client and osd
|
// client and osd
|
||||||
tcp_header_buffer_size: 65536,
|
tcp_header_buffer_size: 65536,
|
||||||
use_sync_send_recv: false,
|
use_sync_send_recv: false,
|
||||||
|
|
54
mon/mon.js
54
mon/mon.js
|
@ -74,6 +74,7 @@ class Mon
|
||||||
this.state = JSON.parse(JSON.stringify(etcd_tree));
|
this.state = JSON.parse(JSON.stringify(etcd_tree));
|
||||||
this.prev_stats = { osd_stats: {}, osd_diff: {} };
|
this.prev_stats = { osd_stats: {}, osd_diff: {} };
|
||||||
this.recheck_pgs_active = false;
|
this.recheck_pgs_active = false;
|
||||||
|
this.updating_total_stats = false;
|
||||||
this.watcher_active = false;
|
this.watcher_active = false;
|
||||||
this.old_pg_config = false;
|
this.old_pg_config = false;
|
||||||
this.old_pg_stats_seen = false;
|
this.old_pg_stats_seen = false;
|
||||||
|
@ -658,7 +659,19 @@ class Mon
|
||||||
this.etcd_watch_revision, pool_id, up_osds, osd_tree, real_prev_pgs, pool_res.pgs, pg_history);
|
this.etcd_watch_revision, pool_id, up_osds, osd_tree, real_prev_pgs, pool_res.pgs, pg_history);
|
||||||
}
|
}
|
||||||
new_pg_config.hash = tree_hash;
|
new_pg_config.hash = tree_hash;
|
||||||
return await this.save_pg_config(new_pg_config, etcd_request);
|
const { backfillfull_pools, backfillfull_osds } = sum_object_counts(
|
||||||
|
{ ...this.state, pg: { ...this.state.pg, config: new_pg_config } }, this.config
|
||||||
|
);
|
||||||
|
if (backfillfull_pools.join(',') != ((this.state.pg.config||{}).backfillfull_pools||[]).join(','))
|
||||||
|
{
|
||||||
|
this.log_backfillfull(backfillfull_osds, backfillfull_pools);
|
||||||
|
}
|
||||||
|
new_pg_config.backfillfull_pools = backfillfull_pools.length ? backfillfull_pools : undefined;
|
||||||
|
if (!await this.save_pg_config(new_pg_config, etcd_request))
|
||||||
|
{
|
||||||
|
return false;
|
||||||
|
}
|
||||||
|
return true;
|
||||||
}
|
}
|
||||||
|
|
||||||
async save_pg_config(new_pg_config, etcd_request = { compare: [], success: [] })
|
async save_pg_config(new_pg_config, etcd_request = { compare: [], success: [] })
|
||||||
|
@ -730,7 +743,7 @@ class Mon
|
||||||
async update_total_stats()
|
async update_total_stats()
|
||||||
{
|
{
|
||||||
const txn = [];
|
const txn = [];
|
||||||
const { object_counts, object_bytes } = sum_object_counts(this.state, this.config);
|
const { object_counts, object_bytes, backfillfull_pools, backfillfull_osds } = sum_object_counts(this.state, this.config);
|
||||||
let stats = sum_op_stats(this.state.osd, this.prev_stats);
|
let stats = sum_op_stats(this.state.osd, this.prev_stats);
|
||||||
let { inode_stats, seen_pools } = sum_inode_stats(this.state, this.prev_stats);
|
let { inode_stats, seen_pools } = sum_inode_stats(this.state, this.prev_stats);
|
||||||
stats.object_counts = object_counts;
|
stats.object_counts = object_counts;
|
||||||
|
@ -783,6 +796,27 @@ class Mon
|
||||||
{
|
{
|
||||||
await this.etcd.etcd_call('/kv/txn', { success: txn }, this.config.etcd_mon_timeout, 0);
|
await this.etcd.etcd_call('/kv/txn', { success: txn }, this.config.etcd_mon_timeout, 0);
|
||||||
}
|
}
|
||||||
|
if (!this.recheck_pgs_active &&
|
||||||
|
backfillfull_pools.join(',') != ((this.state.pg.config||{}).backfillfull_pools||[]).join(','))
|
||||||
|
{
|
||||||
|
this.log_backfillfull(backfillfull_osds, backfillfull_pools);
|
||||||
|
const new_pg_config = { ...this.state.pg.config, backfillfull_pools: backfillfull_pools.length ? backfillfull_pools : undefined };
|
||||||
|
await this.save_pg_config(new_pg_config);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
log_backfillfull(osds, pools)
|
||||||
|
{
|
||||||
|
for (const osd in osds)
|
||||||
|
{
|
||||||
|
const bf = osds[osd];
|
||||||
|
console.log('OSD '+osd+' may fill up during rebalance: capacity '+(bf.cap/1024n/1024n)+
|
||||||
|
' MB, target user data '+(bf.clean/1024n/1024n)+' MB');
|
||||||
|
}
|
||||||
|
console.log(
|
||||||
|
(pools.length ? 'Pool(s) '+pools.join(', ') : 'No pools')+
|
||||||
|
' are backfillfull now, applying rebalance configuration'
|
||||||
|
);
|
||||||
}
|
}
|
||||||
|
|
||||||
schedule_update_stats()
|
schedule_update_stats()
|
||||||
|
@ -794,7 +828,21 @@ class Mon
|
||||||
this.stats_timer = setTimeout(() =>
|
this.stats_timer = setTimeout(() =>
|
||||||
{
|
{
|
||||||
this.stats_timer = null;
|
this.stats_timer = null;
|
||||||
this.update_total_stats().catch(console.error);
|
if (this.updating_total_stats)
|
||||||
|
{
|
||||||
|
this.schedule_update_stats();
|
||||||
|
return;
|
||||||
|
}
|
||||||
|
this.updating_total_stats = true;
|
||||||
|
try
|
||||||
|
{
|
||||||
|
this.update_total_stats().catch(console.error);
|
||||||
|
}
|
||||||
|
catch (e)
|
||||||
|
{
|
||||||
|
console.error(e);
|
||||||
|
}
|
||||||
|
this.updating_total_stats = false;
|
||||||
}, this.config.mon_stats_timeout);
|
}, this.config.mon_stats_timeout);
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|
39
mon/stats.js
39
mon/stats.js
|
@ -109,6 +109,8 @@ function sum_object_counts(state, global_config)
|
||||||
pgstats[pool_id] = { ...(state.pg.stats[pool_id] || {}), ...(pgstats[pool_id] || {}) };
|
pgstats[pool_id] = { ...(state.pg.stats[pool_id] || {}), ...(pgstats[pool_id] || {}) };
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
const pool_per_osd = {};
|
||||||
|
const clean_per_osd = {};
|
||||||
for (const pool_id in pgstats)
|
for (const pool_id in pgstats)
|
||||||
{
|
{
|
||||||
let object_size = 0;
|
let object_size = 0;
|
||||||
|
@ -143,10 +145,45 @@ function sum_object_counts(state, global_config)
|
||||||
object_bytes[k] += BigInt(st[k+'_count']) * object_size;
|
object_bytes[k] += BigInt(st[k+'_count']) * object_size;
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
if (st.object_count)
|
||||||
|
{
|
||||||
|
for (const pg_osd of (((state.pg.config.items||{})[pool_id]||{})[pg_num]||{}).osd_set||[])
|
||||||
|
{
|
||||||
|
if (!(pg_osd in clean_per_osd))
|
||||||
|
{
|
||||||
|
clean_per_osd[pg_osd] = 0n;
|
||||||
|
}
|
||||||
|
clean_per_osd[pg_osd] += BigInt(st.object_count);
|
||||||
|
pool_per_osd[pg_osd] = pool_per_osd[pg_osd]||{};
|
||||||
|
pool_per_osd[pg_osd][pool_id] = true;
|
||||||
|
}
|
||||||
|
}
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
return { object_counts, object_bytes };
|
// If clean_per_osd[osd] is larger than osd capacity then it will fill up during rebalance
|
||||||
|
let backfillfull_pools = {};
|
||||||
|
let backfillfull_osds = {};
|
||||||
|
for (const osd in clean_per_osd)
|
||||||
|
{
|
||||||
|
const st = state.osd.stats[osd];
|
||||||
|
if (!st || !st.size || !st.data_block_size)
|
||||||
|
{
|
||||||
|
continue;
|
||||||
|
}
|
||||||
|
let cap = BigInt(st.size)/BigInt(st.data_block_size);
|
||||||
|
cap = cap * BigInt((global_config.osd_backfillfull_ratio||0.99)*1000000) / 1000000n;
|
||||||
|
if (cap < clean_per_osd[osd])
|
||||||
|
{
|
||||||
|
backfillfull_osds[osd] = { cap: BigInt(st.size), clean: clean_per_osd[osd]*BigInt(st.data_block_size) };
|
||||||
|
for (const pool_id in pool_per_osd[osd])
|
||||||
|
{
|
||||||
|
backfillfull_pools[pool_id] = true;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
backfillfull_pools = Object.keys(backfillfull_pools).sort();
|
||||||
|
return { object_counts, object_bytes, backfillfull_pools, backfillfull_osds };
|
||||||
}
|
}
|
||||||
|
|
||||||
// sum_inode_stats(this.state, this.prev_stats)
|
// sum_inode_stats(this.state, this.prev_stats)
|
||||||
|
|
|
@ -306,12 +306,12 @@ index e5ff653a60..884ecc79ea 100644
|
||||||
+ etcd = virBufferContentAndReset(&buf);
|
+ etcd = virBufferContentAndReset(&buf);
|
||||||
+ }
|
+ }
|
||||||
+
|
+
|
||||||
+ if (virJSONValueObjectCreate(&ret,
|
+ if (virJSONValueObjectAdd(&ret,
|
||||||
+ "S:etcd-host", etcd,
|
+ "S:etcd-host", etcd,
|
||||||
+ "S:etcd-prefix", src->query,
|
+ "S:etcd-prefix", src->query,
|
||||||
+ "S:config-path", src->configFile,
|
+ "S:config-path", src->configFile,
|
||||||
+ "s:image", src->path,
|
+ "s:image", src->path,
|
||||||
+ NULL) < 0)
|
+ NULL) < 0)
|
||||||
+ return NULL;
|
+ return NULL;
|
||||||
+
|
+
|
||||||
+ return ret;
|
+ return ret;
|
||||||
|
|
|
@ -0,0 +1,193 @@
|
||||||
|
Index: pve-qemu-kvm-9.0.0/block/meson.build
|
||||||
|
===================================================================
|
||||||
|
--- pve-qemu-kvm-9.0.0.orig/block/meson.build
|
||||||
|
+++ pve-qemu-kvm-9.0.0/block/meson.build
|
||||||
|
@@ -126,6 +126,7 @@ foreach m : [
|
||||||
|
[libnfs, 'nfs', files('nfs.c')],
|
||||||
|
[libssh, 'ssh', files('ssh.c')],
|
||||||
|
[rbd, 'rbd', files('rbd.c')],
|
||||||
|
+ [vitastor, 'vitastor', files('vitastor.c')],
|
||||||
|
]
|
||||||
|
if m[0].found()
|
||||||
|
module_ss = ss.source_set()
|
||||||
|
Index: pve-qemu-kvm-9.0.0/meson.build
|
||||||
|
===================================================================
|
||||||
|
--- pve-qemu-kvm-9.0.0.orig/meson.build
|
||||||
|
+++ pve-qemu-kvm-9.0.0/meson.build
|
||||||
|
@@ -1452,6 +1452,26 @@ if not get_option('rbd').auto() or have_
|
||||||
|
endif
|
||||||
|
endif
|
||||||
|
|
||||||
|
+vitastor = not_found
|
||||||
|
+if not get_option('vitastor').auto() or have_block
|
||||||
|
+ libvitastor_client = cc.find_library('vitastor_client', has_headers: ['vitastor_c.h'],
|
||||||
|
+ required: get_option('vitastor'))
|
||||||
|
+ if libvitastor_client.found()
|
||||||
|
+ if cc.links('''
|
||||||
|
+ #include <vitastor_c.h>
|
||||||
|
+ int main(void) {
|
||||||
|
+ vitastor_c_create_qemu(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0);
|
||||||
|
+ return 0;
|
||||||
|
+ }''', dependencies: libvitastor_client)
|
||||||
|
+ vitastor = declare_dependency(dependencies: libvitastor_client)
|
||||||
|
+ elif get_option('vitastor').enabled()
|
||||||
|
+ error('could not link libvitastor_client')
|
||||||
|
+ else
|
||||||
|
+ warning('could not link libvitastor_client, disabling')
|
||||||
|
+ endif
|
||||||
|
+ endif
|
||||||
|
+endif
|
||||||
|
+
|
||||||
|
glusterfs = not_found
|
||||||
|
glusterfs_ftruncate_has_stat = false
|
||||||
|
glusterfs_iocb_has_stat = false
|
||||||
|
@@ -2254,6 +2274,7 @@ endif
|
||||||
|
config_host_data.set('CONFIG_OPENGL', opengl.found())
|
||||||
|
config_host_data.set('CONFIG_PLUGIN', get_option('plugins'))
|
||||||
|
config_host_data.set('CONFIG_RBD', rbd.found())
|
||||||
|
+config_host_data.set('CONFIG_VITASTOR', vitastor.found())
|
||||||
|
config_host_data.set('CONFIG_RDMA', rdma.found())
|
||||||
|
config_host_data.set('CONFIG_RELOCATABLE', get_option('relocatable'))
|
||||||
|
config_host_data.set('CONFIG_SAFESTACK', get_option('safe_stack'))
|
||||||
|
@@ -4454,6 +4475,7 @@ summary_info += {'fdt support': fd
|
||||||
|
summary_info += {'libcap-ng support': libcap_ng}
|
||||||
|
summary_info += {'bpf support': libbpf}
|
||||||
|
summary_info += {'rbd support': rbd}
|
||||||
|
+summary_info += {'vitastor support': vitastor}
|
||||||
|
summary_info += {'smartcard support': cacard}
|
||||||
|
summary_info += {'U2F support': u2f}
|
||||||
|
summary_info += {'libusb': libusb}
|
||||||
|
Index: pve-qemu-kvm-9.0.0/meson_options.txt
|
||||||
|
===================================================================
|
||||||
|
--- pve-qemu-kvm-9.0.0.orig/meson_options.txt
|
||||||
|
+++ pve-qemu-kvm-9.0.0/meson_options.txt
|
||||||
|
@@ -194,6 +194,8 @@ option('lzo', type : 'feature', value :
|
||||||
|
description: 'lzo compression support')
|
||||||
|
option('rbd', type : 'feature', value : 'auto',
|
||||||
|
description: 'Ceph block device driver')
|
||||||
|
+option('vitastor', type : 'feature', value : 'auto',
|
||||||
|
+ description: 'Vitastor block device driver')
|
||||||
|
option('opengl', type : 'feature', value : 'auto',
|
||||||
|
description: 'OpenGL support')
|
||||||
|
option('rdma', type : 'feature', value : 'auto',
|
||||||
|
Index: pve-qemu-kvm-9.0.0/qapi/block-core.json
|
||||||
|
===================================================================
|
||||||
|
--- pve-qemu-kvm-9.0.0.orig/qapi/block-core.json
|
||||||
|
+++ pve-qemu-kvm-9.0.0/qapi/block-core.json
|
||||||
|
@@ -3481,7 +3481,7 @@
|
||||||
|
'raw', 'rbd',
|
||||||
|
{ 'name': 'replication', 'if': 'CONFIG_REPLICATION' },
|
||||||
|
'pbs',
|
||||||
|
- 'ssh', 'throttle', 'vdi', 'vhdx',
|
||||||
|
+ 'ssh', 'throttle', 'vdi', 'vhdx', 'vitastor',
|
||||||
|
{ 'name': 'virtio-blk-vfio-pci', 'if': 'CONFIG_BLKIO' },
|
||||||
|
{ 'name': 'virtio-blk-vhost-user', 'if': 'CONFIG_BLKIO' },
|
||||||
|
{ 'name': 'virtio-blk-vhost-vdpa', 'if': 'CONFIG_BLKIO' },
|
||||||
|
@@ -4591,6 +4591,28 @@
|
||||||
|
'*server': ['InetSocketAddressBase'] } }
|
||||||
|
|
||||||
|
##
|
||||||
|
+# @BlockdevOptionsVitastor:
|
||||||
|
+#
|
||||||
|
+# Driver specific block device options for vitastor
|
||||||
|
+#
|
||||||
|
+# @image: Image name
|
||||||
|
+# @inode: Inode number
|
||||||
|
+# @pool: Pool ID
|
||||||
|
+# @size: Desired image size in bytes
|
||||||
|
+# @config-path: Path to Vitastor configuration
|
||||||
|
+# @etcd-host: etcd connection address(es)
|
||||||
|
+# @etcd-prefix: etcd key/value prefix
|
||||||
|
+##
|
||||||
|
+{ 'struct': 'BlockdevOptionsVitastor',
|
||||||
|
+ 'data': { '*inode': 'uint64',
|
||||||
|
+ '*pool': 'uint64',
|
||||||
|
+ '*size': 'uint64',
|
||||||
|
+ '*image': 'str',
|
||||||
|
+ '*config-path': 'str',
|
||||||
|
+ '*etcd-host': 'str',
|
||||||
|
+ '*etcd-prefix': 'str' } }
|
||||||
|
+
|
||||||
|
+##
|
||||||
|
# @ReplicationMode:
|
||||||
|
#
|
||||||
|
# An enumeration of replication modes.
|
||||||
|
@@ -5053,6 +5075,7 @@
|
||||||
|
'throttle': 'BlockdevOptionsThrottle',
|
||||||
|
'vdi': 'BlockdevOptionsGenericFormat',
|
||||||
|
'vhdx': 'BlockdevOptionsGenericFormat',
|
||||||
|
+ 'vitastor': 'BlockdevOptionsVitastor',
|
||||||
|
'virtio-blk-vfio-pci':
|
||||||
|
{ 'type': 'BlockdevOptionsVirtioBlkVfioPci',
|
||||||
|
'if': 'CONFIG_BLKIO' },
|
||||||
|
@@ -5498,6 +5521,20 @@
|
||||||
|
'*encrypt' : 'RbdEncryptionCreateOptions' } }
|
||||||
|
|
||||||
|
##
|
||||||
|
+# @BlockdevCreateOptionsVitastor:
|
||||||
|
+#
|
||||||
|
+# Driver specific image creation options for Vitastor.
|
||||||
|
+#
|
||||||
|
+# @location: Where to store the new image file. This location cannot
|
||||||
|
+# point to a snapshot.
|
||||||
|
+#
|
||||||
|
+# @size: Size of the virtual disk in bytes
|
||||||
|
+##
|
||||||
|
+{ 'struct': 'BlockdevCreateOptionsVitastor',
|
||||||
|
+ 'data': { 'location': 'BlockdevOptionsVitastor',
|
||||||
|
+ 'size': 'size' } }
|
||||||
|
+
|
||||||
|
+##
|
||||||
|
# @BlockdevVmdkSubformat:
|
||||||
|
#
|
||||||
|
# Subformat options for VMDK images
|
||||||
|
@@ -5719,6 +5753,7 @@
|
||||||
|
'ssh': 'BlockdevCreateOptionsSsh',
|
||||||
|
'vdi': 'BlockdevCreateOptionsVdi',
|
||||||
|
'vhdx': 'BlockdevCreateOptionsVhdx',
|
||||||
|
+ 'vitastor': 'BlockdevCreateOptionsVitastor',
|
||||||
|
'vmdk': 'BlockdevCreateOptionsVmdk',
|
||||||
|
'vpc': 'BlockdevCreateOptionsVpc'
|
||||||
|
} }
|
||||||
|
Index: pve-qemu-kvm-9.0.0/scripts/ci/org.centos/stream/8/x86_64/configure
|
||||||
|
===================================================================
|
||||||
|
--- pve-qemu-kvm-9.0.0.orig/scripts/ci/org.centos/stream/8/x86_64/configure
|
||||||
|
+++ pve-qemu-kvm-9.0.0/scripts/ci/org.centos/stream/8/x86_64/configure
|
||||||
|
@@ -30,7 +30,7 @@
|
||||||
|
--with-suffix="qemu-kvm" \
|
||||||
|
--firmwarepath=/usr/share/qemu-firmware \
|
||||||
|
--target-list="x86_64-softmmu" \
|
||||||
|
---block-drv-rw-whitelist="qcow2,raw,file,host_device,nbd,iscsi,rbd,blkdebug,luks,null-co,nvme,copy-on-read,throttle,gluster" \
|
||||||
|
+--block-drv-rw-whitelist="qcow2,raw,file,host_device,nbd,iscsi,rbd,vitastor,blkdebug,luks,null-co,nvme,copy-on-read,throttle,gluster" \
|
||||||
|
--audio-drv-list="" \
|
||||||
|
--block-drv-ro-whitelist="vmdk,vhdx,vpc,https,ssh" \
|
||||||
|
--with-coroutine=ucontext \
|
||||||
|
@@ -176,6 +176,7 @@
|
||||||
|
--enable-opengl \
|
||||||
|
--enable-pie \
|
||||||
|
--enable-rbd \
|
||||||
|
+--enable-vitastor \
|
||||||
|
--enable-rdma \
|
||||||
|
--enable-seccomp \
|
||||||
|
--enable-snappy \
|
||||||
|
Index: pve-qemu-kvm-9.0.0/scripts/meson-buildoptions.sh
|
||||||
|
===================================================================
|
||||||
|
--- pve-qemu-kvm-9.0.0.orig/scripts/meson-buildoptions.sh
|
||||||
|
+++ pve-qemu-kvm-9.0.0/scripts/meson-buildoptions.sh
|
||||||
|
@@ -168,6 +168,7 @@ meson_options_help() {
|
||||||
|
printf "%s\n" ' qed qed image format support'
|
||||||
|
printf "%s\n" ' qga-vss build QGA VSS support (broken with MinGW)'
|
||||||
|
printf "%s\n" ' rbd Ceph block device driver'
|
||||||
|
+ printf "%s\n" ' vitastor Vitastor block device driver'
|
||||||
|
printf "%s\n" ' rdma Enable RDMA-based migration'
|
||||||
|
printf "%s\n" ' replication replication support'
|
||||||
|
printf "%s\n" ' rutabaga-gfx rutabaga_gfx support'
|
||||||
|
@@ -445,6 +446,8 @@ _meson_option_parse() {
|
||||||
|
--disable-qom-cast-debug) printf "%s" -Dqom_cast_debug=false ;;
|
||||||
|
--enable-rbd) printf "%s" -Drbd=enabled ;;
|
||||||
|
--disable-rbd) printf "%s" -Drbd=disabled ;;
|
||||||
|
+ --enable-vitastor) printf "%s" -Dvitastor=enabled ;;
|
||||||
|
+ --disable-vitastor) printf "%s" -Dvitastor=disabled ;;
|
||||||
|
--enable-rdma) printf "%s" -Drdma=enabled ;;
|
||||||
|
--disable-rdma) printf "%s" -Drdma=disabled ;;
|
||||||
|
--enable-relocatable) printf "%s" -Drelocatable=true ;;
|
|
@ -0,0 +1,172 @@
|
||||||
|
diff --git a/block/meson.build b/block/meson.build
|
||||||
|
index f1262ec2ba..3cf3e23f16 100644
|
||||||
|
--- a/block/meson.build
|
||||||
|
+++ b/block/meson.build
|
||||||
|
@@ -114,6 +114,7 @@ foreach m : [
|
||||||
|
[libnfs, 'nfs', files('nfs.c')],
|
||||||
|
[libssh, 'ssh', files('ssh.c')],
|
||||||
|
[rbd, 'rbd', files('rbd.c')],
|
||||||
|
+ [vitastor, 'vitastor', files('vitastor.c')],
|
||||||
|
]
|
||||||
|
if m[0].found()
|
||||||
|
module_ss = ss.source_set()
|
||||||
|
diff --git a/meson.build b/meson.build
|
||||||
|
index fbda17c987..3edac22aff 100644
|
||||||
|
--- a/meson.build
|
||||||
|
+++ b/meson.build
|
||||||
|
@@ -1510,6 +1510,26 @@ if not get_option('rbd').auto() or have_block
|
||||||
|
endif
|
||||||
|
endif
|
||||||
|
|
||||||
|
+vitastor = not_found
|
||||||
|
+if not get_option('vitastor').auto() or have_block
|
||||||
|
+ libvitastor_client = cc.find_library('vitastor_client', has_headers: ['vitastor_c.h'],
|
||||||
|
+ required: get_option('vitastor'))
|
||||||
|
+ if libvitastor_client.found()
|
||||||
|
+ if cc.links('''
|
||||||
|
+ #include <vitastor_c.h>
|
||||||
|
+ int main(void) {
|
||||||
|
+ vitastor_c_create_qemu(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0);
|
||||||
|
+ return 0;
|
||||||
|
+ }''', dependencies: libvitastor_client)
|
||||||
|
+ vitastor = declare_dependency(dependencies: libvitastor_client)
|
||||||
|
+ elif get_option('vitastor').enabled()
|
||||||
|
+ error('could not link libvitastor_client')
|
||||||
|
+ else
|
||||||
|
+ warning('could not link libvitastor_client, disabling')
|
||||||
|
+ endif
|
||||||
|
+ endif
|
||||||
|
+endif
|
||||||
|
+
|
||||||
|
glusterfs = not_found
|
||||||
|
glusterfs_ftruncate_has_stat = false
|
||||||
|
glusterfs_iocb_has_stat = false
|
||||||
|
@@ -2351,6 +2371,7 @@ endif
|
||||||
|
config_host_data.set('CONFIG_OPENGL', opengl.found())
|
||||||
|
config_host_data.set('CONFIG_PLUGIN', get_option('plugins'))
|
||||||
|
config_host_data.set('CONFIG_RBD', rbd.found())
|
||||||
|
+config_host_data.set('CONFIG_VITASTOR', vitastor.found())
|
||||||
|
config_host_data.set('CONFIG_RDMA', rdma.found())
|
||||||
|
config_host_data.set('CONFIG_RELOCATABLE', get_option('relocatable'))
|
||||||
|
config_host_data.set('CONFIG_SAFESTACK', get_option('safe_stack'))
|
||||||
|
@@ -4510,6 +4531,7 @@ summary_info += {'fdt support': fdt_opt == 'internal' ? 'internal' : fdt}
|
||||||
|
summary_info += {'libcap-ng support': libcap_ng}
|
||||||
|
summary_info += {'bpf support': libbpf}
|
||||||
|
summary_info += {'rbd support': rbd}
|
||||||
|
+summary_info += {'vitastor support': vitastor}
|
||||||
|
summary_info += {'smartcard support': cacard}
|
||||||
|
summary_info += {'U2F support': u2f}
|
||||||
|
summary_info += {'libusb': libusb}
|
||||||
|
diff --git a/meson_options.txt b/meson_options.txt
|
||||||
|
index 0269fa0f16..4740ffdc27 100644
|
||||||
|
--- a/meson_options.txt
|
||||||
|
+++ b/meson_options.txt
|
||||||
|
@@ -194,6 +194,8 @@ option('lzo', type : 'feature', value : 'auto',
|
||||||
|
description: 'lzo compression support')
|
||||||
|
option('rbd', type : 'feature', value : 'auto',
|
||||||
|
description: 'Ceph block device driver')
|
||||||
|
+option('vitastor', type : 'feature', value : 'auto',
|
||||||
|
+ description: 'Vitastor block device driver')
|
||||||
|
option('opengl', type : 'feature', value : 'auto',
|
||||||
|
description: 'OpenGL support')
|
||||||
|
option('rdma', type : 'feature', value : 'auto',
|
||||||
|
diff --git a/qapi/block-core.json b/qapi/block-core.json
|
||||||
|
index aa40d44f1d..bbee6a0e9c 100644
|
||||||
|
--- a/qapi/block-core.json
|
||||||
|
+++ b/qapi/block-core.json
|
||||||
|
@@ -3203,7 +3203,7 @@
|
||||||
|
'parallels', 'preallocate', 'qcow', 'qcow2', 'qed', 'quorum',
|
||||||
|
'raw', 'rbd',
|
||||||
|
{ 'name': 'replication', 'if': 'CONFIG_REPLICATION' },
|
||||||
|
- 'ssh', 'throttle', 'vdi', 'vhdx',
|
||||||
|
+ 'ssh', 'throttle', 'vdi', 'vhdx', 'vitastor',
|
||||||
|
{ 'name': 'virtio-blk-vfio-pci', 'if': 'CONFIG_BLKIO' },
|
||||||
|
{ 'name': 'virtio-blk-vhost-user', 'if': 'CONFIG_BLKIO' },
|
||||||
|
{ 'name': 'virtio-blk-vhost-vdpa', 'if': 'CONFIG_BLKIO' },
|
||||||
|
@@ -4286,6 +4286,28 @@
|
||||||
|
'*key-secret': 'str',
|
||||||
|
'*server': ['InetSocketAddressBase'] } }
|
||||||
|
|
||||||
|
+##
|
||||||
|
+# @BlockdevOptionsVitastor:
|
||||||
|
+#
|
||||||
|
+# Driver specific block device options for vitastor
|
||||||
|
+#
|
||||||
|
+# @image: Image name
|
||||||
|
+# @inode: Inode number
|
||||||
|
+# @pool: Pool ID
|
||||||
|
+# @size: Desired image size in bytes
|
||||||
|
+# @config-path: Path to Vitastor configuration
|
||||||
|
+# @etcd-host: etcd connection address(es)
|
||||||
|
+# @etcd-prefix: etcd key/value prefix
|
||||||
|
+##
|
||||||
|
+{ 'struct': 'BlockdevOptionsVitastor',
|
||||||
|
+ 'data': { '*inode': 'uint64',
|
||||||
|
+ '*pool': 'uint64',
|
||||||
|
+ '*size': 'uint64',
|
||||||
|
+ '*image': 'str',
|
||||||
|
+ '*config-path': 'str',
|
||||||
|
+ '*etcd-host': 'str',
|
||||||
|
+ '*etcd-prefix': 'str' } }
|
||||||
|
+
|
||||||
|
##
|
||||||
|
# @ReplicationMode:
|
||||||
|
#
|
||||||
|
@@ -4742,6 +4764,7 @@
|
||||||
|
'throttle': 'BlockdevOptionsThrottle',
|
||||||
|
'vdi': 'BlockdevOptionsGenericFormat',
|
||||||
|
'vhdx': 'BlockdevOptionsGenericFormat',
|
||||||
|
+ 'vitastor': 'BlockdevOptionsVitastor',
|
||||||
|
'virtio-blk-vfio-pci':
|
||||||
|
{ 'type': 'BlockdevOptionsVirtioBlkVfioPci',
|
||||||
|
'if': 'CONFIG_BLKIO' },
|
||||||
|
@@ -5183,6 +5206,20 @@
|
||||||
|
'*cluster-size' : 'size',
|
||||||
|
'*encrypt' : 'RbdEncryptionCreateOptions' } }
|
||||||
|
|
||||||
|
+##
|
||||||
|
+# @BlockdevCreateOptionsVitastor:
|
||||||
|
+#
|
||||||
|
+# Driver specific image creation options for Vitastor.
|
||||||
|
+#
|
||||||
|
+# @location: Where to store the new image file. This location cannot
|
||||||
|
+# point to a snapshot.
|
||||||
|
+#
|
||||||
|
+# @size: Size of the virtual disk in bytes
|
||||||
|
+##
|
||||||
|
+{ 'struct': 'BlockdevCreateOptionsVitastor',
|
||||||
|
+ 'data': { 'location': 'BlockdevOptionsVitastor',
|
||||||
|
+ 'size': 'size' } }
|
||||||
|
+
|
||||||
|
##
|
||||||
|
# @BlockdevVmdkSubformat:
|
||||||
|
#
|
||||||
|
@@ -5405,6 +5442,7 @@
|
||||||
|
'ssh': 'BlockdevCreateOptionsSsh',
|
||||||
|
'vdi': 'BlockdevCreateOptionsVdi',
|
||||||
|
'vhdx': 'BlockdevCreateOptionsVhdx',
|
||||||
|
+ 'vitastor': 'BlockdevCreateOptionsVitastor',
|
||||||
|
'vmdk': 'BlockdevCreateOptionsVmdk',
|
||||||
|
'vpc': 'BlockdevCreateOptionsVpc'
|
||||||
|
} }
|
||||||
|
diff --git a/scripts/meson-buildoptions.sh b/scripts/meson-buildoptions.sh
|
||||||
|
index c97079a38c..4623f552ec 100644
|
||||||
|
--- a/scripts/meson-buildoptions.sh
|
||||||
|
+++ b/scripts/meson-buildoptions.sh
|
||||||
|
@@ -168,6 +168,7 @@ meson_options_help() {
|
||||||
|
printf "%s\n" ' qga-vss build QGA VSS support (broken with MinGW)'
|
||||||
|
printf "%s\n" ' qpl Query Processing Library support'
|
||||||
|
printf "%s\n" ' rbd Ceph block device driver'
|
||||||
|
+ printf "%s\n" ' vitastor Vitastor block device driver'
|
||||||
|
printf "%s\n" ' rdma Enable RDMA-based migration'
|
||||||
|
printf "%s\n" ' replication replication support'
|
||||||
|
printf "%s\n" ' rutabaga-gfx rutabaga_gfx support'
|
||||||
|
@@ -444,6 +445,8 @@ _meson_option_parse() {
|
||||||
|
--disable-qpl) printf "%s" -Dqpl=disabled ;;
|
||||||
|
--enable-rbd) printf "%s" -Drbd=enabled ;;
|
||||||
|
--disable-rbd) printf "%s" -Drbd=disabled ;;
|
||||||
|
+ --enable-vitastor) printf "%s" -Dvitastor=enabled ;;
|
||||||
|
+ --disable-vitastor) printf "%s" -Dvitastor=disabled ;;
|
||||||
|
--enable-rdma) printf "%s" -Drdma=enabled ;;
|
||||||
|
--disable-rdma) printf "%s" -Drdma=disabled ;;
|
||||||
|
--enable-relocatable) printf "%s" -Drelocatable=true ;;
|
|
@ -176,7 +176,7 @@ void etcd_state_client_t::add_etcd_url(std::string addr)
|
||||||
exit(1);
|
exit(1);
|
||||||
}
|
}
|
||||||
if (!local_ips.size())
|
if (!local_ips.size())
|
||||||
local_ips = getifaddr_list();
|
local_ips = getifaddr_list(std::vector<std::string>(), true);
|
||||||
std::string check_addr;
|
std::string check_addr;
|
||||||
int pos = addr.find('/');
|
int pos = addr.find('/');
|
||||||
int pos2 = addr.find(':');
|
int pos2 = addr.find(':');
|
||||||
|
@ -785,7 +785,7 @@ void etcd_state_client_t::parse_state(const etcd_kv_t & kv)
|
||||||
}
|
}
|
||||||
for (auto & pool_item: value.object_items())
|
for (auto & pool_item: value.object_items())
|
||||||
{
|
{
|
||||||
pool_config_t pc;
|
pool_config_t pc = {};
|
||||||
// ID
|
// ID
|
||||||
pool_id_t pool_id;
|
pool_id_t pool_id;
|
||||||
char null_byte = 0;
|
char null_byte = 0;
|
||||||
|
@ -931,12 +931,28 @@ void etcd_state_client_t::parse_state(const etcd_kv_t & kv)
|
||||||
// Ignore old key if the new one is present
|
// Ignore old key if the new one is present
|
||||||
return;
|
return;
|
||||||
}
|
}
|
||||||
|
for (auto & pool_id_json: value["backfillfull_pools"].array_items())
|
||||||
|
{
|
||||||
|
auto pool_id = pool_id_json.uint64_value();
|
||||||
|
auto pool_it = this->pool_config.find(pool_id);
|
||||||
|
if (pool_it != this->pool_config.end())
|
||||||
|
{
|
||||||
|
pool_it->second.backfillfull |= 2;
|
||||||
|
}
|
||||||
|
}
|
||||||
for (auto & pool_item: this->pool_config)
|
for (auto & pool_item: this->pool_config)
|
||||||
{
|
{
|
||||||
for (auto & pg_item: pool_item.second.pg_config)
|
for (auto & pg_item: pool_item.second.pg_config)
|
||||||
{
|
{
|
||||||
pg_item.second.config_exists = false;
|
pg_item.second.config_exists = false;
|
||||||
}
|
}
|
||||||
|
// 3 = was 1 and became 1, 0 = was 0 and became 0
|
||||||
|
if (pool_item.second.backfillfull == 2 || pool_item.second.backfillfull == 1)
|
||||||
|
{
|
||||||
|
if (on_change_backfillfull_hook)
|
||||||
|
on_change_backfillfull_hook(pool_item.first);
|
||||||
|
}
|
||||||
|
pool_item.second.backfillfull = pool_item.second.backfillfull >> 1;
|
||||||
}
|
}
|
||||||
for (auto & pool_item: value["items"].object_items())
|
for (auto & pool_item: value["items"].object_items())
|
||||||
{
|
{
|
||||||
|
|
|
@ -62,6 +62,7 @@ struct pool_config_t
|
||||||
std::map<pg_num_t, pg_config_t> pg_config;
|
std::map<pg_num_t, pg_config_t> pg_config;
|
||||||
uint64_t scrub_interval;
|
uint64_t scrub_interval;
|
||||||
std::string used_for_fs;
|
std::string used_for_fs;
|
||||||
|
int backfillfull;
|
||||||
};
|
};
|
||||||
|
|
||||||
struct inode_config_t
|
struct inode_config_t
|
||||||
|
@ -131,6 +132,7 @@ public:
|
||||||
std::function<json11::Json()> load_pgs_checks_hook;
|
std::function<json11::Json()> load_pgs_checks_hook;
|
||||||
std::function<void(bool)> on_load_pgs_hook;
|
std::function<void(bool)> on_load_pgs_hook;
|
||||||
std::function<void()> on_change_pool_config_hook;
|
std::function<void()> on_change_pool_config_hook;
|
||||||
|
std::function<void(pool_id_t)> on_change_backfillfull_hook;
|
||||||
std::function<void(pool_id_t, pg_num_t, osd_num_t)> on_change_pg_state_hook;
|
std::function<void(pool_id_t, pg_num_t, osd_num_t)> on_change_pg_state_hook;
|
||||||
std::function<void(pool_id_t, pg_num_t)> on_change_pg_history_hook;
|
std::function<void(pool_id_t, pg_num_t)> on_change_pg_history_hook;
|
||||||
std::function<void(osd_num_t)> on_change_osd_state_hook;
|
std::function<void(osd_num_t)> on_change_osd_state_hook;
|
||||||
|
|
|
@ -62,6 +62,7 @@ struct http_co_t
|
||||||
inline void end() { ended = true; if (!onstack) { delete this; } }
|
inline void end() { ended = true; if (!onstack) { delete this; } }
|
||||||
void run_cb_and_clear();
|
void run_cb_and_clear();
|
||||||
void start_connection();
|
void start_connection();
|
||||||
|
void start_ws_connection();
|
||||||
void close_connection();
|
void close_connection();
|
||||||
void next_request();
|
void next_request();
|
||||||
void handle_events();
|
void handle_events();
|
||||||
|
@ -112,7 +113,7 @@ http_co_t* open_websocket(timerfd_manager_t *tfd, const std::string & host, cons
|
||||||
handler->keepalive = false;
|
handler->keepalive = false;
|
||||||
handler->request = request;
|
handler->request = request;
|
||||||
handler->response_callback = response_callback;
|
handler->response_callback = response_callback;
|
||||||
handler->start_connection();
|
handler->start_ws_connection();
|
||||||
return handler;
|
return handler;
|
||||||
}
|
}
|
||||||
|
|
||||||
|
@ -282,6 +283,27 @@ void http_co_t::close_connection()
|
||||||
epoll_events = 0;
|
epoll_events = 0;
|
||||||
}
|
}
|
||||||
|
|
||||||
|
void http_co_t::start_ws_connection()
|
||||||
|
{
|
||||||
|
stackin();
|
||||||
|
start_connection();
|
||||||
|
if (request_timeout > 0)
|
||||||
|
{
|
||||||
|
timeout_id = tfd->set_timer(request_timeout, false, [this](int timer_id)
|
||||||
|
{
|
||||||
|
stackin();
|
||||||
|
if (state != HTTP_CO_WEBSOCKET)
|
||||||
|
{
|
||||||
|
close_connection();
|
||||||
|
parsed = { .error = "Websocket connection timed out" };
|
||||||
|
run_cb_and_clear();
|
||||||
|
}
|
||||||
|
stackout();
|
||||||
|
});
|
||||||
|
}
|
||||||
|
stackout();
|
||||||
|
}
|
||||||
|
|
||||||
void http_co_t::start_connection()
|
void http_co_t::start_connection()
|
||||||
{
|
{
|
||||||
stackin();
|
stackin();
|
||||||
|
|
|
@ -121,7 +121,7 @@ void osd_messenger_t::init()
|
||||||
if (use_rdma)
|
if (use_rdma)
|
||||||
{
|
{
|
||||||
rdma_context = msgr_rdma_context_t::create(
|
rdma_context = msgr_rdma_context_t::create(
|
||||||
rdma_device != "" ? rdma_device.c_str() : NULL,
|
osd_networks, rdma_device != "" ? rdma_device.c_str() : NULL,
|
||||||
rdma_port_num, rdma_gid_index, rdma_mtu, rdma_odp, log_level
|
rdma_port_num, rdma_gid_index, rdma_mtu, rdma_odp, log_level
|
||||||
);
|
);
|
||||||
if (!rdma_context)
|
if (!rdma_context)
|
||||||
|
@ -266,7 +266,8 @@ void osd_messenger_t::parse_config(const json11::Json & config)
|
||||||
this->rdma_port_num = (uint8_t)config["rdma_port_num"].uint64_value();
|
this->rdma_port_num = (uint8_t)config["rdma_port_num"].uint64_value();
|
||||||
if (!this->rdma_port_num)
|
if (!this->rdma_port_num)
|
||||||
this->rdma_port_num = 1;
|
this->rdma_port_num = 1;
|
||||||
this->rdma_gid_index = (uint8_t)config["rdma_gid_index"].uint64_value();
|
if (!config["rdma_gid_index"].is_null())
|
||||||
|
this->rdma_gid_index = (uint8_t)config["rdma_gid_index"].uint64_value();
|
||||||
this->rdma_mtu = (uint32_t)config["rdma_mtu"].uint64_value();
|
this->rdma_mtu = (uint32_t)config["rdma_mtu"].uint64_value();
|
||||||
this->rdma_max_sge = config["rdma_max_sge"].uint64_value();
|
this->rdma_max_sge = config["rdma_max_sge"].uint64_value();
|
||||||
if (!this->rdma_max_sge)
|
if (!this->rdma_max_sge)
|
||||||
|
@ -281,6 +282,15 @@ void osd_messenger_t::parse_config(const json11::Json & config)
|
||||||
if (!this->rdma_max_msg || this->rdma_max_msg > 128*1024*1024)
|
if (!this->rdma_max_msg || this->rdma_max_msg > 128*1024*1024)
|
||||||
this->rdma_max_msg = 129*1024;
|
this->rdma_max_msg = 129*1024;
|
||||||
this->rdma_odp = config["rdma_odp"].bool_value();
|
this->rdma_odp = config["rdma_odp"].bool_value();
|
||||||
|
std::vector<std::string> mask;
|
||||||
|
if (config["bind_address"].is_string())
|
||||||
|
mask.push_back(config["bind_address"].string_value());
|
||||||
|
else if (config["osd_network"].is_string())
|
||||||
|
mask.push_back(config["osd_network"].string_value());
|
||||||
|
else
|
||||||
|
for (auto v: config["osd_network"].array_items())
|
||||||
|
mask.push_back(v.string_value());
|
||||||
|
this->osd_networks = mask;
|
||||||
#endif
|
#endif
|
||||||
if (!osd_num)
|
if (!osd_num)
|
||||||
this->iothread_count = (uint32_t)config["client_iothread_count"].uint64_value();
|
this->iothread_count = (uint32_t)config["client_iothread_count"].uint64_value();
|
||||||
|
|
|
@ -165,8 +165,9 @@ protected:
|
||||||
|
|
||||||
#ifdef WITH_RDMA
|
#ifdef WITH_RDMA
|
||||||
bool use_rdma = true;
|
bool use_rdma = true;
|
||||||
|
std::vector<std::string> osd_networks;
|
||||||
std::string rdma_device;
|
std::string rdma_device;
|
||||||
uint64_t rdma_port_num = 1, rdma_gid_index = 0, rdma_mtu = 0;
|
uint64_t rdma_port_num = 1, rdma_gid_index = -1, rdma_mtu = 0;
|
||||||
msgr_rdma_context_t *rdma_context = NULL;
|
msgr_rdma_context_t *rdma_context = NULL;
|
||||||
uint64_t rdma_max_sge = 0, rdma_max_send = 0, rdma_max_recv = 0;
|
uint64_t rdma_max_sge = 0, rdma_max_send = 0, rdma_max_recv = 0;
|
||||||
uint64_t rdma_max_msg = 0;
|
uint64_t rdma_max_msg = 0;
|
||||||
|
@ -177,7 +178,7 @@ protected:
|
||||||
std::vector<int> read_ready_clients;
|
std::vector<int> read_ready_clients;
|
||||||
std::vector<int> write_ready_clients;
|
std::vector<int> write_ready_clients;
|
||||||
// We don't use ringloop->set_immediate here because we may have no ringloop in client :)
|
// We don't use ringloop->set_immediate here because we may have no ringloop in client :)
|
||||||
std::vector<std::function<void()>> set_immediate;
|
std::vector<osd_op_t*> set_immediate_ops;
|
||||||
|
|
||||||
public:
|
public:
|
||||||
timerfd_manager_t *tfd;
|
timerfd_manager_t *tfd;
|
||||||
|
@ -237,6 +238,8 @@ protected:
|
||||||
void handle_op_hdr(osd_client_t *cl);
|
void handle_op_hdr(osd_client_t *cl);
|
||||||
bool handle_reply_hdr(osd_client_t *cl);
|
bool handle_reply_hdr(osd_client_t *cl);
|
||||||
void handle_reply_ready(osd_op_t *op);
|
void handle_reply_ready(osd_op_t *op);
|
||||||
|
void handle_immediate_ops();
|
||||||
|
void clear_immediate_ops(int peer_fd);
|
||||||
|
|
||||||
#ifdef WITH_RDMA
|
#ifdef WITH_RDMA
|
||||||
void try_send_rdma(osd_client_t *cl);
|
void try_send_rdma(osd_client_t *cl);
|
||||||
|
|
|
@ -3,6 +3,7 @@
|
||||||
|
|
||||||
#include <stdio.h>
|
#include <stdio.h>
|
||||||
#include <stdlib.h>
|
#include <stdlib.h>
|
||||||
|
#include "addr_util.h"
|
||||||
#include "msgr_rdma.h"
|
#include "msgr_rdma.h"
|
||||||
#include "messenger.h"
|
#include "messenger.h"
|
||||||
|
|
||||||
|
@ -69,7 +70,138 @@ msgr_rdma_connection_t::~msgr_rdma_connection_t()
|
||||||
send_out_size = 0;
|
send_out_size = 0;
|
||||||
}
|
}
|
||||||
|
|
||||||
msgr_rdma_context_t *msgr_rdma_context_t::create(const char *ib_devname, uint8_t ib_port, uint8_t gid_index, uint32_t mtu, bool odp, int log_level)
|
#ifdef IBV_ADVISE_MR_ADVICE_PREFETCH_NO_FAULT
|
||||||
|
static bool is_ipv4_gid(ibv_gid_entry *gidx)
|
||||||
|
{
|
||||||
|
return (((uint64_t*)gidx->gid.raw)[0] == 0 &&
|
||||||
|
((uint32_t*)gidx->gid.raw)[2] == 0xffff0000);
|
||||||
|
}
|
||||||
|
|
||||||
|
static bool match_gid(ibv_gid_entry *gidx, addr_mask_t *networks, int nnet)
|
||||||
|
{
|
||||||
|
if (gidx->gid_type != IBV_GID_TYPE_ROCE_V1 &&
|
||||||
|
gidx->gid_type != IBV_GID_TYPE_ROCE_V2 ||
|
||||||
|
((uint64_t*)gidx->gid.raw)[0] == 0 &&
|
||||||
|
((uint64_t*)gidx->gid.raw)[1] == 0)
|
||||||
|
{
|
||||||
|
return false;
|
||||||
|
}
|
||||||
|
if (is_ipv4_gid(gidx))
|
||||||
|
{
|
||||||
|
for (int i = 0; i < nnet; i++)
|
||||||
|
{
|
||||||
|
if (networks[i].family == AF_INET && cidr_match(*(in_addr*)(gidx->gid.raw+12), networks[i].ipv4, networks[i].bits))
|
||||||
|
return true;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
else
|
||||||
|
{
|
||||||
|
for (int i = 0; i < nnet; i++)
|
||||||
|
{
|
||||||
|
if (networks[i].family == AF_INET6 && cidr6_match(*(in6_addr*)gidx->gid.raw, networks[i].ipv6, networks[i].bits))
|
||||||
|
return true;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
return false;
|
||||||
|
}
|
||||||
|
|
||||||
|
struct matched_dev
|
||||||
|
{
|
||||||
|
int dev = -1;
|
||||||
|
int port = -1;
|
||||||
|
int gid = -1;
|
||||||
|
bool rocev2 = false;
|
||||||
|
};
|
||||||
|
|
||||||
|
static void log_rdma_dev_port_gid(ibv_device *dev, int ib_port, int gid_index, ibv_gid_entry & gidx)
|
||||||
|
{
|
||||||
|
bool is4 = ((uint64_t*)gidx.gid.raw)[0] == 0 && ((uint32_t*)gidx.gid.raw)[2] == 0xffff0000;
|
||||||
|
char buf[256];
|
||||||
|
inet_ntop(is4 ? AF_INET : AF_INET6, is4 ? gidx.gid.raw+12 : gidx.gid.raw, buf, sizeof(buf));
|
||||||
|
fprintf(
|
||||||
|
stderr, "Auto-selected RDMA device %s port %d GID %d - ROCEv%d IPv%d %s\n",
|
||||||
|
ibv_get_device_name(dev), ib_port, gid_index,
|
||||||
|
gidx.gid_type == IBV_GID_TYPE_ROCE_V2 ? 2 : 1, is4 ? 4 : 6, buf
|
||||||
|
);
|
||||||
|
}
|
||||||
|
|
||||||
|
static matched_dev match_device(ibv_device **dev_list, addr_mask_t *networks, int nnet, int log_level)
|
||||||
|
{
|
||||||
|
matched_dev best;
|
||||||
|
ibv_device_attr attr;
|
||||||
|
ibv_port_attr portinfo;
|
||||||
|
ibv_gid_entry best_gidx;
|
||||||
|
int res;
|
||||||
|
bool have_non_roce = false, have_roce = false;
|
||||||
|
for (int i = 0; dev_list[i]; ++i)
|
||||||
|
{
|
||||||
|
auto dev = dev_list[i];
|
||||||
|
ibv_context *context = ibv_open_device(dev_list[i]);
|
||||||
|
if ((res = ibv_query_device(context, &attr)) != 0)
|
||||||
|
{
|
||||||
|
fprintf(stderr, "Couldn't query RDMA device %s for its features: %s\n", ibv_get_device_name(dev_list[i]), strerror(res));
|
||||||
|
goto cleanup;
|
||||||
|
}
|
||||||
|
for (int j = 1; j <= attr.phys_port_cnt; j++)
|
||||||
|
{
|
||||||
|
// Try to find a port with matching address
|
||||||
|
if ((res = ibv_query_port(context, j, &portinfo)) != 0)
|
||||||
|
{
|
||||||
|
fprintf(stderr, "Couldn't get RDMA device %s port %d info: %s\n", ibv_get_device_name(dev), j, strerror(res));
|
||||||
|
goto cleanup;
|
||||||
|
}
|
||||||
|
for (int k = 0; k < portinfo.gid_tbl_len; k++)
|
||||||
|
{
|
||||||
|
ibv_gid_entry gidx;
|
||||||
|
if ((res = ibv_query_gid_ex(context, j, k, &gidx, 0)) != 0)
|
||||||
|
{
|
||||||
|
if (res != ENODATA)
|
||||||
|
{
|
||||||
|
fprintf(stderr, "Couldn't read RDMA device %s GID index %d: %s\n", ibv_get_device_name(dev), k, strerror(res));
|
||||||
|
goto cleanup;
|
||||||
|
}
|
||||||
|
else
|
||||||
|
break;
|
||||||
|
}
|
||||||
|
if (gidx.gid_type != IBV_GID_TYPE_ROCE_V1 &&
|
||||||
|
gidx.gid_type != IBV_GID_TYPE_ROCE_V2)
|
||||||
|
have_non_roce = true;
|
||||||
|
else
|
||||||
|
have_roce = true;
|
||||||
|
if (match_gid(&gidx, networks, nnet))
|
||||||
|
{
|
||||||
|
// Prefer RoCEv2
|
||||||
|
if (!best.rocev2)
|
||||||
|
{
|
||||||
|
best.dev = i;
|
||||||
|
best.port = j;
|
||||||
|
best.gid = k;
|
||||||
|
best.rocev2 = (gidx.gid_type == IBV_GID_TYPE_ROCE_V2);
|
||||||
|
best_gidx = gidx;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
cleanup:
|
||||||
|
ibv_close_device(context);
|
||||||
|
if (best.rocev2)
|
||||||
|
{
|
||||||
|
break;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
if (best.dev >= 0 && log_level > 0)
|
||||||
|
{
|
||||||
|
log_rdma_dev_port_gid(dev_list[best.dev], best.port, best.gid, best_gidx);
|
||||||
|
}
|
||||||
|
if (best.dev < 0 && have_non_roce && !have_roce)
|
||||||
|
{
|
||||||
|
best.dev = -2;
|
||||||
|
}
|
||||||
|
return best;
|
||||||
|
}
|
||||||
|
#endif
|
||||||
|
|
||||||
|
msgr_rdma_context_t *msgr_rdma_context_t::create(std::vector<std::string> osd_networks, const char *ib_devname, uint8_t ib_port, uint8_t gid_index, uint32_t mtu, bool odp, int log_level)
|
||||||
{
|
{
|
||||||
int res;
|
int res;
|
||||||
ibv_device **dev_list = NULL;
|
ibv_device **dev_list = NULL;
|
||||||
|
@ -80,28 +212,23 @@ msgr_rdma_context_t *msgr_rdma_context_t::create(const char *ib_devname, uint8_t
|
||||||
clock_gettime(CLOCK_REALTIME, &tv);
|
clock_gettime(CLOCK_REALTIME, &tv);
|
||||||
srand48(tv.tv_sec*1000000000 + tv.tv_nsec);
|
srand48(tv.tv_sec*1000000000 + tv.tv_nsec);
|
||||||
dev_list = ibv_get_device_list(NULL);
|
dev_list = ibv_get_device_list(NULL);
|
||||||
if (!dev_list)
|
if (!dev_list || !*dev_list)
|
||||||
{
|
{
|
||||||
if (errno == -ENOSYS || errno == ENOSYS)
|
if (errno == -ENOSYS || errno == ENOSYS)
|
||||||
{
|
{
|
||||||
if (log_level > 0)
|
if (log_level > 0)
|
||||||
fprintf(stderr, "No RDMA devices found (RDMA device list returned ENOSYS)\n");
|
fprintf(stderr, "No RDMA devices found (RDMA device list returned ENOSYS)\n");
|
||||||
}
|
}
|
||||||
|
else if (!*dev_list)
|
||||||
|
{
|
||||||
|
if (log_level > 0)
|
||||||
|
fprintf(stderr, "No RDMA devices found\n");
|
||||||
|
}
|
||||||
else
|
else
|
||||||
fprintf(stderr, "Failed to get RDMA device list: %s\n", strerror(errno));
|
fprintf(stderr, "Failed to get RDMA device list: %s\n", strerror(errno));
|
||||||
goto cleanup;
|
goto cleanup;
|
||||||
}
|
}
|
||||||
if (!ib_devname)
|
if (ib_devname)
|
||||||
{
|
|
||||||
ctx->dev = *dev_list;
|
|
||||||
if (!ctx->dev)
|
|
||||||
{
|
|
||||||
if (log_level > 0)
|
|
||||||
fprintf(stderr, "No RDMA devices found\n");
|
|
||||||
goto cleanup;
|
|
||||||
}
|
|
||||||
}
|
|
||||||
else
|
|
||||||
{
|
{
|
||||||
int i;
|
int i;
|
||||||
for (i = 0; dev_list[i]; ++i)
|
for (i = 0; dev_list[i]; ++i)
|
||||||
|
@ -114,6 +241,39 @@ msgr_rdma_context_t *msgr_rdma_context_t::create(const char *ib_devname, uint8_t
|
||||||
goto cleanup;
|
goto cleanup;
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
#ifdef IBV_ADVISE_MR_ADVICE_PREFETCH_NO_FAULT
|
||||||
|
else if (osd_networks.size())
|
||||||
|
{
|
||||||
|
std::vector<addr_mask_t> nets;
|
||||||
|
for (auto & netstr: osd_networks)
|
||||||
|
{
|
||||||
|
nets.push_back(cidr_parse(netstr));
|
||||||
|
}
|
||||||
|
auto best = match_device(dev_list, nets.data(), nets.size(), log_level);
|
||||||
|
if (best.dev == -2)
|
||||||
|
{
|
||||||
|
best.dev = 0;
|
||||||
|
if (log_level > 0)
|
||||||
|
fprintf(stderr, "No RoCE devices found, using first available RDMA device %s\n", ibv_get_device_name(*dev_list));
|
||||||
|
}
|
||||||
|
else if (best.dev < 0)
|
||||||
|
{
|
||||||
|
if (log_level > 0)
|
||||||
|
fprintf(stderr, "RDMA device matching osd_network is not found, disabling RDMA\n");
|
||||||
|
goto cleanup;
|
||||||
|
}
|
||||||
|
else
|
||||||
|
{
|
||||||
|
ib_port = best.port;
|
||||||
|
gid_index = best.gid;
|
||||||
|
}
|
||||||
|
ctx->dev = dev_list[best.dev];
|
||||||
|
}
|
||||||
|
#endif
|
||||||
|
else
|
||||||
|
{
|
||||||
|
ctx->dev = *dev_list;
|
||||||
|
}
|
||||||
|
|
||||||
ctx->context = ibv_open_device(ctx->dev);
|
ctx->context = ibv_open_device(ctx->dev);
|
||||||
if (!ctx->context)
|
if (!ctx->context)
|
||||||
|
@ -123,7 +283,6 @@ msgr_rdma_context_t *msgr_rdma_context_t::create(const char *ib_devname, uint8_t
|
||||||
}
|
}
|
||||||
|
|
||||||
ctx->ib_port = ib_port;
|
ctx->ib_port = ib_port;
|
||||||
ctx->gid_index = gid_index;
|
|
||||||
if ((res = ibv_query_port(ctx->context, ib_port, &ctx->portinfo)) != 0)
|
if ((res = ibv_query_port(ctx->context, ib_port, &ctx->portinfo)) != 0)
|
||||||
{
|
{
|
||||||
fprintf(stderr, "Couldn't get RDMA device %s port %d info: %s\n", ibv_get_device_name(ctx->dev), ib_port, strerror(res));
|
fprintf(stderr, "Couldn't get RDMA device %s port %d info: %s\n", ibv_get_device_name(ctx->dev), ib_port, strerror(res));
|
||||||
|
@ -135,11 +294,55 @@ msgr_rdma_context_t *msgr_rdma_context_t::create(const char *ib_devname, uint8_t
|
||||||
fprintf(stderr, "RDMA device %s must have local LID because it's not Ethernet, but LID is zero\n", ibv_get_device_name(ctx->dev));
|
fprintf(stderr, "RDMA device %s must have local LID because it's not Ethernet, but LID is zero\n", ibv_get_device_name(ctx->dev));
|
||||||
goto cleanup;
|
goto cleanup;
|
||||||
}
|
}
|
||||||
if (ibv_query_gid(ctx->context, ib_port, gid_index, &ctx->my_gid))
|
|
||||||
|
#ifdef IBV_ADVISE_MR_ADVICE_PREFETCH_NO_FAULT
|
||||||
|
if (gid_index != -1)
|
||||||
|
#endif
|
||||||
{
|
{
|
||||||
fprintf(stderr, "Couldn't read RDMA device %s GID index %d\n", ibv_get_device_name(ctx->dev), gid_index);
|
ctx->gid_index = gid_index < 0 ? 0 : gid_index;
|
||||||
goto cleanup;
|
if (ibv_query_gid(ctx->context, ib_port, gid_index, &ctx->my_gid))
|
||||||
|
{
|
||||||
|
fprintf(stderr, "Couldn't read RDMA device %s GID index %d\n", ibv_get_device_name(ctx->dev), gid_index);
|
||||||
|
goto cleanup;
|
||||||
|
}
|
||||||
}
|
}
|
||||||
|
#ifdef IBV_ADVISE_MR_ADVICE_PREFETCH_NO_FAULT
|
||||||
|
else
|
||||||
|
{
|
||||||
|
// Auto-guess GID
|
||||||
|
ibv_gid_entry best_gidx;
|
||||||
|
for (int k = 0; k < ctx->portinfo.gid_tbl_len; k++)
|
||||||
|
{
|
||||||
|
ibv_gid_entry gidx;
|
||||||
|
if (ibv_query_gid_ex(ctx->context, ib_port, k, &gidx, 0) != 0)
|
||||||
|
{
|
||||||
|
fprintf(stderr, "Couldn't read RDMA device %s GID index %d\n", ibv_get_device_name(ctx->dev), k);
|
||||||
|
goto cleanup;
|
||||||
|
}
|
||||||
|
// Skip empty GID
|
||||||
|
if (((uint64_t*)gidx.gid.raw)[0] == 0 &&
|
||||||
|
((uint64_t*)gidx.gid.raw)[1] == 0)
|
||||||
|
{
|
||||||
|
continue;
|
||||||
|
}
|
||||||
|
// Prefer IPv4 RoCEv2 -> IPv6 RoCEv2 -> IPv4 RoCEv1 -> IPv6 RoCEv1 -> IB
|
||||||
|
if (gid_index == -1 ||
|
||||||
|
gidx.gid_type == IBV_GID_TYPE_ROCE_V2 && best_gidx.gid_type != IBV_GID_TYPE_ROCE_V2 ||
|
||||||
|
gidx.gid_type == IBV_GID_TYPE_ROCE_V1 && best_gidx.gid_type == IBV_GID_TYPE_IB ||
|
||||||
|
gidx.gid_type == best_gidx.gid_type && is_ipv4_gid(&gidx))
|
||||||
|
{
|
||||||
|
gid_index = k;
|
||||||
|
best_gidx = gidx;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
ctx->gid_index = gid_index = (gid_index == -1 ? 0 : gid_index);
|
||||||
|
if (log_level > 0)
|
||||||
|
{
|
||||||
|
log_rdma_dev_port_gid(ctx->dev, ctx->ib_port, ctx->gid_index, best_gidx);
|
||||||
|
}
|
||||||
|
ctx->my_gid = best_gidx.gid;
|
||||||
|
}
|
||||||
|
#endif
|
||||||
|
|
||||||
ctx->pd = ibv_alloc_pd(ctx->context);
|
ctx->pd = ibv_alloc_pd(ctx->context);
|
||||||
if (!ctx->pd)
|
if (!ctx->pd)
|
||||||
|
@ -598,6 +801,7 @@ void osd_messenger_t::handle_rdma_events()
|
||||||
}
|
}
|
||||||
fprintf(stderr, " with status: %s, stopping client\n", ibv_wc_status_str(wc[i].status));
|
fprintf(stderr, " with status: %s, stopping client\n", ibv_wc_status_str(wc[i].status));
|
||||||
stop_client(client_id);
|
stop_client(client_id);
|
||||||
|
clear_immediate_ops(client_id);
|
||||||
continue;
|
continue;
|
||||||
}
|
}
|
||||||
if (!is_send)
|
if (!is_send)
|
||||||
|
@ -606,6 +810,7 @@ void osd_messenger_t::handle_rdma_events()
|
||||||
if (!handle_read_buffer(cl, rc->recv_buffers[rc->next_recv_buf].buf, wc[i].byte_len))
|
if (!handle_read_buffer(cl, rc->recv_buffers[rc->next_recv_buf].buf, wc[i].byte_len))
|
||||||
{
|
{
|
||||||
// handle_read_buffer may stop the client
|
// handle_read_buffer may stop the client
|
||||||
|
clear_immediate_ops(client_id);
|
||||||
continue;
|
continue;
|
||||||
}
|
}
|
||||||
try_recv_rdma_wr(cl, rc->recv_buffers[rc->next_recv_buf]);
|
try_recv_rdma_wr(cl, rc->recv_buffers[rc->next_recv_buf]);
|
||||||
|
@ -666,9 +871,5 @@ void osd_messenger_t::handle_rdma_events()
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
} while (event_count > 0);
|
} while (event_count > 0);
|
||||||
for (auto cb: set_immediate)
|
handle_immediate_ops();
|
||||||
{
|
|
||||||
cb();
|
|
||||||
}
|
|
||||||
set_immediate.clear();
|
|
||||||
}
|
}
|
||||||
|
|
|
@ -36,7 +36,7 @@ struct msgr_rdma_context_t
|
||||||
int max_cqe = 0;
|
int max_cqe = 0;
|
||||||
int used_max_cqe = 0;
|
int used_max_cqe = 0;
|
||||||
|
|
||||||
static msgr_rdma_context_t *create(const char *ib_devname, uint8_t ib_port, uint8_t gid_index, uint32_t mtu, bool odp, int log_level);
|
static msgr_rdma_context_t *create(std::vector<std::string> osd_networks, const char *ib_devname, uint8_t ib_port, uint8_t gid_index, uint32_t mtu, bool odp, int log_level);
|
||||||
~msgr_rdma_context_t();
|
~msgr_rdma_context_t();
|
||||||
};
|
};
|
||||||
|
|
||||||
|
|
|
@ -65,6 +65,7 @@ void osd_messenger_t::read_requests()
|
||||||
bool osd_messenger_t::handle_read(int result, osd_client_t *cl)
|
bool osd_messenger_t::handle_read(int result, osd_client_t *cl)
|
||||||
{
|
{
|
||||||
bool ret = false;
|
bool ret = false;
|
||||||
|
int peer_fd = cl->peer_fd;
|
||||||
cl->read_msg.msg_iovlen = 0;
|
cl->read_msg.msg_iovlen = 0;
|
||||||
cl->refs--;
|
cl->refs--;
|
||||||
if (cl->peer_state == PEER_STOPPED)
|
if (cl->peer_state == PEER_STOPPED)
|
||||||
|
@ -101,7 +102,8 @@ bool osd_messenger_t::handle_read(int result, osd_client_t *cl)
|
||||||
{
|
{
|
||||||
if (!handle_read_buffer(cl, cl->in_buf, result))
|
if (!handle_read_buffer(cl, cl->in_buf, result))
|
||||||
{
|
{
|
||||||
goto fin;
|
clear_immediate_ops(peer_fd);
|
||||||
|
return false;
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
else
|
else
|
||||||
|
@ -113,7 +115,8 @@ bool osd_messenger_t::handle_read(int result, osd_client_t *cl)
|
||||||
{
|
{
|
||||||
if (!handle_finished_read(cl))
|
if (!handle_finished_read(cl))
|
||||||
{
|
{
|
||||||
goto fin;
|
clear_immediate_ops(peer_fd);
|
||||||
|
return false;
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
@ -122,15 +125,47 @@ bool osd_messenger_t::handle_read(int result, osd_client_t *cl)
|
||||||
ret = true;
|
ret = true;
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
fin:
|
handle_immediate_ops();
|
||||||
for (auto cb: set_immediate)
|
|
||||||
{
|
|
||||||
cb();
|
|
||||||
}
|
|
||||||
set_immediate.clear();
|
|
||||||
return ret;
|
return ret;
|
||||||
}
|
}
|
||||||
|
|
||||||
|
void osd_messenger_t::clear_immediate_ops(int peer_fd)
|
||||||
|
{
|
||||||
|
size_t i = 0, j = 0;
|
||||||
|
while (i < set_immediate_ops.size())
|
||||||
|
{
|
||||||
|
if (set_immediate_ops[i]->peer_fd == peer_fd)
|
||||||
|
{
|
||||||
|
delete set_immediate_ops[i];
|
||||||
|
}
|
||||||
|
else
|
||||||
|
{
|
||||||
|
if (i != j)
|
||||||
|
set_immediate_ops[j] = set_immediate_ops[i];
|
||||||
|
j++;
|
||||||
|
}
|
||||||
|
i++;
|
||||||
|
}
|
||||||
|
set_immediate_ops.resize(j);
|
||||||
|
}
|
||||||
|
|
||||||
|
void osd_messenger_t::handle_immediate_ops()
|
||||||
|
{
|
||||||
|
for (auto op: set_immediate_ops)
|
||||||
|
{
|
||||||
|
if (op->op_type == OSD_OP_IN)
|
||||||
|
{
|
||||||
|
exec_op(op);
|
||||||
|
}
|
||||||
|
else
|
||||||
|
{
|
||||||
|
// Copy lambda to be unaffected by `delete op`
|
||||||
|
std::function<void(osd_op_t*)>(op->callback)(op);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
set_immediate_ops.clear();
|
||||||
|
}
|
||||||
|
|
||||||
bool osd_messenger_t::handle_read_buffer(osd_client_t *cl, void *curbuf, int remain)
|
bool osd_messenger_t::handle_read_buffer(osd_client_t *cl, void *curbuf, int remain)
|
||||||
{
|
{
|
||||||
// Compose operation(s) from the buffer
|
// Compose operation(s) from the buffer
|
||||||
|
@ -199,7 +234,7 @@ bool osd_messenger_t::handle_finished_read(osd_client_t *cl)
|
||||||
{
|
{
|
||||||
// Operation is ready
|
// Operation is ready
|
||||||
cl->received_ops.push_back(cl->read_op);
|
cl->received_ops.push_back(cl->read_op);
|
||||||
set_immediate.push_back([this, op = cl->read_op]() { exec_op(op); });
|
set_immediate_ops.push_back(cl->read_op);
|
||||||
cl->read_op = NULL;
|
cl->read_op = NULL;
|
||||||
cl->read_state = 0;
|
cl->read_state = 0;
|
||||||
}
|
}
|
||||||
|
@ -295,7 +330,7 @@ void osd_messenger_t::handle_op_hdr(osd_client_t *cl)
|
||||||
{
|
{
|
||||||
// Operation is ready
|
// Operation is ready
|
||||||
cl->received_ops.push_back(cur_op);
|
cl->received_ops.push_back(cur_op);
|
||||||
set_immediate.push_back([this, cur_op]() { exec_op(cur_op); });
|
set_immediate_ops.push_back(cur_op);
|
||||||
cl->read_op = NULL;
|
cl->read_op = NULL;
|
||||||
cl->read_state = 0;
|
cl->read_state = 0;
|
||||||
}
|
}
|
||||||
|
@ -416,9 +451,5 @@ void osd_messenger_t::handle_reply_ready(osd_op_t *op)
|
||||||
(tv_end.tv_sec - op->tv_begin.tv_sec)*1000000 +
|
(tv_end.tv_sec - op->tv_begin.tv_sec)*1000000 +
|
||||||
(tv_end.tv_nsec - op->tv_begin.tv_nsec)/1000
|
(tv_end.tv_nsec - op->tv_begin.tv_nsec)/1000
|
||||||
);
|
);
|
||||||
set_immediate.push_back([op]()
|
set_immediate_ops.push_back(op);
|
||||||
{
|
|
||||||
// Copy lambda to be unaffected by `delete op`
|
|
||||||
std::function<void(osd_op_t*)>(op->callback)(op);
|
|
||||||
});
|
|
||||||
}
|
}
|
||||||
|
|
|
@ -64,7 +64,7 @@ static void netlink_sock_alloc(struct netlink_ctx *ctx)
|
||||||
if (nl_driver_id < 0)
|
if (nl_driver_id < 0)
|
||||||
{
|
{
|
||||||
nl_socket_free(sk);
|
nl_socket_free(sk);
|
||||||
fail("Couldn't resolve the nbd netlink family\n");
|
fail("Couldn't resolve the nbd netlink family: %s (code %d)\n", nl_geterror(nl_driver_id), nl_driver_id);
|
||||||
}
|
}
|
||||||
|
|
||||||
ctx->driver_id = nl_driver_id;
|
ctx->driver_id = nl_driver_id;
|
||||||
|
@ -555,7 +555,12 @@ help:
|
||||||
fcntl(sockfd[0], F_SETFL, fcntl(sockfd[0], F_GETFL, 0) | O_NONBLOCK);
|
fcntl(sockfd[0], F_SETFL, fcntl(sockfd[0], F_GETFL, 0) | O_NONBLOCK);
|
||||||
nbd_fd = sockfd[0];
|
nbd_fd = sockfd[0];
|
||||||
load_module();
|
load_module();
|
||||||
|
|
||||||
bool bg = cfg["foreground"].is_null();
|
bool bg = cfg["foreground"].is_null();
|
||||||
|
if (cfg["logfile"].string_value() != "")
|
||||||
|
{
|
||||||
|
logfile = cfg["logfile"].string_value();
|
||||||
|
}
|
||||||
|
|
||||||
if (netlink)
|
if (netlink)
|
||||||
{
|
{
|
||||||
|
@ -588,6 +593,10 @@ help:
|
||||||
}
|
}
|
||||||
close(sockfd[1]);
|
close(sockfd[1]);
|
||||||
printf("/dev/nbd%d\n", err);
|
printf("/dev/nbd%d\n", err);
|
||||||
|
if (bg)
|
||||||
|
{
|
||||||
|
daemonize_reopen_stdio();
|
||||||
|
}
|
||||||
#else
|
#else
|
||||||
fprintf(stderr, "netlink support is disabled in this build\n");
|
fprintf(stderr, "netlink support is disabled in this build\n");
|
||||||
exit(1);
|
exit(1);
|
||||||
|
@ -631,14 +640,10 @@ help:
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
}
|
if (bg)
|
||||||
if (cfg["logfile"].string_value() != "")
|
{
|
||||||
{
|
daemonize();
|
||||||
logfile = cfg["logfile"].string_value();
|
}
|
||||||
}
|
|
||||||
if (bg)
|
|
||||||
{
|
|
||||||
daemonize();
|
|
||||||
}
|
}
|
||||||
// Initialize read state
|
// Initialize read state
|
||||||
read_state = CL_READ_HDR;
|
read_state = CL_READ_HDR;
|
||||||
|
@ -716,13 +721,17 @@ help:
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
void daemonize()
|
void daemonize_fork()
|
||||||
{
|
{
|
||||||
if (fork())
|
if (fork())
|
||||||
exit(0);
|
exit(0);
|
||||||
setsid();
|
setsid();
|
||||||
if (fork())
|
if (fork())
|
||||||
exit(0);
|
exit(0);
|
||||||
|
}
|
||||||
|
|
||||||
|
void daemonize_reopen_stdio()
|
||||||
|
{
|
||||||
close(0);
|
close(0);
|
||||||
close(1);
|
close(1);
|
||||||
close(2);
|
close(2);
|
||||||
|
@ -733,6 +742,12 @@ help:
|
||||||
fprintf(stderr, "Warning: Failed to chdir into /\n");
|
fprintf(stderr, "Warning: Failed to chdir into /\n");
|
||||||
}
|
}
|
||||||
|
|
||||||
|
void daemonize()
|
||||||
|
{
|
||||||
|
daemonize_fork();
|
||||||
|
daemonize_reopen_stdio();
|
||||||
|
}
|
||||||
|
|
||||||
json11::Json::object list_mapped()
|
json11::Json::object list_mapped()
|
||||||
{
|
{
|
||||||
const char *self_filename = exe_name;
|
const char *self_filename = exe_name;
|
||||||
|
@ -783,8 +798,9 @@ help:
|
||||||
if (!strcmp(pid_filename, self_filename))
|
if (!strcmp(pid_filename, self_filename))
|
||||||
{
|
{
|
||||||
json11::Json::object cfg = nbd_proxy::parse_args(argv.size(), argv.data());
|
json11::Json::object cfg = nbd_proxy::parse_args(argv.size(), argv.data());
|
||||||
if (cfg["command"] == "map")
|
if (cfg["command"] == "map" || cfg["command"] == "netlink-map")
|
||||||
{
|
{
|
||||||
|
cfg["interface"] = (cfg["command"] == "netlink-map") ? "netlink" : "nbd";
|
||||||
cfg.erase("command");
|
cfg.erase("command");
|
||||||
cfg["pid"] = pid;
|
cfg["pid"] = pid;
|
||||||
mapped["/dev/nbd"+std::to_string(dev_num)] = cfg;
|
mapped["/dev/nbd"+std::to_string(dev_num)] = cfg;
|
||||||
|
|
|
@ -16,7 +16,6 @@
|
||||||
#include "qapi/error.h"
|
#include "qapi/error.h"
|
||||||
#include "qapi/qmp/qdict.h"
|
#include "qapi/qmp/qdict.h"
|
||||||
#include "qapi/qmp/qerror.h"
|
#include "qapi/qmp/qerror.h"
|
||||||
#include "qemu/uri.h"
|
|
||||||
#include "qemu/error-report.h"
|
#include "qemu/error-report.h"
|
||||||
#include "qemu/module.h"
|
#include "qemu/module.h"
|
||||||
#include "qemu/option.h"
|
#include "qemu/option.h"
|
||||||
|
@ -1021,7 +1020,11 @@ static BlockDriver bdrv_vitastor = {
|
||||||
// FIXME: Implement it along with per-inode statistics
|
// FIXME: Implement it along with per-inode statistics
|
||||||
//.bdrv_get_allocated_file_size = vitastor_get_allocated_file_size,
|
//.bdrv_get_allocated_file_size = vitastor_get_allocated_file_size,
|
||||||
|
|
||||||
|
#if QEMU_VERSION_MAJOR > 9 || QEMU_VERSION_MAJOR == 9 && QEMU_VERSION_MINOR > 0
|
||||||
|
.bdrv_open = vitastor_file_open,
|
||||||
|
#else
|
||||||
.bdrv_file_open = vitastor_file_open,
|
.bdrv_file_open = vitastor_file_open,
|
||||||
|
#endif
|
||||||
.bdrv_close = vitastor_close,
|
.bdrv_close = vitastor_close,
|
||||||
|
|
||||||
// Option list for the create operation
|
// Option list for the create operation
|
||||||
|
|
|
@ -261,6 +261,7 @@ struct dd_out_info_t
|
||||||
else
|
else
|
||||||
{
|
{
|
||||||
// ok
|
// ok
|
||||||
|
out_size = owatch->cfg.size;
|
||||||
return true;
|
return true;
|
||||||
}
|
}
|
||||||
// Wait for sub-command
|
// Wait for sub-command
|
||||||
|
@ -882,6 +883,8 @@ resume_2:
|
||||||
oinfo.end_fsync = oinfo.end_fsync && oinfo.out_seekable;
|
oinfo.end_fsync = oinfo.end_fsync && oinfo.out_seekable;
|
||||||
read_offset = 0;
|
read_offset = 0;
|
||||||
read_end = iinfo.in_seekable ? iinfo.in_size-iseek : 0;
|
read_end = iinfo.in_seekable ? iinfo.in_size-iseek : 0;
|
||||||
|
if (oinfo.out_size && (!read_end || read_end > oinfo.out_size-oseek))
|
||||||
|
read_end = oinfo.out_size-oseek;
|
||||||
if (bytelimit && (!read_end || read_end > bytelimit))
|
if (bytelimit && (!read_end || read_end > bytelimit))
|
||||||
read_end = bytelimit;
|
read_end = bytelimit;
|
||||||
clock_gettime(CLOCK_REALTIME, &tv_begin);
|
clock_gettime(CLOCK_REALTIME, &tv_begin);
|
||||||
|
|
|
@ -216,7 +216,7 @@ resume_1:
|
||||||
for (uint64_t osd_num: node.child_osds)
|
for (uint64_t osd_num: node.child_osds)
|
||||||
{
|
{
|
||||||
auto & osd = placement_tree->osds.at(osd_num);
|
auto & osd = placement_tree->osds.at(osd_num);
|
||||||
fmt_items.push_back(json11::Json::object{
|
auto json_osd = json11::Json::object{
|
||||||
{ "type", "osd" },
|
{ "type", "osd" },
|
||||||
{ "name", osd.num },
|
{ "name", osd.num },
|
||||||
{ "parent", node.name },
|
{ "parent", node.name },
|
||||||
|
@ -230,7 +230,16 @@ resume_1:
|
||||||
{ "bitmap", (uint64_t)osd.bitmap_granularity },
|
{ "bitmap", (uint64_t)osd.bitmap_granularity },
|
||||||
{ "commit", osd.immediate_commit == IMMEDIATE_NONE ? "none" : (osd.immediate_commit == IMMEDIATE_ALL ? "all" : "small") },
|
{ "commit", osd.immediate_commit == IMMEDIATE_NONE ? "none" : (osd.immediate_commit == IMMEDIATE_ALL ? "all" : "small") },
|
||||||
{ "op_stats", osd_stats[osd_num]["op_stats"] },
|
{ "op_stats", osd_stats[osd_num]["op_stats"] },
|
||||||
});
|
};
|
||||||
|
if (osd_stats[osd_num]["slow_ops_primary"].uint64_value() > 0)
|
||||||
|
{
|
||||||
|
json_osd["slow_ops_primary"] = osd_stats[osd_num]["slow_ops_primary"];
|
||||||
|
}
|
||||||
|
if (osd_stats[osd_num]["slow_ops_secondary"].uint64_value() > 0)
|
||||||
|
{
|
||||||
|
json_osd["slow_ops_secondary"] = osd_stats[osd_num]["slow_ops_secondary"];
|
||||||
|
}
|
||||||
|
fmt_items.push_back(json_osd);
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
result.data = fmt_items;
|
result.data = fmt_items;
|
||||||
|
|
|
@ -35,6 +35,7 @@ struct pool_creator_t
|
||||||
uint64_t new_pools_mod_rev;
|
uint64_t new_pools_mod_rev;
|
||||||
json11::Json state_node_tree;
|
json11::Json state_node_tree;
|
||||||
json11::Json new_pools;
|
json11::Json new_pools;
|
||||||
|
std::map<osd_num_t, json11::Json> osd_stats;
|
||||||
|
|
||||||
bool is_done() { return state == 100; }
|
bool is_done() { return state == 100; }
|
||||||
|
|
||||||
|
@ -46,8 +47,6 @@ struct pool_creator_t
|
||||||
goto resume_2;
|
goto resume_2;
|
||||||
else if (state == 3)
|
else if (state == 3)
|
||||||
goto resume_3;
|
goto resume_3;
|
||||||
else if (state == 4)
|
|
||||||
goto resume_4;
|
|
||||||
else if (state == 5)
|
else if (state == 5)
|
||||||
goto resume_5;
|
goto resume_5;
|
||||||
else if (state == 6)
|
else if (state == 6)
|
||||||
|
@ -90,13 +89,19 @@ resume_1:
|
||||||
// If not forced, check that we have enough osds for pg_size
|
// If not forced, check that we have enough osds for pg_size
|
||||||
if (!force)
|
if (!force)
|
||||||
{
|
{
|
||||||
// Get node_placement configuration from etcd
|
// Get node_placement configuration from etcd and OSD stats
|
||||||
parent->etcd_txn(json11::Json::object {
|
parent->etcd_txn(json11::Json::object {
|
||||||
{ "success", json11::Json::array {
|
{ "success", json11::Json::array {
|
||||||
json11::Json::object {
|
json11::Json::object {
|
||||||
{ "request_range", json11::Json::object {
|
{ "request_range", json11::Json::object {
|
||||||
{ "key", base64_encode(parent->cli->st_cli.etcd_prefix+"/config/node_placement") },
|
{ "key", base64_encode(parent->cli->st_cli.etcd_prefix+"/config/node_placement") },
|
||||||
} }
|
} },
|
||||||
|
},
|
||||||
|
json11::Json::object {
|
||||||
|
{ "request_range", json11::Json::object {
|
||||||
|
{ "key", base64_encode(parent->cli->st_cli.etcd_prefix+"/osd/stats/") },
|
||||||
|
{ "range_end", base64_encode(parent->cli->st_cli.etcd_prefix+"/osd/stats0") },
|
||||||
|
} },
|
||||||
},
|
},
|
||||||
} },
|
} },
|
||||||
});
|
});
|
||||||
|
@ -112,10 +117,21 @@ resume_2:
|
||||||
return;
|
return;
|
||||||
}
|
}
|
||||||
|
|
||||||
// Get state_node_tree based on node_placement and osd peer states
|
// Get state_node_tree based on node_placement and osd stats
|
||||||
{
|
{
|
||||||
auto kv = parent->cli->st_cli.parse_etcd_kv(parent->etcd_result["responses"][0]["response_range"]["kvs"][0]);
|
auto node_placement_kv = parent->cli->st_cli.parse_etcd_kv(parent->etcd_result["responses"][0]["response_range"]["kvs"][0]);
|
||||||
state_node_tree = get_state_node_tree(kv.value.object_items());
|
timespec tv_now;
|
||||||
|
clock_gettime(CLOCK_REALTIME, &tv_now);
|
||||||
|
uint64_t osd_out_time = parent->cli->config["osd_out_time"].uint64_value();
|
||||||
|
if (!osd_out_time)
|
||||||
|
osd_out_time = 600;
|
||||||
|
osd_stats.clear();
|
||||||
|
parent->iterate_kvs_1(parent->etcd_result["responses"][1]["response_range"]["kvs"], "/osd/stats/", [&](uint64_t cur_osd, json11::Json value)
|
||||||
|
{
|
||||||
|
if ((uint64_t)value["time"].number_value()+osd_out_time >= tv_now.tv_sec)
|
||||||
|
osd_stats[cur_osd] = value;
|
||||||
|
});
|
||||||
|
state_node_tree = get_state_node_tree(node_placement_kv.value.object_items(), osd_stats);
|
||||||
}
|
}
|
||||||
|
|
||||||
// Skip tag checks, if pool has none
|
// Skip tag checks, if pool has none
|
||||||
|
@ -158,42 +174,18 @@ resume_3:
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
// Get stats (for block_size, bitmap_granularity, ...) of osds in state_node_tree
|
|
||||||
{
|
|
||||||
json11::Json::array osd_stats;
|
|
||||||
|
|
||||||
for (auto osd_num: state_node_tree["osds"].array_items())
|
|
||||||
{
|
|
||||||
osd_stats.push_back(json11::Json::object {
|
|
||||||
{ "request_range", json11::Json::object {
|
|
||||||
{ "key", base64_encode(parent->cli->st_cli.etcd_prefix+"/osd/stats/"+osd_num.as_string()) },
|
|
||||||
} }
|
|
||||||
});
|
|
||||||
}
|
|
||||||
|
|
||||||
parent->etcd_txn(json11::Json::object{ { "success", osd_stats } });
|
|
||||||
}
|
|
||||||
|
|
||||||
state = 4;
|
|
||||||
resume_4:
|
|
||||||
if (parent->waiting > 0)
|
|
||||||
return;
|
|
||||||
if (parent->etcd_err.err)
|
|
||||||
{
|
|
||||||
result = parent->etcd_err;
|
|
||||||
state = 100;
|
|
||||||
return;
|
|
||||||
}
|
|
||||||
|
|
||||||
// Filter osds from state_node_tree based on pool parameters and osd stats
|
// Filter osds from state_node_tree based on pool parameters and osd stats
|
||||||
{
|
{
|
||||||
std::vector<json11::Json> osd_stats;
|
std::vector<json11::Json> filtered_osd_stats;
|
||||||
for (auto & ocr: parent->etcd_result["responses"].array_items())
|
for (auto & osd_num: state_node_tree["osds"].array_items())
|
||||||
{
|
{
|
||||||
auto kv = parent->cli->st_cli.parse_etcd_kv(ocr["response_range"]["kvs"][0]);
|
auto st_it = osd_stats.find(osd_num.uint64_value());
|
||||||
osd_stats.push_back(kv.value);
|
if (st_it != osd_stats.end())
|
||||||
|
{
|
||||||
|
filtered_osd_stats.push_back(st_it->second);
|
||||||
|
}
|
||||||
}
|
}
|
||||||
guess_block_size(osd_stats);
|
guess_block_size(filtered_osd_stats);
|
||||||
state_node_tree = filter_state_node_tree_by_stats(state_node_tree, osd_stats);
|
state_node_tree = filter_state_node_tree_by_stats(state_node_tree, osd_stats);
|
||||||
}
|
}
|
||||||
|
|
||||||
|
@ -201,8 +193,7 @@ resume_4:
|
||||||
{
|
{
|
||||||
auto failure_domain = cfg["failure_domain"].string_value() == ""
|
auto failure_domain = cfg["failure_domain"].string_value() == ""
|
||||||
? "host" : cfg["failure_domain"].string_value();
|
? "host" : cfg["failure_domain"].string_value();
|
||||||
uint64_t max_pg_size = get_max_pg_size(state_node_tree["nodes"].object_items(),
|
uint64_t max_pg_size = get_max_pg_size(state_node_tree, failure_domain, cfg["root_node"].string_value());
|
||||||
failure_domain, cfg["root_node"].string_value());
|
|
||||||
|
|
||||||
if (cfg["pg_size"].uint64_value() > max_pg_size)
|
if (cfg["pg_size"].uint64_value() > max_pg_size)
|
||||||
{
|
{
|
||||||
|
@ -358,56 +349,50 @@ resume_8:
|
||||||
|
|
||||||
// Returns a JSON object of form {"nodes": {...}, "osds": [...]} that
|
// Returns a JSON object of form {"nodes": {...}, "osds": [...]} that
|
||||||
// contains: all nodes (osds, hosts, ...) based on node_placement config
|
// contains: all nodes (osds, hosts, ...) based on node_placement config
|
||||||
// and current peer state, and a list of active peer osds.
|
// and current osd stats.
|
||||||
json11::Json get_state_node_tree(json11::Json::object node_placement)
|
json11::Json get_state_node_tree(json11::Json::object node_placement, std::map<osd_num_t, json11::Json> & osd_stats)
|
||||||
{
|
{
|
||||||
// Erase non-peer osd nodes from node_placement
|
// Erase non-existing osd nodes from node_placement
|
||||||
for (auto np_it = node_placement.begin(); np_it != node_placement.end();)
|
for (auto np_it = node_placement.begin(); np_it != node_placement.end();)
|
||||||
{
|
{
|
||||||
// Numeric nodes are osds
|
// Numeric nodes are osds
|
||||||
osd_num_t osd_num = stoull_full(np_it->first);
|
osd_num_t osd_num = stoull_full(np_it->first);
|
||||||
|
|
||||||
// If node is osd and it is not in peer states, erase it
|
// If node is osd and its stats do not exist, erase it
|
||||||
if (osd_num > 0 &&
|
if (osd_num > 0 && osd_stats.find(osd_num) == osd_stats.end())
|
||||||
parent->cli->st_cli.peer_states.find(osd_num) == parent->cli->st_cli.peer_states.end())
|
|
||||||
{
|
|
||||||
node_placement.erase(np_it++);
|
node_placement.erase(np_it++);
|
||||||
}
|
|
||||||
else
|
else
|
||||||
np_it++;
|
np_it++;
|
||||||
}
|
}
|
||||||
|
|
||||||
// List of peer osds
|
// List of osds
|
||||||
std::vector<std::string> peer_osds;
|
std::vector<std::string> existing_osds;
|
||||||
|
|
||||||
// Record peer osds and add missing osds/hosts to np
|
// Record osds and add missing osds/hosts to np
|
||||||
for (auto & ps: parent->cli->st_cli.peer_states)
|
for (auto & ps: osd_stats)
|
||||||
{
|
{
|
||||||
std::string osd_num = std::to_string(ps.first);
|
std::string osd_num = std::to_string(ps.first);
|
||||||
|
|
||||||
// Record peer osd
|
// Record osd
|
||||||
peer_osds.push_back(osd_num);
|
existing_osds.push_back(osd_num);
|
||||||
|
|
||||||
// Add osd, if necessary
|
// Add host if necessary
|
||||||
if (node_placement.find(osd_num) == node_placement.end())
|
std::string osd_host = ps.second["host"].as_string();
|
||||||
|
if (node_placement.find(osd_host) == node_placement.end())
|
||||||
{
|
{
|
||||||
std::string osd_host = ps.second["host"].as_string();
|
node_placement[osd_host] = json11::Json::object {
|
||||||
|
{ "level", "host" }
|
||||||
// Add host, if necessary
|
|
||||||
if (node_placement.find(osd_host) == node_placement.end())
|
|
||||||
{
|
|
||||||
node_placement[osd_host] = json11::Json::object {
|
|
||||||
{ "level", "host" }
|
|
||||||
};
|
|
||||||
}
|
|
||||||
|
|
||||||
node_placement[osd_num] = json11::Json::object {
|
|
||||||
{ "parent", osd_host }
|
|
||||||
};
|
};
|
||||||
}
|
}
|
||||||
|
|
||||||
|
// Add osd
|
||||||
|
node_placement[osd_num] = json11::Json::object {
|
||||||
|
{ "parent", node_placement[osd_num]["parent"].is_null() ? osd_host : node_placement[osd_num]["parent"] },
|
||||||
|
{ "level", "osd" },
|
||||||
|
};
|
||||||
}
|
}
|
||||||
|
|
||||||
return json11::Json::object { { "osds", peer_osds }, { "nodes", node_placement } };
|
return json11::Json::object { { "osds", existing_osds }, { "nodes", node_placement } };
|
||||||
}
|
}
|
||||||
|
|
||||||
// Returns new state_node_tree based on given state_node_tree with osds
|
// Returns new state_node_tree based on given state_node_tree with osds
|
||||||
|
@ -534,15 +519,13 @@ resume_8:
|
||||||
// filtered out by stats parameters (block_size, bitmap_granularity) in
|
// filtered out by stats parameters (block_size, bitmap_granularity) in
|
||||||
// given osd_stats and current pool config.
|
// given osd_stats and current pool config.
|
||||||
// Requires: state_node_tree["osds"] must match osd_stats 1-1
|
// Requires: state_node_tree["osds"] must match osd_stats 1-1
|
||||||
json11::Json filter_state_node_tree_by_stats(const json11::Json & state_node_tree, std::vector<json11::Json> & osd_stats)
|
json11::Json filter_state_node_tree_by_stats(const json11::Json & state_node_tree, std::map<osd_num_t, json11::Json> & osd_stats)
|
||||||
{
|
{
|
||||||
auto & osds = state_node_tree["osds"].array_items();
|
|
||||||
|
|
||||||
// Accepted state_node_tree nodes
|
// Accepted state_node_tree nodes
|
||||||
auto accepted_nodes = state_node_tree["nodes"].object_items();
|
auto accepted_nodes = state_node_tree["nodes"].object_items();
|
||||||
|
|
||||||
// List of accepted osds
|
// List of accepted osds
|
||||||
std::vector<std::string> accepted_osds;
|
json11::Json::array accepted_osds;
|
||||||
|
|
||||||
block_size = cfg["block_size"].uint64_value()
|
block_size = cfg["block_size"].uint64_value()
|
||||||
? cfg["block_size"].uint64_value()
|
? cfg["block_size"].uint64_value()
|
||||||
|
@ -554,21 +537,25 @@ resume_8:
|
||||||
? etcd_state_client_t::parse_immediate_commit(cfg["immediate_commit"].string_value(), IMMEDIATE_ALL)
|
? etcd_state_client_t::parse_immediate_commit(cfg["immediate_commit"].string_value(), IMMEDIATE_ALL)
|
||||||
: parent->cli->st_cli.global_immediate_commit;
|
: parent->cli->st_cli.global_immediate_commit;
|
||||||
|
|
||||||
for (size_t i = 0; i < osd_stats.size(); i++)
|
for (auto osd_num_json: state_node_tree["osds"].array_items())
|
||||||
{
|
{
|
||||||
auto & os = osd_stats[i];
|
auto osd_num = osd_num_json.uint64_value();
|
||||||
// Get osd number
|
auto os_it = osd_stats.find(osd_num);
|
||||||
auto osd_num = osds[i].as_string();
|
if (os_it == osd_stats.end())
|
||||||
|
{
|
||||||
|
continue;
|
||||||
|
}
|
||||||
|
auto & os = os_it->second;
|
||||||
if (!os["data_block_size"].is_null() && os["data_block_size"] != block_size ||
|
if (!os["data_block_size"].is_null() && os["data_block_size"] != block_size ||
|
||||||
!os["bitmap_granularity"].is_null() && os["bitmap_granularity"] != bitmap_granularity ||
|
!os["bitmap_granularity"].is_null() && os["bitmap_granularity"] != bitmap_granularity ||
|
||||||
!os["immediate_commit"].is_null() &&
|
!os["immediate_commit"].is_null() &&
|
||||||
etcd_state_client_t::parse_immediate_commit(os["immediate_commit"].string_value(), IMMEDIATE_NONE) < immediate_commit)
|
etcd_state_client_t::parse_immediate_commit(os["immediate_commit"].string_value(), IMMEDIATE_NONE) < immediate_commit)
|
||||||
{
|
{
|
||||||
accepted_nodes.erase(osd_num);
|
accepted_nodes.erase(osd_num_json.as_string());
|
||||||
}
|
}
|
||||||
else
|
else
|
||||||
{
|
{
|
||||||
accepted_osds.push_back(osd_num);
|
accepted_osds.push_back(osd_num_json);
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
|
@ -576,87 +563,28 @@ resume_8:
|
||||||
}
|
}
|
||||||
|
|
||||||
// Returns maximum pg_size possible for given node_tree and failure_domain, starting at parent_node
|
// Returns maximum pg_size possible for given node_tree and failure_domain, starting at parent_node
|
||||||
uint64_t get_max_pg_size(json11::Json::object node_tree, const std::string & level, const std::string & parent_node)
|
uint64_t get_max_pg_size(json11::Json state_node_tree, const std::string & level, const std::string & root_node)
|
||||||
{
|
{
|
||||||
uint64_t max_pg_sz = 0;
|
std::set<std::string> level_seen;
|
||||||
|
for (auto & osd: state_node_tree["osds"].array_items())
|
||||||
std::vector<std::string> nodes;
|
|
||||||
|
|
||||||
// Check if parent node is an osd (numeric)
|
|
||||||
if (parent_node != "" && stoull_full(parent_node))
|
|
||||||
{
|
{
|
||||||
// Add it to node list if osd is in node tree
|
// find OSD parent at <level>, but stop at <root_node>
|
||||||
if (node_tree.find(parent_node) != node_tree.end())
|
auto cur_id = osd.string_value();
|
||||||
nodes.push_back(parent_node);
|
auto cur = state_node_tree["nodes"][cur_id];
|
||||||
}
|
while (!cur.is_null())
|
||||||
// If parent node given, ...
|
|
||||||
else if (parent_node != "")
|
|
||||||
{
|
|
||||||
// ... look for children nodes of this parent
|
|
||||||
for (auto & sn: node_tree)
|
|
||||||
{
|
{
|
||||||
auto & props = sn.second.object_items();
|
if (cur["level"] == level)
|
||||||
|
|
||||||
auto parent_prop = props.find("parent");
|
|
||||||
if (parent_prop != props.end() && (parent_prop->second.as_string() == parent_node))
|
|
||||||
{
|
{
|
||||||
nodes.push_back(sn.first);
|
level_seen.insert(cur_id);
|
||||||
|
break;
|
||||||
// If we're not looking for all osds, we only need a single
|
|
||||||
// child osd node
|
|
||||||
if (level != "osd" && stoull_full(sn.first))
|
|
||||||
break;
|
|
||||||
}
|
}
|
||||||
|
if (cur_id == root_node)
|
||||||
|
break;
|
||||||
|
cur_id = cur["parent"].string_value();
|
||||||
|
cur = state_node_tree["nodes"][cur_id];
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
// No parent node given, and we're not looking for all osds
|
return level_seen.size();
|
||||||
else if (level != "osd")
|
|
||||||
{
|
|
||||||
// ... look for all level nodes
|
|
||||||
for (auto & sn: node_tree)
|
|
||||||
{
|
|
||||||
auto & props = sn.second.object_items();
|
|
||||||
|
|
||||||
auto level_prop = props.find("level");
|
|
||||||
if (level_prop != props.end() && (level_prop->second.as_string() == level))
|
|
||||||
{
|
|
||||||
nodes.push_back(sn.first);
|
|
||||||
}
|
|
||||||
}
|
|
||||||
}
|
|
||||||
// Otherwise, ...
|
|
||||||
else
|
|
||||||
{
|
|
||||||
// ... we're looking for osd nodes only
|
|
||||||
for (auto & sn: node_tree)
|
|
||||||
{
|
|
||||||
if (stoull_full(sn.first))
|
|
||||||
{
|
|
||||||
nodes.push_back(sn.first);
|
|
||||||
}
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
// Process gathered nodes
|
|
||||||
for (auto & node: nodes)
|
|
||||||
{
|
|
||||||
// Check for osd node, return constant max size
|
|
||||||
if (stoull_full(node))
|
|
||||||
{
|
|
||||||
max_pg_sz += 1;
|
|
||||||
}
|
|
||||||
// Otherwise, ...
|
|
||||||
else
|
|
||||||
{
|
|
||||||
// ... exclude parent node from tree, and ...
|
|
||||||
node_tree.erase(parent_node);
|
|
||||||
|
|
||||||
// ... descend onto the resulting tree
|
|
||||||
max_pg_sz += get_max_pg_size(node_tree, level, node);
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
return max_pg_sz;
|
|
||||||
}
|
}
|
||||||
|
|
||||||
json11::Json create_pool(const etcd_kv_t & kv)
|
json11::Json create_pool(const etcd_kv_t & kv)
|
||||||
|
|
|
@ -134,6 +134,7 @@ resume_2:
|
||||||
}
|
}
|
||||||
int osd_count = 0, osd_up = 0;
|
int osd_count = 0, osd_up = 0;
|
||||||
uint64_t total_raw = 0, free_raw = 0, free_down_raw = 0, down_raw = 0;
|
uint64_t total_raw = 0, free_raw = 0, free_down_raw = 0, down_raw = 0;
|
||||||
|
std::vector<uint64_t> slow_op_primary_osds, slow_op_secondary_osds;
|
||||||
parent->iterate_kvs_1(osd_stats, "/osd/stats/", [&](uint64_t stat_osd_num, json11::Json value)
|
parent->iterate_kvs_1(osd_stats, "/osd/stats/", [&](uint64_t stat_osd_num, json11::Json value)
|
||||||
{
|
{
|
||||||
osd_count++;
|
osd_count++;
|
||||||
|
@ -141,18 +142,29 @@ resume_2:
|
||||||
auto osd_free = value["free"].uint64_value();
|
auto osd_free = value["free"].uint64_value();
|
||||||
total_raw += osd_size;
|
total_raw += osd_size;
|
||||||
free_raw += osd_free;
|
free_raw += osd_free;
|
||||||
if (!osd_free)
|
if (osd_size)
|
||||||
{
|
{
|
||||||
osds_full++;
|
if (!osd_free)
|
||||||
}
|
{
|
||||||
else if (osd_free < (uint64_t)(osd_size*(1-osd_nearfull_ratio)))
|
osds_full++;
|
||||||
{
|
}
|
||||||
osds_nearfull++;
|
else if (osd_free < (uint64_t)(osd_size*(1-osd_nearfull_ratio)))
|
||||||
|
{
|
||||||
|
osds_nearfull++;
|
||||||
|
}
|
||||||
}
|
}
|
||||||
auto peer_it = parent->cli->st_cli.peer_states.find(stat_osd_num);
|
auto peer_it = parent->cli->st_cli.peer_states.find(stat_osd_num);
|
||||||
if (peer_it != parent->cli->st_cli.peer_states.end())
|
if (peer_it != parent->cli->st_cli.peer_states.end())
|
||||||
{
|
{
|
||||||
osd_up++;
|
osd_up++;
|
||||||
|
if (value["slow_ops_primary"].uint64_value() > 0)
|
||||||
|
{
|
||||||
|
slow_op_primary_osds.push_back(stat_osd_num);
|
||||||
|
}
|
||||||
|
if (value["slow_ops_secondary"].uint64_value() > 0)
|
||||||
|
{
|
||||||
|
slow_op_secondary_osds.push_back(stat_osd_num);
|
||||||
|
}
|
||||||
}
|
}
|
||||||
else
|
else
|
||||||
{
|
{
|
||||||
|
@ -216,6 +228,10 @@ resume_2:
|
||||||
{ "mon_master", mon_master },
|
{ "mon_master", mon_master },
|
||||||
{ "osd_up", osd_up },
|
{ "osd_up", osd_up },
|
||||||
{ "osd_count", osd_count },
|
{ "osd_count", osd_count },
|
||||||
|
{ "osds_full", osds_full },
|
||||||
|
{ "osds_nearfull", osds_nearfull },
|
||||||
|
{ "osds_primary_slow_ops", slow_op_primary_osds },
|
||||||
|
{ "osds_secondary_slow_ops", slow_op_secondary_osds },
|
||||||
{ "total_raw", total_raw },
|
{ "total_raw", total_raw },
|
||||||
{ "free_raw", free_raw },
|
{ "free_raw", free_raw },
|
||||||
{ "down_raw", down_raw },
|
{ "down_raw", down_raw },
|
||||||
|
@ -300,6 +316,26 @@ resume_2:
|
||||||
warning_str += " "+std::to_string(osds_nearfull)+
|
warning_str += " "+std::to_string(osds_nearfull)+
|
||||||
(osds_nearfull > 1 ? " osds are almost full\n" : " osd is almost full\n");
|
(osds_nearfull > 1 ? " osds are almost full\n" : " osd is almost full\n");
|
||||||
}
|
}
|
||||||
|
if (slow_op_primary_osds.size() > 0)
|
||||||
|
{
|
||||||
|
warning_str += " "+std::to_string(slow_op_primary_osds.size());
|
||||||
|
warning_str += (slow_op_primary_osds.size() > 1 ? " osds have" : " osd has");
|
||||||
|
warning_str += " slow client ops: ";
|
||||||
|
for (int i = 0; i < slow_op_primary_osds.size(); i++)
|
||||||
|
{
|
||||||
|
warning_str += (i > 0 ? ", " : "")+std::to_string(slow_op_primary_osds[i])+"\n";
|
||||||
|
}
|
||||||
|
}
|
||||||
|
if (slow_op_secondary_osds.size() > 0)
|
||||||
|
{
|
||||||
|
warning_str += " "+std::to_string(slow_op_secondary_osds.size());
|
||||||
|
warning_str += (slow_op_secondary_osds.size() > 1 ? " osds have" : " osd has");
|
||||||
|
warning_str += " slow replication ops: ";
|
||||||
|
for (int i = 0; i < slow_op_secondary_osds.size(); i++)
|
||||||
|
{
|
||||||
|
warning_str += (i > 0 ? ", " : "")+std::to_string(slow_op_secondary_osds[i])+"\n";
|
||||||
|
}
|
||||||
|
}
|
||||||
if (warning_str != "")
|
if (warning_str != "")
|
||||||
{
|
{
|
||||||
warning_str = "\n warning:\n"+warning_str;
|
warning_str = "\n warning:\n"+warning_str;
|
||||||
|
|
|
@ -108,6 +108,10 @@ int disk_tool_t::prepare_one(std::map<std::string, std::string> options, int is_
|
||||||
try
|
try
|
||||||
{
|
{
|
||||||
dsk.parse_config(options);
|
dsk.parse_config(options);
|
||||||
|
// Set all offsets to 4096 to calculate metadata size with excess
|
||||||
|
dsk.journal_offset = 4096;
|
||||||
|
dsk.meta_offset = 4096;
|
||||||
|
dsk.data_offset = 4096;
|
||||||
dsk.data_io = dsk.meta_io = dsk.journal_io = (options["io"] == "cached" ? "cached" : "direct");
|
dsk.data_io = dsk.meta_io = dsk.journal_io = (options["io"] == "cached" ? "cached" : "direct");
|
||||||
dsk.open_data();
|
dsk.open_data();
|
||||||
dsk.open_meta();
|
dsk.open_meta();
|
||||||
|
@ -171,8 +175,8 @@ int disk_tool_t::prepare_one(std::map<std::string, std::string> options, int is_
|
||||||
}
|
}
|
||||||
sb["osd_num"] = osd_num;
|
sb["osd_num"] = osd_num;
|
||||||
// Zero out metadata and journal
|
// Zero out metadata and journal
|
||||||
if (write_zero(dsk.meta_fd, dsk.meta_offset, dsk.meta_len) != 0 ||
|
if (write_zero(dsk.meta_fd, sb["meta_offset"].uint64_value(), dsk.meta_len) != 0 ||
|
||||||
write_zero(dsk.journal_fd, dsk.journal_offset, dsk.journal_len) != 0)
|
write_zero(dsk.journal_fd, sb["journal_offset"].uint64_value(), dsk.journal_len) != 0)
|
||||||
{
|
{
|
||||||
fprintf(stderr, "Failed to zero out metadata or journal: %s\n", strerror(errno));
|
fprintf(stderr, "Failed to zero out metadata or journal: %s\n", strerror(errno));
|
||||||
dsk.close_all();
|
dsk.close_all();
|
||||||
|
@ -498,6 +502,9 @@ int disk_tool_t::get_meta_partition(std::vector<vitastor_dev_info_t> & ssds, std
|
||||||
{
|
{
|
||||||
blockstore_disk_t dsk;
|
blockstore_disk_t dsk;
|
||||||
dsk.parse_config(options);
|
dsk.parse_config(options);
|
||||||
|
dsk.journal_offset = 4096;
|
||||||
|
dsk.meta_offset = 4096;
|
||||||
|
dsk.data_offset = 4096;
|
||||||
dsk.data_io = dsk.meta_io = dsk.journal_io = "cached";
|
dsk.data_io = dsk.meta_io = dsk.journal_io = "cached";
|
||||||
dsk.open_data();
|
dsk.open_data();
|
||||||
dsk.open_meta();
|
dsk.open_meta();
|
||||||
|
|
|
@ -535,10 +535,12 @@ void osd_t::print_stats()
|
||||||
|
|
||||||
void osd_t::print_slow()
|
void osd_t::print_slow()
|
||||||
{
|
{
|
||||||
bool has_slow = false;
|
cur_slow_op_primary = 0;
|
||||||
|
cur_slow_op_secondary = 0;
|
||||||
char alloc[1024];
|
char alloc[1024];
|
||||||
timespec now;
|
timespec now;
|
||||||
clock_gettime(CLOCK_REALTIME, &now);
|
clock_gettime(CLOCK_REALTIME, &now);
|
||||||
|
// FIXME: Also track slow local blockstore ops and recovery/flush/scrub ops
|
||||||
for (auto & kv: msgr.clients)
|
for (auto & kv: msgr.clients)
|
||||||
{
|
{
|
||||||
for (auto op: kv.second->received_ops)
|
for (auto op: kv.second->received_ops)
|
||||||
|
@ -608,6 +610,7 @@ void osd_t::print_slow()
|
||||||
op->req.hdr.opcode == OSD_OP_SEC_STABILIZE || op->req.hdr.opcode == OSD_OP_SEC_ROLLBACK ||
|
op->req.hdr.opcode == OSD_OP_SEC_STABILIZE || op->req.hdr.opcode == OSD_OP_SEC_ROLLBACK ||
|
||||||
op->req.hdr.opcode == OSD_OP_SEC_READ_BMP)
|
op->req.hdr.opcode == OSD_OP_SEC_READ_BMP)
|
||||||
{
|
{
|
||||||
|
cur_slow_op_secondary++;
|
||||||
bufprintf(" state=%d", op->bs_op ? PRIV(op->bs_op)->op_state : -1);
|
bufprintf(" state=%d", op->bs_op ? PRIV(op->bs_op)->op_state : -1);
|
||||||
int wait_for = op->bs_op ? PRIV(op->bs_op)->wait_for : 0;
|
int wait_for = op->bs_op ? PRIV(op->bs_op)->wait_for : 0;
|
||||||
if (wait_for)
|
if (wait_for)
|
||||||
|
@ -618,15 +621,19 @@ void osd_t::print_slow()
|
||||||
else if (op->req.hdr.opcode == OSD_OP_READ || op->req.hdr.opcode == OSD_OP_WRITE ||
|
else if (op->req.hdr.opcode == OSD_OP_READ || op->req.hdr.opcode == OSD_OP_WRITE ||
|
||||||
op->req.hdr.opcode == OSD_OP_SYNC || op->req.hdr.opcode == OSD_OP_DELETE)
|
op->req.hdr.opcode == OSD_OP_SYNC || op->req.hdr.opcode == OSD_OP_DELETE)
|
||||||
{
|
{
|
||||||
|
cur_slow_op_primary++;
|
||||||
bufprintf(" state=%d", !op->op_data ? -1 : op->op_data->st);
|
bufprintf(" state=%d", !op->op_data ? -1 : op->op_data->st);
|
||||||
}
|
}
|
||||||
|
else
|
||||||
|
{
|
||||||
|
cur_slow_op_primary++;
|
||||||
|
}
|
||||||
#undef bufprintf
|
#undef bufprintf
|
||||||
printf("%s\n", alloc);
|
printf("%s\n", alloc);
|
||||||
has_slow = true;
|
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
if (has_slow && bs)
|
if ((cur_slow_op_primary+cur_slow_op_secondary) > 0 && bs)
|
||||||
{
|
{
|
||||||
bs->dump_diagnostics();
|
bs->dump_diagnostics();
|
||||||
}
|
}
|
||||||
|
|
|
@ -150,7 +150,9 @@ class osd_t
|
||||||
bool pg_config_applied = false;
|
bool pg_config_applied = false;
|
||||||
bool etcd_reporting_pg_state = false;
|
bool etcd_reporting_pg_state = false;
|
||||||
bool etcd_reporting_stats = false;
|
bool etcd_reporting_stats = false;
|
||||||
int autosync_timer_id = -1, print_stats_timer_id = -1, slow_log_timer_id = -1;
|
int print_stats_timer_id = -1, slow_log_timer_id = -1;
|
||||||
|
uint64_t cur_slow_op_primary = 0;
|
||||||
|
uint64_t cur_slow_op_secondary = 0;
|
||||||
|
|
||||||
// peers and PGs
|
// peers and PGs
|
||||||
|
|
||||||
|
@ -168,6 +170,8 @@ class osd_t
|
||||||
object_id recovery_last_oid;
|
object_id recovery_last_oid;
|
||||||
int recovery_pg_done = 0, recovery_done = 0;
|
int recovery_pg_done = 0, recovery_done = 0;
|
||||||
osd_op_t *autosync_op = NULL;
|
osd_op_t *autosync_op = NULL;
|
||||||
|
int autosync_copies_to_delete = 0;
|
||||||
|
int autosync_timer_id = -1;
|
||||||
|
|
||||||
// Scrubbing
|
// Scrubbing
|
||||||
uint64_t scrub_nearest_ts = 0;
|
uint64_t scrub_nearest_ts = 0;
|
||||||
|
@ -222,6 +226,7 @@ class osd_t
|
||||||
void parse_config(bool init);
|
void parse_config(bool init);
|
||||||
void init_cluster();
|
void init_cluster();
|
||||||
void on_change_osd_state_hook(osd_num_t peer_osd);
|
void on_change_osd_state_hook(osd_num_t peer_osd);
|
||||||
|
void on_change_backfillfull_hook(pool_id_t pool_id);
|
||||||
void on_change_pg_history_hook(pool_id_t pool_id, pg_num_t pg_num);
|
void on_change_pg_history_hook(pool_id_t pool_id, pg_num_t pg_num);
|
||||||
void on_change_etcd_state_hook(std::map<std::string, etcd_kv_t> & changes);
|
void on_change_etcd_state_hook(std::map<std::string, etcd_kv_t> & changes);
|
||||||
void on_load_config_hook(json11::Json::object & changes);
|
void on_load_config_hook(json11::Json::object & changes);
|
||||||
|
|
|
@ -65,6 +65,7 @@ void osd_t::init_cluster()
|
||||||
st_cli.tfd = tfd;
|
st_cli.tfd = tfd;
|
||||||
st_cli.log_level = log_level;
|
st_cli.log_level = log_level;
|
||||||
st_cli.on_change_osd_state_hook = [this](osd_num_t peer_osd) { on_change_osd_state_hook(peer_osd); };
|
st_cli.on_change_osd_state_hook = [this](osd_num_t peer_osd) { on_change_osd_state_hook(peer_osd); };
|
||||||
|
st_cli.on_change_backfillfull_hook = [this](pool_id_t pool_id) { on_change_backfillfull_hook(pool_id); };
|
||||||
st_cli.on_change_pg_history_hook = [this](pool_id_t pool_id, pg_num_t pg_num) { on_change_pg_history_hook(pool_id, pg_num); };
|
st_cli.on_change_pg_history_hook = [this](pool_id_t pool_id, pg_num_t pg_num) { on_change_pg_history_hook(pool_id, pg_num); };
|
||||||
st_cli.on_change_hook = [this](std::map<std::string, etcd_kv_t> & changes) { on_change_etcd_state_hook(changes); };
|
st_cli.on_change_hook = [this](std::map<std::string, etcd_kv_t> & changes) { on_change_etcd_state_hook(changes); };
|
||||||
st_cli.on_load_config_hook = [this](json11::Json::object & cfg) { on_load_config_hook(cfg); };
|
st_cli.on_load_config_hook = [this](json11::Json::object & cfg) { on_load_config_hook(cfg); };
|
||||||
|
@ -201,6 +202,14 @@ json11::Json osd_t::get_statistics()
|
||||||
st["immediate_commit"] = immediate_commit == IMMEDIATE_ALL ? "all" : (immediate_commit == IMMEDIATE_SMALL ? "small" : "none");
|
st["immediate_commit"] = immediate_commit == IMMEDIATE_ALL ? "all" : (immediate_commit == IMMEDIATE_SMALL ? "small" : "none");
|
||||||
st["host"] = self_state["host"];
|
st["host"] = self_state["host"];
|
||||||
st["version"] = VITASTOR_VERSION;
|
st["version"] = VITASTOR_VERSION;
|
||||||
|
if (cur_slow_op_primary > 0)
|
||||||
|
{
|
||||||
|
st["slow_ops_primary"] = cur_slow_op_primary;
|
||||||
|
}
|
||||||
|
if (cur_slow_op_secondary > 0)
|
||||||
|
{
|
||||||
|
st["slow_ops_secondary"] = cur_slow_op_secondary;
|
||||||
|
}
|
||||||
json11::Json::object op_stats, subop_stats;
|
json11::Json::object op_stats, subop_stats;
|
||||||
for (int i = OSD_OP_MIN; i <= OSD_OP_MAX; i++)
|
for (int i = OSD_OP_MIN; i <= OSD_OP_MAX; i++)
|
||||||
{
|
{
|
||||||
|
@ -406,6 +415,14 @@ void osd_t::on_change_osd_state_hook(osd_num_t peer_osd)
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
|
void osd_t::on_change_backfillfull_hook(pool_id_t pool_id)
|
||||||
|
{
|
||||||
|
if (!(peering_state & (OSD_RECOVERING | OSD_FLUSHING_PGS)))
|
||||||
|
{
|
||||||
|
peering_state = peering_state | OSD_RECOVERING;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
void osd_t::on_change_etcd_state_hook(std::map<std::string, etcd_kv_t> & changes)
|
void osd_t::on_change_etcd_state_hook(std::map<std::string, etcd_kv_t> & changes)
|
||||||
{
|
{
|
||||||
if (changes.find(st_cli.etcd_prefix+"/config/global") != changes.end())
|
if (changes.find(st_cli.etcd_prefix+"/config/global") != changes.end())
|
||||||
|
|
|
@ -13,10 +13,11 @@ void osd_t::submit_pg_flush_ops(pg_t & pg)
|
||||||
bool first = true;
|
bool first = true;
|
||||||
while (it != pg.flush_actions.end())
|
while (it != pg.flush_actions.end())
|
||||||
{
|
{
|
||||||
if (!first && (it->first.oid.inode != prev_it->first.oid.inode ||
|
if (!first &&
|
||||||
(it->first.oid.stripe & ~STRIPE_MASK) != (prev_it->first.oid.stripe & ~STRIPE_MASK)) &&
|
(it->first.oid.inode != prev_it->first.oid.inode ||
|
||||||
fb->rollback_lists[it->first.osd_num].size() >= FLUSH_BATCH ||
|
(it->first.oid.stripe & ~STRIPE_MASK) != (prev_it->first.oid.stripe & ~STRIPE_MASK)) &&
|
||||||
fb->stable_lists[it->first.osd_num].size() >= FLUSH_BATCH)
|
(fb->rollback_lists[it->first.osd_num].size() >= FLUSH_BATCH ||
|
||||||
|
fb->stable_lists[it->first.osd_num].size() >= FLUSH_BATCH))
|
||||||
{
|
{
|
||||||
// Stop only at the object boundary
|
// Stop only at the object boundary
|
||||||
break;
|
break;
|
||||||
|
@ -75,6 +76,7 @@ void osd_t::handle_flush_op(bool rollback, pool_id_t pool_id, pg_num_t pg_num, p
|
||||||
// Throw the result away
|
// Throw the result away
|
||||||
return;
|
return;
|
||||||
}
|
}
|
||||||
|
fb->flush_done++;
|
||||||
if (retval != 0)
|
if (retval != 0)
|
||||||
{
|
{
|
||||||
if (peer_osd == this->osd_num)
|
if (peer_osd == this->osd_num)
|
||||||
|
@ -92,12 +94,11 @@ void osd_t::handle_flush_op(bool rollback, pool_id_t pool_id, pg_num_t pg_num, p
|
||||||
auto fd_it = msgr.osd_peer_fds.find(peer_osd);
|
auto fd_it = msgr.osd_peer_fds.find(peer_osd);
|
||||||
if (fd_it != msgr.osd_peer_fds.end())
|
if (fd_it != msgr.osd_peer_fds.end())
|
||||||
{
|
{
|
||||||
|
// Will repeer/stop this PG
|
||||||
msgr.stop_client(fd_it->second);
|
msgr.stop_client(fd_it->second);
|
||||||
}
|
}
|
||||||
return;
|
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
fb->flush_done++;
|
|
||||||
if (fb->flush_done == fb->flush_ops)
|
if (fb->flush_done == fb->flush_ops)
|
||||||
{
|
{
|
||||||
// This flush batch is done
|
// This flush batch is done
|
||||||
|
@ -251,10 +252,18 @@ bool osd_t::pick_next_recovery(osd_recovery_op_t &op)
|
||||||
auto mask = recovery_last_degraded ? (PG_ACTIVE | PG_HAS_DEGRADED) : (PG_ACTIVE | PG_DEGRADED | PG_HAS_MISPLACED);
|
auto mask = recovery_last_degraded ? (PG_ACTIVE | PG_HAS_DEGRADED) : (PG_ACTIVE | PG_DEGRADED | PG_HAS_MISPLACED);
|
||||||
auto check = recovery_last_degraded ? (PG_ACTIVE | PG_HAS_DEGRADED) : (PG_ACTIVE | PG_HAS_MISPLACED);
|
auto check = recovery_last_degraded ? (PG_ACTIVE | PG_HAS_DEGRADED) : (PG_ACTIVE | PG_HAS_MISPLACED);
|
||||||
// Restart scanning from the same PG as the last time
|
// Restart scanning from the same PG as the last time
|
||||||
|
restart:
|
||||||
for (auto pg_it = pgs.lower_bound(recovery_last_pg); pg_it != pgs.end(); pg_it++)
|
for (auto pg_it = pgs.lower_bound(recovery_last_pg); pg_it != pgs.end(); pg_it++)
|
||||||
{
|
{
|
||||||
if ((pg_it->second.state & mask) == check)
|
if ((pg_it->second.state & mask) == check)
|
||||||
{
|
{
|
||||||
|
auto pool_it = st_cli.pool_config.find(pg_it->first.pool_id);
|
||||||
|
if (pool_it != st_cli.pool_config.end() && pool_it->second.backfillfull)
|
||||||
|
{
|
||||||
|
// Skip the pool
|
||||||
|
recovery_last_pg.pool_id++;
|
||||||
|
goto restart;
|
||||||
|
}
|
||||||
auto & src = recovery_last_degraded ? pg_it->second.degraded_objects : pg_it->second.misplaced_objects;
|
auto & src = recovery_last_degraded ? pg_it->second.degraded_objects : pg_it->second.misplaced_objects;
|
||||||
assert(src.size() > 0);
|
assert(src.size() > 0);
|
||||||
// Restart scanning from the next object
|
// Restart scanning from the next object
|
||||||
|
|
|
@ -645,6 +645,18 @@ void osd_t::remove_object_from_state(object_id & oid, pg_osd_set_state_t **objec
|
||||||
{
|
{
|
||||||
throw std::runtime_error("BUG: Invalid object state: "+std::to_string((*object_state)->state));
|
throw std::runtime_error("BUG: Invalid object state: "+std::to_string((*object_state)->state));
|
||||||
}
|
}
|
||||||
|
if (changed && immediate_commit != IMMEDIATE_ALL)
|
||||||
|
{
|
||||||
|
// Trigger double automatic sync after changing PG state when we're running with fsyncs.
|
||||||
|
// First autosync commits all written objects and applies copies_to_delete_after_sync;
|
||||||
|
// Second autosync commits all deletions run by the first sync.
|
||||||
|
// Without it, rebalancing in a cluster without load may result in some small amount of
|
||||||
|
// garbage left on "extra" OSDs of the PG, because last deletions are not synced at all.
|
||||||
|
// FIXME: 1000% correct way is to switch PG state only after copies_to_delete_after_sync.
|
||||||
|
// But it's much more complicated.
|
||||||
|
unstable_write_count += autosync_writes;
|
||||||
|
autosync_copies_to_delete = 2;
|
||||||
|
}
|
||||||
if (changed && report)
|
if (changed && report)
|
||||||
{
|
{
|
||||||
report_pg_state(pg);
|
report_pg_state(pg);
|
||||||
|
|
|
@ -9,6 +9,10 @@ void osd_t::autosync()
|
||||||
{
|
{
|
||||||
if (immediate_commit != IMMEDIATE_ALL && !autosync_op)
|
if (immediate_commit != IMMEDIATE_ALL && !autosync_op)
|
||||||
{
|
{
|
||||||
|
if (autosync_copies_to_delete > 0)
|
||||||
|
{
|
||||||
|
autosync_copies_to_delete--;
|
||||||
|
}
|
||||||
autosync_op = new osd_op_t();
|
autosync_op = new osd_op_t();
|
||||||
autosync_op->op_type = OSD_OP_IN;
|
autosync_op->op_type = OSD_OP_IN;
|
||||||
autosync_op->peer_fd = SELF_FD;
|
autosync_op->peer_fd = SELF_FD;
|
||||||
|
@ -29,6 +33,11 @@ void osd_t::autosync()
|
||||||
}
|
}
|
||||||
delete autosync_op;
|
delete autosync_op;
|
||||||
autosync_op = NULL;
|
autosync_op = NULL;
|
||||||
|
if (autosync_copies_to_delete > 0)
|
||||||
|
{
|
||||||
|
// Trigger the second "copies_to_delete" autosync
|
||||||
|
autosync();
|
||||||
|
}
|
||||||
};
|
};
|
||||||
exec_op(autosync_op);
|
exec_op(autosync_op);
|
||||||
}
|
}
|
||||||
|
|
|
@ -213,6 +213,15 @@ resume_8:
|
||||||
{
|
{
|
||||||
goto resume_6;
|
goto resume_6;
|
||||||
}
|
}
|
||||||
|
if (immediate_commit == IMMEDIATE_NONE)
|
||||||
|
{
|
||||||
|
// Mark OSDs as dirty because deletions have to be synced too!
|
||||||
|
for (int i = 0; i < op_data->copies_to_delete_count; i++)
|
||||||
|
{
|
||||||
|
auto & chunk = op_data->copies_to_delete[i];
|
||||||
|
this->dirty_osds.insert(chunk.osd_num);
|
||||||
|
}
|
||||||
|
}
|
||||||
}
|
}
|
||||||
for (int i = 0; i < op_data->dirty_pg_count; i++)
|
for (int i = 0; i < op_data->dirty_pg_count; i++)
|
||||||
{
|
{
|
||||||
|
@ -227,7 +236,7 @@ resume_8:
|
||||||
start_pg_peering(pg);
|
start_pg_peering(pg);
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
// FIXME: Free those in the destructor?
|
// FIXME: Free those in the destructor (not here)?
|
||||||
free(op_data->dirty_pgs);
|
free(op_data->dirty_pgs);
|
||||||
op_data->dirty_pgs = NULL;
|
op_data->dirty_pgs = NULL;
|
||||||
op_data->dirty_osds = NULL;
|
op_data->dirty_osds = NULL;
|
||||||
|
|
|
@ -7,6 +7,12 @@
|
||||||
bool osd_t::check_write_queue(osd_op_t *cur_op, pg_t & pg)
|
bool osd_t::check_write_queue(osd_op_t *cur_op, pg_t & pg)
|
||||||
{
|
{
|
||||||
osd_primary_op_data_t *op_data = cur_op->op_data;
|
osd_primary_op_data_t *op_data = cur_op->op_data;
|
||||||
|
// First check if PG is not active anymore
|
||||||
|
if (!(pg.state & PG_ACTIVE))
|
||||||
|
{
|
||||||
|
pg_cancel_write_queue(pg, cur_op, op_data->oid, -EPIPE);
|
||||||
|
return false;
|
||||||
|
}
|
||||||
// Check if actions are pending for this object
|
// Check if actions are pending for this object
|
||||||
auto act_it = pg.flush_actions.lower_bound((obj_piece_id_t){
|
auto act_it = pg.flush_actions.lower_bound((obj_piece_id_t){
|
||||||
.oid = op_data->oid,
|
.oid = op_data->oid,
|
||||||
|
|
|
@ -65,7 +65,7 @@ std::string addr_to_string(const sockaddr_storage &addr)
|
||||||
return std::string(peer_str)+":"+std::to_string(port);
|
return std::string(peer_str)+":"+std::to_string(port);
|
||||||
}
|
}
|
||||||
|
|
||||||
static bool cidr_match(const in_addr &addr, const in_addr &net, uint8_t bits)
|
bool cidr_match(const in_addr &addr, const in_addr &net, uint8_t bits)
|
||||||
{
|
{
|
||||||
if (bits == 0)
|
if (bits == 0)
|
||||||
{
|
{
|
||||||
|
@ -75,7 +75,7 @@ static bool cidr_match(const in_addr &addr, const in_addr &net, uint8_t bits)
|
||||||
return !((addr.s_addr ^ net.s_addr) & htonl(0xFFFFFFFFu << (32 - bits)));
|
return !((addr.s_addr ^ net.s_addr) & htonl(0xFFFFFFFFu << (32 - bits)));
|
||||||
}
|
}
|
||||||
|
|
||||||
static bool cidr6_match(const in6_addr &address, const in6_addr &network, uint8_t bits)
|
bool cidr6_match(const in6_addr &address, const in6_addr &network, uint8_t bits)
|
||||||
{
|
{
|
||||||
const uint32_t *a = address.s6_addr32;
|
const uint32_t *a = address.s6_addr32;
|
||||||
const uint32_t *n = network.s6_addr32;
|
const uint32_t *n = network.s6_addr32;
|
||||||
|
@ -93,47 +93,49 @@ static bool cidr6_match(const in6_addr &address, const in6_addr &network, uint8_
|
||||||
return true;
|
return true;
|
||||||
}
|
}
|
||||||
|
|
||||||
struct addr_mask_t
|
addr_mask_t cidr_parse(std::string mask)
|
||||||
{
|
{
|
||||||
sa_family_t family;
|
unsigned bits = 255;
|
||||||
|
int p = mask.find('/');
|
||||||
|
if (p != std::string::npos)
|
||||||
|
{
|
||||||
|
char null_byte = 0;
|
||||||
|
if (sscanf(mask.c_str()+p+1, "%u%c", &bits, &null_byte) != 1 || bits > 128)
|
||||||
|
throw std::runtime_error("Invalid IP address mask: " + mask);
|
||||||
|
mask = mask.substr(0, p);
|
||||||
|
}
|
||||||
in_addr ipv4;
|
in_addr ipv4;
|
||||||
in6_addr ipv6;
|
in6_addr ipv6;
|
||||||
uint8_t bits;
|
if (inet_pton(AF_INET, mask.c_str(), &ipv4) == 1)
|
||||||
};
|
{
|
||||||
|
if (bits == 255)
|
||||||
|
bits = 32;
|
||||||
|
if (bits > 32)
|
||||||
|
throw std::runtime_error("Invalid IP address mask: " + mask);
|
||||||
|
return (addr_mask_t){ .family = AF_INET, .ipv4 = ipv4, .bits = (uint8_t)(bits ? bits : 32) };
|
||||||
|
}
|
||||||
|
else if (inet_pton(AF_INET6, mask.c_str(), &ipv6) == 1)
|
||||||
|
{
|
||||||
|
if (bits == 255)
|
||||||
|
bits = 128;
|
||||||
|
return (addr_mask_t){ .family = AF_INET6, .ipv6 = ipv6, .bits = (uint8_t)bits };
|
||||||
|
}
|
||||||
|
else
|
||||||
|
{
|
||||||
|
throw std::runtime_error("Invalid IP address mask: " + mask);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
std::vector<std::string> getifaddr_list(std::vector<std::string> mask_cfg, bool include_v6)
|
std::vector<std::string> getifaddr_list(std::vector<std::string> mask_cfg, bool include_v6)
|
||||||
{
|
{
|
||||||
std::vector<addr_mask_t> masks;
|
std::vector<addr_mask_t> masks;
|
||||||
for (auto mask: mask_cfg)
|
for (auto mask: mask_cfg)
|
||||||
{
|
{
|
||||||
unsigned bits = 0;
|
masks.push_back(cidr_parse(mask));
|
||||||
int p = mask.find('/');
|
if (masks[masks.size()-1].family == AF_INET6)
|
||||||
if (p != std::string::npos)
|
|
||||||
{
|
{
|
||||||
char null_byte = 0;
|
// Auto-enable IPv6 addresses
|
||||||
if (sscanf(mask.c_str()+p+1, "%u%c", &bits, &null_byte) != 1 || bits > 128)
|
include_v6 = true;
|
||||||
{
|
|
||||||
throw std::runtime_error((include_v6 ? "Invalid IPv4 address mask: " : "Invalid IP address mask: ") + mask);
|
|
||||||
}
|
|
||||||
mask = mask.substr(0, p);
|
|
||||||
}
|
|
||||||
in_addr ipv4;
|
|
||||||
in6_addr ipv6;
|
|
||||||
if (inet_pton(AF_INET, mask.c_str(), &ipv4) == 1)
|
|
||||||
{
|
|
||||||
if (bits > 32)
|
|
||||||
{
|
|
||||||
throw std::runtime_error((include_v6 ? "Invalid IPv4 address mask: " : "Invalid IP address mask: ") + mask);
|
|
||||||
}
|
|
||||||
masks.push_back((addr_mask_t){ .family = AF_INET, .ipv4 = ipv4, .bits = (uint8_t)bits });
|
|
||||||
}
|
|
||||||
else if (include_v6 && inet_pton(AF_INET6, mask.c_str(), &ipv6) == 1)
|
|
||||||
{
|
|
||||||
masks.push_back((addr_mask_t){ .family = AF_INET6, .ipv6 = ipv6, .bits = (uint8_t)bits });
|
|
||||||
}
|
|
||||||
else
|
|
||||||
{
|
|
||||||
throw std::runtime_error((include_v6 ? "Invalid IPv4 address mask: " : "Invalid IP address mask: ") + mask);
|
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
std::set<std::string> addresses;
|
std::set<std::string> addresses;
|
||||||
|
|
|
@ -1,10 +1,22 @@
|
||||||
#pragma once
|
#pragma once
|
||||||
|
|
||||||
|
#include <netinet/in.h>
|
||||||
#include <sys/socket.h>
|
#include <sys/socket.h>
|
||||||
#include <string>
|
#include <string>
|
||||||
#include <vector>
|
#include <vector>
|
||||||
|
|
||||||
|
struct addr_mask_t
|
||||||
|
{
|
||||||
|
sa_family_t family;
|
||||||
|
in_addr ipv4;
|
||||||
|
in6_addr ipv6;
|
||||||
|
uint8_t bits;
|
||||||
|
};
|
||||||
|
|
||||||
bool string_to_addr(std::string str, bool parse_port, int default_port, struct sockaddr_storage *addr);
|
bool string_to_addr(std::string str, bool parse_port, int default_port, struct sockaddr_storage *addr);
|
||||||
std::string addr_to_string(const sockaddr_storage &addr);
|
std::string addr_to_string(const sockaddr_storage &addr);
|
||||||
|
addr_mask_t cidr_parse(std::string mask);
|
||||||
|
bool cidr_match(const in_addr &address, const in_addr &network, uint8_t bits);
|
||||||
|
bool cidr6_match(const in6_addr &address, const in6_addr &network, uint8_t bits);
|
||||||
std::vector<std::string> getifaddr_list(std::vector<std::string> mask_cfg = std::vector<std::string>(), bool include_v6 = false);
|
std::vector<std::string> getifaddr_list(std::vector<std::string> mask_cfg = std::vector<std::string>(), bool include_v6 = false);
|
||||||
int create_and_bind_socket(std::string bind_address, int bind_port, int listen_backlog, int *listening_port);
|
int create_and_bind_socket(std::string bind_address, int bind_port, int listen_backlog, int *listening_port);
|
||||||
|
|
|
@ -62,7 +62,7 @@ int timerfd_manager_t::set_timer_us(uint64_t micros, bool repeat, std::function<
|
||||||
.callback = callback,
|
.callback = callback,
|
||||||
});
|
});
|
||||||
inc_timer(timers[timers.size()-1]);
|
inc_timer(timers[timers.size()-1]);
|
||||||
set_nearest();
|
set_nearest(false);
|
||||||
return timer_id;
|
return timer_id;
|
||||||
}
|
}
|
||||||
|
|
||||||
|
@ -82,13 +82,13 @@ void timerfd_manager_t::clear_timer(int timer_id)
|
||||||
{
|
{
|
||||||
nearest--;
|
nearest--;
|
||||||
}
|
}
|
||||||
set_nearest();
|
set_nearest(false);
|
||||||
break;
|
break;
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
void timerfd_manager_t::set_nearest()
|
void timerfd_manager_t::set_nearest(bool trigger_inline)
|
||||||
{
|
{
|
||||||
if (onstack > 0)
|
if (onstack > 0)
|
||||||
{
|
{
|
||||||
|
@ -134,10 +134,13 @@ again:
|
||||||
}
|
}
|
||||||
if (exp.it_value.tv_sec < 0 || exp.it_value.tv_sec == 0 && exp.it_value.tv_nsec <= 0)
|
if (exp.it_value.tv_sec < 0 || exp.it_value.tv_sec == 0 && exp.it_value.tv_nsec <= 0)
|
||||||
{
|
{
|
||||||
// It already happened
|
// It already happened - set minimal timeout
|
||||||
// FIXME: Postpone to setImmediate/BH to avoid reenterability problems
|
if (trigger_inline)
|
||||||
trigger_nearest();
|
{
|
||||||
goto again;
|
trigger_nearest();
|
||||||
|
goto again;
|
||||||
|
}
|
||||||
|
exp.it_value = { .tv_sec = 0, .tv_nsec = 1 };
|
||||||
}
|
}
|
||||||
if (timerfd_settime(timerfd, 0, &exp, NULL))
|
if (timerfd_settime(timerfd, 0, &exp, NULL))
|
||||||
{
|
{
|
||||||
|
@ -157,7 +160,7 @@ void timerfd_manager_t::handle_readable()
|
||||||
trigger_nearest();
|
trigger_nearest();
|
||||||
}
|
}
|
||||||
wait_state = 0;
|
wait_state = 0;
|
||||||
set_nearest();
|
set_nearest(true);
|
||||||
}
|
}
|
||||||
|
|
||||||
void timerfd_manager_t::trigger_nearest()
|
void timerfd_manager_t::trigger_nearest()
|
||||||
|
|
|
@ -26,7 +26,7 @@ class timerfd_manager_t
|
||||||
std::vector<timerfd_timer_t> timers;
|
std::vector<timerfd_timer_t> timers;
|
||||||
|
|
||||||
void inc_timer(timerfd_timer_t & t);
|
void inc_timer(timerfd_timer_t & t);
|
||||||
void set_nearest();
|
void set_nearest(bool trigger_inline);
|
||||||
void trigger_nearest();
|
void trigger_nearest();
|
||||||
void handle_readable();
|
void handle_readable();
|
||||||
public:
|
public:
|
||||||
|
|
|
@ -19,9 +19,12 @@ ANTIETCD=1 ./test_etcd_fail.sh
|
||||||
|
|
||||||
./test_interrupted_rebalance.sh
|
./test_interrupted_rebalance.sh
|
||||||
IMMEDIATE_COMMIT=1 ./test_interrupted_rebalance.sh
|
IMMEDIATE_COMMIT=1 ./test_interrupted_rebalance.sh
|
||||||
|
|
||||||
SCHEME=ec ./test_interrupted_rebalance.sh
|
SCHEME=ec ./test_interrupted_rebalance.sh
|
||||||
SCHEME=ec IMMEDIATE_COMMIT=1 ./test_interrupted_rebalance.sh
|
SCHEME=ec IMMEDIATE_COMMIT=1 ./test_interrupted_rebalance.sh
|
||||||
|
|
||||||
|
./test_create_halfhost.sh
|
||||||
|
|
||||||
./test_failure_domain.sh
|
./test_failure_domain.sh
|
||||||
|
|
||||||
./test_snapshot.sh
|
./test_snapshot.sh
|
||||||
|
|
|
@ -0,0 +1,35 @@
|
||||||
|
#!/bin/bash -ex
|
||||||
|
|
||||||
|
. `dirname $0`/common.sh
|
||||||
|
|
||||||
|
node mon/mon-main.js $MON_PARAMS --etcd_address $ETCD_URL --etcd_prefix "/vitastor" >>./testdata/mon.log 2>&1 &
|
||||||
|
MON_PID=$!
|
||||||
|
wait_etcd
|
||||||
|
|
||||||
|
TIME=$(date '+%s')
|
||||||
|
$ETCDCTL put /vitastor/config/global '{"placement_levels":{"dc":10,"host":100,"half":105,"osd":110}}'
|
||||||
|
$ETCDCTL put /vitastor/config/node_placement '{
|
||||||
|
"h11":{"level":"half","parent":"host1"},
|
||||||
|
"h12":{"level":"half","parent":"host1"},
|
||||||
|
"h21":{"level":"half","parent":"host2"},
|
||||||
|
"h22":{"level":"half","parent":"host2"},
|
||||||
|
"h31":{"level":"half","parent":"host3"},
|
||||||
|
"h32":{"level":"half","parent":"host3"},
|
||||||
|
"1":{"parent":"h11"},
|
||||||
|
"2":{"parent":"h12"},
|
||||||
|
"3":{"parent":"h21"},
|
||||||
|
"4":{"parent":"h22"},
|
||||||
|
"5":{"parent":"h31"},
|
||||||
|
"6":{"parent":"h32"}
|
||||||
|
}'
|
||||||
|
$ETCDCTL put /vitastor/osd/stats/1 '{"host":"host1","size":1073741824,"time":"'$TIME'"}'
|
||||||
|
$ETCDCTL put /vitastor/osd/stats/2 '{"host":"host1","size":1073741824,"time":"'$TIME'"}'
|
||||||
|
$ETCDCTL put /vitastor/osd/stats/3 '{"host":"host2","size":1073741824,"time":"'$TIME'"}'
|
||||||
|
$ETCDCTL put /vitastor/osd/stats/4 '{"host":"host2","size":1073741824,"time":"'$TIME'"}'
|
||||||
|
$ETCDCTL put /vitastor/osd/stats/5 '{"host":"host3","size":1073741824,"time":"'$TIME'"}'
|
||||||
|
$ETCDCTL put /vitastor/osd/stats/6 '{"host":"host3","size":1073741824,"time":"'$TIME'"}'
|
||||||
|
build/src/cmd/vitastor-cli --etcd_address $ETCD_URL osd-tree
|
||||||
|
# check that it doesn't fail
|
||||||
|
build/src/cmd/vitastor-cli --etcd_address $ETCD_URL create-pool testpool --ec 2+1 -n 32
|
||||||
|
|
||||||
|
format_green OK
|
|
@ -53,7 +53,7 @@ kill_osds &
|
||||||
|
|
||||||
LD_PRELOAD="build/src/client/libfio_vitastor.so" \
|
LD_PRELOAD="build/src/client/libfio_vitastor.so" \
|
||||||
fio -thread -name=test -ioengine=build/src/client/libfio_vitastor.so -bsrange=4k-128k -blockalign=4k -direct=1 -iodepth=32 -fsync=256 -rw=randrw \
|
fio -thread -name=test -ioengine=build/src/client/libfio_vitastor.so -bsrange=4k-128k -blockalign=4k -direct=1 -iodepth=32 -fsync=256 -rw=randrw \
|
||||||
-randrepeat=0 -refill_buffers=1 -mirror_file=./testdata/bin/mirror.bin -etcd=$ETCD_URL -image=testimg -loops=10 -runtime=120
|
-serialize_overlap=1 -randrepeat=0 -refill_buffers=1 -mirror_file=./testdata/bin/mirror.bin -etcd=$ETCD_URL -image=testimg -loops=10 -runtime=120
|
||||||
|
|
||||||
qemu-img convert -S 4096 -p \
|
qemu-img convert -S 4096 -p \
|
||||||
-f raw "vitastor:etcd_host=127.0.0.1\:$ETCD_PORT/v3:image=testimg" \
|
-f raw "vitastor:etcd_host=127.0.0.1\:$ETCD_PORT/v3:image=testimg" \
|
||||||
|
|
Loading…
Reference in New Issue