Compare commits
32 Commits
Author | SHA1 | Date |
---|---|---|
|
b7322a405a | |
|
5692630005 | |
|
00ced7cea7 | |
|
ebdb75e287 | |
|
f397fe9c6a | |
|
28560b4ae5 | |
|
2d07449e74 | |
|
80c4e8c20f | |
|
2ab0ae3bc9 | |
|
05e59c1b4f | |
|
e6e1c5b962 | |
|
9556eeae45 | |
|
96b5a72630 | |
|
ef80f121f6 | |
|
bbdd1f3aa7 | |
|
5dd37f519a | |
|
a2278be84d | |
|
1393a2671c | |
|
9fa8ae5384 | |
|
169a35a067 | |
|
2b2a10581d | |
|
10fd51862a | |
|
15d0204f96 | |
|
21d6e88a1b | |
|
df2847df2d | |
|
327c98a4b6 | |
|
3cc0abfd81 | |
|
80e5f8ba76 | |
|
4b660f1ce8 | |
|
dfde0e60f0 | |
|
013f688ffe | |
|
cf9738ddbe |
docker
etc/apt/sources.list.d
src
test/mock
|
@ -3,7 +3,7 @@ VITASTOR_VERSION ?= v2.1.0
|
|||
all: build push
|
||||
|
||||
build:
|
||||
@docker build --rm -t vitalif/vitastor:$(VITASTOR_VERSION) .
|
||||
@docker build --no-cache --rm -t vitalif/vitastor:$(VITASTOR_VERSION) .
|
||||
|
||||
push:
|
||||
@docker push vitalif/vitastor:$(VITASTOR_VERSION)
|
||||
|
|
|
@ -1 +1,2 @@
|
|||
deb http://vitastor.io/debian bookworm main
|
||||
deb http://http.debian.net/debian/ bookworm-backports main
|
||||
|
|
|
@ -34,6 +34,7 @@ between clients, OSDs and etcd.
|
|||
- [etcd_ws_keepalive_interval](#etcd_ws_keepalive_interval)
|
||||
- [etcd_min_reload_interval](#etcd_min_reload_interval)
|
||||
- [tcp_header_buffer_size](#tcp_header_buffer_size)
|
||||
- [min_zerocopy_send_size](#min_zerocopy_send_size)
|
||||
- [use_sync_send_recv](#use_sync_send_recv)
|
||||
|
||||
## osd_network
|
||||
|
@ -313,6 +314,34 @@ is received without an additional copy. You can try to play with this
|
|||
parameter and see how it affects random iops and linear bandwidth if you
|
||||
want.
|
||||
|
||||
## min_zerocopy_send_size
|
||||
|
||||
- Type: integer
|
||||
- Default: 32768
|
||||
|
||||
OSDs and clients will attempt to use io_uring-based zero-copy TCP send
|
||||
for buffers larger than this number of bytes. Zero-copy send with io_uring is
|
||||
supported since Linux kernel version 6.1. Support is auto-detected and disabled
|
||||
automatically when not available. It can also be disabled explicitly by setting
|
||||
this parameter to a negative value.
|
||||
|
||||
⚠️ Warning! Zero-copy send performance may vary greatly from CPU to CPU and from
|
||||
one kernel version to another. Generally, it tends to only make benefit with larger
|
||||
messages. With smaller messages (say, 4 KB), it may actually be slower. 32 KB is
|
||||
enough for almost all CPUs, but even smaller values are optimal for some of them.
|
||||
For example, 4 KB is OK for EPYC Milan/Genoa and 12 KB is OK for Xeon Ice Lake
|
||||
(but verify it yourself please).
|
||||
|
||||
Verification instructions:
|
||||
1. Add `iommu=pt` into your Linux kernel command line and reboot.
|
||||
2. Upgrade your kernel. For example, it's very important to use 6.11+ with recent AMD EPYCs.
|
||||
3. Run some tests with the [send-zerocopy liburing example](https://github.com/axboe/liburing/blob/master/examples/send-zerocopy.c)
|
||||
to find the minimal message size for which zero-copy is optimal.
|
||||
Use `./send-zerocopy tcp -4 -R` at the server side and
|
||||
`time ./send-zerocopy tcp -4 -b 0 -s BUFFER_SIZE -D SERVER_IP` at the client side with
|
||||
`-z 0` (no zero-copy) and `-z 1` (zero-copy), and compare MB/s and used CPU time
|
||||
(user+system).
|
||||
|
||||
## use_sync_send_recv
|
||||
|
||||
- Type: boolean
|
||||
|
|
|
@ -34,6 +34,7 @@
|
|||
- [etcd_ws_keepalive_interval](#etcd_ws_keepalive_interval)
|
||||
- [etcd_min_reload_interval](#etcd_min_reload_interval)
|
||||
- [tcp_header_buffer_size](#tcp_header_buffer_size)
|
||||
- [min_zerocopy_send_size](#min_zerocopy_send_size)
|
||||
- [use_sync_send_recv](#use_sync_send_recv)
|
||||
|
||||
## osd_network
|
||||
|
@ -321,6 +322,34 @@ Vitastor содержат 128-байтные заголовки, за котор
|
|||
поменять этот параметр и посмотреть, как он влияет на производительность
|
||||
случайного и линейного доступа.
|
||||
|
||||
## min_zerocopy_send_size
|
||||
|
||||
- Тип: целое число
|
||||
- Значение по умолчанию: 32768
|
||||
|
||||
OSD и клиенты будут пробовать использовать TCP-отправку без копирования (zero-copy) на
|
||||
основе io_uring для буферов, больших, чем это число байт. Отправка без копирования
|
||||
поддерживается в io_uring, начиная с версии ядра Linux 6.1. Наличие поддержки
|
||||
проверяется автоматически и zero-copy отключается, когда поддержки нет. Также
|
||||
её можно отключить явно, установив данный параметр в отрицательное значение.
|
||||
|
||||
⚠️ Внимание! Производительность данной функции может сильно отличаться на разных
|
||||
процессорах и на разных версиях ядра Linux. В целом, zero-copy обычно быстрее с
|
||||
большими сообщениями, а с мелкими (например, 4 КБ) zero-copy может быть даже
|
||||
медленнее. 32 КБ достаточно почти для всех процессоров, но для каких-то можно
|
||||
использовать даже меньшие значения. Например, для EPYC Milan/Genoa подходит 4 КБ,
|
||||
а для Xeon Ice Lake - 12 КБ (но, пожалуйста, перепроверьте это сами).
|
||||
|
||||
Инструкция по проверке:
|
||||
1. Добавьте `iommu=pt` в командную строку загрузки вашего ядра Linux и перезагрузитесь.
|
||||
2. Обновите ядро. Например, для AMD EPYC очень важно использовать версию 6.11+.
|
||||
3. Позапускайте тесты с помощью [send-zerocopy из примеров liburing](https://github.com/axboe/liburing/blob/master/examples/send-zerocopy.c),
|
||||
чтобы найти минимальный размер сообщения, для которого zero-copy отправка оптимальна.
|
||||
Запускайте `./send-zerocopy tcp -4 -R` на стороне сервера и
|
||||
`time ./send-zerocopy tcp -4 -b 0 -s РАЗМЕР_БУФЕРА -D АДРЕС_СЕРВЕРА` на стороне клиента
|
||||
с опцией `-z 0` (обычная отправка) и `-z 1` (отправка без копирования), и сравнивайте
|
||||
скорость в МБ/с и занятое процессорное время (user+system).
|
||||
|
||||
## use_sync_send_recv
|
||||
|
||||
- Тип: булево (да/нет)
|
||||
|
|
|
@ -373,6 +373,55 @@
|
|||
параметра читается без дополнительного копирования. Вы можете попробовать
|
||||
поменять этот параметр и посмотреть, как он влияет на производительность
|
||||
случайного и линейного доступа.
|
||||
- name: min_zerocopy_send_size
|
||||
type: int
|
||||
default: 32768
|
||||
info: |
|
||||
OSDs and clients will attempt to use io_uring-based zero-copy TCP send
|
||||
for buffers larger than this number of bytes. Zero-copy send with io_uring is
|
||||
supported since Linux kernel version 6.1. Support is auto-detected and disabled
|
||||
automatically when not available. It can also be disabled explicitly by setting
|
||||
this parameter to a negative value.
|
||||
|
||||
⚠️ Warning! Zero-copy send performance may vary greatly from CPU to CPU and from
|
||||
one kernel version to another. Generally, it tends to only make benefit with larger
|
||||
messages. With smaller messages (say, 4 KB), it may actually be slower. 32 KB is
|
||||
enough for almost all CPUs, but even smaller values are optimal for some of them.
|
||||
For example, 4 KB is OK for EPYC Milan/Genoa and 12 KB is OK for Xeon Ice Lake
|
||||
(but verify it yourself please).
|
||||
|
||||
Verification instructions:
|
||||
1. Add `iommu=pt` into your Linux kernel command line and reboot.
|
||||
2. Upgrade your kernel. For example, it's very important to use 6.11+ with recent AMD EPYCs.
|
||||
3. Run some tests with the [send-zerocopy liburing example](https://github.com/axboe/liburing/blob/master/examples/send-zerocopy.c)
|
||||
to find the minimal message size for which zero-copy is optimal.
|
||||
Use `./send-zerocopy tcp -4 -R` at the server side and
|
||||
`time ./send-zerocopy tcp -4 -b 0 -s BUFFER_SIZE -D SERVER_IP` at the client side with
|
||||
`-z 0` (no zero-copy) and `-z 1` (zero-copy), and compare MB/s and used CPU time
|
||||
(user+system).
|
||||
info_ru: |
|
||||
OSD и клиенты будут пробовать использовать TCP-отправку без копирования (zero-copy) на
|
||||
основе io_uring для буферов, больших, чем это число байт. Отправка без копирования
|
||||
поддерживается в io_uring, начиная с версии ядра Linux 6.1. Наличие поддержки
|
||||
проверяется автоматически и zero-copy отключается, когда поддержки нет. Также
|
||||
её можно отключить явно, установив данный параметр в отрицательное значение.
|
||||
|
||||
⚠️ Внимание! Производительность данной функции может сильно отличаться на разных
|
||||
процессорах и на разных версиях ядра Linux. В целом, zero-copy обычно быстрее с
|
||||
большими сообщениями, а с мелкими (например, 4 КБ) zero-copy может быть даже
|
||||
медленнее. 32 КБ достаточно почти для всех процессоров, но для каких-то можно
|
||||
использовать даже меньшие значения. Например, для EPYC Milan/Genoa подходит 4 КБ,
|
||||
а для Xeon Ice Lake - 12 КБ (но, пожалуйста, перепроверьте это сами).
|
||||
|
||||
Инструкция по проверке:
|
||||
1. Добавьте `iommu=pt` в командную строку загрузки вашего ядра Linux и перезагрузитесь.
|
||||
2. Обновите ядро. Например, для AMD EPYC очень важно использовать версию 6.11+.
|
||||
3. Позапускайте тесты с помощью [send-zerocopy из примеров liburing](https://github.com/axboe/liburing/blob/master/examples/send-zerocopy.c),
|
||||
чтобы найти минимальный размер сообщения, для которого zero-copy отправка оптимальна.
|
||||
Запускайте `./send-zerocopy tcp -4 -R` на стороне сервера и
|
||||
`time ./send-zerocopy tcp -4 -b 0 -s РАЗМЕР_БУФЕРА -D АДРЕС_СЕРВЕРА` на стороне клиента
|
||||
с опцией `-z 0` (обычная отправка) и `-z 1` (отправка без копирования), и сравнивайте
|
||||
скорость в МБ/с и занятое процессорное время (user+system).
|
||||
- name: use_sync_send_recv
|
||||
type: bool
|
||||
default: false
|
||||
|
|
|
@ -26,9 +26,9 @@ at Vitastor Kubernetes operator: https://github.com/Antilles7227/vitastor-operat
|
|||
The instruction is very simple.
|
||||
|
||||
1. Download a Docker image of the desired version: \
|
||||
`docker pull vitastor:2.1.0`
|
||||
`docker pull vitastor:v2.1.0`
|
||||
2. Install scripts to the host system: \
|
||||
`docker run --rm -it -v /etc:/host-etc -v /usr/bin:/host-bin vitastor:2.1.0 install.sh`
|
||||
`docker run --rm -it -v /etc:/host-etc -v /usr/bin:/host-bin vitastor:v2.1.0 install.sh`
|
||||
3. Reload udev rules: \
|
||||
`udevadm control --reload-rules`
|
||||
|
||||
|
|
|
@ -25,9 +25,9 @@ Vitastor можно установить в Docker/Podman. При этом etcd,
|
|||
Инструкция по установке максимально простая.
|
||||
|
||||
1. Скачайте Docker-образ желаемой версии: \
|
||||
`docker pull vitastor:2.1.0`
|
||||
`docker pull vitastor:v2.1.0`
|
||||
2. Установите скрипты в хост-систему командой: \
|
||||
`docker run --rm -it -v /etc:/host-etc -v /usr/bin:/host-bin vitastor:2.1.0 install.sh`
|
||||
`docker run --rm -it -v /etc:/host-etc -v /usr/bin:/host-bin vitastor:v2.1.0 install.sh`
|
||||
3. Перезагрузите правила udev: \
|
||||
`udevadm control --reload-rules`
|
||||
|
||||
|
|
|
@ -14,6 +14,7 @@
|
|||
- [Removing a failed disk](#removing-a-failed-disk)
|
||||
- [Adding a disk](#adding-a-disk)
|
||||
- [Restoring from lost pool configuration](#restoring-from-lost-pool-configuration)
|
||||
- [Incompatibility problems](#Incompatibility-problems)
|
||||
- [Upgrading Vitastor](#upgrading-vitastor)
|
||||
- [OSD memory usage](#osd-memory-usage)
|
||||
|
||||
|
@ -166,6 +167,17 @@ done
|
|||
|
||||
After that all PGs should peer and find all previous data.
|
||||
|
||||
## Incompatibility problems
|
||||
|
||||
### ISA-L 2.31
|
||||
|
||||
⚠ It is FORBIDDEN to use Vitastor 2.1.0 and earlier versions with ISA-L 2.31 and newer if
|
||||
you use EC N+K pools and K > 1 on a CPU with GF-NI instruction support, because it WILL
|
||||
lead to **data loss** during EC recovery.
|
||||
|
||||
If you accidentally upgraded ISA-L to 2.31 but didn't upgrade Vitastor and restarted OSDs,
|
||||
then stop them as soon as possible and either update Vitastor or roll back ISA-L.
|
||||
|
||||
## Upgrading Vitastor
|
||||
|
||||
Every upcoming Vitastor version is usually compatible with previous both forward
|
||||
|
|
|
@ -14,6 +14,7 @@
|
|||
- [Удаление неисправного диска](#удаление-неисправного-диска)
|
||||
- [Добавление диска](#добавление-диска)
|
||||
- [Восстановление потерянной конфигурации пулов](#восстановление-потерянной-конфигурации-пулов)
|
||||
- [Проблемы несовместимости](#проблемы-несовместимости)
|
||||
- [Обновление Vitastor](#обновление-vitastor)
|
||||
- [Потребление памяти OSD](#потребление-памяти-osd)
|
||||
|
||||
|
@ -163,6 +164,17 @@ done
|
|||
|
||||
После этого все PG должны пройти peering и найти все предыдущие данные.
|
||||
|
||||
## Проблемы несовместимости
|
||||
|
||||
### ISA-L 2.31
|
||||
|
||||
⚠ ЗАПРЕЩЕНО использовать Vitastor 2.1.0 и более ранних версий с библиотекой ISA-L версии 2.31
|
||||
или более новой, если вы используете EC-пулы N+K и K > 1 на CPU с поддержкой инструкций GF-NI,
|
||||
так как это приведёт к **потере данных** при восстановлении из EC.
|
||||
|
||||
Если вы случайно обновили ISA-L до 2.31, но не обновили Vitastor, и успели перезапустить OSD,
|
||||
то как можно скорее остановите их все и либо обновите Vitastor, либо откатите ISA-L.
|
||||
|
||||
## Обновление Vitastor
|
||||
|
||||
Обычно каждая следующая версия Vitastor совместима с предыдущими и "вперёд", и "назад"
|
||||
|
|
|
@ -14,6 +14,9 @@ Commands:
|
|||
- [upgrade](#upgrade)
|
||||
- [defrag](#defrag)
|
||||
|
||||
⚠️ Important: follow the instructions from [Linux NFS write size](#linux-nfs-write-size)
|
||||
for optimal Vitastor NFS performance if you use EC and HDD and mount your NFS from Linux.
|
||||
|
||||
## Pseudo-FS
|
||||
|
||||
Simplified pseudo-FS proxy is used for file-based image access emulation. It's not
|
||||
|
@ -100,6 +103,62 @@ Other notable missing features which should be addressed in the future:
|
|||
in the DB. The FS is implemented is such way that this garbage doesn't affect its
|
||||
function, but having a tool to clean it up still seems a right thing to do.
|
||||
|
||||
## Linux NFS write size
|
||||
|
||||
Linux NFS client (nfs/nfsv3/nfsv4 kernel modules) has a hard-coded maximum I/O size,
|
||||
currently set to 1 MB - see `rsize` and `wsize` in [man 5 nfs](https://linux.die.net/man/5/nfs).
|
||||
|
||||
This means that when you write to a file in an FS mounted over NFS, the maximum write
|
||||
request size is 1 MB, even in the O_DIRECT mode and even if the original write request
|
||||
is larger.
|
||||
|
||||
However, for optimal linear write performance in Vitastor EC (erasure-coded) pools,
|
||||
the size of write requests should be a multiple of [block_size](../config/layout-cluster.en.md#block_size),
|
||||
multiplied by the data chunk count of the pool ([pg_size](../config/pool.en.md#pg_size)-[parity_chunks](../config/pool.en.md#parity_chunks)).
|
||||
When write requests are smaller or not a multiple of this number, Vitastor has to first
|
||||
read paired data blocks from disks, calculate new parity blocks and only then write them
|
||||
back. Obviously this is 2-3 times slower than a simple disk write.
|
||||
|
||||
Vitastor HDD setups use 1 MB block_size by default. So, for optimal performance, if
|
||||
you use EC 2+1 and HDD, you need your NFS client to send 2 MB write requests, if you
|
||||
use EC 4+1 - 4 MB and so on.
|
||||
|
||||
But Linux NFS client only writes in 1 MB chunks. 😢
|
||||
|
||||
The good news is that you can fix it by rebuilding Linux NFS kernel modules 😉 🤩!
|
||||
You need to change NFS_MAX_FILE_IO_SIZE in nfs_xdr.h and then rebuild and reload modules.
|
||||
|
||||
The instruction, using Debian as an example (should be ran under root):
|
||||
|
||||
```
|
||||
# download current Linux kernel headers required to build modules
|
||||
apt-get install linux-headers-`uname -r`
|
||||
|
||||
# replace NFS_MAX_FILE_IO_SIZE with a desired number (here it's 4194304 - 4 MB)
|
||||
sed -i 's/NFS_MAX_FILE_IO_SIZE\s*.*/NFS_MAX_FILE_IO_SIZE\t(4194304U)/' /lib/modules/`uname -r`/source/include/linux/nfs_xdr.h
|
||||
|
||||
# download current Linux kernel source
|
||||
mkdir linux_src
|
||||
cd linux_src
|
||||
apt-get source linux-image-`uname -r`-unsigned
|
||||
|
||||
# build NFS modules
|
||||
cd linux-*/fs/nfs
|
||||
make -C /lib/modules/`uname -r`/build M=$PWD -j8 modules
|
||||
make -C /lib/modules/`uname -r`/build M=$PWD modules_install
|
||||
|
||||
# move default NFS modules away
|
||||
mv /lib/modules/`uname -r`/kernel/fs/nfs ~/nfs_orig_`uname -r`
|
||||
depmod -a
|
||||
|
||||
# unload old modules and load the new ones
|
||||
rmmod nfsv3 nfs
|
||||
modprobe nfsv3
|
||||
```
|
||||
|
||||
After these (not much complicated 🙂) manipulations NFS begins to be mounted
|
||||
with new wsize and rsize by default and it fixes Vitastor-NFS linear write performance.
|
||||
|
||||
## Horizontal scaling
|
||||
|
||||
Linux NFS 3.0 client doesn't support built-in scaling or failover, i.e. you can't
|
||||
|
|
|
@ -14,6 +14,9 @@
|
|||
- [upgrade](#upgrade)
|
||||
- [defrag](#defrag)
|
||||
|
||||
⚠️ Важно: для оптимальной производительности Vitastor NFS в Linux при использовании
|
||||
HDD и EC (erasure кодов) выполните инструкции из раздела [Размер записи Linux NFS](#размер-записи-linux-nfs).
|
||||
|
||||
## Псевдо-ФС
|
||||
|
||||
Упрощённая реализация псевдо-ФС используется для эмуляции файлового доступа к блочным
|
||||
|
@ -104,6 +107,66 @@ JSON-формате :-). Для инспекции содержимого БД
|
|||
записи. ФС устроена так, что на работу они не влияют, но для порядка и их стоит
|
||||
уметь подчищать.
|
||||
|
||||
## Размер записи Linux NFS
|
||||
|
||||
Клиент Linux NFS (модули ядра nfs/nfsv3/nfsv4) имеет фиксированный в коде максимальный
|
||||
размер запроса ввода-вывода, равный 1 МБ - см. `rsize` и `wsize` в [man 5 nfs](https://linux.die.net/man/5/nfs).
|
||||
|
||||
Это означает, что когда вы записываете в файл в примонтированной по NFS файловой системе,
|
||||
максимальный размер запроса записи составляет 1 МБ, даже в режиме O_DIRECT и даже если
|
||||
исходный запрос записи был больше.
|
||||
|
||||
Однако для оптимальной скорости линейной записи в Vitastor при использовании EC-пулов
|
||||
(пулов с кодами коррекции ошибок) запросы записи должны быть по размеру кратны
|
||||
[block_size](../config/layout-cluster.ru.md#block_size), умноженному на число частей
|
||||
данных пула ([pg_size](../config/pool.ru.md#pg_size)-[parity_chunks](../config/pool.ru.md#parity_chunks)).
|
||||
Если запросы записи меньше или не кратны, то Vitastor приходится сначала прочитать
|
||||
с дисков старые версии парных блоков данных, рассчитать новые блоки чётности и только
|
||||
после этого записать их на диски. Естественно, это в 2-3 раза медленнее простой записи
|
||||
на диск.
|
||||
|
||||
При этом block_size на жёстких дисках по умолчанию устанавливается равным 1 МБ.
|
||||
Таким образом, если вы используете EC 2+1 и HDD, для оптимальной скорости записи вам
|
||||
нужно, чтобы NFS-клиент писал по 2 МБ, если EC 4+1 и HDD - то по 4 МБ, и т.п.
|
||||
|
||||
А Linux NFS-клиент пишет только по 1 МБ. 😢
|
||||
|
||||
Но это можно исправить, пересобрав модули ядра Linux NFS 😉 🤩! Для этого нужно
|
||||
поменять значение переменной NFS_MAX_FILE_IO_SIZE в заголовочном файле nfs_xdr.h,
|
||||
после чего пересобрать модули NFS.
|
||||
|
||||
Инструкция по пересборке на примере Debian (выполнять под root):
|
||||
|
||||
```
|
||||
# скачиваем заголовки для сборки модулей для текущего ядра Linux
|
||||
apt-get install linux-headers-`uname -r`
|
||||
|
||||
# заменяем в заголовках NFS_MAX_FILE_IO_SIZE на желаемый (здесь 4194304 - 4 МБ)
|
||||
sed -i 's/NFS_MAX_FILE_IO_SIZE\s*.*/NFS_MAX_FILE_IO_SIZE\t(4194304U)/' /lib/modules/`uname -r`/source/include/linux/nfs_xdr.h
|
||||
|
||||
# скачиваем исходный код текущего ядра
|
||||
mkdir linux_src
|
||||
cd linux_src
|
||||
apt-get source linux-image-`uname -r`-unsigned
|
||||
|
||||
# собираем модули NFS
|
||||
cd linux-*/fs/nfs
|
||||
make -C /lib/modules/`uname -r`/build M=$PWD -j8 modules
|
||||
make -C /lib/modules/`uname -r`/build M=$PWD modules_install
|
||||
|
||||
# убираем в сторону штатные модули NFS
|
||||
mv /lib/modules/`uname -r`/kernel/fs/nfs ~/nfs_orig_`uname -r`
|
||||
depmod -a
|
||||
|
||||
# выгружаем старые модули и загружаем новые
|
||||
rmmod nfsv3 nfs
|
||||
modprobe nfsv3
|
||||
```
|
||||
|
||||
После такой (относительно нехитрой 🙂) манипуляции NFS начинает по умолчанию
|
||||
монтироваться с новыми wsize и rsize, и производительность линейной записи в Vitastor-NFS
|
||||
исправляется.
|
||||
|
||||
## Горизонтальное масштабирование
|
||||
|
||||
Клиент Linux NFS 3.0 не поддерживает встроенное масштабирование или отказоустойчивость.
|
||||
|
|
|
@ -162,10 +162,12 @@ apt-get install linux-headers-`uname -r`
|
|||
apt-get build-dep linux-image-`uname -r`-unsigned
|
||||
apt-get source linux-image-`uname -r`-unsigned
|
||||
cd linux*/drivers/vdpa
|
||||
make -C /lib/modules/`uname -r`/build M=$PWD CONFIG_VDPA=m CONFIG_VDPA_USER=m CONFIG_VIRTIO_VDPA=m -j8 modules modules_install
|
||||
make -C /lib/modules/`uname -r`/build M=$PWD CONFIG_VDPA=m CONFIG_VDPA_USER=m CONFIG_VIRTIO_VDPA=m -j8 modules
|
||||
make -C /lib/modules/`uname -r`/build M=$PWD CONFIG_VDPA=m CONFIG_VDPA_USER=m CONFIG_VIRTIO_VDPA=m modules_install
|
||||
cat Module.symvers >> /lib/modules/`uname -r`/build/Module.symvers
|
||||
cd ../virtio
|
||||
make -C /lib/modules/`uname -r`/build M=$PWD CONFIG_VDPA=m CONFIG_VDPA_USER=m CONFIG_VIRTIO_VDPA=m -j8 modules modules_install
|
||||
make -C /lib/modules/`uname -r`/build M=$PWD CONFIG_VDPA=m CONFIG_VDPA_USER=m CONFIG_VIRTIO_VDPA=m -j8 modules
|
||||
make -C /lib/modules/`uname -r`/build M=$PWD CONFIG_VDPA=m CONFIG_VDPA_USER=m CONFIG_VIRTIO_VDPA=m modules_install
|
||||
depmod -a
|
||||
```
|
||||
|
||||
|
|
|
@ -165,10 +165,12 @@ apt-get install linux-headers-`uname -r`
|
|||
apt-get build-dep linux-image-`uname -r`-unsigned
|
||||
apt-get source linux-image-`uname -r`-unsigned
|
||||
cd linux*/drivers/vdpa
|
||||
make -C /lib/modules/`uname -r`/build M=$PWD CONFIG_VDPA=m CONFIG_VDPA_USER=m CONFIG_VIRTIO_VDPA=m -j8 modules modules_install
|
||||
make -C /lib/modules/`uname -r`/build M=$PWD CONFIG_VDPA=m CONFIG_VDPA_USER=m CONFIG_VIRTIO_VDPA=m -j8 modules
|
||||
make -C /lib/modules/`uname -r`/build M=$PWD CONFIG_VDPA=m CONFIG_VDPA_USER=m CONFIG_VIRTIO_VDPA=m modules_install
|
||||
cat Module.symvers >> /lib/modules/`uname -r`/build/Module.symvers
|
||||
cd ../virtio
|
||||
make -C /lib/modules/`uname -r`/build M=$PWD CONFIG_VDPA=m CONFIG_VDPA_USER=m CONFIG_VIRTIO_VDPA=m -j8 modules modules_install
|
||||
make -C /lib/modules/`uname -r`/build M=$PWD CONFIG_VDPA=m CONFIG_VDPA_USER=m CONFIG_VIRTIO_VDPA=m -j8 modules
|
||||
make -C /lib/modules/`uname -r`/build M=$PWD CONFIG_VDPA=m CONFIG_VDPA_USER=m CONFIG_VIRTIO_VDPA=m modules_install
|
||||
depmod -a
|
||||
```
|
||||
|
||||
|
|
|
@ -253,7 +253,7 @@ function random_custom_combinations(osd_tree, rules, count, ordered)
|
|||
for (let i = 1; i < rules.length; i++)
|
||||
{
|
||||
const filtered = filter_tree_by_rules(osd_tree, rules[i], selected);
|
||||
const idx = select_murmur3(filtered.length, i => 'p:'+f.id+':'+filtered[i].id);
|
||||
const idx = select_murmur3(filtered.length, i => 'p:'+f.id+':'+(filtered[i].name || filtered[i].id));
|
||||
selected.push(idx == null ? { levels: {}, id: null } : filtered[idx]);
|
||||
}
|
||||
const size = selected.filter(s => s.id !== null).length;
|
||||
|
@ -270,7 +270,7 @@ function random_custom_combinations(osd_tree, rules, count, ordered)
|
|||
for (const item_rules of rules)
|
||||
{
|
||||
const filtered = selected.length ? filter_tree_by_rules(osd_tree, item_rules, selected) : first;
|
||||
const idx = select_murmur3(filtered.length, i => n+':'+filtered[i].id);
|
||||
const idx = select_murmur3(filtered.length, i => n+':'+(filtered[i].name || filtered[i].id));
|
||||
selected.push(idx == null ? { levels: {}, id: null } : filtered[idx]);
|
||||
}
|
||||
const size = selected.filter(s => s.id !== null).length;
|
||||
|
@ -340,9 +340,9 @@ function filter_tree_by_rules(osd_tree, rules, selected)
|
|||
}
|
||||
|
||||
// Convert from
|
||||
// node_list = { id: string|number, level: string, size?: number, parent?: string|number }[]
|
||||
// node_list = { id: string|number, name?: string, level: string, size?: number, parent?: string|number }[]
|
||||
// to
|
||||
// node_tree = { [node_id]: { id, level, size?, parent?, children?: child_node_id[], levels: { [level]: id, ... } } }
|
||||
// node_tree = { [node_id]: { id, name?, level, size?, parent?, children?: child_node[], levels: { [level]: id, ... } } }
|
||||
function index_tree(node_list)
|
||||
{
|
||||
const tree = { '': { children: [], levels: {} } };
|
||||
|
@ -357,7 +357,7 @@ function index_tree(node_list)
|
|||
tree[parent_id].children = tree[parent_id].children || [];
|
||||
tree[parent_id].children.push(tree[node.id]);
|
||||
}
|
||||
const cur = tree[''].children;
|
||||
const cur = [ ...tree[''].children ];
|
||||
for (let i = 0; i < cur.length; i++)
|
||||
{
|
||||
cur[i].levels[cur[i].level] = cur[i].id;
|
||||
|
|
|
@ -0,0 +1,244 @@
|
|||
// Copyright (c) Vitaliy Filippov, 2019+
|
||||
// License: VNPL-1.1 (see README.md for details)
|
||||
|
||||
// Extract OSDs from the lowest affected tree level into a separate (flat) map
|
||||
// to run PG optimisation on failure domains instead of individual OSDs
|
||||
//
|
||||
// node_list = same input as for index_tree()
|
||||
// rules = [ level, operator, value ][][]
|
||||
// returns { nodes: new_node_list, leaves: { new_folded_node_id: [ extracted_leaf_nodes... ] } }
|
||||
function fold_failure_domains(node_list, rules)
|
||||
{
|
||||
const interest = {};
|
||||
for (const level_rules of rules)
|
||||
{
|
||||
for (const rule of level_rules)
|
||||
interest[rule[0]] = true;
|
||||
}
|
||||
const max_numeric_id = node_list.reduce((a, c) => a < (0|c.id) ? (0|c.id) : a, 0);
|
||||
let next_id = max_numeric_id;
|
||||
const node_map = node_list.reduce((a, c) => { a[c.id||''] = c; return a; }, {});
|
||||
const old_ids_by_new = {};
|
||||
const extracted_nodes = {};
|
||||
let folded = true;
|
||||
while (folded)
|
||||
{
|
||||
const per_parent = {};
|
||||
for (const node_id in node_map)
|
||||
{
|
||||
const node = node_map[node_id];
|
||||
const p = node.parent || '';
|
||||
per_parent[p] = per_parent[p]||[];
|
||||
per_parent[p].push(node);
|
||||
}
|
||||
folded = false;
|
||||
for (const node_id in per_parent)
|
||||
{
|
||||
const fold_node = node_id !== '' && per_parent[node_id].length > 0 && per_parent[node_id].filter(child => per_parent[child.id||''] || interest[child.level]).length == 0;
|
||||
if (fold_node)
|
||||
{
|
||||
const old_node = node_map[node_id];
|
||||
const new_id = ++next_id;
|
||||
node_map[new_id] = {
|
||||
...old_node,
|
||||
id: new_id,
|
||||
name: node_id, // for use in murmur3 hashes
|
||||
size: per_parent[node_id].reduce((a, c) => a + (Number(c.size)||0), 0),
|
||||
};
|
||||
delete node_map[node_id];
|
||||
old_ids_by_new[new_id] = node_id;
|
||||
extracted_nodes[new_id] = [];
|
||||
for (const child of per_parent[node_id])
|
||||
{
|
||||
if (old_ids_by_new[child.id])
|
||||
{
|
||||
extracted_nodes[new_id].push(...extracted_nodes[child.id]);
|
||||
delete extracted_nodes[child.id];
|
||||
}
|
||||
else
|
||||
extracted_nodes[new_id].push(child);
|
||||
delete node_map[child.id];
|
||||
}
|
||||
folded = true;
|
||||
}
|
||||
}
|
||||
}
|
||||
return { nodes: Object.values(node_map), leaves: extracted_nodes };
|
||||
}
|
||||
|
||||
// Distribute PGs mapped to "folded" nodes to individual OSDs according to their weights
|
||||
// folded_pgs = optimize_result.int_pgs before folding
|
||||
// prev_pgs = optional previous PGs from optimize_change() input
|
||||
// extracted_nodes = output from fold_failure_domains
|
||||
function unfold_failure_domains(folded_pgs, prev_pgs, extracted_nodes)
|
||||
{
|
||||
const maps = {};
|
||||
let found = false;
|
||||
for (const new_id in extracted_nodes)
|
||||
{
|
||||
const weights = {};
|
||||
for (const sub_node of extracted_nodes[new_id])
|
||||
{
|
||||
weights[sub_node.id] = sub_node.size;
|
||||
}
|
||||
maps[new_id] = { weights, prev: [], next: [], pos: 0 };
|
||||
found = true;
|
||||
}
|
||||
if (!found)
|
||||
{
|
||||
return folded_pgs;
|
||||
}
|
||||
for (let i = 0; i < folded_pgs.length; i++)
|
||||
{
|
||||
for (let j = 0; j < folded_pgs[i].length; j++)
|
||||
{
|
||||
if (maps[folded_pgs[i][j]])
|
||||
{
|
||||
maps[folded_pgs[i][j]].prev.push(prev_pgs && prev_pgs[i] && prev_pgs[i][j] || 0);
|
||||
}
|
||||
}
|
||||
}
|
||||
for (const new_id in maps)
|
||||
{
|
||||
maps[new_id].next = adjust_distribution(maps[new_id].weights, maps[new_id].prev);
|
||||
}
|
||||
const mapped_pgs = [];
|
||||
for (let i = 0; i < folded_pgs.length; i++)
|
||||
{
|
||||
mapped_pgs.push(folded_pgs[i].map(osd => (maps[osd] ? maps[osd].next[maps[osd].pos++] : osd)));
|
||||
}
|
||||
return mapped_pgs;
|
||||
}
|
||||
|
||||
// Return the new array of items re-distributed as close as possible to weights in wanted_weights
|
||||
// wanted_weights = { [key]: weight }
|
||||
// cur_items = key[]
|
||||
function adjust_distribution(wanted_weights, cur_items)
|
||||
{
|
||||
const item_map = {};
|
||||
for (let i = 0; i < cur_items.length; i++)
|
||||
{
|
||||
const item = cur_items[i];
|
||||
item_map[item] = (item_map[item] || { target: 0, cur: [] });
|
||||
item_map[item].cur.push(i);
|
||||
}
|
||||
let total_weight = 0;
|
||||
for (const item in wanted_weights)
|
||||
{
|
||||
total_weight += Number(wanted_weights[item]) || 0;
|
||||
}
|
||||
for (const item in wanted_weights)
|
||||
{
|
||||
const weight = wanted_weights[item] / total_weight * cur_items.length;
|
||||
if (weight > 0)
|
||||
{
|
||||
item_map[item] = (item_map[item] || { target: 0, cur: [] });
|
||||
item_map[item].target = weight;
|
||||
}
|
||||
}
|
||||
const diff = (item) => (item_map[item].cur.length - item_map[item].target);
|
||||
const most_underweighted = Object.keys(item_map)
|
||||
.filter(item => item_map[item].target > 0)
|
||||
.sort((a, b) => diff(a) - diff(b));
|
||||
// Items with zero target weight MUST never be selected - remove them
|
||||
// and remap each of them to a most underweighted item
|
||||
for (const item in item_map)
|
||||
{
|
||||
if (!item_map[item].target)
|
||||
{
|
||||
const prev = item_map[item];
|
||||
delete item_map[item];
|
||||
for (const idx of prev.cur)
|
||||
{
|
||||
const move_to = most_underweighted[0];
|
||||
item_map[move_to].cur.push(idx);
|
||||
move_leftmost(most_underweighted, diff);
|
||||
}
|
||||
}
|
||||
}
|
||||
// Other over-weighted items are only moved if it improves the distribution
|
||||
while (most_underweighted.length > 1)
|
||||
{
|
||||
const first = most_underweighted[0];
|
||||
const last = most_underweighted[most_underweighted.length-1];
|
||||
const first_diff = diff(first);
|
||||
const last_diff = diff(last);
|
||||
if (Math.abs(first_diff+1)+Math.abs(last_diff-1) < Math.abs(first_diff)+Math.abs(last_diff))
|
||||
{
|
||||
item_map[first].cur.push(item_map[last].cur.pop());
|
||||
move_leftmost(most_underweighted, diff);
|
||||
move_rightmost(most_underweighted, diff);
|
||||
}
|
||||
else
|
||||
{
|
||||
break;
|
||||
}
|
||||
}
|
||||
const new_items = new Array(cur_items.length);
|
||||
for (const item in item_map)
|
||||
{
|
||||
for (const idx of item_map[item].cur)
|
||||
{
|
||||
new_items[idx] = item;
|
||||
}
|
||||
}
|
||||
return new_items;
|
||||
}
|
||||
|
||||
function move_leftmost(sorted_array, diff)
|
||||
{
|
||||
// Re-sort by moving the leftmost item to the right if it changes position
|
||||
const first = sorted_array[0];
|
||||
const new_diff = diff(first);
|
||||
let r = 0;
|
||||
while (r < sorted_array.length-1 && diff(sorted_array[r+1]) <= new_diff)
|
||||
r++;
|
||||
if (r > 0)
|
||||
{
|
||||
for (let i = 0; i < r; i++)
|
||||
sorted_array[i] = sorted_array[i+1];
|
||||
sorted_array[r] = first;
|
||||
}
|
||||
}
|
||||
|
||||
function move_rightmost(sorted_array, diff)
|
||||
{
|
||||
// Re-sort by moving the rightmost item to the left if it changes position
|
||||
const last = sorted_array[sorted_array.length-1];
|
||||
const new_diff = diff(last);
|
||||
let r = sorted_array.length-1;
|
||||
while (r > 0 && diff(sorted_array[r-1]) > new_diff)
|
||||
r--;
|
||||
if (r < sorted_array.length-1)
|
||||
{
|
||||
for (let i = sorted_array.length-1; i > r; i--)
|
||||
sorted_array[i] = sorted_array[i-1];
|
||||
sorted_array[r] = last;
|
||||
}
|
||||
}
|
||||
|
||||
// map previous PGs to folded nodes
|
||||
function fold_prev_pgs(pgs, extracted_nodes)
|
||||
{
|
||||
const unmap = {};
|
||||
for (const new_id in extracted_nodes)
|
||||
{
|
||||
for (const sub_node of extracted_nodes[new_id])
|
||||
{
|
||||
unmap[sub_node.id] = new_id;
|
||||
}
|
||||
}
|
||||
const mapped_pgs = [];
|
||||
for (let i = 0; i < pgs.length; i++)
|
||||
{
|
||||
mapped_pgs.push(pgs[i].map(osd => (unmap[osd] || osd)));
|
||||
}
|
||||
return mapped_pgs;
|
||||
}
|
||||
|
||||
module.exports = {
|
||||
fold_failure_domains,
|
||||
unfold_failure_domains,
|
||||
adjust_distribution,
|
||||
fold_prev_pgs,
|
||||
};
|
|
@ -98,6 +98,7 @@ async function optimize_initial({ osd_weights, combinator, pg_count, pg_size = 3
|
|||
score: lp_result.score,
|
||||
weights: lp_result.vars,
|
||||
int_pgs,
|
||||
pg_effsize,
|
||||
space: eff * pg_effsize,
|
||||
total_space: total_weight,
|
||||
};
|
||||
|
@ -409,6 +410,7 @@ async function optimize_change({ prev_pgs: prev_int_pgs, osd_weights, combinator
|
|||
int_pgs: new_pgs,
|
||||
differs,
|
||||
osd_differs,
|
||||
pg_effsize,
|
||||
space: pg_effsize * pg_list_space_efficiency(new_pgs, osd_weights, pg_minsize, parity_space),
|
||||
total_space: total_weight,
|
||||
};
|
||||
|
|
|
@ -0,0 +1,108 @@
|
|||
// Copyright (c) Vitaliy Filippov, 2019+
|
||||
// License: VNPL-1.1 (see README.md for details)
|
||||
|
||||
const assert = require('assert');
|
||||
const { fold_failure_domains, unfold_failure_domains, adjust_distribution } = require('./fold.js');
|
||||
const DSL = require('./dsl_pgs.js');
|
||||
const LPOptimizer = require('./lp_optimizer.js');
|
||||
const stableStringify = require('../stable-stringify.js');
|
||||
|
||||
async function run()
|
||||
{
|
||||
// Test run adjust_distribution
|
||||
console.log('adjust_distribution');
|
||||
const rand = [];
|
||||
for (let i = 0; i < 100; i++)
|
||||
{
|
||||
rand.push(1 + Math.floor(10*Math.random()));
|
||||
// or rand.push(0);
|
||||
}
|
||||
const adj = adjust_distribution({ 1: 1, 2: 1, 3: 1, 4: 1, 5: 1, 6: 1, 7: 1, 8: 1, 9: 1, 10: 1 }, rand);
|
||||
//console.log(rand.join(' '));
|
||||
console.log(rand.reduce((a, c) => { a[c] = (a[c]||0)+1; return a; }, {}));
|
||||
//console.log(adj.join(' '));
|
||||
console.log(adj.reduce((a, c) => { a[c] = (a[c]||0)+1; return a; }, {}));
|
||||
console.log('Movement: '+rand.reduce((a, c, i) => a+(rand[i] != adj[i] ? 1 : 0), 0)+'/'+rand.length);
|
||||
|
||||
console.log('\nfold_failure_domains');
|
||||
console.log(JSON.stringify(fold_failure_domains(
|
||||
[
|
||||
{ id: 1, level: 'osd', size: 1, parent: 'disk1' },
|
||||
{ id: 2, level: 'osd', size: 2, parent: 'disk1' },
|
||||
{ id: 'disk1', level: 'disk', parent: 'host1' },
|
||||
{ id: 'host1', level: 'host', parent: 'dc1' },
|
||||
{ id: 'dc1', level: 'dc' },
|
||||
],
|
||||
[ [ [ 'dc' ], [ 'host' ] ] ]
|
||||
), 0, 2));
|
||||
|
||||
console.log('\nfold_failure_domains empty rules');
|
||||
console.log(JSON.stringify(fold_failure_domains(
|
||||
[
|
||||
{ id: 1, level: 'osd', size: 1, parent: 'disk1' },
|
||||
{ id: 2, level: 'osd', size: 2, parent: 'disk1' },
|
||||
{ id: 'disk1', level: 'disk', parent: 'host1' },
|
||||
{ id: 'host1', level: 'host', parent: 'dc1' },
|
||||
{ id: 'dc1', level: 'dc' },
|
||||
],
|
||||
[]
|
||||
), 0, 2));
|
||||
|
||||
console.log('\noptimize_folded');
|
||||
// 5 DCs, 2 hosts per DC, 10 OSD per host
|
||||
const nodes = [];
|
||||
for (let i = 1; i <= 100; i++)
|
||||
{
|
||||
nodes.push({ id: i, level: 'osd', size: 1, parent: 'host'+(1+(0|((i-1)/10))) });
|
||||
}
|
||||
for (let i = 1; i <= 10; i++)
|
||||
{
|
||||
nodes.push({ id: 'host'+i, level: 'host', parent: 'dc'+(1+(0|((i-1)/2))) });
|
||||
}
|
||||
for (let i = 1; i <= 5; i++)
|
||||
{
|
||||
nodes.push({ id: 'dc'+i, level: 'dc' });
|
||||
}
|
||||
|
||||
// Check rules
|
||||
const rules = DSL.parse_level_indexes({ dc: '112233', host: '123456' }, [ 'dc', 'host', 'osd' ]);
|
||||
assert.deepEqual(rules, [[],[["dc","=",1],["host","!=",[1]]],[["dc","!=",[1]]],[["dc","=",3],["host","!=",[3]]],[["dc","!=",[1,3]]],[["dc","=",5],["host","!=",[5]]]]);
|
||||
|
||||
// Check tree folding
|
||||
const { nodes: folded_nodes, leaves: folded_leaves } = fold_failure_domains(nodes, rules);
|
||||
const expected_folded = [];
|
||||
const expected_leaves = {};
|
||||
for (let i = 1; i <= 10; i++)
|
||||
{
|
||||
expected_folded.push({ id: 100+i, name: 'host'+i, level: 'host', size: 10, parent: 'dc'+(1+(0|((i-1)/2))) });
|
||||
expected_leaves[100+i] = [ ...new Array(10).keys() ].map(k => ({ id: 10*(i-1)+k+1, level: 'osd', size: 1, parent: 'host'+i }));
|
||||
}
|
||||
for (let i = 1; i <= 5; i++)
|
||||
{
|
||||
expected_folded.push({ id: 'dc'+i, level: 'dc' });
|
||||
}
|
||||
assert.equal(stableStringify(folded_nodes), stableStringify(expected_folded));
|
||||
assert.equal(stableStringify(folded_leaves), stableStringify(expected_leaves));
|
||||
|
||||
// Now optimise it
|
||||
console.log('1000 PGs, EC 112233');
|
||||
const leaf_weights = folded_nodes.reduce((a, c) => { if (Number(c.id)) { a[c.id] = c.size; } return a; }, {});
|
||||
let res = await LPOptimizer.optimize_initial({
|
||||
osd_weights: leaf_weights,
|
||||
combinator: new DSL.RuleCombinator(folded_nodes, rules, 10000, false),
|
||||
pg_size: 6,
|
||||
pg_count: 1000,
|
||||
ordered: false,
|
||||
});
|
||||
LPOptimizer.print_change_stats(res, false);
|
||||
assert.equal(res.space, 100, 'Initial distribution');
|
||||
|
||||
const unfolded_res = { ...res };
|
||||
unfolded_res.int_pgs = unfold_failure_domains(res.int_pgs, null, folded_leaves);
|
||||
const osd_weights = nodes.reduce((a, c) => { if (Number(c.id)) { a[c.id] = c.size; } return a; }, {});
|
||||
unfolded_res.space = unfolded_res.pg_effsize * LPOptimizer.pg_list_space_efficiency(unfolded_res.int_pgs, osd_weights, 0, 1);
|
||||
LPOptimizer.print_change_stats(unfolded_res, false);
|
||||
assert.equal(res.space, 100, 'Initial distribution');
|
||||
}
|
||||
|
||||
run().catch(console.error);
|
|
@ -15,7 +15,7 @@ function get_osd_tree(global_config, state)
|
|||
const stat = state.osd.stats[osd_num];
|
||||
const osd_cfg = state.config.osd[osd_num];
|
||||
let reweight = osd_cfg == null ? 1 : Number(osd_cfg.reweight);
|
||||
if (reweight < 0 || isNaN(reweight))
|
||||
if (isNaN(reweight) || reweight < 0 || reweight > 0)
|
||||
reweight = 1;
|
||||
if (stat && stat.size && reweight && (state.osd.state[osd_num] || Number(stat.time) >= down_time ||
|
||||
osd_cfg && osd_cfg.noout))
|
||||
|
@ -87,7 +87,7 @@ function make_hier_tree(global_config, tree)
|
|||
tree[''] = { children: [] };
|
||||
for (const node_id in tree)
|
||||
{
|
||||
if (node_id === '' || tree[node_id].level === 'osd' && (!tree[node_id].size || tree[node_id].size <= 0))
|
||||
if (node_id === '' || !(tree[node_id].children||[]).length && (tree[node_id].size||0) <= 0)
|
||||
{
|
||||
continue;
|
||||
}
|
||||
|
@ -107,10 +107,10 @@ function make_hier_tree(global_config, tree)
|
|||
deleted = 0;
|
||||
for (const node_id in tree)
|
||||
{
|
||||
if (tree[node_id].level !== 'osd' && (!tree[node_id].children || !tree[node_id].children.length))
|
||||
if (!(tree[node_id].children||[]).length && (tree[node_id].size||0) <= 0)
|
||||
{
|
||||
const parent = tree[node_id].parent;
|
||||
if (parent)
|
||||
if (parent && tree[parent])
|
||||
{
|
||||
tree[parent].children = tree[parent].children.filter(c => c != tree[node_id]);
|
||||
}
|
||||
|
|
|
@ -3,6 +3,7 @@
|
|||
|
||||
const { RuleCombinator } = require('./lp_optimizer/dsl_pgs.js');
|
||||
const { SimpleCombinator, flatten_tree } = require('./lp_optimizer/simple_pgs.js');
|
||||
const { fold_failure_domains, unfold_failure_domains, fold_prev_pgs } = require('./lp_optimizer/fold.js');
|
||||
const { validate_pool_cfg, get_pg_rules } = require('./pool_config.js');
|
||||
const LPOptimizer = require('./lp_optimizer/lp_optimizer.js');
|
||||
const { scale_pg_count } = require('./pg_utils.js');
|
||||
|
@ -160,7 +161,6 @@ async function generate_pool_pgs(state, global_config, pool_id, osd_tree, levels
|
|||
pool_cfg.bitmap_granularity || global_config.bitmap_granularity || 4096,
|
||||
pool_cfg.immediate_commit || global_config.immediate_commit || 'all'
|
||||
);
|
||||
pool_tree = make_hier_tree(global_config, pool_tree);
|
||||
// First try last_clean_pgs to minimize data movement
|
||||
let prev_pgs = [];
|
||||
for (const pg in ((state.history.last_clean_pgs.items||{})[pool_id]||{}))
|
||||
|
@ -175,14 +175,19 @@ async function generate_pool_pgs(state, global_config, pool_id, osd_tree, levels
|
|||
prev_pgs[pg-1] = [ ...state.pg.config.items[pool_id][pg].osd_set ];
|
||||
}
|
||||
}
|
||||
const use_rules = !global_config.use_old_pg_combinator || pool_cfg.level_placement || pool_cfg.raw_placement;
|
||||
const rules = use_rules ? get_pg_rules(pool_id, pool_cfg, global_config.placement_levels) : null;
|
||||
const folded = fold_failure_domains(Object.values(pool_tree), use_rules ? rules : [ [ [ pool_cfg.failure_domain ] ] ]);
|
||||
// FIXME: Remove/merge make_hier_tree() step somewhere, however it's needed to remove empty nodes
|
||||
const folded_tree = make_hier_tree(global_config, folded.nodes);
|
||||
const old_pg_count = prev_pgs.length;
|
||||
const optimize_cfg = {
|
||||
osd_weights: Object.values(pool_tree).filter(item => item.level === 'osd').reduce((a, c) => { a[c.id] = c.size; return a; }, {}),
|
||||
combinator: !global_config.use_old_pg_combinator || pool_cfg.level_placement || pool_cfg.raw_placement
|
||||
osd_weights: folded.nodes.reduce((a, c) => { if (Number(c.id)) { a[c.id] = c.size; } return a; }, {}),
|
||||
combinator: use_rules
|
||||
// new algorithm:
|
||||
? new RuleCombinator(pool_tree, get_pg_rules(pool_id, pool_cfg, global_config.placement_levels), pool_cfg.max_osd_combinations)
|
||||
? new RuleCombinator(folded_tree, rules, pool_cfg.max_osd_combinations)
|
||||
// old algorithm:
|
||||
: new SimpleCombinator(flatten_tree(pool_tree[''].children, levels, pool_cfg.failure_domain, 'osd'), pool_cfg.pg_size, pool_cfg.max_osd_combinations),
|
||||
: new SimpleCombinator(flatten_tree(folded_tree[''].children, levels, pool_cfg.failure_domain, 'osd'), pool_cfg.pg_size, pool_cfg.max_osd_combinations),
|
||||
pg_count: pool_cfg.pg_count,
|
||||
pg_size: pool_cfg.pg_size,
|
||||
pg_minsize: pool_cfg.pg_minsize,
|
||||
|
@ -202,12 +207,11 @@ async function generate_pool_pgs(state, global_config, pool_id, osd_tree, levels
|
|||
for (const pg of prev_pgs)
|
||||
{
|
||||
while (pg.length < pool_cfg.pg_size)
|
||||
{
|
||||
pg.push(0);
|
||||
}
|
||||
}
|
||||
const folded_prev_pgs = fold_prev_pgs(prev_pgs, folded.leaves);
|
||||
optimize_result = await LPOptimizer.optimize_change({
|
||||
prev_pgs,
|
||||
prev_pgs: folded_prev_pgs,
|
||||
...optimize_cfg,
|
||||
});
|
||||
}
|
||||
|
@ -215,6 +219,10 @@ async function generate_pool_pgs(state, global_config, pool_id, osd_tree, levels
|
|||
{
|
||||
optimize_result = await LPOptimizer.optimize_initial(optimize_cfg);
|
||||
}
|
||||
optimize_result.int_pgs = unfold_failure_domains(optimize_result.int_pgs, prev_pgs, folded.leaves);
|
||||
const osd_weights = Object.values(pool_tree).reduce((a, c) => { if (c.level === 'osd') { a[c.id] = c.size; } return a; }, {});
|
||||
optimize_result.space = optimize_result.pg_effsize * LPOptimizer.pg_list_space_efficiency(optimize_result.int_pgs,
|
||||
osd_weights, optimize_cfg.pg_minsize, 1);
|
||||
console.log(`Pool ${pool_id} (${pool_cfg.name || 'unnamed'}):`);
|
||||
LPOptimizer.print_change_stats(optimize_result);
|
||||
let pg_effsize = pool_cfg.pg_size;
|
||||
|
|
|
@ -40,6 +40,11 @@ async function run()
|
|||
console.log("/etc/systemd/system/vitastor-etcd.service already exists");
|
||||
process.exit(1);
|
||||
}
|
||||
if (!in_docker && fs.existsSync("/etc/systemd/system/etcd.service"))
|
||||
{
|
||||
console.log("/etc/systemd/system/etcd.service already exists");
|
||||
process.exit(1);
|
||||
}
|
||||
const config = JSON.parse(fs.readFileSync(config_path, { encoding: 'utf-8' }));
|
||||
if (!config.etcd_address)
|
||||
{
|
||||
|
@ -66,7 +71,7 @@ async function run()
|
|||
console.log('etcd for Vitastor configured. Run `systemctl enable --now vitastor-etcd` to start etcd');
|
||||
process.exit(0);
|
||||
}
|
||||
await system(`mkdir -p /var/lib/etcd`);
|
||||
await system(`mkdir -p /var/lib/etcd/vitastor`);
|
||||
fs.writeFileSync(
|
||||
"/etc/systemd/system/vitastor-etcd.service",
|
||||
`[Unit]
|
||||
|
@ -77,14 +82,14 @@ Wants=network-online.target local-fs.target time-sync.target
|
|||
[Service]
|
||||
Restart=always
|
||||
Environment=GOGC=50
|
||||
ExecStart=etcd --name ${etcd_name} --data-dir /var/lib/etcd \\
|
||||
ExecStart=etcd --name ${etcd_name} --data-dir /var/lib/etcd/vitastor \\
|
||||
--snapshot-count 10000 --advertise-client-urls http://${etcds[num]}:2379 --listen-client-urls http://${etcds[num]}:2379 \\
|
||||
--initial-advertise-peer-urls http://${etcds[num]}:2380 --listen-peer-urls http://${etcds[num]}:2380 \\
|
||||
--initial-cluster-token vitastor-etcd-1 --initial-cluster ${etcd_cluster} \\
|
||||
--initial-cluster-state new --max-txn-ops=100000 --max-request-bytes=104857600 \\
|
||||
--auto-compaction-retention=10 --auto-compaction-mode=revision
|
||||
WorkingDirectory=/var/lib/etcd
|
||||
ExecStartPre=+chown -R etcd /var/lib/etcd
|
||||
WorkingDirectory=/var/lib/etcd/vitastor
|
||||
ExecStartPre=+chown -R etcd /var/lib/etcd/vitastor
|
||||
User=etcd
|
||||
PrivateTmp=false
|
||||
TasksMax=infinity
|
||||
|
@ -97,8 +102,9 @@ WantedBy=multi-user.target
|
|||
`);
|
||||
await system(`useradd etcd`);
|
||||
await system(`systemctl daemon-reload`);
|
||||
await system(`systemctl enable etcd`);
|
||||
await system(`systemctl start etcd`);
|
||||
// Disable distribution etcd unit and enable our one
|
||||
await system(`systemctl disable --now etcd`);
|
||||
await system(`systemctl enable --now vitastor-etcd`);
|
||||
process.exit(0);
|
||||
}
|
||||
|
||||
|
|
26
mon/stats.js
26
mon/stats.js
|
@ -87,11 +87,25 @@ function sum_op_stats(all_osd, prev_stats)
|
|||
for (const k in derived[type][op])
|
||||
{
|
||||
sum_diff[type][op] = sum_diff[type][op] || {};
|
||||
sum_diff[type][op][k] = (sum_diff[type][op][k] || 0n) + derived[type][op][k];
|
||||
if (k == 'lat')
|
||||
sum_diff[type][op].lat = (sum_diff[type][op].lat || 0n) + derived[type][op].lat*derived[type][op].iops;
|
||||
else
|
||||
sum_diff[type][op][k] = (sum_diff[type][op][k] || 0n) + derived[type][op][k];
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
// Calculate average (weighted by iops) op latency across all OSDs
|
||||
for (const type in sum_diff)
|
||||
{
|
||||
for (const op in sum_diff[type])
|
||||
{
|
||||
if (sum_diff[type][op].lat)
|
||||
{
|
||||
sum_diff[type][op].lat /= sum_diff[type][op].iops;
|
||||
}
|
||||
}
|
||||
}
|
||||
return sum_diff;
|
||||
}
|
||||
|
||||
|
@ -271,8 +285,7 @@ function sum_inode_stats(state, prev_stats)
|
|||
const op_st = inode_stats[pool_id][inode_num][op];
|
||||
op_st.bps += op_diff.bps;
|
||||
op_st.iops += op_diff.iops;
|
||||
op_st.lat += op_diff.lat;
|
||||
op_st.n_osd = (op_st.n_osd || 0) + 1;
|
||||
op_st.lat += op_diff.lat*op_diff.iops;
|
||||
}
|
||||
}
|
||||
}
|
||||
|
@ -285,11 +298,8 @@ function sum_inode_stats(state, prev_stats)
|
|||
for (const op of [ 'read', 'write', 'delete' ])
|
||||
{
|
||||
const op_st = inode_stats[pool_id][inode_num][op];
|
||||
if (op_st.n_osd)
|
||||
{
|
||||
op_st.lat /= BigInt(op_st.n_osd);
|
||||
delete op_st.n_osd;
|
||||
}
|
||||
if (op_st.lat)
|
||||
op_st.lat /= op_st.iops;
|
||||
if (op_st.bps > 0 || op_st.iops > 0)
|
||||
nonzero = true;
|
||||
}
|
||||
|
|
|
@ -266,6 +266,8 @@ class blockstore_impl_t
|
|||
int throttle_threshold_us = 50;
|
||||
// Maximum writes between automatically added fsync operations
|
||||
uint64_t autosync_writes = 128;
|
||||
// Log level (0-10)
|
||||
int log_level = 0;
|
||||
/******* END OF OPTIONS *******/
|
||||
|
||||
struct ring_consumer_t ring_consumer;
|
||||
|
|
|
@ -113,10 +113,13 @@ int blockstore_journal_check_t::check_available(blockstore_op_t *op, int entries
|
|||
if (!right_dir && next_pos >= bs->journal.used_start-bs->journal.block_size)
|
||||
{
|
||||
// No space in the journal. Wait until used_start changes.
|
||||
printf(
|
||||
"Ran out of journal space (used_start=%08jx, next_free=%08jx, dirty_start=%08jx)\n",
|
||||
bs->journal.used_start, bs->journal.next_free, bs->journal.dirty_start
|
||||
);
|
||||
if (bs->log_level > 5)
|
||||
{
|
||||
printf(
|
||||
"Ran out of journal space (used_start=%08jx, next_free=%08jx, dirty_start=%08jx)\n",
|
||||
bs->journal.used_start, bs->journal.next_free, bs->journal.dirty_start
|
||||
);
|
||||
}
|
||||
PRIV(op)->wait_for = WAIT_JOURNAL;
|
||||
bs->flusher->request_trim();
|
||||
PRIV(op)->wait_detail = bs->journal.used_start;
|
||||
|
|
|
@ -101,6 +101,7 @@ void blockstore_impl_t::parse_config(blockstore_config_t & config, bool init)
|
|||
config["journal_no_same_sector_overwrites"] == "1" || config["journal_no_same_sector_overwrites"] == "yes";
|
||||
journal.inmemory = config["inmemory_journal"] != "false" && config["inmemory_journal"] != "0" &&
|
||||
config["inmemory_journal"] != "no";
|
||||
log_level = strtoull(config["log_level"].c_str(), NULL, 10);
|
||||
// Validate
|
||||
if (journal.sector_count < 2)
|
||||
{
|
||||
|
|
|
@ -93,7 +93,7 @@ add_executable(test_cluster_client
|
|||
EXCLUDE_FROM_ALL
|
||||
../test/test_cluster_client.cpp
|
||||
pg_states.cpp osd_ops.cpp cluster_client.cpp cluster_client_list.cpp cluster_client_wb.cpp msgr_op.cpp ../test/mock/messenger.cpp msgr_stop.cpp
|
||||
etcd_state_client.cpp ../util/timerfd_manager.cpp ../util/str_util.cpp ../util/json_util.cpp ../../json11/json11.cpp
|
||||
etcd_state_client.cpp ../util/timerfd_manager.cpp ../util/addr_util.cpp ../util/str_util.cpp ../util/json_util.cpp ../../json11/json11.cpp
|
||||
)
|
||||
target_compile_definitions(test_cluster_client PUBLIC -D__MOCK__)
|
||||
target_include_directories(test_cluster_client BEFORE PUBLIC ${CMAKE_SOURCE_DIR}/src/test/mock)
|
||||
|
|
|
@ -1244,7 +1244,6 @@ int cluster_client_t::try_send(cluster_op_t *op, int i)
|
|||
.req = { .rw = {
|
||||
.header = {
|
||||
.magic = SECONDARY_OSD_OP_MAGIC,
|
||||
.id = next_op_id(),
|
||||
.opcode = op->opcode == OSD_OP_READ_BITMAP || op->opcode == OSD_OP_READ_CHAIN_BITMAP ? OSD_OP_READ : op->opcode,
|
||||
},
|
||||
.inode = op->cur_inode,
|
||||
|
@ -1353,7 +1352,6 @@ void cluster_client_t::send_sync(cluster_op_t *op, cluster_op_part_t *part)
|
|||
.req = {
|
||||
.hdr = {
|
||||
.magic = SECONDARY_OSD_OP_MAGIC,
|
||||
.id = next_op_id(),
|
||||
.opcode = OSD_OP_SYNC,
|
||||
},
|
||||
},
|
||||
|
@ -1498,8 +1496,3 @@ void cluster_client_t::copy_part_bitmap(cluster_op_t *op, cluster_op_part_t *par
|
|||
part_len--;
|
||||
}
|
||||
}
|
||||
|
||||
uint64_t cluster_client_t::next_op_id()
|
||||
{
|
||||
return msgr.next_subop_id++;
|
||||
}
|
||||
|
|
|
@ -86,8 +86,8 @@ class cluster_client_t
|
|||
#ifdef __MOCK__
|
||||
public:
|
||||
#endif
|
||||
timerfd_manager_t *tfd;
|
||||
ring_loop_t *ringloop;
|
||||
timerfd_manager_t *tfd = NULL;
|
||||
ring_loop_t *ringloop = NULL;
|
||||
|
||||
std::map<pool_id_t, uint64_t> pg_counts;
|
||||
std::map<pool_pg_num_t, osd_num_t> pg_primary;
|
||||
|
@ -152,7 +152,6 @@ public:
|
|||
|
||||
//inline uint32_t get_bs_bitmap_granularity() { return st_cli.global_bitmap_granularity; }
|
||||
//inline uint64_t get_bs_block_size() { return st_cli.global_block_size; }
|
||||
uint64_t next_op_id();
|
||||
|
||||
#ifndef __MOCK__
|
||||
protected:
|
||||
|
|
|
@ -342,7 +342,6 @@ void cluster_client_t::send_list(inode_list_osd_t *cur_list)
|
|||
.sec_list = {
|
||||
.header = {
|
||||
.magic = SECONDARY_OSD_OP_MAGIC,
|
||||
.id = next_op_id(),
|
||||
.opcode = OSD_OP_SEC_LIST,
|
||||
},
|
||||
.list_pg = cur_list->pg->pg_num,
|
||||
|
|
|
@ -167,6 +167,10 @@ void osd_messenger_t::init()
|
|||
}
|
||||
}
|
||||
#endif
|
||||
if (ringloop)
|
||||
{
|
||||
has_sendmsg_zc = ringloop->has_sendmsg_zc();
|
||||
}
|
||||
if (ringloop && iothread_count > 0)
|
||||
{
|
||||
for (int i = 0; i < iothread_count; i++)
|
||||
|
@ -213,7 +217,6 @@ void osd_messenger_t::init()
|
|||
op->req = (osd_any_op_t){
|
||||
.hdr = {
|
||||
.magic = SECONDARY_OSD_OP_MAGIC,
|
||||
.id = this->next_subop_id++,
|
||||
.opcode = OSD_OP_PING,
|
||||
},
|
||||
};
|
||||
|
@ -329,6 +332,9 @@ void osd_messenger_t::parse_config(const json11::Json & config)
|
|||
this->receive_buffer_size = 65536;
|
||||
this->use_sync_send_recv = config["use_sync_send_recv"].bool_value() ||
|
||||
config["use_sync_send_recv"].uint64_value();
|
||||
this->min_zerocopy_send_size = config["min_zerocopy_send_size"].is_null()
|
||||
? DEFAULT_MIN_ZEROCOPY_SEND_SIZE
|
||||
: (int)config["min_zerocopy_send_size"].int64_value();
|
||||
this->peer_connect_interval = config["peer_connect_interval"].uint64_value();
|
||||
if (!this->peer_connect_interval)
|
||||
this->peer_connect_interval = 5;
|
||||
|
@ -622,13 +628,19 @@ void osd_messenger_t::check_peer_config(osd_client_t *cl)
|
|||
.show_conf = {
|
||||
.header = {
|
||||
.magic = SECONDARY_OSD_OP_MAGIC,
|
||||
.id = this->next_subop_id++,
|
||||
.opcode = OSD_OP_SHOW_CONFIG,
|
||||
},
|
||||
},
|
||||
};
|
||||
json11::Json::object payload;
|
||||
if (osd_num)
|
||||
{
|
||||
// Inform that we're OSD <osd_num>
|
||||
payload["osd_num"] = osd_num;
|
||||
}
|
||||
payload["features"] = json11::Json::object{ { "check_sequencing", true } };
|
||||
#ifdef WITH_RDMA
|
||||
if (rdma_contexts.size())
|
||||
if (!use_rdmacm && rdma_contexts.size())
|
||||
{
|
||||
// Choose the right context for the selected network
|
||||
msgr_rdma_context_t *selected_ctx = choose_rdma_context(cl);
|
||||
|
@ -642,19 +654,20 @@ void osd_messenger_t::check_peer_config(osd_client_t *cl)
|
|||
cl->rdma_conn = msgr_rdma_connection_t::create(selected_ctx, rdma_max_send, rdma_max_recv, rdma_max_sge, rdma_max_msg);
|
||||
if (cl->rdma_conn)
|
||||
{
|
||||
json11::Json payload = json11::Json::object {
|
||||
{ "connect_rdma", cl->rdma_conn->addr.to_string() },
|
||||
{ "rdma_max_msg", cl->rdma_conn->max_msg },
|
||||
};
|
||||
std::string payload_str = payload.dump();
|
||||
op->req.show_conf.json_len = payload_str.size();
|
||||
op->buf = malloc_or_die(payload_str.size());
|
||||
op->iov.push_back(op->buf, payload_str.size());
|
||||
memcpy(op->buf, payload_str.c_str(), payload_str.size());
|
||||
payload["connect_rdma"] = cl->rdma_conn->addr.to_string();
|
||||
payload["rdma_max_msg"] = cl->rdma_conn->max_msg;
|
||||
}
|
||||
}
|
||||
}
|
||||
#endif
|
||||
if (payload.size())
|
||||
{
|
||||
std::string payload_str = json11::Json(payload).dump();
|
||||
op->req.show_conf.json_len = payload_str.size();
|
||||
op->buf = malloc_or_die(payload_str.size());
|
||||
op->iov.push_back(op->buf, payload_str.size());
|
||||
memcpy(op->buf, payload_str.c_str(), payload_str.size());
|
||||
}
|
||||
op->callback = [this, cl](osd_op_t *op)
|
||||
{
|
||||
std::string json_err;
|
||||
|
@ -701,7 +714,7 @@ void osd_messenger_t::check_peer_config(osd_client_t *cl)
|
|||
return;
|
||||
}
|
||||
#ifdef WITH_RDMA
|
||||
if (cl->rdma_conn && config["rdma_address"].is_string())
|
||||
if (!use_rdmacm && cl->rdma_conn && config["rdma_address"].is_string())
|
||||
{
|
||||
msgr_rdma_address_t addr;
|
||||
if (!msgr_rdma_address_t::from_string(config["rdma_address"].string_value().c_str(), &addr) ||
|
||||
|
@ -800,7 +813,8 @@ bool osd_messenger_t::is_rdma_enabled()
|
|||
{
|
||||
return rdma_contexts.size() > 0;
|
||||
}
|
||||
|
||||
#endif
|
||||
#ifdef WITH_RDMACM
|
||||
bool osd_messenger_t::is_use_rdmacm()
|
||||
{
|
||||
return use_rdmacm;
|
||||
|
@ -896,6 +910,7 @@ static const char* local_only_params[] = {
|
|||
"tcp_header_buffer_size",
|
||||
"use_rdma",
|
||||
"use_sync_send_recv",
|
||||
"min_zerocopy_send_size",
|
||||
};
|
||||
|
||||
static const char **local_only_end = local_only_params + (sizeof(local_only_params)/sizeof(local_only_params[0]));
|
||||
|
|
|
@ -32,6 +32,8 @@
|
|||
|
||||
#define VITASTOR_CONFIG_PATH "/etc/vitastor/vitastor.conf"
|
||||
|
||||
#define DEFAULT_MIN_ZEROCOPY_SEND_SIZE 32*1024
|
||||
|
||||
#define MSGR_SENDP_HDR 1
|
||||
#define MSGR_SENDP_FREE 2
|
||||
|
||||
|
@ -73,12 +75,15 @@ struct osd_client_t
|
|||
int read_remaining = 0;
|
||||
int read_state = 0;
|
||||
osd_op_buf_list_t recv_list;
|
||||
uint64_t read_op_id = 1;
|
||||
bool check_sequencing = false;
|
||||
|
||||
// Incoming operations
|
||||
std::vector<osd_op_t*> received_ops;
|
||||
|
||||
// Outbound operations
|
||||
std::map<uint64_t, osd_op_t*> sent_ops;
|
||||
uint64_t send_op_id = 0;
|
||||
|
||||
// PGs dirtied by this client's primary-writes
|
||||
std::set<pool_pg_num_t> dirty_pgs;
|
||||
|
@ -88,6 +93,7 @@ struct osd_client_t
|
|||
int write_state = 0;
|
||||
std::vector<iovec> send_list, next_send_list;
|
||||
std::vector<msgr_sendp_t> outbox, next_outbox;
|
||||
std::vector<osd_op_t*> zc_free_list;
|
||||
|
||||
~osd_client_t();
|
||||
};
|
||||
|
@ -97,6 +103,7 @@ struct osd_wanted_peer_t
|
|||
json11::Json raw_address_list;
|
||||
json11::Json address_list;
|
||||
int port = 0;
|
||||
// FIXME: Remove separate WITH_RDMACM?
|
||||
#ifdef WITH_RDMACM
|
||||
int rdmacm_port = 0;
|
||||
#endif
|
||||
|
@ -175,6 +182,7 @@ protected:
|
|||
int osd_ping_timeout = 0;
|
||||
int log_level = 0;
|
||||
bool use_sync_send_recv = false;
|
||||
int min_zerocopy_send_size = DEFAULT_MIN_ZEROCOPY_SEND_SIZE;
|
||||
int iothread_count = 0;
|
||||
|
||||
#ifdef WITH_RDMA
|
||||
|
@ -201,11 +209,11 @@ protected:
|
|||
std::vector<osd_op_t*> set_immediate_ops;
|
||||
|
||||
public:
|
||||
timerfd_manager_t *tfd;
|
||||
ring_loop_t *ringloop;
|
||||
timerfd_manager_t *tfd = NULL;
|
||||
ring_loop_t *ringloop = NULL;
|
||||
bool has_sendmsg_zc = false;
|
||||
// osd_num_t is only for logging and asserts
|
||||
osd_num_t osd_num;
|
||||
uint64_t next_subop_id = 1;
|
||||
std::map<int, osd_client_t*> clients;
|
||||
std::map<osd_num_t, osd_wanted_peer_t> wanted_peers;
|
||||
std::map<uint64_t, int> osd_peer_fds;
|
||||
|
@ -261,7 +269,7 @@ protected:
|
|||
void cancel_op(osd_op_t *op);
|
||||
|
||||
bool try_send(osd_client_t *cl);
|
||||
void handle_send(int result, osd_client_t *cl);
|
||||
void handle_send(int result, bool prev, bool more, osd_client_t *cl);
|
||||
|
||||
bool handle_read(int result, osd_client_t *cl);
|
||||
bool handle_read_buffer(osd_client_t *cl, void *curbuf, int remain);
|
||||
|
@ -286,6 +294,7 @@ protected:
|
|||
msgr_rdma_context_t* rdmacm_create_qp(rdma_cm_id *cmid);
|
||||
void rdmacm_accept(rdma_cm_event *ev);
|
||||
void rdmacm_try_connect_peer(uint64_t peer_osd, const std::string & addr, int rdmacm_port, int fallback_tcp_port);
|
||||
void rdmacm_set_conn_timeout(rdmacm_connecting_t *conn);
|
||||
void rdmacm_on_connect_peer_error(rdma_cm_id *cmid, int res);
|
||||
void rdmacm_address_resolved(rdma_cm_event *ev);
|
||||
void rdmacm_route_resolved(rdma_cm_event *ev);
|
||||
|
|
|
@ -70,6 +70,7 @@ msgr_rdma_context_t::~msgr_rdma_context_t()
|
|||
msgr_rdma_connection_t::~msgr_rdma_connection_t()
|
||||
{
|
||||
ctx->reserve_cqe(-max_send-max_recv);
|
||||
#ifdef WITH_RDMACM
|
||||
if (qp && !cmid)
|
||||
ibv_destroy_qp(qp);
|
||||
if (cmid)
|
||||
|
@ -79,6 +80,10 @@ msgr_rdma_connection_t::~msgr_rdma_connection_t()
|
|||
rdma_destroy_qp(cmid);
|
||||
rdma_destroy_id(cmid);
|
||||
}
|
||||
#else
|
||||
if (qp)
|
||||
ibv_destroy_qp(qp);
|
||||
#endif
|
||||
if (recv_buffers.size())
|
||||
{
|
||||
for (auto b: recv_buffers)
|
||||
|
@ -798,6 +803,9 @@ void osd_messenger_t::handle_rdma_events(msgr_rdma_context_t *rdma_context)
|
|||
}
|
||||
if (!is_send)
|
||||
{
|
||||
// Reset OSD ping state - client is obviously alive
|
||||
cl->ping_time_remaining = 0;
|
||||
cl->idle_time_remaining = osd_idle_timeout;
|
||||
rc->cur_recv--;
|
||||
if (!handle_read_buffer(cl, rc->recv_buffers[rc->next_recv_buf].buf, wc[i].byte_len))
|
||||
{
|
||||
|
|
|
@ -70,7 +70,7 @@ void osd_messenger_t::rdmacm_destroy_listener(rdma_cm_id *listener)
|
|||
|
||||
void osd_messenger_t::handle_rdmacm_events()
|
||||
{
|
||||
// rdma_destroy_id infinitely waits for pthread_cond if called before all events are acked :-(
|
||||
// rdma_destroy_id infinitely waits for pthread_cond if called before all events are acked :-(...
|
||||
std::vector<rdma_cm_event> events_copy;
|
||||
while (1)
|
||||
{
|
||||
|
@ -83,7 +83,15 @@ void osd_messenger_t::handle_rdmacm_events()
|
|||
fprintf(stderr, "Failed to get RDMA-CM event: %s (code %d)\n", strerror(errno), errno);
|
||||
exit(1);
|
||||
}
|
||||
events_copy.push_back(*ev);
|
||||
// ...so we save a copy of all events EXCEPT connection requests, otherwise they sometimes fail with EVENT_DISCONNECT
|
||||
if (ev->event == RDMA_CM_EVENT_CONNECT_REQUEST)
|
||||
{
|
||||
rdmacm_accept(ev);
|
||||
}
|
||||
else
|
||||
{
|
||||
events_copy.push_back(*ev);
|
||||
}
|
||||
r = rdma_ack_cm_event(ev);
|
||||
if (r != 0)
|
||||
{
|
||||
|
@ -96,7 +104,7 @@ void osd_messenger_t::handle_rdmacm_events()
|
|||
auto ev = &evl;
|
||||
if (ev->event == RDMA_CM_EVENT_CONNECT_REQUEST)
|
||||
{
|
||||
rdmacm_accept(ev);
|
||||
// Do nothing, handled above
|
||||
}
|
||||
else if (ev->event == RDMA_CM_EVENT_CONNECT_ERROR ||
|
||||
ev->event == RDMA_CM_EVENT_REJECTED ||
|
||||
|
@ -287,29 +295,34 @@ void osd_messenger_t::rdmacm_accept(rdma_cm_event *ev)
|
|||
rdma_destroy_id(ev->id);
|
||||
return;
|
||||
}
|
||||
rdma_context->cm_refs++;
|
||||
// Wrap into a new msgr_rdma_connection_t
|
||||
msgr_rdma_connection_t *conn = new msgr_rdma_connection_t;
|
||||
conn->ctx = rdma_context;
|
||||
conn->max_send = rdma_max_send;
|
||||
conn->max_recv = rdma_max_recv;
|
||||
conn->max_sge = rdma_max_sge > rdma_context->attrx.orig_attr.max_sge
|
||||
? rdma_context->attrx.orig_attr.max_sge : rdma_max_sge;
|
||||
conn->max_msg = rdma_max_msg;
|
||||
// Wait for RDMA_CM_ESTABLISHED, and enable the connection only after it
|
||||
auto conn = new rdmacm_connecting_t;
|
||||
conn->cmid = ev->id;
|
||||
conn->qp = ev->id->qp;
|
||||
auto cl = new osd_client_t();
|
||||
cl->peer_fd = fake_fd;
|
||||
cl->peer_state = PEER_RDMA;
|
||||
cl->peer_addr = *(sockaddr_storage*)rdma_get_peer_addr(ev->id);
|
||||
cl->in_buf = malloc_or_die(receive_buffer_size);
|
||||
cl->rdma_conn = conn;
|
||||
clients[fake_fd] = cl;
|
||||
rdmacm_connections[ev->id] = cl;
|
||||
// Add initial receive request(s)
|
||||
try_recv_rdma(cl);
|
||||
fprintf(stderr, "[OSD %ju] new client %d: connection from %s via RDMA-CM\n", this->osd_num, fake_fd,
|
||||
addr_to_string(cl->peer_addr).c_str());
|
||||
conn->peer_fd = fake_fd;
|
||||
conn->parsed_addr = *(sockaddr_storage*)rdma_get_peer_addr(ev->id);
|
||||
conn->rdma_context = rdma_context;
|
||||
rdmacm_set_conn_timeout(conn);
|
||||
rdmacm_connecting[ev->id] = conn;
|
||||
fprintf(stderr, "[OSD %ju] new client %d: connection from %s via RDMA-CM\n", this->osd_num, conn->peer_fd,
|
||||
addr_to_string(conn->parsed_addr).c_str());
|
||||
}
|
||||
|
||||
void osd_messenger_t::rdmacm_set_conn_timeout(rdmacm_connecting_t *conn)
|
||||
{
|
||||
conn->timeout_ms = peer_connect_timeout*1000;
|
||||
if (peer_connect_timeout > 0)
|
||||
{
|
||||
conn->timeout_id = tfd->set_timer(1000*peer_connect_timeout, false, [this, cmid = conn->cmid](int timer_id)
|
||||
{
|
||||
auto conn = rdmacm_connecting.at(cmid);
|
||||
conn->timeout_id = -1;
|
||||
if (conn->peer_osd)
|
||||
fprintf(stderr, "RDMA-CM connection to %s timed out\n", conn->addr.c_str());
|
||||
else
|
||||
fprintf(stderr, "Incoming RDMA-CM connection from %s timed out\n", addr_to_string(conn->parsed_addr).c_str());
|
||||
rdmacm_on_connect_peer_error(cmid, -EPIPE);
|
||||
});
|
||||
}
|
||||
}
|
||||
|
||||
void osd_messenger_t::rdmacm_on_connect_peer_error(rdma_cm_id *cmid, int res)
|
||||
|
@ -332,15 +345,18 @@ void osd_messenger_t::rdmacm_on_connect_peer_error(rdma_cm_id *cmid, int res)
|
|||
}
|
||||
rdmacm_connecting.erase(cmid);
|
||||
delete conn;
|
||||
if (!disable_tcp)
|
||||
if (peer_osd)
|
||||
{
|
||||
// Fall back to TCP instead of just reporting the error to on_connect_peer()
|
||||
try_connect_peer_tcp(peer_osd, addr.c_str(), tcp_port);
|
||||
}
|
||||
else
|
||||
{
|
||||
// TCP is disabled
|
||||
on_connect_peer(peer_osd, res == 0 ? -EINVAL : (res > 0 ? -res : res));
|
||||
if (!disable_tcp)
|
||||
{
|
||||
// Fall back to TCP instead of just reporting the error to on_connect_peer()
|
||||
try_connect_peer_tcp(peer_osd, addr.c_str(), tcp_port);
|
||||
}
|
||||
else
|
||||
{
|
||||
// TCP is disabled
|
||||
on_connect_peer(peer_osd, res == 0 ? -EINVAL : (res > 0 ? -res : res));
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
|
@ -374,6 +390,8 @@ void osd_messenger_t::rdmacm_try_connect_peer(uint64_t peer_osd, const std::stri
|
|||
on_connect_peer(peer_osd, res);
|
||||
return;
|
||||
}
|
||||
if (log_level > 0)
|
||||
fprintf(stderr, "Trying to connect to OSD %ju at %s:%d via RDMA-CM\n", peer_osd, addr.c_str(), rdmacm_port);
|
||||
auto conn = new rdmacm_connecting_t;
|
||||
rdmacm_connecting[cmid] = conn;
|
||||
conn->cmid = cmid;
|
||||
|
@ -383,19 +401,7 @@ void osd_messenger_t::rdmacm_try_connect_peer(uint64_t peer_osd, const std::stri
|
|||
conn->parsed_addr = sa;
|
||||
conn->rdmacm_port = rdmacm_port;
|
||||
conn->tcp_port = fallback_tcp_port;
|
||||
conn->timeout_ms = peer_connect_timeout*1000;
|
||||
conn->timeout_id = -1;
|
||||
if (peer_connect_timeout > 0)
|
||||
{
|
||||
conn->timeout_id = tfd->set_timer(1000*peer_connect_timeout, false, [this, cmid](int timer_id)
|
||||
{
|
||||
auto conn = rdmacm_connecting.at(cmid);
|
||||
conn->timeout_id = -1;
|
||||
fprintf(stderr, "RDMA-CM connection to %s timed out\n", conn->addr.c_str());
|
||||
rdmacm_on_connect_peer_error(cmid, -EPIPE);
|
||||
return;
|
||||
});
|
||||
}
|
||||
rdmacm_set_conn_timeout(conn);
|
||||
if (rdma_resolve_addr(cmid, NULL, (sockaddr*)&conn->parsed_addr, conn->timeout_ms) != 0)
|
||||
{
|
||||
auto res = -errno;
|
||||
|
@ -494,7 +500,7 @@ void osd_messenger_t::rdmacm_established(rdma_cm_event *ev)
|
|||
// Wrap into a new msgr_rdma_connection_t
|
||||
msgr_rdma_connection_t *rc = new msgr_rdma_connection_t;
|
||||
rc->ctx = conn->rdma_context;
|
||||
rc->ctx->cm_refs++;
|
||||
rc->ctx->cm_refs++; // FIXME now unused, count also connecting_t's when used
|
||||
rc->max_send = rdma_max_send;
|
||||
rc->max_recv = rdma_max_recv;
|
||||
rc->max_sge = rdma_max_sge > rc->ctx->attrx.orig_attr.max_sge
|
||||
|
@ -514,14 +520,20 @@ void osd_messenger_t::rdmacm_established(rdma_cm_event *ev)
|
|||
cl->rdma_conn = rc;
|
||||
clients[conn->peer_fd] = cl;
|
||||
if (conn->timeout_id >= 0)
|
||||
{
|
||||
tfd->clear_timer(conn->timeout_id);
|
||||
}
|
||||
delete conn;
|
||||
rdmacm_connecting.erase(cmid);
|
||||
rdmacm_connections[cmid] = cl;
|
||||
if (log_level > 0)
|
||||
if (log_level > 0 && peer_osd)
|
||||
{
|
||||
fprintf(stderr, "Successfully connected with OSD %ju using RDMA-CM\n", peer_osd);
|
||||
}
|
||||
// Add initial receive request(s)
|
||||
try_recv_rdma(cl);
|
||||
osd_peer_fds[peer_osd] = cl->peer_fd;
|
||||
on_connect_peer(peer_osd, cl->peer_fd);
|
||||
if (peer_osd)
|
||||
{
|
||||
check_peer_config(cl);
|
||||
}
|
||||
}
|
||||
|
|
|
@ -214,6 +214,7 @@ bool osd_messenger_t::handle_read_buffer(osd_client_t *cl, void *curbuf, int rem
|
|||
|
||||
bool osd_messenger_t::handle_finished_read(osd_client_t *cl)
|
||||
{
|
||||
// Reset OSD ping state
|
||||
cl->ping_time_remaining = 0;
|
||||
cl->idle_time_remaining = osd_idle_timeout;
|
||||
cl->recv_list.reset();
|
||||
|
@ -222,7 +223,19 @@ bool osd_messenger_t::handle_finished_read(osd_client_t *cl)
|
|||
if (cl->read_op->req.hdr.magic == SECONDARY_OSD_REPLY_MAGIC)
|
||||
return handle_reply_hdr(cl);
|
||||
else if (cl->read_op->req.hdr.magic == SECONDARY_OSD_OP_MAGIC)
|
||||
{
|
||||
if (cl->check_sequencing)
|
||||
{
|
||||
if (cl->read_op->req.hdr.id != cl->read_op_id)
|
||||
{
|
||||
fprintf(stderr, "Warning: operation sequencing is broken on client %d, stopping client\n", cl->peer_fd);
|
||||
stop_client(cl->peer_fd);
|
||||
return false;
|
||||
}
|
||||
cl->read_op_id++;
|
||||
}
|
||||
handle_op_hdr(cl);
|
||||
}
|
||||
else
|
||||
{
|
||||
fprintf(stderr, "Received garbage: magic=%jx id=%ju opcode=%jx from %d\n", cl->read_op->req.hdr.magic, cl->read_op->req.hdr.id, cl->read_op->req.hdr.opcode, cl->peer_fd);
|
||||
|
|
|
@ -14,6 +14,7 @@ void osd_messenger_t::outbox_push(osd_op_t *cur_op)
|
|||
if (cur_op->op_type == OSD_OP_OUT)
|
||||
{
|
||||
clock_gettime(CLOCK_REALTIME, &cur_op->tv_begin);
|
||||
cur_op->req.hdr.id = ++cl->send_op_id;
|
||||
}
|
||||
else
|
||||
{
|
||||
|
@ -203,8 +204,24 @@ bool osd_messenger_t::try_send(osd_client_t *cl)
|
|||
cl->write_msg.msg_iovlen = cl->send_list.size() < IOV_MAX ? cl->send_list.size() : IOV_MAX;
|
||||
cl->refs++;
|
||||
ring_data_t* data = ((ring_data_t*)sqe->user_data);
|
||||
data->callback = [this, cl](ring_data_t *data) { handle_send(data->res, cl); };
|
||||
my_uring_prep_sendmsg(sqe, peer_fd, &cl->write_msg, 0);
|
||||
data->callback = [this, cl](ring_data_t *data) { handle_send(data->res, data->prev, data->more, cl); };
|
||||
bool use_zc = has_sendmsg_zc && min_zerocopy_send_size >= 0;
|
||||
if (use_zc && min_zerocopy_send_size > 0)
|
||||
{
|
||||
size_t avg_size = 0;
|
||||
for (size_t i = 0; i < cl->write_msg.msg_iovlen; i++)
|
||||
avg_size += cl->write_msg.msg_iov[i].iov_len;
|
||||
if (avg_size/cl->write_msg.msg_iovlen < min_zerocopy_send_size)
|
||||
use_zc = false;
|
||||
}
|
||||
if (use_zc)
|
||||
{
|
||||
my_uring_prep_sendmsg_zc(sqe, peer_fd, &cl->write_msg, MSG_WAITALL);
|
||||
}
|
||||
else
|
||||
{
|
||||
my_uring_prep_sendmsg(sqe, peer_fd, &cl->write_msg, MSG_WAITALL);
|
||||
}
|
||||
if (iothread)
|
||||
{
|
||||
iothread->add_sqe(sqe_local);
|
||||
|
@ -220,7 +237,7 @@ bool osd_messenger_t::try_send(osd_client_t *cl)
|
|||
{
|
||||
result = -errno;
|
||||
}
|
||||
handle_send(result, cl);
|
||||
handle_send(result, false, false, cl);
|
||||
}
|
||||
return true;
|
||||
}
|
||||
|
@ -240,10 +257,16 @@ void osd_messenger_t::send_replies()
|
|||
write_ready_clients.clear();
|
||||
}
|
||||
|
||||
void osd_messenger_t::handle_send(int result, osd_client_t *cl)
|
||||
void osd_messenger_t::handle_send(int result, bool prev, bool more, osd_client_t *cl)
|
||||
{
|
||||
cl->write_msg.msg_iovlen = 0;
|
||||
cl->refs--;
|
||||
if (!prev)
|
||||
{
|
||||
cl->write_msg.msg_iovlen = 0;
|
||||
}
|
||||
if (!more)
|
||||
{
|
||||
cl->refs--;
|
||||
}
|
||||
if (cl->peer_state == PEER_STOPPED)
|
||||
{
|
||||
if (cl->refs <= 0)
|
||||
|
@ -261,6 +284,16 @@ void osd_messenger_t::handle_send(int result, osd_client_t *cl)
|
|||
}
|
||||
if (result >= 0)
|
||||
{
|
||||
if (prev)
|
||||
{
|
||||
// Second notification - only free a batch of postponed ops
|
||||
int i = 0;
|
||||
for (; i < cl->zc_free_list.size() && cl->zc_free_list[i]; i++)
|
||||
delete cl->zc_free_list[i];
|
||||
if (i > 0)
|
||||
cl->zc_free_list.erase(cl->zc_free_list.begin(), cl->zc_free_list.begin()+i+1);
|
||||
return;
|
||||
}
|
||||
int done = 0;
|
||||
while (result > 0 && done < cl->send_list.size())
|
||||
{
|
||||
|
@ -270,7 +303,10 @@ void osd_messenger_t::handle_send(int result, osd_client_t *cl)
|
|||
if (cl->outbox[done].flags & MSGR_SENDP_FREE)
|
||||
{
|
||||
// Reply fully sent
|
||||
delete cl->outbox[done].op;
|
||||
if (more)
|
||||
cl->zc_free_list.push_back(cl->outbox[done].op);
|
||||
else
|
||||
delete cl->outbox[done].op;
|
||||
}
|
||||
result -= iov.iov_len;
|
||||
done++;
|
||||
|
@ -282,6 +318,12 @@ void osd_messenger_t::handle_send(int result, osd_client_t *cl)
|
|||
break;
|
||||
}
|
||||
}
|
||||
if (more)
|
||||
{
|
||||
auto expected = cl->send_list.size() < IOV_MAX ? cl->send_list.size() : IOV_MAX;
|
||||
assert(done == expected);
|
||||
cl->zc_free_list.push_back(NULL); // end marker
|
||||
}
|
||||
if (done > 0)
|
||||
{
|
||||
cl->send_list.erase(cl->send_list.begin(), cl->send_list.begin()+done);
|
||||
|
|
|
@ -147,7 +147,6 @@ struct cli_describe_t
|
|||
.describe = (osd_op_describe_t){
|
||||
.header = (osd_op_header_t){
|
||||
.magic = SECONDARY_OSD_OP_MAGIC,
|
||||
.id = parent->cli->next_op_id(),
|
||||
.opcode = OSD_OP_DESCRIBE,
|
||||
},
|
||||
.object_state = object_state,
|
||||
|
|
|
@ -159,7 +159,6 @@ struct cli_fix_t
|
|||
.describe = {
|
||||
.header = {
|
||||
.magic = SECONDARY_OSD_OP_MAGIC,
|
||||
.id = parent->cli->next_op_id(),
|
||||
.opcode = OSD_OP_DESCRIBE,
|
||||
},
|
||||
.min_inode = obj.inode,
|
||||
|
@ -194,7 +193,6 @@ struct cli_fix_t
|
|||
.sec_del = {
|
||||
.header = {
|
||||
.magic = SECONDARY_OSD_OP_MAGIC,
|
||||
.id = parent->cli->next_op_id(),
|
||||
.opcode = OSD_OP_SEC_DELETE,
|
||||
},
|
||||
.oid = {
|
||||
|
@ -242,7 +240,6 @@ struct cli_fix_t
|
|||
.rw = {
|
||||
.header = {
|
||||
.magic = SECONDARY_OSD_OP_MAGIC,
|
||||
.id = parent->cli->next_op_id(),
|
||||
.opcode = OSD_OP_SCRUB,
|
||||
},
|
||||
.inode = obj.inode,
|
||||
|
|
|
@ -58,6 +58,12 @@ struct osd_changer_t
|
|||
state = 100;
|
||||
return;
|
||||
}
|
||||
if (set_reweight && new_reweight > 1)
|
||||
{
|
||||
result = (cli_result_t){ .err = EINVAL, .text = "Reweight can't be larger than 1" };
|
||||
state = 100;
|
||||
return;
|
||||
}
|
||||
parent->etcd_txn(json11::Json::object {
|
||||
{ "success", json11::Json::array {
|
||||
json11::Json::object {
|
||||
|
|
|
@ -44,10 +44,10 @@ std::string validate_pool_config(json11::Json::object & new_cfg, json11::Json ol
|
|||
new_cfg["parity_chunks"] = parity_chunks;
|
||||
}
|
||||
|
||||
if (old_cfg.is_null() && new_cfg["scheme"].string_value() == "")
|
||||
if (new_cfg["scheme"].string_value() == "")
|
||||
{
|
||||
// Default scheme
|
||||
new_cfg["scheme"] = "replicated";
|
||||
new_cfg["scheme"] = old_cfg.is_null() ? "replicated" : old_cfg["scheme"];
|
||||
}
|
||||
if (new_cfg.find("pg_minsize") == new_cfg.end() && (old_cfg.is_null() || new_cfg.find("pg_size") != new_cfg.end()))
|
||||
{
|
||||
|
|
|
@ -5,6 +5,7 @@
|
|||
#include "cli.h"
|
||||
#include "cluster_client.h"
|
||||
#include "str_util.h"
|
||||
#include "json_util.h"
|
||||
#include "epoll_manager.h"
|
||||
|
||||
#include <algorithm>
|
||||
|
@ -22,8 +23,8 @@ struct rm_osd_t
|
|||
int state = 0;
|
||||
cli_result_t result;
|
||||
|
||||
std::set<uint64_t> to_remove;
|
||||
std::set<uint64_t> to_restart;
|
||||
std::set<osd_num_t> to_remove;
|
||||
std::vector<osd_num_t> still_up;
|
||||
json11::Json::array pool_effects;
|
||||
json11::Json::array history_updates, history_checks;
|
||||
json11::Json new_pgs, new_clean_pgs;
|
||||
|
@ -63,8 +64,17 @@ struct rm_osd_t
|
|||
}
|
||||
to_remove.insert(osd_id);
|
||||
}
|
||||
// Check if OSDs are still used in data distribution
|
||||
is_warning = is_dataloss = false;
|
||||
// Check if OSDs are still up
|
||||
for (auto osd_id: to_remove)
|
||||
{
|
||||
if (parent->cli->st_cli.peer_states.find(osd_id) != parent->cli->st_cli.peer_states.end())
|
||||
{
|
||||
is_warning = true;
|
||||
still_up.push_back(osd_id);
|
||||
}
|
||||
}
|
||||
// Check if OSDs are still used in data distribution
|
||||
for (auto & pp: parent->cli->st_cli.pool_config)
|
||||
{
|
||||
// Will OSD deletion make pool incomplete / down / degraded?
|
||||
|
@ -158,6 +168,9 @@ struct rm_osd_t
|
|||
: strtoupper(e["effect"].string_value())+" PGs"))
|
||||
)+" after deleting OSD(s).\n";
|
||||
}
|
||||
if (still_up.size())
|
||||
error += (still_up.size() == 1 ? "OSD " : "OSDs ") + implode(", ", still_up) +
|
||||
(still_up.size() == 1 ? "is" : "are") + " still up. Use `vitastor-disk purge` to delete them.\n";
|
||||
if (is_dataloss && !force_dataloss && !dry_run)
|
||||
error += "OSDs not deleted. Please move data to other OSDs or bypass this check with --allow-data-loss if you know what you are doing.\n";
|
||||
else if (is_warning && !force_warning && !dry_run)
|
||||
|
|
|
@ -17,6 +17,8 @@
|
|||
#include "str_util.h"
|
||||
#include "vitastor_kv.h"
|
||||
|
||||
#define KV_LIST_BUF_SIZE 65536
|
||||
|
||||
const char *exe_name = NULL;
|
||||
|
||||
class kv_cli_t
|
||||
|
@ -290,10 +292,26 @@ void kv_cli_t::next_cmd()
|
|||
struct kv_cli_list_t
|
||||
{
|
||||
vitastorkv_dbw_t *db = NULL;
|
||||
std::string buf;
|
||||
void *handle = NULL;
|
||||
int format = 0;
|
||||
int n = 0;
|
||||
std::function<void(int)> cb;
|
||||
|
||||
void write(const std::string & str)
|
||||
{
|
||||
if (buf.capacity() < KV_LIST_BUF_SIZE)
|
||||
buf.reserve(KV_LIST_BUF_SIZE);
|
||||
if (buf.size() + str.size() > buf.capacity())
|
||||
flush();
|
||||
buf.append(str.data(), str.size());
|
||||
}
|
||||
|
||||
void flush()
|
||||
{
|
||||
::write(1, buf.data(), buf.size());
|
||||
buf.resize(0);
|
||||
}
|
||||
};
|
||||
|
||||
std::vector<std::string> kv_cli_t::parse_cmd(const std::string & str)
|
||||
|
@ -604,11 +622,10 @@ void kv_cli_t::handle_cmd(const std::vector<std::string> & cmd, std::function<vo
|
|||
if (res < 0)
|
||||
{
|
||||
if (res != -ENOENT)
|
||||
{
|
||||
fprintf(stderr, "Error: %s (code %d)\n", strerror(-res), res);
|
||||
}
|
||||
if (lst->format == 2)
|
||||
printf("\n}\n");
|
||||
lst->write("\n}\n");
|
||||
lst->flush();
|
||||
lst->db->list_close(lst->handle);
|
||||
lst->cb(res == -ENOENT ? 0 : res);
|
||||
delete lst;
|
||||
|
@ -616,11 +633,27 @@ void kv_cli_t::handle_cmd(const std::vector<std::string> & cmd, std::function<vo
|
|||
else
|
||||
{
|
||||
if (lst->format == 2)
|
||||
printf(lst->n ? ",\n %s: %s" : "{\n %s: %s", addslashes(key).c_str(), addslashes(value).c_str());
|
||||
{
|
||||
lst->write(lst->n ? ",\n " : "{\n ");
|
||||
lst->write(addslashes(key));
|
||||
lst->write(": ");
|
||||
lst->write(addslashes(value));
|
||||
}
|
||||
else if (lst->format == 1)
|
||||
printf("set %s %s\n", auto_addslashes(key).c_str(), value.c_str());
|
||||
{
|
||||
lst->write("set ");
|
||||
lst->write(auto_addslashes(key));
|
||||
lst->write(" ");
|
||||
lst->write(value);
|
||||
lst->write("\n");
|
||||
}
|
||||
else
|
||||
printf("%s = %s\n", key.c_str(), value.c_str());
|
||||
{
|
||||
lst->write(key);
|
||||
lst->write(" = ");
|
||||
lst->write(value);
|
||||
lst->write("\n");
|
||||
}
|
||||
lst->n++;
|
||||
lst->db->list_next(lst->handle, NULL);
|
||||
}
|
||||
|
|
|
@ -870,7 +870,7 @@ static void get_block(kv_db_t *db, uint64_t offset, int cur_level, int recheck_p
|
|||
}
|
||||
// Block already in cache, we can proceed
|
||||
blk->usage = db->usage_counter;
|
||||
cb(0, BLK_UPDATING);
|
||||
db->cli->msgr.ringloop->set_immediate([=] { cb(0, BLK_UPDATING); });
|
||||
return;
|
||||
}
|
||||
cluster_op_t *op = new cluster_op_t;
|
||||
|
|
|
@ -22,8 +22,8 @@ int nfs3_fsstat_proc(void *opaque, rpc_op_t *rop)
|
|||
{
|
||||
auto ttb = pst_it->second["total_raw_tb"].number_value();
|
||||
auto ftb = (pst_it->second["total_raw_tb"].number_value() - pst_it->second["used_raw_tb"].number_value());
|
||||
tbytes = ttb / pst_it->second["raw_to_usable"].number_value() * ((uint64_t)2<<40);
|
||||
fbytes = ftb / pst_it->second["raw_to_usable"].number_value() * ((uint64_t)2<<40);
|
||||
tbytes = ttb / pst_it->second["raw_to_usable"].number_value() * ((uint64_t)1<<40);
|
||||
fbytes = ftb / pst_it->second["raw_to_usable"].number_value() * ((uint64_t)1<<40);
|
||||
}
|
||||
*reply = (FSSTAT3res){
|
||||
.status = NFS3_OK,
|
||||
|
|
|
@ -210,6 +210,7 @@ resume_4:
|
|||
st->res = res;
|
||||
kv_continue_create(st, 5);
|
||||
});
|
||||
return;
|
||||
resume_5:
|
||||
if (st->res < 0)
|
||||
{
|
||||
|
|
|
@ -13,6 +13,12 @@ void kv_read_inode(nfs_proxy_t *proxy, uint64_t ino,
|
|||
std::function<void(int res, const std::string & value, json11::Json ientry)> cb,
|
||||
bool allow_cache)
|
||||
{
|
||||
if (!ino)
|
||||
{
|
||||
// Zero value can not exist
|
||||
cb(-ENOENT, "", json11::Json());
|
||||
return;
|
||||
}
|
||||
auto key = kv_inode_key(ino);
|
||||
proxy->db->get(key, [=](int res, const std::string & value)
|
||||
{
|
||||
|
@ -49,7 +55,7 @@ int kv_nfs3_getattr_proc(void *opaque, rpc_op_t *rop)
|
|||
auto ino = kv_fh_inode(fh);
|
||||
if (self->parent->trace)
|
||||
fprintf(stderr, "[%d] GETATTR %ju\n", self->nfs_fd, ino);
|
||||
if (!kv_fh_valid(fh))
|
||||
if (!kv_fh_valid(fh) || !ino)
|
||||
{
|
||||
*reply = (GETATTR3res){ .status = NFS3ERR_INVAL };
|
||||
rpc_queue_reply(rop);
|
||||
|
|
|
@ -43,9 +43,30 @@ int kv_nfs3_lookup_proc(void *opaque, rpc_op_t *rop)
|
|||
uint64_t ino = direntry["ino"].uint64_value();
|
||||
kv_read_inode(self->parent, ino, [=](int res, const std::string & value, json11::Json ientry)
|
||||
{
|
||||
if (res < 0)
|
||||
if (res == -ENOENT)
|
||||
{
|
||||
*reply = (LOOKUP3res){ .status = vitastor_nfs_map_err(res == -ENOENT ? -EIO : res) };
|
||||
*reply = (LOOKUP3res){
|
||||
.status = NFS3_OK,
|
||||
.resok = (LOOKUP3resok){
|
||||
.object = xdr_copy_string(rop->xdrs, kv_fh(ino)),
|
||||
.obj_attributes = {
|
||||
.attributes_follow = 1,
|
||||
.attributes = (fattr3){
|
||||
.type = (ftype3)0,
|
||||
.mode = 0666,
|
||||
.nlink = 1,
|
||||
.fsid = self->parent->fsid,
|
||||
.fileid = ino,
|
||||
},
|
||||
},
|
||||
},
|
||||
};
|
||||
rpc_queue_reply(rop);
|
||||
return;
|
||||
}
|
||||
else if (res < 0)
|
||||
{
|
||||
*reply = (LOOKUP3res){ .status = vitastor_nfs_map_err(res) };
|
||||
rpc_queue_reply(rop);
|
||||
return;
|
||||
}
|
||||
|
|
|
@ -89,12 +89,23 @@ resume_1:
|
|||
resume_2:
|
||||
if (st->res < 0)
|
||||
{
|
||||
fprintf(stderr, "error reading inode %s: %s (code %d)\n",
|
||||
kv_inode_key(st->ino).c_str(), strerror(-st->res), st->res);
|
||||
auto cb = std::move(st->cb);
|
||||
cb(st->res);
|
||||
return;
|
||||
if (st->res == -ENOENT)
|
||||
{
|
||||
// Just delete direntry and skip inode
|
||||
fprintf(stderr, "direntry %s references a non-existing inode %ju, deleting\n",
|
||||
kv_direntry_key(st->dir_ino, st->filename).c_str(), st->ino);
|
||||
st->ino = 0;
|
||||
}
|
||||
else
|
||||
{
|
||||
fprintf(stderr, "error reading inode %s: %s (code %d)\n",
|
||||
kv_inode_key(st->ino).c_str(), strerror(-st->res), st->res);
|
||||
auto cb = std::move(st->cb);
|
||||
cb(st->res);
|
||||
return;
|
||||
}
|
||||
}
|
||||
else
|
||||
{
|
||||
std::string err;
|
||||
st->ientry = json11::Json::parse(st->ientry_text, err);
|
||||
|
|
|
@ -158,18 +158,13 @@ bool osd_t::check_peer_config(osd_client_t *cl, json11::Json conf)
|
|||
|
||||
json11::Json osd_t::get_osd_state()
|
||||
{
|
||||
std::vector<char> hostname;
|
||||
hostname.resize(1024);
|
||||
while (gethostname(hostname.data(), hostname.size()) < 0 && errno == ENAMETOOLONG)
|
||||
hostname.resize(hostname.size()+1024);
|
||||
hostname.resize(strnlen(hostname.data(), hostname.size()));
|
||||
json11::Json::object st;
|
||||
st["state"] = "up";
|
||||
if (bind_addresses.size() != 1 || bind_addresses[0] != "0.0.0.0")
|
||||
st["addresses"] = bind_addresses;
|
||||
else
|
||||
st["addresses"] = getifaddr_list();
|
||||
st["host"] = std::string(hostname.data(), hostname.size());
|
||||
st["host"] = gethostname_str();
|
||||
st["version"] = VITASTOR_VERSION;
|
||||
st["port"] = listening_port;
|
||||
#ifdef WITH_RDMACM
|
||||
|
|
|
@ -209,7 +209,6 @@ bool osd_t::submit_flush_op(pool_id_t pool_id, pg_num_t pg_num, pg_flush_batch_t
|
|||
.sec_stab = {
|
||||
.header = {
|
||||
.magic = SECONDARY_OSD_OP_MAGIC,
|
||||
.id = msgr.next_subop_id++,
|
||||
.opcode = (uint64_t)(rollback ? OSD_OP_SEC_ROLLBACK : OSD_OP_SEC_STABILIZE),
|
||||
},
|
||||
.len = count * sizeof(obj_ver_id),
|
||||
|
|
|
@ -391,7 +391,6 @@ void osd_t::submit_list_subop(osd_num_t role_osd, pg_peering_state_t *ps)
|
|||
.sec_list = {
|
||||
.header = {
|
||||
.magic = SECONDARY_OSD_OP_MAGIC,
|
||||
.id = msgr.next_subop_id++,
|
||||
.opcode = OSD_OP_SEC_LIST,
|
||||
},
|
||||
.list_pg = ps->pg_num,
|
||||
|
|
|
@ -266,7 +266,6 @@ int osd_t::submit_bitmap_subops(osd_op_t *cur_op, pg_t & pg)
|
|||
.sec_read_bmp = {
|
||||
.header = {
|
||||
.magic = SECONDARY_OSD_OP_MAGIC,
|
||||
.id = msgr.next_subop_id++,
|
||||
.opcode = OSD_OP_SEC_READ_BMP,
|
||||
},
|
||||
.len = sizeof(obj_ver_id)*(i+1-prev),
|
||||
|
|
|
@ -233,7 +233,6 @@ void osd_t::submit_primary_subop(osd_op_t *cur_op, osd_op_t *subop,
|
|||
subop->req.sec_rw = (osd_op_sec_rw_t){
|
||||
.header = {
|
||||
.magic = SECONDARY_OSD_OP_MAGIC,
|
||||
.id = msgr.next_subop_id++,
|
||||
.opcode = (uint64_t)(wr ? (cur_op->op_data->pg->scheme == POOL_SCHEME_REPLICATED ? OSD_OP_SEC_WRITE_STABLE : OSD_OP_SEC_WRITE) : OSD_OP_SEC_READ),
|
||||
},
|
||||
.oid = {
|
||||
|
@ -452,15 +451,16 @@ void osd_t::handle_primary_subop(osd_op_t *subop, osd_op_t *cur_op)
|
|||
{
|
||||
op_data->errcode = retval;
|
||||
}
|
||||
op_data->errors++;
|
||||
if (subop->peer_fd >= 0 && retval != -EDOM && retval != -ERANGE &&
|
||||
(retval != -ENOSPC || opcode != OSD_OP_SEC_WRITE && opcode != OSD_OP_SEC_WRITE_STABLE) &&
|
||||
(retval != -EIO || opcode != OSD_OP_SEC_READ))
|
||||
{
|
||||
// Drop connection on unexpected errors
|
||||
op_data->drops++;
|
||||
msgr.stop_client(subop->peer_fd);
|
||||
op_data->drops++;
|
||||
}
|
||||
// Increase op_data->errors after stop_client to prevent >= n_subops running twice
|
||||
op_data->errors++;
|
||||
}
|
||||
else
|
||||
{
|
||||
|
@ -593,7 +593,6 @@ void osd_t::submit_primary_del_batch(osd_op_t *cur_op, obj_ver_osd_t *chunks_to_
|
|||
subops[i].req = (osd_any_op_t){ .sec_del = {
|
||||
.header = {
|
||||
.magic = SECONDARY_OSD_OP_MAGIC,
|
||||
.id = msgr.next_subop_id++,
|
||||
.opcode = OSD_OP_SEC_DELETE,
|
||||
},
|
||||
.oid = chunk.oid,
|
||||
|
@ -653,7 +652,6 @@ int osd_t::submit_primary_sync_subops(osd_op_t *cur_op)
|
|||
subops[i].req = (osd_any_op_t){ .sec_sync = {
|
||||
.header = {
|
||||
.magic = SECONDARY_OSD_OP_MAGIC,
|
||||
.id = msgr.next_subop_id++,
|
||||
.opcode = OSD_OP_SEC_SYNC,
|
||||
},
|
||||
.flags = cur_op->peer_fd == SELF_FD && cur_op->req.hdr.opcode != OSD_OP_SCRUB ? OSD_OP_RECOVERY_RELATED : 0,
|
||||
|
@ -712,7 +710,6 @@ void osd_t::submit_primary_stab_subops(osd_op_t *cur_op)
|
|||
subops[i].req = (osd_any_op_t){ .sec_stab = {
|
||||
.header = {
|
||||
.magic = SECONDARY_OSD_OP_MAGIC,
|
||||
.id = msgr.next_subop_id++,
|
||||
.opcode = OSD_OP_SEC_STABILIZE,
|
||||
},
|
||||
.len = (uint64_t)(stab_osd.len * sizeof(obj_ver_id)),
|
||||
|
@ -806,7 +803,6 @@ void osd_t::submit_primary_rollback_subops(osd_op_t *cur_op, const uint64_t* osd
|
|||
subop->req = (osd_any_op_t){ .sec_stab = {
|
||||
.header = {
|
||||
.magic = SECONDARY_OSD_OP_MAGIC,
|
||||
.id = msgr.next_subop_id++,
|
||||
.opcode = OSD_OP_SEC_ROLLBACK,
|
||||
},
|
||||
.len = sizeof(obj_ver_id),
|
||||
|
|
|
@ -162,6 +162,7 @@ struct reed_sol_matrix_t
|
|||
int refs = 0;
|
||||
int *je_data;
|
||||
uint8_t *isal_data;
|
||||
int isal_item_size;
|
||||
// 32 bytes = 256/8 = max pg_size/8
|
||||
std::map<std::array<uint8_t, 32>, void*> subdata;
|
||||
std::map<reed_sol_erased_t, void*> decodings;
|
||||
|
@ -181,20 +182,42 @@ void use_ec(int pg_size, int pg_minsize, bool use)
|
|||
}
|
||||
int *matrix = reed_sol_vandermonde_coding_matrix(pg_minsize, pg_size-pg_minsize, OSD_JERASURE_W);
|
||||
uint8_t *isal_table = NULL;
|
||||
int item_size = 8;
|
||||
#ifdef WITH_ISAL
|
||||
uint8_t *isal_matrix = (uint8_t*)malloc_or_die(pg_minsize*(pg_size-pg_minsize));
|
||||
for (int i = 0; i < pg_minsize*(pg_size-pg_minsize); i++)
|
||||
{
|
||||
isal_matrix[i] = matrix[i];
|
||||
}
|
||||
isal_table = (uint8_t*)malloc_or_die(pg_minsize*(pg_size-pg_minsize)*32);
|
||||
isal_table = (uint8_t*)calloc_or_die(1, pg_minsize*(pg_size-pg_minsize)*32);
|
||||
ec_init_tables(pg_minsize, pg_size-pg_minsize, isal_matrix, isal_table);
|
||||
free(isal_matrix);
|
||||
for (int i = pg_minsize*(pg_size-pg_minsize)*8; i < pg_minsize*(pg_size-pg_minsize)*32; i++)
|
||||
{
|
||||
if (isal_table[i] != 0)
|
||||
{
|
||||
// ISA-L GF-NI version uses 8-byte table items
|
||||
item_size = 32;
|
||||
break;
|
||||
}
|
||||
}
|
||||
// Sanity check: rows should never consist of all zeroes
|
||||
uint8_t zero_row[pg_minsize*item_size];
|
||||
memset(zero_row, 0, pg_minsize*item_size);
|
||||
for (int i = 0; i < (pg_size-pg_minsize); i++)
|
||||
{
|
||||
if (memcmp(isal_table + i*pg_minsize*item_size, zero_row, pg_minsize*item_size) == 0)
|
||||
{
|
||||
fprintf(stderr, "BUG or ISA-L incompatibility: EC tables shouldn't have all-zero rows\n");
|
||||
abort();
|
||||
}
|
||||
}
|
||||
#endif
|
||||
matrices[key] = (reed_sol_matrix_t){
|
||||
.refs = 0,
|
||||
.je_data = matrix,
|
||||
.isal_data = isal_table,
|
||||
.isal_item_size = item_size,
|
||||
};
|
||||
rs_it = matrices.find(key);
|
||||
}
|
||||
|
@ -235,7 +258,7 @@ static reed_sol_matrix_t* get_ec_matrix(int pg_size, int pg_minsize)
|
|||
// we don't need it. also it makes an extra allocation of int *erased on every call and doesn't cache
|
||||
// the decoding matrix.
|
||||
// all these flaws are fixed in this function:
|
||||
static void* get_jerasure_decoding_matrix(osd_rmw_stripe_t *stripes, int pg_size, int pg_minsize)
|
||||
static void* get_jerasure_decoding_matrix(osd_rmw_stripe_t *stripes, int pg_size, int pg_minsize, int *item_size)
|
||||
{
|
||||
int edd = 0;
|
||||
int erased[pg_size];
|
||||
|
@ -292,6 +315,7 @@ static void* get_jerasure_decoding_matrix(osd_rmw_stripe_t *stripes, int pg_size
|
|||
int *erased_copy = (int*)(rectable + 32*smrow*pg_minsize);
|
||||
memcpy(erased_copy, erased, pg_size*sizeof(int));
|
||||
matrix->decodings.emplace((reed_sol_erased_t){ .data = erased_copy, .size = pg_size }, rectable);
|
||||
*item_size = matrix->isal_item_size;
|
||||
return rectable;
|
||||
#else
|
||||
int *dm_ids = (int*)malloc_or_die(sizeof(int)*(pg_minsize + pg_minsize*pg_minsize + pg_size));
|
||||
|
@ -355,7 +379,8 @@ static void jerasure_matrix_encode_unaligned(int k, int m, int w, int *matrix, c
|
|||
#ifdef WITH_ISAL
|
||||
void reconstruct_stripes_ec(osd_rmw_stripe_t *stripes, int pg_size, int pg_minsize, uint32_t bitmap_size)
|
||||
{
|
||||
uint8_t *dectable = (uint8_t*)get_jerasure_decoding_matrix(stripes, pg_size, pg_minsize);
|
||||
int item_size = 0;
|
||||
uint8_t *dectable = (uint8_t*)get_jerasure_decoding_matrix(stripes, pg_size, pg_minsize, &item_size);
|
||||
if (!dectable)
|
||||
{
|
||||
return;
|
||||
|
@ -378,7 +403,7 @@ void reconstruct_stripes_ec(osd_rmw_stripe_t *stripes, int pg_size, int pg_minsi
|
|||
}
|
||||
}
|
||||
ec_encode_data(
|
||||
read_end-read_start, pg_minsize, wanted, dectable + wanted_base*32*pg_minsize,
|
||||
read_end-read_start, pg_minsize, wanted, dectable + wanted_base*item_size*pg_minsize,
|
||||
data_ptrs, data_ptrs + pg_minsize
|
||||
);
|
||||
}
|
||||
|
@ -433,7 +458,7 @@ void reconstruct_stripes_ec(osd_rmw_stripe_t *stripes, int pg_size, int pg_minsi
|
|||
#else
|
||||
void reconstruct_stripes_ec(osd_rmw_stripe_t *stripes, int pg_size, int pg_minsize, uint32_t bitmap_size)
|
||||
{
|
||||
int *dm_ids = (int*)get_jerasure_decoding_matrix(stripes, pg_size, pg_minsize);
|
||||
int *dm_ids = (int*)get_jerasure_decoding_matrix(stripes, pg_size, pg_minsize, NULL);
|
||||
if (!dm_ids)
|
||||
{
|
||||
return;
|
||||
|
@ -980,7 +1005,7 @@ void calc_rmw_parity_ec(osd_rmw_stripe_t *stripes, int pg_size, int pg_minsize,
|
|||
{
|
||||
int item_size =
|
||||
#ifdef WITH_ISAL
|
||||
32;
|
||||
matrix->isal_item_size;
|
||||
#else
|
||||
sizeof(int);
|
||||
#endif
|
||||
|
|
|
@ -65,7 +65,6 @@ void osd_t::scrub_list(pool_pg_num_t pg_id, osd_num_t role_osd, object_id min_oi
|
|||
.sec_list = {
|
||||
.header = {
|
||||
.magic = SECONDARY_OSD_OP_MAGIC,
|
||||
.id = msgr.next_subop_id++,
|
||||
.opcode = OSD_OP_SEC_LIST,
|
||||
},
|
||||
.list_pg = pg_num,
|
||||
|
|
|
@ -198,6 +198,12 @@ void osd_t::exec_show_config(osd_op_t *cur_op)
|
|||
json11::Json req_json = cur_op->req.show_conf.json_len > 0
|
||||
? json11::Json::parse(std::string((char *)cur_op->buf), json_err)
|
||||
: json11::Json();
|
||||
if (req_json["features"]["check_sequencing"].bool_value())
|
||||
{
|
||||
auto cl = msgr.clients.at(cur_op->peer_fd);
|
||||
cl->check_sequencing = true;
|
||||
cl->read_op_id = cur_op->req.hdr.id + 1;
|
||||
}
|
||||
// Expose sensitive configuration values so peers can check them
|
||||
json11::Json::object wire_config = json11::Json::object {
|
||||
{ "osd_num", osd_num },
|
||||
|
|
|
@ -21,7 +21,9 @@ osd_messenger_t::~osd_messenger_t()
|
|||
|
||||
void osd_messenger_t::outbox_push(osd_op_t *cur_op)
|
||||
{
|
||||
clients[cur_op->peer_fd]->sent_ops[cur_op->req.hdr.id] = cur_op;
|
||||
auto cl = clients.at(cur_op->peer_fd);
|
||||
cur_op->req.hdr.id = ++cl->send_op_id;
|
||||
cl->sent_ops[cur_op->req.hdr.id] = cur_op;
|
||||
}
|
||||
|
||||
void osd_messenger_t::parse_config(const json11::Json & config)
|
||||
|
|
|
@ -245,3 +245,13 @@ int create_and_bind_socket(std::string bind_address, int bind_port, int listen_b
|
|||
|
||||
return listen_fd;
|
||||
}
|
||||
|
||||
std::string gethostname_str()
|
||||
{
|
||||
std::string hostname;
|
||||
hostname.resize(1024);
|
||||
while (gethostname((char*)hostname.data(), hostname.size()) < 0 && errno == ENAMETOOLONG)
|
||||
hostname.resize(hostname.size()+1024);
|
||||
hostname.resize(strnlen(hostname.data(), hostname.size()));
|
||||
return hostname;
|
||||
}
|
||||
|
|
|
@ -21,3 +21,4 @@ bool cidr6_match(const in6_addr &address, const in6_addr &network, uint8_t bits)
|
|||
bool cidr_sockaddr_match(const sockaddr_storage &addr, const addr_mask_t &mask);
|
||||
std::vector<std::string> getifaddr_list(const std::vector<addr_mask_t> & masks = std::vector<addr_mask_t>(), bool include_v6 = false);
|
||||
int create_and_bind_socket(std::string bind_address, int bind_port, int listen_backlog, int *listening_port);
|
||||
std::string gethostname_str();
|
||||
|
|
|
@ -10,6 +10,10 @@
|
|||
|
||||
#include "ringloop.h"
|
||||
|
||||
#ifndef IORING_CQE_F_MORE
|
||||
#define IORING_CQE_F_MORE (1U << 1)
|
||||
#endif
|
||||
|
||||
ring_loop_t::ring_loop_t(int qd, bool multithreaded)
|
||||
{
|
||||
mt = multithreaded;
|
||||
|
@ -30,6 +34,16 @@ ring_loop_t::ring_loop_t(int qd, bool multithreaded)
|
|||
free_ring_data[i] = i;
|
||||
}
|
||||
in_loop = false;
|
||||
auto probe = io_uring_get_probe();
|
||||
if (probe)
|
||||
{
|
||||
support_zc = io_uring_opcode_supported(probe, IORING_OP_SENDMSG_ZC);
|
||||
#ifdef IORING_SETUP_R_DISABLED /* liburing 2.0 check */
|
||||
io_uring_free_probe(probe);
|
||||
#else
|
||||
free(probe);
|
||||
#endif
|
||||
}
|
||||
}
|
||||
|
||||
ring_loop_t::~ring_loop_t()
|
||||
|
@ -108,7 +122,17 @@ void ring_loop_t::loop()
|
|||
if (mt)
|
||||
mu.lock();
|
||||
struct ring_data_t *d = (struct ring_data_t*)cqe->user_data;
|
||||
if (d->callback)
|
||||
if (cqe->flags & IORING_CQE_F_MORE)
|
||||
{
|
||||
// There will be a second notification
|
||||
d->res = cqe->res;
|
||||
d->more = true;
|
||||
if (d->callback)
|
||||
d->callback(d);
|
||||
d->prev = true;
|
||||
d->more = false;
|
||||
}
|
||||
else if (d->callback)
|
||||
{
|
||||
// First free ring_data item, then call the callback
|
||||
// so it has at least 1 free slot for the next event
|
||||
|
@ -116,7 +140,10 @@ void ring_loop_t::loop()
|
|||
struct ring_data_t dl;
|
||||
dl.iov = d->iov;
|
||||
dl.res = cqe->res;
|
||||
dl.more = false;
|
||||
dl.prev = d->prev;
|
||||
dl.callback.swap(d->callback);
|
||||
d->prev = d->more = false;
|
||||
free_ring_data[free_ring_data_ptr++] = d - ring_datas;
|
||||
if (mt)
|
||||
mu.unlock();
|
||||
|
|
|
@ -18,6 +18,10 @@
|
|||
|
||||
#define RINGLOOP_DEFAULT_SIZE 1024
|
||||
|
||||
#ifndef IORING_RECV_MULTISHOT /* liburing-2.3 check */
|
||||
#define IORING_OP_SENDMSG_ZC 48
|
||||
#endif
|
||||
|
||||
static inline void my_uring_prep_rw(int op, struct io_uring_sqe *sqe, int fd, const void *addr, unsigned len, off_t offset)
|
||||
{
|
||||
// Prepare a read/write operation without clearing user_data
|
||||
|
@ -62,6 +66,12 @@ static inline void my_uring_prep_sendmsg(struct io_uring_sqe *sqe, int fd, const
|
|||
sqe->msg_flags = flags;
|
||||
}
|
||||
|
||||
static inline void my_uring_prep_sendmsg_zc(struct io_uring_sqe *sqe, int fd, const struct msghdr *msg, unsigned flags)
|
||||
{
|
||||
my_uring_prep_rw(IORING_OP_SENDMSG_ZC, sqe, fd, msg, 1, 0);
|
||||
sqe->msg_flags = flags;
|
||||
}
|
||||
|
||||
static inline void my_uring_prep_poll_add(struct io_uring_sqe *sqe, int fd, short poll_mask)
|
||||
{
|
||||
my_uring_prep_rw(IORING_OP_POLL_ADD, sqe, fd, NULL, 0, 0);
|
||||
|
@ -112,6 +122,8 @@ struct ring_data_t
|
|||
{
|
||||
struct iovec iov; // for single-entry read/write operations
|
||||
int res;
|
||||
bool prev: 1;
|
||||
bool more: 1;
|
||||
std::function<void(ring_data_t*)> callback;
|
||||
};
|
||||
|
||||
|
@ -133,6 +145,7 @@ class ring_loop_t
|
|||
bool loop_again;
|
||||
struct io_uring ring;
|
||||
int ring_eventfd = -1;
|
||||
bool support_zc = false;
|
||||
public:
|
||||
ring_loop_t(int qd, bool multithreaded = false);
|
||||
~ring_loop_t();
|
||||
|
@ -144,6 +157,7 @@ public:
|
|||
inline void set_immediate(const std::function<void()> cb)
|
||||
{
|
||||
immediate_queue.push_back(cb);
|
||||
wakeup();
|
||||
}
|
||||
inline int submit()
|
||||
{
|
||||
|
@ -163,6 +177,10 @@ public:
|
|||
{
|
||||
return loop_again;
|
||||
}
|
||||
inline bool has_sendmsg_zc()
|
||||
{
|
||||
return support_zc;
|
||||
}
|
||||
|
||||
void loop();
|
||||
void wakeup();
|
||||
|
|
|
@ -87,6 +87,22 @@ wait_etcd()
|
|||
done
|
||||
}
|
||||
|
||||
wait_condition()
|
||||
{
|
||||
sec=$1
|
||||
check=$2
|
||||
proc=$3
|
||||
i=0
|
||||
while [[ $i -lt $sec ]]; do
|
||||
eval "$check" && break
|
||||
if [ $i -eq $sec ]; then
|
||||
format_error "$proc couldn't finish in $sec seconds"
|
||||
fi
|
||||
sleep 1
|
||||
i=$((i+1))
|
||||
done
|
||||
}
|
||||
|
||||
if [[ -n "$ANTIETCD" ]]; then
|
||||
ETCDCTL="node mon/node_modules/.bin/anticli -e $ETCD_URL"
|
||||
MON_PARAMS="--use_antietcd 1 --antietcd_data_dir ./testdata --antietcd_persist_interval 500 $MON_PARAMS"
|
||||
|
|
|
@ -127,22 +127,6 @@ try_reweight()
|
|||
sleep 3
|
||||
}
|
||||
|
||||
wait_condition()
|
||||
{
|
||||
sec=$1
|
||||
check=$2
|
||||
proc=$3
|
||||
i=0
|
||||
while [[ $i -lt $sec ]]; do
|
||||
eval "$check" && break
|
||||
if [ $i -eq $sec ]; then
|
||||
format_error "$proc couldn't finish in $sec seconds"
|
||||
fi
|
||||
sleep 1
|
||||
i=$((i+1))
|
||||
done
|
||||
}
|
||||
|
||||
wait_finish_rebalance()
|
||||
{
|
||||
sec=$1
|
||||
|
|
|
@ -31,4 +31,10 @@ sleep 2
|
|||
$ETCDCTL get --prefix /vitastor/pg/config --print-value-only | \
|
||||
jq -s -e '([ .[0].items["1"] | .[].osd_set | map_values(. | tonumber) | select((.[0] <= 4) != (.[1] <= 4)) ] | length) == 4'
|
||||
|
||||
# test pool with size 1
|
||||
build/src/cmd/vitastor-cli --etcd_address $ETCD_URL create-pool size1pool -s 1 -n 1 --force
|
||||
wait_condition 10 "$ETCDCTL get --prefix /vitastor/pg/config --print-value-only | jq -s -e '.[0].items["'"'"2"'"'"]'"
|
||||
|
||||
build/src/cmd/vitastor-cli --etcd_address $ETCD_URL modify-pool size1pool -s 2 --force
|
||||
|
||||
format_green OK
|
||||
|
|
|
@ -165,4 +165,24 @@ if ls ./testdata/nfs | grep over1; then false; fi
|
|||
[[ "`cat ./testdata/nfs/linked1`" = "BABABA" ]]
|
||||
format_green "rename over existing file ok"
|
||||
|
||||
# check listing and removal of a bad direntry
|
||||
sudo umount ./testdata/nfs/
|
||||
build/src/kv/vitastor-kv --etcd_address $ETCD_URL fsmeta set d11/settings.jsonLGNmGn '{"ino": 123}'
|
||||
sudo mount localhost:/ ./testdata/nfs -o port=2050,mountport=2050,nfsvers=3,soft,nolock,tcp
|
||||
ls -l ./testdata/nfs
|
||||
ls -l ./testdata/nfs/settings.jsonLGNmGn
|
||||
rm ./testdata/nfs/settings.jsonLGNmGn
|
||||
build/src/kv/vitastor-kv --etcd_address $ETCD_URL fsmeta get d11/settings.jsonLGNmGn 2>&1 | grep '(code -2)'
|
||||
ls -l ./testdata/nfs
|
||||
|
||||
# repeat with ino=0
|
||||
sudo umount ./testdata/nfs/
|
||||
build/src/kv/vitastor-kv --etcd_address $ETCD_URL fsmeta set d11/settings.jsonLGNmGn '{"ino": 0}'
|
||||
sudo mount localhost:/ ./testdata/nfs -o port=2050,mountport=2050,nfsvers=3,soft,nolock,tcp
|
||||
ls -l ./testdata/nfs
|
||||
ls -l ./testdata/nfs/settings.jsonLGNmGn
|
||||
rm ./testdata/nfs/settings.jsonLGNmGn
|
||||
build/src/kv/vitastor-kv --etcd_address $ETCD_URL fsmeta get d11/settings.jsonLGNmGn 2>&1 | grep '(code -2)'
|
||||
ls -l ./testdata/nfs
|
||||
|
||||
format_green OK
|
||||
|
|
Loading…
Reference in New Issue