Compare commits

...

32 Commits

Author SHA1 Message Date
Vitaliy Filippov b7322a405a Move gethostname_str to utils 2025-05-05 02:16:06 +03:00
Vitaliy Filippov 5692630005 Move check_sequencing indication into config response features subkey 2025-05-05 02:07:50 +03:00
Vitaliy Filippov 00ced7cea7 Fix monitor crash with non-existing node_placement nodes 2025-05-05 02:05:04 +03:00
Vitaliy Filippov ebdb75e287 Fix typo in docker docs 2025-05-04 18:09:33 +03:00
Vitaliy Filippov f397fe9c6a Add compatibility with ISA-L 2.31+ 2025-05-04 18:09:33 +03:00
Vitaliy Filippov 28560b4ae5 Write K/V listings in buffered manner 2025-05-03 15:06:59 +03:00
Vitaliy Filippov 2d07449e74 Postpone cb() to set_immediate() to prevent stack overflows in kv_db 2025-05-03 15:06:59 +03:00
Vitaliy Filippov 80c4e8c20f Add missing wakeup in ringloop->set_immediate to prevent slowdowns in code using set_immediate 2025-05-03 14:40:48 +03:00
Vitaliy Filippov 2ab0ae3bc9 Check operation sequencing and stop clients when it breaks 2025-05-02 17:01:50 +03:00
Vitaliy Filippov 05e59c1b4f Fix MSG_WAITALL assertion added in the zero-copy patch 2025-05-02 17:01:43 +03:00
Vitaliy Filippov e6e1c5b962 Check if OSDs are still up in rm-osd 2025-05-02 13:05:59 +03:00
Vitaliy Filippov 9556eeae45 Implement io_uring zero-copy send support 2025-05-01 18:47:10 +03:00
Vitaliy Filippov 96b5a72630 Allow removal of bad direntries in VitastorFS (direntries referring non-existent inodes) 2025-05-01 01:14:23 +03:00
Vitaliy Filippov ef80f121f6 Fix "duplicate inode during create" deletion in VitastorFS 2025-04-30 20:37:49 +03:00
Vitaliy Filippov bbdd1f3aa7 Fix modify-pool -s PG_SIZE without --pg_minsize 2025-04-28 02:20:54 +03:00
Vitaliy Filippov 5dd37f519a Fix node folding in case of empty rules (pool with size 1), add a test 2025-04-28 02:16:49 +03:00
Vitaliy Filippov a2278be84d Improve data distribution: solve LP task on failure domains instead of individual OSDs
This greatly speeds up PG placement and makes it more uniform both because the LP task
becomes simpler and because the distribution of individual OSDs is optimised manually
2025-04-27 01:44:46 +03:00
Vitaliy Filippov 1393a2671c Change default vitastor-etcd data dir to /var/lib/etcd/vitastor 2025-04-27 01:44:46 +03:00
Vitaliy Filippov 9fa8ae5384 Reset OSD ping state on receiving data from it via RDMA 2025-04-26 14:17:09 +03:00
Vitaliy Filippov 169a35a067 Followup to latency aggregation fix 2025-04-26 01:32:46 +03:00
Vitaliy Filippov 2b2a10581d Prevent double handle_primary_subop in rare cases 2025-04-26 01:16:53 +03:00
Vitaliy Filippov 10fd51862a Fix latency aggregation in global stats (/vitastor/stats in etcd) 2025-04-25 00:08:10 +03:00
Vitaliy Filippov 15d0204f96 Hide "Ran out of journal space" log messages by default 2025-04-20 01:21:03 +03:00
Vitaliy Filippov 21d6e88a1b Add instructions to change NFS_MAX_FILE_IO_SIZE 2025-04-18 13:39:32 +03:00
Vitaliy Filippov df2847df2d Wait for RDMA-CM EVENT_ESTABLISHED after rdma_accept(), handle rdma_accept() before acking the event 2025-04-15 15:19:36 +03:00
Vitaliy Filippov 327c98a4b6 Fix index_tree 2025-04-13 16:13:44 +03:00
Vitaliy Filippov 3cc0abfd81 Fix NFS total & free multiplied by extra 2 2025-04-12 19:09:26 +03:00
Vitaliy Filippov 80e5f8ba76 Add missing WITH_RDMACM defines 2025-04-12 19:06:27 +03:00
Vitaliy Filippov 4b660f1ce8 Fix systemd unit name in make-etcd 2025-04-11 02:05:08 +03:00
Vitaliy Filippov dfde0e60f0 Do not allow reweight > 1 2025-04-05 12:21:14 +03:00
Vitaliy Filippov 013f688ffe Run check_peer_config on RDMA-CM connections too 2025-04-02 01:32:28 +03:00
Vitaliy Filippov cf9738ddbe Fix docker 2.1.0 build :) 2025-04-01 22:46:22 +03:00
63 changed files with 1119 additions and 198 deletions

View File

@ -3,7 +3,7 @@ VITASTOR_VERSION ?= v2.1.0
all: build push all: build push
build: build:
@docker build --rm -t vitalif/vitastor:$(VITASTOR_VERSION) . @docker build --no-cache --rm -t vitalif/vitastor:$(VITASTOR_VERSION) .
push: push:
@docker push vitalif/vitastor:$(VITASTOR_VERSION) @docker push vitalif/vitastor:$(VITASTOR_VERSION)

View File

@ -1 +1,2 @@
deb http://vitastor.io/debian bookworm main deb http://vitastor.io/debian bookworm main
deb http://http.debian.net/debian/ bookworm-backports main

View File

@ -34,6 +34,7 @@ between clients, OSDs and etcd.
- [etcd_ws_keepalive_interval](#etcd_ws_keepalive_interval) - [etcd_ws_keepalive_interval](#etcd_ws_keepalive_interval)
- [etcd_min_reload_interval](#etcd_min_reload_interval) - [etcd_min_reload_interval](#etcd_min_reload_interval)
- [tcp_header_buffer_size](#tcp_header_buffer_size) - [tcp_header_buffer_size](#tcp_header_buffer_size)
- [min_zerocopy_send_size](#min_zerocopy_send_size)
- [use_sync_send_recv](#use_sync_send_recv) - [use_sync_send_recv](#use_sync_send_recv)
## osd_network ## osd_network
@ -313,6 +314,34 @@ is received without an additional copy. You can try to play with this
parameter and see how it affects random iops and linear bandwidth if you parameter and see how it affects random iops and linear bandwidth if you
want. want.
## min_zerocopy_send_size
- Type: integer
- Default: 32768
OSDs and clients will attempt to use io_uring-based zero-copy TCP send
for buffers larger than this number of bytes. Zero-copy send with io_uring is
supported since Linux kernel version 6.1. Support is auto-detected and disabled
automatically when not available. It can also be disabled explicitly by setting
this parameter to a negative value.
⚠️ Warning! Zero-copy send performance may vary greatly from CPU to CPU and from
one kernel version to another. Generally, it tends to only make benefit with larger
messages. With smaller messages (say, 4 KB), it may actually be slower. 32 KB is
enough for almost all CPUs, but even smaller values are optimal for some of them.
For example, 4 KB is OK for EPYC Milan/Genoa and 12 KB is OK for Xeon Ice Lake
(but verify it yourself please).
Verification instructions:
1. Add `iommu=pt` into your Linux kernel command line and reboot.
2. Upgrade your kernel. For example, it's very important to use 6.11+ with recent AMD EPYCs.
3. Run some tests with the [send-zerocopy liburing example](https://github.com/axboe/liburing/blob/master/examples/send-zerocopy.c)
to find the minimal message size for which zero-copy is optimal.
Use `./send-zerocopy tcp -4 -R` at the server side and
`time ./send-zerocopy tcp -4 -b 0 -s BUFFER_SIZE -D SERVER_IP` at the client side with
`-z 0` (no zero-copy) and `-z 1` (zero-copy), and compare MB/s and used CPU time
(user+system).
## use_sync_send_recv ## use_sync_send_recv
- Type: boolean - Type: boolean

View File

@ -34,6 +34,7 @@
- [etcd_ws_keepalive_interval](#etcd_ws_keepalive_interval) - [etcd_ws_keepalive_interval](#etcd_ws_keepalive_interval)
- [etcd_min_reload_interval](#etcd_min_reload_interval) - [etcd_min_reload_interval](#etcd_min_reload_interval)
- [tcp_header_buffer_size](#tcp_header_buffer_size) - [tcp_header_buffer_size](#tcp_header_buffer_size)
- [min_zerocopy_send_size](#min_zerocopy_send_size)
- [use_sync_send_recv](#use_sync_send_recv) - [use_sync_send_recv](#use_sync_send_recv)
## osd_network ## osd_network
@ -321,6 +322,34 @@ Vitastor содержат 128-байтные заголовки, за котор
поменять этот параметр и посмотреть, как он влияет на производительность поменять этот параметр и посмотреть, как он влияет на производительность
случайного и линейного доступа. случайного и линейного доступа.
## min_zerocopy_send_size
- Тип: целое число
- Значение по умолчанию: 32768
OSD и клиенты будут пробовать использовать TCP-отправку без копирования (zero-copy) на
основе io_uring для буферов, больших, чем это число байт. Отправка без копирования
поддерживается в io_uring, начиная с версии ядра Linux 6.1. Наличие поддержки
проверяется автоматически и zero-copy отключается, когда поддержки нет. Также
её можно отключить явно, установив данный параметр в отрицательное значение.
⚠️ Внимание! Производительность данной функции может сильно отличаться на разных
процессорах и на разных версиях ядра Linux. В целом, zero-copy обычно быстрее с
большими сообщениями, а с мелкими (например, 4 КБ) zero-copy может быть даже
медленнее. 32 КБ достаточно почти для всех процессоров, но для каких-то можно
использовать даже меньшие значения. Например, для EPYC Milan/Genoa подходит 4 КБ,
а для Xeon Ice Lake - 12 КБ (но, пожалуйста, перепроверьте это сами).
Инструкция по проверке:
1. Добавьте `iommu=pt` в командную строку загрузки вашего ядра Linux и перезагрузитесь.
2. Обновите ядро. Например, для AMD EPYC очень важно использовать версию 6.11+.
3. Позапускайте тесты с помощью [send-zerocopy из примеров liburing](https://github.com/axboe/liburing/blob/master/examples/send-zerocopy.c),
чтобы найти минимальный размер сообщения, для которого zero-copy отправка оптимальна.
Запускайте `./send-zerocopy tcp -4 -R` на стороне сервера и
`time ./send-zerocopy tcp -4 -b 0 -s РАЗМЕРУФЕРА -D АДРЕС_СЕРВЕРА` на стороне клиента
с опцией `-z 0` (обычная отправка) и `-z 1` (отправка без копирования), и сравнивайте
скорость в МБ/с и занятое процессорное время (user+system).
## use_sync_send_recv ## use_sync_send_recv
- Тип: булево (да/нет) - Тип: булево (да/нет)

View File

@ -373,6 +373,55 @@
параметра читается без дополнительного копирования. Вы можете попробовать параметра читается без дополнительного копирования. Вы можете попробовать
поменять этот параметр и посмотреть, как он влияет на производительность поменять этот параметр и посмотреть, как он влияет на производительность
случайного и линейного доступа. случайного и линейного доступа.
- name: min_zerocopy_send_size
type: int
default: 32768
info: |
OSDs and clients will attempt to use io_uring-based zero-copy TCP send
for buffers larger than this number of bytes. Zero-copy send with io_uring is
supported since Linux kernel version 6.1. Support is auto-detected and disabled
automatically when not available. It can also be disabled explicitly by setting
this parameter to a negative value.
⚠️ Warning! Zero-copy send performance may vary greatly from CPU to CPU and from
one kernel version to another. Generally, it tends to only make benefit with larger
messages. With smaller messages (say, 4 KB), it may actually be slower. 32 KB is
enough for almost all CPUs, but even smaller values are optimal for some of them.
For example, 4 KB is OK for EPYC Milan/Genoa and 12 KB is OK for Xeon Ice Lake
(but verify it yourself please).
Verification instructions:
1. Add `iommu=pt` into your Linux kernel command line and reboot.
2. Upgrade your kernel. For example, it's very important to use 6.11+ with recent AMD EPYCs.
3. Run some tests with the [send-zerocopy liburing example](https://github.com/axboe/liburing/blob/master/examples/send-zerocopy.c)
to find the minimal message size for which zero-copy is optimal.
Use `./send-zerocopy tcp -4 -R` at the server side and
`time ./send-zerocopy tcp -4 -b 0 -s BUFFER_SIZE -D SERVER_IP` at the client side with
`-z 0` (no zero-copy) and `-z 1` (zero-copy), and compare MB/s and used CPU time
(user+system).
info_ru: |
OSD и клиенты будут пробовать использовать TCP-отправку без копирования (zero-copy) на
основе io_uring для буферов, больших, чем это число байт. Отправка без копирования
поддерживается в io_uring, начиная с версии ядра Linux 6.1. Наличие поддержки
проверяется автоматически и zero-copy отключается, когда поддержки нет. Также
её можно отключить явно, установив данный параметр в отрицательное значение.
⚠️ Внимание! Производительность данной функции может сильно отличаться на разных
процессорах и на разных версиях ядра Linux. В целом, zero-copy обычно быстрее с
большими сообщениями, а с мелкими (например, 4 КБ) zero-copy может быть даже
медленнее. 32 КБ достаточно почти для всех процессоров, но для каких-то можно
использовать даже меньшие значения. Например, для EPYC Milan/Genoa подходит 4 КБ,
а для Xeon Ice Lake - 12 КБ (но, пожалуйста, перепроверьте это сами).
Инструкция по проверке:
1. Добавьте `iommu=pt` в командную строку загрузки вашего ядра Linux и перезагрузитесь.
2. Обновите ядро. Например, для AMD EPYC очень важно использовать версию 6.11+.
3. Позапускайте тесты с помощью [send-zerocopy из примеров liburing](https://github.com/axboe/liburing/blob/master/examples/send-zerocopy.c),
чтобы найти минимальный размер сообщения, для которого zero-copy отправка оптимальна.
Запускайте `./send-zerocopy tcp -4 -R` на стороне сервера и
`time ./send-zerocopy tcp -4 -b 0 -s РАЗМЕРУФЕРА -D АДРЕС_СЕРВЕРА` на стороне клиента
с опцией `-z 0` (обычная отправка) и `-z 1` (отправка без копирования), и сравнивайте
скорость в МБ/с и занятое процессорное время (user+system).
- name: use_sync_send_recv - name: use_sync_send_recv
type: bool type: bool
default: false default: false

View File

@ -26,9 +26,9 @@ at Vitastor Kubernetes operator: https://github.com/Antilles7227/vitastor-operat
The instruction is very simple. The instruction is very simple.
1. Download a Docker image of the desired version: \ 1. Download a Docker image of the desired version: \
`docker pull vitastor:2.1.0` `docker pull vitastor:v2.1.0`
2. Install scripts to the host system: \ 2. Install scripts to the host system: \
`docker run --rm -it -v /etc:/host-etc -v /usr/bin:/host-bin vitastor:2.1.0 install.sh` `docker run --rm -it -v /etc:/host-etc -v /usr/bin:/host-bin vitastor:v2.1.0 install.sh`
3. Reload udev rules: \ 3. Reload udev rules: \
`udevadm control --reload-rules` `udevadm control --reload-rules`

View File

@ -25,9 +25,9 @@ Vitastor можно установить в Docker/Podman. При этом etcd,
Инструкция по установке максимально простая. Инструкция по установке максимально простая.
1. Скачайте Docker-образ желаемой версии: \ 1. Скачайте Docker-образ желаемой версии: \
`docker pull vitastor:2.1.0` `docker pull vitastor:v2.1.0`
2. Установите скрипты в хост-систему командой: \ 2. Установите скрипты в хост-систему командой: \
`docker run --rm -it -v /etc:/host-etc -v /usr/bin:/host-bin vitastor:2.1.0 install.sh` `docker run --rm -it -v /etc:/host-etc -v /usr/bin:/host-bin vitastor:v2.1.0 install.sh`
3. Перезагрузите правила udev: \ 3. Перезагрузите правила udev: \
`udevadm control --reload-rules` `udevadm control --reload-rules`

View File

@ -14,6 +14,7 @@
- [Removing a failed disk](#removing-a-failed-disk) - [Removing a failed disk](#removing-a-failed-disk)
- [Adding a disk](#adding-a-disk) - [Adding a disk](#adding-a-disk)
- [Restoring from lost pool configuration](#restoring-from-lost-pool-configuration) - [Restoring from lost pool configuration](#restoring-from-lost-pool-configuration)
- [Incompatibility problems](#Incompatibility-problems)
- [Upgrading Vitastor](#upgrading-vitastor) - [Upgrading Vitastor](#upgrading-vitastor)
- [OSD memory usage](#osd-memory-usage) - [OSD memory usage](#osd-memory-usage)
@ -166,6 +167,17 @@ done
After that all PGs should peer and find all previous data. After that all PGs should peer and find all previous data.
## Incompatibility problems
### ISA-L 2.31
⚠ It is FORBIDDEN to use Vitastor 2.1.0 and earlier versions with ISA-L 2.31 and newer if
you use EC N+K pools and K > 1 on a CPU with GF-NI instruction support, because it WILL
lead to **data loss** during EC recovery.
If you accidentally upgraded ISA-L to 2.31 but didn't upgrade Vitastor and restarted OSDs,
then stop them as soon as possible and either update Vitastor or roll back ISA-L.
## Upgrading Vitastor ## Upgrading Vitastor
Every upcoming Vitastor version is usually compatible with previous both forward Every upcoming Vitastor version is usually compatible with previous both forward

View File

@ -14,6 +14,7 @@
- [Удаление неисправного диска](#удаление-неисправного-диска) - [Удаление неисправного диска](#удаление-неисправного-диска)
- [Добавление диска](#добавление-диска) - [Добавление диска](#добавление-диска)
- [Восстановление потерянной конфигурации пулов](#восстановление-потерянной-конфигурации-пулов) - [Восстановление потерянной конфигурации пулов](#восстановление-потерянной-конфигурации-пулов)
- [Проблемы несовместимости](#проблемы-несовместимости)
- [Обновление Vitastor](#обновление-vitastor) - [Обновление Vitastor](#обновление-vitastor)
- [Потребление памяти OSD](#потребление-памяти-osd) - [Потребление памяти OSD](#потребление-памяти-osd)
@ -163,6 +164,17 @@ done
После этого все PG должны пройти peering и найти все предыдущие данные. После этого все PG должны пройти peering и найти все предыдущие данные.
## Проблемы несовместимости
### ISA-L 2.31
⚠ ЗАПРЕЩЕНО использовать Vitastor 2.1.0 и более ранних версий с библиотекой ISA-L версии 2.31
или более новой, если вы используете EC-пулы N+K и K > 1 на CPU с поддержкой инструкций GF-NI,
так как это приведёт к **потере данных** при восстановлении из EC.
Если вы случайно обновили ISA-L до 2.31, но не обновили Vitastor, и успели перезапустить OSD,
то как можно скорее остановите их все и либо обновите Vitastor, либо откатите ISA-L.
## Обновление Vitastor ## Обновление Vitastor
Обычно каждая следующая версия Vitastor совместима с предыдущими и "вперёд", и "назад" Обычно каждая следующая версия Vitastor совместима с предыдущими и "вперёд", и "назад"

View File

@ -14,6 +14,9 @@ Commands:
- [upgrade](#upgrade) - [upgrade](#upgrade)
- [defrag](#defrag) - [defrag](#defrag)
⚠️ Important: follow the instructions from [Linux NFS write size](#linux-nfs-write-size)
for optimal Vitastor NFS performance if you use EC and HDD and mount your NFS from Linux.
## Pseudo-FS ## Pseudo-FS
Simplified pseudo-FS proxy is used for file-based image access emulation. It's not Simplified pseudo-FS proxy is used for file-based image access emulation. It's not
@ -100,6 +103,62 @@ Other notable missing features which should be addressed in the future:
in the DB. The FS is implemented is such way that this garbage doesn't affect its in the DB. The FS is implemented is such way that this garbage doesn't affect its
function, but having a tool to clean it up still seems a right thing to do. function, but having a tool to clean it up still seems a right thing to do.
## Linux NFS write size
Linux NFS client (nfs/nfsv3/nfsv4 kernel modules) has a hard-coded maximum I/O size,
currently set to 1 MB - see `rsize` and `wsize` in [man 5 nfs](https://linux.die.net/man/5/nfs).
This means that when you write to a file in an FS mounted over NFS, the maximum write
request size is 1 MB, even in the O_DIRECT mode and even if the original write request
is larger.
However, for optimal linear write performance in Vitastor EC (erasure-coded) pools,
the size of write requests should be a multiple of [block_size](../config/layout-cluster.en.md#block_size),
multiplied by the data chunk count of the pool ([pg_size](../config/pool.en.md#pg_size)-[parity_chunks](../config/pool.en.md#parity_chunks)).
When write requests are smaller or not a multiple of this number, Vitastor has to first
read paired data blocks from disks, calculate new parity blocks and only then write them
back. Obviously this is 2-3 times slower than a simple disk write.
Vitastor HDD setups use 1 MB block_size by default. So, for optimal performance, if
you use EC 2+1 and HDD, you need your NFS client to send 2 MB write requests, if you
use EC 4+1 - 4 MB and so on.
But Linux NFS client only writes in 1 MB chunks. 😢
The good news is that you can fix it by rebuilding Linux NFS kernel modules 😉 🤩!
You need to change NFS_MAX_FILE_IO_SIZE in nfs_xdr.h and then rebuild and reload modules.
The instruction, using Debian as an example (should be ran under root):
```
# download current Linux kernel headers required to build modules
apt-get install linux-headers-`uname -r`
# replace NFS_MAX_FILE_IO_SIZE with a desired number (here it's 4194304 - 4 MB)
sed -i 's/NFS_MAX_FILE_IO_SIZE\s*.*/NFS_MAX_FILE_IO_SIZE\t(4194304U)/' /lib/modules/`uname -r`/source/include/linux/nfs_xdr.h
# download current Linux kernel source
mkdir linux_src
cd linux_src
apt-get source linux-image-`uname -r`-unsigned
# build NFS modules
cd linux-*/fs/nfs
make -C /lib/modules/`uname -r`/build M=$PWD -j8 modules
make -C /lib/modules/`uname -r`/build M=$PWD modules_install
# move default NFS modules away
mv /lib/modules/`uname -r`/kernel/fs/nfs ~/nfs_orig_`uname -r`
depmod -a
# unload old modules and load the new ones
rmmod nfsv3 nfs
modprobe nfsv3
```
After these (not much complicated 🙂) manipulations NFS begins to be mounted
with new wsize and rsize by default and it fixes Vitastor-NFS linear write performance.
## Horizontal scaling ## Horizontal scaling
Linux NFS 3.0 client doesn't support built-in scaling or failover, i.e. you can't Linux NFS 3.0 client doesn't support built-in scaling or failover, i.e. you can't

View File

@ -14,6 +14,9 @@
- [upgrade](#upgrade) - [upgrade](#upgrade)
- [defrag](#defrag) - [defrag](#defrag)
⚠️ Важно: для оптимальной производительности Vitastor NFS в Linux при использовании
HDD и EC (erasure кодов) выполните инструкции из раздела [Размер записи Linux NFS](#размер-записи-linux-nfs).
## Псевдо-ФС ## Псевдо-ФС
Упрощённая реализация псевдо-ФС используется для эмуляции файлового доступа к блочным Упрощённая реализация псевдо-ФС используется для эмуляции файлового доступа к блочным
@ -104,6 +107,66 @@ JSON-формате :-). Для инспекции содержимого БД
записи. ФС устроена так, что на работу они не влияют, но для порядка и их стоит записи. ФС устроена так, что на работу они не влияют, но для порядка и их стоит
уметь подчищать. уметь подчищать.
## Размер записи Linux NFS
Клиент Linux NFS (модули ядра nfs/nfsv3/nfsv4) имеет фиксированный в коде максимальный
размер запроса ввода-вывода, равный 1 МБ - см. `rsize` и `wsize` в [man 5 nfs](https://linux.die.net/man/5/nfs).
Это означает, что когда вы записываете в файл в примонтированной по NFS файловой системе,
максимальный размер запроса записи составляет 1 МБ, даже в режиме O_DIRECT и даже если
исходный запрос записи был больше.
Однако для оптимальной скорости линейной записи в Vitastor при использовании EC-пулов
(пулов с кодами коррекции ошибок) запросы записи должны быть по размеру кратны
[block_size](../config/layout-cluster.ru.md#block_size), умноженному на число частей
данных пула ([pg_size](../config/pool.ru.md#pg_size)-[parity_chunks](../config/pool.ru.md#parity_chunks)).
Если запросы записи меньше или не кратны, то Vitastor приходится сначала прочитать
с дисков старые версии парных блоков данных, рассчитать новые блоки чётности и только
после этого записать их на диски. Естественно, это в 2-3 раза медленнее простой записи
на диск.
При этом block_size на жёстких дисках по умолчанию устанавливается равным 1 МБ.
Таким образом, если вы используете EC 2+1 и HDD, для оптимальной скорости записи вам
нужно, чтобы NFS-клиент писал по 2 МБ, если EC 4+1 и HDD - то по 4 МБ, и т.п.
А Linux NFS-клиент пишет только по 1 МБ. 😢
Но это можно исправить, пересобрав модули ядра Linux NFS 😉 🤩! Для этого нужно
поменять значение переменной NFS_MAX_FILE_IO_SIZE в заголовочном файле nfs_xdr.h,
после чего пересобрать модули NFS.
Инструкция по пересборке на примере Debian (выполнять под root):
```
# скачиваем заголовки для сборки модулей для текущего ядра Linux
apt-get install linux-headers-`uname -r`
# заменяем в заголовках NFS_MAX_FILE_IO_SIZE на желаемый (здесь 4194304 - 4 МБ)
sed -i 's/NFS_MAX_FILE_IO_SIZE\s*.*/NFS_MAX_FILE_IO_SIZE\t(4194304U)/' /lib/modules/`uname -r`/source/include/linux/nfs_xdr.h
# скачиваем исходный код текущего ядра
mkdir linux_src
cd linux_src
apt-get source linux-image-`uname -r`-unsigned
# собираем модули NFS
cd linux-*/fs/nfs
make -C /lib/modules/`uname -r`/build M=$PWD -j8 modules
make -C /lib/modules/`uname -r`/build M=$PWD modules_install
# убираем в сторону штатные модули NFS
mv /lib/modules/`uname -r`/kernel/fs/nfs ~/nfs_orig_`uname -r`
depmod -a
# выгружаем старые модули и загружаем новые
rmmod nfsv3 nfs
modprobe nfsv3
```
После такой (относительно нехитрой 🙂) манипуляции NFS начинает по умолчанию
монтироваться с новыми wsize и rsize, и производительность линейной записи в Vitastor-NFS
исправляется.
## Горизонтальное масштабирование ## Горизонтальное масштабирование
Клиент Linux NFS 3.0 не поддерживает встроенное масштабирование или отказоустойчивость. Клиент Linux NFS 3.0 не поддерживает встроенное масштабирование или отказоустойчивость.

View File

@ -162,10 +162,12 @@ apt-get install linux-headers-`uname -r`
apt-get build-dep linux-image-`uname -r`-unsigned apt-get build-dep linux-image-`uname -r`-unsigned
apt-get source linux-image-`uname -r`-unsigned apt-get source linux-image-`uname -r`-unsigned
cd linux*/drivers/vdpa cd linux*/drivers/vdpa
make -C /lib/modules/`uname -r`/build M=$PWD CONFIG_VDPA=m CONFIG_VDPA_USER=m CONFIG_VIRTIO_VDPA=m -j8 modules modules_install make -C /lib/modules/`uname -r`/build M=$PWD CONFIG_VDPA=m CONFIG_VDPA_USER=m CONFIG_VIRTIO_VDPA=m -j8 modules
make -C /lib/modules/`uname -r`/build M=$PWD CONFIG_VDPA=m CONFIG_VDPA_USER=m CONFIG_VIRTIO_VDPA=m modules_install
cat Module.symvers >> /lib/modules/`uname -r`/build/Module.symvers cat Module.symvers >> /lib/modules/`uname -r`/build/Module.symvers
cd ../virtio cd ../virtio
make -C /lib/modules/`uname -r`/build M=$PWD CONFIG_VDPA=m CONFIG_VDPA_USER=m CONFIG_VIRTIO_VDPA=m -j8 modules modules_install make -C /lib/modules/`uname -r`/build M=$PWD CONFIG_VDPA=m CONFIG_VDPA_USER=m CONFIG_VIRTIO_VDPA=m -j8 modules
make -C /lib/modules/`uname -r`/build M=$PWD CONFIG_VDPA=m CONFIG_VDPA_USER=m CONFIG_VIRTIO_VDPA=m modules_install
depmod -a depmod -a
``` ```

View File

@ -165,10 +165,12 @@ apt-get install linux-headers-`uname -r`
apt-get build-dep linux-image-`uname -r`-unsigned apt-get build-dep linux-image-`uname -r`-unsigned
apt-get source linux-image-`uname -r`-unsigned apt-get source linux-image-`uname -r`-unsigned
cd linux*/drivers/vdpa cd linux*/drivers/vdpa
make -C /lib/modules/`uname -r`/build M=$PWD CONFIG_VDPA=m CONFIG_VDPA_USER=m CONFIG_VIRTIO_VDPA=m -j8 modules modules_install make -C /lib/modules/`uname -r`/build M=$PWD CONFIG_VDPA=m CONFIG_VDPA_USER=m CONFIG_VIRTIO_VDPA=m -j8 modules
make -C /lib/modules/`uname -r`/build M=$PWD CONFIG_VDPA=m CONFIG_VDPA_USER=m CONFIG_VIRTIO_VDPA=m modules_install
cat Module.symvers >> /lib/modules/`uname -r`/build/Module.symvers cat Module.symvers >> /lib/modules/`uname -r`/build/Module.symvers
cd ../virtio cd ../virtio
make -C /lib/modules/`uname -r`/build M=$PWD CONFIG_VDPA=m CONFIG_VDPA_USER=m CONFIG_VIRTIO_VDPA=m -j8 modules modules_install make -C /lib/modules/`uname -r`/build M=$PWD CONFIG_VDPA=m CONFIG_VDPA_USER=m CONFIG_VIRTIO_VDPA=m -j8 modules
make -C /lib/modules/`uname -r`/build M=$PWD CONFIG_VDPA=m CONFIG_VDPA_USER=m CONFIG_VIRTIO_VDPA=m modules_install
depmod -a depmod -a
``` ```

View File

@ -253,7 +253,7 @@ function random_custom_combinations(osd_tree, rules, count, ordered)
for (let i = 1; i < rules.length; i++) for (let i = 1; i < rules.length; i++)
{ {
const filtered = filter_tree_by_rules(osd_tree, rules[i], selected); const filtered = filter_tree_by_rules(osd_tree, rules[i], selected);
const idx = select_murmur3(filtered.length, i => 'p:'+f.id+':'+filtered[i].id); const idx = select_murmur3(filtered.length, i => 'p:'+f.id+':'+(filtered[i].name || filtered[i].id));
selected.push(idx == null ? { levels: {}, id: null } : filtered[idx]); selected.push(idx == null ? { levels: {}, id: null } : filtered[idx]);
} }
const size = selected.filter(s => s.id !== null).length; const size = selected.filter(s => s.id !== null).length;
@ -270,7 +270,7 @@ function random_custom_combinations(osd_tree, rules, count, ordered)
for (const item_rules of rules) for (const item_rules of rules)
{ {
const filtered = selected.length ? filter_tree_by_rules(osd_tree, item_rules, selected) : first; const filtered = selected.length ? filter_tree_by_rules(osd_tree, item_rules, selected) : first;
const idx = select_murmur3(filtered.length, i => n+':'+filtered[i].id); const idx = select_murmur3(filtered.length, i => n+':'+(filtered[i].name || filtered[i].id));
selected.push(idx == null ? { levels: {}, id: null } : filtered[idx]); selected.push(idx == null ? { levels: {}, id: null } : filtered[idx]);
} }
const size = selected.filter(s => s.id !== null).length; const size = selected.filter(s => s.id !== null).length;
@ -340,9 +340,9 @@ function filter_tree_by_rules(osd_tree, rules, selected)
} }
// Convert from // Convert from
// node_list = { id: string|number, level: string, size?: number, parent?: string|number }[] // node_list = { id: string|number, name?: string, level: string, size?: number, parent?: string|number }[]
// to // to
// node_tree = { [node_id]: { id, level, size?, parent?, children?: child_node_id[], levels: { [level]: id, ... } } } // node_tree = { [node_id]: { id, name?, level, size?, parent?, children?: child_node[], levels: { [level]: id, ... } } }
function index_tree(node_list) function index_tree(node_list)
{ {
const tree = { '': { children: [], levels: {} } }; const tree = { '': { children: [], levels: {} } };
@ -357,7 +357,7 @@ function index_tree(node_list)
tree[parent_id].children = tree[parent_id].children || []; tree[parent_id].children = tree[parent_id].children || [];
tree[parent_id].children.push(tree[node.id]); tree[parent_id].children.push(tree[node.id]);
} }
const cur = tree[''].children; const cur = [ ...tree[''].children ];
for (let i = 0; i < cur.length; i++) for (let i = 0; i < cur.length; i++)
{ {
cur[i].levels[cur[i].level] = cur[i].id; cur[i].levels[cur[i].level] = cur[i].id;

244
mon/lp_optimizer/fold.js Normal file
View File

@ -0,0 +1,244 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.1 (see README.md for details)
// Extract OSDs from the lowest affected tree level into a separate (flat) map
// to run PG optimisation on failure domains instead of individual OSDs
//
// node_list = same input as for index_tree()
// rules = [ level, operator, value ][][]
// returns { nodes: new_node_list, leaves: { new_folded_node_id: [ extracted_leaf_nodes... ] } }
function fold_failure_domains(node_list, rules)
{
const interest = {};
for (const level_rules of rules)
{
for (const rule of level_rules)
interest[rule[0]] = true;
}
const max_numeric_id = node_list.reduce((a, c) => a < (0|c.id) ? (0|c.id) : a, 0);
let next_id = max_numeric_id;
const node_map = node_list.reduce((a, c) => { a[c.id||''] = c; return a; }, {});
const old_ids_by_new = {};
const extracted_nodes = {};
let folded = true;
while (folded)
{
const per_parent = {};
for (const node_id in node_map)
{
const node = node_map[node_id];
const p = node.parent || '';
per_parent[p] = per_parent[p]||[];
per_parent[p].push(node);
}
folded = false;
for (const node_id in per_parent)
{
const fold_node = node_id !== '' && per_parent[node_id].length > 0 && per_parent[node_id].filter(child => per_parent[child.id||''] || interest[child.level]).length == 0;
if (fold_node)
{
const old_node = node_map[node_id];
const new_id = ++next_id;
node_map[new_id] = {
...old_node,
id: new_id,
name: node_id, // for use in murmur3 hashes
size: per_parent[node_id].reduce((a, c) => a + (Number(c.size)||0), 0),
};
delete node_map[node_id];
old_ids_by_new[new_id] = node_id;
extracted_nodes[new_id] = [];
for (const child of per_parent[node_id])
{
if (old_ids_by_new[child.id])
{
extracted_nodes[new_id].push(...extracted_nodes[child.id]);
delete extracted_nodes[child.id];
}
else
extracted_nodes[new_id].push(child);
delete node_map[child.id];
}
folded = true;
}
}
}
return { nodes: Object.values(node_map), leaves: extracted_nodes };
}
// Distribute PGs mapped to "folded" nodes to individual OSDs according to their weights
// folded_pgs = optimize_result.int_pgs before folding
// prev_pgs = optional previous PGs from optimize_change() input
// extracted_nodes = output from fold_failure_domains
function unfold_failure_domains(folded_pgs, prev_pgs, extracted_nodes)
{
const maps = {};
let found = false;
for (const new_id in extracted_nodes)
{
const weights = {};
for (const sub_node of extracted_nodes[new_id])
{
weights[sub_node.id] = sub_node.size;
}
maps[new_id] = { weights, prev: [], next: [], pos: 0 };
found = true;
}
if (!found)
{
return folded_pgs;
}
for (let i = 0; i < folded_pgs.length; i++)
{
for (let j = 0; j < folded_pgs[i].length; j++)
{
if (maps[folded_pgs[i][j]])
{
maps[folded_pgs[i][j]].prev.push(prev_pgs && prev_pgs[i] && prev_pgs[i][j] || 0);
}
}
}
for (const new_id in maps)
{
maps[new_id].next = adjust_distribution(maps[new_id].weights, maps[new_id].prev);
}
const mapped_pgs = [];
for (let i = 0; i < folded_pgs.length; i++)
{
mapped_pgs.push(folded_pgs[i].map(osd => (maps[osd] ? maps[osd].next[maps[osd].pos++] : osd)));
}
return mapped_pgs;
}
// Return the new array of items re-distributed as close as possible to weights in wanted_weights
// wanted_weights = { [key]: weight }
// cur_items = key[]
function adjust_distribution(wanted_weights, cur_items)
{
const item_map = {};
for (let i = 0; i < cur_items.length; i++)
{
const item = cur_items[i];
item_map[item] = (item_map[item] || { target: 0, cur: [] });
item_map[item].cur.push(i);
}
let total_weight = 0;
for (const item in wanted_weights)
{
total_weight += Number(wanted_weights[item]) || 0;
}
for (const item in wanted_weights)
{
const weight = wanted_weights[item] / total_weight * cur_items.length;
if (weight > 0)
{
item_map[item] = (item_map[item] || { target: 0, cur: [] });
item_map[item].target = weight;
}
}
const diff = (item) => (item_map[item].cur.length - item_map[item].target);
const most_underweighted = Object.keys(item_map)
.filter(item => item_map[item].target > 0)
.sort((a, b) => diff(a) - diff(b));
// Items with zero target weight MUST never be selected - remove them
// and remap each of them to a most underweighted item
for (const item in item_map)
{
if (!item_map[item].target)
{
const prev = item_map[item];
delete item_map[item];
for (const idx of prev.cur)
{
const move_to = most_underweighted[0];
item_map[move_to].cur.push(idx);
move_leftmost(most_underweighted, diff);
}
}
}
// Other over-weighted items are only moved if it improves the distribution
while (most_underweighted.length > 1)
{
const first = most_underweighted[0];
const last = most_underweighted[most_underweighted.length-1];
const first_diff = diff(first);
const last_diff = diff(last);
if (Math.abs(first_diff+1)+Math.abs(last_diff-1) < Math.abs(first_diff)+Math.abs(last_diff))
{
item_map[first].cur.push(item_map[last].cur.pop());
move_leftmost(most_underweighted, diff);
move_rightmost(most_underweighted, diff);
}
else
{
break;
}
}
const new_items = new Array(cur_items.length);
for (const item in item_map)
{
for (const idx of item_map[item].cur)
{
new_items[idx] = item;
}
}
return new_items;
}
function move_leftmost(sorted_array, diff)
{
// Re-sort by moving the leftmost item to the right if it changes position
const first = sorted_array[0];
const new_diff = diff(first);
let r = 0;
while (r < sorted_array.length-1 && diff(sorted_array[r+1]) <= new_diff)
r++;
if (r > 0)
{
for (let i = 0; i < r; i++)
sorted_array[i] = sorted_array[i+1];
sorted_array[r] = first;
}
}
function move_rightmost(sorted_array, diff)
{
// Re-sort by moving the rightmost item to the left if it changes position
const last = sorted_array[sorted_array.length-1];
const new_diff = diff(last);
let r = sorted_array.length-1;
while (r > 0 && diff(sorted_array[r-1]) > new_diff)
r--;
if (r < sorted_array.length-1)
{
for (let i = sorted_array.length-1; i > r; i--)
sorted_array[i] = sorted_array[i-1];
sorted_array[r] = last;
}
}
// map previous PGs to folded nodes
function fold_prev_pgs(pgs, extracted_nodes)
{
const unmap = {};
for (const new_id in extracted_nodes)
{
for (const sub_node of extracted_nodes[new_id])
{
unmap[sub_node.id] = new_id;
}
}
const mapped_pgs = [];
for (let i = 0; i < pgs.length; i++)
{
mapped_pgs.push(pgs[i].map(osd => (unmap[osd] || osd)));
}
return mapped_pgs;
}
module.exports = {
fold_failure_domains,
unfold_failure_domains,
adjust_distribution,
fold_prev_pgs,
};

View File

@ -98,6 +98,7 @@ async function optimize_initial({ osd_weights, combinator, pg_count, pg_size = 3
score: lp_result.score, score: lp_result.score,
weights: lp_result.vars, weights: lp_result.vars,
int_pgs, int_pgs,
pg_effsize,
space: eff * pg_effsize, space: eff * pg_effsize,
total_space: total_weight, total_space: total_weight,
}; };
@ -409,6 +410,7 @@ async function optimize_change({ prev_pgs: prev_int_pgs, osd_weights, combinator
int_pgs: new_pgs, int_pgs: new_pgs,
differs, differs,
osd_differs, osd_differs,
pg_effsize,
space: pg_effsize * pg_list_space_efficiency(new_pgs, osd_weights, pg_minsize, parity_space), space: pg_effsize * pg_list_space_efficiency(new_pgs, osd_weights, pg_minsize, parity_space),
total_space: total_weight, total_space: total_weight,
}; };

View File

@ -0,0 +1,108 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.1 (see README.md for details)
const assert = require('assert');
const { fold_failure_domains, unfold_failure_domains, adjust_distribution } = require('./fold.js');
const DSL = require('./dsl_pgs.js');
const LPOptimizer = require('./lp_optimizer.js');
const stableStringify = require('../stable-stringify.js');
async function run()
{
// Test run adjust_distribution
console.log('adjust_distribution');
const rand = [];
for (let i = 0; i < 100; i++)
{
rand.push(1 + Math.floor(10*Math.random()));
// or rand.push(0);
}
const adj = adjust_distribution({ 1: 1, 2: 1, 3: 1, 4: 1, 5: 1, 6: 1, 7: 1, 8: 1, 9: 1, 10: 1 }, rand);
//console.log(rand.join(' '));
console.log(rand.reduce((a, c) => { a[c] = (a[c]||0)+1; return a; }, {}));
//console.log(adj.join(' '));
console.log(adj.reduce((a, c) => { a[c] = (a[c]||0)+1; return a; }, {}));
console.log('Movement: '+rand.reduce((a, c, i) => a+(rand[i] != adj[i] ? 1 : 0), 0)+'/'+rand.length);
console.log('\nfold_failure_domains');
console.log(JSON.stringify(fold_failure_domains(
[
{ id: 1, level: 'osd', size: 1, parent: 'disk1' },
{ id: 2, level: 'osd', size: 2, parent: 'disk1' },
{ id: 'disk1', level: 'disk', parent: 'host1' },
{ id: 'host1', level: 'host', parent: 'dc1' },
{ id: 'dc1', level: 'dc' },
],
[ [ [ 'dc' ], [ 'host' ] ] ]
), 0, 2));
console.log('\nfold_failure_domains empty rules');
console.log(JSON.stringify(fold_failure_domains(
[
{ id: 1, level: 'osd', size: 1, parent: 'disk1' },
{ id: 2, level: 'osd', size: 2, parent: 'disk1' },
{ id: 'disk1', level: 'disk', parent: 'host1' },
{ id: 'host1', level: 'host', parent: 'dc1' },
{ id: 'dc1', level: 'dc' },
],
[]
), 0, 2));
console.log('\noptimize_folded');
// 5 DCs, 2 hosts per DC, 10 OSD per host
const nodes = [];
for (let i = 1; i <= 100; i++)
{
nodes.push({ id: i, level: 'osd', size: 1, parent: 'host'+(1+(0|((i-1)/10))) });
}
for (let i = 1; i <= 10; i++)
{
nodes.push({ id: 'host'+i, level: 'host', parent: 'dc'+(1+(0|((i-1)/2))) });
}
for (let i = 1; i <= 5; i++)
{
nodes.push({ id: 'dc'+i, level: 'dc' });
}
// Check rules
const rules = DSL.parse_level_indexes({ dc: '112233', host: '123456' }, [ 'dc', 'host', 'osd' ]);
assert.deepEqual(rules, [[],[["dc","=",1],["host","!=",[1]]],[["dc","!=",[1]]],[["dc","=",3],["host","!=",[3]]],[["dc","!=",[1,3]]],[["dc","=",5],["host","!=",[5]]]]);
// Check tree folding
const { nodes: folded_nodes, leaves: folded_leaves } = fold_failure_domains(nodes, rules);
const expected_folded = [];
const expected_leaves = {};
for (let i = 1; i <= 10; i++)
{
expected_folded.push({ id: 100+i, name: 'host'+i, level: 'host', size: 10, parent: 'dc'+(1+(0|((i-1)/2))) });
expected_leaves[100+i] = [ ...new Array(10).keys() ].map(k => ({ id: 10*(i-1)+k+1, level: 'osd', size: 1, parent: 'host'+i }));
}
for (let i = 1; i <= 5; i++)
{
expected_folded.push({ id: 'dc'+i, level: 'dc' });
}
assert.equal(stableStringify(folded_nodes), stableStringify(expected_folded));
assert.equal(stableStringify(folded_leaves), stableStringify(expected_leaves));
// Now optimise it
console.log('1000 PGs, EC 112233');
const leaf_weights = folded_nodes.reduce((a, c) => { if (Number(c.id)) { a[c.id] = c.size; } return a; }, {});
let res = await LPOptimizer.optimize_initial({
osd_weights: leaf_weights,
combinator: new DSL.RuleCombinator(folded_nodes, rules, 10000, false),
pg_size: 6,
pg_count: 1000,
ordered: false,
});
LPOptimizer.print_change_stats(res, false);
assert.equal(res.space, 100, 'Initial distribution');
const unfolded_res = { ...res };
unfolded_res.int_pgs = unfold_failure_domains(res.int_pgs, null, folded_leaves);
const osd_weights = nodes.reduce((a, c) => { if (Number(c.id)) { a[c.id] = c.size; } return a; }, {});
unfolded_res.space = unfolded_res.pg_effsize * LPOptimizer.pg_list_space_efficiency(unfolded_res.int_pgs, osd_weights, 0, 1);
LPOptimizer.print_change_stats(unfolded_res, false);
assert.equal(res.space, 100, 'Initial distribution');
}
run().catch(console.error);

View File

@ -15,7 +15,7 @@ function get_osd_tree(global_config, state)
const stat = state.osd.stats[osd_num]; const stat = state.osd.stats[osd_num];
const osd_cfg = state.config.osd[osd_num]; const osd_cfg = state.config.osd[osd_num];
let reweight = osd_cfg == null ? 1 : Number(osd_cfg.reweight); let reweight = osd_cfg == null ? 1 : Number(osd_cfg.reweight);
if (reweight < 0 || isNaN(reweight)) if (isNaN(reweight) || reweight < 0 || reweight > 0)
reweight = 1; reweight = 1;
if (stat && stat.size && reweight && (state.osd.state[osd_num] || Number(stat.time) >= down_time || if (stat && stat.size && reweight && (state.osd.state[osd_num] || Number(stat.time) >= down_time ||
osd_cfg && osd_cfg.noout)) osd_cfg && osd_cfg.noout))
@ -87,7 +87,7 @@ function make_hier_tree(global_config, tree)
tree[''] = { children: [] }; tree[''] = { children: [] };
for (const node_id in tree) for (const node_id in tree)
{ {
if (node_id === '' || tree[node_id].level === 'osd' && (!tree[node_id].size || tree[node_id].size <= 0)) if (node_id === '' || !(tree[node_id].children||[]).length && (tree[node_id].size||0) <= 0)
{ {
continue; continue;
} }
@ -107,10 +107,10 @@ function make_hier_tree(global_config, tree)
deleted = 0; deleted = 0;
for (const node_id in tree) for (const node_id in tree)
{ {
if (tree[node_id].level !== 'osd' && (!tree[node_id].children || !tree[node_id].children.length)) if (!(tree[node_id].children||[]).length && (tree[node_id].size||0) <= 0)
{ {
const parent = tree[node_id].parent; const parent = tree[node_id].parent;
if (parent) if (parent && tree[parent])
{ {
tree[parent].children = tree[parent].children.filter(c => c != tree[node_id]); tree[parent].children = tree[parent].children.filter(c => c != tree[node_id]);
} }

View File

@ -3,6 +3,7 @@
const { RuleCombinator } = require('./lp_optimizer/dsl_pgs.js'); const { RuleCombinator } = require('./lp_optimizer/dsl_pgs.js');
const { SimpleCombinator, flatten_tree } = require('./lp_optimizer/simple_pgs.js'); const { SimpleCombinator, flatten_tree } = require('./lp_optimizer/simple_pgs.js');
const { fold_failure_domains, unfold_failure_domains, fold_prev_pgs } = require('./lp_optimizer/fold.js');
const { validate_pool_cfg, get_pg_rules } = require('./pool_config.js'); const { validate_pool_cfg, get_pg_rules } = require('./pool_config.js');
const LPOptimizer = require('./lp_optimizer/lp_optimizer.js'); const LPOptimizer = require('./lp_optimizer/lp_optimizer.js');
const { scale_pg_count } = require('./pg_utils.js'); const { scale_pg_count } = require('./pg_utils.js');
@ -160,7 +161,6 @@ async function generate_pool_pgs(state, global_config, pool_id, osd_tree, levels
pool_cfg.bitmap_granularity || global_config.bitmap_granularity || 4096, pool_cfg.bitmap_granularity || global_config.bitmap_granularity || 4096,
pool_cfg.immediate_commit || global_config.immediate_commit || 'all' pool_cfg.immediate_commit || global_config.immediate_commit || 'all'
); );
pool_tree = make_hier_tree(global_config, pool_tree);
// First try last_clean_pgs to minimize data movement // First try last_clean_pgs to minimize data movement
let prev_pgs = []; let prev_pgs = [];
for (const pg in ((state.history.last_clean_pgs.items||{})[pool_id]||{})) for (const pg in ((state.history.last_clean_pgs.items||{})[pool_id]||{}))
@ -175,14 +175,19 @@ async function generate_pool_pgs(state, global_config, pool_id, osd_tree, levels
prev_pgs[pg-1] = [ ...state.pg.config.items[pool_id][pg].osd_set ]; prev_pgs[pg-1] = [ ...state.pg.config.items[pool_id][pg].osd_set ];
} }
} }
const use_rules = !global_config.use_old_pg_combinator || pool_cfg.level_placement || pool_cfg.raw_placement;
const rules = use_rules ? get_pg_rules(pool_id, pool_cfg, global_config.placement_levels) : null;
const folded = fold_failure_domains(Object.values(pool_tree), use_rules ? rules : [ [ [ pool_cfg.failure_domain ] ] ]);
// FIXME: Remove/merge make_hier_tree() step somewhere, however it's needed to remove empty nodes
const folded_tree = make_hier_tree(global_config, folded.nodes);
const old_pg_count = prev_pgs.length; const old_pg_count = prev_pgs.length;
const optimize_cfg = { const optimize_cfg = {
osd_weights: Object.values(pool_tree).filter(item => item.level === 'osd').reduce((a, c) => { a[c.id] = c.size; return a; }, {}), osd_weights: folded.nodes.reduce((a, c) => { if (Number(c.id)) { a[c.id] = c.size; } return a; }, {}),
combinator: !global_config.use_old_pg_combinator || pool_cfg.level_placement || pool_cfg.raw_placement combinator: use_rules
// new algorithm: // new algorithm:
? new RuleCombinator(pool_tree, get_pg_rules(pool_id, pool_cfg, global_config.placement_levels), pool_cfg.max_osd_combinations) ? new RuleCombinator(folded_tree, rules, pool_cfg.max_osd_combinations)
// old algorithm: // old algorithm:
: new SimpleCombinator(flatten_tree(pool_tree[''].children, levels, pool_cfg.failure_domain, 'osd'), pool_cfg.pg_size, pool_cfg.max_osd_combinations), : new SimpleCombinator(flatten_tree(folded_tree[''].children, levels, pool_cfg.failure_domain, 'osd'), pool_cfg.pg_size, pool_cfg.max_osd_combinations),
pg_count: pool_cfg.pg_count, pg_count: pool_cfg.pg_count,
pg_size: pool_cfg.pg_size, pg_size: pool_cfg.pg_size,
pg_minsize: pool_cfg.pg_minsize, pg_minsize: pool_cfg.pg_minsize,
@ -202,12 +207,11 @@ async function generate_pool_pgs(state, global_config, pool_id, osd_tree, levels
for (const pg of prev_pgs) for (const pg of prev_pgs)
{ {
while (pg.length < pool_cfg.pg_size) while (pg.length < pool_cfg.pg_size)
{
pg.push(0); pg.push(0);
}
} }
const folded_prev_pgs = fold_prev_pgs(prev_pgs, folded.leaves);
optimize_result = await LPOptimizer.optimize_change({ optimize_result = await LPOptimizer.optimize_change({
prev_pgs, prev_pgs: folded_prev_pgs,
...optimize_cfg, ...optimize_cfg,
}); });
} }
@ -215,6 +219,10 @@ async function generate_pool_pgs(state, global_config, pool_id, osd_tree, levels
{ {
optimize_result = await LPOptimizer.optimize_initial(optimize_cfg); optimize_result = await LPOptimizer.optimize_initial(optimize_cfg);
} }
optimize_result.int_pgs = unfold_failure_domains(optimize_result.int_pgs, prev_pgs, folded.leaves);
const osd_weights = Object.values(pool_tree).reduce((a, c) => { if (c.level === 'osd') { a[c.id] = c.size; } return a; }, {});
optimize_result.space = optimize_result.pg_effsize * LPOptimizer.pg_list_space_efficiency(optimize_result.int_pgs,
osd_weights, optimize_cfg.pg_minsize, 1);
console.log(`Pool ${pool_id} (${pool_cfg.name || 'unnamed'}):`); console.log(`Pool ${pool_id} (${pool_cfg.name || 'unnamed'}):`);
LPOptimizer.print_change_stats(optimize_result); LPOptimizer.print_change_stats(optimize_result);
let pg_effsize = pool_cfg.pg_size; let pg_effsize = pool_cfg.pg_size;

View File

@ -40,6 +40,11 @@ async function run()
console.log("/etc/systemd/system/vitastor-etcd.service already exists"); console.log("/etc/systemd/system/vitastor-etcd.service already exists");
process.exit(1); process.exit(1);
} }
if (!in_docker && fs.existsSync("/etc/systemd/system/etcd.service"))
{
console.log("/etc/systemd/system/etcd.service already exists");
process.exit(1);
}
const config = JSON.parse(fs.readFileSync(config_path, { encoding: 'utf-8' })); const config = JSON.parse(fs.readFileSync(config_path, { encoding: 'utf-8' }));
if (!config.etcd_address) if (!config.etcd_address)
{ {
@ -66,7 +71,7 @@ async function run()
console.log('etcd for Vitastor configured. Run `systemctl enable --now vitastor-etcd` to start etcd'); console.log('etcd for Vitastor configured. Run `systemctl enable --now vitastor-etcd` to start etcd');
process.exit(0); process.exit(0);
} }
await system(`mkdir -p /var/lib/etcd`); await system(`mkdir -p /var/lib/etcd/vitastor`);
fs.writeFileSync( fs.writeFileSync(
"/etc/systemd/system/vitastor-etcd.service", "/etc/systemd/system/vitastor-etcd.service",
`[Unit] `[Unit]
@ -77,14 +82,14 @@ Wants=network-online.target local-fs.target time-sync.target
[Service] [Service]
Restart=always Restart=always
Environment=GOGC=50 Environment=GOGC=50
ExecStart=etcd --name ${etcd_name} --data-dir /var/lib/etcd \\ ExecStart=etcd --name ${etcd_name} --data-dir /var/lib/etcd/vitastor \\
--snapshot-count 10000 --advertise-client-urls http://${etcds[num]}:2379 --listen-client-urls http://${etcds[num]}:2379 \\ --snapshot-count 10000 --advertise-client-urls http://${etcds[num]}:2379 --listen-client-urls http://${etcds[num]}:2379 \\
--initial-advertise-peer-urls http://${etcds[num]}:2380 --listen-peer-urls http://${etcds[num]}:2380 \\ --initial-advertise-peer-urls http://${etcds[num]}:2380 --listen-peer-urls http://${etcds[num]}:2380 \\
--initial-cluster-token vitastor-etcd-1 --initial-cluster ${etcd_cluster} \\ --initial-cluster-token vitastor-etcd-1 --initial-cluster ${etcd_cluster} \\
--initial-cluster-state new --max-txn-ops=100000 --max-request-bytes=104857600 \\ --initial-cluster-state new --max-txn-ops=100000 --max-request-bytes=104857600 \\
--auto-compaction-retention=10 --auto-compaction-mode=revision --auto-compaction-retention=10 --auto-compaction-mode=revision
WorkingDirectory=/var/lib/etcd WorkingDirectory=/var/lib/etcd/vitastor
ExecStartPre=+chown -R etcd /var/lib/etcd ExecStartPre=+chown -R etcd /var/lib/etcd/vitastor
User=etcd User=etcd
PrivateTmp=false PrivateTmp=false
TasksMax=infinity TasksMax=infinity
@ -97,8 +102,9 @@ WantedBy=multi-user.target
`); `);
await system(`useradd etcd`); await system(`useradd etcd`);
await system(`systemctl daemon-reload`); await system(`systemctl daemon-reload`);
await system(`systemctl enable etcd`); // Disable distribution etcd unit and enable our one
await system(`systemctl start etcd`); await system(`systemctl disable --now etcd`);
await system(`systemctl enable --now vitastor-etcd`);
process.exit(0); process.exit(0);
} }

View File

@ -87,11 +87,25 @@ function sum_op_stats(all_osd, prev_stats)
for (const k in derived[type][op]) for (const k in derived[type][op])
{ {
sum_diff[type][op] = sum_diff[type][op] || {}; sum_diff[type][op] = sum_diff[type][op] || {};
sum_diff[type][op][k] = (sum_diff[type][op][k] || 0n) + derived[type][op][k]; if (k == 'lat')
sum_diff[type][op].lat = (sum_diff[type][op].lat || 0n) + derived[type][op].lat*derived[type][op].iops;
else
sum_diff[type][op][k] = (sum_diff[type][op][k] || 0n) + derived[type][op][k];
} }
} }
} }
} }
// Calculate average (weighted by iops) op latency across all OSDs
for (const type in sum_diff)
{
for (const op in sum_diff[type])
{
if (sum_diff[type][op].lat)
{
sum_diff[type][op].lat /= sum_diff[type][op].iops;
}
}
}
return sum_diff; return sum_diff;
} }
@ -271,8 +285,7 @@ function sum_inode_stats(state, prev_stats)
const op_st = inode_stats[pool_id][inode_num][op]; const op_st = inode_stats[pool_id][inode_num][op];
op_st.bps += op_diff.bps; op_st.bps += op_diff.bps;
op_st.iops += op_diff.iops; op_st.iops += op_diff.iops;
op_st.lat += op_diff.lat; op_st.lat += op_diff.lat*op_diff.iops;
op_st.n_osd = (op_st.n_osd || 0) + 1;
} }
} }
} }
@ -285,11 +298,8 @@ function sum_inode_stats(state, prev_stats)
for (const op of [ 'read', 'write', 'delete' ]) for (const op of [ 'read', 'write', 'delete' ])
{ {
const op_st = inode_stats[pool_id][inode_num][op]; const op_st = inode_stats[pool_id][inode_num][op];
if (op_st.n_osd) if (op_st.lat)
{ op_st.lat /= op_st.iops;
op_st.lat /= BigInt(op_st.n_osd);
delete op_st.n_osd;
}
if (op_st.bps > 0 || op_st.iops > 0) if (op_st.bps > 0 || op_st.iops > 0)
nonzero = true; nonzero = true;
} }

View File

@ -266,6 +266,8 @@ class blockstore_impl_t
int throttle_threshold_us = 50; int throttle_threshold_us = 50;
// Maximum writes between automatically added fsync operations // Maximum writes between automatically added fsync operations
uint64_t autosync_writes = 128; uint64_t autosync_writes = 128;
// Log level (0-10)
int log_level = 0;
/******* END OF OPTIONS *******/ /******* END OF OPTIONS *******/
struct ring_consumer_t ring_consumer; struct ring_consumer_t ring_consumer;

View File

@ -113,10 +113,13 @@ int blockstore_journal_check_t::check_available(blockstore_op_t *op, int entries
if (!right_dir && next_pos >= bs->journal.used_start-bs->journal.block_size) if (!right_dir && next_pos >= bs->journal.used_start-bs->journal.block_size)
{ {
// No space in the journal. Wait until used_start changes. // No space in the journal. Wait until used_start changes.
printf( if (bs->log_level > 5)
"Ran out of journal space (used_start=%08jx, next_free=%08jx, dirty_start=%08jx)\n", {
bs->journal.used_start, bs->journal.next_free, bs->journal.dirty_start printf(
); "Ran out of journal space (used_start=%08jx, next_free=%08jx, dirty_start=%08jx)\n",
bs->journal.used_start, bs->journal.next_free, bs->journal.dirty_start
);
}
PRIV(op)->wait_for = WAIT_JOURNAL; PRIV(op)->wait_for = WAIT_JOURNAL;
bs->flusher->request_trim(); bs->flusher->request_trim();
PRIV(op)->wait_detail = bs->journal.used_start; PRIV(op)->wait_detail = bs->journal.used_start;

View File

@ -101,6 +101,7 @@ void blockstore_impl_t::parse_config(blockstore_config_t & config, bool init)
config["journal_no_same_sector_overwrites"] == "1" || config["journal_no_same_sector_overwrites"] == "yes"; config["journal_no_same_sector_overwrites"] == "1" || config["journal_no_same_sector_overwrites"] == "yes";
journal.inmemory = config["inmemory_journal"] != "false" && config["inmemory_journal"] != "0" && journal.inmemory = config["inmemory_journal"] != "false" && config["inmemory_journal"] != "0" &&
config["inmemory_journal"] != "no"; config["inmemory_journal"] != "no";
log_level = strtoull(config["log_level"].c_str(), NULL, 10);
// Validate // Validate
if (journal.sector_count < 2) if (journal.sector_count < 2)
{ {

View File

@ -93,7 +93,7 @@ add_executable(test_cluster_client
EXCLUDE_FROM_ALL EXCLUDE_FROM_ALL
../test/test_cluster_client.cpp ../test/test_cluster_client.cpp
pg_states.cpp osd_ops.cpp cluster_client.cpp cluster_client_list.cpp cluster_client_wb.cpp msgr_op.cpp ../test/mock/messenger.cpp msgr_stop.cpp pg_states.cpp osd_ops.cpp cluster_client.cpp cluster_client_list.cpp cluster_client_wb.cpp msgr_op.cpp ../test/mock/messenger.cpp msgr_stop.cpp
etcd_state_client.cpp ../util/timerfd_manager.cpp ../util/str_util.cpp ../util/json_util.cpp ../../json11/json11.cpp etcd_state_client.cpp ../util/timerfd_manager.cpp ../util/addr_util.cpp ../util/str_util.cpp ../util/json_util.cpp ../../json11/json11.cpp
) )
target_compile_definitions(test_cluster_client PUBLIC -D__MOCK__) target_compile_definitions(test_cluster_client PUBLIC -D__MOCK__)
target_include_directories(test_cluster_client BEFORE PUBLIC ${CMAKE_SOURCE_DIR}/src/test/mock) target_include_directories(test_cluster_client BEFORE PUBLIC ${CMAKE_SOURCE_DIR}/src/test/mock)

View File

@ -1244,7 +1244,6 @@ int cluster_client_t::try_send(cluster_op_t *op, int i)
.req = { .rw = { .req = { .rw = {
.header = { .header = {
.magic = SECONDARY_OSD_OP_MAGIC, .magic = SECONDARY_OSD_OP_MAGIC,
.id = next_op_id(),
.opcode = op->opcode == OSD_OP_READ_BITMAP || op->opcode == OSD_OP_READ_CHAIN_BITMAP ? OSD_OP_READ : op->opcode, .opcode = op->opcode == OSD_OP_READ_BITMAP || op->opcode == OSD_OP_READ_CHAIN_BITMAP ? OSD_OP_READ : op->opcode,
}, },
.inode = op->cur_inode, .inode = op->cur_inode,
@ -1353,7 +1352,6 @@ void cluster_client_t::send_sync(cluster_op_t *op, cluster_op_part_t *part)
.req = { .req = {
.hdr = { .hdr = {
.magic = SECONDARY_OSD_OP_MAGIC, .magic = SECONDARY_OSD_OP_MAGIC,
.id = next_op_id(),
.opcode = OSD_OP_SYNC, .opcode = OSD_OP_SYNC,
}, },
}, },
@ -1498,8 +1496,3 @@ void cluster_client_t::copy_part_bitmap(cluster_op_t *op, cluster_op_part_t *par
part_len--; part_len--;
} }
} }
uint64_t cluster_client_t::next_op_id()
{
return msgr.next_subop_id++;
}

View File

@ -86,8 +86,8 @@ class cluster_client_t
#ifdef __MOCK__ #ifdef __MOCK__
public: public:
#endif #endif
timerfd_manager_t *tfd; timerfd_manager_t *tfd = NULL;
ring_loop_t *ringloop; ring_loop_t *ringloop = NULL;
std::map<pool_id_t, uint64_t> pg_counts; std::map<pool_id_t, uint64_t> pg_counts;
std::map<pool_pg_num_t, osd_num_t> pg_primary; std::map<pool_pg_num_t, osd_num_t> pg_primary;
@ -152,7 +152,6 @@ public:
//inline uint32_t get_bs_bitmap_granularity() { return st_cli.global_bitmap_granularity; } //inline uint32_t get_bs_bitmap_granularity() { return st_cli.global_bitmap_granularity; }
//inline uint64_t get_bs_block_size() { return st_cli.global_block_size; } //inline uint64_t get_bs_block_size() { return st_cli.global_block_size; }
uint64_t next_op_id();
#ifndef __MOCK__ #ifndef __MOCK__
protected: protected:

View File

@ -342,7 +342,6 @@ void cluster_client_t::send_list(inode_list_osd_t *cur_list)
.sec_list = { .sec_list = {
.header = { .header = {
.magic = SECONDARY_OSD_OP_MAGIC, .magic = SECONDARY_OSD_OP_MAGIC,
.id = next_op_id(),
.opcode = OSD_OP_SEC_LIST, .opcode = OSD_OP_SEC_LIST,
}, },
.list_pg = cur_list->pg->pg_num, .list_pg = cur_list->pg->pg_num,

View File

@ -167,6 +167,10 @@ void osd_messenger_t::init()
} }
} }
#endif #endif
if (ringloop)
{
has_sendmsg_zc = ringloop->has_sendmsg_zc();
}
if (ringloop && iothread_count > 0) if (ringloop && iothread_count > 0)
{ {
for (int i = 0; i < iothread_count; i++) for (int i = 0; i < iothread_count; i++)
@ -213,7 +217,6 @@ void osd_messenger_t::init()
op->req = (osd_any_op_t){ op->req = (osd_any_op_t){
.hdr = { .hdr = {
.magic = SECONDARY_OSD_OP_MAGIC, .magic = SECONDARY_OSD_OP_MAGIC,
.id = this->next_subop_id++,
.opcode = OSD_OP_PING, .opcode = OSD_OP_PING,
}, },
}; };
@ -329,6 +332,9 @@ void osd_messenger_t::parse_config(const json11::Json & config)
this->receive_buffer_size = 65536; this->receive_buffer_size = 65536;
this->use_sync_send_recv = config["use_sync_send_recv"].bool_value() || this->use_sync_send_recv = config["use_sync_send_recv"].bool_value() ||
config["use_sync_send_recv"].uint64_value(); config["use_sync_send_recv"].uint64_value();
this->min_zerocopy_send_size = config["min_zerocopy_send_size"].is_null()
? DEFAULT_MIN_ZEROCOPY_SEND_SIZE
: (int)config["min_zerocopy_send_size"].int64_value();
this->peer_connect_interval = config["peer_connect_interval"].uint64_value(); this->peer_connect_interval = config["peer_connect_interval"].uint64_value();
if (!this->peer_connect_interval) if (!this->peer_connect_interval)
this->peer_connect_interval = 5; this->peer_connect_interval = 5;
@ -622,13 +628,19 @@ void osd_messenger_t::check_peer_config(osd_client_t *cl)
.show_conf = { .show_conf = {
.header = { .header = {
.magic = SECONDARY_OSD_OP_MAGIC, .magic = SECONDARY_OSD_OP_MAGIC,
.id = this->next_subop_id++,
.opcode = OSD_OP_SHOW_CONFIG, .opcode = OSD_OP_SHOW_CONFIG,
}, },
}, },
}; };
json11::Json::object payload;
if (osd_num)
{
// Inform that we're OSD <osd_num>
payload["osd_num"] = osd_num;
}
payload["features"] = json11::Json::object{ { "check_sequencing", true } };
#ifdef WITH_RDMA #ifdef WITH_RDMA
if (rdma_contexts.size()) if (!use_rdmacm && rdma_contexts.size())
{ {
// Choose the right context for the selected network // Choose the right context for the selected network
msgr_rdma_context_t *selected_ctx = choose_rdma_context(cl); msgr_rdma_context_t *selected_ctx = choose_rdma_context(cl);
@ -642,19 +654,20 @@ void osd_messenger_t::check_peer_config(osd_client_t *cl)
cl->rdma_conn = msgr_rdma_connection_t::create(selected_ctx, rdma_max_send, rdma_max_recv, rdma_max_sge, rdma_max_msg); cl->rdma_conn = msgr_rdma_connection_t::create(selected_ctx, rdma_max_send, rdma_max_recv, rdma_max_sge, rdma_max_msg);
if (cl->rdma_conn) if (cl->rdma_conn)
{ {
json11::Json payload = json11::Json::object { payload["connect_rdma"] = cl->rdma_conn->addr.to_string();
{ "connect_rdma", cl->rdma_conn->addr.to_string() }, payload["rdma_max_msg"] = cl->rdma_conn->max_msg;
{ "rdma_max_msg", cl->rdma_conn->max_msg },
};
std::string payload_str = payload.dump();
op->req.show_conf.json_len = payload_str.size();
op->buf = malloc_or_die(payload_str.size());
op->iov.push_back(op->buf, payload_str.size());
memcpy(op->buf, payload_str.c_str(), payload_str.size());
} }
} }
} }
#endif #endif
if (payload.size())
{
std::string payload_str = json11::Json(payload).dump();
op->req.show_conf.json_len = payload_str.size();
op->buf = malloc_or_die(payload_str.size());
op->iov.push_back(op->buf, payload_str.size());
memcpy(op->buf, payload_str.c_str(), payload_str.size());
}
op->callback = [this, cl](osd_op_t *op) op->callback = [this, cl](osd_op_t *op)
{ {
std::string json_err; std::string json_err;
@ -701,7 +714,7 @@ void osd_messenger_t::check_peer_config(osd_client_t *cl)
return; return;
} }
#ifdef WITH_RDMA #ifdef WITH_RDMA
if (cl->rdma_conn && config["rdma_address"].is_string()) if (!use_rdmacm && cl->rdma_conn && config["rdma_address"].is_string())
{ {
msgr_rdma_address_t addr; msgr_rdma_address_t addr;
if (!msgr_rdma_address_t::from_string(config["rdma_address"].string_value().c_str(), &addr) || if (!msgr_rdma_address_t::from_string(config["rdma_address"].string_value().c_str(), &addr) ||
@ -800,7 +813,8 @@ bool osd_messenger_t::is_rdma_enabled()
{ {
return rdma_contexts.size() > 0; return rdma_contexts.size() > 0;
} }
#endif
#ifdef WITH_RDMACM
bool osd_messenger_t::is_use_rdmacm() bool osd_messenger_t::is_use_rdmacm()
{ {
return use_rdmacm; return use_rdmacm;
@ -896,6 +910,7 @@ static const char* local_only_params[] = {
"tcp_header_buffer_size", "tcp_header_buffer_size",
"use_rdma", "use_rdma",
"use_sync_send_recv", "use_sync_send_recv",
"min_zerocopy_send_size",
}; };
static const char **local_only_end = local_only_params + (sizeof(local_only_params)/sizeof(local_only_params[0])); static const char **local_only_end = local_only_params + (sizeof(local_only_params)/sizeof(local_only_params[0]));

View File

@ -32,6 +32,8 @@
#define VITASTOR_CONFIG_PATH "/etc/vitastor/vitastor.conf" #define VITASTOR_CONFIG_PATH "/etc/vitastor/vitastor.conf"
#define DEFAULT_MIN_ZEROCOPY_SEND_SIZE 32*1024
#define MSGR_SENDP_HDR 1 #define MSGR_SENDP_HDR 1
#define MSGR_SENDP_FREE 2 #define MSGR_SENDP_FREE 2
@ -73,12 +75,15 @@ struct osd_client_t
int read_remaining = 0; int read_remaining = 0;
int read_state = 0; int read_state = 0;
osd_op_buf_list_t recv_list; osd_op_buf_list_t recv_list;
uint64_t read_op_id = 1;
bool check_sequencing = false;
// Incoming operations // Incoming operations
std::vector<osd_op_t*> received_ops; std::vector<osd_op_t*> received_ops;
// Outbound operations // Outbound operations
std::map<uint64_t, osd_op_t*> sent_ops; std::map<uint64_t, osd_op_t*> sent_ops;
uint64_t send_op_id = 0;
// PGs dirtied by this client's primary-writes // PGs dirtied by this client's primary-writes
std::set<pool_pg_num_t> dirty_pgs; std::set<pool_pg_num_t> dirty_pgs;
@ -88,6 +93,7 @@ struct osd_client_t
int write_state = 0; int write_state = 0;
std::vector<iovec> send_list, next_send_list; std::vector<iovec> send_list, next_send_list;
std::vector<msgr_sendp_t> outbox, next_outbox; std::vector<msgr_sendp_t> outbox, next_outbox;
std::vector<osd_op_t*> zc_free_list;
~osd_client_t(); ~osd_client_t();
}; };
@ -97,6 +103,7 @@ struct osd_wanted_peer_t
json11::Json raw_address_list; json11::Json raw_address_list;
json11::Json address_list; json11::Json address_list;
int port = 0; int port = 0;
// FIXME: Remove separate WITH_RDMACM?
#ifdef WITH_RDMACM #ifdef WITH_RDMACM
int rdmacm_port = 0; int rdmacm_port = 0;
#endif #endif
@ -175,6 +182,7 @@ protected:
int osd_ping_timeout = 0; int osd_ping_timeout = 0;
int log_level = 0; int log_level = 0;
bool use_sync_send_recv = false; bool use_sync_send_recv = false;
int min_zerocopy_send_size = DEFAULT_MIN_ZEROCOPY_SEND_SIZE;
int iothread_count = 0; int iothread_count = 0;
#ifdef WITH_RDMA #ifdef WITH_RDMA
@ -201,11 +209,11 @@ protected:
std::vector<osd_op_t*> set_immediate_ops; std::vector<osd_op_t*> set_immediate_ops;
public: public:
timerfd_manager_t *tfd; timerfd_manager_t *tfd = NULL;
ring_loop_t *ringloop; ring_loop_t *ringloop = NULL;
bool has_sendmsg_zc = false;
// osd_num_t is only for logging and asserts // osd_num_t is only for logging and asserts
osd_num_t osd_num; osd_num_t osd_num;
uint64_t next_subop_id = 1;
std::map<int, osd_client_t*> clients; std::map<int, osd_client_t*> clients;
std::map<osd_num_t, osd_wanted_peer_t> wanted_peers; std::map<osd_num_t, osd_wanted_peer_t> wanted_peers;
std::map<uint64_t, int> osd_peer_fds; std::map<uint64_t, int> osd_peer_fds;
@ -261,7 +269,7 @@ protected:
void cancel_op(osd_op_t *op); void cancel_op(osd_op_t *op);
bool try_send(osd_client_t *cl); bool try_send(osd_client_t *cl);
void handle_send(int result, osd_client_t *cl); void handle_send(int result, bool prev, bool more, osd_client_t *cl);
bool handle_read(int result, osd_client_t *cl); bool handle_read(int result, osd_client_t *cl);
bool handle_read_buffer(osd_client_t *cl, void *curbuf, int remain); bool handle_read_buffer(osd_client_t *cl, void *curbuf, int remain);
@ -286,6 +294,7 @@ protected:
msgr_rdma_context_t* rdmacm_create_qp(rdma_cm_id *cmid); msgr_rdma_context_t* rdmacm_create_qp(rdma_cm_id *cmid);
void rdmacm_accept(rdma_cm_event *ev); void rdmacm_accept(rdma_cm_event *ev);
void rdmacm_try_connect_peer(uint64_t peer_osd, const std::string & addr, int rdmacm_port, int fallback_tcp_port); void rdmacm_try_connect_peer(uint64_t peer_osd, const std::string & addr, int rdmacm_port, int fallback_tcp_port);
void rdmacm_set_conn_timeout(rdmacm_connecting_t *conn);
void rdmacm_on_connect_peer_error(rdma_cm_id *cmid, int res); void rdmacm_on_connect_peer_error(rdma_cm_id *cmid, int res);
void rdmacm_address_resolved(rdma_cm_event *ev); void rdmacm_address_resolved(rdma_cm_event *ev);
void rdmacm_route_resolved(rdma_cm_event *ev); void rdmacm_route_resolved(rdma_cm_event *ev);

View File

@ -70,6 +70,7 @@ msgr_rdma_context_t::~msgr_rdma_context_t()
msgr_rdma_connection_t::~msgr_rdma_connection_t() msgr_rdma_connection_t::~msgr_rdma_connection_t()
{ {
ctx->reserve_cqe(-max_send-max_recv); ctx->reserve_cqe(-max_send-max_recv);
#ifdef WITH_RDMACM
if (qp && !cmid) if (qp && !cmid)
ibv_destroy_qp(qp); ibv_destroy_qp(qp);
if (cmid) if (cmid)
@ -79,6 +80,10 @@ msgr_rdma_connection_t::~msgr_rdma_connection_t()
rdma_destroy_qp(cmid); rdma_destroy_qp(cmid);
rdma_destroy_id(cmid); rdma_destroy_id(cmid);
} }
#else
if (qp)
ibv_destroy_qp(qp);
#endif
if (recv_buffers.size()) if (recv_buffers.size())
{ {
for (auto b: recv_buffers) for (auto b: recv_buffers)
@ -798,6 +803,9 @@ void osd_messenger_t::handle_rdma_events(msgr_rdma_context_t *rdma_context)
} }
if (!is_send) if (!is_send)
{ {
// Reset OSD ping state - client is obviously alive
cl->ping_time_remaining = 0;
cl->idle_time_remaining = osd_idle_timeout;
rc->cur_recv--; rc->cur_recv--;
if (!handle_read_buffer(cl, rc->recv_buffers[rc->next_recv_buf].buf, wc[i].byte_len)) if (!handle_read_buffer(cl, rc->recv_buffers[rc->next_recv_buf].buf, wc[i].byte_len))
{ {

View File

@ -70,7 +70,7 @@ void osd_messenger_t::rdmacm_destroy_listener(rdma_cm_id *listener)
void osd_messenger_t::handle_rdmacm_events() void osd_messenger_t::handle_rdmacm_events()
{ {
// rdma_destroy_id infinitely waits for pthread_cond if called before all events are acked :-( // rdma_destroy_id infinitely waits for pthread_cond if called before all events are acked :-(...
std::vector<rdma_cm_event> events_copy; std::vector<rdma_cm_event> events_copy;
while (1) while (1)
{ {
@ -83,7 +83,15 @@ void osd_messenger_t::handle_rdmacm_events()
fprintf(stderr, "Failed to get RDMA-CM event: %s (code %d)\n", strerror(errno), errno); fprintf(stderr, "Failed to get RDMA-CM event: %s (code %d)\n", strerror(errno), errno);
exit(1); exit(1);
} }
events_copy.push_back(*ev); // ...so we save a copy of all events EXCEPT connection requests, otherwise they sometimes fail with EVENT_DISCONNECT
if (ev->event == RDMA_CM_EVENT_CONNECT_REQUEST)
{
rdmacm_accept(ev);
}
else
{
events_copy.push_back(*ev);
}
r = rdma_ack_cm_event(ev); r = rdma_ack_cm_event(ev);
if (r != 0) if (r != 0)
{ {
@ -96,7 +104,7 @@ void osd_messenger_t::handle_rdmacm_events()
auto ev = &evl; auto ev = &evl;
if (ev->event == RDMA_CM_EVENT_CONNECT_REQUEST) if (ev->event == RDMA_CM_EVENT_CONNECT_REQUEST)
{ {
rdmacm_accept(ev); // Do nothing, handled above
} }
else if (ev->event == RDMA_CM_EVENT_CONNECT_ERROR || else if (ev->event == RDMA_CM_EVENT_CONNECT_ERROR ||
ev->event == RDMA_CM_EVENT_REJECTED || ev->event == RDMA_CM_EVENT_REJECTED ||
@ -287,29 +295,34 @@ void osd_messenger_t::rdmacm_accept(rdma_cm_event *ev)
rdma_destroy_id(ev->id); rdma_destroy_id(ev->id);
return; return;
} }
rdma_context->cm_refs++; // Wait for RDMA_CM_ESTABLISHED, and enable the connection only after it
// Wrap into a new msgr_rdma_connection_t auto conn = new rdmacm_connecting_t;
msgr_rdma_connection_t *conn = new msgr_rdma_connection_t;
conn->ctx = rdma_context;
conn->max_send = rdma_max_send;
conn->max_recv = rdma_max_recv;
conn->max_sge = rdma_max_sge > rdma_context->attrx.orig_attr.max_sge
? rdma_context->attrx.orig_attr.max_sge : rdma_max_sge;
conn->max_msg = rdma_max_msg;
conn->cmid = ev->id; conn->cmid = ev->id;
conn->qp = ev->id->qp; conn->peer_fd = fake_fd;
auto cl = new osd_client_t(); conn->parsed_addr = *(sockaddr_storage*)rdma_get_peer_addr(ev->id);
cl->peer_fd = fake_fd; conn->rdma_context = rdma_context;
cl->peer_state = PEER_RDMA; rdmacm_set_conn_timeout(conn);
cl->peer_addr = *(sockaddr_storage*)rdma_get_peer_addr(ev->id); rdmacm_connecting[ev->id] = conn;
cl->in_buf = malloc_or_die(receive_buffer_size); fprintf(stderr, "[OSD %ju] new client %d: connection from %s via RDMA-CM\n", this->osd_num, conn->peer_fd,
cl->rdma_conn = conn; addr_to_string(conn->parsed_addr).c_str());
clients[fake_fd] = cl; }
rdmacm_connections[ev->id] = cl;
// Add initial receive request(s) void osd_messenger_t::rdmacm_set_conn_timeout(rdmacm_connecting_t *conn)
try_recv_rdma(cl); {
fprintf(stderr, "[OSD %ju] new client %d: connection from %s via RDMA-CM\n", this->osd_num, fake_fd, conn->timeout_ms = peer_connect_timeout*1000;
addr_to_string(cl->peer_addr).c_str()); if (peer_connect_timeout > 0)
{
conn->timeout_id = tfd->set_timer(1000*peer_connect_timeout, false, [this, cmid = conn->cmid](int timer_id)
{
auto conn = rdmacm_connecting.at(cmid);
conn->timeout_id = -1;
if (conn->peer_osd)
fprintf(stderr, "RDMA-CM connection to %s timed out\n", conn->addr.c_str());
else
fprintf(stderr, "Incoming RDMA-CM connection from %s timed out\n", addr_to_string(conn->parsed_addr).c_str());
rdmacm_on_connect_peer_error(cmid, -EPIPE);
});
}
} }
void osd_messenger_t::rdmacm_on_connect_peer_error(rdma_cm_id *cmid, int res) void osd_messenger_t::rdmacm_on_connect_peer_error(rdma_cm_id *cmid, int res)
@ -332,15 +345,18 @@ void osd_messenger_t::rdmacm_on_connect_peer_error(rdma_cm_id *cmid, int res)
} }
rdmacm_connecting.erase(cmid); rdmacm_connecting.erase(cmid);
delete conn; delete conn;
if (!disable_tcp) if (peer_osd)
{ {
// Fall back to TCP instead of just reporting the error to on_connect_peer() if (!disable_tcp)
try_connect_peer_tcp(peer_osd, addr.c_str(), tcp_port); {
} // Fall back to TCP instead of just reporting the error to on_connect_peer()
else try_connect_peer_tcp(peer_osd, addr.c_str(), tcp_port);
{ }
// TCP is disabled else
on_connect_peer(peer_osd, res == 0 ? -EINVAL : (res > 0 ? -res : res)); {
// TCP is disabled
on_connect_peer(peer_osd, res == 0 ? -EINVAL : (res > 0 ? -res : res));
}
} }
} }
@ -374,6 +390,8 @@ void osd_messenger_t::rdmacm_try_connect_peer(uint64_t peer_osd, const std::stri
on_connect_peer(peer_osd, res); on_connect_peer(peer_osd, res);
return; return;
} }
if (log_level > 0)
fprintf(stderr, "Trying to connect to OSD %ju at %s:%d via RDMA-CM\n", peer_osd, addr.c_str(), rdmacm_port);
auto conn = new rdmacm_connecting_t; auto conn = new rdmacm_connecting_t;
rdmacm_connecting[cmid] = conn; rdmacm_connecting[cmid] = conn;
conn->cmid = cmid; conn->cmid = cmid;
@ -383,19 +401,7 @@ void osd_messenger_t::rdmacm_try_connect_peer(uint64_t peer_osd, const std::stri
conn->parsed_addr = sa; conn->parsed_addr = sa;
conn->rdmacm_port = rdmacm_port; conn->rdmacm_port = rdmacm_port;
conn->tcp_port = fallback_tcp_port; conn->tcp_port = fallback_tcp_port;
conn->timeout_ms = peer_connect_timeout*1000; rdmacm_set_conn_timeout(conn);
conn->timeout_id = -1;
if (peer_connect_timeout > 0)
{
conn->timeout_id = tfd->set_timer(1000*peer_connect_timeout, false, [this, cmid](int timer_id)
{
auto conn = rdmacm_connecting.at(cmid);
conn->timeout_id = -1;
fprintf(stderr, "RDMA-CM connection to %s timed out\n", conn->addr.c_str());
rdmacm_on_connect_peer_error(cmid, -EPIPE);
return;
});
}
if (rdma_resolve_addr(cmid, NULL, (sockaddr*)&conn->parsed_addr, conn->timeout_ms) != 0) if (rdma_resolve_addr(cmid, NULL, (sockaddr*)&conn->parsed_addr, conn->timeout_ms) != 0)
{ {
auto res = -errno; auto res = -errno;
@ -494,7 +500,7 @@ void osd_messenger_t::rdmacm_established(rdma_cm_event *ev)
// Wrap into a new msgr_rdma_connection_t // Wrap into a new msgr_rdma_connection_t
msgr_rdma_connection_t *rc = new msgr_rdma_connection_t; msgr_rdma_connection_t *rc = new msgr_rdma_connection_t;
rc->ctx = conn->rdma_context; rc->ctx = conn->rdma_context;
rc->ctx->cm_refs++; rc->ctx->cm_refs++; // FIXME now unused, count also connecting_t's when used
rc->max_send = rdma_max_send; rc->max_send = rdma_max_send;
rc->max_recv = rdma_max_recv; rc->max_recv = rdma_max_recv;
rc->max_sge = rdma_max_sge > rc->ctx->attrx.orig_attr.max_sge rc->max_sge = rdma_max_sge > rc->ctx->attrx.orig_attr.max_sge
@ -514,14 +520,20 @@ void osd_messenger_t::rdmacm_established(rdma_cm_event *ev)
cl->rdma_conn = rc; cl->rdma_conn = rc;
clients[conn->peer_fd] = cl; clients[conn->peer_fd] = cl;
if (conn->timeout_id >= 0) if (conn->timeout_id >= 0)
{
tfd->clear_timer(conn->timeout_id); tfd->clear_timer(conn->timeout_id);
}
delete conn; delete conn;
rdmacm_connecting.erase(cmid); rdmacm_connecting.erase(cmid);
rdmacm_connections[cmid] = cl; rdmacm_connections[cmid] = cl;
if (log_level > 0) if (log_level > 0 && peer_osd)
{
fprintf(stderr, "Successfully connected with OSD %ju using RDMA-CM\n", peer_osd); fprintf(stderr, "Successfully connected with OSD %ju using RDMA-CM\n", peer_osd);
}
// Add initial receive request(s) // Add initial receive request(s)
try_recv_rdma(cl); try_recv_rdma(cl);
osd_peer_fds[peer_osd] = cl->peer_fd; if (peer_osd)
on_connect_peer(peer_osd, cl->peer_fd); {
check_peer_config(cl);
}
} }

View File

@ -214,6 +214,7 @@ bool osd_messenger_t::handle_read_buffer(osd_client_t *cl, void *curbuf, int rem
bool osd_messenger_t::handle_finished_read(osd_client_t *cl) bool osd_messenger_t::handle_finished_read(osd_client_t *cl)
{ {
// Reset OSD ping state
cl->ping_time_remaining = 0; cl->ping_time_remaining = 0;
cl->idle_time_remaining = osd_idle_timeout; cl->idle_time_remaining = osd_idle_timeout;
cl->recv_list.reset(); cl->recv_list.reset();
@ -222,7 +223,19 @@ bool osd_messenger_t::handle_finished_read(osd_client_t *cl)
if (cl->read_op->req.hdr.magic == SECONDARY_OSD_REPLY_MAGIC) if (cl->read_op->req.hdr.magic == SECONDARY_OSD_REPLY_MAGIC)
return handle_reply_hdr(cl); return handle_reply_hdr(cl);
else if (cl->read_op->req.hdr.magic == SECONDARY_OSD_OP_MAGIC) else if (cl->read_op->req.hdr.magic == SECONDARY_OSD_OP_MAGIC)
{
if (cl->check_sequencing)
{
if (cl->read_op->req.hdr.id != cl->read_op_id)
{
fprintf(stderr, "Warning: operation sequencing is broken on client %d, stopping client\n", cl->peer_fd);
stop_client(cl->peer_fd);
return false;
}
cl->read_op_id++;
}
handle_op_hdr(cl); handle_op_hdr(cl);
}
else else
{ {
fprintf(stderr, "Received garbage: magic=%jx id=%ju opcode=%jx from %d\n", cl->read_op->req.hdr.magic, cl->read_op->req.hdr.id, cl->read_op->req.hdr.opcode, cl->peer_fd); fprintf(stderr, "Received garbage: magic=%jx id=%ju opcode=%jx from %d\n", cl->read_op->req.hdr.magic, cl->read_op->req.hdr.id, cl->read_op->req.hdr.opcode, cl->peer_fd);

View File

@ -14,6 +14,7 @@ void osd_messenger_t::outbox_push(osd_op_t *cur_op)
if (cur_op->op_type == OSD_OP_OUT) if (cur_op->op_type == OSD_OP_OUT)
{ {
clock_gettime(CLOCK_REALTIME, &cur_op->tv_begin); clock_gettime(CLOCK_REALTIME, &cur_op->tv_begin);
cur_op->req.hdr.id = ++cl->send_op_id;
} }
else else
{ {
@ -203,8 +204,24 @@ bool osd_messenger_t::try_send(osd_client_t *cl)
cl->write_msg.msg_iovlen = cl->send_list.size() < IOV_MAX ? cl->send_list.size() : IOV_MAX; cl->write_msg.msg_iovlen = cl->send_list.size() < IOV_MAX ? cl->send_list.size() : IOV_MAX;
cl->refs++; cl->refs++;
ring_data_t* data = ((ring_data_t*)sqe->user_data); ring_data_t* data = ((ring_data_t*)sqe->user_data);
data->callback = [this, cl](ring_data_t *data) { handle_send(data->res, cl); }; data->callback = [this, cl](ring_data_t *data) { handle_send(data->res, data->prev, data->more, cl); };
my_uring_prep_sendmsg(sqe, peer_fd, &cl->write_msg, 0); bool use_zc = has_sendmsg_zc && min_zerocopy_send_size >= 0;
if (use_zc && min_zerocopy_send_size > 0)
{
size_t avg_size = 0;
for (size_t i = 0; i < cl->write_msg.msg_iovlen; i++)
avg_size += cl->write_msg.msg_iov[i].iov_len;
if (avg_size/cl->write_msg.msg_iovlen < min_zerocopy_send_size)
use_zc = false;
}
if (use_zc)
{
my_uring_prep_sendmsg_zc(sqe, peer_fd, &cl->write_msg, MSG_WAITALL);
}
else
{
my_uring_prep_sendmsg(sqe, peer_fd, &cl->write_msg, MSG_WAITALL);
}
if (iothread) if (iothread)
{ {
iothread->add_sqe(sqe_local); iothread->add_sqe(sqe_local);
@ -220,7 +237,7 @@ bool osd_messenger_t::try_send(osd_client_t *cl)
{ {
result = -errno; result = -errno;
} }
handle_send(result, cl); handle_send(result, false, false, cl);
} }
return true; return true;
} }
@ -240,10 +257,16 @@ void osd_messenger_t::send_replies()
write_ready_clients.clear(); write_ready_clients.clear();
} }
void osd_messenger_t::handle_send(int result, osd_client_t *cl) void osd_messenger_t::handle_send(int result, bool prev, bool more, osd_client_t *cl)
{ {
cl->write_msg.msg_iovlen = 0; if (!prev)
cl->refs--; {
cl->write_msg.msg_iovlen = 0;
}
if (!more)
{
cl->refs--;
}
if (cl->peer_state == PEER_STOPPED) if (cl->peer_state == PEER_STOPPED)
{ {
if (cl->refs <= 0) if (cl->refs <= 0)
@ -261,6 +284,16 @@ void osd_messenger_t::handle_send(int result, osd_client_t *cl)
} }
if (result >= 0) if (result >= 0)
{ {
if (prev)
{
// Second notification - only free a batch of postponed ops
int i = 0;
for (; i < cl->zc_free_list.size() && cl->zc_free_list[i]; i++)
delete cl->zc_free_list[i];
if (i > 0)
cl->zc_free_list.erase(cl->zc_free_list.begin(), cl->zc_free_list.begin()+i+1);
return;
}
int done = 0; int done = 0;
while (result > 0 && done < cl->send_list.size()) while (result > 0 && done < cl->send_list.size())
{ {
@ -270,7 +303,10 @@ void osd_messenger_t::handle_send(int result, osd_client_t *cl)
if (cl->outbox[done].flags & MSGR_SENDP_FREE) if (cl->outbox[done].flags & MSGR_SENDP_FREE)
{ {
// Reply fully sent // Reply fully sent
delete cl->outbox[done].op; if (more)
cl->zc_free_list.push_back(cl->outbox[done].op);
else
delete cl->outbox[done].op;
} }
result -= iov.iov_len; result -= iov.iov_len;
done++; done++;
@ -282,6 +318,12 @@ void osd_messenger_t::handle_send(int result, osd_client_t *cl)
break; break;
} }
} }
if (more)
{
auto expected = cl->send_list.size() < IOV_MAX ? cl->send_list.size() : IOV_MAX;
assert(done == expected);
cl->zc_free_list.push_back(NULL); // end marker
}
if (done > 0) if (done > 0)
{ {
cl->send_list.erase(cl->send_list.begin(), cl->send_list.begin()+done); cl->send_list.erase(cl->send_list.begin(), cl->send_list.begin()+done);

View File

@ -147,7 +147,6 @@ struct cli_describe_t
.describe = (osd_op_describe_t){ .describe = (osd_op_describe_t){
.header = (osd_op_header_t){ .header = (osd_op_header_t){
.magic = SECONDARY_OSD_OP_MAGIC, .magic = SECONDARY_OSD_OP_MAGIC,
.id = parent->cli->next_op_id(),
.opcode = OSD_OP_DESCRIBE, .opcode = OSD_OP_DESCRIBE,
}, },
.object_state = object_state, .object_state = object_state,

View File

@ -159,7 +159,6 @@ struct cli_fix_t
.describe = { .describe = {
.header = { .header = {
.magic = SECONDARY_OSD_OP_MAGIC, .magic = SECONDARY_OSD_OP_MAGIC,
.id = parent->cli->next_op_id(),
.opcode = OSD_OP_DESCRIBE, .opcode = OSD_OP_DESCRIBE,
}, },
.min_inode = obj.inode, .min_inode = obj.inode,
@ -194,7 +193,6 @@ struct cli_fix_t
.sec_del = { .sec_del = {
.header = { .header = {
.magic = SECONDARY_OSD_OP_MAGIC, .magic = SECONDARY_OSD_OP_MAGIC,
.id = parent->cli->next_op_id(),
.opcode = OSD_OP_SEC_DELETE, .opcode = OSD_OP_SEC_DELETE,
}, },
.oid = { .oid = {
@ -242,7 +240,6 @@ struct cli_fix_t
.rw = { .rw = {
.header = { .header = {
.magic = SECONDARY_OSD_OP_MAGIC, .magic = SECONDARY_OSD_OP_MAGIC,
.id = parent->cli->next_op_id(),
.opcode = OSD_OP_SCRUB, .opcode = OSD_OP_SCRUB,
}, },
.inode = obj.inode, .inode = obj.inode,

View File

@ -58,6 +58,12 @@ struct osd_changer_t
state = 100; state = 100;
return; return;
} }
if (set_reweight && new_reweight > 1)
{
result = (cli_result_t){ .err = EINVAL, .text = "Reweight can't be larger than 1" };
state = 100;
return;
}
parent->etcd_txn(json11::Json::object { parent->etcd_txn(json11::Json::object {
{ "success", json11::Json::array { { "success", json11::Json::array {
json11::Json::object { json11::Json::object {

View File

@ -44,10 +44,10 @@ std::string validate_pool_config(json11::Json::object & new_cfg, json11::Json ol
new_cfg["parity_chunks"] = parity_chunks; new_cfg["parity_chunks"] = parity_chunks;
} }
if (old_cfg.is_null() && new_cfg["scheme"].string_value() == "") if (new_cfg["scheme"].string_value() == "")
{ {
// Default scheme // Default scheme
new_cfg["scheme"] = "replicated"; new_cfg["scheme"] = old_cfg.is_null() ? "replicated" : old_cfg["scheme"];
} }
if (new_cfg.find("pg_minsize") == new_cfg.end() && (old_cfg.is_null() || new_cfg.find("pg_size") != new_cfg.end())) if (new_cfg.find("pg_minsize") == new_cfg.end() && (old_cfg.is_null() || new_cfg.find("pg_size") != new_cfg.end()))
{ {

View File

@ -5,6 +5,7 @@
#include "cli.h" #include "cli.h"
#include "cluster_client.h" #include "cluster_client.h"
#include "str_util.h" #include "str_util.h"
#include "json_util.h"
#include "epoll_manager.h" #include "epoll_manager.h"
#include <algorithm> #include <algorithm>
@ -22,8 +23,8 @@ struct rm_osd_t
int state = 0; int state = 0;
cli_result_t result; cli_result_t result;
std::set<uint64_t> to_remove; std::set<osd_num_t> to_remove;
std::set<uint64_t> to_restart; std::vector<osd_num_t> still_up;
json11::Json::array pool_effects; json11::Json::array pool_effects;
json11::Json::array history_updates, history_checks; json11::Json::array history_updates, history_checks;
json11::Json new_pgs, new_clean_pgs; json11::Json new_pgs, new_clean_pgs;
@ -63,8 +64,17 @@ struct rm_osd_t
} }
to_remove.insert(osd_id); to_remove.insert(osd_id);
} }
// Check if OSDs are still used in data distribution
is_warning = is_dataloss = false; is_warning = is_dataloss = false;
// Check if OSDs are still up
for (auto osd_id: to_remove)
{
if (parent->cli->st_cli.peer_states.find(osd_id) != parent->cli->st_cli.peer_states.end())
{
is_warning = true;
still_up.push_back(osd_id);
}
}
// Check if OSDs are still used in data distribution
for (auto & pp: parent->cli->st_cli.pool_config) for (auto & pp: parent->cli->st_cli.pool_config)
{ {
// Will OSD deletion make pool incomplete / down / degraded? // Will OSD deletion make pool incomplete / down / degraded?
@ -158,6 +168,9 @@ struct rm_osd_t
: strtoupper(e["effect"].string_value())+" PGs")) : strtoupper(e["effect"].string_value())+" PGs"))
)+" after deleting OSD(s).\n"; )+" after deleting OSD(s).\n";
} }
if (still_up.size())
error += (still_up.size() == 1 ? "OSD " : "OSDs ") + implode(", ", still_up) +
(still_up.size() == 1 ? "is" : "are") + " still up. Use `vitastor-disk purge` to delete them.\n";
if (is_dataloss && !force_dataloss && !dry_run) if (is_dataloss && !force_dataloss && !dry_run)
error += "OSDs not deleted. Please move data to other OSDs or bypass this check with --allow-data-loss if you know what you are doing.\n"; error += "OSDs not deleted. Please move data to other OSDs or bypass this check with --allow-data-loss if you know what you are doing.\n";
else if (is_warning && !force_warning && !dry_run) else if (is_warning && !force_warning && !dry_run)

View File

@ -17,6 +17,8 @@
#include "str_util.h" #include "str_util.h"
#include "vitastor_kv.h" #include "vitastor_kv.h"
#define KV_LIST_BUF_SIZE 65536
const char *exe_name = NULL; const char *exe_name = NULL;
class kv_cli_t class kv_cli_t
@ -290,10 +292,26 @@ void kv_cli_t::next_cmd()
struct kv_cli_list_t struct kv_cli_list_t
{ {
vitastorkv_dbw_t *db = NULL; vitastorkv_dbw_t *db = NULL;
std::string buf;
void *handle = NULL; void *handle = NULL;
int format = 0; int format = 0;
int n = 0; int n = 0;
std::function<void(int)> cb; std::function<void(int)> cb;
void write(const std::string & str)
{
if (buf.capacity() < KV_LIST_BUF_SIZE)
buf.reserve(KV_LIST_BUF_SIZE);
if (buf.size() + str.size() > buf.capacity())
flush();
buf.append(str.data(), str.size());
}
void flush()
{
::write(1, buf.data(), buf.size());
buf.resize(0);
}
}; };
std::vector<std::string> kv_cli_t::parse_cmd(const std::string & str) std::vector<std::string> kv_cli_t::parse_cmd(const std::string & str)
@ -604,11 +622,10 @@ void kv_cli_t::handle_cmd(const std::vector<std::string> & cmd, std::function<vo
if (res < 0) if (res < 0)
{ {
if (res != -ENOENT) if (res != -ENOENT)
{
fprintf(stderr, "Error: %s (code %d)\n", strerror(-res), res); fprintf(stderr, "Error: %s (code %d)\n", strerror(-res), res);
}
if (lst->format == 2) if (lst->format == 2)
printf("\n}\n"); lst->write("\n}\n");
lst->flush();
lst->db->list_close(lst->handle); lst->db->list_close(lst->handle);
lst->cb(res == -ENOENT ? 0 : res); lst->cb(res == -ENOENT ? 0 : res);
delete lst; delete lst;
@ -616,11 +633,27 @@ void kv_cli_t::handle_cmd(const std::vector<std::string> & cmd, std::function<vo
else else
{ {
if (lst->format == 2) if (lst->format == 2)
printf(lst->n ? ",\n %s: %s" : "{\n %s: %s", addslashes(key).c_str(), addslashes(value).c_str()); {
lst->write(lst->n ? ",\n " : "{\n ");
lst->write(addslashes(key));
lst->write(": ");
lst->write(addslashes(value));
}
else if (lst->format == 1) else if (lst->format == 1)
printf("set %s %s\n", auto_addslashes(key).c_str(), value.c_str()); {
lst->write("set ");
lst->write(auto_addslashes(key));
lst->write(" ");
lst->write(value);
lst->write("\n");
}
else else
printf("%s = %s\n", key.c_str(), value.c_str()); {
lst->write(key);
lst->write(" = ");
lst->write(value);
lst->write("\n");
}
lst->n++; lst->n++;
lst->db->list_next(lst->handle, NULL); lst->db->list_next(lst->handle, NULL);
} }

View File

@ -870,7 +870,7 @@ static void get_block(kv_db_t *db, uint64_t offset, int cur_level, int recheck_p
} }
// Block already in cache, we can proceed // Block already in cache, we can proceed
blk->usage = db->usage_counter; blk->usage = db->usage_counter;
cb(0, BLK_UPDATING); db->cli->msgr.ringloop->set_immediate([=] { cb(0, BLK_UPDATING); });
return; return;
} }
cluster_op_t *op = new cluster_op_t; cluster_op_t *op = new cluster_op_t;

View File

@ -22,8 +22,8 @@ int nfs3_fsstat_proc(void *opaque, rpc_op_t *rop)
{ {
auto ttb = pst_it->second["total_raw_tb"].number_value(); auto ttb = pst_it->second["total_raw_tb"].number_value();
auto ftb = (pst_it->second["total_raw_tb"].number_value() - pst_it->second["used_raw_tb"].number_value()); auto ftb = (pst_it->second["total_raw_tb"].number_value() - pst_it->second["used_raw_tb"].number_value());
tbytes = ttb / pst_it->second["raw_to_usable"].number_value() * ((uint64_t)2<<40); tbytes = ttb / pst_it->second["raw_to_usable"].number_value() * ((uint64_t)1<<40);
fbytes = ftb / pst_it->second["raw_to_usable"].number_value() * ((uint64_t)2<<40); fbytes = ftb / pst_it->second["raw_to_usable"].number_value() * ((uint64_t)1<<40);
} }
*reply = (FSSTAT3res){ *reply = (FSSTAT3res){
.status = NFS3_OK, .status = NFS3_OK,

View File

@ -210,6 +210,7 @@ resume_4:
st->res = res; st->res = res;
kv_continue_create(st, 5); kv_continue_create(st, 5);
}); });
return;
resume_5: resume_5:
if (st->res < 0) if (st->res < 0)
{ {

View File

@ -13,6 +13,12 @@ void kv_read_inode(nfs_proxy_t *proxy, uint64_t ino,
std::function<void(int res, const std::string & value, json11::Json ientry)> cb, std::function<void(int res, const std::string & value, json11::Json ientry)> cb,
bool allow_cache) bool allow_cache)
{ {
if (!ino)
{
// Zero value can not exist
cb(-ENOENT, "", json11::Json());
return;
}
auto key = kv_inode_key(ino); auto key = kv_inode_key(ino);
proxy->db->get(key, [=](int res, const std::string & value) proxy->db->get(key, [=](int res, const std::string & value)
{ {
@ -49,7 +55,7 @@ int kv_nfs3_getattr_proc(void *opaque, rpc_op_t *rop)
auto ino = kv_fh_inode(fh); auto ino = kv_fh_inode(fh);
if (self->parent->trace) if (self->parent->trace)
fprintf(stderr, "[%d] GETATTR %ju\n", self->nfs_fd, ino); fprintf(stderr, "[%d] GETATTR %ju\n", self->nfs_fd, ino);
if (!kv_fh_valid(fh)) if (!kv_fh_valid(fh) || !ino)
{ {
*reply = (GETATTR3res){ .status = NFS3ERR_INVAL }; *reply = (GETATTR3res){ .status = NFS3ERR_INVAL };
rpc_queue_reply(rop); rpc_queue_reply(rop);

View File

@ -43,9 +43,30 @@ int kv_nfs3_lookup_proc(void *opaque, rpc_op_t *rop)
uint64_t ino = direntry["ino"].uint64_value(); uint64_t ino = direntry["ino"].uint64_value();
kv_read_inode(self->parent, ino, [=](int res, const std::string & value, json11::Json ientry) kv_read_inode(self->parent, ino, [=](int res, const std::string & value, json11::Json ientry)
{ {
if (res < 0) if (res == -ENOENT)
{ {
*reply = (LOOKUP3res){ .status = vitastor_nfs_map_err(res == -ENOENT ? -EIO : res) }; *reply = (LOOKUP3res){
.status = NFS3_OK,
.resok = (LOOKUP3resok){
.object = xdr_copy_string(rop->xdrs, kv_fh(ino)),
.obj_attributes = {
.attributes_follow = 1,
.attributes = (fattr3){
.type = (ftype3)0,
.mode = 0666,
.nlink = 1,
.fsid = self->parent->fsid,
.fileid = ino,
},
},
},
};
rpc_queue_reply(rop);
return;
}
else if (res < 0)
{
*reply = (LOOKUP3res){ .status = vitastor_nfs_map_err(res) };
rpc_queue_reply(rop); rpc_queue_reply(rop);
return; return;
} }

View File

@ -89,12 +89,23 @@ resume_1:
resume_2: resume_2:
if (st->res < 0) if (st->res < 0)
{ {
fprintf(stderr, "error reading inode %s: %s (code %d)\n", if (st->res == -ENOENT)
kv_inode_key(st->ino).c_str(), strerror(-st->res), st->res); {
auto cb = std::move(st->cb); // Just delete direntry and skip inode
cb(st->res); fprintf(stderr, "direntry %s references a non-existing inode %ju, deleting\n",
return; kv_direntry_key(st->dir_ino, st->filename).c_str(), st->ino);
st->ino = 0;
}
else
{
fprintf(stderr, "error reading inode %s: %s (code %d)\n",
kv_inode_key(st->ino).c_str(), strerror(-st->res), st->res);
auto cb = std::move(st->cb);
cb(st->res);
return;
}
} }
else
{ {
std::string err; std::string err;
st->ientry = json11::Json::parse(st->ientry_text, err); st->ientry = json11::Json::parse(st->ientry_text, err);

View File

@ -158,18 +158,13 @@ bool osd_t::check_peer_config(osd_client_t *cl, json11::Json conf)
json11::Json osd_t::get_osd_state() json11::Json osd_t::get_osd_state()
{ {
std::vector<char> hostname;
hostname.resize(1024);
while (gethostname(hostname.data(), hostname.size()) < 0 && errno == ENAMETOOLONG)
hostname.resize(hostname.size()+1024);
hostname.resize(strnlen(hostname.data(), hostname.size()));
json11::Json::object st; json11::Json::object st;
st["state"] = "up"; st["state"] = "up";
if (bind_addresses.size() != 1 || bind_addresses[0] != "0.0.0.0") if (bind_addresses.size() != 1 || bind_addresses[0] != "0.0.0.0")
st["addresses"] = bind_addresses; st["addresses"] = bind_addresses;
else else
st["addresses"] = getifaddr_list(); st["addresses"] = getifaddr_list();
st["host"] = std::string(hostname.data(), hostname.size()); st["host"] = gethostname_str();
st["version"] = VITASTOR_VERSION; st["version"] = VITASTOR_VERSION;
st["port"] = listening_port; st["port"] = listening_port;
#ifdef WITH_RDMACM #ifdef WITH_RDMACM

View File

@ -209,7 +209,6 @@ bool osd_t::submit_flush_op(pool_id_t pool_id, pg_num_t pg_num, pg_flush_batch_t
.sec_stab = { .sec_stab = {
.header = { .header = {
.magic = SECONDARY_OSD_OP_MAGIC, .magic = SECONDARY_OSD_OP_MAGIC,
.id = msgr.next_subop_id++,
.opcode = (uint64_t)(rollback ? OSD_OP_SEC_ROLLBACK : OSD_OP_SEC_STABILIZE), .opcode = (uint64_t)(rollback ? OSD_OP_SEC_ROLLBACK : OSD_OP_SEC_STABILIZE),
}, },
.len = count * sizeof(obj_ver_id), .len = count * sizeof(obj_ver_id),

View File

@ -391,7 +391,6 @@ void osd_t::submit_list_subop(osd_num_t role_osd, pg_peering_state_t *ps)
.sec_list = { .sec_list = {
.header = { .header = {
.magic = SECONDARY_OSD_OP_MAGIC, .magic = SECONDARY_OSD_OP_MAGIC,
.id = msgr.next_subop_id++,
.opcode = OSD_OP_SEC_LIST, .opcode = OSD_OP_SEC_LIST,
}, },
.list_pg = ps->pg_num, .list_pg = ps->pg_num,

View File

@ -266,7 +266,6 @@ int osd_t::submit_bitmap_subops(osd_op_t *cur_op, pg_t & pg)
.sec_read_bmp = { .sec_read_bmp = {
.header = { .header = {
.magic = SECONDARY_OSD_OP_MAGIC, .magic = SECONDARY_OSD_OP_MAGIC,
.id = msgr.next_subop_id++,
.opcode = OSD_OP_SEC_READ_BMP, .opcode = OSD_OP_SEC_READ_BMP,
}, },
.len = sizeof(obj_ver_id)*(i+1-prev), .len = sizeof(obj_ver_id)*(i+1-prev),

View File

@ -233,7 +233,6 @@ void osd_t::submit_primary_subop(osd_op_t *cur_op, osd_op_t *subop,
subop->req.sec_rw = (osd_op_sec_rw_t){ subop->req.sec_rw = (osd_op_sec_rw_t){
.header = { .header = {
.magic = SECONDARY_OSD_OP_MAGIC, .magic = SECONDARY_OSD_OP_MAGIC,
.id = msgr.next_subop_id++,
.opcode = (uint64_t)(wr ? (cur_op->op_data->pg->scheme == POOL_SCHEME_REPLICATED ? OSD_OP_SEC_WRITE_STABLE : OSD_OP_SEC_WRITE) : OSD_OP_SEC_READ), .opcode = (uint64_t)(wr ? (cur_op->op_data->pg->scheme == POOL_SCHEME_REPLICATED ? OSD_OP_SEC_WRITE_STABLE : OSD_OP_SEC_WRITE) : OSD_OP_SEC_READ),
}, },
.oid = { .oid = {
@ -452,15 +451,16 @@ void osd_t::handle_primary_subop(osd_op_t *subop, osd_op_t *cur_op)
{ {
op_data->errcode = retval; op_data->errcode = retval;
} }
op_data->errors++;
if (subop->peer_fd >= 0 && retval != -EDOM && retval != -ERANGE && if (subop->peer_fd >= 0 && retval != -EDOM && retval != -ERANGE &&
(retval != -ENOSPC || opcode != OSD_OP_SEC_WRITE && opcode != OSD_OP_SEC_WRITE_STABLE) && (retval != -ENOSPC || opcode != OSD_OP_SEC_WRITE && opcode != OSD_OP_SEC_WRITE_STABLE) &&
(retval != -EIO || opcode != OSD_OP_SEC_READ)) (retval != -EIO || opcode != OSD_OP_SEC_READ))
{ {
// Drop connection on unexpected errors // Drop connection on unexpected errors
op_data->drops++;
msgr.stop_client(subop->peer_fd); msgr.stop_client(subop->peer_fd);
op_data->drops++;
} }
// Increase op_data->errors after stop_client to prevent >= n_subops running twice
op_data->errors++;
} }
else else
{ {
@ -593,7 +593,6 @@ void osd_t::submit_primary_del_batch(osd_op_t *cur_op, obj_ver_osd_t *chunks_to_
subops[i].req = (osd_any_op_t){ .sec_del = { subops[i].req = (osd_any_op_t){ .sec_del = {
.header = { .header = {
.magic = SECONDARY_OSD_OP_MAGIC, .magic = SECONDARY_OSD_OP_MAGIC,
.id = msgr.next_subop_id++,
.opcode = OSD_OP_SEC_DELETE, .opcode = OSD_OP_SEC_DELETE,
}, },
.oid = chunk.oid, .oid = chunk.oid,
@ -653,7 +652,6 @@ int osd_t::submit_primary_sync_subops(osd_op_t *cur_op)
subops[i].req = (osd_any_op_t){ .sec_sync = { subops[i].req = (osd_any_op_t){ .sec_sync = {
.header = { .header = {
.magic = SECONDARY_OSD_OP_MAGIC, .magic = SECONDARY_OSD_OP_MAGIC,
.id = msgr.next_subop_id++,
.opcode = OSD_OP_SEC_SYNC, .opcode = OSD_OP_SEC_SYNC,
}, },
.flags = cur_op->peer_fd == SELF_FD && cur_op->req.hdr.opcode != OSD_OP_SCRUB ? OSD_OP_RECOVERY_RELATED : 0, .flags = cur_op->peer_fd == SELF_FD && cur_op->req.hdr.opcode != OSD_OP_SCRUB ? OSD_OP_RECOVERY_RELATED : 0,
@ -712,7 +710,6 @@ void osd_t::submit_primary_stab_subops(osd_op_t *cur_op)
subops[i].req = (osd_any_op_t){ .sec_stab = { subops[i].req = (osd_any_op_t){ .sec_stab = {
.header = { .header = {
.magic = SECONDARY_OSD_OP_MAGIC, .magic = SECONDARY_OSD_OP_MAGIC,
.id = msgr.next_subop_id++,
.opcode = OSD_OP_SEC_STABILIZE, .opcode = OSD_OP_SEC_STABILIZE,
}, },
.len = (uint64_t)(stab_osd.len * sizeof(obj_ver_id)), .len = (uint64_t)(stab_osd.len * sizeof(obj_ver_id)),
@ -806,7 +803,6 @@ void osd_t::submit_primary_rollback_subops(osd_op_t *cur_op, const uint64_t* osd
subop->req = (osd_any_op_t){ .sec_stab = { subop->req = (osd_any_op_t){ .sec_stab = {
.header = { .header = {
.magic = SECONDARY_OSD_OP_MAGIC, .magic = SECONDARY_OSD_OP_MAGIC,
.id = msgr.next_subop_id++,
.opcode = OSD_OP_SEC_ROLLBACK, .opcode = OSD_OP_SEC_ROLLBACK,
}, },
.len = sizeof(obj_ver_id), .len = sizeof(obj_ver_id),

View File

@ -162,6 +162,7 @@ struct reed_sol_matrix_t
int refs = 0; int refs = 0;
int *je_data; int *je_data;
uint8_t *isal_data; uint8_t *isal_data;
int isal_item_size;
// 32 bytes = 256/8 = max pg_size/8 // 32 bytes = 256/8 = max pg_size/8
std::map<std::array<uint8_t, 32>, void*> subdata; std::map<std::array<uint8_t, 32>, void*> subdata;
std::map<reed_sol_erased_t, void*> decodings; std::map<reed_sol_erased_t, void*> decodings;
@ -181,20 +182,42 @@ void use_ec(int pg_size, int pg_minsize, bool use)
} }
int *matrix = reed_sol_vandermonde_coding_matrix(pg_minsize, pg_size-pg_minsize, OSD_JERASURE_W); int *matrix = reed_sol_vandermonde_coding_matrix(pg_minsize, pg_size-pg_minsize, OSD_JERASURE_W);
uint8_t *isal_table = NULL; uint8_t *isal_table = NULL;
int item_size = 8;
#ifdef WITH_ISAL #ifdef WITH_ISAL
uint8_t *isal_matrix = (uint8_t*)malloc_or_die(pg_minsize*(pg_size-pg_minsize)); uint8_t *isal_matrix = (uint8_t*)malloc_or_die(pg_minsize*(pg_size-pg_minsize));
for (int i = 0; i < pg_minsize*(pg_size-pg_minsize); i++) for (int i = 0; i < pg_minsize*(pg_size-pg_minsize); i++)
{ {
isal_matrix[i] = matrix[i]; isal_matrix[i] = matrix[i];
} }
isal_table = (uint8_t*)malloc_or_die(pg_minsize*(pg_size-pg_minsize)*32); isal_table = (uint8_t*)calloc_or_die(1, pg_minsize*(pg_size-pg_minsize)*32);
ec_init_tables(pg_minsize, pg_size-pg_minsize, isal_matrix, isal_table); ec_init_tables(pg_minsize, pg_size-pg_minsize, isal_matrix, isal_table);
free(isal_matrix); free(isal_matrix);
for (int i = pg_minsize*(pg_size-pg_minsize)*8; i < pg_minsize*(pg_size-pg_minsize)*32; i++)
{
if (isal_table[i] != 0)
{
// ISA-L GF-NI version uses 8-byte table items
item_size = 32;
break;
}
}
// Sanity check: rows should never consist of all zeroes
uint8_t zero_row[pg_minsize*item_size];
memset(zero_row, 0, pg_minsize*item_size);
for (int i = 0; i < (pg_size-pg_minsize); i++)
{
if (memcmp(isal_table + i*pg_minsize*item_size, zero_row, pg_minsize*item_size) == 0)
{
fprintf(stderr, "BUG or ISA-L incompatibility: EC tables shouldn't have all-zero rows\n");
abort();
}
}
#endif #endif
matrices[key] = (reed_sol_matrix_t){ matrices[key] = (reed_sol_matrix_t){
.refs = 0, .refs = 0,
.je_data = matrix, .je_data = matrix,
.isal_data = isal_table, .isal_data = isal_table,
.isal_item_size = item_size,
}; };
rs_it = matrices.find(key); rs_it = matrices.find(key);
} }
@ -235,7 +258,7 @@ static reed_sol_matrix_t* get_ec_matrix(int pg_size, int pg_minsize)
// we don't need it. also it makes an extra allocation of int *erased on every call and doesn't cache // we don't need it. also it makes an extra allocation of int *erased on every call and doesn't cache
// the decoding matrix. // the decoding matrix.
// all these flaws are fixed in this function: // all these flaws are fixed in this function:
static void* get_jerasure_decoding_matrix(osd_rmw_stripe_t *stripes, int pg_size, int pg_minsize) static void* get_jerasure_decoding_matrix(osd_rmw_stripe_t *stripes, int pg_size, int pg_minsize, int *item_size)
{ {
int edd = 0; int edd = 0;
int erased[pg_size]; int erased[pg_size];
@ -292,6 +315,7 @@ static void* get_jerasure_decoding_matrix(osd_rmw_stripe_t *stripes, int pg_size
int *erased_copy = (int*)(rectable + 32*smrow*pg_minsize); int *erased_copy = (int*)(rectable + 32*smrow*pg_minsize);
memcpy(erased_copy, erased, pg_size*sizeof(int)); memcpy(erased_copy, erased, pg_size*sizeof(int));
matrix->decodings.emplace((reed_sol_erased_t){ .data = erased_copy, .size = pg_size }, rectable); matrix->decodings.emplace((reed_sol_erased_t){ .data = erased_copy, .size = pg_size }, rectable);
*item_size = matrix->isal_item_size;
return rectable; return rectable;
#else #else
int *dm_ids = (int*)malloc_or_die(sizeof(int)*(pg_minsize + pg_minsize*pg_minsize + pg_size)); int *dm_ids = (int*)malloc_or_die(sizeof(int)*(pg_minsize + pg_minsize*pg_minsize + pg_size));
@ -355,7 +379,8 @@ static void jerasure_matrix_encode_unaligned(int k, int m, int w, int *matrix, c
#ifdef WITH_ISAL #ifdef WITH_ISAL
void reconstruct_stripes_ec(osd_rmw_stripe_t *stripes, int pg_size, int pg_minsize, uint32_t bitmap_size) void reconstruct_stripes_ec(osd_rmw_stripe_t *stripes, int pg_size, int pg_minsize, uint32_t bitmap_size)
{ {
uint8_t *dectable = (uint8_t*)get_jerasure_decoding_matrix(stripes, pg_size, pg_minsize); int item_size = 0;
uint8_t *dectable = (uint8_t*)get_jerasure_decoding_matrix(stripes, pg_size, pg_minsize, &item_size);
if (!dectable) if (!dectable)
{ {
return; return;
@ -378,7 +403,7 @@ void reconstruct_stripes_ec(osd_rmw_stripe_t *stripes, int pg_size, int pg_minsi
} }
} }
ec_encode_data( ec_encode_data(
read_end-read_start, pg_minsize, wanted, dectable + wanted_base*32*pg_minsize, read_end-read_start, pg_minsize, wanted, dectable + wanted_base*item_size*pg_minsize,
data_ptrs, data_ptrs + pg_minsize data_ptrs, data_ptrs + pg_minsize
); );
} }
@ -433,7 +458,7 @@ void reconstruct_stripes_ec(osd_rmw_stripe_t *stripes, int pg_size, int pg_minsi
#else #else
void reconstruct_stripes_ec(osd_rmw_stripe_t *stripes, int pg_size, int pg_minsize, uint32_t bitmap_size) void reconstruct_stripes_ec(osd_rmw_stripe_t *stripes, int pg_size, int pg_minsize, uint32_t bitmap_size)
{ {
int *dm_ids = (int*)get_jerasure_decoding_matrix(stripes, pg_size, pg_minsize); int *dm_ids = (int*)get_jerasure_decoding_matrix(stripes, pg_size, pg_minsize, NULL);
if (!dm_ids) if (!dm_ids)
{ {
return; return;
@ -980,7 +1005,7 @@ void calc_rmw_parity_ec(osd_rmw_stripe_t *stripes, int pg_size, int pg_minsize,
{ {
int item_size = int item_size =
#ifdef WITH_ISAL #ifdef WITH_ISAL
32; matrix->isal_item_size;
#else #else
sizeof(int); sizeof(int);
#endif #endif

View File

@ -65,7 +65,6 @@ void osd_t::scrub_list(pool_pg_num_t pg_id, osd_num_t role_osd, object_id min_oi
.sec_list = { .sec_list = {
.header = { .header = {
.magic = SECONDARY_OSD_OP_MAGIC, .magic = SECONDARY_OSD_OP_MAGIC,
.id = msgr.next_subop_id++,
.opcode = OSD_OP_SEC_LIST, .opcode = OSD_OP_SEC_LIST,
}, },
.list_pg = pg_num, .list_pg = pg_num,

View File

@ -198,6 +198,12 @@ void osd_t::exec_show_config(osd_op_t *cur_op)
json11::Json req_json = cur_op->req.show_conf.json_len > 0 json11::Json req_json = cur_op->req.show_conf.json_len > 0
? json11::Json::parse(std::string((char *)cur_op->buf), json_err) ? json11::Json::parse(std::string((char *)cur_op->buf), json_err)
: json11::Json(); : json11::Json();
if (req_json["features"]["check_sequencing"].bool_value())
{
auto cl = msgr.clients.at(cur_op->peer_fd);
cl->check_sequencing = true;
cl->read_op_id = cur_op->req.hdr.id + 1;
}
// Expose sensitive configuration values so peers can check them // Expose sensitive configuration values so peers can check them
json11::Json::object wire_config = json11::Json::object { json11::Json::object wire_config = json11::Json::object {
{ "osd_num", osd_num }, { "osd_num", osd_num },

View File

@ -21,7 +21,9 @@ osd_messenger_t::~osd_messenger_t()
void osd_messenger_t::outbox_push(osd_op_t *cur_op) void osd_messenger_t::outbox_push(osd_op_t *cur_op)
{ {
clients[cur_op->peer_fd]->sent_ops[cur_op->req.hdr.id] = cur_op; auto cl = clients.at(cur_op->peer_fd);
cur_op->req.hdr.id = ++cl->send_op_id;
cl->sent_ops[cur_op->req.hdr.id] = cur_op;
} }
void osd_messenger_t::parse_config(const json11::Json & config) void osd_messenger_t::parse_config(const json11::Json & config)

View File

@ -245,3 +245,13 @@ int create_and_bind_socket(std::string bind_address, int bind_port, int listen_b
return listen_fd; return listen_fd;
} }
std::string gethostname_str()
{
std::string hostname;
hostname.resize(1024);
while (gethostname((char*)hostname.data(), hostname.size()) < 0 && errno == ENAMETOOLONG)
hostname.resize(hostname.size()+1024);
hostname.resize(strnlen(hostname.data(), hostname.size()));
return hostname;
}

View File

@ -21,3 +21,4 @@ bool cidr6_match(const in6_addr &address, const in6_addr &network, uint8_t bits)
bool cidr_sockaddr_match(const sockaddr_storage &addr, const addr_mask_t &mask); bool cidr_sockaddr_match(const sockaddr_storage &addr, const addr_mask_t &mask);
std::vector<std::string> getifaddr_list(const std::vector<addr_mask_t> & masks = std::vector<addr_mask_t>(), bool include_v6 = false); std::vector<std::string> getifaddr_list(const std::vector<addr_mask_t> & masks = std::vector<addr_mask_t>(), bool include_v6 = false);
int create_and_bind_socket(std::string bind_address, int bind_port, int listen_backlog, int *listening_port); int create_and_bind_socket(std::string bind_address, int bind_port, int listen_backlog, int *listening_port);
std::string gethostname_str();

View File

@ -10,6 +10,10 @@
#include "ringloop.h" #include "ringloop.h"
#ifndef IORING_CQE_F_MORE
#define IORING_CQE_F_MORE (1U << 1)
#endif
ring_loop_t::ring_loop_t(int qd, bool multithreaded) ring_loop_t::ring_loop_t(int qd, bool multithreaded)
{ {
mt = multithreaded; mt = multithreaded;
@ -30,6 +34,16 @@ ring_loop_t::ring_loop_t(int qd, bool multithreaded)
free_ring_data[i] = i; free_ring_data[i] = i;
} }
in_loop = false; in_loop = false;
auto probe = io_uring_get_probe();
if (probe)
{
support_zc = io_uring_opcode_supported(probe, IORING_OP_SENDMSG_ZC);
#ifdef IORING_SETUP_R_DISABLED /* liburing 2.0 check */
io_uring_free_probe(probe);
#else
free(probe);
#endif
}
} }
ring_loop_t::~ring_loop_t() ring_loop_t::~ring_loop_t()
@ -108,7 +122,17 @@ void ring_loop_t::loop()
if (mt) if (mt)
mu.lock(); mu.lock();
struct ring_data_t *d = (struct ring_data_t*)cqe->user_data; struct ring_data_t *d = (struct ring_data_t*)cqe->user_data;
if (d->callback) if (cqe->flags & IORING_CQE_F_MORE)
{
// There will be a second notification
d->res = cqe->res;
d->more = true;
if (d->callback)
d->callback(d);
d->prev = true;
d->more = false;
}
else if (d->callback)
{ {
// First free ring_data item, then call the callback // First free ring_data item, then call the callback
// so it has at least 1 free slot for the next event // so it has at least 1 free slot for the next event
@ -116,7 +140,10 @@ void ring_loop_t::loop()
struct ring_data_t dl; struct ring_data_t dl;
dl.iov = d->iov; dl.iov = d->iov;
dl.res = cqe->res; dl.res = cqe->res;
dl.more = false;
dl.prev = d->prev;
dl.callback.swap(d->callback); dl.callback.swap(d->callback);
d->prev = d->more = false;
free_ring_data[free_ring_data_ptr++] = d - ring_datas; free_ring_data[free_ring_data_ptr++] = d - ring_datas;
if (mt) if (mt)
mu.unlock(); mu.unlock();

View File

@ -18,6 +18,10 @@
#define RINGLOOP_DEFAULT_SIZE 1024 #define RINGLOOP_DEFAULT_SIZE 1024
#ifndef IORING_RECV_MULTISHOT /* liburing-2.3 check */
#define IORING_OP_SENDMSG_ZC 48
#endif
static inline void my_uring_prep_rw(int op, struct io_uring_sqe *sqe, int fd, const void *addr, unsigned len, off_t offset) static inline void my_uring_prep_rw(int op, struct io_uring_sqe *sqe, int fd, const void *addr, unsigned len, off_t offset)
{ {
// Prepare a read/write operation without clearing user_data // Prepare a read/write operation without clearing user_data
@ -62,6 +66,12 @@ static inline void my_uring_prep_sendmsg(struct io_uring_sqe *sqe, int fd, const
sqe->msg_flags = flags; sqe->msg_flags = flags;
} }
static inline void my_uring_prep_sendmsg_zc(struct io_uring_sqe *sqe, int fd, const struct msghdr *msg, unsigned flags)
{
my_uring_prep_rw(IORING_OP_SENDMSG_ZC, sqe, fd, msg, 1, 0);
sqe->msg_flags = flags;
}
static inline void my_uring_prep_poll_add(struct io_uring_sqe *sqe, int fd, short poll_mask) static inline void my_uring_prep_poll_add(struct io_uring_sqe *sqe, int fd, short poll_mask)
{ {
my_uring_prep_rw(IORING_OP_POLL_ADD, sqe, fd, NULL, 0, 0); my_uring_prep_rw(IORING_OP_POLL_ADD, sqe, fd, NULL, 0, 0);
@ -112,6 +122,8 @@ struct ring_data_t
{ {
struct iovec iov; // for single-entry read/write operations struct iovec iov; // for single-entry read/write operations
int res; int res;
bool prev: 1;
bool more: 1;
std::function<void(ring_data_t*)> callback; std::function<void(ring_data_t*)> callback;
}; };
@ -133,6 +145,7 @@ class ring_loop_t
bool loop_again; bool loop_again;
struct io_uring ring; struct io_uring ring;
int ring_eventfd = -1; int ring_eventfd = -1;
bool support_zc = false;
public: public:
ring_loop_t(int qd, bool multithreaded = false); ring_loop_t(int qd, bool multithreaded = false);
~ring_loop_t(); ~ring_loop_t();
@ -144,6 +157,7 @@ public:
inline void set_immediate(const std::function<void()> cb) inline void set_immediate(const std::function<void()> cb)
{ {
immediate_queue.push_back(cb); immediate_queue.push_back(cb);
wakeup();
} }
inline int submit() inline int submit()
{ {
@ -163,6 +177,10 @@ public:
{ {
return loop_again; return loop_again;
} }
inline bool has_sendmsg_zc()
{
return support_zc;
}
void loop(); void loop();
void wakeup(); void wakeup();

View File

@ -87,6 +87,22 @@ wait_etcd()
done done
} }
wait_condition()
{
sec=$1
check=$2
proc=$3
i=0
while [[ $i -lt $sec ]]; do
eval "$check" && break
if [ $i -eq $sec ]; then
format_error "$proc couldn't finish in $sec seconds"
fi
sleep 1
i=$((i+1))
done
}
if [[ -n "$ANTIETCD" ]]; then if [[ -n "$ANTIETCD" ]]; then
ETCDCTL="node mon/node_modules/.bin/anticli -e $ETCD_URL" ETCDCTL="node mon/node_modules/.bin/anticli -e $ETCD_URL"
MON_PARAMS="--use_antietcd 1 --antietcd_data_dir ./testdata --antietcd_persist_interval 500 $MON_PARAMS" MON_PARAMS="--use_antietcd 1 --antietcd_data_dir ./testdata --antietcd_persist_interval 500 $MON_PARAMS"

View File

@ -127,22 +127,6 @@ try_reweight()
sleep 3 sleep 3
} }
wait_condition()
{
sec=$1
check=$2
proc=$3
i=0
while [[ $i -lt $sec ]]; do
eval "$check" && break
if [ $i -eq $sec ]; then
format_error "$proc couldn't finish in $sec seconds"
fi
sleep 1
i=$((i+1))
done
}
wait_finish_rebalance() wait_finish_rebalance()
{ {
sec=$1 sec=$1

View File

@ -31,4 +31,10 @@ sleep 2
$ETCDCTL get --prefix /vitastor/pg/config --print-value-only | \ $ETCDCTL get --prefix /vitastor/pg/config --print-value-only | \
jq -s -e '([ .[0].items["1"] | .[].osd_set | map_values(. | tonumber) | select((.[0] <= 4) != (.[1] <= 4)) ] | length) == 4' jq -s -e '([ .[0].items["1"] | .[].osd_set | map_values(. | tonumber) | select((.[0] <= 4) != (.[1] <= 4)) ] | length) == 4'
# test pool with size 1
build/src/cmd/vitastor-cli --etcd_address $ETCD_URL create-pool size1pool -s 1 -n 1 --force
wait_condition 10 "$ETCDCTL get --prefix /vitastor/pg/config --print-value-only | jq -s -e '.[0].items["'"'"2"'"'"]'"
build/src/cmd/vitastor-cli --etcd_address $ETCD_URL modify-pool size1pool -s 2 --force
format_green OK format_green OK

View File

@ -165,4 +165,24 @@ if ls ./testdata/nfs | grep over1; then false; fi
[[ "`cat ./testdata/nfs/linked1`" = "BABABA" ]] [[ "`cat ./testdata/nfs/linked1`" = "BABABA" ]]
format_green "rename over existing file ok" format_green "rename over existing file ok"
# check listing and removal of a bad direntry
sudo umount ./testdata/nfs/
build/src/kv/vitastor-kv --etcd_address $ETCD_URL fsmeta set d11/settings.jsonLGNmGn '{"ino": 123}'
sudo mount localhost:/ ./testdata/nfs -o port=2050,mountport=2050,nfsvers=3,soft,nolock,tcp
ls -l ./testdata/nfs
ls -l ./testdata/nfs/settings.jsonLGNmGn
rm ./testdata/nfs/settings.jsonLGNmGn
build/src/kv/vitastor-kv --etcd_address $ETCD_URL fsmeta get d11/settings.jsonLGNmGn 2>&1 | grep '(code -2)'
ls -l ./testdata/nfs
# repeat with ino=0
sudo umount ./testdata/nfs/
build/src/kv/vitastor-kv --etcd_address $ETCD_URL fsmeta set d11/settings.jsonLGNmGn '{"ino": 0}'
sudo mount localhost:/ ./testdata/nfs -o port=2050,mountport=2050,nfsvers=3,soft,nolock,tcp
ls -l ./testdata/nfs
ls -l ./testdata/nfs/settings.jsonLGNmGn
rm ./testdata/nfs/settings.jsonLGNmGn
build/src/kv/vitastor-kv --etcd_address $ETCD_URL fsmeta get d11/settings.jsonLGNmGn 2>&1 | grep '(code -2)'
ls -l ./testdata/nfs
format_green OK format_green OK