Compare commits

..

105 Commits

Author SHA1 Message Date
cb282d25e0 Release 0.6.5
- Basic support for OpenStack: Cinder driver, patches for Nova and libvirt
- Add missing "image" and "config_path" QEMU options
- Calculate aggregate per-pool statistics in monitor
- Implement writes with Check-And-Set semantics
- Add a C wrapper library with public header
2021-07-10 11:01:21 +03:00
8b2a4c9539 Fix centos builds (yum-builddep stopped working in el7, cmake in el8..) 2021-07-10 11:01:21 +03:00
b66a079892 State basic OpenStack support 2021-07-10 01:11:20 +03:00
e90bbe6385 Implement OpenStack Cinder driver for Vitastor
It can't delete snapshots yet because Vitastor layer merge isn't
implemented yet. You can only delete volumes with all snapshots.
This will be fixed in the near future.
2021-07-10 01:06:29 +03:00
4be761254c Move patches to patches/ 2021-07-09 21:51:19 +03:00
7a45c5f86c buster-backports has broken mesa 2021-07-09 12:29:39 +03:00
bff413584d Fix qemuBlockStorageSourceGetVitastorProps 2021-07-09 02:09:47 +03:00
bb31050ab5 Add missing image, config_path options to QEMU QAPI 2021-07-09 02:09:47 +03:00
b52dd6843a Rename qemu_rbd_unescape and qemu_rbd_next_tok to *_vitastor_* 2021-07-03 23:14:44 +03:00
b66160a7ad Aggregate per-pool statistics in mon 2021-07-03 23:14:44 +03:00
30bb602681 Add _VITASTOR to missing switches in libvirt 7.0 patch 2021-06-28 22:00:23 +03:00
eb0a3adafc Patch libvirt schema, add an example to test libvirt 2021-06-28 01:20:55 +03:00
24301b116c Add libvirt 5.0 patch 2021-06-27 18:43:29 +03:00
1d00c17d68 Add libvirt 7.5 patch 2021-06-27 10:58:12 +03:00
24f19c4b80 Add libvirt 7.0 patch 2021-06-27 00:58:56 +03:00
dfdf5c1f9c Fix comments in mon.js 2021-06-20 00:23:56 +03:00
aad7792d3f Check for loops in parent inode chains 2021-06-20 00:23:03 +03:00
6ca8afffe5 Add CAS version parameter to the C wrapper 2021-06-19 01:00:52 +03:00
511a89948b Rework qemu_proxy into a C wrapper library with public header 2021-06-19 00:39:11 +03:00
3de553ecd7 Add a test for CAS write operation 2021-06-15 00:12:35 +03:00
9c45d43e74 Extract common 3 OSD code from several test scripts 2021-06-15 00:12:35 +03:00
891250d355 Implement CAS writes
From now on, reads will return the server-side object version numbers
and writes and deletes will have an additional "version" parameter
which, if set to a non-zero value, will be atomically compared with
the current version of the object plus 1 and the modification will
fail if it doesn't match.

This feature opens the road to correct online flattening of snapshot
layers and other interesting things.
2021-06-15 00:12:35 +03:00
f9fe72d40a Release 0.6.4
- Implement a basic Kubernetes CSI driver
- Minor fixes for vitastor-nbd
- Fix build without RDMA broken in 0.6.3
2021-05-16 01:38:01 +03:00
10ee4f7c1d Add notes about CSI to README 2021-05-16 01:38:01 +03:00
fd8244699b Implement basic CSI driver
Currently can create and remove volumes, but resizing and snapshots is not supported yet
2021-05-16 01:15:43 +03:00
eaac1fc5d1 Log to stderr in etcd_state_client, too 2021-05-16 01:09:25 +03:00
57be1923d3 Daemonize NBD_DO_IT process, correctly cleanup unmounted NBD clients 2021-05-16 01:09:25 +03:00
c467acc388 Fix /v3 appendage to etcd URLs without /v3 2021-05-15 19:22:24 +03:00
bf591ba3ee Fix nbd module load check 2021-05-15 19:22:24 +03:00
699a0fbbc7 Log to stderr instead of stdout in client 2021-05-15 19:22:24 +03:00
6b2dd50f27 Fix build without RDMA 2021-05-08 18:20:43 +03:00
caf2f3c56f Release 0.6.3
- RDMA support
- Client performance optimisations (4k randread ~120k -> ~180k on 1 core)
- JSON configuration file (/etc/vitastor/vitastor.conf) support
- Bug fixes
2021-05-02 17:47:43 +03:00
9174f188b1 Build packages with libibverbs
For CentOS 7 it also requires newer rdma-core as CentOS 7's native version doesn't have
implicit ODP support. The updated version is already uploaded into the vitastor repo.
2021-05-02 17:47:16 +03:00
d3978c6d0e Do not print RDMA connection messages when log_level=0
By the way, it's 1 by default in the OSD, so these messages will still be there in OSD logs
2021-05-01 00:26:09 +03:00
4a7365660d Do not wait for down OSDs during sync
Fixes a hang introduced in 0.5.11 in the non-immediate_commit mode
2021-05-01 00:26:07 +03:00
818ae5d61d Some config parsing fixes 2021-05-01 00:20:01 +03:00
6810e93c3f Add RDMA options to mon.js list 2021-04-30 01:23:22 +03:00
f6f35f4127 Pass options correctly to not override /etc/vitastor/vitastor.conf 2021-04-30 01:17:44 +03:00
72aa2fd819 Make OSD and client read common configuration from /etc/vitastor/vitastor.conf 2021-04-30 01:11:27 +03:00
5010b0dd75 Use json11 instead of blockstore_config_t 2021-04-30 00:52:46 +03:00
483c5ab380 Negotiate max_msg instead of max_sge, make buffer settings more conservative :-) 2021-04-29 11:10:35 +03:00
6a6fd6544d Add RDMA options to the QEMU driver 2021-04-29 11:02:49 +03:00
971aa4ae4f Implement RDMA receive with memory copying (send remains zero-copy)
This is the simplest and, as usual, the best implementation :)

100% zero-copy implementation is also possible (see rdma-zerocopy branch),
but it requires to create A LOT of queues (~128 per client) to use QPN as a 'tag'
because of the lack of receive tags and the server may simply run out of queues.
Hardware limit is 262144 on Mellanox ConnectX-4 which amounts to only 2048
'connections' per host. And even with that amount of queues it's still less optimal
than the non-zerocopy one.

In fact, newest hardware like Mellanox ConnectX-5 does have Tag Matching
support, but it's still unsuitable for us because it doesn't support scatter/gather
(tm_caps.max_sge=1).
2021-04-29 02:34:45 +03:00
9e6cbc6ebc Negotiate max_sge between RDMA client & server 2021-04-29 02:15:20 +03:00
ce777319c3 WIP RDMA support
Basic naive implementation works, but it's highly non-optimal as
RNR retransmissions occur all the time. RDMA expects the receiver
to always have place for incoming WRs...
2021-04-29 02:03:54 +03:00
f8ff39b0ab Rework continue_ops() to remove a CPU hot spot
This rework increases fio -rw=randread -iodepth=128 result from ~120k to ~180k iops :)
2021-04-29 01:50:13 +03:00
d749159585 Linked list experiment
Rework client operation queue from a vector to a linked list.
This is required to rework continue_ops() as its current implementation
consumes ~25% of client process CPU.
2021-04-29 01:47:33 +03:00
9703773a63 Fix has_flushes setting 2021-04-28 23:40:44 +03:00
5d8d486f7c Add SOVERSION 2021-04-20 01:01:32 +03:00
2b546cdd55 Link vitastor_blk with vitastor_common for timerfd_manager_t
Not really required to operate, but fixes a verify-elf error
2021-04-20 00:51:53 +03:00
bd7b177707 Report sensitive configuration values instead of the configuration source 2021-04-17 23:11:16 +03:00
33f9d03d22 Update documentation regarding image names and vitastor-nbd 2021-04-17 17:40:12 +03:00
82e6aff17b Support mapping NBD by the image name 2021-04-17 17:39:55 +03:00
57e2c503f7 Rename osd_t::c_cli to msgr 2021-04-17 16:32:09 +03:00
715bc8d53d Release 0.6.2
- Fix a possible crash during SYNC when journal fsyncs are enabled
- Fix a memory leak in the chained read implementation
2021-04-15 23:40:06 +03:00
0af077701c Fix a possible crash during SYNC when journal fsyncs are enabled 2021-04-15 02:01:50 +03:00
cac976ce25 Fix a memory leak in the chained read implementation 2021-04-15 01:42:18 +03:00
acf0646542 Build common sources once 2021-04-15 01:13:34 +03:00
ede1c1d667 Release 0.6.1
A bugfix for the new "chained read from snapshot" feature
2021-04-14 22:32:23 +03:00
38bd51c97f Remove aio_context assertion, it seems it is unneeded 2021-04-14 22:32:15 +03:00
8c9f32cd45 Add run_vm test bash scripts 2021-04-13 16:21:21 +03:00
966fb763ca Oooops, fix chained reads 2021-04-13 16:19:21 +03:00
0b41ffc08d Release 0.6.0
Warning: upgrading from 0.5.x is currently not supported!
Please create an issue if you really need upgrade capability.

New features:
- Snapshots and Copy-on-Write clones
- Inode (image) names
- Inode I/O and space statistics
- Write throttling for smoothing random write workloads in SSD+HDD configurations
2021-04-11 00:49:18 +03:00
64eeb79051 Prevent 0.6.x OSDs from talking to 0.5.x
The new protocol is almost compatible - it has bitmaps, but also it has
a "bitmap_length" field. It's not hard to make 0.5-0.6 OSDs and clients
compatible, but for now I just assume nobody needs it.

If I'm wrong and anybody requests to upgrade their production 0.5.x system
to 0.6.x I'll fix it.
2021-04-10 22:26:17 +03:00
2a02f3c4c7 Add metadata superblock and check it on start
Refuse to start if the superblock is missing or bad version;
zero out the metadata area when initializing superblock.
2021-04-10 22:26:17 +03:00
f684d9101a Refuse to start with old journal version 2021-04-10 17:44:12 +03:00
c72fddd714 Notes about master/0.5.x 2021-04-10 17:44:12 +03:00
a1f2f19489 Do not increment inode statistics if the object already exists 2021-04-10 17:44:12 +03:00
82c1a7ec67 Fix statistics reporting, split inode number into pool & inode 2021-04-10 17:44:12 +03:00
2ab423d4ef Implement journaled write throttling for the SSD+HDD case 2021-04-10 17:44:12 +03:00
4694811eab Add microsecond accuracy to set_timer 2021-04-10 17:44:12 +03:00
6b988de17d Remove timerfd_interval 2021-04-10 17:44:12 +03:00
37efdc2a83 Fix bitmap_set for replicated pools 2021-04-10 17:44:12 +03:00
591cad09c9 Fix bitmaps for objects larger than 128K 2021-04-10 17:44:12 +03:00
b907ad50aa Oops, forgot to add external bitmaps to blockstore in some places 2021-04-10 17:44:12 +03:00
7308d6a6c0 Note about etcd 3.4.15 2021-04-10 17:44:12 +03:00
5f5b6ef150 Enable chained reads in the client 2021-04-10 17:44:12 +03:00
38a3df4a0e Implement chained (optimized) read in the primary OSD code 2021-04-10 17:44:12 +03:00
6950b8e3a0 Watch inode metadata revisions 2021-04-10 17:44:12 +03:00
0cea3576fb Add "read bitmaps" operation to secondary OSD protocol 2021-04-10 17:44:12 +03:00
f01eea07d3 Add simplified interface to read blockstore bitmaps synchronously 2021-04-10 17:44:12 +03:00
2c2f08aca2 Shorten some structure names 2021-04-10 17:44:12 +03:00
d6524670e1 Introduce data distribution locality 2021-04-10 17:44:12 +03:00
879ecfa74d Fix wording 2021-04-10 17:44:12 +03:00
aea2d19d35 Change Telegram chat link 2021-04-10 17:44:12 +03:00
04f86dc00b Fix Russian README for CMake build 2021-04-10 17:44:12 +03:00
7aeb2cbac7 Capture all by value in qemu_proxy 2021-04-10 17:44:12 +03:00
519f081006 Add LICENSE 2021-04-10 17:44:12 +03:00
e50f703e1d Add Russian version of the README 2021-04-10 17:44:12 +03:00
2612d3198a Introduce image names and metadata storage in etcd
Each inode has: image name, parent inode number & pool, size and readonly flag

Snapshots are created by switching image name to a different inode number
while using the older inode as parent.
2021-04-10 17:44:12 +03:00
ab39ce2bbb Use clean_entry_bitmap_size instead of entry_attr_size back because of changed bitmap handling 2021-04-10 17:44:12 +03:00
d0c2e31312 Add a test for snapshots, fix bugs. Now the test passes 2021-04-10 17:44:12 +03:00
9038d42327 Fix several snapshot I/O bugs 2021-04-10 17:44:12 +03:00
691f066055 Actual snapshot support (untested) 2021-04-10 17:44:12 +03:00
ffe1cd4c79 Report inode I/O statistics, aggregate it in the monitor 2021-04-10 17:44:12 +03:00
4ae1b84c67 Report inode space usage statistics to etcd, aggregate it in the monitor 2021-04-10 17:44:12 +03:00
c35963967f Add inode space usage statistics tracking to blockstore 2021-04-10 17:44:12 +03:00
0aa2dd2890 Send bitmaps with primary-reads, actually read bitmaps for READ ops 2021-04-10 17:44:12 +03:00
6bf88883ac Allocate bitmaps along with stripes to avoid memory fragmentation 2021-04-10 17:44:12 +03:00
004f265393 Remove cryptic bitmap inlining from bs_op_t and osd_op_t, use bitmap in primary OSD code 2021-04-10 17:44:12 +03:00
860ac24762 Add "external" bitmap support to the secondary OSD protocol 2021-04-10 17:44:12 +03:00
6107a4d07b Add "external" bitmap support to blockstore 2021-04-10 17:44:12 +03:00
95c29b9dc3 Add "external" bitmap support to osd_rmw 2021-04-10 17:44:12 +03:00
d99407dcec Check QEMU block-vitastor.so during the test 2021-04-10 17:44:12 +03:00
6909807068 Allow to start the OSD just to flush the journal completely 2021-04-10 17:44:12 +03:00
125 changed files with 10906 additions and 1461 deletions

View File

@@ -2,4 +2,6 @@ cmake_minimum_required(VERSION 2.8)
project(vitastor)
set(VERSION "0.6.5")
add_subdirectory(src)

27
LICENSE Normal file
View File

@@ -0,0 +1,27 @@
Copyright (c) Vitaliy Filippov (vitalif [at] yourcmc.ru), 2019+
All server-side code (OSD, Monitor and so on) is licensed under the terms of
Vitastor Network Public License 1.1 (VNPL 1.1), a copyleft license based on
GNU GPLv3.0 with the additional "Network Interaction" clause which requires
opensourcing all programs directly or indirectly interacting with Vitastor
through a computer network and expressly designed to be used in conjunction
with it ("Proxy Programs"). Proxy Programs may be made public not only under
the terms of the same license, but also under the terms of any GPL-Compatible
Free Software License, as listed by the Free Software Foundation.
This is a stricter copyleft license than the Affero GPL.
Please note that VNPL doesn't require you to open the code of proprietary
software running inside a VM if it's not specially designed to be used with
Vitastor.
Basically, you can't use the software in a proprietary environment to provide
its functionality to users without opensourcing all intermediary components
standing between the user and Vitastor or purchasing a commercial license
from the author 😀.
Client libraries (cluster_client and so on) are dual-licensed under the same
VNPL 1.1 and also GNU GPL 2.0 or later to allow for compatibility with GPLed
software like QEMU and fio.
You can find the full text of VNPL-1.1 in the file [VNPL-1.1.txt](VNPL-1.1.txt).
GPL 2.0 is also included in this repository as [GPL-2.0.txt](GPL-2.0.txt).

579
README-ru.md Normal file
View File

@@ -0,0 +1,579 @@
## Vitastor
[Read English version](README.md)
## Идея
Я всего лишь хочу сделать качественную блочную SDS!
Vitastor - распределённая блочная SDS, прямой аналог Ceph RBD и внутренних СХД популярных
облачных провайдеров. Однако, в отличие от них, Vitastor быстрый и при этом простой.
Только пока маленький :-).
Архитектурная схожесть с Ceph означает заложенную на уровне алгоритмов записи строгую консистентность,
репликацию через первичный OSD, симметричную кластеризацию без единой точки отказа
и автоматическое распределение данных по любому числу дисков любого размера с настраиваемыми схемами
избыточности - репликацией или с произвольными кодами коррекции ошибок.
## Возможности
Vitastor на данный момент находится в статусе предварительного выпуска, расширенные
возможности пока отсутствуют, а в будущих версиях вероятны "ломающие" изменения.
Однако следующее уже реализовано:
- Базовая часть - надёжное кластерное блочное хранилище без единой точки отказа
- Производительность ;-D
- Несколько схем отказоустойчивости: репликация, XOR n+1 (1 диск чётности), коды коррекции ошибок
Рида-Соломона на основе библиотеки jerasure с любым числом дисков данных и чётности в группе
- Конфигурация через простые человекочитаемые JSON-структуры в etcd
- Автоматическое распределение данных по OSD, с поддержкой:
- Математической оптимизации для лучшей равномерности распределения и минимизации перемещений данных
- Нескольких пулов с разными схемами избыточности
- Дерева распределения, выбора OSD по тегам / классам устройств (только SSD, только HDD) и по поддереву
- Настраиваемых доменов отказа (диск/сервер/стойка и т.п.)
- Восстановление деградированных блоков
- Ребаланс, то есть перемещение данных между OSD (дисками)
- Поддержка "ленивого" fsync (fsync не на каждую операцию)
- Сбор статистики ввода/вывода в etcd
- Клиентская библиотека режима пользователя для ввода/вывода
- Драйвер диска для QEMU (собирается вне дерева исходников QEMU)
- Драйвер диска для утилиты тестирования производительности fio (также собирается вне дерева исходников fio)
- NBD-прокси для монтирования образов ядром ("блочное устройство в режиме пользователя")
- Утилита удаления образов/инодов (vitastor-rm)
- Пакеты для Debian и CentOS
- Статистика операций ввода/вывода и занятого места в разрезе инодов
- Именование инодов через хранение их метаданных в etcd
- Снапшоты и copy-on-write клоны
- Сглаживание производительности случайной записи в SSD+HDD конфигурациях
- Поддержка RDMA/RoCEv2 через libibverbs
- CSI-плагин для Kubernetes
- Базовая поддержка OpenStack: драйвер Cinder, патчи для Nova и libvirt
## Планы развития
- Поддержка удаления снапшотов (слияния слоёв)
- Более корректные скрипты разметки дисков и автоматического запуска OSD
- Другие инструменты администрирования
- Плагины для OpenNebula, Proxmox и других облачных систем
- iSCSI-прокси
- Более быстрое переключение при отказах
- Фоновая проверка целостности без контрольных сумм (сверка реплик)
- Контрольные суммы
- Поддержка SSD-кэширования (tiered storage)
- Поддержка NVDIMM
- Web-интерфейс
- Возможно, сжатие
- Возможно, поддержка кэширования данных через системный page cache
## Архитектура
Так же, как и в Ceph, в Vitastor:
- Есть пулы (pools), PG, OSD, мониторы, домены отказа, дерево распределения (аналог crush-дерева).
- Образы делятся на блоки фиксированного размера (объекты), и эти объекты распределяются по OSD.
- У OSD есть журнал и метаданные и они тоже могут размещаться на отдельных быстрых дисках.
- Все операции записи тоже транзакционны. В Vitastor, правда, есть режим отложенного/ленивого fsync
(коммита), в котором fsync не вызывается на каждую операцию записи, что делает его более
пригодным для использования на "плохих" (десктопных) SSD. Однако все операции записи
в любом случае атомарны.
- Клиентская библиотека тоже старается ждать восстановления после любого отказа кластера, то есть,
вы тоже можете перезагрузить хоть весь кластер разом, и клиенты только на время зависнут,
но не отключатся.
Некоторые базовые термины для тех, кто не знаком с Ceph:
- OSD (Object Storage Daemon) - процесс, который хранит данные на одном диске и обрабатывает
запросы чтения/записи от клиентов.
- Пул (Pool) - контейнер для данных, имеющих одну и ту же схему избыточности и правила распределения по OSD.
- PG (Placement Group) - группа объектов, хранимых на одном и том же наборе реплик (OSD).
Несколько PG могут храниться на одном и том же наборе реплик, но объекты одной PG
в норме не хранятся на разных наборах OSD.
- Монитор - демон, хранящий состояние кластера.
- Домен отказа (Failure Domain) - группа OSD, которым вы разрешаете "упасть" всем вместе.
Иными словами, это группа OSD, в которые СХД не помещает разные копии одного и того же
блока данных. Например, если домен отказа - сервер, то на двух дисках одного сервера
никогда не окажется 2 и более копий одного и того же блока данных, а значит, даже
если в этом сервере откажут все диски, это будет равносильно потере только 1 копии
любого блока данных.
- Дерево распределения (Placement Tree / CRUSH Tree) - иерархическая группировка OSD
в узлы, которые далее можно использовать как домены отказа. То есть, диск (OSD) входит в
сервер, сервер входит в стойку, стойка входит в ряд, ряд в датацентр и т.п.
Чем Vitastor отличается от Ceph:
- Vitastor в первую очередь сфокусирован на SSD. Также Vitastor, вероятно, должен неплохо работать
с комбинацией SSD и HDD через bcache, а в будущем, возможно, будут добавлены и нативные способы
оптимизации под SSD+HDD. Однако хранилище на основе одних лишь жёстких дисков, вообще без SSD,
не в приоритете, поэтому оптимизации под этот кейс могут вообще не состояться.
- OSD Vitastor однопоточный и всегда таким останется, так как это самый оптимальный способ работы.
Если вам не хватает 1 ядра на 1 диск, просто делите диск на разделы и запускайте на нём несколько OSD.
Но, скорее всего, вам хватит и 1 ядра - Vitastor не так прожорлив к ресурсам CPU, как Ceph.
- Журнал и метаданные всегда размещаются в памяти, благодаря чему никогда не тратится лишнее время
на чтение метаданных с диска. Размер метаданных линейно зависит от размера диска и блока данных,
который задаётся в конфигурации кластера и по умолчанию составляет 128 КБ. С блоком 128 КБ метаданные
занимают примерно 512 МБ памяти на 1 ТБ дискового пространства (и это всё равно меньше, чем нужно Ceph-у).
Журнал вообще не должен быть большим, например, тесты производительности в данном документе проводились
с журналом размером всего 16 МБ. Большой журнал, вероятно, даже вреден, т.к. "грязные" записи (записи,
не сброшенные из журнала) тоже занимают память и могут немного замедлять работу.
- В Vitastor нет внутреннего copy-on-write. Я считаю, что реализация CoW-хранилища гораздо сложнее,
поэтому сложнее добиться устойчиво хороших результатов. Возможно, в один прекрасный день
я придумаю красивый алгоритм для CoW-хранилища, но пока нет - внутреннего CoW в Vitastor не будет.
Всё это не относится к "внешнему" CoW (снапшотам и клонам).
- Базовый слой Vitastor - простое блочное хранилище с блоками фиксированного размера, а не сложное
объектное хранилище с расширенными возможностями, как в Ceph (RADOS).
- В Vitastor есть режим "ленивых fsync", в котором OSD группирует запросы записи перед сбросом их
на диск, что позволяет получить лучшую производительность с дешёвыми настольными SSD без конденсаторов
("Advanced Power Loss Protection" / "Capacitor-Based Power Loss Protection").
Тем не менее, такой режим всё равно медленнее использования нормальных серверных SSD и мгновенного
fsync, так как приводит к дополнительным операциям передачи данных по сети, поэтому рекомендуется
всё-таки использовать хорошие серверные диски, тем более, стоят они почти так же, как десктопные.
- PG эфемерны. Это означает, что они не хранятся на дисках и существуют только в памяти работающих OSD.
- Процессы восстановления оперируют отдельными объектами, а не целыми PG.
- PGLOG-ов нет.
- "Мониторы" не хранят данные. Конфигурация и состояние кластера хранятся в etcd в простых человекочитаемых
JSON-структурах. Мониторы Vitastor только следят за состоянием кластера и управляют перемещением данных.
В этом смысле монитор Vitastor не является критичным компонентом системы и больше похож на Ceph-овский
менеджер (MGR). Монитор Vitastor написан на node.js.
- Распределение PG не основано на консистентных хешах. Вместо этого все маппинги PG хранятся прямо в etcd
(ибо нет никакой проблемы сохранить несколько сотен-тысяч записей в памяти, а не считать каждый раз хеши).
Перераспределение PG по OSD выполняется через математическую оптимизацию,
а конкретно, сведение задачи к ЛП (задаче линейного программирования) и решение оной с помощью утилиты
lp_solve. Такой подход позволяет обычно выравнивать распределение места почти идеально - равномерность
обычно составляет 96-99%, в отличие от Ceph, где на голом CRUSH-е без балансировщика обычно выходит 80-90%.
Также это позволяет минимизировать объём перемещения данных и случайность связей между OSD, а также менять
распределение вручную, не боясь сломать логику перебалансировки. В таком подходе есть и потенциальный
недостаток - есть предположение, что в очень большом кластере он может сломаться - однако вплоть до
нескольких сотен OSD подход точно работает нормально. Ну и, собственно, при необходимости легко
реализовать и консистентные хеши.
- Отдельный слой, подобный слою "CRUSH-правил", отсутствует. Вы настраиваете схемы отказоустойчивости,
домены отказа и правила выбора OSD напрямую в конфигурации пулов.
## Понимание сути производительности систем хранения
Вкратце: для быстрой хранилки задержки важнее, чем пиковые iops-ы.
Лучшая возможная задержка достигается при тестировании в 1 поток с глубиной очереди 1,
что приблизительно означает минимально нагруженное состояние кластера. В данном случае
IOPS = 1/задержка. Ни числом серверов, ни дисков, ни серверных процессов/потоков
задержка не масштабируется... Она зависит только от того, насколько быстро один
серверный процесс (и клиент) обрабатывают одну операцию.
Почему задержки важны? Потому, что некоторые приложения *не могут* использовать глубину
очереди больше 1, ибо их задача не параллелизуется. Важный пример - это все СУБД
с поддержкой консистентности (ACID), потому что все они обеспечивают её через
журналирование, а журналы пишутся последовательно и с fsync() после каждой операции.
fsync, кстати - это ещё одна очень важная вещь, про которую почти всегда забывают в тестах.
Смысл в том, что все современные диски имеют кэши/буферы записи и не гарантируют, что
данные реально физически записываются на носитель до того, как вы делаете fsync(),
который транслируется в команду сброса кэша операционной системой.
Дешёвые SSD для настольных ПК и ноутбуков очень быстрые без fsync - NVMe диски, например,
могут обработать порядка 80000 операций записи в секунду с глубиной очереди 1 без fsync.
Однако с fsync, когда они реально вынуждены писать каждый блок данных во флеш-память,
они выжимают лишь 1000-2000 операций записи в секунду (число практически постоянное
для всех моделей SSD).
Серверные SSD часто имеют суперконденсаторы, работающие как встроенный источник
бесперебойного питания и дающие дискам успеть сбросить их DRAM-кэш в постоянную
флеш-память при отключении питания. Благодаря этому диски с чистой совестью
*игнорируют fsync*, так как точно знают, что данные из кэша доедут до постоянной
памяти.
Все наиболее известные программные СХД, например, Ceph и внутренние СХД, используемые
такими облачными провайдерами, как Amazon, Google, Яндекс, медленные в смысле задержки.
В лучшем случае они дают задержки от 0.3мс на чтение и 0.6мс на запись 4 КБ блоками
даже при условии использования наилучшего возможного железа.
И это в эпоху SSD, когда вы можете пойти на рынок и купить там SSD, задержка которого
на чтение будет 0.1мс, а на запись - 0.04мс, за 100$ или даже дешевле.
Когда мне нужно быстро протестировать производительность дисковой подсистемы, я
использую следующие 6 команд, с небольшими вариациями:
- Линейная запись:
`fio -ioengine=libaio -direct=1 -invalidate=1 -name=test -bs=4M -iodepth=32 -rw=write -runtime=60 -filename=/dev/sdX`
- Линейное чтение:
`fio -ioengine=libaio -direct=1 -invalidate=1 -name=test -bs=4M -iodepth=32 -rw=read -runtime=60 -filename=/dev/sdX`
- Запись в 1 поток (T1Q1):
`fio -ioengine=libaio -direct=1 -invalidate=1 -name=test -bs=4k -iodepth=1 -fsync=1 -rw=randwrite -runtime=60 -filename=/dev/sdX`
- Чтение в 1 поток (T1Q1):
`fio -ioengine=libaio -direct=1 -invalidate=1 -name=test -bs=4k -iodepth=1 -rw=randread -runtime=60 -filename=/dev/sdX`
- Параллельная запись (numjobs используется, когда 1 ядро CPU не может насытить диск):
`fio -ioengine=libaio -direct=1 -invalidate=1 -name=test -bs=4k -iodepth=128 [-numjobs=4 -group_reporting] -rw=randwrite -runtime=60 -filename=/dev/sdX`
- Параллельное чтение (numjobs - аналогично):
`fio -ioengine=libaio -direct=1 -invalidate=1 -name=test -bs=4k -iodepth=128 [-numjobs=4 -group_reporting] -rw=randread -runtime=60 -filename=/dev/sdX`
## Теоретическая максимальная производительность Vitastor
При использовании репликации:
- Задержка чтения в 1 поток (T1Q1): 1 сетевой RTT + 1 чтение с диска.
- Запись+fsync в 1 поток:
- С мгновенным сбросом: 2 RTT + 1 запись.
- С отложенным ("ленивым") сбросом: 4 RTT + 1 запись + 1 fsync.
- Параллельное чтение: сумма IOPS всех дисков либо производительность сети, если в сеть упрётся раньше.
- Параллельная запись: сумма IOPS всех дисков / число реплик / WA либо производительность сети, если в сеть упрётся раньше.
При использовании кодов коррекции ошибок (EC):
- Задержка чтения в 1 поток (T1Q1): 1.5 RTT + 1 чтение.
- Запись+fsync в 1 поток:
- С мгновенным сбросом: 3.5 RTT + 1 чтение + 2 записи.
- С отложенным ("ленивым") сбросом: 5.5 RTT + 1 чтение + 2 записи + 2 fsync.
- Под 0.5 на самом деле подразумевается (k-1)/k, где k - число дисков данных,
что означает, что дополнительное обращение по сети не нужно, когда операция
чтения обслуживается локально.
- Параллельное чтение: сумма IOPS всех дисков либо производительность сети, если в сеть упрётся раньше.
- Параллельная запись: сумма IOPS всех дисков / общее число дисков данных и чётности / WA либо производительность сети, если в сеть упрётся раньше.
Примечание: IOPS дисков в данном случае надо брать в смешанном режиме чтения/записи в пропорции, аналогичной формулам выше.
WA (мультипликатор записи) для 4 КБ блоков в Vitastor обычно составляет 3-5:
1. Запись метаданных в журнал
2. Запись блока данных в журнал
3. Запись метаданных в БД
4. Ещё одна запись метаданных в журнал при использовании EC
5. Запись блока данных на диск данных
Если вы найдёте SSD, хорошо работающий с 512-байтными блоками данных (Optane?),
то 1, 3 и 4 можно снизить до 512 байт (1/8 от размера данных) и получить WA всего 2.375.
Кроме того, WA снижается при использовании отложенного/ленивого сброса при параллельной
нагрузке, т.к. блоки журнала записываются на диск только когда они заполняются или явным
образом запрашивается fsync.
## Пример сравнения с Ceph
Железо - 4 сервера, в каждом:
- 6x SATA SSD Intel D3-4510 3.84 TB
- 2x Xeon Gold 6242 (16 cores @ 2.8 GHz)
- 384 GB RAM
- 1x 25 GbE сетевая карта (Mellanox ConnectX-4 LX), подключённая к свитчу Juniper QFX5200
Экономия энергии CPU отключена. В тестах и Vitastor, и Ceph развёрнуто по 2 OSD на 1 SSD.
Все результаты ниже относятся к случайной нагрузке 4 КБ блоками (если явно не указано обратное).
Производительность голых дисков:
- T1Q1 запись ~27000 iops (задержка ~0.037ms)
- T1Q1 чтение ~9800 iops (задержка ~0.101ms)
- T1Q32 запись ~60000 iops
- T1Q32 чтение ~81700 iops
Ceph 15.2.4 (Bluestore):
- T1Q1 запись ~1000 iops (задержка ~1ms)
- T1Q1 чтение ~1750 iops (задержка ~0.57ms)
- T8Q64 запись ~100000 iops, потребление CPU процессами OSD около 40 ядер на каждом сервере
- T8Q64 чтение ~480000 iops, потребление CPU процессами OSD около 40 ядер на каждом сервере
Тесты в 8 потоков проводились на 8 400GB RBD образах со всех хостов (с каждого хоста запускалось 2 процесса fio).
Это нужно потому, что в Ceph несколько RBD-клиентов, пишущих в 1 образ, очень сильно замедляются.
Настройки RocksDB и Bluestore в Ceph не менялись, единственным изменением было отключение cephx_sign_messages.
На самом деле, результаты теста не такие уж и плохие для Ceph (могло быть хуже).
Собственно говоря, эти серверы как раз хорошо сбалансированы для Ceph - 6 SATA SSD как раз
утилизируют 25-гигабитную сеть, а без 2 мощных процессоров Ceph-у бы не хватило ядер,
чтобы выдать пристойный результат. Собственно, что и показывает жор 40 ядер в процессе
параллельного теста.
Vitastor:
- T1Q1 запись: 7087 iops (задержка 0.14ms)
- T1Q1 чтение: 6838 iops (задержка 0.145ms)
- T2Q64 запись: 162000 iops, потребление CPU - 3 ядра на каждом сервере
- T8Q64 чтение: 895000 iops, потребление CPU - 4 ядра на каждом сервере
- Линейная запись (4M T1Q32): 2800 МБ/с
- Линейное чтение (4M T1Q32): 1500 МБ/с
Тест на чтение в 8 потоков проводился на 1 большом образе (3.2 ТБ) со всех хостов (опять же, по 2 fio с каждого).
В Vitastor никакой разницы между 1 образом и 8-ю нет. Естественно, примерно 1/4 запросов чтения
в такой конфигурации, как и в тестах Ceph выше, обслуживалась с локальной машины. Если проводить
тест так, чтобы все операции всегда обращались к первичным OSD по сети - тест сильнее упирался
в сеть и результат составлял примерно 689000 iops.
Настройки Vitastor: `--disable_data_fsync true --immediate_commit all --flusher_count 8
--disk_alignment 4096 --journal_block_size 4096 --meta_block_size 4096
--journal_no_same_sector_overwrites true --journal_sector_buffer_count 1024
--journal_size 16777216`.
### EC/XOR 2+1
Vitastor:
- T1Q1 запись: 2808 iops (задержка ~0.355ms)
- T1Q1 чтение: 6190 iops (задержка ~0.16ms)
- T2Q64 запись: 85500 iops, потребление CPU - 3.4 ядра на каждом сервере
- T8Q64 чтение: 812000 iops, потребление CPU - 4.7 ядра на каждом сервере
- Линейная запись (4M T1Q32): 3200 МБ/с
- Линейное чтение (4M T1Q32): 1800 МБ/с
Ceph:
- T1Q1 запись: 730 iops (задержка ~1.37ms latency)
- T1Q1 чтение: 1500 iops с холодным кэшем метаданных (задержка ~0.66ms), 2300 iops через 2 минуты прогрева (задержка ~0.435ms)
- T4Q128 запись (4 RBD images): 45300 iops, потребление CPU - 30 ядер на каждом сервере
- T8Q64 чтение (4 RBD images): 278600 iops, потребление CPU - 40 ядер на каждом сервере
- Линейная запись (4M T1Q32): 1950 МБ/с в пустой образ, 2500 МБ/с в заполненный образ
- Линейное чтение (4M T1Q32): 2400 МБ/с
### NBD
NBD расшифровывается как "сетевое блочное устройство", но на самом деле оно также
работает просто как аналог FUSE для блочных устройств, то есть, представляет собой
"блочное устройство в пространстве пользователя".
NBD - на данный момент единственный способ монтировать Vitastor ядром Linux.
NBD немного снижает производительность, так как приводит к дополнительным копированиям
данных между ядром и пространством пользователя. Тем не менее, способ достаточно оптимален,
а производительность случайного доступа вообще затрагивается слабо.
Vitastor с однопоточной NBD прокси на том же стенде:
- T1Q1 запись: 6000 iops (задержка 0.166ms)
- T1Q1 чтение: 5518 iops (задержка 0.18ms)
- T1Q128 запись: 94400 iops
- T1Q128 чтение: 103000 iops
- Линейная запись (4M T1Q128): 1266 МБ/с (в сравнении с 2800 МБ/с через fio)
- Линейное чтение (4M T1Q128): 975 МБ/с (в сравнении с 1500 МБ/с через fio)
## Установка
### Debian
- Добавьте ключ репозитория Vitastor:
`wget -q -O - https://vitastor.io/debian/pubkey | sudo apt-key add -`
- Добавьте репозиторий Vitastor в /etc/apt/sources.list:
- Debian 11 (Bullseye/Sid): `deb https://vitastor.io/debian bullseye main`
- Debian 10 (Buster): `deb https://vitastor.io/debian buster main`
- Для Debian 10 (Buster) также включите репозиторий backports:
`deb http://deb.debian.org/debian buster-backports main`
- Установите пакеты: `apt update; apt install vitastor lp-solve etcd linux-image-amd64 qemu`
### CentOS
- Добавьте в систему репозиторий Vitastor:
- CentOS 7: `yum install https://vitastor.io/rpms/centos/7/vitastor-release-1.0-1.el7.noarch.rpm`
- CentOS 8: `dnf install https://vitastor.io/rpms/centos/8/vitastor-release-1.0-1.el8.noarch.rpm`
- Включите EPEL: `yum/dnf install epel-release`
- Включите дополнительные репозитории CentOS:
- CentOS 7: `yum install centos-release-scl`
- CentOS 8: `dnf install centos-release-advanced-virtualization`
- Включите elrepo-kernel:
- CentOS 7: `yum install https://www.elrepo.org/elrepo-release-7.el7.elrepo.noarch.rpm`
- CentOS 8: `dnf install https://www.elrepo.org/elrepo-release-8.el8.elrepo.noarch.rpm`
- Установите пакеты: `yum/dnf install vitastor lpsolve etcd kernel-ml qemu-kvm`
### Установка из исходников
- Установите ядро 5.4 или более новое, для поддержки io_uring. Желательно 5.8 или даже новее,
так как в 5.4 есть как минимум 1 известный баг, ведущий к зависанию с io_uring и контроллером HP SmartArray.
- Установите liburing 0.4 или более новый и его заголовки.
- Установите lp_solve.
- Установите etcd, версии не ниже 3.4.15. Более ранние версии работать не будут из-за различных багов,
например [#12402](https://github.com/etcd-io/etcd/pull/12402). Также вы можете взять версию 3.4.13 с
этим конкретным исправлением из ветки release-3.4 репозитория https://github.com/vitalif/etcd/.
- Установите node.js 10 или новее.
- Установите gcc и g++ 8.x или новее.
- Склонируйте данный репозиторий с подмодулями: `git clone https://yourcmc.ru/git/vitalif/vitastor/`.
- Желательно пересобрать QEMU с патчем, который делает необязательным запуск через LD_PRELOAD.
См `patches/qemu-*.*-vitastor.patch` - выберите версию, наиболее близкую вашей версии QEMU.
- Установите QEMU 3.0 или новее, возьмите исходные коды установленного пакета, начните его пересборку,
через некоторое время остановите её и скопируйте следующие заголовки:
- `<qemu>/include` &rarr; `<vitastor>/qemu/include`
- Debian:
* Берите qemu из основного репозитория
* `<qemu>/b/qemu/config-host.h` &rarr; `<vitastor>/qemu/b/qemu/config-host.h`
* `<qemu>/b/qemu/qapi` &rarr; `<vitastor>/qemu/b/qemu/qapi`
- CentOS 8:
* Берите qemu из репозитория Advanced-Virtualization. Чтобы включить его, запустите
`yum install centos-release-advanced-virtualization.noarch` и далее `yum install qemu`
* `<qemu>/config-host.h` &rarr; `<vitastor>/qemu/b/qemu/config-host.h`
* Для QEMU 3.0+: `<qemu>/qapi` &rarr; `<vitastor>/qemu/b/qemu/qapi`
* Для QEMU 2.0+: `<qemu>/qapi-types.h` &rarr; `<vitastor>/qemu/b/qemu/qapi-types.h`
- `config-host.h` и `qapi` нужны, т.к. в них содержатся автогенерируемые заголовки
- Установите fio 3.7 или новее, возьмите исходники пакета и сделайте на них симлинк с `<vitastor>/fio`.
- Соберите и установите Vitastor командой `mkdir build && cd build && cmake .. && make -j8 && make install`.
Обратите внимание на переменную cmake `QEMU_PLUGINDIR` - под RHEL её нужно установить равной `qemu-kvm`.
## Запуск
Внимание: процедура пока что достаточно нетривиальная, задавать конфигурацию и смещения
на диске нужно почти вручную. Это будет исправлено в ближайшем будущем.
- Желательны SATA SSD или NVMe диски с конденсаторами (серверные SSD). Можно использовать и
десктопные SSD, включив режим отложенного fsync, но производительность однопоточной записи
в этом случае пострадает.
- Быстрая сеть, минимум 10 гбит/с
- Для наилучшей производительности нужно отключить энергосбережение CPU: `cpupower idle-set -D 0 && cpupower frequency-set -g performance`.
- Пропишите нужные вам значения вверху файлов `/usr/lib/vitastor/mon/make-units.sh` и `/usr/lib/vitastor/mon/make-osd.sh`.
- Создайте юниты systemd для etcd и мониторов: `/usr/lib/vitastor/mon/make-units.sh`
- Создайте юниты для OSD: `/usr/lib/vitastor/mon/make-osd.sh /dev/disk/by-partuuid/XXX [/dev/disk/by-partuuid/YYY ...]`
- Вы можете поменять параметры OSD в юнитах systemd. Смысл некоторых параметров:
- `disable_data_fsync 1` - отключает fsync, используется с SSD с конденсаторами.
- `immediate_commit all` - используется с SSD с конденсаторами.
- `disable_device_lock 1` - отключает блокировку файла устройства, нужно, только если вы запускаете
несколько OSD на одном блочном устройстве.
- `flusher_count 256` - "flusher" - микропоток, удаляющий старые данные из журнала.
Не волнуйтесь об этой настройке, 256 теперь достаточно практически всегда.
- `disk_alignment`, `journal_block_size`, `meta_block_size` следует установить равными размеру
внутреннего блока SSD. Это почти всегда 4096.
- `journal_no_same_sector_overwrites true` запрещает перезапись одного и того же сектора журнала подряд
много раз в процессе записи. Большинство (99%) SSD не нуждаются в данной опции. Однако выяснилось, что
диски, используемые на одном из тестовых стендов - Intel D3-S4510 - очень сильно не любят такую
перезапись, и для них была добавлена эта опция. Когда данный режим включён, также нужно поднимать
значение `journal_sector_buffer_count`, так как иначе Vitastor не хватит буферов для записи в журнал.
- Запустите все etcd: `systemctl start etcd`
- Создайте глобальную конфигурацию в etcd: `etcdctl --endpoints=... put /vitastor/config/global '{"immediate_commit":"all"}'`
(если все ваши диски - серверные с конденсаторами).
- Создайте пулы: `etcdctl --endpoints=... put /vitastor/config/pools '{"1":{"name":"testpool","scheme":"replicated","pg_size":2,"pg_minsize":1,"pg_count":256,"failure_domain":"host"}}'`.
Для jerasure EC-пулов конфигурация должна выглядеть так: `2:{"name":"ecpool","scheme":"jerasure","pg_size":4,"parity_chunks":2,"pg_minsize":2,"pg_count":256,"failure_domain":"host"}`.
- Запустите все OSD: `systemctl start vitastor.target`
- Ваш кластер должен быть готов - один из мониторов должен уже сконфигурировать PG, а OSD должны запустить их.
- Вы можете проверить состояние PG прямо в etcd: `etcdctl --endpoints=... get --prefix /vitastor/pg/state`. Все PG должны быть 'active'.
### Задать имя образу
```
etcdctl --endpoints=<etcd> put /vitastor/config/inode/<pool>/<inode> '{"name":"<name>","size":<size>[,"parent_id":<parent_inode_number>][,"readonly":true]}'
```
Например:
```
etcdctl --endpoints=http://10.115.0.10:2379/v3 put /vitastor/config/inode/1/1 '{"name":"testimg","size":2147483648}'
```
Если вы зададите parent_id, то образ станет CoW-клоном, т.е. все новые запросы записи пойдут в новый инод, а запросы
чтения будут проверять сначала его, а потом родительские слои по цепочке вверх. Чтобы случайно не перезаписать данные
в родительском слое, вы можете переключить его в режим "только чтение", добавив флаг `"readonly":true` в его запись
метаданных. В таком случае родительский образ становится просто снапшотом.
Таким образом, для создания снапшота вам нужно просто переименовать предыдущий inode (например, из testimg в testimg@0),
сделать его readonly и создать новый слой с исходным именем образа (testimg), ссылающийся на только что переименованный
в качестве родительского.
### Запуск тестов с fio
Пример команды для запуска тестов:
```
fio -thread -ioengine=libfio_vitastor.so -name=test -bs=4M -direct=1 -iodepth=16 -rw=write -etcd=10.115.0.10:2379/v3 -image=testimg
```
Если вы не хотите обращаться к образу по имени, вместо `-image=testimg` можно указать номер пула, номер инода и размер:
`-pool=1 -inode=1 -size=400G`.
### Загрузить образ диска ВМ в/из Vitastor
Используйте qemu-img и строку `vitastor:etcd_host=<HOST>:image=<IMAGE>` в качестве имени файла диска. Например:
```
qemu-img convert -f qcow2 debian10.qcow2 -p -O raw 'vitastor:etcd_host=10.115.0.10\:2379/v3:image=testimg'
```
Обратите внимание, что если вы используете немодифицированный QEMU, потребуется установить переменную окружения
`LD_PRELOAD=/usr/lib/x86_64-linux-gnu/qemu/block-vitastor.so`.
Если вы не хотите обращаться к образу по имени, вместо `:image=<IMAGE>` можно указать номер пула, номер инода и размер:
`:pool=<POOL>:inode=<INODE>:size=<SIZE>`.
### Запустить ВМ
Для запуска QEMU используйте опцию `-drive file=vitastor:etcd_host=<HOST>:image=<IMAGE>` (аналогично qemu-img)
и физический размер блока 4 KB.
Например:
```
qemu-system-x86_64 -enable-kvm -m 1024
-drive 'file=vitastor:etcd_host=10.115.0.10\:2379/v3:image=testimg',format=raw,if=none,id=drive-virtio-disk0,cache=none
-device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x5,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1,write-cache=off,physical_block_size=4096,logical_block_size=512
-vnc 0.0.0.0:0
```
Обращение по номерам (`:pool=<POOL>:inode=<INODE>:size=<SIZE>` вместо `:image=<IMAGE>`) работает аналогично qemu-img.
### Удалить образ
Используйте утилиту vitastor-rm. Например:
```
vitastor-rm --etcd_address 10.115.0.10:2379/v3 --pool 1 --inode 1 --parallel_osds 16 --iodepth 32
```
### NBD
Чтобы создать локальное блочное устройство, используйте NBD. Например:
```
vitastor-nbd map --etcd_address 10.115.0.10:2379/v3 --image testimg
```
Команда напечатает название устройства вида /dev/nbd0, которое потом можно будет форматировать
и использовать как обычное блочное устройство.
Для обращения по номеру инода, аналогично другим командам, можно использовать опции
`--pool <POOL> --inode <INODE> --size <SIZE>` вместо `--image testimg`.
### Kubernetes
У Vitastor есть CSI-плагин для Kubernetes, поддерживающий RWO-тома.
Для установки возьмите манифесты из директории [csi/deploy/](csi/deploy/), поместите
вашу конфигурацию подключения к Vitastor в [csi/deploy/001-csi-config-map.yaml](001-csi-config-map.yaml),
настройте StorageClass в [csi/deploy/009-storage-class.yaml](009-storage-class.yaml)
и примените все `NNN-*.yaml` к вашей инсталляции Kubernetes.
```
for i in ./???-*.yaml; do kubectl apply -f $i; done
```
После этого вы сможете создавать PersistentVolume. Пример смотрите в файле [csi/deploy/example-pvc.yaml](csi/deploy/example-pvc.yaml).
## Известные проблемы
- Запросы удаления объектов могут в данный момент приводить к "неполным" объектам в EC-пулах,
если в процессе удаления произойдут отказы OSD или серверов, потому что правильная обработка
запросов удаления в кластере должна быть "трёхфазной", а это пока не реализовано. Если вы
столкнётесь с такой ситуацией, просто повторите запрос удаления.
## Принципы реализации
- Я люблю архитектурно простые решения. Vitastor проектируется именно так и я намерен
и далее следовать данному принципу.
- Если вы пришли сюда за идеальным кодом на C++, вы, вероятно, не по адресу. "Общепринятые"
практики написания C++ кода меня не очень волнуют, так как зачастую, опять-таки, ведут к
излишним усложнениям и код получается красивый... но медленный.
- По той же причине в коде иногда можно встретить велосипеды типа собственного упрощённого
HTTP-клиента для работы с etcd. Зато эти велосипеды маленькие и компактные и не требуют
использования десятка внешних библиотек.
- node.js для монитора - не случайный выбор. Он очень быстрый, имеет встроенную событийную
машину, приятный нейтральный C-подобный язык программирования и развитую инфраструктуру.
## Автор и лицензия
Автор: Виталий Филиппов (vitalif [at] yourcmc.ru), 2019+
Заходите в Telegram-чат Vitastor: https://t.me/vitastor
Лицензия: VNPL 1.1 на серверный код и двойная VNPL 1.1 + GPL 2.0+ на клиентский.
VNPL - "сетевой копилефт", собственная свободная копилефт-лицензия
Vitastor Network Public License 1.1, основанная на GNU GPL 3.0 с дополнительным
условием "Сетевого взаимодействия", требующим распространять все программы,
специально разработанные для использования вместе с Vitastor и взаимодействующие
с ним по сети, под лицензией VNPL или под любой другой свободной лицензией.
Идея VNPL - расширение действия копилефта не только на модули, явным образом
связываемые с кодом Vitastor, но также на модули, оформленные в виде микросервисов
и взаимодействующие с ним по сети.
Таким образом, если вы хотите построить на основе Vitastor сервис, содержаший
компоненты с закрытым кодом, взаимодействующие с Vitastor, вам нужна коммерческая
лицензия от автора 😀.
На Windows и любое другое ПО, не разработанное *специально* для использования
вместе с Vitastor, никакие ограничения не накладываются.
Клиентские библиотеки распространяются на условиях двойной лицензии VNPL 1.0
и также на условиях GNU GPL 2.0 или более поздней версии. Так сделано в целях
совместимости с таким ПО, как QEMU и fio.
Вы можете найти полный текст VNPL 1.1 в файле [VNPL-1.1.txt](VNPL-1.1.txt),
а GPL 2.0 в файле [GPL-2.0.txt](GPL-2.0.txt).

159
README.md
View File

@@ -1,5 +1,7 @@
## Vitastor
[Читать на русском](README-ru.md)
## The Idea
Make Software-Defined Block Storage Great Again.
@@ -34,21 +36,26 @@ breaking changes in the future. However, the following is implemented:
- NBD proxy for kernel mounts
- Inode removal tool (vitastor-rm)
- Packaging for Debian and CentOS
- Per-inode I/O and space usage statistics
- Inode metadata storage in etcd
- Snapshots and copy-on-write image clones
- Write throttling to smooth random write workloads in SSD+HDD configurations
- RDMA/RoCEv2 support via libibverbs
- CSI plugin for Kubernetes
- Basic OpenStack support: Cinder driver, Nova and libvirt patches
## Roadmap
- OSD creation tool (OSDs currently have to be created by hand)
- Snapshot deletion (layer merge) support
- Better OSD creation and auto-start tools
- Other administrative tools
- Per-inode I/O and space usage statistics
- Proxmox and OpenNebula plugins
- Plugins for OpenNebula, Proxmox and other cloud systems
- iSCSI proxy
- Inode metadata storage in etcd
- Snapshots and copy-on-write image clones
- Operation timeouts and better failure detection
- Faster failover
- Scrubbing without checksums (verification of replicas)
- Checksums
- SSD+HDD optimizations, possibly including tiered storage and soft journal flushes
- RDMA and NVDIMM support
- Tiered storage
- NVDIMM support
- Web GUI
- Compression (possibly)
- Read caching using system page cache (possibly)
@@ -291,7 +298,7 @@ Vitastor with single-thread NBD on the same hardware:
- Debian 10 (Buster): `deb https://vitastor.io/debian buster main`
- For Debian 10 (Buster) also enable backports repository:
`deb http://deb.debian.org/debian buster-backports main`
- Install packages: `apt update; apt install vitastor lp-solve etcd linux-image-amd64`
- Install packages: `apt update; apt install vitastor lp-solve etcd linux-image-amd64 qemu`
### CentOS
@@ -313,10 +320,9 @@ Vitastor with single-thread NBD on the same hardware:
there is at least one known io_uring hang with 5.4 and an HP SmartArray controller.
- Install liburing 0.4 or newer and its headers.
- Install lp_solve.
- Install etcd. Attention: you need a fixed version from here: https://github.com/vitalif/etcd/,
branch release-3.4, because there is a bug in upstream etcd which makes Vitastor OSDs fail to
move PGs out of "starting" state if you have at least around ~500 PGs or so. The custom build
will be unnecessary when etcd merges the fix: https://github.com/etcd-io/etcd/pull/12402.
- Install etcd, at least version 3.4.15. Earlier versions won't work because of various bugs,
for example [#12402](https://github.com/etcd-io/etcd/pull/12402). You can also take 3.4.13
with this specific fix from here: https://github.com/vitalif/etcd/, branch release-3.4.
- Install node.js 10 or newer.
- Install gcc and g++ 8.x or newer.
- Clone https://yourcmc.ru/git/vitalif/vitastor/ with submodules.
@@ -334,7 +340,7 @@ Vitastor with single-thread NBD on the same hardware:
* For QEMU 2.0+: `<qemu>/qapi-types.h` &rarr; `<vitastor>/qemu/b/qemu/qapi-types.h`
- `config-host.h` and `qapi` are required because they contain generated headers
- You can also rebuild QEMU with a patch that makes LD_PRELOAD unnecessary to load vitastor driver.
See `qemu-*.*-vitastor.patch`.
See `patches/qemu-*.*-vitastor.patch`.
- Install fio 3.7 or later, get its source and symlink it into `<vitastor>/fio`.
- Build & install Vitastor with `mkdir build && cd build && cmake .. && make -j8 && make install`.
Pay attention to the `QEMU_PLUGINDIR` cmake option - it must be set to `qemu-kvm` on RHEL.
@@ -374,34 +380,113 @@ and calculate disk offsets almost by hand. This will be fixed in near future.
For jerasure pools the configuration should look like the following: `2:{"name":"ecpool","scheme":"jerasure","pg_size":4,"parity_chunks":2,"pg_minsize":2,"pg_count":256,"failure_domain":"host"}`.
- At this point, one of the monitors will configure PGs and OSDs will start them.
- You can check PG states with `etcdctl --endpoints=... get --prefix /vitastor/pg/state`. All PGs should become 'active'.
- Run tests with (for example): `fio -thread -ioengine=libfio_vitastor.so -name=test -bs=4M -direct=1 -iodepth=16 -rw=write -etcd=10.115.0.10:2379/v3 -pool=1 -inode=1 -size=400G`.
- Upload VM disk image with qemu-img (for example):
```
qemu-img convert -f qcow2 debian10.qcow2 -p -O raw 'vitastor:etcd_host=10.115.0.10\:2379/v3:pool=1:inode=1:size=2147483648'
```
Note that the command requires to be run with `LD_PRELOAD=/usr/lib/x86_64-linux-gnu/qemu/block-vitastor.so qemu-img ...`
if you use unmodified QEMU.
- Run QEMU with (for example):
```
qemu-system-x86_64 -enable-kvm -m 1024
-drive 'file=vitastor:etcd_host=10.115.0.10\:2379/v3:pool=1:inode=1:size=2147483648',format=raw,if=none,id=drive-virtio-disk0,cache=none
-device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x5,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1,write-cache=off,physical_block_size=4096,logical_block_size=512
-vnc 0.0.0.0:0
```
- Remove inode with (for example):
```
vitastor-rm --etcd_address 10.115.0.10:2379/v3 --pool 1 --inode 1 --parallel_osds 16 --iodepth 32
```
### Name an image
```
etcdctl --endpoints=<etcd> put /vitastor/config/inode/<pool>/<inode> '{"name":"<name>","size":<size>[,"parent_id":<parent_inode_number>][,"readonly":true]}'
```
For example:
```
etcdctl --endpoints=http://10.115.0.10:2379/v3 put /vitastor/config/inode/1/1 '{"name":"testimg","size":2147483648}'
```
If you specify parent_id the image becomes a CoW clone. I.e. all writes go to the new inode and reads first check it
and then upper layers. You can then make parent readonly by updating its entry with `"readonly":true` for safety and
basically treat it as a snapshot.
So to create a snapshot you basically rename the previous upper layer (for example from testimg to testimg@0), make it readonly
and create a new top layer with the original name (testimg) and the previous one as a parent.
### Run fio benchmarks
fio command example:
```
fio -thread -ioengine=libfio_vitastor.so -name=test -bs=4M -direct=1 -iodepth=16 -rw=write -etcd=10.115.0.10:2379/v3 -image=testimg
```
If you don't want to access your image by name, you can specify pool number, inode number and size
(`-pool=1 -inode=1 -size=400G`) instead of the image name (`-image=testimg`).
### Upload VM image
Use qemu-img and `vitastor:etcd_host=<HOST>:image=<IMAGE>` disk filename. For example:
```
qemu-img convert -f qcow2 debian10.qcow2 -p -O raw 'vitastor:etcd_host=10.115.0.10\:2379/v3:image=testimg'
```
Note that the command requires to be run with `LD_PRELOAD=/usr/lib/x86_64-linux-gnu/qemu/block-vitastor.so qemu-img ...`
if you use unmodified QEMU.
You can also specify `:pool=<POOL>:inode=<INODE>:size=<SIZE>` instead of `:image=<IMAGE>`
if you don't want to use inode metadata.
### Start a VM
Run QEMU with `-drive file=vitastor:etcd_host=<HOST>:image=<IMAGE>` and use 4 KB physical block size.
For example:
```
qemu-system-x86_64 -enable-kvm -m 1024
-drive 'file=vitastor:etcd_host=10.115.0.10\:2379/v3:image=testimg',format=raw,if=none,id=drive-virtio-disk0,cache=none
-device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x5,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1,write-cache=off,physical_block_size=4096,logical_block_size=512
-vnc 0.0.0.0:0
```
You can also specify `:pool=<POOL>:inode=<INODE>:size=<SIZE>` instead of `:image=<IMAGE>`,
just like in qemu-img.
### Remove inode
Use vitastor-rm. For example:
```
vitastor-rm --etcd_address 10.115.0.10:2379/v3 --pool 1 --inode 1 --parallel_osds 16 --iodepth 32
```
### NBD
To create a local block device for a Vitastor image, use NBD. For example:
```
vitastor-nbd map --etcd_address 10.115.0.10:2379/v3 --image testimg
```
It will output the device name, like /dev/nbd0 which you can then format and mount as a normal block device.
Again, you can use `--pool <POOL> --inode <INODE> --size <SIZE>` insteaf of `--image <IMAGE>` if you want.
### Kubernetes
Vitastor has a CSI plugin for Kubernetes which supports RWO volumes.
To deploy it, take manifests from [csi/deploy/](csi/deploy/) directory, put your
Vitastor configuration in [csi/deploy/001-csi-config-map.yaml](001-csi-config-map.yaml),
configure storage class in [csi/deploy/009-storage-class.yaml](009-storage-class.yaml)
and apply all `NNN-*.yaml` manifests to your Kubernetes installation:
```
for i in ./???-*.yaml; do kubectl apply -f $i; done
```
After that you'll be able to create PersistentVolumes. See example in [csi/deploy/example-pvc.yaml](csi/deploy/example-pvc.yaml).
## Known Problems
- Object deletion requests may currently lead to 'incomplete' objects if your OSDs crash during
deletion because proper handling of object cleanup in a cluster should be "three-phase"
and it's currently not implemented. Just to repeat the removal again in this case.
- Object deletion requests may currently lead to 'incomplete' objects in EC pools
if your OSDs crash during deletion because proper handling of object cleanup
in a cluster should be "three-phase" and it's currently not implemented.
Just repeat the removal request again in this case.
## Implementation Principles
- I like simple and stupid solutions, so expect Vitastor to stay simple.
- I like architecturally simple solutions. Vitastor is and will always be designed
exactly like that.
- I also like reinventing the wheel to some extent, like writing my own HTTP client
for etcd interaction instead of using prebuilt libraries, because in this case
I'm confident about what my code does and what it doesn't do.
@@ -416,7 +501,7 @@ and calculate disk offsets almost by hand. This will be fixed in near future.
Copyright (c) Vitaliy Filippov (vitalif [at] yourcmc.ru), 2019+
You can also find me in the Russian Telegram Ceph chat: https://t.me/ceph_ru
Join Vitastor Telegram Chat: https://t.me/vitastor
All server-side code (OSD, Monitor and so on) is licensed under the terms of
Vitastor Network Public License 1.1 (VNPL 1.1), a copyleft license based on

3
csi/.dockerignore Normal file
View File

@@ -0,0 +1,3 @@
vitastor-csi
go.sum
Dockerfile

32
csi/Dockerfile Normal file
View File

@@ -0,0 +1,32 @@
# Compile stage
FROM golang:buster AS build
ADD go.mod /app/
RUN cd /app; CGO_ENABLED=1 GOOS=linux GOARCH=amd64 go mod download -x
ADD . /app
RUN perl -i -e '$/ = undef; while(<>) { s/\n\s*(\{\s*\n)/$1\n/g; s/\}(\s*\n\s*)else\b/$1} else/g; print; }' `find /app -name '*.go'`
RUN cd /app; CGO_ENABLED=1 GOOS=linux GOARCH=amd64 go build -o vitastor-csi
# Final stage
FROM debian:buster
LABEL maintainers="Vitaliy Filippov <vitalif@yourcmc.ru>"
LABEL description="Vitastor CSI Driver"
ENV NODE_ID=""
ENV CSI_ENDPOINT=""
RUN apt-get update && \
apt-get install -y wget && \
wget -q -O /etc/apt/trusted.gpg.d/vitastor.gpg https://vitastor.io/debian/pubkey.gpg && \
(echo deb http://vitastor.io/debian buster main > /etc/apt/sources.list.d/vitastor.list) && \
(echo deb http://deb.debian.org/debian buster-backports main > /etc/apt/sources.list.d/backports.list) && \
(echo "APT::Install-Recommends false;" > /etc/apt/apt.conf) && \
apt-get update && \
apt-get install -y e2fsprogs xfsprogs vitastor kmod && \
apt-get clean && \
(echo options nbd nbds_max=128 > /etc/modprobe.d/nbd.conf)
COPY --from=build /app/vitastor-csi /bin/
ENTRYPOINT ["/bin/vitastor-csi"]

9
csi/Makefile Normal file
View File

@@ -0,0 +1,9 @@
VERSION ?= v0.6.5
all: build push
build:
@docker build --rm -t vitalif/vitastor-csi:$(VERSION) .
push:
@docker push vitalif/vitastor-csi:$(VERSION)

View File

@@ -0,0 +1,5 @@
---
apiVersion: v1
kind: Namespace
metadata:
name: vitastor-system

View File

@@ -0,0 +1,9 @@
---
apiVersion: v1
kind: ConfigMap
data:
vitastor.conf: |-
{"etcd_address":"http://192.168.7.2:2379","etcd_prefix":"/vitastor"}
metadata:
namespace: vitastor-system
name: vitastor-config

View File

@@ -0,0 +1,37 @@
---
apiVersion: v1
kind: ServiceAccount
metadata:
namespace: vitastor-system
name: vitastor-csi-nodeplugin
---
kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
metadata:
namespace: vitastor-system
name: vitastor-csi-nodeplugin
rules:
- apiGroups: [""]
resources: ["nodes"]
verbs: ["get"]
# allow to read Vault Token and connection options from the Tenants namespace
- apiGroups: [""]
resources: ["secrets"]
verbs: ["get"]
- apiGroups: [""]
resources: ["configmaps"]
verbs: ["get"]
---
kind: ClusterRoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
namespace: vitastor-system
name: vitastor-csi-nodeplugin
subjects:
- kind: ServiceAccount
name: vitastor-csi-nodeplugin
namespace: vitastor-system
roleRef:
kind: ClusterRole
name: vitastor-csi-nodeplugin
apiGroup: rbac.authorization.k8s.io

View File

@@ -0,0 +1,72 @@
---
apiVersion: policy/v1beta1
kind: PodSecurityPolicy
metadata:
namespace: vitastor-system
name: vitastor-csi-nodeplugin-psp
spec:
allowPrivilegeEscalation: true
allowedCapabilities:
- 'SYS_ADMIN'
fsGroup:
rule: RunAsAny
privileged: true
hostNetwork: true
hostPID: true
runAsUser:
rule: RunAsAny
seLinux:
rule: RunAsAny
supplementalGroups:
rule: RunAsAny
volumes:
- 'configMap'
- 'emptyDir'
- 'projected'
- 'secret'
- 'downwardAPI'
- 'hostPath'
allowedHostPaths:
- pathPrefix: '/dev'
readOnly: false
- pathPrefix: '/run/mount'
readOnly: false
- pathPrefix: '/sys'
readOnly: false
- pathPrefix: '/lib/modules'
readOnly: true
- pathPrefix: '/var/lib/kubelet/pods'
readOnly: false
- pathPrefix: '/var/lib/kubelet/plugins/csi.vitastor.io'
readOnly: false
- pathPrefix: '/var/lib/kubelet/plugins_registry'
readOnly: false
- pathPrefix: '/var/lib/kubelet/plugins'
readOnly: false
---
kind: Role
apiVersion: rbac.authorization.k8s.io/v1
metadata:
namespace: vitastor-system
name: vitastor-csi-nodeplugin-psp
rules:
- apiGroups: ['policy']
resources: ['podsecuritypolicies']
verbs: ['use']
resourceNames: ['vitastor-csi-nodeplugin-psp']
---
kind: RoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
namespace: vitastor-system
name: vitastor-csi-nodeplugin-psp
subjects:
- kind: ServiceAccount
name: vitastor-csi-nodeplugin
namespace: vitastor-system
roleRef:
kind: Role
name: vitastor-csi-nodeplugin-psp
apiGroup: rbac.authorization.k8s.io

View File

@@ -0,0 +1,140 @@
---
kind: DaemonSet
apiVersion: apps/v1
metadata:
namespace: vitastor-system
name: csi-vitastor
spec:
selector:
matchLabels:
app: csi-vitastor
template:
metadata:
namespace: vitastor-system
labels:
app: csi-vitastor
spec:
serviceAccountName: vitastor-csi-nodeplugin
hostNetwork: true
hostPID: true
priorityClassName: system-node-critical
# to use e.g. Rook orchestrated cluster, and mons' FQDN is
# resolved through k8s service, set dns policy to cluster first
dnsPolicy: ClusterFirstWithHostNet
containers:
- name: driver-registrar
# This is necessary only for systems with SELinux, where
# non-privileged sidecar containers cannot access unix domain socket
# created by privileged CSI driver container.
securityContext:
privileged: true
image: k8s.gcr.io/sig-storage/csi-node-driver-registrar:v2.2.0
args:
- "--v=5"
- "--csi-address=/csi/csi.sock"
- "--kubelet-registration-path=/var/lib/kubelet/plugins/csi.vitastor.io/csi.sock"
env:
- name: KUBE_NODE_NAME
valueFrom:
fieldRef:
fieldPath: spec.nodeName
volumeMounts:
- name: socket-dir
mountPath: /csi
- name: registration-dir
mountPath: /registration
- name: csi-vitastor
securityContext:
privileged: true
capabilities:
add: ["SYS_ADMIN"]
allowPrivilegeEscalation: true
image: vitalif/vitastor-csi:v0.6.5
args:
- "--node=$(NODE_ID)"
- "--endpoint=$(CSI_ENDPOINT)"
env:
- name: NODE_ID
valueFrom:
fieldRef:
fieldPath: spec.nodeName
- name: CSI_ENDPOINT
value: unix:///csi/csi.sock
imagePullPolicy: "IfNotPresent"
ports:
- containerPort: 9898
name: healthz
protocol: TCP
livenessProbe:
failureThreshold: 5
httpGet:
path: /healthz
port: healthz
initialDelaySeconds: 10
timeoutSeconds: 3
periodSeconds: 2
volumeMounts:
- name: socket-dir
mountPath: /csi
- mountPath: /dev
name: host-dev
- mountPath: /sys
name: host-sys
- mountPath: /run/mount
name: host-mount
- mountPath: /lib/modules
name: lib-modules
readOnly: true
- name: vitastor-config
mountPath: /etc/vitastor
- name: plugin-dir
mountPath: /var/lib/kubelet/plugins
mountPropagation: "Bidirectional"
- name: mountpoint-dir
mountPath: /var/lib/kubelet/pods
mountPropagation: "Bidirectional"
- name: liveness-probe
securityContext:
privileged: true
image: quay.io/k8scsi/livenessprobe:v1.1.0
args:
- "--csi-address=$(CSI_ENDPOINT)"
- "--health-port=9898"
env:
- name: CSI_ENDPOINT
value: unix://csi/csi.sock
volumeMounts:
- mountPath: /csi
name: socket-dir
volumes:
- name: socket-dir
hostPath:
path: /var/lib/kubelet/plugins/csi.vitastor.io
type: DirectoryOrCreate
- name: plugin-dir
hostPath:
path: /var/lib/kubelet/plugins
type: Directory
- name: mountpoint-dir
hostPath:
path: /var/lib/kubelet/pods
type: DirectoryOrCreate
- name: registration-dir
hostPath:
path: /var/lib/kubelet/plugins_registry/
type: Directory
- name: host-dev
hostPath:
path: /dev
- name: host-sys
hostPath:
path: /sys
- name: host-mount
hostPath:
path: /run/mount
- name: lib-modules
hostPath:
path: /lib/modules
- name: vitastor-config
configMap:
name: vitastor-config

View File

@@ -0,0 +1,102 @@
---
apiVersion: v1
kind: ServiceAccount
metadata:
namespace: vitastor-system
name: vitastor-csi-provisioner
---
kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
metadata:
namespace: vitastor-system
name: vitastor-external-provisioner-runner
rules:
- apiGroups: [""]
resources: ["nodes"]
verbs: ["get", "list", "watch"]
- apiGroups: [""]
resources: ["secrets"]
verbs: ["get", "list", "watch"]
- apiGroups: [""]
resources: ["events"]
verbs: ["list", "watch", "create", "update", "patch"]
- apiGroups: [""]
resources: ["persistentvolumes"]
verbs: ["get", "list", "watch", "create", "update", "delete", "patch"]
- apiGroups: [""]
resources: ["persistentvolumeclaims"]
verbs: ["get", "list", "watch", "update"]
- apiGroups: [""]
resources: ["persistentvolumeclaims/status"]
verbs: ["update", "patch"]
- apiGroups: ["storage.k8s.io"]
resources: ["storageclasses"]
verbs: ["get", "list", "watch"]
- apiGroups: ["snapshot.storage.k8s.io"]
resources: ["volumesnapshots"]
verbs: ["get", "list"]
- apiGroups: ["snapshot.storage.k8s.io"]
resources: ["volumesnapshotcontents"]
verbs: ["create", "get", "list", "watch", "update", "delete"]
- apiGroups: ["snapshot.storage.k8s.io"]
resources: ["volumesnapshotclasses"]
verbs: ["get", "list", "watch"]
- apiGroups: ["storage.k8s.io"]
resources: ["volumeattachments"]
verbs: ["get", "list", "watch", "update", "patch"]
- apiGroups: ["storage.k8s.io"]
resources: ["volumeattachments/status"]
verbs: ["patch"]
- apiGroups: ["storage.k8s.io"]
resources: ["csinodes"]
verbs: ["get", "list", "watch"]
- apiGroups: ["snapshot.storage.k8s.io"]
resources: ["volumesnapshotcontents/status"]
verbs: ["update"]
- apiGroups: [""]
resources: ["configmaps"]
verbs: ["get"]
---
kind: ClusterRoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
namespace: vitastor-system
name: vitastor-csi-provisioner-role
subjects:
- kind: ServiceAccount
name: vitastor-csi-provisioner
namespace: vitastor-system
roleRef:
kind: ClusterRole
name: vitastor-external-provisioner-runner
apiGroup: rbac.authorization.k8s.io
---
kind: Role
apiVersion: rbac.authorization.k8s.io/v1
metadata:
namespace: vitastor-system
name: vitastor-external-provisioner-cfg
rules:
- apiGroups: [""]
resources: ["configmaps"]
verbs: ["get", "list", "watch", "create", "update", "delete"]
- apiGroups: ["coordination.k8s.io"]
resources: ["leases"]
verbs: ["get", "watch", "list", "delete", "update", "create"]
---
kind: RoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
name: vitastor-csi-provisioner-role-cfg
namespace: vitastor-system
subjects:
- kind: ServiceAccount
name: vitastor-csi-provisioner
namespace: vitastor-system
roleRef:
kind: Role
name: vitastor-external-provisioner-cfg
apiGroup: rbac.authorization.k8s.io

View File

@@ -0,0 +1,60 @@
---
apiVersion: policy/v1beta1
kind: PodSecurityPolicy
metadata:
namespace: vitastor-system
name: vitastor-csi-provisioner-psp
spec:
allowPrivilegeEscalation: true
allowedCapabilities:
- 'SYS_ADMIN'
fsGroup:
rule: RunAsAny
privileged: true
runAsUser:
rule: RunAsAny
seLinux:
rule: RunAsAny
supplementalGroups:
rule: RunAsAny
volumes:
- 'configMap'
- 'emptyDir'
- 'projected'
- 'secret'
- 'downwardAPI'
- 'hostPath'
allowedHostPaths:
- pathPrefix: '/dev'
readOnly: false
- pathPrefix: '/sys'
readOnly: false
- pathPrefix: '/lib/modules'
readOnly: true
---
kind: Role
apiVersion: rbac.authorization.k8s.io/v1
metadata:
namespace: vitastor-system
name: vitastor-csi-provisioner-psp
rules:
- apiGroups: ['policy']
resources: ['podsecuritypolicies']
verbs: ['use']
resourceNames: ['vitastor-csi-provisioner-psp']
---
kind: RoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
name: vitastor-csi-provisioner-psp
namespace: vitastor-system
subjects:
- kind: ServiceAccount
name: vitastor-csi-provisioner
namespace: vitastor-system
roleRef:
kind: Role
name: vitastor-csi-provisioner-psp
apiGroup: rbac.authorization.k8s.io

View File

@@ -0,0 +1,159 @@
---
kind: Service
apiVersion: v1
metadata:
namespace: vitastor-system
name: csi-vitastor-provisioner
labels:
app: csi-metrics
spec:
selector:
app: csi-vitastor-provisioner
ports:
- name: http-metrics
port: 8080
protocol: TCP
targetPort: 8680
---
kind: Deployment
apiVersion: apps/v1
metadata:
namespace: vitastor-system
name: csi-vitastor-provisioner
spec:
replicas: 3
selector:
matchLabels:
app: csi-vitastor-provisioner
template:
metadata:
namespace: vitastor-system
labels:
app: csi-vitastor-provisioner
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- csi-vitastor-provisioner
topologyKey: "kubernetes.io/hostname"
serviceAccountName: vitastor-csi-provisioner
priorityClassName: system-cluster-critical
containers:
- name: csi-provisioner
image: k8s.gcr.io/sig-storage/csi-provisioner:v2.2.0
args:
- "--csi-address=$(ADDRESS)"
- "--v=5"
- "--timeout=150s"
- "--retry-interval-start=500ms"
- "--leader-election=true"
# set it to true to use topology based provisioning
- "--feature-gates=Topology=false"
# if fstype is not specified in storageclass, ext4 is default
- "--default-fstype=ext4"
- "--extra-create-metadata=true"
env:
- name: ADDRESS
value: unix:///csi/csi-provisioner.sock
imagePullPolicy: "IfNotPresent"
volumeMounts:
- name: socket-dir
mountPath: /csi
- name: csi-snapshotter
image: k8s.gcr.io/sig-storage/csi-snapshotter:v4.0.0
args:
- "--csi-address=$(ADDRESS)"
- "--v=5"
- "--timeout=150s"
- "--leader-election=true"
env:
- name: ADDRESS
value: unix:///csi/csi-provisioner.sock
imagePullPolicy: "IfNotPresent"
securityContext:
privileged: true
volumeMounts:
- name: socket-dir
mountPath: /csi
- name: csi-attacher
image: k8s.gcr.io/sig-storage/csi-attacher:v3.1.0
args:
- "--v=5"
- "--csi-address=$(ADDRESS)"
- "--leader-election=true"
- "--retry-interval-start=500ms"
env:
- name: ADDRESS
value: /csi/csi-provisioner.sock
imagePullPolicy: "IfNotPresent"
volumeMounts:
- name: socket-dir
mountPath: /csi
- name: csi-resizer
image: k8s.gcr.io/sig-storage/csi-resizer:v1.1.0
args:
- "--csi-address=$(ADDRESS)"
- "--v=5"
- "--timeout=150s"
- "--leader-election"
- "--retry-interval-start=500ms"
- "--handle-volume-inuse-error=false"
env:
- name: ADDRESS
value: unix:///csi/csi-provisioner.sock
imagePullPolicy: "IfNotPresent"
volumeMounts:
- name: socket-dir
mountPath: /csi
- name: csi-vitastor
securityContext:
privileged: true
capabilities:
add: ["SYS_ADMIN"]
image: vitalif/vitastor-csi:v0.6.5
args:
- "--node=$(NODE_ID)"
- "--endpoint=$(CSI_ENDPOINT)"
env:
- name: NODE_ID
valueFrom:
fieldRef:
fieldPath: spec.nodeName
- name: CSI_ENDPOINT
value: unix:///csi/csi-provisioner.sock
imagePullPolicy: "IfNotPresent"
volumeMounts:
- name: socket-dir
mountPath: /csi
- mountPath: /dev
name: host-dev
- mountPath: /sys
name: host-sys
- mountPath: /lib/modules
name: lib-modules
readOnly: true
- name: vitastor-config
mountPath: /etc/vitastor
volumes:
- name: host-dev
hostPath:
path: /dev
- name: host-sys
hostPath:
path: /sys
- name: lib-modules
hostPath:
path: /lib/modules
- name: socket-dir
emptyDir: {
medium: "Memory"
}
- name: vitastor-config
configMap:
name: vitastor-config

View File

@@ -0,0 +1,11 @@
---
# if Kubernetes version is less than 1.18 change
# apiVersion to storage.k8s.io/v1betav1
apiVersion: storage.k8s.io/v1
kind: CSIDriver
metadata:
namespace: vitastor-system
name: csi.vitastor.io
spec:
attachRequired: true
podInfoOnMount: false

View File

@@ -0,0 +1,19 @@
---
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
namespace: vitastor-system
name: vitastor
annotations:
storageclass.kubernetes.io/is-default-class: "true"
provisioner: csi.vitastor.io
volumeBindingMode: Immediate
parameters:
etcdVolumePrefix: ""
poolId: "1"
# you can choose other configuration file if you have it in the config map
#configPath: "/etc/vitastor/vitastor.conf"
# you can also specify etcdUrl here, maybe to connect to another Vitastor cluster
# multiple etcdUrls may be specified, delimited by comma
#etcdUrl: "http://192.168.7.2:2379"
#etcdPrefix: "/vitastor"

View File

@@ -0,0 +1,12 @@
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: test-vitastor-pvc
spec:
storageClassName: vitastor
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 10Gi

35
csi/go.mod Normal file
View File

@@ -0,0 +1,35 @@
module vitastor.io/csi
go 1.15
require (
github.com/container-storage-interface/spec v1.4.0
github.com/coreos/bbolt v0.0.0-00010101000000-000000000000 // indirect
github.com/coreos/etcd v3.3.25+incompatible // indirect
github.com/coreos/go-semver v0.3.0 // indirect
github.com/coreos/go-systemd v0.0.0-20191104093116-d3cd4ed1dbcf // indirect
github.com/coreos/pkg v0.0.0-20180928190104-399ea9e2e55f // indirect
github.com/dustin/go-humanize v1.0.0 // indirect
github.com/golang/glog v0.0.0-20160126235308-23def4e6c14b
github.com/gorilla/websocket v1.4.2 // indirect
github.com/grpc-ecosystem/go-grpc-middleware v1.3.0 // indirect
github.com/grpc-ecosystem/go-grpc-prometheus v1.2.0 // indirect
github.com/grpc-ecosystem/grpc-gateway v1.16.0 // indirect
github.com/jonboulle/clockwork v0.2.2 // indirect
github.com/kubernetes-csi/csi-lib-utils v0.9.1
github.com/soheilhy/cmux v0.1.5 // indirect
github.com/tmc/grpc-websocket-proxy v0.0.0-20201229170055-e5319fda7802 // indirect
github.com/xiang90/probing v0.0.0-20190116061207-43a291ad63a2 // indirect
go.etcd.io/bbolt v0.0.0-00010101000000-000000000000 // indirect
go.etcd.io/etcd v3.3.25+incompatible
golang.org/x/net v0.0.0-20201202161906-c7110b5ffcbb
google.golang.org/grpc v1.33.1
k8s.io/klog v1.0.0
k8s.io/utils v0.0.0-20210305010621-2afb4311ab10
)
replace github.com/coreos/bbolt => go.etcd.io/bbolt v1.3.5
replace go.etcd.io/bbolt => github.com/coreos/bbolt v1.3.5
replace google.golang.org/grpc => google.golang.org/grpc v1.25.1

22
csi/src/config.go Normal file
View File

@@ -0,0 +1,22 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.1 or GNU GPL-2.0+ (see README.md for details)
package vitastor
const (
vitastorCSIDriverName = "csi.vitastor.io"
vitastorCSIDriverVersion = "0.6.5"
)
// Config struct fills the parameters of request or user input
type Config struct
{
Endpoint string
NodeID string
}
// NewConfig returns config struct to initialize new driver
func NewConfig() *Config
{
return &Config{}
}

530
csi/src/controllerserver.go Normal file
View File

@@ -0,0 +1,530 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.1 or GNU GPL-2.0+ (see README.md for details)
package vitastor
import (
"context"
"encoding/json"
"strings"
"bytes"
"strconv"
"time"
"fmt"
"os"
"os/exec"
"io/ioutil"
"github.com/kubernetes-csi/csi-lib-utils/protosanitizer"
"k8s.io/klog"
"google.golang.org/grpc/codes"
"google.golang.org/grpc/status"
"go.etcd.io/etcd/clientv3"
"github.com/container-storage-interface/spec/lib/go/csi"
)
const (
KB int64 = 1024
MB int64 = 1024 * KB
GB int64 = 1024 * MB
TB int64 = 1024 * GB
ETCD_TIMEOUT time.Duration = 15*time.Second
)
type InodeIndex struct
{
Id uint64 `json:"id"`
PoolId uint64 `json:"pool_id"`
}
type InodeConfig struct
{
Name string `json:"name"`
Size uint64 `json:"size,omitempty"`
ParentPool uint64 `json:"parent_pool,omitempty"`
ParentId uint64 `json:"parent_id,omitempty"`
Readonly bool `json:"readonly,omitempty"`
}
type ControllerServer struct
{
*Driver
}
// NewControllerServer create new instance controller
func NewControllerServer(driver *Driver) *ControllerServer
{
return &ControllerServer{
Driver: driver,
}
}
func GetConnectionParams(params map[string]string) (map[string]string, []string, string)
{
ctxVars := make(map[string]string)
configPath := params["configPath"]
if (configPath == "")
{
configPath = "/etc/vitastor/vitastor.conf"
}
else
{
ctxVars["configPath"] = configPath
}
config := make(map[string]interface{})
if configFD, err := os.Open(configPath); err == nil
{
defer configFD.Close()
data, _ := ioutil.ReadAll(configFD)
json.Unmarshal(data, &config)
}
// Try to load prefix & etcd URL from the config
var etcdUrl []string
if (params["etcdUrl"] != "")
{
ctxVars["etcdUrl"] = params["etcdUrl"]
etcdUrl = strings.Split(params["etcdUrl"], ",")
}
if (len(etcdUrl) == 0)
{
switch config["etcd_address"].(type)
{
case string:
etcdUrl = strings.Split(config["etcd_address"].(string), ",")
case []string:
etcdUrl = config["etcd_address"].([]string)
}
}
etcdPrefix := params["etcdPrefix"]
if (etcdPrefix == "")
{
etcdPrefix, _ = config["etcd_prefix"].(string)
if (etcdPrefix == "")
{
etcdPrefix = "/vitastor"
}
}
else
{
ctxVars["etcdPrefix"] = etcdPrefix
}
return ctxVars, etcdUrl, etcdPrefix
}
// Create the volume
func (cs *ControllerServer) CreateVolume(ctx context.Context, req *csi.CreateVolumeRequest) (*csi.CreateVolumeResponse, error)
{
klog.Infof("received controller create volume request %+v", protosanitizer.StripSecrets(req))
if (req == nil)
{
return nil, status.Errorf(codes.InvalidArgument, "request cannot be empty")
}
if (req.GetName() == "")
{
return nil, status.Error(codes.InvalidArgument, "name is a required field")
}
volumeCapabilities := req.GetVolumeCapabilities()
if (volumeCapabilities == nil)
{
return nil, status.Error(codes.InvalidArgument, "volume capabilities is a required field")
}
etcdVolumePrefix := req.Parameters["etcdVolumePrefix"]
poolId, _ := strconv.ParseUint(req.Parameters["poolId"], 10, 64)
if (poolId == 0)
{
return nil, status.Error(codes.InvalidArgument, "poolId is missing in storage class configuration")
}
volName := etcdVolumePrefix + req.GetName()
volSize := 1 * GB
if capRange := req.GetCapacityRange(); capRange != nil
{
volSize = ((capRange.GetRequiredBytes() + MB - 1) / MB) * MB
}
// FIXME: The following should PROBABLY be implemented externally in a management tool
ctxVars, etcdUrl, etcdPrefix := GetConnectionParams(req.Parameters)
if (len(etcdUrl) == 0)
{
return nil, status.Error(codes.InvalidArgument, "no etcdUrl in storage class configuration and no etcd_address in vitastor.conf")
}
// Connect to etcd
cli, err := clientv3.New(clientv3.Config{
DialTimeout: ETCD_TIMEOUT,
Endpoints: etcdUrl,
})
if (err != nil)
{
return nil, status.Error(codes.Internal, "failed to connect to etcd at "+strings.Join(etcdUrl, ",")+": "+err.Error())
}
defer cli.Close()
var imageId uint64 = 0
for
{
// Check if the image exists
ctx, cancel := context.WithTimeout(context.Background(), ETCD_TIMEOUT)
resp, err := cli.Get(ctx, etcdPrefix+"/index/image/"+volName)
cancel()
if (err != nil)
{
return nil, status.Error(codes.Internal, "failed to read key from etcd: "+err.Error())
}
if (len(resp.Kvs) > 0)
{
kv := resp.Kvs[0]
var v InodeIndex
err := json.Unmarshal(kv.Value, &v)
if (err != nil)
{
return nil, status.Error(codes.Internal, "invalid /index/image/"+volName+" key in etcd: "+err.Error())
}
poolId = v.PoolId
imageId = v.Id
inodeCfgKey := fmt.Sprintf("/config/inode/%d/%d", poolId, imageId)
ctx, cancel := context.WithTimeout(context.Background(), ETCD_TIMEOUT)
resp, err := cli.Get(ctx, etcdPrefix+inodeCfgKey)
cancel()
if (err != nil)
{
return nil, status.Error(codes.Internal, "failed to read key from etcd: "+err.Error())
}
if (len(resp.Kvs) == 0)
{
return nil, status.Error(codes.Internal, "missing "+inodeCfgKey+" key in etcd")
}
var inodeCfg InodeConfig
err = json.Unmarshal(resp.Kvs[0].Value, &inodeCfg)
if (err != nil)
{
return nil, status.Error(codes.Internal, "invalid "+inodeCfgKey+" key in etcd: "+err.Error())
}
if (inodeCfg.Size < uint64(volSize))
{
return nil, status.Error(codes.Internal, "image "+volName+" is already created, but size is less than expected")
}
}
else
{
// Find a free ID
// Create image metadata in a transaction verifying that the image doesn't exist yet AND ID is still free
maxIdKey := fmt.Sprintf("%s/index/maxid/%d", etcdPrefix, poolId)
ctx, cancel := context.WithTimeout(context.Background(), ETCD_TIMEOUT)
resp, err := cli.Get(ctx, maxIdKey)
cancel()
if (err != nil)
{
return nil, status.Error(codes.Internal, "failed to read key from etcd: "+err.Error())
}
var modRev int64
var nextId uint64
if (len(resp.Kvs) > 0)
{
var err error
nextId, err = strconv.ParseUint(string(resp.Kvs[0].Value), 10, 64)
if (err != nil)
{
return nil, status.Error(codes.Internal, maxIdKey+" contains invalid ID")
}
modRev = resp.Kvs[0].ModRevision
nextId++
}
else
{
nextId = 1
}
inodeIdxJson, _ := json.Marshal(InodeIndex{
Id: nextId,
PoolId: poolId,
})
inodeCfgJson, _ := json.Marshal(InodeConfig{
Name: volName,
Size: uint64(volSize),
})
ctx, cancel = context.WithTimeout(context.Background(), ETCD_TIMEOUT)
txnResp, err := cli.Txn(ctx).If(
clientv3.Compare(clientv3.ModRevision(fmt.Sprintf("%s/index/maxid/%d", etcdPrefix, poolId)), "=", modRev),
clientv3.Compare(clientv3.CreateRevision(fmt.Sprintf("%s/index/image/%s", etcdPrefix, volName)), "=", 0),
clientv3.Compare(clientv3.CreateRevision(fmt.Sprintf("%s/config/inode/%d/%d", etcdPrefix, poolId, nextId)), "=", 0),
).Then(
clientv3.OpPut(fmt.Sprintf("%s/index/maxid/%d", etcdPrefix, poolId), fmt.Sprintf("%d", nextId)),
clientv3.OpPut(fmt.Sprintf("%s/index/image/%s", etcdPrefix, volName), string(inodeIdxJson)),
clientv3.OpPut(fmt.Sprintf("%s/config/inode/%d/%d", etcdPrefix, poolId, nextId), string(inodeCfgJson)),
).Commit()
cancel()
if (err != nil)
{
return nil, status.Error(codes.Internal, "failed to commit transaction in etcd: "+err.Error())
}
if (txnResp.Succeeded)
{
imageId = nextId
break
}
// Start over if the transaction fails
}
}
ctxVars["name"] = volName
volumeIdJson, _ := json.Marshal(ctxVars)
return &csi.CreateVolumeResponse{
Volume: &csi.Volume{
// Ugly, but VolumeContext isn't passed to DeleteVolume :-(
VolumeId: string(volumeIdJson),
CapacityBytes: volSize,
},
}, nil
}
// DeleteVolume deletes the given volume
func (cs *ControllerServer) DeleteVolume(ctx context.Context, req *csi.DeleteVolumeRequest) (*csi.DeleteVolumeResponse, error)
{
klog.Infof("received controller delete volume request %+v", protosanitizer.StripSecrets(req))
if (req == nil)
{
return nil, status.Error(codes.InvalidArgument, "request cannot be empty")
}
ctxVars := make(map[string]string)
err := json.Unmarshal([]byte(req.VolumeId), &ctxVars)
if (err != nil)
{
return nil, status.Error(codes.Internal, "volume ID not in JSON format")
}
volName := ctxVars["name"]
_, etcdUrl, etcdPrefix := GetConnectionParams(ctxVars)
if (len(etcdUrl) == 0)
{
return nil, status.Error(codes.InvalidArgument, "no etcdUrl in storage class configuration and no etcd_address in vitastor.conf")
}
cli, err := clientv3.New(clientv3.Config{
DialTimeout: ETCD_TIMEOUT,
Endpoints: etcdUrl,
})
if (err != nil)
{
return nil, status.Error(codes.Internal, "failed to connect to etcd at "+strings.Join(etcdUrl, ",")+": "+err.Error())
}
defer cli.Close()
// Find inode by name
ctx, cancel := context.WithTimeout(context.Background(), ETCD_TIMEOUT)
resp, err := cli.Get(ctx, etcdPrefix+"/index/image/"+volName)
cancel()
if (err != nil)
{
return nil, status.Error(codes.Internal, "failed to read key from etcd: "+err.Error())
}
if (len(resp.Kvs) == 0)
{
return nil, status.Error(codes.NotFound, "volume "+volName+" does not exist")
}
var idx InodeIndex
err = json.Unmarshal(resp.Kvs[0].Value, &idx)
if (err != nil)
{
return nil, status.Error(codes.Internal, "invalid /index/image/"+volName+" key in etcd: "+err.Error())
}
// Get inode config
inodeCfgKey := fmt.Sprintf("%s/config/inode/%d/%d", etcdPrefix, idx.PoolId, idx.Id)
ctx, cancel = context.WithTimeout(context.Background(), ETCD_TIMEOUT)
resp, err = cli.Get(ctx, inodeCfgKey)
cancel()
if (err != nil)
{
return nil, status.Error(codes.Internal, "failed to read key from etcd: "+err.Error())
}
if (len(resp.Kvs) == 0)
{
return nil, status.Error(codes.NotFound, "volume "+volName+" does not exist")
}
var inodeCfg InodeConfig
err = json.Unmarshal(resp.Kvs[0].Value, &inodeCfg)
if (err != nil)
{
return nil, status.Error(codes.Internal, "invalid "+inodeCfgKey+" key in etcd: "+err.Error())
}
// Delete inode data by invoking vitastor-rm
args := []string{
"--etcd_address", strings.Join(etcdUrl, ","),
"--pool", fmt.Sprintf("%d", idx.PoolId),
"--inode", fmt.Sprintf("%d", idx.Id),
}
if (ctxVars["configPath"] != "")
{
args = append(args, "--config_path", ctxVars["configPath"])
}
c := exec.Command("/usr/bin/vitastor-rm", args...)
var stderr bytes.Buffer
c.Stdout = nil
c.Stderr = &stderr
err = c.Run()
stderrStr := string(stderr.Bytes())
if (err != nil)
{
klog.Errorf("vitastor-rm failed: %s, status %s\n", stderrStr, err)
return nil, status.Error(codes.Internal, stderrStr+" (status "+err.Error()+")")
}
// Delete inode config in etcd
ctx, cancel = context.WithTimeout(context.Background(), ETCD_TIMEOUT)
txnResp, err := cli.Txn(ctx).Then(
clientv3.OpDelete(fmt.Sprintf("%s/index/image/%s", etcdPrefix, volName)),
clientv3.OpDelete(fmt.Sprintf("%s/config/inode/%d/%d", etcdPrefix, idx.PoolId, idx.Id)),
).Commit()
cancel()
if (err != nil)
{
return nil, status.Error(codes.Internal, "failed to delete keys in etcd: "+err.Error())
}
if (!txnResp.Succeeded)
{
return nil, status.Error(codes.Internal, "failed to delete keys in etcd: transaction failed")
}
return &csi.DeleteVolumeResponse{}, nil
}
// ControllerPublishVolume return Unimplemented error
func (cs *ControllerServer) ControllerPublishVolume(ctx context.Context, req *csi.ControllerPublishVolumeRequest) (*csi.ControllerPublishVolumeResponse, error)
{
return nil, status.Error(codes.Unimplemented, "")
}
// ControllerUnpublishVolume return Unimplemented error
func (cs *ControllerServer) ControllerUnpublishVolume(ctx context.Context, req *csi.ControllerUnpublishVolumeRequest) (*csi.ControllerUnpublishVolumeResponse, error)
{
return nil, status.Error(codes.Unimplemented, "")
}
// ValidateVolumeCapabilities checks whether the volume capabilities requested are supported.
func (cs *ControllerServer) ValidateVolumeCapabilities(ctx context.Context, req *csi.ValidateVolumeCapabilitiesRequest) (*csi.ValidateVolumeCapabilitiesResponse, error)
{
klog.Infof("received controller validate volume capability request %+v", protosanitizer.StripSecrets(req))
if (req == nil)
{
return nil, status.Errorf(codes.InvalidArgument, "request is nil")
}
volumeID := req.GetVolumeId()
if (volumeID == "")
{
return nil, status.Error(codes.InvalidArgument, "volumeId is nil")
}
volumeCapabilities := req.GetVolumeCapabilities()
if (volumeCapabilities == nil)
{
return nil, status.Error(codes.InvalidArgument, "volumeCapabilities is nil")
}
var volumeCapabilityAccessModes []*csi.VolumeCapability_AccessMode
for _, mode := range []csi.VolumeCapability_AccessMode_Mode{
csi.VolumeCapability_AccessMode_SINGLE_NODE_WRITER,
csi.VolumeCapability_AccessMode_MULTI_NODE_MULTI_WRITER,
} {
volumeCapabilityAccessModes = append(volumeCapabilityAccessModes, &csi.VolumeCapability_AccessMode{Mode: mode})
}
capabilitySupport := false
for _, capability := range volumeCapabilities
{
for _, volumeCapabilityAccessMode := range volumeCapabilityAccessModes
{
if (volumeCapabilityAccessMode.Mode == capability.AccessMode.Mode)
{
capabilitySupport = true
}
}
}
if (!capabilitySupport)
{
return nil, status.Errorf(codes.NotFound, "%v not supported", req.GetVolumeCapabilities())
}
return &csi.ValidateVolumeCapabilitiesResponse{
Confirmed: &csi.ValidateVolumeCapabilitiesResponse_Confirmed{
VolumeCapabilities: req.VolumeCapabilities,
},
}, nil
}
// ListVolumes returns a list of volumes
func (cs *ControllerServer) ListVolumes(ctx context.Context, req *csi.ListVolumesRequest) (*csi.ListVolumesResponse, error)
{
return nil, status.Error(codes.Unimplemented, "")
}
// GetCapacity returns the capacity of the storage pool
func (cs *ControllerServer) GetCapacity(ctx context.Context, req *csi.GetCapacityRequest) (*csi.GetCapacityResponse, error)
{
return nil, status.Error(codes.Unimplemented, "")
}
// ControllerGetCapabilities returns the capabilities of the controller service.
func (cs *ControllerServer) ControllerGetCapabilities(ctx context.Context, req *csi.ControllerGetCapabilitiesRequest) (*csi.ControllerGetCapabilitiesResponse, error)
{
functionControllerServerCapabilities := func(cap csi.ControllerServiceCapability_RPC_Type) *csi.ControllerServiceCapability
{
return &csi.ControllerServiceCapability{
Type: &csi.ControllerServiceCapability_Rpc{
Rpc: &csi.ControllerServiceCapability_RPC{
Type: cap,
},
},
}
}
var controllerServerCapabilities []*csi.ControllerServiceCapability
for _, capability := range []csi.ControllerServiceCapability_RPC_Type{
csi.ControllerServiceCapability_RPC_CREATE_DELETE_VOLUME,
csi.ControllerServiceCapability_RPC_LIST_VOLUMES,
csi.ControllerServiceCapability_RPC_EXPAND_VOLUME,
csi.ControllerServiceCapability_RPC_CREATE_DELETE_SNAPSHOT,
} {
controllerServerCapabilities = append(controllerServerCapabilities, functionControllerServerCapabilities(capability))
}
return &csi.ControllerGetCapabilitiesResponse{
Capabilities: controllerServerCapabilities,
}, nil
}
// CreateSnapshot create snapshot of an existing PV
func (cs *ControllerServer) CreateSnapshot(ctx context.Context, req *csi.CreateSnapshotRequest) (*csi.CreateSnapshotResponse, error)
{
return nil, status.Error(codes.Unimplemented, "")
}
// DeleteSnapshot delete provided snapshot of a PV
func (cs *ControllerServer) DeleteSnapshot(ctx context.Context, req *csi.DeleteSnapshotRequest) (*csi.DeleteSnapshotResponse, error)
{
return nil, status.Error(codes.Unimplemented, "")
}
// ListSnapshots list the snapshots of a PV
func (cs *ControllerServer) ListSnapshots(ctx context.Context, req *csi.ListSnapshotsRequest) (*csi.ListSnapshotsResponse, error)
{
return nil, status.Error(codes.Unimplemented, "")
}
// ControllerExpandVolume resizes a volume
func (cs *ControllerServer) ControllerExpandVolume(ctx context.Context, req *csi.ControllerExpandVolumeRequest) (*csi.ControllerExpandVolumeResponse, error)
{
return nil, status.Error(codes.Unimplemented, "")
}
// ControllerGetVolume get volume info
func (cs *ControllerServer) ControllerGetVolume(ctx context.Context, req *csi.ControllerGetVolumeRequest) (*csi.ControllerGetVolumeResponse, error)
{
return nil, status.Error(codes.Unimplemented, "")
}

137
csi/src/grpc.go Normal file
View File

@@ -0,0 +1,137 @@
/*
Copyright 2017 The Kubernetes Authors.
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
*/
package vitastor
import (
"fmt"
"net"
"os"
"strings"
"sync"
"github.com/golang/glog"
"golang.org/x/net/context"
"google.golang.org/grpc"
"github.com/container-storage-interface/spec/lib/go/csi"
"github.com/kubernetes-csi/csi-lib-utils/protosanitizer"
)
// Defines Non blocking GRPC server interfaces
type NonBlockingGRPCServer interface {
// Start services at the endpoint
Start(endpoint string, ids csi.IdentityServer, cs csi.ControllerServer, ns csi.NodeServer)
// Waits for the service to stop
Wait()
// Stops the service gracefully
Stop()
// Stops the service forcefully
ForceStop()
}
func NewNonBlockingGRPCServer() NonBlockingGRPCServer {
return &nonBlockingGRPCServer{}
}
// NonBlocking server
type nonBlockingGRPCServer struct {
wg sync.WaitGroup
server *grpc.Server
}
func (s *nonBlockingGRPCServer) Start(endpoint string, ids csi.IdentityServer, cs csi.ControllerServer, ns csi.NodeServer) {
s.wg.Add(1)
go s.serve(endpoint, ids, cs, ns)
return
}
func (s *nonBlockingGRPCServer) Wait() {
s.wg.Wait()
}
func (s *nonBlockingGRPCServer) Stop() {
s.server.GracefulStop()
}
func (s *nonBlockingGRPCServer) ForceStop() {
s.server.Stop()
}
func (s *nonBlockingGRPCServer) serve(endpoint string, ids csi.IdentityServer, cs csi.ControllerServer, ns csi.NodeServer) {
proto, addr, err := ParseEndpoint(endpoint)
if err != nil {
glog.Fatal(err.Error())
}
if proto == "unix" {
addr = "/" + addr
if err := os.Remove(addr); err != nil && !os.IsNotExist(err) {
glog.Fatalf("Failed to remove %s, error: %s", addr, err.Error())
}
}
listener, err := net.Listen(proto, addr)
if err != nil {
glog.Fatalf("Failed to listen: %v", err)
}
opts := []grpc.ServerOption{
grpc.UnaryInterceptor(logGRPC),
}
server := grpc.NewServer(opts...)
s.server = server
if ids != nil {
csi.RegisterIdentityServer(server, ids)
}
if cs != nil {
csi.RegisterControllerServer(server, cs)
}
if ns != nil {
csi.RegisterNodeServer(server, ns)
}
glog.Infof("Listening for connections on address: %#v", listener.Addr())
server.Serve(listener)
}
func ParseEndpoint(ep string) (string, string, error) {
if strings.HasPrefix(strings.ToLower(ep), "unix://") || strings.HasPrefix(strings.ToLower(ep), "tcp://") {
s := strings.SplitN(ep, "://", 2)
if s[1] != "" {
return s[0], s[1], nil
}
}
return "", "", fmt.Errorf("Invalid endpoint: %v", ep)
}
func logGRPC(ctx context.Context, req interface{}, info *grpc.UnaryServerInfo, handler grpc.UnaryHandler) (interface{}, error) {
glog.V(3).Infof("GRPC call: %s", info.FullMethod)
glog.V(5).Infof("GRPC request: %s", protosanitizer.StripSecrets(req))
resp, err := handler(ctx, req)
if err != nil {
glog.Errorf("GRPC error: %v", err)
} else {
glog.V(5).Infof("GRPC response: %s", protosanitizer.StripSecrets(resp))
}
return resp, err
}

60
csi/src/identityserver.go Normal file
View File

@@ -0,0 +1,60 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.1 or GNU GPL-2.0+ (see README.md for details)
package vitastor
import (
"context"
"github.com/kubernetes-csi/csi-lib-utils/protosanitizer"
"k8s.io/klog"
"github.com/container-storage-interface/spec/lib/go/csi"
)
// IdentityServer struct of Vitastor CSI driver with supported methods of CSI identity server spec.
type IdentityServer struct
{
*Driver
}
// NewIdentityServer create new instance identity
func NewIdentityServer(driver *Driver) *IdentityServer
{
return &IdentityServer{
Driver: driver,
}
}
// GetPluginInfo returns metadata of the plugin
func (is *IdentityServer) GetPluginInfo(ctx context.Context, req *csi.GetPluginInfoRequest) (*csi.GetPluginInfoResponse, error)
{
klog.Infof("received identity plugin info request %+v", protosanitizer.StripSecrets(req))
return &csi.GetPluginInfoResponse{
Name: vitastorCSIDriverName,
VendorVersion: vitastorCSIDriverVersion,
}, nil
}
// GetPluginCapabilities returns available capabilities of the plugin
func (is *IdentityServer) GetPluginCapabilities(ctx context.Context, req *csi.GetPluginCapabilitiesRequest) (*csi.GetPluginCapabilitiesResponse, error)
{
klog.Infof("received identity plugin capabilities request %+v", protosanitizer.StripSecrets(req))
return &csi.GetPluginCapabilitiesResponse{
Capabilities: []*csi.PluginCapability{
{
Type: &csi.PluginCapability_Service_{
Service: &csi.PluginCapability_Service{
Type: csi.PluginCapability_Service_CONTROLLER_SERVICE,
},
},
},
},
}, nil
}
// Probe returns the health and readiness of the plugin
func (is *IdentityServer) Probe(ctx context.Context, req *csi.ProbeRequest) (*csi.ProbeResponse, error)
{
return &csi.ProbeResponse{}, nil
}

279
csi/src/nodeserver.go Normal file
View File

@@ -0,0 +1,279 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.1 or GNU GPL-2.0+ (see README.md for details)
package vitastor
import (
"context"
"os"
"os/exec"
"encoding/json"
"strings"
"bytes"
"google.golang.org/grpc/codes"
"google.golang.org/grpc/status"
"k8s.io/utils/mount"
utilexec "k8s.io/utils/exec"
"github.com/container-storage-interface/spec/lib/go/csi"
"github.com/kubernetes-csi/csi-lib-utils/protosanitizer"
"k8s.io/klog"
)
// NodeServer struct of Vitastor CSI driver with supported methods of CSI node server spec.
type NodeServer struct
{
*Driver
mounter mount.Interface
}
// NewNodeServer create new instance node
func NewNodeServer(driver *Driver) *NodeServer
{
return &NodeServer{
Driver: driver,
mounter: mount.New(""),
}
}
// NodeStageVolume mounts the volume to a staging path on the node.
func (ns *NodeServer) NodeStageVolume(ctx context.Context, req *csi.NodeStageVolumeRequest) (*csi.NodeStageVolumeResponse, error)
{
return &csi.NodeStageVolumeResponse{}, nil
}
// NodeUnstageVolume unstages the volume from the staging path
func (ns *NodeServer) NodeUnstageVolume(ctx context.Context, req *csi.NodeUnstageVolumeRequest) (*csi.NodeUnstageVolumeResponse, error)
{
return &csi.NodeUnstageVolumeResponse{}, nil
}
func Contains(list []string, s string) bool
{
for i := 0; i < len(list); i++
{
if (list[i] == s)
{
return true
}
}
return false
}
// NodePublishVolume mounts the volume mounted to the staging path to the target path
func (ns *NodeServer) NodePublishVolume(ctx context.Context, req *csi.NodePublishVolumeRequest) (*csi.NodePublishVolumeResponse, error)
{
klog.Infof("received node publish volume request %+v", protosanitizer.StripSecrets(req))
targetPath := req.GetTargetPath()
// Check that it's not already mounted
free, error := mount.IsNotMountPoint(ns.mounter, targetPath)
if (error != nil)
{
if (os.IsNotExist(error))
{
error := os.MkdirAll(targetPath, 0777)
if (error != nil)
{
return nil, status.Error(codes.Internal, error.Error())
}
free = true
}
else
{
return nil, status.Error(codes.Internal, error.Error())
}
}
if (!free)
{
return &csi.NodePublishVolumeResponse{}, nil
}
ctxVars := make(map[string]string)
err := json.Unmarshal([]byte(req.VolumeId), &ctxVars)
if (err != nil)
{
return nil, status.Error(codes.Internal, "volume ID not in JSON format")
}
volName := ctxVars["name"]
_, etcdUrl, etcdPrefix := GetConnectionParams(ctxVars)
if (len(etcdUrl) == 0)
{
return nil, status.Error(codes.InvalidArgument, "no etcdUrl in storage class configuration and no etcd_address in vitastor.conf")
}
// Map NBD device
// FIXME: Check if already mapped
args := []string{
"map", "--etcd_address", strings.Join(etcdUrl, ","),
"--etcd_prefix", etcdPrefix,
"--image", volName,
};
if (ctxVars["configPath"] != "")
{
args = append(args, "--config_path", ctxVars["configPath"])
}
if (req.GetReadonly())
{
args = append(args, "--readonly", "1")
}
c := exec.Command("/usr/bin/vitastor-nbd", args...)
var stdout, stderr bytes.Buffer
c.Stdout, c.Stderr = &stdout, &stderr
err = c.Run()
stdoutStr, stderrStr := string(stdout.Bytes()), string(stderr.Bytes())
if (err != nil)
{
klog.Errorf("vitastor-nbd map failed: %s, status %s\n", stdoutStr+stderrStr, err)
return nil, status.Error(codes.Internal, stdoutStr+stderrStr+" (status "+err.Error()+")")
}
devicePath := strings.TrimSpace(stdoutStr)
// Check existing format
diskMounter := &mount.SafeFormatAndMount{Interface: ns.mounter, Exec: utilexec.New()}
existingFormat, err := diskMounter.GetDiskFormat(devicePath)
if (err != nil)
{
klog.Errorf("failed to get disk format for path %s, error: %v", err)
// unmap NBD device
unmapOut, unmapErr := exec.Command("/usr/bin/vitastor-nbd", "unmap", devicePath).CombinedOutput()
if (unmapErr != nil)
{
klog.Errorf("failed to unmap NBD device %s: %s, error: %v", devicePath, unmapOut, unmapErr)
}
return nil, err
}
// Format the device (ext4 or xfs)
fsType := req.GetVolumeCapability().GetMount().GetFsType()
isBlock := req.GetVolumeCapability().GetBlock() != nil
opt := req.GetVolumeCapability().GetMount().GetMountFlags()
opt = append(opt, "_netdev")
if ((req.VolumeCapability.AccessMode.Mode == csi.VolumeCapability_AccessMode_MULTI_NODE_READER_ONLY ||
req.VolumeCapability.AccessMode.Mode == csi.VolumeCapability_AccessMode_SINGLE_NODE_READER_ONLY) &&
!Contains(opt, "ro"))
{
opt = append(opt, "ro")
}
if (fsType == "xfs")
{
opt = append(opt, "nouuid")
}
readOnly := Contains(opt, "ro")
if (existingFormat == "" && !readOnly)
{
args := []string{}
switch fsType
{
case "ext4":
args = []string{"-m0", "-Enodiscard,lazy_itable_init=1,lazy_journal_init=1", devicePath}
case "xfs":
args = []string{"-K", devicePath}
}
if (len(args) > 0)
{
cmdOut, cmdErr := diskMounter.Exec.Command("mkfs."+fsType, args...).CombinedOutput()
if (cmdErr != nil)
{
klog.Errorf("failed to run mkfs error: %v, output: %v", cmdErr, string(cmdOut))
// unmap NBD device
unmapOut, unmapErr := exec.Command("/usr/bin/vitastor-nbd", "unmap", devicePath).CombinedOutput()
if (unmapErr != nil)
{
klog.Errorf("failed to unmap NBD device %s: %s, error: %v", devicePath, unmapOut, unmapErr)
}
return nil, status.Error(codes.Internal, cmdErr.Error())
}
}
}
if (isBlock)
{
opt = append(opt, "bind")
err = diskMounter.Mount(devicePath, targetPath, fsType, opt)
}
else
{
err = diskMounter.FormatAndMount(devicePath, targetPath, fsType, opt)
}
if (err != nil)
{
klog.Errorf(
"failed to mount device path (%s) to path (%s) for volume (%s) error: %s",
devicePath, targetPath, volName, err,
)
// unmap NBD device
unmapOut, unmapErr := exec.Command("/usr/bin/vitastor-nbd", "unmap", devicePath).CombinedOutput()
if (unmapErr != nil)
{
klog.Errorf("failed to unmap NBD device %s: %s, error: %v", devicePath, unmapOut, unmapErr)
}
return nil, status.Error(codes.Internal, err.Error())
}
return &csi.NodePublishVolumeResponse{}, nil
}
// NodeUnpublishVolume unmounts the volume from the target path
func (ns *NodeServer) NodeUnpublishVolume(ctx context.Context, req *csi.NodeUnpublishVolumeRequest) (*csi.NodeUnpublishVolumeResponse, error)
{
klog.Infof("received node unpublish volume request %+v", protosanitizer.StripSecrets(req))
targetPath := req.GetTargetPath()
devicePath, refCount, err := mount.GetDeviceNameFromMount(ns.mounter, targetPath)
if (err != nil)
{
if (os.IsNotExist(err))
{
return nil, status.Error(codes.NotFound, "Target path not found")
}
return nil, status.Error(codes.Internal, err.Error())
}
if (devicePath == "")
{
return nil, status.Error(codes.NotFound, "Volume not mounted")
}
// unmount
err = mount.CleanupMountPoint(targetPath, ns.mounter, false)
if (err != nil)
{
return nil, status.Error(codes.Internal, err.Error())
}
// unmap NBD device
if (refCount == 1)
{
unmapOut, unmapErr := exec.Command("/usr/bin/vitastor-nbd", "unmap", devicePath).CombinedOutput()
if (unmapErr != nil)
{
klog.Errorf("failed to unmap NBD device %s: %s, error: %v", devicePath, unmapOut, unmapErr)
}
}
return &csi.NodeUnpublishVolumeResponse{}, nil
}
// NodeGetVolumeStats returns volume capacity statistics available for the volume
func (ns *NodeServer) NodeGetVolumeStats(ctx context.Context, req *csi.NodeGetVolumeStatsRequest) (*csi.NodeGetVolumeStatsResponse, error)
{
return nil, status.Error(codes.Unimplemented, "")
}
// NodeExpandVolume expanding the file system on the node
func (ns *NodeServer) NodeExpandVolume(ctx context.Context, req *csi.NodeExpandVolumeRequest) (*csi.NodeExpandVolumeResponse, error)
{
return nil, status.Error(codes.Unimplemented, "")
}
// NodeGetCapabilities returns the supported capabilities of the node server
func (ns *NodeServer) NodeGetCapabilities(ctx context.Context, req *csi.NodeGetCapabilitiesRequest) (*csi.NodeGetCapabilitiesResponse, error)
{
return &csi.NodeGetCapabilitiesResponse{}, nil
}
// NodeGetInfo returns NodeGetInfoResponse for CO.
func (ns *NodeServer) NodeGetInfo(ctx context.Context, req *csi.NodeGetInfoRequest) (*csi.NodeGetInfoResponse, error)
{
klog.Infof("received node get info request %+v", protosanitizer.StripSecrets(req))
return &csi.NodeGetInfoResponse{
NodeId: ns.NodeID,
}, nil
}

36
csi/src/server.go Normal file
View File

@@ -0,0 +1,36 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.1 or GNU GPL-2.0+ (see README.md for details)
package vitastor
import (
"k8s.io/klog"
)
type Driver struct
{
*Config
}
// NewDriver create new instance driver
func NewDriver(config *Config) (*Driver, error)
{
if (config == nil)
{
klog.Errorf("Vitastor CSI driver initialization failed")
return nil, nil
}
driver := &Driver{
Config: config,
}
klog.Infof("Vitastor CSI driver initialized")
return driver, nil
}
// Start server
func (driver *Driver) Run()
{
server := NewNonBlockingGRPCServer()
server.Start(driver.Endpoint, NewIdentityServer(driver), NewControllerServer(driver), NewNodeServer(driver))
server.Wait()
}

39
csi/vitastor-csi.go Normal file
View File

@@ -0,0 +1,39 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.1 or GNU GPL-2.0+ (see README.md for details)
package main
import (
"flag"
"fmt"
"os"
"k8s.io/klog"
"vitastor.io/csi/src"
)
func main()
{
var config = vitastor.NewConfig()
flag.StringVar(&config.Endpoint, "endpoint", "", "CSI endpoint")
flag.StringVar(&config.NodeID, "node", "", "Node ID")
flag.Parse()
if (config.Endpoint == "")
{
config.Endpoint = os.Getenv("CSI_ENDPOINT")
}
if (config.NodeID == "")
{
config.NodeID = os.Getenv("NODE_ID")
}
if (config.Endpoint == "" && config.NodeID == "")
{
fmt.Fprintf(os.Stderr, "Please set -endpoint and -node / CSI_ENDPOINT & NODE_ID env vars\n")
os.Exit(1)
}
drv, err := vitastor.NewDriver(config)
if (err != nil)
{
klog.Fatalln(err)
}
drv.Run()
}

14
debian/changelog vendored
View File

@@ -1,8 +1,18 @@
vitastor (0.5.13-1) unstable; urgency=medium
vitastor (0.6.5-1) unstable; urgency=medium
* RDMA support
* Bugfixes
-- Vitaliy Filippov <vitalif@yourcmc.ru> Tue, 02 Feb 2021 23:01:24 +0300
-- Vitaliy Filippov <vitalif@yourcmc.ru> Sat, 01 May 2021 18:46:10 +0300
vitastor (0.6.0-1) unstable; urgency=medium
* Snapshots and Copy-on-Write clones
* Image metadata in etcd (name, size)
* Image I/O and space statistics in etcd
* Write throttling for smoothing random write workloads in SSD+HDD configurations
-- Vitaliy Filippov <vitalif@yourcmc.ru> Sun, 11 Apr 2021 00:49:18 +0300
vitastor (0.5.1-1) unstable; urgency=medium

2
debian/control vendored
View File

@@ -2,7 +2,7 @@ Source: vitastor
Section: admin
Priority: optional
Maintainer: Vitaliy Filippov <vitalif@yourcmc.ru>
Build-Depends: debhelper, liburing-dev (>= 0.6), g++ (>= 8), libstdc++6 (>= 8), linux-libc-dev, libgoogle-perftools-dev, libjerasure-dev, libgf-complete-dev
Build-Depends: debhelper, liburing-dev (>= 0.6), g++ (>= 8), libstdc++6 (>= 8), linux-libc-dev, libgoogle-perftools-dev, libjerasure-dev, libgf-complete-dev, libibverbs-dev
Standards-Version: 4.5.0
Homepage: https://vitastor.io/
Rules-Requires-Root: no

View File

@@ -11,6 +11,10 @@ RUN if [ "$REL" = "buster" ]; then \
echo 'Package: *' >> /etc/apt/preferences; \
echo 'Pin: release a=buster-backports' >> /etc/apt/preferences; \
echo 'Pin-Priority: 500' >> /etc/apt/preferences; \
echo >> /etc/apt/preferences; \
echo 'Package: libglvnd* libgles* libglx* libgl1 libegl* libopengl* mesa*' >> /etc/apt/preferences; \
echo 'Pin: release a=buster-backports' >> /etc/apt/preferences; \
echo 'Pin-Priority: 50' >> /etc/apt/preferences; \
fi; \
grep '^deb ' /etc/apt/sources.list | perl -pe 's/^deb/deb-src/' >> /etc/apt/sources.list; \
echo 'APT::Install-Recommends false;' >> /etc/apt/apt.conf; \
@@ -20,20 +24,22 @@ RUN apt-get update
RUN apt-get -y install qemu fio liburing1 liburing-dev libgoogle-perftools-dev devscripts
RUN apt-get -y build-dep qemu
RUN apt-get -y build-dep fio
# To build a custom version
#RUN cp /root/packages/qemu-orig/* /root
RUN apt-get --download-only source qemu
RUN apt-get --download-only source fio
ADD qemu-5.0-vitastor.patch qemu-5.1-vitastor.patch /root/vitastor/
ADD patches/qemu-5.0-vitastor.patch patches/qemu-5.1-vitastor.patch /root/vitastor/patches/
RUN set -e; \
mkdir -p /root/packages/qemu-$REL; \
rm -rf /root/packages/qemu-$REL/*; \
cd /root/packages/qemu-$REL; \
dpkg-source -x /root/qemu*.dsc; \
if [ -d /root/packages/qemu-$REL/qemu-5.0 ]; then \
cp /root/vitastor/qemu-5.0-vitastor.patch /root/packages/qemu-$REL/qemu-5.0/debian/patches; \
cp /root/vitastor/patches/qemu-5.0-vitastor.patch /root/packages/qemu-$REL/qemu-5.0/debian/patches; \
echo qemu-5.0-vitastor.patch >> /root/packages/qemu-$REL/qemu-5.0/debian/patches/series; \
else \
cp /root/vitastor/qemu-5.1-vitastor.patch /root/packages/qemu-$REL/qemu-*/debian/patches; \
cp /root/vitastor/patches/qemu-5.1-vitastor.patch /root/packages/qemu-$REL/qemu-*/debian/patches; \
P=`ls -d /root/packages/qemu-$REL/qemu-*/debian/patches`; \
echo qemu-5.1-vitastor.patch >> $P/series; \
fi; \

View File

@@ -22,7 +22,7 @@ RUN apt-get -y build-dep qemu
RUN apt-get -y build-dep fio
RUN apt-get --download-only source qemu
RUN apt-get --download-only source fio
RUN apt-get -y install libjerasure-dev cmake
RUN apt-get update && apt-get -y install libjerasure-dev cmake libibverbs-dev
ADD . /root/vitastor
RUN set -e -x; \
@@ -40,10 +40,10 @@ RUN set -e -x; \
mkdir -p /root/packages/vitastor-$REL; \
rm -rf /root/packages/vitastor-$REL/*; \
cd /root/packages/vitastor-$REL; \
cp -r /root/vitastor vitastor-0.5.13; \
ln -s /root/packages/qemu-$REL/qemu-*/ vitastor-0.5.13/qemu; \
ln -s /root/fio-build/fio-*/ vitastor-0.5.13/fio; \
cd vitastor-0.5.13; \
cp -r /root/vitastor vitastor-0.6.5; \
ln -s /root/packages/qemu-$REL/qemu-*/ vitastor-0.6.5/qemu; \
ln -s /root/fio-build/fio-*/ vitastor-0.6.5/fio; \
cd vitastor-0.6.5; \
FIO=$(head -n1 fio/debian/changelog | perl -pe 's/^.*\((.*?)\).*$/$1/'); \
QEMU=$(head -n1 qemu/debian/changelog | perl -pe 's/^.*\((.*?)\).*$/$1/'); \
sh copy-qemu-includes.sh; \
@@ -59,8 +59,8 @@ RUN set -e -x; \
echo "dep:fio=$FIO" > debian/substvars; \
echo "dep:qemu=$QEMU" >> debian/substvars; \
cd /root/packages/vitastor-$REL; \
tar --sort=name --mtime='2020-01-01' --owner=0 --group=0 --exclude=debian -cJf vitastor_0.5.13.orig.tar.xz vitastor-0.5.13; \
cd vitastor-0.5.13; \
tar --sort=name --mtime='2020-01-01' --owner=0 --group=0 --exclude=debian -cJf vitastor_0.6.5.orig.tar.xz vitastor-0.6.5; \
cd vitastor-0.6.5; \
V=$(head -n1 debian/changelog | perl -pe 's/^.*\((.*?)\).*$/$1/'); \
DEBFULLNAME="Vitaliy Filippov <vitalif@yourcmc.ru>" dch -D $REL -v "$V""$REL" "Rebuild for $REL"; \
DEB_BUILD_OPTIONS=nocheck dpkg-buildpackage --jobs=auto -sa; \

View File

@@ -244,6 +244,7 @@ async function optimize_change({ prev_pgs: prev_int_pgs, osd_tree, pg_size = 3,
{
return null;
}
// FIXME: use parity_chunks with parity_space instead of pg_minsize
const pg_effsize = Math.min(pg_minsize, Object.keys(osd_tree).length)
+ Math.max(0, Math.min(pg_size, Object.keys(osd_tree).length) - pg_minsize) * parity_space;
const pg_count = prev_int_pgs.length;

23
mon/merge.js Normal file
View File

@@ -0,0 +1,23 @@
const fsp = require('fs').promises;
async function merge(file1, file2, out)
{
if (!out)
{
console.error('USAGE: nodejs merge.js layer1 layer2 output');
process.exit();
}
const layer1 = await fsp.readFile(file1);
const layer2 = await fsp.readFile(file2);
const zero = Buffer.alloc(4096);
for (let i = 0; i < layer2.length; i += 4096)
{
if (zero.compare(layer2, i, i+4096) != 0)
{
layer2.copy(layer1, i, i, i+4096);
}
}
await fsp.writeFile(out, layer1);
}
merge(process.argv[2], process.argv[3], process.argv[4]);

View File

@@ -24,19 +24,31 @@ const etcd_allow = new RegExp('^'+[
'config/pools',
'config/osd/[1-9]\\d*',
'config/pgs',
'config/inode/[1-9]\\d*/[1-9]\\d*',
'osd/state/[1-9]\\d*',
'osd/stats/[1-9]\\d*',
'osd/inodestats/[1-9]\\d*',
'osd/space/[1-9]\\d*',
'mon/master',
'pg/state/[1-9]\\d*/[1-9]\\d*',
'pg/stats/[1-9]\\d*/[1-9]\\d*',
'pg/history/[1-9]\\d*/[1-9]\\d*',
'history/last_clean_pgs',
'inode/stats/[1-9]\\d*/[1-9]\\d*',
'stats',
'index/image/.*',
'index/maxid/[1-9]\\d*',
].join('$|^')+'$');
const etcd_tree = {
config: {
/* global: {
// WARNING: NOT ALL OF THESE ARE ACTUALLY CONFIGURABLE HERE
// THIS IS JUST A POOR MAN'S CONFIG DOCUMENTATION
// etcd connection
config_path: "/etc/vitastor/vitastor.conf",
etcd_address: "10.0.115.10:2379/v3",
etcd_prefix: "/vitastor",
// mon
etcd_mon_ttl: 30, // min: 10
etcd_mon_timeout: 1000, // ms. min: 0
@@ -46,7 +58,17 @@ const etcd_tree = {
osd_out_time: 600, // seconds. min: 0
placement_levels: { datacenter: 1, rack: 2, host: 3, osd: 4, ... },
// client and osd
tcp_header_buffer_size: 65536,
use_sync_send_recv: false,
use_rdma: true,
rdma_device: null, // for example, "rocep5s0f0"
rdma_port_num: 1,
rdma_gid_index: 0,
rdma_mtu: 4096,
rdma_max_sge: 128,
rdma_max_send: 32,
rdma_max_recv: 8,
rdma_max_msg: 1048576,
log_level: 0,
block_size: 131072,
disk_alignment: 4096,
@@ -141,6 +163,18 @@ const etcd_tree = {
}
}, */
pgs: {},
/* inode: {
<pool_id>: {
<inode_t>: {
name: string,
size?: uint64_t, // bytes
parent_pool?: <pool_id>,
parent_id?: <inode_t>,
readonly?: boolean,
}
}
}, */
inode: {},
},
osd: {
state: {
@@ -172,6 +206,18 @@ const etcd_tree = {
},
}, */
},
inodestats: {
/* <inode_t>: {
read: { count: uint64_t, usec: uint64_t, bytes: uint64_t },
write: { count: uint64_t, usec: uint64_t, bytes: uint64_t },
delete: { count: uint64_t, usec: uint64_t, bytes: uint64_t },
}, */
},
space: {
/* <osd_num_t>: {
<inode_t>: uint64_t, // bytes
}, */
},
},
mon: {
master: {
@@ -211,6 +257,28 @@ const etcd_tree = {
}, */
},
},
inode: {
stats: {
/* <pool_id>: {
<inode_t>: {
raw_used: uint64_t, // raw used bytes on OSDs
read: { count: uint64_t, usec: uint64_t, bytes: uint64_t },
write: { count: uint64_t, usec: uint64_t, bytes: uint64_t },
delete: { count: uint64_t, usec: uint64_t, bytes: uint64_t },
},
}, */
},
},
pool: {
stats: {
/* <pool_id>: {
used_raw_tb: float, // used raw space in the pool
total_raw_tb: float, // maximum amount of space in the pool
raw_to_usable: float, // raw to usable ratio
space_efficiency: float, // 0..1
} */
},
},
stats: {
/* op_stats: {
<string>: { count: uint64_t, usec: uint64_t, bytes: uint64_t },
@@ -233,6 +301,17 @@ const etcd_tree = {
history: {
last_clean_pgs: {},
},
index: {
image: {
/* <name>: {
id: uint64_t,
pool_id: uint64_t,
}, */
},
maxid: {
/* <pool_id>: uint64_t, */
},
},
};
// FIXME Split into several files
@@ -307,6 +386,11 @@ class Mon
{
this.config.mon_stats_timeout = 100;
}
this.config.mon_stats_interval = Number(this.config.mon_stats_interval) || 5000;
if (this.config.mon_stats_interval < 100)
{
this.config.mon_stats_interval = 100;
}
// After this number of seconds, a dead OSD will be removed from PG distribution
this.config.osd_out_time = Number(this.config.osd_out_time) || 0;
if (!this.config.osd_out_time)
@@ -395,7 +479,7 @@ class Mon
{
this.parse_kv(e.kv);
const key = e.kv.key.substr(this.etcd_prefix.length);
if (key.substr(0, 11) == '/osd/stats/' || key.substr(0, 10) == '/pg/stats/')
if (key.substr(0, 11) == '/osd/stats/' || key.substr(0, 10) == '/pg/stats/' || key.substr(0, 16) == '/osd/inodestats/')
{
stats_changed = true;
}
@@ -403,7 +487,7 @@ class Mon
{
pg_states_changed = true;
}
else if (key != '/stats')
else if (key != '/stats' && key.substr(0, 13) != '/inode/stats/')
{
changed = true;
}
@@ -971,6 +1055,17 @@ class Mon
} });
}
LPOptimizer.print_change_stats(optimize_result);
const pg_effsize = Math.min(pool_cfg.pg_size, Object.keys(pool_tree).length);
this.state.pool.stats[pool_id] = {
used_raw_tb: (this.state.pool.stats[pool_id]||{}).used_raw_tb || 0,
total_raw_tb: optimize_result.space,
raw_to_usable: pg_effsize / (pool_cfg.pg_size - (pool_cfg.parity_chunks||0)),
space_efficiency: optimize_result.space/(optimize_result.total_space||1),
};
etcd_request.success.push({ requestPut: {
key: b64(this.etcd_prefix+'/pool/stats/'+pool_id),
value: b64(JSON.stringify(this.state.pool.stats[pool_id])),
} });
this.save_new_pgs_txn(etcd_request, pool_id, up_osds, real_prev_pgs, optimize_result.int_pgs, pg_history);
}
this.state.config.pgs.hash = tree_hash;
@@ -1077,10 +1172,8 @@ class Mon
}, this.config.mon_change_timeout || 1000);
}
sum_stats()
sum_op_stats()
{
let overflow = false;
this.prev_stats = this.prev_stats || { op_stats: {}, subop_stats: {}, recovery_stats: {} };
const op_stats = {}, subop_stats = {}, recovery_stats = {};
for (const osd in this.state.osd.stats)
{
@@ -1105,52 +1198,11 @@ class Mon
recovery_stats[op].bytes += BigInt(st.recovery_stats[op].bytes||0);
}
}
for (const op in op_stats)
{
if (op_stats[op].count >= 0x10000000000000000n)
{
if (!this.prev_stats.op_stats[op])
{
overflow = true;
}
else
{
op_stats[op].count -= this.prev_stats.op_stats[op].count;
op_stats[op].usec -= this.prev_stats.op_stats[op].usec;
op_stats[op].bytes -= this.prev_stats.op_stats[op].bytes;
}
}
}
for (const op in subop_stats)
{
if (subop_stats[op].count >= 0x10000000000000000n)
{
if (!this.prev_stats.subop_stats[op])
{
overflow = true;
}
else
{
subop_stats[op].count -= this.prev_stats.subop_stats[op].count;
subop_stats[op].usec -= this.prev_stats.subop_stats[op].usec;
}
}
}
for (const op in recovery_stats)
{
if (recovery_stats[op].count >= 0x10000000000000000n)
{
if (!this.prev_stats.recovery_stats[op])
{
overflow = true;
}
else
{
recovery_stats[op].count -= this.prev_stats.recovery_stats[op].count;
recovery_stats[op].bytes -= this.prev_stats.recovery_stats[op].bytes;
}
}
}
return { op_stats, subop_stats, recovery_stats };
}
sum_object_counts()
{
const object_counts = { object: 0n, clean: 0n, misplaced: 0n, degraded: 0n, incomplete: 0n };
for (const pool_id in this.state.pg.stats)
{
@@ -1169,36 +1221,143 @@ class Mon
}
}
}
return (this.prev_stats = { overflow, op_stats, subop_stats, recovery_stats, object_counts });
return object_counts;
}
sum_inode_stats()
{
const inode_stats = {};
const inode_stub = () => ({
raw_used: 0n,
read: { count: 0n, usec: 0n, bytes: 0n },
write: { count: 0n, usec: 0n, bytes: 0n },
delete: { count: 0n, usec: 0n, bytes: 0n },
});
for (const pool_id in this.state.config.pools)
{
this.state.pool.stats[pool_id] = this.state.pool.stats[pool_id] || {};
this.state.pool.stats[pool_id].used_raw_tb = 0n;
}
for (const osd_num in this.state.osd.space)
{
for (const pool_id in this.state.osd.space[osd_num])
{
this.state.pool.stats[pool_id] = this.state.pool.stats[pool_id] || { used_raw_tb: 0n };
inode_stats[pool_id] = inode_stats[pool_id] || {};
for (const inode_num in this.state.osd.space[osd_num][pool_id])
{
const u = BigInt(this.state.osd.space[osd_num][pool_id][inode_num]||0);
inode_stats[pool_id][inode_num] = inode_stats[pool_id][inode_num] || inode_stub();
inode_stats[pool_id][inode_num].raw_used += u;
this.state.pool.stats[pool_id].used_raw_tb += u;
}
}
}
for (const pool_id in this.state.config.pools)
{
const used = this.state.pool.stats[pool_id].used_raw_tb;
this.state.pool.stats[pool_id].used_raw_tb = Number(used)/1024/1024/1024/1024;
}
for (const osd_num in this.state.osd.inodestats)
{
const ist = this.state.osd.inodestats[osd_num];
for (const pool_id in ist)
{
inode_stats[pool_id] = inode_stats[pool_id] || {};
for (const inode_num in ist[pool_id])
{
inode_stats[pool_id][inode_num] = inode_stats[pool_id][inode_num] || inode_stub();
for (const op of [ 'read', 'write', 'delete' ])
{
inode_stats[pool_id][inode_num][op].count += BigInt(ist[pool_id][inode_num][op].count||0);
inode_stats[pool_id][inode_num][op].usec += BigInt(ist[pool_id][inode_num][op].usec||0);
inode_stats[pool_id][inode_num][op].bytes += BigInt(ist[pool_id][inode_num][op].bytes||0);
}
}
}
}
return inode_stats;
}
fix_stat_overflows(obj, scratch)
{
for (const k in obj)
{
if (typeof obj[k] == 'bigint')
{
if (obj[k] >= 0x10000000000000000n)
{
if (scratch[k])
{
for (const k2 in scratch)
{
obj[k2] -= scratch[k2];
scratch[k2] = 0n;
}
}
else
{
for (const k2 in obj)
{
scratch[k2] = obj[k2];
}
}
}
}
else if (typeof obj[k] == 'object')
{
this.fix_stat_overflows(obj[k], scratch[k] = (scratch[k] || {}));
}
}
}
serialize_bigints(obj)
{
for (const k in obj)
{
if (typeof obj[k] == 'bigint')
{
obj[k] = ''+obj[k];
}
else if (typeof obj[k] == 'object')
{
this.serialize_bigints(obj[k]);
}
}
}
async update_total_stats()
{
const stats = this.sum_stats();
if (!stats.overflow)
const txn = [];
const stats = this.sum_op_stats();
const object_counts = this.sum_object_counts();
const inode_stats = this.sum_inode_stats();
this.fix_stat_overflows(stats, (this.prev_stats = this.prev_stats || {}));
this.fix_stat_overflows(inode_stats, (this.prev_inode_stats = this.prev_inode_stats || {}));
stats.object_counts = object_counts;
this.serialize_bigints(stats);
this.serialize_bigints(inode_stats);
txn.push({ requestPut: { key: b64(this.etcd_prefix+'/stats'), value: b64(JSON.stringify(stats)) } });
for (const pool_id in inode_stats)
{
// Convert to strings, serialize and save
const ser = {};
for (const st of [ 'op_stats', 'subop_stats', 'recovery_stats' ])
for (const inode_num in inode_stats[pool_id])
{
ser[st] = {};
for (const op in stats[st])
{
ser[st][op] = {};
for (const k in stats[st][op])
{
ser[st][op][k] = ''+stats[st][op][k];
}
}
txn.push({ requestPut: {
key: b64(this.etcd_prefix+'/inode/stats/'+pool_id+'/'+inode_num),
value: b64(JSON.stringify(inode_stats[pool_id][inode_num])),
} });
}
ser.object_counts = {};
for (const k in stats.object_counts)
{
ser.object_counts[k] = ''+stats.object_counts[k];
}
await this.etcd_call('/kv/txn', {
success: [ { requestPut: { key: b64(this.etcd_prefix+'/stats'), value: b64(JSON.stringify(ser)) } } ],
}, this.config.etcd_mon_timeout, 0);
}
for (const pool_id in this.state.pool.stats)
{
txn.push({ requestPut: {
key: b64(this.etcd_prefix+'/pool/stats/'+pool_id),
value: b64(JSON.stringify(this.state.pool.stats[pool_id])),
} });
}
if (txn.length)
{
await this.etcd_call('/kv/txn', { success: txn }, this.config.etcd_mon_timeout, 0);
}
}
@@ -1209,11 +1368,17 @@ class Mon
clearTimeout(this.stats_timer);
this.stats_timer = null;
}
let sleep = (this.stats_update_next||0) - Date.now();
if (sleep < this.config.mon_stats_timeout)
{
sleep = this.config.mon_stats_timeout;
}
this.stats_timer = setTimeout(() =>
{
this.stats_timer = null;
this.stats_update_next = Date.now() + this.config.mon_stats_interval;
this.update_total_stats().catch(console.error);
}, this.config.mon_stats_timeout || 1000);
}, sleep);
}
parse_kv(kv)

View File

@@ -51,7 +51,7 @@ async function run()
const meta_offset = options.journal_offset + Math.ceil(options.journal_size/options.device_block_size)*options.device_block_size;
const entries_per_block = Math.floor(options.device_block_size / (24 + 2*options.object_size/options.bitmap_granularity/8));
const object_count = Math.floor((device_size-meta_offset)/options.object_size);
const meta_size = Math.ceil(object_count / entries_per_block) * options.device_block_size;
const meta_size = Math.ceil(1 + object_count / entries_per_block) * options.device_block_size;
const data_offset = meta_offset + meta_size;
const meta_size_fmt = (meta_size > 1024*1024*1024 ? Math.round(meta_size/1024/1024/1024*100)/100+" GB"
: Math.round(meta_size/1024/1024*100)/100+" MB");
@@ -65,6 +65,9 @@ async function run()
);
}
process.stdout.write(
(options.device_block_size != 4096 ?
` --meta_block_size ${options.device}\n`+
` --journal_block-size ${options.device}\n` : '')+
` --data_device ${options.device}\n`+
` --journal_offset ${options.journal_offset}\n`+
` --meta_offset ${meta_offset}\n`+

948
patches/cinder-vitastor.py Normal file
View File

@@ -0,0 +1,948 @@
# Vitastor Driver for OpenStack Cinder
#
# --------------------------------------------
# Install as cinder/volume/drivers/vitastor.py
# --------------------------------------------
#
# Copyright 2020 Vitaliy Filippov
#
# Licensed under the Apache License, Version 2.0 (the "License"); you may
# not use this file except in compliance with the License. You may obtain
# a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
# WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
# License for the specific language governing permissions and limitations
# under the License.
"""Cinder Vitastor Driver"""
import binascii
import base64
import errno
import json
import math
import os
import tempfile
from castellan import key_manager
from oslo_config import cfg
from oslo_log import log as logging
from oslo_service import loopingcall
from oslo_concurrency import processutils
from oslo_utils import encodeutils
from oslo_utils import excutils
from oslo_utils import fileutils
from oslo_utils import units
import six
from six.moves.urllib import request
from cinder import exception
from cinder.i18n import _
from cinder.image import image_utils
from cinder import interface
from cinder import objects
from cinder.objects import fields
from cinder import utils
from cinder.volume import configuration
from cinder.volume import driver
from cinder.volume import volume_utils
VERSION = '0.6.5'
LOG = logging.getLogger(__name__)
VITASTOR_OPTS = [
cfg.StrOpt(
'vitastor_config_path',
default='/etc/vitastor/vitastor.conf',
help='Vitastor configuration file path'
),
cfg.StrOpt(
'vitastor_etcd_address',
default='',
help='Vitastor etcd address(es)'),
cfg.StrOpt(
'vitastor_etcd_prefix',
default='/vitastor',
help='Vitastor etcd prefix'
),
cfg.StrOpt(
'vitastor_pool_id',
default='',
help='Vitastor pool ID to use for volumes'
),
# FIXME exclusive_cinder_pool ?
]
CONF = cfg.CONF
CONF.register_opts(VITASTOR_OPTS, group = configuration.SHARED_CONF_GROUP)
class VitastorDriverException(exception.VolumeDriverException):
message = _("Vitastor Cinder driver failure: %(reason)s")
@interface.volumedriver
class VitastorDriver(driver.CloneableImageVD,
driver.ManageableVD, driver.ManageableSnapshotsVD,
driver.BaseVD):
"""Implements Vitastor volume commands."""
cfg = {}
_etcd_urls = []
def __init__(self, active_backend_id = None, *args, **kwargs):
super(VitastorDriver, self).__init__(*args, **kwargs)
self.configuration.append_config_values(VITASTOR_OPTS)
@classmethod
def get_driver_options(cls):
additional_opts = cls._get_oslo_driver_opts(
'reserved_percentage',
'max_over_subscription_ratio',
'volume_dd_blocksize'
)
return VITASTOR_OPTS + additional_opts
def do_setup(self, context):
"""Performs initialization steps that could raise exceptions."""
super(VitastorDriver, self).do_setup(context)
# Make sure configuration is in UTF-8
for attr in [ 'config_path', 'etcd_address', 'etcd_prefix', 'pool_id' ]:
val = self.configuration.safe_get('vitastor_'+attr)
if val is not None:
self.cfg[attr] = utils.convert_str(val)
self.cfg = self._load_config(self.cfg)
def _load_config(self, cfg):
# Try to load configuration file
try:
f = open(cfg['config_path'] or '/etc/vitastor/vitastor.conf')
conf = json.loads(f.read())
f.close()
for k in conf:
cfg[k] = cfg.get(k, conf[k])
except:
pass
if isinstance(cfg['etcd_address'], str):
cfg['etcd_address'] = cfg['etcd_address'].split(',')
# Sanitize etcd URLs
for i, etcd_url in enumerate(cfg['etcd_address']):
ssl = False
if etcd_url.lower().startswith('http://'):
etcd_url = etcd_url[7:]
elif etcd_url.lower().startswith('https://'):
etcd_url = etcd_url[8:]
ssl = True
if etcd_url.find('/') < 0:
etcd_url += '/v3'
if ssl:
etcd_url = 'https://'+etcd_url
else:
etcd_url = 'http://'+etcd_url
cfg['etcd_address'][i] = etcd_url
return cfg
def check_for_setup_error(self):
"""Returns an error if prerequisites aren't met."""
def _encode_etcd_key(self, key):
if not isinstance(key, bytes):
key = str(key).encode('utf-8')
return base64.b64encode(self.cfg['etcd_prefix'].encode('utf-8')+b'/'+key).decode('utf-8')
def _encode_etcd_value(self, value):
if not isinstance(value, bytes):
value = str(value).encode('utf-8')
return base64.b64encode(value).decode('utf-8')
def _encode_etcd_requests(self, obj):
for v in obj:
for rt in v:
if 'key' in v[rt]:
v[rt]['key'] = self._encode_etcd_key(v[rt]['key'])
if 'range_end' in v[rt]:
v[rt]['range_end'] = self._encode_etcd_key(v[rt]['range_end'])
if 'value' in v[rt]:
v[rt]['value'] = self._encode_etcd_value(v[rt]['value'])
def _etcd_txn(self, params):
if 'compare' in params:
for v in params['compare']:
if 'key' in v:
v['key'] = self._encode_etcd_key(v['key'])
if 'failure' in params:
self._encode_etcd_requests(params['failure'])
if 'success' in params:
self._encode_etcd_requests(params['success'])
body = json.dumps(params).encode('utf-8')
headers = {
'Content-Type': 'application/json'
}
err = None
for etcd_url in self.cfg['etcd_address']:
try:
resp = request.urlopen(request.Request(etcd_url+'/kv/txn', body, headers), timeout = 5)
data = json.loads(resp.read())
if 'responses' not in data:
data['responses'] = []
for i, resp in enumerate(data['responses']):
if 'response_range' in resp:
if 'kvs' not in resp['response_range']:
resp['response_range']['kvs'] = []
for kv in resp['response_range']['kvs']:
kv['key'] = base64.b64decode(kv['key'].encode('utf-8')).decode('utf-8')
if kv['key'].startswith(self.cfg['etcd_prefix']+'/'):
kv['key'] = kv['key'][len(self.cfg['etcd_prefix'])+1 : ]
kv['value'] = json.loads(base64.b64decode(kv['value'].encode('utf-8')))
if len(resp.keys()) != 1:
LOG.exception('unknown responses['+str(i)+'] format: '+json.dumps(resp))
else:
resp = data['responses'][i] = resp[list(resp.keys())[0]]
return data
except Exception as e:
LOG.exception('error calling etcd transaction: '+body.decode('utf-8')+'\nerror: '+str(e))
err = e
raise err
def _etcd_foreach(self, prefix, add_fn):
total = 0
batch = 1000
begin = prefix+'/'
while True:
resp = self._etcd_txn({ 'success': [
{ 'request_range': {
'key': begin,
'range_end': prefix+'0',
'limit': batch+1,
} },
] })
i = 0
while i < batch and i < len(resp['responses'][0]['kvs']):
kv = resp['responses'][0]['kvs'][i]
add_fn(kv)
i += 1
if len(resp['responses'][0]['kvs']) <= batch:
break
begin = resp['responses'][0]['kvs'][batch]['key']
return total
def _update_volume_stats(self):
location_info = json.dumps({
'config': self.configuration.vitastor_config_path,
'etcd_address': self.configuration.vitastor_etcd_address,
'etcd_prefix': self.configuration.vitastor_etcd_prefix,
'pool_id': self.configuration.vitastor_pool_id,
})
stats = {
'vendor_name': 'Vitastor',
'driver_version': self.VERSION,
'storage_protocol': 'vitastor',
'total_capacity_gb': 'unknown',
'free_capacity_gb': 'unknown',
# FIXME check if safe_get is required
'reserved_percentage': self.configuration.safe_get('reserved_percentage'),
'multiattach': True,
'thin_provisioning_support': True,
'max_over_subscription_ratio': self.configuration.safe_get('max_over_subscription_ratio'),
'location_info': location_info,
'backend_state': 'down',
'volume_backend_name': self.configuration.safe_get('volume_backend_name') or 'vitastor',
'replication_enabled': False,
}
try:
pool_stats = self._etcd_txn({ 'success': [
{ 'request_range': { 'key': 'pool/stats/'+str(self.cfg['pool_id']) } }
] })
total_provisioned = 0
def add_total(kv):
nonlocal total_provisioned
if kv['key'].find('@') >= 0:
total_provisioned += kv['value']['size']
self._etcd_foreach('config/inode/'+str(self.cfg['pool_id']), lambda kv: add_total(kv))
stats['provisioned_capacity_gb'] = round(total_provisioned/1024.0/1024.0/1024.0, 2)
pool_stats = pool_stats['responses'][0]['kvs']
if len(pool_stats):
pool_stats = pool_stats[0]
stats['free_capacity_gb'] = round(1024.0*(pool_stats['total_raw_tb']-pool_stats['used_raw_tb'])/pool_stats['raw_to_usable'], 2)
stats['total_capacity_gb'] = round(1024.0*pool_stats['total_raw_tb'], 2)
stats['backend_state'] = 'up'
except Exception as e:
# just log and return unknown capacities
LOG.exception('error getting vitastor pool stats: '+str(e))
self._stats = stats
def _next_id(self, resp):
if len(resp['kvs']) == 0:
return (1, 0)
else:
return (1 + resp['kvs'][0]['value'], resp['kvs'][0]['mod_revision'])
def create_volume(self, volume):
"""Creates a logical volume."""
size = int(volume.size) * units.Gi
# FIXME: Check if convert_str is really required
vol_name = utils.convert_str(volume.name)
if vol_name.find('@') >= 0 or vol_name.find('/') >= 0:
raise exception.VolumeBackendAPIException(data = '@ and / are forbidden in volume and snapshot names')
LOG.debug("creating volume '%s'", vol_name)
self._create_image(vol_name, { 'size': size })
if volume.encryption_key_id:
self._create_encrypted_volume(volume, volume.obj_context)
volume_update = {}
return volume_update
def _create_encrypted_volume(self, volume, context):
"""Create a new LUKS encrypted image directly in Vitastor."""
vol_name = utils.convert_str(volume.name)
f, opts = self._encrypt_opts(volume, context)
# FIXME: Check if it works at all :-)
self._execute(
'qemu-img', 'convert', '-f', 'luks', *opts,
'vitastor:image='+vol_name.replace(':', '\\:')+self._qemu_args(),
'%sM' % (volume.size * 1024)
)
f.close()
def _encrypt_opts(self, volume, context):
encryption = volume_utils.check_encryption_provider(self.db, volume, context)
# Fetch the key associated with the volume and decode the passphrase
keymgr = key_manager.API(CONF)
key = keymgr.get(context, encryption['encryption_key_id'])
passphrase = binascii.hexlify(key.get_encoded()).decode('utf-8')
# Decode the dm-crypt style cipher spec into something qemu-img can use
cipher_spec = image_utils.decode_cipher(encryption['cipher'], encryption['key_size'])
tmp_dir = volume_utils.image_conversion_dir()
f = tempfile.NamedTemporaryFile(prefix = 'luks_', dir = tmp_dir)
f.write(passphrase)
f.flush()
return (f, [
'--object', 'secret,id=luks_sec,format=raw,file=%(passfile)s' % {'passfile': f.name},
'-o', 'key-secret=luks_sec,cipher-alg=%(cipher_alg)s,cipher-mode=%(cipher_mode)s,ivgen-alg=%(ivgen_alg)s' % cipher_spec,
])
def create_snapshot(self, snapshot):
"""Creates a volume snapshot."""
vol_name = utils.convert_str(snapshot.volume_name)
snap_name = utils.convert_str(snapshot.name)
if snap_name.find('@') >= 0 or snap_name.find('/') >= 0:
raise exception.VolumeBackendAPIException(data = '@ and / are forbidden in volume and snapshot names')
self._create_snapshot(vol_name, vol_name+'@'+snap_name)
def snapshot_revert_use_temp_snapshot(self):
"""Disable the use of a temporary snapshot on revert."""
return False
def revert_to_snapshot(self, context, volume, snapshot):
"""Revert a volume to a given snapshot."""
# FIXME Delete the image, then recreate it from the snapshot
def delete_snapshot(self, snapshot):
"""Deletes a snapshot."""
vol_name = utils.convert_str(snapshot.volume_name)
snap_name = utils.convert_str(snapshot.name)
# Find the snapshot
resp = self._etcd_txn({ 'success': [
{ 'request_range': { 'key': 'index/image/'+vol_name+'@'+snap_name } },
] })
if len(resp['responses'][0]['kvs']) == 0:
raise exception.SnapshotNotFound(snapshot_id = snap_name)
inode_id = int(resp['responses'][0]['kvs'][0]['value']['id'])
pool_id = int(resp['responses'][0]['kvs'][0]['value']['pool_id'])
parents = {}
parents[(pool_id << 48) | (inode_id & 0xffffffffffff)] = True
# Check if there are child volumes
children = self._child_count(parents)
if children > 0:
raise exception.SnapshotIsBusy(snapshot_name = snap_name)
# FIXME: We can't delete snapshots because we can't merge layers yet
raise exception.VolumeBackendAPIException(data = 'Snapshot delete (layer merge) is not implemented yet')
def _child_count(self, parents):
children = 0
def add_child(kv):
nonlocal children
children += self._check_parent(kv, parents)
self._etcd_foreach('config/inode', lambda kv: add_child(kv))
return children
def _check_parent(self, kv, parents):
if 'parent_id' not in kv['value']:
return 0
parent_id = kv['value']['parent_id']
_, _, pool_id, inode_id = kv['key'].split('/')
parent_pool_id = pool_id
if 'parent_pool_id' in kv['value'] and kv['value']['parent_pool_id']:
parent_pool_id = kv['value']['parent_pool_id']
inode = (int(pool_id) << 48) | (int(inode_id) & 0xffffffffffff)
parent = (int(parent_pool_id) << 48) | (int(parent_id) & 0xffffffffffff)
if parent in parents and inode not in parents:
return 1
return 0
def create_cloned_volume(self, volume, src_vref):
"""Create a cloned volume from another volume."""
size = int(volume.size) * units.Gi
src_name = utils.convert_str(src_vref.name)
dest_name = utils.convert_str(volume.name)
if dest_name.find('@') >= 0 or dest_name.find('/') >= 0:
raise exception.VolumeBackendAPIException(data = '@ and / are forbidden in volume and snapshot names')
# FIXME Do full copy if requested (cfg.disable_clone)
if src_vref.admin_metadata.get('readonly') == 'True':
# source volume is a volume-image cache entry or other readonly volume
# clone without intermediate snapshot
src = self._get_image(src_name)
LOG.debug("creating image '%s' from '%s'", dest_name, src_name)
new_cfg = self._create_image(dest_name, {
'size': size,
'parent_id': src['idx']['id'],
'parent_pool_id': src['idx']['pool_id'],
})
return {}
clone_snap = "%s@%s.clone_snap" % (src_name, dest_name)
make_img = True
if (volume.display_name and
volume.display_name.startswith('image-') and
src_vref.project_id != volume.project_id):
# idiotic openstack creates image-volume cache entries
# as clones of normal VM volumes... :-X prevent it :-D
clone_snap = dest_name
make_img = False
LOG.debug("creating layer '%s' under '%s'", clone_snap, src_name)
new_cfg = self._create_snapshot(src_name, clone_snap, True)
if make_img:
# Then create a clone from it
new_cfg = self._create_image(dest_name, {
'size': size,
'parent_id': new_cfg['parent_id'],
'parent_pool_id': new_cfg['parent_pool_id'],
})
return {}
def create_volume_from_snapshot(self, volume, snapshot):
"""Creates a cloned volume from an existing snapshot."""
vol_name = utils.convert_str(volume.name)
snap_name = utils.convert_str(snapshot.name)
snap = self._get_image(vol_name+'@'+snap_name)
if not snap:
raise exception.SnapshotNotFound(snapshot_id = snap_name)
snap_inode_id = int(resp['responses'][0]['kvs'][0]['value']['id'])
snap_pool_id = int(resp['responses'][0]['kvs'][0]['value']['pool_id'])
size = snap['cfg']['size']
if int(volume.size):
size = int(volume.size) * units.Gi
new_cfg = self._create_image(vol_name, {
'size': size,
'parent_id': snap['idx']['id'],
'parent_pool_id': snap['idx']['pool_id'],
})
return {}
def _vitastor_args(self):
args = []
for k in [ 'config_path', 'etcd_address', 'etcd_prefix' ]:
v = self.configuration.safe_get('vitastor_'+k)
if v:
args.extend(['--'+k, v])
return args
def _qemu_args(self):
args = ''
for k in [ 'config_path', 'etcd_address', 'etcd_prefix' ]:
v = self.configuration.safe_get('vitastor_'+k)
kk = k
if kk == 'etcd_address':
# FIXME use etcd_address in qemu driver
kk = 'etcd_host'
if v:
args += ':'+kk+'='+v.replace(':', '\\:')
return args
def delete_volume(self, volume):
"""Deletes a logical volume."""
vol_name = utils.convert_str(volume.name)
# Find the volume and all its snapshots
range_end = b'index/image/' + vol_name.encode('utf-8')
range_end = range_end[0 : len(range_end)-1] + six.int2byte(range_end[len(range_end)-1] + 1)
resp = self._etcd_txn({ 'success': [
{ 'request_range': { 'key': 'index/image/'+vol_name, 'range_end': range_end } },
] })
if len(resp['responses'][0]['kvs']) == 0:
# already deleted
LOG.info("volume %s no longer exists in backend", vol_name)
return
layers = resp['responses'][0]['kvs']
layer_ids = {}
for kv in layers:
inode_id = int(kv['value']['id'])
pool_id = int(kv['value']['pool_id'])
inode_pool_id = (pool_id << 48) | (inode_id & 0xffffffffffff)
layer_ids[inode_pool_id] = True
# Check if the volume has clones and raise 'busy' if so
children = self._child_count(layer_ids)
if children > 0:
raise exception.VolumeIsBusy(volume_name = vol_name)
# Clear data
for kv in layers:
args = [
'vitastor-rm', '--pool', str(kv['value']['pool_id']),
'--inode', str(kv['value']['id']), '--progress', '0',
*(self._vitastor_args())
]
try:
self._execute(*args)
except processutils.ProcessExecutionError as exc:
LOG.error("Failed to remove layer "+kv['key']+": "+exc)
raise exception.VolumeBackendAPIException(data = exc.stderr)
# Delete all layers from etcd
requests = []
for kv in layers:
requests.append({ 'request_delete_range': { 'key': kv['key'] } })
requests.append({ 'request_delete_range': { 'key': 'config/inode/'+str(kv['value']['pool_id'])+'/'+str(kv['value']['id']) } })
self._etcd_txn({ 'success': requests })
def retype(self, context, volume, new_type, diff, host):
"""Change extra type specifications for a volume."""
# FIXME Maybe (in the future) support multiple pools as different types
return True, {}
def ensure_export(self, context, volume):
"""Synchronously recreates an export for a logical volume."""
pass
def create_export(self, context, volume, connector):
"""Exports the volume."""
pass
def remove_export(self, context, volume):
"""Removes an export for a logical volume."""
pass
def _create_image(self, vol_name, cfg):
pool_s = str(self.cfg['pool_id'])
image_id = 0
while image_id == 0:
# check if the image already exists and find a free ID
resp = self._etcd_txn({ 'success': [
{ 'request_range': { 'key': 'index/image/'+vol_name } },
{ 'request_range': { 'key': 'index/maxid/'+pool_s } },
] })
if len(resp['responses'][0]['kvs']) > 0:
# already exists
raise exception.VolumeBackendAPIException(data = 'Volume '+vol_name+' already exists')
image_id, id_mod = self._next_id(resp['responses'][1])
# try to create the image
resp = self._etcd_txn({ 'compare': [
{ 'target': 'MOD', 'mod_revision': id_mod, 'key': 'index/maxid/'+pool_s },
{ 'target': 'VERSION', 'version': 0, 'key': 'index/image/'+vol_name },
{ 'target': 'VERSION', 'version': 0, 'key': 'config/inode/'+pool_s+'/'+str(image_id) },
], 'success': [
{ 'request_put': { 'key': 'index/maxid/'+pool_s, 'value': image_id } },
{ 'request_put': { 'key': 'index/image/'+vol_name, 'value': json.dumps({
'id': image_id, 'pool_id': self.cfg['pool_id']
}) } },
{ 'request_put': { 'key': 'config/inode/'+pool_s+'/'+str(image_id), 'value': json.dumps({
**cfg, 'name': vol_name,
}) } },
] })
if not resp.get('succeeded'):
# repeat
image_id = 0
def _create_snapshot(self, vol_name, snap_vol_name, allow_existing = False):
while True:
# check if the image already exists and snapshot doesn't
resp = self._etcd_txn({ 'success': [
{ 'request_range': { 'key': 'index/image/'+vol_name } },
{ 'request_range': { 'key': 'index/image/'+snap_vol_name } },
] })
if len(resp['responses'][0]['kvs']) == 0:
raise exception.VolumeBackendAPIException(data = 'Volume '+vol_name+' does not exist')
if len(resp['responses'][1]['kvs']) > 0:
if allow_existing:
snap_idx = resp['responses'][1]['kvs'][0]['value']
resp = self._etcd_txn({ 'success': [
{ 'request_range': { 'key': 'config/inode/'+str(snap_idx['pool_id'])+'/'+str(snap_idx['id']) } },
] })
if len(resp['responses'][0]['kvs']) == 0:
raise exception.VolumeBackendAPIException(data =
'Volume '+snap_vol_name+' is already indexed, but does not exist'
)
return resp['responses'][0]['kvs'][0]['value']
raise exception.VolumeBackendAPIException(
data = 'Volume '+snap_vol_name+' already exists'
)
vol_idx = resp['responses'][0]['kvs'][0]['value']
vol_idx_mod = resp['responses'][0]['kvs'][0]['mod_revision']
# get image inode config and find a new ID
resp = self._etcd_txn({ 'success': [
{ 'request_range': { 'key': 'config/inode/'+str(vol_idx['pool_id'])+'/'+str(vol_idx['id']) } },
{ 'request_range': { 'key': 'index/maxid/'+str(self.cfg['pool_id']) } },
] })
if len(resp['responses'][0]['kvs']) == 0:
raise exception.VolumeBackendAPIException(data = 'Volume '+vol_name+' does not exist')
vol_cfg = resp['responses'][0]['kvs'][0]['value']
vol_mod = resp['responses'][0]['kvs'][0]['mod_revision']
new_id, id_mod = self._next_id(resp['responses'][1])
# try to redirect image to the new inode
new_cfg = {
**vol_cfg, 'name': vol_name, 'parent_id': vol_idx['id'], 'parent_pool_id': vol_idx['pool_id']
}
resp = self._etcd_txn({ 'compare': [
{ 'target': 'MOD', 'mod_revision': vol_idx_mod, 'key': 'index/image/'+vol_name },
{ 'target': 'MOD', 'mod_revision': vol_mod, 'key': 'config/inode/'+str(vol_idx['pool_id'])+'/'+str(vol_idx['id']) },
{ 'target': 'MOD', 'mod_revision': id_mod, 'key': 'index/maxid/'+str(self.cfg['pool_id']) },
{ 'target': 'VERSION', 'version': 0, 'key': 'index/image/'+snap_vol_name },
{ 'target': 'VERSION', 'version': 0, 'key': 'config/inode/'+str(self.cfg['pool_id'])+'/'+str(new_id) },
], 'success': [
{ 'request_put': { 'key': 'index/maxid/'+str(self.cfg['pool_id']), 'value': new_id } },
{ 'request_put': { 'key': 'index/image/'+vol_name, 'value': json.dumps({
'id': new_id, 'pool_id': self.cfg['pool_id']
}) } },
{ 'request_put': { 'key': 'config/inode/'+str(self.cfg['pool_id'])+'/'+str(new_id), 'value': json.dumps(new_cfg) } },
{ 'request_put': { 'key': 'index/image/'+snap_vol_name, 'value': json.dumps({
'id': vol_idx['id'], 'pool_id': vol_idx['pool_id']
}) } },
{ 'request_put': { 'key': 'config/inode/'+str(vol_idx['pool_id'])+'/'+str(vol_idx['id']), 'value': json.dumps({
**vol_cfg, 'name': snap_vol_name, 'readonly': True
}) } }
] })
if resp.get('succeeded'):
return new_cfg
def initialize_connection(self, volume, connector):
data = {
'driver_volume_type': 'vitastor',
'data': {
'config_path': self.configuration.vitastor_config_path,
'etcd_address': self.configuration.vitastor_etcd_address,
'etcd_prefix': self.configuration.vitastor_etcd_prefix,
'name': volume.name,
'logical_block_size': 512,
'physical_block_size': 4096,
}
}
LOG.debug('connection data: %s', data)
return data
def terminate_connection(self, volume, connector, **kwargs):
pass
def clone_image(self, context, volume, image_location, image_meta, image_service):
if image_location:
# Note: image_location[0] is glance image direct_url.
# image_location[1] contains the list of all locations (including
# direct_url) or None if show_multiple_locations is False in
# glance configuration.
if image_location[1]:
url_locations = [location['url'] for location in image_location[1]]
else:
url_locations = [image_location[0]]
# iterate all locations to look for a cloneable one.
for url_location in url_locations:
if url_location and url_location.startswith('cinder://'):
# The idea is to use cinder://<volume-id> Glance volumes as base images
base_vol = self.db.volume_get(context, url_location[len('cinder://') : ])
if not base_vol or base_vol.volume_type_id != volume.volume_type_id:
continue
size = int(volume.size) * units.Gi
dest_name = utils.convert_str(volume.name)
# Find or create the base snapshot
snap_cfg = self._create_snapshot(base_vol.name, base_vol.name+'@.clone_snap', True)
# Then create a clone from it
new_cfg = self._create_image(dest_name, {
'size': size,
'parent_id': snap_cfg['parent_id'],
'parent_pool_id': snap_cfg['parent_pool_id'],
})
return ({}, True)
return ({}, False)
def copy_image_to_encrypted_volume(self, context, volume, image_service, image_id):
self.copy_image_to_volume(context, volume, image_service, image_id, encrypted = True)
def copy_image_to_volume(self, context, volume, image_service, image_id, encrypted = False):
tmp_dir = volume_utils.image_conversion_dir()
with tempfile.NamedTemporaryFile(dir = tmp_dir) as tmp:
image_utils.fetch_to_raw(
context, image_service, image_id, tmp.name,
self.configuration.volume_dd_blocksize, size = volume.size
)
out_format = [ '-O', 'raw' ]
if encrypted:
key_file, opts = self._encrypt_opts(volume, context)
out_format = [ '-O', 'luks', *opts ]
dest_name = utils.convert_str(volume.name)
self._try_execute(
'qemu-img', 'convert', '-f', 'raw', tmp.name, *out_format,
'vitastor:image='+dest_name.replace(':', '\\:')+self._qemu_args()
)
if encrypted:
key_file.close()
def copy_volume_to_image(self, context, volume, image_service, image_meta):
tmp_dir = volume_utils.image_conversion_dir()
tmp_file = os.path.join(tmp_dir, volume.name + '-' + image_meta['id'])
with fileutils.remove_path_on_error(tmp_file):
vol_name = utils.convert_str(volume.name)
self._try_execute(
'qemu-img', 'convert', '-f', 'raw',
'vitastor:image='+vol_name.replace(':', '\\:')+self._qemu_args(),
'-O', 'raw', tmp_file
)
# FIXME: Copy directly if the destination image is also in Vitastor
volume_utils.upload_volume(context, image_service, image_meta, tmp_file, volume)
os.unlink(tmp_file)
def _get_image(self, vol_name):
# find the image
resp = self._etcd_txn({ 'success': [
{ 'request_range': { 'key': 'index/image/'+vol_name } },
] })
if len(resp['responses'][0]['kvs']) == 0:
return None
vol_idx = resp['responses'][0]['kvs'][0]['value']
vol_idx_mod = resp['responses'][0]['kvs'][0]['mod_revision']
# get image inode config
resp = self._etcd_txn({ 'success': [
{ 'request_range': { 'key': 'config/inode/'+str(vol_idx['pool_id'])+'/'+str(vol_idx['id']) } },
] })
if len(resp['responses'][0]['kvs']) == 0:
return None
vol_cfg = resp['responses'][0]['kvs'][0]['value']
vol_cfg_mod = resp['responses'][0]['kvs'][0]['mod_revision']
return {
'cfg': vol_cfg,
'cfg_mod': vol_cfg_mod,
'idx': vol_idx,
'idx_mod': vol_idx_mod,
}
def extend_volume(self, volume, new_size):
"""Extend an existing volume."""
vol_name = utils.convert_str(volume.name)
while True:
vol = self._get_image(vol_name)
if not vol:
raise exception.VolumeBackendAPIException(data = 'Volume '+vol_name+' does not exist')
# change size
size = int(new_size) * units.Gi
if size == vol['cfg']['size']:
break
resp = self._etcd_txn({ 'compare': [ {
'target': 'MOD',
'mod_revision': vol['cfg_mod'],
'key': 'config/inode/'+str(vol['idx']['pool_id'])+'/'+str(vol['idx']['id']),
} ], 'success': [
{ 'request_put': {
'key': 'config/inode/'+str(vol['idx']['pool_id'])+'/'+str(vol['idx']['id']),
'value': json.dumps({ **vol['cfg'], 'size': size }),
} },
] })
if resp.get('succeeded'):
break
LOG.debug(
"Extend volume from %(old_size)s GB to %(new_size)s GB.",
{'old_size': volume.size, 'new_size': new_size}
)
def _add_manageable_volume(self, kv, manageable_volumes, cinder_ids):
cfg = kv['value']
if kv['key'].find('@') >= 0:
# snapshot
return
image_id = volume_utils.extract_id_from_volume_name(cfg['name'])
image_info = {
'reference': {'source-name': image_name},
'size': int(math.ceil(float(cfg['size']) / units.Gi)),
'cinder_id': None,
'extra_info': None,
}
if image_id in cinder_ids:
image_info['cinder_id'] = image_id
image_info['safe_to_manage'] = False
image_info['reason_not_safe'] = 'already managed'
else:
image_info['safe_to_manage'] = True
image_info['reason_not_safe'] = None
manageable_volumes.append(image_info)
def get_manageable_volumes(self, cinder_volumes, marker, limit, offset, sort_keys, sort_dirs):
manageable_volumes = []
cinder_ids = [resource['id'] for resource in cinder_volumes]
# List all volumes
# FIXME: It's possible to use pagination in our case, but.. do we want it?
self._etcd_foreach('config/inode/'+str(self.cfg['pool_id']),
lambda kv: self._add_manageable_volume(kv, manageable_volumes, cinder_ids))
return volume_utils.paginate_entries_list(
manageable_volumes, marker, limit, offset, sort_keys, sort_dirs)
def _get_existing_name(existing_ref):
if not isinstance(existing_ref, dict):
existing_ref = {"source-name": existing_ref}
if 'source-name' not in existing_ref:
reason = _('Reference must contain source-name element.')
raise exception.ManageExistingInvalidReference(existing_ref=existing_ref, reason=reason)
src_name = utils.convert_str(existing_ref['source-name'])
if not src_name:
reason = _('Reference must contain source-name element.')
raise exception.ManageExistingInvalidReference(existing_ref=existing_ref, reason=reason)
return src_name
def manage_existing_get_size(self, volume, existing_ref):
"""Return size of an existing image for manage_existing.
:param volume: volume ref info to be set
:param existing_ref: {'source-name': <image name>}
"""
src_name = self._get_existing_name(existing_ref)
vol = self._get_image(src_name)
if not vol:
raise exception.VolumeBackendAPIException(data = 'Volume '+src_name+' does not exist')
return int(math.ceil(float(vol['cfg']['size']) / units.Gi))
def manage_existing(self, volume, existing_ref):
"""Manages an existing image.
Renames the image name to match the expected name for the volume.
:param volume: volume ref info to be set
:param existing_ref: {'source-name': <image name>}
"""
from_name = self._get_existing_name(existing_ref)
to_name = utils.convert_str(volume.name)
self._rename(from_name, to_name)
def _rename(self, from_name, to_name):
while True:
vol = self._get_image(from_name)
if not vol:
raise exception.VolumeBackendAPIException(data = 'Volume '+from_name+' does not exist')
to = self._get_image(to_name)
if to:
raise exception.VolumeBackendAPIException(data = 'Volume '+to_name+' already exists')
resp = self._etcd_txn({ 'compare': [
{ 'target': 'MOD', 'mod_revision': vol['idx_mod'], 'key': 'index/image/'+vol['cfg']['name'] },
{ 'target': 'MOD', 'mod_revision': vol['cfg_mod'], 'key': 'config/inode/'+str(vol['idx']['pool_id'])+'/'+str(vol['idx']['id']) },
{ 'target': 'VERSION', 'version': 0, 'key': 'index/image/'+to_name },
], 'success': [
{ 'request_delete_range': { 'key': 'index/image/'+vol['cfg']['name'] } },
{ 'request_put': { 'key': 'index/image/'+to_name, 'value': json.dumps(vol['idx']) } },
{ 'request_put': { 'key': 'config/inode/'+str(vol['idx']['pool_id'])+'/'+str(vol['idx']['id']),
'value': json.dumps({ **vol['cfg'], 'name': to_name }) } },
] })
if resp.get('succeeded'):
break
def unmanage(self, volume):
pass
def _add_manageable_snapshot(self, kv, manageable_snapshots, cinder_ids):
cfg = kv['value']
dog = kv['key'].find('@')
if dog < 0:
# snapshot
return
image_name = kv['key'][0 : dog]
snap_name = kv['key'][dog+1 : ]
snapshot_id = volume_utils.extract_id_from_snapshot_name(snap_name)
snapshot_info = {
'reference': {'source-name': snap_name},
'size': int(math.ceil(float(cfg['size']) / units.Gi)),
'cinder_id': None,
'extra_info': None,
'safe_to_manage': False,
'reason_not_safe': None,
'source_reference': {'source-name': image_name}
}
if snapshot_id in cinder_ids:
# Exclude snapshots already managed.
snapshot_info['reason_not_safe'] = ('already managed')
snapshot_info['cinder_id'] = snapshot_id
elif snap_name.endswith('.clone_snap'):
# Exclude clone snapshot.
snapshot_info['reason_not_safe'] = ('used for clone snap')
else:
snapshot_info['safe_to_manage'] = True
manageable_snapshots.append(snapshot_info)
def get_manageable_snapshots(self, cinder_snapshots, marker, limit, offset, sort_keys, sort_dirs):
"""List manageable snapshots in Vitastor."""
manageable_snapshots = []
cinder_snapshot_ids = [resource['id'] for resource in cinder_snapshots]
# List all volumes
# FIXME: It's possible to use pagination in our case, but.. do we want it?
self._etcd_foreach('config/inode/'+str(self.cfg['pool_id']),
lambda kv: self._add_manageable_volume(kv, manageable_snapshots, cinder_snapshot_ids))
return volume_utils.paginate_entries_list(
manageable_snapshots, marker, limit, offset, sort_keys, sort_dirs)
def manage_existing_snapshot_get_size(self, snapshot, existing_ref):
"""Return size of an existing image for manage_existing.
:param snapshot: snapshot ref info to be set
:param existing_ref: {'source-name': <name of snapshot>}
"""
vol_name = utils.convert_str(snapshot.volume_name)
snap_name = self._get_existing_name(existing_ref)
vol = self._get_image(vol_name+'@'+snap_name)
if not vol:
raise exception.ManageExistingInvalidReference(
existing_ref=snapshot_name, reason='Specified snapshot does not exist.'
)
return int(math.ceil(float(vol['cfg']['size']) / units.Gi))
def manage_existing_snapshot(self, snapshot, existing_ref):
"""Manages an existing snapshot.
Renames the snapshot name to match the expected name for the snapshot.
Error checking done by manage_existing_get_size is not repeated.
:param snapshot: snapshot ref info to be set
:param existing_ref: {'source-name': <name of snapshot>}
"""
vol_name = utils.convert_str(snapshot.volume_name)
snap_name = self._get_existing_name(existing_ref)
from_name = vol_name+'@'+snap_name
to_name = vol_name+'@'+utils.convert_str(snapshot.name)
self._rename(from_name, to_name)
def unmanage_snapshot(self, snapshot):
"""Removes the specified snapshot from Cinder management."""
pass
def _dumps(self, obj):
return json.dumps(obj, separators=(',', ':'), sort_keys=True)

View File

@@ -0,0 +1,23 @@
# Devstack configuration for bridged networking
[[local|localrc]]
ADMIN_PASSWORD=secret
DATABASE_PASSWORD=$ADMIN_PASSWORD
RABBIT_PASSWORD=$ADMIN_PASSWORD
SERVICE_PASSWORD=$ADMIN_PASSWORD
HOST_IP=10.0.2.15
Q_USE_SECGROUP=True
FLOATING_RANGE="10.0.2.0/24"
IPV4_ADDRS_SAFE_TO_USE="10.0.5.0/24"
Q_FLOATING_ALLOCATION_POOL=start=10.0.2.50,end=10.0.2.100
PUBLIC_NETWORK_GATEWAY=10.0.2.2
PUBLIC_INTERFACE=ens3
Q_USE_PROVIDERNET_FOR_PUBLIC=True
Q_AGENT=linuxbridge
Q_ML2_PLUGIN_MECHANISM_DRIVERS=linuxbridge
LB_PHYSICAL_INTERFACE=ens3
PUBLIC_PHYSICAL_NETWORK=default
LB_INTERFACE_MAPPINGS=default:ens3
Q_SERVICE_PLUGIN_CLASSES=
Q_ML2_PLUGIN_TYPE_DRIVERS=flat
Q_ML2_PLUGIN_EXT_DRIVERS=

View File

@@ -0,0 +1,609 @@
commit bd283191b3e7a4c6d1c100d3d96e348a1ebffe55
Author: Vitaliy Filippov <vitalif@yourcmc.ru>
Date: Sun Jun 27 12:52:40 2021 +0300
Add Vitastor support
diff --git a/docs/schemas/domaincommon.rng b/docs/schemas/domaincommon.rng
index aa50eac..082b4f8 100644
--- a/docs/schemas/domaincommon.rng
+++ b/docs/schemas/domaincommon.rng
@@ -1728,6 +1728,35 @@
</element>
</define>
+ <define name="diskSourceNetworkProtocolVitastor">
+ <element name="source">
+ <interleave>
+ <attribute name="protocol">
+ <value>vitastor</value>
+ </attribute>
+ <ref name="diskSourceCommon"/>
+ <optional>
+ <attribute name="name"/>
+ </optional>
+ <optional>
+ <attribute name="query"/>
+ </optional>
+ <zeroOrMore>
+ <ref name="diskSourceNetworkHost"/>
+ </zeroOrMore>
+ <optional>
+ <element name="config">
+ <attribute name="file">
+ <ref name="absFilePath"/>
+ </attribute>
+ <empty/>
+ </element>
+ </optional>
+ <empty/>
+ </interleave>
+ </element>
+ </define>
+
<define name="diskSourceNetworkProtocolISCSI">
<element name="source">
<attribute name="protocol">
@@ -1851,6 +1880,7 @@
<ref name="diskSourceNetworkProtocolHTTP"/>
<ref name="diskSourceNetworkProtocolSimple"/>
<ref name="diskSourceNetworkProtocolVxHS"/>
+ <ref name="diskSourceNetworkProtocolVitastor"/>
</choice>
</define>
diff --git a/include/libvirt/libvirt-storage.h b/include/libvirt/libvirt-storage.h
index 4bf2b5f..dbc011b 100644
--- a/include/libvirt/libvirt-storage.h
+++ b/include/libvirt/libvirt-storage.h
@@ -240,6 +240,7 @@ typedef enum {
VIR_CONNECT_LIST_STORAGE_POOLS_GLUSTER = 1 << 16,
VIR_CONNECT_LIST_STORAGE_POOLS_ZFS = 1 << 17,
VIR_CONNECT_LIST_STORAGE_POOLS_VSTORAGE = 1 << 18,
+ VIR_CONNECT_LIST_STORAGE_POOLS_VITASTOR = 1 << 20,
} virConnectListAllStoragePoolsFlags;
int virConnectListAllStoragePools(virConnectPtr conn,
diff --git a/src/conf/domain_conf.c b/src/conf/domain_conf.c
index 222bb8c..685d255 100644
--- a/src/conf/domain_conf.c
+++ b/src/conf/domain_conf.c
@@ -8653,6 +8653,10 @@ virDomainDiskSourceNetworkParse(xmlNodePtr node,
goto cleanup;
}
+ if (src->protocol == VIR_STORAGE_NET_PROTOCOL_VITASTOR) {
+ src->relPath = virXMLPropString(node, "query");
+ }
+
if ((haveTLS = virXMLPropString(node, "tls")) &&
(src->haveTLS = virTristateBoolTypeFromString(haveTLS)) <= 0) {
virReportError(VIR_ERR_XML_ERROR,
@@ -23849,6 +23853,10 @@ virDomainDiskSourceFormatNetwork(virBufferPtr attrBuf,
virBufferEscapeString(attrBuf, " name='%s'", path ? path : src->path);
+ if (src->protocol == VIR_STORAGE_NET_PROTOCOL_VITASTOR && src->relPath != NULL) {
+ virBufferEscapeString(attrBuf, " query='%s'", src->relPath);
+ }
+
VIR_FREE(path);
if (src->haveTLS != VIR_TRISTATE_BOOL_ABSENT &&
@@ -30930,6 +30938,7 @@ virDomainDiskTranslateSourcePool(virDomainDiskDefPtr def)
case VIR_STORAGE_POOL_MPATH:
case VIR_STORAGE_POOL_RBD:
+ case VIR_STORAGE_POOL_VITASTOR:
case VIR_STORAGE_POOL_SHEEPDOG:
case VIR_STORAGE_POOL_GLUSTER:
case VIR_STORAGE_POOL_LAST:
diff --git a/src/conf/storage_conf.c b/src/conf/storage_conf.c
index 55db7a9..7cbe937 100644
--- a/src/conf/storage_conf.c
+++ b/src/conf/storage_conf.c
@@ -58,7 +58,7 @@ VIR_ENUM_IMPL(virStoragePool,
"logical", "disk", "iscsi",
"iscsi-direct", "scsi", "mpath",
"rbd", "sheepdog", "gluster",
- "zfs", "vstorage")
+ "zfs", "vstorage", "vitastor")
VIR_ENUM_IMPL(virStoragePoolFormatFileSystem,
VIR_STORAGE_POOL_FS_LAST,
@@ -232,6 +232,18 @@ static virStoragePoolTypeInfo poolTypeInfo[] = {
.formatToString = virStorageFileFormatTypeToString,
}
},
+ {.poolType = VIR_STORAGE_POOL_VITASTOR,
+ .poolOptions = {
+ .flags = (VIR_STORAGE_POOL_SOURCE_HOST |
+ VIR_STORAGE_POOL_SOURCE_NETWORK |
+ VIR_STORAGE_POOL_SOURCE_NAME),
+ },
+ .volOptions = {
+ .defaultFormat = VIR_STORAGE_FILE_RAW,
+ .formatFromString = virStorageVolumeFormatFromString,
+ .formatToString = virStorageFileFormatTypeToString,
+ }
+ },
{.poolType = VIR_STORAGE_POOL_SHEEPDOG,
.poolOptions = {
.flags = (VIR_STORAGE_POOL_SOURCE_HOST |
@@ -434,6 +446,11 @@ virStoragePoolDefParseSource(xmlXPathContextPtr ctxt,
_("element 'name' is mandatory for RBD pool"));
goto cleanup;
}
+ if (pool_type == VIR_STORAGE_POOL_VITASTOR && source->name == NULL) {
+ virReportError(VIR_ERR_XML_ERROR, "%s",
+ _("element 'name' is mandatory for Vitastor pool"));
+ return -1;
+ }
if (options->formatFromString) {
char *format = virXPathString("string(./format/@type)", ctxt);
@@ -1009,6 +1026,7 @@ virStoragePoolDefFormatBuf(virBufferPtr buf,
/* RBD, Sheepdog, Gluster and Iscsi-direct devices are not local block devs nor
* files, so they don't have a target */
if (def->type != VIR_STORAGE_POOL_RBD &&
+ def->type != VIR_STORAGE_POOL_VITASTOR &&
def->type != VIR_STORAGE_POOL_SHEEPDOG &&
def->type != VIR_STORAGE_POOL_GLUSTER &&
def->type != VIR_STORAGE_POOL_ISCSI_DIRECT) {
diff --git a/src/conf/storage_conf.h b/src/conf/storage_conf.h
index dc0aa2a..ed4983d 100644
--- a/src/conf/storage_conf.h
+++ b/src/conf/storage_conf.h
@@ -91,6 +91,7 @@ typedef enum {
VIR_STORAGE_POOL_GLUSTER, /* Gluster device */
VIR_STORAGE_POOL_ZFS, /* ZFS */
VIR_STORAGE_POOL_VSTORAGE, /* Virtuozzo Storage */
+ VIR_STORAGE_POOL_VITASTOR, /* Vitastor */
VIR_STORAGE_POOL_LAST,
} virStoragePoolType;
@@ -422,6 +423,7 @@ VIR_ENUM_DECL(virStoragePartedFs)
VIR_CONNECT_LIST_STORAGE_POOLS_SCSI | \
VIR_CONNECT_LIST_STORAGE_POOLS_MPATH | \
VIR_CONNECT_LIST_STORAGE_POOLS_RBD | \
+ VIR_CONNECT_LIST_STORAGE_POOLS_VITASTOR | \
VIR_CONNECT_LIST_STORAGE_POOLS_SHEEPDOG | \
VIR_CONNECT_LIST_STORAGE_POOLS_GLUSTER | \
VIR_CONNECT_LIST_STORAGE_POOLS_ZFS | \
diff --git a/src/conf/virstorageobj.c b/src/conf/virstorageobj.c
index 6ea6a97..3ba45b9 100644
--- a/src/conf/virstorageobj.c
+++ b/src/conf/virstorageobj.c
@@ -1478,6 +1478,7 @@ virStoragePoolObjSourceFindDuplicateCb(const void *payload,
return 1;
break;
+ case VIR_STORAGE_POOL_VITASTOR:
case VIR_STORAGE_POOL_RBD:
case VIR_STORAGE_POOL_LAST:
break;
@@ -1971,6 +1972,8 @@ virStoragePoolObjMatch(virStoragePoolObjPtr obj,
(obj->def->type == VIR_STORAGE_POOL_MPATH)) ||
(MATCH(VIR_CONNECT_LIST_STORAGE_POOLS_RBD) &&
(obj->def->type == VIR_STORAGE_POOL_RBD)) ||
+ (MATCH(VIR_CONNECT_LIST_STORAGE_POOLS_VITASTOR) &&
+ (obj->def->type == VIR_STORAGE_POOL_VITASTOR)) ||
(MATCH(VIR_CONNECT_LIST_STORAGE_POOLS_SHEEPDOG) &&
(obj->def->type == VIR_STORAGE_POOL_SHEEPDOG)) ||
(MATCH(VIR_CONNECT_LIST_STORAGE_POOLS_GLUSTER) &&
diff --git a/src/libvirt-storage.c b/src/libvirt-storage.c
index 2ea3e94..d5d2273 100644
--- a/src/libvirt-storage.c
+++ b/src/libvirt-storage.c
@@ -92,6 +92,7 @@ virStoragePoolGetConnect(virStoragePoolPtr pool)
* VIR_CONNECT_LIST_STORAGE_POOLS_SCSI
* VIR_CONNECT_LIST_STORAGE_POOLS_MPATH
* VIR_CONNECT_LIST_STORAGE_POOLS_RBD
+ * VIR_CONNECT_LIST_STORAGE_POOLS_VITASTOR
* VIR_CONNECT_LIST_STORAGE_POOLS_SHEEPDOG
*
* Returns the number of storage pools found or -1 and sets @pools to
diff --git a/src/libxl/libxl_conf.c b/src/libxl/libxl_conf.c
index 73e988a..ab7bb81 100644
--- a/src/libxl/libxl_conf.c
+++ b/src/libxl/libxl_conf.c
@@ -905,6 +905,7 @@ libxlMakeNetworkDiskSrcStr(virStorageSourcePtr src,
case VIR_STORAGE_NET_PROTOCOL_SHEEPDOG:
case VIR_STORAGE_NET_PROTOCOL_SSH:
case VIR_STORAGE_NET_PROTOCOL_VXHS:
+ case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
case VIR_STORAGE_NET_PROTOCOL_LAST:
case VIR_STORAGE_NET_PROTOCOL_NONE:
virReportError(VIR_ERR_NO_SUPPORT,
diff --git a/src/qemu/qemu_block.c b/src/qemu/qemu_block.c
index cbf0aa4..096700d 100644
--- a/src/qemu/qemu_block.c
+++ b/src/qemu/qemu_block.c
@@ -959,6 +959,42 @@ qemuBlockStorageSourceGetRBDProps(virStorageSourcePtr src)
}
+static virJSONValuePtr
+qemuBlockStorageSourceGetVitastorProps(virStorageSource *src)
+{
+ virJSONValuePtr ret = NULL;
+ virStorageNetHostDefPtr host;
+ size_t i;
+ virBuffer buf = VIR_BUFFER_INITIALIZER;
+ char *etcd = NULL;
+
+ for (i = 0; i < src->nhosts; i++) {
+ host = src->hosts + i;
+ if ((virStorageNetHostTransport)host->transport != VIR_STORAGE_NET_HOST_TRANS_TCP) {
+ goto cleanup;
+ }
+ virBufferAsprintf(&buf, i > 0 ? ",%s:%u" : "%s:%u", host->name, host->port);
+ }
+ if (src->nhosts > 0) {
+ etcd = virBufferContentAndReset(&buf);
+ }
+
+ if (virJSONValueObjectCreate(&ret,
+ "s:driver", "vitastor",
+ "S:etcd_host", etcd,
+ "S:etcd_prefix", src->relPath,
+ "S:config_path", src->configFile,
+ "s:image", src->path,
+ NULL) < 0)
+ goto cleanup;
+
+cleanup:
+ VIR_FREE(etcd);
+ virBufferFreeAndReset(&buf);
+ return ret;
+}
+
+
static virJSONValuePtr
qemuBlockStorageSourceGetSheepdogProps(virStorageSourcePtr src)
{
@@ -1174,6 +1210,11 @@ qemuBlockStorageSourceGetBackendProps(virStorageSourcePtr src,
return NULL;
break;
+ case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
+ if (!(fileprops = qemuBlockStorageSourceGetVitastorProps(src)))
+ return NULL;
+ break;
+
case VIR_STORAGE_NET_PROTOCOL_SHEEPDOG:
if (!(fileprops = qemuBlockStorageSourceGetSheepdogProps(src)))
return NULL;
diff --git a/src/qemu/qemu_command.c b/src/qemu/qemu_command.c
index 822d5f8..e375cef 100644
--- a/src/qemu/qemu_command.c
+++ b/src/qemu/qemu_command.c
@@ -975,6 +975,43 @@ qemuBuildNetworkDriveStr(virStorageSourcePtr src,
ret = virBufferContentAndReset(&buf);
break;
+ case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
+ if (strchr(src->path, ':')) {
+ virReportError(VIR_ERR_CONFIG_UNSUPPORTED,
+ _("':' not allowed in Vitastor source volume name '%s'"),
+ src->path);
+ return NULL;
+ }
+
+ virBufferStrcat(&buf, "vitastor:image=", src->path, NULL);
+
+ if (src->nhosts > 0) {
+ virBufferAddLit(&buf, ":etcd_host=");
+ for (i = 0; i < src->nhosts; i++) {
+ if (i)
+ virBufferAddLit(&buf, ",");
+
+ /* assume host containing : is ipv6 */
+ if (strchr(src->hosts[i].name, ':'))
+ virBufferEscape(&buf, '\\', ":", "[%s]",
+ src->hosts[i].name);
+ else
+ virBufferAsprintf(&buf, "%s", src->hosts[i].name);
+
+ if (src->hosts[i].port)
+ virBufferAsprintf(&buf, "\\:%u", src->hosts[i].port);
+ }
+ }
+
+ if (src->configFile)
+ virBufferEscape(&buf, '\\', ":", ":config_path=%s", src->configFile);
+
+ if (src->relPath)
+ virBufferEscape(&buf, '\\', ":", ":etcd_prefix=%s", src->relPath);
+
+ ret = virBufferContentAndReset(&buf);
+ break;
+
case VIR_STORAGE_NET_PROTOCOL_VXHS:
virReportError(VIR_ERR_INTERNAL_ERROR, "%s",
_("VxHS protocol does not support URI syntax"));
diff --git a/src/qemu/qemu_domain.c b/src/qemu/qemu_domain.c
index ec6b340..f399efa 100644
--- a/src/qemu/qemu_domain.c
+++ b/src/qemu/qemu_domain.c
@@ -10881,6 +10881,7 @@ qemuDomainPrepareStorageSourceTLS(virStorageSourcePtr src,
break;
case VIR_STORAGE_NET_PROTOCOL_RBD:
+ case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
case VIR_STORAGE_NET_PROTOCOL_SHEEPDOG:
case VIR_STORAGE_NET_PROTOCOL_GLUSTER:
case VIR_STORAGE_NET_PROTOCOL_ISCSI:
diff --git a/src/qemu/qemu_driver.c b/src/qemu/qemu_driver.c
index 1d96170..2d24396 100644
--- a/src/qemu/qemu_driver.c
+++ b/src/qemu/qemu_driver.c
@@ -14687,6 +14687,7 @@ qemuDomainSnapshotPrepareDiskExternalInactive(virDomainSnapshotDiskDefPtr snapdi
case VIR_STORAGE_NET_PROTOCOL_TFTP:
case VIR_STORAGE_NET_PROTOCOL_SSH:
case VIR_STORAGE_NET_PROTOCOL_VXHS:
+ case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
case VIR_STORAGE_NET_PROTOCOL_LAST:
virReportError(VIR_ERR_INTERNAL_ERROR,
_("external inactive snapshots are not supported on "
@@ -14764,6 +14765,7 @@ qemuDomainSnapshotPrepareDiskExternalActive(virDomainSnapshotDiskDefPtr snapdisk
case VIR_STORAGE_NET_PROTOCOL_TFTP:
case VIR_STORAGE_NET_PROTOCOL_SSH:
case VIR_STORAGE_NET_PROTOCOL_VXHS:
+ case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
case VIR_STORAGE_NET_PROTOCOL_LAST:
virReportError(VIR_ERR_INTERNAL_ERROR,
_("external active snapshots are not supported on "
@@ -14887,6 +14889,7 @@ qemuDomainSnapshotPrepareDiskInternal(virDomainDiskDefPtr disk,
case VIR_STORAGE_NET_PROTOCOL_TFTP:
case VIR_STORAGE_NET_PROTOCOL_SSH:
case VIR_STORAGE_NET_PROTOCOL_VXHS:
+ case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
case VIR_STORAGE_NET_PROTOCOL_LAST:
virReportError(VIR_ERR_INTERNAL_ERROR,
_("internal inactive snapshots are not supported on "
diff --git a/src/qemu/qemu_parse_command.c b/src/qemu/qemu_parse_command.c
index c4650f0..551da41 100644
--- a/src/qemu/qemu_parse_command.c
+++ b/src/qemu/qemu_parse_command.c
@@ -2184,6 +2184,7 @@ qemuParseCommandLine(virFileCachePtr capsCache,
case VIR_STORAGE_NET_PROTOCOL_TFTP:
case VIR_STORAGE_NET_PROTOCOL_SSH:
case VIR_STORAGE_NET_PROTOCOL_LAST:
+ case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
case VIR_STORAGE_NET_PROTOCOL_NONE:
/* ignored for now */
break;
diff --git a/src/storage/storage_driver.c b/src/storage/storage_driver.c
index 4a13e90..33301c7 100644
--- a/src/storage/storage_driver.c
+++ b/src/storage/storage_driver.c
@@ -1568,6 +1568,7 @@ storageVolLookupByPathCallback(virStoragePoolObjPtr obj,
case VIR_STORAGE_POOL_RBD:
case VIR_STORAGE_POOL_SHEEPDOG:
case VIR_STORAGE_POOL_ZFS:
+ case VIR_STORAGE_POOL_VITASTOR:
case VIR_STORAGE_POOL_LAST:
ignore_value(VIR_STRDUP(stable_path, data->path));
break;
diff --git a/src/util/virstoragefile.c b/src/util/virstoragefile.c
index bd4b027..b323cd6 100644
--- a/src/util/virstoragefile.c
+++ b/src/util/virstoragefile.c
@@ -84,7 +84,8 @@ VIR_ENUM_IMPL(virStorageNetProtocol, VIR_STORAGE_NET_PROTOCOL_LAST,
"ftps",
"tftp",
"ssh",
- "vxhs")
+ "vxhs",
+ "vitastor")
VIR_ENUM_IMPL(virStorageNetHostTransport, VIR_STORAGE_NET_HOST_TRANS_LAST,
"tcp",
@@ -2839,6 +2840,83 @@ virStorageSourceParseRBDColonString(const char *rbdstr,
}
+static int
+virStorageSourceParseVitastorColonString(const char *colonstr,
+ virStorageSourcePtr src)
+{
+ char *p, *e, *next;
+ char *options = NULL;
+
+ /* optionally skip the "vitastor:" prefix if provided */
+ if (STRPREFIX(colonstr, "vitastor:"))
+ colonstr += strlen("vitastor:");
+
+ if (VIR_STRDUP(options, colonstr) < 0)
+ return -1;
+
+ p = options;
+ while (*p) {
+ /* find : delimiter or end of string */
+ for (e = p; *e && *e != ':'; ++e) {
+ if (*e == '\\') {
+ e++;
+ if (*e == '\0')
+ break;
+ }
+ }
+ if (*e == '\0') {
+ next = e; /* last kv pair */
+ } else {
+ next = e + 1;
+ *e = '\0';
+ }
+
+ if (STRPREFIX(p, "image=")) {
+ if (VIR_STRDUP(src->path, p + strlen("image=")) < 0)
+ return -1;
+ } else if (STRPREFIX(p, "etcd_prefix=")) {
+ if (VIR_STRDUP(src->relPath, p + strlen("etcd_prefix=")) < 0)
+ return -1;
+ } else if (STRPREFIX(p, "config_file=")) {
+ if (VIR_STRDUP(src->configFile, p + strlen("config_file=")) < 0)
+ return -1;
+ } else if (STRPREFIX(p, "etcd_host=")) {
+ char *h, *sep;
+
+ h = p + strlen("etcd_host=");
+ while (h < e) {
+ for (sep = h; sep < e; ++sep) {
+ if (*sep == '\\' && (sep[1] == ',' ||
+ sep[1] == ';' ||
+ sep[1] == ' ')) {
+ *sep = '\0';
+ sep += 2;
+ break;
+ }
+ }
+
+ if (virStorageSourceRBDAddHost(src, h) < 0)
+ goto error;
+
+ h = sep;
+ }
+ }
+
+ p = next;
+ }
+
+ if (!src->path) {
+ goto error;
+ }
+
+ return 0;
+
+error:
+ VIR_FREE(options);
+ return -1;
+}
+
+
static int
virStorageSourceParseNBDColonString(const char *nbdstr,
virStorageSourcePtr src)
@@ -2942,6 +3020,11 @@ virStorageSourceParseBackingColon(virStorageSourcePtr src,
goto cleanup;
break;
+ case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
+ if (virStorageSourceParseVitastorColonString(path, src) < 0)
+ return -1;
+ break;
+
case VIR_STORAGE_NET_PROTOCOL_SHEEPDOG:
case VIR_STORAGE_NET_PROTOCOL_LAST:
case VIR_STORAGE_NET_PROTOCOL_NONE:
@@ -3441,6 +3524,56 @@ virStorageSourceParseBackingJSONRBD(virStorageSourcePtr src,
return ret;
}
+static int
+virStorageSourceParseBackingJSONVitastor(virStorageSourcePtr src,
+ virJSONValuePtr json,
+ int opaque ATTRIBUTE_UNUSED)
+{
+ const char *filename;
+ const char *image = virJSONValueObjectGetString(json, "image");
+ const char *conf = virJSONValueObjectGetString(json, "config_path");
+ const char *etcd_prefix = virJSONValueObjectGetString(json, "etcd_prefix");
+ virJSONValuePtr servers = virJSONValueObjectGetArray(json, "server");
+ size_t nservers;
+ size_t i;
+
+ src->type = VIR_STORAGE_TYPE_NETWORK;
+ src->protocol = VIR_STORAGE_NET_PROTOCOL_VITASTOR;
+
+ /* legacy syntax passed via 'filename' option */
+ if ((filename = virJSONValueObjectGetString(json, "filename")))
+ return virStorageSourceParseVitastorColonString(filename, src);
+
+ if (!image) {
+ virReportError(VIR_ERR_INVALID_ARG, "%s",
+ _("missing image name in Vitastor backing volume "
+ "JSON specification"));
+ return -1;
+ }
+
+ if (VIR_STRDUP(src->path, image) < 0 ||
+ VIR_STRDUP(src->configFile, conf) < 0 ||
+ VIR_STRDUP(src->relPath, etcd_prefix) < 0)
+ return -1;
+
+ if (servers) {
+ nservers = virJSONValueArraySize(servers);
+
+ if (VIR_ALLOC_N(src->hosts, nservers) < 0)
+ return -1;
+
+ src->nhosts = nservers;
+
+ for (i = 0; i < nservers; i++) {
+ if (virStorageSourceParseBackingJSONInetSocketAddress(src->hosts + i,
+ virJSONValueArrayGet(servers, i)) < 0)
+ return -1;
+ }
+ }
+
+ return 0;
+}
+
static int
virStorageSourceParseBackingJSONRaw(virStorageSourcePtr src,
virJSONValuePtr json,
@@ -3507,6 +3640,7 @@ static const struct virStorageSourceJSONDriverParser jsonParsers[] = {
{"sheepdog", virStorageSourceParseBackingJSONSheepdog, 0},
{"ssh", virStorageSourceParseBackingJSONSSH, 0},
{"rbd", virStorageSourceParseBackingJSONRBD, 0},
+ {"vitastor", virStorageSourceParseBackingJSONVitastor, 0},
{"raw", virStorageSourceParseBackingJSONRaw, 0},
{"vxhs", virStorageSourceParseBackingJSONVxHS, 0},
};
@@ -4276,6 +4410,7 @@ virStorageSourceNetworkDefaultPort(virStorageNetProtocol protocol)
case VIR_STORAGE_NET_PROTOCOL_GLUSTER:
return 24007;
+ case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
case VIR_STORAGE_NET_PROTOCOL_RBD:
/* we don't provide a default for RBD */
return 0;
diff --git a/src/util/virstoragefile.h b/src/util/virstoragefile.h
index 1d6161a..8d83bf3 100644
--- a/src/util/virstoragefile.h
+++ b/src/util/virstoragefile.h
@@ -134,6 +134,7 @@ typedef enum {
VIR_STORAGE_NET_PROTOCOL_TFTP,
VIR_STORAGE_NET_PROTOCOL_SSH,
VIR_STORAGE_NET_PROTOCOL_VXHS,
+ VIR_STORAGE_NET_PROTOCOL_VITASTOR,
VIR_STORAGE_NET_PROTOCOL_LAST
} virStorageNetProtocol;
diff --git a/src/xenconfig/xen_xl.c b/src/xenconfig/xen_xl.c
index accfc3a..a18f9c3 100644
--- a/src/xenconfig/xen_xl.c
+++ b/src/xenconfig/xen_xl.c
@@ -1535,6 +1535,7 @@ xenFormatXLDiskSrcNet(virStorageSourcePtr src)
case VIR_STORAGE_NET_PROTOCOL_SHEEPDOG:
case VIR_STORAGE_NET_PROTOCOL_SSH:
case VIR_STORAGE_NET_PROTOCOL_VXHS:
+ case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
case VIR_STORAGE_NET_PROTOCOL_LAST:
case VIR_STORAGE_NET_PROTOCOL_NONE:
virReportError(VIR_ERR_NO_SUPPORT,
diff --git a/tools/virsh-pool.c b/tools/virsh-pool.c
index 70ca39b..9caef51 100644
--- a/tools/virsh-pool.c
+++ b/tools/virsh-pool.c
@@ -1219,6 +1219,9 @@ cmdPoolList(vshControl *ctl, const vshCmd *cmd ATTRIBUTE_UNUSED)
case VIR_STORAGE_POOL_VSTORAGE:
flags |= VIR_CONNECT_LIST_STORAGE_POOLS_VSTORAGE;
break;
+ case VIR_STORAGE_POOL_VITASTOR:
+ flags |= VIR_CONNECT_LIST_STORAGE_POOLS_VITASTOR;
+ break;
case VIR_STORAGE_POOL_LAST:
break;
}

View File

@@ -0,0 +1,657 @@
commit 41cdfe8317d98f70aadedfdbb381effed2641bdd
Author: Vitaliy Filippov <vitalif@yourcmc.ru>
Date: Fri Jul 9 01:31:57 2021 +0300
Add Vitastor support
diff --git a/docs/schemas/domaincommon.rng b/docs/schemas/domaincommon.rng
index 7dc419b..875433b 100644
--- a/docs/schemas/domaincommon.rng
+++ b/docs/schemas/domaincommon.rng
@@ -1827,6 +1827,35 @@
</element>
</define>
+ <define name="diskSourceNetworkProtocolVitastor">
+ <element name="source">
+ <interleave>
+ <attribute name="protocol">
+ <value>vitastor</value>
+ </attribute>
+ <ref name="diskSourceCommon"/>
+ <optional>
+ <attribute name="name"/>
+ </optional>
+ <optional>
+ <attribute name="query"/>
+ </optional>
+ <zeroOrMore>
+ <ref name="diskSourceNetworkHost"/>
+ </zeroOrMore>
+ <optional>
+ <element name="config">
+ <attribute name="file">
+ <ref name="absFilePath"/>
+ </attribute>
+ <empty/>
+ </element>
+ </optional>
+ <empty/>
+ </interleave>
+ </element>
+ </define>
+
<define name="diskSourceNetworkProtocolISCSI">
<element name="source">
<attribute name="protocol">
@@ -2083,6 +2112,7 @@
<ref name="diskSourceNetworkProtocolSimple"/>
<ref name="diskSourceNetworkProtocolVxHS"/>
<ref name="diskSourceNetworkProtocolNFS"/>
+ <ref name="diskSourceNetworkProtocolVitastor"/>
</choice>
</define>
diff --git a/include/libvirt/libvirt-storage.h b/include/libvirt/libvirt-storage.h
index 089e1e0..d7e7ef4 100644
--- a/include/libvirt/libvirt-storage.h
+++ b/include/libvirt/libvirt-storage.h
@@ -245,6 +245,7 @@ typedef enum {
VIR_CONNECT_LIST_STORAGE_POOLS_ZFS = 1 << 17,
VIR_CONNECT_LIST_STORAGE_POOLS_VSTORAGE = 1 << 18,
VIR_CONNECT_LIST_STORAGE_POOLS_ISCSI_DIRECT = 1 << 19,
+ VIR_CONNECT_LIST_STORAGE_POOLS_VITASTOR = 1 << 20,
} virConnectListAllStoragePoolsFlags;
int virConnectListAllStoragePools(virConnectPtr conn,
diff --git a/src/conf/domain_conf.c b/src/conf/domain_conf.c
index 01b7187..c6e9702 100644
--- a/src/conf/domain_conf.c
+++ b/src/conf/domain_conf.c
@@ -8261,7 +8261,8 @@ virDomainDiskSourceNetworkParse(xmlNodePtr node,
src->configFile = virXPathString("string(./config/@file)", ctxt);
if (src->protocol == VIR_STORAGE_NET_PROTOCOL_HTTP ||
- src->protocol == VIR_STORAGE_NET_PROTOCOL_HTTPS)
+ src->protocol == VIR_STORAGE_NET_PROTOCOL_HTTPS ||
+ src->protocol == VIR_STORAGE_NET_PROTOCOL_VITASTOR)
src->query = virXMLPropString(node, "query");
if (virDomainStorageNetworkParseHosts(node, ctxt, &src->hosts, &src->nhosts) < 0)
@@ -31392,6 +31393,7 @@ virDomainStorageSourceTranslateSourcePool(virStorageSourcePtr src,
case VIR_STORAGE_POOL_MPATH:
case VIR_STORAGE_POOL_RBD:
+ case VIR_STORAGE_POOL_VITASTOR:
case VIR_STORAGE_POOL_SHEEPDOG:
case VIR_STORAGE_POOL_GLUSTER:
case VIR_STORAGE_POOL_LAST:
diff --git a/src/conf/storage_conf.c b/src/conf/storage_conf.c
index 0c50529..fe97574 100644
--- a/src/conf/storage_conf.c
+++ b/src/conf/storage_conf.c
@@ -60,7 +60,7 @@ VIR_ENUM_IMPL(virStoragePool,
"logical", "disk", "iscsi",
"iscsi-direct", "scsi", "mpath",
"rbd", "sheepdog", "gluster",
- "zfs", "vstorage",
+ "zfs", "vstorage", "vitastor",
);
VIR_ENUM_IMPL(virStoragePoolFormatFileSystem,
@@ -249,6 +249,18 @@ static virStoragePoolTypeInfo poolTypeInfo[] = {
.formatToString = virStorageFileFormatTypeToString,
}
},
+ {.poolType = VIR_STORAGE_POOL_VITASTOR,
+ .poolOptions = {
+ .flags = (VIR_STORAGE_POOL_SOURCE_HOST |
+ VIR_STORAGE_POOL_SOURCE_NETWORK |
+ VIR_STORAGE_POOL_SOURCE_NAME),
+ },
+ .volOptions = {
+ .defaultFormat = VIR_STORAGE_FILE_RAW,
+ .formatFromString = virStorageVolumeFormatFromString,
+ .formatToString = virStorageFileFormatTypeToString,
+ }
+ },
{.poolType = VIR_STORAGE_POOL_SHEEPDOG,
.poolOptions = {
.flags = (VIR_STORAGE_POOL_SOURCE_HOST |
@@ -551,6 +563,11 @@ virStoragePoolDefParseSource(xmlXPathContextPtr ctxt,
_("element 'name' is mandatory for RBD pool"));
goto cleanup;
}
+ if (pool_type == VIR_STORAGE_POOL_VITASTOR && source->name == NULL) {
+ virReportError(VIR_ERR_XML_ERROR, "%s",
+ _("element 'name' is mandatory for Vitastor pool"));
+ return -1;
+ }
if (options->formatFromString) {
g_autofree char *format = NULL;
@@ -1217,6 +1234,7 @@ virStoragePoolDefFormatBuf(virBufferPtr buf,
/* RBD, Sheepdog, Gluster and Iscsi-direct devices are not local block devs nor
* files, so they don't have a target */
if (def->type != VIR_STORAGE_POOL_RBD &&
+ def->type != VIR_STORAGE_POOL_VITASTOR &&
def->type != VIR_STORAGE_POOL_SHEEPDOG &&
def->type != VIR_STORAGE_POOL_GLUSTER &&
def->type != VIR_STORAGE_POOL_ISCSI_DIRECT) {
diff --git a/src/conf/storage_conf.h b/src/conf/storage_conf.h
index ffd406e..8868a05 100644
--- a/src/conf/storage_conf.h
+++ b/src/conf/storage_conf.h
@@ -110,6 +110,7 @@ typedef enum {
VIR_STORAGE_POOL_GLUSTER, /* Gluster device */
VIR_STORAGE_POOL_ZFS, /* ZFS */
VIR_STORAGE_POOL_VSTORAGE, /* Virtuozzo Storage */
+ VIR_STORAGE_POOL_VITASTOR, /* Vitastor */
VIR_STORAGE_POOL_LAST,
} virStoragePoolType;
@@ -474,6 +475,7 @@ VIR_ENUM_DECL(virStoragePartedFs);
VIR_CONNECT_LIST_STORAGE_POOLS_SCSI | \
VIR_CONNECT_LIST_STORAGE_POOLS_MPATH | \
VIR_CONNECT_LIST_STORAGE_POOLS_RBD | \
+ VIR_CONNECT_LIST_STORAGE_POOLS_VITASTOR | \
VIR_CONNECT_LIST_STORAGE_POOLS_SHEEPDOG | \
VIR_CONNECT_LIST_STORAGE_POOLS_GLUSTER | \
VIR_CONNECT_LIST_STORAGE_POOLS_ZFS | \
diff --git a/src/conf/virstorageobj.c b/src/conf/virstorageobj.c
index 9fe8b3f..bf595b0 100644
--- a/src/conf/virstorageobj.c
+++ b/src/conf/virstorageobj.c
@@ -1491,6 +1491,7 @@ virStoragePoolObjSourceFindDuplicateCb(const void *payload,
return 1;
break;
+ case VIR_STORAGE_POOL_VITASTOR:
case VIR_STORAGE_POOL_RBD:
case VIR_STORAGE_POOL_LAST:
break;
@@ -1990,6 +1991,8 @@ virStoragePoolObjMatch(virStoragePoolObjPtr obj,
(obj->def->type == VIR_STORAGE_POOL_MPATH)) ||
(MATCH(VIR_CONNECT_LIST_STORAGE_POOLS_RBD) &&
(obj->def->type == VIR_STORAGE_POOL_RBD)) ||
+ (MATCH(VIR_CONNECT_LIST_STORAGE_POOLS_VITASTOR) &&
+ (obj->def->type == VIR_STORAGE_POOL_VITASTOR)) ||
(MATCH(VIR_CONNECT_LIST_STORAGE_POOLS_SHEEPDOG) &&
(obj->def->type == VIR_STORAGE_POOL_SHEEPDOG)) ||
(MATCH(VIR_CONNECT_LIST_STORAGE_POOLS_GLUSTER) &&
diff --git a/src/libvirt-storage.c b/src/libvirt-storage.c
index 2a7cdca..f756be1 100644
--- a/src/libvirt-storage.c
+++ b/src/libvirt-storage.c
@@ -92,6 +92,7 @@ virStoragePoolGetConnect(virStoragePoolPtr pool)
* VIR_CONNECT_LIST_STORAGE_POOLS_SCSI
* VIR_CONNECT_LIST_STORAGE_POOLS_MPATH
* VIR_CONNECT_LIST_STORAGE_POOLS_RBD
+ * VIR_CONNECT_LIST_STORAGE_POOLS_VITASTOR
* VIR_CONNECT_LIST_STORAGE_POOLS_SHEEPDOG
* VIR_CONNECT_LIST_STORAGE_POOLS_GLUSTER
* VIR_CONNECT_LIST_STORAGE_POOLS_ZFS
diff --git a/src/libxl/libxl_conf.c b/src/libxl/libxl_conf.c
index 6a8ae27..a735bc6 100644
--- a/src/libxl/libxl_conf.c
+++ b/src/libxl/libxl_conf.c
@@ -942,6 +942,7 @@ libxlMakeNetworkDiskSrcStr(virStorageSourcePtr src,
case VIR_STORAGE_NET_PROTOCOL_SSH:
case VIR_STORAGE_NET_PROTOCOL_VXHS:
case VIR_STORAGE_NET_PROTOCOL_NFS:
+ case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
case VIR_STORAGE_NET_PROTOCOL_LAST:
case VIR_STORAGE_NET_PROTOCOL_NONE:
virReportError(VIR_ERR_NO_SUPPORT,
diff --git a/src/libxl/xen_xl.c b/src/libxl/xen_xl.c
index 17b93d0..c5a0084 100644
--- a/src/libxl/xen_xl.c
+++ b/src/libxl/xen_xl.c
@@ -1601,6 +1601,7 @@ xenFormatXLDiskSrcNet(virStorageSourcePtr src)
case VIR_STORAGE_NET_PROTOCOL_SSH:
case VIR_STORAGE_NET_PROTOCOL_VXHS:
case VIR_STORAGE_NET_PROTOCOL_NFS:
+ case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
case VIR_STORAGE_NET_PROTOCOL_LAST:
case VIR_STORAGE_NET_PROTOCOL_NONE:
virReportError(VIR_ERR_NO_SUPPORT,
diff --git a/src/qemu/qemu_block.c b/src/qemu/qemu_block.c
index f9c6da2..922dde5 100644
--- a/src/qemu/qemu_block.c
+++ b/src/qemu/qemu_block.c
@@ -938,6 +938,38 @@ qemuBlockStorageSourceGetRBDProps(virStorageSourcePtr src,
}
+static virJSONValuePtr
+qemuBlockStorageSourceGetVitastorProps(virStorageSource *src)
+{
+ virJSONValuePtr ret = NULL;
+ virStorageNetHostDefPtr host;
+ size_t i;
+ g_auto(virBuffer) buf = VIR_BUFFER_INITIALIZER;
+ g_autofree char *etcd = NULL;
+
+ for (i = 0; i < src->nhosts; i++) {
+ host = src->hosts + i;
+ if ((virStorageNetHostTransport)host->transport != VIR_STORAGE_NET_HOST_TRANS_TCP) {
+ return NULL;
+ }
+ virBufferAsprintf(&buf, i > 0 ? ",%s:%u" : "%s:%u", host->name, host->port);
+ }
+ if (src->nhosts > 0) {
+ etcd = virBufferContentAndReset(&buf);
+ }
+
+ if (virJSONValueObjectCreate(&ret,
+ "S:etcd_host", etcd,
+ "S:etcd_prefix", src->query,
+ "S:config_path", src->configFile,
+ "s:image", src->path,
+ NULL) < 0)
+ return NULL;
+
+ return ret;
+}
+
+
static virJSONValuePtr
qemuBlockStorageSourceGetSheepdogProps(virStorageSourcePtr src)
{
@@ -1224,6 +1256,12 @@ qemuBlockStorageSourceGetBackendProps(virStorageSourcePtr src,
return NULL;
break;
+ case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
+ driver = "vitastor";
+ if (!(fileprops = qemuBlockStorageSourceGetVitastorProps(src)))
+ return NULL;
+ break;
+
case VIR_STORAGE_NET_PROTOCOL_SHEEPDOG:
driver = "sheepdog";
if (!(fileprops = qemuBlockStorageSourceGetSheepdogProps(src)))
@@ -2183,6 +2221,7 @@ qemuBlockGetBackingStoreString(virStorageSourcePtr src,
case VIR_STORAGE_NET_PROTOCOL_SHEEPDOG:
case VIR_STORAGE_NET_PROTOCOL_RBD:
+ case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
case VIR_STORAGE_NET_PROTOCOL_VXHS:
case VIR_STORAGE_NET_PROTOCOL_NFS:
case VIR_STORAGE_NET_PROTOCOL_SSH:
@@ -2560,6 +2599,12 @@ qemuBlockStorageSourceCreateGetStorageProps(virStorageSourcePtr src,
return -1;
break;
+ case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
+ driver = "vitastor";
+ if (!(location = qemuBlockStorageSourceGetVitastorProps(src)))
+ return -1;
+ break;
+
case VIR_STORAGE_NET_PROTOCOL_SHEEPDOG:
driver = "sheepdog";
if (!(location = qemuBlockStorageSourceGetSheepdogProps(src)))
diff --git a/src/qemu/qemu_command.c b/src/qemu/qemu_command.c
index 6f970a3..10b39ca 100644
--- a/src/qemu/qemu_command.c
+++ b/src/qemu/qemu_command.c
@@ -1034,6 +1034,43 @@ qemuBuildNetworkDriveStr(virStorageSourcePtr src,
ret = virBufferContentAndReset(&buf);
break;
+ case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
+ if (strchr(src->path, ':')) {
+ virReportError(VIR_ERR_CONFIG_UNSUPPORTED,
+ _("':' not allowed in Vitastor source volume name '%s'"),
+ src->path);
+ return NULL;
+ }
+
+ virBufferStrcat(&buf, "vitastor:image=", src->path, NULL);
+
+ if (src->nhosts > 0) {
+ virBufferAddLit(&buf, ":etcd_host=");
+ for (i = 0; i < src->nhosts; i++) {
+ if (i)
+ virBufferAddLit(&buf, ",");
+
+ /* assume host containing : is ipv6 */
+ if (strchr(src->hosts[i].name, ':'))
+ virBufferEscape(&buf, '\\', ":", "[%s]",
+ src->hosts[i].name);
+ else
+ virBufferAsprintf(&buf, "%s", src->hosts[i].name);
+
+ if (src->hosts[i].port)
+ virBufferAsprintf(&buf, "\\:%u", src->hosts[i].port);
+ }
+ }
+
+ if (src->configFile)
+ virBufferEscape(&buf, '\\', ":", ":config_path=%s", src->configFile);
+
+ if (src->query)
+ virBufferEscape(&buf, '\\', ":", ":etcd_prefix=%s", src->query);
+
+ ret = virBufferContentAndReset(&buf);
+ break;
+
case VIR_STORAGE_NET_PROTOCOL_VXHS:
virReportError(VIR_ERR_INTERNAL_ERROR, "%s",
_("VxHS protocol does not support URI syntax"));
diff --git a/src/qemu/qemu_domain.c b/src/qemu/qemu_domain.c
index 0765dc7..4cff344 100644
--- a/src/qemu/qemu_domain.c
+++ b/src/qemu/qemu_domain.c
@@ -4610,7 +4610,8 @@ qemuDomainValidateStorageSource(virStorageSourcePtr src,
if (src->query &&
(actualType != VIR_STORAGE_TYPE_NETWORK ||
(src->protocol != VIR_STORAGE_NET_PROTOCOL_HTTPS &&
- src->protocol != VIR_STORAGE_NET_PROTOCOL_HTTP))) {
+ src->protocol != VIR_STORAGE_NET_PROTOCOL_HTTP &&
+ src->protocol != VIR_STORAGE_NET_PROTOCOL_VITASTOR))) {
virReportError(VIR_ERR_CONFIG_UNSUPPORTED, "%s",
_("query is supported only with HTTP(S) protocols"));
return -1;
@@ -9704,6 +9705,7 @@ qemuDomainPrepareStorageSourceTLS(virStorageSourcePtr src,
break;
case VIR_STORAGE_NET_PROTOCOL_RBD:
+ case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
case VIR_STORAGE_NET_PROTOCOL_SHEEPDOG:
case VIR_STORAGE_NET_PROTOCOL_GLUSTER:
case VIR_STORAGE_NET_PROTOCOL_ISCSI:
diff --git a/src/qemu/qemu_snapshot.c b/src/qemu/qemu_snapshot.c
index ee333c3..674aa58 100644
--- a/src/qemu/qemu_snapshot.c
+++ b/src/qemu/qemu_snapshot.c
@@ -403,6 +403,7 @@ qemuSnapshotPrepareDiskExternalInactive(virDomainSnapshotDiskDefPtr snapdisk,
case VIR_STORAGE_NET_PROTOCOL_NONE:
case VIR_STORAGE_NET_PROTOCOL_NBD:
case VIR_STORAGE_NET_PROTOCOL_RBD:
+ case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
case VIR_STORAGE_NET_PROTOCOL_SHEEPDOG:
case VIR_STORAGE_NET_PROTOCOL_GLUSTER:
case VIR_STORAGE_NET_PROTOCOL_ISCSI:
@@ -493,6 +494,7 @@ qemuSnapshotPrepareDiskExternalActive(virDomainObjPtr vm,
case VIR_STORAGE_NET_PROTOCOL_NONE:
case VIR_STORAGE_NET_PROTOCOL_NBD:
case VIR_STORAGE_NET_PROTOCOL_RBD:
+ case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
case VIR_STORAGE_NET_PROTOCOL_SHEEPDOG:
case VIR_STORAGE_NET_PROTOCOL_ISCSI:
case VIR_STORAGE_NET_PROTOCOL_HTTP:
@@ -623,6 +625,7 @@ qemuSnapshotPrepareDiskInternal(virDomainDiskDefPtr disk,
case VIR_STORAGE_NET_PROTOCOL_NONE:
case VIR_STORAGE_NET_PROTOCOL_NBD:
case VIR_STORAGE_NET_PROTOCOL_RBD:
+ case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
case VIR_STORAGE_NET_PROTOCOL_SHEEPDOG:
case VIR_STORAGE_NET_PROTOCOL_GLUSTER:
case VIR_STORAGE_NET_PROTOCOL_ISCSI:
diff --git a/src/storage/storage_driver.c b/src/storage/storage_driver.c
index 16bc53a..1e5d820 100644
--- a/src/storage/storage_driver.c
+++ b/src/storage/storage_driver.c
@@ -1645,6 +1645,7 @@ storageVolLookupByPathCallback(virStoragePoolObjPtr obj,
case VIR_STORAGE_POOL_GLUSTER:
case VIR_STORAGE_POOL_RBD:
+ case VIR_STORAGE_POOL_VITASTOR:
case VIR_STORAGE_POOL_SHEEPDOG:
case VIR_STORAGE_POOL_ZFS:
case VIR_STORAGE_POOL_LAST:
diff --git a/src/test/test_driver.c b/src/test/test_driver.c
index 29c4c86..a27ad94 100644
--- a/src/test/test_driver.c
+++ b/src/test/test_driver.c
@@ -7096,6 +7096,7 @@ testStorageVolumeTypeForPool(int pooltype)
case VIR_STORAGE_POOL_ISCSI_DIRECT:
case VIR_STORAGE_POOL_GLUSTER:
case VIR_STORAGE_POOL_RBD:
+ case VIR_STORAGE_POOL_VITASTOR:
return VIR_STORAGE_VOL_NETWORK;
case VIR_STORAGE_POOL_LOGICAL:
case VIR_STORAGE_POOL_DISK:
diff --git a/src/util/virstoragefile.c b/src/util/virstoragefile.c
index 0d3c2af..36e3afc 100644
--- a/src/util/virstoragefile.c
+++ b/src/util/virstoragefile.c
@@ -91,6 +91,7 @@ VIR_ENUM_IMPL(virStorageNetProtocol,
"ssh",
"vxhs",
"nfs",
+ "vitastor",
);
VIR_ENUM_IMPL(virStorageNetHostTransport,
@@ -2880,6 +2881,75 @@ virStorageSourceParseRBDColonString(const char *rbdstr,
}
+static int
+virStorageSourceParseVitastorColonString(const char *colonstr,
+ virStorageSourcePtr src)
+{
+ char *p, *e, *next;
+ g_autofree char *options = NULL;
+
+ /* optionally skip the "vitastor:" prefix if provided */
+ if (STRPREFIX(colonstr, "vitastor:"))
+ colonstr += strlen("vitastor:");
+
+ options = g_strdup(colonstr);
+
+ p = options;
+ while (*p) {
+ /* find : delimiter or end of string */
+ for (e = p; *e && *e != ':'; ++e) {
+ if (*e == '\\') {
+ e++;
+ if (*e == '\0')
+ break;
+ }
+ }
+ if (*e == '\0') {
+ next = e; /* last kv pair */
+ } else {
+ next = e + 1;
+ *e = '\0';
+ }
+
+ if (STRPREFIX(p, "image=")) {
+ src->path = g_strdup(p + strlen("image="));
+ } else if (STRPREFIX(p, "etcd_prefix=")) {
+ src->query = g_strdup(p + strlen("etcd_prefix="));
+ } else if (STRPREFIX(p, "config_file=")) {
+ src->configFile = g_strdup(p + strlen("config_file="));
+ } else if (STRPREFIX(p, "etcd_host=")) {
+ char *h, *sep;
+
+ h = p + strlen("etcd_host=");
+ while (h < e) {
+ for (sep = h; sep < e; ++sep) {
+ if (*sep == '\\' && (sep[1] == ',' ||
+ sep[1] == ';' ||
+ sep[1] == ' ')) {
+ *sep = '\0';
+ sep += 2;
+ break;
+ }
+ }
+
+ if (virStorageSourceRBDAddHost(src, h) < 0)
+ return -1;
+
+ h = sep;
+ }
+ }
+
+ p = next;
+ }
+
+ if (!src->path) {
+ return -1;
+ }
+
+ return 0;
+}
+
+
static int
virStorageSourceParseNBDColonString(const char *nbdstr,
virStorageSourcePtr src)
@@ -2992,6 +3062,11 @@ virStorageSourceParseBackingColon(virStorageSourcePtr src,
return -1;
break;
+ case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
+ if (virStorageSourceParseVitastorColonString(path, src) < 0)
+ return -1;
+ break;
+
case VIR_STORAGE_NET_PROTOCOL_SHEEPDOG:
case VIR_STORAGE_NET_PROTOCOL_LAST:
case VIR_STORAGE_NET_PROTOCOL_NONE:
@@ -3581,6 +3656,54 @@ virStorageSourceParseBackingJSONRBD(virStorageSourcePtr src,
return 0;
}
+static int
+virStorageSourceParseBackingJSONVitastor(virStorageSourcePtr src,
+ virJSONValuePtr json,
+ const char *jsonstr G_GNUC_UNUSED,
+ int opaque G_GNUC_UNUSED)
+{
+ const char *filename;
+ const char *image = virJSONValueObjectGetString(json, "image");
+ const char *conf = virJSONValueObjectGetString(json, "config_path");
+ const char *etcd_prefix = virJSONValueObjectGetString(json, "etcd_prefix");
+ virJSONValuePtr servers = virJSONValueObjectGetArray(json, "server");
+ size_t nservers;
+ size_t i;
+
+ src->type = VIR_STORAGE_TYPE_NETWORK;
+ src->protocol = VIR_STORAGE_NET_PROTOCOL_VITASTOR;
+
+ /* legacy syntax passed via 'filename' option */
+ if ((filename = virJSONValueObjectGetString(json, "filename")))
+ return virStorageSourceParseVitastorColonString(filename, src);
+
+ if (!image) {
+ virReportError(VIR_ERR_INVALID_ARG, "%s",
+ _("missing image name in Vitastor backing volume "
+ "JSON specification"));
+ return -1;
+ }
+
+ src->path = g_strdup(image);
+ src->configFile = g_strdup(conf);
+ src->query = g_strdup(etcd_prefix);
+
+ if (servers) {
+ nservers = virJSONValueArraySize(servers);
+
+ src->hosts = g_new0(virStorageNetHostDef, nservers);
+ src->nhosts = nservers;
+
+ for (i = 0; i < nservers; i++) {
+ if (virStorageSourceParseBackingJSONInetSocketAddress(src->hosts + i,
+ virJSONValueArrayGet(servers, i)) < 0)
+ return -1;
+ }
+ }
+
+ return 0;
+}
+
static int
virStorageSourceParseBackingJSONRaw(virStorageSourcePtr src,
virJSONValuePtr json,
@@ -3759,6 +3882,7 @@ static const struct virStorageSourceJSONDriverParser jsonParsers[] = {
{"sheepdog", false, virStorageSourceParseBackingJSONSheepdog, 0},
{"ssh", false, virStorageSourceParseBackingJSONSSH, 0},
{"rbd", false, virStorageSourceParseBackingJSONRBD, 0},
+ {"vitastor", false, virStorageSourceParseBackingJSONVitastor, 0},
{"raw", true, virStorageSourceParseBackingJSONRaw, 0},
{"nfs", false, virStorageSourceParseBackingJSONNFS, 0},
{"vxhs", false, virStorageSourceParseBackingJSONVxHS, 0},
@@ -4503,6 +4627,7 @@ virStorageSourceNetworkDefaultPort(virStorageNetProtocol protocol)
case VIR_STORAGE_NET_PROTOCOL_GLUSTER:
return 24007;
+ case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
case VIR_STORAGE_NET_PROTOCOL_RBD:
/* we don't provide a default for RBD */
return 0;
diff --git a/src/util/virstoragefile.h b/src/util/virstoragefile.h
index 5689c39..3eb4e3c 100644
--- a/src/util/virstoragefile.h
+++ b/src/util/virstoragefile.h
@@ -136,6 +136,7 @@ typedef enum {
VIR_STORAGE_NET_PROTOCOL_SSH,
VIR_STORAGE_NET_PROTOCOL_VXHS,
VIR_STORAGE_NET_PROTOCOL_NFS,
+ VIR_STORAGE_NET_PROTOCOL_VITASTOR,
VIR_STORAGE_NET_PROTOCOL_LAST
} virStorageNetProtocol;
diff --git a/tests/storagepoolcapsschemadata/poolcaps-fs.xml b/tests/storagepoolcapsschemadata/poolcaps-fs.xml
index eee75af..8bd0a57 100644
--- a/tests/storagepoolcapsschemadata/poolcaps-fs.xml
+++ b/tests/storagepoolcapsschemadata/poolcaps-fs.xml
@@ -204,4 +204,11 @@
</enum>
</volOptions>
</pool>
+ <pool type='vitastor' supported='no'>
+ <volOptions>
+ <defaultFormat type='raw'/>
+ <enum name='targetFormatType'>
+ </enum>
+ </volOptions>
+ </pool>
</storagepoolCapabilities>
diff --git a/tests/storagepoolcapsschemadata/poolcaps-full.xml b/tests/storagepoolcapsschemadata/poolcaps-full.xml
index 805950a..852df0d 100644
--- a/tests/storagepoolcapsschemadata/poolcaps-full.xml
+++ b/tests/storagepoolcapsschemadata/poolcaps-full.xml
@@ -204,4 +204,11 @@
</enum>
</volOptions>
</pool>
+ <pool type='vitastor' supported='yes'>
+ <volOptions>
+ <defaultFormat type='raw'/>
+ <enum name='targetFormatType'>
+ </enum>
+ </volOptions>
+ </pool>
</storagepoolCapabilities>
diff --git a/tests/storagepoolxml2argvtest.c b/tests/storagepoolxml2argvtest.c
index 967d1f2..1e8ff7a 100644
--- a/tests/storagepoolxml2argvtest.c
+++ b/tests/storagepoolxml2argvtest.c
@@ -68,6 +68,7 @@ testCompareXMLToArgvFiles(bool shouldFail,
case VIR_STORAGE_POOL_GLUSTER:
case VIR_STORAGE_POOL_ZFS:
case VIR_STORAGE_POOL_VSTORAGE:
+ case VIR_STORAGE_POOL_VITASTOR:
case VIR_STORAGE_POOL_LAST:
default:
VIR_TEST_DEBUG("pool type '%s' has no xml2argv test", defTypeStr);
diff --git a/tools/virsh-pool.c b/tools/virsh-pool.c
index 7835fa6..8841fcf 100644
--- a/tools/virsh-pool.c
+++ b/tools/virsh-pool.c
@@ -1237,6 +1237,9 @@ cmdPoolList(vshControl *ctl, const vshCmd *cmd G_GNUC_UNUSED)
case VIR_STORAGE_POOL_VSTORAGE:
flags |= VIR_CONNECT_LIST_STORAGE_POOLS_VSTORAGE;
break;
+ case VIR_STORAGE_POOL_VITASTOR:
+ flags |= VIR_CONNECT_LIST_STORAGE_POOLS_VITASTOR;
+ break;
case VIR_STORAGE_POOL_LAST:
break;
}

View File

@@ -0,0 +1,661 @@
commit c6e1958a1b4974828e8e5852beb252ce6594e670
Author: Vitaliy Filippov <vitalif@yourcmc.ru>
Date: Mon Jun 28 01:20:19 2021 +0300
Add Vitastor support
diff --git a/docs/schemas/domaincommon.rng b/docs/schemas/domaincommon.rng
index 5ea14b6..a9df168 100644
--- a/docs/schemas/domaincommon.rng
+++ b/docs/schemas/domaincommon.rng
@@ -1859,6 +1859,35 @@
</element>
</define>
+ <define name="diskSourceNetworkProtocolVitastor">
+ <element name="source">
+ <interleave>
+ <attribute name="protocol">
+ <value>vitastor</value>
+ </attribute>
+ <ref name="diskSourceCommon"/>
+ <optional>
+ <attribute name="name"/>
+ </optional>
+ <optional>
+ <attribute name="query"/>
+ </optional>
+ <zeroOrMore>
+ <ref name="diskSourceNetworkHost"/>
+ </zeroOrMore>
+ <optional>
+ <element name="config">
+ <attribute name="file">
+ <ref name="absFilePath"/>
+ </attribute>
+ <empty/>
+ </element>
+ </optional>
+ <empty/>
+ </interleave>
+ </element>
+ </define>
+
<define name="diskSourceNetworkProtocolISCSI">
<element name="source">
<attribute name="protocol">
@@ -2115,6 +2144,7 @@
<ref name="diskSourceNetworkProtocolSimple"/>
<ref name="diskSourceNetworkProtocolVxHS"/>
<ref name="diskSourceNetworkProtocolNFS"/>
+ <ref name="diskSourceNetworkProtocolVitastor"/>
</choice>
</define>
diff --git a/include/libvirt/libvirt-storage.h b/include/libvirt/libvirt-storage.h
index 089e1e0..d7e7ef4 100644
--- a/include/libvirt/libvirt-storage.h
+++ b/include/libvirt/libvirt-storage.h
@@ -245,6 +245,7 @@ typedef enum {
VIR_CONNECT_LIST_STORAGE_POOLS_ZFS = 1 << 17,
VIR_CONNECT_LIST_STORAGE_POOLS_VSTORAGE = 1 << 18,
VIR_CONNECT_LIST_STORAGE_POOLS_ISCSI_DIRECT = 1 << 19,
+ VIR_CONNECT_LIST_STORAGE_POOLS_VITASTOR = 1 << 20,
} virConnectListAllStoragePoolsFlags;
int virConnectListAllStoragePools(virConnectPtr conn,
diff --git a/src/conf/domain_conf.c b/src/conf/domain_conf.c
index d78f846..f7222e3 100644
--- a/src/conf/domain_conf.c
+++ b/src/conf/domain_conf.c
@@ -8251,7 +8251,8 @@ virDomainDiskSourceNetworkParse(xmlNodePtr node,
src->configFile = virXPathString("string(./config/@file)", ctxt);
if (src->protocol == VIR_STORAGE_NET_PROTOCOL_HTTP ||
- src->protocol == VIR_STORAGE_NET_PROTOCOL_HTTPS)
+ src->protocol == VIR_STORAGE_NET_PROTOCOL_HTTPS ||
+ src->protocol == VIR_STORAGE_NET_PROTOCOL_VITASTOR)
src->query = virXMLPropString(node, "query");
if (virDomainStorageNetworkParseHosts(node, ctxt, &src->hosts, &src->nhosts) < 0)
@@ -30775,6 +30776,7 @@ virDomainStorageSourceTranslateSourcePool(virStorageSource *src,
case VIR_STORAGE_POOL_MPATH:
case VIR_STORAGE_POOL_RBD:
+ case VIR_STORAGE_POOL_VITASTOR:
case VIR_STORAGE_POOL_SHEEPDOG:
case VIR_STORAGE_POOL_GLUSTER:
case VIR_STORAGE_POOL_LAST:
diff --git a/src/conf/storage_conf.c b/src/conf/storage_conf.c
index 2aa9a3d..166ca1f 100644
--- a/src/conf/storage_conf.c
+++ b/src/conf/storage_conf.c
@@ -60,7 +60,7 @@ VIR_ENUM_IMPL(virStoragePool,
"logical", "disk", "iscsi",
"iscsi-direct", "scsi", "mpath",
"rbd", "sheepdog", "gluster",
- "zfs", "vstorage",
+ "zfs", "vstorage", "vitastor",
);
VIR_ENUM_IMPL(virStoragePoolFormatFileSystem,
@@ -246,6 +246,18 @@ static virStoragePoolTypeInfo poolTypeInfo[] = {
.formatToString = virStorageFileFormatTypeToString,
}
},
+ {.poolType = VIR_STORAGE_POOL_VITASTOR,
+ .poolOptions = {
+ .flags = (VIR_STORAGE_POOL_SOURCE_HOST |
+ VIR_STORAGE_POOL_SOURCE_NETWORK |
+ VIR_STORAGE_POOL_SOURCE_NAME),
+ },
+ .volOptions = {
+ .defaultFormat = VIR_STORAGE_FILE_RAW,
+ .formatFromString = virStorageVolumeFormatFromString,
+ .formatToString = virStorageFileFormatTypeToString,
+ }
+ },
{.poolType = VIR_STORAGE_POOL_SHEEPDOG,
.poolOptions = {
.flags = (VIR_STORAGE_POOL_SOURCE_HOST |
@@ -546,6 +558,11 @@ virStoragePoolDefParseSource(xmlXPathContextPtr ctxt,
_("element 'name' is mandatory for RBD pool"));
return -1;
}
+ if (pool_type == VIR_STORAGE_POOL_VITASTOR && source->name == NULL) {
+ virReportError(VIR_ERR_XML_ERROR, "%s",
+ _("element 'name' is mandatory for Vitastor pool"));
+ return -1;
+ }
if (options->formatFromString) {
g_autofree char *format = NULL;
@@ -1182,6 +1199,7 @@ virStoragePoolDefFormatBuf(virBuffer *buf,
/* RBD, Sheepdog, Gluster and Iscsi-direct devices are not local block devs nor
* files, so they don't have a target */
if (def->type != VIR_STORAGE_POOL_RBD &&
+ def->type != VIR_STORAGE_POOL_VITASTOR &&
def->type != VIR_STORAGE_POOL_SHEEPDOG &&
def->type != VIR_STORAGE_POOL_GLUSTER &&
def->type != VIR_STORAGE_POOL_ISCSI_DIRECT) {
diff --git a/src/conf/storage_conf.h b/src/conf/storage_conf.h
index 76efaac..928149a 100644
--- a/src/conf/storage_conf.h
+++ b/src/conf/storage_conf.h
@@ -106,6 +106,7 @@ typedef enum {
VIR_STORAGE_POOL_GLUSTER, /* Gluster device */
VIR_STORAGE_POOL_ZFS, /* ZFS */
VIR_STORAGE_POOL_VSTORAGE, /* Virtuozzo Storage */
+ VIR_STORAGE_POOL_VITASTOR, /* Vitastor */
VIR_STORAGE_POOL_LAST,
} virStoragePoolType;
@@ -465,6 +466,7 @@ VIR_ENUM_DECL(virStoragePartedFs);
VIR_CONNECT_LIST_STORAGE_POOLS_SCSI | \
VIR_CONNECT_LIST_STORAGE_POOLS_MPATH | \
VIR_CONNECT_LIST_STORAGE_POOLS_RBD | \
+ VIR_CONNECT_LIST_STORAGE_POOLS_VITASTOR | \
VIR_CONNECT_LIST_STORAGE_POOLS_SHEEPDOG | \
VIR_CONNECT_LIST_STORAGE_POOLS_GLUSTER | \
VIR_CONNECT_LIST_STORAGE_POOLS_ZFS | \
diff --git a/src/conf/storage_source_conf.c b/src/conf/storage_source_conf.c
index 5ca06fa..05ded49 100644
--- a/src/conf/storage_source_conf.c
+++ b/src/conf/storage_source_conf.c
@@ -85,6 +85,7 @@ VIR_ENUM_IMPL(virStorageNetProtocol,
"ssh",
"vxhs",
"nfs",
+ "vitastor",
);
@@ -1262,6 +1263,7 @@ virStorageSourceNetworkDefaultPort(virStorageNetProtocol protocol)
case VIR_STORAGE_NET_PROTOCOL_GLUSTER:
return 24007;
+ case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
case VIR_STORAGE_NET_PROTOCOL_RBD:
/* we don't provide a default for RBD */
return 0;
diff --git a/src/conf/storage_source_conf.h b/src/conf/storage_source_conf.h
index 389c7b5..dbf02e3 100644
--- a/src/conf/storage_source_conf.h
+++ b/src/conf/storage_source_conf.h
@@ -127,6 +127,7 @@ typedef enum {
VIR_STORAGE_NET_PROTOCOL_SSH,
VIR_STORAGE_NET_PROTOCOL_VXHS,
VIR_STORAGE_NET_PROTOCOL_NFS,
+ VIR_STORAGE_NET_PROTOCOL_VITASTOR,
VIR_STORAGE_NET_PROTOCOL_LAST
} virStorageNetProtocol;
diff --git a/src/conf/virstorageobj.c b/src/conf/virstorageobj.c
index 24957d6..4520a73 100644
--- a/src/conf/virstorageobj.c
+++ b/src/conf/virstorageobj.c
@@ -1487,6 +1487,7 @@ virStoragePoolObjSourceFindDuplicateCb(const void *payload,
return 1;
break;
+ case VIR_STORAGE_POOL_VITASTOR:
case VIR_STORAGE_POOL_RBD:
case VIR_STORAGE_POOL_LAST:
break;
@@ -1986,6 +1987,8 @@ virStoragePoolObjMatch(virStoragePoolObj *obj,
(obj->def->type == VIR_STORAGE_POOL_MPATH)) ||
(MATCH(VIR_CONNECT_LIST_STORAGE_POOLS_RBD) &&
(obj->def->type == VIR_STORAGE_POOL_RBD)) ||
+ (MATCH(VIR_CONNECT_LIST_STORAGE_POOLS_VITASTOR) &&
+ (obj->def->type == VIR_STORAGE_POOL_VITASTOR)) ||
(MATCH(VIR_CONNECT_LIST_STORAGE_POOLS_SHEEPDOG) &&
(obj->def->type == VIR_STORAGE_POOL_SHEEPDOG)) ||
(MATCH(VIR_CONNECT_LIST_STORAGE_POOLS_GLUSTER) &&
diff --git a/src/libvirt-storage.c b/src/libvirt-storage.c
index 2a7cdca..f756be1 100644
--- a/src/libvirt-storage.c
+++ b/src/libvirt-storage.c
@@ -92,6 +92,7 @@ virStoragePoolGetConnect(virStoragePoolPtr pool)
* VIR_CONNECT_LIST_STORAGE_POOLS_SCSI
* VIR_CONNECT_LIST_STORAGE_POOLS_MPATH
* VIR_CONNECT_LIST_STORAGE_POOLS_RBD
+ * VIR_CONNECT_LIST_STORAGE_POOLS_VITASTOR
* VIR_CONNECT_LIST_STORAGE_POOLS_SHEEPDOG
* VIR_CONNECT_LIST_STORAGE_POOLS_GLUSTER
* VIR_CONNECT_LIST_STORAGE_POOLS_ZFS
diff --git a/src/libxl/libxl_conf.c b/src/libxl/libxl_conf.c
index 56cb9ab..dfb31b9 100644
--- a/src/libxl/libxl_conf.c
+++ b/src/libxl/libxl_conf.c
@@ -972,6 +972,7 @@ libxlMakeNetworkDiskSrcStr(virStorageSource *src,
case VIR_STORAGE_NET_PROTOCOL_SSH:
case VIR_STORAGE_NET_PROTOCOL_VXHS:
case VIR_STORAGE_NET_PROTOCOL_NFS:
+ case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
case VIR_STORAGE_NET_PROTOCOL_LAST:
case VIR_STORAGE_NET_PROTOCOL_NONE:
virReportError(VIR_ERR_NO_SUPPORT,
diff --git a/src/libxl/xen_xl.c b/src/libxl/xen_xl.c
index c0905b0..c172378 100644
--- a/src/libxl/xen_xl.c
+++ b/src/libxl/xen_xl.c
@@ -1540,6 +1540,7 @@ xenFormatXLDiskSrcNet(virStorageSource *src)
case VIR_STORAGE_NET_PROTOCOL_SSH:
case VIR_STORAGE_NET_PROTOCOL_VXHS:
case VIR_STORAGE_NET_PROTOCOL_NFS:
+ case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
case VIR_STORAGE_NET_PROTOCOL_LAST:
case VIR_STORAGE_NET_PROTOCOL_NONE:
virReportError(VIR_ERR_NO_SUPPORT,
diff --git a/src/qemu/qemu_block.c b/src/qemu/qemu_block.c
index 6627d04..c33f428 100644
--- a/src/qemu/qemu_block.c
+++ b/src/qemu/qemu_block.c
@@ -928,6 +928,38 @@ qemuBlockStorageSourceGetRBDProps(virStorageSource *src,
}
+static virJSONValue *
+qemuBlockStorageSourceGetVitastorProps(virStorageSource *src)
+{
+ virJSONValuePtr ret = NULL;
+ virStorageNetHostDefPtr host;
+ size_t i;
+ g_auto(virBuffer) buf = VIR_BUFFER_INITIALIZER;
+ g_autofree char *etcd = NULL;
+
+ for (i = 0; i < src->nhosts; i++) {
+ host = src->hosts + i;
+ if ((virStorageNetHostTransport)host->transport != VIR_STORAGE_NET_HOST_TRANS_TCP) {
+ return NULL;
+ }
+ virBufferAsprintf(&buf, i > 0 ? ",%s:%u" : "%s:%u", host->name, host->port);
+ }
+ if (src->nhosts > 0) {
+ etcd = virBufferContentAndReset(&buf);
+ }
+
+ if (virJSONValueObjectCreate(&ret,
+ "S:etcd_host", etcd,
+ "S:etcd_prefix", src->query,
+ "S:config_path", src->configFile,
+ "s:image", src->path,
+ NULL) < 0)
+ return NULL;
+
+ return ret;
+}
+
+
static virJSONValue *
qemuBlockStorageSourceGetSheepdogProps(virStorageSource *src)
{
@@ -1218,6 +1250,12 @@ qemuBlockStorageSourceGetBackendProps(virStorageSource *src,
return NULL;
break;
+ case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
+ driver = "vitastor";
+ if (!(fileprops = qemuBlockStorageSourceGetVitastorProps(src)))
+ return NULL;
+ break;
+
case VIR_STORAGE_NET_PROTOCOL_SHEEPDOG:
driver = "sheepdog";
if (!(fileprops = qemuBlockStorageSourceGetSheepdogProps(src)))
@@ -2231,6 +2269,7 @@ qemuBlockGetBackingStoreString(virStorageSource *src,
case VIR_STORAGE_NET_PROTOCOL_SHEEPDOG:
case VIR_STORAGE_NET_PROTOCOL_RBD:
+ case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
case VIR_STORAGE_NET_PROTOCOL_VXHS:
case VIR_STORAGE_NET_PROTOCOL_NFS:
case VIR_STORAGE_NET_PROTOCOL_SSH:
@@ -2608,6 +2647,12 @@ qemuBlockStorageSourceCreateGetStorageProps(virStorageSource *src,
return -1;
break;
+ case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
+ driver = "vitastor";
+ if (!(location = qemuBlockStorageSourceGetVitastorProps(src)))
+ return -1;
+ break;
+
case VIR_STORAGE_NET_PROTOCOL_SHEEPDOG:
driver = "sheepdog";
if (!(location = qemuBlockStorageSourceGetSheepdogProps(src)))
diff --git a/src/qemu/qemu_command.c b/src/qemu/qemu_command.c
index ea51369..8258632 100644
--- a/src/qemu/qemu_command.c
+++ b/src/qemu/qemu_command.c
@@ -1074,6 +1074,43 @@ qemuBuildNetworkDriveStr(virStorageSource *src,
ret = virBufferContentAndReset(&buf);
break;
+ case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
+ if (strchr(src->path, ':')) {
+ virReportError(VIR_ERR_CONFIG_UNSUPPORTED,
+ _("':' not allowed in Vitastor source volume name '%s'"),
+ src->path);
+ return NULL;
+ }
+
+ virBufferStrcat(&buf, "vitastor:image=", src->path, NULL);
+
+ if (src->nhosts > 0) {
+ virBufferAddLit(&buf, ":etcd_host=");
+ for (i = 0; i < src->nhosts; i++) {
+ if (i)
+ virBufferAddLit(&buf, ",");
+
+ /* assume host containing : is ipv6 */
+ if (strchr(src->hosts[i].name, ':'))
+ virBufferEscape(&buf, '\\', ":", "[%s]",
+ src->hosts[i].name);
+ else
+ virBufferAsprintf(&buf, "%s", src->hosts[i].name);
+
+ if (src->hosts[i].port)
+ virBufferAsprintf(&buf, "\\:%u", src->hosts[i].port);
+ }
+ }
+
+ if (src->configFile)
+ virBufferEscape(&buf, '\\', ":", ":config_path=%s", src->configFile);
+
+ if (src->query)
+ virBufferEscape(&buf, '\\', ":", ":etcd_prefix=%s", src->query);
+
+ ret = virBufferContentAndReset(&buf);
+ break;
+
case VIR_STORAGE_NET_PROTOCOL_VXHS:
virReportError(VIR_ERR_INTERNAL_ERROR, "%s",
_("VxHS protocol does not support URI syntax"));
diff --git a/src/qemu/qemu_domain.c b/src/qemu/qemu_domain.c
index fc60e15..5ab410d 100644
--- a/src/qemu/qemu_domain.c
+++ b/src/qemu/qemu_domain.c
@@ -4829,7 +4829,8 @@ qemuDomainValidateStorageSource(virStorageSource *src,
if (src->query &&
(actualType != VIR_STORAGE_TYPE_NETWORK ||
(src->protocol != VIR_STORAGE_NET_PROTOCOL_HTTPS &&
- src->protocol != VIR_STORAGE_NET_PROTOCOL_HTTP))) {
+ src->protocol != VIR_STORAGE_NET_PROTOCOL_HTTP &&
+ src->protocol != VIR_STORAGE_NET_PROTOCOL_VITASTOR))) {
virReportError(VIR_ERR_CONFIG_UNSUPPORTED, "%s",
_("query is supported only with HTTP(S) protocols"));
return -1;
@@ -10027,6 +10028,7 @@ qemuDomainPrepareStorageSourceTLS(virStorageSource *src,
break;
case VIR_STORAGE_NET_PROTOCOL_RBD:
+ case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
case VIR_STORAGE_NET_PROTOCOL_SHEEPDOG:
case VIR_STORAGE_NET_PROTOCOL_GLUSTER:
case VIR_STORAGE_NET_PROTOCOL_ISCSI:
diff --git a/src/qemu/qemu_snapshot.c b/src/qemu/qemu_snapshot.c
index 4e74ddd..14e5f2e 100644
--- a/src/qemu/qemu_snapshot.c
+++ b/src/qemu/qemu_snapshot.c
@@ -402,6 +402,7 @@ qemuSnapshotPrepareDiskExternalInactive(virDomainSnapshotDiskDef *snapdisk,
case VIR_STORAGE_NET_PROTOCOL_NONE:
case VIR_STORAGE_NET_PROTOCOL_NBD:
case VIR_STORAGE_NET_PROTOCOL_RBD:
+ case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
case VIR_STORAGE_NET_PROTOCOL_SHEEPDOG:
case VIR_STORAGE_NET_PROTOCOL_GLUSTER:
case VIR_STORAGE_NET_PROTOCOL_ISCSI:
@@ -494,6 +495,7 @@ qemuSnapshotPrepareDiskExternalActive(virDomainObj *vm,
case VIR_STORAGE_NET_PROTOCOL_NONE:
case VIR_STORAGE_NET_PROTOCOL_NBD:
case VIR_STORAGE_NET_PROTOCOL_RBD:
+ case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
case VIR_STORAGE_NET_PROTOCOL_SHEEPDOG:
case VIR_STORAGE_NET_PROTOCOL_ISCSI:
case VIR_STORAGE_NET_PROTOCOL_HTTP:
@@ -647,6 +649,7 @@ qemuSnapshotPrepareDiskInternal(virDomainDiskDef *disk,
case VIR_STORAGE_NET_PROTOCOL_NONE:
case VIR_STORAGE_NET_PROTOCOL_NBD:
case VIR_STORAGE_NET_PROTOCOL_RBD:
+ case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
case VIR_STORAGE_NET_PROTOCOL_SHEEPDOG:
case VIR_STORAGE_NET_PROTOCOL_GLUSTER:
case VIR_STORAGE_NET_PROTOCOL_ISCSI:
diff --git a/src/storage/storage_driver.c b/src/storage/storage_driver.c
index c2ff4b8..70d0689 100644
--- a/src/storage/storage_driver.c
+++ b/src/storage/storage_driver.c
@@ -1644,6 +1644,7 @@ storageVolLookupByPathCallback(virStoragePoolObj *obj,
case VIR_STORAGE_POOL_GLUSTER:
case VIR_STORAGE_POOL_RBD:
+ case VIR_STORAGE_POOL_VITASTOR:
case VIR_STORAGE_POOL_SHEEPDOG:
case VIR_STORAGE_POOL_ZFS:
case VIR_STORAGE_POOL_LAST:
diff --git a/src/storage_file/storage_source_backingstore.c b/src/storage_file/storage_source_backingstore.c
index e48ae72..d7a9b72 100644
--- a/src/storage_file/storage_source_backingstore.c
+++ b/src/storage_file/storage_source_backingstore.c
@@ -284,6 +284,75 @@ virStorageSourceParseRBDColonString(const char *rbdstr,
}
+static int
+virStorageSourceParseVitastorColonString(const char *colonstr,
+ virStorageSource *src)
+{
+ char *p, *e, *next;
+ g_autofree char *options = NULL;
+
+ /* optionally skip the "vitastor:" prefix if provided */
+ if (STRPREFIX(colonstr, "vitastor:"))
+ colonstr += strlen("vitastor:");
+
+ options = g_strdup(colonstr);
+
+ p = options;
+ while (*p) {
+ /* find : delimiter or end of string */
+ for (e = p; *e && *e != ':'; ++e) {
+ if (*e == '\\') {
+ e++;
+ if (*e == '\0')
+ break;
+ }
+ }
+ if (*e == '\0') {
+ next = e; /* last kv pair */
+ } else {
+ next = e + 1;
+ *e = '\0';
+ }
+
+ if (STRPREFIX(p, "image=")) {
+ src->path = g_strdup(p + strlen("image="));
+ } else if (STRPREFIX(p, "etcd_prefix=")) {
+ src->query = g_strdup(p + strlen("etcd_prefix="));
+ } else if (STRPREFIX(p, "config_file=")) {
+ src->configFile = g_strdup(p + strlen("config_file="));
+ } else if (STRPREFIX(p, "etcd_host=")) {
+ char *h, *sep;
+
+ h = p + strlen("etcd_host=");
+ while (h < e) {
+ for (sep = h; sep < e; ++sep) {
+ if (*sep == '\\' && (sep[1] == ',' ||
+ sep[1] == ';' ||
+ sep[1] == ' ')) {
+ *sep = '\0';
+ sep += 2;
+ break;
+ }
+ }
+
+ if (virStorageSourceRBDAddHost(src, h) < 0)
+ return -1;
+
+ h = sep;
+ }
+ }
+
+ p = next;
+ }
+
+ if (!src->path) {
+ return -1;
+ }
+
+ return 0;
+}
+
+
static int
virStorageSourceParseNBDColonString(const char *nbdstr,
virStorageSource *src)
@@ -396,6 +465,11 @@ virStorageSourceParseBackingColon(virStorageSource *src,
return -1;
break;
+ case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
+ if (virStorageSourceParseVitastorColonString(path, src) < 0)
+ return -1;
+ break;
+
case VIR_STORAGE_NET_PROTOCOL_SHEEPDOG:
case VIR_STORAGE_NET_PROTOCOL_LAST:
case VIR_STORAGE_NET_PROTOCOL_NONE:
@@ -984,6 +1058,54 @@ virStorageSourceParseBackingJSONRBD(virStorageSource *src,
return 0;
}
+static int
+virStorageSourceParseBackingJSONVitastor(virStorageSource *src,
+ virJSONValue *json,
+ const char *jsonstr G_GNUC_UNUSED,
+ int opaque G_GNUC_UNUSED)
+{
+ const char *filename;
+ const char *image = virJSONValueObjectGetString(json, "image");
+ const char *conf = virJSONValueObjectGetString(json, "config_path");
+ const char *etcd_prefix = virJSONValueObjectGetString(json, "etcd_prefix");
+ virJSONValue *servers = virJSONValueObjectGetArray(json, "server");
+ size_t nservers;
+ size_t i;
+
+ src->type = VIR_STORAGE_TYPE_NETWORK;
+ src->protocol = VIR_STORAGE_NET_PROTOCOL_VITASTOR;
+
+ /* legacy syntax passed via 'filename' option */
+ if ((filename = virJSONValueObjectGetString(json, "filename")))
+ return virStorageSourceParseVitastorColonString(filename, src);
+
+ if (!image) {
+ virReportError(VIR_ERR_INVALID_ARG, "%s",
+ _("missing image name in Vitastor backing volume "
+ "JSON specification"));
+ return -1;
+ }
+
+ src->path = g_strdup(image);
+ src->configFile = g_strdup(conf);
+ src->query = g_strdup(etcd_prefix);
+
+ if (servers) {
+ nservers = virJSONValueArraySize(servers);
+
+ src->hosts = g_new0(virStorageNetHostDef, nservers);
+ src->nhosts = nservers;
+
+ for (i = 0; i < nservers; i++) {
+ if (virStorageSourceParseBackingJSONInetSocketAddress(src->hosts + i,
+ virJSONValueArrayGet(servers, i)) < 0)
+ return -1;
+ }
+ }
+
+ return 0;
+}
+
static int
virStorageSourceParseBackingJSONRaw(virStorageSource *src,
virJSONValue *json,
@@ -1162,6 +1284,7 @@ static const struct virStorageSourceJSONDriverParser jsonParsers[] = {
{"sheepdog", false, virStorageSourceParseBackingJSONSheepdog, 0},
{"ssh", false, virStorageSourceParseBackingJSONSSH, 0},
{"rbd", false, virStorageSourceParseBackingJSONRBD, 0},
+ {"vitastor", false, virStorageSourceParseBackingJSONVitastor, 0},
{"raw", true, virStorageSourceParseBackingJSONRaw, 0},
{"nfs", false, virStorageSourceParseBackingJSONNFS, 0},
{"vxhs", false, virStorageSourceParseBackingJSONVxHS, 0},
diff --git a/src/test/test_driver.c b/src/test/test_driver.c
index ef0ddab..2173dc3 100644
--- a/src/test/test_driver.c
+++ b/src/test/test_driver.c
@@ -7131,6 +7131,7 @@ testStorageVolumeTypeForPool(int pooltype)
case VIR_STORAGE_POOL_ISCSI_DIRECT:
case VIR_STORAGE_POOL_GLUSTER:
case VIR_STORAGE_POOL_RBD:
+ case VIR_STORAGE_POOL_VITASTOR:
return VIR_STORAGE_VOL_NETWORK;
case VIR_STORAGE_POOL_LOGICAL:
case VIR_STORAGE_POOL_DISK:
diff --git a/tests/storagepoolcapsschemadata/poolcaps-fs.xml b/tests/storagepoolcapsschemadata/poolcaps-fs.xml
index eee75af..8bd0a57 100644
--- a/tests/storagepoolcapsschemadata/poolcaps-fs.xml
+++ b/tests/storagepoolcapsschemadata/poolcaps-fs.xml
@@ -204,4 +204,11 @@
</enum>
</volOptions>
</pool>
+ <pool type='vitastor' supported='no'>
+ <volOptions>
+ <defaultFormat type='raw'/>
+ <enum name='targetFormatType'>
+ </enum>
+ </volOptions>
+ </pool>
</storagepoolCapabilities>
diff --git a/tests/storagepoolcapsschemadata/poolcaps-full.xml b/tests/storagepoolcapsschemadata/poolcaps-full.xml
index 805950a..852df0d 100644
--- a/tests/storagepoolcapsschemadata/poolcaps-full.xml
+++ b/tests/storagepoolcapsschemadata/poolcaps-full.xml
@@ -204,4 +204,11 @@
</enum>
</volOptions>
</pool>
+ <pool type='vitastor' supported='yes'>
+ <volOptions>
+ <defaultFormat type='raw'/>
+ <enum name='targetFormatType'>
+ </enum>
+ </volOptions>
+ </pool>
</storagepoolCapabilities>
diff --git a/tests/storagepoolxml2argvtest.c b/tests/storagepoolxml2argvtest.c
index 449b745..7f95cc8 100644
--- a/tests/storagepoolxml2argvtest.c
+++ b/tests/storagepoolxml2argvtest.c
@@ -68,6 +68,7 @@ testCompareXMLToArgvFiles(bool shouldFail,
case VIR_STORAGE_POOL_GLUSTER:
case VIR_STORAGE_POOL_ZFS:
case VIR_STORAGE_POOL_VSTORAGE:
+ case VIR_STORAGE_POOL_VITASTOR:
case VIR_STORAGE_POOL_LAST:
default:
VIR_TEST_DEBUG("pool type '%s' has no xml2argv test", defTypeStr);
diff --git a/tools/virsh-pool.c b/tools/virsh-pool.c
index 18f3839..c8e1436 100644
--- a/tools/virsh-pool.c
+++ b/tools/virsh-pool.c
@@ -1231,6 +1231,9 @@ cmdPoolList(vshControl *ctl, const vshCmd *cmd G_GNUC_UNUSED)
case VIR_STORAGE_POOL_VSTORAGE:
flags |= VIR_CONNECT_LIST_STORAGE_POOLS_VSTORAGE;
break;
+ case VIR_STORAGE_POOL_VITASTOR:
+ flags |= VIR_CONNECT_LIST_STORAGE_POOLS_VITASTOR;
+ break;
case VIR_STORAGE_POOL_LAST:
break;
}

View File

@@ -0,0 +1,32 @@
<!-- Example libvirt VM configuration with Vitastor disk -->
<domain type='kvm'>
<name>debian9</name>
<uuid>96f277fb-fd9c-49da-bf21-a5cfd54eb162</uuid>
<memory unit="KiB">524288</memory>
<currentMemory>524288</currentMemory>
<vcpu>1</vcpu>
<os>
<type arch='x86_64'>hvm</type>
<boot dev='hd' />
</os>
<devices>
<emulator>/usr/bin/qemu-system-x86_64</emulator>
<disk type='network' device='disk'>
<target dev='vda' bus='virtio' />
<driver name='qemu' type='raw' />
<!-- name is Vitastor image name -->
<!-- config (optional) is the path to Vitastor's configuration file -->
<!-- query (optional) is Vitastor's etcd_prefix -->
<source protocol='vitastor' name='debian9' query='/vitastor' config='/etc/vitastor/vitastor.conf'>
<!-- hosts = etcd addresses -->
<host name='192.168.7.2' port='2379' />
</source>
<!-- required because Vitastor only supports 4k physical sectors -->
<blockio physical_block_size="4096" logical_block_size="512" />
</disk>
<interface type='network'>
<source network='default' />
</interface>
<graphics type='vnc' port='-1' />
</devices>
</domain>

287
patches/nova-20.diff Normal file
View File

@@ -0,0 +1,287 @@
diff --git a/nova/virt/image/model.py b/nova/virt/image/model.py
index 971f7e9c07..70ed70d5e2 100644
--- a/nova/virt/image/model.py
+++ b/nova/virt/image/model.py
@@ -129,3 +129,22 @@ class RBDImage(Image):
self.user = user
self.password = password
self.servers = servers
+
+
+class VitastorImage(Image):
+ """Class for images in a remote Vitastor cluster"""
+
+ def __init__(self, name, etcd_address = None, etcd_prefix = None, config_path = None):
+ """Create a new Vitastor image object
+
+ :param name: name of the image
+ :param etcd_address: etcd URL(s) (optional)
+ :param etcd_prefix: etcd prefix (optional)
+ :param config_path: path to the configuration (optional)
+ """
+ super(RBDImage, self).__init__(FORMAT_RAW)
+
+ self.name = name
+ self.etcd_address = etcd_address
+ self.etcd_prefix = etcd_prefix
+ self.config_path = config_path
diff --git a/nova/virt/images.py b/nova/virt/images.py
index 5358f3766a..ebe3d6effb 100644
--- a/nova/virt/images.py
+++ b/nova/virt/images.py
@@ -41,7 +41,7 @@ IMAGE_API = glance.API()
def qemu_img_info(path, format=None):
"""Return an object containing the parsed output from qemu-img info."""
- if not os.path.exists(path) and not path.startswith('rbd:'):
+ if not os.path.exists(path) and not path.startswith('rbd:') and not path.startswith('vitastor:'):
raise exception.DiskNotFound(location=path)
info = nova.privsep.qemu.unprivileged_qemu_img_info(path, format=format)
@@ -50,7 +50,7 @@ def qemu_img_info(path, format=None):
def privileged_qemu_img_info(path, format=None, output_format='json'):
"""Return an object containing the parsed output from qemu-img info."""
- if not os.path.exists(path) and not path.startswith('rbd:'):
+ if not os.path.exists(path) and not path.startswith('rbd:') and not path.startswith('vitastor:'):
raise exception.DiskNotFound(location=path)
info = nova.privsep.qemu.privileged_qemu_img_info(path, format=format)
diff --git a/nova/virt/libvirt/config.py b/nova/virt/libvirt/config.py
index f9475776b3..51573fe41d 100644
--- a/nova/virt/libvirt/config.py
+++ b/nova/virt/libvirt/config.py
@@ -1060,6 +1060,8 @@ class LibvirtConfigGuestDisk(LibvirtConfigGuestDevice):
self.driver_iommu = False
self.source_path = None
self.source_protocol = None
+ self.source_query = None
+ self.source_config = None
self.source_name = None
self.source_hosts = []
self.source_ports = []
@@ -1186,7 +1188,8 @@ class LibvirtConfigGuestDisk(LibvirtConfigGuestDevice):
elif self.source_type == "mount":
dev.append(etree.Element("source", dir=self.source_path))
elif self.source_type == "network" and self.source_protocol:
- source = etree.Element("source", protocol=self.source_protocol)
+ source = etree.Element("source", protocol=self.source_protocol,
+ query=self.source_query, config=self.source_config)
if self.source_name is not None:
source.set('name', self.source_name)
hosts_info = zip(self.source_hosts, self.source_ports)
diff --git a/nova/virt/libvirt/driver.py b/nova/virt/libvirt/driver.py
index 391231c527..34dc60dcdd 100644
--- a/nova/virt/libvirt/driver.py
+++ b/nova/virt/libvirt/driver.py
@@ -179,6 +179,7 @@ VOLUME_DRIVERS = {
'local': 'nova.virt.libvirt.volume.volume.LibvirtVolumeDriver',
'fake': 'nova.virt.libvirt.volume.volume.LibvirtFakeVolumeDriver',
'rbd': 'nova.virt.libvirt.volume.net.LibvirtNetVolumeDriver',
+ 'vitastor': 'nova.virt.libvirt.volume.vitastor.LibvirtVitastorVolumeDriver',
'nfs': 'nova.virt.libvirt.volume.nfs.LibvirtNFSVolumeDriver',
'smbfs': 'nova.virt.libvirt.volume.smbfs.LibvirtSMBFSVolumeDriver',
'fibre_channel': 'nova.virt.libvirt.volume.fibrechannel.LibvirtFibreChannelVolumeDriver', # noqa:E501
@@ -385,10 +386,10 @@ class LibvirtDriver(driver.ComputeDriver):
# This prevents the risk of one test setting a capability
# which bleeds over into other tests.
- # LVM and RBD require raw images. If we are not configured to
+ # LVM, RBD, Vitastor require raw images. If we are not configured to
# force convert images into raw format, then we _require_ raw
# images only.
- raw_only = ('rbd', 'lvm')
+ raw_only = ('rbd', 'lvm', 'vitastor')
requires_raw_image = (CONF.libvirt.images_type in raw_only and
not CONF.force_raw_images)
requires_ploop_image = CONF.libvirt.virt_type == 'parallels'
@@ -775,12 +776,12 @@ class LibvirtDriver(driver.ComputeDriver):
# Some imagebackends are only able to import raw disk images,
# and will fail if given any other format. See the bug
# https://bugs.launchpad.net/nova/+bug/1816686 for more details.
- if CONF.libvirt.images_type in ('rbd',):
+ if CONF.libvirt.images_type in ('rbd', 'vitastor'):
if not CONF.force_raw_images:
msg = _("'[DEFAULT]/force_raw_images = False' is not "
- "allowed with '[libvirt]/images_type = rbd'. "
+ "allowed with '[libvirt]/images_type = rbd' or 'vitastor'. "
"Please check the two configs and if you really "
- "do want to use rbd as images_type, set "
+ "do want to use rbd or vitastor as images_type, set "
"force_raw_images to True.")
raise exception.InvalidConfiguration(msg)
@@ -2603,6 +2604,16 @@ class LibvirtDriver(driver.ComputeDriver):
if connection_info['data'].get('auth_enabled'):
username = connection_info['data']['auth_username']
path = f"rbd:{volume_name}:id={username}"
+ elif connection_info['driver_volume_type'] == 'vitastor':
+ volume_name = connection_info['data']['name']
+ path = 'vitastor:image='+volume_name.replace(':', '\\:')
+ for k in [ 'config_path', 'etcd_address', 'etcd_prefix' ]:
+ if k in connection_info['data']:
+ kk = k
+ if kk == 'etcd_address':
+ # FIXME use etcd_address in qemu driver
+ kk = 'etcd_host'
+ path += ":"+kk+"="+connection_info['data'][k].replace(':', '\\:')
else:
path = 'unknown'
raise exception.DiskNotFound(location='unknown')
@@ -2827,8 +2838,8 @@ class LibvirtDriver(driver.ComputeDriver):
image_format = CONF.libvirt.snapshot_image_format or source_type
- # NOTE(bfilippov): save lvm and rbd as raw
- if image_format == 'lvm' or image_format == 'rbd':
+ # NOTE(bfilippov): save lvm and rbd and vitastor as raw
+ if image_format == 'lvm' or image_format == 'rbd' or image_format == 'vitastor':
image_format = 'raw'
metadata = self._create_snapshot_metadata(instance.image_meta,
@@ -2899,7 +2910,7 @@ class LibvirtDriver(driver.ComputeDriver):
expected_state=task_states.IMAGE_UPLOADING)
# TODO(nic): possibly abstract this out to the root_disk
- if source_type == 'rbd' and live_snapshot:
+ if (source_type == 'rbd' or source_type == 'vitastor') and live_snapshot:
# Standard snapshot uses qemu-img convert from RBD which is
# not safe to run with live_snapshot.
live_snapshot = False
@@ -4099,7 +4110,7 @@ class LibvirtDriver(driver.ComputeDriver):
# cleanup rescue volume
lvm.remove_volumes([lvmdisk for lvmdisk in self._lvm_disks(instance)
if lvmdisk.endswith('.rescue')])
- if CONF.libvirt.images_type == 'rbd':
+ if CONF.libvirt.images_type == 'rbd' or CONF.libvirt.images_type == 'vitastor':
filter_fn = lambda disk: (disk.startswith(instance.uuid) and
disk.endswith('.rescue'))
rbd_utils.RBDDriver().cleanup_volumes(filter_fn)
@@ -4356,6 +4367,8 @@ class LibvirtDriver(driver.ComputeDriver):
# TODO(mikal): there is a bug here if images_type has
# changed since creation of the instance, but I am pretty
# sure that this bug already exists.
+ if CONF.libvirt.images_type == 'vitastor':
+ return 'vitastor'
return 'rbd' if CONF.libvirt.images_type == 'rbd' else 'raw'
@staticmethod
@@ -4764,10 +4777,10 @@ class LibvirtDriver(driver.ComputeDriver):
finally:
# NOTE(mikal): if the config drive was imported into RBD,
# then we no longer need the local copy
- if CONF.libvirt.images_type == 'rbd':
+ if CONF.libvirt.images_type == 'rbd' or CONF.libvirt.images_type == 'vitastor':
LOG.info('Deleting local config drive %(path)s '
- 'because it was imported into RBD.',
- {'path': config_disk_local_path},
+ 'because it was imported into %(type).',
+ {'path': config_disk_local_path, 'type': CONF.libvirt.images_type},
instance=instance)
os.unlink(config_disk_local_path)
diff --git a/nova/virt/libvirt/utils.py b/nova/virt/libvirt/utils.py
index da2a6e8b8a..52c02e72f1 100644
--- a/nova/virt/libvirt/utils.py
+++ b/nova/virt/libvirt/utils.py
@@ -340,6 +340,10 @@ def find_disk(guest: libvirt_guest.Guest) -> ty.Tuple[str, ty.Optional[str]]:
disk_path = disk.source_name
if disk_path:
disk_path = 'rbd:' + disk_path
+ elif not disk_path and disk.source_protocol == 'vitastor':
+ disk_path = disk.source_name
+ if disk_path:
+ disk_path = 'vitastor:' + disk_path
if not disk_path:
raise RuntimeError(_("Can't retrieve root device path "
@@ -354,6 +358,8 @@ def get_disk_type_from_path(path: str) -> ty.Optional[str]:
return 'lvm'
elif path.startswith('rbd:'):
return 'rbd'
+ elif path.startswith('vitastor:'):
+ return 'vitastor'
elif (os.path.isdir(path) and
os.path.exists(os.path.join(path, "DiskDescriptor.xml"))):
return 'ploop'
diff --git a/nova/virt/libvirt/volume/vitastor.py b/nova/virt/libvirt/volume/vitastor.py
new file mode 100644
index 0000000000..0256df62c1
--- /dev/null
+++ b/nova/virt/libvirt/volume/vitastor.py
@@ -0,0 +1,75 @@
+# Copyright (c) 2021+, Vitaliy Filippov <vitalif@yourcmc.ru>
+#
+# Licensed under the Apache License, Version 2.0 (the "License"); you may
+# not use this file except in compliance with the License. You may obtain
+# a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
+# WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
+# License for the specific language governing permissions and limitations
+# under the License.
+
+from os_brick import exception as os_brick_exception
+from os_brick import initiator
+from os_brick.initiator import connector
+from oslo_log import log as logging
+
+import nova.conf
+from nova import utils
+from nova.virt.libvirt.volume import volume as libvirt_volume
+
+
+CONF = nova.conf.CONF
+LOG = logging.getLogger(__name__)
+
+
+class LibvirtVitastorVolumeDriver(libvirt_volume.LibvirtBaseVolumeDriver):
+ """Driver to attach Vitastor volumes to libvirt."""
+ def __init__(self, host):
+ super(LibvirtVitastorVolumeDriver, self).__init__(host, is_block_dev=False)
+
+ def connect_volume(self, connection_info, instance):
+ pass
+
+ def disconnect_volume(self, connection_info, instance):
+ pass
+
+ def get_config(self, connection_info, disk_info):
+ """Returns xml for libvirt."""
+ conf = super(LibvirtVitastorVolumeDriver, self).get_config(connection_info, disk_info)
+ conf.source_type = 'network'
+ conf.source_protocol = 'vitastor'
+ conf.source_name = connection_info['data'].get('name')
+ conf.source_query = connection_info['data'].get('etcd_prefix') or None
+ conf.source_config = connection_info['data'].get('config_path') or None
+ conf.source_hosts = []
+ conf.source_ports = []
+ addresses = connection_info['data'].get('etcd_address', '')
+ if addresses:
+ if not isinstance(addresses, list):
+ addresses = addresses.split(',')
+ for addr in addresses:
+ if addr.startswith('https://'):
+ raise NotImplementedError('Vitastor block driver does not support SSL for etcd communication yet')
+ if addr.startswith('http://'):
+ addr = addr[7:]
+ addr = addr.rstrip('/')
+ if addr.endswith('/v3'):
+ addr = addr[0:-3]
+ p = addr.find('/')
+ if p > 0:
+ raise NotImplementedError('libvirt does not support custom URL paths for Vitastor etcd yet. Use /etc/vitastor/vitastor.conf')
+ p = addr.find(':')
+ port = '2379'
+ if p > 0:
+ port = addr[p+1:]
+ addr = addr[0:p]
+ conf.source_hosts.append(addr)
+ conf.source_ports.append(port)
+ return conf
+
+ def extend_volume(self, connection_info, instance, requested_size):
+ raise NotImplementedError

View File

@@ -11,7 +11,7 @@ Index: qemu-3.1+dfsg/qapi/block-core.json
'host_cdrom', 'host_device', 'http', 'https', 'iscsi', 'luks',
'nbd', 'nfs', 'null-aio', 'null-co', 'nvme', 'parallels', 'qcow',
'qcow2', 'qed', 'quorum', 'raw', 'rbd', 'replication', 'sheepdog',
@@ -3367,6 +3367,24 @@
@@ -3367,6 +3367,28 @@
'*tag': 'str' } }
##
@@ -19,17 +19,21 @@ Index: qemu-3.1+dfsg/qapi/block-core.json
+#
+# Driver specific block device options for vitastor
+#
+# @image: Image name
+# @inode: Inode number
+# @pool: Pool ID
+# @size: Desired image size in bytes
+# @etcd_host: etcd connection address
+# @config_path: Path to Vitastor configuration
+# @etcd_host: etcd connection address(es)
+# @etcd_prefix: etcd key/value prefix
+##
+{ 'struct': 'BlockdevOptionsVitastor',
+ 'data': { 'inode': 'uint64',
+ 'pool': 'uint64',
+ 'size': 'uint64',
+ 'etcd_host': 'str',
+ 'data': { '*inode': 'uint64',
+ '*pool': 'uint64',
+ '*size': 'uint64',
+ '*image': 'str',
+ '*config_path': 'str',
+ '*etcd_host': 'str',
+ '*etcd_prefix': 'str' } }
+
+##

View File

@@ -11,7 +11,7 @@ Index: qemu/qapi/block-core.json
'ssh', 'throttle', 'vdi', 'vhdx', 'vmdk', 'vpc', 'vvfat', 'vxhs' ] }
##
@@ -3725,6 +3725,24 @@
@@ -3725,6 +3725,28 @@
'*tag': 'str' } }
##
@@ -19,17 +19,21 @@ Index: qemu/qapi/block-core.json
+#
+# Driver specific block device options for vitastor
+#
+# @image: Image name
+# @inode: Inode number
+# @pool: Pool ID
+# @size: Desired image size in bytes
+# @etcd_host: etcd connection address
+# @config_path: Path to Vitastor configuration
+# @etcd_host: etcd connection address(es)
+# @etcd_prefix: etcd key/value prefix
+##
+{ 'struct': 'BlockdevOptionsVitastor',
+ 'data': { 'inode': 'uint64',
+ 'pool': 'uint64',
+ 'size': 'uint64',
+ 'etcd_host': 'str',
+ 'data': { '*inode': 'uint64',
+ '*pool': 'uint64',
+ '*size': 'uint64',
+ '*image': 'str',
+ '*config_path': 'str',
+ '*etcd_host': 'str',
+ '*etcd_prefix': 'str' } }
+
+##

View File

@@ -11,7 +11,7 @@ Index: qemu/qapi/block-core.json
'ssh', 'throttle', 'vdi', 'vhdx', 'vmdk', 'vpc', 'vvfat', 'vxhs' ] }
##
@@ -3635,6 +3635,24 @@
@@ -3635,6 +3635,28 @@
'*tag': 'str' } }
##
@@ -19,17 +19,21 @@ Index: qemu/qapi/block-core.json
+#
+# Driver specific block device options for vitastor
+#
+# @image: Image name
+# @inode: Inode number
+# @pool: Pool ID
+# @size: Desired image size in bytes
+# @etcd_host: etcd connection address
+# @config_path: Path to Vitastor configuration
+# @etcd_host: etcd connection address(es)
+# @etcd_prefix: etcd key/value prefix
+##
+{ 'struct': 'BlockdevOptionsVitastor',
+ 'data': { 'inode': 'uint64',
+ 'pool': 'uint64',
+ 'size': 'uint64',
+ 'etcd_host': 'str',
+ 'data': { '*inode': 'uint64',
+ '*pool': 'uint64',
+ '*size': 'uint64',
+ '*image': 'str',
+ '*config_path': 'str',
+ '*etcd_host': 'str',
+ '*etcd_prefix': 'str' } }
+
+##

View File

@@ -11,7 +11,7 @@ Index: qemu-5.1+dfsg/qapi/block-core.json
'ssh', 'throttle', 'vdi', 'vhdx', 'vmdk', 'vpc', 'vvfat' ] }
##
@@ -3644,6 +3644,24 @@
@@ -3644,6 +3644,28 @@
'*tag': 'str' } }
##
@@ -19,17 +19,21 @@ Index: qemu-5.1+dfsg/qapi/block-core.json
+#
+# Driver specific block device options for vitastor
+#
+# @image: Image name
+# @inode: Inode number
+# @pool: Pool ID
+# @size: Desired image size in bytes
+# @etcd_host: etcd connection address
+# @config_path: Path to Vitastor configuration
+# @etcd_host: etcd connection address(es)
+# @etcd_prefix: etcd key/value prefix
+##
+{ 'struct': 'BlockdevOptionsVitastor',
+ 'data': { 'inode': 'uint64',
+ 'pool': 'uint64',
+ 'size': 'uint64',
+ 'etcd_host': 'str',
+ 'data': { '*inode': 'uint64',
+ '*pool': 'uint64',
+ '*size': 'uint64',
+ '*image': 'str',
+ '*config_path': 'str',
+ '*etcd_host': 'str',
+ '*etcd_prefix': 'str' } }
+
+##

View File

@@ -48,4 +48,4 @@ FIO=`rpm -qi fio | perl -e 'while(<>) { /^Epoch[\s:]+(\S+)/ && print "$1:"; /^Ve
QEMU=`rpm -qi qemu qemu-kvm | perl -e 'while(<>) { /^Epoch[\s:]+(\S+)/ && print "$1:"; /^Version[\s:]+(\S+)/ && print $1; /^Release[\s:]+(\S+)/ && print "-$1"; }'`
perl -i -pe 's/(Requires:\s*fio)([^\n]+)?/$1 = '$FIO'/' $VITASTOR/rpm/vitastor-el$EL.spec
perl -i -pe 's/(Requires:\s*qemu(?:-kvm)?)([^\n]+)?/$1 = '$QEMU'/' $VITASTOR/rpm/vitastor-el$EL.spec
tar --transform 's#^#vitastor-0.5.13/#' --exclude 'rpm/*.rpm' -czf $VITASTOR/../vitastor-0.5.13$(rpm --eval '%dist').tar.gz *
tar --transform 's#^#vitastor-0.6.5/#' --exclude 'rpm/*.rpm' -czf $VITASTOR/../vitastor-0.6.5$(rpm --eval '%dist').tar.gz *

View File

@@ -11,7 +11,7 @@ RUN rm -rf /var/lib/dnf/*; dnf download --disablerepo='*' --enablerepo='centos-a
RUN rpm --nomd5 -i qemu*.src.rpm
RUN cd ~/rpmbuild/SPECS && dnf builddep -y --enablerepo=PowerTools --spec qemu-kvm.spec
ADD qemu-*-vitastor.patch /root/vitastor/
ADD patches/qemu-*-vitastor.patch /root/vitastor/patches/
RUN set -e; \
mkdir -p /root/packages/qemu-el8; \
@@ -25,7 +25,7 @@ RUN set -e; \
echo "Patch$((PN+1)): qemu-4.2-vitastor.patch" >> qemu-kvm.spec; \
tail -n +2 xx01 >> qemu-kvm.spec; \
perl -i -pe 's/(^Release:\s*\d+)/$1.vitastor/' qemu-kvm.spec; \
cp /root/vitastor/qemu-4.2-vitastor.patch ~/rpmbuild/SOURCES; \
cp /root/vitastor/patches/qemu-4.2-vitastor.patch ~/rpmbuild/SOURCES; \
rpmbuild --nocheck -ba qemu-kvm.spec; \
cp ~/rpmbuild/RPMS/*/*qemu* /root/packages/qemu-el8/; \
cp ~/rpmbuild/SRPMS/*qemu* /root/packages/qemu-el8/

View File

@@ -15,8 +15,9 @@ RUN yumdownloader --disablerepo=centos-sclo-rh --source fio
RUN rpm --nomd5 -i qemu*.src.rpm
RUN rpm --nomd5 -i fio*.src.rpm
RUN rm -f /etc/yum.repos.d/CentOS-Media.repo
RUN cd ~/rpmbuild/SPECS && yum-builddep -y --enablerepo='*' --disablerepo=centos-sclo-rh --disablerepo=centos-sclo-rh-source --disablerepo=centos-sclo-sclo-testing qemu-kvm.spec
RUN cd ~/rpmbuild/SPECS && yum-builddep -y --enablerepo='*' --disablerepo=centos-sclo-rh --disablerepo=centos-sclo-rh-source --disablerepo=centos-sclo-sclo-testing fio.spec
RUN cd ~/rpmbuild/SPECS && yum-builddep -y qemu-kvm.spec
RUN cd ~/rpmbuild/SPECS && yum-builddep -y fio.spec
RUN yum -y install rdma-core-devel
ADD https://vitastor.io/rpms/liburing-el7/liburing-0.7-2.el7.src.rpm /root
@@ -37,7 +38,7 @@ ADD . /root/vitastor
RUN set -e; \
cd /root/vitastor/rpm; \
sh build-tarball.sh; \
cp /root/vitastor-0.5.13.el7.tar.gz ~/rpmbuild/SOURCES; \
cp /root/vitastor-0.6.5.el7.tar.gz ~/rpmbuild/SOURCES; \
cp vitastor-el7.spec ~/rpmbuild/SPECS/vitastor.spec; \
cd ~/rpmbuild/SPECS/; \
rpmbuild -ba vitastor.spec; \

View File

@@ -1,11 +1,11 @@
Name: vitastor
Version: 0.5.13
Version: 0.6.5
Release: 1%{?dist}
Summary: Vitastor, a fast software-defined clustered block storage
License: Vitastor Network Public License 1.1
URL: https://vitastor.io/
Source0: vitastor-0.5.13.el7.tar.gz
Source0: vitastor-0.6.5.el7.tar.gz
BuildRequires: liburing-devel >= 0.6
BuildRequires: gperftools-devel
@@ -14,6 +14,7 @@ BuildRequires: rh-nodejs12
BuildRequires: rh-nodejs12-npm
BuildRequires: jerasure-devel
BuildRequires: gf-complete-devel
BuildRequires: libibverbs-devel
BuildRequires: cmake
Requires: fio = 3.7-1.el7
Requires: qemu-kvm = 2.0.0-1.el7.6
@@ -61,8 +62,9 @@ cp -r mon %buildroot/usr/lib/vitastor/mon
%_libdir/libfio_vitastor.so
%_libdir/libfio_vitastor_blk.so
%_libdir/libfio_vitastor_sec.so
%_libdir/libvitastor_blk.so
%_libdir/libvitastor_client.so
%_libdir/libvitastor_blk.so*
%_libdir/libvitastor_client.so*
%_includedir/vitastor_c.h
/usr/lib/vitastor

View File

@@ -15,6 +15,7 @@ RUN rpm --nomd5 -i qemu*.src.rpm
RUN rpm --nomd5 -i fio*.src.rpm
RUN cd ~/rpmbuild/SPECS && dnf builddep -y --enablerepo=powertools --spec qemu-kvm.spec
RUN cd ~/rpmbuild/SPECS && dnf builddep -y --enablerepo=powertools --spec fio.spec && dnf install -y cmake
RUN yum -y install libibverbs-devel libarchive
ADD https://vitastor.io/rpms/liburing-el7/liburing-0.7-2.el7.src.rpm /root
@@ -35,7 +36,7 @@ ADD . /root/vitastor
RUN set -e; \
cd /root/vitastor/rpm; \
sh build-tarball.sh; \
cp /root/vitastor-0.5.13.el8.tar.gz ~/rpmbuild/SOURCES; \
cp /root/vitastor-0.6.5.el8.tar.gz ~/rpmbuild/SOURCES; \
cp vitastor-el8.spec ~/rpmbuild/SPECS/vitastor.spec; \
cd ~/rpmbuild/SPECS/; \
rpmbuild -ba vitastor.spec; \

View File

@@ -1,11 +1,11 @@
Name: vitastor
Version: 0.5.13
Version: 0.6.5
Release: 1%{?dist}
Summary: Vitastor, a fast software-defined clustered block storage
License: Vitastor Network Public License 1.1
URL: https://vitastor.io/
Source0: vitastor-0.5.13.el8.tar.gz
Source0: vitastor-0.6.5.el8.tar.gz
BuildRequires: liburing-devel >= 0.6
BuildRequires: gperftools-devel
@@ -13,6 +13,7 @@ BuildRequires: gcc-toolset-9-gcc-c++
BuildRequires: nodejs >= 10
BuildRequires: jerasure-devel
BuildRequires: gf-complete-devel
BuildRequires: libibverbs-devel
BuildRequires: cmake
Requires: fio = 3.7-3.el8
Requires: qemu-kvm = 4.2.0-29.el8.6
@@ -58,8 +59,9 @@ cp -r mon %buildroot/usr/lib/vitastor
%_libdir/libfio_vitastor.so
%_libdir/libfio_vitastor_blk.so
%_libdir/libfio_vitastor_sec.so
%_libdir/libvitastor_blk.so
%_libdir/libvitastor_client.so
%_libdir/libvitastor_blk.so*
%_libdir/libvitastor_client.so*
%_includedir/vitastor_c.h
/usr/lib/vitastor

View File

@@ -4,6 +4,8 @@ project(vitastor)
include(GNUInstallDirs)
set(WITH_QEMU true CACHE BOOL "Build QEMU driver")
set(WITH_FIO true CACHE BOOL "Build FIO driver")
set(QEMU_PLUGINDIR qemu CACHE STRING "QEMU plugin directory suffix (qemu-kvm on RHEL)")
set(WITH_ASAN false CACHE BOOL "Build with AddressSanitizer")
if("${CMAKE_INSTALL_PREFIX}" MATCHES "^/usr/local/?$")
@@ -13,7 +15,7 @@ if("${CMAKE_INSTALL_PREFIX}" MATCHES "^/usr/local/?$")
set(CMAKE_INSTALL_RPATH "${CMAKE_INSTALL_PREFIX}/${CMAKE_INSTALL_LIBDIR}")
endif()
add_definitions(-DVERSION="0.6-dev")
add_definitions(-DVERSION="0.6.5")
add_definitions(-Wall -Wno-sign-compare -Wno-comment -Wno-parentheses -Wno-pointer-arith -I ${CMAKE_SOURCE_DIR}/src)
if (${WITH_ASAN})
add_definitions(-fsanitize=address -fno-omit-frame-pointer)
@@ -36,12 +38,19 @@ string(REGEX REPLACE "([\\/\\-]D) *NDEBUG" "" CMAKE_C_FLAGS_RELWITHDEBINFO "${CM
find_package(PkgConfig)
pkg_check_modules(LIBURING REQUIRED liburing)
pkg_check_modules(GLIB REQUIRED glib-2.0)
if (${WITH_QEMU})
pkg_check_modules(GLIB REQUIRED glib-2.0)
endif (${WITH_QEMU})
pkg_check_modules(IBVERBS libibverbs)
if (IBVERBS_LIBRARIES)
add_definitions(-DWITH_RDMA)
endif (IBVERBS_LIBRARIES)
include_directories(
../
/usr/include/jerasure
${LIBURING_INCLUDE_DIRS}
${IBVERBS_INCLUDE_DIRS}
)
# libvitastor_blk.so
@@ -52,56 +61,81 @@ add_library(vitastor_blk SHARED
target_link_libraries(vitastor_blk
${LIBURING_LIBRARIES}
tcmalloc_minimal
# for timerfd_manager
vitastor_common
)
set_target_properties(vitastor_blk PROPERTIES VERSION ${VERSION} SOVERSION 0)
# libfio_vitastor_blk.so
add_library(fio_vitastor_blk SHARED
fio_engine.cpp
../json11/json11.cpp
)
target_link_libraries(fio_vitastor_blk
vitastor_blk
if (${WITH_FIO})
# libfio_vitastor_blk.so
add_library(fio_vitastor_blk SHARED
fio_engine.cpp
../json11/json11.cpp
)
target_link_libraries(fio_vitastor_blk
vitastor_blk
)
endif (${WITH_FIO})
# libvitastor_common.a
set(MSGR_RDMA "")
if (IBVERBS_LIBRARIES)
set(MSGR_RDMA "msgr_rdma.cpp")
endif (IBVERBS_LIBRARIES)
add_library(vitastor_common STATIC
epoll_manager.cpp etcd_state_client.cpp
messenger.cpp msgr_stop.cpp msgr_op.cpp msgr_send.cpp msgr_receive.cpp ringloop.cpp ../json11/json11.cpp
http_client.cpp osd_ops.cpp pg_states.cpp timerfd_manager.cpp base64.cpp ${MSGR_RDMA}
)
target_compile_options(vitastor_common PUBLIC -fPIC)
# vitastor-osd
add_executable(vitastor-osd
osd_main.cpp osd.cpp osd_secondary.cpp msgr_receive.cpp msgr_send.cpp osd_peering.cpp osd_flush.cpp osd_peering_pg.cpp
osd_primary.cpp osd_primary_sync.cpp osd_primary_write.cpp osd_primary_subops.cpp
etcd_state_client.cpp messenger.cpp msgr_stop.cpp msgr_op.cpp osd_cluster.cpp http_client.cpp osd_ops.cpp pg_states.cpp
osd_rmw.cpp base64.cpp timerfd_manager.cpp epoll_manager.cpp ../json11/json11.cpp
osd_main.cpp osd.cpp osd_secondary.cpp osd_peering.cpp osd_flush.cpp osd_peering_pg.cpp
osd_primary.cpp osd_primary_chain.cpp osd_primary_sync.cpp osd_primary_write.cpp osd_primary_subops.cpp
osd_cluster.cpp osd_rmw.cpp
)
target_link_libraries(vitastor-osd
vitastor_common
vitastor_blk
Jerasure
${IBVERBS_LIBRARIES}
)
# libfio_vitastor_sec.so
add_library(fio_vitastor_sec SHARED
fio_sec_osd.cpp
rw_blocking.cpp
)
target_link_libraries(fio_vitastor_sec
tcmalloc_minimal
)
if (${WITH_FIO})
# libfio_vitastor_sec.so
add_library(fio_vitastor_sec SHARED
fio_sec_osd.cpp
rw_blocking.cpp
)
target_link_libraries(fio_vitastor_sec
tcmalloc_minimal
)
endif (${WITH_FIO})
# libvitastor_client.so
add_library(vitastor_client SHARED
cluster_client.cpp epoll_manager.cpp etcd_state_client.cpp
messenger.cpp msgr_stop.cpp msgr_op.cpp msgr_send.cpp msgr_receive.cpp ringloop.cpp ../json11/json11.cpp
http_client.cpp osd_ops.cpp pg_states.cpp timerfd_manager.cpp base64.cpp
cluster_client.cpp
vitastor_c.cpp
)
set_target_properties(vitastor_client PROPERTIES PUBLIC_HEADER "vitastor_c.h")
target_link_libraries(vitastor_client
vitastor_common
tcmalloc_minimal
${LIBURING_LIBRARIES}
${IBVERBS_LIBRARIES}
)
set_target_properties(vitastor_client PROPERTIES VERSION ${VERSION} SOVERSION 0)
# libfio_vitastor.so
add_library(fio_vitastor SHARED
fio_cluster.cpp
)
target_link_libraries(fio_vitastor
vitastor_client
)
if (${WITH_FIO})
# libfio_vitastor.so
add_library(fio_vitastor SHARED
fio_cluster.cpp
)
target_link_libraries(fio_vitastor
vitastor_client
)
endif (${WITH_FIO})
# vitastor-nbd
add_executable(vitastor-nbd
@@ -124,27 +158,24 @@ add_executable(vitastor-dump-journal
dump_journal.cpp crc32c.c
)
# qemu_driver.so
add_library(qemu_proxy STATIC qemu_proxy.cpp)
target_compile_options(qemu_proxy PUBLIC -fPIC)
target_include_directories(qemu_proxy PUBLIC
../qemu/b/qemu
../qemu/include
${GLIB_INCLUDE_DIRS}
)
target_link_libraries(qemu_proxy
vitastor_client
)
add_library(qemu_vitastor SHARED
qemu_driver.c
)
target_link_libraries(qemu_vitastor
qemu_proxy
)
set_target_properties(qemu_vitastor PROPERTIES
PREFIX ""
OUTPUT_NAME "block-vitastor"
)
if (${WITH_QEMU})
# qemu_driver.so
add_library(qemu_vitastor SHARED
qemu_driver.c
)
target_include_directories(qemu_vitastor PUBLIC
../qemu/b/qemu
../qemu/include
${GLIB_INCLUDE_DIRS}
)
target_link_libraries(qemu_vitastor
vitastor_client
)
set_target_properties(qemu_vitastor PROPERTIES
PREFIX ""
OUTPUT_NAME "block-vitastor"
)
endif (${WITH_QEMU})
### Test stubs
@@ -162,11 +193,12 @@ target_link_libraries(osd_rmw_test Jerasure tcmalloc_minimal)
# stub_uring_osd
add_executable(stub_uring_osd
stub_uring_osd.cpp epoll_manager.cpp messenger.cpp msgr_stop.cpp msgr_op.cpp
msgr_send.cpp msgr_receive.cpp ringloop.cpp timerfd_manager.cpp ../json11/json11.cpp
stub_uring_osd.cpp
)
target_link_libraries(stub_uring_osd
vitastor_common
${LIBURING_LIBRARIES}
${IBVERBS_LIBRARIES}
tcmalloc_minimal
)
@@ -177,6 +209,14 @@ target_link_libraries(osd_peering_pg_test tcmalloc_minimal)
# test_allocator
add_executable(test_allocator test_allocator.cpp allocator.cpp)
# test_cas
add_executable(test_cas
test_cas.cpp
)
target_link_libraries(test_cas
vitastor_client
)
# test_cluster_client
add_executable(test_cluster_client
test_cluster_client.cpp
@@ -187,7 +227,7 @@ target_compile_definitions(test_cluster_client PUBLIC -D__MOCK__)
target_include_directories(test_cluster_client PUBLIC ${CMAKE_SOURCE_DIR}/src/mock)
## test_blockstore, test_shit
#add_executable(test_blockstore test_blockstore.cpp timerfd_interval.cpp)
#add_executable(test_blockstore test_blockstore.cpp)
#target_link_libraries(test_blockstore blockstore)
#add_executable(test_shit test_shit.cpp osd_peering_pg.cpp)
#target_link_libraries(test_shit ${LIBURING_LIBRARIES} m)
@@ -195,5 +235,14 @@ target_include_directories(test_cluster_client PUBLIC ${CMAKE_SOURCE_DIR}/src/mo
### Install
install(TARGETS vitastor-osd vitastor-dump-journal vitastor-nbd vitastor-rm RUNTIME DESTINATION ${CMAKE_INSTALL_BINDIR})
install(TARGETS fio_vitastor fio_vitastor_blk fio_vitastor_sec vitastor_blk vitastor_client LIBRARY DESTINATION ${CMAKE_INSTALL_LIBDIR})
install(TARGETS qemu_vitastor LIBRARY DESTINATION /usr/${CMAKE_INSTALL_LIBDIR}/${QEMU_PLUGINDIR})
install(
TARGETS vitastor_blk vitastor_client
LIBRARY DESTINATION ${CMAKE_INSTALL_LIBDIR}
PUBLIC_HEADER DESTINATION ${CMAKE_INSTALL_INCLUDEDIR}
)
if (${WITH_FIO})
install(TARGETS fio_vitastor fio_vitastor_blk fio_vitastor_sec LIBRARY DESTINATION ${CMAKE_INSTALL_LIBDIR})
endif (${WITH_FIO})
if (${WITH_QEMU})
install(TARGETS qemu_vitastor LIBRARY DESTINATION /usr/${CMAKE_INSTALL_LIBDIR}/${QEMU_PLUGINDIR})
endif (${WITH_QEMU})

View File

@@ -142,3 +142,35 @@ uint64_t allocator::get_free_count()
{
return free;
}
void bitmap_set(void *bitmap, uint64_t start, uint64_t len, uint64_t bitmap_granularity)
{
if (start == 0)
{
if (len == 32*bitmap_granularity)
{
*((uint32_t*)bitmap) = UINT32_MAX;
return;
}
else if (len == 64*bitmap_granularity)
{
*((uint64_t*)bitmap) = UINT64_MAX;
return;
}
}
unsigned bit_start = start / bitmap_granularity;
unsigned bit_end = ((start + len) + bitmap_granularity - 1) / bitmap_granularity;
while (bit_start < bit_end)
{
if (!(bit_start & 7) && bit_end >= bit_start+8)
{
((uint8_t*)bitmap)[bit_start / 8] = UINT8_MAX;
bit_start += 8;
}
else
{
((uint8_t*)bitmap)[bit_start / 8] |= 1 << (bit_start % 8);
bit_start++;
}
}
}

View File

@@ -21,3 +21,5 @@ public:
uint64_t find_free();
uint64_t get_free_count();
};
void bitmap_set(void *bitmap, uint64_t start, uint64_t len, uint64_t bitmap_granularity);

View File

@@ -3,9 +3,9 @@
#include "blockstore_impl.h"
blockstore_t::blockstore_t(blockstore_config_t & config, ring_loop_t *ringloop)
blockstore_t::blockstore_t(blockstore_config_t & config, ring_loop_t *ringloop, timerfd_manager_t *tfd)
{
impl = new blockstore_impl_t(config, ringloop);
impl = new blockstore_impl_t(config, ringloop, tfd);
}
blockstore_t::~blockstore_t()
@@ -38,9 +38,14 @@ void blockstore_t::enqueue_op(blockstore_op_t *op)
impl->enqueue_op(op);
}
std::unordered_map<object_id, uint64_t> & blockstore_t::get_unstable_writes()
int blockstore_t::read_bitmap(object_id oid, uint64_t target_version, void *bitmap, uint64_t *result_version)
{
return impl->unstable_writes;
return impl->read_bitmap(oid, target_version, bitmap, result_version);
}
std::map<uint64_t, uint64_t> & blockstore_t::get_inode_space_stats()
{
return impl->inode_space_stats;
}
uint32_t blockstore_t::get_block_size()

View File

@@ -16,6 +16,7 @@
#include "object_id.h"
#include "ringloop.h"
#include "timerfd_manager.h"
// Memory alignment for direct I/O (usually 512 bytes)
// All other alignments must be a multiple of this one
@@ -27,6 +28,7 @@
#define DEFAULT_ORDER 17
#define MIN_BLOCK_SIZE 4*1024
#define MAX_BLOCK_SIZE 128*1024*1024
#define DEFAULT_BITMAP_GRANULARITY 4096
#define BS_OP_MIN 1
#define BS_OP_READ 1
@@ -64,6 +66,8 @@ Input:
- offset, len = offset and length within object. length may be zero, in that case
read operation only returns the version / write operation only bumps the version
- buf = pre-allocated buffer for data (read) / with data (write). may be NULL if len == 0.
- bitmap = pointer to the new 'external' object bitmap data. Its part which is respective to the
write request is copied into the metadata area bitwise and stored there.
Output:
- retval = number of bytes actually read/written or negative error number (-EINVAL or -ENOSPC)
@@ -141,6 +145,7 @@ struct blockstore_op_t
uint32_t offset;
uint32_t len;
void *buf;
void *bitmap;
int retval;
uint8_t private_data[BS_OP_PRIVATE_DATA_SIZE];
@@ -154,7 +159,7 @@ class blockstore_t
{
blockstore_impl_t *impl;
public:
blockstore_t(blockstore_config_t & config, ring_loop_t *ringloop);
blockstore_t(blockstore_config_t & config, ring_loop_t *ringloop, timerfd_manager_t *tfd);
~blockstore_t();
// Event loop
@@ -175,8 +180,11 @@ public:
// Submission
void enqueue_op(blockstore_op_t *op);
// Unstable writes are added here (map of object_id -> version)
std::unordered_map<object_id, uint64_t> & get_unstable_writes();
// Simplified synchronous operation: get object bitmap & current version
int read_bitmap(object_id oid, uint64_t target_version, void *bitmap, uint64_t *result_version = NULL);
// Get per-inode space usage statistics
std::map<uint64_t, uint64_t> & get_inode_space_stats();
// FIXME rename to object_size
uint32_t get_block_size();

View File

@@ -17,8 +17,8 @@ journal_flusher_t::journal_flusher_t(blockstore_impl_t *bs)
// FIXME: allow to configure flusher_start_threshold and journal_trim_interval
flusher_start_threshold = bs->journal_block_size / sizeof(journal_entry_stable);
journal_trim_interval = 512;
journal_trim_counter = 0;
trim_wanted = 0;
journal_trim_counter = bs->journal.flush_journal ? 1 : 0;
trim_wanted = bs->journal.flush_journal ? 1 : 0;
journal_superblock = bs->journal.inmemory ? bs->journal.buffer : memalign_or_die(MEM_ALIGNMENT, bs->journal_block_size);
co = new journal_flusher_co[max_flusher_count];
for (int i = 0; i < max_flusher_count; i++)
@@ -428,18 +428,18 @@ resume_1:
{
new_clean_bitmap = (bs->inmemory_meta
? meta_new.buf + meta_new.pos*bs->clean_entry_size + sizeof(clean_disk_entry)
: bs->clean_bitmap + (clean_loc >> bs->block_order)*bs->clean_entry_bitmap_size);
: bs->clean_bitmap + (clean_loc >> bs->block_order)*(2*bs->clean_entry_bitmap_size));
if (clean_init_bitmap)
{
memset(new_clean_bitmap, 0, bs->clean_entry_bitmap_size);
bitmap_set(new_clean_bitmap, clean_bitmap_offset, clean_bitmap_len);
bitmap_set(new_clean_bitmap, clean_bitmap_offset, clean_bitmap_len, bs->bitmap_granularity);
}
}
for (it = v.begin(); it != v.end(); it++)
{
if (new_clean_bitmap)
{
bitmap_set(new_clean_bitmap, it->offset, it->len);
bitmap_set(new_clean_bitmap, it->offset, it->len, bs->bitmap_granularity);
}
await_sqe(4);
data->iov = (struct iovec){ it->buf, (size_t)it->len };
@@ -473,6 +473,7 @@ resume_1:
wait_state = 5;
return false;
}
// zero out old metadata entry
memset(meta_old.buf + meta_old.pos*bs->clean_entry_size, 0, bs->clean_entry_size);
await_sqe(15);
data->iov = (struct iovec){ meta_old.buf, bs->meta_block_size };
@@ -509,6 +510,12 @@ resume_1:
{
memcpy(&new_entry->bitmap, new_clean_bitmap, bs->clean_entry_bitmap_size);
}
// copy latest external bitmap/attributes
if (bs->clean_entry_bitmap_size)
{
void *bmp_ptr = bs->clean_entry_bitmap_size > sizeof(void*) ? dirty_end->second.bitmap : &dirty_end->second.bitmap;
memcpy((void*)(new_entry+1) + bs->clean_entry_bitmap_size, bmp_ptr, bs->clean_entry_bitmap_size);
}
}
await_sqe(6);
data->iov = (struct iovec){ meta_new.buf, bs->meta_block_size };
@@ -595,6 +602,7 @@ resume_1:
.size = sizeof(journal_entry_start),
.reserved = 0,
.journal_start = new_trim_pos,
.version = JOURNAL_VERSION,
};
((journal_entry_start*)flusher->journal_superblock)->crc32 = je_crc32((journal_entry*)flusher->journal_superblock);
data->iov = (struct iovec){ flusher->journal_superblock, bs->journal_block_size };
@@ -626,6 +634,12 @@ resume_1:
#endif
flusher->trimming = false;
}
if (bs->journal.flush_journal && !flusher->flush_queue.size())
{
assert(bs->journal.used_start == bs->journal.next_free);
printf("Journal flushed\n");
exit(0);
}
}
// All done
flusher->active_flushers--;
@@ -879,35 +893,3 @@ bool journal_flusher_co::fsync_batch(bool fsync_meta, int wait_base)
}
return true;
}
void journal_flusher_co::bitmap_set(void *bitmap, uint64_t start, uint64_t len)
{
if (start == 0)
{
if (len == 32*bs->bitmap_granularity)
{
*((uint32_t*)bitmap) = UINT32_MAX;
return;
}
else if (len == 64*bs->bitmap_granularity)
{
*((uint64_t*)bitmap) = UINT64_MAX;
return;
}
}
unsigned bit_start = start / bs->bitmap_granularity;
unsigned bit_end = ((start + len) + bs->bitmap_granularity - 1) / bs->bitmap_granularity;
while (bit_start < bit_end)
{
if (!(bit_start & 7) && bit_end >= bit_start+8)
{
((uint8_t*)bitmap)[bit_start / 8] = UINT8_MAX;
bit_start += 8;
}
else
{
((uint8_t*)bitmap)[bit_start / 8] |= 1 << (bit_start % 8);
bit_start++;
}
}
}

View File

@@ -69,7 +69,6 @@ class journal_flusher_co
bool modify_meta_read(uint64_t meta_loc, flusher_meta_write_t &wr, int wait_base);
void update_clean_db();
bool fsync_batch(bool fsync_meta, int wait_base);
void bitmap_set(void *bitmap, uint64_t start, uint64_t len);
public:
journal_flusher_co();
bool loop();

View File

@@ -3,9 +3,10 @@
#include "blockstore_impl.h"
blockstore_impl_t::blockstore_impl_t(blockstore_config_t & config, ring_loop_t *ringloop)
blockstore_impl_t::blockstore_impl_t(blockstore_config_t & config, ring_loop_t *ringloop, timerfd_manager_t *tfd)
{
assert(sizeof(blockstore_op_private_t) <= BS_OP_PRIVATE_DATA_SIZE);
this->tfd = tfd;
this->ringloop = ringloop;
ring_consumer.loop = [this]() { loop(); };
ringloop->register_consumer(&ring_consumer);
@@ -92,10 +93,23 @@ void blockstore_impl_t::loop()
{
delete journal_init_reader;
journal_init_reader = NULL;
initialized = 10;
if (journal.flush_journal)
initialized = 3;
else
initialized = 10;
ringloop->wakeup();
}
}
if (initialized == 3)
{
if (readonly)
{
printf("Can't flush the journal in readonly mode\n");
exit(1);
}
flusher->loop();
ringloop->submit();
}
}
else
{
@@ -443,7 +457,7 @@ void blockstore_impl_t::process_list(blockstore_op_t *op)
}
for (; clean_it != clean_end; clean_it++)
{
if (!pg_count || ((clean_it->first.inode + clean_it->first.stripe / pg_stripe_size) % pg_count) == list_pg)
if (!pg_count || ((clean_it->first.stripe / pg_stripe_size) % pg_count) == list_pg) // like map_to_pg()
{
if (stable_count >= stable_alloc)
{
@@ -488,7 +502,7 @@ void blockstore_impl_t::process_list(blockstore_op_t *op)
}
for (; dirty_it != dirty_end; dirty_it++)
{
if (!pg_count || ((dirty_it->first.oid.inode + dirty_it->first.oid.stripe / pg_stripe_size) % pg_count) == list_pg)
if (!pg_count || ((dirty_it->first.oid.stripe / pg_stripe_size) % pg_count) == list_pg) // like map_to_pg()
{
if (IS_DELETE(dirty_it->second.state))
{

View File

@@ -9,6 +9,7 @@
#include <sys/ioctl.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <time.h>
#include <unistd.h>
#include <linux/fs.h>
@@ -77,7 +78,25 @@
#include "blockstore_journal.h"
// 24 bytes + block bitmap per "clean" entry on disk with fixed metadata tables
// "VITAstor"
#define BLOCKSTORE_META_MAGIC 0x726F747341544956l
#define BLOCKSTORE_META_VERSION 1
// metadata header (superblock)
// FIXME: After adding the OSD superblock, add a key to metadata
// and journal headers to check if they belong to the same OSD
struct __attribute__((__packed__)) blockstore_meta_header_t
{
uint64_t zero;
uint64_t magic;
uint64_t version;
uint32_t meta_block_size;
uint32_t data_block_size;
uint32_t bitmap_granularity;
};
// 32 bytes = 24 bytes + block bitmap (4 bytes by default) + external attributes (also bitmap, 4 bytes by default)
// per "clean" entry on disk with fixed metadata tables
// FIXME: maybe add crc32's to metadata
struct __attribute__((__packed__)) clean_disk_entry
{
@@ -93,7 +112,7 @@ struct __attribute__((__packed__)) clean_entry
uint64_t location;
};
// 56 = 24 + 32 bytes per dirty entry in memory (obj_ver_id => dirty_entry)
// 64 = 24 + 40 bytes per dirty entry in memory (obj_ver_id => dirty_entry)
struct __attribute__((__packed__)) dirty_entry
{
uint32_t state;
@@ -102,6 +121,7 @@ struct __attribute__((__packed__)) dirty_entry
uint32_t offset; // data offset within object (stripe)
uint32_t len; // data length
uint64_t journal_sector; // journal sector used for this entry
void* bitmap; // either external bitmap itself when it fits, or a pointer to it when it doesn't
};
// - Sync must be submitted after previous writes/deletes (not before!)
@@ -156,6 +176,7 @@ struct blockstore_op_private_t
struct iovec iov_zerofill[3];
// Warning: must not have a default value here because it's written to before calling constructor in blockstore_write.cpp O_o
uint64_t real_version;
timespec tv_begin;
// Sync
std::vector<obj_ver_id> sync_big_writes, sync_small_writes;
@@ -201,6 +222,14 @@ class blockstore_impl_t
unsigned max_flusher_count, min_flusher_count;
// Maximum queue depth
unsigned max_write_iodepth = 128;
// Enable small (journaled) write throttling, useful for the SSD+HDD case
bool throttle_small_writes = false;
// Target data device iops, bandwidth and parallelism for throttling (100/100/1 is the default for HDD)
int throttle_target_iops = 100;
int throttle_target_mbs = 100;
int throttle_target_parallelism = 1;
// Minimum difference in microseconds between target and real execution times to throttle the response
int throttle_threshold_us = 50;
/******* END OF OPTIONS *******/
struct ring_consumer_t ring_consumer;
@@ -231,6 +260,7 @@ class blockstore_impl_t
bool live = false, queue_stall = false;
ring_loop_t *ringloop;
timerfd_manager_t *tfd;
bool stop_sync_submitted;
@@ -250,6 +280,7 @@ class blockstore_impl_t
void open_data();
void open_meta();
void open_journal();
uint8_t* get_clean_entry_bitmap(uint64_t block_loc, int offset);
// Asynchronous init
int initialized;
@@ -300,7 +331,7 @@ class blockstore_impl_t
public:
blockstore_impl_t(blockstore_config_t & config, ring_loop_t *ringloop);
blockstore_impl_t(blockstore_config_t & config, ring_loop_t *ringloop, timerfd_manager_t *tfd);
~blockstore_impl_t();
// Event loop
@@ -321,9 +352,15 @@ public:
// Submission
void enqueue_op(blockstore_op_t *op);
// Simplified synchronous operation: get object bitmap & current version
int read_bitmap(object_id oid, uint64_t target_version, void *bitmap, uint64_t *result_version = NULL);
// Unstable writes are added here (map of object_id -> version)
std::unordered_map<object_id, uint64_t> unstable_writes;
// Space usage statistics
std::map<uint64_t, uint64_t> inode_space_stats;
inline uint32_t get_block_size() { return block_size; }
inline uint64_t get_block_count() { return block_count; }
inline uint64_t get_free_block_count() { return data_alloc->get_free_count(); }

View File

@@ -3,6 +3,20 @@
#include "blockstore_impl.h"
#define GET_SQE() \
sqe = bs->get_sqe();\
if (!sqe)\
throw std::runtime_error("io_uring is full during initialization");\
data = ((ring_data_t*)sqe->user_data)
static bool iszero(uint64_t *buf, int len)
{
for (int i = 0; i < len; i++)
if (buf[i] != 0)
return false;
return true;
}
blockstore_init_meta::blockstore_init_meta(blockstore_impl_t *bs)
{
this->bs = bs;
@@ -10,7 +24,7 @@ blockstore_init_meta::blockstore_init_meta(blockstore_impl_t *bs)
void blockstore_init_meta::handle_event(ring_data_t *data)
{
if (data->res <= 0)
if (data->res < 0)
{
throw std::runtime_error(
std::string("read metadata failed at offset ") + std::to_string(metadata_read) +
@@ -28,6 +42,12 @@ int blockstore_init_meta::loop()
{
if (wait_state == 1)
goto resume_1;
else if (wait_state == 2)
goto resume_2;
else if (wait_state == 3)
goto resume_3;
else if (wait_state == 4)
goto resume_4;
printf("Reading blockstore metadata\n");
if (bs->inmemory_meta)
metadata_buffer = bs->metadata_buffer;
@@ -35,22 +55,98 @@ int blockstore_init_meta::loop()
metadata_buffer = memalign(MEM_ALIGNMENT, 2*bs->metadata_buf_size);
if (!metadata_buffer)
throw std::runtime_error("Failed to allocate metadata read buffer");
// Read superblock
GET_SQE();
data->iov = { metadata_buffer, bs->meta_block_size };
data->callback = [this](ring_data_t *data) { handle_event(data); };
my_uring_prep_readv(sqe, bs->meta_fd, &data->iov, 1, bs->meta_offset);
bs->ringloop->submit();
submitted = 1;
resume_1:
if (submitted)
{
wait_state = 1;
return 1;
}
if (iszero((uint64_t*)metadata_buffer, bs->meta_block_size / sizeof(uint64_t)))
{
{
blockstore_meta_header_t *hdr = (blockstore_meta_header_t *)metadata_buffer;
hdr->zero = 0;
hdr->magic = BLOCKSTORE_META_MAGIC;
hdr->version = BLOCKSTORE_META_VERSION;
hdr->meta_block_size = bs->meta_block_size;
hdr->data_block_size = bs->block_size;
hdr->bitmap_granularity = bs->bitmap_granularity;
}
if (bs->readonly)
{
printf("Skipping metadata initialization because blockstore is readonly\n");
}
else
{
printf("Initializing metadata area\n");
GET_SQE();
data->iov = (struct iovec){ metadata_buffer, bs->meta_block_size };
data->callback = [this](ring_data_t *data) { handle_event(data); };
my_uring_prep_writev(sqe, bs->meta_fd, &data->iov, 1, bs->meta_offset);
bs->ringloop->submit();
submitted = 1;
resume_3:
if (submitted > 0)
{
wait_state = 3;
return 1;
}
zero_on_init = true;
}
}
else
{
blockstore_meta_header_t *hdr = (blockstore_meta_header_t *)metadata_buffer;
if (hdr->zero != 0 ||
hdr->magic != BLOCKSTORE_META_MAGIC ||
hdr->version != BLOCKSTORE_META_VERSION)
{
printf(
"Metadata is corrupt or old version.\n"
" If this is a new OSD please zero out the metadata area before starting it.\n"
" If you need to upgrade from 0.5.x please request it via the issue tracker.\n"
);
exit(1);
}
if (hdr->meta_block_size != bs->meta_block_size ||
hdr->data_block_size != bs->block_size ||
hdr->bitmap_granularity != bs->bitmap_granularity)
{
printf(
"Configuration stored in metadata superblock"
" (meta_block_size=%u, data_block_size=%u, bitmap_granularity=%u)"
" differs from OSD configuration (%lu/%u/%lu).\n",
hdr->meta_block_size, hdr->data_block_size, hdr->bitmap_granularity,
bs->meta_block_size, bs->block_size, bs->bitmap_granularity
);
exit(1);
}
}
// Skip superblock
bs->meta_offset += bs->meta_block_size;
prev_done = 0;
done_len = 0;
done_pos = 0;
metadata_read = 0;
// Read the rest of the metadata
while (1)
{
resume_1:
resume_2:
if (submitted)
{
wait_state = 1;
wait_state = 2;
return 1;
}
if (metadata_read < bs->meta_len)
{
sqe = bs->get_sqe();
if (!sqe)
{
throw std::runtime_error("io_uring is full while trying to read metadata");
}
data = ((ring_data_t*)sqe->user_data);
GET_SQE();
data->iov = {
metadata_buffer + (bs->inmemory_meta
? metadata_read
@@ -58,7 +154,14 @@ int blockstore_init_meta::loop()
bs->meta_len - metadata_read > bs->metadata_buf_size ? bs->metadata_buf_size : bs->meta_len - metadata_read,
};
data->callback = [this](ring_data_t *data) { handle_event(data); };
my_uring_prep_readv(sqe, bs->meta_fd, &data->iov, 1, bs->meta_offset + metadata_read);
if (!zero_on_init)
my_uring_prep_readv(sqe, bs->meta_fd, &data->iov, 1, bs->meta_offset + metadata_read);
else
{
// Fill metadata with zeroes
memset(data->iov.iov_base, 0, data->iov.iov_len);
my_uring_prep_writev(sqe, bs->meta_fd, &data->iov, 1, bs->meta_offset + metadata_read);
}
bs->ringloop->submit();
submitted = (prev == 1 ? 2 : 1);
prev = submitted;
@@ -90,6 +193,21 @@ int blockstore_init_meta::loop()
free(metadata_buffer);
metadata_buffer = NULL;
}
if (zero_on_init && !bs->disable_meta_fsync)
{
GET_SQE();
my_uring_prep_fsync(sqe, bs->meta_fd, IORING_FSYNC_DATASYNC);
data->iov = { 0 };
data->callback = [this](ring_data_t *data) { handle_event(data); };
submitted = 1;
bs->ringloop->submit();
resume_4:
if (submitted > 0)
{
wait_state = 4;
return 1;
}
}
return 0;
}
@@ -100,7 +218,7 @@ void blockstore_init_meta::handle_entries(void* entries, unsigned count, int blo
clean_disk_entry *entry = (clean_disk_entry*)(entries + i*bs->clean_entry_size);
if (!bs->inmemory_meta && bs->clean_entry_bitmap_size)
{
memcpy(bs->clean_bitmap + (done_cnt+i)*bs->clean_entry_bitmap_size, &entry->bitmap, bs->clean_entry_bitmap_size);
memcpy(bs->clean_bitmap + (done_cnt+i)*2*bs->clean_entry_bitmap_size, &entry->bitmap, 2*bs->clean_entry_bitmap_size);
}
if (entry->oid.inode > 0)
{
@@ -118,6 +236,10 @@ void blockstore_init_meta::handle_entries(void* entries, unsigned count, int blo
#endif
bs->data_alloc->set(clean_it->second.location >> block_order, false);
}
else
{
bs->inode_space_stats[entry->oid.inode] += bs->block_size;
}
entries_loaded++;
#ifdef BLOCKSTORE_DEBUG
printf("Allocate block (clean entry) %lu: %lx:%lx v%lu\n", done_cnt+i, entry->oid.inode, entry->oid.stripe, entry->version);
@@ -152,14 +274,6 @@ blockstore_init_journal::blockstore_init_journal(blockstore_impl_t *bs)
};
}
bool iszero(uint64_t *buf, int len)
{
for (int i = 0; i < len; i++)
if (buf[i] != 0)
return false;
return true;
}
void blockstore_init_journal::handle_event(ring_data_t *data1)
{
if (data1->res <= 0)
@@ -184,12 +298,6 @@ void blockstore_init_journal::handle_event(ring_data_t *data1)
submitted_buf = NULL;
}
#define GET_SQE() \
sqe = bs->get_sqe();\
if (!sqe)\
throw std::runtime_error("io_uring is full while trying to read journal");\
data = ((ring_data_t*)sqe->user_data)
int blockstore_init_journal::loop()
{
if (wait_state == 1)
@@ -227,7 +335,7 @@ resume_1:
wait_state = 1;
return 1;
}
if (iszero((uint64_t*)submitted_buf, 3))
if (iszero((uint64_t*)submitted_buf, bs->journal.block_size / sizeof(uint64_t)))
{
// Journal is empty
// FIXME handle this wrapping to journal_block_size better (maybe)
@@ -242,6 +350,7 @@ resume_1:
.size = sizeof(journal_entry_start),
.reserved = 0,
.journal_start = bs->journal.block_size,
.version = JOURNAL_VERSION,
};
((journal_entry_start*)submitted_buf)->crc32 = je_crc32((journal_entry*)submitted_buf);
if (bs->readonly)
@@ -292,11 +401,21 @@ resume_1:
je_start = (journal_entry_start*)submitted_buf;
if (je_start->magic != JOURNAL_MAGIC ||
je_start->type != JE_START ||
je_start->size != sizeof(journal_entry_start) ||
je_crc32((journal_entry*)je_start) != je_start->crc32)
je_crc32((journal_entry*)je_start) != je_start->crc32 ||
je_start->size != sizeof(journal_entry_start) && je_start->size != JE_START_LEGACY_SIZE)
{
// Entry is corrupt
throw std::runtime_error("first entry of the journal is corrupt");
fprintf(stderr, "First entry of the journal is corrupt\n");
exit(1);
}
if (je_start->size == JE_START_LEGACY_SIZE || je_start->version != JOURNAL_VERSION)
{
fprintf(
stderr, "The code only supports journal version %d, but it is %lu on disk."
" Please use the previous version to flush the journal before upgrading OSD\n",
JOURNAL_VERSION, je_start->size == JE_START_LEGACY_SIZE ? 0 : je_start->version
);
exit(1);
}
next_free = journal_pos = bs->journal.used_start = je_start->journal_start;
if (!bs->journal.inmemory)
@@ -545,6 +664,21 @@ int blockstore_init_journal::handle_journal_part(void *buf, uint64_t done_pos, u
.oid = je->small_write.oid,
.version = je->small_write.version,
};
void *bmp = NULL;
void *bmp_from = (void*)je + sizeof(journal_entry_small_write);
if (bs->clean_entry_bitmap_size <= sizeof(void*))
{
memcpy(&bmp, bmp_from, bs->clean_entry_bitmap_size);
}
else
{
// FIXME Using large blockstore objects will result in a lot of small
// allocations for entry bitmaps. This can only be fixed by using
// a patched map with dynamic entry size, but not the btree_map,
// because it doesn't keep iterators valid all the time.
bmp = malloc_or_die(bs->clean_entry_bitmap_size);
memcpy(bmp, bmp_from, bs->clean_entry_bitmap_size);
}
bs->dirty_db.emplace(ov, (dirty_entry){
.state = (BS_ST_SMALL_WRITE | BS_ST_SYNCED),
.flags = 0,
@@ -552,6 +686,7 @@ int blockstore_init_journal::handle_journal_part(void *buf, uint64_t done_pos, u
.offset = je->small_write.offset,
.len = je->small_write.len,
.journal_sector = proc_pos,
.bitmap = bmp,
});
bs->journal.used_sectors[proc_pos]++;
#ifdef BLOCKSTORE_DEBUG
@@ -609,6 +744,21 @@ int blockstore_init_journal::handle_journal_part(void *buf, uint64_t done_pos, u
.oid = je->big_write.oid,
.version = je->big_write.version,
};
void *bmp = NULL;
void *bmp_from = (void*)je + sizeof(journal_entry_big_write);
if (bs->clean_entry_bitmap_size <= sizeof(void*))
{
memcpy(&bmp, bmp_from, bs->clean_entry_bitmap_size);
}
else
{
// FIXME Using large blockstore objects will result in a lot of small
// allocations for entry bitmaps. This can only be fixed by using
// a patched map with dynamic entry size, but not the btree_map,
// because it doesn't keep iterators valid all the time.
bmp = malloc_or_die(bs->clean_entry_bitmap_size);
memcpy(bmp, bmp_from, bs->clean_entry_bitmap_size);
}
auto dirty_it = bs->dirty_db.emplace(ov, (dirty_entry){
.state = (BS_ST_BIG_WRITE | BS_ST_SYNCED),
.flags = 0,
@@ -616,6 +766,7 @@ int blockstore_init_journal::handle_journal_part(void *buf, uint64_t done_pos, u
.offset = je->big_write.offset,
.len = je->big_write.len,
.journal_sector = proc_pos,
.bitmap = bmp,
}).first;
if (bs->data_alloc->get(je->big_write.location >> bs->block_order))
{
@@ -734,6 +885,7 @@ int blockstore_init_journal::handle_journal_part(void *buf, uint64_t done_pos, u
void blockstore_init_journal::erase_dirty_object(blockstore_dirty_db_t::iterator dirty_it)
{
auto oid = dirty_it->first.oid;
bool exists = !IS_DELETE(dirty_it->second.state);
auto dirty_end = dirty_it;
dirty_end++;
while (1)
@@ -752,6 +904,10 @@ void blockstore_init_journal::erase_dirty_object(blockstore_dirty_db_t::iterator
auto clean_it = bs->clean_db.find(oid);
uint64_t clean_loc = clean_it != bs->clean_db.end()
? clean_it->second.location : UINT64_MAX;
if (exists && clean_loc == UINT64_MAX)
{
bs->inode_space_stats[oid.inode] -= bs->block_size;
}
bs->erase_dirty(dirty_it, dirty_end, clean_loc);
// Remove it from the flusher's queue, too
// Otherwise it may end up referring to a small unstable write after reading the rest of the journal

View File

@@ -7,6 +7,7 @@ class blockstore_init_meta
{
blockstore_impl_t *bs;
int wait_state = 0, wait_count = 0;
bool zero_on_init = false;
void *metadata_buffer = NULL;
uint64_t metadata_read = 0;
int prev = 0, prev_done = 0, done_len = 0, submitted = 0;

View File

@@ -7,6 +7,7 @@
#define MIN_JOURNAL_SIZE 4*1024*1024
#define JOURNAL_MAGIC 0x4A33
#define JOURNAL_VERSION 1
#define JOURNAL_BUFFER_SIZE 4*1024*1024
// We reserve some extra space for future stabilize requests during writes
@@ -37,7 +38,9 @@ struct __attribute__((__packed__)) journal_entry_start
uint32_t size;
uint32_t reserved;
uint64_t journal_start;
uint64_t version;
};
#define JE_START_LEGACY_SIZE 24
struct __attribute__((__packed__)) journal_entry_small_write
{
@@ -54,6 +57,9 @@ struct __attribute__((__packed__)) journal_entry_small_write
// data_offset is its offset within journal
uint64_t data_offset;
uint32_t crc32_data;
// small_write and big_write entries are followed by the "external" bitmap
// its size is dynamic and included in journal entry's <size> field
uint8_t bitmap[];
};
struct __attribute__((__packed__)) journal_entry_big_write
@@ -68,6 +74,9 @@ struct __attribute__((__packed__)) journal_entry_big_write
uint32_t offset;
uint32_t len;
uint64_t location;
// small_write and big_write entries are followed by the "external" bitmap
// its size is dynamic and included in journal entry's <size> field
uint8_t bitmap[];
};
struct __attribute__((__packed__)) journal_entry_stable
@@ -143,6 +152,7 @@ struct journal_t
int fd;
uint64_t device_size;
bool inmemory = false;
bool flush_journal = false;
void *buffer = NULL;
uint64_t block_size;

View File

@@ -42,6 +42,11 @@ void blockstore_impl_t::parse_config(blockstore_config_t & config)
{
disable_flock = true;
}
if (config["flush_journal"] == "true" || config["flush_journal"] == "1" || config["flush_journal"] == "yes")
{
// Only flush journal and exit
journal.flush_journal = true;
}
if (config["immediate_commit"] == "all")
{
immediate_commit = IMMEDIATE_ALL;
@@ -74,6 +79,11 @@ void blockstore_impl_t::parse_config(blockstore_config_t & config)
max_flusher_count = strtoull(config["flusher_count"].c_str(), NULL, 10);
min_flusher_count = strtoull(config["min_flusher_count"].c_str(), NULL, 10);
max_write_iodepth = strtoull(config["max_write_iodepth"].c_str(), NULL, 10);
throttle_small_writes = config["throttle_small_writes"] == "true" || config["throttle_small_writes"] == "1" || config["throttle_small_writes"] == "yes";
throttle_target_iops = strtoull(config["throttle_target_iops"].c_str(), NULL, 10);
throttle_target_mbs = strtoull(config["throttle_target_mbs"].c_str(), NULL, 10);
throttle_target_parallelism = strtoull(config["throttle_target_parallelism"].c_str(), NULL, 10);
throttle_threshold_us = strtoull(config["throttle_threshold_us"].c_str(), NULL, 10);
// Validate
if (!block_size)
{
@@ -87,7 +97,7 @@ void blockstore_impl_t::parse_config(blockstore_config_t & config)
{
max_flusher_count = 256;
}
if (!min_flusher_count)
if (!min_flusher_count || journal.flush_journal)
{
min_flusher_count = 1;
}
@@ -101,7 +111,7 @@ void blockstore_impl_t::parse_config(blockstore_config_t & config)
}
else if (disk_alignment % MEM_ALIGNMENT)
{
throw std::runtime_error("disk_alingment must be a multiple of "+std::to_string(MEM_ALIGNMENT));
throw std::runtime_error("disk_alignment must be a multiple of "+std::to_string(MEM_ALIGNMENT));
}
if (!journal_block_size)
{
@@ -125,7 +135,7 @@ void blockstore_impl_t::parse_config(blockstore_config_t & config)
}
if (!bitmap_granularity)
{
bitmap_granularity = 4096;
bitmap_granularity = DEFAULT_BITMAP_GRANULARITY;
}
else if (bitmap_granularity % disk_alignment)
{
@@ -175,9 +185,25 @@ void blockstore_impl_t::parse_config(blockstore_config_t & config)
{
throw std::runtime_error("immediate_commit=all requires disable_journal_fsync and disable_data_fsync");
}
if (!throttle_target_iops)
{
throttle_target_iops = 100;
}
if (!throttle_target_mbs)
{
throttle_target_mbs = 100;
}
if (!throttle_target_parallelism)
{
throttle_target_parallelism = 1;
}
if (!throttle_threshold_us)
{
throttle_threshold_us = 50;
}
// init some fields
clean_entry_bitmap_size = block_size / bitmap_granularity / 8;
clean_entry_size = sizeof(clean_disk_entry) + clean_entry_bitmap_size;
clean_entry_size = sizeof(clean_disk_entry) + 2*clean_entry_bitmap_size;
journal.block_size = journal_block_size;
journal.next_free = journal_block_size;
journal.used_start = journal_block_size;
@@ -231,7 +257,7 @@ void blockstore_impl_t::calc_lengths()
}
// required metadata size
block_count = data_len / block_size;
meta_len = ((block_count - 1 + meta_block_size / clean_entry_size) / (meta_block_size / clean_entry_size)) * meta_block_size;
meta_len = (1 + (block_count - 1 + meta_block_size / clean_entry_size) / (meta_block_size / clean_entry_size)) * meta_block_size;
if (meta_area < meta_len)
{
throw std::runtime_error("Metadata area is too small, need at least "+std::to_string(meta_len)+" bytes");
@@ -244,7 +270,7 @@ void blockstore_impl_t::calc_lengths()
}
else if (clean_entry_bitmap_size)
{
clean_bitmap = (uint8_t*)malloc(block_count * clean_entry_bitmap_size);
clean_bitmap = (uint8_t*)malloc(block_count * 2*clean_entry_bitmap_size);
if (!clean_bitmap)
throw std::runtime_error("Failed to allocate memory for the metadata sparse write bitmap");
}

View File

@@ -94,6 +94,21 @@ endwhile:
return 1;
}
uint8_t* blockstore_impl_t::get_clean_entry_bitmap(uint64_t block_loc, int offset)
{
uint8_t *clean_entry_bitmap;
uint64_t meta_loc = block_loc >> block_order;
if (inmemory_meta)
{
uint64_t sector = (meta_loc / (meta_block_size / clean_entry_size)) * meta_block_size;
uint64_t pos = (meta_loc % (meta_block_size / clean_entry_size));
clean_entry_bitmap = (uint8_t*)(metadata_buffer + sector + pos*clean_entry_size + sizeof(clean_disk_entry) + offset);
}
else
clean_entry_bitmap = (uint8_t*)(clean_bitmap + meta_loc*2*clean_entry_bitmap_size + offset);
return clean_entry_bitmap;
}
int blockstore_impl_t::dequeue_read(blockstore_op_t *read_op)
{
auto clean_it = clean_db.find(read_op->oid);
@@ -134,6 +149,11 @@ int blockstore_impl_t::dequeue_read(blockstore_op_t *read_op)
if (!result_version)
{
result_version = dirty_it->first.version;
if (read_op->bitmap)
{
void *bmp_ptr = (clean_entry_bitmap_size > sizeof(void*) ? dirty_it->second.bitmap : &dirty_it->second.bitmap);
memcpy(read_op->bitmap, bmp_ptr, clean_entry_bitmap_size);
}
}
if (!fulfill_read(read_op, fulfilled, dirty.offset, dirty.offset + dirty.len,
dirty.state, dirty_it->first.version, dirty.location + (IS_JOURNAL(dirty.state) ? 0 : dirty.offset)))
@@ -155,6 +175,11 @@ int blockstore_impl_t::dequeue_read(blockstore_op_t *read_op)
if (!result_version)
{
result_version = clean_it->second.version;
if (read_op->bitmap)
{
void *bmp_ptr = get_clean_entry_bitmap(clean_it->second.location, clean_entry_bitmap_size);
memcpy(read_op->bitmap, bmp_ptr, clean_entry_bitmap_size);
}
}
if (fulfilled < read_op->len)
{
@@ -169,18 +194,7 @@ int blockstore_impl_t::dequeue_read(blockstore_op_t *read_op)
}
else
{
uint64_t meta_loc = clean_it->second.location >> block_order;
uint8_t *clean_entry_bitmap;
if (inmemory_meta)
{
uint64_t sector = (meta_loc / (meta_block_size / clean_entry_size)) * meta_block_size;
uint64_t pos = (meta_loc % (meta_block_size / clean_entry_size));
clean_entry_bitmap = (uint8_t*)(metadata_buffer + sector + pos*clean_entry_size + sizeof(clean_disk_entry));
}
else
{
clean_entry_bitmap = (uint8_t*)(clean_bitmap + meta_loc*clean_entry_bitmap_size);
}
uint8_t *clean_entry_bitmap = get_clean_entry_bitmap(clean_it->second.location, 0);
uint64_t bmp_start = 0, bmp_end = 0, bmp_size = block_size/bitmap_granularity;
while (bmp_start < bmp_size)
{
@@ -254,3 +268,50 @@ void blockstore_impl_t::handle_read_event(ring_data_t *data, blockstore_op_t *op
FINISH_OP(op);
}
}
int blockstore_impl_t::read_bitmap(object_id oid, uint64_t target_version, void *bitmap, uint64_t *result_version)
{
auto dirty_it = dirty_db.upper_bound((obj_ver_id){
.oid = oid,
.version = UINT64_MAX,
});
if (dirty_it != dirty_db.begin())
dirty_it--;
if (dirty_it != dirty_db.end())
{
while (dirty_it->first.oid == oid)
{
if (target_version >= dirty_it->first.version)
{
if (result_version)
*result_version = dirty_it->first.version;
if (bitmap)
{
void *bmp_ptr = (clean_entry_bitmap_size > sizeof(void*) ? dirty_it->second.bitmap : &dirty_it->second.bitmap);
memcpy(bitmap, bmp_ptr, clean_entry_bitmap_size);
}
return 0;
}
if (dirty_it == dirty_db.begin())
break;
dirty_it--;
}
}
auto clean_it = clean_db.find(oid);
if (clean_it != clean_db.end())
{
if (result_version)
*result_version = clean_it->second.version;
if (bitmap)
{
void *bmp_ptr = get_clean_entry_bitmap(clean_it->second.location, clean_entry_bitmap_size);
memcpy(bitmap, bmp_ptr, clean_entry_bitmap_size);
}
return 0;
}
if (result_version)
*result_version = 0;
if (bitmap)
memset(bitmap, 0, clean_entry_bitmap_size);
return -ENOENT;
}

View File

@@ -268,6 +268,11 @@ void blockstore_impl_t::erase_dirty(blockstore_dirty_db_t::iterator dirty_start,
{
journal.used_sectors.erase(dirty_it->second.journal_sector);
}
if (clean_entry_bitmap_size > sizeof(void*))
{
free(dirty_it->second.bitmap);
dirty_it->second.bitmap = NULL;
}
if (dirty_it == dirty_start)
{
break;

View File

@@ -190,6 +190,33 @@ void blockstore_impl_t::mark_stable(const obj_ver_id & v, bool forget_dirty)
if ((dirty_it->second.state & BS_ST_WORKFLOW_MASK) == BS_ST_SYNCED)
{
dirty_it->second.state = (dirty_it->second.state & ~BS_ST_WORKFLOW_MASK) | BS_ST_STABLE;
// Allocations and deletions are counted when they're stabilized
if (IS_BIG_WRITE(dirty_it->second.state))
{
int exists = -1;
if (dirty_it != dirty_db.begin())
{
auto prev_it = dirty_it;
prev_it--;
if (prev_it->first.oid == v.oid)
{
exists = IS_DELETE(prev_it->second.state) ? 0 : 1;
}
}
if (exists == -1)
{
auto clean_it = clean_db.find(v.oid);
exists = clean_it != clean_db.end() ? 1 : 0;
}
if (!exists)
{
inode_space_stats[dirty_it->first.oid.inode] += block_size;
}
}
else if (IS_DELETE(dirty_it->second.state))
{
inode_space_stats[dirty_it->first.oid.inode] -= block_size;
}
}
if (forget_dirty && (IS_BIG_WRITE(dirty_it->second.state) ||
IS_DELETE(dirty_it->second.state)))

View File

@@ -80,7 +80,8 @@ int blockstore_impl_t::continue_sync(blockstore_op_t *op, bool queue_has_in_prog
// 2nd step: Data device is synced, prepare & write journal entries
// Check space in the journal and journal memory buffers
blockstore_journal_check_t space_check(this);
if (!space_check.check_available(op, PRIV(op)->sync_big_writes.size(), sizeof(journal_entry_big_write), JOURNAL_STABILIZE_RESERVATION))
if (!space_check.check_available(op, PRIV(op)->sync_big_writes.size(),
sizeof(journal_entry_big_write) + clean_entry_bitmap_size, JOURNAL_STABILIZE_RESERVATION))
{
return 0;
}
@@ -95,7 +96,7 @@ int blockstore_impl_t::continue_sync(blockstore_op_t *op, bool queue_has_in_prog
int s = 0, cur_sector = -1;
while (it != PRIV(op)->sync_big_writes.end())
{
if (!journal.entry_fits(sizeof(journal_entry_big_write)) &&
if (!journal.entry_fits(sizeof(journal_entry_big_write) + clean_entry_bitmap_size) &&
journal.sector_info[journal.cur_sector].dirty)
{
if (cur_sector == -1)
@@ -103,24 +104,27 @@ int blockstore_impl_t::continue_sync(blockstore_op_t *op, bool queue_has_in_prog
prepare_journal_sector_write(journal, journal.cur_sector, sqe[s++], [this, op](ring_data_t *data) { handle_sync_event(data, op); });
cur_sector = journal.cur_sector;
}
auto & dirty_entry = dirty_db.at(*it);
journal_entry_big_write *je = (journal_entry_big_write*)prefill_single_journal_entry(
journal, (dirty_db[*it].state & BS_ST_INSTANT) ? JE_BIG_WRITE_INSTANT : JE_BIG_WRITE,
sizeof(journal_entry_big_write)
journal, (dirty_entry.state & BS_ST_INSTANT) ? JE_BIG_WRITE_INSTANT : JE_BIG_WRITE,
sizeof(journal_entry_big_write) + clean_entry_bitmap_size
);
dirty_db[*it].journal_sector = journal.sector_info[journal.cur_sector].offset;
dirty_entry.journal_sector = journal.sector_info[journal.cur_sector].offset;
journal.used_sectors[journal.sector_info[journal.cur_sector].offset]++;
#ifdef BLOCKSTORE_DEBUG
printf(
"journal offset %08lx is used by %lx:%lx v%lu (%lu refs)\n",
dirty_db[*it].journal_sector, it->oid.inode, it->oid.stripe, it->version,
dirty_entry.journal_sector, it->oid.inode, it->oid.stripe, it->version,
journal.used_sectors[journal.sector_info[journal.cur_sector].offset]
);
#endif
je->oid = it->oid;
je->version = it->version;
je->offset = dirty_db[*it].offset;
je->len = dirty_db[*it].len;
je->location = dirty_db[*it].location;
je->offset = dirty_entry.offset;
je->len = dirty_entry.len;
je->location = dirty_entry.location;
memcpy((void*)(je+1), (clean_entry_bitmap_size > sizeof(void*)
? dirty_entry.bitmap : &dirty_entry.bitmap), clean_entry_bitmap_size);
je->crc32 = je_crc32((journal_entry*)je);
journal.crc32_last = je->crc32;
it++;
@@ -142,6 +146,7 @@ int blockstore_impl_t::continue_sync(blockstore_op_t *op, bool queue_has_in_prog
my_uring_prep_fsync(sqe, journal.fd, IORING_FSYNC_DATASYNC);
data->iov = { 0 };
data->callback = [this, op](ring_data_t *data) { handle_sync_event(data, op); };
PRIV(op)->min_flushed_journal_sector = PRIV(op)->max_flushed_journal_sector = 0;
PRIV(op)->pending_ops = 1;
PRIV(op)->op_state = SYNC_JOURNAL_SYNC_SENT;
return 1;

View File

@@ -8,7 +8,12 @@ bool blockstore_impl_t::enqueue_write(blockstore_op_t *op)
// Check or assign version number
bool found = false, deleted = false, is_del = (op->opcode == BS_OP_DELETE);
bool wait_big = false, wait_del = false;
void *bmp = NULL;
uint64_t version = 1;
if (!is_del && clean_entry_bitmap_size > sizeof(void*))
{
bmp = calloc_or_die(1, clean_entry_bitmap_size);
}
if (dirty_db.size() > 0)
{
auto dirty_it = dirty_db.upper_bound((obj_ver_id){
@@ -25,6 +30,13 @@ bool blockstore_impl_t::enqueue_write(blockstore_op_t *op)
wait_big = (dirty_it->second.state & BS_ST_TYPE_MASK) == BS_ST_BIG_WRITE
? !IS_SYNCED(dirty_it->second.state)
: ((dirty_it->second.state & BS_ST_WORKFLOW_MASK) == BS_ST_WAIT_BIG);
if (!is_del && !deleted)
{
if (clean_entry_bitmap_size > sizeof(void*))
memcpy(bmp, dirty_it->second.bitmap, clean_entry_bitmap_size);
else
bmp = dirty_it->second.bitmap;
}
}
}
if (!found)
@@ -33,6 +45,11 @@ bool blockstore_impl_t::enqueue_write(blockstore_op_t *op)
if (clean_it != clean_db.end())
{
version = clean_it->second.version + 1;
if (!is_del)
{
void *bmp_ptr = get_clean_entry_bitmap(clean_it->second.location, clean_entry_bitmap_size);
memcpy((clean_entry_bitmap_size > sizeof(void*) ? bmp : &bmp), bmp_ptr, clean_entry_bitmap_size);
}
}
else
{
@@ -72,6 +89,10 @@ bool blockstore_impl_t::enqueue_write(blockstore_op_t *op)
{
// Invalid version requested
op->retval = -EEXIST;
if (!is_del && clean_entry_bitmap_size > sizeof(void*))
{
free(bmp);
}
return false;
}
}
@@ -101,6 +122,8 @@ bool blockstore_impl_t::enqueue_write(blockstore_op_t *op)
else
{
state = (op->len == block_size || deleted ? BS_ST_BIG_WRITE : BS_ST_SMALL_WRITE);
if (state == BS_ST_SMALL_WRITE && throttle_small_writes)
clock_gettime(CLOCK_REALTIME, &PRIV(op)->tv_begin);
if (wait_del)
state |= BS_ST_WAIT_DEL;
else if (state == BS_ST_SMALL_WRITE && wait_big)
@@ -109,6 +132,28 @@ bool blockstore_impl_t::enqueue_write(blockstore_op_t *op)
state |= BS_ST_IN_FLIGHT;
if (op->opcode == BS_OP_WRITE_STABLE)
state |= BS_ST_INSTANT;
if (op->bitmap)
{
// Only allow to overwrite part of the object bitmap respective to the write's offset/len
uint8_t *bmp_ptr = (uint8_t*)(clean_entry_bitmap_size > sizeof(void*) ? bmp : &bmp);
uint32_t bit = op->offset/bitmap_granularity;
uint32_t bits_left = op->len/bitmap_granularity;
while (!(bit % 8) && bits_left > 8)
{
// Copy bytes
bmp_ptr[bit/8] = ((uint8_t*)op->bitmap)[bit/8];
bit += 8;
bits_left -= 8;
}
while (bits_left > 0)
{
// Copy bits
bmp_ptr[bit/8] = (bmp_ptr[bit/8] & ~(1 << (bit%8)))
| (((uint8_t*)op->bitmap)[bit/8] & (1 << bit%8));
bit++;
bits_left--;
}
}
}
dirty_db.emplace((obj_ver_id){
.oid = op->oid,
@@ -120,6 +165,7 @@ bool blockstore_impl_t::enqueue_write(blockstore_op_t *op)
.offset = is_del ? 0 : op->offset,
.len = is_del ? 0 : op->len,
.journal_sector = 0,
.bitmap = bmp,
});
return true;
}
@@ -128,6 +174,8 @@ void blockstore_impl_t::cancel_all_writes(blockstore_op_t *op, blockstore_dirty_
{
while (dirty_it != dirty_db.end() && dirty_it->first.oid == op->oid)
{
if (clean_entry_bitmap_size > sizeof(void*))
free(dirty_it->second.bitmap);
dirty_db.erase(dirty_it++);
}
bool found = false;
@@ -201,7 +249,8 @@ int blockstore_impl_t::dequeue_write(blockstore_op_t *op)
if ((dirty_it->second.state & BS_ST_TYPE_MASK) == BS_ST_BIG_WRITE)
{
blockstore_journal_check_t space_check(this);
if (!space_check.check_available(op, unsynced_big_write_count + 1, sizeof(journal_entry_big_write), JOURNAL_STABILIZE_RESERVATION))
if (!space_check.check_available(op, unsynced_big_write_count + 1,
sizeof(journal_entry_big_write) + clean_entry_bitmap_size, JOURNAL_STABILIZE_RESERVATION))
{
return 0;
}
@@ -267,8 +316,11 @@ int blockstore_impl_t::dequeue_write(blockstore_op_t *op)
// Small (journaled) write
// First check if the journal has sufficient space
blockstore_journal_check_t space_check(this);
if (unsynced_big_write_count && !space_check.check_available(op, unsynced_big_write_count, sizeof(journal_entry_big_write), 0)
|| !space_check.check_available(op, 1, sizeof(journal_entry_small_write), op->len + JOURNAL_STABILIZE_RESERVATION))
if (unsynced_big_write_count &&
!space_check.check_available(op, unsynced_big_write_count,
sizeof(journal_entry_big_write) + clean_entry_bitmap_size, 0)
|| !space_check.check_available(op, 1,
sizeof(journal_entry_small_write) + clean_entry_bitmap_size, op->len + JOURNAL_STABILIZE_RESERVATION))
{
return 0;
}
@@ -276,8 +328,7 @@ int blockstore_impl_t::dequeue_write(blockstore_op_t *op)
// There is sufficient space. Get SQE(s)
struct io_uring_sqe *sqe1 = NULL;
if (immediate_commit != IMMEDIATE_NONE ||
(journal_block_size - journal.in_sector_pos) < sizeof(journal_entry_small_write) &&
journal.sector_info[journal.cur_sector].dirty)
!journal.entry_fits(sizeof(journal_entry_small_write) + clean_entry_bitmap_size))
{
// Write current journal sector only if it's dirty and full, or in the immediate_commit mode
BS_SUBMIT_GET_SQE_DECL(sqe1);
@@ -305,7 +356,7 @@ int blockstore_impl_t::dequeue_write(blockstore_op_t *op)
// Then pre-fill journal entry
journal_entry_small_write *je = (journal_entry_small_write*)prefill_single_journal_entry(
journal, op->opcode == BS_OP_WRITE_STABLE ? JE_SMALL_WRITE_INSTANT : JE_SMALL_WRITE,
sizeof(journal_entry_small_write)
sizeof(journal_entry_small_write) + clean_entry_bitmap_size
);
dirty_it->second.journal_sector = journal.sector_info[journal.cur_sector].offset;
journal.used_sectors[journal.sector_info[journal.cur_sector].offset]++;
@@ -324,6 +375,7 @@ int blockstore_impl_t::dequeue_write(blockstore_op_t *op)
je->len = op->len;
je->data_offset = journal.next_free;
je->crc32_data = crc32c(0, op->buf, op->len);
memcpy((void*)(je+1), (clean_entry_bitmap_size > sizeof(void*) ? dirty_it->second.bitmap : &dirty_it->second.bitmap), clean_entry_bitmap_size);
je->crc32 = je_crc32((journal_entry*)je);
journal.crc32_last = je->crc32;
if (immediate_commit != IMMEDIATE_NONE)
@@ -374,104 +426,148 @@ int blockstore_impl_t::dequeue_write(blockstore_op_t *op)
int blockstore_impl_t::continue_write(blockstore_op_t *op)
{
io_uring_sqe *sqe = NULL;
journal_entry_big_write *je;
int op_state = PRIV(op)->op_state;
if (op_state != 2 && op_state != 4)
{
// In progress
return 1;
}
auto dirty_it = dirty_db.find((obj_ver_id){
.oid = op->oid,
.version = op->version,
});
assert(dirty_it != dirty_db.end());
if (op_state == 2)
goto resume_2;
else if (op_state == 4)
goto resume_4;
else if (op_state == 6)
goto resume_6;
else
{
// In progress
return 1;
}
resume_2:
// Only for the immediate_commit mode: prepare and submit big_write journal entry
BS_SUBMIT_GET_SQE_DECL(sqe);
je = (journal_entry_big_write*)prefill_single_journal_entry(
journal, op->opcode == BS_OP_WRITE_STABLE ? JE_BIG_WRITE_INSTANT : JE_BIG_WRITE,
sizeof(journal_entry_big_write)
);
dirty_it->second.journal_sector = journal.sector_info[journal.cur_sector].offset;
journal.used_sectors[journal.sector_info[journal.cur_sector].offset]++;
{
auto dirty_it = dirty_db.find((obj_ver_id){
.oid = op->oid,
.version = op->version,
});
assert(dirty_it != dirty_db.end());
io_uring_sqe *sqe = NULL;
BS_SUBMIT_GET_SQE_DECL(sqe);
journal_entry_big_write *je = (journal_entry_big_write*)prefill_single_journal_entry(
journal, op->opcode == BS_OP_WRITE_STABLE ? JE_BIG_WRITE_INSTANT : JE_BIG_WRITE,
sizeof(journal_entry_big_write) + clean_entry_bitmap_size
);
dirty_it->second.journal_sector = journal.sector_info[journal.cur_sector].offset;
journal.used_sectors[journal.sector_info[journal.cur_sector].offset]++;
#ifdef BLOCKSTORE_DEBUG
printf(
"journal offset %08lx is used by %lx:%lx v%lu (%lu refs)\n",
journal.sector_info[journal.cur_sector].offset, op->oid.inode, op->oid.stripe, op->version,
journal.used_sectors[journal.sector_info[journal.cur_sector].offset]
);
printf(
"journal offset %08lx is used by %lx:%lx v%lu (%lu refs)\n",
journal.sector_info[journal.cur_sector].offset, op->oid.inode, op->oid.stripe, op->version,
journal.used_sectors[journal.sector_info[journal.cur_sector].offset]
);
#endif
je->oid = op->oid;
je->version = op->version;
je->offset = op->offset;
je->len = op->len;
je->location = dirty_it->second.location;
je->crc32 = je_crc32((journal_entry*)je);
journal.crc32_last = je->crc32;
prepare_journal_sector_write(journal, journal.cur_sector, sqe,
[this, op](ring_data_t *data) { handle_write_event(data, op); });
PRIV(op)->min_flushed_journal_sector = PRIV(op)->max_flushed_journal_sector = 1 + journal.cur_sector;
PRIV(op)->pending_ops = 1;
PRIV(op)->op_state = 3;
return 1;
je->oid = op->oid;
je->version = op->version;
je->offset = op->offset;
je->len = op->len;
je->location = dirty_it->second.location;
memcpy((void*)(je+1), (clean_entry_bitmap_size > sizeof(void*) ? dirty_it->second.bitmap : &dirty_it->second.bitmap), clean_entry_bitmap_size);
je->crc32 = je_crc32((journal_entry*)je);
journal.crc32_last = je->crc32;
prepare_journal_sector_write(journal, journal.cur_sector, sqe,
[this, op](ring_data_t *data) { handle_write_event(data, op); });
PRIV(op)->min_flushed_journal_sector = PRIV(op)->max_flushed_journal_sector = 1 + journal.cur_sector;
PRIV(op)->pending_ops = 1;
PRIV(op)->op_state = 3;
return 1;
}
resume_4:
// Switch object state
#ifdef BLOCKSTORE_DEBUG
printf("Ack write %lx:%lx v%lu = state 0x%x\n", op->oid.inode, op->oid.stripe, op->version, dirty_it->second.state);
#endif
bool imm = (dirty_it->second.state & BS_ST_TYPE_MASK) == BS_ST_BIG_WRITE
? (immediate_commit == IMMEDIATE_ALL)
: (immediate_commit != IMMEDIATE_NONE);
if (imm)
{
auto & unstab = unstable_writes[op->oid];
unstab = unstab < op->version ? op->version : unstab;
}
dirty_it->second.state = (dirty_it->second.state & ~BS_ST_WORKFLOW_MASK)
| (imm ? BS_ST_SYNCED : BS_ST_WRITTEN);
if (imm && ((dirty_it->second.state & BS_ST_TYPE_MASK) == BS_ST_DELETE || (dirty_it->second.state & BS_ST_INSTANT)))
{
// Deletions and 'instant' operations are treated as immediately stable
mark_stable(dirty_it->first);
}
if (!imm)
{
if ((dirty_it->second.state & BS_ST_TYPE_MASK) == BS_ST_BIG_WRITE)
auto dirty_it = dirty_db.find((obj_ver_id){
.oid = op->oid,
.version = op->version,
});
assert(dirty_it != dirty_db.end());
bool is_big = (dirty_it->second.state & BS_ST_TYPE_MASK) == BS_ST_BIG_WRITE;
bool imm = is_big ? (immediate_commit == IMMEDIATE_ALL) : (immediate_commit != IMMEDIATE_NONE);
if (imm)
{
// Remember big write as unsynced
unsynced_big_writes.push_back((obj_ver_id){
.oid = op->oid,
.version = op->version,
});
auto & unstab = unstable_writes[op->oid];
unstab = unstab < op->version ? op->version : unstab;
}
else
dirty_it->second.state = (dirty_it->second.state & ~BS_ST_WORKFLOW_MASK)
| (imm ? BS_ST_SYNCED : BS_ST_WRITTEN);
if (imm && ((dirty_it->second.state & BS_ST_TYPE_MASK) == BS_ST_DELETE || (dirty_it->second.state & BS_ST_INSTANT)))
{
// Remember small write as unsynced
unsynced_small_writes.push_back((obj_ver_id){
.oid = op->oid,
.version = op->version,
});
// Deletions and 'instant' operations are treated as immediately stable
mark_stable(dirty_it->first);
}
}
if (imm && (dirty_it->second.state & BS_ST_TYPE_MASK) == BS_ST_BIG_WRITE)
{
// Unblock small writes
dirty_it++;
while (dirty_it != dirty_db.end() && dirty_it->first.oid == op->oid)
if (!imm)
{
if ((dirty_it->second.state & BS_ST_WORKFLOW_MASK) == BS_ST_WAIT_BIG)
if (is_big)
{
dirty_it->second.state = (dirty_it->second.state & ~BS_ST_WORKFLOW_MASK) | BS_ST_IN_FLIGHT;
// Remember big write as unsynced
unsynced_big_writes.push_back((obj_ver_id){
.oid = op->oid,
.version = op->version,
});
}
else
{
// Remember small write as unsynced
unsynced_small_writes.push_back((obj_ver_id){
.oid = op->oid,
.version = op->version,
});
}
}
if (imm && (dirty_it->second.state & BS_ST_TYPE_MASK) == BS_ST_BIG_WRITE)
{
// Unblock small writes
dirty_it++;
while (dirty_it != dirty_db.end() && dirty_it->first.oid == op->oid)
{
if ((dirty_it->second.state & BS_ST_WORKFLOW_MASK) == BS_ST_WAIT_BIG)
{
dirty_it->second.state = (dirty_it->second.state & ~BS_ST_WORKFLOW_MASK) | BS_ST_IN_FLIGHT;
}
dirty_it++;
}
}
// Apply throttling to not fill the journal too fast for the SSD+HDD case
if (!is_big && throttle_small_writes)
{
// Apply throttling
timespec tv_end;
clock_gettime(CLOCK_REALTIME, &tv_end);
uint64_t exec_us =
(tv_end.tv_sec - PRIV(op)->tv_begin.tv_sec)*1000000 +
(tv_end.tv_nsec - PRIV(op)->tv_begin.tv_nsec)/1000;
// Compare with target execution time
// 100% free -> target time = 0
// 0% free -> target time = iodepth/parallelism * (iops + size/bw) / write per second
uint64_t used_start = journal.get_trim_pos();
uint64_t journal_free_space = journal.next_free < used_start
? (used_start - journal.next_free)
: (journal.len - journal.next_free + used_start - journal.block_size);
uint64_t ref_us =
(write_iodepth <= throttle_target_parallelism ? 100 : 100*write_iodepth/throttle_target_parallelism)
* (1000000/throttle_target_iops + op->len*1000000/throttle_target_mbs/1024/1024)
/ 100;
ref_us -= ref_us * journal_free_space / journal.len;
if (ref_us > exec_us + throttle_threshold_us)
{
// Pause reply
tfd->set_timer_us(ref_us-exec_us, false, [this, op](int timer_id)
{
PRIV(op)->op_state++;
ringloop->wakeup();
});
PRIV(op)->op_state = 5;
return 1;
}
}
}
resume_6:
// Acknowledge write
op->retval = op->len;
write_iodepth--;

View File

@@ -5,6 +5,7 @@
#include <assert.h>
#include "cluster_client.h"
#define SCRAP_BUFFER_SIZE 4*1024*1024
#define PART_SENT 1
#define PART_DONE 2
#define PART_ERROR 4
@@ -15,6 +16,8 @@
cluster_client_t::cluster_client_t(ring_loop_t *ringloop, timerfd_manager_t *tfd, json11::Json & config)
{
config = osd_messenger_t::read_config(config);
this->ringloop = ringloop;
this->tfd = tfd;
this->config = config;
@@ -48,21 +51,25 @@ cluster_client_t::cluster_client_t(ring_loop_t *ringloop, timerfd_manager_t *tfd
msgr.exec_op = [this](osd_op_t *op)
{
// Garbage in
printf("Incoming garbage from peer %d\n", op->peer_fd);
fprintf(stderr, "Incoming garbage from peer %d\n", op->peer_fd);
msgr.stop_client(op->peer_fd);
delete op;
};
msgr.parse_config(this->config);
msgr.init();
st_cli.tfd = tfd;
st_cli.on_load_config_hook = [this](json11::Json::object & cfg) { on_load_config_hook(cfg); };
st_cli.on_change_osd_state_hook = [this](uint64_t peer_osd) { on_change_osd_state_hook(peer_osd); };
st_cli.on_change_hook = [this](json11::Json::object & changes) { on_change_hook(changes); };
st_cli.on_change_hook = [this](std::map<std::string, etcd_kv_t> & changes) { on_change_hook(changes); };
st_cli.on_load_pgs_hook = [this](bool success) { on_load_pgs_hook(success); };
st_cli.parse_config(config);
st_cli.load_global_config();
scrap_buffer_size = SCRAP_BUFFER_SIZE;
scrap_buffer = malloc_or_die(scrap_buffer_size);
if (ringloop)
{
consumer.loop = [this]()
@@ -86,6 +93,131 @@ cluster_client_t::~cluster_client_t()
{
ringloop->unregister_consumer(&consumer);
}
free(scrap_buffer);
}
cluster_op_t::~cluster_op_t()
{
if (buf)
{
free(buf);
buf = NULL;
}
if (bitmap_buf)
{
free(bitmap_buf);
part_bitmaps = NULL;
bitmap_buf = NULL;
}
}
void cluster_client_t::calc_wait(cluster_op_t *op)
{
op->prev_wait = 0;
if (op->opcode == OSD_OP_WRITE)
{
for (auto prev = op->prev; prev; prev = prev->prev)
{
if (prev->opcode == OSD_OP_SYNC ||
prev->opcode == OSD_OP_WRITE && !(op->flags & OP_FLUSH_BUFFER) && (prev->flags & OP_FLUSH_BUFFER))
{
op->prev_wait++;
}
}
if (!op->prev_wait && pgs_loaded)
continue_rw(op);
}
else if (op->opcode == OSD_OP_SYNC)
{
for (auto prev = op->prev; prev; prev = prev->prev)
{
if (prev->opcode == OSD_OP_SYNC || prev->opcode == OSD_OP_WRITE)
{
op->prev_wait++;
}
}
if (!op->prev_wait && pgs_loaded)
continue_sync(op);
}
else
{
for (auto prev = op->prev; prev; prev = prev->prev)
{
if (prev->opcode == OSD_OP_WRITE && prev->flags & OP_FLUSH_BUFFER)
{
op->prev_wait++;
}
else if (prev->opcode == OSD_OP_WRITE || prev->opcode == OSD_OP_READ)
{
// Flushes are always in the beginning
break;
}
}
if (!op->prev_wait && pgs_loaded)
continue_rw(op);
}
}
void cluster_client_t::inc_wait(uint64_t opcode, uint64_t flags, cluster_op_t *next, int inc)
{
if (opcode == OSD_OP_WRITE)
{
while (next)
{
auto n2 = next->next;
if (next->opcode == OSD_OP_SYNC ||
next->opcode == OSD_OP_WRITE && (flags & OP_FLUSH_BUFFER) && !(next->flags & OP_FLUSH_BUFFER) ||
next->opcode == OSD_OP_READ && (flags & OP_FLUSH_BUFFER))
{
next->prev_wait += inc;
if (!next->prev_wait)
{
if (next->opcode == OSD_OP_SYNC)
continue_sync(next);
else
continue_rw(next);
}
}
next = n2;
}
}
else if (opcode == OSD_OP_SYNC)
{
while (next)
{
auto n2 = next->next;
if (next->opcode == OSD_OP_SYNC || next->opcode == OSD_OP_WRITE)
{
next->prev_wait += inc;
if (!next->prev_wait)
{
if (next->opcode == OSD_OP_SYNC)
continue_sync(next);
else
continue_rw(next);
}
}
next = n2;
}
}
}
void cluster_client_t::erase_op(cluster_op_t *op)
{
uint64_t opcode = op->opcode, flags = op->flags;
cluster_op_t *next = op->next;
if (op->prev)
op->prev->next = op->next;
if (op->next)
op->next->prev = op->prev;
if (op_queue_head == op)
op_queue_head = op->next;
if (op_queue_tail == op)
op_queue_tail = op->prev;
op->next = op->prev = NULL;
std::function<void(cluster_op_t*)>(op->callback)(op);
if (!immediate_commit)
inc_wait(opcode, flags, next, -1);
}
void cluster_client_t::continue_ops(bool up_retry)
@@ -98,60 +230,25 @@ void cluster_client_t::continue_ops(bool up_retry)
if (continuing_ops)
{
// Attempt to reenter the function
continuing_ops = 2;
return;
}
restart:
continuing_ops = 1;
op_queue_pos = 0;
bool has_flushes = false, has_writes = false;
while (op_queue_pos < op_queue.size())
for (auto op = op_queue_head; op; )
{
auto op = op_queue[op_queue_pos];
bool rm = false, is_flush = op->flags & OP_FLUSH_BUFFER;
auto opcode = op->opcode;
cluster_op_t *next_op = op->next;
if (!op->up_wait || up_retry)
{
op->up_wait = false;
if (opcode == OSD_OP_READ || opcode == OSD_OP_WRITE)
if (!op->prev_wait)
{
if (is_flush || !has_flushes)
{
// Regular writes can't proceed before buffer flushes
rm = continue_rw(op);
}
}
else if (opcode == OSD_OP_SYNC)
{
if (!has_writes)
{
// SYNC can't proceed before previous writes
rm = continue_sync(op);
}
if (op->opcode == OSD_OP_SYNC)
continue_sync(op);
else
continue_rw(op);
}
}
if (opcode == OSD_OP_WRITE)
{
has_writes = has_writes || !rm;
if (is_flush)
{
has_flushes = has_writes || !rm;
}
}
else if (opcode == OSD_OP_SYNC)
{
// Postpone writes until previous SYNC completes
// ...so dirty_writes can't contain anything newer than SYNC
has_flushes = has_writes || !rm;
}
if (rm)
{
op_queue.erase(op_queue.begin()+op_queue_pos, op_queue.begin()+op_queue_pos+1);
}
else
{
op_queue_pos++;
}
op = next_op;
if (continuing_ops == 2)
{
goto restart;
@@ -187,16 +284,14 @@ void cluster_client_t::on_load_config_hook(json11::Json::object & config)
{
bs_bitmap_granularity = DEFAULT_BITMAP_GRANULARITY;
}
bs_bitmap_size = bs_block_size / bs_bitmap_granularity / 8;
uint32_t block_order;
if ((block_order = is_power_of_two(bs_block_size)) >= 64 || bs_block_size < MIN_BLOCK_SIZE || bs_block_size >= MAX_BLOCK_SIZE)
{
throw std::runtime_error("Bad block size");
}
if (config["immediate_commit"] == "all")
{
// Cluster-wide immediate_commit mode
immediate_commit = true;
}
// Cluster-wide immediate_commit mode
immediate_commit = (config["immediate_commit"] == "all");
if (config.find("client_max_dirty_bytes") != config.end())
{
client_max_dirty_bytes = config["client_max_dirty_bytes"].uint64_value();
@@ -252,7 +347,7 @@ void cluster_client_t::on_load_pgs_hook(bool success)
continue_ops();
}
void cluster_client_t::on_change_hook(json11::Json::object & changes)
void cluster_client_t::on_change_hook(std::map<std::string, etcd_kv_t> & changes)
{
for (auto pool_item: st_cli.pool_config)
{
@@ -260,10 +355,10 @@ void cluster_client_t::on_change_hook(json11::Json::object & changes)
{
// At this point, all pool operations should have been suspended
// And now they have to be resliced!
for (auto op: op_queue)
for (auto op = op_queue_head; op; op = op->next)
{
if ((op->opcode == OSD_OP_WRITE || op->opcode == OSD_OP_READ) &&
INODE_POOL(op->inode) == pool_item.first)
INODE_POOL(op->cur_inode) == pool_item.first)
{
op->needs_reslice = true;
}
@@ -328,6 +423,7 @@ void cluster_client_t::execute(cluster_op_t *op)
std::function<void(cluster_op_t*)>(op->callback)(op);
return;
}
op->cur_inode = op->inode;
op->retval = 0;
if (op->opcode == OSD_OP_WRITE && !immediate_commit)
{
@@ -340,9 +436,17 @@ void cluster_client_t::execute(cluster_op_t *op)
{
delete sync_op;
};
op_queue.push_back(sync_op);
sync_op->prev = op_queue_tail;
if (op_queue_tail)
{
op_queue_tail->next = sync_op;
op_queue_tail = sync_op;
}
else
op_queue_tail = op_queue_head = sync_op;
dirty_bytes = 0;
dirty_ops = 0;
calc_wait(sync_op);
}
dirty_bytes += op->len;
dirty_ops++;
@@ -352,8 +456,23 @@ void cluster_client_t::execute(cluster_op_t *op)
dirty_bytes = 0;
dirty_ops = 0;
}
op_queue.push_back(op);
continue_ops();
op->prev = op_queue_tail;
if (op_queue_tail)
{
op_queue_tail->next = op;
op_queue_tail = op;
}
else
op_queue_tail = op_queue_head = op;
if (!immediate_commit)
calc_wait(op);
else if (pgs_loaded)
{
if (op->opcode == OSD_OP_SYNC)
continue_sync(op);
else
continue_rw(op);
}
}
void cluster_client_t::copy_write(cluster_op_t *op, std::map<object_id, cluster_buffer_t> & dirty_buffers)
@@ -440,7 +559,7 @@ void cluster_client_t::flush_buffer(const object_id & oid, cluster_buffer_t *wr)
cluster_op_t *op = new cluster_op_t;
op->flags = OP_FLUSH_BUFFER;
op->opcode = OSD_OP_WRITE;
op->inode = oid.inode;
op->cur_inode = op->inode = oid.inode;
op->offset = oid.stripe;
op->len = wr->len;
op->iov.push_back(wr->buf, wr->len);
@@ -452,12 +571,16 @@ void cluster_client_t::flush_buffer(const object_id & oid, cluster_buffer_t *wr)
}
delete op;
};
op_queue.insert(op_queue.begin(), op);
if (continuing_ops)
op->next = op_queue_head;
if (op_queue_head)
{
continuing_ops = 2;
op_queue_pos++;
op_queue_head->prev = op;
op_queue_head = op;
}
else
op_queue_tail = op_queue_head = op;
inc_wait(op->opcode, op->flags, op->next, 1);
continue_rw(op);
}
int cluster_client_t::continue_rw(cluster_op_t *op)
@@ -474,15 +597,15 @@ resume_0:
if (!op->len || op->offset % bs_bitmap_granularity || op->len % bs_bitmap_granularity)
{
op->retval = -EINVAL;
std::function<void(cluster_op_t*)>(op->callback)(op);
erase_op(op);
return 1;
}
{
pool_id_t pool_id = INODE_POOL(op->inode);
pool_id_t pool_id = INODE_POOL(op->cur_inode);
if (!pool_id)
{
op->retval = -EINVAL;
std::function<void(cluster_op_t*)>(op->callback)(op);
erase_op(op);
return 1;
}
if (st_cli.pool_config.find(pool_id) == st_cli.pool_config.end() ||
@@ -494,6 +617,13 @@ resume_0:
}
if (op->opcode == OSD_OP_WRITE)
{
auto ino_it = st_cli.inode_config.find(op->inode);
if (ino_it != st_cli.inode_config.end() && ino_it->second.readonly)
{
op->retval = -EINVAL;
erase_op(op);
return 1;
}
if (!immediate_commit && !(op->flags & OP_FLUSH_BUFFER))
{
copy_write(op, dirty_buffers);
@@ -503,6 +633,13 @@ resume_1:
// Slice the operation into parts
slice_rw(op);
op->needs_reslice = false;
if (op->opcode == OSD_OP_WRITE && op->version && op->parts.size() > 1)
{
// Atomic writes to multiple stripes are unsupported
op->retval = -EINVAL;
erase_op(op);
return 1;
}
resume_2:
// Send unsent parts, if they're not subject to change
op->state = 3;
@@ -552,14 +689,38 @@ resume_3:
// Finished successfully
// Even if the PG count has changed in meanwhile we treat it as success
// because if some operations were invalid for the new PG count we'd get errors
bool is_read = op->opcode == OSD_OP_READ;
if (is_read)
{
// Check parent inode
auto ino_it = st_cli.inode_config.find(op->cur_inode);
while (ino_it != st_cli.inode_config.end() && ino_it->second.parent_id &&
INODE_POOL(ino_it->second.parent_id) == INODE_POOL(op->cur_inode) &&
// Check for loops
ino_it->second.parent_id != op->inode)
{
// Skip parents from the same pool
ino_it = st_cli.inode_config.find(ino_it->second.parent_id);
}
if (ino_it != st_cli.inode_config.end() &&
ino_it->second.parent_id &&
ino_it->second.parent_id != op->inode)
{
// Continue reading from the parent inode
op->cur_inode = ino_it->second.parent_id;
op->parts.clear();
op->done_count = 0;
goto resume_1;
}
}
op->retval = op->len;
std::function<void(cluster_op_t*)>(op->callback)(op);
erase_op(op);
return 1;
}
else if (op->retval != 0 && op->retval != -EPIPE)
{
// Fatal error (not -EPIPE)
std::function<void(cluster_op_t*)>(op->callback)(op);
erase_op(op);
return 1;
}
else
@@ -584,51 +745,134 @@ resume_3:
return 0;
}
static void add_iov(int size, bool skip, cluster_op_t *op, int &iov_idx, size_t &iov_pos, osd_op_buf_list_t &iov, void *scrap, int scrap_len)
{
int left = size;
while (left > 0 && iov_idx < op->iov.count)
{
int cur_left = op->iov.buf[iov_idx].iov_len - iov_pos;
if (cur_left < left)
{
if (!skip)
{
iov.push_back(op->iov.buf[iov_idx].iov_base + iov_pos, cur_left);
}
left -= cur_left;
iov_pos = 0;
iov_idx++;
}
else
{
if (!skip)
{
iov.push_back(op->iov.buf[iov_idx].iov_base + iov_pos, left);
}
iov_pos += left;
left = 0;
}
}
assert(left == 0);
if (skip && scrap_len > 0)
{
// All skipped ranges are read into the same useless buffer
left = size;
while (left > 0)
{
int cur_left = scrap_len < left ? scrap_len : left;
iov.push_back(scrap, cur_left);
left -= cur_left;
}
}
}
void cluster_client_t::slice_rw(cluster_op_t *op)
{
// Slice the request into individual object stripe requests
// Primary OSDs still operate individual stripes, but their size is multiplied by PG minsize in case of EC
auto & pool_cfg = st_cli.pool_config.at(INODE_POOL(op->inode));
auto & pool_cfg = st_cli.pool_config.at(INODE_POOL(op->cur_inode));
uint32_t pg_data_size = (pool_cfg.scheme == POOL_SCHEME_REPLICATED ? 1 : pool_cfg.pg_size-pool_cfg.parity_chunks);
uint64_t pg_block_size = bs_block_size * pg_data_size;
uint64_t first_stripe = (op->offset / pg_block_size) * pg_block_size;
uint64_t last_stripe = ((op->offset + op->len + pg_block_size - 1) / pg_block_size - 1) * pg_block_size;
op->retval = 0;
op->parts.resize((last_stripe - first_stripe) / pg_block_size + 1);
if (op->opcode == OSD_OP_READ)
{
// Allocate memory for the bitmap
unsigned object_bitmap_size = ((op->len / bs_bitmap_granularity + 7) / 8);
object_bitmap_size = (object_bitmap_size < 8 ? 8 : object_bitmap_size);
unsigned bitmap_mem = object_bitmap_size + (bs_bitmap_size * pg_data_size) * op->parts.size();
if (op->bitmap_buf_size < bitmap_mem)
{
op->bitmap_buf = realloc_or_die(op->bitmap_buf, bitmap_mem);
if (!op->bitmap_buf_size)
{
// First allocation
memset(op->bitmap_buf, 0, object_bitmap_size);
}
op->part_bitmaps = op->bitmap_buf + object_bitmap_size;
op->bitmap_buf_size = bitmap_mem;
}
}
int iov_idx = 0;
size_t iov_pos = 0;
int i = 0;
for (uint64_t stripe = first_stripe; stripe <= last_stripe; stripe += pg_block_size)
{
pg_num_t pg_num = (op->inode + stripe/pool_cfg.pg_stripe_size) % pool_cfg.real_pg_count + 1;
pg_num_t pg_num = (stripe/pool_cfg.pg_stripe_size) % pool_cfg.real_pg_count + 1; // like map_to_pg()
uint64_t begin = (op->offset < stripe ? stripe : op->offset);
uint64_t end = (op->offset + op->len) > (stripe + pg_block_size)
? (stripe + pg_block_size) : (op->offset + op->len);
op->parts[i] = (cluster_op_part_t){
.parent = op,
.offset = begin,
.len = (uint32_t)(end - begin),
.pg_num = pg_num,
.flags = 0,
};
int left = end-begin;
while (left > 0 && iov_idx < op->iov.count)
op->parts[i].iov.reset();
if (op->cur_inode != op->inode)
{
if (op->iov.buf[iov_idx].iov_len - iov_pos < left)
// Read remaining parts from upper layers
uint64_t prev = begin, cur = begin;
bool skip_prev = true;
while (cur < end)
{
op->parts[i].iov.push_back(op->iov.buf[iov_idx].iov_base + iov_pos, op->iov.buf[iov_idx].iov_len - iov_pos);
left -= (op->iov.buf[iov_idx].iov_len - iov_pos);
iov_pos = 0;
iov_idx++;
unsigned bmp_loc = (cur - op->offset)/bs_bitmap_granularity;
bool skip = (((*(uint8_t*)(op->bitmap_buf + bmp_loc/8)) >> (bmp_loc%8)) & 0x1);
if (skip_prev != skip)
{
if (cur > prev)
{
if (prev == begin && skip_prev)
{
begin = cur;
// Just advance iov_idx & iov_pos
add_iov(cur-prev, true, op, iov_idx, iov_pos, op->parts[i].iov, NULL, 0);
}
else
add_iov(cur-prev, skip_prev, op, iov_idx, iov_pos, op->parts[i].iov, scrap_buffer, scrap_buffer_size);
}
skip_prev = skip;
prev = cur;
}
cur += bs_bitmap_granularity;
}
assert(cur > prev);
if (skip_prev)
{
// Just advance iov_idx & iov_pos
add_iov(end-prev, true, op, iov_idx, iov_pos, op->parts[i].iov, NULL, 0);
end = prev;
}
else
{
op->parts[i].iov.push_back(op->iov.buf[iov_idx].iov_base + iov_pos, left);
iov_pos += left;
left = 0;
}
add_iov(cur-prev, skip_prev, op, iov_idx, iov_pos, op->parts[i].iov, scrap_buffer, scrap_buffer_size);
if (end == begin)
op->done_count++;
}
assert(left == 0);
else
{
add_iov(end-begin, false, op, iov_idx, iov_pos, op->parts[i].iov, NULL, 0);
}
op->parts[i].parent = op;
op->parts[i].offset = begin;
op->parts[i].len = (uint32_t)(end - begin);
op->parts[i].pg_num = pg_num;
op->parts[i].osd_num = 0;
op->parts[i].flags = 0;
i++;
}
}
@@ -655,7 +899,7 @@ bool cluster_client_t::affects_osd(uint64_t inode, uint64_t offset, uint64_t len
bool cluster_client_t::try_send(cluster_op_t *op, int i)
{
auto part = &op->parts[i];
auto & pool_cfg = st_cli.pool_config[INODE_POOL(op->inode)];
auto & pool_cfg = st_cli.pool_config.at(INODE_POOL(op->cur_inode));
auto pg_it = pool_cfg.pg_config.find(part->pg_num);
if (pg_it != pool_cfg.pg_config.end() &&
!pg_it->second.pause && pg_it->second.cur_primary)
@@ -668,6 +912,13 @@ bool cluster_client_t::try_send(cluster_op_t *op, int i)
part->osd_num = primary_osd;
part->flags |= PART_SENT;
op->inflight_count++;
uint64_t pg_bitmap_size = bs_bitmap_size * (
pool_cfg.scheme == POOL_SCHEME_REPLICATED ? 1 : pool_cfg.pg_size-pool_cfg.parity_chunks
);
uint64_t meta_rev = 0;
auto ino_it = st_cli.inode_config.find(op->inode);
if (ino_it != st_cli.inode_config.end())
meta_rev = ino_it->second.mod_revision;
part->op = (osd_op_t){
.op_type = OSD_OP_OUT,
.peer_fd = peer_fd,
@@ -677,10 +928,14 @@ bool cluster_client_t::try_send(cluster_op_t *op, int i)
.id = op_id++,
.opcode = op->opcode,
},
.inode = op->inode,
.inode = op->cur_inode,
.offset = part->offset,
.len = part->len,
.meta_revision = meta_rev,
.version = op->opcode == OSD_OP_WRITE ? op->version : 0,
} },
.bitmap = op->opcode == OSD_OP_WRITE ? NULL : op->part_bitmaps + pg_bitmap_size*i,
.bitmap_len = (unsigned)(op->opcode == OSD_OP_WRITE ? 0 : pg_bitmap_size),
.callback = [this, part](osd_op_t *op_part)
{
handle_op_part(part);
@@ -706,17 +961,18 @@ int cluster_client_t::continue_sync(cluster_op_t *op)
{
// Sync is not required in the immediate_commit mode or if there are no dirty_osds
op->retval = 0;
std::function<void(cluster_op_t*)>(op->callback)(op);
erase_op(op);
return 1;
}
// Check that all OSD connections are still alive
for (auto sync_osd: dirty_osds)
for (auto do_it = dirty_osds.begin(); do_it != dirty_osds.end(); )
{
osd_num_t sync_osd = *do_it;
auto peer_it = msgr.osd_peer_fds.find(sync_osd);
if (peer_it == msgr.osd_peer_fds.end())
{
return 0;
}
dirty_osds.erase(do_it++);
else
do_it++;
}
// Post sync to affected OSDs
for (auto & prev_op: dirty_buffers)
@@ -781,7 +1037,7 @@ resume_1:
uw_it++;
}
}
std::function<void(cluster_op_t*)>(op->callback)(op);
erase_op(op);
return 1;
}
@@ -809,6 +1065,16 @@ void cluster_client_t::send_sync(cluster_op_t *op, cluster_op_part_t *part)
msgr.outbox_push(&part->op);
}
static inline void mem_or(void *res, const void *r2, unsigned int len)
{
unsigned int i;
for (i = 0; i < len; ++i)
{
// Hope the compiler vectorizes this
((uint8_t*)res)[i] = ((uint8_t*)res)[i] | ((uint8_t*)r2)[i];
}
}
void cluster_client_t::handle_op_part(cluster_op_part_t *part)
{
cluster_op_t *op = part->parent;
@@ -817,10 +1083,6 @@ void cluster_client_t::handle_op_part(cluster_op_part_t *part)
if (part->op.reply.hdr.retval != expected)
{
// Operation failed, retry
printf(
"%s operation failed on OSD %lu: retval=%ld (expected %d), dropping connection\n",
osd_op_names[part->op.req.hdr.opcode], part->osd_num, part->op.reply.hdr.retval, expected
);
if (part->op.reply.hdr.retval == -EPIPE)
{
// Mark op->up_wait = true before stopping the client
@@ -839,7 +1101,14 @@ void cluster_client_t::handle_op_part(cluster_op_part_t *part)
// Don't overwrite other errors with -EPIPE
op->retval = part->op.reply.hdr.retval;
}
msgr.stop_client(part->op.peer_fd);
if (op->retval != -EINTR && op->retval != -EIO)
{
fprintf(
stderr, "%s operation failed on OSD %lu: retval=%ld (expected %d), dropping connection\n",
osd_op_names[part->op.req.hdr.opcode], part->osd_num, part->op.reply.hdr.retval, expected
);
msgr.stop_client(part->op.peer_fd);
}
part->flags |= PART_ERROR;
}
else
@@ -848,9 +1117,47 @@ void cluster_client_t::handle_op_part(cluster_op_part_t *part)
dirty_osds.insert(part->osd_num);
part->flags |= PART_DONE;
op->done_count++;
if (op->opcode == OSD_OP_READ)
{
copy_part_bitmap(op, part);
op->version = op->parts.size() == 1 ? part->op.reply.rw.version : 0;
}
}
if (op->inflight_count == 0)
{
continue_ops();
if (op->opcode == OSD_OP_SYNC)
continue_sync(op);
else
continue_rw(op);
}
}
void cluster_client_t::copy_part_bitmap(cluster_op_t *op, cluster_op_part_t *part)
{
// Copy (OR) bitmap
auto & pool_cfg = st_cli.pool_config.at(INODE_POOL(op->cur_inode));
uint32_t pg_block_size = bs_block_size * (
pool_cfg.scheme == POOL_SCHEME_REPLICATED ? 1 : pool_cfg.pg_size-pool_cfg.parity_chunks
);
uint32_t object_offset = (part->op.req.rw.offset - op->offset) / bs_bitmap_granularity;
uint32_t part_offset = (part->op.req.rw.offset % pg_block_size) / bs_bitmap_granularity;
uint32_t part_len = part->op.req.rw.len / bs_bitmap_granularity;
if (!(object_offset & 0x7) && !(part_offset & 0x7) && (part_len >= 8))
{
// Copy bytes
mem_or(op->bitmap_buf + object_offset/8, part->op.bitmap + part_offset/8, part_len/8);
object_offset += (part_len & ~0x7);
part_offset += (part_len & ~0x7);
part_len = (part_len & 0x7);
}
while (part_len > 0)
{
// Copy bits
(*(uint8_t*)(op->bitmap_buf + (object_offset >> 3))) |= (
(((*(uint8_t*)(part->op.bitmap + (part_offset >> 3))) >> (part_offset & 0x7)) & 0x1) << (object_offset & 0x7)
);
part_offset++;
object_offset++;
part_len--;
}
}

View File

@@ -8,8 +8,6 @@
#define MIN_BLOCK_SIZE 4*1024
#define MAX_BLOCK_SIZE 128*1024*1024
#define DEFAULT_DISK_ALIGNMENT 4096
#define DEFAULT_BITMAP_GRANULARITY 4096
#define DEFAULT_CLIENT_MAX_DIRTY_BYTES 32*1024*1024
#define DEFAULT_CLIENT_MAX_DIRTY_OPS 1024
@@ -33,18 +31,27 @@ struct cluster_op_t
uint64_t inode;
uint64_t offset;
uint64_t len;
// for reads and writes within a single object (stripe),
// reads can return current version and writes can use "CAS" semantics
uint64_t version = 0;
int retval;
osd_op_buf_list_t iov;
std::function<void(cluster_op_t*)> callback;
~cluster_op_t();
protected:
int flags = 0;
uint64_t flags = 0;
int state = 0;
uint64_t cur_inode; // for snapshot reads
void *buf = NULL;
cluster_op_t *orig_op = NULL;
bool needs_reslice = false;
bool up_wait = false;
int inflight_count = 0, done_count = 0;
std::vector<cluster_op_part_t> parts;
void *bitmap_buf = NULL, *part_bitmaps = NULL;
unsigned bitmap_buf_size = 0;
cluster_op_t *prev = NULL, *next = NULL;
int prev_wait = 0;
friend class cluster_client_t;
};
@@ -62,9 +69,10 @@ class cluster_client_t
ring_loop_t *ringloop;
uint64_t bs_block_size = 0;
uint64_t bs_bitmap_granularity = 0;
uint32_t bs_bitmap_granularity = 0, bs_bitmap_size = 0;
std::map<pool_id_t, uint64_t> pg_counts;
bool immediate_commit = false;
// WARNING: initially true so execute() doesn't create fake sync
bool immediate_commit = true;
// FIXME: Implement inmemory_commit mode. Note that it requires to return overlapping reads from memory.
uint64_t client_max_dirty_bytes = 0;
uint64_t client_max_dirty_ops = 0;
@@ -74,16 +82,18 @@ class cluster_client_t
int retry_timeout_id = 0;
uint64_t op_id = 1;
std::vector<cluster_op_t*> offline_ops;
std::vector<cluster_op_t*> op_queue;
cluster_op_t *op_queue_head = NULL, *op_queue_tail = NULL;
std::map<object_id, cluster_buffer_t> dirty_buffers;
std::set<osd_num_t> dirty_osds;
uint64_t dirty_bytes = 0, dirty_ops = 0;
void *scrap_buffer = NULL;
unsigned scrap_buffer_size = 0;
bool pgs_loaded = false;
ring_consumer_t consumer;
std::vector<std::function<void(void)>> on_ready_hooks;
int continuing_ops = 0;
int op_queue_pos = 0;
public:
etcd_state_client_t st_cli;
@@ -103,7 +113,7 @@ protected:
void flush_buffer(const object_id & oid, cluster_buffer_t *wr);
void on_load_config_hook(json11::Json::object & config);
void on_load_pgs_hook(bool success);
void on_change_hook(json11::Json::object & changes);
void on_change_hook(std::map<std::string, etcd_kv_t> & changes);
void on_change_osd_state_hook(uint64_t peer_osd);
int continue_rw(cluster_op_t *op);
void slice_rw(cluster_op_t *op);
@@ -111,4 +121,8 @@ protected:
int continue_sync(cluster_op_t *op);
void send_sync(cluster_op_t *op, cluster_op_part_t *part);
void handle_op_part(cluster_op_part_t *part);
void copy_part_bitmap(cluster_op_t *op, cluster_op_part_t *part);
void erase_op(cluster_op_t *op);
void calc_wait(cluster_op_t *op);
void inc_wait(uint64_t opcode, uint64_t flags, cluster_op_t *next, int inc);
};

View File

@@ -11,6 +11,11 @@
etcd_state_client_t::~etcd_state_client_t()
{
for (auto watch: watches)
{
delete watch;
}
watches.clear();
etcd_watches_initialised = -1;
#ifndef __MOCK__
if (etcd_watch_ws)
@@ -22,17 +27,19 @@ etcd_state_client_t::~etcd_state_client_t()
}
#ifndef __MOCK__
json_kv_t etcd_state_client_t::parse_etcd_kv(const json11::Json & kv_json)
etcd_kv_t etcd_state_client_t::parse_etcd_kv(const json11::Json & kv_json)
{
json_kv_t kv;
etcd_kv_t kv;
kv.key = base64_decode(kv_json["key"].string_value());
std::string json_err, json_text = base64_decode(kv_json["value"].string_value());
kv.value = json_text == "" ? json11::Json() : json11::Json::parse(json_text, json_err);
if (json_err != "")
{
printf("Bad JSON in etcd key %s: %s (value: %s)\n", kv.key.c_str(), json_err.c_str(), json_text.c_str());
fprintf(stderr, "Bad JSON in etcd key %s: %s (value: %s)\n", kv.key.c_str(), json_err.c_str(), json_text.c_str());
kv.key = "";
}
else
kv.mod_revision = kv_json["mod_revision"].uint64_value();
return kv;
}
@@ -43,6 +50,11 @@ void etcd_state_client_t::etcd_txn(json11::Json txn, int timeout, std::function<
void etcd_state_client_t::etcd_call(std::string api, json11::Json payload, int timeout, std::function<void(std::string, json11::Json)> callback)
{
if (!etcd_addresses.size())
{
fprintf(stderr, "etcd_address is missing in Vitastor configuration\n");
exit(1);
}
std::string etcd_address = etcd_addresses[rand() % etcd_addresses.size()];
std::string etcd_api_path;
int pos = etcd_address.find('/');
@@ -69,16 +81,16 @@ void etcd_state_client_t::add_etcd_url(std::string addr)
addr = addr.substr(7);
else if (strtolower(addr.substr(0, 8)) == "https://")
{
printf("HTTPS is unsupported for etcd. Either use plain HTTP or setup a local proxy for etcd interaction\n");
fprintf(stderr, "HTTPS is unsupported for etcd. Either use plain HTTP or setup a local proxy for etcd interaction\n");
exit(1);
}
if (addr.find('/') < 0)
if (addr.find('/') == std::string::npos)
addr += "/v3";
this->etcd_addresses.push_back(addr);
}
}
void etcd_state_client_t::parse_config(json11::Json & config)
void etcd_state_client_t::parse_config(const json11::Json & config)
{
this->etcd_addresses.clear();
if (config["etcd_address"].is_string())
@@ -115,6 +127,11 @@ void etcd_state_client_t::parse_config(json11::Json & config)
void etcd_state_client_t::start_etcd_watcher()
{
if (!etcd_addresses.size())
{
fprintf(stderr, "etcd_address is missing in Vitastor configuration\n");
exit(1);
}
std::string etcd_address = etcd_addresses[rand() % etcd_addresses.size()];
std::string etcd_api_path;
int pos = etcd_address.find('/');
@@ -132,7 +149,7 @@ void etcd_state_client_t::start_etcd_watcher()
json11::Json data = json11::Json::parse(msg->body, json_err);
if (json_err != "")
{
printf("Bad JSON in etcd event: %s, ignoring event\n", json_err.c_str());
fprintf(stderr, "Bad JSON in etcd event: %s, ignoring event\n", json_err.c_str());
}
else
{
@@ -145,22 +162,22 @@ void etcd_state_client_t::start_etcd_watcher()
etcd_watch_revision = data["result"]["header"]["revision"].uint64_value();
}
// First gather all changes into a hash to remove multiple overwrites
json11::Json::object changes;
std::map<std::string, etcd_kv_t> changes;
for (auto & ev: data["result"]["events"].array_items())
{
auto kv = parse_etcd_kv(ev["kv"]);
if (kv.key != "")
{
changes[kv.key] = kv.value;
changes[kv.key] = kv;
}
}
for (auto & kv: changes)
{
if (this->log_level > 3)
{
printf("Incoming event: %s -> %s\n", kv.first.c_str(), kv.second.dump().c_str());
fprintf(stderr, "Incoming event: %s -> %s\n", kv.first.c_str(), kv.second.value.dump().c_str());
}
parse_state(kv.first, kv.second);
parse_state(kv.second);
}
// React to changes
if (on_change_hook != NULL)
@@ -233,7 +250,7 @@ void etcd_state_client_t::load_global_config()
{
if (err != "")
{
printf("Error reading OSD configuration from etcd: %s\n", err.c_str());
fprintf(stderr, "Error reading OSD configuration from etcd: %s\n", err.c_str());
tfd->set_timer(ETCD_SLOW_TIMEOUT, false, [this](int timer_id)
{
load_global_config();
@@ -271,6 +288,12 @@ void etcd_state_client_t::load_pgs()
{ "key", base64_encode(etcd_prefix+"/config/pgs") },
} }
},
json11::Json::object {
{ "request_range", json11::Json::object {
{ "key", base64_encode(etcd_prefix+"/config/inode/") },
{ "range_end", base64_encode(etcd_prefix+"/config/inode0") },
} }
},
json11::Json::object {
{ "request_range", json11::Json::object {
{ "key", base64_encode(etcd_prefix+"/pg/history/") },
@@ -300,7 +323,7 @@ void etcd_state_client_t::load_pgs()
{
if (err != "")
{
printf("Error loading PGs from etcd: %s\n", err.c_str());
fprintf(stderr, "Error loading PGs from etcd: %s\n", err.c_str());
tfd->set_timer(ETCD_SLOW_TIMEOUT, false, [this](int timer_id)
{
load_pgs();
@@ -321,7 +344,7 @@ void etcd_state_client_t::load_pgs()
for (auto & kv_json: res["response_range"]["kvs"].array_items())
{
auto kv = parse_etcd_kv(kv_json);
parse_state(kv.key, kv.value);
parse_state(kv);
}
}
on_load_pgs_hook(true);
@@ -329,7 +352,7 @@ void etcd_state_client_t::load_pgs()
});
}
#else
void etcd_state_client_t::parse_config(json11::Json & config)
void etcd_state_client_t::parse_config(const json11::Json & config)
{
}
@@ -344,13 +367,10 @@ void etcd_state_client_t::load_pgs()
}
#endif
void etcd_state_client_t::parse_state(const json_kv_t & kv)
{
parse_state(kv.key, kv.value);
}
void etcd_state_client_t::parse_state(const std::string & key, const json11::Json & value)
void etcd_state_client_t::parse_state(const etcd_kv_t & kv)
{
const std::string & key = kv.key;
const json11::Json & value = kv.value;
if (key == etcd_prefix+"/config/pools")
{
for (auto & pool_item: this->pool_config)
@@ -366,7 +386,7 @@ void etcd_state_client_t::parse_state(const std::string & key, const json11::Jso
sscanf(pool_item.first.c_str(), "%u%c", &pool_id, &null_byte);
if (!pool_id || pool_id >= POOL_ID_MAX || null_byte != 0)
{
printf("Pool ID %s is invalid (must be a number less than 0x%x), skipping pool\n", pool_item.first.c_str(), POOL_ID_MAX);
fprintf(stderr, "Pool ID %s is invalid (must be a number less than 0x%x), skipping pool\n", pool_item.first.c_str(), POOL_ID_MAX);
continue;
}
pc.id = pool_id;
@@ -374,7 +394,7 @@ void etcd_state_client_t::parse_state(const std::string & key, const json11::Jso
pc.name = pool_item.second["name"].string_value();
if (pc.name == "")
{
printf("Pool %u has empty name, skipping pool\n", pool_id);
fprintf(stderr, "Pool %u has empty name, skipping pool\n", pool_id);
continue;
}
// Failure Domain
@@ -388,7 +408,7 @@ void etcd_state_client_t::parse_state(const std::string & key, const json11::Jso
pc.scheme = POOL_SCHEME_JERASURE;
else
{
printf("Pool %u has invalid coding scheme (one of \"xor\", \"replicated\" or \"jerasure\" required), skipping pool\n", pool_id);
fprintf(stderr, "Pool %u has invalid coding scheme (one of \"xor\", \"replicated\" or \"jerasure\" required), skipping pool\n", pool_id);
continue;
}
// PG Size
@@ -398,7 +418,7 @@ void etcd_state_client_t::parse_state(const std::string & key, const json11::Jso
(pc.scheme == POOL_SCHEME_XOR || pc.scheme == POOL_SCHEME_JERASURE) ||
pool_item.second["pg_size"].uint64_value() > 256)
{
printf("Pool %u has invalid pg_size, skipping pool\n", pool_id);
fprintf(stderr, "Pool %u has invalid pg_size, skipping pool\n", pool_id);
continue;
}
// Parity Chunks
@@ -407,7 +427,7 @@ void etcd_state_client_t::parse_state(const std::string & key, const json11::Jso
{
if (pc.parity_chunks > 1)
{
printf("Pool %u has invalid parity_chunks (must be 1), skipping pool\n", pool_id);
fprintf(stderr, "Pool %u has invalid parity_chunks (must be 1), skipping pool\n", pool_id);
continue;
}
pc.parity_chunks = 1;
@@ -415,7 +435,7 @@ void etcd_state_client_t::parse_state(const std::string & key, const json11::Jso
if (pc.scheme == POOL_SCHEME_JERASURE &&
(pc.parity_chunks < 1 || pc.parity_chunks > pc.pg_size-2))
{
printf("Pool %u has invalid parity_chunks (must be between 1 and pg_size-2), skipping pool\n", pool_id);
fprintf(stderr, "Pool %u has invalid parity_chunks (must be between 1 and pg_size-2), skipping pool\n", pool_id);
continue;
}
// PG MinSize
@@ -424,14 +444,14 @@ void etcd_state_client_t::parse_state(const std::string & key, const json11::Jso
(pc.scheme == POOL_SCHEME_XOR || pc.scheme == POOL_SCHEME_JERASURE) &&
pc.pg_minsize < (pc.pg_size-pc.parity_chunks))
{
printf("Pool %u has invalid pg_minsize, skipping pool\n", pool_id);
fprintf(stderr, "Pool %u has invalid pg_minsize, skipping pool\n", pool_id);
continue;
}
// PG Count
pc.pg_count = pool_item.second["pg_count"].uint64_value();
if (pc.pg_count < 1)
{
printf("Pool %u has invalid pg_count, skipping pool\n", pool_id);
fprintf(stderr, "Pool %u has invalid pg_count, skipping pool\n", pool_id);
continue;
}
// Max OSD Combinations
@@ -440,7 +460,7 @@ void etcd_state_client_t::parse_state(const std::string & key, const json11::Jso
pc.max_osd_combinations = 10000;
if (pc.max_osd_combinations > 0 && pc.max_osd_combinations < 100)
{
printf("Pool %u has invalid max_osd_combinations (must be at least 100), skipping pool\n", pool_id);
fprintf(stderr, "Pool %u has invalid max_osd_combinations (must be at least 100), skipping pool\n", pool_id);
continue;
}
// PG Stripe Size
@@ -458,7 +478,7 @@ void etcd_state_client_t::parse_state(const std::string & key, const json11::Jso
{
if (pg_item.second.target_set.size() != parsed_cfg.pg_size)
{
printf("Pool %u PG %u configuration is invalid: osd_set size %lu != pool pg_size %lu\n",
fprintf(stderr, "Pool %u PG %u configuration is invalid: osd_set size %lu != pool pg_size %lu\n",
pool_id, pg_item.first, pg_item.second.target_set.size(), parsed_cfg.pg_size);
pg_item.second.pause = true;
}
@@ -481,7 +501,7 @@ void etcd_state_client_t::parse_state(const std::string & key, const json11::Jso
sscanf(pool_item.first.c_str(), "%u%c", &pool_id, &null_byte);
if (!pool_id || pool_id >= POOL_ID_MAX || null_byte != 0)
{
printf("Pool ID %s is invalid in PG configuration (must be a number less than 0x%x), skipping pool\n", pool_item.first.c_str(), POOL_ID_MAX);
fprintf(stderr, "Pool ID %s is invalid in PG configuration (must be a number less than 0x%x), skipping pool\n", pool_item.first.c_str(), POOL_ID_MAX);
continue;
}
for (auto & pg_item: pool_item.second.object_items())
@@ -490,7 +510,7 @@ void etcd_state_client_t::parse_state(const std::string & key, const json11::Jso
sscanf(pg_item.first.c_str(), "%u%c", &pg_num, &null_byte);
if (!pg_num || null_byte != 0)
{
printf("Bad key in pool %u PG configuration: %s (must be a number), skipped\n", pool_id, pg_item.first.c_str());
fprintf(stderr, "Bad key in pool %u PG configuration: %s (must be a number), skipped\n", pool_id, pg_item.first.c_str());
continue;
}
auto & parsed_cfg = this->pool_config[pool_id].pg_config[pg_num];
@@ -504,7 +524,7 @@ void etcd_state_client_t::parse_state(const std::string & key, const json11::Jso
}
if (parsed_cfg.target_set.size() != pool_config[pool_id].pg_size)
{
printf("Pool %u PG %u configuration is invalid: osd_set size %lu != pool pg_size %lu\n",
fprintf(stderr, "Pool %u PG %u configuration is invalid: osd_set size %lu != pool pg_size %lu\n",
pool_id, pg_num, parsed_cfg.target_set.size(), pool_config[pool_id].pg_size);
parsed_cfg.pause = true;
}
@@ -517,8 +537,8 @@ void etcd_state_client_t::parse_state(const std::string & key, const json11::Jso
{
if (pg_it->second.exists && pg_it->first != ++n)
{
printf(
"Invalid pool %u PG configuration: PG numbers don't cover whole 1..%lu range\n",
fprintf(
stderr, "Invalid pool %u PG configuration: PG numbers don't cover whole 1..%lu range\n",
pool_item.second.id, pool_item.second.pg_config.size()
);
for (pg_it = pool_item.second.pg_config.begin(); pg_it != pool_item.second.pg_config.end(); pg_it++)
@@ -541,7 +561,7 @@ void etcd_state_client_t::parse_state(const std::string & key, const json11::Jso
sscanf(key.c_str() + etcd_prefix.length()+12, "%u/%u%c", &pool_id, &pg_num, &null_byte);
if (!pool_id || pool_id >= POOL_ID_MAX || !pg_num || null_byte != 0)
{
printf("Bad etcd key %s, ignoring\n", key.c_str());
fprintf(stderr, "Bad etcd key %s, ignoring\n", key.c_str());
}
else
{
@@ -580,7 +600,7 @@ void etcd_state_client_t::parse_state(const std::string & key, const json11::Jso
sscanf(key.c_str() + etcd_prefix.length()+10, "%u/%u%c", &pool_id, &pg_num, &null_byte);
if (!pool_id || pool_id >= POOL_ID_MAX || !pg_num || null_byte != 0)
{
printf("Bad etcd key %s, ignoring\n", key.c_str());
fprintf(stderr, "Bad etcd key %s, ignoring\n", key.c_str());
}
else if (value.is_null())
{
@@ -604,7 +624,7 @@ void etcd_state_client_t::parse_state(const std::string & key, const json11::Jso
}
if (i >= pg_state_bit_count)
{
printf("Unexpected pool %u PG %u state keyword in etcd: %s\n", pool_id, pg_num, e.dump().c_str());
fprintf(stderr, "Unexpected pool %u PG %u state keyword in etcd: %s\n", pool_id, pg_num, e.dump().c_str());
return;
}
}
@@ -613,7 +633,7 @@ void etcd_state_client_t::parse_state(const std::string & key, const json11::Jso
(state & PG_PEERING) && state != PG_PEERING ||
(state & PG_INCOMPLETE) && state != PG_INCOMPLETE)
{
printf("Unexpected pool %u PG %u state in etcd: primary=%lu, state=%s\n", pool_id, pg_num, cur_primary, value["state"].dump().c_str());
fprintf(stderr, "Unexpected pool %u PG %u state in etcd: primary=%lu, state=%s\n", pool_id, pg_num, cur_primary, value["state"].dump().c_str());
return;
}
this->pool_config[pool_id].pg_config[pg_num].cur_primary = cur_primary;
@@ -642,4 +662,106 @@ void etcd_state_client_t::parse_state(const std::string & key, const json11::Jso
}
}
}
else if (key.substr(0, etcd_prefix.length()+14) == etcd_prefix+"/config/inode/")
{
// <etcd_prefix>/config/inode/%d/%d
uint64_t pool_id = 0;
uint64_t inode_num = 0;
char null_byte = 0;
sscanf(key.c_str() + etcd_prefix.length()+14, "%lu/%lu%c", &pool_id, &inode_num, &null_byte);
if (!pool_id || pool_id >= POOL_ID_MAX || !inode_num || (inode_num >> (64-POOL_ID_BITS)) || null_byte != 0)
{
fprintf(stderr, "Bad etcd key %s, ignoring\n", key.c_str());
}
else
{
inode_num |= (pool_id << (64-POOL_ID_BITS));
auto it = this->inode_config.find(inode_num);
if (it != this->inode_config.end() && it->second.name != "")
{
auto n_it = this->inode_by_name.find(it->second.name);
if (n_it->second == inode_num)
{
this->inode_by_name.erase(n_it);
for (auto w: watches)
{
if (w->name == it->second.name)
{
w->cfg = { 0 };
}
}
}
}
if (!value.is_object())
{
this->inode_config.erase(inode_num);
}
else
{
inode_t parent_inode_num = value["parent_id"].uint64_value();
if (parent_inode_num && !(parent_inode_num >> (64-POOL_ID_BITS)))
{
uint64_t parent_pool_id = value["parent_pool"].uint64_value();
if (!parent_pool_id)
parent_inode_num |= pool_id << (64-POOL_ID_BITS);
else if (parent_pool_id >= POOL_ID_MAX)
{
fprintf(
stderr, "Inode %lu/%lu parent_pool value is invalid, ignoring parent setting\n",
inode_num >> (64-POOL_ID_BITS), inode_num & ((1l << (64-POOL_ID_BITS)) - 1)
);
parent_inode_num = 0;
}
else
parent_inode_num |= parent_pool_id << (64-POOL_ID_BITS);
}
inode_config_t cfg = (inode_config_t){
.num = inode_num,
.name = value["name"].string_value(),
.size = value["size"].uint64_value(),
.parent_id = parent_inode_num,
.readonly = value["readonly"].bool_value(),
.mod_revision = kv.mod_revision,
};
this->inode_config[inode_num] = cfg;
if (cfg.name != "")
{
this->inode_by_name[cfg.name] = inode_num;
for (auto w: watches)
{
if (w->name == value["name"].string_value())
{
w->cfg = cfg;
}
}
}
}
}
}
}
inode_watch_t* etcd_state_client_t::watch_inode(std::string name)
{
inode_watch_t *watch = new inode_watch_t;
watch->name = name;
watches.push_back(watch);
auto it = inode_by_name.find(name);
if (it != inode_by_name.end())
{
watch->cfg = inode_config[it->second];
}
return watch;
}
void etcd_state_client_t::close_watch(inode_watch_t* watch)
{
for (int i = 0; i < watches.size(); i++)
{
if (watches[i] == watch)
{
watches.erase(watches.begin()+i, watches.begin()+i+1);
break;
}
}
delete watch;
}

View File

@@ -18,10 +18,11 @@
#define DEFAULT_BLOCK_SIZE 128*1024
struct json_kv_t
struct etcd_kv_t
{
std::string key;
json11::Json value;
uint64_t mod_revision;
};
struct pg_config_t
@@ -52,11 +53,29 @@ struct pool_config_t
std::map<pg_num_t, pg_config_t> pg_config;
};
struct inode_config_t
{
uint64_t num;
std::string name;
uint64_t size;
inode_t parent_id;
bool readonly;
// Change revision of the metadata in etcd
uint64_t mod_revision;
};
struct inode_watch_t
{
std::string name;
inode_config_t cfg;
};
struct websocket_t;
struct etcd_state_client_t
{
protected:
std::vector<inode_watch_t*> watches;
websocket_t *etcd_watch_ws = NULL;
uint64_t bs_block_size = DEFAULT_BLOCK_SIZE;
void add_etcd_url(std::string);
@@ -70,22 +89,25 @@ public:
uint64_t etcd_watch_revision = 0;
std::map<pool_id_t, pool_config_t> pool_config;
std::map<osd_num_t, json11::Json> peer_states;
std::map<inode_t, inode_config_t> inode_config;
std::map<std::string, inode_t> inode_by_name;
std::function<void(json11::Json::object &)> on_change_hook;
std::function<void(std::map<std::string, etcd_kv_t> &)> on_change_hook;
std::function<void(json11::Json::object &)> on_load_config_hook;
std::function<json11::Json()> load_pgs_checks_hook;
std::function<void(bool)> on_load_pgs_hook;
std::function<void(pool_id_t, pg_num_t)> on_change_pg_history_hook;
std::function<void(osd_num_t)> on_change_osd_state_hook;
json_kv_t parse_etcd_kv(const json11::Json & kv_json);
etcd_kv_t parse_etcd_kv(const json11::Json & kv_json);
void etcd_call(std::string api, json11::Json payload, int timeout, std::function<void(std::string, json11::Json)> callback);
void etcd_txn(json11::Json txn, int timeout, std::function<void(std::string, json11::Json)> callback);
void start_etcd_watcher();
void load_global_config();
void load_pgs();
void parse_state(const json_kv_t & kv);
void parse_state(const std::string & key, const json11::Json & value);
void parse_config(json11::Json & config);
void parse_state(const etcd_kv_t & kv);
void parse_config(const json11::Json & config);
inode_watch_t* watch_inode(std::string name);
void close_watch(inode_watch_t* watch);
~etcd_state_client_t();
};

View File

@@ -6,17 +6,17 @@
// Random write:
//
// fio -thread -ioengine=./libfio_cluster.so -name=test -bs=4k -direct=1 -fsync=16 -iodepth=16 -rw=randwrite \
// -etcd=127.0.0.1:2379 [-etcd_prefix=/vitastor] -pool=1 -inode=1 -size=1000M
// -etcd=127.0.0.1:2379 [-etcd_prefix=/vitastor] (-image=testimg | -pool=1 -inode=1 -size=1000M)
//
// Linear write:
//
// fio -thread -ioengine=./libfio_cluster.so -name=test -bs=128k -direct=1 -fsync=32 -iodepth=32 -rw=write \
// -etcd=127.0.0.1:2379 [-etcd_prefix=/vitastor] -pool=1 -inode=1 -size=1000M
// -etcd=127.0.0.1:2379 [-etcd_prefix=/vitastor] -image=testimg
//
// Random read (run with -iodepth=32 or -iodepth=1):
//
// fio -thread -ioengine=./libfio_cluster.so -name=test -bs=4k -direct=1 -iodepth=32 -rw=randread \
// -etcd=127.0.0.1:2379 [-etcd_prefix=/vitastor] -pool=1 -inode=1 -size=1000M
// -etcd=127.0.0.1:2379 [-etcd_prefix=/vitastor] -image=testimg
#include <sys/types.h>
#include <sys/socket.h>
@@ -24,36 +24,49 @@
#include <netinet/tcp.h>
#include <vector>
#include <unordered_map>
#include "epoll_manager.h"
#include "cluster_client.h"
#include "vitastor_c.h"
#include "fio_headers.h"
struct sec_data
{
ring_loop_t *ringloop = NULL;
epoll_manager_t *epmgr = NULL;
cluster_client_t *cli = NULL;
vitastor_c *cli = NULL;
void *watch = NULL;
bool last_sync = false;
/* The list of completed io_u structs. */
std::vector<io_u*> completed;
uint64_t op_n = 0, inflight = 0;
uint64_t inflight = 0;
bool trace = false;
};
struct sec_options
{
int __pad;
char *config_path = NULL;
char *etcd_host = NULL;
char *etcd_prefix = NULL;
char *image = NULL;
uint64_t pool = 0;
uint64_t inode = 0;
int cluster_log = 0;
int trace = 0;
int use_rdma = 0;
char *rdma_device = NULL;
int rdma_port_num = 0;
int rdma_gid_index = 0;
int rdma_mtu = 0;
};
static struct fio_option options[] = {
{
.name = "conf",
.lname = "Vitastor config path",
.type = FIO_OPT_STR_STORE,
.off1 = offsetof(struct sec_options, config_path),
.help = "Vitastor config path",
.category = FIO_OPT_C_ENGINE,
.group = FIO_OPT_G_FILENAME,
},
{
.name = "etcd",
.lname = "etcd address",
@@ -64,7 +77,7 @@ static struct fio_option options[] = {
.group = FIO_OPT_G_FILENAME,
},
{
.name = "etcd",
.name = "etcd_prefix",
.lname = "etcd key prefix",
.type = FIO_OPT_STR_STORE,
.off1 = offsetof(struct sec_options, etcd_prefix),
@@ -72,6 +85,15 @@ static struct fio_option options[] = {
.category = FIO_OPT_C_ENGINE,
.group = FIO_OPT_G_FILENAME,
},
{
.name = "image",
.lname = "Vitastor image name",
.type = FIO_OPT_STR_STORE,
.off1 = offsetof(struct sec_options, image),
.help = "Vitastor image name to run tests on",
.category = FIO_OPT_C_ENGINE,
.group = FIO_OPT_G_FILENAME,
},
{
.name = "pool",
.lname = "pool number for the inode",
@@ -86,7 +108,7 @@ static struct fio_option options[] = {
.lname = "inode to run tests on",
.type = FIO_OPT_INT,
.off1 = offsetof(struct sec_options, inode),
.help = "inode to run tests on (1 by default)",
.help = "inode number to run tests on",
.category = FIO_OPT_C_ENGINE,
.group = FIO_OPT_G_FILENAME,
},
@@ -110,22 +132,71 @@ static struct fio_option options[] = {
.category = FIO_OPT_C_ENGINE,
.group = FIO_OPT_G_FILENAME,
},
{
.name = "use_rdma",
.lname = "Use RDMA",
.type = FIO_OPT_BOOL,
.off1 = offsetof(struct sec_options, use_rdma),
.help = "Use RDMA",
.def = "-1",
.category = FIO_OPT_C_ENGINE,
.group = FIO_OPT_G_FILENAME,
},
{
.name = "rdma_device",
.lname = "RDMA device name",
.type = FIO_OPT_STR_STORE,
.off1 = offsetof(struct sec_options, rdma_device),
.help = "RDMA device name",
.category = FIO_OPT_C_ENGINE,
.group = FIO_OPT_G_FILENAME,
},
{
.name = "rdma_port_num",
.lname = "RDMA port number",
.type = FIO_OPT_INT,
.off1 = offsetof(struct sec_options, rdma_port_num),
.help = "RDMA port number",
.def = "0",
.category = FIO_OPT_C_ENGINE,
.group = FIO_OPT_G_FILENAME,
},
{
.name = "rdma_gid_index",
.lname = "RDMA gid index",
.type = FIO_OPT_INT,
.off1 = offsetof(struct sec_options, rdma_gid_index),
.help = "RDMA gid index",
.def = "0",
.category = FIO_OPT_C_ENGINE,
.group = FIO_OPT_G_FILENAME,
},
{
.name = "rdma_mtu",
.lname = "RDMA path MTU",
.type = FIO_OPT_INT,
.off1 = offsetof(struct sec_options, rdma_mtu),
.help = "RDMA path MTU",
.def = "0",
.category = FIO_OPT_C_ENGINE,
.group = FIO_OPT_G_FILENAME,
},
{
.name = NULL,
},
};
static void watch_callback(void *opaque, long watch)
{
struct sec_data *bsd = (struct sec_data*)opaque;
bsd->watch = (void*)watch;
}
static int sec_setup(struct thread_data *td)
{
sec_options *o = (sec_options*)td->eo;
sec_data *bsd;
if (!o->etcd_host)
{
td_verror(td, EINVAL, "etcd address is missing");
return 1;
}
bsd = new sec_data;
if (!bsd)
{
@@ -141,6 +212,45 @@ static int sec_setup(struct thread_data *td)
td->o.open_files++;
}
if (!o->image)
{
if (!(o->inode & ((1l << (64-POOL_ID_BITS)) - 1)))
{
td_verror(td, EINVAL, "inode number is missing");
return 1;
}
if (o->pool)
{
o->inode = (o->inode & ((1l << (64-POOL_ID_BITS)) - 1)) | (o->pool << (64-POOL_ID_BITS));
}
if (!(o->inode >> (64-POOL_ID_BITS)))
{
td_verror(td, EINVAL, "pool is missing");
return 1;
}
}
else
{
o->inode = 0;
}
bsd->cli = vitastor_c_create_uring(o->config_path, o->etcd_host, o->etcd_prefix,
o->use_rdma, o->rdma_device, o->rdma_port_num, o->rdma_gid_index, o->rdma_mtu, o->cluster_log);
if (o->image)
{
bsd->watch = NULL;
vitastor_c_watch_inode(bsd->cli, o->image, watch_callback, bsd);
while (true)
{
vitastor_c_uring_handle_events(bsd->cli);
if (bsd->watch)
break;
vitastor_c_uring_wait_events(bsd->cli);
}
td->files[0]->real_file_size = vitastor_c_inode_get_size(bsd->watch);
}
bsd->trace = o->trace ? true : false;
return 0;
}
@@ -149,9 +259,11 @@ static void sec_cleanup(struct thread_data *td)
sec_data *bsd = (sec_data*)td->io_ops_data;
if (bsd)
{
delete bsd->cli;
delete bsd->epmgr;
delete bsd->ringloop;
if (bsd->watch)
{
vitastor_c_close_watch(bsd->cli, bsd->watch);
}
vitastor_c_destroy(bsd->cli);
delete bsd;
}
}
@@ -159,37 +271,34 @@ static void sec_cleanup(struct thread_data *td)
/* Connect to the server from each thread. */
static int sec_init(struct thread_data *td)
{
sec_options *o = (sec_options*)td->eo;
sec_data *bsd = (sec_data*)td->io_ops_data;
json11::Json cfg = json11::Json::object {
{ "etcd_address", std::string(o->etcd_host) },
{ "etcd_prefix", std::string(o->etcd_prefix ? o->etcd_prefix : "/vitastor") },
{ "log_level", o->cluster_log },
};
if (o->pool)
o->inode = (o->inode & ((1l << (64-POOL_ID_BITS)) - 1)) | (o->pool << (64-POOL_ID_BITS));
if (!(o->inode >> (64-POOL_ID_BITS)))
{
td_verror(td, EINVAL, "pool is missing");
return 1;
}
bsd->ringloop = new ring_loop_t(512);
bsd->epmgr = new epoll_manager_t(bsd->ringloop);
bsd->cli = new cluster_client_t(bsd->ringloop, bsd->epmgr->tfd, cfg);
bsd->trace = o->trace ? true : false;
return 0;
}
static void io_callback(void *opaque, long retval)
{
struct io_u *io = (struct io_u*)opaque;
io->error = retval < 0 ? -retval : 0;
sec_data *bsd = (sec_data*)io->engine_data;
bsd->inflight--;
bsd->completed.push_back(io);
if (bsd->trace)
{
printf("--- %s 0x%lx retval=%ld\n", io->ddir == DDIR_READ ? "READ" :
(io->ddir == DDIR_WRITE ? "WRITE" : "SYNC"), (uint64_t)io, retval);
}
}
static void read_callback(void *opaque, long retval, uint64_t version)
{
io_callback(opaque, retval);
}
/* Begin read or write request. */
static enum fio_q_status sec_queue(struct thread_data *td, struct io_u *io)
{
sec_options *opt = (sec_options*)td->eo;
sec_data *bsd = (sec_data*)td->io_ops_data;
int n = bsd->op_n;
struct iovec iov;
fio_ro_check(td, io);
if (io->ddir == DDIR_SYNC && bsd->last_sync)
@@ -198,28 +307,29 @@ static enum fio_q_status sec_queue(struct thread_data *td, struct io_u *io)
}
io->engine_data = bsd;
cluster_op_t *op = new cluster_op_t;
io->error = 0;
bsd->inflight++;
uint64_t inode = opt->image ? vitastor_c_inode_get_num(bsd->watch) : opt->inode;
switch (io->ddir)
{
case DDIR_READ:
op->opcode = OSD_OP_READ;
op->inode = opt->inode;
op->offset = io->offset;
op->len = io->xfer_buflen;
op->iov.push_back(io->xfer_buf, io->xfer_buflen);
iov = { .iov_base = io->xfer_buf, .iov_len = io->xfer_buflen };
vitastor_c_read(bsd->cli, inode, io->offset, io->xfer_buflen, &iov, 1, read_callback, io);
bsd->last_sync = false;
break;
case DDIR_WRITE:
op->opcode = OSD_OP_WRITE;
op->inode = opt->inode;
op->offset = io->offset;
op->len = io->xfer_buflen;
op->iov.push_back(io->xfer_buf, io->xfer_buflen);
if (opt->image && vitastor_c_inode_get_readonly(bsd->watch))
{
io->error = EROFS;
return FIO_Q_COMPLETED;
}
iov = { .iov_base = io->xfer_buf, .iov_len = io->xfer_buflen };
vitastor_c_write(bsd->cli, inode, io->offset, io->xfer_buflen, 0, &iov, 1, io_callback, io);
bsd->last_sync = false;
break;
case DDIR_SYNC:
op->opcode = OSD_OP_SYNC;
vitastor_c_sync(bsd->cli, io_callback, io);
bsd->last_sync = true;
break;
default:
@@ -227,39 +337,20 @@ static enum fio_q_status sec_queue(struct thread_data *td, struct io_u *io)
return FIO_Q_COMPLETED;
}
op->callback = [io, n](cluster_op_t *op)
{
io->error = op->retval < 0 ? -op->retval : 0;
sec_data *bsd = (sec_data*)io->engine_data;
bsd->inflight--;
bsd->completed.push_back(io);
if (bsd->trace)
{
printf("--- %s n=%d retval=%d\n", io->ddir == DDIR_READ ? "READ" :
(io->ddir == DDIR_WRITE ? "WRITE" : "SYNC"), n, op->retval);
}
delete op;
};
if (opt->trace)
{
if (io->ddir == DDIR_SYNC)
{
printf("+++ SYNC # %d\n", n);
printf("+++ SYNC 0x%lx\n", (uint64_t)io);
}
else
{
printf("+++ %s # %d 0x%llx+%llx\n",
printf("+++ %s 0x%lx 0x%llx+%llx\n",
io->ddir == DDIR_READ ? "READ" : "WRITE",
n, io->offset, io->xfer_buflen);
(uint64_t)io, io->offset, io->xfer_buflen);
}
}
io->error = 0;
bsd->inflight++;
bsd->op_n++;
bsd->cli->execute(op);
if (io->error != 0)
return FIO_Q_COMPLETED;
return FIO_Q_QUEUED;
@@ -270,10 +361,10 @@ static int sec_getevents(struct thread_data *td, unsigned int min, unsigned int
sec_data *bsd = (sec_data*)td->io_ops_data;
while (true)
{
bsd->ringloop->loop();
vitastor_c_uring_handle_events(bsd->cli);
if (bsd->completed.size() >= min)
break;
bsd->ringloop->wait();
vitastor_c_uring_wait_events(bsd->cli);
}
return bsd->completed.size();
}

View File

@@ -25,6 +25,7 @@
// -bs_config='{"data_device":"./test_data.bin"}' -size=1000M
#include "blockstore.h"
#include "epoll_manager.h"
#include "fio_headers.h"
#include "json11/json11.hpp"
@@ -32,6 +33,7 @@
struct bs_data
{
blockstore_t *bs;
epoll_manager_t *epmgr;
ring_loop_t *ringloop;
/* The list of completed io_u structs. */
std::vector<io_u*> completed;
@@ -104,6 +106,7 @@ static void bs_cleanup(struct thread_data *td)
}
safe:
delete bsd->bs;
delete bsd->epmgr;
delete bsd->ringloop;
delete bsd;
}
@@ -129,7 +132,8 @@ static int bs_init(struct thread_data *td)
}
}
bsd->ringloop = new ring_loop_t(512);
bsd->bs = new blockstore_t(config, bsd->ringloop);
bsd->epmgr = new epoll_manager_t(bsd->ringloop);
bsd->bs = new blockstore_t(config, bsd->ringloop, bsd->epmgr->tfd);
while (1)
{
bsd->ringloop->loop();

View File

@@ -12,6 +12,31 @@
void osd_messenger_t::init()
{
#ifdef WITH_RDMA
if (use_rdma)
{
rdma_context = msgr_rdma_context_t::create(
rdma_device != "" ? rdma_device.c_str() : NULL,
rdma_port_num, rdma_gid_index, rdma_mtu
);
if (!rdma_context)
{
fprintf(stderr, "[OSD %lu] Couldn't initialize RDMA, proceeding with TCP only\n", osd_num);
}
else
{
rdma_max_sge = rdma_max_sge < rdma_context->attrx.orig_attr.max_sge
? rdma_max_sge : rdma_context->attrx.orig_attr.max_sge;
fprintf(stderr, "[OSD %lu] RDMA initialized successfully\n", osd_num);
fcntl(rdma_context->channel->fd, F_SETFL, fcntl(rdma_context->channel->fd, F_GETFL, 0) | O_NONBLOCK);
tfd->set_fd_handler(rdma_context->channel->fd, false, [this](int notify_fd, int epoll_events)
{
handle_rdma_events();
});
handle_rdma_events();
}
}
#endif
keepalive_timer_id = tfd->set_timer(1000, true, [this](int)
{
std::vector<int> to_stop;
@@ -19,7 +44,7 @@ void osd_messenger_t::init()
for (auto cl_it = clients.begin(); cl_it != clients.end(); cl_it++)
{
auto cl = cl_it->second;
if (!cl->osd_num || cl->peer_state != PEER_CONNECTED)
if (!cl->osd_num || cl->peer_state != PEER_CONNECTED && cl->peer_state != PEER_RDMA)
{
// Do not run keepalive on regular clients
continue;
@@ -30,7 +55,7 @@ void osd_messenger_t::init()
if (!cl->ping_time_remaining)
{
// Ping timed out, stop the client
printf("Ping timed out for OSD %lu (client %d), disconnecting peer\n", cl->osd_num, cl->peer_fd);
fprintf(stderr, "Ping timed out for OSD %lu (client %d), disconnecting peer\n", cl->osd_num, cl->peer_fd);
to_stop.push_back(cl->peer_fd);
}
}
@@ -57,7 +82,7 @@ void osd_messenger_t::init()
delete op;
if (fail_fd >= 0)
{
printf("Ping failed for OSD %lu (client %d), disconnecting peer\n", cl->osd_num, cl->peer_fd);
fprintf(stderr, "Ping failed for OSD %lu (client %d), disconnecting peer\n", cl->osd_num, cl->peer_fd);
stop_client(fail_fd, true);
}
};
@@ -94,32 +119,58 @@ osd_messenger_t::~osd_messenger_t()
{
stop_client(clients.begin()->first, true);
}
#ifdef WITH_RDMA
if (rdma_context)
{
delete rdma_context;
}
#endif
}
void osd_messenger_t::parse_config(const json11::Json & config)
{
#ifdef WITH_RDMA
if (!config["use_rdma"].is_null())
{
// RDMA is on by default in RDMA-enabled builds
this->use_rdma = config["use_rdma"].bool_value() || config["use_rdma"].uint64_value() != 0;
}
this->rdma_device = config["rdma_device"].string_value();
this->rdma_port_num = (uint8_t)config["rdma_port_num"].uint64_value();
if (!this->rdma_port_num)
this->rdma_port_num = 1;
this->rdma_gid_index = (uint8_t)config["rdma_gid_index"].uint64_value();
this->rdma_mtu = (uint32_t)config["rdma_mtu"].uint64_value();
this->rdma_max_sge = config["rdma_max_sge"].uint64_value();
if (!this->rdma_max_sge)
this->rdma_max_sge = 128;
this->rdma_max_send = config["rdma_max_send"].uint64_value();
if (!this->rdma_max_send)
this->rdma_max_send = 32;
this->rdma_max_recv = config["rdma_max_recv"].uint64_value();
if (!this->rdma_max_recv)
this->rdma_max_recv = 8;
this->rdma_max_msg = config["rdma_max_msg"].uint64_value();
if (!this->rdma_max_msg || this->rdma_max_msg > 128*1024*1024)
this->rdma_max_msg = 1024*1024;
#endif
this->receive_buffer_size = (uint32_t)config["tcp_header_buffer_size"].uint64_value();
if (!this->receive_buffer_size || this->receive_buffer_size > 1024*1024*1024)
this->receive_buffer_size = 65536;
this->use_sync_send_recv = config["use_sync_send_recv"].bool_value() ||
config["use_sync_send_recv"].uint64_value();
this->peer_connect_interval = config["peer_connect_interval"].uint64_value();
if (!this->peer_connect_interval)
{
this->peer_connect_interval = DEFAULT_PEER_CONNECT_INTERVAL;
}
this->peer_connect_interval = 5;
this->peer_connect_timeout = config["peer_connect_timeout"].uint64_value();
if (!this->peer_connect_timeout)
{
this->peer_connect_timeout = DEFAULT_PEER_CONNECT_TIMEOUT;
}
this->peer_connect_timeout = 5;
this->osd_idle_timeout = config["osd_idle_timeout"].uint64_value();
if (!this->osd_idle_timeout)
{
this->osd_idle_timeout = DEFAULT_OSD_PING_TIMEOUT;
}
this->osd_idle_timeout = 5;
this->osd_ping_timeout = config["osd_ping_timeout"].uint64_value();
if (!this->osd_ping_timeout)
{
this->osd_ping_timeout = DEFAULT_OSD_PING_TIMEOUT;
}
this->osd_ping_timeout = 5;
this->log_level = config["log_level"].uint64_value();
}
@@ -210,7 +261,7 @@ void osd_messenger_t::try_connect_peer_addr(osd_num_t peer_osd, const char *peer
{
osd_num_t peer_osd = clients.at(peer_fd)->osd_num;
stop_client(peer_fd, true);
on_connect_peer(peer_osd, -EIO);
on_connect_peer(peer_osd, -EPIPE);
return;
});
}
@@ -254,7 +305,7 @@ void osd_messenger_t::handle_peer_epoll(int peer_fd, int epoll_events)
if (epoll_events & EPOLLRDHUP)
{
// Stop client
printf("[OSD %lu] client %d disconnected\n", this->osd_num, peer_fd);
fprintf(stderr, "[OSD %lu] client %d disconnected\n", this->osd_num, peer_fd);
stop_client(peer_fd, true);
}
else if (epoll_events & EPOLLIN)
@@ -279,7 +330,7 @@ void osd_messenger_t::on_connect_peer(osd_num_t peer_osd, int peer_fd)
wp.connecting = false;
if (peer_fd < 0)
{
printf("Failed to connect to peer OSD %lu address %s port %d: %s\n", peer_osd, wp.cur_addr.c_str(), wp.cur_port, strerror(-peer_fd));
fprintf(stderr, "Failed to connect to peer OSD %lu address %s port %d: %s\n", peer_osd, wp.cur_addr.c_str(), wp.cur_port, strerror(-peer_fd));
if (wp.address_changed)
{
wp.address_changed = false;
@@ -306,7 +357,7 @@ void osd_messenger_t::on_connect_peer(osd_num_t peer_osd, int peer_fd)
}
if (log_level > 0)
{
printf("[OSD %lu] Connected with peer OSD %lu (client %d)\n", osd_num, peer_osd, peer_fd);
fprintf(stderr, "[OSD %lu] Connected with peer OSD %lu (client %d)\n", osd_num, peer_osd, peer_fd);
}
wanted_peers.erase(peer_osd);
repeer_pgs(peer_osd);
@@ -326,6 +377,24 @@ void osd_messenger_t::check_peer_config(osd_client_t *cl)
},
},
};
#ifdef WITH_RDMA
if (rdma_context)
{
cl->rdma_conn = msgr_rdma_connection_t::create(rdma_context, rdma_max_send, rdma_max_recv, rdma_max_sge, rdma_max_msg);
if (cl->rdma_conn)
{
json11::Json payload = json11::Json::object {
{ "connect_rdma", cl->rdma_conn->addr.to_string() },
{ "rdma_max_msg", cl->rdma_conn->max_msg },
};
std::string payload_str = payload.dump();
op->req.show_conf.json_len = payload_str.size();
op->buf = malloc_or_die(payload_str.size());
op->iov.push_back(op->buf, payload_str.size());
memcpy(op->buf, payload_str.c_str(), payload_str.size());
}
}
#endif
op->callback = [this, cl](osd_op_t *op)
{
std::string json_err;
@@ -334,7 +403,7 @@ void osd_messenger_t::check_peer_config(osd_client_t *cl)
if (op->reply.hdr.retval < 0)
{
err = true;
printf("Failed to get config from OSD %lu (retval=%ld), disconnecting peer\n", cl->osd_num, op->reply.hdr.retval);
fprintf(stderr, "Failed to get config from OSD %lu (retval=%ld), disconnecting peer\n", cl->osd_num, op->reply.hdr.retval);
}
else
{
@@ -342,22 +411,69 @@ void osd_messenger_t::check_peer_config(osd_client_t *cl)
if (json_err != "")
{
err = true;
printf("Failed to get config from OSD %lu: bad JSON: %s, disconnecting peer\n", cl->osd_num, json_err.c_str());
fprintf(stderr, "Failed to get config from OSD %lu: bad JSON: %s, disconnecting peer\n", cl->osd_num, json_err.c_str());
}
else if (config["osd_num"].uint64_value() != cl->osd_num)
{
err = true;
printf("Connected to OSD %lu instead of OSD %lu, peer state is outdated, disconnecting peer\n", config["osd_num"].uint64_value(), cl->osd_num);
fprintf(stderr, "Connected to OSD %lu instead of OSD %lu, peer state is outdated, disconnecting peer\n", config["osd_num"].uint64_value(), cl->osd_num);
}
else if (config["protocol_version"].uint64_value() != OSD_PROTOCOL_VERSION)
{
err = true;
fprintf(
stderr, "OSD %lu protocol version is %lu, but only version %u is supported.\n"
" If you need to upgrade from 0.5.x please request it via the issue tracker.\n",
cl->osd_num, config["protocol_version"].uint64_value(), OSD_PROTOCOL_VERSION
);
}
}
if (err)
{
osd_num_t osd_num = cl->osd_num;
osd_num_t peer_osd = cl->osd_num;
stop_client(op->peer_fd);
on_connect_peer(osd_num, -1);
on_connect_peer(peer_osd, -1);
delete op;
return;
}
#ifdef WITH_RDMA
if (config["rdma_address"].is_string())
{
msgr_rdma_address_t addr;
if (!msgr_rdma_address_t::from_string(config["rdma_address"].string_value().c_str(), &addr) ||
cl->rdma_conn->connect(&addr) != 0)
{
fprintf(
stderr, "Failed to connect to OSD %lu (address %s) using RDMA\n",
cl->osd_num, config["rdma_address"].string_value().c_str()
);
delete cl->rdma_conn;
cl->rdma_conn = NULL;
// FIXME: Keep TCP connection in this case
osd_num_t peer_osd = cl->osd_num;
stop_client(cl->peer_fd);
on_connect_peer(peer_osd, -1);
delete op;
return;
}
else
{
uint64_t server_max_msg = config["rdma_max_msg"].uint64_value();
if (cl->rdma_conn->max_msg > server_max_msg)
{
cl->rdma_conn->max_msg = server_max_msg;
}
if (log_level > 0)
{
fprintf(stderr, "Connected to OSD %lu using RDMA\n", cl->osd_num);
}
cl->peer_state = PEER_RDMA;
tfd->set_fd_handler(cl->peer_fd, false, NULL);
// Add the initial receive request
try_recv_rdma(cl);
}
}
#endif
osd_peer_fds[cl->osd_num] = cl->peer_fd;
on_connect_peer(cl->osd_num, cl->peer_fd);
delete op;
@@ -375,7 +491,7 @@ void osd_messenger_t::accept_connections(int listen_fd)
{
assert(peer_fd != 0);
char peer_str[256];
printf("[OSD %lu] new client %d: connection from %s port %d\n", this->osd_num, peer_fd,
fprintf(stderr, "[OSD %lu] new client %d: connection from %s port %d\n", this->osd_num, peer_fd,
inet_ntop(AF_INET, &addr.sin_addr, peer_str, 256), ntohs(addr.sin_port));
fcntl(peer_fd, F_SETFL, fcntl(peer_fd, F_GETFL, 0) | O_NONBLOCK);
int one = 1;
@@ -399,3 +515,59 @@ void osd_messenger_t::accept_connections(int listen_fd)
throw std::runtime_error(std::string("accept: ") + strerror(errno));
}
}
#ifdef WITH_RDMA
bool osd_messenger_t::is_rdma_enabled()
{
return rdma_context != NULL;
}
#endif
json11::Json osd_messenger_t::read_config(const json11::Json & config)
{
const char *config_path = config["config_path"].string_value() != ""
? config["config_path"].string_value().c_str() : VITASTOR_CONFIG_PATH;
int fd = open(config_path, O_RDONLY);
if (fd < 0)
{
if (errno != ENOENT)
fprintf(stderr, "Error reading %s: %s\n", config_path, strerror(errno));
return config;
}
struct stat st;
if (fstat(fd, &st) != 0)
{
fprintf(stderr, "Error reading %s: %s\n", config_path, strerror(errno));
close(fd);
return config;
}
std::string buf;
buf.resize(st.st_size);
int done = 0;
while (done < st.st_size)
{
int r = read(fd, (void*)buf.data()+done, st.st_size-done);
if (r < 0)
{
fprintf(stderr, "Error reading %s: %s\n", config_path, strerror(errno));
close(fd);
return config;
}
done += r;
}
close(fd);
std::string json_err;
json11::Json::object file_config = json11::Json::parse(buf, json_err).object_items();
if (json_err != "")
{
fprintf(stderr, "Invalid JSON in %s: %s\n", config_path, json_err.c_str());
return config;
}
file_config.erase("config_path");
file_config.erase("osd_num");
for (auto kv: config.object_items())
{
file_config[kv.first] = kv.second;
}
return file_config;
}

View File

@@ -18,19 +18,32 @@
#include "timerfd_manager.h"
#include <ringloop.h>
#ifdef WITH_RDMA
#include "msgr_rdma.h"
#endif
#define CL_READ_HDR 1
#define CL_READ_DATA 2
#define CL_READ_REPLY_DATA 3
#define CL_WRITE_READY 1
#define CL_WRITE_REPLY 2
#define PEER_CONNECTING 1
#define PEER_CONNECTED 2
#define PEER_STOPPED 3
#define PEER_RDMA_CONNECTING 3
#define PEER_RDMA 4
#define PEER_STOPPED 5
#define DEFAULT_PEER_CONNECT_INTERVAL 5
#define DEFAULT_PEER_CONNECT_TIMEOUT 5
#define DEFAULT_OSD_PING_TIMEOUT 5
#define DEFAULT_BITMAP_GRANULARITY 4096
#define VITASTOR_CONFIG_PATH "/etc/vitastor/vitastor.conf"
#define MSGR_SENDP_HDR 1
#define MSGR_SENDP_FREE 2
struct msgr_sendp_t
{
osd_op_t *op;
int flags;
};
struct osd_client_t
{
@@ -47,6 +60,10 @@ struct osd_client_t
void *in_buf = NULL;
#ifdef WITH_RDMA
msgr_rdma_connection_t *rdma_conn = NULL;
#endif
// Read state
int read_ready = 0;
osd_op_t *read_op = NULL;
@@ -69,7 +86,7 @@ struct osd_client_t
msghdr write_msg = { 0 };
int write_state = 0;
std::vector<iovec> send_list, next_send_list;
std::vector<osd_op_t*> outbox, next_outbox;
std::vector<msgr_sendp_t> outbox, next_outbox;
~osd_client_t()
{
@@ -103,15 +120,23 @@ struct osd_messenger_t
protected:
int keepalive_timer_id = -1;
// FIXME: make receive_buffer_size configurable
int receive_buffer_size = 64*1024;
int peer_connect_interval = DEFAULT_PEER_CONNECT_INTERVAL;
int peer_connect_timeout = DEFAULT_PEER_CONNECT_TIMEOUT;
int osd_idle_timeout = DEFAULT_OSD_PING_TIMEOUT;
int osd_ping_timeout = DEFAULT_OSD_PING_TIMEOUT;
uint32_t receive_buffer_size = 0;
int peer_connect_interval = 0;
int peer_connect_timeout = 0;
int osd_idle_timeout = 0;
int osd_ping_timeout = 0;
int log_level = 0;
bool use_sync_send_recv = false;
#ifdef WITH_RDMA
bool use_rdma = true;
std::string rdma_device;
uint64_t rdma_port_num = 1, rdma_gid_index = 0, rdma_mtu = 0;
msgr_rdma_context_t *rdma_context = NULL;
uint64_t rdma_max_sge = 0, rdma_max_send = 0, rdma_max_recv = 8;
uint64_t rdma_max_msg = 0;
#endif
std::vector<int> read_ready_clients;
std::vector<int> write_ready_clients;
std::vector<std::function<void()>> set_immediate;
@@ -140,6 +165,13 @@ public:
void accept_connections(int listen_fd);
~osd_messenger_t();
static json11::Json read_config(const json11::Json & config);
#ifdef WITH_RDMA
bool is_rdma_enabled();
bool connect_rdma(int peer_fd, std::string rdma_address, uint64_t client_max_msg);
#endif
protected:
void try_connect_peer(uint64_t osd_num);
void try_connect_peer_addr(osd_num_t peer_osd, const char *peer_host, int peer_port);
@@ -155,8 +187,15 @@ protected:
void handle_send(int result, osd_client_t *cl);
bool handle_read(int result, osd_client_t *cl);
bool handle_read_buffer(osd_client_t *cl, void *curbuf, int remain);
bool handle_finished_read(osd_client_t *cl);
void handle_op_hdr(osd_client_t *cl);
bool handle_reply_hdr(osd_client_t *cl);
void handle_reply_ready(osd_op_t *op);
#ifdef WITH_RDMA
bool try_send_rdma(osd_client_t *cl);
bool try_recv_rdma(osd_client_t *cl);
void handle_rdma_events();
#endif
};

View File

@@ -42,3 +42,8 @@ void osd_messenger_t::read_requests()
void osd_messenger_t::send_replies()
{
}
json11::Json osd_messenger_t::read_config(const json11::Json & config)
{
return config;
}

View File

@@ -76,7 +76,7 @@ struct osd_op_buf_list_t
buf = (iovec*)malloc(sizeof(iovec) * alloc);
if (!buf)
{
printf("Failed to allocate %lu bytes\n", sizeof(iovec) * alloc);
fprintf(stderr, "Failed to allocate %lu bytes\n", sizeof(iovec) * alloc);
exit(1);
}
memcpy(buf, inline_buf, sizeof(iovec) * old);
@@ -87,7 +87,7 @@ struct osd_op_buf_list_t
buf = (iovec*)realloc(buf, sizeof(iovec) * alloc);
if (!buf)
{
printf("Failed to allocate %lu bytes\n", sizeof(iovec) * alloc);
fprintf(stderr, "Failed to allocate %lu bytes\n", sizeof(iovec) * alloc);
exit(1);
}
}
@@ -109,7 +109,7 @@ struct osd_op_buf_list_t
buf = (iovec*)malloc(sizeof(iovec) * alloc);
if (!buf)
{
printf("Failed to allocate %lu bytes\n", sizeof(iovec) * alloc);
fprintf(stderr, "Failed to allocate %lu bytes\n", sizeof(iovec) * alloc);
exit(1);
}
memcpy(buf, inline_buf, sizeof(iovec)*old);
@@ -120,7 +120,7 @@ struct osd_op_buf_list_t
buf = (iovec*)realloc(buf, sizeof(iovec) * alloc);
if (!buf)
{
printf("Failed to allocate %lu bytes\n", sizeof(iovec) * alloc);
fprintf(stderr, "Failed to allocate %lu bytes\n", sizeof(iovec) * alloc);
exit(1);
}
}
@@ -154,13 +154,17 @@ struct osd_primary_op_data_t;
struct osd_op_t
{
timespec tv_begin;
timespec tv_begin = { 0 }, tv_end = { 0 };
uint64_t op_type = OSD_OP_IN;
int peer_fd;
osd_any_op_t req;
osd_any_reply_t reply;
blockstore_op_t *bs_op = NULL;
void *buf = NULL;
// bitmap, bitmap_len, bmp_data are only meaningful for reads
void *bitmap = NULL;
unsigned bitmap_len = 0;
unsigned bmp_data = 0;
void *rmw_buf = NULL;
osd_primary_op_data_t* op_data = NULL;
std::function<void(osd_op_t*)> callback;

521
src/msgr_rdma.cpp Normal file
View File

@@ -0,0 +1,521 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.1 or GNU GPL-2.0+ (see README.md for details)
#include <stdio.h>
#include <stdlib.h>
#include "msgr_rdma.h"
#include "messenger.h"
std::string msgr_rdma_address_t::to_string()
{
char msg[sizeof "0000:00000000:00000000:00000000000000000000000000000000"];
sprintf(
msg, "%04x:%06x:%06x:%016lx%016lx", lid, qpn, psn,
htobe64(((uint64_t*)&gid)[0]), htobe64(((uint64_t*)&gid)[1])
);
return std::string(msg);
}
bool msgr_rdma_address_t::from_string(const char *str, msgr_rdma_address_t *dest)
{
uint64_t* gid = (uint64_t*)&dest->gid;
int n = sscanf(
str, "%hx:%x:%x:%16lx%16lx", &dest->lid, &dest->qpn, &dest->psn, gid, gid+1
);
gid[0] = be64toh(gid[0]);
gid[1] = be64toh(gid[1]);
return n == 5;
}
msgr_rdma_context_t::~msgr_rdma_context_t()
{
if (cq)
ibv_destroy_cq(cq);
if (channel)
ibv_destroy_comp_channel(channel);
if (mr)
ibv_dereg_mr(mr);
if (pd)
ibv_dealloc_pd(pd);
if (context)
ibv_close_device(context);
}
msgr_rdma_connection_t::~msgr_rdma_connection_t()
{
ctx->used_max_cqe -= max_send+max_recv;
if (qp)
ibv_destroy_qp(qp);
}
msgr_rdma_context_t *msgr_rdma_context_t::create(const char *ib_devname, uint8_t ib_port, uint8_t gid_index, uint32_t mtu)
{
int res;
ibv_device **dev_list = NULL;
msgr_rdma_context_t *ctx = new msgr_rdma_context_t();
ctx->mtu = mtu;
dev_list = ibv_get_device_list(NULL);
if (!dev_list)
{
fprintf(stderr, "Failed to get RDMA device list: %s\n", strerror(errno));
goto cleanup;
}
if (!ib_devname)
{
ctx->dev = *dev_list;
if (!ctx->dev)
{
fprintf(stderr, "No RDMA devices found\n");
goto cleanup;
}
}
else
{
int i;
for (i = 0; dev_list[i]; ++i)
if (!strcmp(ibv_get_device_name(dev_list[i]), ib_devname))
break;
ctx->dev = dev_list[i];
if (!ctx->dev)
{
fprintf(stderr, "RDMA device %s not found\n", ib_devname);
goto cleanup;
}
}
ctx->context = ibv_open_device(ctx->dev);
if (!ctx->context)
{
fprintf(stderr, "Couldn't get RDMA context for %s\n", ibv_get_device_name(ctx->dev));
goto cleanup;
}
ctx->ib_port = ib_port;
ctx->gid_index = gid_index;
if ((res = ibv_query_port(ctx->context, ib_port, &ctx->portinfo)) != 0)
{
fprintf(stderr, "Couldn't get RDMA device %s port %d info: %s\n", ibv_get_device_name(ctx->dev), ib_port, strerror(res));
goto cleanup;
}
ctx->my_lid = ctx->portinfo.lid;
if (ctx->portinfo.link_layer != IBV_LINK_LAYER_ETHERNET && !ctx->my_lid)
{
fprintf(stderr, "RDMA device %s must have local LID because it's not Ethernet, but LID is zero\n", ibv_get_device_name(ctx->dev));
goto cleanup;
}
if (ibv_query_gid(ctx->context, ib_port, gid_index, &ctx->my_gid))
{
fprintf(stderr, "Couldn't read RDMA device %s GID index %d\n", ibv_get_device_name(ctx->dev), gid_index);
goto cleanup;
}
ctx->pd = ibv_alloc_pd(ctx->context);
if (!ctx->pd)
{
fprintf(stderr, "Couldn't allocate RDMA protection domain\n");
goto cleanup;
}
{
if (ibv_query_device_ex(ctx->context, NULL, &ctx->attrx))
{
fprintf(stderr, "Couldn't query RDMA device for its features\n");
goto cleanup;
}
if (!(ctx->attrx.odp_caps.general_caps & IBV_ODP_SUPPORT) ||
!(ctx->attrx.odp_caps.general_caps & IBV_ODP_SUPPORT_IMPLICIT) ||
!(ctx->attrx.odp_caps.per_transport_caps.rc_odp_caps & IBV_ODP_SUPPORT_SEND) ||
!(ctx->attrx.odp_caps.per_transport_caps.rc_odp_caps & IBV_ODP_SUPPORT_RECV))
{
fprintf(stderr, "The RDMA device isn't implicit ODP (On-Demand Paging) capable or does not support RC send and receive with ODP\n");
goto cleanup;
}
}
ctx->mr = ibv_reg_mr(ctx->pd, NULL, SIZE_MAX, IBV_ACCESS_LOCAL_WRITE | IBV_ACCESS_ON_DEMAND);
if (!ctx->mr)
{
fprintf(stderr, "Couldn't register RDMA memory region\n");
goto cleanup;
}
ctx->channel = ibv_create_comp_channel(ctx->context);
if (!ctx->channel)
{
fprintf(stderr, "Couldn't create RDMA completion channel\n");
goto cleanup;
}
ctx->max_cqe = 4096;
ctx->cq = ibv_create_cq(ctx->context, ctx->max_cqe, NULL, ctx->channel, 0);
if (!ctx->cq)
{
fprintf(stderr, "Couldn't create RDMA completion queue\n");
goto cleanup;
}
if (dev_list)
ibv_free_device_list(dev_list);
return ctx;
cleanup:
delete ctx;
if (dev_list)
ibv_free_device_list(dev_list);
return NULL;
}
msgr_rdma_connection_t *msgr_rdma_connection_t::create(msgr_rdma_context_t *ctx, uint32_t max_send,
uint32_t max_recv, uint32_t max_sge, uint32_t max_msg)
{
msgr_rdma_connection_t *conn = new msgr_rdma_connection_t;
max_sge = max_sge > ctx->attrx.orig_attr.max_sge ? ctx->attrx.orig_attr.max_sge : max_sge;
conn->ctx = ctx;
conn->max_send = max_send;
conn->max_recv = max_recv;
conn->max_sge = max_sge;
conn->max_msg = max_msg;
ctx->used_max_cqe += max_send+max_recv;
if (ctx->used_max_cqe > ctx->max_cqe)
{
// Resize CQ
// Mellanox ConnectX-4 supports up to 4194303 CQEs, so it's fine to put everything into a single CQ
int new_max_cqe = ctx->max_cqe;
while (ctx->used_max_cqe > new_max_cqe)
{
new_max_cqe *= 2;
}
if (ibv_resize_cq(ctx->cq, new_max_cqe) != 0)
{
fprintf(stderr, "Couldn't resize RDMA completion queue to %d entries\n", new_max_cqe);
delete conn;
return NULL;
}
ctx->max_cqe = new_max_cqe;
}
ibv_qp_init_attr init_attr = {
.send_cq = ctx->cq,
.recv_cq = ctx->cq,
.cap = {
.max_send_wr = max_send,
.max_recv_wr = max_recv,
.max_send_sge = max_sge,
.max_recv_sge = max_sge,
},
.qp_type = IBV_QPT_RC,
};
conn->qp = ibv_create_qp(ctx->pd, &init_attr);
if (!conn->qp)
{
fprintf(stderr, "Couldn't create RDMA queue pair\n");
delete conn;
return NULL;
}
conn->addr.lid = ctx->my_lid;
conn->addr.gid = ctx->my_gid;
conn->addr.qpn = conn->qp->qp_num;
conn->addr.psn = lrand48() & 0xffffff;
ibv_qp_attr attr = {
.qp_state = IBV_QPS_INIT,
.qp_access_flags = 0,
.pkey_index = 0,
.port_num = ctx->ib_port,
};
if (ibv_modify_qp(conn->qp, &attr, IBV_QP_STATE | IBV_QP_PKEY_INDEX | IBV_QP_PORT | IBV_QP_ACCESS_FLAGS))
{
fprintf(stderr, "Failed to switch RDMA queue pair to INIT state\n");
delete conn;
return NULL;
}
return conn;
}
static ibv_mtu mtu_to_ibv_mtu(uint32_t mtu)
{
switch (mtu)
{
case 256: return IBV_MTU_256;
case 512: return IBV_MTU_512;
case 1024: return IBV_MTU_1024;
case 2048: return IBV_MTU_2048;
case 4096: return IBV_MTU_4096;
}
return IBV_MTU_4096;
}
int msgr_rdma_connection_t::connect(msgr_rdma_address_t *dest)
{
auto conn = this;
ibv_qp_attr attr = {
.qp_state = IBV_QPS_RTR,
.path_mtu = mtu_to_ibv_mtu(conn->ctx->mtu),
.rq_psn = dest->psn,
.sq_psn = conn->addr.psn,
.dest_qp_num = dest->qpn,
.ah_attr = {
.grh = {
.dgid = dest->gid,
.sgid_index = conn->ctx->gid_index,
.hop_limit = 1, // FIXME can it vary?
},
.dlid = dest->lid,
.sl = 0, // service level
.src_path_bits = 0,
.is_global = (uint8_t)(dest->gid.global.interface_id ? 1 : 0),
.port_num = conn->ctx->ib_port,
},
.max_rd_atomic = 1,
.max_dest_rd_atomic = 1,
// Timeout and min_rnr_timer actual values seem to be 4.096us*2^(timeout+1)
.min_rnr_timer = 1,
.timeout = 14,
.retry_cnt = 7,
.rnr_retry = 7,
};
// FIXME No idea if ibv_modify_qp is a blocking operation or not. No idea if it has a timeout and what it is.
if (ibv_modify_qp(conn->qp, &attr, IBV_QP_STATE | IBV_QP_AV | IBV_QP_PATH_MTU |
IBV_QP_DEST_QPN | IBV_QP_RQ_PSN | IBV_QP_MAX_DEST_RD_ATOMIC | IBV_QP_MIN_RNR_TIMER))
{
fprintf(stderr, "Failed to switch RDMA queue pair to RTR (ready-to-receive) state\n");
return 1;
}
attr.qp_state = IBV_QPS_RTS;
if (ibv_modify_qp(conn->qp, &attr, IBV_QP_STATE | IBV_QP_TIMEOUT |
IBV_QP_RETRY_CNT | IBV_QP_RNR_RETRY | IBV_QP_SQ_PSN | IBV_QP_MAX_QP_RD_ATOMIC))
{
fprintf(stderr, "Failed to switch RDMA queue pair to RTS (ready-to-send) state\n");
return 1;
}
return 0;
}
bool osd_messenger_t::connect_rdma(int peer_fd, std::string rdma_address, uint64_t client_max_msg)
{
// Try to connect to the peer using RDMA
msgr_rdma_address_t addr;
if (msgr_rdma_address_t::from_string(rdma_address.c_str(), &addr))
{
if (client_max_msg > rdma_max_msg)
{
client_max_msg = rdma_max_msg;
}
auto rdma_conn = msgr_rdma_connection_t::create(rdma_context, rdma_max_send, rdma_max_recv, rdma_max_sge, client_max_msg);
if (rdma_conn)
{
int r = rdma_conn->connect(&addr);
if (r != 0)
{
delete rdma_conn;
fprintf(
stderr, "Failed to connect RDMA queue pair to %s (client %d)\n",
addr.to_string().c_str(), peer_fd
);
}
else
{
// Remember connection, but switch to RDMA only after sending the configuration response
auto cl = clients.at(peer_fd);
cl->rdma_conn = rdma_conn;
cl->peer_state = PEER_RDMA_CONNECTING;
return true;
}
}
}
return false;
}
static void try_send_rdma_wr(osd_client_t *cl, ibv_sge *sge, int op_sge)
{
ibv_send_wr *bad_wr = NULL;
ibv_send_wr wr = {
.wr_id = (uint64_t)(cl->peer_fd*2+1),
.sg_list = sge,
.num_sge = op_sge,
.opcode = IBV_WR_SEND,
.send_flags = IBV_SEND_SIGNALED,
};
int err = ibv_post_send(cl->rdma_conn->qp, &wr, &bad_wr);
if (err || bad_wr)
{
fprintf(stderr, "RDMA send failed: %s\n", strerror(err));
exit(1);
}
cl->rdma_conn->cur_send++;
}
bool osd_messenger_t::try_send_rdma(osd_client_t *cl)
{
auto rc = cl->rdma_conn;
if (!cl->send_list.size() || rc->cur_send > 0)
{
// Only send one batch at a time
return true;
}
uint64_t op_size = 0, op_sge = 0;
ibv_sge sge[rc->max_sge];
while (rc->send_pos < cl->send_list.size())
{
iovec & iov = cl->send_list[rc->send_pos];
if (op_size >= rc->max_msg || op_sge >= rc->max_sge)
{
try_send_rdma_wr(cl, sge, op_sge);
op_sge = 0;
op_size = 0;
if (rc->cur_send >= rc->max_send)
{
break;
}
}
uint32_t len = (uint32_t)(op_size+iov.iov_len-rc->send_buf_pos < rc->max_msg
? iov.iov_len-rc->send_buf_pos : rc->max_msg-op_size);
sge[op_sge++] = {
.addr = (uintptr_t)(iov.iov_base+rc->send_buf_pos),
.length = len,
.lkey = rc->ctx->mr->lkey,
};
op_size += len;
rc->send_buf_pos += len;
if (rc->send_buf_pos >= iov.iov_len)
{
rc->send_pos++;
rc->send_buf_pos = 0;
}
}
if (op_sge > 0)
{
try_send_rdma_wr(cl, sge, op_sge);
}
return true;
}
static void try_recv_rdma_wr(osd_client_t *cl, ibv_sge *sge, int op_sge)
{
ibv_recv_wr *bad_wr = NULL;
ibv_recv_wr wr = {
.wr_id = (uint64_t)(cl->peer_fd*2),
.sg_list = sge,
.num_sge = op_sge,
};
int err = ibv_post_recv(cl->rdma_conn->qp, &wr, &bad_wr);
if (err || bad_wr)
{
fprintf(stderr, "RDMA receive failed: %s\n", strerror(err));
exit(1);
}
cl->rdma_conn->cur_recv++;
}
bool osd_messenger_t::try_recv_rdma(osd_client_t *cl)
{
auto rc = cl->rdma_conn;
while (rc->cur_recv < rc->max_recv)
{
void *buf = malloc_or_die(rc->max_msg);
rc->recv_buffers.push_back(buf);
ibv_sge sge = {
.addr = (uintptr_t)buf,
.length = (uint32_t)rc->max_msg,
.lkey = rc->ctx->mr->lkey,
};
try_recv_rdma_wr(cl, &sge, 1);
}
return true;
}
#define RDMA_EVENTS_AT_ONCE 32
void osd_messenger_t::handle_rdma_events()
{
// Request next notification
ibv_cq *ev_cq;
void *ev_ctx;
// FIXME: This is inefficient as it calls read()...
if (ibv_get_cq_event(rdma_context->channel, &ev_cq, &ev_ctx) == 0)
{
ibv_ack_cq_events(rdma_context->cq, 1);
}
if (ibv_req_notify_cq(rdma_context->cq, 0) != 0)
{
fprintf(stderr, "Failed to request RDMA completion notification, exiting\n");
exit(1);
}
ibv_wc wc[RDMA_EVENTS_AT_ONCE];
int event_count;
do
{
event_count = ibv_poll_cq(rdma_context->cq, RDMA_EVENTS_AT_ONCE, wc);
for (int i = 0; i < event_count; i++)
{
int client_id = wc[i].wr_id >> 1;
bool is_send = wc[i].wr_id & 1;
auto cl_it = clients.find(client_id);
if (cl_it == clients.end())
{
continue;
}
osd_client_t *cl = cl_it->second;
if (wc[i].status != IBV_WC_SUCCESS)
{
fprintf(stderr, "RDMA work request failed for client %d", client_id);
if (cl->osd_num)
{
fprintf(stderr, " (OSD %lu)", cl->osd_num);
}
fprintf(stderr, " with status: %s, stopping client\n", ibv_wc_status_str(wc[i].status));
stop_client(client_id);
continue;
}
if (!is_send)
{
cl->rdma_conn->cur_recv--;
handle_read_buffer(cl, cl->rdma_conn->recv_buffers[0], wc[i].byte_len);
free(cl->rdma_conn->recv_buffers[0]);
cl->rdma_conn->recv_buffers.erase(cl->rdma_conn->recv_buffers.begin(), cl->rdma_conn->recv_buffers.begin()+1);
try_recv_rdma(cl);
}
else
{
cl->rdma_conn->cur_send--;
if (!cl->rdma_conn->cur_send)
{
// Wait for the whole batch
for (int i = 0; i < cl->rdma_conn->send_pos; i++)
{
if (cl->outbox[i].flags & MSGR_SENDP_FREE)
{
// Reply fully sent
delete cl->outbox[i].op;
}
}
if (cl->rdma_conn->send_pos > 0)
{
cl->send_list.erase(cl->send_list.begin(), cl->send_list.begin()+cl->rdma_conn->send_pos);
cl->outbox.erase(cl->outbox.begin(), cl->outbox.begin()+cl->rdma_conn->send_pos);
cl->rdma_conn->send_pos = 0;
}
if (cl->rdma_conn->send_buf_pos > 0)
{
cl->send_list[0].iov_base += cl->rdma_conn->send_buf_pos;
cl->send_list[0].iov_len -= cl->rdma_conn->send_buf_pos;
cl->rdma_conn->send_buf_pos = 0;
}
try_send_rdma(cl);
}
}
}
} while (event_count > 0);
for (auto cb: set_immediate)
{
cb();
}
set_immediate.clear();
}

58
src/msgr_rdma.h Normal file
View File

@@ -0,0 +1,58 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.1 or GNU GPL-2.0+ (see README.md for details)
#pragma once
#include <infiniband/verbs.h>
#include <string>
#include <vector>
struct msgr_rdma_address_t
{
ibv_gid gid;
uint16_t lid;
uint32_t qpn;
uint32_t psn;
std::string to_string();
static bool from_string(const char *str, msgr_rdma_address_t *dest);
};
struct msgr_rdma_context_t
{
ibv_context *context = NULL;
ibv_device *dev = NULL;
ibv_device_attr_ex attrx;
ibv_pd *pd = NULL;
ibv_mr *mr = NULL;
ibv_comp_channel *channel = NULL;
ibv_cq *cq = NULL;
ibv_port_attr portinfo;
uint8_t ib_port;
uint8_t gid_index;
uint16_t my_lid;
ibv_gid my_gid;
uint32_t mtu;
int max_cqe = 0;
int used_max_cqe = 0;
static msgr_rdma_context_t *create(const char *ib_devname, uint8_t ib_port, uint8_t gid_index, uint32_t mtu);
~msgr_rdma_context_t();
};
struct msgr_rdma_connection_t
{
msgr_rdma_context_t *ctx = NULL;
ibv_qp *qp = NULL;
msgr_rdma_address_t addr;
int max_send = 0, max_recv = 0, max_sge = 0;
int cur_send = 0, cur_recv = 0;
uint64_t max_msg = 0;
int send_pos = 0, send_buf_pos = 0;
int recv_pos = 0, recv_buf_pos = 0;
std::vector<void*> recv_buffers;
~msgr_rdma_connection_t();
static msgr_rdma_connection_t *create(msgr_rdma_context_t *ctx, uint32_t max_send, uint32_t max_recv, uint32_t max_sge, uint32_t max_msg);
int connect(msgr_rdma_address_t *dest);
};

View File

@@ -72,7 +72,7 @@ bool osd_messenger_t::handle_read(int result, osd_client_t *cl)
// this is a client socket, so don't panic on error. just disconnect it
if (result != 0)
{
printf("Client %d socket read error: %d (%s). Disconnecting client\n", cl->peer_fd, -result, strerror(-result));
fprintf(stderr, "Client %d socket read error: %d (%s). Disconnecting client\n", cl->peer_fd, -result, strerror(-result));
}
stop_client(cl->peer_fd);
return false;
@@ -91,48 +91,9 @@ bool osd_messenger_t::handle_read(int result, osd_client_t *cl)
{
if (cl->read_iov.iov_base == cl->in_buf)
{
// Compose operation(s) from the buffer
int remain = result;
void *curbuf = cl->in_buf;
while (remain > 0)
if (!handle_read_buffer(cl, cl->in_buf, result))
{
if (!cl->read_op)
{
cl->read_op = new osd_op_t;
cl->read_op->peer_fd = cl->peer_fd;
cl->read_op->op_type = OSD_OP_IN;
cl->recv_list.push_back(cl->read_op->req.buf, OSD_PACKET_SIZE);
cl->read_remaining = OSD_PACKET_SIZE;
cl->read_state = CL_READ_HDR;
}
while (cl->recv_list.done < cl->recv_list.count && remain > 0)
{
iovec* cur = cl->recv_list.get_iovec();
if (cur->iov_len > remain)
{
memcpy(cur->iov_base, curbuf, remain);
cl->read_remaining -= remain;
cur->iov_len -= remain;
cur->iov_base += remain;
remain = 0;
}
else
{
memcpy(cur->iov_base, curbuf, cur->iov_len);
curbuf += cur->iov_len;
cl->read_remaining -= cur->iov_len;
remain -= cur->iov_len;
cur->iov_len = 0;
cl->recv_list.done++;
}
}
if (cl->recv_list.done >= cl->recv_list.count)
{
if (!handle_finished_read(cl))
{
goto fin;
}
}
goto fin;
}
}
else
@@ -159,6 +120,52 @@ fin:
return ret;
}
bool osd_messenger_t::handle_read_buffer(osd_client_t *cl, void *curbuf, int remain)
{
// Compose operation(s) from the buffer
while (remain > 0)
{
if (!cl->read_op)
{
cl->read_op = new osd_op_t;
cl->read_op->peer_fd = cl->peer_fd;
cl->read_op->op_type = OSD_OP_IN;
cl->recv_list.push_back(cl->read_op->req.buf, OSD_PACKET_SIZE);
cl->read_remaining = OSD_PACKET_SIZE;
cl->read_state = CL_READ_HDR;
}
while (cl->recv_list.done < cl->recv_list.count && remain > 0)
{
iovec* cur = cl->recv_list.get_iovec();
if (cur->iov_len > remain)
{
memcpy(cur->iov_base, curbuf, remain);
cl->read_remaining -= remain;
cur->iov_len -= remain;
cur->iov_base += remain;
remain = 0;
}
else
{
memcpy(cur->iov_base, curbuf, cur->iov_len);
curbuf += cur->iov_len;
cl->read_remaining -= cur->iov_len;
remain -= cur->iov_len;
cur->iov_len = 0;
cl->recv_list.done++;
}
}
if (cl->recv_list.done >= cl->recv_list.count)
{
if (!handle_finished_read(cl))
{
return false;
}
}
}
return true;
}
bool osd_messenger_t::handle_finished_read(osd_client_t *cl)
{
cl->recv_list.reset();
@@ -170,7 +177,7 @@ bool osd_messenger_t::handle_finished_read(osd_client_t *cl)
handle_op_hdr(cl);
else
{
printf("Received garbage: magic=%lx id=%lu opcode=%lx from %d\n", cl->read_op->req.hdr.magic, cl->read_op->req.hdr.id, cl->read_op->req.hdr.opcode, cl->peer_fd);
fprintf(stderr, "Received garbage: magic=%lx id=%lu opcode=%lx from %d\n", cl->read_op->req.hdr.magic, cl->read_op->req.hdr.id, cl->read_op->req.hdr.opcode, cl->peer_fd);
stop_client(cl->peer_fd);
return false;
}
@@ -202,24 +209,45 @@ void osd_messenger_t::handle_op_hdr(osd_client_t *cl)
osd_op_t *cur_op = cl->read_op;
if (cur_op->req.hdr.opcode == OSD_OP_SEC_READ)
{
if (cur_op->req.sec_rw.len > 0)
cur_op->buf = memalign_or_die(MEM_ALIGNMENT, cur_op->req.sec_rw.len);
cl->read_remaining = 0;
}
else if (cur_op->req.hdr.opcode == OSD_OP_SEC_WRITE ||
cur_op->req.hdr.opcode == OSD_OP_SEC_WRITE_STABLE)
{
if (cur_op->req.sec_rw.attr_len > 0)
{
if (cur_op->req.sec_rw.attr_len > sizeof(unsigned))
cur_op->bitmap = cur_op->rmw_buf = malloc_or_die(cur_op->req.sec_rw.attr_len);
else
cur_op->bitmap = &cur_op->bmp_data;
cl->recv_list.push_back(cur_op->bitmap, cur_op->req.sec_rw.attr_len);
}
if (cur_op->req.sec_rw.len > 0)
{
cur_op->buf = memalign_or_die(MEM_ALIGNMENT, cur_op->req.sec_rw.len);
cl->read_remaining = cur_op->req.sec_rw.len;
cl->recv_list.push_back(cur_op->buf, cur_op->req.sec_rw.len);
}
cl->read_remaining = cur_op->req.sec_rw.len + cur_op->req.sec_rw.attr_len;
}
else if (cur_op->req.hdr.opcode == OSD_OP_SEC_STABILIZE ||
cur_op->req.hdr.opcode == OSD_OP_SEC_ROLLBACK)
{
if (cur_op->req.sec_stab.len > 0)
{
cur_op->buf = memalign_or_die(MEM_ALIGNMENT, cur_op->req.sec_stab.len);
cl->recv_list.push_back(cur_op->buf, cur_op->req.sec_stab.len);
}
cl->read_remaining = cur_op->req.sec_stab.len;
}
else if (cur_op->req.hdr.opcode == OSD_OP_SEC_READ_BMP)
{
if (cur_op->req.sec_read_bmp.len > 0)
{
cur_op->buf = memalign_or_die(MEM_ALIGNMENT, cur_op->req.sec_read_bmp.len);
cl->recv_list.push_back(cur_op->buf, cur_op->req.sec_read_bmp.len);
}
cl->read_remaining = cur_op->req.sec_read_bmp.len;
}
else if (cur_op->req.hdr.opcode == OSD_OP_READ)
{
cl->read_remaining = 0;
@@ -227,13 +255,25 @@ void osd_messenger_t::handle_op_hdr(osd_client_t *cl)
else if (cur_op->req.hdr.opcode == OSD_OP_WRITE)
{
if (cur_op->req.rw.len > 0)
{
cur_op->buf = memalign_or_die(MEM_ALIGNMENT, cur_op->req.rw.len);
cl->recv_list.push_back(cur_op->buf, cur_op->req.rw.len);
}
cl->read_remaining = cur_op->req.rw.len;
}
else if (cur_op->req.hdr.opcode == OSD_OP_SHOW_CONFIG)
{
if (cur_op->req.show_conf.json_len > 0)
{
cur_op->buf = malloc_or_die(cur_op->req.show_conf.json_len+1);
((uint8_t*)cur_op->buf)[cur_op->req.show_conf.json_len] = 0;
cl->recv_list.push_back(cur_op->buf, cur_op->req.show_conf.json_len);
}
cl->read_remaining = cur_op->req.show_conf.json_len;
}
if (cl->read_remaining > 0)
{
// Read data
cl->recv_list.push_back(cur_op->buf, cl->read_remaining);
cl->read_state = CL_READ_DATA;
}
else
@@ -252,31 +292,45 @@ bool osd_messenger_t::handle_reply_hdr(osd_client_t *cl)
if (req_it == cl->sent_ops.end())
{
// Command out of sync. Drop connection
printf("Client %d command out of sync: id %lu\n", cl->peer_fd, cl->read_op->req.hdr.id);
fprintf(stderr, "Client %d command out of sync: id %lu\n", cl->peer_fd, cl->read_op->req.hdr.id);
stop_client(cl->peer_fd);
return false;
}
osd_op_t *op = req_it->second;
memcpy(op->reply.buf, cl->read_op->req.buf, OSD_PACKET_SIZE);
cl->sent_ops.erase(req_it);
if ((op->reply.hdr.opcode == OSD_OP_SEC_READ || op->reply.hdr.opcode == OSD_OP_READ) &&
op->reply.hdr.retval > 0)
if (op->reply.hdr.opcode == OSD_OP_SEC_READ || op->reply.hdr.opcode == OSD_OP_READ)
{
// Read data. In this case we assume that the buffer is preallocated by the caller (!)
assert(op->iov.count > 0);
if (op->reply.hdr.retval != (op->reply.hdr.opcode == OSD_OP_SEC_READ ? op->req.sec_rw.len : op->req.rw.len))
unsigned bmp_len = (op->reply.hdr.opcode == OSD_OP_SEC_READ ? op->reply.sec_rw.attr_len : op->reply.rw.bitmap_len);
unsigned expected_size = (op->reply.hdr.opcode == OSD_OP_SEC_READ ? op->req.sec_rw.len : op->req.rw.len);
if (op->reply.hdr.retval >= 0 && (op->reply.hdr.retval != expected_size || bmp_len > op->bitmap_len))
{
// Check reply length to not overflow the buffer
printf("Client %d read reply of different length\n", cl->peer_fd);
fprintf(stderr, "Client %d read reply of different length: expected %u+%u, got %ld+%u\n",
cl->peer_fd, expected_size, op->bitmap_len, op->reply.hdr.retval, bmp_len);
cl->sent_ops[op->req.hdr.id] = op;
stop_client(cl->peer_fd);
return false;
}
cl->recv_list.append(op->iov);
if (op->reply.hdr.retval >= 0 && bmp_len > 0)
{
assert(op->bitmap);
cl->recv_list.push_back(op->bitmap, bmp_len);
}
if (op->reply.hdr.retval > 0)
{
assert(op->iov.count > 0);
cl->recv_list.append(op->iov);
}
cl->read_remaining = op->reply.hdr.retval + bmp_len;
if (cl->read_remaining == 0)
{
goto reuse;
}
delete cl->read_op;
cl->read_op = op;
cl->read_state = CL_READ_REPLY_DATA;
cl->read_remaining = op->reply.hdr.retval;
}
else if (op->reply.hdr.opcode == OSD_OP_SEC_LIST && op->reply.hdr.retval > 0)
{
@@ -288,18 +342,30 @@ bool osd_messenger_t::handle_reply_hdr(osd_client_t *cl)
op->buf = memalign_or_die(MEM_ALIGNMENT, cl->read_remaining);
cl->recv_list.push_back(op->buf, cl->read_remaining);
}
else if (op->reply.hdr.opcode == OSD_OP_SHOW_CONFIG && op->reply.hdr.retval > 0)
else if (op->reply.hdr.opcode == OSD_OP_SEC_READ_BMP && op->reply.hdr.retval > 0)
{
assert(!op->iov.count);
delete cl->read_op;
cl->read_op = op;
cl->read_state = CL_READ_REPLY_DATA;
cl->read_remaining = op->reply.hdr.retval;
free(op->buf);
op->buf = memalign_or_die(MEM_ALIGNMENT, cl->read_remaining);
cl->recv_list.push_back(op->buf, cl->read_remaining);
}
else if (op->reply.hdr.opcode == OSD_OP_SHOW_CONFIG && op->reply.hdr.retval > 0)
{
delete cl->read_op;
cl->read_op = op;
cl->read_state = CL_READ_REPLY_DATA;
cl->read_remaining = op->reply.hdr.retval;
free(op->buf);
op->buf = malloc_or_die(op->reply.hdr.retval);
cl->recv_list.push_back(op->buf, op->reply.hdr.retval);
}
else
{
reuse:
// It's fine to reuse cl->read_op for the next reply
handle_reply_ready(op);
cl->recv_list.push_back(cl->read_op->req.buf, OSD_PACKET_SIZE);

View File

@@ -46,7 +46,28 @@ void osd_messenger_t::outbox_push(osd_op_t *cur_op)
to_send_list.push_back((iovec){ .iov_base = cur_op->req.buf, .iov_len = OSD_PACKET_SIZE });
cl->sent_ops[cur_op->req.hdr.id] = cur_op;
}
to_outbox.push_back(NULL);
to_outbox.push_back((msgr_sendp_t){ .op = cur_op, .flags = MSGR_SENDP_HDR });
// Bitmap
if (cur_op->op_type == OSD_OP_IN &&
cur_op->req.hdr.opcode == OSD_OP_SEC_READ &&
cur_op->reply.sec_rw.attr_len > 0)
{
to_send_list.push_back((iovec){
.iov_base = cur_op->bitmap,
.iov_len = cur_op->reply.sec_rw.attr_len,
});
to_outbox.push_back((msgr_sendp_t){ .op = cur_op, .flags = 0 });
}
else if (cur_op->op_type == OSD_OP_OUT &&
(cur_op->req.hdr.opcode == OSD_OP_SEC_WRITE || cur_op->req.hdr.opcode == OSD_OP_SEC_WRITE_STABLE) &&
cur_op->req.sec_rw.attr_len > 0)
{
to_send_list.push_back((iovec){
.iov_base = cur_op->bitmap,
.iov_len = cur_op->req.sec_rw.attr_len,
});
to_outbox.push_back((msgr_sendp_t){ .op = cur_op, .flags = 0 });
}
// Operation data
if ((cur_op->op_type == OSD_OP_IN
? (cur_op->req.hdr.opcode == OSD_OP_READ ||
@@ -57,20 +78,35 @@ void osd_messenger_t::outbox_push(osd_op_t *cur_op)
cur_op->req.hdr.opcode == OSD_OP_SEC_WRITE ||
cur_op->req.hdr.opcode == OSD_OP_SEC_WRITE_STABLE ||
cur_op->req.hdr.opcode == OSD_OP_SEC_STABILIZE ||
cur_op->req.hdr.opcode == OSD_OP_SEC_ROLLBACK)) && cur_op->iov.count > 0)
cur_op->req.hdr.opcode == OSD_OP_SEC_ROLLBACK ||
cur_op->req.hdr.opcode == OSD_OP_SHOW_CONFIG)) && cur_op->iov.count > 0)
{
for (int i = 0; i < cur_op->iov.count; i++)
{
assert(cur_op->iov.buf[i].iov_base);
to_send_list.push_back(cur_op->iov.buf[i]);
to_outbox.push_back(NULL);
to_outbox.push_back((msgr_sendp_t){ .op = cur_op, .flags = 0 });
}
}
if (cur_op->req.hdr.opcode == OSD_OP_SEC_READ_BMP)
{
if (cur_op->op_type == OSD_OP_IN && cur_op->reply.hdr.retval > 0)
to_send_list.push_back((iovec){ .iov_base = cur_op->buf, .iov_len = (size_t)cur_op->reply.hdr.retval });
else if (cur_op->op_type == OSD_OP_OUT && cur_op->req.sec_read_bmp.len > 0)
to_send_list.push_back((iovec){ .iov_base = cur_op->buf, .iov_len = (size_t)cur_op->req.sec_read_bmp.len });
to_outbox.push_back((msgr_sendp_t){ .op = cur_op, .flags = 0 });
}
if (cur_op->op_type == OSD_OP_IN)
{
// To free it later
to_outbox[to_outbox.size()-1] = cur_op;
to_outbox[to_outbox.size()-1].flags |= MSGR_SENDP_FREE;
}
#ifdef WITH_RDMA
if (cl->peer_state == PEER_RDMA)
{
try_send_rdma(cl);
return;
}
#endif
if (!ringloop)
{
// FIXME: It's worse because it doesn't allow batching
@@ -97,8 +133,10 @@ void osd_messenger_t::measure_exec(osd_op_t *cur_op)
{
return;
}
timespec tv_end;
clock_gettime(CLOCK_REALTIME, &tv_end);
if (!cur_op->tv_end.tv_sec)
{
clock_gettime(CLOCK_REALTIME, &cur_op->tv_end);
}
stats.op_stat_count[cur_op->req.hdr.opcode]++;
if (!stats.op_stat_count[cur_op->req.hdr.opcode])
{
@@ -107,8 +145,8 @@ void osd_messenger_t::measure_exec(osd_op_t *cur_op)
stats.op_stat_bytes[cur_op->req.hdr.opcode] = 0;
}
stats.op_stat_sum[cur_op->req.hdr.opcode] += (
(tv_end.tv_sec - cur_op->tv_begin.tv_sec)*1000000 +
(tv_end.tv_nsec - cur_op->tv_begin.tv_nsec)/1000
(cur_op->tv_end.tv_sec - cur_op->tv_begin.tv_sec)*1000000 +
(cur_op->tv_end.tv_nsec - cur_op->tv_begin.tv_nsec)/1000
);
if (cur_op->req.hdr.opcode == OSD_OP_READ ||
cur_op->req.hdr.opcode == OSD_OP_WRITE)
@@ -189,7 +227,7 @@ void osd_messenger_t::handle_send(int result, osd_client_t *cl)
if (result < 0 && result != -EAGAIN)
{
// this is a client socket, so don't panic. just disconnect it
printf("Client %d socket write error: %d (%s). Disconnecting client\n", cl->peer_fd, -result, strerror(-result));
fprintf(stderr, "Client %d socket write error: %d (%s). Disconnecting client\n", cl->peer_fd, -result, strerror(-result));
stop_client(cl->peer_fd);
return;
}
@@ -201,10 +239,10 @@ void osd_messenger_t::handle_send(int result, osd_client_t *cl)
iovec & iov = cl->send_list[done];
if (iov.iov_len <= result)
{
if (cl->outbox[done])
if (cl->outbox[done].flags & MSGR_SENDP_FREE)
{
// Reply fully sent
delete cl->outbox[done];
delete cl->outbox[done].op;
}
result -= iov.iov_len;
done++;
@@ -229,6 +267,21 @@ void osd_messenger_t::handle_send(int result, osd_client_t *cl)
cl->next_outbox.clear();
}
cl->write_state = cl->outbox.size() > 0 ? CL_WRITE_READY : 0;
#ifdef WITH_RDMA
if (cl->rdma_conn && !cl->outbox.size() && cl->peer_state == PEER_RDMA_CONNECTING)
{
// FIXME: Do something better than just forgetting the FD
// FIXME: Ignore pings during RDMA state transition
if (log_level > 0)
{
fprintf(stderr, "Successfully connected with client %d using RDMA\n", cl->peer_fd);
}
cl->peer_state = PEER_RDMA;
tfd->set_fd_handler(cl->peer_fd, false, NULL);
// Add the initial receive request
try_recv_rdma(cl);
}
#endif
}
if (cl->write_state != 0)
{

View File

@@ -58,11 +58,11 @@ void osd_messenger_t::stop_client(int peer_fd, bool force)
{
if (cl->osd_num)
{
printf("[OSD %lu] Stopping client %d (OSD peer %lu)\n", osd_num, peer_fd, cl->osd_num);
fprintf(stderr, "[OSD %lu] Stopping client %d (OSD peer %lu)\n", osd_num, peer_fd, cl->osd_num);
}
else
{
printf("[OSD %lu] Stopping client %d (regular client)\n", osd_num, peer_fd);
fprintf(stderr, "[OSD %lu] Stopping client %d (regular client)\n", osd_num, peer_fd);
}
}
// First set state to STOPPED so another stop_client() call doesn't try to free it again
@@ -122,6 +122,12 @@ void osd_messenger_t::stop_client(int peer_fd, bool force)
// And close the FD only when everything is done
// ...because peer_fd number can get reused after close()
close(peer_fd);
#ifdef WITH_RDMA
if (cl->rdma_conn)
{
delete cl->rdma_conn;
}
#endif
#endif
// Find the item again because it can be invalidated at this point
it = clients.find(peer_fd);

View File

@@ -10,6 +10,7 @@
#include <netinet/tcp.h>
#include <arpa/inet.h>
#include <sys/un.h>
#include <sys/epoll.h>
#include <unistd.h>
#include <fcntl.h>
#include <signal.h>
@@ -26,7 +27,10 @@ const char *exe_name = NULL;
class nbd_proxy
{
protected:
std::string image_name;
uint64_t inode = 0;
uint64_t device_size = 0;
inode_watch_t *watch = NULL;
ring_loop_t *ringloop = NULL;
epoll_manager_t *epmgr = NULL;
@@ -111,9 +115,9 @@ public:
{
printf(
"Vitastor NBD proxy\n"
"(c) Vitaliy Filippov, 2020 (VNPL-1.1)\n\n"
"(c) Vitaliy Filippov, 2020-2021 (VNPL-1.1)\n\n"
"USAGE:\n"
" %s map --etcd_address <etcd_address> --pool <pool> --inode <inode> --size <size in bytes>\n"
" %s map [--etcd_address <etcd_address>] (--image <image> | --pool <pool> --inode <inode> --size <size in bytes>)\n"
" %s unmap /dev/nbd0\n"
" %s list [--json]\n",
exe_name, exe_name, exe_name
@@ -143,26 +147,49 @@ public:
void start(json11::Json cfg)
{
// Check options
if (cfg["etcd_address"].string_value() == "")
if (cfg["image"].string_value() != "")
{
fprintf(stderr, "etcd_address is missing\n");
exit(1);
// Use image name
image_name = cfg["image"].string_value();
inode = 0;
}
if (!cfg["size"].uint64_value())
else
{
fprintf(stderr, "device size is missing\n");
exit(1);
// Use pool, inode number and size
if (!cfg["size"].uint64_value())
{
fprintf(stderr, "device size is missing\n");
exit(1);
}
device_size = cfg["size"].uint64_value();
inode = cfg["inode"].uint64_value();
uint64_t pool = cfg["pool"].uint64_value();
if (pool)
{
inode = (inode & ((1l << (64-POOL_ID_BITS)) - 1)) | (pool << (64-POOL_ID_BITS));
}
if (!(inode >> (64-POOL_ID_BITS)))
{
fprintf(stderr, "pool is missing\n");
exit(1);
}
}
inode = cfg["inode"].uint64_value();
uint64_t pool = cfg["pool"].uint64_value();
if (pool)
// Create client
ringloop = new ring_loop_t(512);
epmgr = new epoll_manager_t(ringloop);
cli = new cluster_client_t(ringloop, epmgr->tfd, cfg);
if (!inode)
{
inode = (inode & ((1l << (64-POOL_ID_BITS)) - 1)) | (pool << (64-POOL_ID_BITS));
}
if (!(inode >> (64-POOL_ID_BITS)))
{
fprintf(stderr, "pool is missing\n");
exit(1);
// Load image metadata
while (!cli->is_ready())
{
ringloop->loop();
if (cli->is_ready())
break;
ringloop->wait();
}
watch = cli->st_cli.watch_inode(image_name);
device_size = watch->cfg.size;
}
// Initialize NBD
int sockfd[2];
@@ -174,9 +201,10 @@ public:
fcntl(sockfd[0], F_SETFL, fcntl(sockfd[0], F_GETFL, 0) | O_NONBLOCK);
nbd_fd = sockfd[0];
load_module();
bool bg = cfg["foreground"].is_null();
if (!cfg["dev_num"].is_null())
{
if (run_nbd(sockfd, cfg["dev_num"].int64_value(), cfg["size"].uint64_value(), NBD_FLAG_SEND_FLUSH, 30) < 0)
if (run_nbd(sockfd, cfg["dev_num"].int64_value(), device_size, NBD_FLAG_SEND_FLUSH, 30, bg) < 0)
{
perror("run_nbd");
exit(1);
@@ -188,7 +216,7 @@ public:
int i = 0;
while (true)
{
int r = run_nbd(sockfd, i, cfg["size"].uint64_value(), NBD_FLAG_SEND_FLUSH, 30);
int r = run_nbd(sockfd, i, device_size, NBD_FLAG_SEND_FLUSH, 30, bg);
if (r == 0)
{
printf("/dev/nbd%d\n", i);
@@ -211,14 +239,10 @@ public:
}
}
}
if (cfg["foreground"].is_null())
if (bg)
{
daemonize();
}
// Create client
ringloop = new ring_loop_t(512);
epmgr = new epoll_manager_t(ringloop);
cli = new cluster_client_t(ringloop, epmgr->tfd, cfg);
// Initialize read state
read_state = CL_READ_HDR;
recv_buf = malloc_or_die(receive_buffer_size);
@@ -232,21 +256,47 @@ public:
};
ringloop->register_consumer(&consumer);
// Add FD to epoll
epmgr->tfd->set_fd_handler(sockfd[0], false, [this](int peer_fd, int epoll_events)
bool stop = false;
epmgr->tfd->set_fd_handler(sockfd[0], false, [this, &stop](int peer_fd, int epoll_events)
{
read_ready++;
submit_read();
if (epoll_events & EPOLLRDHUP)
{
close(peer_fd);
stop = true;
}
else
{
read_ready++;
submit_read();
}
});
while (1)
while (!stop)
{
ringloop->loop();
ringloop->wait();
}
stop = false;
cluster_op_t *close_sync = new cluster_op_t;
close_sync->opcode = OSD_OP_SYNC;
close_sync->callback = [this, &stop](cluster_op_t *op)
{
stop = true;
delete op;
};
cli->execute(close_sync);
while (!stop)
{
ringloop->loop();
ringloop->wait();
}
delete cli;
delete epmgr;
delete ringloop;
}
void load_module()
{
if (access("/sys/module/nbd", F_OK))
if (access("/sys/module/nbd", F_OK) == 0)
{
return;
}
@@ -388,7 +438,7 @@ public:
}
protected:
int run_nbd(int sockfd[2], int dev_num, uint64_t size, uint64_t flags, unsigned timeout)
int run_nbd(int sockfd[2], int dev_num, uint64_t size, uint64_t flags, unsigned timeout, bool bg)
{
// Check handle size
assert(sizeof(cur_req.handle) == 8);
@@ -436,11 +486,14 @@ protected:
{
// Run in child
close(sockfd[0]);
if (bg)
{
daemonize();
}
r = ioctl(nbd, NBD_DO_IT);
if (r < 0)
{
fprintf(stderr, "NBD device terminated with error: %s\n", strerror(errno));
kill(getppid(), SIGTERM);
}
close(sockfd[1]);
ioctl(nbd, NBD_CLEAR_QUE);
@@ -610,7 +663,7 @@ protected:
if (req_type == NBD_CMD_READ || req_type == NBD_CMD_WRITE)
{
op->opcode = req_type == NBD_CMD_READ ? OSD_OP_READ : OSD_OP_WRITE;
op->inode = inode;
op->inode = inode ? inode : watch->cfg.num;
op->offset = be64toh(cur_req.from);
op->len = be32toh(cur_req.len);
buf = malloc_or_die(sizeof(nbd_reply) + op->len);
@@ -657,7 +710,15 @@ protected:
}
else
{
cli->execute(cur_op);
if (cur_op->opcode == OSD_OP_WRITE && watch->cfg.readonly)
{
cur_op->retval = -EROFS;
std::function<void(cluster_op_t*)>(cur_op->callback)(cur_op);
}
else
{
cli->execute(cur_op);
}
cur_op = NULL;
cur_buf = &cur_req;
cur_left = sizeof(nbd_request);

View File

@@ -6,12 +6,14 @@
#include <stdint.h>
#include <functional>
typedef uint64_t inode_t;
// 16 bytes per object/stripe id
// stripe = (start of the parity stripe + peer role)
// i.e. for example (256KB + one of 0,1,2)
struct __attribute__((__packed__)) object_id
{
uint64_t inode;
inode_t inode;
uint64_t stripe;
};

View File

@@ -10,24 +10,40 @@
#include "osd.h"
#include "http_client.h"
osd_t::osd_t(blockstore_config_t & config, ring_loop_t *ringloop)
static blockstore_config_t json_to_bs(const json11::Json::object & config)
{
config["entry_attr_size"] = "0";
blockstore_config_t bs;
for (auto kv: config)
{
if (kv.second.is_string())
bs[kv.first] = kv.second.string_value();
else
bs[kv.first] = kv.second.dump();
}
return bs;
}
osd_t::osd_t(const json11::Json & config, ring_loop_t *ringloop)
{
zero_buffer_size = 1<<20;
zero_buffer = malloc_or_die(zero_buffer_size);
memset(zero_buffer, 0, zero_buffer_size);
this->config = config;
this->ringloop = ringloop;
// FIXME: Create Blockstore from on-disk superblock config and check it against the OSD cluster config
this->bs = new blockstore_t(config, ringloop);
this->bs_block_size = bs->get_block_size();
this->bs_bitmap_granularity = bs->get_bitmap_granularity();
parse_config(config);
this->config = msgr.read_config(config).object_items();
if (this->config.find("log_level") == this->config.end())
this->config["log_level"] = 1;
parse_config(this->config);
epmgr = new epoll_manager_t(ringloop);
// FIXME: Use timerfd_interval based directly on io_uring
this->tfd = epmgr->tfd;
// FIXME: Create Blockstore from on-disk superblock config and check it against the OSD cluster config
auto bs_cfg = json_to_bs(this->config);
this->bs = new blockstore_t(bs_cfg, ringloop, tfd);
this->tfd->set_timer(print_stats_interval*1000, true, [this](int timer_id)
{
print_stats();
@@ -37,11 +53,11 @@ osd_t::osd_t(blockstore_config_t & config, ring_loop_t *ringloop)
print_slow();
});
c_cli.tfd = this->tfd;
c_cli.ringloop = this->ringloop;
c_cli.exec_op = [this](osd_op_t *op) { exec_op(op); };
c_cli.repeer_pgs = [this](osd_num_t peer_osd) { repeer_pgs(peer_osd); };
c_cli.init();
msgr.tfd = this->tfd;
msgr.ringloop = this->ringloop;
msgr.exec_op = [this](osd_op_t *op) { exec_op(op); };
msgr.repeer_pgs = [this](osd_num_t peer_osd) { repeer_pgs(peer_osd); };
msgr.init();
init_cluster();
@@ -55,64 +71,74 @@ osd_t::~osd_t()
delete epmgr;
delete bs;
close(listen_fd);
free(zero_buffer);
}
void osd_t::parse_config(blockstore_config_t & config)
void osd_t::parse_config(const json11::Json & config)
{
if (config.find("log_level") == config.end())
config["log_level"] = "1";
log_level = strtoull(config["log_level"].c_str(), NULL, 10);
// Initial startup configuration
json11::Json json_config = json11::Json(config);
st_cli.parse_config(json_config);
etcd_report_interval = strtoull(config["etcd_report_interval"].c_str(), NULL, 10);
if (etcd_report_interval <= 0)
etcd_report_interval = 30;
osd_num = strtoull(config["osd_num"].c_str(), NULL, 10);
st_cli.parse_config(config);
msgr.parse_config(config);
// OSD number
osd_num = config["osd_num"].uint64_value();
if (!osd_num)
throw std::runtime_error("osd_num is required in the configuration");
c_cli.osd_num = osd_num;
msgr.osd_num = osd_num;
// Vital Blockstore parameters
bs_block_size = config["block_size"].uint64_value();
if (!bs_block_size)
bs_block_size = DEFAULT_BLOCK_SIZE;
bs_bitmap_granularity = config["bitmap_granularity"].uint64_value();
if (!bs_bitmap_granularity)
bs_bitmap_granularity = DEFAULT_BITMAP_GRANULARITY;
clean_entry_bitmap_size = bs_block_size / bs_bitmap_granularity / 8;
// Bind address
bind_address = config["bind_address"].string_value();
if (bind_address == "")
bind_address = "0.0.0.0";
bind_port = config["bind_port"].uint64_value();
if (bind_port <= 0 || bind_port > 65535)
bind_port = 0;
// OSD configuration
log_level = config["log_level"].uint64_value();
etcd_report_interval = config["etcd_report_interval"].uint64_value();
if (etcd_report_interval <= 0)
etcd_report_interval = 30;
readonly = config["readonly"] == "true" || config["readonly"] == "1" || config["readonly"] == "yes";
run_primary = config["run_primary"] != "false" && config["run_primary"] != "0" && config["run_primary"] != "no";
no_rebalance = config["no_rebalance"] == "true" || config["no_rebalance"] == "1" || config["no_rebalance"] == "yes";
no_recovery = config["no_recovery"] == "true" || config["no_recovery"] == "1" || config["no_recovery"] == "yes";
// Cluster configuration
bind_address = config["bind_address"];
if (bind_address == "")
bind_address = "0.0.0.0";
bind_port = stoull_full(config["bind_port"]);
if (bind_port <= 0 || bind_port > 65535)
bind_port = 0;
allow_test_ops = config["allow_test_ops"] == "true" || config["allow_test_ops"] == "1" || config["allow_test_ops"] == "yes";
if (config["immediate_commit"] == "all")
immediate_commit = IMMEDIATE_ALL;
else if (config["immediate_commit"] == "small")
immediate_commit = IMMEDIATE_SMALL;
if (config.find("autosync_interval") != config.end())
else
immediate_commit = IMMEDIATE_NONE;
if (!config["autosync_interval"].is_null())
{
autosync_interval = strtoull(config["autosync_interval"].c_str(), NULL, 10);
// Allow to set it to 0
autosync_interval = config["autosync_interval"].uint64_value();
if (autosync_interval > MAX_AUTOSYNC_INTERVAL)
autosync_interval = DEFAULT_AUTOSYNC_INTERVAL;
}
if (config.find("client_queue_depth") != config.end())
if (!config["client_queue_depth"].is_null())
{
client_queue_depth = strtoull(config["client_queue_depth"].c_str(), NULL, 10);
client_queue_depth = config["client_queue_depth"].uint64_value();
if (client_queue_depth < 128)
client_queue_depth = 128;
}
recovery_queue_depth = strtoull(config["recovery_queue_depth"].c_str(), NULL, 10);
recovery_queue_depth = config["recovery_queue_depth"].uint64_value();
if (recovery_queue_depth < 1 || recovery_queue_depth > MAX_RECOVERY_QUEUE)
recovery_queue_depth = DEFAULT_RECOVERY_QUEUE;
recovery_sync_batch = strtoull(config["recovery_sync_batch"].c_str(), NULL, 10);
recovery_sync_batch = config["recovery_sync_batch"].uint64_value();
if (recovery_sync_batch < 1 || recovery_sync_batch > MAX_RECOVERY_QUEUE)
recovery_sync_batch = DEFAULT_RECOVERY_BATCH;
if (config["readonly"] == "true" || config["readonly"] == "1" || config["readonly"] == "yes")
readonly = true;
print_stats_interval = strtoull(config["print_stats_interval"].c_str(), NULL, 10);
print_stats_interval = config["print_stats_interval"].uint64_value();
if (!print_stats_interval)
print_stats_interval = 3;
slow_log_interval = strtoull(config["slow_log_interval"].c_str(), NULL, 10);
slow_log_interval = config["slow_log_interval"].uint64_value();
if (!slow_log_interval)
slow_log_interval = 10;
c_cli.parse_config(json_config);
}
void osd_t::bind_socket()
@@ -165,7 +191,7 @@ void osd_t::bind_socket()
epmgr->set_fd_handler(listen_fd, false, [this](int fd, int events)
{
c_cli.accept_connections(listen_fd);
msgr.accept_connections(listen_fd);
});
}
@@ -182,8 +208,8 @@ bool osd_t::shutdown()
void osd_t::loop()
{
handle_peers();
c_cli.read_requests();
c_cli.send_replies();
msgr.read_requests();
msgr.send_replies();
ringloop->submit();
}
@@ -228,6 +254,7 @@ void osd_t::exec_op(osd_op_t *cur_op)
cur_op->req.hdr.opcode != OSD_OP_SEC_READ &&
cur_op->req.hdr.opcode != OSD_OP_SEC_LIST &&
cur_op->req.hdr.opcode != OSD_OP_READ &&
cur_op->req.hdr.opcode != OSD_OP_SEC_READ_BMP &&
cur_op->req.hdr.opcode != OSD_OP_SHOW_CONFIG)
{
// Readonly mode
@@ -266,7 +293,7 @@ void osd_t::exec_op(osd_op_t *cur_op)
void osd_t::reset_stats()
{
c_cli.stats = { 0 };
msgr.stats = { 0 };
prev_stats = { 0 };
memset(recovery_stat_count, 0, sizeof(recovery_stat_count));
memset(recovery_stat_bytes, 0, sizeof(recovery_stat_bytes));
@@ -276,11 +303,11 @@ void osd_t::print_stats()
{
for (int i = OSD_OP_MIN; i <= OSD_OP_MAX; i++)
{
if (c_cli.stats.op_stat_count[i] != prev_stats.op_stat_count[i] && i != OSD_OP_PING)
if (msgr.stats.op_stat_count[i] != prev_stats.op_stat_count[i] && i != OSD_OP_PING)
{
uint64_t avg = (c_cli.stats.op_stat_sum[i] - prev_stats.op_stat_sum[i])/(c_cli.stats.op_stat_count[i] - prev_stats.op_stat_count[i]);
uint64_t bw = (c_cli.stats.op_stat_bytes[i] - prev_stats.op_stat_bytes[i]) / print_stats_interval;
if (c_cli.stats.op_stat_bytes[i] != 0)
uint64_t avg = (msgr.stats.op_stat_sum[i] - prev_stats.op_stat_sum[i])/(msgr.stats.op_stat_count[i] - prev_stats.op_stat_count[i]);
uint64_t bw = (msgr.stats.op_stat_bytes[i] - prev_stats.op_stat_bytes[i]) / print_stats_interval;
if (msgr.stats.op_stat_bytes[i] != 0)
{
printf(
"[OSD %lu] avg latency for op %d (%s): %lu us, B/W: %.2f %s\n", osd_num, i, osd_op_names[i], avg,
@@ -292,19 +319,19 @@ void osd_t::print_stats()
{
printf("[OSD %lu] avg latency for op %d (%s): %lu us\n", osd_num, i, osd_op_names[i], avg);
}
prev_stats.op_stat_count[i] = c_cli.stats.op_stat_count[i];
prev_stats.op_stat_sum[i] = c_cli.stats.op_stat_sum[i];
prev_stats.op_stat_bytes[i] = c_cli.stats.op_stat_bytes[i];
prev_stats.op_stat_count[i] = msgr.stats.op_stat_count[i];
prev_stats.op_stat_sum[i] = msgr.stats.op_stat_sum[i];
prev_stats.op_stat_bytes[i] = msgr.stats.op_stat_bytes[i];
}
}
for (int i = OSD_OP_MIN; i <= OSD_OP_MAX; i++)
{
if (c_cli.stats.subop_stat_count[i] != prev_stats.subop_stat_count[i])
if (msgr.stats.subop_stat_count[i] != prev_stats.subop_stat_count[i])
{
uint64_t avg = (c_cli.stats.subop_stat_sum[i] - prev_stats.subop_stat_sum[i])/(c_cli.stats.subop_stat_count[i] - prev_stats.subop_stat_count[i]);
uint64_t avg = (msgr.stats.subop_stat_sum[i] - prev_stats.subop_stat_sum[i])/(msgr.stats.subop_stat_count[i] - prev_stats.subop_stat_count[i]);
printf("[OSD %lu] avg latency for subop %d (%s): %ld us\n", osd_num, i, osd_op_names[i], avg);
prev_stats.subop_stat_count[i] = c_cli.stats.subop_stat_count[i];
prev_stats.subop_stat_sum[i] = c_cli.stats.subop_stat_sum[i];
prev_stats.subop_stat_count[i] = msgr.stats.subop_stat_count[i];
prev_stats.subop_stat_sum[i] = msgr.stats.subop_stat_sum[i];
}
}
for (int i = 0; i < 2; i++)
@@ -341,7 +368,7 @@ void osd_t::print_slow()
char alloc[1024];
timespec now;
clock_gettime(CLOCK_REALTIME, &now);
for (auto & kv: c_cli.clients)
for (auto & kv: msgr.clients)
{
for (auto op: kv.second->received_ops)
{

View File

@@ -55,11 +55,44 @@ struct osd_recovery_op_t
osd_op_t *osd_op = NULL;
};
// Posted as /osd/inodestats/$osd, then accumulated by the monitor
#define INODE_STATS_READ 0
#define INODE_STATS_WRITE 1
#define INODE_STATS_DELETE 2
struct inode_stats_t
{
uint64_t op_sum[3] = { 0 };
uint64_t op_count[3] = { 0 };
uint64_t op_bytes[3] = { 0 };
};
struct bitmap_request_t
{
osd_num_t osd_num;
object_id oid;
uint64_t version;
void *bmp_buf;
};
inline bool operator < (const bitmap_request_t & a, const bitmap_request_t & b)
{
return a.osd_num < b.osd_num || a.osd_num == b.osd_num && a.oid < b.oid;
}
struct osd_chain_read_t
{
int chain_pos;
inode_t inode;
uint32_t offset, len;
};
struct osd_rmw_stripe_t;
class osd_t
{
// config
blockstore_config_t config;
json11::Json::object config;
int etcd_report_interval = 30;
bool readonly = false;
@@ -71,7 +104,7 @@ class osd_t
int bind_port, listen_backlog;
// FIXME: Implement client queue depth limit
int client_queue_depth = 128;
bool allow_test_ops = true;
bool allow_test_ops = false;
int print_stats_interval = 3;
int slow_log_interval = 10;
int immediate_commit = IMMEDIATE_NONE;
@@ -83,7 +116,7 @@ class osd_t
// cluster state
etcd_state_client_t st_cli;
osd_messenger_t c_cli;
osd_messenger_t msgr;
int etcd_failed_attempts = 0;
std::string etcd_lease_id;
json11::Json self_state;
@@ -115,7 +148,9 @@ class osd_t
bool stopping = false;
int inflight_ops = 0;
blockstore_t *bs;
uint32_t bs_block_size, bs_bitmap_granularity;
void *zero_buffer = NULL;
uint64_t zero_buffer_size = 0;
uint32_t bs_block_size, bs_bitmap_granularity, clean_entry_bitmap_size;
ring_loop_t *ringloop;
timerfd_manager_t *tfd = NULL;
epoll_manager_t *epmgr = NULL;
@@ -126,16 +161,17 @@ class osd_t
// op statistics
osd_op_stats_t prev_stats;
std::map<uint64_t, inode_stats_t> inode_stats;
const char* recovery_stat_names[2] = { "degraded", "misplaced" };
uint64_t recovery_stat_count[2][2] = { 0 };
uint64_t recovery_stat_bytes[2][2] = { 0 };
// cluster connection
void parse_config(blockstore_config_t & config);
void parse_config(const json11::Json & config);
void init_cluster();
void on_change_osd_state_hook(osd_num_t peer_osd);
void on_change_pg_history_hook(pool_id_t pool_id, pg_num_t pg_num);
void on_change_etcd_state_hook(json11::Json::object & changes);
void on_change_etcd_state_hook(std::map<std::string, etcd_kv_t> & changes);
void on_load_config_hook(json11::Json::object & changes);
json11::Json on_load_pgs_checks_hook();
void on_load_pgs_hook(bool success);
@@ -204,7 +240,10 @@ class osd_t
void handle_primary_bs_subop(osd_op_t *subop);
void add_bs_subop_stats(osd_op_t *subop);
void pg_cancel_write_queue(pg_t & pg, osd_op_t *first_op, object_id oid, int retval);
void submit_primary_subops(int submit_type, uint64_t op_version, int pg_size, const uint64_t* osd_set, osd_op_t *cur_op);
void submit_primary_subops(int submit_type, uint64_t op_version, const uint64_t* osd_set, osd_op_t *cur_op);
int submit_primary_subop_batch(int submit_type, inode_t inode, uint64_t op_version,
osd_rmw_stripe_t *stripes, const uint64_t* osd_set, osd_op_t *cur_op, int subop_idx, int zero_read);
void submit_primary_del_subops(osd_op_t *cur_op, uint64_t *cur_set, uint64_t set_size, pg_osd_set_t & loc_set);
void submit_primary_del_batch(osd_op_t *cur_op, obj_ver_osd_t *chunks_to_delete, int chunks_to_delete_count);
int submit_primary_sync_subops(osd_op_t *cur_op);
@@ -212,16 +251,24 @@ class osd_t
uint64_t* get_object_osd_set(pg_t &pg, object_id &oid, uint64_t *def, pg_osd_set_state_t **object_state);
void continue_chained_read(osd_op_t *cur_op);
int submit_chained_read_requests(pg_t & pg, osd_op_t *cur_op);
void send_chained_read_results(pg_t & pg, osd_op_t *cur_op);
std::vector<osd_chain_read_t> collect_chained_read_requests(osd_op_t *cur_op);
int collect_bitmap_requests(osd_op_t *cur_op, pg_t & pg, std::vector<bitmap_request_t> & bitmap_requests);
int submit_bitmap_subops(osd_op_t *cur_op, pg_t & pg);
int read_bitmaps(osd_op_t *cur_op, pg_t & pg, int base_state);
inline pg_num_t map_to_pg(object_id oid, uint64_t pg_stripe_size)
{
uint64_t pg_count = pg_counts[INODE_POOL(oid.inode)];
if (!pg_count)
pg_count = 1;
return (oid.inode + oid.stripe / pg_stripe_size) % pg_count + 1;
return (oid.stripe / pg_stripe_size) % pg_count + 1;
}
public:
osd_t(blockstore_config_t & config, ring_loop_t *ringloop);
osd_t(const json11::Json & config, ring_loop_t *ringloop);
~osd_t();
void force_stop(int exitcode);
bool shutdown();

View File

@@ -21,7 +21,7 @@ void osd_t::init_cluster()
{
// Test version of clustering code with 1 pool, 1 PG and 2 peers
// Example: peers = 2:127.0.0.1:11204,3:127.0.0.1:11205
std::string peerstr = config["peers"];
std::string peerstr = config["peers"].string_value();
while (peerstr.size())
{
int pos = peerstr.find(',');
@@ -65,7 +65,7 @@ void osd_t::init_cluster()
st_cli.log_level = log_level;
st_cli.on_change_osd_state_hook = [this](osd_num_t peer_osd) { on_change_osd_state_hook(peer_osd); };
st_cli.on_change_pg_history_hook = [this](pool_id_t pool_id, pg_num_t pg_num) { on_change_pg_history_hook(pool_id, pg_num); };
st_cli.on_change_hook = [this](json11::Json::object & changes) { on_change_etcd_state_hook(changes); };
st_cli.on_change_hook = [this](std::map<std::string, etcd_kv_t> & changes) { on_change_etcd_state_hook(changes); };
st_cli.on_load_config_hook = [this](json11::Json::object & cfg) { on_load_config_hook(cfg); };
st_cli.load_pgs_checks_hook = [this]() { return on_load_pgs_checks_hook(); };
st_cli.on_load_pgs_hook = [this](bool success) { on_load_pgs_hook(success); };
@@ -104,7 +104,7 @@ void osd_t::parse_test_peer(std::string peer)
{ "addresses", json11::Json::array { addr } },
{ "port", port },
};
c_cli.connect_peer(peer_osd, st_cli.peer_states[peer_osd]);
msgr.connect_peer(peer_osd, st_cli.peer_states[peer_osd]);
}
json11::Json osd_t::get_osd_state()
@@ -146,16 +146,16 @@ json11::Json osd_t::get_statistics()
for (int i = OSD_OP_MIN; i <= OSD_OP_MAX; i++)
{
op_stats[osd_op_names[i]] = json11::Json::object {
{ "count", c_cli.stats.op_stat_count[i] },
{ "usec", c_cli.stats.op_stat_sum[i] },
{ "bytes", c_cli.stats.op_stat_bytes[i] },
{ "count", msgr.stats.op_stat_count[i] },
{ "usec", msgr.stats.op_stat_sum[i] },
{ "bytes", msgr.stats.op_stat_bytes[i] },
};
}
for (int i = OSD_OP_MIN; i <= OSD_OP_MAX; i++)
{
subop_stats[osd_op_names[i]] = json11::Json::object {
{ "count", c_cli.stats.subop_stat_count[i] },
{ "usec", c_cli.stats.subop_stat_sum[i] },
{ "count", msgr.stats.subop_stat_count[i] },
{ "usec", msgr.stats.subop_stat_sum[i] },
};
}
st["op_stats"] = op_stats;
@@ -180,12 +180,80 @@ void osd_t::report_statistics()
return;
}
etcd_reporting_stats = true;
json11::Json::array txn = { json11::Json::object {
{ "request_put", json11::Json::object {
{ "key", base64_encode(st_cli.etcd_prefix+"/osd/stats/"+std::to_string(osd_num)) },
{ "value", base64_encode(get_statistics().dump()) },
} }
} };
// Report space usage statistics as a whole
// Maybe we'll report it using deltas if we tune for a lot of inodes at some point
json11::Json::object inode_space;
json11::Json::object last_stat;
pool_id_t last_pool = 0;
for (auto kv: bs->get_inode_space_stats())
{
pool_id_t pool_id = INODE_POOL(kv.first);
uint64_t only_inode_num = (kv.first & ((1l << (64-POOL_ID_BITS)) - 1));
if (!last_pool || pool_id != last_pool)
{
if (last_pool)
inode_space[std::to_string(last_pool)] = last_stat;
last_stat = json11::Json::object();
last_pool = pool_id;
}
last_stat[std::to_string(only_inode_num)] = kv.second;
}
if (last_pool)
inode_space[std::to_string(last_pool)] = last_stat;
last_stat = json11::Json::object();
last_pool = 0;
json11::Json::object inode_ops;
for (auto kv: inode_stats)
{
pool_id_t pool_id = INODE_POOL(kv.first);
uint64_t only_inode_num = (kv.first & ((1l << (64-POOL_ID_BITS)) - 1));
if (!last_pool || pool_id != last_pool)
{
if (last_pool)
inode_ops[std::to_string(last_pool)] = last_stat;
last_stat = json11::Json::object();
last_pool = pool_id;
}
last_stat[std::to_string(only_inode_num)] = json11::Json::object {
{ "read", json11::Json::object {
{ "count", kv.second.op_count[INODE_STATS_READ] },
{ "usec", kv.second.op_sum[INODE_STATS_READ] },
{ "bytes", kv.second.op_bytes[INODE_STATS_READ] },
} },
{ "write", json11::Json::object {
{ "count", kv.second.op_count[INODE_STATS_WRITE] },
{ "usec", kv.second.op_sum[INODE_STATS_WRITE] },
{ "bytes", kv.second.op_bytes[INODE_STATS_WRITE] },
} },
{ "delete", json11::Json::object {
{ "count", kv.second.op_count[INODE_STATS_DELETE] },
{ "usec", kv.second.op_sum[INODE_STATS_DELETE] },
{ "bytes", kv.second.op_bytes[INODE_STATS_DELETE] },
} },
};
}
if (last_pool)
inode_ops[std::to_string(last_pool)] = last_stat;
json11::Json::array txn = {
json11::Json::object {
{ "request_put", json11::Json::object {
{ "key", base64_encode(st_cli.etcd_prefix+"/osd/stats/"+std::to_string(osd_num)) },
{ "value", base64_encode(get_statistics().dump()) },
} },
},
json11::Json::object {
{ "request_put", json11::Json::object {
{ "key", base64_encode(st_cli.etcd_prefix+"/osd/space/"+std::to_string(osd_num)) },
{ "value", base64_encode(json11::Json(inode_space).dump()) },
} },
},
json11::Json::object {
{ "request_put", json11::Json::object {
{ "key", base64_encode(st_cli.etcd_prefix+"/osd/inodestats/"+std::to_string(osd_num)) },
{ "value", base64_encode(json11::Json(inode_ops).dump()) },
} },
},
};
for (auto & p: pgs)
{
auto & pg = p.second;
@@ -230,13 +298,13 @@ void osd_t::report_statistics()
void osd_t::on_change_osd_state_hook(osd_num_t peer_osd)
{
if (c_cli.wanted_peers.find(peer_osd) != c_cli.wanted_peers.end())
if (msgr.wanted_peers.find(peer_osd) != msgr.wanted_peers.end())
{
c_cli.connect_peer(peer_osd, st_cli.peer_states[peer_osd]);
msgr.connect_peer(peer_osd, st_cli.peer_states[peer_osd]);
}
}
void osd_t::on_change_etcd_state_hook(json11::Json::object & changes)
void osd_t::on_change_etcd_state_hook(std::map<std::string, etcd_kv_t> & changes)
{
// FIXME apply config changes in runtime (maybe, some)
if (run_primary)
@@ -272,21 +340,10 @@ void osd_t::on_change_pg_history_hook(pool_id_t pool_id, pg_num_t pg_num)
void osd_t::on_load_config_hook(json11::Json::object & global_config)
{
blockstore_config_t osd_config = this->config;
for (auto & cfg_var: global_config)
{
if (this->config.find(cfg_var.first) == this->config.end())
{
if (cfg_var.second.is_string())
{
osd_config[cfg_var.first] = cfg_var.second.string_value();
}
else
{
osd_config[cfg_var.first] = cfg_var.second.dump();
}
}
}
json11::Json::object osd_config = this->config;
for (auto & kv: global_config)
if (osd_config.find(kv.first) == osd_config.end())
osd_config[kv.first] = kv.second;
parse_config(osd_config);
bind_socket();
acquire_lease();
@@ -312,7 +369,7 @@ void osd_t::acquire_lease()
etcd_lease_id = data["ID"].string_value();
create_osd_state();
});
printf("[OSD %lu] reporting to etcd at %s every %d seconds\n", this->osd_num, config["etcd_address"].c_str(), etcd_report_interval);
printf("[OSD %lu] reporting to etcd at %s every %d seconds\n", this->osd_num, config["etcd_address"].string_value().c_str(), etcd_report_interval);
tfd->set_timer(etcd_report_interval*1000, true, [this](int timer_id)
{
renew_lease();
@@ -627,9 +684,9 @@ void osd_t::apply_pg_config()
// Add peers
for (auto pg_osd: all_peers)
{
if (pg_osd != this->osd_num && c_cli.osd_peer_fds.find(pg_osd) == c_cli.osd_peer_fds.end())
if (pg_osd != this->osd_num && msgr.osd_peer_fds.find(pg_osd) == msgr.osd_peer_fds.end())
{
c_cli.connect_peer(pg_osd, st_cli.peer_states[pg_osd]);
msgr.connect_peer(pg_osd, st_cli.peer_states[pg_osd]);
}
}
start_pg_peering(pg);

View File

@@ -82,10 +82,10 @@ void osd_t::handle_flush_op(bool rollback, pool_id_t pool_id, pg_num_t pg_num, p
else
{
printf("Error while doing flush on OSD %lu: %d (%s)\n", osd_num, retval, strerror(-retval));
auto fd_it = c_cli.osd_peer_fds.find(peer_osd);
if (fd_it != c_cli.osd_peer_fds.end())
auto fd_it = msgr.osd_peer_fds.find(peer_osd);
if (fd_it != msgr.osd_peer_fds.end())
{
c_cli.stop_client(fd_it->second);
msgr.stop_client(fd_it->second);
}
return;
}
@@ -188,7 +188,7 @@ void osd_t::submit_flush_op(pool_id_t pool_id, pg_num_t pg_num, pg_flush_batch_t
else
{
// Peer
int peer_fd = c_cli.osd_peer_fds[peer_osd];
int peer_fd = msgr.osd_peer_fds[peer_osd];
op->op_type = OSD_OP_OUT;
op->iov.push_back(op->buf, count * sizeof(obj_ver_id));
op->peer_fd = peer_fd;
@@ -196,7 +196,7 @@ void osd_t::submit_flush_op(pool_id_t pool_id, pg_num_t pg_num, pg_flush_batch_t
.sec_stab = {
.header = {
.magic = SECONDARY_OSD_OP_MAGIC,
.id = c_cli.next_subop_id++,
.id = msgr.next_subop_id++,
.opcode = (uint64_t)(rollback ? OSD_OP_SEC_ROLLBACK : OSD_OP_SEC_STABILIZE),
},
.len = count * sizeof(obj_ver_id),
@@ -207,7 +207,7 @@ void osd_t::submit_flush_op(pool_id_t pool_id, pg_num_t pg_num, pg_flush_batch_t
handle_flush_op(op->req.hdr.opcode == OSD_OP_SEC_ROLLBACK, pool_id, pg_num, fb, peer_osd, op->reply.hdr.retval);
delete op;
};
c_cli.outbox_push(op);
msgr.outbox_push(op);
}
}

View File

@@ -29,13 +29,13 @@ int main(int narg, char *args[])
perror("BUG: too small packet size");
return 1;
}
blockstore_config_t config;
json11::Json::object config;
for (int i = 1; i < narg; i++)
{
if (args[i][0] == '-' && args[i][1] == '-' && i < narg-1)
{
char *opt = args[i]+2;
config[opt] = args[++i];
config[std::string(opt)] = std::string(args[++i]);
}
}
signal(SIGINT, handle_sigint);

View File

@@ -20,4 +20,5 @@ const char* osd_op_names[] = {
"primary_sync",
"primary_delete",
"ping",
"sec_read_bmp",
};

View File

@@ -28,12 +28,14 @@
#define OSD_OP_SYNC 13
#define OSD_OP_DELETE 14
#define OSD_OP_PING 15
#define OSD_OP_MAX 15
#define OSD_OP_SEC_READ_BMP 16
#define OSD_OP_MAX 16
// Alignment & limit for read/write operations
#ifndef MEM_ALIGNMENT
#define MEM_ALIGNMENT 512
#endif
#define OSD_RW_MAX 64*1024*1024
#define OSD_PROTOCOL_VERSION 1
// common request and reply headers
struct __attribute__((__packed__)) osd_op_header_t
@@ -59,7 +61,7 @@ struct __attribute__((__packed__)) osd_reply_header_t
};
// read or write to the secondary OSD
struct __attribute__((__packed__)) osd_op_secondary_rw_t
struct __attribute__((__packed__)) osd_op_sec_rw_t
{
osd_op_header_t header;
// object
@@ -71,17 +73,23 @@ struct __attribute__((__packed__)) osd_op_secondary_rw_t
uint32_t offset;
// length
uint32_t len;
// bitmap/attribute length - bitmap comes after header, but before data
uint32_t attr_len;
uint32_t pad0;
};
struct __attribute__((__packed__)) osd_reply_secondary_rw_t
struct __attribute__((__packed__)) osd_reply_sec_rw_t
{
osd_reply_header_t header;
// for reads and writes: assigned or read version number
uint64_t version;
// for reads: bitmap/attribute length (just to double-check)
uint32_t attr_len;
uint32_t pad0;
};
// delete object on the secondary OSD
struct __attribute__((__packed__)) osd_op_secondary_del_t
struct __attribute__((__packed__)) osd_op_sec_del_t
{
osd_op_header_t header;
// object
@@ -90,42 +98,58 @@ struct __attribute__((__packed__)) osd_op_secondary_del_t
uint64_t version;
};
struct __attribute__((__packed__)) osd_reply_secondary_del_t
struct __attribute__((__packed__)) osd_reply_sec_del_t
{
osd_reply_header_t header;
uint64_t version;
};
// sync to the secondary OSD
struct __attribute__((__packed__)) osd_op_secondary_sync_t
struct __attribute__((__packed__)) osd_op_sec_sync_t
{
osd_op_header_t header;
};
struct __attribute__((__packed__)) osd_reply_secondary_sync_t
struct __attribute__((__packed__)) osd_reply_sec_sync_t
{
osd_reply_header_t header;
};
// stabilize or rollback objects on the secondary OSD
struct __attribute__((__packed__)) osd_op_secondary_stabilize_t
struct __attribute__((__packed__)) osd_op_sec_stab_t
{
osd_op_header_t header;
// obj_ver_id array length in bytes
uint64_t len;
};
typedef osd_op_secondary_stabilize_t osd_op_secondary_rollback_t;
typedef osd_op_sec_stab_t osd_op_sec_rollback_t;
struct __attribute__((__packed__)) osd_reply_secondary_stabilize_t
struct __attribute__((__packed__)) osd_reply_sec_stab_t
{
osd_reply_header_t header;
};
typedef osd_reply_secondary_stabilize_t osd_reply_secondary_rollback_t;
typedef osd_reply_sec_stab_t osd_reply_sec_rollback_t;
// bulk read bitmaps from a secondary OSD
struct __attribute__((__packed__)) osd_op_sec_read_bmp_t
{
osd_op_header_t header;
// obj_ver_id array length in bytes
uint64_t len;
};
struct __attribute__((__packed__)) osd_reply_sec_read_bmp_t
{
// retval is payload length in bytes. payload is {version,bitmap}[]
osd_reply_header_t header;
};
// show configuration
struct __attribute__((__packed__)) osd_op_show_config_t
{
osd_op_header_t header;
// JSON request length
uint64_t json_len;
};
struct __attribute__((__packed__)) osd_reply_show_config_t
@@ -134,7 +158,7 @@ struct __attribute__((__packed__)) osd_reply_show_config_t
};
// list objects on replica
struct __attribute__((__packed__)) osd_op_secondary_list_t
struct __attribute__((__packed__)) osd_op_sec_list_t
{
osd_op_header_t header;
// placement group total number and total count
@@ -145,7 +169,7 @@ struct __attribute__((__packed__)) osd_op_secondary_list_t
uint64_t min_inode, max_inode;
};
struct __attribute__((__packed__)) osd_reply_secondary_list_t
struct __attribute__((__packed__)) osd_reply_sec_list_t
{
osd_reply_header_t header;
// stable object version count. header.retval = total object version count
@@ -154,7 +178,6 @@ struct __attribute__((__packed__)) osd_reply_secondary_list_t
};
// read or write to the primary OSD (must be within individual stripe)
// FIXME: allow to return used block bitmap (required for snapshots)
struct __attribute__((__packed__)) osd_op_rw_t
{
osd_op_header_t header;
@@ -164,11 +187,23 @@ struct __attribute__((__packed__)) osd_op_rw_t
uint64_t offset;
// length
uint32_t len;
// flags (for future)
uint32_t flags;
// inode metadata revision
uint64_t meta_revision;
// object version for atomic "CAS" (compare-and-set) writes
// writes and deletes fail with -EINTR if object version differs from (version-1)
uint64_t version;
};
struct __attribute__((__packed__)) osd_reply_rw_t
{
osd_reply_header_t header;
// for reads: bitmap length
uint32_t bitmap_len;
uint32_t pad0;
// for reads: object version
uint64_t version;
};
// sync to the primary OSD
@@ -186,11 +221,12 @@ struct __attribute__((__packed__)) osd_reply_sync_t
union osd_any_op_t
{
osd_op_header_t hdr;
osd_op_secondary_rw_t sec_rw;
osd_op_secondary_del_t sec_del;
osd_op_secondary_sync_t sec_sync;
osd_op_secondary_stabilize_t sec_stab;
osd_op_secondary_list_t sec_list;
osd_op_sec_rw_t sec_rw;
osd_op_sec_del_t sec_del;
osd_op_sec_sync_t sec_sync;
osd_op_sec_stab_t sec_stab;
osd_op_sec_read_bmp_t sec_read_bmp;
osd_op_sec_list_t sec_list;
osd_op_show_config_t show_conf;
osd_op_rw_t rw;
osd_op_sync_t sync;
@@ -200,11 +236,12 @@ union osd_any_op_t
union osd_any_reply_t
{
osd_reply_header_t hdr;
osd_reply_secondary_rw_t sec_rw;
osd_reply_secondary_del_t sec_del;
osd_reply_secondary_sync_t sec_sync;
osd_reply_secondary_stabilize_t sec_stab;
osd_reply_secondary_list_t sec_list;
osd_reply_sec_rw_t sec_rw;
osd_reply_sec_del_t sec_del;
osd_reply_sec_sync_t sec_sync;
osd_reply_sec_stab_t sec_stab;
osd_reply_sec_read_bmp_t sec_read_bmp;
osd_reply_sec_list_t sec_list;
osd_reply_show_config_t show_conf;
osd_reply_rw_t rw;
osd_reply_sync_t sync;

View File

@@ -156,7 +156,7 @@ void osd_t::start_pg_peering(pg_t & pg)
if (immediate_commit != IMMEDIATE_ALL)
{
std::vector<int> to_stop;
for (auto & cp: c_cli.clients)
for (auto & cp: msgr.clients)
{
if (cp.second->dirty_pgs.find({ .pool_id = pg.pool_id, .pg_num = pg.pg_num }) != cp.second->dirty_pgs.end())
{
@@ -165,7 +165,7 @@ void osd_t::start_pg_peering(pg_t & pg)
}
for (auto peer_fd: to_stop)
{
c_cli.stop_client(peer_fd);
msgr.stop_client(peer_fd);
}
}
// Calculate current write OSD set
@@ -175,7 +175,7 @@ void osd_t::start_pg_peering(pg_t & pg)
for (int role = 0; role < pg.target_set.size(); role++)
{
pg.cur_set[role] = pg.target_set[role] == this->osd_num ||
c_cli.osd_peer_fds.find(pg.target_set[role]) != c_cli.osd_peer_fds.end() ? pg.target_set[role] : 0;
msgr.osd_peer_fds.find(pg.target_set[role]) != msgr.osd_peer_fds.end() ? pg.target_set[role] : 0;
if (pg.cur_set[role] != 0)
{
pg.pg_cursize++;
@@ -199,7 +199,7 @@ void osd_t::start_pg_peering(pg_t & pg)
{
found = false;
if (history_osd == this->osd_num ||
c_cli.osd_peer_fds.find(history_osd) != c_cli.osd_peer_fds.end())
msgr.osd_peer_fds.find(history_osd) != msgr.osd_peer_fds.end())
{
found = true;
break;
@@ -223,13 +223,13 @@ void osd_t::start_pg_peering(pg_t & pg)
std::set<osd_num_t> cur_peers;
for (auto pg_osd: pg.all_peers)
{
if (pg_osd == this->osd_num || c_cli.osd_peer_fds.find(pg_osd) != c_cli.osd_peer_fds.end())
if (pg_osd == this->osd_num || msgr.osd_peer_fds.find(pg_osd) != msgr.osd_peer_fds.end())
{
cur_peers.insert(pg_osd);
}
else if (c_cli.wanted_peers.find(pg_osd) == c_cli.wanted_peers.end())
else if (msgr.wanted_peers.find(pg_osd) == msgr.wanted_peers.end())
{
c_cli.connect_peer(pg_osd, st_cli.peer_states[pg_osd]);
msgr.connect_peer(pg_osd, st_cli.peer_states[pg_osd]);
}
}
pg.cur_peers.insert(pg.cur_peers.begin(), cur_peers.begin(), cur_peers.end());
@@ -325,7 +325,7 @@ void osd_t::submit_sync_and_list_subop(osd_num_t role_osd, pg_peering_state_t *p
else
{
// Peer
auto & cl = c_cli.clients.at(c_cli.osd_peer_fds[role_osd]);
auto & cl = msgr.clients.at(msgr.osd_peer_fds[role_osd]);
osd_op_t *op = new osd_op_t();
op->op_type = OSD_OP_OUT;
op->peer_fd = cl->peer_fd;
@@ -333,7 +333,7 @@ void osd_t::submit_sync_and_list_subop(osd_num_t role_osd, pg_peering_state_t *p
.sec_sync = {
.header = {
.magic = SECONDARY_OSD_OP_MAGIC,
.id = c_cli.next_subop_id++,
.id = msgr.next_subop_id++,
.opcode = OSD_OP_SEC_SYNC,
},
},
@@ -347,14 +347,14 @@ void osd_t::submit_sync_and_list_subop(osd_num_t role_osd, pg_peering_state_t *p
int fail_fd = op->peer_fd;
ps->list_ops.erase(role_osd);
delete op;
c_cli.stop_client(fail_fd);
msgr.stop_client(fail_fd);
return;
}
delete op;
ps->list_ops.erase(role_osd);
submit_list_subop(role_osd, ps);
};
c_cli.outbox_push(op);
msgr.outbox_push(op);
ps->list_ops[role_osd] = op;
}
}
@@ -404,12 +404,12 @@ void osd_t::submit_list_subop(osd_num_t role_osd, pg_peering_state_t *ps)
// Peer
osd_op_t *op = new osd_op_t();
op->op_type = OSD_OP_OUT;
op->peer_fd = c_cli.osd_peer_fds[role_osd];
op->peer_fd = msgr.osd_peer_fds[role_osd];
op->req = (osd_any_op_t){
.sec_list = {
.header = {
.magic = SECONDARY_OSD_OP_MAGIC,
.id = c_cli.next_subop_id++,
.id = msgr.next_subop_id++,
.opcode = OSD_OP_SEC_LIST,
},
.list_pg = ps->pg_num,
@@ -427,7 +427,7 @@ void osd_t::submit_list_subop(osd_num_t role_osd, pg_peering_state_t *ps)
int fail_fd = op->peer_fd;
ps->list_ops.erase(role_osd);
delete op;
c_cli.stop_client(fail_fd);
msgr.stop_client(fail_fd);
return;
}
printf(
@@ -444,7 +444,7 @@ void osd_t::submit_list_subop(osd_num_t role_osd, pg_peering_state_t *ps)
ps->list_ops.erase(role_osd);
delete op;
};
c_cli.outbox_push(op);
msgr.outbox_push(op);
ps->list_ops[role_osd] = op;
}
}

View File

@@ -2,6 +2,7 @@
// License: VNPL-1.1 (see README.md for details)
#include "osd_primary.h"
#include "allocator.h"
// read: read directly or read paired stripe(s), reconstruct, return
// write: read paired stripe(s), reconstruct, modify, calculate parity, write
@@ -27,6 +28,7 @@ bool osd_t::prepare_primary_rw(osd_op_t *cur_op)
return false;
}
auto & pool_cfg = pool_cfg_it->second;
// FIXME: op_data->pg_data_size can probably be removed (there's pg.pg_data_size)
uint64_t pg_data_size = (pool_cfg.scheme == POOL_SCHEME_REPLICATED ? 1 : pool_cfg.pg_size-pool_cfg.parity_chunks);
uint64_t pg_block_size = bs_block_size * pg_data_size;
object_id oid = {
@@ -34,7 +36,7 @@ bool osd_t::prepare_primary_rw(osd_op_t *cur_op)
// oid.stripe = starting offset of the parity stripe
.stripe = (cur_op->req.rw.offset/pg_block_size)*pg_block_size,
};
pg_num_t pg_num = (cur_op->req.rw.inode + oid.stripe/pool_cfg.pg_stripe_size) % pg_counts[pool_id] + 1;
pg_num_t pg_num = (oid.stripe/pool_cfg.pg_stripe_size) % pg_counts[pool_id] + 1; // like map_to_pg()
auto pg_it = pgs.find({ .pool_id = pool_id, .pg_num = pg_num });
if (pg_it == pgs.end() || !(pg_it->second.state & PG_ACTIVE))
{
@@ -50,16 +52,88 @@ bool osd_t::prepare_primary_rw(osd_op_t *cur_op)
finish_op(cur_op, -EINVAL);
return false;
}
int stripe_count = (pool_cfg.scheme == POOL_SCHEME_REPLICATED ? 1 : pg_it->second.pg_size);
int chain_size = 0;
if (cur_op->req.hdr.opcode == OSD_OP_READ && cur_op->req.rw.meta_revision > 0)
{
// Chained read
auto inode_it = st_cli.inode_config.find(cur_op->req.rw.inode);
if (inode_it->second.mod_revision != cur_op->req.rw.meta_revision)
{
// Client view of the metadata differs from OSD's view
// Operation can't be completed correctly, client should retry later
finish_op(cur_op, -EPIPE);
return false;
}
// Find parents from the same pool. Optimized reads only work within pools
while (inode_it != st_cli.inode_config.end() && inode_it->second.parent_id &&
INODE_POOL(inode_it->second.parent_id) == pg_it->second.pool_id &&
// Check for loops
inode_it->second.parent_id != cur_op->req.rw.inode)
{
chain_size++;
inode_it = st_cli.inode_config.find(inode_it->second.parent_id);
}
if (chain_size)
{
// Add the original inode
chain_size++;
}
}
osd_primary_op_data_t *op_data = (osd_primary_op_data_t*)calloc_or_die(
1, sizeof(osd_primary_op_data_t) + sizeof(osd_rmw_stripe_t) * (pool_cfg.scheme == POOL_SCHEME_REPLICATED ? 1 : pg_it->second.pg_size)
// Allocate:
// - op_data
1, sizeof(osd_primary_op_data_t) +
// - stripes
// - resulting bitmap buffers
stripe_count * (clean_entry_bitmap_size + sizeof(osd_rmw_stripe_t)) +
chain_size * (
// - copy of the chain
sizeof(inode_t) +
// - bitmap buffers for chained read
stripe_count * clean_entry_bitmap_size +
// - 'missing' flags for chained reads
(pool_cfg.scheme == POOL_SCHEME_REPLICATED ? 0 : pg_it->second.pg_size)
)
);
void *data_buf = ((void*)op_data) + sizeof(osd_primary_op_data_t);
op_data->pg_num = pg_num;
op_data->oid = oid;
op_data->stripes = ((osd_rmw_stripe_t*)(op_data+1));
op_data->stripes = (osd_rmw_stripe_t*)data_buf;
data_buf += sizeof(osd_rmw_stripe_t) * stripe_count;
op_data->scheme = pool_cfg.scheme;
op_data->pg_data_size = pg_data_size;
op_data->pg_size = pg_it->second.pg_size;
cur_op->op_data = op_data;
split_stripes(pg_data_size, bs_block_size, (uint32_t)(cur_op->req.rw.offset - oid.stripe), cur_op->req.rw.len, op_data->stripes);
// Allocate bitmaps along with stripes to avoid extra allocations and fragmentation
for (int i = 0; i < stripe_count; i++)
{
op_data->stripes[i].bmp_buf = data_buf;
data_buf += clean_entry_bitmap_size;
}
op_data->chain_size = chain_size;
if (chain_size > 0)
{
op_data->read_chain = (inode_t*)data_buf;
data_buf += sizeof(inode_t) * chain_size;
op_data->snapshot_bitmaps = data_buf;
data_buf += chain_size * stripe_count * clean_entry_bitmap_size;
op_data->missing_flags = (uint8_t*)data_buf;
data_buf += chain_size * (pool_cfg.scheme == POOL_SCHEME_REPLICATED ? 0 : pg_it->second.pg_size);
// Copy chain
int chain_num = 0;
op_data->read_chain[chain_num++] = cur_op->req.rw.inode;
auto inode_it = st_cli.inode_config.find(cur_op->req.rw.inode);
while (inode_it != st_cli.inode_config.end() && inode_it->second.parent_id &&
INODE_POOL(inode_it->second.parent_id) == pg_it->second.pool_id &&
// Check for loops
inode_it->second.parent_id != cur_op->req.rw.inode)
{
op_data->read_chain[chain_num++] = inode_it->second.parent_id;
inode_it = st_cli.inode_config.find(inode_it->second.parent_id);
}
}
pg_it->second.inflight++;
return true;
}
@@ -100,8 +174,16 @@ void osd_t::continue_primary_read(osd_op_t *cur_op)
return;
}
osd_primary_op_data_t *op_data = cur_op->op_data;
if (op_data->st == 1) goto resume_1;
else if (op_data->st == 2) goto resume_2;
if (op_data->chain_size)
{
continue_chained_read(cur_op);
return;
}
if (op_data->st == 1)
goto resume_1;
else if (op_data->st == 2)
goto resume_2;
cur_op->reply.rw.bitmap_len = 0;
{
auto & pg = pgs.at({ .pool_id = INODE_POOL(op_data->oid.inode), .pg_num = op_data->pg_num });
for (int role = 0; role < op_data->pg_data_size; role++)
@@ -116,8 +198,7 @@ void osd_t::continue_primary_read(osd_op_t *cur_op)
{
// Fast happy-path
cur_op->buf = alloc_read_buffer(op_data->stripes, op_data->pg_data_size, 0);
submit_primary_subops(SUBMIT_READ, op_data->target_ver,
(op_data->scheme == POOL_SCHEME_REPLICATED ? pg.pg_size : op_data->pg_data_size), pg.cur_set.data(), cur_op);
submit_primary_subops(SUBMIT_READ, op_data->target_ver, pg.cur_set.data(), cur_op);
op_data->st = 1;
}
else
@@ -134,7 +215,7 @@ void osd_t::continue_primary_read(osd_op_t *cur_op)
op_data->scheme = pg.scheme;
op_data->degraded = 1;
cur_op->buf = alloc_read_buffer(op_data->stripes, pg.pg_size, 0);
submit_primary_subops(SUBMIT_READ, op_data->target_ver, pg.pg_size, cur_set, cur_op);
submit_primary_subops(SUBMIT_READ, op_data->target_ver, cur_set, cur_op);
op_data->st = 1;
}
}
@@ -146,18 +227,21 @@ resume_2:
finish_op(cur_op, op_data->epipe > 0 ? -EPIPE : -EIO);
return;
}
cur_op->reply.rw.version = op_data->fact_ver;
cur_op->reply.rw.bitmap_len = op_data->pg_data_size * clean_entry_bitmap_size;
if (op_data->degraded)
{
// Reconstruct missing stripes
osd_rmw_stripe_t *stripes = op_data->stripes;
if (op_data->scheme == POOL_SCHEME_XOR)
{
reconstruct_stripes_xor(stripes, op_data->pg_size);
reconstruct_stripes_xor(stripes, op_data->pg_size, clean_entry_bitmap_size);
}
else if (op_data->scheme == POOL_SCHEME_JERASURE)
{
reconstruct_stripes_jerasure(stripes, op_data->pg_size, op_data->pg_data_size);
reconstruct_stripes_jerasure(stripes, op_data->pg_size, op_data->pg_data_size, clean_entry_bitmap_size);
}
cur_op->iov.push_back(op_data->stripes[0].bmp_buf, cur_op->reply.rw.bitmap_len);
for (int role = 0; role < op_data->pg_size; role++)
{
if (stripes[role].req_end != 0)
@@ -172,6 +256,7 @@ resume_2:
}
else
{
cur_op->iov.push_back(op_data->stripes[0].bmp_buf, cur_op->reply.rw.bitmap_len);
cur_op->iov.push_back(cur_op->buf, cur_op->req.rw.len);
}
finish_op(cur_op, cur_op->req.rw.len);
@@ -254,7 +339,7 @@ resume_1:
// Determine which OSDs contain this object and delete it
op_data->prev_set = get_object_osd_set(pg, op_data->oid, pg.cur_set.data(), &op_data->object_state);
// Submit 1 read to determine the actual version number
submit_primary_subops(SUBMIT_RMW_READ, UINT64_MAX, pg.pg_size, op_data->prev_set, cur_op);
submit_primary_subops(SUBMIT_RMW_READ, UINT64_MAX, op_data->prev_set, cur_op);
resume_2:
op_data->st = 2;
return;
@@ -264,6 +349,12 @@ resume_3:
pg_cancel_write_queue(pg, cur_op, op_data->oid, op_data->epipe > 0 ? -EPIPE : -EIO);
return;
}
// Check CAS version
if (cur_op->req.rw.version && op_data->fact_ver != (cur_op->req.rw.version-1))
{
cur_op->reply.hdr.retval = -EINTR;
goto continue_others;
}
// Save version override for parallel reads
pg.ver_override[op_data->oid] = op_data->fact_ver;
// Submit deletes
@@ -291,6 +382,8 @@ resume_5:
free_object_state(pg, &op_data->object_state);
}
pg.total_count--;
cur_op->reply.hdr.retval = 0;
continue_others:
osd_op_t *next_op = NULL;
auto next_it = pg.write_queue.find(op_data->oid);
if (next_it != pg.write_queue.end() && next_it->second == cur_op)
@@ -299,7 +392,7 @@ resume_5:
if (next_it != pg.write_queue.end() && next_it->first == op_data->oid)
next_op = next_it->second;
}
finish_op(cur_op, cur_op->req.rw.len);
finish_op(cur_op, cur_op->reply.hdr.retval);
if (next_op)
{
// Continue next write to the same object

View File

@@ -31,15 +31,31 @@ struct osd_primary_op_data_t
uint64_t *prev_set = NULL;
pg_osd_set_state_t *object_state = NULL;
// for sync. oops, requires freeing
std::vector<unstable_osd_num_t> *unstable_write_osds = NULL;
pool_pg_num_t *dirty_pgs = NULL;
int dirty_pg_count = 0;
osd_num_t *dirty_osds = NULL;
int dirty_osd_count = 0;
obj_ver_id *unstable_writes = NULL;
obj_ver_osd_t *copies_to_delete = NULL;
int copies_to_delete_count = 0;
union
{
struct
{
// for sync. oops, requires freeing
std::vector<unstable_osd_num_t> *unstable_write_osds;
pool_pg_num_t *dirty_pgs;
int dirty_pg_count;
osd_num_t *dirty_osds;
int dirty_osd_count;
obj_ver_id *unstable_writes;
obj_ver_osd_t *copies_to_delete;
int copies_to_delete_count;
};
struct
{
// for read_bitmaps
void *snapshot_bitmaps;
inode_t *read_chain;
uint8_t *missing_flags;
int chain_size;
osd_chain_read_t *chain_reads;
int chain_read_count;
};
};
};
bool contains_osd(osd_num_t *osd_set, uint64_t size, osd_num_t osd_num);

564
src/osd_primary_chain.cpp Normal file
View File

@@ -0,0 +1,564 @@
// Copyright (c) Vitaliy Filippov, 2019+
// License: VNPL-1.1 (see README.md for details)
#include "osd_primary.h"
#include "allocator.h"
void osd_t::continue_chained_read(osd_op_t *cur_op)
{
osd_primary_op_data_t *op_data = cur_op->op_data;
auto & pg = pgs.at({ .pool_id = INODE_POOL(op_data->oid.inode), .pg_num = op_data->pg_num });
if (op_data->st == 1)
goto resume_1;
else if (op_data->st == 2)
goto resume_2;
else if (op_data->st == 3)
goto resume_3;
else if (op_data->st == 4)
goto resume_4;
cur_op->reply.rw.bitmap_len = 0;
for (int role = 0; role < op_data->pg_data_size; role++)
{
op_data->stripes[role].read_start = op_data->stripes[role].req_start;
op_data->stripes[role].read_end = op_data->stripes[role].req_end;
}
resume_1:
resume_2:
// Read bitmaps
if (read_bitmaps(cur_op, pg, 1) != 0)
return;
// Prepare & submit reads
if (submit_chained_read_requests(pg, cur_op) != 0)
return;
if (op_data->n_subops > 0)
{
// Wait for reads
op_data->st = 3;
resume_3:
return;
}
resume_4:
if (op_data->errors > 0)
{
free(op_data->chain_reads);
op_data->chain_reads = NULL;
finish_op(cur_op, op_data->epipe > 0 ? -EPIPE : -EIO);
return;
}
send_chained_read_results(pg, cur_op);
finish_op(cur_op, cur_op->req.rw.len);
}
int osd_t::read_bitmaps(osd_op_t *cur_op, pg_t & pg, int base_state)
{
osd_primary_op_data_t *op_data = cur_op->op_data;
if (op_data->st == base_state)
goto resume_0;
else if (op_data->st == base_state+1)
goto resume_1;
if (pg.state == PG_ACTIVE && pg.scheme == POOL_SCHEME_REPLICATED)
{
// Happy path for clean replicated PGs (all bitmaps are available locally)
for (int chain_num = 0; chain_num < op_data->chain_size; chain_num++)
{
object_id cur_oid = { .inode = op_data->read_chain[chain_num], .stripe = op_data->oid.stripe };
auto vo_it = pg.ver_override.find(cur_oid);
auto read_version = (vo_it != pg.ver_override.end() ? vo_it->second : UINT64_MAX);
// Read bitmap synchronously from the local database
bs->read_bitmap(
cur_oid, read_version, op_data->snapshot_bitmaps + chain_num*clean_entry_bitmap_size,
!chain_num ? &cur_op->reply.rw.version : NULL
);
}
}
else
{
if (submit_bitmap_subops(cur_op, pg) < 0)
{
// Failure
finish_op(cur_op, -EIO);
return -1;
}
resume_0:
if (op_data->n_subops > 0)
{
// Wait for subops
op_data->st = base_state;
return 1;
}
resume_1:
if (pg.scheme != POOL_SCHEME_REPLICATED)
{
for (int chain_num = 0; chain_num < op_data->chain_size; chain_num++)
{
// Check if we need to reconstruct any bitmaps
for (int i = 0; i < pg.pg_size; i++)
{
if (op_data->missing_flags[chain_num*pg.pg_size + i])
{
osd_rmw_stripe_t local_stripes[pg.pg_size] = { 0 };
for (i = 0; i < pg.pg_size; i++)
{
local_stripes[i].missing = op_data->missing_flags[chain_num*pg.pg_size + i] && true;
local_stripes[i].bmp_buf = op_data->snapshot_bitmaps + (chain_num*pg.pg_size + i)*clean_entry_bitmap_size;
local_stripes[i].read_start = local_stripes[i].read_end = 1;
}
if (pg.scheme == POOL_SCHEME_XOR)
{
reconstruct_stripes_xor(local_stripes, pg.pg_size, clean_entry_bitmap_size);
}
else if (pg.scheme == POOL_SCHEME_JERASURE)
{
reconstruct_stripes_jerasure(local_stripes, pg.pg_size, pg.pg_data_size, clean_entry_bitmap_size);
}
break;
}
}
}
}
}
return 0;
}
int osd_t::collect_bitmap_requests(osd_op_t *cur_op, pg_t & pg, std::vector<bitmap_request_t> & bitmap_requests)
{
osd_primary_op_data_t *op_data = cur_op->op_data;
for (int chain_num = 0; chain_num < op_data->chain_size; chain_num++)
{
object_id cur_oid = { .inode = op_data->read_chain[chain_num], .stripe = op_data->oid.stripe };
auto vo_it = pg.ver_override.find(cur_oid);
uint64_t target_version = vo_it != pg.ver_override.end() ? vo_it->second : UINT64_MAX;
pg_osd_set_state_t *object_state;
uint64_t* cur_set = get_object_osd_set(pg, cur_oid, pg.cur_set.data(), &object_state);
if (pg.scheme == POOL_SCHEME_REPLICATED)
{
osd_num_t read_target = 0;
for (int i = 0; i < pg.pg_size; i++)
{
if (cur_set[i] == this->osd_num || cur_set[i] != 0 && read_target == 0)
{
// Select local or any other available OSD for reading
read_target = cur_set[i];
}
}
assert(read_target != 0);
bitmap_requests.push_back((bitmap_request_t){
.osd_num = read_target,
.oid = cur_oid,
.version = target_version,
.bmp_buf = op_data->snapshot_bitmaps + chain_num*clean_entry_bitmap_size,
});
}
else
{
osd_rmw_stripe_t local_stripes[pg.pg_size];
memcpy(local_stripes, op_data->stripes, sizeof(osd_rmw_stripe_t) * pg.pg_size);
if (extend_missing_stripes(local_stripes, cur_set, pg.pg_data_size, pg.pg_size) < 0)
{
free(op_data->snapshot_bitmaps);
return -1;
}
int need_at_least = 0;
for (int i = 0; i < pg.pg_size; i++)
{
if (local_stripes[i].read_end != 0 && cur_set[i] == 0)
{
// We need this part of the bitmap, but it's unavailable
need_at_least = pg.pg_data_size;
op_data->missing_flags[chain_num*pg.pg_size + i] = 1;
}
else
{
op_data->missing_flags[chain_num*pg.pg_size + i] = 0;
}
}
int found = 0;
for (int i = 0; i < pg.pg_size; i++)
{
if (cur_set[i] != 0 && (local_stripes[i].read_end != 0 || found < need_at_least))
{
// Read part of the bitmap
bitmap_requests.push_back((bitmap_request_t){
.osd_num = cur_set[i],
.oid = {
.inode = cur_oid.inode,
.stripe = cur_oid.stripe | i,
},
.version = target_version,
.bmp_buf = op_data->snapshot_bitmaps + (chain_num*pg.pg_size + i)*clean_entry_bitmap_size,
});
found++;
}
}
// Already checked by extend_missing_stripes, so it's fine to use assert
assert(found >= need_at_least);
}
}
std::sort(bitmap_requests.begin(), bitmap_requests.end());
return 0;
}
int osd_t::submit_bitmap_subops(osd_op_t *cur_op, pg_t & pg)
{
osd_primary_op_data_t *op_data = cur_op->op_data;
std::vector<bitmap_request_t> *bitmap_requests = new std::vector<bitmap_request_t>();
if (collect_bitmap_requests(cur_op, pg, *bitmap_requests) < 0)
{
return -1;
}
op_data->n_subops = 0;
for (int i = 0; i < bitmap_requests->size(); i++)
{
if ((i == bitmap_requests->size()-1 || (*bitmap_requests)[i+1].osd_num != (*bitmap_requests)[i].osd_num) &&
(*bitmap_requests)[i].osd_num != this->osd_num)
{
op_data->n_subops++;
}
}
if (op_data->n_subops)
{
op_data->fact_ver = 0;
op_data->done = op_data->errors = 0;
op_data->subops = new osd_op_t[op_data->n_subops];
}
for (int i = 0, subop_idx = 0, prev = 0; i < bitmap_requests->size(); i++)
{
if (i == bitmap_requests->size()-1 || (*bitmap_requests)[i+1].osd_num != (*bitmap_requests)[i].osd_num)
{
osd_num_t subop_osd_num = (*bitmap_requests)[i].osd_num;
if (subop_osd_num == this->osd_num)
{
// Read bitmap synchronously from the local database
for (int j = prev; j <= i; j++)
{
bs->read_bitmap(
(*bitmap_requests)[j].oid, (*bitmap_requests)[j].version, (*bitmap_requests)[j].bmp_buf,
(*bitmap_requests)[j].oid.inode == cur_op->req.rw.inode ? &cur_op->reply.rw.version : NULL
);
}
}
else
{
// Send to a remote OSD
osd_op_t *subop = op_data->subops+subop_idx;
subop->op_type = OSD_OP_OUT;
subop->peer_fd = msgr.osd_peer_fds.at(subop_osd_num);
// FIXME: Use the pre-allocated buffer
subop->buf = malloc_or_die(sizeof(obj_ver_id)*(i+1-prev));
subop->req = (osd_any_op_t){
.sec_read_bmp = {
.header = {
.magic = SECONDARY_OSD_OP_MAGIC,
.id = msgr.next_subop_id++,
.opcode = OSD_OP_SEC_READ_BMP,
},
.len = sizeof(obj_ver_id)*(i+1-prev),
}
};
obj_ver_id *ov = (obj_ver_id*)subop->buf;
for (int j = prev; j <= i; j++, ov++)
{
ov->oid = (*bitmap_requests)[j].oid;
ov->version = (*bitmap_requests)[j].version;
}
subop->callback = [cur_op, bitmap_requests, prev, i, this](osd_op_t *subop)
{
int requested_count = subop->req.sec_read_bmp.len / sizeof(obj_ver_id);
if (subop->reply.hdr.retval == requested_count * (8 + clean_entry_bitmap_size))
{
void *cur_buf = subop->buf + 8;
for (int j = prev; j <= i; j++)
{
memcpy((*bitmap_requests)[j].bmp_buf, cur_buf, clean_entry_bitmap_size);
if ((*bitmap_requests)[j].oid.inode == cur_op->req.rw.inode)
{
memcpy(&cur_op->reply.rw.version, cur_buf-8, 8);
}
cur_buf += 8 + clean_entry_bitmap_size;
}
}
if ((cur_op->op_data->errors + cur_op->op_data->done + 1) >= cur_op->op_data->n_subops)
{
delete bitmap_requests;
}
handle_primary_subop(subop, cur_op);
};
msgr.outbox_push(subop);
subop_idx++;
}
prev = i+1;
}
}
if (!op_data->n_subops)
{
delete bitmap_requests;
}
return 0;
}
std::vector<osd_chain_read_t> osd_t::collect_chained_read_requests(osd_op_t *cur_op)
{
osd_primary_op_data_t *op_data = cur_op->op_data;
std::vector<osd_chain_read_t> chain_reads;
int stripe_count = (op_data->scheme == POOL_SCHEME_REPLICATED ? 1 : op_data->pg_size);
memset(op_data->stripes[0].bmp_buf, 0, stripe_count * clean_entry_bitmap_size);
uint8_t *global_bitmap = (uint8_t*)op_data->stripes[0].bmp_buf;
// We always use at most 1 read request per layer
for (int chain_pos = 0; chain_pos < op_data->chain_size; chain_pos++)
{
uint8_t *part_bitmap = ((uint8_t*)op_data->snapshot_bitmaps) + chain_pos*stripe_count*clean_entry_bitmap_size;
int start = (cur_op->req.rw.offset - op_data->oid.stripe)/bs_bitmap_granularity;
int end = start + cur_op->req.rw.len/bs_bitmap_granularity;
// Skip unneeded part in the beginning
while (start < end && (
((global_bitmap[start>>3] >> (start&7)) & 1) ||
!((part_bitmap[start>>3] >> (start&7)) & 1)))
{
start++;
}
// Skip unneeded part in the end
while (start < end && (
((global_bitmap[(end-1)>>3] >> ((end-1)&7)) & 1) ||
!((part_bitmap[(end-1)>>3] >> ((end-1)&7)) & 1)))
{
end--;
}
if (start < end)
{
// Copy (OR) bits in between
int cur = start;
for (; cur < end && (cur & 0x7); cur++)
{
global_bitmap[cur>>3] = global_bitmap[cur>>3] | (part_bitmap[cur>>3] & (1 << (cur&7)));
}
for (; cur <= end-8; cur += 8)
{
global_bitmap[cur>>3] = global_bitmap[cur>>3] | part_bitmap[cur>>3];
}
for (; cur < end; cur++)
{
global_bitmap[cur>>3] = global_bitmap[cur>>3] | (part_bitmap[cur>>3] & (1 << (cur&7)));
}
// Add request
chain_reads.push_back((osd_chain_read_t){
.chain_pos = chain_pos,
.inode = op_data->read_chain[chain_pos],
.offset = start*bs_bitmap_granularity,
.len = (end-start)*bs_bitmap_granularity,
});
}
}
return chain_reads;
}
int osd_t::submit_chained_read_requests(pg_t & pg, osd_op_t *cur_op)
{
// Decide which parts of which objects we need to read based on bitmaps
osd_primary_op_data_t *op_data = cur_op->op_data;
auto chain_reads = collect_chained_read_requests(cur_op);
int stripe_count = (pg.scheme == POOL_SCHEME_REPLICATED ? 1 : pg.pg_size);
op_data->chain_read_count = chain_reads.size();
op_data->chain_reads = (osd_chain_read_t*)calloc_or_die(
1, sizeof(osd_chain_read_t) * chain_reads.size()
+ sizeof(osd_rmw_stripe_t) * stripe_count * op_data->chain_size
);
osd_rmw_stripe_t *chain_stripes = (osd_rmw_stripe_t*)(
((void*)op_data->chain_reads) + sizeof(osd_chain_read_t) * op_data->chain_read_count
);
// Now process each subrequest as a separate read, including reconstruction if needed
// Prepare reads
int n_subops = 0;
uint64_t read_buffer_size = 0;
for (int cri = 0; cri < chain_reads.size(); cri++)
{
op_data->chain_reads[cri] = chain_reads[cri];
object_id cur_oid = { .inode = chain_reads[cri].inode, .stripe = op_data->oid.stripe };
// FIXME: maybe introduce split_read_stripes to shorten these lines and to remove read_start=req_start
osd_rmw_stripe_t *stripes = chain_stripes + chain_reads[cri].chain_pos*stripe_count;
split_stripes(pg.pg_data_size, bs_block_size, chain_reads[cri].offset, chain_reads[cri].len, stripes);
if (op_data->scheme == POOL_SCHEME_REPLICATED && !stripes[0].req_end)
{
continue;
}
for (int role = 0; role < op_data->pg_data_size; role++)
{
stripes[role].read_start = stripes[role].req_start;
stripes[role].read_end = stripes[role].req_end;
}
uint64_t *cur_set = pg.cur_set.data();
if (pg.state != PG_ACTIVE && op_data->scheme != POOL_SCHEME_REPLICATED)
{
pg_osd_set_state_t *object_state;
cur_set = get_object_osd_set(pg, cur_oid, pg.cur_set.data(), &object_state);
if (extend_missing_stripes(stripes, cur_set, pg.pg_data_size, pg.pg_size) < 0)
{
free(op_data->chain_reads);
op_data->chain_reads = NULL;
finish_op(cur_op, -EIO);
return -1;
}
op_data->degraded = 1;
}
if (op_data->scheme == POOL_SCHEME_REPLICATED)
{
n_subops++;
read_buffer_size += stripes[0].read_end - stripes[0].read_start;
}
else
{
for (int role = 0; role < pg.pg_size; role++)
{
if (stripes[role].read_end > 0 && cur_set[role] != 0)
n_subops++;
if (stripes[role].read_end > 0)
read_buffer_size += stripes[role].read_end - stripes[role].read_start;
}
}
}
cur_op->buf = memalign_or_die(MEM_ALIGNMENT, read_buffer_size);
void *cur_buf = cur_op->buf;
for (int cri = 0; cri < chain_reads.size(); cri++)
{
osd_rmw_stripe_t *stripes = chain_stripes + chain_reads[cri].chain_pos*stripe_count;
for (int role = 0; role < stripe_count; role++)
{
if (stripes[role].read_end > 0)
{
stripes[role].read_buf = cur_buf;
stripes[role].bmp_buf = op_data->snapshot_bitmaps + (chain_reads[cri].chain_pos*stripe_count + role)*clean_entry_bitmap_size;
cur_buf += stripes[role].read_end - stripes[role].read_start;
}
}
}
// Submit all reads
op_data->fact_ver = UINT64_MAX;
op_data->done = op_data->errors = 0;
op_data->n_subops = n_subops;
if (!n_subops)
{
return 0;
}
op_data->subops = new osd_op_t[n_subops];
int cur_subops = 0;
for (int cri = 0; cri < chain_reads.size(); cri++)
{
osd_rmw_stripe_t *stripes = chain_stripes + chain_reads[cri].chain_pos*stripe_count;
if (op_data->scheme == POOL_SCHEME_REPLICATED && !stripes[0].req_end)
{
continue;
}
object_id cur_oid = { .inode = chain_reads[cri].inode, .stripe = op_data->oid.stripe };
auto vo_it = pg.ver_override.find(cur_oid);
uint64_t target_ver = vo_it != pg.ver_override.end() ? vo_it->second : UINT64_MAX;
uint64_t *cur_set = pg.cur_set.data();
if (pg.state != PG_ACTIVE && op_data->scheme != POOL_SCHEME_REPLICATED)
{
pg_osd_set_state_t *object_state;
cur_set = get_object_osd_set(pg, cur_oid, pg.cur_set.data(), &object_state);
}
int zero_read = -1;
if (op_data->scheme == POOL_SCHEME_REPLICATED)
{
for (int role = 0; role < op_data->pg_size; role++)
if (cur_set[role] == this->osd_num || zero_read == -1)
zero_read = role;
}
cur_subops += submit_primary_subop_batch(SUBMIT_READ, chain_reads[cri].inode, target_ver, stripes, cur_set, cur_op, cur_subops, zero_read);
}
assert(cur_subops == n_subops);
return 0;
}
void osd_t::send_chained_read_results(pg_t & pg, osd_op_t *cur_op)
{
osd_primary_op_data_t *op_data = cur_op->op_data;
int stripe_count = (pg.scheme == POOL_SCHEME_REPLICATED ? 1 : pg.pg_size);
osd_rmw_stripe_t *chain_stripes = (osd_rmw_stripe_t*)(
((void*)op_data->chain_reads) + sizeof(osd_chain_read_t) * op_data->chain_read_count
);
// Reconstruct parts if needed
if (op_data->degraded)
{
int stripe_count = (pg.scheme == POOL_SCHEME_REPLICATED ? 1 : pg.pg_size);
for (int cri = 0; cri < op_data->chain_read_count; cri++)
{
// Reconstruct missing stripes
osd_rmw_stripe_t *stripes = chain_stripes + op_data->chain_reads[cri].chain_pos*stripe_count;
if (op_data->scheme == POOL_SCHEME_XOR)
{
reconstruct_stripes_xor(stripes, pg.pg_size, clean_entry_bitmap_size);
}
else if (op_data->scheme == POOL_SCHEME_JERASURE)
{
reconstruct_stripes_jerasure(stripes, pg.pg_size, pg.pg_data_size, clean_entry_bitmap_size);
}
}
}
// Send bitmap
cur_op->reply.rw.bitmap_len = op_data->pg_data_size * clean_entry_bitmap_size;
cur_op->iov.push_back(op_data->stripes[0].bmp_buf, cur_op->reply.rw.bitmap_len);
// And finally compose the result
uint64_t sent = 0;
int prev_pos = 0, pos = 0;
bool prev_set = false;
int prev = (cur_op->req.rw.offset - op_data->oid.stripe) / bs_bitmap_granularity;
int end = prev + cur_op->req.rw.len/bs_bitmap_granularity;
int cur = prev;
while (cur <= end)
{
bool has_bit = false;
if (cur < end)
{
for (pos = 0; pos < op_data->chain_size; pos++)
{
has_bit = (((uint8_t*)op_data->snapshot_bitmaps)[pos*stripe_count*clean_entry_bitmap_size + cur/8] >> (cur%8)) & 1;
if (has_bit)
break;
}
}
if (has_bit != prev_set || pos != prev_pos || cur == end)
{
if (cur > prev)
{
// Send buffer in parts to avoid copying
if (!prev_set)
{
while ((cur-prev) > zero_buffer_size/bs_bitmap_granularity)
{
cur_op->iov.push_back(zero_buffer, zero_buffer_size);
sent += zero_buffer_size;
prev += zero_buffer_size/bs_bitmap_granularity;
}
cur_op->iov.push_back(zero_buffer, (cur-prev)*bs_bitmap_granularity);
sent += (cur-prev)*bs_bitmap_granularity;
}
else
{
osd_rmw_stripe_t *stripes = chain_stripes + prev_pos*stripe_count;
while (cur > prev)
{
int role = prev*bs_bitmap_granularity/bs_block_size;
int role_start = prev*bs_bitmap_granularity - role*bs_block_size;
int role_end = cur*bs_bitmap_granularity - role*bs_block_size;
if (role_end > bs_block_size)
role_end = bs_block_size;
assert(stripes[role].read_buf);
cur_op->iov.push_back(
stripes[role].read_buf + (role_start - stripes[role].read_start),
role_end - role_start
);
sent += role_end - role_start;
prev += (role_end - role_start)/bs_bitmap_granularity;
}
}
}
prev = cur;
prev_pos = pos;
prev_set = has_bit;
}
cur++;
}
assert(sent == cur_op->req.rw.len);
free(op_data->chain_reads);
op_data->chain_reads = NULL;
}

View File

@@ -36,6 +36,29 @@ void osd_t::autosync()
void osd_t::finish_op(osd_op_t *cur_op, int retval)
{
inflight_ops--;
if (cur_op->req.hdr.opcode == OSD_OP_READ ||
cur_op->req.hdr.opcode == OSD_OP_WRITE ||
cur_op->req.hdr.opcode == OSD_OP_DELETE)
{
// Track inode statistics
if (!cur_op->tv_end.tv_sec)
{
clock_gettime(CLOCK_REALTIME, &cur_op->tv_end);
}
uint64_t usec = (
(cur_op->tv_end.tv_sec - cur_op->tv_begin.tv_sec)*1000000 +
(cur_op->tv_end.tv_nsec - cur_op->tv_begin.tv_nsec)/1000
);
int inode_st_op = cur_op->req.hdr.opcode == OSD_OP_DELETE
? INODE_STATS_DELETE
: (cur_op->req.hdr.opcode == OSD_OP_READ ? INODE_STATS_READ : INODE_STATS_WRITE);
inode_stats[cur_op->req.rw.inode].op_count[inode_st_op]++;
inode_stats[cur_op->req.rw.inode].op_sum[inode_st_op] += usec;
if (cur_op->req.hdr.opcode == OSD_OP_DELETE)
inode_stats[cur_op->req.rw.inode].op_bytes[inode_st_op] += cur_op->op_data->pg_data_size * bs_block_size;
else
inode_stats[cur_op->req.rw.inode].op_bytes[inode_st_op] += cur_op->req.rw.len;
}
if (cur_op->op_data)
{
if (cur_op->op_data->pg_num > 0)
@@ -53,9 +76,6 @@ void osd_t::finish_op(osd_op_t *cur_op, int retval)
}
}
assert(!cur_op->op_data->subops);
assert(!cur_op->op_data->unstable_write_osds);
assert(!cur_op->op_data->unstable_writes);
assert(!cur_op->op_data->dirty_pgs);
free(cur_op->op_data);
cur_op->op_data = NULL;
}
@@ -66,15 +86,15 @@ void osd_t::finish_op(osd_op_t *cur_op, int retval)
}
else
{
// FIXME add separate magic number
auto cl_it = c_cli.clients.find(cur_op->peer_fd);
if (cl_it != c_cli.clients.end())
// FIXME add separate magic number for primary ops
auto cl_it = msgr.clients.find(cur_op->peer_fd);
if (cl_it != msgr.clients.end())
{
cur_op->reply.hdr.magic = SECONDARY_OSD_REPLY_MAGIC;
cur_op->reply.hdr.id = cur_op->req.hdr.id;
cur_op->reply.hdr.opcode = cur_op->req.hdr.opcode;
cur_op->reply.hdr.retval = retval;
c_cli.outbox_push(cur_op);
msgr.outbox_push(cur_op);
}
else
{
@@ -83,7 +103,7 @@ void osd_t::finish_op(osd_op_t *cur_op, int retval)
}
}
void osd_t::submit_primary_subops(int submit_type, uint64_t op_version, int pg_size, const uint64_t* osd_set, osd_op_t *cur_op)
void osd_t::submit_primary_subops(int submit_type, uint64_t op_version, const uint64_t* osd_set, osd_op_t *cur_op)
{
bool wr = submit_type == SUBMIT_WRITE;
osd_primary_op_data_t *op_data = cur_op->op_data;
@@ -91,32 +111,34 @@ void osd_t::submit_primary_subops(int submit_type, uint64_t op_version, int pg_s
bool rep = op_data->scheme == POOL_SCHEME_REPLICATED;
// Allocate subops
int n_subops = 0, zero_read = -1;
for (int role = 0; role < pg_size; role++)
for (int role = 0; role < op_data->pg_size; role++)
{
if (osd_set[role] == this->osd_num || osd_set[role] != 0 && zero_read == -1)
{
zero_read = role;
}
if (osd_set[role] != 0 && (wr || !rep && stripes[role].read_end != 0))
{
n_subops++;
}
}
if (!n_subops && (submit_type == SUBMIT_RMW_READ || rep))
{
n_subops = 1;
}
else
{
zero_read = -1;
}
osd_op_t *subops = new osd_op_t[n_subops];
op_data->fact_ver = 0;
op_data->done = op_data->errors = 0;
op_data->n_subops = n_subops;
op_data->subops = subops;
int i = 0;
for (int role = 0; role < pg_size; role++)
int sent = submit_primary_subop_batch(submit_type, op_data->oid.inode, op_version, op_data->stripes, osd_set, cur_op, 0, zero_read);
assert(sent == n_subops);
}
int osd_t::submit_primary_subop_batch(int submit_type, inode_t inode, uint64_t op_version,
osd_rmw_stripe_t *stripes, const uint64_t* osd_set, osd_op_t *cur_op, int subop_idx, int zero_read)
{
bool wr = submit_type == SUBMIT_WRITE;
osd_primary_op_data_t *op_data = cur_op->op_data;
bool rep = op_data->scheme == POOL_SCHEME_REPLICATED;
int i = subop_idx;
for (int role = 0; role < op_data->pg_size; role++)
{
// We always submit zero-length writes to all replicas, even if the stripe is not modified
if (!(wr || !rep && stripes[role].read_end != 0 || zero_read == role))
@@ -127,82 +149,90 @@ void osd_t::submit_primary_subops(int submit_type, uint64_t op_version, int pg_s
if (role_osd_num != 0)
{
int stripe_num = rep ? 0 : role;
osd_op_t *subop = op_data->subops + i;
if (role_osd_num == this->osd_num)
{
clock_gettime(CLOCK_REALTIME, &subops[i].tv_begin);
subops[i].op_type = (uint64_t)cur_op;
subops[i].bs_op = new blockstore_op_t({
clock_gettime(CLOCK_REALTIME, &subop->tv_begin);
subop->op_type = (uint64_t)cur_op;
subop->bitmap = stripes[stripe_num].bmp_buf;
subop->bitmap_len = clean_entry_bitmap_size;
subop->bs_op = new blockstore_op_t({
.opcode = (uint64_t)(wr ? (rep ? BS_OP_WRITE_STABLE : BS_OP_WRITE) : BS_OP_READ),
.callback = [subop = &subops[i], this](blockstore_op_t *bs_subop)
.callback = [subop, this](blockstore_op_t *bs_subop)
{
handle_primary_bs_subop(subop);
},
.oid = {
.inode = op_data->oid.inode,
.inode = inode,
.stripe = op_data->oid.stripe | stripe_num,
},
.version = op_version,
.offset = wr ? stripes[stripe_num].write_start : stripes[stripe_num].read_start,
.len = wr ? stripes[stripe_num].write_end - stripes[stripe_num].write_start : stripes[stripe_num].read_end - stripes[stripe_num].read_start,
.buf = wr ? stripes[stripe_num].write_buf : stripes[stripe_num].read_buf,
.bitmap = stripes[stripe_num].bmp_buf,
});
#ifdef OSD_DEBUG
printf(
"Submit %s to local: %lx:%lx v%lu %u-%u\n", wr ? "write" : "read",
op_data->oid.inode, op_data->oid.stripe | stripe_num, op_version,
subops[i].bs_op->offset, subops[i].bs_op->len
inode, op_data->oid.stripe | stripe_num, op_version,
subop->bs_op->offset, subop->bs_op->len
);
#endif
bs->enqueue_op(subops[i].bs_op);
bs->enqueue_op(subop->bs_op);
}
else
{
subops[i].op_type = OSD_OP_OUT;
subops[i].peer_fd = c_cli.osd_peer_fds.at(role_osd_num);
subops[i].req.sec_rw = {
subop->op_type = OSD_OP_OUT;
subop->peer_fd = msgr.osd_peer_fds.at(role_osd_num);
subop->bitmap = stripes[stripe_num].bmp_buf;
subop->bitmap_len = clean_entry_bitmap_size;
subop->req.sec_rw = {
.header = {
.magic = SECONDARY_OSD_OP_MAGIC,
.id = c_cli.next_subop_id++,
.id = msgr.next_subop_id++,
.opcode = (uint64_t)(wr ? (rep ? OSD_OP_SEC_WRITE_STABLE : OSD_OP_SEC_WRITE) : OSD_OP_SEC_READ),
},
.oid = {
.inode = op_data->oid.inode,
.inode = inode,
.stripe = op_data->oid.stripe | stripe_num,
},
.version = op_version,
.offset = wr ? stripes[stripe_num].write_start : stripes[stripe_num].read_start,
.len = wr ? stripes[stripe_num].write_end - stripes[stripe_num].write_start : stripes[stripe_num].read_end - stripes[stripe_num].read_start,
.attr_len = wr ? clean_entry_bitmap_size : 0,
};
#ifdef OSD_DEBUG
printf(
"Submit %s to osd %lu: %lx:%lx v%lu %u-%u\n", wr ? "write" : "read", role_osd_num,
op_data->oid.inode, op_data->oid.stripe | stripe_num, op_version,
subops[i].req.sec_rw.offset, subops[i].req.sec_rw.len
inode, op_data->oid.stripe | stripe_num, op_version,
subop->req.sec_rw.offset, subop->req.sec_rw.len
);
#endif
if (wr)
{
if (stripes[stripe_num].write_end > stripes[stripe_num].write_start)
{
subops[i].iov.push_back(stripes[stripe_num].write_buf, stripes[stripe_num].write_end - stripes[stripe_num].write_start);
subop->iov.push_back(stripes[stripe_num].write_buf, stripes[stripe_num].write_end - stripes[stripe_num].write_start);
}
}
else
{
if (stripes[stripe_num].read_end > stripes[stripe_num].read_start)
{
subops[i].iov.push_back(stripes[stripe_num].read_buf, stripes[stripe_num].read_end - stripes[stripe_num].read_start);
subop->iov.push_back(stripes[stripe_num].read_buf, stripes[stripe_num].read_end - stripes[stripe_num].read_start);
}
}
subops[i].callback = [cur_op, this](osd_op_t *subop)
subop->callback = [cur_op, this](osd_op_t *subop)
{
handle_primary_subop(subop, cur_op);
};
c_cli.outbox_push(&subops[i]);
msgr.outbox_push(subop);
}
i++;
}
}
return i-subop_idx;
}
static uint64_t bs_op_to_osd_op[] = {
@@ -252,20 +282,20 @@ void osd_t::add_bs_subop_stats(osd_op_t *subop)
uint64_t opcode = bs_op_to_osd_op[subop->bs_op->opcode];
timespec tv_end;
clock_gettime(CLOCK_REALTIME, &tv_end);
c_cli.stats.op_stat_count[opcode]++;
if (!c_cli.stats.op_stat_count[opcode])
msgr.stats.op_stat_count[opcode]++;
if (!msgr.stats.op_stat_count[opcode])
{
c_cli.stats.op_stat_count[opcode] = 1;
c_cli.stats.op_stat_sum[opcode] = 0;
c_cli.stats.op_stat_bytes[opcode] = 0;
msgr.stats.op_stat_count[opcode] = 1;
msgr.stats.op_stat_sum[opcode] = 0;
msgr.stats.op_stat_bytes[opcode] = 0;
}
c_cli.stats.op_stat_sum[opcode] += (
msgr.stats.op_stat_sum[opcode] += (
(tv_end.tv_sec - subop->tv_begin.tv_sec)*1000000 +
(tv_end.tv_nsec - subop->tv_begin.tv_nsec)/1000
);
if (opcode == OSD_OP_SEC_READ || opcode == OSD_OP_SEC_WRITE)
{
c_cli.stats.op_stat_bytes[opcode] += subop->bs_op->len;
msgr.stats.op_stat_bytes[opcode] += subop->bs_op->len;
}
}
@@ -273,8 +303,13 @@ void osd_t::handle_primary_subop(osd_op_t *subop, osd_op_t *cur_op)
{
uint64_t opcode = subop->req.hdr.opcode;
int retval = subop->reply.hdr.retval;
int expected = opcode == OSD_OP_SEC_READ || opcode == OSD_OP_SEC_WRITE
|| opcode == OSD_OP_SEC_WRITE_STABLE ? subop->req.sec_rw.len : 0;
int expected;
if (opcode == OSD_OP_SEC_READ || opcode == OSD_OP_SEC_WRITE || opcode == OSD_OP_SEC_WRITE_STABLE)
expected = subop->req.sec_rw.len;
else if (opcode == OSD_OP_SEC_READ_BMP)
expected = subop->req.sec_read_bmp.len / sizeof(obj_ver_id) * (8 + clean_entry_bitmap_size);
else
expected = 0;
osd_primary_op_data_t *op_data = cur_op->op_data;
if (retval != expected)
{
@@ -287,7 +322,7 @@ void osd_t::handle_primary_subop(osd_op_t *subop, osd_op_t *cur_op)
if (subop->peer_fd >= 0)
{
// Drop connection on any error
c_cli.stop_client(subop->peer_fd);
msgr.stop_client(subop->peer_fd);
}
}
else
@@ -297,18 +332,21 @@ void osd_t::handle_primary_subop(osd_op_t *subop, osd_op_t *cur_op)
{
uint64_t version = subop->reply.sec_rw.version;
#ifdef OSD_DEBUG
uint64_t peer_osd = c_cli.clients.find(subop->peer_fd) != c_cli.clients.end()
? c_cli.clients[subop->peer_fd]->osd_num : osd_num;
uint64_t peer_osd = msgr.clients.find(subop->peer_fd) != msgr.clients.end()
? msgr.clients[subop->peer_fd]->osd_num : osd_num;
printf("subop %lu from osd %lu: version = %lu\n", opcode, peer_osd, version);
#endif
if (op_data->fact_ver != 0 && op_data->fact_ver != version)
if (op_data->fact_ver != UINT64_MAX)
{
throw std::runtime_error(
"different fact_versions returned from "+std::string(osd_op_names[opcode])+
" subops: "+std::to_string(version)+" vs "+std::to_string(op_data->fact_ver)
);
if (op_data->fact_ver != 0 && op_data->fact_ver != version)
{
throw std::runtime_error(
"different fact_versions returned from "+std::string(osd_op_names[opcode])+
" subops: "+std::to_string(version)+" vs "+std::to_string(op_data->fact_ver)
);
}
op_data->fact_ver = version;
}
op_data->fact_ver = version;
}
}
if ((op_data->errors + op_data->done) >= op_data->n_subops)
@@ -427,11 +465,11 @@ void osd_t::submit_primary_del_batch(osd_op_t *cur_op, obj_ver_osd_t *chunks_to_
else
{
subops[i].op_type = OSD_OP_OUT;
subops[i].peer_fd = c_cli.osd_peer_fds.at(chunk.osd_num);
subops[i].peer_fd = msgr.osd_peer_fds.at(chunk.osd_num);
subops[i].req = (osd_any_op_t){ .sec_del = {
.header = {
.magic = SECONDARY_OSD_OP_MAGIC,
.id = c_cli.next_subop_id++,
.id = msgr.next_subop_id++,
.opcode = OSD_OP_SEC_DELETE,
},
.oid = chunk.oid,
@@ -441,7 +479,7 @@ void osd_t::submit_primary_del_batch(osd_op_t *cur_op, obj_ver_osd_t *chunks_to_
{
handle_primary_subop(subop, cur_op);
};
c_cli.outbox_push(&subops[i]);
msgr.outbox_push(&subops[i]);
}
}
}
@@ -471,14 +509,14 @@ int osd_t::submit_primary_sync_subops(osd_op_t *cur_op)
});
bs->enqueue_op(subops[i].bs_op);
}
else if ((peer_it = c_cli.osd_peer_fds.find(sync_osd)) != c_cli.osd_peer_fds.end())
else if ((peer_it = msgr.osd_peer_fds.find(sync_osd)) != msgr.osd_peer_fds.end())
{
subops[i].op_type = OSD_OP_OUT;
subops[i].peer_fd = peer_it->second;
subops[i].req = (osd_any_op_t){ .sec_sync = {
.header = {
.magic = SECONDARY_OSD_OP_MAGIC,
.id = c_cli.next_subop_id++,
.id = msgr.next_subop_id++,
.opcode = OSD_OP_SEC_SYNC,
},
} };
@@ -486,7 +524,7 @@ int osd_t::submit_primary_sync_subops(osd_op_t *cur_op)
{
handle_primary_subop(subop, cur_op);
};
c_cli.outbox_push(&subops[i]);
msgr.outbox_push(&subops[i]);
}
else
{
@@ -531,11 +569,11 @@ void osd_t::submit_primary_stab_subops(osd_op_t *cur_op)
else
{
subops[i].op_type = OSD_OP_OUT;
subops[i].peer_fd = c_cli.osd_peer_fds.at(stab_osd.osd_num);
subops[i].peer_fd = msgr.osd_peer_fds.at(stab_osd.osd_num);
subops[i].req = (osd_any_op_t){ .sec_stab = {
.header = {
.magic = SECONDARY_OSD_OP_MAGIC,
.id = c_cli.next_subop_id++,
.id = msgr.next_subop_id++,
.opcode = OSD_OP_SEC_STABILIZE,
},
.len = (uint64_t)(stab_osd.len * sizeof(obj_ver_id)),
@@ -545,7 +583,7 @@ void osd_t::submit_primary_stab_subops(osd_op_t *cur_op)
{
handle_primary_subop(subop, cur_op);
};
c_cli.outbox_push(&subops[i]);
msgr.outbox_push(&subops[i]);
}
}
}

View File

@@ -247,8 +247,8 @@ resume_8:
finish:
if (cur_op->peer_fd)
{
auto it = c_cli.clients.find(cur_op->peer_fd);
if (it != c_cli.clients.end())
auto it = msgr.clients.find(cur_op->peer_fd);
if (it != msgr.clients.end())
it->second->dirty_pgs.clear();
}
finish_op(cur_op, 0);

View File

@@ -77,7 +77,7 @@ resume_1:
else
{
cur_op->rmw_buf = calc_rmw(cur_op->buf, op_data->stripes, op_data->prev_set,
pg.pg_size, op_data->pg_data_size, pg.pg_cursize, pg.cur_set.data(), bs_block_size);
pg.pg_size, op_data->pg_data_size, pg.pg_cursize, pg.cur_set.data(), bs_block_size, clean_entry_bitmap_size);
if (!cur_op->rmw_buf)
{
// Refuse partial overwrite of an incomplete object
@@ -86,7 +86,7 @@ resume_1:
}
}
// Read required blocks
submit_primary_subops(SUBMIT_RMW_READ, UINT64_MAX, pg.pg_size, op_data->prev_set, cur_op);
submit_primary_subops(SUBMIT_RMW_READ, UINT64_MAX, op_data->prev_set, cur_op);
resume_2:
op_data->st = 2;
return;
@@ -96,9 +96,18 @@ resume_3:
pg_cancel_write_queue(pg, cur_op, op_data->oid, op_data->epipe > 0 ? -EPIPE : -EIO);
return;
}
// Check CAS version
if (cur_op->req.rw.version && op_data->fact_ver != (cur_op->req.rw.version-1))
{
cur_op->reply.hdr.retval = -EINTR;
goto continue_others;
}
if (op_data->scheme == POOL_SCHEME_REPLICATED)
{
// Only (possibly) copy new data from the request into the recovery buffer
// Set bitmap bits
bitmap_set(op_data->stripes[0].bmp_buf, op_data->stripes[0].write_start,
op_data->stripes[0].write_end-op_data->stripes[0].write_start, bs_bitmap_granularity);
// Possibly copy new data from the request into the recovery buffer
if (pg.cur_set.data() != op_data->prev_set && (op_data->stripes[0].write_start != 0 ||
op_data->stripes[0].write_end != bs_block_size))
{
@@ -120,11 +129,11 @@ resume_3:
// Recover missing stripes, calculate parity
if (pg.scheme == POOL_SCHEME_XOR)
{
calc_rmw_parity_xor(op_data->stripes, pg.pg_size, op_data->prev_set, pg.cur_set.data(), bs_block_size);
calc_rmw_parity_xor(op_data->stripes, pg.pg_size, op_data->prev_set, pg.cur_set.data(), bs_block_size, clean_entry_bitmap_size);
}
else if (pg.scheme == POOL_SCHEME_JERASURE)
{
calc_rmw_parity_jerasure(op_data->stripes, pg.pg_size, op_data->pg_data_size, op_data->prev_set, pg.cur_set.data(), bs_block_size);
calc_rmw_parity_jerasure(op_data->stripes, pg.pg_size, op_data->pg_data_size, op_data->prev_set, pg.cur_set.data(), bs_block_size, clean_entry_bitmap_size);
}
}
// Send writes
@@ -155,7 +164,7 @@ resume_10:
return;
}
}
submit_primary_subops(SUBMIT_WRITE, op_data->target_ver, pg.pg_size, pg.cur_set.data(), cur_op);
submit_primary_subops(SUBMIT_WRITE, op_data->target_ver, pg.cur_set.data(), cur_op);
resume_4:
op_data->st = 4;
return;
@@ -262,7 +271,7 @@ continue_others:
next_op = next_it->second;
}
// finish_op would invalidate next_it if it cleared pg.write_queue, but it doesn't do that :)
finish_op(cur_op, cur_op->req.rw.len);
finish_op(cur_op, cur_op->reply.hdr.retval);
if (next_op)
{
// Continue next write to the same object
@@ -367,8 +376,8 @@ lazy:
}
// Remember PG as dirty to drop the connection when PG goes offline
// (this is required because of the "lazy sync")
auto cl_it = c_cli.clients.find(cur_op->peer_fd);
if (cl_it != c_cli.clients.end())
auto cl_it = msgr.clients.find(cur_op->peer_fd);
if (cl_it != msgr.clients.end())
{
cl_it->second->dirty_pgs.insert({ .pool_id = pg.pool_id, .pg_num = pg.pg_num });
}

Some files were not shown because too many files have changed in this diff Show More