[Draft] Optimized read

Add "read bitmaps" operation to secondary OSD protocol
Add simplified interface to read blockstore bitmaps synchronously
2021-03-17 02:14:41 +03:00 · 2021-03-16 12:48:36 +03:00 · 2021-03-16 12:48:36 +03:00 · 2021-03-16 12:48:36 +03:00 · 2021-03-16 12:48:36 +03:00 · 2021-03-16 12:48:36 +03:00
128 changed files with 2993 additions and 11572 deletions
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -2,6 +2,4 @@ cmake_minimum_required(VERSION 2.8)

 project(vitastor)

-set(VERSION "0.6.5")
-
 add_subdirectory(src)
--- a/README-ru.md
+++ b/README-ru.md
@@ -45,23 +45,18 @@ Vitastor на данный момент находится в статусе п
 - Статистика операций ввода/вывода и занятого места в разрезе инодов
 - Именование инодов через хранение их метаданных в etcd
 - Снапшоты и copy-on-write клоны
- Сглаживание производительности случайной записи в SSD+HDD конфигурациях
- Поддержка RDMA/RoCEv2 через libibverbs
- CSI-плагин для Kubernetes
- Базовая поддержка OpenStack: драйвер Cinder, патчи для Nova и libvirt

-## Планы развития
+## Планы разработки

- Поддержка удаления снапшотов (слияния слоёв)
 - Более корректные скрипты разметки дисков и автоматического запуска OSD
 - Другие инструменты администрирования
- Плагины для OpenNebula, Proxmox и других облачных систем
+- Плагины для OpenStack, Kubernetes, OpenNebula, Proxmox и других облачных систем
 - iSCSI-прокси
- Более быстрое переключение при отказах
+- Таймауты операций и более быстрое выявление отказов
 - Фоновая проверка целостности без контрольных сумм (сверка реплик)
 - Контрольные суммы
- Поддержка SSD-кэширования (tiered storage)
- Поддержка NVDIMM
+- Оптимизации для гибридных SSD+HDD хранилищ
+- Поддержка RDMA и NVDIMM
 - Web-интерфейс
 - Возможно, сжатие
 - Возможно, поддержка кэширования данных через системный page cache
@@ -315,15 +310,14 @@ Ceph:

 ### NBD

+NBD - на данный момент единственный способ монтировать Vitastor ядром Linux, но он
+приводит к дополнительным копированиям данных, поэтому немного ухудшает производительность,
+правда, в основном - линейную, а случайная затрагивается слабо.
+
 NBD расшифровывается как "сетевое блочное устройство", но на самом деле оно также
 работает просто как аналог FUSE для блочных устройств, то есть, представляет собой
 "блочное устройство в пространстве пользователя".

-NBD - на данный момент единственный способ монтировать Vitastor ядром Linux.
-NBD немного снижает производительность, так как приводит к дополнительным копированиям
-данных между ядром и пространством пользователя. Тем не менее, способ достаточно оптимален,
-а производительность случайного доступа вообще затрагивается слабо.
-
 Vitastor с однопоточной NBD прокси на том же стенде:
 - T1Q1 запись: 6000 iops (задержка 0.166ms)
 - T1Q1 чтение: 5518 iops (задержка 0.18ms)
@@ -365,14 +359,14 @@ Vitastor с однопоточной NBD прокси на том же стен
  так как в 5.4 есть как минимум 1 известный баг, ведущий к зависанию с io_uring и контроллером HP SmartArray.
 - Установите liburing 0.4 или более новый и его заголовки.
 - Установите lp_solve.
- Установите etcd, версии не ниже 3.4.15. Более ранние версии работать не будут из-за различных багов,
-  например [#12402](https://github.com/etcd-io/etcd/pull/12402). Также вы можете взять версию 3.4.13 с
-  этим конкретным исправлением из ветки release-3.4 репозитория https://github.com/vitalif/etcd/.
+- Установите etcd. Внимание: вам нужна версия с исправлением отсюда: https://github.com/vitalif/etcd/,
+  из ветки release-3.4, так как в etcd есть баг, который [будет](https://github.com/etcd-io/etcd/pull/12402)
+  исправлен только в 3.4.15. Баг приводит к неспособности Vitastor запустить PG, когда их хотя бы 500 штук.
 - Установите node.js 10 или новее.
 - Установите gcc и g++ 8.x или новее.
 - Склонируйте данный репозиторий с подмодулями: `git clone https://yourcmc.ru/git/vitalif/vitastor/`.
 - Желательно пересобрать QEMU с патчем, который делает необязательным запуск через LD_PRELOAD.
-  См `patches/qemu-*.*-vitastor.patch` - выберите версию, наиболее близкую вашей версии QEMU.
+  См `qemu-*.*-vitastor.patch` - выберите версию, наиболее близкую вашей версии QEMU.
 - Установите QEMU 3.0 или новее, возьмите исходные коды установленного пакета, начните его пересборку,
  через некоторое время остановите её и скопируйте следующие заголовки:
   - `<qemu>/include` &rarr; `<vitastor>/qemu/include`
@@ -426,105 +420,23 @@ Vitastor с однопоточной NBD прокси на том же стен
 - Запустите все OSD: `systemctl start vitastor.target`
 - Ваш кластер должен быть готов - один из мониторов должен уже сконфигурировать PG, а OSD должны запустить их.
 - Вы можете проверить состояние PG прямо в etcd: `etcdctl --endpoints=... get --prefix /vitastor/pg/state`. Все PG должны быть 'active'.
-
-### Задать имя образу
-
-```
-etcdctl --endpoints=<etcd> put /vitastor/config/inode/<pool>/<inode> '{"name":"<name>","size":<size>[,"parent_id":<parent_inode_number>][,"readonly":true]}'
-```
-
-Например:
-
-```
-etcdctl --endpoints=http://10.115.0.10:2379/v3 put /vitastor/config/inode/1/1 '{"name":"testimg","size":2147483648}'
-```
-
-Если вы зададите parent_id, то образ станет CoW-клоном, т.е. все новые запросы записи пойдут в новый инод, а запросы
-чтения будут проверять сначала его, а потом родительские слои по цепочке вверх. Чтобы случайно не перезаписать данные
-в родительском слое, вы можете переключить его в режим "только чтение", добавив флаг `"readonly":true` в его запись
-метаданных. В таком случае родительский образ становится просто снапшотом.
-
-Таким образом, для создания снапшота вам нужно просто переименовать предыдущий inode (например, из testimg в testimg@0),
-сделать его readonly и создать новый слой с исходным именем образа (testimg), ссылающийся на только что переименованный
-в качестве родительского.
-
-### Запуск тестов с fio
-
-Пример команды для запуска тестов:
-
-```
-fio -thread -ioengine=libfio_vitastor.so -name=test -bs=4M -direct=1 -iodepth=16 -rw=write -etcd=10.115.0.10:2379/v3 -image=testimg
-```
-
-Если вы не хотите обращаться к образу по имени, вместо `-image=testimg` можно указать номер пула, номер инода и размер:
-`-pool=1 -inode=1 -size=400G`.
-
-### Загрузить образ диска ВМ в/из Vitastor
-
-Используйте qemu-img и строку `vitastor:etcd_host=<HOST>:image=<IMAGE>` в качестве имени файла диска. Например:
-
-```
-qemu-img convert -f qcow2 debian10.qcow2 -p -O raw 'vitastor:etcd_host=10.115.0.10\:2379/v3:image=testimg'
-```
-
-Обратите внимание, что если вы используете немодифицированный QEMU, потребуется установить переменную окружения
-`LD_PRELOAD=/usr/lib/x86_64-linux-gnu/qemu/block-vitastor.so`.
-
-Если вы не хотите обращаться к образу по имени, вместо `:image=<IMAGE>` можно указать номер пула, номер инода и размер:
-`:pool=<POOL>:inode=<INODE>:size=<SIZE>`.
-
-### Запустить ВМ
-
-Для запуска QEMU используйте опцию `-drive file=vitastor:etcd_host=<HOST>:image=<IMAGE>` (аналогично qemu-img)
-и физический размер блока 4 KB.
-
-Например:
-
-```
-qemu-system-x86_64 -enable-kvm -m 1024
-  -drive 'file=vitastor:etcd_host=10.115.0.10\:2379/v3:image=testimg',format=raw,if=none,id=drive-virtio-disk0,cache=none
-  -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x5,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1,write-cache=off,physical_block_size=4096,logical_block_size=512
-  -vnc 0.0.0.0:0
-```
-
-Обращение по номерам (`:pool=<POOL>:inode=<INODE>:size=<SIZE>` вместо `:image=<IMAGE>`) работает аналогично qemu-img.
-
-### Удалить образ
-
-Используйте утилиту vitastor-rm. Например:
-
-```
-vitastor-rm --etcd_address 10.115.0.10:2379/v3 --pool 1 --inode 1 --parallel_osds 16 --iodepth 32
-```
-
-### NBD
-
-Чтобы создать локальное блочное устройство, используйте NBD. Например:
-
-```
-vitastor-nbd map --etcd_address 10.115.0.10:2379/v3 --image testimg
-```
-
-Команда напечатает название устройства вида /dev/nbd0, которое потом можно будет форматировать
-и использовать как обычное блочное устройство.
-
-Для обращения по номеру инода, аналогично другим командам, можно использовать опции
-`--pool <POOL> --inode <INODE> --size <SIZE>` вместо `--image testimg`.
-
-### Kubernetes
-
-У Vitastor есть CSI-плагин для Kubernetes, поддерживающий RWO-тома.
-
-Для установки возьмите манифесты из директории [csi/deploy/](csi/deploy/), поместите
-вашу конфигурацию подключения к Vitastor в [csi/deploy/001-csi-config-map.yaml](001-csi-config-map.yaml),
-настройте StorageClass в [csi/deploy/009-storage-class.yaml](009-storage-class.yaml)
-и примените все `NNN-*.yaml` к вашей инсталляции Kubernetes.
-
-```
-for i in ./???-*.yaml; do kubectl apply -f $i; done
-```
-
-После этого вы сможете создавать PersistentVolume. Пример смотрите в файле [csi/deploy/example-pvc.yaml](csi/deploy/example-pvc.yaml).
+- Пример команды для запуска тестов: `fio -thread -ioengine=libfio_vitastor.so -name=test -bs=4M -direct=1 -iodepth=16 -rw=write -etcd=10.115.0.10:2379/v3 -pool=1 -inode=1 -size=400G`.
+- Пример команды для заливки образа ВМ в vitastor через qemu-img:
+  ```
+  qemu-img convert -f qcow2 debian10.qcow2 -p -O raw 'vitastor:etcd_host=10.115.0.10\:2379/v3:pool=1:inode=1:size=2147483648'
+  ```
+  Если вы используете немодифицированный QEMU, данной команде потребуется переменная окружения `LD_PRELOAD=/usr/lib/x86_64-linux-gnu/qemu/block-vitastor.so`.
+- Пример команды запуска QEMU:
+  ```
+  qemu-system-x86_64 -enable-kvm -m 1024
+    -drive 'file=vitastor:etcd_host=10.115.0.10\:2379/v3:pool=1:inode=1:size=2147483648',format=raw,if=none,id=drive-virtio-disk0,cache=none
+    -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x5,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1,write-cache=off,physical_block_size=4096,logical_block_size=512
+    -vnc 0.0.0.0:0
+  ```
+- Пример команды удаления образа (инода) из Vitastor:
+  ```
+  vitastor-rm --etcd_address 10.115.0.10:2379/v3 --pool 1 --inode 1 --parallel_osds 16 --iodepth 32
+  ```

 ## Известные проблемы

--- a/README.md
+++ b/README.md
@@ -39,23 +39,18 @@ breaking changes in the future. However, the following is implemented:
 - Per-inode I/O and space usage statistics
 - Inode metadata storage in etcd
 - Snapshots and copy-on-write image clones
- Write throttling to smooth random write workloads in SSD+HDD configurations
- RDMA/RoCEv2 support via libibverbs
- CSI plugin for Kubernetes
- Basic OpenStack support: Cinder driver, Nova and libvirt patches

 ## Roadmap

- Snapshot deletion (layer merge) support
 - Better OSD creation and auto-start tools
 - Other administrative tools
- Plugins for OpenNebula, Proxmox and other cloud systems
+- Plugins for OpenStack, Kubernetes, OpenNebula, Proxmox and other cloud systems
 - iSCSI proxy
- Faster failover
+- Operation timeouts and better failure detection
 - Scrubbing without checksums (verification of replicas)
 - Checksums
- Tiered storage
- NVDIMM support
+- SSD+HDD optimizations, possibly including tiered storage and soft journal flushes
+- RDMA and NVDIMM support
 - Web GUI
 - Compression (possibly)
 - Read caching using system page cache (possibly)
@@ -320,9 +315,10 @@ Vitastor with single-thread NBD on the same hardware:
  there is at least one known io_uring hang with 5.4 and an HP SmartArray controller.
 - Install liburing 0.4 or newer and its headers.
 - Install lp_solve.
- Install etcd, at least version 3.4.15. Earlier versions won't work because of various bugs,
-  for example [#12402](https://github.com/etcd-io/etcd/pull/12402). You can also take 3.4.13
-  with this specific fix from here: https://github.com/vitalif/etcd/, branch release-3.4.
+- Install etcd. Attention: you need a fixed version from here: https://github.com/vitalif/etcd/,
+  branch release-3.4, because there is a bug in upstream etcd which makes Vitastor OSDs fail to
+  move PGs out of "starting" state if you have at least around ~500 PGs or so. The custom build
+  will be unnecessary when etcd merges the fix: https://github.com/etcd-io/etcd/pull/12402.
 - Install node.js 10 or newer.
 - Install gcc and g++ 8.x or newer.
 - Clone https://yourcmc.ru/git/vitalif/vitastor/ with submodules.
@@ -340,7 +336,7 @@ Vitastor with single-thread NBD on the same hardware:
      * For QEMU 2.0+: `<qemu>/qapi-types.h` &rarr; `<vitastor>/qemu/b/qemu/qapi-types.h`
   - `config-host.h` and `qapi` are required because they contain generated headers
 - You can also rebuild QEMU with a patch that makes LD_PRELOAD unnecessary to load vitastor driver.
-  See `patches/qemu-*.*-vitastor.patch`.
+  See `qemu-*.*-vitastor.patch`.
 - Install fio 3.7 or later, get its source and symlink it into `<vitastor>/fio`.
 - Build & install Vitastor with `mkdir build && cd build && cmake .. && make -j8 && make install`.
  Pay attention to the `QEMU_PLUGINDIR` cmake option - it must be set to `qemu-kvm` on RHEL.
@@ -380,101 +376,24 @@ and calculate disk offsets almost by hand. This will be fixed in near future.
  For jerasure pools the configuration should look like the following: `2:{"name":"ecpool","scheme":"jerasure","pg_size":4,"parity_chunks":2,"pg_minsize":2,"pg_count":256,"failure_domain":"host"}`.
 - At this point, one of the monitors will configure PGs and OSDs will start them.
 - You can check PG states with `etcdctl --endpoints=... get --prefix /vitastor/pg/state`. All PGs should become 'active'.
-
-### Name an image
-
-```
-etcdctl --endpoints=<etcd> put /vitastor/config/inode/<pool>/<inode> '{"name":"<name>","size":<size>[,"parent_id":<parent_inode_number>][,"readonly":true]}'
-```
-
-For example:
-
-```
-etcdctl --endpoints=http://10.115.0.10:2379/v3 put /vitastor/config/inode/1/1 '{"name":"testimg","size":2147483648}'
-```
-
-If you specify parent_id the image becomes a CoW clone. I.e. all writes go to the new inode and reads first check it
-and then upper layers. You can then make parent readonly by updating its entry with `"readonly":true` for safety and
-basically treat it as a snapshot.
-
-So to create a snapshot you basically rename the previous upper layer (for example from testimg to testimg@0), make it readonly
-and create a new top layer with the original name (testimg) and the previous one as a parent.
-
-### Run fio benchmarks
-
-fio command example:
-
-```
-fio -thread -ioengine=libfio_vitastor.so -name=test -bs=4M -direct=1 -iodepth=16 -rw=write -etcd=10.115.0.10:2379/v3 -image=testimg
-```
-
-If you don't want to access your image by name, you can specify pool number, inode number and size
-(`-pool=1 -inode=1 -size=400G`) instead of the image name (`-image=testimg`).
-
-### Upload VM image
-
-Use qemu-img and `vitastor:etcd_host=<HOST>:image=<IMAGE>` disk filename. For example:
-
-```
-qemu-img convert -f qcow2 debian10.qcow2 -p -O raw 'vitastor:etcd_host=10.115.0.10\:2379/v3:image=testimg'
-```
-
-Note that the command requires to be run with `LD_PRELOAD=/usr/lib/x86_64-linux-gnu/qemu/block-vitastor.so qemu-img ...`
-if you use unmodified QEMU.
-
-You can also specify `:pool=<POOL>:inode=<INODE>:size=<SIZE>` instead of `:image=<IMAGE>`
-if you don't want to use inode metadata.
-
-### Start a VM
-
-Run QEMU with `-drive file=vitastor:etcd_host=<HOST>:image=<IMAGE>` and use 4 KB physical block size.
-
-For example:
-
-```
-qemu-system-x86_64 -enable-kvm -m 1024
-  -drive 'file=vitastor:etcd_host=10.115.0.10\:2379/v3:image=testimg',format=raw,if=none,id=drive-virtio-disk0,cache=none
-  -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x5,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1,write-cache=off,physical_block_size=4096,logical_block_size=512
-  -vnc 0.0.0.0:0
-```
-
-You can also specify `:pool=<POOL>:inode=<INODE>:size=<SIZE>` instead of `:image=<IMAGE>`,
-just like in qemu-img.
-
-### Remove inode
-
-Use vitastor-rm. For example:
-
-```
-vitastor-rm --etcd_address 10.115.0.10:2379/v3 --pool 1 --inode 1 --parallel_osds 16 --iodepth 32
-```
-
-### NBD
-
-To create a local block device for a Vitastor image, use NBD. For example:
-
-```
-vitastor-nbd map --etcd_address 10.115.0.10:2379/v3 --image testimg
-```
-
-It will output the device name, like /dev/nbd0 which you can then format and mount as a normal block device.
-
-Again, you can use `--pool <POOL> --inode <INODE> --size <SIZE>` insteaf of `--image <IMAGE>` if you want.
-
-### Kubernetes
-
-Vitastor has a CSI plugin for Kubernetes which supports RWO volumes.
-
-To deploy it, take manifests from [csi/deploy/](csi/deploy/) directory, put your
-Vitastor configuration in [csi/deploy/001-csi-config-map.yaml](001-csi-config-map.yaml),
-configure storage class in [csi/deploy/009-storage-class.yaml](009-storage-class.yaml)
-and apply all `NNN-*.yaml` manifests to your Kubernetes installation:
-
-```
-for i in ./???-*.yaml; do kubectl apply -f $i; done
-```
-
-After that you'll be able to create PersistentVolumes. See example in [csi/deploy/example-pvc.yaml](csi/deploy/example-pvc.yaml).
+- Run tests with (for example): `fio -thread -ioengine=libfio_vitastor.so -name=test -bs=4M -direct=1 -iodepth=16 -rw=write -etcd=10.115.0.10:2379/v3 -pool=1 -inode=1 -size=400G`.
+- Upload VM disk image with qemu-img (for example):
+  ```
+  qemu-img convert -f qcow2 debian10.qcow2 -p -O raw 'vitastor:etcd_host=10.115.0.10\:2379/v3:pool=1:inode=1:size=2147483648'
+  ```
+  Note that the command requires to be run with `LD_PRELOAD=/usr/lib/x86_64-linux-gnu/qemu/block-vitastor.so qemu-img ...`
+  if you use unmodified QEMU.
+- Run QEMU with (for example):
+  ```
+  qemu-system-x86_64 -enable-kvm -m 1024
+    -drive 'file=vitastor:etcd_host=10.115.0.10\:2379/v3:pool=1:inode=1:size=2147483648',format=raw,if=none,id=drive-virtio-disk0,cache=none
+    -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x5,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1,write-cache=off,physical_block_size=4096,logical_block_size=512
+    -vnc 0.0.0.0:0
+  ```
+- Remove inode with (for example):
+  ```
+  vitastor-rm --etcd_address 10.115.0.10:2379/v3 --pool 1 --inode 1 --parallel_osds 16 --iodepth 32
+  ```

 ## Known Problems

--- a/csi/.dockerignore
+++ b/csi/.dockerignore
@@ -1,3 +0,0 @@
-vitastor-csi
-go.sum
-Dockerfile
--- a/csi/Dockerfile
+++ b/csi/Dockerfile
@@ -1,32 +0,0 @@
-# Compile stage
-FROM golang:buster AS build
-
-ADD go.mod /app/
-RUN cd /app; CGO_ENABLED=1 GOOS=linux GOARCH=amd64 go mod download -x
-ADD . /app
-RUN perl -i -e '$/ = undef; while(<>) { s/\n\s*(\{\s*\n)/$1\n/g; s/\}(\s*\n\s*)else\b/$1} else/g; print; }' `find /app -name '*.go'`
-RUN cd /app; CGO_ENABLED=1 GOOS=linux GOARCH=amd64 go build -o vitastor-csi
-
-# Final stage
-FROM debian:buster
-
-LABEL maintainers="Vitaliy Filippov <vitalif@yourcmc.ru>"
-LABEL description="Vitastor CSI Driver"
-
-ENV NODE_ID=""
-ENV CSI_ENDPOINT=""
-
-RUN apt-get update && \
-    apt-get install -y wget && \
-    wget -q -O /etc/apt/trusted.gpg.d/vitastor.gpg https://vitastor.io/debian/pubkey.gpg && \
-    (echo deb http://vitastor.io/debian buster main > /etc/apt/sources.list.d/vitastor.list) && \
-    (echo deb http://deb.debian.org/debian buster-backports main > /etc/apt/sources.list.d/backports.list) && \
-    (echo "APT::Install-Recommends false;" > /etc/apt/apt.conf) && \
-    apt-get update && \
-    apt-get install -y e2fsprogs xfsprogs vitastor kmod && \
-    apt-get clean && \
-    (echo options nbd nbds_max=128 > /etc/modprobe.d/nbd.conf)
-
-COPY --from=build /app/vitastor-csi /bin/
-
-ENTRYPOINT ["/bin/vitastor-csi"]
--- a/csi/Makefile
+++ b/csi/Makefile
@@ -1,9 +0,0 @@
-VERSION ?= v0.6.5
-
-all: build push
-
-build:
-	@docker build --rm -t vitalif/vitastor-csi:$(VERSION) .
-
-push:
-	@docker push vitalif/vitastor-csi:$(VERSION)
--- a/csi/deploy/000-csi-namespace.yaml
+++ b/csi/deploy/000-csi-namespace.yaml
@@ -1,5 +0,0 @@
---
-apiVersion: v1
-kind: Namespace
-metadata:
-  name: vitastor-system
--- a/csi/deploy/001-csi-config-map.yaml
+++ b/csi/deploy/001-csi-config-map.yaml
@@ -1,9 +0,0 @@
---
-apiVersion: v1
-kind: ConfigMap
-data:
-  vitastor.conf: |-
-    {"etcd_address":"http://192.168.7.2:2379","etcd_prefix":"/vitastor"}
-metadata:
-  namespace: vitastor-system
-  name: vitastor-config
--- a/csi/deploy/002-csi-nodeplugin-rbac.yaml
+++ b/csi/deploy/002-csi-nodeplugin-rbac.yaml
@@ -1,37 +0,0 @@
---
-apiVersion: v1
-kind: ServiceAccount
-metadata:
-  namespace: vitastor-system
-  name: vitastor-csi-nodeplugin
---
-kind: ClusterRole
-apiVersion: rbac.authorization.k8s.io/v1
-metadata:
-  namespace: vitastor-system
-  name: vitastor-csi-nodeplugin
-rules:
-  - apiGroups: [""]
-    resources: ["nodes"]
-    verbs: ["get"]
-  # allow to read Vault Token and connection options from the Tenants namespace
-  - apiGroups: [""]
-    resources: ["secrets"]
-    verbs: ["get"]
-  - apiGroups: [""]
-    resources: ["configmaps"]
-    verbs: ["get"]
---
-kind: ClusterRoleBinding
-apiVersion: rbac.authorization.k8s.io/v1
-metadata:
-  namespace: vitastor-system
-  name: vitastor-csi-nodeplugin
-subjects:
-  - kind: ServiceAccount
-    name: vitastor-csi-nodeplugin
-    namespace: vitastor-system
-roleRef:
-  kind: ClusterRole
-  name: vitastor-csi-nodeplugin
-  apiGroup: rbac.authorization.k8s.io
--- a/csi/deploy/003-csi-nodeplugin-psp.yaml
+++ b/csi/deploy/003-csi-nodeplugin-psp.yaml
@@ -1,72 +0,0 @@
---
-apiVersion: policy/v1beta1
-kind: PodSecurityPolicy
-metadata:
-  namespace: vitastor-system
-  name: vitastor-csi-nodeplugin-psp
-spec:
-  allowPrivilegeEscalation: true
-  allowedCapabilities:
-    - 'SYS_ADMIN'
-  fsGroup:
-    rule: RunAsAny
-  privileged: true
-  hostNetwork: true
-  hostPID: true
-  runAsUser:
-    rule: RunAsAny
-  seLinux:
-    rule: RunAsAny
-  supplementalGroups:
-    rule: RunAsAny
-  volumes:
-    - 'configMap'
-    - 'emptyDir'
-    - 'projected'
-    - 'secret'
-    - 'downwardAPI'
-    - 'hostPath'
-  allowedHostPaths:
-    - pathPrefix: '/dev'
-      readOnly: false
-    - pathPrefix: '/run/mount'
-      readOnly: false
-    - pathPrefix: '/sys'
-      readOnly: false
-    - pathPrefix: '/lib/modules'
-      readOnly: true
-    - pathPrefix: '/var/lib/kubelet/pods'
-      readOnly: false
-    - pathPrefix: '/var/lib/kubelet/plugins/csi.vitastor.io'
-      readOnly: false
-    - pathPrefix: '/var/lib/kubelet/plugins_registry'
-      readOnly: false
-    - pathPrefix: '/var/lib/kubelet/plugins'
-      readOnly: false
-
---
-kind: Role
-apiVersion: rbac.authorization.k8s.io/v1
-metadata:
-  namespace: vitastor-system
-  name: vitastor-csi-nodeplugin-psp
-rules:
-  - apiGroups: ['policy']
-    resources: ['podsecuritypolicies']
-    verbs: ['use']
-    resourceNames: ['vitastor-csi-nodeplugin-psp']
-
---
-kind: RoleBinding
-apiVersion: rbac.authorization.k8s.io/v1
-metadata:
-  namespace: vitastor-system
-  name: vitastor-csi-nodeplugin-psp
-subjects:
-  - kind: ServiceAccount
-    name: vitastor-csi-nodeplugin
-    namespace: vitastor-system
-roleRef:
-  kind: Role
-  name: vitastor-csi-nodeplugin-psp
-  apiGroup: rbac.authorization.k8s.io
--- a/csi/deploy/004-csi-nodeplugin.yaml
+++ b/csi/deploy/004-csi-nodeplugin.yaml
@@ -1,140 +0,0 @@
---
-kind: DaemonSet
-apiVersion: apps/v1
-metadata:
-  namespace: vitastor-system
-  name: csi-vitastor
-spec:
-  selector:
-    matchLabels:
-      app: csi-vitastor
-  template:
-    metadata:
-      namespace: vitastor-system
-      labels:
-        app: csi-vitastor
-    spec:
-      serviceAccountName: vitastor-csi-nodeplugin
-      hostNetwork: true
-      hostPID: true
-      priorityClassName: system-node-critical
-      # to use e.g. Rook orchestrated cluster, and mons' FQDN is
-      # resolved through k8s service, set dns policy to cluster first
-      dnsPolicy: ClusterFirstWithHostNet
-      containers:
-        - name: driver-registrar
-          # This is necessary only for systems with SELinux, where
-          # non-privileged sidecar containers cannot access unix domain socket
-          # created by privileged CSI driver container.
-          securityContext:
-            privileged: true
-          image: k8s.gcr.io/sig-storage/csi-node-driver-registrar:v2.2.0
-          args:
-            - "--v=5"
-            - "--csi-address=/csi/csi.sock"
-            - "--kubelet-registration-path=/var/lib/kubelet/plugins/csi.vitastor.io/csi.sock"
-          env:
-            - name: KUBE_NODE_NAME
-              valueFrom:
-                fieldRef:
-                  fieldPath: spec.nodeName
-          volumeMounts:
-            - name: socket-dir
-              mountPath: /csi
-            - name: registration-dir
-              mountPath: /registration
-        - name: csi-vitastor
-          securityContext:
-            privileged: true
-            capabilities:
-              add: ["SYS_ADMIN"]
-            allowPrivilegeEscalation: true
-          image: vitalif/vitastor-csi:v0.6.5
-          args:
-            - "--node=$(NODE_ID)"
-            - "--endpoint=$(CSI_ENDPOINT)"
-          env:
-            - name: NODE_ID
-              valueFrom:
-                fieldRef:
-                  fieldPath: spec.nodeName
-            - name: CSI_ENDPOINT
-              value: unix:///csi/csi.sock
-          imagePullPolicy: "IfNotPresent"
-          ports:
-          - containerPort: 9898
-            name: healthz
-            protocol: TCP
-          livenessProbe:
-            failureThreshold: 5
-            httpGet:
-              path: /healthz
-              port: healthz
-            initialDelaySeconds: 10
-            timeoutSeconds: 3
-            periodSeconds: 2
-          volumeMounts:
-            - name: socket-dir
-              mountPath: /csi
-            - mountPath: /dev
-              name: host-dev
-            - mountPath: /sys
-              name: host-sys
-            - mountPath: /run/mount
-              name: host-mount
-            - mountPath: /lib/modules
-              name: lib-modules
-              readOnly: true
-            - name: vitastor-config
-              mountPath: /etc/vitastor
-            - name: plugin-dir
-              mountPath: /var/lib/kubelet/plugins
-              mountPropagation: "Bidirectional"
-            - name: mountpoint-dir
-              mountPath: /var/lib/kubelet/pods
-              mountPropagation: "Bidirectional"
-        - name: liveness-probe
-          securityContext:
-            privileged: true
-          image: quay.io/k8scsi/livenessprobe:v1.1.0
-          args:
-            - "--csi-address=$(CSI_ENDPOINT)"
-            - "--health-port=9898"
-          env:
-            - name: CSI_ENDPOINT
-              value: unix://csi/csi.sock
-          volumeMounts:
-          - mountPath: /csi
-            name: socket-dir
-      volumes:
-        - name: socket-dir
-          hostPath:
-            path: /var/lib/kubelet/plugins/csi.vitastor.io
-            type: DirectoryOrCreate
-        - name: plugin-dir
-          hostPath:
-            path: /var/lib/kubelet/plugins
-            type: Directory
-        - name: mountpoint-dir
-          hostPath:
-            path: /var/lib/kubelet/pods
-            type: DirectoryOrCreate
-        - name: registration-dir
-          hostPath:
-            path: /var/lib/kubelet/plugins_registry/
-            type: Directory
-        - name: host-dev
-          hostPath:
-            path: /dev
-        - name: host-sys
-          hostPath:
-            path: /sys
-        - name: host-mount
-          hostPath:
-            path: /run/mount
-        - name: lib-modules
-          hostPath:
-            path: /lib/modules
-        - name: vitastor-config
-          configMap:
-            name: vitastor-config
--- a/csi/deploy/005-csi-provisioner-rbac.yaml
+++ b/csi/deploy/005-csi-provisioner-rbac.yaml
@@ -1,102 +0,0 @@
---
-apiVersion: v1
-kind: ServiceAccount
-metadata:
-  namespace: vitastor-system
-  name: vitastor-csi-provisioner
-
---
-kind: ClusterRole
-apiVersion: rbac.authorization.k8s.io/v1
-metadata:
-  namespace: vitastor-system
-  name: vitastor-external-provisioner-runner
-rules:
-  - apiGroups: [""]
-    resources: ["nodes"]
-    verbs: ["get", "list", "watch"]
-  - apiGroups: [""]
-    resources: ["secrets"]
-    verbs: ["get", "list", "watch"]
-  - apiGroups: [""]
-    resources: ["events"]
-    verbs: ["list", "watch", "create", "update", "patch"]
-  - apiGroups: [""]
-    resources: ["persistentvolumes"]
-    verbs: ["get", "list", "watch", "create", "update", "delete", "patch"]
-  - apiGroups: [""]
-    resources: ["persistentvolumeclaims"]
-    verbs: ["get", "list", "watch", "update"]
-  - apiGroups: [""]
-    resources: ["persistentvolumeclaims/status"]
-    verbs: ["update", "patch"]
-  - apiGroups: ["storage.k8s.io"]
-    resources: ["storageclasses"]
-    verbs: ["get", "list", "watch"]
-  - apiGroups: ["snapshot.storage.k8s.io"]
-    resources: ["volumesnapshots"]
-    verbs: ["get", "list"]
-  - apiGroups: ["snapshot.storage.k8s.io"]
-    resources: ["volumesnapshotcontents"]
-    verbs: ["create", "get", "list", "watch", "update", "delete"]
-  - apiGroups: ["snapshot.storage.k8s.io"]
-    resources: ["volumesnapshotclasses"]
-    verbs: ["get", "list", "watch"]
-  - apiGroups: ["storage.k8s.io"]
-    resources: ["volumeattachments"]
-    verbs: ["get", "list", "watch", "update", "patch"]
-  - apiGroups: ["storage.k8s.io"]
-    resources: ["volumeattachments/status"]
-    verbs: ["patch"]
-  - apiGroups: ["storage.k8s.io"]
-    resources: ["csinodes"]
-    verbs: ["get", "list", "watch"]
-  - apiGroups: ["snapshot.storage.k8s.io"]
-    resources: ["volumesnapshotcontents/status"]
-    verbs: ["update"]
-  - apiGroups: [""]
-    resources: ["configmaps"]
-    verbs: ["get"]
---
-kind: ClusterRoleBinding
-apiVersion: rbac.authorization.k8s.io/v1
-metadata:
-  namespace: vitastor-system
-  name: vitastor-csi-provisioner-role
-subjects:
-  - kind: ServiceAccount
-    name: vitastor-csi-provisioner
-    namespace: vitastor-system
-roleRef:
-  kind: ClusterRole
-  name: vitastor-external-provisioner-runner
-  apiGroup: rbac.authorization.k8s.io
-
---
-kind: Role
-apiVersion: rbac.authorization.k8s.io/v1
-metadata:
-  namespace: vitastor-system
-  name: vitastor-external-provisioner-cfg
-rules:
-  - apiGroups: [""]
-    resources: ["configmaps"]
-    verbs: ["get", "list", "watch", "create", "update", "delete"]
-  - apiGroups: ["coordination.k8s.io"]
-    resources: ["leases"]
-    verbs: ["get", "watch", "list", "delete", "update", "create"]
-
---
-kind: RoleBinding
-apiVersion: rbac.authorization.k8s.io/v1
-metadata:
-  name: vitastor-csi-provisioner-role-cfg
-  namespace: vitastor-system
-subjects:
-  - kind: ServiceAccount
-    name: vitastor-csi-provisioner
-    namespace: vitastor-system
-roleRef:
-  kind: Role
-  name: vitastor-external-provisioner-cfg
-  apiGroup: rbac.authorization.k8s.io
--- a/csi/deploy/006-csi-provisioner-psp.yaml
+++ b/csi/deploy/006-csi-provisioner-psp.yaml
@@ -1,60 +0,0 @@
---
-apiVersion: policy/v1beta1
-kind: PodSecurityPolicy
-metadata:
-  namespace: vitastor-system
-  name: vitastor-csi-provisioner-psp
-spec:
-  allowPrivilegeEscalation: true
-  allowedCapabilities:
-    - 'SYS_ADMIN'
-  fsGroup:
-    rule: RunAsAny
-  privileged: true
-  runAsUser:
-    rule: RunAsAny
-  seLinux:
-    rule: RunAsAny
-  supplementalGroups:
-    rule: RunAsAny
-  volumes:
-    - 'configMap'
-    - 'emptyDir'
-    - 'projected'
-    - 'secret'
-    - 'downwardAPI'
-    - 'hostPath'
-  allowedHostPaths:
-    - pathPrefix: '/dev'
-      readOnly: false
-    - pathPrefix: '/sys'
-      readOnly: false
-    - pathPrefix: '/lib/modules'
-      readOnly: true
-
---
-kind: Role
-apiVersion: rbac.authorization.k8s.io/v1
-metadata:
-  namespace: vitastor-system
-  name: vitastor-csi-provisioner-psp
-rules:
-  - apiGroups: ['policy']
-    resources: ['podsecuritypolicies']
-    verbs: ['use']
-    resourceNames: ['vitastor-csi-provisioner-psp']
-
---
-kind: RoleBinding
-apiVersion: rbac.authorization.k8s.io/v1
-metadata:
-  name: vitastor-csi-provisioner-psp
-  namespace: vitastor-system
-subjects:
-  - kind: ServiceAccount
-    name: vitastor-csi-provisioner
-    namespace: vitastor-system
-roleRef:
-  kind: Role
-  name: vitastor-csi-provisioner-psp
-  apiGroup: rbac.authorization.k8s.io
--- a/csi/deploy/007-csi-provisioner.yaml
+++ b/csi/deploy/007-csi-provisioner.yaml
@@ -1,159 +0,0 @@
---
-kind: Service
-apiVersion: v1
-metadata:
-  namespace: vitastor-system
-  name: csi-vitastor-provisioner
-  labels:
-    app: csi-metrics
-spec:
-  selector:
-    app: csi-vitastor-provisioner
-  ports:
-    - name: http-metrics
-      port: 8080
-      protocol: TCP
-      targetPort: 8680
-
---
-kind: Deployment
-apiVersion: apps/v1
-metadata:
-  namespace: vitastor-system
-  name: csi-vitastor-provisioner
-spec:
-  replicas: 3
-  selector:
-    matchLabels:
-      app: csi-vitastor-provisioner
-  template:
-    metadata:
-      namespace: vitastor-system
-      labels:
-        app: csi-vitastor-provisioner
-    spec:
-      affinity:
-        podAntiAffinity:
-          requiredDuringSchedulingIgnoredDuringExecution:
-            - labelSelector:
-                matchExpressions:
-                  - key: app
-                    operator: In
-                    values:
-                      - csi-vitastor-provisioner
-              topologyKey: "kubernetes.io/hostname"
-      serviceAccountName: vitastor-csi-provisioner
-      priorityClassName: system-cluster-critical
-      containers:
-        - name: csi-provisioner
-          image: k8s.gcr.io/sig-storage/csi-provisioner:v2.2.0
-          args:
-            - "--csi-address=$(ADDRESS)"
-            - "--v=5"
-            - "--timeout=150s"
-            - "--retry-interval-start=500ms"
-            - "--leader-election=true"
-            #  set it to true to use topology based provisioning
-            - "--feature-gates=Topology=false"
-            # if fstype is not specified in storageclass, ext4 is default
-            - "--default-fstype=ext4"
-            - "--extra-create-metadata=true"
-          env:
-            - name: ADDRESS
-              value: unix:///csi/csi-provisioner.sock
-          imagePullPolicy: "IfNotPresent"
-          volumeMounts:
-            - name: socket-dir
-              mountPath: /csi
-        - name: csi-snapshotter
-          image: k8s.gcr.io/sig-storage/csi-snapshotter:v4.0.0
-          args:
-            - "--csi-address=$(ADDRESS)"
-            - "--v=5"
-            - "--timeout=150s"
-            - "--leader-election=true"
-          env:
-            - name: ADDRESS
-              value: unix:///csi/csi-provisioner.sock
-          imagePullPolicy: "IfNotPresent"
-          securityContext:
-            privileged: true
-          volumeMounts:
-            - name: socket-dir
-              mountPath: /csi
-        - name: csi-attacher
-          image: k8s.gcr.io/sig-storage/csi-attacher:v3.1.0
-          args:
-            - "--v=5"
-            - "--csi-address=$(ADDRESS)"
-            - "--leader-election=true"
-            - "--retry-interval-start=500ms"
-          env:
-            - name: ADDRESS
-              value: /csi/csi-provisioner.sock
-          imagePullPolicy: "IfNotPresent"
-          volumeMounts:
-            - name: socket-dir
-              mountPath: /csi
-        - name: csi-resizer
-          image: k8s.gcr.io/sig-storage/csi-resizer:v1.1.0
-          args:
-            - "--csi-address=$(ADDRESS)"
-            - "--v=5"
-            - "--timeout=150s"
-            - "--leader-election"
-            - "--retry-interval-start=500ms"
-            - "--handle-volume-inuse-error=false"
-          env:
-            - name: ADDRESS
-              value: unix:///csi/csi-provisioner.sock
-          imagePullPolicy: "IfNotPresent"
-          volumeMounts:
-            - name: socket-dir
-              mountPath: /csi
-        - name: csi-vitastor
-          securityContext:
-            privileged: true
-            capabilities:
-              add: ["SYS_ADMIN"]
-          image: vitalif/vitastor-csi:v0.6.5
-          args:
-            - "--node=$(NODE_ID)"
-            - "--endpoint=$(CSI_ENDPOINT)"
-          env:
-            - name: NODE_ID
-              valueFrom:
-                fieldRef:
-                  fieldPath: spec.nodeName
-            - name: CSI_ENDPOINT
-              value: unix:///csi/csi-provisioner.sock
-          imagePullPolicy: "IfNotPresent"
-          volumeMounts:
-            - name: socket-dir
-              mountPath: /csi
-            - mountPath: /dev
-              name: host-dev
-            - mountPath: /sys
-              name: host-sys
-            - mountPath: /lib/modules
-              name: lib-modules
-              readOnly: true
-            - name: vitastor-config
-              mountPath: /etc/vitastor
-      volumes:
-        - name: host-dev
-          hostPath:
-            path: /dev
-        - name: host-sys
-          hostPath:
-            path: /sys
-        - name: lib-modules
-          hostPath:
-            path: /lib/modules
-        - name: socket-dir
-          emptyDir: {
-            medium: "Memory"
-          }
-        - name: vitastor-config
-          configMap:
-            name: vitastor-config
--- a/csi/deploy/008-csi-driver.yaml
+++ b/csi/deploy/008-csi-driver.yaml
@@ -1,11 +0,0 @@
---
-# if Kubernetes version is less than 1.18 change
-# apiVersion to storage.k8s.io/v1betav1
-apiVersion: storage.k8s.io/v1
-kind: CSIDriver
-metadata:
-  namespace: vitastor-system
-  name: csi.vitastor.io
-spec:
-  attachRequired: true
-  podInfoOnMount: false
--- a/csi/deploy/009-storage-class.yaml
+++ b/csi/deploy/009-storage-class.yaml
@@ -1,19 +0,0 @@
---
-apiVersion: storage.k8s.io/v1
-kind: StorageClass
-metadata:
-  namespace: vitastor-system
-  name: vitastor
-  annotations:
-    storageclass.kubernetes.io/is-default-class: "true"
-provisioner: csi.vitastor.io
-volumeBindingMode: Immediate
-parameters:
-  etcdVolumePrefix: ""
-  poolId: "1"
-  # you can choose other configuration file if you have it in the config map
-  #configPath: "/etc/vitastor/vitastor.conf"
-  # you can also specify etcdUrl here, maybe to connect to another Vitastor cluster
-  # multiple etcdUrls may be specified, delimited by comma
-  #etcdUrl: "http://192.168.7.2:2379"
-  #etcdPrefix: "/vitastor"
--- a/csi/deploy/example-pvc.yaml
+++ b/csi/deploy/example-pvc.yaml
@@ -1,12 +0,0 @@
---
-apiVersion: v1
-kind: PersistentVolumeClaim
-metadata:
-  name: test-vitastor-pvc
-spec:
-  storageClassName: vitastor
-  accessModes:
-    - ReadWriteOnce
-  resources:
-    requests:
-      storage: 10Gi
--- a/csi/go.mod
+++ b/csi/go.mod
@@ -1,35 +0,0 @@
-module vitastor.io/csi
-
-go 1.15
-
-require (
-	github.com/container-storage-interface/spec v1.4.0
-	github.com/coreos/bbolt v0.0.0-00010101000000-000000000000 // indirect
-	github.com/coreos/etcd v3.3.25+incompatible // indirect
-	github.com/coreos/go-semver v0.3.0 // indirect
-	github.com/coreos/go-systemd v0.0.0-20191104093116-d3cd4ed1dbcf // indirect
-	github.com/coreos/pkg v0.0.0-20180928190104-399ea9e2e55f // indirect
-	github.com/dustin/go-humanize v1.0.0 // indirect
-	github.com/golang/glog v0.0.0-20160126235308-23def4e6c14b
-	github.com/gorilla/websocket v1.4.2 // indirect
-	github.com/grpc-ecosystem/go-grpc-middleware v1.3.0 // indirect
-	github.com/grpc-ecosystem/go-grpc-prometheus v1.2.0 // indirect
-	github.com/grpc-ecosystem/grpc-gateway v1.16.0 // indirect
-	github.com/jonboulle/clockwork v0.2.2 // indirect
-	github.com/kubernetes-csi/csi-lib-utils v0.9.1
-	github.com/soheilhy/cmux v0.1.5 // indirect
-	github.com/tmc/grpc-websocket-proxy v0.0.0-20201229170055-e5319fda7802 // indirect
-	github.com/xiang90/probing v0.0.0-20190116061207-43a291ad63a2 // indirect
-	go.etcd.io/bbolt v0.0.0-00010101000000-000000000000 // indirect
-	go.etcd.io/etcd v3.3.25+incompatible
-	golang.org/x/net v0.0.0-20201202161906-c7110b5ffcbb
-	google.golang.org/grpc v1.33.1
-	k8s.io/klog v1.0.0
-	k8s.io/utils v0.0.0-20210305010621-2afb4311ab10
-)
-
-replace github.com/coreos/bbolt => go.etcd.io/bbolt v1.3.5
-
-replace go.etcd.io/bbolt => github.com/coreos/bbolt v1.3.5
-
-replace google.golang.org/grpc => google.golang.org/grpc v1.25.1
--- a/csi/src/config.go
+++ b/csi/src/config.go
@@ -1,22 +0,0 @@
-// Copyright (c) Vitaliy Filippov, 2019+
-// License: VNPL-1.1 or GNU GPL-2.0+ (see README.md for details)
-
-package vitastor
-
-const (
-    vitastorCSIDriverName    = "csi.vitastor.io"
-    vitastorCSIDriverVersion = "0.6.5"
-)
-
-// Config struct fills the parameters of request or user input
-type Config struct
-{
-    Endpoint string
-    NodeID   string
-}
-
-// NewConfig returns config struct to initialize new driver
-func NewConfig() *Config
-{
-    return &Config{}
-}
--- a/csi/src/controllerserver.go
+++ b/csi/src/controllerserver.go
@@ -1,530 +0,0 @@
-// Copyright (c) Vitaliy Filippov, 2019+
-// License: VNPL-1.1 or GNU GPL-2.0+ (see README.md for details)
-
-package vitastor
-
-import (
-    "context"
-    "encoding/json"
-    "strings"
-    "bytes"
-    "strconv"
-    "time"
-    "fmt"
-    "os"
-    "os/exec"
-    "io/ioutil"
-
-    "github.com/kubernetes-csi/csi-lib-utils/protosanitizer"
-    "k8s.io/klog"
-
-    "google.golang.org/grpc/codes"
-    "google.golang.org/grpc/status"
-
-    "go.etcd.io/etcd/clientv3"
-
-    "github.com/container-storage-interface/spec/lib/go/csi"
-)
-
-const (
-    KB int64 = 1024
-    MB int64 = 1024 * KB
-    GB int64 = 1024 * MB
-    TB int64 = 1024 * GB
-    ETCD_TIMEOUT time.Duration = 15*time.Second
-)
-
-type InodeIndex struct
-{
-    Id uint64 `json:"id"`
-    PoolId uint64 `json:"pool_id"`
-}
-
-type InodeConfig struct
-{
-    Name string `json:"name"`
-    Size uint64 `json:"size,omitempty"`
-    ParentPool uint64 `json:"parent_pool,omitempty"`
-    ParentId uint64 `json:"parent_id,omitempty"`
-    Readonly bool `json:"readonly,omitempty"`
-}
-
-type ControllerServer struct
-{
-    *Driver
-}
-
-// NewControllerServer create new instance controller
-func NewControllerServer(driver *Driver) *ControllerServer
-{
-    return &ControllerServer{
-        Driver: driver,
-    }
-}
-
-func GetConnectionParams(params map[string]string) (map[string]string, []string, string)
-{
-    ctxVars := make(map[string]string)
-    configPath := params["configPath"]
-    if (configPath == "")
-    {
-        configPath = "/etc/vitastor/vitastor.conf"
-    }
-    else
-    {
-        ctxVars["configPath"] = configPath
-    }
-    config := make(map[string]interface{})
-    if configFD, err := os.Open(configPath); err == nil
-    {
-        defer configFD.Close()
-        data, _ := ioutil.ReadAll(configFD)
-        json.Unmarshal(data, &config)
-    }
-    // Try to load prefix & etcd URL from the config
-    var etcdUrl []string
-    if (params["etcdUrl"] != "")
-    {
-        ctxVars["etcdUrl"] = params["etcdUrl"]
-        etcdUrl = strings.Split(params["etcdUrl"], ",")
-    }
-    if (len(etcdUrl) == 0)
-    {
-        switch config["etcd_address"].(type)
-        {
-        case string:
-            etcdUrl = strings.Split(config["etcd_address"].(string), ",")
-        case []string:
-            etcdUrl = config["etcd_address"].([]string)
-        }
-    }
-    etcdPrefix := params["etcdPrefix"]
-    if (etcdPrefix == "")
-    {
-        etcdPrefix, _ = config["etcd_prefix"].(string)
-        if (etcdPrefix == "")
-        {
-            etcdPrefix = "/vitastor"
-        }
-    }
-    else
-    {
-        ctxVars["etcdPrefix"] = etcdPrefix
-    }
-    return ctxVars, etcdUrl, etcdPrefix
-}
-
-// Create the volume
-func (cs *ControllerServer) CreateVolume(ctx context.Context, req *csi.CreateVolumeRequest) (*csi.CreateVolumeResponse, error)
-{
-    klog.Infof("received controller create volume request %+v", protosanitizer.StripSecrets(req))
-    if (req == nil)
-    {
-        return nil, status.Errorf(codes.InvalidArgument, "request cannot be empty")
-    }
-    if (req.GetName() == "")
-    {
-        return nil, status.Error(codes.InvalidArgument, "name is a required field")
-    }
-    volumeCapabilities := req.GetVolumeCapabilities()
-    if (volumeCapabilities == nil)
-    {
-        return nil, status.Error(codes.InvalidArgument, "volume capabilities is a required field")
-    }
-
-    etcdVolumePrefix := req.Parameters["etcdVolumePrefix"]
-    poolId, _ := strconv.ParseUint(req.Parameters["poolId"], 10, 64)
-    if (poolId == 0)
-    {
-        return nil, status.Error(codes.InvalidArgument, "poolId is missing in storage class configuration")
-    }
-
-    volName := etcdVolumePrefix + req.GetName()
-    volSize := 1 * GB
-    if capRange := req.GetCapacityRange(); capRange != nil
-    {
-        volSize = ((capRange.GetRequiredBytes() + MB - 1) / MB) * MB
-    }
-
-    // FIXME: The following should PROBABLY be implemented externally in a management tool
-
-    ctxVars, etcdUrl, etcdPrefix := GetConnectionParams(req.Parameters)
-    if (len(etcdUrl) == 0)
-    {
-        return nil, status.Error(codes.InvalidArgument, "no etcdUrl in storage class configuration and no etcd_address in vitastor.conf")
-    }
-
-    // Connect to etcd
-    cli, err := clientv3.New(clientv3.Config{
-        DialTimeout: ETCD_TIMEOUT,
-        Endpoints: etcdUrl,
-    })
-    if (err != nil)
-    {
-        return nil, status.Error(codes.Internal, "failed to connect to etcd at "+strings.Join(etcdUrl, ",")+": "+err.Error())
-    }
-    defer cli.Close()
-
-    var imageId uint64 = 0
-    for
-    {
-        // Check if the image exists
-        ctx, cancel := context.WithTimeout(context.Background(), ETCD_TIMEOUT)
-        resp, err := cli.Get(ctx, etcdPrefix+"/index/image/"+volName)
-        cancel()
-        if (err != nil)
-        {
-            return nil, status.Error(codes.Internal, "failed to read key from etcd: "+err.Error())
-        }
-        if (len(resp.Kvs) > 0)
-        {
-            kv := resp.Kvs[0]
-            var v InodeIndex
-            err := json.Unmarshal(kv.Value, &v)
-            if (err != nil)
-            {
-                return nil, status.Error(codes.Internal, "invalid /index/image/"+volName+" key in etcd: "+err.Error())
-            }
-            poolId = v.PoolId
-            imageId = v.Id
-            inodeCfgKey := fmt.Sprintf("/config/inode/%d/%d", poolId, imageId)
-            ctx, cancel := context.WithTimeout(context.Background(), ETCD_TIMEOUT)
-            resp, err := cli.Get(ctx, etcdPrefix+inodeCfgKey)
-            cancel()
-            if (err != nil)
-            {
-                return nil, status.Error(codes.Internal, "failed to read key from etcd: "+err.Error())
-            }
-            if (len(resp.Kvs) == 0)
-            {
-                return nil, status.Error(codes.Internal, "missing "+inodeCfgKey+" key in etcd")
-            }
-            var inodeCfg InodeConfig
-            err = json.Unmarshal(resp.Kvs[0].Value, &inodeCfg)
-            if (err != nil)
-            {
-                return nil, status.Error(codes.Internal, "invalid "+inodeCfgKey+" key in etcd: "+err.Error())
-            }
-            if (inodeCfg.Size < uint64(volSize))
-            {
-                return nil, status.Error(codes.Internal, "image "+volName+" is already created, but size is less than expected")
-            }
-        }
-        else
-        {
-            // Find a free ID
-            // Create image metadata in a transaction verifying that the image doesn't exist yet AND ID is still free
-            maxIdKey := fmt.Sprintf("%s/index/maxid/%d", etcdPrefix, poolId)
-            ctx, cancel := context.WithTimeout(context.Background(), ETCD_TIMEOUT)
-            resp, err := cli.Get(ctx, maxIdKey)
-            cancel()
-            if (err != nil)
-            {
-                return nil, status.Error(codes.Internal, "failed to read key from etcd: "+err.Error())
-            }
-            var modRev int64
-            var nextId uint64
-            if (len(resp.Kvs) > 0)
-            {
-                var err error
-                nextId, err = strconv.ParseUint(string(resp.Kvs[0].Value), 10, 64)
-                if (err != nil)
-                {
-                    return nil, status.Error(codes.Internal, maxIdKey+" contains invalid ID")
-                }
-                modRev = resp.Kvs[0].ModRevision
-                nextId++
-            }
-            else
-            {
-                nextId = 1
-            }
-            inodeIdxJson, _ := json.Marshal(InodeIndex{
-                Id: nextId,
-                PoolId: poolId,
-            })
-            inodeCfgJson, _ := json.Marshal(InodeConfig{
-                Name: volName,
-                Size: uint64(volSize),
-            })
-            ctx, cancel = context.WithTimeout(context.Background(), ETCD_TIMEOUT)
-            txnResp, err := cli.Txn(ctx).If(
-                clientv3.Compare(clientv3.ModRevision(fmt.Sprintf("%s/index/maxid/%d", etcdPrefix, poolId)), "=", modRev),
-                clientv3.Compare(clientv3.CreateRevision(fmt.Sprintf("%s/index/image/%s", etcdPrefix, volName)), "=", 0),
-                clientv3.Compare(clientv3.CreateRevision(fmt.Sprintf("%s/config/inode/%d/%d", etcdPrefix, poolId, nextId)), "=", 0),
-            ).Then(
-                clientv3.OpPut(fmt.Sprintf("%s/index/maxid/%d", etcdPrefix, poolId), fmt.Sprintf("%d", nextId)),
-                clientv3.OpPut(fmt.Sprintf("%s/index/image/%s", etcdPrefix, volName), string(inodeIdxJson)),
-                clientv3.OpPut(fmt.Sprintf("%s/config/inode/%d/%d", etcdPrefix, poolId, nextId), string(inodeCfgJson)),
-            ).Commit()
-            cancel()
-            if (err != nil)
-            {
-                return nil, status.Error(codes.Internal, "failed to commit transaction in etcd: "+err.Error())
-            }
-            if (txnResp.Succeeded)
-            {
-                imageId = nextId
-                break
-            }
-            // Start over if the transaction fails
-        }
-    }
-
-    ctxVars["name"] = volName
-    volumeIdJson, _ := json.Marshal(ctxVars)
-    return &csi.CreateVolumeResponse{
-        Volume: &csi.Volume{
-            // Ugly, but VolumeContext isn't passed to DeleteVolume :-(
-            VolumeId: string(volumeIdJson),
-            CapacityBytes: volSize,
-        },
-    }, nil
-}
-
-// DeleteVolume deletes the given volume
-func (cs *ControllerServer) DeleteVolume(ctx context.Context, req *csi.DeleteVolumeRequest) (*csi.DeleteVolumeResponse, error)
-{
-    klog.Infof("received controller delete volume request %+v", protosanitizer.StripSecrets(req))
-    if (req == nil)
-    {
-        return nil, status.Error(codes.InvalidArgument, "request cannot be empty")
-    }
-
-    ctxVars := make(map[string]string)
-    err := json.Unmarshal([]byte(req.VolumeId), &ctxVars)
-    if (err != nil)
-    {
-        return nil, status.Error(codes.Internal, "volume ID not in JSON format")
-    }
-    volName := ctxVars["name"]
-
-    _, etcdUrl, etcdPrefix := GetConnectionParams(ctxVars)
-    if (len(etcdUrl) == 0)
-    {
-        return nil, status.Error(codes.InvalidArgument, "no etcdUrl in storage class configuration and no etcd_address in vitastor.conf")
-    }
-
-    cli, err := clientv3.New(clientv3.Config{
-        DialTimeout: ETCD_TIMEOUT,
-        Endpoints: etcdUrl,
-    })
-    if (err != nil)
-    {
-        return nil, status.Error(codes.Internal, "failed to connect to etcd at "+strings.Join(etcdUrl, ",")+": "+err.Error())
-    }
-    defer cli.Close()
-
-    // Find inode by name
-    ctx, cancel := context.WithTimeout(context.Background(), ETCD_TIMEOUT)
-    resp, err := cli.Get(ctx, etcdPrefix+"/index/image/"+volName)
-    cancel()
-    if (err != nil)
-    {
-        return nil, status.Error(codes.Internal, "failed to read key from etcd: "+err.Error())
-    }
-    if (len(resp.Kvs) == 0)
-    {
-        return nil, status.Error(codes.NotFound, "volume "+volName+" does not exist")
-    }
-    var idx InodeIndex
-    err = json.Unmarshal(resp.Kvs[0].Value, &idx)
-    if (err != nil)
-    {
-        return nil, status.Error(codes.Internal, "invalid /index/image/"+volName+" key in etcd: "+err.Error())
-    }
-
-    // Get inode config
-    inodeCfgKey := fmt.Sprintf("%s/config/inode/%d/%d", etcdPrefix, idx.PoolId, idx.Id)
-    ctx, cancel = context.WithTimeout(context.Background(), ETCD_TIMEOUT)
-    resp, err = cli.Get(ctx, inodeCfgKey)
-    cancel()
-    if (err != nil)
-    {
-        return nil, status.Error(codes.Internal, "failed to read key from etcd: "+err.Error())
-    }
-    if (len(resp.Kvs) == 0)
-    {
-        return nil, status.Error(codes.NotFound, "volume "+volName+" does not exist")
-    }
-    var inodeCfg InodeConfig
-    err = json.Unmarshal(resp.Kvs[0].Value, &inodeCfg)
-    if (err != nil)
-    {
-        return nil, status.Error(codes.Internal, "invalid "+inodeCfgKey+" key in etcd: "+err.Error())
-    }
-
-    // Delete inode data by invoking vitastor-rm
-    args := []string{
-        "--etcd_address", strings.Join(etcdUrl, ","),
-        "--pool", fmt.Sprintf("%d", idx.PoolId),
-        "--inode", fmt.Sprintf("%d", idx.Id),
-    }
-    if (ctxVars["configPath"] != "")
-    {
-        args = append(args, "--config_path", ctxVars["configPath"])
-    }
-    c := exec.Command("/usr/bin/vitastor-rm", args...)
-    var stderr bytes.Buffer
-    c.Stdout = nil
-    c.Stderr = &stderr
-    err = c.Run()
-    stderrStr := string(stderr.Bytes())
-    if (err != nil)
-    {
-        klog.Errorf("vitastor-rm failed: %s, status %s\n", stderrStr, err)
-        return nil, status.Error(codes.Internal, stderrStr+" (status "+err.Error()+")")
-    }
-
-    // Delete inode config in etcd
-    ctx, cancel = context.WithTimeout(context.Background(), ETCD_TIMEOUT)
-    txnResp, err := cli.Txn(ctx).Then(
-        clientv3.OpDelete(fmt.Sprintf("%s/index/image/%s", etcdPrefix, volName)),
-        clientv3.OpDelete(fmt.Sprintf("%s/config/inode/%d/%d", etcdPrefix, idx.PoolId, idx.Id)),
-    ).Commit()
-    cancel()
-    if (err != nil)
-    {
-        return nil, status.Error(codes.Internal, "failed to delete keys in etcd: "+err.Error())
-    }
-    if (!txnResp.Succeeded)
-    {
-        return nil, status.Error(codes.Internal, "failed to delete keys in etcd: transaction failed")
-    }
-
-    return &csi.DeleteVolumeResponse{}, nil
-}
-
-// ControllerPublishVolume return Unimplemented error
-func (cs *ControllerServer) ControllerPublishVolume(ctx context.Context, req *csi.ControllerPublishVolumeRequest) (*csi.ControllerPublishVolumeResponse, error)
-{
-    return nil, status.Error(codes.Unimplemented, "")
-}
-
-// ControllerUnpublishVolume return Unimplemented error
-func (cs *ControllerServer) ControllerUnpublishVolume(ctx context.Context, req *csi.ControllerUnpublishVolumeRequest) (*csi.ControllerUnpublishVolumeResponse, error)
-{
-    return nil, status.Error(codes.Unimplemented, "")
-}
-
-// ValidateVolumeCapabilities checks whether the volume capabilities requested are supported.
-func (cs *ControllerServer) ValidateVolumeCapabilities(ctx context.Context, req *csi.ValidateVolumeCapabilitiesRequest) (*csi.ValidateVolumeCapabilitiesResponse, error)
-{
-    klog.Infof("received controller validate volume capability request %+v", protosanitizer.StripSecrets(req))
-    if (req == nil)
-    {
-        return nil, status.Errorf(codes.InvalidArgument, "request is nil")
-    }
-    volumeID := req.GetVolumeId()
-    if (volumeID == "")
-    {
-        return nil, status.Error(codes.InvalidArgument, "volumeId is nil")
-    }
-    volumeCapabilities := req.GetVolumeCapabilities()
-    if (volumeCapabilities == nil)
-    {
-        return nil, status.Error(codes.InvalidArgument, "volumeCapabilities is nil")
-    }
-
-    var volumeCapabilityAccessModes []*csi.VolumeCapability_AccessMode
-    for _, mode := range []csi.VolumeCapability_AccessMode_Mode{
-        csi.VolumeCapability_AccessMode_SINGLE_NODE_WRITER,
-        csi.VolumeCapability_AccessMode_MULTI_NODE_MULTI_WRITER,
-    } {
-        volumeCapabilityAccessModes = append(volumeCapabilityAccessModes, &csi.VolumeCapability_AccessMode{Mode: mode})
-    }
-
-    capabilitySupport := false
-    for _, capability := range volumeCapabilities
-    {
-        for _, volumeCapabilityAccessMode := range volumeCapabilityAccessModes
-        {
-            if (volumeCapabilityAccessMode.Mode == capability.AccessMode.Mode)
-            {
-                capabilitySupport = true
-            }
-        }
-    }
-
-    if (!capabilitySupport)
-    {
-        return nil, status.Errorf(codes.NotFound, "%v not supported", req.GetVolumeCapabilities())
-    }
-
-    return &csi.ValidateVolumeCapabilitiesResponse{
-        Confirmed: &csi.ValidateVolumeCapabilitiesResponse_Confirmed{
-            VolumeCapabilities: req.VolumeCapabilities,
-        },
-    }, nil
-}
-
-// ListVolumes returns a list of volumes
-func (cs *ControllerServer) ListVolumes(ctx context.Context, req *csi.ListVolumesRequest) (*csi.ListVolumesResponse, error)
-{
-    return nil, status.Error(codes.Unimplemented, "")
-}
-
-// GetCapacity returns the capacity of the storage pool
-func (cs *ControllerServer) GetCapacity(ctx context.Context, req *csi.GetCapacityRequest) (*csi.GetCapacityResponse, error)
-{
-    return nil, status.Error(codes.Unimplemented, "")
-}
-
-// ControllerGetCapabilities returns the capabilities of the controller service.
-func (cs *ControllerServer) ControllerGetCapabilities(ctx context.Context, req *csi.ControllerGetCapabilitiesRequest) (*csi.ControllerGetCapabilitiesResponse, error)
-{
-    functionControllerServerCapabilities := func(cap csi.ControllerServiceCapability_RPC_Type) *csi.ControllerServiceCapability
-    {
-        return &csi.ControllerServiceCapability{
-            Type: &csi.ControllerServiceCapability_Rpc{
-                Rpc: &csi.ControllerServiceCapability_RPC{
-                    Type: cap,
-                },
-            },
-        }
-    }
-
-    var controllerServerCapabilities []*csi.ControllerServiceCapability
-    for _, capability := range []csi.ControllerServiceCapability_RPC_Type{
-        csi.ControllerServiceCapability_RPC_CREATE_DELETE_VOLUME,
-        csi.ControllerServiceCapability_RPC_LIST_VOLUMES,
-        csi.ControllerServiceCapability_RPC_EXPAND_VOLUME,
-        csi.ControllerServiceCapability_RPC_CREATE_DELETE_SNAPSHOT,
-    } {
-        controllerServerCapabilities = append(controllerServerCapabilities, functionControllerServerCapabilities(capability))
-    }
-
-    return &csi.ControllerGetCapabilitiesResponse{
-        Capabilities: controllerServerCapabilities,
-    }, nil
-}
-
-// CreateSnapshot create snapshot of an existing PV
-func (cs *ControllerServer) CreateSnapshot(ctx context.Context, req *csi.CreateSnapshotRequest) (*csi.CreateSnapshotResponse, error)
-{
-    return nil, status.Error(codes.Unimplemented, "")
-}
-
-// DeleteSnapshot delete provided snapshot of a PV
-func (cs *ControllerServer) DeleteSnapshot(ctx context.Context, req *csi.DeleteSnapshotRequest) (*csi.DeleteSnapshotResponse, error)
-{
-    return nil, status.Error(codes.Unimplemented, "")
-}
-
-// ListSnapshots list the snapshots of a PV
-func (cs *ControllerServer) ListSnapshots(ctx context.Context, req *csi.ListSnapshotsRequest) (*csi.ListSnapshotsResponse, error)
-{
-    return nil, status.Error(codes.Unimplemented, "")
-}
-
-// ControllerExpandVolume resizes a volume
-func (cs *ControllerServer) ControllerExpandVolume(ctx context.Context, req *csi.ControllerExpandVolumeRequest) (*csi.ControllerExpandVolumeResponse, error)
-{
-    return nil, status.Error(codes.Unimplemented, "")
-}
-
-// ControllerGetVolume get volume info
-func (cs *ControllerServer) ControllerGetVolume(ctx context.Context, req *csi.ControllerGetVolumeRequest) (*csi.ControllerGetVolumeResponse, error)
-{
-    return nil, status.Error(codes.Unimplemented, "")
-}
--- a/csi/src/grpc.go
+++ b/csi/src/grpc.go
@@ -1,137 +0,0 @@
-/*
-Copyright 2017 The Kubernetes Authors.
-
-Licensed under the Apache License, Version 2.0 (the "License");
-you may not use this file except in compliance with the License.
-You may obtain a copy of the License at
-
-    http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing, software
-distributed under the License is distributed on an "AS IS" BASIS,
-WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-See the License for the specific language governing permissions and
-limitations under the License.
-*/
-
-package vitastor
-
-import (
-    "fmt"
-    "net"
-    "os"
-    "strings"
-    "sync"
-
-    "github.com/golang/glog"
-    "golang.org/x/net/context"
-    "google.golang.org/grpc"
-
-    "github.com/container-storage-interface/spec/lib/go/csi"
-    "github.com/kubernetes-csi/csi-lib-utils/protosanitizer"
-)
-
-// Defines Non blocking GRPC server interfaces
-type NonBlockingGRPCServer interface {
-    // Start services at the endpoint
-    Start(endpoint string, ids csi.IdentityServer, cs csi.ControllerServer, ns csi.NodeServer)
-    // Waits for the service to stop
-    Wait()
-    // Stops the service gracefully
-    Stop()
-    // Stops the service forcefully
-    ForceStop()
-}
-
-func NewNonBlockingGRPCServer() NonBlockingGRPCServer {
-    return &nonBlockingGRPCServer{}
-}
-
-// NonBlocking server
-type nonBlockingGRPCServer struct {
-    wg     sync.WaitGroup
-    server *grpc.Server
-}
-
-func (s *nonBlockingGRPCServer) Start(endpoint string, ids csi.IdentityServer, cs csi.ControllerServer, ns csi.NodeServer) {
-
-    s.wg.Add(1)
-
-    go s.serve(endpoint, ids, cs, ns)
-
-    return
-}
-
-func (s *nonBlockingGRPCServer) Wait() {
-    s.wg.Wait()
-}
-
-func (s *nonBlockingGRPCServer) Stop() {
-    s.server.GracefulStop()
-}
-
-func (s *nonBlockingGRPCServer) ForceStop() {
-    s.server.Stop()
-}
-
-func (s *nonBlockingGRPCServer) serve(endpoint string, ids csi.IdentityServer, cs csi.ControllerServer, ns csi.NodeServer) {
-
-    proto, addr, err := ParseEndpoint(endpoint)
-    if err != nil {
-        glog.Fatal(err.Error())
-    }
-
-    if proto == "unix" {
-        addr = "/" + addr
-        if err := os.Remove(addr); err != nil && !os.IsNotExist(err) {
-            glog.Fatalf("Failed to remove %s, error: %s", addr, err.Error())
-        }
-    }
-
-    listener, err := net.Listen(proto, addr)
-    if err != nil {
-        glog.Fatalf("Failed to listen: %v", err)
-    }
-
-    opts := []grpc.ServerOption{
-        grpc.UnaryInterceptor(logGRPC),
-    }
-    server := grpc.NewServer(opts...)
-    s.server = server
-
-    if ids != nil {
-        csi.RegisterIdentityServer(server, ids)
-    }
-    if cs != nil {
-        csi.RegisterControllerServer(server, cs)
-    }
-    if ns != nil {
-        csi.RegisterNodeServer(server, ns)
-    }
-
-    glog.Infof("Listening for connections on address: %#v", listener.Addr())
-
-    server.Serve(listener)
-}
-
-func ParseEndpoint(ep string) (string, string, error) {
-    if strings.HasPrefix(strings.ToLower(ep), "unix://") || strings.HasPrefix(strings.ToLower(ep), "tcp://") {
-        s := strings.SplitN(ep, "://", 2)
-        if s[1] != "" {
-            return s[0], s[1], nil
-        }
-    }
-    return "", "", fmt.Errorf("Invalid endpoint: %v", ep)
-}
-
-func logGRPC(ctx context.Context, req interface{}, info *grpc.UnaryServerInfo, handler grpc.UnaryHandler) (interface{}, error) {
-    glog.V(3).Infof("GRPC call: %s", info.FullMethod)
-    glog.V(5).Infof("GRPC request: %s", protosanitizer.StripSecrets(req))
-    resp, err := handler(ctx, req)
-    if err != nil {
-        glog.Errorf("GRPC error: %v", err)
-    } else {
-        glog.V(5).Infof("GRPC response: %s", protosanitizer.StripSecrets(resp))
-    }
-    return resp, err
-}
--- a/csi/src/identityserver.go
+++ b/csi/src/identityserver.go
@@ -1,60 +0,0 @@
-// Copyright (c) Vitaliy Filippov, 2019+
-// License: VNPL-1.1 or GNU GPL-2.0+ (see README.md for details)
-
-package vitastor
-
-import (
-    "context"
-
-    "github.com/kubernetes-csi/csi-lib-utils/protosanitizer"
-    "k8s.io/klog"
-
-    "github.com/container-storage-interface/spec/lib/go/csi"
-)
-
-// IdentityServer struct of Vitastor CSI driver with supported methods of CSI identity server spec.
-type IdentityServer struct
-{
-    *Driver
-}
-
-// NewIdentityServer create new instance identity
-func NewIdentityServer(driver *Driver) *IdentityServer
-{
-    return &IdentityServer{
-        Driver: driver,
-    }
-}
-
-// GetPluginInfo returns metadata of the plugin
-func (is *IdentityServer) GetPluginInfo(ctx context.Context, req *csi.GetPluginInfoRequest) (*csi.GetPluginInfoResponse, error)
-{
-    klog.Infof("received identity plugin info request %+v", protosanitizer.StripSecrets(req))
-    return &csi.GetPluginInfoResponse{
-        Name:          vitastorCSIDriverName,
-        VendorVersion: vitastorCSIDriverVersion,
-    }, nil
-}
-
-// GetPluginCapabilities returns available capabilities of the plugin
-func (is *IdentityServer) GetPluginCapabilities(ctx context.Context, req *csi.GetPluginCapabilitiesRequest) (*csi.GetPluginCapabilitiesResponse, error)
-{
-    klog.Infof("received identity plugin capabilities request %+v", protosanitizer.StripSecrets(req))
-    return &csi.GetPluginCapabilitiesResponse{
-        Capabilities: []*csi.PluginCapability{
-            {
-                Type: &csi.PluginCapability_Service_{
-                    Service: &csi.PluginCapability_Service{
-                        Type: csi.PluginCapability_Service_CONTROLLER_SERVICE,
-                    },
-                },
-            },
-        },
-    }, nil
-}
-
-// Probe returns the health and readiness of the plugin
-func (is *IdentityServer) Probe(ctx context.Context, req *csi.ProbeRequest) (*csi.ProbeResponse, error)
-{
-    return &csi.ProbeResponse{}, nil
-}
--- a/csi/src/nodeserver.go
+++ b/csi/src/nodeserver.go
@@ -1,279 +0,0 @@
-// Copyright (c) Vitaliy Filippov, 2019+
-// License: VNPL-1.1 or GNU GPL-2.0+ (see README.md for details)
-
-package vitastor
-
-import (
-    "context"
-    "os"
-    "os/exec"
-    "encoding/json"
-    "strings"
-    "bytes"
-
-    "google.golang.org/grpc/codes"
-    "google.golang.org/grpc/status"
-    "k8s.io/utils/mount"
-    utilexec "k8s.io/utils/exec"
-
-    "github.com/container-storage-interface/spec/lib/go/csi"
-    "github.com/kubernetes-csi/csi-lib-utils/protosanitizer"
-    "k8s.io/klog"
-)
-
-// NodeServer struct of Vitastor CSI driver with supported methods of CSI node server spec.
-type NodeServer struct
-{
-    *Driver
-    mounter mount.Interface
-}
-
-// NewNodeServer create new instance node
-func NewNodeServer(driver *Driver) *NodeServer
-{
-    return &NodeServer{
-        Driver: driver,
-        mounter: mount.New(""),
-    }
-}
-
-// NodeStageVolume mounts the volume to a staging path on the node.
-func (ns *NodeServer) NodeStageVolume(ctx context.Context, req *csi.NodeStageVolumeRequest) (*csi.NodeStageVolumeResponse, error)
-{
-    return &csi.NodeStageVolumeResponse{}, nil
-}
-
-// NodeUnstageVolume unstages the volume from the staging path
-func (ns *NodeServer) NodeUnstageVolume(ctx context.Context, req *csi.NodeUnstageVolumeRequest) (*csi.NodeUnstageVolumeResponse, error)
-{
-    return &csi.NodeUnstageVolumeResponse{}, nil
-}
-
-func Contains(list []string, s string) bool
-{
-    for i := 0; i < len(list); i++
-    {
-        if (list[i] == s)
-        {
-            return true
-        }
-    }
-    return false
-}
-
-// NodePublishVolume mounts the volume mounted to the staging path to the target path
-func (ns *NodeServer) NodePublishVolume(ctx context.Context, req *csi.NodePublishVolumeRequest) (*csi.NodePublishVolumeResponse, error)
-{
-    klog.Infof("received node publish volume request %+v", protosanitizer.StripSecrets(req))
-
-    targetPath := req.GetTargetPath()
-
-    // Check that it's not already mounted
-    free, error := mount.IsNotMountPoint(ns.mounter, targetPath)
-    if (error != nil)
-    {
-        if (os.IsNotExist(error))
-        {
-            error := os.MkdirAll(targetPath, 0777)
-            if (error != nil)
-            {
-                return nil, status.Error(codes.Internal, error.Error())
-            }
-            free = true
-        }
-        else
-        {
-            return nil, status.Error(codes.Internal, error.Error())
-        }
-    }
-    if (!free)
-    {
-        return &csi.NodePublishVolumeResponse{}, nil
-    }
-
-    ctxVars := make(map[string]string)
-    err := json.Unmarshal([]byte(req.VolumeId), &ctxVars)
-    if (err != nil)
-    {
-        return nil, status.Error(codes.Internal, "volume ID not in JSON format")
-    }
-    volName := ctxVars["name"]
-
-    _, etcdUrl, etcdPrefix := GetConnectionParams(ctxVars)
-    if (len(etcdUrl) == 0)
-    {
-        return nil, status.Error(codes.InvalidArgument, "no etcdUrl in storage class configuration and no etcd_address in vitastor.conf")
-    }
-
-    // Map NBD device
-    // FIXME: Check if already mapped
-    args := []string{
-        "map", "--etcd_address", strings.Join(etcdUrl, ","),
-        "--etcd_prefix", etcdPrefix,
-        "--image", volName,
-    };
-    if (ctxVars["configPath"] != "")
-    {
-        args = append(args, "--config_path", ctxVars["configPath"])
-    }
-    if (req.GetReadonly())
-    {
-        args = append(args, "--readonly", "1")
-    }
-    c := exec.Command("/usr/bin/vitastor-nbd", args...)
-    var stdout, stderr bytes.Buffer
-    c.Stdout, c.Stderr = &stdout, &stderr
-    err = c.Run()
-    stdoutStr, stderrStr := string(stdout.Bytes()), string(stderr.Bytes())
-    if (err != nil)
-    {
-        klog.Errorf("vitastor-nbd map failed: %s, status %s\n", stdoutStr+stderrStr, err)
-        return nil, status.Error(codes.Internal, stdoutStr+stderrStr+" (status "+err.Error()+")")
-    }
-    devicePath := strings.TrimSpace(stdoutStr)
-
-    // Check existing format
-    diskMounter := &mount.SafeFormatAndMount{Interface: ns.mounter, Exec: utilexec.New()}
-    existingFormat, err := diskMounter.GetDiskFormat(devicePath)
-    if (err != nil)
-    {
-        klog.Errorf("failed to get disk format for path %s, error: %v", err)
-        // unmap NBD device
-        unmapOut, unmapErr := exec.Command("/usr/bin/vitastor-nbd", "unmap", devicePath).CombinedOutput()
-        if (unmapErr != nil)
-        {
-            klog.Errorf("failed to unmap NBD device %s: %s, error: %v", devicePath, unmapOut, unmapErr)
-        }
-        return nil, err
-    }
-
-    // Format the device (ext4 or xfs)
-    fsType := req.GetVolumeCapability().GetMount().GetFsType()
-    isBlock := req.GetVolumeCapability().GetBlock() != nil
-    opt := req.GetVolumeCapability().GetMount().GetMountFlags()
-    opt = append(opt, "_netdev")
-    if ((req.VolumeCapability.AccessMode.Mode == csi.VolumeCapability_AccessMode_MULTI_NODE_READER_ONLY ||
-        req.VolumeCapability.AccessMode.Mode == csi.VolumeCapability_AccessMode_SINGLE_NODE_READER_ONLY) &&
-        !Contains(opt, "ro"))
-    {
-        opt = append(opt, "ro")
-    }
-    if (fsType == "xfs")
-    {
-        opt = append(opt, "nouuid")
-    }
-    readOnly := Contains(opt, "ro")
-    if (existingFormat == "" && !readOnly)
-    {
-        args := []string{}
-        switch fsType
-        {
-            case "ext4":
-                args = []string{"-m0", "-Enodiscard,lazy_itable_init=1,lazy_journal_init=1", devicePath}
-            case "xfs":
-                args = []string{"-K", devicePath}
-        }
-        if (len(args) > 0)
-        {
-            cmdOut, cmdErr := diskMounter.Exec.Command("mkfs."+fsType, args...).CombinedOutput()
-            if (cmdErr != nil)
-            {
-                klog.Errorf("failed to run mkfs error: %v, output: %v", cmdErr, string(cmdOut))
-                // unmap NBD device
-                unmapOut, unmapErr := exec.Command("/usr/bin/vitastor-nbd", "unmap", devicePath).CombinedOutput()
-                if (unmapErr != nil)
-                {
-                    klog.Errorf("failed to unmap NBD device %s: %s, error: %v", devicePath, unmapOut, unmapErr)
-                }
-                return nil, status.Error(codes.Internal, cmdErr.Error())
-            }
-        }
-    }
-    if (isBlock)
-    {
-        opt = append(opt, "bind")
-        err = diskMounter.Mount(devicePath, targetPath, fsType, opt)
-    }
-    else
-    {
-        err = diskMounter.FormatAndMount(devicePath, targetPath, fsType, opt)
-    }
-    if (err != nil)
-    {
-        klog.Errorf(
-            "failed to mount device path (%s) to path (%s) for volume (%s) error: %s",
-            devicePath, targetPath, volName, err,
-        )
-        // unmap NBD device
-        unmapOut, unmapErr := exec.Command("/usr/bin/vitastor-nbd", "unmap", devicePath).CombinedOutput()
-        if (unmapErr != nil)
-        {
-            klog.Errorf("failed to unmap NBD device %s: %s, error: %v", devicePath, unmapOut, unmapErr)
-        }
-        return nil, status.Error(codes.Internal, err.Error())
-    }
-    return &csi.NodePublishVolumeResponse{}, nil
-}
-
-// NodeUnpublishVolume unmounts the volume from the target path
-func (ns *NodeServer) NodeUnpublishVolume(ctx context.Context, req *csi.NodeUnpublishVolumeRequest) (*csi.NodeUnpublishVolumeResponse, error)
-{
-    klog.Infof("received node unpublish volume request %+v", protosanitizer.StripSecrets(req))
-    targetPath := req.GetTargetPath()
-    devicePath, refCount, err := mount.GetDeviceNameFromMount(ns.mounter, targetPath)
-    if (err != nil)
-    {
-        if (os.IsNotExist(err))
-        {
-            return nil, status.Error(codes.NotFound, "Target path not found")
-        }
-        return nil, status.Error(codes.Internal, err.Error())
-    }
-    if (devicePath == "")
-    {
-        return nil, status.Error(codes.NotFound, "Volume not mounted")
-    }
-    // unmount
-    err = mount.CleanupMountPoint(targetPath, ns.mounter, false)
-    if (err != nil)
-    {
-        return nil, status.Error(codes.Internal, err.Error())
-    }
-    // unmap NBD device
-    if (refCount == 1)
-    {
-        unmapOut, unmapErr := exec.Command("/usr/bin/vitastor-nbd", "unmap", devicePath).CombinedOutput()
-        if (unmapErr != nil)
-        {
-            klog.Errorf("failed to unmap NBD device %s: %s, error: %v", devicePath, unmapOut, unmapErr)
-        }
-    }
-    return &csi.NodeUnpublishVolumeResponse{}, nil
-}
-
-// NodeGetVolumeStats returns volume capacity statistics available for the volume
-func (ns *NodeServer) NodeGetVolumeStats(ctx context.Context, req *csi.NodeGetVolumeStatsRequest) (*csi.NodeGetVolumeStatsResponse, error)
-{
-    return nil, status.Error(codes.Unimplemented, "")
-}
-
-// NodeExpandVolume expanding the file system on the node
-func (ns *NodeServer) NodeExpandVolume(ctx context.Context, req *csi.NodeExpandVolumeRequest) (*csi.NodeExpandVolumeResponse, error)
-{
-    return nil, status.Error(codes.Unimplemented, "")
-}
-
-// NodeGetCapabilities returns the supported capabilities of the node server
-func (ns *NodeServer) NodeGetCapabilities(ctx context.Context, req *csi.NodeGetCapabilitiesRequest) (*csi.NodeGetCapabilitiesResponse, error)
-{
-    return &csi.NodeGetCapabilitiesResponse{}, nil
-}
-
-// NodeGetInfo returns NodeGetInfoResponse for CO.
-func (ns *NodeServer) NodeGetInfo(ctx context.Context, req *csi.NodeGetInfoRequest) (*csi.NodeGetInfoResponse, error)
-{
-    klog.Infof("received node get info request %+v", protosanitizer.StripSecrets(req))
-    return &csi.NodeGetInfoResponse{
-        NodeId: ns.NodeID,
-    }, nil
-}
--- a/csi/src/server.go
+++ b/csi/src/server.go
@@ -1,36 +0,0 @@
-// Copyright (c) Vitaliy Filippov, 2019+
-// License: VNPL-1.1 or GNU GPL-2.0+ (see README.md for details)
-
-package vitastor
-
-import (
-    "k8s.io/klog"
-)
-
-type Driver struct
-{
-    *Config
-}
-
-// NewDriver create new instance driver
-func NewDriver(config *Config) (*Driver, error)
-{
-    if (config == nil)
-    {
-        klog.Errorf("Vitastor CSI driver initialization failed")
-        return nil, nil
-    }
-    driver := &Driver{
-        Config: config,
-    }
-    klog.Infof("Vitastor CSI driver initialized")
-    return driver, nil
-}
-
-// Start server
-func (driver *Driver) Run()
-{
-    server := NewNonBlockingGRPCServer()
-    server.Start(driver.Endpoint, NewIdentityServer(driver), NewControllerServer(driver), NewNodeServer(driver))
-    server.Wait()
-}
--- a/csi/vitastor-csi.go
+++ b/csi/vitastor-csi.go
@@ -1,39 +0,0 @@
-// Copyright (c) Vitaliy Filippov, 2019+
-// License: VNPL-1.1 or GNU GPL-2.0+ (see README.md for details)
-
-package main
-
-import (
-    "flag"
-    "fmt"
-    "os"
-    "k8s.io/klog"
-    "vitastor.io/csi/src"
-)
-
-func main()
-{
-    var config = vitastor.NewConfig()
-    flag.StringVar(&config.Endpoint, "endpoint", "", "CSI endpoint")
-    flag.StringVar(&config.NodeID, "node", "", "Node ID")
-    flag.Parse()
-    if (config.Endpoint == "")
-    {
-        config.Endpoint = os.Getenv("CSI_ENDPOINT")
-    }
-    if (config.NodeID == "")
-    {
-        config.NodeID = os.Getenv("NODE_ID")
-    }
-    if (config.Endpoint == "" && config.NodeID == "")
-    {
-        fmt.Fprintf(os.Stderr, "Please set -endpoint and -node / CSI_ENDPOINT & NODE_ID env vars\n")
-        os.Exit(1)
-    }
-    drv, err := vitastor.NewDriver(config)
-    if (err != nil)
-    {
-        klog.Fatalln(err)
-    }
-    drv.Run()
-}
--- a/debian/changelog
+++ b/debian/changelog
@@ -1,18 +1,8 @@
-vitastor (0.6.5-1) unstable; urgency=medium
+vitastor (0.5.10-1) unstable; urgency=medium

-  * RDMA support
  * Bugfixes

- -- Vitaliy Filippov <vitalif@yourcmc.ru>  Sat, 01 May 2021 18:46:10 +0300
-
-vitastor (0.6.0-1) unstable; urgency=medium
-
-  * Snapshots and Copy-on-Write clones
-  * Image metadata in etcd (name, size)
-  * Image I/O and space statistics in etcd
-  * Write throttling for smoothing random write workloads in SSD+HDD configurations
-
- -- Vitaliy Filippov <vitalif@yourcmc.ru>  Sun, 11 Apr 2021 00:49:18 +0300
+ -- Vitaliy Filippov <vitalif@yourcmc.ru>  Tue, 02 Feb 2021 23:01:24 +0300

 vitastor (0.5.1-1) unstable; urgency=medium

--- a/debian/control
+++ b/debian/control
@@ -2,7 +2,7 @@ Source: vitastor
 Section: admin
 Priority: optional
 Maintainer: Vitaliy Filippov <vitalif@yourcmc.ru>
-Build-Depends: debhelper, liburing-dev (>= 0.6), g++ (>= 8), libstdc++6 (>= 8), linux-libc-dev, libgoogle-perftools-dev, libjerasure-dev, libgf-complete-dev, libibverbs-dev
+Build-Depends: debhelper, liburing-dev (>= 0.6), g++ (>= 8), libstdc++6 (>= 8), linux-libc-dev, libgoogle-perftools-dev, libjerasure-dev, libgf-complete-dev
 Standards-Version: 4.5.0
 Homepage: https://vitastor.io/
 Rules-Requires-Root: no
--- a/debian/patched-qemu.Dockerfile
+++ b/debian/patched-qemu.Dockerfile
@@ -11,10 +11,6 @@ RUN if [ "$REL" = "buster" ]; then \
        echo 'Package: *' >> /etc/apt/preferences; \
        echo 'Pin: release a=buster-backports' >> /etc/apt/preferences; \
        echo 'Pin-Priority: 500' >> /etc/apt/preferences; \
-        echo >> /etc/apt/preferences; \
-        echo 'Package: libglvnd* libgles* libglx* libgl1 libegl* libopengl* mesa*' >> /etc/apt/preferences; \
-        echo 'Pin: release a=buster-backports' >> /etc/apt/preferences; \
-        echo 'Pin-Priority: 50' >> /etc/apt/preferences; \
    fi; \
    grep '^deb ' /etc/apt/sources.list | perl -pe 's/^deb/deb-src/' >> /etc/apt/sources.list; \
    echo 'APT::Install-Recommends false;' >> /etc/apt/apt.conf; \
@@ -24,22 +20,20 @@ RUN apt-get update
 RUN apt-get -y install qemu fio liburing1 liburing-dev libgoogle-perftools-dev devscripts
 RUN apt-get -y build-dep qemu
 RUN apt-get -y build-dep fio
-# To build a custom version
-#RUN cp /root/packages/qemu-orig/* /root
 RUN apt-get --download-only source qemu
 RUN apt-get --download-only source fio

-ADD patches/qemu-5.0-vitastor.patch patches/qemu-5.1-vitastor.patch /root/vitastor/patches/
+ADD qemu-5.0-vitastor.patch qemu-5.1-vitastor.patch /root/vitastor/
 RUN set -e; \
    mkdir -p /root/packages/qemu-$REL; \
    rm -rf /root/packages/qemu-$REL/*; \
    cd /root/packages/qemu-$REL; \
    dpkg-source -x /root/qemu*.dsc; \
    if [ -d /root/packages/qemu-$REL/qemu-5.0 ]; then \
-        cp /root/vitastor/patches/qemu-5.0-vitastor.patch /root/packages/qemu-$REL/qemu-5.0/debian/patches; \
+        cp /root/vitastor/qemu-5.0-vitastor.patch /root/packages/qemu-$REL/qemu-5.0/debian/patches; \
        echo qemu-5.0-vitastor.patch >> /root/packages/qemu-$REL/qemu-5.0/debian/patches/series; \
    else \
-        cp /root/vitastor/patches/qemu-5.1-vitastor.patch /root/packages/qemu-$REL/qemu-*/debian/patches; \
+        cp /root/vitastor/qemu-5.1-vitastor.patch /root/packages/qemu-$REL/qemu-*/debian/patches; \
        P=`ls -d /root/packages/qemu-$REL/qemu-*/debian/patches`; \
        echo qemu-5.1-vitastor.patch >> $P/series; \
    fi; \
--- a/debian/vitastor.Dockerfile
+++ b/debian/vitastor.Dockerfile
@@ -22,7 +22,7 @@ RUN apt-get -y build-dep qemu
 RUN apt-get -y build-dep fio
 RUN apt-get --download-only source qemu
 RUN apt-get --download-only source fio
-RUN apt-get update && apt-get -y install libjerasure-dev cmake libibverbs-dev
+RUN apt-get -y install libjerasure-dev cmake

 ADD . /root/vitastor
 RUN set -e -x; \
@@ -40,10 +40,10 @@ RUN set -e -x; \
    mkdir -p /root/packages/vitastor-$REL; \
    rm -rf /root/packages/vitastor-$REL/*; \
    cd /root/packages/vitastor-$REL; \
-    cp -r /root/vitastor vitastor-0.6.5; \
-    ln -s /root/packages/qemu-$REL/qemu-*/ vitastor-0.6.5/qemu; \
-    ln -s /root/fio-build/fio-*/ vitastor-0.6.5/fio; \
-    cd vitastor-0.6.5; \
+    cp -r /root/vitastor vitastor-0.5.10; \
+    ln -s /root/packages/qemu-$REL/qemu-*/ vitastor-0.5.10/qemu; \
+    ln -s /root/fio-build/fio-*/ vitastor-0.5.10/fio; \
+    cd vitastor-0.5.10; \
    FIO=$(head -n1 fio/debian/changelog | perl -pe 's/^.*\((.*?)\).*$/$1/'); \
    QEMU=$(head -n1 qemu/debian/changelog | perl -pe 's/^.*\((.*?)\).*$/$1/'); \
    sh copy-qemu-includes.sh; \
@@ -59,8 +59,8 @@ RUN set -e -x; \
    echo "dep:fio=$FIO" > debian/substvars; \
    echo "dep:qemu=$QEMU" >> debian/substvars; \
    cd /root/packages/vitastor-$REL; \
-    tar --sort=name --mtime='2020-01-01' --owner=0 --group=0 --exclude=debian -cJf vitastor_0.6.5.orig.tar.xz vitastor-0.6.5; \
-    cd vitastor-0.6.5; \
+    tar --sort=name --mtime='2020-01-01' --owner=0 --group=0 --exclude=debian -cJf vitastor_0.5.10.orig.tar.xz vitastor-0.5.10; \
+    cd vitastor-0.5.10; \
    V=$(head -n1 debian/changelog | perl -pe 's/^.*\((.*?)\).*$/$1/'); \
    DEBFULLNAME="Vitaliy Filippov <vitalif@yourcmc.ru>" dch -D $REL -v "$V""$REL" "Rebuild for $REL"; \
    DEB_BUILD_OPTIONS=nocheck dpkg-buildpackage --jobs=auto -sa; \
--- a/mon/lp-optimizer.js
+++ b/mon/lp-optimizer.js
@@ -104,17 +104,6 @@ async function optimize_initial({ osd_tree, pg_count, pg_size = 3, pg_minsize =
    return res;
 }

-function shuffle(array)
-{
-    for (let i = array.length - 1, j, x; i > 0; i--)
-    {
-        j = Math.floor(Math.random() * (i + 1));
-        x = array[i];
-        array[i] = array[j];
-        array[j] = x;
-    }
-}
-
 function make_int_pgs(weights, pg_count)
 {
    const total_weight = Object.values(weights).reduce((a, c) => Number(a) + Number(c), 0);
@@ -131,7 +120,6 @@ function make_int_pgs(weights, pg_count)
        weight_left -= weights[pg_name];
        pg_left -= n;
    }
-    shuffle(int_pgs);
    return int_pgs;
 }

@@ -244,7 +232,6 @@ async function optimize_change({ prev_pgs: prev_int_pgs, osd_tree, pg_size = 3,
    {
        return null;
    }
-    // FIXME: use parity_chunks with parity_space instead of pg_minsize
    const pg_effsize = Math.min(pg_minsize, Object.keys(osd_tree).length)
        + Math.max(0, Math.min(pg_size, Object.keys(osd_tree).length) - pg_minsize) * parity_space;
    const pg_count = prev_int_pgs.length;
--- a/mon/make-osd.sh
+++ b/mon/make-osd.sh
@@ -53,6 +53,7 @@ ExecStart=/usr/bin/vitastor-osd \\
    --osd_num $OSD_NUM \\
    --disable_data_fsync 1 \\
    --immediate_commit all \\
+    --flusher_count 256 \\
    --disk_alignment 4096 --journal_block_size 4096 --meta_block_size 4096 \\
    --journal_no_same_sector_overwrites true \\
    --journal_sector_buffer_count 1024 \\
--- a/mon/make-units.sh
+++ b/mon/make-units.sh
@@ -32,8 +32,7 @@ ExecStart=/usr/local/bin/etcd -name etcd$ETCD_NUM --data-dir /var/lib/etcd$ETCD_
    --advertise-client-urls http://$IP:2379 --listen-client-urls http://$IP:2379 \\
    --initial-advertise-peer-urls http://$IP:2380 --listen-peer-urls http://$IP:2380 \\
    --initial-cluster-token vitastor-etcd-1 --initial-cluster $ETCD_HOSTS \\
-    --initial-cluster-state new --max-txn-ops=100000 --max-request-bytes=104857600 \\
-    --auto-compaction-retention=10 --auto-compaction-mode=revision
+    --initial-cluster-state new --max-txn-ops=100000 --auto-compaction-retention=10 --auto-compaction-mode=revision
 WorkingDirectory=/var/lib/etcd$ETCD_NUM.etcd
 ExecStartPre=+chown -R etcd /var/lib/etcd$ETCD_NUM.etcd
 User=etcd
--- a/mon/mon.js
+++ b/mon/mon.js
@@ -34,21 +34,13 @@ const etcd_allow = new RegExp('^'+[
    'pg/stats/[1-9]\\d*/[1-9]\\d*',
    'pg/history/[1-9]\\d*/[1-9]\\d*',
    'history/last_clean_pgs',
-    'inode/stats/[1-9]\\d*/[1-9]\\d*',
+    'inode/stats/[1-9]\\d*',
    'stats',
-    'index/image/.*',
-    'index/maxid/[1-9]\\d*',
 ].join('$|^')+'$');

 const etcd_tree = {
    config: {
        /* global: {
-            // WARNING: NOT ALL OF THESE ARE ACTUALLY CONFIGURABLE HERE
-            // THIS IS JUST A POOR MAN'S CONFIG DOCUMENTATION
-            // etcd connection
-            config_path: "/etc/vitastor/vitastor.conf",
-            etcd_address: "10.0.115.10:2379/v3",
-            etcd_prefix: "/vitastor",
            // mon
            etcd_mon_ttl: 30, // min: 10
            etcd_mon_timeout: 1000, // ms. min: 0
@@ -58,17 +50,7 @@ const etcd_tree = {
            osd_out_time: 600, // seconds. min: 0
            placement_levels: { datacenter: 1, rack: 2, host: 3, osd: 4, ... },
            // client and osd
-            tcp_header_buffer_size: 65536,
            use_sync_send_recv: false,
-            use_rdma: true,
-            rdma_device: null, // for example, "rocep5s0f0"
-            rdma_port_num: 1,
-            rdma_gid_index: 0,
-            rdma_mtu: 4096,
-            rdma_max_sge: 128,
-            rdma_max_send: 32,
-            rdma_max_recv: 8,
-            rdma_max_msg: 1048576,
            log_level: 0,
            block_size: 131072,
            disk_alignment: 4096,
@@ -114,8 +96,7 @@ const etcd_tree = {
            disable_device_lock,
            // blockstore - configurable
            max_write_iodepth,
-            min_flusher_count: 1,
-            max_flusher_count: 256,
+            flusher_count,
            inmemory_metadata,
            inmemory_journal,
            journal_sector_buffer_count,
@@ -229,7 +210,7 @@ const etcd_tree = {
            /* <pool_id>: {
                <pg_id>: {
                    primary: osd_num_t,
-                    state: ("starting"|"peering"|"incomplete"|"active"|"repeering"|"stopping"|"offline"|
+                    state: ("starting"|"peering"|"incomplete"|"active"|"stopping"|"offline"|
                        "degraded"|"has_incomplete"|"has_degraded"|"has_misplaced"|"has_unclean"|
                        "has_invalid"|"left_on_dead")[],
                }
@@ -259,26 +240,14 @@ const etcd_tree = {
    },
    inode: {
        stats: {
-            /* <pool_id>: {
-                <inode_t>: {
-                    raw_used: uint64_t, // raw used bytes on OSDs
-                    read: { count: uint64_t, usec: uint64_t, bytes: uint64_t },
-                    write: { count: uint64_t, usec: uint64_t, bytes: uint64_t },
-                    delete: { count: uint64_t, usec: uint64_t, bytes: uint64_t },
-                },
+            /* <inode_t>: {
+                raw_used: uint64_t, // raw used bytes on OSDs
+                read: { count: uint64_t, usec: uint64_t, bytes: uint64_t },
+                write: { count: uint64_t, usec: uint64_t, bytes: uint64_t },
+                delete: { count: uint64_t, usec: uint64_t, bytes: uint64_t },
            }, */
        },
    },
-    pool: {
-        stats: {
-            /* <pool_id>: {
-                used_raw_tb: float, // used raw space in the pool
-                total_raw_tb: float, // maximum amount of space in the pool
-                raw_to_usable: float, // raw to usable ratio
-                space_efficiency: float, // 0..1
-            } */
-        },
-    },
    stats: {
        /* op_stats: {
            <string>: { count: uint64_t, usec: uint64_t, bytes: uint64_t },
@@ -301,17 +270,6 @@ const etcd_tree = {
    history: {
        last_clean_pgs: {},
    },
-    index: {
-        image: {
-            /* <name>: {
-                id: uint64_t,
-                pool_id: uint64_t,
-            }, */
-        },
-        maxid: {
-            /* <pool_id>: uint64_t, */
-        },
-    },
 };

 // FIXME Split into several files
@@ -386,11 +344,6 @@ class Mon
        {
            this.config.mon_stats_timeout = 100;
        }
-        this.config.mon_stats_interval = Number(this.config.mon_stats_interval) || 5000;
-        if (this.config.mon_stats_interval < 100)
-        {
-            this.config.mon_stats_interval = 100;
-        }
        // After this number of seconds, a dead OSD will be removed from PG distribution
        this.config.osd_out_time = Number(this.config.osd_out_time) || 0;
        if (!this.config.osd_out_time)
@@ -626,7 +579,7 @@ class Mon
        for (const osd_num of this.all_osds().sort((a, b) => a - b))
        {
            const stat = this.state.osd.stats[osd_num];
-            if (stat && stat.size && (this.state.osd.state[osd_num] || Number(stat.time) >= down_time))
+            if (stat.size && (this.state.osd.state[osd_num] || Number(stat.time) >= down_time))
            {
                // Numeric IDs are reserved for OSDs
                const osd_cfg = this.state.config.osd[osd_num];
@@ -777,11 +730,6 @@ class Mon
                pg_history[i].osd_sets = pg_history[i].osd_sets || [];
                pg_history[i].osd_sets.push(prev_pgs[i]);
            }
-            if (pg_history[i] && pg_history[i].osd_sets)
-            {
-                pg_history[i].osd_sets = Object.values(pg_history[i].osd_sets
-                    .reduce((a, c) => { a[c.join(' ')] = c; return a; }, {}));
-            }
        });
        for (let i = 0; i < new_pgs.length || i < prev_pgs.length; i++)
        {
@@ -932,7 +880,7 @@ class Mon
    {
        // Take configuration and state, check it against the stored configuration hash
        // Recalculate PGs and save them to etcd if the configuration is changed
-        // FIXME: Do not change anything if the distribution is good and random enough and no PGs are degraded
+        // FIXME: Also do not change anything if the distribution is good enough and no PGs are degraded
        const { up_osds, levels, osd_tree } = this.get_osd_tree();
        const tree_cfg = {
            osd_tree,
@@ -991,14 +939,7 @@ class Mon
                    prev_pgs[pg-1] = this.state.history.last_clean_pgs.items[pool_id][pg].osd_set;
                }
                prev_pgs = JSON.parse(JSON.stringify(prev_pgs.length ? prev_pgs : real_prev_pgs));
-                const old_pg_count = real_prev_pgs.length;
-                const optimize_cfg = {
-                    osd_tree: pool_tree,
-                    pg_count: pool_cfg.pg_count,
-                    pg_size: pool_cfg.pg_size,
-                    pg_minsize: pool_cfg.pg_minsize,
-                    max_combinations: pool_cfg.max_osd_combinations,
-                };
+                const old_pg_count = prev_pgs.length;
                let optimize_result;
                if (old_pg_count > 0)
                {
@@ -1025,22 +966,23 @@ class Mon
                            pg.pop();
                        }
                    }
-                    if (!this.state.config.pgs.hash)
-                    {
-                        // Re-shuffle PGs
-                        optimize_result = await LPOptimizer.optimize_initial(optimize_cfg);
-                    }
-                    else
-                    {
-                        optimize_result = await LPOptimizer.optimize_change({
-                            prev_pgs,
-                            ...optimize_cfg,
-                        });
-                    }
+                    optimize_result = await LPOptimizer.optimize_change({
+                        prev_pgs,
+                        osd_tree: pool_tree,
+                        pg_size: pool_cfg.pg_size,
+                        pg_minsize: pool_cfg.pg_minsize,
+                        max_combinations: pool_cfg.max_osd_combinations,
+                    });
                }
                else
                {
-                    optimize_result = await LPOptimizer.optimize_initial(optimize_cfg);
+                    optimize_result = await LPOptimizer.optimize_initial({
+                        osd_tree: pool_tree,
+                        pg_count: pool_cfg.pg_count,
+                        pg_size: pool_cfg.pg_size,
+                        pg_minsize: pool_cfg.pg_minsize,
+                        max_combinations: pool_cfg.max_osd_combinations,
+                    });
                }
                if (old_pg_count != optimize_result.int_pgs.length)
                {
@@ -1055,17 +997,6 @@ class Mon
                    } });
                }
                LPOptimizer.print_change_stats(optimize_result);
-                const pg_effsize = Math.min(pool_cfg.pg_size, Object.keys(pool_tree).length);
-                this.state.pool.stats[pool_id] = {
-                    used_raw_tb: (this.state.pool.stats[pool_id]||{}).used_raw_tb || 0,
-                    total_raw_tb: optimize_result.space,
-                    raw_to_usable: pg_effsize / (pool_cfg.pg_size - (pool_cfg.parity_chunks||0)),
-                    space_efficiency: optimize_result.space/(optimize_result.total_space||1),
-                };
-                etcd_request.success.push({ requestPut: {
-                    key: b64(this.etcd_prefix+'/pool/stats/'+pool_id),
-                    value: b64(JSON.stringify(this.state.pool.stats[pool_id])),
-                } });
                this.save_new_pgs_txn(etcd_request, pool_id, up_osds, real_prev_pgs, optimize_result.int_pgs, pg_history);
            }
            this.state.config.pgs.hash = tree_hash;
@@ -1172,12 +1103,12 @@ class Mon
        }, this.config.mon_change_timeout || 1000);
    }

-    sum_op_stats()
+    sum_stats()
    {
        const op_stats = {}, subop_stats = {}, recovery_stats = {};
        for (const osd in this.state.osd.stats)
        {
-            const st = this.state.osd.stats[osd]||{};
+            const st = this.state.osd.stats[osd];
            for (const op in st.op_stats||{})
            {
                op_stats[op] = op_stats[op] || { count: 0n, usec: 0n, bytes: 0n };
@@ -1233,46 +1164,25 @@ class Mon
            write: { count: 0n, usec: 0n, bytes: 0n },
            delete: { count: 0n, usec: 0n, bytes: 0n },
        });
-        for (const pool_id in this.state.config.pools)
-        {
-            this.state.pool.stats[pool_id] = this.state.pool.stats[pool_id] || {};
-            this.state.pool.stats[pool_id].used_raw_tb = 0n;
-        }
        for (const osd_num in this.state.osd.space)
        {
-            for (const pool_id in this.state.osd.space[osd_num])
+            for (const inode_num in this.state.osd.space[osd_num])
            {
-                this.state.pool.stats[pool_id] = this.state.pool.stats[pool_id] || { used_raw_tb: 0n };
-                inode_stats[pool_id] = inode_stats[pool_id] || {};
-                for (const inode_num in this.state.osd.space[osd_num][pool_id])
-                {
-                    const u = BigInt(this.state.osd.space[osd_num][pool_id][inode_num]||0);
-                    inode_stats[pool_id][inode_num] = inode_stats[pool_id][inode_num] || inode_stub();
-                    inode_stats[pool_id][inode_num].raw_used += u;
-                    this.state.pool.stats[pool_id].used_raw_tb += u;
-                }
+                inode_stats[inode_num] = inode_stats[inode_num] || inode_stub();
+                inode_stats[inode_num].raw_used += BigInt(this.state.osd.space[osd_num][inode_num]||0);
            }
        }
-        for (const pool_id in this.state.config.pools)
-        {
-            const used = this.state.pool.stats[pool_id].used_raw_tb;
-            this.state.pool.stats[pool_id].used_raw_tb = Number(used)/1024/1024/1024/1024;
-        }
        for (const osd_num in this.state.osd.inodestats)
        {
            const ist = this.state.osd.inodestats[osd_num];
-            for (const pool_id in ist)
+            for (const inode_num in ist)
            {
-                inode_stats[pool_id] = inode_stats[pool_id] || {};
-                for (const inode_num in ist[pool_id])
+                inode_stats[inode_num] = inode_stats[inode_num] || inode_stub();
+                for (const op of [ 'read', 'write', 'delete' ])
                {
-                    inode_stats[pool_id][inode_num] = inode_stats[pool_id][inode_num] || inode_stub();
-                    for (const op of [ 'read', 'write', 'delete' ])
-                    {
-                        inode_stats[pool_id][inode_num][op].count += BigInt(ist[pool_id][inode_num][op].count||0);
-                        inode_stats[pool_id][inode_num][op].usec += BigInt(ist[pool_id][inode_num][op].usec||0);
-                        inode_stats[pool_id][inode_num][op].bytes += BigInt(ist[pool_id][inode_num][op].bytes||0);
-                    }
+                    inode_stats[inode_num][op].count += BigInt(ist[inode_num][op].count||0);
+                    inode_stats[inode_num][op].usec += BigInt(ist[inode_num][op].usec||0);
+                    inode_stats[inode_num][op].bytes += BigInt(ist[inode_num][op].bytes||0);
                }
            }
        }
@@ -1329,7 +1239,7 @@ class Mon
    async update_total_stats()
    {
        const txn = [];
-        const stats = this.sum_op_stats();
+        const stats = this.sum_stats();
        const object_counts = this.sum_object_counts();
        const inode_stats = this.sum_inode_stats();
        this.fix_stat_overflows(stats, (this.prev_stats = this.prev_stats || {}));
@@ -1338,21 +1248,11 @@ class Mon
        this.serialize_bigints(stats);
        this.serialize_bigints(inode_stats);
        txn.push({ requestPut: { key: b64(this.etcd_prefix+'/stats'), value: b64(JSON.stringify(stats)) } });
-        for (const pool_id in inode_stats)
-        {
-            for (const inode_num in inode_stats[pool_id])
-            {
-                txn.push({ requestPut: {
-                    key: b64(this.etcd_prefix+'/inode/stats/'+pool_id+'/'+inode_num),
-                    value: b64(JSON.stringify(inode_stats[pool_id][inode_num])),
-                } });
-            }
-        }
-        for (const pool_id in this.state.pool.stats)
+        for (const inode_num in inode_stats)
        {
            txn.push({ requestPut: {
-                key: b64(this.etcd_prefix+'/pool/stats/'+pool_id),
-                value: b64(JSON.stringify(this.state.pool.stats[pool_id])),
+                key: b64(this.etcd_prefix+'/inode/stats/'+inode_num),
+                value: b64(JSON.stringify(inode_stats[inode_num])),
            } });
        }
        if (txn.length)
@@ -1368,17 +1268,11 @@ class Mon
            clearTimeout(this.stats_timer);
            this.stats_timer = null;
        }
-        let sleep = (this.stats_update_next||0) - Date.now();
-        if (sleep < this.config.mon_stats_timeout)
-        {
-            sleep = this.config.mon_stats_timeout;
-        }
        this.stats_timer = setTimeout(() =>
        {
            this.stats_timer = null;
-            this.stats_update_next = Date.now() + this.config.mon_stats_interval;
            this.update_total_stats().catch(console.error);
-        }, sleep);
+        }, this.config.mon_stats_timeout || 1000);
    }

    parse_kv(kv)
--- a/mon/simple-offsets.js
+++ b/mon/simple-offsets.js
@@ -51,7 +51,7 @@ async function run()
    const meta_offset = options.journal_offset + Math.ceil(options.journal_size/options.device_block_size)*options.device_block_size;
    const entries_per_block = Math.floor(options.device_block_size / (24 + 2*options.object_size/options.bitmap_granularity/8));
    const object_count = Math.floor((device_size-meta_offset)/options.object_size);
-    const meta_size = Math.ceil(1 + object_count / entries_per_block) * options.device_block_size;
+    const meta_size = Math.ceil(object_count / entries_per_block) * options.device_block_size;
    const data_offset = meta_offset + meta_size;
    const meta_size_fmt = (meta_size > 1024*1024*1024 ? Math.round(meta_size/1024/1024/1024*100)/100+" GB"
        : Math.round(meta_size/1024/1024*100)/100+" MB");
@@ -65,9 +65,6 @@ async function run()
            );
        }
        process.stdout.write(
-            (options.device_block_size != 4096 ?
-                `    --meta_block_size ${options.device}\n`+
-                `    --journal_block-size ${options.device}\n` : '')+
            `    --data_device ${options.device}\n`+
            `    --journal_offset ${options.journal_offset}\n`+
            `    --meta_offset ${meta_offset}\n`+
--- a/patches/cinder-vitastor.py
+++ b/patches/cinder-vitastor.py
@@ -1,948 +0,0 @@
-# Vitastor Driver for OpenStack Cinder
-#
-# --------------------------------------------
-# Install as cinder/volume/drivers/vitastor.py
-# --------------------------------------------
-#
-# Copyright 2020 Vitaliy Filippov
-#
-# Licensed under the Apache License, Version 2.0 (the "License"); you may
-# not use this file except in compliance with the License. You may obtain
-# a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
-# WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
-# License for the specific language governing permissions and limitations
-# under the License.
-"""Cinder Vitastor Driver"""
-
-import binascii
-import base64
-import errno
-import json
-import math
-import os
-import tempfile
-
-from castellan import key_manager
-from oslo_config import cfg
-from oslo_log import log as logging
-from oslo_service import loopingcall
-from oslo_concurrency import processutils
-from oslo_utils import encodeutils
-from oslo_utils import excutils
-from oslo_utils import fileutils
-from oslo_utils import units
-import six
-from six.moves.urllib import request
-
-from cinder import exception
-from cinder.i18n import _
-from cinder.image import image_utils
-from cinder import interface
-from cinder import objects
-from cinder.objects import fields
-from cinder import utils
-from cinder.volume import configuration
-from cinder.volume import driver
-from cinder.volume import volume_utils
-
-VERSION = '0.6.5'
-
-LOG = logging.getLogger(__name__)
-
-VITASTOR_OPTS = [
-    cfg.StrOpt(
-        'vitastor_config_path',
-        default='/etc/vitastor/vitastor.conf',
-        help='Vitastor configuration file path'
-    ),
-    cfg.StrOpt(
-        'vitastor_etcd_address',
-        default='',
-        help='Vitastor etcd address(es)'),
-    cfg.StrOpt(
-        'vitastor_etcd_prefix',
-        default='/vitastor',
-        help='Vitastor etcd prefix'
-    ),
-    cfg.StrOpt(
-        'vitastor_pool_id',
-        default='',
-        help='Vitastor pool ID to use for volumes'
-    ),
-    # FIXME exclusive_cinder_pool ?
-]
-
-CONF = cfg.CONF
-CONF.register_opts(VITASTOR_OPTS, group = configuration.SHARED_CONF_GROUP)
-
-class VitastorDriverException(exception.VolumeDriverException):
-    message = _("Vitastor Cinder driver failure: %(reason)s")
-
-@interface.volumedriver
-class VitastorDriver(driver.CloneableImageVD,
-    driver.ManageableVD, driver.ManageableSnapshotsVD,
-    driver.BaseVD):
-    """Implements Vitastor volume commands."""
-
-    cfg = {}
-    _etcd_urls = []
-
-    def __init__(self, active_backend_id = None, *args, **kwargs):
-        super(VitastorDriver, self).__init__(*args, **kwargs)
-        self.configuration.append_config_values(VITASTOR_OPTS)
-
-    @classmethod
-    def get_driver_options(cls):
-        additional_opts = cls._get_oslo_driver_opts(
-            'reserved_percentage',
-            'max_over_subscription_ratio',
-            'volume_dd_blocksize'
-        )
-        return VITASTOR_OPTS + additional_opts
-
-    def do_setup(self, context):
-        """Performs initialization steps that could raise exceptions."""
-        super(VitastorDriver, self).do_setup(context)
-        # Make sure configuration is in UTF-8
-        for attr in [ 'config_path', 'etcd_address', 'etcd_prefix', 'pool_id' ]:
-            val = self.configuration.safe_get('vitastor_'+attr)
-            if val is not None:
-                self.cfg[attr] = utils.convert_str(val)
-        self.cfg = self._load_config(self.cfg)
-
-    def _load_config(self, cfg):
-        # Try to load configuration file
-        try:
-            f = open(cfg['config_path'] or '/etc/vitastor/vitastor.conf')
-            conf = json.loads(f.read())
-            f.close()
-            for k in conf:
-                cfg[k] = cfg.get(k, conf[k])
-        except:
-            pass
-        if isinstance(cfg['etcd_address'], str):
-            cfg['etcd_address'] = cfg['etcd_address'].split(',')
-        # Sanitize etcd URLs
-        for i, etcd_url in enumerate(cfg['etcd_address']):
-            ssl = False
-            if etcd_url.lower().startswith('http://'):
-                etcd_url = etcd_url[7:]
-            elif etcd_url.lower().startswith('https://'):
-                etcd_url = etcd_url[8:]
-                ssl = True
-            if etcd_url.find('/') < 0:
-                etcd_url += '/v3'
-            if ssl:
-                etcd_url = 'https://'+etcd_url
-            else:
-                etcd_url = 'http://'+etcd_url
-            cfg['etcd_address'][i] = etcd_url
-        return cfg
-
-    def check_for_setup_error(self):
-        """Returns an error if prerequisites aren't met."""
-
-    def _encode_etcd_key(self, key):
-        if not isinstance(key, bytes):
-            key = str(key).encode('utf-8')
-        return base64.b64encode(self.cfg['etcd_prefix'].encode('utf-8')+b'/'+key).decode('utf-8')
-
-    def _encode_etcd_value(self, value):
-        if not isinstance(value, bytes):
-            value = str(value).encode('utf-8')
-        return base64.b64encode(value).decode('utf-8')
-
-    def _encode_etcd_requests(self, obj):
-        for v in obj:
-            for rt in v:
-                if 'key' in v[rt]:
-                    v[rt]['key'] = self._encode_etcd_key(v[rt]['key'])
-                if 'range_end' in v[rt]:
-                    v[rt]['range_end'] = self._encode_etcd_key(v[rt]['range_end'])
-                if 'value' in v[rt]:
-                    v[rt]['value'] = self._encode_etcd_value(v[rt]['value'])
-
-    def _etcd_txn(self, params):
-        if 'compare' in params:
-            for v in params['compare']:
-                if 'key' in v:
-                    v['key'] = self._encode_etcd_key(v['key'])
-        if 'failure' in params:
-            self._encode_etcd_requests(params['failure'])
-        if 'success' in params:
-            self._encode_etcd_requests(params['success'])
-        body = json.dumps(params).encode('utf-8')
-        headers = {
-            'Content-Type': 'application/json'
-        }
-        err = None
-        for etcd_url in self.cfg['etcd_address']:
-            try:
-                resp = request.urlopen(request.Request(etcd_url+'/kv/txn', body, headers), timeout = 5)
-                data = json.loads(resp.read())
-                if 'responses' not in data:
-                    data['responses'] = []
-                for i, resp in enumerate(data['responses']):
-                    if 'response_range' in resp:
-                        if 'kvs' not in resp['response_range']:
-                            resp['response_range']['kvs'] = []
-                        for kv in resp['response_range']['kvs']:
-                            kv['key'] = base64.b64decode(kv['key'].encode('utf-8')).decode('utf-8')
-                            if kv['key'].startswith(self.cfg['etcd_prefix']+'/'):
-                                kv['key'] = kv['key'][len(self.cfg['etcd_prefix'])+1 : ]
-                            kv['value'] = json.loads(base64.b64decode(kv['value'].encode('utf-8')))
-                    if len(resp.keys()) != 1:
-                        LOG.exception('unknown responses['+str(i)+'] format: '+json.dumps(resp))
-                    else:
-                        resp = data['responses'][i] = resp[list(resp.keys())[0]]
-                return data
-            except Exception as e:
-                LOG.exception('error calling etcd transaction: '+body.decode('utf-8')+'\nerror: '+str(e))
-                err = e
-        raise err
-
-    def _etcd_foreach(self, prefix, add_fn):
-        total = 0
-        batch = 1000
-        begin = prefix+'/'
-        while True:
-            resp = self._etcd_txn({ 'success': [
-                { 'request_range': {
-                    'key': begin,
-                    'range_end': prefix+'0',
-                    'limit': batch+1,
-                } },
-            ] })
-            i = 0
-            while i < batch and i < len(resp['responses'][0]['kvs']):
-                kv = resp['responses'][0]['kvs'][i]
-                add_fn(kv)
-                i += 1
-            if len(resp['responses'][0]['kvs']) <= batch:
-                break
-            begin = resp['responses'][0]['kvs'][batch]['key']
-        return total
-
-    def _update_volume_stats(self):
-        location_info = json.dumps({
-            'config': self.configuration.vitastor_config_path,
-            'etcd_address': self.configuration.vitastor_etcd_address,
-            'etcd_prefix': self.configuration.vitastor_etcd_prefix,
-            'pool_id': self.configuration.vitastor_pool_id,
-        })
-
-        stats = {
-            'vendor_name': 'Vitastor',
-            'driver_version': self.VERSION,
-            'storage_protocol': 'vitastor',
-            'total_capacity_gb': 'unknown',
-            'free_capacity_gb': 'unknown',
-            # FIXME check if safe_get is required
-            'reserved_percentage': self.configuration.safe_get('reserved_percentage'),
-            'multiattach': True,
-            'thin_provisioning_support': True,
-            'max_over_subscription_ratio': self.configuration.safe_get('max_over_subscription_ratio'),
-            'location_info': location_info,
-            'backend_state': 'down',
-            'volume_backend_name': self.configuration.safe_get('volume_backend_name') or 'vitastor',
-            'replication_enabled': False,
-        }
-
-        try:
-            pool_stats = self._etcd_txn({ 'success': [
-                { 'request_range': { 'key': 'pool/stats/'+str(self.cfg['pool_id']) } }
-            ] })
-            total_provisioned = 0
-            def add_total(kv):
-                nonlocal total_provisioned
-                if kv['key'].find('@') >= 0:
-                    total_provisioned += kv['value']['size']
-            self._etcd_foreach('config/inode/'+str(self.cfg['pool_id']), lambda kv: add_total(kv))
-            stats['provisioned_capacity_gb'] = round(total_provisioned/1024.0/1024.0/1024.0, 2)
-            pool_stats = pool_stats['responses'][0]['kvs']
-            if len(pool_stats):
-                pool_stats = pool_stats[0]
-                stats['free_capacity_gb'] = round(1024.0*(pool_stats['total_raw_tb']-pool_stats['used_raw_tb'])/pool_stats['raw_to_usable'], 2)
-                stats['total_capacity_gb'] = round(1024.0*pool_stats['total_raw_tb'], 2)
-            stats['backend_state'] = 'up'
-        except Exception as e:
-            # just log and return unknown capacities
-            LOG.exception('error getting vitastor pool stats: '+str(e))
-
-        self._stats = stats
-
-    def _next_id(self, resp):
-        if len(resp['kvs']) == 0:
-            return (1, 0)
-        else:
-            return (1 + resp['kvs'][0]['value'], resp['kvs'][0]['mod_revision'])
-
-    def create_volume(self, volume):
-        """Creates a logical volume."""
-
-        size = int(volume.size) * units.Gi
-        # FIXME: Check if convert_str is really required
-        vol_name = utils.convert_str(volume.name)
-        if vol_name.find('@') >= 0 or vol_name.find('/') >= 0:
-            raise exception.VolumeBackendAPIException(data = '@ and / are forbidden in volume and snapshot names')
-
-        LOG.debug("creating volume '%s'", vol_name)
-
-        self._create_image(vol_name, { 'size': size })
-
-        if volume.encryption_key_id:
-            self._create_encrypted_volume(volume, volume.obj_context)
-
-        volume_update = {}
-        return volume_update
-
-    def _create_encrypted_volume(self, volume, context):
-        """Create a new LUKS encrypted image directly in Vitastor."""
-        vol_name = utils.convert_str(volume.name)
-        f, opts = self._encrypt_opts(volume, context)
-        # FIXME: Check if it works at all :-)
-        self._execute(
-            'qemu-img', 'convert', '-f', 'luks', *opts,
-            'vitastor:image='+vol_name.replace(':', '\\:')+self._qemu_args(),
-            '%sM' % (volume.size * 1024)
-        )
-        f.close()
-
-    def _encrypt_opts(self, volume, context):
-        encryption = volume_utils.check_encryption_provider(self.db, volume, context)
-        # Fetch the key associated with the volume and decode the passphrase
-        keymgr = key_manager.API(CONF)
-        key = keymgr.get(context, encryption['encryption_key_id'])
-        passphrase = binascii.hexlify(key.get_encoded()).decode('utf-8')
-        # Decode the dm-crypt style cipher spec into something qemu-img can use
-        cipher_spec = image_utils.decode_cipher(encryption['cipher'], encryption['key_size'])
-        tmp_dir = volume_utils.image_conversion_dir()
-        f = tempfile.NamedTemporaryFile(prefix = 'luks_', dir = tmp_dir)
-        f.write(passphrase)
-        f.flush()
-        return (f, [
-            '--object', 'secret,id=luks_sec,format=raw,file=%(passfile)s' % {'passfile': f.name},
-            '-o', 'key-secret=luks_sec,cipher-alg=%(cipher_alg)s,cipher-mode=%(cipher_mode)s,ivgen-alg=%(ivgen_alg)s' % cipher_spec,
-        ])
-
-    def create_snapshot(self, snapshot):
-        """Creates a volume snapshot."""
-
-        vol_name = utils.convert_str(snapshot.volume_name)
-        snap_name = utils.convert_str(snapshot.name)
-        if snap_name.find('@') >= 0 or snap_name.find('/') >= 0:
-            raise exception.VolumeBackendAPIException(data = '@ and / are forbidden in volume and snapshot names')
-        self._create_snapshot(vol_name, vol_name+'@'+snap_name)
-
-    def snapshot_revert_use_temp_snapshot(self):
-        """Disable the use of a temporary snapshot on revert."""
-        return False
-
-    def revert_to_snapshot(self, context, volume, snapshot):
-        """Revert a volume to a given snapshot."""
-
-        # FIXME Delete the image, then recreate it from the snapshot
-
-    def delete_snapshot(self, snapshot):
-        """Deletes a snapshot."""
-
-        vol_name = utils.convert_str(snapshot.volume_name)
-        snap_name = utils.convert_str(snapshot.name)
-
-        # Find the snapshot
-        resp = self._etcd_txn({ 'success': [
-            { 'request_range': { 'key': 'index/image/'+vol_name+'@'+snap_name } },
-        ] })
-        if len(resp['responses'][0]['kvs']) == 0:
-            raise exception.SnapshotNotFound(snapshot_id = snap_name)
-        inode_id = int(resp['responses'][0]['kvs'][0]['value']['id'])
-        pool_id = int(resp['responses'][0]['kvs'][0]['value']['pool_id'])
-        parents = {}
-        parents[(pool_id << 48) | (inode_id & 0xffffffffffff)] = True
-
-        # Check if there are child volumes
-        children = self._child_count(parents)
-        if children > 0:
-            raise exception.SnapshotIsBusy(snapshot_name = snap_name)
-
-        # FIXME: We can't delete snapshots because we can't merge layers yet
-        raise exception.VolumeBackendAPIException(data = 'Snapshot delete (layer merge) is not implemented yet')
-
-    def _child_count(self, parents):
-        children = 0
-        def add_child(kv):
-            nonlocal children
-            children += self._check_parent(kv, parents)
-        self._etcd_foreach('config/inode', lambda kv: add_child(kv))
-        return children
-
-    def _check_parent(self, kv, parents):
-        if 'parent_id' not in kv['value']:
-            return 0
-        parent_id = kv['value']['parent_id']
-        _, _, pool_id, inode_id = kv['key'].split('/')
-        parent_pool_id = pool_id
-        if 'parent_pool_id' in kv['value'] and kv['value']['parent_pool_id']:
-            parent_pool_id = kv['value']['parent_pool_id']
-        inode = (int(pool_id) << 48) | (int(inode_id) & 0xffffffffffff)
-        parent = (int(parent_pool_id) << 48) | (int(parent_id) & 0xffffffffffff)
-        if parent in parents and inode not in parents:
-            return 1
-        return 0
-
-    def create_cloned_volume(self, volume, src_vref):
-        """Create a cloned volume from another volume."""
-
-        size = int(volume.size) * units.Gi
-        src_name = utils.convert_str(src_vref.name)
-        dest_name = utils.convert_str(volume.name)
-        if dest_name.find('@') >= 0 or dest_name.find('/') >= 0:
-            raise exception.VolumeBackendAPIException(data = '@ and / are forbidden in volume and snapshot names')
-
-        # FIXME Do full copy if requested (cfg.disable_clone)
-
-        if src_vref.admin_metadata.get('readonly') == 'True':
-            # source volume is a volume-image cache entry or other readonly volume
-            # clone without intermediate snapshot
-            src = self._get_image(src_name)
-            LOG.debug("creating image '%s' from '%s'", dest_name, src_name)
-            new_cfg = self._create_image(dest_name, {
-                'size': size,
-                'parent_id': src['idx']['id'],
-                'parent_pool_id': src['idx']['pool_id'],
-            })
-            return {}
-
-        clone_snap = "%s@%s.clone_snap" % (src_name, dest_name)
-        make_img = True
-        if (volume.display_name and
-            volume.display_name.startswith('image-') and
-            src_vref.project_id != volume.project_id):
-            # idiotic openstack creates image-volume cache entries
-            # as clones of normal VM volumes... :-X prevent it :-D
-            clone_snap = dest_name
-            make_img = False
-
-        LOG.debug("creating layer '%s' under '%s'", clone_snap, src_name)
-        new_cfg = self._create_snapshot(src_name, clone_snap, True)
-        if make_img:
-            # Then create a clone from it
-            new_cfg = self._create_image(dest_name, {
-                'size': size,
-                'parent_id': new_cfg['parent_id'],
-                'parent_pool_id': new_cfg['parent_pool_id'],
-            })
-
-        return {}
-
-    def create_volume_from_snapshot(self, volume, snapshot):
-        """Creates a cloned volume from an existing snapshot."""
-
-        vol_name = utils.convert_str(volume.name)
-        snap_name = utils.convert_str(snapshot.name)
-
-        snap = self._get_image(vol_name+'@'+snap_name)
-        if not snap:
-            raise exception.SnapshotNotFound(snapshot_id = snap_name)
-        snap_inode_id = int(resp['responses'][0]['kvs'][0]['value']['id'])
-        snap_pool_id = int(resp['responses'][0]['kvs'][0]['value']['pool_id'])
-
-        size = snap['cfg']['size']
-        if int(volume.size):
-            size = int(volume.size) * units.Gi
-        new_cfg = self._create_image(vol_name, {
-            'size': size,
-            'parent_id': snap['idx']['id'],
-            'parent_pool_id': snap['idx']['pool_id'],
-        })
-
-        return {}
-
-    def _vitastor_args(self):
-        args = []
-        for k in [ 'config_path', 'etcd_address', 'etcd_prefix' ]:
-            v = self.configuration.safe_get('vitastor_'+k)
-            if v:
-                args.extend(['--'+k, v])
-        return args
-
-    def _qemu_args(self):
-        args = ''
-        for k in [ 'config_path', 'etcd_address', 'etcd_prefix' ]:
-            v = self.configuration.safe_get('vitastor_'+k)
-            kk = k
-            if kk == 'etcd_address':
-                # FIXME use etcd_address in qemu driver
-                kk = 'etcd_host'
-            if v:
-                args += ':'+kk+'='+v.replace(':', '\\:')
-        return args
-
-    def delete_volume(self, volume):
-        """Deletes a logical volume."""
-
-        vol_name = utils.convert_str(volume.name)
-
-        # Find the volume and all its snapshots
-        range_end = b'index/image/' + vol_name.encode('utf-8')
-        range_end = range_end[0 : len(range_end)-1] + six.int2byte(range_end[len(range_end)-1] + 1)
-        resp = self._etcd_txn({ 'success': [
-            { 'request_range': { 'key': 'index/image/'+vol_name, 'range_end': range_end } },
-        ] })
-        if len(resp['responses'][0]['kvs']) == 0:
-            # already deleted
-            LOG.info("volume %s no longer exists in backend", vol_name)
-            return
-        layers = resp['responses'][0]['kvs']
-        layer_ids = {}
-        for kv in layers:
-            inode_id = int(kv['value']['id'])
-            pool_id = int(kv['value']['pool_id'])
-            inode_pool_id = (pool_id << 48) | (inode_id & 0xffffffffffff)
-            layer_ids[inode_pool_id] = True
-
-        # Check if the volume has clones and raise 'busy' if so
-        children = self._child_count(layer_ids)
-        if children > 0:
-            raise exception.VolumeIsBusy(volume_name = vol_name)
-
-        # Clear data
-        for kv in layers:
-            args = [
-                'vitastor-rm', '--pool', str(kv['value']['pool_id']),
-                '--inode', str(kv['value']['id']), '--progress', '0',
-                *(self._vitastor_args())
-            ]
-            try:
-                self._execute(*args)
-            except processutils.ProcessExecutionError as exc:
-                LOG.error("Failed to remove layer "+kv['key']+": "+exc)
-                raise exception.VolumeBackendAPIException(data = exc.stderr)
-
-        # Delete all layers from etcd
-        requests = []
-        for kv in layers:
-            requests.append({ 'request_delete_range': { 'key': kv['key'] } })
-            requests.append({ 'request_delete_range': { 'key': 'config/inode/'+str(kv['value']['pool_id'])+'/'+str(kv['value']['id']) } })
-        self._etcd_txn({ 'success': requests })
-
-    def retype(self, context, volume, new_type, diff, host):
-        """Change extra type specifications for a volume."""
-
-        # FIXME Maybe (in the future) support multiple pools as different types
-        return True, {}
-
-    def ensure_export(self, context, volume):
-        """Synchronously recreates an export for a logical volume."""
-        pass
-
-    def create_export(self, context, volume, connector):
-        """Exports the volume."""
-        pass
-
-    def remove_export(self, context, volume):
-        """Removes an export for a logical volume."""
-        pass
-
-    def _create_image(self, vol_name, cfg):
-        pool_s = str(self.cfg['pool_id'])
-        image_id = 0
-        while image_id == 0:
-            # check if the image already exists and find a free ID
-            resp = self._etcd_txn({ 'success': [
-                { 'request_range': { 'key': 'index/image/'+vol_name } },
-                { 'request_range': { 'key': 'index/maxid/'+pool_s } },
-            ] })
-            if len(resp['responses'][0]['kvs']) > 0:
-                # already exists
-                raise exception.VolumeBackendAPIException(data = 'Volume '+vol_name+' already exists')
-            image_id, id_mod = self._next_id(resp['responses'][1])
-            # try to create the image
-            resp = self._etcd_txn({ 'compare': [
-                { 'target': 'MOD', 'mod_revision': id_mod, 'key': 'index/maxid/'+pool_s },
-                { 'target': 'VERSION', 'version': 0, 'key': 'index/image/'+vol_name },
-                { 'target': 'VERSION', 'version': 0, 'key': 'config/inode/'+pool_s+'/'+str(image_id) },
-            ], 'success': [
-                { 'request_put': { 'key': 'index/maxid/'+pool_s, 'value': image_id } },
-                { 'request_put': { 'key': 'index/image/'+vol_name, 'value': json.dumps({
-                    'id': image_id, 'pool_id': self.cfg['pool_id']
-                }) } },
-                { 'request_put': { 'key': 'config/inode/'+pool_s+'/'+str(image_id), 'value': json.dumps({
-                    **cfg, 'name': vol_name,
-                }) } },
-            ] })
-            if not resp.get('succeeded'):
-                # repeat
-                image_id = 0
-
-    def _create_snapshot(self, vol_name, snap_vol_name, allow_existing = False):
-        while True:
-            # check if the image already exists and snapshot doesn't
-            resp = self._etcd_txn({ 'success': [
-                { 'request_range': { 'key': 'index/image/'+vol_name } },
-                { 'request_range': { 'key': 'index/image/'+snap_vol_name } },
-            ] })
-            if len(resp['responses'][0]['kvs']) == 0:
-                raise exception.VolumeBackendAPIException(data = 'Volume '+vol_name+' does not exist')
-            if len(resp['responses'][1]['kvs']) > 0:
-                if allow_existing:
-                    snap_idx = resp['responses'][1]['kvs'][0]['value']
-                    resp = self._etcd_txn({ 'success': [
-                        { 'request_range': { 'key': 'config/inode/'+str(snap_idx['pool_id'])+'/'+str(snap_idx['id']) } },
-                    ] })
-                    if len(resp['responses'][0]['kvs']) == 0:
-                        raise exception.VolumeBackendAPIException(data =
-                            'Volume '+snap_vol_name+' is already indexed, but does not exist'
-                        )
-                    return resp['responses'][0]['kvs'][0]['value']
-                raise exception.VolumeBackendAPIException(
-                    data = 'Volume '+snap_vol_name+' already exists'
-                )
-            vol_idx = resp['responses'][0]['kvs'][0]['value']
-            vol_idx_mod = resp['responses'][0]['kvs'][0]['mod_revision']
-            # get image inode config and find a new ID
-            resp = self._etcd_txn({ 'success': [
-                { 'request_range': { 'key': 'config/inode/'+str(vol_idx['pool_id'])+'/'+str(vol_idx['id']) } },
-                { 'request_range': { 'key': 'index/maxid/'+str(self.cfg['pool_id']) } },
-            ] })
-            if len(resp['responses'][0]['kvs']) == 0:
-                raise exception.VolumeBackendAPIException(data = 'Volume '+vol_name+' does not exist')
-            vol_cfg = resp['responses'][0]['kvs'][0]['value']
-            vol_mod = resp['responses'][0]['kvs'][0]['mod_revision']
-            new_id, id_mod = self._next_id(resp['responses'][1])
-            # try to redirect image to the new inode
-            new_cfg = {
-                **vol_cfg, 'name': vol_name, 'parent_id': vol_idx['id'], 'parent_pool_id': vol_idx['pool_id']
-            }
-            resp = self._etcd_txn({ 'compare': [
-                { 'target': 'MOD', 'mod_revision': vol_idx_mod, 'key': 'index/image/'+vol_name },
-                { 'target': 'MOD', 'mod_revision': vol_mod, 'key': 'config/inode/'+str(vol_idx['pool_id'])+'/'+str(vol_idx['id']) },
-                { 'target': 'MOD', 'mod_revision': id_mod, 'key': 'index/maxid/'+str(self.cfg['pool_id']) },
-                { 'target': 'VERSION', 'version': 0, 'key': 'index/image/'+snap_vol_name },
-                { 'target': 'VERSION', 'version': 0, 'key': 'config/inode/'+str(self.cfg['pool_id'])+'/'+str(new_id) },
-            ], 'success': [
-                { 'request_put': { 'key': 'index/maxid/'+str(self.cfg['pool_id']), 'value': new_id } },
-                { 'request_put': { 'key': 'index/image/'+vol_name, 'value': json.dumps({
-                    'id': new_id, 'pool_id': self.cfg['pool_id']
-                }) } },
-                { 'request_put': { 'key': 'config/inode/'+str(self.cfg['pool_id'])+'/'+str(new_id), 'value': json.dumps(new_cfg) } },
-                { 'request_put': { 'key': 'index/image/'+snap_vol_name, 'value': json.dumps({
-                    'id': vol_idx['id'], 'pool_id': vol_idx['pool_id']
-                }) } },
-                { 'request_put': { 'key': 'config/inode/'+str(vol_idx['pool_id'])+'/'+str(vol_idx['id']), 'value': json.dumps({
-                    **vol_cfg, 'name': snap_vol_name, 'readonly': True
-                }) } }
-            ] })
-            if resp.get('succeeded'):
-                return new_cfg
-
-    def initialize_connection(self, volume, connector):
-        data = {
-            'driver_volume_type': 'vitastor',
-            'data': {
-                'config_path': self.configuration.vitastor_config_path,
-                'etcd_address': self.configuration.vitastor_etcd_address,
-                'etcd_prefix': self.configuration.vitastor_etcd_prefix,
-                'name': volume.name,
-                'logical_block_size': 512,
-                'physical_block_size': 4096,
-            }
-        }
-        LOG.debug('connection data: %s', data)
-        return data
-
-    def terminate_connection(self, volume, connector, **kwargs):
-        pass
-
-    def clone_image(self, context, volume, image_location, image_meta, image_service):
-        if image_location:
-            # Note: image_location[0] is glance image direct_url.
-            # image_location[1] contains the list of all locations (including
-            # direct_url) or None if show_multiple_locations is False in
-            # glance configuration.
-            if image_location[1]:
-                url_locations = [location['url'] for location in image_location[1]]
-            else:
-                url_locations = [image_location[0]]
-            # iterate all locations to look for a cloneable one.
-            for url_location in url_locations:
-                if url_location and url_location.startswith('cinder://'):
-                    # The idea is to use cinder://<volume-id> Glance volumes as base images
-                    base_vol = self.db.volume_get(context, url_location[len('cinder://') : ])
-                    if not base_vol or base_vol.volume_type_id != volume.volume_type_id:
-                        continue
-                    size = int(volume.size) * units.Gi
-                    dest_name = utils.convert_str(volume.name)
-                    # Find or create the base snapshot
-                    snap_cfg = self._create_snapshot(base_vol.name, base_vol.name+'@.clone_snap', True)
-                    # Then create a clone from it
-                    new_cfg = self._create_image(dest_name, {
-                        'size': size,
-                        'parent_id': snap_cfg['parent_id'],
-                        'parent_pool_id': snap_cfg['parent_pool_id'],
-                    })
-                    return ({}, True)
-        return ({}, False)
-
-    def copy_image_to_encrypted_volume(self, context, volume, image_service, image_id):
-        self.copy_image_to_volume(context, volume, image_service, image_id, encrypted = True)
-
-    def copy_image_to_volume(self, context, volume, image_service, image_id, encrypted = False):
-        tmp_dir = volume_utils.image_conversion_dir()
-        with tempfile.NamedTemporaryFile(dir = tmp_dir) as tmp:
-            image_utils.fetch_to_raw(
-                context, image_service, image_id, tmp.name,
-                self.configuration.volume_dd_blocksize, size = volume.size
-            )
-            out_format = [ '-O', 'raw' ]
-            if encrypted:
-                key_file, opts = self._encrypt_opts(volume, context)
-                out_format = [ '-O', 'luks', *opts ]
-            dest_name = utils.convert_str(volume.name)
-            self._try_execute(
-                'qemu-img', 'convert', '-f', 'raw', tmp.name, *out_format,
-                'vitastor:image='+dest_name.replace(':', '\\:')+self._qemu_args()
-            )
-            if encrypted:
-                key_file.close()
-
-    def copy_volume_to_image(self, context, volume, image_service, image_meta):
-        tmp_dir = volume_utils.image_conversion_dir()
-        tmp_file = os.path.join(tmp_dir, volume.name + '-' + image_meta['id'])
-        with fileutils.remove_path_on_error(tmp_file):
-            vol_name = utils.convert_str(volume.name)
-            self._try_execute(
-                'qemu-img', 'convert', '-f', 'raw',
-                'vitastor:image='+vol_name.replace(':', '\\:')+self._qemu_args(),
-                '-O', 'raw', tmp_file
-            )
-            # FIXME: Copy directly if the destination image is also in Vitastor
-            volume_utils.upload_volume(context, image_service, image_meta, tmp_file, volume)
-        os.unlink(tmp_file)
-
-    def _get_image(self, vol_name):
-        # find the image
-        resp = self._etcd_txn({ 'success': [
-            { 'request_range': { 'key': 'index/image/'+vol_name } },
-        ] })
-        if len(resp['responses'][0]['kvs']) == 0:
-            return None
-        vol_idx = resp['responses'][0]['kvs'][0]['value']
-        vol_idx_mod = resp['responses'][0]['kvs'][0]['mod_revision']
-        # get image inode config
-        resp = self._etcd_txn({ 'success': [
-            { 'request_range': { 'key': 'config/inode/'+str(vol_idx['pool_id'])+'/'+str(vol_idx['id']) } },
-        ] })
-        if len(resp['responses'][0]['kvs']) == 0:
-            return None
-        vol_cfg = resp['responses'][0]['kvs'][0]['value']
-        vol_cfg_mod = resp['responses'][0]['kvs'][0]['mod_revision']
-        return {
-            'cfg': vol_cfg,
-            'cfg_mod': vol_cfg_mod,
-            'idx': vol_idx,
-            'idx_mod': vol_idx_mod,
-        }
-
-    def extend_volume(self, volume, new_size):
-        """Extend an existing volume."""
-        vol_name = utils.convert_str(volume.name)
-        while True:
-            vol = self._get_image(vol_name)
-            if not vol:
-                raise exception.VolumeBackendAPIException(data = 'Volume '+vol_name+' does not exist')
-            # change size
-            size = int(new_size) * units.Gi
-            if size == vol['cfg']['size']:
-                break
-            resp = self._etcd_txn({ 'compare': [ {
-                'target': 'MOD',
-                'mod_revision': vol['cfg_mod'],
-                'key': 'config/inode/'+str(vol['idx']['pool_id'])+'/'+str(vol['idx']['id']),
-            } ], 'success': [
-                { 'request_put': {
-                    'key': 'config/inode/'+str(vol['idx']['pool_id'])+'/'+str(vol['idx']['id']),
-                    'value': json.dumps({ **vol['cfg'], 'size': size }),
-                } },
-            ] })
-            if resp.get('succeeded'):
-                break
-        LOG.debug(
-            "Extend volume from %(old_size)s GB to %(new_size)s GB.",
-            {'old_size': volume.size, 'new_size': new_size}
-        )
-
-    def _add_manageable_volume(self, kv, manageable_volumes, cinder_ids):
-        cfg = kv['value']
-        if kv['key'].find('@') >= 0:
-            # snapshot
-            return
-        image_id = volume_utils.extract_id_from_volume_name(cfg['name'])
-        image_info = {
-            'reference': {'source-name': image_name},
-            'size': int(math.ceil(float(cfg['size']) / units.Gi)),
-            'cinder_id': None,
-            'extra_info': None,
-        }
-        if image_id in cinder_ids:
-            image_info['cinder_id'] = image_id
-            image_info['safe_to_manage'] = False
-            image_info['reason_not_safe'] = 'already managed'
-        else:
-            image_info['safe_to_manage'] = True
-            image_info['reason_not_safe'] = None
-        manageable_volumes.append(image_info)
-
-    def get_manageable_volumes(self, cinder_volumes, marker, limit, offset, sort_keys, sort_dirs):
-        manageable_volumes = []
-        cinder_ids = [resource['id'] for resource in cinder_volumes]
-
-        # List all volumes
-        # FIXME: It's possible to use pagination in our case, but.. do we want it?
-        self._etcd_foreach('config/inode/'+str(self.cfg['pool_id']),
-            lambda kv: self._add_manageable_volume(kv, manageable_volumes, cinder_ids))
-
-        return volume_utils.paginate_entries_list(
-            manageable_volumes, marker, limit, offset, sort_keys, sort_dirs)
-
-    def _get_existing_name(existing_ref):
-        if not isinstance(existing_ref, dict):
-            existing_ref = {"source-name": existing_ref}
-        if 'source-name' not in existing_ref:
-            reason = _('Reference must contain source-name element.')
-            raise exception.ManageExistingInvalidReference(existing_ref=existing_ref, reason=reason)
-        src_name = utils.convert_str(existing_ref['source-name'])
-        if not src_name:
-            reason = _('Reference must contain source-name element.')
-            raise exception.ManageExistingInvalidReference(existing_ref=existing_ref, reason=reason)
-        return src_name
-
-    def manage_existing_get_size(self, volume, existing_ref):
-        """Return size of an existing image for manage_existing.
-
-        :param volume: volume ref info to be set
-        :param existing_ref: {'source-name': <image name>}
-        """
-        src_name = self._get_existing_name(existing_ref)
-        vol = self._get_image(src_name)
-        if not vol:
-            raise exception.VolumeBackendAPIException(data = 'Volume '+src_name+' does not exist')
-        return int(math.ceil(float(vol['cfg']['size']) / units.Gi))
-
-    def manage_existing(self, volume, existing_ref):
-        """Manages an existing image.
-
-        Renames the image name to match the expected name for the volume.
-
-        :param volume: volume ref info to be set
-        :param existing_ref: {'source-name': <image name>}
-        """
-        from_name = self._get_existing_name(existing_ref)
-        to_name = utils.convert_str(volume.name)
-        self._rename(from_name, to_name)
-
-    def _rename(self, from_name, to_name):
-        while True:
-            vol = self._get_image(from_name)
-            if not vol:
-                raise exception.VolumeBackendAPIException(data = 'Volume '+from_name+' does not exist')
-            to = self._get_image(to_name)
-            if to:
-                raise exception.VolumeBackendAPIException(data = 'Volume '+to_name+' already exists')
-            resp = self._etcd_txn({ 'compare': [
-                { 'target': 'MOD', 'mod_revision': vol['idx_mod'], 'key': 'index/image/'+vol['cfg']['name'] },
-                { 'target': 'MOD', 'mod_revision': vol['cfg_mod'], 'key': 'config/inode/'+str(vol['idx']['pool_id'])+'/'+str(vol['idx']['id']) },
-                { 'target': 'VERSION', 'version': 0, 'key': 'index/image/'+to_name },
-            ], 'success': [
-                { 'request_delete_range': { 'key': 'index/image/'+vol['cfg']['name'] } },
-                { 'request_put': { 'key': 'index/image/'+to_name, 'value': json.dumps(vol['idx']) } },
-                { 'request_put': { 'key': 'config/inode/'+str(vol['idx']['pool_id'])+'/'+str(vol['idx']['id']),
-                    'value': json.dumps({ **vol['cfg'], 'name': to_name }) } },
-            ] })
-            if resp.get('succeeded'):
-                break
-
-    def unmanage(self, volume):
-        pass
-
-    def _add_manageable_snapshot(self, kv, manageable_snapshots, cinder_ids):
-        cfg = kv['value']
-        dog = kv['key'].find('@')
-        if dog < 0:
-            # snapshot
-            return
-        image_name = kv['key'][0 : dog]
-        snap_name = kv['key'][dog+1 : ]
-        snapshot_id = volume_utils.extract_id_from_snapshot_name(snap_name)
-        snapshot_info = {
-            'reference': {'source-name': snap_name},
-            'size': int(math.ceil(float(cfg['size']) / units.Gi)),
-            'cinder_id': None,
-            'extra_info': None,
-            'safe_to_manage': False,
-            'reason_not_safe': None,
-            'source_reference': {'source-name': image_name}
-        }
-        if snapshot_id in cinder_ids:
-            # Exclude snapshots already managed.
-            snapshot_info['reason_not_safe'] = ('already managed')
-            snapshot_info['cinder_id'] = snapshot_id
-        elif snap_name.endswith('.clone_snap'):
-            # Exclude clone snapshot.
-            snapshot_info['reason_not_safe'] = ('used for clone snap')
-        else:
-            snapshot_info['safe_to_manage'] = True
-        manageable_snapshots.append(snapshot_info)
-
-    def get_manageable_snapshots(self, cinder_snapshots, marker, limit, offset, sort_keys, sort_dirs):
-        """List manageable snapshots in Vitastor."""
-        manageable_snapshots = []
-        cinder_snapshot_ids = [resource['id'] for resource in cinder_snapshots]
-        # List all volumes
-        # FIXME: It's possible to use pagination in our case, but.. do we want it?
-        self._etcd_foreach('config/inode/'+str(self.cfg['pool_id']),
-            lambda kv: self._add_manageable_volume(kv, manageable_snapshots, cinder_snapshot_ids))
-        return volume_utils.paginate_entries_list(
-            manageable_snapshots, marker, limit, offset, sort_keys, sort_dirs)
-
-    def manage_existing_snapshot_get_size(self, snapshot, existing_ref):
-        """Return size of an existing image for manage_existing.
-
-        :param snapshot: snapshot ref info to be set
-        :param existing_ref: {'source-name': <name of snapshot>}
-        """
-        vol_name = utils.convert_str(snapshot.volume_name)
-        snap_name = self._get_existing_name(existing_ref)
-        vol = self._get_image(vol_name+'@'+snap_name)
-        if not vol:
-            raise exception.ManageExistingInvalidReference(
-                existing_ref=snapshot_name, reason='Specified snapshot does not exist.'
-            )
-        return int(math.ceil(float(vol['cfg']['size']) / units.Gi))
-
-    def manage_existing_snapshot(self, snapshot, existing_ref):
-        """Manages an existing snapshot.
-
-        Renames the snapshot name to match the expected name for the snapshot.
-        Error checking done by manage_existing_get_size is not repeated.
-
-        :param snapshot: snapshot ref info to be set
-        :param existing_ref: {'source-name': <name of snapshot>}
-        """
-        vol_name = utils.convert_str(snapshot.volume_name)
-        snap_name = self._get_existing_name(existing_ref)
-        from_name = vol_name+'@'+snap_name
-        to_name = vol_name+'@'+utils.convert_str(snapshot.name)
-        self._rename(from_name, to_name)
-
-    def unmanage_snapshot(self, snapshot):
-        """Removes the specified snapshot from Cinder management."""
-        pass
-
-    def _dumps(self, obj):
-        return json.dumps(obj, separators=(',', ':'), sort_keys=True)
--- a/patches/devstack-local.conf
+++ b/patches/devstack-local.conf
@@ -1,23 +0,0 @@
-# Devstack configuration for bridged networking
-
-[[local|localrc]]
-ADMIN_PASSWORD=secret
-DATABASE_PASSWORD=$ADMIN_PASSWORD
-RABBIT_PASSWORD=$ADMIN_PASSWORD
-SERVICE_PASSWORD=$ADMIN_PASSWORD
-HOST_IP=10.0.2.15
-Q_USE_SECGROUP=True
-FLOATING_RANGE="10.0.2.0/24"
-IPV4_ADDRS_SAFE_TO_USE="10.0.5.0/24"
-Q_FLOATING_ALLOCATION_POOL=start=10.0.2.50,end=10.0.2.100
-PUBLIC_NETWORK_GATEWAY=10.0.2.2
-PUBLIC_INTERFACE=ens3
-Q_USE_PROVIDERNET_FOR_PUBLIC=True
-Q_AGENT=linuxbridge
-Q_ML2_PLUGIN_MECHANISM_DRIVERS=linuxbridge
-LB_PHYSICAL_INTERFACE=ens3
-PUBLIC_PHYSICAL_NETWORK=default
-LB_INTERFACE_MAPPINGS=default:ens3
-Q_SERVICE_PLUGIN_CLASSES=
-Q_ML2_PLUGIN_TYPE_DRIVERS=flat
-Q_ML2_PLUGIN_EXT_DRIVERS=
--- a/patches/libvirt-5.0-vitastor.diff
+++ b/patches/libvirt-5.0-vitastor.diff
@@ -1,609 +0,0 @@
-commit bd283191b3e7a4c6d1c100d3d96e348a1ebffe55
-Author: Vitaliy Filippov <vitalif@yourcmc.ru>
-Date:   Sun Jun 27 12:52:40 2021 +0300
-
-    Add Vitastor support
-
-diff --git a/docs/schemas/domaincommon.rng b/docs/schemas/domaincommon.rng
-index aa50eac..082b4f8 100644
--- a/docs/schemas/domaincommon.rng
-+++ b/docs/schemas/domaincommon.rng
-@@ -1728,6 +1728,35 @@
-     </element>
-   </define>
- 
-+  <define name="diskSourceNetworkProtocolVitastor">
-+    <element name="source">
-+      <interleave>
-+        <attribute name="protocol">
-+          <value>vitastor</value>
-+        </attribute>
-+        <ref name="diskSourceCommon"/>
-+        <optional>
-+          <attribute name="name"/>
-+        </optional>
-+        <optional>
-+          <attribute name="query"/>
-+        </optional>
-+        <zeroOrMore>
-+          <ref name="diskSourceNetworkHost"/>
-+        </zeroOrMore>
-+        <optional>
-+          <element name="config">
-+            <attribute name="file">
-+              <ref name="absFilePath"/>
-+            </attribute>
-+            <empty/>
-+          </element>
-+        </optional>
-+        <empty/>
-+      </interleave>
-+    </element>
-+  </define>
-+
-   <define name="diskSourceNetworkProtocolISCSI">
-     <element name="source">
-       <attribute name="protocol">
-@@ -1851,6 +1880,7 @@
-       <ref name="diskSourceNetworkProtocolHTTP"/>
-       <ref name="diskSourceNetworkProtocolSimple"/>
-       <ref name="diskSourceNetworkProtocolVxHS"/>
-+      <ref name="diskSourceNetworkProtocolVitastor"/>
-     </choice>
-   </define>
- 
-diff --git a/include/libvirt/libvirt-storage.h b/include/libvirt/libvirt-storage.h
-index 4bf2b5f..dbc011b 100644
--- a/include/libvirt/libvirt-storage.h
-+++ b/include/libvirt/libvirt-storage.h
-@@ -240,6 +240,7 @@ typedef enum {
-     VIR_CONNECT_LIST_STORAGE_POOLS_GLUSTER       = 1 << 16,
-     VIR_CONNECT_LIST_STORAGE_POOLS_ZFS           = 1 << 17,
-     VIR_CONNECT_LIST_STORAGE_POOLS_VSTORAGE      = 1 << 18,
-+    VIR_CONNECT_LIST_STORAGE_POOLS_VITASTOR      = 1 << 20,
- } virConnectListAllStoragePoolsFlags;
- 
- int                     virConnectListAllStoragePools(virConnectPtr conn,
-diff --git a/src/conf/domain_conf.c b/src/conf/domain_conf.c
-index 222bb8c..685d255 100644
--- a/src/conf/domain_conf.c
-+++ b/src/conf/domain_conf.c
-@@ -8653,6 +8653,10 @@ virDomainDiskSourceNetworkParse(xmlNodePtr node,
-         goto cleanup;
-     }
- 
-+    if (src->protocol == VIR_STORAGE_NET_PROTOCOL_VITASTOR) {
-+        src->relPath = virXMLPropString(node, "query");
-+    }
-+
-     if ((haveTLS = virXMLPropString(node, "tls")) &&
-         (src->haveTLS = virTristateBoolTypeFromString(haveTLS)) <= 0) {
-         virReportError(VIR_ERR_XML_ERROR,
-@@ -23849,6 +23853,10 @@ virDomainDiskSourceFormatNetwork(virBufferPtr attrBuf,
- 
-     virBufferEscapeString(attrBuf, " name='%s'", path ? path : src->path);
- 
-+    if (src->protocol == VIR_STORAGE_NET_PROTOCOL_VITASTOR && src->relPath != NULL) {
-+        virBufferEscapeString(attrBuf, " query='%s'", src->relPath);
-+    }
-+
-     VIR_FREE(path);
- 
-     if (src->haveTLS != VIR_TRISTATE_BOOL_ABSENT &&
-@@ -30930,6 +30938,7 @@ virDomainDiskTranslateSourcePool(virDomainDiskDefPtr def)
- 
-     case VIR_STORAGE_POOL_MPATH:
-     case VIR_STORAGE_POOL_RBD:
-+    case VIR_STORAGE_POOL_VITASTOR:
-     case VIR_STORAGE_POOL_SHEEPDOG:
-     case VIR_STORAGE_POOL_GLUSTER:
-     case VIR_STORAGE_POOL_LAST:
-diff --git a/src/conf/storage_conf.c b/src/conf/storage_conf.c
-index 55db7a9..7cbe937 100644
--- a/src/conf/storage_conf.c
-+++ b/src/conf/storage_conf.c
-@@ -58,7 +58,7 @@ VIR_ENUM_IMPL(virStoragePool,
-               "logical", "disk", "iscsi",
-               "iscsi-direct", "scsi", "mpath",
-               "rbd", "sheepdog", "gluster",
-              "zfs", "vstorage")
-+              "zfs", "vstorage", "vitastor")
- 
- VIR_ENUM_IMPL(virStoragePoolFormatFileSystem,
-               VIR_STORAGE_POOL_FS_LAST,
-@@ -232,6 +232,18 @@ static virStoragePoolTypeInfo poolTypeInfo[] = {
-           .formatToString = virStorageFileFormatTypeToString,
-       }
-     },
-+    {.poolType = VIR_STORAGE_POOL_VITASTOR,
-+     .poolOptions = {
-+         .flags = (VIR_STORAGE_POOL_SOURCE_HOST |
-+                   VIR_STORAGE_POOL_SOURCE_NETWORK |
-+                   VIR_STORAGE_POOL_SOURCE_NAME),
-+      },
-+      .volOptions = {
-+          .defaultFormat = VIR_STORAGE_FILE_RAW,
-+          .formatFromString = virStorageVolumeFormatFromString,
-+          .formatToString = virStorageFileFormatTypeToString,
-+      }
-+    },
-     {.poolType = VIR_STORAGE_POOL_SHEEPDOG,
-      .poolOptions = {
-          .flags = (VIR_STORAGE_POOL_SOURCE_HOST |
-@@ -434,6 +446,11 @@ virStoragePoolDefParseSource(xmlXPathContextPtr ctxt,
-                        _("element 'name' is mandatory for RBD pool"));
-         goto cleanup;
-     }
-+    if (pool_type == VIR_STORAGE_POOL_VITASTOR && source->name == NULL) {
-+        virReportError(VIR_ERR_XML_ERROR, "%s",
-+                       _("element 'name' is mandatory for Vitastor pool"));
-+        return -1;
-+    }
- 
-     if (options->formatFromString) {
-         char *format = virXPathString("string(./format/@type)", ctxt);
-@@ -1009,6 +1026,7 @@ virStoragePoolDefFormatBuf(virBufferPtr buf,
-     /* RBD, Sheepdog, Gluster and Iscsi-direct devices are not local block devs nor
-      * files, so they don't have a target */
-     if (def->type != VIR_STORAGE_POOL_RBD &&
-+        def->type != VIR_STORAGE_POOL_VITASTOR &&
-         def->type != VIR_STORAGE_POOL_SHEEPDOG &&
-         def->type != VIR_STORAGE_POOL_GLUSTER &&
-         def->type != VIR_STORAGE_POOL_ISCSI_DIRECT) {
-diff --git a/src/conf/storage_conf.h b/src/conf/storage_conf.h
-index dc0aa2a..ed4983d 100644
--- a/src/conf/storage_conf.h
-+++ b/src/conf/storage_conf.h
-@@ -91,6 +91,7 @@ typedef enum {
-     VIR_STORAGE_POOL_GLUSTER,  /* Gluster device */
-     VIR_STORAGE_POOL_ZFS,      /* ZFS */
-     VIR_STORAGE_POOL_VSTORAGE, /* Virtuozzo Storage */
-+    VIR_STORAGE_POOL_VITASTOR, /* Vitastor */
- 
-     VIR_STORAGE_POOL_LAST,
- } virStoragePoolType;
-@@ -422,6 +423,7 @@ VIR_ENUM_DECL(virStoragePartedFs)
-                  VIR_CONNECT_LIST_STORAGE_POOLS_SCSI     | \
-                  VIR_CONNECT_LIST_STORAGE_POOLS_MPATH    | \
-                  VIR_CONNECT_LIST_STORAGE_POOLS_RBD      | \
-+                 VIR_CONNECT_LIST_STORAGE_POOLS_VITASTOR | \
-                  VIR_CONNECT_LIST_STORAGE_POOLS_SHEEPDOG | \
-                  VIR_CONNECT_LIST_STORAGE_POOLS_GLUSTER  | \
-                  VIR_CONNECT_LIST_STORAGE_POOLS_ZFS      | \
-diff --git a/src/conf/virstorageobj.c b/src/conf/virstorageobj.c
-index 6ea6a97..3ba45b9 100644
--- a/src/conf/virstorageobj.c
-+++ b/src/conf/virstorageobj.c
-@@ -1478,6 +1478,7 @@ virStoragePoolObjSourceFindDuplicateCb(const void *payload,
-             return 1;
-         break;
- 
-+    case VIR_STORAGE_POOL_VITASTOR:
-     case VIR_STORAGE_POOL_RBD:
-     case VIR_STORAGE_POOL_LAST:
-         break;
-@@ -1971,6 +1972,8 @@ virStoragePoolObjMatch(virStoragePoolObjPtr obj,
-                (obj->def->type == VIR_STORAGE_POOL_MPATH))   ||
-               (MATCH(VIR_CONNECT_LIST_STORAGE_POOLS_RBD) &&
-                (obj->def->type == VIR_STORAGE_POOL_RBD))     ||
-+              (MATCH(VIR_CONNECT_LIST_STORAGE_POOLS_VITASTOR) &&
-+               (obj->def->type == VIR_STORAGE_POOL_VITASTOR)) ||
-               (MATCH(VIR_CONNECT_LIST_STORAGE_POOLS_SHEEPDOG) &&
-                (obj->def->type == VIR_STORAGE_POOL_SHEEPDOG)) ||
-               (MATCH(VIR_CONNECT_LIST_STORAGE_POOLS_GLUSTER) &&
-diff --git a/src/libvirt-storage.c b/src/libvirt-storage.c
-index 2ea3e94..d5d2273 100644
--- a/src/libvirt-storage.c
-+++ b/src/libvirt-storage.c
-@@ -92,6 +92,7 @@ virStoragePoolGetConnect(virStoragePoolPtr pool)
-  * VIR_CONNECT_LIST_STORAGE_POOLS_SCSI
-  * VIR_CONNECT_LIST_STORAGE_POOLS_MPATH
-  * VIR_CONNECT_LIST_STORAGE_POOLS_RBD
-+ * VIR_CONNECT_LIST_STORAGE_POOLS_VITASTOR
-  * VIR_CONNECT_LIST_STORAGE_POOLS_SHEEPDOG
-  *
-  * Returns the number of storage pools found or -1 and sets @pools to
-diff --git a/src/libxl/libxl_conf.c b/src/libxl/libxl_conf.c
-index 73e988a..ab7bb81 100644
--- a/src/libxl/libxl_conf.c
-+++ b/src/libxl/libxl_conf.c
-@@ -905,6 +905,7 @@ libxlMakeNetworkDiskSrcStr(virStorageSourcePtr src,
-     case VIR_STORAGE_NET_PROTOCOL_SHEEPDOG:
-     case VIR_STORAGE_NET_PROTOCOL_SSH:
-     case VIR_STORAGE_NET_PROTOCOL_VXHS:
-+    case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
-     case VIR_STORAGE_NET_PROTOCOL_LAST:
-     case VIR_STORAGE_NET_PROTOCOL_NONE:
-         virReportError(VIR_ERR_NO_SUPPORT,
-diff --git a/src/qemu/qemu_block.c b/src/qemu/qemu_block.c
-index cbf0aa4..096700d 100644
--- a/src/qemu/qemu_block.c
-+++ b/src/qemu/qemu_block.c
-@@ -959,6 +959,42 @@ qemuBlockStorageSourceGetRBDProps(virStorageSourcePtr src)
- }
- 
- 
-+static virJSONValuePtr
-+qemuBlockStorageSourceGetVitastorProps(virStorageSource *src)
-+{
-+    virJSONValuePtr ret = NULL;
-+    virStorageNetHostDefPtr host;
-+    size_t i;
-+    virBuffer buf = VIR_BUFFER_INITIALIZER;
-+    char *etcd = NULL;
-+
-+    for (i = 0; i < src->nhosts; i++) {
-+        host = src->hosts + i;
-+        if ((virStorageNetHostTransport)host->transport != VIR_STORAGE_NET_HOST_TRANS_TCP) {
-+            goto cleanup;
-+        }
-+        virBufferAsprintf(&buf, i > 0 ? ",%s:%u" : "%s:%u", host->name, host->port);
-+    }
-+    if (src->nhosts > 0) {
-+        etcd = virBufferContentAndReset(&buf);
-+    }
-+
-+    if (virJSONValueObjectCreate(&ret,
-+                                 "s:driver", "vitastor",
-+                                 "S:etcd_host", etcd,
-+                                 "S:etcd_prefix", src->relPath,
-+                                 "S:config_path", src->configFile,
-+                                 "s:image", src->path,
-+                                 NULL) < 0)
-+        goto cleanup;
-+
-+cleanup:
-+    VIR_FREE(etcd);
-+    virBufferFreeAndReset(&buf);
-+    return ret;
-+}
-+
-+
- static virJSONValuePtr
- qemuBlockStorageSourceGetSheepdogProps(virStorageSourcePtr src)
- {
-@@ -1174,6 +1210,11 @@ qemuBlockStorageSourceGetBackendProps(virStorageSourcePtr src,
-                 return NULL;
-             break;
- 
-+        case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
-+            if (!(fileprops = qemuBlockStorageSourceGetVitastorProps(src)))
-+                return NULL;
-+            break;
-+
-         case VIR_STORAGE_NET_PROTOCOL_SHEEPDOG:
-             if (!(fileprops = qemuBlockStorageSourceGetSheepdogProps(src)))
-                 return NULL;
-diff --git a/src/qemu/qemu_command.c b/src/qemu/qemu_command.c
-index 822d5f8..e375cef 100644
--- a/src/qemu/qemu_command.c
-+++ b/src/qemu/qemu_command.c
-@@ -975,6 +975,43 @@ qemuBuildNetworkDriveStr(virStorageSourcePtr src,
-             ret = virBufferContentAndReset(&buf);
-             break;
- 
-+        case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
-+            if (strchr(src->path, ':')) {
-+                virReportError(VIR_ERR_CONFIG_UNSUPPORTED,
-+                               _("':' not allowed in Vitastor source volume name '%s'"),
-+                               src->path);
-+                return NULL;
-+            }
-+
-+            virBufferStrcat(&buf, "vitastor:image=", src->path, NULL);
-+
-+            if (src->nhosts > 0) {
-+                virBufferAddLit(&buf, ":etcd_host=");
-+                for (i = 0; i < src->nhosts; i++) {
-+                    if (i)
-+                        virBufferAddLit(&buf, ",");
-+
-+                    /* assume host containing : is ipv6 */
-+                    if (strchr(src->hosts[i].name, ':'))
-+                        virBufferEscape(&buf, '\\', ":", "[%s]",
-+                                        src->hosts[i].name);
-+                    else
-+                        virBufferAsprintf(&buf, "%s", src->hosts[i].name);
-+
-+                    if (src->hosts[i].port)
-+                        virBufferAsprintf(&buf, "\\:%u", src->hosts[i].port);
-+                }
-+            }
-+
-+            if (src->configFile)
-+                virBufferEscape(&buf, '\\', ":", ":config_path=%s", src->configFile);
-+
-+            if (src->relPath)
-+                virBufferEscape(&buf, '\\', ":", ":etcd_prefix=%s", src->relPath);
-+
-+            ret = virBufferContentAndReset(&buf);
-+            break;
-+
-         case VIR_STORAGE_NET_PROTOCOL_VXHS:
-             virReportError(VIR_ERR_INTERNAL_ERROR, "%s",
-                            _("VxHS protocol does not support URI syntax"));
-diff --git a/src/qemu/qemu_domain.c b/src/qemu/qemu_domain.c
-index ec6b340..f399efa 100644
--- a/src/qemu/qemu_domain.c
-+++ b/src/qemu/qemu_domain.c
-@@ -10881,6 +10881,7 @@ qemuDomainPrepareStorageSourceTLS(virStorageSourcePtr src,
-         break;
- 
-     case VIR_STORAGE_NET_PROTOCOL_RBD:
-+    case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
-     case VIR_STORAGE_NET_PROTOCOL_SHEEPDOG:
-     case VIR_STORAGE_NET_PROTOCOL_GLUSTER:
-     case VIR_STORAGE_NET_PROTOCOL_ISCSI:
-diff --git a/src/qemu/qemu_driver.c b/src/qemu/qemu_driver.c
-index 1d96170..2d24396 100644
--- a/src/qemu/qemu_driver.c
-+++ b/src/qemu/qemu_driver.c
-@@ -14687,6 +14687,7 @@ qemuDomainSnapshotPrepareDiskExternalInactive(virDomainSnapshotDiskDefPtr snapdi
-         case VIR_STORAGE_NET_PROTOCOL_TFTP:
-         case VIR_STORAGE_NET_PROTOCOL_SSH:
-         case VIR_STORAGE_NET_PROTOCOL_VXHS:
-+        case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
-         case VIR_STORAGE_NET_PROTOCOL_LAST:
-             virReportError(VIR_ERR_INTERNAL_ERROR,
-                            _("external inactive snapshots are not supported on "
-@@ -14764,6 +14765,7 @@ qemuDomainSnapshotPrepareDiskExternalActive(virDomainSnapshotDiskDefPtr snapdisk
-         case VIR_STORAGE_NET_PROTOCOL_TFTP:
-         case VIR_STORAGE_NET_PROTOCOL_SSH:
-         case VIR_STORAGE_NET_PROTOCOL_VXHS:
-+        case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
-         case VIR_STORAGE_NET_PROTOCOL_LAST:
-             virReportError(VIR_ERR_INTERNAL_ERROR,
-                            _("external active snapshots are not supported on "
-@@ -14887,6 +14889,7 @@ qemuDomainSnapshotPrepareDiskInternal(virDomainDiskDefPtr disk,
-         case VIR_STORAGE_NET_PROTOCOL_TFTP:
-         case VIR_STORAGE_NET_PROTOCOL_SSH:
-         case VIR_STORAGE_NET_PROTOCOL_VXHS:
-+        case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
-         case VIR_STORAGE_NET_PROTOCOL_LAST:
-             virReportError(VIR_ERR_INTERNAL_ERROR,
-                            _("internal inactive snapshots are not supported on "
-diff --git a/src/qemu/qemu_parse_command.c b/src/qemu/qemu_parse_command.c
-index c4650f0..551da41 100644
--- a/src/qemu/qemu_parse_command.c
-+++ b/src/qemu/qemu_parse_command.c
-@@ -2184,6 +2184,7 @@ qemuParseCommandLine(virFileCachePtr capsCache,
-                 case VIR_STORAGE_NET_PROTOCOL_TFTP:
-                 case VIR_STORAGE_NET_PROTOCOL_SSH:
-                 case VIR_STORAGE_NET_PROTOCOL_LAST:
-+                case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
-                 case VIR_STORAGE_NET_PROTOCOL_NONE:
-                     /* ignored for now */
-                     break;
-diff --git a/src/storage/storage_driver.c b/src/storage/storage_driver.c
-index 4a13e90..33301c7 100644
--- a/src/storage/storage_driver.c
-+++ b/src/storage/storage_driver.c
-@@ -1568,6 +1568,7 @@ storageVolLookupByPathCallback(virStoragePoolObjPtr obj,
-         case VIR_STORAGE_POOL_RBD:
-         case VIR_STORAGE_POOL_SHEEPDOG:
-         case VIR_STORAGE_POOL_ZFS:
-+        case VIR_STORAGE_POOL_VITASTOR:
-         case VIR_STORAGE_POOL_LAST:
-             ignore_value(VIR_STRDUP(stable_path, data->path));
-             break;
-diff --git a/src/util/virstoragefile.c b/src/util/virstoragefile.c
-index bd4b027..b323cd6 100644
--- a/src/util/virstoragefile.c
-+++ b/src/util/virstoragefile.c
-@@ -84,7 +84,8 @@ VIR_ENUM_IMPL(virStorageNetProtocol, VIR_STORAGE_NET_PROTOCOL_LAST,
-               "ftps",
-               "tftp",
-               "ssh",
-              "vxhs")
-+              "vxhs",
-+              "vitastor")
- 
- VIR_ENUM_IMPL(virStorageNetHostTransport, VIR_STORAGE_NET_HOST_TRANS_LAST,
-               "tcp",
-@@ -2839,6 +2840,83 @@ virStorageSourceParseRBDColonString(const char *rbdstr,
- }
- 
- 
-+static int
-+virStorageSourceParseVitastorColonString(const char *colonstr,
-+                                         virStorageSourcePtr src)
-+{
-+    char *p, *e, *next;
-+    char *options = NULL;
-+
-+    /* optionally skip the "vitastor:" prefix if provided */
-+    if (STRPREFIX(colonstr, "vitastor:"))
-+        colonstr += strlen("vitastor:");
-+
-+    if (VIR_STRDUP(options, colonstr) < 0)
-+        return -1;
-+
-+    p = options;
-+    while (*p) {
-+        /* find : delimiter or end of string */
-+        for (e = p; *e && *e != ':'; ++e) {
-+            if (*e == '\\') {
-+                e++;
-+                if (*e == '\0')
-+                    break;
-+            }
-+        }
-+        if (*e == '\0') {
-+            next = e;    /* last kv pair */
-+        } else {
-+            next = e + 1;
-+            *e = '\0';
-+        }
-+
-+        if (STRPREFIX(p, "image=")) {
-+            if (VIR_STRDUP(src->path, p + strlen("image=")) < 0)
-+                return -1;
-+        } else if (STRPREFIX(p, "etcd_prefix=")) {
-+            if (VIR_STRDUP(src->relPath, p + strlen("etcd_prefix=")) < 0)
-+                return -1;
-+        } else if (STRPREFIX(p, "config_file=")) {
-+            if (VIR_STRDUP(src->configFile, p + strlen("config_file=")) < 0)
-+                return -1;
-+        } else if (STRPREFIX(p, "etcd_host=")) {
-+            char *h, *sep;
-+
-+            h = p + strlen("etcd_host=");
-+            while (h < e) {
-+                for (sep = h; sep < e; ++sep) {
-+                    if (*sep == '\\' && (sep[1] == ',' ||
-+                                         sep[1] == ';' ||
-+                                         sep[1] == ' ')) {
-+                        *sep = '\0';
-+                        sep += 2;
-+                        break;
-+                    }
-+                }
-+
-+                if (virStorageSourceRBDAddHost(src, h) < 0)
-+                    goto error;
-+
-+                h = sep;
-+            }
-+        }
-+
-+        p = next;
-+    }
-+
-+    if (!src->path) {
-+        goto error;
-+    }
-+
-+    return 0;
-+
-+error:
-+    VIR_FREE(options);
-+    return -1;
-+}
-+
-+
- static int
- virStorageSourceParseNBDColonString(const char *nbdstr,
-                                     virStorageSourcePtr src)
-@@ -2942,6 +3020,11 @@ virStorageSourceParseBackingColon(virStorageSourcePtr src,
-             goto cleanup;
-         break;
- 
-+    case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
-+        if (virStorageSourceParseVitastorColonString(path, src) < 0)
-+            return -1;
-+        break;
-+
-     case VIR_STORAGE_NET_PROTOCOL_SHEEPDOG:
-     case VIR_STORAGE_NET_PROTOCOL_LAST:
-     case VIR_STORAGE_NET_PROTOCOL_NONE:
-@@ -3441,6 +3524,56 @@ virStorageSourceParseBackingJSONRBD(virStorageSourcePtr src,
-     return ret;
- }
- 
-+static int
-+virStorageSourceParseBackingJSONVitastor(virStorageSourcePtr src,
-+                                         virJSONValuePtr json,
-+                                         int opaque ATTRIBUTE_UNUSED)
-+{
-+    const char *filename;
-+    const char *image = virJSONValueObjectGetString(json, "image");
-+    const char *conf = virJSONValueObjectGetString(json, "config_path");
-+    const char *etcd_prefix = virJSONValueObjectGetString(json, "etcd_prefix");
-+    virJSONValuePtr servers = virJSONValueObjectGetArray(json, "server");
-+    size_t nservers;
-+    size_t i;
-+
-+    src->type = VIR_STORAGE_TYPE_NETWORK;
-+    src->protocol = VIR_STORAGE_NET_PROTOCOL_VITASTOR;
-+
-+    /* legacy syntax passed via 'filename' option */
-+    if ((filename = virJSONValueObjectGetString(json, "filename")))
-+        return virStorageSourceParseVitastorColonString(filename, src);
-+
-+    if (!image) {
-+        virReportError(VIR_ERR_INVALID_ARG, "%s",
-+                       _("missing image name in Vitastor backing volume "
-+                         "JSON specification"));
-+        return -1;
-+    }
-+
-+    if (VIR_STRDUP(src->path, image) < 0 ||
-+        VIR_STRDUP(src->configFile, conf) < 0 ||
-+        VIR_STRDUP(src->relPath, etcd_prefix) < 0)
-+        return -1;
-+
-+    if (servers) {
-+        nservers = virJSONValueArraySize(servers);
-+
-+        if (VIR_ALLOC_N(src->hosts, nservers) < 0)
-+            return -1;
-+
-+        src->nhosts = nservers;
-+
-+        for (i = 0; i < nservers; i++) {
-+            if (virStorageSourceParseBackingJSONInetSocketAddress(src->hosts + i,
-+                                                                  virJSONValueArrayGet(servers, i)) < 0)
-+                return -1;
-+        }
-+    }
-+
-+    return 0;
-+}
-+
- static int
- virStorageSourceParseBackingJSONRaw(virStorageSourcePtr src,
-                                     virJSONValuePtr json,
-@@ -3507,6 +3640,7 @@ static const struct virStorageSourceJSONDriverParser jsonParsers[] = {
-     {"sheepdog", virStorageSourceParseBackingJSONSheepdog, 0},
-     {"ssh", virStorageSourceParseBackingJSONSSH, 0},
-     {"rbd", virStorageSourceParseBackingJSONRBD, 0},
-+    {"vitastor", virStorageSourceParseBackingJSONVitastor, 0},
-     {"raw", virStorageSourceParseBackingJSONRaw, 0},
-     {"vxhs", virStorageSourceParseBackingJSONVxHS, 0},
- };
-@@ -4276,6 +4410,7 @@ virStorageSourceNetworkDefaultPort(virStorageNetProtocol protocol)
-         case VIR_STORAGE_NET_PROTOCOL_GLUSTER:
-             return 24007;
- 
-+        case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
-         case VIR_STORAGE_NET_PROTOCOL_RBD:
-             /* we don't provide a default for RBD */
-             return 0;
-diff --git a/src/util/virstoragefile.h b/src/util/virstoragefile.h
-index 1d6161a..8d83bf3 100644
--- a/src/util/virstoragefile.h
-+++ b/src/util/virstoragefile.h
-@@ -134,6 +134,7 @@ typedef enum {
-     VIR_STORAGE_NET_PROTOCOL_TFTP,
-     VIR_STORAGE_NET_PROTOCOL_SSH,
-     VIR_STORAGE_NET_PROTOCOL_VXHS,
-+    VIR_STORAGE_NET_PROTOCOL_VITASTOR,
- 
-     VIR_STORAGE_NET_PROTOCOL_LAST
- } virStorageNetProtocol;
-diff --git a/src/xenconfig/xen_xl.c b/src/xenconfig/xen_xl.c
-index accfc3a..a18f9c3 100644
--- a/src/xenconfig/xen_xl.c
-+++ b/src/xenconfig/xen_xl.c
-@@ -1535,6 +1535,7 @@ xenFormatXLDiskSrcNet(virStorageSourcePtr src)
-     case VIR_STORAGE_NET_PROTOCOL_SHEEPDOG:
-     case VIR_STORAGE_NET_PROTOCOL_SSH:
-     case VIR_STORAGE_NET_PROTOCOL_VXHS:
-+    case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
-     case VIR_STORAGE_NET_PROTOCOL_LAST:
-     case VIR_STORAGE_NET_PROTOCOL_NONE:
-         virReportError(VIR_ERR_NO_SUPPORT,
-diff --git a/tools/virsh-pool.c b/tools/virsh-pool.c
-index 70ca39b..9caef51 100644
--- a/tools/virsh-pool.c
-+++ b/tools/virsh-pool.c
-@@ -1219,6 +1219,9 @@ cmdPoolList(vshControl *ctl, const vshCmd *cmd ATTRIBUTE_UNUSED)
-             case VIR_STORAGE_POOL_VSTORAGE:
-                 flags |= VIR_CONNECT_LIST_STORAGE_POOLS_VSTORAGE;
-                 break;
-+            case VIR_STORAGE_POOL_VITASTOR:
-+                flags |= VIR_CONNECT_LIST_STORAGE_POOLS_VITASTOR;
-+                break;
-             case VIR_STORAGE_POOL_LAST:
-                 break;
-             }
--- a/patches/libvirt-7.0-vitastor.diff
+++ b/patches/libvirt-7.0-vitastor.diff
@@ -1,657 +0,0 @@
-commit 41cdfe8317d98f70aadedfdbb381effed2641bdd
-Author: Vitaliy Filippov <vitalif@yourcmc.ru>
-Date:   Fri Jul 9 01:31:57 2021 +0300
-
-    Add Vitastor support
-
-diff --git a/docs/schemas/domaincommon.rng b/docs/schemas/domaincommon.rng
-index 7dc419b..875433b 100644
--- a/docs/schemas/domaincommon.rng
-+++ b/docs/schemas/domaincommon.rng
-@@ -1827,6 +1827,35 @@
-     </element>
-   </define>
- 
-+  <define name="diskSourceNetworkProtocolVitastor">
-+    <element name="source">
-+      <interleave>
-+        <attribute name="protocol">
-+          <value>vitastor</value>
-+        </attribute>
-+        <ref name="diskSourceCommon"/>
-+        <optional>
-+          <attribute name="name"/>
-+        </optional>
-+        <optional>
-+          <attribute name="query"/>
-+        </optional>
-+        <zeroOrMore>
-+          <ref name="diskSourceNetworkHost"/>
-+        </zeroOrMore>
-+        <optional>
-+          <element name="config">
-+            <attribute name="file">
-+              <ref name="absFilePath"/>
-+            </attribute>
-+            <empty/>
-+          </element>
-+        </optional>
-+        <empty/>
-+      </interleave>
-+    </element>
-+  </define>
-+
-   <define name="diskSourceNetworkProtocolISCSI">
-     <element name="source">
-       <attribute name="protocol">
-@@ -2083,6 +2112,7 @@
-       <ref name="diskSourceNetworkProtocolSimple"/>
-       <ref name="diskSourceNetworkProtocolVxHS"/>
-       <ref name="diskSourceNetworkProtocolNFS"/>
-+      <ref name="diskSourceNetworkProtocolVitastor"/>
-     </choice>
-   </define>
- 
-diff --git a/include/libvirt/libvirt-storage.h b/include/libvirt/libvirt-storage.h
-index 089e1e0..d7e7ef4 100644
--- a/include/libvirt/libvirt-storage.h
-+++ b/include/libvirt/libvirt-storage.h
-@@ -245,6 +245,7 @@ typedef enum {
-     VIR_CONNECT_LIST_STORAGE_POOLS_ZFS           = 1 << 17,
-     VIR_CONNECT_LIST_STORAGE_POOLS_VSTORAGE      = 1 << 18,
-     VIR_CONNECT_LIST_STORAGE_POOLS_ISCSI_DIRECT  = 1 << 19,
-+    VIR_CONNECT_LIST_STORAGE_POOLS_VITASTOR      = 1 << 20,
- } virConnectListAllStoragePoolsFlags;
- 
- int                     virConnectListAllStoragePools(virConnectPtr conn,
-diff --git a/src/conf/domain_conf.c b/src/conf/domain_conf.c
-index 01b7187..c6e9702 100644
--- a/src/conf/domain_conf.c
-+++ b/src/conf/domain_conf.c
-@@ -8261,7 +8261,8 @@ virDomainDiskSourceNetworkParse(xmlNodePtr node,
-     src->configFile = virXPathString("string(./config/@file)", ctxt);
- 
-     if (src->protocol == VIR_STORAGE_NET_PROTOCOL_HTTP ||
-        src->protocol == VIR_STORAGE_NET_PROTOCOL_HTTPS)
-+        src->protocol == VIR_STORAGE_NET_PROTOCOL_HTTPS ||
-+        src->protocol == VIR_STORAGE_NET_PROTOCOL_VITASTOR)
-         src->query = virXMLPropString(node, "query");
- 
-     if (virDomainStorageNetworkParseHosts(node, ctxt, &src->hosts, &src->nhosts) < 0)
-@@ -31392,6 +31393,7 @@ virDomainStorageSourceTranslateSourcePool(virStorageSourcePtr src,
- 
-     case VIR_STORAGE_POOL_MPATH:
-     case VIR_STORAGE_POOL_RBD:
-+    case VIR_STORAGE_POOL_VITASTOR:
-     case VIR_STORAGE_POOL_SHEEPDOG:
-     case VIR_STORAGE_POOL_GLUSTER:
-     case VIR_STORAGE_POOL_LAST:
-diff --git a/src/conf/storage_conf.c b/src/conf/storage_conf.c
-index 0c50529..fe97574 100644
--- a/src/conf/storage_conf.c
-+++ b/src/conf/storage_conf.c
-@@ -60,7 +60,7 @@ VIR_ENUM_IMPL(virStoragePool,
-               "logical", "disk", "iscsi",
-               "iscsi-direct", "scsi", "mpath",
-               "rbd", "sheepdog", "gluster",
-              "zfs", "vstorage",
-+              "zfs", "vstorage", "vitastor",
- );
- 
- VIR_ENUM_IMPL(virStoragePoolFormatFileSystem,
-@@ -249,6 +249,18 @@ static virStoragePoolTypeInfo poolTypeInfo[] = {
-           .formatToString = virStorageFileFormatTypeToString,
-       }
-     },
-+    {.poolType = VIR_STORAGE_POOL_VITASTOR,
-+     .poolOptions = {
-+         .flags = (VIR_STORAGE_POOL_SOURCE_HOST |
-+                   VIR_STORAGE_POOL_SOURCE_NETWORK |
-+                   VIR_STORAGE_POOL_SOURCE_NAME),
-+      },
-+      .volOptions = {
-+          .defaultFormat = VIR_STORAGE_FILE_RAW,
-+          .formatFromString = virStorageVolumeFormatFromString,
-+          .formatToString = virStorageFileFormatTypeToString,
-+      }
-+    },
-     {.poolType = VIR_STORAGE_POOL_SHEEPDOG,
-      .poolOptions = {
-          .flags = (VIR_STORAGE_POOL_SOURCE_HOST |
-@@ -551,6 +563,11 @@ virStoragePoolDefParseSource(xmlXPathContextPtr ctxt,
-                        _("element 'name' is mandatory for RBD pool"));
-         goto cleanup;
-     }
-+    if (pool_type == VIR_STORAGE_POOL_VITASTOR && source->name == NULL) {
-+        virReportError(VIR_ERR_XML_ERROR, "%s",
-+                       _("element 'name' is mandatory for Vitastor pool"));
-+        return -1;
-+    }
- 
-     if (options->formatFromString) {
-         g_autofree char *format = NULL;
-@@ -1217,6 +1234,7 @@ virStoragePoolDefFormatBuf(virBufferPtr buf,
-     /* RBD, Sheepdog, Gluster and Iscsi-direct devices are not local block devs nor
-      * files, so they don't have a target */
-     if (def->type != VIR_STORAGE_POOL_RBD &&
-+        def->type != VIR_STORAGE_POOL_VITASTOR &&
-         def->type != VIR_STORAGE_POOL_SHEEPDOG &&
-         def->type != VIR_STORAGE_POOL_GLUSTER &&
-         def->type != VIR_STORAGE_POOL_ISCSI_DIRECT) {
-diff --git a/src/conf/storage_conf.h b/src/conf/storage_conf.h
-index ffd406e..8868a05 100644
--- a/src/conf/storage_conf.h
-+++ b/src/conf/storage_conf.h
-@@ -110,6 +110,7 @@ typedef enum {
-     VIR_STORAGE_POOL_GLUSTER,  /* Gluster device */
-     VIR_STORAGE_POOL_ZFS,      /* ZFS */
-     VIR_STORAGE_POOL_VSTORAGE, /* Virtuozzo Storage */
-+    VIR_STORAGE_POOL_VITASTOR, /* Vitastor */
- 
-     VIR_STORAGE_POOL_LAST,
- } virStoragePoolType;
-@@ -474,6 +475,7 @@ VIR_ENUM_DECL(virStoragePartedFs);
-                  VIR_CONNECT_LIST_STORAGE_POOLS_SCSI     | \
-                  VIR_CONNECT_LIST_STORAGE_POOLS_MPATH    | \
-                  VIR_CONNECT_LIST_STORAGE_POOLS_RBD      | \
-+                 VIR_CONNECT_LIST_STORAGE_POOLS_VITASTOR | \
-                  VIR_CONNECT_LIST_STORAGE_POOLS_SHEEPDOG | \
-                  VIR_CONNECT_LIST_STORAGE_POOLS_GLUSTER  | \
-                  VIR_CONNECT_LIST_STORAGE_POOLS_ZFS      | \
-diff --git a/src/conf/virstorageobj.c b/src/conf/virstorageobj.c
-index 9fe8b3f..bf595b0 100644
--- a/src/conf/virstorageobj.c
-+++ b/src/conf/virstorageobj.c
-@@ -1491,6 +1491,7 @@ virStoragePoolObjSourceFindDuplicateCb(const void *payload,
-             return 1;
-         break;
- 
-+    case VIR_STORAGE_POOL_VITASTOR:
-     case VIR_STORAGE_POOL_RBD:
-     case VIR_STORAGE_POOL_LAST:
-         break;
-@@ -1990,6 +1991,8 @@ virStoragePoolObjMatch(virStoragePoolObjPtr obj,
-                (obj->def->type == VIR_STORAGE_POOL_MPATH))   ||
-               (MATCH(VIR_CONNECT_LIST_STORAGE_POOLS_RBD) &&
-                (obj->def->type == VIR_STORAGE_POOL_RBD))     ||
-+              (MATCH(VIR_CONNECT_LIST_STORAGE_POOLS_VITASTOR) &&
-+               (obj->def->type == VIR_STORAGE_POOL_VITASTOR)) ||
-               (MATCH(VIR_CONNECT_LIST_STORAGE_POOLS_SHEEPDOG) &&
-                (obj->def->type == VIR_STORAGE_POOL_SHEEPDOG)) ||
-               (MATCH(VIR_CONNECT_LIST_STORAGE_POOLS_GLUSTER) &&
-diff --git a/src/libvirt-storage.c b/src/libvirt-storage.c
-index 2a7cdca..f756be1 100644
--- a/src/libvirt-storage.c
-+++ b/src/libvirt-storage.c
-@@ -92,6 +92,7 @@ virStoragePoolGetConnect(virStoragePoolPtr pool)
-  * VIR_CONNECT_LIST_STORAGE_POOLS_SCSI
-  * VIR_CONNECT_LIST_STORAGE_POOLS_MPATH
-  * VIR_CONNECT_LIST_STORAGE_POOLS_RBD
-+ * VIR_CONNECT_LIST_STORAGE_POOLS_VITASTOR
-  * VIR_CONNECT_LIST_STORAGE_POOLS_SHEEPDOG
-  * VIR_CONNECT_LIST_STORAGE_POOLS_GLUSTER
-  * VIR_CONNECT_LIST_STORAGE_POOLS_ZFS
-diff --git a/src/libxl/libxl_conf.c b/src/libxl/libxl_conf.c
-index 6a8ae27..a735bc6 100644
--- a/src/libxl/libxl_conf.c
-+++ b/src/libxl/libxl_conf.c
-@@ -942,6 +942,7 @@ libxlMakeNetworkDiskSrcStr(virStorageSourcePtr src,
-     case VIR_STORAGE_NET_PROTOCOL_SSH:
-     case VIR_STORAGE_NET_PROTOCOL_VXHS:
-     case VIR_STORAGE_NET_PROTOCOL_NFS:
-+    case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
-     case VIR_STORAGE_NET_PROTOCOL_LAST:
-     case VIR_STORAGE_NET_PROTOCOL_NONE:
-         virReportError(VIR_ERR_NO_SUPPORT,
-diff --git a/src/libxl/xen_xl.c b/src/libxl/xen_xl.c
-index 17b93d0..c5a0084 100644
--- a/src/libxl/xen_xl.c
-+++ b/src/libxl/xen_xl.c
-@@ -1601,6 +1601,7 @@ xenFormatXLDiskSrcNet(virStorageSourcePtr src)
-     case VIR_STORAGE_NET_PROTOCOL_SSH:
-     case VIR_STORAGE_NET_PROTOCOL_VXHS:
-     case VIR_STORAGE_NET_PROTOCOL_NFS:
-+    case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
-     case VIR_STORAGE_NET_PROTOCOL_LAST:
-     case VIR_STORAGE_NET_PROTOCOL_NONE:
-         virReportError(VIR_ERR_NO_SUPPORT,
-diff --git a/src/qemu/qemu_block.c b/src/qemu/qemu_block.c
-index f9c6da2..922dde5 100644
--- a/src/qemu/qemu_block.c
-+++ b/src/qemu/qemu_block.c
-@@ -938,6 +938,38 @@ qemuBlockStorageSourceGetRBDProps(virStorageSourcePtr src,
- }
- 
- 
-+static virJSONValuePtr
-+qemuBlockStorageSourceGetVitastorProps(virStorageSource *src)
-+{
-+    virJSONValuePtr ret = NULL;
-+    virStorageNetHostDefPtr host;
-+    size_t i;
-+    g_auto(virBuffer) buf = VIR_BUFFER_INITIALIZER;
-+    g_autofree char *etcd = NULL;
-+
-+    for (i = 0; i < src->nhosts; i++) {
-+        host = src->hosts + i;
-+        if ((virStorageNetHostTransport)host->transport != VIR_STORAGE_NET_HOST_TRANS_TCP) {
-+            return NULL;
-+        }
-+        virBufferAsprintf(&buf, i > 0 ? ",%s:%u" : "%s:%u", host->name, host->port);
-+    }
-+    if (src->nhosts > 0) {
-+        etcd = virBufferContentAndReset(&buf);
-+    }
-+
-+    if (virJSONValueObjectCreate(&ret,
-+                                 "S:etcd_host", etcd,
-+                                 "S:etcd_prefix", src->query,
-+                                 "S:config_path", src->configFile,
-+                                 "s:image", src->path,
-+                                 NULL) < 0)
-+        return NULL;
-+
-+    return ret;
-+}
-+
-+
- static virJSONValuePtr
- qemuBlockStorageSourceGetSheepdogProps(virStorageSourcePtr src)
- {
-@@ -1224,6 +1256,12 @@ qemuBlockStorageSourceGetBackendProps(virStorageSourcePtr src,
-                 return NULL;
-             break;
- 
-+        case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
-+            driver = "vitastor";
-+            if (!(fileprops = qemuBlockStorageSourceGetVitastorProps(src)))
-+                return NULL;
-+            break;
-+
-         case VIR_STORAGE_NET_PROTOCOL_SHEEPDOG:
-             driver = "sheepdog";
-             if (!(fileprops = qemuBlockStorageSourceGetSheepdogProps(src)))
-@@ -2183,6 +2221,7 @@ qemuBlockGetBackingStoreString(virStorageSourcePtr src,
- 
-             case VIR_STORAGE_NET_PROTOCOL_SHEEPDOG:
-             case VIR_STORAGE_NET_PROTOCOL_RBD:
-+            case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
-             case VIR_STORAGE_NET_PROTOCOL_VXHS:
-             case VIR_STORAGE_NET_PROTOCOL_NFS:
-             case VIR_STORAGE_NET_PROTOCOL_SSH:
-@@ -2560,6 +2599,12 @@ qemuBlockStorageSourceCreateGetStorageProps(virStorageSourcePtr src,
-                 return -1;
-             break;
- 
-+        case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
-+            driver = "vitastor";
-+            if (!(location = qemuBlockStorageSourceGetVitastorProps(src)))
-+                return -1;
-+            break;
-+
-         case VIR_STORAGE_NET_PROTOCOL_SHEEPDOG:
-             driver = "sheepdog";
-             if (!(location = qemuBlockStorageSourceGetSheepdogProps(src)))
-diff --git a/src/qemu/qemu_command.c b/src/qemu/qemu_command.c
-index 6f970a3..10b39ca 100644
--- a/src/qemu/qemu_command.c
-+++ b/src/qemu/qemu_command.c
-@@ -1034,6 +1034,43 @@ qemuBuildNetworkDriveStr(virStorageSourcePtr src,
-             ret = virBufferContentAndReset(&buf);
-             break;
- 
-+        case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
-+            if (strchr(src->path, ':')) {
-+                virReportError(VIR_ERR_CONFIG_UNSUPPORTED,
-+                               _("':' not allowed in Vitastor source volume name '%s'"),
-+                               src->path);
-+                return NULL;
-+            }
-+
-+            virBufferStrcat(&buf, "vitastor:image=", src->path, NULL);
-+
-+            if (src->nhosts > 0) {
-+                virBufferAddLit(&buf, ":etcd_host=");
-+                for (i = 0; i < src->nhosts; i++) {
-+                    if (i)
-+                        virBufferAddLit(&buf, ",");
-+
-+                    /* assume host containing : is ipv6 */
-+                    if (strchr(src->hosts[i].name, ':'))
-+                        virBufferEscape(&buf, '\\', ":", "[%s]",
-+                                        src->hosts[i].name);
-+                    else
-+                        virBufferAsprintf(&buf, "%s", src->hosts[i].name);
-+
-+                    if (src->hosts[i].port)
-+                        virBufferAsprintf(&buf, "\\:%u", src->hosts[i].port);
-+                }
-+            }
-+
-+            if (src->configFile)
-+                virBufferEscape(&buf, '\\', ":", ":config_path=%s", src->configFile);
-+
-+            if (src->query)
-+                virBufferEscape(&buf, '\\', ":", ":etcd_prefix=%s", src->query);
-+
-+            ret = virBufferContentAndReset(&buf);
-+            break;
-+
-         case VIR_STORAGE_NET_PROTOCOL_VXHS:
-             virReportError(VIR_ERR_INTERNAL_ERROR, "%s",
-                            _("VxHS protocol does not support URI syntax"));
-diff --git a/src/qemu/qemu_domain.c b/src/qemu/qemu_domain.c
-index 0765dc7..4cff344 100644
--- a/src/qemu/qemu_domain.c
-+++ b/src/qemu/qemu_domain.c
-@@ -4610,7 +4610,8 @@ qemuDomainValidateStorageSource(virStorageSourcePtr src,
-     if (src->query &&
-         (actualType != VIR_STORAGE_TYPE_NETWORK ||
-          (src->protocol != VIR_STORAGE_NET_PROTOCOL_HTTPS &&
-          src->protocol != VIR_STORAGE_NET_PROTOCOL_HTTP))) {
-+          src->protocol != VIR_STORAGE_NET_PROTOCOL_HTTP &&
-+          src->protocol != VIR_STORAGE_NET_PROTOCOL_VITASTOR))) {
-         virReportError(VIR_ERR_CONFIG_UNSUPPORTED, "%s",
-                        _("query is supported only with HTTP(S) protocols"));
-         return -1;
-@@ -9704,6 +9705,7 @@ qemuDomainPrepareStorageSourceTLS(virStorageSourcePtr src,
-         break;
- 
-     case VIR_STORAGE_NET_PROTOCOL_RBD:
-+    case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
-     case VIR_STORAGE_NET_PROTOCOL_SHEEPDOG:
-     case VIR_STORAGE_NET_PROTOCOL_GLUSTER:
-     case VIR_STORAGE_NET_PROTOCOL_ISCSI:
-diff --git a/src/qemu/qemu_snapshot.c b/src/qemu/qemu_snapshot.c
-index ee333c3..674aa58 100644
--- a/src/qemu/qemu_snapshot.c
-+++ b/src/qemu/qemu_snapshot.c
-@@ -403,6 +403,7 @@ qemuSnapshotPrepareDiskExternalInactive(virDomainSnapshotDiskDefPtr snapdisk,
-         case VIR_STORAGE_NET_PROTOCOL_NONE:
-         case VIR_STORAGE_NET_PROTOCOL_NBD:
-         case VIR_STORAGE_NET_PROTOCOL_RBD:
-+        case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
-         case VIR_STORAGE_NET_PROTOCOL_SHEEPDOG:
-         case VIR_STORAGE_NET_PROTOCOL_GLUSTER:
-         case VIR_STORAGE_NET_PROTOCOL_ISCSI:
-@@ -493,6 +494,7 @@ qemuSnapshotPrepareDiskExternalActive(virDomainObjPtr vm,
-         case VIR_STORAGE_NET_PROTOCOL_NONE:
-         case VIR_STORAGE_NET_PROTOCOL_NBD:
-         case VIR_STORAGE_NET_PROTOCOL_RBD:
-+        case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
-         case VIR_STORAGE_NET_PROTOCOL_SHEEPDOG:
-         case VIR_STORAGE_NET_PROTOCOL_ISCSI:
-         case VIR_STORAGE_NET_PROTOCOL_HTTP:
-@@ -623,6 +625,7 @@ qemuSnapshotPrepareDiskInternal(virDomainDiskDefPtr disk,
-         case VIR_STORAGE_NET_PROTOCOL_NONE:
-         case VIR_STORAGE_NET_PROTOCOL_NBD:
-         case VIR_STORAGE_NET_PROTOCOL_RBD:
-+        case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
-         case VIR_STORAGE_NET_PROTOCOL_SHEEPDOG:
-         case VIR_STORAGE_NET_PROTOCOL_GLUSTER:
-         case VIR_STORAGE_NET_PROTOCOL_ISCSI:
-diff --git a/src/storage/storage_driver.c b/src/storage/storage_driver.c
-index 16bc53a..1e5d820 100644
--- a/src/storage/storage_driver.c
-+++ b/src/storage/storage_driver.c
-@@ -1645,6 +1645,7 @@ storageVolLookupByPathCallback(virStoragePoolObjPtr obj,
- 
-         case VIR_STORAGE_POOL_GLUSTER:
-         case VIR_STORAGE_POOL_RBD:
-+        case VIR_STORAGE_POOL_VITASTOR:
-         case VIR_STORAGE_POOL_SHEEPDOG:
-         case VIR_STORAGE_POOL_ZFS:
-         case VIR_STORAGE_POOL_LAST:
-diff --git a/src/test/test_driver.c b/src/test/test_driver.c
-index 29c4c86..a27ad94 100644
--- a/src/test/test_driver.c
-+++ b/src/test/test_driver.c
-@@ -7096,6 +7096,7 @@ testStorageVolumeTypeForPool(int pooltype)
-     case VIR_STORAGE_POOL_ISCSI_DIRECT:
-     case VIR_STORAGE_POOL_GLUSTER:
-     case VIR_STORAGE_POOL_RBD:
-+    case VIR_STORAGE_POOL_VITASTOR:
-         return VIR_STORAGE_VOL_NETWORK;
-     case VIR_STORAGE_POOL_LOGICAL:
-     case VIR_STORAGE_POOL_DISK:
-diff --git a/src/util/virstoragefile.c b/src/util/virstoragefile.c
-index 0d3c2af..36e3afc 100644
--- a/src/util/virstoragefile.c
-+++ b/src/util/virstoragefile.c
-@@ -91,6 +91,7 @@ VIR_ENUM_IMPL(virStorageNetProtocol,
-               "ssh",
-               "vxhs",
-               "nfs",
-+              "vitastor",
- );
- 
- VIR_ENUM_IMPL(virStorageNetHostTransport,
-@@ -2880,6 +2881,75 @@ virStorageSourceParseRBDColonString(const char *rbdstr,
- }
- 
- 
-+static int
-+virStorageSourceParseVitastorColonString(const char *colonstr,
-+                                         virStorageSourcePtr src)
-+{
-+    char *p, *e, *next;
-+    g_autofree char *options = NULL;
-+
-+    /* optionally skip the "vitastor:" prefix if provided */
-+    if (STRPREFIX(colonstr, "vitastor:"))
-+        colonstr += strlen("vitastor:");
-+
-+    options = g_strdup(colonstr);
-+
-+    p = options;
-+    while (*p) {
-+        /* find : delimiter or end of string */
-+        for (e = p; *e && *e != ':'; ++e) {
-+            if (*e == '\\') {
-+                e++;
-+                if (*e == '\0')
-+                    break;
-+            }
-+        }
-+        if (*e == '\0') {
-+            next = e;    /* last kv pair */
-+        } else {
-+            next = e + 1;
-+            *e = '\0';
-+        }
-+
-+        if (STRPREFIX(p, "image=")) {
-+            src->path = g_strdup(p + strlen("image="));
-+        } else if (STRPREFIX(p, "etcd_prefix=")) {
-+            src->query = g_strdup(p + strlen("etcd_prefix="));
-+        } else if (STRPREFIX(p, "config_file=")) {
-+            src->configFile = g_strdup(p + strlen("config_file="));
-+        } else if (STRPREFIX(p, "etcd_host=")) {
-+            char *h, *sep;
-+
-+            h = p + strlen("etcd_host=");
-+            while (h < e) {
-+                for (sep = h; sep < e; ++sep) {
-+                    if (*sep == '\\' && (sep[1] == ',' ||
-+                                         sep[1] == ';' ||
-+                                         sep[1] == ' ')) {
-+                        *sep = '\0';
-+                        sep += 2;
-+                        break;
-+                    }
-+                }
-+
-+                if (virStorageSourceRBDAddHost(src, h) < 0)
-+                    return -1;
-+
-+                h = sep;
-+            }
-+        }
-+
-+        p = next;
-+    }
-+
-+    if (!src->path) {
-+        return -1;
-+    }
-+
-+    return 0;
-+}
-+
-+
- static int
- virStorageSourceParseNBDColonString(const char *nbdstr,
-                                     virStorageSourcePtr src)
-@@ -2992,6 +3062,11 @@ virStorageSourceParseBackingColon(virStorageSourcePtr src,
-             return -1;
-         break;
- 
-+    case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
-+        if (virStorageSourceParseVitastorColonString(path, src) < 0)
-+            return -1;
-+        break;
-+
-     case VIR_STORAGE_NET_PROTOCOL_SHEEPDOG:
-     case VIR_STORAGE_NET_PROTOCOL_LAST:
-     case VIR_STORAGE_NET_PROTOCOL_NONE:
-@@ -3581,6 +3656,54 @@ virStorageSourceParseBackingJSONRBD(virStorageSourcePtr src,
-     return 0;
- }
- 
-+static int
-+virStorageSourceParseBackingJSONVitastor(virStorageSourcePtr src,
-+                                         virJSONValuePtr json,
-+                                         const char *jsonstr G_GNUC_UNUSED,
-+                                         int opaque G_GNUC_UNUSED)
-+{
-+    const char *filename;
-+    const char *image = virJSONValueObjectGetString(json, "image");
-+    const char *conf = virJSONValueObjectGetString(json, "config_path");
-+    const char *etcd_prefix = virJSONValueObjectGetString(json, "etcd_prefix");
-+    virJSONValuePtr servers = virJSONValueObjectGetArray(json, "server");
-+    size_t nservers;
-+    size_t i;
-+
-+    src->type = VIR_STORAGE_TYPE_NETWORK;
-+    src->protocol = VIR_STORAGE_NET_PROTOCOL_VITASTOR;
-+
-+    /* legacy syntax passed via 'filename' option */
-+    if ((filename = virJSONValueObjectGetString(json, "filename")))
-+        return virStorageSourceParseVitastorColonString(filename, src);
-+
-+    if (!image) {
-+        virReportError(VIR_ERR_INVALID_ARG, "%s",
-+                       _("missing image name in Vitastor backing volume "
-+                         "JSON specification"));
-+        return -1;
-+    }
-+
-+    src->path = g_strdup(image);
-+    src->configFile = g_strdup(conf);
-+    src->query = g_strdup(etcd_prefix);
-+
-+    if (servers) {
-+        nservers = virJSONValueArraySize(servers);
-+
-+        src->hosts = g_new0(virStorageNetHostDef, nservers);
-+        src->nhosts = nservers;
-+
-+        for (i = 0; i < nservers; i++) {
-+            if (virStorageSourceParseBackingJSONInetSocketAddress(src->hosts + i,
-+                                                                  virJSONValueArrayGet(servers, i)) < 0)
-+                return -1;
-+        }
-+    }
-+
-+    return 0;
-+}
-+
- static int
- virStorageSourceParseBackingJSONRaw(virStorageSourcePtr src,
-                                     virJSONValuePtr json,
-@@ -3759,6 +3882,7 @@ static const struct virStorageSourceJSONDriverParser jsonParsers[] = {
-     {"sheepdog", false, virStorageSourceParseBackingJSONSheepdog, 0},
-     {"ssh", false, virStorageSourceParseBackingJSONSSH, 0},
-     {"rbd", false, virStorageSourceParseBackingJSONRBD, 0},
-+    {"vitastor", false, virStorageSourceParseBackingJSONVitastor, 0},
-     {"raw", true, virStorageSourceParseBackingJSONRaw, 0},
-     {"nfs", false, virStorageSourceParseBackingJSONNFS, 0},
-     {"vxhs", false, virStorageSourceParseBackingJSONVxHS, 0},
-@@ -4503,6 +4627,7 @@ virStorageSourceNetworkDefaultPort(virStorageNetProtocol protocol)
-         case VIR_STORAGE_NET_PROTOCOL_GLUSTER:
-             return 24007;
- 
-+        case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
-         case VIR_STORAGE_NET_PROTOCOL_RBD:
-             /* we don't provide a default for RBD */
-             return 0;
-diff --git a/src/util/virstoragefile.h b/src/util/virstoragefile.h
-index 5689c39..3eb4e3c 100644
--- a/src/util/virstoragefile.h
-+++ b/src/util/virstoragefile.h
-@@ -136,6 +136,7 @@ typedef enum {
-     VIR_STORAGE_NET_PROTOCOL_SSH,
-     VIR_STORAGE_NET_PROTOCOL_VXHS,
-     VIR_STORAGE_NET_PROTOCOL_NFS,
-+    VIR_STORAGE_NET_PROTOCOL_VITASTOR,
- 
-     VIR_STORAGE_NET_PROTOCOL_LAST
- } virStorageNetProtocol;
-diff --git a/tests/storagepoolcapsschemadata/poolcaps-fs.xml b/tests/storagepoolcapsschemadata/poolcaps-fs.xml
-index eee75af..8bd0a57 100644
--- a/tests/storagepoolcapsschemadata/poolcaps-fs.xml
-+++ b/tests/storagepoolcapsschemadata/poolcaps-fs.xml
-@@ -204,4 +204,11 @@
-       </enum>
-     </volOptions>
-   </pool>
-+  <pool type='vitastor' supported='no'>
-+    <volOptions>
-+      <defaultFormat type='raw'/>
-+      <enum name='targetFormatType'>
-+      </enum>
-+    </volOptions>
-+  </pool>
- </storagepoolCapabilities>
-diff --git a/tests/storagepoolcapsschemadata/poolcaps-full.xml b/tests/storagepoolcapsschemadata/poolcaps-full.xml
-index 805950a..852df0d 100644
--- a/tests/storagepoolcapsschemadata/poolcaps-full.xml
-+++ b/tests/storagepoolcapsschemadata/poolcaps-full.xml
-@@ -204,4 +204,11 @@
-       </enum>
-     </volOptions>
-   </pool>
-+  <pool type='vitastor' supported='yes'>
-+    <volOptions>
-+      <defaultFormat type='raw'/>
-+      <enum name='targetFormatType'>
-+      </enum>
-+    </volOptions>
-+  </pool>
- </storagepoolCapabilities>
-diff --git a/tests/storagepoolxml2argvtest.c b/tests/storagepoolxml2argvtest.c
-index 967d1f2..1e8ff7a 100644
--- a/tests/storagepoolxml2argvtest.c
-+++ b/tests/storagepoolxml2argvtest.c
-@@ -68,6 +68,7 @@ testCompareXMLToArgvFiles(bool shouldFail,
-     case VIR_STORAGE_POOL_GLUSTER:
-     case VIR_STORAGE_POOL_ZFS:
-     case VIR_STORAGE_POOL_VSTORAGE:
-+    case VIR_STORAGE_POOL_VITASTOR:
-     case VIR_STORAGE_POOL_LAST:
-     default:
-         VIR_TEST_DEBUG("pool type '%s' has no xml2argv test", defTypeStr);
-diff --git a/tools/virsh-pool.c b/tools/virsh-pool.c
-index 7835fa6..8841fcf 100644
--- a/tools/virsh-pool.c
-+++ b/tools/virsh-pool.c
-@@ -1237,6 +1237,9 @@ cmdPoolList(vshControl *ctl, const vshCmd *cmd G_GNUC_UNUSED)
-             case VIR_STORAGE_POOL_VSTORAGE:
-                 flags |= VIR_CONNECT_LIST_STORAGE_POOLS_VSTORAGE;
-                 break;
-+            case VIR_STORAGE_POOL_VITASTOR:
-+                flags |= VIR_CONNECT_LIST_STORAGE_POOLS_VITASTOR;
-+                break;
-             case VIR_STORAGE_POOL_LAST:
-                 break;
-             }
--- a/patches/libvirt-7.5-vitastor.diff
+++ b/patches/libvirt-7.5-vitastor.diff
@@ -1,661 +0,0 @@
-commit c6e1958a1b4974828e8e5852beb252ce6594e670
-Author: Vitaliy Filippov <vitalif@yourcmc.ru>
-Date:   Mon Jun 28 01:20:19 2021 +0300
-
-    Add Vitastor support
-
-diff --git a/docs/schemas/domaincommon.rng b/docs/schemas/domaincommon.rng
-index 5ea14b6..a9df168 100644
--- a/docs/schemas/domaincommon.rng
-+++ b/docs/schemas/domaincommon.rng
-@@ -1859,6 +1859,35 @@
-     </element>
-   </define>
- 
-+  <define name="diskSourceNetworkProtocolVitastor">
-+    <element name="source">
-+      <interleave>
-+        <attribute name="protocol">
-+          <value>vitastor</value>
-+        </attribute>
-+        <ref name="diskSourceCommon"/>
-+        <optional>
-+          <attribute name="name"/>
-+        </optional>
-+        <optional>
-+          <attribute name="query"/>
-+        </optional>
-+        <zeroOrMore>
-+          <ref name="diskSourceNetworkHost"/>
-+        </zeroOrMore>
-+        <optional>
-+          <element name="config">
-+            <attribute name="file">
-+              <ref name="absFilePath"/>
-+            </attribute>
-+            <empty/>
-+          </element>
-+        </optional>
-+        <empty/>
-+      </interleave>
-+    </element>
-+  </define>
-+
-   <define name="diskSourceNetworkProtocolISCSI">
-     <element name="source">
-       <attribute name="protocol">
-@@ -2115,6 +2144,7 @@
-       <ref name="diskSourceNetworkProtocolSimple"/>
-       <ref name="diskSourceNetworkProtocolVxHS"/>
-       <ref name="diskSourceNetworkProtocolNFS"/>
-+      <ref name="diskSourceNetworkProtocolVitastor"/>
-     </choice>
-   </define>
- 
-diff --git a/include/libvirt/libvirt-storage.h b/include/libvirt/libvirt-storage.h
-index 089e1e0..d7e7ef4 100644
--- a/include/libvirt/libvirt-storage.h
-+++ b/include/libvirt/libvirt-storage.h
-@@ -245,6 +245,7 @@ typedef enum {
-     VIR_CONNECT_LIST_STORAGE_POOLS_ZFS           = 1 << 17,
-     VIR_CONNECT_LIST_STORAGE_POOLS_VSTORAGE      = 1 << 18,
-     VIR_CONNECT_LIST_STORAGE_POOLS_ISCSI_DIRECT  = 1 << 19,
-+    VIR_CONNECT_LIST_STORAGE_POOLS_VITASTOR      = 1 << 20,
- } virConnectListAllStoragePoolsFlags;
- 
- int                     virConnectListAllStoragePools(virConnectPtr conn,
-diff --git a/src/conf/domain_conf.c b/src/conf/domain_conf.c
-index d78f846..f7222e3 100644
--- a/src/conf/domain_conf.c
-+++ b/src/conf/domain_conf.c
-@@ -8251,7 +8251,8 @@ virDomainDiskSourceNetworkParse(xmlNodePtr node,
-     src->configFile = virXPathString("string(./config/@file)", ctxt);
- 
-     if (src->protocol == VIR_STORAGE_NET_PROTOCOL_HTTP ||
-        src->protocol == VIR_STORAGE_NET_PROTOCOL_HTTPS)
-+        src->protocol == VIR_STORAGE_NET_PROTOCOL_HTTPS ||
-+        src->protocol == VIR_STORAGE_NET_PROTOCOL_VITASTOR)
-         src->query = virXMLPropString(node, "query");
- 
-     if (virDomainStorageNetworkParseHosts(node, ctxt, &src->hosts, &src->nhosts) < 0)
-@@ -30775,6 +30776,7 @@ virDomainStorageSourceTranslateSourcePool(virStorageSource *src,
- 
-     case VIR_STORAGE_POOL_MPATH:
-     case VIR_STORAGE_POOL_RBD:
-+    case VIR_STORAGE_POOL_VITASTOR:
-     case VIR_STORAGE_POOL_SHEEPDOG:
-     case VIR_STORAGE_POOL_GLUSTER:
-     case VIR_STORAGE_POOL_LAST:
-diff --git a/src/conf/storage_conf.c b/src/conf/storage_conf.c
-index 2aa9a3d..166ca1f 100644
--- a/src/conf/storage_conf.c
-+++ b/src/conf/storage_conf.c
-@@ -60,7 +60,7 @@ VIR_ENUM_IMPL(virStoragePool,
-               "logical", "disk", "iscsi",
-               "iscsi-direct", "scsi", "mpath",
-               "rbd", "sheepdog", "gluster",
-              "zfs", "vstorage",
-+              "zfs", "vstorage", "vitastor",
- );
- 
- VIR_ENUM_IMPL(virStoragePoolFormatFileSystem,
-@@ -246,6 +246,18 @@ static virStoragePoolTypeInfo poolTypeInfo[] = {
-           .formatToString = virStorageFileFormatTypeToString,
-       }
-     },
-+    {.poolType = VIR_STORAGE_POOL_VITASTOR,
-+     .poolOptions = {
-+         .flags = (VIR_STORAGE_POOL_SOURCE_HOST |
-+                   VIR_STORAGE_POOL_SOURCE_NETWORK |
-+                   VIR_STORAGE_POOL_SOURCE_NAME),
-+      },
-+      .volOptions = {
-+          .defaultFormat = VIR_STORAGE_FILE_RAW,
-+          .formatFromString = virStorageVolumeFormatFromString,
-+          .formatToString = virStorageFileFormatTypeToString,
-+      }
-+    },
-     {.poolType = VIR_STORAGE_POOL_SHEEPDOG,
-      .poolOptions = {
-          .flags = (VIR_STORAGE_POOL_SOURCE_HOST |
-@@ -546,6 +558,11 @@ virStoragePoolDefParseSource(xmlXPathContextPtr ctxt,
-                        _("element 'name' is mandatory for RBD pool"));
-         return -1;
-     }
-+    if (pool_type == VIR_STORAGE_POOL_VITASTOR && source->name == NULL) {
-+        virReportError(VIR_ERR_XML_ERROR, "%s",
-+                       _("element 'name' is mandatory for Vitastor pool"));
-+        return -1;
-+    }
- 
-     if (options->formatFromString) {
-         g_autofree char *format = NULL;
-@@ -1182,6 +1199,7 @@ virStoragePoolDefFormatBuf(virBuffer *buf,
-     /* RBD, Sheepdog, Gluster and Iscsi-direct devices are not local block devs nor
-      * files, so they don't have a target */
-     if (def->type != VIR_STORAGE_POOL_RBD &&
-+        def->type != VIR_STORAGE_POOL_VITASTOR &&
-         def->type != VIR_STORAGE_POOL_SHEEPDOG &&
-         def->type != VIR_STORAGE_POOL_GLUSTER &&
-         def->type != VIR_STORAGE_POOL_ISCSI_DIRECT) {
-diff --git a/src/conf/storage_conf.h b/src/conf/storage_conf.h
-index 76efaac..928149a 100644
--- a/src/conf/storage_conf.h
-+++ b/src/conf/storage_conf.h
-@@ -106,6 +106,7 @@ typedef enum {
-     VIR_STORAGE_POOL_GLUSTER,  /* Gluster device */
-     VIR_STORAGE_POOL_ZFS,      /* ZFS */
-     VIR_STORAGE_POOL_VSTORAGE, /* Virtuozzo Storage */
-+    VIR_STORAGE_POOL_VITASTOR, /* Vitastor */
- 
-     VIR_STORAGE_POOL_LAST,
- } virStoragePoolType;
-@@ -465,6 +466,7 @@ VIR_ENUM_DECL(virStoragePartedFs);
-                  VIR_CONNECT_LIST_STORAGE_POOLS_SCSI     | \
-                  VIR_CONNECT_LIST_STORAGE_POOLS_MPATH    | \
-                  VIR_CONNECT_LIST_STORAGE_POOLS_RBD      | \
-+                 VIR_CONNECT_LIST_STORAGE_POOLS_VITASTOR | \
-                  VIR_CONNECT_LIST_STORAGE_POOLS_SHEEPDOG | \
-                  VIR_CONNECT_LIST_STORAGE_POOLS_GLUSTER  | \
-                  VIR_CONNECT_LIST_STORAGE_POOLS_ZFS      | \
-diff --git a/src/conf/storage_source_conf.c b/src/conf/storage_source_conf.c
-index 5ca06fa..05ded49 100644
--- a/src/conf/storage_source_conf.c
-+++ b/src/conf/storage_source_conf.c
-@@ -85,6 +85,7 @@ VIR_ENUM_IMPL(virStorageNetProtocol,
-               "ssh",
-               "vxhs",
-               "nfs",
-+              "vitastor",
- );
- 
- 
-@@ -1262,6 +1263,7 @@ virStorageSourceNetworkDefaultPort(virStorageNetProtocol protocol)
-         case VIR_STORAGE_NET_PROTOCOL_GLUSTER:
-             return 24007;
- 
-+        case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
-         case VIR_STORAGE_NET_PROTOCOL_RBD:
-             /* we don't provide a default for RBD */
-             return 0;
-diff --git a/src/conf/storage_source_conf.h b/src/conf/storage_source_conf.h
-index 389c7b5..dbf02e3 100644
--- a/src/conf/storage_source_conf.h
-+++ b/src/conf/storage_source_conf.h
-@@ -127,6 +127,7 @@ typedef enum {
-     VIR_STORAGE_NET_PROTOCOL_SSH,
-     VIR_STORAGE_NET_PROTOCOL_VXHS,
-     VIR_STORAGE_NET_PROTOCOL_NFS,
-+    VIR_STORAGE_NET_PROTOCOL_VITASTOR,
- 
-     VIR_STORAGE_NET_PROTOCOL_LAST
- } virStorageNetProtocol;
-diff --git a/src/conf/virstorageobj.c b/src/conf/virstorageobj.c
-index 24957d6..4520a73 100644
--- a/src/conf/virstorageobj.c
-+++ b/src/conf/virstorageobj.c
-@@ -1487,6 +1487,7 @@ virStoragePoolObjSourceFindDuplicateCb(const void *payload,
-             return 1;
-         break;
- 
-+    case VIR_STORAGE_POOL_VITASTOR:
-     case VIR_STORAGE_POOL_RBD:
-     case VIR_STORAGE_POOL_LAST:
-         break;
-@@ -1986,6 +1987,8 @@ virStoragePoolObjMatch(virStoragePoolObj *obj,
-                (obj->def->type == VIR_STORAGE_POOL_MPATH))   ||
-               (MATCH(VIR_CONNECT_LIST_STORAGE_POOLS_RBD) &&
-                (obj->def->type == VIR_STORAGE_POOL_RBD))     ||
-+              (MATCH(VIR_CONNECT_LIST_STORAGE_POOLS_VITASTOR) &&
-+               (obj->def->type == VIR_STORAGE_POOL_VITASTOR)) ||
-               (MATCH(VIR_CONNECT_LIST_STORAGE_POOLS_SHEEPDOG) &&
-                (obj->def->type == VIR_STORAGE_POOL_SHEEPDOG)) ||
-               (MATCH(VIR_CONNECT_LIST_STORAGE_POOLS_GLUSTER) &&
-diff --git a/src/libvirt-storage.c b/src/libvirt-storage.c
-index 2a7cdca..f756be1 100644
--- a/src/libvirt-storage.c
-+++ b/src/libvirt-storage.c
-@@ -92,6 +92,7 @@ virStoragePoolGetConnect(virStoragePoolPtr pool)
-  * VIR_CONNECT_LIST_STORAGE_POOLS_SCSI
-  * VIR_CONNECT_LIST_STORAGE_POOLS_MPATH
-  * VIR_CONNECT_LIST_STORAGE_POOLS_RBD
-+ * VIR_CONNECT_LIST_STORAGE_POOLS_VITASTOR
-  * VIR_CONNECT_LIST_STORAGE_POOLS_SHEEPDOG
-  * VIR_CONNECT_LIST_STORAGE_POOLS_GLUSTER
-  * VIR_CONNECT_LIST_STORAGE_POOLS_ZFS
-diff --git a/src/libxl/libxl_conf.c b/src/libxl/libxl_conf.c
-index 56cb9ab..dfb31b9 100644
--- a/src/libxl/libxl_conf.c
-+++ b/src/libxl/libxl_conf.c
-@@ -972,6 +972,7 @@ libxlMakeNetworkDiskSrcStr(virStorageSource *src,
-     case VIR_STORAGE_NET_PROTOCOL_SSH:
-     case VIR_STORAGE_NET_PROTOCOL_VXHS:
-     case VIR_STORAGE_NET_PROTOCOL_NFS:
-+    case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
-     case VIR_STORAGE_NET_PROTOCOL_LAST:
-     case VIR_STORAGE_NET_PROTOCOL_NONE:
-         virReportError(VIR_ERR_NO_SUPPORT,
-diff --git a/src/libxl/xen_xl.c b/src/libxl/xen_xl.c
-index c0905b0..c172378 100644
--- a/src/libxl/xen_xl.c
-+++ b/src/libxl/xen_xl.c
-@@ -1540,6 +1540,7 @@ xenFormatXLDiskSrcNet(virStorageSource *src)
-     case VIR_STORAGE_NET_PROTOCOL_SSH:
-     case VIR_STORAGE_NET_PROTOCOL_VXHS:
-     case VIR_STORAGE_NET_PROTOCOL_NFS:
-+    case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
-     case VIR_STORAGE_NET_PROTOCOL_LAST:
-     case VIR_STORAGE_NET_PROTOCOL_NONE:
-         virReportError(VIR_ERR_NO_SUPPORT,
-diff --git a/src/qemu/qemu_block.c b/src/qemu/qemu_block.c
-index 6627d04..c33f428 100644
--- a/src/qemu/qemu_block.c
-+++ b/src/qemu/qemu_block.c
-@@ -928,6 +928,38 @@ qemuBlockStorageSourceGetRBDProps(virStorageSource *src,
- }
- 
- 
-+static virJSONValue *
-+qemuBlockStorageSourceGetVitastorProps(virStorageSource *src)
-+{
-+    virJSONValuePtr ret = NULL;
-+    virStorageNetHostDefPtr host;
-+    size_t i;
-+    g_auto(virBuffer) buf = VIR_BUFFER_INITIALIZER;
-+    g_autofree char *etcd = NULL;
-+
-+    for (i = 0; i < src->nhosts; i++) {
-+        host = src->hosts + i;
-+        if ((virStorageNetHostTransport)host->transport != VIR_STORAGE_NET_HOST_TRANS_TCP) {
-+            return NULL;
-+        }
-+        virBufferAsprintf(&buf, i > 0 ? ",%s:%u" : "%s:%u", host->name, host->port);
-+    }
-+    if (src->nhosts > 0) {
-+        etcd = virBufferContentAndReset(&buf);
-+    }
-+
-+    if (virJSONValueObjectCreate(&ret,
-+                                 "S:etcd_host", etcd,
-+                                 "S:etcd_prefix", src->query,
-+                                 "S:config_path", src->configFile,
-+                                 "s:image", src->path,
-+                                 NULL) < 0)
-+        return NULL;
-+
-+    return ret;
-+}
-+
-+
- static virJSONValue *
- qemuBlockStorageSourceGetSheepdogProps(virStorageSource *src)
- {
-@@ -1218,6 +1250,12 @@ qemuBlockStorageSourceGetBackendProps(virStorageSource *src,
-                 return NULL;
-             break;
- 
-+        case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
-+            driver = "vitastor";
-+            if (!(fileprops = qemuBlockStorageSourceGetVitastorProps(src)))
-+                return NULL;
-+            break;
-+
-         case VIR_STORAGE_NET_PROTOCOL_SHEEPDOG:
-             driver = "sheepdog";
-             if (!(fileprops = qemuBlockStorageSourceGetSheepdogProps(src)))
-@@ -2231,6 +2269,7 @@ qemuBlockGetBackingStoreString(virStorageSource *src,
- 
-             case VIR_STORAGE_NET_PROTOCOL_SHEEPDOG:
-             case VIR_STORAGE_NET_PROTOCOL_RBD:
-+            case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
-             case VIR_STORAGE_NET_PROTOCOL_VXHS:
-             case VIR_STORAGE_NET_PROTOCOL_NFS:
-             case VIR_STORAGE_NET_PROTOCOL_SSH:
-@@ -2608,6 +2647,12 @@ qemuBlockStorageSourceCreateGetStorageProps(virStorageSource *src,
-                 return -1;
-             break;
- 
-+        case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
-+            driver = "vitastor";
-+            if (!(location = qemuBlockStorageSourceGetVitastorProps(src)))
-+                return -1;
-+            break;
-+
-         case VIR_STORAGE_NET_PROTOCOL_SHEEPDOG:
-             driver = "sheepdog";
-             if (!(location = qemuBlockStorageSourceGetSheepdogProps(src)))
-diff --git a/src/qemu/qemu_command.c b/src/qemu/qemu_command.c
-index ea51369..8258632 100644
--- a/src/qemu/qemu_command.c
-+++ b/src/qemu/qemu_command.c
-@@ -1074,6 +1074,43 @@ qemuBuildNetworkDriveStr(virStorageSource *src,
-             ret = virBufferContentAndReset(&buf);
-             break;
- 
-+        case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
-+            if (strchr(src->path, ':')) {
-+                virReportError(VIR_ERR_CONFIG_UNSUPPORTED,
-+                               _("':' not allowed in Vitastor source volume name '%s'"),
-+                               src->path);
-+                return NULL;
-+            }
-+
-+            virBufferStrcat(&buf, "vitastor:image=", src->path, NULL);
-+
-+            if (src->nhosts > 0) {
-+                virBufferAddLit(&buf, ":etcd_host=");
-+                for (i = 0; i < src->nhosts; i++) {
-+                    if (i)
-+                        virBufferAddLit(&buf, ",");
-+
-+                    /* assume host containing : is ipv6 */
-+                    if (strchr(src->hosts[i].name, ':'))
-+                        virBufferEscape(&buf, '\\', ":", "[%s]",
-+                                        src->hosts[i].name);
-+                    else
-+                        virBufferAsprintf(&buf, "%s", src->hosts[i].name);
-+
-+                    if (src->hosts[i].port)
-+                        virBufferAsprintf(&buf, "\\:%u", src->hosts[i].port);
-+                }
-+            }
-+
-+            if (src->configFile)
-+                virBufferEscape(&buf, '\\', ":", ":config_path=%s", src->configFile);
-+
-+            if (src->query)
-+                virBufferEscape(&buf, '\\', ":", ":etcd_prefix=%s", src->query);
-+
-+            ret = virBufferContentAndReset(&buf);
-+            break;
-+
-         case VIR_STORAGE_NET_PROTOCOL_VXHS:
-             virReportError(VIR_ERR_INTERNAL_ERROR, "%s",
-                            _("VxHS protocol does not support URI syntax"));
-diff --git a/src/qemu/qemu_domain.c b/src/qemu/qemu_domain.c
-index fc60e15..5ab410d 100644
--- a/src/qemu/qemu_domain.c
-+++ b/src/qemu/qemu_domain.c
-@@ -4829,7 +4829,8 @@ qemuDomainValidateStorageSource(virStorageSource *src,
-     if (src->query &&
-         (actualType != VIR_STORAGE_TYPE_NETWORK ||
-          (src->protocol != VIR_STORAGE_NET_PROTOCOL_HTTPS &&
-          src->protocol != VIR_STORAGE_NET_PROTOCOL_HTTP))) {
-+          src->protocol != VIR_STORAGE_NET_PROTOCOL_HTTP &&
-+          src->protocol != VIR_STORAGE_NET_PROTOCOL_VITASTOR))) {
-         virReportError(VIR_ERR_CONFIG_UNSUPPORTED, "%s",
-                        _("query is supported only with HTTP(S) protocols"));
-         return -1;
-@@ -10027,6 +10028,7 @@ qemuDomainPrepareStorageSourceTLS(virStorageSource *src,
-         break;
- 
-     case VIR_STORAGE_NET_PROTOCOL_RBD:
-+    case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
-     case VIR_STORAGE_NET_PROTOCOL_SHEEPDOG:
-     case VIR_STORAGE_NET_PROTOCOL_GLUSTER:
-     case VIR_STORAGE_NET_PROTOCOL_ISCSI:
-diff --git a/src/qemu/qemu_snapshot.c b/src/qemu/qemu_snapshot.c
-index 4e74ddd..14e5f2e 100644
--- a/src/qemu/qemu_snapshot.c
-+++ b/src/qemu/qemu_snapshot.c
-@@ -402,6 +402,7 @@ qemuSnapshotPrepareDiskExternalInactive(virDomainSnapshotDiskDef *snapdisk,
-         case VIR_STORAGE_NET_PROTOCOL_NONE:
-         case VIR_STORAGE_NET_PROTOCOL_NBD:
-         case VIR_STORAGE_NET_PROTOCOL_RBD:
-+        case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
-         case VIR_STORAGE_NET_PROTOCOL_SHEEPDOG:
-         case VIR_STORAGE_NET_PROTOCOL_GLUSTER:
-         case VIR_STORAGE_NET_PROTOCOL_ISCSI:
-@@ -494,6 +495,7 @@ qemuSnapshotPrepareDiskExternalActive(virDomainObj *vm,
-         case VIR_STORAGE_NET_PROTOCOL_NONE:
-         case VIR_STORAGE_NET_PROTOCOL_NBD:
-         case VIR_STORAGE_NET_PROTOCOL_RBD:
-+        case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
-         case VIR_STORAGE_NET_PROTOCOL_SHEEPDOG:
-         case VIR_STORAGE_NET_PROTOCOL_ISCSI:
-         case VIR_STORAGE_NET_PROTOCOL_HTTP:
-@@ -647,6 +649,7 @@ qemuSnapshotPrepareDiskInternal(virDomainDiskDef *disk,
-         case VIR_STORAGE_NET_PROTOCOL_NONE:
-         case VIR_STORAGE_NET_PROTOCOL_NBD:
-         case VIR_STORAGE_NET_PROTOCOL_RBD:
-+        case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
-         case VIR_STORAGE_NET_PROTOCOL_SHEEPDOG:
-         case VIR_STORAGE_NET_PROTOCOL_GLUSTER:
-         case VIR_STORAGE_NET_PROTOCOL_ISCSI:
-diff --git a/src/storage/storage_driver.c b/src/storage/storage_driver.c
-index c2ff4b8..70d0689 100644
--- a/src/storage/storage_driver.c
-+++ b/src/storage/storage_driver.c
-@@ -1644,6 +1644,7 @@ storageVolLookupByPathCallback(virStoragePoolObj *obj,
- 
-         case VIR_STORAGE_POOL_GLUSTER:
-         case VIR_STORAGE_POOL_RBD:
-+        case VIR_STORAGE_POOL_VITASTOR:
-         case VIR_STORAGE_POOL_SHEEPDOG:
-         case VIR_STORAGE_POOL_ZFS:
-         case VIR_STORAGE_POOL_LAST:
-diff --git a/src/storage_file/storage_source_backingstore.c b/src/storage_file/storage_source_backingstore.c
-index e48ae72..d7a9b72 100644
--- a/src/storage_file/storage_source_backingstore.c
-+++ b/src/storage_file/storage_source_backingstore.c
-@@ -284,6 +284,75 @@ virStorageSourceParseRBDColonString(const char *rbdstr,
- }
- 
- 
-+static int
-+virStorageSourceParseVitastorColonString(const char *colonstr,
-+                                         virStorageSource *src)
-+{
-+    char *p, *e, *next;
-+    g_autofree char *options = NULL;
-+
-+    /* optionally skip the "vitastor:" prefix if provided */
-+    if (STRPREFIX(colonstr, "vitastor:"))
-+        colonstr += strlen("vitastor:");
-+
-+    options = g_strdup(colonstr);
-+
-+    p = options;
-+    while (*p) {
-+        /* find : delimiter or end of string */
-+        for (e = p; *e && *e != ':'; ++e) {
-+            if (*e == '\\') {
-+                e++;
-+                if (*e == '\0')
-+                    break;
-+            }
-+        }
-+        if (*e == '\0') {
-+            next = e;    /* last kv pair */
-+        } else {
-+            next = e + 1;
-+            *e = '\0';
-+        }
-+
-+        if (STRPREFIX(p, "image=")) {
-+            src->path = g_strdup(p + strlen("image="));
-+        } else if (STRPREFIX(p, "etcd_prefix=")) {
-+            src->query = g_strdup(p + strlen("etcd_prefix="));
-+        } else if (STRPREFIX(p, "config_file=")) {
-+            src->configFile = g_strdup(p + strlen("config_file="));
-+        } else if (STRPREFIX(p, "etcd_host=")) {
-+            char *h, *sep;
-+
-+            h = p + strlen("etcd_host=");
-+            while (h < e) {
-+                for (sep = h; sep < e; ++sep) {
-+                    if (*sep == '\\' && (sep[1] == ',' ||
-+                                         sep[1] == ';' ||
-+                                         sep[1] == ' ')) {
-+                        *sep = '\0';
-+                        sep += 2;
-+                        break;
-+                    }
-+                }
-+
-+                if (virStorageSourceRBDAddHost(src, h) < 0)
-+                    return -1;
-+
-+                h = sep;
-+            }
-+        }
-+
-+        p = next;
-+    }
-+
-+    if (!src->path) {
-+        return -1;
-+    }
-+
-+    return 0;
-+}
-+
-+
- static int
- virStorageSourceParseNBDColonString(const char *nbdstr,
-                                     virStorageSource *src)
-@@ -396,6 +465,11 @@ virStorageSourceParseBackingColon(virStorageSource *src,
-             return -1;
-         break;
- 
-+    case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
-+        if (virStorageSourceParseVitastorColonString(path, src) < 0)
-+            return -1;
-+        break;
-+
-     case VIR_STORAGE_NET_PROTOCOL_SHEEPDOG:
-     case VIR_STORAGE_NET_PROTOCOL_LAST:
-     case VIR_STORAGE_NET_PROTOCOL_NONE:
-@@ -984,6 +1058,54 @@ virStorageSourceParseBackingJSONRBD(virStorageSource *src,
-     return 0;
- }
- 
-+static int
-+virStorageSourceParseBackingJSONVitastor(virStorageSource *src,
-+                                         virJSONValue *json,
-+                                         const char *jsonstr G_GNUC_UNUSED,
-+                                         int opaque G_GNUC_UNUSED)
-+{
-+    const char *filename;
-+    const char *image = virJSONValueObjectGetString(json, "image");
-+    const char *conf = virJSONValueObjectGetString(json, "config_path");
-+    const char *etcd_prefix = virJSONValueObjectGetString(json, "etcd_prefix");
-+    virJSONValue *servers = virJSONValueObjectGetArray(json, "server");
-+    size_t nservers;
-+    size_t i;
-+
-+    src->type = VIR_STORAGE_TYPE_NETWORK;
-+    src->protocol = VIR_STORAGE_NET_PROTOCOL_VITASTOR;
-+
-+    /* legacy syntax passed via 'filename' option */
-+    if ((filename = virJSONValueObjectGetString(json, "filename")))
-+        return virStorageSourceParseVitastorColonString(filename, src);
-+
-+    if (!image) {
-+        virReportError(VIR_ERR_INVALID_ARG, "%s",
-+                       _("missing image name in Vitastor backing volume "
-+                         "JSON specification"));
-+        return -1;
-+    }
-+
-+    src->path = g_strdup(image);
-+    src->configFile = g_strdup(conf);
-+    src->query = g_strdup(etcd_prefix);
-+
-+    if (servers) {
-+        nservers = virJSONValueArraySize(servers);
-+
-+        src->hosts = g_new0(virStorageNetHostDef, nservers);
-+        src->nhosts = nservers;
-+
-+        for (i = 0; i < nservers; i++) {
-+            if (virStorageSourceParseBackingJSONInetSocketAddress(src->hosts + i,
-+                                                                  virJSONValueArrayGet(servers, i)) < 0)
-+                return -1;
-+        }
-+    }
-+
-+    return 0;
-+}
-+
- static int
- virStorageSourceParseBackingJSONRaw(virStorageSource *src,
-                                     virJSONValue *json,
-@@ -1162,6 +1284,7 @@ static const struct virStorageSourceJSONDriverParser jsonParsers[] = {
-     {"sheepdog", false, virStorageSourceParseBackingJSONSheepdog, 0},
-     {"ssh", false, virStorageSourceParseBackingJSONSSH, 0},
-     {"rbd", false, virStorageSourceParseBackingJSONRBD, 0},
-+    {"vitastor", false, virStorageSourceParseBackingJSONVitastor, 0},
-     {"raw", true, virStorageSourceParseBackingJSONRaw, 0},
-     {"nfs", false, virStorageSourceParseBackingJSONNFS, 0},
-     {"vxhs", false, virStorageSourceParseBackingJSONVxHS, 0},
-diff --git a/src/test/test_driver.c b/src/test/test_driver.c
-index ef0ddab..2173dc3 100644
--- a/src/test/test_driver.c
-+++ b/src/test/test_driver.c
-@@ -7131,6 +7131,7 @@ testStorageVolumeTypeForPool(int pooltype)
-     case VIR_STORAGE_POOL_ISCSI_DIRECT:
-     case VIR_STORAGE_POOL_GLUSTER:
-     case VIR_STORAGE_POOL_RBD:
-+    case VIR_STORAGE_POOL_VITASTOR:
-         return VIR_STORAGE_VOL_NETWORK;
-     case VIR_STORAGE_POOL_LOGICAL:
-     case VIR_STORAGE_POOL_DISK:
-diff --git a/tests/storagepoolcapsschemadata/poolcaps-fs.xml b/tests/storagepoolcapsschemadata/poolcaps-fs.xml
-index eee75af..8bd0a57 100644
--- a/tests/storagepoolcapsschemadata/poolcaps-fs.xml
-+++ b/tests/storagepoolcapsschemadata/poolcaps-fs.xml
-@@ -204,4 +204,11 @@
-       </enum>
-     </volOptions>
-   </pool>
-+  <pool type='vitastor' supported='no'>
-+    <volOptions>
-+      <defaultFormat type='raw'/>
-+      <enum name='targetFormatType'>
-+      </enum>
-+    </volOptions>
-+  </pool>
- </storagepoolCapabilities>
-diff --git a/tests/storagepoolcapsschemadata/poolcaps-full.xml b/tests/storagepoolcapsschemadata/poolcaps-full.xml
-index 805950a..852df0d 100644
--- a/tests/storagepoolcapsschemadata/poolcaps-full.xml
-+++ b/tests/storagepoolcapsschemadata/poolcaps-full.xml
-@@ -204,4 +204,11 @@
-       </enum>
-     </volOptions>
-   </pool>
-+  <pool type='vitastor' supported='yes'>
-+    <volOptions>
-+      <defaultFormat type='raw'/>
-+      <enum name='targetFormatType'>
-+      </enum>
-+    </volOptions>
-+  </pool>
- </storagepoolCapabilities>
-diff --git a/tests/storagepoolxml2argvtest.c b/tests/storagepoolxml2argvtest.c
-index 449b745..7f95cc8 100644
--- a/tests/storagepoolxml2argvtest.c
-+++ b/tests/storagepoolxml2argvtest.c
-@@ -68,6 +68,7 @@ testCompareXMLToArgvFiles(bool shouldFail,
-     case VIR_STORAGE_POOL_GLUSTER:
-     case VIR_STORAGE_POOL_ZFS:
-     case VIR_STORAGE_POOL_VSTORAGE:
-+    case VIR_STORAGE_POOL_VITASTOR:
-     case VIR_STORAGE_POOL_LAST:
-     default:
-         VIR_TEST_DEBUG("pool type '%s' has no xml2argv test", defTypeStr);
-diff --git a/tools/virsh-pool.c b/tools/virsh-pool.c
-index 18f3839..c8e1436 100644
--- a/tools/virsh-pool.c
-+++ b/tools/virsh-pool.c
-@@ -1231,6 +1231,9 @@ cmdPoolList(vshControl *ctl, const vshCmd *cmd G_GNUC_UNUSED)
-             case VIR_STORAGE_POOL_VSTORAGE:
-                 flags |= VIR_CONNECT_LIST_STORAGE_POOLS_VSTORAGE;
-                 break;
-+            case VIR_STORAGE_POOL_VITASTOR:
-+                flags |= VIR_CONNECT_LIST_STORAGE_POOLS_VITASTOR;
-+                break;
-             case VIR_STORAGE_POOL_LAST:
-                 break;
-             }
--- a/patches/libvirt-example.xml
+++ b/patches/libvirt-example.xml
@@ -1,32 +0,0 @@
-<!-- Example libvirt VM configuration with Vitastor disk -->
-<domain type='kvm'>
-  <name>debian9</name>
-  <uuid>96f277fb-fd9c-49da-bf21-a5cfd54eb162</uuid>
-  <memory unit="KiB">524288</memory>
-  <currentMemory>524288</currentMemory>
-  <vcpu>1</vcpu>
-  <os>
-    <type arch='x86_64'>hvm</type>
-    <boot dev='hd' />
-  </os>
-  <devices>
-    <emulator>/usr/bin/qemu-system-x86_64</emulator>
-    <disk type='network' device='disk'>
-      <target dev='vda' bus='virtio' />
-      <driver name='qemu' type='raw' />
-      <!-- name is Vitastor image name -->
-      <!-- config (optional) is the path to Vitastor's configuration file -->
-      <!-- query (optional) is Vitastor's etcd_prefix -->
-      <source protocol='vitastor' name='debian9' query='/vitastor' config='/etc/vitastor/vitastor.conf'>
-        <!-- hosts = etcd addresses -->
-        <host name='192.168.7.2' port='2379' />
-      </source>
-      <!-- required because Vitastor only supports 4k physical sectors -->
-      <blockio physical_block_size="4096" logical_block_size="512" />
-    </disk>
-    <interface type='network'>
-      <source network='default' />
-    </interface>
-    <graphics type='vnc' port='-1' />
-  </devices>
-</domain>
--- a/patches/nova-20.diff
+++ b/patches/nova-20.diff
@@ -1,287 +0,0 @@
-diff --git a/nova/virt/image/model.py b/nova/virt/image/model.py
-index 971f7e9c07..70ed70d5e2 100644
--- a/nova/virt/image/model.py
-+++ b/nova/virt/image/model.py
-@@ -129,3 +129,22 @@ class RBDImage(Image):
-         self.user = user
-         self.password = password
-         self.servers = servers
-+
-+
-+class VitastorImage(Image):
-+    """Class for images in a remote Vitastor cluster"""
-+
-+    def __init__(self, name, etcd_address = None, etcd_prefix = None, config_path = None):
-+        """Create a new Vitastor image object
-+
-+        :param name: name of the image
-+        :param etcd_address: etcd URL(s) (optional)
-+        :param etcd_prefix: etcd prefix (optional)
-+        :param config_path: path to the configuration (optional)
-+        """
-+        super(RBDImage, self).__init__(FORMAT_RAW)
-+
-+        self.name = name
-+        self.etcd_address = etcd_address
-+        self.etcd_prefix = etcd_prefix
-+        self.config_path = config_path
-diff --git a/nova/virt/images.py b/nova/virt/images.py
-index 5358f3766a..ebe3d6effb 100644
--- a/nova/virt/images.py
-+++ b/nova/virt/images.py
-@@ -41,7 +41,7 @@ IMAGE_API = glance.API()
- 
- def qemu_img_info(path, format=None):
-     """Return an object containing the parsed output from qemu-img info."""
-    if not os.path.exists(path) and not path.startswith('rbd:'):
-+    if not os.path.exists(path) and not path.startswith('rbd:') and not path.startswith('vitastor:'):
-         raise exception.DiskNotFound(location=path)
- 
-     info = nova.privsep.qemu.unprivileged_qemu_img_info(path, format=format)
-@@ -50,7 +50,7 @@ def qemu_img_info(path, format=None):
- 
- def privileged_qemu_img_info(path, format=None, output_format='json'):
-     """Return an object containing the parsed output from qemu-img info."""
-    if not os.path.exists(path) and not path.startswith('rbd:'):
-+    if not os.path.exists(path) and not path.startswith('rbd:') and not path.startswith('vitastor:'):
-         raise exception.DiskNotFound(location=path)
- 
-     info = nova.privsep.qemu.privileged_qemu_img_info(path, format=format)
-diff --git a/nova/virt/libvirt/config.py b/nova/virt/libvirt/config.py
-index f9475776b3..51573fe41d 100644
--- a/nova/virt/libvirt/config.py
-+++ b/nova/virt/libvirt/config.py
-@@ -1060,6 +1060,8 @@ class LibvirtConfigGuestDisk(LibvirtConfigGuestDevice):
-         self.driver_iommu = False
-         self.source_path = None
-         self.source_protocol = None
-+        self.source_query = None
-+        self.source_config = None
-         self.source_name = None
-         self.source_hosts = []
-         self.source_ports = []
-@@ -1186,7 +1188,8 @@ class LibvirtConfigGuestDisk(LibvirtConfigGuestDevice):
-         elif self.source_type == "mount":
-             dev.append(etree.Element("source", dir=self.source_path))
-         elif self.source_type == "network" and self.source_protocol:
-            source = etree.Element("source", protocol=self.source_protocol)
-+            source = etree.Element("source", protocol=self.source_protocol,
-+                query=self.source_query, config=self.source_config)
-             if self.source_name is not None:
-                 source.set('name', self.source_name)
-             hosts_info = zip(self.source_hosts, self.source_ports)
-diff --git a/nova/virt/libvirt/driver.py b/nova/virt/libvirt/driver.py
-index 391231c527..34dc60dcdd 100644
--- a/nova/virt/libvirt/driver.py
-+++ b/nova/virt/libvirt/driver.py
-@@ -179,6 +179,7 @@ VOLUME_DRIVERS = {
-     'local': 'nova.virt.libvirt.volume.volume.LibvirtVolumeDriver',
-     'fake': 'nova.virt.libvirt.volume.volume.LibvirtFakeVolumeDriver',
-     'rbd': 'nova.virt.libvirt.volume.net.LibvirtNetVolumeDriver',
-+    'vitastor': 'nova.virt.libvirt.volume.vitastor.LibvirtVitastorVolumeDriver',
-     'nfs': 'nova.virt.libvirt.volume.nfs.LibvirtNFSVolumeDriver',
-     'smbfs': 'nova.virt.libvirt.volume.smbfs.LibvirtSMBFSVolumeDriver',
-     'fibre_channel': 'nova.virt.libvirt.volume.fibrechannel.LibvirtFibreChannelVolumeDriver',  # noqa:E501
-@@ -385,10 +386,10 @@ class LibvirtDriver(driver.ComputeDriver):
-         # This prevents the risk of one test setting a capability
-         # which bleeds over into other tests.
- 
-        # LVM and RBD require raw images. If we are not configured to
-+        # LVM, RBD, Vitastor require raw images. If we are not configured to
-         # force convert images into raw format, then we _require_ raw
-         # images only.
-        raw_only = ('rbd', 'lvm')
-+        raw_only = ('rbd', 'lvm', 'vitastor')
-         requires_raw_image = (CONF.libvirt.images_type in raw_only and
-                               not CONF.force_raw_images)
-         requires_ploop_image = CONF.libvirt.virt_type == 'parallels'
-@@ -775,12 +776,12 @@ class LibvirtDriver(driver.ComputeDriver):
-         # Some imagebackends are only able to import raw disk images,
-         # and will fail if given any other format. See the bug
-         # https://bugs.launchpad.net/nova/+bug/1816686 for more details.
-        if CONF.libvirt.images_type in ('rbd',):
-+        if CONF.libvirt.images_type in ('rbd', 'vitastor'):
-             if not CONF.force_raw_images:
-                 msg = _("'[DEFAULT]/force_raw_images = False' is not "
-                        "allowed with '[libvirt]/images_type = rbd'. "
-+                        "allowed with '[libvirt]/images_type = rbd' or 'vitastor'. "
-                         "Please check the two configs and if you really "
-                        "do want to use rbd as images_type, set "
-+                        "do want to use rbd or vitastor as images_type, set "
-                         "force_raw_images to True.")
-                 raise exception.InvalidConfiguration(msg)
- 
-@@ -2603,6 +2604,16 @@ class LibvirtDriver(driver.ComputeDriver):
-                     if connection_info['data'].get('auth_enabled'):
-                         username = connection_info['data']['auth_username']
-                         path = f"rbd:{volume_name}:id={username}"
-+                elif connection_info['driver_volume_type'] == 'vitastor':
-+                    volume_name = connection_info['data']['name']
-+                    path = 'vitastor:image='+volume_name.replace(':', '\\:')
-+                    for k in [ 'config_path', 'etcd_address', 'etcd_prefix' ]:
-+                        if k in connection_info['data']:
-+                            kk = k
-+                            if kk == 'etcd_address':
-+                                # FIXME use etcd_address in qemu driver
-+                                kk = 'etcd_host'
-+                            path += ":"+kk+"="+connection_info['data'][k].replace(':', '\\:')
-                 else:
-                     path = 'unknown'
-                     raise exception.DiskNotFound(location='unknown')
-@@ -2827,8 +2838,8 @@ class LibvirtDriver(driver.ComputeDriver):
- 
-         image_format = CONF.libvirt.snapshot_image_format or source_type
- 
-        # NOTE(bfilippov): save lvm and rbd as raw
-        if image_format == 'lvm' or image_format == 'rbd':
-+        # NOTE(bfilippov): save lvm and rbd and vitastor as raw
-+        if image_format == 'lvm' or image_format == 'rbd' or image_format == 'vitastor':
-             image_format = 'raw'
- 
-         metadata = self._create_snapshot_metadata(instance.image_meta,
-@@ -2899,7 +2910,7 @@ class LibvirtDriver(driver.ComputeDriver):
-                               expected_state=task_states.IMAGE_UPLOADING)
- 
-             # TODO(nic): possibly abstract this out to the root_disk
-            if source_type == 'rbd' and live_snapshot:
-+            if (source_type == 'rbd' or source_type == 'vitastor') and live_snapshot:
-                 # Standard snapshot uses qemu-img convert from RBD which is
-                 # not safe to run with live_snapshot.
-                 live_snapshot = False
-@@ -4099,7 +4110,7 @@ class LibvirtDriver(driver.ComputeDriver):
-         # cleanup rescue volume
-         lvm.remove_volumes([lvmdisk for lvmdisk in self._lvm_disks(instance)
-                                 if lvmdisk.endswith('.rescue')])
-        if CONF.libvirt.images_type == 'rbd':
-+        if CONF.libvirt.images_type == 'rbd' or CONF.libvirt.images_type == 'vitastor':
-             filter_fn = lambda disk: (disk.startswith(instance.uuid) and
-                                       disk.endswith('.rescue'))
-             rbd_utils.RBDDriver().cleanup_volumes(filter_fn)
-@@ -4356,6 +4367,8 @@ class LibvirtDriver(driver.ComputeDriver):
-         # TODO(mikal): there is a bug here if images_type has
-         # changed since creation of the instance, but I am pretty
-         # sure that this bug already exists.
-+        if CONF.libvirt.images_type == 'vitastor':
-+            return 'vitastor'
-         return 'rbd' if CONF.libvirt.images_type == 'rbd' else 'raw'
- 
-     @staticmethod
-@@ -4764,10 +4777,10 @@ class LibvirtDriver(driver.ComputeDriver):
-                 finally:
-                     # NOTE(mikal): if the config drive was imported into RBD,
-                     # then we no longer need the local copy
-                    if CONF.libvirt.images_type == 'rbd':
-+                    if CONF.libvirt.images_type == 'rbd' or CONF.libvirt.images_type == 'vitastor':
-                         LOG.info('Deleting local config drive %(path)s '
-                                 'because it was imported into RBD.',
-                                 {'path': config_disk_local_path},
-+                                 'because it was imported into %(type).',
-+                                 {'path': config_disk_local_path, 'type': CONF.libvirt.images_type},
-                                  instance=instance)
-                         os.unlink(config_disk_local_path)
- 
-diff --git a/nova/virt/libvirt/utils.py b/nova/virt/libvirt/utils.py
-index da2a6e8b8a..52c02e72f1 100644
--- a/nova/virt/libvirt/utils.py
-+++ b/nova/virt/libvirt/utils.py
-@@ -340,6 +340,10 @@ def find_disk(guest: libvirt_guest.Guest) -> ty.Tuple[str, ty.Optional[str]]:
-             disk_path = disk.source_name
-             if disk_path:
-                 disk_path = 'rbd:' + disk_path
-+        elif not disk_path and disk.source_protocol == 'vitastor':
-+            disk_path = disk.source_name
-+            if disk_path:
-+                disk_path = 'vitastor:' + disk_path
- 
-     if not disk_path:
-         raise RuntimeError(_("Can't retrieve root device path "
-@@ -354,6 +358,8 @@ def get_disk_type_from_path(path: str) -> ty.Optional[str]:
-         return 'lvm'
-     elif path.startswith('rbd:'):
-         return 'rbd'
-+    elif path.startswith('vitastor:'):
-+        return 'vitastor'
-     elif (os.path.isdir(path) and
-           os.path.exists(os.path.join(path, "DiskDescriptor.xml"))):
-         return 'ploop'
-diff --git a/nova/virt/libvirt/volume/vitastor.py b/nova/virt/libvirt/volume/vitastor.py
-new file mode 100644
-index 0000000000..0256df62c1
--- /dev/null
-+++ b/nova/virt/libvirt/volume/vitastor.py
-@@ -0,0 +1,75 @@
-+# Copyright (c) 2021+, Vitaliy Filippov <vitalif@yourcmc.ru>
-+#
-+#    Licensed under the Apache License, Version 2.0 (the "License"); you may
-+#    not use this file except in compliance with the License. You may obtain
-+#    a copy of the License at
-+#
-+#         http://www.apache.org/licenses/LICENSE-2.0
-+#
-+#    Unless required by applicable law or agreed to in writing, software
-+#    distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
-+#    WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
-+#    License for the specific language governing permissions and limitations
-+#    under the License.
-+
-+from os_brick import exception as os_brick_exception
-+from os_brick import initiator
-+from os_brick.initiator import connector
-+from oslo_log import log as logging
-+
-+import nova.conf
-+from nova import utils
-+from nova.virt.libvirt.volume import volume as libvirt_volume
-+
-+
-+CONF = nova.conf.CONF
-+LOG = logging.getLogger(__name__)
-+
-+
-+class LibvirtVitastorVolumeDriver(libvirt_volume.LibvirtBaseVolumeDriver):
-+    """Driver to attach Vitastor volumes to libvirt."""
-+    def __init__(self, host):
-+        super(LibvirtVitastorVolumeDriver, self).__init__(host, is_block_dev=False)
-+
-+    def connect_volume(self, connection_info, instance):
-+        pass
-+
-+    def disconnect_volume(self, connection_info, instance):
-+        pass
-+
-+    def get_config(self, connection_info, disk_info):
-+        """Returns xml for libvirt."""
-+        conf = super(LibvirtVitastorVolumeDriver, self).get_config(connection_info, disk_info)
-+        conf.source_type = 'network'
-+        conf.source_protocol = 'vitastor'
-+        conf.source_name = connection_info['data'].get('name')
-+        conf.source_query = connection_info['data'].get('etcd_prefix') or None
-+        conf.source_config = connection_info['data'].get('config_path') or None
-+        conf.source_hosts = []
-+        conf.source_ports = []
-+        addresses = connection_info['data'].get('etcd_address', '')
-+        if addresses:
-+            if not isinstance(addresses, list):
-+                addresses = addresses.split(',')
-+            for addr in addresses:
-+                if addr.startswith('https://'):
-+                    raise NotImplementedError('Vitastor block driver does not support SSL for etcd communication yet')
-+                if addr.startswith('http://'):
-+                    addr = addr[7:]
-+                addr = addr.rstrip('/')
-+                if addr.endswith('/v3'):
-+                    addr = addr[0:-3]
-+                p = addr.find('/')
-+                if p > 0:
-+                    raise NotImplementedError('libvirt does not support custom URL paths for Vitastor etcd yet. Use /etc/vitastor/vitastor.conf')
-+                p = addr.find(':')
-+                port = '2379'
-+                if p > 0:
-+                    port = addr[p+1:]
-+                    addr = addr[0:p]
-+                conf.source_hosts.append(addr)
-+                conf.source_ports.append(port)
-+        return conf
-+
-+    def extend_volume(self, connection_info, instance, requested_size):
-+        raise NotImplementedError
--- a/patches/qemu-3.1-vitastor.patch
+++ b/patches/qemu-3.1-vitastor.patch
@@ -11,7 +11,7 @@ Index: qemu-3.1+dfsg/qapi/block-core.json
             'host_cdrom', 'host_device', 'http', 'https', 'iscsi', 'luks',
             'nbd', 'nfs', 'null-aio', 'null-co', 'nvme', 'parallels', 'qcow',
             'qcow2', 'qed', 'quorum', 'raw', 'rbd', 'replication', 'sheepdog',
-@@ -3367,6 +3367,28 @@
+@@ -3367,6 +3367,24 @@
             '*tag': 'str' } }
 
 ##
@@ -19,21 +19,17 @@ Index: qemu-3.1+dfsg/qapi/block-core.json
 +#
 +# Driver specific block device options for vitastor
 +#
-+# @image:       Image name
 +# @inode:       Inode number
 +# @pool:        Pool ID
 +# @size:        Desired image size in bytes
-+# @config_path: Path to Vitastor configuration
-+# @etcd_host:   etcd connection address(es)
+# @etcd_host:   etcd connection address
 +# @etcd_prefix: etcd key/value prefix
 +##
 +{ 'struct': 'BlockdevOptionsVitastor',
-+  'data': { '*inode': 'uint64',
-+            '*pool': 'uint64',
-+            '*size': 'uint64',
-+            '*image': 'str',
-+            '*config_path': 'str',
-+            '*etcd_host': 'str',
+  'data': { 'inode': 'uint64',
+            'pool': 'uint64',
+            'size': 'uint64',
+            'etcd_host': 'str',
 +            '*etcd_prefix': 'str' } }
 +
 +##
--- a/patches/qemu-4.2-vitastor.patch
+++ b/patches/qemu-4.2-vitastor.patch
@@ -11,7 +11,7 @@ Index: qemu/qapi/block-core.json
             'ssh', 'throttle', 'vdi', 'vhdx', 'vmdk', 'vpc', 'vvfat', 'vxhs' ] }
 
 ##
-@@ -3725,6 +3725,28 @@
+@@ -3725,6 +3725,24 @@
             '*tag': 'str' } }
 
 ##
@@ -19,21 +19,17 @@ Index: qemu/qapi/block-core.json
 +#
 +# Driver specific block device options for vitastor
 +#
-+# @image:       Image name
 +# @inode:       Inode number
 +# @pool:        Pool ID
 +# @size:        Desired image size in bytes
-+# @config_path: Path to Vitastor configuration
-+# @etcd_host:   etcd connection address(es)
+# @etcd_host:   etcd connection address
 +# @etcd_prefix: etcd key/value prefix
 +##
 +{ 'struct': 'BlockdevOptionsVitastor',
-+  'data': { '*inode': 'uint64',
-+            '*pool': 'uint64',
-+            '*size': 'uint64',
-+            '*image': 'str',
-+            '*config_path': 'str',
-+            '*etcd_host': 'str',
+  'data': { 'inode': 'uint64',
+            'pool': 'uint64',
+            'size': 'uint64',
+            'etcd_host': 'str',
 +            '*etcd_prefix': 'str' } }
 +
 +##
--- a/patches/qemu-5.0-vitastor.patch
+++ b/patches/qemu-5.0-vitastor.patch
@@ -11,7 +11,7 @@ Index: qemu/qapi/block-core.json
             'ssh', 'throttle', 'vdi', 'vhdx', 'vmdk', 'vpc', 'vvfat', 'vxhs' ] }
 
 ##
-@@ -3635,6 +3635,28 @@
+@@ -3635,6 +3635,24 @@
             '*tag': 'str' } }
 
 ##
@@ -19,21 +19,17 @@ Index: qemu/qapi/block-core.json
 +#
 +# Driver specific block device options for vitastor
 +#
-+# @image:       Image name
 +# @inode:       Inode number
 +# @pool:        Pool ID
 +# @size:        Desired image size in bytes
-+# @config_path: Path to Vitastor configuration
-+# @etcd_host:   etcd connection address(es)
+# @etcd_host:   etcd connection address
 +# @etcd_prefix: etcd key/value prefix
 +##
 +{ 'struct': 'BlockdevOptionsVitastor',
-+  'data': { '*inode': 'uint64',
-+            '*pool': 'uint64',
-+            '*size': 'uint64',
-+            '*image': 'str',
-+            '*config_path': 'str',
-+            '*etcd_host': 'str',
+  'data': { 'inode': 'uint64',
+            'pool': 'uint64',
+            'size': 'uint64',
+            'etcd_host': 'str',
 +            '*etcd_prefix': 'str' } }
 +
 +##
--- a/patches/qemu-5.1-vitastor.patch
+++ b/patches/qemu-5.1-vitastor.patch
@@ -11,7 +11,7 @@ Index: qemu-5.1+dfsg/qapi/block-core.json
             'ssh', 'throttle', 'vdi', 'vhdx', 'vmdk', 'vpc', 'vvfat' ] }
 
 ##
-@@ -3644,6 +3644,28 @@
+@@ -3644,6 +3644,24 @@
             '*tag': 'str' } }
 
 ##
@@ -19,21 +19,17 @@ Index: qemu-5.1+dfsg/qapi/block-core.json
 +#
 +# Driver specific block device options for vitastor
 +#
-+# @image:       Image name
 +# @inode:       Inode number
 +# @pool:        Pool ID
 +# @size:        Desired image size in bytes
-+# @config_path: Path to Vitastor configuration
-+# @etcd_host:   etcd connection address(es)
+# @etcd_host:   etcd connection address
 +# @etcd_prefix: etcd key/value prefix
 +##
 +{ 'struct': 'BlockdevOptionsVitastor',
-+  'data': { '*inode': 'uint64',
-+            '*pool': 'uint64',
-+            '*size': 'uint64',
-+            '*image': 'str',
-+            '*config_path': 'str',
-+            '*etcd_host': 'str',
+  'data': { 'inode': 'uint64',
+            'pool': 'uint64',
+            'size': 'uint64',
+            'etcd_host': 'str',
 +            '*etcd_prefix': 'str' } }
 +
 +##
--- a/rpm/build-tarball.sh
+++ b/rpm/build-tarball.sh
@@ -48,4 +48,4 @@ FIO=`rpm -qi fio | perl -e 'while(<>) { /^Epoch[\s:]+(\S+)/ && print "$1:"; /^Ve
 QEMU=`rpm -qi qemu qemu-kvm | perl -e 'while(<>) { /^Epoch[\s:]+(\S+)/ && print "$1:"; /^Version[\s:]+(\S+)/ && print $1; /^Release[\s:]+(\S+)/ && print "-$1"; }'`
 perl -i -pe 's/(Requires:\s*fio)([^\n]+)?/$1 = '$FIO'/' $VITASTOR/rpm/vitastor-el$EL.spec
 perl -i -pe 's/(Requires:\s*qemu(?:-kvm)?)([^\n]+)?/$1 = '$QEMU'/' $VITASTOR/rpm/vitastor-el$EL.spec
-tar --transform 's#^#vitastor-0.6.5/#' --exclude 'rpm/*.rpm' -czf $VITASTOR/../vitastor-0.6.5$(rpm --eval '%dist').tar.gz *
+tar --transform 's#^#vitastor-0.5.10/#' --exclude 'rpm/*.rpm' -czf $VITASTOR/../vitastor-0.5.10$(rpm --eval '%dist').tar.gz *
--- a/rpm/qemu-el8.Dockerfile
+++ b/rpm/qemu-el8.Dockerfile
@@ -11,7 +11,7 @@ RUN rm -rf /var/lib/dnf/*; dnf download --disablerepo='*' --enablerepo='centos-a
 RUN rpm --nomd5 -i qemu*.src.rpm
 RUN cd ~/rpmbuild/SPECS && dnf builddep -y --enablerepo=PowerTools --spec qemu-kvm.spec

-ADD patches/qemu-*-vitastor.patch /root/vitastor/patches/
+ADD qemu-*-vitastor.patch /root/vitastor/

 RUN set -e; \
    mkdir -p /root/packages/qemu-el8; \
@@ -25,7 +25,7 @@ RUN set -e; \
    echo "Patch$((PN+1)): qemu-4.2-vitastor.patch" >> qemu-kvm.spec; \
    tail -n +2 xx01 >> qemu-kvm.spec; \
    perl -i -pe 's/(^Release:\s*\d+)/$1.vitastor/' qemu-kvm.spec; \
-    cp /root/vitastor/patches/qemu-4.2-vitastor.patch ~/rpmbuild/SOURCES; \
+    cp /root/vitastor/qemu-4.2-vitastor.patch ~/rpmbuild/SOURCES; \
    rpmbuild --nocheck -ba qemu-kvm.spec; \
    cp ~/rpmbuild/RPMS/*/*qemu* /root/packages/qemu-el8/; \
    cp ~/rpmbuild/SRPMS/*qemu* /root/packages/qemu-el8/
--- a/rpm/vitastor-el7.Dockerfile
+++ b/rpm/vitastor-el7.Dockerfile
@@ -15,9 +15,8 @@ RUN yumdownloader --disablerepo=centos-sclo-rh --source fio
 RUN rpm --nomd5 -i qemu*.src.rpm
 RUN rpm --nomd5 -i fio*.src.rpm
 RUN rm -f /etc/yum.repos.d/CentOS-Media.repo
-RUN cd ~/rpmbuild/SPECS && yum-builddep -y qemu-kvm.spec
-RUN cd ~/rpmbuild/SPECS && yum-builddep -y fio.spec
-RUN yum -y install rdma-core-devel
+RUN cd ~/rpmbuild/SPECS && yum-builddep -y --enablerepo='*' --disablerepo=centos-sclo-rh --disablerepo=centos-sclo-rh-source --disablerepo=centos-sclo-sclo-testing qemu-kvm.spec
+RUN cd ~/rpmbuild/SPECS && yum-builddep -y --enablerepo='*' --disablerepo=centos-sclo-rh --disablerepo=centos-sclo-rh-source --disablerepo=centos-sclo-sclo-testing fio.spec

 ADD https://vitastor.io/rpms/liburing-el7/liburing-0.7-2.el7.src.rpm /root

@@ -38,7 +37,7 @@ ADD . /root/vitastor
 RUN set -e; \
    cd /root/vitastor/rpm; \
    sh build-tarball.sh; \
-    cp /root/vitastor-0.6.5.el7.tar.gz ~/rpmbuild/SOURCES; \
+    cp /root/vitastor-0.5.10.el7.tar.gz ~/rpmbuild/SOURCES; \
    cp vitastor-el7.spec ~/rpmbuild/SPECS/vitastor.spec; \
    cd ~/rpmbuild/SPECS/; \
    rpmbuild -ba vitastor.spec; \
--- a/rpm/vitastor-el7.spec
+++ b/rpm/vitastor-el7.spec
@@ -1,11 +1,11 @@
 Name:           vitastor
-Version:        0.6.5
+Version:        0.5.10
 Release:        1%{?dist}
 Summary:        Vitastor, a fast software-defined clustered block storage

 License:        Vitastor Network Public License 1.1
 URL:            https://vitastor.io/
-Source0:        vitastor-0.6.5.el7.tar.gz
+Source0:        vitastor-0.5.10.el7.tar.gz

 BuildRequires:  liburing-devel >= 0.6
 BuildRequires:  gperftools-devel
@@ -14,7 +14,6 @@ BuildRequires:  rh-nodejs12
 BuildRequires:  rh-nodejs12-npm
 BuildRequires:  jerasure-devel
 BuildRequires:  gf-complete-devel
-BuildRequires:  libibverbs-devel
 BuildRequires:  cmake
 Requires:       fio = 3.7-1.el7
 Requires:       qemu-kvm = 2.0.0-1.el7.6
@@ -62,9 +61,8 @@ cp -r mon %buildroot/usr/lib/vitastor/mon
 %_libdir/libfio_vitastor.so
 %_libdir/libfio_vitastor_blk.so
 %_libdir/libfio_vitastor_sec.so
-%_libdir/libvitastor_blk.so*
-%_libdir/libvitastor_client.so*
-%_includedir/vitastor_c.h
+%_libdir/libvitastor_blk.so
+%_libdir/libvitastor_client.so
 /usr/lib/vitastor


--- a/rpm/vitastor-el8.Dockerfile
+++ b/rpm/vitastor-el8.Dockerfile
@@ -15,7 +15,6 @@ RUN rpm --nomd5 -i qemu*.src.rpm
 RUN rpm --nomd5 -i fio*.src.rpm
 RUN cd ~/rpmbuild/SPECS && dnf builddep -y --enablerepo=powertools --spec qemu-kvm.spec
 RUN cd ~/rpmbuild/SPECS && dnf builddep -y --enablerepo=powertools --spec fio.spec && dnf install -y cmake
-RUN yum -y install libibverbs-devel libarchive

 ADD https://vitastor.io/rpms/liburing-el7/liburing-0.7-2.el7.src.rpm /root

@@ -36,7 +35,7 @@ ADD . /root/vitastor
 RUN set -e; \
    cd /root/vitastor/rpm; \
    sh build-tarball.sh; \
-    cp /root/vitastor-0.6.5.el8.tar.gz ~/rpmbuild/SOURCES; \
+    cp /root/vitastor-0.5.10.el8.tar.gz ~/rpmbuild/SOURCES; \
    cp vitastor-el8.spec ~/rpmbuild/SPECS/vitastor.spec; \
    cd ~/rpmbuild/SPECS/; \
    rpmbuild -ba vitastor.spec; \
--- a/rpm/vitastor-el8.spec
+++ b/rpm/vitastor-el8.spec
@@ -1,11 +1,11 @@
 Name:           vitastor
-Version:        0.6.5
+Version:        0.5.10
 Release:        1%{?dist}
 Summary:        Vitastor, a fast software-defined clustered block storage

 License:        Vitastor Network Public License 1.1
 URL:            https://vitastor.io/
-Source0:        vitastor-0.6.5.el8.tar.gz
+Source0:        vitastor-0.5.10.el8.tar.gz

 BuildRequires:  liburing-devel >= 0.6
 BuildRequires:  gperftools-devel
@@ -13,7 +13,6 @@ BuildRequires:  gcc-toolset-9-gcc-c++
 BuildRequires:  nodejs >= 10
 BuildRequires:  jerasure-devel
 BuildRequires:  gf-complete-devel
-BuildRequires:  libibverbs-devel
 BuildRequires:  cmake
 Requires:       fio = 3.7-3.el8
 Requires:       qemu-kvm = 4.2.0-29.el8.6
@@ -59,9 +58,8 @@ cp -r mon %buildroot/usr/lib/vitastor
 %_libdir/libfio_vitastor.so
 %_libdir/libfio_vitastor_blk.so
 %_libdir/libfio_vitastor_sec.so
-%_libdir/libvitastor_blk.so*
-%_libdir/libvitastor_client.so*
-%_includedir/vitastor_c.h
+%_libdir/libvitastor_blk.so
+%_libdir/libvitastor_client.so
 /usr/lib/vitastor


--- a/src/CMakeLists.txt
+++ b/src/CMakeLists.txt
@@ -4,8 +4,6 @@ project(vitastor)

 include(GNUInstallDirs)

-set(WITH_QEMU true CACHE BOOL "Build QEMU driver")
-set(WITH_FIO true CACHE BOOL "Build FIO driver")
 set(QEMU_PLUGINDIR qemu CACHE STRING "QEMU plugin directory suffix (qemu-kvm on RHEL)")
 set(WITH_ASAN false CACHE BOOL "Build with AddressSanitizer")
 if("${CMAKE_INSTALL_PREFIX}" MATCHES "^/usr/local/?$")
@@ -15,8 +13,8 @@ if("${CMAKE_INSTALL_PREFIX}" MATCHES "^/usr/local/?$")
 	set(CMAKE_INSTALL_RPATH "${CMAKE_INSTALL_PREFIX}/${CMAKE_INSTALL_LIBDIR}")
 endif()

-add_definitions(-DVERSION="0.6.5")
-add_definitions(-Wall -Wno-sign-compare -Wno-comment -Wno-parentheses -Wno-pointer-arith -I ${CMAKE_SOURCE_DIR}/src)
+add_definitions(-DVERSION="0.6-dev")
+add_definitions(-Wall -Wno-sign-compare -Wno-comment -Wno-parentheses -Wno-pointer-arith)
 if (${WITH_ASAN})
 	add_definitions(-fsanitize=address -fno-omit-frame-pointer)
 	add_link_options(-fsanitize=address -fno-omit-frame-pointer)
@@ -38,19 +36,12 @@ string(REGEX REPLACE "([\\/\\-]D) *NDEBUG" "" CMAKE_C_FLAGS_RELWITHDEBINFO "${CM

 find_package(PkgConfig)
 pkg_check_modules(LIBURING REQUIRED liburing)
-if (${WITH_QEMU})
-	pkg_check_modules(GLIB REQUIRED glib-2.0)
-endif (${WITH_QEMU})
-pkg_check_modules(IBVERBS libibverbs)
-if (IBVERBS_LIBRARIES)
-	add_definitions(-DWITH_RDMA)
-endif (IBVERBS_LIBRARIES)
+pkg_check_modules(GLIB REQUIRED glib-2.0)

 include_directories(
 	../
 	/usr/include/jerasure
 	${LIBURING_INCLUDE_DIRS}
-	${IBVERBS_INCLUDE_DIRS}
 )

 # libvitastor_blk.so
@@ -61,81 +52,55 @@ add_library(vitastor_blk SHARED
 target_link_libraries(vitastor_blk
 	${LIBURING_LIBRARIES}
 	tcmalloc_minimal
-	# for timerfd_manager
-	vitastor_common
 )
-set_target_properties(vitastor_blk PROPERTIES VERSION ${VERSION} SOVERSION 0)

-if (${WITH_FIO})
-	# libfio_vitastor_blk.so
-	add_library(fio_vitastor_blk SHARED
-		fio_engine.cpp
-		../json11/json11.cpp
-	)
-	target_link_libraries(fio_vitastor_blk
-		vitastor_blk
-	)
-endif (${WITH_FIO})
-
-# libvitastor_common.a
-set(MSGR_RDMA "")
-if (IBVERBS_LIBRARIES)
-	set(MSGR_RDMA "msgr_rdma.cpp")
-endif (IBVERBS_LIBRARIES)
-add_library(vitastor_common STATIC
-	epoll_manager.cpp etcd_state_client.cpp
-	messenger.cpp msgr_stop.cpp msgr_op.cpp msgr_send.cpp msgr_receive.cpp ringloop.cpp ../json11/json11.cpp
-	http_client.cpp osd_ops.cpp pg_states.cpp timerfd_manager.cpp base64.cpp ${MSGR_RDMA}
+# libfio_vitastor_blk.so
+add_library(fio_vitastor_blk SHARED
+	fio_engine.cpp
+	../json11/json11.cpp
+)
+target_link_libraries(fio_vitastor_blk
+	vitastor_blk
 )
-target_compile_options(vitastor_common PUBLIC -fPIC)

 # vitastor-osd
 add_executable(vitastor-osd
-	osd_main.cpp osd.cpp osd_secondary.cpp osd_peering.cpp osd_flush.cpp osd_peering_pg.cpp
-	osd_primary.cpp osd_primary_chain.cpp osd_primary_sync.cpp osd_primary_write.cpp osd_primary_subops.cpp
-	osd_cluster.cpp osd_rmw.cpp
+	osd_main.cpp osd.cpp osd_secondary.cpp msgr_receive.cpp msgr_send.cpp osd_peering.cpp osd_flush.cpp osd_peering_pg.cpp
+	osd_primary.cpp osd_primary_subops.cpp etcd_state_client.cpp messenger.cpp osd_cluster.cpp http_client.cpp osd_ops.cpp pg_states.cpp
+	osd_rmw.cpp base64.cpp timerfd_manager.cpp epoll_manager.cpp ../json11/json11.cpp
 )
 target_link_libraries(vitastor-osd
-	vitastor_common
 	vitastor_blk
 	Jerasure
-	${IBVERBS_LIBRARIES}
 )

-if (${WITH_FIO})
-	# libfio_vitastor_sec.so
-	add_library(fio_vitastor_sec SHARED
-		fio_sec_osd.cpp
-		rw_blocking.cpp
-	)
-	target_link_libraries(fio_vitastor_sec
-		tcmalloc_minimal
-	)
-endif (${WITH_FIO})
+# libfio_vitastor_sec.so
+add_library(fio_vitastor_sec SHARED
+	fio_sec_osd.cpp
+	rw_blocking.cpp
+)
+target_link_libraries(fio_vitastor_sec
+	tcmalloc_minimal
+)

 # libvitastor_client.so
 add_library(vitastor_client SHARED
-	cluster_client.cpp
-	vitastor_c.cpp
+	cluster_client.cpp epoll_manager.cpp etcd_state_client.cpp
+	messenger.cpp msgr_send.cpp msgr_receive.cpp ringloop.cpp ../json11/json11.cpp
+	http_client.cpp osd_ops.cpp pg_states.cpp timerfd_manager.cpp base64.cpp
 )
-set_target_properties(vitastor_client PROPERTIES PUBLIC_HEADER "vitastor_c.h")
 target_link_libraries(vitastor_client
-	vitastor_common
 	tcmalloc_minimal
 	${LIBURING_LIBRARIES}
-	${IBVERBS_LIBRARIES}
 )
-set_target_properties(vitastor_client PROPERTIES VERSION ${VERSION} SOVERSION 0)

-if (${WITH_FIO})
-	# libfio_vitastor.so
-	add_library(fio_vitastor SHARED
-		fio_cluster.cpp
-	)
-	target_link_libraries(fio_vitastor
-		vitastor_client
-	)
-endif (${WITH_FIO})
+# libfio_vitastor.so
+add_library(fio_vitastor SHARED
+	fio_cluster.cpp
+)
+target_link_libraries(fio_vitastor
+	vitastor_client
+)

 # vitastor-nbd
 add_executable(vitastor-nbd
@@ -158,24 +123,27 @@ add_executable(vitastor-dump-journal
 	dump_journal.cpp crc32c.c
 )

-if (${WITH_QEMU})
-	# qemu_driver.so
-	add_library(qemu_vitastor SHARED
-		qemu_driver.c
-	)
-	target_include_directories(qemu_vitastor PUBLIC
-		../qemu/b/qemu
-		../qemu/include
-		${GLIB_INCLUDE_DIRS}
-	)
-	target_link_libraries(qemu_vitastor
-		vitastor_client
-	)
-	set_target_properties(qemu_vitastor PROPERTIES
-		PREFIX ""
-		OUTPUT_NAME "block-vitastor"
-	)
-endif (${WITH_QEMU})
+# qemu_driver.so
+add_library(qemu_proxy STATIC qemu_proxy.cpp)
+target_compile_options(qemu_proxy PUBLIC -fPIC)
+target_include_directories(qemu_proxy PUBLIC
+	../qemu/b/qemu
+	../qemu/include
+	${GLIB_INCLUDE_DIRS}
+)
+target_link_libraries(qemu_proxy
+	vitastor_client
+)
+add_library(qemu_vitastor SHARED
+	qemu_driver.c
+)
+target_link_libraries(qemu_vitastor
+	qemu_proxy
+)
+set_target_properties(qemu_vitastor PROPERTIES
+	PREFIX ""
+	OUTPUT_NAME "block-vitastor"
+)

 ### Test stubs

@@ -193,12 +161,10 @@ target_link_libraries(osd_rmw_test Jerasure tcmalloc_minimal)

 # stub_uring_osd
 add_executable(stub_uring_osd
-	stub_uring_osd.cpp
+	stub_uring_osd.cpp epoll_manager.cpp messenger.cpp msgr_send.cpp msgr_receive.cpp ringloop.cpp timerfd_manager.cpp ../json11/json11.cpp
 )
 target_link_libraries(stub_uring_osd
-	vitastor_common
 	${LIBURING_LIBRARIES}
-	${IBVERBS_LIBRARIES}
 	tcmalloc_minimal
 )

@@ -209,25 +175,8 @@ target_link_libraries(osd_peering_pg_test tcmalloc_minimal)
 # test_allocator
 add_executable(test_allocator test_allocator.cpp allocator.cpp)

-# test_cas
-add_executable(test_cas
-	test_cas.cpp
-)
-target_link_libraries(test_cas
-	vitastor_client
-)
-
-# test_cluster_client
-add_executable(test_cluster_client
-	test_cluster_client.cpp
-	pg_states.cpp osd_ops.cpp cluster_client.cpp msgr_op.cpp mock/messenger.cpp msgr_stop.cpp
-	etcd_state_client.cpp timerfd_manager.cpp ../json11/json11.cpp
-)
-target_compile_definitions(test_cluster_client PUBLIC -D__MOCK__)
-target_include_directories(test_cluster_client PUBLIC ${CMAKE_SOURCE_DIR}/src/mock)
-
 ## test_blockstore, test_shit
-#add_executable(test_blockstore test_blockstore.cpp)
+#add_executable(test_blockstore test_blockstore.cpp timerfd_interval.cpp)
 #target_link_libraries(test_blockstore blockstore)
 #add_executable(test_shit test_shit.cpp osd_peering_pg.cpp)
 #target_link_libraries(test_shit ${LIBURING_LIBRARIES} m)
@@ -235,14 +184,5 @@ target_include_directories(test_cluster_client PUBLIC ${CMAKE_SOURCE_DIR}/src/mo
 ### Install

 install(TARGETS vitastor-osd vitastor-dump-journal vitastor-nbd vitastor-rm RUNTIME DESTINATION ${CMAKE_INSTALL_BINDIR})
-install(
-	TARGETS vitastor_blk vitastor_client
-	LIBRARY DESTINATION ${CMAKE_INSTALL_LIBDIR}
-	PUBLIC_HEADER DESTINATION ${CMAKE_INSTALL_INCLUDEDIR}
-)
-if (${WITH_FIO})
-	install(TARGETS fio_vitastor fio_vitastor_blk fio_vitastor_sec LIBRARY DESTINATION ${CMAKE_INSTALL_LIBDIR})
-endif (${WITH_FIO})
-if (${WITH_QEMU})
-	install(TARGETS qemu_vitastor LIBRARY DESTINATION /usr/${CMAKE_INSTALL_LIBDIR}/${QEMU_PLUGINDIR})
-endif (${WITH_QEMU})
+install(TARGETS fio_vitastor fio_vitastor_blk fio_vitastor_sec vitastor_blk vitastor_client LIBRARY DESTINATION ${CMAKE_INSTALL_LIBDIR})
+install(TARGETS qemu_vitastor LIBRARY DESTINATION /usr/${CMAKE_INSTALL_LIBDIR}/${QEMU_PLUGINDIR})
--- a/src/allocator.cpp
+++ b/src/allocator.cpp
@@ -37,21 +37,6 @@ allocator::~allocator()
    delete[] mask;
 }

-bool allocator::get(uint64_t addr)
-{
-    if (addr >= size)
-    {
-        return false;
-    }
-    uint64_t p2 = 1, offset = 0;
-    while (p2 * 64 < size)
-    {
-        offset += p2;
-        p2 = p2 * 64;
-    }
-    return ((mask[offset + addr/64] >> (addr % 64)) & 1);
-}
-
 void allocator::set(uint64_t addr, bool value)
 {
    if (addr >= size)
--- a/src/allocator.h
+++ b/src/allocator.h
@@ -16,7 +16,6 @@ class allocator
 public:
    allocator(uint64_t blocks);
    ~allocator();
-    bool get(uint64_t addr);
    void set(uint64_t addr, bool value);
    uint64_t find_free();
    uint64_t get_free_count();
--- a/src/blockstore.cpp
+++ b/src/blockstore.cpp
@@ -3,9 +3,9 @@

 #include "blockstore_impl.h"

-blockstore_t::blockstore_t(blockstore_config_t & config, ring_loop_t *ringloop, timerfd_manager_t *tfd)
+blockstore_t::blockstore_t(blockstore_config_t & config, ring_loop_t *ringloop)
 {
-    impl = new blockstore_impl_t(config, ringloop, tfd);
+    impl = new blockstore_impl_t(config, ringloop);
 }

 blockstore_t::~blockstore_t()
@@ -43,6 +43,11 @@ int blockstore_t::read_bitmap(object_id oid, uint64_t target_version, void *bitm
    return impl->read_bitmap(oid, target_version, bitmap, result_version);
 }

+std::unordered_map<object_id, uint64_t> & blockstore_t::get_unstable_writes()
+{
+    return impl->unstable_writes;
+}
+
 std::map<uint64_t, uint64_t> & blockstore_t::get_inode_space_stats()
 {
    return impl->inode_space_stats;
--- a/src/blockstore.h
+++ b/src/blockstore.h
@@ -16,7 +16,6 @@

 #include "object_id.h"
 #include "ringloop.h"
-#include "timerfd_manager.h"

 // Memory alignment for direct I/O (usually 512 bytes)
 // All other alignments must be a multiple of this one
@@ -159,7 +158,7 @@ class blockstore_t
 {
    blockstore_impl_t *impl;
 public:
-    blockstore_t(blockstore_config_t & config, ring_loop_t *ringloop, timerfd_manager_t *tfd);
+    blockstore_t(blockstore_config_t & config, ring_loop_t *ringloop);
    ~blockstore_t();

    // Event loop
@@ -183,6 +182,9 @@ public:
    // Simplified synchronous operation: get object bitmap & current version
    int read_bitmap(object_id oid, uint64_t target_version, void *bitmap, uint64_t *result_version = NULL);

+    // Unstable writes are added here (map of object_id -> version)
+    std::unordered_map<object_id, uint64_t> & get_unstable_writes();
+
    // Get per-inode space usage statistics
    std::map<uint64_t, uint64_t> & get_inode_space_stats();

--- a/src/blockstore_flush.cpp
+++ b/src/blockstore_flush.cpp
@@ -3,13 +3,12 @@

 #include "blockstore_impl.h"

-journal_flusher_t::journal_flusher_t(blockstore_impl_t *bs)
+journal_flusher_t::journal_flusher_t(int flusher_count, blockstore_impl_t *bs)
 {
    this->bs = bs;
-    this->max_flusher_count = bs->max_flusher_count;
-    this->min_flusher_count = bs->min_flusher_count;
-    this->cur_flusher_count = bs->min_flusher_count;
-    this->target_flusher_count = bs->min_flusher_count;
+    this->flusher_count = flusher_count;
+    this->cur_flusher_count = 1;
+    this->target_flusher_count = 1;
    dequeuing = false;
    trimming = false;
    active_flushers = 0;
@@ -17,11 +16,11 @@ journal_flusher_t::journal_flusher_t(blockstore_impl_t *bs)
    // FIXME: allow to configure flusher_start_threshold and journal_trim_interval
    flusher_start_threshold = bs->journal_block_size / sizeof(journal_entry_stable);
    journal_trim_interval = 512;
-    journal_trim_counter = bs->journal.flush_journal ? 1 : 0;
-    trim_wanted = bs->journal.flush_journal ? 1 : 0;
+    journal_trim_counter = 0;
+    trim_wanted = 0;
    journal_superblock = bs->journal.inmemory ? bs->journal.buffer : memalign_or_die(MEM_ALIGNMENT, bs->journal_block_size);
-    co = new journal_flusher_co[max_flusher_count];
-    for (int i = 0; i < max_flusher_count; i++)
+    co = new journal_flusher_co[flusher_count];
+    for (int i = 0; i < flusher_count; i++)
    {
        co[i].bs = bs;
        co[i].flusher = this;
@@ -72,10 +71,10 @@ bool journal_flusher_t::is_active()
 void journal_flusher_t::loop()
 {
    target_flusher_count = bs->write_iodepth*2;
-    if (target_flusher_count < min_flusher_count)
-        target_flusher_count = min_flusher_count;
-    else if (target_flusher_count > max_flusher_count)
-        target_flusher_count = max_flusher_count;
+    if (target_flusher_count <= 0)
+        target_flusher_count = 1;
+    else if (target_flusher_count > flusher_count)
+        target_flusher_count = flusher_count;
    if (target_flusher_count > cur_flusher_count)
        cur_flusher_count = target_flusher_count;
    else if (target_flusher_count < cur_flusher_count)
@@ -238,8 +237,7 @@ bool journal_flusher_co::loop()
    else if (wait_state == 21)
        goto resume_21;
 resume_0:
-    if (flusher->flush_queue.size() < flusher->min_flusher_count && !flusher->trim_wanted ||
-        !flusher->flush_queue.size() || !flusher->dequeuing)
+    if (!flusher->flush_queue.size() || !flusher->dequeuing)
    {
 stop_flusher:
        if (flusher->trim_wanted > 0 && flusher->journal_trim_counter > 0)
@@ -485,13 +483,6 @@ resume_1:
        }
        if (has_delete)
        {
-            clean_disk_entry *new_entry = (clean_disk_entry*)(meta_new.buf + meta_new.pos*bs->clean_entry_size);
-            if (new_entry->oid.inode != 0 && new_entry->oid != cur.oid)
-            {
-                printf("Fatal error (metadata corruption or bug): tried to delete metadata entry %lu (%lx:%lx) while deleting %lx:%lx\n",
-                    clean_loc >> bs->block_order, new_entry->oid.inode, new_entry->oid.stripe, cur.oid.inode, cur.oid.stripe);
-                exit(1);
-            }
            // zero out new metadata entry
            memset(meta_new.buf + meta_new.pos*bs->clean_entry_size, 0, bs->clean_entry_size);
        }
@@ -602,7 +593,6 @@ resume_1:
                    .size = sizeof(journal_entry_start),
                    .reserved = 0,
                    .journal_start = new_trim_pos,
-                    .version = JOURNAL_VERSION,
                };
                ((journal_entry_start*)flusher->journal_superblock)->crc32 = je_crc32((journal_entry*)flusher->journal_superblock);
                data->iov = (struct iovec){ flusher->journal_superblock, bs->journal_block_size };
@@ -634,12 +624,6 @@ resume_1:
 #endif
                flusher->trimming = false;
            }
-            if (bs->journal.flush_journal && !flusher->flush_queue.size())
-            {
-                assert(bs->journal.used_start == bs->journal.next_free);
-                printf("Journal flushed\n");
-                exit(0);
-            }
        }
        // All done
        flusher->active_flushers--;
@@ -670,7 +654,7 @@ bool journal_flusher_co::scan_dirty(int wait_base)
        {
            char err[1024];
            snprintf(
-                err, 1024, "BUG: Unexpected dirty_entry %lx:%lx v%lu unstable state during flush: 0x%x",
+                err, 1024, "BUG: Unexpected dirty_entry %lx:%lx v%lu unstable state during flush: %d",
                dirty_it->first.oid.inode, dirty_it->first.oid.stripe, dirty_it->first.version, dirty_it->second.state
            );
            throw std::runtime_error(err);
@@ -799,10 +783,7 @@ void journal_flusher_co::update_clean_db()
    if (old_clean_loc != UINT64_MAX && old_clean_loc != clean_loc)
    {
 #ifdef BLOCKSTORE_DEBUG
-        printf("Free block %lu from %lx:%lx v%lu (new location is %lu)\n",
-            old_clean_loc >> bs->block_order,
-            cur.oid.inode, cur.oid.stripe, cur.version,
-            clean_loc >> bs->block_order);
+        printf("Free block %lu (new location is %lu)\n", old_clean_loc >> bs->block_order, clean_loc >> bs->block_order);
 #endif
        bs->data_alloc->set(old_clean_loc >> bs->block_order, false);
    }
@@ -810,11 +791,6 @@ void journal_flusher_co::update_clean_db()
    {
        auto clean_it = bs->clean_db.find(cur.oid);
        bs->clean_db.erase(clean_it);
-#ifdef BLOCKSTORE_DEBUG
-        printf("Free block %lu from %lx:%lx v%lu (delete)\n",
-            clean_loc >> bs->block_order,
-            cur.oid.inode, cur.oid.stripe, cur.version);
-#endif
        bs->data_alloc->set(clean_loc >> bs->block_order, false);
        clean_loc = UINT64_MAX;
    }
@@ -836,7 +812,7 @@ bool journal_flusher_co::fsync_batch(bool fsync_meta, int wait_base)
        goto resume_1;
    else if (wait_state == wait_base+2)
        goto resume_2;
-    if (!(fsync_meta ? bs->disable_meta_fsync : bs->disable_data_fsync))
+    if (!(fsync_meta ? bs->disable_meta_fsync : bs->disable_journal_fsync))
    {
        cur_sync = flusher->syncs.end();
        while (cur_sync != flusher->syncs.begin())
--- a/src/blockstore_flush.h
+++ b/src/blockstore_flush.h
@@ -79,7 +79,7 @@ class journal_flusher_t
 {
    int trim_wanted = 0;
    bool dequeuing;
-    int min_flusher_count, max_flusher_count, cur_flusher_count, target_flusher_count;
+    int flusher_count, cur_flusher_count, target_flusher_count;
    int flusher_start_threshold;
    journal_flusher_co *co;
    blockstore_impl_t *bs;
@@ -98,7 +98,7 @@ class journal_flusher_t
    std::deque<object_id> flush_queue;
    std::map<object_id, uint64_t> flush_versions;
 public:
-    journal_flusher_t(blockstore_impl_t *bs);
+    journal_flusher_t(int flusher_count, blockstore_impl_t *bs);
    ~journal_flusher_t();
    void loop();
    bool is_active();
--- a/src/blockstore_impl.cpp
+++ b/src/blockstore_impl.cpp
@@ -3,10 +3,9 @@

 #include "blockstore_impl.h"

-blockstore_impl_t::blockstore_impl_t(blockstore_config_t & config, ring_loop_t *ringloop, timerfd_manager_t *tfd)
+blockstore_impl_t::blockstore_impl_t(blockstore_config_t & config, ring_loop_t *ringloop)
 {
    assert(sizeof(blockstore_op_private_t) <= BS_OP_PRIVATE_DATA_SIZE);
-    this->tfd = tfd;
    this->ringloop = ringloop;
    ring_consumer.loop = [this]() { loop(); };
    ringloop->register_consumer(&ring_consumer);
@@ -32,7 +31,7 @@ blockstore_impl_t::blockstore_impl_t(blockstore_config_t & config, ring_loop_t *
            close(journal.fd);
        throw;
    }
-    flusher = new journal_flusher_t(this);
+    flusher = new journal_flusher_t(flusher_count, this);
 }

 blockstore_impl_t::~blockstore_impl_t()
@@ -93,23 +92,10 @@ void blockstore_impl_t::loop()
            {
                delete journal_init_reader;
                journal_init_reader = NULL;
-                if (journal.flush_journal)
-                    initialized = 3;
-                else
-                    initialized = 10;
+                initialized = 10;
                ringloop->wakeup();
            }
        }
-        if (initialized == 3)
-        {
-            if (readonly)
-            {
-                printf("Can't flush the journal in readonly mode\n");
-                exit(1);
-            }
-            flusher->loop();
-            ringloop->submit();
-        }
    }
    else
    {
--- a/src/blockstore_impl.h
+++ b/src/blockstore_impl.h
@@ -9,7 +9,6 @@
 #include <sys/ioctl.h>
 #include <sys/stat.h>
 #include <fcntl.h>
-#include <time.h>
 #include <unistd.h>
 #include <linux/fs.h>

@@ -78,23 +77,6 @@

 #include "blockstore_journal.h"

-// "VITAstor"
-#define BLOCKSTORE_META_MAGIC 0x726F747341544956l
-#define BLOCKSTORE_META_VERSION 1
-
-// metadata header (superblock)
-// FIXME: After adding the OSD superblock, add a key to metadata
-// and journal headers to check if they belong to the same OSD
-struct __attribute__((__packed__)) blockstore_meta_header_t
-{
-    uint64_t zero;
-    uint64_t magic;
-    uint64_t version;
-    uint32_t meta_block_size;
-    uint32_t data_block_size;
-    uint32_t bitmap_granularity;
-};
-
 // 32 bytes = 24 bytes + block bitmap (4 bytes by default) + external attributes (also bitmap, 4 bytes by default)
 // per "clean" entry on disk with fixed metadata tables
 // FIXME: maybe add crc32's to metadata
@@ -176,7 +158,6 @@ struct blockstore_op_private_t
    struct iovec iov_zerofill[3];
    // Warning: must not have a default value here because it's written to before calling constructor in blockstore_write.cpp O_o
    uint64_t real_version;
-    timespec tv_begin;

    // Sync
    std::vector<obj_ver_id> sync_big_writes, sync_small_writes;
@@ -218,18 +199,10 @@ class blockstore_impl_t
    // Suitable only for server SSDs with capacitors, requires disabled data and journal fsyncs
    int immediate_commit = IMMEDIATE_NONE;
    bool inmemory_meta = false;
-    // Maximum and minimum flusher count
-    unsigned max_flusher_count, min_flusher_count;
+    // Maximum flusher count
+    unsigned flusher_count;
    // Maximum queue depth
    unsigned max_write_iodepth = 128;
-    // Enable small (journaled) write throttling, useful for the SSD+HDD case
-    bool throttle_small_writes = false;
-    // Target data device iops, bandwidth and parallelism for throttling (100/100/1 is the default for HDD)
-    int throttle_target_iops = 100;
-    int throttle_target_mbs = 100;
-    int throttle_target_parallelism = 1;
-    // Minimum difference in microseconds between target and real execution times to throttle the response
-    int throttle_threshold_us = 50;
    /******* END OF OPTIONS *******/

    struct ring_consumer_t ring_consumer;
@@ -239,7 +212,6 @@ class blockstore_impl_t
    blockstore_dirty_db_t dirty_db;
    std::vector<blockstore_op_t*> submit_queue;
    std::vector<obj_ver_id> unsynced_big_writes, unsynced_small_writes;
-    int unsynced_big_write_count = 0;
    allocator *data_alloc = NULL;
    uint8_t *zero_object;

@@ -260,7 +232,6 @@ class blockstore_impl_t

    bool live = false, queue_stall = false;
    ring_loop_t *ringloop;
-    timerfd_manager_t *tfd;

    bool stop_sync_submitted;

@@ -315,7 +286,7 @@ class blockstore_impl_t
    // Stabilize
    int dequeue_stable(blockstore_op_t *op);
    int continue_stable(blockstore_op_t *op);
-    void mark_stable(const obj_ver_id & ov, bool forget_dirty = false);
+    void mark_stable(const obj_ver_id & ov);
    void handle_stable_event(ring_data_t *data, blockstore_op_t *op);
    void stabilize_object(object_id oid, uint64_t max_ver);

@@ -331,7 +302,7 @@ class blockstore_impl_t

 public:

-    blockstore_impl_t(blockstore_config_t & config, ring_loop_t *ringloop, timerfd_manager_t *tfd);
+    blockstore_impl_t(blockstore_config_t & config, ring_loop_t *ringloop);
    ~blockstore_impl_t();

    // Event loop
--- a/src/blockstore_init.cpp
+++ b/src/blockstore_init.cpp
@@ -3,20 +3,6 @@

 #include "blockstore_impl.h"

-#define GET_SQE() \
-    sqe = bs->get_sqe();\
-    if (!sqe)\
-        throw std::runtime_error("io_uring is full during initialization");\
-    data = ((ring_data_t*)sqe->user_data)
-
-static bool iszero(uint64_t *buf, int len)
-{
-    for (int i = 0; i < len; i++)
-        if (buf[i] != 0)
-            return false;
-    return true;
-}
-
 blockstore_init_meta::blockstore_init_meta(blockstore_impl_t *bs)
 {
    this->bs = bs;
@@ -24,7 +10,7 @@ blockstore_init_meta::blockstore_init_meta(blockstore_impl_t *bs)

 void blockstore_init_meta::handle_event(ring_data_t *data)
 {
-    if (data->res < 0)
+    if (data->res <= 0)
    {
        throw std::runtime_error(
            std::string("read metadata failed at offset ") + std::to_string(metadata_read) +
@@ -42,12 +28,6 @@ int blockstore_init_meta::loop()
 {
    if (wait_state == 1)
        goto resume_1;
-    else if (wait_state == 2)
-        goto resume_2;
-    else if (wait_state == 3)
-        goto resume_3;
-    else if (wait_state == 4)
-        goto resume_4;
    printf("Reading blockstore metadata\n");
    if (bs->inmemory_meta)
        metadata_buffer = bs->metadata_buffer;
@@ -55,98 +35,22 @@ int blockstore_init_meta::loop()
        metadata_buffer = memalign(MEM_ALIGNMENT, 2*bs->metadata_buf_size);
    if (!metadata_buffer)
        throw std::runtime_error("Failed to allocate metadata read buffer");
-    // Read superblock
-    GET_SQE();
-    data->iov = { metadata_buffer, bs->meta_block_size };
-    data->callback = [this](ring_data_t *data) { handle_event(data); };
-    my_uring_prep_readv(sqe, bs->meta_fd, &data->iov, 1, bs->meta_offset);
-    bs->ringloop->submit();
-    submitted = 1;
-resume_1:
-    if (submitted)
-    {
-        wait_state = 1;
-        return 1;
-    }
-    if (iszero((uint64_t*)metadata_buffer, bs->meta_block_size / sizeof(uint64_t)))
-    {
-        {
-            blockstore_meta_header_t *hdr = (blockstore_meta_header_t *)metadata_buffer;
-            hdr->zero = 0;
-            hdr->magic = BLOCKSTORE_META_MAGIC;
-            hdr->version = BLOCKSTORE_META_VERSION;
-            hdr->meta_block_size = bs->meta_block_size;
-            hdr->data_block_size = bs->block_size;
-            hdr->bitmap_granularity = bs->bitmap_granularity;
-        }
-        if (bs->readonly)
-        {
-            printf("Skipping metadata initialization because blockstore is readonly\n");
-        }
-        else
-        {
-            printf("Initializing metadata area\n");
-            GET_SQE();
-            data->iov = (struct iovec){ metadata_buffer, bs->meta_block_size };
-            data->callback = [this](ring_data_t *data) { handle_event(data); };
-            my_uring_prep_writev(sqe, bs->meta_fd, &data->iov, 1, bs->meta_offset);
-            bs->ringloop->submit();
-            submitted = 1;
-        resume_3:
-            if (submitted > 0)
-            {
-                wait_state = 3;
-                return 1;
-            }
-            zero_on_init = true;
-        }
-    }
-    else
-    {
-        blockstore_meta_header_t *hdr = (blockstore_meta_header_t *)metadata_buffer;
-        if (hdr->zero != 0 ||
-            hdr->magic != BLOCKSTORE_META_MAGIC ||
-            hdr->version != BLOCKSTORE_META_VERSION)
-        {
-            printf(
-                "Metadata is corrupt or old version.\n"
-                " If this is a new OSD please zero out the metadata area before starting it.\n"
-                " If you need to upgrade from 0.5.x please request it via the issue tracker.\n"
-            );
-            exit(1);
-        }
-        if (hdr->meta_block_size != bs->meta_block_size ||
-            hdr->data_block_size != bs->block_size ||
-            hdr->bitmap_granularity != bs->bitmap_granularity)
-        {
-            printf(
-                "Configuration stored in metadata superblock"
-                " (meta_block_size=%u, data_block_size=%u, bitmap_granularity=%u)"
-                " differs from OSD configuration (%lu/%u/%lu).\n",
-                hdr->meta_block_size, hdr->data_block_size, hdr->bitmap_granularity,
-                bs->meta_block_size, bs->block_size, bs->bitmap_granularity
-            );
-            exit(1);
-        }
-    }
-    // Skip superblock
-    bs->meta_offset += bs->meta_block_size;
-    prev_done = 0;
-    done_len = 0;
-    done_pos = 0;
-    metadata_read = 0;
-    // Read the rest of the metadata
    while (1)
    {
-    resume_2:
+    resume_1:
        if (submitted)
        {
-            wait_state = 2;
+            wait_state = 1;
            return 1;
        }
        if (metadata_read < bs->meta_len)
        {
-            GET_SQE();
+            sqe = bs->get_sqe();
+            if (!sqe)
+            {
+                throw std::runtime_error("io_uring is full while trying to read metadata");
+            }
+            data = ((ring_data_t*)sqe->user_data);
            data->iov = {
                metadata_buffer + (bs->inmemory_meta
                    ? metadata_read
@@ -154,14 +58,7 @@ resume_1:
                bs->meta_len - metadata_read > bs->metadata_buf_size ? bs->metadata_buf_size : bs->meta_len - metadata_read,
            };
            data->callback = [this](ring_data_t *data) { handle_event(data); };
-            if (!zero_on_init)
-                my_uring_prep_readv(sqe, bs->meta_fd, &data->iov, 1, bs->meta_offset + metadata_read);
-            else
-            {
-                // Fill metadata with zeroes
-                memset(data->iov.iov_base, 0, data->iov.iov_len);
-                my_uring_prep_writev(sqe, bs->meta_fd, &data->iov, 1, bs->meta_offset + metadata_read);
-            }
+            my_uring_prep_readv(sqe, bs->meta_fd, &data->iov, 1, bs->meta_offset + metadata_read);
            bs->ringloop->submit();
            submitted = (prev == 1 ? 2 : 1);
            prev = submitted;
@@ -193,21 +90,6 @@ resume_1:
        free(metadata_buffer);
        metadata_buffer = NULL;
    }
-    if (zero_on_init && !bs->disable_meta_fsync)
-    {
-        GET_SQE();
-        my_uring_prep_fsync(sqe, bs->meta_fd, IORING_FSYNC_DATASYNC);
-        data->iov = { 0 };
-        data->callback = [this](ring_data_t *data) { handle_event(data); };
-        submitted = 1;
-        bs->ringloop->submit();
-    resume_4:
-        if (submitted > 0)
-        {
-            wait_state = 4;
-            return 1;
-        }
-    }
    return 0;
 }

@@ -229,10 +111,7 @@ void blockstore_init_meta::handle_entries(void* entries, unsigned count, int blo
                {
                    // free the previous block
 #ifdef BLOCKSTORE_DEBUG
-                    printf("Free block %lu from %lx:%lx v%lu (new location is %lu)\n",
-                        clean_it->second.location >> block_order,
-                        clean_it->first.inode, clean_it->first.stripe, clean_it->second.version,
-                        done_cnt+i);
+                    printf("Free block %lu (new location is %lu)\n", clean_it->second.location >> block_order, done_cnt+i);
 #endif
                    bs->data_alloc->set(clean_it->second.location >> block_order, false);
                }
@@ -274,6 +153,14 @@ blockstore_init_journal::blockstore_init_journal(blockstore_impl_t *bs)
    };
 }

+bool iszero(uint64_t *buf, int len)
+{
+    for (int i = 0; i < len; i++)
+        if (buf[i] != 0)
+            return false;
+    return true;
+}
+
 void blockstore_init_journal::handle_event(ring_data_t *data1)
 {
    if (data1->res <= 0)
@@ -298,6 +185,12 @@ void blockstore_init_journal::handle_event(ring_data_t *data1)
    submitted_buf = NULL;
 }

+#define GET_SQE() \
+    sqe = bs->get_sqe();\
+    if (!sqe)\
+        throw std::runtime_error("io_uring is full while trying to read journal");\
+    data = ((ring_data_t*)sqe->user_data)
+
 int blockstore_init_journal::loop()
 {
    if (wait_state == 1)
@@ -335,7 +228,7 @@ resume_1:
        wait_state = 1;
        return 1;
    }
-    if (iszero((uint64_t*)submitted_buf, bs->journal.block_size / sizeof(uint64_t)))
+    if (iszero((uint64_t*)submitted_buf, 3))
    {
        // Journal is empty
        // FIXME handle this wrapping to journal_block_size better (maybe)
@@ -350,7 +243,6 @@ resume_1:
            .size = sizeof(journal_entry_start),
            .reserved = 0,
            .journal_start = bs->journal.block_size,
-            .version = JOURNAL_VERSION,
        };
        ((journal_entry_start*)submitted_buf)->crc32 = je_crc32((journal_entry*)submitted_buf);
        if (bs->readonly)
@@ -401,21 +293,11 @@ resume_1:
        je_start = (journal_entry_start*)submitted_buf;
        if (je_start->magic != JOURNAL_MAGIC ||
            je_start->type != JE_START ||
-            je_crc32((journal_entry*)je_start) != je_start->crc32 ||
-            je_start->size != sizeof(journal_entry_start) && je_start->size != JE_START_LEGACY_SIZE)
+            je_start->size != sizeof(journal_entry_start) ||
+            je_crc32((journal_entry*)je_start) != je_start->crc32)
        {
            // Entry is corrupt
-            fprintf(stderr, "First entry of the journal is corrupt\n");
-            exit(1);
-        }
-        if (je_start->size == JE_START_LEGACY_SIZE || je_start->version != JOURNAL_VERSION)
-        {
-            fprintf(
-                stderr, "The code only supports journal version %d, but it is %lu on disk."
-                    " Please use the previous version to flush the journal before upgrading OSD\n",
-                JOURNAL_VERSION, je_start->size == JE_START_LEGACY_SIZE ? 0 : je_start->version
-            );
-            exit(1);
+            throw std::runtime_error("first entry of the journal is corrupt");
        }
        next_free = journal_pos = bs->journal.used_start = je_start->journal_start;
        if (!bs->journal.inmemory)
@@ -521,18 +403,6 @@ resume_1:
            }
        }
    }
-    for (auto ov: double_allocs)
-    {
-        auto dirty_it = bs->dirty_db.find(ov);
-        if (dirty_it != bs->dirty_db.end() &&
-            IS_BIG_WRITE(dirty_it->second.state) &&
-            dirty_it->second.location == UINT64_MAX)
-        {
-            printf("Fatal error (bug): %lx:%lx v%lu big_write journal_entry was allocated over another object\n",
-                dirty_it->first.oid.inode, dirty_it->first.oid.stripe, dirty_it->first.version);
-            exit(1);
-        }
-    }
    bs->flusher->mark_trim_possible();
    bs->journal.dirty_start = bs->journal.next_free;
    printf(
@@ -664,20 +534,20 @@ int blockstore_init_journal::handle_journal_part(void *buf, uint64_t done_pos, u
                        .oid = je->small_write.oid,
                        .version = je->small_write.version,
                    };
-                    void *bmp = NULL;
-                    void *bmp_from = (void*)je + sizeof(journal_entry_small_write);
+                    void *bmp = (void*)je + sizeof(journal_entry_small_write);
                    if (bs->clean_entry_bitmap_size <= sizeof(void*))
                    {
-                        memcpy(&bmp, bmp_from, bs->clean_entry_bitmap_size);
+                        memcpy(&bmp, bmp, bs->clean_entry_bitmap_size);
                    }
-                    else
+                    else if (!bs->journal.inmemory)
                    {
-                        // FIXME Using large blockstore objects will result in a lot of small
-                        // allocations for entry bitmaps. This can only be fixed by using
-                        // a patched map with dynamic entry size, but not the btree_map,
-                        // because it doesn't keep iterators valid all the time.
-                        bmp = malloc_or_die(bs->clean_entry_bitmap_size);
-                        memcpy(bmp, bmp_from, bs->clean_entry_bitmap_size);
+                        // FIXME Using large blockstore objects and not keeping journal in memory
+                        // will result in a lot of small allocations for entry bitmaps. This can
+                        // only be fixed by using a patched map with dynamic entry size, but not
+                        // the btree_map, because it doesn't keep iterators valid all the time.
+                        void *bmp_cp = malloc_or_die(bs->clean_entry_bitmap_size);
+                        memcpy(bmp_cp, bmp, bs->clean_entry_bitmap_size);
+                        bmp = bmp_cp;
                    }
                    bs->dirty_db.emplace(ov, (dirty_entry){
                        .state = (BS_ST_SMALL_WRITE | BS_ST_SYNCED),
@@ -699,7 +569,7 @@ int blockstore_init_journal::handle_journal_part(void *buf, uint64_t done_pos, u
                    unstab = unstab < ov.version ? ov.version : unstab;
                    if (je->type == JE_SMALL_WRITE_INSTANT)
                    {
-                        bs->mark_stable(ov, true);
+                        bs->mark_stable(ov);
                    }
                }
            }
@@ -729,10 +599,32 @@ int blockstore_init_journal::handle_journal_part(void *buf, uint64_t done_pos, u
                        // its data and metadata are already flushed.
                        // We don't know if newer versions are flushed, but
                        // the previous delete definitely is.
-                        // So we forget previous dirty entries, but retain the clean one.
+                        // So we flush previous dirty entries, but retain the clean one.
                        // This feature is required for writes happening shortly
                        // after deletes.
-                        erase_dirty_object(dirty_it);
+                        auto dirty_end = dirty_it;
+                        dirty_end++;
+                        while (1)
+                        {
+                            if (dirty_it == bs->dirty_db.begin())
+                            {
+                                break;
+                            }
+                            dirty_it--;
+                            if (dirty_it->first.oid != je->big_write.oid)
+                            {
+                                dirty_it++;
+                                break;
+                            }
+                        }
+                        auto clean_it = bs->clean_db.find(je->big_write.oid);
+                        bs->erase_dirty(
+                            dirty_it, dirty_end,
+                            clean_it != bs->clean_db.end() ? clean_it->second.location : UINT64_MAX
+                        );
+                        // Remove it from the flusher's queue, too
+                        // Otherwise it may end up referring to a small unstable write after reading the rest of the journal
+                        bs->flusher->remove_flush(je->big_write.oid);
                    }
                }
                auto clean_it = bs->clean_db.find(je->big_write.oid);
@@ -744,22 +636,22 @@ int blockstore_init_journal::handle_journal_part(void *buf, uint64_t done_pos, u
                        .oid = je->big_write.oid,
                        .version = je->big_write.version,
                    };
-                    void *bmp = NULL;
-                    void *bmp_from = (void*)je + sizeof(journal_entry_big_write);
+                    void *bmp = (void*)je + sizeof(journal_entry_big_write);
                    if (bs->clean_entry_bitmap_size <= sizeof(void*))
                    {
-                        memcpy(&bmp, bmp_from, bs->clean_entry_bitmap_size);
+                        memcpy(&bmp, bmp, bs->clean_entry_bitmap_size);
                    }
-                    else
+                    else if (!bs->journal.inmemory)
                    {
-                        // FIXME Using large blockstore objects will result in a lot of small
-                        // allocations for entry bitmaps. This can only be fixed by using
-                        // a patched map with dynamic entry size, but not the btree_map,
-                        // because it doesn't keep iterators valid all the time.
-                        bmp = malloc_or_die(bs->clean_entry_bitmap_size);
-                        memcpy(bmp, bmp_from, bs->clean_entry_bitmap_size);
+                        // FIXME Using large blockstore objects and not keeping journal in memory
+                        // will result in a lot of small allocations for entry bitmaps. This can
+                        // only be fixed by using a patched map with dynamic entry size, but not
+                        // the btree_map, because it doesn't keep iterators valid all the time.
+                        void *bmp_cp = malloc_or_die(bs->clean_entry_bitmap_size);
+                        memcpy(bmp_cp, bmp, bs->clean_entry_bitmap_size);
+                        bmp = bmp_cp;
                    }
-                    auto dirty_it = bs->dirty_db.emplace(ov, (dirty_entry){
+                    bs->dirty_db.emplace(ov, (dirty_entry){
                        .state = (BS_ST_BIG_WRITE | BS_ST_SYNCED),
                        .flags = 0,
                        .location = je->big_write.location,
@@ -767,26 +659,11 @@ int blockstore_init_journal::handle_journal_part(void *buf, uint64_t done_pos, u
                        .len = je->big_write.len,
                        .journal_sector = proc_pos,
                        .bitmap = bmp,
-                    }).first;
-                    if (bs->data_alloc->get(je->big_write.location >> bs->block_order))
-                    {
-                        // This is probably a big_write that's already flushed and freed, but it may
-                        // also indicate a bug. So we remember such entries and recheck them afterwards.
-                        // If it's not a bug they won't be present after reading the whole journal.
-                        dirty_it->second.location = UINT64_MAX;
-                        double_allocs.push_back(ov);
-                    }
-                    else
-                    {
+                    });
 #ifdef BLOCKSTORE_DEBUG
-                        printf(
-                            "Allocate block (journal) %lu: %lx:%lx v%lu\n",
-                            je->big_write.location >> bs->block_order,
-                            ov.oid.inode, ov.oid.stripe, ov.version
-                        );
+                    printf("Allocate block %lu\n", je->big_write.location >> bs->block_order);
 #endif
-                        bs->data_alloc->set(je->big_write.location >> bs->block_order, true);
-                    }
+                    bs->data_alloc->set(je->big_write.location >> bs->block_order, true);
                    bs->journal.used_sectors[proc_pos]++;
 #ifdef BLOCKSTORE_DEBUG
                    printf(
@@ -798,7 +675,7 @@ int blockstore_init_journal::handle_journal_part(void *buf, uint64_t done_pos, u
                    unstab = unstab < ov.version ? ov.version : unstab;
                    if (je->type == JE_BIG_WRITE_INSTANT)
                    {
-                        bs->mark_stable(ov, true);
+                        bs->mark_stable(ov);
                    }
                }
            }
@@ -812,7 +689,7 @@ int blockstore_init_journal::handle_journal_part(void *buf, uint64_t done_pos, u
                    .oid = je->stable.oid,
                    .version = je->stable.version,
                };
-                bs->mark_stable(ov, true);
+                bs->mark_stable(ov);
            }
            else if (je->type == JE_ROLLBACK)
            {
@@ -831,26 +708,9 @@ int blockstore_init_journal::handle_journal_part(void *buf, uint64_t done_pos, u
 #ifdef BLOCKSTORE_DEBUG
                printf("je_delete oid=%lx:%lx ver=%lu\n", je->del.oid.inode, je->del.oid.stripe, je->del.version);
 #endif
-                bool dirty_exists = false;
-                auto dirty_it = bs->dirty_db.upper_bound((obj_ver_id){
-                    .oid = je->del.oid,
-                    .version = UINT64_MAX,
-                });
-                if (dirty_it != bs->dirty_db.begin())
-                {
-                    dirty_it--;
-                    dirty_exists = dirty_it->first.oid == je->del.oid;
-                }
                auto clean_it = bs->clean_db.find(je->del.oid);
-                bool clean_exists = (clean_it != bs->clean_db.end() &&
-                    clean_it->second.version < je->del.version);
-                if (!clean_exists && dirty_exists)
-                {
-                    // Clean entry doesn't exist. This means that the delete is already flushed.
-                    // So we must not flush this object anymore.
-                    erase_dirty_object(dirty_it);
-                }
-                else if (clean_exists || dirty_exists)
+                if (clean_it != bs->clean_db.end() &&
+                    clean_it->second.version < je->del.version)
                {
                    // oid, version
                    obj_ver_id ov = {
@@ -868,9 +728,8 @@ int blockstore_init_journal::handle_journal_part(void *buf, uint64_t done_pos, u
                    bs->journal.used_sectors[proc_pos]++;
                    // Deletions are treated as immediately stable, because
                    // "2-phase commit" (write->stabilize) isn't sufficient for them anyway
-                    bs->mark_stable(ov, true);
+                    bs->mark_stable(ov);
                }
-                // Ignore delete if neither preceding dirty entries nor the clean one are present
            }
            started = true;
            pos += je->size;
@@ -881,35 +740,3 @@ int blockstore_init_journal::handle_journal_part(void *buf, uint64_t done_pos, u
    bs->journal.next_free = next_free;
    return 1;
 }
-
-void blockstore_init_journal::erase_dirty_object(blockstore_dirty_db_t::iterator dirty_it)
-{
-    auto oid = dirty_it->first.oid;
-    bool exists = !IS_DELETE(dirty_it->second.state);
-    auto dirty_end = dirty_it;
-    dirty_end++;
-    while (1)
-    {
-        if (dirty_it == bs->dirty_db.begin())
-        {
-            break;
-        }
-        dirty_it--;
-        if (dirty_it->first.oid != oid)
-        {
-            dirty_it++;
-            break;
-        }
-    }
-    auto clean_it = bs->clean_db.find(oid);
-    uint64_t clean_loc = clean_it != bs->clean_db.end()
-        ? clean_it->second.location : UINT64_MAX;
-    if (exists && clean_loc == UINT64_MAX)
-    {
-        bs->inode_space_stats[oid.inode] -= bs->block_size;
-    }
-    bs->erase_dirty(dirty_it, dirty_end, clean_loc);
-    // Remove it from the flusher's queue, too
-    // Otherwise it may end up referring to a small unstable write after reading the rest of the journal
-    bs->flusher->remove_flush(oid);
-}
--- a/src/blockstore_init.h
+++ b/src/blockstore_init.h
@@ -7,7 +7,6 @@ class blockstore_init_meta
 {
    blockstore_impl_t *bs;
    int wait_state = 0, wait_count = 0;
-    bool zero_on_init = false;
    void *metadata_buffer = NULL;
    uint64_t metadata_read = 0;
    int prev = 0, prev_done = 0, done_len = 0, submitted = 0;
@@ -37,7 +36,6 @@ class blockstore_init_journal
    bool started = false;
    uint64_t next_free;
    std::vector<bs_init_journal_done> done;
-    std::vector<obj_ver_id> double_allocs;
    uint64_t journal_pos = 0;
    uint64_t continue_pos = 0;
    void *init_write_buf = NULL;
@@ -50,7 +48,6 @@ class blockstore_init_journal
    std::function<void(ring_data_t*)> simple_callback;
    int handle_journal_part(void *buf, uint64_t done_pos, uint64_t len);
    void handle_event(ring_data_t *data);
-    void erase_dirty_object(blockstore_dirty_db_t::iterator dirty_it);
 public:
    blockstore_init_journal(blockstore_impl_t* bs);
    int loop();
--- a/src/blockstore_journal.h
+++ b/src/blockstore_journal.h
@@ -7,7 +7,6 @@

 #define MIN_JOURNAL_SIZE 4*1024*1024
 #define JOURNAL_MAGIC 0x4A33
-#define JOURNAL_VERSION 1
 #define JOURNAL_BUFFER_SIZE 4*1024*1024

 // We reserve some extra space for future stabilize requests during writes
@@ -38,9 +37,7 @@ struct __attribute__((__packed__)) journal_entry_start
    uint32_t size;
    uint32_t reserved;
    uint64_t journal_start;
-    uint64_t version;
 };
-#define JE_START_LEGACY_SIZE 24

 struct __attribute__((__packed__)) journal_entry_small_write
 {
@@ -152,7 +149,6 @@ struct journal_t
    int fd;
    uint64_t device_size;
    bool inmemory = false;
-    bool flush_journal = false;
    void *buffer = NULL;

    uint64_t block_size;
--- a/src/blockstore_open.cpp
+++ b/src/blockstore_open.cpp
@@ -42,11 +42,6 @@ void blockstore_impl_t::parse_config(blockstore_config_t & config)
    {
        disable_flock = true;
    }
-    if (config["flush_journal"] == "true" || config["flush_journal"] == "1" || config["flush_journal"] == "yes")
-    {
-        // Only flush journal and exit
-        journal.flush_journal = true;
-    }
    if (config["immediate_commit"] == "all")
    {
        immediate_commit = IMMEDIATE_ALL;
@@ -74,16 +69,8 @@ void blockstore_impl_t::parse_config(blockstore_config_t & config)
    journal_block_size = strtoull(config["journal_block_size"].c_str(), NULL, 10);
    meta_block_size = strtoull(config["meta_block_size"].c_str(), NULL, 10);
    bitmap_granularity = strtoull(config["bitmap_granularity"].c_str(), NULL, 10);
-    max_flusher_count = strtoull(config["max_flusher_count"].c_str(), NULL, 10);
-    if (!max_flusher_count)
-        max_flusher_count = strtoull(config["flusher_count"].c_str(), NULL, 10);
-    min_flusher_count = strtoull(config["min_flusher_count"].c_str(), NULL, 10);
+    flusher_count = strtoull(config["flusher_count"].c_str(), NULL, 10);
    max_write_iodepth = strtoull(config["max_write_iodepth"].c_str(), NULL, 10);
-    throttle_small_writes = config["throttle_small_writes"] == "true" || config["throttle_small_writes"] == "1" || config["throttle_small_writes"] == "yes";
-    throttle_target_iops = strtoull(config["throttle_target_iops"].c_str(), NULL, 10);
-    throttle_target_mbs = strtoull(config["throttle_target_mbs"].c_str(), NULL, 10);
-    throttle_target_parallelism = strtoull(config["throttle_target_parallelism"].c_str(), NULL, 10);
-    throttle_threshold_us = strtoull(config["throttle_threshold_us"].c_str(), NULL, 10);
    // Validate
    if (!block_size)
    {
@@ -93,13 +80,9 @@ void blockstore_impl_t::parse_config(blockstore_config_t & config)
    {
        throw std::runtime_error("Bad block size");
    }
-    if (!max_flusher_count)
+    if (!flusher_count)
    {
-        max_flusher_count = 256;
-    }
-    if (!min_flusher_count || journal.flush_journal)
-    {
-        min_flusher_count = 1;
+        flusher_count = 32;
    }
    if (!max_write_iodepth)
    {
@@ -185,22 +168,6 @@ void blockstore_impl_t::parse_config(blockstore_config_t & config)
    {
        throw std::runtime_error("immediate_commit=all requires disable_journal_fsync and disable_data_fsync");
    }
-    if (!throttle_target_iops)
-    {
-        throttle_target_iops = 100;
-    }
-    if (!throttle_target_mbs)
-    {
-        throttle_target_mbs = 100;
-    }
-    if (!throttle_target_parallelism)
-    {
-        throttle_target_parallelism = 1;
-    }
-    if (!throttle_threshold_us)
-    {
-        throttle_threshold_us = 50;
-    }
    // init some fields
    clean_entry_bitmap_size = block_size / bitmap_granularity / 8;
    clean_entry_size = sizeof(clean_disk_entry) + 2*clean_entry_bitmap_size;
@@ -257,7 +224,7 @@ void blockstore_impl_t::calc_lengths()
    }
    // required metadata size
    block_count = data_len / block_size;
-    meta_len = (1 + (block_count - 1 + meta_block_size / clean_entry_size) / (meta_block_size / clean_entry_size)) * meta_block_size;
+    meta_len = ((block_count - 1 + meta_block_size / clean_entry_size) / (meta_block_size / clean_entry_size)) * meta_block_size;
    if (meta_area < meta_len)
    {
        throw std::runtime_error("Metadata area is too small, need at least "+std::to_string(meta_len)+" bytes");
--- a/src/blockstore_rollback.cpp
+++ b/src/blockstore_rollback.cpp
@@ -248,12 +248,10 @@ void blockstore_impl_t::erase_dirty(blockstore_dirty_db_t::iterator dirty_start,
    }
    while (1)
    {
-        if (IS_BIG_WRITE(dirty_it->second.state) && dirty_it->second.location != clean_loc &&
-            dirty_it->second.location != UINT64_MAX)
+        if (IS_BIG_WRITE(dirty_it->second.state) && dirty_it->second.location != clean_loc)
        {
 #ifdef BLOCKSTORE_DEBUG
-            printf("Free block %lu from %lx:%lx v%lu\n", dirty_it->second.location >> block_order,
-                dirty_it->first.oid.inode, dirty_it->first.oid.stripe, dirty_it->first.version);
+            printf("Free block %lu\n", dirty_it->second.location >> block_order);
 #endif
            data_alloc->set(dirty_it->second.location >> block_order, false);
        }
--- a/src/blockstore_stable.cpp
+++ b/src/blockstore_stable.cpp
@@ -168,9 +168,6 @@ resume_5:
    for (i = 0, v = (obj_ver_id*)op->buf; i < op->len; i++, v++)
    {
        // Mark all dirty_db entries up to op->version as stable
-#ifdef BLOCKSTORE_DEBUG
-        printf("Stabilize %lx:%lx v%lu\n", v->oid.inode, v->oid.stripe, v->version);
-#endif
        mark_stable(*v);
    }
    // Acknowledge op
@@ -179,66 +176,31 @@ resume_5:
    return 2;
 }

-void blockstore_impl_t::mark_stable(const obj_ver_id & v, bool forget_dirty)
+void blockstore_impl_t::mark_stable(const obj_ver_id & v)
 {
    auto dirty_it = dirty_db.find(v);
    if (dirty_it != dirty_db.end())
    {
        while (1)
        {
-            bool was_stable = IS_STABLE(dirty_it->second.state);
            if ((dirty_it->second.state & BS_ST_WORKFLOW_MASK) == BS_ST_SYNCED)
            {
                dirty_it->second.state = (dirty_it->second.state & ~BS_ST_WORKFLOW_MASK) | BS_ST_STABLE;
                // Allocations and deletions are counted when they're stabilized
                if (IS_BIG_WRITE(dirty_it->second.state))
                {
-                    int exists = -1;
-                    if (dirty_it != dirty_db.begin())
-                    {
-                        auto prev_it = dirty_it;
-                        prev_it--;
-                        if (prev_it->first.oid == v.oid)
-                        {
-                            exists = IS_DELETE(prev_it->second.state) ? 0 : 1;
-                        }
-                    }
-                    if (exists == -1)
-                    {
-                        auto clean_it = clean_db.find(v.oid);
-                        exists = clean_it != clean_db.end() ? 1 : 0;
-                    }
-                    if (!exists)
-                    {
-                        inode_space_stats[dirty_it->first.oid.inode] += block_size;
-                    }
+                    inode_space_stats[dirty_it->first.oid.inode] += block_size;
                }
                else if (IS_DELETE(dirty_it->second.state))
                {
                    inode_space_stats[dirty_it->first.oid.inode] -= block_size;
                }
            }
-            if (forget_dirty && (IS_BIG_WRITE(dirty_it->second.state) ||
-                IS_DELETE(dirty_it->second.state)))
+            else if (IS_STABLE(dirty_it->second.state))
            {
-                // Big write overrides all previous dirty entries
-                auto erase_end = dirty_it;
-                while (dirty_it != dirty_db.begin())
-                {
-                    dirty_it--;
-                    if (dirty_it->first.oid != v.oid)
-                    {
-                        dirty_it++;
-                        break;
-                    }
-                }
-                auto clean_it = clean_db.find(v.oid);
-                uint64_t clean_loc = clean_it != clean_db.end()
-                    ? clean_it->second.location : UINT64_MAX;
-                erase_dirty(dirty_it, erase_end, clean_loc);
                break;
            }
-            if (was_stable || dirty_it == dirty_db.begin())
+            if (dirty_it == dirty_db.begin())
            {
                break;
            }
--- a/src/blockstore_sync.cpp
+++ b/src/blockstore_sync.cpp
@@ -24,7 +24,6 @@ int blockstore_impl_t::continue_sync(blockstore_op_t *op, bool queue_has_in_prog
    if (PRIV(op)->op_state == 0)
    {
        stop_sync_submitted = false;
-        unsynced_big_write_count -= unsynced_big_writes.size();
        PRIV(op)->sync_big_writes.swap(unsynced_big_writes);
        PRIV(op)->sync_small_writes.swap(unsynced_small_writes);
        PRIV(op)->sync_small_checked = 0;
@@ -80,8 +79,7 @@ int blockstore_impl_t::continue_sync(blockstore_op_t *op, bool queue_has_in_prog
        // 2nd step: Data device is synced, prepare & write journal entries
        // Check space in the journal and journal memory buffers
        blockstore_journal_check_t space_check(this);
-        if (!space_check.check_available(op, PRIV(op)->sync_big_writes.size(),
-            sizeof(journal_entry_big_write) + clean_entry_bitmap_size, JOURNAL_STABILIZE_RESERVATION))
+        if (!space_check.check_available(op, PRIV(op)->sync_big_writes.size(), sizeof(journal_entry_big_write), JOURNAL_STABILIZE_RESERVATION))
        {
            return 0;
        }
@@ -96,7 +94,7 @@ int blockstore_impl_t::continue_sync(blockstore_op_t *op, bool queue_has_in_prog
        int s = 0, cur_sector = -1;
        while (it != PRIV(op)->sync_big_writes.end())
        {
-            if (!journal.entry_fits(sizeof(journal_entry_big_write) + clean_entry_bitmap_size) &&
+            if (!journal.entry_fits(sizeof(journal_entry_big_write)) &&
                journal.sector_info[journal.cur_sector].dirty)
            {
                if (cur_sector == -1)
@@ -104,27 +102,24 @@ int blockstore_impl_t::continue_sync(blockstore_op_t *op, bool queue_has_in_prog
                prepare_journal_sector_write(journal, journal.cur_sector, sqe[s++], [this, op](ring_data_t *data) { handle_sync_event(data, op); });
                cur_sector = journal.cur_sector;
            }
-            auto & dirty_entry = dirty_db.at(*it);
            journal_entry_big_write *je = (journal_entry_big_write*)prefill_single_journal_entry(
-                journal, (dirty_entry.state & BS_ST_INSTANT) ? JE_BIG_WRITE_INSTANT : JE_BIG_WRITE,
-                sizeof(journal_entry_big_write) + clean_entry_bitmap_size
+                journal, (dirty_db[*it].state & BS_ST_INSTANT) ? JE_BIG_WRITE_INSTANT : JE_BIG_WRITE,
+                sizeof(journal_entry_big_write)
            );
-            dirty_entry.journal_sector = journal.sector_info[journal.cur_sector].offset;
+            dirty_db[*it].journal_sector = journal.sector_info[journal.cur_sector].offset;
            journal.used_sectors[journal.sector_info[journal.cur_sector].offset]++;
 #ifdef BLOCKSTORE_DEBUG
            printf(
                "journal offset %08lx is used by %lx:%lx v%lu (%lu refs)\n",
-                dirty_entry.journal_sector, it->oid.inode, it->oid.stripe, it->version,
+                dirty_db[*it].journal_sector, it->oid.inode, it->oid.stripe, it->version,
                journal.used_sectors[journal.sector_info[journal.cur_sector].offset]
            );
 #endif
            je->oid = it->oid;
            je->version = it->version;
-            je->offset = dirty_entry.offset;
-            je->len = dirty_entry.len;
-            je->location = dirty_entry.location;
-            memcpy((void*)(je+1), (clean_entry_bitmap_size > sizeof(void*)
-                ? dirty_entry.bitmap : &dirty_entry.bitmap), clean_entry_bitmap_size);
+            je->offset = dirty_db[*it].offset;
+            je->len = dirty_db[*it].len;
+            je->location = dirty_db[*it].location;
            je->crc32 = je_crc32((journal_entry*)je);
            journal.crc32_last = je->crc32;
            it++;
@@ -146,7 +141,6 @@ int blockstore_impl_t::continue_sync(blockstore_op_t *op, bool queue_has_in_prog
            my_uring_prep_fsync(sqe, journal.fd, IORING_FSYNC_DATASYNC);
            data->iov = { 0 };
            data->callback = [this, op](ring_data_t *data) { handle_sync_event(data, op); };
-            PRIV(op)->min_flushed_journal_sector = PRIV(op)->max_flushed_journal_sector = 0;
            PRIV(op)->pending_ops = 1;
            PRIV(op)->op_state = SYNC_JOURNAL_SYNC_SENT;
            return 1;
--- a/src/blockstore_write.cpp
+++ b/src/blockstore_write.cpp
@@ -30,13 +30,10 @@ bool blockstore_impl_t::enqueue_write(blockstore_op_t *op)
            wait_big = (dirty_it->second.state & BS_ST_TYPE_MASK) == BS_ST_BIG_WRITE
                ? !IS_SYNCED(dirty_it->second.state)
                : ((dirty_it->second.state & BS_ST_WORKFLOW_MASK) == BS_ST_WAIT_BIG);
-            if (!is_del && !deleted)
-            {
-                if (clean_entry_bitmap_size > sizeof(void*))
-                    memcpy(bmp, dirty_it->second.bitmap, clean_entry_bitmap_size);
-                else
-                    bmp = dirty_it->second.bitmap;
-            }
+            if (clean_entry_bitmap_size > sizeof(void*))
+                memcpy(bmp, dirty_it->second.bitmap, clean_entry_bitmap_size);
+            else
+                bmp = dirty_it->second.bitmap;
        }
    }
    if (!found)
@@ -45,11 +42,8 @@ bool blockstore_impl_t::enqueue_write(blockstore_op_t *op)
        if (clean_it != clean_db.end())
        {
            version = clean_it->second.version + 1;
-            if (!is_del)
-            {
-                void *bmp_ptr = get_clean_entry_bitmap(clean_it->second.location, clean_entry_bitmap_size);
-                memcpy((clean_entry_bitmap_size > sizeof(void*) ? bmp : &bmp), bmp_ptr, clean_entry_bitmap_size);
-            }
+            void *bmp_ptr = get_clean_entry_bitmap(clean_it->second.location, clean_entry_bitmap_size);
+            memcpy((clean_entry_bitmap_size > sizeof(void*) ? bmp : &bmp), bmp_ptr, clean_entry_bitmap_size);
        }
        else
        {
@@ -122,8 +116,6 @@ bool blockstore_impl_t::enqueue_write(blockstore_op_t *op)
    else
    {
        state = (op->len == block_size || deleted ? BS_ST_BIG_WRITE : BS_ST_SMALL_WRITE);
-        if (state == BS_ST_SMALL_WRITE && throttle_small_writes)
-            clock_gettime(CLOCK_REALTIME, &PRIV(op)->tv_begin);
        if (wait_del)
            state |= BS_ST_WAIT_DEL;
        else if (state == BS_ST_SMALL_WRITE && wait_big)
@@ -249,8 +241,7 @@ int blockstore_impl_t::dequeue_write(blockstore_op_t *op)
    if ((dirty_it->second.state & BS_ST_TYPE_MASK) == BS_ST_BIG_WRITE)
    {
        blockstore_journal_check_t space_check(this);
-        if (!space_check.check_available(op, unsynced_big_write_count + 1,
-            sizeof(journal_entry_big_write) + clean_entry_bitmap_size, JOURNAL_STABILIZE_RESERVATION))
+        if (!space_check.check_available(op, unsynced_big_writes.size() + 1, sizeof(journal_entry_big_write), JOURNAL_STABILIZE_RESERVATION))
        {
            return 0;
        }
@@ -273,10 +264,7 @@ int blockstore_impl_t::dequeue_write(blockstore_op_t *op)
        dirty_it->second.location = loc << block_order;
        dirty_it->second.state = (dirty_it->second.state & ~BS_ST_WORKFLOW_MASK) | BS_ST_SUBMITTED;
 #ifdef BLOCKSTORE_DEBUG
-        printf(
-            "Allocate block %lu for %lx:%lx v%lu\n",
-            loc, op->oid.inode, op->oid.stripe, op->version
-        );
+        printf("Allocate block %lu\n", loc);
 #endif
        data_alloc->set(loc, true);
        uint64_t stripe_offset = (op->offset % bitmap_granularity);
@@ -302,8 +290,11 @@ int blockstore_impl_t::dequeue_write(blockstore_op_t *op)
        PRIV(op)->min_flushed_journal_sector = PRIV(op)->max_flushed_journal_sector = 0;
        if (immediate_commit != IMMEDIATE_ALL)
        {
-            // Increase the counter, but don't save into unsynced_writes yet (can't sync until the write is finished)
-            unsynced_big_write_count++;
+            // Remember big write as unsynced
+            unsynced_big_writes.push_back((obj_ver_id){
+                .oid = op->oid,
+                .version = op->version,
+            });
            PRIV(op)->op_state = 3;
        }
        else
@@ -316,11 +307,8 @@ int blockstore_impl_t::dequeue_write(blockstore_op_t *op)
        // Small (journaled) write
        // First check if the journal has sufficient space
        blockstore_journal_check_t space_check(this);
-        if (unsynced_big_write_count &&
-            !space_check.check_available(op, unsynced_big_write_count,
-                sizeof(journal_entry_big_write) + clean_entry_bitmap_size, 0)
-            || !space_check.check_available(op, 1,
-                sizeof(journal_entry_small_write) + clean_entry_bitmap_size, op->len + JOURNAL_STABILIZE_RESERVATION))
+        if (unsynced_big_writes.size() && !space_check.check_available(op, unsynced_big_writes.size(), sizeof(journal_entry_big_write), 0)
+            || !space_check.check_available(op, 1, sizeof(journal_entry_small_write), op->len + JOURNAL_STABILIZE_RESERVATION))
        {
            return 0;
        }
@@ -328,7 +316,8 @@ int blockstore_impl_t::dequeue_write(blockstore_op_t *op)
        // There is sufficient space. Get SQE(s)
        struct io_uring_sqe *sqe1 = NULL;
        if (immediate_commit != IMMEDIATE_NONE ||
-            !journal.entry_fits(sizeof(journal_entry_small_write) + clean_entry_bitmap_size))
+            (journal_block_size - journal.in_sector_pos) < sizeof(journal_entry_small_write) &&
+            journal.sector_info[journal.cur_sector].dirty)
        {
            // Write current journal sector only if it's dirty and full, or in the immediate_commit mode
            BS_SUBMIT_GET_SQE_DECL(sqe1);
@@ -411,6 +400,14 @@ int blockstore_impl_t::dequeue_write(blockstore_op_t *op)
        {
            journal.next_free = journal_block_size;
        }
+        if (immediate_commit == IMMEDIATE_NONE)
+        {
+            // Remember small write as unsynced
+            unsynced_small_writes.push_back((obj_ver_id){
+                .oid = op->oid,
+                .version = op->version,
+            });
+        }
        if (!PRIV(op)->pending_ops)
        {
            PRIV(op)->op_state = 4;
@@ -426,148 +423,85 @@ int blockstore_impl_t::dequeue_write(blockstore_op_t *op)

 int blockstore_impl_t::continue_write(blockstore_op_t *op)
 {
+    io_uring_sqe *sqe = NULL;
+    journal_entry_big_write *je;
    int op_state = PRIV(op)->op_state;
-    if (op_state == 2)
-        goto resume_2;
-    else if (op_state == 4)
-        goto resume_4;
-    else if (op_state == 6)
-        goto resume_6;
-    else
+    if (op_state != 2 && op_state != 4)
    {
        // In progress
        return 1;
    }
+    auto dirty_it = dirty_db.find((obj_ver_id){
+        .oid = op->oid,
+        .version = op->version,
+    });
+    assert(dirty_it != dirty_db.end());
+    if (op_state == 2)
+        goto resume_2;
+    else if (op_state == 4)
+        goto resume_4;
 resume_2:
    // Only for the immediate_commit mode: prepare and submit big_write journal entry
-    {
-        auto dirty_it = dirty_db.find((obj_ver_id){
-            .oid = op->oid,
-            .version = op->version,
-        });
-        assert(dirty_it != dirty_db.end());
-        io_uring_sqe *sqe = NULL;
-        BS_SUBMIT_GET_SQE_DECL(sqe);
-        journal_entry_big_write *je = (journal_entry_big_write*)prefill_single_journal_entry(
-            journal, op->opcode == BS_OP_WRITE_STABLE ? JE_BIG_WRITE_INSTANT : JE_BIG_WRITE,
-            sizeof(journal_entry_big_write) + clean_entry_bitmap_size
-        );
-        dirty_it->second.journal_sector = journal.sector_info[journal.cur_sector].offset;
-        journal.used_sectors[journal.sector_info[journal.cur_sector].offset]++;
+    BS_SUBMIT_GET_SQE_DECL(sqe);
+    je = (journal_entry_big_write*)prefill_single_journal_entry(
+        journal, op->opcode == BS_OP_WRITE_STABLE ? JE_BIG_WRITE_INSTANT : JE_BIG_WRITE,
+        sizeof(journal_entry_big_write) + clean_entry_bitmap_size
+    );
+    dirty_it->second.journal_sector = journal.sector_info[journal.cur_sector].offset;
+    journal.used_sectors[journal.sector_info[journal.cur_sector].offset]++;
 #ifdef BLOCKSTORE_DEBUG
-        printf(
-            "journal offset %08lx is used by %lx:%lx v%lu (%lu refs)\n",
-            journal.sector_info[journal.cur_sector].offset, op->oid.inode, op->oid.stripe, op->version,
-            journal.used_sectors[journal.sector_info[journal.cur_sector].offset]
-        );
+    printf(
+        "journal offset %08lx is used by %lx:%lx v%lu (%lu refs)\n",
+        journal.sector_info[journal.cur_sector].offset, op->oid.inode, op->oid.stripe, op->version,
+        journal.used_sectors[journal.sector_info[journal.cur_sector].offset]
+    );
 #endif
-        je->oid = op->oid;
-        je->version = op->version;
-        je->offset = op->offset;
-        je->len = op->len;
-        je->location = dirty_it->second.location;
-        memcpy((void*)(je+1), (clean_entry_bitmap_size > sizeof(void*) ? dirty_it->second.bitmap : &dirty_it->second.bitmap), clean_entry_bitmap_size);
-        je->crc32 = je_crc32((journal_entry*)je);
-        journal.crc32_last = je->crc32;
-        prepare_journal_sector_write(journal, journal.cur_sector, sqe,
-            [this, op](ring_data_t *data) { handle_write_event(data, op); });
-        PRIV(op)->min_flushed_journal_sector = PRIV(op)->max_flushed_journal_sector = 1 + journal.cur_sector;
-        PRIV(op)->pending_ops = 1;
-        PRIV(op)->op_state = 3;
-        return 1;
-    }
+    je->oid = op->oid;
+    je->version = op->version;
+    je->offset = op->offset;
+    je->len = op->len;
+    je->location = dirty_it->second.location;
+    memcpy((void*)(je+1), (clean_entry_bitmap_size > sizeof(void*) ? dirty_it->second.bitmap : &dirty_it->second.bitmap), clean_entry_bitmap_size);
+    je->crc32 = je_crc32((journal_entry*)je);
+    journal.crc32_last = je->crc32;
+    prepare_journal_sector_write(journal, journal.cur_sector, sqe,
+        [this, op](ring_data_t *data) { handle_write_event(data, op); });
+    PRIV(op)->min_flushed_journal_sector = PRIV(op)->max_flushed_journal_sector = 1 + journal.cur_sector;
+    PRIV(op)->pending_ops = 1;
+    PRIV(op)->op_state = 3;
+    return 1;
 resume_4:
    // Switch object state
 #ifdef BLOCKSTORE_DEBUG
-    printf("Ack write %lx:%lx v%lu = state 0x%x\n", op->oid.inode, op->oid.stripe, op->version, dirty_it->second.state);
+    printf("Ack write %lx:%lx v%lu = state %x\n", op->oid.inode, op->oid.stripe, op->version, dirty_it->second.state);
 #endif
+    bool imm = (dirty_it->second.state & BS_ST_TYPE_MASK) == BS_ST_BIG_WRITE
+        ? (immediate_commit == IMMEDIATE_ALL)
+        : (immediate_commit != IMMEDIATE_NONE);
+    if (imm)
    {
-        auto dirty_it = dirty_db.find((obj_ver_id){
-            .oid = op->oid,
-            .version = op->version,
-        });
-        assert(dirty_it != dirty_db.end());
-        bool is_big = (dirty_it->second.state & BS_ST_TYPE_MASK) == BS_ST_BIG_WRITE;
-        bool imm = is_big ? (immediate_commit == IMMEDIATE_ALL) : (immediate_commit != IMMEDIATE_NONE);
-        if (imm)
+        auto & unstab = unstable_writes[op->oid];
+        unstab = unstab < op->version ? op->version : unstab;
+    }
+    dirty_it->second.state = (dirty_it->second.state & ~BS_ST_WORKFLOW_MASK)
+        | (imm ? BS_ST_SYNCED : BS_ST_WRITTEN);
+    if (imm && ((dirty_it->second.state & BS_ST_TYPE_MASK) == BS_ST_DELETE || (dirty_it->second.state & BS_ST_INSTANT)))
+    {
+        // Deletions are treated as immediately stable
+        mark_stable(dirty_it->first);
+    }
+    if (immediate_commit == IMMEDIATE_ALL)
+    {
+        dirty_it++;
+        while (dirty_it != dirty_db.end() && dirty_it->first.oid == op->oid)
        {
-            auto & unstab = unstable_writes[op->oid];
-            unstab = unstab < op->version ? op->version : unstab;
-        }
-        dirty_it->second.state = (dirty_it->second.state & ~BS_ST_WORKFLOW_MASK)
-            | (imm ? BS_ST_SYNCED : BS_ST_WRITTEN);
-        if (imm && ((dirty_it->second.state & BS_ST_TYPE_MASK) == BS_ST_DELETE || (dirty_it->second.state & BS_ST_INSTANT)))
-        {
-            // Deletions and 'instant' operations are treated as immediately stable
-            mark_stable(dirty_it->first);
-        }
-        if (!imm)
-        {
-            if (is_big)
+            if ((dirty_it->second.state & BS_ST_WORKFLOW_MASK) == BS_ST_WAIT_BIG)
            {
-                // Remember big write as unsynced
-                unsynced_big_writes.push_back((obj_ver_id){
-                    .oid = op->oid,
-                    .version = op->version,
-                });
+                dirty_it->second.state = (dirty_it->second.state & ~BS_ST_WORKFLOW_MASK) | BS_ST_IN_FLIGHT;
            }
-            else
-            {
-                // Remember small write as unsynced
-                unsynced_small_writes.push_back((obj_ver_id){
-                    .oid = op->oid,
-                    .version = op->version,
-                });
-            }
-        }
-        if (imm && (dirty_it->second.state & BS_ST_TYPE_MASK) == BS_ST_BIG_WRITE)
-        {
-            // Unblock small writes
            dirty_it++;
-            while (dirty_it != dirty_db.end() && dirty_it->first.oid == op->oid)
-            {
-                if ((dirty_it->second.state & BS_ST_WORKFLOW_MASK) == BS_ST_WAIT_BIG)
-                {
-                    dirty_it->second.state = (dirty_it->second.state & ~BS_ST_WORKFLOW_MASK) | BS_ST_IN_FLIGHT;
-                }
-                dirty_it++;
-            }
-        }
-        // Apply throttling to not fill the journal too fast for the SSD+HDD case
-        if (!is_big && throttle_small_writes)
-        {
-            // Apply throttling
-            timespec tv_end;
-            clock_gettime(CLOCK_REALTIME, &tv_end);
-            uint64_t exec_us =
-                (tv_end.tv_sec - PRIV(op)->tv_begin.tv_sec)*1000000 +
-                (tv_end.tv_nsec - PRIV(op)->tv_begin.tv_nsec)/1000;
-            // Compare with target execution time
-            // 100% free -> target time = 0
-            // 0% free -> target time = iodepth/parallelism * (iops + size/bw) / write per second
-            uint64_t used_start = journal.get_trim_pos();
-            uint64_t journal_free_space = journal.next_free < used_start
-                ? (used_start - journal.next_free)
-                : (journal.len - journal.next_free + used_start - journal.block_size);
-            uint64_t ref_us =
-                (write_iodepth <= throttle_target_parallelism ? 100 : 100*write_iodepth/throttle_target_parallelism)
-                * (1000000/throttle_target_iops + op->len*1000000/throttle_target_mbs/1024/1024)
-                / 100;
-            ref_us -= ref_us * journal_free_space / journal.len;
-            if (ref_us > exec_us + throttle_threshold_us)
-            {
-                // Pause reply
-                tfd->set_timer_us(ref_us-exec_us, false, [this, op](int timer_id)
-                {
-                    PRIV(op)->op_state++;
-                    ringloop->wakeup();
-                });
-                PRIV(op)->op_state = 5;
-                return 1;
-            }
        }
    }
-resume_6:
    // Acknowledge write
    op->retval = op->len;
    write_iodepth--;
@@ -691,6 +625,14 @@ int blockstore_impl_t::dequeue_del(blockstore_op_t *op)
        PRIV(op)->min_flushed_journal_sector = PRIV(op)->max_flushed_journal_sector = 1 + journal.cur_sector;
        PRIV(op)->pending_ops++;
    }
+    else
+    {
+        // Remember delete as unsynced
+        unsynced_small_writes.push_back((obj_ver_id){
+            .oid = op->oid,
+            .version = op->version,
+        });
+    }
    if (!PRIV(op)->pending_ops)
    {
        PRIV(op)->op_state = 4;
--- a/src/cluster_client.cpp
+++ b/src/cluster_client.cpp
--- a/src/cluster_client.h
+++ b/src/cluster_client.h
@@ -8,8 +8,7 @@

 #define MIN_BLOCK_SIZE 4*1024
 #define MAX_BLOCK_SIZE 128*1024*1024
-#define DEFAULT_CLIENT_MAX_DIRTY_BYTES 32*1024*1024
-#define DEFAULT_CLIENT_MAX_DIRTY_OPS 1024
+#define DEFAULT_CLIENT_DIRTY_LIMIT 32*1024*1024

 struct cluster_op_t;

@@ -21,7 +20,8 @@ struct cluster_op_part_t
    pg_num_t pg_num;
    osd_num_t osd_num;
    osd_op_buf_list_t iov;
-    unsigned flags;
+    bool sent;
+    bool done;
    osd_op_t op;
 };

@@ -31,38 +31,24 @@ struct cluster_op_t
    uint64_t inode;
    uint64_t offset;
    uint64_t len;
-    // for reads and writes within a single object (stripe),
-    // reads can return current version and writes can use "CAS" semantics
-    uint64_t version = 0;
    int retval;
    osd_op_buf_list_t iov;
    std::function<void(cluster_op_t*)> callback;
    ~cluster_op_t();
 protected:
-    uint64_t flags = 0;
-    int state = 0;
    uint64_t cur_inode; // for snapshot reads
    void *buf = NULL;
    cluster_op_t *orig_op = NULL;
+    bool is_internal = false;
    bool needs_reslice = false;
    bool up_wait = false;
-    int inflight_count = 0, done_count = 0;
+    int sent_count = 0, done_count = 0;
    std::vector<cluster_op_part_t> parts;
    void *bitmap_buf = NULL, *part_bitmaps = NULL;
    unsigned bitmap_buf_size = 0;
-    cluster_op_t *prev = NULL, *next = NULL;
-    int prev_wait = 0;
    friend class cluster_client_t;
 };

-struct cluster_buffer_t
-{
-    void *buf;
-    uint64_t len;
-    int state;
-};
-
-// FIXME: Split into public and private interfaces
 class cluster_client_t
 {
    timerfd_manager_t *tfd;
@@ -71,29 +57,30 @@ class cluster_client_t
    uint64_t bs_block_size = 0;
    uint32_t bs_bitmap_granularity = 0, bs_bitmap_size = 0;
    std::map<pool_id_t, uint64_t> pg_counts;
-    // WARNING: initially true so execute() doesn't create fake sync
-    bool immediate_commit = true;
+    bool immediate_commit = false;
    // FIXME: Implement inmemory_commit mode. Note that it requires to return overlapping reads from memory.
-    uint64_t client_max_dirty_bytes = 0;
-    uint64_t client_max_dirty_ops = 0;
+    uint64_t client_dirty_limit = 0;
    int log_level;
    int up_wait_retry_interval = 500; // ms

-    int retry_timeout_id = 0;
    uint64_t op_id = 1;
+    ring_consumer_t consumer;
+    // operations currently in progress
+    std::set<cluster_op_t*> cur_ops;
+    int retry_timeout_id = 0;
+    // unsynced operations are copied in memory to allow replay when cluster isn't in the immediate_commit mode
+    // unsynced_writes are replayed in any order (because only the SYNC operation guarantees ordering)
+    std::vector<cluster_op_t*> unsynced_writes;
+    std::vector<cluster_op_t*> syncing_writes;
+    cluster_op_t* cur_sync = NULL;
+    std::vector<cluster_op_t*> next_writes;
    std::vector<cluster_op_t*> offline_ops;
-    cluster_op_t *op_queue_head = NULL, *op_queue_tail = NULL;
-    std::map<object_id, cluster_buffer_t> dirty_buffers;
-    std::set<osd_num_t> dirty_osds;
-    uint64_t dirty_bytes = 0, dirty_ops = 0;
-
+    uint64_t queued_bytes = 0;
    void *scrap_buffer = NULL;
    unsigned scrap_buffer_size = 0;

    bool pgs_loaded = false;
-    ring_consumer_t consumer;
    std::vector<std::function<void(void)>> on_ready_hooks;
-    int continuing_ops = 0;

 public:
    etcd_state_client_t st_cli;
@@ -106,23 +93,19 @@ public:
    bool is_ready();
    void on_ready(std::function<void(void)> fn);

-    static void copy_write(cluster_op_t *op, std::map<object_id, cluster_buffer_t> & dirty_buffers);
-    void continue_ops(bool up_retry = false);
 protected:
-    bool affects_osd(uint64_t inode, uint64_t offset, uint64_t len, osd_num_t osd);
-    void flush_buffer(const object_id & oid, cluster_buffer_t *wr);
+    void continue_ops(bool up_retry = false);
    void on_load_config_hook(json11::Json::object & config);
    void on_load_pgs_hook(bool success);
-    void on_change_hook(std::map<std::string, etcd_kv_t> & changes);
+    void on_change_hook(json11::Json::object & changes);
    void on_change_osd_state_hook(uint64_t peer_osd);
-    int continue_rw(cluster_op_t *op);
+    cluster_op_t *copy_write(cluster_op_t *op);
+    void continue_rw(cluster_op_t *op);
    void slice_rw(cluster_op_t *op);
    bool try_send(cluster_op_t *op, int i);
-    int continue_sync(cluster_op_t *op);
+    void execute_sync(cluster_op_t *op);
+    void continue_sync();
+    void finish_sync();
    void send_sync(cluster_op_t *op, cluster_op_part_t *part);
    void handle_op_part(cluster_op_part_t *part);
-    void copy_part_bitmap(cluster_op_t *op, cluster_op_part_t *part);
-    void erase_op(cluster_op_t *op);
-    void calc_wait(cluster_op_t *op);
-    void inc_wait(uint64_t opcode, uint64_t flags, cluster_op_t *next, int inc);
 };
--- a/src/etcd_state_client.cpp
+++ b/src/etcd_state_client.cpp
@@ -4,10 +4,8 @@
 #include "osd_ops.h"
 #include "pg_states.h"
 #include "etcd_state_client.h"
-#ifndef __MOCK__
 #include "http_client.h"
 #include "base64.h"
-#endif

 etcd_state_client_t::~etcd_state_client_t()
 {
@@ -17,29 +15,24 @@ etcd_state_client_t::~etcd_state_client_t()
    }
    watches.clear();
    etcd_watches_initialised = -1;
-#ifndef __MOCK__
    if (etcd_watch_ws)
    {
        etcd_watch_ws->close();
        etcd_watch_ws = NULL;
    }
-#endif
 }

-#ifndef __MOCK__
-etcd_kv_t etcd_state_client_t::parse_etcd_kv(const json11::Json & kv_json)
+json_kv_t etcd_state_client_t::parse_etcd_kv(const json11::Json & kv_json)
 {
-    etcd_kv_t kv;
+    json_kv_t kv;
    kv.key = base64_decode(kv_json["key"].string_value());
    std::string json_err, json_text = base64_decode(kv_json["value"].string_value());
    kv.value = json_text == "" ? json11::Json() : json11::Json::parse(json_text, json_err);
    if (json_err != "")
    {
-        fprintf(stderr, "Bad JSON in etcd key %s: %s (value: %s)\n", kv.key.c_str(), json_err.c_str(), json_text.c_str());
+        printf("Bad JSON in etcd key %s: %s (value: %s)\n", kv.key.c_str(), json_err.c_str(), json_text.c_str());
        kv.key = "";
    }
-    else
-        kv.mod_revision = kv_json["mod_revision"].uint64_value();
    return kv;
 }

@@ -50,11 +43,6 @@ void etcd_state_client_t::etcd_txn(json11::Json txn, int timeout, std::function<

 void etcd_state_client_t::etcd_call(std::string api, json11::Json payload, int timeout, std::function<void(std::string, json11::Json)> callback)
 {
-    if (!etcd_addresses.size())
-    {
-        fprintf(stderr, "etcd_address is missing in Vitastor configuration\n");
-        exit(1);
-    }
    std::string etcd_address = etcd_addresses[rand() % etcd_addresses.size()];
    std::string etcd_api_path;
    int pos = etcd_address.find('/');
@@ -81,16 +69,16 @@ void etcd_state_client_t::add_etcd_url(std::string addr)
            addr = addr.substr(7);
        else if (strtolower(addr.substr(0, 8)) == "https://")
        {
-            fprintf(stderr, "HTTPS is unsupported for etcd. Either use plain HTTP or setup a local proxy for etcd interaction\n");
+            printf("HTTPS is unsupported for etcd. Either use plain HTTP or setup a local proxy for etcd interaction\n");
            exit(1);
        }
-        if (addr.find('/') == std::string::npos)
+        if (addr.find('/') < 0)
            addr += "/v3";
        this->etcd_addresses.push_back(addr);
    }
 }

-void etcd_state_client_t::parse_config(const json11::Json & config)
+void etcd_state_client_t::parse_config(json11::Json & config)
 {
    this->etcd_addresses.clear();
    if (config["etcd_address"].is_string())
@@ -127,11 +115,6 @@ void etcd_state_client_t::parse_config(const json11::Json & config)

 void etcd_state_client_t::start_etcd_watcher()
 {
-    if (!etcd_addresses.size())
-    {
-        fprintf(stderr, "etcd_address is missing in Vitastor configuration\n");
-        exit(1);
-    }
    std::string etcd_address = etcd_addresses[rand() % etcd_addresses.size()];
    std::string etcd_api_path;
    int pos = etcd_address.find('/');
@@ -149,7 +132,7 @@ void etcd_state_client_t::start_etcd_watcher()
            json11::Json data = json11::Json::parse(msg->body, json_err);
            if (json_err != "")
            {
-                fprintf(stderr, "Bad JSON in etcd event: %s, ignoring event\n", json_err.c_str());
+                printf("Bad JSON in etcd event: %s, ignoring event\n", json_err.c_str());
            }
            else
            {
@@ -162,22 +145,22 @@ void etcd_state_client_t::start_etcd_watcher()
                    etcd_watch_revision = data["result"]["header"]["revision"].uint64_value();
                }
                // First gather all changes into a hash to remove multiple overwrites
-                std::map<std::string, etcd_kv_t> changes;
+                json11::Json::object changes;
                for (auto & ev: data["result"]["events"].array_items())
                {
                    auto kv = parse_etcd_kv(ev["kv"]);
                    if (kv.key != "")
                    {
-                        changes[kv.key] = kv;
+                        changes[kv.key] = kv.value;
                    }
                }
                for (auto & kv: changes)
                {
                    if (this->log_level > 3)
                    {
-                        fprintf(stderr, "Incoming event: %s -> %s\n", kv.first.c_str(), kv.second.value.dump().c_str());
+                        printf("Incoming event: %s -> %s\n", kv.first.c_str(), kv.second.dump().c_str());
                    }
-                    parse_state(kv.second);
+                    parse_state(kv.first, kv.second);
                }
                // React to changes
                if (on_change_hook != NULL)
@@ -250,7 +233,7 @@ void etcd_state_client_t::load_global_config()
    {
        if (err != "")
        {
-            fprintf(stderr, "Error reading OSD configuration from etcd: %s\n", err.c_str());
+            printf("Error reading OSD configuration from etcd: %s\n", err.c_str());
            tfd->set_timer(ETCD_SLOW_TIMEOUT, false, [this](int timer_id)
            {
                load_global_config();
@@ -323,7 +306,7 @@ void etcd_state_client_t::load_pgs()
    {
        if (err != "")
        {
-            fprintf(stderr, "Error loading PGs from etcd: %s\n", err.c_str());
+            printf("Error loading PGs from etcd: %s\n", err.c_str());
            tfd->set_timer(ETCD_SLOW_TIMEOUT, false, [this](int timer_id)
            {
                load_pgs();
@@ -344,33 +327,16 @@ void etcd_state_client_t::load_pgs()
            for (auto & kv_json: res["response_range"]["kvs"].array_items())
            {
                auto kv = parse_etcd_kv(kv_json);
-                parse_state(kv);
+                parse_state(kv.key, kv.value);
            }
        }
        on_load_pgs_hook(true);
        start_etcd_watcher();
    });
 }
-#else
-void etcd_state_client_t::parse_config(const json11::Json & config)
-{
-}

-void etcd_state_client_t::load_global_config()
+void etcd_state_client_t::parse_state(const std::string & key, const json11::Json & value)
 {
-    json11::Json::object global_config;
-    on_load_config_hook(global_config);
-}
-
-void etcd_state_client_t::load_pgs()
-{
-}
-#endif
-
-void etcd_state_client_t::parse_state(const etcd_kv_t & kv)
-{
-    const std::string & key = kv.key;
-    const json11::Json & value = kv.value;
    if (key == etcd_prefix+"/config/pools")
    {
        for (auto & pool_item: this->pool_config)
@@ -381,12 +347,10 @@ void etcd_state_client_t::parse_state(const etcd_kv_t & kv)
        {
            pool_config_t pc;
            // ID
-            pool_id_t pool_id;
-            char null_byte = 0;
-            sscanf(pool_item.first.c_str(), "%u%c", &pool_id, &null_byte);
-            if (!pool_id || pool_id >= POOL_ID_MAX || null_byte != 0)
+            pool_id_t pool_id = stoull_full(pool_item.first);
+            if (!pool_id || pool_id >= POOL_ID_MAX)
            {
-                fprintf(stderr, "Pool ID %s is invalid (must be a number less than 0x%x), skipping pool\n", pool_item.first.c_str(), POOL_ID_MAX);
+                printf("Pool ID %s is invalid (must be a number less than 0x%x), skipping pool\n", pool_item.first.c_str(), POOL_ID_MAX);
                continue;
            }
            pc.id = pool_id;
@@ -394,7 +358,7 @@ void etcd_state_client_t::parse_state(const etcd_kv_t & kv)
            pc.name = pool_item.second["name"].string_value();
            if (pc.name == "")
            {
-                fprintf(stderr, "Pool %u has empty name, skipping pool\n", pool_id);
+                printf("Pool %u has empty name, skipping pool\n", pool_id);
                continue;
            }
            // Failure Domain
@@ -408,7 +372,7 @@ void etcd_state_client_t::parse_state(const etcd_kv_t & kv)
                pc.scheme = POOL_SCHEME_JERASURE;
            else
            {
-                fprintf(stderr, "Pool %u has invalid coding scheme (one of \"xor\", \"replicated\" or \"jerasure\" required), skipping pool\n", pool_id);
+                printf("Pool %u has invalid coding scheme (one of \"xor\", \"replicated\" or \"jerasure\" required), skipping pool\n", pool_id);
                continue;
            }
            // PG Size
@@ -418,7 +382,7 @@ void etcd_state_client_t::parse_state(const etcd_kv_t & kv)
                (pc.scheme == POOL_SCHEME_XOR || pc.scheme == POOL_SCHEME_JERASURE) ||
                pool_item.second["pg_size"].uint64_value() > 256)
            {
-                fprintf(stderr, "Pool %u has invalid pg_size, skipping pool\n", pool_id);
+                printf("Pool %u has invalid pg_size, skipping pool\n", pool_id);
                continue;
            }
            // Parity Chunks
@@ -427,7 +391,7 @@ void etcd_state_client_t::parse_state(const etcd_kv_t & kv)
            {
                if (pc.parity_chunks > 1)
                {
-                    fprintf(stderr, "Pool %u has invalid parity_chunks (must be 1), skipping pool\n", pool_id);
+                    printf("Pool %u has invalid parity_chunks (must be 1), skipping pool\n", pool_id);
                    continue;
                }
                pc.parity_chunks = 1;
@@ -435,7 +399,7 @@ void etcd_state_client_t::parse_state(const etcd_kv_t & kv)
            if (pc.scheme == POOL_SCHEME_JERASURE &&
                (pc.parity_chunks < 1 || pc.parity_chunks > pc.pg_size-2))
            {
-                fprintf(stderr, "Pool %u has invalid parity_chunks (must be between 1 and pg_size-2), skipping pool\n", pool_id);
+                printf("Pool %u has invalid parity_chunks (must be between 1 and pg_size-2), skipping pool\n", pool_id);
                continue;
            }
            // PG MinSize
@@ -444,14 +408,14 @@ void etcd_state_client_t::parse_state(const etcd_kv_t & kv)
                (pc.scheme == POOL_SCHEME_XOR || pc.scheme == POOL_SCHEME_JERASURE) &&
                pc.pg_minsize < (pc.pg_size-pc.parity_chunks))
            {
-                fprintf(stderr, "Pool %u has invalid pg_minsize, skipping pool\n", pool_id);
+                printf("Pool %u has invalid pg_minsize, skipping pool\n", pool_id);
                continue;
            }
            // PG Count
            pc.pg_count = pool_item.second["pg_count"].uint64_value();
            if (pc.pg_count < 1)
            {
-                fprintf(stderr, "Pool %u has invalid pg_count, skipping pool\n", pool_id);
+                printf("Pool %u has invalid pg_count, skipping pool\n", pool_id);
                continue;
            }
            // Max OSD Combinations
@@ -460,7 +424,7 @@ void etcd_state_client_t::parse_state(const etcd_kv_t & kv)
                pc.max_osd_combinations = 10000;
            if (pc.max_osd_combinations > 0 && pc.max_osd_combinations < 100)
            {
-                fprintf(stderr, "Pool %u has invalid max_osd_combinations (must be at least 100), skipping pool\n", pool_id);
+                printf("Pool %u has invalid max_osd_combinations (must be at least 100), skipping pool\n", pool_id);
                continue;
            }
            // PG Stripe Size
@@ -478,7 +442,7 @@ void etcd_state_client_t::parse_state(const etcd_kv_t & kv)
            {
                if (pg_item.second.target_set.size() != parsed_cfg.pg_size)
                {
-                    fprintf(stderr, "Pool %u PG %u configuration is invalid: osd_set size %lu != pool pg_size %lu\n",
+                    printf("Pool %u PG %u configuration is invalid: osd_set size %lu != pool pg_size %lu\n",
                        pool_id, pg_item.first, pg_item.second.target_set.size(), parsed_cfg.pg_size);
                    pg_item.second.pause = true;
                }
@@ -496,21 +460,18 @@ void etcd_state_client_t::parse_state(const etcd_kv_t & kv)
        }
        for (auto & pool_item: value["items"].object_items())
        {
-            pool_id_t pool_id;
-            char null_byte = 0;
-            sscanf(pool_item.first.c_str(), "%u%c", &pool_id, &null_byte);
-            if (!pool_id || pool_id >= POOL_ID_MAX || null_byte != 0)
+            pool_id_t pool_id = stoull_full(pool_item.first);
+            if (!pool_id || pool_id >= POOL_ID_MAX)
            {
-                fprintf(stderr, "Pool ID %s is invalid in PG configuration (must be a number less than 0x%x), skipping pool\n", pool_item.first.c_str(), POOL_ID_MAX);
+                printf("Pool ID %s is invalid in PG configuration (must be a number less than 0x%x), skipping pool\n", pool_item.first.c_str(), POOL_ID_MAX);
                continue;
            }
            for (auto & pg_item: pool_item.second.object_items())
            {
-                pg_num_t pg_num = 0;
-                sscanf(pg_item.first.c_str(), "%u%c", &pg_num, &null_byte);
-                if (!pg_num || null_byte != 0)
+                pg_num_t pg_num = stoull_full(pg_item.first);
+                if (!pg_num)
                {
-                    fprintf(stderr, "Bad key in pool %u PG configuration: %s (must be a number), skipped\n", pool_id, pg_item.first.c_str());
+                    printf("Bad key in pool %u PG configuration: %s (must be a number), skipped\n", pool_id, pg_item.first.c_str());
                    continue;
                }
                auto & parsed_cfg = this->pool_config[pool_id].pg_config[pg_num];
@@ -524,7 +485,7 @@ void etcd_state_client_t::parse_state(const etcd_kv_t & kv)
                }
                if (parsed_cfg.target_set.size() != pool_config[pool_id].pg_size)
                {
-                    fprintf(stderr, "Pool %u PG %u configuration is invalid: osd_set size %lu != pool pg_size %lu\n",
+                    printf("Pool %u PG %u configuration is invalid: osd_set size %lu != pool pg_size %lu\n",
                        pool_id, pg_num, parsed_cfg.target_set.size(), pool_config[pool_id].pg_size);
                    parsed_cfg.pause = true;
                }
@@ -537,8 +498,8 @@ void etcd_state_client_t::parse_state(const etcd_kv_t & kv)
            {
                if (pg_it->second.exists && pg_it->first != ++n)
                {
-                    fprintf(
-                        stderr, "Invalid pool %u PG configuration: PG numbers don't cover whole 1..%lu range\n",
+                    printf(
+                        "Invalid pool %u PG configuration: PG numbers don't cover whole 1..%lu range\n",
                        pool_item.second.id, pool_item.second.pg_config.size()
                    );
                    for (pg_it = pool_item.second.pg_config.begin(); pg_it != pool_item.second.pg_config.end(); pg_it++)
@@ -561,7 +522,7 @@ void etcd_state_client_t::parse_state(const etcd_kv_t & kv)
        sscanf(key.c_str() + etcd_prefix.length()+12, "%u/%u%c", &pool_id, &pg_num, &null_byte);
        if (!pool_id || pool_id >= POOL_ID_MAX || !pg_num || null_byte != 0)
        {
-            fprintf(stderr, "Bad etcd key %s, ignoring\n", key.c_str());
+            printf("Bad etcd key %s, ignoring\n", key.c_str());
        }
        else
        {
@@ -600,7 +561,7 @@ void etcd_state_client_t::parse_state(const etcd_kv_t & kv)
        sscanf(key.c_str() + etcd_prefix.length()+10, "%u/%u%c", &pool_id, &pg_num, &null_byte);
        if (!pool_id || pool_id >= POOL_ID_MAX || !pg_num || null_byte != 0)
        {
-            fprintf(stderr, "Bad etcd key %s, ignoring\n", key.c_str());
+            printf("Bad etcd key %s, ignoring\n", key.c_str());
        }
        else if (value.is_null())
        {
@@ -624,7 +585,7 @@ void etcd_state_client_t::parse_state(const etcd_kv_t & kv)
                }
                if (i >= pg_state_bit_count)
                {
-                    fprintf(stderr, "Unexpected pool %u PG %u state keyword in etcd: %s\n", pool_id, pg_num, e.dump().c_str());
+                    printf("Unexpected pool %u PG %u state keyword in etcd: %s\n", pool_id, pg_num, e.dump().c_str());
                    return;
                }
            }
@@ -633,7 +594,7 @@ void etcd_state_client_t::parse_state(const etcd_kv_t & kv)
                (state & PG_PEERING) && state != PG_PEERING ||
                (state & PG_INCOMPLETE) && state != PG_INCOMPLETE)
            {
-                fprintf(stderr, "Unexpected pool %u PG %u state in etcd: primary=%lu, state=%s\n", pool_id, pg_num, cur_primary, value["state"].dump().c_str());
+                printf("Unexpected pool %u PG %u state in etcd: primary=%lu, state=%s\n", pool_id, pg_num, cur_primary, value["state"].dump().c_str());
                return;
            }
            this->pool_config[pool_id].pg_config[pg_num].cur_primary = cur_primary;
@@ -671,7 +632,7 @@ void etcd_state_client_t::parse_state(const etcd_kv_t & kv)
        sscanf(key.c_str() + etcd_prefix.length()+14, "%lu/%lu%c", &pool_id, &inode_num, &null_byte);
        if (!pool_id || pool_id >= POOL_ID_MAX || !inode_num || (inode_num >> (64-POOL_ID_BITS)) || null_byte != 0)
        {
-            fprintf(stderr, "Bad etcd key %s, ignoring\n", key.c_str());
+            printf("Bad etcd key %s, ignoring\n", key.c_str());
        }
        else
        {
@@ -706,8 +667,8 @@ void etcd_state_client_t::parse_state(const etcd_kv_t & kv)
                        parent_inode_num |= pool_id << (64-POOL_ID_BITS);
                    else if (parent_pool_id >= POOL_ID_MAX)
                    {
-                        fprintf(
-                            stderr, "Inode %lu/%lu parent_pool value is invalid, ignoring parent setting\n",
+                        printf(
+                            "Inode %lu/%lu parent_pool value is invalid, ignoring parent setting\n",
                            inode_num >> (64-POOL_ID_BITS), inode_num & ((1l << (64-POOL_ID_BITS)) - 1)
                        );
                        parent_inode_num = 0;
@@ -721,7 +682,6 @@ void etcd_state_client_t::parse_state(const etcd_kv_t & kv)
                    .size = value["size"].uint64_value(),
                    .parent_id = parent_inode_num,
                    .readonly = value["readonly"].bool_value(),
-                    .mod_revision = kv.mod_revision,
                };
                this->inode_config[inode_num] = cfg;
                if (cfg.name != "")
--- a/src/etcd_state_client.h
+++ b/src/etcd_state_client.h
@@ -3,8 +3,8 @@

 #pragma once

-#include "json11/json11.hpp"
 #include "osd_id.h"
+#include "http_client.h"
 #include "timerfd_manager.h"

 #define ETCD_CONFIG_WATCH_ID 1
@@ -18,11 +18,10 @@

 #define DEFAULT_BLOCK_SIZE 128*1024

-struct etcd_kv_t
+struct json_kv_t
 {
    std::string key;
    json11::Json value;
-    uint64_t mod_revision;
 };

 struct pg_config_t
@@ -60,8 +59,6 @@ struct inode_config_t
    uint64_t size;
    inode_t parent_id;
    bool readonly;
-    // Change revision of the metadata in etcd
-    uint64_t mod_revision;
 };

 struct inode_watch_t
@@ -70,14 +67,12 @@ struct inode_watch_t
    inode_config_t cfg;
 };

-struct websocket_t;
-
 struct etcd_state_client_t
 {
 protected:
    std::vector<inode_watch_t*> watches;
    websocket_t *etcd_watch_ws = NULL;
-    uint64_t bs_block_size = DEFAULT_BLOCK_SIZE;
+    uint64_t bs_block_size = 0;
    void add_etcd_url(std::string);
 public:
    std::vector<std::string> etcd_addresses;
@@ -92,21 +87,21 @@ public:
    std::map<inode_t, inode_config_t> inode_config;
    std::map<std::string, inode_t> inode_by_name;

-    std::function<void(std::map<std::string, etcd_kv_t> &)> on_change_hook;
+    std::function<void(json11::Json::object &)> on_change_hook;
    std::function<void(json11::Json::object &)> on_load_config_hook;
    std::function<json11::Json()> load_pgs_checks_hook;
    std::function<void(bool)> on_load_pgs_hook;
    std::function<void(pool_id_t, pg_num_t)> on_change_pg_history_hook;
    std::function<void(osd_num_t)> on_change_osd_state_hook;

-    etcd_kv_t parse_etcd_kv(const json11::Json & kv_json);
+    json_kv_t parse_etcd_kv(const json11::Json & kv_json);
    void etcd_call(std::string api, json11::Json payload, int timeout, std::function<void(std::string, json11::Json)> callback);
    void etcd_txn(json11::Json txn, int timeout, std::function<void(std::string, json11::Json)> callback);
    void start_etcd_watcher();
    void load_global_config();
    void load_pgs();
-    void parse_state(const etcd_kv_t & kv);
-    void parse_config(const json11::Json & config);
+    void parse_state(const std::string & key, const json11::Json & value);
+    void parse_config(json11::Json & config);
    inode_watch_t* watch_inode(std::string name);
    void close_watch(inode_watch_t* watch);
    ~etcd_state_client_t();
--- a/src/fio_cluster.cpp
+++ b/src/fio_cluster.cpp
@@ -24,25 +24,28 @@
 #include <netinet/tcp.h>

 #include <vector>
+#include <unordered_map>

-#include "vitastor_c.h"
+#include "epoll_manager.h"
+#include "cluster_client.h"
 #include "fio_headers.h"

 struct sec_data
 {
-    vitastor_c *cli = NULL;
-    void *watch = NULL;
+    ring_loop_t *ringloop = NULL;
+    epoll_manager_t *epmgr = NULL;
+    cluster_client_t *cli = NULL;
+    inode_watch_t *watch = NULL;
    bool last_sync = false;
    /* The list of completed io_u structs. */
    std::vector<io_u*> completed;
-    uint64_t inflight = 0;
+    uint64_t op_n = 0, inflight = 0;
    bool trace = false;
 };

 struct sec_options
 {
    int __pad;
-    char *config_path = NULL;
    char *etcd_host = NULL;
    char *etcd_prefix = NULL;
    char *image = NULL;
@@ -50,23 +53,9 @@ struct sec_options
    uint64_t inode = 0;
    int cluster_log = 0;
    int trace = 0;
-    int use_rdma = 0;
-    char *rdma_device = NULL;
-    int rdma_port_num = 0;
-    int rdma_gid_index = 0;
-    int rdma_mtu = 0;
 };

 static struct fio_option options[] = {
-    {
-        .name   = "conf",
-        .lname  = "Vitastor config path",
-        .type   = FIO_OPT_STR_STORE,
-        .off1   = offsetof(struct sec_options, config_path),
-        .help   = "Vitastor config path",
-        .category = FIO_OPT_C_ENGINE,
-        .group  = FIO_OPT_G_FILENAME,
-    },
    {
        .name   = "etcd",
        .lname  = "etcd address",
@@ -132,71 +121,22 @@ static struct fio_option options[] = {
        .category = FIO_OPT_C_ENGINE,
        .group  = FIO_OPT_G_FILENAME,
    },
-    {
-        .name   = "use_rdma",
-        .lname  = "Use RDMA",
-        .type   = FIO_OPT_BOOL,
-        .off1   = offsetof(struct sec_options, use_rdma),
-        .help   = "Use RDMA",
-        .def    = "-1",
-        .category = FIO_OPT_C_ENGINE,
-        .group  = FIO_OPT_G_FILENAME,
-    },
-    {
-        .name   = "rdma_device",
-        .lname  = "RDMA device name",
-        .type   = FIO_OPT_STR_STORE,
-        .off1   = offsetof(struct sec_options, rdma_device),
-        .help   = "RDMA device name",
-        .category = FIO_OPT_C_ENGINE,
-        .group  = FIO_OPT_G_FILENAME,
-    },
-    {
-        .name   = "rdma_port_num",
-        .lname  = "RDMA port number",
-        .type   = FIO_OPT_INT,
-        .off1   = offsetof(struct sec_options, rdma_port_num),
-        .help   = "RDMA port number",
-        .def    = "0",
-        .category = FIO_OPT_C_ENGINE,
-        .group  = FIO_OPT_G_FILENAME,
-    },
-    {
-        .name   = "rdma_gid_index",
-        .lname  = "RDMA gid index",
-        .type   = FIO_OPT_INT,
-        .off1   = offsetof(struct sec_options, rdma_gid_index),
-        .help   = "RDMA gid index",
-        .def    = "0",
-        .category = FIO_OPT_C_ENGINE,
-        .group  = FIO_OPT_G_FILENAME,
-    },
-    {
-        .name   = "rdma_mtu",
-        .lname  = "RDMA path MTU",
-        .type   = FIO_OPT_INT,
-        .off1   = offsetof(struct sec_options, rdma_mtu),
-        .help   = "RDMA path MTU",
-        .def    = "0",
-        .category = FIO_OPT_C_ENGINE,
-        .group  = FIO_OPT_G_FILENAME,
-    },
    {
        .name = NULL,
    },
 };

-static void watch_callback(void *opaque, long watch)
-{
-    struct sec_data *bsd = (struct sec_data*)opaque;
-    bsd->watch = (void*)watch;
-}
-
 static int sec_setup(struct thread_data *td)
 {
    sec_options *o = (sec_options*)td->eo;
    sec_data *bsd;

+    if (!o->etcd_host)
+    {
+        td_verror(td, EINVAL, "etcd address is missing");
+        return 1;
+    }
+
    bsd = new sec_data;
    if (!bsd)
    {
@@ -212,6 +152,12 @@ static int sec_setup(struct thread_data *td)
        td->o.open_files++;
    }

+    json11::Json cfg = json11::Json::object {
+        { "etcd_address", std::string(o->etcd_host) },
+        { "etcd_prefix", std::string(o->etcd_prefix ? o->etcd_prefix : "/vitastor") },
+        { "log_level", o->cluster_log },
+    };
+
    if (!o->image)
    {
        if (!(o->inode & ((1l << (64-POOL_ID_BITS)) - 1)))
@@ -233,20 +179,20 @@ static int sec_setup(struct thread_data *td)
    {
        o->inode = 0;
    }
-    bsd->cli = vitastor_c_create_uring(o->config_path, o->etcd_host, o->etcd_prefix,
-        o->use_rdma, o->rdma_device, o->rdma_port_num, o->rdma_gid_index, o->rdma_mtu, o->cluster_log);
+    bsd->ringloop = new ring_loop_t(512);
+    bsd->epmgr = new epoll_manager_t(bsd->ringloop);
+    bsd->cli = new cluster_client_t(bsd->ringloop, bsd->epmgr->tfd, cfg);
    if (o->image)
    {
-        bsd->watch = NULL;
-        vitastor_c_watch_inode(bsd->cli, o->image, watch_callback, bsd);
-        while (true)
+        while (!bsd->cli->is_ready())
        {
-            vitastor_c_uring_handle_events(bsd->cli);
-            if (bsd->watch)
+            bsd->ringloop->loop();
+            if (bsd->cli->is_ready())
                break;
-            vitastor_c_uring_wait_events(bsd->cli);
+            bsd->ringloop->wait();
        }
-        td->files[0]->real_file_size = vitastor_c_inode_get_size(bsd->watch);
+        bsd->watch = bsd->cli->st_cli.watch_inode(std::string(o->image));
+        td->files[0]->real_file_size = bsd->watch->cfg.size;
    }

    bsd->trace = o->trace ? true : false;
@@ -261,9 +207,11 @@ static void sec_cleanup(struct thread_data *td)
    {
        if (bsd->watch)
        {
-            vitastor_c_close_watch(bsd->cli, bsd->watch);
+            bsd->cli->st_cli.close_watch(bsd->watch);
        }
-        vitastor_c_destroy(bsd->cli);
+        delete bsd->cli;
+        delete bsd->epmgr;
+        delete bsd->ringloop;
        delete bsd;
    }
 }
@@ -274,31 +222,12 @@ static int sec_init(struct thread_data *td)
    return 0;
 }

-static void io_callback(void *opaque, long retval)
-{
-    struct io_u *io = (struct io_u*)opaque;
-    io->error = retval < 0 ? -retval : 0;
-    sec_data *bsd = (sec_data*)io->engine_data;
-    bsd->inflight--;
-    bsd->completed.push_back(io);
-    if (bsd->trace)
-    {
-        printf("--- %s 0x%lx retval=%ld\n", io->ddir == DDIR_READ ? "READ" :
-            (io->ddir == DDIR_WRITE ? "WRITE" : "SYNC"), (uint64_t)io, retval);
-    }
-}
-
-static void read_callback(void *opaque, long retval, uint64_t version)
-{
-    io_callback(opaque, retval);
-}
-
 /* Begin read or write request. */
 static enum fio_q_status sec_queue(struct thread_data *td, struct io_u *io)
 {
    sec_options *opt = (sec_options*)td->eo;
    sec_data *bsd = (sec_data*)td->io_ops_data;
-    struct iovec iov;
+    int n = bsd->op_n;

    fio_ro_check(td, io);
    if (io->ddir == DDIR_SYNC && bsd->last_sync)
@@ -307,29 +236,32 @@ static enum fio_q_status sec_queue(struct thread_data *td, struct io_u *io)
    }

    io->engine_data = bsd;
-    io->error = 0;
-    bsd->inflight++;
+    cluster_op_t *op = new cluster_op_t;

-    uint64_t inode = opt->image ? vitastor_c_inode_get_num(bsd->watch) : opt->inode;
+    op->inode = opt->image ? bsd->watch->cfg.num : opt->inode;
    switch (io->ddir)
    {
    case DDIR_READ:
-        iov = { .iov_base = io->xfer_buf, .iov_len = io->xfer_buflen };
-        vitastor_c_read(bsd->cli, inode, io->offset, io->xfer_buflen, &iov, 1, read_callback, io);
+        op->opcode = OSD_OP_READ;
+        op->offset = io->offset;
+        op->len = io->xfer_buflen;
+        op->iov.push_back(io->xfer_buf, io->xfer_buflen);
        bsd->last_sync = false;
        break;
    case DDIR_WRITE:
-        if (opt->image && vitastor_c_inode_get_readonly(bsd->watch))
+        if (opt->image && bsd->watch->cfg.readonly)
        {
            io->error = EROFS;
            return FIO_Q_COMPLETED;
        }
-        iov = { .iov_base = io->xfer_buf, .iov_len = io->xfer_buflen };
-        vitastor_c_write(bsd->cli, inode, io->offset, io->xfer_buflen, 0, &iov, 1, io_callback, io);
+        op->opcode = OSD_OP_WRITE;
+        op->offset = io->offset;
+        op->len = io->xfer_buflen;
+        op->iov.push_back(io->xfer_buf, io->xfer_buflen);
        bsd->last_sync = false;
        break;
    case DDIR_SYNC:
-        vitastor_c_sync(bsd->cli, io_callback, io);
+        op->opcode = OSD_OP_SYNC;
        bsd->last_sync = true;
        break;
    default:
@@ -337,20 +269,39 @@ static enum fio_q_status sec_queue(struct thread_data *td, struct io_u *io)
        return FIO_Q_COMPLETED;
    }

+    op->callback = [io, n](cluster_op_t *op)
+    {
+        io->error = op->retval < 0 ? -op->retval : 0;
+        sec_data *bsd = (sec_data*)io->engine_data;
+        bsd->inflight--;
+        bsd->completed.push_back(io);
+        if (bsd->trace)
+        {
+            printf("--- %s n=%d retval=%d\n", io->ddir == DDIR_READ ? "READ" :
+                (io->ddir == DDIR_WRITE ? "WRITE" : "SYNC"), n, op->retval);
+        }
+        delete op;
+    };
+
    if (opt->trace)
    {
        if (io->ddir == DDIR_SYNC)
        {
-            printf("+++ SYNC 0x%lx\n", (uint64_t)io);
+            printf("+++ SYNC # %d\n", n);
        }
        else
        {
-            printf("+++ %s 0x%lx 0x%llx+%llx\n",
+            printf("+++ %s # %d 0x%llx+%llx\n",
                io->ddir == DDIR_READ ? "READ" : "WRITE",
-                (uint64_t)io, io->offset, io->xfer_buflen);
+                n, io->offset, io->xfer_buflen);
        }
    }

+    io->error = 0;
+    bsd->inflight++;
+    bsd->op_n++;
+    bsd->cli->execute(op);
+
    if (io->error != 0)
        return FIO_Q_COMPLETED;
    return FIO_Q_QUEUED;
@@ -361,10 +312,10 @@ static int sec_getevents(struct thread_data *td, unsigned int min, unsigned int
    sec_data *bsd = (sec_data*)td->io_ops_data;
    while (true)
    {
-        vitastor_c_uring_handle_events(bsd->cli);
+        bsd->ringloop->loop();
        if (bsd->completed.size() >= min)
            break;
-        vitastor_c_uring_wait_events(bsd->cli);
+        bsd->ringloop->wait();
    }
    return bsd->completed.size();
 }
--- a/src/fio_engine.cpp
+++ b/src/fio_engine.cpp
@@ -25,7 +25,6 @@
 //     -bs_config='{"data_device":"./test_data.bin"}' -size=1000M

 #include "blockstore.h"
-#include "epoll_manager.h"
 #include "fio_headers.h"

 #include "json11/json11.hpp"
@@ -33,7 +32,6 @@
 struct bs_data
 {
    blockstore_t *bs;
-    epoll_manager_t *epmgr;
    ring_loop_t *ringloop;
    /* The list of completed io_u structs. */
    std::vector<io_u*> completed;
@@ -106,7 +104,6 @@ static void bs_cleanup(struct thread_data *td)
        }
    safe:
        delete bsd->bs;
-        delete bsd->epmgr;
        delete bsd->ringloop;
        delete bsd;
    }
@@ -132,8 +129,7 @@ static int bs_init(struct thread_data *td)
        }
    }
    bsd->ringloop = new ring_loop_t(512);
-    bsd->epmgr = new epoll_manager_t(bsd->ringloop);
-    bsd->bs = new blockstore_t(config, bsd->ringloop, bsd->epmgr->tfd);
+    bsd->bs = new blockstore_t(config, bsd->ringloop);
    while (1)
    {
        bsd->ringloop->loop();
--- a/src/messenger.cpp
+++ b/src/messenger.cpp
@@ -10,41 +10,30 @@

 #include "messenger.h"

+osd_op_t::~osd_op_t()
+{
+    assert(!bs_op);
+    assert(!op_data);
+    if (rmw_buf)
+    {
+        free(rmw_buf);
+    }
+    if (buf)
+    {
+        // Note: reusing osd_op_t WILL currently lead to memory leaks
+        // So we don't reuse it, but free it every time
+        free(buf);
+    }
+}
+
 void osd_messenger_t::init()
 {
-#ifdef WITH_RDMA
-    if (use_rdma)
-    {
-        rdma_context = msgr_rdma_context_t::create(
-            rdma_device != "" ? rdma_device.c_str() : NULL,
-            rdma_port_num, rdma_gid_index, rdma_mtu
-        );
-        if (!rdma_context)
-        {
-            fprintf(stderr, "[OSD %lu] Couldn't initialize RDMA, proceeding with TCP only\n", osd_num);
-        }
-        else
-        {
-            rdma_max_sge = rdma_max_sge < rdma_context->attrx.orig_attr.max_sge
-                ? rdma_max_sge : rdma_context->attrx.orig_attr.max_sge;
-            fprintf(stderr, "[OSD %lu] RDMA initialized successfully\n", osd_num);
-            fcntl(rdma_context->channel->fd, F_SETFL, fcntl(rdma_context->channel->fd, F_GETFL, 0) | O_NONBLOCK);
-            tfd->set_fd_handler(rdma_context->channel->fd, false, [this](int notify_fd, int epoll_events)
-            {
-                handle_rdma_events();
-            });
-            handle_rdma_events();
-        }
-    }
-#endif
    keepalive_timer_id = tfd->set_timer(1000, true, [this](int)
    {
-        std::vector<int> to_stop;
-        std::vector<osd_op_t*> to_ping;
-        for (auto cl_it = clients.begin(); cl_it != clients.end(); cl_it++)
+        for (auto cl_it = clients.begin(); cl_it != clients.end();)
        {
-            auto cl = cl_it->second;
-            if (!cl->osd_num || cl->peer_state != PEER_CONNECTED && cl->peer_state != PEER_RDMA)
+            auto cl = (cl_it++)->second;
+            if (!cl->osd_num)
            {
                // Do not run keepalive on regular clients
                continue;
@@ -55,8 +44,7 @@ void osd_messenger_t::init()
                if (!cl->ping_time_remaining)
                {
                    // Ping timed out, stop the client
-                    fprintf(stderr, "Ping timed out for OSD %lu (client %d), disconnecting peer\n", cl->osd_num, cl->peer_fd);
-                    to_stop.push_back(cl->peer_fd);
+                    stop_client(cl->peer_fd, true);
                }
            }
            else if (cl->idle_time_remaining > 0)
@@ -82,11 +70,10 @@ void osd_messenger_t::init()
                        delete op;
                        if (fail_fd >= 0)
                        {
-                            fprintf(stderr, "Ping failed for OSD %lu (client %d), disconnecting peer\n", cl->osd_num, cl->peer_fd);
                            stop_client(fail_fd, true);
                        }
                    };
-                    to_ping.push_back(op);
+                    outbox_push(op);
                    cl->ping_time_remaining = osd_ping_timeout;
                    cl->idle_time_remaining = osd_idle_timeout;
                }
@@ -96,15 +83,6 @@ void osd_messenger_t::init()
                cl->idle_time_remaining = osd_idle_timeout;
            }
        }
-        // Don't stop clients while a 'clients' iterator is still active
-        for (int peer_fd: to_stop)
-        {
-            stop_client(peer_fd, true);
-        }
-        for (auto op: to_ping)
-        {
-            outbox_push(op);
-        }
    });
 }

@@ -119,58 +97,32 @@ osd_messenger_t::~osd_messenger_t()
    {
        stop_client(clients.begin()->first, true);
    }
-#ifdef WITH_RDMA
-    if (rdma_context)
-    {
-        delete rdma_context;
-    }
-#endif
 }

 void osd_messenger_t::parse_config(const json11::Json & config)
 {
-#ifdef WITH_RDMA
-    if (!config["use_rdma"].is_null())
-    {
-        // RDMA is on by default in RDMA-enabled builds
-        this->use_rdma = config["use_rdma"].bool_value() || config["use_rdma"].uint64_value() != 0;
-    }
-    this->rdma_device = config["rdma_device"].string_value();
-    this->rdma_port_num = (uint8_t)config["rdma_port_num"].uint64_value();
-    if (!this->rdma_port_num)
-        this->rdma_port_num = 1;
-    this->rdma_gid_index = (uint8_t)config["rdma_gid_index"].uint64_value();
-    this->rdma_mtu = (uint32_t)config["rdma_mtu"].uint64_value();
-    this->rdma_max_sge = config["rdma_max_sge"].uint64_value();
-    if (!this->rdma_max_sge)
-        this->rdma_max_sge = 128;
-    this->rdma_max_send = config["rdma_max_send"].uint64_value();
-    if (!this->rdma_max_send)
-        this->rdma_max_send = 32;
-    this->rdma_max_recv = config["rdma_max_recv"].uint64_value();
-    if (!this->rdma_max_recv)
-        this->rdma_max_recv = 8;
-    this->rdma_max_msg = config["rdma_max_msg"].uint64_value();
-    if (!this->rdma_max_msg || this->rdma_max_msg > 128*1024*1024)
-        this->rdma_max_msg = 1024*1024;
-#endif
-    this->receive_buffer_size = (uint32_t)config["tcp_header_buffer_size"].uint64_value();
-    if (!this->receive_buffer_size || this->receive_buffer_size > 1024*1024*1024)
-        this->receive_buffer_size = 65536;
    this->use_sync_send_recv = config["use_sync_send_recv"].bool_value() ||
        config["use_sync_send_recv"].uint64_value();
    this->peer_connect_interval = config["peer_connect_interval"].uint64_value();
    if (!this->peer_connect_interval)
-        this->peer_connect_interval = 5;
+    {
+        this->peer_connect_interval = DEFAULT_PEER_CONNECT_INTERVAL;
+    }
    this->peer_connect_timeout = config["peer_connect_timeout"].uint64_value();
    if (!this->peer_connect_timeout)
-        this->peer_connect_timeout = 5;
+    {
+        this->peer_connect_timeout = DEFAULT_PEER_CONNECT_TIMEOUT;
+    }
    this->osd_idle_timeout = config["osd_idle_timeout"].uint64_value();
    if (!this->osd_idle_timeout)
-        this->osd_idle_timeout = 5;
+    {
+        this->osd_idle_timeout = DEFAULT_OSD_PING_TIMEOUT;
+    }
    this->osd_ping_timeout = config["osd_ping_timeout"].uint64_value();
    if (!this->osd_ping_timeout)
-        this->osd_ping_timeout = 5;
+    {
+        this->osd_ping_timeout = DEFAULT_OSD_PING_TIMEOUT;
+    }
    this->log_level = config["log_level"].uint64_value();
 }

@@ -189,14 +141,17 @@ void osd_messenger_t::connect_peer(uint64_t peer_osd, json11::Json peer_state)
        wanted_peers[peer_osd].port = (int)peer_state["port"].int64_value();
    }
    wanted_peers[peer_osd].address_changed = true;
-    try_connect_peer(peer_osd);
+    if (!wanted_peers[peer_osd].connecting &&
+        (time(NULL) - wanted_peers[peer_osd].last_connect_attempt) >= peer_connect_interval)
+    {
+        try_connect_peer(peer_osd);
+    }
 }

 void osd_messenger_t::try_connect_peer(uint64_t peer_osd)
 {
    auto wp_it = wanted_peers.find(peer_osd);
-    if (wp_it == wanted_peers.end() || wp_it->second.connecting ||
-        (time(NULL) - wp_it->second.last_connect_attempt) < peer_connect_interval)
+    if (wp_it == wanted_peers.end())
    {
        return;
    }
@@ -242,29 +197,31 @@ void osd_messenger_t::try_connect_peer_addr(osd_num_t peer_osd, const char *peer
        on_connect_peer(peer_osd, -errno);
        return;
    }
-    clients[peer_fd] = new osd_client_t();
-    clients[peer_fd]->peer_addr = addr;
-    clients[peer_fd]->peer_port = peer_port;
-    clients[peer_fd]->peer_fd = peer_fd;
-    clients[peer_fd]->peer_state = PEER_CONNECTING;
-    clients[peer_fd]->connect_timeout_id = -1;
-    clients[peer_fd]->osd_num = peer_osd;
-    clients[peer_fd]->in_buf = malloc_or_die(receive_buffer_size);
+    int timeout_id = -1;
+    if (peer_connect_timeout > 0)
+    {
+        timeout_id = tfd->set_timer(1000*peer_connect_timeout, false, [this, peer_fd](int timer_id)
+        {
+            osd_num_t peer_osd = clients.at(peer_fd)->osd_num;
+            stop_client(peer_fd, true);
+            on_connect_peer(peer_osd, -EIO);
+            return;
+        });
+    }
+    clients[peer_fd] = new osd_client_t((osd_client_t){
+        .peer_addr = addr,
+        .peer_port = peer_port,
+        .peer_fd = peer_fd,
+        .peer_state = PEER_CONNECTING,
+        .connect_timeout_id = timeout_id,
+        .osd_num = peer_osd,
+        .in_buf = malloc_or_die(receive_buffer_size),
+    });
    tfd->set_fd_handler(peer_fd, true, [this](int peer_fd, int epoll_events)
    {
        // Either OUT (connected) or HUP
        handle_connect_epoll(peer_fd);
    });
-    if (peer_connect_timeout > 0)
-    {
-        clients[peer_fd]->connect_timeout_id = tfd->set_timer(1000*peer_connect_timeout, false, [this, peer_fd](int timer_id)
-        {
-            osd_num_t peer_osd = clients.at(peer_fd)->osd_num;
-            stop_client(peer_fd, true);
-            on_connect_peer(peer_osd, -EPIPE);
-            return;
-        });
-    }
 }

 void osd_messenger_t::handle_connect_epoll(int peer_fd)
@@ -305,7 +262,7 @@ void osd_messenger_t::handle_peer_epoll(int peer_fd, int epoll_events)
    if (epoll_events & EPOLLRDHUP)
    {
        // Stop client
-        fprintf(stderr, "[OSD %lu] client %d disconnected\n", this->osd_num, peer_fd);
+        printf("[OSD %lu] client %d disconnected\n", this->osd_num, peer_fd);
        stop_client(peer_fd, true);
    }
    else if (epoll_events & EPOLLIN)
@@ -330,7 +287,7 @@ void osd_messenger_t::on_connect_peer(osd_num_t peer_osd, int peer_fd)
    wp.connecting = false;
    if (peer_fd < 0)
    {
-        fprintf(stderr, "Failed to connect to peer OSD %lu address %s port %d: %s\n", peer_osd, wp.cur_addr.c_str(), wp.cur_port, strerror(-peer_fd));
+        printf("Failed to connect to peer OSD %lu address %s port %d: %s\n", peer_osd, wp.cur_addr.c_str(), wp.cur_port, strerror(-peer_fd));
        if (wp.address_changed)
        {
            wp.address_changed = false;
@@ -357,7 +314,7 @@ void osd_messenger_t::on_connect_peer(osd_num_t peer_osd, int peer_fd)
    }
    if (log_level > 0)
    {
-        fprintf(stderr, "[OSD %lu] Connected with peer OSD %lu (client %d)\n", osd_num, peer_osd, peer_fd);
+        printf("[OSD %lu] Connected with peer OSD %lu (client %d)\n", osd_num, peer_osd, peer_fd);
    }
    wanted_peers.erase(peer_osd);
    repeer_pgs(peer_osd);
@@ -377,24 +334,6 @@ void osd_messenger_t::check_peer_config(osd_client_t *cl)
            },
        },
    };
-#ifdef WITH_RDMA
-    if (rdma_context)
-    {
-        cl->rdma_conn = msgr_rdma_connection_t::create(rdma_context, rdma_max_send, rdma_max_recv, rdma_max_sge, rdma_max_msg);
-        if (cl->rdma_conn)
-        {
-            json11::Json payload = json11::Json::object {
-                { "connect_rdma", cl->rdma_conn->addr.to_string() },
-                { "rdma_max_msg", cl->rdma_conn->max_msg },
-            };
-            std::string payload_str = payload.dump();
-            op->req.show_conf.json_len = payload_str.size();
-            op->buf = malloc_or_die(payload_str.size());
-            op->iov.push_back(op->buf, payload_str.size());
-            memcpy(op->buf, payload_str.c_str(), payload_str.size());
-        }
-    }
-#endif
    op->callback = [this, cl](osd_op_t *op)
    {
        std::string json_err;
@@ -403,7 +342,7 @@ void osd_messenger_t::check_peer_config(osd_client_t *cl)
        if (op->reply.hdr.retval < 0)
        {
            err = true;
-            fprintf(stderr, "Failed to get config from OSD %lu (retval=%ld), disconnecting peer\n", cl->osd_num, op->reply.hdr.retval);
+            printf("Failed to get config from OSD %lu (retval=%ld), disconnecting peer\n", cl->osd_num, op->reply.hdr.retval);
        }
        else
        {
@@ -411,69 +350,22 @@ void osd_messenger_t::check_peer_config(osd_client_t *cl)
            if (json_err != "")
            {
                err = true;
-                fprintf(stderr, "Failed to get config from OSD %lu: bad JSON: %s, disconnecting peer\n", cl->osd_num, json_err.c_str());
+                printf("Failed to get config from OSD %lu: bad JSON: %s, disconnecting peer\n", cl->osd_num, json_err.c_str());
            }
            else if (config["osd_num"].uint64_value() != cl->osd_num)
            {
                err = true;
-                fprintf(stderr, "Connected to OSD %lu instead of OSD %lu, peer state is outdated, disconnecting peer\n", config["osd_num"].uint64_value(), cl->osd_num);
-            }
-            else if (config["protocol_version"].uint64_value() != OSD_PROTOCOL_VERSION)
-            {
-                err = true;
-                fprintf(
-                    stderr, "OSD %lu protocol version is %lu, but only version %u is supported.\n"
-                    " If you need to upgrade from 0.5.x please request it via the issue tracker.\n",
-                    cl->osd_num, config["protocol_version"].uint64_value(), OSD_PROTOCOL_VERSION
-                );
+                printf("Connected to OSD %lu instead of OSD %lu, peer state is outdated, disconnecting peer\n", config["osd_num"].uint64_value(), cl->osd_num);
            }
        }
        if (err)
        {
-            osd_num_t peer_osd = cl->osd_num;
+            osd_num_t osd_num = cl->osd_num;
            stop_client(op->peer_fd);
-            on_connect_peer(peer_osd, -1);
+            on_connect_peer(osd_num, -1);
            delete op;
            return;
        }
-#ifdef WITH_RDMA
-        if (config["rdma_address"].is_string())
-        {
-            msgr_rdma_address_t addr;
-            if (!msgr_rdma_address_t::from_string(config["rdma_address"].string_value().c_str(), &addr) ||
-                cl->rdma_conn->connect(&addr) != 0)
-            {
-                fprintf(
-                    stderr, "Failed to connect to OSD %lu (address %s) using RDMA\n",
-                    cl->osd_num, config["rdma_address"].string_value().c_str()
-                );
-                delete cl->rdma_conn;
-                cl->rdma_conn = NULL;
-                // FIXME: Keep TCP connection in this case
-                osd_num_t peer_osd = cl->osd_num;
-                stop_client(cl->peer_fd);
-                on_connect_peer(peer_osd, -1);
-                delete op;
-                return;
-            }
-            else
-            {
-                uint64_t server_max_msg = config["rdma_max_msg"].uint64_value();
-                if (cl->rdma_conn->max_msg > server_max_msg)
-                {
-                    cl->rdma_conn->max_msg = server_max_msg;
-                }
-                if (log_level > 0)
-                {
-                    fprintf(stderr, "Connected to OSD %lu using RDMA\n", cl->osd_num);
-                }
-                cl->peer_state = PEER_RDMA;
-                tfd->set_fd_handler(cl->peer_fd, false, NULL);
-                // Add the initial receive request
-                try_recv_rdma(cl);
-            }
-        }
-#endif
        osd_peer_fds[cl->osd_num] = cl->peer_fd;
        on_connect_peer(cl->osd_num, cl->peer_fd);
        delete op;
@@ -481,6 +373,123 @@ void osd_messenger_t::check_peer_config(osd_client_t *cl)
    outbox_push(op);
 }

+void osd_messenger_t::cancel_osd_ops(osd_client_t *cl)
+{
+    for (auto p: cl->sent_ops)
+    {
+        cancel_op(p.second);
+    }
+    cl->sent_ops.clear();
+    cl->outbox.clear();
+}
+
+void osd_messenger_t::cancel_op(osd_op_t *op)
+{
+    if (op->op_type == OSD_OP_OUT)
+    {
+        op->reply.hdr.magic = SECONDARY_OSD_REPLY_MAGIC;
+        op->reply.hdr.id = op->req.hdr.id;
+        op->reply.hdr.opcode = op->req.hdr.opcode;
+        op->reply.hdr.retval = -EPIPE;
+        // Copy lambda to be unaffected by `delete op`
+        std::function<void(osd_op_t*)>(op->callback)(op);
+    }
+    else
+    {
+        // This function is only called in stop_client(), so it's fine to destroy the operation
+        delete op;
+    }
+}
+
+void osd_messenger_t::stop_client(int peer_fd, bool force)
+{
+    assert(peer_fd != 0);
+    auto it = clients.find(peer_fd);
+    if (it == clients.end())
+    {
+        return;
+    }
+    uint64_t repeer_osd = 0;
+    osd_client_t *cl = it->second;
+    if (cl->peer_state == PEER_CONNECTED)
+    {
+        if (cl->osd_num)
+        {
+            // Reload configuration from etcd when the connection is dropped
+            if (log_level > 0)
+                printf("[OSD %lu] Stopping client %d (OSD peer %lu)\n", osd_num, peer_fd, cl->osd_num);
+            repeer_osd = cl->osd_num;
+        }
+        else
+        {
+            if (log_level > 0)
+                printf("[OSD %lu] Stopping client %d (regular client)\n", osd_num, peer_fd);
+        }
+    }
+    else if (!force)
+    {
+        return;
+    }
+    cl->peer_state = PEER_STOPPED;
+    clients.erase(it);
+    tfd->set_fd_handler(peer_fd, false, NULL);
+    if (cl->connect_timeout_id >= 0)
+    {
+        tfd->clear_timer(cl->connect_timeout_id);
+        cl->connect_timeout_id = -1;
+    }
+    if (cl->osd_num)
+    {
+        osd_peer_fds.erase(cl->osd_num);
+    }
+    if (cl->read_op)
+    {
+        if (cl->read_op->callback)
+        {
+            cancel_op(cl->read_op);
+        }
+        else
+        {
+            delete cl->read_op;
+        }
+        cl->read_op = NULL;
+    }
+    for (auto rit = read_ready_clients.begin(); rit != read_ready_clients.end(); rit++)
+    {
+        if (*rit == peer_fd)
+        {
+            read_ready_clients.erase(rit);
+            break;
+        }
+    }
+    for (auto wit = write_ready_clients.begin(); wit != write_ready_clients.end(); wit++)
+    {
+        if (*wit == peer_fd)
+        {
+            write_ready_clients.erase(wit);
+            break;
+        }
+    }
+    free(cl->in_buf);
+    cl->in_buf = NULL;
+    close(peer_fd);
+    if (repeer_osd)
+    {
+        // First repeer PGs as canceling OSD ops may push new operations
+        // and we need correct PG states when we do that
+        repeer_pgs(repeer_osd);
+    }
+    if (cl->osd_num)
+    {
+        // Cancel outbound operations
+        cancel_osd_ops(cl);
+    }
+    if (cl->refs <= 0)
+    {
+        delete cl;
+    }
+}
+
 void osd_messenger_t::accept_connections(int listen_fd)
 {
    // Accept new connections
@@ -491,17 +500,18 @@ void osd_messenger_t::accept_connections(int listen_fd)
    {
        assert(peer_fd != 0);
        char peer_str[256];
-        fprintf(stderr, "[OSD %lu] new client %d: connection from %s port %d\n", this->osd_num, peer_fd,
+        printf("[OSD %lu] new client %d: connection from %s port %d\n", this->osd_num, peer_fd,
            inet_ntop(AF_INET, &addr.sin_addr, peer_str, 256), ntohs(addr.sin_port));
        fcntl(peer_fd, F_SETFL, fcntl(peer_fd, F_GETFL, 0) | O_NONBLOCK);
        int one = 1;
        setsockopt(peer_fd, SOL_TCP, TCP_NODELAY, &one, sizeof(one));
-        clients[peer_fd] = new osd_client_t();
-        clients[peer_fd]->peer_addr = addr;
-        clients[peer_fd]->peer_port = ntohs(addr.sin_port);
-        clients[peer_fd]->peer_fd = peer_fd;
-        clients[peer_fd]->peer_state = PEER_CONNECTED;
-        clients[peer_fd]->in_buf = malloc_or_die(receive_buffer_size);
+        clients[peer_fd] = new osd_client_t((osd_client_t){
+            .peer_addr = addr,
+            .peer_port = ntohs(addr.sin_port),
+            .peer_fd = peer_fd,
+            .peer_state = PEER_CONNECTED,
+            .in_buf = malloc_or_die(receive_buffer_size),
+        });
        // Add FD to epoll
        tfd->set_fd_handler(peer_fd, false, [this](int peer_fd, int epoll_events)
        {
@@ -515,59 +525,3 @@ void osd_messenger_t::accept_connections(int listen_fd)
        throw std::runtime_error(std::string("accept: ") + strerror(errno));
    }
 }
-
-#ifdef WITH_RDMA
-bool osd_messenger_t::is_rdma_enabled()
-{
-    return rdma_context != NULL;
-}
-#endif
-
-json11::Json osd_messenger_t::read_config(const json11::Json & config)
-{
-    const char *config_path = config["config_path"].string_value() != ""
-        ? config["config_path"].string_value().c_str() : VITASTOR_CONFIG_PATH;
-    int fd = open(config_path, O_RDONLY);
-    if (fd < 0)
-    {
-        if (errno != ENOENT)
-            fprintf(stderr, "Error reading %s: %s\n", config_path, strerror(errno));
-        return config;
-    }
-    struct stat st;
-    if (fstat(fd, &st) != 0)
-    {
-        fprintf(stderr, "Error reading %s: %s\n", config_path, strerror(errno));
-        close(fd);
-        return config;
-    }
-    std::string buf;
-    buf.resize(st.st_size);
-    int done = 0;
-    while (done < st.st_size)
-    {
-        int r = read(fd, (void*)buf.data()+done, st.st_size-done);
-        if (r < 0)
-        {
-            fprintf(stderr, "Error reading %s: %s\n", config_path, strerror(errno));
-            close(fd);
-            return config;
-        }
-        done += r;
-    }
-    close(fd);
-    std::string json_err;
-    json11::Json::object file_config = json11::Json::parse(buf, json_err).object_items();
-    if (json_err != "")
-    {
-        fprintf(stderr, "Invalid JSON in %s: %s\n", config_path, json_err.c_str());
-        return config;
-    }
-    file_config.erase("config_path");
-    file_config.erase("osd_num");
-    for (auto kv: config.object_items())
-    {
-        file_config[kv.first] = kv.second;
-    }
-    return file_config;
-}
--- a/src/messenger.h
+++ b/src/messenger.h
@@ -14,35 +14,185 @@

 #include "malloc_or_die.h"
 #include "json11/json11.hpp"
-#include "msgr_op.h"
+#include "osd_ops.h"
 #include "timerfd_manager.h"
-#include <ringloop.h>
+#include "ringloop.h"

-#ifdef WITH_RDMA
-#include "msgr_rdma.h"
-#endif
+#define OSD_OP_IN 0
+#define OSD_OP_OUT 1

 #define CL_READ_HDR 1
 #define CL_READ_DATA 2
 #define CL_READ_REPLY_DATA 3
 #define CL_WRITE_READY 1
+#define CL_WRITE_REPLY 2
+#define OSD_OP_INLINE_BUF_COUNT 16

 #define PEER_CONNECTING 1
 #define PEER_CONNECTED 2
-#define PEER_RDMA_CONNECTING 3
-#define PEER_RDMA 4
-#define PEER_STOPPED 5
+#define PEER_STOPPED 3

+#define DEFAULT_PEER_CONNECT_INTERVAL 5
+#define DEFAULT_PEER_CONNECT_TIMEOUT 5
+#define DEFAULT_OSD_PING_TIMEOUT 5
 #define DEFAULT_BITMAP_GRANULARITY 4096
-#define VITASTOR_CONFIG_PATH "/etc/vitastor/vitastor.conf"

-#define MSGR_SENDP_HDR 1
-#define MSGR_SENDP_FREE 2
-
-struct msgr_sendp_t
+// Kind of a vector with small-list-optimisation
+struct osd_op_buf_list_t
 {
-    osd_op_t *op;
-    int flags;
+    int count = 0, alloc = OSD_OP_INLINE_BUF_COUNT, done = 0;
+    iovec *buf = NULL;
+    iovec inline_buf[OSD_OP_INLINE_BUF_COUNT];
+
+    inline osd_op_buf_list_t()
+    {
+        buf = inline_buf;
+    }
+
+    inline osd_op_buf_list_t(const osd_op_buf_list_t & other)
+    {
+        buf = inline_buf;
+        append(other);
+    }
+
+    inline osd_op_buf_list_t & operator = (const osd_op_buf_list_t & other)
+    {
+        reset();
+        append(other);
+        return *this;
+    }
+
+    inline ~osd_op_buf_list_t()
+    {
+        if (buf && buf != inline_buf)
+        {
+            free(buf);
+        }
+    }
+
+    inline void reset()
+    {
+        count = 0;
+        done = 0;
+    }
+
+    inline iovec* get_iovec()
+    {
+        return buf + done;
+    }
+
+    inline int get_size()
+    {
+        return count - done;
+    }
+
+    inline void append(const osd_op_buf_list_t & other)
+    {
+        if (count+other.count > alloc)
+        {
+            if (buf == inline_buf)
+            {
+                int old = alloc;
+                alloc = (((count+other.count+15)/16)*16);
+                buf = (iovec*)malloc(sizeof(iovec) * alloc);
+                if (!buf)
+                {
+                    printf("Failed to allocate %lu bytes\n", sizeof(iovec) * alloc);
+                    exit(1);
+                }
+                memcpy(buf, inline_buf, sizeof(iovec) * old);
+            }
+            else
+            {
+                alloc = (((count+other.count+15)/16)*16);
+                buf = (iovec*)realloc(buf, sizeof(iovec) * alloc);
+                if (!buf)
+                {
+                    printf("Failed to allocate %lu bytes\n", sizeof(iovec) * alloc);
+                    exit(1);
+                }
+            }
+        }
+        for (int i = 0; i < other.count; i++)
+        {
+            buf[count++] = other.buf[i];
+        }
+    }
+
+    inline void push_back(void *nbuf, size_t len)
+    {
+        if (count >= alloc)
+        {
+            if (buf == inline_buf)
+            {
+                int old = alloc;
+                alloc = ((alloc/16)*16 + 1);
+                buf = (iovec*)malloc(sizeof(iovec) * alloc);
+                if (!buf)
+                {
+                    printf("Failed to allocate %lu bytes\n", sizeof(iovec) * alloc);
+                    exit(1);
+                }
+                memcpy(buf, inline_buf, sizeof(iovec)*old);
+            }
+            else
+            {
+                alloc = alloc < 16 ? 16 : (alloc+16);
+                buf = (iovec*)realloc(buf, sizeof(iovec) * alloc);
+                if (!buf)
+                {
+                    printf("Failed to allocate %lu bytes\n", sizeof(iovec) * alloc);
+                    exit(1);
+                }
+            }
+        }
+        buf[count++] = { .iov_base = nbuf, .iov_len = len };
+    }
+
+    inline void eat(int result)
+    {
+        while (result > 0 && done < count)
+        {
+            iovec & iov = buf[done];
+            if (iov.iov_len <= result)
+            {
+                result -= iov.iov_len;
+                done++;
+            }
+            else
+            {
+                iov.iov_len -= result;
+                iov.iov_base += result;
+                break;
+            }
+        }
+    }
+};
+
+struct blockstore_op_t;
+
+struct osd_primary_op_data_t;
+
+struct osd_op_t
+{
+    timespec tv_begin = { 0 }, tv_end = { 0 };
+    uint64_t op_type = OSD_OP_IN;
+    int peer_fd;
+    osd_any_op_t req;
+    osd_any_reply_t reply;
+    blockstore_op_t *bs_op = NULL;
+    void *buf = NULL;
+    // bitmap, bitmap_len, bmp_data are only meaningful for reads
+    void *bitmap = NULL;
+    unsigned bitmap_len = 0;
+    unsigned bmp_data = 0;
+    void *rmw_buf = NULL;
+    osd_primary_op_data_t* op_data = NULL;
+    std::function<void(osd_op_t*)> callback;
+
+    osd_op_buf_list_t iov;
+
+    ~osd_op_t();
 };

 struct osd_client_t
@@ -60,10 +210,6 @@ struct osd_client_t

    void *in_buf = NULL;

-#ifdef WITH_RDMA
-    msgr_rdma_connection_t *rdma_conn = NULL;
-#endif
-
    // Read state
    int read_ready = 0;
    osd_op_t *read_op = NULL;
@@ -86,13 +232,7 @@ struct osd_client_t
    msghdr write_msg = { 0 };
    int write_state = 0;
    std::vector<iovec> send_list, next_send_list;
-    std::vector<msgr_sendp_t> outbox, next_outbox;
-
-    ~osd_client_t()
-    {
-        free(in_buf);
-        in_buf = NULL;
-    }
+    std::vector<osd_op_t*> outbox, next_outbox;
 };

 struct osd_wanted_peer_t
@@ -117,42 +257,34 @@ struct osd_op_stats_t

 struct osd_messenger_t
 {
-protected:
+    timerfd_manager_t *tfd;
+    ring_loop_t *ringloop;
    int keepalive_timer_id = -1;

-    uint32_t receive_buffer_size = 0;
-    int peer_connect_interval = 0;
-    int peer_connect_timeout = 0;
-    int osd_idle_timeout = 0;
-    int osd_ping_timeout = 0;
+    // osd_num_t is only for logging and asserts
+    osd_num_t osd_num;
+    // FIXME: make receive_buffer_size configurable
+    int receive_buffer_size = 64*1024;
+    int peer_connect_interval = DEFAULT_PEER_CONNECT_INTERVAL;
+    int peer_connect_timeout = DEFAULT_PEER_CONNECT_TIMEOUT;
+    int osd_idle_timeout = DEFAULT_OSD_PING_TIMEOUT;
+    int osd_ping_timeout = DEFAULT_OSD_PING_TIMEOUT;
    int log_level = 0;
    bool use_sync_send_recv = false;

-#ifdef WITH_RDMA
-    bool use_rdma = true;
-    std::string rdma_device;
-    uint64_t rdma_port_num = 1, rdma_gid_index = 0, rdma_mtu = 0;
-    msgr_rdma_context_t *rdma_context = NULL;
-    uint64_t rdma_max_sge = 0, rdma_max_send = 0, rdma_max_recv = 8;
-    uint64_t rdma_max_msg = 0;
-#endif
+    std::map<osd_num_t, osd_wanted_peer_t> wanted_peers;
+    std::map<uint64_t, int> osd_peer_fds;
+    uint64_t next_subop_id = 1;

+    std::map<int, osd_client_t*> clients;
    std::vector<int> read_ready_clients;
    std::vector<int> write_ready_clients;
    std::vector<std::function<void()>> set_immediate;

-public:
-    timerfd_manager_t *tfd;
-    ring_loop_t *ringloop;
-    // osd_num_t is only for logging and asserts
-    osd_num_t osd_num;
-    uint64_t next_subop_id = 1;
-    std::map<int, osd_client_t*> clients;
-    std::map<osd_num_t, osd_wanted_peer_t> wanted_peers;
-    std::map<uint64_t, int> osd_peer_fds;
    // op statistics
    osd_op_stats_t stats;

+public:
    void init();
    void parse_config(const json11::Json & config);
    void connect_peer(uint64_t osd_num, json11::Json peer_state);
@@ -160,22 +292,15 @@ public:
    void outbox_push(osd_op_t *cur_op);
    std::function<void(osd_op_t*)> exec_op;
    std::function<void(osd_num_t)> repeer_pgs;
+    void handle_peer_epoll(int peer_fd, int epoll_events);
    void read_requests();
    void send_replies();
    void accept_connections(int listen_fd);
    ~osd_messenger_t();

-    static json11::Json read_config(const json11::Json & config);
-
-#ifdef WITH_RDMA
-    bool is_rdma_enabled();
-    bool connect_rdma(int peer_fd, std::string rdma_address, uint64_t client_max_msg);
-#endif
-
 protected:
    void try_connect_peer(uint64_t osd_num);
    void try_connect_peer_addr(osd_num_t peer_osd, const char *peer_host, int peer_port);
-    void handle_peer_epoll(int peer_fd, int epoll_events);
    void handle_connect_epoll(int peer_fd);
    void on_connect_peer(osd_num_t peer_osd, int peer_fd);
    void check_peer_config(osd_client_t *cl);
@@ -187,15 +312,8 @@ protected:
    void handle_send(int result, osd_client_t *cl);

    bool handle_read(int result, osd_client_t *cl);
-    bool handle_read_buffer(osd_client_t *cl, void *curbuf, int remain);
    bool handle_finished_read(osd_client_t *cl);
    void handle_op_hdr(osd_client_t *cl);
    bool handle_reply_hdr(osd_client_t *cl);
    void handle_reply_ready(osd_op_t *op);
-
-#ifdef WITH_RDMA
-    bool try_send_rdma(osd_client_t *cl);
-    bool try_recv_rdma(osd_client_t *cl);
-    void handle_rdma_events();
-#endif
 };
--- a/src/mock/build.sh
+++ b/src/mock/build.sh
@@ -1 +0,0 @@
-g++ -D__MOCK__ -fsanitize=address -g -Wno-pointer-arith pg_states.cpp osd_ops.cpp test_cluster_client.cpp cluster_client.cpp msgr_op.cpp msgr_stop.cpp mock/messenger.cpp etcd_state_client.cpp timerfd_manager.cpp ../json11/json11.cpp -I mock -I . -I ..; ./a.out
--- a/src/mock/messenger.cpp
+++ b/src/mock/messenger.cpp
@@ -1,49 +0,0 @@
-// Copyright (c) Vitaliy Filippov, 2019+
-// License: VNPL-1.1 or GNU GPL-2.0+ (see README.md for details)
-
-#include <unistd.h>
-#include <stdexcept>
-#include <assert.h>
-
-#include "messenger.h"
-
-void osd_messenger_t::init()
-{
-}
-
-osd_messenger_t::~osd_messenger_t()
-{
-    while (clients.size() > 0)
-    {
-        stop_client(clients.begin()->first, true);
-    }
-}
-
-void osd_messenger_t::outbox_push(osd_op_t *cur_op)
-{
-    clients[cur_op->peer_fd]->sent_ops[cur_op->req.hdr.id] = cur_op;
-}
-
-void osd_messenger_t::parse_config(const json11::Json & config)
-{
-}
-
-void osd_messenger_t::connect_peer(uint64_t peer_osd, json11::Json peer_state)
-{
-    wanted_peers[peer_osd] = (osd_wanted_peer_t){
-        .port = 1,
-    };
-}
-
-void osd_messenger_t::read_requests()
-{
-}
-
-void osd_messenger_t::send_replies()
-{
-}
-
-json11::Json osd_messenger_t::read_config(const json11::Json & config)
-{
-    return config;
-}
--- a/src/mock/ringloop.h
+++ b/src/mock/ringloop.h
@@ -1,25 +0,0 @@
-// Copyright (c) Vitaliy Filippov, 2019+
-// License: VNPL-1.1 or GNU GPL-2.0+ (see README.md for details)
-
-#pragma once
-
-#include <functional>
-
-struct ring_consumer_t
-{
-    std::function<void(void)> loop;
-};
-
-class ring_loop_t
-{
-public:
-    void register_consumer(ring_consumer_t *consumer)
-    {
-    }
-    void unregister_consumer(ring_consumer_t *consumer)
-    {
-    }
-    void submit()
-    {
-    }
-};
--- a/src/msgr_op.cpp
+++ b/src/msgr_op.cpp
@@ -1,22 +0,0 @@
-// Copyright (c) Vitaliy Filippov, 2019+
-// License: VNPL-1.1 or GNU GPL-2.0+ (see README.md for details)
-
-#include <assert.h>
-
-#include "msgr_op.h"
-
-osd_op_t::~osd_op_t()
-{
-    assert(!bs_op);
-    assert(!op_data);
-    if (rmw_buf)
-    {
-        free(rmw_buf);
-    }
-    if (buf)
-    {
-        // Note: reusing osd_op_t WILL currently lead to memory leaks
-        // So we don't reuse it, but free it every time
-        free(buf);
-    }
-}
--- a/src/msgr_op.h
+++ b/src/msgr_op.h
@@ -1,175 +0,0 @@
-// Copyright (c) Vitaliy Filippov, 2019+
-// License: VNPL-1.1 or GNU GPL-2.0+ (see README.md for details)
-
-#pragma once
-
-#include <sys/uio.h>
-#include <stdint.h>
-#include <stdio.h>
-#include <string.h>
-#include <stdlib.h>
-
-#include "osd_ops.h"
-
-#define OSD_OP_IN 0
-#define OSD_OP_OUT 1
-
-#define OSD_OP_INLINE_BUF_COUNT 16
-
-// Kind of a vector with small-list-optimisation
-struct osd_op_buf_list_t
-{
-    int count = 0, alloc = OSD_OP_INLINE_BUF_COUNT, done = 0;
-    iovec *buf = NULL;
-    iovec inline_buf[OSD_OP_INLINE_BUF_COUNT];
-
-    inline osd_op_buf_list_t()
-    {
-        buf = inline_buf;
-    }
-
-    inline osd_op_buf_list_t(const osd_op_buf_list_t & other)
-    {
-        buf = inline_buf;
-        append(other);
-    }
-
-    inline osd_op_buf_list_t & operator = (const osd_op_buf_list_t & other)
-    {
-        reset();
-        append(other);
-        return *this;
-    }
-
-    inline ~osd_op_buf_list_t()
-    {
-        if (buf && buf != inline_buf)
-        {
-            free(buf);
-        }
-    }
-
-    inline void reset()
-    {
-        count = 0;
-        done = 0;
-    }
-
-    inline iovec* get_iovec()
-    {
-        return buf + done;
-    }
-
-    inline int get_size()
-    {
-        return count - done;
-    }
-
-    inline void append(const osd_op_buf_list_t & other)
-    {
-        if (count+other.count > alloc)
-        {
-            if (buf == inline_buf)
-            {
-                int old = alloc;
-                alloc = (((count+other.count+15)/16)*16);
-                buf = (iovec*)malloc(sizeof(iovec) * alloc);
-                if (!buf)
-                {
-                    fprintf(stderr, "Failed to allocate %lu bytes\n", sizeof(iovec) * alloc);
-                    exit(1);
-                }
-                memcpy(buf, inline_buf, sizeof(iovec) * old);
-            }
-            else
-            {
-                alloc = (((count+other.count+15)/16)*16);
-                buf = (iovec*)realloc(buf, sizeof(iovec) * alloc);
-                if (!buf)
-                {
-                    fprintf(stderr, "Failed to allocate %lu bytes\n", sizeof(iovec) * alloc);
-                    exit(1);
-                }
-            }
-        }
-        for (int i = 0; i < other.count; i++)
-        {
-            buf[count++] = other.buf[i];
-        }
-    }
-
-    inline void push_back(void *nbuf, size_t len)
-    {
-        if (count >= alloc)
-        {
-            if (buf == inline_buf)
-            {
-                int old = alloc;
-                alloc = ((alloc/16)*16 + 1);
-                buf = (iovec*)malloc(sizeof(iovec) * alloc);
-                if (!buf)
-                {
-                    fprintf(stderr, "Failed to allocate %lu bytes\n", sizeof(iovec) * alloc);
-                    exit(1);
-                }
-                memcpy(buf, inline_buf, sizeof(iovec)*old);
-            }
-            else
-            {
-                alloc = alloc < 16 ? 16 : (alloc+16);
-                buf = (iovec*)realloc(buf, sizeof(iovec) * alloc);
-                if (!buf)
-                {
-                    fprintf(stderr, "Failed to allocate %lu bytes\n", sizeof(iovec) * alloc);
-                    exit(1);
-                }
-            }
-        }
-        buf[count++] = { .iov_base = nbuf, .iov_len = len };
-    }
-
-    inline void eat(int result)
-    {
-        while (result > 0 && done < count)
-        {
-            iovec & iov = buf[done];
-            if (iov.iov_len <= result)
-            {
-                result -= iov.iov_len;
-                done++;
-            }
-            else
-            {
-                iov.iov_len -= result;
-                iov.iov_base += result;
-                break;
-            }
-        }
-    }
-};
-
-struct blockstore_op_t;
-
-struct osd_primary_op_data_t;
-
-struct osd_op_t
-{
-    timespec tv_begin = { 0 }, tv_end = { 0 };
-    uint64_t op_type = OSD_OP_IN;
-    int peer_fd;
-    osd_any_op_t req;
-    osd_any_reply_t reply;
-    blockstore_op_t *bs_op = NULL;
-    void *buf = NULL;
-    // bitmap, bitmap_len, bmp_data are only meaningful for reads
-    void *bitmap = NULL;
-    unsigned bitmap_len = 0;
-    unsigned bmp_data = 0;
-    void *rmw_buf = NULL;
-    osd_primary_op_data_t* op_data = NULL;
-    std::function<void(osd_op_t*)> callback;
-
-    osd_op_buf_list_t iov;
-
-    ~osd_op_t();
-};
--- a/src/msgr_rdma.cpp
+++ b/src/msgr_rdma.cpp
@@ -1,521 +0,0 @@
-// Copyright (c) Vitaliy Filippov, 2019+
-// License: VNPL-1.1 or GNU GPL-2.0+ (see README.md for details)
-
-#include <stdio.h>
-#include <stdlib.h>
-#include "msgr_rdma.h"
-#include "messenger.h"
-
-std::string msgr_rdma_address_t::to_string()
-{
-    char msg[sizeof "0000:00000000:00000000:00000000000000000000000000000000"];
-    sprintf(
-        msg, "%04x:%06x:%06x:%016lx%016lx", lid, qpn, psn,
-        htobe64(((uint64_t*)&gid)[0]), htobe64(((uint64_t*)&gid)[1])
-    );
-    return std::string(msg);
-}
-
-bool msgr_rdma_address_t::from_string(const char *str, msgr_rdma_address_t *dest)
-{
-    uint64_t* gid = (uint64_t*)&dest->gid;
-    int n = sscanf(
-        str, "%hx:%x:%x:%16lx%16lx", &dest->lid, &dest->qpn, &dest->psn, gid, gid+1
-    );
-    gid[0] = be64toh(gid[0]);
-    gid[1] = be64toh(gid[1]);
-    return n == 5;
-}
-
-msgr_rdma_context_t::~msgr_rdma_context_t()
-{
-    if (cq)
-        ibv_destroy_cq(cq);
-    if (channel)
-        ibv_destroy_comp_channel(channel);
-    if (mr)
-        ibv_dereg_mr(mr);
-    if (pd)
-        ibv_dealloc_pd(pd);
-    if (context)
-        ibv_close_device(context);
-}
-
-msgr_rdma_connection_t::~msgr_rdma_connection_t()
-{
-    ctx->used_max_cqe -= max_send+max_recv;
-    if (qp)
-        ibv_destroy_qp(qp);
-}
-
-msgr_rdma_context_t *msgr_rdma_context_t::create(const char *ib_devname, uint8_t ib_port, uint8_t gid_index, uint32_t mtu)
-{
-    int res;
-    ibv_device **dev_list = NULL;
-    msgr_rdma_context_t *ctx = new msgr_rdma_context_t();
-    ctx->mtu = mtu;
-
-    dev_list = ibv_get_device_list(NULL);
-    if (!dev_list)
-    {
-        fprintf(stderr, "Failed to get RDMA device list: %s\n", strerror(errno));
-        goto cleanup;
-    }
-    if (!ib_devname)
-    {
-        ctx->dev = *dev_list;
-        if (!ctx->dev)
-        {
-            fprintf(stderr, "No RDMA devices found\n");
-            goto cleanup;
-        }
-    }
-    else
-    {
-        int i;
-        for (i = 0; dev_list[i]; ++i)
-            if (!strcmp(ibv_get_device_name(dev_list[i]), ib_devname))
-                break;
-        ctx->dev = dev_list[i];
-        if (!ctx->dev)
-        {
-            fprintf(stderr, "RDMA device %s not found\n", ib_devname);
-            goto cleanup;
-        }
-    }
-
-    ctx->context = ibv_open_device(ctx->dev);
-    if (!ctx->context)
-    {
-        fprintf(stderr, "Couldn't get RDMA context for %s\n", ibv_get_device_name(ctx->dev));
-        goto cleanup;
-    }
-
-    ctx->ib_port = ib_port;
-    ctx->gid_index = gid_index;
-    if ((res = ibv_query_port(ctx->context, ib_port, &ctx->portinfo)) != 0)
-    {
-        fprintf(stderr, "Couldn't get RDMA device %s port %d info: %s\n", ibv_get_device_name(ctx->dev), ib_port, strerror(res));
-        goto cleanup;
-    }
-    ctx->my_lid = ctx->portinfo.lid;
-    if (ctx->portinfo.link_layer != IBV_LINK_LAYER_ETHERNET && !ctx->my_lid)
-    {
-        fprintf(stderr, "RDMA device %s must have local LID because it's not Ethernet, but LID is zero\n", ibv_get_device_name(ctx->dev));
-        goto cleanup;
-    }
-    if (ibv_query_gid(ctx->context, ib_port, gid_index, &ctx->my_gid))
-    {
-        fprintf(stderr, "Couldn't read RDMA device %s GID index %d\n", ibv_get_device_name(ctx->dev), gid_index);
-        goto cleanup;
-    }
-
-    ctx->pd = ibv_alloc_pd(ctx->context);
-    if (!ctx->pd)
-    {
-        fprintf(stderr, "Couldn't allocate RDMA protection domain\n");
-        goto cleanup;
-    }
-
-    {
-        if (ibv_query_device_ex(ctx->context, NULL, &ctx->attrx))
-        {
-            fprintf(stderr, "Couldn't query RDMA device for its features\n");
-            goto cleanup;
-        }
-        if (!(ctx->attrx.odp_caps.general_caps & IBV_ODP_SUPPORT) ||
-            !(ctx->attrx.odp_caps.general_caps & IBV_ODP_SUPPORT_IMPLICIT) ||
-            !(ctx->attrx.odp_caps.per_transport_caps.rc_odp_caps & IBV_ODP_SUPPORT_SEND) ||
-            !(ctx->attrx.odp_caps.per_transport_caps.rc_odp_caps & IBV_ODP_SUPPORT_RECV))
-        {
-            fprintf(stderr, "The RDMA device isn't implicit ODP (On-Demand Paging) capable or does not support RC send and receive with ODP\n");
-            goto cleanup;
-        }
-    }
-
-    ctx->mr = ibv_reg_mr(ctx->pd, NULL, SIZE_MAX, IBV_ACCESS_LOCAL_WRITE | IBV_ACCESS_ON_DEMAND);
-    if (!ctx->mr)
-    {
-        fprintf(stderr, "Couldn't register RDMA memory region\n");
-        goto cleanup;
-    }
-
-    ctx->channel = ibv_create_comp_channel(ctx->context);
-    if (!ctx->channel)
-    {
-        fprintf(stderr, "Couldn't create RDMA completion channel\n");
-        goto cleanup;
-    }
-
-    ctx->max_cqe = 4096;
-    ctx->cq = ibv_create_cq(ctx->context, ctx->max_cqe, NULL, ctx->channel, 0);
-    if (!ctx->cq)
-    {
-        fprintf(stderr, "Couldn't create RDMA completion queue\n");
-        goto cleanup;
-    }
-
-    if (dev_list)
-        ibv_free_device_list(dev_list);
-    return ctx;
-
-cleanup:
-    delete ctx;
-    if (dev_list)
-        ibv_free_device_list(dev_list);
-    return NULL;
-}
-
-msgr_rdma_connection_t *msgr_rdma_connection_t::create(msgr_rdma_context_t *ctx, uint32_t max_send,
-    uint32_t max_recv, uint32_t max_sge, uint32_t max_msg)
-{
-    msgr_rdma_connection_t *conn = new msgr_rdma_connection_t;
-
-    max_sge = max_sge > ctx->attrx.orig_attr.max_sge ? ctx->attrx.orig_attr.max_sge : max_sge;
-
-    conn->ctx = ctx;
-    conn->max_send = max_send;
-    conn->max_recv = max_recv;
-    conn->max_sge = max_sge;
-    conn->max_msg = max_msg;
-
-    ctx->used_max_cqe += max_send+max_recv;
-    if (ctx->used_max_cqe > ctx->max_cqe)
-    {
-        // Resize CQ
-        // Mellanox ConnectX-4 supports up to 4194303 CQEs, so it's fine to put everything into a single CQ
-        int new_max_cqe = ctx->max_cqe;
-        while (ctx->used_max_cqe > new_max_cqe)
-        {
-            new_max_cqe *= 2;
-        }
-        if (ibv_resize_cq(ctx->cq, new_max_cqe) != 0)
-        {
-            fprintf(stderr, "Couldn't resize RDMA completion queue to %d entries\n", new_max_cqe);
-            delete conn;
-            return NULL;
-        }
-        ctx->max_cqe = new_max_cqe;
-    }
-
-    ibv_qp_init_attr init_attr = {
-        .send_cq = ctx->cq,
-        .recv_cq = ctx->cq,
-        .cap     = {
-            .max_send_wr  = max_send,
-            .max_recv_wr  = max_recv,
-            .max_send_sge = max_sge,
-            .max_recv_sge = max_sge,
-        },
-        .qp_type = IBV_QPT_RC,
-    };
-    conn->qp = ibv_create_qp(ctx->pd, &init_attr);
-    if (!conn->qp)
-    {
-        fprintf(stderr, "Couldn't create RDMA queue pair\n");
-        delete conn;
-        return NULL;
-    }
-
-    conn->addr.lid = ctx->my_lid;
-    conn->addr.gid = ctx->my_gid;
-    conn->addr.qpn = conn->qp->qp_num;
-    conn->addr.psn = lrand48() & 0xffffff;
-
-    ibv_qp_attr attr = {
-        .qp_state        = IBV_QPS_INIT,
-        .qp_access_flags = 0,
-        .pkey_index      = 0,
-        .port_num        = ctx->ib_port,
-    };
-
-    if (ibv_modify_qp(conn->qp, &attr, IBV_QP_STATE | IBV_QP_PKEY_INDEX | IBV_QP_PORT | IBV_QP_ACCESS_FLAGS))
-    {
-        fprintf(stderr, "Failed to switch RDMA queue pair to INIT state\n");
-        delete conn;
-        return NULL;
-    }
-
-    return conn;
-}
-
-static ibv_mtu mtu_to_ibv_mtu(uint32_t mtu)
-{
-    switch (mtu)
-    {
-        case 256:  return IBV_MTU_256;
-        case 512:  return IBV_MTU_512;
-        case 1024: return IBV_MTU_1024;
-        case 2048: return IBV_MTU_2048;
-        case 4096: return IBV_MTU_4096;
-    }
-    return IBV_MTU_4096;
-}
-
-int msgr_rdma_connection_t::connect(msgr_rdma_address_t *dest)
-{
-    auto conn = this;
-    ibv_qp_attr attr = {
-        .qp_state       = IBV_QPS_RTR,
-        .path_mtu       = mtu_to_ibv_mtu(conn->ctx->mtu),
-        .rq_psn         = dest->psn,
-        .sq_psn         = conn->addr.psn,
-        .dest_qp_num    = dest->qpn,
-        .ah_attr        = {
-            .grh        = {
-                .dgid = dest->gid,
-                .sgid_index = conn->ctx->gid_index,
-                .hop_limit = 1, // FIXME can it vary?
-            },
-            .dlid       = dest->lid,
-            .sl         = 0, // service level
-            .src_path_bits = 0,
-            .is_global  = (uint8_t)(dest->gid.global.interface_id ? 1 : 0),
-            .port_num   = conn->ctx->ib_port,
-        },
-        .max_rd_atomic  = 1,
-        .max_dest_rd_atomic = 1,
-        // Timeout and min_rnr_timer actual values seem to be 4.096us*2^(timeout+1)
-        .min_rnr_timer  = 1,
-        .timeout        = 14,
-        .retry_cnt      = 7,
-        .rnr_retry      = 7,
-    };
-    // FIXME No idea if ibv_modify_qp is a blocking operation or not. No idea if it has a timeout and what it is.
-    if (ibv_modify_qp(conn->qp, &attr, IBV_QP_STATE | IBV_QP_AV | IBV_QP_PATH_MTU |
-        IBV_QP_DEST_QPN | IBV_QP_RQ_PSN | IBV_QP_MAX_DEST_RD_ATOMIC | IBV_QP_MIN_RNR_TIMER))
-    {
-        fprintf(stderr, "Failed to switch RDMA queue pair to RTR (ready-to-receive) state\n");
-        return 1;
-    }
-    attr.qp_state = IBV_QPS_RTS;
-    if (ibv_modify_qp(conn->qp, &attr, IBV_QP_STATE | IBV_QP_TIMEOUT |
-        IBV_QP_RETRY_CNT | IBV_QP_RNR_RETRY | IBV_QP_SQ_PSN | IBV_QP_MAX_QP_RD_ATOMIC))
-    {
-        fprintf(stderr, "Failed to switch RDMA queue pair to RTS (ready-to-send) state\n");
-        return 1;
-    }
-    return 0;
-}
-
-bool osd_messenger_t::connect_rdma(int peer_fd, std::string rdma_address, uint64_t client_max_msg)
-{
-    // Try to connect to the peer using RDMA
-    msgr_rdma_address_t addr;
-    if (msgr_rdma_address_t::from_string(rdma_address.c_str(), &addr))
-    {
-        if (client_max_msg > rdma_max_msg)
-        {
-            client_max_msg = rdma_max_msg;
-        }
-        auto rdma_conn = msgr_rdma_connection_t::create(rdma_context, rdma_max_send, rdma_max_recv, rdma_max_sge, client_max_msg);
-        if (rdma_conn)
-        {
-            int r = rdma_conn->connect(&addr);
-            if (r != 0)
-            {
-                delete rdma_conn;
-                fprintf(
-                    stderr, "Failed to connect RDMA queue pair to %s (client %d)\n",
-                    addr.to_string().c_str(), peer_fd
-                );
-            }
-            else
-            {
-                // Remember connection, but switch to RDMA only after sending the configuration response
-                auto cl = clients.at(peer_fd);
-                cl->rdma_conn = rdma_conn;
-                cl->peer_state = PEER_RDMA_CONNECTING;
-                return true;
-            }
-        }
-    }
-    return false;
-}
-
-static void try_send_rdma_wr(osd_client_t *cl, ibv_sge *sge, int op_sge)
-{
-    ibv_send_wr *bad_wr = NULL;
-    ibv_send_wr wr = {
-        .wr_id = (uint64_t)(cl->peer_fd*2+1),
-        .sg_list = sge,
-        .num_sge = op_sge,
-        .opcode = IBV_WR_SEND,
-        .send_flags = IBV_SEND_SIGNALED,
-    };
-    int err = ibv_post_send(cl->rdma_conn->qp, &wr, &bad_wr);
-    if (err || bad_wr)
-    {
-        fprintf(stderr, "RDMA send failed: %s\n", strerror(err));
-        exit(1);
-    }
-    cl->rdma_conn->cur_send++;
-}
-
-bool osd_messenger_t::try_send_rdma(osd_client_t *cl)
-{
-    auto rc = cl->rdma_conn;
-    if (!cl->send_list.size() || rc->cur_send > 0)
-    {
-        // Only send one batch at a time
-        return true;
-    }
-    uint64_t op_size = 0, op_sge = 0;
-    ibv_sge sge[rc->max_sge];
-    while (rc->send_pos < cl->send_list.size())
-    {
-        iovec & iov = cl->send_list[rc->send_pos];
-        if (op_size >= rc->max_msg || op_sge >= rc->max_sge)
-        {
-            try_send_rdma_wr(cl, sge, op_sge);
-            op_sge = 0;
-            op_size = 0;
-            if (rc->cur_send >= rc->max_send)
-            {
-                break;
-            }
-        }
-        uint32_t len = (uint32_t)(op_size+iov.iov_len-rc->send_buf_pos < rc->max_msg
-            ? iov.iov_len-rc->send_buf_pos : rc->max_msg-op_size);
-        sge[op_sge++] = {
-            .addr = (uintptr_t)(iov.iov_base+rc->send_buf_pos),
-            .length = len,
-            .lkey = rc->ctx->mr->lkey,
-        };
-        op_size += len;
-        rc->send_buf_pos += len;
-        if (rc->send_buf_pos >= iov.iov_len)
-        {
-            rc->send_pos++;
-            rc->send_buf_pos = 0;
-        }
-    }
-    if (op_sge > 0)
-    {
-        try_send_rdma_wr(cl, sge, op_sge);
-    }
-    return true;
-}
-
-static void try_recv_rdma_wr(osd_client_t *cl, ibv_sge *sge, int op_sge)
-{
-    ibv_recv_wr *bad_wr = NULL;
-    ibv_recv_wr wr = {
-        .wr_id = (uint64_t)(cl->peer_fd*2),
-        .sg_list = sge,
-        .num_sge = op_sge,
-    };
-    int err = ibv_post_recv(cl->rdma_conn->qp, &wr, &bad_wr);
-    if (err || bad_wr)
-    {
-        fprintf(stderr, "RDMA receive failed: %s\n", strerror(err));
-        exit(1);
-    }
-    cl->rdma_conn->cur_recv++;
-}
-
-bool osd_messenger_t::try_recv_rdma(osd_client_t *cl)
-{
-    auto rc = cl->rdma_conn;
-    while (rc->cur_recv < rc->max_recv)
-    {
-        void *buf = malloc_or_die(rc->max_msg);
-        rc->recv_buffers.push_back(buf);
-        ibv_sge sge = {
-            .addr = (uintptr_t)buf,
-            .length = (uint32_t)rc->max_msg,
-            .lkey = rc->ctx->mr->lkey,
-        };
-        try_recv_rdma_wr(cl, &sge, 1);
-    }
-    return true;
-}
-
-#define RDMA_EVENTS_AT_ONCE 32
-
-void osd_messenger_t::handle_rdma_events()
-{
-    // Request next notification
-    ibv_cq *ev_cq;
-    void *ev_ctx;
-    // FIXME: This is inefficient as it calls read()...
-    if (ibv_get_cq_event(rdma_context->channel, &ev_cq, &ev_ctx) == 0)
-    {
-        ibv_ack_cq_events(rdma_context->cq, 1);
-    }
-    if (ibv_req_notify_cq(rdma_context->cq, 0) != 0)
-    {
-        fprintf(stderr, "Failed to request RDMA completion notification, exiting\n");
-        exit(1);
-    }
-    ibv_wc wc[RDMA_EVENTS_AT_ONCE];
-    int event_count;
-    do
-    {
-        event_count = ibv_poll_cq(rdma_context->cq, RDMA_EVENTS_AT_ONCE, wc);
-        for (int i = 0; i < event_count; i++)
-        {
-            int client_id = wc[i].wr_id >> 1;
-            bool is_send = wc[i].wr_id & 1;
-            auto cl_it = clients.find(client_id);
-            if (cl_it == clients.end())
-            {
-                continue;
-            }
-            osd_client_t *cl = cl_it->second;
-            if (wc[i].status != IBV_WC_SUCCESS)
-            {
-                fprintf(stderr, "RDMA work request failed for client %d", client_id);
-                if (cl->osd_num)
-                {
-                    fprintf(stderr, " (OSD %lu)", cl->osd_num);
-                }
-                fprintf(stderr, " with status: %s, stopping client\n", ibv_wc_status_str(wc[i].status));
-                stop_client(client_id);
-                continue;
-            }
-            if (!is_send)
-            {
-                cl->rdma_conn->cur_recv--;
-                handle_read_buffer(cl, cl->rdma_conn->recv_buffers[0], wc[i].byte_len);
-                free(cl->rdma_conn->recv_buffers[0]);
-                cl->rdma_conn->recv_buffers.erase(cl->rdma_conn->recv_buffers.begin(), cl->rdma_conn->recv_buffers.begin()+1);
-                try_recv_rdma(cl);
-            }
-            else
-            {
-                cl->rdma_conn->cur_send--;
-                if (!cl->rdma_conn->cur_send)
-                {
-                    // Wait for the whole batch
-                    for (int i = 0; i < cl->rdma_conn->send_pos; i++)
-                    {
-                        if (cl->outbox[i].flags & MSGR_SENDP_FREE)
-                        {
-                            // Reply fully sent
-                            delete cl->outbox[i].op;
-                        }
-                    }
-                    if (cl->rdma_conn->send_pos > 0)
-                    {
-                        cl->send_list.erase(cl->send_list.begin(), cl->send_list.begin()+cl->rdma_conn->send_pos);
-                        cl->outbox.erase(cl->outbox.begin(), cl->outbox.begin()+cl->rdma_conn->send_pos);
-                        cl->rdma_conn->send_pos = 0;
-                    }
-                    if (cl->rdma_conn->send_buf_pos > 0)
-                    {
-                        cl->send_list[0].iov_base += cl->rdma_conn->send_buf_pos;
-                        cl->send_list[0].iov_len -= cl->rdma_conn->send_buf_pos;
-                        cl->rdma_conn->send_buf_pos = 0;
-                    }
-                    try_send_rdma(cl);
-                }
-            }
-        }
-    } while (event_count > 0);
-    for (auto cb: set_immediate)
-    {
-        cb();
-    }
-    set_immediate.clear();
-}
--- a/src/msgr_rdma.h
+++ b/src/msgr_rdma.h
@@ -1,58 +0,0 @@
-// Copyright (c) Vitaliy Filippov, 2019+
-// License: VNPL-1.1 or GNU GPL-2.0+ (see README.md for details)
-
-#pragma once
-#include <infiniband/verbs.h>
-#include <string>
-#include <vector>
-
-struct msgr_rdma_address_t
-{
-    ibv_gid gid;
-    uint16_t lid;
-    uint32_t qpn;
-    uint32_t psn;
-
-    std::string to_string();
-    static bool from_string(const char *str, msgr_rdma_address_t *dest);
-};
-
-struct msgr_rdma_context_t
-{
-    ibv_context *context = NULL;
-    ibv_device *dev = NULL;
-    ibv_device_attr_ex attrx;
-    ibv_pd *pd = NULL;
-    ibv_mr *mr = NULL;
-    ibv_comp_channel *channel = NULL;
-    ibv_cq *cq = NULL;
-    ibv_port_attr portinfo;
-    uint8_t ib_port;
-    uint8_t gid_index;
-    uint16_t my_lid;
-    ibv_gid my_gid;
-    uint32_t mtu;
-    int max_cqe = 0;
-    int used_max_cqe = 0;
-
-    static msgr_rdma_context_t *create(const char *ib_devname, uint8_t ib_port, uint8_t gid_index, uint32_t mtu);
-    ~msgr_rdma_context_t();
-};
-
-struct msgr_rdma_connection_t
-{
-    msgr_rdma_context_t *ctx = NULL;
-    ibv_qp *qp = NULL;
-    msgr_rdma_address_t addr;
-    int max_send = 0, max_recv = 0, max_sge = 0;
-    int cur_send = 0, cur_recv = 0;
-    uint64_t max_msg = 0;
-
-    int send_pos = 0, send_buf_pos = 0;
-    int recv_pos = 0, recv_buf_pos = 0;
-    std::vector<void*> recv_buffers;
-
-    ~msgr_rdma_connection_t();
-    static msgr_rdma_connection_t *create(msgr_rdma_context_t *ctx, uint32_t max_send, uint32_t max_recv, uint32_t max_sge, uint32_t max_msg);
-    int connect(msgr_rdma_address_t *dest);
-};
--- a/src/msgr_receive.cpp
+++ b/src/msgr_receive.cpp
@@ -72,7 +72,7 @@ bool osd_messenger_t::handle_read(int result, osd_client_t *cl)
        // this is a client socket, so don't panic on error. just disconnect it
        if (result != 0)
        {
-            fprintf(stderr, "Client %d socket read error: %d (%s). Disconnecting client\n", cl->peer_fd, -result, strerror(-result));
+            printf("Client %d socket read error: %d (%s). Disconnecting client\n", cl->peer_fd, -result, strerror(-result));
        }
        stop_client(cl->peer_fd);
        return false;
@@ -91,9 +91,48 @@ bool osd_messenger_t::handle_read(int result, osd_client_t *cl)
    {
        if (cl->read_iov.iov_base == cl->in_buf)
        {
-            if (!handle_read_buffer(cl, cl->in_buf, result))
+            // Compose operation(s) from the buffer
+            int remain = result;
+            void *curbuf = cl->in_buf;
+            while (remain > 0)
            {
-                goto fin;
+                if (!cl->read_op)
+                {
+                    cl->read_op = new osd_op_t;
+                    cl->read_op->peer_fd = cl->peer_fd;
+                    cl->read_op->op_type = OSD_OP_IN;
+                    cl->recv_list.push_back(cl->read_op->req.buf, OSD_PACKET_SIZE);
+                    cl->read_remaining = OSD_PACKET_SIZE;
+                    cl->read_state = CL_READ_HDR;
+                }
+                while (cl->recv_list.done < cl->recv_list.count && remain > 0)
+                {
+                    iovec* cur = cl->recv_list.get_iovec();
+                    if (cur->iov_len > remain)
+                    {
+                        memcpy(cur->iov_base, curbuf, remain);
+                        cl->read_remaining -= remain;
+                        cur->iov_len -= remain;
+                        cur->iov_base += remain;
+                        remain = 0;
+                    }
+                    else
+                    {
+                        memcpy(cur->iov_base, curbuf, cur->iov_len);
+                        curbuf += cur->iov_len;
+                        cl->read_remaining -= cur->iov_len;
+                        remain -= cur->iov_len;
+                        cur->iov_len = 0;
+                        cl->recv_list.done++;
+                    }
+                }
+                if (cl->recv_list.done >= cl->recv_list.count)
+                {
+                    if (!handle_finished_read(cl))
+                    {
+                        goto fin;
+                    }
+                }
            }
        }
        else
@@ -120,52 +159,6 @@ fin:
    return ret;
 }

-bool osd_messenger_t::handle_read_buffer(osd_client_t *cl, void *curbuf, int remain)
-{
-    // Compose operation(s) from the buffer
-    while (remain > 0)
-    {
-        if (!cl->read_op)
-        {
-            cl->read_op = new osd_op_t;
-            cl->read_op->peer_fd = cl->peer_fd;
-            cl->read_op->op_type = OSD_OP_IN;
-            cl->recv_list.push_back(cl->read_op->req.buf, OSD_PACKET_SIZE);
-            cl->read_remaining = OSD_PACKET_SIZE;
-            cl->read_state = CL_READ_HDR;
-        }
-        while (cl->recv_list.done < cl->recv_list.count && remain > 0)
-        {
-            iovec* cur = cl->recv_list.get_iovec();
-            if (cur->iov_len > remain)
-            {
-                memcpy(cur->iov_base, curbuf, remain);
-                cl->read_remaining -= remain;
-                cur->iov_len -= remain;
-                cur->iov_base += remain;
-                remain = 0;
-            }
-            else
-            {
-                memcpy(cur->iov_base, curbuf, cur->iov_len);
-                curbuf += cur->iov_len;
-                cl->read_remaining -= cur->iov_len;
-                remain -= cur->iov_len;
-                cur->iov_len = 0;
-                cl->recv_list.done++;
-            }
-        }
-        if (cl->recv_list.done >= cl->recv_list.count)
-        {
-            if (!handle_finished_read(cl))
-            {
-                return false;
-            }
-        }
-    }
-    return true;
-}
-
 bool osd_messenger_t::handle_finished_read(osd_client_t *cl)
 {
    cl->recv_list.reset();
@@ -177,7 +170,7 @@ bool osd_messenger_t::handle_finished_read(osd_client_t *cl)
            handle_op_hdr(cl);
        else
        {
-            fprintf(stderr, "Received garbage: magic=%lx id=%lu opcode=%lx from %d\n", cl->read_op->req.hdr.magic, cl->read_op->req.hdr.id, cl->read_op->req.hdr.opcode, cl->peer_fd);
+            printf("Received garbage: magic=%lx id=%lu opcode=%lx from %d\n", cl->read_op->req.hdr.magic, cl->read_op->req.hdr.id, cl->read_op->req.hdr.opcode, cl->peer_fd);
            stop_client(cl->peer_fd);
            return false;
        }
@@ -261,16 +254,6 @@ void osd_messenger_t::handle_op_hdr(osd_client_t *cl)
        }
        cl->read_remaining = cur_op->req.rw.len;
    }
-    else if (cur_op->req.hdr.opcode == OSD_OP_SHOW_CONFIG)
-    {
-        if (cur_op->req.show_conf.json_len > 0)
-        {
-            cur_op->buf = malloc_or_die(cur_op->req.show_conf.json_len+1);
-            ((uint8_t*)cur_op->buf)[cur_op->req.show_conf.json_len] = 0;
-            cl->recv_list.push_back(cur_op->buf, cur_op->req.show_conf.json_len);
-        }
-        cl->read_remaining = cur_op->req.show_conf.json_len;
-    }
    if (cl->read_remaining > 0)
    {
        // Read data
@@ -292,7 +275,7 @@ bool osd_messenger_t::handle_reply_hdr(osd_client_t *cl)
    if (req_it == cl->sent_ops.end())
    {
        // Command out of sync. Drop connection
-        fprintf(stderr, "Client %d command out of sync: id %lu\n", cl->peer_fd, cl->read_op->req.hdr.id);
+        printf("Client %d command out of sync: id %lu\n", cl->peer_fd, cl->read_op->req.hdr.id);
        stop_client(cl->peer_fd);
        return false;
    }
@@ -303,19 +286,17 @@ bool osd_messenger_t::handle_reply_hdr(osd_client_t *cl)
    {
        // Read data. In this case we assume that the buffer is preallocated by the caller (!)
        unsigned bmp_len = (op->reply.hdr.opcode == OSD_OP_SEC_READ ? op->reply.sec_rw.attr_len : op->reply.rw.bitmap_len);
-        unsigned expected_size = (op->reply.hdr.opcode == OSD_OP_SEC_READ ? op->req.sec_rw.len : op->req.rw.len);
-        if (op->reply.hdr.retval >= 0 && (op->reply.hdr.retval != expected_size || bmp_len > op->bitmap_len))
+        if (op->reply.hdr.retval != (op->reply.hdr.opcode == OSD_OP_SEC_READ ? op->req.sec_rw.len : op->req.rw.len) ||
+            bmp_len > op->bitmap_len)
        {
            // Check reply length to not overflow the buffer
-            fprintf(stderr, "Client %d read reply of different length: expected %u+%u, got %ld+%u\n",
-                cl->peer_fd, expected_size, op->bitmap_len, op->reply.hdr.retval, bmp_len);
+            printf("Client %d read reply of different length\n", cl->peer_fd);
            cl->sent_ops[op->req.hdr.id] = op;
            stop_client(cl->peer_fd);
            return false;
        }
-        if (op->reply.hdr.retval >= 0 && bmp_len > 0)
+        if (bmp_len > 0)
        {
-            assert(op->bitmap);
            cl->recv_list.push_back(op->bitmap, bmp_len);
        }
        if (op->reply.hdr.retval > 0)
@@ -342,24 +323,13 @@ bool osd_messenger_t::handle_reply_hdr(osd_client_t *cl)
        op->buf = memalign_or_die(MEM_ALIGNMENT, cl->read_remaining);
        cl->recv_list.push_back(op->buf, cl->read_remaining);
    }
-    else if (op->reply.hdr.opcode == OSD_OP_SEC_READ_BMP && op->reply.hdr.retval > 0)
+    else if (op->reply.hdr.opcode == OSD_OP_SHOW_CONFIG && op->reply.hdr.retval > 0)
    {
        assert(!op->iov.count);
        delete cl->read_op;
        cl->read_op = op;
        cl->read_state = CL_READ_REPLY_DATA;
        cl->read_remaining = op->reply.hdr.retval;
-        free(op->buf);
-        op->buf = memalign_or_die(MEM_ALIGNMENT, cl->read_remaining);
-        cl->recv_list.push_back(op->buf, cl->read_remaining);
-    }
-    else if (op->reply.hdr.opcode == OSD_OP_SHOW_CONFIG && op->reply.hdr.retval > 0)
-    {
-        delete cl->read_op;
-        cl->read_op = op;
-        cl->read_state = CL_READ_REPLY_DATA;
-        cl->read_remaining = op->reply.hdr.retval;
-        free(op->buf);
        op->buf = malloc_or_die(op->reply.hdr.retval);
        cl->recv_list.push_back(op->buf, op->reply.hdr.retval);
    }
--- a/src/msgr_send.cpp
+++ b/src/msgr_send.cpp
@@ -46,7 +46,7 @@ void osd_messenger_t::outbox_push(osd_op_t *cur_op)
        to_send_list.push_back((iovec){ .iov_base = cur_op->req.buf, .iov_len = OSD_PACKET_SIZE });
        cl->sent_ops[cur_op->req.hdr.id] = cur_op;
    }
-    to_outbox.push_back((msgr_sendp_t){ .op = cur_op, .flags = MSGR_SENDP_HDR });
+    to_outbox.push_back(NULL);
    // Bitmap
    if (cur_op->op_type == OSD_OP_IN &&
        cur_op->req.hdr.opcode == OSD_OP_SEC_READ &&
@@ -56,7 +56,7 @@ void osd_messenger_t::outbox_push(osd_op_t *cur_op)
            .iov_base = cur_op->bitmap,
            .iov_len = cur_op->reply.sec_rw.attr_len,
        });
-        to_outbox.push_back((msgr_sendp_t){ .op = cur_op, .flags = 0 });
+        to_outbox.push_back(NULL);
    }
    else if (cur_op->op_type == OSD_OP_OUT &&
        (cur_op->req.hdr.opcode == OSD_OP_SEC_WRITE || cur_op->req.hdr.opcode == OSD_OP_SEC_WRITE_STABLE) &&
@@ -66,47 +66,33 @@ void osd_messenger_t::outbox_push(osd_op_t *cur_op)
            .iov_base = cur_op->bitmap,
            .iov_len = cur_op->req.sec_rw.attr_len,
        });
-        to_outbox.push_back((msgr_sendp_t){ .op = cur_op, .flags = 0 });
+        to_outbox.push_back(NULL);
    }
    // Operation data
    if ((cur_op->op_type == OSD_OP_IN
        ? (cur_op->req.hdr.opcode == OSD_OP_READ ||
        cur_op->req.hdr.opcode == OSD_OP_SEC_READ ||
        cur_op->req.hdr.opcode == OSD_OP_SEC_LIST ||
+        cur_op->req.hdr.opcode == OSD_OP_SEC_READ_BMP ||
        cur_op->req.hdr.opcode == OSD_OP_SHOW_CONFIG)
        : (cur_op->req.hdr.opcode == OSD_OP_WRITE ||
        cur_op->req.hdr.opcode == OSD_OP_SEC_WRITE ||
        cur_op->req.hdr.opcode == OSD_OP_SEC_WRITE_STABLE ||
        cur_op->req.hdr.opcode == OSD_OP_SEC_STABILIZE ||
-        cur_op->req.hdr.opcode == OSD_OP_SEC_ROLLBACK ||
-        cur_op->req.hdr.opcode == OSD_OP_SHOW_CONFIG)) && cur_op->iov.count > 0)
+        cur_op->req.hdr.opcode == OSD_OP_SEC_ROLLBACK)) && cur_op->iov.count > 0)
    {
        for (int i = 0; i < cur_op->iov.count; i++)
        {
            assert(cur_op->iov.buf[i].iov_base);
            to_send_list.push_back(cur_op->iov.buf[i]);
-            to_outbox.push_back((msgr_sendp_t){ .op = cur_op, .flags = 0 });
+            to_outbox.push_back(NULL);
        }
    }
-    if (cur_op->req.hdr.opcode == OSD_OP_SEC_READ_BMP)
-    {
-        if (cur_op->op_type == OSD_OP_IN && cur_op->reply.hdr.retval > 0)
-            to_send_list.push_back((iovec){ .iov_base = cur_op->buf, .iov_len = (size_t)cur_op->reply.hdr.retval });
-        else if (cur_op->op_type == OSD_OP_OUT && cur_op->req.sec_read_bmp.len > 0)
-            to_send_list.push_back((iovec){ .iov_base = cur_op->buf, .iov_len = (size_t)cur_op->req.sec_read_bmp.len });
-        to_outbox.push_back((msgr_sendp_t){ .op = cur_op, .flags = 0 });
-    }
    if (cur_op->op_type == OSD_OP_IN)
    {
-        to_outbox[to_outbox.size()-1].flags |= MSGR_SENDP_FREE;
+        // To free it later
+        to_outbox[to_outbox.size()-1] = cur_op;
    }
-#ifdef WITH_RDMA
-    if (cl->peer_state == PEER_RDMA)
-    {
-        try_send_rdma(cl);
-        return;
-    }
-#endif
    if (!ringloop)
    {
        // FIXME: It's worse because it doesn't allow batching
@@ -218,7 +204,7 @@ void osd_messenger_t::handle_send(int result, osd_client_t *cl)
    cl->refs--;
    if (cl->peer_state == PEER_STOPPED)
    {
-        if (cl->refs <= 0)
+        if (!cl->refs)
        {
            delete cl;
        }
@@ -227,7 +213,7 @@ void osd_messenger_t::handle_send(int result, osd_client_t *cl)
    if (result < 0 && result != -EAGAIN)
    {
        // this is a client socket, so don't panic. just disconnect it
-        fprintf(stderr, "Client %d socket write error: %d (%s). Disconnecting client\n", cl->peer_fd, -result, strerror(-result));
+        printf("Client %d socket write error: %d (%s). Disconnecting client\n", cl->peer_fd, -result, strerror(-result));
        stop_client(cl->peer_fd);
        return;
    }
@@ -239,10 +225,10 @@ void osd_messenger_t::handle_send(int result, osd_client_t *cl)
            iovec & iov = cl->send_list[done];
            if (iov.iov_len <= result)
            {
-                if (cl->outbox[done].flags & MSGR_SENDP_FREE)
+                if (cl->outbox[done])
                {
                    // Reply fully sent
-                    delete cl->outbox[done].op;
+                    delete cl->outbox[done];
                }
                result -= iov.iov_len;
                done++;
@@ -267,21 +253,6 @@ void osd_messenger_t::handle_send(int result, osd_client_t *cl)
            cl->next_outbox.clear();
        }
        cl->write_state = cl->outbox.size() > 0 ? CL_WRITE_READY : 0;
-#ifdef WITH_RDMA
-        if (cl->rdma_conn && !cl->outbox.size() && cl->peer_state == PEER_RDMA_CONNECTING)
-        {
-            // FIXME: Do something better than just forgetting the FD
-            // FIXME: Ignore pings during RDMA state transition
-            if (log_level > 0)
-            {
-                fprintf(stderr, "Successfully connected with client %d using RDMA\n", cl->peer_fd);
-            }
-            cl->peer_state = PEER_RDMA;
-            tfd->set_fd_handler(cl->peer_fd, false, NULL);
-            // Add the initial receive request
-            try_recv_rdma(cl);
-        }
-#endif
    }
    if (cl->write_state != 0)
    {
--- a/src/msgr_stop.cpp
+++ b/src/msgr_stop.cpp
@@ -1,143 +0,0 @@
-// Copyright (c) Vitaliy Filippov, 2019+
-// License: VNPL-1.1 or GNU GPL-2.0+ (see README.md for details)
-
-#include <unistd.h>
-#include <assert.h>
-
-#include "messenger.h"
-
-void osd_messenger_t::cancel_osd_ops(osd_client_t *cl)
-{
-    std::vector<osd_op_t*> cancel_ops;
-    cancel_ops.resize(cl->sent_ops.size());
-    int i = 0;
-    for (auto p: cl->sent_ops)
-    {
-        cancel_ops[i++] = p.second;
-    }
-    cl->sent_ops.clear();
-    cl->outbox.clear();
-    for (auto op: cancel_ops)
-    {
-        cancel_op(op);
-    }
-}
-
-void osd_messenger_t::cancel_op(osd_op_t *op)
-{
-    if (op->op_type == OSD_OP_OUT)
-    {
-        op->reply.hdr.magic = SECONDARY_OSD_REPLY_MAGIC;
-        op->reply.hdr.id = op->req.hdr.id;
-        op->reply.hdr.opcode = op->req.hdr.opcode;
-        op->reply.hdr.retval = -EPIPE;
-        // Copy lambda to be unaffected by `delete op`
-        std::function<void(osd_op_t*)>(op->callback)(op);
-    }
-    else
-    {
-        // This function is only called in stop_client(), so it's fine to destroy the operation
-        delete op;
-    }
-}
-
-void osd_messenger_t::stop_client(int peer_fd, bool force)
-{
-    assert(peer_fd != 0);
-    auto it = clients.find(peer_fd);
-    if (it == clients.end())
-    {
-        return;
-    }
-    osd_client_t *cl = it->second;
-    if (cl->peer_state == PEER_CONNECTING && !force || cl->peer_state == PEER_STOPPED)
-    {
-        return;
-    }
-    if (log_level > 0)
-    {
-        if (cl->osd_num)
-        {
-            fprintf(stderr, "[OSD %lu] Stopping client %d (OSD peer %lu)\n", osd_num, peer_fd, cl->osd_num);
-        }
-        else
-        {
-            fprintf(stderr, "[OSD %lu] Stopping client %d (regular client)\n", osd_num, peer_fd);
-        }
-    }
-    // First set state to STOPPED so another stop_client() call doesn't try to free it again
-    cl->refs++;
-    cl->peer_state = PEER_STOPPED;
-    if (cl->osd_num)
-    {
-        // ...and forget OSD peer
-        osd_peer_fds.erase(cl->osd_num);
-    }
-#ifndef __MOCK__
-    // Then remove FD from the eventloop so we don't accidentally read something
-    tfd->set_fd_handler(peer_fd, false, NULL);
-    if (cl->connect_timeout_id >= 0)
-    {
-        tfd->clear_timer(cl->connect_timeout_id);
-        cl->connect_timeout_id = -1;
-    }
-    for (auto rit = read_ready_clients.begin(); rit != read_ready_clients.end(); rit++)
-    {
-        if (*rit == peer_fd)
-        {
-            read_ready_clients.erase(rit);
-            break;
-        }
-    }
-    for (auto wit = write_ready_clients.begin(); wit != write_ready_clients.end(); wit++)
-    {
-        if (*wit == peer_fd)
-        {
-            write_ready_clients.erase(wit);
-            break;
-        }
-    }
-#endif
-    if (cl->osd_num)
-    {
-        // Then repeer PGs because cancel_op() callbacks can try to perform
-        // some actions and we need correct PG states to not do something silly
-        repeer_pgs(cl->osd_num);
-    }
-    // Then cancel all operations
-    if (cl->read_op)
-    {
-        if (!cl->read_op->callback)
-        {
-            delete cl->read_op;
-        }
-        cl->read_op = NULL;
-    }
-    if (cl->osd_num)
-    {
-        // Cancel outbound operations
-        cancel_osd_ops(cl);
-    }
-#ifndef __MOCK__
-    // And close the FD only when everything is done
-    // ...because peer_fd number can get reused after close()
-    close(peer_fd);
-#ifdef WITH_RDMA
-    if (cl->rdma_conn)
-    {
-        delete cl->rdma_conn;
-    }
-#endif
-#endif
-    // Find the item again because it can be invalidated at this point
-    it = clients.find(peer_fd);
-    if (it != clients.end())
-    {
-        clients.erase(it);
-    }
-    cl->refs--;
-    if (cl->refs <= 0)
-    {
-        delete cl;
-    }
-}
--- a/src/nbd_proxy.cpp
+++ b/src/nbd_proxy.cpp
@@ -10,7 +10,6 @@
 #include <netinet/tcp.h>
 #include <arpa/inet.h>
 #include <sys/un.h>
-#include <sys/epoll.h>
 #include <unistd.h>
 #include <fcntl.h>
 #include <signal.h>
@@ -27,10 +26,7 @@ const char *exe_name = NULL;
 class nbd_proxy
 {
 protected:
-    std::string image_name;
    uint64_t inode = 0;
-    uint64_t device_size = 0;
-    inode_watch_t *watch = NULL;

    ring_loop_t *ringloop = NULL;
    epoll_manager_t *epmgr = NULL;
@@ -115,9 +111,9 @@ public:
    {
        printf(
            "Vitastor NBD proxy\n"
-            "(c) Vitaliy Filippov, 2020-2021 (VNPL-1.1)\n\n"
+            "(c) Vitaliy Filippov, 2020 (VNPL-1.1)\n\n"
            "USAGE:\n"
-            "  %s map [--etcd_address <etcd_address>] (--image <image> | --pool <pool> --inode <inode> --size <size in bytes>)\n"
+            "  %s map --etcd_address <etcd_address> --pool <pool> --inode <inode> --size <size in bytes>\n"
            "  %s unmap /dev/nbd0\n"
            "  %s list [--json]\n",
            exe_name, exe_name, exe_name
@@ -147,49 +143,26 @@ public:
    void start(json11::Json cfg)
    {
        // Check options
-        if (cfg["image"].string_value() != "")
+        if (cfg["etcd_address"].string_value() == "")
        {
-            // Use image name
-            image_name = cfg["image"].string_value();
-            inode = 0;
+            fprintf(stderr, "etcd_address is missing\n");
+            exit(1);
        }
-        else
+        if (!cfg["size"].uint64_value())
        {
-            // Use pool, inode number and size
-            if (!cfg["size"].uint64_value())
-            {
-                fprintf(stderr, "device size is missing\n");
-                exit(1);
-            }
-            device_size = cfg["size"].uint64_value();
-            inode = cfg["inode"].uint64_value();
-            uint64_t pool = cfg["pool"].uint64_value();
-            if (pool)
-            {
-                inode = (inode & ((1l << (64-POOL_ID_BITS)) - 1)) | (pool << (64-POOL_ID_BITS));
-            }
-            if (!(inode >> (64-POOL_ID_BITS)))
-            {
-                fprintf(stderr, "pool is missing\n");
-                exit(1);
-            }
+            fprintf(stderr, "device size is missing\n");
+            exit(1);
        }
-        // Create client
-        ringloop = new ring_loop_t(512);
-        epmgr = new epoll_manager_t(ringloop);
-        cli = new cluster_client_t(ringloop, epmgr->tfd, cfg);
-        if (!inode)
+        inode = cfg["inode"].uint64_value();
+        uint64_t pool = cfg["pool"].uint64_value();
+        if (pool)
        {
-            // Load image metadata
-            while (!cli->is_ready())
-            {
-                ringloop->loop();
-                if (cli->is_ready())
-                    break;
-                ringloop->wait();
-            }
-            watch = cli->st_cli.watch_inode(image_name);
-            device_size = watch->cfg.size;
+            inode = (inode & ((1l << (64-POOL_ID_BITS)) - 1)) | (pool << (64-POOL_ID_BITS));
+        }
+        if (!(inode >> (64-POOL_ID_BITS)))
+        {
+            fprintf(stderr, "pool is missing\n");
+            exit(1);
        }
        // Initialize NBD
        int sockfd[2];
@@ -201,10 +174,9 @@ public:
        fcntl(sockfd[0], F_SETFL, fcntl(sockfd[0], F_GETFL, 0) | O_NONBLOCK);
        nbd_fd = sockfd[0];
        load_module();
-        bool bg = cfg["foreground"].is_null();
        if (!cfg["dev_num"].is_null())
        {
-            if (run_nbd(sockfd, cfg["dev_num"].int64_value(), device_size, NBD_FLAG_SEND_FLUSH, 30, bg) < 0)
+            if (run_nbd(sockfd, cfg["dev_num"].int64_value(), cfg["size"].uint64_value(), NBD_FLAG_SEND_FLUSH, 30) < 0)
            {
                perror("run_nbd");
                exit(1);
@@ -216,7 +188,7 @@ public:
            int i = 0;
            while (true)
            {
-                int r = run_nbd(sockfd, i, device_size, NBD_FLAG_SEND_FLUSH, 30, bg);
+                int r = run_nbd(sockfd, i, cfg["size"].uint64_value(), NBD_FLAG_SEND_FLUSH, 30);
                if (r == 0)
                {
                    printf("/dev/nbd%d\n", i);
@@ -239,10 +211,14 @@ public:
                }
            }
        }
-        if (bg)
+        if (cfg["foreground"].is_null())
        {
            daemonize();
        }
+        // Create client
+        ringloop = new ring_loop_t(512);
+        epmgr = new epoll_manager_t(ringloop);
+        cli = new cluster_client_t(ringloop, epmgr->tfd, cfg);
        // Initialize read state
        read_state = CL_READ_HDR;
        recv_buf = malloc_or_die(receive_buffer_size);
@@ -256,47 +232,21 @@ public:
        };
        ringloop->register_consumer(&consumer);
        // Add FD to epoll
-        bool stop = false;
-        epmgr->tfd->set_fd_handler(sockfd[0], false, [this, &stop](int peer_fd, int epoll_events)
+        epmgr->tfd->set_fd_handler(sockfd[0], false, [this](int peer_fd, int epoll_events)
        {
-            if (epoll_events & EPOLLRDHUP)
-            {
-                close(peer_fd);
-                stop = true;
-            }
-            else
-            {
-                read_ready++;
-                submit_read();
-            }
+            read_ready++;
+            submit_read();
        });
-        while (!stop)
+        while (1)
        {
            ringloop->loop();
            ringloop->wait();
        }
-        stop = false;
-        cluster_op_t *close_sync = new cluster_op_t;
-        close_sync->opcode = OSD_OP_SYNC;
-        close_sync->callback = [this, &stop](cluster_op_t *op)
-        {
-            stop = true;
-            delete op;
-        };
-        cli->execute(close_sync);
-        while (!stop)
-        {
-            ringloop->loop();
-            ringloop->wait();
-        }
-        delete cli;
-        delete epmgr;
-        delete ringloop;
    }

    void load_module()
    {
-        if (access("/sys/module/nbd", F_OK) == 0)
+        if (access("/sys/module/nbd", F_OK))
        {
            return;
        }
@@ -438,7 +388,7 @@ public:
    }

 protected:
-    int run_nbd(int sockfd[2], int dev_num, uint64_t size, uint64_t flags, unsigned timeout, bool bg)
+    int run_nbd(int sockfd[2], int dev_num, uint64_t size, uint64_t flags, unsigned timeout)
    {
        // Check handle size
        assert(sizeof(cur_req.handle) == 8);
@@ -486,14 +436,11 @@ protected:
        {
            // Run in child
            close(sockfd[0]);
-            if (bg)
-            {
-                daemonize();
-            }
            r = ioctl(nbd, NBD_DO_IT);
            if (r < 0)
            {
                fprintf(stderr, "NBD device terminated with error: %s\n", strerror(errno));
+                kill(getppid(), SIGTERM);
            }
            close(sockfd[1]);
            ioctl(nbd, NBD_CLEAR_QUE);
@@ -663,7 +610,7 @@ protected:
            if (req_type == NBD_CMD_READ || req_type == NBD_CMD_WRITE)
            {
                op->opcode = req_type == NBD_CMD_READ ? OSD_OP_READ : OSD_OP_WRITE;
-                op->inode = inode ? inode : watch->cfg.num;
+                op->inode = inode;
                op->offset = be64toh(cur_req.from);
                op->len = be32toh(cur_req.len);
                buf = malloc_or_die(sizeof(nbd_reply) + op->len);
@@ -710,15 +657,7 @@ protected:
        }
        else
        {
-            if (cur_op->opcode == OSD_OP_WRITE && watch->cfg.readonly)
-            {
-                cur_op->retval = -EROFS;
-                std::function<void(cluster_op_t*)>(cur_op->callback)(cur_op);
-            }
-            else
-            {
-                cli->execute(cur_op);
-            }
+            cli->execute(cur_op);
            cur_op = NULL;
            cur_buf = &cur_req;
            cur_left = sizeof(nbd_request);
--- a/src/osd.cpp
+++ b/src/osd.cpp
@@ -8,42 +8,28 @@
 #include <arpa/inet.h>

 #include "osd.h"
-#include "http_client.h"

-static blockstore_config_t json_to_bs(const json11::Json::object & config)
+osd_t::osd_t(blockstore_config_t & config, ring_loop_t *ringloop)
 {
-    blockstore_config_t bs;
-    for (auto kv: config)
-    {
-        if (kv.second.is_string())
-            bs[kv.first] = kv.second.string_value();
-        else
-            bs[kv.first] = kv.second.dump();
-    }
-    return bs;
-}
-
-osd_t::osd_t(const json11::Json & config, ring_loop_t *ringloop)
-{
-    zero_buffer_size = 1<<20;
-    zero_buffer = malloc_or_die(zero_buffer_size);
-    memset(zero_buffer, 0, zero_buffer_size);
+    bs_block_size = strtoull(config["block_size"].c_str(), NULL, 10);
+    bs_bitmap_granularity = strtoull(config["bitmap_granularity"].c_str(), NULL, 10);
+    if (!bs_block_size)
+        bs_block_size = DEFAULT_BLOCK_SIZE;
+    if (!bs_bitmap_granularity)
+        bs_bitmap_granularity = DEFAULT_BITMAP_GRANULARITY;
+    clean_entry_bitmap_size = bs_block_size / bs_bitmap_granularity / 8;

+    this->config = config;
    this->ringloop = ringloop;

-    this->config = msgr.read_config(config).object_items();
-    if (this->config.find("log_level") == this->config.end())
-        this->config["log_level"] = 1;
-    parse_config(this->config);
+    // FIXME: Create Blockstore from on-disk superblock config and check it against the OSD cluster config
+    this->bs = new blockstore_t(config, ringloop);
+
+    parse_config(config);

    epmgr = new epoll_manager_t(ringloop);
-    // FIXME: Use timerfd_interval based directly on io_uring
    this->tfd = epmgr->tfd;

-    // FIXME: Create Blockstore from on-disk superblock config and check it against the OSD cluster config
-    auto bs_cfg = json_to_bs(this->config);
-    this->bs = new blockstore_t(bs_cfg, ringloop, tfd);
-
    this->tfd->set_timer(print_stats_interval*1000, true, [this](int timer_id)
    {
        print_stats();
@@ -53,11 +39,11 @@ osd_t::osd_t(const json11::Json & config, ring_loop_t *ringloop)
        print_slow();
    });

-    msgr.tfd = this->tfd;
-    msgr.ringloop = this->ringloop;
-    msgr.exec_op = [this](osd_op_t *op) { exec_op(op); };
-    msgr.repeer_pgs = [this](osd_num_t peer_osd) { repeer_pgs(peer_osd); };
-    msgr.init();
+    c_cli.tfd = this->tfd;
+    c_cli.ringloop = this->ringloop;
+    c_cli.exec_op = [this](osd_op_t *op) { exec_op(op); };
+    c_cli.repeer_pgs = [this](osd_num_t peer_osd) { repeer_pgs(peer_osd); };
+    c_cli.init();

    init_cluster();

@@ -71,74 +57,64 @@ osd_t::~osd_t()
    delete epmgr;
    delete bs;
    close(listen_fd);
-    free(zero_buffer);
 }

-void osd_t::parse_config(const json11::Json & config)
+void osd_t::parse_config(blockstore_config_t & config)
 {
-    st_cli.parse_config(config);
-    msgr.parse_config(config);
-    // OSD number
-    osd_num = config["osd_num"].uint64_value();
-    if (!osd_num)
-        throw std::runtime_error("osd_num is required in the configuration");
-    msgr.osd_num = osd_num;
-    // Vital Blockstore parameters
-    bs_block_size = config["block_size"].uint64_value();
-    if (!bs_block_size)
-        bs_block_size = DEFAULT_BLOCK_SIZE;
-    bs_bitmap_granularity = config["bitmap_granularity"].uint64_value();
-    if (!bs_bitmap_granularity)
-        bs_bitmap_granularity = DEFAULT_BITMAP_GRANULARITY;
-    clean_entry_bitmap_size = bs_block_size / bs_bitmap_granularity / 8;
-    // Bind address
-    bind_address = config["bind_address"].string_value();
-    if (bind_address == "")
-        bind_address = "0.0.0.0";
-    bind_port = config["bind_port"].uint64_value();
-    if (bind_port <= 0 || bind_port > 65535)
-        bind_port = 0;
-    // OSD configuration
-    log_level = config["log_level"].uint64_value();
-    etcd_report_interval = config["etcd_report_interval"].uint64_value();
+    if (config.find("log_level") == config.end())
+        config["log_level"] = "1";
+    log_level = strtoull(config["log_level"].c_str(), NULL, 10);
+    // Initial startup configuration
+    json11::Json json_config = json11::Json(config);
+    st_cli.parse_config(json_config);
+    etcd_report_interval = strtoull(config["etcd_report_interval"].c_str(), NULL, 10);
    if (etcd_report_interval <= 0)
        etcd_report_interval = 30;
-    readonly = config["readonly"] == "true" || config["readonly"] == "1" || config["readonly"] == "yes";
+    osd_num = strtoull(config["osd_num"].c_str(), NULL, 10);
+    if (!osd_num)
+        throw std::runtime_error("osd_num is required in the configuration");
+    c_cli.osd_num = osd_num;
    run_primary = config["run_primary"] != "false" && config["run_primary"] != "0" && config["run_primary"] != "no";
    no_rebalance = config["no_rebalance"] == "true" || config["no_rebalance"] == "1" || config["no_rebalance"] == "yes";
    no_recovery = config["no_recovery"] == "true" || config["no_recovery"] == "1" || config["no_recovery"] == "yes";
-    allow_test_ops = config["allow_test_ops"] == "true" || config["allow_test_ops"] == "1" || config["allow_test_ops"] == "yes";
+    // Cluster configuration
+    bind_address = config["bind_address"];
+    if (bind_address == "")
+        bind_address = "0.0.0.0";
+    bind_port = stoull_full(config["bind_port"]);
+    if (bind_port <= 0 || bind_port > 65535)
+        bind_port = 0;
    if (config["immediate_commit"] == "all")
        immediate_commit = IMMEDIATE_ALL;
    else if (config["immediate_commit"] == "small")
        immediate_commit = IMMEDIATE_SMALL;
-    else
-        immediate_commit = IMMEDIATE_NONE;
-    if (!config["autosync_interval"].is_null())
+    if (config.find("autosync_interval") != config.end())
    {
-        // Allow to set it to 0
-        autosync_interval = config["autosync_interval"].uint64_value();
+        autosync_interval = strtoull(config["autosync_interval"].c_str(), NULL, 10);
        if (autosync_interval > MAX_AUTOSYNC_INTERVAL)
            autosync_interval = DEFAULT_AUTOSYNC_INTERVAL;
    }
-    if (!config["client_queue_depth"].is_null())
+    if (config.find("client_queue_depth") != config.end())
    {
-        client_queue_depth = config["client_queue_depth"].uint64_value();
+        client_queue_depth = strtoull(config["client_queue_depth"].c_str(), NULL, 10);
        if (client_queue_depth < 128)
            client_queue_depth = 128;
    }
-    recovery_queue_depth = config["recovery_queue_depth"].uint64_value();
+    recovery_queue_depth = strtoull(config["recovery_queue_depth"].c_str(), NULL, 10);
    if (recovery_queue_depth < 1 || recovery_queue_depth > MAX_RECOVERY_QUEUE)
        recovery_queue_depth = DEFAULT_RECOVERY_QUEUE;
-    recovery_sync_batch = config["recovery_sync_batch"].uint64_value();
+    recovery_sync_batch = strtoull(config["recovery_sync_batch"].c_str(), NULL, 10);
    if (recovery_sync_batch < 1 || recovery_sync_batch > MAX_RECOVERY_QUEUE)
        recovery_sync_batch = DEFAULT_RECOVERY_BATCH;
-    print_stats_interval = config["print_stats_interval"].uint64_value();
+    if (config["readonly"] == "true" || config["readonly"] == "1" || config["readonly"] == "yes")
+        readonly = true;
+    print_stats_interval = strtoull(config["print_stats_interval"].c_str(), NULL, 10);
    if (!print_stats_interval)
        print_stats_interval = 3;
-    slow_log_interval = config["slow_log_interval"].uint64_value();
+    slow_log_interval = strtoull(config["slow_log_interval"].c_str(), NULL, 10);
    if (!slow_log_interval)
        slow_log_interval = 10;
+    c_cli.parse_config(json_config);
 }

 void osd_t::bind_socket()
@@ -191,7 +167,7 @@ void osd_t::bind_socket()

    epmgr->set_fd_handler(listen_fd, false, [this](int fd, int events)
    {
-        msgr.accept_connections(listen_fd);
+        c_cli.accept_connections(listen_fd);
    });
 }

@@ -208,8 +184,8 @@ bool osd_t::shutdown()
 void osd_t::loop()
 {
    handle_peers();
-    msgr.read_requests();
-    msgr.send_replies();
+    c_cli.read_requests();
+    c_cli.send_replies();
    ringloop->submit();
 }

@@ -222,8 +198,6 @@ void osd_t::exec_op(osd_op_t *cur_op)
        delete cur_op;
        return;
    }
-    // Clear the reply buffer
-    memset(cur_op->reply.buf, 0, OSD_PACKET_SIZE);
    inflight_ops++;
    if (cur_op->req.hdr.magic != SECONDARY_OSD_OP_MAGIC ||
        cur_op->req.hdr.opcode < OSD_OP_MIN || cur_op->req.hdr.opcode > OSD_OP_MAX ||
@@ -293,7 +267,7 @@ void osd_t::exec_op(osd_op_t *cur_op)

 void osd_t::reset_stats()
 {
-    msgr.stats = { 0 };
+    c_cli.stats = { 0 };
    prev_stats = { 0 };
    memset(recovery_stat_count, 0, sizeof(recovery_stat_count));
    memset(recovery_stat_bytes, 0, sizeof(recovery_stat_bytes));
@@ -303,11 +277,11 @@ void osd_t::print_stats()
 {
    for (int i = OSD_OP_MIN; i <= OSD_OP_MAX; i++)
    {
-        if (msgr.stats.op_stat_count[i] != prev_stats.op_stat_count[i] && i != OSD_OP_PING)
+        if (c_cli.stats.op_stat_count[i] != prev_stats.op_stat_count[i] && i != OSD_OP_PING)
        {
-            uint64_t avg = (msgr.stats.op_stat_sum[i] - prev_stats.op_stat_sum[i])/(msgr.stats.op_stat_count[i] - prev_stats.op_stat_count[i]);
-            uint64_t bw = (msgr.stats.op_stat_bytes[i] - prev_stats.op_stat_bytes[i]) / print_stats_interval;
-            if (msgr.stats.op_stat_bytes[i] != 0)
+            uint64_t avg = (c_cli.stats.op_stat_sum[i] - prev_stats.op_stat_sum[i])/(c_cli.stats.op_stat_count[i] - prev_stats.op_stat_count[i]);
+            uint64_t bw = (c_cli.stats.op_stat_bytes[i] - prev_stats.op_stat_bytes[i]) / print_stats_interval;
+            if (c_cli.stats.op_stat_bytes[i] != 0)
            {
                printf(
                    "[OSD %lu] avg latency for op %d (%s): %lu us, B/W: %.2f %s\n", osd_num, i, osd_op_names[i], avg,
@@ -319,19 +293,19 @@ void osd_t::print_stats()
            {
                printf("[OSD %lu] avg latency for op %d (%s): %lu us\n", osd_num, i, osd_op_names[i], avg);
            }
-            prev_stats.op_stat_count[i] = msgr.stats.op_stat_count[i];
-            prev_stats.op_stat_sum[i] = msgr.stats.op_stat_sum[i];
-            prev_stats.op_stat_bytes[i] = msgr.stats.op_stat_bytes[i];
+            prev_stats.op_stat_count[i] = c_cli.stats.op_stat_count[i];
+            prev_stats.op_stat_sum[i] = c_cli.stats.op_stat_sum[i];
+            prev_stats.op_stat_bytes[i] = c_cli.stats.op_stat_bytes[i];
        }
    }
    for (int i = OSD_OP_MIN; i <= OSD_OP_MAX; i++)
    {
-        if (msgr.stats.subop_stat_count[i] != prev_stats.subop_stat_count[i])
+        if (c_cli.stats.subop_stat_count[i] != prev_stats.subop_stat_count[i])
        {
-            uint64_t avg = (msgr.stats.subop_stat_sum[i] - prev_stats.subop_stat_sum[i])/(msgr.stats.subop_stat_count[i] - prev_stats.subop_stat_count[i]);
+            uint64_t avg = (c_cli.stats.subop_stat_sum[i] - prev_stats.subop_stat_sum[i])/(c_cli.stats.subop_stat_count[i] - prev_stats.subop_stat_count[i]);
            printf("[OSD %lu] avg latency for subop %d (%s): %ld us\n", osd_num, i, osd_op_names[i], avg);
-            prev_stats.subop_stat_count[i] = msgr.stats.subop_stat_count[i];
-            prev_stats.subop_stat_sum[i] = msgr.stats.subop_stat_sum[i];
+            prev_stats.subop_stat_count[i] = c_cli.stats.subop_stat_count[i];
+            prev_stats.subop_stat_sum[i] = c_cli.stats.subop_stat_sum[i];
        }
    }
    for (int i = 0; i < 2; i++)
@@ -368,7 +342,7 @@ void osd_t::print_slow()
    char alloc[1024];
    timespec now;
    clock_gettime(CLOCK_REALTIME, &now);
-    for (auto & kv: msgr.clients)
+    for (auto & kv: c_cli.clients)
    {
        for (auto op: kv.second->received_ops)
        {
--- a/src/osd.h
+++ b/src/osd.h
@@ -66,33 +66,11 @@ struct inode_stats_t
    uint64_t op_bytes[3] = { 0 };
 };

-struct bitmap_request_t
-{
-    osd_num_t osd_num;
-    object_id oid;
-    uint64_t version;
-    void *bmp_buf;
-};
-
-inline bool operator < (const bitmap_request_t & a, const bitmap_request_t & b)
-{
-    return a.osd_num < b.osd_num || a.osd_num == b.osd_num && a.oid < b.oid;
-}
-
-struct osd_chain_read_t
-{
-    int chain_pos;
-    inode_t inode;
-    uint32_t offset, len;
-};
-
-struct osd_rmw_stripe_t;
-
 class osd_t
 {
    // config

-    json11::Json::object config;
+    blockstore_config_t config;
    int etcd_report_interval = 30;

    bool readonly = false;
@@ -104,7 +82,7 @@ class osd_t
    int bind_port, listen_backlog;
    // FIXME: Implement client queue depth limit
    int client_queue_depth = 128;
-    bool allow_test_ops = false;
+    bool allow_test_ops = true;
    int print_stats_interval = 3;
    int slow_log_interval = 10;
    int immediate_commit = IMMEDIATE_NONE;
@@ -116,7 +94,7 @@ class osd_t
    // cluster state

    etcd_state_client_t st_cli;
-    osd_messenger_t msgr;
+    osd_messenger_t c_cli;
    int etcd_failed_attempts = 0;
    std::string etcd_lease_id;
    json11::Json self_state;
@@ -148,8 +126,6 @@ class osd_t
    bool stopping = false;
    int inflight_ops = 0;
    blockstore_t *bs;
-    void *zero_buffer = NULL;
-    uint64_t zero_buffer_size = 0;
    uint32_t bs_block_size, bs_bitmap_granularity, clean_entry_bitmap_size;
    ring_loop_t *ringloop;
    timerfd_manager_t *tfd = NULL;
@@ -167,11 +143,11 @@ class osd_t
    uint64_t recovery_stat_bytes[2][2] = { 0 };

    // cluster connection
-    void parse_config(const json11::Json & config);
+    void parse_config(blockstore_config_t & config);
    void init_cluster();
    void on_change_osd_state_hook(osd_num_t peer_osd);
    void on_change_pg_history_hook(pool_id_t pool_id, pg_num_t pg_num);
-    void on_change_etcd_state_hook(std::map<std::string, etcd_kv_t> & changes);
+    void on_change_etcd_state_hook(json11::Json::object & changes);
    void on_load_config_hook(json11::Json::object & changes);
    json11::Json on_load_pgs_checks_hook();
    void on_load_pgs_hook(bool success);
@@ -234,31 +210,17 @@ class osd_t
    void continue_primary_del(osd_op_t *cur_op);
    bool check_write_queue(osd_op_t *cur_op, pg_t & pg);
    void remove_object_from_state(object_id & oid, pg_osd_set_state_t *object_state, pg_t &pg);
-    void free_object_state(pg_t & pg, pg_osd_set_state_t **object_state);
    bool remember_unstable_write(osd_op_t *cur_op, pg_t & pg, pg_osd_set_t & loc_set, int base_state);
    void handle_primary_subop(osd_op_t *subop, osd_op_t *cur_op);
    void handle_primary_bs_subop(osd_op_t *subop);
    void add_bs_subop_stats(osd_op_t *subop);
    void pg_cancel_write_queue(pg_t & pg, osd_op_t *first_op, object_id oid, int retval);
-
-    void submit_primary_subops(int submit_type, uint64_t op_version, const uint64_t* osd_set, osd_op_t *cur_op);
-    int submit_primary_subop_batch(int submit_type, inode_t inode, uint64_t op_version,
-        osd_rmw_stripe_t *stripes, const uint64_t* osd_set, osd_op_t *cur_op, int subop_idx, int zero_read);
+    void submit_primary_subops(int submit_type, uint64_t op_version, int pg_size, const uint64_t* osd_set, osd_op_t *cur_op);
    void submit_primary_del_subops(osd_op_t *cur_op, uint64_t *cur_set, uint64_t set_size, pg_osd_set_t & loc_set);
    void submit_primary_del_batch(osd_op_t *cur_op, obj_ver_osd_t *chunks_to_delete, int chunks_to_delete_count);
-    int submit_primary_sync_subops(osd_op_t *cur_op);
+    void submit_primary_sync_subops(osd_op_t *cur_op);
    void submit_primary_stab_subops(osd_op_t *cur_op);

-    uint64_t* get_object_osd_set(pg_t &pg, object_id &oid, uint64_t *def, pg_osd_set_state_t **object_state);
-
-    void continue_chained_read(osd_op_t *cur_op);
-    int submit_chained_read_requests(pg_t & pg, osd_op_t *cur_op);
-    void send_chained_read_results(pg_t & pg, osd_op_t *cur_op);
-    std::vector<osd_chain_read_t> collect_chained_read_requests(osd_op_t *cur_op);
-    int collect_bitmap_requests(osd_op_t *cur_op, pg_t & pg, std::vector<bitmap_request_t> & bitmap_requests);
-    int submit_bitmap_subops(osd_op_t *cur_op, pg_t & pg);
-    int read_bitmaps(osd_op_t *cur_op, pg_t & pg, int base_state);
-
    inline pg_num_t map_to_pg(object_id oid, uint64_t pg_stripe_size)
    {
        uint64_t pg_count = pg_counts[INODE_POOL(oid.inode)];
@@ -268,7 +230,7 @@ class osd_t
    }

 public:
-    osd_t(const json11::Json & config, ring_loop_t *ringloop);
+    osd_t(blockstore_config_t & config, ring_loop_t *ringloop);
    ~osd_t();
    void force_stop(int exitcode);
    bool shutdown();
--- a/src/osd_cluster.cpp
+++ b/src/osd_cluster.cpp
@@ -4,7 +4,6 @@
 #include "osd.h"
 #include "base64.h"
 #include "etcd_state_client.h"
-#include "http_client.h"
 #include "osd_rmw.h"

 // Startup sequence:
@@ -21,7 +20,7 @@ void osd_t::init_cluster()
        {
            // Test version of clustering code with 1 pool, 1 PG and 2 peers
            // Example: peers = 2:127.0.0.1:11204,3:127.0.0.1:11205
-            std::string peerstr = config["peers"].string_value();
+            std::string peerstr = config["peers"];
            while (peerstr.size())
            {
                int pos = peerstr.find(',');
@@ -65,7 +64,7 @@ void osd_t::init_cluster()
        st_cli.log_level = log_level;
        st_cli.on_change_osd_state_hook = [this](osd_num_t peer_osd) { on_change_osd_state_hook(peer_osd); };
        st_cli.on_change_pg_history_hook = [this](pool_id_t pool_id, pg_num_t pg_num) { on_change_pg_history_hook(pool_id, pg_num); };
-        st_cli.on_change_hook = [this](std::map<std::string, etcd_kv_t> & changes) { on_change_etcd_state_hook(changes); };
+        st_cli.on_change_hook = [this](json11::Json::object & changes) { on_change_etcd_state_hook(changes); };
        st_cli.on_load_config_hook = [this](json11::Json::object & cfg) { on_load_config_hook(cfg); };
        st_cli.load_pgs_checks_hook = [this]() { return on_load_pgs_checks_hook(); };
        st_cli.on_load_pgs_hook = [this](bool success) { on_load_pgs_hook(success); };
@@ -104,7 +103,7 @@ void osd_t::parse_test_peer(std::string peer)
        { "addresses", json11::Json::array { addr } },
        { "port", port },
    };
-    msgr.connect_peer(peer_osd, st_cli.peer_states[peer_osd]);
+    c_cli.connect_peer(peer_osd, st_cli.peer_states[peer_osd]);
 }

 json11::Json osd_t::get_osd_state()
@@ -146,16 +145,16 @@ json11::Json osd_t::get_statistics()
    for (int i = OSD_OP_MIN; i <= OSD_OP_MAX; i++)
    {
        op_stats[osd_op_names[i]] = json11::Json::object {
-            { "count", msgr.stats.op_stat_count[i] },
-            { "usec", msgr.stats.op_stat_sum[i] },
-            { "bytes", msgr.stats.op_stat_bytes[i] },
+            { "count", c_cli.stats.op_stat_count[i] },
+            { "usec", c_cli.stats.op_stat_sum[i] },
+            { "bytes", c_cli.stats.op_stat_bytes[i] },
        };
    }
    for (int i = OSD_OP_MIN; i <= OSD_OP_MAX; i++)
    {
        subop_stats[osd_op_names[i]] = json11::Json::object {
-            { "count", msgr.stats.subop_stat_count[i] },
-            { "usec", msgr.stats.subop_stat_sum[i] },
+            { "count", c_cli.stats.subop_stat_count[i] },
+            { "usec", c_cli.stats.subop_stat_sum[i] },
        };
    }
    st["op_stats"] = op_stats;
@@ -183,38 +182,14 @@ void osd_t::report_statistics()
    // Report space usage statistics as a whole
    // Maybe we'll report it using deltas if we tune for a lot of inodes at some point
    json11::Json::object inode_space;
-    json11::Json::object last_stat;
-    pool_id_t last_pool = 0;
    for (auto kv: bs->get_inode_space_stats())
    {
-        pool_id_t pool_id = INODE_POOL(kv.first);
-        uint64_t only_inode_num = (kv.first & ((1l << (64-POOL_ID_BITS)) - 1));
-        if (!last_pool || pool_id != last_pool)
-        {
-            if (last_pool)
-                inode_space[std::to_string(last_pool)] = last_stat;
-            last_stat = json11::Json::object();
-            last_pool = pool_id;
-        }
-        last_stat[std::to_string(only_inode_num)] = kv.second;
+        inode_space[std::to_string(kv.first)] = kv.second;
    }
-    if (last_pool)
-        inode_space[std::to_string(last_pool)] = last_stat;
-    last_stat = json11::Json::object();
-    last_pool = 0;
    json11::Json::object inode_ops;
    for (auto kv: inode_stats)
    {
-        pool_id_t pool_id = INODE_POOL(kv.first);
-        uint64_t only_inode_num = (kv.first & ((1l << (64-POOL_ID_BITS)) - 1));
-        if (!last_pool || pool_id != last_pool)
-        {
-            if (last_pool)
-                inode_ops[std::to_string(last_pool)] = last_stat;
-            last_stat = json11::Json::object();
-            last_pool = pool_id;
-        }
-        last_stat[std::to_string(only_inode_num)] = json11::Json::object {
+        inode_ops[std::to_string(kv.first)] = json11::Json::object {
            { "read", json11::Json::object {
                { "count", kv.second.op_count[INODE_STATS_READ] },
                { "usec", kv.second.op_sum[INODE_STATS_READ] },
@@ -232,28 +207,20 @@ void osd_t::report_statistics()
            } },
        };
    }
-    if (last_pool)
-        inode_ops[std::to_string(last_pool)] = last_stat;
-    json11::Json::array txn = {
-        json11::Json::object {
-            { "request_put", json11::Json::object {
-                { "key", base64_encode(st_cli.etcd_prefix+"/osd/stats/"+std::to_string(osd_num)) },
-                { "value", base64_encode(get_statistics().dump()) },
-            } },
-        },
-        json11::Json::object {
-            { "request_put", json11::Json::object {
-                { "key", base64_encode(st_cli.etcd_prefix+"/osd/space/"+std::to_string(osd_num)) },
-                { "value", base64_encode(json11::Json(inode_space).dump()) },
-            } },
-        },
-        json11::Json::object {
-            { "request_put", json11::Json::object {
-                { "key", base64_encode(st_cli.etcd_prefix+"/osd/inodestats/"+std::to_string(osd_num)) },
-                { "value", base64_encode(json11::Json(inode_ops).dump()) },
-            } },
-        },
-    };
+    json11::Json::array txn = { json11::Json::object {
+        { "request_put", json11::Json::object {
+            { "key", base64_encode(st_cli.etcd_prefix+"/osd/stats/"+std::to_string(osd_num)) },
+            { "value", base64_encode(get_statistics().dump()) },
+        } },
+        { "request_put", json11::Json::object {
+            { "key", base64_encode(st_cli.etcd_prefix+"/osd/space/"+std::to_string(osd_num)) },
+            { "value", base64_encode(json11::Json(inode_space).dump()) },
+        } },
+        { "request_put", json11::Json::object {
+            { "key", base64_encode(st_cli.etcd_prefix+"/osd/inodestats/"+std::to_string(osd_num)) },
+            { "value", base64_encode(json11::Json(inode_ops).dump()) },
+        } },
+    } };
    for (auto & p: pgs)
    {
        auto & pg = p.second;
@@ -298,13 +265,13 @@ void osd_t::report_statistics()

 void osd_t::on_change_osd_state_hook(osd_num_t peer_osd)
 {
-    if (msgr.wanted_peers.find(peer_osd) != msgr.wanted_peers.end())
+    if (c_cli.wanted_peers.find(peer_osd) != c_cli.wanted_peers.end())
    {
-        msgr.connect_peer(peer_osd, st_cli.peer_states[peer_osd]);
+        c_cli.connect_peer(peer_osd, st_cli.peer_states[peer_osd]);
    }
 }

-void osd_t::on_change_etcd_state_hook(std::map<std::string, etcd_kv_t> & changes)
+void osd_t::on_change_etcd_state_hook(json11::Json::object & changes)
 {
    // FIXME apply config changes in runtime (maybe, some)
    if (run_primary)
@@ -340,10 +307,21 @@ void osd_t::on_change_pg_history_hook(pool_id_t pool_id, pg_num_t pg_num)

 void osd_t::on_load_config_hook(json11::Json::object & global_config)
 {
-    json11::Json::object osd_config = this->config;
-    for (auto & kv: global_config)
-        if (osd_config.find(kv.first) == osd_config.end())
-            osd_config[kv.first] = kv.second;
+    blockstore_config_t osd_config = this->config;
+    for (auto & cfg_var: global_config)
+    {
+        if (this->config.find(cfg_var.first) == this->config.end())
+        {
+            if (cfg_var.second.is_string())
+            {
+                osd_config[cfg_var.first] = cfg_var.second.string_value();
+            }
+            else
+            {
+                osd_config[cfg_var.first] = cfg_var.second.dump();
+            }
+        }
+    }
    parse_config(osd_config);
    bind_socket();
    acquire_lease();
@@ -369,7 +347,7 @@ void osd_t::acquire_lease()
        etcd_lease_id = data["ID"].string_value();
        create_osd_state();
    });
-    printf("[OSD %lu] reporting to etcd at %s every %d seconds\n", this->osd_num, config["etcd_address"].string_value().c_str(), etcd_report_interval);
+    printf("[OSD %lu] reporting to etcd at %s every %d seconds\n", this->osd_num, config["etcd_address"].c_str(), etcd_report_interval);
    tfd->set_timer(etcd_report_interval*1000, true, [this](int timer_id)
    {
        renew_lease();
@@ -615,7 +593,7 @@ void osd_t::apply_pg_config()
                }
                if (currently_taken)
                {
-                    if (pg_it->second.state & (PG_ACTIVE | PG_INCOMPLETE | PG_PEERING | PG_REPEERING))
+                    if (pg_it->second.state & (PG_ACTIVE | PG_INCOMPLETE | PG_PEERING))
                    {
                        if (pg_it->second.target_set == pg_cfg.target_set)
                        {
@@ -684,9 +662,9 @@ void osd_t::apply_pg_config()
                    // Add peers
                    for (auto pg_osd: all_peers)
                    {
-                        if (pg_osd != this->osd_num && msgr.osd_peer_fds.find(pg_osd) == msgr.osd_peer_fds.end())
+                        if (pg_osd != this->osd_num && c_cli.osd_peer_fds.find(pg_osd) == c_cli.osd_peer_fds.end())
                        {
-                            msgr.connect_peer(pg_osd, st_cli.peer_states[pg_osd]);
+                            c_cli.connect_peer(pg_osd, st_cli.peer_states[pg_osd]);
                        }
                    }
                    start_pg_peering(pg);
--- a/src/osd_flush.cpp
+++ b/src/osd_flush.cpp
@@ -82,10 +82,10 @@ void osd_t::handle_flush_op(bool rollback, pool_id_t pool_id, pg_num_t pg_num, p
        else
        {
            printf("Error while doing flush on OSD %lu: %d (%s)\n", osd_num, retval, strerror(-retval));
-            auto fd_it = msgr.osd_peer_fds.find(peer_osd);
-            if (fd_it != msgr.osd_peer_fds.end())
+            auto fd_it = c_cli.osd_peer_fds.find(peer_osd);
+            if (fd_it != c_cli.osd_peer_fds.end())
            {
-                msgr.stop_client(fd_it->second);
+                c_cli.stop_client(fd_it->second);
            }
            return;
        }
@@ -149,14 +149,10 @@ void osd_t::handle_flush_op(bool rollback, pool_id_t pool_id, pg_num_t pg_num, p
        {
            continue_primary_write(op);
        }
-        if ((pg.state & PG_STOPPING) && pg.inflight == 0 && !pg.flush_batch)
+        if (pg.inflight == 0 && (pg.state & PG_STOPPING))
        {
            finish_stop_pg(pg);
        }
-        else if ((pg.state & PG_REPEERING) && pg.inflight == 0 && !pg.flush_batch)
-        {
-            start_pg_peering(pg);
-        }
    }
 }

@@ -188,7 +184,7 @@ void osd_t::submit_flush_op(pool_id_t pool_id, pg_num_t pg_num, pg_flush_batch_t
    else
    {
        // Peer
-        int peer_fd = msgr.osd_peer_fds[peer_osd];
+        int peer_fd = c_cli.osd_peer_fds[peer_osd];
        op->op_type = OSD_OP_OUT;
        op->iov.push_back(op->buf, count * sizeof(obj_ver_id));
        op->peer_fd = peer_fd;
@@ -196,7 +192,7 @@ void osd_t::submit_flush_op(pool_id_t pool_id, pg_num_t pg_num, pg_flush_batch_t
            .sec_stab = {
                .header = {
                    .magic = SECONDARY_OSD_OP_MAGIC,
-                    .id = msgr.next_subop_id++,
+                    .id = c_cli.next_subop_id++,
                    .opcode = (uint64_t)(rollback ? OSD_OP_SEC_ROLLBACK : OSD_OP_SEC_STABILIZE),
                },
                .len = count * sizeof(obj_ver_id),
@@ -207,7 +203,7 @@ void osd_t::submit_flush_op(pool_id_t pool_id, pg_num_t pg_num, pg_flush_batch_t
            handle_flush_op(op->req.hdr.opcode == OSD_OP_SEC_ROLLBACK, pool_id, pg_num, fb, peer_osd, op->reply.hdr.retval);
            delete op;
        };
-        msgr.outbox_push(op);
+        c_cli.outbox_push(op);
    }
 }

@@ -235,8 +231,7 @@ bool osd_t::pick_next_recovery(osd_recovery_op_t &op)
    {
        for (auto pg_it = pgs.begin(); pg_it != pgs.end(); pg_it++)
        {
-            // Don't try to "recover" misplaced objects if "recovery" would make them degraded
-            if ((pg_it->second.state & (PG_ACTIVE | PG_DEGRADED | PG_HAS_MISPLACED)) == (PG_ACTIVE | PG_HAS_MISPLACED))
+            if ((pg_it->second.state & (PG_ACTIVE | PG_HAS_MISPLACED)) == (PG_ACTIVE | PG_HAS_MISPLACED))
            {
                for (auto obj_it = pg_it->second.misplaced_objects.begin(); obj_it != pg_it->second.misplaced_objects.end(); obj_it++)
                {
--- a/src/osd_main.cpp
+++ b/src/osd_main.cpp
@@ -29,13 +29,13 @@ int main(int narg, char *args[])
        perror("BUG: too small packet size");
        return 1;
    }
-    json11::Json::object config;
+    blockstore_config_t config;
    for (int i = 1; i < narg; i++)
    {
        if (args[i][0] == '-' && args[i][1] == '-' && i < narg-1)
        {
            char *opt = args[i]+2;
-            config[std::string(opt)] = std::string(args[++i]);
+            config[opt] = args[++i];
        }
    }
    signal(SIGINT, handle_sigint);
--- a/src/osd_ops.h
+++ b/src/osd_ops.h
@@ -35,7 +35,6 @@
 #define MEM_ALIGNMENT               512
 #endif
 #define OSD_RW_MAX                  64*1024*1024
-#define OSD_PROTOCOL_VERSION        1

 // common request and reply headers
 struct __attribute__((__packed__)) osd_op_header_t
@@ -148,8 +147,6 @@ struct __attribute__((__packed__)) osd_reply_sec_read_bmp_t
 struct __attribute__((__packed__)) osd_op_show_config_t
 {
    osd_op_header_t header;
-    // JSON request length
-    uint64_t json_len;
 };

 struct __attribute__((__packed__)) osd_reply_show_config_t
@@ -187,13 +184,6 @@ struct __attribute__((__packed__)) osd_op_rw_t
    uint64_t offset;
    // length
    uint32_t len;
-    // flags (for future)
-    uint32_t flags;
-    // inode metadata revision
-    uint64_t meta_revision;
-    // object version for atomic "CAS" (compare-and-set) writes
-    // writes and deletes fail with -EINTR if object version differs from (version-1)
-    uint64_t version;
 };

 struct __attribute__((__packed__)) osd_reply_rw_t
@@ -202,8 +192,6 @@ struct __attribute__((__packed__)) osd_reply_rw_t
    // for reads: bitmap length
    uint32_t bitmap_len;
    uint32_t pad0;
-    // for reads: object version
-    uint64_t version;
 };

 // sync to the primary OSD
--- a/src/osd_peering.cpp
+++ b/src/osd_peering.cpp
@@ -77,11 +77,10 @@ void osd_t::repeer_pgs(osd_num_t peer_osd)
    // Re-peer affected PGs
    for (auto & p: pgs)
    {
-        auto & pg = p.second;
        bool repeer = false;
-        if (pg.state & (PG_PEERING | PG_ACTIVE | PG_INCOMPLETE))
+        if (p.second.state & (PG_PEERING | PG_ACTIVE | PG_INCOMPLETE))
        {
-            for (osd_num_t pg_osd: pg.all_peers)
+            for (osd_num_t pg_osd: p.second.all_peers)
            {
                if (pg_osd == peer_osd)
                {
@@ -92,17 +91,8 @@ void osd_t::repeer_pgs(osd_num_t peer_osd)
            if (repeer)
            {
                // Repeer this pg
-                printf("[PG %u/%u] Repeer because of OSD %lu\n", pg.pool_id, pg.pg_num, peer_osd);
-                if (!(pg.state & (PG_ACTIVE | PG_REPEERING)) || pg.inflight == 0 && !pg.flush_batch)
-                {
-                    start_pg_peering(pg);
-                }
-                else
-                {
-                    // Stop accepting new operations, wait for current ones to finish or fail
-                    pg.state = pg.state & ~PG_ACTIVE | PG_REPEERING;
-                    report_pg_state(pg);
-                }
+                printf("[PG %u/%u] Repeer because of OSD %lu\n", p.second.pool_id, p.second.pg_num, peer_osd);
+                start_pg_peering(p.second);
            }
        }
    }
@@ -156,7 +146,7 @@ void osd_t::start_pg_peering(pg_t & pg)
    if (immediate_commit != IMMEDIATE_ALL)
    {
        std::vector<int> to_stop;
-        for (auto & cp: msgr.clients)
+        for (auto & cp: c_cli.clients)
        {
            if (cp.second->dirty_pgs.find({ .pool_id = pg.pool_id, .pg_num = pg.pg_num }) != cp.second->dirty_pgs.end())
            {
@@ -165,7 +155,7 @@ void osd_t::start_pg_peering(pg_t & pg)
        }
        for (auto peer_fd: to_stop)
        {
-            msgr.stop_client(peer_fd);
+            c_cli.stop_client(peer_fd);
        }
    }
    // Calculate current write OSD set
@@ -175,7 +165,7 @@ void osd_t::start_pg_peering(pg_t & pg)
    for (int role = 0; role < pg.target_set.size(); role++)
    {
        pg.cur_set[role] = pg.target_set[role] == this->osd_num ||
-            msgr.osd_peer_fds.find(pg.target_set[role]) != msgr.osd_peer_fds.end() ? pg.target_set[role] : 0;
+            c_cli.osd_peer_fds.find(pg.target_set[role]) != c_cli.osd_peer_fds.end() ? pg.target_set[role] : 0;
        if (pg.cur_set[role] != 0)
        {
            pg.pg_cursize++;
@@ -199,7 +189,7 @@ void osd_t::start_pg_peering(pg_t & pg)
                {
                    found = false;
                    if (history_osd == this->osd_num ||
-                        msgr.osd_peer_fds.find(history_osd) != msgr.osd_peer_fds.end())
+                        c_cli.osd_peer_fds.find(history_osd) != c_cli.osd_peer_fds.end())
                    {
                        found = true;
                        break;
@@ -223,13 +213,13 @@ void osd_t::start_pg_peering(pg_t & pg)
    std::set<osd_num_t> cur_peers;
    for (auto pg_osd: pg.all_peers)
    {
-        if (pg_osd == this->osd_num || msgr.osd_peer_fds.find(pg_osd) != msgr.osd_peer_fds.end())
+        if (pg_osd == this->osd_num || c_cli.osd_peer_fds.find(pg_osd) != c_cli.osd_peer_fds.end())
        {
            cur_peers.insert(pg_osd);
        }
-        else if (msgr.wanted_peers.find(pg_osd) == msgr.wanted_peers.end())
+        else if (c_cli.wanted_peers.find(pg_osd) == c_cli.wanted_peers.end())
        {
-            msgr.connect_peer(pg_osd, st_cli.peer_states[pg_osd]);
+            c_cli.connect_peer(pg_osd, st_cli.peer_states[pg_osd]);
        }
    }
    pg.cur_peers.insert(pg.cur_peers.begin(), cur_peers.begin(), cur_peers.end());
@@ -325,7 +315,7 @@ void osd_t::submit_sync_and_list_subop(osd_num_t role_osd, pg_peering_state_t *p
    else
    {
        // Peer
-        auto & cl = msgr.clients.at(msgr.osd_peer_fds[role_osd]);
+        auto & cl = c_cli.clients.at(c_cli.osd_peer_fds[role_osd]);
        osd_op_t *op = new osd_op_t();
        op->op_type = OSD_OP_OUT;
        op->peer_fd = cl->peer_fd;
@@ -333,7 +323,7 @@ void osd_t::submit_sync_and_list_subop(osd_num_t role_osd, pg_peering_state_t *p
            .sec_sync = {
                .header = {
                    .magic = SECONDARY_OSD_OP_MAGIC,
-                    .id = msgr.next_subop_id++,
+                    .id = c_cli.next_subop_id++,
                    .opcode = OSD_OP_SEC_SYNC,
                },
            },
@@ -344,17 +334,16 @@ void osd_t::submit_sync_and_list_subop(osd_num_t role_osd, pg_peering_state_t *p
            {
                // FIXME: Mark peer as failed and don't reconnect immediately after dropping the connection
                printf("Failed to sync OSD %lu: %ld (%s), disconnecting peer\n", role_osd, op->reply.hdr.retval, strerror(-op->reply.hdr.retval));
-                int fail_fd = op->peer_fd;
                ps->list_ops.erase(role_osd);
+                c_cli.stop_client(op->peer_fd);
                delete op;
-                msgr.stop_client(fail_fd);
                return;
            }
            delete op;
            ps->list_ops.erase(role_osd);
            submit_list_subop(role_osd, ps);
        };
-        msgr.outbox_push(op);
+        c_cli.outbox_push(op);
        ps->list_ops[role_osd] = op;
    }
 }
@@ -404,12 +393,12 @@ void osd_t::submit_list_subop(osd_num_t role_osd, pg_peering_state_t *ps)
        // Peer
        osd_op_t *op = new osd_op_t();
        op->op_type = OSD_OP_OUT;
-        op->peer_fd = msgr.osd_peer_fds[role_osd];
+        op->peer_fd = c_cli.osd_peer_fds[role_osd];
        op->req = (osd_any_op_t){
            .sec_list = {
                .header = {
                    .magic = SECONDARY_OSD_OP_MAGIC,
-                    .id = msgr.next_subop_id++,
+                    .id = c_cli.next_subop_id++,
                    .opcode = OSD_OP_SEC_LIST,
                },
                .list_pg = ps->pg_num,
@@ -424,10 +413,9 @@ void osd_t::submit_list_subop(osd_num_t role_osd, pg_peering_state_t *ps)
            if (op->reply.hdr.retval < 0)
            {
                printf("Failed to get object list from OSD %lu (retval=%ld), disconnecting peer\n", role_osd, op->reply.hdr.retval);
-                int fail_fd = op->peer_fd;
                ps->list_ops.erase(role_osd);
+                c_cli.stop_client(op->peer_fd);
                delete op;
-                msgr.stop_client(fail_fd);
                return;
            }
            printf(
@@ -444,7 +432,7 @@ void osd_t::submit_list_subop(osd_num_t role_osd, pg_peering_state_t *ps)
            ps->list_ops.erase(role_osd);
            delete op;
        };
-        msgr.outbox_push(op);
+        c_cli.outbox_push(op);
        ps->list_ops[role_osd] = op;
    }
 }
@@ -496,13 +484,15 @@ bool osd_t::stop_pg(pg_t & pg)
    {
        return false;
    }
-    if (!(pg.state & (PG_ACTIVE | PG_REPEERING)))
+    if (!(pg.state & PG_ACTIVE))
    {
        finish_stop_pg(pg);
        return true;
    }
-    pg.state = pg.state & ~PG_ACTIVE & ~PG_REPEERING | PG_STOPPING;
-    if (pg.inflight == 0 && !pg.flush_batch)
+    pg.state = pg.state & ~PG_ACTIVE | PG_STOPPING;
+    if (pg.inflight == 0 && !pg.flush_batch &&
+        // We must either forget all PG's unstable writes or wait for it to become clean
+        dirty_pgs.find({ .pool_id = pg.pool_id, .pg_num = pg.pg_num }) == dirty_pgs.end())
    {
        finish_stop_pg(pg);
    }
--- a/src/osd_peering_pg.cpp
+++ b/src/osd_peering_pg.cpp
@@ -430,13 +430,12 @@ void pg_t::calc_object_states(int log_level)
 void pg_t::print_state()
 {
    printf(
-        "[PG %u/%u] is %s%s%s%s%s%s%s%s%s%s%s%s%s%s (%lu objects)\n", pool_id, pg_num,
+        "[PG %u/%u] is %s%s%s%s%s%s%s%s%s%s%s%s%s (%lu objects)\n", pool_id, pg_num,
        (state & PG_STARTING) ? "starting" : "",
        (state & PG_OFFLINE) ? "offline" : "",
        (state & PG_PEERING) ? "peering" : "",
        (state & PG_INCOMPLETE) ? "incomplete" : "",
        (state & PG_ACTIVE) ? "active" : "",
-        (state & PG_REPEERING) ? "repeering" : "",
        (state & PG_STOPPING) ? "stopping" : "",
        (state & PG_DEGRADED) ? " + degraded" : "",
        (state & PG_HAS_INCOMPLETE) ? " + has_incomplete" : "",
--- a/src/osd_primary.cpp
+++ b/src/osd_primary.cpp
--- a/src/osd_primary.h
+++ b/src/osd_primary.h
@@ -31,31 +31,15 @@ struct osd_primary_op_data_t
    uint64_t *prev_set = NULL;
    pg_osd_set_state_t *object_state = NULL;

-    union
-    {
-        struct
-        {
-            // for sync. oops, requires freeing
-            std::vector<unstable_osd_num_t> *unstable_write_osds;
-            pool_pg_num_t *dirty_pgs;
-            int dirty_pg_count;
-            osd_num_t *dirty_osds;
-            int dirty_osd_count;
-            obj_ver_id *unstable_writes;
-            obj_ver_osd_t *copies_to_delete;
-            int copies_to_delete_count;
-        };
-        struct
-        {
-            // for read_bitmaps
-            void *snapshot_bitmaps;
-            inode_t *read_chain;
-            uint8_t *missing_flags;
-            int chain_size;
-            osd_chain_read_t *chain_reads;
-            int chain_read_count;
-        };
-    };
+    // for sync. oops, requires freeing
+    std::vector<unstable_osd_num_t> *unstable_write_osds = NULL;
+    pool_pg_num_t *dirty_pgs = NULL;
+    int dirty_pg_count = 0;
+    osd_num_t *dirty_osds = NULL;
+    int dirty_osd_count = 0;
+    obj_ver_id *unstable_writes = NULL;
+    obj_ver_osd_t *copies_to_delete = NULL;
+    int copies_to_delete_count = 0;
 };

 bool contains_osd(osd_num_t *osd_set, uint64_t size, osd_num_t osd_num);
--- a/src/osd_primary_chain.cpp
+++ b/src/osd_primary_chain.cpp
@@ -1,564 +0,0 @@
-// Copyright (c) Vitaliy Filippov, 2019+
-// License: VNPL-1.1 (see README.md for details)
-
-#include "osd_primary.h"
-#include "allocator.h"
-
-void osd_t::continue_chained_read(osd_op_t *cur_op)
-{
-    osd_primary_op_data_t *op_data = cur_op->op_data;
-    auto & pg = pgs.at({ .pool_id = INODE_POOL(op_data->oid.inode), .pg_num = op_data->pg_num });
-    if (op_data->st == 1)
-        goto resume_1;
-    else if (op_data->st == 2)
-        goto resume_2;
-    else if (op_data->st == 3)
-        goto resume_3;
-    else if (op_data->st == 4)
-        goto resume_4;
-    cur_op->reply.rw.bitmap_len = 0;
-    for (int role = 0; role < op_data->pg_data_size; role++)
-    {
-        op_data->stripes[role].read_start = op_data->stripes[role].req_start;
-        op_data->stripes[role].read_end = op_data->stripes[role].req_end;
-    }
-resume_1:
-resume_2:
-    // Read bitmaps
-    if (read_bitmaps(cur_op, pg, 1) != 0)
-        return;
-    // Prepare & submit reads
-    if (submit_chained_read_requests(pg, cur_op) != 0)
-        return;
-    if (op_data->n_subops > 0)
-    {
-        // Wait for reads
-        op_data->st = 3;
-resume_3:
-        return;
-    }
-resume_4:
-    if (op_data->errors > 0)
-    {
-        free(op_data->chain_reads);
-        op_data->chain_reads = NULL;
-        finish_op(cur_op, op_data->epipe > 0 ? -EPIPE : -EIO);
-        return;
-    }
-    send_chained_read_results(pg, cur_op);
-    finish_op(cur_op, cur_op->req.rw.len);
-}
-
-int osd_t::read_bitmaps(osd_op_t *cur_op, pg_t & pg, int base_state)
-{
-    osd_primary_op_data_t *op_data = cur_op->op_data;
-    if (op_data->st == base_state)
-        goto resume_0;
-    else if (op_data->st == base_state+1)
-        goto resume_1;
-    if (pg.state == PG_ACTIVE && pg.scheme == POOL_SCHEME_REPLICATED)
-    {
-        // Happy path for clean replicated PGs (all bitmaps are available locally)
-        for (int chain_num = 0; chain_num < op_data->chain_size; chain_num++)
-        {
-            object_id cur_oid = { .inode = op_data->read_chain[chain_num], .stripe = op_data->oid.stripe };
-            auto vo_it = pg.ver_override.find(cur_oid);
-            auto read_version = (vo_it != pg.ver_override.end() ? vo_it->second : UINT64_MAX);
-            // Read bitmap synchronously from the local database
-            bs->read_bitmap(
-                cur_oid, read_version, op_data->snapshot_bitmaps + chain_num*clean_entry_bitmap_size,
-                !chain_num ? &cur_op->reply.rw.version : NULL
-            );
-        }
-    }
-    else
-    {
-        if (submit_bitmap_subops(cur_op, pg) < 0)
-        {
-            // Failure
-            finish_op(cur_op, -EIO);
-            return -1;
-        }
-resume_0:
-        if (op_data->n_subops > 0)
-        {
-            // Wait for subops
-            op_data->st = base_state;
-            return 1;
-        }
-resume_1:
-        if (pg.scheme != POOL_SCHEME_REPLICATED)
-        {
-            for (int chain_num = 0; chain_num < op_data->chain_size; chain_num++)
-            {
-                // Check if we need to reconstruct any bitmaps
-                for (int i = 0; i < pg.pg_size; i++)
-                {
-                    if (op_data->missing_flags[chain_num*pg.pg_size + i])
-                    {
-                        osd_rmw_stripe_t local_stripes[pg.pg_size] = { 0 };
-                        for (i = 0; i < pg.pg_size; i++)
-                        {
-                            local_stripes[i].missing = op_data->missing_flags[chain_num*pg.pg_size + i] && true;
-                            local_stripes[i].bmp_buf = op_data->snapshot_bitmaps + (chain_num*pg.pg_size + i)*clean_entry_bitmap_size;
-                            local_stripes[i].read_start = local_stripes[i].read_end = 1;
-                        }
-                        if (pg.scheme == POOL_SCHEME_XOR)
-                        {
-                            reconstruct_stripes_xor(local_stripes, pg.pg_size, clean_entry_bitmap_size);
-                        }
-                        else if (pg.scheme == POOL_SCHEME_JERASURE)
-                        {
-                            reconstruct_stripes_jerasure(local_stripes, pg.pg_size, pg.pg_data_size, clean_entry_bitmap_size);
-                        }
-                        break;
-                    }
-                }
-            }
-        }
-    }
-    return 0;
-}
-
-int osd_t::collect_bitmap_requests(osd_op_t *cur_op, pg_t & pg, std::vector<bitmap_request_t> & bitmap_requests)
-{
-    osd_primary_op_data_t *op_data = cur_op->op_data;
-    for (int chain_num = 0; chain_num < op_data->chain_size; chain_num++)
-    {
-        object_id cur_oid = { .inode = op_data->read_chain[chain_num], .stripe = op_data->oid.stripe };
-        auto vo_it = pg.ver_override.find(cur_oid);
-        uint64_t target_version = vo_it != pg.ver_override.end() ? vo_it->second : UINT64_MAX;
-        pg_osd_set_state_t *object_state;
-        uint64_t* cur_set = get_object_osd_set(pg, cur_oid, pg.cur_set.data(), &object_state);
-        if (pg.scheme == POOL_SCHEME_REPLICATED)
-        {
-            osd_num_t read_target = 0;
-            for (int i = 0; i < pg.pg_size; i++)
-            {
-                if (cur_set[i] == this->osd_num || cur_set[i] != 0 && read_target == 0)
-                {
-                    // Select local or any other available OSD for reading
-                    read_target = cur_set[i];
-                }
-            }
-            assert(read_target != 0);
-            bitmap_requests.push_back((bitmap_request_t){
-                .osd_num = read_target,
-                .oid = cur_oid,
-                .version = target_version,
-                .bmp_buf = op_data->snapshot_bitmaps + chain_num*clean_entry_bitmap_size,
-            });
-        }
-        else
-        {
-            osd_rmw_stripe_t local_stripes[pg.pg_size];
-            memcpy(local_stripes, op_data->stripes, sizeof(osd_rmw_stripe_t) * pg.pg_size);
-            if (extend_missing_stripes(local_stripes, cur_set, pg.pg_data_size, pg.pg_size) < 0)
-            {
-                free(op_data->snapshot_bitmaps);
-                return -1;
-            }
-            int need_at_least = 0;
-            for (int i = 0; i < pg.pg_size; i++)
-            {
-                if (local_stripes[i].read_end != 0 && cur_set[i] == 0)
-                {
-                    // We need this part of the bitmap, but it's unavailable
-                    need_at_least = pg.pg_data_size;
-                    op_data->missing_flags[chain_num*pg.pg_size + i] = 1;
-                }
-                else
-                {
-                    op_data->missing_flags[chain_num*pg.pg_size + i] = 0;
-                }
-            }
-            int found = 0;
-            for (int i = 0; i < pg.pg_size; i++)
-            {
-                if (cur_set[i] != 0 && (local_stripes[i].read_end != 0 || found < need_at_least))
-                {
-                    // Read part of the bitmap
-                    bitmap_requests.push_back((bitmap_request_t){
-                        .osd_num = cur_set[i],
-                        .oid = {
-                            .inode = cur_oid.inode,
-                            .stripe = cur_oid.stripe | i,
-                        },
-                        .version = target_version,
-                        .bmp_buf = op_data->snapshot_bitmaps + (chain_num*pg.pg_size + i)*clean_entry_bitmap_size,
-                    });
-                    found++;
-                }
-            }
-            // Already checked by extend_missing_stripes, so it's fine to use assert
-            assert(found >= need_at_least);
-        }
-    }
-    std::sort(bitmap_requests.begin(), bitmap_requests.end());
-    return 0;
-}
-
-int osd_t::submit_bitmap_subops(osd_op_t *cur_op, pg_t & pg)
-{
-    osd_primary_op_data_t *op_data = cur_op->op_data;
-    std::vector<bitmap_request_t> *bitmap_requests = new std::vector<bitmap_request_t>();
-    if (collect_bitmap_requests(cur_op, pg, *bitmap_requests) < 0)
-    {
-        return -1;
-    }
-    op_data->n_subops = 0;
-    for (int i = 0; i < bitmap_requests->size(); i++)
-    {
-        if ((i == bitmap_requests->size()-1 || (*bitmap_requests)[i+1].osd_num != (*bitmap_requests)[i].osd_num) &&
-            (*bitmap_requests)[i].osd_num != this->osd_num)
-        {
-            op_data->n_subops++;
-        }
-    }
-    if (op_data->n_subops)
-    {
-        op_data->fact_ver = 0;
-        op_data->done = op_data->errors = 0;
-        op_data->subops = new osd_op_t[op_data->n_subops];
-    }
-    for (int i = 0, subop_idx = 0, prev = 0; i < bitmap_requests->size(); i++)
-    {
-        if (i == bitmap_requests->size()-1 || (*bitmap_requests)[i+1].osd_num != (*bitmap_requests)[i].osd_num)
-        {
-            osd_num_t subop_osd_num = (*bitmap_requests)[i].osd_num;
-            if (subop_osd_num == this->osd_num)
-            {
-                // Read bitmap synchronously from the local database
-                for (int j = prev; j <= i; j++)
-                {
-                    bs->read_bitmap(
-                        (*bitmap_requests)[j].oid, (*bitmap_requests)[j].version, (*bitmap_requests)[j].bmp_buf,
-                        (*bitmap_requests)[j].oid.inode == cur_op->req.rw.inode ? &cur_op->reply.rw.version : NULL
-                    );
-                }
-            }
-            else
-            {
-                // Send to a remote OSD
-                osd_op_t *subop = op_data->subops+subop_idx;
-                subop->op_type = OSD_OP_OUT;
-                subop->peer_fd = msgr.osd_peer_fds.at(subop_osd_num);
-                // FIXME: Use the pre-allocated buffer
-                subop->buf = malloc_or_die(sizeof(obj_ver_id)*(i+1-prev));
-                subop->req = (osd_any_op_t){
-                    .sec_read_bmp = {
-                        .header = {
-                            .magic = SECONDARY_OSD_OP_MAGIC,
-                            .id = msgr.next_subop_id++,
-                            .opcode = OSD_OP_SEC_READ_BMP,
-                        },
-                        .len = sizeof(obj_ver_id)*(i+1-prev),
-                    }
-                };
-                obj_ver_id *ov = (obj_ver_id*)subop->buf;
-                for (int j = prev; j <= i; j++, ov++)
-                {
-                    ov->oid = (*bitmap_requests)[j].oid;
-                    ov->version = (*bitmap_requests)[j].version;
-                }
-                subop->callback = [cur_op, bitmap_requests, prev, i, this](osd_op_t *subop)
-                {
-                    int requested_count = subop->req.sec_read_bmp.len / sizeof(obj_ver_id);
-                    if (subop->reply.hdr.retval == requested_count * (8 + clean_entry_bitmap_size))
-                    {
-                        void *cur_buf = subop->buf + 8;
-                        for (int j = prev; j <= i; j++)
-                        {
-                            memcpy((*bitmap_requests)[j].bmp_buf, cur_buf, clean_entry_bitmap_size);
-                            if ((*bitmap_requests)[j].oid.inode == cur_op->req.rw.inode)
-                            {
-                                memcpy(&cur_op->reply.rw.version, cur_buf-8, 8);
-                            }
-                            cur_buf += 8 + clean_entry_bitmap_size;
-                        }
-                    }
-                    if ((cur_op->op_data->errors + cur_op->op_data->done + 1) >= cur_op->op_data->n_subops)
-                    {
-                        delete bitmap_requests;
-                    }
-                    handle_primary_subop(subop, cur_op);
-                };
-                msgr.outbox_push(subop);
-                subop_idx++;
-            }
-            prev = i+1;
-        }
-    }
-    if (!op_data->n_subops)
-    {
-        delete bitmap_requests;
-    }
-    return 0;
-}
-
-std::vector<osd_chain_read_t> osd_t::collect_chained_read_requests(osd_op_t *cur_op)
-{
-    osd_primary_op_data_t *op_data = cur_op->op_data;
-    std::vector<osd_chain_read_t> chain_reads;
-    int stripe_count = (op_data->scheme == POOL_SCHEME_REPLICATED ? 1 : op_data->pg_size);
-    memset(op_data->stripes[0].bmp_buf, 0, stripe_count * clean_entry_bitmap_size);
-    uint8_t *global_bitmap = (uint8_t*)op_data->stripes[0].bmp_buf;
-    // We always use at most 1 read request per layer
-    for (int chain_pos = 0; chain_pos < op_data->chain_size; chain_pos++)
-    {
-        uint8_t *part_bitmap = ((uint8_t*)op_data->snapshot_bitmaps) + chain_pos*stripe_count*clean_entry_bitmap_size;
-        int start = (cur_op->req.rw.offset - op_data->oid.stripe)/bs_bitmap_granularity;
-        int end = start + cur_op->req.rw.len/bs_bitmap_granularity;
-        // Skip unneeded part in the beginning
-        while (start < end && (
-            ((global_bitmap[start>>3] >> (start&7)) & 1) ||
-            !((part_bitmap[start>>3] >> (start&7)) & 1)))
-        {
-            start++;
-        }
-        // Skip unneeded part in the end
-        while (start < end && (
-            ((global_bitmap[(end-1)>>3] >> ((end-1)&7)) & 1) ||
-            !((part_bitmap[(end-1)>>3] >> ((end-1)&7)) & 1)))
-        {
-            end--;
-        }
-        if (start < end)
-        {
-            // Copy (OR) bits in between
-            int cur = start;
-            for (; cur < end && (cur & 0x7); cur++)
-            {
-                global_bitmap[cur>>3] = global_bitmap[cur>>3] | (part_bitmap[cur>>3] & (1 << (cur&7)));
-            }
-            for (; cur <= end-8; cur += 8)
-            {
-                global_bitmap[cur>>3] = global_bitmap[cur>>3] | part_bitmap[cur>>3];
-            }
-            for (; cur < end; cur++)
-            {
-                global_bitmap[cur>>3] = global_bitmap[cur>>3] | (part_bitmap[cur>>3] & (1 << (cur&7)));
-            }
-            // Add request
-            chain_reads.push_back((osd_chain_read_t){
-                .chain_pos = chain_pos,
-                .inode = op_data->read_chain[chain_pos],
-                .offset = start*bs_bitmap_granularity,
-                .len = (end-start)*bs_bitmap_granularity,
-            });
-        }
-    }
-    return chain_reads;
-}
-
-int osd_t::submit_chained_read_requests(pg_t & pg, osd_op_t *cur_op)
-{
-    // Decide which parts of which objects we need to read based on bitmaps
-    osd_primary_op_data_t *op_data = cur_op->op_data;
-    auto chain_reads = collect_chained_read_requests(cur_op);
-    int stripe_count = (pg.scheme == POOL_SCHEME_REPLICATED ? 1 : pg.pg_size);
-    op_data->chain_read_count = chain_reads.size();
-    op_data->chain_reads = (osd_chain_read_t*)calloc_or_die(
-        1, sizeof(osd_chain_read_t) * chain_reads.size()
-        + sizeof(osd_rmw_stripe_t) * stripe_count * op_data->chain_size
-    );
-    osd_rmw_stripe_t *chain_stripes = (osd_rmw_stripe_t*)(
-        ((void*)op_data->chain_reads) + sizeof(osd_chain_read_t) * op_data->chain_read_count
-    );
-    // Now process each subrequest as a separate read, including reconstruction if needed
-    // Prepare reads
-    int n_subops = 0;
-    uint64_t read_buffer_size = 0;
-    for (int cri = 0; cri < chain_reads.size(); cri++)
-    {
-        op_data->chain_reads[cri] = chain_reads[cri];
-        object_id cur_oid = { .inode = chain_reads[cri].inode, .stripe = op_data->oid.stripe };
-        // FIXME: maybe introduce split_read_stripes to shorten these lines and to remove read_start=req_start
-        osd_rmw_stripe_t *stripes = chain_stripes + chain_reads[cri].chain_pos*stripe_count;
-        split_stripes(pg.pg_data_size, bs_block_size, chain_reads[cri].offset, chain_reads[cri].len, stripes);
-        if (op_data->scheme == POOL_SCHEME_REPLICATED && !stripes[0].req_end)
-        {
-            continue;
-        }
-        for (int role = 0; role < op_data->pg_data_size; role++)
-        {
-            stripes[role].read_start = stripes[role].req_start;
-            stripes[role].read_end = stripes[role].req_end;
-        }
-        uint64_t *cur_set = pg.cur_set.data();
-        if (pg.state != PG_ACTIVE && op_data->scheme != POOL_SCHEME_REPLICATED)
-        {
-            pg_osd_set_state_t *object_state;
-            cur_set = get_object_osd_set(pg, cur_oid, pg.cur_set.data(), &object_state);
-            if (extend_missing_stripes(stripes, cur_set, pg.pg_data_size, pg.pg_size) < 0)
-            {
-                free(op_data->chain_reads);
-                op_data->chain_reads = NULL;
-                finish_op(cur_op, -EIO);
-                return -1;
-            }
-            op_data->degraded = 1;
-        }
-        if (op_data->scheme == POOL_SCHEME_REPLICATED)
-        {
-            n_subops++;
-            read_buffer_size += stripes[0].read_end - stripes[0].read_start;
-        }
-        else
-        {
-            for (int role = 0; role < pg.pg_size; role++)
-            {
-                if (stripes[role].read_end > 0 && cur_set[role] != 0)
-                    n_subops++;
-                if (stripes[role].read_end > 0)
-                    read_buffer_size += stripes[role].read_end - stripes[role].read_start;
-            }
-        }
-    }
-    cur_op->buf = memalign_or_die(MEM_ALIGNMENT, read_buffer_size);
-    void *cur_buf = cur_op->buf;
-    for (int cri = 0; cri < chain_reads.size(); cri++)
-    {
-        osd_rmw_stripe_t *stripes = chain_stripes + chain_reads[cri].chain_pos*stripe_count;
-        for (int role = 0; role < stripe_count; role++)
-        {
-            if (stripes[role].read_end > 0)
-            {
-                stripes[role].read_buf = cur_buf;
-                stripes[role].bmp_buf = op_data->snapshot_bitmaps + (chain_reads[cri].chain_pos*stripe_count + role)*clean_entry_bitmap_size;
-                cur_buf += stripes[role].read_end - stripes[role].read_start;
-            }
-        }
-    }
-    // Submit all reads
-    op_data->fact_ver = UINT64_MAX;
-    op_data->done = op_data->errors = 0;
-    op_data->n_subops = n_subops;
-    if (!n_subops)
-    {
-        return 0;
-    }
-    op_data->subops = new osd_op_t[n_subops];
-    int cur_subops = 0;
-    for (int cri = 0; cri < chain_reads.size(); cri++)
-    {
-        osd_rmw_stripe_t *stripes = chain_stripes + chain_reads[cri].chain_pos*stripe_count;
-        if (op_data->scheme == POOL_SCHEME_REPLICATED && !stripes[0].req_end)
-        {
-            continue;
-        }
-        object_id cur_oid = { .inode = chain_reads[cri].inode, .stripe = op_data->oid.stripe };
-        auto vo_it = pg.ver_override.find(cur_oid);
-        uint64_t target_ver = vo_it != pg.ver_override.end() ? vo_it->second : UINT64_MAX;
-        uint64_t *cur_set = pg.cur_set.data();
-        if (pg.state != PG_ACTIVE && op_data->scheme != POOL_SCHEME_REPLICATED)
-        {
-            pg_osd_set_state_t *object_state;
-            cur_set = get_object_osd_set(pg, cur_oid, pg.cur_set.data(), &object_state);
-        }
-        int zero_read = -1;
-        if (op_data->scheme == POOL_SCHEME_REPLICATED)
-        {
-            for (int role = 0; role < op_data->pg_size; role++)
-                if (cur_set[role] == this->osd_num || zero_read == -1)
-                    zero_read = role;
-        }
-        cur_subops += submit_primary_subop_batch(SUBMIT_READ, chain_reads[cri].inode, target_ver, stripes, cur_set, cur_op, cur_subops, zero_read);
-    }
-    assert(cur_subops == n_subops);
-    return 0;
-}
-
-void osd_t::send_chained_read_results(pg_t & pg, osd_op_t *cur_op)
-{
-    osd_primary_op_data_t *op_data = cur_op->op_data;
-    int stripe_count = (pg.scheme == POOL_SCHEME_REPLICATED ? 1 : pg.pg_size);
-    osd_rmw_stripe_t *chain_stripes = (osd_rmw_stripe_t*)(
-        ((void*)op_data->chain_reads) + sizeof(osd_chain_read_t) * op_data->chain_read_count
-    );
-    // Reconstruct parts if needed
-    if (op_data->degraded)
-    {
-        int stripe_count = (pg.scheme == POOL_SCHEME_REPLICATED ? 1 : pg.pg_size);
-        for (int cri = 0; cri < op_data->chain_read_count; cri++)
-        {
-            // Reconstruct missing stripes
-            osd_rmw_stripe_t *stripes = chain_stripes + op_data->chain_reads[cri].chain_pos*stripe_count;
-            if (op_data->scheme == POOL_SCHEME_XOR)
-            {
-                reconstruct_stripes_xor(stripes, pg.pg_size, clean_entry_bitmap_size);
-            }
-            else if (op_data->scheme == POOL_SCHEME_JERASURE)
-            {
-                reconstruct_stripes_jerasure(stripes, pg.pg_size, pg.pg_data_size, clean_entry_bitmap_size);
-            }
-        }
-    }
-    // Send bitmap
-    cur_op->reply.rw.bitmap_len = op_data->pg_data_size * clean_entry_bitmap_size;
-    cur_op->iov.push_back(op_data->stripes[0].bmp_buf, cur_op->reply.rw.bitmap_len);
-    // And finally compose the result
-    uint64_t sent = 0;
-    int prev_pos = 0, pos = 0;
-    bool prev_set = false;
-    int prev = (cur_op->req.rw.offset - op_data->oid.stripe) / bs_bitmap_granularity;
-    int end = prev + cur_op->req.rw.len/bs_bitmap_granularity;
-    int cur = prev;
-    while (cur <= end)
-    {
-        bool has_bit = false;
-        if (cur < end)
-        {
-            for (pos = 0; pos < op_data->chain_size; pos++)
-            {
-                has_bit = (((uint8_t*)op_data->snapshot_bitmaps)[pos*stripe_count*clean_entry_bitmap_size + cur/8] >> (cur%8)) & 1;
-                if (has_bit)
-                    break;
-            }
-        }
-        if (has_bit != prev_set || pos != prev_pos || cur == end)
-        {
-            if (cur > prev)
-            {
-                // Send buffer in parts to avoid copying
-                if (!prev_set)
-                {
-                    while ((cur-prev) > zero_buffer_size/bs_bitmap_granularity)
-                    {
-                        cur_op->iov.push_back(zero_buffer, zero_buffer_size);
-                        sent += zero_buffer_size;
-                        prev += zero_buffer_size/bs_bitmap_granularity;
-                    }
-                    cur_op->iov.push_back(zero_buffer, (cur-prev)*bs_bitmap_granularity);
-                    sent += (cur-prev)*bs_bitmap_granularity;
-                }
-                else
-                {
-                    osd_rmw_stripe_t *stripes = chain_stripes + prev_pos*stripe_count;
-                    while (cur > prev)
-                    {
-                        int role = prev*bs_bitmap_granularity/bs_block_size;
-                        int role_start = prev*bs_bitmap_granularity - role*bs_block_size;
-                        int role_end = cur*bs_bitmap_granularity - role*bs_block_size;
-                        if (role_end > bs_block_size)
-                            role_end = bs_block_size;
-                        assert(stripes[role].read_buf);
-                        cur_op->iov.push_back(
-                            stripes[role].read_buf + (role_start - stripes[role].read_start),
-                            role_end - role_start
-                        );
-                        sent += role_end - role_start;
-                        prev += (role_end - role_start)/bs_bitmap_granularity;
-                    }
-                }
-            }
-            prev = cur;
-            prev_pos = pos;
-            prev_set = has_bit;
-        }
-        cur++;
-    }
-    assert(sent == cur_op->req.rw.len);
-    free(op_data->chain_reads);
-    op_data->chain_reads = NULL;
-}
--- a/src/osd_primary_subops.cpp
+++ b/src/osd_primary_subops.cpp
@@ -66,16 +66,17 @@ void osd_t::finish_op(osd_op_t *cur_op, int retval)
            auto & pg = pgs.at({ .pool_id = INODE_POOL(cur_op->op_data->oid.inode), .pg_num = cur_op->op_data->pg_num });
            pg.inflight--;
            assert(pg.inflight >= 0);
-            if ((pg.state & PG_STOPPING) && pg.inflight == 0 && !pg.flush_batch)
+            if ((pg.state & PG_STOPPING) && pg.inflight == 0 && !pg.flush_batch &&
+                // We must either forget all PG's unstable writes or wait for it to become clean
+                dirty_pgs.find({ .pool_id = pg.pool_id, .pg_num = pg.pg_num }) == dirty_pgs.end())
            {
                finish_stop_pg(pg);
            }
-            else if ((pg.state & PG_REPEERING) && pg.inflight == 0 && !pg.flush_batch)
-            {
-                start_pg_peering(pg);
-            }
        }
        assert(!cur_op->op_data->subops);
+        assert(!cur_op->op_data->unstable_write_osds);
+        assert(!cur_op->op_data->unstable_writes);
+        assert(!cur_op->op_data->dirty_pgs);
        free(cur_op->op_data);
        cur_op->op_data = NULL;
    }
@@ -87,14 +88,14 @@ void osd_t::finish_op(osd_op_t *cur_op, int retval)
    else
    {
        // FIXME add separate magic number for primary ops
-        auto cl_it = msgr.clients.find(cur_op->peer_fd);
-        if (cl_it != msgr.clients.end())
+        auto cl_it = c_cli.clients.find(cur_op->peer_fd);
+        if (cl_it != c_cli.clients.end())
        {
            cur_op->reply.hdr.magic = SECONDARY_OSD_REPLY_MAGIC;
            cur_op->reply.hdr.id = cur_op->req.hdr.id;
            cur_op->reply.hdr.opcode = cur_op->req.hdr.opcode;
            cur_op->reply.hdr.retval = retval;
-            msgr.outbox_push(cur_op);
+            c_cli.outbox_push(cur_op);
        }
        else
        {
@@ -103,7 +104,7 @@ void osd_t::finish_op(osd_op_t *cur_op, int retval)
    }
 }

-void osd_t::submit_primary_subops(int submit_type, uint64_t op_version, const uint64_t* osd_set, osd_op_t *cur_op)
+void osd_t::submit_primary_subops(int submit_type, uint64_t op_version, int pg_size, const uint64_t* osd_set, osd_op_t *cur_op)
 {
    bool wr = submit_type == SUBMIT_WRITE;
    osd_primary_op_data_t *op_data = cur_op->op_data;
@@ -111,34 +112,32 @@ void osd_t::submit_primary_subops(int submit_type, uint64_t op_version, const ui
    bool rep = op_data->scheme == POOL_SCHEME_REPLICATED;
    // Allocate subops
    int n_subops = 0, zero_read = -1;
-    for (int role = 0; role < op_data->pg_size; role++)
+    for (int role = 0; role < pg_size; role++)
    {
        if (osd_set[role] == this->osd_num || osd_set[role] != 0 && zero_read == -1)
+        {
            zero_read = role;
+        }
        if (osd_set[role] != 0 && (wr || !rep && stripes[role].read_end != 0))
+        {
            n_subops++;
+        }
    }
    if (!n_subops && (submit_type == SUBMIT_RMW_READ || rep))
+    {
        n_subops = 1;
+    }
    else
+    {
        zero_read = -1;
+    }
    osd_op_t *subops = new osd_op_t[n_subops];
    op_data->fact_ver = 0;
    op_data->done = op_data->errors = 0;
    op_data->n_subops = n_subops;
    op_data->subops = subops;
-    int sent = submit_primary_subop_batch(submit_type, op_data->oid.inode, op_version, op_data->stripes, osd_set, cur_op, 0, zero_read);
-    assert(sent == n_subops);
-}
-
-int osd_t::submit_primary_subop_batch(int submit_type, inode_t inode, uint64_t op_version,
-    osd_rmw_stripe_t *stripes, const uint64_t* osd_set, osd_op_t *cur_op, int subop_idx, int zero_read)
-{
-    bool wr = submit_type == SUBMIT_WRITE;
-    osd_primary_op_data_t *op_data = cur_op->op_data;
-    bool rep = op_data->scheme == POOL_SCHEME_REPLICATED;
-    int i = subop_idx;
-    for (int role = 0; role < op_data->pg_size; role++)
+    int i = 0;
+    for (int role = 0; role < pg_size; role++)
    {
        // We always submit zero-length writes to all replicas, even if the stripe is not modified
        if (!(wr || !rep && stripes[role].read_end != 0 || zero_read == role))
@@ -149,90 +148,99 @@ int osd_t::submit_primary_subop_batch(int submit_type, inode_t inode, uint64_t o
        if (role_osd_num != 0)
        {
            int stripe_num = rep ? 0 : role;
-            osd_op_t *subop = op_data->subops + i;
            if (role_osd_num == this->osd_num)
            {
-                clock_gettime(CLOCK_REALTIME, &subop->tv_begin);
-                subop->op_type = (uint64_t)cur_op;
-                subop->bitmap = stripes[stripe_num].bmp_buf;
-                subop->bitmap_len = clean_entry_bitmap_size;
-                subop->bs_op = new blockstore_op_t({
+                clock_gettime(CLOCK_REALTIME, &subops[i].tv_begin);
+                subops[i].op_type = (uint64_t)cur_op;
+                subops[i].bitmap = stripes[stripe_num].bmp_buf;
+                subops[i].bitmap_len = clean_entry_bitmap_size;
+                subops[i].bs_op = new blockstore_op_t({
                    .opcode = (uint64_t)(wr ? (rep ? BS_OP_WRITE_STABLE : BS_OP_WRITE) : BS_OP_READ),
-                    .callback = [subop, this](blockstore_op_t *bs_subop)
+                    .callback = [subop = &subops[i], this](blockstore_op_t *bs_subop)
                    {
                        handle_primary_bs_subop(subop);
                    },
                    .oid = {
-                        .inode = inode,
+                        .inode = op_data->oid.inode,
                        .stripe = op_data->oid.stripe | stripe_num,
                    },
                    .version = op_version,
-                    .offset = wr ? stripes[stripe_num].write_start : stripes[stripe_num].read_start,
-                    .len = wr ? stripes[stripe_num].write_end - stripes[stripe_num].write_start : stripes[stripe_num].read_end - stripes[stripe_num].read_start,
+                    .offset = submit_type == SUBMIT_READ_BITMAPS ? 0
+                        : (wr ? stripes[stripe_num].write_start : stripes[stripe_num].read_start),
+                    .len = submit_type == SUBMIT_READ_BITMAPS ? 0
+                        : (wr ? stripes[stripe_num].write_end - stripes[stripe_num].write_start : stripes[stripe_num].read_end - stripes[stripe_num].read_start),
                    .buf = wr ? stripes[stripe_num].write_buf : stripes[stripe_num].read_buf,
                    .bitmap = stripes[stripe_num].bmp_buf,
                });
 #ifdef OSD_DEBUG
                printf(
                    "Submit %s to local: %lx:%lx v%lu %u-%u\n", wr ? "write" : "read",
-                    inode, op_data->oid.stripe | stripe_num, op_version,
-                    subop->bs_op->offset, subop->bs_op->len
+                    op_data->oid.inode, op_data->oid.stripe | stripe_num, op_version,
+                    subops[i].bs_op->offset, subops[i].bs_op->len
                );
 #endif
-                bs->enqueue_op(subop->bs_op);
+                bs->enqueue_op(subops[i].bs_op);
            }
            else
            {
-                subop->op_type = OSD_OP_OUT;
-                subop->peer_fd = msgr.osd_peer_fds.at(role_osd_num);
-                subop->bitmap = stripes[stripe_num].bmp_buf;
-                subop->bitmap_len = clean_entry_bitmap_size;
-                subop->req.sec_rw = {
+                subops[i].op_type = OSD_OP_OUT;
+                subops[i].peer_fd = c_cli.osd_peer_fds.at(role_osd_num);
+                subops[i].bitmap = stripes[stripe_num].bmp_buf;
+                subops[i].bitmap_len = clean_entry_bitmap_size;
+                subops[i].req.sec_rw = {
                    .header = {
                        .magic = SECONDARY_OSD_OP_MAGIC,
-                        .id = msgr.next_subop_id++,
+                        .id = c_cli.next_subop_id++,
                        .opcode = (uint64_t)(wr ? (rep ? OSD_OP_SEC_WRITE_STABLE : OSD_OP_SEC_WRITE) : OSD_OP_SEC_READ),
                    },
                    .oid = {
-                        .inode = inode,
+                        .inode = op_data->oid.inode,
                        .stripe = op_data->oid.stripe | stripe_num,
                    },
                    .version = op_version,
-                    .offset = wr ? stripes[stripe_num].write_start : stripes[stripe_num].read_start,
-                    .len = wr ? stripes[stripe_num].write_end - stripes[stripe_num].write_start : stripes[stripe_num].read_end - stripes[stripe_num].read_start,
+                    .offset = submit_type == SUBMIT_READ_BITMAPS ? 0
+                        : (wr ? stripes[stripe_num].write_start : stripes[stripe_num].read_start),
+                    .len = submit_type == SUBMIT_READ_BITMAPS ? 0
+                        : (wr ? stripes[stripe_num].write_end - stripes[stripe_num].write_start : stripes[stripe_num].read_end - stripes[stripe_num].read_start),
                    .attr_len = wr ? clean_entry_bitmap_size : 0,
                };
 #ifdef OSD_DEBUG
                printf(
                    "Submit %s to osd %lu: %lx:%lx v%lu %u-%u\n", wr ? "write" : "read", role_osd_num,
-                    inode, op_data->oid.stripe | stripe_num, op_version,
-                    subop->req.sec_rw.offset, subop->req.sec_rw.len
+                    op_data->oid.inode, op_data->oid.stripe | stripe_num, op_version,
+                    subops[i].req.sec_rw.offset, subops[i].req.sec_rw.len
                );
 #endif
                if (wr)
                {
                    if (stripes[stripe_num].write_end > stripes[stripe_num].write_start)
                    {
-                        subop->iov.push_back(stripes[stripe_num].write_buf, stripes[stripe_num].write_end - stripes[stripe_num].write_start);
+                        subops[i].iov.push_back(stripes[stripe_num].write_buf, stripes[stripe_num].write_end - stripes[stripe_num].write_start);
                    }
                }
-                else
+                else if (submit_type != SUBMIT_READ_BITMAPS)
                {
                    if (stripes[stripe_num].read_end > stripes[stripe_num].read_start)
                    {
-                        subop->iov.push_back(stripes[stripe_num].read_buf, stripes[stripe_num].read_end - stripes[stripe_num].read_start);
+                        subops[i].iov.push_back(stripes[stripe_num].read_buf, stripes[stripe_num].read_end - stripes[stripe_num].read_start);
                    }
                }
-                subop->callback = [cur_op, this](osd_op_t *subop)
+                subops[i].callback = [cur_op, this](osd_op_t *subop)
                {
+                    int fail_fd = subop->req.hdr.opcode == OSD_OP_SEC_WRITE &&
+                        subop->reply.hdr.retval != subop->req.sec_rw.len ? subop->peer_fd : -1;
                    handle_primary_subop(subop, cur_op);
+                    if (fail_fd >= 0)
+                    {
+                        // write operation failed, drop the connection
+                        c_cli.stop_client(fail_fd);
+                    }
                };
-                msgr.outbox_push(subop);
+                c_cli.outbox_push(&subops[i]);
            }
            i++;
        }
    }
-    return i-subop_idx;
 }

 static uint64_t bs_op_to_osd_op[] = {
@@ -272,7 +280,6 @@ void osd_t::handle_primary_bs_subop(osd_op_t *subop)
    }
    delete bs_op;
    subop->bs_op = NULL;
-    subop->peer_fd = -1;
    handle_primary_subop(subop, cur_op);
 }

@@ -282,20 +289,20 @@ void osd_t::add_bs_subop_stats(osd_op_t *subop)
    uint64_t opcode = bs_op_to_osd_op[subop->bs_op->opcode];
    timespec tv_end;
    clock_gettime(CLOCK_REALTIME, &tv_end);
-    msgr.stats.op_stat_count[opcode]++;
-    if (!msgr.stats.op_stat_count[opcode])
+    c_cli.stats.op_stat_count[opcode]++;
+    if (!c_cli.stats.op_stat_count[opcode])
    {
-        msgr.stats.op_stat_count[opcode] = 1;
-        msgr.stats.op_stat_sum[opcode] = 0;
-        msgr.stats.op_stat_bytes[opcode] = 0;
+        c_cli.stats.op_stat_count[opcode] = 1;
+        c_cli.stats.op_stat_sum[opcode] = 0;
+        c_cli.stats.op_stat_bytes[opcode] = 0;
    }
-    msgr.stats.op_stat_sum[opcode] += (
+    c_cli.stats.op_stat_sum[opcode] += (
        (tv_end.tv_sec - subop->tv_begin.tv_sec)*1000000 +
        (tv_end.tv_nsec - subop->tv_begin.tv_nsec)/1000
    );
    if (opcode == OSD_OP_SEC_READ || opcode == OSD_OP_SEC_WRITE)
    {
-        msgr.stats.op_stat_bytes[opcode] += subop->bs_op->len;
+        c_cli.stats.op_stat_bytes[opcode] += subop->bs_op->len;
    }
 }

@@ -303,13 +310,8 @@ void osd_t::handle_primary_subop(osd_op_t *subop, osd_op_t *cur_op)
 {
    uint64_t opcode = subop->req.hdr.opcode;
    int retval = subop->reply.hdr.retval;
-    int expected;
-    if (opcode == OSD_OP_SEC_READ || opcode == OSD_OP_SEC_WRITE || opcode == OSD_OP_SEC_WRITE_STABLE)
-        expected = subop->req.sec_rw.len;
-    else if (opcode == OSD_OP_SEC_READ_BMP)
-        expected = subop->req.sec_read_bmp.len / sizeof(obj_ver_id) * (8 + clean_entry_bitmap_size);
-    else
-        expected = 0;
+    int expected = opcode == OSD_OP_SEC_READ || opcode == OSD_OP_SEC_WRITE
+        || opcode == OSD_OP_SEC_WRITE_STABLE ? subop->req.sec_rw.len : 0;
    osd_primary_op_data_t *op_data = cur_op->op_data;
    if (retval != expected)
    {
@@ -319,11 +321,6 @@ void osd_t::handle_primary_subop(osd_op_t *subop, osd_op_t *cur_op)
            op_data->epipe++;
        }
        op_data->errors++;
-        if (subop->peer_fd >= 0)
-        {
-            // Drop connection on any error
-            msgr.stop_client(subop->peer_fd);
-        }
    }
    else
    {
@@ -332,21 +329,18 @@ void osd_t::handle_primary_subop(osd_op_t *subop, osd_op_t *cur_op)
        {
            uint64_t version = subop->reply.sec_rw.version;
 #ifdef OSD_DEBUG
-            uint64_t peer_osd = msgr.clients.find(subop->peer_fd) != msgr.clients.end()
-                ? msgr.clients[subop->peer_fd]->osd_num : osd_num;
+            uint64_t peer_osd = c_cli.clients.find(subop->peer_fd) != c_cli.clients.end()
+                ? c_cli.clients[subop->peer_fd]->osd_num : osd_num;
            printf("subop %lu from osd %lu: version = %lu\n", opcode, peer_osd, version);
 #endif
-            if (op_data->fact_ver != UINT64_MAX)
+            if (op_data->fact_ver != 0 && op_data->fact_ver != version)
            {
-                if (op_data->fact_ver != 0 && op_data->fact_ver != version)
-                {
-                    throw std::runtime_error(
-                        "different fact_versions returned from "+std::string(osd_op_names[opcode])+
-                        " subops: "+std::to_string(version)+" vs "+std::to_string(op_data->fact_ver)
-                    );
-                }
-                op_data->fact_ver = version;
+                throw std::runtime_error(
+                    "different fact_versions returned from "+std::string(osd_op_names[opcode])+
+                    " subops: "+std::to_string(version)+" vs "+std::to_string(op_data->fact_ver)
+                );
            }
+            op_data->fact_ver = version;
        }
    }
    if ((op_data->errors + op_data->done) >= op_data->n_subops)
@@ -465,26 +459,32 @@ void osd_t::submit_primary_del_batch(osd_op_t *cur_op, obj_ver_osd_t *chunks_to_
        else
        {
            subops[i].op_type = OSD_OP_OUT;
-            subops[i].peer_fd = msgr.osd_peer_fds.at(chunk.osd_num);
-            subops[i].req = (osd_any_op_t){ .sec_del = {
+            subops[i].peer_fd = c_cli.osd_peer_fds.at(chunk.osd_num);
+            subops[i].req.sec_del = {
                .header = {
                    .magic = SECONDARY_OSD_OP_MAGIC,
-                    .id = msgr.next_subop_id++,
+                    .id = c_cli.next_subop_id++,
                    .opcode = OSD_OP_SEC_DELETE,
                },
                .oid = chunk.oid,
                .version = chunk.version,
-            } };
+            };
            subops[i].callback = [cur_op, this](osd_op_t *subop)
            {
+                int fail_fd = subop->reply.hdr.retval != 0 ? subop->peer_fd : -1;
                handle_primary_subop(subop, cur_op);
+                if (fail_fd >= 0)
+                {
+                    // delete operation failed, drop the connection
+                    c_cli.stop_client(fail_fd);
+                }
            };
-            msgr.outbox_push(&subops[i]);
+            c_cli.outbox_push(&subops[i]);
        }
    }
 }

-int osd_t::submit_primary_sync_subops(osd_op_t *cur_op)
+void osd_t::submit_primary_sync_subops(osd_op_t *cur_op)
 {
    osd_primary_op_data_t *op_data = cur_op->op_data;
    int n_osds = op_data->dirty_osd_count;
@@ -492,7 +492,6 @@ int osd_t::submit_primary_sync_subops(osd_op_t *cur_op)
    op_data->done = op_data->errors = 0;
    op_data->n_subops = n_osds;
    op_data->subops = subops;
-    std::map<uint64_t, int>::iterator peer_it;
    for (int i = 0; i < n_osds; i++)
    {
        osd_num_t sync_osd = op_data->dirty_osds[i];
@@ -509,35 +508,30 @@ int osd_t::submit_primary_sync_subops(osd_op_t *cur_op)
            });
            bs->enqueue_op(subops[i].bs_op);
        }
-        else if ((peer_it = msgr.osd_peer_fds.find(sync_osd)) != msgr.osd_peer_fds.end())
-        {
-            subops[i].op_type = OSD_OP_OUT;
-            subops[i].peer_fd = peer_it->second;
-            subops[i].req = (osd_any_op_t){ .sec_sync = {
-                .header = {
-                    .magic = SECONDARY_OSD_OP_MAGIC,
-                    .id = msgr.next_subop_id++,
-                    .opcode = OSD_OP_SEC_SYNC,
-                },
-            } };
-            subops[i].callback = [cur_op, this](osd_op_t *subop)
-            {
-                handle_primary_subop(subop, cur_op);
-            };
-            msgr.outbox_push(&subops[i]);
-        }
        else
        {
-            op_data->done++;
+            subops[i].op_type = OSD_OP_OUT;
+            subops[i].peer_fd = c_cli.osd_peer_fds.at(sync_osd);
+            subops[i].req.sec_sync = {
+                .header = {
+                    .magic = SECONDARY_OSD_OP_MAGIC,
+                    .id = c_cli.next_subop_id++,
+                    .opcode = OSD_OP_SEC_SYNC,
+                },
+            };
+            subops[i].callback = [cur_op, this](osd_op_t *subop)
+            {
+                int fail_fd = subop->reply.hdr.retval != 0 ? subop->peer_fd : -1;
+                handle_primary_subop(subop, cur_op);
+                if (fail_fd >= 0)
+                {
+                    // sync operation failed, drop the connection
+                    c_cli.stop_client(fail_fd);
+                }
+            };
+            c_cli.outbox_push(&subops[i]);
        }
    }
-    if (op_data->done >= op_data->n_subops)
-    {
-        delete[] op_data->subops;
-        op_data->subops = NULL;
-        return 0;
-    }
-    return 1;
 }

 void osd_t::submit_primary_stab_subops(osd_op_t *cur_op)
@@ -569,21 +563,27 @@ void osd_t::submit_primary_stab_subops(osd_op_t *cur_op)
        else
        {
            subops[i].op_type = OSD_OP_OUT;
-            subops[i].peer_fd = msgr.osd_peer_fds.at(stab_osd.osd_num);
-            subops[i].req = (osd_any_op_t){ .sec_stab = {
+            subops[i].peer_fd = c_cli.osd_peer_fds.at(stab_osd.osd_num);
+            subops[i].req.sec_stab = {
                .header = {
                    .magic = SECONDARY_OSD_OP_MAGIC,
-                    .id = msgr.next_subop_id++,
+                    .id = c_cli.next_subop_id++,
                    .opcode = OSD_OP_SEC_STABILIZE,
                },
                .len = (uint64_t)(stab_osd.len * sizeof(obj_ver_id)),
-            } };
+            };
            subops[i].iov.push_back(op_data->unstable_writes + stab_osd.start, stab_osd.len * sizeof(obj_ver_id));
            subops[i].callback = [cur_op, this](osd_op_t *subop)
            {
+                int fail_fd = subop->reply.hdr.retval != 0 ? subop->peer_fd : -1;
                handle_primary_subop(subop, cur_op);
+                if (fail_fd >= 0)
+                {
+                    // sync operation failed, drop the connection
+                    c_cli.stop_client(fail_fd);
+                }
            };
-            msgr.outbox_push(&subops[i]);
+            c_cli.outbox_push(&subops[i]);
        }
    }
 }
@@ -599,7 +599,7 @@ void osd_t::pg_cancel_write_queue(pg_t & pg, osd_op_t *first_op, object_id oid,
        return;
    }
    std::vector<osd_op_t*> cancel_ops;
-    while (it != pg.write_queue.end() && it->first == oid)
+    while (it != pg.write_queue.end())
    {
        cancel_ops.push_back(it->second);
        it++;
--- a/src/osd_primary_sync.cpp
+++ b/src/osd_primary_sync.cpp
@@ -1,265 +0,0 @@
-// Copyright (c) Vitaliy Filippov, 2019+
-// License: VNPL-1.1 (see README.md for details)
-
-#include "osd_primary.h"
-
-// Save and clear unstable_writes -> SYNC all -> STABLE all
-void osd_t::continue_primary_sync(osd_op_t *cur_op)
-{
-    if (!cur_op->op_data)
-    {
-        cur_op->op_data = (osd_primary_op_data_t*)calloc_or_die(1, sizeof(osd_primary_op_data_t));
-    }
-    osd_primary_op_data_t *op_data = cur_op->op_data;
-    if (op_data->st == 1)      goto resume_1;
-    else if (op_data->st == 2) goto resume_2;
-    else if (op_data->st == 3) goto resume_3;
-    else if (op_data->st == 4) goto resume_4;
-    else if (op_data->st == 5) goto resume_5;
-    else if (op_data->st == 6) goto resume_6;
-    else if (op_data->st == 7) goto resume_7;
-    else if (op_data->st == 8) goto resume_8;
-    assert(op_data->st == 0);
-    if (syncs_in_progress.size() > 0)
-    {
-        // Wait for previous syncs, if any
-        // FIXME: We may try to execute the current one in parallel, like in Blockstore, but I'm not sure if it matters at all
-        syncs_in_progress.push_back(cur_op);
-        op_data->st = 1;
-resume_1:
-        return;
-    }
-    else
-    {
-        syncs_in_progress.push_back(cur_op);
-    }
-resume_2:
-    if (dirty_osds.size() == 0)
-    {
-        // Nothing to sync
-        goto finish;
-    }
-    // Save and clear unstable_writes
-    // In theory it is possible to do in on a per-client basis, but this seems to be an unnecessary complication
-    // It would be cool not to copy these here at all, but someone has to deduplicate them by object IDs anyway
-    if (unstable_writes.size() > 0)
-    {
-        op_data->unstable_write_osds = new std::vector<unstable_osd_num_t>();
-        op_data->unstable_writes = new obj_ver_id[this->unstable_writes.size()];
-        osd_num_t last_osd = 0;
-        int last_start = 0, last_end = 0;
-        for (auto it = this->unstable_writes.begin(); it != this->unstable_writes.end(); it++)
-        {
-            if (last_osd != it->first.osd_num)
-            {
-                if (last_osd != 0)
-                {
-                    op_data->unstable_write_osds->push_back((unstable_osd_num_t){
-                        .osd_num = last_osd,
-                        .start = last_start,
-                        .len = last_end - last_start,
-                    });
-                }
-                last_osd = it->first.osd_num;
-                last_start = last_end;
-            }
-            op_data->unstable_writes[last_end] = (obj_ver_id){
-                .oid = it->first.oid,
-                .version = it->second,
-            };
-            last_end++;
-        }
-        if (last_osd != 0)
-        {
-            op_data->unstable_write_osds->push_back((unstable_osd_num_t){
-                .osd_num = last_osd,
-                .start = last_start,
-                .len = last_end - last_start,
-            });
-        }
-        this->unstable_writes.clear();
-    }
-    {
-        void *dirty_buf = malloc_or_die(
-            sizeof(pool_pg_num_t)*dirty_pgs.size() +
-            sizeof(osd_num_t)*dirty_osds.size() +
-            sizeof(obj_ver_osd_t)*this->copies_to_delete_after_sync_count
-        );
-        op_data->dirty_pgs = (pool_pg_num_t*)dirty_buf;
-        op_data->dirty_osds = (osd_num_t*)(dirty_buf + sizeof(pool_pg_num_t)*dirty_pgs.size());
-        op_data->dirty_pg_count = dirty_pgs.size();
-        op_data->dirty_osd_count = dirty_osds.size();
-        if (this->copies_to_delete_after_sync_count)
-        {
-            op_data->copies_to_delete_count = 0;
-            op_data->copies_to_delete = (obj_ver_osd_t*)(op_data->dirty_osds + op_data->dirty_osd_count);
-            for (auto dirty_pg_num: dirty_pgs)
-            {
-                auto & pg = pgs.at(dirty_pg_num);
-                assert(pg.copies_to_delete_after_sync.size() <= this->copies_to_delete_after_sync_count);
-                memcpy(
-                    op_data->copies_to_delete + op_data->copies_to_delete_count,
-                    pg.copies_to_delete_after_sync.data(),
-                    sizeof(obj_ver_osd_t)*pg.copies_to_delete_after_sync.size()
-                );
-                op_data->copies_to_delete_count += pg.copies_to_delete_after_sync.size();
-                this->copies_to_delete_after_sync_count -= pg.copies_to_delete_after_sync.size();
-                pg.copies_to_delete_after_sync.clear();
-            }
-            assert(this->copies_to_delete_after_sync_count == 0);
-        }
-        int dpg = 0;
-        for (auto dirty_pg_num: dirty_pgs)
-        {
-            pgs.at(dirty_pg_num).inflight++;
-            op_data->dirty_pgs[dpg++] = dirty_pg_num;
-        }
-        dirty_pgs.clear();
-        dpg = 0;
-        for (auto osd_num: dirty_osds)
-        {
-            op_data->dirty_osds[dpg++] = osd_num;
-        }
-        dirty_osds.clear();
-    }
-    if (immediate_commit != IMMEDIATE_ALL)
-    {
-        // SYNC
-        if (!submit_primary_sync_subops(cur_op))
-        {
-            goto resume_4;
-        }
-resume_3:
-        op_data->st = 3;
-        return;
-resume_4:
-        if (op_data->errors > 0)
-        {
-            goto resume_6;
-        }
-    }
-    if (op_data->unstable_writes)
-    {
-        // Stabilize version sets, if any
-        submit_primary_stab_subops(cur_op);
-resume_5:
-        op_data->st = 5;
-        return;
-    }
-resume_6:
-    if (op_data->errors > 0)
-    {
-        // Return PGs and OSDs back into their dirty sets
-        for (int i = 0; i < op_data->dirty_pg_count; i++)
-        {
-            dirty_pgs.insert(op_data->dirty_pgs[i]);
-        }
-        for (int i = 0; i < op_data->dirty_osd_count; i++)
-        {
-            dirty_osds.insert(op_data->dirty_osds[i]);
-        }
-        if (op_data->unstable_writes)
-        {
-            // Return objects back into the unstable write set
-            for (auto unstable_osd: *(op_data->unstable_write_osds))
-            {
-                for (int i = 0; i < unstable_osd.len; i++)
-                {
-                    // Except those from peered PGs
-                    auto & w = op_data->unstable_writes[i];
-                    pool_pg_num_t wpg = {
-                        .pool_id = INODE_POOL(w.oid.inode),
-                        .pg_num = map_to_pg(w.oid, st_cli.pool_config.at(INODE_POOL(w.oid.inode)).pg_stripe_size),
-                    };
-                    if (pgs.at(wpg).state & PG_ACTIVE)
-                    {
-                        uint64_t & dest = this->unstable_writes[(osd_object_id_t){
-                            .osd_num = unstable_osd.osd_num,
-                            .oid = w.oid,
-                        }];
-                        dest = dest < w.version ? w.version : dest;
-                        dirty_pgs.insert(wpg);
-                    }
-                }
-            }
-        }
-        if (op_data->copies_to_delete)
-        {
-            // Return 'copies to delete' back into respective PGs
-            for (int i = 0; i < op_data->copies_to_delete_count; i++)
-            {
-                auto & w = op_data->copies_to_delete[i];
-                auto & pg = pgs.at((pool_pg_num_t){
-                    .pool_id = INODE_POOL(w.oid.inode),
-                    .pg_num = map_to_pg(w.oid, st_cli.pool_config.at(INODE_POOL(w.oid.inode)).pg_stripe_size),
-                });
-                if (pg.state & PG_ACTIVE)
-                {
-                    pg.copies_to_delete_after_sync.push_back(w);
-                    copies_to_delete_after_sync_count++;
-                }
-            }
-        }
-    }
-    else if (op_data->copies_to_delete)
-    {
-        // Actually delete copies which we wanted to delete
-        submit_primary_del_batch(cur_op, op_data->copies_to_delete, op_data->copies_to_delete_count);
-resume_7:
-        op_data->st = 7;
-        return;
-resume_8:
-        if (op_data->errors > 0)
-        {
-            goto resume_6;
-        }
-    }
-    for (int i = 0; i < op_data->dirty_pg_count; i++)
-    {
-        auto & pg = pgs.at(op_data->dirty_pgs[i]);
-        pg.inflight--;
-        if ((pg.state & PG_STOPPING) && pg.inflight == 0 && !pg.flush_batch)
-        {
-            finish_stop_pg(pg);
-        }
-        else if ((pg.state & PG_REPEERING) && pg.inflight == 0 && !pg.flush_batch)
-        {
-            start_pg_peering(pg);
-        }
-    }
-    // FIXME: Free those in the destructor?
-    free(op_data->dirty_pgs);
-    op_data->dirty_pgs = NULL;
-    op_data->dirty_osds = NULL;
-    if (op_data->unstable_writes)
-    {
-        delete op_data->unstable_write_osds;
-        delete[] op_data->unstable_writes;
-        op_data->unstable_writes = NULL;
-        op_data->unstable_write_osds = NULL;
-    }
-    if (op_data->errors > 0)
-    {
-        finish_op(cur_op, op_data->epipe > 0 ? -EPIPE : -EIO);
-    }
-    else
-    {
-finish:
-        if (cur_op->peer_fd)
-        {
-            auto it = msgr.clients.find(cur_op->peer_fd);
-            if (it != msgr.clients.end())
-                it->second->dirty_pgs.clear();
-        }
-        finish_op(cur_op, 0);
-    }
-    assert(syncs_in_progress.front() == cur_op);
-    syncs_in_progress.pop_front();
-    if (syncs_in_progress.size() > 0)
-    {
-        cur_op = syncs_in_progress.front();
-        op_data = cur_op->op_data;
-        op_data->st++;
-        goto resume_2;
-    }
-}
--- a/Show More
+++ b/Show More
Author	SHA1	Message	Date
Vitaliy Filippov	c0f06dea56	[Draft] Optimized read	2021-03-17 02:14:41 +03:00
Vitaliy Filippov	be49998c89	Add "read bitmaps" operation to secondary OSD protocol	2021-03-16 12:48:36 +03:00
Vitaliy Filippov	15636dd3a2	Add simplified interface to read blockstore bitmaps synchronously	2021-03-16 12:48:36 +03:00
Vitaliy Filippov	f78a959544	Shorten some structure names	2021-03-16 12:48:36 +03:00
Vitaliy Filippov	ac6dba8ddc	Introduce data distribution locality	2021-03-16 12:48:36 +03:00
Vitaliy Filippov	a8d744ca0e	Fix wording	2021-03-16 12:48:36 +03:00
Vitaliy Filippov	b5ff44fb6f	Change Telegram chat link	2021-03-16 12:48:36 +03:00
Vitaliy Filippov	f918bc4543	Fix Russian README for CMake build	2021-03-16 12:48:36 +03:00
Vitaliy Filippov	6875a838e0	Capture all by value in qemu_proxy	2021-03-16 12:48:36 +03:00
Vitaliy Filippov	20781abd3d	Add LICENSE	2021-03-16 12:48:36 +03:00
Vitaliy Filippov	1f02f645c0	Add Russian version of the README	2021-03-16 12:48:36 +03:00
Vitaliy Filippov	ee44f64927	Introduce image names and metadata storage in etcd Each inode has: image name, parent inode number & pool, size and readonly flag Snapshots are created by switching image name to a different inode number while using the older inode as parent.	2021-03-16 12:48:36 +03:00
Vitaliy Filippov	abf0611d93	Use clean_entry_bitmap_size instead of entry_attr_size back because of changed bitmap handling	2021-03-16 12:48:36 +03:00
Vitaliy Filippov	edbf0eb040	Add a test for snapshots, fix bugs. Now the test passes	2021-03-16 12:48:36 +03:00
Vitaliy Filippov	09725038e7	Begin snapshot test	2021-03-16 12:48:36 +03:00
Vitaliy Filippov	18f71b059a	Fix part bitmap addresses	2021-03-16 12:48:36 +03:00
Vitaliy Filippov	2db2ed22ea	Fix several snapshot I/O bugs	2021-03-16 12:48:36 +03:00
Vitaliy Filippov	aa7699da24	Fix subop generation for snapshot implementation	2021-03-16 12:48:36 +03:00
Vitaliy Filippov	853ecba780	Actual snapshot support (untested)	2021-03-16 12:48:36 +03:00
Vitaliy Filippov	2f9c76b8fc	Report inode I/O statistics, aggregate it in the monitor	2021-03-16 12:48:36 +03:00
Vitaliy Filippov	8da7f26459	Report inode space usage statistics to etcd, aggregate it in the monitor	2021-03-16 12:48:36 +03:00
Vitaliy Filippov	9998b50c7e	Add inode space usage statistics tracking to blockstore	2021-03-16 12:48:36 +03:00
Vitaliy Filippov	0422d94a70	Send bitmaps with primary-reads, actually read bitmaps for READ ops	2021-03-16 12:48:36 +03:00
Vitaliy Filippov	ff2208ae70	Allocate bitmaps along with stripes to avoid memory fragmentation	2021-03-16 12:48:36 +03:00
Vitaliy Filippov	ae54dddb0c	Remove cryptic bitmap inlining from bs_op_t and osd_op_t, use bitmap in primary OSD code	2021-03-16 12:48:36 +03:00
Vitaliy Filippov	bfc175fe0f	Add "external" bitmap support to the secondary OSD protocol	2021-03-16 12:48:36 +03:00
Vitaliy Filippov	07e10210b6	Use bitmap granularity for alignment checks	2021-03-16 12:48:36 +03:00
Vitaliy Filippov	221b728fc9	Add "external" bitmap support to blockstore	2021-03-16 12:48:36 +03:00
Vitaliy Filippov	6625aaae00	Add "external" bitmap support to osd_rmw	2021-03-16 12:48:36 +03:00
				`@@ -1 +0,0 @@`
				`g++ -D__MOCK__ -fsanitize=address -g -Wno-pointer-arith pg_states.cpp osd_ops.cpp test_cluster_client.cpp cluster_client.cpp msgr_op.cpp msgr_stop.cpp mock/messenger.cpp etcd_state_client.cpp timerfd_manager.cpp ../json11/json11.cpp -I mock -I . -I ..; ./a.out`