Release 0.6.13

- Fix client hangs possible on OSD restarts (bug affected versions from 0.5.11) - Fix "Assertion `sqe != NULL' failed" io_uring-related crashes possible on some kernels (0.6.11 increased probability of this bug) - Fix timeout=0 in NBD proxy - Fix build under centos 7
Fix warnings
2022-02-03 01:50:30 +03:00 · 2022-02-03 01:50:30 +03:00 · 2022-02-03 01:42:19 +03:00 · 2022-02-02 01:40:22 +03:00 · 2022-02-01 22:46:13 +03:00 · 2022-02-01 22:45:12 +03:00
115 changed files with 4751 additions and 1207 deletions
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -2,6 +2,6 @@ cmake_minimum_required(VERSION 2.8)

 project(vitastor)

-set(VERSION "0.6.9")
+set(VERSION "0.6.13")

 add_subdirectory(src)
--- a/README-ru.md
+++ b/README-ru.md
@@ -51,14 +51,15 @@ Vitastor на данный момент находится в статусе п
 - Базовая поддержка OpenStack: драйвер Cinder, патчи для Nova и libvirt
 - Слияние снапшотов (vitastor-cli {snap-rm,flatten,merge})
 - Консольный интерфейс для управления образами (vitastor-cli {ls,create,modify})
+- Плагин для Proxmox

 ## Планы развития

- Поддержка удаления снапшотов (слияния слоёв)
 - Более корректные скрипты разметки дисков и автоматического запуска OSD
 - Другие инструменты администрирования
- Плагины для OpenNebula, Proxmox и других облачных систем
+- Плагины для OpenNebula и других облачных систем
 - iSCSI-прокси
+- Упрощённый NFS прокси
 - Более быстрое переключение при отказах
 - Фоновая проверка целостности без контрольных сумм (сверка реплик)
 - Контрольные суммы
@@ -537,6 +538,75 @@ for i in ./???-*.yaml; do kubectl apply -f $i; done

 После этого вы сможете создавать PersistentVolume. Пример смотрите в файле [csi/deploy/example-pvc.yaml](csi/deploy/example-pvc.yaml).

+### OpenStack
+
+Чтобы подключить Vitastor к OpenStack:
+
+- Установите пакеты vitastor-client, libvirt и QEMU из DEB или RPM репозитория Vitastor
+- Примените патч `patches/nova-21.diff` или `patches/nova-23.diff` к вашей инсталляции Nova.
+  nova-21.diff подходит для Nova 21-22, nova-23.diff подходит для Nova 23-24.
+- Скопируйте `patches/cinder-vitastor.py` в инсталляцию Cinder как `cinder/volume/drivers/vitastor.py`
+- Создайте тип томов в cinder.conf (см. ниже)
+- Обязательно заблокируйте доступ от виртуальных машин к сети Vitastor (OSD и etcd), т.к. Vitastor (пока) не поддерживает аутентификацию
+- Перезапустите Cinder и Nova
+
+Пример конфигурации Cinder:
+
+```
+[DEFAULT]
+enabled_backends = lvmdriver-1, vitastor-testcluster
+# ...
+
+[vitastor-testcluster]
+volume_driver = cinder.volume.drivers.vitastor.VitastorDriver
+volume_backend_name = vitastor-testcluster
+image_volume_cache_enabled = True
+volume_clear = none
+vitastor_etcd_address = 192.168.7.2:2379
+vitastor_etcd_prefix =
+vitastor_config_path = /etc/vitastor/vitastor.conf
+vitastor_pool_id = 1
+image_upload_use_cinder_backend = True
+```
+
+Чтобы помещать в Vitastor Glance-образы, нужно использовать
+[https://docs.openstack.org/cinder/pike/admin/blockstorage-volume-backed-image.html](образы на основе томов Cinder),
+однако, поддержка этой функции ещё не проверялась.
+
+### Proxmox
+
+Чтобы подключить Vitastor к Proxmox Virtual Environment (поддерживаются версии 6.4 и 7.1):
+
+- Добавьте соответствующий Debian-репозиторий Vitastor в sources.list на хостах Proxmox
+  (buster для 6.4, bullseye для 7.1)
+- Установите пакеты vitastor-client, pve-qemu-kvm, pve-storage-vitastor (* или см. сноску) из репозитория Vitastor
+- Определите тип хранилища в `/etc/pve/storage.cfg` (см. ниже)
+- Обязательно заблокируйте доступ от виртуальных машин к сети Vitastor (OSD и etcd), т.к. Vitastor (пока) не поддерживает аутентификацию
+- Перезапустите демон Proxmox: `systemctl restart pvedaemon`
+
+Пример `/etc/pve/storage.cfg` (единственная обязательная опция - vitastor_pool, все остальные
+перечислены внизу для понимания значений по умолчанию):
+
+```
+vitastor: vitastor
+    # Пул, в который будут помещаться образы дисков
+    vitastor_pool testpool
+    # Путь к файлу конфигурации
+    vitastor_config_path /etc/vitastor/vitastor.conf
+    # Адрес(а) etcd, нужны, только если не указаны в vitastor.conf
+    vitastor_etcd_address 192.168.7.2:2379/v3
+    # Префикс ключей метаданных в etcd
+    vitastor_etcd_prefix /vitastor
+    # Префикс имён образов
+    vitastor_prefix pve/
+    # Монтировать образы через NBD прокси, через ядро (нужно только для контейнеров)
+    vitastor_nbd 0
+```
+
+\* Примечание: вместо установки пакета pve-storage-vitastor вы можете вручную скопировать файл
+[patches/PVE_VitastorPlugin.pm](patches/PVE_VitastorPlugin.pm) на хосты Proxmox как
+`/usr/share/perl5/PVE/Storage/Custom/VitastorPlugin.pm`.
+
 ## Известные проблемы

 - Запросы удаления объектов могут в данный момент приводить к "неполным" объектам в EC-пулах,
--- a/README.md
+++ b/README.md
@@ -45,14 +45,15 @@ breaking changes in the future. However, the following is implemented:
 - Basic OpenStack support: Cinder driver, Nova and libvirt patches
 - Snapshot merge tool (vitastor-cli {snap-rm,flatten,merge})
 - Image management CLI (vitastor-cli {ls,create,modify})
+- Proxmox storage plugin

 ## Roadmap

- Snapshot deletion (layer merge) support
 - Better OSD creation and auto-start tools
 - Other administrative tools
- Plugins for OpenNebula, Proxmox and other cloud systems
+- Plugins for OpenNebula and other cloud systems
 - iSCSI proxy
+- Simplified NFS proxy
 - Faster failover
 - Scrubbing without checksums (verification of replicas)
 - Checksums
@@ -486,6 +487,73 @@ for i in ./???-*.yaml; do kubectl apply -f $i; done

 After that you'll be able to create PersistentVolumes. See example in [csi/deploy/example-pvc.yaml](csi/deploy/example-pvc.yaml).

+### OpenStack
+
+To enable Vitastor support in an OpenStack installation:
+
+- Install vitastor-client, patched QEMU and libvirt packages from Vitastor DEB or RPM repository
+- Use `patches/nova-21.diff` or `patches/nova-23.diff` to patch your Nova installation.
+  Patch 21 fits Nova 21-22, patch 23 fits Nova 23-24.
+- Install `patches/cinder-vitastor.py` as `..../cinder/volume/drivers/vitastor.py`
+- Define a volume type in cinder.conf (see below)
+- Block network access from VMs to Vitastor network (to OSDs and etcd), because Vitastor doesn't support authentication (yet)
+- Restart Cinder and Nova
+
+Cinder volume type configuration example:
+
+```
+[DEFAULT]
+enabled_backends = lvmdriver-1, vitastor-testcluster
+# ...
+
+[vitastor-testcluster]
+volume_driver = cinder.volume.drivers.vitastor.VitastorDriver
+volume_backend_name = vitastor-testcluster
+image_volume_cache_enabled = True
+volume_clear = none
+vitastor_etcd_address = 192.168.7.2:2379
+vitastor_etcd_prefix =
+vitastor_config_path = /etc/vitastor/vitastor.conf
+vitastor_pool_id = 1
+image_upload_use_cinder_backend = True
+```
+
+To put Glance images in Vitastor, use [https://docs.openstack.org/cinder/pike/admin/blockstorage-volume-backed-image.html](volume-backed images),
+although the support has not been verified yet.
+
+### Proxmox
+
+To enable Vitastor support in Proxmox Virtual Environment (6.4 and 7.1 are supported):
+
+- Add the corresponding Vitastor Debian repository into sources.list on Proxmox hosts
+  (buster for 6.4, bullseye for 7.1)
+- Install vitastor-client, pve-qemu-kvm, pve-storage-vitastor (* or see note) packages from Vitastor repository
+- Define storage in `/etc/pve/storage.cfg` (see below)
+- Block network access from VMs to Vitastor network (to OSDs and etcd), because Vitastor doesn't support authentication (yet)
+- Restart pvedaemon: `systemctl restart pvedaemon`
+
+`/etc/pve/storage.cfg` example (the only required option is vitastor_pool, all others
+are listed below with their default values):
+
+```
+vitastor: vitastor
+    # pool to put new images into
+    vitastor_pool testpool
+    # path to the configuration file
+    vitastor_config_path /etc/vitastor/vitastor.conf
+    # etcd address(es), required only if missing in the configuration file
+    vitastor_etcd_address 192.168.7.2:2379/v3
+    # prefix for keys in etcd
+    vitastor_etcd_prefix /vitastor
+    # prefix for images
+    vitastor_prefix pve/
+    # use NBD mounter (only required for containers)
+    vitastor_nbd 0
+```
+
+\* Note: you can also manually copy [patches/PVE_VitastorPlugin.pm](patches/PVE_VitastorPlugin.pm) to Proxmox hosts
+as `/usr/share/perl5/PVE/Storage/Custom/VitastorPlugin.pm` instead of installing pve-storage-vitastor.
+
 ## Known Problems

 - Object deletion requests may currently lead to 'incomplete' objects in EC pools
--- a/2
+++ b/2
--- a/csi/Makefile
+++ b/csi/Makefile
@@ -1,4 +1,4 @@
-VERSION ?= v0.6.9
+VERSION ?= v0.6.13

 all: build push

--- a/csi/deploy/004-csi-nodeplugin.yaml
+++ b/csi/deploy/004-csi-nodeplugin.yaml
@@ -49,7 +49,7 @@ spec:
            capabilities:
              add: ["SYS_ADMIN"]
            allowPrivilegeEscalation: true
-          image: vitalif/vitastor-csi:v0.6.9
+          image: vitalif/vitastor-csi:v0.6.13
          args:
            - "--node=$(NODE_ID)"
            - "--endpoint=$(CSI_ENDPOINT)"
--- a/csi/deploy/007-csi-provisioner.yaml
+++ b/csi/deploy/007-csi-provisioner.yaml
@@ -116,7 +116,7 @@ spec:
            privileged: true
            capabilities:
              add: ["SYS_ADMIN"]
-          image: vitalif/vitastor-csi:v0.6.9
+          image: vitalif/vitastor-csi:v0.6.13
          args:
            - "--node=$(NODE_ID)"
            - "--endpoint=$(CSI_ENDPOINT)"
--- a/csi/src/config.go
+++ b/csi/src/config.go
@@ -5,7 +5,7 @@ package vitastor

 const (
    vitastorCSIDriverName    = "csi.vitastor.io"
-    vitastorCSIDriverVersion = "0.6.9"
+    vitastorCSIDriverVersion = "0.6.13"
 )

 // Config struct fills the parameters of request or user input
--- a/debian/changelog
+++ b/debian/changelog
@@ -1,4 +1,4 @@
-vitastor (0.6.9-1) unstable; urgency=medium
+vitastor (0.6.13-1) unstable; urgency=medium

  * RDMA support
  * Bugfixes
--- a/debian/control
+++ b/debian/control
@@ -47,3 +47,9 @@ Architecture: amd64
 Depends: ${shlibs:Depends}, ${misc:Depends}, vitastor-client (= ${binary:Version}), fio (= ${dep:fio})
 Description: Vitastor, a fast software-defined clustered block storage - fio drivers
 Vitastor fio drivers for benchmarking.
+
+Package: pve-storage-vitastor
+Architecture: amd64
+Depends: ${shlibs:Depends}, ${misc:Depends}, vitastor-client (= ${binary:Version})
+Description: Vitastor Proxmox Virtual Environment storage plugin
+ Vitastor storage plugin for Proxmox Virtual Environment.
--- a/debian/pve-storage-vitastor.install
+++ b/debian/pve-storage-vitastor.install
@@ -0,0 +1 @@
+patches/PVE_VitastorPlugin.pm usr/share/perl5/PVE/Storage/Custom/VitastorPlugin.pm
--- a/debian/vitastor.Dockerfile
+++ b/debian/vitastor.Dockerfile
@@ -33,8 +33,8 @@ RUN set -e -x; \
    mkdir -p /root/packages/vitastor-$REL; \
    rm -rf /root/packages/vitastor-$REL/*; \
    cd /root/packages/vitastor-$REL; \
-    cp -r /root/vitastor vitastor-0.6.9; \
-    cd vitastor-0.6.9; \
+    cp -r /root/vitastor vitastor-0.6.13; \
+    cd vitastor-0.6.13; \
    ln -s /root/fio-build/fio-*/ ./fio; \
    FIO=$(head -n1 fio/debian/changelog | perl -pe 's/^.*\((.*?)\).*$/$1/'); \
    ls /usr/include/linux/raw.h || cp ./debian/raw.h /usr/include/linux/raw.h; \
@@ -47,8 +47,8 @@ RUN set -e -x; \
    rm -rf a b; \
    echo "dep:fio=$FIO" > debian/fio_version; \
    cd /root/packages/vitastor-$REL; \
-    tar --sort=name --mtime='2020-01-01' --owner=0 --group=0 --exclude=debian -cJf vitastor_0.6.9.orig.tar.xz vitastor-0.6.9; \
-    cd vitastor-0.6.9; \
+    tar --sort=name --mtime='2020-01-01' --owner=0 --group=0 --exclude=debian -cJf vitastor_0.6.13.orig.tar.xz vitastor-0.6.13; \
+    cd vitastor-0.6.13; \
    V=$(head -n1 debian/changelog | perl -pe 's/^.*\((.*?)\).*$/$1/'); \
    DEBFULLNAME="Vitaliy Filippov <vitalif@yourcmc.ru>" dch -D $REL -v "$V""$REL" "Rebuild for $REL"; \
    DEB_BUILD_OPTIONS=nocheck dpkg-buildpackage --jobs=auto -sa; \
--- a/docs/params/common.yml
+++ b/docs/params/common.yml
@@ -0,0 +1,35 @@
+- name: config_path
+  type: string
+  default: "/etc/vitastor/vitastor.conf"
+  info: |
+    Path to the JSON configuration file. Configuration file is optional,
+    a non-existing configuration file does not prevent Vitastor from
+    running if required parameters are specified.
+  info_ru: |
+    Путь к файлу конфигурации в формате JSON. Файл конфигурации необязателен,
+    без него Vitastor тоже будет работать, если переданы необходимые параметры.
+- name: etcd_address
+  type: string or array of strings
+  type_ru: строка или массив строк
+  info: |
+    etcd connection endpoint(s). Multiple endpoints may be delimited by "," or
+    specified in a JSON array `["10.0.115.10:2379/v3","10.0.115.11:2379/v3"]`.
+    Note that https is not supported for etcd connections yet.
+  info_ru: |
+    Адрес(а) подключения к etcd. Несколько адресов могут разделяться запятой
+    или указываться в виде JSON-массива `["10.0.115.10:2379/v3","10.0.115.11:2379/v3"]`.
+- name: etcd_prefix
+  type: string
+  default: "/vitastor"
+  info: |
+    Prefix for all keys in etcd used by Vitastor. You can change prefix and, for
+    example, use a single etcd cluster for multiple Vitastor clusters.
+  info_ru: |
+    Префикс для ключей etcd, которые использует Vitastor. Вы можете задать другой
+    префикс, например, чтобы запустить несколько кластеров Vitastor с одним
+    кластером etcd.
+- name: log_level
+  type: int
+  default: 0
+  info: Log level. Raise if you want more verbose output.
+  info_ru: Уровень логгирования. Повысьте, если хотите более подробный вывод.
--- a/docs/params/layout-cluster.yml
+++ b/docs/params/layout-cluster.yml
@@ -0,0 +1,200 @@
+- name: block_size
+  type: int
+  default: 131072
+  info: |
+    Size of objects (data blocks) into which all physical and virtual drives are
+    subdivided in Vitastor. One of current main settings in Vitastor, affects
+    memory usage, write amplification and I/O load distribution effectiveness.
+
+    Recommended default block size is 128 KB for SSD and 4 MB for HDD. In fact,
+    it's possible to use 4 MB for SSD too - it will lower memory usage, but
+    may increase average WA and reduce linear performance.
+
+    OSDs with different block sizes (for example, SSD and SSD+HDD OSDs) can
+    currently coexist in one etcd instance only within separate Vitastor
+    clusters with different etcd_prefix'es.
+
+    Also block size can't be changed after OSD initialization without losing
+    data.
+
+    You must always specify block_size in etcd in /vitastor/config/global if
+    you change it so all clients can know about it.
+
+    OSD memory usage is roughly (SIZE / BLOCK * 68 bytes) which is roughly
+    544 MB per 1 TB of used disk space with the default 128 KB block size.
+  info_ru: |
+    Размер объектов (блоков данных), на которые делятся физические и виртуальные
+    диски в Vitastor. Одна из ключевых на данный момент настроек, влияет на
+    потребление памяти, объём избыточной записи (write amplification) и
+    эффективность распределения нагрузки по OSD.
+
+    Рекомендуемые по умолчанию размеры блока - 128 килобайт для SSD и 4
+    мегабайта для HDD. В принципе, для SSD можно тоже использовать 4 мегабайта,
+    это понизит использование памяти, но ухудшит распределение нагрузки и в
+    среднем увеличит WA.
+
+    OSD с разными размерами блока (например, SSD и SSD+HDD OSD) на данный
+    момент могут сосуществовать в рамках одного etcd только в виде двух независимых
+    кластеров Vitastor с разными etcd_prefix.
+
+    Также размер блока нельзя менять после инициализации OSD без потери данных.
+
+    Если вы меняете размер блока, обязательно прописывайте его в etcd в
+    /vitastor/config/global, дабы все клиенты его знали.
+
+    Потребление памяти OSD составляет примерно (РАЗМЕР / БЛОК * 68 байт),
+    т.е. примерно 544 МБ памяти на 1 ТБ занятого места на диске при
+    стандартном 128 КБ блоке.
+- name: bitmap_granularity
+  type: int
+  default: 4096
+  info: |
+    Required virtual disk write alignment ("sector size"). Must be a multiple
+    of disk_alignment. It's called bitmap granularity because Vitastor tracks
+    an allocation bitmap for each object containing 2 bits per each
+    (bitmap_granularity) bytes.
+
+    This parameter can't be changed after OSD initialization without losing
+    data. Also it's fixed for the whole Vitastor cluster i.e. two different
+    values can't be used in a single Vitastor cluster.
+
+    Clients MUST be aware of this parameter value, so put it into etcd key
+    /vitastor/config/global if you change it for any reason.
+  info_ru: |
+    Требуемое выравнивание записи на виртуальные диски (размер их "сектора").
+    Должен быть кратен disk_alignment. Называется гранулярностью битовой карты
+    потому, что Vitastor хранит битовую карту для каждого объекта, содержащую
+    по 2 бита на каждые (bitmap_granularity) байт.
+
+    Данный параметр нельзя менять после инициализации OSD без потери данных.
+    Также он фиксирован для всего кластера Vitastor, т.е. разные значения
+    не могут сосуществовать в одном кластере.
+
+    Клиенты ДОЛЖНЫ знать правильное значение этого параметра, так что если вы
+    его меняете, обязательно прописывайте изменённое значение в etcd в ключ
+    /vitastor/config/global.
+- name: immediate_commit
+  type: string
+  default: false
+  info: |
+    Another parameter which is really important for performance.
+
+    Desktop SSDs are very fast (100000+ iops) for simple random writes
+    without cache flush. However, they are really slow (only around 1000 iops)
+    if you try to fsync() each write, that is, when you want to guarantee that
+    each change gets immediately persisted to the physical media.
+
+    Server-grade SSDs with "Advanced/Enhanced Power Loss Protection" or with
+    "Supercapacitor-based Power Loss Protection", on the other hand, are equally
+    fast with and without fsync because their cache is protected from sudden
+    power loss by a built-in supercapacitor-based "UPS".
+
+    Some software-defined storage systems always fsync each write and thus are
+    really slow when used with desktop SSDs. Vitastor, however, can also
+    efficiently utilize desktop SSDs by postponing fsync until the client calls
+    it explicitly.
+
+    This is what this parameter regulates. When it's set to "all" the whole
+    Vitastor cluster commits each change to disks immediately and clients just
+    ignore fsyncs because they know for sure that they're unneeded. This reduces
+    the amount of network roundtrips performed by clients and improves
+    performance. So it's always better to use server grade SSDs with
+    supercapacitors even with Vitastor, especially given that they cost only
+    a bit more than desktop models.
+
+    There is also a common SATA SSD (and HDD too!) firmware bug (or feature)
+    that makes server SSDs which have supercapacitors slow with fsync. To check
+    if your SSDs are affected, compare benchmark results from `fio -name=test
+    -ioengine=libaio -direct=1 -bs=4k -rw=randwrite -iodepth=1` with and without
+    `-fsync=1`. Results should be the same. If fsync=1 result is worse you can
+    try to work around this bug by "disabling" drive write-back cache by running
+    `hdparm -W 0 /dev/sdXX` or `echo write through > /sys/block/sdXX/device/scsi_disk/*/cache_type`
+    (IMPORTANT: don't mistake it with `/sys/block/sdXX/queue/write_cache` - it's
+    unsafe to change by hand). The same may apply to newer HDDs with internal
+    SSD cache or "media-cache" - for example, a lot of Seagate EXOS drives have
+    it (they have internal SSD cache even though it's not stated in datasheets).
+
+    This parameter must be set both in etcd in /vitastor/config/global and in
+    OSD command line or configuration. Setting it to "all" or "small" requires
+    enabling disable_journal_fsync and disable_meta_fsync, setting it to "all"
+    also requires enabling disable_data_fsync.
+
+    TLDR: For optimal performance, set immediate_commit to "all" if you only use
+    SSDs with supercapacitor-based power loss protection (nonvolatile
+    write-through cache) for both data and journals in the whole Vitastor
+    cluster. Set it to "small" if you only use such SSDs for journals. Leave
+    empty if your drives have write-back cache.
+  info_ru: |
+    Ещё один важный для производительности параметр.
+
+    Модели SSD для настольных компьютеров очень быстрые (100000+ операций в
+    секунду) при простой случайной записи без сбросов кэша. Однако они очень
+    медленные (всего порядка 1000 iops), если вы пытаетесь сбрасывать кэш после
+    каждой записи, то есть, если вы пытаетесь гарантировать, что каждое
+    изменение физически записывается в энергонезависимую память.
+
+    С другой стороны, серверные SSD с конденсаторами - функцией, называемой
+    "Advanced/Enhanced Power Loss Protection" или просто "Supercapacitor-based
+    Power Loss Protection" - одинаково быстрые и со сбросом кэша, и без
+    него, потому что их кэш защищён от потери питания встроенным "источником
+    бесперебойного питания" на основе суперконденсаторов и на самом деле они
+    его никогда не сбрасывают.
+
+    Некоторые программные СХД всегда сбрасывают кэши дисков при каждой записи
+    и поэтому работают очень медленно с настольными SSD. Vitastor, однако, может
+    откладывать fsync до явного его вызова со стороны клиента и таким образом
+    эффективно утилизировать настольные SSD.
+
+    Данный параметр влияет как раз на это. Когда он установлен в значение "all",
+    весь кластер Vitastor мгновенно фиксирует каждое изменение на физические
+    носители и клиенты могут просто игнорировать запросы fsync, т.к. они точно
+    знают, что fsync-и не нужны. Это уменьшает число необходимых обращений к OSD
+    по сети и улучшает производительность. Поэтому даже с Vitastor лучше всегда
+    использовать только серверные модели SSD с суперконденсаторами, особенно
+    учитывая то, что стоят они ненамного дороже настольных.
+
+    Также в прошивках SATA SSD (и даже HDD!) очень часто встречается либо баг,
+    либо просто особенность логики, из-за которой серверные SSD, имеющие
+    конденсаторы и защиту от потери питания, всё равно медленно работают с
+    fsync. Чтобы понять, подвержены ли этой проблеме ваши SSD, сравните
+    результаты тестов `fio -name=test -ioengine=libaio -direct=1 -bs=4k
+    -rw=randwrite -iodepth=1` без и с опцией `-fsync=1`. Результаты должны
+    быть одинаковые. Если результат с `fsync=1` хуже, вы можете попробовать
+    обойти проблему, "отключив" кэш записи диска командой `hdparm -W 0 /dev/sdXX`
+    либо `echo write through > /sys/block/sdXX/device/scsi_disk/*/cache_type`
+    (ВАЖНО: не перепутайте с `/sys/block/sdXX/queue/write_cache` - этот параметр
+    менять руками небезопасно). Такая же проблема может встречаться и в новых
+    HDD-дисках с внутренним SSD или "медиа" кэшем - например, она встречается во
+    многих дисках Seagate EXOS (у них есть внутренний SSD-кэш, хотя это и не
+    указано в спецификациях).
+
+    Данный параметр нужно указывать и в etcd в /vitastor/config/global, и в
+    командной строке или конфигурации OSD. Значения "all" и "small" требуют
+    включения disable_journal_fsync и disable_meta_fsync, значение "all" также
+    требует включения disable_data_fsync.
+
+    Итого, вкратце: для оптимальной производительности установите
+    immediate_commit в значение "all", если вы используете в кластере только SSD
+    с суперконденсаторами и для данных, и для журналов. Если вы используете
+    такие SSD для всех журналов, но не для данных - можете установить параметр
+    в "small". Если и какие-то из дисков журналов имеют волатильный кэш записи -
+    оставьте параметр пустым.
+- name: client_dirty_limit
+  type: int
+  default: 33554432
+  info: |
+    Without immediate_commit=all this parameter sets the limit of "dirty"
+    (not committed by fsync) data allowed by the client before forcing an
+    additional fsync and committing the data. Also note that the client always
+    holds a copy of uncommitted data in memory so this setting also affects
+    RAM usage of clients.
+
+    This parameter doesn't affect OSDs themselves.
+  info_ru: |
+    При работе без immediate_commit=all - это лимит объёма "грязных" (не
+    зафиксированных fsync-ом) данных, при достижении которого клиент будет
+    принудительно вызывать fsync и фиксировать данные. Также стоит иметь в виду,
+    что в этом случае до момента fsync клиент хранит копию незафиксированных
+    данных в памяти, то есть, настройка влияет на потребление памяти клиентами.
+
+    Параметр не влияет на сами OSD.
--- a/docs/params/layout-osd.yml
+++ b/docs/params/layout-osd.yml
@@ -0,0 +1,205 @@
+- name: data_device
+  type: string
+  info: |
+    Path to the block device to use for data. It's highly recommendded to use
+    stable paths for all device names: `/dev/disk/by-partuuid/xxx...` instead
+    of just `/dev/sda` or `/dev/nvme0n1` to not mess up after server restart.
+    Files can also be used instead of block devices, but this is implemented
+    only for testing purposes and not for production.
+  info_ru: |
+    Путь к диску (блочному устройству) для хранения данных. Крайне рекомендуется
+    использовать стабильные пути: `/dev/disk/by-partuuid/xxx...` вместо простых
+    `/dev/sda` или `/dev/nvme0n1`, чтобы пути не могли спутаться после
+    перезагрузки сервера. Также вместо блочных устройств можно указывать файлы,
+    но это реализовано только для тестирования, а не для боевой среды.
+- name: meta_device
+  type: string
+  info: |
+    Path to the block device to use for the metadata. Metadata must be on a fast
+    SSD or performance will suffer. If this option is skipped, `data_device` is
+    used for the metadata.
+  info_ru: |
+    Путь к диску метаданных. Метаданные должны располагаться на быстром
+    SSD-диске, иначе производительность пострадает. Если эта опция не указана,
+    для метаданных используется `data_device`.
+- name: journal_device
+  type: string
+  info: |
+    Path to the block device to use for the journal. Journal must be on a fast
+    SSD or performance will suffer. If this option is skipped, `meta_device` is
+    used for the journal, and if it's also empty, journal is put on
+    `data_device`. It's almost always fine to put metadata and journal on the
+    same device, in this case you only need to set `meta_device`.
+  info_ru: |
+    Путь к диску журнала. Журнал должен располагаться на быстром SSD-диске,
+    иначе производительность пострадает. Если эта опция не указана,
+    для журнала используется `meta_device`, если же пуста и она, журнал
+    располагается на `data_device`. Нормально располагать журнал и метаданные
+    на одном устройстве, в этом случае достаточно указать только `meta_device`.
+- name: journal_offset
+  type: int
+  default: 0
+  info: Offset on the device in bytes where the journal is stored.
+  info_ru: Смещение на устройстве в байтах, по которому располагается журнал.
+- name: journal_size
+  type: int
+  info: |
+    Journal size in bytes. Doesn't have to be large, 16-32 MB is usually fine.
+    By default, the whole journal device will be used for the journal. You must
+    set it to some value manually (or use make-osd.sh) if you colocate the
+    journal with data or metadata.
+  info_ru: |
+    Размер журнала в байтах. Большим быть не обязан, 16-32 МБ обычно достаточно.
+    По умолчанию для журнала используется всё устройство журнала. Если же вы
+    размещаете журнал на устройстве данных или метаданных, то вы должны
+    установить эту опцию в какое-то значение сами (или использовать скрипт
+    make-osd.sh).
+- name: meta_offset
+  type: int
+  default: 0
+  info: |
+    Offset on the device in bytes where the metadata area is stored.
+    Again, set it to something if you colocate metadata with journal or data.
+  info_ru: |
+    Смещение на устройстве в байтах, по которому располагаются метаданные.
+    Эту опцию нужно задать, если метаданные у вас хранятся на том же
+    устройстве, что данные или журнал.
+- name: data_offset
+  type: int
+  default: 0
+  info: |
+    Offset on the device in bytes where the data area is stored.
+    Again, set it to something if you colocate data with journal or metadata.
+  info_ru: |
+    Смещение на устройстве в байтах, по которому располагаются данные.
+    Эту опцию нужно задать, если данные у вас хранятся на том же
+    устройстве, что метаданные или журнал.
+- name: data_size
+  type: int
+  info: |
+    Data area size in bytes. By default, the whole data device up to the end
+    will be used for the data area, but you can restrict it if you want to use
+    a smaller part. Note that there is no option to set metadata area size -
+    it's derived from the data area size.
+  info_ru: |
+    Размер области данных в байтах. По умолчанию под данные будет использована
+    вся доступная область устройства данных до конца устройства, но вы можете
+    использовать эту опцию, чтобы ограничить её меньшим размером. Заметьте, что
+    опции размера области метаданных нет - она вычисляется из размера области
+    данных автоматически.
+- name: meta_block_size
+  type: int
+  default: 4096
+  info: |
+    Physical block size of the metadata device. 4096 for most current
+    HDDs and SSDs.
+  info_ru: |
+    Размер физического блока устройства метаданных. 4096 для большинства
+    современных SSD и HDD.
+- name: journal_block_size
+  type: int
+  default: 4096
+  info: |
+    Physical block size of the journal device. Must be a multiple of
+    `disk_alignment`. 4096 for most current HDDs and SSDs.
+  info_ru: |
+    Размер физического блока устройства журнала. Должен быть кратен
+    `disk_alignment`. 4096 для большинства современных SSD и HDD.
+- name: disable_data_fsync
+  type: bool
+  default: false
+  info: |
+    Do not issue fsyncs to the data device, i.e. do not flush its cache.
+    Safe ONLY if your data device has write-through cache. If you disable
+    the cache yourself using `hdparm` or `scsi_disk/cache_type` then make sure
+    that the cache disable command is run every time before starting Vitastor
+    OSD, for example, in the systemd unit. See also `immediate_commit` option
+    for the instructions to disable cache and how to benefit from it.
+  info_ru: |
+    Не отправлять fsync-и устройству данных, т.е. не сбрасывать его кэш.
+    Безопасно, ТОЛЬКО если ваше устройство данных имеет кэш со сквозной
+    записью (write-through). Если вы отключаете кэш через `hdparm` или
+    `scsi_disk/cache_type`, то удостоверьтесь, что команда отключения кэша
+    выполняется перед каждым запуском Vitastor OSD, например, в systemd unit-е.
+    Смотрите также опцию `immediate_commit` для инструкций по отключению кэша
+    и о том, как из этого извлечь выгоду.
+- name: disable_meta_fsync
+  type: bool
+  default: false
+  info: |
+    Same as disable_data_fsync, but for the metadata device. If the metadata
+    device is not set or if the data device is used for the metadata the option
+    is ignored and disable_data_fsync value is used instead of it.
+  info_ru: |
+    То же, что disable_data_fsync, но для устройства метаданных. Если устройство
+    метаданных не задано или если оно равно устройству данных, значение опции
+    игнорируется и вместо него используется значение опции disable_data_fsync.
+- name: disable_journal_fsync
+  type: bool
+  default: false
+  info: |
+    Same as disable_data_fsync, but for the journal device. If the journal
+    device is not set or if the metadata device is used for the journal the
+    option is ignored and disable_meta_fsync value is used instead of it. If
+    the same device is used for data, metadata and journal the option is also
+    ignored and disable_data_fsync value is used instead of it.
+  info_ru: |
+    То же, что disable_data_fsync, но для устройства журнала. Если устройство
+    журнала не задано или если оно равно устройству метаданных, значение опции
+    игнорируется и вместо него используется значение опции disable_meta_fsync.
+    Если одно и то же устройство используется и под данные, и под журнал, и под
+    метаданные - значение опции также игнорируется и вместо него используется
+    значение опции disable_data_fsync.
+- name: disable_device_lock
+  type: bool
+  default: false
+  info: |
+    Do not lock data, metadata and journal block devices exclusively with
+    flock(). Though it's not recommended, but you can use it you want to run
+    multiple OSD with a single device and different offsets, without using
+    partitions.
+  info_ru: |
+    Не блокировать устройства данных, метаданных и журнала от открытия их
+    другими OSD с помощью flock(). Так делать не рекомендуется, но теоретически
+    вы можете это использовать, чтобы запускать несколько OSD на одном
+    устройстве с разными смещениями и без использования разделов.
+- name: disk_alignment
+  type: int
+  default: 4096
+  info: |
+    Required physical disk write alignment. Most current SSD and HDD drives
+    use 4 KB physical sectors even if they report 512 byte logical sector
+    size, so 4 KB is a good default setting.
+
+    Note, however, that physical sector size also affects WA, because with block
+    devices it's impossible to write anything smaller than a block. So, when
+    Vitastor has to write a single metadata entry that's only about 32 bytes in
+    size, it actually has to write the whole 4 KB sector.
+
+    Because of this it can actually be beneficial to use SSDs which work well
+    with 512 byte sectors and use 512 byte disk_alignment, journal_block_size
+    and meta_block_size. But the only SSD that may fit into this category is
+    Intel Optane (probably, not tested yet).
+
+    Clients don't need to be aware of disk_alignment, so it's not required to
+    put a modified value into etcd key /vitastor/config/global.
+  info_ru: |
+    Требуемое выравнивание записи на физические диски. Почти все современные
+    SSD и HDD диски используют 4 КБ физические секторы, даже если показывают
+    логический размер сектора 512 байт, поэтому 4 КБ - хорошее значение по
+    умолчанию.
+
+    Однако стоит понимать, что физический размер сектора тоже влияет на
+    избыточную запись (WA), потому что ничего меньше блока (сектора) на блочное
+    устройство записать невозможно. Таким образом, когда Vitastor-у нужно
+    записать на диск всего лишь одну 32-байтную запись метаданных, фактически
+    приходится перезаписывать 4 КБ сектор целиком.
+
+    Поэтому, на самом деле, может быть выгодно найти SSD, хорошо работающие с
+    меньшими, 512-байтными, блоками и использовать 512-байтные disk_alignment,
+    journal_block_size и meta_block_size. Однако единственные SSD, которые
+    теоретически могут попасть в эту категорию - это Intel Optane (но и это
+    пока не проверялось автором).
+
+    Клиентам не обязательно знать про disk_alignment, так что помещать значение
+    этого параметра в etcd в /vitastor/config/global не нужно.
--- a/docs/params/monitor.yml
+++ b/docs/params/monitor.yml
@@ -0,0 +1,65 @@
+- name: etcd_mon_ttl
+  type: sec
+  min: 10
+  default: 30
+  info: Monitor etcd lease refresh interval in seconds
+  info_ru: Интервал обновления etcd резервации (lease) монитором
+- name: etcd_mon_timeout
+  type: ms
+  default: 1000
+  info: etcd request timeout used by monitor
+  info_ru: Таймаут выполнения запросов к etcd от монитора
+- name: etcd_mon_retries
+  type: int
+  default: 5
+  info: Maximum number of attempts for one monitor etcd request
+  info_ru: Максимальное число попыток выполнения запросов к etcd монитором
+- name: mon_change_timeout
+  type: ms
+  min: 100
+  default: 1000
+  info: Optimistic retry interval for monitor etcd modification requests
+  info_ru: Время повтора при коллизиях при запросах модификации в etcd, производимых монитором
+- name: mon_stats_timeout
+  type: ms
+  min: 100
+  default: 1000
+  info: |
+    Interval for monitor to wait before updating aggregated statistics in
+    etcd after receiving OSD statistics updates
+  info_ru: |
+    Интервал, который монитор ожидает при изменении статистики по отдельным
+    OSD перед обновлением агрегированной статистики в etcd
+- name: osd_out_time
+  type: sec
+  default: 600
+  info: |
+    Time after which a failed OSD is removed from the data distribution.
+    I.e. time which the monitor waits before attempting to restore data
+    redundancy using other OSDs.
+  info_ru: |
+    Время, через которое отключенный OSD исключается из распределения данных.
+    То есть, время, которое монитор ожидает перед попыткой переместить данные
+    на другие OSD и таким образом восстановить избыточность хранения.
+- name: placement_levels
+  type: json
+  default: '`{"host":100,"osd":101}`'
+  info: |
+    Levels for the placement tree. You can define arbitrary tree levels by
+    defining them in this parameter. The configuration parameter value should
+    contain a JSON object with level names as keys and integer priorities as
+    values.  Smaller priority means higher level in tree. For example,
+    "datacenter" should have smaller priority than "osd". "host" and "osd"
+    levels are always predefined and can't be removed. If one of them is not
+    present in the configuration, then it is defined with the default priority
+    (100 for "host", 101 for "osd").
+  info_ru: |
+    Определения уровней для дерева размещения OSD. Вы можете определять
+    произвольные уровни, помещая их в данный параметр конфигурации. Значение
+    параметра должно содержать JSON-объект, ключи которого будут являться
+    названиями уровней, а значения - целочисленными приоритетами. Меньшие
+    приоритеты соответствуют верхним уровням дерева. Например, уровень
+    "датацентр" должен иметь меньший приоритет, чем "OSD". Уровни с названиями
+    "host" и "osd" являются предопределёнными и не могут быть удалены. Если
+    один из них отсутствует в конфигурации, он доопределяется с приоритетом по
+    умолчанию (100 для уровня "host", 101 для "osd").
--- a/docs/params/network.yml
+++ b/docs/params/network.yml
@@ -0,0 +1,225 @@
+- name: tcp_header_buffer_size
+  type: int
+  default: 65536
+  info: |
+    Size of the buffer used to read data using an additional copy. Vitastor
+    packet headers are 128 bytes, payload is always at least 4 KB, so it is
+    usually beneficial to try to read multiple packets at once even though
+    it requires to copy the data an additional time. The rest of each packet
+    is received without an additional copy. You can try to play with this
+    parameter and see how it affects random iops and linear bandwidth if you
+    want.
+  info_ru: |
+    Размер буфера для чтения данных с дополнительным копированием. Пакеты
+    Vitastor содержат 128-байтные заголовки, за которыми следуют данные размером
+    от 4 КБ и для мелких операций ввода-вывода обычно выгодно за 1 вызов читать
+    сразу несколько пакетов, даже не смотря на то, что это требует лишний раз
+    скопировать данные. Часть каждого пакета за пределами значения данного
+    параметра читается без дополнительного копирования. Вы можете попробовать
+    поменять этот параметр и посмотреть, как он влияет на производительность
+    случайного и линейного доступа.
+- name: use_sync_send_recv
+  type: bool
+  default: false
+  info: |
+    If true, synchronous send/recv syscalls are used instead of io_uring for
+    socket communication. Useless for OSDs because they require io_uring anyway,
+    but may be required for clients with old kernel versions.
+  info_ru: |
+    Если установлено в истину, то вместо io_uring для передачи данных по сети
+    будут использоваться обычные синхронные системные вызовы send/recv. Для OSD
+    это бессмысленно, так как OSD в любом случае нуждается в io_uring, но, в
+    принципе, это может применяться для клиентов со старыми версиями ядра.
+- name: use_rdma
+  type: bool
+  default: true
+  info: |
+    Try to use RDMA for communication if it's available. Disable if you don't
+    want Vitastor to use RDMA. RDMA increases the performance, but TCP-only
+    clients can still talk to an RDMA-enabled cluster, so you don't need to
+    make sure that all clients support RDMA when enabling it.
+  info_ru: |
+    Пытаться использовать RDMA для связи при наличии доступных устройств.
+    Отключите, если вы не хотите, чтобы Vitastor использовал RDMA.
+    RDMA улучшает производительность, но 
+    Клиенты и клиентов and TCP-only clients in the cluster at the
+    same time - TCP-only clients are still able to use an RDMA-enabled cluster.
+- name: rdma_device
+  type: string
+  info: |
+    RDMA device name to use for Vitastor OSD communications (for example,
+    "rocep5s0f0"). Please note that Vitastor RDMA requires Implicit On-Demand
+    Paging (Implicit ODP) and Scatter/Gather (SG) support from the RDMA device
+    to work. For example, Mellanox ConnectX-3 and older adapters don't have
+    Implicit ODP, so they're unsupported by Vitastor. Run `ibv_devinfo -v` as
+    root to list available RDMA devices and their features.
+  info_ru: |
+    Название RDMA-устройства для связи с Vitastor OSD (например, "rocep5s0f0").
+    Имейте в виду, что поддержка RDMA в Vitastor требует функций устройства
+    Implicit On-Demand Paging (Implicit ODP) и Scatter/Gather (SG). Например,
+    адаптеры Mellanox ConnectX-3 и более старые не поддерживают Implicit ODP и
+    потому не поддерживаются в Vitastor. Запустите `ibv_devinfo -v` от имени
+    суперпользователя, чтобы посмотреть список доступных RDMA-устройств, их
+    параметры и возможности.
+- name: rdma_port_num
+  type: int
+  default: 1
+  info: |
+    RDMA device port number to use. Only for devices that have more than 1 port.
+    See `phys_port_cnt` in `ibv_devinfo -v` output to determine how many ports
+    your device has.
+  info_ru: |
+    Номер порта RDMA-устройства, который следует использовать. Имеет смысл
+    только для устройств, у которых более 1 порта. Чтобы узнать, сколько портов
+    у вашего адаптера, посмотрите `phys_port_cnt` в выводе команды
+    `ibv_devinfo -v`.
+- name: rdma_gid_index
+  type: int
+  default: 0
+  info: |
+    Global address identifier index of the RDMA device to use. Different GID
+    indexes may correspond to different protocols like RoCEv1, RoCEv2 and iWARP.
+    Search for "GID" in `ibv_devinfo -v` output to determine which GID index
+    you need.
+
+    **IMPORTANT:** If you want to use RoCEv2 (as recommended) then the correct
+    rdma_gid_index is usually 1 (IPv6) or 3 (IPv4).
+  info_ru: |
+    Номер глобального идентификатора адреса RDMA-устройства, который следует
+    использовать. Разным gid_index могут соответствовать разные протоколы связи:
+    RoCEv1, RoCEv2, iWARP. Чтобы понять, какой нужен вам - смотрите строчки со
+    словом "GID" в выводе команды `ibv_devinfo -v`.
+
+    **ВАЖНО:** Если вы хотите использовать RoCEv2 (как мы и рекомендуем), то
+    правильный rdma_gid_index, как правило, 1 (IPv6) или 3 (IPv4).
+- name: rdma_mtu
+  type: int
+  default: 4096
+  info: |
+    RDMA Path MTU to use. Must be 1024, 2048 or 4096. There is usually no
+    sense to change it from the default 4096.
+  info_ru: |
+    Максимальная единица передачи (Path MTU) для RDMA. Должно быть равно 1024,
+    2048 или 4096. Обычно нет смысла менять значение по умолчанию, равное 4096.
+- name: rdma_max_sge
+  type: int
+  default: 128
+  info: |
+    Maximum number of scatter/gather entries to use for RDMA. OSDs negotiate
+    the actual value when establishing connection anyway, so it's usually not
+    required to change this parameter.
+  info_ru: |
+    Максимальное число записей разделения/сборки (scatter/gather) для RDMA.
+    OSD в любом случае согласовывают реальное значение при установке соединения,
+    так что менять этот параметр обычно не нужно.
+- name: rdma_max_msg
+  type: int
+  default: 1048576
+  info: Maximum size of a single RDMA send or receive operation in bytes.
+  info_ru: Максимальный размер одной RDMA-операции отправки или приёма.
+- name: rdma_max_recv
+  type: int
+  default: 8
+  info: |
+    Maximum number of parallel RDMA receive operations. Note that this number
+    of receive buffers `rdma_max_msg` in size are allocated for each client,
+    so this setting actually affects memory usage. This is because RDMA receive
+    operations are (sadly) still not zero-copy in Vitastor. It may be fixed in
+    later versions.
+  info_ru: |
+    Максимальное число параллельных RDMA-операций получения данных. Следует
+    иметь в виду, что данное число буферов размером `rdma_max_msg` выделяется
+    для каждого подключённого клиентского соединения, так что данная настройка
+    влияет на потребление памяти. Это так потому, что RDMA-приём данных в
+    Vitastor, увы, всё равно не является zero-copy, т.е. всё равно 1 раз
+    копирует данные в памяти. Данная особенность, возможно, будет исправлена в
+    более новых версиях Vitastor.
+- name: peer_connect_interval
+  type: sec
+  min: 1
+  default: 5
+  info: Interval before attempting to reconnect to an unavailable OSD.
+  info_ru: Время ожидания перед повторной попыткой соединиться с недоступным OSD.
+- name: peer_connect_timeout
+  type: sec
+  min: 1
+  default: 5
+  info: Timeout for OSD connection attempts.
+  info_ru: Максимальное время ожидания попытки соединения с OSD.
+- name: osd_idle_timeout
+  type: sec
+  min: 1
+  default: 5
+  info: |
+    OSD connection inactivity time after which clients and other OSDs send
+    keepalive requests to check state of the connection.
+  info_ru: |
+    Время неактивности соединения с OSD, после которого клиенты или другие OSD
+    посылают запрос проверки состояния соединения.
+- name: osd_ping_timeout
+  type: sec
+  min: 1
+  default: 5
+  info: |
+    Maximum time to wait for OSD keepalive responses. If an OSD doesn't respond
+    within this time, the connection to it is dropped and a reconnection attempt
+    is scheduled.
+  info_ru: |
+    Максимальное время ожидания ответа на запрос проверки состояния соединения.
+    Если OSD не отвечает за это время, соединение отключается и производится
+    повторная попытка соединения.
+- name: up_wait_retry_interval
+  type: ms
+  min: 50
+  default: 500
+  info: |
+    OSDs respond to clients with a special error code when they receive I/O
+    requests for a PG that's not synchronized and started. This parameter sets
+    the time for the clients to wait before re-attempting such I/O requests.
+  info_ru: |
+    Когда OSD получают от клиентов запросы ввода-вывода, относящиеся к не
+    поднятым на данный момент на них PG, либо к PG в процессе синхронизации,
+    они отвечают клиентам специальным кодом ошибки, означающим, что клиент
+    должен некоторое время подождать перед повторением запроса. Именно это время
+    ожидания задаёт данный параметр.
+- name: max_etcd_attempts
+  type: int
+  default: 5
+  info: |
+    Maximum number of attempts for etcd requests which can't be retried
+    indefinitely.
+  info_ru: |
+    Максимальное число попыток выполнения запросов к etcd для тех запросов,
+    которые нельзя повторять бесконечно.
+- name: etcd_quick_timeout
+  type: ms
+  default: 1000
+  info: |
+    Timeout for etcd requests which should complete quickly, like lease refresh.
+  info_ru: |
+    Максимальное время выполнения запросов к etcd, которые должны завершаться
+    быстро, таких, как обновление резервации (lease).
+- name: etcd_slow_timeout
+  type: ms
+  default: 5000
+  info: Timeout for etcd requests which are allowed to wait for some time.
+  info_ru: |
+    Максимальное время выполнения запросов к etcd, для которых не обязательно
+    гарантировать быстрое выполнение.
+- name: etcd_keepalive_timeout
+  type: sec
+  default: max(30, etcd_report_interval*2)
+  info: |
+    Timeout for etcd connection HTTP Keep-Alive. Should be higher than
+    etcd_report_interval to guarantee that keepalive actually works.
+  info_ru: |
+    Таймаут для HTTP Keep-Alive в соединениях к etcd. Должен быть больше, чем
+    etcd_report_interval, чтобы keepalive гарантированно работал.
+- name: etcd_ws_keepalive_timeout
+  type: sec
+  default: 30
+  info: |
+    etcd websocket ping interval required to keep the connection alive and
+    detect disconnections quickly.
+  info_ru: |
+    Интервал проверки живости вебсокет-подключений к etcd.
--- a/docs/params/osd.yml
+++ b/docs/params/osd.yml
@@ -0,0 +1,341 @@
+- name: etcd_report_interval
+  type: sec
+  default: 5
+  info: |
+    Interval at which OSDs report their state to etcd. Affects OSD lease time
+    and thus the failover speed. Lease time is equal to this parameter value
+    plus max_etcd_attempts * etcd_quick_timeout because it should be guaranteed
+    that every OSD always refreshes its lease in time.
+  info_ru: |
+    Интервал, с которым OSD обновляет своё состояние в etcd. Значение параметра
+    влияет на время резервации (lease) OSD и поэтому на скорость переключения
+    при падении OSD. Время lease равняется значению этого параметра плюс
+    max_etcd_attempts * etcd_quick_timeout.
+- name: run_primary
+  type: bool
+  default: true
+  info: |
+    Start primary OSD logic on this OSD. As of now, can be turned off only for
+    debugging purposes. It's possible to implement additional feature for the
+    monitor which may allow to separate primary and secondary OSDs, but it's
+    unclear why anyone could need it, so it's not implemented.
+  info_ru: |
+    Запускать логику первичного OSD на данном OSD. На данный момент отключать
+    эту опцию может иметь смысл только в целях отладки. В теории, можно
+    реализовать дополнительный режим для монитора, который позволит отделять
+    первичные OSD от вторичных, но пока не понятно, зачем это может кому-то
+    понадобиться, поэтому это не реализовано.
+- name: osd_network
+  type: string or array of strings
+  type_ru: строка или массив строк
+  info: |
+    Network mask of the network (IPv4 or IPv6) to use for OSDs. Note that
+    although it's possible to specify multiple networks here, this does not
+    mean that OSDs will create multiple listening sockets - they'll only
+    pick the first matching address of an UP + RUNNING interface. Separate
+    networks for cluster and client connections are also not implemented, but
+    they are mostly useless anyway, so it's not a big deal.
+  info_ru: |
+    Маска подсети (IPv4 или IPv6) для использования для соединений с OSD.
+    Имейте в виду, что хотя сейчас и можно передать в этот параметр несколько
+    подсетей, это не означает, что OSD будут создавать несколько слушающих
+    сокетов - они лишь будут выбирать адрес первого поднятого (состояние UP +
+    RUNNING), подходящий под заданную маску. Также не реализовано разделение
+    кластерной и публичной сетей OSD. Правда, от него обычно всё равно довольно
+    мало толку, так что особенной проблемы в этом нет.
+- name: bind_address
+  type: string
+  default: "0.0.0.0"
+  info: |
+    Instead of the network mask, you can also set OSD listen address explicitly
+    using this parameter. May be useful if you want to start OSDs on interfaces
+    that are not UP + RUNNING.
+  info_ru: |
+    Этим параметром можно явным образом задать адрес, на котором будет ожидать
+    соединений OSD (вместо использования маски подсети). Может быть полезно,
+    например, чтобы запускать OSD на неподнятых интерфейсах (не UP + RUNNING).
+- name: bind_port
+  type: int
+  info: |
+    By default, OSDs pick random ports to use for incoming connections
+    automatically. With this option you can set a specific port for a specific
+    OSD by hand.
+  info_ru: |
+    По умолчанию OSD сами выбирают случайные порты для входящих подключений.
+    С помощью данной опции вы можете задать порт для отдельного OSD вручную.
+- name: autosync_interval
+  type: sec
+  default: 5
+  info: |
+    Time interval at which automatic fsyncs/flushes are issued by each OSD when
+    the immediate_commit mode if disabled. fsyncs are required because without
+    them OSDs quickly fill their journals, become unable to clear them and
+    stall. Also this option limits the amount of recent uncommitted changes
+    which OSDs may lose in case of a power outage in case when clients don't
+    issue fsyncs at all.
+  info_ru: |
+    Временной интервал отправки автоматических fsync-ов (операций очистки кэша)
+    каждым OSD для случая, когда режим immediate_commit отключён. fsync-и нужны
+    OSD, чтобы успевать очищать журнал - без них OSD быстро заполняют журналы и
+    перестают обрабатывать операции записи. Также эта опция ограничивает объём
+    недавних незафиксированных изменений, которые OSD могут терять при
+    отключении питания, если клиенты вообще не отправляют fsync.
+- name: autosync_writes
+  type: int
+  default: 128
+  info: |
+    Same as autosync_interval, but sets the maximum number of uncommitted write
+    operations before issuing an fsync operation internally.
+  info_ru: |
+    Аналогично autosync_interval, но задаёт не временной интервал, а
+    максимальное количество незафиксированных операций записи перед
+    принудительной отправкой fsync-а.
+- name: recovery_queue_depth
+  type: int
+  default: 4
+  info: |
+    Maximum recovery operations per one primary OSD at any given moment of time.
+    Currently it's the only parameter available to tune the speed or recovery
+    and rebalancing, but it's planned to implement more.
+  info_ru: |
+    Максимальное число операций восстановления на одном первичном OSD в любой
+    момент времени. На данный момент единственный параметр, который можно менять
+    для ускорения или замедления восстановления и перебалансировки данных, но
+    в планах реализация других параметров.
+- name: recovery_sync_batch
+  type: int
+  default: 16
+  info: Maximum number of recovery operations before issuing an additional fsync.
+  info_ru: Максимальное число операций восстановления перед дополнительным fsync.
+- name: readonly
+  type: bool
+  default: false
+  info: |
+    Read-only mode. If this is enabled, an OSD will never issue any writes to
+    the underlying device. This may be useful for recovery purposes.
+  info_ru: |
+    Режим "только чтение". Если включить этот режим, OSD не будет писать ничего
+    на диск. Может быть полезно в целях восстановления.
+- name: no_recovery
+  type: bool
+  default: false
+  info: |
+    Disable automatic background recovery of objects. Note that it doesn't
+    affect implicit recovery of objects happening during writes - a write is
+    always made to a full set of at least pg_minsize OSDs.
+  info_ru: |
+    Отключить автоматическое фоновое восстановление объектов. Обратите внимание,
+    что эта опция не отключает восстановление объектов, происходящее при
+    записи - запись всегда производится в полный набор из как минимум pg_minsize
+    OSD.
+- name: no_rebalance
+  type: bool
+  default: false
+  info: |
+    Disable background movement of data between different OSDs. Disabling it
+    means that PGs in the `has_misplaced` state will be left in it indefinitely.
+  info_ru: |
+    Отключить фоновое перемещение объектов между разными OSD. Отключение
+    означает, что PG, находящиеся в состоянии `has_misplaced`, будут оставлены
+    в нём на неопределённый срок.
+- name: print_stats_interval
+  type: sec
+  default: 3
+  info: |
+    Time interval at which OSDs print simple human-readable operation
+    statistics on stdout.
+  info_ru: |
+    Временной интервал, с которым OSD печатают простую человекочитаемую
+    статистику выполнения операций в стандартный вывод.
+- name: slow_log_interval
+  type: sec
+  default: 10
+  info: |
+    Time interval at which OSDs dump slow or stuck operations on stdout, if
+    they're any. Also it's the time after which an operation is considered
+    "slow".
+  info_ru: |
+    Временной интервал, с которым OSD выводят в стандартный вывод список
+    медленных или зависших операций, если таковые имеются. Также время, при
+    превышении которого операция считается "медленной".
+- name: max_write_iodepth
+  type: int
+  default: 128
+  info: |
+    Parallel client write operation limit per one OSD. Operations that exceed
+    this limit are pushed to a temporary queue instead of being executed
+    immediately.
+  info_ru: |
+    Максимальное число одновременных клиентских операций записи на один OSD.
+    Операции, превышающие этот лимит, не исполняются сразу, а сохраняются во
+    временной очереди.
+- name: min_flusher_count
+  type: int
+  default: 1
+  info: |
+    Flusher is a micro-thread that moves data from the journal to the data
+    area of the device. Their number is auto-tuned between minimum and maximum.
+    Minimum number is set by this parameter.
+  info_ru: |
+    Flusher - это микро-поток (корутина), которая копирует данные из журнала в
+    основную область устройства данных. Их число настраивается динамически между
+    минимальным и максимальным значением. Этот параметр задаёт минимальное число.
+- name: max_flusher_count
+  type: int
+  default: 256
+  info: |
+    Maximum number of journal flushers (see above min_flusher_count).
+  info_ru: |
+    Максимальное число микро-потоков очистки журнала (см. выше min_flusher_count).
+- name: inmemory_metadata
+  type: bool
+  default: true
+  info: |
+    This parameter makes Vitastor always keep metadata area of the block device
+    in memory. It's required for good performance because it allows to avoid
+    additional read-modify-write cycles during metadata modifications. Metadata
+    area size is currently roughly 224 MB per 1 TB of data. You can turn it off
+    to reduce memory usage by this value, but it will hurt performance. This
+    restriction is likely to be removed in the future along with the upgrade
+    of the metadata storage scheme.
+  info_ru: |
+    Данный параметр заставляет Vitastor всегда держать область метаданных диска
+    в памяти. Это нужно, чтобы избегать дополнительных операций чтения с диска
+    при записи. Размер области метаданных на данный момент составляет примерно
+    224 МБ на 1 ТБ данных. При включении потребление памяти снизится примерно
+    на эту величину, но при этом также снизится и производительность. В будущем,
+    после обновления схемы хранения метаданных, это ограничение, скорее всего,
+    будет ликвидировано.
+- name: inmemory_journal
+  type: bool
+  default: true
+  info: |
+    This parameter make Vitastor always keep journal area of the block
+    device in memory. Turning it off will, again, reduce memory usage, but
+    hurt performance because flusher coroutines will have to read data from
+    the disk back before copying it into the main area. The memory usage benefit
+    is typically very small because it's sufficient to have 16-32 MB journal
+    for SSD OSDs. However, in theory it's possible that you'll want to turn it
+    off for hybrid (HDD+SSD) OSDs with large journals on quick devices.
+  info_ru: |
+    Данный параметр заставляет Vitastor всегда держать в памяти журналы OSD.
+    Отключение параметра, опять же, снижает потребление памяти, но ухудшает
+    производительность, так как для копирования данных из журнала в основную
+    область устройства OSD будут вынуждены читать их обратно с диска. Выигрыш
+    по памяти при этом обычно крайне низкий, так как для SSD OSD обычно
+    достаточно 16- или 32-мегабайтного журнала. Однако в теории отключение
+    параметра может оказаться полезным для гибридных OSD (HDD+SSD) с большими
+    журналами, расположенными на быстром по сравнению с HDD устройстве.
+- name: journal_sector_buffer_count
+  type: int
+  default: 32
+  info: |
+    Maximum number of buffers that can be used for writing journal metadata
+    blocks. The only situation when you should increase it to a larger value
+    is when you enable journal_no_same_sector_overwrites. In this case set
+    it to, for example, 1024.
+  info_ru: |
+    Максимальное число буферов, разрешённых для использования под записываемые
+    в журнал блоки метаданных. Единственная ситуация, в которой этот параметр
+    нужно менять - это если вы включаете journal_no_same_sector_overwrites. В
+    этом случае установите данный параметр, например, в 1024.
+- name: journal_no_same_sector_overwrites
+  type: bool
+  default: false
+  info: |
+    Enable this option for SSDs like Intel D3-S4510 and D3-S4610 which REALLY
+    don't like when a program overwrites the same sector multiple times in a
+    row and slow down significantly (from 25000+ iops to ~3000 iops). When
+    this option is set, Vitastor will always move to the next sector of the
+    journal after writing it instead of possibly overwriting it the second time.
+  info_ru: |
+    Включайте данную опцию для SSD вроде Intel D3-S4510 и D3-S4610, которые
+    ОЧЕНЬ не любят, когда ПО перезаписывает один и тот же сектор несколько раз
+    подряд. Такие SSD при многократной перезаписи одного и того же сектора
+    сильно замедляются - условно, с 25000 и более iops до 3000 iops. Когда
+    данная опция установлена, Vitastor всегда переходит к следующему сектору
+    журнала после записи вместо потенциально повторной перезаписи того же
+    самого сектора.
+- name: throttle_small_writes
+  type: bool
+  default: false
+  info: |
+    Enable soft throttling of small journaled writes. Useful for hybrid OSDs
+    with fast journal/metadata devices and slow data devices. The idea is that
+    small writes complete very quickly because they're first written to the
+    journal device, but moving them to the main device is slow. So if an OSD
+    allows clients to issue a lot of small writes it will perform very good
+    for several seconds and then the journal will fill up and the performance
+    will drop to almost zero. Throttling is meant to prevent this problem by
+    artifically slowing quick writes down based on the amount of free space in
+    the journal. When throttling is used, the performance of small writes will
+    decrease smoothly instead of abrupt drop at the moment when the journal
+    fills up.
+  info_ru: |
+    Разрешить мягкое ограничение скорости журналируемой записи. Полезно для
+    гибридных OSD с быстрыми устройствами метаданных и медленными устройствами
+    данных. Идея заключается в том, что мелкие записи в этой ситуации могут
+    завершаться очень быстро, так как они изначально записываются на быстрое
+    журнальное устройство (SSD). Но перемещать их потом на основное медленное
+    устройство долго. Поэтому если OSD быстро примет от клиентов очень много
+    мелких операций записи, он быстро заполнит свой журнал, после чего
+    производительность записи резко упадёт практически до нуля. Ограничение
+    скорости записи призвано решить эту проблему с помощью искусственного
+    замедления операций записи на основании объёма свободного места в журнале.
+    Когда эта опция включена, производительность мелких операций записи будет
+    снижаться плавно, а не резко в момент окончательного заполнения журнала.
+- name: throttle_target_iops
+  type: int
+  default: 100
+  info: |
+    Target maximum number of throttled operations per second under the condition
+    of full journal. Set it to approximate random write iops of your data devices
+    (HDDs).
+  info_ru: |
+    Расчётное максимальное число ограничиваемых операций в секунду при условии
+    отсутствия свободного места в журнале. Устанавливайте приблизительно равным
+    максимальной производительности случайной записи ваших устройств данных
+    (HDD) в операциях в секунду.
+- name: throttle_target_mbs
+  type: int
+  default: 100
+  info: |
+    Target maximum bandwidth in MB/s of throttled operations per second under
+    the condition of full journal. Set it to approximate linear write
+    performance of your data devices (HDDs).
+  info_ru: |
+    Расчётный максимальный размер в МБ/с ограничиваемых операций в секунду при
+    условии отсутствия свободного места в журнале. Устанавливайте приблизительно
+    равным максимальной производительности линейной записи ваших устройств
+    данных (HDD).
+- name: throttle_target_parallelism
+  type: int
+  default: 1
+  info: |
+    Target maximum parallelism of throttled operations under the condition of
+    full journal. Set it to approximate internal parallelism of your data
+    devices (1 for HDDs, 4-8 for SSDs).
+  info_ru: |
+    Расчётный максимальный параллелизм ограничиваемых операций в секунду при
+    условии отсутствия свободного места в журнале. Устанавливайте приблизительно
+    равным внутреннему параллелизму ваших устройств данных (1 для HDD, 4-8
+    для SSD).
+- name: throttle_threshold_us
+  type: us
+  default: 50
+  info: |
+    Minimal computed delay to be applied to throttled operations. Usually
+    doesn't need to be changed.
+  info_ru: |
+    Минимальная применимая к ограничиваемым операциям задержка. Обычно не
+    требует изменений.
+- name: osd_memlock
+  type: bool
+  default: false
+  info: >
+    Lock all OSD memory to prevent it from being unloaded into swap with
+    mlockall(). Requires sufficient ulimit -l (max locked memory).
+  info_ru: >
+    Блокировать всю память OSD с помощью mlockall, чтобы запретить её выгрузку
+    в пространство подкачки. Требует достаточного значения ulimit -l (лимита
+    заблокированной памяти).
--- a/2
+++ b/2
--- a/mon/lp-optimizer.js
+++ b/mon/lp-optimizer.js
@@ -50,7 +50,7 @@ async function lp_solve(text)
    return { score, vars };
 }

-async function optimize_initial({ osd_tree, pg_count, pg_size = 3, pg_minsize = 2, max_combinations = 10000, parity_space = 1, round_robin = false })
+async function optimize_initial({ osd_tree, pg_count, pg_size = 3, pg_minsize = 2, max_combinations = 10000, parity_space = 1, ordered = false })
 {
    if (!pg_count || !osd_tree)
    {
@@ -92,7 +92,7 @@ async function optimize_initial({ osd_tree, pg_count, pg_size = 3, pg_minsize =
        console.log(lp);
        throw new Error('Problem is infeasible or unbounded - is it a bug?');
    }
-    const int_pgs = make_int_pgs(lp_result.vars, pg_count, round_robin);
+    const int_pgs = make_int_pgs(lp_result.vars, pg_count, ordered);
    const eff = pg_list_space_efficiency(int_pgs, all_weights, pg_minsize, parity_space);
    const res = {
        score: lp_result.score,
@@ -140,20 +140,20 @@ function make_int_pgs(weights, pg_count, round_robin)
    return int_pgs;
 }

-function calc_intersect_weights(pg_size, pg_count, prev_weights, all_pgs)
+function calc_intersect_weights(old_pg_size, pg_size, pg_count, prev_weights, all_pgs, ordered)
 {
    const move_weights = {};
-    if ((1 << pg_size) < pg_count)
+    if ((1 << old_pg_size) < pg_count)
    {
        const intersect = {};
        for (const pg_name in prev_weights)
        {
            const pg = pg_name.substr(3).split(/_/);
-            for (let omit = 1; omit < (1 << pg_size); omit++)
+            for (let omit = 1; omit < (1 << old_pg_size); omit++)
            {
                let pg_omit = [ ...pg ];
-                let intersect_count = pg_size;
-                for (let i = 0; i < pg_size; i++)
+                let intersect_count = old_pg_size;
+                for (let i = 0; i < old_pg_size; i++)
                {
                    if (omit & (1 << i))
                    {
@@ -161,6 +161,8 @@ function calc_intersect_weights(pg_size, pg_count, prev_weights, all_pgs)
                        intersect_count--;
                    }
                }
+                if (!ordered)
+                    pg_omit = pg_omit.filter(n => n).sort();
                pg_omit = pg_omit.join(':');
                intersect[pg_omit] = Math.max(intersect[pg_omit] || 0, intersect_count);
            }
@@ -174,10 +176,10 @@ function calc_intersect_weights(pg_size, pg_count, prev_weights, all_pgs)
                for (let i = 0; i < pg_size; i++)
                {
                    if (omit & (1 << i))
-                    {
                        pg_omit[i] = '';
-                    }
                }
+                if (!ordered)
+                    pg_omit = pg_omit.filter(n => n).sort();
                pg_omit = pg_omit.join(':');
                max_int = Math.max(max_int, intersect[pg_omit] || 0);
            }
@@ -186,15 +188,18 @@ function calc_intersect_weights(pg_size, pg_count, prev_weights, all_pgs)
    }
    else
    {
-        const prev_pg_hashed = Object.keys(prev_weights).map(pg_name => pg_name.substr(3).split(/_/).reduce((a, c) => { a[c] = 1; return a; }, {}));
+        const prev_pg_hashed = Object.keys(prev_weights).map(pg_name => pg_name
+            .substr(3).split(/_/).reduce((a, c, i) => { a[c] = i+1; return a; }, {}));
        for (const pg of all_pgs)
        {
            if (!prev_weights['pg_'+pg.join('_')])
            {
                let max_int = 0;
-                for (const prev_hash in prev_pg_hashed)
+                for (const prev_hash of prev_pg_hashed)
                {
-                    const intersect_count = pg.reduce((a, osd) => a + (prev_hash[osd] ? 1 : 0), 0);
+                    const intersect_count = ordered
+                        ? pg.reduce((a, osd, i) => a + (prev_hash[osd] == 1+i ? 1 : 0), 0)
+                        : pg.reduce((a, osd, i) => a + (prev_hash[osd] ? 1 : 0), 0);
                    if (max_int < intersect_count)
                    {
                        max_int = intersect_count;
@@ -243,7 +248,7 @@ function add_valid_previous(osd_tree, prev_weights, all_pgs)
 }

 // Try to minimize data movement
-async function optimize_change({ prev_pgs: prev_int_pgs, osd_tree, pg_size = 3, pg_minsize = 2, max_combinations = 10000, parity_space = 1 })
+async function optimize_change({ prev_pgs: prev_int_pgs, osd_tree, pg_size = 3, pg_minsize = 2, max_combinations = 10000, parity_space = 1, ordered = false })
 {
    if (!osd_tree)
    {
@@ -266,9 +271,13 @@ async function optimize_change({ prev_pgs: prev_int_pgs, osd_tree, pg_size = 3,
            prev_pg_per_osd[osd].push([ pg_name, (i >= pg_minsize ? parity_space : 1) ]);
        }
    }
+    const old_pg_size = prev_int_pgs[0].length;
    // Get all combinations
    let all_pgs = random_combinations(osd_tree, pg_size, max_combinations, parity_space > 1);
-    add_valid_previous(osd_tree, prev_weights, all_pgs);
+    if (old_pg_size == pg_size)
+    {
+        add_valid_previous(osd_tree, prev_weights, all_pgs);
+    }
    all_pgs = Object.values(all_pgs);
    const pg_per_osd = {};
    for (const pg of all_pgs)
@@ -282,7 +291,7 @@ async function optimize_change({ prev_pgs: prev_int_pgs, osd_tree, pg_size = 3,
        }
    }
    // Penalize PGs based on their similarity to old PGs
-    const move_weights = calc_intersect_weights(pg_size, pg_count, prev_weights, all_pgs);
+    const move_weights = calc_intersect_weights(old_pg_size, pg_size, pg_count, prev_weights, all_pgs, ordered);
    // Calculate total weight - old PG weights
    const all_pg_names = all_pgs.map(pg => 'pg_'+pg.join('_'));
    const all_pgs_hash = all_pg_names.reduce((a, c) => { a[c] = true; return a; }, {});
@@ -373,11 +382,35 @@ async function optimize_change({ prev_pgs: prev_int_pgs, osd_tree, pg_size = 3,
        {
            differs++;
        }
-        for (let j = 0; j < pg_size; j++)
+    }
+    if (ordered)
+    {
+        for (let i = 0; i < pg_count; i++)
        {
-            if (new_pgs[i][j] != prev_int_pgs[i][j])
+            for (let j = 0; j < pg_size; j++)
            {
-                osd_differs++;
+                if (new_pgs[i][j] != prev_int_pgs[i][j])
+                {
+                    osd_differs++;
+                }
+            }
+        }
+    }
+    else
+    {
+        for (let i = 0; i < pg_count; i++)
+        {
+            const old_map = prev_int_pgs[i].reduce((a, c) => { a[c] = (a[c]|0) + 1; return a; }, {});
+            for (let j = 0; j < pg_size; j++)
+            {
+                if ((0|old_map[new_pgs[i][j]]) > 0)
+                {
+                    old_map[new_pgs[i][j]]--;
+                }
+                else
+                {
+                    osd_differs++;
+                }
            }
        }
    }
--- a/mon/mon.js
+++ b/mon/mon.js
@@ -83,8 +83,13 @@ const etcd_tree = {
            osd_idle_timeout: 5, // seconds. min: 1
            osd_ping_timeout: 5, // seconds. min: 1
            up_wait_retry_interval: 500, // ms. min: 50
+            max_etcd_attempts: 5,
+            etcd_quick_timeout: 1000, // ms
+            etcd_slow_timeout: 5000, // ms
+            etcd_keepalive_timeout: 30, // seconds, default is max(30, etcd_report_interval*2)
+            etcd_ws_keepalive_interval: 30, // seconds
            // osd
-            etcd_report_interval: 5,
+            etcd_report_interval: 5, // seconds
            run_primary: true,
            osd_network: null, // "192.168.7.0/24" or an array of masks
            bind_address: "0.0.0.0",
@@ -99,6 +104,7 @@ const etcd_tree = {
            no_rebalance: false,
            print_stats_interval: 3,
            slow_log_interval: 10,
+            osd_memlock: false,
            // blockstore - fixed in superblock
            block_size,
            disk_alignment,
@@ -125,6 +131,11 @@ const etcd_tree = {
            inmemory_journal,
            journal_sector_buffer_count,
            journal_no_same_sector_overwrites,
+            throttle_small_writes: false,
+            throttle_target_iops: 100,
+            throttle_target_mbs: 100,
+            throttle_target_parallelism: 1,
+            throttle_threshold_us: 50,
        }, */
        global: {},
        /* node_placement: {
@@ -341,6 +352,9 @@ class Mon
        this.etcd_start_timeout = (config.etcd_start_timeout || 5) * 1000;
        this.state = JSON.parse(JSON.stringify(this.constructor.etcd_tree));
        this.signals_set = false;
+        this.ws = null;
+        this.ws_alive = false;
+        this.ws_keepalive_timer = null;
        this.on_stop_cb = () => this.on_stop(0).catch(console.error);
    }

@@ -383,7 +397,7 @@ class Mon
        for (const pool_id in this.state.config.pools)
        {
            if (!this.state.pool.stats[pool_id] ||
-                !this.state.pool.stats[pool_id].pg_real_size)
+                !Number(this.state.pool.stats[pool_id].pg_real_size))
            {
                // Generate missing data in etcd
                this.state.config.pgs.hash = null;
@@ -461,8 +475,20 @@ class Mon

    restart_watcher(cur_addr)
    {
+        if (this.ws)
+        {
+            this.ws.close();
+            this.ws = null;
+        }
+        if (this.ws_keepalive_timer)
+        {
+            clearInterval(this.ws_keepalive_timer);
+            this.ws_keepalive_timer = null;
+        }
        if (this.selected_etcd_url == cur_addr)
+        {
            this.selected_etcd_url = null;
+        }
        this.start_watcher(this.config.etcd_mon_retries).catch(this.die);
    }

@@ -482,6 +508,7 @@ class Mon
                const timer_id = setTimeout(() =>
                {
                    this.ws.close();
+                    this.ws = null;
                    ok(false);
                }, this.config.etcd_mon_timeout);
                this.ws = new WebSocket(base+'/watch');
@@ -510,6 +537,20 @@ class Mon
            this.die('Failed to open etcd watch websocket');
        }
        const cur_addr = this.selected_etcd_url;
+        this.ws_alive = true;
+        this.ws_keepalive_timer = setInterval(() =>
+        {
+            if (this.ws_alive)
+            {
+                this.ws_alive = false;
+                this.ws.send(JSON.stringify({ progress_request: {} }));
+            }
+            else
+            {
+                console.log('etcd websocket timed out, restarting it');
+                this.restart_watcher(cur_addr);
+            }
+        }, (Number(this.config.etcd_keepalive_interval) || 30)*1000);
        this.ws.on('error', () => this.restart_watcher(cur_addr));
        this.ws.send(JSON.stringify({
            create_request: {
@@ -522,6 +563,7 @@ class Mon
        }));
        this.ws.on('message', (msg) =>
        {
+            this.ws_alive = true;
            let data;
            try
            {
@@ -558,7 +600,7 @@ class Mon
                    console.log('Revision '+data.result.header.revision+' events: ');
                }
                this.etcd_watch_revision = BigInt(data.result.header.revision)+BigInt(1);
-                for (const e of data.result.events)
+                for (const e of data.result.events||[])
                {
                    this.parse_kv(e.kv);
                    const key = e.kv.key.substr(this.etcd_prefix.length);
@@ -709,10 +751,13 @@ class Mon
        for (const node_id in this.state.config.node_placement||{})
        {
            const node_cfg = this.state.config.node_placement[node_id];
-            if (!node_id || /^\d/.exec(node_id) ||
-                !node_cfg.level || !levels[node_cfg.level])
+            if (/^\d+$/.exec(node_id))
            {
-                // All nodes must have non-empty non-numeric IDs and valid levels
+                node_cfg.level = 'osd';
+            }
+            if (!node_id || !node_cfg.level || !levels[node_cfg.level])
+            {
+                // All nodes must have non-empty IDs and valid levels
                continue;
            }
            tree[node_id] = { id: node_id, level: node_cfg.level, parent: node_cfg.parent, children: [] };
@@ -745,10 +790,10 @@ class Mon
                        .reduce((a, c) => { a[c] = true; return a; }, {});
                }
                delete tree[osd_num].children;
-                if (!tree[tree[osd_num].parent])
+                if (!tree[stat.host])
                {
-                    tree[tree[osd_num].parent] = {
-                        id: tree[osd_num].parent,
+                    tree[stat.host] = {
+                        id: stat.host,
                        level: 'host',
                        parent: null,
                        children: [],
@@ -1094,7 +1139,7 @@ class Mon
                    pg_size: pool_cfg.pg_size,
                    pg_minsize: pool_cfg.pg_minsize,
                    max_combinations: pool_cfg.max_osd_combinations,
-                    round_robin: pool_cfg.scheme != 'replicated',
+                    ordered: pool_cfg.scheme != 'replicated',
                };
                let optimize_result;
                if (old_pg_count > 0)
@@ -1117,10 +1162,6 @@ class Mon
                        {
                            pg.push(0);
                        }
-                        while (pg.length > pool_cfg.pg_size)
-                        {
-                            pg.pop();
-                        }
                    }
                    if (!this.state.config.pgs.hash)
                    {
@@ -1156,8 +1197,8 @@ class Mon
                this.state.pool.stats[pool_id] = {
                    used_raw_tb: (this.state.pool.stats[pool_id]||{}).used_raw_tb || 0,
                    total_raw_tb: optimize_result.space,
-                    pg_real_size: pg_effsize,
-                    raw_to_usable: pg_effsize / (pool_cfg.scheme === 'replicated'
+                    pg_real_size: pg_effsize || pool_cfg.pg_size,
+                    raw_to_usable: (pg_effsize || pool_cfg.pg_size) / (pool_cfg.scheme === 'replicated'
                        ? 1 : (pool_cfg.pg_size - (pool_cfg.parity_chunks||0))),
                    space_efficiency: optimize_result.space/(optimize_result.total_space||1),
                };
--- a/mon/test-optimize-simple.js
+++ b/mon/test-optimize-simple.js
@@ -5,21 +5,45 @@ const LPOptimizer = require('./lp-optimizer.js');

 async function run()
 {
-    const osd_tree = { a: { 1: 1 }, b: { 2: 1 }, c: { 3: 1 } };
+    const osd_tree = {
+        100: { 1: 1 },
+        200: { 2: 1 },
+        300: { 3: 1 },
+    };
+
    let res;

    console.log('16 PGs, size=3');
-    res = await LPOptimizer.optimize_initial({ osd_tree, pg_size: 3, pg_count: 16 });
+    res = await LPOptimizer.optimize_initial({ osd_tree, pg_size: 3, pg_count: 16, ordered: false });
    LPOptimizer.print_change_stats(res, false);
-
-    console.log('\nReduce PG size to 2');
-    res = await LPOptimizer.optimize_change({ prev_pgs: res.int_pgs.map(pg => pg.slice(0, 2)), osd_tree, pg_size: 2 });
+    assert(res.space == 3, 'Initial distribution');
+    console.log('\nChange size to 2');
+    res = await LPOptimizer.optimize_change({ prev_pgs: res.int_pgs, osd_tree, pg_size: 2, ordered: false });
    LPOptimizer.print_change_stats(res, false);
-
+    assert(res.space >= 3*14/16 && res.osd_differs == 0, 'Redistribution');
    console.log('\nRemove OSD 3');
-    delete osd_tree['c'];
-    res = await LPOptimizer.optimize_change({ prev_pgs: res.int_pgs, osd_tree, pg_size: 2 });
+    const no3_tree = { ...osd_tree };
+    delete no3_tree['300'];
+    res = await LPOptimizer.optimize_change({ prev_pgs: res.int_pgs, osd_tree: no3_tree, pg_size: 2, ordered: false });
    LPOptimizer.print_change_stats(res, false);
+    assert(res.space == 2, 'Redistribution after OSD removal');
+
+    console.log('\n16 PGs, size=3, ordered');
+    res = await LPOptimizer.optimize_initial({ osd_tree, pg_size: 3, pg_count: 16, ordered: true });
+    LPOptimizer.print_change_stats(res, false);
+    assert(res.space == 3, 'Initial distribution');
+    console.log('\nChange size to 2, ordered');
+    res = await LPOptimizer.optimize_change({ prev_pgs: res.int_pgs, osd_tree, pg_size: 2, ordered: true });
+    LPOptimizer.print_change_stats(res, false);
+    assert(res.space >= 3*14/16 && res.osd_differs < 8, 'Redistribution');
+}
+
+function assert(cond, txt)
+{
+    if (!cond)
+    {
+        throw new Error((txt||'test')+' failed');
+    }
 }

 run().catch(console.error);
--- a/mon/test-optimize-undersized.js
+++ b/mon/test-optimize-undersized.js
@@ -45,30 +45,45 @@ async function run()
    console.log('Empty tree:');
    let res = await LPOptimizer.optimize_initial({ osd_tree: cur_tree, pg_size: 3, pg_count: 256 });
    LPOptimizer.print_change_stats(res, false);
+    assert(res.space == 0);
    console.log('\nAdding 1st failure domain:');
    cur_tree['dom1'] = osd_tree['dom1'];
    res = await LPOptimizer.optimize_change({ prev_pgs: res.int_pgs, osd_tree: cur_tree, pg_size: 3 });
    LPOptimizer.print_change_stats(res, false);
+    assert(res.space == 12 && res.total_space == 12);
    console.log('\nAdding 2nd failure domain:');
    cur_tree['dom2'] = osd_tree['dom2'];
    res = await LPOptimizer.optimize_change({ prev_pgs: res.int_pgs, osd_tree: cur_tree, pg_size: 3 });
    LPOptimizer.print_change_stats(res, false);
+    assert(res.space == 24 && res.total_space == 24);
    console.log('\nAdding 3rd failure domain:');
    cur_tree['dom3'] = osd_tree['dom3'];
    res = await LPOptimizer.optimize_change({ prev_pgs: res.int_pgs, osd_tree: cur_tree, pg_size: 3 });
    LPOptimizer.print_change_stats(res, false);
+    assert(res.space == 36 && res.total_space == 36);
    console.log('\nRemoving 3rd failure domain:');
    delete cur_tree['dom3'];
    res = await LPOptimizer.optimize_change({ prev_pgs: res.int_pgs, osd_tree: cur_tree, pg_size: 3 });
    LPOptimizer.print_change_stats(res, false);
+    assert(res.space == 24 && res.total_space == 24);
    console.log('\nRemoving 2nd failure domain:');
    delete cur_tree['dom2'];
    res = await LPOptimizer.optimize_change({ prev_pgs: res.int_pgs, osd_tree: cur_tree, pg_size: 3 });
    LPOptimizer.print_change_stats(res, false);
+    assert(res.space == 12 && res.total_space == 12);
    console.log('\nRemoving 1st failure domain:');
    delete cur_tree['dom1'];
    res = await LPOptimizer.optimize_change({ prev_pgs: res.int_pgs, osd_tree: cur_tree, pg_size: 3 });
    LPOptimizer.print_change_stats(res, false);
+    assert(res.space == 0);
+}
+
+function assert(cond, txt)
+{
+    if (!cond)
+    {
+        throw new Error((txt||'test')+' failed');
+    }
 }

 run().catch(console.error);
--- a/patches/PVE_VitastorPlugin.pm
+++ b/patches/PVE_VitastorPlugin.pm
@@ -0,0 +1,503 @@
+# Install as /usr/share/perl5/PVE/Storage/Custom/VitastorPlugin.pm
+
+# Proxmox Vitastor Driver
+# Copyright (c) Vitaliy Filippov, 2021+
+# License: VNPL-1.1 or GNU AGPLv3.0
+
+package PVE::Storage::Custom::VitastorPlugin;
+
+use strict;
+use warnings;
+
+use JSON;
+
+use PVE::Storage::Plugin;
+use PVE::Tools qw(run_command);
+
+use base qw(PVE::Storage::Plugin);
+
+sub api
+{
+    # Trick it :)
+    return PVE::Storage->APIVER;
+}
+
+sub run_cli
+{
+    my ($scfg, $cmd, %args) = @_;
+    my $retval;
+    my $stderr = '';
+    my $errmsg = $args{errmsg} ? $args{errmsg}.": " : "vitastor-cli error: ";
+    my $json = delete $args{json};
+    $json = 1 if !defined $json;
+    my $binary = delete $args{binary};
+    $binary = '/usr/bin/vitastor-cli' if !defined $binary;
+    if (!exists($args{errfunc}))
+    {
+        $args{errfunc} = sub
+        {
+            my $line = shift;
+            print STDERR $line;
+            *STDERR->flush();
+            $stderr .= $line;
+        };
+    }
+    if (!exists($args{outfunc}))
+    {
+        $retval = '';
+        $args{outfunc} = sub { $retval .= shift };
+        if ($json)
+        {
+            unshift @$cmd, '--json';
+        }
+    }
+    if ($scfg->{vitastor_etcd_address})
+    {
+        unshift @$cmd, '--etcd_address', $scfg->{vitastor_etcd_address};
+    }
+    if ($scfg->{vitastor_config_path})
+    {
+        unshift @$cmd, '--config_path', $scfg->{vitastor_config_path};
+    }
+    unshift @$cmd, $binary;
+    eval { run_command($cmd, %args); };
+    if (my $err = $@)
+    {
+        die "Error invoking vitastor-cli: $err";
+    }
+    if (defined $retval)
+    {
+        # untaint
+        $retval =~ /^(.*)$/s;
+        if ($json)
+        {
+            eval { $retval = JSON::decode_json($1); };
+            if ($@)
+            {
+                die "vitastor-cli returned bad JSON: $@";
+            }
+        }
+        else
+        {
+            $retval = $1;
+        }
+    }
+    return $retval;
+}
+
+# Configuration
+
+sub type
+{
+    return 'vitastor';
+}
+
+sub plugindata
+{
+    return {
+        content => [ { images => 1, rootdir => 1 }, { images => 1 } ],
+    };
+}
+
+sub properties
+{
+    return {
+        vitastor_etcd_address => {
+            description => 'IP address(es) of etcd.',
+            type => 'string',
+            format => 'pve-storage-portal-dns-list',
+        },
+        vitastor_etcd_prefix => {
+            description => 'Prefix for Vitastor etcd metadata',
+            type => 'string',
+        },
+        vitastor_config_path => {
+            description => 'Path to Vitastor configuration file',
+            type => 'string',
+        },
+        vitastor_prefix => {
+            description => 'Image name prefix',
+            type => 'string',
+        },
+        vitastor_pool => {
+            description => 'Default pool to use for images',
+            type => 'string',
+        },
+        vitastor_nbd => {
+            description => 'Use kernel NBD devices (slower)',
+            type => 'boolean',
+        },
+    };
+}
+
+sub options
+{
+    return {
+        nodes => { optional => 1 },
+        disable => { optional => 1 },
+        vitastor_etcd_address => { optional => 1},
+        vitastor_etcd_prefix => { optional => 1 },
+        vitastor_config_path => { optional => 1 },
+        vitastor_prefix => { optional => 1 },
+        vitastor_pool => {},
+        vitastor_nbd => { optional => 1 },
+    };
+}
+
+# Storage implementation
+
+sub parse_volname
+{
+    my ($class, $volname) = @_;
+    if ($volname =~ m/^((base-(\d+)-\S+)\/)?((?:(base)|(vm))-(\d+)-\S+)$/)
+    {
+        # ($vtype, $name, $vmid, $basename, $basevmid, $isBase, $format)
+        return ('images', $4, $7, $2, $3, $5, 'raw');
+    }
+    die "unable to parse vitastor volume name '$volname'\n";
+}
+
+sub _qemu_option
+{
+    my ($k, $v) = @_;
+    if (defined $v && $v ne "")
+    {
+        $v =~ s/:/\\:/gso;
+        return ":$k=$v";
+    }
+    return "";
+}
+
+sub path
+{
+    my ($class, $scfg, $volname, $storeid, $snapname) = @_;
+    my $prefix = defined $scfg->{vitastor_prefix} ? $scfg->{vitastor_prefix} : 'pve/';
+    my ($vtype, $name, $vmid) = $class->parse_volname($volname);
+    $name .= '@'.$snapname if $snapname;
+    if ($scfg->{vitastor_nbd})
+    {
+        my $mapped = run_cli($scfg, [ 'ls' ], binary => '/usr/bin/vitastor-nbd');
+        my ($kerneldev) = grep { $mapped->{$_}->{image} eq $prefix.$name } keys %$mapped;
+        die "Image not mapped via NBD" if !$kerneldev;
+        return ($kerneldev, $vmid, $vtype);
+    }
+    my $path = "vitastor";
+    $path .= _qemu_option('config_path', $scfg->{vitastor_config_path});
+    # FIXME This is the only exception: etcd_address -> etcd_host for qemu
+    $path .= _qemu_option('etcd_host', $scfg->{vitastor_etcd_address});
+    $path .= _qemu_option('etcd_prefix', $scfg->{vitastor_etcd_prefix});
+    $path .= _qemu_option('image', $prefix.$name);
+    return ($path, $vmid, $vtype);
+}
+
+sub _find_free_diskname
+{
+    my ($class, $storeid, $scfg, $vmid, $fmt, $add_fmt_suffix) = @_;
+    my $list = _process_list($scfg, $storeid, run_cli($scfg, [ 'ls' ]));
+    $list = [ map { $_->{name} } @$list ];
+    return PVE::Storage::Plugin::get_next_vm_diskname($list, $storeid, $vmid, undef, $scfg);
+}
+
+# Used only in "Create Template" and, in fact, converts a VM into a template
+# As a consequence, this is always invoked with the VM powered off
+# So we just rename vm-xxx to base-xxx and make it a readonly base layer
+sub create_base
+{
+    my ($class, $storeid, $scfg, $volname) = @_;
+    my $prefix = defined $scfg->{vitastor_prefix} ? $scfg->{vitastor_prefix} : 'pve/';
+
+    my ($vtype, $name, $vmid, $basename, $basevmid, $isBase) = $class->parse_volname($volname);
+    die "create_base not possible with base image\n" if $isBase;
+
+    my $info = _process_list($scfg, $storeid, run_cli($scfg, [ 'ls', $prefix.$name ]))->[0];
+    die "image $name does not exist\n" if !$info;
+
+    die "volname '$volname' contains wrong information about parent {$info->{parent}} $basename\n"
+        if $basename && (!$info->{parent} || $info->{parent} ne $basename);
+
+    my $newname = $name;
+    $newname =~ s/^vm-/base-/;
+
+    my $newvolname = $basename ? "$basename/$newname" : "$newname";
+    run_cli($scfg, [ 'modify', '--rename', $prefix.$newname, '--readonly', $prefix.$name ], json => 0);
+
+    return $newvolname;
+}
+
+sub clone_image
+{
+    my ($class, $scfg, $storeid, $volname, $vmid, $snapname) = @_;
+    my $prefix = defined $scfg->{vitastor_prefix} ? $scfg->{vitastor_prefix} : 'pve/';
+
+    my $snap = '';
+    $snap = '@'.$snapname if length $snapname;
+
+    my ($vtype, $basename, $basevmid, undef, undef, $isBase) = $class->parse_volname($volname);
+    die "$volname is not a base image and snapname is not provided\n" if !$isBase && !length($snapname);
+
+    my $name = $class->find_free_diskname($storeid, $scfg, $vmid);
+
+    warn "clone $volname: $basename snapname $snap to $name\n";
+
+    my $newvol = "$basename/$name";
+    $newvol = $name if length($snapname);
+
+    run_cli($scfg, [ 'create', '--parent', $prefix.$basename.$snap, $prefix.$name ], json => 0);
+
+    return $newvol;
+}
+
+sub alloc_image
+{
+    # $size is in kb in this method
+    my ($class, $storeid, $scfg, $vmid, $fmt, $name, $size) = @_;
+    my $prefix = defined $scfg->{vitastor_prefix} ? $scfg->{vitastor_prefix} : 'pve/';
+    die "illegal name '$name' - should be 'vm-$vmid-*'\n" if $name && $name !~ m/^vm-$vmid-/;
+    $name = $class->find_free_diskname($storeid, $scfg, $vmid) if !$name;
+    run_cli($scfg, [ 'create', '--size', (int(($size+3)/4)*4).'k', '--pool', $scfg->{vitastor_pool}, $prefix.$name ], json => 0);
+    return $name;
+}
+
+sub free_image
+{
+    my ($class, $storeid, $scfg, $volname, $isBase) = @_;
+    my $prefix = defined $scfg->{vitastor_prefix} ? $scfg->{vitastor_prefix} : 'pve/';
+    my ($vtype, $name, $vmid, undef, undef, undef) = $class->parse_volname($volname);
+    $class->deactivate_volume($storeid, $scfg, $volname);
+    my $full_list = run_cli($scfg, [ 'ls', '-l' ]);
+    my $list = _process_list($scfg, $storeid, $full_list);
+    # Remove image and all its snapshots
+    my $rm_names = {
+        map { ($prefix.$_->{name} => 1) }
+        grep { $_->{name} eq $name || substr($_->{name}, 0, length($name)+1) eq ($name.'@') }
+        @$list
+    };
+    my $children = [ grep { $_->{parent_name} && $rm_names->{$_->{parent_name}} } @$full_list ];
+    die "Image has children: ".join(', ', map {
+        substr($_->{name}, 0, length $prefix) eq $prefix
+            ? substr($_->name, length $prefix)
+            : $_->{name}
+    } @$children)."\n" if @$children;
+    my $to_remove = [ grep { $rm_names->{$_->{name}} } @$full_list ];
+    for my $rmi (@$to_remove)
+    {
+        run_cli($scfg, [ 'rm-data', '--pool', $rmi->{pool_id}, '--inode', $rmi->{inode_num} ], json => 0);
+    }
+    for my $rmi (@$to_remove)
+    {
+        run_cli($scfg, [ 'rm', $rmi->{name} ], json => 0);
+    }
+    return undef;
+}
+
+sub _process_list
+{
+    my ($scfg, $storeid, $result) = @_;
+    my $prefix = defined $scfg->{vitastor_prefix} ? $scfg->{vitastor_prefix} : 'pve/';
+    my $list = [];
+    foreach my $el (@$result)
+    {
+        next if !$el->{name} || length($prefix) && substr($el->{name}, 0, length $prefix) ne $prefix;
+        my $name = substr($el->{name}, length $prefix);
+        next if $name =~ /@/;
+        my ($owner) = $name =~ /^(?:vm|base)-(\d+)-/s;
+        next if !defined $owner;
+        my $parent = !defined $el->{parent_name}
+            ? undef
+            : ($prefix eq '' || substr($el->{parent_name}, 0, length $prefix) eq $prefix
+                ? substr($el->{parent_name}, length $prefix) : '');
+        my $volid = $parent && $parent =~ /^(base-\d+-\S+)$/s
+            ? "$storeid:$1/$name" : "$storeid:$name";
+        push @$list, {
+            format => 'raw',
+            volid => $volid,
+            name => $name,
+            size => $el->{size},
+            parent => $parent,
+            vmid => $owner,
+        };
+    }
+    return $list;
+}
+
+sub list_images
+{
+    my ($class, $storeid, $scfg, $vmid, $vollist, $cache) = @_;
+    my $list = _process_list($scfg, $storeid, run_cli($scfg, [ 'ls', '-l' ]));
+    if ($vollist)
+    {
+        my $h = { map { ($_ => 1) } @$vollist };
+        $list = [ grep { $h->{$_->{volid}} } @$list ]
+    }
+    elsif (defined $vmid)
+    {
+        $list = [ grep { $_->{vmid} eq $vmid } @$list ];
+    }
+    return $list;
+}
+
+sub status
+{
+    my ($class, $storeid, $scfg, $cache) = @_;
+    my $stats = [ grep { $_->{name} eq $scfg->{vitastor_pool} } @{ run_cli($scfg, [ 'df' ]) } ]->[0];
+    my $free = $stats ? $stats->{max_available} : 0;
+    my $used = $stats ? $stats->{used_raw}/($stats->{raw_to_usable}||1) : 0;
+    my $total = $free+$used;
+    my $active = $stats ? 1 : 0;
+    return ($total, $free, $used, $active);
+}
+
+sub activate_storage
+{
+    my ($class, $storeid, $scfg, $cache) = @_;
+    return 1;
+}
+
+sub deactivate_storage
+{
+    my ($class, $storeid, $scfg, $cache) = @_;
+    return 1;
+}
+
+sub map_volume
+{
+    my ($class, $storeid, $scfg, $volname, $snapname) = @_;
+    my $prefix = defined $scfg->{vitastor_prefix} ? $scfg->{vitastor_prefix} : 'pve/';
+
+    my ($vtype, $img_name, $vmid) = $class->parse_volname($volname);
+    my $name = $img_name;
+    $name .= '@'.$snapname if $snapname;
+
+    my $mapped = run_cli($scfg, [ 'ls' ], binary => '/usr/bin/vitastor-nbd');
+    my ($kerneldev) = grep { $mapped->{$_}->{image} eq $prefix.$name } keys %$mapped;
+    return $kerneldev if $kerneldev && -b $kerneldev; # already mapped
+
+    $kerneldev = run_cli($scfg, [ 'map', '--image', $prefix.$name ], binary => '/usr/bin/vitastor-nbd', json => 0);
+    return $kerneldev;
+}
+
+sub unmap_volume
+{
+    my ($class, $storeid, $scfg, $volname, $snapname) = @_;
+    my $prefix = defined $scfg->{vitastor_prefix} ? $scfg->{vitastor_prefix} : 'pve/';
+
+    return 1 if !$scfg->{vitastor_nbd};
+
+    my ($vtype, $name, $vmid) = $class->parse_volname($volname);
+    $name .= '@'.$snapname if $snapname;
+
+    my $mapped = run_cli($scfg, [ 'ls' ], binary => '/usr/bin/vitastor-nbd');
+    my ($kerneldev) = grep { $mapped->{$_}->{image} eq $prefix.$name } keys %$mapped;
+    if ($kerneldev && -b $kerneldev)
+    {
+        run_cli($scfg, [ 'unmap', $kerneldev ], binary => '/usr/bin/vitastor-nbd', json => 0);
+    }
+
+    return 1;
+}
+
+sub activate_volume
+{
+    my ($class, $storeid, $scfg, $volname, $snapname, $cache) = @_;
+    $class->map_volume($storeid, $scfg, $volname, $snapname) if $scfg->{vitastor_nbd};
+    return 1;
+}
+
+sub deactivate_volume
+{
+    my ($class, $storeid, $scfg, $volname, $snapname, $cache) = @_;
+    $class->unmap_volume($storeid, $scfg, $volname, $snapname);
+    return 1;
+}
+
+sub volume_size_info
+{
+    my ($class, $scfg, $storeid, $volname, $timeout) = @_;
+    my $prefix = defined $scfg->{vitastor_prefix} ? $scfg->{vitastor_prefix} : 'pve/';
+    my ($vtype, $name, $vmid) = $class->parse_volname($volname);
+    my $info = _process_list($scfg, $storeid, run_cli($scfg, [ 'ls', $prefix.$name ]))->[0];
+    #return wantarray ? ($size, $format, $used, $parent, $st->ctime) : $size;
+    return $info->{size};
+}
+
+sub volume_resize
+{
+    my ($class, $scfg, $storeid, $volname, $size, $running) = @_;
+    my $prefix = defined $scfg->{vitastor_prefix} ? $scfg->{vitastor_prefix} : 'pve/';
+    my ($vtype, $name, $vmid) = $class->parse_volname($volname);
+    # $size is in bytes in this method
+    run_cli($scfg, [ 'modify', '--resize', (int(($size+4095)/4096)*4).'k', $prefix.$name ], json => 0);
+    return undef;
+}
+
+sub volume_snapshot
+{
+    my ($class, $scfg, $storeid, $volname, $snap) = @_;
+    my $prefix = defined $scfg->{vitastor_prefix} ? $scfg->{vitastor_prefix} : 'pve/';
+    my ($vtype, $name, $vmid) = $class->parse_volname($volname);
+    run_cli($scfg, [ 'create', '--snapshot', $snap, $prefix.$name ], json => 0);
+    return undef;
+}
+
+sub volume_snapshot_rollback
+{
+    my ($class, $scfg, $storeid, $volname, $snap) = @_;
+    my $prefix = defined $scfg->{vitastor_prefix} ? $scfg->{vitastor_prefix} : 'pve/';
+    my ($vtype, $name, $vmid) = $class->parse_volname($volname);
+    run_cli($scfg, [ 'rm', $prefix.$name ], json => 0);
+    run_cli($scfg, [ 'create', '--parent', $prefix.$name.'@'.$snap, $prefix.$name ], json => 0);
+    return undef;
+}
+
+sub volume_snapshot_delete
+{
+    my ($class, $scfg, $storeid, $volname, $snap, $running) = @_;
+    my $prefix = defined $scfg->{vitastor_prefix} ? $scfg->{vitastor_prefix} : 'pve/';
+    my ($vtype, $name, $vmid) = $class->parse_volname($volname);
+    run_cli($scfg, [ 'rm', $prefix.$name.'@'.$snap ], json => 0);
+    return undef;
+}
+
+sub volume_snapshot_needs_fsfreeze
+{
+    return 1;
+}
+
+sub volume_has_feature
+{
+    my ($class, $scfg, $feature, $storeid, $volname, $snapname, $running) = @_;
+    my $features = {
+        snapshot => { current => 1, snap => 1 },
+        clone => { base => 1, snap => 1 },
+        template => { current => 1 },
+        copy => { base => 1, current => 1, snap => 1 },
+        sparseinit => { base => 1, current => 1 },
+        rename => { current => 1 },
+    };
+    my ($vtype, $name, $vmid, $basename, $basevmid, $isBase) = $class->parse_volname($volname);
+    my $key = undef;
+    if ($snapname)
+    {
+        $key = 'snap';
+    }
+    else
+    {
+        $key = $isBase ? 'base' : 'current';
+    }
+    return 1 if $features->{$feature}->{$key};
+    return undef;
+}
+
+sub rename_volume
+{
+    my ($class, $scfg, $storeid, $source_volname, $target_vmid, $target_volname) = @_;
+    my $prefix = defined $scfg->{vitastor_prefix} ? $scfg->{vitastor_prefix} : 'pve/';
+    my (undef, $source_image, $source_vmid, $base_name, $base_vmid, undef, $format) =
+        $class->parse_volname($source_volname);
+    $target_volname = $class->find_free_diskname($storeid, $scfg, $target_vmid, $format) if !$target_volname;
+    run_cli($scfg, [ 'modify', '--rename', $prefix.$target_volname, $prefix.$source_image ], json => 0);
+    $base_name = $base_name ? "${base_name}/" : '';
+    return "${storeid}:${base_name}${target_volname}";
+}
+
+1;
--- a/patches/cinder-vitastor.py
+++ b/patches/cinder-vitastor.py
@@ -50,7 +50,7 @@ from cinder.volume import configuration
 from cinder.volume import driver
 from cinder.volume import volume_utils

-VERSION = '0.6.9'
+VERSION = '0.6.13'

 LOG = logging.getLogger(__name__)

--- a/patches/nova-21.diff
+++ b/patches/nova-21.diff
@@ -0,0 +1,288 @@
+diff --git a/nova/virt/image/model.py b/nova/virt/image/model.py
+index 971f7e9c07..ec3fca72cb 100644
+--- a/nova/virt/image/model.py
+++ b/nova/virt/image/model.py
+@@ -129,3 +129,22 @@ class RBDImage(Image):
+         self.user = user
+         self.password = password
+         self.servers = servers
+
+
+class VitastorImage(Image):
+    """Class for images in a remote Vitastor cluster"""
+
+    def __init__(self, name, etcd_address = None, etcd_prefix = None, config_path = None):
+        """Create a new Vitastor image object
+
+        :param name: name of the image
+        :param etcd_address: etcd URL(s) (optional)
+        :param etcd_prefix: etcd prefix (optional)
+        :param config_path: path to the configuration (optional)
+        """
+        super(VitastorImage, self).__init__(FORMAT_RAW)
+
+        self.name = name
+        self.etcd_address = etcd_address
+        self.etcd_prefix = etcd_prefix
+        self.config_path = config_path
+diff --git a/nova/virt/images.py b/nova/virt/images.py
+index 5358f3766a..ebe3d6effb 100644
+--- a/nova/virt/images.py
+++ b/nova/virt/images.py
+@@ -41,7 +41,7 @@ IMAGE_API = glance.API()
+ 
+ def qemu_img_info(path, format=None):
+     """Return an object containing the parsed output from qemu-img info."""
+-    if not os.path.exists(path) and not path.startswith('rbd:'):
+    if not os.path.exists(path) and not path.startswith('rbd:') and not path.startswith('vitastor:'):
+         raise exception.DiskNotFound(location=path)
+ 
+     info = nova.privsep.qemu.unprivileged_qemu_img_info(path, format=format)
+@@ -50,7 +50,7 @@ def qemu_img_info(path, format=None):
+ 
+ def privileged_qemu_img_info(path, format=None, output_format='json'):
+     """Return an object containing the parsed output from qemu-img info."""
+-    if not os.path.exists(path) and not path.startswith('rbd:'):
+    if not os.path.exists(path) and not path.startswith('rbd:') and not path.startswith('vitastor:'):
+         raise exception.DiskNotFound(location=path)
+ 
+     info = nova.privsep.qemu.privileged_qemu_img_info(path, format=format)
+diff --git a/nova/virt/libvirt/config.py b/nova/virt/libvirt/config.py
+index ea525648b3..d7aa798954 100644
+--- a/nova/virt/libvirt/config.py
+++ b/nova/virt/libvirt/config.py
+@@ -1005,6 +1005,8 @@ class LibvirtConfigGuestDisk(LibvirtConfigGuestDevice):
+         self.driver_iommu = False
+         self.source_path = None
+         self.source_protocol = None
+        self.source_query = None
+        self.source_config = None
+         self.source_name = None
+         self.source_hosts = []
+         self.source_ports = []
+@@ -1133,6 +1135,10 @@ class LibvirtConfigGuestDisk(LibvirtConfigGuestDevice):
+             source = etree.Element("source", protocol=self.source_protocol)
+             if self.source_name is not None:
+                 source.set('name', self.source_name)
+            if self.source_query is not None:
+                source.set('query', self.source_query)
+            if self.source_config is not None:
+                source.append(etree.Element('config', file=self.source_config))
+             hosts_info = zip(self.source_hosts, self.source_ports)
+             for name, port in hosts_info:
+                 host = etree.Element('host', name=name)
+diff --git a/nova/virt/libvirt/driver.py b/nova/virt/libvirt/driver.py
+index fbd033690a..74dc59ce87 100644
+--- a/nova/virt/libvirt/driver.py
+++ b/nova/virt/libvirt/driver.py
+@@ -180,6 +180,7 @@ libvirt_volume_drivers = [
+     'local=nova.virt.libvirt.volume.volume.LibvirtVolumeDriver',
+     'fake=nova.virt.libvirt.volume.volume.LibvirtFakeVolumeDriver',
+     'rbd=nova.virt.libvirt.volume.net.LibvirtNetVolumeDriver',
+    'vitastor=nova.virt.libvirt.volume.vitastor.LibvirtVitastorVolumeDriver',
+     'nfs=nova.virt.libvirt.volume.nfs.LibvirtNFSVolumeDriver',
+     'smbfs=nova.virt.libvirt.volume.smbfs.LibvirtSMBFSVolumeDriver',
+     'fibre_channel='
+@@ -287,10 +288,10 @@ class LibvirtDriver(driver.ComputeDriver):
+         # This prevents the risk of one test setting a capability
+         # which bleeds over into other tests.
+ 
+-        # LVM and RBD require raw images. If we are not configured to
+        # LVM, RBD, Vitastor require raw images. If we are not configured to
+         # force convert images into raw format, then we _require_ raw
+         # images only.
+-        raw_only = ('rbd', 'lvm')
+        raw_only = ('rbd', 'lvm', 'vitastor')
+         requires_raw_image = (CONF.libvirt.images_type in raw_only and
+                               not CONF.force_raw_images)
+         requires_ploop_image = CONF.libvirt.virt_type == 'parallels'
+@@ -703,12 +704,12 @@ class LibvirtDriver(driver.ComputeDriver):
+         # Some imagebackends are only able to import raw disk images,
+         # and will fail if given any other format. See the bug
+         # https://bugs.launchpad.net/nova/+bug/1816686 for more details.
+-        if CONF.libvirt.images_type in ('rbd',):
+        if CONF.libvirt.images_type in ('rbd', 'vitastor'):
+             if not CONF.force_raw_images:
+                 msg = _("'[DEFAULT]/force_raw_images = False' is not "
+-                        "allowed with '[libvirt]/images_type = rbd'. "
+                        "allowed with '[libvirt]/images_type = rbd' or 'vitastor'. "
+                         "Please check the two configs and if you really "
+-                        "do want to use rbd as images_type, set "
+                        "do want to use rbd or vitastor as images_type, set "
+                         "force_raw_images to True.")
+                 raise exception.InvalidConfiguration(msg)
+ 
+@@ -2165,6 +2166,16 @@ class LibvirtDriver(driver.ComputeDriver):
+                     if connection_info['data'].get('auth_enabled'):
+                         username = connection_info['data']['auth_username']
+                         path = f"rbd:{volume_name}:id={username}"
+                elif connection_info['driver_volume_type'] == 'vitastor':
+                    volume_name = connection_info['data']['name']
+                    path = 'vitastor:image='+volume_name.replace(':', '\\:')
+                    for k in [ 'config_path', 'etcd_address', 'etcd_prefix' ]:
+                        if k in connection_info['data']:
+                            kk = k
+                            if kk == 'etcd_address':
+                                # FIXME use etcd_address in qemu driver
+                                kk = 'etcd_host'
+                            path += ":"+kk.replace('_', '-')+"="+connection_info['data'][k].replace(':', '\\:')
+                 else:
+                     path = 'unknown'
+                     raise exception.DiskNotFound(location='unknown')
+@@ -2440,8 +2451,8 @@ class LibvirtDriver(driver.ComputeDriver):
+ 
+         image_format = CONF.libvirt.snapshot_image_format or source_type
+ 
+-        # NOTE(bfilippov): save lvm and rbd as raw
+-        if image_format == 'lvm' or image_format == 'rbd':
+        # NOTE(bfilippov): save lvm and rbd and vitastor as raw
+        if image_format == 'lvm' or image_format == 'rbd' or image_format == 'vitastor':
+             image_format = 'raw'
+ 
+         metadata = self._create_snapshot_metadata(instance.image_meta,
+@@ -2512,7 +2523,7 @@ class LibvirtDriver(driver.ComputeDriver):
+                               expected_state=task_states.IMAGE_UPLOADING)
+ 
+             # TODO(nic): possibly abstract this out to the root_disk
+-            if source_type == 'rbd' and live_snapshot:
+            if (source_type == 'rbd' or source_type == 'vitastor') and live_snapshot:
+                 # Standard snapshot uses qemu-img convert from RBD which is
+                 # not safe to run with live_snapshot.
+                 live_snapshot = False
+@@ -3715,7 +3726,7 @@ class LibvirtDriver(driver.ComputeDriver):
+         # cleanup rescue volume
+         lvm.remove_volumes([lvmdisk for lvmdisk in self._lvm_disks(instance)
+                                 if lvmdisk.endswith('.rescue')])
+-        if CONF.libvirt.images_type == 'rbd':
+        if CONF.libvirt.images_type == 'rbd' or CONF.libvirt.images_type == 'vitastor':
+             filter_fn = lambda disk: (disk.startswith(instance.uuid) and
+                                       disk.endswith('.rescue'))
+             rbd_utils.RBDDriver().cleanup_volumes(filter_fn)
+@@ -3972,6 +3983,8 @@ class LibvirtDriver(driver.ComputeDriver):
+         # TODO(mikal): there is a bug here if images_type has
+         # changed since creation of the instance, but I am pretty
+         # sure that this bug already exists.
+        if CONF.libvirt.images_type == 'vitastor':
+            return 'vitastor'
+         return 'rbd' if CONF.libvirt.images_type == 'rbd' else 'raw'
+ 
+     @staticmethod
+@@ -4370,10 +4383,10 @@ class LibvirtDriver(driver.ComputeDriver):
+                 finally:
+                     # NOTE(mikal): if the config drive was imported into RBD,
+                     # then we no longer need the local copy
+-                    if CONF.libvirt.images_type == 'rbd':
+                    if CONF.libvirt.images_type == 'rbd' or CONF.libvirt.images_type == 'vitastor':
+                         LOG.info('Deleting local config drive %(path)s '
+-                                 'because it was imported into RBD.',
+-                                 {'path': config_disk_local_path},
+                                 'because it was imported into %(type).',
+                                 {'path': config_disk_local_path, 'type': CONF.libvirt.images_type},
+                                  instance=instance)
+                         os.unlink(config_disk_local_path)
+ 
+diff --git a/nova/virt/libvirt/utils.py b/nova/virt/libvirt/utils.py
+index c1dc34daf4..263965912f 100644
+--- a/nova/virt/libvirt/utils.py
+++ b/nova/virt/libvirt/utils.py
+@@ -399,6 +399,10 @@ def find_disk(guest: libvirt_guest.Guest) -> ty.Tuple[str, ty.Optional[str]]:
+             disk_path = disk.source_name
+             if disk_path:
+                 disk_path = 'rbd:' + disk_path
+        elif not disk_path and disk.source_protocol == 'vitastor':
+            disk_path = disk.source_name
+            if disk_path:
+                disk_path = 'vitastor:' + disk_path
+ 
+     if not disk_path:
+         raise RuntimeError(_("Can't retrieve root device path "
+@@ -417,6 +421,8 @@ def get_disk_type_from_path(path: str) -> ty.Optional[str]:
+         return 'lvm'
+     elif path.startswith('rbd:'):
+         return 'rbd'
+    elif path.startswith('vitastor:'):
+        return 'vitastor'
+     elif (os.path.isdir(path) and
+           os.path.exists(os.path.join(path, "DiskDescriptor.xml"))):
+         return 'ploop'
+diff --git a/nova/virt/libvirt/volume/vitastor.py b/nova/virt/libvirt/volume/vitastor.py
+new file mode 100644
+index 0000000000..0256df62c1
+--- /dev/null
+++ b/nova/virt/libvirt/volume/vitastor.py
+@@ -0,0 +1,75 @@
+# Copyright (c) 2021+, Vitaliy Filippov <vitalif@yourcmc.ru>
+#
+#    Licensed under the Apache License, Version 2.0 (the "License"); you may
+#    not use this file except in compliance with the License. You may obtain
+#    a copy of the License at
+#
+#         http://www.apache.org/licenses/LICENSE-2.0
+#
+#    Unless required by applicable law or agreed to in writing, software
+#    distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
+#    WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
+#    License for the specific language governing permissions and limitations
+#    under the License.
+
+from os_brick import exception as os_brick_exception
+from os_brick import initiator
+from os_brick.initiator import connector
+from oslo_log import log as logging
+
+import nova.conf
+from nova import utils
+from nova.virt.libvirt.volume import volume as libvirt_volume
+
+
+CONF = nova.conf.CONF
+LOG = logging.getLogger(__name__)
+
+
+class LibvirtVitastorVolumeDriver(libvirt_volume.LibvirtBaseVolumeDriver):
+    """Driver to attach Vitastor volumes to libvirt."""
+    def __init__(self, host):
+        super(LibvirtVitastorVolumeDriver, self).__init__(host, is_block_dev=False)
+
+    def connect_volume(self, connection_info, instance):
+        pass
+
+    def disconnect_volume(self, connection_info, instance):
+        pass
+
+    def get_config(self, connection_info, disk_info):
+        """Returns xml for libvirt."""
+        conf = super(LibvirtVitastorVolumeDriver, self).get_config(connection_info, disk_info)
+        conf.source_type = 'network'
+        conf.source_protocol = 'vitastor'
+        conf.source_name = connection_info['data'].get('name')
+        conf.source_query = connection_info['data'].get('etcd_prefix') or None
+        conf.source_config = connection_info['data'].get('config_path') or None
+        conf.source_hosts = []
+        conf.source_ports = []
+        addresses = connection_info['data'].get('etcd_address', '')
+        if addresses:
+            if not isinstance(addresses, list):
+                addresses = addresses.split(',')
+            for addr in addresses:
+                if addr.startswith('https://'):
+                    raise NotImplementedError('Vitastor block driver does not support SSL for etcd communication yet')
+                if addr.startswith('http://'):
+                    addr = addr[7:]
+                addr = addr.rstrip('/')
+                if addr.endswith('/v3'):
+                    addr = addr[0:-3]
+                p = addr.find('/')
+                if p > 0:
+                    raise NotImplementedError('libvirt does not support custom URL paths for Vitastor etcd yet. Use /etc/vitastor/vitastor.conf')
+                p = addr.find(':')
+                port = '2379'
+                if p > 0:
+                    port = addr[p+1:]
+                    addr = addr[0:p]
+                conf.source_hosts.append(addr)
+                conf.source_ports.append(port)
+        return conf
+
+    def extend_volume(self, connection_info, instance, requested_size):
+        raise NotImplementedError
--- a/patches/pve-qemu-5.1-vitastor.patch
+++ b/patches/pve-qemu-5.1-vitastor.patch
@@ -0,0 +1,175 @@
+Index: pve-qemu-kvm-5.1.0/qapi/block-core.json
+===================================================================
+--- pve-qemu-kvm-5.1.0.orig/qapi/block-core.json
+++ pve-qemu-kvm-5.1.0/qapi/block-core.json
+@@ -3041,7 +3041,7 @@
+             'luks', 'nbd', 'nfs', 'null-aio', 'null-co', 'nvme', 'parallels',
+             'qcow', 'qcow2', 'qed', 'quorum', 'raw', 'rbd',
+             { 'name': 'replication', 'if': 'defined(CONFIG_REPLICATION)' },
+-            'sheepdog', 'pbs',
+            'sheepdog', 'pbs', 'vitastor',
+             'ssh', 'throttle', 'vdi', 'vhdx', 'vmdk', 'vpc', 'vvfat' ] }
+ 
+ ##
+@@ -3889,6 +3889,28 @@
+             '*tag': 'str' } }
+ 
+ ##
+# @BlockdevOptionsVitastor:
+#
+# Driver specific block device options for vitastor
+#
+# @image:       Image name
+# @inode:       Inode number
+# @pool:        Pool ID
+# @size:        Desired image size in bytes
+# @config-path: Path to Vitastor configuration
+# @etcd-host:   etcd connection address(es)
+# @etcd-prefix: etcd key/value prefix
+##
+{ 'struct': 'BlockdevOptionsVitastor',
+  'data': { '*inode': 'uint64',
+            '*pool': 'uint64',
+            '*size': 'uint64',
+            '*image': 'str',
+            '*config-path': 'str',
+            '*etcd-host': 'str',
+            '*etcd-prefix': 'str' } }
+
+##
+ # @ReplicationMode:
+ #
+ # An enumeration of replication modes.
+@@ -4234,6 +4256,7 @@
+       'replication': { 'type': 'BlockdevOptionsReplication',
+                        'if': 'defined(CONFIG_REPLICATION)' },
+       'sheepdog':   'BlockdevOptionsSheepdog',
+      'vitastor':   'BlockdevOptionsVitastor',
+       'ssh':        'BlockdevOptionsSsh',
+       'throttle':   'BlockdevOptionsThrottle',
+       'vdi':        'BlockdevOptionsGenericFormat',
+@@ -4623,6 +4646,17 @@
+             '*cluster-size' :   'size' } }
+ 
+ ##
+# @BlockdevCreateOptionsVitastor:
+#
+# Driver specific image creation options for Vitastor.
+#
+# @size: Size of the virtual disk in bytes
+##
+{ 'struct': 'BlockdevCreateOptionsVitastor',
+  'data': { 'location':         'BlockdevOptionsVitastor',
+            'size':             'size' } }
+
+##
+ # @BlockdevVmdkSubformat:
+ #
+ # Subformat options for VMDK images
+@@ -4884,6 +4918,7 @@
+       'qed':            'BlockdevCreateOptionsQed',
+       'rbd':            'BlockdevCreateOptionsRbd',
+       'sheepdog':       'BlockdevCreateOptionsSheepdog',
+      'vitastor':       'BlockdevCreateOptionsVitastor',
+       'ssh':            'BlockdevCreateOptionsSsh',
+       'vdi':            'BlockdevCreateOptionsVdi',
+       'vhdx':           'BlockdevCreateOptionsVhdx',
+Index: pve-qemu-kvm-5.1.0/configure
+===================================================================
+--- pve-qemu-kvm-5.1.0.orig/configure
+++ pve-qemu-kvm-5.1.0/configure
+@@ -446,6 +446,7 @@ trace_backends="log"
+ trace_file="trace"
+ spice=""
+ rbd=""
+vitastor=""
+ smartcard=""
+ libusb=""
+ usb_redir=""
+@@ -1383,6 +1384,10 @@ for opt do
+   ;;
+   --enable-rbd) rbd="yes"
+   ;;
+  --disable-vitastor) vitastor="no"
+  ;;
+  --enable-vitastor) vitastor="yes"
+  ;;
+   --disable-xfsctl) xfs="no"
+   ;;
+   --enable-xfsctl) xfs="yes"
+@@ -1901,6 +1906,7 @@ disabled with --disable-FEATURE, default
+   vhost-vdpa      vhost-vdpa kernel backend support
+   spice           spice
+   rbd             rados block device (rbd)
+  vitastor        vitastor block device
+   libiscsi        iscsi support
+   libnfs          nfs support
+   smartcard       smartcard support (libcacard)
+@@ -4234,6 +4240,27 @@ EOF
+ fi
+ 
+ ##########################################
+# vitastor probe
+if test "$vitastor" != "no" ; then
+  cat > $TMPC <<EOF
+#include <vitastor_c.h>
+int main(void) {
+  vitastor_c_create_qemu(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0);
+  return 0;
+}
+EOF
+  vitastor_libs="-lvitastor_client"
+  if compile_prog "" "$vitastor_libs" ; then
+    vitastor=yes
+  else
+    if test "$vitastor" = "yes" ; then
+      feature_not_found "vitastor block device" "Install vitastor-client-dev"
+    fi
+    vitastor=no
+  fi
+fi
+
+##########################################
+ # libssh probe
+ if test "$libssh" != "no" ; then
+   if $pkg_config --exists libssh; then
+@@ -6969,6 +6996,7 @@ echo "Trace output file $trace_file-<pid
+ fi
+ echo "spice support     $spice $(echo_version $spice $spice_protocol_version/$spice_server_version)"
+ echo "rbd support       $rbd"
+echo "vitastor support  $vitastor"
+ echo "xfsctl support    $xfs"
+ echo "smartcard support $smartcard"
+ echo "libusb            $libusb"
+@@ -7644,6 +7672,10 @@ if test "$rbd" = "yes" ; then
+   echo "RBD_CFLAGS=$rbd_cflags" >> $config_host_mak
+   echo "RBD_LIBS=$rbd_libs" >> $config_host_mak
+ fi
+if test "$vitastor" = "yes" ; then
+  echo "CONFIG_VITASTOR=y" >> $config_host_mak
+  echo "VITASTOR_LIBS=$vitastor_libs" >> $config_host_mak
+fi
+ 
+ echo "CONFIG_COROUTINE_BACKEND=$coroutine" >> $config_host_mak
+ if test "$coroutine_pool" = "yes" ; then
+Index: pve-qemu-kvm-5.1.0/block/Makefile.objs
+===================================================================
+--- pve-qemu-kvm-5.1.0.orig/block/Makefile.objs
+++ pve-qemu-kvm-5.1.0/block/Makefile.objs
+@@ -32,6 +32,7 @@ block-obj-$(if $(CONFIG_LIBISCSI),y,n) +
+ block-obj-$(CONFIG_LIBNFS) += nfs.o
+ block-obj-$(CONFIG_CURL) += curl.o
+ block-obj-$(CONFIG_RBD) += rbd.o
+block-obj-$(CONFIG_VITASTOR) += vitastor.o
+ block-obj-$(CONFIG_GLUSTERFS) += gluster.o
+ block-obj-$(CONFIG_LIBSSH) += ssh.o
+ block-obj-y += backup-dump.o
+@@ -61,6 +62,8 @@ curl.o-cflags      := $(CURL_CFLAGS)
+ curl.o-libs        := $(CURL_LIBS)
+ rbd.o-cflags       := $(RBD_CFLAGS)
+ rbd.o-libs         := $(RBD_LIBS)
+vitastor.o-cflags  := $(VITASTOR_CFLAGS)
+vitastor.o-libs    := $(VITASTOR_LIBS)
+ gluster.o-cflags   := $(GLUSTERFS_CFLAGS)
+ gluster.o-libs     := $(GLUSTERFS_LIBS)
+ ssh.o-cflags       := $(LIBSSH_CFLAGS)
--- a/patches/pve-qemu-5.2-vitastor.patch
+++ b/patches/pve-qemu-5.2-vitastor.patch
@@ -0,0 +1,181 @@
+Index: pve-qemu-kvm-5.2.0/qapi/block-core.json
+===================================================================
+--- pve-qemu-kvm-5.2.0.orig/qapi/block-core.json
+++ pve-qemu-kvm-5.2.0/qapi/block-core.json
+@@ -3076,7 +3076,7 @@
+             'luks', 'nbd', 'nfs', 'null-aio', 'null-co', 'nvme', 'parallels',
+             'qcow', 'qcow2', 'qed', 'quorum', 'raw', 'rbd',
+             { 'name': 'replication', 'if': 'defined(CONFIG_REPLICATION)' },
+-            'sheepdog', 'pbs',
+            'sheepdog', 'pbs', 'vitastor',
+             'ssh', 'throttle', 'vdi', 'vhdx', 'vmdk', 'vpc', 'vvfat' ] }
+ 
+ ##
+@@ -3924,6 +3924,28 @@
+             '*tag': 'str' } }
+ 
+ ##
+# @BlockdevOptionsVitastor:
+#
+# Driver specific block device options for vitastor
+#
+# @image:       Image name
+# @inode:       Inode number
+# @pool:        Pool ID
+# @size:        Desired image size in bytes
+# @config-path: Path to Vitastor configuration
+# @etcd-host:   etcd connection address(es)
+# @etcd-prefix: etcd key/value prefix
+##
+{ 'struct': 'BlockdevOptionsVitastor',
+  'data': { '*inode': 'uint64',
+            '*pool': 'uint64',
+            '*size': 'uint64',
+            '*image': 'str',
+            '*config-path': 'str',
+            '*etcd-host': 'str',
+            '*etcd-prefix': 'str' } }
+
+##
+ # @ReplicationMode:
+ #
+ # An enumeration of replication modes.
+@@ -4272,6 +4294,7 @@
+       'replication': { 'type': 'BlockdevOptionsReplication',
+                        'if': 'defined(CONFIG_REPLICATION)' },
+       'sheepdog':   'BlockdevOptionsSheepdog',
+      'vitastor':   'BlockdevOptionsVitastor',
+       'ssh':        'BlockdevOptionsSsh',
+       'throttle':   'BlockdevOptionsThrottle',
+       'vdi':        'BlockdevOptionsGenericFormat',
+@@ -4662,6 +4685,17 @@
+             '*cluster-size' :   'size' } }
+ 
+ ##
+# @BlockdevCreateOptionsVitastor:
+#
+# Driver specific image creation options for Vitastor.
+#
+# @size: Size of the virtual disk in bytes
+##
+{ 'struct': 'BlockdevCreateOptionsVitastor',
+  'data': { 'location':         'BlockdevOptionsVitastor',
+            'size':             'size' } }
+
+##
+ # @BlockdevVmdkSubformat:
+ #
+ # Subformat options for VMDK images
+@@ -4923,6 +4957,7 @@
+       'qed':            'BlockdevCreateOptionsQed',
+       'rbd':            'BlockdevCreateOptionsRbd',
+       'sheepdog':       'BlockdevCreateOptionsSheepdog',
+      'vitastor':       'BlockdevCreateOptionsVitastor',
+       'ssh':            'BlockdevCreateOptionsSsh',
+       'vdi':            'BlockdevCreateOptionsVdi',
+       'vhdx':           'BlockdevCreateOptionsVhdx',
+Index: pve-qemu-kvm-5.2.0/block/meson.build
+===================================================================
+--- pve-qemu-kvm-5.2.0.orig/block/meson.build
+++ pve-qemu-kvm-5.2.0/block/meson.build
+@@ -89,6 +89,7 @@ foreach m : [
+   ['CONFIG_LIBNFS', 'nfs', libnfs, 'nfs.c'],
+   ['CONFIG_LIBSSH', 'ssh', libssh, 'ssh.c'],
+   ['CONFIG_RBD', 'rbd', rbd, 'rbd.c'],
+  ['CONFIG_VITASTOR', 'vitastor', vitastor, 'vitastor.c'],
+ ]
+   if config_host.has_key(m[0])
+     if enable_modules
+Index: pve-qemu-kvm-5.2.0/configure
+===================================================================
+--- pve-qemu-kvm-5.2.0.orig/configure
+++ pve-qemu-kvm-5.2.0/configure
+@@ -372,6 +372,7 @@ trace_backends="log"
+ trace_file="trace"
+ spice=""
+ rbd=""
+vitastor=""
+ smartcard=""
+ u2f="auto"
+ libusb=""
+@@ -1264,6 +1265,10 @@ for opt do
+   ;;
+   --enable-rbd) rbd="yes"
+   ;;
+  --disable-vitastor) vitastor="no"
+  ;;
+  --enable-vitastor) vitastor="yes"
+  ;;
+   --disable-xfsctl) xfs="no"
+   ;;
+   --enable-xfsctl) xfs="yes"
+@@ -1807,6 +1812,7 @@ disabled with --disable-FEATURE, default
+   vhost-vdpa      vhost-vdpa kernel backend support
+   spice           spice
+   rbd             rados block device (rbd)
+  vitastor        vitastor block device
+   libiscsi        iscsi support
+   libnfs          nfs support
+   smartcard       smartcard support (libcacard)
+@@ -3700,6 +3706,27 @@ EOF
+ fi
+ 
+ ##########################################
+# vitastor probe
+if test "$vitastor" != "no" ; then
+  cat > $TMPC <<EOF
+#include <vitastor_c.h>
+int main(void) {
+  vitastor_c_create_qemu(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0);
+  return 0;
+}
+EOF
+  vitastor_libs="-lvitastor_client"
+  if compile_prog "" "$vitastor_libs" ; then
+    vitastor=yes
+  else
+    if test "$vitastor" = "yes" ; then
+      feature_not_found "vitastor block device" "Install vitastor-client-dev"
+    fi
+    vitastor=no
+  fi
+fi
+
+##########################################
+ # libssh probe
+ if test "$libssh" != "no" ; then
+   if $pkg_config --exists libssh; then
+@@ -6437,6 +6464,10 @@ if test "$rbd" = "yes" ; then
+   echo "CONFIG_RBD=y" >> $config_host_mak
+   echo "RBD_LIBS=$rbd_libs" >> $config_host_mak
+ fi
+if test "$vitastor" = "yes" ; then
+  echo "CONFIG_VITASTOR=y" >> $config_host_mak
+  echo "VITASTOR_LIBS=$vitastor_libs" >> $config_host_mak
+fi
+ 
+ echo "CONFIG_COROUTINE_BACKEND=$coroutine" >> $config_host_mak
+ if test "$coroutine_pool" = "yes" ; then
+Index: pve-qemu-kvm-5.2.0/meson.build
+===================================================================
+--- pve-qemu-kvm-5.2.0.orig/meson.build
+++ pve-qemu-kvm-5.2.0/meson.build
+@@ -596,6 +596,10 @@ rbd = not_found
+ if 'CONFIG_RBD' in config_host
+   rbd = declare_dependency(link_args: config_host['RBD_LIBS'].split())
+ endif
+vitastor = not_found
+if 'CONFIG_VITASTOR' in config_host
+  vitastor = declare_dependency(link_args: config_host['VITASTOR_LIBS'].split())
+endif
+ glusterfs = not_found
+ if 'CONFIG_GLUSTERFS' in config_host
+   glusterfs = declare_dependency(compile_args: config_host['GLUSTERFS_CFLAGS'].split(),
+@@ -2151,6 +2155,7 @@ endif
+ # TODO: add back protocol and server version
+ summary_info += {'spice support':     config_host.has_key('CONFIG_SPICE')}
+ summary_info += {'rbd support':       config_host.has_key('CONFIG_RBD')}
+summary_info += {'vitastor support':  config_host.has_key('CONFIG_VITASTOR')}
+ summary_info += {'xfsctl support':    config_host.has_key('CONFIG_XFS')}
+ summary_info += {'smartcard support': config_host.has_key('CONFIG_SMARTCARD')}
+ summary_info += {'U2F support':       u2f.found()}
--- a/patches/pve-qemu-6.1-vitastor.patch
+++ b/patches/pve-qemu-6.1-vitastor.patch
@@ -0,0 +1,188 @@
+Index: pve-qemu-kvm-6.1.0/qapi/block-core.json
+===================================================================
+--- pve-qemu-kvm-6.1.0.orig/qapi/block-core.json
+++ pve-qemu-kvm-6.1.0/qapi/block-core.json
+@@ -3084,7 +3084,7 @@
+             'preallocate', 'qcow', 'qcow2', 'qed', 'quorum', 'raw', 'rbd',
+             { 'name': 'replication', 'if': 'defined(CONFIG_REPLICATION)' },
+             'pbs',
+-            'ssh', 'throttle', 'vdi', 'vhdx', 'vmdk', 'vpc', 'vvfat' ] }
+            'ssh', 'throttle', 'vdi', 'vhdx', 'vitastor', 'vmdk', 'vpc', 'vvfat' ] }
+ 
+ ##
+ # @BlockdevOptionsFile:
+@@ -4020,6 +4020,28 @@
+             '*server': ['InetSocketAddressBase'] } }
+ 
+ ##
+# @BlockdevOptionsVitastor:
+#
+# Driver specific block device options for vitastor
+#
+# @image:       Image name
+# @inode:       Inode number
+# @pool:        Pool ID
+# @size:        Desired image size in bytes
+# @config-path: Path to Vitastor configuration
+# @etcd-host:   etcd connection address(es)
+# @etcd-prefix: etcd key/value prefix
+##
+{ 'struct': 'BlockdevOptionsVitastor',
+  'data': { '*inode': 'uint64',
+            '*pool': 'uint64',
+            '*size': 'uint64',
+            '*image': 'str',
+            '*config-path': 'str',
+            '*etcd-host': 'str',
+            '*etcd-prefix': 'str' } }
+
+##
+ # @ReplicationMode:
+ #
+ # An enumeration of replication modes.
+@@ -4392,6 +4414,7 @@
+       'throttle':   'BlockdevOptionsThrottle',
+       'vdi':        'BlockdevOptionsGenericFormat',
+       'vhdx':       'BlockdevOptionsGenericFormat',
+      'vitastor':   'BlockdevOptionsVitastor',
+       'vmdk':       'BlockdevOptionsGenericCOWFormat',
+       'vpc':        'BlockdevOptionsGenericFormat',
+       'vvfat':      'BlockdevOptionsVVFAT'
+@@ -4782,6 +4805,17 @@
+             '*encrypt' :        'RbdEncryptionCreateOptions' } }
+ 
+ ##
+# @BlockdevCreateOptionsVitastor:
+#
+# Driver specific image creation options for Vitastor.
+#
+# @size: Size of the virtual disk in bytes
+##
+{ 'struct': 'BlockdevCreateOptionsVitastor',
+  'data': { 'location':         'BlockdevOptionsVitastor',
+            'size':             'size' } }
+
+##
+ # @BlockdevVmdkSubformat:
+ #
+ # Subformat options for VMDK images
+@@ -4977,6 +5011,7 @@
+       'ssh':            'BlockdevCreateOptionsSsh',
+       'vdi':            'BlockdevCreateOptionsVdi',
+       'vhdx':           'BlockdevCreateOptionsVhdx',
+      'vitastor':       'BlockdevCreateOptionsVitastor',
+       'vmdk':           'BlockdevCreateOptionsVmdk',
+       'vpc':            'BlockdevCreateOptionsVpc'
+   } }
+Index: pve-qemu-kvm-6.1.0/block/meson.build
+===================================================================
+--- pve-qemu-kvm-6.1.0.orig/block/meson.build
+++ pve-qemu-kvm-6.1.0/block/meson.build
+@@ -91,6 +91,7 @@ foreach m : [
+   [libnfs, 'nfs', files('nfs.c')],
+   [libssh, 'ssh', files('ssh.c')],
+   [rbd, 'rbd', files('rbd.c')],
+  [vitastor, 'vitastor', files('vitastor.c')],
+ ]
+   if m[0].found()
+     module_ss = ss.source_set()
+Index: pve-qemu-kvm-6.1.0/configure
+===================================================================
+--- pve-qemu-kvm-6.1.0.orig/configure
+++ pve-qemu-kvm-6.1.0/configure
+@@ -375,6 +375,7 @@ trace_file="trace"
+ spice="$default_feature"
+ spice_protocol="auto"
+ rbd="auto"
+vitastor="auto"
+ smartcard="auto"
+ u2f="auto"
+ libusb="auto"
+@@ -1293,6 +1294,10 @@ for opt do
+   ;;
+   --enable-rbd) rbd="enabled"
+   ;;
+  --disable-vitastor) vitastor="disabled"
+  ;;
+  --enable-vitastor) vitastor="enabled"
+  ;;
+   --disable-xfsctl) xfs="no"
+   ;;
+   --enable-xfsctl) xfs="yes"
+@@ -1921,6 +1926,7 @@ disabled with --disable-FEATURE, default
+   spice           spice
+   spice-protocol  spice-protocol
+   rbd             rados block device (rbd)
+  vitastor        vitastor block device
+   libiscsi        iscsi support
+   libnfs          nfs support
+   smartcard       smartcard support (libcacard)
+@@ -5211,7 +5217,7 @@ if test "$skip_meson" = no; then
+         -Dcapstone=$capstone -Dslirp=$slirp -Dfdt=$fdt -Dbrlapi=$brlapi \
+         -Dcurl=$curl -Dglusterfs=$glusterfs -Dbzip2=$bzip2 -Dlibiscsi=$libiscsi \
+         -Dlibnfs=$libnfs -Diconv=$iconv -Dcurses=$curses -Dlibudev=$libudev\
+-        -Drbd=$rbd -Dlzo=$lzo -Dsnappy=$snappy -Dlzfse=$lzfse -Dlibxml2=$libxml2 \
+        -Drbd=$rbd -Dvitastor=$vitastor -Dlzo=$lzo -Dsnappy=$snappy -Dlzfse=$lzfse -Dlibxml2=$libxml2 \
+         -Dlibdaxctl=$libdaxctl -Dlibpmem=$libpmem -Dlinux_io_uring=$linux_io_uring \
+         -Dgnutls=$gnutls -Dnettle=$nettle -Dgcrypt=$gcrypt -Dauth_pam=$auth_pam \
+         -Dzstd=$zstd -Dseccomp=$seccomp -Dvirtfs=$virtfs -Dcap_ng=$cap_ng \
+Index: pve-qemu-kvm-6.1.0/meson.build
+===================================================================
+--- pve-qemu-kvm-6.1.0.orig/meson.build
+++ pve-qemu-kvm-6.1.0/meson.build
+@@ -729,6 +729,26 @@ if not get_option('rbd').auto() or have_
+   endif
+ endif
+ 
+vitastor = not_found
+if not get_option('vitastor').auto() or have_block
+  libvitastor_client = cc.find_library('vitastor_client', has_headers: ['vitastor_c.h'],
+    required: get_option('vitastor'), kwargs: static_kwargs)
+  if libvitastor_client.found()
+    if cc.links('''
+      #include <vitastor_c.h>
+      int main(void) {
+        vitastor_c_create_qemu(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0);
+        return 0;
+      }''', dependencies: libvitastor_client)
+      vitastor = declare_dependency(dependencies: libvitastor_client)
+    elif get_option('vitastor').enabled()
+      error('could not link libvitastor_client')
+    else
+      warning('could not link libvitastor_client, disabling')
+    endif
+  endif
+endif
+
+ glusterfs = not_found
+ glusterfs_ftruncate_has_stat = false
+ glusterfs_iocb_has_stat = false
+@@ -1268,6 +1288,7 @@ config_host_data.set('CONFIG_LIBNFS', li
+ config_host_data.set('CONFIG_LINUX_IO_URING', linux_io_uring.found())
+ config_host_data.set('CONFIG_LIBPMEM', libpmem.found())
+ config_host_data.set('CONFIG_RBD', rbd.found())
+config_host_data.set('CONFIG_VITASTOR', vitastor.found())
+ config_host_data.set('CONFIG_SDL', sdl.found())
+ config_host_data.set('CONFIG_SDL_IMAGE', sdl_image.found())
+ config_host_data.set('CONFIG_SECCOMP', seccomp.found())
+@@ -3087,6 +3108,7 @@ summary_info += {'bpf support': libbpf.f
+ # TODO: add back protocol and server version
+ summary_info += {'spice support':     config_host.has_key('CONFIG_SPICE')}
+ summary_info += {'rbd support':       rbd.found()}
+summary_info += {'vitastor support':  vitastor.found()}
+ summary_info += {'xfsctl support':    config_host.has_key('CONFIG_XFS')}
+ summary_info += {'smartcard support': cacard.found()}
+ summary_info += {'U2F support':       u2f.found()}
+Index: pve-qemu-kvm-6.1.0/meson_options.txt
+===================================================================
+--- pve-qemu-kvm-6.1.0.orig/meson_options.txt
+++ pve-qemu-kvm-6.1.0/meson_options.txt
+@@ -102,6 +102,8 @@ option('lzo', type : 'feature', value :
+        description: 'lzo compression support')
+ option('rbd', type : 'feature', value : 'auto',
+        description: 'Ceph block device driver')
+option('vitastor', type : 'feature', value : 'auto',
+       description: 'Vitastor block device driver')
+ option('gtk', type : 'feature', value : 'auto',
+        description: 'GTK+ user interface')
+ option('sdl', type : 'feature', value : 'auto',
--- a/patches/qemu-make-patches.sh
+++ b/patches/qemu-make-patches.sh
@@ -0,0 +1,15 @@
+#!/bin/bash
+# QEMU patches don't include the `block/vitastor.c` file to not duplicate it in sources
+# Run this script to append its creation to all QEMU patches
+
+DIR=$(dirname $0)
+for i in "$DIR"/qemu-*-vitastor.patch "$DIR"/pve-qemu-*-vitastor.patch; do
+    if ! grep -qP '^\+\+\+ .*block/vitastor\.c' $i; then
+        echo 'Index: a/block/vitastor.c' >> $i
+        echo '===================================================================' >> $i
+        echo '--- /dev/null' >> $i
+        echo '+++ a/block/vitastor.c' >> $i
+        echo '@@ -0,0 +1,'$(wc -l "$DIR"/../src/qemu_driver.c)' @@' >> $i
+        cat "$DIR"/../src/qemu_driver.c | sed 's/^/+/' >> $i
+    fi
+done
--- a/rpm/build-tarball.sh
+++ b/rpm/build-tarball.sh
@@ -25,4 +25,4 @@ rm fio
 mv fio-copy fio
 FIO=`rpm -qi fio | perl -e 'while(<>) { /^Epoch[\s:]+(\S+)/ && print "$1:"; /^Version[\s:]+(\S+)/ && print $1; /^Release[\s:]+(\S+)/ && print "-$1"; }'`
 perl -i -pe 's/(Requires:\s*fio)([^\n]+)?/$1 = '$FIO'/' $VITASTOR/rpm/vitastor-el$EL.spec
-tar --transform 's#^#vitastor-0.6.9/#' --exclude 'rpm/*.rpm' -czf $VITASTOR/../vitastor-0.6.9$(rpm --eval '%dist').tar.gz *
+tar --transform 's#^#vitastor-0.6.13/#' --exclude 'rpm/*.rpm' -czf $VITASTOR/../vitastor-0.6.13$(rpm --eval '%dist').tar.gz *
--- a/rpm/vitastor-el7.Dockerfile
+++ b/rpm/vitastor-el7.Dockerfile
@@ -34,7 +34,7 @@ ADD . /root/vitastor
 RUN set -e; \
    cd /root/vitastor/rpm; \
    sh build-tarball.sh; \
-    cp /root/vitastor-0.6.9.el7.tar.gz ~/rpmbuild/SOURCES; \
+    cp /root/vitastor-0.6.13.el7.tar.gz ~/rpmbuild/SOURCES; \
    cp vitastor-el7.spec ~/rpmbuild/SPECS/vitastor.spec; \
    cd ~/rpmbuild/SPECS/; \
    rpmbuild -ba vitastor.spec; \
--- a/rpm/vitastor-el7.spec
+++ b/rpm/vitastor-el7.spec
@@ -1,11 +1,11 @@
 Name:           vitastor
-Version:        0.6.9
+Version:        0.6.13
 Release:        1%{?dist}
 Summary:        Vitastor, a fast software-defined clustered block storage

 License:        Vitastor Network Public License 1.1
 URL:            https://vitastor.io/
-Source0:        vitastor-0.6.9.el7.tar.gz
+Source0:        vitastor-0.6.13.el7.tar.gz

 BuildRequires:  liburing-devel >= 0.6
 BuildRequires:  gperftools-devel
--- a/rpm/vitastor-el8.Dockerfile
+++ b/rpm/vitastor-el8.Dockerfile
@@ -33,7 +33,7 @@ ADD . /root/vitastor
 RUN set -e; \
    cd /root/vitastor/rpm; \
    sh build-tarball.sh; \
-    cp /root/vitastor-0.6.9.el8.tar.gz ~/rpmbuild/SOURCES; \
+    cp /root/vitastor-0.6.13.el8.tar.gz ~/rpmbuild/SOURCES; \
    cp vitastor-el8.spec ~/rpmbuild/SPECS/vitastor.spec; \
    cd ~/rpmbuild/SPECS/; \
    rpmbuild -ba vitastor.spec; \
--- a/rpm/vitastor-el8.spec
+++ b/rpm/vitastor-el8.spec
@@ -1,11 +1,11 @@
 Name:           vitastor
-Version:        0.6.9
+Version:        0.6.13
 Release:        1%{?dist}
 Summary:        Vitastor, a fast software-defined clustered block storage

 License:        Vitastor Network Public License 1.1
 URL:            https://vitastor.io/
-Source0:        vitastor-0.6.9.el8.tar.gz
+Source0:        vitastor-0.6.13.el8.tar.gz

 BuildRequires:  liburing-devel >= 0.6
 BuildRequires:  gperftools-devel
--- a/src/CMakeLists.txt
+++ b/src/CMakeLists.txt
@@ -15,7 +15,7 @@ if("${CMAKE_INSTALL_PREFIX}" MATCHES "^/usr/local/?$")
 	set(CMAKE_INSTALL_RPATH "${CMAKE_INSTALL_PREFIX}/${CMAKE_INSTALL_LIBDIR}")
 endif()

-add_definitions(-DVERSION="0.6.9")
+add_definitions(-DVERSION="0.6.13")
 add_definitions(-Wall -Wno-sign-compare -Wno-comment -Wno-parentheses -Wno-pointer-arith -fdiagnostics-color=always -I ${CMAKE_SOURCE_DIR}/src)
 if (${WITH_ASAN})
 	add_definitions(-fsanitize=address -fno-omit-frame-pointer)
@@ -88,8 +88,8 @@ if (IBVERBS_LIBRARIES)
 	set(MSGR_RDMA "msgr_rdma.cpp")
 endif (IBVERBS_LIBRARIES)
 add_library(vitastor_common STATIC
-	epoll_manager.cpp etcd_state_client.cpp
-	messenger.cpp msgr_stop.cpp msgr_op.cpp msgr_send.cpp msgr_receive.cpp ringloop.cpp ../json11/json11.cpp
+	epoll_manager.cpp etcd_state_client.cpp messenger.cpp addr_util.cpp
+	msgr_stop.cpp msgr_op.cpp msgr_send.cpp msgr_receive.cpp ringloop.cpp ../json11/json11.cpp
 	http_client.cpp osd_ops.cpp pg_states.cpp timerfd_manager.cpp base64.cpp ${MSGR_RDMA}
 )
 target_compile_options(vitastor_common PUBLIC -fPIC)
@@ -112,6 +112,7 @@ if (${WITH_FIO})
 	add_library(fio_vitastor_sec SHARED
 		fio_sec_osd.cpp
 		rw_blocking.cpp
+		addr_util.cpp
 	)
 	target_link_libraries(fio_vitastor_sec
 		tcmalloc_minimal
@@ -153,7 +154,7 @@ target_link_libraries(vitastor-nbd

 # vitastor-cli
 add_executable(vitastor-cli
-	cli.cpp cli_alloc_osd.cpp cli_simple_offsets.cpp
+	cli.cpp cli_alloc_osd.cpp cli_simple_offsets.cpp cli_df.cpp
 	cli_ls.cpp cli_create.cpp cli_modify.cpp cli_flatten.cpp cli_merge.cpp cli_rm.cpp cli_snap_rm.cpp
 )
 target_link_libraries(vitastor-cli
@@ -189,11 +190,11 @@ endif (${WITH_QEMU})
 ### Test stubs

 # stub_osd, stub_bench, osd_test
-add_executable(stub_osd stub_osd.cpp rw_blocking.cpp)
+add_executable(stub_osd stub_osd.cpp rw_blocking.cpp addr_util.cpp)
 target_link_libraries(stub_osd tcmalloc_minimal)
-add_executable(stub_bench stub_bench.cpp rw_blocking.cpp)
+add_executable(stub_bench stub_bench.cpp rw_blocking.cpp addr_util.cpp)
 target_link_libraries(stub_bench tcmalloc_minimal)
-add_executable(osd_test osd_test.cpp rw_blocking.cpp)
+add_executable(osd_test osd_test.cpp rw_blocking.cpp addr_util.cpp)
 target_link_libraries(osd_test tcmalloc_minimal)

 # osd_rmw_test
--- a/src/addr_util.cpp
+++ b/src/addr_util.cpp
@@ -0,0 +1,188 @@
+#include <arpa/inet.h>
+#include <net/if.h>
+#include <sys/types.h>
+#include <ifaddrs.h>
+#include <string.h>
+#include <stdio.h>
+
+#include <stdexcept>
+
+#include "addr_util.h"
+
+bool string_to_addr(std::string str, bool parse_port, int default_port, struct sockaddr *addr)
+{
+    if (parse_port)
+    {
+        int p = str.rfind(':');
+        if (p != std::string::npos && !(str.length() > 0 && str[p-1] == ']')) // "[ipv6]" which contains ':'
+        {
+            char null_byte = 0;
+            int n = sscanf(str.c_str()+p+1, "%d%c", &default_port, &null_byte);
+            if (n != 1 || default_port >= 0x10000)
+                return false;
+            str = str.substr(0, p);
+        }
+    }
+    if (inet_pton(AF_INET, str.c_str(), &((struct sockaddr_in*)addr)->sin_addr) == 1)
+    {
+        addr->sa_family = AF_INET;
+        ((struct sockaddr_in*)addr)->sin_port = htons(default_port);
+        return true;
+    }
+    if (str.length() >= 2 && str[0] == '[' && str[str.length()-1] == ']')
+        str = str.substr(1, str.length()-2);
+    if (inet_pton(AF_INET6, str.c_str(), &((struct sockaddr_in6*)addr)->sin6_addr) == 1)
+    {
+        addr->sa_family = AF_INET6;
+        ((struct sockaddr_in6*)addr)->sin6_port = htons(default_port);
+        return true;
+    }
+    return false;
+}
+
+std::string addr_to_string(const sockaddr &addr)
+{
+    char peer_str[256];
+    bool ok = false;
+    int port;
+    if (addr.sa_family == AF_INET)
+    {
+        ok = !!inet_ntop(AF_INET, &((sockaddr_in*)&addr)->sin_addr, peer_str, 256);
+        port = ntohs(((sockaddr_in*)&addr)->sin_port);
+    }
+    else if (addr.sa_family == AF_INET6)
+    {
+        ok = !!inet_ntop(AF_INET6, &((sockaddr_in6*)&addr)->sin6_addr, peer_str, 256);
+        port = ntohs(((sockaddr_in6*)&addr)->sin6_port);
+    }
+    else
+        throw std::runtime_error("Unknown address family "+std::to_string(addr.sa_family));
+    if (!ok)
+        throw std::runtime_error(std::string("inet_ntop: ") + strerror(errno));
+    return std::string(peer_str)+":"+std::to_string(port);
+}
+
+static bool cidr_match(const in_addr &addr, const in_addr &net, uint8_t bits)
+{
+    if (bits == 0)
+    {
+        // C99 6.5.7 (3): u32 << 32 is undefined behaviour
+        return true;
+    }
+    return !((addr.s_addr ^ net.s_addr) & htonl(0xFFFFFFFFu << (32 - bits)));
+}
+
+static bool cidr6_match(const in6_addr &address, const in6_addr &network, uint8_t bits)
+{
+    const uint32_t *a = address.s6_addr32;
+    const uint32_t *n = network.s6_addr32;
+    int bits_whole, bits_incomplete;
+    bits_whole = bits >> 5;         // number of whole u32
+    bits_incomplete = bits & 0x1F;  // number of bits in incomplete u32
+    if (bits_whole && memcmp(a, n, bits_whole << 2))
+        return false;
+    if (bits_incomplete)
+    {
+        uint32_t mask = htonl((0xFFFFFFFFu) << (32 - bits_incomplete));
+        if ((a[bits_whole] ^ n[bits_whole]) & mask)
+            return false;
+    }
+    return true;
+}
+
+struct addr_mask_t
+{
+    sa_family_t family;
+    in_addr ipv4;
+    in6_addr ipv6;
+    uint8_t bits;
+};
+
+std::vector<std::string> getifaddr_list(std::vector<std::string> mask_cfg, bool include_v6)
+{
+    std::vector<addr_mask_t> masks;
+    for (auto mask: mask_cfg)
+    {
+        unsigned bits = 0;
+        int p = mask.find('/');
+        if (p != std::string::npos)
+        {
+            char null_byte = 0;
+            if (sscanf(mask.c_str()+p+1, "%u%c", &bits, &null_byte) != 1 || bits > 128)
+            {
+                throw std::runtime_error((include_v6 ? "Invalid IPv4 address mask: " : "Invalid IP address mask: ") + mask);
+            }
+            mask = mask.substr(0, p);
+        }
+        in_addr ipv4;
+        in6_addr ipv6;
+        if (inet_pton(AF_INET, mask.c_str(), &ipv4) == 1)
+        {
+            if (bits > 32)
+            {
+                throw std::runtime_error((include_v6 ? "Invalid IPv4 address mask: " : "Invalid IP address mask: ") + mask);
+            }
+            masks.push_back((addr_mask_t){ .family = AF_INET, .ipv4 = ipv4, .bits = (uint8_t)bits });
+        }
+        else if (include_v6 && inet_pton(AF_INET6, mask.c_str(), &ipv6) == 1)
+        {
+            masks.push_back((addr_mask_t){ .family = AF_INET6, .ipv6 = ipv6, .bits = (uint8_t)bits });
+        }
+        else
+        {
+            throw std::runtime_error((include_v6 ? "Invalid IPv4 address mask: " : "Invalid IP address mask: ") + mask);
+        }
+    }
+    std::vector<std::string> addresses;
+    ifaddrs *list, *ifa;
+    if (getifaddrs(&list) == -1)
+    {
+        throw std::runtime_error(std::string("getifaddrs: ") + strerror(errno));
+    }
+    for (ifa = list; ifa != NULL; ifa = ifa->ifa_next)
+    {
+        if (!ifa->ifa_addr)
+        {
+            continue;
+        }
+        int family = ifa->ifa_addr->sa_family;
+        if ((family == AF_INET || family == AF_INET6 && include_v6) &&
+            (ifa->ifa_flags & (IFF_UP | IFF_RUNNING | IFF_LOOPBACK)) == (IFF_UP | IFF_RUNNING))
+        {
+            void *addr_ptr;
+            if (family == AF_INET)
+            {
+                addr_ptr = &((sockaddr_in *)ifa->ifa_addr)->sin_addr;
+            }
+            else
+            {
+                addr_ptr = &((sockaddr_in6 *)ifa->ifa_addr)->sin6_addr;
+            }
+            if (masks.size() > 0)
+            {
+                int i;
+                for (i = 0; i < masks.size(); i++)
+                {
+                    if (masks[i].family == family && (family == AF_INET
+                        ? cidr_match(*(in_addr*)addr_ptr, masks[i].ipv4, masks[i].bits)
+                        : cidr6_match(*(in6_addr*)addr_ptr, masks[i].ipv6, masks[i].bits)))
+                    {
+                        break;
+                    }
+                }
+                if (i >= masks.size())
+                {
+                    continue;
+                }
+            }
+            char addr[INET6_ADDRSTRLEN];
+            if (!inet_ntop(family, addr_ptr, addr, INET6_ADDRSTRLEN))
+            {
+                throw std::runtime_error(std::string("inet_ntop: ") + strerror(errno));
+            }
+            addresses.push_back(std::string(addr));
+        }
+    }
+    freeifaddrs(list);
+    return addresses;
+}
--- a/src/addr_util.h
+++ b/src/addr_util.h
@@ -0,0 +1,9 @@
+#pragma once
+
+#include <sys/socket.h>
+#include <string>
+#include <vector>
+
+bool string_to_addr(std::string str, bool parse_port, int default_port, struct sockaddr *addr);
+std::string addr_to_string(const sockaddr &addr);
+std::vector<std::string> getifaddr_list(std::vector<std::string> mask_cfg = std::vector<std::string>(), bool include_v6 = false);
--- a/src/blockstore_flush.cpp
+++ b/src/blockstore_flush.cpp
@@ -185,7 +185,7 @@ void journal_flusher_t::release_trim()
 void journal_flusher_t::dump_diagnostics()
 {
    const char *unflushable_type = "";
-    obj_ver_id unflushable = { 0 };
+    obj_ver_id unflushable = {};
    // Try to find out if there is a flushable object for information
    for (object_id cur_oid: flush_queue)
    {
@@ -486,8 +486,8 @@ resume_1:
        if (bs->clean_entry_bitmap_size)
        {
            new_clean_bitmap = (bs->inmemory_meta
-                ? meta_new.buf + meta_new.pos*bs->clean_entry_size + sizeof(clean_disk_entry)
-                : bs->clean_bitmap + (clean_loc >> bs->block_order)*(2*bs->clean_entry_bitmap_size));
+                ? (uint8_t*)meta_new.buf + meta_new.pos*bs->clean_entry_size + sizeof(clean_disk_entry)
+                : (uint8_t*)bs->clean_bitmap + (clean_loc >> bs->block_order)*(2*bs->clean_entry_bitmap_size));
            if (clean_init_bitmap)
            {
                memset(new_clean_bitmap, 0, bs->clean_entry_bitmap_size);
@@ -533,7 +533,7 @@ resume_1:
                return false;
            }
            // zero out old metadata entry
-            memset(meta_old.buf + meta_old.pos*bs->clean_entry_size, 0, bs->clean_entry_size);
+            memset((uint8_t*)meta_old.buf + meta_old.pos*bs->clean_entry_size, 0, bs->clean_entry_size);
            await_sqe(15);
            data->iov = (struct iovec){ meta_old.buf, bs->meta_block_size };
            data->callback = simple_callback_w;
@@ -544,23 +544,25 @@ resume_1:
        }
        if (has_delete)
        {
-            clean_disk_entry *new_entry = (clean_disk_entry*)(meta_new.buf + meta_new.pos*bs->clean_entry_size);
+            clean_disk_entry *new_entry = (clean_disk_entry*)((uint8_t*)meta_new.buf + meta_new.pos*bs->clean_entry_size);
            if (new_entry->oid.inode != 0 && new_entry->oid != cur.oid)
            {
-                printf("Fatal error (metadata corruption or bug): tried to delete metadata entry %lu (%lx:%lx) while deleting %lx:%lx\n",
-                    clean_loc >> bs->block_order, new_entry->oid.inode, new_entry->oid.stripe, cur.oid.inode, cur.oid.stripe);
+                printf("Fatal error (metadata corruption or bug): tried to delete metadata entry %lu (%lx:%lx v%lu) while deleting %lx:%lx\n",
+                    clean_loc >> bs->block_order, new_entry->oid.inode, new_entry->oid.stripe,
+                    new_entry->version, cur.oid.inode, cur.oid.stripe);
                exit(1);
            }
            // zero out new metadata entry
-            memset(meta_new.buf + meta_new.pos*bs->clean_entry_size, 0, bs->clean_entry_size);
+            memset((uint8_t*)meta_new.buf + meta_new.pos*bs->clean_entry_size, 0, bs->clean_entry_size);
        }
        else
        {
-            clean_disk_entry *new_entry = (clean_disk_entry*)(meta_new.buf + meta_new.pos*bs->clean_entry_size);
+            clean_disk_entry *new_entry = (clean_disk_entry*)((uint8_t*)meta_new.buf + meta_new.pos*bs->clean_entry_size);
            if (new_entry->oid.inode != 0 && new_entry->oid != cur.oid)
            {
-                printf("Fatal error (metadata corruption or bug): tried to overwrite non-zero metadata entry %lu (%lx:%lx) with %lx:%lx\n",
-                    clean_loc >> bs->block_order, new_entry->oid.inode, new_entry->oid.stripe, cur.oid.inode, cur.oid.stripe);
+                printf("Fatal error (metadata corruption or bug): tried to overwrite non-zero metadata entry %lu (%lx:%lx v%lu) with %lx:%lx v%lu\n",
+                    clean_loc >> bs->block_order, new_entry->oid.inode, new_entry->oid.stripe, new_entry->version,
+                    cur.oid.inode, cur.oid.stripe, cur.version);
                exit(1);
            }
            new_entry->oid = cur.oid;
@@ -573,7 +575,7 @@ resume_1:
            if (bs->clean_entry_bitmap_size)
            {
                void *bmp_ptr = bs->clean_entry_bitmap_size > sizeof(void*) ? dirty_end->second.bitmap : &dirty_end->second.bitmap;
-                memcpy((void*)(new_entry+1) + bs->clean_entry_bitmap_size, bmp_ptr, bs->clean_entry_bitmap_size);
+                memcpy((uint8_t*)(new_entry+1) + bs->clean_entry_bitmap_size, bmp_ptr, bs->clean_entry_bitmap_size);
            }
        }
        await_sqe(6);
@@ -760,7 +762,7 @@ bool journal_flusher_co::scan_dirty(int wait_base)
                        if (bs->journal.inmemory)
                        {
                            // Take it from memory
-                            memcpy(it->buf, bs->journal.buffer + submit_offset, submit_len);
+                            memcpy(it->buf, (uint8_t*)bs->journal.buffer + submit_offset, submit_len);
                        }
                        else
                        {
@@ -824,7 +826,7 @@ bool journal_flusher_co::modify_meta_read(uint64_t meta_loc, flusher_meta_write_
    wr.pos = ((meta_loc >> bs->block_order) % (bs->meta_block_size / bs->clean_entry_size));
    if (bs->inmemory_meta)
    {
-        wr.buf = bs->metadata_buffer + wr.sector;
+        wr.buf = (uint8_t*)bs->metadata_buffer + wr.sector;
        return true;
    }
    wr.it = flusher->meta_sectors.find(wr.sector);
--- a/src/blockstore_impl.cpp
+++ b/src/blockstore_impl.cpp
@@ -142,7 +142,6 @@ void blockstore_impl_t::loop()
                    continue;
                }
            }
-            unsigned ring_space = ringloop->space_left();
            unsigned prev_sqe_pos = ringloop->save();
            // 0 = can't submit
            // 1 = in progress
@@ -212,7 +211,6 @@ void blockstore_impl_t::loop()
                ringloop->restore(prev_sqe_pos);
                if (PRIV(op)->wait_for == WAIT_SQE)
                {
-                    PRIV(op)->wait_detail = 1 + ring_space;
                    // ring is full, stop submission
                    break;
                }
@@ -235,6 +233,12 @@ void blockstore_impl_t::loop()
        {
            throw std::runtime_error(std::string("io_uring_submit: ") + strerror(-ret));
        }
+        for (auto s: journal.submitting_sectors)
+        {
+            // Mark journal sector writes as submitted
+            journal.sector_info[s].submit_id = 0;
+        }
+        journal.submitting_sectors.clear();
        if ((initial_ring_space - ringloop->space_left()) > 0)
        {
            live = true;
@@ -276,7 +280,7 @@ void blockstore_impl_t::check_wait(blockstore_op_t *op)
 {
    if (PRIV(op)->wait_for == WAIT_SQE)
    {
-        if (ringloop->space_left() < PRIV(op)->wait_detail)
+        if (ringloop->sqes_left() < PRIV(op)->wait_detail)
        {
            // stop submission if there's still no free space
 #ifdef BLOCKSTORE_DEBUG
@@ -366,7 +370,7 @@ void blockstore_impl_t::enqueue_op(blockstore_op_t *op)
                    };
                }
                unstable_writes.clear();
-                op->callback = [this, old_callback](blockstore_op_t *op)
+                op->callback = [old_callback](blockstore_op_t *op)
                {
                    obj_ver_id *vers = (obj_ver_id*)op->buf;
                    delete[] vers;
--- a/src/blockstore_impl.h
+++ b/src/blockstore_impl.h
@@ -54,6 +54,15 @@
 #define IS_BIG_WRITE(st) (((st) & 0x0F) == BS_ST_BIG_WRITE)
 #define IS_DELETE(st) (((st) & 0x0F) == BS_ST_DELETE)

+#define BS_SUBMIT_CHECK_SQES(n) \
+    if (ringloop->sqes_left() < (n))\
+    {\
+        /* Pause until there are more requests available */\
+        PRIV(op)->wait_detail = (n);\
+        PRIV(op)->wait_for = WAIT_SQE;\
+        return 0;\
+    }
+
 #define BS_SUBMIT_GET_SQE(sqe, data) \
    BS_SUBMIT_GET_ONLY_SQE(sqe); \
    struct ring_data_t *data = ((ring_data_t*)sqe->user_data)
@@ -63,6 +72,7 @@
    if (!sqe)\
    {\
        /* Pause until there are more requests available */\
+        PRIV(op)->wait_detail = 1;\
        PRIV(op)->wait_for = WAIT_SQE;\
        return 0;\
    }
@@ -72,6 +82,7 @@
    if (!sqe)\
    {\
        /* Pause until there are more requests available */\
+        PRIV(op)->wait_detail = 1;\
        PRIV(op)->wait_for = WAIT_SQE;\
        return 0;\
    }
@@ -170,7 +181,7 @@ struct blockstore_op_private_t
    std::vector<fulfill_read_t> read_vec;

    // Sync, write
-    uint64_t min_flushed_journal_sector, max_flushed_journal_sector;
+    int min_flushed_journal_sector, max_flushed_journal_sector;

    // Write
    struct iovec iov_zerofill[3];
@@ -251,6 +262,7 @@ class blockstore_impl_t
    int data_fd;
    uint64_t meta_size, meta_area, meta_len;
    uint64_t data_size, data_len;
+    uint64_t data_device_sect, meta_device_sect, journal_device_sect;

    void *metadata_buffer = NULL;

@@ -271,7 +283,7 @@ class blockstore_impl_t

    friend class blockstore_init_meta;
    friend class blockstore_init_journal;
-    friend class blockstore_journal_check_t;
+    friend struct blockstore_journal_check_t;
    friend class journal_flusher_t;
    friend class journal_flusher_co;

@@ -282,6 +294,10 @@ class blockstore_impl_t
    void open_journal();
    uint8_t* get_clean_entry_bitmap(uint64_t block_loc, int offset);

+    // Journaling
+    void prepare_journal_sector_write(int sector, blockstore_op_t *op);
+    void handle_journal_write(ring_data_t *data, uint64_t flush_id);
+
    // Asynchronous init
    int initialized;
    int metadata_buf_size;
@@ -309,21 +325,18 @@ class blockstore_impl_t

    // Sync
    int continue_sync(blockstore_op_t *op, bool queue_has_in_progress_sync);
-    void handle_sync_event(ring_data_t *data, blockstore_op_t *op);
    void ack_sync(blockstore_op_t *op);

    // Stabilize
    int dequeue_stable(blockstore_op_t *op);
    int continue_stable(blockstore_op_t *op);
    void mark_stable(const obj_ver_id & ov, bool forget_dirty = false);
-    void handle_stable_event(ring_data_t *data, blockstore_op_t *op);
    void stabilize_object(object_id oid, uint64_t max_ver);

    // Rollback
    int dequeue_rollback(blockstore_op_t *op);
    int continue_rollback(blockstore_op_t *op);
    void mark_rolled_back(const obj_ver_id & ov);
-    void handle_rollback_event(ring_data_t *data, blockstore_op_t *op);
    void erase_dirty(blockstore_dirty_db_t::iterator dirty_start, blockstore_dirty_db_t::iterator dirty_end, uint64_t clean_loc);

    // List
--- a/src/blockstore_init.cpp
+++ b/src/blockstore_init.cpp
@@ -148,7 +148,7 @@ resume_1:
        {
            GET_SQE();
            data->iov = {
-                metadata_buffer + (bs->inmemory_meta
+                (uint8_t*)metadata_buffer + (bs->inmemory_meta
                    ? metadata_read
                    : (prev == 1 ? bs->metadata_buf_size : 0)),
                bs->meta_len - metadata_read > bs->metadata_buf_size ? bs->metadata_buf_size : bs->meta_len - metadata_read,
@@ -169,13 +169,13 @@ resume_1:
        if (prev_done)
        {
            void *done_buf = bs->inmemory_meta
-                ? (metadata_buffer + done_pos)
-                : (metadata_buffer + (prev_done == 2 ? bs->metadata_buf_size : 0));
+                ? ((uint8_t*)metadata_buffer + done_pos)
+                : ((uint8_t*)metadata_buffer + (prev_done == 2 ? bs->metadata_buf_size : 0));
            unsigned count = bs->meta_block_size / bs->clean_entry_size;
            for (int sector = 0; sector < done_len; sector += bs->meta_block_size)
            {
                // handle <count> entries
-                handle_entries(done_buf + sector, count, bs->block_order);
+                handle_entries((uint8_t*)done_buf + sector, count, bs->block_order);
                done_cnt += count;
            }
            prev_done = 0;
@@ -215,7 +215,7 @@ void blockstore_init_meta::handle_entries(void* entries, unsigned count, int blo
 {
    for (unsigned i = 0; i < count; i++)
    {
-        clean_disk_entry *entry = (clean_disk_entry*)(entries + i*bs->clean_entry_size);
+        clean_disk_entry *entry = (clean_disk_entry*)((uint8_t*)entries + i*bs->clean_entry_size);
        if (!bs->inmemory_meta && bs->clean_entry_bitmap_size)
        {
            memcpy(bs->clean_bitmap + (done_cnt+i)*2*bs->clean_entry_bitmap_size, &entry->bitmap, 2*bs->clean_entry_bitmap_size);
@@ -440,7 +440,7 @@ resume_1:
                if (!bs->journal.inmemory)
                    submitted_buf = memalign_or_die(MEM_ALIGNMENT, JOURNAL_BUFFER_SIZE);
                else
-                    submitted_buf = bs->journal.buffer + journal_pos;
+                    submitted_buf = (uint8_t*)bs->journal.buffer + journal_pos;
                data->iov = {
                    submitted_buf,
                    end - journal_pos < JOURNAL_BUFFER_SIZE ? end - journal_pos : JOURNAL_BUFFER_SIZE,
@@ -570,7 +570,7 @@ int blockstore_init_journal::handle_journal_part(void *buf, uint64_t done_pos, u
    resume:
        while (pos < bs->journal.block_size)
        {
-            journal_entry *je = (journal_entry*)(buf + proc_pos - done_pos + pos);
+            journal_entry *je = (journal_entry*)((uint8_t*)buf + proc_pos - done_pos + pos);
            if (je->magic != JOURNAL_MAGIC || je_crc32(je) != je->crc32 ||
                je->type < JE_MIN || je->type > JE_MAX || started && je->crc32_prev != crc32_last)
            {
@@ -619,7 +619,7 @@ int blockstore_init_journal::handle_journal_part(void *buf, uint64_t done_pos, u
                if (location >= done_pos && location+je->small_write.len <= done_pos+len)
                {
                    // data is within this buffer
-                    data_crc32 = crc32c(0, buf + location - done_pos, je->small_write.len);
+                    data_crc32 = crc32c(0, (uint8_t*)buf + location - done_pos, je->small_write.len);
                }
                else
                {
@@ -634,7 +634,7 @@ int blockstore_init_journal::handle_journal_part(void *buf, uint64_t done_pos, u
                                ? location+je->small_write.len : done[i].pos+done[i].len);
                            uint64_t part_begin = (location < done[i].pos ? done[i].pos : location);
                            covered += part_end - part_begin;
-                            data_crc32 = crc32c(data_crc32, done[i].buf + part_begin - done[i].pos, part_end - part_begin);
+                            data_crc32 = crc32c(data_crc32, (uint8_t*)done[i].buf + part_begin - done[i].pos, part_end - part_begin);
                        }
                    }
                    if (covered < je->small_write.len)
@@ -650,9 +650,9 @@ int blockstore_init_journal::handle_journal_part(void *buf, uint64_t done_pos, u
                    // interesting thing is that we must clear the corrupt entry if we're not readonly,
                    // because we don't write next entries in the same journal block
                    printf("Journal entry data is corrupt (data crc32 %x != %x)\n", data_crc32, je->small_write.crc32_data);
-                    memset(buf + proc_pos - done_pos + pos, 0, bs->journal.block_size - pos);
+                    memset((uint8_t*)buf + proc_pos - done_pos + pos, 0, bs->journal.block_size - pos);
                    bs->journal.next_free = prev_free;
-                    init_write_buf = buf + proc_pos - done_pos;
+                    init_write_buf = (uint8_t*)buf + proc_pos - done_pos;
                    init_write_sector = proc_pos;
                    return 0;
                }
@@ -665,7 +665,7 @@ int blockstore_init_journal::handle_journal_part(void *buf, uint64_t done_pos, u
                        .version = je->small_write.version,
                    };
                    void *bmp = NULL;
-                    void *bmp_from = (void*)je + sizeof(journal_entry_small_write);
+                    void *bmp_from = (uint8_t*)je + sizeof(journal_entry_small_write);
                    if (bs->clean_entry_bitmap_size <= sizeof(void*))
                    {
                        memcpy(&bmp, bmp_from, bs->clean_entry_bitmap_size);
@@ -745,7 +745,7 @@ int blockstore_init_journal::handle_journal_part(void *buf, uint64_t done_pos, u
                        .version = je->big_write.version,
                    };
                    void *bmp = NULL;
-                    void *bmp_from = (void*)je + sizeof(journal_entry_big_write);
+                    void *bmp_from = (uint8_t*)je + sizeof(journal_entry_big_write);
                    if (bs->clean_entry_bitmap_size <= sizeof(void*))
                    {
                        memcpy(&bmp, bmp_from, bs->clean_entry_bitmap_size);
--- a/src/blockstore_init.h
+++ b/src/blockstore_init.h
@@ -6,7 +6,7 @@
 class blockstore_init_meta
 {
    blockstore_impl_t *bs;
-    int wait_state = 0, wait_count = 0;
+    int wait_state = 0;
    bool zero_on_init = false;
    void *metadata_buffer = NULL;
    uint64_t metadata_read = 0;
--- a/src/blockstore_journal.cpp
+++ b/src/blockstore_journal.cpp
@@ -96,7 +96,8 @@ int blockstore_journal_check_t::check_available(blockstore_op_t *op, int entries
        next_pos = next_pos + data_after;
        if (next_pos > bs->journal.len)
        {
-            next_pos = bs->journal.block_size + data_after;
+            if (right_dir)
+                next_pos = bs->journal.block_size + data_after;
            right_dir = false;
        }
    }
@@ -136,13 +137,13 @@ journal_entry* prefill_single_journal_entry(journal_t & journal, uint16_t type,
        journal.in_sector_pos = 0;
        journal.next_free = (journal.next_free+journal.block_size) < journal.len ? journal.next_free + journal.block_size : journal.block_size;
        memset(journal.inmemory
-            ? journal.buffer + journal.sector_info[journal.cur_sector].offset
-            : journal.sector_buf + journal.block_size*journal.cur_sector, 0, journal.block_size);
+            ? (uint8_t*)journal.buffer + journal.sector_info[journal.cur_sector].offset
+            : (uint8_t*)journal.sector_buf + journal.block_size*journal.cur_sector, 0, journal.block_size);
    }
    journal_entry *je = (struct journal_entry*)(
        (journal.inmemory
-            ? journal.buffer + journal.sector_info[journal.cur_sector].offset
-            : journal.sector_buf + journal.block_size*journal.cur_sector) + journal.in_sector_pos
+            ? (uint8_t*)journal.buffer + journal.sector_info[journal.cur_sector].offset
+            : (uint8_t*)journal.sector_buf + journal.block_size*journal.cur_sector) + journal.in_sector_pos
    );
    journal.in_sector_pos += size;
    je->magic = JOURNAL_MAGIC;
@@ -153,22 +154,73 @@ journal_entry* prefill_single_journal_entry(journal_t & journal, uint16_t type,
    return je;
 }

-void prepare_journal_sector_write(journal_t & journal, int cur_sector, io_uring_sqe *sqe, std::function<void(ring_data_t*)> cb)
+void blockstore_impl_t::prepare_journal_sector_write(int cur_sector, blockstore_op_t *op)
 {
+    // Don't submit the same sector twice in the same batch
+    if (!journal.sector_info[cur_sector].submit_id)
+    {
+        io_uring_sqe *sqe = get_sqe();
+        // Caller must ensure availability of an SQE
+        assert(sqe != NULL);
+        ring_data_t *data = ((ring_data_t*)sqe->user_data);
+        journal.sector_info[cur_sector].written = true;
+        journal.sector_info[cur_sector].submit_id = ++journal.submit_id;
+        journal.submitting_sectors.push_back(cur_sector);
+        journal.sector_info[cur_sector].flush_count++;
+        data->iov = (struct iovec){
+            (journal.inmemory
+                ? (uint8_t*)journal.buffer + journal.sector_info[cur_sector].offset
+                : (uint8_t*)journal.sector_buf + journal.block_size*cur_sector),
+            journal.block_size
+        };
+        data->callback = [this, flush_id = journal.submit_id](ring_data_t *data) { handle_journal_write(data, flush_id); };
+        my_uring_prep_writev(
+            sqe, journal.fd, &data->iov, 1, journal.offset + journal.sector_info[cur_sector].offset
+        );
+    }
    journal.sector_info[cur_sector].dirty = false;
-    journal.sector_info[cur_sector].written = true;
-    journal.sector_info[cur_sector].flush_count++;
-    ring_data_t *data = ((ring_data_t*)sqe->user_data);
-    data->iov = (struct iovec){
-        (journal.inmemory
-            ? journal.buffer + journal.sector_info[cur_sector].offset
-            : journal.sector_buf + journal.block_size*cur_sector),
-        journal.block_size
-    };
-    data->callback = cb;
-    my_uring_prep_writev(
-        sqe, journal.fd, &data->iov, 1, journal.offset + journal.sector_info[cur_sector].offset
-    );
+    // But always remember that this operation has to wait until this exact journal write is finished
+    journal.flushing_ops.insert((pending_journaling_t){
+        .flush_id = journal.sector_info[cur_sector].submit_id,
+        .sector = cur_sector,
+        .op = op,
+    });
+    auto priv = PRIV(op);
+    priv->pending_ops++;
+    if (!priv->min_flushed_journal_sector)
+        priv->min_flushed_journal_sector = 1+cur_sector;
+    priv->max_flushed_journal_sector = 1+cur_sector;
+}
+
+void blockstore_impl_t::handle_journal_write(ring_data_t *data, uint64_t flush_id)
+{
+    live = true;
+    if (data->res != data->iov.iov_len)
+    {
+        // FIXME: our state becomes corrupted after a write error. maybe do something better than just die
+        throw std::runtime_error(
+            "journal write failed ("+std::to_string(data->res)+" != "+std::to_string(data->iov.iov_len)+
+            "). in-memory state is corrupted. AAAAAAAaaaaaaaaa!!!111"
+        );
+    }
+    auto fl_it = journal.flushing_ops.upper_bound((pending_journaling_t){ .flush_id = flush_id });
+    if (fl_it != journal.flushing_ops.end() && fl_it->flush_id == flush_id)
+    {
+        journal.sector_info[fl_it->sector].flush_count--;
+    }
+    while (fl_it != journal.flushing_ops.end() && fl_it->flush_id == flush_id)
+    {
+        auto priv = PRIV(fl_it->op);
+        priv->pending_ops--;
+        assert(priv->pending_ops >= 0);
+        if (priv->pending_ops == 0)
+        {
+            release_journal_sectors(fl_it->op);
+            priv->op_state++;
+            ringloop->wakeup();
+        }
+        journal.flushing_ops.erase(fl_it++);
+    }
 }

 journal_t::~journal_t()
--- a/src/blockstore_journal.h
+++ b/src/blockstore_journal.h
@@ -4,6 +4,7 @@
 #pragma once

 #include "crc32c.h"
+#include <set>

 #define MIN_JOURNAL_SIZE 4*1024*1024
 #define JOURNAL_MAGIC 0x4A33
@@ -145,8 +146,21 @@ struct journal_sector_info_t
    uint64_t flush_count;
    bool written;
    bool dirty;
+    uint64_t submit_id;
 };

+struct pending_journaling_t
+{
+    uint64_t flush_id;
+    int sector;
+    blockstore_op_t *op;
+};
+
+inline bool operator < (const pending_journaling_t & a, const pending_journaling_t & b)
+{
+    return a.flush_id < b.flush_id || a.flush_id == b.flush_id && a.op < b.op;
+}
+
 struct journal_t
 {
    int fd;
@@ -172,6 +186,9 @@ struct journal_t
    bool no_same_sector_overwrites = false;
    int cur_sector = 0;
    int in_sector_pos = 0;
+    std::vector<int> submitting_sectors;
+    std::set<pending_journaling_t> flushing_ops;
+    uint64_t submit_id = 0;

    // Used sector map
    // May use ~ 80 MB per 1 GB of used journal space in the worst case
@@ -200,5 +217,3 @@ struct blockstore_journal_check_t
 };

 journal_entry* prefill_single_journal_entry(journal_t & journal, uint16_t type, uint32_t size);
-
-void prepare_journal_sector_write(journal_t & journal, int sector, io_uring_sqe *sqe, std::function<void(ring_data_t*)> cb);
--- a/src/blockstore_open.cpp
+++ b/src/blockstore_open.cpp
@@ -295,9 +295,9 @@ void blockstore_impl_t::calc_lengths()
    }
 }

-void check_size(int fd, uint64_t *size, std::string name)
+static void check_size(int fd, uint64_t *size, uint64_t *sectsize, std::string name)
 {
-    int sectsize;
+    int sect;
    struct stat st;
    if (fstat(fd, &st) < 0)
    {
@@ -306,14 +306,21 @@ void check_size(int fd, uint64_t *size, std::string name)
    if (S_ISREG(st.st_mode))
    {
        *size = st.st_size;
+        if (sectsize)
+        {
+            *sectsize = st.st_blksize;
+        }
    }
    else if (S_ISBLK(st.st_mode))
    {
-        if (ioctl(fd, BLKSSZGET, &sectsize) < 0 ||
-            ioctl(fd, BLKGETSIZE64, size) < 0 ||
-            sectsize != 512)
+        if (ioctl(fd, BLKGETSIZE64, size) < 0 ||
+            ioctl(fd, BLKSSZGET, &sect) < 0)
        {
-            throw std::runtime_error(name+" sector is not equal to 512 bytes");
+            throw std::runtime_error("failed to get "+name+" size or block size: "+strerror(errno));
+        }
+        if (sectsize)
+        {
+            *sectsize = sect;
        }
    }
    else
@@ -329,7 +336,14 @@ void blockstore_impl_t::open_data()
    {
        throw std::runtime_error("Failed to open data device");
    }
-    check_size(data_fd, &data_size, "data device");
+    check_size(data_fd, &data_size, &data_device_sect, "data device");
+    if (disk_alignment % data_device_sect)
+    {
+        throw std::runtime_error(
+            "disk_alignment ("+std::to_string(disk_alignment)+
+            ") is not a multiple of data device sector size ("+std::to_string(data_device_sect)+")"
+        );
+    }
    if (data_offset >= data_size)
    {
        throw std::runtime_error("data_offset exceeds device size = "+std::to_string(data_size));
@@ -350,7 +364,7 @@ void blockstore_impl_t::open_meta()
        {
            throw std::runtime_error("Failed to open metadata device");
        }
-        check_size(meta_fd, &meta_size, "metadata device");
+        check_size(meta_fd, &meta_size, &meta_device_sect, "metadata device");
        if (meta_offset >= meta_size)
        {
            throw std::runtime_error("meta_offset exceeds device size = "+std::to_string(meta_size));
@@ -363,12 +377,20 @@ void blockstore_impl_t::open_meta()
    else
    {
        meta_fd = data_fd;
+        meta_device_sect = data_device_sect;
        meta_size = 0;
        if (meta_offset >= data_size)
        {
            throw std::runtime_error("meta_offset exceeds device size = "+std::to_string(data_size));
        }
    }
+    if (meta_block_size % meta_device_sect)
+    {
+        throw std::runtime_error(
+            "meta_block_size ("+std::to_string(meta_block_size)+
+            ") is not a multiple of data device sector size ("+std::to_string(meta_device_sect)+")"
+        );
+    }
 }

 void blockstore_impl_t::open_journal()
@@ -380,7 +402,7 @@ void blockstore_impl_t::open_journal()
        {
            throw std::runtime_error("Failed to open journal device");
        }
-        check_size(journal.fd, &journal.device_size, "journal device");
+        check_size(journal.fd, &journal.device_size, &journal_device_sect, "journal device");
        if (!disable_flock && flock(journal.fd, LOCK_EX|LOCK_NB) != 0)
        {
            throw std::runtime_error(std::string("Failed to lock journal device: ") + strerror(errno));
@@ -389,6 +411,7 @@ void blockstore_impl_t::open_journal()
    else
    {
        journal.fd = meta_fd;
+        journal_device_sect = meta_device_sect;
        journal.device_size = 0;
        if (journal.offset >= data_size)
        {
@@ -406,4 +429,11 @@ void blockstore_impl_t::open_journal()
        if (!journal.sector_buf)
            throw std::bad_alloc();
    }
+    if (journal_block_size % journal_device_sect)
+    {
+        throw std::runtime_error(
+            "journal_block_size ("+std::to_string(journal_block_size)+
+            ") is not a multiple of journal device sector size ("+std::to_string(journal_device_sect)+")"
+        );
+    }
 }
--- a/src/blockstore_read.cpp
+++ b/src/blockstore_read.cpp
@@ -24,7 +24,7 @@ int blockstore_impl_t::fulfill_read_push(blockstore_op_t *op, void *buf, uint64_
    }
    if (journal.inmemory && IS_JOURNAL(item_state))
    {
-        memcpy(buf, journal.buffer + offset, len);
+        memcpy(buf, (uint8_t*)journal.buffer + offset, len);
        return 1;
    }
    BS_SUBMIT_GET_SQE(sqe, data);
@@ -75,7 +75,7 @@ int blockstore_impl_t::fulfill_read(blockstore_op_t *read_op, uint64_t &fulfille
                };
                it = PRIV(read_op)->read_vec.insert(it, el);
                if (!fulfill_read_push(read_op,
-                    read_op->buf + el.offset - read_op->offset,
+                    (uint8_t*)read_op->buf + el.offset - read_op->offset,
                    item_location + el.offset - item_start,
                    el.len, item_state, item_version))
                {
@@ -102,7 +102,7 @@ uint8_t* blockstore_impl_t::get_clean_entry_bitmap(uint64_t block_loc, int offse
    {
        uint64_t sector = (meta_loc / (meta_block_size / clean_entry_size)) * meta_block_size;
        uint64_t pos = (meta_loc % (meta_block_size / clean_entry_size));
-        clean_entry_bitmap = (uint8_t*)(metadata_buffer + sector + pos*clean_entry_size + sizeof(clean_disk_entry) + offset);
+        clean_entry_bitmap = ((uint8_t*)metadata_buffer + sector + pos*clean_entry_size + sizeof(clean_disk_entry) + offset);
    }
    else
        clean_entry_bitmap = (uint8_t*)(clean_bitmap + meta_loc*2*clean_entry_bitmap_size + offset);
--- a/src/blockstore_rollback.cpp
+++ b/src/blockstore_rollback.cpp
@@ -74,24 +74,17 @@ skip_ov:
    {
        return 0;
    }
-    // There is sufficient space. Get SQEs
-    struct io_uring_sqe *sqe[space_check.sectors_to_write];
-    for (i = 0; i < space_check.sectors_to_write; i++)
-    {
-        BS_SUBMIT_GET_SQE_DECL(sqe[i]);
-    }
+    // There is sufficient space. Check SQEs
+    BS_SUBMIT_CHECK_SQES(space_check.sectors_to_write);
    // Prepare and submit journal entries
-    auto cb = [this, op](ring_data_t *data) { handle_rollback_event(data, op); };
-    int s = 0, cur_sector = -1;
+    int s = 0;
    for (i = 0, v = (obj_ver_id*)op->buf; i < op->len; i++, v++)
    {
        if (!journal.entry_fits(sizeof(journal_entry_rollback)) &&
            journal.sector_info[journal.cur_sector].dirty)
        {
-            if (cur_sector == -1)
-                PRIV(op)->min_flushed_journal_sector = 1 + journal.cur_sector;
-            prepare_journal_sector_write(journal, journal.cur_sector, sqe[s++], cb);
-            cur_sector = journal.cur_sector;
+            prepare_journal_sector_write(journal.cur_sector, op);
+            s++;
        }
        journal_entry_rollback *je = (journal_entry_rollback*)
            prefill_single_journal_entry(journal, JE_ROLLBACK, sizeof(journal_entry_rollback));
@@ -100,12 +93,9 @@ skip_ov:
        je->crc32 = je_crc32((journal_entry*)je);
        journal.crc32_last = je->crc32;
    }
-    prepare_journal_sector_write(journal, journal.cur_sector, sqe[s++], cb);
+    prepare_journal_sector_write(journal.cur_sector, op);
+    s++;
    assert(s == space_check.sectors_to_write);
-    if (cur_sector == -1)
-        PRIV(op)->min_flushed_journal_sector = 1 + journal.cur_sector;
-    PRIV(op)->max_flushed_journal_sector = 1 + journal.cur_sector;
-    PRIV(op)->pending_ops = s;
    PRIV(op)->op_state = 1;
    return 1;
 }
@@ -114,30 +104,23 @@ int blockstore_impl_t::continue_rollback(blockstore_op_t *op)
 {
    if (PRIV(op)->op_state == 2)
        goto resume_2;
-    else if (PRIV(op)->op_state == 3)
-        goto resume_3;
-    else if (PRIV(op)->op_state == 5)
-        goto resume_5;
+    else if (PRIV(op)->op_state == 4)
+        goto resume_4;
    else
        return 1;
 resume_2:
-    // Release used journal sectors
-    release_journal_sectors(op);
-resume_3:
    if (!disable_journal_fsync)
    {
-        io_uring_sqe *sqe;
-        BS_SUBMIT_GET_SQE_DECL(sqe);
-        ring_data_t *data = ((ring_data_t*)sqe->user_data);
+        BS_SUBMIT_GET_SQE(sqe, data);
        my_uring_prep_fsync(sqe, journal.fd, IORING_FSYNC_DATASYNC);
        data->iov = { 0 };
-        data->callback = [this, op](ring_data_t *data) { handle_rollback_event(data, op); };
+        data->callback = [this, op](ring_data_t *data) { handle_write_event(data, op); };
        PRIV(op)->min_flushed_journal_sector = PRIV(op)->max_flushed_journal_sector = 0;
        PRIV(op)->pending_ops = 1;
-        PRIV(op)->op_state = 4;
+        PRIV(op)->op_state = 3;
        return 1;
    }
-resume_5:
+resume_4:
    obj_ver_id* v;
    int i;
    for (i = 0, v = (obj_ver_id*)op->buf; i < op->len; i++, v++)
@@ -196,24 +179,6 @@ void blockstore_impl_t::mark_rolled_back(const obj_ver_id & ov)
    }
 }

-void blockstore_impl_t::handle_rollback_event(ring_data_t *data, blockstore_op_t *op)
-{
-    live = true;
-    if (data->res != data->iov.iov_len)
-    {
-        throw std::runtime_error(
-            "write operation failed ("+std::to_string(data->res)+" != "+std::to_string(data->iov.iov_len)+
-            "). in-memory state is corrupted. AAAAAAAaaaaaaaaa!!!111"
-        );
-    }
-    PRIV(op)->pending_ops--;
-    if (PRIV(op)->pending_ops == 0)
-    {
-        PRIV(op)->op_state++;
-        ringloop->wakeup();
-    }
-}
-
 void blockstore_impl_t::erase_dirty(blockstore_dirty_db_t::iterator dirty_start, blockstore_dirty_db_t::iterator dirty_end, uint64_t clean_loc)
 {
    if (dirty_end == dirty_start)
--- a/src/blockstore_stable.cpp
+++ b/src/blockstore_stable.cpp
@@ -97,25 +97,18 @@ int blockstore_impl_t::dequeue_stable(blockstore_op_t *op)
    {
        return 0;
    }
-    // There is sufficient space. Get SQEs
-    struct io_uring_sqe *sqe[space_check.sectors_to_write];
-    for (i = 0; i < space_check.sectors_to_write; i++)
-    {
-        BS_SUBMIT_GET_SQE_DECL(sqe[i]);
-    }
+    // There is sufficient space. Check SQEs
+    BS_SUBMIT_CHECK_SQES(space_check.sectors_to_write);
    // Prepare and submit journal entries
-    auto cb = [this, op](ring_data_t *data) { handle_stable_event(data, op); };
-    int s = 0, cur_sector = -1;
+    int s = 0;
    for (i = 0, v = (obj_ver_id*)op->buf; i < op->len; i++, v++)
    {
        // FIXME: Only stabilize versions that aren't stable yet
        if (!journal.entry_fits(sizeof(journal_entry_stable)) &&
            journal.sector_info[journal.cur_sector].dirty)
        {
-            if (cur_sector == -1)
-                PRIV(op)->min_flushed_journal_sector = 1 + journal.cur_sector;
-            prepare_journal_sector_write(journal, journal.cur_sector, sqe[s++], cb);
-            cur_sector = journal.cur_sector;
+            prepare_journal_sector_write(journal.cur_sector, op);
+            s++;
        }
        journal_entry_stable *je = (journal_entry_stable*)
            prefill_single_journal_entry(journal, JE_STABLE, sizeof(journal_entry_stable));
@@ -124,12 +117,9 @@ int blockstore_impl_t::dequeue_stable(blockstore_op_t *op)
        je->crc32 = je_crc32((journal_entry*)je);
        journal.crc32_last = je->crc32;
    }
-    prepare_journal_sector_write(journal, journal.cur_sector, sqe[s++], cb);
+    prepare_journal_sector_write(journal.cur_sector, op);
+    s++;
    assert(s == space_check.sectors_to_write);
-    if (cur_sector == -1)
-        PRIV(op)->min_flushed_journal_sector = 1 + journal.cur_sector;
-    PRIV(op)->max_flushed_journal_sector = 1 + journal.cur_sector;
-    PRIV(op)->pending_ops = s;
    PRIV(op)->op_state = 1;
    return 1;
 }
@@ -138,30 +128,23 @@ int blockstore_impl_t::continue_stable(blockstore_op_t *op)
 {
    if (PRIV(op)->op_state == 2)
        goto resume_2;
-    else if (PRIV(op)->op_state == 3)
-        goto resume_3;
-    else if (PRIV(op)->op_state == 5)
-        goto resume_5;
+    else if (PRIV(op)->op_state == 4)
+        goto resume_4;
    else
        return 1;
 resume_2:
-    // Release used journal sectors
-    release_journal_sectors(op);
-resume_3:
    if (!disable_journal_fsync)
    {
-        io_uring_sqe *sqe;
-        BS_SUBMIT_GET_SQE_DECL(sqe);
-        ring_data_t *data = ((ring_data_t*)sqe->user_data);
+        BS_SUBMIT_GET_SQE(sqe, data);
        my_uring_prep_fsync(sqe, journal.fd, IORING_FSYNC_DATASYNC);
        data->iov = { 0 };
-        data->callback = [this, op](ring_data_t *data) { handle_stable_event(data, op); };
+        data->callback = [this, op](ring_data_t *data) { handle_write_event(data, op); };
        PRIV(op)->min_flushed_journal_sector = PRIV(op)->max_flushed_journal_sector = 0;
        PRIV(op)->pending_ops = 1;
-        PRIV(op)->op_state = 4;
+        PRIV(op)->op_state = 3;
        return 1;
    }
-resume_5:
+resume_4:
    // Mark dirty_db entries as stable, acknowledge op completion
    obj_ver_id* v;
    int i;
@@ -257,21 +240,3 @@ void blockstore_impl_t::mark_stable(const obj_ver_id & v, bool forget_dirty)
        unstable_writes.erase(unstab_it);
    }
 }
-
-void blockstore_impl_t::handle_stable_event(ring_data_t *data, blockstore_op_t *op)
-{
-    live = true;
-    if (data->res != data->iov.iov_len)
-    {
-        throw std::runtime_error(
-            "write operation failed ("+std::to_string(data->res)+" != "+std::to_string(data->iov.iov_len)+
-            "). in-memory state is corrupted. AAAAAAAaaaaaaaaa!!!111"
-        );
-    }
-    PRIV(op)->pending_ops--;
-    if (PRIV(op)->pending_ops == 0)
-    {
-        PRIV(op)->op_state++;
-        ringloop->wakeup();
-    }
-}
--- a/src/blockstore_sync.cpp
+++ b/src/blockstore_sync.cpp
@@ -44,10 +44,8 @@ int blockstore_impl_t::continue_sync(blockstore_op_t *op, bool queue_has_in_prog
        if (journal.sector_info[journal.cur_sector].dirty)
        {
            // Write out the last journal sector if it happens to be dirty
-            BS_SUBMIT_GET_ONLY_SQE(sqe);
-            prepare_journal_sector_write(journal, journal.cur_sector, sqe, [this, op](ring_data_t *data) { handle_sync_event(data, op); });
-            PRIV(op)->min_flushed_journal_sector = PRIV(op)->max_flushed_journal_sector = 1 + journal.cur_sector;
-            PRIV(op)->pending_ops = 1;
+            BS_SUBMIT_CHECK_SQES(1);
+            prepare_journal_sector_write(journal.cur_sector, op);
            PRIV(op)->op_state = SYNC_JOURNAL_WRITE_SENT;
            return 1;
        }
@@ -64,7 +62,7 @@ int blockstore_impl_t::continue_sync(blockstore_op_t *op, bool queue_has_in_prog
            BS_SUBMIT_GET_SQE(sqe, data);
            my_uring_prep_fsync(sqe, data_fd, IORING_FSYNC_DATASYNC);
            data->iov = { 0 };
-            data->callback = [this, op](ring_data_t *data) { handle_sync_event(data, op); };
+            data->callback = [this, op](ring_data_t *data) { handle_write_event(data, op); };
            PRIV(op)->min_flushed_journal_sector = PRIV(op)->max_flushed_journal_sector = 0;
            PRIV(op)->pending_ops = 1;
            PRIV(op)->op_state = SYNC_DATA_SYNC_SENT;
@@ -85,24 +83,18 @@ int blockstore_impl_t::continue_sync(blockstore_op_t *op, bool queue_has_in_prog
        {
            return 0;
        }
-        // Get SQEs. Don't bother about merging, submit each journal sector as a separate request
-        struct io_uring_sqe *sqe[space_check.sectors_to_write];
-        for (int i = 0; i < space_check.sectors_to_write; i++)
-        {
-            BS_SUBMIT_GET_SQE_DECL(sqe[i]);
-        }
+        // Check SQEs. Don't bother about merging, submit each journal sector as a separate request
+        BS_SUBMIT_CHECK_SQES(space_check.sectors_to_write);
        // Prepare and submit journal entries
        auto it = PRIV(op)->sync_big_writes.begin();
-        int s = 0, cur_sector = -1;
+        int s = 0;
        while (it != PRIV(op)->sync_big_writes.end())
        {
            if (!journal.entry_fits(sizeof(journal_entry_big_write) + clean_entry_bitmap_size) &&
                journal.sector_info[journal.cur_sector].dirty)
            {
-                if (cur_sector == -1)
-                    PRIV(op)->min_flushed_journal_sector = 1 + journal.cur_sector;
-                prepare_journal_sector_write(journal, journal.cur_sector, sqe[s++], [this, op](ring_data_t *data) { handle_sync_event(data, op); });
-                cur_sector = journal.cur_sector;
+                prepare_journal_sector_write(journal.cur_sector, op);
+                s++;
            }
            auto & dirty_entry = dirty_db.at(*it);
            journal_entry_big_write *je = (journal_entry_big_write*)prefill_single_journal_entry(
@@ -129,12 +121,9 @@ int blockstore_impl_t::continue_sync(blockstore_op_t *op, bool queue_has_in_prog
            journal.crc32_last = je->crc32;
            it++;
        }
-        prepare_journal_sector_write(journal, journal.cur_sector, sqe[s++], [this, op](ring_data_t *data) { handle_sync_event(data, op); });
+        prepare_journal_sector_write(journal.cur_sector, op);
+        s++;
        assert(s == space_check.sectors_to_write);
-        if (cur_sector == -1)
-            PRIV(op)->min_flushed_journal_sector = 1 + journal.cur_sector;
-        PRIV(op)->max_flushed_journal_sector = 1 + journal.cur_sector;
-        PRIV(op)->pending_ops = s;
        PRIV(op)->op_state = SYNC_JOURNAL_WRITE_SENT;
        return 1;
    }
@@ -145,7 +134,7 @@ int blockstore_impl_t::continue_sync(blockstore_op_t *op, bool queue_has_in_prog
            BS_SUBMIT_GET_SQE(sqe, data);
            my_uring_prep_fsync(sqe, journal.fd, IORING_FSYNC_DATASYNC);
            data->iov = { 0 };
-            data->callback = [this, op](ring_data_t *data) { handle_sync_event(data, op); };
+            data->callback = [this, op](ring_data_t *data) { handle_write_event(data, op); };
            PRIV(op)->min_flushed_journal_sector = PRIV(op)->max_flushed_journal_sector = 0;
            PRIV(op)->pending_ops = 1;
            PRIV(op)->op_state = SYNC_JOURNAL_SYNC_SENT;
@@ -164,42 +153,6 @@ int blockstore_impl_t::continue_sync(blockstore_op_t *op, bool queue_has_in_prog
    return 1;
 }

-void blockstore_impl_t::handle_sync_event(ring_data_t *data, blockstore_op_t *op)
-{
-    live = true;
-    if (data->res != data->iov.iov_len)
-    {
-        throw std::runtime_error(
-            "write operation failed ("+std::to_string(data->res)+" != "+std::to_string(data->iov.iov_len)+
-            "). in-memory state is corrupted. AAAAAAAaaaaaaaaa!!!111"
-        );
-    }
-    PRIV(op)->pending_ops--;
-    if (PRIV(op)->pending_ops == 0)
-    {
-        // Release used journal sectors
-        release_journal_sectors(op);
-        // Handle states
-        if (PRIV(op)->op_state == SYNC_DATA_SYNC_SENT)
-        {
-            PRIV(op)->op_state = SYNC_DATA_SYNC_DONE;
-        }
-        else if (PRIV(op)->op_state == SYNC_JOURNAL_WRITE_SENT)
-        {
-            PRIV(op)->op_state = SYNC_JOURNAL_WRITE_DONE;
-        }
-        else if (PRIV(op)->op_state == SYNC_JOURNAL_SYNC_SENT)
-        {
-            PRIV(op)->op_state = SYNC_DONE;
-        }
-        else
-        {
-            throw std::runtime_error("BUG: unexpected sync op state");
-        }
-        ringloop->wakeup();
-    }
-}
-
 void blockstore_impl_t::ack_sync(blockstore_op_t *op)
 {
    // Handle states
--- a/src/blockstore_write.cpp
+++ b/src/blockstore_write.cpp
@@ -102,7 +102,7 @@ bool blockstore_impl_t::enqueue_write(blockstore_op_t *op)
        // Issue an additional sync so that the previous big write can reach the journal
        blockstore_op_t *sync_op = new blockstore_op_t;
        sync_op->opcode = BS_OP_SYNC;
-        sync_op->callback = [this, op](blockstore_op_t *sync_op)
+        sync_op->callback = [](blockstore_op_t *sync_op)
        {
            delete sync_op;
        };
@@ -268,8 +268,8 @@ int blockstore_impl_t::dequeue_write(blockstore_op_t *op)
            cancel_all_writes(op, dirty_it, -ENOSPC);
            return 2;
        }
-        write_iodepth++;
        BS_SUBMIT_GET_SQE(sqe, data);
+        write_iodepth++;
        dirty_it->second.location = loc << block_order;
        dirty_it->second.state = (dirty_it->second.state & ~BS_ST_WORKFLOW_MASK) | BS_ST_SUBMITTED;
 #ifdef BLOCKSTORE_DEBUG
@@ -324,29 +324,21 @@ int blockstore_impl_t::dequeue_write(blockstore_op_t *op)
        {
            return 0;
        }
-        write_iodepth++;
-        // There is sufficient space. Get SQE(s)
-        struct io_uring_sqe *sqe1 = NULL;
-        if (immediate_commit != IMMEDIATE_NONE ||
-            !journal.entry_fits(sizeof(journal_entry_small_write) + clean_entry_bitmap_size))
-        {
+        // There is sufficient space. Check SQE(s)
+        BS_SUBMIT_CHECK_SQES(
            // Write current journal sector only if it's dirty and full, or in the immediate_commit mode
-            BS_SUBMIT_GET_SQE_DECL(sqe1);
-        }
-        struct io_uring_sqe *sqe2 = NULL;
-        if (op->len > 0)
-        {
-            BS_SUBMIT_GET_SQE_DECL(sqe2);
-        }
+            (immediate_commit != IMMEDIATE_NONE ||
+                !journal.entry_fits(sizeof(journal_entry_small_write) + clean_entry_bitmap_size) ? 1 : 0) +
+            (op->len > 0 ? 1 : 0)
+        );
+        write_iodepth++;
        // Got SQEs. Prepare previous journal sector write if required
        auto cb = [this, op](ring_data_t *data) { handle_write_event(data, op); };
        if (immediate_commit == IMMEDIATE_NONE)
        {
-            if (sqe1)
+            if (!journal.entry_fits(sizeof(journal_entry_small_write) + clean_entry_bitmap_size))
            {
-                prepare_journal_sector_write(journal, journal.cur_sector, sqe1, cb);
-                PRIV(op)->min_flushed_journal_sector = PRIV(op)->max_flushed_journal_sector = 1 + journal.cur_sector;
-                PRIV(op)->pending_ops++;
+                prepare_journal_sector_write(journal.cur_sector, op);
            }
            else
            {
@@ -380,9 +372,7 @@ int blockstore_impl_t::dequeue_write(blockstore_op_t *op)
        journal.crc32_last = je->crc32;
        if (immediate_commit != IMMEDIATE_NONE)
        {
-            prepare_journal_sector_write(journal, journal.cur_sector, sqe1, cb);
-            PRIV(op)->min_flushed_journal_sector = PRIV(op)->max_flushed_journal_sector = 1 + journal.cur_sector;
-            PRIV(op)->pending_ops++;
+            prepare_journal_sector_write(journal.cur_sector, op);
        }
        if (op->len > 0)
        {
@@ -390,9 +380,9 @@ int blockstore_impl_t::dequeue_write(blockstore_op_t *op)
            if (journal.inmemory)
            {
                // Copy data
-                memcpy(journal.buffer + journal.next_free, op->buf, op->len);
+                memcpy((uint8_t*)journal.buffer + journal.next_free, op->buf, op->len);
            }
-            ring_data_t *data2 = ((ring_data_t*)sqe2->user_data);
+            BS_SUBMIT_GET_SQE(sqe2, data2);
            data2->iov = (struct iovec){ op->buf, op->len };
            data2->callback = cb;
            my_uring_prep_writev(
@@ -441,13 +431,12 @@ int blockstore_impl_t::continue_write(blockstore_op_t *op)
 resume_2:
    // Only for the immediate_commit mode: prepare and submit big_write journal entry
    {
+        BS_SUBMIT_CHECK_SQES(1);
        auto dirty_it = dirty_db.find((obj_ver_id){
            .oid = op->oid,
            .version = op->version,
        });
        assert(dirty_it != dirty_db.end());
-        io_uring_sqe *sqe = NULL;
-        BS_SUBMIT_GET_SQE_DECL(sqe);
        journal_entry_big_write *je = (journal_entry_big_write*)prefill_single_journal_entry(
            journal, op->opcode == BS_OP_WRITE_STABLE ? JE_BIG_WRITE_INSTANT : JE_BIG_WRITE,
            sizeof(journal_entry_big_write) + clean_entry_bitmap_size
@@ -469,10 +458,7 @@ resume_2:
        memcpy((void*)(je+1), (clean_entry_bitmap_size > sizeof(void*) ? dirty_it->second.bitmap : &dirty_it->second.bitmap), clean_entry_bitmap_size);
        je->crc32 = je_crc32((journal_entry*)je);
        journal.crc32_last = je->crc32;
-        prepare_journal_sector_write(journal, journal.cur_sector, sqe,
-            [this, op](ring_data_t *data) { handle_write_event(data, op); });
-        PRIV(op)->min_flushed_journal_sector = PRIV(op)->max_flushed_journal_sector = 1 + journal.cur_sector;
-        PRIV(op)->pending_ops = 1;
+        prepare_journal_sector_write(journal.cur_sector, op);
        PRIV(op)->op_state = 3;
        return 1;
    }
@@ -587,6 +573,7 @@ void blockstore_impl_t::handle_write_event(ring_data_t *data, blockstore_op_t *o
        );
    }
    PRIV(op)->pending_ops--;
+    assert(PRIV(op)->pending_ops >= 0);
    if (PRIV(op)->pending_ops == 0)
    {
        release_journal_sectors(op);
@@ -604,7 +591,6 @@ void blockstore_impl_t::release_journal_sectors(blockstore_op_t *op)
        uint64_t s = PRIV(op)->min_flushed_journal_sector;
        while (1)
        {
-            journal.sector_info[s-1].flush_count--;
            if (s != (1+journal.cur_sector) && journal.sector_info[s-1].flush_count == 0)
            {
                // We know for sure that we won't write into this sector anymore
@@ -643,24 +629,24 @@ int blockstore_impl_t::dequeue_del(blockstore_op_t *op)
    {
        return 0;
    }
-    write_iodepth++;
-    io_uring_sqe *sqe = NULL;
-    if (immediate_commit != IMMEDIATE_NONE ||
-        (journal_block_size - journal.in_sector_pos) < sizeof(journal_entry_del) &&
-        journal.sector_info[journal.cur_sector].dirty)
+    // Write current journal sector only if it's dirty and full, or in the immediate_commit mode
+    BS_SUBMIT_CHECK_SQES(
+        (immediate_commit != IMMEDIATE_NONE ||
+            (journal_block_size - journal.in_sector_pos) < sizeof(journal_entry_del) &&
+            journal.sector_info[journal.cur_sector].dirty) ? 1 : 0
+    );
+    if (write_iodepth >= max_write_iodepth)
    {
-        // Write current journal sector only if it's dirty and full, or in the immediate_commit mode
-        BS_SUBMIT_GET_SQE_DECL(sqe);
+        return 0;
    }
-    auto cb = [this, op](ring_data_t *data) { handle_write_event(data, op); };
+    write_iodepth++;
    // Prepare journal sector write
    if (immediate_commit == IMMEDIATE_NONE)
    {
-        if (sqe)
+        if ((journal_block_size - journal.in_sector_pos) < sizeof(journal_entry_del) &&
+            journal.sector_info[journal.cur_sector].dirty)
        {
-            prepare_journal_sector_write(journal, journal.cur_sector, sqe, cb);
-            PRIV(op)->min_flushed_journal_sector = PRIV(op)->max_flushed_journal_sector = 1 + journal.cur_sector;
-            PRIV(op)->pending_ops++;
+            prepare_journal_sector_write(journal.cur_sector, op);
        }
        else
        {
@@ -687,9 +673,7 @@ int blockstore_impl_t::dequeue_del(blockstore_op_t *op)
    dirty_it->second.state = BS_ST_DELETE | BS_ST_SUBMITTED;
    if (immediate_commit != IMMEDIATE_NONE)
    {
-        prepare_journal_sector_write(journal, journal.cur_sector, sqe, cb);
-        PRIV(op)->min_flushed_journal_sector = PRIV(op)->max_flushed_journal_sector = 1 + journal.cur_sector;
-        PRIV(op)->pending_ops++;
+        prepare_journal_sector_write(journal.cur_sector, op);
    }
    if (!PRIV(op)->pending_ops)
    {
--- a/src/cli.cpp
+++ b/src/cli.cpp
@@ -57,6 +57,7 @@ json11::Json::object cli_tool_t::parse_args(int narg, const char *args[])
            const char *opt = args[i]+2;
            cfg[opt] = i == narg-1 || !strcmp(opt, "json") || !strcmp(opt, "wait-list") ||
                !strcmp(opt, "long") || !strcmp(opt, "del") || !strcmp(opt, "no-color") ||
+                !strcmp(opt, "readonly") || !strcmp(opt, "readwrite") ||
                !strcmp(opt, "force") || !strcmp(opt, "reverse") ||
                !strcmp(opt, "writers-stopped") && strcmp("1", args[i+1]) != 0
                ? "1" : args[++i];
@@ -69,7 +70,7 @@ json11::Json::object cli_tool_t::parse_args(int narg, const char *args[])
    if (!cmd.size())
    {
        std::string exe(exe_name);
-        if (exe.substr(exe.size()-11) == "vitastor-rm")
+        if (exe.size() >= 11 && exe.substr(exe.size()-11) == "vitastor-rm")
        {
            cmd.push_back("rm-data");
        }
@@ -85,8 +86,11 @@ void cli_tool_t::help()
        "(c) Vitaliy Filippov, 2019+ (VNPL-1.1)\n"
        "\n"
        "USAGE:\n"
-        "%s ls [-l] [-p POOL] [--sort FIELD] [-r] [-n N] [<name> ...]\n"
-        "  List images (only specified if <name> passed).\n"
+        "%s df\n"
+        "  Show pool space statistics\n"
+        "\n"
+        "%s ls [-l] [-p POOL] [--sort FIELD] [-r] [-n N] [<glob> ...]\n"
+        "  List images (only matching <glob> patterns if passed).\n"
        "  -p|--pool POOL  Filter images by pool ID or name\n"
        "  -l|--long       Also report allocated size and I/O statistics\n"
        "  --del           Also include delete operation statistics\n"
@@ -103,7 +107,7 @@ void cli_tool_t::help()
        "%s snap-create [-p|--pool <id|name>] <image>@<snapshot>\n"
        "  Create a snapshot of image <name>. May be used live if only a single writer is active.\n"
        "\n"
-        "%s modify <name> [--rename <new-name>] [-s|--size <size>] [--readonly | --readwrite] [-f|--force]\n"
+        "%s modify <name> [--rename <new-name>] [--resize <size>] [--readonly | --readwrite] [-f|--force]\n"
        "  Rename, resize image or change its readonly status. Images with children can't be made read-write.\n"
        "  If the new size is smaller than the old size, extra data will be purged.\n"
        "  You should resize file system in the image, if present, before shrinking it.\n"
@@ -151,7 +155,8 @@ void cli_tool_t::help()
        "  --no-color          Disable colored output\n"
        "  --json              JSON output\n"
        ,
-        exe_name, exe_name, exe_name, exe_name, exe_name, exe_name, exe_name, exe_name, exe_name, exe_name, exe_name
+        exe_name, exe_name, exe_name, exe_name, exe_name, exe_name,
+        exe_name, exe_name, exe_name, exe_name, exe_name, exe_name
    );
    exit(0);
 }
@@ -172,7 +177,7 @@ void cli_tool_t::change_parent(inode_t cur, inode_t new_parent)
    new_cfg.parent_id = new_parent;
    json11::Json::object cur_cfg_json = cli->st_cli.serialize_inode_cfg(&new_cfg);
    waiting++;
-    cli->st_cli.etcd_txn(json11::Json::object {
+    cli->st_cli.etcd_txn_slow(json11::Json::object {
        { "compare", json11::Json::array {
            json11::Json::object {
                { "target", "MOD" },
@@ -189,7 +194,7 @@ void cli_tool_t::change_parent(inode_t cur, inode_t new_parent)
                } }
            },
        } },
-    }, ETCD_SLOW_TIMEOUT, [this, new_parent, cur, cur_name](std::string err, json11::Json res)
+    }, [this, new_parent, cur, cur_name](std::string err, json11::Json res)
    {
        if (err != "")
        {
@@ -224,6 +229,22 @@ void cli_tool_t::change_parent(inode_t cur, inode_t new_parent)
    });
 }

+void cli_tool_t::etcd_txn(json11::Json txn)
+{
+    waiting++;
+    cli->st_cli.etcd_txn_slow(txn, [this](std::string err, json11::Json res)
+    {
+        waiting--;
+        if (err != "")
+        {
+            fprintf(stderr, "Error reading from etcd: %s\n", err.c_str());
+            exit(1);
+        }
+        etcd_result = res;
+        ringloop->wakeup();
+    });
+}
+
 inode_config_t* cli_tool_t::get_inode_cfg(const std::string & name)
 {
    for (auto & ic: cli->st_cli.inode_config)
@@ -245,6 +266,11 @@ void cli_tool_t::run(json11::Json cfg)
        fprintf(stderr, "command is missing\n");
        exit(1);
    }
+    else if (cmd[0] == "df")
+    {
+        // Show pool space stats
+        action_cb = start_df(cfg);
+    }
    else if (cmd[0] == "ls")
    {
        // List images
@@ -295,6 +321,10 @@ void cli_tool_t::run(json11::Json cfg)
        fprintf(stderr, "unknown command: %s\n", cmd[0].string_value().c_str());
        exit(1);
    }
+    if (action_cb == NULL)
+    {
+        return;
+    }
    color = !cfg["no-color"].bool_value();
    json_output = cfg["json"].bool_value();
    iodepth = cfg["iodepth"].uint64_value();
@@ -335,6 +365,13 @@ void cli_tool_t::run(json11::Json cfg)
        if (action_cb != NULL)
            ringloop->wait();
    }
+    // Destroy the client
+    delete cli;
+    delete epmgr;
+    delete ringloop;
+    cli = NULL;
+    epmgr = NULL;
+    ringloop = NULL;
 }

 int main(int narg, const char *args[])
@@ -344,5 +381,6 @@ int main(int narg, const char *args[])
    exe_name = args[0];
    cli_tool_t *p = new cli_tool_t();
    p->run(cli_tool_t::parse_args(narg, args));
+    delete p;
    return 0;
 }
--- a/src/cli.h
+++ b/src/cli.h
@@ -34,6 +34,7 @@ public:
    cluster_client_t *cli = NULL;

    int waiting = 0;
+    json11::Json etcd_result;
    ring_consumer_t consumer;
    std::function<bool(void)> action_cb;

@@ -50,6 +51,7 @@ public:
    friend struct snap_flattener_t;
    friend struct snap_remover_t;

+    std::function<bool(void)> start_df(json11::Json);
    std::function<bool(void)> start_ls(json11::Json);
    std::function<bool(void)> start_create(json11::Json);
    std::function<bool(void)> start_modify(json11::Json);
@@ -59,7 +61,18 @@ public:
    std::function<bool(void)> start_snap_rm(json11::Json);
    std::function<bool(void)> start_alloc_osd(json11::Json cfg, uint64_t *out = NULL);
    std::function<bool(void)> simple_offsets(json11::Json cfg);
+
+    void etcd_txn(json11::Json txn);
 };

-std::string format_size(uint64_t size);
 uint64_t parse_size(std::string size_str);
+
+std::string print_table(json11::Json items, json11::Json header, bool use_esc);
+
+std::string format_size(uint64_t size);
+
+std::string format_lat(uint64_t lat);
+
+std::string format_q(double depth);
+
+bool stupid_glob(const std::string str, const std::string glob);
--- a/src/cli_alloc_osd.cpp
+++ b/src/cli_alloc_osd.cpp
@@ -13,7 +13,6 @@ struct alloc_osd_t
 {
    cli_tool_t *parent;

-    json11::Json result;
    uint64_t new_id = 1;

    int state = 0;
@@ -29,7 +28,7 @@ struct alloc_osd_t
            goto resume_1;
        do
        {
-            etcd_txn(json11::Json::object {
+            parent->etcd_txn(json11::Json::object {
                { "compare", json11::Json::array {
                    json11::Json::object {
                        { "target", "VERSION" },
@@ -63,10 +62,10 @@ struct alloc_osd_t
            state = 1;
            if (parent->waiting > 0)
                return;
-            if (!result["succeeded"].bool_value())
+            if (!parent->etcd_result["succeeded"].bool_value())
            {
                std::vector<osd_num_t> used;
-                for (auto kv: result["responses"][0]["response_range"]["kvs"].array_items())
+                for (auto kv: parent->etcd_result["responses"][0]["response_range"]["kvs"].array_items())
                {
                    std::string key = base64_decode(kv["key"].string_value());
                    osd_num_t cur_osd;
@@ -98,25 +97,9 @@ struct alloc_osd_t
                    new_id = used[e-1]+1;
                }
            }
-        } while (!result["succeeded"].bool_value());
+        } while (!parent->etcd_result["succeeded"].bool_value());
        state = 100;
    }
-
-    void etcd_txn(json11::Json txn)
-    {
-        parent->waiting++;
-        parent->cli->st_cli.etcd_txn(txn, ETCD_SLOW_TIMEOUT, [this](std::string err, json11::Json res)
-        {
-            parent->waiting--;
-            if (err != "")
-            {
-                fprintf(stderr, "Error reading from etcd: %s\n", err.c_str());
-                exit(1);
-            }
-            this->result = res;
-            parent->ringloop->wakeup();
-        });
-    }
 };

 std::function<bool(void)> cli_tool_t::start_alloc_osd(json11::Json cfg, uint64_t *out)
--- a/src/cli_create.cpp
+++ b/src/cli_create.cpp
@@ -31,7 +31,6 @@ struct image_creator_t
    inode_t new_parent_id = 0;
    inode_t new_id = 0, old_id = 0;
    uint64_t max_id_mod_rev = 0, cfg_mod_rev = 0, idx_mod_rev = 0;
-    json11::Json result;

    int state = 0;

@@ -88,6 +87,31 @@ struct image_creator_t
            goto resume_2;
        else if (state == 3)
            goto resume_3;
+        for (auto & ic: parent->cli->st_cli.inode_config)
+        {
+            if (ic.second.name == image_name)
+            {
+                fprintf(stderr, "Image %s already exists\n", image_name.c_str());
+                exit(1);
+            }
+            if (ic.second.name == new_parent)
+            {
+                new_parent_id = ic.second.num;
+                if (!new_pool_id)
+                {
+                    new_pool_id = INODE_POOL(ic.second.num);
+                }
+                if (!size)
+                {
+                    size = ic.second.size;
+                }
+            }
+        }
+        if (new_parent != "" && !new_parent_id)
+        {
+            fprintf(stderr, "Parent image not found\n");
+            exit(1);
+        }
        if (!new_pool_id)
        {
            fprintf(stderr, "Pool name or ID is missing\n");
@@ -98,36 +122,28 @@ struct image_creator_t
            fprintf(stderr, "Image size is missing\n");
            exit(1);
        }
-        for (auto & ic: parent->cli->st_cli.inode_config)
-        {
-            if (ic.second.name == image_name)
-            {
-                fprintf(stderr, "Image %s already exists\n", image_name.c_str());
-                exit(1);
-            }
-        }
        do
        {
-            etcd_txn(json11::Json::object {
+            parent->etcd_txn(json11::Json::object {
                { "success", json11::Json::array { get_next_id() } }
            });
            state = 2;
 resume_2:
            if (parent->waiting > 0)
                return;
-            extract_next_id(result["responses"][0]);
+            extract_next_id(parent->etcd_result["responses"][0]);
            attempt_create();
            state = 3;
 resume_3:
            if (parent->waiting > 0)
                return;
-            if (!result["succeeded"].bool_value() &&
-                result["responses"][0]["response_range"]["kvs"].array_items().size() > 0)
+            if (!parent->etcd_result["succeeded"].bool_value() &&
+                parent->etcd_result["responses"][0]["response_range"]["kvs"].array_items().size() > 0)
            {
                fprintf(stderr, "Image %s already exists\n", image_name.c_str());
                exit(1);
            }
-        } while (!result["succeeded"].bool_value());
+        } while (!parent->etcd_result["succeeded"].bool_value());
        if (parent->progress)
        {
            printf("Image %s created\n", image_name.c_str());
@@ -151,6 +167,11 @@ resume_3:
                exit(1);
            }
        }
+        if (new_parent != "")
+        {
+            fprintf(stderr, "--parent can't be used with snapshots\n");
+            exit(1);
+        }
        do
        {
            // In addition to next_id, get: size, old_id, old_pool_id, new_parent, cfg_mod_rev, idx_mod_rev
@@ -174,13 +195,13 @@ resume_3:
 resume_4:
            if (parent->waiting > 0)
                return;
-            if (!result["succeeded"].bool_value() &&
-                result["responses"][0]["response_range"]["kvs"].array_items().size() > 0)
+            if (!parent->etcd_result["succeeded"].bool_value() &&
+                parent->etcd_result["responses"][0]["response_range"]["kvs"].array_items().size() > 0)
            {
                fprintf(stderr, "Snapshot %s@%s already exists\n", image_name.c_str(), new_snap.c_str());
                exit(1);
            }
-        } while (!result["succeeded"].bool_value());
+        } while (!parent->etcd_result["succeeded"].bool_value());
        if (parent->progress)
        {
            printf("Snapshot %s@%s created\n", image_name.c_str(), new_snap.c_str());
@@ -224,7 +245,7 @@ resume_4:
            goto resume_2;
        else if (state == 3)
            goto resume_3;
-        etcd_txn(json11::Json::object { { "success", json11::Json::array {
+        parent->etcd_txn(json11::Json::object { { "success", json11::Json::array {
            get_next_id(),
            json11::Json::object {
                { "request_range", json11::Json::object {
@@ -238,11 +259,11 @@ resume_4:
 resume_2:
        if (parent->waiting > 0)
            return;
-        extract_next_id(result["responses"][0]);
+        extract_next_id(parent->etcd_result["responses"][0]);
        old_id = 0;
        old_pool_id = 0;
        cfg_mod_rev = idx_mod_rev = 0;
-        if (result["responses"][1]["response_range"]["kvs"].array_items().size() == 0)
+        if (parent->etcd_result["responses"][1]["response_range"]["kvs"].array_items().size() == 0)
        {
            for (auto & ic: parent->cli->st_cli.inode_config)
            {
@@ -261,7 +282,7 @@ resume_2:
        {
            // FIXME: Parse kvs in etcd_state_client automatically
            {
-                auto kv = parent->cli->st_cli.parse_etcd_kv(result["responses"][1]["response_range"]["kvs"][0]);
+                auto kv = parent->cli->st_cli.parse_etcd_kv(parent->etcd_result["responses"][1]["response_range"]["kvs"][0]);
                old_id = INODE_NO_POOL(kv.value["id"].uint64_value());
                old_pool_id = (pool_id_t)kv.value["pool_id"].uint64_value();
                idx_mod_rev = kv.mod_revision;
@@ -271,7 +292,7 @@ resume_2:
                    exit(1);
                }
            }
-            etcd_txn(json11::Json::object {
+            parent->etcd_txn(json11::Json::object {
                { "success", json11::Json::array {
                    json11::Json::object {
                        { "request_range", json11::Json::object {
@@ -288,7 +309,7 @@ resume_3:
            if (parent->waiting > 0)
                return;
            {
-                auto kv = parent->cli->st_cli.parse_etcd_kv(result["responses"][0]["response_range"]["kvs"][0]);
+                auto kv = parent->cli->st_cli.parse_etcd_kv(parent->etcd_result["responses"][0]["response_range"]["kvs"][0]);
                size = kv.value["size"].uint64_value();
                new_parent_id = kv.value["parent_id"].uint64_value();
                uint64_t parent_pool_id = kv.value["parent_pool_id"].uint64_value();
@@ -417,32 +438,20 @@ resume_3:
                } },
            });
        };
-        etcd_txn(json11::Json::object {
+        parent->etcd_txn(json11::Json::object {
            { "compare", checks },
            { "success", success },
            { "failure", failure },
        });
    }
-
-    void etcd_txn(json11::Json txn)
-    {
-        parent->waiting++;
-        parent->cli->st_cli.etcd_txn(txn, ETCD_SLOW_TIMEOUT, [this](std::string err, json11::Json res)
-        {
-            parent->waiting--;
-            if (err != "")
-            {
-                fprintf(stderr, "Error reading from etcd: %s\n", err.c_str());
-                exit(1);
-            }
-            this->result = res;
-            parent->ringloop->wakeup();
-        });
-    }
 };

 uint64_t parse_size(std::string size_str)
 {
+    if (!size_str.length())
+    {
+        return 0;
+    }
    uint64_t mul = 1;
    char type_char = tolower(size_str[size_str.length()-1]);
    if (type_char == 'k' || type_char == 'm' || type_char == 'g' || type_char == 't')
--- a/src/cli_df.cpp
+++ b/src/cli_df.cpp
@@ -0,0 +1,229 @@
+// Copyright (c) Vitaliy Filippov, 2019+
+// License: VNPL-1.1 (see README.md for details)
+
+#include "cli.h"
+#include "cluster_client.h"
+#include "base64.h"
+
+// List pools with space statistics
+struct pool_lister_t
+{
+    cli_tool_t *parent;
+
+    int state = 0;
+    json11::Json space_info;
+    std::map<pool_id_t, json11::Json::object> pool_stats;
+
+    bool is_done()
+    {
+        return state == 100;
+    }
+
+    void get_stats()
+    {
+        if (state == 1)
+            goto resume_1;
+        // Space statistics - pool/stats/<pool>
+        parent->etcd_txn(json11::Json::object {
+            { "success", json11::Json::array {
+                json11::Json::object {
+                    { "request_range", json11::Json::object {
+                        { "key", base64_encode(
+                            parent->cli->st_cli.etcd_prefix+"/pool/stats/"
+                        ) },
+                        { "range_end", base64_encode(
+                            parent->cli->st_cli.etcd_prefix+"/pool/stats0"
+                        ) },
+                    } },
+                },
+                json11::Json::object {
+                    { "request_range", json11::Json::object {
+                        { "key", base64_encode(
+                            parent->cli->st_cli.etcd_prefix+"/osd/stats/"
+                        ) },
+                        { "range_end", base64_encode(
+                            parent->cli->st_cli.etcd_prefix+"/osd/stats0"
+                        ) },
+                    } },
+                },
+            } },
+        });
+        state = 1;
+resume_1:
+        if (parent->waiting > 0)
+            return;
+        space_info = parent->etcd_result;
+        std::map<pool_id_t, uint64_t> osd_free;
+        for (auto & kv_item: space_info["responses"][0]["response_range"]["kvs"].array_items())
+        {
+            auto kv = parent->cli->st_cli.parse_etcd_kv(kv_item);
+            // pool ID
+            pool_id_t pool_id;
+            char null_byte = 0;
+            sscanf(kv.key.substr(parent->cli->st_cli.etcd_prefix.length()).c_str(), "/pool/stats/%u%c", &pool_id, &null_byte);
+            if (!pool_id || pool_id >= POOL_ID_MAX || null_byte != 0)
+            {
+                fprintf(stderr, "Invalid key in etcd: %s\n", kv.key.c_str());
+                continue;
+            }
+            // pool/stats/<N>
+            pool_stats[pool_id] = kv.value.object_items();
+        }
+        for (auto & kv_item: space_info["responses"][1]["response_range"]["kvs"].array_items())
+        {
+            auto kv = parent->cli->st_cli.parse_etcd_kv(kv_item);
+            // osd ID
+            osd_num_t osd_num;
+            char null_byte = 0;
+            sscanf(kv.key.substr(parent->cli->st_cli.etcd_prefix.length()).c_str(), "/osd/stats/%lu%c", &osd_num, &null_byte);
+            if (!osd_num || osd_num >= POOL_ID_MAX || null_byte != 0)
+            {
+                fprintf(stderr, "Invalid key in etcd: %s\n", kv.key.c_str());
+                continue;
+            }
+            // osd/stats/<N>::free
+            osd_free[osd_num] = kv.value["free"].uint64_value();
+        }
+        // Calculate max_avail for each pool
+        for (auto & pp: parent->cli->st_cli.pool_config)
+        {
+            auto & pool_cfg = pp.second;
+            uint64_t pool_avail = UINT64_MAX;
+            std::map<osd_num_t, uint64_t> pg_per_osd;
+            for (auto & pgp: pool_cfg.pg_config)
+            {
+                for (auto pg_osd: pgp.second.target_set)
+                {
+                    if (pg_osd != 0)
+                    {
+                        pg_per_osd[pg_osd]++;
+                    }
+                }
+            }
+            for (auto pg_per_pair: pg_per_osd)
+            {
+                uint64_t pg_free = osd_free[pg_per_pair.first] * pool_cfg.pg_count / pg_per_pair.second;
+                if (pool_avail > pg_free)
+                {
+                    pool_avail = pg_free;
+                }
+            }
+            if (pool_avail == UINT64_MAX)
+            {
+                pool_avail = 0;
+            }
+            if (pool_cfg.scheme != POOL_SCHEME_REPLICATED)
+            {
+                uint64_t pg_real_size = pool_stats[pool_cfg.id]["pg_real_size"].uint64_value();
+                pool_avail = pg_real_size > 0 ? pool_avail * (pool_cfg.pg_size - pool_cfg.parity_chunks) / pg_real_size : 0;
+            }
+            pool_stats[pool_cfg.id] = json11::Json::object {
+                { "name", pool_cfg.name },
+                { "pg_count", pool_cfg.pg_count },
+                { "scheme", pool_cfg.scheme == POOL_SCHEME_REPLICATED ? "replicated" : "jerasure" },
+                { "scheme_name", pool_cfg.scheme == POOL_SCHEME_REPLICATED
+                    ? std::to_string(pool_cfg.pg_size)+"/"+std::to_string(pool_cfg.pg_minsize)
+                    : "EC "+std::to_string(pool_cfg.pg_size-pool_cfg.parity_chunks)+"+"+std::to_string(pool_cfg.parity_chunks) },
+                { "used_raw", (uint64_t)(pool_stats[pool_cfg.id]["used_raw_tb"].number_value() * (1l<<40)) },
+                { "total_raw", (uint64_t)(pool_stats[pool_cfg.id]["total_raw_tb"].number_value() * (1l<<40)) },
+                { "max_available", pool_avail },
+                { "raw_to_usable", pool_stats[pool_cfg.id]["raw_to_usable"].number_value() },
+                { "space_efficiency", pool_stats[pool_cfg.id]["space_efficiency"].number_value() },
+                { "pg_real_size", pool_stats[pool_cfg.id]["pg_real_size"].uint64_value() },
+                { "failure_domain", pool_cfg.failure_domain },
+            };
+        }
+    }
+
+    json11::Json::array to_list()
+    {
+        json11::Json::array list;
+        for (auto & kv: pool_stats)
+        {
+            list.push_back(kv.second);
+        }
+        return list;
+    }
+
+    void loop()
+    {
+        get_stats();
+        if (parent->waiting > 0)
+            return;
+        if (parent->json_output)
+        {
+            // JSON output
+            printf("%s\n", json11::Json(to_list()).dump().c_str());
+            state = 100;
+            return;
+        }
+        // Table output: name, scheme_name, pg_count, total, used, max_avail, used%, efficiency
+        json11::Json::array cols;
+        cols.push_back(json11::Json::object{
+            { "key", "name" },
+            { "title", "NAME" },
+        });
+        cols.push_back(json11::Json::object{
+            { "key", "scheme_name" },
+            { "title", "SCHEME" },
+        });
+        cols.push_back(json11::Json::object{
+            { "key", "pg_count" },
+            { "title", "PGS" },
+        });
+        cols.push_back(json11::Json::object{
+            { "key", "total_fmt" },
+            { "title", "TOTAL" },
+        });
+        cols.push_back(json11::Json::object{
+            { "key", "used_fmt" },
+            { "title", "USED" },
+        });
+        cols.push_back(json11::Json::object{
+            { "key", "max_avail_fmt" },
+            { "title", "AVAILABLE" },
+        });
+        cols.push_back(json11::Json::object{
+            { "key", "used_pct" },
+            { "title", "USED%" },
+        });
+        cols.push_back(json11::Json::object{
+            { "key", "eff_fmt" },
+            { "title", "EFFICIENCY" },
+        });
+        json11::Json::array list;
+        for (auto & kv: pool_stats)
+        {
+            double raw_to = kv.second["raw_to_usable"].number_value();
+            if (raw_to < 0.000001 && raw_to > -0.000001)
+                raw_to = 1;
+            kv.second["total_fmt"] = format_size(kv.second["total_raw"].uint64_value() / raw_to);
+            kv.second["used_fmt"] = format_size(kv.second["used_raw"].uint64_value() / raw_to);
+            kv.second["max_avail_fmt"] = format_size(kv.second["max_available"].uint64_value());
+            kv.second["used_pct"] = format_q(kv.second["total_raw"].uint64_value()
+                ? (100 - 100*kv.second["max_available"].uint64_value() *
+                    kv.second["raw_to_usable"].number_value() / kv.second["total_raw"].uint64_value())
+                : 100)+"%";
+            kv.second["eff_fmt"] = format_q(kv.second["space_efficiency"].number_value()*100)+"%";
+        }
+        printf("%s", print_table(to_list(), cols, parent->color).c_str());
+        state = 100;
+    }
+};
+
+std::function<bool(void)> cli_tool_t::start_df(json11::Json cfg)
+{
+    json11::Json::array cmd = cfg["command"].array_items();
+    auto lister = new pool_lister_t();
+    lister->parent = this;
+    return [lister]()
+    {
+        lister->loop();
+        if (lister->is_done())
+        {
+            delete lister;
+            return true;
+        }
+        return false;
+    };
+}
--- a/src/cli_flatten.cpp
+++ b/src/cli_flatten.cpp
@@ -3,6 +3,7 @@

 #include "cli.h"
 #include "cluster_client.h"
+#include <sys/stat.h>

 // Flatten a layer: merge all parents into a layer and break the connection completely
 struct snap_flattener_t
--- a/src/cli_ls.cpp
+++ b/src/cli_ls.cpp
@@ -6,16 +6,6 @@
 #include "cluster_client.h"
 #include "base64.h"

-#define MIN(a, b) ((a) < (b) ? (b) : (a))
-
-std::string print_table(json11::Json items, json11::Json header, bool use_esc);
-
-std::string format_size(uint64_t size);
-
-std::string format_lat(uint64_t lat);
-
-std::string format_q(double depth);
-
 // List existing images
 //
 // Again, you can just look into etcd, but this console tool incapsulates it
@@ -94,8 +84,7 @@ struct image_lister_t
        // Space statistics
        // inode/stats/<pool>/<inode>::raw_used divided by pool/stats/<pool>::pg_real_size
        // multiplied by 1 or number of data drives
-        parent->waiting++;
-        parent->cli->st_cli.etcd_txn(json11::Json::object {
+        parent->etcd_txn(json11::Json::object {
            { "success", json11::Json::array {
                json11::Json::object {
                    { "request_range", json11::Json::object {
@@ -122,21 +111,12 @@ struct image_lister_t
                    } },
                },
            } },
-        }, ETCD_SLOW_TIMEOUT, [this](std::string err, json11::Json res)
-        {
-            parent->waiting--;
-            if (err != "")
-            {
-                fprintf(stderr, "Error reading from etcd: %s\n", err.c_str());
-                exit(1);
-            }
-            space_info = res;
-            parent->ringloop->wakeup();
        });
        state = 1;
 resume_1:
        if (parent->waiting > 0)
            return;
+        space_info = parent->etcd_result;
        std::map<pool_id_t, uint64_t> pool_pg_real_size;
        for (auto & kv_item: space_info["responses"][0]["response_range"]["kvs"].array_items())
        {
@@ -213,10 +193,21 @@ resume_1:
        json11::Json::array list;
        for (auto & kv: stats)
        {
-            if (!only_names.size() || only_names.find(kv.second["name"].string_value()) != only_names.end())
+            if (!only_names.size())
            {
                list.push_back(kv.second);
            }
+            else
+            {
+                for (auto glob: only_names)
+                {
+                    if (stupid_glob(kv.second["name"].string_value(), glob))
+                    {
+                        list.push_back(kv.second);
+                        break;
+                    }
+                }
+            }
        }
        if (sort_field == "name" || sort_field == "pool_name")
        {
@@ -355,6 +346,9 @@ resume_1:
                kv.second["read_bw"] = format_size(kv.second["read_bps"].uint64_value())+"/s";
                kv.second["write_bw"] = format_size(kv.second["write_bps"].uint64_value())+"/s";
                kv.second["delete_bw"] = format_size(kv.second["delete_bps"].uint64_value())+"/s";
+                kv.second["read_iops"] = format_q(kv.second["read_iops"].number_value());
+                kv.second["write_iops"] = format_q(kv.second["write_iops"].number_value());
+                kv.second["delete_iops"] = format_q(kv.second["delete_iops"].number_value());
                kv.second["read_lat_f"] = format_lat(kv.second["read_lat"].uint64_value());
                kv.second["write_lat_f"] = format_lat(kv.second["write_lat"].uint64_value());
                kv.second["delete_lat_f"] = format_lat(kv.second["delete_lat"].uint64_value());
@@ -493,6 +487,62 @@ std::string format_q(double depth)
    return std::string(buf);
 }

+struct glob_stack_t
+{
+    int glob_pos;
+    int str_pos;
+};
+
+// Yes I know I could do it by translating the pattern to std::regex O:-)
+bool stupid_glob(const std::string str, const std::string glob)
+{
+    std::vector<glob_stack_t> wildcards;
+    int pos = 0, gp = 0;
+    bool m;
+back:
+    while (true)
+    {
+        if (gp >= glob.length())
+        {
+            if (pos >= str.length())
+                return true;
+            m = false;
+        }
+        else if (glob[gp] == '*')
+        {
+            wildcards.push_back((glob_stack_t){ .glob_pos = ++gp, .str_pos = pos });
+            continue;
+        }
+        else if (glob[gp] == '?')
+            m = pos < str.size();
+        else
+        {
+            if (glob[gp] == '\\' && gp < glob.length()-1)
+                gp++;
+            m = pos < str.size() && str[pos] == glob[gp];
+        }
+        if (!m)
+        {
+            while (wildcards.size() > 0)
+            {
+                // Backtrack
+                pos = (++wildcards[wildcards.size()-1].str_pos);
+                if (pos > str.size())
+                    wildcards.pop_back();
+                else
+                {
+                    gp = wildcards[wildcards.size()-1].glob_pos;
+                    goto back;
+                }
+            }
+            return false;
+        }
+        pos++;
+        gp++;
+    }
+    return true;
+}
+
 std::function<bool(void)> cli_tool_t::start_ls(json11::Json cfg)
 {
    json11::Json::array cmd = cfg["command"].array_items();
--- a/src/cli_merge.cpp
+++ b/src/cli_merge.cpp
@@ -412,7 +412,7 @@ struct snap_merger_t
        uint64_t bitmap_size = target_block_size / gran;
        while (rwo->end < bitmap_size)
        {
-            auto bit = ((*(uint8_t*)(rwo->op.bitmap_buf + (rwo->end >> 3))) & (1 << (rwo->end & 0x7)));
+            auto bit = ((*((uint8_t*)rwo->op.bitmap_buf + (rwo->end >> 3))) & (1 << (rwo->end & 0x7)));
            if (!bit)
            {
                if (rwo->end > rwo->start)
@@ -459,7 +459,7 @@ struct snap_merger_t
        subop->len = end-start;
        subop->version = version;
        subop->flags = OSD_OP_IGNORE_READONLY;
-        subop->iov.push_back(rwo->buf+start, end-start);
+        subop->iov.push_back((uint8_t*)rwo->buf+start, end-start);
        subop->callback = [this, rwo](cluster_op_t *subop)
        {
            rwo->todo--;
@@ -495,7 +495,7 @@ struct snap_merger_t
        subop->offset = offset;
        subop->len = 0;
        subop->flags = OSD_OP_IGNORE_READONLY;
-        subop->callback = [this](cluster_op_t *subop)
+        subop->callback = [](cluster_op_t *subop)
        {
            if (subop->retval != 0)
            {
@@ -519,10 +519,10 @@ struct snap_merger_t
                deleted_unsynced++;
                if (deleted_unsynced >= fsync_interval)
                {
-                    uint64_t from = last_fsync_offset, to = last_written_offset;
+                    uint64_t to = last_written_offset;
                    cluster_op_t *subop = new cluster_op_t;
                    subop->opcode = OSD_OP_SYNC;
-                    subop->callback = [this, from, to](cluster_op_t *subop)
+                    subop->callback = [this, to](cluster_op_t *subop)
                    {
                        delete subop;
                        // We can now delete source data between <from> and <to>
--- a/src/cli_modify.cpp
+++ b/src/cli_modify.cpp
@@ -63,6 +63,15 @@ struct image_changer_t
                break;
            }
        }
+        if ((!set_readwrite || !cfg.readonly) &&
+            (!set_readonly || cfg.readonly) &&
+            (!new_size || cfg.size == new_size) &&
+            (new_name == "" || new_name == image_name))
+        {
+            printf("No change\n");
+            state = 100;
+            return;
+        }
        if (new_size != 0)
        {
            if (cfg.size >= new_size)
@@ -161,29 +170,19 @@ resume_1:
                } }
            });
        }
-        parent->waiting++;
-        parent->cli->st_cli.etcd_txn(json11::Json::object {
+        parent->etcd_txn(json11::Json::object {
            { "compare", checks },
            { "success", success },
-        }, ETCD_SLOW_TIMEOUT, [this](std::string err, json11::Json res)
-        {
-            if (err != "")
-            {
-                fprintf(stderr, "Error changing %s: %s\n", image_name.c_str(), err.c_str());
-                exit(1);
-            }
-            if (!res["succeeded"].bool_value())
-            {
-                fprintf(stderr, "Image %s was modified by someone else, please repeat your request\n", image_name.c_str());
-                exit(1);
-            }
-            parent->waiting--;
-            parent->ringloop->wakeup();
        });
        state = 2;
 resume_2:
        if (parent->waiting > 0)
            return;
+        if (!parent->etcd_result["succeeded"].bool_value())
+        {
+            fprintf(stderr, "Image %s was modified by someone else, please repeat your request\n", image_name.c_str());
+            exit(1);
+        }
        printf("Image %s modified\n", image_name.c_str());
        state = 100;
    }
@@ -201,11 +200,7 @@ std::function<bool(void)> cli_tool_t::start_modify(json11::Json cfg)
        exit(1);
    }
    changer->new_name = cfg["rename"].string_value();
-    if (changer->new_name == changer->image_name)
-    {
-        changer->new_name = "";
-    }
-    changer->new_size = cfg["size"].uint64_value();
+    changer->new_size = parse_size(cfg["resize"].string_value());
    if (changer->new_size != 0 && (changer->new_size % 4096))
    {
        fprintf(stderr, "Image size should be a multiple of 4096\n");
--- a/src/cli_simple_offsets.cpp
+++ b/src/cli_simple_offsets.cpp
@@ -8,6 +8,7 @@
 #include "cli.h"
 #include "cluster_client.h"
 #include "base64.h"
+#include <sys/stat.h>

 // Calculate offsets for a block device and print OSD command line parameters
 std::function<bool(void)> cli_tool_t::simple_offsets(json11::Json cfg)
--- a/src/cli_snap_rm.cpp
+++ b/src/cli_snap_rm.cpp
@@ -256,9 +256,9 @@ resume_9:
            });
        }
        parent->waiting++;
-        parent->cli->st_cli.etcd_txn(json11::Json::object {
+        parent->cli->st_cli.etcd_txn_slow(json11::Json::object {
            { "success", reads },
-        }, ETCD_SLOW_TIMEOUT, [this](std::string err, json11::Json data)
+        }, [this](std::string err, json11::Json data)
        {
            parent->waiting--;
            if (err != "")
@@ -414,10 +414,10 @@ resume_9:
            }
        }
        parent->waiting++;
-        parent->cli->st_cli.etcd_txn(json11::Json::object {
+        parent->cli->st_cli.etcd_txn_slow(json11::Json::object {
            { "compare", cmp },
            { "success", txn },
-        }, ETCD_SLOW_TIMEOUT, [this, target_name, child_name](std::string err, json11::Json res)
+        }, [this, target_name, child_name](std::string err, json11::Json res)
        {
            parent->waiting--;
            if (err != "")
@@ -454,7 +454,7 @@ resume_9:
            "/"+std::to_string(INODE_NO_POOL(cur))
        );
        parent->waiting++;
-        parent->cli->st_cli.etcd_txn(json11::Json::object {
+        parent->cli->st_cli.etcd_txn_slow(json11::Json::object {
            { "compare", json11::Json::array {
                json11::Json::object {
                    { "target", "MOD" },
@@ -475,7 +475,7 @@ resume_9:
                    } },
                },
            } },
-        }, ETCD_SLOW_TIMEOUT, [this, cur_name](std::string err, json11::Json res)
+        }, [this, cur_name](std::string err, json11::Json res)
        {
            parent->waiting--;
            if (err != "")
--- a/src/cluster_client.cpp
+++ b/src/cluster_client.cpp
@@ -534,8 +534,8 @@ void cluster_client_t::copy_write(cluster_op_t *op, std::map<object_id, cluster_
            unsigned iov_len = (op->iov.buf[iov_idx].iov_len - iov_pos);
            if (iov_len <= cur_len)
            {
-                memcpy(dirty_it->second.buf + pos - dirty_it->first.stripe,
-                    op->iov.buf[iov_idx].iov_base + iov_pos, iov_len);
+                memcpy((uint8_t*)dirty_it->second.buf + pos - dirty_it->first.stripe,
+                    (uint8_t*)op->iov.buf[iov_idx].iov_base + iov_pos, iov_len);
                pos += iov_len;
                len -= iov_len;
                cur_len -= iov_len;
@@ -544,8 +544,8 @@ void cluster_client_t::copy_write(cluster_op_t *op, std::map<object_id, cluster_
            }
            else
            {
-                memcpy(dirty_it->second.buf + pos - dirty_it->first.stripe,
-                    op->iov.buf[iov_idx].iov_base + iov_pos, cur_len);
+                memcpy((uint8_t*)dirty_it->second.buf + pos - dirty_it->first.stripe,
+                    (uint8_t*)op->iov.buf[iov_idx].iov_base + iov_pos, cur_len);
                pos += cur_len;
                len -= cur_len;
                iov_pos += cur_len;
@@ -762,7 +762,7 @@ static void add_iov(int size, bool skip, cluster_op_t *op, int &iov_idx, size_t
        {
            if (!skip)
            {
-                iov.push_back(op->iov.buf[iov_idx].iov_base + iov_pos, cur_left);
+                iov.push_back((uint8_t*)op->iov.buf[iov_idx].iov_base + iov_pos, cur_left);
            }
            left -= cur_left;
            iov_pos = 0;
@@ -772,7 +772,7 @@ static void add_iov(int size, bool skip, cluster_op_t *op, int &iov_idx, size_t
        {
            if (!skip)
            {
-                iov.push_back(op->iov.buf[iov_idx].iov_base + iov_pos, left);
+                iov.push_back((uint8_t*)op->iov.buf[iov_idx].iov_base + iov_pos, left);
            }
            iov_pos += left;
            left = 0;
@@ -817,7 +817,7 @@ void cluster_client_t::slice_rw(cluster_op_t *op)
                // First allocation
                memset(op->bitmap_buf, 0, object_bitmap_size);
            }
-            op->part_bitmaps = op->bitmap_buf + object_bitmap_size;
+            op->part_bitmaps = (uint8_t*)op->bitmap_buf + object_bitmap_size;
            op->bitmap_buf_size = bitmap_mem;
        }
    }
@@ -839,7 +839,7 @@ void cluster_client_t::slice_rw(cluster_op_t *op)
            while (cur < end)
            {
                unsigned bmp_loc = (cur - op->offset)/bs_bitmap_granularity;
-                bool skip = (((*(uint8_t*)(op->bitmap_buf + bmp_loc/8)) >> (bmp_loc%8)) & 0x1);
+                bool skip = (((*((uint8_t*)op->bitmap_buf + bmp_loc/8)) >> (bmp_loc%8)) & 0x1);
                if (skip_prev != skip)
                {
                    if (cur > prev)
@@ -944,7 +944,7 @@ bool cluster_client_t::try_send(cluster_op_t *op, int i)
                    .meta_revision = meta_rev,
                    .version = op->opcode == OSD_OP_WRITE || op->opcode == OSD_OP_DELETE ? op->version : 0,
                } },
-                .bitmap = (op->opcode == OSD_OP_READ || op->opcode == OSD_OP_READ_BITMAP ? op->part_bitmaps + pg_bitmap_size*i : NULL),
+                .bitmap = (op->opcode == OSD_OP_READ || op->opcode == OSD_OP_READ_BITMAP ? (uint8_t*)op->part_bitmaps + pg_bitmap_size*i : NULL),
                .bitmap_len = (unsigned)(op->opcode == OSD_OP_READ || op->opcode == OSD_OP_READ_BITMAP ? pg_bitmap_size : 0),
                .callback = [this, part](osd_op_t *op_part)
                {
@@ -1155,7 +1155,7 @@ void cluster_client_t::copy_part_bitmap(cluster_op_t *op, cluster_op_part_t *par
    if (!(object_offset & 0x7) && !(part_offset & 0x7) && (part_len >= 8))
    {
        // Copy bytes
-        mem_or(op->bitmap_buf + object_offset/8, part->op.bitmap + part_offset/8, part_len/8);
+        mem_or((uint8_t*)op->bitmap_buf + object_offset/8, (uint8_t*)part->op.bitmap + part_offset/8, part_len/8);
        object_offset += (part_len & ~0x7);
        part_offset += (part_len & ~0x7);
        part_len = (part_len & 0x7);
@@ -1163,8 +1163,8 @@ void cluster_client_t::copy_part_bitmap(cluster_op_t *op, cluster_op_part_t *par
    while (part_len > 0)
    {
        // Copy bits
-        (*(uint8_t*)(op->bitmap_buf + (object_offset >> 3))) |= (
-            (((*(uint8_t*)(part->op.bitmap + (part_offset >> 3))) >> (part_offset & 0x7)) & 0x1) << (object_offset & 0x7)
+        (*((uint8_t*)op->bitmap_buf + (object_offset >> 3))) |= (
+            (((*((uint8_t*)part->op.bitmap + (part_offset >> 3))) >> (part_offset & 0x7)) & 0x1) << (object_offset & 0x7)
        );
        part_offset++;
        object_offset++;
--- a/src/dump_journal.cpp
+++ b/src/dump_journal.cpp
@@ -75,7 +75,7 @@ int main(int argc, char *argv[])
            uint64_t s;
            for (s = 0; s < self.journal_block; s += 8)
            {
-                if (*((uint64_t*)(data+s)) != 0)
+                if (*((uint64_t*)((uint8_t*)data+s)) != 0)
                    break;
            }
            if (s == self.journal_block)
@@ -139,7 +139,7 @@ int journal_dump_t::dump_block(void *buf)
    bool wrapped = false;
    while (pos < journal_block)
    {
-        journal_entry *je = (journal_entry*)(buf + pos);
+        journal_entry *je = (journal_entry*)((uint8_t*)buf + pos);
        if (je->magic != JOURNAL_MAGIC || je->type < JE_MIN || je->type > JE_MAX ||
            !all && started && je->crc32_prev != crc32_last)
        {
--- a/src/epoll_manager.cpp
+++ b/src/epoll_manager.cpp
@@ -13,6 +13,7 @@
 epoll_manager_t::epoll_manager_t(ring_loop_t *ringloop)
 {
    this->ringloop = ringloop;
+    this->pending = false;

    epoll_fd = epoll_create(1);
    if (epoll_fd < 0)
@@ -22,11 +23,19 @@ epoll_manager_t::epoll_manager_t(ring_loop_t *ringloop)

    tfd = new timerfd_manager_t([this](int fd, bool wr, std::function<void(int, int)> handler) { set_fd_handler(fd, wr, handler); });

+    consumer.loop = [this]()
+    {
+        if (pending)
+            handle_epoll_events();
+    };
+    ringloop->register_consumer(&consumer);
+
    handle_epoll_events();
 }

 epoll_manager_t::~epoll_manager_t()
 {
+    ringloop->unregister_consumer(&consumer);
    if (tfd)
    {
        delete tfd;
@@ -64,8 +73,13 @@ void epoll_manager_t::handle_epoll_events()
    io_uring_sqe *sqe = ringloop->get_sqe();
    if (!sqe)
    {
-        throw std::runtime_error("can't get SQE, will fall out of sync with EPOLLET");
+        // Don't handle epoll events until we manage to post the next event handler
+        // otherwise we'll fall out of sync with EPOLLET
+        pending = true;
+        ringloop->wakeup();
+        return;
    }
+    pending = false;
    ring_data_t *data = ((ring_data_t*)sqe->user_data);
    my_uring_prep_poll_add(sqe, epoll_fd, POLLIN);
    data->callback = [this](ring_data_t *data)
--- a/src/epoll_manager.h
+++ b/src/epoll_manager.h
@@ -11,6 +11,8 @@
 class epoll_manager_t
 {
    int epoll_fd;
+    bool pending;
+    ring_consumer_t consumer;
    ring_loop_t *ringloop;
    std::map<int, std::function<void(int, int)>> epoll_handlers;
 public:
--- a/src/etcd_state_client.cpp
+++ b/src/etcd_state_client.cpp
@@ -5,6 +5,7 @@
 #include "pg_states.h"
 #include "etcd_state_client.h"
 #ifndef __MOCK__
+#include "addr_util.h"
 #include "http_client.h"
 #include "base64.h"
 #endif
@@ -25,9 +26,14 @@ etcd_state_client_t::~etcd_state_client_t()
 #ifndef __MOCK__
    if (etcd_watch_ws)
    {
-        etcd_watch_ws->close();
+        http_close(etcd_watch_ws);
        etcd_watch_ws = NULL;
    }
+    if (keepalive_client)
+    {
+        http_close(keepalive_client);
+        keepalive_client = NULL;
+    }
 #endif
 }

@@ -48,12 +54,18 @@ etcd_kv_t etcd_state_client_t::parse_etcd_kv(const json11::Json & kv_json)
    return kv;
 }

-void etcd_state_client_t::etcd_txn(json11::Json txn, int timeout, std::function<void(std::string, json11::Json)> callback)
+void etcd_state_client_t::etcd_txn(json11::Json txn, int timeout, int retries, int interval, std::function<void(std::string, json11::Json)> callback)
 {
-    etcd_call("/kv/txn", txn, timeout, callback);
+    etcd_call("/kv/txn", txn, timeout, retries, interval, callback);
 }

-void etcd_state_client_t::etcd_call(std::string api, json11::Json payload, int timeout, std::function<void(std::string, json11::Json)> callback)
+void etcd_state_client_t::etcd_txn_slow(json11::Json txn, std::function<void(std::string, json11::Json)> callback)
+{
+    etcd_call("/kv/txn", txn, etcd_slow_timeout, max_etcd_attempts, 0, callback);
+}
+
+void etcd_state_client_t::etcd_call(std::string api, json11::Json payload, int timeout,
+    int retries, int interval, std::function<void(std::string, json11::Json)> callback)
 {
    if (!etcd_addresses.size() && !etcd_local.size())
    {
@@ -74,14 +86,49 @@ void etcd_state_client_t::etcd_call(std::string api, json11::Json payload, int t
        "Host: "+etcd_address+"\r\n"
        "Content-Type: application/json\r\n"
        "Content-Length: "+std::to_string(req.size())+"\r\n"
-        "Connection: close\r\n"
+        "Connection: keep-alive\r\n"
+        "Keep-Alive: timeout="+std::to_string(etcd_keepalive_timeout)+"\r\n"
        "\r\n"+req;
-    http_request_json(tfd, etcd_address, req, timeout, [this, cur_addr = selected_etcd_address, callback](std::string err, json11::Json data)
+    auto cb = [this, api, payload, timeout, retries, interval, callback,
+        cur_addr = selected_etcd_address](const http_response_t *response)
    {
-        if (err != "" && cur_addr == selected_etcd_address)
-            selected_etcd_address = "";
-        callback(err, data);
-    });
+        std::string err;
+        json11::Json data;
+        response->parse_json_response(err, data);
+        if (err != "")
+        {
+            if (cur_addr == selected_etcd_address)
+                selected_etcd_address = "";
+            if (retries > 0)
+            {
+                if (this->log_level > 0)
+                {
+                    printf(
+                        "Warning: etcd request failed: %s, retrying %d more times\n",
+                        err.c_str(), retries
+                    );
+                }
+                if (interval > 0)
+                {
+                    tfd->set_timer(interval, false, [this, api, payload, timeout, retries, interval, callback](int)
+                    {
+                        etcd_call(api, payload, timeout, retries-1, interval, callback);
+                    });
+                }
+                else
+                    etcd_call(api, payload, timeout, retries-1, interval, callback);
+            }
+            else
+                callback(err, data);
+        }
+        else
+            callback(err, data);
+    };
+    if (!keepalive_client)
+    {
+        keepalive_client = http_init(tfd);
+    }
+    http_request(keepalive_client, etcd_address, req, { .timeout = timeout, .keepalive = true }, cb);
 }

 void etcd_state_client_t::add_etcd_url(std::string addr)
@@ -155,6 +202,33 @@ void etcd_state_client_t::parse_config(const json11::Json & config)
        this->etcd_prefix = "/"+this->etcd_prefix;
    }
    this->log_level = config["log_level"].int64_value();
+    this->etcd_keepalive_timeout = config["etcd_keepalive_timeout"].uint64_value();
+    if (this->etcd_keepalive_timeout <= 0)
+    {
+        this->etcd_keepalive_timeout = config["etcd_report_interval"].uint64_value() * 2;
+        if (this->etcd_keepalive_timeout < 30)
+            this->etcd_keepalive_timeout = 30;
+    }
+    this->etcd_ws_keepalive_interval = config["etcd_ws_keepalive_interval"].uint64_value();
+    if (this->etcd_ws_keepalive_interval <= 0)
+    {
+        this->etcd_ws_keepalive_interval = 30;
+    }
+    this->max_etcd_attempts = config["max_etcd_attempts"].uint64_value();
+    if (this->max_etcd_attempts <= 0)
+    {
+        this->max_etcd_attempts = 5;
+    }
+    this->etcd_slow_timeout = config["etcd_slow_timeout"].uint64_value();
+    if (this->etcd_slow_timeout <= 0)
+    {
+        this->etcd_slow_timeout = 5000;
+    }
+    this->etcd_quick_timeout = config["etcd_quick_timeout"].uint64_value();
+    if (this->etcd_quick_timeout <= 0)
+    {
+        this->etcd_quick_timeout = 1000;
+    }
 }

 void etcd_state_client_t::pick_next_etcd()
@@ -169,9 +243,16 @@ void etcd_state_client_t::pick_next_etcd()
        std::vector<int> ns;
        for (int i = 0; i < etcd_addresses.size(); i++)
            ns.push_back(i);
+        if (!rand_initialized)
+        {
+            timespec tv;
+            clock_gettime(CLOCK_REALTIME, &tv);
+            srand48(tv.tv_sec*1000000000 + tv.tv_nsec);
+            rand_initialized = true;
+        }
        while (ns.size())
        {
-            int i = rand() % ns.size();
+            int i = lrand48() % ns.size();
            addresses_to_try.push_back(etcd_addresses[ns[i]]);
            ns.erase(ns.begin()+i, ns.begin()+i+1);
        }
@@ -200,10 +281,12 @@ void etcd_state_client_t::start_etcd_watcher()
    ws_alive = 1;
    if (etcd_watch_ws)
    {
-        etcd_watch_ws->close();
+        http_close(etcd_watch_ws);
        etcd_watch_ws = NULL;
    }
-    etcd_watch_ws = open_websocket(tfd, etcd_address, etcd_api_path+"/watch", ETCD_SLOW_TIMEOUT,
+    if (this->log_level > 1)
+        printf("Trying to connect to etcd websocket at %s\n", etcd_address.c_str());
+    etcd_watch_ws = open_websocket(tfd, etcd_address, etcd_api_path+"/watch", etcd_slow_timeout,
        [this, cur_addr = selected_etcd_address](const http_response_t *msg)
    {
        if (msg->body.length())
@@ -219,6 +302,8 @@ void etcd_state_client_t::start_etcd_watcher()
            {
                if (data["result"]["created"].bool_value())
                {
+                    if (etcd_watches_initialised == 3 && this->log_level > 0)
+                        fprintf(stderr, "Successfully subscribed to etcd at %s\n", selected_etcd_address.c_str());
                    etcd_watches_initialised++;
                }
                if (data["result"]["canceled"].bool_value())
@@ -232,8 +317,11 @@ void etcd_state_client_t::start_etcd_watcher()
                        {
                            fprintf(stderr, "Revisions before %lu were compacted by etcd, reloading state\n",
                                data["result"]["compact_revision"].uint64_value());
-                            etcd_watch_ws->close();
-                            etcd_watch_ws = NULL;
+                            if (etcd_watch_ws)
+                            {
+                                http_close(etcd_watch_ws);
+                                etcd_watch_ws = NULL;
+                            }
                            etcd_watch_revision = 0;
                            on_reload_hook();
                        }
@@ -284,13 +372,20 @@ void etcd_state_client_t::start_etcd_watcher()
        {
            if (cur_addr == selected_etcd_address)
            {
+                fprintf(stderr, "Disconnected from etcd %s\n", selected_etcd_address.c_str());
                selected_etcd_address = "";
            }
-            etcd_watch_ws = NULL;
+            else
+                fprintf(stderr, "Disconnected from etcd\n");
+            if (etcd_watch_ws)
+            {
+                http_close(etcd_watch_ws);
+                etcd_watch_ws = NULL;
+            }
            if (etcd_watches_initialised == 0)
            {
-                // Connection not established, retry in <ETCD_QUICK_TIMEOUT>
-                tfd->set_timer(ETCD_QUICK_TIMEOUT, false, [this](int)
+                // Connection not established, retry in <etcd_quick_timeout>
+                tfd->set_timer(etcd_quick_timeout, false, [this](int)
                {
                    start_etcd_watcher();
                });
@@ -302,7 +397,7 @@ void etcd_state_client_t::start_etcd_watcher()
            }
        }
    });
-    etcd_watch_ws->post_message(WS_TEXT, json11::Json(json11::Json::object {
+    http_post_message(etcd_watch_ws, WS_TEXT, json11::Json(json11::Json::object {
        { "create_request", json11::Json::object {
            { "key", base64_encode(etcd_prefix+"/config/") },
            { "range_end", base64_encode(etcd_prefix+"/config0") },
@@ -311,7 +406,7 @@ void etcd_state_client_t::start_etcd_watcher()
            { "progress_notify", true },
        } }
    }).dump());
-    etcd_watch_ws->post_message(WS_TEXT, json11::Json(json11::Json::object {
+    http_post_message(etcd_watch_ws, WS_TEXT, json11::Json(json11::Json::object {
        { "create_request", json11::Json::object {
            { "key", base64_encode(etcd_prefix+"/osd/state/") },
            { "range_end", base64_encode(etcd_prefix+"/osd/state0") },
@@ -320,7 +415,7 @@ void etcd_state_client_t::start_etcd_watcher()
            { "progress_notify", true },
        } }
    }).dump());
-    etcd_watch_ws->post_message(WS_TEXT, json11::Json(json11::Json::object {
+    http_post_message(etcd_watch_ws, WS_TEXT, json11::Json(json11::Json::object {
        { "create_request", json11::Json::object {
            { "key", base64_encode(etcd_prefix+"/pg/state/") },
            { "range_end", base64_encode(etcd_prefix+"/pg/state0") },
@@ -329,7 +424,7 @@ void etcd_state_client_t::start_etcd_watcher()
            { "progress_notify", true },
        } }
    }).dump());
-    etcd_watch_ws->post_message(WS_TEXT, json11::Json(json11::Json::object {
+    http_post_message(etcd_watch_ws, WS_TEXT, json11::Json(json11::Json::object {
        { "create_request", json11::Json::object {
            { "key", base64_encode(etcd_prefix+"/pg/history/") },
            { "range_end", base64_encode(etcd_prefix+"/pg/history0") },
@@ -340,7 +435,7 @@ void etcd_state_client_t::start_etcd_watcher()
    }).dump());
    if (ws_keepalive_timer < 0)
    {
-        ws_keepalive_timer = tfd->set_timer(ETCD_KEEPALIVE_TIMEOUT, true, [this](int)
+        ws_keepalive_timer = tfd->set_timer(etcd_ws_keepalive_interval*1000, true, [this](int)
        {
            if (!etcd_watch_ws)
            {
@@ -348,14 +443,21 @@ void etcd_state_client_t::start_etcd_watcher()
            }
            else if (!ws_alive)
            {
-                etcd_watch_ws->close();
-                etcd_watch_ws = NULL;
+                if (this->log_level > 0)
+                {
+                    fprintf(stderr, "Websocket ping failed, disconnecting from etcd %s\n", selected_etcd_address.c_str());
+                }
+                if (etcd_watch_ws)
+                {
+                    http_close(etcd_watch_ws);
+                    etcd_watch_ws = NULL;
+                }
                start_etcd_watcher();
            }
            else
            {
                ws_alive = 0;
-                etcd_watch_ws->post_message(WS_TEXT, json11::Json(json11::Json::object {
+                http_post_message(etcd_watch_ws, WS_TEXT, json11::Json(json11::Json::object {
                    { "progress_request", json11::Json::object { } }
                }).dump());
            }
@@ -367,12 +469,12 @@ void etcd_state_client_t::load_global_config()
 {
    etcd_call("/kv/range", json11::Json::object {
        { "key", base64_encode(etcd_prefix+"/config/global") }
-    }, ETCD_SLOW_TIMEOUT, [this](std::string err, json11::Json data)
+    }, etcd_slow_timeout, max_etcd_attempts, 0, [this](std::string err, json11::Json data)
    {
        if (err != "")
        {
            fprintf(stderr, "Error reading OSD configuration from etcd: %s\n", err.c_str());
-            tfd->set_timer(ETCD_SLOW_TIMEOUT, false, [this](int timer_id)
+            tfd->set_timer(etcd_slow_timeout, false, [this](int timer_id)
            {
                load_global_config();
            });
@@ -440,12 +542,13 @@ void etcd_state_client_t::load_pgs()
    {
        req["compare"] = checks;
    }
-    etcd_txn(req, ETCD_SLOW_TIMEOUT, [this](std::string err, json11::Json data)
+    etcd_txn_slow(req, [this](std::string err, json11::Json data)
    {
        if (err != "")
        {
+            // Retry indefinitely
            fprintf(stderr, "Error loading PGs from etcd: %s\n", err.c_str());
-            tfd->set_timer(ETCD_SLOW_TIMEOUT, false, [this](int timer_id)
+            tfd->set_timer(etcd_slow_timeout, false, [this](int timer_id)
            {
                load_pgs();
            });
--- a/src/etcd_state_client.h
+++ b/src/etcd_state_client.h
@@ -12,11 +12,6 @@
 #define ETCD_PG_HISTORY_WATCH_ID 3
 #define ETCD_OSD_STATE_WATCH_ID 4

-#define MAX_ETCD_ATTEMPTS 5
-#define ETCD_SLOW_TIMEOUT 5000
-#define ETCD_QUICK_TIMEOUT 1000
-#define ETCD_KEEPALIVE_TIMEOUT 30000
-
 #define DEFAULT_BLOCK_SIZE 128*1024

 struct etcd_kv_t
@@ -71,7 +66,7 @@ struct inode_watch_t
    inode_config_t cfg;
 };

-struct websocket_t;
+struct http_co_t;

 struct etcd_state_client_t
 {
@@ -82,13 +77,20 @@ protected:
    std::string selected_etcd_address;
    std::vector<std::string> addresses_to_try;
    std::vector<inode_watch_t*> watches;
-    websocket_t *etcd_watch_ws = NULL;
+    http_co_t *etcd_watch_ws = NULL, *keepalive_client = NULL;
    int ws_keepalive_timer = -1;
    int ws_alive = 0;
+    bool rand_initialized = false;
    uint64_t bs_block_size = DEFAULT_BLOCK_SIZE;
    void add_etcd_url(std::string);
    void pick_next_etcd();
 public:
+    int etcd_keepalive_timeout = 30;
+    int etcd_ws_keepalive_interval = 30;
+    int max_etcd_attempts = 5;
+    int etcd_quick_timeout = 1000;
+    int etcd_slow_timeout = 5000;
+
    std::string etcd_prefix;
    int log_level = 0;
    timerfd_manager_t *tfd = NULL;
@@ -110,8 +112,9 @@ public:

    json11::Json::object serialize_inode_cfg(inode_config_t *cfg);
    etcd_kv_t parse_etcd_kv(const json11::Json & kv_json);
-    void etcd_call(std::string api, json11::Json payload, int timeout, std::function<void(std::string, json11::Json)> callback);
-    void etcd_txn(json11::Json txn, int timeout, std::function<void(std::string, json11::Json)> callback);
+    void etcd_call(std::string api, json11::Json payload, int timeout, int retries, int interval, std::function<void(std::string, json11::Json)> callback);
+    void etcd_txn(json11::Json txn, int timeout, int retries, int interval, std::function<void(std::string, json11::Json)> callback);
+    void etcd_txn_slow(json11::Json txn, std::function<void(std::string, json11::Json)> callback);
    void start_etcd_watcher();
    void load_global_config();
    void load_pgs();
--- a/src/fio_cluster.cpp
+++ b/src/fio_cluster.cpp
@@ -247,6 +247,12 @@ static int sec_setup(struct thread_data *td)
            vitastor_c_uring_wait_events(bsd->cli);
        }
        td->files[0]->real_file_size = vitastor_c_inode_get_size(bsd->watch);
+        if (!vitastor_c_inode_get_num(bsd->watch) ||
+            !td->files[0]->real_file_size)
+        {
+            td_verror(td, EINVAL, "image does not exist");
+            return 1;
+        }
    }

    bsd->trace = o->trace ? true : false;
@@ -345,7 +351,7 @@ static enum fio_q_status sec_queue(struct thread_data *td, struct io_u *io)
        }
        else
        {
-            printf("+++ %s 0x%lx 0x%llx+%llx\n",
+            printf("+++ %s 0x%lx 0x%llx+%lx\n",
                io->ddir == DDIR_READ ? "READ" : "WRITE",
                (uint64_t)io, io->offset, io->xfer_buflen);
        }
--- a/src/fio_engine.cpp
+++ b/src/fio_engine.cpp
@@ -26,9 +26,8 @@

 #include "blockstore.h"
 #include "epoll_manager.h"
-#include "fio_headers.h"
-
 #include "json11/json11.hpp"
+#include "fio_headers.h"

 struct bs_data
 {
@@ -150,7 +149,6 @@ static int bs_init(struct thread_data *td)
 static enum fio_q_status bs_queue(struct thread_data *td, struct io_u *io)
 {
    bs_data *bsd = (bs_data*)td->io_ops_data;
-    int n = bsd->op_n;
    if (io->ddir == DDIR_SYNC && bsd->last_sync)
    {
        return FIO_Q_COMPLETED;
@@ -178,7 +176,7 @@ static enum fio_q_status bs_queue(struct thread_data *td, struct io_u *io)
        op->version = UINT64_MAX; // last unstable
        op->offset = io->offset % bsd->bs->get_block_size();
        op->len = io->xfer_buflen;
-        op->callback = [io, n](blockstore_op_t *op)
+        op->callback = [io](blockstore_op_t *op)
        {
            io->error = op->retval < 0 ? -op->retval : 0;
            bs_data *bsd = (bs_data*)io->engine_data;
@@ -200,7 +198,7 @@ static enum fio_q_status bs_queue(struct thread_data *td, struct io_u *io)
        op->version = 0; // assign automatically
        op->offset = io->offset % bsd->bs->get_block_size();
        op->len = io->xfer_buflen;
-        op->callback = [io, n](blockstore_op_t *op)
+        op->callback = [io](blockstore_op_t *op)
        {
            io->error = op->retval < 0 ? -op->retval : 0;
            bs_data *bsd = (bs_data*)io->engine_data;
@@ -215,7 +213,7 @@ static enum fio_q_status bs_queue(struct thread_data *td, struct io_u *io)
        break;
    case DDIR_SYNC:
        op->opcode = BS_OP_SYNC_STAB_ALL;
-        op->callback = [io, n](blockstore_op_t *op)
+        op->callback = [io](blockstore_op_t *op)
        {
            bs_data *bsd = (bs_data*)io->engine_data;
            io->error = op->retval < 0 ? -op->retval : 0;
@@ -230,6 +228,7 @@ static enum fio_q_status bs_queue(struct thread_data *td, struct io_u *io)
        break;
    default:
        io->error = EINVAL;
+        delete op;
        return FIO_Q_COMPLETED;
    }

--- a/src/fio_headers.h
+++ b/src/fio_headers.h
@@ -1,4 +1,3 @@
-extern "C" {
 // Kill atomics in fio headers
 #define _STDATOMIC_H
 #include "fio/arch/arch.h"
@@ -11,6 +10,7 @@ extern "C" {
 #define CONFIG_HAVE_GETTID
 #define CONFIG_SYNC_FILE_RANGE
 #define CONFIG_PWRITEV2
+extern "C" {
 #include "fio/fio.h"
 #include "fio/optgroup.h"
 }
--- a/src/fio_sec_osd.cpp
+++ b/src/fio_sec_osd.cpp
@@ -28,16 +28,23 @@
 #include <vector>
 #include <unordered_map>

+#include "addr_util.h"
 #include "rw_blocking.h"
 #include "osd_ops.h"
 #include "fio_headers.h"

+struct op_buf_t
+{
+    osd_any_op_t buf;
+    io_u* fio_op;
+};
+
 struct sec_data
 {
    int connect_fd;
    /* block_size = 1 << block_order (128KB by default) */
    uint64_t block_order = 17, block_size = 1 << 17;
-    std::unordered_map<uint64_t, io_u*> queue;
+    std::unordered_map<uint64_t, op_buf_t*> queue;
    bool last_sync = false;
    /* The list of completed io_u structs. */
    std::vector<io_u*> completed;
@@ -52,6 +59,7 @@ struct sec_options
    int single_primary = 0;
    int trace = 0;
    int block_order = 17;
+    int zerocopy_send = 0;
 };

 static struct fio_option options[] = {
@@ -102,6 +110,16 @@ static struct fio_option options[] = {
        .category = FIO_OPT_C_ENGINE,
        .group  = FIO_OPT_G_FILENAME,
    },
+    {
+        .name   = "zerocopy_send",
+        .lname  = "Use zero-copy send",
+        .type   = FIO_OPT_BOOL,
+        .off1   = offsetof(struct sec_options, zerocopy_send),
+        .help   = "Use zero-copy send (MSG_ZEROCOPY)",
+        .def    = "0",
+        .category = FIO_OPT_C_ENGINE,
+        .group  = FIO_OPT_G_FILENAME,
+    },
    {
        .name = NULL,
    },
@@ -152,17 +170,14 @@ static int sec_init(struct thread_data *td)
    bsd->block_order = o->block_order == 0 ? 17 : o->block_order;
    bsd->block_size = 1 << o->block_order;

-    struct sockaddr_in addr;
-    int r;
-    if ((r = inet_pton(AF_INET, o->host ? o->host : "127.0.0.1", &addr.sin_addr)) != 1)
+    sockaddr addr;
+    if (!string_to_addr(std::string(o->host ? o->host : "127.0.0.1"), false, o->port > 0 ? o->port : 11203, &addr))
    {
-        fprintf(stderr, "server address: %s%s\n", o->host ? o->host : "127.0.0.1", r == 0 ? " is not valid" : ": no ipv4 support");
+        fprintf(stderr, "server address: %s is not valid\n", o->host ? o->host : "127.0.0.1");
        return 1;
    }
-    addr.sin_family = AF_INET;
-    addr.sin_port = htons(o->port ? o->port : 11203);

-    bsd->connect_fd = socket(AF_INET, SOCK_STREAM, 0);
+    bsd->connect_fd = socket(addr.sa_family, SOCK_STREAM, 0);
    if (bsd->connect_fd < 0)
    {
        perror("socket");
@@ -175,6 +190,19 @@ static int sec_init(struct thread_data *td)
    }
    int one = 1;
    setsockopt(bsd->connect_fd, SOL_TCP, TCP_NODELAY, &one, sizeof(one));
+    if (o->zerocopy_send)
+    {
+#ifndef SO_ZEROCOPY
+        perror("zerocopy send not supported on your system (socket.h misses SO_ZEROCOPY)");
+        return 1;
+#else
+        if (setsockopt(bsd->connect_fd, SOL_SOCKET, SO_ZEROCOPY, &one, sizeof(one)) < 0)
+        {
+            perror("setsockopt zerocopy");
+            return 1;
+        }
+#endif
+    }

    // FIXME: read config (block size) from OSD

@@ -195,7 +223,9 @@ static enum fio_q_status sec_queue(struct thread_data *td, struct io_u *io)
    }

    io->engine_data = bsd;
-    osd_any_op_t op = { 0 };
+    op_buf_t *op_buf = new op_buf_t;
+    op_buf->fio_op = io;
+    osd_any_op_t &op = op_buf->buf;

    op.hdr.magic = SECONDARY_OSD_OP_MAGIC;
    op.hdr.id = n;
@@ -259,6 +289,7 @@ static enum fio_q_status sec_queue(struct thread_data *td, struct io_u *io)
        break;
    default:
        io->error = EINVAL;
+        delete op_buf;
        return FIO_Q_COMPLETED;
    }

@@ -271,19 +302,24 @@ static enum fio_q_status sec_queue(struct thread_data *td, struct io_u *io)
    io->error = 0;
    bsd->inflight++;
    bsd->op_n++;
-    bsd->queue[n] = io;
+    bsd->queue[n] = op_buf;

    iovec iov[2] = { { .iov_base = op.buf, .iov_len = OSD_PACKET_SIZE } };
    int iovcnt = 1, wtotal = OSD_PACKET_SIZE;
    if (io->ddir == DDIR_WRITE)
    {
-        iov[1] = { .iov_base = io->xfer_buf, .iov_len = io->xfer_buflen };
+        iov[iovcnt++] = { .iov_base = io->xfer_buf, .iov_len = io->xfer_buflen };
        wtotal += io->xfer_buflen;
-        iovcnt++;
    }
-    if (writev_blocking(bsd->connect_fd, iov, iovcnt) != wtotal)
+    if (sendv_blocking(bsd->connect_fd, iov, iovcnt,
+#ifdef SO_ZEROCOPY
+        opt->zerocopy_send ? MSG_ZEROCOPY : 0
+#else
+        0
+#endif
+    ) != wtotal)
    {
-        perror("writev");
+        perror("sendmsg");
        exit(1);
    }

@@ -312,22 +348,39 @@ static int sec_getevents(struct thread_data *td, unsigned int min, unsigned int
            fprintf(stderr, "bad reply: op id %lx missing in local queue\n", reply.hdr.id);
            exit(1);
        }
-        io_u* io = it->second;
+        io_u* io = it->second->fio_op;
+        delete it->second;
        bsd->queue.erase(it);
        if (io->ddir == DDIR_READ)
        {
            if (reply.hdr.retval != io->xfer_buflen)
            {
-                fprintf(stderr, "Short read: retval = %ld instead of %llu\n", reply.hdr.retval, io->xfer_buflen);
+                fprintf(stderr, "Short read: retval = %ld instead of %lu\n", reply.hdr.retval, io->xfer_buflen);
                exit(1);
            }
-            read_blocking(bsd->connect_fd, io->xfer_buf, io->xfer_buflen);
+            // Support bitmap
+            uint64_t bitmap = 0;
+            int iovcnt = 0;
+            iovec iov[2];
+            if (reply.sec_rw.attr_len > 0)
+            {
+                if (reply.sec_rw.attr_len <= 8)
+                    iov[iovcnt++] = { .iov_base = &bitmap, .iov_len = reply.sec_rw.attr_len };
+                else
+                    iov[iovcnt++] = { .iov_base = (void*)(bitmap = (uint64_t)malloc(reply.sec_rw.attr_len)), .iov_len = reply.sec_rw.attr_len };
+            }
+            iov[iovcnt++] = { .iov_base = io->xfer_buf, .iov_len = io->xfer_buflen };
+            readv_blocking(bsd->connect_fd, iov, iovcnt);
+            if (reply.sec_rw.attr_len > 8)
+            {
+                free((void*)bitmap);
+            }
        }
        else if (io->ddir == DDIR_WRITE)
        {
            if (reply.hdr.retval != io->xfer_buflen)
            {
-                fprintf(stderr, "Short write: retval = %ld instead of %llu\n", reply.hdr.retval, io->xfer_buflen);
+                fprintf(stderr, "Short write: retval = %ld instead of %lu\n", reply.hdr.retval, io->xfer_buflen);
                exit(1);
            }
        }
--- a/src/http_client.cpp
+++ b/src/http_client.cpp
@@ -4,9 +4,7 @@
 #include <netinet/tcp.h>
 #include <sys/epoll.h>

-#include <net/if.h>
 #include <arpa/inet.h>
-#include <ifaddrs.h>

 #include <ctype.h>
 #include <unistd.h>
@@ -15,21 +13,22 @@

 #include <stdexcept>

+#include "addr_util.h"
 #include "json11/json11.hpp"
 #include "http_client.h"
 #include "timerfd_manager.h"

 #define READ_BUFFER_SIZE 9000

-static int extract_port(std::string & host);
 static std::string trim(const std::string & in);
 static std::string ws_format_frame(int type, uint64_t size);
 static bool ws_parse_frame(std::string & buf, int & type, std::string & res);
+static void parse_http_headers(std::string & res, http_response_t *parsed);

-// FIXME: Use keepalive
 struct http_co_t
 {
    timerfd_manager_t *tfd;
+    std::function<void(const http_response_t*)> response_callback;

    int request_timeout = 0;
    std::string host;
@@ -37,11 +36,12 @@ struct http_co_t
    std::string ws_outbox;
    std::string response;
    bool want_streaming;
+    bool keepalive;

-    http_response_t parsed;
-    uint64_t target_response_size = 0;
+    std::vector<std::function<void()>> keepalive_queue;

    int state = 0;
+    std::string connected_host;
    int peer_fd = -1;
    int timeout_id = -1;
    int epoll_events = 0;
@@ -49,10 +49,8 @@ struct http_co_t
    std::vector<char> rbuf;
    iovec read_iov, send_iov;
    msghdr read_msg = { 0 }, send_msg = { 0 };
-
-    std::function<void(const http_response_t*)> callback;
-
-    websocket_t ws;
+    http_response_t parsed;
+    uint64_t target_response_size = 0;

    int onstack = 0;
    bool ended = false;
@@ -61,66 +59,40 @@ struct http_co_t
    inline void stackin() { onstack++; }
    inline void stackout() { onstack--; if (!onstack && ended) end(); }
    inline void end() { ended = true; if (!onstack) { delete this; } }
+    void run_cb_and_clear();
    void start_connection();
+    void close_connection();
    void handle_events();
    void handle_connect_result();
    void submit_read();
    void submit_send();
    bool handle_read();
    void post_message(int type, const std::string & msg);
+    void send_request(const std::string & host, const std::string & request,
+        const http_options_t & options, std::function<void(const http_response_t *response)> response_callback);
 };

+#define HTTP_CO_CLOSED 0
 #define HTTP_CO_CONNECTING 1
 #define HTTP_CO_SENDING_REQUEST 2
 #define HTTP_CO_REQUEST_SENT 3
 #define HTTP_CO_HEADERS_RECEIVED 4
 #define HTTP_CO_WEBSOCKET 5
 #define HTTP_CO_CHUNKED 6
+#define HTTP_CO_KEEPALIVE 7

 #define DEFAULT_TIMEOUT 5000

-void http_request(timerfd_manager_t *tfd, const std::string & host, const std::string & request,
-    const http_options_t & options, std::function<void(const http_response_t *response)> callback)
+http_co_t *http_init(timerfd_manager_t *tfd)
 {
    http_co_t *handler = new http_co_t();
-    handler->request_timeout = options.timeout < 0 ? 0 : (options.timeout == 0 ? DEFAULT_TIMEOUT : options.timeout);
-    handler->want_streaming = options.want_streaming;
    handler->tfd = tfd;
-    handler->host = host;
-    handler->request = request;
-    handler->callback = callback;
-    handler->ws.co = handler;
-    handler->start_connection();
+    handler->state = HTTP_CO_CLOSED;
+    return handler;
 }

-void http_request_json(timerfd_manager_t *tfd, const std::string & host, const std::string & request,
-    int timeout, std::function<void(std::string, json11::Json r)> callback)
-{
-    http_request(tfd, host, request, { .timeout = timeout }, [callback](const http_response_t* res)
-    {
-        if (res->error_code != 0)
-        {
-            callback("Error code: "+std::to_string(res->error_code)+" ("+std::string(strerror(res->error_code))+")", json11::Json());
-            return;
-        }
-        if (res->status_code != 200)
-        {
-            callback("HTTP "+std::to_string(res->status_code)+" "+res->status_line+" body: "+trim(res->body), json11::Json());
-            return;
-        }
-        std::string json_err;
-        json11::Json data = json11::Json::parse(res->body, json_err);
-        if (json_err != "")
-        {
-            callback("Bad JSON: "+json_err+" (response: "+trim(res->body)+")", json11::Json());
-            return;
-        }
-        callback(std::string(), data);
-    });
-}
-
-websocket_t* open_websocket(timerfd_manager_t *tfd, const std::string & host, const std::string & path,
-    int timeout, std::function<void(const http_response_t *msg)> callback)
+http_co_t* open_websocket(timerfd_manager_t *tfd, const std::string & host, const std::string & path,
+    int timeout, std::function<void(const http_response_t *msg)> response_callback)
 {
    std::string request = "GET "+path+" HTTP/1.1\r\n"
        "Host: "+host+"\r\n"
@@ -130,28 +102,154 @@ websocket_t* open_websocket(timerfd_manager_t *tfd, const std::string & host, co
        "Sec-WebSocket-Version: 13\r\n"
        "\r\n";
    http_co_t *handler = new http_co_t();
+    handler->tfd = tfd;
+    handler->state = HTTP_CO_CLOSED;
+    handler->host = host;
    handler->request_timeout = timeout < 0 ? -1 : (timeout == 0 ? DEFAULT_TIMEOUT : timeout);
    handler->want_streaming = false;
-    handler->tfd = tfd;
-    handler->host = host;
+    handler->keepalive = false;
    handler->request = request;
-    handler->callback = callback;
-    handler->ws.co = handler;
+    handler->response_callback = response_callback;
    handler->start_connection();
-    return &handler->ws;
+    return handler;
 }

-void websocket_t::post_message(int type, const std::string & msg)
+void http_request(http_co_t *handler, const std::string & host, const std::string & request,
+    const http_options_t & options, std::function<void(const http_response_t *response)> response_callback)
 {
-    co->post_message(type, msg);
+    handler->send_request(host, request, options, response_callback);
 }

-void websocket_t::close()
+void http_co_t::run_cb_and_clear()
 {
-    co->end();
+    parsed.eof = true;
+    std::function<void(const http_response_t*)> cb;
+    cb.swap(response_callback);
+    // Call callback after clearing it because otherwise we may hit reenterability problems
+    if (cb != NULL)
+        cb(&parsed);
+}
+
+void http_co_t::send_request(const std::string & host, const std::string & request,
+    const http_options_t & options, std::function<void(const http_response_t *response)> response_callback)
+{
+    stackin();
+    if (state == HTTP_CO_WEBSOCKET)
+    {
+        stackout();
+        throw std::runtime_error("Attempt to send HTTP request into a websocket or chunked stream");
+    }
+    else if (state != HTTP_CO_KEEPALIVE && state != HTTP_CO_CLOSED)
+    {
+        keepalive_queue.push_back([this, host, request, options, response_callback]()
+        {
+            this->send_request(host, request, options, response_callback);
+        });
+        stackout();
+        return;
+    }
+    if (state == HTTP_CO_KEEPALIVE && connected_host != host)
+    {
+        close_connection();
+    }
+    this->request_timeout = options.timeout < 0 ? 0 : (options.timeout == 0 ? DEFAULT_TIMEOUT : options.timeout);
+    this->want_streaming = options.want_streaming;
+    this->keepalive = options.keepalive;
+    this->host = host;
+    this->request = request;
+    this->response = "";
+    this->sent = 0;
+    this->response_callback = response_callback;
+    this->parsed = {};
+    if (request_timeout > 0)
+    {
+        timeout_id = tfd->set_timer(request_timeout, false, [this](int timer_id)
+        {
+            stackin();
+            close_connection();
+            parsed = { .error = "HTTP request timed out" };
+            run_cb_and_clear();
+            stackout();
+        });
+    }
+    if (state == HTTP_CO_KEEPALIVE)
+    {
+        state = HTTP_CO_SENDING_REQUEST;
+        submit_send();
+    }
+    else
+    {
+        start_connection();
+    }
+    stackout();
+}
+
+void http_post_message(http_co_t *handler, int type, const std::string & msg)
+{
+    handler->post_message(type, msg);
+}
+
+void http_co_t::post_message(int type, const std::string & msg)
+{
+    stackin();
+    if (state == HTTP_CO_WEBSOCKET)
+    {
+        request += ws_format_frame(type, msg.size());
+        request += msg;
+        submit_send();
+    }
+    else if (state == HTTP_CO_KEEPALIVE || state == HTTP_CO_CHUNKED)
+    {
+        throw std::runtime_error("Attempt to send websocket message on a regular HTTP connection");
+    }
+    else
+    {
+        ws_outbox += ws_format_frame(type, msg.size());
+        ws_outbox += msg;
+    }
+    stackout();
+}
+
+void http_close(http_co_t *handler)
+{
+    handler->end();
+}
+
+void http_response_t::parse_json_response(std::string & error, json11::Json & r) const
+{
+    if (this->error != "")
+    {
+        error = this->error;
+        r = json11::Json();
+    }
+    else if (status_code != 200)
+    {
+        error = "HTTP "+std::to_string(status_code)+" "+status_line+" body: "+trim(body);
+        r = json11::Json();
+    }
+    else
+    {
+        std::string json_err;
+        json11::Json data = json11::Json::parse(body, json_err);
+        if (json_err != "")
+        {
+            error = "Bad JSON: "+json_err+" (response: "+trim(body)+")";
+            r = json11::Json();
+        }
+        else
+        {
+            error = "";
+            r = data;
+        }
+    }
 }

 http_co_t::~http_co_t()
+{
+    close_connection();
+}
+
+void http_co_t::close_connection()
 {
    if (timeout_id >= 0)
    {
@@ -164,67 +262,41 @@ http_co_t::~http_co_t()
        close(peer_fd);
        peer_fd = -1;
    }
-    if (parsed.headers["transfer-encoding"] == "chunked")
-    {
-        int prev = 0, pos = 0;
-        while ((pos = response.find("\r\n", prev)) >= prev)
-        {
-            uint64_t len = strtoull(response.c_str()+prev, NULL, 16);
-            parsed.body += response.substr(pos+2, len);
-            prev = pos+2+len+2;
-        }
-    }
-    else
-    {
-        std::swap(parsed.body, response);
-    }
-    parsed.eof = true;
-    callback(&parsed);
+    state = HTTP_CO_CLOSED;
+    connected_host = "";
+    response = "";
+    epoll_events = 0;
 }

 void http_co_t::start_connection()
 {
    stackin();
-    int port = extract_port(host);
-    struct sockaddr_in addr;
-    int r;
-    if ((r = inet_pton(AF_INET, host.c_str(), &addr.sin_addr)) != 1)
+    struct sockaddr addr;
+    if (!string_to_addr(host.c_str(), 1, 80, &addr))
    {
-        parsed.error_code = ENXIO;
+        parsed = { .error = "Invalid address: "+host };
+        run_cb_and_clear();
        stackout();
-        end();
        return;
    }
-    addr.sin_family = AF_INET;
-    addr.sin_port = htons(port ? port : 80);
-    peer_fd = socket(AF_INET, SOCK_STREAM, 0);
+    peer_fd = socket(addr.sa_family, SOCK_STREAM, 0);
    if (peer_fd < 0)
    {
-        parsed.error_code = errno;
+        parsed = { .error = std::string("socket: ")+strerror(errno) };
+        run_cb_and_clear();
        stackout();
-        end();
        return;
    }
    fcntl(peer_fd, F_SETFL, fcntl(peer_fd, F_GETFL, 0) | O_NONBLOCK);
-    if (request_timeout > 0)
-    {
-        timeout_id = tfd->set_timer(request_timeout, false, [this](int timer_id)
-        {
-            if (response.length() == 0)
-            {
-                parsed.error_code = ETIME;
-            }
-            end();
-        });
-    }
    epoll_events = 0;
    // Finally call connect
-    r = ::connect(peer_fd, (sockaddr*)&addr, sizeof(addr));
+    int r = ::connect(peer_fd, (sockaddr*)&addr, sizeof(addr));
    if (r < 0 && errno != EINPROGRESS)
    {
-        parsed.error_code = errno;
+        close_connection();
+        parsed = { .error = std::string("connect: ")+strerror(errno) };
+        run_cb_and_clear();
        stackout();
-        end();
        return;
    }
    tfd->set_fd_handler(peer_fd, true, [this](int peer_fd, int epoll_events)
@@ -232,6 +304,7 @@ void http_co_t::start_connection()
        this->epoll_events |= epoll_events;
        handle_events();
    });
+    connected_host = host;
    state = HTTP_CO_CONNECTING;
    stackout();
 }
@@ -254,7 +327,8 @@ void http_co_t::handle_events()
            }
            else if (epoll_events & (EPOLLRDHUP|EPOLLERR))
            {
-                end();
+                close_connection();
+                run_cb_and_clear();
                break;
            }
        }
@@ -273,9 +347,10 @@ void http_co_t::handle_connect_result()
    }
    if (result != 0)
    {
-        parsed.error_code = result;
+        close_connection();
+        parsed = { .error = std::string("connect: ")+strerror(result) };
+        run_cb_and_clear();
        stackout();
-        end();
        return;
    }
    int one = 1;
@@ -290,6 +365,51 @@ void http_co_t::handle_connect_result()
    stackout();
 }

+void http_co_t::submit_send()
+{
+    stackin();
+    int res;
+again:
+    if (sent < request.size())
+    {
+        send_iov = (iovec){ .iov_base = (void*)(request.c_str()+sent), .iov_len = request.size()-sent };
+        send_msg.msg_iov = &send_iov;
+        send_msg.msg_iovlen = 1;
+        res = sendmsg(peer_fd, &send_msg, MSG_NOSIGNAL);
+        if (res < 0)
+        {
+            res = -errno;
+        }
+        if (res == -EAGAIN || res == -EINTR)
+        {
+            res = 0;
+        }
+        else if (res < 0)
+        {
+            close_connection();
+            parsed = { .error = std::string("sendmsg: ")+strerror(errno) };
+            run_cb_and_clear();
+            stackout();
+            return;
+        }
+        sent += res;
+        if (state == HTTP_CO_SENDING_REQUEST)
+        {
+            if (sent >= request.size())
+                state = HTTP_CO_REQUEST_SENT;
+            else
+                goto again;
+        }
+        else if (state == HTTP_CO_WEBSOCKET)
+        {
+            request = request.substr(sent);
+            sent = 0;
+            goto again;
+        }
+    }
+    stackout();
+}
+
 void http_co_t::submit_read()
 {
    stackin();
@@ -306,16 +426,18 @@ void http_co_t::submit_read()
    {
        res = -errno;
    }
-    if (res == -EAGAIN)
+    if (res == -EAGAIN || res == -EINTR)
    {
        epoll_events = epoll_events & ~EPOLLIN;
    }
    else if (res <= 0)
    {
        // < 0 means error, 0 means EOF
-        if (!res)
-            epoll_events = epoll_events & ~EPOLLIN;
-        end();
+        epoll_events = epoll_events & ~EPOLLIN;
+        close_connection();
+        if (res < 0)
+            parsed = { .error = std::string("recvmsg: ")+strerror(-res) };
+        run_cb_and_clear();
    }
    else
    {
@@ -325,51 +447,6 @@ void http_co_t::submit_read()
    stackout();
 }

-void http_co_t::submit_send()
-{
-    stackin();
-    int res;
-again:
-    if (sent < request.size())
-    {
-        send_iov = (iovec){ .iov_base = (void*)(request.c_str()+sent), .iov_len = request.size()-sent };
-        send_msg.msg_iov = &send_iov;
-        send_msg.msg_iovlen = 1;
-        res = sendmsg(peer_fd, &send_msg, MSG_NOSIGNAL);
-        if (res < 0)
-        {
-            res = -errno;
-        }
-        if (res == -EAGAIN)
-        {
-            res = 0;
-        }
-        else if (res < 0)
-        {
-            stackout();
-            end();
-            return;
-        }
-        sent += res;
-        if (state == HTTP_CO_SENDING_REQUEST)
-        {
-            if (sent >= request.size())
-            {
-                state = HTTP_CO_REQUEST_SENT;
-            }
-            else
-                goto again;
-        }
-        else if (state == HTTP_CO_WEBSOCKET)
-        {
-            request = request.substr(sent);
-            sent = 0;
-            goto again;
-        }
-    }
-    stackout();
-}
-
 bool http_co_t::handle_read()
 {
    stackin();
@@ -380,6 +457,7 @@ bool http_co_t::handle_read()
        {
            if (timeout_id >= 0)
            {
+                // Timeout is cleared when headers are received
                tfd->clear_timer(timeout_id);
                timeout_id = -1;
            }
@@ -407,20 +485,26 @@ bool http_co_t::handle_read()
                if (!target_response_size)
                {
                    // Sorry, unsupported response
+                    close_connection();
+                    parsed = { .error = "Response has neither Connection: close, nor Transfer-Encoding: chunked nor Content-Length headers" };
+                    run_cb_and_clear();
                    stackout();
-                    end();
                    return false;
                }
            }
+            else
+            {
+                keepalive = false;
+            }
        }
    }
    if (state == HTTP_CO_HEADERS_RECEIVED && target_response_size > 0 && response.size() >= target_response_size)
    {
-        stackout();
-        end();
-        return false;
+        std::swap(parsed.body, response);
+        response_callback(&parsed);
+        parsed.eof = true;
    }
-    if (state == HTTP_CO_CHUNKED && response.size() > 0)
+    else if (state == HTTP_CO_CHUNKED && response.size() > 0)
    {
        int prev = 0, pos = 0;
        while ((pos = response.find("\r\n", prev)) >= prev)
@@ -443,55 +527,49 @@ bool http_co_t::handle_read()
        {
            response = response.substr(prev);
        }
-        if (parsed.eof)
+        if (want_streaming)
        {
-            stackout();
-            end();
-            return false;
-        }
-        if (want_streaming && parsed.body.size() > 0)
-        {
-            if (!ended)
-            {
-                // Don't deliver additional events after close()
-                callback(&parsed);
-            }
+            // Streaming response
+            response_callback(&parsed);
            parsed.body = "";
        }
+        if (parsed.eof && !want_streaming)
+        {
+            // Normal response
+            response_callback(&parsed);
+        }
    }
-    if (state == HTTP_CO_WEBSOCKET && response.size() > 0)
+    else if (state == HTTP_CO_WEBSOCKET && response.size() > 0)
    {
        while (ws_parse_frame(response, parsed.ws_msg_type, parsed.body))
        {
-            if (!ended)
-            {
-                // Don't deliver additional events after close()
-                callback(&parsed);
-            }
+            response_callback(&parsed);
            parsed.body = "";
        }
    }
+    if (parsed.eof)
+    {
+        response_callback = NULL;
+        parsed = {};
+        if (!keepalive)
+        {
+            close_connection();
+        }
+        else
+        {
+            state = HTTP_CO_KEEPALIVE;
+            if (keepalive_queue.size() > 0)
+            {
+                auto next = keepalive_queue[0];
+                keepalive_queue.erase(keepalive_queue.begin(), keepalive_queue.begin()+1);
+                next();
+            }
+        }
+    }
    stackout();
    return true;
 }

-void http_co_t::post_message(int type, const std::string & msg)
-{
-    stackin();
-    if (state == HTTP_CO_WEBSOCKET)
-    {
-        request += ws_format_frame(type, msg.size());
-        request += msg;
-        submit_send();
-    }
-    else
-    {
-        ws_outbox += ws_format_frame(type, msg.size());
-        ws_outbox += msg;
-    }
-    stackout();
-}
-
 uint64_t stoull_full(const std::string & str, int base)
 {
    if (isspace(str[0]))
@@ -507,7 +585,7 @@ uint64_t stoull_full(const std::string & str, int base)
    return r;
 }

-void parse_http_headers(std::string & res, http_response_t *parsed)
+static void parse_http_headers(std::string & res, http_response_t *parsed)
 {
    int pos = res.find("\r\n");
    pos = pos < 0 ? res.length() : pos+2;
@@ -556,13 +634,13 @@ static std::string ws_format_frame(int type, uint64_t size)
        res[p++] = size | /*mask*/0x80;
    else if (size < 65536)
    {
-        res[p++] = 126 | /*mask*/0x80;
+        res[p++] = (char)(126 | /*mask*/0x80);
        res[p++] = (size >> 8) & 0xFF;
        res[p++] = (size >> 0) & 0xFF;
    }
    else
    {
-        res[p++] = 127 | /*mask*/0x80;
+        res[p++] = (char)(127 | /*mask*/0x80);
        res[p++] = (size >> 56) & 0xFF;
        res[p++] = (size >> 48) & 0xFF;
        res[p++] = (size >> 40) & 0xFF;
@@ -629,152 +707,6 @@ static bool ws_parse_frame(std::string & buf, int & type, std::string & res)
    return true;
 }

-static bool cidr_match(const in_addr &addr, const in_addr &net, uint8_t bits)
-{
-    if (bits == 0)
-    {
-        // C99 6.5.7 (3): u32 << 32 is undefined behaviour
-        return true;
-    }
-    return !((addr.s_addr ^ net.s_addr) & htonl(0xFFFFFFFFu << (32 - bits)));
-}
-
-static bool cidr6_match(const in6_addr &address, const in6_addr &network, uint8_t bits)
-{
-    const uint32_t *a = address.s6_addr32;
-    const uint32_t *n = network.s6_addr32;
-    int bits_whole, bits_incomplete;
-    bits_whole = bits >> 5;         // number of whole u32
-    bits_incomplete = bits & 0x1F;  // number of bits in incomplete u32
-    if (bits_whole && memcmp(a, n, bits_whole << 2))
-        return false;
-    if (bits_incomplete)
-    {
-        uint32_t mask = htonl((0xFFFFFFFFu) << (32 - bits_incomplete));
-        if ((a[bits_whole] ^ n[bits_whole]) & mask)
-            return false;
-    }
-    return true;
-}
-
-struct addr_mask_t
-{
-    sa_family_t family;
-    in_addr ipv4;
-    in6_addr ipv6;
-    uint8_t bits;
-};
-
-std::vector<std::string> getifaddr_list(json11::Json mask_cfg, bool include_v6)
-{
-    std::vector<addr_mask_t> masks;
-    if (mask_cfg.is_string())
-    {
-        mask_cfg = json11::Json::array{ mask_cfg };
-    }
-    for (auto mask_json: mask_cfg.array_items())
-    {
-        std::string mask = mask_json.string_value();
-        unsigned bits = 0;
-        int p = mask.find('/');
-        if (p != std::string::npos)
-        {
-            char null_byte = 0;
-            if (sscanf(mask.c_str()+p+1, "%u%c", &bits, &null_byte) != 1 || bits > 128)
-            {
-                throw std::runtime_error((include_v6 ? "Invalid IPv4 address mask: " : "Invalid IP address mask: ") + mask);
-            }
-            mask = mask.substr(0, p);
-        }
-        in_addr ipv4;
-        in6_addr ipv6;
-        if (inet_pton(AF_INET, mask.c_str(), &ipv4) == 1)
-        {
-            if (bits > 32)
-            {
-                throw std::runtime_error((include_v6 ? "Invalid IPv4 address mask: " : "Invalid IP address mask: ") + mask);
-            }
-            masks.push_back((addr_mask_t){ .family = AF_INET, .ipv4 = ipv4, .bits = (uint8_t)bits });
-        }
-        else if (include_v6 && inet_pton(AF_INET6, mask.c_str(), &ipv6) == 1)
-        {
-            masks.push_back((addr_mask_t){ .family = AF_INET6, .ipv6 = ipv6, .bits = (uint8_t)bits });
-        }
-        else
-        {
-            throw std::runtime_error((include_v6 ? "Invalid IPv4 address mask: " : "Invalid IP address mask: ") + mask);
-        }
-    }
-    std::vector<std::string> addresses;
-    ifaddrs *list, *ifa;
-    if (getifaddrs(&list) == -1)
-    {
-        throw std::runtime_error(std::string("getifaddrs: ") + strerror(errno));
-    }
-    for (ifa = list; ifa != NULL; ifa = ifa->ifa_next)
-    {
-        if (!ifa->ifa_addr)
-        {
-            continue;
-        }
-        int family = ifa->ifa_addr->sa_family;
-        if ((family == AF_INET || family == AF_INET6 && include_v6) &&
-            (ifa->ifa_flags & (IFF_UP | IFF_RUNNING | IFF_LOOPBACK)) == (IFF_UP | IFF_RUNNING))
-        {
-            void *addr_ptr;
-            if (family == AF_INET)
-            {
-                addr_ptr = &((sockaddr_in *)ifa->ifa_addr)->sin_addr;
-            }
-            else
-            {
-                addr_ptr = &((sockaddr_in6 *)ifa->ifa_addr)->sin6_addr;
-            }
-            if (masks.size() > 0)
-            {
-                int i;
-                for (i = 0; i < masks.size(); i++)
-                {
-                    if (masks[i].family == family && (family == AF_INET
-                        ? cidr_match(*(in_addr*)addr_ptr, masks[i].ipv4, masks[i].bits)
-                        : cidr6_match(*(in6_addr*)addr_ptr, masks[i].ipv6, masks[i].bits)))
-                    {
-                        break;
-                    }
-                }
-                if (i >= masks.size())
-                {
-                    continue;
-                }
-            }
-            char addr[INET6_ADDRSTRLEN];
-            if (!inet_ntop(family, addr_ptr, addr, INET6_ADDRSTRLEN))
-            {
-                throw std::runtime_error(std::string("inet_ntop: ") + strerror(errno));
-            }
-            addresses.push_back(std::string(addr));
-        }
-    }
-    freeifaddrs(list);
-    return addresses;
-}
-
-static int extract_port(std::string & host)
-{
-    int port = 0;
-    int pos = 0;
-    if ((pos = host.find(':')) >= 0)
-    {
-        port = strtoull(host.c_str() + pos + 1, NULL, 10);
-        if (port >= 0x10000)
-        {
-            port = 0;
-        }
-        host = host.substr(0, pos);
-    }
-    return port;
-}
-
 std::string strtolower(const std::string & in)
 {
    std::string s = in;
--- a/src/http_client.h
+++ b/src/http_client.h
@@ -21,41 +21,34 @@ struct http_options_t
 {
    int timeout;
    bool want_streaming;
+    bool keepalive;
 };

 struct http_response_t
 {
+    std::string error;
+
    bool eof = false;
-    int error_code = 0;
    int status_code = 0;
    std::string status_line;
    std::map<std::string, std::string> headers;
    int ws_msg_type = -1;
    std::string body;
+
+    void parse_json_response(std::string & error, json11::Json & r) const;
 };

+// Opened websocket or keepalive HTTP connection
 struct http_co_t;

-struct websocket_t
-{
-    http_co_t *co;
-    void post_message(int type, const std::string & msg);
-    void close();
-};
-
-void parse_http_headers(std::string & res, http_response_t *parsed);
-
-std::vector<std::string> getifaddr_list(json11::Json mask_cfg = json11::Json(), bool include_v6 = false);
+http_co_t* http_init(timerfd_manager_t *tfd);
+http_co_t* open_websocket(timerfd_manager_t *tfd, const std::string & host, const std::string & path,
+    int timeout, std::function<void(const http_response_t *msg)> on_message);
+void http_request(http_co_t *handler, const std::string & host, const std::string & request,
+    const http_options_t & options, std::function<void(const http_response_t *response)> response_callback);
+void http_post_message(http_co_t *handler, int type, const std::string & msg);
+void http_close(http_co_t *co);

+// Utils
 uint64_t stoull_full(const std::string & str, int base = 10);
-
 std::string strtolower(const std::string & in);
-
-void http_request(timerfd_manager_t *tfd, const std::string & host, const std::string & request,
-    const http_options_t & options, std::function<void(const http_response_t *response)> callback);
-
-void http_request_json(timerfd_manager_t *tfd, const std::string & host, const std::string & request,
-    int timeout, std::function<void(std::string, json11::Json r)> callback);
-
-websocket_t* open_websocket(timerfd_manager_t *tfd, const std::string & host, const std::string & path,
-    int timeout, std::function<void(const http_response_t *msg)> callback);
--- a/src/messenger.cpp
+++ b/src/messenger.cpp
@@ -4,10 +4,12 @@
 #include <unistd.h>
 #include <fcntl.h>
 #include <sys/socket.h>
+#include <sys/stat.h>
 #include <sys/epoll.h>
 #include <netinet/tcp.h>
 #include <stdexcept>

+#include "addr_util.h"
 #include "messenger.h"

 void osd_messenger_t::init()
@@ -220,23 +222,20 @@ void osd_messenger_t::try_connect_peer(uint64_t peer_osd)
 void osd_messenger_t::try_connect_peer_addr(osd_num_t peer_osd, const char *peer_host, int peer_port)
 {
    assert(peer_osd != this->osd_num);
-    struct sockaddr_in addr;
-    int r;
-    if ((r = inet_pton(AF_INET, peer_host, &addr.sin_addr)) != 1)
+    struct sockaddr addr;
+    if (!string_to_addr(peer_host, 0, peer_port, &addr))
    {
        on_connect_peer(peer_osd, -EINVAL);
        return;
    }
-    addr.sin_family = AF_INET;
-    addr.sin_port = htons(peer_port ? peer_port : 11203);
-    int peer_fd = socket(AF_INET, SOCK_STREAM, 0);
+    int peer_fd = socket(addr.sa_family, SOCK_STREAM, 0);
    if (peer_fd < 0)
    {
        on_connect_peer(peer_osd, -errno);
        return;
    }
    fcntl(peer_fd, F_SETFL, fcntl(peer_fd, F_GETFL, 0) | O_NONBLOCK);
-    r = connect(peer_fd, (sockaddr*)&addr, sizeof(addr));
+    int r = connect(peer_fd, (sockaddr*)&addr, sizeof(addr));
    if (r < 0 && errno != EINPROGRESS)
    {
        close(peer_fd);
@@ -485,21 +484,20 @@ void osd_messenger_t::check_peer_config(osd_client_t *cl)
 void osd_messenger_t::accept_connections(int listen_fd)
 {
    // Accept new connections
-    sockaddr_in addr;
+    sockaddr addr;
    socklen_t peer_addr_size = sizeof(addr);
    int peer_fd;
-    while ((peer_fd = accept(listen_fd, (sockaddr*)&addr, &peer_addr_size)) >= 0)
+    while ((peer_fd = accept(listen_fd, &addr, &peer_addr_size)) >= 0)
    {
        assert(peer_fd != 0);
-        char peer_str[256];
-        fprintf(stderr, "[OSD %lu] new client %d: connection from %s port %d\n", this->osd_num, peer_fd,
-            inet_ntop(AF_INET, &addr.sin_addr, peer_str, 256), ntohs(addr.sin_port));
+        fprintf(stderr, "[OSD %lu] new client %d: connection from %s\n", this->osd_num, peer_fd,
+            addr_to_string(addr).c_str());
        fcntl(peer_fd, F_SETFL, fcntl(peer_fd, F_GETFL, 0) | O_NONBLOCK);
        int one = 1;
        setsockopt(peer_fd, SOL_TCP, TCP_NODELAY, &one, sizeof(one));
        clients[peer_fd] = new osd_client_t();
        clients[peer_fd]->peer_addr = addr;
-        clients[peer_fd]->peer_port = ntohs(addr.sin_port);
+        clients[peer_fd]->peer_port = ntohs(((sockaddr_in*)&addr)->sin_port);
        clients[peer_fd]->peer_fd = peer_fd;
        clients[peer_fd]->peer_state = PEER_CONNECTED;
        clients[peer_fd]->in_buf = malloc_or_die(receive_buffer_size);
@@ -547,7 +545,7 @@ json11::Json osd_messenger_t::read_config(const json11::Json & config)
    int done = 0;
    while (done < st.st_size)
    {
-        int r = read(fd, (void*)buf.data()+done, st.st_size-done);
+        int r = read(fd, (uint8_t*)buf.data()+done, st.st_size-done);
        if (r < 0)
        {
            fprintf(stderr, "Error reading %s: %s\n", config_path, strerror(errno));
--- a/src/messenger.h
+++ b/src/messenger.h
@@ -49,7 +49,7 @@ struct osd_client_t
 {
    int refs = 0;

-    sockaddr_in peer_addr;
+    sockaddr peer_addr;
    int peer_port;
    int peer_fd;
    int peer_state;
--- a/src/msgr_op.h
+++ b/src/msgr_op.h
@@ -141,7 +141,7 @@ struct osd_op_buf_list_t
            else
            {
                iov.iov_len -= result;
-                iov.iov_base += result;
+                iov.iov_base = (uint8_t*)iov.iov_base + result;
                break;
            }
        }
--- a/src/msgr_rdma.cpp
+++ b/src/msgr_rdma.cpp
@@ -58,11 +58,19 @@ msgr_rdma_context_t *msgr_rdma_context_t::create(const char *ib_devname, uint8_t
    msgr_rdma_context_t *ctx = new msgr_rdma_context_t();
    ctx->mtu = mtu;

-    srand48(time(NULL));
+    timespec tv;
+    clock_gettime(CLOCK_REALTIME, &tv);
+    srand48(tv.tv_sec*1000000000 + tv.tv_nsec);
    dev_list = ibv_get_device_list(NULL);
    if (!dev_list)
    {
-        fprintf(stderr, "Failed to get RDMA device list: %s\n", strerror(errno));
+        if (errno == -ENOSYS || errno == ENOSYS)
+        {
+            if (log_level > 0)
+                fprintf(stderr, "No RDMA devices found (RDMA device list returned ENOSYS)\n");
+        }
+        else
+            fprintf(stderr, "Failed to get RDMA device list: %s\n", strerror(errno));
        goto cleanup;
    }
    if (!ib_devname)
@@ -383,7 +391,7 @@ bool osd_messenger_t::try_send_rdma(osd_client_t *cl)
        uint32_t len = (uint32_t)(op_size+iov.iov_len-rc->send_buf_pos < rc->max_msg
            ? iov.iov_len-rc->send_buf_pos : rc->max_msg-op_size);
        sge[op_sge++] = {
-            .addr = (uintptr_t)(iov.iov_base+rc->send_buf_pos),
+            .addr = (uintptr_t)((uint8_t*)iov.iov_base+rc->send_buf_pos),
            .length = len,
            .lkey = rc->ctx->mr->lkey,
        };
@@ -513,7 +521,7 @@ void osd_messenger_t::handle_rdma_events()
                    }
                    if (cl->rdma_conn->send_buf_pos > 0)
                    {
-                        cl->send_list[0].iov_base += cl->rdma_conn->send_buf_pos;
+                        cl->send_list[0].iov_base = (uint8_t*)cl->send_list[0].iov_base + cl->rdma_conn->send_buf_pos;
                        cl->send_list[0].iov_len -= cl->rdma_conn->send_buf_pos;
                        cl->rdma_conn->send_buf_pos = 0;
                    }
--- a/src/msgr_receive.cpp
+++ b/src/msgr_receive.cpp
@@ -67,7 +67,7 @@ bool osd_messenger_t::handle_read(int result, osd_client_t *cl)
        }
        return false;
    }
-    if (result <= 0 && result != -EAGAIN)
+    if (result <= 0 && result != -EAGAIN && result != -EINTR)
    {
        // this is a client socket, so don't panic on error. just disconnect it
        if (result != 0)
@@ -77,7 +77,7 @@ bool osd_messenger_t::handle_read(int result, osd_client_t *cl)
        stop_client(cl->peer_fd);
        return false;
    }
-    if (result == -EAGAIN || result < cl->read_iov.iov_len)
+    if (result == -EAGAIN || result == -EINTR || result < cl->read_iov.iov_len)
    {
        cl->read_ready--;
        if (cl->read_ready > 0)
@@ -142,13 +142,13 @@ bool osd_messenger_t::handle_read_buffer(osd_client_t *cl, void *curbuf, int rem
                memcpy(cur->iov_base, curbuf, remain);
                cl->read_remaining -= remain;
                cur->iov_len -= remain;
-                cur->iov_base += remain;
+                cur->iov_base = (uint8_t*)cur->iov_base + remain;
                remain = 0;
            }
            else
            {
                memcpy(cur->iov_base, curbuf, cur->iov_len);
-                curbuf += cur->iov_len;
+                curbuf = (uint8_t*)curbuf + cur->iov_len;
                cl->read_remaining -= cur->iov_len;
                remain -= cur->iov_len;
                cur->iov_len = 0;
@@ -390,7 +390,7 @@ void osd_messenger_t::handle_reply_ready(osd_op_t *op)
        (tv_end.tv_sec - op->tv_begin.tv_sec)*1000000 +
        (tv_end.tv_nsec - op->tv_begin.tv_nsec)/1000
    );
-    set_immediate.push_back([this, op]()
+    set_immediate.push_back([op]()
    {
        // Copy lambda to be unaffected by `delete op`
        std::function<void(osd_op_t*)>(op->callback)(op);
--- a/src/msgr_send.cpp
+++ b/src/msgr_send.cpp
@@ -224,7 +224,7 @@ void osd_messenger_t::handle_send(int result, osd_client_t *cl)
        }
        return;
    }
-    if (result < 0 && result != -EAGAIN)
+    if (result < 0 && result != -EAGAIN && result != -EINTR)
    {
        // this is a client socket, so don't panic. just disconnect it
        fprintf(stderr, "Client %d socket write error: %d (%s). Disconnecting client\n", cl->peer_fd, -result, strerror(-result));
@@ -250,7 +250,7 @@ void osd_messenger_t::handle_send(int result, osd_client_t *cl)
            else
            {
                iov.iov_len -= result;
-                iov.iov_base += result;
+                iov.iov_base = (uint8_t*)iov.iov_base + result;
                break;
            }
        }
--- a/src/msgr_stop.cpp
+++ b/src/msgr_stop.cpp
@@ -111,6 +111,10 @@ void osd_messenger_t::stop_client(int peer_fd, bool force, bool force_delete)
        {
            delete cl->read_op;
        }
+        else
+        {
+            cancel_op(cl->read_op);
+        }
        cl->read_op = NULL;
    }
    if (cl->osd_num)
--- a/src/nbd_proxy.cpp
+++ b/src/nbd_proxy.cpp
@@ -30,6 +30,9 @@ protected:
    std::string image_name;
    uint64_t inode = 0;
    uint64_t device_size = 0;
+    int nbd_timeout = 30;
+    int nbd_max_devices = 64;
+    int nbd_max_part = 3;
    inode_watch_t *watch = NULL;

    ring_loop_t *ringloop = NULL;
@@ -52,6 +55,15 @@ protected:
    iovec read_iov = { 0 };

 public:
+    ~nbd_proxy()
+    {
+        if (recv_buf)
+        {
+            free(recv_buf);
+            recv_buf = NULL;
+        }
+    }
+
    static json11::Json::object parse_args(int narg, const char *args[])
    {
        json11::Json::object cfg;
@@ -117,9 +129,18 @@ public:
            "Vitastor NBD proxy\n"
            "(c) Vitaliy Filippov, 2020-2021 (VNPL-1.1)\n\n"
            "USAGE:\n"
-            "  %s map [--etcd_address <etcd_address>] (--image <image> | --pool <pool> --inode <inode> --size <size in bytes>)\n"
+            "  %s map [OPTIONS] (--image <image> | --pool <pool> --inode <inode> --size <size in bytes>)\n"
            "  %s unmap /dev/nbd0\n"
-            "  %s ls [--json]\n",
+            "  %s ls [--json]\n"
+            "OPTIONS:\n"
+            "  All usual Vitastor config options like --etcd_address <etcd_address> plus NBD-specific:\n"
+            "  --nbd_timeout 30\n"
+            "    timeout in seconds after which the kernel will stop the device\n"
+            "    you can set it to 0, but beware that you won't be able to stop the device at all\n"
+            "    if vitastor-nbd process dies\n"
+            "  --nbd_max_devices 64 --nbd_max_part 3\n"
+            "    options for the \"nbd\" kernel module when modprobing it (nbds_max and max_part).\n"
+            "    note that maximum allowed (nbds_max)*(1+max_part) is 256.\n",
            exe_name, exe_name, exe_name
        );
        exit(0);
@@ -174,6 +195,18 @@ public:
                exit(1);
            }
        }
+        if (cfg["nbd_max_devices"].is_number() || cfg["nbd_max_devices"].is_string())
+        {
+            nbd_max_devices = cfg["nbd_max_devices"].uint64_value();
+        }
+        if (cfg["nbd_max_part"].is_number() || cfg["nbd_max_part"].is_string())
+        {
+            nbd_max_part = cfg["nbd_max_part"].uint64_value();
+        }
+        if (cfg["nbd_timeout"].is_number() || cfg["nbd_timeout"].is_string())
+        {
+            nbd_timeout = cfg["nbd_timeout"].uint64_value();
+        }
        // Create client
        ringloop = new ring_loop_t(512);
        epmgr = new epoll_manager_t(ringloop);
@@ -190,6 +223,12 @@ public:
            }
            watch = cli->st_cli.watch_inode(image_name);
            device_size = watch->cfg.size;
+            if (!watch->cfg.num || !device_size)
+            {
+                // Image does not exist
+                fprintf(stderr, "Image %s does not exist\n", image_name.c_str());
+                exit(1);
+            }
        }
        // Initialize NBD
        int sockfd[2];
@@ -204,7 +243,7 @@ public:
        bool bg = cfg["foreground"].is_null();
        if (!cfg["dev_num"].is_null())
        {
-            if (run_nbd(sockfd, cfg["dev_num"].int64_value(), device_size, NBD_FLAG_SEND_FLUSH, 30, bg) < 0)
+            if (run_nbd(sockfd, cfg["dev_num"].int64_value(), device_size, NBD_FLAG_SEND_FLUSH, nbd_timeout, bg) < 0)
            {
                perror("run_nbd");
                exit(1);
@@ -278,7 +317,7 @@ public:
        stop = false;
        cluster_op_t *close_sync = new cluster_op_t;
        close_sync->opcode = OSD_OP_SYNC;
-        close_sync->callback = [this, &stop](cluster_op_t *op)
+        close_sync->callback = [&stop](cluster_op_t *op)
        {
            stop = true;
            delete op;
@@ -292,6 +331,9 @@ public:
        delete cli;
        delete epmgr;
        delete ringloop;
+        cli = NULL;
+        epmgr = NULL;
+        ringloop = NULL;
    }

    void load_module()
@@ -301,7 +343,10 @@ public:
            return;
        }
        int r;
-        if ((r = system("modprobe nbd")) != 0)
+        // Kernel built-in default is 16 devices with up to 16 partitions per device which is a big shit
+        // 64 also isn't too high, but the possible maximum is nbds_max=256 max_part=0 and it won't reserve
+        // any block device minor numbers for partitions
+        if ((r = system(("modprobe nbd nbds_max="+std::to_string(nbd_max_devices)+" max_part="+std::to_string(nbd_max_part)).c_str())) != 0)
        {
            if (r < 0)
                perror("Failed to load NBD kernel module");
@@ -318,7 +363,8 @@ public:
        setsid();
        if (fork())
            exit(0);
-        chdir("/");
+        if (chdir("/") != 0)
+            fprintf(stderr, "Warning: Failed to chdir into /\n");
        close(0);
        close(1);
        close(2);
@@ -465,7 +511,7 @@ protected:
            goto end_unmap;
        }
        ioctl(nbd, NBD_SET_FLAGS, flags);
-        if (timeout >= 0)
+        if (timeout > 0)
        {
            r = ioctl(nbd, NBD_SET_TIMEOUT, (unsigned long)timeout);
            if (r < 0)
@@ -480,7 +526,11 @@ protected:
        {
            goto end_unmap;
        }
-        write(qd_fd, "32768", 5);
+        r = write(qd_fd, "32768", 5);
+        if (r != 5)
+        {
+            fprintf(stderr, "Warning: Failed to configure max_sectors_kb\n");
+        }
        close(qd_fd);
        if (!fork())
        {
@@ -553,7 +603,7 @@ protected:
            }
            else
            {
-                send_list[to_eat].iov_base += result;
+                send_list[to_eat].iov_base = (uint8_t*)send_list[to_eat].iov_base + result;
                send_list[to_eat].iov_len -= result;
                break;
            }
@@ -627,8 +677,8 @@ protected:
                memcpy(cur_buf, b, inc);
                cur_left -= inc;
                result -= inc;
-                cur_buf += inc;
-                b += inc;
+                cur_buf = (uint8_t*)cur_buf + inc;
+                b = (uint8_t*)b + inc;
            }
            else
            {
@@ -667,7 +717,7 @@ protected:
                op->offset = be64toh(cur_req.from);
                op->len = be32toh(cur_req.len);
                buf = malloc_or_die(sizeof(nbd_reply) + op->len);
-                op->iov.push_back(buf + sizeof(nbd_reply), op->len);
+                op->iov.push_back((uint8_t*)buf + sizeof(nbd_reply), op->len);
            }
            else if (req_type == NBD_CMD_FLUSH)
            {
@@ -695,7 +745,7 @@ protected:
            if (req_type == NBD_CMD_WRITE)
            {
                cur_op = op;
-                cur_buf = buf + sizeof(nbd_reply);
+                cur_buf = (uint8_t*)buf + sizeof(nbd_reply);
                cur_left = op->len;
                read_state = CL_READ_DATA;
            }
@@ -734,5 +784,6 @@ int main(int narg, const char *args[])
    exe_name = args[0];
    nbd_proxy *p = new nbd_proxy();
    p->exec(nbd_proxy::parse_args(narg, args));
+    delete p;
    return 0;
 }
--- a/src/osd.cpp
+++ b/src/osd.cpp
@@ -3,10 +3,12 @@

 #include <sys/socket.h>
 #include <sys/poll.h>
+#include <sys/mman.h>
 #include <netinet/in.h>
 #include <netinet/tcp.h>
 #include <arpa/inet.h>

+#include "addr_util.h"
 #include "blockstore_impl.h"
 #include "osd_primary.h"
 #include "osd.h"
@@ -52,6 +54,20 @@ osd_t::osd_t(const json11::Json & config, ring_loop_t *ringloop)
            autosync_writes = max_autosync;
    }

+    if (this->config["osd_memlock"] == "true" || this->config["osd_memlock"] == "1" || this->config["osd_memlock"] == "yes")
+    {
+        // Lock all OSD memory if requested
+        if (mlockall(MCL_CURRENT|MCL_FUTURE
+#ifdef MCL_ONFAULT
+            | MCL_ONFAULT
+#endif
+            ) != 0)
+        {
+            fprintf(stderr, "osd_memlock is set to true, but mlockall() failed: %s\n", strerror(errno));
+            exit(-1);
+        }
+    }
+
    this->tfd->set_timer(print_stats_interval*1000, true, [this](int timer_id)
    {
        print_stats();
@@ -156,14 +172,6 @@ void osd_t::parse_config(const json11::Json & config)

 void osd_t::bind_socket()
 {
-    listen_fd = socket(AF_INET, SOCK_STREAM, 0);
-    if (listen_fd < 0)
-    {
-        throw std::runtime_error(std::string("socket: ") + strerror(errno));
-    }
-    int enable = 1;
-    setsockopt(listen_fd, SOL_SOCKET, SO_REUSEADDR, &enable, sizeof(enable));
-
    if (config["osd_network"].is_string() ||
        config["osd_network"].is_array())
    {
@@ -173,28 +181,40 @@ void osd_t::bind_socket()
        else
            for (auto v: config["osd_network"].array_items())
                mask.push_back(v.string_value());
-        auto matched_addrs = getifaddr_list(mask, false);
+        auto matched_addrs = getifaddr_list(mask);
        if (matched_addrs.size() > 1)
        {
            fprintf(stderr, "More than 1 address matches requested network(s): %s\n", json11::Json(matched_addrs).dump().c_str());
            force_stop(1);
        }
+        if (!matched_addrs.size())
+        {
+            std::string nets;
+            for (auto v: mask)
+                nets += (nets == "" ? v : ","+v);
+            fprintf(stderr, "Addresses matching osd_network(s) %s not found\n", nets.c_str());
+            force_stop(1);
+        }
        bind_address = matched_addrs[0];
    }

    // FIXME Support multiple listening sockets

-    sockaddr_in addr;
-    int r;
-    if ((r = inet_pton(AF_INET, bind_address.c_str(), &addr.sin_addr)) != 1)
+    sockaddr addr;
+    if (!string_to_addr(bind_address, 0, bind_port, &addr))
    {
-        close(listen_fd);
-        throw std::runtime_error("bind address "+bind_address+(r == 0 ? " is not valid" : ": no ipv4 support"));
+        throw std::runtime_error("bind address "+bind_address+" is not valid");
    }
-    addr.sin_family = AF_INET;

-    addr.sin_port = htons(bind_port);
-    if (bind(listen_fd, (sockaddr*)&addr, sizeof(addr)) < 0)
+    listen_fd = socket(addr.sa_family, SOCK_STREAM, 0);
+    if (listen_fd < 0)
+    {
+        throw std::runtime_error(std::string("socket: ") + strerror(errno));
+    }
+    int enable = 1;
+    setsockopt(listen_fd, SOL_SOCKET, SO_REUSEADDR, &enable, sizeof(enable));
+
+    if (bind(listen_fd, &addr, sizeof(addr)) < 0)
    {
        close(listen_fd);
        throw std::runtime_error(std::string("bind: ") + strerror(errno));
@@ -207,7 +227,7 @@ void osd_t::bind_socket()
            close(listen_fd);
            throw std::runtime_error(std::string("getsockname: ") + strerror(errno));
        }
-        listening_port = ntohs(addr.sin_port);
+        listening_port = ntohs(((sockaddr_in*)&addr)->sin_port);
    }
    else
    {
@@ -326,8 +346,8 @@ void osd_t::exec_op(osd_op_t *cur_op)

 void osd_t::reset_stats()
 {
-    msgr.stats = { 0 };
-    prev_stats = { 0 };
+    msgr.stats = {};
+    prev_stats = {};
    memset(recovery_stat_count, 0, sizeof(recovery_stat_count));
    memset(recovery_stat_bytes, 0, sizeof(recovery_stat_bytes));
 }
@@ -442,7 +462,7 @@ void osd_t::print_slow()
                {
                    for (uint64_t i = 0; i < op->req.sec_stab.len; i += sizeof(obj_ver_id))
                    {
-                        obj_ver_id *ov = (obj_ver_id*)(op->buf + i);
+                        obj_ver_id *ov = (obj_ver_id*)((uint8_t*)op->buf + i);
                        bufprintf(i == 0 ? " %lx:%lx v%lu" : ", %lx:%lx v%lu", ov->oid.inode, ov->oid.stripe, ov->version);
                    }
                }
--- a/src/osd.h
+++ b/src/osd.h
@@ -102,7 +102,7 @@ class osd_t
    bool no_rebalance = false;
    bool no_recovery = false;
    std::string bind_address;
-    int bind_port, listen_backlog;
+    int bind_port, listen_backlog = 128;
    // FIXME: Implement client queue depth limit
    int client_queue_depth = 128;
    bool allow_test_ops = false;
@@ -166,8 +166,8 @@ class osd_t
    osd_op_stats_t prev_stats;
    std::map<uint64_t, inode_stats_t> inode_stats;
    const char* recovery_stat_names[2] = { "degraded", "misplaced" };
-    uint64_t recovery_stat_count[2][2] = { 0 };
-    uint64_t recovery_stat_bytes[2][2] = { 0 };
+    uint64_t recovery_stat_count[2][2] = {};
+    uint64_t recovery_stat_bytes[2][2] = {};

    // cluster connection
    void parse_config(const json11::Json & config);
--- a/src/osd_cluster.cpp
+++ b/src/osd_cluster.cpp
@@ -6,6 +6,7 @@
 #include "etcd_state_client.h"
 #include "http_client.h"
 #include "osd_rmw.h"
+#include "addr_util.h"

 // Startup sequence:
 //   Start etcd watcher -> Load global OSD configuration -> Bind socket -> Acquire lease -> Report&lock OSD state
@@ -276,14 +277,14 @@ void osd_t::report_statistics()
            } }
        });
    }
-    st_cli.etcd_txn(json11::Json::object { { "success", txn } }, ETCD_SLOW_TIMEOUT, [this](std::string err, json11::Json res)
+    st_cli.etcd_txn_slow(json11::Json::object { { "success", txn } }, [this](std::string err, json11::Json res)
    {
        etcd_reporting_stats = false;
        if (err != "")
        {
            printf("[OSD %lu] Error reporting state to etcd: %s\n", this->osd_num, err.c_str());
            // Retry indefinitely
-            tfd->set_timer(ETCD_SLOW_TIMEOUT, false, [this](int timer_id)
+            tfd->set_timer(st_cli.etcd_slow_timeout, false, [this](int timer_id)
            {
                report_statistics();
            });
@@ -354,13 +355,13 @@ void osd_t::acquire_lease()
 {
    // Maximum lease TTL is (report interval) + retries * (timeout + repeat interval)
    st_cli.etcd_call("/lease/grant", json11::Json::object {
-        { "TTL", etcd_report_interval+(MAX_ETCD_ATTEMPTS*(2*ETCD_QUICK_TIMEOUT)+999)/1000 }
-    }, ETCD_QUICK_TIMEOUT, [this](std::string err, json11::Json data)
+        { "TTL", etcd_report_interval+(st_cli.max_etcd_attempts*(2*st_cli.etcd_quick_timeout)+999)/1000 }
+    }, st_cli.etcd_quick_timeout, 0, 0, [this](std::string err, json11::Json data)
    {
        if (err != "" || data["ID"].string_value() == "")
        {
-            printf("Error acquiring a lease from etcd: %s\n", err.c_str());
-            tfd->set_timer(ETCD_QUICK_TIMEOUT, false, [this](int timer_id)
+            printf("Error acquiring a lease from etcd: %s, retrying\n", err.c_str());
+            tfd->set_timer(st_cli.etcd_quick_timeout, false, [this](int timer_id)
            {
                acquire_lease();
            });
@@ -407,19 +408,19 @@ void osd_t::create_osd_state()
                } }
            },
        } },
-    }, ETCD_QUICK_TIMEOUT, [this](std::string err, json11::Json data)
+    }, st_cli.etcd_quick_timeout, 0, 0, [this](std::string err, json11::Json data)
    {
        if (err != "")
        {
            etcd_failed_attempts++;
            printf("Error creating OSD state key: %s\n", err.c_str());
-            if (etcd_failed_attempts > MAX_ETCD_ATTEMPTS)
+            if (etcd_failed_attempts > st_cli.max_etcd_attempts)
            {
                // Die
                throw std::runtime_error("Cluster connection failed");
            }
            // Retry
-            tfd->set_timer(ETCD_QUICK_TIMEOUT, false, [this](int timer_id)
+            tfd->set_timer(st_cli.etcd_quick_timeout, false, [this](int timer_id)
            {
                create_osd_state();
            });
@@ -451,7 +452,7 @@ void osd_t::renew_lease()
 {
    st_cli.etcd_call("/lease/keepalive", json11::Json::object {
        { "ID", etcd_lease_id }
-    }, ETCD_QUICK_TIMEOUT, [this](std::string err, json11::Json data)
+    }, st_cli.etcd_quick_timeout, 0, 0, [this](std::string err, json11::Json data)
    {
        if (err == "" && data["result"]["TTL"].string_value() == "")
        {
@@ -462,13 +463,13 @@ void osd_t::renew_lease()
        {
            etcd_failed_attempts++;
            printf("Error renewing etcd lease: %s\n", err.c_str());
-            if (etcd_failed_attempts > MAX_ETCD_ATTEMPTS)
+            if (etcd_failed_attempts > st_cli.max_etcd_attempts)
            {
                // Die
                throw std::runtime_error("Cluster connection failed");
            }
            // Retry
-            tfd->set_timer(ETCD_QUICK_TIMEOUT, false, [this](int timer_id)
+            tfd->set_timer(st_cli.etcd_quick_timeout, false, [this](int timer_id)
            {
                renew_lease();
            });
@@ -487,7 +488,7 @@ void osd_t::force_stop(int exitcode)
    {
        st_cli.etcd_call("/kv/lease/revoke", json11::Json::object {
            { "ID", etcd_lease_id }
-        }, ETCD_QUICK_TIMEOUT, [this, exitcode](std::string err, json11::Json data)
+        }, st_cli.etcd_quick_timeout, st_cli.max_etcd_attempts, 0, [this, exitcode](std::string err, json11::Json data)
        {
            if (err != "")
            {
@@ -825,7 +826,7 @@ void osd_t::report_pg_states()
    etcd_reporting_pg_state = true;
    st_cli.etcd_txn(json11::Json::object {
        { "compare", checks }, { "success", success }, { "failure", failure }
-    }, ETCD_QUICK_TIMEOUT, [this, reporting_pgs](std::string err, json11::Json data)
+    }, st_cli.etcd_quick_timeout, 0, 0, [this, reporting_pgs](std::string err, json11::Json data)
    {
        etcd_reporting_pg_state = false;
        if (!data["succeeded"].bool_value())
@@ -857,10 +858,13 @@ void osd_t::report_pg_states()
                        if (null_byte == 0)
                        {
                            auto pg_it = pgs.find({ .pool_id = pool_id, .pg_num = pg_num });
-                            if (pg_it != pgs.end() && pg_it->second.state != PG_OFFLINE && pg_it->second.state != PG_STARTING)
+                            if (pg_it != pgs.end() && pg_it->second.state != PG_OFFLINE && pg_it->second.state != PG_STARTING &&
+                                kv.value["primary"].uint64_value() != 0 &&
+                                kv.value["primary"].uint64_value() != this->osd_num)
                            {
-                                // Live PG state update failed
-                                printf("Failed to report state of pool %u PG %u which is live. Race condition detected, exiting\n", pool_id, pg_num);
+                                // PG is somehow captured by another OSD
+                                printf("BUG: OSD %lu captured our PG %u/%u. Race condition detected, exiting\n",
+                                    kv.value["primary"].uint64_value(), pool_id, pg_num);
                                force_stop(1);
                                return;
                            }
--- a/src/osd_peering.cpp
+++ b/src/osd_peering.cpp
@@ -27,9 +27,9 @@ void osd_t::handle_peers()
                    misplaced_objects += p.second.misplaced_objects.size();
                    // FIXME: degraded objects may currently include misplaced, too! Report them separately?
                    degraded_objects += p.second.degraded_objects.size();
-                    if ((p.second.state & (PG_ACTIVE | PG_HAS_UNCLEAN)) == (PG_ACTIVE | PG_HAS_UNCLEAN))
+                    if (p.second.state & PG_HAS_UNCLEAN)
                        peering_state = peering_state | OSD_FLUSHING_PGS;
-                    else if (p.second.state & PG_ACTIVE)
+                    else if (p.second.state & PG_HAS_DEGRADED)
                        peering_state = peering_state | OSD_RECOVERING;
                }
                else
@@ -176,6 +176,17 @@ void osd_t::start_pg_peering(pg_t & pg)
            msgr.stop_client(peer_fd);
        }
    }
+    // Try to connect with current peers if they're up, but we don't have connections to them
+    // Otherwise we may erroneously decide that the pg is incomplete :-)
+    for (auto pg_osd: pg.all_peers)
+    {
+        if (pg_osd != this->osd_num &&
+            msgr.osd_peer_fds.find(pg_osd) == msgr.osd_peer_fds.end() &&
+            msgr.wanted_peers.find(pg_osd) == msgr.wanted_peers.end())
+        {
+            msgr.connect_peer(pg_osd, st_cli.peer_states[pg_osd]);
+        }
+    }
    // Calculate current write OSD set
    pg.pg_cursize = 0;
    pg.cur_set.resize(pg.target_set.size());
@@ -194,6 +205,20 @@ void osd_t::start_pg_peering(pg_t & pg)
            });
        }
    }
+    if (pg.pg_cursize < pg.pg_minsize)
+    {
+        pg.state = PG_INCOMPLETE;
+        report_pg_state(pg);
+        return;
+    }
+    std::set<osd_num_t> cur_peers;
+    for (auto pg_osd: pg.all_peers)
+    {
+        if (pg_osd == this->osd_num || msgr.osd_peer_fds.find(pg_osd) != msgr.osd_peer_fds.end())
+        {
+            cur_peers.insert(pg_osd);
+        }
+    }
    if (pg.target_history.size())
    {
        // Refuse to start PG if no peers are available from any of the historical OSD sets
@@ -222,24 +247,6 @@ void osd_t::start_pg_peering(pg_t & pg)
            }
        }
    }
-    if (pg.pg_cursize < pg.pg_minsize)
-    {
-        pg.state = PG_INCOMPLETE;
-        report_pg_state(pg);
-        return;
-    }
-    std::set<osd_num_t> cur_peers;
-    for (auto pg_osd: pg.all_peers)
-    {
-        if (pg_osd == this->osd_num || msgr.osd_peer_fds.find(pg_osd) != msgr.osd_peer_fds.end())
-        {
-            cur_peers.insert(pg_osd);
-        }
-        else if (msgr.wanted_peers.find(pg_osd) == msgr.wanted_peers.end())
-        {
-            msgr.connect_peer(pg_osd, st_cli.peer_states[pg_osd]);
-        }
-    }
    pg.cur_peers.insert(pg.cur_peers.begin(), cur_peers.begin(), cur_peers.end());
    if (pg.peering_state)
    {
--- a/src/osd_primary.cpp
+++ b/src/osd_primary.cpp
@@ -96,11 +96,11 @@ bool osd_t::prepare_primary_rw(osd_op_t *cur_op)
            (pool_cfg.scheme == POOL_SCHEME_REPLICATED ? 0 : pg_it->second.pg_size)
        )
    );
-    void *data_buf = ((void*)op_data) + sizeof(osd_primary_op_data_t);
+    void *data_buf = (uint8_t*)op_data + sizeof(osd_primary_op_data_t);
    op_data->pg_num = pg_num;
    op_data->oid = oid;
    op_data->stripes = (osd_rmw_stripe_t*)data_buf;
-    data_buf += sizeof(osd_rmw_stripe_t) * stripe_count;
+    data_buf = (uint8_t*)data_buf + sizeof(osd_rmw_stripe_t) * stripe_count;
    op_data->scheme = pool_cfg.scheme;
    op_data->pg_data_size = pg_data_size;
    op_data->pg_size = pg_it->second.pg_size;
@@ -110,17 +110,17 @@ bool osd_t::prepare_primary_rw(osd_op_t *cur_op)
    for (int i = 0; i < stripe_count; i++)
    {
        op_data->stripes[i].bmp_buf = data_buf;
-        data_buf += clean_entry_bitmap_size;
+        data_buf = (uint8_t*)data_buf + clean_entry_bitmap_size;
    }
    op_data->chain_size = chain_size;
    if (chain_size > 0)
    {
        op_data->read_chain = (inode_t*)data_buf;
-        data_buf += sizeof(inode_t) * chain_size;
+        data_buf = (uint8_t*)data_buf + sizeof(inode_t) * chain_size;
        op_data->snapshot_bitmaps = data_buf;
-        data_buf += chain_size * stripe_count * clean_entry_bitmap_size;
+        data_buf = (uint8_t*)data_buf + chain_size * stripe_count * clean_entry_bitmap_size;
        op_data->missing_flags = (uint8_t*)data_buf;
-        data_buf += chain_size * (pool_cfg.scheme == POOL_SCHEME_REPLICATED ? 0 : pg_it->second.pg_size);
+        data_buf = (uint8_t*)data_buf + chain_size * (pool_cfg.scheme == POOL_SCHEME_REPLICATED ? 0 : pg_it->second.pg_size);
        // Copy chain
        int chain_num = 0;
        op_data->read_chain[chain_num++] = cur_op->req.rw.inode;
@@ -248,7 +248,7 @@ resume_2:
            {
                // Send buffer in parts to avoid copying
                cur_op->iov.push_back(
-                    stripes[role].read_buf + (stripes[role].req_start - stripes[role].read_start),
+                    (uint8_t*)stripes[role].read_buf + (stripes[role].req_start - stripes[role].read_start),
                    stripes[role].req_end - stripes[role].req_start
                );
            }
--- a/src/osd_primary_chain.cpp
+++ b/src/osd_primary_chain.cpp
@@ -66,7 +66,7 @@ int osd_t::read_bitmaps(osd_op_t *cur_op, pg_t & pg, int base_state)
            auto read_version = (vo_it != pg.ver_override.end() ? vo_it->second : UINT64_MAX);
            // Read bitmap synchronously from the local database
            bs->read_bitmap(
-                cur_oid, read_version, op_data->snapshot_bitmaps + chain_num*clean_entry_bitmap_size,
+                cur_oid, read_version, (uint8_t*)op_data->snapshot_bitmaps + chain_num*clean_entry_bitmap_size,
                !chain_num ? &cur_op->reply.rw.version : NULL
            );
        }
@@ -96,12 +96,15 @@ resume_1:
                {
                    if (op_data->missing_flags[chain_num*pg.pg_size + i])
                    {
-                        osd_rmw_stripe_t local_stripes[pg.pg_size] = { 0 };
+                        osd_rmw_stripe_t local_stripes[pg.pg_size];
                        for (i = 0; i < pg.pg_size; i++)
                        {
-                            local_stripes[i].missing = op_data->missing_flags[chain_num*pg.pg_size + i] && true;
-                            local_stripes[i].bmp_buf = op_data->snapshot_bitmaps + (chain_num*pg.pg_size + i)*clean_entry_bitmap_size;
-                            local_stripes[i].read_start = local_stripes[i].read_end = 1;
+                            local_stripes[i] = (osd_rmw_stripe_t){
+                                .bmp_buf = (uint8_t*)op_data->snapshot_bitmaps + (chain_num*pg.pg_size + i)*clean_entry_bitmap_size,
+                                .read_start = 1,
+                                .read_end = 1,
+                                .missing = op_data->missing_flags[chain_num*pg.pg_size + i] && true,
+                            };
                        }
                        if (pg.scheme == POOL_SCHEME_XOR)
                        {
@@ -146,7 +149,7 @@ int osd_t::collect_bitmap_requests(osd_op_t *cur_op, pg_t & pg, std::vector<bitm
                .osd_num = read_target,
                .oid = cur_oid,
                .version = target_version,
-                .bmp_buf = op_data->snapshot_bitmaps + chain_num*clean_entry_bitmap_size,
+                .bmp_buf = (uint8_t*)op_data->snapshot_bitmaps + chain_num*clean_entry_bitmap_size,
            });
        }
        else
@@ -185,7 +188,7 @@ int osd_t::collect_bitmap_requests(osd_op_t *cur_op, pg_t & pg, std::vector<bitm
                            .stripe = cur_oid.stripe | i,
                        },
                        .version = target_version,
-                        .bmp_buf = op_data->snapshot_bitmaps + (chain_num*pg.pg_size + i)*clean_entry_bitmap_size,
+                        .bmp_buf = (uint8_t*)op_data->snapshot_bitmaps + (chain_num*pg.pg_size + i)*clean_entry_bitmap_size,
                    });
                    found++;
                }
@@ -204,6 +207,7 @@ int osd_t::submit_bitmap_subops(osd_op_t *cur_op, pg_t & pg)
    std::vector<bitmap_request_t> *bitmap_requests = new std::vector<bitmap_request_t>();
    if (collect_bitmap_requests(cur_op, pg, *bitmap_requests) < 0)
    {
+        delete bitmap_requests;
        return -1;
    }
    op_data->n_subops = 0;
@@ -266,15 +270,15 @@ int osd_t::submit_bitmap_subops(osd_op_t *cur_op, pg_t & pg)
                    int requested_count = subop->req.sec_read_bmp.len / sizeof(obj_ver_id);
                    if (subop->reply.hdr.retval == requested_count * (8 + clean_entry_bitmap_size))
                    {
-                        void *cur_buf = subop->buf + 8;
+                        void *cur_buf = (uint8_t*)subop->buf + 8;
                        for (int j = prev; j <= i; j++)
                        {
                            memcpy((*bitmap_requests)[j].bmp_buf, cur_buf, clean_entry_bitmap_size);
                            if ((*bitmap_requests)[j].oid.inode == cur_op->req.rw.inode)
                            {
-                                memcpy(&cur_op->reply.rw.version, cur_buf-8, 8);
+                                memcpy(&cur_op->reply.rw.version, (uint8_t*)cur_buf-8, 8);
                            }
-                            cur_buf += 8 + clean_entry_bitmap_size;
+                            cur_buf = (uint8_t*)cur_buf + 8 + clean_entry_bitmap_size;
                        }
                    }
                    if ((cur_op->op_data->errors + cur_op->op_data->done + 1) >= cur_op->op_data->n_subops)
@@ -363,7 +367,7 @@ int osd_t::submit_chained_read_requests(pg_t & pg, osd_op_t *cur_op)
        + sizeof(osd_rmw_stripe_t) * stripe_count * op_data->chain_size
    );
    osd_rmw_stripe_t *chain_stripes = (osd_rmw_stripe_t*)(
-        ((void*)op_data->chain_reads) + sizeof(osd_chain_read_t) * op_data->chain_read_count
+        (uint8_t*)op_data->chain_reads + sizeof(osd_chain_read_t) * op_data->chain_read_count
    );
    // Now process each subrequest as a separate read, including reconstruction if needed
    // Prepare reads
@@ -425,8 +429,8 @@ int osd_t::submit_chained_read_requests(pg_t & pg, osd_op_t *cur_op)
            if (stripes[role].read_end > 0)
            {
                stripes[role].read_buf = cur_buf;
-                stripes[role].bmp_buf = op_data->snapshot_bitmaps + (chain_reads[cri].chain_pos*stripe_count + role)*clean_entry_bitmap_size;
-                cur_buf += stripes[role].read_end - stripes[role].read_start;
+                stripes[role].bmp_buf = (uint8_t*)op_data->snapshot_bitmaps + (chain_reads[cri].chain_pos*stripe_count + role)*clean_entry_bitmap_size;
+                cur_buf = (uint8_t*)cur_buf + stripes[role].read_end - stripes[role].read_start;
            }
        }
    }
@@ -474,7 +478,7 @@ void osd_t::send_chained_read_results(pg_t & pg, osd_op_t *cur_op)
    osd_primary_op_data_t *op_data = cur_op->op_data;
    int stripe_count = (pg.scheme == POOL_SCHEME_REPLICATED ? 1 : pg.pg_size);
    osd_rmw_stripe_t *chain_stripes = (osd_rmw_stripe_t*)(
-        ((void*)op_data->chain_reads) + sizeof(osd_chain_read_t) * op_data->chain_read_count
+        (uint8_t*)op_data->chain_reads + sizeof(osd_chain_read_t) * op_data->chain_read_count
    );
    // Reconstruct parts if needed
    if (op_data->degraded)
@@ -544,7 +548,7 @@ void osd_t::send_chained_read_results(pg_t & pg, osd_op_t *cur_op)
                            role_end = bs_block_size;
                        assert(stripes[role].read_buf);
                        cur_op->iov.push_back(
-                            stripes[role].read_buf + (role_start - stripes[role].read_start),
+                            (uint8_t*)stripes[role].read_buf + (role_start - stripes[role].read_start),
                            role_end - role_start
                        );
                        sent += role_end - role_start;
--- a/src/osd_primary_sync.cpp
+++ b/src/osd_primary_sync.cpp
@@ -86,7 +86,7 @@ resume_2:
            sizeof(obj_ver_osd_t)*this->copies_to_delete_after_sync_count
        );
        op_data->dirty_pgs = (pool_pg_num_t*)dirty_buf;
-        op_data->dirty_osds = (osd_num_t*)(dirty_buf + sizeof(pool_pg_num_t)*dirty_pgs.size());
+        op_data->dirty_osds = (osd_num_t*)((uint8_t*)dirty_buf + sizeof(pool_pg_num_t)*dirty_pgs.size());
        op_data->dirty_pg_count = dirty_pgs.size();
        op_data->dirty_osd_count = dirty_osds.size();
        if (this->copies_to_delete_after_sync_count)
--- a/src/osd_primary_write.cpp
+++ b/src/osd_primary_write.cpp
@@ -113,7 +113,7 @@ resume_3:
            op_data->stripes[0].write_end != bs_block_size))
        {
            memcpy(
-                op_data->stripes[0].read_buf + op_data->stripes[0].req_start,
+                (uint8_t*)op_data->stripes[0].read_buf + op_data->stripes[0].req_start,
                op_data->stripes[0].write_buf,
                op_data->stripes[0].req_end - op_data->stripes[0].req_start
            );
--- a/src/osd_rmw.cpp
+++ b/src/osd_rmw.cpp
@@ -103,8 +103,8 @@ void reconstruct_stripes_xor(osd_rmw_stripe_t *stripes, int pg_size, uint32_t bi
                        assert(stripes[role].read_start >= stripes[prev].read_start &&
                            stripes[role].read_start >= stripes[other].read_start);
                        memxor(
-                            stripes[prev].read_buf + (stripes[role].read_start - stripes[prev].read_start),
-                            stripes[other].read_buf + (stripes[role].read_start - stripes[other].read_start),
+                            (uint8_t*)stripes[prev].read_buf + (stripes[role].read_start - stripes[prev].read_start),
+                            (uint8_t*)stripes[other].read_buf + (stripes[role].read_start - stripes[other].read_start),
                            stripes[role].read_buf, stripes[role].read_end - stripes[role].read_start
                        );
                        memxor(stripes[prev].bmp_buf, stripes[other].bmp_buf, stripes[role].bmp_buf, bitmap_size);
@@ -115,7 +115,7 @@ void reconstruct_stripes_xor(osd_rmw_stripe_t *stripes, int pg_size, uint32_t bi
                        assert(stripes[role].read_start >= stripes[other].read_start);
                        memxor(
                            stripes[role].read_buf,
-                            stripes[other].read_buf + (stripes[role].read_start - stripes[other].read_start),
+                            (uint8_t*)stripes[other].read_buf + (stripes[role].read_start - stripes[other].read_start),
                            stripes[role].read_buf, stripes[role].read_end - stripes[role].read_start
                        );
                        memxor(stripes[role].bmp_buf, stripes[other].bmp_buf, stripes[role].bmp_buf, bitmap_size);
@@ -202,10 +202,9 @@ reed_sol_matrix_t* get_jerasure_matrix(int pg_size, int pg_minsize)
 int* get_jerasure_decoding_matrix(osd_rmw_stripe_t *stripes, int pg_size, int pg_minsize)
 {
    int edd = 0;
-    int erased[pg_size] = { 0 };
+    int erased[pg_size];
    for (int i = 0; i < pg_size; i++)
-        if (stripes[i].read_end == 0 || stripes[i].missing)
-            erased[i] = 1;
+        erased[i] = (stripes[i].read_end == 0 || stripes[i].missing ? 1 : 0);
    for (int i = 0; i < pg_minsize; i++)
        if (stripes[i].read_end != 0 && stripes[i].missing)
            edd++;
@@ -241,7 +240,9 @@ void reconstruct_stripes_jerasure(osd_rmw_stripe_t *stripes, int pg_size, int pg
        return;
    }
    int *decoding_matrix = dm_ids + pg_minsize;
-    char *data_ptrs[pg_size] = { 0 };
+    char *data_ptrs[pg_size];
+    for (int role = 0; role < pg_size; role++)
+        data_ptrs[role] = NULL;
    for (int role = 0; role < pg_minsize; role++)
    {
        if (stripes[role].read_end != 0 && stripes[role].missing)
@@ -254,7 +255,7 @@ void reconstruct_stripes_jerasure(osd_rmw_stripe_t *stripes, int pg_size, int pg
                    {
                        assert(stripes[other].read_start <= stripes[role].read_start);
                        assert(stripes[other].read_end >= stripes[role].read_end);
-                        data_ptrs[other] = (char*)(stripes[other].read_buf + (stripes[role].read_start - stripes[other].read_start));
+                        data_ptrs[other] = (char*)stripes[other].read_buf + (stripes[role].read_start - stripes[other].read_start);
                    }
                }
                data_ptrs[role] = (char*)stripes[role].read_buf;
@@ -330,7 +331,7 @@ void* alloc_read_buffer(osd_rmw_stripe_t *stripes, int read_pg_size, uint64_t ad
    {
        if (stripes[role].read_end != 0)
        {
-            stripes[role].read_buf = buf + buf_pos;
+            stripes[role].read_buf = (uint8_t*)buf + buf_pos;
            buf_pos += stripes[role].read_end - stripes[role].read_start;
        }
    }
@@ -446,12 +447,12 @@ void* calc_rmw(void *request_buf, osd_rmw_stripe_t *stripes, uint64_t *read_osd_
    {
        if (stripes[role].req_end != 0)
        {
-            stripes[role].write_buf = request_buf + in_pos;
+            stripes[role].write_buf = (uint8_t*)request_buf + in_pos;
            in_pos += stripes[role].req_end - stripes[role].req_start;
        }
        else if (role >= pg_minsize && write_osd_set[role] != 0 && end != 0)
        {
-            stripes[role].write_buf = rmw_buf + buf_pos;
+            stripes[role].write_buf = (uint8_t*)rmw_buf + buf_pos;
            buf_pos += end - start;
        }
    }
@@ -476,13 +477,13 @@ static void get_old_new_buffers(osd_rmw_stripe_t & stripe, uint32_t wr_start, ui
    if (ne && (!oe || ns <= os))
    {
        // NEW or NEW->OLD
-        bufs[nbufs++] = { .buf = stripe.write_buf + ns - stripe.req_start, .len = ne-ns };
+        bufs[nbufs++] = { .buf = (uint8_t*)stripe.write_buf + ns - stripe.req_start, .len = ne-ns };
        if (os < ne)
            os = ne;
        if (oe > os)
        {
            // NEW->OLD
-            bufs[nbufs++] = { .buf = stripe.read_buf + os - stripe.read_start, .len = oe-os };
+            bufs[nbufs++] = { .buf = (uint8_t*)stripe.read_buf + os - stripe.read_start, .len = oe-os };
        }
    }
    else if (oe)
@@ -491,18 +492,18 @@ static void get_old_new_buffers(osd_rmw_stripe_t & stripe, uint32_t wr_start, ui
        if (ne)
        {
            // OLD->NEW or OLD->NEW->OLD
-            bufs[nbufs++] = { .buf = stripe.read_buf + os - stripe.read_start, .len = ns-os };
-            bufs[nbufs++] = { .buf = stripe.write_buf + ns - stripe.req_start, .len = ne-ns };
+            bufs[nbufs++] = { .buf = (uint8_t*)stripe.read_buf + os - stripe.read_start, .len = ns-os };
+            bufs[nbufs++] = { .buf = (uint8_t*)stripe.write_buf + ns - stripe.req_start, .len = ne-ns };
            if (oe > ne)
            {
                // OLD->NEW->OLD
-                bufs[nbufs++] = { .buf = stripe.read_buf + ne - stripe.read_start, .len = oe-ne };
+                bufs[nbufs++] = { .buf = (uint8_t*)stripe.read_buf + ne - stripe.read_start, .len = oe-ne };
            }
        }
        else
        {
            // OLD
-            bufs[nbufs++] = { .buf = stripe.read_buf + os - stripe.read_start, .len = oe-os };
+            bufs[nbufs++] = { .buf = (uint8_t*)stripe.read_buf + os - stripe.read_start, .len = oe-os };
        }
    }
 }
@@ -517,7 +518,7 @@ static void xor_multiple_buffers(buf_len_t *xor1, int n1, buf_len_t *xor2, int n
    {
        // We know for sure that ranges overlap
        uint32_t end = std::min(end1, end2);
-        memxor(xor1[i1].buf + pos-start1, xor2[i2].buf + pos-start2, dest+pos, end-pos);
+        memxor((uint8_t*)xor1[i1].buf + pos-start1, (uint8_t*)xor2[i2].buf + pos-start2, (uint8_t*)dest+pos, end-pos);
        pos = end;
        if (pos >= end1)
        {
@@ -586,7 +587,7 @@ static void calc_rmw_parity_copy_mod(osd_rmw_stripe_t *stripes, int pg_size, int
            {
                // Copy modified chunk into the read buffer to write it back
                memcpy(
-                    stripes[role].read_buf + stripes[role].req_start,
+                    (uint8_t*)stripes[role].read_buf + stripes[role].req_start,
                    stripes[role].write_buf,
                    stripes[role].req_end - stripes[role].req_start
                );
@@ -609,7 +610,7 @@ static void calc_rmw_parity_copy_parity(osd_rmw_stripe_t *stripes, int pg_size,
            {
                // Copy new parity into the read buffer to write it back
                memcpy(
-                    stripes[role].read_buf + start,
+                    (uint8_t*)stripes[role].read_buf + start,
                    stripes[role].write_buf,
                    end - start
                );
@@ -698,9 +699,15 @@ void calc_rmw_parity_jerasure(osd_rmw_stripe_t *stripes, int pg_size, int pg_min
        {
            // Calculate new coding chunks
            buf_len_t bufs[pg_size][3];
-            int nbuf[pg_size] = { 0 }, curbuf[pg_size] = { 0 };
+            int nbuf[pg_size], curbuf[pg_size];
            uint32_t positions[pg_size];
-            void *data_ptrs[pg_size] = { 0 };
+            void *data_ptrs[pg_size];
+            for (int i = 0; i < pg_size; i++)
+            {
+                data_ptrs[i] = NULL;
+                nbuf[i] = 0;
+                curbuf[i] = 0;
+            }
            for (int i = 0; i < pg_minsize; i++)
            {
                get_old_new_buffers(stripes[i], start, end, bufs[i], nbuf[i]);
@@ -719,7 +726,7 @@ void calc_rmw_parity_jerasure(osd_rmw_stripe_t *stripes, int pg_size, int pg_min
                {
                    assert(curbuf[i] < nbuf[i]);
                    assert(bufs[i][curbuf[i]].buf);
-                    data_ptrs[i] = bufs[i][curbuf[i]].buf + pos-positions[i];
+                    data_ptrs[i] = (uint8_t*)bufs[i][curbuf[i]].buf + pos-positions[i];
                    uint32_t this_end = bufs[i][curbuf[i]].len + positions[i];
                    if (next_end > this_end)
                        next_end = this_end;
--- a/src/osd_rmw_test.cpp
+++ b/src/osd_rmw_test.cpp
@@ -90,7 +90,7 @@ void dump_stripes(osd_rmw_stripe_t *stripes, int pg_size)
 void test1()
 {
    osd_num_t osd_set[3] = { 1, 0, 3 };
-    osd_rmw_stripe_t stripes[3] = { 0 };
+    osd_rmw_stripe_t stripes[3] = {};
    // Test 1.1
    split_stripes(2, 128*1024, 128*1024-4096, 8192, stripes);
    assert(stripes[0].req_start == 128*1024-4096 && stripes[0].req_end == 128*1024);
@@ -129,7 +129,7 @@ void test4()
    const uint32_t bmp = 4;
    unsigned bitmaps[3] = { 0 };
    osd_num_t osd_set[3] = { 1, 0, 3 };
-    osd_rmw_stripe_t stripes[3] = { 0 };
+    osd_rmw_stripe_t stripes[3] = {};
    // Test 4.1
    split_stripes(2, 128*1024, 128*1024-4096, 8192, stripes);
    for (int i = 0; i < 3; i++)
@@ -142,11 +142,11 @@ void test4()
    assert(stripes[0].write_start == 128*1024-4096 && stripes[0].write_end == 128*1024);
    assert(stripes[1].write_start == 0 && stripes[1].write_end == 4096);
    assert(stripes[2].write_start == 0 && stripes[2].write_end == 128*1024);
-    assert(stripes[0].read_buf == rmw_buf+128*1024);
-    assert(stripes[1].read_buf == rmw_buf+128*1024*2);
-    assert(stripes[2].read_buf == rmw_buf+128*1024*3-4096);
+    assert(stripes[0].read_buf == (uint8_t*)rmw_buf+128*1024);
+    assert(stripes[1].read_buf == (uint8_t*)rmw_buf+128*1024*2);
+    assert(stripes[2].read_buf == (uint8_t*)rmw_buf+128*1024*3-4096);
    assert(stripes[0].write_buf == write_buf);
-    assert(stripes[1].write_buf == write_buf+4096);
+    assert(stripes[1].write_buf == (uint8_t*)write_buf+4096);
    assert(stripes[2].write_buf == rmw_buf);
    // Test 4.2
    set_pattern(write_buf, 8192, PATTERN0);
@@ -183,7 +183,7 @@ void test4()
 void test5()
 {
    osd_num_t osd_set[3] = { 1, 0, 3 };
-    osd_rmw_stripe_t stripes[3] = { 0 };
+    osd_rmw_stripe_t stripes[3] = {};
    // Test 5.1
    split_stripes(2, 128*1024, 0, 64*1024*3, stripes);
    assert(stripes[0].req_start == 0 && stripes[0].req_end == 128*1024);
@@ -198,11 +198,11 @@ void test5()
    assert(stripes[0].write_start == 0 && stripes[0].write_end == 128*1024);
    assert(stripes[1].write_start == 0 && stripes[1].write_end == 64*1024);
    assert(stripes[2].write_start == 0 && stripes[2].write_end == 128*1024);
-    assert(stripes[0].read_buf == rmw_buf+128*1024);
-    assert(stripes[1].read_buf == rmw_buf+64*3*1024);
-    assert(stripes[2].read_buf == rmw_buf+64*4*1024);
+    assert(stripes[0].read_buf == (uint8_t*)rmw_buf+128*1024);
+    assert(stripes[1].read_buf == (uint8_t*)rmw_buf+64*3*1024);
+    assert(stripes[2].read_buf == (uint8_t*)rmw_buf+64*4*1024);
    assert(stripes[0].write_buf == write_buf);
-    assert(stripes[1].write_buf == write_buf+128*1024);
+    assert(stripes[1].write_buf == (uint8_t*)write_buf+128*1024);
    assert(stripes[2].write_buf == rmw_buf);
    free(rmw_buf);
    free(write_buf);
@@ -224,7 +224,7 @@ void test5()
 void test6()
 {
    osd_num_t osd_set[3] = { 1, 2, 3 };
-    osd_rmw_stripe_t stripes[3] = { 0 };
+    osd_rmw_stripe_t stripes[3] = {};
    // Test 6.1
    split_stripes(2, 128*1024, 0, 64*1024*3, stripes);
    void *write_buf = malloc(64*1024*3);
@@ -236,10 +236,10 @@ void test6()
    assert(stripes[1].write_start == 0 && stripes[1].write_end == 64*1024);
    assert(stripes[2].write_start == 0 && stripes[2].write_end == 128*1024);
    assert(stripes[0].read_buf == 0);
-    assert(stripes[1].read_buf == rmw_buf+128*1024);
+    assert(stripes[1].read_buf == (uint8_t*)rmw_buf+128*1024);
    assert(stripes[2].read_buf == 0);
    assert(stripes[0].write_buf == write_buf);
-    assert(stripes[1].write_buf == write_buf+128*1024);
+    assert(stripes[1].write_buf == (uint8_t*)write_buf+128*1024);
    assert(stripes[2].write_buf == rmw_buf);
    free(rmw_buf);
    free(write_buf);
@@ -267,7 +267,7 @@ void test7()
 {
    osd_num_t osd_set[3] = { 1, 0, 3 };
    osd_num_t write_osd_set[3] = { 1, 2, 3 };
-    osd_rmw_stripe_t stripes[3] = { 0 };
+    osd_rmw_stripe_t stripes[3] = {};
    // Test 7.1
    split_stripes(2, 128*1024, 128*1024-4096, 8192, stripes);
    void *write_buf = malloc(8192);
@@ -278,11 +278,11 @@ void test7()
    assert(stripes[0].write_start == 128*1024-4096 && stripes[0].write_end == 128*1024);
    assert(stripes[1].write_start == 0 && stripes[1].write_end == 4096);
    assert(stripes[2].write_start == 0 && stripes[2].write_end == 128*1024);
-    assert(stripes[0].read_buf == rmw_buf+128*1024);
-    assert(stripes[1].read_buf == rmw_buf+128*1024*2);
-    assert(stripes[2].read_buf == rmw_buf+128*1024*3);
+    assert(stripes[0].read_buf == (uint8_t*)rmw_buf+128*1024);
+    assert(stripes[1].read_buf == (uint8_t*)rmw_buf+128*1024*2);
+    assert(stripes[2].read_buf == (uint8_t*)rmw_buf+128*1024*3);
    assert(stripes[0].write_buf == write_buf);
-    assert(stripes[1].write_buf == write_buf+4096);
+    assert(stripes[1].write_buf == (uint8_t*)write_buf+4096);
    assert(stripes[2].write_buf == rmw_buf);
    // Test 7.2
    set_pattern(write_buf, 8192, PATTERN0);
@@ -320,7 +320,7 @@ void test8()
 {
    osd_num_t osd_set[3] = { 0, 2, 3 };
    osd_num_t write_osd_set[3] = { 1, 2, 3 };
-    osd_rmw_stripe_t stripes[3] = { 0 };
+    osd_rmw_stripe_t stripes[3] = {};
    // Test 8.1
    split_stripes(2, 128*1024, 0, 128*1024+4096, stripes);
    void *write_buf = malloc(128*1024+4096);
@@ -332,10 +332,10 @@ void test8()
    assert(stripes[1].write_start == 0 && stripes[1].write_end == 4096);
    assert(stripes[2].write_start == 0 && stripes[2].write_end == 128*1024);
    assert(stripes[0].read_buf == NULL);
-    assert(stripes[1].read_buf == rmw_buf+128*1024);
+    assert(stripes[1].read_buf == (uint8_t*)rmw_buf+128*1024);
    assert(stripes[2].read_buf == NULL);
    assert(stripes[0].write_buf == write_buf);
-    assert(stripes[1].write_buf == write_buf+128*1024);
+    assert(stripes[1].write_buf == (uint8_t*)write_buf+128*1024);
    assert(stripes[2].write_buf == rmw_buf);
    // Test 8.2
    set_pattern(write_buf, 128*1024+4096, PATTERN0);
@@ -345,7 +345,7 @@ void test8()
    assert(stripes[1].write_start == 0 && stripes[1].write_end == 4096);     // recheck again
    assert(stripes[2].write_start == 0 && stripes[2].write_end == 128*1024); // recheck again
    assert(stripes[0].write_buf == write_buf);                               // recheck again
-    assert(stripes[1].write_buf == write_buf+128*1024);                      // recheck again
+    assert(stripes[1].write_buf == (uint8_t*)write_buf+128*1024);                      // recheck again
    assert(stripes[2].write_buf == rmw_buf);                                 // recheck again
    check_pattern(stripes[2].write_buf, 4096, 0); // new parity
    check_pattern(stripes[2].write_buf+4096, 128*1024-4096, PATTERN0^PATTERN1); // new parity
@@ -375,7 +375,7 @@ void test9()
 {
    osd_num_t osd_set[3] = { 0, 2, 3 };
    osd_num_t write_osd_set[3] = { 1, 2, 3 };
-    osd_rmw_stripe_t stripes[3] = { 0 };
+    osd_rmw_stripe_t stripes[3] = {};
    // Test 9.0
    split_stripes(2, 128*1024, 64*1024, 0, stripes);
    assert(stripes[0].req_start == 0 && stripes[0].req_end == 0);
@@ -391,8 +391,8 @@ void test9()
    assert(stripes[1].write_start == 0 && stripes[1].write_end == 0);
    assert(stripes[2].write_start == 0 && stripes[2].write_end == 0);
    assert(stripes[0].read_buf == rmw_buf);
-    assert(stripes[1].read_buf == rmw_buf+128*1024);
-    assert(stripes[2].read_buf == rmw_buf+128*1024*2);
+    assert(stripes[1].read_buf == (uint8_t*)rmw_buf+128*1024);
+    assert(stripes[2].read_buf == (uint8_t*)rmw_buf+128*1024*2);
    assert(stripes[0].write_buf == NULL);
    assert(stripes[1].write_buf == NULL);
    assert(stripes[2].write_buf == NULL);
@@ -430,7 +430,7 @@ void test10()
 {
    osd_num_t osd_set[3] = { 1, 0, 0 };
    osd_num_t write_osd_set[3] = { 1, 2, 3 };
-    osd_rmw_stripe_t stripes[3] = { 0 };
+    osd_rmw_stripe_t stripes[3] = {};
    // Test 10.0
    split_stripes(2, 128*1024, 0, 256*1024, stripes);
    assert(stripes[0].req_start == 0 && stripes[0].req_end == 128*1024);
@@ -450,7 +450,7 @@ void test10()
    assert(stripes[1].read_buf == NULL);
    assert(stripes[2].read_buf == NULL);
    assert(stripes[0].write_buf == write_buf);
-    assert(stripes[1].write_buf == write_buf+128*1024);
+    assert(stripes[1].write_buf == (uint8_t*)write_buf+128*1024);
    assert(stripes[2].write_buf == rmw_buf);
    // Test 10.2
    set_pattern(stripes[0].write_buf, 128*1024, PATTERN1);
@@ -460,7 +460,7 @@ void test10()
    assert(stripes[1].write_start == 0 && stripes[1].write_end == 128*1024);
    assert(stripes[2].write_start == 0 && stripes[2].write_end == 128*1024);
    assert(stripes[0].write_buf == write_buf);
-    assert(stripes[1].write_buf == write_buf+128*1024);
+    assert(stripes[1].write_buf == (uint8_t*)write_buf+128*1024);
    assert(stripes[2].write_buf == rmw_buf);
    check_pattern(stripes[2].write_buf, 128*1024, PATTERN1^PATTERN2);
    free(rmw_buf);
@@ -486,7 +486,7 @@ void test11()
 {
    osd_num_t osd_set[3] = { 1, 0, 0 };
    osd_num_t write_osd_set[3] = { 1, 2, 3 };
-    osd_rmw_stripe_t stripes[3] = { 0 };
+    osd_rmw_stripe_t stripes[3] = {};
    // Test 11.0
    split_stripes(2, 128*1024, 128*1024, 256*1024, stripes);
    assert(stripes[0].req_start == 0 && stripes[0].req_end == 0);
@@ -502,7 +502,7 @@ void test11()
    assert(stripes[0].write_start == 0 && stripes[0].write_end == 0);
    assert(stripes[1].write_start == 0 && stripes[1].write_end == 128*1024);
    assert(stripes[2].write_start == 0 && stripes[2].write_end == 128*1024);
-    assert(stripes[0].read_buf == rmw_buf+128*1024);
+    assert(stripes[0].read_buf == (uint8_t*)rmw_buf+128*1024);
    assert(stripes[1].read_buf == NULL);
    assert(stripes[2].read_buf == NULL);
    assert(stripes[0].write_buf == NULL);
@@ -542,7 +542,7 @@ void test12()
 {
    osd_num_t osd_set[3] = { 1, 2, 0 };
    osd_num_t write_osd_set[3] = { 1, 2, 3 };
-    osd_rmw_stripe_t stripes[3] = { 0 };
+    osd_rmw_stripe_t stripes[3] = {};
    // Test 12.0
    split_stripes(2, 128*1024, 0, 0, stripes);
    assert(stripes[0].req_start == 0 && stripes[0].req_end == 0);
@@ -557,8 +557,8 @@ void test12()
    assert(stripes[0].write_start == 0 && stripes[0].write_end == 0);
    assert(stripes[1].write_start == 0 && stripes[1].write_end == 0);
    assert(stripes[2].write_start == 0 && stripes[2].write_end == 128*1024);
-    assert(stripes[0].read_buf == rmw_buf+128*1024);
-    assert(stripes[1].read_buf == rmw_buf+2*128*1024);
+    assert(stripes[0].read_buf == (uint8_t*)rmw_buf+128*1024);
+    assert(stripes[1].read_buf == (uint8_t*)rmw_buf+2*128*1024);
    assert(stripes[2].read_buf == NULL);
    assert(stripes[0].write_buf == NULL);
    assert(stripes[1].write_buf == NULL);
@@ -597,7 +597,7 @@ void test13()
    use_jerasure(4, 2, true);
    osd_num_t osd_set[4] = { 1, 2, 0, 0 };
    osd_num_t write_osd_set[4] = { 1, 2, 3, 4 };
-    osd_rmw_stripe_t stripes[4] = { 0 };
+    osd_rmw_stripe_t stripes[4] = {};
    // Test 13.0
    void *write_buf = malloc_or_die(8192);
    split_stripes(2, 128*1024, 128*1024-4096, 8192, stripes);
@@ -616,14 +616,14 @@ void test13()
    assert(stripes[1].write_start == 0 && stripes[1].write_end == 4096);
    assert(stripes[2].write_start == 0 && stripes[2].write_end == 128*1024);
    assert(stripes[3].write_start == 0 && stripes[3].write_end == 128*1024);
-    assert(stripes[0].read_buf == rmw_buf+2*128*1024);
-    assert(stripes[1].read_buf == rmw_buf+3*128*1024-4096);
+    assert(stripes[0].read_buf == (uint8_t*)rmw_buf+2*128*1024);
+    assert(stripes[1].read_buf == (uint8_t*)rmw_buf+3*128*1024-4096);
    assert(stripes[2].read_buf == NULL);
    assert(stripes[3].read_buf == NULL);
    assert(stripes[0].write_buf == write_buf);
-    assert(stripes[1].write_buf == write_buf+4096);
+    assert(stripes[1].write_buf == (uint8_t*)write_buf+4096);
    assert(stripes[2].write_buf == rmw_buf);
-    assert(stripes[3].write_buf == rmw_buf+128*1024);
+    assert(stripes[3].write_buf == (uint8_t*)rmw_buf+128*1024);
    // Test 13.2 - encode
    set_pattern(write_buf, 8192, PATTERN3);
    set_pattern(stripes[0].read_buf, 128*1024-4096, PATTERN1);
@@ -634,9 +634,9 @@ void test13()
    assert(stripes[2].write_start == 0 && stripes[2].write_end == 128*1024);
    assert(stripes[3].write_start == 0 && stripes[3].write_end == 128*1024);
    assert(stripes[0].write_buf == write_buf);
-    assert(stripes[1].write_buf == write_buf+4096);
+    assert(stripes[1].write_buf == (uint8_t*)write_buf+4096);
    assert(stripes[2].write_buf == rmw_buf);
-    assert(stripes[3].write_buf == rmw_buf+128*1024);
+    assert(stripes[3].write_buf == (uint8_t*)rmw_buf+128*1024);
    // Test 13.3 - full decode and verify
    osd_num_t read_osd_set[4] = { 0, 0, 3, 4 };
    memset(stripes, 0, sizeof(stripes));
@@ -658,11 +658,11 @@ void test13()
    void *read_buf = alloc_read_buffer(stripes, 4, 0);
    assert(read_buf);
    assert(stripes[0].read_buf == read_buf);
-    assert(stripes[1].read_buf == read_buf+128*1024);
-    assert(stripes[2].read_buf == read_buf+2*128*1024);
-    assert(stripes[3].read_buf == read_buf+3*128*1024);
-    memcpy(read_buf+2*128*1024, rmw_buf, 128*1024);
-    memcpy(read_buf+3*128*1024, rmw_buf+128*1024, 128*1024);
+    assert(stripes[1].read_buf == (uint8_t*)read_buf+128*1024);
+    assert(stripes[2].read_buf == (uint8_t*)read_buf+2*128*1024);
+    assert(stripes[3].read_buf == (uint8_t*)read_buf+3*128*1024);
+    memcpy((uint8_t*)read_buf+2*128*1024, rmw_buf, 128*1024);
+    memcpy((uint8_t*)read_buf+3*128*1024, (uint8_t*)rmw_buf+128*1024, 128*1024);
    reconstruct_stripes_jerasure(stripes, 4, 2, 0);
    check_pattern(stripes[0].read_buf, 128*1024-4096, PATTERN1);
    check_pattern(stripes[0].read_buf+128*1024-4096, 4096, PATTERN3);
@@ -690,10 +690,10 @@ void test13()
    assert(read_buf);
    assert(stripes[0].read_buf == read_buf);
    assert(stripes[1].read_buf == NULL);
-    assert(stripes[2].read_buf == read_buf+128*1024);
-    assert(stripes[3].read_buf == read_buf+2*128*1024);
-    memcpy(read_buf+128*1024, rmw_buf, 128*1024);
-    memcpy(read_buf+2*128*1024, rmw_buf+128*1024, 128*1024);
+    assert(stripes[2].read_buf == (uint8_t*)read_buf+128*1024);
+    assert(stripes[3].read_buf == (uint8_t*)read_buf+2*128*1024);
+    memcpy((uint8_t*)read_buf+128*1024, rmw_buf, 128*1024);
+    memcpy((uint8_t*)read_buf+2*128*1024, (uint8_t*)rmw_buf+128*1024, 128*1024);
    reconstruct_stripes_jerasure(stripes, 4, 2, 0);
    check_pattern(stripes[0].read_buf, 128*1024-4096, PATTERN1);
    check_pattern(stripes[0].read_buf+128*1024-4096, 4096, PATTERN3);
@@ -725,7 +725,7 @@ void test14()
    use_jerasure(3, 2, true);
    osd_num_t osd_set[3] = { 1, 2, 0 };
    osd_num_t write_osd_set[3] = { 1, 2, 3 };
-    osd_rmw_stripe_t stripes[3] = { 0 };
+    osd_rmw_stripe_t stripes[3] = {};
    unsigned bitmaps[3] = { 0 };
    // Test 13.0
    void *write_buf = malloc_or_die(8192);
@@ -744,11 +744,11 @@ void test14()
    assert(stripes[0].write_start == 128*1024-4096 && stripes[0].write_end == 128*1024);
    assert(stripes[1].write_start == 0 && stripes[1].write_end == 4096);
    assert(stripes[2].write_start == 0 && stripes[2].write_end == 128*1024);
-    assert(stripes[0].read_buf == rmw_buf+128*1024);
-    assert(stripes[1].read_buf == rmw_buf+2*128*1024-4096);
+    assert(stripes[0].read_buf == (uint8_t*)rmw_buf+128*1024);
+    assert(stripes[1].read_buf == (uint8_t*)rmw_buf+2*128*1024-4096);
    assert(stripes[2].read_buf == NULL);
    assert(stripes[0].write_buf == write_buf);
-    assert(stripes[1].write_buf == write_buf+4096);
+    assert(stripes[1].write_buf == (uint8_t*)write_buf+4096);
    assert(stripes[2].write_buf == rmw_buf);
    // Test 13.2 - encode
    set_pattern(write_buf, 8192, PATTERN3);
@@ -765,7 +765,7 @@ void test14()
    assert(stripes[1].write_start == 0 && stripes[1].write_end == 4096);
    assert(stripes[2].write_start == 0 && stripes[2].write_end == 128*1024);
    assert(stripes[0].write_buf == write_buf);
-    assert(stripes[1].write_buf == write_buf+4096);
+    assert(stripes[1].write_buf == (uint8_t*)write_buf+4096);
    assert(stripes[2].write_buf == rmw_buf);
    // Test 13.3 - decode and verify
    osd_num_t read_osd_set[4] = { 0, 2, 3 };
@@ -788,8 +788,8 @@ void test14()
        stripes[i].bmp_buf = bitmaps+i;
    assert(read_buf);
    assert(stripes[0].read_buf == read_buf);
-    assert(stripes[1].read_buf == read_buf+128*1024);
-    assert(stripes[2].read_buf == read_buf+2*128*1024);
+    assert(stripes[1].read_buf == (uint8_t*)read_buf+128*1024);
+    assert(stripes[2].read_buf == (uint8_t*)read_buf+2*128*1024);
    set_pattern(stripes[1].read_buf, 4096, PATTERN3);
    set_pattern(stripes[1].read_buf+4096, 128*1024-4096, PATTERN2);
    memcpy(stripes[2].read_buf, rmw_buf, 128*1024);
--- a/src/osd_secondary.cpp
+++ b/src/osd_secondary.cpp
@@ -54,8 +54,8 @@ void osd_t::exec_secondary(osd_op_t *cur_op)
            void *cur_buf = reply_buf;
            for (int i = 0; i < n; i++)
            {
-                bs->read_bitmap(ov[i].oid, ov[i].version, cur_buf + sizeof(uint64_t), (uint64_t*)cur_buf);
-                cur_buf += (8 + clean_entry_bitmap_size);
+                bs->read_bitmap(ov[i].oid, ov[i].version, (uint8_t*)cur_buf + sizeof(uint64_t), (uint64_t*)cur_buf);
+                cur_buf = (uint8_t*)cur_buf + (8 + clean_entry_bitmap_size);
            }
            free(cur_op->buf);
            cur_op->buf = reply_buf;
@@ -159,7 +159,7 @@ void osd_t::exec_show_config(osd_op_t *cur_op)
        { "readonly", readonly },
        { "immediate_commit", (immediate_commit == IMMEDIATE_ALL ? "all" :
            (immediate_commit == IMMEDIATE_SMALL ? "small" : "none")) },
-        { "lease_timeout", etcd_report_interval+(MAX_ETCD_ATTEMPTS*(2*ETCD_QUICK_TIMEOUT)+999)/1000 },
+        { "lease_timeout", etcd_report_interval+(st_cli.max_etcd_attempts*(2*st_cli.etcd_quick_timeout)+999)/1000 },
    };
 #ifdef WITH_RDMA
    if (msgr.is_rdma_enabled())
--- a/src/osd_test.cpp
+++ b/src/osd_test.cpp
@@ -16,6 +16,7 @@

 #include <stdexcept>

+#include "addr_util.h"
 #include "osd_ops.h"
 #include "rw_blocking.h"
 #include "test_pattern.h"
@@ -133,17 +134,14 @@ int main(int narg, char *args[])

 int connect_osd(const char *osd_address, int osd_port)
 {
-    struct sockaddr_in addr;
-    int r;
-    if ((r = inet_pton(AF_INET, osd_address, &addr.sin_addr)) != 1)
+    struct sockaddr addr;
+    if (!string_to_addr(osd_address, 0, osd_port, &addr))
    {
-        fprintf(stderr, "server address: %s%s\n", osd_address, r == 0 ? " is not valid" : ": no ipv4 support");
+        fprintf(stderr, "server address: %s is not valid\n", osd_address);
        return -1;
    }
-    addr.sin_family = AF_INET;
-    addr.sin_port = htons(osd_port);

-    int connect_fd = socket(AF_INET, SOCK_STREAM, 0);
+    int connect_fd = socket(addr.sa_family, SOCK_STREAM, 0);
    if (connect_fd < 0)
    {
        perror("socket");
--- a/src/qemu_driver.c
+++ b/src/qemu_driver.c
@@ -308,7 +308,7 @@ static void vitastor_close(BlockDriverState *bs)
 static int vitastor_probe_blocksizes(BlockDriverState *bs, BlockSizes *bsz)
 {
    bsz->phys = 4096;
-    bsz->log = 4096;
+    bsz->log = 512;
    return 0;
 }
 #endif
--- a/src/ringloop.cpp
+++ b/src/ringloop.cpp
@@ -112,3 +112,17 @@ void ring_loop_t::restore(unsigned sqe_tail)
    }
    ring.sq.sqe_tail = sqe_tail;
 }
+
+int ring_loop_t::sqes_left()
+{
+    struct io_uring_sq *sq = &ring.sq;
+    unsigned int head = io_uring_smp_load_acquire(sq->khead);
+    unsigned int next = sq->sqe_tail + 1;
+    int left = *sq->kring_entries - (next - head);
+    if (left > free_ring_data_ptr)
+    {
+        // return min(sqes left, ring_datas left)
+        return free_ring_data_ptr;
+    }
+    return left;
+}
--- a/src/ringloop.h
+++ b/src/ringloop.h
@@ -17,15 +17,12 @@

 static inline void my_uring_prep_rw(int op, struct io_uring_sqe *sqe, int fd, const void *addr, unsigned len, off_t offset)
 {
-    sqe->opcode = op;
-    sqe->flags = 0;
-    sqe->ioprio = 0;
-    sqe->fd = fd;
-    sqe->off = offset;
-    sqe->addr = (unsigned long) addr;
-    sqe->len = len;
-    sqe->rw_flags = 0;
-    sqe->__pad2[0] = sqe->__pad2[1] = sqe->__pad2[2] = 0;
+    // Prepare a read/write operation without clearing user_data
+    // Very recently, 22 Dec 2021, liburing finally got this change too (8ecd3fd959634df81d66af8b3a69c16202a014e8)
+    // But all versions prior to it (sadly) clear user_data
+    __u64 user_data = sqe->user_data;
+    io_uring_prep_rw(op, sqe, fd, addr, len, offset);
+    sqe->user_data = user_data;
 }

 static inline void my_uring_prep_readv(struct io_uring_sqe *sqe, int fd, const struct iovec *iovecs, unsigned nr_vecs, off_t offset)
@@ -172,6 +169,7 @@ public:
        struct io_uring_cqe *cqe;
        return io_uring_wait_cqe(&ring, &cqe);
    }
+    int sqes_left();
    inline unsigned space_left()
    {
        return free_ring_data_ptr;
--- a/src/rw_blocking.cpp
+++ b/src/rw_blocking.cpp
@@ -3,7 +3,10 @@

 #include <errno.h>
 #include <stdlib.h>
+#include <stdint.h>
 #include <stdio.h>
+#include <sys/types.h>
+#include <sys/socket.h>

 #include "rw_blocking.h"

@@ -20,7 +23,7 @@ int read_blocking(int fd, void *read_buf, size_t remaining)
                // EOF
                return done;
            }
-            else if (errno != EAGAIN && errno != EPIPE)
+            else if (errno != EINTR && errno != EAGAIN && errno != EPIPE)
            {
                perror("read");
                exit(1);
@@ -28,7 +31,7 @@ int read_blocking(int fd, void *read_buf, size_t remaining)
            continue;
        }
        done += r;
-        read_buf += r;
+        read_buf = (uint8_t*)read_buf + r;
    }
    return done;
 }
@@ -41,7 +44,7 @@ int write_blocking(int fd, void *write_buf, size_t remaining)
        size_t r = write(fd, write_buf, remaining-done);
        if (r < 0)
        {
-            if (errno != EAGAIN && errno != EPIPE)
+            if (errno != EINTR && errno != EAGAIN && errno != EPIPE)
            {
                perror("write");
                exit(1);
@@ -49,7 +52,7 @@ int write_blocking(int fd, void *write_buf, size_t remaining)
            continue;
        }
        done += r;
-        write_buf += r;
+        write_buf = (uint8_t*)write_buf + r;
    }
    return done;
 }
@@ -60,30 +63,31 @@ int readv_blocking(int fd, iovec *iov, int iovcnt)
    int done = 0;
    while (v < iovcnt)
    {
-        ssize_t r = readv(fd, iov, iovcnt);
+        ssize_t r = readv(fd, iov+v, iovcnt-v);
        if (r < 0)
        {
-            if (errno != EAGAIN && errno != EPIPE)
+            if (errno != EINTR && errno != EAGAIN && errno != EPIPE)
            {
                perror("writev");
                exit(1);
            }
            continue;
        }
+        done += r;
        while (v < iovcnt)
        {
            if (iov[v].iov_len > r)
            {
                iov[v].iov_len -= r;
-                iov[v].iov_base += r;
+                iov[v].iov_base = (uint8_t*)iov[v].iov_base + r;
                break;
            }
            else
            {
+                r -= iov[v].iov_len;
                v++;
            }
        }
-        done += r;
    }
    return done;
 }
@@ -94,30 +98,69 @@ int writev_blocking(int fd, iovec *iov, int iovcnt)
    int done = 0;
    while (v < iovcnt)
    {
-        ssize_t r = writev(fd, iov, iovcnt);
+        ssize_t r = writev(fd, iov+v, iovcnt-v);
        if (r < 0)
        {
-            if (errno != EAGAIN && errno != EPIPE)
+            if (errno != EINTR && errno != EAGAIN && errno != EPIPE)
            {
                perror("writev");
                exit(1);
            }
            continue;
        }
+        done += r;
        while (v < iovcnt)
        {
            if (iov[v].iov_len > r)
            {
                iov[v].iov_len -= r;
-                iov[v].iov_base += r;
+                iov[v].iov_base = (uint8_t*)iov[v].iov_base + r;
                break;
            }
            else
            {
+                r -= iov[v].iov_len;
+                v++;
+            }
+        }
+    }
+    return done;
+}
+
+int sendv_blocking(int fd, iovec *iov, int iovcnt, int flags)
+{
+    struct msghdr msg = { 0 };
+    int v = 0;
+    int done = 0;
+    while (v < iovcnt)
+    {
+        msg.msg_iov = iov+v;
+        msg.msg_iovlen = iovcnt-v;
+        ssize_t r = sendmsg(fd, &msg, flags);
+        if (r < 0)
+        {
+            if (errno != EINTR && errno != EAGAIN && errno != EPIPE)
+            {
+                perror("sendmsg");
+                exit(1);
+            }
+            continue;
+        }
+        done += r;
+        while (v < iovcnt)
+        {
+            if (iov[v].iov_len > r)
+            {
+                iov[v].iov_len -= r;
+                iov[v].iov_base = (uint8_t*)iov[v].iov_base + r;
+                break;
+            }
+            else
+            {
+                r -= iov[v].iov_len;
                v++;
            }
        }
-        done += r;
    }
    return done;
 }
--- a/src/rw_blocking.h
+++ b/src/rw_blocking.h
@@ -10,3 +10,4 @@ int read_blocking(int fd, void *read_buf, size_t remaining);
 int write_blocking(int fd, void *write_buf, size_t remaining);
 int readv_blocking(int fd, iovec *iov, int iovcnt);
 int writev_blocking(int fd, iovec *iov, int iovcnt);
+int sendv_blocking(int fd, iovec *iov, int iovcnt, int flags);
--- a/src/stub_bench.cpp
+++ b/src/stub_bench.cpp
@@ -21,6 +21,7 @@

 #include <stdexcept>

+#include "addr_util.h"
 #include "rw_blocking.h"
 #include "osd_ops.h"

@@ -66,16 +67,14 @@ int main(int narg, char *args[])

 int connect_stub(const char *server_address, int server_port)
 {
-    struct sockaddr_in addr;
-    int r;
-    if ((r = inet_pton(AF_INET, server_address, &addr.sin_addr)) != 1)
+    struct sockaddr addr;
+    if (!string_to_addr(server_address, 0, server_port, &addr))
    {
-        fprintf(stderr, "server address: %s%s\n", server_address, r == 0 ? " is not valid" : ": no ipv4 support");
+        fprintf(stderr, "server address: %s is not valid\n", server_address);
        return -1;
    }
-    addr.sin_family = AF_INET;
-    addr.sin_port = htons(server_port);
-    int connect_fd = socket(AF_INET, SOCK_STREAM, 0);
+
+    int connect_fd = socket(addr.sa_family, SOCK_STREAM, 0);
    if (connect_fd < 0)
    {
        perror("socket");
--- a/Show More
+++ b/Show More
Author	SHA1	Message	Date
Vitaliy Filippov	36f352f06f	Release 0.6.13 - Fix client hangs possible on OSD restarts (bug affected versions from 0.5.11) - Fix "Assertion `sqe != NULL' failed" io_uring-related crashes possible on some kernels (0.6.11 increased probability of this bug) - Fix timeout=0 in NBD proxy - Fix build under centos 7	2022-02-03 01:50:30 +03:00
Vitaliy Filippov	318cc463c2	Fix warnings	2022-02-03 01:50:30 +03:00
Vitaliy Filippov	145e5cfb86	MCL_ONFAULT is not available under centos 7	2022-02-03 01:42:19 +03:00
Vitaliy Filippov	73ae578981	Add osd_memlock option	2022-02-02 01:40:22 +03:00
Vitaliy Filippov	20ee4ed758	Update some parameter docs	2022-02-01 22:46:13 +03:00
Vitaliy Filippov	63de79d1b2	Change > to \| to preserve newlines	2022-02-01 22:45:12 +03:00
Vitaliy Filippov	f712967079	And one more sqe starvation fix	2022-02-01 02:50:16 +03:00
Vitaliy Filippov	df0cd85352	Fix another part of the "async sqe clear" bug (followup to `d9857a5340`)	2022-02-01 01:14:56 +03:00
Vitaliy Filippov	ebaf4d7a72	Fix compatibility with fio 3.28+	2022-01-31 23:39:14 +03:00
Vitaliy Filippov	d4bc10542c	Fix compatibility with liburing >= 2.1 where it only has __pad2[2]	2022-01-31 22:49:40 +03:00
Vitaliy Filippov	140309620a	Free recv_buf in nbd_proxy	2022-01-31 20:37:58 +03:00
Vitaliy Filippov	0a610ee943	Destroy the client after completing CLI command	2022-01-31 18:27:04 +03:00
Vitaliy Filippov	f3ce166064	Do not print nan% in df when a pool has no available OSDs	2022-01-31 18:23:57 +03:00
Vitaliy Filippov	717d303370	Handle get_sqe failures, don't die with "will fall out of sync" in epoll_manager Problem is that in recent kernels io_uring may return completions BEFORE clearing the submission queue. I.e. for example its capacity is 512, there were 512 requests, one of them completed, so when the request completion is processed the queue "should have" 1 free slot. But sometimes it doesn't because io_uring doesn't always clear the submission queue before sending CQE :-/	2022-01-31 02:52:20 +03:00
Vitaliy Filippov	d9857a5340	Check for SQEs, not for completions Should finally fix Assertion `sqe != NULL' failed introduced after journaling refactor in 0.6.11...	2022-01-31 02:19:10 +03:00
Vitaliy Filippov	eb5d9153e8	Fix build under centos 7	2022-01-30 20:29:44 +03:00
Vitaliy Filippov	ae6d1ed1d5	Remove completed items	2022-01-30 20:20:06 +03:00
Vitaliy Filippov	d123e58ea3	Fix yaml syntax - remove ` in default	2022-01-29 02:08:48 +03:00
Vitaliy Filippov	d9869d8116	Add parameter documentation	2022-01-28 02:45:54 +03:00
Vitaliy Filippov	4047ca606f	Add missing cancel_op(currently being read op) when stopping a client Fixes client hangs possible after stopping & restarting an osd. Hangs happened when a connection was closed in the middle of reading a READ operation reply from the network. In this case the operation being read was in read_op and the client didn't free it when closing the connection. Test case for msgr_read.cpp: - Partially read reply for a READ operation - stop_client() - Check that the READ operation returns EPIPE The bug was actually introduced in 0.5.11.	2022-01-28 01:53:52 +03:00
Vitaliy Filippov	218e294e9c	> 0, of course	2022-01-24 13:36:09 +03:00
Vitaliy Filippov	c1929cabe0	Release 0.6.12 etcd connection stability, clang & elbrus support - Fix build under CLang and Elbrus LCC compilers, making Vitastor compatible with Elbrus CPUs :) - Completely fix the bug where OSDs didn't connect to peers and incorrectly marked PGs as incomplete - Limit I/O depth for deletes the same way as for small writes. Makes OSD crashes with "Assertion failed: sqe != NULL" during image deletion go away - Fix a very old, but rare, journaling bug (credits to https://github.com/mirrorll) - Fix flushing of unclean journaled objects leading to OSDs sometimes hanging after failover in EC setups (bug was introduced in 0.6.7) - Fix several problems that could prevent smooth operation of a Vitastor cluster under the condition of partial etcd failure: - OSDs could randomly fail due to too strict error handling - New clients and OSDs could be unable to start because of the lack of retries - CLI could fail some commands because of the lack of retries - Monitor could stop receiving state updates because of the lack of websocket pings - Fix monitor being unable to rebalance PGs after a downscale of pool pg_size (3->2) - Exit with failure when trying to nbd map or benchmark a non-existing image - Use HTTP keep-alive for etcd connections - Allow to configure etcd request timeouts and retries - Allow to configure NBD timeout, max devices and partitions, and set default to up to 64 devices with up to 3 partitions each	2022-01-24 01:15:25 +03:00
Vitaliy Filippov	cc6b24e03a	Allow to configure NBD timeout, max devices and partitions Also set default NBD devices/partitions to 64/3, Linux default is 16/16 which is way too low	2022-01-24 01:15:19 +03:00
Vitaliy Filippov	0757ba630a	Do not happily NBD "map" non-existing images, do not try to benchmark them too	2022-01-23 23:03:42 +03:00
Vitaliy Filippov	2a0b881685	Respect max_write_iodepth for deletes	2022-01-23 22:05:23 +03:00
Vitaliy Filippov	9a15b843ff	Do not set pg_real_size to 0	2022-01-23 20:15:04 +03:00
Vitaliy Filippov	8dc1ffb13b	Try to connect with PG peers before deciding it's incomplete :) I already attempted to fix it in 0.6.11, but it happened so that the fix was only partial :)	2022-01-23 19:19:26 +03:00
Vitaliy Filippov	ba63af49b4	Add etcd retries everywhere (they were missing in some places)	2022-01-23 17:21:48 +03:00
Vitaliy Filippov	31b9c683ee	Fix flushing of unclean objects This was preventing OSD failover when there were some unclean objects. Bug was introduced in `aa436027c8`	2022-01-23 00:45:11 +03:00
Vitaliy Filippov	3abcac058f	Check for double response_callback call more	2022-01-23 00:26:20 +03:00
Vitaliy Filippov	e01c4db702	Add paranoic if()s to prevent accidental double free of etcd_watch_ws	2022-01-23 00:16:09 +03:00
Vitaliy Filippov	a5cf06acd0	Remove etcd timeout and keepalive interval hardcode	2022-01-23 00:00:00 +03:00
Vitaliy Filippov	9c3653b1e1	Handle EINTR	2022-01-22 23:59:37 +03:00
Vitaliy Filippov	23e578b6a2	Fix common.sh	2022-01-21 01:51:25 +03:00
Vitaliy Filippov	7920414bee	Fix build under older gcc (debian buster)	2022-01-20 10:34:52 +03:00
Vitaliy Filippov	098e369a3b	Fix rand initialization, add etcd connection/disconnection logging	2022-01-20 00:45:49 +03:00
Vitaliy Filippov	a43ef525a2	Remove two last end()s from http_client (should have been removed in the keepalive patch)	2022-01-20 00:44:18 +03:00
Vitaliy Filippov	8a6b07d8f7	Add a 2/5 etcd failure test	2022-01-20 00:43:22 +03:00
Vitaliy Filippov	2c930d55fb	Merge pull request #41 from promobit-bitblaze/1-small-fix #1 fix deps	2022-01-18 11:19:08 +03:00
Mikhail Koshel	d798e0821e	#1 fix deps	2022-01-18 13:30:53 +06:00
Vitaliy Filippov	e591a3e9f7	Include sys/stat.h in messenger.cpp No idea why, but it builds without it on x86 and does not build on e2k	2022-01-17 13:43:29 +03:00
Vitaliy Filippov	77cc18420a	Fix leaks detected by clang scan-build (only 1 of 4 may be important though)	2022-01-16 00:11:59 +03:00
Vitaliy Filippov	7bdd92ca4f	Fix build under clang and some warnings Build problems fixed: - void* pointer arithmetic which is a GNU extension (works as byte*) - "variable size object may not be initialized" which is OK under GCC - nullptr_t related error in json11 (it lacks 'operator <' in clang) Warnings fixed: - empty nested struct initializer { 0 } replaced by {} - removed several unused lambda captures	2022-01-16 00:02:54 +03:00
Vitaliy Filippov	8f64fc61e7	Ignore empty events in mon	2022-01-08 11:41:00 +03:00
Vitaliy Filippov	4a9f001d9e	Make mon also ping etcd websockets regularly	2022-01-05 17:28:51 +03:00
Vitaliy Filippov	8c908316d9	Add a test with an OSD being added	2022-01-05 17:06:24 +03:00
Vitaliy Filippov	515a2e6e33	Only die when detecting a real race condition, not just a CAS failure	2022-01-05 17:05:25 +03:00
Vitaliy Filippov	68b6763ebe	Add asserts for lp-optimizer tests, pass `ordered` from the monitor	2022-01-03 20:37:07 +03:00
Vitaliy Filippov	9c6168bf17	Remove fill_parsed_response	2022-01-03 20:08:26 +03:00
Vitaliy Filippov	08e467270a	Fix pg_size changing from 3 to 2	2022-01-03 17:56:54 +03:00
Vitaliy Filippov	5473d5b4a2	Rework HTTP client to use keepalive, move getifaddr_list to addr_util	2022-01-03 14:52:01 +03:00
Vitaliy Filippov	c3304bce27	Merge pull request #38 from mirrorll/master journal check_available error	2021-12-31 12:45:16 +03:00
Vitaliy Filippov	ec2852c598	Add minsize_1 test	2021-12-28 10:54:36 +03:00
Vitaliy Filippov	b9f5c2a823	Support zero-copy send in fio_sec_osd to allow testing it Prelimilary results: - CPU usage drops significantly. For example, in T1Q8 128K write test against stub_uring_osd with 10G network and Athlon X4 860k CPU it drops from 100% to 30% - Latency becomes slightly worse. In T1Q1 4K write test in the same environment latency increases from 56 to 63 us. - Small write throughput also becomes slightly worse. In T1Q128 4K write test against stub iops decreases from 138k to ~110k (unstable, fluctuates 100k..120k). Note that this is without io_uring, of course.	2021-12-27 02:12:44 +03:00
Vitaliy Filippov	e9d2f79aa7	Support reading bitmaps in fio_sec_osd	2021-12-27 02:12:44 +03:00
Vitaliy Filippov	0785bdf8b3	Release 0.6.11 - Slightly reduce journaling write amplification (requires no_same_sector_overwrites=false) - Fix listen_backlog (it was 0) because it could more than halve OSD socket send speed - Support IPv6 OSD addresses - Do not try to initialize client in simple-offsets - Fix OSDs sometimes marking PGs incomplete instead of trying to connect with peers - Allow to configure OSD placement in node_placement - Allow to run with 4k sector size block devices. Natural, but it was forbidden	2021-12-26 21:11:24 +03:00
Vitaliy Filippov	b57e44748b	Send 4 byte bitmap in stub_uring_osd	2021-12-25 11:38:13 +03:00
Vitaliy Filippov	1bbe62f29c	Fix uninitialized listen_backlog which was leading to REALLY SLOW send speeds!!!	2021-12-25 11:38:13 +03:00
lihai	3061c30132	journal check_available error	2021-12-21 09:39:58 +08:00
Vitaliy Filippov	20a4406acc	Support IPv6 OSD addresses	2021-12-19 10:42:17 +03:00
Vitaliy Filippov	f93491bc6c	Implement journal write batching and slightly refactor journal writes Slightly reduces WA. For example, in 4K T1Q128 replicated randwrite tests WA is reduced from ~3.6 to ~3.1, in T1Q64 from ~3.8 to ~3.4. Only effective without no_same_sector_overwrites.	2021-12-16 00:27:17 +03:00
Vitaliy Filippov	999bed8514	Fix opening regular files as blockstore	2021-12-15 02:08:58 +03:00
Vitaliy Filippov	3f33095fd7	Do not try to initialize client in simple-offsets	2021-12-15 02:07:27 +03:00
Vitaliy Filippov	dd74c5ce1b	Fix OSDs marking PGs incomplete instead of trying to connect with peers	2021-12-14 01:57:51 +03:00
Vitaliy Filippov	c6d104ecd6	Print object version on fatal overwrite	2021-12-14 01:57:04 +03:00
Vitaliy Filippov	e544aef7d0	Fix test rw_blocking	2021-12-12 23:24:50 +03:00
Vitaliy Filippov	616c18c786	Fix stub_uring_osd	2021-12-12 23:06:11 +03:00
Vitaliy Filippov	fa687d3878	Allow to configure OSD placement in node_placement	2021-12-12 01:25:45 +03:00
Vitaliy Filippov	2c7556e536	Allow to run with 4k sector size. Natural, but it was forbidden	2021-12-11 22:03:16 +00:00
Vitaliy Filippov	2020608a39	Release 0.6.10 - Implement a storage plugin for Proxmox. Now you can use Vitastor with Proxmox! - Implement `vitastor-cli df` (pool space usage statistics) command - Add glob pattern support for `vitastor-cli ls` - Fix several bugs in other CLI commands (resize, create --parent, modify --readonly) - Use 512 byte logical block size in QEMU driver by default (and thus don't require to set it in QEMU options)	2021-12-10 21:40:12 +03:00
Vitaliy Filippov	139b98d80f	Exclude block/vitastor.c from patches and add script to easily re-add it	2021-12-10 21:38:36 +03:00
Vitaliy Filippov	f54ff6ad5d	Do not crash in simple-offsets when some options are empty, too	2021-12-10 12:27:25 +03:00
Vitaliy Filippov	b376ef2ed9	Do not crash on empty matched_addrs	2021-12-10 11:40:59 +03:00
Vitaliy Filippov	5a234588b9	Do not die when invoked via `vita` symlink	2021-12-10 02:45:16 +03:00
Vitaliy Filippov	b82c30328f	Use vitastor-cli df to show pool stats in Proxmox	2021-12-10 02:42:31 +03:00
Vitaliy Filippov	0ee5e0a7fe	Implement vitastor-cli df command	2021-12-10 02:37:02 +03:00
Vitaliy Filippov	0a1640d169	Some important fixes for our new Proxmox driver	2021-12-10 01:17:06 +03:00
Vitaliy Filippov	3482bb0860	Fix readonly/readwrite option parsing	2021-12-10 00:52:59 +03:00
Vitaliy Filippov	526995f486	Do not skip empty iops in listings	2021-12-10 00:52:59 +03:00
Vitaliy Filippov	073b505928	Package Proxmox plugin as pve-storage-vitastor	2021-12-10 00:22:45 +03:00
Vitaliy Filippov	a8b21a22d0	Add patch for pve-qemu 6.1	2021-12-09 02:57:43 +03:00
Vitaliy Filippov	0b1ffba62b	Add Proxmox storage driver	2021-12-09 02:26:54 +03:00
Vitaliy Filippov	8dfbd7943c	Use logical block size = 512 bytes by default	2021-12-08 23:43:40 +03:00
Vitaliy Filippov	39e7f98e54	Allow to change etcd IP in tests	2021-12-08 23:00:48 +03:00
Vitaliy Filippov	3a83a32cb7	Aaand now fix create --parent :D	2021-12-08 23:00:34 +03:00
Vitaliy Filippov	20d5ed799a	Add glob pattern matching for ls	2021-12-08 23:00:34 +03:00
Vitaliy Filippov	b262938bca	Fix naggy "Failed to get RDMA device list: Unknown error -38"	2021-12-08 02:02:30 +03:00
Vitaliy Filippov	7e54242251	Add patches for Proxmox QEMU 5.1 and 52	2021-12-05 17:45:01 +03:00
Vitaliy Filippov	c3c2e68cc1	Now fix resize command :D	2021-12-05 01:38:08 +03:00
				`@@ -0,0 +1 @@`
				`patches/PVE_VitastorPlugin.pm usr/share/perl5/PVE/Storage/Custom/VitastorPlugin.pm`