Release 0.6.6

- New command-line tool: vitastor-cli - Implement layer (snapshot/clone) merge and delete - Remove 'bool' from the C header - Fix a very rare flusher stall - More diagnostics now printed for slow ops in the log
Fix build for older gcc
2021-10-19 02:26:37 +03:00 · 2021-10-19 02:26:37 +03:00 · 2021-10-19 02:26:37 +03:00 · 2021-10-18 01:49:07 +03:00 · 2021-10-15 23:56:22 +03:00 · 2021-09-26 13:41:48 +03:00
161 changed files with 16522 additions and 3760 deletions
--- a/.gitignore
+++ b/.gitignore
@@ -0,0 +1,18 @@
+*.o
+*.so
+package-lock.json
+fio
+qemu
+osd
+stub_osd
+stub_uring_osd
+stub_bench
+osd_test
+osd_peering_pg_test
+dump_journal
+nbd_proxy
+rm_inode
+test_allocator
+test_blockstore
+test_shit
+osd_rmw_test
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -2,4 +2,6 @@ cmake_minimum_required(VERSION 2.8)

 project(vitastor)

+set(VERSION "0.6.6")
+
 add_subdirectory(src)
--- a/27
+++ b/27
@@ -0,0 +1,27 @@
+Copyright (c) Vitaliy Filippov (vitalif [at] yourcmc.ru), 2019+
+
+All server-side code (OSD, Monitor and so on) is licensed under the terms of
+Vitastor Network Public License 1.1 (VNPL 1.1), a copyleft license based on
+GNU GPLv3.0 with the additional "Network Interaction" clause which requires
+opensourcing all programs directly or indirectly interacting with Vitastor
+through a computer network and expressly designed to be used in conjunction
+with it ("Proxy Programs"). Proxy Programs may be made public not only under
+the terms of the same license, but also under the terms of any GPL-Compatible
+Free Software License, as listed by the Free Software Foundation.
+This is a stricter copyleft license than the Affero GPL.
+
+Please note that VNPL doesn't require you to open the code of proprietary
+software running inside a VM if it's not specially designed to be used with
+Vitastor.
+
+Basically, you can't use the software in a proprietary environment to provide
+its functionality to users without opensourcing all intermediary components
+standing between the user and Vitastor or purchasing a commercial license
+from the author 😀.
+
+Client libraries (cluster_client and so on) are dual-licensed under the same
+VNPL 1.1 and also GNU GPL 2.0 or later to allow for compatibility with GPLed
+software like QEMU and fio.
+
+You can find the full text of VNPL-1.1 in the file [VNPL-1.1.txt](VNPL-1.1.txt).
+GPL 2.0 is also included in this repository as [GPL-2.0.txt](GPL-2.0.txt).
--- a/README-ru.md
+++ b/README-ru.md
@@ -0,0 +1,580 @@
+## Vitastor
+
+[Read English version](README.md)
+
+## Идея
+
+Я всего лишь хочу сделать качественную блочную SDS!
+
+Vitastor - распределённая блочная SDS, прямой аналог Ceph RBD и внутренних СХД популярных
+облачных провайдеров. Однако, в отличие от них, Vitastor быстрый и при этом простой.
+Только пока маленький :-).
+
+Архитектурная схожесть с Ceph означает заложенную на уровне алгоритмов записи строгую консистентность,
+репликацию через первичный OSD, симметричную кластеризацию без единой точки отказа
+и автоматическое распределение данных по любому числу дисков любого размера с настраиваемыми схемами
+избыточности - репликацией или с произвольными кодами коррекции ошибок.
+
+## Возможности
+
+Vitastor на данный момент находится в статусе предварительного выпуска, расширенные
+возможности пока отсутствуют, а в будущих версиях вероятны "ломающие" изменения.
+
+Однако следующее уже реализовано:
+
+- Базовая часть - надёжное кластерное блочное хранилище без единой точки отказа
+- Производительность ;-D
+- Несколько схем отказоустойчивости: репликация, XOR n+1 (1 диск чётности), коды коррекции ошибок
+  Рида-Соломона на основе библиотеки jerasure с любым числом дисков данных и чётности в группе
+- Конфигурация через простые человекочитаемые JSON-структуры в etcd
+- Автоматическое распределение данных по OSD, с поддержкой:
+  - Математической оптимизации для лучшей равномерности распределения и минимизации перемещений данных
+  - Нескольких пулов с разными схемами избыточности
+  - Дерева распределения, выбора OSD по тегам / классам устройств (только SSD, только HDD) и по поддереву
+  - Настраиваемых доменов отказа (диск/сервер/стойка и т.п.)
+- Восстановление деградированных блоков
+- Ребаланс, то есть перемещение данных между OSD (дисками)
+- Поддержка "ленивого" fsync (fsync не на каждую операцию)
+- Сбор статистики ввода/вывода в etcd
+- Клиентская библиотека режима пользователя для ввода/вывода
+- Драйвер диска для QEMU (собирается вне дерева исходников QEMU)
+- Драйвер диска для утилиты тестирования производительности fio (также собирается вне дерева исходников fio)
+- NBD-прокси для монтирования образов ядром ("блочное устройство в режиме пользователя")
+- Утилита для удаления образов/инодов (vitastor-cli rm)
+- Пакеты для Debian и CentOS
+- Статистика операций ввода/вывода и занятого места в разрезе инодов
+- Именование инодов через хранение их метаданных в etcd
+- Снапшоты и copy-on-write клоны
+- Сглаживание производительности случайной записи в SSD+HDD конфигурациях
+- Поддержка RDMA/RoCEv2 через libibverbs
+- CSI-плагин для Kubernetes
+- Базовая поддержка OpenStack: драйвер Cinder, патчи для Nova и libvirt
+- Слияние снапшотов (vitastor-cli {snap-rm,flatten,merge})
+
+## Планы развития
+
+- Поддержка удаления снапшотов (слияния слоёв)
+- Более корректные скрипты разметки дисков и автоматического запуска OSD
+- Другие инструменты администрирования
+- Плагины для OpenNebula, Proxmox и других облачных систем
+- iSCSI-прокси
+- Более быстрое переключение при отказах
+- Фоновая проверка целостности без контрольных сумм (сверка реплик)
+- Контрольные суммы
+- Поддержка SSD-кэширования (tiered storage)
+- Поддержка NVDIMM
+- Web-интерфейс
+- Возможно, сжатие
+- Возможно, поддержка кэширования данных через системный page cache
+
+## Архитектура
+
+Так же, как и в Ceph, в Vitastor:
+
+- Есть пулы (pools), PG, OSD, мониторы, домены отказа, дерево распределения (аналог crush-дерева).
+- Образы делятся на блоки фиксированного размера (объекты), и эти объекты распределяются по OSD.
+- У OSD есть журнал и метаданные и они тоже могут размещаться на отдельных быстрых дисках.
+- Все операции записи тоже транзакционны. В Vitastor, правда, есть режим отложенного/ленивого fsync
+  (коммита), в котором fsync не вызывается на каждую операцию записи, что делает его более
+  пригодным для использования на "плохих" (десктопных) SSD. Однако все операции записи
+  в любом случае атомарны.
+- Клиентская библиотека тоже старается ждать восстановления после любого отказа кластера, то есть,
+  вы тоже можете перезагрузить хоть весь кластер разом, и клиенты только на время зависнут,
+  но не отключатся.
+
+Некоторые базовые термины для тех, кто не знаком с Ceph:
+
+- OSD (Object Storage Daemon) - процесс, который хранит данные на одном диске и обрабатывает
+  запросы чтения/записи от клиентов.
+- Пул (Pool) - контейнер для данных, имеющих одну и ту же схему избыточности и правила распределения по OSD.
+- PG (Placement Group) - группа объектов, хранимых на одном и том же наборе реплик (OSD).
+  Несколько PG могут храниться на одном и том же наборе реплик, но объекты одной PG
+  в норме не хранятся на разных наборах OSD.
+- Монитор - демон, хранящий состояние кластера.
+- Домен отказа (Failure Domain) - группа OSD, которым вы разрешаете "упасть" всем вместе.
+  Иными словами, это группа OSD, в которые СХД не помещает разные копии одного и того же
+  блока данных. Например, если домен отказа - сервер, то на двух дисках одного сервера
+  никогда не окажется 2 и более копий одного и того же блока данных, а значит, даже
+  если в этом сервере откажут все диски, это будет равносильно потере только 1 копии
+  любого блока данных.
+- Дерево распределения (Placement Tree / CRUSH Tree) - иерархическая группировка OSD
+  в узлы, которые далее можно использовать как домены отказа. То есть, диск (OSD) входит в
+  сервер, сервер входит в стойку, стойка входит в ряд, ряд в датацентр и т.п.
+
+Чем Vitastor отличается от Ceph:
+
+- Vitastor в первую очередь сфокусирован на SSD. Также Vitastor, вероятно, должен неплохо работать
+  с комбинацией SSD и HDD через bcache, а в будущем, возможно, будут добавлены и нативные способы
+  оптимизации под SSD+HDD. Однако хранилище на основе одних лишь жёстких дисков, вообще без SSD,
+  не в приоритете, поэтому оптимизации под этот кейс могут вообще не состояться.
+- OSD Vitastor однопоточный и всегда таким останется, так как это самый оптимальный способ работы.
+  Если вам не хватает 1 ядра на 1 диск, просто делите диск на разделы и запускайте на нём несколько OSD.
+  Но, скорее всего, вам хватит и 1 ядра - Vitastor не так прожорлив к ресурсам CPU, как Ceph.
+- Журнал и метаданные всегда размещаются в памяти, благодаря чему никогда не тратится лишнее время
+  на чтение метаданных с диска. Размер метаданных линейно зависит от размера диска и блока данных,
+  который задаётся в конфигурации кластера и по умолчанию составляет 128 КБ. С блоком 128 КБ метаданные
+  занимают примерно 512 МБ памяти на 1 ТБ дискового пространства (и это всё равно меньше, чем нужно Ceph-у).
+  Журнал вообще не должен быть большим, например, тесты производительности в данном документе проводились
+  с журналом размером всего 16 МБ. Большой журнал, вероятно, даже вреден, т.к. "грязные" записи (записи,
+  не сброшенные из журнала) тоже занимают память и могут немного замедлять работу.
+- В Vitastor нет внутреннего copy-on-write. Я считаю, что реализация CoW-хранилища гораздо сложнее,
+  поэтому сложнее добиться устойчиво хороших результатов. Возможно, в один прекрасный день
+  я придумаю красивый алгоритм для CoW-хранилища, но пока нет - внутреннего CoW в Vitastor не будет.
+  Всё это не относится к "внешнему" CoW (снапшотам и клонам).
+- Базовый слой Vitastor - простое блочное хранилище с блоками фиксированного размера, а не сложное
+  объектное хранилище с расширенными возможностями, как в Ceph (RADOS).
+- В Vitastor есть режим "ленивых fsync", в котором OSD группирует запросы записи перед сбросом их
+  на диск, что позволяет получить лучшую производительность с дешёвыми настольными SSD без конденсаторов
+  ("Advanced Power Loss Protection" / "Capacitor-Based Power Loss Protection").
+  Тем не менее, такой режим всё равно медленнее использования нормальных серверных SSD и мгновенного
+  fsync, так как приводит к дополнительным операциям передачи данных по сети, поэтому рекомендуется
+  всё-таки использовать хорошие серверные диски, тем более, стоят они почти так же, как десктопные.
+- PG эфемерны. Это означает, что они не хранятся на дисках и существуют только в памяти работающих OSD.
+- Процессы восстановления оперируют отдельными объектами, а не целыми PG.
+- PGLOG-ов нет.
+- "Мониторы" не хранят данные. Конфигурация и состояние кластера хранятся в etcd в простых человекочитаемых
+  JSON-структурах. Мониторы Vitastor только следят за состоянием кластера и управляют перемещением данных.
+  В этом смысле монитор Vitastor не является критичным компонентом системы и больше похож на Ceph-овский
+  менеджер (MGR). Монитор Vitastor написан на node.js.
+- Распределение PG не основано на консистентных хешах. Вместо этого все маппинги PG хранятся прямо в etcd
+  (ибо нет никакой проблемы сохранить несколько сотен-тысяч записей в памяти, а не считать каждый раз хеши).
+  Перераспределение PG по OSD выполняется через математическую оптимизацию,
+  а конкретно, сведение задачи к ЛП (задаче линейного программирования) и решение оной с помощью утилиты
+  lp_solve. Такой подход позволяет обычно выравнивать распределение места почти идеально - равномерность
+  обычно составляет 96-99%, в отличие от Ceph, где на голом CRUSH-е без балансировщика обычно выходит 80-90%.
+  Также это позволяет минимизировать объём перемещения данных и случайность связей между OSD, а также менять
+  распределение вручную, не боясь сломать логику перебалансировки. В таком подходе есть и потенциальный
+  недостаток - есть предположение, что в очень большом кластере он может сломаться - однако вплоть до
+  нескольких сотен OSD подход точно работает нормально. Ну и, собственно, при необходимости легко
+  реализовать и консистентные хеши.
+- Отдельный слой, подобный слою "CRUSH-правил", отсутствует. Вы настраиваете схемы отказоустойчивости,
+  домены отказа и правила выбора OSD напрямую в конфигурации пулов.
+
+## Понимание сути производительности систем хранения
+
+Вкратце: для быстрой хранилки задержки важнее, чем пиковые iops-ы.
+
+Лучшая возможная задержка достигается при тестировании в 1 поток с глубиной очереди 1,
+что приблизительно означает минимально нагруженное состояние кластера. В данном случае
+IOPS = 1/задержка. Ни числом серверов, ни дисков, ни серверных процессов/потоков
+задержка не масштабируется... Она зависит только от того, насколько быстро один
+серверный процесс (и клиент) обрабатывают одну операцию.
+
+Почему задержки важны? Потому, что некоторые приложения *не могут* использовать глубину
+очереди больше 1, ибо их задача не параллелизуется. Важный пример - это все СУБД
+с поддержкой консистентности (ACID), потому что все они обеспечивают её через
+журналирование, а журналы пишутся последовательно и с fsync() после каждой операции.
+
+fsync, кстати - это ещё одна очень важная вещь, про которую почти всегда забывают в тестах.
+Смысл в том, что все современные диски имеют кэши/буферы записи и не гарантируют, что
+данные реально физически записываются на носитель до того, как вы делаете fsync(),
+который транслируется в команду сброса кэша операционной системой.
+
+Дешёвые SSD для настольных ПК и ноутбуков очень быстрые без fsync - NVMe диски, например,
+могут обработать порядка 80000 операций записи в секунду с глубиной очереди 1 без fsync.
+Однако с fsync, когда они реально вынуждены писать каждый блок данных во флеш-память,
+они выжимают лишь 1000-2000 операций записи в секунду (число практически постоянное
+для всех моделей SSD).
+
+Серверные SSD часто имеют суперконденсаторы, работающие как встроенный источник
+бесперебойного питания и дающие дискам успеть сбросить их DRAM-кэш в постоянную
+флеш-память при отключении питания. Благодаря этому диски с чистой совестью
+*игнорируют fsync*, так как точно знают, что данные из кэша доедут до постоянной
+памяти.
+
+Все наиболее известные программные СХД, например, Ceph и внутренние СХД, используемые
+такими облачными провайдерами, как Amazon, Google, Яндекс, медленные в смысле задержки.
+В лучшем случае они дают задержки от 0.3мс на чтение и 0.6мс на запись 4 КБ блоками
+даже при условии использования наилучшего возможного железа.
+
+И это в эпоху SSD, когда вы можете пойти на рынок и купить там SSD, задержка которого
+на чтение будет 0.1мс, а на запись - 0.04мс, за 100$ или даже дешевле.
+
+Когда мне нужно быстро протестировать производительность дисковой подсистемы, я
+использую следующие 6 команд, с небольшими вариациями:
+
+- Линейная запись:
+  `fio -ioengine=libaio -direct=1 -invalidate=1 -name=test -bs=4M -iodepth=32 -rw=write -runtime=60 -filename=/dev/sdX`
+- Линейное чтение:
+  `fio -ioengine=libaio -direct=1 -invalidate=1 -name=test -bs=4M -iodepth=32 -rw=read -runtime=60 -filename=/dev/sdX`
+- Запись в 1 поток (T1Q1):
+  `fio -ioengine=libaio -direct=1 -invalidate=1 -name=test -bs=4k -iodepth=1 -fsync=1 -rw=randwrite -runtime=60 -filename=/dev/sdX`
+- Чтение в 1 поток (T1Q1):
+  `fio -ioengine=libaio -direct=1 -invalidate=1 -name=test -bs=4k -iodepth=1 -rw=randread -runtime=60 -filename=/dev/sdX`
+- Параллельная запись (numjobs используется, когда 1 ядро CPU не может насытить диск):
+  `fio -ioengine=libaio -direct=1 -invalidate=1 -name=test -bs=4k -iodepth=128 [-numjobs=4 -group_reporting] -rw=randwrite -runtime=60 -filename=/dev/sdX`
+- Параллельное чтение (numjobs - аналогично):
+  `fio -ioengine=libaio -direct=1 -invalidate=1 -name=test -bs=4k -iodepth=128 [-numjobs=4 -group_reporting] -rw=randread -runtime=60 -filename=/dev/sdX`
+
+## Теоретическая максимальная производительность Vitastor
+
+При использовании репликации:
+- Задержка чтения в 1 поток (T1Q1): 1 сетевой RTT + 1 чтение с диска.
+- Запись+fsync в 1 поток:
+  - С мгновенным сбросом: 2 RTT + 1 запись.
+  - С отложенным ("ленивым") сбросом: 4 RTT + 1 запись + 1 fsync.
+- Параллельное чтение: сумма IOPS всех дисков либо производительность сети, если в сеть упрётся раньше.
+- Параллельная запись: сумма IOPS всех дисков / число реплик / WA либо производительность сети, если в сеть упрётся раньше.
+
+При использовании кодов коррекции ошибок (EC):
+- Задержка чтения в 1 поток (T1Q1): 1.5 RTT + 1 чтение.
+- Запись+fsync в 1 поток:
+  - С мгновенным сбросом: 3.5 RTT + 1 чтение + 2 записи.
+  - С отложенным ("ленивым") сбросом: 5.5 RTT + 1 чтение + 2 записи + 2 fsync.
+- Под 0.5 на самом деле подразумевается (k-1)/k, где k - число дисков данных,
+  что означает, что дополнительное обращение по сети не нужно, когда операция
+  чтения обслуживается локально.
+- Параллельное чтение: сумма IOPS всех дисков либо производительность сети, если в сеть упрётся раньше.
+- Параллельная запись: сумма IOPS всех дисков / общее число дисков данных и чётности / WA либо производительность сети, если в сеть упрётся раньше.
+  Примечание: IOPS дисков в данном случае надо брать в смешанном режиме чтения/записи в пропорции, аналогичной формулам выше.
+
+WA (мультипликатор записи) для 4 КБ блоков в Vitastor обычно составляет 3-5:
+1. Запись метаданных в журнал
+2. Запись блока данных в журнал
+3. Запись метаданных в БД
+4. Ещё одна запись метаданных в журнал при использовании EC
+5. Запись блока данных на диск данных
+
+Если вы найдёте SSD, хорошо работающий с 512-байтными блоками данных (Optane?),
+то 1, 3 и 4 можно снизить до 512 байт (1/8 от размера данных) и получить WA всего 2.375.
+
+Кроме того, WA снижается при использовании отложенного/ленивого сброса при параллельной
+нагрузке, т.к. блоки журнала записываются на диск только когда они заполняются или явным
+образом запрашивается fsync.
+
+## Пример сравнения с Ceph
+
+Железо - 4 сервера, в каждом:
+- 6x SATA SSD Intel D3-4510 3.84 TB
+- 2x Xeon Gold 6242 (16 cores @ 2.8 GHz)
+- 384 GB RAM
+- 1x 25 GbE сетевая карта (Mellanox ConnectX-4 LX), подключённая к свитчу Juniper QFX5200
+
+Экономия энергии CPU отключена. В тестах и Vitastor, и Ceph развёрнуто по 2 OSD на 1 SSD.
+
+Все результаты ниже относятся к случайной нагрузке 4 КБ блоками (если явно не указано обратное).
+
+Производительность голых дисков:
+- T1Q1 запись ~27000 iops (задержка ~0.037ms)
+- T1Q1 чтение ~9800 iops (задержка ~0.101ms)
+- T1Q32 запись ~60000 iops
+- T1Q32 чтение ~81700 iops
+
+Ceph 15.2.4 (Bluestore):
+- T1Q1 запись ~1000 iops (задержка ~1ms)
+- T1Q1 чтение ~1750 iops (задержка ~0.57ms)
+- T8Q64 запись ~100000 iops, потребление CPU процессами OSD около 40 ядер на каждом сервере
+- T8Q64 чтение ~480000 iops, потребление CPU процессами OSD около 40 ядер на каждом сервере
+
+Тесты в 8 потоков проводились на 8 400GB RBD образах со всех хостов (с каждого хоста запускалось 2 процесса fio).
+Это нужно потому, что в Ceph несколько RBD-клиентов, пишущих в 1 образ, очень сильно замедляются.
+
+Настройки RocksDB и Bluestore в Ceph не менялись, единственным изменением было отключение cephx_sign_messages.
+
+На самом деле, результаты теста не такие уж и плохие для Ceph (могло быть хуже).
+Собственно говоря, эти серверы как раз хорошо сбалансированы для Ceph - 6 SATA SSD как раз
+утилизируют 25-гигабитную сеть, а без 2 мощных процессоров Ceph-у бы не хватило ядер,
+чтобы выдать пристойный результат. Собственно, что и показывает жор 40 ядер в процессе
+параллельного теста.
+
+Vitastor:
+- T1Q1 запись: 7087 iops (задержка 0.14ms)
+- T1Q1 чтение: 6838 iops (задержка 0.145ms)
+- T2Q64 запись: 162000 iops, потребление CPU - 3 ядра на каждом сервере
+- T8Q64 чтение: 895000 iops, потребление CPU - 4 ядра на каждом сервере
+- Линейная запись (4M T1Q32): 2800 МБ/с
+- Линейное чтение (4M T1Q32): 1500 МБ/с
+
+Тест на чтение в 8 потоков проводился на 1 большом образе (3.2 ТБ) со всех хостов (опять же, по 2 fio с каждого).
+В Vitastor никакой разницы между 1 образом и 8-ю нет. Естественно, примерно 1/4 запросов чтения
+в такой конфигурации, как и в тестах Ceph выше, обслуживалась с локальной машины. Если проводить
+тест так, чтобы все операции всегда обращались к первичным OSD по сети - тест сильнее упирался
+в сеть и результат составлял примерно 689000 iops.
+
+Настройки Vitastor: `--disable_data_fsync true --immediate_commit all --flusher_count 8
+  --disk_alignment 4096 --journal_block_size 4096 --meta_block_size 4096
+  --journal_no_same_sector_overwrites true --journal_sector_buffer_count 1024
+  --journal_size 16777216`.
+
+### EC/XOR 2+1
+
+Vitastor:
+- T1Q1 запись: 2808 iops (задержка ~0.355ms)
+- T1Q1 чтение: 6190 iops (задержка ~0.16ms)
+- T2Q64 запись: 85500 iops, потребление CPU - 3.4 ядра на каждом сервере
+- T8Q64 чтение: 812000 iops, потребление CPU - 4.7 ядра на каждом сервере
+- Линейная запись (4M T1Q32): 3200 МБ/с
+- Линейное чтение (4M T1Q32): 1800 МБ/с
+
+Ceph:
+- T1Q1 запись: 730 iops (задержка ~1.37ms latency)
+- T1Q1 чтение: 1500 iops с холодным кэшем метаданных (задержка ~0.66ms), 2300 iops через 2 минуты прогрева (задержка ~0.435ms)
+- T4Q128 запись (4 RBD images): 45300 iops, потребление CPU - 30 ядер на каждом сервере
+- T8Q64 чтение (4 RBD images): 278600 iops, потребление CPU - 40 ядер на каждом сервере
+- Линейная запись (4M T1Q32): 1950 МБ/с в пустой образ, 2500 МБ/с в заполненный образ
+- Линейное чтение (4M T1Q32): 2400 МБ/с
+
+### NBD
+
+NBD расшифровывается как "сетевое блочное устройство", но на самом деле оно также
+работает просто как аналог FUSE для блочных устройств, то есть, представляет собой
+"блочное устройство в пространстве пользователя".
+
+NBD - на данный момент единственный способ монтировать Vitastor ядром Linux.
+NBD немного снижает производительность, так как приводит к дополнительным копированиям
+данных между ядром и пространством пользователя. Тем не менее, способ достаточно оптимален,
+а производительность случайного доступа вообще затрагивается слабо.
+
+Vitastor с однопоточной NBD прокси на том же стенде:
+- T1Q1 запись: 6000 iops (задержка 0.166ms)
+- T1Q1 чтение: 5518 iops (задержка 0.18ms)
+- T1Q128 запись: 94400 iops
+- T1Q128 чтение: 103000 iops
+- Линейная запись (4M T1Q128): 1266 МБ/с (в сравнении с 2800 МБ/с через fio)
+- Линейное чтение (4M T1Q128): 975 МБ/с (в сравнении с 1500 МБ/с через fio)
+
+## Установка
+
+### Debian
+
+- Добавьте ключ репозитория Vitastor:
+  `wget -q -O - https://vitastor.io/debian/pubkey | sudo apt-key add -`
+- Добавьте репозиторий Vitastor в /etc/apt/sources.list:
+  - Debian 11 (Bullseye/Sid): `deb https://vitastor.io/debian bullseye main`
+  - Debian 10 (Buster): `deb https://vitastor.io/debian buster main`
+- Для Debian 10 (Buster) также включите репозиторий backports:
+  `deb http://deb.debian.org/debian buster-backports main`
+- Установите пакеты: `apt update; apt install vitastor lp-solve etcd linux-image-amd64 qemu`
+
+### CentOS
+
+- Добавьте в систему репозиторий Vitastor:
+  - CentOS 7: `yum install https://vitastor.io/rpms/centos/7/vitastor-release-1.0-1.el7.noarch.rpm`
+  - CentOS 8: `dnf install https://vitastor.io/rpms/centos/8/vitastor-release-1.0-1.el8.noarch.rpm`
+- Включите EPEL: `yum/dnf install epel-release`
+- Включите дополнительные репозитории CentOS:
+  - CentOS 7: `yum install centos-release-scl`
+  - CentOS 8: `dnf install centos-release-advanced-virtualization`
+- Включите elrepo-kernel:
+  - CentOS 7: `yum install https://www.elrepo.org/elrepo-release-7.el7.elrepo.noarch.rpm`
+  - CentOS 8: `dnf install https://www.elrepo.org/elrepo-release-8.el8.elrepo.noarch.rpm`
+- Установите пакеты: `yum/dnf install vitastor lpsolve etcd kernel-ml qemu-kvm`
+
+### Установка из исходников
+
+- Установите ядро 5.4 или более новое, для поддержки io_uring. Желательно 5.8 или даже новее,
+  так как в 5.4 есть как минимум 1 известный баг, ведущий к зависанию с io_uring и контроллером HP SmartArray.
+- Установите liburing 0.4 или более новый и его заголовки.
+- Установите lp_solve.
+- Установите etcd, версии не ниже 3.4.15. Более ранние версии работать не будут из-за различных багов,
+  например [#12402](https://github.com/etcd-io/etcd/pull/12402). Также вы можете взять версию 3.4.13 с
+  этим конкретным исправлением из ветки release-3.4 репозитория https://github.com/vitalif/etcd/.
+- Установите node.js 10 или новее.
+- Установите gcc и g++ 8.x или новее.
+- Склонируйте данный репозиторий с подмодулями: `git clone https://yourcmc.ru/git/vitalif/vitastor/`.
+- Желательно пересобрать QEMU с патчем, который делает необязательным запуск через LD_PRELOAD.
+  См `patches/qemu-*.*-vitastor.patch` - выберите версию, наиболее близкую вашей версии QEMU.
+- Установите QEMU 3.0 или новее, возьмите исходные коды установленного пакета, начните его пересборку,
+  через некоторое время остановите её и скопируйте следующие заголовки:
+   - `<qemu>/include` &rarr; `<vitastor>/qemu/include`
+   - Debian:
+      * Берите qemu из основного репозитория
+      * `<qemu>/b/qemu/config-host.h` &rarr; `<vitastor>/qemu/b/qemu/config-host.h`
+      * `<qemu>/b/qemu/qapi` &rarr; `<vitastor>/qemu/b/qemu/qapi`
+   - CentOS 8:
+      * Берите qemu из репозитория Advanced-Virtualization. Чтобы включить его, запустите
+        `yum install centos-release-advanced-virtualization.noarch` и далее `yum install qemu`
+      * `<qemu>/config-host.h` &rarr; `<vitastor>/qemu/b/qemu/config-host.h`
+      * Для QEMU 3.0+: `<qemu>/qapi` &rarr; `<vitastor>/qemu/b/qemu/qapi`
+      * Для QEMU 2.0+: `<qemu>/qapi-types.h` &rarr; `<vitastor>/qemu/b/qemu/qapi-types.h`
+   - `config-host.h` и `qapi` нужны, т.к. в них содержатся автогенерируемые заголовки
+- Установите fio 3.7 или новее, возьмите исходники пакета и сделайте на них симлинк с `<vitastor>/fio`.
+- Соберите и установите Vitastor командой `mkdir build && cd build && cmake .. && make -j8 && make install`.
+  Обратите внимание на переменную cmake `QEMU_PLUGINDIR` - под RHEL её нужно установить равной `qemu-kvm`.
+
+## Запуск
+
+Внимание: процедура пока что достаточно нетривиальная, задавать конфигурацию и смещения
+на диске нужно почти вручную. Это будет исправлено в ближайшем будущем.
+
+- Желательны SATA SSD или NVMe диски с конденсаторами (серверные SSD). Можно использовать и
+  десктопные SSD, включив режим отложенного fsync, но производительность однопоточной записи
+  в этом случае пострадает.
+- Быстрая сеть, минимум 10 гбит/с
+- Для наилучшей производительности нужно отключить энергосбережение CPU: `cpupower idle-set -D 0 && cpupower frequency-set -g performance`.
+- Пропишите нужные вам значения вверху файлов `/usr/lib/vitastor/mon/make-units.sh` и `/usr/lib/vitastor/mon/make-osd.sh`.
+- Создайте юниты systemd для etcd и мониторов: `/usr/lib/vitastor/mon/make-units.sh`
+- Создайте юниты для OSD: `/usr/lib/vitastor/mon/make-osd.sh /dev/disk/by-partuuid/XXX [/dev/disk/by-partuuid/YYY ...]`
+- Вы можете поменять параметры OSD в юнитах systemd. Смысл некоторых параметров:
+  - `disable_data_fsync 1` - отключает fsync, используется с SSD с конденсаторами.
+  - `immediate_commit all` - используется с SSD с конденсаторами.
+  - `disable_device_lock 1` - отключает блокировку файла устройства, нужно, только если вы запускаете
+    несколько OSD на одном блочном устройстве.
+  - `flusher_count 256` - "flusher" - микропоток, удаляющий старые данные из журнала.
+    Не волнуйтесь об этой настройке, 256 теперь достаточно практически всегда.
+  - `disk_alignment`, `journal_block_size`, `meta_block_size` следует установить равными размеру
+    внутреннего блока SSD. Это почти всегда 4096.
+  - `journal_no_same_sector_overwrites true` запрещает перезапись одного и того же сектора журнала подряд
+    много раз в процессе записи. Большинство (99%) SSD не нуждаются в данной опции. Однако выяснилось, что
+    диски, используемые на одном из тестовых стендов - Intel D3-S4510 - очень сильно не любят такую
+    перезапись, и для них была добавлена эта опция. Когда данный режим включён, также нужно поднимать
+    значение `journal_sector_buffer_count`, так как иначе Vitastor не хватит буферов для записи в журнал.
+- Запустите все etcd: `systemctl start etcd`
+- Создайте глобальную конфигурацию в etcd: `etcdctl --endpoints=... put /vitastor/config/global '{"immediate_commit":"all"}'`
+  (если все ваши диски - серверные с конденсаторами).
+- Создайте пулы: `etcdctl --endpoints=... put /vitastor/config/pools '{"1":{"name":"testpool","scheme":"replicated","pg_size":2,"pg_minsize":1,"pg_count":256,"failure_domain":"host"}}'`.
+  Для jerasure EC-пулов конфигурация должна выглядеть так: `2:{"name":"ecpool","scheme":"jerasure","pg_size":4,"parity_chunks":2,"pg_minsize":2,"pg_count":256,"failure_domain":"host"}`.
+- Запустите все OSD: `systemctl start vitastor.target`
+- Ваш кластер должен быть готов - один из мониторов должен уже сконфигурировать PG, а OSD должны запустить их.
+- Вы можете проверить состояние PG прямо в etcd: `etcdctl --endpoints=... get --prefix /vitastor/pg/state`. Все PG должны быть 'active'.
+
+### Задать имя образу
+
+```
+etcdctl --endpoints=<etcd> put /vitastor/config/inode/<pool>/<inode> '{"name":"<name>","size":<size>[,"parent_id":<parent_inode_number>][,"readonly":true]}'
+```
+
+Например:
+
+```
+etcdctl --endpoints=http://10.115.0.10:2379/v3 put /vitastor/config/inode/1/1 '{"name":"testimg","size":2147483648}'
+```
+
+Если вы зададите parent_id, то образ станет CoW-клоном, т.е. все новые запросы записи пойдут в новый инод, а запросы
+чтения будут проверять сначала его, а потом родительские слои по цепочке вверх. Чтобы случайно не перезаписать данные
+в родительском слое, вы можете переключить его в режим "только чтение", добавив флаг `"readonly":true` в его запись
+метаданных. В таком случае родительский образ становится просто снапшотом.
+
+Таким образом, для создания снапшота вам нужно просто переименовать предыдущий inode (например, из testimg в testimg@0),
+сделать его readonly и создать новый слой с исходным именем образа (testimg), ссылающийся на только что переименованный
+в качестве родительского.
+
+### Запуск тестов с fio
+
+Пример команды для запуска тестов:
+
+```
+fio -thread -ioengine=libfio_vitastor.so -name=test -bs=4M -direct=1 -iodepth=16 -rw=write -etcd=10.115.0.10:2379/v3 -image=testimg
+```
+
+Если вы не хотите обращаться к образу по имени, вместо `-image=testimg` можно указать номер пула, номер инода и размер:
+`-pool=1 -inode=1 -size=400G`.
+
+### Загрузить образ диска ВМ в/из Vitastor
+
+Используйте qemu-img и строку `vitastor:etcd_host=<HOST>:image=<IMAGE>` в качестве имени файла диска. Например:
+
+```
+qemu-img convert -f qcow2 debian10.qcow2 -p -O raw 'vitastor:etcd_host=10.115.0.10\:2379/v3:image=testimg'
+```
+
+Обратите внимание, что если вы используете немодифицированный QEMU, потребуется установить переменную окружения
+`LD_PRELOAD=/usr/lib/x86_64-linux-gnu/qemu/block-vitastor.so`.
+
+Если вы не хотите обращаться к образу по имени, вместо `:image=<IMAGE>` можно указать номер пула, номер инода и размер:
+`:pool=<POOL>:inode=<INODE>:size=<SIZE>`.
+
+### Запустить ВМ
+
+Для запуска QEMU используйте опцию `-drive file=vitastor:etcd_host=<HOST>:image=<IMAGE>` (аналогично qemu-img)
+и физический размер блока 4 KB.
+
+Например:
+
+```
+qemu-system-x86_64 -enable-kvm -m 1024
+  -drive 'file=vitastor:etcd_host=10.115.0.10\:2379/v3:image=testimg',format=raw,if=none,id=drive-virtio-disk0,cache=none
+  -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x5,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1,write-cache=off,physical_block_size=4096,logical_block_size=512
+  -vnc 0.0.0.0:0
+```
+
+Обращение по номерам (`:pool=<POOL>:inode=<INODE>:size=<SIZE>` вместо `:image=<IMAGE>`) работает аналогично qemu-img.
+
+### Удалить образ
+
+Используйте утилиту vitastor-cli rm. Например:
+
+```
+vitastor-cli rm --etcd_address 10.115.0.10:2379/v3 --pool 1 --inode 1 --parallel_osds 16 --iodepth 32
+```
+
+### NBD
+
+Чтобы создать локальное блочное устройство, используйте NBD. Например:
+
+```
+vitastor-nbd map --etcd_address 10.115.0.10:2379/v3 --image testimg
+```
+
+Команда напечатает название устройства вида /dev/nbd0, которое потом можно будет форматировать
+и использовать как обычное блочное устройство.
+
+Для обращения по номеру инода, аналогично другим командам, можно использовать опции
+`--pool <POOL> --inode <INODE> --size <SIZE>` вместо `--image testimg`.
+
+### Kubernetes
+
+У Vitastor есть CSI-плагин для Kubernetes, поддерживающий RWO-тома.
+
+Для установки возьмите манифесты из директории [csi/deploy/](csi/deploy/), поместите
+вашу конфигурацию подключения к Vitastor в [csi/deploy/001-csi-config-map.yaml](001-csi-config-map.yaml),
+настройте StorageClass в [csi/deploy/009-storage-class.yaml](009-storage-class.yaml)
+и примените все `NNN-*.yaml` к вашей инсталляции Kubernetes.
+
+```
+for i in ./???-*.yaml; do kubectl apply -f $i; done
+```
+
+После этого вы сможете создавать PersistentVolume. Пример смотрите в файле [csi/deploy/example-pvc.yaml](csi/deploy/example-pvc.yaml).
+
+## Известные проблемы
+
+- Запросы удаления объектов могут в данный момент приводить к "неполным" объектам в EC-пулах,
+  если в процессе удаления произойдут отказы OSD или серверов, потому что правильная обработка
+  запросов удаления в кластере должна быть "трёхфазной", а это пока не реализовано. Если вы
+  столкнётесь с такой ситуацией, просто повторите запрос удаления.
+
+## Принципы реализации
+
+- Я люблю архитектурно простые решения. Vitastor проектируется именно так и я намерен
+  и далее следовать данному принципу.
+- Если вы пришли сюда за идеальным кодом на C++, вы, вероятно, не по адресу. "Общепринятые"
+  практики написания C++ кода меня не очень волнуют, так как зачастую, опять-таки, ведут к
+  излишним усложнениям и код получается красивый... но медленный.
+- По той же причине в коде иногда можно встретить велосипеды типа собственного упрощённого
+  HTTP-клиента для работы с etcd. Зато эти велосипеды маленькие и компактные и не требуют
+  использования десятка внешних библиотек.
+- node.js для монитора - не случайный выбор. Он очень быстрый, имеет встроенную событийную
+  машину, приятный нейтральный C-подобный язык программирования и развитую инфраструктуру.
+
+## Автор и лицензия
+
+Автор: Виталий Филиппов (vitalif [at] yourcmc.ru), 2019+
+
+Заходите в Telegram-чат Vitastor: https://t.me/vitastor
+
+Лицензия: VNPL 1.1 на серверный код и двойная VNPL 1.1 + GPL 2.0+ на клиентский.
+
+VNPL - "сетевой копилефт", собственная свободная копилефт-лицензия
+Vitastor Network Public License 1.1, основанная на GNU GPL 3.0 с дополнительным
+условием "Сетевого взаимодействия", требующим распространять все программы,
+специально разработанные для использования вместе с Vitastor и взаимодействующие
+с ним по сети, под лицензией VNPL или под любой другой свободной лицензией.
+
+Идея VNPL - расширение действия копилефта не только на модули, явным образом
+связываемые с кодом Vitastor, но также на модули, оформленные в виде микросервисов
+и взаимодействующие с ним по сети.
+
+Таким образом, если вы хотите построить на основе Vitastor сервис, содержаший
+компоненты с закрытым кодом, взаимодействующие с Vitastor, вам нужна коммерческая
+лицензия от автора 😀.
+
+На Windows и любое другое ПО, не разработанное *специально* для использования
+вместе с Vitastor, никакие ограничения не накладываются.
+
+Клиентские библиотеки распространяются на условиях двойной лицензии VNPL 1.0
+и также на условиях GNU GPL 2.0 или более поздней версии. Так сделано в целях
+совместимости с таким ПО, как QEMU и fio.
+
+Вы можете найти полный текст VNPL 1.1 в файле [VNPL-1.1.txt](VNPL-1.1.txt),
+а GPL 2.0 в файле [GPL-2.0.txt](GPL-2.0.txt).
--- a/README.md
+++ b/README.md
@@ -1,5 +1,7 @@
 ## Vitastor

+[Читать на русском](README-ru.md)
+
 ## The Idea

 Make Software-Defined Block Storage Great Again.
@@ -32,23 +34,29 @@ breaking changes in the future. However, the following is implemented:
 - QEMU driver (built out-of-tree)
 - Loadable fio engine for benchmarks (also built out-of-tree)
 - NBD proxy for kernel mounts
- Inode removal tool (vitastor-rm)
+- Inode removal tool (vitastor-cli rm)
 - Packaging for Debian and CentOS
+- Per-inode I/O and space usage statistics
+- Inode metadata storage in etcd
+- Snapshots and copy-on-write image clones
+- Write throttling to smooth random write workloads in SSD+HDD configurations
+- RDMA/RoCEv2 support via libibverbs
+- CSI plugin for Kubernetes
+- Basic OpenStack support: Cinder driver, Nova and libvirt patches
+- Snapshot merge tool (vitastor-cli {snap-rm,flatten,merge})

 ## Roadmap

- OSD creation tool (OSDs currently have to be created by hand)
+- Snapshot deletion (layer merge) support
+- Better OSD creation and auto-start tools
 - Other administrative tools
- Per-inode I/O and space usage statistics
- Proxmox and OpenNebula plugins
+- Plugins for OpenNebula, Proxmox and other cloud systems
 - iSCSI proxy
- Inode metadata storage in etcd
- Snapshots and copy-on-write image clones
- Operation timeouts and better failure detection
+- Faster failover
 - Scrubbing without checksums (verification of replicas)
 - Checksums
- SSD+HDD optimizations, possibly including tiered storage and soft journal flushes
- RDMA and NVDIMM support
+- Tiered storage
+- NVDIMM support
 - Web GUI
 - Compression (possibly)
 - Read caching using system page cache (possibly)
@@ -291,7 +299,7 @@ Vitastor with single-thread NBD on the same hardware:
  - Debian 10 (Buster): `deb https://vitastor.io/debian buster main`
 - For Debian 10 (Buster) also enable backports repository:
  `deb http://deb.debian.org/debian buster-backports main`
- Install packages: `apt update; apt install vitastor lp-solve etcd linux-image-amd64`
+- Install packages: `apt update; apt install vitastor lp-solve etcd linux-image-amd64 qemu`

 ### CentOS

@@ -313,10 +321,9 @@ Vitastor with single-thread NBD on the same hardware:
  there is at least one known io_uring hang with 5.4 and an HP SmartArray controller.
 - Install liburing 0.4 or newer and its headers.
 - Install lp_solve.
- Install etcd. Attention: you need a fixed version from here: https://github.com/vitalif/etcd/,
-  branch release-3.4, because there is a bug in upstream etcd which makes Vitastor OSDs fail to
-  move PGs out of "starting" state if you have at least around ~500 PGs or so. The custom build
-  will be unnecessary when etcd merges the fix: https://github.com/etcd-io/etcd/pull/12402.
+- Install etcd, at least version 3.4.15. Earlier versions won't work because of various bugs,
+  for example [#12402](https://github.com/etcd-io/etcd/pull/12402). You can also take 3.4.13
+  with this specific fix from here: https://github.com/vitalif/etcd/, branch release-3.4.
 - Install node.js 10 or newer.
 - Install gcc and g++ 8.x or newer.
 - Clone https://yourcmc.ru/git/vitalif/vitastor/ with submodules.
@@ -334,7 +341,7 @@ Vitastor with single-thread NBD on the same hardware:
      * For QEMU 2.0+: `<qemu>/qapi-types.h` &rarr; `<vitastor>/qemu/b/qemu/qapi-types.h`
   - `config-host.h` and `qapi` are required because they contain generated headers
 - You can also rebuild QEMU with a patch that makes LD_PRELOAD unnecessary to load vitastor driver.
-  See `qemu-*.*-vitastor.patch`.
+  See `patches/qemu-*.*-vitastor.patch`.
 - Install fio 3.7 or later, get its source and symlink it into `<vitastor>/fio`.
 - Build & install Vitastor with `mkdir build && cd build && cmake .. && make -j8 && make install`.
  Pay attention to the `QEMU_PLUGINDIR` cmake option - it must be set to `qemu-kvm` on RHEL.
@@ -374,34 +381,113 @@ and calculate disk offsets almost by hand. This will be fixed in near future.
  For jerasure pools the configuration should look like the following: `2:{"name":"ecpool","scheme":"jerasure","pg_size":4,"parity_chunks":2,"pg_minsize":2,"pg_count":256,"failure_domain":"host"}`.
 - At this point, one of the monitors will configure PGs and OSDs will start them.
 - You can check PG states with `etcdctl --endpoints=... get --prefix /vitastor/pg/state`. All PGs should become 'active'.
- Run tests with (for example): `fio -thread -ioengine=libfio_vitastor.so -name=test -bs=4M -direct=1 -iodepth=16 -rw=write -etcd=10.115.0.10:2379/v3 -pool=1 -inode=1 -size=400G`.
- Upload VM disk image with qemu-img (for example):
-  ```
-  qemu-img convert -f qcow2 debian10.qcow2 -p -O raw 'vitastor:etcd_host=10.115.0.10\:2379/v3:pool=1:inode=1:size=2147483648'
-  ```
-  Note that the command requires to be run with `LD_PRELOAD=/usr/lib/x86_64-linux-gnu/qemu/block-vitastor.so qemu-img ...`
-  if you use unmodified QEMU.
- Run QEMU with (for example):
-  ```
-  qemu-system-x86_64 -enable-kvm -m 1024
-    -drive 'file=vitastor:etcd_host=10.115.0.10\:2379/v3:pool=1:inode=1:size=2147483648',format=raw,if=none,id=drive-virtio-disk0,cache=none
-    -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x5,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1,write-cache=off,physical_block_size=4096,logical_block_size=512
-    -vnc 0.0.0.0:0
-  ```
- Remove inode with (for example):
-  ```
-  vitastor-rm --etcd_address 10.115.0.10:2379/v3 --pool 1 --inode 1 --parallel_osds 16 --iodepth 32
-  ```
+
+### Name an image
+
+```
+etcdctl --endpoints=<etcd> put /vitastor/config/inode/<pool>/<inode> '{"name":"<name>","size":<size>[,"parent_id":<parent_inode_number>][,"readonly":true]}'
+```
+
+For example:
+
+```
+etcdctl --endpoints=http://10.115.0.10:2379/v3 put /vitastor/config/inode/1/1 '{"name":"testimg","size":2147483648}'
+```
+
+If you specify parent_id the image becomes a CoW clone. I.e. all writes go to the new inode and reads first check it
+and then upper layers. You can then make parent readonly by updating its entry with `"readonly":true` for safety and
+basically treat it as a snapshot.
+
+So to create a snapshot you basically rename the previous upper layer (for example from testimg to testimg@0), make it readonly
+and create a new top layer with the original name (testimg) and the previous one as a parent.
+
+### Run fio benchmarks
+
+fio command example:
+
+```
+fio -thread -ioengine=libfio_vitastor.so -name=test -bs=4M -direct=1 -iodepth=16 -rw=write -etcd=10.115.0.10:2379/v3 -image=testimg
+```
+
+If you don't want to access your image by name, you can specify pool number, inode number and size
+(`-pool=1 -inode=1 -size=400G`) instead of the image name (`-image=testimg`).
+
+### Upload VM image
+
+Use qemu-img and `vitastor:etcd_host=<HOST>:image=<IMAGE>` disk filename. For example:
+
+```
+qemu-img convert -f qcow2 debian10.qcow2 -p -O raw 'vitastor:etcd_host=10.115.0.10\:2379/v3:image=testimg'
+```
+
+Note that the command requires to be run with `LD_PRELOAD=/usr/lib/x86_64-linux-gnu/qemu/block-vitastor.so qemu-img ...`
+if you use unmodified QEMU.
+
+You can also specify `:pool=<POOL>:inode=<INODE>:size=<SIZE>` instead of `:image=<IMAGE>`
+if you don't want to use inode metadata.
+
+### Start a VM
+
+Run QEMU with `-drive file=vitastor:etcd_host=<HOST>:image=<IMAGE>` and use 4 KB physical block size.
+
+For example:
+
+```
+qemu-system-x86_64 -enable-kvm -m 1024
+  -drive 'file=vitastor:etcd_host=10.115.0.10\:2379/v3:image=testimg',format=raw,if=none,id=drive-virtio-disk0,cache=none
+  -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x5,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1,write-cache=off,physical_block_size=4096,logical_block_size=512
+  -vnc 0.0.0.0:0
+```
+
+You can also specify `:pool=<POOL>:inode=<INODE>:size=<SIZE>` instead of `:image=<IMAGE>`,
+just like in qemu-img.
+
+### Remove inode
+
+Use vitastor-rm. For example:
+
+```
+vitastor-rm --etcd_address 10.115.0.10:2379/v3 --pool 1 --inode 1 --parallel_osds 16 --iodepth 32
+```
+
+### NBD
+
+To create a local block device for a Vitastor image, use NBD. For example:
+
+```
+vitastor-nbd map --etcd_address 10.115.0.10:2379/v3 --image testimg
+```
+
+It will output the device name, like /dev/nbd0 which you can then format and mount as a normal block device.
+
+Again, you can use `--pool <POOL> --inode <INODE> --size <SIZE>` insteaf of `--image <IMAGE>` if you want.
+
+### Kubernetes
+
+Vitastor has a CSI plugin for Kubernetes which supports RWO volumes.
+
+To deploy it, take manifests from [csi/deploy/](csi/deploy/) directory, put your
+Vitastor configuration in [csi/deploy/001-csi-config-map.yaml](001-csi-config-map.yaml),
+configure storage class in [csi/deploy/009-storage-class.yaml](009-storage-class.yaml)
+and apply all `NNN-*.yaml` manifests to your Kubernetes installation:
+
+```
+for i in ./???-*.yaml; do kubectl apply -f $i; done
+```
+
+After that you'll be able to create PersistentVolumes. See example in [csi/deploy/example-pvc.yaml](csi/deploy/example-pvc.yaml).

 ## Known Problems

- Object deletion requests may currently lead to 'incomplete' objects if your OSDs crash during
-  deletion because proper handling of object cleanup in a cluster should be "three-phase"
-  and it's currently not implemented. Just to repeat the removal again in this case.
+- Object deletion requests may currently lead to 'incomplete' objects in EC pools
+  if your OSDs crash during deletion because proper handling of object cleanup
+  in a cluster should be "three-phase" and it's currently not implemented.
+  Just repeat the removal request again in this case.

 ## Implementation Principles

- I like simple and stupid solutions, so expect Vitastor to stay simple.
+- I like architecturally simple solutions. Vitastor is and will always be designed
+  exactly like that.
 - I also like reinventing the wheel to some extent, like writing my own HTTP client
  for etcd interaction instead of using prebuilt libraries, because in this case
  I'm confident about what my code does and what it doesn't do.
@@ -416,7 +502,7 @@ and calculate disk offsets almost by hand. This will be fixed in near future.

 Copyright (c) Vitaliy Filippov (vitalif [at] yourcmc.ru), 2019+

-You can also find me in the Russian Telegram Ceph chat: https://t.me/ceph_ru
+Join Vitastor Telegram Chat: https://t.me/vitastor

 All server-side code (OSD, Monitor and so on) is licensed under the terms of
 Vitastor Network Public License 1.1 (VNPL 1.1), a copyleft license based on
--- a/csi/.dockerignore
+++ b/csi/.dockerignore
@@ -0,0 +1,3 @@
+vitastor-csi
+go.sum
+Dockerfile
--- a/csi/Dockerfile
+++ b/csi/Dockerfile
@@ -0,0 +1,32 @@
+# Compile stage
+FROM golang:buster AS build
+
+ADD go.mod /app/
+RUN cd /app; CGO_ENABLED=1 GOOS=linux GOARCH=amd64 go mod download -x
+ADD . /app
+RUN perl -i -e '$/ = undef; while(<>) { s/\n\s*(\{\s*\n)/$1\n/g; s/\}(\s*\n\s*)else\b/$1} else/g; print; }' `find /app -name '*.go'`
+RUN cd /app; CGO_ENABLED=1 GOOS=linux GOARCH=amd64 go build -o vitastor-csi
+
+# Final stage
+FROM debian:buster
+
+LABEL maintainers="Vitaliy Filippov <vitalif@yourcmc.ru>"
+LABEL description="Vitastor CSI Driver"
+
+ENV NODE_ID=""
+ENV CSI_ENDPOINT=""
+
+RUN apt-get update && \
+    apt-get install -y wget && \
+    wget -q -O /etc/apt/trusted.gpg.d/vitastor.gpg https://vitastor.io/debian/pubkey.gpg && \
+    (echo deb http://vitastor.io/debian buster main > /etc/apt/sources.list.d/vitastor.list) && \
+    (echo deb http://deb.debian.org/debian buster-backports main > /etc/apt/sources.list.d/backports.list) && \
+    (echo "APT::Install-Recommends false;" > /etc/apt/apt.conf) && \
+    apt-get update && \
+    apt-get install -y e2fsprogs xfsprogs vitastor kmod && \
+    apt-get clean && \
+    (echo options nbd nbds_max=128 > /etc/modprobe.d/nbd.conf)
+
+COPY --from=build /app/vitastor-csi /bin/
+
+ENTRYPOINT ["/bin/vitastor-csi"]
--- a/csi/Makefile
+++ b/csi/Makefile
@@ -0,0 +1,9 @@
+VERSION ?= v0.6.6
+
+all: build push
+
+build:
+	@docker build --rm -t vitalif/vitastor-csi:$(VERSION) .
+
+push:
+	@docker push vitalif/vitastor-csi:$(VERSION)
--- a/csi/deploy/000-csi-namespace.yaml
+++ b/csi/deploy/000-csi-namespace.yaml
@@ -0,0 +1,5 @@
+---
+apiVersion: v1
+kind: Namespace
+metadata:
+  name: vitastor-system
--- a/csi/deploy/001-csi-config-map.yaml
+++ b/csi/deploy/001-csi-config-map.yaml
@@ -0,0 +1,9 @@
+---
+apiVersion: v1
+kind: ConfigMap
+data:
+  vitastor.conf: |-
+    {"etcd_address":"http://192.168.7.2:2379","etcd_prefix":"/vitastor"}
+metadata:
+  namespace: vitastor-system
+  name: vitastor-config
--- a/csi/deploy/002-csi-nodeplugin-rbac.yaml
+++ b/csi/deploy/002-csi-nodeplugin-rbac.yaml
@@ -0,0 +1,37 @@
+---
+apiVersion: v1
+kind: ServiceAccount
+metadata:
+  namespace: vitastor-system
+  name: vitastor-csi-nodeplugin
+---
+kind: ClusterRole
+apiVersion: rbac.authorization.k8s.io/v1
+metadata:
+  namespace: vitastor-system
+  name: vitastor-csi-nodeplugin
+rules:
+  - apiGroups: [""]
+    resources: ["nodes"]
+    verbs: ["get"]
+  # allow to read Vault Token and connection options from the Tenants namespace
+  - apiGroups: [""]
+    resources: ["secrets"]
+    verbs: ["get"]
+  - apiGroups: [""]
+    resources: ["configmaps"]
+    verbs: ["get"]
+---
+kind: ClusterRoleBinding
+apiVersion: rbac.authorization.k8s.io/v1
+metadata:
+  namespace: vitastor-system
+  name: vitastor-csi-nodeplugin
+subjects:
+  - kind: ServiceAccount
+    name: vitastor-csi-nodeplugin
+    namespace: vitastor-system
+roleRef:
+  kind: ClusterRole
+  name: vitastor-csi-nodeplugin
+  apiGroup: rbac.authorization.k8s.io
--- a/csi/deploy/003-csi-nodeplugin-psp.yaml
+++ b/csi/deploy/003-csi-nodeplugin-psp.yaml
@@ -0,0 +1,72 @@
+---
+apiVersion: policy/v1beta1
+kind: PodSecurityPolicy
+metadata:
+  namespace: vitastor-system
+  name: vitastor-csi-nodeplugin-psp
+spec:
+  allowPrivilegeEscalation: true
+  allowedCapabilities:
+    - 'SYS_ADMIN'
+  fsGroup:
+    rule: RunAsAny
+  privileged: true
+  hostNetwork: true
+  hostPID: true
+  runAsUser:
+    rule: RunAsAny
+  seLinux:
+    rule: RunAsAny
+  supplementalGroups:
+    rule: RunAsAny
+  volumes:
+    - 'configMap'
+    - 'emptyDir'
+    - 'projected'
+    - 'secret'
+    - 'downwardAPI'
+    - 'hostPath'
+  allowedHostPaths:
+    - pathPrefix: '/dev'
+      readOnly: false
+    - pathPrefix: '/run/mount'
+      readOnly: false
+    - pathPrefix: '/sys'
+      readOnly: false
+    - pathPrefix: '/lib/modules'
+      readOnly: true
+    - pathPrefix: '/var/lib/kubelet/pods'
+      readOnly: false
+    - pathPrefix: '/var/lib/kubelet/plugins/csi.vitastor.io'
+      readOnly: false
+    - pathPrefix: '/var/lib/kubelet/plugins_registry'
+      readOnly: false
+    - pathPrefix: '/var/lib/kubelet/plugins'
+      readOnly: false
+
+---
+kind: Role
+apiVersion: rbac.authorization.k8s.io/v1
+metadata:
+  namespace: vitastor-system
+  name: vitastor-csi-nodeplugin-psp
+rules:
+  - apiGroups: ['policy']
+    resources: ['podsecuritypolicies']
+    verbs: ['use']
+    resourceNames: ['vitastor-csi-nodeplugin-psp']
+
+---
+kind: RoleBinding
+apiVersion: rbac.authorization.k8s.io/v1
+metadata:
+  namespace: vitastor-system
+  name: vitastor-csi-nodeplugin-psp
+subjects:
+  - kind: ServiceAccount
+    name: vitastor-csi-nodeplugin
+    namespace: vitastor-system
+roleRef:
+  kind: Role
+  name: vitastor-csi-nodeplugin-psp
+  apiGroup: rbac.authorization.k8s.io
--- a/csi/deploy/004-csi-nodeplugin.yaml
+++ b/csi/deploy/004-csi-nodeplugin.yaml
@@ -0,0 +1,140 @@
+---
+kind: DaemonSet
+apiVersion: apps/v1
+metadata:
+  namespace: vitastor-system
+  name: csi-vitastor
+spec:
+  selector:
+    matchLabels:
+      app: csi-vitastor
+  template:
+    metadata:
+      namespace: vitastor-system
+      labels:
+        app: csi-vitastor
+    spec:
+      serviceAccountName: vitastor-csi-nodeplugin
+      hostNetwork: true
+      hostPID: true
+      priorityClassName: system-node-critical
+      # to use e.g. Rook orchestrated cluster, and mons' FQDN is
+      # resolved through k8s service, set dns policy to cluster first
+      dnsPolicy: ClusterFirstWithHostNet
+      containers:
+        - name: driver-registrar
+          # This is necessary only for systems with SELinux, where
+          # non-privileged sidecar containers cannot access unix domain socket
+          # created by privileged CSI driver container.
+          securityContext:
+            privileged: true
+          image: k8s.gcr.io/sig-storage/csi-node-driver-registrar:v2.2.0
+          args:
+            - "--v=5"
+            - "--csi-address=/csi/csi.sock"
+            - "--kubelet-registration-path=/var/lib/kubelet/plugins/csi.vitastor.io/csi.sock"
+          env:
+            - name: KUBE_NODE_NAME
+              valueFrom:
+                fieldRef:
+                  fieldPath: spec.nodeName
+          volumeMounts:
+            - name: socket-dir
+              mountPath: /csi
+            - name: registration-dir
+              mountPath: /registration
+        - name: csi-vitastor
+          securityContext:
+            privileged: true
+            capabilities:
+              add: ["SYS_ADMIN"]
+            allowPrivilegeEscalation: true
+          image: vitalif/vitastor-csi:v0.6.6
+          args:
+            - "--node=$(NODE_ID)"
+            - "--endpoint=$(CSI_ENDPOINT)"
+          env:
+            - name: NODE_ID
+              valueFrom:
+                fieldRef:
+                  fieldPath: spec.nodeName
+            - name: CSI_ENDPOINT
+              value: unix:///csi/csi.sock
+          imagePullPolicy: "IfNotPresent"
+          ports:
+          - containerPort: 9898
+            name: healthz
+            protocol: TCP
+          livenessProbe:
+            failureThreshold: 5
+            httpGet:
+              path: /healthz
+              port: healthz
+            initialDelaySeconds: 10
+            timeoutSeconds: 3
+            periodSeconds: 2
+          volumeMounts:
+            - name: socket-dir
+              mountPath: /csi
+            - mountPath: /dev
+              name: host-dev
+            - mountPath: /sys
+              name: host-sys
+            - mountPath: /run/mount
+              name: host-mount
+            - mountPath: /lib/modules
+              name: lib-modules
+              readOnly: true
+            - name: vitastor-config
+              mountPath: /etc/vitastor
+            - name: plugin-dir
+              mountPath: /var/lib/kubelet/plugins
+              mountPropagation: "Bidirectional"
+            - name: mountpoint-dir
+              mountPath: /var/lib/kubelet/pods
+              mountPropagation: "Bidirectional"
+        - name: liveness-probe
+          securityContext:
+            privileged: true
+          image: quay.io/k8scsi/livenessprobe:v1.1.0
+          args:
+            - "--csi-address=$(CSI_ENDPOINT)"
+            - "--health-port=9898"
+          env:
+            - name: CSI_ENDPOINT
+              value: unix://csi/csi.sock
+          volumeMounts:
+          - mountPath: /csi
+            name: socket-dir
+      volumes:
+        - name: socket-dir
+          hostPath:
+            path: /var/lib/kubelet/plugins/csi.vitastor.io
+            type: DirectoryOrCreate
+        - name: plugin-dir
+          hostPath:
+            path: /var/lib/kubelet/plugins
+            type: Directory
+        - name: mountpoint-dir
+          hostPath:
+            path: /var/lib/kubelet/pods
+            type: DirectoryOrCreate
+        - name: registration-dir
+          hostPath:
+            path: /var/lib/kubelet/plugins_registry/
+            type: Directory
+        - name: host-dev
+          hostPath:
+            path: /dev
+        - name: host-sys
+          hostPath:
+            path: /sys
+        - name: host-mount
+          hostPath:
+            path: /run/mount
+        - name: lib-modules
+          hostPath:
+            path: /lib/modules
+        - name: vitastor-config
+          configMap:
+            name: vitastor-config
--- a/csi/deploy/005-csi-provisioner-rbac.yaml
+++ b/csi/deploy/005-csi-provisioner-rbac.yaml
@@ -0,0 +1,102 @@
+---
+apiVersion: v1
+kind: ServiceAccount
+metadata:
+  namespace: vitastor-system
+  name: vitastor-csi-provisioner
+
+---
+kind: ClusterRole
+apiVersion: rbac.authorization.k8s.io/v1
+metadata:
+  namespace: vitastor-system
+  name: vitastor-external-provisioner-runner
+rules:
+  - apiGroups: [""]
+    resources: ["nodes"]
+    verbs: ["get", "list", "watch"]
+  - apiGroups: [""]
+    resources: ["secrets"]
+    verbs: ["get", "list", "watch"]
+  - apiGroups: [""]
+    resources: ["events"]
+    verbs: ["list", "watch", "create", "update", "patch"]
+  - apiGroups: [""]
+    resources: ["persistentvolumes"]
+    verbs: ["get", "list", "watch", "create", "update", "delete", "patch"]
+  - apiGroups: [""]
+    resources: ["persistentvolumeclaims"]
+    verbs: ["get", "list", "watch", "update"]
+  - apiGroups: [""]
+    resources: ["persistentvolumeclaims/status"]
+    verbs: ["update", "patch"]
+  - apiGroups: ["storage.k8s.io"]
+    resources: ["storageclasses"]
+    verbs: ["get", "list", "watch"]
+  - apiGroups: ["snapshot.storage.k8s.io"]
+    resources: ["volumesnapshots"]
+    verbs: ["get", "list"]
+  - apiGroups: ["snapshot.storage.k8s.io"]
+    resources: ["volumesnapshotcontents"]
+    verbs: ["create", "get", "list", "watch", "update", "delete"]
+  - apiGroups: ["snapshot.storage.k8s.io"]
+    resources: ["volumesnapshotclasses"]
+    verbs: ["get", "list", "watch"]
+  - apiGroups: ["storage.k8s.io"]
+    resources: ["volumeattachments"]
+    verbs: ["get", "list", "watch", "update", "patch"]
+  - apiGroups: ["storage.k8s.io"]
+    resources: ["volumeattachments/status"]
+    verbs: ["patch"]
+  - apiGroups: ["storage.k8s.io"]
+    resources: ["csinodes"]
+    verbs: ["get", "list", "watch"]
+  - apiGroups: ["snapshot.storage.k8s.io"]
+    resources: ["volumesnapshotcontents/status"]
+    verbs: ["update"]
+  - apiGroups: [""]
+    resources: ["configmaps"]
+    verbs: ["get"]
+---
+kind: ClusterRoleBinding
+apiVersion: rbac.authorization.k8s.io/v1
+metadata:
+  namespace: vitastor-system
+  name: vitastor-csi-provisioner-role
+subjects:
+  - kind: ServiceAccount
+    name: vitastor-csi-provisioner
+    namespace: vitastor-system
+roleRef:
+  kind: ClusterRole
+  name: vitastor-external-provisioner-runner
+  apiGroup: rbac.authorization.k8s.io
+
+---
+kind: Role
+apiVersion: rbac.authorization.k8s.io/v1
+metadata:
+  namespace: vitastor-system
+  name: vitastor-external-provisioner-cfg
+rules:
+  - apiGroups: [""]
+    resources: ["configmaps"]
+    verbs: ["get", "list", "watch", "create", "update", "delete"]
+  - apiGroups: ["coordination.k8s.io"]
+    resources: ["leases"]
+    verbs: ["get", "watch", "list", "delete", "update", "create"]
+
+---
+kind: RoleBinding
+apiVersion: rbac.authorization.k8s.io/v1
+metadata:
+  name: vitastor-csi-provisioner-role-cfg
+  namespace: vitastor-system
+subjects:
+  - kind: ServiceAccount
+    name: vitastor-csi-provisioner
+    namespace: vitastor-system
+roleRef:
+  kind: Role
+  name: vitastor-external-provisioner-cfg
+  apiGroup: rbac.authorization.k8s.io
--- a/csi/deploy/006-csi-provisioner-psp.yaml
+++ b/csi/deploy/006-csi-provisioner-psp.yaml
@@ -0,0 +1,60 @@
+---
+apiVersion: policy/v1beta1
+kind: PodSecurityPolicy
+metadata:
+  namespace: vitastor-system
+  name: vitastor-csi-provisioner-psp
+spec:
+  allowPrivilegeEscalation: true
+  allowedCapabilities:
+    - 'SYS_ADMIN'
+  fsGroup:
+    rule: RunAsAny
+  privileged: true
+  runAsUser:
+    rule: RunAsAny
+  seLinux:
+    rule: RunAsAny
+  supplementalGroups:
+    rule: RunAsAny
+  volumes:
+    - 'configMap'
+    - 'emptyDir'
+    - 'projected'
+    - 'secret'
+    - 'downwardAPI'
+    - 'hostPath'
+  allowedHostPaths:
+    - pathPrefix: '/dev'
+      readOnly: false
+    - pathPrefix: '/sys'
+      readOnly: false
+    - pathPrefix: '/lib/modules'
+      readOnly: true
+
+---
+kind: Role
+apiVersion: rbac.authorization.k8s.io/v1
+metadata:
+  namespace: vitastor-system
+  name: vitastor-csi-provisioner-psp
+rules:
+  - apiGroups: ['policy']
+    resources: ['podsecuritypolicies']
+    verbs: ['use']
+    resourceNames: ['vitastor-csi-provisioner-psp']
+
+---
+kind: RoleBinding
+apiVersion: rbac.authorization.k8s.io/v1
+metadata:
+  name: vitastor-csi-provisioner-psp
+  namespace: vitastor-system
+subjects:
+  - kind: ServiceAccount
+    name: vitastor-csi-provisioner
+    namespace: vitastor-system
+roleRef:
+  kind: Role
+  name: vitastor-csi-provisioner-psp
+  apiGroup: rbac.authorization.k8s.io
--- a/csi/deploy/007-csi-provisioner.yaml
+++ b/csi/deploy/007-csi-provisioner.yaml
@@ -0,0 +1,159 @@
+---
+kind: Service
+apiVersion: v1
+metadata:
+  namespace: vitastor-system
+  name: csi-vitastor-provisioner
+  labels:
+    app: csi-metrics
+spec:
+  selector:
+    app: csi-vitastor-provisioner
+  ports:
+    - name: http-metrics
+      port: 8080
+      protocol: TCP
+      targetPort: 8680
+
+---
+kind: Deployment
+apiVersion: apps/v1
+metadata:
+  namespace: vitastor-system
+  name: csi-vitastor-provisioner
+spec:
+  replicas: 3
+  selector:
+    matchLabels:
+      app: csi-vitastor-provisioner
+  template:
+    metadata:
+      namespace: vitastor-system
+      labels:
+        app: csi-vitastor-provisioner
+    spec:
+      affinity:
+        podAntiAffinity:
+          requiredDuringSchedulingIgnoredDuringExecution:
+            - labelSelector:
+                matchExpressions:
+                  - key: app
+                    operator: In
+                    values:
+                      - csi-vitastor-provisioner
+              topologyKey: "kubernetes.io/hostname"
+      serviceAccountName: vitastor-csi-provisioner
+      priorityClassName: system-cluster-critical
+      containers:
+        - name: csi-provisioner
+          image: k8s.gcr.io/sig-storage/csi-provisioner:v2.2.0
+          args:
+            - "--csi-address=$(ADDRESS)"
+            - "--v=5"
+            - "--timeout=150s"
+            - "--retry-interval-start=500ms"
+            - "--leader-election=true"
+            #  set it to true to use topology based provisioning
+            - "--feature-gates=Topology=false"
+            # if fstype is not specified in storageclass, ext4 is default
+            - "--default-fstype=ext4"
+            - "--extra-create-metadata=true"
+          env:
+            - name: ADDRESS
+              value: unix:///csi/csi-provisioner.sock
+          imagePullPolicy: "IfNotPresent"
+          volumeMounts:
+            - name: socket-dir
+              mountPath: /csi
+        - name: csi-snapshotter
+          image: k8s.gcr.io/sig-storage/csi-snapshotter:v4.0.0
+          args:
+            - "--csi-address=$(ADDRESS)"
+            - "--v=5"
+            - "--timeout=150s"
+            - "--leader-election=true"
+          env:
+            - name: ADDRESS
+              value: unix:///csi/csi-provisioner.sock
+          imagePullPolicy: "IfNotPresent"
+          securityContext:
+            privileged: true
+          volumeMounts:
+            - name: socket-dir
+              mountPath: /csi
+        - name: csi-attacher
+          image: k8s.gcr.io/sig-storage/csi-attacher:v3.1.0
+          args:
+            - "--v=5"
+            - "--csi-address=$(ADDRESS)"
+            - "--leader-election=true"
+            - "--retry-interval-start=500ms"
+          env:
+            - name: ADDRESS
+              value: /csi/csi-provisioner.sock
+          imagePullPolicy: "IfNotPresent"
+          volumeMounts:
+            - name: socket-dir
+              mountPath: /csi
+        - name: csi-resizer
+          image: k8s.gcr.io/sig-storage/csi-resizer:v1.1.0
+          args:
+            - "--csi-address=$(ADDRESS)"
+            - "--v=5"
+            - "--timeout=150s"
+            - "--leader-election"
+            - "--retry-interval-start=500ms"
+            - "--handle-volume-inuse-error=false"
+          env:
+            - name: ADDRESS
+              value: unix:///csi/csi-provisioner.sock
+          imagePullPolicy: "IfNotPresent"
+          volumeMounts:
+            - name: socket-dir
+              mountPath: /csi
+        - name: csi-vitastor
+          securityContext:
+            privileged: true
+            capabilities:
+              add: ["SYS_ADMIN"]
+          image: vitalif/vitastor-csi:v0.6.6
+          args:
+            - "--node=$(NODE_ID)"
+            - "--endpoint=$(CSI_ENDPOINT)"
+          env:
+            - name: NODE_ID
+              valueFrom:
+                fieldRef:
+                  fieldPath: spec.nodeName
+            - name: CSI_ENDPOINT
+              value: unix:///csi/csi-provisioner.sock
+          imagePullPolicy: "IfNotPresent"
+          volumeMounts:
+            - name: socket-dir
+              mountPath: /csi
+            - mountPath: /dev
+              name: host-dev
+            - mountPath: /sys
+              name: host-sys
+            - mountPath: /lib/modules
+              name: lib-modules
+              readOnly: true
+            - name: vitastor-config
+              mountPath: /etc/vitastor
+      volumes:
+        - name: host-dev
+          hostPath:
+            path: /dev
+        - name: host-sys
+          hostPath:
+            path: /sys
+        - name: lib-modules
+          hostPath:
+            path: /lib/modules
+        - name: socket-dir
+          emptyDir: {
+            medium: "Memory"
+          }
+        - name: vitastor-config
+          configMap:
+            name: vitastor-config
--- a/csi/deploy/008-csi-driver.yaml
+++ b/csi/deploy/008-csi-driver.yaml
@@ -0,0 +1,11 @@
+---
+# if Kubernetes version is less than 1.18 change
+# apiVersion to storage.k8s.io/v1betav1
+apiVersion: storage.k8s.io/v1
+kind: CSIDriver
+metadata:
+  namespace: vitastor-system
+  name: csi.vitastor.io
+spec:
+  attachRequired: true
+  podInfoOnMount: false
--- a/csi/deploy/009-storage-class.yaml
+++ b/csi/deploy/009-storage-class.yaml
@@ -0,0 +1,19 @@
+---
+apiVersion: storage.k8s.io/v1
+kind: StorageClass
+metadata:
+  namespace: vitastor-system
+  name: vitastor
+  annotations:
+    storageclass.kubernetes.io/is-default-class: "true"
+provisioner: csi.vitastor.io
+volumeBindingMode: Immediate
+parameters:
+  etcdVolumePrefix: ""
+  poolId: "1"
+  # you can choose other configuration file if you have it in the config map
+  #configPath: "/etc/vitastor/vitastor.conf"
+  # you can also specify etcdUrl here, maybe to connect to another Vitastor cluster
+  # multiple etcdUrls may be specified, delimited by comma
+  #etcdUrl: "http://192.168.7.2:2379"
+  #etcdPrefix: "/vitastor"
--- a/csi/deploy/example-pvc.yaml
+++ b/csi/deploy/example-pvc.yaml
@@ -0,0 +1,12 @@
+---
+apiVersion: v1
+kind: PersistentVolumeClaim
+metadata:
+  name: test-vitastor-pvc
+spec:
+  storageClassName: vitastor
+  accessModes:
+    - ReadWriteOnce
+  resources:
+    requests:
+      storage: 10Gi
--- a/csi/go.mod
+++ b/csi/go.mod
@@ -0,0 +1,35 @@
+module vitastor.io/csi
+
+go 1.15
+
+require (
+	github.com/container-storage-interface/spec v1.4.0
+	github.com/coreos/bbolt v0.0.0-00010101000000-000000000000 // indirect
+	github.com/coreos/etcd v3.3.25+incompatible // indirect
+	github.com/coreos/go-semver v0.3.0 // indirect
+	github.com/coreos/go-systemd v0.0.0-20191104093116-d3cd4ed1dbcf // indirect
+	github.com/coreos/pkg v0.0.0-20180928190104-399ea9e2e55f // indirect
+	github.com/dustin/go-humanize v1.0.0 // indirect
+	github.com/golang/glog v0.0.0-20160126235308-23def4e6c14b
+	github.com/gorilla/websocket v1.4.2 // indirect
+	github.com/grpc-ecosystem/go-grpc-middleware v1.3.0 // indirect
+	github.com/grpc-ecosystem/go-grpc-prometheus v1.2.0 // indirect
+	github.com/grpc-ecosystem/grpc-gateway v1.16.0 // indirect
+	github.com/jonboulle/clockwork v0.2.2 // indirect
+	github.com/kubernetes-csi/csi-lib-utils v0.9.1
+	github.com/soheilhy/cmux v0.1.5 // indirect
+	github.com/tmc/grpc-websocket-proxy v0.0.0-20201229170055-e5319fda7802 // indirect
+	github.com/xiang90/probing v0.0.0-20190116061207-43a291ad63a2 // indirect
+	go.etcd.io/bbolt v0.0.0-00010101000000-000000000000 // indirect
+	go.etcd.io/etcd v3.3.25+incompatible
+	golang.org/x/net v0.0.0-20201202161906-c7110b5ffcbb
+	google.golang.org/grpc v1.33.1
+	k8s.io/klog v1.0.0
+	k8s.io/utils v0.0.0-20210305010621-2afb4311ab10
+)
+
+replace github.com/coreos/bbolt => go.etcd.io/bbolt v1.3.5
+
+replace go.etcd.io/bbolt => github.com/coreos/bbolt v1.3.5
+
+replace google.golang.org/grpc => google.golang.org/grpc v1.25.1
--- a/csi/src/config.go
+++ b/csi/src/config.go
@@ -0,0 +1,22 @@
+// Copyright (c) Vitaliy Filippov, 2019+
+// License: VNPL-1.1 or GNU GPL-2.0+ (see README.md for details)
+
+package vitastor
+
+const (
+    vitastorCSIDriverName    = "csi.vitastor.io"
+    vitastorCSIDriverVersion = "0.6.6"
+)
+
+// Config struct fills the parameters of request or user input
+type Config struct
+{
+    Endpoint string
+    NodeID   string
+}
+
+// NewConfig returns config struct to initialize new driver
+func NewConfig() *Config
+{
+    return &Config{}
+}
--- a/csi/src/controllerserver.go
+++ b/csi/src/controllerserver.go
@@ -0,0 +1,530 @@
+// Copyright (c) Vitaliy Filippov, 2019+
+// License: VNPL-1.1 or GNU GPL-2.0+ (see README.md for details)
+
+package vitastor
+
+import (
+    "context"
+    "encoding/json"
+    "strings"
+    "bytes"
+    "strconv"
+    "time"
+    "fmt"
+    "os"
+    "os/exec"
+    "io/ioutil"
+
+    "github.com/kubernetes-csi/csi-lib-utils/protosanitizer"
+    "k8s.io/klog"
+
+    "google.golang.org/grpc/codes"
+    "google.golang.org/grpc/status"
+
+    "go.etcd.io/etcd/clientv3"
+
+    "github.com/container-storage-interface/spec/lib/go/csi"
+)
+
+const (
+    KB int64 = 1024
+    MB int64 = 1024 * KB
+    GB int64 = 1024 * MB
+    TB int64 = 1024 * GB
+    ETCD_TIMEOUT time.Duration = 15*time.Second
+)
+
+type InodeIndex struct
+{
+    Id uint64 `json:"id"`
+    PoolId uint64 `json:"pool_id"`
+}
+
+type InodeConfig struct
+{
+    Name string `json:"name"`
+    Size uint64 `json:"size,omitempty"`
+    ParentPool uint64 `json:"parent_pool,omitempty"`
+    ParentId uint64 `json:"parent_id,omitempty"`
+    Readonly bool `json:"readonly,omitempty"`
+}
+
+type ControllerServer struct
+{
+    *Driver
+}
+
+// NewControllerServer create new instance controller
+func NewControllerServer(driver *Driver) *ControllerServer
+{
+    return &ControllerServer{
+        Driver: driver,
+    }
+}
+
+func GetConnectionParams(params map[string]string) (map[string]string, []string, string)
+{
+    ctxVars := make(map[string]string)
+    configPath := params["configPath"]
+    if (configPath == "")
+    {
+        configPath = "/etc/vitastor/vitastor.conf"
+    }
+    else
+    {
+        ctxVars["configPath"] = configPath
+    }
+    config := make(map[string]interface{})
+    if configFD, err := os.Open(configPath); err == nil
+    {
+        defer configFD.Close()
+        data, _ := ioutil.ReadAll(configFD)
+        json.Unmarshal(data, &config)
+    }
+    // Try to load prefix & etcd URL from the config
+    var etcdUrl []string
+    if (params["etcdUrl"] != "")
+    {
+        ctxVars["etcdUrl"] = params["etcdUrl"]
+        etcdUrl = strings.Split(params["etcdUrl"], ",")
+    }
+    if (len(etcdUrl) == 0)
+    {
+        switch config["etcd_address"].(type)
+        {
+        case string:
+            etcdUrl = strings.Split(config["etcd_address"].(string), ",")
+        case []string:
+            etcdUrl = config["etcd_address"].([]string)
+        }
+    }
+    etcdPrefix := params["etcdPrefix"]
+    if (etcdPrefix == "")
+    {
+        etcdPrefix, _ = config["etcd_prefix"].(string)
+        if (etcdPrefix == "")
+        {
+            etcdPrefix = "/vitastor"
+        }
+    }
+    else
+    {
+        ctxVars["etcdPrefix"] = etcdPrefix
+    }
+    return ctxVars, etcdUrl, etcdPrefix
+}
+
+// Create the volume
+func (cs *ControllerServer) CreateVolume(ctx context.Context, req *csi.CreateVolumeRequest) (*csi.CreateVolumeResponse, error)
+{
+    klog.Infof("received controller create volume request %+v", protosanitizer.StripSecrets(req))
+    if (req == nil)
+    {
+        return nil, status.Errorf(codes.InvalidArgument, "request cannot be empty")
+    }
+    if (req.GetName() == "")
+    {
+        return nil, status.Error(codes.InvalidArgument, "name is a required field")
+    }
+    volumeCapabilities := req.GetVolumeCapabilities()
+    if (volumeCapabilities == nil)
+    {
+        return nil, status.Error(codes.InvalidArgument, "volume capabilities is a required field")
+    }
+
+    etcdVolumePrefix := req.Parameters["etcdVolumePrefix"]
+    poolId, _ := strconv.ParseUint(req.Parameters["poolId"], 10, 64)
+    if (poolId == 0)
+    {
+        return nil, status.Error(codes.InvalidArgument, "poolId is missing in storage class configuration")
+    }
+
+    volName := etcdVolumePrefix + req.GetName()
+    volSize := 1 * GB
+    if capRange := req.GetCapacityRange(); capRange != nil
+    {
+        volSize = ((capRange.GetRequiredBytes() + MB - 1) / MB) * MB
+    }
+
+    // FIXME: The following should PROBABLY be implemented externally in a management tool
+
+    ctxVars, etcdUrl, etcdPrefix := GetConnectionParams(req.Parameters)
+    if (len(etcdUrl) == 0)
+    {
+        return nil, status.Error(codes.InvalidArgument, "no etcdUrl in storage class configuration and no etcd_address in vitastor.conf")
+    }
+
+    // Connect to etcd
+    cli, err := clientv3.New(clientv3.Config{
+        DialTimeout: ETCD_TIMEOUT,
+        Endpoints: etcdUrl,
+    })
+    if (err != nil)
+    {
+        return nil, status.Error(codes.Internal, "failed to connect to etcd at "+strings.Join(etcdUrl, ",")+": "+err.Error())
+    }
+    defer cli.Close()
+
+    var imageId uint64 = 0
+    for
+    {
+        // Check if the image exists
+        ctx, cancel := context.WithTimeout(context.Background(), ETCD_TIMEOUT)
+        resp, err := cli.Get(ctx, etcdPrefix+"/index/image/"+volName)
+        cancel()
+        if (err != nil)
+        {
+            return nil, status.Error(codes.Internal, "failed to read key from etcd: "+err.Error())
+        }
+        if (len(resp.Kvs) > 0)
+        {
+            kv := resp.Kvs[0]
+            var v InodeIndex
+            err := json.Unmarshal(kv.Value, &v)
+            if (err != nil)
+            {
+                return nil, status.Error(codes.Internal, "invalid /index/image/"+volName+" key in etcd: "+err.Error())
+            }
+            poolId = v.PoolId
+            imageId = v.Id
+            inodeCfgKey := fmt.Sprintf("/config/inode/%d/%d", poolId, imageId)
+            ctx, cancel := context.WithTimeout(context.Background(), ETCD_TIMEOUT)
+            resp, err := cli.Get(ctx, etcdPrefix+inodeCfgKey)
+            cancel()
+            if (err != nil)
+            {
+                return nil, status.Error(codes.Internal, "failed to read key from etcd: "+err.Error())
+            }
+            if (len(resp.Kvs) == 0)
+            {
+                return nil, status.Error(codes.Internal, "missing "+inodeCfgKey+" key in etcd")
+            }
+            var inodeCfg InodeConfig
+            err = json.Unmarshal(resp.Kvs[0].Value, &inodeCfg)
+            if (err != nil)
+            {
+                return nil, status.Error(codes.Internal, "invalid "+inodeCfgKey+" key in etcd: "+err.Error())
+            }
+            if (inodeCfg.Size < uint64(volSize))
+            {
+                return nil, status.Error(codes.Internal, "image "+volName+" is already created, but size is less than expected")
+            }
+        }
+        else
+        {
+            // Find a free ID
+            // Create image metadata in a transaction verifying that the image doesn't exist yet AND ID is still free
+            maxIdKey := fmt.Sprintf("%s/index/maxid/%d", etcdPrefix, poolId)
+            ctx, cancel := context.WithTimeout(context.Background(), ETCD_TIMEOUT)
+            resp, err := cli.Get(ctx, maxIdKey)
+            cancel()
+            if (err != nil)
+            {
+                return nil, status.Error(codes.Internal, "failed to read key from etcd: "+err.Error())
+            }
+            var modRev int64
+            var nextId uint64
+            if (len(resp.Kvs) > 0)
+            {
+                var err error
+                nextId, err = strconv.ParseUint(string(resp.Kvs[0].Value), 10, 64)
+                if (err != nil)
+                {
+                    return nil, status.Error(codes.Internal, maxIdKey+" contains invalid ID")
+                }
+                modRev = resp.Kvs[0].ModRevision
+                nextId++
+            }
+            else
+            {
+                nextId = 1
+            }
+            inodeIdxJson, _ := json.Marshal(InodeIndex{
+                Id: nextId,
+                PoolId: poolId,
+            })
+            inodeCfgJson, _ := json.Marshal(InodeConfig{
+                Name: volName,
+                Size: uint64(volSize),
+            })
+            ctx, cancel = context.WithTimeout(context.Background(), ETCD_TIMEOUT)
+            txnResp, err := cli.Txn(ctx).If(
+                clientv3.Compare(clientv3.ModRevision(fmt.Sprintf("%s/index/maxid/%d", etcdPrefix, poolId)), "=", modRev),
+                clientv3.Compare(clientv3.CreateRevision(fmt.Sprintf("%s/index/image/%s", etcdPrefix, volName)), "=", 0),
+                clientv3.Compare(clientv3.CreateRevision(fmt.Sprintf("%s/config/inode/%d/%d", etcdPrefix, poolId, nextId)), "=", 0),
+            ).Then(
+                clientv3.OpPut(fmt.Sprintf("%s/index/maxid/%d", etcdPrefix, poolId), fmt.Sprintf("%d", nextId)),
+                clientv3.OpPut(fmt.Sprintf("%s/index/image/%s", etcdPrefix, volName), string(inodeIdxJson)),
+                clientv3.OpPut(fmt.Sprintf("%s/config/inode/%d/%d", etcdPrefix, poolId, nextId), string(inodeCfgJson)),
+            ).Commit()
+            cancel()
+            if (err != nil)
+            {
+                return nil, status.Error(codes.Internal, "failed to commit transaction in etcd: "+err.Error())
+            }
+            if (txnResp.Succeeded)
+            {
+                imageId = nextId
+                break
+            }
+            // Start over if the transaction fails
+        }
+    }
+
+    ctxVars["name"] = volName
+    volumeIdJson, _ := json.Marshal(ctxVars)
+    return &csi.CreateVolumeResponse{
+        Volume: &csi.Volume{
+            // Ugly, but VolumeContext isn't passed to DeleteVolume :-(
+            VolumeId: string(volumeIdJson),
+            CapacityBytes: volSize,
+        },
+    }, nil
+}
+
+// DeleteVolume deletes the given volume
+func (cs *ControllerServer) DeleteVolume(ctx context.Context, req *csi.DeleteVolumeRequest) (*csi.DeleteVolumeResponse, error)
+{
+    klog.Infof("received controller delete volume request %+v", protosanitizer.StripSecrets(req))
+    if (req == nil)
+    {
+        return nil, status.Error(codes.InvalidArgument, "request cannot be empty")
+    }
+
+    ctxVars := make(map[string]string)
+    err := json.Unmarshal([]byte(req.VolumeId), &ctxVars)
+    if (err != nil)
+    {
+        return nil, status.Error(codes.Internal, "volume ID not in JSON format")
+    }
+    volName := ctxVars["name"]
+
+    _, etcdUrl, etcdPrefix := GetConnectionParams(ctxVars)
+    if (len(etcdUrl) == 0)
+    {
+        return nil, status.Error(codes.InvalidArgument, "no etcdUrl in storage class configuration and no etcd_address in vitastor.conf")
+    }
+
+    cli, err := clientv3.New(clientv3.Config{
+        DialTimeout: ETCD_TIMEOUT,
+        Endpoints: etcdUrl,
+    })
+    if (err != nil)
+    {
+        return nil, status.Error(codes.Internal, "failed to connect to etcd at "+strings.Join(etcdUrl, ",")+": "+err.Error())
+    }
+    defer cli.Close()
+
+    // Find inode by name
+    ctx, cancel := context.WithTimeout(context.Background(), ETCD_TIMEOUT)
+    resp, err := cli.Get(ctx, etcdPrefix+"/index/image/"+volName)
+    cancel()
+    if (err != nil)
+    {
+        return nil, status.Error(codes.Internal, "failed to read key from etcd: "+err.Error())
+    }
+    if (len(resp.Kvs) == 0)
+    {
+        return nil, status.Error(codes.NotFound, "volume "+volName+" does not exist")
+    }
+    var idx InodeIndex
+    err = json.Unmarshal(resp.Kvs[0].Value, &idx)
+    if (err != nil)
+    {
+        return nil, status.Error(codes.Internal, "invalid /index/image/"+volName+" key in etcd: "+err.Error())
+    }
+
+    // Get inode config
+    inodeCfgKey := fmt.Sprintf("%s/config/inode/%d/%d", etcdPrefix, idx.PoolId, idx.Id)
+    ctx, cancel = context.WithTimeout(context.Background(), ETCD_TIMEOUT)
+    resp, err = cli.Get(ctx, inodeCfgKey)
+    cancel()
+    if (err != nil)
+    {
+        return nil, status.Error(codes.Internal, "failed to read key from etcd: "+err.Error())
+    }
+    if (len(resp.Kvs) == 0)
+    {
+        return nil, status.Error(codes.NotFound, "volume "+volName+" does not exist")
+    }
+    var inodeCfg InodeConfig
+    err = json.Unmarshal(resp.Kvs[0].Value, &inodeCfg)
+    if (err != nil)
+    {
+        return nil, status.Error(codes.Internal, "invalid "+inodeCfgKey+" key in etcd: "+err.Error())
+    }
+
+    // Delete inode data by invoking vitastor-cli
+    args := []string{
+        "rm", "--etcd_address", strings.Join(etcdUrl, ","),
+        "--pool", fmt.Sprintf("%d", idx.PoolId),
+        "--inode", fmt.Sprintf("%d", idx.Id),
+    }
+    if (ctxVars["configPath"] != "")
+    {
+        args = append(args, "--config_path", ctxVars["configPath"])
+    }
+    c := exec.Command("/usr/bin/vitastor-cli", args...)
+    var stderr bytes.Buffer
+    c.Stdout = nil
+    c.Stderr = &stderr
+    err = c.Run()
+    stderrStr := string(stderr.Bytes())
+    if (err != nil)
+    {
+        klog.Errorf("vitastor-cli rm failed: %s, status %s\n", stderrStr, err)
+        return nil, status.Error(codes.Internal, stderrStr+" (status "+err.Error()+")")
+    }
+
+    // Delete inode config in etcd
+    ctx, cancel = context.WithTimeout(context.Background(), ETCD_TIMEOUT)
+    txnResp, err := cli.Txn(ctx).Then(
+        clientv3.OpDelete(fmt.Sprintf("%s/index/image/%s", etcdPrefix, volName)),
+        clientv3.OpDelete(fmt.Sprintf("%s/config/inode/%d/%d", etcdPrefix, idx.PoolId, idx.Id)),
+    ).Commit()
+    cancel()
+    if (err != nil)
+    {
+        return nil, status.Error(codes.Internal, "failed to delete keys in etcd: "+err.Error())
+    }
+    if (!txnResp.Succeeded)
+    {
+        return nil, status.Error(codes.Internal, "failed to delete keys in etcd: transaction failed")
+    }
+
+    return &csi.DeleteVolumeResponse{}, nil
+}
+
+// ControllerPublishVolume return Unimplemented error
+func (cs *ControllerServer) ControllerPublishVolume(ctx context.Context, req *csi.ControllerPublishVolumeRequest) (*csi.ControllerPublishVolumeResponse, error)
+{
+    return nil, status.Error(codes.Unimplemented, "")
+}
+
+// ControllerUnpublishVolume return Unimplemented error
+func (cs *ControllerServer) ControllerUnpublishVolume(ctx context.Context, req *csi.ControllerUnpublishVolumeRequest) (*csi.ControllerUnpublishVolumeResponse, error)
+{
+    return nil, status.Error(codes.Unimplemented, "")
+}
+
+// ValidateVolumeCapabilities checks whether the volume capabilities requested are supported.
+func (cs *ControllerServer) ValidateVolumeCapabilities(ctx context.Context, req *csi.ValidateVolumeCapabilitiesRequest) (*csi.ValidateVolumeCapabilitiesResponse, error)
+{
+    klog.Infof("received controller validate volume capability request %+v", protosanitizer.StripSecrets(req))
+    if (req == nil)
+    {
+        return nil, status.Errorf(codes.InvalidArgument, "request is nil")
+    }
+    volumeID := req.GetVolumeId()
+    if (volumeID == "")
+    {
+        return nil, status.Error(codes.InvalidArgument, "volumeId is nil")
+    }
+    volumeCapabilities := req.GetVolumeCapabilities()
+    if (volumeCapabilities == nil)
+    {
+        return nil, status.Error(codes.InvalidArgument, "volumeCapabilities is nil")
+    }
+
+    var volumeCapabilityAccessModes []*csi.VolumeCapability_AccessMode
+    for _, mode := range []csi.VolumeCapability_AccessMode_Mode{
+        csi.VolumeCapability_AccessMode_SINGLE_NODE_WRITER,
+        csi.VolumeCapability_AccessMode_MULTI_NODE_MULTI_WRITER,
+    } {
+        volumeCapabilityAccessModes = append(volumeCapabilityAccessModes, &csi.VolumeCapability_AccessMode{Mode: mode})
+    }
+
+    capabilitySupport := false
+    for _, capability := range volumeCapabilities
+    {
+        for _, volumeCapabilityAccessMode := range volumeCapabilityAccessModes
+        {
+            if (volumeCapabilityAccessMode.Mode == capability.AccessMode.Mode)
+            {
+                capabilitySupport = true
+            }
+        }
+    }
+
+    if (!capabilitySupport)
+    {
+        return nil, status.Errorf(codes.NotFound, "%v not supported", req.GetVolumeCapabilities())
+    }
+
+    return &csi.ValidateVolumeCapabilitiesResponse{
+        Confirmed: &csi.ValidateVolumeCapabilitiesResponse_Confirmed{
+            VolumeCapabilities: req.VolumeCapabilities,
+        },
+    }, nil
+}
+
+// ListVolumes returns a list of volumes
+func (cs *ControllerServer) ListVolumes(ctx context.Context, req *csi.ListVolumesRequest) (*csi.ListVolumesResponse, error)
+{
+    return nil, status.Error(codes.Unimplemented, "")
+}
+
+// GetCapacity returns the capacity of the storage pool
+func (cs *ControllerServer) GetCapacity(ctx context.Context, req *csi.GetCapacityRequest) (*csi.GetCapacityResponse, error)
+{
+    return nil, status.Error(codes.Unimplemented, "")
+}
+
+// ControllerGetCapabilities returns the capabilities of the controller service.
+func (cs *ControllerServer) ControllerGetCapabilities(ctx context.Context, req *csi.ControllerGetCapabilitiesRequest) (*csi.ControllerGetCapabilitiesResponse, error)
+{
+    functionControllerServerCapabilities := func(cap csi.ControllerServiceCapability_RPC_Type) *csi.ControllerServiceCapability
+    {
+        return &csi.ControllerServiceCapability{
+            Type: &csi.ControllerServiceCapability_Rpc{
+                Rpc: &csi.ControllerServiceCapability_RPC{
+                    Type: cap,
+                },
+            },
+        }
+    }
+
+    var controllerServerCapabilities []*csi.ControllerServiceCapability
+    for _, capability := range []csi.ControllerServiceCapability_RPC_Type{
+        csi.ControllerServiceCapability_RPC_CREATE_DELETE_VOLUME,
+        csi.ControllerServiceCapability_RPC_LIST_VOLUMES,
+        csi.ControllerServiceCapability_RPC_EXPAND_VOLUME,
+        csi.ControllerServiceCapability_RPC_CREATE_DELETE_SNAPSHOT,
+    } {
+        controllerServerCapabilities = append(controllerServerCapabilities, functionControllerServerCapabilities(capability))
+    }
+
+    return &csi.ControllerGetCapabilitiesResponse{
+        Capabilities: controllerServerCapabilities,
+    }, nil
+}
+
+// CreateSnapshot create snapshot of an existing PV
+func (cs *ControllerServer) CreateSnapshot(ctx context.Context, req *csi.CreateSnapshotRequest) (*csi.CreateSnapshotResponse, error)
+{
+    return nil, status.Error(codes.Unimplemented, "")
+}
+
+// DeleteSnapshot delete provided snapshot of a PV
+func (cs *ControllerServer) DeleteSnapshot(ctx context.Context, req *csi.DeleteSnapshotRequest) (*csi.DeleteSnapshotResponse, error)
+{
+    return nil, status.Error(codes.Unimplemented, "")
+}
+
+// ListSnapshots list the snapshots of a PV
+func (cs *ControllerServer) ListSnapshots(ctx context.Context, req *csi.ListSnapshotsRequest) (*csi.ListSnapshotsResponse, error)
+{
+    return nil, status.Error(codes.Unimplemented, "")
+}
+
+// ControllerExpandVolume resizes a volume
+func (cs *ControllerServer) ControllerExpandVolume(ctx context.Context, req *csi.ControllerExpandVolumeRequest) (*csi.ControllerExpandVolumeResponse, error)
+{
+    return nil, status.Error(codes.Unimplemented, "")
+}
+
+// ControllerGetVolume get volume info
+func (cs *ControllerServer) ControllerGetVolume(ctx context.Context, req *csi.ControllerGetVolumeRequest) (*csi.ControllerGetVolumeResponse, error)
+{
+    return nil, status.Error(codes.Unimplemented, "")
+}
--- a/csi/src/grpc.go
+++ b/csi/src/grpc.go
@@ -0,0 +1,137 @@
+/*
+Copyright 2017 The Kubernetes Authors.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+*/
+
+package vitastor
+
+import (
+    "fmt"
+    "net"
+    "os"
+    "strings"
+    "sync"
+
+    "github.com/golang/glog"
+    "golang.org/x/net/context"
+    "google.golang.org/grpc"
+
+    "github.com/container-storage-interface/spec/lib/go/csi"
+    "github.com/kubernetes-csi/csi-lib-utils/protosanitizer"
+)
+
+// Defines Non blocking GRPC server interfaces
+type NonBlockingGRPCServer interface {
+    // Start services at the endpoint
+    Start(endpoint string, ids csi.IdentityServer, cs csi.ControllerServer, ns csi.NodeServer)
+    // Waits for the service to stop
+    Wait()
+    // Stops the service gracefully
+    Stop()
+    // Stops the service forcefully
+    ForceStop()
+}
+
+func NewNonBlockingGRPCServer() NonBlockingGRPCServer {
+    return &nonBlockingGRPCServer{}
+}
+
+// NonBlocking server
+type nonBlockingGRPCServer struct {
+    wg     sync.WaitGroup
+    server *grpc.Server
+}
+
+func (s *nonBlockingGRPCServer) Start(endpoint string, ids csi.IdentityServer, cs csi.ControllerServer, ns csi.NodeServer) {
+
+    s.wg.Add(1)
+
+    go s.serve(endpoint, ids, cs, ns)
+
+    return
+}
+
+func (s *nonBlockingGRPCServer) Wait() {
+    s.wg.Wait()
+}
+
+func (s *nonBlockingGRPCServer) Stop() {
+    s.server.GracefulStop()
+}
+
+func (s *nonBlockingGRPCServer) ForceStop() {
+    s.server.Stop()
+}
+
+func (s *nonBlockingGRPCServer) serve(endpoint string, ids csi.IdentityServer, cs csi.ControllerServer, ns csi.NodeServer) {
+
+    proto, addr, err := ParseEndpoint(endpoint)
+    if err != nil {
+        glog.Fatal(err.Error())
+    }
+
+    if proto == "unix" {
+        addr = "/" + addr
+        if err := os.Remove(addr); err != nil && !os.IsNotExist(err) {
+            glog.Fatalf("Failed to remove %s, error: %s", addr, err.Error())
+        }
+    }
+
+    listener, err := net.Listen(proto, addr)
+    if err != nil {
+        glog.Fatalf("Failed to listen: %v", err)
+    }
+
+    opts := []grpc.ServerOption{
+        grpc.UnaryInterceptor(logGRPC),
+    }
+    server := grpc.NewServer(opts...)
+    s.server = server
+
+    if ids != nil {
+        csi.RegisterIdentityServer(server, ids)
+    }
+    if cs != nil {
+        csi.RegisterControllerServer(server, cs)
+    }
+    if ns != nil {
+        csi.RegisterNodeServer(server, ns)
+    }
+
+    glog.Infof("Listening for connections on address: %#v", listener.Addr())
+
+    server.Serve(listener)
+}
+
+func ParseEndpoint(ep string) (string, string, error) {
+    if strings.HasPrefix(strings.ToLower(ep), "unix://") || strings.HasPrefix(strings.ToLower(ep), "tcp://") {
+        s := strings.SplitN(ep, "://", 2)
+        if s[1] != "" {
+            return s[0], s[1], nil
+        }
+    }
+    return "", "", fmt.Errorf("Invalid endpoint: %v", ep)
+}
+
+func logGRPC(ctx context.Context, req interface{}, info *grpc.UnaryServerInfo, handler grpc.UnaryHandler) (interface{}, error) {
+    glog.V(3).Infof("GRPC call: %s", info.FullMethod)
+    glog.V(5).Infof("GRPC request: %s", protosanitizer.StripSecrets(req))
+    resp, err := handler(ctx, req)
+    if err != nil {
+        glog.Errorf("GRPC error: %v", err)
+    } else {
+        glog.V(5).Infof("GRPC response: %s", protosanitizer.StripSecrets(resp))
+    }
+    return resp, err
+}
--- a/csi/src/identityserver.go
+++ b/csi/src/identityserver.go
@@ -0,0 +1,60 @@
+// Copyright (c) Vitaliy Filippov, 2019+
+// License: VNPL-1.1 or GNU GPL-2.0+ (see README.md for details)
+
+package vitastor
+
+import (
+    "context"
+
+    "github.com/kubernetes-csi/csi-lib-utils/protosanitizer"
+    "k8s.io/klog"
+
+    "github.com/container-storage-interface/spec/lib/go/csi"
+)
+
+// IdentityServer struct of Vitastor CSI driver with supported methods of CSI identity server spec.
+type IdentityServer struct
+{
+    *Driver
+}
+
+// NewIdentityServer create new instance identity
+func NewIdentityServer(driver *Driver) *IdentityServer
+{
+    return &IdentityServer{
+        Driver: driver,
+    }
+}
+
+// GetPluginInfo returns metadata of the plugin
+func (is *IdentityServer) GetPluginInfo(ctx context.Context, req *csi.GetPluginInfoRequest) (*csi.GetPluginInfoResponse, error)
+{
+    klog.Infof("received identity plugin info request %+v", protosanitizer.StripSecrets(req))
+    return &csi.GetPluginInfoResponse{
+        Name:          vitastorCSIDriverName,
+        VendorVersion: vitastorCSIDriverVersion,
+    }, nil
+}
+
+// GetPluginCapabilities returns available capabilities of the plugin
+func (is *IdentityServer) GetPluginCapabilities(ctx context.Context, req *csi.GetPluginCapabilitiesRequest) (*csi.GetPluginCapabilitiesResponse, error)
+{
+    klog.Infof("received identity plugin capabilities request %+v", protosanitizer.StripSecrets(req))
+    return &csi.GetPluginCapabilitiesResponse{
+        Capabilities: []*csi.PluginCapability{
+            {
+                Type: &csi.PluginCapability_Service_{
+                    Service: &csi.PluginCapability_Service{
+                        Type: csi.PluginCapability_Service_CONTROLLER_SERVICE,
+                    },
+                },
+            },
+        },
+    }, nil
+}
+
+// Probe returns the health and readiness of the plugin
+func (is *IdentityServer) Probe(ctx context.Context, req *csi.ProbeRequest) (*csi.ProbeResponse, error)
+{
+    return &csi.ProbeResponse{}, nil
+}
--- a/csi/src/nodeserver.go
+++ b/csi/src/nodeserver.go
@@ -0,0 +1,279 @@
+// Copyright (c) Vitaliy Filippov, 2019+
+// License: VNPL-1.1 or GNU GPL-2.0+ (see README.md for details)
+
+package vitastor
+
+import (
+    "context"
+    "os"
+    "os/exec"
+    "encoding/json"
+    "strings"
+    "bytes"
+
+    "google.golang.org/grpc/codes"
+    "google.golang.org/grpc/status"
+    "k8s.io/utils/mount"
+    utilexec "k8s.io/utils/exec"
+
+    "github.com/container-storage-interface/spec/lib/go/csi"
+    "github.com/kubernetes-csi/csi-lib-utils/protosanitizer"
+    "k8s.io/klog"
+)
+
+// NodeServer struct of Vitastor CSI driver with supported methods of CSI node server spec.
+type NodeServer struct
+{
+    *Driver
+    mounter mount.Interface
+}
+
+// NewNodeServer create new instance node
+func NewNodeServer(driver *Driver) *NodeServer
+{
+    return &NodeServer{
+        Driver: driver,
+        mounter: mount.New(""),
+    }
+}
+
+// NodeStageVolume mounts the volume to a staging path on the node.
+func (ns *NodeServer) NodeStageVolume(ctx context.Context, req *csi.NodeStageVolumeRequest) (*csi.NodeStageVolumeResponse, error)
+{
+    return &csi.NodeStageVolumeResponse{}, nil
+}
+
+// NodeUnstageVolume unstages the volume from the staging path
+func (ns *NodeServer) NodeUnstageVolume(ctx context.Context, req *csi.NodeUnstageVolumeRequest) (*csi.NodeUnstageVolumeResponse, error)
+{
+    return &csi.NodeUnstageVolumeResponse{}, nil
+}
+
+func Contains(list []string, s string) bool
+{
+    for i := 0; i < len(list); i++
+    {
+        if (list[i] == s)
+        {
+            return true
+        }
+    }
+    return false
+}
+
+// NodePublishVolume mounts the volume mounted to the staging path to the target path
+func (ns *NodeServer) NodePublishVolume(ctx context.Context, req *csi.NodePublishVolumeRequest) (*csi.NodePublishVolumeResponse, error)
+{
+    klog.Infof("received node publish volume request %+v", protosanitizer.StripSecrets(req))
+
+    targetPath := req.GetTargetPath()
+
+    // Check that it's not already mounted
+    free, error := mount.IsNotMountPoint(ns.mounter, targetPath)
+    if (error != nil)
+    {
+        if (os.IsNotExist(error))
+        {
+            error := os.MkdirAll(targetPath, 0777)
+            if (error != nil)
+            {
+                return nil, status.Error(codes.Internal, error.Error())
+            }
+            free = true
+        }
+        else
+        {
+            return nil, status.Error(codes.Internal, error.Error())
+        }
+    }
+    if (!free)
+    {
+        return &csi.NodePublishVolumeResponse{}, nil
+    }
+
+    ctxVars := make(map[string]string)
+    err := json.Unmarshal([]byte(req.VolumeId), &ctxVars)
+    if (err != nil)
+    {
+        return nil, status.Error(codes.Internal, "volume ID not in JSON format")
+    }
+    volName := ctxVars["name"]
+
+    _, etcdUrl, etcdPrefix := GetConnectionParams(ctxVars)
+    if (len(etcdUrl) == 0)
+    {
+        return nil, status.Error(codes.InvalidArgument, "no etcdUrl in storage class configuration and no etcd_address in vitastor.conf")
+    }
+
+    // Map NBD device
+    // FIXME: Check if already mapped
+    args := []string{
+        "map", "--etcd_address", strings.Join(etcdUrl, ","),
+        "--etcd_prefix", etcdPrefix,
+        "--image", volName,
+    };
+    if (ctxVars["configPath"] != "")
+    {
+        args = append(args, "--config_path", ctxVars["configPath"])
+    }
+    if (req.GetReadonly())
+    {
+        args = append(args, "--readonly", "1")
+    }
+    c := exec.Command("/usr/bin/vitastor-nbd", args...)
+    var stdout, stderr bytes.Buffer
+    c.Stdout, c.Stderr = &stdout, &stderr
+    err = c.Run()
+    stdoutStr, stderrStr := string(stdout.Bytes()), string(stderr.Bytes())
+    if (err != nil)
+    {
+        klog.Errorf("vitastor-nbd map failed: %s, status %s\n", stdoutStr+stderrStr, err)
+        return nil, status.Error(codes.Internal, stdoutStr+stderrStr+" (status "+err.Error()+")")
+    }
+    devicePath := strings.TrimSpace(stdoutStr)
+
+    // Check existing format
+    diskMounter := &mount.SafeFormatAndMount{Interface: ns.mounter, Exec: utilexec.New()}
+    existingFormat, err := diskMounter.GetDiskFormat(devicePath)
+    if (err != nil)
+    {
+        klog.Errorf("failed to get disk format for path %s, error: %v", err)
+        // unmap NBD device
+        unmapOut, unmapErr := exec.Command("/usr/bin/vitastor-nbd", "unmap", devicePath).CombinedOutput()
+        if (unmapErr != nil)
+        {
+            klog.Errorf("failed to unmap NBD device %s: %s, error: %v", devicePath, unmapOut, unmapErr)
+        }
+        return nil, err
+    }
+
+    // Format the device (ext4 or xfs)
+    fsType := req.GetVolumeCapability().GetMount().GetFsType()
+    isBlock := req.GetVolumeCapability().GetBlock() != nil
+    opt := req.GetVolumeCapability().GetMount().GetMountFlags()
+    opt = append(opt, "_netdev")
+    if ((req.VolumeCapability.AccessMode.Mode == csi.VolumeCapability_AccessMode_MULTI_NODE_READER_ONLY ||
+        req.VolumeCapability.AccessMode.Mode == csi.VolumeCapability_AccessMode_SINGLE_NODE_READER_ONLY) &&
+        !Contains(opt, "ro"))
+    {
+        opt = append(opt, "ro")
+    }
+    if (fsType == "xfs")
+    {
+        opt = append(opt, "nouuid")
+    }
+    readOnly := Contains(opt, "ro")
+    if (existingFormat == "" && !readOnly)
+    {
+        args := []string{}
+        switch fsType
+        {
+            case "ext4":
+                args = []string{"-m0", "-Enodiscard,lazy_itable_init=1,lazy_journal_init=1", devicePath}
+            case "xfs":
+                args = []string{"-K", devicePath}
+        }
+        if (len(args) > 0)
+        {
+            cmdOut, cmdErr := diskMounter.Exec.Command("mkfs."+fsType, args...).CombinedOutput()
+            if (cmdErr != nil)
+            {
+                klog.Errorf("failed to run mkfs error: %v, output: %v", cmdErr, string(cmdOut))
+                // unmap NBD device
+                unmapOut, unmapErr := exec.Command("/usr/bin/vitastor-nbd", "unmap", devicePath).CombinedOutput()
+                if (unmapErr != nil)
+                {
+                    klog.Errorf("failed to unmap NBD device %s: %s, error: %v", devicePath, unmapOut, unmapErr)
+                }
+                return nil, status.Error(codes.Internal, cmdErr.Error())
+            }
+        }
+    }
+    if (isBlock)
+    {
+        opt = append(opt, "bind")
+        err = diskMounter.Mount(devicePath, targetPath, fsType, opt)
+    }
+    else
+    {
+        err = diskMounter.FormatAndMount(devicePath, targetPath, fsType, opt)
+    }
+    if (err != nil)
+    {
+        klog.Errorf(
+            "failed to mount device path (%s) to path (%s) for volume (%s) error: %s",
+            devicePath, targetPath, volName, err,
+        )
+        // unmap NBD device
+        unmapOut, unmapErr := exec.Command("/usr/bin/vitastor-nbd", "unmap", devicePath).CombinedOutput()
+        if (unmapErr != nil)
+        {
+            klog.Errorf("failed to unmap NBD device %s: %s, error: %v", devicePath, unmapOut, unmapErr)
+        }
+        return nil, status.Error(codes.Internal, err.Error())
+    }
+    return &csi.NodePublishVolumeResponse{}, nil
+}
+
+// NodeUnpublishVolume unmounts the volume from the target path
+func (ns *NodeServer) NodeUnpublishVolume(ctx context.Context, req *csi.NodeUnpublishVolumeRequest) (*csi.NodeUnpublishVolumeResponse, error)
+{
+    klog.Infof("received node unpublish volume request %+v", protosanitizer.StripSecrets(req))
+    targetPath := req.GetTargetPath()
+    devicePath, refCount, err := mount.GetDeviceNameFromMount(ns.mounter, targetPath)
+    if (err != nil)
+    {
+        if (os.IsNotExist(err))
+        {
+            return nil, status.Error(codes.NotFound, "Target path not found")
+        }
+        return nil, status.Error(codes.Internal, err.Error())
+    }
+    if (devicePath == "")
+    {
+        return nil, status.Error(codes.NotFound, "Volume not mounted")
+    }
+    // unmount
+    err = mount.CleanupMountPoint(targetPath, ns.mounter, false)
+    if (err != nil)
+    {
+        return nil, status.Error(codes.Internal, err.Error())
+    }
+    // unmap NBD device
+    if (refCount == 1)
+    {
+        unmapOut, unmapErr := exec.Command("/usr/bin/vitastor-nbd", "unmap", devicePath).CombinedOutput()
+        if (unmapErr != nil)
+        {
+            klog.Errorf("failed to unmap NBD device %s: %s, error: %v", devicePath, unmapOut, unmapErr)
+        }
+    }
+    return &csi.NodeUnpublishVolumeResponse{}, nil
+}
+
+// NodeGetVolumeStats returns volume capacity statistics available for the volume
+func (ns *NodeServer) NodeGetVolumeStats(ctx context.Context, req *csi.NodeGetVolumeStatsRequest) (*csi.NodeGetVolumeStatsResponse, error)
+{
+    return nil, status.Error(codes.Unimplemented, "")
+}
+
+// NodeExpandVolume expanding the file system on the node
+func (ns *NodeServer) NodeExpandVolume(ctx context.Context, req *csi.NodeExpandVolumeRequest) (*csi.NodeExpandVolumeResponse, error)
+{
+    return nil, status.Error(codes.Unimplemented, "")
+}
+
+// NodeGetCapabilities returns the supported capabilities of the node server
+func (ns *NodeServer) NodeGetCapabilities(ctx context.Context, req *csi.NodeGetCapabilitiesRequest) (*csi.NodeGetCapabilitiesResponse, error)
+{
+    return &csi.NodeGetCapabilitiesResponse{}, nil
+}
+
+// NodeGetInfo returns NodeGetInfoResponse for CO.
+func (ns *NodeServer) NodeGetInfo(ctx context.Context, req *csi.NodeGetInfoRequest) (*csi.NodeGetInfoResponse, error)
+{
+    klog.Infof("received node get info request %+v", protosanitizer.StripSecrets(req))
+    return &csi.NodeGetInfoResponse{
+        NodeId: ns.NodeID,
+    }, nil
+}
--- a/csi/src/server.go
+++ b/csi/src/server.go
@@ -0,0 +1,36 @@
+// Copyright (c) Vitaliy Filippov, 2019+
+// License: VNPL-1.1 or GNU GPL-2.0+ (see README.md for details)
+
+package vitastor
+
+import (
+    "k8s.io/klog"
+)
+
+type Driver struct
+{
+    *Config
+}
+
+// NewDriver create new instance driver
+func NewDriver(config *Config) (*Driver, error)
+{
+    if (config == nil)
+    {
+        klog.Errorf("Vitastor CSI driver initialization failed")
+        return nil, nil
+    }
+    driver := &Driver{
+        Config: config,
+    }
+    klog.Infof("Vitastor CSI driver initialized")
+    return driver, nil
+}
+
+// Start server
+func (driver *Driver) Run()
+{
+    server := NewNonBlockingGRPCServer()
+    server.Start(driver.Endpoint, NewIdentityServer(driver), NewControllerServer(driver), NewNodeServer(driver))
+    server.Wait()
+}
--- a/csi/vitastor-csi.go
+++ b/csi/vitastor-csi.go
@@ -0,0 +1,39 @@
+// Copyright (c) Vitaliy Filippov, 2019+
+// License: VNPL-1.1 or GNU GPL-2.0+ (see README.md for details)
+
+package main
+
+import (
+    "flag"
+    "fmt"
+    "os"
+    "k8s.io/klog"
+    "vitastor.io/csi/src"
+)
+
+func main()
+{
+    var config = vitastor.NewConfig()
+    flag.StringVar(&config.Endpoint, "endpoint", "", "CSI endpoint")
+    flag.StringVar(&config.NodeID, "node", "", "Node ID")
+    flag.Parse()
+    if (config.Endpoint == "")
+    {
+        config.Endpoint = os.Getenv("CSI_ENDPOINT")
+    }
+    if (config.NodeID == "")
+    {
+        config.NodeID = os.Getenv("NODE_ID")
+    }
+    if (config.Endpoint == "" && config.NodeID == "")
+    {
+        fmt.Fprintf(os.Stderr, "Please set -endpoint and -node / CSI_ENDPOINT & NODE_ID env vars\n")
+        os.Exit(1)
+    }
+    drv, err := vitastor.NewDriver(config)
+    if (err != nil)
+    {
+        klog.Fatalln(err)
+    }
+    drv.Run()
+}
--- a/debian/build-vitastor-bullseye.sh
+++ b/debian/build-vitastor-bullseye.sh
@@ -1,6 +1,6 @@
 #!/bin/bash

-sed 's/$REL/bullseye/' < vitastor.Dockerfile > ../Dockerfile
+sed 's/$REL/bullseye/g' < vitastor.Dockerfile > ../Dockerfile
 cd ..
 mkdir -p packages
 sudo podman build -v `pwd`/packages:/root/packages -f Dockerfile .
--- a/debian/build-vitastor-buster.sh
+++ b/debian/build-vitastor-buster.sh
@@ -1,6 +1,6 @@
 #!/bin/bash

-sed 's/$REL/buster/' < vitastor.Dockerfile > ../Dockerfile
+sed 's/$REL/buster/g' < vitastor.Dockerfile > ../Dockerfile
 cd ..
 mkdir -p packages
 sudo podman build -v `pwd`/packages:/root/packages -f Dockerfile .
--- a/debian/changelog
+++ b/debian/changelog
@@ -1,8 +1,18 @@
-vitastor (0.5.5-1) unstable; urgency=medium
+vitastor (0.6.6-1) unstable; urgency=medium

+  * RDMA support
  * Bugfixes

- -- Vitaliy Filippov <vitalif@yourcmc.ru>  Tue, 02 Feb 2021 23:01:24 +0300
+ -- Vitaliy Filippov <vitalif@yourcmc.ru>  Sat, 01 May 2021 18:46:10 +0300
+
+vitastor (0.6.0-1) unstable; urgency=medium
+
+  * Snapshots and Copy-on-Write clones
+  * Image metadata in etcd (name, size)
+  * Image I/O and space statistics in etcd
+  * Write throttling for smoothing random write workloads in SSD+HDD configurations
+
+ -- Vitaliy Filippov <vitalif@yourcmc.ru>  Sun, 11 Apr 2021 00:49:18 +0300

 vitastor (0.5.1-1) unstable; urgency=medium

--- a/debian/control
+++ b/debian/control
@@ -2,7 +2,7 @@ Source: vitastor
 Section: admin
 Priority: optional
 Maintainer: Vitaliy Filippov <vitalif@yourcmc.ru>
-Build-Depends: debhelper, liburing-dev (>= 0.6), g++ (>= 8), libstdc++6 (>= 8), linux-libc-dev, libgoogle-perftools-dev, libjerasure-dev, libgf-complete-dev
+Build-Depends: debhelper, liburing-dev (>= 0.6), g++ (>= 8), libstdc++6 (>= 8), linux-libc-dev, libgoogle-perftools-dev, libjerasure-dev, libgf-complete-dev, libibverbs-dev
 Standards-Version: 4.5.0
 Homepage: https://vitastor.io/
 Rules-Requires-Root: no
--- a/debian/patched-qemu.Dockerfile
+++ b/debian/patched-qemu.Dockerfile
@@ -11,6 +11,10 @@ RUN if [ "$REL" = "buster" ]; then \
        echo 'Package: *' >> /etc/apt/preferences; \
        echo 'Pin: release a=buster-backports' >> /etc/apt/preferences; \
        echo 'Pin-Priority: 500' >> /etc/apt/preferences; \
+        echo >> /etc/apt/preferences; \
+        echo 'Package: libglvnd* libgles* libglx* libgl1 libegl* libopengl* mesa*' >> /etc/apt/preferences; \
+        echo 'Pin: release a=buster-backports' >> /etc/apt/preferences; \
+        echo 'Pin-Priority: 50' >> /etc/apt/preferences; \
    fi; \
    grep '^deb ' /etc/apt/sources.list | perl -pe 's/^deb/deb-src/' >> /etc/apt/sources.list; \
    echo 'APT::Install-Recommends false;' >> /etc/apt/apt.conf; \
@@ -20,20 +24,22 @@ RUN apt-get update
 RUN apt-get -y install qemu fio liburing1 liburing-dev libgoogle-perftools-dev devscripts
 RUN apt-get -y build-dep qemu
 RUN apt-get -y build-dep fio
+# To build a custom version
+#RUN cp /root/packages/qemu-orig/* /root
 RUN apt-get --download-only source qemu
 RUN apt-get --download-only source fio

-ADD qemu-5.0-vitastor.patch qemu-5.1-vitastor.patch /root/vitastor/
+ADD patches/qemu-5.0-vitastor.patch patches/qemu-5.1-vitastor.patch /root/vitastor/patches/
 RUN set -e; \
    mkdir -p /root/packages/qemu-$REL; \
    rm -rf /root/packages/qemu-$REL/*; \
    cd /root/packages/qemu-$REL; \
    dpkg-source -x /root/qemu*.dsc; \
    if [ -d /root/packages/qemu-$REL/qemu-5.0 ]; then \
-        cp /root/vitastor/qemu-5.0-vitastor.patch /root/packages/qemu-$REL/qemu-5.0/debian/patches; \
+        cp /root/vitastor/patches/qemu-5.0-vitastor.patch /root/packages/qemu-$REL/qemu-5.0/debian/patches; \
        echo qemu-5.0-vitastor.patch >> /root/packages/qemu-$REL/qemu-5.0/debian/patches/series; \
    else \
-        cp /root/vitastor/qemu-5.1-vitastor.patch /root/packages/qemu-$REL/qemu-*/debian/patches; \
+        cp /root/vitastor/patches/qemu-5.1-vitastor.patch /root/packages/qemu-$REL/qemu-*/debian/patches; \
        P=`ls -d /root/packages/qemu-$REL/qemu-*/debian/patches`; \
        echo qemu-5.1-vitastor.patch >> $P/series; \
    fi; \
--- a/debian/vitastor.Dockerfile
+++ b/debian/vitastor.Dockerfile
@@ -22,7 +22,7 @@ RUN apt-get -y build-dep qemu
 RUN apt-get -y build-dep fio
 RUN apt-get --download-only source qemu
 RUN apt-get --download-only source fio
-RUN apt-get -y install libjerasure-dev cmake
+RUN apt-get update && apt-get -y install libjerasure-dev cmake libibverbs-dev

 ADD . /root/vitastor
 RUN set -e -x; \
@@ -40,10 +40,10 @@ RUN set -e -x; \
    mkdir -p /root/packages/vitastor-$REL; \
    rm -rf /root/packages/vitastor-$REL/*; \
    cd /root/packages/vitastor-$REL; \
-    cp -r /root/vitastor vitastor-0.5.5; \
-    ln -s /root/packages/qemu-$REL/qemu-*/ vitastor-0.5.5/qemu; \
-    ln -s /root/fio-build/fio-*/ vitastor-0.5.5/fio; \
-    cd vitastor-0.5.5; \
+    cp -r /root/vitastor vitastor-0.6.6; \
+    ln -s /root/packages/qemu-$REL/qemu-*/ vitastor-0.6.6/qemu; \
+    ln -s /root/fio-build/fio-*/ vitastor-0.6.6/fio; \
+    cd vitastor-0.6.6; \
    FIO=$(head -n1 fio/debian/changelog | perl -pe 's/^.*\((.*?)\).*$/$1/'); \
    QEMU=$(head -n1 qemu/debian/changelog | perl -pe 's/^.*\((.*?)\).*$/$1/'); \
    sh copy-qemu-includes.sh; \
@@ -59,8 +59,8 @@ RUN set -e -x; \
    echo "dep:fio=$FIO" > debian/substvars; \
    echo "dep:qemu=$QEMU" >> debian/substvars; \
    cd /root/packages/vitastor-$REL; \
-    tar --sort=name --mtime='2020-01-01' --owner=0 --group=0 --exclude=debian -cJf vitastor_0.5.5.orig.tar.xz vitastor-0.5.5; \
-    cd vitastor-0.5.5; \
+    tar --sort=name --mtime='2020-01-01' --owner=0 --group=0 --exclude=debian -cJf vitastor_0.6.6.orig.tar.xz vitastor-0.6.6; \
+    cd vitastor-0.6.6; \
    V=$(head -n1 debian/changelog | perl -pe 's/^.*\((.*?)\).*$/$1/'); \
    DEBFULLNAME="Vitaliy Filippov <vitalif@yourcmc.ru>" dch -D $REL -v "$V""$REL" "Rebuild for $REL"; \
    DEB_BUILD_OPTIONS=nocheck dpkg-buildpackage --jobs=auto -sa; \
--- a/docker/Dockerfile
+++ b/docker/Dockerfile
@@ -0,0 +1,9 @@
+# Build Docker image with Vitastor packages
+
+FROM debian:bullseye
+
+ADD vitastor.list /etc/apt/sources.list.d
+ADD vitastor.gpg /etc/apt/trusted.gpg.d
+ADD vitastor.pref /etc/apt/preferences.d
+ADD apt.conf /etc/apt/
+RUN apt-get update && apt-get -y install vitastor qemu-system-x86 qemu-system-common && apt-get clean
--- a/docker/apt.conf
+++ b/docker/apt.conf
@@ -0,0 +1 @@
+APT::Install-Recommends false;
--- a/docker/vitastor.gpg
+++ b/docker/vitastor.gpg
--- a/docker/vitastor.list
+++ b/docker/vitastor.list
@@ -0,0 +1 @@
+deb http://vitastor.io/debian bullseye main
--- a/docker/vitastor.pref
+++ b/docker/vitastor.pref
@@ -0,0 +1,3 @@
+Package: *
+Pin: origin "vitastor.io"
+Pin-Priority: 1000
--- a/lambda_size.cpp
+++ b/lambda_size.cpp
@@ -1,51 +0,0 @@
-// Copyright (c) Vitaliy Filippov, 2019+
-// License: VNPL-1.1 (see README.md for details)
-
-#include <iostream>
-#include <functional>
-#include <array>
-#include <cstdlib> // for malloc() and free()
-using namespace std;
-
-// replace operator new and delete to log allocations
-void* operator new(std::size_t n)
-{
-    cout << "Allocating " << n << " bytes" << endl;
-    return malloc(n);
-}
-
-void operator delete(void* p) throw()
-{
-    free(p);
-}
-
-class test
-{
-public:
-    std::string s;
-    void a(std::function<void()> & f, const char *str)
-    {
-        auto l = [this, str]() { cout << str << " ? " << s << " from this\n"; };
-        cout << "Assigning lambda3 of size " << sizeof(l) << endl;
-        f = l;
-    }
-};
-
-int main()
-{
-    std::array<char, 16> arr1;
-    auto lambda1 = [arr1](){};
-    cout << "Assigning lambda1 of size " << sizeof(lambda1) << endl;
-    std::function<void()> f1 = lambda1;
-
-    std::array<char, 17> arr2;
-    auto lambda2 = [arr2](){};
-    cout << "Assigning lambda2 of size " << sizeof(lambda2) << endl;
-    std::function<void()> f2 = lambda2;
-
-    test t;
-    std::function<void()> f3;
-    t.s = "str";
-    t.a(f3, "huyambda");
-    f3();
-}
--- a/mon/PGUtil.js
+++ b/mon/PGUtil.js
@@ -5,18 +5,55 @@ module.exports = {
    scale_pg_count,
 };

+function add_pg_history(new_pg_history, new_pg, prev_pgs, prev_pg_history, old_pg)
+{
+    if (!new_pg_history[new_pg])
+    {
+        new_pg_history[new_pg] = {
+            osd_sets: {},
+            all_peers: {},
+            epoch: 0,
+        };
+    }
+    const nh = new_pg_history[new_pg], oh = prev_pg_history[old_pg];
+    nh.osd_sets[prev_pgs[old_pg].join(' ')] = prev_pgs[old_pg];
+    if (oh && oh.osd_sets && oh.osd_sets.length)
+    {
+        for (const pg of oh.osd_sets)
+        {
+            nh.osd_sets[pg.join(' ')] = pg;
+        }
+    }
+    if (oh && oh.all_peers && oh.all_peers.length)
+    {
+        for (const osd_num of oh.all_peers)
+        {
+            nh.all_peers[osd_num] = Number(osd_num);
+        }
+    }
+    if (oh && oh.epoch)
+    {
+        nh.epoch = nh.epoch < oh.epoch ? oh.epoch : nh.epoch;
+    }
+}
+
+function finish_pg_history(merged_history)
+{
+    merged_history.osd_sets = Object.values(merged_history.osd_sets);
+    merged_history.all_peers = Object.values(merged_history.all_peers);
+}
+
 function scale_pg_count(prev_pgs, prev_pg_history, new_pg_history, new_pg_count)
 {
    const old_pg_count = prev_pgs.length;
    // Add all possibly intersecting PGs to the history of new PGs
    if (!(new_pg_count % old_pg_count))
    {
-        // New PG count is a multiple of the old PG count
-        const mul = (new_pg_count / old_pg_count);
+        // New PG count is a multiple of old PG count
        for (let i = 0; i < new_pg_count; i++)
        {
-            const old_i = Math.floor(new_pg_count / mul);
-            new_pg_history[i] = prev_pg_history[old_i] ? JSON.parse(JSON.stringify(prev_pg_history[old_i])) : undefined;
+            add_pg_history(new_pg_history, i, prev_pgs, prev_pg_history, i % old_pg_count);
+            finish_pg_history(new_pg_history[i]);
        }
    }
    else if (!(old_pg_count % new_pg_count))
@@ -25,68 +62,26 @@ function scale_pg_count(prev_pgs, prev_pg_history, new_pg_history, new_pg_count)
        const mul = (old_pg_count / new_pg_count);
        for (let i = 0; i < new_pg_count; i++)
        {
-            new_pg_history[i] = {
-                osd_sets: [],
-                all_peers: [],
-                epoch: 0,
-            };
            for (let j = 0; j < mul; j++)
            {
-                new_pg_history[i].osd_sets.push(prev_pgs[i*mul]);
-                const hist = prev_pg_history[1+i*mul+j];
-                if (hist && hist.osd_sets && hist.osd_sets.length)
-                {
-                    Array.prototype.push.apply(new_pg_history[i].osd_sets, hist.osd_sets);
-                }
-                if (hist && hist.all_peers && hist.all_peers.length)
-                {
-                    Array.prototype.push.apply(new_pg_history[i].all_peers, hist.all_peers);
-                }
-                if (hist && hist.epoch)
-                {
-                    new_pg_history[i].epoch = new_pg_history[i].epoch < hist.epoch ? hist.epoch : new_pg_history[i].epoch;
-                }
+                add_pg_history(new_pg_history, i, prev_pgs, prev_pg_history, i+j*new_pg_count);
            }
+            finish_pg_history(new_pg_history[i]);
        }
    }
    else
    {
        // Any PG may intersect with any PG after non-multiple PG count change
        // So, merge ALL PGs history
-        let all_sets = {};
-        let all_peers = {};
-        let max_epoch = 0;
-        for (const pg of prev_pgs)
+        let merged_history = {};
+        for (let i = 0; i < old_pg_count; i++)
        {
-            all_sets[pg.join(' ')] = pg;
+            add_pg_history(merged_history, 1, prev_pgs, prev_pg_history, i);
        }
-        for (const pg in prev_pg_history)
-        {
-            const hist = prev_pg_history[pg];
-            if (hist && hist.osd_sets)
-            {
-                for (const pg of hist.osd_sets)
-                {
-                    all_sets[pg.join(' ')] = pg;
-                }
-            }
-            if (hist && hist.all_peers)
-            {
-                for (const osd_num of hist.all_peers)
-                {
-                    all_peers[osd_num] = Number(osd_num);
-                }
-            }
-            if (hist && hist.epoch)
-            {
-                max_epoch = max_epoch < hist.epoch ? hist.epoch : max_epoch;
-            }
-        }
-        all_sets = Object.values(all_sets);
-        all_peers = Object.values(all_peers);
+        finish_pg_history(merged_history[1]);
        for (let i = 0; i < new_pg_count; i++)
        {
-            new_pg_history[i] = { osd_sets: all_sets, all_peers, epoch: max_epoch };
+            new_pg_history[i] = { ...merged_history[1] };
        }
    }
    // Mark history keys for removed PGs as removed
@@ -94,19 +89,16 @@ function scale_pg_count(prev_pgs, prev_pg_history, new_pg_history, new_pg_count)
    {
        new_pg_history[i] = null;
    }
+    // Just for the lp_solve optimizer - pick a "previous" PG for each "new" one
    if (old_pg_count < new_pg_count)
    {
-        for (let i = new_pg_count-1; i >= 0; i--)
+        for (let i = old_pg_count; i < new_pg_count; i++)
        {
-            prev_pgs[i] = prev_pgs[Math.floor(i/new_pg_count*old_pg_count)];
+            prev_pgs[i] = prev_pgs[i % old_pg_count];
        }
    }
    else if (old_pg_count > new_pg_count)
    {
-        for (let i = 0; i < new_pg_count; i++)
-        {
-            prev_pgs[i] = prev_pgs[Math.round(i/new_pg_count*old_pg_count)];
-        }
        prev_pgs.splice(new_pg_count, old_pg_count-new_pg_count);
    }
 }
--- a/mon/lp-optimizer.js
+++ b/mon/lp-optimizer.js
@@ -104,6 +104,17 @@ async function optimize_initial({ osd_tree, pg_count, pg_size = 3, pg_minsize =
    return res;
 }

+function shuffle(array)
+{
+    for (let i = array.length - 1, j, x; i > 0; i--)
+    {
+        j = Math.floor(Math.random() * (i + 1));
+        x = array[i];
+        array[i] = array[j];
+        array[j] = x;
+    }
+}
+
 function make_int_pgs(weights, pg_count)
 {
    const total_weight = Object.values(weights).reduce((a, c) => Number(a) + Number(c), 0);
@@ -120,6 +131,7 @@ function make_int_pgs(weights, pg_count)
        weight_left -= weights[pg_name];
        pg_left -= n;
    }
+    shuffle(int_pgs);
    return int_pgs;
 }

@@ -232,6 +244,7 @@ async function optimize_change({ prev_pgs: prev_int_pgs, osd_tree, pg_size = 3,
    {
        return null;
    }
+    // FIXME: use parity_chunks with parity_space instead of pg_minsize
    const pg_effsize = Math.min(pg_minsize, Object.keys(osd_tree).length)
        + Math.max(0, Math.min(pg_size, Object.keys(osd_tree).length) - pg_minsize) * parity_space;
    const pg_count = prev_int_pgs.length;
--- a/mon/make-osd.sh
+++ b/mon/make-osd.sh
@@ -53,7 +53,6 @@ ExecStart=/usr/bin/vitastor-osd \\
    --osd_num $OSD_NUM \\
    --disable_data_fsync 1 \\
    --immediate_commit all \\
-    --flusher_count 256 \\
    --disk_alignment 4096 --journal_block_size 4096 --meta_block_size 4096 \\
    --journal_no_same_sector_overwrites true \\
    --journal_sector_buffer_count 1024 \\
--- a/mon/make-units.sh
+++ b/mon/make-units.sh
@@ -32,7 +32,8 @@ ExecStart=/usr/local/bin/etcd -name etcd$ETCD_NUM --data-dir /var/lib/etcd$ETCD_
    --advertise-client-urls http://$IP:2379 --listen-client-urls http://$IP:2379 \\
    --initial-advertise-peer-urls http://$IP:2380 --listen-peer-urls http://$IP:2380 \\
    --initial-cluster-token vitastor-etcd-1 --initial-cluster $ETCD_HOSTS \\
-    --initial-cluster-state new --max-txn-ops=100000 --auto-compaction-retention=10 --auto-compaction-mode=revision
+    --initial-cluster-state new --max-txn-ops=100000 --max-request-bytes=104857600 \\
+    --auto-compaction-retention=10 --auto-compaction-mode=revision
 WorkingDirectory=/var/lib/etcd$ETCD_NUM.etcd
 ExecStartPre=+chown -R etcd /var/lib/etcd$ETCD_NUM.etcd
 User=etcd
--- a/mon/merge.js
+++ b/mon/merge.js
@@ -0,0 +1,23 @@
+const fsp = require('fs').promises;
+
+async function merge(file1, file2, out)
+{
+    if (!out)
+    {
+        console.error('USAGE: nodejs merge.js layer1 layer2 output');
+        process.exit();
+    }
+    const layer1 = await fsp.readFile(file1);
+    const layer2 = await fsp.readFile(file2);
+    const zero = Buffer.alloc(4096);
+    for (let i = 0; i < layer2.length; i += 4096)
+    {
+        if (zero.compare(layer2, i, i+4096) != 0)
+        {
+            layer2.copy(layer1, i, i, i+4096);
+        }
+    }
+    await fsp.writeFile(out, layer1);
+}
+
+merge(process.argv[2], process.argv[3], process.argv[4]);
--- a/mon/mon.js
+++ b/mon/mon.js
@@ -10,34 +10,65 @@ const stableStringify = require('./stable-stringify.js');
 const PGUtil = require('./PGUtil.js');

 // FIXME document all etcd keys and config variables in the form of JSON schema or similar
+const etcd_nonempty_keys = {
+    'config/global': 1,
+    'config/node_placement': 1,
+    'config/pools': 1,
+    'config/pgs': 1,
+    'history/last_clean_pgs': 1,
+    'stats': 1,
+};
 const etcd_allow = new RegExp('^'+[
    'config/global',
    'config/node_placement',
    'config/pools',
    'config/osd/[1-9]\\d*',
    'config/pgs',
+    'config/inode/[1-9]\\d*/[1-9]\\d*',
    'osd/state/[1-9]\\d*',
    'osd/stats/[1-9]\\d*',
+    'osd/inodestats/[1-9]\\d*',
+    'osd/space/[1-9]\\d*',
    'mon/master',
    'pg/state/[1-9]\\d*/[1-9]\\d*',
    'pg/stats/[1-9]\\d*/[1-9]\\d*',
    'pg/history/[1-9]\\d*/[1-9]\\d*',
+    'history/last_clean_pgs',
+    'inode/stats/[1-9]\\d*/[1-9]\\d*',
    'stats',
+    'index/image/.*',
+    'index/maxid/[1-9]\\d*',
 ].join('$|^')+'$');

 const etcd_tree = {
    config: {
        /* global: {
+            // WARNING: NOT ALL OF THESE ARE ACTUALLY CONFIGURABLE HERE
+            // THIS IS JUST A POOR MAN'S CONFIG DOCUMENTATION
+            // etcd connection
+            config_path: "/etc/vitastor/vitastor.conf",
+            etcd_address: "10.0.115.10:2379/v3",
+            etcd_prefix: "/vitastor",
            // mon
            etcd_mon_ttl: 30, // min: 10
            etcd_mon_timeout: 1000, // ms. min: 0
            etcd_mon_retries: 5, // min: 0
            mon_change_timeout: 1000, // ms. min: 100
            mon_stats_timeout: 1000, // ms. min: 100
-            osd_out_time: 1800, // seconds. min: 0
+            osd_out_time: 600, // seconds. min: 0
            placement_levels: { datacenter: 1, rack: 2, host: 3, osd: 4, ... },
            // client and osd
+            tcp_header_buffer_size: 65536,
            use_sync_send_recv: false,
+            use_rdma: true,
+            rdma_device: null, // for example, "rocep5s0f0"
+            rdma_port_num: 1,
+            rdma_gid_index: 0,
+            rdma_mtu: 4096,
+            rdma_max_sge: 128,
+            rdma_max_send: 32,
+            rdma_max_recv: 8,
+            rdma_max_msg: 1048576,
            log_level: 0,
            block_size: 131072,
            disk_alignment: 4096,
@@ -46,6 +77,8 @@ const etcd_tree = {
            client_dirty_limit: 33554432,
            peer_connect_interval: 5, // seconds. min: 1
            peer_connect_timeout: 5, // seconds. min: 1
+            osd_idle_timeout: 5, // seconds. min: 1
+            osd_ping_timeout: 5, // seconds. min: 1
            up_wait_retry_interval: 500, // ms. min: 50
            // osd
            etcd_report_interval: 30, // min: 10
@@ -55,8 +88,12 @@ const etcd_tree = {
            autosync_interval: 5,
            client_queue_depth: 128, // unused
            recovery_queue_depth: 4,
+            recovery_sync_batch: 16,
            readonly: false,
+            no_recovery: false,
+            no_rebalance: false,
            print_stats_interval: 3,
+            slow_log_interval: 10,
            // blockstore - fixed in superblock
            block_size,
            disk_alignment,
@@ -76,7 +113,9 @@ const etcd_tree = {
            disable_meta_fsync,
            disable_device_lock,
            // blockstore - configurable
-            flusher_count,
+            max_write_iodepth,
+            min_flusher_count: 1,
+            max_flusher_count: 256,
            inmemory_metadata,
            inmemory_journal,
            journal_sector_buffer_count,
@@ -124,6 +163,18 @@ const etcd_tree = {
            }
        }, */
        pgs: {},
+        /* inode: {
+            <pool_id>: {
+                <inode_t>: {
+                    name: string,
+                    size?: uint64_t, // bytes
+                    parent_pool?: <pool_id>,
+                    parent_id?: <inode_t>,
+                    readonly?: boolean,
+                }
+            }
+        }, */
+        inode: {},
    },
    osd: {
        state: {
@@ -155,6 +206,18 @@ const etcd_tree = {
                },
            }, */
        },
+        inodestats: {
+            /* <inode_t>: {
+                read: { count: uint64_t, usec: uint64_t, bytes: uint64_t },
+                write: { count: uint64_t, usec: uint64_t, bytes: uint64_t },
+                delete: { count: uint64_t, usec: uint64_t, bytes: uint64_t },
+            }, */
+        },
+        space: {
+            /* <osd_num_t>: {
+                <inode_t>: uint64_t, // bytes
+            }, */
+        },
    },
    mon: {
        master: {
@@ -166,7 +229,7 @@ const etcd_tree = {
            /* <pool_id>: {
                <pg_id>: {
                    primary: osd_num_t,
-                    state: ("starting"|"peering"|"incomplete"|"active"|"stopping"|"offline"|
+                    state: ("starting"|"peering"|"incomplete"|"active"|"repeering"|"stopping"|"offline"|
                        "degraded"|"has_incomplete"|"has_degraded"|"has_misplaced"|"has_unclean"|
                        "has_invalid"|"left_on_dead")[],
                }
@@ -194,6 +257,28 @@ const etcd_tree = {
            }, */
        },
    },
+    inode: {
+        stats: {
+            /* <pool_id>: {
+                <inode_t>: {
+                    raw_used: uint64_t, // raw used bytes on OSDs
+                    read: { count: uint64_t, usec: uint64_t, bytes: uint64_t },
+                    write: { count: uint64_t, usec: uint64_t, bytes: uint64_t },
+                    delete: { count: uint64_t, usec: uint64_t, bytes: uint64_t },
+                },
+            }, */
+        },
+    },
+    pool: {
+        stats: {
+            /* <pool_id>: {
+                used_raw_tb: float, // used raw space in the pool
+                total_raw_tb: float, // maximum amount of space in the pool
+                raw_to_usable: float, // raw to usable ratio
+                space_efficiency: float, // 0..1
+            } */
+        },
+    },
    stats: {
        /* op_stats: {
            <string>: { count: uint64_t, usec: uint64_t, bytes: uint64_t },
@@ -213,6 +298,20 @@ const etcd_tree = {
            incomplete: uint64_t,
        }, */
    },
+    history: {
+        last_clean_pgs: {},
+    },
+    index: {
+        image: {
+            /* <name>: {
+                id: uint64_t,
+                pool_id: uint64_t,
+            }, */
+        },
+        maxid: {
+            /* <pool_id>: uint64_t, */
+        },
+    },
 };

 // FIXME Split into several files
@@ -287,11 +386,16 @@ class Mon
        {
            this.config.mon_stats_timeout = 100;
        }
+        this.config.mon_stats_interval = Number(this.config.mon_stats_interval) || 5000;
+        if (this.config.mon_stats_interval < 100)
+        {
+            this.config.mon_stats_interval = 100;
+        }
        // After this number of seconds, a dead OSD will be removed from PG distribution
        this.config.osd_out_time = Number(this.config.osd_out_time) || 0;
        if (!this.config.osd_out_time)
        {
-            this.config.osd_out_time = 30*60; // 30 minutes by default
+            this.config.osd_out_time = 600; // 10 minutes by default
        }
    }

@@ -313,8 +417,14 @@ class Mon
                    ok(false);
                }, this.config.etcd_mon_timeout);
                this.ws = new WebSocket(base+'/watch');
+                const fail = () =>
+                {
+                    ok(false);
+                };
+                this.ws.on('error', fail);
                this.ws.on('open', () =>
                {
+                    this.ws.removeListener('error', fail);
                    if (timer_id)
                        clearTimeout(timer_id);
                    ok(true);
@@ -359,7 +469,7 @@ class Mon
            }
            else
            {
-                let stats_changed = false, changed = false;
+                let stats_changed = false, changed = false, pg_states_changed = false;
                if (this.verbose)
                {
                    console.log('Revision '+data.result.header.revision+' events: ');
@@ -369,19 +479,27 @@ class Mon
                {
                    this.parse_kv(e.kv);
                    const key = e.kv.key.substr(this.etcd_prefix.length);
-                    if (key.substr(0, 11) == '/osd/stats/' || key.substr(0, 10) == '/pg/stats/')
+                    if (key.substr(0, 11) == '/osd/stats/' || key.substr(0, 10) == '/pg/stats/' || key.substr(0, 16) == '/osd/inodestats/')
                    {
                        stats_changed = true;
                    }
-                    else if (key != '/stats')
+                    else if (key.substr(0, 10) == '/pg/state/')
+                    {
+                        pg_states_changed = true;
+                    }
+                    else if (key != '/stats' && key.substr(0, 13) != '/inode/stats/')
                    {
                        changed = true;
                    }
                    if (this.verbose)
                    {
-                        console.log(e);
+                        console.log(JSON.stringify(e));
                    }
                }
+                if (pg_states_changed)
+                {
+                    this.save_last_clean().catch(console.error);
+                }
                if (stats_changed)
                {
                    this.schedule_update_stats();
@@ -394,10 +512,46 @@ class Mon
        });
    }

+    async save_last_clean()
+    {
+        // last_clean_pgs is used to avoid extra data move when observing a series of changes in the cluster
+        for (const pool_id in this.state.config.pools)
+        {
+            const pool_cfg = this.state.config.pools[pool_id];
+            if (!this.validate_pool_cfg(pool_id, pool_cfg, false))
+            {
+                continue;
+            }
+            for (let pg_num = 1; pg_num <= pool_cfg.pg_count; pg_num++)
+            {
+                if (!this.state.pg.state[pool_id] ||
+                    !this.state.pg.state[pool_id][pg_num] ||
+                    !(this.state.pg.state[pool_id][pg_num].state instanceof Array))
+                {
+                    // Unclean
+                    return;
+                }
+                let st = this.state.pg.state[pool_id][pg_num].state.join(',');
+                if (st != 'active' && st != 'active,left_on_dead' && st != 'left_on_dead,active')
+                {
+                    // Unclean
+                    return;
+                }
+            }
+        }
+        this.state.history.last_clean_pgs = JSON.parse(JSON.stringify(this.state.config.pgs));
+        await this.etcd_call('/kv/txn', {
+            success: [ { requestPut: {
+                key: b64(this.etcd_prefix+'/history/last_clean_pgs'),
+                value: b64(JSON.stringify(this.state.history.last_clean_pgs))
+            } } ],
+        }, this.etcd_start_timeout, 0);
+    }
+
    async get_lease()
    {
        const max_ttl = this.config.etcd_mon_ttl + this.config.etcd_mon_timeout/1000*this.config.etcd_mon_retries;
-        const res = await this.etcd_call('/lease/grant', { TTL: max_ttl }, this.config.etcd_mon_timeout, this.config.etcd_mon_retries);
+        const res = await this.etcd_call('/lease/grant', { TTL: max_ttl }, this.config.etcd_mon_timeout, -1);
        this.etcd_lease_id = res.ID;
        setInterval(async () =>
        {
@@ -472,7 +626,7 @@ class Mon
        for (const osd_num of this.all_osds().sort((a, b) => a - b))
        {
            const stat = this.state.osd.stats[osd_num];
-            if (stat.size && (this.state.osd.state[osd_num] || Number(stat.time) >= down_time))
+            if (stat && stat.size && (this.state.osd.state[osd_num] || Number(stat.time) >= down_time))
            {
                // Numeric IDs are reserved for OSDs
                const osd_cfg = this.state.config.osd[osd_num];
@@ -573,34 +727,61 @@ class Mon
        return !has_online;
    }

+    reset_rng()
+    {
+        this.seed = 0x5f020e43;
+    }
+
+    rng()
+    {
+        this.seed ^= this.seed << 13;
+        this.seed ^= this.seed >> 17;
+        this.seed ^= this.seed << 5;
+        return this.seed + 2147483648;
+    }
+
+    pick_primary(pool_id, osd_set, up_osds)
+    {
+        let alive_set;
+        if (this.state.config.pools[pool_id].scheme === 'replicated')
+            alive_set = osd_set.filter(osd_num => osd_num && up_osds[osd_num]);
+        else
+        {
+            // Prefer data OSDs for EC because they can actually read something without an additional network hop
+            const pg_data_size = (this.state.config.pools[pool_id].pg_size||0) -
+                (this.state.config.pools[pool_id].parity_chunks||0);
+            alive_set = osd_set.slice(0, pg_data_size).filter(osd_num => osd_num && up_osds[osd_num]);
+            if (!alive_set.length)
+                alive_set = osd_set.filter(osd_num => osd_num && up_osds[osd_num]);
+        }
+        if (!alive_set.length)
+            return 0;
+        return alive_set[this.rng() % alive_set.length];
+    }
+
    save_new_pgs_txn(request, pool_id, up_osds, prev_pgs, new_pgs, pg_history)
    {
-        const replicated = new_pgs.length && this.state.config.pools[pool_id].scheme === 'replicated';
-        const pg_minsize = new_pgs.length && this.state.config.pools[pool_id].pg_minsize;
        const pg_items = {};
+        this.reset_rng();
        new_pgs.map((osd_set, i) =>
        {
            osd_set = osd_set.map(osd_num => osd_num === LPOptimizer.NO_OSD ? 0 : osd_num);
-            let alive_set;
-            if (replicated)
-                alive_set = osd_set.filter(osd_num => osd_num && up_osds[osd_num]);
-            else
-            {
-                // Prefer data OSDs for EC because they can actually read something without an additional network hop
-                alive_set = osd_set.slice(0, pg_minsize).filter(osd_num => osd_num && up_osds[osd_num]);
-                if (!alive_set.length)
-                    alive_set = osd_set.filter(osd_num => osd_num && up_osds[osd_num]);
-            }
            pg_items[i+1] = {
                osd_set,
-                primary: alive_set.length ? alive_set[Math.floor(Math.random()*alive_set.length)] : 0,
+                primary: this.pick_primary(pool_id, osd_set, up_osds),
            };
-            if (prev_pgs[i] && prev_pgs[i].join(' ') != osd_set.join(' '))
+            if (prev_pgs[i] && prev_pgs[i].join(' ') != osd_set.join(' ') &&
+                prev_pgs[i].filter(osd_num => osd_num).length > 0)
            {
                pg_history[i] = pg_history[i] || {};
                pg_history[i].osd_sets = pg_history[i].osd_sets || [];
                pg_history[i].osd_sets.push(prev_pgs[i]);
            }
+            if (pg_history[i] && pg_history[i].osd_sets)
+            {
+                pg_history[i].osd_sets = Object.values(pg_history[i].osd_sets
+                    .reduce((a, c) => { a[c.join(' ')] = c; return a; }, {}));
+            }
        });
        for (let i = 0; i < new_pgs.length || i < prev_pgs.length; i++)
        {
@@ -751,7 +932,7 @@ class Mon
    {
        // Take configuration and state, check it against the stored configuration hash
        // Recalculate PGs and save them to etcd if the configuration is changed
-        // FIXME: Also do not change anything if the distribution is good enough and no PGs are degraded
+        // FIXME: Do not change anything if the distribution is good and random enough and no PGs are degraded
        const { up_osds, levels, osd_tree } = this.get_osd_tree();
        const tree_cfg = {
            osd_tree,
@@ -791,13 +972,33 @@ class Mon
                pool_tree = pool_tree ? pool_tree.children : [];
                pool_tree = LPOptimizer.flatten_tree(pool_tree, levels, pool_cfg.failure_domain, 'osd');
                this.filter_osds_by_tags(osd_tree, pool_tree, pool_cfg.osd_tags);
-                const prev_pgs = [];
-                for (const pg in ((this.state.config.pgs.items||{})[pool_id]||{})||{})
+                // These are for the purpose of building history.osd_sets
+                const real_prev_pgs = [];
+                let pg_history = [];
+                for (const pg in ((this.state.config.pgs.items||{})[pool_id]||{}))
                {
-                    prev_pgs[pg-1] = this.state.config.pgs.items[pool_id][pg].osd_set;
+                    real_prev_pgs[pg-1] = this.state.config.pgs.items[pool_id][pg].osd_set;
+                    if (this.state.pg.history[pool_id] &&
+                        this.state.pg.history[pool_id][pg])
+                    {
+                        pg_history[pg-1] = this.state.pg.history[pool_id][pg];
+                    }
                }
-                const pg_history = [];
-                const old_pg_count = prev_pgs.length;
+                // And these are for the purpose of minimizing data movement
+                let prev_pgs = [];
+                for (const pg in ((this.state.history.last_clean_pgs.items||{})[pool_id]||{}))
+                {
+                    prev_pgs[pg-1] = this.state.history.last_clean_pgs.items[pool_id][pg].osd_set;
+                }
+                prev_pgs = JSON.parse(JSON.stringify(prev_pgs.length ? prev_pgs : real_prev_pgs));
+                const old_pg_count = real_prev_pgs.length;
+                const optimize_cfg = {
+                    osd_tree: pool_tree,
+                    pg_count: pool_cfg.pg_count,
+                    pg_size: pool_cfg.pg_size,
+                    pg_minsize: pool_cfg.pg_minsize,
+                    max_combinations: pool_cfg.max_osd_combinations,
+                };
                let optimize_result;
                if (old_pg_count > 0)
                {
@@ -809,7 +1010,9 @@ class Mon
                            this.schedule_recheck();
                            return;
                        }
-                        PGUtil.scale_pg_count(prev_pgs, this.state.pg.history[pool_id]||{}, pg_history, pool_cfg.pg_count);
+                        const new_pg_history = [];
+                        PGUtil.scale_pg_count(prev_pgs, pg_history, new_pg_history, pool_cfg.pg_count);
+                        pg_history = new_pg_history;
                    }
                    for (const pg of prev_pgs)
                    {
@@ -822,23 +1025,22 @@ class Mon
                            pg.pop();
                        }
                    }
-                    optimize_result = await LPOptimizer.optimize_change({
-                        prev_pgs,
-                        osd_tree: pool_tree,
-                        pg_size: pool_cfg.pg_size,
-                        pg_minsize: pool_cfg.pg_minsize,
-                        max_combinations: pool_cfg.max_osd_combinations,
-                    });
+                    if (!this.state.config.pgs.hash)
+                    {
+                        // Re-shuffle PGs
+                        optimize_result = await LPOptimizer.optimize_initial(optimize_cfg);
+                    }
+                    else
+                    {
+                        optimize_result = await LPOptimizer.optimize_change({
+                            prev_pgs,
+                            ...optimize_cfg,
+                        });
+                    }
                }
                else
                {
-                    optimize_result = await LPOptimizer.optimize_initial({
-                        osd_tree: pool_tree,
-                        pg_count: pool_cfg.pg_count,
-                        pg_size: pool_cfg.pg_size,
-                        pg_minsize: pool_cfg.pg_minsize,
-                        max_combinations: pool_cfg.max_osd_combinations,
-                    });
+                    optimize_result = await LPOptimizer.optimize_initial(optimize_cfg);
                }
                if (old_pg_count != optimize_result.int_pgs.length)
                {
@@ -846,16 +1048,32 @@ class Mon
                        `PG count for pool ${pool_id} (${pool_cfg.name || 'unnamed'})`+
                        ` changed from: ${old_pg_count} to ${optimize_result.int_pgs.length}`
                    );
+                    // Drop stats
+                    etcd_request.success.push({ requestDeleteRange: {
+                        key: b64(this.etcd_prefix+'/pg/stats/'+pool_id+'/'),
+                        range_end: b64(this.etcd_prefix+'/pg/stats/'+pool_id+'0'),
+                    } });
                }
                LPOptimizer.print_change_stats(optimize_result);
-                this.save_new_pgs_txn(etcd_request, pool_id, up_osds, prev_pgs, optimize_result.int_pgs, pg_history);
+                const pg_effsize = Math.min(pool_cfg.pg_size, Object.keys(pool_tree).length);
+                this.state.pool.stats[pool_id] = {
+                    used_raw_tb: (this.state.pool.stats[pool_id]||{}).used_raw_tb || 0,
+                    total_raw_tb: optimize_result.space,
+                    raw_to_usable: pg_effsize / (pool_cfg.pg_size - (pool_cfg.parity_chunks||0)),
+                    space_efficiency: optimize_result.space/(optimize_result.total_space||1),
+                };
+                etcd_request.success.push({ requestPut: {
+                    key: b64(this.etcd_prefix+'/pool/stats/'+pool_id),
+                    value: b64(JSON.stringify(this.state.pool.stats[pool_id])),
+                } });
+                this.save_new_pgs_txn(etcd_request, pool_id, up_osds, real_prev_pgs, optimize_result.int_pgs, pg_history);
            }
            this.state.config.pgs.hash = tree_hash;
            await this.save_pg_config(etcd_request);
        }
        else
        {
-            // Nothing changed, but we still want to check for down OSDs
+            // Nothing changed, but we still want to recheck the distribution of primaries
            let changed = false;
            for (const pool_id in this.state.config.pools)
            {
@@ -865,22 +1083,13 @@ class Mon
                    continue;
                }
                const replicated = pool_cfg.scheme === 'replicated';
-                for (const pg_num in ((this.state.config.pgs.items||{})[pool_id]||{})||{})
+                this.reset_rng();
+                for (let pg_num = 1; pg_num <= pool_cfg.pg_count; pg_num++)
                {
                    const pg_cfg = this.state.config.pgs.items[pool_id][pg_num];
-                    if (!Number(pg_cfg.primary) || !up_osds[pg_cfg.primary])
+                    if (pg_cfg)
                    {
-                        let alive_set;
-                        if (replicated)
-                            alive_set = pg_cfg.osd_set.filter(osd_num => osd_num && up_osds[osd_num]);
-                        else
-                        {
-                            // Prefer data OSDs for EC because they can actually read something without an additional network hop
-                            alive_set = pg_cfg.osd_set.slice(0, pool_cfg.pg_minsize).filter(osd_num => osd_num && up_osds[osd_num]);
-                            if (!alive_set.length)
-                                alive_set = pg_cfg.osd_set.filter(osd_num => osd_num && up_osds[osd_num]);
-                        }
-                        const new_primary = alive_set.length ? alive_set[Math.floor(Math.random()*alive_set.length)] : 0;
+                        const new_primary = this.pick_primary(pool_id, pg_cfg.osd_set, up_osds);
                        if (pg_cfg.primary != new_primary)
                        {
                            console.log(
@@ -963,125 +1172,192 @@ class Mon
        }, this.config.mon_change_timeout || 1000);
    }

-    sum_stats()
+    sum_op_stats()
    {
-        let overflow = false;
-        this.prev_stats = this.prev_stats || { op_stats: {}, subop_stats: {}, recovery_stats: {} };
        const op_stats = {}, subop_stats = {}, recovery_stats = {};
        for (const osd in this.state.osd.stats)
        {
-            const st = this.state.osd.stats[osd];
+            const st = this.state.osd.stats[osd]||{};
            for (const op in st.op_stats||{})
            {
                op_stats[op] = op_stats[op] || { count: 0n, usec: 0n, bytes: 0n };
-                op_stats[op].count += BigInt(st.op_stats.count||0);
-                op_stats[op].usec += BigInt(st.op_stats.usec||0);
-                op_stats[op].bytes += BigInt(st.op_stats.bytes||0);
+                op_stats[op].count += BigInt(st.op_stats[op].count||0);
+                op_stats[op].usec += BigInt(st.op_stats[op].usec||0);
+                op_stats[op].bytes += BigInt(st.op_stats[op].bytes||0);
            }
            for (const op in st.subop_stats||{})
            {
                subop_stats[op] = subop_stats[op] || { count: 0n, usec: 0n };
-                subop_stats[op].count += BigInt(st.subop_stats.count||0);
-                subop_stats[op].usec += BigInt(st.subop_stats.usec||0);
+                subop_stats[op].count += BigInt(st.subop_stats[op].count||0);
+                subop_stats[op].usec += BigInt(st.subop_stats[op].usec||0);
            }
            for (const op in st.recovery_stats||{})
            {
                recovery_stats[op] = recovery_stats[op] || { count: 0n, bytes: 0n };
-                recovery_stats[op].count += BigInt(st.recovery_stats.count||0);
-                recovery_stats[op].bytes += BigInt(st.recovery_stats.bytes||0);
-            }
-        }
-        for (const op in op_stats)
-        {
-            if (op_stats[op].count >= 0x10000000000000000n)
-            {
-                if (!this.prev_stats.op_stats[op])
-                {
-                    overflow = true;
-                }
-                else
-                {
-                    op_stats[op].count -= this.prev_stats.op_stats[op].count;
-                    op_stats[op].usec -= this.prev_stats.op_stats[op].usec;
-                    op_stats[op].bytes -= this.prev_stats.op_stats[op].bytes;
-                }
-            }
-        }
-        for (const op in subop_stats)
-        {
-            if (subop_stats[op].count >= 0x10000000000000000n)
-            {
-                if (!this.prev_stats.subop_stats[op])
-                {
-                    overflow = true;
-                }
-                else
-                {
-                    subop_stats[op].count -= this.prev_stats.subop_stats[op].count;
-                    subop_stats[op].usec -= this.prev_stats.subop_stats[op].usec;
-                }
-            }
-        }
-        for (const op in recovery_stats)
-        {
-            if (recovery_stats[op].count >= 0x10000000000000000n)
-            {
-                if (!this.prev_stats.recovery_stats[op])
-                {
-                    overflow = true;
-                }
-                else
-                {
-                    recovery_stats[op].count -= this.prev_stats.recovery_stats[op].count;
-                    recovery_stats[op].bytes -= this.prev_stats.recovery_stats[op].bytes;
-                }
+                recovery_stats[op].count += BigInt(st.recovery_stats[op].count||0);
+                recovery_stats[op].bytes += BigInt(st.recovery_stats[op].bytes||0);
            }
        }
+        return { op_stats, subop_stats, recovery_stats };
+    }
+
+    sum_object_counts()
+    {
        const object_counts = { object: 0n, clean: 0n, misplaced: 0n, degraded: 0n, incomplete: 0n };
        for (const pool_id in this.state.pg.stats)
        {
            for (const pg_num in this.state.pg.stats[pool_id])
            {
                const st = this.state.pg.stats[pool_id][pg_num];
-                for (const k in object_counts)
+                if (st)
                {
-                    if (st[k+'_count'])
+                    for (const k in object_counts)
                    {
-                        object_counts[k] += BigInt(st[k+'_count']);
+                        if (st[k+'_count'])
+                        {
+                            object_counts[k] += BigInt(st[k+'_count']);
+                        }
                    }
                }
            }
        }
-        return (this.prev_stats = { overflow, op_stats, subop_stats, recovery_stats, object_counts });
+        return object_counts;
+    }
+
+    sum_inode_stats()
+    {
+        const inode_stats = {};
+        const inode_stub = () => ({
+            raw_used: 0n,
+            read: { count: 0n, usec: 0n, bytes: 0n },
+            write: { count: 0n, usec: 0n, bytes: 0n },
+            delete: { count: 0n, usec: 0n, bytes: 0n },
+        });
+        for (const pool_id in this.state.config.pools)
+        {
+            this.state.pool.stats[pool_id] = this.state.pool.stats[pool_id] || {};
+            this.state.pool.stats[pool_id].used_raw_tb = 0n;
+        }
+        for (const osd_num in this.state.osd.space)
+        {
+            for (const pool_id in this.state.osd.space[osd_num])
+            {
+                this.state.pool.stats[pool_id] = this.state.pool.stats[pool_id] || { used_raw_tb: 0n };
+                inode_stats[pool_id] = inode_stats[pool_id] || {};
+                for (const inode_num in this.state.osd.space[osd_num][pool_id])
+                {
+                    const u = BigInt(this.state.osd.space[osd_num][pool_id][inode_num]||0);
+                    inode_stats[pool_id][inode_num] = inode_stats[pool_id][inode_num] || inode_stub();
+                    inode_stats[pool_id][inode_num].raw_used += u;
+                    this.state.pool.stats[pool_id].used_raw_tb += u;
+                }
+            }
+        }
+        for (const pool_id in this.state.config.pools)
+        {
+            const used = this.state.pool.stats[pool_id].used_raw_tb;
+            this.state.pool.stats[pool_id].used_raw_tb = Number(used)/1024/1024/1024/1024;
+        }
+        for (const osd_num in this.state.osd.inodestats)
+        {
+            const ist = this.state.osd.inodestats[osd_num];
+            for (const pool_id in ist)
+            {
+                inode_stats[pool_id] = inode_stats[pool_id] || {};
+                for (const inode_num in ist[pool_id])
+                {
+                    inode_stats[pool_id][inode_num] = inode_stats[pool_id][inode_num] || inode_stub();
+                    for (const op of [ 'read', 'write', 'delete' ])
+                    {
+                        inode_stats[pool_id][inode_num][op].count += BigInt(ist[pool_id][inode_num][op].count||0);
+                        inode_stats[pool_id][inode_num][op].usec += BigInt(ist[pool_id][inode_num][op].usec||0);
+                        inode_stats[pool_id][inode_num][op].bytes += BigInt(ist[pool_id][inode_num][op].bytes||0);
+                    }
+                }
+            }
+        }
+        return inode_stats;
+    }
+
+    fix_stat_overflows(obj, scratch)
+    {
+        for (const k in obj)
+        {
+            if (typeof obj[k] == 'bigint')
+            {
+                if (obj[k] >= 0x10000000000000000n)
+                {
+                    if (scratch[k])
+                    {
+                        for (const k2 in scratch)
+                        {
+                            obj[k2] -= scratch[k2];
+                            scratch[k2] = 0n;
+                        }
+                    }
+                    else
+                    {
+                        for (const k2 in obj)
+                        {
+                            scratch[k2] = obj[k2];
+                        }
+                    }
+                }
+            }
+            else if (typeof obj[k] == 'object')
+            {
+                this.fix_stat_overflows(obj[k], scratch[k] = (scratch[k] || {}));
+            }
+        }
+    }
+
+    serialize_bigints(obj)
+    {
+        for (const k in obj)
+        {
+            if (typeof obj[k] == 'bigint')
+            {
+                obj[k] = ''+obj[k];
+            }
+            else if (typeof obj[k] == 'object')
+            {
+                this.serialize_bigints(obj[k]);
+            }
+        }
    }

    async update_total_stats()
    {
-        const stats = this.sum_stats();
-        if (!stats.overflow)
+        const txn = [];
+        const stats = this.sum_op_stats();
+        const object_counts = this.sum_object_counts();
+        const inode_stats = this.sum_inode_stats();
+        this.fix_stat_overflows(stats, (this.prev_stats = this.prev_stats || {}));
+        this.fix_stat_overflows(inode_stats, (this.prev_inode_stats = this.prev_inode_stats || {}));
+        stats.object_counts = object_counts;
+        this.serialize_bigints(stats);
+        this.serialize_bigints(inode_stats);
+        txn.push({ requestPut: { key: b64(this.etcd_prefix+'/stats'), value: b64(JSON.stringify(stats)) } });
+        for (const pool_id in inode_stats)
        {
-            // Convert to strings, serialize and save
-            const ser = {};
-            for (const st of [ 'op_stats', 'subop_stats', 'recovery_stats' ])
+            for (const inode_num in inode_stats[pool_id])
            {
-                ser[st] = {};
-                for (const op in stats[st])
-                {
-                    ser[st][op] = {};
-                    for (const k in stats[st][op])
-                    {
-                        ser[st][op][k] = ''+stats[st][op][k];
-                    }
-                }
+                txn.push({ requestPut: {
+                    key: b64(this.etcd_prefix+'/inode/stats/'+pool_id+'/'+inode_num),
+                    value: b64(JSON.stringify(inode_stats[pool_id][inode_num])),
+                } });
            }
-            ser.object_counts = {};
-            for (const k in stats.object_counts)
-            {
-                ser.object_counts[k] = ''+stats.object_counts[k];
-            }
-            await this.etcd_call('/kv/txn', {
-                success: [ { requestPut: { key: b64(this.etcd_prefix+'/stats'), value: b64(JSON.stringify(ser)) } } ],
-            }, this.config.etcd_mon_timeout, 0);
+        }
+        for (const pool_id in this.state.pool.stats)
+        {
+            txn.push({ requestPut: {
+                key: b64(this.etcd_prefix+'/pool/stats/'+pool_id),
+                value: b64(JSON.stringify(this.state.pool.stats[pool_id])),
+            } });
+        }
+        if (txn.length)
+        {
+            await this.etcd_call('/kv/txn', { success: txn }, this.config.etcd_mon_timeout, 0);
        }
    }

@@ -1092,11 +1368,17 @@ class Mon
            clearTimeout(this.stats_timer);
            this.stats_timer = null;
        }
+        let sleep = (this.stats_update_next||0) - Date.now();
+        if (sleep < this.config.mon_stats_timeout)
+        {
+            sleep = this.config.mon_stats_timeout;
+        }
        this.stats_timer = setTimeout(() =>
        {
            this.stats_timer = null;
+            this.stats_update_next = Date.now() + this.config.mon_stats_interval;
            this.update_total_stats().catch(console.error);
-        }, this.config.mon_stats_timeout || 1000);
+        }, sleep);
    }

    parse_kv(kv)
@@ -1122,16 +1404,20 @@ class Mon
            console.log('Bad value in etcd: '+kv.key+' = '+kv.value);
            return;
        }
-        key = key.split('/');
+        let key_parts = key.split('/');
        let cur = this.state;
-        for (let i = 0; i < key.length-1; i++)
+        for (let i = 0; i < key_parts.length-1; i++)
        {
-            cur = (cur[key[i]] = cur[key[i]] || {});
+            cur = (cur[key_parts[i]] = cur[key_parts[i]] || {});
        }
-        cur[key[key.length-1]] = kv.value;
-        if (key.join('/') === 'config/global')
+        if (etcd_nonempty_keys[key])
+        {
+            // Do not clear these to null
+            kv.value = kv.value || {};
+        }
+        cur[key_parts[key_parts.length-1]] = kv.value;
+        if (key === 'config/global')
        {
-            this.state.config.global = this.state.config.global || {};
            this.config = this.state.config.global;
            this.check_config();
            for (const osd_num in this.state.osd.stats)
@@ -1142,7 +1428,7 @@ class Mon
                );
            }
        }
-        else if (key.join('/') === 'config/pools')
+        else if (key === 'config/pools')
        {
            for (const pool_id in this.state.config.pools)
            {
@@ -1151,7 +1437,7 @@ class Mon
                this.validate_pool_cfg(pool_id, pool_cfg, true);
            }
        }
-        else if (key[0] === 'osd' && key[1] === 'stats')
+        else if (key_parts[0] === 'osd' && key_parts[1] === 'stats')
        {
            // Recheck PGs <osd_out_time> later
            this.schedule_next_recheck_at(
@@ -1183,6 +1469,11 @@ class Mon
                    console.error('etcd returned error: '+res.json.error);
                    break;
                }
+                if (this.etcd_urls.length > 1)
+                {
+                    // Stick to the same etcd for the rest of calls
+                    this.etcd_urls = [ base ];
+                }
                return res.json;
            }
            retry++;
--- a/mon/simple-offsets.js
+++ b/mon/simple-offsets.js
@@ -51,7 +51,7 @@ async function run()
    const meta_offset = options.journal_offset + Math.ceil(options.journal_size/options.device_block_size)*options.device_block_size;
    const entries_per_block = Math.floor(options.device_block_size / (24 + 2*options.object_size/options.bitmap_granularity/8));
    const object_count = Math.floor((device_size-meta_offset)/options.object_size);
-    const meta_size = Math.ceil(object_count / entries_per_block) * options.device_block_size;
+    const meta_size = Math.ceil(1 + object_count / entries_per_block) * options.device_block_size;
    const data_offset = meta_offset + meta_size;
    const meta_size_fmt = (meta_size > 1024*1024*1024 ? Math.round(meta_size/1024/1024/1024*100)/100+" GB"
        : Math.round(meta_size/1024/1024*100)/100+" MB");
@@ -65,6 +65,9 @@ async function run()
            );
        }
        process.stdout.write(
+            (options.device_block_size != 4096 ?
+                `    --meta_block_size ${options.device}\n`+
+                `    --journal_block-size ${options.device}\n` : '')+
            `    --data_device ${options.device}\n`+
            `    --journal_offset ${options.journal_offset}\n`+
            `    --meta_offset ${meta_offset}\n`+
--- a/patches/cinder-vitastor.py
+++ b/patches/cinder-vitastor.py
@@ -0,0 +1,948 @@
+# Vitastor Driver for OpenStack Cinder
+#
+# --------------------------------------------
+# Install as cinder/volume/drivers/vitastor.py
+# --------------------------------------------
+#
+# Copyright 2020 Vitaliy Filippov
+#
+# Licensed under the Apache License, Version 2.0 (the "License"); you may
+# not use this file except in compliance with the License. You may obtain
+# a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
+# WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
+# License for the specific language governing permissions and limitations
+# under the License.
+"""Cinder Vitastor Driver"""
+
+import binascii
+import base64
+import errno
+import json
+import math
+import os
+import tempfile
+
+from castellan import key_manager
+from oslo_config import cfg
+from oslo_log import log as logging
+from oslo_service import loopingcall
+from oslo_concurrency import processutils
+from oslo_utils import encodeutils
+from oslo_utils import excutils
+from oslo_utils import fileutils
+from oslo_utils import units
+import six
+from six.moves.urllib import request
+
+from cinder import exception
+from cinder.i18n import _
+from cinder.image import image_utils
+from cinder import interface
+from cinder import objects
+from cinder.objects import fields
+from cinder import utils
+from cinder.volume import configuration
+from cinder.volume import driver
+from cinder.volume import volume_utils
+
+VERSION = '0.6.6'
+
+LOG = logging.getLogger(__name__)
+
+VITASTOR_OPTS = [
+    cfg.StrOpt(
+        'vitastor_config_path',
+        default='/etc/vitastor/vitastor.conf',
+        help='Vitastor configuration file path'
+    ),
+    cfg.StrOpt(
+        'vitastor_etcd_address',
+        default='',
+        help='Vitastor etcd address(es)'),
+    cfg.StrOpt(
+        'vitastor_etcd_prefix',
+        default='/vitastor',
+        help='Vitastor etcd prefix'
+    ),
+    cfg.StrOpt(
+        'vitastor_pool_id',
+        default='',
+        help='Vitastor pool ID to use for volumes'
+    ),
+    # FIXME exclusive_cinder_pool ?
+]
+
+CONF = cfg.CONF
+CONF.register_opts(VITASTOR_OPTS, group = configuration.SHARED_CONF_GROUP)
+
+class VitastorDriverException(exception.VolumeDriverException):
+    message = _("Vitastor Cinder driver failure: %(reason)s")
+
+@interface.volumedriver
+class VitastorDriver(driver.CloneableImageVD,
+    driver.ManageableVD, driver.ManageableSnapshotsVD,
+    driver.BaseVD):
+    """Implements Vitastor volume commands."""
+
+    cfg = {}
+    _etcd_urls = []
+
+    def __init__(self, active_backend_id = None, *args, **kwargs):
+        super(VitastorDriver, self).__init__(*args, **kwargs)
+        self.configuration.append_config_values(VITASTOR_OPTS)
+
+    @classmethod
+    def get_driver_options(cls):
+        additional_opts = cls._get_oslo_driver_opts(
+            'reserved_percentage',
+            'max_over_subscription_ratio',
+            'volume_dd_blocksize'
+        )
+        return VITASTOR_OPTS + additional_opts
+
+    def do_setup(self, context):
+        """Performs initialization steps that could raise exceptions."""
+        super(VitastorDriver, self).do_setup(context)
+        # Make sure configuration is in UTF-8
+        for attr in [ 'config_path', 'etcd_address', 'etcd_prefix', 'pool_id' ]:
+            val = self.configuration.safe_get('vitastor_'+attr)
+            if val is not None:
+                self.cfg[attr] = utils.convert_str(val)
+        self.cfg = self._load_config(self.cfg)
+
+    def _load_config(self, cfg):
+        # Try to load configuration file
+        try:
+            f = open(cfg['config_path'] or '/etc/vitastor/vitastor.conf')
+            conf = json.loads(f.read())
+            f.close()
+            for k in conf:
+                cfg[k] = cfg.get(k, conf[k])
+        except:
+            pass
+        if isinstance(cfg['etcd_address'], str):
+            cfg['etcd_address'] = cfg['etcd_address'].split(',')
+        # Sanitize etcd URLs
+        for i, etcd_url in enumerate(cfg['etcd_address']):
+            ssl = False
+            if etcd_url.lower().startswith('http://'):
+                etcd_url = etcd_url[7:]
+            elif etcd_url.lower().startswith('https://'):
+                etcd_url = etcd_url[8:]
+                ssl = True
+            if etcd_url.find('/') < 0:
+                etcd_url += '/v3'
+            if ssl:
+                etcd_url = 'https://'+etcd_url
+            else:
+                etcd_url = 'http://'+etcd_url
+            cfg['etcd_address'][i] = etcd_url
+        return cfg
+
+    def check_for_setup_error(self):
+        """Returns an error if prerequisites aren't met."""
+
+    def _encode_etcd_key(self, key):
+        if not isinstance(key, bytes):
+            key = str(key).encode('utf-8')
+        return base64.b64encode(self.cfg['etcd_prefix'].encode('utf-8')+b'/'+key).decode('utf-8')
+
+    def _encode_etcd_value(self, value):
+        if not isinstance(value, bytes):
+            value = str(value).encode('utf-8')
+        return base64.b64encode(value).decode('utf-8')
+
+    def _encode_etcd_requests(self, obj):
+        for v in obj:
+            for rt in v:
+                if 'key' in v[rt]:
+                    v[rt]['key'] = self._encode_etcd_key(v[rt]['key'])
+                if 'range_end' in v[rt]:
+                    v[rt]['range_end'] = self._encode_etcd_key(v[rt]['range_end'])
+                if 'value' in v[rt]:
+                    v[rt]['value'] = self._encode_etcd_value(v[rt]['value'])
+
+    def _etcd_txn(self, params):
+        if 'compare' in params:
+            for v in params['compare']:
+                if 'key' in v:
+                    v['key'] = self._encode_etcd_key(v['key'])
+        if 'failure' in params:
+            self._encode_etcd_requests(params['failure'])
+        if 'success' in params:
+            self._encode_etcd_requests(params['success'])
+        body = json.dumps(params).encode('utf-8')
+        headers = {
+            'Content-Type': 'application/json'
+        }
+        err = None
+        for etcd_url in self.cfg['etcd_address']:
+            try:
+                resp = request.urlopen(request.Request(etcd_url+'/kv/txn', body, headers), timeout = 5)
+                data = json.loads(resp.read())
+                if 'responses' not in data:
+                    data['responses'] = []
+                for i, resp in enumerate(data['responses']):
+                    if 'response_range' in resp:
+                        if 'kvs' not in resp['response_range']:
+                            resp['response_range']['kvs'] = []
+                        for kv in resp['response_range']['kvs']:
+                            kv['key'] = base64.b64decode(kv['key'].encode('utf-8')).decode('utf-8')
+                            if kv['key'].startswith(self.cfg['etcd_prefix']+'/'):
+                                kv['key'] = kv['key'][len(self.cfg['etcd_prefix'])+1 : ]
+                            kv['value'] = json.loads(base64.b64decode(kv['value'].encode('utf-8')))
+                    if len(resp.keys()) != 1:
+                        LOG.exception('unknown responses['+str(i)+'] format: '+json.dumps(resp))
+                    else:
+                        resp = data['responses'][i] = resp[list(resp.keys())[0]]
+                return data
+            except Exception as e:
+                LOG.exception('error calling etcd transaction: '+body.decode('utf-8')+'\nerror: '+str(e))
+                err = e
+        raise err
+
+    def _etcd_foreach(self, prefix, add_fn):
+        total = 0
+        batch = 1000
+        begin = prefix+'/'
+        while True:
+            resp = self._etcd_txn({ 'success': [
+                { 'request_range': {
+                    'key': begin,
+                    'range_end': prefix+'0',
+                    'limit': batch+1,
+                } },
+            ] })
+            i = 0
+            while i < batch and i < len(resp['responses'][0]['kvs']):
+                kv = resp['responses'][0]['kvs'][i]
+                add_fn(kv)
+                i += 1
+            if len(resp['responses'][0]['kvs']) <= batch:
+                break
+            begin = resp['responses'][0]['kvs'][batch]['key']
+        return total
+
+    def _update_volume_stats(self):
+        location_info = json.dumps({
+            'config': self.configuration.vitastor_config_path,
+            'etcd_address': self.configuration.vitastor_etcd_address,
+            'etcd_prefix': self.configuration.vitastor_etcd_prefix,
+            'pool_id': self.configuration.vitastor_pool_id,
+        })
+
+        stats = {
+            'vendor_name': 'Vitastor',
+            'driver_version': self.VERSION,
+            'storage_protocol': 'vitastor',
+            'total_capacity_gb': 'unknown',
+            'free_capacity_gb': 'unknown',
+            # FIXME check if safe_get is required
+            'reserved_percentage': self.configuration.safe_get('reserved_percentage'),
+            'multiattach': True,
+            'thin_provisioning_support': True,
+            'max_over_subscription_ratio': self.configuration.safe_get('max_over_subscription_ratio'),
+            'location_info': location_info,
+            'backend_state': 'down',
+            'volume_backend_name': self.configuration.safe_get('volume_backend_name') or 'vitastor',
+            'replication_enabled': False,
+        }
+
+        try:
+            pool_stats = self._etcd_txn({ 'success': [
+                { 'request_range': { 'key': 'pool/stats/'+str(self.cfg['pool_id']) } }
+            ] })
+            total_provisioned = 0
+            def add_total(kv):
+                nonlocal total_provisioned
+                if kv['key'].find('@') >= 0:
+                    total_provisioned += kv['value']['size']
+            self._etcd_foreach('config/inode/'+str(self.cfg['pool_id']), lambda kv: add_total(kv))
+            stats['provisioned_capacity_gb'] = round(total_provisioned/1024.0/1024.0/1024.0, 2)
+            pool_stats = pool_stats['responses'][0]['kvs']
+            if len(pool_stats):
+                pool_stats = pool_stats[0]
+                stats['free_capacity_gb'] = round(1024.0*(pool_stats['total_raw_tb']-pool_stats['used_raw_tb'])/pool_stats['raw_to_usable'], 2)
+                stats['total_capacity_gb'] = round(1024.0*pool_stats['total_raw_tb'], 2)
+            stats['backend_state'] = 'up'
+        except Exception as e:
+            # just log and return unknown capacities
+            LOG.exception('error getting vitastor pool stats: '+str(e))
+
+        self._stats = stats
+
+    def _next_id(self, resp):
+        if len(resp['kvs']) == 0:
+            return (1, 0)
+        else:
+            return (1 + resp['kvs'][0]['value'], resp['kvs'][0]['mod_revision'])
+
+    def create_volume(self, volume):
+        """Creates a logical volume."""
+
+        size = int(volume.size) * units.Gi
+        # FIXME: Check if convert_str is really required
+        vol_name = utils.convert_str(volume.name)
+        if vol_name.find('@') >= 0 or vol_name.find('/') >= 0:
+            raise exception.VolumeBackendAPIException(data = '@ and / are forbidden in volume and snapshot names')
+
+        LOG.debug("creating volume '%s'", vol_name)
+
+        self._create_image(vol_name, { 'size': size })
+
+        if volume.encryption_key_id:
+            self._create_encrypted_volume(volume, volume.obj_context)
+
+        volume_update = {}
+        return volume_update
+
+    def _create_encrypted_volume(self, volume, context):
+        """Create a new LUKS encrypted image directly in Vitastor."""
+        vol_name = utils.convert_str(volume.name)
+        f, opts = self._encrypt_opts(volume, context)
+        # FIXME: Check if it works at all :-)
+        self._execute(
+            'qemu-img', 'convert', '-f', 'luks', *opts,
+            'vitastor:image='+vol_name.replace(':', '\\:')+self._qemu_args(),
+            '%sM' % (volume.size * 1024)
+        )
+        f.close()
+
+    def _encrypt_opts(self, volume, context):
+        encryption = volume_utils.check_encryption_provider(self.db, volume, context)
+        # Fetch the key associated with the volume and decode the passphrase
+        keymgr = key_manager.API(CONF)
+        key = keymgr.get(context, encryption['encryption_key_id'])
+        passphrase = binascii.hexlify(key.get_encoded()).decode('utf-8')
+        # Decode the dm-crypt style cipher spec into something qemu-img can use
+        cipher_spec = image_utils.decode_cipher(encryption['cipher'], encryption['key_size'])
+        tmp_dir = volume_utils.image_conversion_dir()
+        f = tempfile.NamedTemporaryFile(prefix = 'luks_', dir = tmp_dir)
+        f.write(passphrase)
+        f.flush()
+        return (f, [
+            '--object', 'secret,id=luks_sec,format=raw,file=%(passfile)s' % {'passfile': f.name},
+            '-o', 'key-secret=luks_sec,cipher-alg=%(cipher_alg)s,cipher-mode=%(cipher_mode)s,ivgen-alg=%(ivgen_alg)s' % cipher_spec,
+        ])
+
+    def create_snapshot(self, snapshot):
+        """Creates a volume snapshot."""
+
+        vol_name = utils.convert_str(snapshot.volume_name)
+        snap_name = utils.convert_str(snapshot.name)
+        if snap_name.find('@') >= 0 or snap_name.find('/') >= 0:
+            raise exception.VolumeBackendAPIException(data = '@ and / are forbidden in volume and snapshot names')
+        self._create_snapshot(vol_name, vol_name+'@'+snap_name)
+
+    def snapshot_revert_use_temp_snapshot(self):
+        """Disable the use of a temporary snapshot on revert."""
+        return False
+
+    def revert_to_snapshot(self, context, volume, snapshot):
+        """Revert a volume to a given snapshot."""
+
+        # FIXME Delete the image, then recreate it from the snapshot
+
+    def delete_snapshot(self, snapshot):
+        """Deletes a snapshot."""
+
+        vol_name = utils.convert_str(snapshot.volume_name)
+        snap_name = utils.convert_str(snapshot.name)
+
+        # Find the snapshot
+        resp = self._etcd_txn({ 'success': [
+            { 'request_range': { 'key': 'index/image/'+vol_name+'@'+snap_name } },
+        ] })
+        if len(resp['responses'][0]['kvs']) == 0:
+            raise exception.SnapshotNotFound(snapshot_id = snap_name)
+        inode_id = int(resp['responses'][0]['kvs'][0]['value']['id'])
+        pool_id = int(resp['responses'][0]['kvs'][0]['value']['pool_id'])
+        parents = {}
+        parents[(pool_id << 48) | (inode_id & 0xffffffffffff)] = True
+
+        # Check if there are child volumes
+        children = self._child_count(parents)
+        if children > 0:
+            raise exception.SnapshotIsBusy(snapshot_name = snap_name)
+
+        # FIXME: We can't delete snapshots because we can't merge layers yet
+        raise exception.VolumeBackendAPIException(data = 'Snapshot delete (layer merge) is not implemented yet')
+
+    def _child_count(self, parents):
+        children = 0
+        def add_child(kv):
+            nonlocal children
+            children += self._check_parent(kv, parents)
+        self._etcd_foreach('config/inode', lambda kv: add_child(kv))
+        return children
+
+    def _check_parent(self, kv, parents):
+        if 'parent_id' not in kv['value']:
+            return 0
+        parent_id = kv['value']['parent_id']
+        _, _, pool_id, inode_id = kv['key'].split('/')
+        parent_pool_id = pool_id
+        if 'parent_pool_id' in kv['value'] and kv['value']['parent_pool_id']:
+            parent_pool_id = kv['value']['parent_pool_id']
+        inode = (int(pool_id) << 48) | (int(inode_id) & 0xffffffffffff)
+        parent = (int(parent_pool_id) << 48) | (int(parent_id) & 0xffffffffffff)
+        if parent in parents and inode not in parents:
+            return 1
+        return 0
+
+    def create_cloned_volume(self, volume, src_vref):
+        """Create a cloned volume from another volume."""
+
+        size = int(volume.size) * units.Gi
+        src_name = utils.convert_str(src_vref.name)
+        dest_name = utils.convert_str(volume.name)
+        if dest_name.find('@') >= 0 or dest_name.find('/') >= 0:
+            raise exception.VolumeBackendAPIException(data = '@ and / are forbidden in volume and snapshot names')
+
+        # FIXME Do full copy if requested (cfg.disable_clone)
+
+        if src_vref.admin_metadata.get('readonly') == 'True':
+            # source volume is a volume-image cache entry or other readonly volume
+            # clone without intermediate snapshot
+            src = self._get_image(src_name)
+            LOG.debug("creating image '%s' from '%s'", dest_name, src_name)
+            new_cfg = self._create_image(dest_name, {
+                'size': size,
+                'parent_id': src['idx']['id'],
+                'parent_pool_id': src['idx']['pool_id'],
+            })
+            return {}
+
+        clone_snap = "%s@%s.clone_snap" % (src_name, dest_name)
+        make_img = True
+        if (volume.display_name and
+            volume.display_name.startswith('image-') and
+            src_vref.project_id != volume.project_id):
+            # idiotic openstack creates image-volume cache entries
+            # as clones of normal VM volumes... :-X prevent it :-D
+            clone_snap = dest_name
+            make_img = False
+
+        LOG.debug("creating layer '%s' under '%s'", clone_snap, src_name)
+        new_cfg = self._create_snapshot(src_name, clone_snap, True)
+        if make_img:
+            # Then create a clone from it
+            new_cfg = self._create_image(dest_name, {
+                'size': size,
+                'parent_id': new_cfg['parent_id'],
+                'parent_pool_id': new_cfg['parent_pool_id'],
+            })
+
+        return {}
+
+    def create_volume_from_snapshot(self, volume, snapshot):
+        """Creates a cloned volume from an existing snapshot."""
+
+        vol_name = utils.convert_str(volume.name)
+        snap_name = utils.convert_str(snapshot.name)
+
+        snap = self._get_image(vol_name+'@'+snap_name)
+        if not snap:
+            raise exception.SnapshotNotFound(snapshot_id = snap_name)
+        snap_inode_id = int(resp['responses'][0]['kvs'][0]['value']['id'])
+        snap_pool_id = int(resp['responses'][0]['kvs'][0]['value']['pool_id'])
+
+        size = snap['cfg']['size']
+        if int(volume.size):
+            size = int(volume.size) * units.Gi
+        new_cfg = self._create_image(vol_name, {
+            'size': size,
+            'parent_id': snap['idx']['id'],
+            'parent_pool_id': snap['idx']['pool_id'],
+        })
+
+        return {}
+
+    def _vitastor_args(self):
+        args = []
+        for k in [ 'config_path', 'etcd_address', 'etcd_prefix' ]:
+            v = self.configuration.safe_get('vitastor_'+k)
+            if v:
+                args.extend(['--'+k, v])
+        return args
+
+    def _qemu_args(self):
+        args = ''
+        for k in [ 'config_path', 'etcd_address', 'etcd_prefix' ]:
+            v = self.configuration.safe_get('vitastor_'+k)
+            kk = k
+            if kk == 'etcd_address':
+                # FIXME use etcd_address in qemu driver
+                kk = 'etcd_host'
+            if v:
+                args += ':'+kk+'='+v.replace(':', '\\:')
+        return args
+
+    def delete_volume(self, volume):
+        """Deletes a logical volume."""
+
+        vol_name = utils.convert_str(volume.name)
+
+        # Find the volume and all its snapshots
+        range_end = b'index/image/' + vol_name.encode('utf-8')
+        range_end = range_end[0 : len(range_end)-1] + six.int2byte(range_end[len(range_end)-1] + 1)
+        resp = self._etcd_txn({ 'success': [
+            { 'request_range': { 'key': 'index/image/'+vol_name, 'range_end': range_end } },
+        ] })
+        if len(resp['responses'][0]['kvs']) == 0:
+            # already deleted
+            LOG.info("volume %s no longer exists in backend", vol_name)
+            return
+        layers = resp['responses'][0]['kvs']
+        layer_ids = {}
+        for kv in layers:
+            inode_id = int(kv['value']['id'])
+            pool_id = int(kv['value']['pool_id'])
+            inode_pool_id = (pool_id << 48) | (inode_id & 0xffffffffffff)
+            layer_ids[inode_pool_id] = True
+
+        # Check if the volume has clones and raise 'busy' if so
+        children = self._child_count(layer_ids)
+        if children > 0:
+            raise exception.VolumeIsBusy(volume_name = vol_name)
+
+        # Clear data
+        for kv in layers:
+            args = [
+                'vitastor-cli', 'rm', '--pool', str(kv['value']['pool_id']),
+                '--inode', str(kv['value']['id']), '--progress', '0',
+                *(self._vitastor_args())
+            ]
+            try:
+                self._execute(*args)
+            except processutils.ProcessExecutionError as exc:
+                LOG.error("Failed to remove layer "+kv['key']+": "+exc)
+                raise exception.VolumeBackendAPIException(data = exc.stderr)
+
+        # Delete all layers from etcd
+        requests = []
+        for kv in layers:
+            requests.append({ 'request_delete_range': { 'key': kv['key'] } })
+            requests.append({ 'request_delete_range': { 'key': 'config/inode/'+str(kv['value']['pool_id'])+'/'+str(kv['value']['id']) } })
+        self._etcd_txn({ 'success': requests })
+
+    def retype(self, context, volume, new_type, diff, host):
+        """Change extra type specifications for a volume."""
+
+        # FIXME Maybe (in the future) support multiple pools as different types
+        return True, {}
+
+    def ensure_export(self, context, volume):
+        """Synchronously recreates an export for a logical volume."""
+        pass
+
+    def create_export(self, context, volume, connector):
+        """Exports the volume."""
+        pass
+
+    def remove_export(self, context, volume):
+        """Removes an export for a logical volume."""
+        pass
+
+    def _create_image(self, vol_name, cfg):
+        pool_s = str(self.cfg['pool_id'])
+        image_id = 0
+        while image_id == 0:
+            # check if the image already exists and find a free ID
+            resp = self._etcd_txn({ 'success': [
+                { 'request_range': { 'key': 'index/image/'+vol_name } },
+                { 'request_range': { 'key': 'index/maxid/'+pool_s } },
+            ] })
+            if len(resp['responses'][0]['kvs']) > 0:
+                # already exists
+                raise exception.VolumeBackendAPIException(data = 'Volume '+vol_name+' already exists')
+            image_id, id_mod = self._next_id(resp['responses'][1])
+            # try to create the image
+            resp = self._etcd_txn({ 'compare': [
+                { 'target': 'MOD', 'mod_revision': id_mod, 'key': 'index/maxid/'+pool_s },
+                { 'target': 'VERSION', 'version': 0, 'key': 'index/image/'+vol_name },
+                { 'target': 'VERSION', 'version': 0, 'key': 'config/inode/'+pool_s+'/'+str(image_id) },
+            ], 'success': [
+                { 'request_put': { 'key': 'index/maxid/'+pool_s, 'value': image_id } },
+                { 'request_put': { 'key': 'index/image/'+vol_name, 'value': json.dumps({
+                    'id': image_id, 'pool_id': self.cfg['pool_id']
+                }) } },
+                { 'request_put': { 'key': 'config/inode/'+pool_s+'/'+str(image_id), 'value': json.dumps({
+                    **cfg, 'name': vol_name,
+                }) } },
+            ] })
+            if not resp.get('succeeded'):
+                # repeat
+                image_id = 0
+
+    def _create_snapshot(self, vol_name, snap_vol_name, allow_existing = False):
+        while True:
+            # check if the image already exists and snapshot doesn't
+            resp = self._etcd_txn({ 'success': [
+                { 'request_range': { 'key': 'index/image/'+vol_name } },
+                { 'request_range': { 'key': 'index/image/'+snap_vol_name } },
+            ] })
+            if len(resp['responses'][0]['kvs']) == 0:
+                raise exception.VolumeBackendAPIException(data = 'Volume '+vol_name+' does not exist')
+            if len(resp['responses'][1]['kvs']) > 0:
+                if allow_existing:
+                    snap_idx = resp['responses'][1]['kvs'][0]['value']
+                    resp = self._etcd_txn({ 'success': [
+                        { 'request_range': { 'key': 'config/inode/'+str(snap_idx['pool_id'])+'/'+str(snap_idx['id']) } },
+                    ] })
+                    if len(resp['responses'][0]['kvs']) == 0:
+                        raise exception.VolumeBackendAPIException(data =
+                            'Volume '+snap_vol_name+' is already indexed, but does not exist'
+                        )
+                    return resp['responses'][0]['kvs'][0]['value']
+                raise exception.VolumeBackendAPIException(
+                    data = 'Volume '+snap_vol_name+' already exists'
+                )
+            vol_idx = resp['responses'][0]['kvs'][0]['value']
+            vol_idx_mod = resp['responses'][0]['kvs'][0]['mod_revision']
+            # get image inode config and find a new ID
+            resp = self._etcd_txn({ 'success': [
+                { 'request_range': { 'key': 'config/inode/'+str(vol_idx['pool_id'])+'/'+str(vol_idx['id']) } },
+                { 'request_range': { 'key': 'index/maxid/'+str(self.cfg['pool_id']) } },
+            ] })
+            if len(resp['responses'][0]['kvs']) == 0:
+                raise exception.VolumeBackendAPIException(data = 'Volume '+vol_name+' does not exist')
+            vol_cfg = resp['responses'][0]['kvs'][0]['value']
+            vol_mod = resp['responses'][0]['kvs'][0]['mod_revision']
+            new_id, id_mod = self._next_id(resp['responses'][1])
+            # try to redirect image to the new inode
+            new_cfg = {
+                **vol_cfg, 'name': vol_name, 'parent_id': vol_idx['id'], 'parent_pool_id': vol_idx['pool_id']
+            }
+            resp = self._etcd_txn({ 'compare': [
+                { 'target': 'MOD', 'mod_revision': vol_idx_mod, 'key': 'index/image/'+vol_name },
+                { 'target': 'MOD', 'mod_revision': vol_mod, 'key': 'config/inode/'+str(vol_idx['pool_id'])+'/'+str(vol_idx['id']) },
+                { 'target': 'MOD', 'mod_revision': id_mod, 'key': 'index/maxid/'+str(self.cfg['pool_id']) },
+                { 'target': 'VERSION', 'version': 0, 'key': 'index/image/'+snap_vol_name },
+                { 'target': 'VERSION', 'version': 0, 'key': 'config/inode/'+str(self.cfg['pool_id'])+'/'+str(new_id) },
+            ], 'success': [
+                { 'request_put': { 'key': 'index/maxid/'+str(self.cfg['pool_id']), 'value': new_id } },
+                { 'request_put': { 'key': 'index/image/'+vol_name, 'value': json.dumps({
+                    'id': new_id, 'pool_id': self.cfg['pool_id']
+                }) } },
+                { 'request_put': { 'key': 'config/inode/'+str(self.cfg['pool_id'])+'/'+str(new_id), 'value': json.dumps(new_cfg) } },
+                { 'request_put': { 'key': 'index/image/'+snap_vol_name, 'value': json.dumps({
+                    'id': vol_idx['id'], 'pool_id': vol_idx['pool_id']
+                }) } },
+                { 'request_put': { 'key': 'config/inode/'+str(vol_idx['pool_id'])+'/'+str(vol_idx['id']), 'value': json.dumps({
+                    **vol_cfg, 'name': snap_vol_name, 'readonly': True
+                }) } }
+            ] })
+            if resp.get('succeeded'):
+                return new_cfg
+
+    def initialize_connection(self, volume, connector):
+        data = {
+            'driver_volume_type': 'vitastor',
+            'data': {
+                'config_path': self.configuration.vitastor_config_path,
+                'etcd_address': self.configuration.vitastor_etcd_address,
+                'etcd_prefix': self.configuration.vitastor_etcd_prefix,
+                'name': volume.name,
+                'logical_block_size': 512,
+                'physical_block_size': 4096,
+            }
+        }
+        LOG.debug('connection data: %s', data)
+        return data
+
+    def terminate_connection(self, volume, connector, **kwargs):
+        pass
+
+    def clone_image(self, context, volume, image_location, image_meta, image_service):
+        if image_location:
+            # Note: image_location[0] is glance image direct_url.
+            # image_location[1] contains the list of all locations (including
+            # direct_url) or None if show_multiple_locations is False in
+            # glance configuration.
+            if image_location[1]:
+                url_locations = [location['url'] for location in image_location[1]]
+            else:
+                url_locations = [image_location[0]]
+            # iterate all locations to look for a cloneable one.
+            for url_location in url_locations:
+                if url_location and url_location.startswith('cinder://'):
+                    # The idea is to use cinder://<volume-id> Glance volumes as base images
+                    base_vol = self.db.volume_get(context, url_location[len('cinder://') : ])
+                    if not base_vol or base_vol.volume_type_id != volume.volume_type_id:
+                        continue
+                    size = int(volume.size) * units.Gi
+                    dest_name = utils.convert_str(volume.name)
+                    # Find or create the base snapshot
+                    snap_cfg = self._create_snapshot(base_vol.name, base_vol.name+'@.clone_snap', True)
+                    # Then create a clone from it
+                    new_cfg = self._create_image(dest_name, {
+                        'size': size,
+                        'parent_id': snap_cfg['parent_id'],
+                        'parent_pool_id': snap_cfg['parent_pool_id'],
+                    })
+                    return ({}, True)
+        return ({}, False)
+
+    def copy_image_to_encrypted_volume(self, context, volume, image_service, image_id):
+        self.copy_image_to_volume(context, volume, image_service, image_id, encrypted = True)
+
+    def copy_image_to_volume(self, context, volume, image_service, image_id, encrypted = False):
+        tmp_dir = volume_utils.image_conversion_dir()
+        with tempfile.NamedTemporaryFile(dir = tmp_dir) as tmp:
+            image_utils.fetch_to_raw(
+                context, image_service, image_id, tmp.name,
+                self.configuration.volume_dd_blocksize, size = volume.size
+            )
+            out_format = [ '-O', 'raw' ]
+            if encrypted:
+                key_file, opts = self._encrypt_opts(volume, context)
+                out_format = [ '-O', 'luks', *opts ]
+            dest_name = utils.convert_str(volume.name)
+            self._try_execute(
+                'qemu-img', 'convert', '-f', 'raw', tmp.name, *out_format,
+                'vitastor:image='+dest_name.replace(':', '\\:')+self._qemu_args()
+            )
+            if encrypted:
+                key_file.close()
+
+    def copy_volume_to_image(self, context, volume, image_service, image_meta):
+        tmp_dir = volume_utils.image_conversion_dir()
+        tmp_file = os.path.join(tmp_dir, volume.name + '-' + image_meta['id'])
+        with fileutils.remove_path_on_error(tmp_file):
+            vol_name = utils.convert_str(volume.name)
+            self._try_execute(
+                'qemu-img', 'convert', '-f', 'raw',
+                'vitastor:image='+vol_name.replace(':', '\\:')+self._qemu_args(),
+                '-O', 'raw', tmp_file
+            )
+            # FIXME: Copy directly if the destination image is also in Vitastor
+            volume_utils.upload_volume(context, image_service, image_meta, tmp_file, volume)
+        os.unlink(tmp_file)
+
+    def _get_image(self, vol_name):
+        # find the image
+        resp = self._etcd_txn({ 'success': [
+            { 'request_range': { 'key': 'index/image/'+vol_name } },
+        ] })
+        if len(resp['responses'][0]['kvs']) == 0:
+            return None
+        vol_idx = resp['responses'][0]['kvs'][0]['value']
+        vol_idx_mod = resp['responses'][0]['kvs'][0]['mod_revision']
+        # get image inode config
+        resp = self._etcd_txn({ 'success': [
+            { 'request_range': { 'key': 'config/inode/'+str(vol_idx['pool_id'])+'/'+str(vol_idx['id']) } },
+        ] })
+        if len(resp['responses'][0]['kvs']) == 0:
+            return None
+        vol_cfg = resp['responses'][0]['kvs'][0]['value']
+        vol_cfg_mod = resp['responses'][0]['kvs'][0]['mod_revision']
+        return {
+            'cfg': vol_cfg,
+            'cfg_mod': vol_cfg_mod,
+            'idx': vol_idx,
+            'idx_mod': vol_idx_mod,
+        }
+
+    def extend_volume(self, volume, new_size):
+        """Extend an existing volume."""
+        vol_name = utils.convert_str(volume.name)
+        while True:
+            vol = self._get_image(vol_name)
+            if not vol:
+                raise exception.VolumeBackendAPIException(data = 'Volume '+vol_name+' does not exist')
+            # change size
+            size = int(new_size) * units.Gi
+            if size == vol['cfg']['size']:
+                break
+            resp = self._etcd_txn({ 'compare': [ {
+                'target': 'MOD',
+                'mod_revision': vol['cfg_mod'],
+                'key': 'config/inode/'+str(vol['idx']['pool_id'])+'/'+str(vol['idx']['id']),
+            } ], 'success': [
+                { 'request_put': {
+                    'key': 'config/inode/'+str(vol['idx']['pool_id'])+'/'+str(vol['idx']['id']),
+                    'value': json.dumps({ **vol['cfg'], 'size': size }),
+                } },
+            ] })
+            if resp.get('succeeded'):
+                break
+        LOG.debug(
+            "Extend volume from %(old_size)s GB to %(new_size)s GB.",
+            {'old_size': volume.size, 'new_size': new_size}
+        )
+
+    def _add_manageable_volume(self, kv, manageable_volumes, cinder_ids):
+        cfg = kv['value']
+        if kv['key'].find('@') >= 0:
+            # snapshot
+            return
+        image_id = volume_utils.extract_id_from_volume_name(cfg['name'])
+        image_info = {
+            'reference': {'source-name': image_name},
+            'size': int(math.ceil(float(cfg['size']) / units.Gi)),
+            'cinder_id': None,
+            'extra_info': None,
+        }
+        if image_id in cinder_ids:
+            image_info['cinder_id'] = image_id
+            image_info['safe_to_manage'] = False
+            image_info['reason_not_safe'] = 'already managed'
+        else:
+            image_info['safe_to_manage'] = True
+            image_info['reason_not_safe'] = None
+        manageable_volumes.append(image_info)
+
+    def get_manageable_volumes(self, cinder_volumes, marker, limit, offset, sort_keys, sort_dirs):
+        manageable_volumes = []
+        cinder_ids = [resource['id'] for resource in cinder_volumes]
+
+        # List all volumes
+        # FIXME: It's possible to use pagination in our case, but.. do we want it?
+        self._etcd_foreach('config/inode/'+str(self.cfg['pool_id']),
+            lambda kv: self._add_manageable_volume(kv, manageable_volumes, cinder_ids))
+
+        return volume_utils.paginate_entries_list(
+            manageable_volumes, marker, limit, offset, sort_keys, sort_dirs)
+
+    def _get_existing_name(existing_ref):
+        if not isinstance(existing_ref, dict):
+            existing_ref = {"source-name": existing_ref}
+        if 'source-name' not in existing_ref:
+            reason = _('Reference must contain source-name element.')
+            raise exception.ManageExistingInvalidReference(existing_ref=existing_ref, reason=reason)
+        src_name = utils.convert_str(existing_ref['source-name'])
+        if not src_name:
+            reason = _('Reference must contain source-name element.')
+            raise exception.ManageExistingInvalidReference(existing_ref=existing_ref, reason=reason)
+        return src_name
+
+    def manage_existing_get_size(self, volume, existing_ref):
+        """Return size of an existing image for manage_existing.
+
+        :param volume: volume ref info to be set
+        :param existing_ref: {'source-name': <image name>}
+        """
+        src_name = self._get_existing_name(existing_ref)
+        vol = self._get_image(src_name)
+        if not vol:
+            raise exception.VolumeBackendAPIException(data = 'Volume '+src_name+' does not exist')
+        return int(math.ceil(float(vol['cfg']['size']) / units.Gi))
+
+    def manage_existing(self, volume, existing_ref):
+        """Manages an existing image.
+
+        Renames the image name to match the expected name for the volume.
+
+        :param volume: volume ref info to be set
+        :param existing_ref: {'source-name': <image name>}
+        """
+        from_name = self._get_existing_name(existing_ref)
+        to_name = utils.convert_str(volume.name)
+        self._rename(from_name, to_name)
+
+    def _rename(self, from_name, to_name):
+        while True:
+            vol = self._get_image(from_name)
+            if not vol:
+                raise exception.VolumeBackendAPIException(data = 'Volume '+from_name+' does not exist')
+            to = self._get_image(to_name)
+            if to:
+                raise exception.VolumeBackendAPIException(data = 'Volume '+to_name+' already exists')
+            resp = self._etcd_txn({ 'compare': [
+                { 'target': 'MOD', 'mod_revision': vol['idx_mod'], 'key': 'index/image/'+vol['cfg']['name'] },
+                { 'target': 'MOD', 'mod_revision': vol['cfg_mod'], 'key': 'config/inode/'+str(vol['idx']['pool_id'])+'/'+str(vol['idx']['id']) },
+                { 'target': 'VERSION', 'version': 0, 'key': 'index/image/'+to_name },
+            ], 'success': [
+                { 'request_delete_range': { 'key': 'index/image/'+vol['cfg']['name'] } },
+                { 'request_put': { 'key': 'index/image/'+to_name, 'value': json.dumps(vol['idx']) } },
+                { 'request_put': { 'key': 'config/inode/'+str(vol['idx']['pool_id'])+'/'+str(vol['idx']['id']),
+                    'value': json.dumps({ **vol['cfg'], 'name': to_name }) } },
+            ] })
+            if resp.get('succeeded'):
+                break
+
+    def unmanage(self, volume):
+        pass
+
+    def _add_manageable_snapshot(self, kv, manageable_snapshots, cinder_ids):
+        cfg = kv['value']
+        dog = kv['key'].find('@')
+        if dog < 0:
+            # snapshot
+            return
+        image_name = kv['key'][0 : dog]
+        snap_name = kv['key'][dog+1 : ]
+        snapshot_id = volume_utils.extract_id_from_snapshot_name(snap_name)
+        snapshot_info = {
+            'reference': {'source-name': snap_name},
+            'size': int(math.ceil(float(cfg['size']) / units.Gi)),
+            'cinder_id': None,
+            'extra_info': None,
+            'safe_to_manage': False,
+            'reason_not_safe': None,
+            'source_reference': {'source-name': image_name}
+        }
+        if snapshot_id in cinder_ids:
+            # Exclude snapshots already managed.
+            snapshot_info['reason_not_safe'] = ('already managed')
+            snapshot_info['cinder_id'] = snapshot_id
+        elif snap_name.endswith('.clone_snap'):
+            # Exclude clone snapshot.
+            snapshot_info['reason_not_safe'] = ('used for clone snap')
+        else:
+            snapshot_info['safe_to_manage'] = True
+        manageable_snapshots.append(snapshot_info)
+
+    def get_manageable_snapshots(self, cinder_snapshots, marker, limit, offset, sort_keys, sort_dirs):
+        """List manageable snapshots in Vitastor."""
+        manageable_snapshots = []
+        cinder_snapshot_ids = [resource['id'] for resource in cinder_snapshots]
+        # List all volumes
+        # FIXME: It's possible to use pagination in our case, but.. do we want it?
+        self._etcd_foreach('config/inode/'+str(self.cfg['pool_id']),
+            lambda kv: self._add_manageable_volume(kv, manageable_snapshots, cinder_snapshot_ids))
+        return volume_utils.paginate_entries_list(
+            manageable_snapshots, marker, limit, offset, sort_keys, sort_dirs)
+
+    def manage_existing_snapshot_get_size(self, snapshot, existing_ref):
+        """Return size of an existing image for manage_existing.
+
+        :param snapshot: snapshot ref info to be set
+        :param existing_ref: {'source-name': <name of snapshot>}
+        """
+        vol_name = utils.convert_str(snapshot.volume_name)
+        snap_name = self._get_existing_name(existing_ref)
+        vol = self._get_image(vol_name+'@'+snap_name)
+        if not vol:
+            raise exception.ManageExistingInvalidReference(
+                existing_ref=snapshot_name, reason='Specified snapshot does not exist.'
+            )
+        return int(math.ceil(float(vol['cfg']['size']) / units.Gi))
+
+    def manage_existing_snapshot(self, snapshot, existing_ref):
+        """Manages an existing snapshot.
+
+        Renames the snapshot name to match the expected name for the snapshot.
+        Error checking done by manage_existing_get_size is not repeated.
+
+        :param snapshot: snapshot ref info to be set
+        :param existing_ref: {'source-name': <name of snapshot>}
+        """
+        vol_name = utils.convert_str(snapshot.volume_name)
+        snap_name = self._get_existing_name(existing_ref)
+        from_name = vol_name+'@'+snap_name
+        to_name = vol_name+'@'+utils.convert_str(snapshot.name)
+        self._rename(from_name, to_name)
+
+    def unmanage_snapshot(self, snapshot):
+        """Removes the specified snapshot from Cinder management."""
+        pass
+
+    def _dumps(self, obj):
+        return json.dumps(obj, separators=(',', ':'), sort_keys=True)
--- a/patches/devstack-local.conf
+++ b/patches/devstack-local.conf
@@ -0,0 +1,23 @@
+# Devstack configuration for bridged networking
+
+[[local|localrc]]
+ADMIN_PASSWORD=secret
+DATABASE_PASSWORD=$ADMIN_PASSWORD
+RABBIT_PASSWORD=$ADMIN_PASSWORD
+SERVICE_PASSWORD=$ADMIN_PASSWORD
+HOST_IP=10.0.2.15
+Q_USE_SECGROUP=True
+FLOATING_RANGE="10.0.2.0/24"
+IPV4_ADDRS_SAFE_TO_USE="10.0.5.0/24"
+Q_FLOATING_ALLOCATION_POOL=start=10.0.2.50,end=10.0.2.100
+PUBLIC_NETWORK_GATEWAY=10.0.2.2
+PUBLIC_INTERFACE=ens3
+Q_USE_PROVIDERNET_FOR_PUBLIC=True
+Q_AGENT=linuxbridge
+Q_ML2_PLUGIN_MECHANISM_DRIVERS=linuxbridge
+LB_PHYSICAL_INTERFACE=ens3
+PUBLIC_PHYSICAL_NETWORK=default
+LB_INTERFACE_MAPPINGS=default:ens3
+Q_SERVICE_PLUGIN_CLASSES=
+Q_ML2_PLUGIN_TYPE_DRIVERS=flat
+Q_ML2_PLUGIN_EXT_DRIVERS=
--- a/patches/libvirt-5.0-vitastor.diff
+++ b/patches/libvirt-5.0-vitastor.diff
@@ -0,0 +1,609 @@
+commit bd283191b3e7a4c6d1c100d3d96e348a1ebffe55
+Author: Vitaliy Filippov <vitalif@yourcmc.ru>
+Date:   Sun Jun 27 12:52:40 2021 +0300
+
+    Add Vitastor support
+
+diff --git a/docs/schemas/domaincommon.rng b/docs/schemas/domaincommon.rng
+index aa50eac..082b4f8 100644
+--- a/docs/schemas/domaincommon.rng
+++ b/docs/schemas/domaincommon.rng
+@@ -1728,6 +1728,35 @@
+     </element>
+   </define>
+ 
+  <define name="diskSourceNetworkProtocolVitastor">
+    <element name="source">
+      <interleave>
+        <attribute name="protocol">
+          <value>vitastor</value>
+        </attribute>
+        <ref name="diskSourceCommon"/>
+        <optional>
+          <attribute name="name"/>
+        </optional>
+        <optional>
+          <attribute name="query"/>
+        </optional>
+        <zeroOrMore>
+          <ref name="diskSourceNetworkHost"/>
+        </zeroOrMore>
+        <optional>
+          <element name="config">
+            <attribute name="file">
+              <ref name="absFilePath"/>
+            </attribute>
+            <empty/>
+          </element>
+        </optional>
+        <empty/>
+      </interleave>
+    </element>
+  </define>
+
+   <define name="diskSourceNetworkProtocolISCSI">
+     <element name="source">
+       <attribute name="protocol">
+@@ -1851,6 +1880,7 @@
+       <ref name="diskSourceNetworkProtocolHTTP"/>
+       <ref name="diskSourceNetworkProtocolSimple"/>
+       <ref name="diskSourceNetworkProtocolVxHS"/>
+      <ref name="diskSourceNetworkProtocolVitastor"/>
+     </choice>
+   </define>
+ 
+diff --git a/include/libvirt/libvirt-storage.h b/include/libvirt/libvirt-storage.h
+index 4bf2b5f..dbc011b 100644
+--- a/include/libvirt/libvirt-storage.h
+++ b/include/libvirt/libvirt-storage.h
+@@ -240,6 +240,7 @@ typedef enum {
+     VIR_CONNECT_LIST_STORAGE_POOLS_GLUSTER       = 1 << 16,
+     VIR_CONNECT_LIST_STORAGE_POOLS_ZFS           = 1 << 17,
+     VIR_CONNECT_LIST_STORAGE_POOLS_VSTORAGE      = 1 << 18,
+    VIR_CONNECT_LIST_STORAGE_POOLS_VITASTOR      = 1 << 20,
+ } virConnectListAllStoragePoolsFlags;
+ 
+ int                     virConnectListAllStoragePools(virConnectPtr conn,
+diff --git a/src/conf/domain_conf.c b/src/conf/domain_conf.c
+index 222bb8c..685d255 100644
+--- a/src/conf/domain_conf.c
+++ b/src/conf/domain_conf.c
+@@ -8653,6 +8653,10 @@ virDomainDiskSourceNetworkParse(xmlNodePtr node,
+         goto cleanup;
+     }
+ 
+    if (src->protocol == VIR_STORAGE_NET_PROTOCOL_VITASTOR) {
+        src->relPath = virXMLPropString(node, "query");
+    }
+
+     if ((haveTLS = virXMLPropString(node, "tls")) &&
+         (src->haveTLS = virTristateBoolTypeFromString(haveTLS)) <= 0) {
+         virReportError(VIR_ERR_XML_ERROR,
+@@ -23849,6 +23853,10 @@ virDomainDiskSourceFormatNetwork(virBufferPtr attrBuf,
+ 
+     virBufferEscapeString(attrBuf, " name='%s'", path ? path : src->path);
+ 
+    if (src->protocol == VIR_STORAGE_NET_PROTOCOL_VITASTOR && src->relPath != NULL) {
+        virBufferEscapeString(attrBuf, " query='%s'", src->relPath);
+    }
+
+     VIR_FREE(path);
+ 
+     if (src->haveTLS != VIR_TRISTATE_BOOL_ABSENT &&
+@@ -30930,6 +30938,7 @@ virDomainDiskTranslateSourcePool(virDomainDiskDefPtr def)
+ 
+     case VIR_STORAGE_POOL_MPATH:
+     case VIR_STORAGE_POOL_RBD:
+    case VIR_STORAGE_POOL_VITASTOR:
+     case VIR_STORAGE_POOL_SHEEPDOG:
+     case VIR_STORAGE_POOL_GLUSTER:
+     case VIR_STORAGE_POOL_LAST:
+diff --git a/src/conf/storage_conf.c b/src/conf/storage_conf.c
+index 55db7a9..7cbe937 100644
+--- a/src/conf/storage_conf.c
+++ b/src/conf/storage_conf.c
+@@ -58,7 +58,7 @@ VIR_ENUM_IMPL(virStoragePool,
+               "logical", "disk", "iscsi",
+               "iscsi-direct", "scsi", "mpath",
+               "rbd", "sheepdog", "gluster",
+-              "zfs", "vstorage")
+              "zfs", "vstorage", "vitastor")
+ 
+ VIR_ENUM_IMPL(virStoragePoolFormatFileSystem,
+               VIR_STORAGE_POOL_FS_LAST,
+@@ -232,6 +232,18 @@ static virStoragePoolTypeInfo poolTypeInfo[] = {
+           .formatToString = virStorageFileFormatTypeToString,
+       }
+     },
+    {.poolType = VIR_STORAGE_POOL_VITASTOR,
+     .poolOptions = {
+         .flags = (VIR_STORAGE_POOL_SOURCE_HOST |
+                   VIR_STORAGE_POOL_SOURCE_NETWORK |
+                   VIR_STORAGE_POOL_SOURCE_NAME),
+      },
+      .volOptions = {
+          .defaultFormat = VIR_STORAGE_FILE_RAW,
+          .formatFromString = virStorageVolumeFormatFromString,
+          .formatToString = virStorageFileFormatTypeToString,
+      }
+    },
+     {.poolType = VIR_STORAGE_POOL_SHEEPDOG,
+      .poolOptions = {
+          .flags = (VIR_STORAGE_POOL_SOURCE_HOST |
+@@ -434,6 +446,11 @@ virStoragePoolDefParseSource(xmlXPathContextPtr ctxt,
+                        _("element 'name' is mandatory for RBD pool"));
+         goto cleanup;
+     }
+    if (pool_type == VIR_STORAGE_POOL_VITASTOR && source->name == NULL) {
+        virReportError(VIR_ERR_XML_ERROR, "%s",
+                       _("element 'name' is mandatory for Vitastor pool"));
+        return -1;
+    }
+ 
+     if (options->formatFromString) {
+         char *format = virXPathString("string(./format/@type)", ctxt);
+@@ -1009,6 +1026,7 @@ virStoragePoolDefFormatBuf(virBufferPtr buf,
+     /* RBD, Sheepdog, Gluster and Iscsi-direct devices are not local block devs nor
+      * files, so they don't have a target */
+     if (def->type != VIR_STORAGE_POOL_RBD &&
+        def->type != VIR_STORAGE_POOL_VITASTOR &&
+         def->type != VIR_STORAGE_POOL_SHEEPDOG &&
+         def->type != VIR_STORAGE_POOL_GLUSTER &&
+         def->type != VIR_STORAGE_POOL_ISCSI_DIRECT) {
+diff --git a/src/conf/storage_conf.h b/src/conf/storage_conf.h
+index dc0aa2a..ed4983d 100644
+--- a/src/conf/storage_conf.h
+++ b/src/conf/storage_conf.h
+@@ -91,6 +91,7 @@ typedef enum {
+     VIR_STORAGE_POOL_GLUSTER,  /* Gluster device */
+     VIR_STORAGE_POOL_ZFS,      /* ZFS */
+     VIR_STORAGE_POOL_VSTORAGE, /* Virtuozzo Storage */
+    VIR_STORAGE_POOL_VITASTOR, /* Vitastor */
+ 
+     VIR_STORAGE_POOL_LAST,
+ } virStoragePoolType;
+@@ -422,6 +423,7 @@ VIR_ENUM_DECL(virStoragePartedFs)
+                  VIR_CONNECT_LIST_STORAGE_POOLS_SCSI     | \
+                  VIR_CONNECT_LIST_STORAGE_POOLS_MPATH    | \
+                  VIR_CONNECT_LIST_STORAGE_POOLS_RBD      | \
+                 VIR_CONNECT_LIST_STORAGE_POOLS_VITASTOR | \
+                  VIR_CONNECT_LIST_STORAGE_POOLS_SHEEPDOG | \
+                  VIR_CONNECT_LIST_STORAGE_POOLS_GLUSTER  | \
+                  VIR_CONNECT_LIST_STORAGE_POOLS_ZFS      | \
+diff --git a/src/conf/virstorageobj.c b/src/conf/virstorageobj.c
+index 6ea6a97..3ba45b9 100644
+--- a/src/conf/virstorageobj.c
+++ b/src/conf/virstorageobj.c
+@@ -1478,6 +1478,7 @@ virStoragePoolObjSourceFindDuplicateCb(const void *payload,
+             return 1;
+         break;
+ 
+    case VIR_STORAGE_POOL_VITASTOR:
+     case VIR_STORAGE_POOL_RBD:
+     case VIR_STORAGE_POOL_LAST:
+         break;
+@@ -1971,6 +1972,8 @@ virStoragePoolObjMatch(virStoragePoolObjPtr obj,
+                (obj->def->type == VIR_STORAGE_POOL_MPATH))   ||
+               (MATCH(VIR_CONNECT_LIST_STORAGE_POOLS_RBD) &&
+                (obj->def->type == VIR_STORAGE_POOL_RBD))     ||
+              (MATCH(VIR_CONNECT_LIST_STORAGE_POOLS_VITASTOR) &&
+               (obj->def->type == VIR_STORAGE_POOL_VITASTOR)) ||
+               (MATCH(VIR_CONNECT_LIST_STORAGE_POOLS_SHEEPDOG) &&
+                (obj->def->type == VIR_STORAGE_POOL_SHEEPDOG)) ||
+               (MATCH(VIR_CONNECT_LIST_STORAGE_POOLS_GLUSTER) &&
+diff --git a/src/libvirt-storage.c b/src/libvirt-storage.c
+index 2ea3e94..d5d2273 100644
+--- a/src/libvirt-storage.c
+++ b/src/libvirt-storage.c
+@@ -92,6 +92,7 @@ virStoragePoolGetConnect(virStoragePoolPtr pool)
+  * VIR_CONNECT_LIST_STORAGE_POOLS_SCSI
+  * VIR_CONNECT_LIST_STORAGE_POOLS_MPATH
+  * VIR_CONNECT_LIST_STORAGE_POOLS_RBD
+ * VIR_CONNECT_LIST_STORAGE_POOLS_VITASTOR
+  * VIR_CONNECT_LIST_STORAGE_POOLS_SHEEPDOG
+  *
+  * Returns the number of storage pools found or -1 and sets @pools to
+diff --git a/src/libxl/libxl_conf.c b/src/libxl/libxl_conf.c
+index 73e988a..ab7bb81 100644
+--- a/src/libxl/libxl_conf.c
+++ b/src/libxl/libxl_conf.c
+@@ -905,6 +905,7 @@ libxlMakeNetworkDiskSrcStr(virStorageSourcePtr src,
+     case VIR_STORAGE_NET_PROTOCOL_SHEEPDOG:
+     case VIR_STORAGE_NET_PROTOCOL_SSH:
+     case VIR_STORAGE_NET_PROTOCOL_VXHS:
+    case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
+     case VIR_STORAGE_NET_PROTOCOL_LAST:
+     case VIR_STORAGE_NET_PROTOCOL_NONE:
+         virReportError(VIR_ERR_NO_SUPPORT,
+diff --git a/src/qemu/qemu_block.c b/src/qemu/qemu_block.c
+index cbf0aa4..096700d 100644
+--- a/src/qemu/qemu_block.c
+++ b/src/qemu/qemu_block.c
+@@ -959,6 +959,42 @@ qemuBlockStorageSourceGetRBDProps(virStorageSourcePtr src)
+ }
+ 
+ 
+static virJSONValuePtr
+qemuBlockStorageSourceGetVitastorProps(virStorageSource *src)
+{
+    virJSONValuePtr ret = NULL;
+    virStorageNetHostDefPtr host;
+    size_t i;
+    virBuffer buf = VIR_BUFFER_INITIALIZER;
+    char *etcd = NULL;
+
+    for (i = 0; i < src->nhosts; i++) {
+        host = src->hosts + i;
+        if ((virStorageNetHostTransport)host->transport != VIR_STORAGE_NET_HOST_TRANS_TCP) {
+            goto cleanup;
+        }
+        virBufferAsprintf(&buf, i > 0 ? ",%s:%u" : "%s:%u", host->name, host->port);
+    }
+    if (src->nhosts > 0) {
+        etcd = virBufferContentAndReset(&buf);
+    }
+
+    if (virJSONValueObjectCreate(&ret,
+                                 "s:driver", "vitastor",
+                                 "S:etcd_host", etcd,
+                                 "S:etcd_prefix", src->relPath,
+                                 "S:config_path", src->configFile,
+                                 "s:image", src->path,
+                                 NULL) < 0)
+        goto cleanup;
+
+cleanup:
+    VIR_FREE(etcd);
+    virBufferFreeAndReset(&buf);
+    return ret;
+}
+
+
+ static virJSONValuePtr
+ qemuBlockStorageSourceGetSheepdogProps(virStorageSourcePtr src)
+ {
+@@ -1174,6 +1210,11 @@ qemuBlockStorageSourceGetBackendProps(virStorageSourcePtr src,
+                 return NULL;
+             break;
+ 
+        case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
+            if (!(fileprops = qemuBlockStorageSourceGetVitastorProps(src)))
+                return NULL;
+            break;
+
+         case VIR_STORAGE_NET_PROTOCOL_SHEEPDOG:
+             if (!(fileprops = qemuBlockStorageSourceGetSheepdogProps(src)))
+                 return NULL;
+diff --git a/src/qemu/qemu_command.c b/src/qemu/qemu_command.c
+index 822d5f8..e375cef 100644
+--- a/src/qemu/qemu_command.c
+++ b/src/qemu/qemu_command.c
+@@ -975,6 +975,43 @@ qemuBuildNetworkDriveStr(virStorageSourcePtr src,
+             ret = virBufferContentAndReset(&buf);
+             break;
+ 
+        case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
+            if (strchr(src->path, ':')) {
+                virReportError(VIR_ERR_CONFIG_UNSUPPORTED,
+                               _("':' not allowed in Vitastor source volume name '%s'"),
+                               src->path);
+                return NULL;
+            }
+
+            virBufferStrcat(&buf, "vitastor:image=", src->path, NULL);
+
+            if (src->nhosts > 0) {
+                virBufferAddLit(&buf, ":etcd_host=");
+                for (i = 0; i < src->nhosts; i++) {
+                    if (i)
+                        virBufferAddLit(&buf, ",");
+
+                    /* assume host containing : is ipv6 */
+                    if (strchr(src->hosts[i].name, ':'))
+                        virBufferEscape(&buf, '\\', ":", "[%s]",
+                                        src->hosts[i].name);
+                    else
+                        virBufferAsprintf(&buf, "%s", src->hosts[i].name);
+
+                    if (src->hosts[i].port)
+                        virBufferAsprintf(&buf, "\\:%u", src->hosts[i].port);
+                }
+            }
+
+            if (src->configFile)
+                virBufferEscape(&buf, '\\', ":", ":config_path=%s", src->configFile);
+
+            if (src->relPath)
+                virBufferEscape(&buf, '\\', ":", ":etcd_prefix=%s", src->relPath);
+
+            ret = virBufferContentAndReset(&buf);
+            break;
+
+         case VIR_STORAGE_NET_PROTOCOL_VXHS:
+             virReportError(VIR_ERR_INTERNAL_ERROR, "%s",
+                            _("VxHS protocol does not support URI syntax"));
+diff --git a/src/qemu/qemu_domain.c b/src/qemu/qemu_domain.c
+index ec6b340..f399efa 100644
+--- a/src/qemu/qemu_domain.c
+++ b/src/qemu/qemu_domain.c
+@@ -10881,6 +10881,7 @@ qemuDomainPrepareStorageSourceTLS(virStorageSourcePtr src,
+         break;
+ 
+     case VIR_STORAGE_NET_PROTOCOL_RBD:
+    case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
+     case VIR_STORAGE_NET_PROTOCOL_SHEEPDOG:
+     case VIR_STORAGE_NET_PROTOCOL_GLUSTER:
+     case VIR_STORAGE_NET_PROTOCOL_ISCSI:
+diff --git a/src/qemu/qemu_driver.c b/src/qemu/qemu_driver.c
+index 1d96170..2d24396 100644
+--- a/src/qemu/qemu_driver.c
+++ b/src/qemu/qemu_driver.c
+@@ -14687,6 +14687,7 @@ qemuDomainSnapshotPrepareDiskExternalInactive(virDomainSnapshotDiskDefPtr snapdi
+         case VIR_STORAGE_NET_PROTOCOL_TFTP:
+         case VIR_STORAGE_NET_PROTOCOL_SSH:
+         case VIR_STORAGE_NET_PROTOCOL_VXHS:
+        case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
+         case VIR_STORAGE_NET_PROTOCOL_LAST:
+             virReportError(VIR_ERR_INTERNAL_ERROR,
+                            _("external inactive snapshots are not supported on "
+@@ -14764,6 +14765,7 @@ qemuDomainSnapshotPrepareDiskExternalActive(virDomainSnapshotDiskDefPtr snapdisk
+         case VIR_STORAGE_NET_PROTOCOL_TFTP:
+         case VIR_STORAGE_NET_PROTOCOL_SSH:
+         case VIR_STORAGE_NET_PROTOCOL_VXHS:
+        case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
+         case VIR_STORAGE_NET_PROTOCOL_LAST:
+             virReportError(VIR_ERR_INTERNAL_ERROR,
+                            _("external active snapshots are not supported on "
+@@ -14887,6 +14889,7 @@ qemuDomainSnapshotPrepareDiskInternal(virDomainDiskDefPtr disk,
+         case VIR_STORAGE_NET_PROTOCOL_TFTP:
+         case VIR_STORAGE_NET_PROTOCOL_SSH:
+         case VIR_STORAGE_NET_PROTOCOL_VXHS:
+        case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
+         case VIR_STORAGE_NET_PROTOCOL_LAST:
+             virReportError(VIR_ERR_INTERNAL_ERROR,
+                            _("internal inactive snapshots are not supported on "
+diff --git a/src/qemu/qemu_parse_command.c b/src/qemu/qemu_parse_command.c
+index c4650f0..551da41 100644
+--- a/src/qemu/qemu_parse_command.c
+++ b/src/qemu/qemu_parse_command.c
+@@ -2184,6 +2184,7 @@ qemuParseCommandLine(virFileCachePtr capsCache,
+                 case VIR_STORAGE_NET_PROTOCOL_TFTP:
+                 case VIR_STORAGE_NET_PROTOCOL_SSH:
+                 case VIR_STORAGE_NET_PROTOCOL_LAST:
+                case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
+                 case VIR_STORAGE_NET_PROTOCOL_NONE:
+                     /* ignored for now */
+                     break;
+diff --git a/src/storage/storage_driver.c b/src/storage/storage_driver.c
+index 4a13e90..33301c7 100644
+--- a/src/storage/storage_driver.c
+++ b/src/storage/storage_driver.c
+@@ -1568,6 +1568,7 @@ storageVolLookupByPathCallback(virStoragePoolObjPtr obj,
+         case VIR_STORAGE_POOL_RBD:
+         case VIR_STORAGE_POOL_SHEEPDOG:
+         case VIR_STORAGE_POOL_ZFS:
+        case VIR_STORAGE_POOL_VITASTOR:
+         case VIR_STORAGE_POOL_LAST:
+             ignore_value(VIR_STRDUP(stable_path, data->path));
+             break;
+diff --git a/src/util/virstoragefile.c b/src/util/virstoragefile.c
+index bd4b027..b323cd6 100644
+--- a/src/util/virstoragefile.c
+++ b/src/util/virstoragefile.c
+@@ -84,7 +84,8 @@ VIR_ENUM_IMPL(virStorageNetProtocol, VIR_STORAGE_NET_PROTOCOL_LAST,
+               "ftps",
+               "tftp",
+               "ssh",
+-              "vxhs")
+              "vxhs",
+              "vitastor")
+ 
+ VIR_ENUM_IMPL(virStorageNetHostTransport, VIR_STORAGE_NET_HOST_TRANS_LAST,
+               "tcp",
+@@ -2839,6 +2840,83 @@ virStorageSourceParseRBDColonString(const char *rbdstr,
+ }
+ 
+ 
+static int
+virStorageSourceParseVitastorColonString(const char *colonstr,
+                                         virStorageSourcePtr src)
+{
+    char *p, *e, *next;
+    char *options = NULL;
+
+    /* optionally skip the "vitastor:" prefix if provided */
+    if (STRPREFIX(colonstr, "vitastor:"))
+        colonstr += strlen("vitastor:");
+
+    if (VIR_STRDUP(options, colonstr) < 0)
+        return -1;
+
+    p = options;
+    while (*p) {
+        /* find : delimiter or end of string */
+        for (e = p; *e && *e != ':'; ++e) {
+            if (*e == '\\') {
+                e++;
+                if (*e == '\0')
+                    break;
+            }
+        }
+        if (*e == '\0') {
+            next = e;    /* last kv pair */
+        } else {
+            next = e + 1;
+            *e = '\0';
+        }
+
+        if (STRPREFIX(p, "image=")) {
+            if (VIR_STRDUP(src->path, p + strlen("image=")) < 0)
+                return -1;
+        } else if (STRPREFIX(p, "etcd_prefix=")) {
+            if (VIR_STRDUP(src->relPath, p + strlen("etcd_prefix=")) < 0)
+                return -1;
+        } else if (STRPREFIX(p, "config_file=")) {
+            if (VIR_STRDUP(src->configFile, p + strlen("config_file=")) < 0)
+                return -1;
+        } else if (STRPREFIX(p, "etcd_host=")) {
+            char *h, *sep;
+
+            h = p + strlen("etcd_host=");
+            while (h < e) {
+                for (sep = h; sep < e; ++sep) {
+                    if (*sep == '\\' && (sep[1] == ',' ||
+                                         sep[1] == ';' ||
+                                         sep[1] == ' ')) {
+                        *sep = '\0';
+                        sep += 2;
+                        break;
+                    }
+                }
+
+                if (virStorageSourceRBDAddHost(src, h) < 0)
+                    goto error;
+
+                h = sep;
+            }
+        }
+
+        p = next;
+    }
+
+    if (!src->path) {
+        goto error;
+    }
+
+    return 0;
+
+error:
+    VIR_FREE(options);
+    return -1;
+}
+
+
+ static int
+ virStorageSourceParseNBDColonString(const char *nbdstr,
+                                     virStorageSourcePtr src)
+@@ -2942,6 +3020,11 @@ virStorageSourceParseBackingColon(virStorageSourcePtr src,
+             goto cleanup;
+         break;
+ 
+    case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
+        if (virStorageSourceParseVitastorColonString(path, src) < 0)
+            return -1;
+        break;
+
+     case VIR_STORAGE_NET_PROTOCOL_SHEEPDOG:
+     case VIR_STORAGE_NET_PROTOCOL_LAST:
+     case VIR_STORAGE_NET_PROTOCOL_NONE:
+@@ -3441,6 +3524,56 @@ virStorageSourceParseBackingJSONRBD(virStorageSourcePtr src,
+     return ret;
+ }
+ 
+static int
+virStorageSourceParseBackingJSONVitastor(virStorageSourcePtr src,
+                                         virJSONValuePtr json,
+                                         int opaque ATTRIBUTE_UNUSED)
+{
+    const char *filename;
+    const char *image = virJSONValueObjectGetString(json, "image");
+    const char *conf = virJSONValueObjectGetString(json, "config_path");
+    const char *etcd_prefix = virJSONValueObjectGetString(json, "etcd_prefix");
+    virJSONValuePtr servers = virJSONValueObjectGetArray(json, "server");
+    size_t nservers;
+    size_t i;
+
+    src->type = VIR_STORAGE_TYPE_NETWORK;
+    src->protocol = VIR_STORAGE_NET_PROTOCOL_VITASTOR;
+
+    /* legacy syntax passed via 'filename' option */
+    if ((filename = virJSONValueObjectGetString(json, "filename")))
+        return virStorageSourceParseVitastorColonString(filename, src);
+
+    if (!image) {
+        virReportError(VIR_ERR_INVALID_ARG, "%s",
+                       _("missing image name in Vitastor backing volume "
+                         "JSON specification"));
+        return -1;
+    }
+
+    if (VIR_STRDUP(src->path, image) < 0 ||
+        VIR_STRDUP(src->configFile, conf) < 0 ||
+        VIR_STRDUP(src->relPath, etcd_prefix) < 0)
+        return -1;
+
+    if (servers) {
+        nservers = virJSONValueArraySize(servers);
+
+        if (VIR_ALLOC_N(src->hosts, nservers) < 0)
+            return -1;
+
+        src->nhosts = nservers;
+
+        for (i = 0; i < nservers; i++) {
+            if (virStorageSourceParseBackingJSONInetSocketAddress(src->hosts + i,
+                                                                  virJSONValueArrayGet(servers, i)) < 0)
+                return -1;
+        }
+    }
+
+    return 0;
+}
+
+ static int
+ virStorageSourceParseBackingJSONRaw(virStorageSourcePtr src,
+                                     virJSONValuePtr json,
+@@ -3507,6 +3640,7 @@ static const struct virStorageSourceJSONDriverParser jsonParsers[] = {
+     {"sheepdog", virStorageSourceParseBackingJSONSheepdog, 0},
+     {"ssh", virStorageSourceParseBackingJSONSSH, 0},
+     {"rbd", virStorageSourceParseBackingJSONRBD, 0},
+    {"vitastor", virStorageSourceParseBackingJSONVitastor, 0},
+     {"raw", virStorageSourceParseBackingJSONRaw, 0},
+     {"vxhs", virStorageSourceParseBackingJSONVxHS, 0},
+ };
+@@ -4276,6 +4410,7 @@ virStorageSourceNetworkDefaultPort(virStorageNetProtocol protocol)
+         case VIR_STORAGE_NET_PROTOCOL_GLUSTER:
+             return 24007;
+ 
+        case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
+         case VIR_STORAGE_NET_PROTOCOL_RBD:
+             /* we don't provide a default for RBD */
+             return 0;
+diff --git a/src/util/virstoragefile.h b/src/util/virstoragefile.h
+index 1d6161a..8d83bf3 100644
+--- a/src/util/virstoragefile.h
+++ b/src/util/virstoragefile.h
+@@ -134,6 +134,7 @@ typedef enum {
+     VIR_STORAGE_NET_PROTOCOL_TFTP,
+     VIR_STORAGE_NET_PROTOCOL_SSH,
+     VIR_STORAGE_NET_PROTOCOL_VXHS,
+    VIR_STORAGE_NET_PROTOCOL_VITASTOR,
+ 
+     VIR_STORAGE_NET_PROTOCOL_LAST
+ } virStorageNetProtocol;
+diff --git a/src/xenconfig/xen_xl.c b/src/xenconfig/xen_xl.c
+index accfc3a..a18f9c3 100644
+--- a/src/xenconfig/xen_xl.c
+++ b/src/xenconfig/xen_xl.c
+@@ -1535,6 +1535,7 @@ xenFormatXLDiskSrcNet(virStorageSourcePtr src)
+     case VIR_STORAGE_NET_PROTOCOL_SHEEPDOG:
+     case VIR_STORAGE_NET_PROTOCOL_SSH:
+     case VIR_STORAGE_NET_PROTOCOL_VXHS:
+    case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
+     case VIR_STORAGE_NET_PROTOCOL_LAST:
+     case VIR_STORAGE_NET_PROTOCOL_NONE:
+         virReportError(VIR_ERR_NO_SUPPORT,
+diff --git a/tools/virsh-pool.c b/tools/virsh-pool.c
+index 70ca39b..9caef51 100644
+--- a/tools/virsh-pool.c
+++ b/tools/virsh-pool.c
+@@ -1219,6 +1219,9 @@ cmdPoolList(vshControl *ctl, const vshCmd *cmd ATTRIBUTE_UNUSED)
+             case VIR_STORAGE_POOL_VSTORAGE:
+                 flags |= VIR_CONNECT_LIST_STORAGE_POOLS_VSTORAGE;
+                 break;
+            case VIR_STORAGE_POOL_VITASTOR:
+                flags |= VIR_CONNECT_LIST_STORAGE_POOLS_VITASTOR;
+                break;
+             case VIR_STORAGE_POOL_LAST:
+                 break;
+             }
--- a/patches/libvirt-7.0-vitastor.diff
+++ b/patches/libvirt-7.0-vitastor.diff
@@ -0,0 +1,657 @@
+commit 41cdfe8317d98f70aadedfdbb381effed2641bdd
+Author: Vitaliy Filippov <vitalif@yourcmc.ru>
+Date:   Fri Jul 9 01:31:57 2021 +0300
+
+    Add Vitastor support
+
+diff --git a/docs/schemas/domaincommon.rng b/docs/schemas/domaincommon.rng
+index 7dc419b..875433b 100644
+--- a/docs/schemas/domaincommon.rng
+++ b/docs/schemas/domaincommon.rng
+@@ -1827,6 +1827,35 @@
+     </element>
+   </define>
+ 
+  <define name="diskSourceNetworkProtocolVitastor">
+    <element name="source">
+      <interleave>
+        <attribute name="protocol">
+          <value>vitastor</value>
+        </attribute>
+        <ref name="diskSourceCommon"/>
+        <optional>
+          <attribute name="name"/>
+        </optional>
+        <optional>
+          <attribute name="query"/>
+        </optional>
+        <zeroOrMore>
+          <ref name="diskSourceNetworkHost"/>
+        </zeroOrMore>
+        <optional>
+          <element name="config">
+            <attribute name="file">
+              <ref name="absFilePath"/>
+            </attribute>
+            <empty/>
+          </element>
+        </optional>
+        <empty/>
+      </interleave>
+    </element>
+  </define>
+
+   <define name="diskSourceNetworkProtocolISCSI">
+     <element name="source">
+       <attribute name="protocol">
+@@ -2083,6 +2112,7 @@
+       <ref name="diskSourceNetworkProtocolSimple"/>
+       <ref name="diskSourceNetworkProtocolVxHS"/>
+       <ref name="diskSourceNetworkProtocolNFS"/>
+      <ref name="diskSourceNetworkProtocolVitastor"/>
+     </choice>
+   </define>
+ 
+diff --git a/include/libvirt/libvirt-storage.h b/include/libvirt/libvirt-storage.h
+index 089e1e0..d7e7ef4 100644
+--- a/include/libvirt/libvirt-storage.h
+++ b/include/libvirt/libvirt-storage.h
+@@ -245,6 +245,7 @@ typedef enum {
+     VIR_CONNECT_LIST_STORAGE_POOLS_ZFS           = 1 << 17,
+     VIR_CONNECT_LIST_STORAGE_POOLS_VSTORAGE      = 1 << 18,
+     VIR_CONNECT_LIST_STORAGE_POOLS_ISCSI_DIRECT  = 1 << 19,
+    VIR_CONNECT_LIST_STORAGE_POOLS_VITASTOR      = 1 << 20,
+ } virConnectListAllStoragePoolsFlags;
+ 
+ int                     virConnectListAllStoragePools(virConnectPtr conn,
+diff --git a/src/conf/domain_conf.c b/src/conf/domain_conf.c
+index 01b7187..c6e9702 100644
+--- a/src/conf/domain_conf.c
+++ b/src/conf/domain_conf.c
+@@ -8261,7 +8261,8 @@ virDomainDiskSourceNetworkParse(xmlNodePtr node,
+     src->configFile = virXPathString("string(./config/@file)", ctxt);
+ 
+     if (src->protocol == VIR_STORAGE_NET_PROTOCOL_HTTP ||
+-        src->protocol == VIR_STORAGE_NET_PROTOCOL_HTTPS)
+        src->protocol == VIR_STORAGE_NET_PROTOCOL_HTTPS ||
+        src->protocol == VIR_STORAGE_NET_PROTOCOL_VITASTOR)
+         src->query = virXMLPropString(node, "query");
+ 
+     if (virDomainStorageNetworkParseHosts(node, ctxt, &src->hosts, &src->nhosts) < 0)
+@@ -31392,6 +31393,7 @@ virDomainStorageSourceTranslateSourcePool(virStorageSourcePtr src,
+ 
+     case VIR_STORAGE_POOL_MPATH:
+     case VIR_STORAGE_POOL_RBD:
+    case VIR_STORAGE_POOL_VITASTOR:
+     case VIR_STORAGE_POOL_SHEEPDOG:
+     case VIR_STORAGE_POOL_GLUSTER:
+     case VIR_STORAGE_POOL_LAST:
+diff --git a/src/conf/storage_conf.c b/src/conf/storage_conf.c
+index 0c50529..fe97574 100644
+--- a/src/conf/storage_conf.c
+++ b/src/conf/storage_conf.c
+@@ -60,7 +60,7 @@ VIR_ENUM_IMPL(virStoragePool,
+               "logical", "disk", "iscsi",
+               "iscsi-direct", "scsi", "mpath",
+               "rbd", "sheepdog", "gluster",
+-              "zfs", "vstorage",
+              "zfs", "vstorage", "vitastor",
+ );
+ 
+ VIR_ENUM_IMPL(virStoragePoolFormatFileSystem,
+@@ -249,6 +249,18 @@ static virStoragePoolTypeInfo poolTypeInfo[] = {
+           .formatToString = virStorageFileFormatTypeToString,
+       }
+     },
+    {.poolType = VIR_STORAGE_POOL_VITASTOR,
+     .poolOptions = {
+         .flags = (VIR_STORAGE_POOL_SOURCE_HOST |
+                   VIR_STORAGE_POOL_SOURCE_NETWORK |
+                   VIR_STORAGE_POOL_SOURCE_NAME),
+      },
+      .volOptions = {
+          .defaultFormat = VIR_STORAGE_FILE_RAW,
+          .formatFromString = virStorageVolumeFormatFromString,
+          .formatToString = virStorageFileFormatTypeToString,
+      }
+    },
+     {.poolType = VIR_STORAGE_POOL_SHEEPDOG,
+      .poolOptions = {
+          .flags = (VIR_STORAGE_POOL_SOURCE_HOST |
+@@ -551,6 +563,11 @@ virStoragePoolDefParseSource(xmlXPathContextPtr ctxt,
+                        _("element 'name' is mandatory for RBD pool"));
+         goto cleanup;
+     }
+    if (pool_type == VIR_STORAGE_POOL_VITASTOR && source->name == NULL) {
+        virReportError(VIR_ERR_XML_ERROR, "%s",
+                       _("element 'name' is mandatory for Vitastor pool"));
+        return -1;
+    }
+ 
+     if (options->formatFromString) {
+         g_autofree char *format = NULL;
+@@ -1217,6 +1234,7 @@ virStoragePoolDefFormatBuf(virBufferPtr buf,
+     /* RBD, Sheepdog, Gluster and Iscsi-direct devices are not local block devs nor
+      * files, so they don't have a target */
+     if (def->type != VIR_STORAGE_POOL_RBD &&
+        def->type != VIR_STORAGE_POOL_VITASTOR &&
+         def->type != VIR_STORAGE_POOL_SHEEPDOG &&
+         def->type != VIR_STORAGE_POOL_GLUSTER &&
+         def->type != VIR_STORAGE_POOL_ISCSI_DIRECT) {
+diff --git a/src/conf/storage_conf.h b/src/conf/storage_conf.h
+index ffd406e..8868a05 100644
+--- a/src/conf/storage_conf.h
+++ b/src/conf/storage_conf.h
+@@ -110,6 +110,7 @@ typedef enum {
+     VIR_STORAGE_POOL_GLUSTER,  /* Gluster device */
+     VIR_STORAGE_POOL_ZFS,      /* ZFS */
+     VIR_STORAGE_POOL_VSTORAGE, /* Virtuozzo Storage */
+    VIR_STORAGE_POOL_VITASTOR, /* Vitastor */
+ 
+     VIR_STORAGE_POOL_LAST,
+ } virStoragePoolType;
+@@ -474,6 +475,7 @@ VIR_ENUM_DECL(virStoragePartedFs);
+                  VIR_CONNECT_LIST_STORAGE_POOLS_SCSI     | \
+                  VIR_CONNECT_LIST_STORAGE_POOLS_MPATH    | \
+                  VIR_CONNECT_LIST_STORAGE_POOLS_RBD      | \
+                 VIR_CONNECT_LIST_STORAGE_POOLS_VITASTOR | \
+                  VIR_CONNECT_LIST_STORAGE_POOLS_SHEEPDOG | \
+                  VIR_CONNECT_LIST_STORAGE_POOLS_GLUSTER  | \
+                  VIR_CONNECT_LIST_STORAGE_POOLS_ZFS      | \
+diff --git a/src/conf/virstorageobj.c b/src/conf/virstorageobj.c
+index 9fe8b3f..bf595b0 100644
+--- a/src/conf/virstorageobj.c
+++ b/src/conf/virstorageobj.c
+@@ -1491,6 +1491,7 @@ virStoragePoolObjSourceFindDuplicateCb(const void *payload,
+             return 1;
+         break;
+ 
+    case VIR_STORAGE_POOL_VITASTOR:
+     case VIR_STORAGE_POOL_RBD:
+     case VIR_STORAGE_POOL_LAST:
+         break;
+@@ -1990,6 +1991,8 @@ virStoragePoolObjMatch(virStoragePoolObjPtr obj,
+                (obj->def->type == VIR_STORAGE_POOL_MPATH))   ||
+               (MATCH(VIR_CONNECT_LIST_STORAGE_POOLS_RBD) &&
+                (obj->def->type == VIR_STORAGE_POOL_RBD))     ||
+              (MATCH(VIR_CONNECT_LIST_STORAGE_POOLS_VITASTOR) &&
+               (obj->def->type == VIR_STORAGE_POOL_VITASTOR)) ||
+               (MATCH(VIR_CONNECT_LIST_STORAGE_POOLS_SHEEPDOG) &&
+                (obj->def->type == VIR_STORAGE_POOL_SHEEPDOG)) ||
+               (MATCH(VIR_CONNECT_LIST_STORAGE_POOLS_GLUSTER) &&
+diff --git a/src/libvirt-storage.c b/src/libvirt-storage.c
+index 2a7cdca..f756be1 100644
+--- a/src/libvirt-storage.c
+++ b/src/libvirt-storage.c
+@@ -92,6 +92,7 @@ virStoragePoolGetConnect(virStoragePoolPtr pool)
+  * VIR_CONNECT_LIST_STORAGE_POOLS_SCSI
+  * VIR_CONNECT_LIST_STORAGE_POOLS_MPATH
+  * VIR_CONNECT_LIST_STORAGE_POOLS_RBD
+ * VIR_CONNECT_LIST_STORAGE_POOLS_VITASTOR
+  * VIR_CONNECT_LIST_STORAGE_POOLS_SHEEPDOG
+  * VIR_CONNECT_LIST_STORAGE_POOLS_GLUSTER
+  * VIR_CONNECT_LIST_STORAGE_POOLS_ZFS
+diff --git a/src/libxl/libxl_conf.c b/src/libxl/libxl_conf.c
+index 6a8ae27..a735bc6 100644
+--- a/src/libxl/libxl_conf.c
+++ b/src/libxl/libxl_conf.c
+@@ -942,6 +942,7 @@ libxlMakeNetworkDiskSrcStr(virStorageSourcePtr src,
+     case VIR_STORAGE_NET_PROTOCOL_SSH:
+     case VIR_STORAGE_NET_PROTOCOL_VXHS:
+     case VIR_STORAGE_NET_PROTOCOL_NFS:
+    case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
+     case VIR_STORAGE_NET_PROTOCOL_LAST:
+     case VIR_STORAGE_NET_PROTOCOL_NONE:
+         virReportError(VIR_ERR_NO_SUPPORT,
+diff --git a/src/libxl/xen_xl.c b/src/libxl/xen_xl.c
+index 17b93d0..c5a0084 100644
+--- a/src/libxl/xen_xl.c
+++ b/src/libxl/xen_xl.c
+@@ -1601,6 +1601,7 @@ xenFormatXLDiskSrcNet(virStorageSourcePtr src)
+     case VIR_STORAGE_NET_PROTOCOL_SSH:
+     case VIR_STORAGE_NET_PROTOCOL_VXHS:
+     case VIR_STORAGE_NET_PROTOCOL_NFS:
+    case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
+     case VIR_STORAGE_NET_PROTOCOL_LAST:
+     case VIR_STORAGE_NET_PROTOCOL_NONE:
+         virReportError(VIR_ERR_NO_SUPPORT,
+diff --git a/src/qemu/qemu_block.c b/src/qemu/qemu_block.c
+index f9c6da2..922dde5 100644
+--- a/src/qemu/qemu_block.c
+++ b/src/qemu/qemu_block.c
+@@ -938,6 +938,38 @@ qemuBlockStorageSourceGetRBDProps(virStorageSourcePtr src,
+ }
+ 
+ 
+static virJSONValuePtr
+qemuBlockStorageSourceGetVitastorProps(virStorageSource *src)
+{
+    virJSONValuePtr ret = NULL;
+    virStorageNetHostDefPtr host;
+    size_t i;
+    g_auto(virBuffer) buf = VIR_BUFFER_INITIALIZER;
+    g_autofree char *etcd = NULL;
+
+    for (i = 0; i < src->nhosts; i++) {
+        host = src->hosts + i;
+        if ((virStorageNetHostTransport)host->transport != VIR_STORAGE_NET_HOST_TRANS_TCP) {
+            return NULL;
+        }
+        virBufferAsprintf(&buf, i > 0 ? ",%s:%u" : "%s:%u", host->name, host->port);
+    }
+    if (src->nhosts > 0) {
+        etcd = virBufferContentAndReset(&buf);
+    }
+
+    if (virJSONValueObjectCreate(&ret,
+                                 "S:etcd_host", etcd,
+                                 "S:etcd_prefix", src->query,
+                                 "S:config_path", src->configFile,
+                                 "s:image", src->path,
+                                 NULL) < 0)
+        return NULL;
+
+    return ret;
+}
+
+
+ static virJSONValuePtr
+ qemuBlockStorageSourceGetSheepdogProps(virStorageSourcePtr src)
+ {
+@@ -1224,6 +1256,12 @@ qemuBlockStorageSourceGetBackendProps(virStorageSourcePtr src,
+                 return NULL;
+             break;
+ 
+        case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
+            driver = "vitastor";
+            if (!(fileprops = qemuBlockStorageSourceGetVitastorProps(src)))
+                return NULL;
+            break;
+
+         case VIR_STORAGE_NET_PROTOCOL_SHEEPDOG:
+             driver = "sheepdog";
+             if (!(fileprops = qemuBlockStorageSourceGetSheepdogProps(src)))
+@@ -2183,6 +2221,7 @@ qemuBlockGetBackingStoreString(virStorageSourcePtr src,
+ 
+             case VIR_STORAGE_NET_PROTOCOL_SHEEPDOG:
+             case VIR_STORAGE_NET_PROTOCOL_RBD:
+            case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
+             case VIR_STORAGE_NET_PROTOCOL_VXHS:
+             case VIR_STORAGE_NET_PROTOCOL_NFS:
+             case VIR_STORAGE_NET_PROTOCOL_SSH:
+@@ -2560,6 +2599,12 @@ qemuBlockStorageSourceCreateGetStorageProps(virStorageSourcePtr src,
+                 return -1;
+             break;
+ 
+        case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
+            driver = "vitastor";
+            if (!(location = qemuBlockStorageSourceGetVitastorProps(src)))
+                return -1;
+            break;
+
+         case VIR_STORAGE_NET_PROTOCOL_SHEEPDOG:
+             driver = "sheepdog";
+             if (!(location = qemuBlockStorageSourceGetSheepdogProps(src)))
+diff --git a/src/qemu/qemu_command.c b/src/qemu/qemu_command.c
+index 6f970a3..10b39ca 100644
+--- a/src/qemu/qemu_command.c
+++ b/src/qemu/qemu_command.c
+@@ -1034,6 +1034,43 @@ qemuBuildNetworkDriveStr(virStorageSourcePtr src,
+             ret = virBufferContentAndReset(&buf);
+             break;
+ 
+        case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
+            if (strchr(src->path, ':')) {
+                virReportError(VIR_ERR_CONFIG_UNSUPPORTED,
+                               _("':' not allowed in Vitastor source volume name '%s'"),
+                               src->path);
+                return NULL;
+            }
+
+            virBufferStrcat(&buf, "vitastor:image=", src->path, NULL);
+
+            if (src->nhosts > 0) {
+                virBufferAddLit(&buf, ":etcd_host=");
+                for (i = 0; i < src->nhosts; i++) {
+                    if (i)
+                        virBufferAddLit(&buf, ",");
+
+                    /* assume host containing : is ipv6 */
+                    if (strchr(src->hosts[i].name, ':'))
+                        virBufferEscape(&buf, '\\', ":", "[%s]",
+                                        src->hosts[i].name);
+                    else
+                        virBufferAsprintf(&buf, "%s", src->hosts[i].name);
+
+                    if (src->hosts[i].port)
+                        virBufferAsprintf(&buf, "\\:%u", src->hosts[i].port);
+                }
+            }
+
+            if (src->configFile)
+                virBufferEscape(&buf, '\\', ":", ":config_path=%s", src->configFile);
+
+            if (src->query)
+                virBufferEscape(&buf, '\\', ":", ":etcd_prefix=%s", src->query);
+
+            ret = virBufferContentAndReset(&buf);
+            break;
+
+         case VIR_STORAGE_NET_PROTOCOL_VXHS:
+             virReportError(VIR_ERR_INTERNAL_ERROR, "%s",
+                            _("VxHS protocol does not support URI syntax"));
+diff --git a/src/qemu/qemu_domain.c b/src/qemu/qemu_domain.c
+index 0765dc7..4cff344 100644
+--- a/src/qemu/qemu_domain.c
+++ b/src/qemu/qemu_domain.c
+@@ -4610,7 +4610,8 @@ qemuDomainValidateStorageSource(virStorageSourcePtr src,
+     if (src->query &&
+         (actualType != VIR_STORAGE_TYPE_NETWORK ||
+          (src->protocol != VIR_STORAGE_NET_PROTOCOL_HTTPS &&
+-          src->protocol != VIR_STORAGE_NET_PROTOCOL_HTTP))) {
+          src->protocol != VIR_STORAGE_NET_PROTOCOL_HTTP &&
+          src->protocol != VIR_STORAGE_NET_PROTOCOL_VITASTOR))) {
+         virReportError(VIR_ERR_CONFIG_UNSUPPORTED, "%s",
+                        _("query is supported only with HTTP(S) protocols"));
+         return -1;
+@@ -9704,6 +9705,7 @@ qemuDomainPrepareStorageSourceTLS(virStorageSourcePtr src,
+         break;
+ 
+     case VIR_STORAGE_NET_PROTOCOL_RBD:
+    case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
+     case VIR_STORAGE_NET_PROTOCOL_SHEEPDOG:
+     case VIR_STORAGE_NET_PROTOCOL_GLUSTER:
+     case VIR_STORAGE_NET_PROTOCOL_ISCSI:
+diff --git a/src/qemu/qemu_snapshot.c b/src/qemu/qemu_snapshot.c
+index ee333c3..674aa58 100644
+--- a/src/qemu/qemu_snapshot.c
+++ b/src/qemu/qemu_snapshot.c
+@@ -403,6 +403,7 @@ qemuSnapshotPrepareDiskExternalInactive(virDomainSnapshotDiskDefPtr snapdisk,
+         case VIR_STORAGE_NET_PROTOCOL_NONE:
+         case VIR_STORAGE_NET_PROTOCOL_NBD:
+         case VIR_STORAGE_NET_PROTOCOL_RBD:
+        case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
+         case VIR_STORAGE_NET_PROTOCOL_SHEEPDOG:
+         case VIR_STORAGE_NET_PROTOCOL_GLUSTER:
+         case VIR_STORAGE_NET_PROTOCOL_ISCSI:
+@@ -493,6 +494,7 @@ qemuSnapshotPrepareDiskExternalActive(virDomainObjPtr vm,
+         case VIR_STORAGE_NET_PROTOCOL_NONE:
+         case VIR_STORAGE_NET_PROTOCOL_NBD:
+         case VIR_STORAGE_NET_PROTOCOL_RBD:
+        case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
+         case VIR_STORAGE_NET_PROTOCOL_SHEEPDOG:
+         case VIR_STORAGE_NET_PROTOCOL_ISCSI:
+         case VIR_STORAGE_NET_PROTOCOL_HTTP:
+@@ -623,6 +625,7 @@ qemuSnapshotPrepareDiskInternal(virDomainDiskDefPtr disk,
+         case VIR_STORAGE_NET_PROTOCOL_NONE:
+         case VIR_STORAGE_NET_PROTOCOL_NBD:
+         case VIR_STORAGE_NET_PROTOCOL_RBD:
+        case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
+         case VIR_STORAGE_NET_PROTOCOL_SHEEPDOG:
+         case VIR_STORAGE_NET_PROTOCOL_GLUSTER:
+         case VIR_STORAGE_NET_PROTOCOL_ISCSI:
+diff --git a/src/storage/storage_driver.c b/src/storage/storage_driver.c
+index 16bc53a..1e5d820 100644
+--- a/src/storage/storage_driver.c
+++ b/src/storage/storage_driver.c
+@@ -1645,6 +1645,7 @@ storageVolLookupByPathCallback(virStoragePoolObjPtr obj,
+ 
+         case VIR_STORAGE_POOL_GLUSTER:
+         case VIR_STORAGE_POOL_RBD:
+        case VIR_STORAGE_POOL_VITASTOR:
+         case VIR_STORAGE_POOL_SHEEPDOG:
+         case VIR_STORAGE_POOL_ZFS:
+         case VIR_STORAGE_POOL_LAST:
+diff --git a/src/test/test_driver.c b/src/test/test_driver.c
+index 29c4c86..a27ad94 100644
+--- a/src/test/test_driver.c
+++ b/src/test/test_driver.c
+@@ -7096,6 +7096,7 @@ testStorageVolumeTypeForPool(int pooltype)
+     case VIR_STORAGE_POOL_ISCSI_DIRECT:
+     case VIR_STORAGE_POOL_GLUSTER:
+     case VIR_STORAGE_POOL_RBD:
+    case VIR_STORAGE_POOL_VITASTOR:
+         return VIR_STORAGE_VOL_NETWORK;
+     case VIR_STORAGE_POOL_LOGICAL:
+     case VIR_STORAGE_POOL_DISK:
+diff --git a/src/util/virstoragefile.c b/src/util/virstoragefile.c
+index 0d3c2af..36e3afc 100644
+--- a/src/util/virstoragefile.c
+++ b/src/util/virstoragefile.c
+@@ -91,6 +91,7 @@ VIR_ENUM_IMPL(virStorageNetProtocol,
+               "ssh",
+               "vxhs",
+               "nfs",
+              "vitastor",
+ );
+ 
+ VIR_ENUM_IMPL(virStorageNetHostTransport,
+@@ -2880,6 +2881,75 @@ virStorageSourceParseRBDColonString(const char *rbdstr,
+ }
+ 
+ 
+static int
+virStorageSourceParseVitastorColonString(const char *colonstr,
+                                         virStorageSourcePtr src)
+{
+    char *p, *e, *next;
+    g_autofree char *options = NULL;
+
+    /* optionally skip the "vitastor:" prefix if provided */
+    if (STRPREFIX(colonstr, "vitastor:"))
+        colonstr += strlen("vitastor:");
+
+    options = g_strdup(colonstr);
+
+    p = options;
+    while (*p) {
+        /* find : delimiter or end of string */
+        for (e = p; *e && *e != ':'; ++e) {
+            if (*e == '\\') {
+                e++;
+                if (*e == '\0')
+                    break;
+            }
+        }
+        if (*e == '\0') {
+            next = e;    /* last kv pair */
+        } else {
+            next = e + 1;
+            *e = '\0';
+        }
+
+        if (STRPREFIX(p, "image=")) {
+            src->path = g_strdup(p + strlen("image="));
+        } else if (STRPREFIX(p, "etcd_prefix=")) {
+            src->query = g_strdup(p + strlen("etcd_prefix="));
+        } else if (STRPREFIX(p, "config_file=")) {
+            src->configFile = g_strdup(p + strlen("config_file="));
+        } else if (STRPREFIX(p, "etcd_host=")) {
+            char *h, *sep;
+
+            h = p + strlen("etcd_host=");
+            while (h < e) {
+                for (sep = h; sep < e; ++sep) {
+                    if (*sep == '\\' && (sep[1] == ',' ||
+                                         sep[1] == ';' ||
+                                         sep[1] == ' ')) {
+                        *sep = '\0';
+                        sep += 2;
+                        break;
+                    }
+                }
+
+                if (virStorageSourceRBDAddHost(src, h) < 0)
+                    return -1;
+
+                h = sep;
+            }
+        }
+
+        p = next;
+    }
+
+    if (!src->path) {
+        return -1;
+    }
+
+    return 0;
+}
+
+
+ static int
+ virStorageSourceParseNBDColonString(const char *nbdstr,
+                                     virStorageSourcePtr src)
+@@ -2992,6 +3062,11 @@ virStorageSourceParseBackingColon(virStorageSourcePtr src,
+             return -1;
+         break;
+ 
+    case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
+        if (virStorageSourceParseVitastorColonString(path, src) < 0)
+            return -1;
+        break;
+
+     case VIR_STORAGE_NET_PROTOCOL_SHEEPDOG:
+     case VIR_STORAGE_NET_PROTOCOL_LAST:
+     case VIR_STORAGE_NET_PROTOCOL_NONE:
+@@ -3581,6 +3656,54 @@ virStorageSourceParseBackingJSONRBD(virStorageSourcePtr src,
+     return 0;
+ }
+ 
+static int
+virStorageSourceParseBackingJSONVitastor(virStorageSourcePtr src,
+                                         virJSONValuePtr json,
+                                         const char *jsonstr G_GNUC_UNUSED,
+                                         int opaque G_GNUC_UNUSED)
+{
+    const char *filename;
+    const char *image = virJSONValueObjectGetString(json, "image");
+    const char *conf = virJSONValueObjectGetString(json, "config_path");
+    const char *etcd_prefix = virJSONValueObjectGetString(json, "etcd_prefix");
+    virJSONValuePtr servers = virJSONValueObjectGetArray(json, "server");
+    size_t nservers;
+    size_t i;
+
+    src->type = VIR_STORAGE_TYPE_NETWORK;
+    src->protocol = VIR_STORAGE_NET_PROTOCOL_VITASTOR;
+
+    /* legacy syntax passed via 'filename' option */
+    if ((filename = virJSONValueObjectGetString(json, "filename")))
+        return virStorageSourceParseVitastorColonString(filename, src);
+
+    if (!image) {
+        virReportError(VIR_ERR_INVALID_ARG, "%s",
+                       _("missing image name in Vitastor backing volume "
+                         "JSON specification"));
+        return -1;
+    }
+
+    src->path = g_strdup(image);
+    src->configFile = g_strdup(conf);
+    src->query = g_strdup(etcd_prefix);
+
+    if (servers) {
+        nservers = virJSONValueArraySize(servers);
+
+        src->hosts = g_new0(virStorageNetHostDef, nservers);
+        src->nhosts = nservers;
+
+        for (i = 0; i < nservers; i++) {
+            if (virStorageSourceParseBackingJSONInetSocketAddress(src->hosts + i,
+                                                                  virJSONValueArrayGet(servers, i)) < 0)
+                return -1;
+        }
+    }
+
+    return 0;
+}
+
+ static int
+ virStorageSourceParseBackingJSONRaw(virStorageSourcePtr src,
+                                     virJSONValuePtr json,
+@@ -3759,6 +3882,7 @@ static const struct virStorageSourceJSONDriverParser jsonParsers[] = {
+     {"sheepdog", false, virStorageSourceParseBackingJSONSheepdog, 0},
+     {"ssh", false, virStorageSourceParseBackingJSONSSH, 0},
+     {"rbd", false, virStorageSourceParseBackingJSONRBD, 0},
+    {"vitastor", false, virStorageSourceParseBackingJSONVitastor, 0},
+     {"raw", true, virStorageSourceParseBackingJSONRaw, 0},
+     {"nfs", false, virStorageSourceParseBackingJSONNFS, 0},
+     {"vxhs", false, virStorageSourceParseBackingJSONVxHS, 0},
+@@ -4503,6 +4627,7 @@ virStorageSourceNetworkDefaultPort(virStorageNetProtocol protocol)
+         case VIR_STORAGE_NET_PROTOCOL_GLUSTER:
+             return 24007;
+ 
+        case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
+         case VIR_STORAGE_NET_PROTOCOL_RBD:
+             /* we don't provide a default for RBD */
+             return 0;
+diff --git a/src/util/virstoragefile.h b/src/util/virstoragefile.h
+index 5689c39..3eb4e3c 100644
+--- a/src/util/virstoragefile.h
+++ b/src/util/virstoragefile.h
+@@ -136,6 +136,7 @@ typedef enum {
+     VIR_STORAGE_NET_PROTOCOL_SSH,
+     VIR_STORAGE_NET_PROTOCOL_VXHS,
+     VIR_STORAGE_NET_PROTOCOL_NFS,
+    VIR_STORAGE_NET_PROTOCOL_VITASTOR,
+ 
+     VIR_STORAGE_NET_PROTOCOL_LAST
+ } virStorageNetProtocol;
+diff --git a/tests/storagepoolcapsschemadata/poolcaps-fs.xml b/tests/storagepoolcapsschemadata/poolcaps-fs.xml
+index eee75af..8bd0a57 100644
+--- a/tests/storagepoolcapsschemadata/poolcaps-fs.xml
+++ b/tests/storagepoolcapsschemadata/poolcaps-fs.xml
+@@ -204,4 +204,11 @@
+       </enum>
+     </volOptions>
+   </pool>
+  <pool type='vitastor' supported='no'>
+    <volOptions>
+      <defaultFormat type='raw'/>
+      <enum name='targetFormatType'>
+      </enum>
+    </volOptions>
+  </pool>
+ </storagepoolCapabilities>
+diff --git a/tests/storagepoolcapsschemadata/poolcaps-full.xml b/tests/storagepoolcapsschemadata/poolcaps-full.xml
+index 805950a..852df0d 100644
+--- a/tests/storagepoolcapsschemadata/poolcaps-full.xml
+++ b/tests/storagepoolcapsschemadata/poolcaps-full.xml
+@@ -204,4 +204,11 @@
+       </enum>
+     </volOptions>
+   </pool>
+  <pool type='vitastor' supported='yes'>
+    <volOptions>
+      <defaultFormat type='raw'/>
+      <enum name='targetFormatType'>
+      </enum>
+    </volOptions>
+  </pool>
+ </storagepoolCapabilities>
+diff --git a/tests/storagepoolxml2argvtest.c b/tests/storagepoolxml2argvtest.c
+index 967d1f2..1e8ff7a 100644
+--- a/tests/storagepoolxml2argvtest.c
+++ b/tests/storagepoolxml2argvtest.c
+@@ -68,6 +68,7 @@ testCompareXMLToArgvFiles(bool shouldFail,
+     case VIR_STORAGE_POOL_GLUSTER:
+     case VIR_STORAGE_POOL_ZFS:
+     case VIR_STORAGE_POOL_VSTORAGE:
+    case VIR_STORAGE_POOL_VITASTOR:
+     case VIR_STORAGE_POOL_LAST:
+     default:
+         VIR_TEST_DEBUG("pool type '%s' has no xml2argv test", defTypeStr);
+diff --git a/tools/virsh-pool.c b/tools/virsh-pool.c
+index 7835fa6..8841fcf 100644
+--- a/tools/virsh-pool.c
+++ b/tools/virsh-pool.c
+@@ -1237,6 +1237,9 @@ cmdPoolList(vshControl *ctl, const vshCmd *cmd G_GNUC_UNUSED)
+             case VIR_STORAGE_POOL_VSTORAGE:
+                 flags |= VIR_CONNECT_LIST_STORAGE_POOLS_VSTORAGE;
+                 break;
+            case VIR_STORAGE_POOL_VITASTOR:
+                flags |= VIR_CONNECT_LIST_STORAGE_POOLS_VITASTOR;
+                break;
+             case VIR_STORAGE_POOL_LAST:
+                 break;
+             }
--- a/patches/libvirt-7.5-vitastor.diff
+++ b/patches/libvirt-7.5-vitastor.diff
@@ -0,0 +1,661 @@
+commit c6e1958a1b4974828e8e5852beb252ce6594e670
+Author: Vitaliy Filippov <vitalif@yourcmc.ru>
+Date:   Mon Jun 28 01:20:19 2021 +0300
+
+    Add Vitastor support
+
+diff --git a/docs/schemas/domaincommon.rng b/docs/schemas/domaincommon.rng
+index 5ea14b6..a9df168 100644
+--- a/docs/schemas/domaincommon.rng
+++ b/docs/schemas/domaincommon.rng
+@@ -1859,6 +1859,35 @@
+     </element>
+   </define>
+ 
+  <define name="diskSourceNetworkProtocolVitastor">
+    <element name="source">
+      <interleave>
+        <attribute name="protocol">
+          <value>vitastor</value>
+        </attribute>
+        <ref name="diskSourceCommon"/>
+        <optional>
+          <attribute name="name"/>
+        </optional>
+        <optional>
+          <attribute name="query"/>
+        </optional>
+        <zeroOrMore>
+          <ref name="diskSourceNetworkHost"/>
+        </zeroOrMore>
+        <optional>
+          <element name="config">
+            <attribute name="file">
+              <ref name="absFilePath"/>
+            </attribute>
+            <empty/>
+          </element>
+        </optional>
+        <empty/>
+      </interleave>
+    </element>
+  </define>
+
+   <define name="diskSourceNetworkProtocolISCSI">
+     <element name="source">
+       <attribute name="protocol">
+@@ -2115,6 +2144,7 @@
+       <ref name="diskSourceNetworkProtocolSimple"/>
+       <ref name="diskSourceNetworkProtocolVxHS"/>
+       <ref name="diskSourceNetworkProtocolNFS"/>
+      <ref name="diskSourceNetworkProtocolVitastor"/>
+     </choice>
+   </define>
+ 
+diff --git a/include/libvirt/libvirt-storage.h b/include/libvirt/libvirt-storage.h
+index 089e1e0..d7e7ef4 100644
+--- a/include/libvirt/libvirt-storage.h
+++ b/include/libvirt/libvirt-storage.h
+@@ -245,6 +245,7 @@ typedef enum {
+     VIR_CONNECT_LIST_STORAGE_POOLS_ZFS           = 1 << 17,
+     VIR_CONNECT_LIST_STORAGE_POOLS_VSTORAGE      = 1 << 18,
+     VIR_CONNECT_LIST_STORAGE_POOLS_ISCSI_DIRECT  = 1 << 19,
+    VIR_CONNECT_LIST_STORAGE_POOLS_VITASTOR      = 1 << 20,
+ } virConnectListAllStoragePoolsFlags;
+ 
+ int                     virConnectListAllStoragePools(virConnectPtr conn,
+diff --git a/src/conf/domain_conf.c b/src/conf/domain_conf.c
+index d78f846..f7222e3 100644
+--- a/src/conf/domain_conf.c
+++ b/src/conf/domain_conf.c
+@@ -8251,7 +8251,8 @@ virDomainDiskSourceNetworkParse(xmlNodePtr node,
+     src->configFile = virXPathString("string(./config/@file)", ctxt);
+ 
+     if (src->protocol == VIR_STORAGE_NET_PROTOCOL_HTTP ||
+-        src->protocol == VIR_STORAGE_NET_PROTOCOL_HTTPS)
+        src->protocol == VIR_STORAGE_NET_PROTOCOL_HTTPS ||
+        src->protocol == VIR_STORAGE_NET_PROTOCOL_VITASTOR)
+         src->query = virXMLPropString(node, "query");
+ 
+     if (virDomainStorageNetworkParseHosts(node, ctxt, &src->hosts, &src->nhosts) < 0)
+@@ -30775,6 +30776,7 @@ virDomainStorageSourceTranslateSourcePool(virStorageSource *src,
+ 
+     case VIR_STORAGE_POOL_MPATH:
+     case VIR_STORAGE_POOL_RBD:
+    case VIR_STORAGE_POOL_VITASTOR:
+     case VIR_STORAGE_POOL_SHEEPDOG:
+     case VIR_STORAGE_POOL_GLUSTER:
+     case VIR_STORAGE_POOL_LAST:
+diff --git a/src/conf/storage_conf.c b/src/conf/storage_conf.c
+index 2aa9a3d..166ca1f 100644
+--- a/src/conf/storage_conf.c
+++ b/src/conf/storage_conf.c
+@@ -60,7 +60,7 @@ VIR_ENUM_IMPL(virStoragePool,
+               "logical", "disk", "iscsi",
+               "iscsi-direct", "scsi", "mpath",
+               "rbd", "sheepdog", "gluster",
+-              "zfs", "vstorage",
+              "zfs", "vstorage", "vitastor",
+ );
+ 
+ VIR_ENUM_IMPL(virStoragePoolFormatFileSystem,
+@@ -246,6 +246,18 @@ static virStoragePoolTypeInfo poolTypeInfo[] = {
+           .formatToString = virStorageFileFormatTypeToString,
+       }
+     },
+    {.poolType = VIR_STORAGE_POOL_VITASTOR,
+     .poolOptions = {
+         .flags = (VIR_STORAGE_POOL_SOURCE_HOST |
+                   VIR_STORAGE_POOL_SOURCE_NETWORK |
+                   VIR_STORAGE_POOL_SOURCE_NAME),
+      },
+      .volOptions = {
+          .defaultFormat = VIR_STORAGE_FILE_RAW,
+          .formatFromString = virStorageVolumeFormatFromString,
+          .formatToString = virStorageFileFormatTypeToString,
+      }
+    },
+     {.poolType = VIR_STORAGE_POOL_SHEEPDOG,
+      .poolOptions = {
+          .flags = (VIR_STORAGE_POOL_SOURCE_HOST |
+@@ -546,6 +558,11 @@ virStoragePoolDefParseSource(xmlXPathContextPtr ctxt,
+                        _("element 'name' is mandatory for RBD pool"));
+         return -1;
+     }
+    if (pool_type == VIR_STORAGE_POOL_VITASTOR && source->name == NULL) {
+        virReportError(VIR_ERR_XML_ERROR, "%s",
+                       _("element 'name' is mandatory for Vitastor pool"));
+        return -1;
+    }
+ 
+     if (options->formatFromString) {
+         g_autofree char *format = NULL;
+@@ -1182,6 +1199,7 @@ virStoragePoolDefFormatBuf(virBuffer *buf,
+     /* RBD, Sheepdog, Gluster and Iscsi-direct devices are not local block devs nor
+      * files, so they don't have a target */
+     if (def->type != VIR_STORAGE_POOL_RBD &&
+        def->type != VIR_STORAGE_POOL_VITASTOR &&
+         def->type != VIR_STORAGE_POOL_SHEEPDOG &&
+         def->type != VIR_STORAGE_POOL_GLUSTER &&
+         def->type != VIR_STORAGE_POOL_ISCSI_DIRECT) {
+diff --git a/src/conf/storage_conf.h b/src/conf/storage_conf.h
+index 76efaac..928149a 100644
+--- a/src/conf/storage_conf.h
+++ b/src/conf/storage_conf.h
+@@ -106,6 +106,7 @@ typedef enum {
+     VIR_STORAGE_POOL_GLUSTER,  /* Gluster device */
+     VIR_STORAGE_POOL_ZFS,      /* ZFS */
+     VIR_STORAGE_POOL_VSTORAGE, /* Virtuozzo Storage */
+    VIR_STORAGE_POOL_VITASTOR, /* Vitastor */
+ 
+     VIR_STORAGE_POOL_LAST,
+ } virStoragePoolType;
+@@ -465,6 +466,7 @@ VIR_ENUM_DECL(virStoragePartedFs);
+                  VIR_CONNECT_LIST_STORAGE_POOLS_SCSI     | \
+                  VIR_CONNECT_LIST_STORAGE_POOLS_MPATH    | \
+                  VIR_CONNECT_LIST_STORAGE_POOLS_RBD      | \
+                 VIR_CONNECT_LIST_STORAGE_POOLS_VITASTOR | \
+                  VIR_CONNECT_LIST_STORAGE_POOLS_SHEEPDOG | \
+                  VIR_CONNECT_LIST_STORAGE_POOLS_GLUSTER  | \
+                  VIR_CONNECT_LIST_STORAGE_POOLS_ZFS      | \
+diff --git a/src/conf/storage_source_conf.c b/src/conf/storage_source_conf.c
+index 5ca06fa..05ded49 100644
+--- a/src/conf/storage_source_conf.c
+++ b/src/conf/storage_source_conf.c
+@@ -85,6 +85,7 @@ VIR_ENUM_IMPL(virStorageNetProtocol,
+               "ssh",
+               "vxhs",
+               "nfs",
+              "vitastor",
+ );
+ 
+ 
+@@ -1262,6 +1263,7 @@ virStorageSourceNetworkDefaultPort(virStorageNetProtocol protocol)
+         case VIR_STORAGE_NET_PROTOCOL_GLUSTER:
+             return 24007;
+ 
+        case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
+         case VIR_STORAGE_NET_PROTOCOL_RBD:
+             /* we don't provide a default for RBD */
+             return 0;
+diff --git a/src/conf/storage_source_conf.h b/src/conf/storage_source_conf.h
+index 389c7b5..dbf02e3 100644
+--- a/src/conf/storage_source_conf.h
+++ b/src/conf/storage_source_conf.h
+@@ -127,6 +127,7 @@ typedef enum {
+     VIR_STORAGE_NET_PROTOCOL_SSH,
+     VIR_STORAGE_NET_PROTOCOL_VXHS,
+     VIR_STORAGE_NET_PROTOCOL_NFS,
+    VIR_STORAGE_NET_PROTOCOL_VITASTOR,
+ 
+     VIR_STORAGE_NET_PROTOCOL_LAST
+ } virStorageNetProtocol;
+diff --git a/src/conf/virstorageobj.c b/src/conf/virstorageobj.c
+index 24957d6..4520a73 100644
+--- a/src/conf/virstorageobj.c
+++ b/src/conf/virstorageobj.c
+@@ -1487,6 +1487,7 @@ virStoragePoolObjSourceFindDuplicateCb(const void *payload,
+             return 1;
+         break;
+ 
+    case VIR_STORAGE_POOL_VITASTOR:
+     case VIR_STORAGE_POOL_RBD:
+     case VIR_STORAGE_POOL_LAST:
+         break;
+@@ -1986,6 +1987,8 @@ virStoragePoolObjMatch(virStoragePoolObj *obj,
+                (obj->def->type == VIR_STORAGE_POOL_MPATH))   ||
+               (MATCH(VIR_CONNECT_LIST_STORAGE_POOLS_RBD) &&
+                (obj->def->type == VIR_STORAGE_POOL_RBD))     ||
+              (MATCH(VIR_CONNECT_LIST_STORAGE_POOLS_VITASTOR) &&
+               (obj->def->type == VIR_STORAGE_POOL_VITASTOR)) ||
+               (MATCH(VIR_CONNECT_LIST_STORAGE_POOLS_SHEEPDOG) &&
+                (obj->def->type == VIR_STORAGE_POOL_SHEEPDOG)) ||
+               (MATCH(VIR_CONNECT_LIST_STORAGE_POOLS_GLUSTER) &&
+diff --git a/src/libvirt-storage.c b/src/libvirt-storage.c
+index 2a7cdca..f756be1 100644
+--- a/src/libvirt-storage.c
+++ b/src/libvirt-storage.c
+@@ -92,6 +92,7 @@ virStoragePoolGetConnect(virStoragePoolPtr pool)
+  * VIR_CONNECT_LIST_STORAGE_POOLS_SCSI
+  * VIR_CONNECT_LIST_STORAGE_POOLS_MPATH
+  * VIR_CONNECT_LIST_STORAGE_POOLS_RBD
+ * VIR_CONNECT_LIST_STORAGE_POOLS_VITASTOR
+  * VIR_CONNECT_LIST_STORAGE_POOLS_SHEEPDOG
+  * VIR_CONNECT_LIST_STORAGE_POOLS_GLUSTER
+  * VIR_CONNECT_LIST_STORAGE_POOLS_ZFS
+diff --git a/src/libxl/libxl_conf.c b/src/libxl/libxl_conf.c
+index 56cb9ab..dfb31b9 100644
+--- a/src/libxl/libxl_conf.c
+++ b/src/libxl/libxl_conf.c
+@@ -972,6 +972,7 @@ libxlMakeNetworkDiskSrcStr(virStorageSource *src,
+     case VIR_STORAGE_NET_PROTOCOL_SSH:
+     case VIR_STORAGE_NET_PROTOCOL_VXHS:
+     case VIR_STORAGE_NET_PROTOCOL_NFS:
+    case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
+     case VIR_STORAGE_NET_PROTOCOL_LAST:
+     case VIR_STORAGE_NET_PROTOCOL_NONE:
+         virReportError(VIR_ERR_NO_SUPPORT,
+diff --git a/src/libxl/xen_xl.c b/src/libxl/xen_xl.c
+index c0905b0..c172378 100644
+--- a/src/libxl/xen_xl.c
+++ b/src/libxl/xen_xl.c
+@@ -1540,6 +1540,7 @@ xenFormatXLDiskSrcNet(virStorageSource *src)
+     case VIR_STORAGE_NET_PROTOCOL_SSH:
+     case VIR_STORAGE_NET_PROTOCOL_VXHS:
+     case VIR_STORAGE_NET_PROTOCOL_NFS:
+    case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
+     case VIR_STORAGE_NET_PROTOCOL_LAST:
+     case VIR_STORAGE_NET_PROTOCOL_NONE:
+         virReportError(VIR_ERR_NO_SUPPORT,
+diff --git a/src/qemu/qemu_block.c b/src/qemu/qemu_block.c
+index 6627d04..c33f428 100644
+--- a/src/qemu/qemu_block.c
+++ b/src/qemu/qemu_block.c
+@@ -928,6 +928,38 @@ qemuBlockStorageSourceGetRBDProps(virStorageSource *src,
+ }
+ 
+ 
+static virJSONValue *
+qemuBlockStorageSourceGetVitastorProps(virStorageSource *src)
+{
+    virJSONValuePtr ret = NULL;
+    virStorageNetHostDefPtr host;
+    size_t i;
+    g_auto(virBuffer) buf = VIR_BUFFER_INITIALIZER;
+    g_autofree char *etcd = NULL;
+
+    for (i = 0; i < src->nhosts; i++) {
+        host = src->hosts + i;
+        if ((virStorageNetHostTransport)host->transport != VIR_STORAGE_NET_HOST_TRANS_TCP) {
+            return NULL;
+        }
+        virBufferAsprintf(&buf, i > 0 ? ",%s:%u" : "%s:%u", host->name, host->port);
+    }
+    if (src->nhosts > 0) {
+        etcd = virBufferContentAndReset(&buf);
+    }
+
+    if (virJSONValueObjectCreate(&ret,
+                                 "S:etcd_host", etcd,
+                                 "S:etcd_prefix", src->query,
+                                 "S:config_path", src->configFile,
+                                 "s:image", src->path,
+                                 NULL) < 0)
+        return NULL;
+
+    return ret;
+}
+
+
+ static virJSONValue *
+ qemuBlockStorageSourceGetSheepdogProps(virStorageSource *src)
+ {
+@@ -1218,6 +1250,12 @@ qemuBlockStorageSourceGetBackendProps(virStorageSource *src,
+                 return NULL;
+             break;
+ 
+        case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
+            driver = "vitastor";
+            if (!(fileprops = qemuBlockStorageSourceGetVitastorProps(src)))
+                return NULL;
+            break;
+
+         case VIR_STORAGE_NET_PROTOCOL_SHEEPDOG:
+             driver = "sheepdog";
+             if (!(fileprops = qemuBlockStorageSourceGetSheepdogProps(src)))
+@@ -2231,6 +2269,7 @@ qemuBlockGetBackingStoreString(virStorageSource *src,
+ 
+             case VIR_STORAGE_NET_PROTOCOL_SHEEPDOG:
+             case VIR_STORAGE_NET_PROTOCOL_RBD:
+            case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
+             case VIR_STORAGE_NET_PROTOCOL_VXHS:
+             case VIR_STORAGE_NET_PROTOCOL_NFS:
+             case VIR_STORAGE_NET_PROTOCOL_SSH:
+@@ -2608,6 +2647,12 @@ qemuBlockStorageSourceCreateGetStorageProps(virStorageSource *src,
+                 return -1;
+             break;
+ 
+        case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
+            driver = "vitastor";
+            if (!(location = qemuBlockStorageSourceGetVitastorProps(src)))
+                return -1;
+            break;
+
+         case VIR_STORAGE_NET_PROTOCOL_SHEEPDOG:
+             driver = "sheepdog";
+             if (!(location = qemuBlockStorageSourceGetSheepdogProps(src)))
+diff --git a/src/qemu/qemu_command.c b/src/qemu/qemu_command.c
+index ea51369..8258632 100644
+--- a/src/qemu/qemu_command.c
+++ b/src/qemu/qemu_command.c
+@@ -1074,6 +1074,43 @@ qemuBuildNetworkDriveStr(virStorageSource *src,
+             ret = virBufferContentAndReset(&buf);
+             break;
+ 
+        case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
+            if (strchr(src->path, ':')) {
+                virReportError(VIR_ERR_CONFIG_UNSUPPORTED,
+                               _("':' not allowed in Vitastor source volume name '%s'"),
+                               src->path);
+                return NULL;
+            }
+
+            virBufferStrcat(&buf, "vitastor:image=", src->path, NULL);
+
+            if (src->nhosts > 0) {
+                virBufferAddLit(&buf, ":etcd_host=");
+                for (i = 0; i < src->nhosts; i++) {
+                    if (i)
+                        virBufferAddLit(&buf, ",");
+
+                    /* assume host containing : is ipv6 */
+                    if (strchr(src->hosts[i].name, ':'))
+                        virBufferEscape(&buf, '\\', ":", "[%s]",
+                                        src->hosts[i].name);
+                    else
+                        virBufferAsprintf(&buf, "%s", src->hosts[i].name);
+
+                    if (src->hosts[i].port)
+                        virBufferAsprintf(&buf, "\\:%u", src->hosts[i].port);
+                }
+            }
+
+            if (src->configFile)
+                virBufferEscape(&buf, '\\', ":", ":config_path=%s", src->configFile);
+
+            if (src->query)
+                virBufferEscape(&buf, '\\', ":", ":etcd_prefix=%s", src->query);
+
+            ret = virBufferContentAndReset(&buf);
+            break;
+
+         case VIR_STORAGE_NET_PROTOCOL_VXHS:
+             virReportError(VIR_ERR_INTERNAL_ERROR, "%s",
+                            _("VxHS protocol does not support URI syntax"));
+diff --git a/src/qemu/qemu_domain.c b/src/qemu/qemu_domain.c
+index fc60e15..5ab410d 100644
+--- a/src/qemu/qemu_domain.c
+++ b/src/qemu/qemu_domain.c
+@@ -4829,7 +4829,8 @@ qemuDomainValidateStorageSource(virStorageSource *src,
+     if (src->query &&
+         (actualType != VIR_STORAGE_TYPE_NETWORK ||
+          (src->protocol != VIR_STORAGE_NET_PROTOCOL_HTTPS &&
+-          src->protocol != VIR_STORAGE_NET_PROTOCOL_HTTP))) {
+          src->protocol != VIR_STORAGE_NET_PROTOCOL_HTTP &&
+          src->protocol != VIR_STORAGE_NET_PROTOCOL_VITASTOR))) {
+         virReportError(VIR_ERR_CONFIG_UNSUPPORTED, "%s",
+                        _("query is supported only with HTTP(S) protocols"));
+         return -1;
+@@ -10027,6 +10028,7 @@ qemuDomainPrepareStorageSourceTLS(virStorageSource *src,
+         break;
+ 
+     case VIR_STORAGE_NET_PROTOCOL_RBD:
+    case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
+     case VIR_STORAGE_NET_PROTOCOL_SHEEPDOG:
+     case VIR_STORAGE_NET_PROTOCOL_GLUSTER:
+     case VIR_STORAGE_NET_PROTOCOL_ISCSI:
+diff --git a/src/qemu/qemu_snapshot.c b/src/qemu/qemu_snapshot.c
+index 4e74ddd..14e5f2e 100644
+--- a/src/qemu/qemu_snapshot.c
+++ b/src/qemu/qemu_snapshot.c
+@@ -402,6 +402,7 @@ qemuSnapshotPrepareDiskExternalInactive(virDomainSnapshotDiskDef *snapdisk,
+         case VIR_STORAGE_NET_PROTOCOL_NONE:
+         case VIR_STORAGE_NET_PROTOCOL_NBD:
+         case VIR_STORAGE_NET_PROTOCOL_RBD:
+        case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
+         case VIR_STORAGE_NET_PROTOCOL_SHEEPDOG:
+         case VIR_STORAGE_NET_PROTOCOL_GLUSTER:
+         case VIR_STORAGE_NET_PROTOCOL_ISCSI:
+@@ -494,6 +495,7 @@ qemuSnapshotPrepareDiskExternalActive(virDomainObj *vm,
+         case VIR_STORAGE_NET_PROTOCOL_NONE:
+         case VIR_STORAGE_NET_PROTOCOL_NBD:
+         case VIR_STORAGE_NET_PROTOCOL_RBD:
+        case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
+         case VIR_STORAGE_NET_PROTOCOL_SHEEPDOG:
+         case VIR_STORAGE_NET_PROTOCOL_ISCSI:
+         case VIR_STORAGE_NET_PROTOCOL_HTTP:
+@@ -647,6 +649,7 @@ qemuSnapshotPrepareDiskInternal(virDomainDiskDef *disk,
+         case VIR_STORAGE_NET_PROTOCOL_NONE:
+         case VIR_STORAGE_NET_PROTOCOL_NBD:
+         case VIR_STORAGE_NET_PROTOCOL_RBD:
+        case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
+         case VIR_STORAGE_NET_PROTOCOL_SHEEPDOG:
+         case VIR_STORAGE_NET_PROTOCOL_GLUSTER:
+         case VIR_STORAGE_NET_PROTOCOL_ISCSI:
+diff --git a/src/storage/storage_driver.c b/src/storage/storage_driver.c
+index c2ff4b8..70d0689 100644
+--- a/src/storage/storage_driver.c
+++ b/src/storage/storage_driver.c
+@@ -1644,6 +1644,7 @@ storageVolLookupByPathCallback(virStoragePoolObj *obj,
+ 
+         case VIR_STORAGE_POOL_GLUSTER:
+         case VIR_STORAGE_POOL_RBD:
+        case VIR_STORAGE_POOL_VITASTOR:
+         case VIR_STORAGE_POOL_SHEEPDOG:
+         case VIR_STORAGE_POOL_ZFS:
+         case VIR_STORAGE_POOL_LAST:
+diff --git a/src/storage_file/storage_source_backingstore.c b/src/storage_file/storage_source_backingstore.c
+index e48ae72..d7a9b72 100644
+--- a/src/storage_file/storage_source_backingstore.c
+++ b/src/storage_file/storage_source_backingstore.c
+@@ -284,6 +284,75 @@ virStorageSourceParseRBDColonString(const char *rbdstr,
+ }
+ 
+ 
+static int
+virStorageSourceParseVitastorColonString(const char *colonstr,
+                                         virStorageSource *src)
+{
+    char *p, *e, *next;
+    g_autofree char *options = NULL;
+
+    /* optionally skip the "vitastor:" prefix if provided */
+    if (STRPREFIX(colonstr, "vitastor:"))
+        colonstr += strlen("vitastor:");
+
+    options = g_strdup(colonstr);
+
+    p = options;
+    while (*p) {
+        /* find : delimiter or end of string */
+        for (e = p; *e && *e != ':'; ++e) {
+            if (*e == '\\') {
+                e++;
+                if (*e == '\0')
+                    break;
+            }
+        }
+        if (*e == '\0') {
+            next = e;    /* last kv pair */
+        } else {
+            next = e + 1;
+            *e = '\0';
+        }
+
+        if (STRPREFIX(p, "image=")) {
+            src->path = g_strdup(p + strlen("image="));
+        } else if (STRPREFIX(p, "etcd_prefix=")) {
+            src->query = g_strdup(p + strlen("etcd_prefix="));
+        } else if (STRPREFIX(p, "config_file=")) {
+            src->configFile = g_strdup(p + strlen("config_file="));
+        } else if (STRPREFIX(p, "etcd_host=")) {
+            char *h, *sep;
+
+            h = p + strlen("etcd_host=");
+            while (h < e) {
+                for (sep = h; sep < e; ++sep) {
+                    if (*sep == '\\' && (sep[1] == ',' ||
+                                         sep[1] == ';' ||
+                                         sep[1] == ' ')) {
+                        *sep = '\0';
+                        sep += 2;
+                        break;
+                    }
+                }
+
+                if (virStorageSourceRBDAddHost(src, h) < 0)
+                    return -1;
+
+                h = sep;
+            }
+        }
+
+        p = next;
+    }
+
+    if (!src->path) {
+        return -1;
+    }
+
+    return 0;
+}
+
+
+ static int
+ virStorageSourceParseNBDColonString(const char *nbdstr,
+                                     virStorageSource *src)
+@@ -396,6 +465,11 @@ virStorageSourceParseBackingColon(virStorageSource *src,
+             return -1;
+         break;
+ 
+    case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
+        if (virStorageSourceParseVitastorColonString(path, src) < 0)
+            return -1;
+        break;
+
+     case VIR_STORAGE_NET_PROTOCOL_SHEEPDOG:
+     case VIR_STORAGE_NET_PROTOCOL_LAST:
+     case VIR_STORAGE_NET_PROTOCOL_NONE:
+@@ -984,6 +1058,54 @@ virStorageSourceParseBackingJSONRBD(virStorageSource *src,
+     return 0;
+ }
+ 
+static int
+virStorageSourceParseBackingJSONVitastor(virStorageSource *src,
+                                         virJSONValue *json,
+                                         const char *jsonstr G_GNUC_UNUSED,
+                                         int opaque G_GNUC_UNUSED)
+{
+    const char *filename;
+    const char *image = virJSONValueObjectGetString(json, "image");
+    const char *conf = virJSONValueObjectGetString(json, "config_path");
+    const char *etcd_prefix = virJSONValueObjectGetString(json, "etcd_prefix");
+    virJSONValue *servers = virJSONValueObjectGetArray(json, "server");
+    size_t nservers;
+    size_t i;
+
+    src->type = VIR_STORAGE_TYPE_NETWORK;
+    src->protocol = VIR_STORAGE_NET_PROTOCOL_VITASTOR;
+
+    /* legacy syntax passed via 'filename' option */
+    if ((filename = virJSONValueObjectGetString(json, "filename")))
+        return virStorageSourceParseVitastorColonString(filename, src);
+
+    if (!image) {
+        virReportError(VIR_ERR_INVALID_ARG, "%s",
+                       _("missing image name in Vitastor backing volume "
+                         "JSON specification"));
+        return -1;
+    }
+
+    src->path = g_strdup(image);
+    src->configFile = g_strdup(conf);
+    src->query = g_strdup(etcd_prefix);
+
+    if (servers) {
+        nservers = virJSONValueArraySize(servers);
+
+        src->hosts = g_new0(virStorageNetHostDef, nservers);
+        src->nhosts = nservers;
+
+        for (i = 0; i < nservers; i++) {
+            if (virStorageSourceParseBackingJSONInetSocketAddress(src->hosts + i,
+                                                                  virJSONValueArrayGet(servers, i)) < 0)
+                return -1;
+        }
+    }
+
+    return 0;
+}
+
+ static int
+ virStorageSourceParseBackingJSONRaw(virStorageSource *src,
+                                     virJSONValue *json,
+@@ -1162,6 +1284,7 @@ static const struct virStorageSourceJSONDriverParser jsonParsers[] = {
+     {"sheepdog", false, virStorageSourceParseBackingJSONSheepdog, 0},
+     {"ssh", false, virStorageSourceParseBackingJSONSSH, 0},
+     {"rbd", false, virStorageSourceParseBackingJSONRBD, 0},
+    {"vitastor", false, virStorageSourceParseBackingJSONVitastor, 0},
+     {"raw", true, virStorageSourceParseBackingJSONRaw, 0},
+     {"nfs", false, virStorageSourceParseBackingJSONNFS, 0},
+     {"vxhs", false, virStorageSourceParseBackingJSONVxHS, 0},
+diff --git a/src/test/test_driver.c b/src/test/test_driver.c
+index ef0ddab..2173dc3 100644
+--- a/src/test/test_driver.c
+++ b/src/test/test_driver.c
+@@ -7131,6 +7131,7 @@ testStorageVolumeTypeForPool(int pooltype)
+     case VIR_STORAGE_POOL_ISCSI_DIRECT:
+     case VIR_STORAGE_POOL_GLUSTER:
+     case VIR_STORAGE_POOL_RBD:
+    case VIR_STORAGE_POOL_VITASTOR:
+         return VIR_STORAGE_VOL_NETWORK;
+     case VIR_STORAGE_POOL_LOGICAL:
+     case VIR_STORAGE_POOL_DISK:
+diff --git a/tests/storagepoolcapsschemadata/poolcaps-fs.xml b/tests/storagepoolcapsschemadata/poolcaps-fs.xml
+index eee75af..8bd0a57 100644
+--- a/tests/storagepoolcapsschemadata/poolcaps-fs.xml
+++ b/tests/storagepoolcapsschemadata/poolcaps-fs.xml
+@@ -204,4 +204,11 @@
+       </enum>
+     </volOptions>
+   </pool>
+  <pool type='vitastor' supported='no'>
+    <volOptions>
+      <defaultFormat type='raw'/>
+      <enum name='targetFormatType'>
+      </enum>
+    </volOptions>
+  </pool>
+ </storagepoolCapabilities>
+diff --git a/tests/storagepoolcapsschemadata/poolcaps-full.xml b/tests/storagepoolcapsschemadata/poolcaps-full.xml
+index 805950a..852df0d 100644
+--- a/tests/storagepoolcapsschemadata/poolcaps-full.xml
+++ b/tests/storagepoolcapsschemadata/poolcaps-full.xml
+@@ -204,4 +204,11 @@
+       </enum>
+     </volOptions>
+   </pool>
+  <pool type='vitastor' supported='yes'>
+    <volOptions>
+      <defaultFormat type='raw'/>
+      <enum name='targetFormatType'>
+      </enum>
+    </volOptions>
+  </pool>
+ </storagepoolCapabilities>
+diff --git a/tests/storagepoolxml2argvtest.c b/tests/storagepoolxml2argvtest.c
+index 449b745..7f95cc8 100644
+--- a/tests/storagepoolxml2argvtest.c
+++ b/tests/storagepoolxml2argvtest.c
+@@ -68,6 +68,7 @@ testCompareXMLToArgvFiles(bool shouldFail,
+     case VIR_STORAGE_POOL_GLUSTER:
+     case VIR_STORAGE_POOL_ZFS:
+     case VIR_STORAGE_POOL_VSTORAGE:
+    case VIR_STORAGE_POOL_VITASTOR:
+     case VIR_STORAGE_POOL_LAST:
+     default:
+         VIR_TEST_DEBUG("pool type '%s' has no xml2argv test", defTypeStr);
+diff --git a/tools/virsh-pool.c b/tools/virsh-pool.c
+index 18f3839..c8e1436 100644
+--- a/tools/virsh-pool.c
+++ b/tools/virsh-pool.c
+@@ -1231,6 +1231,9 @@ cmdPoolList(vshControl *ctl, const vshCmd *cmd G_GNUC_UNUSED)
+             case VIR_STORAGE_POOL_VSTORAGE:
+                 flags |= VIR_CONNECT_LIST_STORAGE_POOLS_VSTORAGE;
+                 break;
+            case VIR_STORAGE_POOL_VITASTOR:
+                flags |= VIR_CONNECT_LIST_STORAGE_POOLS_VITASTOR;
+                break;
+             case VIR_STORAGE_POOL_LAST:
+                 break;
+             }
--- a/patches/libvirt-example.xml
+++ b/patches/libvirt-example.xml
@@ -0,0 +1,32 @@
+<!-- Example libvirt VM configuration with Vitastor disk -->
+<domain type='kvm'>
+  <name>debian9</name>
+  <uuid>96f277fb-fd9c-49da-bf21-a5cfd54eb162</uuid>
+  <memory unit="KiB">524288</memory>
+  <currentMemory>524288</currentMemory>
+  <vcpu>1</vcpu>
+  <os>
+    <type arch='x86_64'>hvm</type>
+    <boot dev='hd' />
+  </os>
+  <devices>
+    <emulator>/usr/bin/qemu-system-x86_64</emulator>
+    <disk type='network' device='disk'>
+      <target dev='vda' bus='virtio' />
+      <driver name='qemu' type='raw' />
+      <!-- name is Vitastor image name -->
+      <!-- config (optional) is the path to Vitastor's configuration file -->
+      <!-- query (optional) is Vitastor's etcd_prefix -->
+      <source protocol='vitastor' name='debian9' query='/vitastor' config='/etc/vitastor/vitastor.conf'>
+        <!-- hosts = etcd addresses -->
+        <host name='192.168.7.2' port='2379' />
+      </source>
+      <!-- required because Vitastor only supports 4k physical sectors -->
+      <blockio physical_block_size="4096" logical_block_size="512" />
+    </disk>
+    <interface type='network'>
+      <source network='default' />
+    </interface>
+    <graphics type='vnc' port='-1' />
+  </devices>
+</domain>
--- a/patches/nova-20.diff
+++ b/patches/nova-20.diff
@@ -0,0 +1,287 @@
+diff --git a/nova/virt/image/model.py b/nova/virt/image/model.py
+index 971f7e9c07..70ed70d5e2 100644
+--- a/nova/virt/image/model.py
+++ b/nova/virt/image/model.py
+@@ -129,3 +129,22 @@ class RBDImage(Image):
+         self.user = user
+         self.password = password
+         self.servers = servers
+
+
+class VitastorImage(Image):
+    """Class for images in a remote Vitastor cluster"""
+
+    def __init__(self, name, etcd_address = None, etcd_prefix = None, config_path = None):
+        """Create a new Vitastor image object
+
+        :param name: name of the image
+        :param etcd_address: etcd URL(s) (optional)
+        :param etcd_prefix: etcd prefix (optional)
+        :param config_path: path to the configuration (optional)
+        """
+        super(RBDImage, self).__init__(FORMAT_RAW)
+
+        self.name = name
+        self.etcd_address = etcd_address
+        self.etcd_prefix = etcd_prefix
+        self.config_path = config_path
+diff --git a/nova/virt/images.py b/nova/virt/images.py
+index 5358f3766a..ebe3d6effb 100644
+--- a/nova/virt/images.py
+++ b/nova/virt/images.py
+@@ -41,7 +41,7 @@ IMAGE_API = glance.API()
+ 
+ def qemu_img_info(path, format=None):
+     """Return an object containing the parsed output from qemu-img info."""
+-    if not os.path.exists(path) and not path.startswith('rbd:'):
+    if not os.path.exists(path) and not path.startswith('rbd:') and not path.startswith('vitastor:'):
+         raise exception.DiskNotFound(location=path)
+ 
+     info = nova.privsep.qemu.unprivileged_qemu_img_info(path, format=format)
+@@ -50,7 +50,7 @@ def qemu_img_info(path, format=None):
+ 
+ def privileged_qemu_img_info(path, format=None, output_format='json'):
+     """Return an object containing the parsed output from qemu-img info."""
+-    if not os.path.exists(path) and not path.startswith('rbd:'):
+    if not os.path.exists(path) and not path.startswith('rbd:') and not path.startswith('vitastor:'):
+         raise exception.DiskNotFound(location=path)
+ 
+     info = nova.privsep.qemu.privileged_qemu_img_info(path, format=format)
+diff --git a/nova/virt/libvirt/config.py b/nova/virt/libvirt/config.py
+index f9475776b3..51573fe41d 100644
+--- a/nova/virt/libvirt/config.py
+++ b/nova/virt/libvirt/config.py
+@@ -1060,6 +1060,8 @@ class LibvirtConfigGuestDisk(LibvirtConfigGuestDevice):
+         self.driver_iommu = False
+         self.source_path = None
+         self.source_protocol = None
+        self.source_query = None
+        self.source_config = None
+         self.source_name = None
+         self.source_hosts = []
+         self.source_ports = []
+@@ -1186,7 +1188,8 @@ class LibvirtConfigGuestDisk(LibvirtConfigGuestDevice):
+         elif self.source_type == "mount":
+             dev.append(etree.Element("source", dir=self.source_path))
+         elif self.source_type == "network" and self.source_protocol:
+-            source = etree.Element("source", protocol=self.source_protocol)
+            source = etree.Element("source", protocol=self.source_protocol,
+                query=self.source_query, config=self.source_config)
+             if self.source_name is not None:
+                 source.set('name', self.source_name)
+             hosts_info = zip(self.source_hosts, self.source_ports)
+diff --git a/nova/virt/libvirt/driver.py b/nova/virt/libvirt/driver.py
+index 391231c527..34dc60dcdd 100644
+--- a/nova/virt/libvirt/driver.py
+++ b/nova/virt/libvirt/driver.py
+@@ -179,6 +179,7 @@ VOLUME_DRIVERS = {
+     'local': 'nova.virt.libvirt.volume.volume.LibvirtVolumeDriver',
+     'fake': 'nova.virt.libvirt.volume.volume.LibvirtFakeVolumeDriver',
+     'rbd': 'nova.virt.libvirt.volume.net.LibvirtNetVolumeDriver',
+    'vitastor': 'nova.virt.libvirt.volume.vitastor.LibvirtVitastorVolumeDriver',
+     'nfs': 'nova.virt.libvirt.volume.nfs.LibvirtNFSVolumeDriver',
+     'smbfs': 'nova.virt.libvirt.volume.smbfs.LibvirtSMBFSVolumeDriver',
+     'fibre_channel': 'nova.virt.libvirt.volume.fibrechannel.LibvirtFibreChannelVolumeDriver',  # noqa:E501
+@@ -385,10 +386,10 @@ class LibvirtDriver(driver.ComputeDriver):
+         # This prevents the risk of one test setting a capability
+         # which bleeds over into other tests.
+ 
+-        # LVM and RBD require raw images. If we are not configured to
+        # LVM, RBD, Vitastor require raw images. If we are not configured to
+         # force convert images into raw format, then we _require_ raw
+         # images only.
+-        raw_only = ('rbd', 'lvm')
+        raw_only = ('rbd', 'lvm', 'vitastor')
+         requires_raw_image = (CONF.libvirt.images_type in raw_only and
+                               not CONF.force_raw_images)
+         requires_ploop_image = CONF.libvirt.virt_type == 'parallels'
+@@ -775,12 +776,12 @@ class LibvirtDriver(driver.ComputeDriver):
+         # Some imagebackends are only able to import raw disk images,
+         # and will fail if given any other format. See the bug
+         # https://bugs.launchpad.net/nova/+bug/1816686 for more details.
+-        if CONF.libvirt.images_type in ('rbd',):
+        if CONF.libvirt.images_type in ('rbd', 'vitastor'):
+             if not CONF.force_raw_images:
+                 msg = _("'[DEFAULT]/force_raw_images = False' is not "
+-                        "allowed with '[libvirt]/images_type = rbd'. "
+                        "allowed with '[libvirt]/images_type = rbd' or 'vitastor'. "
+                         "Please check the two configs and if you really "
+-                        "do want to use rbd as images_type, set "
+                        "do want to use rbd or vitastor as images_type, set "
+                         "force_raw_images to True.")
+                 raise exception.InvalidConfiguration(msg)
+ 
+@@ -2603,6 +2604,16 @@ class LibvirtDriver(driver.ComputeDriver):
+                     if connection_info['data'].get('auth_enabled'):
+                         username = connection_info['data']['auth_username']
+                         path = f"rbd:{volume_name}:id={username}"
+                elif connection_info['driver_volume_type'] == 'vitastor':
+                    volume_name = connection_info['data']['name']
+                    path = 'vitastor:image='+volume_name.replace(':', '\\:')
+                    for k in [ 'config_path', 'etcd_address', 'etcd_prefix' ]:
+                        if k in connection_info['data']:
+                            kk = k
+                            if kk == 'etcd_address':
+                                # FIXME use etcd_address in qemu driver
+                                kk = 'etcd_host'
+                            path += ":"+kk+"="+connection_info['data'][k].replace(':', '\\:')
+                 else:
+                     path = 'unknown'
+                     raise exception.DiskNotFound(location='unknown')
+@@ -2827,8 +2838,8 @@ class LibvirtDriver(driver.ComputeDriver):
+ 
+         image_format = CONF.libvirt.snapshot_image_format or source_type
+ 
+-        # NOTE(bfilippov): save lvm and rbd as raw
+-        if image_format == 'lvm' or image_format == 'rbd':
+        # NOTE(bfilippov): save lvm and rbd and vitastor as raw
+        if image_format == 'lvm' or image_format == 'rbd' or image_format == 'vitastor':
+             image_format = 'raw'
+ 
+         metadata = self._create_snapshot_metadata(instance.image_meta,
+@@ -2899,7 +2910,7 @@ class LibvirtDriver(driver.ComputeDriver):
+                               expected_state=task_states.IMAGE_UPLOADING)
+ 
+             # TODO(nic): possibly abstract this out to the root_disk
+-            if source_type == 'rbd' and live_snapshot:
+            if (source_type == 'rbd' or source_type == 'vitastor') and live_snapshot:
+                 # Standard snapshot uses qemu-img convert from RBD which is
+                 # not safe to run with live_snapshot.
+                 live_snapshot = False
+@@ -4099,7 +4110,7 @@ class LibvirtDriver(driver.ComputeDriver):
+         # cleanup rescue volume
+         lvm.remove_volumes([lvmdisk for lvmdisk in self._lvm_disks(instance)
+                                 if lvmdisk.endswith('.rescue')])
+-        if CONF.libvirt.images_type == 'rbd':
+        if CONF.libvirt.images_type == 'rbd' or CONF.libvirt.images_type == 'vitastor':
+             filter_fn = lambda disk: (disk.startswith(instance.uuid) and
+                                       disk.endswith('.rescue'))
+             rbd_utils.RBDDriver().cleanup_volumes(filter_fn)
+@@ -4356,6 +4367,8 @@ class LibvirtDriver(driver.ComputeDriver):
+         # TODO(mikal): there is a bug here if images_type has
+         # changed since creation of the instance, but I am pretty
+         # sure that this bug already exists.
+        if CONF.libvirt.images_type == 'vitastor':
+            return 'vitastor'
+         return 'rbd' if CONF.libvirt.images_type == 'rbd' else 'raw'
+ 
+     @staticmethod
+@@ -4764,10 +4777,10 @@ class LibvirtDriver(driver.ComputeDriver):
+                 finally:
+                     # NOTE(mikal): if the config drive was imported into RBD,
+                     # then we no longer need the local copy
+-                    if CONF.libvirt.images_type == 'rbd':
+                    if CONF.libvirt.images_type == 'rbd' or CONF.libvirt.images_type == 'vitastor':
+                         LOG.info('Deleting local config drive %(path)s '
+-                                 'because it was imported into RBD.',
+-                                 {'path': config_disk_local_path},
+                                 'because it was imported into %(type).',
+                                 {'path': config_disk_local_path, 'type': CONF.libvirt.images_type},
+                                  instance=instance)
+                         os.unlink(config_disk_local_path)
+ 
+diff --git a/nova/virt/libvirt/utils.py b/nova/virt/libvirt/utils.py
+index da2a6e8b8a..52c02e72f1 100644
+--- a/nova/virt/libvirt/utils.py
+++ b/nova/virt/libvirt/utils.py
+@@ -340,6 +340,10 @@ def find_disk(guest: libvirt_guest.Guest) -> ty.Tuple[str, ty.Optional[str]]:
+             disk_path = disk.source_name
+             if disk_path:
+                 disk_path = 'rbd:' + disk_path
+        elif not disk_path and disk.source_protocol == 'vitastor':
+            disk_path = disk.source_name
+            if disk_path:
+                disk_path = 'vitastor:' + disk_path
+ 
+     if not disk_path:
+         raise RuntimeError(_("Can't retrieve root device path "
+@@ -354,6 +358,8 @@ def get_disk_type_from_path(path: str) -> ty.Optional[str]:
+         return 'lvm'
+     elif path.startswith('rbd:'):
+         return 'rbd'
+    elif path.startswith('vitastor:'):
+        return 'vitastor'
+     elif (os.path.isdir(path) and
+           os.path.exists(os.path.join(path, "DiskDescriptor.xml"))):
+         return 'ploop'
+diff --git a/nova/virt/libvirt/volume/vitastor.py b/nova/virt/libvirt/volume/vitastor.py
+new file mode 100644
+index 0000000000..0256df62c1
+--- /dev/null
+++ b/nova/virt/libvirt/volume/vitastor.py
+@@ -0,0 +1,75 @@
+# Copyright (c) 2021+, Vitaliy Filippov <vitalif@yourcmc.ru>
+#
+#    Licensed under the Apache License, Version 2.0 (the "License"); you may
+#    not use this file except in compliance with the License. You may obtain
+#    a copy of the License at
+#
+#         http://www.apache.org/licenses/LICENSE-2.0
+#
+#    Unless required by applicable law or agreed to in writing, software
+#    distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
+#    WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
+#    License for the specific language governing permissions and limitations
+#    under the License.
+
+from os_brick import exception as os_brick_exception
+from os_brick import initiator
+from os_brick.initiator import connector
+from oslo_log import log as logging
+
+import nova.conf
+from nova import utils
+from nova.virt.libvirt.volume import volume as libvirt_volume
+
+
+CONF = nova.conf.CONF
+LOG = logging.getLogger(__name__)
+
+
+class LibvirtVitastorVolumeDriver(libvirt_volume.LibvirtBaseVolumeDriver):
+    """Driver to attach Vitastor volumes to libvirt."""
+    def __init__(self, host):
+        super(LibvirtVitastorVolumeDriver, self).__init__(host, is_block_dev=False)
+
+    def connect_volume(self, connection_info, instance):
+        pass
+
+    def disconnect_volume(self, connection_info, instance):
+        pass
+
+    def get_config(self, connection_info, disk_info):
+        """Returns xml for libvirt."""
+        conf = super(LibvirtVitastorVolumeDriver, self).get_config(connection_info, disk_info)
+        conf.source_type = 'network'
+        conf.source_protocol = 'vitastor'
+        conf.source_name = connection_info['data'].get('name')
+        conf.source_query = connection_info['data'].get('etcd_prefix') or None
+        conf.source_config = connection_info['data'].get('config_path') or None
+        conf.source_hosts = []
+        conf.source_ports = []
+        addresses = connection_info['data'].get('etcd_address', '')
+        if addresses:
+            if not isinstance(addresses, list):
+                addresses = addresses.split(',')
+            for addr in addresses:
+                if addr.startswith('https://'):
+                    raise NotImplementedError('Vitastor block driver does not support SSL for etcd communication yet')
+                if addr.startswith('http://'):
+                    addr = addr[7:]
+                addr = addr.rstrip('/')
+                if addr.endswith('/v3'):
+                    addr = addr[0:-3]
+                p = addr.find('/')
+                if p > 0:
+                    raise NotImplementedError('libvirt does not support custom URL paths for Vitastor etcd yet. Use /etc/vitastor/vitastor.conf')
+                p = addr.find(':')
+                port = '2379'
+                if p > 0:
+                    port = addr[p+1:]
+                    addr = addr[0:p]
+                conf.source_hosts.append(addr)
+                conf.source_ports.append(port)
+        return conf
+
+    def extend_volume(self, connection_info, instance, requested_size):
+        raise NotImplementedError
--- a/patches/qemu-3.1-vitastor.patch
+++ b/patches/qemu-3.1-vitastor.patch
@@ -11,7 +11,7 @@ Index: qemu-3.1+dfsg/qapi/block-core.json
             'host_cdrom', 'host_device', 'http', 'https', 'iscsi', 'luks',
             'nbd', 'nfs', 'null-aio', 'null-co', 'nvme', 'parallels', 'qcow',
             'qcow2', 'qed', 'quorum', 'raw', 'rbd', 'replication', 'sheepdog',
-@@ -3367,6 +3367,24 @@
+@@ -3367,6 +3367,28 @@
             '*tag': 'str' } }
 
 ##
@@ -19,17 +19,21 @@ Index: qemu-3.1+dfsg/qapi/block-core.json
 +#
 +# Driver specific block device options for vitastor
 +#
+# @image:       Image name
 +# @inode:       Inode number
 +# @pool:        Pool ID
 +# @size:        Desired image size in bytes
-+# @etcd_host:   etcd connection address
+# @config_path: Path to Vitastor configuration
+# @etcd_host:   etcd connection address(es)
 +# @etcd_prefix: etcd key/value prefix
 +##
 +{ 'struct': 'BlockdevOptionsVitastor',
-+  'data': { 'inode': 'uint64',
-+            'pool': 'uint64',
-+            'size': 'uint64',
-+            'etcd_host': 'str',
+  'data': { '*inode': 'uint64',
+            '*pool': 'uint64',
+            '*size': 'uint64',
+            '*image': 'str',
+            '*config_path': 'str',
+            '*etcd_host': 'str',
 +            '*etcd_prefix': 'str' } }
 +
 +##
--- a/patches/qemu-4.2-vitastor.patch
+++ b/patches/qemu-4.2-vitastor.patch
@@ -11,7 +11,7 @@ Index: qemu/qapi/block-core.json
             'ssh', 'throttle', 'vdi', 'vhdx', 'vmdk', 'vpc', 'vvfat', 'vxhs' ] }
 
 ##
-@@ -3725,6 +3725,24 @@
+@@ -3725,6 +3725,28 @@
             '*tag': 'str' } }
 
 ##
@@ -19,17 +19,21 @@ Index: qemu/qapi/block-core.json
 +#
 +# Driver specific block device options for vitastor
 +#
+# @image:       Image name
 +# @inode:       Inode number
 +# @pool:        Pool ID
 +# @size:        Desired image size in bytes
-+# @etcd_host:   etcd connection address
+# @config_path: Path to Vitastor configuration
+# @etcd_host:   etcd connection address(es)
 +# @etcd_prefix: etcd key/value prefix
 +##
 +{ 'struct': 'BlockdevOptionsVitastor',
-+  'data': { 'inode': 'uint64',
-+            'pool': 'uint64',
-+            'size': 'uint64',
-+            'etcd_host': 'str',
+  'data': { '*inode': 'uint64',
+            '*pool': 'uint64',
+            '*size': 'uint64',
+            '*image': 'str',
+            '*config_path': 'str',
+            '*etcd_host': 'str',
 +            '*etcd_prefix': 'str' } }
 +
 +##
--- a/patches/qemu-5.0-vitastor.patch
+++ b/patches/qemu-5.0-vitastor.patch
@@ -11,7 +11,7 @@ Index: qemu/qapi/block-core.json
             'ssh', 'throttle', 'vdi', 'vhdx', 'vmdk', 'vpc', 'vvfat', 'vxhs' ] }
 
 ##
-@@ -3635,6 +3635,24 @@
+@@ -3635,6 +3635,28 @@
             '*tag': 'str' } }
 
 ##
@@ -19,17 +19,21 @@ Index: qemu/qapi/block-core.json
 +#
 +# Driver specific block device options for vitastor
 +#
+# @image:       Image name
 +# @inode:       Inode number
 +# @pool:        Pool ID
 +# @size:        Desired image size in bytes
-+# @etcd_host:   etcd connection address
+# @config_path: Path to Vitastor configuration
+# @etcd_host:   etcd connection address(es)
 +# @etcd_prefix: etcd key/value prefix
 +##
 +{ 'struct': 'BlockdevOptionsVitastor',
-+  'data': { 'inode': 'uint64',
-+            'pool': 'uint64',
-+            'size': 'uint64',
-+            'etcd_host': 'str',
+  'data': { '*inode': 'uint64',
+            '*pool': 'uint64',
+            '*size': 'uint64',
+            '*image': 'str',
+            '*config_path': 'str',
+            '*etcd_host': 'str',
 +            '*etcd_prefix': 'str' } }
 +
 +##
--- a/patches/qemu-5.1-vitastor.patch
+++ b/patches/qemu-5.1-vitastor.patch
@@ -11,7 +11,7 @@ Index: qemu-5.1+dfsg/qapi/block-core.json
             'ssh', 'throttle', 'vdi', 'vhdx', 'vmdk', 'vpc', 'vvfat' ] }
 
 ##
-@@ -3644,6 +3644,24 @@
+@@ -3644,6 +3644,28 @@
             '*tag': 'str' } }
 
 ##
@@ -19,17 +19,21 @@ Index: qemu-5.1+dfsg/qapi/block-core.json
 +#
 +# Driver specific block device options for vitastor
 +#
+# @image:       Image name
 +# @inode:       Inode number
 +# @pool:        Pool ID
 +# @size:        Desired image size in bytes
-+# @etcd_host:   etcd connection address
+# @config_path: Path to Vitastor configuration
+# @etcd_host:   etcd connection address(es)
 +# @etcd_prefix: etcd key/value prefix
 +##
 +{ 'struct': 'BlockdevOptionsVitastor',
-+  'data': { 'inode': 'uint64',
-+            'pool': 'uint64',
-+            'size': 'uint64',
-+            'etcd_host': 'str',
+  'data': { '*inode': 'uint64',
+            '*pool': 'uint64',
+            '*size': 'uint64',
+            '*image': 'str',
+            '*config_path': 'str',
+            '*etcd_host': 'str',
 +            '*etcd_prefix': 'str' } }
 +
 +##
--- a/rpm/build-tarball.sh
+++ b/rpm/build-tarball.sh
@@ -48,4 +48,4 @@ FIO=`rpm -qi fio | perl -e 'while(<>) { /^Epoch[\s:]+(\S+)/ && print "$1:"; /^Ve
 QEMU=`rpm -qi qemu qemu-kvm | perl -e 'while(<>) { /^Epoch[\s:]+(\S+)/ && print "$1:"; /^Version[\s:]+(\S+)/ && print $1; /^Release[\s:]+(\S+)/ && print "-$1"; }'`
 perl -i -pe 's/(Requires:\s*fio)([^\n]+)?/$1 = '$FIO'/' $VITASTOR/rpm/vitastor-el$EL.spec
 perl -i -pe 's/(Requires:\s*qemu(?:-kvm)?)([^\n]+)?/$1 = '$QEMU'/' $VITASTOR/rpm/vitastor-el$EL.spec
-tar --transform 's#^#vitastor-0.5.5/#' --exclude 'rpm/*.rpm' -czf $VITASTOR/../vitastor-0.5.5$(rpm --eval '%dist').tar.gz *
+tar --transform 's#^#vitastor-0.6.6/#' --exclude 'rpm/*.rpm' -czf $VITASTOR/../vitastor-0.6.6$(rpm --eval '%dist').tar.gz *
--- a/rpm/qemu-el8.Dockerfile
+++ b/rpm/qemu-el8.Dockerfile
@@ -11,7 +11,7 @@ RUN rm -rf /var/lib/dnf/*; dnf download --disablerepo='*' --enablerepo='centos-a
 RUN rpm --nomd5 -i qemu*.src.rpm
 RUN cd ~/rpmbuild/SPECS && dnf builddep -y --enablerepo=PowerTools --spec qemu-kvm.spec

-ADD qemu-*-vitastor.patch /root/vitastor/
+ADD patches/qemu-*-vitastor.patch /root/vitastor/patches/

 RUN set -e; \
    mkdir -p /root/packages/qemu-el8; \
@@ -25,7 +25,7 @@ RUN set -e; \
    echo "Patch$((PN+1)): qemu-4.2-vitastor.patch" >> qemu-kvm.spec; \
    tail -n +2 xx01 >> qemu-kvm.spec; \
    perl -i -pe 's/(^Release:\s*\d+)/$1.vitastor/' qemu-kvm.spec; \
-    cp /root/vitastor/qemu-4.2-vitastor.patch ~/rpmbuild/SOURCES; \
+    cp /root/vitastor/patches/qemu-4.2-vitastor.patch ~/rpmbuild/SOURCES; \
    rpmbuild --nocheck -ba qemu-kvm.spec; \
    cp ~/rpmbuild/RPMS/*/*qemu* /root/packages/qemu-el8/; \
    cp ~/rpmbuild/SRPMS/*qemu* /root/packages/qemu-el8/
--- a/rpm/vitastor-el7.Dockerfile
+++ b/rpm/vitastor-el7.Dockerfile
@@ -15,8 +15,9 @@ RUN yumdownloader --disablerepo=centos-sclo-rh --source fio
 RUN rpm --nomd5 -i qemu*.src.rpm
 RUN rpm --nomd5 -i fio*.src.rpm
 RUN rm -f /etc/yum.repos.d/CentOS-Media.repo
-RUN cd ~/rpmbuild/SPECS && yum-builddep -y --enablerepo='*' --disablerepo=centos-sclo-rh --disablerepo=centos-sclo-rh-source --disablerepo=centos-sclo-sclo-testing qemu-kvm.spec
-RUN cd ~/rpmbuild/SPECS && yum-builddep -y --enablerepo='*' --disablerepo=centos-sclo-rh --disablerepo=centos-sclo-rh-source --disablerepo=centos-sclo-sclo-testing fio.spec
+RUN cd ~/rpmbuild/SPECS && yum-builddep -y qemu-kvm.spec
+RUN cd ~/rpmbuild/SPECS && yum-builddep -y fio.spec
+RUN yum -y install rdma-core-devel

 ADD https://vitastor.io/rpms/liburing-el7/liburing-0.7-2.el7.src.rpm /root

@@ -37,7 +38,7 @@ ADD . /root/vitastor
 RUN set -e; \
    cd /root/vitastor/rpm; \
    sh build-tarball.sh; \
-    cp /root/vitastor-0.5.5.el7.tar.gz ~/rpmbuild/SOURCES; \
+    cp /root/vitastor-0.6.6.el7.tar.gz ~/rpmbuild/SOURCES; \
    cp vitastor-el7.spec ~/rpmbuild/SPECS/vitastor.spec; \
    cd ~/rpmbuild/SPECS/; \
    rpmbuild -ba vitastor.spec; \
--- a/rpm/vitastor-el7.spec
+++ b/rpm/vitastor-el7.spec
@@ -1,11 +1,11 @@
 Name:           vitastor
-Version:        0.5.5
+Version:        0.6.6
 Release:        1%{?dist}
 Summary:        Vitastor, a fast software-defined clustered block storage

 License:        Vitastor Network Public License 1.1
 URL:            https://vitastor.io/
-Source0:        vitastor-0.5.5.el7.tar.gz
+Source0:        vitastor-0.6.6.el7.tar.gz

 BuildRequires:  liburing-devel >= 0.6
 BuildRequires:  gperftools-devel
@@ -14,6 +14,7 @@ BuildRequires:  rh-nodejs12
 BuildRequires:  rh-nodejs12-npm
 BuildRequires:  jerasure-devel
 BuildRequires:  gf-complete-devel
+BuildRequires:  libibverbs-devel
 BuildRequires:  cmake
 Requires:       fio = 3.7-1.el7
 Requires:       qemu-kvm = 2.0.0-1.el7.6
@@ -56,13 +57,15 @@ cp -r mon %buildroot/usr/lib/vitastor/mon
 %_bindir/vitastor-dump-journal
 %_bindir/vitastor-nbd
 %_bindir/vitastor-osd
+%_bindir/vitastor-cli
 %_bindir/vitastor-rm
 %_libdir/qemu-kvm/block-vitastor.so
 %_libdir/libfio_vitastor.so
 %_libdir/libfio_vitastor_blk.so
 %_libdir/libfio_vitastor_sec.so
-%_libdir/libvitastor_blk.so
-%_libdir/libvitastor_client.so
+%_libdir/libvitastor_blk.so*
+%_libdir/libvitastor_client.so*
+%_includedir/vitastor_c.h
 /usr/lib/vitastor


--- a/rpm/vitastor-el8.Dockerfile
+++ b/rpm/vitastor-el8.Dockerfile
@@ -15,6 +15,7 @@ RUN rpm --nomd5 -i qemu*.src.rpm
 RUN rpm --nomd5 -i fio*.src.rpm
 RUN cd ~/rpmbuild/SPECS && dnf builddep -y --enablerepo=powertools --spec qemu-kvm.spec
 RUN cd ~/rpmbuild/SPECS && dnf builddep -y --enablerepo=powertools --spec fio.spec && dnf install -y cmake
+RUN yum -y install libibverbs-devel libarchive

 ADD https://vitastor.io/rpms/liburing-el7/liburing-0.7-2.el7.src.rpm /root

@@ -35,7 +36,7 @@ ADD . /root/vitastor
 RUN set -e; \
    cd /root/vitastor/rpm; \
    sh build-tarball.sh; \
-    cp /root/vitastor-0.5.5.el8.tar.gz ~/rpmbuild/SOURCES; \
+    cp /root/vitastor-0.6.6.el8.tar.gz ~/rpmbuild/SOURCES; \
    cp vitastor-el8.spec ~/rpmbuild/SPECS/vitastor.spec; \
    cd ~/rpmbuild/SPECS/; \
    rpmbuild -ba vitastor.spec; \
--- a/rpm/vitastor-el8.spec
+++ b/rpm/vitastor-el8.spec
@@ -1,11 +1,11 @@
 Name:           vitastor
-Version:        0.5.5
+Version:        0.6.6
 Release:        1%{?dist}
 Summary:        Vitastor, a fast software-defined clustered block storage

 License:        Vitastor Network Public License 1.1
 URL:            https://vitastor.io/
-Source0:        vitastor-0.5.5.el8.tar.gz
+Source0:        vitastor-0.6.6.el8.tar.gz

 BuildRequires:  liburing-devel >= 0.6
 BuildRequires:  gperftools-devel
@@ -13,6 +13,7 @@ BuildRequires:  gcc-toolset-9-gcc-c++
 BuildRequires:  nodejs >= 10
 BuildRequires:  jerasure-devel
 BuildRequires:  gf-complete-devel
+BuildRequires:  libibverbs-devel
 BuildRequires:  cmake
 Requires:       fio = 3.7-3.el8
 Requires:       qemu-kvm = 4.2.0-29.el8.6
@@ -53,13 +54,15 @@ cp -r mon %buildroot/usr/lib/vitastor
 %_bindir/vitastor-dump-journal
 %_bindir/vitastor-nbd
 %_bindir/vitastor-osd
+%_bindir/vitastor-cli
 %_bindir/vitastor-rm
 %_libdir/qemu-kvm/block-vitastor.so
 %_libdir/libfio_vitastor.so
 %_libdir/libfio_vitastor_blk.so
 %_libdir/libfio_vitastor_sec.so
-%_libdir/libvitastor_blk.so
-%_libdir/libvitastor_client.so
+%_libdir/libvitastor_blk.so*
+%_libdir/libvitastor_client.so*
+%_includedir/vitastor_c.h
 /usr/lib/vitastor


--- a/src/CMakeLists.txt
+++ b/src/CMakeLists.txt
@@ -4,6 +4,8 @@ project(vitastor)

 include(GNUInstallDirs)

+set(WITH_QEMU true CACHE BOOL "Build QEMU driver")
+set(WITH_FIO true CACHE BOOL "Build FIO driver")
 set(QEMU_PLUGINDIR qemu CACHE STRING "QEMU plugin directory suffix (qemu-kvm on RHEL)")
 set(WITH_ASAN false CACHE BOOL "Build with AddressSanitizer")
 if("${CMAKE_INSTALL_PREFIX}" MATCHES "^/usr/local/?$")
@@ -13,8 +15,8 @@ if("${CMAKE_INSTALL_PREFIX}" MATCHES "^/usr/local/?$")
 	set(CMAKE_INSTALL_RPATH "${CMAKE_INSTALL_PREFIX}/${CMAKE_INSTALL_LIBDIR}")
 endif()

-add_definitions(-DVERSION="0.6-dev")
-add_definitions(-Wall -Wno-sign-compare -Wno-comment -Wno-parentheses -Wno-pointer-arith)
+add_definitions(-DVERSION="0.6.6")
+add_definitions(-Wall -Wno-sign-compare -Wno-comment -Wno-parentheses -Wno-pointer-arith -fdiagnostics-color=always -I ${CMAKE_SOURCE_DIR}/src)
 if (${WITH_ASAN})
 	add_definitions(-fsanitize=address -fno-omit-frame-pointer)
 	add_link_options(-fsanitize=address -fno-omit-frame-pointer)
@@ -34,14 +36,26 @@ string(REGEX REPLACE "([\\/\\-]D) *NDEBUG" "" CMAKE_C_FLAGS_RELEASE "${CMAKE_C_F
 string(REGEX REPLACE "([\\/\\-]D) *NDEBUG" "" CMAKE_C_FLAGS_MINSIZEREL "${CMAKE_C_FLAGS_MINSIZEREL}")
 string(REGEX REPLACE "([\\/\\-]D) *NDEBUG" "" CMAKE_C_FLAGS_RELWITHDEBINFO "${CMAKE_C_FLAGS_RELWITHDEBINFO}")

+macro(install_symlink filepath sympath)
+	install(CODE "execute_process(COMMAND ${CMAKE_COMMAND} -E create_symlink ${filepath} ${sympath})")
+	install(CODE "message(\"-- Created symlink: ${sympath} -> ${filepath}\")")
+endmacro(install_symlink)
+
 find_package(PkgConfig)
 pkg_check_modules(LIBURING REQUIRED liburing)
-pkg_check_modules(GLIB REQUIRED glib-2.0)
+if (${WITH_QEMU})
+	pkg_check_modules(GLIB REQUIRED glib-2.0)
+endif (${WITH_QEMU})
+pkg_check_modules(IBVERBS libibverbs)
+if (IBVERBS_LIBRARIES)
+	add_definitions(-DWITH_RDMA)
+endif (IBVERBS_LIBRARIES)

 include_directories(
 	../
 	/usr/include/jerasure
 	${LIBURING_INCLUDE_DIRS}
+	${IBVERBS_INCLUDE_DIRS}
 )

 # libvitastor_blk.so
@@ -52,55 +66,82 @@ add_library(vitastor_blk SHARED
 target_link_libraries(vitastor_blk
 	${LIBURING_LIBRARIES}
 	tcmalloc_minimal
+	# for timerfd_manager
+	vitastor_common
 )
+set_target_properties(vitastor_blk PROPERTIES VERSION ${VERSION} SOVERSION 0)

-# libfio_vitastor_blk.so
-add_library(fio_vitastor_blk SHARED
-	fio_engine.cpp
-	../json11/json11.cpp
-)
-target_link_libraries(fio_vitastor_blk
-	vitastor_blk
+if (${WITH_FIO})
+	# libfio_vitastor_blk.so
+	add_library(fio_vitastor_blk SHARED
+		fio_engine.cpp
+		../json11/json11.cpp
+	)
+	target_link_libraries(fio_vitastor_blk
+		vitastor_blk
+	)
+endif (${WITH_FIO})
+
+# libvitastor_common.a
+set(MSGR_RDMA "")
+if (IBVERBS_LIBRARIES)
+	set(MSGR_RDMA "msgr_rdma.cpp")
+endif (IBVERBS_LIBRARIES)
+add_library(vitastor_common STATIC
+	epoll_manager.cpp etcd_state_client.cpp
+	messenger.cpp msgr_stop.cpp msgr_op.cpp msgr_send.cpp msgr_receive.cpp ringloop.cpp ../json11/json11.cpp
+	http_client.cpp osd_ops.cpp pg_states.cpp timerfd_manager.cpp base64.cpp ${MSGR_RDMA}
 )
+target_compile_options(vitastor_common PUBLIC -fPIC)

 # vitastor-osd
 add_executable(vitastor-osd
-	osd_main.cpp osd.cpp osd_secondary.cpp msgr_receive.cpp msgr_send.cpp osd_peering.cpp osd_flush.cpp osd_peering_pg.cpp
-	osd_primary.cpp osd_primary_subops.cpp etcd_state_client.cpp messenger.cpp osd_cluster.cpp http_client.cpp osd_ops.cpp pg_states.cpp
-	osd_rmw.cpp base64.cpp timerfd_manager.cpp epoll_manager.cpp ../json11/json11.cpp
+	osd_main.cpp osd.cpp osd_secondary.cpp osd_peering.cpp osd_flush.cpp osd_peering_pg.cpp
+	osd_primary.cpp osd_primary_chain.cpp osd_primary_sync.cpp osd_primary_write.cpp osd_primary_subops.cpp
+	osd_cluster.cpp osd_rmw.cpp
 )
 target_link_libraries(vitastor-osd
+	vitastor_common
 	vitastor_blk
 	Jerasure
+	${IBVERBS_LIBRARIES}
 )

-# libfio_vitastor_sec.so
-add_library(fio_vitastor_sec SHARED
-	fio_sec_osd.cpp
-	rw_blocking.cpp
-)
-target_link_libraries(fio_vitastor_sec
-	tcmalloc_minimal
-)
+if (${WITH_FIO})
+	# libfio_vitastor_sec.so
+	add_library(fio_vitastor_sec SHARED
+		fio_sec_osd.cpp
+		rw_blocking.cpp
+	)
+	target_link_libraries(fio_vitastor_sec
+		tcmalloc_minimal
+	)
+endif (${WITH_FIO})

 # libvitastor_client.so
 add_library(vitastor_client SHARED
-	cluster_client.cpp epoll_manager.cpp etcd_state_client.cpp
-	messenger.cpp msgr_send.cpp msgr_receive.cpp ringloop.cpp ../json11/json11.cpp
-	http_client.cpp osd_ops.cpp pg_states.cpp timerfd_manager.cpp base64.cpp
+	cluster_client.cpp
+	cluster_client_list.cpp
+	vitastor_c.cpp
 )
+set_target_properties(vitastor_client PROPERTIES PUBLIC_HEADER "vitastor_c.h")
 target_link_libraries(vitastor_client
+	vitastor_common
 	tcmalloc_minimal
 	${LIBURING_LIBRARIES}
+	${IBVERBS_LIBRARIES}
 )
+set_target_properties(vitastor_client PROPERTIES VERSION ${VERSION} SOVERSION 0)

-# libfio_vitastor.so
-add_library(fio_vitastor SHARED
-	fio_cluster.cpp
-)
-target_link_libraries(fio_vitastor
-	vitastor_client
-)
+if (${WITH_FIO})
+	# libfio_vitastor.so
+	add_library(fio_vitastor SHARED
+		fio_cluster.cpp
+	)
+	target_link_libraries(fio_vitastor
+		vitastor_client
+	)
+endif (${WITH_FIO})

 # vitastor-nbd
 add_executable(vitastor-nbd
@@ -110,11 +151,11 @@ target_link_libraries(vitastor-nbd
 	vitastor_client
 )

-# vitastor-rm
-add_executable(vitastor-rm
-	rm_inode.cpp
+# vitastor-cli
+add_executable(vitastor-cli
+	cli.cpp cli_flatten.cpp cli_merge.cpp cli_rm.cpp cli_snap_rm.cpp
 )
-target_link_libraries(vitastor-rm
+target_link_libraries(vitastor-cli
 	vitastor_client
 )

@@ -123,27 +164,24 @@ add_executable(vitastor-dump-journal
 	dump_journal.cpp crc32c.c
 )

-# qemu_driver.so
-add_library(qemu_proxy STATIC qemu_proxy.cpp)
-target_compile_options(qemu_proxy PUBLIC -fPIC)
-target_include_directories(qemu_proxy PUBLIC
-	../qemu/b/qemu
-	../qemu/include
-	${GLIB_INCLUDE_DIRS}
-)
-target_link_libraries(qemu_proxy
-	vitastor_client
-)
-add_library(qemu_vitastor SHARED
-	qemu_driver.c
-)
-target_link_libraries(qemu_vitastor
-	qemu_proxy
-)
-set_target_properties(qemu_vitastor PROPERTIES
-	PREFIX ""
-	OUTPUT_NAME "block-vitastor"
-)
+if (${WITH_QEMU})
+	# qemu_driver.so
+	add_library(qemu_vitastor SHARED
+		qemu_driver.c
+	)
+	target_include_directories(qemu_vitastor PUBLIC
+		../qemu/b/qemu
+		../qemu/include
+		${GLIB_INCLUDE_DIRS}
+	)
+	target_link_libraries(qemu_vitastor
+		vitastor_client
+	)
+	set_target_properties(qemu_vitastor PROPERTIES
+		PREFIX ""
+		OUTPUT_NAME "block-vitastor"
+	)
+endif (${WITH_QEMU})

 ### Test stubs

@@ -161,10 +199,12 @@ target_link_libraries(osd_rmw_test Jerasure tcmalloc_minimal)

 # stub_uring_osd
 add_executable(stub_uring_osd
-	stub_uring_osd.cpp epoll_manager.cpp messenger.cpp msgr_send.cpp msgr_receive.cpp ringloop.cpp timerfd_manager.cpp ../json11/json11.cpp
+	stub_uring_osd.cpp
 )
 target_link_libraries(stub_uring_osd
+	vitastor_common
 	${LIBURING_LIBRARIES}
+	${IBVERBS_LIBRARIES}
 	tcmalloc_minimal
 )

@@ -175,14 +215,41 @@ target_link_libraries(osd_peering_pg_test tcmalloc_minimal)
 # test_allocator
 add_executable(test_allocator test_allocator.cpp allocator.cpp)

+# test_cas
+add_executable(test_cas
+	test_cas.cpp
+)
+target_link_libraries(test_cas
+	vitastor_client
+)
+
+# test_cluster_client
+add_executable(test_cluster_client
+	test_cluster_client.cpp
+	pg_states.cpp osd_ops.cpp cluster_client.cpp cluster_client_list.cpp msgr_op.cpp mock/messenger.cpp msgr_stop.cpp
+	etcd_state_client.cpp timerfd_manager.cpp ../json11/json11.cpp
+)
+target_compile_definitions(test_cluster_client PUBLIC -D__MOCK__)
+target_include_directories(test_cluster_client PUBLIC ${CMAKE_SOURCE_DIR}/src/mock)
+
 ## test_blockstore, test_shit
-#add_executable(test_blockstore test_blockstore.cpp timerfd_interval.cpp)
+#add_executable(test_blockstore test_blockstore.cpp)
 #target_link_libraries(test_blockstore blockstore)
 #add_executable(test_shit test_shit.cpp osd_peering_pg.cpp)
 #target_link_libraries(test_shit ${LIBURING_LIBRARIES} m)

 ### Install

-install(TARGETS vitastor-osd vitastor-dump-journal vitastor-nbd vitastor-rm RUNTIME DESTINATION ${CMAKE_INSTALL_BINDIR})
-install(TARGETS fio_vitastor fio_vitastor_blk fio_vitastor_sec vitastor_blk vitastor_client LIBRARY DESTINATION ${CMAKE_INSTALL_LIBDIR})
-install(TARGETS qemu_vitastor LIBRARY DESTINATION /usr/${CMAKE_INSTALL_LIBDIR}/${QEMU_PLUGINDIR})
+install(TARGETS vitastor-osd vitastor-dump-journal vitastor-nbd vitastor-cli RUNTIME DESTINATION ${CMAKE_INSTALL_BINDIR})
+install_symlink(${CMAKE_INSTALL_BINDIR}/vitastor-rm vitastor-cli)
+install(
+	TARGETS vitastor_blk vitastor_client
+	LIBRARY DESTINATION ${CMAKE_INSTALL_LIBDIR}
+	PUBLIC_HEADER DESTINATION ${CMAKE_INSTALL_INCLUDEDIR}
+)
+if (${WITH_FIO})
+	install(TARGETS fio_vitastor fio_vitastor_blk fio_vitastor_sec LIBRARY DESTINATION ${CMAKE_INSTALL_LIBDIR})
+endif (${WITH_FIO})
+if (${WITH_QEMU})
+	install(TARGETS qemu_vitastor LIBRARY DESTINATION /usr/${CMAKE_INSTALL_LIBDIR}/${QEMU_PLUGINDIR})
+endif (${WITH_QEMU})
--- a/src/allocator.cpp
+++ b/src/allocator.cpp
@@ -13,19 +13,19 @@ allocator::allocator(uint64_t blocks)
    {
        throw std::invalid_argument("blocks");
    }
-    uint64_t p2 = 1, total = 1;
+    uint64_t p2 = 1;
+    total = 0;
    while (p2 * 64 < blocks)
    {
-        p2 = p2 * 64;
        total += p2;
+        p2 = p2 * 64;
    }
-    total -= p2;
    total += (blocks+63) / 64;
-    mask = new uint64_t[2 + total];
+    mask = new uint64_t[total];
    size = free = blocks;
    last_one_mask = (blocks % 64) == 0
        ? UINT64_MAX
-        : ~(UINT64_MAX << (64 - blocks % 64));
+        : ((1l << (blocks % 64)) - 1);
    for (uint64_t i = 0; i < total; i++)
    {
        mask[i] = 0;
@@ -37,6 +37,21 @@ allocator::~allocator()
    delete[] mask;
 }

+bool allocator::get(uint64_t addr)
+{
+    if (addr >= size)
+    {
+        return false;
+    }
+    uint64_t p2 = 1, offset = 0;
+    while (p2 * 64 < size)
+    {
+        offset += p2;
+        p2 = p2 * 64;
+    }
+    return ((mask[offset + addr/64] >> (addr % 64)) & 1);
+}
+
 void allocator::set(uint64_t addr, bool value)
 {
    if (addr >= size)
@@ -99,6 +114,10 @@ uint64_t allocator::find_free()
    uint64_t p2 = 1, offset = 0, addr = 0, f, i;
    while (p2 < size)
    {
+        if (offset+addr >= total)
+        {
+            return UINT64_MAX;
+        }
        uint64_t m = mask[offset + addr];
        for (i = 0, f = 1; i < 64; i++, f <<= 1)
        {
@@ -113,11 +132,6 @@ uint64_t allocator::find_free()
            return UINT64_MAX;
        }
        addr = (addr * 64) | i;
-        if (addr >= size)
-        {
-            // No space
-            return UINT64_MAX;
-        }
        offset += p2;
        p2 = p2 * 64;
    }
@@ -128,3 +142,35 @@ uint64_t allocator::get_free_count()
 {
    return free;
 }
+
+void bitmap_set(void *bitmap, uint64_t start, uint64_t len, uint64_t bitmap_granularity)
+{
+    if (start == 0)
+    {
+        if (len == 32*bitmap_granularity)
+        {
+            *((uint32_t*)bitmap) = UINT32_MAX;
+            return;
+        }
+        else if (len == 64*bitmap_granularity)
+        {
+            *((uint64_t*)bitmap) = UINT64_MAX;
+            return;
+        }
+    }
+    unsigned bit_start = start / bitmap_granularity;
+    unsigned bit_end = ((start + len) + bitmap_granularity - 1) / bitmap_granularity;
+    while (bit_start < bit_end)
+    {
+        if (!(bit_start & 7) && bit_end >= bit_start+8)
+        {
+            ((uint8_t*)bitmap)[bit_start / 8] = UINT8_MAX;
+            bit_start += 8;
+        }
+        else
+        {
+            ((uint8_t*)bitmap)[bit_start / 8] |= 1 << (bit_start % 8);
+            bit_start++;
+        }
+    }
+}
--- a/src/allocator.h
+++ b/src/allocator.h
@@ -8,6 +8,7 @@
 // Hierarchical bitmap allocator
 class allocator
 {
+    uint64_t total;
    uint64_t size;
    uint64_t free;
    uint64_t last_one_mask;
@@ -15,7 +16,10 @@ class allocator
 public:
    allocator(uint64_t blocks);
    ~allocator();
+    bool get(uint64_t addr);
    void set(uint64_t addr, bool value);
    uint64_t find_free();
    uint64_t get_free_count();
 };
+
+void bitmap_set(void *bitmap, uint64_t start, uint64_t len, uint64_t bitmap_granularity);
--- a/src/blockstore.cpp
+++ b/src/blockstore.cpp
@@ -3,9 +3,9 @@

 #include "blockstore_impl.h"

-blockstore_t::blockstore_t(blockstore_config_t & config, ring_loop_t *ringloop)
+blockstore_t::blockstore_t(blockstore_config_t & config, ring_loop_t *ringloop, timerfd_manager_t *tfd)
 {
-    impl = new blockstore_impl_t(config, ringloop);
+    impl = new blockstore_impl_t(config, ringloop, tfd);
 }

 blockstore_t::~blockstore_t()
@@ -35,17 +35,22 @@ bool blockstore_t::is_safe_to_stop()

 void blockstore_t::enqueue_op(blockstore_op_t *op)
 {
-    impl->enqueue_op(op, false);
+    impl->enqueue_op(op);
 }

-void blockstore_t::enqueue_op_first(blockstore_op_t *op)
+int blockstore_t::read_bitmap(object_id oid, uint64_t target_version, void *bitmap, uint64_t *result_version)
 {
-    impl->enqueue_op(op, true);
+    return impl->read_bitmap(oid, target_version, bitmap, result_version);
 }

-std::unordered_map<object_id, uint64_t> & blockstore_t::get_unstable_writes()
+std::map<uint64_t, uint64_t> & blockstore_t::get_inode_space_stats()
 {
-    return impl->unstable_writes;
+    return impl->inode_space_stats;
+}
+
+void blockstore_t::dump_diagnostics()
+{
+    return impl->dump_diagnostics();
 }

 uint32_t blockstore_t::get_block_size()
@@ -63,7 +68,7 @@ uint64_t blockstore_t::get_free_block_count()
    return impl->get_free_block_count();
 }

-uint32_t blockstore_t::get_disk_alignment()
+uint32_t blockstore_t::get_bitmap_granularity()
 {
-    return impl->get_disk_alignment();
+    return impl->get_bitmap_granularity();
 }
--- a/src/blockstore.h
+++ b/src/blockstore.h
@@ -16,6 +16,7 @@

 #include "object_id.h"
 #include "ringloop.h"
+#include "timerfd_manager.h"

 // Memory alignment for direct I/O (usually 512 bytes)
 // All other alignments must be a multiple of this one
@@ -27,6 +28,7 @@
 #define DEFAULT_ORDER 17
 #define MIN_BLOCK_SIZE 4*1024
 #define MAX_BLOCK_SIZE 128*1024*1024
+#define DEFAULT_BITMAP_GRANULARITY 4096

 #define BS_OP_MIN 1
 #define BS_OP_READ 1
@@ -64,6 +66,8 @@ Input:
 - offset, len = offset and length within object. length may be zero, in that case
  read operation only returns the version / write operation only bumps the version
 - buf = pre-allocated buffer for data (read) / with data (write). may be NULL if len == 0.
+- bitmap = pointer to the new 'external' object bitmap data. Its part which is respective to the
+  write request is copied into the metadata area bitwise and stored there.

 Output:
 - retval = number of bytes actually read/written or negative error number (-EINVAL or -ENOSPC)
@@ -141,6 +145,7 @@ struct blockstore_op_t
    uint32_t offset;
    uint32_t len;
    void *buf;
+    void *bitmap;
    int retval;

    uint8_t private_data[BS_OP_PRIVATE_DATA_SIZE];
@@ -154,7 +159,7 @@ class blockstore_t
 {
    blockstore_impl_t *impl;
 public:
-    blockstore_t(blockstore_config_t & config, ring_loop_t *ringloop);
+    blockstore_t(blockstore_config_t & config, ring_loop_t *ringloop, timerfd_manager_t *tfd);
    ~blockstore_t();

    // Event loop
@@ -175,17 +180,19 @@ public:
    // Submission
    void enqueue_op(blockstore_op_t *op);

-    // Insert operation into the beginning of the queue
-    // Intended for the OSD syncer "thread" to be able to stabilize something when the journal is full
-    void enqueue_op_first(blockstore_op_t *op);
+    // Simplified synchronous operation: get object bitmap & current version
+    int read_bitmap(object_id oid, uint64_t target_version, void *bitmap, uint64_t *result_version = NULL);

-    // Unstable writes are added here (map of object_id -> version)
-    std::unordered_map<object_id, uint64_t> & get_unstable_writes();
+    // Get per-inode space usage statistics
+    std::map<uint64_t, uint64_t> & get_inode_space_stats();
+
+    // Print diagnostics to stdout
+    void dump_diagnostics();

    // FIXME rename to object_size
    uint32_t get_block_size();
    uint64_t get_block_count();
    uint64_t get_free_block_count();

-    uint32_t get_disk_alignment();
+    uint32_t get_bitmap_granularity();
 };
--- a/src/blockstore_flush.cpp
+++ b/src/blockstore_flush.cpp
@@ -3,12 +3,13 @@

 #include "blockstore_impl.h"

-journal_flusher_t::journal_flusher_t(int flusher_count, blockstore_impl_t *bs)
+journal_flusher_t::journal_flusher_t(blockstore_impl_t *bs)
 {
    this->bs = bs;
-    this->flusher_count = flusher_count;
-    this->cur_flusher_count = 1;
-    this->target_flusher_count = 1;
+    this->max_flusher_count = bs->max_flusher_count;
+    this->min_flusher_count = bs->min_flusher_count;
+    this->cur_flusher_count = bs->min_flusher_count;
+    this->target_flusher_count = bs->min_flusher_count;
    dequeuing = false;
    trimming = false;
    active_flushers = 0;
@@ -16,11 +17,11 @@ journal_flusher_t::journal_flusher_t(int flusher_count, blockstore_impl_t *bs)
    // FIXME: allow to configure flusher_start_threshold and journal_trim_interval
    flusher_start_threshold = bs->journal_block_size / sizeof(journal_entry_stable);
    journal_trim_interval = 512;
-    journal_trim_counter = 0;
-    trim_wanted = 0;
+    journal_trim_counter = bs->journal.flush_journal ? 1 : 0;
+    trim_wanted = bs->journal.flush_journal ? 1 : 0;
    journal_superblock = bs->journal.inmemory ? bs->journal.buffer : memalign_or_die(MEM_ALIGNMENT, bs->journal_block_size);
-    co = new journal_flusher_co[flusher_count];
-    for (int i = 0; i < flusher_count; i++)
+    co = new journal_flusher_co[max_flusher_count];
+    for (int i = 0; i < max_flusher_count; i++)
    {
        co[i].bs = bs;
        co[i].flusher = this;
@@ -71,10 +72,10 @@ bool journal_flusher_t::is_active()
 void journal_flusher_t::loop()
 {
    target_flusher_count = bs->write_iodepth*2;
-    if (target_flusher_count <= 0)
-        target_flusher_count = 1;
-    else if (target_flusher_count > flusher_count)
-        target_flusher_count = flusher_count;
+    if (target_flusher_count < min_flusher_count)
+        target_flusher_count = min_flusher_count;
+    else if (target_flusher_count > max_flusher_count)
+        target_flusher_count = max_flusher_count;
    if (target_flusher_count > cur_flusher_count)
        cur_flusher_count = target_flusher_count;
    else if (target_flusher_count < cur_flusher_count)
@@ -181,6 +182,75 @@ void journal_flusher_t::release_trim()
    trim_wanted--;
 }

+void journal_flusher_t::dump_diagnostics()
+{
+    const char *unflushable_type = "";
+    obj_ver_id unflushable = { 0 };
+    // Try to find out if there is a flushable object for information
+    for (object_id cur_oid: flush_queue)
+    {
+        obj_ver_id cur = { .oid = cur_oid, .version = flush_versions[cur_oid] };
+        auto dirty_end = bs->dirty_db.find(cur);
+        if (dirty_end == bs->dirty_db.end())
+        {
+            // Already flushed
+            continue;
+        }
+        auto repeat_it = sync_to_repeat.find(cur.oid);
+        if (repeat_it != sync_to_repeat.end())
+        {
+            // Someone is already flushing it
+            unflushable_type = "locked,";
+            unflushable = cur;
+            break;
+        }
+        if (dirty_end->second.journal_sector >= bs->journal.dirty_start &&
+            (bs->journal.dirty_start >= bs->journal.used_start ||
+            dirty_end->second.journal_sector < bs->journal.used_start))
+        {
+            // Object is more recent than possible to flush
+            bool found = try_find_older(dirty_end, cur);
+            if (!found)
+            {
+                unflushable_type = "dirty,";
+                unflushable = cur;
+                break;
+            }
+        }
+        unflushable_type = "ok,";
+        unflushable = cur;
+        break;
+    }
+    printf(
+        "Flusher: queued=%ld first=%s%lx:%lx trim_wanted=%d dequeuing=%d trimming=%d cur=%d target=%d active=%d syncing=%d\n",
+        flush_queue.size(), unflushable_type, unflushable.oid.inode, unflushable.oid.stripe,
+        trim_wanted, dequeuing, trimming, cur_flusher_count, target_flusher_count,
+        active_flushers, syncing_flushers
+    );
+}
+
+bool journal_flusher_t::try_find_older(std::map<obj_ver_id, dirty_entry>::iterator & dirty_end, obj_ver_id & cur)
+{
+    bool found = false;
+    while (dirty_end != bs->dirty_db.begin())
+    {
+        dirty_end--;
+        if (dirty_end->first.oid != cur.oid)
+        {
+            break;
+        }
+        if (!(dirty_end->second.journal_sector >= bs->journal.dirty_start &&
+            (bs->journal.dirty_start >= bs->journal.used_start ||
+            dirty_end->second.journal_sector < bs->journal.used_start)))
+        {
+            found = true;
+            cur.version = dirty_end->first.version;
+            break;
+        }
+    }
+    return found;
+}
+
 #define await_sqe(label) \
    resume_##label:\
        sqe = bs->get_sqe();\
@@ -237,7 +307,8 @@ bool journal_flusher_co::loop()
    else if (wait_state == 21)
        goto resume_21;
 resume_0:
-    if (!flusher->flush_queue.size() || !flusher->dequeuing)
+    if (flusher->flush_queue.size() < flusher->min_flusher_count && !flusher->trim_wanted ||
+        !flusher->flush_queue.size() || !flusher->dequeuing)
    {
 stop_flusher:
        if (flusher->trim_wanted > 0 && flusher->journal_trim_counter > 0)
@@ -284,30 +355,15 @@ stop_flusher:
            // And it may even block writes if we don't flush the older version
            // (if it's in the beginning of the journal)...
            // So first try to find an older version of the same object to flush.
-            bool found = false;
-            while (dirty_end != bs->dirty_db.begin())
-            {
-                dirty_end--;
-                if (dirty_end->first.oid != cur.oid)
-                {
-                    break;
-                }
-                if (!(dirty_end->second.journal_sector >= bs->journal.dirty_start &&
-                    (bs->journal.dirty_start >= bs->journal.used_start ||
-                    dirty_end->second.journal_sector < bs->journal.used_start)))
-                {
-                    found = true;
-                    cur.version = dirty_end->first.version;
-                    break;
-                }
-            }
+            bool found = flusher->try_find_older(dirty_end, cur);
            if (!found)
            {
                // Try other objects
                flusher->sync_to_repeat.erase(cur.oid);
                int search_left = flusher->flush_queue.size() - 1;
 #ifdef BLOCKSTORE_DEBUG
-                printf("Flusher overran writers (dirty_start=%08lx) - searching for older flushes (%d left)\n", bs->journal.dirty_start, search_left);
+                printf("Flusher overran writers (%lx:%lx v%lu, dirty_start=%08lx) - searching for older flushes (%d left)\n",
+                    cur.oid.inode, cur.oid.stripe, cur.version, bs->journal.dirty_start, search_left);
 #endif
                while (search_left > 0)
                {
@@ -330,7 +386,12 @@ stop_flusher:
                        else
                        {
                            repeat_it = flusher->sync_to_repeat.find(cur.oid);
-                            if (repeat_it == flusher->sync_to_repeat.end())
+                            if (repeat_it != flusher->sync_to_repeat.end())
+                            {
+                                if (repeat_it->second < cur.version)
+                                    repeat_it->second = cur.version;
+                            }
+                            else
                            {
                                flusher->sync_to_repeat[cur.oid] = 0;
                                break;
@@ -426,18 +487,18 @@ resume_1:
        {
            new_clean_bitmap = (bs->inmemory_meta
                ? meta_new.buf + meta_new.pos*bs->clean_entry_size + sizeof(clean_disk_entry)
-                : bs->clean_bitmap + (clean_loc >> bs->block_order)*bs->clean_entry_bitmap_size);
+                : bs->clean_bitmap + (clean_loc >> bs->block_order)*(2*bs->clean_entry_bitmap_size));
            if (clean_init_bitmap)
            {
                memset(new_clean_bitmap, 0, bs->clean_entry_bitmap_size);
-                bitmap_set(new_clean_bitmap, clean_bitmap_offset, clean_bitmap_len);
+                bitmap_set(new_clean_bitmap, clean_bitmap_offset, clean_bitmap_len, bs->bitmap_granularity);
            }
        }
        for (it = v.begin(); it != v.end(); it++)
        {
            if (new_clean_bitmap)
            {
-                bitmap_set(new_clean_bitmap, it->offset, it->len);
+                bitmap_set(new_clean_bitmap, it->offset, it->len, bs->bitmap_granularity);
            }
            await_sqe(4);
            data->iov = (struct iovec){ it->buf, (size_t)it->len };
@@ -471,6 +532,7 @@ resume_1:
                wait_state = 5;
                return false;
            }
+            // zero out old metadata entry
            memset(meta_old.buf + meta_old.pos*bs->clean_entry_size, 0, bs->clean_entry_size);
            await_sqe(15);
            data->iov = (struct iovec){ meta_old.buf, bs->meta_block_size };
@@ -482,6 +544,14 @@ resume_1:
        }
        if (has_delete)
        {
+            clean_disk_entry *new_entry = (clean_disk_entry*)(meta_new.buf + meta_new.pos*bs->clean_entry_size);
+            if (new_entry->oid.inode != 0 && new_entry->oid != cur.oid)
+            {
+                printf("Fatal error (metadata corruption or bug): tried to delete metadata entry %lu (%lx:%lx) while deleting %lx:%lx\n",
+                    clean_loc >> bs->block_order, new_entry->oid.inode, new_entry->oid.stripe, cur.oid.inode, cur.oid.stripe);
+                exit(1);
+            }
+            // zero out new metadata entry
            memset(meta_new.buf + meta_new.pos*bs->clean_entry_size, 0, bs->clean_entry_size);
        }
        else
@@ -499,6 +569,12 @@ resume_1:
            {
                memcpy(&new_entry->bitmap, new_clean_bitmap, bs->clean_entry_bitmap_size);
            }
+            // copy latest external bitmap/attributes
+            if (bs->clean_entry_bitmap_size)
+            {
+                void *bmp_ptr = bs->clean_entry_bitmap_size > sizeof(void*) ? dirty_end->second.bitmap : &dirty_end->second.bitmap;
+                memcpy((void*)(new_entry+1) + bs->clean_entry_bitmap_size, bmp_ptr, bs->clean_entry_bitmap_size);
+            }
        }
        await_sqe(6);
        data->iov = (struct iovec){ meta_new.buf, bs->meta_block_size };
@@ -585,6 +661,7 @@ resume_1:
                    .size = sizeof(journal_entry_start),
                    .reserved = 0,
                    .journal_start = new_trim_pos,
+                    .version = JOURNAL_VERSION,
                };
                ((journal_entry_start*)flusher->journal_superblock)->crc32 = je_crc32((journal_entry*)flusher->journal_superblock);
                data->iov = (struct iovec){ flusher->journal_superblock, bs->journal_block_size };
@@ -616,6 +693,12 @@ resume_1:
 #endif
                flusher->trimming = false;
            }
+            if (bs->journal.flush_journal && !flusher->flush_queue.size())
+            {
+                assert(bs->journal.used_start == bs->journal.next_free);
+                printf("Journal flushed\n");
+                exit(0);
+            }
        }
        // All done
        flusher->active_flushers--;
@@ -646,7 +729,7 @@ bool journal_flusher_co::scan_dirty(int wait_base)
        {
            char err[1024];
            snprintf(
-                err, 1024, "BUG: Unexpected dirty_entry %lx:%lx v%lu unstable state during flush: %d",
+                err, 1024, "BUG: Unexpected dirty_entry %lx:%lx v%lu unstable state during flush: 0x%x",
                dirty_it->first.oid.inode, dirty_it->first.oid.stripe, dirty_it->first.version, dirty_it->second.state
            );
            throw std::runtime_error(err);
@@ -775,7 +858,10 @@ void journal_flusher_co::update_clean_db()
    if (old_clean_loc != UINT64_MAX && old_clean_loc != clean_loc)
    {
 #ifdef BLOCKSTORE_DEBUG
-        printf("Free block %lu (new location is %lu)\n", old_clean_loc >> bs->block_order, clean_loc >> bs->block_order);
+        printf("Free block %lu from %lx:%lx v%lu (new location is %lu)\n",
+            old_clean_loc >> bs->block_order,
+            cur.oid.inode, cur.oid.stripe, cur.version,
+            clean_loc >> bs->block_order);
 #endif
        bs->data_alloc->set(old_clean_loc >> bs->block_order, false);
    }
@@ -783,6 +869,11 @@ void journal_flusher_co::update_clean_db()
    {
        auto clean_it = bs->clean_db.find(cur.oid);
        bs->clean_db.erase(clean_it);
+#ifdef BLOCKSTORE_DEBUG
+        printf("Free block %lu from %lx:%lx v%lu (delete)\n",
+            clean_loc >> bs->block_order,
+            cur.oid.inode, cur.oid.stripe, cur.version);
+#endif
        bs->data_alloc->set(clean_loc >> bs->block_order, false);
        clean_loc = UINT64_MAX;
    }
@@ -804,7 +895,7 @@ bool journal_flusher_co::fsync_batch(bool fsync_meta, int wait_base)
        goto resume_1;
    else if (wait_state == wait_base+2)
        goto resume_2;
-    if (!(fsync_meta ? bs->disable_meta_fsync : bs->disable_journal_fsync))
+    if (!(fsync_meta ? bs->disable_meta_fsync : bs->disable_data_fsync))
    {
        cur_sync = flusher->syncs.end();
        while (cur_sync != flusher->syncs.begin())
@@ -823,31 +914,34 @@ bool journal_flusher_co::fsync_batch(bool fsync_meta, int wait_base)
    sync_found:
        cur_sync->ready_count++;
        flusher->syncing_flushers++;
-        if (flusher->syncing_flushers >= flusher->flusher_count || !flusher->flush_queue.size())
+    resume_1:
+        if (!cur_sync->state)
        {
-            // Sync batch is ready. Do it.
-            await_sqe(0);
-            data->iov = { 0 };
-            data->callback = simple_callback_w;
-            my_uring_prep_fsync(sqe, fsync_meta ? bs->meta_fd : bs->data_fd, IORING_FSYNC_DATASYNC);
-            cur_sync->state = 1;
-            wait_count++;
-        resume_1:
-            if (wait_count > 0)
+            if (flusher->syncing_flushers >= flusher->cur_flusher_count || !flusher->flush_queue.size())
            {
+                // Sync batch is ready. Do it.
+                await_sqe(0);
+                data->iov = { 0 };
+                data->callback = simple_callback_w;
+                my_uring_prep_fsync(sqe, fsync_meta ? bs->meta_fd : bs->data_fd, IORING_FSYNC_DATASYNC);
+                cur_sync->state = 1;
+                wait_count++;
+            resume_2:
+                if (wait_count > 0)
+                {
+                    wait_state = 2;
+                    return false;
+                }
+                // Sync completed. All previous coroutines waiting for it must be resumed
+                cur_sync->state = 2;
+                bs->ringloop->wakeup();
+            }
+            else
+            {
+                // Wait until someone else sends and completes a sync.
                wait_state = 1;
                return false;
            }
-            // Sync completed. All previous coroutines waiting for it must be resumed
-            cur_sync->state = 2;
-            bs->ringloop->wakeup();
-        }
-        // Wait until someone else sends and completes a sync.
-    resume_2:
-        if (!cur_sync->state)
-        {
-            wait_state = 2;
-            return false;
        }
        flusher->syncing_flushers--;
        cur_sync->ready_count--;
@@ -858,35 +952,3 @@ bool journal_flusher_co::fsync_batch(bool fsync_meta, int wait_base)
    }
    return true;
 }
-
-void journal_flusher_co::bitmap_set(void *bitmap, uint64_t start, uint64_t len)
-{
-    if (start == 0)
-    {
-        if (len == 32*bs->bitmap_granularity)
-        {
-            *((uint32_t*)bitmap) = UINT32_MAX;
-            return;
-        }
-        else if (len == 64*bs->bitmap_granularity)
-        {
-            *((uint64_t*)bitmap) = UINT64_MAX;
-            return;
-        }
-    }
-    unsigned bit_start = start / bs->bitmap_granularity;
-    unsigned bit_end = ((start + len) + bs->bitmap_granularity - 1) / bs->bitmap_granularity;
-    while (bit_start < bit_end)
-    {
-        if (!(bit_start & 7) && bit_end >= bit_start+8)
-        {
-            ((uint8_t*)bitmap)[bit_start / 8] = UINT8_MAX;
-            bit_start += 8;
-        }
-        else
-        {
-            ((uint8_t*)bitmap)[bit_start / 8] |= 1 << (bit_start % 8);
-            bit_start++;
-        }
-    }
-}
--- a/src/blockstore_flush.h
+++ b/src/blockstore_flush.h
@@ -69,7 +69,6 @@ class journal_flusher_co
    bool modify_meta_read(uint64_t meta_loc, flusher_meta_write_t &wr, int wait_base);
    void update_clean_db();
    bool fsync_batch(bool fsync_meta, int wait_base);
-    void bitmap_set(void *bitmap, uint64_t start, uint64_t len);
 public:
    journal_flusher_co();
    bool loop();
@@ -80,7 +79,7 @@ class journal_flusher_t
 {
    int trim_wanted = 0;
    bool dequeuing;
-    int flusher_count, cur_flusher_count, target_flusher_count;
+    int min_flusher_count, max_flusher_count, cur_flusher_count, target_flusher_count;
    int flusher_start_threshold;
    journal_flusher_co *co;
    blockstore_impl_t *bs;
@@ -98,8 +97,11 @@ class journal_flusher_t
    std::map<uint64_t, meta_sector_t> meta_sectors;
    std::deque<object_id> flush_queue;
    std::map<object_id, uint64_t> flush_versions;
+
+    bool try_find_older(std::map<obj_ver_id, dirty_entry>::iterator & dirty_end, obj_ver_id & cur);
+
 public:
-    journal_flusher_t(int flusher_count, blockstore_impl_t *bs);
+    journal_flusher_t(blockstore_impl_t *bs);
    ~journal_flusher_t();
    void loop();
    bool is_active();
@@ -109,4 +111,5 @@ public:
    void enqueue_flush(obj_ver_id oid);
    void unshift_flush(obj_ver_id oid, bool force);
    void remove_flush(object_id oid);
+    void dump_diagnostics();
 };
--- a/src/blockstore_impl.cpp
+++ b/src/blockstore_impl.cpp
@@ -3,16 +3,17 @@

 #include "blockstore_impl.h"

-blockstore_impl_t::blockstore_impl_t(blockstore_config_t & config, ring_loop_t *ringloop)
+blockstore_impl_t::blockstore_impl_t(blockstore_config_t & config, ring_loop_t *ringloop, timerfd_manager_t *tfd)
 {
    assert(sizeof(blockstore_op_private_t) <= BS_OP_PRIVATE_DATA_SIZE);
+    this->tfd = tfd;
    this->ringloop = ringloop;
    ring_consumer.loop = [this]() { loop(); };
    ringloop->register_consumer(&ring_consumer);
    initialized = 0;
-    zero_object = (uint8_t*)memalign_or_die(MEM_ALIGNMENT, block_size);
    data_fd = meta_fd = journal.fd = -1;
    parse_config(config);
+    zero_object = (uint8_t*)memalign_or_die(MEM_ALIGNMENT, block_size);
    try
    {
        open_data();
@@ -31,7 +32,7 @@ blockstore_impl_t::blockstore_impl_t(blockstore_config_t & config, ring_loop_t *
            close(journal.fd);
        throw;
    }
-    flusher = new journal_flusher_t(flusher_count, this);
+    flusher = new journal_flusher_t(this);
 }

 blockstore_impl_t::~blockstore_impl_t()
@@ -92,35 +93,36 @@ void blockstore_impl_t::loop()
            {
                delete journal_init_reader;
                journal_init_reader = NULL;
-                initialized = 10;
+                if (journal.flush_journal)
+                    initialized = 3;
+                else
+                    initialized = 10;
                ringloop->wakeup();
            }
        }
+        if (initialized == 3)
+        {
+            if (readonly)
+            {
+                printf("Can't flush the journal in readonly mode\n");
+                exit(1);
+            }
+            flusher->loop();
+            ringloop->submit();
+        }
    }
    else
    {
        // try to submit ops
        unsigned initial_ring_space = ringloop->space_left();
-        // FIXME: rework this "sync polling"
-        auto cur_sync = in_progress_syncs.begin();
-        while (cur_sync != in_progress_syncs.end())
+        // has_writes == 0 - no writes before the current queue item
+        // has_writes == 1 - some writes in progress
+        // has_writes == 2 - tried to submit some writes, but failed
+        int has_writes = 0, op_idx = 0, new_idx = 0;
+        for (; op_idx < submit_queue.size(); op_idx++, new_idx++)
        {
-            if (continue_sync(*cur_sync) != 2)
-            {
-                // List is unmodified
-                cur_sync++;
-            }
-            else
-            {
-                cur_sync = in_progress_syncs.begin();
-            }
-        }
-        auto cur = submit_queue.begin();
-        int has_writes = 0;
-        while (cur != submit_queue.end())
-        {
-            auto op_ptr = cur;
-            auto op = *(cur++);
+            auto op = submit_queue[op_idx];
+            submit_queue[new_idx] = op;
            // FIXME: This needs some simplification
            // Writes should not block reads if the ring is not full and reads don't depend on them
            // In all other cases we should stop submission
@@ -142,10 +144,13 @@ void blockstore_impl_t::loop()
            }
            unsigned ring_space = ringloop->space_left();
            unsigned prev_sqe_pos = ringloop->save();
-            bool dequeue_op = false;
+            // 0 = can't submit
+            // 1 = in progress
+            // 2 = can be removed from queue
+            int wr_st = 0;
            if (op->opcode == BS_OP_READ)
            {
-                dequeue_op = dequeue_read(op);
+                wr_st = dequeue_read(op);
            }
            else if (op->opcode == BS_OP_WRITE || op->opcode == BS_OP_WRITE_STABLE)
            {
@@ -154,8 +159,8 @@ void blockstore_impl_t::loop()
                    // Some writes already could not be submitted
                    continue;
                }
-                dequeue_op = dequeue_write(op);
-                has_writes = dequeue_op ? 1 : 2;
+                wr_st = dequeue_write(op);
+                has_writes = wr_st > 0 ? 1 : 2;
            }
            else if (op->opcode == BS_OP_DELETE)
            {
@@ -164,8 +169,8 @@ void blockstore_impl_t::loop()
                    // Some writes already could not be submitted
                    continue;
                }
-                dequeue_op = dequeue_del(op);
-                has_writes = dequeue_op ? 1 : 2;
+                wr_st = dequeue_del(op);
+                has_writes = wr_st > 0 ? 1 : 2;
            }
            else if (op->opcode == BS_OP_SYNC)
            {
@@ -178,29 +183,31 @@ void blockstore_impl_t::loop()
                    // Can't submit SYNC before previous writes
                    continue;
                }
-                dequeue_op = dequeue_sync(op);
+                wr_st = continue_sync(op, false);
+                if (wr_st != 2)
+                {
+                    has_writes = wr_st > 0 ? 1 : 2;
+                }
            }
            else if (op->opcode == BS_OP_STABLE)
            {
-                dequeue_op = dequeue_stable(op);
+                wr_st = dequeue_stable(op);
            }
            else if (op->opcode == BS_OP_ROLLBACK)
            {
-                dequeue_op = dequeue_rollback(op);
+                wr_st = dequeue_rollback(op);
            }
            else if (op->opcode == BS_OP_LIST)
            {
-                // LIST doesn't need to be blocked by previous modifications,
-                // it only needs to include all in-progress writes as they're guaranteed
-                // to be readable and stabilizable/rollbackable by subsequent operations
+                // LIST doesn't need to be blocked by previous modifications
                process_list(op);
-                dequeue_op = true;
+                wr_st = 2;
            }
-            if (dequeue_op)
+            if (wr_st == 2)
            {
-                submit_queue.erase(op_ptr);
+                new_idx--;
            }
-            else
+            if (wr_st == 0)
            {
                ringloop->restore(prev_sqe_pos);
                if (PRIV(op)->wait_for == WAIT_SQE)
@@ -211,6 +218,14 @@ void blockstore_impl_t::loop()
                }
            }
        }
+        if (op_idx != new_idx)
+        {
+            while (op_idx < submit_queue.size())
+            {
+                submit_queue[new_idx++] = submit_queue[op_idx++];
+            }
+            submit_queue.resize(new_idx);
+        }
        if (!readonly)
        {
            flusher->loop();
@@ -233,7 +248,7 @@ bool blockstore_impl_t::is_safe_to_stop()
 {
    // It's safe to stop blockstore when there are no in-flight operations,
    // no in-progress syncs and flusher isn't doing anything
-    if (submit_queue.size() > 0 || in_progress_syncs.size() > 0 || !readonly && flusher->is_active())
+    if (submit_queue.size() > 0 || !readonly && flusher->is_active())
    {
        return false;
    }
@@ -300,7 +315,7 @@ void blockstore_impl_t::check_wait(blockstore_op_t *op)
    }
    else if (PRIV(op)->wait_for == WAIT_FREE)
    {
-        if (!data_alloc->get_free_count() && !flusher->is_active())
+        if (!data_alloc->get_free_count() && flusher->is_active())
        {
 #ifdef BLOCKSTORE_DEBUG
            printf("Still waiting for free space on the data device\n");
@@ -315,7 +330,7 @@ void blockstore_impl_t::check_wait(blockstore_op_t *op)
    }
 }

-void blockstore_impl_t::enqueue_op(blockstore_op_t *op, bool first)
+void blockstore_impl_t::enqueue_op(blockstore_op_t *op)
 {
    if (op->opcode < BS_OP_MIN || op->opcode > BS_OP_MAX ||
        ((op->opcode == BS_OP_READ || op->opcode == BS_OP_WRITE || op->opcode == BS_OP_WRITE_STABLE) && (
@@ -323,8 +338,7 @@ void blockstore_impl_t::enqueue_op(blockstore_op_t *op, bool first)
            op->len > block_size-op->offset ||
            (op->len % disk_alignment)
        )) ||
-        readonly && op->opcode != BS_OP_READ && op->opcode != BS_OP_LIST ||
-        first && (op->opcode == BS_OP_WRITE || op->opcode == BS_OP_WRITE_STABLE))
+        readonly && op->opcode != BS_OP_READ && op->opcode != BS_OP_LIST)
    {
        // Basic verification not passed
        op->retval = -EINVAL;
@@ -374,25 +388,12 @@ void blockstore_impl_t::enqueue_op(blockstore_op_t *op, bool first)
        std::function<void (blockstore_op_t*)>(op->callback)(op);
        return;
    }
-    if (op->opcode == BS_OP_SYNC && immediate_commit == IMMEDIATE_ALL)
-    {
-        op->retval = 0;
-        std::function<void (blockstore_op_t*)>(op->callback)(op);
-        return;
-    }
    // Call constructor without allocating memory. We'll call destructor before returning op back
    new ((void*)op->private_data) blockstore_op_private_t;
    PRIV(op)->wait_for = 0;
    PRIV(op)->op_state = 0;
    PRIV(op)->pending_ops = 0;
-    if (!first)
-    {
-        submit_queue.push_back(op);
-    }
-    else
-    {
-        submit_queue.push_front(op);
-    }
+    submit_queue.push_back(op);
    ringloop->wakeup();
 }

@@ -456,7 +457,7 @@ void blockstore_impl_t::process_list(blockstore_op_t *op)
        }
        for (; clean_it != clean_end; clean_it++)
        {
-            if (!pg_count || ((clean_it->first.inode + clean_it->first.stripe / pg_stripe_size) % pg_count) == list_pg)
+            if (!pg_count || ((clean_it->first.stripe / pg_stripe_size) % pg_count) == list_pg) // like map_to_pg()
            {
                if (stable_count >= stable_alloc)
                {
@@ -501,7 +502,7 @@ void blockstore_impl_t::process_list(blockstore_op_t *op)
        }
        for (; dirty_it != dirty_end; dirty_it++)
        {
-            if (!pg_count || ((dirty_it->first.oid.inode + dirty_it->first.oid.stripe / pg_stripe_size) % pg_count) == list_pg)
+            if (!pg_count || ((dirty_it->first.oid.stripe / pg_stripe_size) % pg_count) == list_pg) // like map_to_pg()
            {
                if (IS_DELETE(dirty_it->second.state))
                {
@@ -594,3 +595,9 @@ void blockstore_impl_t::process_list(blockstore_op_t *op)
    op->buf = stable;
    FINISH_OP(op);
 }
+
+void blockstore_impl_t::dump_diagnostics()
+{
+    journal.dump_diagnostics();
+    flusher->dump_diagnostics();
+}
--- a/src/blockstore_impl.h
+++ b/src/blockstore_impl.h
@@ -9,6 +9,7 @@
 #include <sys/ioctl.h>
 #include <sys/stat.h>
 #include <fcntl.h>
+#include <time.h>
 #include <unistd.h>
 #include <linux/fs.h>

@@ -77,7 +78,25 @@

 #include "blockstore_journal.h"

-// 24 bytes + block bitmap per "clean" entry on disk with fixed metadata tables
+// "VITAstor"
+#define BLOCKSTORE_META_MAGIC 0x726F747341544956l
+#define BLOCKSTORE_META_VERSION 1
+
+// metadata header (superblock)
+// FIXME: After adding the OSD superblock, add a key to metadata
+// and journal headers to check if they belong to the same OSD
+struct __attribute__((__packed__)) blockstore_meta_header_t
+{
+    uint64_t zero;
+    uint64_t magic;
+    uint64_t version;
+    uint32_t meta_block_size;
+    uint32_t data_block_size;
+    uint32_t bitmap_granularity;
+};
+
+// 32 bytes = 24 bytes + block bitmap (4 bytes by default) + external attributes (also bitmap, 4 bytes by default)
+// per "clean" entry on disk with fixed metadata tables
 // FIXME: maybe add crc32's to metadata
 struct __attribute__((__packed__)) clean_disk_entry
 {
@@ -93,7 +112,7 @@ struct __attribute__((__packed__)) clean_entry
    uint64_t location;
 };

-// 56 = 24 + 32 bytes per dirty entry in memory (obj_ver_id => dirty_entry)
+// 64 = 24 + 40 bytes per dirty entry in memory (obj_ver_id => dirty_entry)
 struct __attribute__((__packed__)) dirty_entry
 {
    uint32_t state;
@@ -102,6 +121,7 @@ struct __attribute__((__packed__)) dirty_entry
    uint32_t offset;   // data offset within object (stripe)
    uint32_t len;      // data length
    uint64_t journal_sector; // journal sector used for this entry
+    void* bitmap;   // either external bitmap itself when it fits, or a pointer to it when it doesn't
 };

 // - Sync must be submitted after previous writes/deletes (not before!)
@@ -156,12 +176,11 @@ struct blockstore_op_private_t
    struct iovec iov_zerofill[3];
    // Warning: must not have a default value here because it's written to before calling constructor in blockstore_write.cpp O_o
    uint64_t real_version;
+    timespec tv_begin;

    // Sync
    std::vector<obj_ver_id> sync_big_writes, sync_small_writes;
    int sync_small_checked, sync_big_checked;
-    std::list<blockstore_op_t*>::iterator in_progress_ptr;
-    int prev_sync_count;
 };

 // https://github.com/algorithm-ninja/cpp-btree
@@ -199,10 +218,18 @@ class blockstore_impl_t
    // Suitable only for server SSDs with capacitors, requires disabled data and journal fsyncs
    int immediate_commit = IMMEDIATE_NONE;
    bool inmemory_meta = false;
-    // Maximum flusher count
-    unsigned flusher_count;
+    // Maximum and minimum flusher count
+    unsigned max_flusher_count, min_flusher_count;
    // Maximum queue depth
    unsigned max_write_iodepth = 128;
+    // Enable small (journaled) write throttling, useful for the SSD+HDD case
+    bool throttle_small_writes = false;
+    // Target data device iops, bandwidth and parallelism for throttling (100/100/1 is the default for HDD)
+    int throttle_target_iops = 100;
+    int throttle_target_mbs = 100;
+    int throttle_target_parallelism = 1;
+    // Minimum difference in microseconds between target and real execution times to throttle the response
+    int throttle_threshold_us = 50;
    /******* END OF OPTIONS *******/

    struct ring_consumer_t ring_consumer;
@@ -210,9 +237,9 @@ class blockstore_impl_t
    blockstore_clean_db_t clean_db;
    uint8_t *clean_bitmap = NULL;
    blockstore_dirty_db_t dirty_db;
-    std::list<blockstore_op_t*> submit_queue; // FIXME: funny thing is that vector is better here
+    std::vector<blockstore_op_t*> submit_queue;
    std::vector<obj_ver_id> unsynced_big_writes, unsynced_small_writes;
-    std::list<blockstore_op_t*> in_progress_syncs; // ...and probably here, too
+    int unsynced_big_write_count = 0;
    allocator *data_alloc = NULL;
    uint8_t *zero_object;

@@ -233,6 +260,7 @@ class blockstore_impl_t

    bool live = false, queue_stall = false;
    ring_loop_t *ringloop;
+    timerfd_manager_t *tfd;

    bool stop_sync_submitted;

@@ -252,6 +280,7 @@ class blockstore_impl_t
    void open_data();
    void open_meta();
    void open_journal();
+    uint8_t* get_clean_entry_bitmap(uint64_t block_loc, int offset);

    // Asynchronous init
    int initialized;
@@ -271,6 +300,7 @@ class blockstore_impl_t

    // Write
    bool enqueue_write(blockstore_op_t *op);
+    void cancel_all_writes(blockstore_op_t *op, blockstore_dirty_db_t::iterator dirty_it, int retval);
    int dequeue_write(blockstore_op_t *op);
    int dequeue_del(blockstore_op_t *op);
    int continue_write(blockstore_op_t *op);
@@ -278,16 +308,14 @@ class blockstore_impl_t
    void handle_write_event(ring_data_t *data, blockstore_op_t *op);

    // Sync
-    int dequeue_sync(blockstore_op_t *op);
+    int continue_sync(blockstore_op_t *op, bool queue_has_in_progress_sync);
    void handle_sync_event(ring_data_t *data, blockstore_op_t *op);
-    int continue_sync(blockstore_op_t *op);
-    void ack_one_sync(blockstore_op_t *op);
-    int ack_sync(blockstore_op_t *op);
+    void ack_sync(blockstore_op_t *op);

    // Stabilize
    int dequeue_stable(blockstore_op_t *op);
    int continue_stable(blockstore_op_t *op);
-    void mark_stable(const obj_ver_id & ov);
+    void mark_stable(const obj_ver_id & ov, bool forget_dirty = false);
    void handle_stable_event(ring_data_t *data, blockstore_op_t *op);
    void stabilize_object(object_id oid, uint64_t max_ver);

@@ -303,7 +331,7 @@ class blockstore_impl_t

 public:

-    blockstore_impl_t(blockstore_config_t & config, ring_loop_t *ringloop);
+    blockstore_impl_t(blockstore_config_t & config, ring_loop_t *ringloop, timerfd_manager_t *tfd);
    ~blockstore_impl_t();

    // Event loop
@@ -322,13 +350,22 @@ public:
    bool is_stalled();

    // Submission
-    void enqueue_op(blockstore_op_t *op, bool first = false);
+    void enqueue_op(blockstore_op_t *op);
+
+    // Simplified synchronous operation: get object bitmap & current version
+    int read_bitmap(object_id oid, uint64_t target_version, void *bitmap, uint64_t *result_version = NULL);

    // Unstable writes are added here (map of object_id -> version)
    std::unordered_map<object_id, uint64_t> unstable_writes;

+    // Space usage statistics
+    std::map<uint64_t, uint64_t> inode_space_stats;
+
+    // Print diagnostics to stdout
+    void dump_diagnostics();
+
    inline uint32_t get_block_size() { return block_size; }
    inline uint64_t get_block_count() { return block_count; }
    inline uint64_t get_free_block_count() { return data_alloc->get_free_count(); }
-    inline uint32_t get_disk_alignment() { return disk_alignment; }
+    inline uint32_t get_bitmap_granularity() { return disk_alignment; }
 };
--- a/src/blockstore_init.cpp
+++ b/src/blockstore_init.cpp
@@ -3,6 +3,20 @@

 #include "blockstore_impl.h"

+#define GET_SQE() \
+    sqe = bs->get_sqe();\
+    if (!sqe)\
+        throw std::runtime_error("io_uring is full during initialization");\
+    data = ((ring_data_t*)sqe->user_data)
+
+static bool iszero(uint64_t *buf, int len)
+{
+    for (int i = 0; i < len; i++)
+        if (buf[i] != 0)
+            return false;
+    return true;
+}
+
 blockstore_init_meta::blockstore_init_meta(blockstore_impl_t *bs)
 {
    this->bs = bs;
@@ -10,7 +24,7 @@ blockstore_init_meta::blockstore_init_meta(blockstore_impl_t *bs)

 void blockstore_init_meta::handle_event(ring_data_t *data)
 {
-    if (data->res <= 0)
+    if (data->res < 0)
    {
        throw std::runtime_error(
            std::string("read metadata failed at offset ") + std::to_string(metadata_read) +
@@ -28,6 +42,12 @@ int blockstore_init_meta::loop()
 {
    if (wait_state == 1)
        goto resume_1;
+    else if (wait_state == 2)
+        goto resume_2;
+    else if (wait_state == 3)
+        goto resume_3;
+    else if (wait_state == 4)
+        goto resume_4;
    printf("Reading blockstore metadata\n");
    if (bs->inmemory_meta)
        metadata_buffer = bs->metadata_buffer;
@@ -35,22 +55,98 @@ int blockstore_init_meta::loop()
        metadata_buffer = memalign(MEM_ALIGNMENT, 2*bs->metadata_buf_size);
    if (!metadata_buffer)
        throw std::runtime_error("Failed to allocate metadata read buffer");
+    // Read superblock
+    GET_SQE();
+    data->iov = { metadata_buffer, bs->meta_block_size };
+    data->callback = [this](ring_data_t *data) { handle_event(data); };
+    my_uring_prep_readv(sqe, bs->meta_fd, &data->iov, 1, bs->meta_offset);
+    bs->ringloop->submit();
+    submitted = 1;
+resume_1:
+    if (submitted)
+    {
+        wait_state = 1;
+        return 1;
+    }
+    if (iszero((uint64_t*)metadata_buffer, bs->meta_block_size / sizeof(uint64_t)))
+    {
+        {
+            blockstore_meta_header_t *hdr = (blockstore_meta_header_t *)metadata_buffer;
+            hdr->zero = 0;
+            hdr->magic = BLOCKSTORE_META_MAGIC;
+            hdr->version = BLOCKSTORE_META_VERSION;
+            hdr->meta_block_size = bs->meta_block_size;
+            hdr->data_block_size = bs->block_size;
+            hdr->bitmap_granularity = bs->bitmap_granularity;
+        }
+        if (bs->readonly)
+        {
+            printf("Skipping metadata initialization because blockstore is readonly\n");
+        }
+        else
+        {
+            printf("Initializing metadata area\n");
+            GET_SQE();
+            data->iov = (struct iovec){ metadata_buffer, bs->meta_block_size };
+            data->callback = [this](ring_data_t *data) { handle_event(data); };
+            my_uring_prep_writev(sqe, bs->meta_fd, &data->iov, 1, bs->meta_offset);
+            bs->ringloop->submit();
+            submitted = 1;
+        resume_3:
+            if (submitted > 0)
+            {
+                wait_state = 3;
+                return 1;
+            }
+            zero_on_init = true;
+        }
+    }
+    else
+    {
+        blockstore_meta_header_t *hdr = (blockstore_meta_header_t *)metadata_buffer;
+        if (hdr->zero != 0 ||
+            hdr->magic != BLOCKSTORE_META_MAGIC ||
+            hdr->version != BLOCKSTORE_META_VERSION)
+        {
+            printf(
+                "Metadata is corrupt or old version.\n"
+                " If this is a new OSD please zero out the metadata area before starting it.\n"
+                " If you need to upgrade from 0.5.x please request it via the issue tracker.\n"
+            );
+            exit(1);
+        }
+        if (hdr->meta_block_size != bs->meta_block_size ||
+            hdr->data_block_size != bs->block_size ||
+            hdr->bitmap_granularity != bs->bitmap_granularity)
+        {
+            printf(
+                "Configuration stored in metadata superblock"
+                " (meta_block_size=%u, data_block_size=%u, bitmap_granularity=%u)"
+                " differs from OSD configuration (%lu/%u/%lu).\n",
+                hdr->meta_block_size, hdr->data_block_size, hdr->bitmap_granularity,
+                bs->meta_block_size, bs->block_size, bs->bitmap_granularity
+            );
+            exit(1);
+        }
+    }
+    // Skip superblock
+    bs->meta_offset += bs->meta_block_size;
+    prev_done = 0;
+    done_len = 0;
+    done_pos = 0;
+    metadata_read = 0;
+    // Read the rest of the metadata
    while (1)
    {
-    resume_1:
+    resume_2:
        if (submitted)
        {
-            wait_state = 1;
+            wait_state = 2;
            return 1;
        }
        if (metadata_read < bs->meta_len)
        {
-            sqe = bs->get_sqe();
-            if (!sqe)
-            {
-                throw std::runtime_error("io_uring is full while trying to read metadata");
-            }
-            data = ((ring_data_t*)sqe->user_data);
+            GET_SQE();
            data->iov = {
                metadata_buffer + (bs->inmemory_meta
                    ? metadata_read
@@ -58,7 +154,14 @@ int blockstore_init_meta::loop()
                bs->meta_len - metadata_read > bs->metadata_buf_size ? bs->metadata_buf_size : bs->meta_len - metadata_read,
            };
            data->callback = [this](ring_data_t *data) { handle_event(data); };
-            my_uring_prep_readv(sqe, bs->meta_fd, &data->iov, 1, bs->meta_offset + metadata_read);
+            if (!zero_on_init)
+                my_uring_prep_readv(sqe, bs->meta_fd, &data->iov, 1, bs->meta_offset + metadata_read);
+            else
+            {
+                // Fill metadata with zeroes
+                memset(data->iov.iov_base, 0, data->iov.iov_len);
+                my_uring_prep_writev(sqe, bs->meta_fd, &data->iov, 1, bs->meta_offset + metadata_read);
+            }
            bs->ringloop->submit();
            submitted = (prev == 1 ? 2 : 1);
            prev = submitted;
@@ -90,6 +193,21 @@ int blockstore_init_meta::loop()
        free(metadata_buffer);
        metadata_buffer = NULL;
    }
+    if (zero_on_init && !bs->disable_meta_fsync)
+    {
+        GET_SQE();
+        my_uring_prep_fsync(sqe, bs->meta_fd, IORING_FSYNC_DATASYNC);
+        data->iov = { 0 };
+        data->callback = [this](ring_data_t *data) { handle_event(data); };
+        submitted = 1;
+        bs->ringloop->submit();
+    resume_4:
+        if (submitted > 0)
+        {
+            wait_state = 4;
+            return 1;
+        }
+    }
    return 0;
 }

@@ -100,7 +218,7 @@ void blockstore_init_meta::handle_entries(void* entries, unsigned count, int blo
        clean_disk_entry *entry = (clean_disk_entry*)(entries + i*bs->clean_entry_size);
        if (!bs->inmemory_meta && bs->clean_entry_bitmap_size)
        {
-            memcpy(bs->clean_bitmap + (done_cnt+i)*bs->clean_entry_bitmap_size, &entry->bitmap, bs->clean_entry_bitmap_size);
+            memcpy(bs->clean_bitmap + (done_cnt+i)*2*bs->clean_entry_bitmap_size, &entry->bitmap, 2*bs->clean_entry_bitmap_size);
        }
        if (entry->oid.inode > 0)
        {
@@ -111,10 +229,17 @@ void blockstore_init_meta::handle_entries(void* entries, unsigned count, int blo
                {
                    // free the previous block
 #ifdef BLOCKSTORE_DEBUG
-                    printf("Free block %lu (new location is %lu)\n", clean_it->second.location >> block_order, done_cnt+i);
+                    printf("Free block %lu from %lx:%lx v%lu (new location is %lu)\n",
+                        clean_it->second.location >> block_order,
+                        clean_it->first.inode, clean_it->first.stripe, clean_it->second.version,
+                        done_cnt+i);
 #endif
                    bs->data_alloc->set(clean_it->second.location >> block_order, false);
                }
+                else
+                {
+                    bs->inode_space_stats[entry->oid.inode] += bs->block_size;
+                }
                entries_loaded++;
 #ifdef BLOCKSTORE_DEBUG
                printf("Allocate block (clean entry) %lu: %lx:%lx v%lu\n", done_cnt+i, entry->oid.inode, entry->oid.stripe, entry->version);
@@ -149,14 +274,6 @@ blockstore_init_journal::blockstore_init_journal(blockstore_impl_t *bs)
    };
 }

-bool iszero(uint64_t *buf, int len)
-{
-    for (int i = 0; i < len; i++)
-        if (buf[i] != 0)
-            return false;
-    return true;
-}
-
 void blockstore_init_journal::handle_event(ring_data_t *data1)
 {
    if (data1->res <= 0)
@@ -181,12 +298,6 @@ void blockstore_init_journal::handle_event(ring_data_t *data1)
    submitted_buf = NULL;
 }

-#define GET_SQE() \
-    sqe = bs->get_sqe();\
-    if (!sqe)\
-        throw std::runtime_error("io_uring is full while trying to read journal");\
-    data = ((ring_data_t*)sqe->user_data)
-
 int blockstore_init_journal::loop()
 {
    if (wait_state == 1)
@@ -224,7 +335,7 @@ resume_1:
        wait_state = 1;
        return 1;
    }
-    if (iszero((uint64_t*)submitted_buf, 3))
+    if (iszero((uint64_t*)submitted_buf, bs->journal.block_size / sizeof(uint64_t)))
    {
        // Journal is empty
        // FIXME handle this wrapping to journal_block_size better (maybe)
@@ -239,6 +350,7 @@ resume_1:
            .size = sizeof(journal_entry_start),
            .reserved = 0,
            .journal_start = bs->journal.block_size,
+            .version = JOURNAL_VERSION,
        };
        ((journal_entry_start*)submitted_buf)->crc32 = je_crc32((journal_entry*)submitted_buf);
        if (bs->readonly)
@@ -289,11 +401,21 @@ resume_1:
        je_start = (journal_entry_start*)submitted_buf;
        if (je_start->magic != JOURNAL_MAGIC ||
            je_start->type != JE_START ||
-            je_start->size != sizeof(journal_entry_start) ||
-            je_crc32((journal_entry*)je_start) != je_start->crc32)
+            je_crc32((journal_entry*)je_start) != je_start->crc32 ||
+            je_start->size != sizeof(journal_entry_start) && je_start->size != JE_START_LEGACY_SIZE)
        {
            // Entry is corrupt
-            throw std::runtime_error("first entry of the journal is corrupt");
+            fprintf(stderr, "First entry of the journal is corrupt\n");
+            exit(1);
+        }
+        if (je_start->size == JE_START_LEGACY_SIZE || je_start->version != JOURNAL_VERSION)
+        {
+            fprintf(
+                stderr, "The code only supports journal version %d, but it is %lu on disk."
+                    " Please use the previous version to flush the journal before upgrading OSD\n",
+                JOURNAL_VERSION, je_start->size == JE_START_LEGACY_SIZE ? 0 : je_start->version
+            );
+            exit(1);
        }
        next_free = journal_pos = bs->journal.used_start = je_start->journal_start;
        if (!bs->journal.inmemory)
@@ -399,6 +521,18 @@ resume_1:
            }
        }
    }
+    for (auto ov: double_allocs)
+    {
+        auto dirty_it = bs->dirty_db.find(ov);
+        if (dirty_it != bs->dirty_db.end() &&
+            IS_BIG_WRITE(dirty_it->second.state) &&
+            dirty_it->second.location == UINT64_MAX)
+        {
+            printf("Fatal error (bug): %lx:%lx v%lu big_write journal_entry was allocated over another object\n",
+                dirty_it->first.oid.inode, dirty_it->first.oid.stripe, dirty_it->first.version);
+            exit(1);
+        }
+    }
    bs->flusher->mark_trim_possible();
    bs->journal.dirty_start = bs->journal.next_free;
    printf(
@@ -530,6 +664,21 @@ int blockstore_init_journal::handle_journal_part(void *buf, uint64_t done_pos, u
                        .oid = je->small_write.oid,
                        .version = je->small_write.version,
                    };
+                    void *bmp = NULL;
+                    void *bmp_from = (void*)je + sizeof(journal_entry_small_write);
+                    if (bs->clean_entry_bitmap_size <= sizeof(void*))
+                    {
+                        memcpy(&bmp, bmp_from, bs->clean_entry_bitmap_size);
+                    }
+                    else
+                    {
+                        // FIXME Using large blockstore objects will result in a lot of small
+                        // allocations for entry bitmaps. This can only be fixed by using
+                        // a patched map with dynamic entry size, but not the btree_map,
+                        // because it doesn't keep iterators valid all the time.
+                        bmp = malloc_or_die(bs->clean_entry_bitmap_size);
+                        memcpy(bmp, bmp_from, bs->clean_entry_bitmap_size);
+                    }
                    bs->dirty_db.emplace(ov, (dirty_entry){
                        .state = (BS_ST_SMALL_WRITE | BS_ST_SYNCED),
                        .flags = 0,
@@ -537,6 +686,7 @@ int blockstore_init_journal::handle_journal_part(void *buf, uint64_t done_pos, u
                        .offset = je->small_write.offset,
                        .len = je->small_write.len,
                        .journal_sector = proc_pos,
+                        .bitmap = bmp,
                    });
                    bs->journal.used_sectors[proc_pos]++;
 #ifdef BLOCKSTORE_DEBUG
@@ -549,7 +699,7 @@ int blockstore_init_journal::handle_journal_part(void *buf, uint64_t done_pos, u
                    unstab = unstab < ov.version ? ov.version : unstab;
                    if (je->type == JE_SMALL_WRITE_INSTANT)
                    {
-                        bs->mark_stable(ov);
+                        bs->mark_stable(ov, true);
                    }
                }
            }
@@ -579,32 +729,10 @@ int blockstore_init_journal::handle_journal_part(void *buf, uint64_t done_pos, u
                        // its data and metadata are already flushed.
                        // We don't know if newer versions are flushed, but
                        // the previous delete definitely is.
-                        // So we flush previous dirty entries, but retain the clean one.
+                        // So we forget previous dirty entries, but retain the clean one.
                        // This feature is required for writes happening shortly
                        // after deletes.
-                        auto dirty_end = dirty_it;
-                        dirty_end++;
-                        while (1)
-                        {
-                            if (dirty_it == bs->dirty_db.begin())
-                            {
-                                break;
-                            }
-                            dirty_it--;
-                            if (dirty_it->first.oid != je->big_write.oid)
-                            {
-                                dirty_it++;
-                                break;
-                            }
-                        }
-                        auto clean_it = bs->clean_db.find(je->big_write.oid);
-                        bs->erase_dirty(
-                            dirty_it, dirty_end,
-                            clean_it != bs->clean_db.end() ? clean_it->second.location : UINT64_MAX
-                        );
-                        // Remove it from the flusher's queue, too
-                        // Otherwise it may end up referring to a small unstable write after reading the rest of the journal
-                        bs->flusher->remove_flush(je->big_write.oid);
+                        erase_dirty_object(dirty_it);
                    }
                }
                auto clean_it = bs->clean_db.find(je->big_write.oid);
@@ -616,18 +744,49 @@ int blockstore_init_journal::handle_journal_part(void *buf, uint64_t done_pos, u
                        .oid = je->big_write.oid,
                        .version = je->big_write.version,
                    };
-                    bs->dirty_db.emplace(ov, (dirty_entry){
+                    void *bmp = NULL;
+                    void *bmp_from = (void*)je + sizeof(journal_entry_big_write);
+                    if (bs->clean_entry_bitmap_size <= sizeof(void*))
+                    {
+                        memcpy(&bmp, bmp_from, bs->clean_entry_bitmap_size);
+                    }
+                    else
+                    {
+                        // FIXME Using large blockstore objects will result in a lot of small
+                        // allocations for entry bitmaps. This can only be fixed by using
+                        // a patched map with dynamic entry size, but not the btree_map,
+                        // because it doesn't keep iterators valid all the time.
+                        bmp = malloc_or_die(bs->clean_entry_bitmap_size);
+                        memcpy(bmp, bmp_from, bs->clean_entry_bitmap_size);
+                    }
+                    auto dirty_it = bs->dirty_db.emplace(ov, (dirty_entry){
                        .state = (BS_ST_BIG_WRITE | BS_ST_SYNCED),
                        .flags = 0,
                        .location = je->big_write.location,
                        .offset = je->big_write.offset,
                        .len = je->big_write.len,
                        .journal_sector = proc_pos,
-                    });
+                        .bitmap = bmp,
+                    }).first;
+                    if (bs->data_alloc->get(je->big_write.location >> bs->block_order))
+                    {
+                        // This is probably a big_write that's already flushed and freed, but it may
+                        // also indicate a bug. So we remember such entries and recheck them afterwards.
+                        // If it's not a bug they won't be present after reading the whole journal.
+                        dirty_it->second.location = UINT64_MAX;
+                        double_allocs.push_back(ov);
+                    }
+                    else
+                    {
 #ifdef BLOCKSTORE_DEBUG
-                    printf("Allocate block %lu\n", je->big_write.location >> bs->block_order);
+                        printf(
+                            "Allocate block (journal) %lu: %lx:%lx v%lu\n",
+                            je->big_write.location >> bs->block_order,
+                            ov.oid.inode, ov.oid.stripe, ov.version
+                        );
 #endif
-                    bs->data_alloc->set(je->big_write.location >> bs->block_order, true);
+                        bs->data_alloc->set(je->big_write.location >> bs->block_order, true);
+                    }
                    bs->journal.used_sectors[proc_pos]++;
 #ifdef BLOCKSTORE_DEBUG
                    printf(
@@ -639,7 +798,7 @@ int blockstore_init_journal::handle_journal_part(void *buf, uint64_t done_pos, u
                    unstab = unstab < ov.version ? ov.version : unstab;
                    if (je->type == JE_BIG_WRITE_INSTANT)
                    {
-                        bs->mark_stable(ov);
+                        bs->mark_stable(ov, true);
                    }
                }
            }
@@ -653,7 +812,7 @@ int blockstore_init_journal::handle_journal_part(void *buf, uint64_t done_pos, u
                    .oid = je->stable.oid,
                    .version = je->stable.version,
                };
-                bs->mark_stable(ov);
+                bs->mark_stable(ov, true);
            }
            else if (je->type == JE_ROLLBACK)
            {
@@ -672,9 +831,26 @@ int blockstore_init_journal::handle_journal_part(void *buf, uint64_t done_pos, u
 #ifdef BLOCKSTORE_DEBUG
                printf("je_delete oid=%lx:%lx ver=%lu\n", je->del.oid.inode, je->del.oid.stripe, je->del.version);
 #endif
+                bool dirty_exists = false;
+                auto dirty_it = bs->dirty_db.upper_bound((obj_ver_id){
+                    .oid = je->del.oid,
+                    .version = UINT64_MAX,
+                });
+                if (dirty_it != bs->dirty_db.begin())
+                {
+                    dirty_it--;
+                    dirty_exists = dirty_it->first.oid == je->del.oid;
+                }
                auto clean_it = bs->clean_db.find(je->del.oid);
-                if (clean_it == bs->clean_db.end() ||
-                    clean_it->second.version < je->del.version)
+                bool clean_exists = (clean_it != bs->clean_db.end() &&
+                    clean_it->second.version < je->del.version);
+                if (!clean_exists && dirty_exists)
+                {
+                    // Clean entry doesn't exist. This means that the delete is already flushed.
+                    // So we must not flush this object anymore.
+                    erase_dirty_object(dirty_it);
+                }
+                else if (clean_exists || dirty_exists)
                {
                    // oid, version
                    obj_ver_id ov = {
@@ -692,8 +868,9 @@ int blockstore_init_journal::handle_journal_part(void *buf, uint64_t done_pos, u
                    bs->journal.used_sectors[proc_pos]++;
                    // Deletions are treated as immediately stable, because
                    // "2-phase commit" (write->stabilize) isn't sufficient for them anyway
-                    bs->mark_stable(ov);
+                    bs->mark_stable(ov, true);
                }
+                // Ignore delete if neither preceding dirty entries nor the clean one are present
            }
            started = true;
            pos += je->size;
@@ -704,3 +881,35 @@ int blockstore_init_journal::handle_journal_part(void *buf, uint64_t done_pos, u
    bs->journal.next_free = next_free;
    return 1;
 }
+
+void blockstore_init_journal::erase_dirty_object(blockstore_dirty_db_t::iterator dirty_it)
+{
+    auto oid = dirty_it->first.oid;
+    bool exists = !IS_DELETE(dirty_it->second.state);
+    auto dirty_end = dirty_it;
+    dirty_end++;
+    while (1)
+    {
+        if (dirty_it == bs->dirty_db.begin())
+        {
+            break;
+        }
+        dirty_it--;
+        if (dirty_it->first.oid != oid)
+        {
+            dirty_it++;
+            break;
+        }
+    }
+    auto clean_it = bs->clean_db.find(oid);
+    uint64_t clean_loc = clean_it != bs->clean_db.end()
+        ? clean_it->second.location : UINT64_MAX;
+    if (exists && clean_loc == UINT64_MAX)
+    {
+        bs->inode_space_stats[oid.inode] -= bs->block_size;
+    }
+    bs->erase_dirty(dirty_it, dirty_end, clean_loc);
+    // Remove it from the flusher's queue, too
+    // Otherwise it may end up referring to a small unstable write after reading the rest of the journal
+    bs->flusher->remove_flush(oid);
+}
--- a/src/blockstore_init.h
+++ b/src/blockstore_init.h
@@ -7,6 +7,7 @@ class blockstore_init_meta
 {
    blockstore_impl_t *bs;
    int wait_state = 0, wait_count = 0;
+    bool zero_on_init = false;
    void *metadata_buffer = NULL;
    uint64_t metadata_read = 0;
    int prev = 0, prev_done = 0, done_len = 0, submitted = 0;
@@ -36,6 +37,7 @@ class blockstore_init_journal
    bool started = false;
    uint64_t next_free;
    std::vector<bs_init_journal_done> done;
+    std::vector<obj_ver_id> double_allocs;
    uint64_t journal_pos = 0;
    uint64_t continue_pos = 0;
    void *init_write_buf = NULL;
@@ -48,6 +50,7 @@ class blockstore_init_journal
    std::function<void(ring_data_t*)> simple_callback;
    int handle_journal_part(void *buf, uint64_t done_pos, uint64_t len);
    void handle_event(ring_data_t *data);
+    void erase_dirty_object(blockstore_dirty_db_t::iterator dirty_it);
 public:
    blockstore_init_journal(blockstore_impl_t* bs);
    int loop();
--- a/src/blockstore_journal.cpp
+++ b/src/blockstore_journal.cpp
@@ -218,3 +218,19 @@ uint64_t journal_t::get_trim_pos()
    // Can't trim journal
    return used_start;
 }
+
+void journal_t::dump_diagnostics()
+{
+    auto journal_used_it = used_sectors.lower_bound(used_start);
+    if (journal_used_it == used_sectors.end())
+    {
+        // Journal is cleared to its end, restart from the beginning
+        journal_used_it = used_sectors.begin();
+    }
+    printf(
+        "Journal: used_start=%08lx next_free=%08lx dirty_start=%08lx trim_to=%08lx trim_to_refs=%ld\n",
+        used_start, next_free, dirty_start,
+        journal_used_it == used_sectors.end() ? 0 : journal_used_it->first,
+        journal_used_it == used_sectors.end() ? 0 : journal_used_it->second
+    );
+}
--- a/src/blockstore_journal.h
+++ b/src/blockstore_journal.h
@@ -7,6 +7,7 @@

 #define MIN_JOURNAL_SIZE 4*1024*1024
 #define JOURNAL_MAGIC 0x4A33
+#define JOURNAL_VERSION 1
 #define JOURNAL_BUFFER_SIZE 4*1024*1024

 // We reserve some extra space for future stabilize requests during writes
@@ -37,7 +38,9 @@ struct __attribute__((__packed__)) journal_entry_start
    uint32_t size;
    uint32_t reserved;
    uint64_t journal_start;
+    uint64_t version;
 };
+#define JE_START_LEGACY_SIZE 24

 struct __attribute__((__packed__)) journal_entry_small_write
 {
@@ -54,6 +57,9 @@ struct __attribute__((__packed__)) journal_entry_small_write
    // data_offset is its offset within journal
    uint64_t data_offset;
    uint32_t crc32_data;
+    // small_write and big_write entries are followed by the "external" bitmap
+    // its size is dynamic and included in journal entry's <size> field
+    uint8_t bitmap[];
 };

 struct __attribute__((__packed__)) journal_entry_big_write
@@ -68,6 +74,9 @@ struct __attribute__((__packed__)) journal_entry_big_write
    uint32_t offset;
    uint32_t len;
    uint64_t location;
+    // small_write and big_write entries are followed by the "external" bitmap
+    // its size is dynamic and included in journal entry's <size> field
+    uint8_t bitmap[];
 };

 struct __attribute__((__packed__)) journal_entry_stable
@@ -143,6 +152,7 @@ struct journal_t
    int fd;
    uint64_t device_size;
    bool inmemory = false;
+    bool flush_journal = false;
    void *buffer = NULL;

    uint64_t block_size;
@@ -170,6 +180,7 @@ struct journal_t
    ~journal_t();
    bool trim();
    uint64_t get_trim_pos();
+    void dump_diagnostics();
    inline bool entry_fits(int size)
    {
        return !(block_size - in_sector_pos < size ||
--- a/src/blockstore_open.cpp
+++ b/src/blockstore_open.cpp
@@ -42,6 +42,11 @@ void blockstore_impl_t::parse_config(blockstore_config_t & config)
    {
        disable_flock = true;
    }
+    if (config["flush_journal"] == "true" || config["flush_journal"] == "1" || config["flush_journal"] == "yes")
+    {
+        // Only flush journal and exit
+        journal.flush_journal = true;
+    }
    if (config["immediate_commit"] == "all")
    {
        immediate_commit = IMMEDIATE_ALL;
@@ -69,8 +74,16 @@ void blockstore_impl_t::parse_config(blockstore_config_t & config)
    journal_block_size = strtoull(config["journal_block_size"].c_str(), NULL, 10);
    meta_block_size = strtoull(config["meta_block_size"].c_str(), NULL, 10);
    bitmap_granularity = strtoull(config["bitmap_granularity"].c_str(), NULL, 10);
-    flusher_count = strtoull(config["flusher_count"].c_str(), NULL, 10);
+    max_flusher_count = strtoull(config["max_flusher_count"].c_str(), NULL, 10);
+    if (!max_flusher_count)
+        max_flusher_count = strtoull(config["flusher_count"].c_str(), NULL, 10);
+    min_flusher_count = strtoull(config["min_flusher_count"].c_str(), NULL, 10);
    max_write_iodepth = strtoull(config["max_write_iodepth"].c_str(), NULL, 10);
+    throttle_small_writes = config["throttle_small_writes"] == "true" || config["throttle_small_writes"] == "1" || config["throttle_small_writes"] == "yes";
+    throttle_target_iops = strtoull(config["throttle_target_iops"].c_str(), NULL, 10);
+    throttle_target_mbs = strtoull(config["throttle_target_mbs"].c_str(), NULL, 10);
+    throttle_target_parallelism = strtoull(config["throttle_target_parallelism"].c_str(), NULL, 10);
+    throttle_threshold_us = strtoull(config["throttle_threshold_us"].c_str(), NULL, 10);
    // Validate
    if (!block_size)
    {
@@ -80,9 +93,13 @@ void blockstore_impl_t::parse_config(blockstore_config_t & config)
    {
        throw std::runtime_error("Bad block size");
    }
-    if (!flusher_count)
+    if (!max_flusher_count)
    {
-        flusher_count = 32;
+        max_flusher_count = 256;
+    }
+    if (!min_flusher_count || journal.flush_journal)
+    {
+        min_flusher_count = 1;
    }
    if (!max_write_iodepth)
    {
@@ -94,7 +111,7 @@ void blockstore_impl_t::parse_config(blockstore_config_t & config)
    }
    else if (disk_alignment % MEM_ALIGNMENT)
    {
-        throw std::runtime_error("disk_alingment must be a multiple of "+std::to_string(MEM_ALIGNMENT));
+        throw std::runtime_error("disk_alignment must be a multiple of "+std::to_string(MEM_ALIGNMENT));
    }
    if (!journal_block_size)
    {
@@ -118,7 +135,7 @@ void blockstore_impl_t::parse_config(blockstore_config_t & config)
    }
    if (!bitmap_granularity)
    {
-        bitmap_granularity = 4096;
+        bitmap_granularity = DEFAULT_BITMAP_GRANULARITY;
    }
    else if (bitmap_granularity % disk_alignment)
    {
@@ -168,9 +185,25 @@ void blockstore_impl_t::parse_config(blockstore_config_t & config)
    {
        throw std::runtime_error("immediate_commit=all requires disable_journal_fsync and disable_data_fsync");
    }
+    if (!throttle_target_iops)
+    {
+        throttle_target_iops = 100;
+    }
+    if (!throttle_target_mbs)
+    {
+        throttle_target_mbs = 100;
+    }
+    if (!throttle_target_parallelism)
+    {
+        throttle_target_parallelism = 1;
+    }
+    if (!throttle_threshold_us)
+    {
+        throttle_threshold_us = 50;
+    }
    // init some fields
    clean_entry_bitmap_size = block_size / bitmap_granularity / 8;
-    clean_entry_size = sizeof(clean_disk_entry) + clean_entry_bitmap_size;
+    clean_entry_size = sizeof(clean_disk_entry) + 2*clean_entry_bitmap_size;
    journal.block_size = journal_block_size;
    journal.next_free = journal_block_size;
    journal.used_start = journal_block_size;
@@ -224,7 +257,7 @@ void blockstore_impl_t::calc_lengths()
    }
    // required metadata size
    block_count = data_len / block_size;
-    meta_len = ((block_count - 1 + meta_block_size / clean_entry_size) / (meta_block_size / clean_entry_size)) * meta_block_size;
+    meta_len = (1 + (block_count - 1 + meta_block_size / clean_entry_size) / (meta_block_size / clean_entry_size)) * meta_block_size;
    if (meta_area < meta_len)
    {
        throw std::runtime_error("Metadata area is too small, need at least "+std::to_string(meta_len)+" bytes");
@@ -237,7 +270,7 @@ void blockstore_impl_t::calc_lengths()
    }
    else if (clean_entry_bitmap_size)
    {
-        clean_bitmap = (uint8_t*)malloc(block_count * clean_entry_bitmap_size);
+        clean_bitmap = (uint8_t*)malloc(block_count * 2*clean_entry_bitmap_size);
        if (!clean_bitmap)
            throw std::runtime_error("Failed to allocate memory for the metadata sparse write bitmap");
    }
--- a/src/blockstore_read.cpp
+++ b/src/blockstore_read.cpp
@@ -94,6 +94,21 @@ endwhile:
    return 1;
 }

+uint8_t* blockstore_impl_t::get_clean_entry_bitmap(uint64_t block_loc, int offset)
+{
+    uint8_t *clean_entry_bitmap;
+    uint64_t meta_loc = block_loc >> block_order;
+    if (inmemory_meta)
+    {
+        uint64_t sector = (meta_loc / (meta_block_size / clean_entry_size)) * meta_block_size;
+        uint64_t pos = (meta_loc % (meta_block_size / clean_entry_size));
+        clean_entry_bitmap = (uint8_t*)(metadata_buffer + sector + pos*clean_entry_size + sizeof(clean_disk_entry) + offset);
+    }
+    else
+        clean_entry_bitmap = (uint8_t*)(clean_bitmap + meta_loc*2*clean_entry_bitmap_size + offset);
+    return clean_entry_bitmap;
+}
+
 int blockstore_impl_t::dequeue_read(blockstore_op_t *read_op)
 {
    auto clean_it = clean_db.find(read_op->oid);
@@ -112,7 +127,7 @@ int blockstore_impl_t::dequeue_read(blockstore_op_t *read_op)
        read_op->version = 0;
        read_op->retval = read_op->len;
        FINISH_OP(read_op);
-        return 1;
+        return 2;
    }
    uint64_t fulfilled = 0;
    PRIV(read_op)->pending_ops = 0;
@@ -134,6 +149,11 @@ int blockstore_impl_t::dequeue_read(blockstore_op_t *read_op)
                if (!result_version)
                {
                    result_version = dirty_it->first.version;
+                    if (read_op->bitmap)
+                    {
+                        void *bmp_ptr = (clean_entry_bitmap_size > sizeof(void*) ? dirty_it->second.bitmap : &dirty_it->second.bitmap);
+                        memcpy(read_op->bitmap, bmp_ptr, clean_entry_bitmap_size);
+                    }
                }
                if (!fulfill_read(read_op, fulfilled, dirty.offset, dirty.offset + dirty.len,
                    dirty.state, dirty_it->first.version, dirty.location + (IS_JOURNAL(dirty.state) ? 0 : dirty.offset)))
@@ -155,6 +175,11 @@ int blockstore_impl_t::dequeue_read(blockstore_op_t *read_op)
        if (!result_version)
        {
            result_version = clean_it->second.version;
+            if (read_op->bitmap)
+            {
+                void *bmp_ptr = get_clean_entry_bitmap(clean_it->second.location, clean_entry_bitmap_size);
+                memcpy(read_op->bitmap, bmp_ptr, clean_entry_bitmap_size);
+            }
        }
        if (fulfilled < read_op->len)
        {
@@ -169,18 +194,7 @@ int blockstore_impl_t::dequeue_read(blockstore_op_t *read_op)
            }
            else
            {
-                uint64_t meta_loc = clean_it->second.location >> block_order;
-                uint8_t *clean_entry_bitmap;
-                if (inmemory_meta)
-                {
-                    uint64_t sector = (meta_loc / (meta_block_size / clean_entry_size)) * meta_block_size;
-                    uint64_t pos = (meta_loc % (meta_block_size / clean_entry_size));
-                    clean_entry_bitmap = (uint8_t*)(metadata_buffer + sector + pos*clean_entry_size + sizeof(clean_disk_entry));
-                }
-                else
-                {
-                    clean_entry_bitmap = (uint8_t*)(clean_bitmap + meta_loc*clean_entry_bitmap_size);
-                }
+                uint8_t *clean_entry_bitmap = get_clean_entry_bitmap(clean_it->second.location, 0);
                uint64_t bmp_start = 0, bmp_end = 0, bmp_size = block_size/bitmap_granularity;
                while (bmp_start < bmp_size)
                {
@@ -191,8 +205,8 @@ int blockstore_impl_t::dequeue_read(blockstore_op_t *read_op)
                    if (bmp_end > bmp_start)
                    {
                        // fill with zeroes
-                        fulfill_read(read_op, fulfilled, bmp_start * bitmap_granularity,
-                            bmp_end * bitmap_granularity, (BS_ST_DELETE | BS_ST_STABLE), 0, 0);
+                        assert(fulfill_read(read_op, fulfilled, bmp_start * bitmap_granularity,
+                            bmp_end * bitmap_granularity, (BS_ST_DELETE | BS_ST_STABLE), 0, 0));
                    }
                    bmp_start = bmp_end;
                    while (clean_entry_bitmap[bmp_end >> 3] & (1 << (bmp_end & 0x7)) && bmp_end < bmp_size)
@@ -218,7 +232,7 @@ int blockstore_impl_t::dequeue_read(blockstore_op_t *read_op)
    else if (fulfilled < read_op->len)
    {
        // fill remaining parts with zeroes
-        fulfill_read(read_op, fulfilled, 0, block_size, (BS_ST_DELETE | BS_ST_STABLE), 0, 0);
+        assert(fulfill_read(read_op, fulfilled, 0, block_size, (BS_ST_DELETE | BS_ST_STABLE), 0, 0));
    }
    assert(fulfilled == read_op->len);
    read_op->version = result_version;
@@ -232,10 +246,10 @@ int blockstore_impl_t::dequeue_read(blockstore_op_t *read_op)
        }
        read_op->retval = read_op->len;
        FINISH_OP(read_op);
-        return 1;
+        return 2;
    }
    read_op->retval = 0;
-    return 1;
+    return 2;
 }

 void blockstore_impl_t::handle_read_event(ring_data_t *data, blockstore_op_t *op)
@@ -254,3 +268,50 @@ void blockstore_impl_t::handle_read_event(ring_data_t *data, blockstore_op_t *op
        FINISH_OP(op);
    }
 }
+
+int blockstore_impl_t::read_bitmap(object_id oid, uint64_t target_version, void *bitmap, uint64_t *result_version)
+{
+    auto dirty_it = dirty_db.upper_bound((obj_ver_id){
+        .oid = oid,
+        .version = UINT64_MAX,
+    });
+    if (dirty_it != dirty_db.begin())
+        dirty_it--;
+    if (dirty_it != dirty_db.end())
+    {
+        while (dirty_it->first.oid == oid)
+        {
+            if (target_version >= dirty_it->first.version)
+            {
+                if (result_version)
+                    *result_version = dirty_it->first.version;
+                if (bitmap)
+                {
+                    void *bmp_ptr = (clean_entry_bitmap_size > sizeof(void*) ? dirty_it->second.bitmap : &dirty_it->second.bitmap);
+                    memcpy(bitmap, bmp_ptr, clean_entry_bitmap_size);
+                }
+                return 0;
+            }
+            if (dirty_it == dirty_db.begin())
+                break;
+            dirty_it--;
+        }
+    }
+    auto clean_it = clean_db.find(oid);
+    if (clean_it != clean_db.end())
+    {
+        if (result_version)
+            *result_version = clean_it->second.version;
+        if (bitmap)
+        {
+            void *bmp_ptr = get_clean_entry_bitmap(clean_it->second.location, clean_entry_bitmap_size);
+            memcpy(bitmap, bmp_ptr, clean_entry_bitmap_size);
+        }
+        return 0;
+    }
+    if (result_version)
+        *result_version = 0;
+    if (bitmap)
+        memset(bitmap, 0, clean_entry_bitmap_size);
+    return -ENOENT;
+}
--- a/src/blockstore_rollback.cpp
+++ b/src/blockstore_rollback.cpp
@@ -50,7 +50,7 @@ skip_ov:
                {
                    op->retval = -EBUSY;
                    FINISH_OP(op);
-                    return 1;
+                    return 2;
                }
                if (dirty_it == dirty_db.begin())
                {
@@ -66,7 +66,7 @@ skip_ov:
        // Already rolled back
        op->retval = 0;
        FINISH_OP(op);
-        return 1;
+        return 2;
    }
    // Check journal space
    blockstore_journal_check_t space_check(this);
@@ -126,11 +126,8 @@ resume_2:
 resume_3:
    if (!disable_journal_fsync)
    {
-        io_uring_sqe *sqe = get_sqe();
-        if (!sqe)
-        {
-            return 0;
-        }
+        io_uring_sqe *sqe;
+        BS_SUBMIT_GET_SQE_DECL(sqe);
        ring_data_t *data = ((ring_data_t*)sqe->user_data);
        my_uring_prep_fsync(sqe, journal.fd, IORING_FSYNC_DATASYNC);
        data->iov = { 0 };
@@ -151,7 +148,7 @@ resume_5:
    // Acknowledge op
    op->retval = 0;
    FINISH_OP(op);
-    return 1;
+    return 2;
 }

 void blockstore_impl_t::mark_rolled_back(const obj_ver_id & ov)
@@ -166,10 +163,7 @@ void blockstore_impl_t::mark_rolled_back(const obj_ver_id & ov)
        auto rm_start = it;
        auto rm_end = it;
        it--;
-        while (it->first.oid == ov.oid &&
-            it->first.version > ov.version &&
-            !IS_IN_FLIGHT(it->second.state) &&
-            !IS_STABLE(it->second.state))
+        while (1)
        {
            if (it->first.oid != ov.oid)
                break;
@@ -179,7 +173,7 @@ void blockstore_impl_t::mark_rolled_back(const obj_ver_id & ov)
                    max_unstable = it->first.version;
                break;
            }
-            else if (IS_STABLE(it->second.state))
+            else if (IS_IN_FLIGHT(it->second.state) || IS_STABLE(it->second.state))
                break;
            // Remove entry
            rm_start = it;
@@ -190,14 +184,14 @@ void blockstore_impl_t::mark_rolled_back(const obj_ver_id & ov)
        if (rm_start != rm_end)
        {
            erase_dirty(rm_start, rm_end, UINT64_MAX);
-        }
-        auto unstab_it = unstable_writes.find(ov.oid);
-        if (unstab_it != unstable_writes.end())
-        {
-            if (max_unstable == 0)
-                unstable_writes.erase(unstab_it);
-            else
-                unstab_it->second = max_unstable;
+            auto unstab_it = unstable_writes.find(ov.oid);
+            if (unstab_it != unstable_writes.end())
+            {
+                if (max_unstable == 0)
+                    unstable_writes.erase(unstab_it);
+                else
+                    unstab_it->second = max_unstable;
+            }
        }
    }
 }
@@ -216,10 +210,7 @@ void blockstore_impl_t::handle_rollback_event(ring_data_t *data, blockstore_op_t
    if (PRIV(op)->pending_ops == 0)
    {
        PRIV(op)->op_state++;
-        if (!continue_rollback(op))
-        {
-            submit_queue.push_front(op);
-        }
+        ringloop->wakeup();
    }
 }

@@ -257,10 +248,12 @@ void blockstore_impl_t::erase_dirty(blockstore_dirty_db_t::iterator dirty_start,
    }
    while (1)
    {
-        if (IS_BIG_WRITE(dirty_it->second.state) && dirty_it->second.location != clean_loc)
+        if (IS_BIG_WRITE(dirty_it->second.state) && dirty_it->second.location != clean_loc &&
+            dirty_it->second.location != UINT64_MAX)
        {
 #ifdef BLOCKSTORE_DEBUG
-            printf("Free block %lu\n", dirty_it->second.location >> block_order);
+            printf("Free block %lu from %lx:%lx v%lu\n", dirty_it->second.location >> block_order,
+                dirty_it->first.oid.inode, dirty_it->first.oid.stripe, dirty_it->first.version);
 #endif
            data_alloc->set(dirty_it->second.location >> block_order, false);
        }
@@ -275,6 +268,11 @@ void blockstore_impl_t::erase_dirty(blockstore_dirty_db_t::iterator dirty_start,
        {
            journal.used_sectors.erase(dirty_it->second.journal_sector);
        }
+        if (clean_entry_bitmap_size > sizeof(void*))
+        {
+            free(dirty_it->second.bitmap);
+            dirty_it->second.bitmap = NULL;
+        }
        if (dirty_it == dirty_start)
        {
            break;
--- a/src/blockstore_stable.cpp
+++ b/src/blockstore_stable.cpp
@@ -60,7 +60,7 @@ int blockstore_impl_t::dequeue_stable(blockstore_op_t *op)
                // No such object version
                op->retval = -ENOENT;
                FINISH_OP(op);
-                return 1;
+                return 2;
            }
            else
            {
@@ -77,7 +77,7 @@ int blockstore_impl_t::dequeue_stable(blockstore_op_t *op)
            // Object not synced yet. Caller must sync it first
            op->retval = -EBUSY;
            FINISH_OP(op);
-            return 1;
+            return 2;
        }
        else if (!IS_STABLE(dirty_it->second.state))
        {
@@ -89,7 +89,7 @@ int blockstore_impl_t::dequeue_stable(blockstore_op_t *op)
        // Already stable
        op->retval = 0;
        FINISH_OP(op);
-        return 1;
+        return 2;
    }
    // Check journal space
    blockstore_journal_check_t space_check(this);
@@ -150,11 +150,8 @@ resume_2:
 resume_3:
    if (!disable_journal_fsync)
    {
-        io_uring_sqe *sqe = get_sqe();
-        if (!sqe)
-        {
-            return 0;
-        }
+        io_uring_sqe *sqe;
+        BS_SUBMIT_GET_SQE_DECL(sqe);
        ring_data_t *data = ((ring_data_t*)sqe->user_data);
        my_uring_prep_fsync(sqe, journal.fd, IORING_FSYNC_DATASYNC);
        data->iov = { 0 };
@@ -171,30 +168,77 @@ resume_5:
    for (i = 0, v = (obj_ver_id*)op->buf; i < op->len; i++, v++)
    {
        // Mark all dirty_db entries up to op->version as stable
+#ifdef BLOCKSTORE_DEBUG
+        printf("Stabilize %lx:%lx v%lu\n", v->oid.inode, v->oid.stripe, v->version);
+#endif
        mark_stable(*v);
    }
    // Acknowledge op
    op->retval = 0;
    FINISH_OP(op);
-    return 1;
+    return 2;
 }

-void blockstore_impl_t::mark_stable(const obj_ver_id & v)
+void blockstore_impl_t::mark_stable(const obj_ver_id & v, bool forget_dirty)
 {
    auto dirty_it = dirty_db.find(v);
    if (dirty_it != dirty_db.end())
    {
        while (1)
        {
+            bool was_stable = IS_STABLE(dirty_it->second.state);
            if ((dirty_it->second.state & BS_ST_WORKFLOW_MASK) == BS_ST_SYNCED)
            {
                dirty_it->second.state = (dirty_it->second.state & ~BS_ST_WORKFLOW_MASK) | BS_ST_STABLE;
+                // Allocations and deletions are counted when they're stabilized
+                if (IS_BIG_WRITE(dirty_it->second.state))
+                {
+                    int exists = -1;
+                    if (dirty_it != dirty_db.begin())
+                    {
+                        auto prev_it = dirty_it;
+                        prev_it--;
+                        if (prev_it->first.oid == v.oid)
+                        {
+                            exists = IS_DELETE(prev_it->second.state) ? 0 : 1;
+                        }
+                    }
+                    if (exists == -1)
+                    {
+                        auto clean_it = clean_db.find(v.oid);
+                        exists = clean_it != clean_db.end() ? 1 : 0;
+                    }
+                    if (!exists)
+                    {
+                        inode_space_stats[dirty_it->first.oid.inode] += block_size;
+                    }
+                }
+                else if (IS_DELETE(dirty_it->second.state))
+                {
+                    inode_space_stats[dirty_it->first.oid.inode] -= block_size;
+                }
            }
-            else if (IS_STABLE(dirty_it->second.state))
+            if (forget_dirty && (IS_BIG_WRITE(dirty_it->second.state) ||
+                IS_DELETE(dirty_it->second.state)))
            {
+                // Big write overrides all previous dirty entries
+                auto erase_end = dirty_it;
+                while (dirty_it != dirty_db.begin())
+                {
+                    dirty_it--;
+                    if (dirty_it->first.oid != v.oid)
+                    {
+                        dirty_it++;
+                        break;
+                    }
+                }
+                auto clean_it = clean_db.find(v.oid);
+                uint64_t clean_loc = clean_it != clean_db.end()
+                    ? clean_it->second.location : UINT64_MAX;
+                erase_dirty(dirty_it, erase_end, clean_loc);
                break;
            }
-            if (dirty_it == dirty_db.begin())
+            if (was_stable || dirty_it == dirty_db.begin())
            {
                break;
            }
@@ -228,9 +272,6 @@ void blockstore_impl_t::handle_stable_event(ring_data_t *data, blockstore_op_t *
    if (PRIV(op)->pending_ops == 0)
    {
        PRIV(op)->op_state++;
-        if (!continue_stable(op))
-        {
-            submit_queue.push_front(op);
-        }
+        ringloop->wakeup();
    }
 }
--- a/src/blockstore_sync.cpp
+++ b/src/blockstore_sync.cpp
@@ -12,11 +12,19 @@
 #define SYNC_JOURNAL_SYNC_SENT 7
 #define SYNC_DONE 8

-int blockstore_impl_t::dequeue_sync(blockstore_op_t *op)
+int blockstore_impl_t::continue_sync(blockstore_op_t *op, bool queue_has_in_progress_sync)
 {
+    if (immediate_commit == IMMEDIATE_ALL)
+    {
+        // We can return immediately because sync is only dequeued after all previous writes
+        op->retval = 0;
+        FINISH_OP(op);
+        return 2;
+    }
    if (PRIV(op)->op_state == 0)
    {
        stop_sync_submitted = false;
+        unsynced_big_write_count -= unsynced_big_writes.size();
        PRIV(op)->sync_big_writes.swap(unsynced_big_writes);
        PRIV(op)->sync_small_writes.swap(unsynced_small_writes);
        PRIV(op)->sync_small_checked = 0;
@@ -29,34 +37,15 @@ int blockstore_impl_t::dequeue_sync(blockstore_op_t *op)
            PRIV(op)->op_state = SYNC_HAS_SMALL;
        else
            PRIV(op)->op_state = SYNC_DONE;
-        // Always add sync to in_progress_syncs because we clear unsynced_big_writes and unsynced_small_writes
-        PRIV(op)->prev_sync_count = in_progress_syncs.size();
-        PRIV(op)->in_progress_ptr = in_progress_syncs.insert(in_progress_syncs.end(), op);
    }
-    continue_sync(op);
-    // Always dequeue because we always add syncs to in_progress_syncs
-    return 1;
-}
-
-int blockstore_impl_t::continue_sync(blockstore_op_t *op)
-{
-    auto cb = [this, op](ring_data_t *data) { handle_sync_event(data, op); };
    if (PRIV(op)->op_state == SYNC_HAS_SMALL)
    {
        // No big writes, just fsync the journal
-        for (; PRIV(op)->sync_small_checked < PRIV(op)->sync_small_writes.size(); PRIV(op)->sync_small_checked++)
-        {
-            if (IS_IN_FLIGHT(dirty_db[PRIV(op)->sync_small_writes[PRIV(op)->sync_small_checked]].state))
-            {
-                // Wait for small inflight writes to complete
-                return 0;
-            }
-        }
        if (journal.sector_info[journal.cur_sector].dirty)
        {
            // Write out the last journal sector if it happens to be dirty
            BS_SUBMIT_GET_ONLY_SQE(sqe);
-            prepare_journal_sector_write(journal, journal.cur_sector, sqe, cb);
+            prepare_journal_sector_write(journal, journal.cur_sector, sqe, [this, op](ring_data_t *data) { handle_sync_event(data, op); });
            PRIV(op)->min_flushed_journal_sector = PRIV(op)->max_flushed_journal_sector = 1 + journal.cur_sector;
            PRIV(op)->pending_ops = 1;
            PRIV(op)->op_state = SYNC_JOURNAL_WRITE_SENT;
@@ -69,21 +58,13 @@ int blockstore_impl_t::continue_sync(blockstore_op_t *op)
    }
    if (PRIV(op)->op_state == SYNC_HAS_BIG)
    {
-        for (; PRIV(op)->sync_big_checked < PRIV(op)->sync_big_writes.size(); PRIV(op)->sync_big_checked++)
-        {
-            if (IS_IN_FLIGHT(dirty_db[PRIV(op)->sync_big_writes[PRIV(op)->sync_big_checked]].state))
-            {
-                // Wait for big inflight writes to complete
-                return 0;
-            }
-        }
        // 1st step: fsync data
        if (!disable_data_fsync)
        {
            BS_SUBMIT_GET_SQE(sqe, data);
            my_uring_prep_fsync(sqe, data_fd, IORING_FSYNC_DATASYNC);
            data->iov = { 0 };
-            data->callback = cb;
+            data->callback = [this, op](ring_data_t *data) { handle_sync_event(data, op); };
            PRIV(op)->min_flushed_journal_sector = PRIV(op)->max_flushed_journal_sector = 0;
            PRIV(op)->pending_ops = 1;
            PRIV(op)->op_state = SYNC_DATA_SYNC_SENT;
@@ -96,18 +77,11 @@ int blockstore_impl_t::continue_sync(blockstore_op_t *op)
    }
    if (PRIV(op)->op_state == SYNC_DATA_SYNC_DONE)
    {
-        for (; PRIV(op)->sync_small_checked < PRIV(op)->sync_small_writes.size(); PRIV(op)->sync_small_checked++)
-        {
-            if (IS_IN_FLIGHT(dirty_db[PRIV(op)->sync_small_writes[PRIV(op)->sync_small_checked]].state))
-            {
-                // Wait for small inflight writes to complete
-                return 0;
-            }
-        }
        // 2nd step: Data device is synced, prepare & write journal entries
        // Check space in the journal and journal memory buffers
        blockstore_journal_check_t space_check(this);
-        if (!space_check.check_available(op, PRIV(op)->sync_big_writes.size(), sizeof(journal_entry_big_write), JOURNAL_STABILIZE_RESERVATION))
+        if (!space_check.check_available(op, PRIV(op)->sync_big_writes.size(),
+            sizeof(journal_entry_big_write) + clean_entry_bitmap_size, JOURNAL_STABILIZE_RESERVATION))
        {
            return 0;
        }
@@ -122,37 +96,40 @@ int blockstore_impl_t::continue_sync(blockstore_op_t *op)
        int s = 0, cur_sector = -1;
        while (it != PRIV(op)->sync_big_writes.end())
        {
-            if (!journal.entry_fits(sizeof(journal_entry_big_write)) &&
+            if (!journal.entry_fits(sizeof(journal_entry_big_write) + clean_entry_bitmap_size) &&
                journal.sector_info[journal.cur_sector].dirty)
            {
                if (cur_sector == -1)
                    PRIV(op)->min_flushed_journal_sector = 1 + journal.cur_sector;
-                prepare_journal_sector_write(journal, journal.cur_sector, sqe[s++], cb);
+                prepare_journal_sector_write(journal, journal.cur_sector, sqe[s++], [this, op](ring_data_t *data) { handle_sync_event(data, op); });
                cur_sector = journal.cur_sector;
            }
+            auto & dirty_entry = dirty_db.at(*it);
            journal_entry_big_write *je = (journal_entry_big_write*)prefill_single_journal_entry(
-                journal, (dirty_db[*it].state & BS_ST_INSTANT) ? JE_BIG_WRITE_INSTANT : JE_BIG_WRITE,
-                sizeof(journal_entry_big_write)
+                journal, (dirty_entry.state & BS_ST_INSTANT) ? JE_BIG_WRITE_INSTANT : JE_BIG_WRITE,
+                sizeof(journal_entry_big_write) + clean_entry_bitmap_size
            );
-            dirty_db[*it].journal_sector = journal.sector_info[journal.cur_sector].offset;
+            dirty_entry.journal_sector = journal.sector_info[journal.cur_sector].offset;
            journal.used_sectors[journal.sector_info[journal.cur_sector].offset]++;
 #ifdef BLOCKSTORE_DEBUG
            printf(
                "journal offset %08lx is used by %lx:%lx v%lu (%lu refs)\n",
-                dirty_db[*it].journal_sector, it->oid.inode, it->oid.stripe, it->version,
+                dirty_entry.journal_sector, it->oid.inode, it->oid.stripe, it->version,
                journal.used_sectors[journal.sector_info[journal.cur_sector].offset]
            );
 #endif
            je->oid = it->oid;
            je->version = it->version;
-            je->offset = dirty_db[*it].offset;
-            je->len = dirty_db[*it].len;
-            je->location = dirty_db[*it].location;
+            je->offset = dirty_entry.offset;
+            je->len = dirty_entry.len;
+            je->location = dirty_entry.location;
+            memcpy((void*)(je+1), (clean_entry_bitmap_size > sizeof(void*)
+                ? dirty_entry.bitmap : &dirty_entry.bitmap), clean_entry_bitmap_size);
            je->crc32 = je_crc32((journal_entry*)je);
            journal.crc32_last = je->crc32;
            it++;
        }
-        prepare_journal_sector_write(journal, journal.cur_sector, sqe[s++], cb);
+        prepare_journal_sector_write(journal, journal.cur_sector, sqe[s++], [this, op](ring_data_t *data) { handle_sync_event(data, op); });
        assert(s == space_check.sectors_to_write);
        if (cur_sector == -1)
            PRIV(op)->min_flushed_journal_sector = 1 + journal.cur_sector;
@@ -168,7 +145,8 @@ int blockstore_impl_t::continue_sync(blockstore_op_t *op)
            BS_SUBMIT_GET_SQE(sqe, data);
            my_uring_prep_fsync(sqe, journal.fd, IORING_FSYNC_DATASYNC);
            data->iov = { 0 };
-            data->callback = cb;
+            data->callback = [this, op](ring_data_t *data) { handle_sync_event(data, op); };
+            PRIV(op)->min_flushed_journal_sector = PRIV(op)->max_flushed_journal_sector = 0;
            PRIV(op)->pending_ops = 1;
            PRIV(op)->op_state = SYNC_JOURNAL_SYNC_SENT;
            return 1;
@@ -178,9 +156,10 @@ int blockstore_impl_t::continue_sync(blockstore_op_t *op)
            PRIV(op)->op_state = SYNC_DONE;
        }
    }
-    if (PRIV(op)->op_state == SYNC_DONE)
+    if (PRIV(op)->op_state == SYNC_DONE && !queue_has_in_progress_sync)
    {
-        return ack_sync(op);
+        ack_sync(op);
+        return 2;
    }
    return 1;
 }
@@ -212,42 +191,16 @@ void blockstore_impl_t::handle_sync_event(ring_data_t *data, blockstore_op_t *op
        else if (PRIV(op)->op_state == SYNC_JOURNAL_SYNC_SENT)
        {
            PRIV(op)->op_state = SYNC_DONE;
-            ack_sync(op);
        }
        else
        {
            throw std::runtime_error("BUG: unexpected sync op state");
        }
+        ringloop->wakeup();
    }
 }

-int blockstore_impl_t::ack_sync(blockstore_op_t *op)
-{
-    if (PRIV(op)->op_state == SYNC_DONE && PRIV(op)->prev_sync_count == 0)
-    {
-        // Remove dependency of subsequent syncs
-        auto it = PRIV(op)->in_progress_ptr;
-        int done_syncs = 1;
-        ++it;
-        // Acknowledge sync
-        ack_one_sync(op);
-        while (it != in_progress_syncs.end())
-        {
-            auto & next_sync = *it++;
-            PRIV(next_sync)->prev_sync_count -= done_syncs;
-            if (PRIV(next_sync)->prev_sync_count == 0 && PRIV(next_sync)->op_state == SYNC_DONE)
-            {
-                done_syncs++;
-                // Acknowledge next_sync
-                ack_one_sync(next_sync);
-            }
-        }
-        return 2;
-    }
-    return 0;
-}
-
-void blockstore_impl_t::ack_one_sync(blockstore_op_t *op)
+void blockstore_impl_t::ack_sync(blockstore_op_t *op)
 {
    // Handle states
    for (auto it = PRIV(op)->sync_big_writes.begin(); it != PRIV(op)->sync_big_writes.end(); it++)
@@ -295,7 +248,6 @@ void blockstore_impl_t::ack_one_sync(blockstore_op_t *op)
            }
        }
    }
-    in_progress_syncs.erase(PRIV(op)->in_progress_ptr);
    op->retval = 0;
    FINISH_OP(op);
 }
--- a/src/blockstore_write.cpp
+++ b/src/blockstore_write.cpp
@@ -8,7 +8,12 @@ bool blockstore_impl_t::enqueue_write(blockstore_op_t *op)
    // Check or assign version number
    bool found = false, deleted = false, is_del = (op->opcode == BS_OP_DELETE);
    bool wait_big = false, wait_del = false;
+    void *bmp = NULL;
    uint64_t version = 1;
+    if (!is_del && clean_entry_bitmap_size > sizeof(void*))
+    {
+        bmp = calloc_or_die(1, clean_entry_bitmap_size);
+    }
    if (dirty_db.size() > 0)
    {
        auto dirty_it = dirty_db.upper_bound((obj_ver_id){
@@ -25,6 +30,13 @@ bool blockstore_impl_t::enqueue_write(blockstore_op_t *op)
            wait_big = (dirty_it->second.state & BS_ST_TYPE_MASK) == BS_ST_BIG_WRITE
                ? !IS_SYNCED(dirty_it->second.state)
                : ((dirty_it->second.state & BS_ST_WORKFLOW_MASK) == BS_ST_WAIT_BIG);
+            if (!is_del && !deleted)
+            {
+                if (clean_entry_bitmap_size > sizeof(void*))
+                    memcpy(bmp, dirty_it->second.bitmap, clean_entry_bitmap_size);
+                else
+                    bmp = dirty_it->second.bitmap;
+            }
        }
    }
    if (!found)
@@ -33,6 +45,11 @@ bool blockstore_impl_t::enqueue_write(blockstore_op_t *op)
        if (clean_it != clean_db.end())
        {
            version = clean_it->second.version + 1;
+            if (!is_del)
+            {
+                void *bmp_ptr = get_clean_entry_bitmap(clean_it->second.location, clean_entry_bitmap_size);
+                memcpy((clean_entry_bitmap_size > sizeof(void*) ? bmp : &bmp), bmp_ptr, clean_entry_bitmap_size);
+            }
        }
        else
        {
@@ -72,6 +89,10 @@ bool blockstore_impl_t::enqueue_write(blockstore_op_t *op)
        {
            // Invalid version requested
            op->retval = -EEXIST;
+            if (!is_del && clean_entry_bitmap_size > sizeof(void*))
+            {
+                free(bmp);
+            }
            return false;
        }
    }
@@ -101,6 +122,8 @@ bool blockstore_impl_t::enqueue_write(blockstore_op_t *op)
    else
    {
        state = (op->len == block_size || deleted ? BS_ST_BIG_WRITE : BS_ST_SMALL_WRITE);
+        if (state == BS_ST_SMALL_WRITE && throttle_small_writes)
+            clock_gettime(CLOCK_REALTIME, &PRIV(op)->tv_begin);
        if (wait_del)
            state |= BS_ST_WAIT_DEL;
        else if (state == BS_ST_SMALL_WRITE && wait_big)
@@ -109,6 +132,28 @@ bool blockstore_impl_t::enqueue_write(blockstore_op_t *op)
            state |= BS_ST_IN_FLIGHT;
        if (op->opcode == BS_OP_WRITE_STABLE)
            state |= BS_ST_INSTANT;
+        if (op->bitmap)
+        {
+            // Only allow to overwrite part of the object bitmap respective to the write's offset/len
+            uint8_t *bmp_ptr = (uint8_t*)(clean_entry_bitmap_size > sizeof(void*) ? bmp : &bmp);
+            uint32_t bit = op->offset/bitmap_granularity;
+            uint32_t bits_left = op->len/bitmap_granularity;
+            while (!(bit % 8) && bits_left > 8)
+            {
+                // Copy bytes
+                bmp_ptr[bit/8] = ((uint8_t*)op->bitmap)[bit/8];
+                bit += 8;
+                bits_left -= 8;
+            }
+            while (bits_left > 0)
+            {
+                // Copy bits
+                bmp_ptr[bit/8] = (bmp_ptr[bit/8] & ~(1 << (bit%8)))
+                    | (((uint8_t*)op->bitmap)[bit/8] & (1 << bit%8));
+                bit++;
+                bits_left--;
+            }
+        }
    }
    dirty_db.emplace((obj_ver_id){
        .oid = op->oid,
@@ -120,10 +165,36 @@ bool blockstore_impl_t::enqueue_write(blockstore_op_t *op)
        .offset = is_del ? 0 : op->offset,
        .len = is_del ? 0 : op->len,
        .journal_sector = 0,
+        .bitmap = bmp,
    });
    return true;
 }

+void blockstore_impl_t::cancel_all_writes(blockstore_op_t *op, blockstore_dirty_db_t::iterator dirty_it, int retval)
+{
+    while (dirty_it != dirty_db.end() && dirty_it->first.oid == op->oid)
+    {
+        if (clean_entry_bitmap_size > sizeof(void*))
+            free(dirty_it->second.bitmap);
+        dirty_db.erase(dirty_it++);
+    }
+    bool found = false;
+    for (auto other_op: submit_queue)
+    {
+        if (!found && other_op == op)
+            found = true;
+        else if (found && other_op->oid == op->oid &&
+            (other_op->opcode == BS_OP_WRITE || other_op->opcode == BS_OP_WRITE_STABLE))
+        {
+            // Mark operations to cancel them
+            PRIV(other_op)->real_version = UINT64_MAX;
+            other_op->retval = retval;
+        }
+    }
+    op->retval = retval;
+    FINISH_OP(op);
+}
+
 // First step of the write algorithm: dequeue operation and submit initial write(s)
 int blockstore_impl_t::dequeue_write(blockstore_op_t *op)
 {
@@ -143,6 +214,12 @@ int blockstore_impl_t::dequeue_write(blockstore_op_t *op)
    }
    if (PRIV(op)->real_version != 0)
    {
+        if (PRIV(op)->real_version == UINT64_MAX)
+        {
+            // This is the flag value used to cancel operations
+            FINISH_OP(op);
+            return 2;
+        }
        // Restore original low version number for unblocked operations
 #ifdef BLOCKSTORE_DEBUG
        printf("Restoring %lx:%lx version: v%lu -> v%lu\n", op->oid.inode, op->oid.stripe, op->version, PRIV(op)->real_version);
@@ -152,11 +229,9 @@ int blockstore_impl_t::dequeue_write(blockstore_op_t *op)
        if (prev_it->first.oid == op->oid && prev_it->first.version >= PRIV(op)->real_version)
        {
            // Original version is still invalid
-            // FIXME Oops. Successive small writes will currently break in an unexpected way. Fix it
-            dirty_db.erase(dirty_it);
-            op->retval = -EEXIST;
-            FINISH_OP(op);
-            return 1;
+            // All subsequent writes to the same object must be canceled too
+            cancel_all_writes(op, dirty_it, -EEXIST);
+            return 2;
        }
        op->version = PRIV(op)->real_version;
        PRIV(op)->real_version = 0;
@@ -174,7 +249,8 @@ int blockstore_impl_t::dequeue_write(blockstore_op_t *op)
    if ((dirty_it->second.state & BS_ST_TYPE_MASK) == BS_ST_BIG_WRITE)
    {
        blockstore_journal_check_t space_check(this);
-        if (!space_check.check_available(op, unsynced_big_writes.size() + 1, sizeof(journal_entry_big_write), JOURNAL_STABILIZE_RESERVATION))
+        if (!space_check.check_available(op, unsynced_big_write_count + 1,
+            sizeof(journal_entry_big_write) + clean_entry_bitmap_size, JOURNAL_STABILIZE_RESERVATION))
        {
            return 0;
        }
@@ -189,18 +265,18 @@ int blockstore_impl_t::dequeue_write(blockstore_op_t *op)
                PRIV(op)->wait_for = WAIT_FREE;
                return 0;
            }
-            // FIXME Oops. Successive small writes will currently break in an unexpected way. Fix it
-            dirty_db.erase(dirty_it);
-            op->retval = -ENOSPC;
-            FINISH_OP(op);
-            return 1;
+            cancel_all_writes(op, dirty_it, -ENOSPC);
+            return 2;
        }
        write_iodepth++;
        BS_SUBMIT_GET_SQE(sqe, data);
        dirty_it->second.location = loc << block_order;
        dirty_it->second.state = (dirty_it->second.state & ~BS_ST_WORKFLOW_MASK) | BS_ST_SUBMITTED;
 #ifdef BLOCKSTORE_DEBUG
-        printf("Allocate block %lu\n", loc);
+        printf(
+            "Allocate block %lu for %lx:%lx v%lu\n",
+            loc, op->oid.inode, op->oid.stripe, op->version
+        );
 #endif
        data_alloc->set(loc, true);
        uint64_t stripe_offset = (op->offset % bitmap_granularity);
@@ -226,11 +302,8 @@ int blockstore_impl_t::dequeue_write(blockstore_op_t *op)
        PRIV(op)->min_flushed_journal_sector = PRIV(op)->max_flushed_journal_sector = 0;
        if (immediate_commit != IMMEDIATE_ALL)
        {
-            // Remember big write as unsynced
-            unsynced_big_writes.push_back((obj_ver_id){
-                .oid = op->oid,
-                .version = op->version,
-            });
+            // Increase the counter, but don't save into unsynced_writes yet (can't sync until the write is finished)
+            unsynced_big_write_count++;
            PRIV(op)->op_state = 3;
        }
        else
@@ -243,8 +316,11 @@ int blockstore_impl_t::dequeue_write(blockstore_op_t *op)
        // Small (journaled) write
        // First check if the journal has sufficient space
        blockstore_journal_check_t space_check(this);
-        if (unsynced_big_writes.size() && !space_check.check_available(op, unsynced_big_writes.size(), sizeof(journal_entry_big_write), 0)
-            || !space_check.check_available(op, 1, sizeof(journal_entry_small_write), op->len + JOURNAL_STABILIZE_RESERVATION))
+        if (unsynced_big_write_count &&
+            !space_check.check_available(op, unsynced_big_write_count,
+                sizeof(journal_entry_big_write) + clean_entry_bitmap_size, 0)
+            || !space_check.check_available(op, 1,
+                sizeof(journal_entry_small_write) + clean_entry_bitmap_size, op->len + JOURNAL_STABILIZE_RESERVATION))
        {
            return 0;
        }
@@ -252,8 +328,7 @@ int blockstore_impl_t::dequeue_write(blockstore_op_t *op)
        // There is sufficient space. Get SQE(s)
        struct io_uring_sqe *sqe1 = NULL;
        if (immediate_commit != IMMEDIATE_NONE ||
-            (journal_block_size - journal.in_sector_pos) < sizeof(journal_entry_small_write) &&
-            journal.sector_info[journal.cur_sector].dirty)
+            !journal.entry_fits(sizeof(journal_entry_small_write) + clean_entry_bitmap_size))
        {
            // Write current journal sector only if it's dirty and full, or in the immediate_commit mode
            BS_SUBMIT_GET_SQE_DECL(sqe1);
@@ -281,7 +356,7 @@ int blockstore_impl_t::dequeue_write(blockstore_op_t *op)
        // Then pre-fill journal entry
        journal_entry_small_write *je = (journal_entry_small_write*)prefill_single_journal_entry(
            journal, op->opcode == BS_OP_WRITE_STABLE ? JE_SMALL_WRITE_INSTANT : JE_SMALL_WRITE,
-            sizeof(journal_entry_small_write)
+            sizeof(journal_entry_small_write) + clean_entry_bitmap_size
        );
        dirty_it->second.journal_sector = journal.sector_info[journal.cur_sector].offset;
        journal.used_sectors[journal.sector_info[journal.cur_sector].offset]++;
@@ -300,6 +375,7 @@ int blockstore_impl_t::dequeue_write(blockstore_op_t *op)
        je->len = op->len;
        je->data_offset = journal.next_free;
        je->crc32_data = crc32c(0, op->buf, op->len);
+        memcpy((void*)(je+1), (clean_entry_bitmap_size > sizeof(void*) ? dirty_it->second.bitmap : &dirty_it->second.bitmap), clean_entry_bitmap_size);
        je->crc32 = je_crc32((journal_entry*)je);
        journal.crc32_last = je->crc32;
        if (immediate_commit != IMMEDIATE_NONE)
@@ -335,18 +411,10 @@ int blockstore_impl_t::dequeue_write(blockstore_op_t *op)
        {
            journal.next_free = journal_block_size;
        }
-        if (immediate_commit == IMMEDIATE_NONE)
-        {
-            // Remember small write as unsynced
-            unsynced_small_writes.push_back((obj_ver_id){
-                .oid = op->oid,
-                .version = op->version,
-            });
-        }
        if (!PRIV(op)->pending_ops)
        {
            PRIV(op)->op_state = 4;
-            continue_write(op);
+            return continue_write(op);
        }
        else
        {
@@ -358,89 +426,153 @@ int blockstore_impl_t::dequeue_write(blockstore_op_t *op)

 int blockstore_impl_t::continue_write(blockstore_op_t *op)
 {
-    io_uring_sqe *sqe = NULL;
-    journal_entry_big_write *je;
-    auto dirty_it = dirty_db.find((obj_ver_id){
-        .oid = op->oid,
-        .version = op->version,
-    });
-    assert(dirty_it != dirty_db.end());
-    if (PRIV(op)->op_state == 2)
+    int op_state = PRIV(op)->op_state;
+    if (op_state == 2)
        goto resume_2;
-    else if (PRIV(op)->op_state == 4)
+    else if (op_state == 4)
        goto resume_4;
+    else if (op_state == 6)
+        goto resume_6;
    else
+    {
+        // In progress
        return 1;
+    }
 resume_2:
    // Only for the immediate_commit mode: prepare and submit big_write journal entry
-    sqe = get_sqe();
-    if (!sqe)
    {
-        return 0;
-    }
-    je = (journal_entry_big_write*)prefill_single_journal_entry(
-        journal, op->opcode == BS_OP_WRITE_STABLE ? JE_BIG_WRITE_INSTANT : JE_BIG_WRITE,
-        sizeof(journal_entry_big_write)
-    );
-    dirty_it->second.journal_sector = journal.sector_info[journal.cur_sector].offset;
-    journal.used_sectors[journal.sector_info[journal.cur_sector].offset]++;
+        auto dirty_it = dirty_db.find((obj_ver_id){
+            .oid = op->oid,
+            .version = op->version,
+        });
+        assert(dirty_it != dirty_db.end());
+        io_uring_sqe *sqe = NULL;
+        BS_SUBMIT_GET_SQE_DECL(sqe);
+        journal_entry_big_write *je = (journal_entry_big_write*)prefill_single_journal_entry(
+            journal, op->opcode == BS_OP_WRITE_STABLE ? JE_BIG_WRITE_INSTANT : JE_BIG_WRITE,
+            sizeof(journal_entry_big_write) + clean_entry_bitmap_size
+        );
+        dirty_it->second.journal_sector = journal.sector_info[journal.cur_sector].offset;
+        journal.used_sectors[journal.sector_info[journal.cur_sector].offset]++;
 #ifdef BLOCKSTORE_DEBUG
-    printf(
-        "journal offset %08lx is used by %lx:%lx v%lu (%lu refs)\n",
-        journal.sector_info[journal.cur_sector].offset, op->oid.inode, op->oid.stripe, op->version,
-        journal.used_sectors[journal.sector_info[journal.cur_sector].offset]
-    );
+        printf(
+            "journal offset %08lx is used by %lx:%lx v%lu (%lu refs)\n",
+            journal.sector_info[journal.cur_sector].offset, op->oid.inode, op->oid.stripe, op->version,
+            journal.used_sectors[journal.sector_info[journal.cur_sector].offset]
+        );
 #endif
-    je->oid = op->oid;
-    je->version = op->version;
-    je->offset = op->offset;
-    je->len = op->len;
-    je->location = dirty_it->second.location;
-    je->crc32 = je_crc32((journal_entry*)je);
-    journal.crc32_last = je->crc32;
-    prepare_journal_sector_write(journal, journal.cur_sector, sqe,
-        [this, op](ring_data_t *data) { handle_write_event(data, op); });
-    PRIV(op)->min_flushed_journal_sector = PRIV(op)->max_flushed_journal_sector = 1 + journal.cur_sector;
-    PRIV(op)->pending_ops = 1;
-    PRIV(op)->op_state = 3;
-    return 1;
+        je->oid = op->oid;
+        je->version = op->version;
+        je->offset = op->offset;
+        je->len = op->len;
+        je->location = dirty_it->second.location;
+        memcpy((void*)(je+1), (clean_entry_bitmap_size > sizeof(void*) ? dirty_it->second.bitmap : &dirty_it->second.bitmap), clean_entry_bitmap_size);
+        je->crc32 = je_crc32((journal_entry*)je);
+        journal.crc32_last = je->crc32;
+        prepare_journal_sector_write(journal, journal.cur_sector, sqe,
+            [this, op](ring_data_t *data) { handle_write_event(data, op); });
+        PRIV(op)->min_flushed_journal_sector = PRIV(op)->max_flushed_journal_sector = 1 + journal.cur_sector;
+        PRIV(op)->pending_ops = 1;
+        PRIV(op)->op_state = 3;
+        return 1;
+    }
 resume_4:
    // Switch object state
+    {
+        auto dirty_it = dirty_db.find((obj_ver_id){
+            .oid = op->oid,
+            .version = op->version,
+        });
+        assert(dirty_it != dirty_db.end());
 #ifdef BLOCKSTORE_DEBUG
-    printf("Ack write %lx:%lx v%lu = state %x\n", op->oid.inode, op->oid.stripe, op->version, dirty_it->second.state);
+        printf("Ack write %lx:%lx v%lu = state 0x%x\n", op->oid.inode, op->oid.stripe, op->version, dirty_it->second.state);
 #endif
-    bool imm = (dirty_it->second.state & BS_ST_TYPE_MASK) == BS_ST_BIG_WRITE
-        ? (immediate_commit == IMMEDIATE_ALL)
-        : (immediate_commit != IMMEDIATE_NONE);
-    if (imm)
-    {
-        auto & unstab = unstable_writes[op->oid];
-        unstab = unstab < op->version ? op->version : unstab;
-    }
-    dirty_it->second.state = (dirty_it->second.state & ~BS_ST_WORKFLOW_MASK)
-        | (imm ? BS_ST_SYNCED : BS_ST_WRITTEN);
-    if (imm && ((dirty_it->second.state & BS_ST_TYPE_MASK) == BS_ST_DELETE || (dirty_it->second.state & BS_ST_INSTANT)))
-    {
-        // Deletions are treated as immediately stable
-        mark_stable(dirty_it->first);
-    }
-    if (immediate_commit == IMMEDIATE_ALL)
-    {
-        dirty_it++;
-        while (dirty_it != dirty_db.end() && dirty_it->first.oid == op->oid)
+        bool is_big = (dirty_it->second.state & BS_ST_TYPE_MASK) == BS_ST_BIG_WRITE;
+        bool imm = is_big ? (immediate_commit == IMMEDIATE_ALL) : (immediate_commit != IMMEDIATE_NONE);
+        if (imm)
        {
-            if ((dirty_it->second.state & BS_ST_WORKFLOW_MASK) == BS_ST_WAIT_BIG)
+            auto & unstab = unstable_writes[op->oid];
+            unstab = unstab < op->version ? op->version : unstab;
+        }
+        dirty_it->second.state = (dirty_it->second.state & ~BS_ST_WORKFLOW_MASK)
+            | (imm ? BS_ST_SYNCED : BS_ST_WRITTEN);
+        if (imm && ((dirty_it->second.state & BS_ST_TYPE_MASK) == BS_ST_DELETE || (dirty_it->second.state & BS_ST_INSTANT)))
+        {
+            // Deletions and 'instant' operations are treated as immediately stable
+            mark_stable(dirty_it->first);
+        }
+        if (!imm)
+        {
+            if (is_big)
            {
-                dirty_it->second.state = (dirty_it->second.state & ~BS_ST_WORKFLOW_MASK) | BS_ST_IN_FLIGHT;
+                // Remember big write as unsynced
+                unsynced_big_writes.push_back((obj_ver_id){
+                    .oid = op->oid,
+                    .version = op->version,
+                });
            }
+            else
+            {
+                // Remember small write as unsynced
+                unsynced_small_writes.push_back((obj_ver_id){
+                    .oid = op->oid,
+                    .version = op->version,
+                });
+            }
+        }
+        if (imm && (dirty_it->second.state & BS_ST_TYPE_MASK) == BS_ST_BIG_WRITE)
+        {
+            // Unblock small writes
            dirty_it++;
+            while (dirty_it != dirty_db.end() && dirty_it->first.oid == op->oid)
+            {
+                if ((dirty_it->second.state & BS_ST_WORKFLOW_MASK) == BS_ST_WAIT_BIG)
+                {
+                    dirty_it->second.state = (dirty_it->second.state & ~BS_ST_WORKFLOW_MASK) | BS_ST_IN_FLIGHT;
+                }
+                dirty_it++;
+            }
+        }
+        // Apply throttling to not fill the journal too fast for the SSD+HDD case
+        if (!is_big && throttle_small_writes)
+        {
+            // Apply throttling
+            timespec tv_end;
+            clock_gettime(CLOCK_REALTIME, &tv_end);
+            uint64_t exec_us =
+                (tv_end.tv_sec - PRIV(op)->tv_begin.tv_sec)*1000000 +
+                (tv_end.tv_nsec - PRIV(op)->tv_begin.tv_nsec)/1000;
+            // Compare with target execution time
+            // 100% free -> target time = 0
+            // 0% free -> target time = iodepth/parallelism * (iops + size/bw) / write per second
+            uint64_t used_start = journal.get_trim_pos();
+            uint64_t journal_free_space = journal.next_free < used_start
+                ? (used_start - journal.next_free)
+                : (journal.len - journal.next_free + used_start - journal.block_size);
+            uint64_t ref_us =
+                (write_iodepth <= throttle_target_parallelism ? 100 : 100*write_iodepth/throttle_target_parallelism)
+                * (1000000/throttle_target_iops + op->len*1000000/throttle_target_mbs/1024/1024)
+                / 100;
+            ref_us -= ref_us * journal_free_space / journal.len;
+            if (ref_us > exec_us + throttle_threshold_us)
+            {
+                // Pause reply
+                tfd->set_timer_us(ref_us-exec_us, false, [this, op](int timer_id)
+                {
+                    PRIV(op)->op_state++;
+                    ringloop->wakeup();
+                });
+                PRIV(op)->op_state = 5;
+                return 1;
+            }
        }
    }
+resume_6:
    // Acknowledge write
    op->retval = op->len;
    write_iodepth--;
    FINISH_OP(op);
-    return 1;
+    return 2;
 }

 void blockstore_impl_t::handle_write_event(ring_data_t *data, blockstore_op_t *op)
@@ -459,10 +591,7 @@ void blockstore_impl_t::handle_write_event(ring_data_t *data, blockstore_op_t *o
    {
        release_journal_sectors(op);
        PRIV(op)->op_state++;
-        if (!continue_write(op))
-        {
-            submit_queue.push_front(op);
-        }
+        ringloop->wakeup();
    }
 }

@@ -500,6 +629,10 @@ void blockstore_impl_t::release_journal_sectors(blockstore_op_t *op)

 int blockstore_impl_t::dequeue_del(blockstore_op_t *op)
 {
+    if (PRIV(op)->op_state)
+    {
+        return continue_write(op);
+    }
    auto dirty_it = dirty_db.find((obj_ver_id){
        .oid = op->oid,
        .version = op->version,
@@ -510,6 +643,7 @@ int blockstore_impl_t::dequeue_del(blockstore_op_t *op)
    {
        return 0;
    }
+    write_iodepth++;
    io_uring_sqe *sqe = NULL;
    if (immediate_commit != IMMEDIATE_NONE ||
        (journal_block_size - journal.in_sector_pos) < sizeof(journal_entry_del) &&
@@ -557,18 +691,10 @@ int blockstore_impl_t::dequeue_del(blockstore_op_t *op)
        PRIV(op)->min_flushed_journal_sector = PRIV(op)->max_flushed_journal_sector = 1 + journal.cur_sector;
        PRIV(op)->pending_ops++;
    }
-    else
-    {
-        // Remember delete as unsynced
-        unsynced_small_writes.push_back((obj_ver_id){
-            .oid = op->oid,
-            .version = op->version,
-        });
-    }
    if (!PRIV(op)->pending_ops)
    {
        PRIV(op)->op_state = 4;
-        continue_write(op);
+        return continue_write(op);
    }
    else
    {
--- a/src/cli.cpp
+++ b/src/cli.cpp
@@ -0,0 +1,251 @@
+// Copyright (c) Vitaliy Filippov, 2019+
+// License: VNPL-1.1 (see README.md for details)
+
+/**
+ * CLI tool
+ * Currently can (a) remove inodes and (b) merge snapshot/clone layers
+ */
+
+#include <vector>
+#include <algorithm>
+
+#include "cli.h"
+#include "epoll_manager.h"
+#include "cluster_client.h"
+#include "pg_states.h"
+#include "base64.h"
+
+static const char *exe_name = NULL;
+
+json11::Json::object cli_tool_t::parse_args(int narg, const char *args[])
+{
+    json11::Json::object cfg;
+    json11::Json::array cmd;
+    cfg["progress"] = "1";
+    for (int i = 1; i < narg; i++)
+    {
+        if (!strcmp(args[i], "-h") || !strcmp(args[i], "--help"))
+        {
+            help();
+        }
+        else if (args[i][0] == '-' && args[i][1] == '-')
+        {
+            const char *opt = args[i]+2;
+            cfg[opt] = !strcmp(opt, "json") || !strcmp(opt, "wait-list") || i == narg-1 ? "1" : args[++i];
+        }
+        else
+        {
+            cmd.push_back(std::string(args[i]));
+        }
+    }
+    if (!cmd.size())
+    {
+        std::string exe(exe_name);
+        if (exe.substr(exe.size()-11) == "vitastor-rm")
+        {
+            cmd.push_back("rm-data");
+        }
+    }
+    cfg["command"] = cmd;
+    return cfg;
+}
+
+void cli_tool_t::help()
+{
+    printf(
+        "Vitastor command-line tool\n"
+        "(c) Vitaliy Filippov, 2019+ (VNPL-1.1)\n\n"
+        "USAGE:\n"
+        "%s rm-data [OPTIONS] --pool <pool> --inode <inode> [--wait-list]\n"
+        "  Remove inode data without changing metadata.\n"
+        "  --wait-list means first retrieve objects listings and then remove it.\n"
+        "  --wait-list requires more memory, but allows to show correct stats.\n"
+        "\n"
+        "%s merge-data [OPTIONS] <from> <to> [--target <target>]\n"
+        "  Merge layer data without changing metadata. Merge <from>..<to> to <target>.\n"
+        "  <to> must be a child of <from> and <target> may be one of the layers between\n"
+        "  <from> and <to>, including <from> and <to>.\n"
+        "\n"
+        "%s flatten [OPTIONS] <layer>\n"
+        "  Flatten a layer, i.e. merge data and detach it from parents\n"
+        "\n"
+        "%s rm [OPTIONS] <from> [<to>] [--writers-stopped 1]\n"
+        "  Remove <from> or all layers between <from> and <to> (<to> must be a child of <from>),\n"
+        "  rebasing all their children accordingly. One of deleted parents may be renamed to one\n"
+        "  of children \"to be rebased\", but only if that child itself is readonly or if\n"
+        "  --writers-stopped 1 is specified\n"
+        "\n"
+        "OPTIONS (global):\n"
+        "  --etcd_address <etcd_address>\n"
+        "  --iodepth N         Send N operations in parallel to each OSD when possible (default 32)\n"
+        "  --parallel_osds M   Work with M osds in parallel when possible (default 4)\n"
+        "  --progress 1|0      Report progress (default 1)\n"
+        "  --cas 1|0           Use online CAS writes when possible (default auto)\n"
+        ,
+        exe_name, exe_name, exe_name, exe_name
+    );
+    exit(0);
+}
+
+void cli_tool_t::change_parent(inode_t cur, inode_t new_parent)
+{
+    auto cur_cfg_it = cli->st_cli.inode_config.find(cur);
+    if (cur_cfg_it == cli->st_cli.inode_config.end())
+    {
+        fprintf(stderr, "Inode 0x%lx disappeared\n", cur);
+        exit(1);
+    }
+    inode_config_t new_cfg = cur_cfg_it->second;
+    std::string cur_name = new_cfg.name;
+    std::string cur_cfg_key = base64_encode(cli->st_cli.etcd_prefix+
+        "/config/inode/"+std::to_string(INODE_POOL(cur))+
+        "/"+std::to_string(INODE_NO_POOL(cur)));
+    new_cfg.parent_id = new_parent;
+    json11::Json::object cur_cfg_json = cli->st_cli.serialize_inode_cfg(&new_cfg);
+    waiting++;
+    cli->st_cli.etcd_txn(json11::Json::object {
+        { "compare", json11::Json::array {
+            json11::Json::object {
+                { "target", "MOD" },
+                { "key", cur_cfg_key },
+                { "result", "LESS" },
+                { "mod_revision", new_cfg.mod_revision+1 },
+            },
+        } },
+        { "success", json11::Json::array {
+            json11::Json::object {
+                { "request_put", json11::Json::object {
+                    { "key", cur_cfg_key },
+                    { "value", base64_encode(json11::Json(cur_cfg_json).dump()) },
+                } }
+            },
+        } },
+    }, ETCD_SLOW_TIMEOUT, [this, new_parent, cur, cur_name](std::string err, json11::Json res)
+    {
+        if (err != "")
+        {
+            fprintf(stderr, "Error changing parent of %s: %s\n", cur_name.c_str(), err.c_str());
+            exit(1);
+        }
+        if (!res["succeeded"].bool_value())
+        {
+            fprintf(stderr, "Inode %s was modified during snapshot deletion\n", cur_name.c_str());
+            exit(1);
+        }
+        if (new_parent)
+        {
+            auto new_parent_it = cli->st_cli.inode_config.find(new_parent);
+            std::string new_parent_name = new_parent_it != cli->st_cli.inode_config.end()
+                ? new_parent_it->second.name : "<unknown>";
+            printf(
+                "Parent of layer %s (inode %lu in pool %u) changed to %s (inode %lu in pool %u)\n",
+                cur_name.c_str(), INODE_NO_POOL(cur), INODE_POOL(cur),
+                new_parent_name.c_str(), INODE_NO_POOL(new_parent), INODE_POOL(new_parent)
+            );
+        }
+        else
+        {
+            printf(
+                "Parent of layer %s (inode %lu in pool %u) detached\n",
+                cur_name.c_str(), INODE_NO_POOL(cur), INODE_POOL(cur)
+            );
+        }
+        waiting--;
+        ringloop->wakeup();
+    });
+}
+
+inode_config_t* cli_tool_t::get_inode_cfg(const std::string & name)
+{
+    for (auto & ic: cli->st_cli.inode_config)
+    {
+        if (ic.second.name == name)
+        {
+            return &ic.second;
+        }
+    }
+    fprintf(stderr, "Layer %s not found\n", name.c_str());
+    exit(1);
+}
+
+void cli_tool_t::run(json11::Json cfg)
+{
+    json11::Json::array cmd = cfg["command"].array_items();
+    if (!cmd.size())
+    {
+        fprintf(stderr, "command is missing\n");
+        exit(1);
+    }
+    else if (cmd[0] == "rm-data")
+    {
+        // Delete inode data
+        action_cb = start_rm(cfg);
+    }
+    else if (cmd[0] == "merge-data")
+    {
+        // Merge layer data without affecting metadata
+        action_cb = start_merge(cfg);
+    }
+    else if (cmd[0] == "flatten")
+    {
+        // Merge layer data without affecting metadata
+        action_cb = start_flatten(cfg);
+    }
+    else if (cmd[0] == "rm")
+    {
+        // Remove multiple snapshots and rebase their children
+        action_cb = start_snap_rm(cfg);
+    }
+    else
+    {
+        fprintf(stderr, "unknown command: %s\n", cmd[0].string_value().c_str());
+        exit(1);
+    }
+    iodepth = cfg["iodepth"].uint64_value();
+    if (!iodepth)
+        iodepth = 32;
+    parallel_osds = cfg["parallel_osds"].uint64_value();
+    if (!parallel_osds)
+        parallel_osds = 4;
+    log_level = cfg["log_level"].int64_value();
+    progress = cfg["progress"].uint64_value() ? true : false;
+    list_first = cfg["wait-list"].uint64_value() ? true : false;
+    // Create client
+    ringloop = new ring_loop_t(512);
+    epmgr = new epoll_manager_t(ringloop);
+    cli = new cluster_client_t(ringloop, epmgr->tfd, cfg);
+    cli->on_ready([this]()
+    {
+        // Initialize job
+        consumer.loop = [this]()
+        {
+            if (action_cb != NULL)
+            {
+                bool done = action_cb();
+                if (done)
+                {
+                    action_cb = NULL;
+                }
+            }
+            ringloop->submit();
+        };
+        ringloop->register_consumer(&consumer);
+        consumer.loop();
+    });
+    // Loop until it completes
+    while (action_cb != NULL)
+    {
+        ringloop->loop();
+        ringloop->wait();
+    }
+}
+
+int main(int narg, const char *args[])
+{
+    setvbuf(stdout, NULL, _IONBF, 0);
+    setvbuf(stderr, NULL, _IONBF, 0);
+    exe_name = args[0];
+    cli_tool_t *p = new cli_tool_t();
+    p->run(cli_tool_t::parse_args(narg, args));
+    return 0;
+}
--- a/src/cli.h
+++ b/src/cli.h
@@ -0,0 +1,56 @@
+// Copyright (c) Vitaliy Filippov, 2019+
+// License: VNPL-1.1 (see README.md for details)
+
+// Common CLI tool header
+
+#pragma once
+
+#include "json11/json11.hpp"
+#include "object_id.h"
+#include "ringloop.h"
+#include <functional>
+
+struct rm_inode_t;
+struct snap_merger_t;
+struct snap_flattener_t;
+struct snap_remover_t;
+
+class epoll_manager_t;
+class cluster_client_t;
+struct inode_config_t;
+
+class cli_tool_t
+{
+public:
+    uint64_t iodepth = 0, parallel_osds = 0;
+    bool progress = true;
+    bool list_first = false;
+    int log_level = 0;
+    int mode = 0;
+
+    ring_loop_t *ringloop = NULL;
+    epoll_manager_t *epmgr = NULL;
+    cluster_client_t *cli = NULL;
+
+    int waiting = 0;
+    ring_consumer_t consumer;
+    std::function<bool(void)> action_cb;
+
+    void run(json11::Json cfg);
+
+    void change_parent(inode_t cur, inode_t new_parent);
+    inode_config_t* get_inode_cfg(const std::string & name);
+
+    static json11::Json::object parse_args(int narg, const char *args[]);
+    static void help();
+
+    friend struct rm_inode_t;
+    friend struct snap_merger_t;
+    friend struct snap_flattener_t;
+    friend struct snap_remover_t;
+
+    std::function<bool(void)> start_rm(json11::Json);
+    std::function<bool(void)> start_merge(json11::Json);
+    std::function<bool(void)> start_flatten(json11::Json);
+    std::function<bool(void)> start_snap_rm(json11::Json);
+};
--- a/src/cli_flatten.cpp
+++ b/src/cli_flatten.cpp
@@ -0,0 +1,124 @@
+// Copyright (c) Vitaliy Filippov, 2019+
+// License: VNPL-1.1 (see README.md for details)
+
+#include "cli.h"
+#include "cluster_client.h"
+
+// Flatten a layer: merge all parents into a layer and break the connection completely
+struct snap_flattener_t
+{
+    cli_tool_t *parent;
+
+    // target to flatten
+    std::string target_name;
+    // writers are stopped, we can safely change writable layers
+    bool writers_stopped = false;
+    // use CAS writes (0 = never, 1 = auto, 2 = always)
+    int use_cas = 1;
+    // interval between fsyncs
+    int fsync_interval = 128;
+
+    std::string top_parent_name;
+    inode_t target_id = 0;
+    int state = 0;
+    std::function<bool(void)> merger_cb;
+
+    void get_merge_parents()
+    {
+        // Get all parents of target
+        inode_config_t *target_cfg = parent->get_inode_cfg(target_name);
+        target_id = target_cfg->num;
+        std::vector<inode_t> chain_list;
+        inode_config_t *cur = target_cfg;
+        chain_list.push_back(cur->num);
+        while (cur->parent_id != 0 && cur->parent_id != target_cfg->num)
+        {
+            auto it = parent->cli->st_cli.inode_config.find(cur->parent_id);
+            if (it == parent->cli->st_cli.inode_config.end())
+            {
+                fprintf(stderr, "Parent inode of layer %s (id %ld) not found\n", cur->name.c_str(), cur->parent_id);
+                exit(1);
+            }
+            cur = &it->second;
+            chain_list.push_back(cur->num);
+        }
+        if (cur->parent_id != 0)
+        {
+            fprintf(stderr, "Layer %s has a loop in parents\n", target_name.c_str());
+            exit(1);
+        }
+        top_parent_name = cur->name;
+    }
+
+    bool is_done()
+    {
+        return state == 5;
+    }
+
+    void loop()
+    {
+        if (state == 1)
+            goto resume_1;
+        else if (state == 2)
+            goto resume_2;
+        else if (state == 3)
+            goto resume_3;
+        // Get parent layers
+        get_merge_parents();
+        // Start merger
+        merger_cb = parent->start_merge(json11::Json::object {
+            { "command", json11::Json::array{ "merge-data", top_parent_name, target_name } },
+            { "target", target_name },
+            { "delete-source", false },
+            { "cas", use_cas },
+            { "fsync-interval", fsync_interval },
+        });
+        // Wait for it
+resume_1:
+        while (!merger_cb())
+        {
+            state = 1;
+            return;
+        }
+        merger_cb = NULL;
+        // Change parent
+        parent->change_parent(target_id, 0);
+        // Wait for it to complete
+        state = 2;
+resume_2:
+        if (parent->waiting > 0)
+            return;
+        state = 3;
+resume_3:
+        // Done
+        return;
+    }
+};
+
+std::function<bool(void)> cli_tool_t::start_flatten(json11::Json cfg)
+{
+    json11::Json::array cmd = cfg["command"].array_items();
+    auto flattener = new snap_flattener_t();
+    flattener->parent = this;
+    flattener->target_name = cmd.size() > 1 ? cmd[1].string_value() : "";
+    if (flattener->target_name == "")
+    {
+        fprintf(stderr, "Layer to flatten argument is missing\n");
+        exit(1);
+    }
+    flattener->fsync_interval = cfg["fsync-interval"].uint64_value();
+    if (!flattener->fsync_interval)
+        flattener->fsync_interval = 128;
+    if (!cfg["cas"].is_null())
+        flattener->use_cas = cfg["cas"].uint64_value() ? 2 : 0;
+    return [flattener]()
+    {
+        flattener->loop();
+        if (flattener->is_done())
+        {
+            delete flattener;
+            return true;
+        }
+        return false;
+    };
+}
--- a/src/cli_merge.cpp
+++ b/src/cli_merge.cpp
@@ -0,0 +1,583 @@
+// Copyright (c) Vitaliy Filippov, 2019+
+// License: VNPL-1.1 (see README.md for details)
+
+#include "cli.h"
+#include "cluster_client.h"
+#include "cpp-btree/safe_btree_set.h"
+
+struct snap_rw_op_t
+{
+    uint64_t offset = 0;
+    void *buf = NULL;
+    cluster_op_t op;
+    int todo = 0;
+    uint32_t start = 0, end = 0;
+};
+
+// Layer merge is the base for multiple operations:
+// 1) Delete snapshot "up" = merge child layer into the parent layer, remove the child
+//    and rename the parent to the child
+// 2) Delete snapshot "down" = merge parent layer into the child layer and remove the parent
+// 3) Flatten image = merge parent layers into the child layer and break the connection
+struct snap_merger_t
+{
+    cli_tool_t *parent;
+
+    // -- CONFIGURATION --
+    // merge from..to into target (target may be one of from..to)
+    std::string from_name, to_name, target_name;
+    // inode=>rank (bigger rank means child layers)
+    std::map<inode_t,int> sources;
+    // delete merged source inode data during merge
+    bool delete_source = false;
+    // use CAS writes (0 = never, 1 = auto, 2 = always)
+    int use_cas = 1;
+    // don't necessarily delete source data, but perform checks as if we were to do it
+    bool check_delete_source = false;
+    // interval between fsyncs
+    int fsync_interval = 128;
+
+    // -- STATE --
+    inode_t target;
+    int target_rank;
+    bool inside_continue = false;
+    int state = 0;
+    int lists_todo = 0;
+    uint64_t target_block_size = 0;
+    btree::safe_btree_set<uint64_t> merge_offsets;
+    btree::safe_btree_set<uint64_t>::iterator oit;
+    std::map<inode_t, std::vector<uint64_t>> layer_lists;
+    std::map<inode_t, uint64_t> layer_block_size;
+    std::map<inode_t, uint64_t> layer_list_pos;
+    int in_flight = 0;
+    uint64_t last_fsync_offset = 0;
+    uint64_t last_written_offset = 0;
+    int deleted_unsynced = 0;
+    uint64_t processed = 0, to_process = 0;
+
+    void start_merge()
+    {
+        check_delete_source = delete_source || check_delete_source;
+        inode_config_t *from_cfg = parent->get_inode_cfg(from_name);
+        inode_config_t *to_cfg = parent->get_inode_cfg(to_name);
+        inode_config_t *target_cfg = target_name == "" ? from_cfg : parent->get_inode_cfg(target_name);
+        if (to_cfg->num == from_cfg->num)
+        {
+            fprintf(stderr, "Only one layer specified, nothing to merge\n");
+            exit(1);
+        }
+        // Check that to_cfg is actually a child of from_cfg and target_cfg is somewhere between them
+        std::vector<inode_t> chain_list;
+        inode_config_t *cur = to_cfg;
+        chain_list.push_back(cur->num);
+        layer_block_size[cur->num] = get_block_size(cur->num);
+        while (cur->parent_id != from_cfg->num &&
+            cur->parent_id != to_cfg->num &&
+            cur->parent_id != 0)
+        {
+            auto it = parent->cli->st_cli.inode_config.find(cur->parent_id);
+            if (it == parent->cli->st_cli.inode_config.end())
+            {
+                fprintf(stderr, "Parent inode of layer %s (id %ld) not found\n", cur->name.c_str(), cur->parent_id);
+                exit(1);
+            }
+            cur = &it->second;
+            chain_list.push_back(cur->num);
+            layer_block_size[cur->num] = get_block_size(cur->num);
+        }
+        if (cur->parent_id != from_cfg->num)
+        {
+            fprintf(stderr, "Layer %s is not a child of %s\n", to_name.c_str(), from_name.c_str());
+            exit(1);
+        }
+        chain_list.push_back(from_cfg->num);
+        layer_block_size[from_cfg->num] = get_block_size(from_cfg->num);
+        int i = chain_list.size()-1;
+        for (inode_t item: chain_list)
+        {
+            sources[item] = i--;
+        }
+        if (sources.find(target_cfg->num) == sources.end())
+        {
+            fprintf(stderr, "Layer %s is not between %s and %s\n", target_name.c_str(), to_name.c_str(), from_name.c_str());
+            exit(1);
+        }
+        target = target_cfg->num;
+        target_rank = sources.at(target);
+        int to_rank = sources.at(to_cfg->num);
+        bool to_has_children = false;
+        // Check that there are no other inodes dependent on altered layers
+        //
+        // 1) everything between <target> and <to> except <to> is not allowed
+        //    to have children other than <to> if <to> is a child of <target>:
+        //
+        //    <target> - <layer 3> - <to>
+        //            \- <layer 4> <--------X--------- NOT ALLOWED
+        //
+        // 2) everything between <from> and <target>, except <target>, is not allowed
+        //    to have children other than <target> if sources are to be deleted after merging:
+        //
+        //    <from> - <layer 1> - <target> - <to>
+        //          \- <layer 2> <---------X-------- NOT ALLOWED
+        for (auto & ic: parent->cli->st_cli.inode_config)
+        {
+            auto it = sources.find(ic.second.num);
+            if (it == sources.end() && ic.second.parent_id != 0)
+            {
+                it = sources.find(ic.second.parent_id);
+                if (it != sources.end())
+                {
+                    int parent_rank = it->second;
+                    if (parent_rank < to_rank && (parent_rank >= target_rank || check_delete_source))
+                    {
+                        fprintf(
+                            stderr, "Layers at or above %s, but below %s are not allowed"
+                                " to have other children, but %s is a child of %s\n",
+                            (check_delete_source ? from_name.c_str() : target_name.c_str()),
+                            to_name.c_str(), ic.second.name.c_str(),
+                            parent->cli->st_cli.inode_config.at(ic.second.parent_id).name.c_str()
+                        );
+                        exit(1);
+                    }
+                    if (parent_rank >= to_rank)
+                    {
+                        to_has_children = true;
+                    }
+                }
+            }
+        }
+        if ((target_rank < to_rank || to_has_children) && use_cas == 1)
+        {
+            // <to> has children itself, no need for CAS
+            use_cas = 0;
+        }
+        sources.erase(target);
+        printf(
+            "Merging %ld layer(s) into target %s%s (inode %lu in pool %u)\n",
+            sources.size(), target_cfg->name.c_str(),
+            use_cas ? " online (with CAS)" : "", INODE_NO_POOL(target), INODE_POOL(target)
+        );
+        target_block_size = get_block_size(target);
+    }
+
+    uint64_t get_block_size(inode_t inode)
+    {
+        auto & pool_cfg = parent->cli->st_cli.pool_config.at(INODE_POOL(inode));
+        uint64_t pg_data_size = (pool_cfg.scheme == POOL_SCHEME_REPLICATED ? 1 : pool_cfg.pg_size-pool_cfg.parity_chunks);
+        return parent->cli->get_bs_block_size() * pg_data_size;
+    }
+
+    void continue_merge_reent()
+    {
+        if (!inside_continue)
+        {
+            inside_continue = true;
+            continue_merge();
+            inside_continue = false;
+        }
+    }
+
+    bool is_done()
+    {
+        return state == 6;
+    }
+
+    void continue_merge()
+    {
+        if (state == 1)
+            goto resume_1;
+        else if (state == 2)
+            goto resume_2;
+        else if (state == 3)
+            goto resume_3;
+        else if (state == 4)
+            goto resume_4;
+        else if (state == 5)
+            goto resume_5;
+        else if (state == 6)
+            goto resume_6;
+        // Get parents and so on
+        start_merge();
+        // First list lower layers
+        list_layers(true);
+        state = 1;
+    resume_1:
+        while (lists_todo > 0)
+        {
+            // Wait for lists
+            return;
+        }
+        if (merge_offsets.size() > 0)
+        {
+            state = 2;
+            oit = merge_offsets.begin();
+            processed = 0;
+            to_process = merge_offsets.size();
+    resume_2:
+            // Then remove blocks already filled in target by issuing zero-length reads and checking bitmaps
+            while (in_flight < parent->iodepth*parent->parallel_osds && oit != merge_offsets.end())
+            {
+                in_flight++;
+                check_if_full(*oit);
+                oit++;
+                processed++;
+                if (parent->progress && !(processed % 128))
+                {
+                    printf("\rFiltering target blocks: %lu/%lu", processed, to_process);
+                }
+            }
+            if (in_flight > 0 || oit != merge_offsets.end())
+            {
+                // Wait until reads finish
+                return;
+            }
+            if (parent->progress)
+            {
+                printf("\r%lu full blocks of target filtered out\n", to_process-merge_offsets.size());
+            }
+        }
+        state = 3;
+    resume_3:
+        // Then list upper layers
+        list_layers(false);
+        state = 4;
+    resume_4:
+        while (lists_todo > 0)
+        {
+            // Wait for lists
+            return;
+        }
+        state = 5;
+        processed = 0;
+        to_process = merge_offsets.size();
+        oit = merge_offsets.begin();
+    resume_5:
+        // Now read, overwrite and optionally delete offsets one by one
+        while (in_flight < parent->iodepth*parent->parallel_osds && oit != merge_offsets.end())
+        {
+            in_flight++;
+            read_and_write(*oit);
+            oit++;
+            processed++;
+            if (parent->progress && !(processed % 128))
+            {
+                printf("\rOverwriting blocks: %lu/%lu", processed, to_process);
+            }
+        }
+        if (in_flight > 0 || oit != merge_offsets.end())
+        {
+            // Wait until overwrites finish
+            return;
+        }
+        if (parent->progress)
+        {
+            printf("\rOverwriting blocks: %lu/%lu\n", to_process, to_process);
+        }
+        // Done
+        printf("Done, layers from %s to %s merged into %s\n", from_name.c_str(), to_name.c_str(), target_name.c_str());
+        state = 6;
+    resume_6:
+        return;
+    }
+
+    void list_layers(bool lower)
+    {
+        for (auto & sp: sources)
+        {
+            inode_t src = sp.first;
+            if (lower ? (sp.second < target_rank) : (sp.second > target_rank))
+            {
+                lists_todo++;
+                inode_list_t* lst = parent->cli->list_inode_start(src, [this, src](
+                    inode_list_t *lst, std::set<object_id>&& objects, pg_num_t pg_num, osd_num_t primary_osd, int status)
+                {
+                    uint64_t layer_block = layer_block_size.at(src);
+                    for (object_id obj: objects)
+                    {
+                        merge_offsets.insert(obj.stripe - obj.stripe % target_block_size);
+                        for (int i = target_block_size; i < layer_block; i += target_block_size)
+                        {
+                            merge_offsets.insert(obj.stripe - obj.stripe % target_block_size + i);
+                        }
+                    }
+                    if (delete_source)
+                    {
+                        // Also store individual lists
+                        auto & layer_list = layer_lists[src];
+                        int pos = layer_list.size();
+                        layer_list.resize(pos + objects.size());
+                        for (object_id obj: objects)
+                        {
+                            layer_list[pos++] = obj.stripe;
+                        }
+                    }
+                    if (status & INODE_LIST_DONE)
+                    {
+                        auto & name = parent->cli->st_cli.inode_config.at(src).name;
+                        printf("Got listing of layer %s (inode %lu in pool %u)\n", name.c_str(), INODE_NO_POOL(src), INODE_POOL(src));
+                        if (delete_source)
+                        {
+                            // Sort the inode listing
+                            std::sort(layer_lists[src].begin(), layer_lists[src].end());
+                        }
+                        lists_todo--;
+                        continue_merge_reent();
+                    }
+                    else
+                    {
+                        parent->cli->list_inode_next(lst, 1);
+                    }
+                });
+                parent->cli->list_inode_next(lst, parent->parallel_osds);
+            }
+        }
+    }
+
+    // Check if <offset> is fully written in <target> and remove it from merge_offsets if so
+    void check_if_full(uint64_t offset)
+    {
+        cluster_op_t *op = new cluster_op_t;
+        op->opcode = OSD_OP_READ_BITMAP;
+        op->inode = target;
+        op->offset = offset;
+        op->len = 0;
+        op->callback = [this](cluster_op_t *op)
+        {
+            if (op->retval < 0)
+            {
+                fprintf(stderr, "error reading target bitmap at offset %lx: %s\n", op->offset, strerror(-op->retval));
+            }
+            else
+            {
+                uint64_t bitmap_bytes = target_block_size/parent->cli->get_bs_bitmap_granularity()/8;
+                int i;
+                for (i = 0; i < bitmap_bytes; i++)
+                {
+                    if (((uint8_t*)op->bitmap_buf)[i] != 0xff)
+                    {
+                        break;
+                    }
+                }
+                if (i == bitmap_bytes)
+                {
+                    // full
+                    merge_offsets.erase(op->offset);
+                }
+            }
+            delete op;
+            in_flight--;
+            continue_merge_reent();
+        };
+        parent->cli->execute(op);
+    }
+
+    // Read <offset> from <to>, write it to <target> and optionally delete it
+    // from all layers except <target> after fsync'ing
+    void read_and_write(uint64_t offset)
+    {
+        snap_rw_op_t *rwo = new snap_rw_op_t;
+        // Initialize counter to 1 to later allow write_subop() to return immediately
+        // (even though it shouldn't really do that)
+        rwo->todo = 1;
+        rwo->buf = malloc(target_block_size);
+        rwo->offset = offset;
+        rwo_read(rwo);
+    }
+
+    void rwo_read(snap_rw_op_t *rwo)
+    {
+        cluster_op_t *op = &rwo->op;
+        op->opcode = OSD_OP_READ;
+        op->inode = target;
+        op->offset = rwo->offset;
+        op->len = target_block_size;
+        op->iov.push_back(rwo->buf, target_block_size);
+        op->callback = [this, rwo](cluster_op_t *op)
+        {
+            if (op->retval != op->len)
+            {
+                fprintf(stderr, "error reading target at offset %lx: %s\n", op->offset, strerror(-op->retval));
+                exit(1);
+            }
+            next_write(rwo);
+        };
+        parent->cli->execute(op);
+    }
+
+    void next_write(snap_rw_op_t *rwo)
+    {
+        // Write each non-empty range using an individual operation
+        // FIXME: Allow to use single write with "holes" (OSDs don't allow it yet)
+        uint32_t gran = parent->cli->get_bs_bitmap_granularity();
+        uint64_t bitmap_size = target_block_size / gran;
+        while (rwo->end < bitmap_size)
+        {
+            auto bit = ((*(uint8_t*)(rwo->op.bitmap_buf + (rwo->end >> 3))) & (1 << (rwo->end & 0x7)));
+            if (!bit)
+            {
+                if (rwo->end > rwo->start)
+                {
+                    // write start->end
+                    rwo->todo++;
+                    write_subop(rwo, rwo->start*gran, rwo->end*gran, use_cas ? 1+rwo->op.version : 0);
+                    rwo->start = rwo->end;
+                    if (use_cas)
+                    {
+                        // Submit one by one if using CAS writes
+                        return;
+                    }
+                }
+                rwo->start = rwo->end = rwo->end+1;
+            }
+            else
+            {
+                rwo->end++;
+            }
+        }
+        if (rwo->end > rwo->start)
+        {
+            // write start->end
+            rwo->todo++;
+            write_subop(rwo, rwo->start*gran, rwo->end*gran, use_cas ? 1+rwo->op.version : 0);
+            rwo->start = rwo->end;
+            if (use_cas)
+            {
+                return;
+            }
+        }
+        rwo->todo--;
+        // Just in case, if everything is done
+        autofree_op(rwo);
+    }
+
+    void write_subop(snap_rw_op_t *rwo, uint32_t start, uint32_t end, uint64_t version)
+    {
+        cluster_op_t *subop = new cluster_op_t;
+        subop->opcode = OSD_OP_WRITE;
+        subop->inode = target;
+        subop->offset = rwo->offset+start;
+        subop->len = end-start;
+        subop->version = version;
+        subop->flags = OSD_OP_IGNORE_READONLY;
+        subop->iov.push_back(rwo->buf+start, end-start);
+        subop->callback = [this, rwo](cluster_op_t *subop)
+        {
+            rwo->todo--;
+            if (subop->retval != subop->len)
+            {
+                if (use_cas && subop->retval == -EINTR)
+                {
+                    // CAS failure - reread and repeat optimistically
+                    rwo->start = subop->offset - rwo->offset;
+                    rwo_read(rwo);
+                    delete subop;
+                    return;
+                }
+                fprintf(stderr, "error writing target at offset %lx: %s\n", subop->offset, strerror(-subop->retval));
+                exit(1);
+            }
+            // Increment CAS version
+            rwo->op.version++;
+            if (use_cas)
+                next_write(rwo);
+            else
+                autofree_op(rwo);
+            delete subop;
+        };
+        parent->cli->execute(subop);
+    }
+
+    void delete_offset(inode_t inode_num, uint64_t offset)
+    {
+        cluster_op_t *subop = new cluster_op_t;
+        subop->opcode = OSD_OP_DELETE;
+        subop->inode = inode_num;
+        subop->offset = offset;
+        subop->len = 0;
+        subop->flags = OSD_OP_IGNORE_READONLY;
+        subop->callback = [this](cluster_op_t *subop)
+        {
+            if (subop->retval != 0)
+            {
+                fprintf(stderr, "error deleting from layer 0x%lx at offset %lx: %s", subop->inode, subop->offset, strerror(-subop->retval));
+            }
+            delete subop;
+        };
+        parent->cli->execute(subop);
+    }
+
+    void autofree_op(snap_rw_op_t *rwo)
+    {
+        if (!rwo->todo)
+        {
+            if (last_written_offset < rwo->op.offset+target_block_size)
+            {
+                last_written_offset = rwo->op.offset+target_block_size;
+            }
+            if (delete_source)
+            {
+                deleted_unsynced++;
+                if (deleted_unsynced >= fsync_interval)
+                {
+                    uint64_t from = last_fsync_offset, to = last_written_offset;
+                    cluster_op_t *subop = new cluster_op_t;
+                    subop->opcode = OSD_OP_SYNC;
+                    subop->callback = [this, from, to](cluster_op_t *subop)
+                    {
+                        delete subop;
+                        // We can now delete source data between <from> and <to>
+                        // But to do this we have to keep all object lists in memory :-(
+                        for (auto & lp: layer_list_pos)
+                        {
+                            auto & layer_list = layer_lists.at(lp.first);
+                            uint64_t layer_block = layer_block_size.at(lp.first);
+                            int cur_pos = lp.second;
+                            while (cur_pos < layer_list.size() && layer_list[cur_pos]+layer_block < to)
+                            {
+                                delete_offset(lp.first, layer_list[cur_pos]);
+                                cur_pos++;
+                            }
+                            lp.second = cur_pos;
+                        }
+                    };
+                    parent->cli->execute(subop);
+                }
+            }
+            free(rwo->buf);
+            delete rwo;
+            in_flight--;
+            continue_merge_reent();
+        }
+    }
+};
+
+std::function<bool(void)> cli_tool_t::start_merge(json11::Json cfg)
+{
+    json11::Json::array cmd = cfg["command"].array_items();
+    auto merger = new snap_merger_t();
+    merger->parent = this;
+    merger->from_name = cmd.size() > 1 ? cmd[1].string_value() : "";
+    merger->to_name = cmd.size() > 2 ? cmd[2].string_value() : "";
+    merger->target_name = cfg["target"].string_value();
+    if (merger->from_name == "" || merger->to_name == "")
+    {
+        fprintf(stderr, "Beginning or end of the merge sequence is missing\n");
+        exit(1);
+    }
+    merger->delete_source = cfg["delete-source"].string_value() != "";
+    merger->fsync_interval = cfg["fsync-interval"].uint64_value();
+    if (!merger->fsync_interval)
+        merger->fsync_interval = 128;
+    if (!cfg["cas"].is_null())
+        merger->use_cas = cfg["cas"].uint64_value() ? 2 : 0;
+    return [merger]()
+    {
+        merger->continue_merge_reent();
+        if (merger->is_done())
+        {
+            delete merger;
+            return true;
+        }
+        return false;
+    };
+}
--- a/src/cli_rm.cpp
+++ b/src/cli_rm.cpp
@@ -0,0 +1,195 @@
+// Copyright (c) Vitaliy Filippov, 2019+
+// License: VNPL-1.1 (see README.md for details)
+
+#include "cli.h"
+#include "cluster_client.h"
+
+#define RM_LISTING 1
+#define RM_REMOVING 2
+#define RM_END 3
+
+struct rm_pg_t
+{
+    pg_num_t pg_num;
+    osd_num_t rm_osd_num;
+    std::set<object_id> objects;
+    std::set<object_id>::iterator obj_pos;
+    uint64_t obj_count = 0, obj_done = 0, obj_prev_done = 0;
+    int state = 0;
+    int in_flight = 0;
+};
+
+struct rm_inode_t
+{
+    uint64_t inode = 0;
+    pool_id_t pool_id = 0;
+
+    cli_tool_t *parent = NULL;
+    inode_list_t *lister = NULL;
+    std::vector<rm_pg_t*> lists;
+    uint64_t total_count = 0, total_done = 0, total_prev_pct = 0;
+    uint64_t pgs_to_list = 0;
+    bool lists_done = false;
+    int state = 0;
+
+    void start_delete()
+    {
+        lister = parent->cli->list_inode_start(inode, [this](inode_list_t *lst,
+            std::set<object_id>&& objects, pg_num_t pg_num, osd_num_t primary_osd, int status)
+        {
+            rm_pg_t *rm = new rm_pg_t((rm_pg_t){
+                .pg_num = pg_num,
+                .rm_osd_num = primary_osd,
+                .objects = objects,
+                .obj_count = objects.size(),
+                .obj_done = 0,
+                .obj_prev_done = 0,
+            });
+            rm->obj_pos = rm->objects.begin();
+            lists.push_back(rm);
+            if (parent->list_first)
+            {
+                parent->cli->list_inode_next(lister, 1);
+            }
+            if (status & INODE_LIST_DONE)
+            {
+                lists_done = true;
+            }
+            pgs_to_list--;
+            continue_delete();
+        });
+        if (!lister)
+        {
+            fprintf(stderr, "Failed to list inode %lu from pool %u objects\n", INODE_NO_POOL(inode), INODE_POOL(inode));
+            exit(1);
+        }
+        pgs_to_list = parent->cli->list_pg_count(lister);
+        parent->cli->list_inode_next(lister, parent->parallel_osds);
+    }
+
+    void send_ops(rm_pg_t *cur_list)
+    {
+        if (parent->cli->msgr.osd_peer_fds.find(cur_list->rm_osd_num) ==
+            parent->cli->msgr.osd_peer_fds.end())
+        {
+            // Initiate connection
+            parent->cli->msgr.connect_peer(cur_list->rm_osd_num, parent->cli->st_cli.peer_states[cur_list->rm_osd_num]);
+            return;
+        }
+        while (cur_list->in_flight < parent->iodepth && cur_list->obj_pos != cur_list->objects.end())
+        {
+            osd_op_t *op = new osd_op_t();
+            op->op_type = OSD_OP_OUT;
+            op->peer_fd = parent->cli->msgr.osd_peer_fds[cur_list->rm_osd_num];
+            op->req = (osd_any_op_t){
+                .rw = {
+                    .header = {
+                        .magic = SECONDARY_OSD_OP_MAGIC,
+                        .id = parent->cli->next_op_id(),
+                        .opcode = OSD_OP_DELETE,
+                    },
+                    .inode = cur_list->obj_pos->inode,
+                    .offset = cur_list->obj_pos->stripe,
+                    .len = 0,
+                },
+            };
+            op->callback = [this, cur_list](osd_op_t *op)
+            {
+                cur_list->in_flight--;
+                if (op->reply.hdr.retval < 0)
+                {
+                    fprintf(stderr, "Failed to remove object %lx:%lx from PG %u (OSD %lu) (retval=%ld)\n",
+                        op->req.rw.inode, op->req.rw.offset,
+                        cur_list->pg_num, cur_list->rm_osd_num, op->reply.hdr.retval);
+                }
+                delete op;
+                cur_list->obj_done++;
+                total_done++;
+                continue_delete();
+            };
+            cur_list->obj_pos++;
+            cur_list->in_flight++;
+            parent->cli->msgr.outbox_push(op);
+        }
+    }
+
+    void continue_delete()
+    {
+        if (parent->list_first && !lists_done)
+        {
+            return;
+        }
+        for (int i = 0; i < lists.size(); i++)
+        {
+            if (!lists[i]->in_flight && lists[i]->obj_pos == lists[i]->objects.end())
+            {
+                delete lists[i];
+                lists.erase(lists.begin()+i, lists.begin()+i+1);
+                i--;
+                if (!lists_done)
+                {
+                    parent->cli->list_inode_next(lister, 1);
+                }
+            }
+            else
+            {
+                send_ops(lists[i]);
+            }
+        }
+        if (parent->progress && total_count > 0 && total_done*1000/total_count != total_prev_pct)
+        {
+            printf("\rRemoved %lu/%lu objects, %lu more PGs to list...", total_done, total_count, pgs_to_list);
+            total_prev_pct = total_done*1000/total_count;
+        }
+        if (lists_done && !lists.size())
+        {
+            printf("Done, inode %lu in pool %u data removed\n", INODE_NO_POOL(inode), pool_id);
+            state = 2;
+        }
+    }
+
+    bool loop()
+    {
+        if (state == 0)
+        {
+            start_delete();
+            state = 1;
+        }
+        else if (state == 1)
+        {
+            continue_delete();
+        }
+        else if (state == 2)
+        {
+            return true;
+        }
+        return false;
+    }
+};
+
+std::function<bool(void)> cli_tool_t::start_rm(json11::Json cfg)
+{
+    auto remover = new rm_inode_t();
+    remover->parent = this;
+    remover->inode = cfg["inode"].uint64_value();
+    remover->pool_id = cfg["pool"].uint64_value();
+    if (remover->pool_id)
+    {
+        remover->inode = (remover->inode & ((1l << (64-POOL_ID_BITS)) - 1)) | (((uint64_t)remover->pool_id) << (64-POOL_ID_BITS));
+    }
+    remover->pool_id = INODE_POOL(remover->inode);
+    if (!remover->pool_id)
+    {
+        fprintf(stderr, "pool is missing\n");
+        exit(1);
+    }
+    return [remover]()
+    {
+        if (remover->loop())
+        {
+            delete remover;
+            return true;
+        }
+        return false;
+    };
+}
--- a/src/cli_snap_rm.cpp
+++ b/src/cli_snap_rm.cpp
@@ -0,0 +1,565 @@
+// Copyright (c) Vitaliy Filippov, 2019+
+// License: VNPL-1.1 (see README.md for details)
+
+#include "cli.h"
+#include "cluster_client.h"
+#include "base64.h"
+
+// Remove layer(s): similar to merge, but alters metadata and processes multiple merge targets
+//
+// Exactly one child of the requested layers may be merged using the "inverted" workflow,
+// where we merge it "down" into one of the "to-be-removed" layers and then rename the
+// "to-be-removed" layer to the child. It may be done either if all writers are stopped
+// before trying to delete layers (which is signaled by --writers-stopped) or if that child
+// is a read-only layer (snapshot) itself.
+//
+// This "inverted" workflow trades copying data of one of the deleted layers for copying
+// data of one child of the chain which is also a child of the "traded" layer. So we
+// choose the (parent,child) pair which has the largest difference between "parent" and
+// "child" inode sizes.
+//
+// All other children of the chain are processed by iterating though them, merging removed
+// parents into them and rebasing them to the last layer which isn't a member of the removed
+// chain.
+//
+// Example:
+//
+// <parent> - <from> - <layer 2> - <to> - <child 1>
+//                 \           \       \- <child 2>
+//                  \           \- <child 3>
+//                   \-<child 4>
+//
+// 1) Find optimal pair for the "reverse" scenario
+//    Imagine that it's (<layer 2>, <child 1>) in this example
+// 2) Process all children except <child 1>:
+//    - Merge <from>..<to> to <child 2>
+//    - Set <child 2> parent to <parent>
+//    - Repeat for others
+// 3) Process <child 1>:
+//    - Merge <from>..<child 1> to <layer 2>
+//    - Set <layer 2> parent to <parent>
+//    - Rename <layer 2> to <child 1>
+// 4) Delete other layers of the chain (<from>, <to>)
+struct snap_remover_t
+{
+    cli_tool_t *parent;
+
+    // remove from..to
+    std::string from_name, to_name;
+    // writers are stopped, we can safely change writable layers
+    bool writers_stopped = false;
+    // use CAS writes (0 = never, 1 = auto, 2 = always)
+    int use_cas = 1;
+    // interval between fsyncs
+    int fsync_interval = 128;
+
+    std::map<inode_t,int> sources;
+    std::map<inode_t,uint64_t> inode_used;
+    std::vector<inode_t> merge_children;
+    std::vector<inode_t> chain_list;
+    std::map<inode_t,int> inverse_candidates;
+    inode_t inverse_parent = 0, inverse_child = 0;
+    inode_t new_parent = 0;
+    int state = 0;
+    int current_child = 0;
+    std::function<bool(void)> cb;
+
+    bool is_done()
+    {
+        return state == 9;
+    }
+
+    void loop()
+    {
+        if (state == 1)
+            goto resume_1;
+        else if (state == 2)
+            goto resume_2;
+        else if (state == 3)
+            goto resume_3;
+        else if (state == 4)
+            goto resume_4;
+        else if (state == 5)
+            goto resume_5;
+        else if (state == 6)
+            goto resume_6;
+        else if (state == 7)
+            goto resume_7;
+        else if (state == 8)
+            goto resume_8;
+        else if (state == 9)
+            goto resume_9;
+        // Get children to merge
+        get_merge_children();
+        // Try to select an inode for the "inverse" optimized scenario
+        // Read statistics from etcd to do it
+        read_stats();
+        state = 1;
+resume_1:
+        if (parent->waiting > 0)
+            return;
+        choose_inverse_candidate();
+        // Merge children one by one, except our "inverse" child
+        for (current_child = 0; current_child < merge_children.size(); current_child++)
+        {
+            if (merge_children[current_child] == inverse_child)
+                continue;
+            start_merge_child(merge_children[current_child], merge_children[current_child]);
+resume_2:
+            while (!cb())
+            {
+                state = 2;
+                return;
+            }
+            cb = NULL;
+            parent->change_parent(merge_children[current_child], new_parent);
+            state = 3;
+resume_3:
+            if (parent->waiting > 0)
+                return;
+        }
+        // Merge our "inverse" child into our "inverse" parent
+        if (inverse_child != 0)
+        {
+            start_merge_child(inverse_child, inverse_parent);
+resume_4:
+            while (!cb())
+            {
+                state = 4;
+                return;
+            }
+            cb = NULL;
+            // Delete "inverse" child data
+            start_delete_source(inverse_child);
+resume_5:
+            while (!cb())
+            {
+                state = 5;
+                return;
+            }
+            cb = NULL;
+            // Delete "inverse" child metadata, rename parent over it,
+            // and also change parent links of the previous "inverse" child
+            rename_inverse_parent();
+            state = 6;
+resume_6:
+            if (parent->waiting > 0)
+                return;
+        }
+        // Delete parents, except the "inverse" one
+        for (current_child = 0; current_child < chain_list.size(); current_child++)
+        {
+            if (chain_list[current_child] == inverse_parent)
+                continue;
+            start_delete_source(chain_list[current_child]);
+resume_7:
+            while (!cb())
+            {
+                state = 7;
+                return;
+            }
+            cb = NULL;
+            delete_inode_config(chain_list[current_child]);
+            state = 8;
+resume_8:
+            if (parent->waiting > 0)
+                return;
+        }
+        state = 9;
+resume_9:
+        // Done
+        return;
+    }
+
+    void get_merge_children()
+    {
+        // Get all children of from..to
+        inode_config_t *from_cfg = parent->get_inode_cfg(from_name);
+        inode_config_t *to_cfg = parent->get_inode_cfg(to_name);
+        // Check that to_cfg is actually a child of from_cfg
+        // FIXME de-copypaste the following piece of code with snap_merger_t
+        inode_config_t *cur = to_cfg;
+        chain_list.push_back(cur->num);
+        while (cur->num != from_cfg->num && cur->parent_id != 0)
+        {
+            auto it = parent->cli->st_cli.inode_config.find(cur->parent_id);
+            if (it == parent->cli->st_cli.inode_config.end())
+            {
+                fprintf(stderr, "Parent inode of layer %s (id %ld) not found\n", cur->name.c_str(), cur->parent_id);
+                exit(1);
+            }
+            cur = &it->second;
+            chain_list.push_back(cur->num);
+        }
+        if (cur->num != from_cfg->num)
+        {
+            fprintf(stderr, "Layer %s is not a child of %s\n", to_name.c_str(), from_name.c_str());
+            exit(1);
+        }
+        new_parent = from_cfg->parent_id;
+        // Calculate ranks
+        int i = chain_list.size()-1;
+        for (inode_t item: chain_list)
+        {
+            sources[item] = i--;
+        }
+        for (auto & ic: parent->cli->st_cli.inode_config)
+        {
+            if (!ic.second.parent_id)
+            {
+                continue;
+            }
+            auto it = sources.find(ic.second.parent_id);
+            if (it != sources.end() && sources.find(ic.second.num) == sources.end())
+            {
+                merge_children.push_back(ic.second.num);
+                if (ic.second.readonly || writers_stopped)
+                {
+                    inverse_candidates[ic.second.num] = it->second;
+                }
+            }
+        }
+    }
+
+    void read_stats()
+    {
+        if (inverse_candidates.size() == 0)
+        {
+            return;
+        }
+        json11::Json::array reads;
+        for (auto cp: inverse_candidates)
+        {
+            inode_t inode = cp.first;
+            reads.push_back(json11::Json::object {
+                { "request_range", json11::Json::object {
+                    { "key", base64_encode(
+                        parent->cli->st_cli.etcd_prefix+
+                        "/inode/stats/"+std::to_string(INODE_POOL(inode))+
+                        "/"+std::to_string(INODE_NO_POOL(inode))
+                    ) },
+                } }
+            });
+        }
+        for (auto cp: sources)
+        {
+            inode_t inode = cp.first;
+            reads.push_back(json11::Json::object {
+                { "request_range", json11::Json::object {
+                    { "key", base64_encode(
+                        parent->cli->st_cli.etcd_prefix+
+                        "/inode/stats/"+std::to_string(INODE_POOL(inode))+
+                        "/"+std::to_string(INODE_NO_POOL(inode))
+                    ) },
+                } }
+            });
+        }
+        parent->waiting++;
+        parent->cli->st_cli.etcd_txn(json11::Json::object {
+            { "success", reads },
+        }, ETCD_SLOW_TIMEOUT, [this](std::string err, json11::Json data)
+        {
+            parent->waiting--;
+            if (err != "")
+            {
+                fprintf(stderr, "Error reading layer statistics from etcd: %s\n", err.c_str());
+                exit(1);
+            }
+            for (auto inode_result: data["responses"].array_items())
+            {
+                auto kv = parent->cli->st_cli.parse_etcd_kv(inode_result["kvs"][0]);
+                pool_id_t pool_id = 0;
+                inode_t inode = 0;
+                char null_byte = 0;
+                sscanf(kv.key.c_str() + parent->cli->st_cli.etcd_prefix.length()+13, "%u/%lu%c", &pool_id, &inode, &null_byte);
+                if (!inode || null_byte != 0)
+                {
+                    fprintf(stderr, "Bad key returned from etcd: %s\n", kv.key.c_str());
+                    exit(1);
+                }
+                auto pool_cfg_it = parent->cli->st_cli.pool_config.find(pool_id);
+                if (pool_cfg_it == parent->cli->st_cli.pool_config.end())
+                {
+                    fprintf(stderr, "Pool %u does not exist\n", pool_id);
+                    exit(1);
+                }
+                inode = INODE_WITH_POOL(pool_id, inode);
+                auto & pool_cfg = pool_cfg_it->second;
+                uint64_t used_bytes = kv.value["raw_used"].uint64_value() / pool_cfg.pg_size;
+                if (pool_cfg.scheme != POOL_SCHEME_REPLICATED)
+                {
+                    used_bytes *= (pool_cfg.pg_size - pool_cfg.parity_chunks);
+                }
+                inode_used[inode] = used_bytes;
+            }
+            parent->ringloop->wakeup();
+        });
+    }
+
+    void choose_inverse_candidate()
+    {
+        uint64_t max_diff = 0;
+        for (auto cp: inverse_candidates)
+        {
+            inode_t child = cp.first;
+            uint64_t child_used = inode_used[child];
+            int rank = cp.second;
+            for (int i = chain_list.size()-rank; i < chain_list.size(); i++)
+            {
+                inode_t parent = chain_list[i];
+                uint64_t parent_used = inode_used[parent];
+                if (parent_used > child_used && (!max_diff || max_diff < (parent_used-child_used)))
+                {
+                    max_diff = (parent_used-child_used);
+                    inverse_parent = parent;
+                    inverse_child = child;
+                }
+            }
+        }
+    }
+
+    void rename_inverse_parent()
+    {
+        auto child_it = parent->cli->st_cli.inode_config.find(inverse_child);
+        if (child_it == parent->cli->st_cli.inode_config.end())
+        {
+            fprintf(stderr, "Inode %ld disappeared\n", inverse_child);
+            exit(1);
+        }
+        auto target_it = parent->cli->st_cli.inode_config.find(inverse_parent);
+        if (target_it == parent->cli->st_cli.inode_config.end())
+        {
+            fprintf(stderr, "Inode %ld disappeared\n", inverse_parent);
+            exit(1);
+        }
+        inode_config_t *child_cfg = &child_it->second;
+        inode_config_t *target_cfg = &target_it->second;
+        std::string child_name = child_cfg->name;
+        std::string target_name = target_cfg->name;
+        std::string child_cfg_key = base64_encode(
+            parent->cli->st_cli.etcd_prefix+
+            "/config/inode/"+std::to_string(INODE_POOL(inverse_child))+
+            "/"+std::to_string(INODE_NO_POOL(inverse_child))
+        );
+        std::string target_cfg_key = base64_encode(
+            parent->cli->st_cli.etcd_prefix+
+            "/config/inode/"+std::to_string(INODE_POOL(inverse_parent))+
+            "/"+std::to_string(INODE_NO_POOL(inverse_parent))
+        );
+        // Fill new configuration
+        inode_config_t new_cfg = *child_cfg;
+        new_cfg.num = target_cfg->num;
+        new_cfg.parent_id = new_parent;
+        json11::Json::array cmp = json11::Json::array {
+            json11::Json::object {
+                { "target", "MOD" },
+                { "key", child_cfg_key },
+                { "result", "LESS" },
+                { "mod_revision", child_cfg->mod_revision+1 },
+            },
+            json11::Json::object {
+                { "target", "MOD" },
+                { "key", target_cfg_key },
+                { "result", "LESS" },
+                { "mod_revision", target_cfg->mod_revision+1 },
+            },
+        };
+        json11::Json::array txn = json11::Json::array {
+            json11::Json::object {
+                { "request_delete_range", json11::Json::object {
+                    { "key", child_cfg_key },
+                } },
+            },
+            json11::Json::object {
+                { "request_put", json11::Json::object {
+                    { "key", target_cfg_key },
+                    { "value", base64_encode(json11::Json(parent->cli->st_cli.serialize_inode_cfg(&new_cfg)).dump()) },
+                } },
+            },
+            json11::Json::object {
+                { "request_put", json11::Json::object {
+                    { "key", base64_encode(parent->cli->st_cli.etcd_prefix+"/index/image/"+child_cfg->name) },
+                    { "value", base64_encode(json11::Json({
+                        { "id", INODE_NO_POOL(inverse_parent) },
+                        { "pool_id", (uint64_t)INODE_POOL(inverse_parent) },
+                    }).dump()) },
+                } },
+            },
+        };
+        // Reparent children of inverse_child
+        for (auto & cp: parent->cli->st_cli.inode_config)
+        {
+            if (cp.second.parent_id == child_cfg->num)
+            {
+                auto cp_cfg = cp.second;
+                cp_cfg.parent_id = inverse_parent;
+                auto cp_key = base64_encode(
+                    parent->cli->st_cli.etcd_prefix+
+                    "/config/inode/"+std::to_string(INODE_POOL(cp.second.num))+
+                    "/"+std::to_string(INODE_NO_POOL(cp.second.num))
+                );
+                cmp.push_back(json11::Json::object {
+                    { "target", "MOD" },
+                    { "key", cp_key },
+                    { "result", "LESS" },
+                    { "mod_revision", cp.second.mod_revision+1 },
+                });
+                txn.push_back(json11::Json::object {
+                    { "request_put", json11::Json::object {
+                        { "key", cp_key },
+                        { "value", base64_encode(json11::Json(parent->cli->st_cli.serialize_inode_cfg(&cp_cfg)).dump()) },
+                    } },
+                });
+            }
+        }
+        parent->waiting++;
+        parent->cli->st_cli.etcd_txn(json11::Json::object {
+            { "compare", cmp },
+            { "success", txn },
+        }, ETCD_SLOW_TIMEOUT, [this, target_name, child_name](std::string err, json11::Json res)
+        {
+            parent->waiting--;
+            if (err != "")
+            {
+                fprintf(stderr, "Error renaming %s to %s: %s\n", target_name.c_str(), child_name.c_str(), err.c_str());
+                exit(1);
+            }
+            if (!res["succeeded"].bool_value())
+            {
+                fprintf(
+                    stderr, "Parent (%s), child (%s), or one of its children"
+                    " configuration was modified during rename\n", target_name.c_str(), child_name.c_str()
+                );
+                exit(1);
+            }
+            printf("Layer %s renamed to %s\n", target_name.c_str(), child_name.c_str());
+            parent->ringloop->wakeup();
+        });
+    }
+
+    void delete_inode_config(inode_t cur)
+    {
+        auto cur_cfg_it = parent->cli->st_cli.inode_config.find(cur);
+        if (cur_cfg_it == parent->cli->st_cli.inode_config.end())
+        {
+            fprintf(stderr, "Inode 0x%lx disappeared\n", cur);
+            exit(1);
+        }
+        inode_config_t *cur_cfg = &cur_cfg_it->second;
+        std::string cur_name = cur_cfg->name;
+        std::string cur_cfg_key = base64_encode(
+            parent->cli->st_cli.etcd_prefix+
+            "/config/inode/"+std::to_string(INODE_POOL(cur))+
+            "/"+std::to_string(INODE_NO_POOL(cur))
+        );
+        parent->waiting++;
+        parent->cli->st_cli.etcd_txn(json11::Json::object {
+            { "compare", json11::Json::array {
+                json11::Json::object {
+                    { "target", "MOD" },
+                    { "key", cur_cfg_key },
+                    { "result", "LESS" },
+                    { "mod_revision", cur_cfg->mod_revision+1 },
+                },
+            } },
+            { "success", json11::Json::array {
+                json11::Json::object {
+                    { "request_delete_range", json11::Json::object {
+                        { "key", cur_cfg_key },
+                    } },
+                    { "request_delete_range", json11::Json::object {
+                        { "key", base64_encode(parent->cli->st_cli.etcd_prefix+"/index/image/"+cur_name) },
+                    } },
+                },
+            } },
+        }, ETCD_SLOW_TIMEOUT, [this, cur_name](std::string err, json11::Json res)
+        {
+            parent->waiting--;
+            if (err != "")
+            {
+                fprintf(stderr, "Error deleting %s: %s\n", cur_name.c_str(), err.c_str());
+                exit(1);
+            }
+            if (!res["succeeded"].bool_value())
+            {
+                fprintf(stderr, "Layer %s configuration was modified during deletion\n", cur_name.c_str());
+                exit(1);
+            }
+            printf("Layer %s deleted\n", cur_name.c_str());
+            parent->ringloop->wakeup();
+        });
+    }
+
+    void start_merge_child(inode_t child_inode, inode_t target_inode)
+    {
+        auto child_it = parent->cli->st_cli.inode_config.find(child_inode);
+        if (child_it == parent->cli->st_cli.inode_config.end())
+        {
+            fprintf(stderr, "Inode %ld disappeared\n", child_inode);
+            exit(1);
+        }
+        auto target_it = parent->cli->st_cli.inode_config.find(target_inode);
+        if (target_it == parent->cli->st_cli.inode_config.end())
+        {
+            fprintf(stderr, "Inode %ld disappeared\n", target_inode);
+            exit(1);
+        }
+        cb = parent->start_merge(json11::Json::object {
+            { "command", json11::Json::array{ "merge-data", from_name, child_it->second.name } },
+            { "target", target_it->second.name },
+            { "delete-source", false },
+            { "cas", use_cas },
+            { "fsync-interval", fsync_interval },
+        });
+    }
+
+    void start_delete_source(inode_t inode)
+    {
+        auto source = parent->cli->st_cli.inode_config.find(inode);
+        if (source == parent->cli->st_cli.inode_config.end())
+        {
+            fprintf(stderr, "Inode %ld disappeared\n", inode);
+            exit(1);
+        }
+        cb = parent->start_rm(json11::Json::object {
+            { "inode", inode },
+            { "pool", (uint64_t)INODE_POOL(inode) },
+            { "fsync-interval", fsync_interval },
+        });
+    }
+};
+
+std::function<bool(void)> cli_tool_t::start_snap_rm(json11::Json cfg)
+{
+    json11::Json::array cmd = cfg["command"].array_items();
+    auto snap_remover = new snap_remover_t();
+    snap_remover->parent = this;
+    snap_remover->from_name = cmd.size() > 1 ? cmd[1].string_value() : "";
+    snap_remover->to_name = cmd.size() > 2 ? cmd[2].string_value() : "";
+    if (snap_remover->from_name == "")
+    {
+        fprintf(stderr, "Layer to remove argument is missing\n");
+        exit(1);
+    }
+    if (snap_remover->to_name == "")
+    {
+        snap_remover->to_name = snap_remover->from_name;
+    }
+    snap_remover->fsync_interval = cfg["fsync-interval"].uint64_value();
+    if (!snap_remover->fsync_interval)
+        snap_remover->fsync_interval = 128;
+    if (!cfg["cas"].is_null())
+        snap_remover->use_cas = cfg["cas"].uint64_value() ? 2 : 0;
+    if (!cfg["writers_stopped"].is_null())
+        snap_remover->writers_stopped = true;
+    return [snap_remover]()
+    {
+        snap_remover->loop();
+        if (snap_remover->is_done())
+        {
+            delete snap_remover;
+            return true;
+        }
+        return false;
+    };
+}
--- a/src/cluster_client.cpp
+++ b/src/cluster_client.cpp
--- a/src/cluster_client.h
+++ b/src/cluster_client.h
@@ -8,9 +8,13 @@

 #define MIN_BLOCK_SIZE 4*1024
 #define MAX_BLOCK_SIZE 128*1024*1024
-#define DEFAULT_DISK_ALIGNMENT 4096
-#define DEFAULT_BITMAP_GRANULARITY 4096
-#define DEFAULT_CLIENT_DIRTY_LIMIT 32*1024*1024
+#define DEFAULT_CLIENT_MAX_DIRTY_BYTES 32*1024*1024
+#define DEFAULT_CLIENT_MAX_DIRTY_OPS 1024
+#define INODE_LIST_DONE 1
+#define INODE_LIST_HAS_UNSTABLE 2
+#define OSD_OP_READ_BITMAP OSD_OP_SEC_READ_BMP
+
+#define OSD_OP_IGNORE_READONLY 0x08

 struct cluster_op_t;

@@ -22,85 +26,126 @@ struct cluster_op_part_t
    pg_num_t pg_num;
    osd_num_t osd_num;
    osd_op_buf_list_t iov;
-    bool sent;
-    bool done;
+    unsigned flags;
    osd_op_t op;
 };

 struct cluster_op_t
 {
-    uint64_t opcode; // OSD_OP_READ, OSD_OP_WRITE, OSD_OP_SYNC
+    uint64_t opcode; // OSD_OP_READ, OSD_OP_WRITE, OSD_OP_SYNC, OSD_OP_DELETE, OSD_OP_READ_BITMAP
    uint64_t inode;
    uint64_t offset;
    uint64_t len;
+    // for reads and writes within a single object (stripe),
+    // reads can return current version and writes can use "CAS" semantics
+    uint64_t version = 0;
+    // now only OSD_OP_IGNORE_READONLY is supported
+    uint64_t flags = 0;
    int retval;
    osd_op_buf_list_t iov;
+    // READ and READ_BITMAP return the bitmap here
+    void *bitmap_buf = NULL;
    std::function<void(cluster_op_t*)> callback;
+    ~cluster_op_t();
 protected:
+    int state = 0;
+    uint64_t cur_inode; // for snapshot reads
    void *buf = NULL;
    cluster_op_t *orig_op = NULL;
-    bool is_internal = false;
    bool needs_reslice = false;
    bool up_wait = false;
-    int sent_count = 0, done_count = 0;
+    int inflight_count = 0, done_count = 0;
    std::vector<cluster_op_part_t> parts;
+    void *part_bitmaps = NULL;
+    unsigned bitmap_buf_size = 0;
+    cluster_op_t *prev = NULL, *next = NULL;
+    int prev_wait = 0;
    friend class cluster_client_t;
 };

+struct cluster_buffer_t
+{
+    void *buf;
+    uint64_t len;
+    int state;
+};
+
+struct inode_list_t;
+struct inode_list_osd_t;
+
+// FIXME: Split into public and private interfaces
 class cluster_client_t
 {
    timerfd_manager_t *tfd;
    ring_loop_t *ringloop;

    uint64_t bs_block_size = 0;
-    uint64_t bs_disk_alignment = 0;
-    uint64_t bs_bitmap_granularity = 0;
+    uint32_t bs_bitmap_granularity = 0, bs_bitmap_size = 0;
    std::map<pool_id_t, uint64_t> pg_counts;
-    bool immediate_commit = false;
+    // WARNING: initially true so execute() doesn't create fake sync
+    bool immediate_commit = true;
    // FIXME: Implement inmemory_commit mode. Note that it requires to return overlapping reads from memory.
-    uint64_t client_dirty_limit = 0;
+    uint64_t client_max_dirty_bytes = 0;
+    uint64_t client_max_dirty_ops = 0;
    int log_level;
    int up_wait_retry_interval = 500; // ms

-    uint64_t op_id = 1;
-    ring_consumer_t consumer;
-    // operations currently in progress
-    std::set<cluster_op_t*> cur_ops;
    int retry_timeout_id = 0;
-    // unsynced operations are copied in memory to allow replay when cluster isn't in the immediate_commit mode
-    // unsynced_writes are replayed in any order (because only the SYNC operation guarantees ordering)
-    std::vector<cluster_op_t*> unsynced_writes;
-    std::vector<cluster_op_t*> syncing_writes;
-    cluster_op_t* cur_sync = NULL;
-    std::vector<cluster_op_t*> next_writes;
+    uint64_t op_id = 1;
    std::vector<cluster_op_t*> offline_ops;
-    uint64_t queued_bytes = 0;
+    cluster_op_t *op_queue_head = NULL, *op_queue_tail = NULL;
+    std::map<object_id, cluster_buffer_t> dirty_buffers;
+    std::set<osd_num_t> dirty_osds;
+    uint64_t dirty_bytes = 0, dirty_ops = 0;
+
+    void *scrap_buffer = NULL;
+    unsigned scrap_buffer_size = 0;

    bool pgs_loaded = false;
+    ring_consumer_t consumer;
    std::vector<std::function<void(void)>> on_ready_hooks;
+    std::vector<inode_list_t*> lists;
+    int continuing_ops = 0;

 public:
    etcd_state_client_t st_cli;
    osd_messenger_t msgr;
+    json11::Json config;

    cluster_client_t(ring_loop_t *ringloop, timerfd_manager_t *tfd, json11::Json & config);
    ~cluster_client_t();
    void execute(cluster_op_t *op);
+    bool is_ready();
    void on_ready(std::function<void(void)> fn);
-    void stop();
+
+    static void copy_write(cluster_op_t *op, std::map<object_id, cluster_buffer_t> & dirty_buffers);
+    void continue_ops(bool up_retry = false);
+    inode_list_t *list_inode_start(inode_t inode,
+        std::function<void(inode_list_t* lst, std::set<object_id>&& objects, pg_num_t pg_num, osd_num_t primary_osd, int status)> callback);
+    int list_pg_count(inode_list_t *lst);
+    void list_inode_next(inode_list_t *lst, int next_pgs);
+    inline uint32_t get_bs_bitmap_granularity() { return bs_bitmap_granularity; }
+    inline uint64_t get_bs_block_size() { return bs_block_size; }
+    uint64_t next_op_id();

 protected:
-    void continue_ops(bool up_retry = false);
+    bool affects_osd(uint64_t inode, uint64_t offset, uint64_t len, osd_num_t osd);
+    void flush_buffer(const object_id & oid, cluster_buffer_t *wr);
    void on_load_config_hook(json11::Json::object & config);
    void on_load_pgs_hook(bool success);
-    void on_change_hook(json11::Json::object & changes);
+    void on_change_hook(std::map<std::string, etcd_kv_t> & changes);
    void on_change_osd_state_hook(uint64_t peer_osd);
-    void continue_rw(cluster_op_t *op);
+    int continue_rw(cluster_op_t *op);
    void slice_rw(cluster_op_t *op);
-    bool try_send(cluster_op_t *op, cluster_op_part_t *part);
-    void execute_sync(cluster_op_t *op);
-    void continue_sync();
-    void finish_sync();
+    bool try_send(cluster_op_t *op, int i);
+    int continue_sync(cluster_op_t *op);
    void send_sync(cluster_op_t *op, cluster_op_part_t *part);
    void handle_op_part(cluster_op_part_t *part);
+    void copy_part_bitmap(cluster_op_t *op, cluster_op_part_t *part);
+    void erase_op(cluster_op_t *op);
+    void calc_wait(cluster_op_t *op);
+    void inc_wait(uint64_t opcode, uint64_t flags, cluster_op_t *next, int inc);
+    void continue_lists();
+    void continue_listing(inode_list_t *lst);
+    void send_list(inode_list_osd_t *cur_list);
 };
--- a/src/cluster_client_list.cpp
+++ b/src/cluster_client_list.cpp
@@ -0,0 +1,285 @@
+// Copyright (c) Vitaliy Filippov, 2019+
+// License: VNPL-1.1 or GNU GPL-2.0+ (see README.md for details)
+
+#include <algorithm>
+#include "pg_states.h"
+#include "cluster_client.h"
+
+struct inode_list_t;
+
+struct inode_list_pg_t;
+
+struct inode_list_osd_t
+{
+    inode_list_pg_t *pg = NULL;
+    osd_num_t osd_num = 0;
+    bool sent = false;
+};
+
+struct inode_list_pg_t
+{
+    inode_list_t *lst = NULL;
+    int pos = 0;
+    pg_num_t pg_num;
+    osd_num_t cur_primary;
+    bool has_unstable = false;
+    int sent = 0;
+    int done = 0;
+    std::vector<inode_list_osd_t> list_osds;
+    std::set<object_id> objects;
+};
+
+struct inode_list_t
+{
+    cluster_client_t *cli = NULL;
+    pool_id_t pool_id = 0;
+    inode_t inode = 0;
+    int done_pgs = 0;
+    int want = 0;
+    std::vector<inode_list_pg_t*> pgs;
+    std::function<void(inode_list_t* lst, std::set<object_id>&& objects, pg_num_t pg_num, osd_num_t primary_osd, int status)> callback;
+};
+
+inode_list_t* cluster_client_t::list_inode_start(inode_t inode,
+    std::function<void(inode_list_t* lst, std::set<object_id>&& objects, pg_num_t pg_num, osd_num_t primary_osd, int status)> callback)
+{
+    int skipped_pgs = 0;
+    pool_id_t pool_id = INODE_POOL(inode);
+    if (!pool_id || st_cli.pool_config.find(pool_id) == st_cli.pool_config.end())
+    {
+        if (log_level > 0)
+        {
+            fprintf(stderr, "Pool %u does not exist\n", pool_id);
+        }
+        return NULL;
+    }
+    inode_list_t *lst = new inode_list_t();
+    lst->cli = this;
+    lst->pool_id = pool_id;
+    lst->inode = inode;
+    lst->callback = callback;
+    auto pool_cfg = st_cli.pool_config[pool_id];
+    for (auto & pg_item: pool_cfg.pg_config)
+    {
+        auto & pg = pg_item.second;
+        if (pg.pause || !pg.cur_primary || !(pg.cur_state & PG_ACTIVE))
+        {
+            skipped_pgs++;
+            if (log_level > 0)
+            {
+                fprintf(stderr, "PG %u is inactive, skipping\n", pg_item.first);
+            }
+            continue;
+        }
+        inode_list_pg_t *r = new inode_list_pg_t();
+        r->lst = lst;
+        r->pg_num = pg_item.first;
+        r->cur_primary = pg.cur_primary;
+        if (pg.cur_state != PG_ACTIVE)
+        {
+            // Not clean
+            std::set<osd_num_t> all_peers;
+            for (osd_num_t pg_osd: pg.target_set)
+            {
+                if (pg_osd != 0)
+                {
+                    all_peers.insert(pg_osd);
+                }
+            }
+            for (osd_num_t pg_osd: pg.all_peers)
+            {
+                if (pg_osd != 0)
+                {
+                    all_peers.insert(pg_osd);
+                }
+            }
+            for (auto & hist_item: pg.target_history)
+            {
+                for (auto pg_osd: hist_item)
+                {
+                    if (pg_osd != 0)
+                    {
+                        all_peers.insert(pg_osd);
+                    }
+                }
+            }
+            for (osd_num_t peer_osd: all_peers)
+            {
+                r->list_osds.push_back((inode_list_osd_t){
+                    .pg = r,
+                    .osd_num = peer_osd,
+                    .sent = false,
+                });
+            }
+        }
+        else
+        {
+            // Clean
+            r->list_osds.push_back((inode_list_osd_t){
+                .pg = r,
+                .osd_num = pg.cur_primary,
+                .sent = false,
+            });
+        }
+        lst->pgs.push_back(r);
+    }
+    std::sort(lst->pgs.begin(), lst->pgs.end(), [](inode_list_pg_t *a, inode_list_pg_t *b)
+    {
+        return a->cur_primary < b->cur_primary ? true : false;
+    });
+    for (int i = 0; i < lst->pgs.size(); i++)
+    {
+        lst->pgs[i]->pos = i;
+    }
+    lists.push_back(lst);
+    return lst;
+}
+
+int cluster_client_t::list_pg_count(inode_list_t *lst)
+{
+    return lst->pgs.size();
+}
+
+void cluster_client_t::list_inode_next(inode_list_t *lst, int next_pgs)
+{
+    if (next_pgs >= 0)
+    {
+        lst->want += next_pgs;
+    }
+    continue_listing(lst);
+}
+
+void cluster_client_t::continue_listing(inode_list_t *lst)
+{
+    if (lst->done_pgs >= lst->pgs.size())
+    {
+        // All done
+        for (int i = 0; i < lists.size(); i++)
+        {
+            if (lists[i] == lst)
+            {
+                lists.erase(lists.begin()+i, lists.begin()+i+1);
+                break;
+            }
+        }
+        delete lst;
+        return;
+    }
+    if (lst->want <= 0)
+    {
+        return;
+    }
+    for (int i = 0; i < lst->pgs.size(); i++)
+    {
+        if (lst->pgs[i] && lst->pgs[i]->sent < lst->pgs[i]->list_osds.size())
+        {
+            for (int j = 0; j < lst->pgs[i]->list_osds.size(); j++)
+            {
+                send_list(&lst->pgs[i]->list_osds[j]);
+                if (lst->want <= 0)
+                {
+                    break;
+                }
+            }
+        }
+    }
+}
+
+void cluster_client_t::send_list(inode_list_osd_t *cur_list)
+{
+    if (cur_list->sent)
+    {
+        return;
+    }
+    if (msgr.osd_peer_fds.find(cur_list->osd_num) == msgr.osd_peer_fds.end())
+    {
+        // Initiate connection
+        msgr.connect_peer(cur_list->osd_num, st_cli.peer_states[cur_list->osd_num]);
+        return;
+    }
+    auto & pool_cfg = st_cli.pool_config[cur_list->pg->lst->pool_id];
+    osd_op_t *op = new osd_op_t();
+    op->op_type = OSD_OP_OUT;
+    op->peer_fd = msgr.osd_peer_fds[cur_list->osd_num];
+    op->req = (osd_any_op_t){
+        .sec_list = {
+            .header = {
+                .magic = SECONDARY_OSD_OP_MAGIC,
+                .id = op_id++,
+                .opcode = OSD_OP_SEC_LIST,
+            },
+            .list_pg = cur_list->pg->pg_num,
+            .pg_count = (pg_num_t)pool_cfg.real_pg_count,
+            .pg_stripe_size = pool_cfg.pg_stripe_size,
+            .min_inode = cur_list->pg->lst->inode,
+            .max_inode = cur_list->pg->lst->inode,
+        },
+    };
+    op->callback = [this, cur_list](osd_op_t *op)
+    {
+        if (op->reply.hdr.retval < 0)
+        {
+            fprintf(stderr, "Failed to get PG %u/%u object list from OSD %lu (retval=%ld), skipping\n",
+                cur_list->pg->lst->pool_id, cur_list->pg->pg_num, cur_list->osd_num, op->reply.hdr.retval);
+        }
+        else
+        {
+            if (op->reply.sec_list.stable_count < op->reply.hdr.retval)
+            {
+                // Unstable objects, if present, mean that someone still writes into the inode. Warn the user about it.
+                cur_list->pg->has_unstable = true;
+                fprintf(
+                    stderr, "[PG %u/%u] Inode still has %lu unstable object versions out of total %lu - is it still open?\n",
+                    cur_list->pg->lst->pool_id, cur_list->pg->pg_num, op->reply.hdr.retval - op->reply.sec_list.stable_count,
+                    op->reply.hdr.retval
+                );
+            }
+            if (log_level > 0)
+            {
+                fprintf(
+                    stderr, "[PG %u/%u] Got inode object list from OSD %lu: %ld object versions\n",
+                    cur_list->pg->lst->pool_id, cur_list->pg->pg_num, cur_list->osd_num, op->reply.hdr.retval
+                );
+            }
+            for (uint64_t i = 0; i < op->reply.hdr.retval; i++)
+            {
+                object_id oid = ((obj_ver_id*)op->buf)[i].oid;
+                oid.stripe = oid.stripe & ~STRIPE_MASK;
+                cur_list->pg->objects.insert(oid);
+            }
+        }
+        delete op;
+        auto lst = cur_list->pg->lst;
+        auto pg = cur_list->pg;
+        pg->done++;
+        if (pg->done >= pg->list_osds.size())
+        {
+            int status = 0;
+            lst->done_pgs++;
+            if (lst->done_pgs >= lst->pgs.size())
+            {
+                status |= INODE_LIST_DONE;
+            }
+            if (pg->has_unstable)
+            {
+                status |= INODE_LIST_HAS_UNSTABLE;
+            }
+            lst->callback(lst, std::move(pg->objects), pg->pg_num, pg->cur_primary, status);
+            lst->pgs[pg->pos] = NULL;
+            delete pg;
+        }
+        continue_listing(lst);
+    };
+    msgr.outbox_push(op);
+    cur_list->sent = true;
+    cur_list->pg->sent++;
+    cur_list->pg->lst->want--;
+}
+
+void cluster_client_t::continue_lists()
+{
+    for (auto lst: lists)
+    {
+        continue_listing(lst);
+    }
+}
--- a/src/etcd_state_client.cpp
+++ b/src/etcd_state_client.cpp
@@ -4,20 +4,42 @@
 #include "osd_ops.h"
 #include "pg_states.h"
 #include "etcd_state_client.h"
+#ifndef __MOCK__
 #include "http_client.h"
 #include "base64.h"
+#endif

-json_kv_t etcd_state_client_t::parse_etcd_kv(const json11::Json & kv_json)
+etcd_state_client_t::~etcd_state_client_t()
 {
-    json_kv_t kv;
+    for (auto watch: watches)
+    {
+        delete watch;
+    }
+    watches.clear();
+    etcd_watches_initialised = -1;
+#ifndef __MOCK__
+    if (etcd_watch_ws)
+    {
+        etcd_watch_ws->close();
+        etcd_watch_ws = NULL;
+    }
+#endif
+}
+
+#ifndef __MOCK__
+etcd_kv_t etcd_state_client_t::parse_etcd_kv(const json11::Json & kv_json)
+{
+    etcd_kv_t kv;
    kv.key = base64_decode(kv_json["key"].string_value());
    std::string json_err, json_text = base64_decode(kv_json["value"].string_value());
    kv.value = json_text == "" ? json11::Json() : json11::Json::parse(json_text, json_err);
    if (json_err != "")
    {
-        printf("Bad JSON in etcd key %s: %s (value: %s)\n", kv.key.c_str(), json_err.c_str(), json_text.c_str());
+        fprintf(stderr, "Bad JSON in etcd key %s: %s (value: %s)\n", kv.key.c_str(), json_err.c_str(), json_text.c_str());
        kv.key = "";
    }
+    else
+        kv.mod_revision = kv_json["mod_revision"].uint64_value();
    return kv;
 }

@@ -28,6 +50,11 @@ void etcd_state_client_t::etcd_txn(json11::Json txn, int timeout, std::function<

 void etcd_state_client_t::etcd_call(std::string api, json11::Json payload, int timeout, std::function<void(std::string, json11::Json)> callback)
 {
+    if (!etcd_addresses.size())
+    {
+        fprintf(stderr, "etcd_address is missing in Vitastor configuration\n");
+        exit(1);
+    }
    std::string etcd_address = etcd_addresses[rand() % etcd_addresses.size()];
    std::string etcd_api_path;
    int pos = etcd_address.find('/');
@@ -46,7 +73,24 @@ void etcd_state_client_t::etcd_call(std::string api, json11::Json payload, int t
    http_request_json(tfd, etcd_address, req, timeout, callback);
 }

-void etcd_state_client_t::parse_config(json11::Json & config)
+void etcd_state_client_t::add_etcd_url(std::string addr)
+{
+    if (addr.length() > 0)
+    {
+        if (strtolower(addr.substr(0, 7)) == "http://")
+            addr = addr.substr(7);
+        else if (strtolower(addr.substr(0, 8)) == "https://")
+        {
+            fprintf(stderr, "HTTPS is unsupported for etcd. Either use plain HTTP or setup a local proxy for etcd interaction\n");
+            exit(1);
+        }
+        if (addr.find('/') == std::string::npos)
+            addr += "/v3";
+        this->etcd_addresses.push_back(addr);
+    }
+}
+
+void etcd_state_client_t::parse_config(const json11::Json & config)
 {
    this->etcd_addresses.clear();
    if (config["etcd_address"].is_string())
@@ -55,13 +99,7 @@ void etcd_state_client_t::parse_config(json11::Json & config)
        while (1)
        {
            int pos = ea.find(',');
-            std::string addr = pos >= 0 ? ea.substr(0, pos) : ea;
-            if (addr.length() > 0)
-            {
-                if (addr.find('/') < 0)
-                    addr += "/v3";
-                this->etcd_addresses.push_back(addr);
-            }
+            add_etcd_url(pos >= 0 ? ea.substr(0, pos) : ea);
            if (pos >= 0)
                ea = ea.substr(pos+1);
            else
@@ -72,13 +110,7 @@ void etcd_state_client_t::parse_config(json11::Json & config)
    {
        for (auto & ea: config["etcd_address"].array_items())
        {
-            std::string addr = ea.string_value();
-            if (addr != "")
-            {
-                if (addr.find('/') < 0)
-                    addr += "/v3";
-                this->etcd_addresses.push_back(addr);
-            }
+            add_etcd_url(ea.string_value());
        }
    }
    this->etcd_prefix = config["etcd_prefix"].string_value();
@@ -95,6 +127,11 @@ void etcd_state_client_t::parse_config(json11::Json & config)

 void etcd_state_client_t::start_etcd_watcher()
 {
+    if (!etcd_addresses.size())
+    {
+        fprintf(stderr, "etcd_address is missing in Vitastor configuration\n");
+        exit(1);
+    }
    std::string etcd_address = etcd_addresses[rand() % etcd_addresses.size()];
    std::string etcd_api_path;
    int pos = etcd_address.find('/');
@@ -112,7 +149,7 @@ void etcd_state_client_t::start_etcd_watcher()
            json11::Json data = json11::Json::parse(msg->body, json_err);
            if (json_err != "")
            {
-                printf("Bad JSON in etcd event: %s, ignoring event\n", json_err.c_str());
+                fprintf(stderr, "Bad JSON in etcd event: %s, ignoring event\n", json_err.c_str());
            }
            else
            {
@@ -125,22 +162,22 @@ void etcd_state_client_t::start_etcd_watcher()
                    etcd_watch_revision = data["result"]["header"]["revision"].uint64_value();
                }
                // First gather all changes into a hash to remove multiple overwrites
-                json11::Json::object changes;
+                std::map<std::string, etcd_kv_t> changes;
                for (auto & ev: data["result"]["events"].array_items())
                {
                    auto kv = parse_etcd_kv(ev["kv"]);
                    if (kv.key != "")
                    {
-                        changes[kv.key] = kv.value;
+                        changes[kv.key] = kv;
                    }
                }
                for (auto & kv: changes)
                {
                    if (this->log_level > 3)
                    {
-                        printf("Incoming event: %s -> %s\n", kv.first.c_str(), kv.second.dump().c_str());
+                        fprintf(stderr, "Incoming event: %s -> %s\n", kv.first.c_str(), kv.second.value.dump().c_str());
                    }
-                    parse_state(kv.first, kv.second);
+                    parse_state(kv.second);
                }
                // React to changes
                if (on_change_hook != NULL)
@@ -160,7 +197,7 @@ void etcd_state_client_t::start_etcd_watcher()
                    start_etcd_watcher();
                });
            }
-            else
+            else if (etcd_watches_initialised > 0)
            {
                // Connection was live, retry immediately
                start_etcd_watcher();
@@ -213,7 +250,7 @@ void etcd_state_client_t::load_global_config()
    {
        if (err != "")
        {
-            printf("Error reading OSD configuration from etcd: %s\n", err.c_str());
+            fprintf(stderr, "Error reading OSD configuration from etcd: %s\n", err.c_str());
            tfd->set_timer(ETCD_SLOW_TIMEOUT, false, [this](int timer_id)
            {
                load_global_config();
@@ -251,6 +288,12 @@ void etcd_state_client_t::load_pgs()
                { "key", base64_encode(etcd_prefix+"/config/pgs") },
            } }
        },
+        json11::Json::object {
+            { "request_range", json11::Json::object {
+                { "key", base64_encode(etcd_prefix+"/config/inode/") },
+                { "range_end", base64_encode(etcd_prefix+"/config/inode0") },
+            } }
+        },
        json11::Json::object {
            { "request_range", json11::Json::object {
                { "key", base64_encode(etcd_prefix+"/pg/history/") },
@@ -280,7 +323,7 @@ void etcd_state_client_t::load_pgs()
    {
        if (err != "")
        {
-            printf("Error loading PGs from etcd: %s\n", err.c_str());
+            fprintf(stderr, "Error loading PGs from etcd: %s\n", err.c_str());
            tfd->set_timer(ETCD_SLOW_TIMEOUT, false, [this](int timer_id)
            {
                load_pgs();
@@ -301,16 +344,33 @@ void etcd_state_client_t::load_pgs()
            for (auto & kv_json: res["response_range"]["kvs"].array_items())
            {
                auto kv = parse_etcd_kv(kv_json);
-                parse_state(kv.key, kv.value);
+                parse_state(kv);
            }
        }
        on_load_pgs_hook(true);
        start_etcd_watcher();
    });
 }
-
-void etcd_state_client_t::parse_state(const std::string & key, const json11::Json & value)
+#else
+void etcd_state_client_t::parse_config(const json11::Json & config)
 {
+}
+
+void etcd_state_client_t::load_global_config()
+{
+    json11::Json::object global_config;
+    on_load_config_hook(global_config);
+}
+
+void etcd_state_client_t::load_pgs()
+{
+}
+#endif
+
+void etcd_state_client_t::parse_state(const etcd_kv_t & kv)
+{
+    const std::string & key = kv.key;
+    const json11::Json & value = kv.value;
    if (key == etcd_prefix+"/config/pools")
    {
        for (auto & pool_item: this->pool_config)
@@ -321,10 +381,12 @@ void etcd_state_client_t::parse_state(const std::string & key, const json11::Jso
        {
            pool_config_t pc;
            // ID
-            pool_id_t pool_id = stoull_full(pool_item.first);
-            if (!pool_id || pool_id >= POOL_ID_MAX)
+            pool_id_t pool_id;
+            char null_byte = 0;
+            sscanf(pool_item.first.c_str(), "%u%c", &pool_id, &null_byte);
+            if (!pool_id || pool_id >= POOL_ID_MAX || null_byte != 0)
            {
-                printf("Pool ID %s is invalid (must be a number less than 0x%x), skipping pool\n", pool_item.first.c_str(), POOL_ID_MAX);
+                fprintf(stderr, "Pool ID %s is invalid (must be a number less than 0x%x), skipping pool\n", pool_item.first.c_str(), POOL_ID_MAX);
                continue;
            }
            pc.id = pool_id;
@@ -332,7 +394,7 @@ void etcd_state_client_t::parse_state(const std::string & key, const json11::Jso
            pc.name = pool_item.second["name"].string_value();
            if (pc.name == "")
            {
-                printf("Pool %u has empty name, skipping pool\n", pool_id);
+                fprintf(stderr, "Pool %u has empty name, skipping pool\n", pool_id);
                continue;
            }
            // Failure Domain
@@ -346,7 +408,7 @@ void etcd_state_client_t::parse_state(const std::string & key, const json11::Jso
                pc.scheme = POOL_SCHEME_JERASURE;
            else
            {
-                printf("Pool %u has invalid coding scheme (one of \"xor\", \"replicated\" or \"jerasure\" required), skipping pool\n", pool_id);
+                fprintf(stderr, "Pool %u has invalid coding scheme (one of \"xor\", \"replicated\" or \"jerasure\" required), skipping pool\n", pool_id);
                continue;
            }
            // PG Size
@@ -356,7 +418,7 @@ void etcd_state_client_t::parse_state(const std::string & key, const json11::Jso
                (pc.scheme == POOL_SCHEME_XOR || pc.scheme == POOL_SCHEME_JERASURE) ||
                pool_item.second["pg_size"].uint64_value() > 256)
            {
-                printf("Pool %u has invalid pg_size, skipping pool\n", pool_id);
+                fprintf(stderr, "Pool %u has invalid pg_size, skipping pool\n", pool_id);
                continue;
            }
            // Parity Chunks
@@ -365,7 +427,7 @@ void etcd_state_client_t::parse_state(const std::string & key, const json11::Jso
            {
                if (pc.parity_chunks > 1)
                {
-                    printf("Pool %u has invalid parity_chunks (must be 1), skipping pool\n", pool_id);
+                    fprintf(stderr, "Pool %u has invalid parity_chunks (must be 1), skipping pool\n", pool_id);
                    continue;
                }
                pc.parity_chunks = 1;
@@ -373,7 +435,7 @@ void etcd_state_client_t::parse_state(const std::string & key, const json11::Jso
            if (pc.scheme == POOL_SCHEME_JERASURE &&
                (pc.parity_chunks < 1 || pc.parity_chunks > pc.pg_size-2))
            {
-                printf("Pool %u has invalid parity_chunks (must be between 1 and pg_size-2), skipping pool\n", pool_id);
+                fprintf(stderr, "Pool %u has invalid parity_chunks (must be between 1 and pg_size-2), skipping pool\n", pool_id);
                continue;
            }
            // PG MinSize
@@ -382,14 +444,14 @@ void etcd_state_client_t::parse_state(const std::string & key, const json11::Jso
                (pc.scheme == POOL_SCHEME_XOR || pc.scheme == POOL_SCHEME_JERASURE) &&
                pc.pg_minsize < (pc.pg_size-pc.parity_chunks))
            {
-                printf("Pool %u has invalid pg_minsize, skipping pool\n", pool_id);
+                fprintf(stderr, "Pool %u has invalid pg_minsize, skipping pool\n", pool_id);
                continue;
            }
            // PG Count
            pc.pg_count = pool_item.second["pg_count"].uint64_value();
            if (pc.pg_count < 1)
            {
-                printf("Pool %u has invalid pg_count, skipping pool\n", pool_id);
+                fprintf(stderr, "Pool %u has invalid pg_count, skipping pool\n", pool_id);
                continue;
            }
            // Max OSD Combinations
@@ -398,7 +460,7 @@ void etcd_state_client_t::parse_state(const std::string & key, const json11::Jso
                pc.max_osd_combinations = 10000;
            if (pc.max_osd_combinations > 0 && pc.max_osd_combinations < 100)
            {
-                printf("Pool %u has invalid max_osd_combinations (must be at least 100), skipping pool\n", pool_id);
+                fprintf(stderr, "Pool %u has invalid max_osd_combinations (must be at least 100), skipping pool\n", pool_id);
                continue;
            }
            // PG Stripe Size
@@ -416,7 +478,7 @@ void etcd_state_client_t::parse_state(const std::string & key, const json11::Jso
            {
                if (pg_item.second.target_set.size() != parsed_cfg.pg_size)
                {
-                    printf("Pool %u PG %u configuration is invalid: osd_set size %lu != pool pg_size %lu\n",
+                    fprintf(stderr, "Pool %u PG %u configuration is invalid: osd_set size %lu != pool pg_size %lu\n",
                        pool_id, pg_item.first, pg_item.second.target_set.size(), parsed_cfg.pg_size);
                    pg_item.second.pause = true;
                }
@@ -434,18 +496,21 @@ void etcd_state_client_t::parse_state(const std::string & key, const json11::Jso
        }
        for (auto & pool_item: value["items"].object_items())
        {
-            pool_id_t pool_id = stoull_full(pool_item.first);
-            if (!pool_id || pool_id >= POOL_ID_MAX)
+            pool_id_t pool_id;
+            char null_byte = 0;
+            sscanf(pool_item.first.c_str(), "%u%c", &pool_id, &null_byte);
+            if (!pool_id || pool_id >= POOL_ID_MAX || null_byte != 0)
            {
-                printf("Pool ID %s is invalid in PG configuration (must be a number less than 0x%x), skipping pool\n", pool_item.first.c_str(), POOL_ID_MAX);
+                fprintf(stderr, "Pool ID %s is invalid in PG configuration (must be a number less than 0x%x), skipping pool\n", pool_item.first.c_str(), POOL_ID_MAX);
                continue;
            }
            for (auto & pg_item: pool_item.second.object_items())
            {
-                pg_num_t pg_num = stoull_full(pg_item.first);
-                if (!pg_num)
+                pg_num_t pg_num = 0;
+                sscanf(pg_item.first.c_str(), "%u%c", &pg_num, &null_byte);
+                if (!pg_num || null_byte != 0)
                {
-                    printf("Bad key in pool %u PG configuration: %s (must be a number), skipped\n", pool_id, pg_item.first.c_str());
+                    fprintf(stderr, "Bad key in pool %u PG configuration: %s (must be a number), skipped\n", pool_id, pg_item.first.c_str());
                    continue;
                }
                auto & parsed_cfg = this->pool_config[pool_id].pg_config[pg_num];
@@ -459,7 +524,7 @@ void etcd_state_client_t::parse_state(const std::string & key, const json11::Jso
                }
                if (parsed_cfg.target_set.size() != pool_config[pool_id].pg_size)
                {
-                    printf("Pool %u PG %u configuration is invalid: osd_set size %lu != pool pg_size %lu\n",
+                    fprintf(stderr, "Pool %u PG %u configuration is invalid: osd_set size %lu != pool pg_size %lu\n",
                        pool_id, pg_num, parsed_cfg.target_set.size(), pool_config[pool_id].pg_size);
                    parsed_cfg.pause = true;
                }
@@ -472,8 +537,8 @@ void etcd_state_client_t::parse_state(const std::string & key, const json11::Jso
            {
                if (pg_it->second.exists && pg_it->first != ++n)
                {
-                    printf(
-                        "Invalid pool %u PG configuration: PG numbers don't cover whole 1..%lu range\n",
+                    fprintf(
+                        stderr, "Invalid pool %u PG configuration: PG numbers don't cover whole 1..%lu range\n",
                        pool_item.second.id, pool_item.second.pg_config.size()
                    );
                    for (pg_it = pool_item.second.pg_config.begin(); pg_it != pool_item.second.pg_config.end(); pg_it++)
@@ -496,7 +561,7 @@ void etcd_state_client_t::parse_state(const std::string & key, const json11::Jso
        sscanf(key.c_str() + etcd_prefix.length()+12, "%u/%u%c", &pool_id, &pg_num, &null_byte);
        if (!pool_id || pool_id >= POOL_ID_MAX || !pg_num || null_byte != 0)
        {
-            printf("Bad etcd key %s, ignoring\n", key.c_str());
+            fprintf(stderr, "Bad etcd key %s, ignoring\n", key.c_str());
        }
        else
        {
@@ -535,7 +600,7 @@ void etcd_state_client_t::parse_state(const std::string & key, const json11::Jso
        sscanf(key.c_str() + etcd_prefix.length()+10, "%u/%u%c", &pool_id, &pg_num, &null_byte);
        if (!pool_id || pool_id >= POOL_ID_MAX || !pg_num || null_byte != 0)
        {
-            printf("Bad etcd key %s, ignoring\n", key.c_str());
+            fprintf(stderr, "Bad etcd key %s, ignoring\n", key.c_str());
        }
        else if (value.is_null())
        {
@@ -559,7 +624,7 @@ void etcd_state_client_t::parse_state(const std::string & key, const json11::Jso
                }
                if (i >= pg_state_bit_count)
                {
-                    printf("Unexpected pool %u PG %u state keyword in etcd: %s\n", pool_id, pg_num, e.dump().c_str());
+                    fprintf(stderr, "Unexpected pool %u PG %u state keyword in etcd: %s\n", pool_id, pg_num, e.dump().c_str());
                    return;
                }
            }
@@ -568,7 +633,7 @@ void etcd_state_client_t::parse_state(const std::string & key, const json11::Jso
                (state & PG_PEERING) && state != PG_PEERING ||
                (state & PG_INCOMPLETE) && state != PG_INCOMPLETE)
            {
-                printf("Unexpected pool %u PG %u state in etcd: primary=%lu, state=%s\n", pool_id, pg_num, cur_primary, value["state"].dump().c_str());
+                fprintf(stderr, "Unexpected pool %u PG %u state in etcd: primary=%lu, state=%s\n", pool_id, pg_num, cur_primary, value["state"].dump().c_str());
                return;
            }
            this->pool_config[pool_id].pg_config[pg_num].cur_primary = cur_primary;
@@ -597,4 +662,125 @@ void etcd_state_client_t::parse_state(const std::string & key, const json11::Jso
            }
        }
    }
+    else if (key.substr(0, etcd_prefix.length()+14) == etcd_prefix+"/config/inode/")
+    {
+        // <etcd_prefix>/config/inode/%d/%d
+        uint64_t pool_id = 0;
+        uint64_t inode_num = 0;
+        char null_byte = 0;
+        sscanf(key.c_str() + etcd_prefix.length()+14, "%lu/%lu%c", &pool_id, &inode_num, &null_byte);
+        if (!pool_id || pool_id >= POOL_ID_MAX || !inode_num || (inode_num >> (64-POOL_ID_BITS)) || null_byte != 0)
+        {
+            fprintf(stderr, "Bad etcd key %s, ignoring\n", key.c_str());
+        }
+        else
+        {
+            inode_num |= (pool_id << (64-POOL_ID_BITS));
+            auto it = this->inode_config.find(inode_num);
+            if (it != this->inode_config.end() && it->second.name != "")
+            {
+                auto n_it = this->inode_by_name.find(it->second.name);
+                if (n_it->second == inode_num)
+                {
+                    this->inode_by_name.erase(n_it);
+                    for (auto w: watches)
+                    {
+                        if (w->name == it->second.name)
+                        {
+                            w->cfg = { 0 };
+                        }
+                    }
+                }
+            }
+            if (!value.is_object())
+            {
+                this->inode_config.erase(inode_num);
+            }
+            else
+            {
+                inode_t parent_inode_num = value["parent_id"].uint64_value();
+                if (parent_inode_num && !(parent_inode_num >> (64-POOL_ID_BITS)))
+                {
+                    uint64_t parent_pool_id = value["parent_pool"].uint64_value();
+                    if (!parent_pool_id)
+                        parent_inode_num |= pool_id << (64-POOL_ID_BITS);
+                    else if (parent_pool_id >= POOL_ID_MAX)
+                    {
+                        fprintf(
+                            stderr, "Inode %lu/%lu parent_pool value is invalid, ignoring parent setting\n",
+                            inode_num >> (64-POOL_ID_BITS), inode_num & ((1l << (64-POOL_ID_BITS)) - 1)
+                        );
+                        parent_inode_num = 0;
+                    }
+                    else
+                        parent_inode_num |= parent_pool_id << (64-POOL_ID_BITS);
+                }
+                inode_config_t cfg = (inode_config_t){
+                    .num = inode_num,
+                    .name = value["name"].string_value(),
+                    .size = value["size"].uint64_value(),
+                    .parent_id = parent_inode_num,
+                    .readonly = value["readonly"].bool_value(),
+                    .mod_revision = kv.mod_revision,
+                };
+                this->inode_config[inode_num] = cfg;
+                if (cfg.name != "")
+                {
+                    this->inode_by_name[cfg.name] = inode_num;
+                    for (auto w: watches)
+                    {
+                        if (w->name == value["name"].string_value())
+                        {
+                            w->cfg = cfg;
+                        }
+                    }
+                }
+            }
+        }
+    }
+}
+
+inode_watch_t* etcd_state_client_t::watch_inode(std::string name)
+{
+    inode_watch_t *watch = new inode_watch_t;
+    watch->name = name;
+    watches.push_back(watch);
+    auto it = inode_by_name.find(name);
+    if (it != inode_by_name.end())
+    {
+        watch->cfg = inode_config[it->second];
+    }
+    return watch;
+}
+
+void etcd_state_client_t::close_watch(inode_watch_t* watch)
+{
+    for (int i = 0; i < watches.size(); i++)
+    {
+        if (watches[i] == watch)
+        {
+            watches.erase(watches.begin()+i, watches.begin()+i+1);
+            break;
+        }
+    }
+    delete watch;
+}
+
+json11::Json::object & etcd_state_client_t::serialize_inode_cfg(inode_config_t *cfg)
+{
+    json11::Json::object new_cfg = json11::Json::object {
+        { "name", cfg->name },
+        { "size", cfg->size },
+    };
+    if (cfg->parent_id)
+    {
+        if (INODE_POOL(cfg->num) != INODE_POOL(cfg->parent_id))
+            new_cfg["parent_pool"] = (uint64_t)INODE_POOL(cfg->parent_id);
+        new_cfg["parent_id"] = (uint64_t)INODE_NO_POOL(cfg->parent_id);
+    }
+    if (cfg->readonly)
+    {
+        new_cfg["readonly"] = true;
+    }
+    return new_cfg;
 }
--- a/src/etcd_state_client.h
+++ b/src/etcd_state_client.h
@@ -3,8 +3,8 @@

 #pragma once

+#include "json11/json11.hpp"
 #include "osd_id.h"
-#include "http_client.h"
 #include "timerfd_manager.h"

 #define ETCD_CONFIG_WATCH_ID 1
@@ -18,10 +18,11 @@

 #define DEFAULT_BLOCK_SIZE 128*1024

-struct json_kv_t
+struct etcd_kv_t
 {
    std::string key;
    json11::Json value;
+    uint64_t mod_revision;
 };

 struct pg_config_t
@@ -52,8 +53,33 @@ struct pool_config_t
    std::map<pg_num_t, pg_config_t> pg_config;
 };

+struct inode_config_t
+{
+    uint64_t num;
+    std::string name;
+    uint64_t size;
+    inode_t parent_id;
+    bool readonly;
+    // Change revision of the metadata in etcd
+    uint64_t mod_revision;
+};
+
+struct inode_watch_t
+{
+    std::string name;
+    inode_config_t cfg;
+};
+
+struct websocket_t;
+
 struct etcd_state_client_t
 {
+protected:
+    std::vector<inode_watch_t*> watches;
+    websocket_t *etcd_watch_ws = NULL;
+    uint64_t bs_block_size = DEFAULT_BLOCK_SIZE;
+    void add_etcd_url(std::string);
+public:
    std::vector<std::string> etcd_addresses;
    std::string etcd_prefix;
    int log_level = 0;
@@ -61,24 +87,28 @@ struct etcd_state_client_t

    int etcd_watches_initialised = 0;
    uint64_t etcd_watch_revision = 0;
-    websocket_t *etcd_watch_ws = NULL;
-    uint64_t bs_block_size = 0;
    std::map<pool_id_t, pool_config_t> pool_config;
    std::map<osd_num_t, json11::Json> peer_states;
+    std::map<inode_t, inode_config_t> inode_config;
+    std::map<std::string, inode_t> inode_by_name;

-    std::function<void(json11::Json::object &)> on_change_hook;
+    std::function<void(std::map<std::string, etcd_kv_t> &)> on_change_hook;
    std::function<void(json11::Json::object &)> on_load_config_hook;
    std::function<json11::Json()> load_pgs_checks_hook;
    std::function<void(bool)> on_load_pgs_hook;
    std::function<void(pool_id_t, pg_num_t)> on_change_pg_history_hook;
    std::function<void(osd_num_t)> on_change_osd_state_hook;

-    json_kv_t parse_etcd_kv(const json11::Json & kv_json);
+    json11::Json::object & serialize_inode_cfg(inode_config_t *cfg);
+    etcd_kv_t parse_etcd_kv(const json11::Json & kv_json);
    void etcd_call(std::string api, json11::Json payload, int timeout, std::function<void(std::string, json11::Json)> callback);
    void etcd_txn(json11::Json txn, int timeout, std::function<void(std::string, json11::Json)> callback);
    void start_etcd_watcher();
    void load_global_config();
    void load_pgs();
-    void parse_state(const std::string & key, const json11::Json & value);
-    void parse_config(json11::Json & config);
+    void parse_state(const etcd_kv_t & kv);
+    void parse_config(const json11::Json & config);
+    inode_watch_t* watch_inode(std::string name);
+    void close_watch(inode_watch_t* watch);
+    ~etcd_state_client_t();
 };
--- a/src/fio_cluster.cpp
+++ b/src/fio_cluster.cpp
@@ -6,17 +6,17 @@
 // Random write:
 //
 // fio -thread -ioengine=./libfio_cluster.so -name=test -bs=4k -direct=1 -fsync=16 -iodepth=16 -rw=randwrite \
-//     -etcd=127.0.0.1:2379 [-etcd_prefix=/vitastor] -pool=1 -inode=1 -size=1000M
+//     -etcd=127.0.0.1:2379 [-etcd_prefix=/vitastor] (-image=testimg | -pool=1 -inode=1 -size=1000M)
 //
 // Linear write:
 //
 // fio -thread -ioengine=./libfio_cluster.so -name=test -bs=128k -direct=1 -fsync=32 -iodepth=32 -rw=write \
-//     -etcd=127.0.0.1:2379 [-etcd_prefix=/vitastor] -pool=1 -inode=1 -size=1000M
+//     -etcd=127.0.0.1:2379 [-etcd_prefix=/vitastor] -image=testimg
 //
 // Random read (run with -iodepth=32 or -iodepth=1):
 //
 // fio -thread -ioengine=./libfio_cluster.so -name=test -bs=4k -direct=1 -iodepth=32 -rw=randread \
-//     -etcd=127.0.0.1:2379 [-etcd_prefix=/vitastor] -pool=1 -inode=1 -size=1000M
+//     -etcd=127.0.0.1:2379 [-etcd_prefix=/vitastor] -image=testimg

 #include <sys/types.h>
 #include <sys/socket.h>
@@ -24,36 +24,49 @@
 #include <netinet/tcp.h>

 #include <vector>
-#include <unordered_map>

-#include "epoll_manager.h"
-#include "cluster_client.h"
+#include "vitastor_c.h"
 #include "fio_headers.h"

 struct sec_data
 {
-    ring_loop_t *ringloop = NULL;
-    epoll_manager_t *epmgr = NULL;
-    cluster_client_t *cli = NULL;
+    vitastor_c *cli = NULL;
+    void *watch = NULL;
    bool last_sync = false;
    /* The list of completed io_u structs. */
    std::vector<io_u*> completed;
-    uint64_t op_n = 0, inflight = 0;
+    uint64_t inflight = 0;
    bool trace = false;
 };

 struct sec_options
 {
    int __pad;
+    char *config_path = NULL;
    char *etcd_host = NULL;
    char *etcd_prefix = NULL;
+    char *image = NULL;
    uint64_t pool = 0;
    uint64_t inode = 0;
    int cluster_log = 0;
    int trace = 0;
+    int use_rdma = 0;
+    char *rdma_device = NULL;
+    int rdma_port_num = 0;
+    int rdma_gid_index = 0;
+    int rdma_mtu = 0;
 };

 static struct fio_option options[] = {
+    {
+        .name   = "conf",
+        .lname  = "Vitastor config path",
+        .type   = FIO_OPT_STR_STORE,
+        .off1   = offsetof(struct sec_options, config_path),
+        .help   = "Vitastor config path",
+        .category = FIO_OPT_C_ENGINE,
+        .group  = FIO_OPT_G_FILENAME,
+    },
    {
        .name   = "etcd",
        .lname  = "etcd address",
@@ -64,7 +77,7 @@ static struct fio_option options[] = {
        .group  = FIO_OPT_G_FILENAME,
    },
    {
-        .name   = "etcd",
+        .name   = "etcd_prefix",
        .lname  = "etcd key prefix",
        .type   = FIO_OPT_STR_STORE,
        .off1   = offsetof(struct sec_options, etcd_prefix),
@@ -72,6 +85,15 @@ static struct fio_option options[] = {
        .category = FIO_OPT_C_ENGINE,
        .group  = FIO_OPT_G_FILENAME,
    },
+    {
+        .name   = "image",
+        .lname  = "Vitastor image name",
+        .type   = FIO_OPT_STR_STORE,
+        .off1   = offsetof(struct sec_options, image),
+        .help   = "Vitastor image name to run tests on",
+        .category = FIO_OPT_C_ENGINE,
+        .group  = FIO_OPT_G_FILENAME,
+    },
    {
        .name   = "pool",
        .lname  = "pool number for the inode",
@@ -86,7 +108,7 @@ static struct fio_option options[] = {
        .lname  = "inode to run tests on",
        .type   = FIO_OPT_INT,
        .off1   = offsetof(struct sec_options, inode),
-        .help   = "inode to run tests on (1 by default)",
+        .help   = "inode number to run tests on",
        .category = FIO_OPT_C_ENGINE,
        .group  = FIO_OPT_G_FILENAME,
    },
@@ -110,13 +132,69 @@ static struct fio_option options[] = {
        .category = FIO_OPT_C_ENGINE,
        .group  = FIO_OPT_G_FILENAME,
    },
+    {
+        .name   = "use_rdma",
+        .lname  = "Use RDMA",
+        .type   = FIO_OPT_BOOL,
+        .off1   = offsetof(struct sec_options, use_rdma),
+        .help   = "Use RDMA",
+        .def    = "-1",
+        .category = FIO_OPT_C_ENGINE,
+        .group  = FIO_OPT_G_FILENAME,
+    },
+    {
+        .name   = "rdma_device",
+        .lname  = "RDMA device name",
+        .type   = FIO_OPT_STR_STORE,
+        .off1   = offsetof(struct sec_options, rdma_device),
+        .help   = "RDMA device name",
+        .category = FIO_OPT_C_ENGINE,
+        .group  = FIO_OPT_G_FILENAME,
+    },
+    {
+        .name   = "rdma_port_num",
+        .lname  = "RDMA port number",
+        .type   = FIO_OPT_INT,
+        .off1   = offsetof(struct sec_options, rdma_port_num),
+        .help   = "RDMA port number",
+        .def    = "0",
+        .category = FIO_OPT_C_ENGINE,
+        .group  = FIO_OPT_G_FILENAME,
+    },
+    {
+        .name   = "rdma_gid_index",
+        .lname  = "RDMA gid index",
+        .type   = FIO_OPT_INT,
+        .off1   = offsetof(struct sec_options, rdma_gid_index),
+        .help   = "RDMA gid index",
+        .def    = "0",
+        .category = FIO_OPT_C_ENGINE,
+        .group  = FIO_OPT_G_FILENAME,
+    },
+    {
+        .name   = "rdma_mtu",
+        .lname  = "RDMA path MTU",
+        .type   = FIO_OPT_INT,
+        .off1   = offsetof(struct sec_options, rdma_mtu),
+        .help   = "RDMA path MTU",
+        .def    = "0",
+        .category = FIO_OPT_C_ENGINE,
+        .group  = FIO_OPT_G_FILENAME,
+    },
    {
        .name = NULL,
    },
 };

+static void watch_callback(void *opaque, long watch)
+{
+    struct sec_data *bsd = (struct sec_data*)opaque;
+    bsd->watch = (void*)watch;
+}
+
 static int sec_setup(struct thread_data *td)
 {
+    sec_options *o = (sec_options*)td->eo;
    sec_data *bsd;

    bsd = new sec_data;
@@ -134,6 +212,45 @@ static int sec_setup(struct thread_data *td)
        td->o.open_files++;
    }

+    if (!o->image)
+    {
+        if (!(o->inode & ((1l << (64-POOL_ID_BITS)) - 1)))
+        {
+            td_verror(td, EINVAL, "inode number is missing");
+            return 1;
+        }
+        if (o->pool)
+        {
+            o->inode = (o->inode & ((1l << (64-POOL_ID_BITS)) - 1)) | (o->pool << (64-POOL_ID_BITS));
+        }
+        if (!(o->inode >> (64-POOL_ID_BITS)))
+        {
+            td_verror(td, EINVAL, "pool is missing");
+            return 1;
+        }
+    }
+    else
+    {
+        o->inode = 0;
+    }
+    bsd->cli = vitastor_c_create_uring(o->config_path, o->etcd_host, o->etcd_prefix,
+        o->use_rdma, o->rdma_device, o->rdma_port_num, o->rdma_gid_index, o->rdma_mtu, o->cluster_log);
+    if (o->image)
+    {
+        bsd->watch = NULL;
+        vitastor_c_watch_inode(bsd->cli, o->image, watch_callback, bsd);
+        while (true)
+        {
+            vitastor_c_uring_handle_events(bsd->cli);
+            if (bsd->watch)
+                break;
+            vitastor_c_uring_wait_events(bsd->cli);
+        }
+        td->files[0]->real_file_size = vitastor_c_inode_get_size(bsd->watch);
+    }
+
+    bsd->trace = o->trace ? true : false;
+
    return 0;
 }

@@ -142,9 +259,11 @@ static void sec_cleanup(struct thread_data *td)
    sec_data *bsd = (sec_data*)td->io_ops_data;
    if (bsd)
    {
-        delete bsd->cli;
-        delete bsd->epmgr;
-        delete bsd->ringloop;
+        if (bsd->watch)
+        {
+            vitastor_c_close_watch(bsd->cli, bsd->watch);
+        }
+        vitastor_c_destroy(bsd->cli);
        delete bsd;
    }
 }
@@ -152,37 +271,34 @@ static void sec_cleanup(struct thread_data *td)
 /* Connect to the server from each thread. */
 static int sec_init(struct thread_data *td)
 {
-    sec_options *o = (sec_options*)td->eo;
-    sec_data *bsd = (sec_data*)td->io_ops_data;
-
-    json11::Json cfg = json11::Json::object {
-        { "etcd_address", std::string(o->etcd_host) },
-        { "etcd_prefix", std::string(o->etcd_prefix ? o->etcd_prefix : "/vitastor") },
-        { "log_level", o->cluster_log },
-    };
-
-    if (o->pool)
-        o->inode = (o->inode & ((1l << (64-POOL_ID_BITS)) - 1)) | (o->pool << (64-POOL_ID_BITS));
-    if (!(o->inode >> (64-POOL_ID_BITS)))
-    {
-        td_verror(td, EINVAL, "pool is missing");
-        return 1;
-    }
-    bsd->ringloop = new ring_loop_t(512);
-    bsd->epmgr = new epoll_manager_t(bsd->ringloop);
-    bsd->cli = new cluster_client_t(bsd->ringloop, bsd->epmgr->tfd, cfg);
-
-    bsd->trace = o->trace ? true : false;
-
    return 0;
 }

+static void io_callback(void *opaque, long retval)
+{
+    struct io_u *io = (struct io_u*)opaque;
+    io->error = retval < 0 ? -retval : 0;
+    sec_data *bsd = (sec_data*)io->engine_data;
+    bsd->inflight--;
+    bsd->completed.push_back(io);
+    if (bsd->trace)
+    {
+        printf("--- %s 0x%lx retval=%ld\n", io->ddir == DDIR_READ ? "READ" :
+            (io->ddir == DDIR_WRITE ? "WRITE" : "SYNC"), (uint64_t)io, retval);
+    }
+}
+
+static void read_callback(void *opaque, long retval, uint64_t version)
+{
+    io_callback(opaque, retval);
+}
+
 /* Begin read or write request. */
 static enum fio_q_status sec_queue(struct thread_data *td, struct io_u *io)
 {
    sec_options *opt = (sec_options*)td->eo;
    sec_data *bsd = (sec_data*)td->io_ops_data;
-    int n = bsd->op_n;
+    struct iovec iov;

    fio_ro_check(td, io);
    if (io->ddir == DDIR_SYNC && bsd->last_sync)
@@ -191,28 +307,29 @@ static enum fio_q_status sec_queue(struct thread_data *td, struct io_u *io)
    }

    io->engine_data = bsd;
-    cluster_op_t *op = new cluster_op_t;
+    io->error = 0;
+    bsd->inflight++;

+    uint64_t inode = opt->image ? vitastor_c_inode_get_num(bsd->watch) : opt->inode;
    switch (io->ddir)
    {
    case DDIR_READ:
-        op->opcode = OSD_OP_READ;
-        op->inode = opt->inode;
-        op->offset = io->offset;
-        op->len = io->xfer_buflen;
-        op->iov.push_back(io->xfer_buf, io->xfer_buflen);
+        iov = { .iov_base = io->xfer_buf, .iov_len = io->xfer_buflen };
+        vitastor_c_read(bsd->cli, inode, io->offset, io->xfer_buflen, &iov, 1, read_callback, io);
        bsd->last_sync = false;
        break;
    case DDIR_WRITE:
-        op->opcode = OSD_OP_WRITE;
-        op->inode = opt->inode;
-        op->offset = io->offset;
-        op->len = io->xfer_buflen;
-        op->iov.push_back(io->xfer_buf, io->xfer_buflen);
+        if (opt->image && vitastor_c_inode_get_readonly(bsd->watch))
+        {
+            io->error = EROFS;
+            return FIO_Q_COMPLETED;
+        }
+        iov = { .iov_base = io->xfer_buf, .iov_len = io->xfer_buflen };
+        vitastor_c_write(bsd->cli, inode, io->offset, io->xfer_buflen, 0, &iov, 1, io_callback, io);
        bsd->last_sync = false;
        break;
    case DDIR_SYNC:
-        op->opcode = OSD_OP_SYNC;
+        vitastor_c_sync(bsd->cli, io_callback, io);
        bsd->last_sync = true;
        break;
    default:
@@ -220,39 +337,20 @@ static enum fio_q_status sec_queue(struct thread_data *td, struct io_u *io)
        return FIO_Q_COMPLETED;
    }

-    op->callback = [io, n](cluster_op_t *op)
-    {
-        io->error = op->retval < 0 ? -op->retval : 0;
-        sec_data *bsd = (sec_data*)io->engine_data;
-        bsd->inflight--;
-        bsd->completed.push_back(io);
-        if (bsd->trace)
-        {
-            printf("--- %s n=%d retval=%d\n", io->ddir == DDIR_READ ? "READ" :
-                (io->ddir == DDIR_WRITE ? "WRITE" : "SYNC"), n, op->retval);
-        }
-        delete op;
-    };
-
    if (opt->trace)
    {
        if (io->ddir == DDIR_SYNC)
        {
-            printf("+++ SYNC # %d\n", n);
+            printf("+++ SYNC 0x%lx\n", (uint64_t)io);
        }
        else
        {
-            printf("+++ %s # %d 0x%llx+%llx\n",
+            printf("+++ %s 0x%lx 0x%llx+%llx\n",
                io->ddir == DDIR_READ ? "READ" : "WRITE",
-                n, io->offset, io->xfer_buflen);
+                (uint64_t)io, io->offset, io->xfer_buflen);
        }
    }

-    io->error = 0;
-    bsd->inflight++;
-    bsd->op_n++;
-    bsd->cli->execute(op);
-
    if (io->error != 0)
        return FIO_Q_COMPLETED;
    return FIO_Q_QUEUED;
@@ -263,10 +361,10 @@ static int sec_getevents(struct thread_data *td, unsigned int min, unsigned int
    sec_data *bsd = (sec_data*)td->io_ops_data;
    while (true)
    {
-        bsd->ringloop->loop();
+        vitastor_c_uring_handle_events(bsd->cli);
        if (bsd->completed.size() >= min)
            break;
-        bsd->ringloop->wait();
+        vitastor_c_uring_wait_events(bsd->cli);
    }
    return bsd->completed.size();
 }
--- a/src/fio_engine.cpp
+++ b/src/fio_engine.cpp
@@ -25,6 +25,7 @@
 //     -bs_config='{"data_device":"./test_data.bin"}' -size=1000M

 #include "blockstore.h"
+#include "epoll_manager.h"
 #include "fio_headers.h"

 #include "json11/json11.hpp"
@@ -32,6 +33,7 @@
 struct bs_data
 {
    blockstore_t *bs;
+    epoll_manager_t *epmgr;
    ring_loop_t *ringloop;
    /* The list of completed io_u structs. */
    std::vector<io_u*> completed;
@@ -104,6 +106,7 @@ static void bs_cleanup(struct thread_data *td)
        }
    safe:
        delete bsd->bs;
+        delete bsd->epmgr;
        delete bsd->ringloop;
        delete bsd;
    }
@@ -129,7 +132,8 @@ static int bs_init(struct thread_data *td)
        }
    }
    bsd->ringloop = new ring_loop_t(512);
-    bsd->bs = new blockstore_t(config, bsd->ringloop);
+    bsd->epmgr = new epoll_manager_t(bsd->ringloop);
+    bsd->bs = new blockstore_t(config, bsd->ringloop, bsd->epmgr->tfd);
    while (1)
    {
        bsd->ringloop->loop();
--- a/src/http_client.cpp
+++ b/src/http_client.cpp
@@ -22,7 +22,6 @@
 #define READ_BUFFER_SIZE 9000

 static int extract_port(std::string & host);
-static std::string strtolower(const std::string & in);
 static std::string trim(const std::string & in);
 static std::string ws_format_frame(int type, uint64_t size);
 static bool ws_parse_frame(std::string & buf, int & type, std::string & res);
@@ -673,7 +672,7 @@ static int extract_port(std::string & host)
    return port;
 }

-static std::string strtolower(const std::string & in)
+std::string strtolower(const std::string & in)
 {
    std::string s = in;
    for (int i = 0; i < s.length(); i++)
--- a/src/http_client.h
+++ b/src/http_client.h
@@ -49,6 +49,8 @@ std::vector<std::string> getifaddr_list(bool include_v6 = false);

 uint64_t stoull_full(const std::string & str, int base = 10);

+std::string strtolower(const std::string & in);
+
 void http_request(timerfd_manager_t *tfd, const std::string & host, const std::string & request,
    const http_options_t & options, std::function<void(const http_response_t *response)> callback);

--- a/src/messenger.cpp
+++ b/src/messenger.cpp
@@ -10,28 +10,168 @@

 #include "messenger.h"

-osd_op_t::~osd_op_t()
+void osd_messenger_t::init()
 {
-    assert(!bs_op);
-    assert(!op_data);
-    if (rmw_buf)
+#ifdef WITH_RDMA
+    if (use_rdma)
    {
-        free(rmw_buf);
+        rdma_context = msgr_rdma_context_t::create(
+            rdma_device != "" ? rdma_device.c_str() : NULL,
+            rdma_port_num, rdma_gid_index, rdma_mtu
+        );
+        if (!rdma_context)
+        {
+            fprintf(stderr, "[OSD %lu] Couldn't initialize RDMA, proceeding with TCP only\n", osd_num);
+        }
+        else
+        {
+            rdma_max_sge = rdma_max_sge < rdma_context->attrx.orig_attr.max_sge
+                ? rdma_max_sge : rdma_context->attrx.orig_attr.max_sge;
+            fprintf(stderr, "[OSD %lu] RDMA initialized successfully\n", osd_num);
+            fcntl(rdma_context->channel->fd, F_SETFL, fcntl(rdma_context->channel->fd, F_GETFL, 0) | O_NONBLOCK);
+            tfd->set_fd_handler(rdma_context->channel->fd, false, [this](int notify_fd, int epoll_events)
+            {
+                handle_rdma_events();
+            });
+            handle_rdma_events();
+        }
    }
-    if (buf)
+#endif
+    keepalive_timer_id = tfd->set_timer(1000, true, [this](int)
    {
-        // Note: reusing osd_op_t WILL currently lead to memory leaks
-        // So we don't reuse it, but free it every time
-        free(buf);
-    }
+        std::vector<int> to_stop;
+        std::vector<osd_op_t*> to_ping;
+        for (auto cl_it = clients.begin(); cl_it != clients.end(); cl_it++)
+        {
+            auto cl = cl_it->second;
+            if (!cl->osd_num || cl->peer_state != PEER_CONNECTED && cl->peer_state != PEER_RDMA)
+            {
+                // Do not run keepalive on regular clients
+                continue;
+            }
+            if (cl->ping_time_remaining > 0)
+            {
+                cl->ping_time_remaining--;
+                if (!cl->ping_time_remaining)
+                {
+                    // Ping timed out, stop the client
+                    fprintf(stderr, "Ping timed out for OSD %lu (client %d), disconnecting peer\n", cl->osd_num, cl->peer_fd);
+                    to_stop.push_back(cl->peer_fd);
+                }
+            }
+            else if (cl->idle_time_remaining > 0)
+            {
+                cl->idle_time_remaining--;
+                if (!cl->idle_time_remaining)
+                {
+                    // Connection is idle for <osd_idle_time>, send ping
+                    osd_op_t *op = new osd_op_t();
+                    op->op_type = OSD_OP_OUT;
+                    op->peer_fd = cl->peer_fd;
+                    op->req = (osd_any_op_t){
+                        .hdr = {
+                            .magic = SECONDARY_OSD_OP_MAGIC,
+                            .id = this->next_subop_id++,
+                            .opcode = OSD_OP_PING,
+                        },
+                    };
+                    op->callback = [this, cl](osd_op_t *op)
+                    {
+                        int fail_fd = (op->reply.hdr.retval != 0 ? op->peer_fd : -1);
+                        cl->ping_time_remaining = 0;
+                        delete op;
+                        if (fail_fd >= 0)
+                        {
+                            fprintf(stderr, "Ping failed for OSD %lu (client %d), disconnecting peer\n", cl->osd_num, cl->peer_fd);
+                            stop_client(fail_fd, true);
+                        }
+                    };
+                    to_ping.push_back(op);
+                    cl->ping_time_remaining = osd_ping_timeout;
+                    cl->idle_time_remaining = osd_idle_timeout;
+                }
+            }
+            else
+            {
+                cl->idle_time_remaining = osd_idle_timeout;
+            }
+        }
+        // Don't stop clients while a 'clients' iterator is still active
+        for (int peer_fd: to_stop)
+        {
+            stop_client(peer_fd, true);
+        }
+        for (auto op: to_ping)
+        {
+            outbox_push(op);
+        }
+    });
 }

 osd_messenger_t::~osd_messenger_t()
 {
+    if (keepalive_timer_id >= 0)
+    {
+        tfd->clear_timer(keepalive_timer_id);
+        keepalive_timer_id = -1;
+    }
    while (clients.size() > 0)
    {
-        stop_client(clients.begin()->first, true);
+        stop_client(clients.begin()->first, true, true);
    }
+#ifdef WITH_RDMA
+    if (rdma_context)
+    {
+        delete rdma_context;
+    }
+#endif
+}
+
+void osd_messenger_t::parse_config(const json11::Json & config)
+{
+#ifdef WITH_RDMA
+    if (!config["use_rdma"].is_null())
+    {
+        // RDMA is on by default in RDMA-enabled builds
+        this->use_rdma = config["use_rdma"].bool_value() || config["use_rdma"].uint64_value() != 0;
+    }
+    this->rdma_device = config["rdma_device"].string_value();
+    this->rdma_port_num = (uint8_t)config["rdma_port_num"].uint64_value();
+    if (!this->rdma_port_num)
+        this->rdma_port_num = 1;
+    this->rdma_gid_index = (uint8_t)config["rdma_gid_index"].uint64_value();
+    this->rdma_mtu = (uint32_t)config["rdma_mtu"].uint64_value();
+    this->rdma_max_sge = config["rdma_max_sge"].uint64_value();
+    if (!this->rdma_max_sge)
+        this->rdma_max_sge = 128;
+    this->rdma_max_send = config["rdma_max_send"].uint64_value();
+    if (!this->rdma_max_send)
+        this->rdma_max_send = 32;
+    this->rdma_max_recv = config["rdma_max_recv"].uint64_value();
+    if (!this->rdma_max_recv)
+        this->rdma_max_recv = 8;
+    this->rdma_max_msg = config["rdma_max_msg"].uint64_value();
+    if (!this->rdma_max_msg || this->rdma_max_msg > 128*1024*1024)
+        this->rdma_max_msg = 1024*1024;
+#endif
+    this->receive_buffer_size = (uint32_t)config["tcp_header_buffer_size"].uint64_value();
+    if (!this->receive_buffer_size || this->receive_buffer_size > 1024*1024*1024)
+        this->receive_buffer_size = 65536;
+    this->use_sync_send_recv = config["use_sync_send_recv"].bool_value() ||
+        config["use_sync_send_recv"].uint64_value();
+    this->peer_connect_interval = config["peer_connect_interval"].uint64_value();
+    if (!this->peer_connect_interval)
+        this->peer_connect_interval = 5;
+    this->peer_connect_timeout = config["peer_connect_timeout"].uint64_value();
+    if (!this->peer_connect_timeout)
+        this->peer_connect_timeout = 5;
+    this->osd_idle_timeout = config["osd_idle_timeout"].uint64_value();
+    if (!this->osd_idle_timeout)
+        this->osd_idle_timeout = 5;
+    this->osd_ping_timeout = config["osd_ping_timeout"].uint64_value();
+    if (!this->osd_ping_timeout)
+        this->osd_ping_timeout = 5;
+    this->log_level = config["log_level"].uint64_value();
 }

 void osd_messenger_t::connect_peer(uint64_t peer_osd, json11::Json peer_state)
@@ -49,17 +189,14 @@ void osd_messenger_t::connect_peer(uint64_t peer_osd, json11::Json peer_state)
        wanted_peers[peer_osd].port = (int)peer_state["port"].int64_value();
    }
    wanted_peers[peer_osd].address_changed = true;
-    if (!wanted_peers[peer_osd].connecting &&
-        (time(NULL) - wanted_peers[peer_osd].last_connect_attempt) >= peer_connect_interval)
-    {
-        try_connect_peer(peer_osd);
-    }
+    try_connect_peer(peer_osd);
 }

 void osd_messenger_t::try_connect_peer(uint64_t peer_osd)
 {
    auto wp_it = wanted_peers.find(peer_osd);
-    if (wp_it == wanted_peers.end())
+    if (wp_it == wanted_peers.end() || wp_it->second.connecting ||
+        (time(NULL) - wp_it->second.last_connect_attempt) < peer_connect_interval)
    {
        return;
    }
@@ -105,31 +242,29 @@ void osd_messenger_t::try_connect_peer_addr(osd_num_t peer_osd, const char *peer
        on_connect_peer(peer_osd, -errno);
        return;
    }
-    int timeout_id = -1;
-    if (peer_connect_timeout > 0)
-    {
-        timeout_id = tfd->set_timer(1000*peer_connect_timeout, false, [this, peer_fd](int timer_id)
-        {
-            osd_num_t peer_osd = clients.at(peer_fd)->osd_num;
-            stop_client(peer_fd, true);
-            on_connect_peer(peer_osd, -EIO);
-            return;
-        });
-    }
-    clients[peer_fd] = new osd_client_t((osd_client_t){
-        .peer_addr = addr,
-        .peer_port = peer_port,
-        .peer_fd = peer_fd,
-        .peer_state = PEER_CONNECTING,
-        .connect_timeout_id = timeout_id,
-        .osd_num = peer_osd,
-        .in_buf = malloc_or_die(receive_buffer_size),
-    });
+    clients[peer_fd] = new osd_client_t();
+    clients[peer_fd]->peer_addr = addr;
+    clients[peer_fd]->peer_port = peer_port;
+    clients[peer_fd]->peer_fd = peer_fd;
+    clients[peer_fd]->peer_state = PEER_CONNECTING;
+    clients[peer_fd]->connect_timeout_id = -1;
+    clients[peer_fd]->osd_num = peer_osd;
+    clients[peer_fd]->in_buf = malloc_or_die(receive_buffer_size);
    tfd->set_fd_handler(peer_fd, true, [this](int peer_fd, int epoll_events)
    {
        // Either OUT (connected) or HUP
        handle_connect_epoll(peer_fd);
    });
+    if (peer_connect_timeout > 0)
+    {
+        clients[peer_fd]->connect_timeout_id = tfd->set_timer(1000*peer_connect_timeout, false, [this, peer_fd](int timer_id)
+        {
+            osd_num_t peer_osd = clients.at(peer_fd)->osd_num;
+            stop_client(peer_fd, true);
+            on_connect_peer(peer_osd, -EPIPE);
+            return;
+        });
+    }
 }

 void osd_messenger_t::handle_connect_epoll(int peer_fd)
@@ -170,7 +305,7 @@ void osd_messenger_t::handle_peer_epoll(int peer_fd, int epoll_events)
    if (epoll_events & EPOLLRDHUP)
    {
        // Stop client
-        printf("[OSD %lu] client %d disconnected\n", this->osd_num, peer_fd);
+        fprintf(stderr, "[OSD %lu] client %d disconnected\n", this->osd_num, peer_fd);
        stop_client(peer_fd, true);
    }
    else if (epoll_events & EPOLLIN)
@@ -195,7 +330,7 @@ void osd_messenger_t::on_connect_peer(osd_num_t peer_osd, int peer_fd)
    wp.connecting = false;
    if (peer_fd < 0)
    {
-        printf("Failed to connect to peer OSD %lu address %s port %d: %s\n", peer_osd, wp.cur_addr.c_str(), wp.cur_port, strerror(-peer_fd));
+        fprintf(stderr, "Failed to connect to peer OSD %lu address %s port %d: %s\n", peer_osd, wp.cur_addr.c_str(), wp.cur_port, strerror(-peer_fd));
        if (wp.address_changed)
        {
            wp.address_changed = false;
@@ -222,7 +357,7 @@ void osd_messenger_t::on_connect_peer(osd_num_t peer_osd, int peer_fd)
    }
    if (log_level > 0)
    {
-        printf("[OSD %lu] Connected with peer OSD %lu (client %d)\n", osd_num, peer_osd, peer_fd);
+        fprintf(stderr, "[OSD %lu] Connected with peer OSD %lu (client %d)\n", osd_num, peer_osd, peer_fd);
    }
    wanted_peers.erase(peer_osd);
    repeer_pgs(peer_osd);
@@ -242,6 +377,24 @@ void osd_messenger_t::check_peer_config(osd_client_t *cl)
            },
        },
    };
+#ifdef WITH_RDMA
+    if (rdma_context)
+    {
+        cl->rdma_conn = msgr_rdma_connection_t::create(rdma_context, rdma_max_send, rdma_max_recv, rdma_max_sge, rdma_max_msg);
+        if (cl->rdma_conn)
+        {
+            json11::Json payload = json11::Json::object {
+                { "connect_rdma", cl->rdma_conn->addr.to_string() },
+                { "rdma_max_msg", cl->rdma_conn->max_msg },
+            };
+            std::string payload_str = payload.dump();
+            op->req.show_conf.json_len = payload_str.size();
+            op->buf = malloc_or_die(payload_str.size());
+            op->iov.push_back(op->buf, payload_str.size());
+            memcpy(op->buf, payload_str.c_str(), payload_str.size());
+        }
+    }
+#endif
    op->callback = [this, cl](osd_op_t *op)
    {
        std::string json_err;
@@ -250,7 +403,7 @@ void osd_messenger_t::check_peer_config(osd_client_t *cl)
        if (op->reply.hdr.retval < 0)
        {
            err = true;
-            printf("Failed to get config from OSD %lu (retval=%ld), disconnecting peer\n", cl->osd_num, op->reply.hdr.retval);
+            fprintf(stderr, "Failed to get config from OSD %lu (retval=%ld), disconnecting peer\n", cl->osd_num, op->reply.hdr.retval);
        }
        else
        {
@@ -258,22 +411,69 @@ void osd_messenger_t::check_peer_config(osd_client_t *cl)
            if (json_err != "")
            {
                err = true;
-                printf("Failed to get config from OSD %lu: bad JSON: %s, disconnecting peer\n", cl->osd_num, json_err.c_str());
+                fprintf(stderr, "Failed to get config from OSD %lu: bad JSON: %s, disconnecting peer\n", cl->osd_num, json_err.c_str());
            }
            else if (config["osd_num"].uint64_value() != cl->osd_num)
            {
                err = true;
-                printf("Connected to OSD %lu instead of OSD %lu, peer state is outdated, disconnecting peer\n", config["osd_num"].uint64_value(), cl->osd_num);
+                fprintf(stderr, "Connected to OSD %lu instead of OSD %lu, peer state is outdated, disconnecting peer\n", config["osd_num"].uint64_value(), cl->osd_num);
+            }
+            else if (config["protocol_version"].uint64_value() != OSD_PROTOCOL_VERSION)
+            {
+                err = true;
+                fprintf(
+                    stderr, "OSD %lu protocol version is %lu, but only version %u is supported.\n"
+                    " If you need to upgrade from 0.5.x please request it via the issue tracker.\n",
+                    cl->osd_num, config["protocol_version"].uint64_value(), OSD_PROTOCOL_VERSION
+                );
            }
        }
        if (err)
        {
-            osd_num_t osd_num = cl->osd_num;
+            osd_num_t peer_osd = cl->osd_num;
            stop_client(op->peer_fd);
-            on_connect_peer(osd_num, -1);
+            on_connect_peer(peer_osd, -1);
            delete op;
            return;
        }
+#ifdef WITH_RDMA
+        if (config["rdma_address"].is_string())
+        {
+            msgr_rdma_address_t addr;
+            if (!msgr_rdma_address_t::from_string(config["rdma_address"].string_value().c_str(), &addr) ||
+                cl->rdma_conn->connect(&addr) != 0)
+            {
+                fprintf(
+                    stderr, "Failed to connect to OSD %lu (address %s) using RDMA\n",
+                    cl->osd_num, config["rdma_address"].string_value().c_str()
+                );
+                delete cl->rdma_conn;
+                cl->rdma_conn = NULL;
+                // FIXME: Keep TCP connection in this case
+                osd_num_t peer_osd = cl->osd_num;
+                stop_client(cl->peer_fd);
+                on_connect_peer(peer_osd, -1);
+                delete op;
+                return;
+            }
+            else
+            {
+                uint64_t server_max_msg = config["rdma_max_msg"].uint64_value();
+                if (cl->rdma_conn->max_msg > server_max_msg)
+                {
+                    cl->rdma_conn->max_msg = server_max_msg;
+                }
+                if (log_level > 0)
+                {
+                    fprintf(stderr, "Connected to OSD %lu using RDMA\n", cl->osd_num);
+                }
+                cl->peer_state = PEER_RDMA;
+                tfd->set_fd_handler(cl->peer_fd, false, NULL);
+                // Add the initial receive request
+                try_recv_rdma(cl);
+            }
+        }
+#endif
        osd_peer_fds[cl->osd_num] = cl->peer_fd;
        on_connect_peer(cl->osd_num, cl->peer_fd);
        delete op;
@@ -281,123 +481,6 @@ void osd_messenger_t::check_peer_config(osd_client_t *cl)
    outbox_push(op);
 }

-void osd_messenger_t::cancel_osd_ops(osd_client_t *cl)
-{
-    for (auto p: cl->sent_ops)
-    {
-        cancel_op(p.second);
-    }
-    cl->sent_ops.clear();
-    cl->outbox.clear();
-}
-
-void osd_messenger_t::cancel_op(osd_op_t *op)
-{
-    if (op->op_type == OSD_OP_OUT)
-    {
-        op->reply.hdr.magic = SECONDARY_OSD_REPLY_MAGIC;
-        op->reply.hdr.id = op->req.hdr.id;
-        op->reply.hdr.opcode = op->req.hdr.opcode;
-        op->reply.hdr.retval = -EPIPE;
-        // Copy lambda to be unaffected by `delete op`
-        std::function<void(osd_op_t*)>(op->callback)(op);
-    }
-    else
-    {
-        // This function is only called in stop_client(), so it's fine to destroy the operation
-        delete op;
-    }
-}
-
-void osd_messenger_t::stop_client(int peer_fd, bool force)
-{
-    assert(peer_fd != 0);
-    auto it = clients.find(peer_fd);
-    if (it == clients.end())
-    {
-        return;
-    }
-    uint64_t repeer_osd = 0;
-    osd_client_t *cl = it->second;
-    if (cl->peer_state == PEER_CONNECTED)
-    {
-        if (cl->osd_num)
-        {
-            // Reload configuration from etcd when the connection is dropped
-            if (log_level > 0)
-                printf("[OSD %lu] Stopping client %d (OSD peer %lu)\n", osd_num, peer_fd, cl->osd_num);
-            repeer_osd = cl->osd_num;
-        }
-        else
-        {
-            if (log_level > 0)
-                printf("[OSD %lu] Stopping client %d (regular client)\n", osd_num, peer_fd);
-        }
-    }
-    else if (!force)
-    {
-        return;
-    }
-    cl->peer_state = PEER_STOPPED;
-    clients.erase(it);
-    tfd->set_fd_handler(peer_fd, false, NULL);
-    if (cl->connect_timeout_id >= 0)
-    {
-        tfd->clear_timer(cl->connect_timeout_id);
-        cl->connect_timeout_id = -1;
-    }
-    if (cl->osd_num)
-    {
-        osd_peer_fds.erase(cl->osd_num);
-    }
-    if (cl->read_op)
-    {
-        if (cl->read_op->callback)
-        {
-            cancel_op(cl->read_op);
-        }
-        else
-        {
-            delete cl->read_op;
-        }
-        cl->read_op = NULL;
-    }
-    for (auto rit = read_ready_clients.begin(); rit != read_ready_clients.end(); rit++)
-    {
-        if (*rit == peer_fd)
-        {
-            read_ready_clients.erase(rit);
-            break;
-        }
-    }
-    for (auto wit = write_ready_clients.begin(); wit != write_ready_clients.end(); wit++)
-    {
-        if (*wit == peer_fd)
-        {
-            write_ready_clients.erase(wit);
-            break;
-        }
-    }
-    free(cl->in_buf);
-    cl->in_buf = NULL;
-    close(peer_fd);
-    if (repeer_osd)
-    {
-        // First repeer PGs as canceling OSD ops may push new operations
-        // and we need correct PG states when we do that
-        repeer_pgs(repeer_osd);
-    }
-    if (cl->osd_num)
-    {
-        // Cancel outbound operations
-        cancel_osd_ops(cl);
-    }
-    if (cl->refs <= 0)
-    {
-        delete cl;
-    }
-}
-
 void osd_messenger_t::accept_connections(int listen_fd)
 {
    // Accept new connections
@@ -408,18 +491,17 @@ void osd_messenger_t::accept_connections(int listen_fd)
    {
        assert(peer_fd != 0);
        char peer_str[256];
-        printf("[OSD %lu] new client %d: connection from %s port %d\n", this->osd_num, peer_fd,
+        fprintf(stderr, "[OSD %lu] new client %d: connection from %s port %d\n", this->osd_num, peer_fd,
            inet_ntop(AF_INET, &addr.sin_addr, peer_str, 256), ntohs(addr.sin_port));
        fcntl(peer_fd, F_SETFL, fcntl(peer_fd, F_GETFL, 0) | O_NONBLOCK);
        int one = 1;
        setsockopt(peer_fd, SOL_TCP, TCP_NODELAY, &one, sizeof(one));
-        clients[peer_fd] = new osd_client_t((osd_client_t){
-            .peer_addr = addr,
-            .peer_port = ntohs(addr.sin_port),
-            .peer_fd = peer_fd,
-            .peer_state = PEER_CONNECTED,
-            .in_buf = malloc_or_die(receive_buffer_size),
-        });
+        clients[peer_fd] = new osd_client_t();
+        clients[peer_fd]->peer_addr = addr;
+        clients[peer_fd]->peer_port = ntohs(addr.sin_port);
+        clients[peer_fd]->peer_fd = peer_fd;
+        clients[peer_fd]->peer_state = PEER_CONNECTED;
+        clients[peer_fd]->in_buf = malloc_or_die(receive_buffer_size);
        // Add FD to epoll
        tfd->set_fd_handler(peer_fd, false, [this](int peer_fd, int epoll_events)
        {
@@ -433,3 +515,59 @@ void osd_messenger_t::accept_connections(int listen_fd)
        throw std::runtime_error(std::string("accept: ") + strerror(errno));
    }
 }
+
+#ifdef WITH_RDMA
+bool osd_messenger_t::is_rdma_enabled()
+{
+    return rdma_context != NULL;
+}
+#endif
+
+json11::Json osd_messenger_t::read_config(const json11::Json & config)
+{
+    const char *config_path = config["config_path"].string_value() != ""
+        ? config["config_path"].string_value().c_str() : VITASTOR_CONFIG_PATH;
+    int fd = open(config_path, O_RDONLY);
+    if (fd < 0)
+    {
+        if (errno != ENOENT)
+            fprintf(stderr, "Error reading %s: %s\n", config_path, strerror(errno));
+        return config;
+    }
+    struct stat st;
+    if (fstat(fd, &st) != 0)
+    {
+        fprintf(stderr, "Error reading %s: %s\n", config_path, strerror(errno));
+        close(fd);
+        return config;
+    }
+    std::string buf;
+    buf.resize(st.st_size);
+    int done = 0;
+    while (done < st.st_size)
+    {
+        int r = read(fd, (void*)buf.data()+done, st.st_size-done);
+        if (r < 0)
+        {
+            fprintf(stderr, "Error reading %s: %s\n", config_path, strerror(errno));
+            close(fd);
+            return config;
+        }
+        done += r;
+    }
+    close(fd);
+    std::string json_err;
+    json11::Json::object file_config = json11::Json::parse(buf, json_err).object_items();
+    if (json_err != "")
+    {
+        fprintf(stderr, "Invalid JSON in %s: %s\n", config_path, json_err.c_str());
+        return config;
+    }
+    file_config.erase("config_path");
+    file_config.erase("osd_num");
+    for (auto kv: config.object_items())
+    {
+        file_config[kv.first] = kv.second;
+    }
+    return file_config;
+}
--- a/src/messenger.h
+++ b/src/messenger.h
@@ -14,179 +14,35 @@

 #include "malloc_or_die.h"
 #include "json11/json11.hpp"
-#include "osd_ops.h"
+#include "msgr_op.h"
 #include "timerfd_manager.h"
-#include "ringloop.h"
+#include <ringloop.h>

-#define OSD_OP_IN 0
-#define OSD_OP_OUT 1
+#ifdef WITH_RDMA
+#include "msgr_rdma.h"
+#endif

 #define CL_READ_HDR 1
 #define CL_READ_DATA 2
 #define CL_READ_REPLY_DATA 3
 #define CL_WRITE_READY 1
-#define CL_WRITE_REPLY 2
-#define OSD_OP_INLINE_BUF_COUNT 16

 #define PEER_CONNECTING 1
 #define PEER_CONNECTED 2
-#define PEER_STOPPED 3
+#define PEER_RDMA_CONNECTING 3
+#define PEER_RDMA 4
+#define PEER_STOPPED 5

-#define DEFAULT_PEER_CONNECT_INTERVAL 5
-#define DEFAULT_PEER_CONNECT_TIMEOUT 5
+#define DEFAULT_BITMAP_GRANULARITY 4096
+#define VITASTOR_CONFIG_PATH "/etc/vitastor/vitastor.conf"

-// Kind of a vector with small-list-optimisation
-struct osd_op_buf_list_t
+#define MSGR_SENDP_HDR 1
+#define MSGR_SENDP_FREE 2
+
+struct msgr_sendp_t
 {
-    int count = 0, alloc = OSD_OP_INLINE_BUF_COUNT, done = 0;
-    iovec *buf = NULL;
-    iovec inline_buf[OSD_OP_INLINE_BUF_COUNT];
-
-    inline osd_op_buf_list_t()
-    {
-        buf = inline_buf;
-    }
-
-    inline osd_op_buf_list_t(const osd_op_buf_list_t & other)
-    {
-        buf = inline_buf;
-        append(other);
-    }
-
-    inline osd_op_buf_list_t & operator = (const osd_op_buf_list_t & other)
-    {
-        reset();
-        append(other);
-        return *this;
-    }
-
-    inline ~osd_op_buf_list_t()
-    {
-        if (buf && buf != inline_buf)
-        {
-            free(buf);
-        }
-    }
-
-    inline void reset()
-    {
-        count = 0;
-        done = 0;
-    }
-
-    inline iovec* get_iovec()
-    {
-        return buf + done;
-    }
-
-    inline int get_size()
-    {
-        return count - done;
-    }
-
-    inline void append(const osd_op_buf_list_t & other)
-    {
-        if (count+other.count > alloc)
-        {
-            if (buf == inline_buf)
-            {
-                int old = alloc;
-                alloc = (((count+other.count+15)/16)*16);
-                buf = (iovec*)malloc(sizeof(iovec) * alloc);
-                if (!buf)
-                {
-                    printf("Failed to allocate %lu bytes\n", sizeof(iovec) * alloc);
-                    exit(1);
-                }
-                memcpy(buf, inline_buf, sizeof(iovec) * old);
-            }
-            else
-            {
-                alloc = (((count+other.count+15)/16)*16);
-                buf = (iovec*)realloc(buf, sizeof(iovec) * alloc);
-                if (!buf)
-                {
-                    printf("Failed to allocate %lu bytes\n", sizeof(iovec) * alloc);
-                    exit(1);
-                }
-            }
-        }
-        for (int i = 0; i < other.count; i++)
-        {
-            buf[count++] = other.buf[i];
-        }
-    }
-
-    inline void push_back(void *nbuf, size_t len)
-    {
-        if (count >= alloc)
-        {
-            if (buf == inline_buf)
-            {
-                int old = alloc;
-                alloc = ((alloc/16)*16 + 1);
-                buf = (iovec*)malloc(sizeof(iovec) * alloc);
-                if (!buf)
-                {
-                    printf("Failed to allocate %lu bytes\n", sizeof(iovec) * alloc);
-                    exit(1);
-                }
-                memcpy(buf, inline_buf, sizeof(iovec)*old);
-            }
-            else
-            {
-                alloc = alloc < 16 ? 16 : (alloc+16);
-                buf = (iovec*)realloc(buf, sizeof(iovec) * alloc);
-                if (!buf)
-                {
-                    printf("Failed to allocate %lu bytes\n", sizeof(iovec) * alloc);
-                    exit(1);
-                }
-            }
-        }
-        buf[count++] = { .iov_base = nbuf, .iov_len = len };
-    }
-
-    inline void eat(int result)
-    {
-        while (result > 0 && done < count)
-        {
-            iovec & iov = buf[done];
-            if (iov.iov_len <= result)
-            {
-                result -= iov.iov_len;
-                done++;
-            }
-            else
-            {
-                iov.iov_len -= result;
-                iov.iov_base += result;
-                break;
-            }
-        }
-    }
-};
-
-struct blockstore_op_t;
-
-struct osd_primary_op_data_t;
-
-struct osd_op_t
-{
-    timespec tv_begin;
-    uint64_t op_type = OSD_OP_IN;
-    int peer_fd;
-    osd_any_op_t req;
-    osd_any_reply_t reply;
-    blockstore_op_t *bs_op = NULL;
-    void *buf = NULL;
-    void *rmw_buf = NULL;
-    osd_primary_op_data_t* op_data = NULL;
-    std::function<void(osd_op_t*)> callback;
-
-    osd_op_buf_list_t iov;
-
-    ~osd_op_t();
+    osd_op_t *op;
+    int flags;
 };

 struct osd_client_t
@@ -198,10 +54,16 @@ struct osd_client_t
    int peer_fd;
    int peer_state;
    int connect_timeout_id = -1;
+    int ping_time_remaining = 0;
+    int idle_time_remaining = 0;
    osd_num_t osd_num = 0;

    void *in_buf = NULL;

+#ifdef WITH_RDMA
+    msgr_rdma_connection_t *rdma_conn = NULL;
+#endif
+
    // Read state
    int read_ready = 0;
    osd_op_t *read_op = NULL;
@@ -224,7 +86,13 @@ struct osd_client_t
    msghdr write_msg = { 0 };
    int write_state = 0;
    std::vector<iovec> send_list, next_send_list;
-    std::vector<osd_op_t*> outbox, next_outbox;
+    std::vector<msgr_sendp_t> outbox, next_outbox;
+
+    ~osd_client_t()
+    {
+        free(in_buf);
+        in_buf = NULL;
+    }
 };

 struct osd_wanted_peer_t
@@ -249,45 +117,65 @@ struct osd_op_stats_t

 struct osd_messenger_t
 {
-    timerfd_manager_t *tfd;
-    ring_loop_t *ringloop;
+protected:
+    int keepalive_timer_id = -1;

-    // osd_num_t is only for logging and asserts
-    osd_num_t osd_num;
-    // FIXME: make receive_buffer_size configurable
-    int receive_buffer_size = 64*1024;
-    int peer_connect_interval = DEFAULT_PEER_CONNECT_INTERVAL;
-    int peer_connect_timeout = DEFAULT_PEER_CONNECT_TIMEOUT;
+    uint32_t receive_buffer_size = 0;
+    int peer_connect_interval = 0;
+    int peer_connect_timeout = 0;
+    int osd_idle_timeout = 0;
+    int osd_ping_timeout = 0;
    int log_level = 0;
    bool use_sync_send_recv = false;

-    std::map<osd_num_t, osd_wanted_peer_t> wanted_peers;
-    std::map<uint64_t, int> osd_peer_fds;
-    uint64_t next_subop_id = 1;
+#ifdef WITH_RDMA
+    bool use_rdma = true;
+    std::string rdma_device;
+    uint64_t rdma_port_num = 1, rdma_gid_index = 0, rdma_mtu = 0;
+    msgr_rdma_context_t *rdma_context = NULL;
+    uint64_t rdma_max_sge = 0, rdma_max_send = 0, rdma_max_recv = 8;
+    uint64_t rdma_max_msg = 0;
+#endif

-    std::map<int, osd_client_t*> clients;
    std::vector<int> read_ready_clients;
    std::vector<int> write_ready_clients;
    std::vector<std::function<void()>> set_immediate;

+public:
+    timerfd_manager_t *tfd;
+    ring_loop_t *ringloop;
+    // osd_num_t is only for logging and asserts
+    osd_num_t osd_num;
+    uint64_t next_subop_id = 1;
+    std::map<int, osd_client_t*> clients;
+    std::map<osd_num_t, osd_wanted_peer_t> wanted_peers;
+    std::map<uint64_t, int> osd_peer_fds;
    // op statistics
    osd_op_stats_t stats;

-public:
+    void init();
+    void parse_config(const json11::Json & config);
    void connect_peer(uint64_t osd_num, json11::Json peer_state);
-    void stop_client(int peer_fd, bool force = false);
+    void stop_client(int peer_fd, bool force = false, bool force_delete = false);
    void outbox_push(osd_op_t *cur_op);
    std::function<void(osd_op_t*)> exec_op;
    std::function<void(osd_num_t)> repeer_pgs;
-    void handle_peer_epoll(int peer_fd, int epoll_events);
    void read_requests();
    void send_replies();
    void accept_connections(int listen_fd);
    ~osd_messenger_t();

+    static json11::Json read_config(const json11::Json & config);
+
+#ifdef WITH_RDMA
+    bool is_rdma_enabled();
+    bool connect_rdma(int peer_fd, std::string rdma_address, uint64_t client_max_msg);
+#endif
+
 protected:
    void try_connect_peer(uint64_t osd_num);
    void try_connect_peer_addr(osd_num_t peer_osd, const char *peer_host, int peer_port);
+    void handle_peer_epoll(int peer_fd, int epoll_events);
    void handle_connect_epoll(int peer_fd);
    void on_connect_peer(osd_num_t peer_osd, int peer_fd);
    void check_peer_config(osd_client_t *cl);
@@ -299,8 +187,15 @@ protected:
    void handle_send(int result, osd_client_t *cl);

    bool handle_read(int result, osd_client_t *cl);
+    bool handle_read_buffer(osd_client_t *cl, void *curbuf, int remain);
    bool handle_finished_read(osd_client_t *cl);
    void handle_op_hdr(osd_client_t *cl);
    bool handle_reply_hdr(osd_client_t *cl);
    void handle_reply_ready(osd_op_t *op);
+
+#ifdef WITH_RDMA
+    bool try_send_rdma(osd_client_t *cl);
+    bool try_recv_rdma(osd_client_t *cl);
+    void handle_rdma_events();
+#endif
 };
--- a/src/mock/build.sh
+++ b/src/mock/build.sh
@@ -0,0 +1 @@
+g++ -D__MOCK__ -fsanitize=address -g -Wno-pointer-arith pg_states.cpp osd_ops.cpp test_cluster_client.cpp cluster_client.cpp msgr_op.cpp msgr_stop.cpp mock/messenger.cpp etcd_state_client.cpp timerfd_manager.cpp ../json11/json11.cpp -I mock -I . -I ..; ./a.out
--- a/Show More
+++ b/Show More
				`@@ -0,0 +1 @@`
				`deb http://vitastor.io/debian bullseye main`
				`@@ -0,0 +1 @@`
				`g++ -D__MOCK__ -fsanitize=address -g -Wno-pointer-arith pg_states.cpp osd_ops.cpp test_cluster_client.cpp cluster_client.cpp msgr_op.cpp msgr_stop.cpp mock/messenger.cpp etcd_state_client.cpp timerfd_manager.cpp ../json11/json11.cpp -I mock -I . -I ..; ./a.out`