Release 0.6.2

- Fix a possible crash during SYNC when journal fsyncs are enabled - Fix a memory leak in the chained read implementation
Fix a possible crash during SYNC when journal fsyncs are enabled
2021-04-15 23:40:06 +03:00 · 2021-04-15 02:01:50 +03:00 · 2021-04-15 01:42:18 +03:00 · 2021-04-15 01:13:34 +03:00 · 2021-04-14 22:32:23 +03:00 · 2021-04-14 22:32:15 +03:00
88 changed files with 5849 additions and 2181 deletions
--- a/27
+++ b/27
@ -0,0 +1,27 @@
+Copyright (c) Vitaliy Filippov (vitalif [at] yourcmc.ru), 2019+
+
+All server-side code (OSD, Monitor and so on) is licensed under the terms of
+Vitastor Network Public License 1.1 (VNPL 1.1), a copyleft license based on
+GNU GPLv3.0 with the additional "Network Interaction" clause which requires
+opensourcing all programs directly or indirectly interacting with Vitastor
+through a computer network and expressly designed to be used in conjunction
+with it ("Proxy Programs"). Proxy Programs may be made public not only under
+the terms of the same license, but also under the terms of any GPL-Compatible
+Free Software License, as listed by the Free Software Foundation.
+This is a stricter copyleft license than the Affero GPL.
+
+Please note that VNPL doesn't require you to open the code of proprietary
+software running inside a VM if it's not specially designed to be used with
+Vitastor.
+
+Basically, you can't use the software in a proprietary environment to provide
+its functionality to users without opensourcing all intermediary components
+standing between the user and Vitastor or purchasing a commercial license
+from the author 😀.
+
+Client libraries (cluster_client and so on) are dual-licensed under the same
+VNPL 1.1 and also GNU GPL 2.0 or later to allow for compatibility with GPLed
+software like QEMU and fio.
+
+You can find the full text of VNPL-1.1 in the file [VNPL-1.1.txt](VNPL-1.1.txt).
+GPL 2.0 is also included in this repository as [GPL-2.0.txt](GPL-2.0.txt).
--- a/README-ru.md
+++ b/README-ru.md
@ -0,0 +1,495 @@
+## Vitastor
+
+[Read English version](README.md)
+
+## Идея
+
+Я всего лишь хочу сделать качественную блочную SDS!
+
+Vitastor - распределённая блочная SDS, прямой аналог Ceph RBD и внутренних СХД популярных
+облачных провайдеров. Однако, в отличие от них, Vitastor быстрый и при этом простой.
+Только пока маленький :-).
+
+Архитектурная схожесть с Ceph означает заложенную на уровне алгоритмов записи строгую консистентность,
+репликацию через первичный OSD, симметричную кластеризацию без единой точки отказа
+и автоматическое распределение данных по любому числу дисков любого размера с настраиваемыми схемами
+избыточности - репликацией или с произвольными кодами коррекции ошибок.
+
+## Возможности
+
+Vitastor на данный момент находится в статусе предварительного выпуска, расширенные
+возможности пока отсутствуют, а в будущих версиях вероятны "ломающие" изменения.
+
+Однако следующее уже реализовано:
+
+0.5.x (стабильная версия):
+- Базовая часть - надёжное кластерное блочное хранилище без единой точки отказа
+- Производительность ;-D
+- Несколько схем отказоустойчивости: репликация, XOR n+1 (1 диск чётности), коды коррекции ошибок
+  Рида-Соломона на основе библиотеки jerasure с любым числом дисков данных и чётности в группе
+- Конфигурация через простые человекочитаемые JSON-структуры в etcd
+- Автоматическое распределение данных по OSD, с поддержкой:
+  - Математической оптимизации для лучшей равномерности распределения и минимизации перемещений данных
+  - Нескольких пулов с разными схемами избыточности
+  - Дерева распределения, выбора OSD по тегам / классам устройств (только SSD, только HDD) и по поддереву
+  - Настраиваемых доменов отказа (диск/сервер/стойка и т.п.)
+- Восстановление деградированных блоков
+- Ребаланс, то есть перемещение данных между OSD (дисками)
+- Поддержка "ленивого" fsync (fsync не на каждую операцию)
+- Сбор статистики ввода/вывода в etcd
+- Клиентская библиотека режима пользователя для ввода/вывода
+- Драйвер диска для QEMU (собирается вне дерева исходников QEMU)
+- Драйвер диска для утилиты тестирования производительности fio (также собирается вне дерева исходников fio)
+- NBD-прокси для монтирования образов ядром ("блочное устройство в режиме пользователя")
+- Утилита удаления образов/инодов (vitastor-rm)
+- Пакеты для Debian и CentOS
+
+0.6.x (master-ветка):
+- Статистика операций ввода/вывода и занятого места в разрезе инодов
+- Именование инодов через хранение их метаданных в etcd
+- Снапшоты и copy-on-write клоны
+- Сглаживание производительности случайной записи в SSD+HDD конфигурациях
+
+## Планы развития
+
+- Более корректные скрипты разметки дисков и автоматического запуска OSD
+- Другие инструменты администрирования
+- Плагины для OpenStack, Kubernetes, OpenNebula, Proxmox и других облачных систем
+- iSCSI-прокси
+- Более быстрое переключение при отказах
+- Фоновая проверка целостности без контрольных сумм (сверка реплик)
+- Контрольные суммы
+- Поддержка SSD-кэширования (tiered storage)
+- Поддержка RDMA и NVDIMM
+- Web-интерфейс
+- Возможно, сжатие
+- Возможно, поддержка кэширования данных через системный page cache
+
+## Архитектура
+
+Так же, как и в Ceph, в Vitastor:
+
+- Есть пулы (pools), PG, OSD, мониторы, домены отказа, дерево распределения (аналог crush-дерева).
+- Образы делятся на блоки фиксированного размера (объекты), и эти объекты распределяются по OSD.
+- У OSD есть журнал и метаданные и они тоже могут размещаться на отдельных быстрых дисках.
+- Все операции записи тоже транзакционны. В Vitastor, правда, есть режим отложенного/ленивого fsync
+  (коммита), в котором fsync не вызывается на каждую операцию записи, что делает его более
+  пригодным для использования на "плохих" (десктопных) SSD. Однако все операции записи
+  в любом случае атомарны.
+- Клиентская библиотека тоже старается ждать восстановления после любого отказа кластера, то есть,
+  вы тоже можете перезагрузить хоть весь кластер разом, и клиенты только на время зависнут,
+  но не отключатся.
+
+Некоторые базовые термины для тех, кто не знаком с Ceph:
+
+- OSD (Object Storage Daemon) - процесс, который хранит данные на одном диске и обрабатывает
+  запросы чтения/записи от клиентов.
+- Пул (Pool) - контейнер для данных, имеющих одну и ту же схему избыточности и правила распределения по OSD.
+- PG (Placement Group) - группа объектов, хранимых на одном и том же наборе реплик (OSD).
+  Несколько PG могут храниться на одном и том же наборе реплик, но объекты одной PG
+  в норме не хранятся на разных наборах OSD.
+- Монитор - демон, хранящий состояние кластера.
+- Домен отказа (Failure Domain) - группа OSD, которым вы разрешаете "упасть" всем вместе.
+  Иными словами, это группа OSD, в которые СХД не помещает разные копии одного и того же
+  блока данных. Например, если домен отказа - сервер, то на двух дисках одного сервера
+  никогда не окажется 2 и более копий одного и того же блока данных, а значит, даже
+  если в этом сервере откажут все диски, это будет равносильно потере только 1 копии
+  любого блока данных.
+- Дерево распределения (Placement Tree / CRUSH Tree) - иерархическая группировка OSD
+  в узлы, которые далее можно использовать как домены отказа. То есть, диск (OSD) входит в
+  сервер, сервер входит в стойку, стойка входит в ряд, ряд в датацентр и т.п.
+
+Чем Vitastor отличается от Ceph:
+
+- Vitastor в первую очередь сфокусирован на SSD. Также Vitastor, вероятно, должен неплохо работать
+  с комбинацией SSD и HDD через bcache, а в будущем, возможно, будут добавлены и нативные способы
+  оптимизации под SSD+HDD. Однако хранилище на основе одних лишь жёстких дисков, вообще без SSD,
+  не в приоритете, поэтому оптимизации под этот кейс могут вообще не состояться.
+- OSD Vitastor однопоточный и всегда таким останется, так как это самый оптимальный способ работы.
+  Если вам не хватает 1 ядра на 1 диск, просто делите диск на разделы и запускайте на нём несколько OSD.
+  Но, скорее всего, вам хватит и 1 ядра - Vitastor не так прожорлив к ресурсам CPU, как Ceph.
+- Журнал и метаданные всегда размещаются в памяти, благодаря чему никогда не тратится лишнее время
+  на чтение метаданных с диска. Размер метаданных линейно зависит от размера диска и блока данных,
+  который задаётся в конфигурации кластера и по умолчанию составляет 128 КБ. С блоком 128 КБ метаданные
+  занимают примерно 512 МБ памяти на 1 ТБ дискового пространства (и это всё равно меньше, чем нужно Ceph-у).
+  Журнал вообще не должен быть большим, например, тесты производительности в данном документе проводились
+  с журналом размером всего 16 МБ. Большой журнал, вероятно, даже вреден, т.к. "грязные" записи (записи,
+  не сброшенные из журнала) тоже занимают память и могут немного замедлять работу.
+- В Vitastor нет внутреннего copy-on-write. Я считаю, что реализация CoW-хранилища гораздо сложнее,
+  поэтому сложнее добиться устойчиво хороших результатов. Возможно, в один прекрасный день
+  я придумаю красивый алгоритм для CoW-хранилища, но пока нет - внутреннего CoW в Vitastor не будет.
+  Всё это не относится к "внешнему" CoW (снапшотам и клонам).
+- Базовый слой Vitastor - простое блочное хранилище с блоками фиксированного размера, а не сложное
+  объектное хранилище с расширенными возможностями, как в Ceph (RADOS).
+- В Vitastor есть режим "ленивых fsync", в котором OSD группирует запросы записи перед сбросом их
+  на диск, что позволяет получить лучшую производительность с дешёвыми настольными SSD без конденсаторов
+  ("Advanced Power Loss Protection" / "Capacitor-Based Power Loss Protection").
+  Тем не менее, такой режим всё равно медленнее использования нормальных серверных SSD и мгновенного
+  fsync, так как приводит к дополнительным операциям передачи данных по сети, поэтому рекомендуется
+  всё-таки использовать хорошие серверные диски, тем более, стоят они почти так же, как десктопные.
+- PG эфемерны. Это означает, что они не хранятся на дисках и существуют только в памяти работающих OSD.
+- Процессы восстановления оперируют отдельными объектами, а не целыми PG.
+- PGLOG-ов нет.
+- "Мониторы" не хранят данные. Конфигурация и состояние кластера хранятся в etcd в простых человекочитаемых
+  JSON-структурах. Мониторы Vitastor только следят за состоянием кластера и управляют перемещением данных.
+  В этом смысле монитор Vitastor не является критичным компонентом системы и больше похож на Ceph-овский
+  менеджер (MGR). Монитор Vitastor написан на node.js.
+- Распределение PG не основано на консистентных хешах. Вместо этого все маппинги PG хранятся прямо в etcd
+  (ибо нет никакой проблемы сохранить несколько сотен-тысяч записей в памяти, а не считать каждый раз хеши).
+  Перераспределение PG по OSD выполняется через математическую оптимизацию,
+  а конкретно, сведение задачи к ЛП (задаче линейного программирования) и решение оной с помощью утилиты
+  lp_solve. Такой подход позволяет обычно выравнивать распределение места почти идеально - равномерность
+  обычно составляет 96-99%, в отличие от Ceph, где на голом CRUSH-е без балансировщика обычно выходит 80-90%.
+  Также это позволяет минимизировать объём перемещения данных и случайность связей между OSD, а также менять
+  распределение вручную, не боясь сломать логику перебалансировки. В таком подходе есть и потенциальный
+  недостаток - есть предположение, что в очень большом кластере он может сломаться - однако вплоть до
+  нескольких сотен OSD подход точно работает нормально. Ну и, собственно, при необходимости легко
+  реализовать и консистентные хеши.
+- Отдельный слой, подобный слою "CRUSH-правил", отсутствует. Вы настраиваете схемы отказоустойчивости,
+  домены отказа и правила выбора OSD напрямую в конфигурации пулов.
+
+## Понимание сути производительности систем хранения
+
+Вкратце: для быстрой хранилки задержки важнее, чем пиковые iops-ы.
+
+Лучшая возможная задержка достигается при тестировании в 1 поток с глубиной очереди 1,
+что приблизительно означает минимально нагруженное состояние кластера. В данном случае
+IOPS = 1/задержка. Ни числом серверов, ни дисков, ни серверных процессов/потоков
+задержка не масштабируется... Она зависит только от того, насколько быстро один
+серверный процесс (и клиент) обрабатывают одну операцию.
+
+Почему задержки важны? Потому, что некоторые приложения *не могут* использовать глубину
+очереди больше 1, ибо их задача не параллелизуется. Важный пример - это все СУБД
+с поддержкой консистентности (ACID), потому что все они обеспечивают её через
+журналирование, а журналы пишутся последовательно и с fsync() после каждой операции.
+
+fsync, кстати - это ещё одна очень важная вещь, про которую почти всегда забывают в тестах.
+Смысл в том, что все современные диски имеют кэши/буферы записи и не гарантируют, что
+данные реально физически записываются на носитель до того, как вы делаете fsync(),
+который транслируется в команду сброса кэша операционной системой.
+
+Дешёвые SSD для настольных ПК и ноутбуков очень быстрые без fsync - NVMe диски, например,
+могут обработать порядка 80000 операций записи в секунду с глубиной очереди 1 без fsync.
+Однако с fsync, когда они реально вынуждены писать каждый блок данных во флеш-память,
+они выжимают лишь 1000-2000 операций записи в секунду (число практически постоянное
+для всех моделей SSD).
+
+Серверные SSD часто имеют суперконденсаторы, работающие как встроенный источник
+бесперебойного питания и дающие дискам успеть сбросить их DRAM-кэш в постоянную
+флеш-память при отключении питания. Благодаря этому диски с чистой совестью
+*игнорируют fsync*, так как точно знают, что данные из кэша доедут до постоянной
+памяти.
+
+Все наиболее известные программные СХД, например, Ceph и внутренние СХД, используемые
+такими облачными провайдерами, как Amazon, Google, Яндекс, медленные в смысле задержки.
+В лучшем случае они дают задержки от 0.3мс на чтение и 0.6мс на запись 4 КБ блоками
+даже при условии использования наилучшего возможного железа.
+
+И это в эпоху SSD, когда вы можете пойти на рынок и купить там SSD, задержка которого
+на чтение будет 0.1мс, а на запись - 0.04мс, за 100$ или даже дешевле.
+
+Когда мне нужно быстро протестировать производительность дисковой подсистемы, я
+использую следующие 6 команд, с небольшими вариациями:
+
+- Линейная запись:
+  `fio -ioengine=libaio -direct=1 -invalidate=1 -name=test -bs=4M -iodepth=32 -rw=write -runtime=60 -filename=/dev/sdX`
+- Линейное чтение:
+  `fio -ioengine=libaio -direct=1 -invalidate=1 -name=test -bs=4M -iodepth=32 -rw=read -runtime=60 -filename=/dev/sdX`
+- Запись в 1 поток (T1Q1):
+  `fio -ioengine=libaio -direct=1 -invalidate=1 -name=test -bs=4k -iodepth=1 -fsync=1 -rw=randwrite -runtime=60 -filename=/dev/sdX`
+- Чтение в 1 поток (T1Q1):
+  `fio -ioengine=libaio -direct=1 -invalidate=1 -name=test -bs=4k -iodepth=1 -rw=randread -runtime=60 -filename=/dev/sdX`
+- Параллельная запись (numjobs используется, когда 1 ядро CPU не может насытить диск):
+  `fio -ioengine=libaio -direct=1 -invalidate=1 -name=test -bs=4k -iodepth=128 [-numjobs=4 -group_reporting] -rw=randwrite -runtime=60 -filename=/dev/sdX`
+- Параллельное чтение (numjobs - аналогично):
+  `fio -ioengine=libaio -direct=1 -invalidate=1 -name=test -bs=4k -iodepth=128 [-numjobs=4 -group_reporting] -rw=randread -runtime=60 -filename=/dev/sdX`
+
+## Теоретическая максимальная производительность Vitastor
+
+При использовании репликации:
+- Задержка чтения в 1 поток (T1Q1): 1 сетевой RTT + 1 чтение с диска.
+- Запись+fsync в 1 поток:
+  - С мгновенным сбросом: 2 RTT + 1 запись.
+  - С отложенным ("ленивым") сбросом: 4 RTT + 1 запись + 1 fsync.
+- Параллельное чтение: сумма IOPS всех дисков либо производительность сети, если в сеть упрётся раньше.
+- Параллельная запись: сумма IOPS всех дисков / число реплик / WA либо производительность сети, если в сеть упрётся раньше.
+
+При использовании кодов коррекции ошибок (EC):
+- Задержка чтения в 1 поток (T1Q1): 1.5 RTT + 1 чтение.
+- Запись+fsync в 1 поток:
+  - С мгновенным сбросом: 3.5 RTT + 1 чтение + 2 записи.
+  - С отложенным ("ленивым") сбросом: 5.5 RTT + 1 чтение + 2 записи + 2 fsync.
+- Под 0.5 на самом деле подразумевается (k-1)/k, где k - число дисков данных,
+  что означает, что дополнительное обращение по сети не нужно, когда операция
+  чтения обслуживается локально.
+- Параллельное чтение: сумма IOPS всех дисков либо производительность сети, если в сеть упрётся раньше.
+- Параллельная запись: сумма IOPS всех дисков / общее число дисков данных и чётности / WA либо производительность сети, если в сеть упрётся раньше.
+  Примечание: IOPS дисков в данном случае надо брать в смешанном режиме чтения/записи в пропорции, аналогичной формулам выше.
+
+WA (мультипликатор записи) для 4 КБ блоков в Vitastor обычно составляет 3-5:
+1. Запись метаданных в журнал
+2. Запись блока данных в журнал
+3. Запись метаданных в БД
+4. Ещё одна запись метаданных в журнал при использовании EC
+5. Запись блока данных на диск данных
+
+Если вы найдёте SSD, хорошо работающий с 512-байтными блоками данных (Optane?),
+то 1, 3 и 4 можно снизить до 512 байт (1/8 от размера данных) и получить WA всего 2.375.
+
+Кроме того, WA снижается при использовании отложенного/ленивого сброса при параллельной
+нагрузке, т.к. блоки журнала записываются на диск только когда они заполняются или явным
+образом запрашивается fsync.
+
+## Пример сравнения с Ceph
+
+Железо - 4 сервера, в каждом:
+- 6x SATA SSD Intel D3-4510 3.84 TB
+- 2x Xeon Gold 6242 (16 cores @ 2.8 GHz)
+- 384 GB RAM
+- 1x 25 GbE сетевая карта (Mellanox ConnectX-4 LX), подключённая к свитчу Juniper QFX5200
+
+Экономия энергии CPU отключена. В тестах и Vitastor, и Ceph развёрнуто по 2 OSD на 1 SSD.
+
+Все результаты ниже относятся к случайной нагрузке 4 КБ блоками (если явно не указано обратное).
+
+Производительность голых дисков:
+- T1Q1 запись ~27000 iops (задержка ~0.037ms)
+- T1Q1 чтение ~9800 iops (задержка ~0.101ms)
+- T1Q32 запись ~60000 iops
+- T1Q32 чтение ~81700 iops
+
+Ceph 15.2.4 (Bluestore):
+- T1Q1 запись ~1000 iops (задержка ~1ms)
+- T1Q1 чтение ~1750 iops (задержка ~0.57ms)
+- T8Q64 запись ~100000 iops, потребление CPU процессами OSD около 40 ядер на каждом сервере
+- T8Q64 чтение ~480000 iops, потребление CPU процессами OSD около 40 ядер на каждом сервере
+
+Тесты в 8 потоков проводились на 8 400GB RBD образах со всех хостов (с каждого хоста запускалось 2 процесса fio).
+Это нужно потому, что в Ceph несколько RBD-клиентов, пишущих в 1 образ, очень сильно замедляются.
+
+Настройки RocksDB и Bluestore в Ceph не менялись, единственным изменением было отключение cephx_sign_messages.
+
+На самом деле, результаты теста не такие уж и плохие для Ceph (могло быть хуже).
+Собственно говоря, эти серверы как раз хорошо сбалансированы для Ceph - 6 SATA SSD как раз
+утилизируют 25-гигабитную сеть, а без 2 мощных процессоров Ceph-у бы не хватило ядер,
+чтобы выдать пристойный результат. Собственно, что и показывает жор 40 ядер в процессе
+параллельного теста.
+
+Vitastor:
+- T1Q1 запись: 7087 iops (задержка 0.14ms)
+- T1Q1 чтение: 6838 iops (задержка 0.145ms)
+- T2Q64 запись: 162000 iops, потребление CPU - 3 ядра на каждом сервере
+- T8Q64 чтение: 895000 iops, потребление CPU - 4 ядра на каждом сервере
+- Линейная запись (4M T1Q32): 2800 МБ/с
+- Линейное чтение (4M T1Q32): 1500 МБ/с
+
+Тест на чтение в 8 потоков проводился на 1 большом образе (3.2 ТБ) со всех хостов (опять же, по 2 fio с каждого).
+В Vitastor никакой разницы между 1 образом и 8-ю нет. Естественно, примерно 1/4 запросов чтения
+в такой конфигурации, как и в тестах Ceph выше, обслуживалась с локальной машины. Если проводить
+тест так, чтобы все операции всегда обращались к первичным OSD по сети - тест сильнее упирался
+в сеть и результат составлял примерно 689000 iops.
+
+Настройки Vitastor: `--disable_data_fsync true --immediate_commit all --flusher_count 8
+  --disk_alignment 4096 --journal_block_size 4096 --meta_block_size 4096
+  --journal_no_same_sector_overwrites true --journal_sector_buffer_count 1024
+  --journal_size 16777216`.
+
+### EC/XOR 2+1
+
+Vitastor:
+- T1Q1 запись: 2808 iops (задержка ~0.355ms)
+- T1Q1 чтение: 6190 iops (задержка ~0.16ms)
+- T2Q64 запись: 85500 iops, потребление CPU - 3.4 ядра на каждом сервере
+- T8Q64 чтение: 812000 iops, потребление CPU - 4.7 ядра на каждом сервере
+- Линейная запись (4M T1Q32): 3200 МБ/с
+- Линейное чтение (4M T1Q32): 1800 МБ/с
+
+Ceph:
+- T1Q1 запись: 730 iops (задержка ~1.37ms latency)
+- T1Q1 чтение: 1500 iops с холодным кэшем метаданных (задержка ~0.66ms), 2300 iops через 2 минуты прогрева (задержка ~0.435ms)
+- T4Q128 запись (4 RBD images): 45300 iops, потребление CPU - 30 ядер на каждом сервере
+- T8Q64 чтение (4 RBD images): 278600 iops, потребление CPU - 40 ядер на каждом сервере
+- Линейная запись (4M T1Q32): 1950 МБ/с в пустой образ, 2500 МБ/с в заполненный образ
+- Линейное чтение (4M T1Q32): 2400 МБ/с
+
+### NBD
+
+NBD - на данный момент единственный способ монтировать Vitastor ядром Linux, но он
+приводит к дополнительным копированиям данных, поэтому немного ухудшает производительность,
+правда, в основном - линейную, а случайная затрагивается слабо.
+
+NBD расшифровывается как "сетевое блочное устройство", но на самом деле оно также
+работает просто как аналог FUSE для блочных устройств, то есть, представляет собой
+"блочное устройство в пространстве пользователя".
+
+Vitastor с однопоточной NBD прокси на том же стенде:
+- T1Q1 запись: 6000 iops (задержка 0.166ms)
+- T1Q1 чтение: 5518 iops (задержка 0.18ms)
+- T1Q128 запись: 94400 iops
+- T1Q128 чтение: 103000 iops
+- Линейная запись (4M T1Q128): 1266 МБ/с (в сравнении с 2800 МБ/с через fio)
+- Линейное чтение (4M T1Q128): 975 МБ/с (в сравнении с 1500 МБ/с через fio)
+
+## Установка
+
+### Debian
+
+- Добавьте ключ репозитория Vitastor:
+  `wget -q -O - https://vitastor.io/debian/pubkey | sudo apt-key add -`
+- Добавьте репозиторий Vitastor в /etc/apt/sources.list:
+  - Debian 11 (Bullseye/Sid): `deb https://vitastor.io/debian bullseye main`
+  - Debian 10 (Buster): `deb https://vitastor.io/debian buster main`
+- Для Debian 10 (Buster) также включите репозиторий backports:
+  `deb http://deb.debian.org/debian buster-backports main`
+- Установите пакеты: `apt update; apt install vitastor lp-solve etcd linux-image-amd64 qemu`
+
+### CentOS
+
+- Добавьте в систему репозиторий Vitastor:
+  - CentOS 7: `yum install https://vitastor.io/rpms/centos/7/vitastor-release-1.0-1.el7.noarch.rpm`
+  - CentOS 8: `dnf install https://vitastor.io/rpms/centos/8/vitastor-release-1.0-1.el8.noarch.rpm`
+- Включите EPEL: `yum/dnf install epel-release`
+- Включите дополнительные репозитории CentOS:
+  - CentOS 7: `yum install centos-release-scl`
+  - CentOS 8: `dnf install centos-release-advanced-virtualization`
+- Включите elrepo-kernel:
+  - CentOS 7: `yum install https://www.elrepo.org/elrepo-release-7.el7.elrepo.noarch.rpm`
+  - CentOS 8: `dnf install https://www.elrepo.org/elrepo-release-8.el8.elrepo.noarch.rpm`
+- Установите пакеты: `yum/dnf install vitastor lpsolve etcd kernel-ml qemu-kvm`
+
+### Установка из исходников
+
+- Установите ядро 5.4 или более новое, для поддержки io_uring. Желательно 5.8 или даже новее,
+  так как в 5.4 есть как минимум 1 известный баг, ведущий к зависанию с io_uring и контроллером HP SmartArray.
+- Установите liburing 0.4 или более новый и его заголовки.
+- Установите lp_solve.
+- Установите etcd, версии не ниже 3.4.15. Более ранние версии работать не будут из-за различных багов,
+  например [#12402](https://github.com/etcd-io/etcd/pull/12402). Также вы можете взять версию 3.4.13 с
+  этим конкретным исправлением из ветки release-3.4 репозитория https://github.com/vitalif/etcd/.
+- Установите node.js 10 или новее.
+- Установите gcc и g++ 8.x или новее.
+- Склонируйте данный репозиторий с подмодулями: `git clone https://yourcmc.ru/git/vitalif/vitastor/`.
+- Желательно пересобрать QEMU с патчем, который делает необязательным запуск через LD_PRELOAD.
+  См `qemu-*.*-vitastor.patch` - выберите версию, наиболее близкую вашей версии QEMU.
+- Установите QEMU 3.0 или новее, возьмите исходные коды установленного пакета, начните его пересборку,
+  через некоторое время остановите её и скопируйте следующие заголовки:
+   - `<qemu>/include` &rarr; `<vitastor>/qemu/include`
+   - Debian:
+      * Берите qemu из основного репозитория
+      * `<qemu>/b/qemu/config-host.h` &rarr; `<vitastor>/qemu/b/qemu/config-host.h`
+      * `<qemu>/b/qemu/qapi` &rarr; `<vitastor>/qemu/b/qemu/qapi`
+   - CentOS 8:
+      * Берите qemu из репозитория Advanced-Virtualization. Чтобы включить его, запустите
+        `yum install centos-release-advanced-virtualization.noarch` и далее `yum install qemu`
+      * `<qemu>/config-host.h` &rarr; `<vitastor>/qemu/b/qemu/config-host.h`
+      * Для QEMU 3.0+: `<qemu>/qapi` &rarr; `<vitastor>/qemu/b/qemu/qapi`
+      * Для QEMU 2.0+: `<qemu>/qapi-types.h` &rarr; `<vitastor>/qemu/b/qemu/qapi-types.h`
+   - `config-host.h` и `qapi` нужны, т.к. в них содержатся автогенерируемые заголовки
+- Установите fio 3.7 или новее, возьмите исходники пакета и сделайте на них симлинк с `<vitastor>/fio`.
+- Соберите и установите Vitastor командой `mkdir build && cd build && cmake .. && make -j8 && make install`.
+  Обратите внимание на переменную cmake `QEMU_PLUGINDIR` - под RHEL её нужно установить равной `qemu-kvm`.
+
+## Запуск
+
+Внимание: процедура пока что достаточно нетривиальная, задавать конфигурацию и смещения
+на диске нужно почти вручную. Это будет исправлено в ближайшем будущем.
+
+- Желательны SATA SSD или NVMe диски с конденсаторами (серверные SSD). Можно использовать и
+  десктопные SSD, включив режим отложенного fsync, но производительность однопоточной записи
+  в этом случае пострадает.
+- Быстрая сеть, минимум 10 гбит/с
+- Для наилучшей производительности нужно отключить энергосбережение CPU: `cpupower idle-set -D 0 && cpupower frequency-set -g performance`.
+- Пропишите нужные вам значения вверху файлов `/usr/lib/vitastor/mon/make-units.sh` и `/usr/lib/vitastor/mon/make-osd.sh`.
+- Создайте юниты systemd для etcd и мониторов: `/usr/lib/vitastor/mon/make-units.sh`
+- Создайте юниты для OSD: `/usr/lib/vitastor/mon/make-osd.sh /dev/disk/by-partuuid/XXX [/dev/disk/by-partuuid/YYY ...]`
+- Вы можете поменять параметры OSD в юнитах systemd. Смысл некоторых параметров:
+  - `disable_data_fsync 1` - отключает fsync, используется с SSD с конденсаторами.
+  - `immediate_commit all` - используется с SSD с конденсаторами.
+  - `disable_device_lock 1` - отключает блокировку файла устройства, нужно, только если вы запускаете
+    несколько OSD на одном блочном устройстве.
+  - `flusher_count 256` - "flusher" - микропоток, удаляющий старые данные из журнала.
+    Не волнуйтесь об этой настройке, 256 теперь достаточно практически всегда.
+  - `disk_alignment`, `journal_block_size`, `meta_block_size` следует установить равными размеру
+    внутреннего блока SSD. Это почти всегда 4096.
+  - `journal_no_same_sector_overwrites true` запрещает перезапись одного и того же сектора журнала подряд
+    много раз в процессе записи. Большинство (99%) SSD не нуждаются в данной опции. Однако выяснилось, что
+    диски, используемые на одном из тестовых стендов - Intel D3-S4510 - очень сильно не любят такую
+    перезапись, и для них была добавлена эта опция. Когда данный режим включён, также нужно поднимать
+    значение `journal_sector_buffer_count`, так как иначе Vitastor не хватит буферов для записи в журнал.
+- Запустите все etcd: `systemctl start etcd`
+- Создайте глобальную конфигурацию в etcd: `etcdctl --endpoints=... put /vitastor/config/global '{"immediate_commit":"all"}'`
+  (если все ваши диски - серверные с конденсаторами).
+- Создайте пулы: `etcdctl --endpoints=... put /vitastor/config/pools '{"1":{"name":"testpool","scheme":"replicated","pg_size":2,"pg_minsize":1,"pg_count":256,"failure_domain":"host"}}'`.
+  Для jerasure EC-пулов конфигурация должна выглядеть так: `2:{"name":"ecpool","scheme":"jerasure","pg_size":4,"parity_chunks":2,"pg_minsize":2,"pg_count":256,"failure_domain":"host"}`.
+- Запустите все OSD: `systemctl start vitastor.target`
+- Ваш кластер должен быть готов - один из мониторов должен уже сконфигурировать PG, а OSD должны запустить их.
+- Вы можете проверить состояние PG прямо в etcd: `etcdctl --endpoints=... get --prefix /vitastor/pg/state`. Все PG должны быть 'active'.
+- Пример команды для запуска тестов: `fio -thread -ioengine=libfio_vitastor.so -name=test -bs=4M -direct=1 -iodepth=16 -rw=write -etcd=10.115.0.10:2379/v3 -pool=1 -inode=1 -size=400G`.
+- Пример команды для заливки образа ВМ в vitastor через qemu-img:
+  ```
+  qemu-img convert -f qcow2 debian10.qcow2 -p -O raw 'vitastor:etcd_host=10.115.0.10\:2379/v3:pool=1:inode=1:size=2147483648'
+  ```
+  Если вы используете немодифицированный QEMU, данной команде потребуется переменная окружения `LD_PRELOAD=/usr/lib/x86_64-linux-gnu/qemu/block-vitastor.so`.
+- Пример команды запуска QEMU:
+  ```
+  qemu-system-x86_64 -enable-kvm -m 1024
+    -drive 'file=vitastor:etcd_host=10.115.0.10\:2379/v3:pool=1:inode=1:size=2147483648',format=raw,if=none,id=drive-virtio-disk0,cache=none
+    -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x5,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1,write-cache=off,physical_block_size=4096,logical_block_size=512
+    -vnc 0.0.0.0:0
+  ```
+- Пример команды удаления образа (инода) из Vitastor:
+  ```
+  vitastor-rm --etcd_address 10.115.0.10:2379/v3 --pool 1 --inode 1 --parallel_osds 16 --iodepth 32
+  ```
+
+## Известные проблемы
+
+- Запросы удаления объектов могут в данный момент приводить к "неполным" объектам в EC-пулах,
+  если в процессе удаления произойдут отказы OSD или серверов, потому что правильная обработка
+  запросов удаления в кластере должна быть "трёхфазной", а это пока не реализовано. Если вы
+  столкнётесь с такой ситуацией, просто повторите запрос удаления.
+
+## Принципы реализации
+
+- Я люблю архитектурно простые решения. Vitastor проектируется именно так и я намерен
+  и далее следовать данному принципу.
+- Если вы пришли сюда за идеальным кодом на C++, вы, вероятно, не по адресу. "Общепринятые"
+  практики написания C++ кода меня не очень волнуют, так как зачастую, опять-таки, ведут к
+  излишним усложнениям и код получается красивый... но медленный.
+- По той же причине в коде иногда можно встретить велосипеды типа собственного упрощённого
+  HTTP-клиента для работы с etcd. Зато эти велосипеды маленькие и компактные и не требуют
+  использования десятка внешних библиотек.
+- node.js для монитора - не случайный выбор. Он очень быстрый, имеет встроенную событийную
+  машину, приятный нейтральный C-подобный язык программирования и развитую инфраструктуру.
+
+## Автор и лицензия
+
+Автор: Виталий Филиппов (vitalif [at] yourcmc.ru), 2019+
+
+Заходите в Telegram-чат Vitastor: https://t.me/vitastor
+
+Лицензия: VNPL 1.1 на серверный код и двойная VNPL 1.1 + GPL 2.0+ на клиентский.
+
+VNPL - "сетевой копилефт", собственная свободная копилефт-лицензия
+Vitastor Network Public License 1.1, основанная на GNU GPL 3.0 с дополнительным
+условием "Сетевого взаимодействия", требующим распространять все программы,
+специально разработанные для использования вместе с Vitastor и взаимодействующие
+с ним по сети, под лицензией VNPL или под любой другой свободной лицензией.
+
+Идея VNPL - расширение действия копилефта не только на модули, явным образом
+связываемые с кодом Vitastor, но также на модули, оформленные в виде микросервисов
+и взаимодействующие с ним по сети.
+
+Таким образом, если вы хотите построить на основе Vitastor сервис, содержаший
+компоненты с закрытым кодом, взаимодействующие с Vitastor, вам нужна коммерческая
+лицензия от автора 😀.
+
+На Windows и любое другое ПО, не разработанное *специально* для использования
+вместе с Vitastor, никакие ограничения не накладываются.
+
+Клиентские библиотеки распространяются на условиях двойной лицензии VNPL 1.0
+и также на условиях GNU GPL 2.0 или более поздней версии. Так сделано в целях
+совместимости с таким ПО, как QEMU и fio.
+
+Вы можете найти полный текст VNPL 1.1 в файле [VNPL-1.1.txt](VNPL-1.1.txt),
+а GPL 2.0 в файле [GPL-2.0.txt](GPL-2.0.txt).
--- a/README.md
+++ b/README.md
@ -1,5 +1,7 @@
 ## Vitastor

+[Читать на русском](README-ru.md)
+
 ## The Idea

 Make Software-Defined Block Storage Great Again.
@ -14,6 +16,7 @@ with configurable redundancy (replication or erasure codes/XOR).
 Vitastor is currently a pre-release, a lot of features are missing and you can still expect
 breaking changes in the future. However, the following is implemented:

+0.5.x (stable):
 - Basic part: highly-available block storage with symmetric clustering and no SPOF
 - Performance ;-D
 - Multiple redundancy schemes: Replication, XOR n+1, Reed-Solomon erasure codes
@ -35,19 +38,22 @@ breaking changes in the future. However, the following is implemented:
 - Inode removal tool (vitastor-rm)
 - Packaging for Debian and CentOS

-## Roadmap
-
- OSD creation tool (OSDs currently have to be created by hand)
- Other administrative tools
+0.6.x (master):
 - Per-inode I/O and space usage statistics
- Proxmox and OpenNebula plugins
- iSCSI proxy
 - Inode metadata storage in etcd
 - Snapshots and copy-on-write image clones
- Operation timeouts and better failure detection
+- Write throttling to smooth random write workloads in SSD+HDD configurations
+
+## Roadmap
+
+- Better OSD creation and auto-start tools
+- Other administrative tools
+- Plugins for OpenStack, Kubernetes, OpenNebula, Proxmox and other cloud systems
+- iSCSI proxy
+- Faster failover
 - Scrubbing without checksums (verification of replicas)
 - Checksums
- SSD+HDD optimizations, possibly including tiered storage and soft journal flushes
+- Tiered storage
 - RDMA and NVDIMM support
 - Web GUI
 - Compression (possibly)
@ -291,7 +297,7 @@ Vitastor with single-thread NBD on the same hardware:
  - Debian 10 (Buster): `deb https://vitastor.io/debian buster main`
 - For Debian 10 (Buster) also enable backports repository:
  `deb http://deb.debian.org/debian buster-backports main`
- Install packages: `apt update; apt install vitastor lp-solve etcd linux-image-amd64`
+- Install packages: `apt update; apt install vitastor lp-solve etcd linux-image-amd64 qemu`

 ### CentOS

@ -313,10 +319,9 @@ Vitastor with single-thread NBD on the same hardware:
  there is at least one known io_uring hang with 5.4 and an HP SmartArray controller.
 - Install liburing 0.4 or newer and its headers.
 - Install lp_solve.
- Install etcd. Attention: you need a fixed version from here: https://github.com/vitalif/etcd/,
-  branch release-3.4, because there is a bug in upstream etcd which makes Vitastor OSDs fail to
-  move PGs out of "starting" state if you have at least around ~500 PGs or so. The custom build
-  will be unnecessary when etcd merges the fix: https://github.com/etcd-io/etcd/pull/12402.
+- Install etcd, at least version 3.4.15. Earlier versions won't work because of various bugs,
+  for example [#12402](https://github.com/etcd-io/etcd/pull/12402). You can also take 3.4.13
+  with this specific fix from here: https://github.com/vitalif/etcd/, branch release-3.4.
 - Install node.js 10 or newer.
 - Install gcc and g++ 8.x or newer.
 - Clone https://yourcmc.ru/git/vitalif/vitastor/ with submodules.
@ -395,13 +400,15 @@ and calculate disk offsets almost by hand. This will be fixed in near future.

 ## Known Problems

- Object deletion requests may currently lead to 'incomplete' objects if your OSDs crash during
-  deletion because proper handling of object cleanup in a cluster should be "three-phase"
-  and it's currently not implemented. Just to repeat the removal again in this case.
+- Object deletion requests may currently lead to 'incomplete' objects in EC pools
+  if your OSDs crash during deletion because proper handling of object cleanup
+  in a cluster should be "three-phase" and it's currently not implemented.
+  Just repeat the removal request again in this case.

 ## Implementation Principles

- I like simple and stupid solutions, so expect Vitastor to stay simple.
+- I like architecturally simple solutions. Vitastor is and will always be designed
+  exactly like that.
 - I also like reinventing the wheel to some extent, like writing my own HTTP client
  for etcd interaction instead of using prebuilt libraries, because in this case
  I'm confident about what my code does and what it doesn't do.
@ -416,7 +423,7 @@ and calculate disk offsets almost by hand. This will be fixed in near future.

 Copyright (c) Vitaliy Filippov (vitalif [at] yourcmc.ru), 2019+

-You can also find me in the Russian Telegram Ceph chat: https://t.me/ceph_ru
+Join Vitastor Telegram Chat: https://t.me/vitastor

 All server-side code (OSD, Monitor and so on) is licensed under the terms of
 Vitastor Network Public License 1.1 (VNPL 1.1), a copyleft license based on
--- a/debian/changelog
+++ b/debian/changelog
@ -1,4 +1,4 @@
-vitastor (0.5.10-1) unstable; urgency=medium
+vitastor (0.6.2-1) unstable; urgency=medium

  * Bugfixes

--- a/debian/vitastor.Dockerfile
+++ b/debian/vitastor.Dockerfile
@ -40,10 +40,10 @@ RUN set -e -x; \
    mkdir -p /root/packages/vitastor-$REL; \
    rm -rf /root/packages/vitastor-$REL/*; \
    cd /root/packages/vitastor-$REL; \
-    cp -r /root/vitastor vitastor-0.5.10; \
-    ln -s /root/packages/qemu-$REL/qemu-*/ vitastor-0.5.10/qemu; \
-    ln -s /root/fio-build/fio-*/ vitastor-0.5.10/fio; \
-    cd vitastor-0.5.10; \
+    cp -r /root/vitastor vitastor-0.6.2; \
+    ln -s /root/packages/qemu-$REL/qemu-*/ vitastor-0.6.2/qemu; \
+    ln -s /root/fio-build/fio-*/ vitastor-0.6.2/fio; \
+    cd vitastor-0.6.2; \
    FIO=$(head -n1 fio/debian/changelog | perl -pe 's/^.*\((.*?)\).*$/$1/'); \
    QEMU=$(head -n1 qemu/debian/changelog | perl -pe 's/^.*\((.*?)\).*$/$1/'); \
    sh copy-qemu-includes.sh; \
@ -59,8 +59,8 @@ RUN set -e -x; \
    echo "dep:fio=$FIO" > debian/substvars; \
    echo "dep:qemu=$QEMU" >> debian/substvars; \
    cd /root/packages/vitastor-$REL; \
-    tar --sort=name --mtime='2020-01-01' --owner=0 --group=0 --exclude=debian -cJf vitastor_0.5.10.orig.tar.xz vitastor-0.5.10; \
-    cd vitastor-0.5.10; \
+    tar --sort=name --mtime='2020-01-01' --owner=0 --group=0 --exclude=debian -cJf vitastor_0.6.2.orig.tar.xz vitastor-0.6.2; \
+    cd vitastor-0.6.2; \
    V=$(head -n1 debian/changelog | perl -pe 's/^.*\((.*?)\).*$/$1/'); \
    DEBFULLNAME="Vitaliy Filippov <vitalif@yourcmc.ru>" dch -D $REL -v "$V""$REL" "Rebuild for $REL"; \
    DEB_BUILD_OPTIONS=nocheck dpkg-buildpackage --jobs=auto -sa; \
--- a/mon/lp-optimizer.js
+++ b/mon/lp-optimizer.js
@ -104,6 +104,17 @@ async function optimize_initial({ osd_tree, pg_count, pg_size = 3, pg_minsize =
    return res;
 }

+function shuffle(array)
+{
+    for (let i = array.length - 1, j, x; i > 0; i--)
+    {
+        j = Math.floor(Math.random() * (i + 1));
+        x = array[i];
+        array[i] = array[j];
+        array[j] = x;
+    }
+}
+
 function make_int_pgs(weights, pg_count)
 {
    const total_weight = Object.values(weights).reduce((a, c) => Number(a) + Number(c), 0);
@ -120,6 +131,7 @@ function make_int_pgs(weights, pg_count)
        weight_left -= weights[pg_name];
        pg_left -= n;
    }
+    shuffle(int_pgs);
    return int_pgs;
 }

--- a/mon/make-osd.sh
+++ b/mon/make-osd.sh
@ -53,7 +53,6 @@ ExecStart=/usr/bin/vitastor-osd \\
    --osd_num $OSD_NUM \\
    --disable_data_fsync 1 \\
    --immediate_commit all \\
-    --flusher_count 256 \\
    --disk_alignment 4096 --journal_block_size 4096 --meta_block_size 4096 \\
    --journal_no_same_sector_overwrites true \\
    --journal_sector_buffer_count 1024 \\
--- a/mon/make-units.sh
+++ b/mon/make-units.sh
@ -32,7 +32,8 @@ ExecStart=/usr/local/bin/etcd -name etcd$ETCD_NUM --data-dir /var/lib/etcd$ETCD_
    --advertise-client-urls http://$IP:2379 --listen-client-urls http://$IP:2379 \\
    --initial-advertise-peer-urls http://$IP:2380 --listen-peer-urls http://$IP:2380 \\
    --initial-cluster-token vitastor-etcd-1 --initial-cluster $ETCD_HOSTS \\
-    --initial-cluster-state new --max-txn-ops=100000 --auto-compaction-retention=10 --auto-compaction-mode=revision
+    --initial-cluster-state new --max-txn-ops=100000 --max-request-bytes=104857600 \\
+    --auto-compaction-retention=10 --auto-compaction-mode=revision
 WorkingDirectory=/var/lib/etcd$ETCD_NUM.etcd
 ExecStartPre=+chown -R etcd /var/lib/etcd$ETCD_NUM.etcd
 User=etcd
--- a/mon/merge.js
+++ b/mon/merge.js
@ -0,0 +1,23 @@
+const fsp = require('fs').promises;
+
+async function merge(file1, file2, out)
+{
+    if (!out)
+    {
+        console.error('USAGE: nodejs merge.js layer1 layer2 output');
+        process.exit();
+    }
+    const layer1 = await fsp.readFile(file1);
+    const layer2 = await fsp.readFile(file2);
+    const zero = Buffer.alloc(4096);
+    for (let i = 0; i < layer2.length; i += 4096)
+    {
+        if (zero.compare(layer2, i, i+4096) != 0)
+        {
+            layer2.copy(layer1, i, i, i+4096);
+        }
+    }
+    await fsp.writeFile(out, layer1);
+}
+
+merge(process.argv[2], process.argv[3], process.argv[4]);
--- a/mon/mon.js
+++ b/mon/mon.js
@ -24,13 +24,17 @@ const etcd_allow = new RegExp('^'+[
    'config/pools',
    'config/osd/[1-9]\\d*',
    'config/pgs',
+    'config/inode/[1-9]\\d*/[1-9]\\d*',
    'osd/state/[1-9]\\d*',
    'osd/stats/[1-9]\\d*',
+    'osd/inodestats/[1-9]\\d*',
+    'osd/space/[1-9]\\d*',
    'mon/master',
    'pg/state/[1-9]\\d*/[1-9]\\d*',
    'pg/stats/[1-9]\\d*/[1-9]\\d*',
    'pg/history/[1-9]\\d*/[1-9]\\d*',
    'history/last_clean_pgs',
+    'inode/stats/[1-9]\\d*/[1-9]\\d*',
    'stats',
 ].join('$|^')+'$');

@ -92,7 +96,8 @@ const etcd_tree = {
            disable_device_lock,
            // blockstore - configurable
            max_write_iodepth,
-            flusher_count,
+            min_flusher_count: 1,
+            max_flusher_count: 256,
            inmemory_metadata,
            inmemory_journal,
            journal_sector_buffer_count,
@ -140,6 +145,18 @@ const etcd_tree = {
            }
        }, */
        pgs: {},
+        /* inode: {
+            <pool_id>: {
+                <inode_t>: {
+                    name: string,
+                    size?: uint64_t, // bytes
+                    parent_pool?: <pool_id>,
+                    parent_id?: <inode_t>,
+                    readonly?: boolean,
+                }
+            }
+        }, */
+        inode: {},
    },
    osd: {
        state: {
@ -171,6 +188,18 @@ const etcd_tree = {
                },
            }, */
        },
+        inodestats: {
+            /* <inode_t>: {
+                read: { count: uint64_t, usec: uint64_t, bytes: uint64_t },
+                write: { count: uint64_t, usec: uint64_t, bytes: uint64_t },
+                delete: { count: uint64_t, usec: uint64_t, bytes: uint64_t },
+            }, */
+        },
+        space: {
+            /* <osd_num_t>: {
+                <inode_t>: uint64_t, // bytes
+            }, */
+        },
    },
    mon: {
        master: {
@ -182,7 +211,7 @@ const etcd_tree = {
            /* <pool_id>: {
                <pg_id>: {
                    primary: osd_num_t,
-                    state: ("starting"|"peering"|"incomplete"|"active"|"stopping"|"offline"|
+                    state: ("starting"|"peering"|"incomplete"|"active"|"repeering"|"stopping"|"offline"|
                        "degraded"|"has_incomplete"|"has_degraded"|"has_misplaced"|"has_unclean"|
                        "has_invalid"|"left_on_dead")[],
                }
@ -210,6 +239,16 @@ const etcd_tree = {
            }, */
        },
    },
+    inode: {
+        stats: {
+            /* <inode_t>: {
+                raw_used: uint64_t, // raw used bytes on OSDs
+                read: { count: uint64_t, usec: uint64_t, bytes: uint64_t },
+                write: { count: uint64_t, usec: uint64_t, bytes: uint64_t },
+                delete: { count: uint64_t, usec: uint64_t, bytes: uint64_t },
+            }, */
+        },
+    },
    stats: {
        /* op_stats: {
            <string>: { count: uint64_t, usec: uint64_t, bytes: uint64_t },
@ -394,7 +433,7 @@ class Mon
                {
                    this.parse_kv(e.kv);
                    const key = e.kv.key.substr(this.etcd_prefix.length);
-                    if (key.substr(0, 11) == '/osd/stats/' || key.substr(0, 10) == '/pg/stats/')
+                    if (key.substr(0, 11) == '/osd/stats/' || key.substr(0, 10) == '/pg/stats/' || key.substr(0, 16) == '/osd/inodestats/')
                    {
                        stats_changed = true;
                    }
@ -402,7 +441,7 @@ class Mon
                    {
                        pg_states_changed = true;
                    }
-                    else if (key != '/stats')
+                    else if (key != '/stats' && key.substr(0, 13) != '/inode/stats/')
                    {
                        changed = true;
                    }
@ -541,7 +580,7 @@ class Mon
        for (const osd_num of this.all_osds().sort((a, b) => a - b))
        {
            const stat = this.state.osd.stats[osd_num];
-            if (stat.size && (this.state.osd.state[osd_num] || Number(stat.time) >= down_time))
+            if (stat && stat.size && (this.state.osd.state[osd_num] || Number(stat.time) >= down_time))
            {
                // Numeric IDs are reserved for OSDs
                const osd_cfg = this.state.config.osd[osd_num];
@ -692,6 +731,11 @@ class Mon
                pg_history[i].osd_sets = pg_history[i].osd_sets || [];
                pg_history[i].osd_sets.push(prev_pgs[i]);
            }
+            if (pg_history[i] && pg_history[i].osd_sets)
+            {
+                pg_history[i].osd_sets = Object.values(pg_history[i].osd_sets
+                    .reduce((a, c) => { a[c.join(' ')] = c; return a; }, {}));
+            }
        });
        for (let i = 0; i < new_pgs.length || i < prev_pgs.length; i++)
        {
@ -842,7 +886,7 @@ class Mon
    {
        // Take configuration and state, check it against the stored configuration hash
        // Recalculate PGs and save them to etcd if the configuration is changed
-        // FIXME: Also do not change anything if the distribution is good enough and no PGs are degraded
+        // FIXME: Do not change anything if the distribution is good and random enough and no PGs are degraded
        const { up_osds, levels, osd_tree } = this.get_osd_tree();
        const tree_cfg = {
            osd_tree,
@ -901,7 +945,14 @@ class Mon
                    prev_pgs[pg-1] = this.state.history.last_clean_pgs.items[pool_id][pg].osd_set;
                }
                prev_pgs = JSON.parse(JSON.stringify(prev_pgs.length ? prev_pgs : real_prev_pgs));
-                const old_pg_count = prev_pgs.length;
+                const old_pg_count = real_prev_pgs.length;
+                const optimize_cfg = {
+                    osd_tree: pool_tree,
+                    pg_count: pool_cfg.pg_count,
+                    pg_size: pool_cfg.pg_size,
+                    pg_minsize: pool_cfg.pg_minsize,
+                    max_combinations: pool_cfg.max_osd_combinations,
+                };
                let optimize_result;
                if (old_pg_count > 0)
                {
@ -928,24 +979,23 @@ class Mon
                            pg.pop();
                        }
                    }
-                    optimize_result = await LPOptimizer.optimize_change({
-                        prev_pgs,
-                        osd_tree: pool_tree,
-                        pg_size: pool_cfg.pg_size,
-                        pg_minsize: pool_cfg.pg_minsize,
-                        max_combinations: pool_cfg.max_osd_combinations,
-                    });
+                    if (!this.state.config.pgs.hash)
+                    {
+                        // Re-shuffle PGs
+                        optimize_result = await LPOptimizer.optimize_initial(optimize_cfg);
                    }
                    else
                    {
-                    optimize_result = await LPOptimizer.optimize_initial({
-                        osd_tree: pool_tree,
-                        pg_count: pool_cfg.pg_count,
-                        pg_size: pool_cfg.pg_size,
-                        pg_minsize: pool_cfg.pg_minsize,
-                        max_combinations: pool_cfg.max_osd_combinations,
+                        optimize_result = await LPOptimizer.optimize_change({
+                            prev_pgs,
+                            ...optimize_cfg,
                        });
                    }
+                }
+                else
+                {
+                    optimize_result = await LPOptimizer.optimize_initial(optimize_cfg);
+                }
                if (old_pg_count != optimize_result.int_pgs.length)
                {
                    console.log(
@ -1067,12 +1117,10 @@ class Mon

    sum_stats()
    {
-        let overflow = false;
-        this.prev_stats = this.prev_stats || { op_stats: {}, subop_stats: {}, recovery_stats: {} };
        const op_stats = {}, subop_stats = {}, recovery_stats = {};
        for (const osd in this.state.osd.stats)
        {
-            const st = this.state.osd.stats[osd];
+            const st = this.state.osd.stats[osd]||{};
            for (const op in st.op_stats||{})
            {
                op_stats[op] = op_stats[op] || { count: 0n, usec: 0n, bytes: 0n };
@ -1093,52 +1141,11 @@ class Mon
                recovery_stats[op].bytes += BigInt(st.recovery_stats[op].bytes||0);
            }
        }
-        for (const op in op_stats)
-        {
-            if (op_stats[op].count >= 0x10000000000000000n)
-            {
-                if (!this.prev_stats.op_stats[op])
-                {
-                    overflow = true;
+        return { op_stats, subop_stats, recovery_stats };
    }
-                else
+
+    sum_object_counts()
    {
-                    op_stats[op].count -= this.prev_stats.op_stats[op].count;
-                    op_stats[op].usec -= this.prev_stats.op_stats[op].usec;
-                    op_stats[op].bytes -= this.prev_stats.op_stats[op].bytes;
-                }
-            }
-        }
-        for (const op in subop_stats)
-        {
-            if (subop_stats[op].count >= 0x10000000000000000n)
-            {
-                if (!this.prev_stats.subop_stats[op])
-                {
-                    overflow = true;
-                }
-                else
-                {
-                    subop_stats[op].count -= this.prev_stats.subop_stats[op].count;
-                    subop_stats[op].usec -= this.prev_stats.subop_stats[op].usec;
-                }
-            }
-        }
-        for (const op in recovery_stats)
-        {
-            if (recovery_stats[op].count >= 0x10000000000000000n)
-            {
-                if (!this.prev_stats.recovery_stats[op])
-                {
-                    overflow = true;
-                }
-                else
-                {
-                    recovery_stats[op].count -= this.prev_stats.recovery_stats[op].count;
-                    recovery_stats[op].bytes -= this.prev_stats.recovery_stats[op].bytes;
-                }
-            }
-        }
        const object_counts = { object: 0n, clean: 0n, misplaced: 0n, degraded: 0n, incomplete: 0n };
        for (const pool_id in this.state.pg.stats)
        {
@ -1157,36 +1164,123 @@ class Mon
                }
            }
        }
-        return (this.prev_stats = { overflow, op_stats, subop_stats, recovery_stats, object_counts });
+        return object_counts;
+    }
+
+    sum_inode_stats()
+    {
+        const inode_stats = {};
+        const inode_stub = () => ({
+            raw_used: 0n,
+            read: { count: 0n, usec: 0n, bytes: 0n },
+            write: { count: 0n, usec: 0n, bytes: 0n },
+            delete: { count: 0n, usec: 0n, bytes: 0n },
+        });
+        for (const osd_num in this.state.osd.space)
+        {
+            for (const pool_id in this.state.osd.space[osd_num])
+            {
+                inode_stats[pool_id] = inode_stats[pool_id] || {};
+                for (const inode_num in this.state.osd.space[osd_num][pool_id])
+                {
+                    inode_stats[pool_id][inode_num] = inode_stats[pool_id][inode_num] || inode_stub();
+                    inode_stats[pool_id][inode_num].raw_used += BigInt(this.state.osd.space[osd_num][pool_id][inode_num]||0);
+                }
+            }
+        }
+        for (const osd_num in this.state.osd.inodestats)
+        {
+            const ist = this.state.osd.inodestats[osd_num];
+            for (const pool_id in ist)
+            {
+                inode_stats[pool_id] = inode_stats[pool_id] || {};
+                for (const inode_num in ist[pool_id])
+                {
+                    inode_stats[pool_id][inode_num] = inode_stats[pool_id][inode_num] || inode_stub();
+                    for (const op of [ 'read', 'write', 'delete' ])
+                    {
+                        inode_stats[pool_id][inode_num][op].count += BigInt(ist[pool_id][inode_num][op].count||0);
+                        inode_stats[pool_id][inode_num][op].usec += BigInt(ist[pool_id][inode_num][op].usec||0);
+                        inode_stats[pool_id][inode_num][op].bytes += BigInt(ist[pool_id][inode_num][op].bytes||0);
+                    }
+                }
+            }
+        }
+        return inode_stats;
+    }
+
+    fix_stat_overflows(obj, scratch)
+    {
+        for (const k in obj)
+        {
+            if (typeof obj[k] == 'bigint')
+            {
+                if (obj[k] >= 0x10000000000000000n)
+                {
+                    if (scratch[k])
+                    {
+                        for (const k2 in scratch)
+                        {
+                            obj[k2] -= scratch[k2];
+                            scratch[k2] = 0n;
+                        }
+                    }
+                    else
+                    {
+                        for (const k2 in obj)
+                        {
+                            scratch[k2] = obj[k2];
+                        }
+                    }
+                }
+            }
+            else if (typeof obj[k] == 'object')
+            {
+                this.fix_stat_overflows(obj[k], scratch[k] = (scratch[k] || {}));
+            }
+        }
+    }
+
+    serialize_bigints(obj)
+    {
+        for (const k in obj)
+        {
+            if (typeof obj[k] == 'bigint')
+            {
+                obj[k] = ''+obj[k];
+            }
+            else if (typeof obj[k] == 'object')
+            {
+                this.serialize_bigints(obj[k]);
+            }
+        }
    }

    async update_total_stats()
    {
+        const txn = [];
        const stats = this.sum_stats();
-        if (!stats.overflow)
+        const object_counts = this.sum_object_counts();
+        const inode_stats = this.sum_inode_stats();
+        this.fix_stat_overflows(stats, (this.prev_stats = this.prev_stats || {}));
+        this.fix_stat_overflows(inode_stats, (this.prev_inode_stats = this.prev_inode_stats || {}));
+        stats.object_counts = object_counts;
+        this.serialize_bigints(stats);
+        this.serialize_bigints(inode_stats);
+        txn.push({ requestPut: { key: b64(this.etcd_prefix+'/stats'), value: b64(JSON.stringify(stats)) } });
+        for (const pool_id in inode_stats)
        {
-            // Convert to strings, serialize and save
-            const ser = {};
-            for (const st of [ 'op_stats', 'subop_stats', 'recovery_stats' ])
+            for (const inode_num in inode_stats[pool_id])
            {
-                ser[st] = {};
-                for (const op in stats[st])
-                {
-                    ser[st][op] = {};
-                    for (const k in stats[st][op])
-                    {
-                        ser[st][op][k] = ''+stats[st][op][k];
+                txn.push({ requestPut: {
+                    key: b64(this.etcd_prefix+'/inode/stats/'+pool_id+'/'+inode_num),
+                    value: b64(JSON.stringify(inode_stats[pool_id][inode_num])),
+                } });
            }
        }
-            }
-            ser.object_counts = {};
-            for (const k in stats.object_counts)
+        if (txn.length)
        {
-                ser.object_counts[k] = ''+stats.object_counts[k];
-            }
-            await this.etcd_call('/kv/txn', {
-                success: [ { requestPut: { key: b64(this.etcd_prefix+'/stats'), value: b64(JSON.stringify(ser)) } } ],
-            }, this.config.etcd_mon_timeout, 0);
+            await this.etcd_call('/kv/txn', { success: txn }, this.config.etcd_mon_timeout, 0);
        }
    }

--- a/mon/simple-offsets.js
+++ b/mon/simple-offsets.js
@ -51,7 +51,7 @@ async function run()
    const meta_offset = options.journal_offset + Math.ceil(options.journal_size/options.device_block_size)*options.device_block_size;
    const entries_per_block = Math.floor(options.device_block_size / (24 + 2*options.object_size/options.bitmap_granularity/8));
    const object_count = Math.floor((device_size-meta_offset)/options.object_size);
-    const meta_size = Math.ceil(object_count / entries_per_block) * options.device_block_size;
+    const meta_size = Math.ceil(1 + object_count / entries_per_block) * options.device_block_size;
    const data_offset = meta_offset + meta_size;
    const meta_size_fmt = (meta_size > 1024*1024*1024 ? Math.round(meta_size/1024/1024/1024*100)/100+" GB"
        : Math.round(meta_size/1024/1024*100)/100+" MB");
@ -65,6 +65,9 @@ async function run()
            );
        }
        process.stdout.write(
+            (options.device_block_size != 4096 ?
+                `    --meta_block_size ${options.device}\n`+
+                `    --journal_block-size ${options.device}\n` : '')+
            `    --data_device ${options.device}\n`+
            `    --journal_offset ${options.journal_offset}\n`+
            `    --meta_offset ${meta_offset}\n`+
--- a/rpm/build-tarball.sh
+++ b/rpm/build-tarball.sh
@ -48,4 +48,4 @@ FIO=`rpm -qi fio | perl -e 'while(<>) { /^Epoch[\s:]+(\S+)/ && print "$1:"; /^Ve
 QEMU=`rpm -qi qemu qemu-kvm | perl -e 'while(<>) { /^Epoch[\s:]+(\S+)/ && print "$1:"; /^Version[\s:]+(\S+)/ && print $1; /^Release[\s:]+(\S+)/ && print "-$1"; }'`
 perl -i -pe 's/(Requires:\s*fio)([^\n]+)?/$1 = '$FIO'/' $VITASTOR/rpm/vitastor-el$EL.spec
 perl -i -pe 's/(Requires:\s*qemu(?:-kvm)?)([^\n]+)?/$1 = '$QEMU'/' $VITASTOR/rpm/vitastor-el$EL.spec
-tar --transform 's#^#vitastor-0.5.10/#' --exclude 'rpm/*.rpm' -czf $VITASTOR/../vitastor-0.5.10$(rpm --eval '%dist').tar.gz *
+tar --transform 's#^#vitastor-0.6.2/#' --exclude 'rpm/*.rpm' -czf $VITASTOR/../vitastor-0.6.2$(rpm --eval '%dist').tar.gz *
--- a/rpm/vitastor-el7.Dockerfile
+++ b/rpm/vitastor-el7.Dockerfile
@ -37,7 +37,7 @@ ADD . /root/vitastor
 RUN set -e; \
    cd /root/vitastor/rpm; \
    sh build-tarball.sh; \
-    cp /root/vitastor-0.5.10.el7.tar.gz ~/rpmbuild/SOURCES; \
+    cp /root/vitastor-0.6.2.el7.tar.gz ~/rpmbuild/SOURCES; \
    cp vitastor-el7.spec ~/rpmbuild/SPECS/vitastor.spec; \
    cd ~/rpmbuild/SPECS/; \
    rpmbuild -ba vitastor.spec; \
--- a/rpm/vitastor-el7.spec
+++ b/rpm/vitastor-el7.spec
@ -1,11 +1,11 @@
 Name:           vitastor
-Version:        0.5.10
+Version:        0.6.2
 Release:        1%{?dist}
 Summary:        Vitastor, a fast software-defined clustered block storage

 License:        Vitastor Network Public License 1.1
 URL:            https://vitastor.io/
-Source0:        vitastor-0.5.10.el7.tar.gz
+Source0:        vitastor-0.6.2.el7.tar.gz

 BuildRequires:  liburing-devel >= 0.6
 BuildRequires:  gperftools-devel
--- a/rpm/vitastor-el8.Dockerfile
+++ b/rpm/vitastor-el8.Dockerfile
@ -35,7 +35,7 @@ ADD . /root/vitastor
 RUN set -e; \
    cd /root/vitastor/rpm; \
    sh build-tarball.sh; \
-    cp /root/vitastor-0.5.10.el8.tar.gz ~/rpmbuild/SOURCES; \
+    cp /root/vitastor-0.6.2.el8.tar.gz ~/rpmbuild/SOURCES; \
    cp vitastor-el8.spec ~/rpmbuild/SPECS/vitastor.spec; \
    cd ~/rpmbuild/SPECS/; \
    rpmbuild -ba vitastor.spec; \
--- a/rpm/vitastor-el8.spec
+++ b/rpm/vitastor-el8.spec
@ -1,11 +1,11 @@
 Name:           vitastor
-Version:        0.5.10
+Version:        0.6.2
 Release:        1%{?dist}
 Summary:        Vitastor, a fast software-defined clustered block storage

 License:        Vitastor Network Public License 1.1
 URL:            https://vitastor.io/
-Source0:        vitastor-0.5.10.el8.tar.gz
+Source0:        vitastor-0.6.2.el8.tar.gz

 BuildRequires:  liburing-devel >= 0.6
 BuildRequires:  gperftools-devel
--- a/src/CMakeLists.txt
+++ b/src/CMakeLists.txt
@ -13,8 +13,8 @@ if("${CMAKE_INSTALL_PREFIX}" MATCHES "^/usr/local/?$")
 	set(CMAKE_INSTALL_RPATH "${CMAKE_INSTALL_PREFIX}/${CMAKE_INSTALL_LIBDIR}")
 endif()

-add_definitions(-DVERSION="0.6-dev")
-add_definitions(-Wall -Wno-sign-compare -Wno-comment -Wno-parentheses -Wno-pointer-arith)
+add_definitions(-DVERSION="0.6.2")
+add_definitions(-Wall -Wno-sign-compare -Wno-comment -Wno-parentheses -Wno-pointer-arith -I ${CMAKE_SOURCE_DIR}/src)
 if (${WITH_ASAN})
 	add_definitions(-fsanitize=address -fno-omit-frame-pointer)
 	add_link_options(-fsanitize=address -fno-omit-frame-pointer)
@ -63,13 +63,22 @@ target_link_libraries(fio_vitastor_blk
 	vitastor_blk
 )

+# libvitastor_common.a
+add_library(vitastor_common STATIC
+	epoll_manager.cpp etcd_state_client.cpp
+	messenger.cpp msgr_stop.cpp msgr_op.cpp msgr_send.cpp msgr_receive.cpp ringloop.cpp ../json11/json11.cpp
+	http_client.cpp osd_ops.cpp pg_states.cpp timerfd_manager.cpp base64.cpp
+)
+target_compile_options(vitastor_common PUBLIC -fPIC)
+
 # vitastor-osd
 add_executable(vitastor-osd
-	osd_main.cpp osd.cpp osd_secondary.cpp msgr_receive.cpp msgr_send.cpp osd_peering.cpp osd_flush.cpp osd_peering_pg.cpp
-	osd_primary.cpp osd_primary_subops.cpp etcd_state_client.cpp messenger.cpp osd_cluster.cpp http_client.cpp osd_ops.cpp pg_states.cpp
-	osd_rmw.cpp base64.cpp timerfd_manager.cpp epoll_manager.cpp ../json11/json11.cpp
+	osd_main.cpp osd.cpp osd_secondary.cpp osd_peering.cpp osd_flush.cpp osd_peering_pg.cpp
+	osd_primary.cpp osd_primary_chain.cpp osd_primary_sync.cpp osd_primary_write.cpp osd_primary_subops.cpp
+	osd_cluster.cpp osd_rmw.cpp
 )
 target_link_libraries(vitastor-osd
+	vitastor_common
 	vitastor_blk
 	Jerasure
 )
@ -85,11 +94,10 @@ target_link_libraries(fio_vitastor_sec

 # libvitastor_client.so
 add_library(vitastor_client SHARED
-	cluster_client.cpp epoll_manager.cpp etcd_state_client.cpp
-	messenger.cpp msgr_send.cpp msgr_receive.cpp ringloop.cpp ../json11/json11.cpp
-	http_client.cpp osd_ops.cpp pg_states.cpp timerfd_manager.cpp base64.cpp
+	cluster_client.cpp
 )
 target_link_libraries(vitastor_client
+	vitastor_common
 	tcmalloc_minimal
 	${LIBURING_LIBRARIES}
 )
@ -161,9 +169,10 @@ target_link_libraries(osd_rmw_test Jerasure tcmalloc_minimal)

 # stub_uring_osd
 add_executable(stub_uring_osd
-	stub_uring_osd.cpp epoll_manager.cpp messenger.cpp msgr_send.cpp msgr_receive.cpp ringloop.cpp timerfd_manager.cpp ../json11/json11.cpp
+	stub_uring_osd.cpp
 )
 target_link_libraries(stub_uring_osd
+	vitastor_common
 	${LIBURING_LIBRARIES}
 	tcmalloc_minimal
 )
@ -175,8 +184,17 @@ target_link_libraries(osd_peering_pg_test tcmalloc_minimal)
 # test_allocator
 add_executable(test_allocator test_allocator.cpp allocator.cpp)

+# test_cluster_client
+add_executable(test_cluster_client
+	test_cluster_client.cpp
+	pg_states.cpp osd_ops.cpp cluster_client.cpp msgr_op.cpp mock/messenger.cpp msgr_stop.cpp
+	etcd_state_client.cpp timerfd_manager.cpp ../json11/json11.cpp
+)
+target_compile_definitions(test_cluster_client PUBLIC -D__MOCK__)
+target_include_directories(test_cluster_client PUBLIC ${CMAKE_SOURCE_DIR}/src/mock)
+
 ## test_blockstore, test_shit
-#add_executable(test_blockstore test_blockstore.cpp timerfd_interval.cpp)
+#add_executable(test_blockstore test_blockstore.cpp)
 #target_link_libraries(test_blockstore blockstore)
 #add_executable(test_shit test_shit.cpp osd_peering_pg.cpp)
 #target_link_libraries(test_shit ${LIBURING_LIBRARIES} m)
--- a/src/allocator.cpp
+++ b/src/allocator.cpp
@ -37,6 +37,21 @@ allocator::~allocator()
    delete[] mask;
 }

+bool allocator::get(uint64_t addr)
+{
+    if (addr >= size)
+    {
+        return false;
+    }
+    uint64_t p2 = 1, offset = 0;
+    while (p2 * 64 < size)
+    {
+        offset += p2;
+        p2 = p2 * 64;
+    }
+    return ((mask[offset + addr/64] >> (addr % 64)) & 1);
+}
+
 void allocator::set(uint64_t addr, bool value)
 {
    if (addr >= size)
@ -127,3 +142,35 @@ uint64_t allocator::get_free_count()
 {
    return free;
 }
+
+void bitmap_set(void *bitmap, uint64_t start, uint64_t len, uint64_t bitmap_granularity)
+{
+    if (start == 0)
+    {
+        if (len == 32*bitmap_granularity)
+        {
+            *((uint32_t*)bitmap) = UINT32_MAX;
+            return;
+        }
+        else if (len == 64*bitmap_granularity)
+        {
+            *((uint64_t*)bitmap) = UINT64_MAX;
+            return;
+        }
+    }
+    unsigned bit_start = start / bitmap_granularity;
+    unsigned bit_end = ((start + len) + bitmap_granularity - 1) / bitmap_granularity;
+    while (bit_start < bit_end)
+    {
+        if (!(bit_start & 7) && bit_end >= bit_start+8)
+        {
+            ((uint8_t*)bitmap)[bit_start / 8] = UINT8_MAX;
+            bit_start += 8;
+        }
+        else
+        {
+            ((uint8_t*)bitmap)[bit_start / 8] |= 1 << (bit_start % 8);
+            bit_start++;
+        }
+    }
+}
--- a/src/allocator.h
+++ b/src/allocator.h
@ -16,7 +16,10 @@ class allocator
 public:
    allocator(uint64_t blocks);
    ~allocator();
+    bool get(uint64_t addr);
    void set(uint64_t addr, bool value);
    uint64_t find_free();
    uint64_t get_free_count();
 };
+
+void bitmap_set(void *bitmap, uint64_t start, uint64_t len, uint64_t bitmap_granularity);
--- a/src/blockstore.cpp
+++ b/src/blockstore.cpp
@ -3,9 +3,9 @@

 #include "blockstore_impl.h"

-blockstore_t::blockstore_t(blockstore_config_t & config, ring_loop_t *ringloop)
+blockstore_t::blockstore_t(blockstore_config_t & config, ring_loop_t *ringloop, timerfd_manager_t *tfd)
 {
-    impl = new blockstore_impl_t(config, ringloop);
+    impl = new blockstore_impl_t(config, ringloop, tfd);
 }

 blockstore_t::~blockstore_t()
@ -38,11 +38,21 @@ void blockstore_t::enqueue_op(blockstore_op_t *op)
    impl->enqueue_op(op);
 }

+int blockstore_t::read_bitmap(object_id oid, uint64_t target_version, void *bitmap, uint64_t *result_version)
+{
+    return impl->read_bitmap(oid, target_version, bitmap, result_version);
+}
+
 std::unordered_map<object_id, uint64_t> & blockstore_t::get_unstable_writes()
 {
    return impl->unstable_writes;
 }

+std::map<uint64_t, uint64_t> & blockstore_t::get_inode_space_stats()
+{
+    return impl->inode_space_stats;
+}
+
 uint32_t blockstore_t::get_block_size()
 {
    return impl->get_block_size();
@ -58,7 +68,7 @@ uint64_t blockstore_t::get_free_block_count()
    return impl->get_free_block_count();
 }

-uint32_t blockstore_t::get_disk_alignment()
+uint32_t blockstore_t::get_bitmap_granularity()
 {
-    return impl->get_disk_alignment();
+    return impl->get_bitmap_granularity();
 }
--- a/src/blockstore.h
+++ b/src/blockstore.h
@ -16,6 +16,7 @@

 #include "object_id.h"
 #include "ringloop.h"
+#include "timerfd_manager.h"

 // Memory alignment for direct I/O (usually 512 bytes)
 // All other alignments must be a multiple of this one
@ -27,6 +28,7 @@
 #define DEFAULT_ORDER 17
 #define MIN_BLOCK_SIZE 4*1024
 #define MAX_BLOCK_SIZE 128*1024*1024
+#define DEFAULT_BITMAP_GRANULARITY 4096

 #define BS_OP_MIN 1
 #define BS_OP_READ 1
@ -64,6 +66,8 @@ Input:
 - offset, len = offset and length within object. length may be zero, in that case
  read operation only returns the version / write operation only bumps the version
 - buf = pre-allocated buffer for data (read) / with data (write). may be NULL if len == 0.
+- bitmap = pointer to the new 'external' object bitmap data. Its part which is respective to the
+  write request is copied into the metadata area bitwise and stored there.

 Output:
 - retval = number of bytes actually read/written or negative error number (-EINVAL or -ENOSPC)
@ -141,6 +145,7 @@ struct blockstore_op_t
    uint32_t offset;
    uint32_t len;
    void *buf;
+    void *bitmap;
    int retval;

    uint8_t private_data[BS_OP_PRIVATE_DATA_SIZE];
@ -154,7 +159,7 @@ class blockstore_t
 {
    blockstore_impl_t *impl;
 public:
-    blockstore_t(blockstore_config_t & config, ring_loop_t *ringloop);
+    blockstore_t(blockstore_config_t & config, ring_loop_t *ringloop, timerfd_manager_t *tfd);
    ~blockstore_t();

    // Event loop
@ -175,13 +180,19 @@ public:
    // Submission
    void enqueue_op(blockstore_op_t *op);

+    // Simplified synchronous operation: get object bitmap & current version
+    int read_bitmap(object_id oid, uint64_t target_version, void *bitmap, uint64_t *result_version = NULL);
+
    // Unstable writes are added here (map of object_id -> version)
    std::unordered_map<object_id, uint64_t> & get_unstable_writes();

+    // Get per-inode space usage statistics
+    std::map<uint64_t, uint64_t> & get_inode_space_stats();
+
    // FIXME rename to object_size
    uint32_t get_block_size();
    uint64_t get_block_count();
    uint64_t get_free_block_count();

-    uint32_t get_disk_alignment();
+    uint32_t get_bitmap_granularity();
 };
--- a/src/blockstore_flush.cpp
+++ b/src/blockstore_flush.cpp
@ -3,12 +3,13 @@

 #include "blockstore_impl.h"

-journal_flusher_t::journal_flusher_t(int flusher_count, blockstore_impl_t *bs)
+journal_flusher_t::journal_flusher_t(blockstore_impl_t *bs)
 {
    this->bs = bs;
-    this->flusher_count = flusher_count;
-    this->cur_flusher_count = 1;
-    this->target_flusher_count = 1;
+    this->max_flusher_count = bs->max_flusher_count;
+    this->min_flusher_count = bs->min_flusher_count;
+    this->cur_flusher_count = bs->min_flusher_count;
+    this->target_flusher_count = bs->min_flusher_count;
    dequeuing = false;
    trimming = false;
    active_flushers = 0;
@ -16,11 +17,11 @@ journal_flusher_t::journal_flusher_t(int flusher_count, blockstore_impl_t *bs)
    // FIXME: allow to configure flusher_start_threshold and journal_trim_interval
    flusher_start_threshold = bs->journal_block_size / sizeof(journal_entry_stable);
    journal_trim_interval = 512;
-    journal_trim_counter = 0;
-    trim_wanted = 0;
+    journal_trim_counter = bs->journal.flush_journal ? 1 : 0;
+    trim_wanted = bs->journal.flush_journal ? 1 : 0;
    journal_superblock = bs->journal.inmemory ? bs->journal.buffer : memalign_or_die(MEM_ALIGNMENT, bs->journal_block_size);
-    co = new journal_flusher_co[flusher_count];
-    for (int i = 0; i < flusher_count; i++)
+    co = new journal_flusher_co[max_flusher_count];
+    for (int i = 0; i < max_flusher_count; i++)
    {
        co[i].bs = bs;
        co[i].flusher = this;
@ -71,10 +72,10 @@ bool journal_flusher_t::is_active()
 void journal_flusher_t::loop()
 {
    target_flusher_count = bs->write_iodepth*2;
-    if (target_flusher_count <= 0)
-        target_flusher_count = 1;
-    else if (target_flusher_count > flusher_count)
-        target_flusher_count = flusher_count;
+    if (target_flusher_count < min_flusher_count)
+        target_flusher_count = min_flusher_count;
+    else if (target_flusher_count > max_flusher_count)
+        target_flusher_count = max_flusher_count;
    if (target_flusher_count > cur_flusher_count)
        cur_flusher_count = target_flusher_count;
    else if (target_flusher_count < cur_flusher_count)
@ -237,7 +238,8 @@ bool journal_flusher_co::loop()
    else if (wait_state == 21)
        goto resume_21;
 resume_0:
-    if (!flusher->flush_queue.size() || !flusher->dequeuing)
+    if (flusher->flush_queue.size() < flusher->min_flusher_count && !flusher->trim_wanted ||
+        !flusher->flush_queue.size() || !flusher->dequeuing)
    {
 stop_flusher:
        if (flusher->trim_wanted > 0 && flusher->journal_trim_counter > 0)
@ -426,18 +428,18 @@ resume_1:
        {
            new_clean_bitmap = (bs->inmemory_meta
                ? meta_new.buf + meta_new.pos*bs->clean_entry_size + sizeof(clean_disk_entry)
-                : bs->clean_bitmap + (clean_loc >> bs->block_order)*bs->clean_entry_bitmap_size);
+                : bs->clean_bitmap + (clean_loc >> bs->block_order)*(2*bs->clean_entry_bitmap_size));
            if (clean_init_bitmap)
            {
                memset(new_clean_bitmap, 0, bs->clean_entry_bitmap_size);
-                bitmap_set(new_clean_bitmap, clean_bitmap_offset, clean_bitmap_len);
+                bitmap_set(new_clean_bitmap, clean_bitmap_offset, clean_bitmap_len, bs->bitmap_granularity);
            }
        }
        for (it = v.begin(); it != v.end(); it++)
        {
            if (new_clean_bitmap)
            {
-                bitmap_set(new_clean_bitmap, it->offset, it->len);
+                bitmap_set(new_clean_bitmap, it->offset, it->len, bs->bitmap_granularity);
            }
            await_sqe(4);
            data->iov = (struct iovec){ it->buf, (size_t)it->len };
@ -471,6 +473,7 @@ resume_1:
                wait_state = 5;
                return false;
            }
+            // zero out old metadata entry
            memset(meta_old.buf + meta_old.pos*bs->clean_entry_size, 0, bs->clean_entry_size);
            await_sqe(15);
            data->iov = (struct iovec){ meta_old.buf, bs->meta_block_size };
@ -482,6 +485,14 @@ resume_1:
        }
        if (has_delete)
        {
+            clean_disk_entry *new_entry = (clean_disk_entry*)(meta_new.buf + meta_new.pos*bs->clean_entry_size);
+            if (new_entry->oid.inode != 0 && new_entry->oid != cur.oid)
+            {
+                printf("Fatal error (metadata corruption or bug): tried to delete metadata entry %lu (%lx:%lx) while deleting %lx:%lx\n",
+                    clean_loc >> bs->block_order, new_entry->oid.inode, new_entry->oid.stripe, cur.oid.inode, cur.oid.stripe);
+                exit(1);
+            }
+            // zero out new metadata entry
            memset(meta_new.buf + meta_new.pos*bs->clean_entry_size, 0, bs->clean_entry_size);
        }
        else
@ -499,6 +510,12 @@ resume_1:
            {
                memcpy(&new_entry->bitmap, new_clean_bitmap, bs->clean_entry_bitmap_size);
            }
+            // copy latest external bitmap/attributes
+            if (bs->clean_entry_bitmap_size)
+            {
+                void *bmp_ptr = bs->clean_entry_bitmap_size > sizeof(void*) ? dirty_end->second.bitmap : &dirty_end->second.bitmap;
+                memcpy((void*)(new_entry+1) + bs->clean_entry_bitmap_size, bmp_ptr, bs->clean_entry_bitmap_size);
+            }
        }
        await_sqe(6);
        data->iov = (struct iovec){ meta_new.buf, bs->meta_block_size };
@ -585,6 +602,7 @@ resume_1:
                    .size = sizeof(journal_entry_start),
                    .reserved = 0,
                    .journal_start = new_trim_pos,
+                    .version = JOURNAL_VERSION,
                };
                ((journal_entry_start*)flusher->journal_superblock)->crc32 = je_crc32((journal_entry*)flusher->journal_superblock);
                data->iov = (struct iovec){ flusher->journal_superblock, bs->journal_block_size };
@ -616,6 +634,12 @@ resume_1:
 #endif
                flusher->trimming = false;
            }
+            if (bs->journal.flush_journal && !flusher->flush_queue.size())
+            {
+                assert(bs->journal.used_start == bs->journal.next_free);
+                printf("Journal flushed\n");
+                exit(0);
+            }
        }
        // All done
        flusher->active_flushers--;
@ -646,7 +670,7 @@ bool journal_flusher_co::scan_dirty(int wait_base)
        {
            char err[1024];
            snprintf(
-                err, 1024, "BUG: Unexpected dirty_entry %lx:%lx v%lu unstable state during flush: %d",
+                err, 1024, "BUG: Unexpected dirty_entry %lx:%lx v%lu unstable state during flush: 0x%x",
                dirty_it->first.oid.inode, dirty_it->first.oid.stripe, dirty_it->first.version, dirty_it->second.state
            );
            throw std::runtime_error(err);
@ -775,7 +799,10 @@ void journal_flusher_co::update_clean_db()
    if (old_clean_loc != UINT64_MAX && old_clean_loc != clean_loc)
    {
 #ifdef BLOCKSTORE_DEBUG
-        printf("Free block %lu (new location is %lu)\n", old_clean_loc >> bs->block_order, clean_loc >> bs->block_order);
+        printf("Free block %lu from %lx:%lx v%lu (new location is %lu)\n",
+            old_clean_loc >> bs->block_order,
+            cur.oid.inode, cur.oid.stripe, cur.version,
+            clean_loc >> bs->block_order);
 #endif
        bs->data_alloc->set(old_clean_loc >> bs->block_order, false);
    }
@ -783,6 +810,11 @@ void journal_flusher_co::update_clean_db()
    {
        auto clean_it = bs->clean_db.find(cur.oid);
        bs->clean_db.erase(clean_it);
+#ifdef BLOCKSTORE_DEBUG
+        printf("Free block %lu from %lx:%lx v%lu (delete)\n",
+            clean_loc >> bs->block_order,
+            cur.oid.inode, cur.oid.stripe, cur.version);
+#endif
        bs->data_alloc->set(clean_loc >> bs->block_order, false);
        clean_loc = UINT64_MAX;
    }
@ -804,7 +836,7 @@ bool journal_flusher_co::fsync_batch(bool fsync_meta, int wait_base)
        goto resume_1;
    else if (wait_state == wait_base+2)
        goto resume_2;
-    if (!(fsync_meta ? bs->disable_meta_fsync : bs->disable_journal_fsync))
+    if (!(fsync_meta ? bs->disable_meta_fsync : bs->disable_data_fsync))
    {
        cur_sync = flusher->syncs.end();
        while (cur_sync != flusher->syncs.begin())
@ -861,35 +893,3 @@ bool journal_flusher_co::fsync_batch(bool fsync_meta, int wait_base)
    }
    return true;
 }
-
-void journal_flusher_co::bitmap_set(void *bitmap, uint64_t start, uint64_t len)
-{
-    if (start == 0)
-    {
-        if (len == 32*bs->bitmap_granularity)
-        {
-            *((uint32_t*)bitmap) = UINT32_MAX;
-            return;
-        }
-        else if (len == 64*bs->bitmap_granularity)
-        {
-            *((uint64_t*)bitmap) = UINT64_MAX;
-            return;
-        }
-    }
-    unsigned bit_start = start / bs->bitmap_granularity;
-    unsigned bit_end = ((start + len) + bs->bitmap_granularity - 1) / bs->bitmap_granularity;
-    while (bit_start < bit_end)
-    {
-        if (!(bit_start & 7) && bit_end >= bit_start+8)
-        {
-            ((uint8_t*)bitmap)[bit_start / 8] = UINT8_MAX;
-            bit_start += 8;
-        }
-        else
-        {
-            ((uint8_t*)bitmap)[bit_start / 8] |= 1 << (bit_start % 8);
-            bit_start++;
-        }
-    }
-}
--- a/src/blockstore_flush.h
+++ b/src/blockstore_flush.h
@ -69,7 +69,6 @@ class journal_flusher_co
    bool modify_meta_read(uint64_t meta_loc, flusher_meta_write_t &wr, int wait_base);
    void update_clean_db();
    bool fsync_batch(bool fsync_meta, int wait_base);
-    void bitmap_set(void *bitmap, uint64_t start, uint64_t len);
 public:
    journal_flusher_co();
    bool loop();
@ -80,7 +79,7 @@ class journal_flusher_t
 {
    int trim_wanted = 0;
    bool dequeuing;
-    int flusher_count, cur_flusher_count, target_flusher_count;
+    int min_flusher_count, max_flusher_count, cur_flusher_count, target_flusher_count;
    int flusher_start_threshold;
    journal_flusher_co *co;
    blockstore_impl_t *bs;
@ -99,7 +98,7 @@ class journal_flusher_t
    std::deque<object_id> flush_queue;
    std::map<object_id, uint64_t> flush_versions;
 public:
-    journal_flusher_t(int flusher_count, blockstore_impl_t *bs);
+    journal_flusher_t(blockstore_impl_t *bs);
    ~journal_flusher_t();
    void loop();
    bool is_active();
--- a/src/blockstore_impl.cpp
+++ b/src/blockstore_impl.cpp
@ -3,16 +3,17 @@

 #include "blockstore_impl.h"

-blockstore_impl_t::blockstore_impl_t(blockstore_config_t & config, ring_loop_t *ringloop)
+blockstore_impl_t::blockstore_impl_t(blockstore_config_t & config, ring_loop_t *ringloop, timerfd_manager_t *tfd)
 {
    assert(sizeof(blockstore_op_private_t) <= BS_OP_PRIVATE_DATA_SIZE);
+    this->tfd = tfd;
    this->ringloop = ringloop;
    ring_consumer.loop = [this]() { loop(); };
    ringloop->register_consumer(&ring_consumer);
    initialized = 0;
-    zero_object = (uint8_t*)memalign_or_die(MEM_ALIGNMENT, block_size);
    data_fd = meta_fd = journal.fd = -1;
    parse_config(config);
+    zero_object = (uint8_t*)memalign_or_die(MEM_ALIGNMENT, block_size);
    try
    {
        open_data();
@ -31,7 +32,7 @@ blockstore_impl_t::blockstore_impl_t(blockstore_config_t & config, ring_loop_t *
            close(journal.fd);
        throw;
    }
-    flusher = new journal_flusher_t(flusher_count, this);
+    flusher = new journal_flusher_t(this);
 }

 blockstore_impl_t::~blockstore_impl_t()
@ -92,10 +93,23 @@ void blockstore_impl_t::loop()
            {
                delete journal_init_reader;
                journal_init_reader = NULL;
+                if (journal.flush_journal)
+                    initialized = 3;
+                else
                    initialized = 10;
                ringloop->wakeup();
            }
        }
+        if (initialized == 3)
+        {
+            if (readonly)
+            {
+                printf("Can't flush the journal in readonly mode\n");
+                exit(1);
+            }
+            flusher->loop();
+            ringloop->submit();
+        }
    }
    else
    {
@ -443,7 +457,7 @@ void blockstore_impl_t::process_list(blockstore_op_t *op)
        }
        for (; clean_it != clean_end; clean_it++)
        {
-            if (!pg_count || ((clean_it->first.inode + clean_it->first.stripe / pg_stripe_size) % pg_count) == list_pg)
+            if (!pg_count || ((clean_it->first.stripe / pg_stripe_size) % pg_count) == list_pg) // like map_to_pg()
            {
                if (stable_count >= stable_alloc)
                {
@ -488,7 +502,7 @@ void blockstore_impl_t::process_list(blockstore_op_t *op)
        }
        for (; dirty_it != dirty_end; dirty_it++)
        {
-            if (!pg_count || ((dirty_it->first.oid.inode + dirty_it->first.oid.stripe / pg_stripe_size) % pg_count) == list_pg)
+            if (!pg_count || ((dirty_it->first.oid.stripe / pg_stripe_size) % pg_count) == list_pg) // like map_to_pg()
            {
                if (IS_DELETE(dirty_it->second.state))
                {
--- a/src/blockstore_impl.h
+++ b/src/blockstore_impl.h
@ -9,6 +9,7 @@
 #include <sys/ioctl.h>
 #include <sys/stat.h>
 #include <fcntl.h>
+#include <time.h>
 #include <unistd.h>
 #include <linux/fs.h>

@ -77,7 +78,25 @@

 #include "blockstore_journal.h"

-// 24 bytes + block bitmap per "clean" entry on disk with fixed metadata tables
+// "VITAstor"
+#define BLOCKSTORE_META_MAGIC 0x726F747341544956l
+#define BLOCKSTORE_META_VERSION 1
+
+// metadata header (superblock)
+// FIXME: After adding the OSD superblock, add a key to metadata
+// and journal headers to check if they belong to the same OSD
+struct __attribute__((__packed__)) blockstore_meta_header_t
+{
+    uint64_t zero;
+    uint64_t magic;
+    uint64_t version;
+    uint32_t meta_block_size;
+    uint32_t data_block_size;
+    uint32_t bitmap_granularity;
+};
+
+// 32 bytes = 24 bytes + block bitmap (4 bytes by default) + external attributes (also bitmap, 4 bytes by default)
+// per "clean" entry on disk with fixed metadata tables
 // FIXME: maybe add crc32's to metadata
 struct __attribute__((__packed__)) clean_disk_entry
 {
@ -93,7 +112,7 @@ struct __attribute__((__packed__)) clean_entry
    uint64_t location;
 };

-// 56 = 24 + 32 bytes per dirty entry in memory (obj_ver_id => dirty_entry)
+// 64 = 24 + 40 bytes per dirty entry in memory (obj_ver_id => dirty_entry)
 struct __attribute__((__packed__)) dirty_entry
 {
    uint32_t state;
@ -102,6 +121,7 @@ struct __attribute__((__packed__)) dirty_entry
    uint32_t offset;   // data offset within object (stripe)
    uint32_t len;      // data length
    uint64_t journal_sector; // journal sector used for this entry
+    void* bitmap;   // either external bitmap itself when it fits, or a pointer to it when it doesn't
 };

 // - Sync must be submitted after previous writes/deletes (not before!)
@ -156,6 +176,7 @@ struct blockstore_op_private_t
    struct iovec iov_zerofill[3];
    // Warning: must not have a default value here because it's written to before calling constructor in blockstore_write.cpp O_o
    uint64_t real_version;
+    timespec tv_begin;

    // Sync
    std::vector<obj_ver_id> sync_big_writes, sync_small_writes;
@ -197,10 +218,18 @@ class blockstore_impl_t
    // Suitable only for server SSDs with capacitors, requires disabled data and journal fsyncs
    int immediate_commit = IMMEDIATE_NONE;
    bool inmemory_meta = false;
-    // Maximum flusher count
-    unsigned flusher_count;
+    // Maximum and minimum flusher count
+    unsigned max_flusher_count, min_flusher_count;
    // Maximum queue depth
    unsigned max_write_iodepth = 128;
+    // Enable small (journaled) write throttling, useful for the SSD+HDD case
+    bool throttle_small_writes = false;
+    // Target data device iops, bandwidth and parallelism for throttling (100/100/1 is the default for HDD)
+    int throttle_target_iops = 100;
+    int throttle_target_mbs = 100;
+    int throttle_target_parallelism = 1;
+    // Minimum difference in microseconds between target and real execution times to throttle the response
+    int throttle_threshold_us = 50;
    /******* END OF OPTIONS *******/

    struct ring_consumer_t ring_consumer;
@ -210,6 +239,7 @@ class blockstore_impl_t
    blockstore_dirty_db_t dirty_db;
    std::vector<blockstore_op_t*> submit_queue;
    std::vector<obj_ver_id> unsynced_big_writes, unsynced_small_writes;
+    int unsynced_big_write_count = 0;
    allocator *data_alloc = NULL;
    uint8_t *zero_object;

@ -230,6 +260,7 @@ class blockstore_impl_t

    bool live = false, queue_stall = false;
    ring_loop_t *ringloop;
+    timerfd_manager_t *tfd;

    bool stop_sync_submitted;

@ -249,6 +280,7 @@ class blockstore_impl_t
    void open_data();
    void open_meta();
    void open_journal();
+    uint8_t* get_clean_entry_bitmap(uint64_t block_loc, int offset);

    // Asynchronous init
    int initialized;
@ -283,7 +315,7 @@ class blockstore_impl_t
    // Stabilize
    int dequeue_stable(blockstore_op_t *op);
    int continue_stable(blockstore_op_t *op);
-    void mark_stable(const obj_ver_id & ov);
+    void mark_stable(const obj_ver_id & ov, bool forget_dirty = false);
    void handle_stable_event(ring_data_t *data, blockstore_op_t *op);
    void stabilize_object(object_id oid, uint64_t max_ver);

@ -299,7 +331,7 @@ class blockstore_impl_t

 public:

-    blockstore_impl_t(blockstore_config_t & config, ring_loop_t *ringloop);
+    blockstore_impl_t(blockstore_config_t & config, ring_loop_t *ringloop, timerfd_manager_t *tfd);
    ~blockstore_impl_t();

    // Event loop
@ -320,11 +352,17 @@ public:
    // Submission
    void enqueue_op(blockstore_op_t *op);

+    // Simplified synchronous operation: get object bitmap & current version
+    int read_bitmap(object_id oid, uint64_t target_version, void *bitmap, uint64_t *result_version = NULL);
+
    // Unstable writes are added here (map of object_id -> version)
    std::unordered_map<object_id, uint64_t> unstable_writes;

+    // Space usage statistics
+    std::map<uint64_t, uint64_t> inode_space_stats;
+
    inline uint32_t get_block_size() { return block_size; }
    inline uint64_t get_block_count() { return block_count; }
    inline uint64_t get_free_block_count() { return data_alloc->get_free_count(); }
-    inline uint32_t get_disk_alignment() { return disk_alignment; }
+    inline uint32_t get_bitmap_granularity() { return disk_alignment; }
 };
--- a/src/blockstore_init.cpp
+++ b/src/blockstore_init.cpp
@ -3,6 +3,20 @@

 #include "blockstore_impl.h"

+#define GET_SQE() \
+    sqe = bs->get_sqe();\
+    if (!sqe)\
+        throw std::runtime_error("io_uring is full during initialization");\
+    data = ((ring_data_t*)sqe->user_data)
+
+static bool iszero(uint64_t *buf, int len)
+{
+    for (int i = 0; i < len; i++)
+        if (buf[i] != 0)
+            return false;
+    return true;
+}
+
 blockstore_init_meta::blockstore_init_meta(blockstore_impl_t *bs)
 {
    this->bs = bs;
@ -10,7 +24,7 @@ blockstore_init_meta::blockstore_init_meta(blockstore_impl_t *bs)

 void blockstore_init_meta::handle_event(ring_data_t *data)
 {
-    if (data->res <= 0)
+    if (data->res < 0)
    {
        throw std::runtime_error(
            std::string("read metadata failed at offset ") + std::to_string(metadata_read) +
@ -28,6 +42,12 @@ int blockstore_init_meta::loop()
 {
    if (wait_state == 1)
        goto resume_1;
+    else if (wait_state == 2)
+        goto resume_2;
+    else if (wait_state == 3)
+        goto resume_3;
+    else if (wait_state == 4)
+        goto resume_4;
    printf("Reading blockstore metadata\n");
    if (bs->inmemory_meta)
        metadata_buffer = bs->metadata_buffer;
@ -35,22 +55,98 @@ int blockstore_init_meta::loop()
        metadata_buffer = memalign(MEM_ALIGNMENT, 2*bs->metadata_buf_size);
    if (!metadata_buffer)
        throw std::runtime_error("Failed to allocate metadata read buffer");
-    while (1)
-    {
-    resume_1:
+    // Read superblock
+    GET_SQE();
+    data->iov = { metadata_buffer, bs->meta_block_size };
+    data->callback = [this](ring_data_t *data) { handle_event(data); };
+    my_uring_prep_readv(sqe, bs->meta_fd, &data->iov, 1, bs->meta_offset);
+    bs->ringloop->submit();
+    submitted = 1;
+resume_1:
    if (submitted)
    {
        wait_state = 1;
        return 1;
    }
+    if (iszero((uint64_t*)metadata_buffer, bs->meta_block_size / sizeof(uint64_t)))
+    {
+        {
+            blockstore_meta_header_t *hdr = (blockstore_meta_header_t *)metadata_buffer;
+            hdr->zero = 0;
+            hdr->magic = BLOCKSTORE_META_MAGIC;
+            hdr->version = BLOCKSTORE_META_VERSION;
+            hdr->meta_block_size = bs->meta_block_size;
+            hdr->data_block_size = bs->block_size;
+            hdr->bitmap_granularity = bs->bitmap_granularity;
+        }
+        if (bs->readonly)
+        {
+            printf("Skipping metadata initialization because blockstore is readonly\n");
+        }
+        else
+        {
+            printf("Initializing metadata area\n");
+            GET_SQE();
+            data->iov = (struct iovec){ metadata_buffer, bs->meta_block_size };
+            data->callback = [this](ring_data_t *data) { handle_event(data); };
+            my_uring_prep_writev(sqe, bs->meta_fd, &data->iov, 1, bs->meta_offset);
+            bs->ringloop->submit();
+            submitted = 1;
+        resume_3:
+            if (submitted > 0)
+            {
+                wait_state = 3;
+                return 1;
+            }
+            zero_on_init = true;
+        }
+    }
+    else
+    {
+        blockstore_meta_header_t *hdr = (blockstore_meta_header_t *)metadata_buffer;
+        if (hdr->zero != 0 ||
+            hdr->magic != BLOCKSTORE_META_MAGIC ||
+            hdr->version != BLOCKSTORE_META_VERSION)
+        {
+            printf(
+                "Metadata is corrupt or old version.\n"
+                " If this is a new OSD please zero out the metadata area before starting it.\n"
+                " If you need to upgrade from 0.5.x please request it via the issue tracker.\n"
+            );
+            exit(1);
+        }
+        if (hdr->meta_block_size != bs->meta_block_size ||
+            hdr->data_block_size != bs->block_size ||
+            hdr->bitmap_granularity != bs->bitmap_granularity)
+        {
+            printf(
+                "Configuration stored in metadata superblock"
+                " (meta_block_size=%u, data_block_size=%u, bitmap_granularity=%u)"
+                " differs from OSD configuration (%lu/%u/%lu).\n",
+                hdr->meta_block_size, hdr->data_block_size, hdr->bitmap_granularity,
+                bs->meta_block_size, bs->block_size, bs->bitmap_granularity
+            );
+            exit(1);
+        }
+    }
+    // Skip superblock
+    bs->meta_offset += bs->meta_block_size;
+    prev_done = 0;
+    done_len = 0;
+    done_pos = 0;
+    metadata_read = 0;
+    // Read the rest of the metadata
+    while (1)
+    {
+    resume_2:
+        if (submitted)
+        {
+            wait_state = 2;
+            return 1;
+        }
        if (metadata_read < bs->meta_len)
        {
-            sqe = bs->get_sqe();
-            if (!sqe)
-            {
-                throw std::runtime_error("io_uring is full while trying to read metadata");
-            }
-            data = ((ring_data_t*)sqe->user_data);
+            GET_SQE();
            data->iov = {
                metadata_buffer + (bs->inmemory_meta
                    ? metadata_read
@ -58,7 +154,14 @@ int blockstore_init_meta::loop()
                bs->meta_len - metadata_read > bs->metadata_buf_size ? bs->metadata_buf_size : bs->meta_len - metadata_read,
            };
            data->callback = [this](ring_data_t *data) { handle_event(data); };
+            if (!zero_on_init)
                my_uring_prep_readv(sqe, bs->meta_fd, &data->iov, 1, bs->meta_offset + metadata_read);
+            else
+            {
+                // Fill metadata with zeroes
+                memset(data->iov.iov_base, 0, data->iov.iov_len);
+                my_uring_prep_writev(sqe, bs->meta_fd, &data->iov, 1, bs->meta_offset + metadata_read);
+            }
            bs->ringloop->submit();
            submitted = (prev == 1 ? 2 : 1);
            prev = submitted;
@ -90,6 +193,21 @@ int blockstore_init_meta::loop()
        free(metadata_buffer);
        metadata_buffer = NULL;
    }
+    if (zero_on_init && !bs->disable_meta_fsync)
+    {
+        GET_SQE();
+        my_uring_prep_fsync(sqe, bs->meta_fd, IORING_FSYNC_DATASYNC);
+        data->iov = { 0 };
+        data->callback = [this](ring_data_t *data) { handle_event(data); };
+        submitted = 1;
+        bs->ringloop->submit();
+    resume_4:
+        if (submitted > 0)
+        {
+            wait_state = 4;
+            return 1;
+        }
+    }
    return 0;
 }

@ -100,7 +218,7 @@ void blockstore_init_meta::handle_entries(void* entries, unsigned count, int blo
        clean_disk_entry *entry = (clean_disk_entry*)(entries + i*bs->clean_entry_size);
        if (!bs->inmemory_meta && bs->clean_entry_bitmap_size)
        {
-            memcpy(bs->clean_bitmap + (done_cnt+i)*bs->clean_entry_bitmap_size, &entry->bitmap, bs->clean_entry_bitmap_size);
+            memcpy(bs->clean_bitmap + (done_cnt+i)*2*bs->clean_entry_bitmap_size, &entry->bitmap, 2*bs->clean_entry_bitmap_size);
        }
        if (entry->oid.inode > 0)
        {
@ -111,10 +229,17 @@ void blockstore_init_meta::handle_entries(void* entries, unsigned count, int blo
                {
                    // free the previous block
 #ifdef BLOCKSTORE_DEBUG
-                    printf("Free block %lu (new location is %lu)\n", clean_it->second.location >> block_order, done_cnt+i);
+                    printf("Free block %lu from %lx:%lx v%lu (new location is %lu)\n",
+                        clean_it->second.location >> block_order,
+                        clean_it->first.inode, clean_it->first.stripe, clean_it->second.version,
+                        done_cnt+i);
 #endif
                    bs->data_alloc->set(clean_it->second.location >> block_order, false);
                }
+                else
+                {
+                    bs->inode_space_stats[entry->oid.inode] += bs->block_size;
+                }
                entries_loaded++;
 #ifdef BLOCKSTORE_DEBUG
                printf("Allocate block (clean entry) %lu: %lx:%lx v%lu\n", done_cnt+i, entry->oid.inode, entry->oid.stripe, entry->version);
@ -149,14 +274,6 @@ blockstore_init_journal::blockstore_init_journal(blockstore_impl_t *bs)
    };
 }

-bool iszero(uint64_t *buf, int len)
-{
-    for (int i = 0; i < len; i++)
-        if (buf[i] != 0)
-            return false;
-    return true;
-}
-
 void blockstore_init_journal::handle_event(ring_data_t *data1)
 {
    if (data1->res <= 0)
@ -181,12 +298,6 @@ void blockstore_init_journal::handle_event(ring_data_t *data1)
    submitted_buf = NULL;
 }

-#define GET_SQE() \
-    sqe = bs->get_sqe();\
-    if (!sqe)\
-        throw std::runtime_error("io_uring is full while trying to read journal");\
-    data = ((ring_data_t*)sqe->user_data)
-
 int blockstore_init_journal::loop()
 {
    if (wait_state == 1)
@ -224,7 +335,7 @@ resume_1:
        wait_state = 1;
        return 1;
    }
-    if (iszero((uint64_t*)submitted_buf, 3))
+    if (iszero((uint64_t*)submitted_buf, bs->journal.block_size / sizeof(uint64_t)))
    {
        // Journal is empty
        // FIXME handle this wrapping to journal_block_size better (maybe)
@ -239,6 +350,7 @@ resume_1:
            .size = sizeof(journal_entry_start),
            .reserved = 0,
            .journal_start = bs->journal.block_size,
+            .version = JOURNAL_VERSION,
        };
        ((journal_entry_start*)submitted_buf)->crc32 = je_crc32((journal_entry*)submitted_buf);
        if (bs->readonly)
@ -289,11 +401,21 @@ resume_1:
        je_start = (journal_entry_start*)submitted_buf;
        if (je_start->magic != JOURNAL_MAGIC ||
            je_start->type != JE_START ||
-            je_start->size != sizeof(journal_entry_start) ||
-            je_crc32((journal_entry*)je_start) != je_start->crc32)
+            je_crc32((journal_entry*)je_start) != je_start->crc32 ||
+            je_start->size != sizeof(journal_entry_start) && je_start->size != JE_START_LEGACY_SIZE)
        {
            // Entry is corrupt
-            throw std::runtime_error("first entry of the journal is corrupt");
+            fprintf(stderr, "First entry of the journal is corrupt\n");
+            exit(1);
+        }
+        if (je_start->size == JE_START_LEGACY_SIZE || je_start->version != JOURNAL_VERSION)
+        {
+            fprintf(
+                stderr, "The code only supports journal version %d, but it is %lu on disk."
+                    " Please use the previous version to flush the journal before upgrading OSD\n",
+                JOURNAL_VERSION, je_start->size == JE_START_LEGACY_SIZE ? 0 : je_start->version
+            );
+            exit(1);
        }
        next_free = journal_pos = bs->journal.used_start = je_start->journal_start;
        if (!bs->journal.inmemory)
@ -399,6 +521,18 @@ resume_1:
            }
        }
    }
+    for (auto ov: double_allocs)
+    {
+        auto dirty_it = bs->dirty_db.find(ov);
+        if (dirty_it != bs->dirty_db.end() &&
+            IS_BIG_WRITE(dirty_it->second.state) &&
+            dirty_it->second.location == UINT64_MAX)
+        {
+            printf("Fatal error (bug): %lx:%lx v%lu big_write journal_entry was allocated over another object\n",
+                dirty_it->first.oid.inode, dirty_it->first.oid.stripe, dirty_it->first.version);
+            exit(1);
+        }
+    }
    bs->flusher->mark_trim_possible();
    bs->journal.dirty_start = bs->journal.next_free;
    printf(
@ -530,6 +664,21 @@ int blockstore_init_journal::handle_journal_part(void *buf, uint64_t done_pos, u
                        .oid = je->small_write.oid,
                        .version = je->small_write.version,
                    };
+                    void *bmp = NULL;
+                    void *bmp_from = (void*)je + sizeof(journal_entry_small_write);
+                    if (bs->clean_entry_bitmap_size <= sizeof(void*))
+                    {
+                        memcpy(&bmp, bmp_from, bs->clean_entry_bitmap_size);
+                    }
+                    else
+                    {
+                        // FIXME Using large blockstore objects will result in a lot of small
+                        // allocations for entry bitmaps. This can only be fixed by using
+                        // a patched map with dynamic entry size, but not the btree_map,
+                        // because it doesn't keep iterators valid all the time.
+                        bmp = malloc_or_die(bs->clean_entry_bitmap_size);
+                        memcpy(bmp, bmp_from, bs->clean_entry_bitmap_size);
+                    }
                    bs->dirty_db.emplace(ov, (dirty_entry){
                        .state = (BS_ST_SMALL_WRITE | BS_ST_SYNCED),
                        .flags = 0,
@ -537,6 +686,7 @@ int blockstore_init_journal::handle_journal_part(void *buf, uint64_t done_pos, u
                        .offset = je->small_write.offset,
                        .len = je->small_write.len,
                        .journal_sector = proc_pos,
+                        .bitmap = bmp,
                    });
                    bs->journal.used_sectors[proc_pos]++;
 #ifdef BLOCKSTORE_DEBUG
@ -549,7 +699,7 @@ int blockstore_init_journal::handle_journal_part(void *buf, uint64_t done_pos, u
                    unstab = unstab < ov.version ? ov.version : unstab;
                    if (je->type == JE_SMALL_WRITE_INSTANT)
                    {
-                        bs->mark_stable(ov);
+                        bs->mark_stable(ov, true);
                    }
                }
            }
@ -579,32 +729,10 @@ int blockstore_init_journal::handle_journal_part(void *buf, uint64_t done_pos, u
                        // its data and metadata are already flushed.
                        // We don't know if newer versions are flushed, but
                        // the previous delete definitely is.
-                        // So we flush previous dirty entries, but retain the clean one.
+                        // So we forget previous dirty entries, but retain the clean one.
                        // This feature is required for writes happening shortly
                        // after deletes.
-                        auto dirty_end = dirty_it;
-                        dirty_end++;
-                        while (1)
-                        {
-                            if (dirty_it == bs->dirty_db.begin())
-                            {
-                                break;
-                            }
-                            dirty_it--;
-                            if (dirty_it->first.oid != je->big_write.oid)
-                            {
-                                dirty_it++;
-                                break;
-                            }
-                        }
-                        auto clean_it = bs->clean_db.find(je->big_write.oid);
-                        bs->erase_dirty(
-                            dirty_it, dirty_end,
-                            clean_it != bs->clean_db.end() ? clean_it->second.location : UINT64_MAX
-                        );
-                        // Remove it from the flusher's queue, too
-                        // Otherwise it may end up referring to a small unstable write after reading the rest of the journal
-                        bs->flusher->remove_flush(je->big_write.oid);
+                        erase_dirty_object(dirty_it);
                    }
                }
                auto clean_it = bs->clean_db.find(je->big_write.oid);
@ -616,18 +744,49 @@ int blockstore_init_journal::handle_journal_part(void *buf, uint64_t done_pos, u
                        .oid = je->big_write.oid,
                        .version = je->big_write.version,
                    };
-                    bs->dirty_db.emplace(ov, (dirty_entry){
+                    void *bmp = NULL;
+                    void *bmp_from = (void*)je + sizeof(journal_entry_big_write);
+                    if (bs->clean_entry_bitmap_size <= sizeof(void*))
+                    {
+                        memcpy(&bmp, bmp_from, bs->clean_entry_bitmap_size);
+                    }
+                    else
+                    {
+                        // FIXME Using large blockstore objects will result in a lot of small
+                        // allocations for entry bitmaps. This can only be fixed by using
+                        // a patched map with dynamic entry size, but not the btree_map,
+                        // because it doesn't keep iterators valid all the time.
+                        bmp = malloc_or_die(bs->clean_entry_bitmap_size);
+                        memcpy(bmp, bmp_from, bs->clean_entry_bitmap_size);
+                    }
+                    auto dirty_it = bs->dirty_db.emplace(ov, (dirty_entry){
                        .state = (BS_ST_BIG_WRITE | BS_ST_SYNCED),
                        .flags = 0,
                        .location = je->big_write.location,
                        .offset = je->big_write.offset,
                        .len = je->big_write.len,
                        .journal_sector = proc_pos,
-                    });
+                        .bitmap = bmp,
+                    }).first;
+                    if (bs->data_alloc->get(je->big_write.location >> bs->block_order))
+                    {
+                        // This is probably a big_write that's already flushed and freed, but it may
+                        // also indicate a bug. So we remember such entries and recheck them afterwards.
+                        // If it's not a bug they won't be present after reading the whole journal.
+                        dirty_it->second.location = UINT64_MAX;
+                        double_allocs.push_back(ov);
+                    }
+                    else
+                    {
 #ifdef BLOCKSTORE_DEBUG
-                    printf("Allocate block %lu\n", je->big_write.location >> bs->block_order);
+                        printf(
+                            "Allocate block (journal) %lu: %lx:%lx v%lu\n",
+                            je->big_write.location >> bs->block_order,
+                            ov.oid.inode, ov.oid.stripe, ov.version
+                        );
 #endif
                        bs->data_alloc->set(je->big_write.location >> bs->block_order, true);
+                    }
                    bs->journal.used_sectors[proc_pos]++;
 #ifdef BLOCKSTORE_DEBUG
                    printf(
@ -639,7 +798,7 @@ int blockstore_init_journal::handle_journal_part(void *buf, uint64_t done_pos, u
                    unstab = unstab < ov.version ? ov.version : unstab;
                    if (je->type == JE_BIG_WRITE_INSTANT)
                    {
-                        bs->mark_stable(ov);
+                        bs->mark_stable(ov, true);
                    }
                }
            }
@ -653,7 +812,7 @@ int blockstore_init_journal::handle_journal_part(void *buf, uint64_t done_pos, u
                    .oid = je->stable.oid,
                    .version = je->stable.version,
                };
-                bs->mark_stable(ov);
+                bs->mark_stable(ov, true);
            }
            else if (je->type == JE_ROLLBACK)
            {
@ -672,9 +831,26 @@ int blockstore_init_journal::handle_journal_part(void *buf, uint64_t done_pos, u
 #ifdef BLOCKSTORE_DEBUG
                printf("je_delete oid=%lx:%lx ver=%lu\n", je->del.oid.inode, je->del.oid.stripe, je->del.version);
 #endif
+                bool dirty_exists = false;
+                auto dirty_it = bs->dirty_db.upper_bound((obj_ver_id){
+                    .oid = je->del.oid,
+                    .version = UINT64_MAX,
+                });
+                if (dirty_it != bs->dirty_db.begin())
+                {
+                    dirty_it--;
+                    dirty_exists = dirty_it->first.oid == je->del.oid;
+                }
                auto clean_it = bs->clean_db.find(je->del.oid);
-                if (clean_it == bs->clean_db.end() ||
-                    clean_it->second.version < je->del.version)
+                bool clean_exists = (clean_it != bs->clean_db.end() &&
+                    clean_it->second.version < je->del.version);
+                if (!clean_exists && dirty_exists)
+                {
+                    // Clean entry doesn't exist. This means that the delete is already flushed.
+                    // So we must not flush this object anymore.
+                    erase_dirty_object(dirty_it);
+                }
+                else if (clean_exists || dirty_exists)
                {
                    // oid, version
                    obj_ver_id ov = {
@ -692,8 +868,9 @@ int blockstore_init_journal::handle_journal_part(void *buf, uint64_t done_pos, u
                    bs->journal.used_sectors[proc_pos]++;
                    // Deletions are treated as immediately stable, because
                    // "2-phase commit" (write->stabilize) isn't sufficient for them anyway
-                    bs->mark_stable(ov);
+                    bs->mark_stable(ov, true);
                }
+                // Ignore delete if neither preceding dirty entries nor the clean one are present
            }
            started = true;
            pos += je->size;
@ -704,3 +881,35 @@ int blockstore_init_journal::handle_journal_part(void *buf, uint64_t done_pos, u
    bs->journal.next_free = next_free;
    return 1;
 }
+
+void blockstore_init_journal::erase_dirty_object(blockstore_dirty_db_t::iterator dirty_it)
+{
+    auto oid = dirty_it->first.oid;
+    bool exists = !IS_DELETE(dirty_it->second.state);
+    auto dirty_end = dirty_it;
+    dirty_end++;
+    while (1)
+    {
+        if (dirty_it == bs->dirty_db.begin())
+        {
+            break;
+        }
+        dirty_it--;
+        if (dirty_it->first.oid != oid)
+        {
+            dirty_it++;
+            break;
+        }
+    }
+    auto clean_it = bs->clean_db.find(oid);
+    uint64_t clean_loc = clean_it != bs->clean_db.end()
+        ? clean_it->second.location : UINT64_MAX;
+    if (exists && clean_loc == UINT64_MAX)
+    {
+        bs->inode_space_stats[oid.inode] -= bs->block_size;
+    }
+    bs->erase_dirty(dirty_it, dirty_end, clean_loc);
+    // Remove it from the flusher's queue, too
+    // Otherwise it may end up referring to a small unstable write after reading the rest of the journal
+    bs->flusher->remove_flush(oid);
+}
--- a/src/blockstore_init.h
+++ b/src/blockstore_init.h
@ -7,6 +7,7 @@ class blockstore_init_meta
 {
    blockstore_impl_t *bs;
    int wait_state = 0, wait_count = 0;
+    bool zero_on_init = false;
    void *metadata_buffer = NULL;
    uint64_t metadata_read = 0;
    int prev = 0, prev_done = 0, done_len = 0, submitted = 0;
@ -36,6 +37,7 @@ class blockstore_init_journal
    bool started = false;
    uint64_t next_free;
    std::vector<bs_init_journal_done> done;
+    std::vector<obj_ver_id> double_allocs;
    uint64_t journal_pos = 0;
    uint64_t continue_pos = 0;
    void *init_write_buf = NULL;
@ -48,6 +50,7 @@ class blockstore_init_journal
    std::function<void(ring_data_t*)> simple_callback;
    int handle_journal_part(void *buf, uint64_t done_pos, uint64_t len);
    void handle_event(ring_data_t *data);
+    void erase_dirty_object(blockstore_dirty_db_t::iterator dirty_it);
 public:
    blockstore_init_journal(blockstore_impl_t* bs);
    int loop();
--- a/src/blockstore_journal.h
+++ b/src/blockstore_journal.h
@ -7,6 +7,7 @@

 #define MIN_JOURNAL_SIZE 4*1024*1024
 #define JOURNAL_MAGIC 0x4A33
+#define JOURNAL_VERSION 1
 #define JOURNAL_BUFFER_SIZE 4*1024*1024

 // We reserve some extra space for future stabilize requests during writes
@ -37,7 +38,9 @@ struct __attribute__((__packed__)) journal_entry_start
    uint32_t size;
    uint32_t reserved;
    uint64_t journal_start;
+    uint64_t version;
 };
+#define JE_START_LEGACY_SIZE 24

 struct __attribute__((__packed__)) journal_entry_small_write
 {
@ -54,6 +57,9 @@ struct __attribute__((__packed__)) journal_entry_small_write
    // data_offset is its offset within journal
    uint64_t data_offset;
    uint32_t crc32_data;
+    // small_write and big_write entries are followed by the "external" bitmap
+    // its size is dynamic and included in journal entry's <size> field
+    uint8_t bitmap[];
 };

 struct __attribute__((__packed__)) journal_entry_big_write
@ -68,6 +74,9 @@ struct __attribute__((__packed__)) journal_entry_big_write
    uint32_t offset;
    uint32_t len;
    uint64_t location;
+    // small_write and big_write entries are followed by the "external" bitmap
+    // its size is dynamic and included in journal entry's <size> field
+    uint8_t bitmap[];
 };

 struct __attribute__((__packed__)) journal_entry_stable
@ -143,6 +152,7 @@ struct journal_t
    int fd;
    uint64_t device_size;
    bool inmemory = false;
+    bool flush_journal = false;
    void *buffer = NULL;

    uint64_t block_size;
--- a/src/blockstore_open.cpp
+++ b/src/blockstore_open.cpp
@ -42,6 +42,11 @@ void blockstore_impl_t::parse_config(blockstore_config_t & config)
    {
        disable_flock = true;
    }
+    if (config["flush_journal"] == "true" || config["flush_journal"] == "1" || config["flush_journal"] == "yes")
+    {
+        // Only flush journal and exit
+        journal.flush_journal = true;
+    }
    if (config["immediate_commit"] == "all")
    {
        immediate_commit = IMMEDIATE_ALL;
@ -69,8 +74,16 @@ void blockstore_impl_t::parse_config(blockstore_config_t & config)
    journal_block_size = strtoull(config["journal_block_size"].c_str(), NULL, 10);
    meta_block_size = strtoull(config["meta_block_size"].c_str(), NULL, 10);
    bitmap_granularity = strtoull(config["bitmap_granularity"].c_str(), NULL, 10);
-    flusher_count = strtoull(config["flusher_count"].c_str(), NULL, 10);
+    max_flusher_count = strtoull(config["max_flusher_count"].c_str(), NULL, 10);
+    if (!max_flusher_count)
+        max_flusher_count = strtoull(config["flusher_count"].c_str(), NULL, 10);
+    min_flusher_count = strtoull(config["min_flusher_count"].c_str(), NULL, 10);
    max_write_iodepth = strtoull(config["max_write_iodepth"].c_str(), NULL, 10);
+    throttle_small_writes = config["throttle_small_writes"] == "true" || config["throttle_small_writes"] == "1" || config["throttle_small_writes"] == "yes";
+    throttle_target_iops = strtoull(config["throttle_target_iops"].c_str(), NULL, 10);
+    throttle_target_mbs = strtoull(config["throttle_target_mbs"].c_str(), NULL, 10);
+    throttle_target_parallelism = strtoull(config["throttle_target_parallelism"].c_str(), NULL, 10);
+    throttle_threshold_us = strtoull(config["throttle_threshold_us"].c_str(), NULL, 10);
    // Validate
    if (!block_size)
    {
@ -80,9 +93,13 @@ void blockstore_impl_t::parse_config(blockstore_config_t & config)
    {
        throw std::runtime_error("Bad block size");
    }
-    if (!flusher_count)
+    if (!max_flusher_count)
    {
-        flusher_count = 32;
+        max_flusher_count = 256;
+    }
+    if (!min_flusher_count || journal.flush_journal)
+    {
+        min_flusher_count = 1;
    }
    if (!max_write_iodepth)
    {
@ -94,7 +111,7 @@ void blockstore_impl_t::parse_config(blockstore_config_t & config)
    }
    else if (disk_alignment % MEM_ALIGNMENT)
    {
-        throw std::runtime_error("disk_alingment must be a multiple of "+std::to_string(MEM_ALIGNMENT));
+        throw std::runtime_error("disk_alignment must be a multiple of "+std::to_string(MEM_ALIGNMENT));
    }
    if (!journal_block_size)
    {
@ -118,7 +135,7 @@ void blockstore_impl_t::parse_config(blockstore_config_t & config)
    }
    if (!bitmap_granularity)
    {
-        bitmap_granularity = 4096;
+        bitmap_granularity = DEFAULT_BITMAP_GRANULARITY;
    }
    else if (bitmap_granularity % disk_alignment)
    {
@ -168,9 +185,25 @@ void blockstore_impl_t::parse_config(blockstore_config_t & config)
    {
        throw std::runtime_error("immediate_commit=all requires disable_journal_fsync and disable_data_fsync");
    }
+    if (!throttle_target_iops)
+    {
+        throttle_target_iops = 100;
+    }
+    if (!throttle_target_mbs)
+    {
+        throttle_target_mbs = 100;
+    }
+    if (!throttle_target_parallelism)
+    {
+        throttle_target_parallelism = 1;
+    }
+    if (!throttle_threshold_us)
+    {
+        throttle_threshold_us = 50;
+    }
    // init some fields
    clean_entry_bitmap_size = block_size / bitmap_granularity / 8;
-    clean_entry_size = sizeof(clean_disk_entry) + clean_entry_bitmap_size;
+    clean_entry_size = sizeof(clean_disk_entry) + 2*clean_entry_bitmap_size;
    journal.block_size = journal_block_size;
    journal.next_free = journal_block_size;
    journal.used_start = journal_block_size;
@ -224,7 +257,7 @@ void blockstore_impl_t::calc_lengths()
    }
    // required metadata size
    block_count = data_len / block_size;
-    meta_len = ((block_count - 1 + meta_block_size / clean_entry_size) / (meta_block_size / clean_entry_size)) * meta_block_size;
+    meta_len = (1 + (block_count - 1 + meta_block_size / clean_entry_size) / (meta_block_size / clean_entry_size)) * meta_block_size;
    if (meta_area < meta_len)
    {
        throw std::runtime_error("Metadata area is too small, need at least "+std::to_string(meta_len)+" bytes");
@ -237,7 +270,7 @@ void blockstore_impl_t::calc_lengths()
    }
    else if (clean_entry_bitmap_size)
    {
-        clean_bitmap = (uint8_t*)malloc(block_count * clean_entry_bitmap_size);
+        clean_bitmap = (uint8_t*)malloc(block_count * 2*clean_entry_bitmap_size);
        if (!clean_bitmap)
            throw std::runtime_error("Failed to allocate memory for the metadata sparse write bitmap");
    }
--- a/src/blockstore_read.cpp
+++ b/src/blockstore_read.cpp
@ -94,6 +94,21 @@ endwhile:
    return 1;
 }

+uint8_t* blockstore_impl_t::get_clean_entry_bitmap(uint64_t block_loc, int offset)
+{
+    uint8_t *clean_entry_bitmap;
+    uint64_t meta_loc = block_loc >> block_order;
+    if (inmemory_meta)
+    {
+        uint64_t sector = (meta_loc / (meta_block_size / clean_entry_size)) * meta_block_size;
+        uint64_t pos = (meta_loc % (meta_block_size / clean_entry_size));
+        clean_entry_bitmap = (uint8_t*)(metadata_buffer + sector + pos*clean_entry_size + sizeof(clean_disk_entry) + offset);
+    }
+    else
+        clean_entry_bitmap = (uint8_t*)(clean_bitmap + meta_loc*2*clean_entry_bitmap_size + offset);
+    return clean_entry_bitmap;
+}
+
 int blockstore_impl_t::dequeue_read(blockstore_op_t *read_op)
 {
    auto clean_it = clean_db.find(read_op->oid);
@ -134,6 +149,11 @@ int blockstore_impl_t::dequeue_read(blockstore_op_t *read_op)
                if (!result_version)
                {
                    result_version = dirty_it->first.version;
+                    if (read_op->bitmap)
+                    {
+                        void *bmp_ptr = (clean_entry_bitmap_size > sizeof(void*) ? dirty_it->second.bitmap : &dirty_it->second.bitmap);
+                        memcpy(read_op->bitmap, bmp_ptr, clean_entry_bitmap_size);
+                    }
                }
                if (!fulfill_read(read_op, fulfilled, dirty.offset, dirty.offset + dirty.len,
                    dirty.state, dirty_it->first.version, dirty.location + (IS_JOURNAL(dirty.state) ? 0 : dirty.offset)))
@ -155,6 +175,11 @@ int blockstore_impl_t::dequeue_read(blockstore_op_t *read_op)
        if (!result_version)
        {
            result_version = clean_it->second.version;
+            if (read_op->bitmap)
+            {
+                void *bmp_ptr = get_clean_entry_bitmap(clean_it->second.location, clean_entry_bitmap_size);
+                memcpy(read_op->bitmap, bmp_ptr, clean_entry_bitmap_size);
+            }
        }
        if (fulfilled < read_op->len)
        {
@ -169,18 +194,7 @@ int blockstore_impl_t::dequeue_read(blockstore_op_t *read_op)
            }
            else
            {
-                uint64_t meta_loc = clean_it->second.location >> block_order;
-                uint8_t *clean_entry_bitmap;
-                if (inmemory_meta)
-                {
-                    uint64_t sector = (meta_loc / (meta_block_size / clean_entry_size)) * meta_block_size;
-                    uint64_t pos = (meta_loc % (meta_block_size / clean_entry_size));
-                    clean_entry_bitmap = (uint8_t*)(metadata_buffer + sector + pos*clean_entry_size + sizeof(clean_disk_entry));
-                }
-                else
-                {
-                    clean_entry_bitmap = (uint8_t*)(clean_bitmap + meta_loc*clean_entry_bitmap_size);
-                }
+                uint8_t *clean_entry_bitmap = get_clean_entry_bitmap(clean_it->second.location, 0);
                uint64_t bmp_start = 0, bmp_end = 0, bmp_size = block_size/bitmap_granularity;
                while (bmp_start < bmp_size)
                {
@ -254,3 +268,50 @@ void blockstore_impl_t::handle_read_event(ring_data_t *data, blockstore_op_t *op
        FINISH_OP(op);
    }
 }
+
+int blockstore_impl_t::read_bitmap(object_id oid, uint64_t target_version, void *bitmap, uint64_t *result_version)
+{
+    auto dirty_it = dirty_db.upper_bound((obj_ver_id){
+        .oid = oid,
+        .version = UINT64_MAX,
+    });
+    if (dirty_it != dirty_db.begin())
+        dirty_it--;
+    if (dirty_it != dirty_db.end())
+    {
+        while (dirty_it->first.oid == oid)
+        {
+            if (target_version >= dirty_it->first.version)
+            {
+                if (result_version)
+                    *result_version = dirty_it->first.version;
+                if (bitmap)
+                {
+                    void *bmp_ptr = (clean_entry_bitmap_size > sizeof(void*) ? dirty_it->second.bitmap : &dirty_it->second.bitmap);
+                    memcpy(bitmap, bmp_ptr, clean_entry_bitmap_size);
+                }
+                return 0;
+            }
+            if (dirty_it == dirty_db.begin())
+                break;
+            dirty_it--;
+        }
+    }
+    auto clean_it = clean_db.find(oid);
+    if (clean_it != clean_db.end())
+    {
+        if (result_version)
+            *result_version = clean_it->second.version;
+        if (bitmap)
+        {
+            void *bmp_ptr = get_clean_entry_bitmap(clean_it->second.location, clean_entry_bitmap_size);
+            memcpy(bitmap, bmp_ptr, clean_entry_bitmap_size);
+        }
+        return 0;
+    }
+    if (result_version)
+        *result_version = 0;
+    if (bitmap)
+        memset(bitmap, 0, clean_entry_bitmap_size);
+    return -ENOENT;
+}
--- a/src/blockstore_rollback.cpp
+++ b/src/blockstore_rollback.cpp
@ -163,10 +163,7 @@ void blockstore_impl_t::mark_rolled_back(const obj_ver_id & ov)
        auto rm_start = it;
        auto rm_end = it;
        it--;
-        while (it->first.oid == ov.oid &&
-            it->first.version > ov.version &&
-            !IS_IN_FLIGHT(it->second.state) &&
-            !IS_STABLE(it->second.state))
+        while (1)
        {
            if (it->first.oid != ov.oid)
                break;
@ -176,7 +173,7 @@ void blockstore_impl_t::mark_rolled_back(const obj_ver_id & ov)
                    max_unstable = it->first.version;
                break;
            }
-            else if (IS_STABLE(it->second.state))
+            else if (IS_IN_FLIGHT(it->second.state) || IS_STABLE(it->second.state))
                break;
            // Remove entry
            rm_start = it;
@ -187,7 +184,6 @@ void blockstore_impl_t::mark_rolled_back(const obj_ver_id & ov)
        if (rm_start != rm_end)
        {
            erase_dirty(rm_start, rm_end, UINT64_MAX);
-        }
            auto unstab_it = unstable_writes.find(ov.oid);
            if (unstab_it != unstable_writes.end())
            {
@ -197,6 +193,7 @@ void blockstore_impl_t::mark_rolled_back(const obj_ver_id & ov)
                    unstab_it->second = max_unstable;
            }
        }
+    }
 }

 void blockstore_impl_t::handle_rollback_event(ring_data_t *data, blockstore_op_t *op)
@ -251,10 +248,12 @@ void blockstore_impl_t::erase_dirty(blockstore_dirty_db_t::iterator dirty_start,
    }
    while (1)
    {
-        if (IS_BIG_WRITE(dirty_it->second.state) && dirty_it->second.location != clean_loc)
+        if (IS_BIG_WRITE(dirty_it->second.state) && dirty_it->second.location != clean_loc &&
+            dirty_it->second.location != UINT64_MAX)
        {
 #ifdef BLOCKSTORE_DEBUG
-            printf("Free block %lu\n", dirty_it->second.location >> block_order);
+            printf("Free block %lu from %lx:%lx v%lu\n", dirty_it->second.location >> block_order,
+                dirty_it->first.oid.inode, dirty_it->first.oid.stripe, dirty_it->first.version);
 #endif
            data_alloc->set(dirty_it->second.location >> block_order, false);
        }
@ -269,6 +268,11 @@ void blockstore_impl_t::erase_dirty(blockstore_dirty_db_t::iterator dirty_start,
        {
            journal.used_sectors.erase(dirty_it->second.journal_sector);
        }
+        if (clean_entry_bitmap_size > sizeof(void*))
+        {
+            free(dirty_it->second.bitmap);
+            dirty_it->second.bitmap = NULL;
+        }
        if (dirty_it == dirty_start)
        {
            break;
--- a/src/blockstore_stable.cpp
+++ b/src/blockstore_stable.cpp
@ -168,6 +168,9 @@ resume_5:
    for (i = 0, v = (obj_ver_id*)op->buf; i < op->len; i++, v++)
    {
        // Mark all dirty_db entries up to op->version as stable
+#ifdef BLOCKSTORE_DEBUG
+        printf("Stabilize %lx:%lx v%lu\n", v->oid.inode, v->oid.stripe, v->version);
+#endif
        mark_stable(*v);
    }
    // Acknowledge op
@ -176,22 +179,66 @@ resume_5:
    return 2;
 }

-void blockstore_impl_t::mark_stable(const obj_ver_id & v)
+void blockstore_impl_t::mark_stable(const obj_ver_id & v, bool forget_dirty)
 {
    auto dirty_it = dirty_db.find(v);
    if (dirty_it != dirty_db.end())
    {
        while (1)
        {
+            bool was_stable = IS_STABLE(dirty_it->second.state);
            if ((dirty_it->second.state & BS_ST_WORKFLOW_MASK) == BS_ST_SYNCED)
            {
                dirty_it->second.state = (dirty_it->second.state & ~BS_ST_WORKFLOW_MASK) | BS_ST_STABLE;
-            }
-            else if (IS_STABLE(dirty_it->second.state))
+                // Allocations and deletions are counted when they're stabilized
+                if (IS_BIG_WRITE(dirty_it->second.state))
                {
+                    int exists = -1;
+                    if (dirty_it != dirty_db.begin())
+                    {
+                        auto prev_it = dirty_it;
+                        prev_it--;
+                        if (prev_it->first.oid == v.oid)
+                        {
+                            exists = IS_DELETE(prev_it->second.state) ? 0 : 1;
+                        }
+                    }
+                    if (exists == -1)
+                    {
+                        auto clean_it = clean_db.find(v.oid);
+                        exists = clean_it != clean_db.end() ? 1 : 0;
+                    }
+                    if (!exists)
+                    {
+                        inode_space_stats[dirty_it->first.oid.inode] += block_size;
+                    }
+                }
+                else if (IS_DELETE(dirty_it->second.state))
+                {
+                    inode_space_stats[dirty_it->first.oid.inode] -= block_size;
+                }
+            }
+            if (forget_dirty && (IS_BIG_WRITE(dirty_it->second.state) ||
+                IS_DELETE(dirty_it->second.state)))
+            {
+                // Big write overrides all previous dirty entries
+                auto erase_end = dirty_it;
+                while (dirty_it != dirty_db.begin())
+                {
+                    dirty_it--;
+                    if (dirty_it->first.oid != v.oid)
+                    {
+                        dirty_it++;
                        break;
                    }
-            if (dirty_it == dirty_db.begin())
+                }
+                auto clean_it = clean_db.find(v.oid);
+                uint64_t clean_loc = clean_it != clean_db.end()
+                    ? clean_it->second.location : UINT64_MAX;
+                erase_dirty(dirty_it, erase_end, clean_loc);
+                break;
+            }
+            if (was_stable || dirty_it == dirty_db.begin())
            {
                break;
            }
--- a/src/blockstore_sync.cpp
+++ b/src/blockstore_sync.cpp
@ -24,6 +24,7 @@ int blockstore_impl_t::continue_sync(blockstore_op_t *op, bool queue_has_in_prog
    if (PRIV(op)->op_state == 0)
    {
        stop_sync_submitted = false;
+        unsynced_big_write_count -= unsynced_big_writes.size();
        PRIV(op)->sync_big_writes.swap(unsynced_big_writes);
        PRIV(op)->sync_small_writes.swap(unsynced_small_writes);
        PRIV(op)->sync_small_checked = 0;
@ -79,7 +80,8 @@ int blockstore_impl_t::continue_sync(blockstore_op_t *op, bool queue_has_in_prog
        // 2nd step: Data device is synced, prepare & write journal entries
        // Check space in the journal and journal memory buffers
        blockstore_journal_check_t space_check(this);
-        if (!space_check.check_available(op, PRIV(op)->sync_big_writes.size(), sizeof(journal_entry_big_write), JOURNAL_STABILIZE_RESERVATION))
+        if (!space_check.check_available(op, PRIV(op)->sync_big_writes.size(),
+            sizeof(journal_entry_big_write) + clean_entry_bitmap_size, JOURNAL_STABILIZE_RESERVATION))
        {
            return 0;
        }
@ -94,7 +96,7 @@ int blockstore_impl_t::continue_sync(blockstore_op_t *op, bool queue_has_in_prog
        int s = 0, cur_sector = -1;
        while (it != PRIV(op)->sync_big_writes.end())
        {
-            if (!journal.entry_fits(sizeof(journal_entry_big_write)) &&
+            if (!journal.entry_fits(sizeof(journal_entry_big_write) + clean_entry_bitmap_size) &&
                journal.sector_info[journal.cur_sector].dirty)
            {
                if (cur_sector == -1)
@ -102,24 +104,27 @@ int blockstore_impl_t::continue_sync(blockstore_op_t *op, bool queue_has_in_prog
                prepare_journal_sector_write(journal, journal.cur_sector, sqe[s++], [this, op](ring_data_t *data) { handle_sync_event(data, op); });
                cur_sector = journal.cur_sector;
            }
+            auto & dirty_entry = dirty_db.at(*it);
            journal_entry_big_write *je = (journal_entry_big_write*)prefill_single_journal_entry(
-                journal, (dirty_db[*it].state & BS_ST_INSTANT) ? JE_BIG_WRITE_INSTANT : JE_BIG_WRITE,
-                sizeof(journal_entry_big_write)
+                journal, (dirty_entry.state & BS_ST_INSTANT) ? JE_BIG_WRITE_INSTANT : JE_BIG_WRITE,
+                sizeof(journal_entry_big_write) + clean_entry_bitmap_size
            );
-            dirty_db[*it].journal_sector = journal.sector_info[journal.cur_sector].offset;
+            dirty_entry.journal_sector = journal.sector_info[journal.cur_sector].offset;
            journal.used_sectors[journal.sector_info[journal.cur_sector].offset]++;
 #ifdef BLOCKSTORE_DEBUG
            printf(
                "journal offset %08lx is used by %lx:%lx v%lu (%lu refs)\n",
-                dirty_db[*it].journal_sector, it->oid.inode, it->oid.stripe, it->version,
+                dirty_entry.journal_sector, it->oid.inode, it->oid.stripe, it->version,
                journal.used_sectors[journal.sector_info[journal.cur_sector].offset]
            );
 #endif
            je->oid = it->oid;
            je->version = it->version;
-            je->offset = dirty_db[*it].offset;
-            je->len = dirty_db[*it].len;
-            je->location = dirty_db[*it].location;
+            je->offset = dirty_entry.offset;
+            je->len = dirty_entry.len;
+            je->location = dirty_entry.location;
+            memcpy((void*)(je+1), (clean_entry_bitmap_size > sizeof(void*)
+                ? dirty_entry.bitmap : &dirty_entry.bitmap), clean_entry_bitmap_size);
            je->crc32 = je_crc32((journal_entry*)je);
            journal.crc32_last = je->crc32;
            it++;
@ -141,6 +146,7 @@ int blockstore_impl_t::continue_sync(blockstore_op_t *op, bool queue_has_in_prog
            my_uring_prep_fsync(sqe, journal.fd, IORING_FSYNC_DATASYNC);
            data->iov = { 0 };
            data->callback = [this, op](ring_data_t *data) { handle_sync_event(data, op); };
+            PRIV(op)->min_flushed_journal_sector = PRIV(op)->max_flushed_journal_sector = 0;
            PRIV(op)->pending_ops = 1;
            PRIV(op)->op_state = SYNC_JOURNAL_SYNC_SENT;
            return 1;
--- a/src/blockstore_write.cpp
+++ b/src/blockstore_write.cpp
@ -8,7 +8,12 @@ bool blockstore_impl_t::enqueue_write(blockstore_op_t *op)
    // Check or assign version number
    bool found = false, deleted = false, is_del = (op->opcode == BS_OP_DELETE);
    bool wait_big = false, wait_del = false;
+    void *bmp = NULL;
    uint64_t version = 1;
+    if (!is_del && clean_entry_bitmap_size > sizeof(void*))
+    {
+        bmp = calloc_or_die(1, clean_entry_bitmap_size);
+    }
    if (dirty_db.size() > 0)
    {
        auto dirty_it = dirty_db.upper_bound((obj_ver_id){
@ -25,6 +30,13 @@ bool blockstore_impl_t::enqueue_write(blockstore_op_t *op)
            wait_big = (dirty_it->second.state & BS_ST_TYPE_MASK) == BS_ST_BIG_WRITE
                ? !IS_SYNCED(dirty_it->second.state)
                : ((dirty_it->second.state & BS_ST_WORKFLOW_MASK) == BS_ST_WAIT_BIG);
+            if (!is_del && !deleted)
+            {
+                if (clean_entry_bitmap_size > sizeof(void*))
+                    memcpy(bmp, dirty_it->second.bitmap, clean_entry_bitmap_size);
+                else
+                    bmp = dirty_it->second.bitmap;
+            }
        }
    }
    if (!found)
@ -33,6 +45,11 @@ bool blockstore_impl_t::enqueue_write(blockstore_op_t *op)
        if (clean_it != clean_db.end())
        {
            version = clean_it->second.version + 1;
+            if (!is_del)
+            {
+                void *bmp_ptr = get_clean_entry_bitmap(clean_it->second.location, clean_entry_bitmap_size);
+                memcpy((clean_entry_bitmap_size > sizeof(void*) ? bmp : &bmp), bmp_ptr, clean_entry_bitmap_size);
+            }
        }
        else
        {
@ -72,6 +89,10 @@ bool blockstore_impl_t::enqueue_write(blockstore_op_t *op)
        {
            // Invalid version requested
            op->retval = -EEXIST;
+            if (!is_del && clean_entry_bitmap_size > sizeof(void*))
+            {
+                free(bmp);
+            }
            return false;
        }
    }
@ -101,6 +122,8 @@ bool blockstore_impl_t::enqueue_write(blockstore_op_t *op)
    else
    {
        state = (op->len == block_size || deleted ? BS_ST_BIG_WRITE : BS_ST_SMALL_WRITE);
+        if (state == BS_ST_SMALL_WRITE && throttle_small_writes)
+            clock_gettime(CLOCK_REALTIME, &PRIV(op)->tv_begin);
        if (wait_del)
            state |= BS_ST_WAIT_DEL;
        else if (state == BS_ST_SMALL_WRITE && wait_big)
@ -109,6 +132,28 @@ bool blockstore_impl_t::enqueue_write(blockstore_op_t *op)
            state |= BS_ST_IN_FLIGHT;
        if (op->opcode == BS_OP_WRITE_STABLE)
            state |= BS_ST_INSTANT;
+        if (op->bitmap)
+        {
+            // Only allow to overwrite part of the object bitmap respective to the write's offset/len
+            uint8_t *bmp_ptr = (uint8_t*)(clean_entry_bitmap_size > sizeof(void*) ? bmp : &bmp);
+            uint32_t bit = op->offset/bitmap_granularity;
+            uint32_t bits_left = op->len/bitmap_granularity;
+            while (!(bit % 8) && bits_left > 8)
+            {
+                // Copy bytes
+                bmp_ptr[bit/8] = ((uint8_t*)op->bitmap)[bit/8];
+                bit += 8;
+                bits_left -= 8;
+            }
+            while (bits_left > 0)
+            {
+                // Copy bits
+                bmp_ptr[bit/8] = (bmp_ptr[bit/8] & ~(1 << (bit%8)))
+                    | (((uint8_t*)op->bitmap)[bit/8] & (1 << bit%8));
+                bit++;
+                bits_left--;
+            }
+        }
    }
    dirty_db.emplace((obj_ver_id){
        .oid = op->oid,
@ -120,6 +165,7 @@ bool blockstore_impl_t::enqueue_write(blockstore_op_t *op)
        .offset = is_del ? 0 : op->offset,
        .len = is_del ? 0 : op->len,
        .journal_sector = 0,
+        .bitmap = bmp,
    });
    return true;
 }
@ -128,6 +174,8 @@ void blockstore_impl_t::cancel_all_writes(blockstore_op_t *op, blockstore_dirty_
 {
    while (dirty_it != dirty_db.end() && dirty_it->first.oid == op->oid)
    {
+        if (clean_entry_bitmap_size > sizeof(void*))
+            free(dirty_it->second.bitmap);
        dirty_db.erase(dirty_it++);
    }
    bool found = false;
@ -201,7 +249,8 @@ int blockstore_impl_t::dequeue_write(blockstore_op_t *op)
    if ((dirty_it->second.state & BS_ST_TYPE_MASK) == BS_ST_BIG_WRITE)
    {
        blockstore_journal_check_t space_check(this);
-        if (!space_check.check_available(op, unsynced_big_writes.size() + 1, sizeof(journal_entry_big_write), JOURNAL_STABILIZE_RESERVATION))
+        if (!space_check.check_available(op, unsynced_big_write_count + 1,
+            sizeof(journal_entry_big_write) + clean_entry_bitmap_size, JOURNAL_STABILIZE_RESERVATION))
        {
            return 0;
        }
@ -224,7 +273,10 @@ int blockstore_impl_t::dequeue_write(blockstore_op_t *op)
        dirty_it->second.location = loc << block_order;
        dirty_it->second.state = (dirty_it->second.state & ~BS_ST_WORKFLOW_MASK) | BS_ST_SUBMITTED;
 #ifdef BLOCKSTORE_DEBUG
-        printf("Allocate block %lu\n", loc);
+        printf(
+            "Allocate block %lu for %lx:%lx v%lu\n",
+            loc, op->oid.inode, op->oid.stripe, op->version
+        );
 #endif
        data_alloc->set(loc, true);
        uint64_t stripe_offset = (op->offset % bitmap_granularity);
@ -250,11 +302,8 @@ int blockstore_impl_t::dequeue_write(blockstore_op_t *op)
        PRIV(op)->min_flushed_journal_sector = PRIV(op)->max_flushed_journal_sector = 0;
        if (immediate_commit != IMMEDIATE_ALL)
        {
-            // Remember big write as unsynced
-            unsynced_big_writes.push_back((obj_ver_id){
-                .oid = op->oid,
-                .version = op->version,
-            });
+            // Increase the counter, but don't save into unsynced_writes yet (can't sync until the write is finished)
+            unsynced_big_write_count++;
            PRIV(op)->op_state = 3;
        }
        else
@ -267,8 +316,11 @@ int blockstore_impl_t::dequeue_write(blockstore_op_t *op)
        // Small (journaled) write
        // First check if the journal has sufficient space
        blockstore_journal_check_t space_check(this);
-        if (unsynced_big_writes.size() && !space_check.check_available(op, unsynced_big_writes.size(), sizeof(journal_entry_big_write), 0)
-            || !space_check.check_available(op, 1, sizeof(journal_entry_small_write), op->len + JOURNAL_STABILIZE_RESERVATION))
+        if (unsynced_big_write_count &&
+            !space_check.check_available(op, unsynced_big_write_count,
+                sizeof(journal_entry_big_write) + clean_entry_bitmap_size, 0)
+            || !space_check.check_available(op, 1,
+                sizeof(journal_entry_small_write) + clean_entry_bitmap_size, op->len + JOURNAL_STABILIZE_RESERVATION))
        {
            return 0;
        }
@ -276,8 +328,7 @@ int blockstore_impl_t::dequeue_write(blockstore_op_t *op)
        // There is sufficient space. Get SQE(s)
        struct io_uring_sqe *sqe1 = NULL;
        if (immediate_commit != IMMEDIATE_NONE ||
-            (journal_block_size - journal.in_sector_pos) < sizeof(journal_entry_small_write) &&
-            journal.sector_info[journal.cur_sector].dirty)
+            !journal.entry_fits(sizeof(journal_entry_small_write) + clean_entry_bitmap_size))
        {
            // Write current journal sector only if it's dirty and full, or in the immediate_commit mode
            BS_SUBMIT_GET_SQE_DECL(sqe1);
@ -305,7 +356,7 @@ int blockstore_impl_t::dequeue_write(blockstore_op_t *op)
        // Then pre-fill journal entry
        journal_entry_small_write *je = (journal_entry_small_write*)prefill_single_journal_entry(
            journal, op->opcode == BS_OP_WRITE_STABLE ? JE_SMALL_WRITE_INSTANT : JE_SMALL_WRITE,
-            sizeof(journal_entry_small_write)
+            sizeof(journal_entry_small_write) + clean_entry_bitmap_size
        );
        dirty_it->second.journal_sector = journal.sector_info[journal.cur_sector].offset;
        journal.used_sectors[journal.sector_info[journal.cur_sector].offset]++;
@ -324,6 +375,7 @@ int blockstore_impl_t::dequeue_write(blockstore_op_t *op)
        je->len = op->len;
        je->data_offset = journal.next_free;
        je->crc32_data = crc32c(0, op->buf, op->len);
+        memcpy((void*)(je+1), (clean_entry_bitmap_size > sizeof(void*) ? dirty_it->second.bitmap : &dirty_it->second.bitmap), clean_entry_bitmap_size);
        je->crc32 = je_crc32((journal_entry*)je);
        journal.crc32_last = je->crc32;
        if (immediate_commit != IMMEDIATE_NONE)
@ -359,14 +411,6 @@ int blockstore_impl_t::dequeue_write(blockstore_op_t *op)
        {
            journal.next_free = journal_block_size;
        }
-        if (immediate_commit == IMMEDIATE_NONE)
-        {
-            // Remember small write as unsynced
-            unsynced_small_writes.push_back((obj_ver_id){
-                .oid = op->oid,
-                .version = op->version,
-            });
-        }
        if (!PRIV(op)->pending_ops)
        {
            PRIV(op)->op_state = 4;
@ -382,29 +426,31 @@ int blockstore_impl_t::dequeue_write(blockstore_op_t *op)

 int blockstore_impl_t::continue_write(blockstore_op_t *op)
 {
-    io_uring_sqe *sqe = NULL;
-    journal_entry_big_write *je;
    int op_state = PRIV(op)->op_state;
-    if (op_state != 2 && op_state != 4)
+    if (op_state == 2)
+        goto resume_2;
+    else if (op_state == 4)
+        goto resume_4;
+    else if (op_state == 6)
+        goto resume_6;
+    else
    {
        // In progress
        return 1;
    }
+resume_2:
+    // Only for the immediate_commit mode: prepare and submit big_write journal entry
+    {
        auto dirty_it = dirty_db.find((obj_ver_id){
            .oid = op->oid,
            .version = op->version,
        });
        assert(dirty_it != dirty_db.end());
-    if (op_state == 2)
-        goto resume_2;
-    else if (op_state == 4)
-        goto resume_4;
-resume_2:
-    // Only for the immediate_commit mode: prepare and submit big_write journal entry
+        io_uring_sqe *sqe = NULL;
        BS_SUBMIT_GET_SQE_DECL(sqe);
-    je = (journal_entry_big_write*)prefill_single_journal_entry(
+        journal_entry_big_write *je = (journal_entry_big_write*)prefill_single_journal_entry(
            journal, op->opcode == BS_OP_WRITE_STABLE ? JE_BIG_WRITE_INSTANT : JE_BIG_WRITE,
-        sizeof(journal_entry_big_write)
+            sizeof(journal_entry_big_write) + clean_entry_bitmap_size
        );
        dirty_it->second.journal_sector = journal.sector_info[journal.cur_sector].offset;
        journal.used_sectors[journal.sector_info[journal.cur_sector].offset]++;
@ -420,6 +466,7 @@ resume_2:
        je->offset = op->offset;
        je->len = op->len;
        je->location = dirty_it->second.location;
+        memcpy((void*)(je+1), (clean_entry_bitmap_size > sizeof(void*) ? dirty_it->second.bitmap : &dirty_it->second.bitmap), clean_entry_bitmap_size);
        je->crc32 = je_crc32((journal_entry*)je);
        journal.crc32_last = je->crc32;
        prepare_journal_sector_write(journal, journal.cur_sector, sqe,
@ -428,14 +475,20 @@ resume_2:
        PRIV(op)->pending_ops = 1;
        PRIV(op)->op_state = 3;
        return 1;
+    }
 resume_4:
    // Switch object state
 #ifdef BLOCKSTORE_DEBUG
-    printf("Ack write %lx:%lx v%lu = state %x\n", op->oid.inode, op->oid.stripe, op->version, dirty_it->second.state);
+    printf("Ack write %lx:%lx v%lu = state 0x%x\n", op->oid.inode, op->oid.stripe, op->version, dirty_it->second.state);
 #endif
-    bool imm = (dirty_it->second.state & BS_ST_TYPE_MASK) == BS_ST_BIG_WRITE
-        ? (immediate_commit == IMMEDIATE_ALL)
-        : (immediate_commit != IMMEDIATE_NONE);
+    {
+        auto dirty_it = dirty_db.find((obj_ver_id){
+            .oid = op->oid,
+            .version = op->version,
+        });
+        assert(dirty_it != dirty_db.end());
+        bool is_big = (dirty_it->second.state & BS_ST_TYPE_MASK) == BS_ST_BIG_WRITE;
+        bool imm = is_big ? (immediate_commit == IMMEDIATE_ALL) : (immediate_commit != IMMEDIATE_NONE);
        if (imm)
        {
            auto & unstab = unstable_writes[op->oid];
@ -445,11 +498,31 @@ resume_4:
            | (imm ? BS_ST_SYNCED : BS_ST_WRITTEN);
        if (imm && ((dirty_it->second.state & BS_ST_TYPE_MASK) == BS_ST_DELETE || (dirty_it->second.state & BS_ST_INSTANT)))
        {
-        // Deletions are treated as immediately stable
+            // Deletions and 'instant' operations are treated as immediately stable
            mark_stable(dirty_it->first);
        }
-    if (immediate_commit == IMMEDIATE_ALL)
+        if (!imm)
        {
+            if (is_big)
+            {
+                // Remember big write as unsynced
+                unsynced_big_writes.push_back((obj_ver_id){
+                    .oid = op->oid,
+                    .version = op->version,
+                });
+            }
+            else
+            {
+                // Remember small write as unsynced
+                unsynced_small_writes.push_back((obj_ver_id){
+                    .oid = op->oid,
+                    .version = op->version,
+                });
+            }
+        }
+        if (imm && (dirty_it->second.state & BS_ST_TYPE_MASK) == BS_ST_BIG_WRITE)
+        {
+            // Unblock small writes
            dirty_it++;
            while (dirty_it != dirty_db.end() && dirty_it->first.oid == op->oid)
            {
@ -460,6 +533,41 @@ resume_4:
                dirty_it++;
            }
        }
+        // Apply throttling to not fill the journal too fast for the SSD+HDD case
+        if (!is_big && throttle_small_writes)
+        {
+            // Apply throttling
+            timespec tv_end;
+            clock_gettime(CLOCK_REALTIME, &tv_end);
+            uint64_t exec_us =
+                (tv_end.tv_sec - PRIV(op)->tv_begin.tv_sec)*1000000 +
+                (tv_end.tv_nsec - PRIV(op)->tv_begin.tv_nsec)/1000;
+            // Compare with target execution time
+            // 100% free -> target time = 0
+            // 0% free -> target time = iodepth/parallelism * (iops + size/bw) / write per second
+            uint64_t used_start = journal.get_trim_pos();
+            uint64_t journal_free_space = journal.next_free < used_start
+                ? (used_start - journal.next_free)
+                : (journal.len - journal.next_free + used_start - journal.block_size);
+            uint64_t ref_us =
+                (write_iodepth <= throttle_target_parallelism ? 100 : 100*write_iodepth/throttle_target_parallelism)
+                * (1000000/throttle_target_iops + op->len*1000000/throttle_target_mbs/1024/1024)
+                / 100;
+            ref_us -= ref_us * journal_free_space / journal.len;
+            if (ref_us > exec_us + throttle_threshold_us)
+            {
+                // Pause reply
+                tfd->set_timer_us(ref_us-exec_us, false, [this, op](int timer_id)
+                {
+                    PRIV(op)->op_state++;
+                    ringloop->wakeup();
+                });
+                PRIV(op)->op_state = 5;
+                return 1;
+            }
+        }
+    }
+resume_6:
    // Acknowledge write
    op->retval = op->len;
    write_iodepth--;
@ -583,14 +691,6 @@ int blockstore_impl_t::dequeue_del(blockstore_op_t *op)
        PRIV(op)->min_flushed_journal_sector = PRIV(op)->max_flushed_journal_sector = 1 + journal.cur_sector;
        PRIV(op)->pending_ops++;
    }
-    else
-    {
-        // Remember delete as unsynced
-        unsynced_small_writes.push_back((obj_ver_id){
-            .oid = op->oid,
-            .version = op->version,
-        });
-    }
    if (!PRIV(op)->pending_ops)
    {
        PRIV(op)->op_state = 4;
--- a/src/cluster_client.cpp
+++ b/src/cluster_client.cpp
--- a/src/cluster_client.h
+++ b/src/cluster_client.h
@ -8,9 +8,8 @@

 #define MIN_BLOCK_SIZE 4*1024
 #define MAX_BLOCK_SIZE 128*1024*1024
-#define DEFAULT_DISK_ALIGNMENT 4096
-#define DEFAULT_BITMAP_GRANULARITY 4096
-#define DEFAULT_CLIENT_DIRTY_LIMIT 32*1024*1024
+#define DEFAULT_CLIENT_MAX_DIRTY_BYTES 32*1024*1024
+#define DEFAULT_CLIENT_MAX_DIRTY_OPS 1024

 struct cluster_op_t;

@ -22,8 +21,7 @@ struct cluster_op_part_t
    pg_num_t pg_num;
    osd_num_t osd_num;
    osd_op_buf_list_t iov;
-    bool sent;
-    bool done;
+    unsigned flags;
    osd_op_t op;
 };

@ -36,48 +34,61 @@ struct cluster_op_t
    int retval;
    osd_op_buf_list_t iov;
    std::function<void(cluster_op_t*)> callback;
+    ~cluster_op_t();
 protected:
+    int flags = 0;
+    int state = 0;
+    uint64_t cur_inode; // for snapshot reads
    void *buf = NULL;
    cluster_op_t *orig_op = NULL;
-    bool is_internal = false;
    bool needs_reslice = false;
    bool up_wait = false;
-    int sent_count = 0, done_count = 0;
+    int inflight_count = 0, done_count = 0;
    std::vector<cluster_op_part_t> parts;
+    void *bitmap_buf = NULL, *part_bitmaps = NULL;
+    unsigned bitmap_buf_size = 0;
    friend class cluster_client_t;
 };

+struct cluster_buffer_t
+{
+    void *buf;
+    uint64_t len;
+    int state;
+};
+
+// FIXME: Split into public and private interfaces
 class cluster_client_t
 {
    timerfd_manager_t *tfd;
    ring_loop_t *ringloop;

    uint64_t bs_block_size = 0;
-    uint64_t bs_disk_alignment = 0;
-    uint64_t bs_bitmap_granularity = 0;
+    uint32_t bs_bitmap_granularity = 0, bs_bitmap_size = 0;
    std::map<pool_id_t, uint64_t> pg_counts;
    bool immediate_commit = false;
    // FIXME: Implement inmemory_commit mode. Note that it requires to return overlapping reads from memory.
-    uint64_t client_dirty_limit = 0;
+    uint64_t client_max_dirty_bytes = 0;
+    uint64_t client_max_dirty_ops = 0;
    int log_level;
    int up_wait_retry_interval = 500; // ms

-    uint64_t op_id = 1;
-    ring_consumer_t consumer;
-    // operations currently in progress
-    std::set<cluster_op_t*> cur_ops;
    int retry_timeout_id = 0;
-    // unsynced operations are copied in memory to allow replay when cluster isn't in the immediate_commit mode
-    // unsynced_writes are replayed in any order (because only the SYNC operation guarantees ordering)
-    std::vector<cluster_op_t*> unsynced_writes;
-    std::vector<cluster_op_t*> syncing_writes;
-    cluster_op_t* cur_sync = NULL;
-    std::vector<cluster_op_t*> next_writes;
+    uint64_t op_id = 1;
    std::vector<cluster_op_t*> offline_ops;
-    uint64_t queued_bytes = 0;
+    std::vector<cluster_op_t*> op_queue;
+    std::map<object_id, cluster_buffer_t> dirty_buffers;
+    std::set<osd_num_t> dirty_osds;
+    uint64_t dirty_bytes = 0, dirty_ops = 0;
+
+    void *scrap_buffer = NULL;
+    unsigned scrap_buffer_size = 0;

    bool pgs_loaded = false;
+    ring_consumer_t consumer;
    std::vector<std::function<void(void)>> on_ready_hooks;
+    int continuing_ops = 0;
+    int op_queue_pos = 0;

 public:
    etcd_state_client_t st_cli;
@ -87,21 +98,23 @@ public:
    cluster_client_t(ring_loop_t *ringloop, timerfd_manager_t *tfd, json11::Json & config);
    ~cluster_client_t();
    void execute(cluster_op_t *op);
+    bool is_ready();
    void on_ready(std::function<void(void)> fn);
-    void stop();

-protected:
+    static void copy_write(cluster_op_t *op, std::map<object_id, cluster_buffer_t> & dirty_buffers);
    void continue_ops(bool up_retry = false);
+protected:
+    bool affects_osd(uint64_t inode, uint64_t offset, uint64_t len, osd_num_t osd);
+    void flush_buffer(const object_id & oid, cluster_buffer_t *wr);
    void on_load_config_hook(json11::Json::object & config);
    void on_load_pgs_hook(bool success);
-    void on_change_hook(json11::Json::object & changes);
+    void on_change_hook(std::map<std::string, etcd_kv_t> & changes);
    void on_change_osd_state_hook(uint64_t peer_osd);
-    void continue_rw(cluster_op_t *op);
+    int continue_rw(cluster_op_t *op);
    void slice_rw(cluster_op_t *op);
-    bool try_send(cluster_op_t *op, cluster_op_part_t *part);
-    void execute_sync(cluster_op_t *op);
-    void continue_sync();
-    void finish_sync();
+    bool try_send(cluster_op_t *op, int i);
+    int continue_sync(cluster_op_t *op);
    void send_sync(cluster_op_t *op, cluster_op_part_t *part);
    void handle_op_part(cluster_op_part_t *part);
+    void copy_part_bitmap(cluster_op_t *op, cluster_op_part_t *part);
 };
--- a/src/etcd_state_client.cpp
+++ b/src/etcd_state_client.cpp
@ -4,22 +4,32 @@
 #include "osd_ops.h"
 #include "pg_states.h"
 #include "etcd_state_client.h"
+#ifndef __MOCK__
 #include "http_client.h"
 #include "base64.h"
+#endif

 etcd_state_client_t::~etcd_state_client_t()
 {
+    for (auto watch: watches)
+    {
+        delete watch;
+    }
+    watches.clear();
    etcd_watches_initialised = -1;
+#ifndef __MOCK__
    if (etcd_watch_ws)
    {
        etcd_watch_ws->close();
        etcd_watch_ws = NULL;
    }
+#endif
 }

-json_kv_t etcd_state_client_t::parse_etcd_kv(const json11::Json & kv_json)
+#ifndef __MOCK__
+etcd_kv_t etcd_state_client_t::parse_etcd_kv(const json11::Json & kv_json)
 {
-    json_kv_t kv;
+    etcd_kv_t kv;
    kv.key = base64_decode(kv_json["key"].string_value());
    std::string json_err, json_text = base64_decode(kv_json["value"].string_value());
    kv.value = json_text == "" ? json11::Json() : json11::Json::parse(json_text, json_err);
@ -28,6 +38,8 @@ json_kv_t etcd_state_client_t::parse_etcd_kv(const json11::Json & kv_json)
        printf("Bad JSON in etcd key %s: %s (value: %s)\n", kv.key.c_str(), json_err.c_str(), json_text.c_str());
        kv.key = "";
    }
+    else
+        kv.mod_revision = kv_json["mod_revision"].uint64_value();
    return kv;
 }

@ -140,22 +152,22 @@ void etcd_state_client_t::start_etcd_watcher()
                    etcd_watch_revision = data["result"]["header"]["revision"].uint64_value();
                }
                // First gather all changes into a hash to remove multiple overwrites
-                json11::Json::object changes;
+                std::map<std::string, etcd_kv_t> changes;
                for (auto & ev: data["result"]["events"].array_items())
                {
                    auto kv = parse_etcd_kv(ev["kv"]);
                    if (kv.key != "")
                    {
-                        changes[kv.key] = kv.value;
+                        changes[kv.key] = kv;
                    }
                }
                for (auto & kv: changes)
                {
                    if (this->log_level > 3)
                    {
-                        printf("Incoming event: %s -> %s\n", kv.first.c_str(), kv.second.dump().c_str());
+                        printf("Incoming event: %s -> %s\n", kv.first.c_str(), kv.second.value.dump().c_str());
                    }
-                    parse_state(kv.first, kv.second);
+                    parse_state(kv.second);
                }
                // React to changes
                if (on_change_hook != NULL)
@ -266,6 +278,12 @@ void etcd_state_client_t::load_pgs()
                { "key", base64_encode(etcd_prefix+"/config/pgs") },
            } }
        },
+        json11::Json::object {
+            { "request_range", json11::Json::object {
+                { "key", base64_encode(etcd_prefix+"/config/inode/") },
+                { "range_end", base64_encode(etcd_prefix+"/config/inode0") },
+            } }
+        },
        json11::Json::object {
            { "request_range", json11::Json::object {
                { "key", base64_encode(etcd_prefix+"/pg/history/") },
@ -316,16 +334,33 @@ void etcd_state_client_t::load_pgs()
            for (auto & kv_json: res["response_range"]["kvs"].array_items())
            {
                auto kv = parse_etcd_kv(kv_json);
-                parse_state(kv.key, kv.value);
+                parse_state(kv);
            }
        }
        on_load_pgs_hook(true);
        start_etcd_watcher();
    });
 }
-
-void etcd_state_client_t::parse_state(const std::string & key, const json11::Json & value)
+#else
+void etcd_state_client_t::parse_config(json11::Json & config)
 {
+}
+
+void etcd_state_client_t::load_global_config()
+{
+    json11::Json::object global_config;
+    on_load_config_hook(global_config);
+}
+
+void etcd_state_client_t::load_pgs()
+{
+}
+#endif
+
+void etcd_state_client_t::parse_state(const etcd_kv_t & kv)
+{
+    const std::string & key = kv.key;
+    const json11::Json & value = kv.value;
    if (key == etcd_prefix+"/config/pools")
    {
        for (auto & pool_item: this->pool_config)
@ -336,8 +371,10 @@ void etcd_state_client_t::parse_state(const std::string & key, const json11::Jso
        {
            pool_config_t pc;
            // ID
-            pool_id_t pool_id = stoull_full(pool_item.first);
-            if (!pool_id || pool_id >= POOL_ID_MAX)
+            pool_id_t pool_id;
+            char null_byte = 0;
+            sscanf(pool_item.first.c_str(), "%u%c", &pool_id, &null_byte);
+            if (!pool_id || pool_id >= POOL_ID_MAX || null_byte != 0)
            {
                printf("Pool ID %s is invalid (must be a number less than 0x%x), skipping pool\n", pool_item.first.c_str(), POOL_ID_MAX);
                continue;
@ -449,16 +486,19 @@ void etcd_state_client_t::parse_state(const std::string & key, const json11::Jso
        }
        for (auto & pool_item: value["items"].object_items())
        {
-            pool_id_t pool_id = stoull_full(pool_item.first);
-            if (!pool_id || pool_id >= POOL_ID_MAX)
+            pool_id_t pool_id;
+            char null_byte = 0;
+            sscanf(pool_item.first.c_str(), "%u%c", &pool_id, &null_byte);
+            if (!pool_id || pool_id >= POOL_ID_MAX || null_byte != 0)
            {
                printf("Pool ID %s is invalid in PG configuration (must be a number less than 0x%x), skipping pool\n", pool_item.first.c_str(), POOL_ID_MAX);
                continue;
            }
            for (auto & pg_item: pool_item.second.object_items())
            {
-                pg_num_t pg_num = stoull_full(pg_item.first);
-                if (!pg_num)
+                pg_num_t pg_num = 0;
+                sscanf(pg_item.first.c_str(), "%u%c", &pg_num, &null_byte);
+                if (!pg_num || null_byte != 0)
                {
                    printf("Bad key in pool %u PG configuration: %s (must be a number), skipped\n", pool_id, pg_item.first.c_str());
                    continue;
@ -612,4 +652,106 @@ void etcd_state_client_t::parse_state(const std::string & key, const json11::Jso
            }
        }
    }
+    else if (key.substr(0, etcd_prefix.length()+14) == etcd_prefix+"/config/inode/")
+    {
+        // <etcd_prefix>/config/inode/%d/%d
+        uint64_t pool_id = 0;
+        uint64_t inode_num = 0;
+        char null_byte = 0;
+        sscanf(key.c_str() + etcd_prefix.length()+14, "%lu/%lu%c", &pool_id, &inode_num, &null_byte);
+        if (!pool_id || pool_id >= POOL_ID_MAX || !inode_num || (inode_num >> (64-POOL_ID_BITS)) || null_byte != 0)
+        {
+            printf("Bad etcd key %s, ignoring\n", key.c_str());
+        }
+        else
+        {
+            inode_num |= (pool_id << (64-POOL_ID_BITS));
+            auto it = this->inode_config.find(inode_num);
+            if (it != this->inode_config.end() && it->second.name != "")
+            {
+                auto n_it = this->inode_by_name.find(it->second.name);
+                if (n_it->second == inode_num)
+                {
+                    this->inode_by_name.erase(n_it);
+                    for (auto w: watches)
+                    {
+                        if (w->name == it->second.name)
+                        {
+                            w->cfg = { 0 };
+                        }
+                    }
+                }
+            }
+            if (!value.is_object())
+            {
+                this->inode_config.erase(inode_num);
+            }
+            else
+            {
+                inode_t parent_inode_num = value["parent_id"].uint64_value();
+                if (parent_inode_num && !(parent_inode_num >> (64-POOL_ID_BITS)))
+                {
+                    uint64_t parent_pool_id = value["parent_pool"].uint64_value();
+                    if (!parent_pool_id)
+                        parent_inode_num |= pool_id << (64-POOL_ID_BITS);
+                    else if (parent_pool_id >= POOL_ID_MAX)
+                    {
+                        printf(
+                            "Inode %lu/%lu parent_pool value is invalid, ignoring parent setting\n",
+                            inode_num >> (64-POOL_ID_BITS), inode_num & ((1l << (64-POOL_ID_BITS)) - 1)
+                        );
+                        parent_inode_num = 0;
+                    }
+                    else
+                        parent_inode_num |= parent_pool_id << (64-POOL_ID_BITS);
+                }
+                inode_config_t cfg = (inode_config_t){
+                    .num = inode_num,
+                    .name = value["name"].string_value(),
+                    .size = value["size"].uint64_value(),
+                    .parent_id = parent_inode_num,
+                    .readonly = value["readonly"].bool_value(),
+                    .mod_revision = kv.mod_revision,
+                };
+                this->inode_config[inode_num] = cfg;
+                if (cfg.name != "")
+                {
+                    this->inode_by_name[cfg.name] = inode_num;
+                    for (auto w: watches)
+                    {
+                        if (w->name == value["name"].string_value())
+                        {
+                            w->cfg = cfg;
+                        }
+                    }
+                }
+            }
+        }
+    }
+}
+
+inode_watch_t* etcd_state_client_t::watch_inode(std::string name)
+{
+    inode_watch_t *watch = new inode_watch_t;
+    watch->name = name;
+    watches.push_back(watch);
+    auto it = inode_by_name.find(name);
+    if (it != inode_by_name.end())
+    {
+        watch->cfg = inode_config[it->second];
+    }
+    return watch;
+}
+
+void etcd_state_client_t::close_watch(inode_watch_t* watch)
+{
+    for (int i = 0; i < watches.size(); i++)
+    {
+        if (watches[i] == watch)
+        {
+            watches.erase(watches.begin()+i, watches.begin()+i+1);
+            break;
+        }
+    }
+    delete watch;
 }
--- a/src/etcd_state_client.h
+++ b/src/etcd_state_client.h
@ -3,8 +3,8 @@

 #pragma once

+#include "json11/json11.hpp"
 #include "osd_id.h"
-#include "http_client.h"
 #include "timerfd_manager.h"

 #define ETCD_CONFIG_WATCH_ID 1
@ -18,10 +18,11 @@

 #define DEFAULT_BLOCK_SIZE 128*1024

-struct json_kv_t
+struct etcd_kv_t
 {
    std::string key;
    json11::Json value;
+    uint64_t mod_revision;
 };

 struct pg_config_t
@ -52,9 +53,31 @@ struct pool_config_t
    std::map<pg_num_t, pg_config_t> pg_config;
 };

+struct inode_config_t
+{
+    uint64_t num;
+    std::string name;
+    uint64_t size;
+    inode_t parent_id;
+    bool readonly;
+    // Change revision of the metadata in etcd
+    uint64_t mod_revision;
+};
+
+struct inode_watch_t
+{
+    std::string name;
+    inode_config_t cfg;
+};
+
+struct websocket_t;
+
 struct etcd_state_client_t
 {
 protected:
+    std::vector<inode_watch_t*> watches;
+    websocket_t *etcd_watch_ws = NULL;
+    uint64_t bs_block_size = DEFAULT_BLOCK_SIZE;
    void add_etcd_url(std::string);
 public:
    std::vector<std::string> etcd_addresses;
@ -64,25 +87,27 @@ public:

    int etcd_watches_initialised = 0;
    uint64_t etcd_watch_revision = 0;
-    websocket_t *etcd_watch_ws = NULL;
-    uint64_t bs_block_size = 0;
    std::map<pool_id_t, pool_config_t> pool_config;
    std::map<osd_num_t, json11::Json> peer_states;
+    std::map<inode_t, inode_config_t> inode_config;
+    std::map<std::string, inode_t> inode_by_name;

-    std::function<void(json11::Json::object &)> on_change_hook;
+    std::function<void(std::map<std::string, etcd_kv_t> &)> on_change_hook;
    std::function<void(json11::Json::object &)> on_load_config_hook;
    std::function<json11::Json()> load_pgs_checks_hook;
    std::function<void(bool)> on_load_pgs_hook;
    std::function<void(pool_id_t, pg_num_t)> on_change_pg_history_hook;
    std::function<void(osd_num_t)> on_change_osd_state_hook;

-    json_kv_t parse_etcd_kv(const json11::Json & kv_json);
+    etcd_kv_t parse_etcd_kv(const json11::Json & kv_json);
    void etcd_call(std::string api, json11::Json payload, int timeout, std::function<void(std::string, json11::Json)> callback);
    void etcd_txn(json11::Json txn, int timeout, std::function<void(std::string, json11::Json)> callback);
    void start_etcd_watcher();
    void load_global_config();
    void load_pgs();
-    void parse_state(const std::string & key, const json11::Json & value);
+    void parse_state(const etcd_kv_t & kv);
    void parse_config(json11::Json & config);
+    inode_watch_t* watch_inode(std::string name);
+    void close_watch(inode_watch_t* watch);
    ~etcd_state_client_t();
 };
--- a/src/fio_cluster.cpp
+++ b/src/fio_cluster.cpp
@ -6,17 +6,17 @@
 // Random write:
 //
 // fio -thread -ioengine=./libfio_cluster.so -name=test -bs=4k -direct=1 -fsync=16 -iodepth=16 -rw=randwrite \
-//     -etcd=127.0.0.1:2379 [-etcd_prefix=/vitastor] -pool=1 -inode=1 -size=1000M
+//     -etcd=127.0.0.1:2379 [-etcd_prefix=/vitastor] (-image=testimg | -pool=1 -inode=1 -size=1000M)
 //
 // Linear write:
 //
 // fio -thread -ioengine=./libfio_cluster.so -name=test -bs=128k -direct=1 -fsync=32 -iodepth=32 -rw=write \
-//     -etcd=127.0.0.1:2379 [-etcd_prefix=/vitastor] -pool=1 -inode=1 -size=1000M
+//     -etcd=127.0.0.1:2379 [-etcd_prefix=/vitastor] -image=testimg
 //
 // Random read (run with -iodepth=32 or -iodepth=1):
 //
 // fio -thread -ioengine=./libfio_cluster.so -name=test -bs=4k -direct=1 -iodepth=32 -rw=randread \
-//     -etcd=127.0.0.1:2379 [-etcd_prefix=/vitastor] -pool=1 -inode=1 -size=1000M
+//     -etcd=127.0.0.1:2379 [-etcd_prefix=/vitastor] -image=testimg

 #include <sys/types.h>
 #include <sys/socket.h>
@ -35,6 +35,7 @@ struct sec_data
    ring_loop_t *ringloop = NULL;
    epoll_manager_t *epmgr = NULL;
    cluster_client_t *cli = NULL;
+    inode_watch_t *watch = NULL;
    bool last_sync = false;
    /* The list of completed io_u structs. */
    std::vector<io_u*> completed;
@ -47,6 +48,7 @@ struct sec_options
    int __pad;
    char *etcd_host = NULL;
    char *etcd_prefix = NULL;
+    char *image = NULL;
    uint64_t pool = 0;
    uint64_t inode = 0;
    int cluster_log = 0;
@ -64,7 +66,7 @@ static struct fio_option options[] = {
        .group  = FIO_OPT_G_FILENAME,
    },
    {
-        .name   = "etcd",
+        .name   = "etcd_prefix",
        .lname  = "etcd key prefix",
        .type   = FIO_OPT_STR_STORE,
        .off1   = offsetof(struct sec_options, etcd_prefix),
@ -72,6 +74,15 @@ static struct fio_option options[] = {
        .category = FIO_OPT_C_ENGINE,
        .group  = FIO_OPT_G_FILENAME,
    },
+    {
+        .name   = "image",
+        .lname  = "Vitastor image name",
+        .type   = FIO_OPT_STR_STORE,
+        .off1   = offsetof(struct sec_options, image),
+        .help   = "Vitastor image name to run tests on",
+        .category = FIO_OPT_C_ENGINE,
+        .group  = FIO_OPT_G_FILENAME,
+    },
    {
        .name   = "pool",
        .lname  = "pool number for the inode",
@ -86,7 +97,7 @@ static struct fio_option options[] = {
        .lname  = "inode to run tests on",
        .type   = FIO_OPT_INT,
        .off1   = offsetof(struct sec_options, inode),
-        .help   = "inode to run tests on (1 by default)",
+        .help   = "inode number to run tests on",
        .category = FIO_OPT_C_ENGINE,
        .group  = FIO_OPT_G_FILENAME,
    },
@ -141,6 +152,51 @@ static int sec_setup(struct thread_data *td)
        td->o.open_files++;
    }

+    json11::Json cfg = json11::Json::object {
+        { "etcd_address", std::string(o->etcd_host) },
+        { "etcd_prefix", std::string(o->etcd_prefix ? o->etcd_prefix : "/vitastor") },
+        { "log_level", o->cluster_log },
+    };
+
+    if (!o->image)
+    {
+        if (!(o->inode & ((1l << (64-POOL_ID_BITS)) - 1)))
+        {
+            td_verror(td, EINVAL, "inode number is missing");
+            return 1;
+        }
+        if (o->pool)
+        {
+            o->inode = (o->inode & ((1l << (64-POOL_ID_BITS)) - 1)) | (o->pool << (64-POOL_ID_BITS));
+        }
+        if (!(o->inode >> (64-POOL_ID_BITS)))
+        {
+            td_verror(td, EINVAL, "pool is missing");
+            return 1;
+        }
+    }
+    else
+    {
+        o->inode = 0;
+    }
+    bsd->ringloop = new ring_loop_t(512);
+    bsd->epmgr = new epoll_manager_t(bsd->ringloop);
+    bsd->cli = new cluster_client_t(bsd->ringloop, bsd->epmgr->tfd, cfg);
+    if (o->image)
+    {
+        while (!bsd->cli->is_ready())
+        {
+            bsd->ringloop->loop();
+            if (bsd->cli->is_ready())
+                break;
+            bsd->ringloop->wait();
+        }
+        bsd->watch = bsd->cli->st_cli.watch_inode(std::string(o->image));
+        td->files[0]->real_file_size = bsd->watch->cfg.size;
+    }
+
+    bsd->trace = o->trace ? true : false;
+
    return 0;
 }

@ -149,6 +205,10 @@ static void sec_cleanup(struct thread_data *td)
    sec_data *bsd = (sec_data*)td->io_ops_data;
    if (bsd)
    {
+        if (bsd->watch)
+        {
+            bsd->cli->st_cli.close_watch(bsd->watch);
+        }
        delete bsd->cli;
        delete bsd->epmgr;
        delete bsd->ringloop;
@ -159,28 +219,6 @@ static void sec_cleanup(struct thread_data *td)
 /* Connect to the server from each thread. */
 static int sec_init(struct thread_data *td)
 {
-    sec_options *o = (sec_options*)td->eo;
-    sec_data *bsd = (sec_data*)td->io_ops_data;
-
-    json11::Json cfg = json11::Json::object {
-        { "etcd_address", std::string(o->etcd_host) },
-        { "etcd_prefix", std::string(o->etcd_prefix ? o->etcd_prefix : "/vitastor") },
-        { "log_level", o->cluster_log },
-    };
-
-    if (o->pool)
-        o->inode = (o->inode & ((1l << (64-POOL_ID_BITS)) - 1)) | (o->pool << (64-POOL_ID_BITS));
-    if (!(o->inode >> (64-POOL_ID_BITS)))
-    {
-        td_verror(td, EINVAL, "pool is missing");
-        return 1;
-    }
-    bsd->ringloop = new ring_loop_t(512);
-    bsd->epmgr = new epoll_manager_t(bsd->ringloop);
-    bsd->cli = new cluster_client_t(bsd->ringloop, bsd->epmgr->tfd, cfg);
-
-    bsd->trace = o->trace ? true : false;
-
    return 0;
 }

@ -200,19 +238,23 @@ static enum fio_q_status sec_queue(struct thread_data *td, struct io_u *io)
    io->engine_data = bsd;
    cluster_op_t *op = new cluster_op_t;

+    op->inode = opt->image ? bsd->watch->cfg.num : opt->inode;
    switch (io->ddir)
    {
    case DDIR_READ:
        op->opcode = OSD_OP_READ;
-        op->inode = opt->inode;
        op->offset = io->offset;
        op->len = io->xfer_buflen;
        op->iov.push_back(io->xfer_buf, io->xfer_buflen);
        bsd->last_sync = false;
        break;
    case DDIR_WRITE:
+        if (opt->image && bsd->watch->cfg.readonly)
+        {
+            io->error = EROFS;
+            return FIO_Q_COMPLETED;
+        }
        op->opcode = OSD_OP_WRITE;
-        op->inode = opt->inode;
        op->offset = io->offset;
        op->len = io->xfer_buflen;
        op->iov.push_back(io->xfer_buf, io->xfer_buflen);
--- a/src/fio_engine.cpp
+++ b/src/fio_engine.cpp
@ -25,6 +25,7 @@
 //     -bs_config='{"data_device":"./test_data.bin"}' -size=1000M

 #include "blockstore.h"
+#include "epoll_manager.h"
 #include "fio_headers.h"

 #include "json11/json11.hpp"
@ -32,6 +33,7 @@
 struct bs_data
 {
    blockstore_t *bs;
+    epoll_manager_t *epmgr;
    ring_loop_t *ringloop;
    /* The list of completed io_u structs. */
    std::vector<io_u*> completed;
@ -104,6 +106,7 @@ static void bs_cleanup(struct thread_data *td)
        }
    safe:
        delete bsd->bs;
+        delete bsd->epmgr;
        delete bsd->ringloop;
        delete bsd;
    }
@ -129,7 +132,8 @@ static int bs_init(struct thread_data *td)
        }
    }
    bsd->ringloop = new ring_loop_t(512);
-    bsd->bs = new blockstore_t(config, bsd->ringloop);
+    bsd->epmgr = new epoll_manager_t(bsd->ringloop);
+    bsd->bs = new blockstore_t(config, bsd->ringloop, bsd->epmgr->tfd);
    while (1)
    {
        bsd->ringloop->loop();
--- a/src/messenger.cpp
+++ b/src/messenger.cpp
@ -10,30 +10,16 @@

 #include "messenger.h"

-osd_op_t::~osd_op_t()
-{
-    assert(!bs_op);
-    assert(!op_data);
-    if (rmw_buf)
-    {
-        free(rmw_buf);
-    }
-    if (buf)
-    {
-        // Note: reusing osd_op_t WILL currently lead to memory leaks
-        // So we don't reuse it, but free it every time
-        free(buf);
-    }
-}
-
 void osd_messenger_t::init()
 {
    keepalive_timer_id = tfd->set_timer(1000, true, [this](int)
    {
-        for (auto cl_it = clients.begin(); cl_it != clients.end();)
+        std::vector<int> to_stop;
+        std::vector<osd_op_t*> to_ping;
+        for (auto cl_it = clients.begin(); cl_it != clients.end(); cl_it++)
        {
-            auto cl = (cl_it++)->second;
-            if (!cl->osd_num)
+            auto cl = cl_it->second;
+            if (!cl->osd_num || cl->peer_state != PEER_CONNECTED)
            {
                // Do not run keepalive on regular clients
                continue;
@ -44,7 +30,8 @@ void osd_messenger_t::init()
                if (!cl->ping_time_remaining)
                {
                    // Ping timed out, stop the client
-                    stop_client(cl->peer_fd, true);
+                    printf("Ping timed out for OSD %lu (client %d), disconnecting peer\n", cl->osd_num, cl->peer_fd);
+                    to_stop.push_back(cl->peer_fd);
                }
            }
            else if (cl->idle_time_remaining > 0)
@ -70,10 +57,11 @@ void osd_messenger_t::init()
                        delete op;
                        if (fail_fd >= 0)
                        {
+                            printf("Ping failed for OSD %lu (client %d), disconnecting peer\n", cl->osd_num, cl->peer_fd);
                            stop_client(fail_fd, true);
                        }
                    };
-                    outbox_push(op);
+                    to_ping.push_back(op);
                    cl->ping_time_remaining = osd_ping_timeout;
                    cl->idle_time_remaining = osd_idle_timeout;
                }
@ -83,6 +71,15 @@ void osd_messenger_t::init()
                cl->idle_time_remaining = osd_idle_timeout;
            }
        }
+        // Don't stop clients while a 'clients' iterator is still active
+        for (int peer_fd: to_stop)
+        {
+            stop_client(peer_fd, true);
+        }
+        for (auto op: to_ping)
+        {
+            outbox_push(op);
+        }
    });
 }

@ -141,17 +138,14 @@ void osd_messenger_t::connect_peer(uint64_t peer_osd, json11::Json peer_state)
        wanted_peers[peer_osd].port = (int)peer_state["port"].int64_value();
    }
    wanted_peers[peer_osd].address_changed = true;
-    if (!wanted_peers[peer_osd].connecting &&
-        (time(NULL) - wanted_peers[peer_osd].last_connect_attempt) >= peer_connect_interval)
-    {
    try_connect_peer(peer_osd);
-    }
 }

 void osd_messenger_t::try_connect_peer(uint64_t peer_osd)
 {
    auto wp_it = wanted_peers.find(peer_osd);
-    if (wp_it == wanted_peers.end())
+    if (wp_it == wanted_peers.end() || wp_it->second.connecting ||
+        (time(NULL) - wp_it->second.last_connect_attempt) < peer_connect_interval)
    {
        return;
    }
@ -197,10 +191,22 @@ void osd_messenger_t::try_connect_peer_addr(osd_num_t peer_osd, const char *peer
        on_connect_peer(peer_osd, -errno);
        return;
    }
-    int timeout_id = -1;
+    clients[peer_fd] = new osd_client_t();
+    clients[peer_fd]->peer_addr = addr;
+    clients[peer_fd]->peer_port = peer_port;
+    clients[peer_fd]->peer_fd = peer_fd;
+    clients[peer_fd]->peer_state = PEER_CONNECTING;
+    clients[peer_fd]->connect_timeout_id = -1;
+    clients[peer_fd]->osd_num = peer_osd;
+    clients[peer_fd]->in_buf = malloc_or_die(receive_buffer_size);
+    tfd->set_fd_handler(peer_fd, true, [this](int peer_fd, int epoll_events)
+    {
+        // Either OUT (connected) or HUP
+        handle_connect_epoll(peer_fd);
+    });
    if (peer_connect_timeout > 0)
    {
-        timeout_id = tfd->set_timer(1000*peer_connect_timeout, false, [this, peer_fd](int timer_id)
+        clients[peer_fd]->connect_timeout_id = tfd->set_timer(1000*peer_connect_timeout, false, [this, peer_fd](int timer_id)
        {
            osd_num_t peer_osd = clients.at(peer_fd)->osd_num;
            stop_client(peer_fd, true);
@ -208,20 +214,6 @@ void osd_messenger_t::try_connect_peer_addr(osd_num_t peer_osd, const char *peer
            return;
        });
    }
-    clients[peer_fd] = new osd_client_t((osd_client_t){
-        .peer_addr = addr,
-        .peer_port = peer_port,
-        .peer_fd = peer_fd,
-        .peer_state = PEER_CONNECTING,
-        .connect_timeout_id = timeout_id,
-        .osd_num = peer_osd,
-        .in_buf = malloc_or_die(receive_buffer_size),
-    });
-    tfd->set_fd_handler(peer_fd, true, [this](int peer_fd, int epoll_events)
-    {
-        // Either OUT (connected) or HUP
-        handle_connect_epoll(peer_fd);
-    });
 }

 void osd_messenger_t::handle_connect_epoll(int peer_fd)
@ -357,6 +349,15 @@ void osd_messenger_t::check_peer_config(osd_client_t *cl)
                err = true;
                printf("Connected to OSD %lu instead of OSD %lu, peer state is outdated, disconnecting peer\n", config["osd_num"].uint64_value(), cl->osd_num);
            }
+            else if (config["protocol_version"].uint64_value() != OSD_PROTOCOL_VERSION)
+            {
+                err = true;
+                printf(
+                    "OSD %lu protocol version is %lu, but only version %u is supported.\n"
+                    " If you need to upgrade from 0.5.x please request it via the issue tracker.\n",
+                    cl->osd_num, config["protocol_version"].uint64_value(), OSD_PROTOCOL_VERSION
+                );
+            }
        }
        if (err)
        {
@ -373,123 +374,6 @@ void osd_messenger_t::check_peer_config(osd_client_t *cl)
    outbox_push(op);
 }

-void osd_messenger_t::cancel_osd_ops(osd_client_t *cl)
-{
-    for (auto p: cl->sent_ops)
-    {
-        cancel_op(p.second);
-    }
-    cl->sent_ops.clear();
-    cl->outbox.clear();
-}
-
-void osd_messenger_t::cancel_op(osd_op_t *op)
-{
-    if (op->op_type == OSD_OP_OUT)
-    {
-        op->reply.hdr.magic = SECONDARY_OSD_REPLY_MAGIC;
-        op->reply.hdr.id = op->req.hdr.id;
-        op->reply.hdr.opcode = op->req.hdr.opcode;
-        op->reply.hdr.retval = -EPIPE;
-        // Copy lambda to be unaffected by `delete op`
-        std::function<void(osd_op_t*)>(op->callback)(op);
-    }
-    else
-    {
-        // This function is only called in stop_client(), so it's fine to destroy the operation
-        delete op;
-    }
-}
-
-void osd_messenger_t::stop_client(int peer_fd, bool force)
-{
-    assert(peer_fd != 0);
-    auto it = clients.find(peer_fd);
-    if (it == clients.end())
-    {
-        return;
-    }
-    uint64_t repeer_osd = 0;
-    osd_client_t *cl = it->second;
-    if (cl->peer_state == PEER_CONNECTED)
-    {
-        if (cl->osd_num)
-        {
-            // Reload configuration from etcd when the connection is dropped
-            if (log_level > 0)
-                printf("[OSD %lu] Stopping client %d (OSD peer %lu)\n", osd_num, peer_fd, cl->osd_num);
-            repeer_osd = cl->osd_num;
-        }
-        else
-        {
-            if (log_level > 0)
-                printf("[OSD %lu] Stopping client %d (regular client)\n", osd_num, peer_fd);
-        }
-    }
-    else if (!force)
-    {
-        return;
-    }
-    cl->peer_state = PEER_STOPPED;
-    clients.erase(it);
-    tfd->set_fd_handler(peer_fd, false, NULL);
-    if (cl->connect_timeout_id >= 0)
-    {
-        tfd->clear_timer(cl->connect_timeout_id);
-        cl->connect_timeout_id = -1;
-    }
-    if (cl->osd_num)
-    {
-        osd_peer_fds.erase(cl->osd_num);
-    }
-    if (cl->read_op)
-    {
-        if (cl->read_op->callback)
-        {
-            cancel_op(cl->read_op);
-        }
-        else
-        {
-            delete cl->read_op;
-        }
-        cl->read_op = NULL;
-    }
-    for (auto rit = read_ready_clients.begin(); rit != read_ready_clients.end(); rit++)
-    {
-        if (*rit == peer_fd)
-        {
-            read_ready_clients.erase(rit);
-            break;
-        }
-    }
-    for (auto wit = write_ready_clients.begin(); wit != write_ready_clients.end(); wit++)
-    {
-        if (*wit == peer_fd)
-        {
-            write_ready_clients.erase(wit);
-            break;
-        }
-    }
-    free(cl->in_buf);
-    cl->in_buf = NULL;
-    close(peer_fd);
-    if (repeer_osd)
-    {
-        // First repeer PGs as canceling OSD ops may push new operations
-        // and we need correct PG states when we do that
-        repeer_pgs(repeer_osd);
-    }
-    if (cl->osd_num)
-    {
-        // Cancel outbound operations
-        cancel_osd_ops(cl);
-    }
-    if (cl->refs <= 0)
-    {
-        delete cl;
-    }
-}
-
 void osd_messenger_t::accept_connections(int listen_fd)
 {
    // Accept new connections
@ -505,13 +389,12 @@ void osd_messenger_t::accept_connections(int listen_fd)
        fcntl(peer_fd, F_SETFL, fcntl(peer_fd, F_GETFL, 0) | O_NONBLOCK);
        int one = 1;
        setsockopt(peer_fd, SOL_TCP, TCP_NODELAY, &one, sizeof(one));
-        clients[peer_fd] = new osd_client_t((osd_client_t){
-            .peer_addr = addr,
-            .peer_port = ntohs(addr.sin_port),
-            .peer_fd = peer_fd,
-            .peer_state = PEER_CONNECTED,
-            .in_buf = malloc_or_die(receive_buffer_size),
-        });
+        clients[peer_fd] = new osd_client_t();
+        clients[peer_fd]->peer_addr = addr;
+        clients[peer_fd]->peer_port = ntohs(addr.sin_port);
+        clients[peer_fd]->peer_fd = peer_fd;
+        clients[peer_fd]->peer_state = PEER_CONNECTED;
+        clients[peer_fd]->in_buf = malloc_or_die(receive_buffer_size);
        // Add FD to epoll
        tfd->set_fd_handler(peer_fd, false, [this](int peer_fd, int epoll_events)
        {
--- a/src/messenger.h
+++ b/src/messenger.h
@ -14,19 +14,15 @@

 #include "malloc_or_die.h"
 #include "json11/json11.hpp"
-#include "osd_ops.h"
+#include "msgr_op.h"
 #include "timerfd_manager.h"
-#include "ringloop.h"
-
-#define OSD_OP_IN 0
-#define OSD_OP_OUT 1
+#include <ringloop.h>

 #define CL_READ_HDR 1
 #define CL_READ_DATA 2
 #define CL_READ_REPLY_DATA 3
 #define CL_WRITE_READY 1
 #define CL_WRITE_REPLY 2
-#define OSD_OP_INLINE_BUF_COUNT 16

 #define PEER_CONNECTING 1
 #define PEER_CONNECTED 2
@ -35,160 +31,7 @@
 #define DEFAULT_PEER_CONNECT_INTERVAL 5
 #define DEFAULT_PEER_CONNECT_TIMEOUT 5
 #define DEFAULT_OSD_PING_TIMEOUT 5
-
-// Kind of a vector with small-list-optimisation
-struct osd_op_buf_list_t
-{
-    int count = 0, alloc = OSD_OP_INLINE_BUF_COUNT, done = 0;
-    iovec *buf = NULL;
-    iovec inline_buf[OSD_OP_INLINE_BUF_COUNT];
-
-    inline osd_op_buf_list_t()
-    {
-        buf = inline_buf;
-    }
-
-    inline osd_op_buf_list_t(const osd_op_buf_list_t & other)
-    {
-        buf = inline_buf;
-        append(other);
-    }
-
-    inline osd_op_buf_list_t & operator = (const osd_op_buf_list_t & other)
-    {
-        reset();
-        append(other);
-        return *this;
-    }
-
-    inline ~osd_op_buf_list_t()
-    {
-        if (buf && buf != inline_buf)
-        {
-            free(buf);
-        }
-    }
-
-    inline void reset()
-    {
-        count = 0;
-        done = 0;
-    }
-
-    inline iovec* get_iovec()
-    {
-        return buf + done;
-    }
-
-    inline int get_size()
-    {
-        return count - done;
-    }
-
-    inline void append(const osd_op_buf_list_t & other)
-    {
-        if (count+other.count > alloc)
-        {
-            if (buf == inline_buf)
-            {
-                int old = alloc;
-                alloc = (((count+other.count+15)/16)*16);
-                buf = (iovec*)malloc(sizeof(iovec) * alloc);
-                if (!buf)
-                {
-                    printf("Failed to allocate %lu bytes\n", sizeof(iovec) * alloc);
-                    exit(1);
-                }
-                memcpy(buf, inline_buf, sizeof(iovec) * old);
-            }
-            else
-            {
-                alloc = (((count+other.count+15)/16)*16);
-                buf = (iovec*)realloc(buf, sizeof(iovec) * alloc);
-                if (!buf)
-                {
-                    printf("Failed to allocate %lu bytes\n", sizeof(iovec) * alloc);
-                    exit(1);
-                }
-            }
-        }
-        for (int i = 0; i < other.count; i++)
-        {
-            buf[count++] = other.buf[i];
-        }
-    }
-
-    inline void push_back(void *nbuf, size_t len)
-    {
-        if (count >= alloc)
-        {
-            if (buf == inline_buf)
-            {
-                int old = alloc;
-                alloc = ((alloc/16)*16 + 1);
-                buf = (iovec*)malloc(sizeof(iovec) * alloc);
-                if (!buf)
-                {
-                    printf("Failed to allocate %lu bytes\n", sizeof(iovec) * alloc);
-                    exit(1);
-                }
-                memcpy(buf, inline_buf, sizeof(iovec)*old);
-            }
-            else
-            {
-                alloc = alloc < 16 ? 16 : (alloc+16);
-                buf = (iovec*)realloc(buf, sizeof(iovec) * alloc);
-                if (!buf)
-                {
-                    printf("Failed to allocate %lu bytes\n", sizeof(iovec) * alloc);
-                    exit(1);
-                }
-            }
-        }
-        buf[count++] = { .iov_base = nbuf, .iov_len = len };
-    }
-
-    inline void eat(int result)
-    {
-        while (result > 0 && done < count)
-        {
-            iovec & iov = buf[done];
-            if (iov.iov_len <= result)
-            {
-                result -= iov.iov_len;
-                done++;
-            }
-            else
-            {
-                iov.iov_len -= result;
-                iov.iov_base += result;
-                break;
-            }
-        }
-    }
-};
-
-struct blockstore_op_t;
-
-struct osd_primary_op_data_t;
-
-struct osd_op_t
-{
-    timespec tv_begin;
-    uint64_t op_type = OSD_OP_IN;
-    int peer_fd;
-    osd_any_op_t req;
-    osd_any_reply_t reply;
-    blockstore_op_t *bs_op = NULL;
-    void *buf = NULL;
-    void *rmw_buf = NULL;
-    osd_primary_op_data_t* op_data = NULL;
-    std::function<void(osd_op_t*)> callback;
-
-    osd_op_buf_list_t iov;
-
-    ~osd_op_t();
-};
+#define DEFAULT_BITMAP_GRANULARITY 4096

 struct osd_client_t
 {
@ -228,6 +71,12 @@ struct osd_client_t
    int write_state = 0;
    std::vector<iovec> send_list, next_send_list;
    std::vector<osd_op_t*> outbox, next_outbox;
+
+    ~osd_client_t()
+    {
+        free(in_buf);
+        in_buf = NULL;
+    }
 };

 struct osd_wanted_peer_t
@ -252,12 +101,9 @@ struct osd_op_stats_t

 struct osd_messenger_t
 {
-    timerfd_manager_t *tfd;
-    ring_loop_t *ringloop;
+protected:
    int keepalive_timer_id = -1;

-    // osd_num_t is only for logging and asserts
-    osd_num_t osd_num;
    // FIXME: make receive_buffer_size configurable
    int receive_buffer_size = 64*1024;
    int peer_connect_interval = DEFAULT_PEER_CONNECT_INTERVAL;
@ -267,19 +113,22 @@ struct osd_messenger_t
    int log_level = 0;
    bool use_sync_send_recv = false;

-    std::map<osd_num_t, osd_wanted_peer_t> wanted_peers;
-    std::map<uint64_t, int> osd_peer_fds;
-    uint64_t next_subop_id = 1;
-
-    std::map<int, osd_client_t*> clients;
    std::vector<int> read_ready_clients;
    std::vector<int> write_ready_clients;
    std::vector<std::function<void()>> set_immediate;

+public:
+    timerfd_manager_t *tfd;
+    ring_loop_t *ringloop;
+    // osd_num_t is only for logging and asserts
+    osd_num_t osd_num;
+    uint64_t next_subop_id = 1;
+    std::map<int, osd_client_t*> clients;
+    std::map<osd_num_t, osd_wanted_peer_t> wanted_peers;
+    std::map<uint64_t, int> osd_peer_fds;
    // op statistics
    osd_op_stats_t stats;

-public:
    void init();
    void parse_config(const json11::Json & config);
    void connect_peer(uint64_t osd_num, json11::Json peer_state);
@ -287,7 +136,6 @@ public:
    void outbox_push(osd_op_t *cur_op);
    std::function<void(osd_op_t*)> exec_op;
    std::function<void(osd_num_t)> repeer_pgs;
-    void handle_peer_epoll(int peer_fd, int epoll_events);
    void read_requests();
    void send_replies();
    void accept_connections(int listen_fd);
@ -296,6 +144,7 @@ public:
 protected:
    void try_connect_peer(uint64_t osd_num);
    void try_connect_peer_addr(osd_num_t peer_osd, const char *peer_host, int peer_port);
+    void handle_peer_epoll(int peer_fd, int epoll_events);
    void handle_connect_epoll(int peer_fd);
    void on_connect_peer(osd_num_t peer_osd, int peer_fd);
    void check_peer_config(osd_client_t *cl);
--- a/src/mock/build.sh
+++ b/src/mock/build.sh
@ -0,0 +1 @@
+g++ -D__MOCK__ -fsanitize=address -g -Wno-pointer-arith pg_states.cpp osd_ops.cpp test_cluster_client.cpp cluster_client.cpp msgr_op.cpp msgr_stop.cpp mock/messenger.cpp etcd_state_client.cpp timerfd_manager.cpp ../json11/json11.cpp -I mock -I . -I ..; ./a.out
--- a/src/mock/messenger.cpp
+++ b/src/mock/messenger.cpp
@ -0,0 +1,44 @@
+// Copyright (c) Vitaliy Filippov, 2019+
+// License: VNPL-1.1 or GNU GPL-2.0+ (see README.md for details)
+
+#include <unistd.h>
+#include <stdexcept>
+#include <assert.h>
+
+#include "messenger.h"
+
+void osd_messenger_t::init()
+{
+}
+
+osd_messenger_t::~osd_messenger_t()
+{
+    while (clients.size() > 0)
+    {
+        stop_client(clients.begin()->first, true);
+    }
+}
+
+void osd_messenger_t::outbox_push(osd_op_t *cur_op)
+{
+    clients[cur_op->peer_fd]->sent_ops[cur_op->req.hdr.id] = cur_op;
+}
+
+void osd_messenger_t::parse_config(const json11::Json & config)
+{
+}
+
+void osd_messenger_t::connect_peer(uint64_t peer_osd, json11::Json peer_state)
+{
+    wanted_peers[peer_osd] = (osd_wanted_peer_t){
+        .port = 1,
+    };
+}
+
+void osd_messenger_t::read_requests()
+{
+}
+
+void osd_messenger_t::send_replies()
+{
+}
--- a/src/mock/ringloop.h
+++ b/src/mock/ringloop.h
@ -0,0 +1,25 @@
+// Copyright (c) Vitaliy Filippov, 2019+
+// License: VNPL-1.1 or GNU GPL-2.0+ (see README.md for details)
+
+#pragma once
+
+#include <functional>
+
+struct ring_consumer_t
+{
+    std::function<void(void)> loop;
+};
+
+class ring_loop_t
+{
+public:
+    void register_consumer(ring_consumer_t *consumer)
+    {
+    }
+    void unregister_consumer(ring_consumer_t *consumer)
+    {
+    }
+    void submit()
+    {
+    }
+};
--- a/src/msgr_op.cpp
+++ b/src/msgr_op.cpp
@ -0,0 +1,22 @@
+// Copyright (c) Vitaliy Filippov, 2019+
+// License: VNPL-1.1 or GNU GPL-2.0+ (see README.md for details)
+
+#include <assert.h>
+
+#include "msgr_op.h"
+
+osd_op_t::~osd_op_t()
+{
+    assert(!bs_op);
+    assert(!op_data);
+    if (rmw_buf)
+    {
+        free(rmw_buf);
+    }
+    if (buf)
+    {
+        // Note: reusing osd_op_t WILL currently lead to memory leaks
+        // So we don't reuse it, but free it every time
+        free(buf);
+    }
+}
--- a/src/msgr_op.h
+++ b/src/msgr_op.h
@ -0,0 +1,175 @@
+// Copyright (c) Vitaliy Filippov, 2019+
+// License: VNPL-1.1 or GNU GPL-2.0+ (see README.md for details)
+
+#pragma once
+
+#include <sys/uio.h>
+#include <stdint.h>
+#include <stdio.h>
+#include <string.h>
+#include <stdlib.h>
+
+#include "osd_ops.h"
+
+#define OSD_OP_IN 0
+#define OSD_OP_OUT 1
+
+#define OSD_OP_INLINE_BUF_COUNT 16
+
+// Kind of a vector with small-list-optimisation
+struct osd_op_buf_list_t
+{
+    int count = 0, alloc = OSD_OP_INLINE_BUF_COUNT, done = 0;
+    iovec *buf = NULL;
+    iovec inline_buf[OSD_OP_INLINE_BUF_COUNT];
+
+    inline osd_op_buf_list_t()
+    {
+        buf = inline_buf;
+    }
+
+    inline osd_op_buf_list_t(const osd_op_buf_list_t & other)
+    {
+        buf = inline_buf;
+        append(other);
+    }
+
+    inline osd_op_buf_list_t & operator = (const osd_op_buf_list_t & other)
+    {
+        reset();
+        append(other);
+        return *this;
+    }
+
+    inline ~osd_op_buf_list_t()
+    {
+        if (buf && buf != inline_buf)
+        {
+            free(buf);
+        }
+    }
+
+    inline void reset()
+    {
+        count = 0;
+        done = 0;
+    }
+
+    inline iovec* get_iovec()
+    {
+        return buf + done;
+    }
+
+    inline int get_size()
+    {
+        return count - done;
+    }
+
+    inline void append(const osd_op_buf_list_t & other)
+    {
+        if (count+other.count > alloc)
+        {
+            if (buf == inline_buf)
+            {
+                int old = alloc;
+                alloc = (((count+other.count+15)/16)*16);
+                buf = (iovec*)malloc(sizeof(iovec) * alloc);
+                if (!buf)
+                {
+                    printf("Failed to allocate %lu bytes\n", sizeof(iovec) * alloc);
+                    exit(1);
+                }
+                memcpy(buf, inline_buf, sizeof(iovec) * old);
+            }
+            else
+            {
+                alloc = (((count+other.count+15)/16)*16);
+                buf = (iovec*)realloc(buf, sizeof(iovec) * alloc);
+                if (!buf)
+                {
+                    printf("Failed to allocate %lu bytes\n", sizeof(iovec) * alloc);
+                    exit(1);
+                }
+            }
+        }
+        for (int i = 0; i < other.count; i++)
+        {
+            buf[count++] = other.buf[i];
+        }
+    }
+
+    inline void push_back(void *nbuf, size_t len)
+    {
+        if (count >= alloc)
+        {
+            if (buf == inline_buf)
+            {
+                int old = alloc;
+                alloc = ((alloc/16)*16 + 1);
+                buf = (iovec*)malloc(sizeof(iovec) * alloc);
+                if (!buf)
+                {
+                    printf("Failed to allocate %lu bytes\n", sizeof(iovec) * alloc);
+                    exit(1);
+                }
+                memcpy(buf, inline_buf, sizeof(iovec)*old);
+            }
+            else
+            {
+                alloc = alloc < 16 ? 16 : (alloc+16);
+                buf = (iovec*)realloc(buf, sizeof(iovec) * alloc);
+                if (!buf)
+                {
+                    printf("Failed to allocate %lu bytes\n", sizeof(iovec) * alloc);
+                    exit(1);
+                }
+            }
+        }
+        buf[count++] = { .iov_base = nbuf, .iov_len = len };
+    }
+
+    inline void eat(int result)
+    {
+        while (result > 0 && done < count)
+        {
+            iovec & iov = buf[done];
+            if (iov.iov_len <= result)
+            {
+                result -= iov.iov_len;
+                done++;
+            }
+            else
+            {
+                iov.iov_len -= result;
+                iov.iov_base += result;
+                break;
+            }
+        }
+    }
+};
+
+struct blockstore_op_t;
+
+struct osd_primary_op_data_t;
+
+struct osd_op_t
+{
+    timespec tv_begin = { 0 }, tv_end = { 0 };
+    uint64_t op_type = OSD_OP_IN;
+    int peer_fd;
+    osd_any_op_t req;
+    osd_any_reply_t reply;
+    blockstore_op_t *bs_op = NULL;
+    void *buf = NULL;
+    // bitmap, bitmap_len, bmp_data are only meaningful for reads
+    void *bitmap = NULL;
+    unsigned bitmap_len = 0;
+    unsigned bmp_data = 0;
+    void *rmw_buf = NULL;
+    osd_primary_op_data_t* op_data = NULL;
+    std::function<void(osd_op_t*)> callback;
+
+    osd_op_buf_list_t iov;
+
+    ~osd_op_t();
+};
--- a/src/msgr_receive.cpp
+++ b/src/msgr_receive.cpp
@ -202,24 +202,45 @@ void osd_messenger_t::handle_op_hdr(osd_client_t *cl)
    osd_op_t *cur_op = cl->read_op;
    if (cur_op->req.hdr.opcode == OSD_OP_SEC_READ)
    {
-        if (cur_op->req.sec_rw.len > 0)
-            cur_op->buf = memalign_or_die(MEM_ALIGNMENT, cur_op->req.sec_rw.len);
        cl->read_remaining = 0;
    }
    else if (cur_op->req.hdr.opcode == OSD_OP_SEC_WRITE ||
        cur_op->req.hdr.opcode == OSD_OP_SEC_WRITE_STABLE)
    {
+        if (cur_op->req.sec_rw.attr_len > 0)
+        {
+            if (cur_op->req.sec_rw.attr_len > sizeof(unsigned))
+                cur_op->bitmap = cur_op->rmw_buf = malloc_or_die(cur_op->req.sec_rw.attr_len);
+            else
+                cur_op->bitmap = &cur_op->bmp_data;
+            cl->recv_list.push_back(cur_op->bitmap, cur_op->req.sec_rw.attr_len);
+        }
        if (cur_op->req.sec_rw.len > 0)
+        {
            cur_op->buf = memalign_or_die(MEM_ALIGNMENT, cur_op->req.sec_rw.len);
-        cl->read_remaining = cur_op->req.sec_rw.len;
+            cl->recv_list.push_back(cur_op->buf, cur_op->req.sec_rw.len);
+        }
+        cl->read_remaining = cur_op->req.sec_rw.len + cur_op->req.sec_rw.attr_len;
    }
    else if (cur_op->req.hdr.opcode == OSD_OP_SEC_STABILIZE ||
        cur_op->req.hdr.opcode == OSD_OP_SEC_ROLLBACK)
    {
        if (cur_op->req.sec_stab.len > 0)
+        {
            cur_op->buf = memalign_or_die(MEM_ALIGNMENT, cur_op->req.sec_stab.len);
+            cl->recv_list.push_back(cur_op->buf, cur_op->req.sec_stab.len);
+        }
        cl->read_remaining = cur_op->req.sec_stab.len;
    }
+    else if (cur_op->req.hdr.opcode == OSD_OP_SEC_READ_BMP)
+    {
+        if (cur_op->req.sec_read_bmp.len > 0)
+        {
+            cur_op->buf = memalign_or_die(MEM_ALIGNMENT, cur_op->req.sec_read_bmp.len);
+            cl->recv_list.push_back(cur_op->buf, cur_op->req.sec_read_bmp.len);
+        }
+        cl->read_remaining = cur_op->req.sec_read_bmp.len;
+    }
    else if (cur_op->req.hdr.opcode == OSD_OP_READ)
    {
        cl->read_remaining = 0;
@ -227,13 +248,15 @@ void osd_messenger_t::handle_op_hdr(osd_client_t *cl)
    else if (cur_op->req.hdr.opcode == OSD_OP_WRITE)
    {
        if (cur_op->req.rw.len > 0)
+        {
            cur_op->buf = memalign_or_die(MEM_ALIGNMENT, cur_op->req.rw.len);
+            cl->recv_list.push_back(cur_op->buf, cur_op->req.rw.len);
+        }
        cl->read_remaining = cur_op->req.rw.len;
    }
    if (cl->read_remaining > 0)
    {
        // Read data
-        cl->recv_list.push_back(cur_op->buf, cl->read_remaining);
        cl->read_state = CL_READ_DATA;
    }
    else
@ -259,24 +282,38 @@ bool osd_messenger_t::handle_reply_hdr(osd_client_t *cl)
    osd_op_t *op = req_it->second;
    memcpy(op->reply.buf, cl->read_op->req.buf, OSD_PACKET_SIZE);
    cl->sent_ops.erase(req_it);
-    if ((op->reply.hdr.opcode == OSD_OP_SEC_READ || op->reply.hdr.opcode == OSD_OP_READ) &&
-        op->reply.hdr.retval > 0)
+    if (op->reply.hdr.opcode == OSD_OP_SEC_READ || op->reply.hdr.opcode == OSD_OP_READ)
    {
        // Read data. In this case we assume that the buffer is preallocated by the caller (!)
-        assert(op->iov.count > 0);
-        if (op->reply.hdr.retval != (op->reply.hdr.opcode == OSD_OP_SEC_READ ? op->req.sec_rw.len : op->req.rw.len))
+        unsigned bmp_len = (op->reply.hdr.opcode == OSD_OP_SEC_READ ? op->reply.sec_rw.attr_len : op->reply.rw.bitmap_len);
+        unsigned expected_size = (op->reply.hdr.opcode == OSD_OP_SEC_READ ? op->req.sec_rw.len : op->req.rw.len);
+        if (op->reply.hdr.retval >= 0 && (op->reply.hdr.retval != expected_size || bmp_len > op->bitmap_len))
        {
            // Check reply length to not overflow the buffer
-            printf("Client %d read reply of different length\n", cl->peer_fd);
+            printf("Client %d read reply of different length: expected %u+%u, got %ld+%u\n",
+                cl->peer_fd, expected_size, op->bitmap_len, op->reply.hdr.retval, bmp_len);
            cl->sent_ops[op->req.hdr.id] = op;
            stop_client(cl->peer_fd);
            return false;
        }
+        if (op->reply.hdr.retval >= 0 && bmp_len > 0)
+        {
+            assert(op->bitmap);
+            cl->recv_list.push_back(op->bitmap, bmp_len);
+        }
+        if (op->reply.hdr.retval > 0)
+        {
+            assert(op->iov.count > 0);
            cl->recv_list.append(op->iov);
+        }
+        cl->read_remaining = op->reply.hdr.retval + bmp_len;
+        if (cl->read_remaining == 0)
+        {
+            goto reuse;
+        }
        delete cl->read_op;
        cl->read_op = op;
        cl->read_state = CL_READ_REPLY_DATA;
-        cl->read_remaining = op->reply.hdr.retval;
    }
    else if (op->reply.hdr.opcode == OSD_OP_SEC_LIST && op->reply.hdr.retval > 0)
    {
@ -288,6 +325,17 @@ bool osd_messenger_t::handle_reply_hdr(osd_client_t *cl)
        op->buf = memalign_or_die(MEM_ALIGNMENT, cl->read_remaining);
        cl->recv_list.push_back(op->buf, cl->read_remaining);
    }
+    else if (op->reply.hdr.opcode == OSD_OP_SEC_READ_BMP && op->reply.hdr.retval > 0)
+    {
+        assert(!op->iov.count);
+        delete cl->read_op;
+        cl->read_op = op;
+        cl->read_state = CL_READ_REPLY_DATA;
+        cl->read_remaining = op->reply.hdr.retval;
+        free(op->buf);
+        op->buf = memalign_or_die(MEM_ALIGNMENT, cl->read_remaining);
+        cl->recv_list.push_back(op->buf, cl->read_remaining);
+    }
    else if (op->reply.hdr.opcode == OSD_OP_SHOW_CONFIG && op->reply.hdr.retval > 0)
    {
        assert(!op->iov.count);
@ -300,6 +348,7 @@ bool osd_messenger_t::handle_reply_hdr(osd_client_t *cl)
    }
    else
    {
+reuse:
        // It's fine to reuse cl->read_op for the next reply
        handle_reply_ready(op);
        cl->recv_list.push_back(cl->read_op->req.buf, OSD_PACKET_SIZE);
--- a/src/msgr_send.cpp
+++ b/src/msgr_send.cpp
@ -47,6 +47,27 @@ void osd_messenger_t::outbox_push(osd_op_t *cur_op)
        cl->sent_ops[cur_op->req.hdr.id] = cur_op;
    }
    to_outbox.push_back(NULL);
+    // Bitmap
+    if (cur_op->op_type == OSD_OP_IN &&
+        cur_op->req.hdr.opcode == OSD_OP_SEC_READ &&
+        cur_op->reply.sec_rw.attr_len > 0)
+    {
+        to_send_list.push_back((iovec){
+            .iov_base = cur_op->bitmap,
+            .iov_len = cur_op->reply.sec_rw.attr_len,
+        });
+        to_outbox.push_back(NULL);
+    }
+    else if (cur_op->op_type == OSD_OP_OUT &&
+        (cur_op->req.hdr.opcode == OSD_OP_SEC_WRITE || cur_op->req.hdr.opcode == OSD_OP_SEC_WRITE_STABLE) &&
+        cur_op->req.sec_rw.attr_len > 0)
+    {
+        to_send_list.push_back((iovec){
+            .iov_base = cur_op->bitmap,
+            .iov_len = cur_op->req.sec_rw.attr_len,
+        });
+        to_outbox.push_back(NULL);
+    }
    // Operation data
    if ((cur_op->op_type == OSD_OP_IN
        ? (cur_op->req.hdr.opcode == OSD_OP_READ ||
@ -66,6 +87,14 @@ void osd_messenger_t::outbox_push(osd_op_t *cur_op)
            to_outbox.push_back(NULL);
        }
    }
+    if (cur_op->req.hdr.opcode == OSD_OP_SEC_READ_BMP)
+    {
+        if (cur_op->op_type == OSD_OP_IN && cur_op->reply.hdr.retval > 0)
+            to_send_list.push_back((iovec){ .iov_base = cur_op->buf, .iov_len = (size_t)cur_op->reply.hdr.retval });
+        else if (cur_op->op_type == OSD_OP_OUT && cur_op->req.sec_read_bmp.len > 0)
+            to_send_list.push_back((iovec){ .iov_base = cur_op->buf, .iov_len = (size_t)cur_op->req.sec_read_bmp.len });
+        to_outbox.push_back(NULL);
+    }
    if (cur_op->op_type == OSD_OP_IN)
    {
        // To free it later
@ -97,8 +126,10 @@ void osd_messenger_t::measure_exec(osd_op_t *cur_op)
    {
        return;
    }
-    timespec tv_end;
-    clock_gettime(CLOCK_REALTIME, &tv_end);
+    if (!cur_op->tv_end.tv_sec)
+    {
+        clock_gettime(CLOCK_REALTIME, &cur_op->tv_end);
+    }
    stats.op_stat_count[cur_op->req.hdr.opcode]++;
    if (!stats.op_stat_count[cur_op->req.hdr.opcode])
    {
@ -107,8 +138,8 @@ void osd_messenger_t::measure_exec(osd_op_t *cur_op)
        stats.op_stat_bytes[cur_op->req.hdr.opcode] = 0;
    }
    stats.op_stat_sum[cur_op->req.hdr.opcode] += (
-        (tv_end.tv_sec - cur_op->tv_begin.tv_sec)*1000000 +
-        (tv_end.tv_nsec - cur_op->tv_begin.tv_nsec)/1000
+        (cur_op->tv_end.tv_sec - cur_op->tv_begin.tv_sec)*1000000 +
+        (cur_op->tv_end.tv_nsec - cur_op->tv_begin.tv_nsec)/1000
    );
    if (cur_op->req.hdr.opcode == OSD_OP_READ ||
        cur_op->req.hdr.opcode == OSD_OP_WRITE)
@ -180,7 +211,7 @@ void osd_messenger_t::handle_send(int result, osd_client_t *cl)
    cl->refs--;
    if (cl->peer_state == PEER_STOPPED)
    {
-        if (!cl->refs)
+        if (cl->refs <= 0)
        {
            delete cl;
        }
--- a/src/msgr_stop.cpp
+++ b/src/msgr_stop.cpp
@ -0,0 +1,137 @@
+// Copyright (c) Vitaliy Filippov, 2019+
+// License: VNPL-1.1 or GNU GPL-2.0+ (see README.md for details)
+
+#include <unistd.h>
+#include <assert.h>
+
+#include "messenger.h"
+
+void osd_messenger_t::cancel_osd_ops(osd_client_t *cl)
+{
+    std::vector<osd_op_t*> cancel_ops;
+    cancel_ops.resize(cl->sent_ops.size());
+    int i = 0;
+    for (auto p: cl->sent_ops)
+    {
+        cancel_ops[i++] = p.second;
+    }
+    cl->sent_ops.clear();
+    cl->outbox.clear();
+    for (auto op: cancel_ops)
+    {
+        cancel_op(op);
+    }
+}
+
+void osd_messenger_t::cancel_op(osd_op_t *op)
+{
+    if (op->op_type == OSD_OP_OUT)
+    {
+        op->reply.hdr.magic = SECONDARY_OSD_REPLY_MAGIC;
+        op->reply.hdr.id = op->req.hdr.id;
+        op->reply.hdr.opcode = op->req.hdr.opcode;
+        op->reply.hdr.retval = -EPIPE;
+        // Copy lambda to be unaffected by `delete op`
+        std::function<void(osd_op_t*)>(op->callback)(op);
+    }
+    else
+    {
+        // This function is only called in stop_client(), so it's fine to destroy the operation
+        delete op;
+    }
+}
+
+void osd_messenger_t::stop_client(int peer_fd, bool force)
+{
+    assert(peer_fd != 0);
+    auto it = clients.find(peer_fd);
+    if (it == clients.end())
+    {
+        return;
+    }
+    osd_client_t *cl = it->second;
+    if (cl->peer_state == PEER_CONNECTING && !force || cl->peer_state == PEER_STOPPED)
+    {
+        return;
+    }
+    if (log_level > 0)
+    {
+        if (cl->osd_num)
+        {
+            printf("[OSD %lu] Stopping client %d (OSD peer %lu)\n", osd_num, peer_fd, cl->osd_num);
+        }
+        else
+        {
+            printf("[OSD %lu] Stopping client %d (regular client)\n", osd_num, peer_fd);
+        }
+    }
+    // First set state to STOPPED so another stop_client() call doesn't try to free it again
+    cl->refs++;
+    cl->peer_state = PEER_STOPPED;
+    if (cl->osd_num)
+    {
+        // ...and forget OSD peer
+        osd_peer_fds.erase(cl->osd_num);
+    }
+#ifndef __MOCK__
+    // Then remove FD from the eventloop so we don't accidentally read something
+    tfd->set_fd_handler(peer_fd, false, NULL);
+    if (cl->connect_timeout_id >= 0)
+    {
+        tfd->clear_timer(cl->connect_timeout_id);
+        cl->connect_timeout_id = -1;
+    }
+    for (auto rit = read_ready_clients.begin(); rit != read_ready_clients.end(); rit++)
+    {
+        if (*rit == peer_fd)
+        {
+            read_ready_clients.erase(rit);
+            break;
+        }
+    }
+    for (auto wit = write_ready_clients.begin(); wit != write_ready_clients.end(); wit++)
+    {
+        if (*wit == peer_fd)
+        {
+            write_ready_clients.erase(wit);
+            break;
+        }
+    }
+#endif
+    if (cl->osd_num)
+    {
+        // Then repeer PGs because cancel_op() callbacks can try to perform
+        // some actions and we need correct PG states to not do something silly
+        repeer_pgs(cl->osd_num);
+    }
+    // Then cancel all operations
+    if (cl->read_op)
+    {
+        if (!cl->read_op->callback)
+        {
+            delete cl->read_op;
+        }
+        cl->read_op = NULL;
+    }
+    if (cl->osd_num)
+    {
+        // Cancel outbound operations
+        cancel_osd_ops(cl);
+    }
+#ifndef __MOCK__
+    // And close the FD only when everything is done
+    // ...because peer_fd number can get reused after close()
+    close(peer_fd);
+#endif
+    // Find the item again because it can be invalidated at this point
+    it = clients.find(peer_fd);
+    if (it != clients.end())
+    {
+        clients.erase(it);
+    }
+    cl->refs--;
+    if (cl->refs <= 0)
+    {
+        delete cl;
+    }
+}
--- a/src/object_id.h
+++ b/src/object_id.h
@ -6,12 +6,14 @@
 #include <stdint.h>
 #include <functional>

+typedef uint64_t inode_t;
+
 // 16 bytes per object/stripe id
 // stripe = (start of the parity stripe + peer role)
 // i.e. for example (256KB + one of 0,1,2)
 struct __attribute__((__packed__)) object_id
 {
-    uint64_t inode;
+    inode_t inode;
    uint64_t stripe;
 };

--- a/src/osd.cpp
+++ b/src/osd.cpp
@ -8,22 +8,34 @@
 #include <arpa/inet.h>

 #include "osd.h"
+#include "http_client.h"

-osd_t::osd_t(blockstore_config_t & config, blockstore_t *bs, ring_loop_t *ringloop)
+osd_t::osd_t(blockstore_config_t & config, ring_loop_t *ringloop)
 {
+    bs_block_size = strtoull(config["block_size"].c_str(), NULL, 10);
+    bs_bitmap_granularity = strtoull(config["bitmap_granularity"].c_str(), NULL, 10);
+    if (!bs_block_size)
+        bs_block_size = DEFAULT_BLOCK_SIZE;
+    if (!bs_bitmap_granularity)
+        bs_bitmap_granularity = DEFAULT_BITMAP_GRANULARITY;
+    clean_entry_bitmap_size = bs_block_size / bs_bitmap_granularity / 8;
+
+    zero_buffer_size = 1<<20;
+    zero_buffer = malloc_or_die(zero_buffer_size);
+    memset(zero_buffer, 0, zero_buffer_size);
+
    this->config = config;
-    this->bs = bs;
    this->ringloop = ringloop;

-    this->bs_block_size = bs->get_block_size();
-    // FIXME: use bitmap granularity instead
-    this->bs_disk_alignment = bs->get_disk_alignment();
+    epmgr = new epoll_manager_t(ringloop);
+    // FIXME: Use timerfd_interval based directly on io_uring
+    this->tfd = epmgr->tfd;
+
+    // FIXME: Create Blockstore from on-disk superblock config and check it against the OSD cluster config
+    this->bs = new blockstore_t(config, ringloop, tfd);

    parse_config(config);

-    epmgr = new epoll_manager_t(ringloop);
-    this->tfd = epmgr->tfd;
-
    this->tfd->set_timer(print_stats_interval*1000, true, [this](int timer_id)
    {
        print_stats();
@ -49,7 +61,9 @@ osd_t::~osd_t()
 {
    ringloop->unregister_consumer(&consumer);
    delete epmgr;
+    delete bs;
    close(listen_fd);
+    free(zero_buffer);
 }

 void osd_t::parse_config(blockstore_config_t & config)
@ -171,7 +185,7 @@ bool osd_t::shutdown()
    {
        return false;
    }
-    return bs->is_safe_to_stop();
+    return !bs || bs->is_safe_to_stop();
 }

 void osd_t::loop()
@ -191,6 +205,8 @@ void osd_t::exec_op(osd_op_t *cur_op)
        delete cur_op;
        return;
    }
+    // Clear the reply buffer
+    memset(cur_op->reply.buf, 0, OSD_PACKET_SIZE);
    inflight_ops++;
    if (cur_op->req.hdr.magic != SECONDARY_OSD_OP_MAGIC ||
        cur_op->req.hdr.opcode < OSD_OP_MIN || cur_op->req.hdr.opcode > OSD_OP_MAX ||
@ -198,14 +214,14 @@ void osd_t::exec_op(osd_op_t *cur_op)
            cur_op->req.hdr.opcode == OSD_OP_SEC_WRITE ||
            cur_op->req.hdr.opcode == OSD_OP_SEC_WRITE_STABLE) &&
            (cur_op->req.sec_rw.len > OSD_RW_MAX ||
-            cur_op->req.sec_rw.len % bs_disk_alignment ||
-            cur_op->req.sec_rw.offset % bs_disk_alignment)) ||
+            cur_op->req.sec_rw.len % bs_bitmap_granularity ||
+            cur_op->req.sec_rw.offset % bs_bitmap_granularity)) ||
        ((cur_op->req.hdr.opcode == OSD_OP_READ ||
            cur_op->req.hdr.opcode == OSD_OP_WRITE ||
            cur_op->req.hdr.opcode == OSD_OP_DELETE) &&
            (cur_op->req.rw.len > OSD_RW_MAX ||
-            cur_op->req.rw.len % bs_disk_alignment ||
-            cur_op->req.rw.offset % bs_disk_alignment)))
+            cur_op->req.rw.len % bs_bitmap_granularity ||
+            cur_op->req.rw.offset % bs_bitmap_granularity)))
    {
        // Bad command
        finish_op(cur_op, -EINVAL);
@ -221,6 +237,7 @@ void osd_t::exec_op(osd_op_t *cur_op)
        cur_op->req.hdr.opcode != OSD_OP_SEC_READ &&
        cur_op->req.hdr.opcode != OSD_OP_SEC_LIST &&
        cur_op->req.hdr.opcode != OSD_OP_READ &&
+        cur_op->req.hdr.opcode != OSD_OP_SEC_READ_BMP &&
        cur_op->req.hdr.opcode != OSD_OP_SHOW_CONFIG)
    {
        // Readonly mode
--- a/src/osd.h
+++ b/src/osd.h
@ -55,6 +55,39 @@ struct osd_recovery_op_t
    osd_op_t *osd_op = NULL;
 };

+// Posted as /osd/inodestats/$osd, then accumulated by the monitor
+#define INODE_STATS_READ 0
+#define INODE_STATS_WRITE 1
+#define INODE_STATS_DELETE 2
+struct inode_stats_t
+{
+    uint64_t op_sum[3] = { 0 };
+    uint64_t op_count[3] = { 0 };
+    uint64_t op_bytes[3] = { 0 };
+};
+
+struct bitmap_request_t
+{
+    osd_num_t osd_num;
+    object_id oid;
+    uint64_t version;
+    void *bmp_buf;
+};
+
+inline bool operator < (const bitmap_request_t & a, const bitmap_request_t & b)
+{
+    return a.osd_num < b.osd_num || a.osd_num == b.osd_num && a.oid < b.oid;
+}
+
+struct osd_chain_read_t
+{
+    int chain_pos;
+    inode_t inode;
+    uint32_t offset, len;
+};
+
+struct osd_rmw_stripe_t;
+
 class osd_t
 {
    // config
@ -115,7 +148,9 @@ class osd_t
    bool stopping = false;
    int inflight_ops = 0;
    blockstore_t *bs;
-    uint32_t bs_block_size, bs_disk_alignment;
+    void *zero_buffer = NULL;
+    uint64_t zero_buffer_size = 0;
+    uint32_t bs_block_size, bs_bitmap_granularity, clean_entry_bitmap_size;
    ring_loop_t *ringloop;
    timerfd_manager_t *tfd = NULL;
    epoll_manager_t *epmgr = NULL;
@ -126,6 +161,7 @@ class osd_t

    // op statistics
    osd_op_stats_t prev_stats;
+    std::map<uint64_t, inode_stats_t> inode_stats;
    const char* recovery_stat_names[2] = { "degraded", "misplaced" };
    uint64_t recovery_stat_count[2][2] = { 0 };
    uint64_t recovery_stat_bytes[2][2] = { 0 };
@ -135,7 +171,7 @@ class osd_t
    void init_cluster();
    void on_change_osd_state_hook(osd_num_t peer_osd);
    void on_change_pg_history_hook(pool_id_t pool_id, pg_num_t pg_num);
-    void on_change_etcd_state_hook(json11::Json::object & changes);
+    void on_change_etcd_state_hook(std::map<std::string, etcd_kv_t> & changes);
    void on_load_config_hook(json11::Json::object & changes);
    json11::Json on_load_pgs_checks_hook();
    void on_load_pgs_hook(bool success);
@ -198,27 +234,41 @@ class osd_t
    void continue_primary_del(osd_op_t *cur_op);
    bool check_write_queue(osd_op_t *cur_op, pg_t & pg);
    void remove_object_from_state(object_id & oid, pg_osd_set_state_t *object_state, pg_t &pg);
+    void free_object_state(pg_t & pg, pg_osd_set_state_t **object_state);
    bool remember_unstable_write(osd_op_t *cur_op, pg_t & pg, pg_osd_set_t & loc_set, int base_state);
    void handle_primary_subop(osd_op_t *subop, osd_op_t *cur_op);
    void handle_primary_bs_subop(osd_op_t *subop);
    void add_bs_subop_stats(osd_op_t *subop);
    void pg_cancel_write_queue(pg_t & pg, osd_op_t *first_op, object_id oid, int retval);
-    void submit_primary_subops(int submit_type, uint64_t op_version, int pg_size, const uint64_t* osd_set, osd_op_t *cur_op);
+
+    void submit_primary_subops(int submit_type, uint64_t op_version, const uint64_t* osd_set, osd_op_t *cur_op);
+    int submit_primary_subop_batch(int submit_type, inode_t inode, uint64_t op_version,
+        osd_rmw_stripe_t *stripes, const uint64_t* osd_set, osd_op_t *cur_op, int subop_idx, int zero_read);
    void submit_primary_del_subops(osd_op_t *cur_op, uint64_t *cur_set, uint64_t set_size, pg_osd_set_t & loc_set);
    void submit_primary_del_batch(osd_op_t *cur_op, obj_ver_osd_t *chunks_to_delete, int chunks_to_delete_count);
-    void submit_primary_sync_subops(osd_op_t *cur_op);
+    int submit_primary_sync_subops(osd_op_t *cur_op);
    void submit_primary_stab_subops(osd_op_t *cur_op);

+    uint64_t* get_object_osd_set(pg_t &pg, object_id &oid, uint64_t *def, pg_osd_set_state_t **object_state);
+
+    void continue_chained_read(osd_op_t *cur_op);
+    int submit_chained_read_requests(pg_t & pg, osd_op_t *cur_op);
+    void send_chained_read_results(pg_t & pg, osd_op_t *cur_op);
+    std::vector<osd_chain_read_t> collect_chained_read_requests(osd_op_t *cur_op);
+    int collect_bitmap_requests(osd_op_t *cur_op, pg_t & pg, std::vector<bitmap_request_t> & bitmap_requests);
+    int submit_bitmap_subops(osd_op_t *cur_op, pg_t & pg);
+    int read_bitmaps(osd_op_t *cur_op, pg_t & pg, int base_state);
+
    inline pg_num_t map_to_pg(object_id oid, uint64_t pg_stripe_size)
    {
        uint64_t pg_count = pg_counts[INODE_POOL(oid.inode)];
        if (!pg_count)
            pg_count = 1;
-        return (oid.inode + oid.stripe / pg_stripe_size) % pg_count + 1;
+        return (oid.stripe / pg_stripe_size) % pg_count + 1;
    }

 public:
-    osd_t(blockstore_config_t & config, blockstore_t *bs, ring_loop_t *ringloop);
+    osd_t(blockstore_config_t & config, ring_loop_t *ringloop);
    ~osd_t();
    void force_stop(int exitcode);
    bool shutdown();
--- a/src/osd_cluster.cpp
+++ b/src/osd_cluster.cpp
@ -4,6 +4,7 @@
 #include "osd.h"
 #include "base64.h"
 #include "etcd_state_client.h"
+#include "http_client.h"
 #include "osd_rmw.h"

 // Startup sequence:
@ -64,7 +65,7 @@ void osd_t::init_cluster()
        st_cli.log_level = log_level;
        st_cli.on_change_osd_state_hook = [this](osd_num_t peer_osd) { on_change_osd_state_hook(peer_osd); };
        st_cli.on_change_pg_history_hook = [this](pool_id_t pool_id, pg_num_t pg_num) { on_change_pg_history_hook(pool_id, pg_num); };
-        st_cli.on_change_hook = [this](json11::Json::object & changes) { on_change_etcd_state_hook(changes); };
+        st_cli.on_change_hook = [this](std::map<std::string, etcd_kv_t> & changes) { on_change_etcd_state_hook(changes); };
        st_cli.on_load_config_hook = [this](json11::Json::object & cfg) { on_load_config_hook(cfg); };
        st_cli.load_pgs_checks_hook = [this]() { return on_load_pgs_checks_hook(); };
        st_cli.on_load_pgs_hook = [this](bool success) { on_load_pgs_hook(success); };
@ -179,12 +180,80 @@ void osd_t::report_statistics()
        return;
    }
    etcd_reporting_stats = true;
-    json11::Json::array txn = { json11::Json::object {
+    // Report space usage statistics as a whole
+    // Maybe we'll report it using deltas if we tune for a lot of inodes at some point
+    json11::Json::object inode_space;
+    json11::Json::object last_stat;
+    pool_id_t last_pool = 0;
+    for (auto kv: bs->get_inode_space_stats())
+    {
+        pool_id_t pool_id = INODE_POOL(kv.first);
+        uint64_t only_inode_num = (kv.first & ((1l << (64-POOL_ID_BITS)) - 1));
+        if (!last_pool || pool_id != last_pool)
+        {
+            if (last_pool)
+                inode_space[std::to_string(last_pool)] = last_stat;
+            last_stat = json11::Json::object();
+            last_pool = pool_id;
+        }
+        last_stat[std::to_string(only_inode_num)] = kv.second;
+    }
+    if (last_pool)
+        inode_space[std::to_string(last_pool)] = last_stat;
+    last_stat = json11::Json::object();
+    last_pool = 0;
+    json11::Json::object inode_ops;
+    for (auto kv: inode_stats)
+    {
+        pool_id_t pool_id = INODE_POOL(kv.first);
+        uint64_t only_inode_num = (kv.first & ((1l << (64-POOL_ID_BITS)) - 1));
+        if (!last_pool || pool_id != last_pool)
+        {
+            if (last_pool)
+                inode_ops[std::to_string(last_pool)] = last_stat;
+            last_stat = json11::Json::object();
+            last_pool = pool_id;
+        }
+        last_stat[std::to_string(only_inode_num)] = json11::Json::object {
+            { "read", json11::Json::object {
+                { "count", kv.second.op_count[INODE_STATS_READ] },
+                { "usec", kv.second.op_sum[INODE_STATS_READ] },
+                { "bytes", kv.second.op_bytes[INODE_STATS_READ] },
+            } },
+            { "write", json11::Json::object {
+                { "count", kv.second.op_count[INODE_STATS_WRITE] },
+                { "usec", kv.second.op_sum[INODE_STATS_WRITE] },
+                { "bytes", kv.second.op_bytes[INODE_STATS_WRITE] },
+            } },
+            { "delete", json11::Json::object {
+                { "count", kv.second.op_count[INODE_STATS_DELETE] },
+                { "usec", kv.second.op_sum[INODE_STATS_DELETE] },
+                { "bytes", kv.second.op_bytes[INODE_STATS_DELETE] },
+            } },
+        };
+    }
+    if (last_pool)
+        inode_ops[std::to_string(last_pool)] = last_stat;
+    json11::Json::array txn = {
+        json11::Json::object {
            { "request_put", json11::Json::object {
                { "key", base64_encode(st_cli.etcd_prefix+"/osd/stats/"+std::to_string(osd_num)) },
                { "value", base64_encode(get_statistics().dump()) },
-        } }
-    } };
+            } },
+        },
+        json11::Json::object {
+            { "request_put", json11::Json::object {
+                { "key", base64_encode(st_cli.etcd_prefix+"/osd/space/"+std::to_string(osd_num)) },
+                { "value", base64_encode(json11::Json(inode_space).dump()) },
+            } },
+        },
+        json11::Json::object {
+            { "request_put", json11::Json::object {
+                { "key", base64_encode(st_cli.etcd_prefix+"/osd/inodestats/"+std::to_string(osd_num)) },
+                { "value", base64_encode(json11::Json(inode_ops).dump()) },
+            } },
+        },
+    };
    for (auto & p: pgs)
    {
        auto & pg = p.second;
@ -235,7 +304,7 @@ void osd_t::on_change_osd_state_hook(osd_num_t peer_osd)
    }
 }

-void osd_t::on_change_etcd_state_hook(json11::Json::object & changes)
+void osd_t::on_change_etcd_state_hook(std::map<std::string, etcd_kv_t> & changes)
 {
    // FIXME apply config changes in runtime (maybe, some)
    if (run_primary)
@ -557,7 +626,7 @@ void osd_t::apply_pg_config()
                }
                if (currently_taken)
                {
-                    if (pg_it->second.state & (PG_ACTIVE | PG_INCOMPLETE | PG_PEERING))
+                    if (pg_it->second.state & (PG_ACTIVE | PG_INCOMPLETE | PG_PEERING | PG_REPEERING))
                    {
                        if (pg_it->second.target_set == pg_cfg.target_set)
                        {
--- a/src/osd_flush.cpp
+++ b/src/osd_flush.cpp
@ -149,10 +149,14 @@ void osd_t::handle_flush_op(bool rollback, pool_id_t pool_id, pg_num_t pg_num, p
        {
            continue_primary_write(op);
        }
-        if (pg.inflight == 0 && (pg.state & PG_STOPPING))
+        if ((pg.state & PG_STOPPING) && pg.inflight == 0 && !pg.flush_batch)
        {
            finish_stop_pg(pg);
        }
+        else if ((pg.state & PG_REPEERING) && pg.inflight == 0 && !pg.flush_batch)
+        {
+            start_pg_peering(pg);
+        }
    }
 }

@ -231,7 +235,8 @@ bool osd_t::pick_next_recovery(osd_recovery_op_t &op)
    {
        for (auto pg_it = pgs.begin(); pg_it != pgs.end(); pg_it++)
        {
-            if ((pg_it->second.state & (PG_ACTIVE | PG_HAS_MISPLACED)) == (PG_ACTIVE | PG_HAS_MISPLACED))
+            // Don't try to "recover" misplaced objects if "recovery" would make them degraded
+            if ((pg_it->second.state & (PG_ACTIVE | PG_DEGRADED | PG_HAS_MISPLACED)) == (PG_ACTIVE | PG_HAS_MISPLACED))
            {
                for (auto obj_it = pg_it->second.misplaced_objects.begin(); obj_it != pg_it->second.misplaced_objects.end(); obj_it++)
                {
--- a/src/osd_main.cpp
+++ b/src/osd_main.cpp
@ -41,16 +41,13 @@ int main(int narg, char *args[])
    signal(SIGINT, handle_sigint);
    signal(SIGTERM, handle_sigint);
    ring_loop_t *ringloop = new ring_loop_t(512);
-    // FIXME: Create Blockstore from on-disk superblock config and check it against the OSD cluster config
-    blockstore_t *bs = new blockstore_t(config, ringloop);
-    osd = new osd_t(config, bs, ringloop);
+    osd = new osd_t(config, ringloop);
    while (1)
    {
        ringloop->loop();
        ringloop->wait();
    }
    delete osd;
-    delete bs;
    delete ringloop;
    return 0;
 }
--- a/src/osd_ops.cpp
+++ b/src/osd_ops.cpp
@ -20,4 +20,5 @@ const char* osd_op_names[] = {
    "primary_sync",
    "primary_delete",
    "ping",
+    "sec_read_bmp",
 };
--- a/src/osd_ops.h
+++ b/src/osd_ops.h
@ -28,12 +28,14 @@
 #define OSD_OP_SYNC                 13
 #define OSD_OP_DELETE               14
 #define OSD_OP_PING                 15
-#define OSD_OP_MAX                  15
+#define OSD_OP_SEC_READ_BMP         16
+#define OSD_OP_MAX                  16
 // Alignment & limit for read/write operations
 #ifndef MEM_ALIGNMENT
 #define MEM_ALIGNMENT               512
 #endif
 #define OSD_RW_MAX                  64*1024*1024
+#define OSD_PROTOCOL_VERSION        1

 // common request and reply headers
 struct __attribute__((__packed__)) osd_op_header_t
@ -59,7 +61,7 @@ struct __attribute__((__packed__)) osd_reply_header_t
 };

 // read or write to the secondary OSD
-struct __attribute__((__packed__)) osd_op_secondary_rw_t
+struct __attribute__((__packed__)) osd_op_sec_rw_t
 {
    osd_op_header_t header;
    // object
@ -71,17 +73,23 @@ struct __attribute__((__packed__)) osd_op_secondary_rw_t
    uint32_t offset;
    // length
    uint32_t len;
+    // bitmap/attribute length - bitmap comes after header, but before data
+    uint32_t attr_len;
+    uint32_t pad0;
 };

-struct __attribute__((__packed__)) osd_reply_secondary_rw_t
+struct __attribute__((__packed__)) osd_reply_sec_rw_t
 {
    osd_reply_header_t header;
    // for reads and writes: assigned or read version number
    uint64_t version;
+    // for reads: bitmap/attribute length (just to double-check)
+    uint32_t attr_len;
+    uint32_t pad0;
 };

 // delete object on the secondary OSD
-struct __attribute__((__packed__)) osd_op_secondary_del_t
+struct __attribute__((__packed__)) osd_op_sec_del_t
 {
    osd_op_header_t header;
    // object
@ -90,37 +98,51 @@ struct __attribute__((__packed__)) osd_op_secondary_del_t
    uint64_t version;
 };

-struct __attribute__((__packed__)) osd_reply_secondary_del_t
+struct __attribute__((__packed__)) osd_reply_sec_del_t
 {
    osd_reply_header_t header;
    uint64_t version;
 };

 // sync to the secondary OSD
-struct __attribute__((__packed__)) osd_op_secondary_sync_t
+struct __attribute__((__packed__)) osd_op_sec_sync_t
 {
    osd_op_header_t header;
 };

-struct __attribute__((__packed__)) osd_reply_secondary_sync_t
+struct __attribute__((__packed__)) osd_reply_sec_sync_t
 {
    osd_reply_header_t header;
 };

 // stabilize or rollback objects on the secondary OSD
-struct __attribute__((__packed__)) osd_op_secondary_stabilize_t
+struct __attribute__((__packed__)) osd_op_sec_stab_t
 {
    osd_op_header_t header;
    // obj_ver_id array length in bytes
    uint64_t len;
 };
-typedef osd_op_secondary_stabilize_t osd_op_secondary_rollback_t;
+typedef osd_op_sec_stab_t osd_op_sec_rollback_t;

-struct __attribute__((__packed__)) osd_reply_secondary_stabilize_t
+struct __attribute__((__packed__)) osd_reply_sec_stab_t
 {
    osd_reply_header_t header;
 };
-typedef osd_reply_secondary_stabilize_t osd_reply_secondary_rollback_t;
+typedef osd_reply_sec_stab_t osd_reply_sec_rollback_t;
+
+// bulk read bitmaps from a secondary OSD
+struct __attribute__((__packed__)) osd_op_sec_read_bmp_t
+{
+    osd_op_header_t header;
+    // obj_ver_id array length in bytes
+    uint64_t len;
+};
+
+struct __attribute__((__packed__)) osd_reply_sec_read_bmp_t
+{
+    // retval is payload length in bytes. payload is {version,bitmap}[]
+    osd_reply_header_t header;
+};

 // show configuration
 struct __attribute__((__packed__)) osd_op_show_config_t
@ -134,7 +156,7 @@ struct __attribute__((__packed__)) osd_reply_show_config_t
 };

 // list objects on replica
-struct __attribute__((__packed__)) osd_op_secondary_list_t
+struct __attribute__((__packed__)) osd_op_sec_list_t
 {
    osd_op_header_t header;
    // placement group total number and total count
@ -145,7 +167,7 @@ struct __attribute__((__packed__)) osd_op_secondary_list_t
    uint64_t min_inode, max_inode;
 };

-struct __attribute__((__packed__)) osd_reply_secondary_list_t
+struct __attribute__((__packed__)) osd_reply_sec_list_t
 {
    osd_reply_header_t header;
    // stable object version count. header.retval = total object version count
@ -154,7 +176,6 @@ struct __attribute__((__packed__)) osd_reply_secondary_list_t
 };

 // read or write to the primary OSD (must be within individual stripe)
-// FIXME: allow to return used block bitmap (required for snapshots)
 struct __attribute__((__packed__)) osd_op_rw_t
 {
    osd_op_header_t header;
@ -164,11 +185,18 @@ struct __attribute__((__packed__)) osd_op_rw_t
    uint64_t offset;
    // length
    uint32_t len;
+    // flags (for future)
+    uint32_t flags;
+    // inode metadata revision
+    uint64_t meta_revision;
 };

 struct __attribute__((__packed__)) osd_reply_rw_t
 {
    osd_reply_header_t header;
+    // for reads: bitmap length
+    uint32_t bitmap_len;
+    uint32_t pad0;
 };

 // sync to the primary OSD
@ -186,11 +214,12 @@ struct __attribute__((__packed__)) osd_reply_sync_t
 union osd_any_op_t
 {
    osd_op_header_t hdr;
-    osd_op_secondary_rw_t sec_rw;
-    osd_op_secondary_del_t sec_del;
-    osd_op_secondary_sync_t sec_sync;
-    osd_op_secondary_stabilize_t sec_stab;
-    osd_op_secondary_list_t sec_list;
+    osd_op_sec_rw_t sec_rw;
+    osd_op_sec_del_t sec_del;
+    osd_op_sec_sync_t sec_sync;
+    osd_op_sec_stab_t sec_stab;
+    osd_op_sec_read_bmp_t sec_read_bmp;
+    osd_op_sec_list_t sec_list;
    osd_op_show_config_t show_conf;
    osd_op_rw_t rw;
    osd_op_sync_t sync;
@ -200,11 +229,12 @@ union osd_any_op_t
 union osd_any_reply_t
 {
    osd_reply_header_t hdr;
-    osd_reply_secondary_rw_t sec_rw;
-    osd_reply_secondary_del_t sec_del;
-    osd_reply_secondary_sync_t sec_sync;
-    osd_reply_secondary_stabilize_t sec_stab;
-    osd_reply_secondary_list_t sec_list;
+    osd_reply_sec_rw_t sec_rw;
+    osd_reply_sec_del_t sec_del;
+    osd_reply_sec_sync_t sec_sync;
+    osd_reply_sec_stab_t sec_stab;
+    osd_reply_sec_read_bmp_t sec_read_bmp;
+    osd_reply_sec_list_t sec_list;
    osd_reply_show_config_t show_conf;
    osd_reply_rw_t rw;
    osd_reply_sync_t sync;
--- a/src/osd_peering.cpp
+++ b/src/osd_peering.cpp
@ -77,10 +77,11 @@ void osd_t::repeer_pgs(osd_num_t peer_osd)
    // Re-peer affected PGs
    for (auto & p: pgs)
    {
+        auto & pg = p.second;
        bool repeer = false;
-        if (p.second.state & (PG_PEERING | PG_ACTIVE | PG_INCOMPLETE))
+        if (pg.state & (PG_PEERING | PG_ACTIVE | PG_INCOMPLETE))
        {
-            for (osd_num_t pg_osd: p.second.all_peers)
+            for (osd_num_t pg_osd: pg.all_peers)
            {
                if (pg_osd == peer_osd)
                {
@ -91,8 +92,17 @@ void osd_t::repeer_pgs(osd_num_t peer_osd)
            if (repeer)
            {
                // Repeer this pg
-                printf("[PG %u/%u] Repeer because of OSD %lu\n", p.second.pool_id, p.second.pg_num, peer_osd);
-                start_pg_peering(p.second);
+                printf("[PG %u/%u] Repeer because of OSD %lu\n", pg.pool_id, pg.pg_num, peer_osd);
+                if (!(pg.state & (PG_ACTIVE | PG_REPEERING)) || pg.inflight == 0 && !pg.flush_batch)
+                {
+                    start_pg_peering(pg);
+                }
+                else
+                {
+                    // Stop accepting new operations, wait for current ones to finish or fail
+                    pg.state = pg.state & ~PG_ACTIVE | PG_REPEERING;
+                    report_pg_state(pg);
+                }
            }
        }
    }
@ -334,9 +344,10 @@ void osd_t::submit_sync_and_list_subop(osd_num_t role_osd, pg_peering_state_t *p
            {
                // FIXME: Mark peer as failed and don't reconnect immediately after dropping the connection
                printf("Failed to sync OSD %lu: %ld (%s), disconnecting peer\n", role_osd, op->reply.hdr.retval, strerror(-op->reply.hdr.retval));
+                int fail_fd = op->peer_fd;
                ps->list_ops.erase(role_osd);
-                c_cli.stop_client(op->peer_fd);
                delete op;
+                c_cli.stop_client(fail_fd);
                return;
            }
            delete op;
@ -413,9 +424,10 @@ void osd_t::submit_list_subop(osd_num_t role_osd, pg_peering_state_t *ps)
            if (op->reply.hdr.retval < 0)
            {
                printf("Failed to get object list from OSD %lu (retval=%ld), disconnecting peer\n", role_osd, op->reply.hdr.retval);
+                int fail_fd = op->peer_fd;
                ps->list_ops.erase(role_osd);
-                c_cli.stop_client(op->peer_fd);
                delete op;
+                c_cli.stop_client(fail_fd);
                return;
            }
            printf(
@ -484,15 +496,13 @@ bool osd_t::stop_pg(pg_t & pg)
    {
        return false;
    }
-    if (!(pg.state & PG_ACTIVE))
+    if (!(pg.state & (PG_ACTIVE | PG_REPEERING)))
    {
        finish_stop_pg(pg);
        return true;
    }
-    pg.state = pg.state & ~PG_ACTIVE | PG_STOPPING;
-    if (pg.inflight == 0 && !pg.flush_batch &&
-        // We must either forget all PG's unstable writes or wait for it to become clean
-        dirty_pgs.find({ .pool_id = pg.pool_id, .pg_num = pg.pg_num }) == dirty_pgs.end())
+    pg.state = pg.state & ~PG_ACTIVE & ~PG_REPEERING | PG_STOPPING;
+    if (pg.inflight == 0 && !pg.flush_batch)
    {
        finish_stop_pg(pg);
    }
--- a/src/osd_peering_pg.cpp
+++ b/src/osd_peering_pg.cpp
@ -430,12 +430,13 @@ void pg_t::calc_object_states(int log_level)
 void pg_t::print_state()
 {
    printf(
-        "[PG %u/%u] is %s%s%s%s%s%s%s%s%s%s%s%s%s (%lu objects)\n", pool_id, pg_num,
+        "[PG %u/%u] is %s%s%s%s%s%s%s%s%s%s%s%s%s%s (%lu objects)\n", pool_id, pg_num,
        (state & PG_STARTING) ? "starting" : "",
        (state & PG_OFFLINE) ? "offline" : "",
        (state & PG_PEERING) ? "peering" : "",
        (state & PG_INCOMPLETE) ? "incomplete" : "",
        (state & PG_ACTIVE) ? "active" : "",
+        (state & PG_REPEERING) ? "repeering" : "",
        (state & PG_STOPPING) ? "stopping" : "",
        (state & PG_DEGRADED) ? " + degraded" : "",
        (state & PG_HAS_INCOMPLETE) ? " + has_incomplete" : "",
--- a/src/osd_primary.cpp
+++ b/src/osd_primary.cpp
@ -2,6 +2,7 @@
 // License: VNPL-1.1 (see README.md for details)

 #include "osd_primary.h"
+#include "allocator.h"

 // read: read directly or read paired stripe(s), reconstruct, return
 // write: read paired stripe(s), reconstruct, modify, calculate parity, write
@ -18,7 +19,7 @@ bool osd_t::prepare_primary_rw(osd_op_t *cur_op)
    // Our EC scheme stores data in fixed chunks equal to (K*block size)
    // K = (pg_size-parity_chunks) in case of EC/XOR, or 1 for replicated pools
    pool_id_t pool_id = INODE_POOL(cur_op->req.rw.inode);
-    // FIXME: We have to access pool config here, so make sure that it doesn't change while its PGs are active...
+    // Note: We read pool config here, so we must NOT change it when PGs are active
    auto pool_cfg_it = st_cli.pool_config.find(pool_id);
    if (pool_cfg_it == st_cli.pool_config.end())
    {
@ -27,6 +28,7 @@ bool osd_t::prepare_primary_rw(osd_op_t *cur_op)
        return false;
    }
    auto & pool_cfg = pool_cfg_it->second;
+    // FIXME: op_data->pg_data_size can probably be removed (there's pg.pg_data_size)
    uint64_t pg_data_size = (pool_cfg.scheme == POOL_SCHEME_REPLICATED ? 1 : pool_cfg.pg_size-pool_cfg.parity_chunks);
    uint64_t pg_block_size = bs_block_size * pg_data_size;
    object_id oid = {
@ -34,7 +36,7 @@ bool osd_t::prepare_primary_rw(osd_op_t *cur_op)
        // oid.stripe = starting offset of the parity stripe
        .stripe = (cur_op->req.rw.offset/pg_block_size)*pg_block_size,
    };
-    pg_num_t pg_num = (cur_op->req.rw.inode + oid.stripe/pool_cfg.pg_stripe_size) % pg_counts[pool_id] + 1;
+    pg_num_t pg_num = (oid.stripe/pool_cfg.pg_stripe_size) % pg_counts[pool_id] + 1; // like map_to_pg()
    auto pg_it = pgs.find({ .pool_id = pool_id, .pg_num = pg_num });
    if (pg_it == pgs.end() || !(pg_it->second.state & PG_ACTIVE))
    {
@ -44,27 +46,94 @@ bool osd_t::prepare_primary_rw(osd_op_t *cur_op)
        return false;
    }
    if ((cur_op->req.rw.offset + cur_op->req.rw.len) > (oid.stripe + pg_block_size) ||
-        (cur_op->req.rw.offset % bs_disk_alignment) != 0 ||
-        (cur_op->req.rw.len % bs_disk_alignment) != 0)
+        (cur_op->req.rw.offset % bs_bitmap_granularity) != 0 ||
+        (cur_op->req.rw.len % bs_bitmap_granularity) != 0)
    {
        finish_op(cur_op, -EINVAL);
        return false;
    }
+    int stripe_count = (pool_cfg.scheme == POOL_SCHEME_REPLICATED ? 1 : pg_it->second.pg_size);
+    int chain_size = 0;
+    if (cur_op->req.hdr.opcode == OSD_OP_READ && cur_op->req.rw.meta_revision > 0)
+    {
+        // Chained read
+        auto inode_it = st_cli.inode_config.find(cur_op->req.rw.inode);
+        if (inode_it->second.mod_revision != cur_op->req.rw.meta_revision)
+        {
+            // Client view of the metadata differs from OSD's view
+            // Operation can't be completed correctly, client should retry later
+            finish_op(cur_op, -EPIPE);
+            return false;
+        }
+        // Find parents from the same pool. Optimized reads only work within pools
+        while (inode_it != st_cli.inode_config.end() && inode_it->second.parent_id &&
+            INODE_POOL(inode_it->second.parent_id) == pg_it->second.pool_id)
+        {
+            chain_size++;
+            inode_it = st_cli.inode_config.find(inode_it->second.parent_id);
+        }
+        if (chain_size)
+        {
+            // Add the original inode
+            chain_size++;
+        }
+    }
    osd_primary_op_data_t *op_data = (osd_primary_op_data_t*)calloc_or_die(
-        1, sizeof(osd_primary_op_data_t) + sizeof(osd_rmw_stripe_t) * (pool_cfg.scheme == POOL_SCHEME_REPLICATED ? 1 : pg_it->second.pg_size)
+        // Allocate:
+        // - op_data
+        1, sizeof(osd_primary_op_data_t) +
+        // - stripes
+        // - resulting bitmap buffers
+        stripe_count * (clean_entry_bitmap_size + sizeof(osd_rmw_stripe_t)) +
+        chain_size * (
+            // - copy of the chain
+            sizeof(inode_t) +
+            // - bitmap buffers for chained read
+            stripe_count * clean_entry_bitmap_size +
+            // - 'missing' flags for chained reads
+            (pool_cfg.scheme == POOL_SCHEME_REPLICATED ? 0 : pg_it->second.pg_size)
+        )
    );
+    void *data_buf = ((void*)op_data) + sizeof(osd_primary_op_data_t);
    op_data->pg_num = pg_num;
    op_data->oid = oid;
-    op_data->stripes = ((osd_rmw_stripe_t*)(op_data+1));
+    op_data->stripes = (osd_rmw_stripe_t*)data_buf;
+    data_buf += sizeof(osd_rmw_stripe_t) * stripe_count;
    op_data->scheme = pool_cfg.scheme;
    op_data->pg_data_size = pg_data_size;
+    op_data->pg_size = pg_it->second.pg_size;
    cur_op->op_data = op_data;
    split_stripes(pg_data_size, bs_block_size, (uint32_t)(cur_op->req.rw.offset - oid.stripe), cur_op->req.rw.len, op_data->stripes);
+    // Allocate bitmaps along with stripes to avoid extra allocations and fragmentation
+    for (int i = 0; i < stripe_count; i++)
+    {
+        op_data->stripes[i].bmp_buf = data_buf;
+        data_buf += clean_entry_bitmap_size;
+    }
+    op_data->chain_size = chain_size;
+    if (chain_size > 0)
+    {
+        op_data->read_chain = (inode_t*)data_buf;
+        data_buf += sizeof(inode_t) * chain_size;
+        op_data->snapshot_bitmaps = data_buf;
+        data_buf += chain_size * stripe_count * clean_entry_bitmap_size;
+        op_data->missing_flags = (uint8_t*)data_buf;
+        data_buf += chain_size * (pool_cfg.scheme == POOL_SCHEME_REPLICATED ? 0 : pg_it->second.pg_size);
+        // Copy chain
+        int chain_num = 0;
+        op_data->read_chain[chain_num++] = cur_op->req.rw.inode;
+        auto inode_it = st_cli.inode_config.find(cur_op->req.rw.inode);
+        while (inode_it != st_cli.inode_config.end() && inode_it->second.parent_id)
+        {
+            op_data->read_chain[chain_num++] = inode_it->second.parent_id;
+            inode_it = st_cli.inode_config.find(inode_it->second.parent_id);
+        }
+    }
    pg_it->second.inflight++;
    return true;
 }

-static uint64_t* get_object_osd_set(pg_t &pg, object_id &oid, uint64_t *def, pg_osd_set_state_t **object_state)
+uint64_t* osd_t::get_object_osd_set(pg_t &pg, object_id &oid, uint64_t *def, pg_osd_set_state_t **object_state)
 {
    if (!(pg.state & (PG_HAS_INCOMPLETE | PG_HAS_DEGRADED | PG_HAS_MISPLACED)))
    {
@ -100,8 +169,16 @@ void osd_t::continue_primary_read(osd_op_t *cur_op)
        return;
    }
    osd_primary_op_data_t *op_data = cur_op->op_data;
-    if (op_data->st == 1)      goto resume_1;
-    else if (op_data->st == 2) goto resume_2;
+    if (op_data->chain_size)
+    {
+        continue_chained_read(cur_op);
+        return;
+    }
+    if (op_data->st == 1)
+        goto resume_1;
+    else if (op_data->st == 2)
+        goto resume_2;
+    cur_op->reply.rw.bitmap_len = 0;
    {
        auto & pg = pgs.at({ .pool_id = INODE_POOL(op_data->oid.inode), .pg_num = op_data->pg_num });
        for (int role = 0; role < op_data->pg_data_size; role++)
@ -116,8 +193,7 @@ void osd_t::continue_primary_read(osd_op_t *cur_op)
        {
            // Fast happy-path
            cur_op->buf = alloc_read_buffer(op_data->stripes, op_data->pg_data_size, 0);
-            submit_primary_subops(SUBMIT_READ, op_data->target_ver,
-                (op_data->scheme == POOL_SCHEME_REPLICATED ? pg.pg_size : op_data->pg_data_size), pg.cur_set.data(), cur_op);
+            submit_primary_subops(SUBMIT_READ, op_data->target_ver, pg.cur_set.data(), cur_op);
            op_data->st = 1;
        }
        else
@ -134,7 +210,7 @@ void osd_t::continue_primary_read(osd_op_t *cur_op)
            op_data->scheme = pg.scheme;
            op_data->degraded = 1;
            cur_op->buf = alloc_read_buffer(op_data->stripes, pg.pg_size, 0);
-            submit_primary_subops(SUBMIT_READ, op_data->target_ver, pg.pg_size, cur_set, cur_op);
+            submit_primary_subops(SUBMIT_READ, op_data->target_ver, cur_set, cur_op);
            op_data->st = 1;
        }
    }
@ -146,18 +222,20 @@ resume_2:
        finish_op(cur_op, op_data->epipe > 0 ? -EPIPE : -EIO);
        return;
    }
+    cur_op->reply.rw.bitmap_len = op_data->pg_data_size * clean_entry_bitmap_size;
    if (op_data->degraded)
    {
        // Reconstruct missing stripes
        osd_rmw_stripe_t *stripes = op_data->stripes;
        if (op_data->scheme == POOL_SCHEME_XOR)
        {
-            reconstruct_stripes_xor(stripes, op_data->pg_size);
+            reconstruct_stripes_xor(stripes, op_data->pg_size, clean_entry_bitmap_size);
        }
        else if (op_data->scheme == POOL_SCHEME_JERASURE)
        {
-            reconstruct_stripes_jerasure(stripes, op_data->pg_size, op_data->pg_data_size);
+            reconstruct_stripes_jerasure(stripes, op_data->pg_size, op_data->pg_data_size, clean_entry_bitmap_size);
        }
+        cur_op->iov.push_back(op_data->stripes[0].bmp_buf, cur_op->reply.rw.bitmap_len);
        for (int role = 0; role < op_data->pg_size; role++)
        {
            if (stripes[role].req_end != 0)
@ -172,614 +250,12 @@ resume_2:
    }
    else
    {
+        cur_op->iov.push_back(op_data->stripes[0].bmp_buf, cur_op->reply.rw.bitmap_len);
        cur_op->iov.push_back(cur_op->buf, cur_op->req.rw.len);
    }
    finish_op(cur_op, cur_op->req.rw.len);
 }

-bool osd_t::check_write_queue(osd_op_t *cur_op, pg_t & pg)
-{
-    osd_primary_op_data_t *op_data = cur_op->op_data;
-    // Check if actions are pending for this object
-    auto act_it = pg.flush_actions.lower_bound((obj_piece_id_t){
-        .oid = op_data->oid,
-        .osd_num = 0,
-    });
-    if (act_it != pg.flush_actions.end() &&
-        act_it->first.oid.inode == op_data->oid.inode &&
-        (act_it->first.oid.stripe & ~STRIPE_MASK) == op_data->oid.stripe)
-    {
-        pg.write_queue.emplace(op_data->oid, cur_op);
-        return false;
-    }
-    // Check if there are other write requests to the same object
-    auto vo_it = pg.write_queue.find(op_data->oid);
-    if (vo_it != pg.write_queue.end())
-    {
-        op_data->st = 1;
-        pg.write_queue.emplace(op_data->oid, cur_op);
-        return false;
-    }
-    pg.write_queue.emplace(op_data->oid, cur_op);
-    return true;
-}
-
-void osd_t::continue_primary_write(osd_op_t *cur_op)
-{
-    if (!cur_op->op_data && !prepare_primary_rw(cur_op))
-    {
-        return;
-    }
-    osd_primary_op_data_t *op_data = cur_op->op_data;
-    auto & pg = pgs.at({ .pool_id = INODE_POOL(op_data->oid.inode), .pg_num = op_data->pg_num });
-    if (op_data->st == 1)      goto resume_1;
-    else if (op_data->st == 2) goto resume_2;
-    else if (op_data->st == 3) goto resume_3;
-    else if (op_data->st == 4) goto resume_4;
-    else if (op_data->st == 5) goto resume_5;
-    else if (op_data->st == 6) goto resume_6;
-    else if (op_data->st == 7) goto resume_7;
-    else if (op_data->st == 8) goto resume_8;
-    else if (op_data->st == 9) goto resume_9;
-    else if (op_data->st == 10) goto resume_10;
-    assert(op_data->st == 0);
-    if (!check_write_queue(cur_op, pg))
-    {
-        return;
-    }
-resume_1:
-    // Determine blocks to read and write
-    // Missing chunks are allowed to be overwritten even in incomplete objects
-    // FIXME: Allow to do small writes to the old (degraded/misplaced) OSD set for lower performance impact
-    op_data->prev_set = get_object_osd_set(pg, op_data->oid, pg.cur_set.data(), &op_data->object_state);
-    if (op_data->scheme == POOL_SCHEME_REPLICATED)
-    {
-        // Simplified algorithm
-        op_data->stripes[0].write_start = op_data->stripes[0].req_start;
-        op_data->stripes[0].write_end = op_data->stripes[0].req_end;
-        op_data->stripes[0].write_buf = cur_op->buf;
-        if (pg.cur_set.data() != op_data->prev_set && (op_data->stripes[0].write_start != 0 ||
-            op_data->stripes[0].write_end != bs_block_size))
-        {
-            // Object is degraded/misplaced and will be moved to <write_osd_set>
-            op_data->stripes[0].read_start = 0;
-            op_data->stripes[0].read_end = bs_block_size;
-            cur_op->rmw_buf = op_data->stripes[0].read_buf = memalign_or_die(MEM_ALIGNMENT, bs_block_size);
-        }
-    }
-    else
-    {
-        cur_op->rmw_buf = calc_rmw(cur_op->buf, op_data->stripes, op_data->prev_set,
-            pg.pg_size, op_data->pg_data_size, pg.pg_cursize, pg.cur_set.data(), bs_block_size);
-        if (!cur_op->rmw_buf)
-        {
-            // Refuse partial overwrite of an incomplete object
-            cur_op->reply.hdr.retval = -EINVAL;
-            goto continue_others;
-        }
-    }
-    // Read required blocks
-    submit_primary_subops(SUBMIT_RMW_READ, UINT64_MAX, pg.pg_size, op_data->prev_set, cur_op);
-resume_2:
-    op_data->st = 2;
-    return;
-resume_3:
-    if (op_data->errors > 0)
-    {
-        pg_cancel_write_queue(pg, cur_op, op_data->oid, op_data->epipe > 0 ? -EPIPE : -EIO);
-        return;
-    }
-    // Save version override for parallel reads
-    pg.ver_override[op_data->oid] = op_data->fact_ver;
-    if (op_data->scheme == POOL_SCHEME_REPLICATED)
-    {
-        // Only (possibly) copy new data from the request into the recovery buffer
-        if (pg.cur_set.data() != op_data->prev_set && (op_data->stripes[0].write_start != 0 ||
-            op_data->stripes[0].write_end != bs_block_size))
-        {
-            memcpy(
-                op_data->stripes[0].read_buf + op_data->stripes[0].req_start,
-                op_data->stripes[0].write_buf,
-                op_data->stripes[0].req_end - op_data->stripes[0].req_start
-            );
-            op_data->stripes[0].write_buf = op_data->stripes[0].read_buf;
-            op_data->stripes[0].write_start = 0;
-            op_data->stripes[0].write_end = bs_block_size;
-        }
-    }
-    else
-    {
-        // Recover missing stripes, calculate parity
-        if (pg.scheme == POOL_SCHEME_XOR)
-        {
-            calc_rmw_parity_xor(op_data->stripes, pg.pg_size, op_data->prev_set, pg.cur_set.data(), bs_block_size);
-        }
-        else if (pg.scheme == POOL_SCHEME_JERASURE)
-        {
-            calc_rmw_parity_jerasure(op_data->stripes, pg.pg_size, op_data->pg_data_size, op_data->prev_set, pg.cur_set.data(), bs_block_size);
-        }
-    }
-    // Send writes
-    if ((op_data->fact_ver >> (64-PG_EPOCH_BITS)) < pg.epoch)
-    {
-        op_data->target_ver = ((uint64_t)pg.epoch << (64-PG_EPOCH_BITS)) | 1;
-    }
-    else
-    {
-        if ((op_data->fact_ver & (1ul<<(64-PG_EPOCH_BITS) - 1)) == (1ul<<(64-PG_EPOCH_BITS) - 1))
-        {
-            assert(pg.epoch != ((1ul << PG_EPOCH_BITS)-1));
-            pg.epoch++;
-        }
-        op_data->target_ver = op_data->fact_ver + 1;
-    }
-    if (pg.epoch > pg.reported_epoch)
-    {
-        // Report newer epoch before writing
-        // FIXME: We may report only one PG state here...
-        this->pg_state_dirty.insert({ .pool_id = pg.pool_id, .pg_num = pg.pg_num });
-        pg.history_changed = true;
-        report_pg_states();
-resume_10:
-        if (pg.epoch > pg.reported_epoch)
-        {
-            op_data->st = 10;
-            return;
-        }
-    }
-    submit_primary_subops(SUBMIT_WRITE, op_data->target_ver, pg.pg_size, pg.cur_set.data(), cur_op);
-resume_4:
-    op_data->st = 4;
-    return;
-resume_5:
-    if (op_data->errors > 0)
-    {
-        pg_cancel_write_queue(pg, cur_op, op_data->oid, op_data->epipe > 0 ? -EPIPE : -EIO);
-        return;
-    }
-resume_6:
-resume_7:
-    if (!remember_unstable_write(cur_op, pg, pg.cur_loc_set, 6))
-    {
-        // FIXME: Check for immediate_commit == IMMEDIATE_SMALL
-        return;
-    }
-    if (op_data->fact_ver == 1)
-    {
-        // Object is created
-        pg.clean_count++;
-        pg.total_count++;
-    }
-    if (op_data->object_state)
-    {
-        {
-            int recovery_type = op_data->object_state->state & (OBJ_DEGRADED|OBJ_INCOMPLETE) ? 0 : 1;
-            recovery_stat_count[0][recovery_type]++;
-            if (!recovery_stat_count[0][recovery_type])
-            {
-                recovery_stat_count[0][recovery_type]++;
-                recovery_stat_bytes[0][recovery_type] = 0;
-            }
-            for (int role = 0; role < (op_data->scheme == POOL_SCHEME_REPLICATED ? 1 : pg.pg_size); role++)
-            {
-                recovery_stat_bytes[0][recovery_type] += op_data->stripes[role].write_end - op_data->stripes[role].write_start;
-            }
-        }
-        // Any kind of a non-clean object can have extra chunks, because we don't record objects
-        // as degraded & misplaced or incomplete & misplaced at the same time. So try to remove extra chunks
-        if (immediate_commit != IMMEDIATE_ALL)
-        {
-            // We can't remove extra chunks yet if fsyncs are explicit, because
-            // new copies may not be committed to stable storage yet
-            // We can only remove extra chunks after a successful SYNC for this PG
-            for (auto & chunk: op_data->object_state->osd_set)
-            {
-                // Check is the same as in submit_primary_del_subops()
-                if (op_data->scheme == POOL_SCHEME_REPLICATED
-                    ? !contains_osd(pg.cur_set.data(), pg.pg_size, chunk.osd_num)
-                    : (chunk.osd_num != pg.cur_set[chunk.role]))
-                {
-                    pg.copies_to_delete_after_sync.push_back((obj_ver_osd_t){
-                        .osd_num = chunk.osd_num,
-                        .oid = {
-                            .inode = op_data->oid.inode,
-                            .stripe = op_data->oid.stripe | (op_data->scheme == POOL_SCHEME_REPLICATED ? 0 : chunk.role),
-                        },
-                        .version = op_data->fact_ver,
-                    });
-                    copies_to_delete_after_sync_count++;
-                }
-            }
-        }
-        else
-        {
-            submit_primary_del_subops(cur_op, pg.cur_set.data(), pg.pg_size, op_data->object_state->osd_set);
-            if (op_data->n_subops > 0)
-            {
-resume_8:
-                op_data->st = 8;
-                return;
-resume_9:
-                if (op_data->errors > 0)
-                {
-                    pg_cancel_write_queue(pg, cur_op, op_data->oid, op_data->epipe > 0 ? -EPIPE : -EIO);
-                    return;
-                }
-            }
-        }
-        // Clear object state
-        remove_object_from_state(op_data->oid, op_data->object_state, pg);
-        pg.clean_count++;
-    }
-    cur_op->reply.hdr.retval = cur_op->req.rw.len;
-continue_others:
-    // Remove version override
-    pg.ver_override.erase(op_data->oid);
-    object_id oid = op_data->oid;
-    // Remove the operation from queue before calling finish_op so it doesn't see the completed operation in queue
-    auto next_it = pg.write_queue.find(oid);
-    if (next_it != pg.write_queue.end() && next_it->second == cur_op)
-    {
-        pg.write_queue.erase(next_it++);
-    }
-    // finish_op would invalidate next_it if it cleared pg.write_queue, but it doesn't do that :)
-    finish_op(cur_op, cur_op->reply.hdr.retval);
-    // Continue other write operations to the same object
-    if (next_it != pg.write_queue.end() && next_it->first == oid)
-    {
-        osd_op_t *next_op = next_it->second;
-        continue_primary_write(next_op);
-    }
-}
-
-bool osd_t::remember_unstable_write(osd_op_t *cur_op, pg_t & pg, pg_osd_set_t & loc_set, int base_state)
-{
-    osd_primary_op_data_t *op_data = cur_op->op_data;
-    if (op_data->st == base_state)
-    {
-        goto resume_6;
-    }
-    else if (op_data->st == base_state+1)
-    {
-        goto resume_7;
-    }
-    // FIXME: Check for immediate_commit == IMMEDIATE_SMALL
-    if (immediate_commit == IMMEDIATE_ALL)
-    {
-        if (op_data->scheme != POOL_SCHEME_REPLICATED)
-        {
-            // Send STABILIZE ops immediately
-            op_data->unstable_write_osds = new std::vector<unstable_osd_num_t>();
-            op_data->unstable_writes = new obj_ver_id[loc_set.size()];
-            {
-                int last_start = 0;
-                for (auto & chunk: loc_set)
-                {
-                    op_data->unstable_writes[last_start] = (obj_ver_id){
-                        .oid = {
-                            .inode = op_data->oid.inode,
-                            .stripe = op_data->oid.stripe | chunk.role,
-                        },
-                        .version = op_data->fact_ver,
-                    };
-                    op_data->unstable_write_osds->push_back((unstable_osd_num_t){
-                        .osd_num = chunk.osd_num,
-                        .start = last_start,
-                        .len = 1,
-                    });
-                    last_start++;
-                }
-            }
-            submit_primary_stab_subops(cur_op);
-resume_6:
-            op_data->st = 6;
-            return false;
-resume_7:
-            // FIXME: Free those in the destructor?
-            delete op_data->unstable_write_osds;
-            delete[] op_data->unstable_writes;
-            op_data->unstable_writes = NULL;
-            op_data->unstable_write_osds = NULL;
-            if (op_data->errors > 0)
-            {
-                pg_cancel_write_queue(pg, cur_op, op_data->oid, op_data->epipe > 0 ? -EPIPE : -EIO);
-                return false;
-            }
-        }
-    }
-    else
-    {
-        if (op_data->scheme != POOL_SCHEME_REPLICATED)
-        {
-            // Remember version as unstable for EC/XOR
-            for (auto & chunk: loc_set)
-            {
-                this->dirty_osds.insert(chunk.osd_num);
-                this->unstable_writes[(osd_object_id_t){
-                    .osd_num = chunk.osd_num,
-                    .oid = {
-                        .inode = op_data->oid.inode,
-                        .stripe = op_data->oid.stripe | chunk.role,
-                    },
-                }] = op_data->fact_ver;
-            }
-        }
-        else
-        {
-            // Only remember to sync OSDs for replicated pools
-            for (auto & chunk: loc_set)
-            {
-                this->dirty_osds.insert(chunk.osd_num);
-            }
-        }
-        // Remember PG as dirty to drop the connection when PG goes offline
-        // (this is required because of the "lazy sync")
-        auto cl_it = c_cli.clients.find(cur_op->peer_fd);
-        if (cl_it != c_cli.clients.end())
-        {
-            cl_it->second->dirty_pgs.insert({ .pool_id = pg.pool_id, .pg_num = pg.pg_num });
-        }
-        dirty_pgs.insert({ .pool_id = pg.pool_id, .pg_num = pg.pg_num });
-    }
-    return true;
-}
-
-// Save and clear unstable_writes -> SYNC all -> STABLE all
-void osd_t::continue_primary_sync(osd_op_t *cur_op)
-{
-    if (!cur_op->op_data)
-    {
-        cur_op->op_data = (osd_primary_op_data_t*)calloc_or_die(1, sizeof(osd_primary_op_data_t));
-    }
-    osd_primary_op_data_t *op_data = cur_op->op_data;
-    if (op_data->st == 1)      goto resume_1;
-    else if (op_data->st == 2) goto resume_2;
-    else if (op_data->st == 3) goto resume_3;
-    else if (op_data->st == 4) goto resume_4;
-    else if (op_data->st == 5) goto resume_5;
-    else if (op_data->st == 6) goto resume_6;
-    else if (op_data->st == 7) goto resume_7;
-    else if (op_data->st == 8) goto resume_8;
-    assert(op_data->st == 0);
-    if (syncs_in_progress.size() > 0)
-    {
-        // Wait for previous syncs, if any
-        // FIXME: We may try to execute the current one in parallel, like in Blockstore, but I'm not sure if it matters at all
-        syncs_in_progress.push_back(cur_op);
-        op_data->st = 1;
-resume_1:
-        return;
-    }
-    else
-    {
-        syncs_in_progress.push_back(cur_op);
-    }
-resume_2:
-    if (dirty_osds.size() == 0)
-    {
-        // Nothing to sync
-        goto finish;
-    }
-    // Save and clear unstable_writes
-    // In theory it is possible to do in on a per-client basis, but this seems to be an unnecessary complication
-    // It would be cool not to copy these here at all, but someone has to deduplicate them by object IDs anyway
-    if (unstable_writes.size() > 0)
-    {
-        op_data->unstable_write_osds = new std::vector<unstable_osd_num_t>();
-        op_data->unstable_writes = new obj_ver_id[this->unstable_writes.size()];
-        osd_num_t last_osd = 0;
-        int last_start = 0, last_end = 0;
-        for (auto it = this->unstable_writes.begin(); it != this->unstable_writes.end(); it++)
-        {
-            if (last_osd != it->first.osd_num)
-            {
-                if (last_osd != 0)
-                {
-                    op_data->unstable_write_osds->push_back((unstable_osd_num_t){
-                        .osd_num = last_osd,
-                        .start = last_start,
-                        .len = last_end - last_start,
-                    });
-                }
-                last_osd = it->first.osd_num;
-                last_start = last_end;
-            }
-            op_data->unstable_writes[last_end] = (obj_ver_id){
-                .oid = it->first.oid,
-                .version = it->second,
-            };
-            last_end++;
-        }
-        if (last_osd != 0)
-        {
-            op_data->unstable_write_osds->push_back((unstable_osd_num_t){
-                .osd_num = last_osd,
-                .start = last_start,
-                .len = last_end - last_start,
-            });
-        }
-        this->unstable_writes.clear();
-    }
-    {
-        void *dirty_buf = malloc_or_die(
-            sizeof(pool_pg_num_t)*dirty_pgs.size() +
-            sizeof(osd_num_t)*dirty_osds.size() +
-            sizeof(obj_ver_osd_t)*this->copies_to_delete_after_sync_count
-        );
-        op_data->dirty_pgs = (pool_pg_num_t*)dirty_buf;
-        op_data->dirty_osds = (osd_num_t*)(dirty_buf + sizeof(pool_pg_num_t)*dirty_pgs.size());
-        op_data->dirty_pg_count = dirty_pgs.size();
-        op_data->dirty_osd_count = dirty_osds.size();
-        if (this->copies_to_delete_after_sync_count)
-        {
-            op_data->copies_to_delete_count = 0;
-            op_data->copies_to_delete = (obj_ver_osd_t*)(op_data->dirty_osds + op_data->dirty_osd_count);
-            for (auto dirty_pg_num: dirty_pgs)
-            {
-                auto & pg = pgs.at(dirty_pg_num);
-                assert(pg.copies_to_delete_after_sync.size() <= this->copies_to_delete_after_sync_count);
-                memcpy(
-                    op_data->copies_to_delete + op_data->copies_to_delete_count,
-                    pg.copies_to_delete_after_sync.data(),
-                    sizeof(obj_ver_osd_t)*pg.copies_to_delete_after_sync.size()
-                );
-                op_data->copies_to_delete_count += pg.copies_to_delete_after_sync.size();
-                this->copies_to_delete_after_sync_count -= pg.copies_to_delete_after_sync.size();
-                pg.copies_to_delete_after_sync.clear();
-            }
-            assert(this->copies_to_delete_after_sync_count == 0);
-        }
-        int dpg = 0;
-        for (auto dirty_pg_num: dirty_pgs)
-        {
-            pgs.at(dirty_pg_num).inflight++;
-            op_data->dirty_pgs[dpg++] = dirty_pg_num;
-        }
-        dirty_pgs.clear();
-        dpg = 0;
-        for (auto osd_num: dirty_osds)
-        {
-            op_data->dirty_osds[dpg++] = osd_num;
-        }
-        dirty_osds.clear();
-    }
-    if (immediate_commit != IMMEDIATE_ALL)
-    {
-        // SYNC
-        submit_primary_sync_subops(cur_op);
-resume_3:
-        op_data->st = 3;
-        return;
-resume_4:
-        if (op_data->errors > 0)
-        {
-            goto resume_6;
-        }
-    }
-    if (op_data->unstable_writes)
-    {
-        // Stabilize version sets, if any
-        submit_primary_stab_subops(cur_op);
-resume_5:
-        op_data->st = 5;
-        return;
-    }
-resume_6:
-    if (op_data->errors > 0)
-    {
-        // Return PGs and OSDs back into their dirty sets
-        for (int i = 0; i < op_data->dirty_pg_count; i++)
-        {
-            dirty_pgs.insert(op_data->dirty_pgs[i]);
-        }
-        for (int i = 0; i < op_data->dirty_osd_count; i++)
-        {
-            dirty_osds.insert(op_data->dirty_osds[i]);
-        }
-        if (op_data->unstable_writes)
-        {
-            // Return objects back into the unstable write set
-            for (auto unstable_osd: *(op_data->unstable_write_osds))
-            {
-                for (int i = 0; i < unstable_osd.len; i++)
-                {
-                    // Except those from peered PGs
-                    auto & w = op_data->unstable_writes[i];
-                    pool_pg_num_t wpg = {
-                        .pool_id = INODE_POOL(w.oid.inode),
-                        .pg_num = map_to_pg(w.oid, st_cli.pool_config.at(INODE_POOL(w.oid.inode)).pg_stripe_size),
-                    };
-                    if (pgs.at(wpg).state & PG_ACTIVE)
-                    {
-                        uint64_t & dest = this->unstable_writes[(osd_object_id_t){
-                            .osd_num = unstable_osd.osd_num,
-                            .oid = w.oid,
-                        }];
-                        dest = dest < w.version ? w.version : dest;
-                        dirty_pgs.insert(wpg);
-                    }
-                }
-            }
-        }
-        if (op_data->copies_to_delete)
-        {
-            // Return 'copies to delete' back into respective PGs
-            for (int i = 0; i < op_data->copies_to_delete_count; i++)
-            {
-                auto & w = op_data->copies_to_delete[i];
-                auto & pg = pgs.at((pool_pg_num_t){
-                    .pool_id = INODE_POOL(w.oid.inode),
-                    .pg_num = map_to_pg(w.oid, st_cli.pool_config.at(INODE_POOL(w.oid.inode)).pg_stripe_size),
-                });
-                if (pg.state & PG_ACTIVE)
-                {
-                    pg.copies_to_delete_after_sync.push_back(w);
-                    copies_to_delete_after_sync_count++;
-                }
-            }
-        }
-    }
-    else if (op_data->copies_to_delete)
-    {
-        // Actually delete copies which we wanted to delete
-        submit_primary_del_batch(cur_op, op_data->copies_to_delete, op_data->copies_to_delete_count);
-resume_7:
-        op_data->st = 7;
-        return;
-resume_8:
-        if (op_data->errors > 0)
-        {
-            goto resume_6;
-        }
-    }
-    for (int i = 0; i < op_data->dirty_pg_count; i++)
-    {
-        auto & pg = pgs.at(op_data->dirty_pgs[i]);
-        pg.inflight--;
-        if ((pg.state & PG_STOPPING) && pg.inflight == 0 && !pg.flush_batch &&
-            // We must either forget all PG's unstable writes or wait for it to become clean
-            dirty_pgs.find({ .pool_id = pg.pool_id, .pg_num = pg.pg_num }) == dirty_pgs.end())
-        {
-            finish_stop_pg(pg);
-        }
-    }
-    // FIXME: Free those in the destructor?
-    free(op_data->dirty_pgs);
-    op_data->dirty_pgs = NULL;
-    op_data->dirty_osds = NULL;
-    if (op_data->unstable_writes)
-    {
-        delete op_data->unstable_write_osds;
-        delete[] op_data->unstable_writes;
-        op_data->unstable_writes = NULL;
-        op_data->unstable_write_osds = NULL;
-    }
-    if (op_data->errors > 0)
-    {
-        finish_op(cur_op, op_data->epipe > 0 ? -EPIPE : -EIO);
-    }
-    else
-    {
-finish:
-        if (cur_op->peer_fd)
-        {
-            auto it = c_cli.clients.find(cur_op->peer_fd);
-            if (it != c_cli.clients.end())
-                it->second->dirty_pgs.clear();
-        }
-        finish_op(cur_op, 0);
-    }
-    assert(syncs_in_progress.front() == cur_op);
-    syncs_in_progress.pop_front();
-    if (syncs_in_progress.size() > 0)
-    {
-        cur_op = syncs_in_progress.front();
-        op_data = cur_op->op_data;
-        op_data->st++;
-        goto resume_2;
-    }
-}
-
 // Decrement pg_osd_set_state_t's object_count and change PG state accordingly
 void osd_t::remove_object_from_state(object_id & oid, pg_osd_set_state_t *object_state, pg_t & pg)
 {
@ -818,10 +294,14 @@ void osd_t::remove_object_from_state(object_id & oid, pg_osd_set_state_t *object
    {
        throw std::runtime_error("BUG: Invalid object state: "+std::to_string(object_state->state));
    }
-    object_state->object_count--;
-    if (!object_state->object_count)
+}
+
+void osd_t::free_object_state(pg_t & pg, pg_osd_set_state_t **object_state)
+{
+    if (*object_state && !(--(*object_state)->object_count))
    {
-        pg.state_dict.erase(object_state->osd_set);
+        pg.state_dict.erase((*object_state)->osd_set);
+        *object_state = NULL;
    }
 }

@ -853,7 +333,7 @@ resume_1:
    // Determine which OSDs contain this object and delete it
    op_data->prev_set = get_object_osd_set(pg, op_data->oid, pg.cur_set.data(), &op_data->object_state);
    // Submit 1 read to determine the actual version number
-    submit_primary_subops(SUBMIT_RMW_READ, UINT64_MAX, pg.pg_size, op_data->prev_set, cur_op);
+    submit_primary_subops(SUBMIT_RMW_READ, UINT64_MAX, op_data->prev_set, cur_op);
 resume_2:
    op_data->st = 2;
    return;
@ -887,22 +367,21 @@ resume_5:
    else
    {
        remove_object_from_state(op_data->oid, op_data->object_state, pg);
+        free_object_state(pg, &op_data->object_state);
    }
    pg.total_count--;
-    object_id oid = op_data->oid;
+    osd_op_t *next_op = NULL;
+    auto next_it = pg.write_queue.find(op_data->oid);
+    if (next_it != pg.write_queue.end() && next_it->second == cur_op)
+    {
+        pg.write_queue.erase(next_it++);
+        if (next_it != pg.write_queue.end() && next_it->first == op_data->oid)
+            next_op = next_it->second;
+    }
    finish_op(cur_op, cur_op->req.rw.len);
-    // Continue other write operations to the same object
-    auto next_it = pg.write_queue.find(oid);
-    auto this_it = next_it;
-    if (this_it != pg.write_queue.end() && this_it->second == cur_op)
+    if (next_op)
    {
-        next_it++;
-        pg.write_queue.erase(this_it);
-        if (next_it != pg.write_queue.end() &&
-            next_it->first == oid)
-        {
-            osd_op_t *next_op = next_it->second;
+        // Continue next write to the same object
        continue_primary_write(next_op);
    }
-    }
 }
--- a/src/osd_primary.h
+++ b/src/osd_primary.h
@ -31,15 +31,31 @@ struct osd_primary_op_data_t
    uint64_t *prev_set = NULL;
    pg_osd_set_state_t *object_state = NULL;

+    union
+    {
+        struct
+        {
            // for sync. oops, requires freeing
-    std::vector<unstable_osd_num_t> *unstable_write_osds = NULL;
-    pool_pg_num_t *dirty_pgs = NULL;
-    int dirty_pg_count = 0;
-    osd_num_t *dirty_osds = NULL;
-    int dirty_osd_count = 0;
-    obj_ver_id *unstable_writes = NULL;
-    obj_ver_osd_t *copies_to_delete = NULL;
-    int copies_to_delete_count = 0;
+            std::vector<unstable_osd_num_t> *unstable_write_osds;
+            pool_pg_num_t *dirty_pgs;
+            int dirty_pg_count;
+            osd_num_t *dirty_osds;
+            int dirty_osd_count;
+            obj_ver_id *unstable_writes;
+            obj_ver_osd_t *copies_to_delete;
+            int copies_to_delete_count;
+        };
+        struct
+        {
+            // for read_bitmaps
+            void *snapshot_bitmaps;
+            inode_t *read_chain;
+            uint8_t *missing_flags;
+            int chain_size;
+            osd_chain_read_t *chain_reads;
+            int chain_read_count;
+        };
+    };
 };

 bool contains_osd(osd_num_t *osd_set, uint64_t size, osd_num_t osd_num);
--- a/src/osd_primary_chain.cpp
+++ b/src/osd_primary_chain.cpp
@ -0,0 +1,554 @@
+// Copyright (c) Vitaliy Filippov, 2019+
+// License: VNPL-1.1 (see README.md for details)
+
+#include "osd_primary.h"
+#include "allocator.h"
+
+void osd_t::continue_chained_read(osd_op_t *cur_op)
+{
+    osd_primary_op_data_t *op_data = cur_op->op_data;
+    auto & pg = pgs.at({ .pool_id = INODE_POOL(op_data->oid.inode), .pg_num = op_data->pg_num });
+    if (op_data->st == 1)
+        goto resume_1;
+    else if (op_data->st == 2)
+        goto resume_2;
+    else if (op_data->st == 3)
+        goto resume_3;
+    else if (op_data->st == 4)
+        goto resume_4;
+    cur_op->reply.rw.bitmap_len = 0;
+    for (int role = 0; role < op_data->pg_data_size; role++)
+    {
+        op_data->stripes[role].read_start = op_data->stripes[role].req_start;
+        op_data->stripes[role].read_end = op_data->stripes[role].req_end;
+    }
+resume_1:
+resume_2:
+    // Read bitmaps
+    if (read_bitmaps(cur_op, pg, 1) != 0)
+        return;
+    // Prepare & submit reads
+    if (submit_chained_read_requests(pg, cur_op) != 0)
+        return;
+    if (op_data->n_subops > 0)
+    {
+        // Wait for reads
+        op_data->st = 3;
+resume_3:
+        return;
+    }
+resume_4:
+    if (op_data->errors > 0)
+    {
+        free(op_data->chain_reads);
+        op_data->chain_reads = NULL;
+        finish_op(cur_op, op_data->epipe > 0 ? -EPIPE : -EIO);
+        return;
+    }
+    send_chained_read_results(pg, cur_op);
+    finish_op(cur_op, cur_op->req.rw.len);
+}
+
+int osd_t::read_bitmaps(osd_op_t *cur_op, pg_t & pg, int base_state)
+{
+    osd_primary_op_data_t *op_data = cur_op->op_data;
+    if (op_data->st == base_state)
+        goto resume_0;
+    else if (op_data->st == base_state+1)
+        goto resume_1;
+    if (pg.state == PG_ACTIVE && pg.scheme == POOL_SCHEME_REPLICATED)
+    {
+        // Happy path for clean replicated PGs (all bitmaps are available locally)
+        for (int chain_num = 0; chain_num < op_data->chain_size; chain_num++)
+        {
+            object_id cur_oid = { .inode = op_data->read_chain[chain_num], .stripe = op_data->oid.stripe };
+            auto vo_it = pg.ver_override.find(cur_oid);
+            auto read_version = (vo_it != pg.ver_override.end() ? vo_it->second : UINT64_MAX);
+            // Read bitmap synchronously from the local database
+            bs->read_bitmap(cur_oid, read_version, op_data->snapshot_bitmaps + chain_num*clean_entry_bitmap_size, NULL);
+        }
+    }
+    else
+    {
+        if (submit_bitmap_subops(cur_op, pg) < 0)
+        {
+            // Failure
+            finish_op(cur_op, -EIO);
+            return -1;
+        }
+resume_0:
+        if (op_data->n_subops > 0)
+        {
+            // Wait for subops
+            op_data->st = base_state;
+            return 1;
+        }
+resume_1:
+        if (pg.scheme != POOL_SCHEME_REPLICATED)
+        {
+            for (int chain_num = 0; chain_num < op_data->chain_size; chain_num++)
+            {
+                // Check if we need to reconstruct any bitmaps
+                for (int i = 0; i < pg.pg_size; i++)
+                {
+                    if (op_data->missing_flags[chain_num*pg.pg_size + i])
+                    {
+                        osd_rmw_stripe_t local_stripes[pg.pg_size] = { 0 };
+                        for (i = 0; i < pg.pg_size; i++)
+                        {
+                            local_stripes[i].missing = op_data->missing_flags[chain_num*pg.pg_size + i] && true;
+                            local_stripes[i].bmp_buf = op_data->snapshot_bitmaps + (chain_num*pg.pg_size + i)*clean_entry_bitmap_size;
+                            local_stripes[i].read_start = local_stripes[i].read_end = 1;
+                        }
+                        if (pg.scheme == POOL_SCHEME_XOR)
+                        {
+                            reconstruct_stripes_xor(local_stripes, pg.pg_size, clean_entry_bitmap_size);
+                        }
+                        else if (pg.scheme == POOL_SCHEME_JERASURE)
+                        {
+                            reconstruct_stripes_jerasure(local_stripes, pg.pg_size, pg.pg_data_size, clean_entry_bitmap_size);
+                        }
+                        break;
+                    }
+                }
+            }
+        }
+    }
+    return 0;
+}
+
+int osd_t::collect_bitmap_requests(osd_op_t *cur_op, pg_t & pg, std::vector<bitmap_request_t> & bitmap_requests)
+{
+    osd_primary_op_data_t *op_data = cur_op->op_data;
+    for (int chain_num = 0; chain_num < op_data->chain_size; chain_num++)
+    {
+        object_id cur_oid = { .inode = op_data->read_chain[chain_num], .stripe = op_data->oid.stripe };
+        auto vo_it = pg.ver_override.find(cur_oid);
+        uint64_t target_version = vo_it != pg.ver_override.end() ? vo_it->second : UINT64_MAX;
+        pg_osd_set_state_t *object_state;
+        uint64_t* cur_set = get_object_osd_set(pg, cur_oid, pg.cur_set.data(), &object_state);
+        if (pg.scheme == POOL_SCHEME_REPLICATED)
+        {
+            osd_num_t read_target = 0;
+            for (int i = 0; i < pg.pg_size; i++)
+            {
+                if (cur_set[i] == this->osd_num || cur_set[i] != 0 && read_target == 0)
+                {
+                    // Select local or any other available OSD for reading
+                    read_target = cur_set[i];
+                }
+            }
+            assert(read_target != 0);
+            bitmap_requests.push_back((bitmap_request_t){
+                .osd_num = read_target,
+                .oid = cur_oid,
+                .version = target_version,
+                .bmp_buf = op_data->snapshot_bitmaps + chain_num*clean_entry_bitmap_size,
+            });
+        }
+        else
+        {
+            osd_rmw_stripe_t local_stripes[pg.pg_size];
+            memcpy(local_stripes, op_data->stripes, sizeof(osd_rmw_stripe_t) * pg.pg_size);
+            if (extend_missing_stripes(local_stripes, cur_set, pg.pg_data_size, pg.pg_size) < 0)
+            {
+                free(op_data->snapshot_bitmaps);
+                return -1;
+            }
+            int need_at_least = 0;
+            for (int i = 0; i < pg.pg_size; i++)
+            {
+                if (local_stripes[i].read_end != 0 && cur_set[i] == 0)
+                {
+                    // We need this part of the bitmap, but it's unavailable
+                    need_at_least = pg.pg_data_size;
+                    op_data->missing_flags[chain_num*pg.pg_size + i] = 1;
+                }
+                else
+                {
+                    op_data->missing_flags[chain_num*pg.pg_size + i] = 0;
+                }
+            }
+            int found = 0;
+            for (int i = 0; i < pg.pg_size; i++)
+            {
+                if (cur_set[i] != 0 && (local_stripes[i].read_end != 0 || found < need_at_least))
+                {
+                    // Read part of the bitmap
+                    bitmap_requests.push_back((bitmap_request_t){
+                        .osd_num = cur_set[i],
+                        .oid = {
+                            .inode = cur_oid.inode,
+                            .stripe = cur_oid.stripe | i,
+                        },
+                        .version = target_version,
+                        .bmp_buf = op_data->snapshot_bitmaps + (chain_num*pg.pg_size + i)*clean_entry_bitmap_size,
+                    });
+                    found++;
+                }
+            }
+            // Already checked by extend_missing_stripes, so it's fine to use assert
+            assert(found >= need_at_least);
+        }
+    }
+    std::sort(bitmap_requests.begin(), bitmap_requests.end());
+    return 0;
+}
+
+int osd_t::submit_bitmap_subops(osd_op_t *cur_op, pg_t & pg)
+{
+    osd_primary_op_data_t *op_data = cur_op->op_data;
+    std::vector<bitmap_request_t> *bitmap_requests = new std::vector<bitmap_request_t>();
+    if (collect_bitmap_requests(cur_op, pg, *bitmap_requests) < 0)
+    {
+        return -1;
+    }
+    op_data->n_subops = 0;
+    for (int i = 0; i < bitmap_requests->size(); i++)
+    {
+        if ((i == bitmap_requests->size()-1 || (*bitmap_requests)[i+1].osd_num != (*bitmap_requests)[i].osd_num) &&
+            (*bitmap_requests)[i].osd_num != this->osd_num)
+        {
+            op_data->n_subops++;
+        }
+    }
+    if (op_data->n_subops)
+    {
+        op_data->fact_ver = 0;
+        op_data->done = op_data->errors = 0;
+        op_data->subops = new osd_op_t[op_data->n_subops];
+    }
+    for (int i = 0, subop_idx = 0, prev = 0; i < bitmap_requests->size(); i++)
+    {
+        if (i == bitmap_requests->size()-1 || (*bitmap_requests)[i+1].osd_num != (*bitmap_requests)[i].osd_num)
+        {
+            osd_num_t subop_osd_num = (*bitmap_requests)[i].osd_num;
+            if (subop_osd_num == this->osd_num)
+            {
+                // Read bitmap synchronously from the local database
+                for (int j = prev; j <= i; j++)
+                {
+                    bs->read_bitmap((*bitmap_requests)[j].oid, (*bitmap_requests)[j].version, (*bitmap_requests)[j].bmp_buf, NULL);
+                }
+            }
+            else
+            {
+                // Send to a remote OSD
+                osd_op_t *subop = op_data->subops+subop_idx;
+                subop->op_type = OSD_OP_OUT;
+                subop->peer_fd = c_cli.osd_peer_fds.at(subop_osd_num);
+                // FIXME: Use the pre-allocated buffer
+                subop->buf = malloc_or_die(sizeof(obj_ver_id)*(i+1-prev));
+                subop->req = (osd_any_op_t){
+                    .sec_read_bmp = {
+                        .header = {
+                            .magic = SECONDARY_OSD_OP_MAGIC,
+                            .id = c_cli.next_subop_id++,
+                            .opcode = OSD_OP_SEC_READ_BMP,
+                        },
+                        .len = sizeof(obj_ver_id)*(i+1-prev),
+                    }
+                };
+                obj_ver_id *ov = (obj_ver_id*)subop->buf;
+                for (int j = prev; j <= i; j++, ov++)
+                {
+                    ov->oid = (*bitmap_requests)[j].oid;
+                    ov->version = (*bitmap_requests)[j].version;
+                }
+                subop->callback = [cur_op, bitmap_requests, prev, i, this](osd_op_t *subop)
+                {
+                    int requested_count = subop->req.sec_read_bmp.len / sizeof(obj_ver_id);
+                    if (subop->reply.hdr.retval == requested_count * (8 + clean_entry_bitmap_size))
+                    {
+                        void *cur_buf = subop->buf + 8;
+                        for (int j = prev; j <= i; j++)
+                        {
+                            memcpy((*bitmap_requests)[j].bmp_buf, cur_buf, clean_entry_bitmap_size);
+                            cur_buf += 8 + clean_entry_bitmap_size;
+                        }
+                    }
+                    if ((cur_op->op_data->errors + cur_op->op_data->done + 1) >= cur_op->op_data->n_subops)
+                    {
+                        delete bitmap_requests;
+                    }
+                    handle_primary_subop(subop, cur_op);
+                };
+                c_cli.outbox_push(subop);
+                subop_idx++;
+            }
+            prev = i+1;
+        }
+    }
+    if (!op_data->n_subops)
+    {
+        delete bitmap_requests;
+    }
+    return 0;
+}
+
+std::vector<osd_chain_read_t> osd_t::collect_chained_read_requests(osd_op_t *cur_op)
+{
+    osd_primary_op_data_t *op_data = cur_op->op_data;
+    std::vector<osd_chain_read_t> chain_reads;
+    int stripe_count = (op_data->scheme == POOL_SCHEME_REPLICATED ? 1 : op_data->pg_size);
+    memset(op_data->stripes[0].bmp_buf, 0, stripe_count * clean_entry_bitmap_size);
+    uint8_t *global_bitmap = (uint8_t*)op_data->stripes[0].bmp_buf;
+    // We always use at most 1 read request per layer
+    for (int chain_pos = 0; chain_pos < op_data->chain_size; chain_pos++)
+    {
+        uint8_t *part_bitmap = ((uint8_t*)op_data->snapshot_bitmaps) + chain_pos*stripe_count*clean_entry_bitmap_size;
+        int start = (cur_op->req.rw.offset - op_data->oid.stripe)/bs_bitmap_granularity;
+        int end = start + cur_op->req.rw.len/bs_bitmap_granularity;
+        // Skip unneeded part in the beginning
+        while (start < end && (
+            ((global_bitmap[start>>3] >> (start&7)) & 1) ||
+            !((part_bitmap[start>>3] >> (start&7)) & 1)))
+        {
+            start++;
+        }
+        // Skip unneeded part in the end
+        while (start < end && (
+            ((global_bitmap[(end-1)>>3] >> ((end-1)&7)) & 1) ||
+            !((part_bitmap[(end-1)>>3] >> ((end-1)&7)) & 1)))
+        {
+            end--;
+        }
+        if (start < end)
+        {
+            // Copy (OR) bits in between
+            int cur = start;
+            for (; cur < end && (cur & 0x7); cur++)
+            {
+                global_bitmap[cur>>3] = global_bitmap[cur>>3] | (part_bitmap[cur>>3] & (1 << (cur&7)));
+            }
+            for (; cur <= end-8; cur += 8)
+            {
+                global_bitmap[cur>>3] = global_bitmap[cur>>3] | part_bitmap[cur>>3];
+            }
+            for (; cur < end; cur++)
+            {
+                global_bitmap[cur>>3] = global_bitmap[cur>>3] | (part_bitmap[cur>>3] & (1 << (cur&7)));
+            }
+            // Add request
+            chain_reads.push_back((osd_chain_read_t){
+                .chain_pos = chain_pos,
+                .inode = op_data->read_chain[chain_pos],
+                .offset = start*bs_bitmap_granularity,
+                .len = (end-start)*bs_bitmap_granularity,
+            });
+        }
+    }
+    return chain_reads;
+}
+
+int osd_t::submit_chained_read_requests(pg_t & pg, osd_op_t *cur_op)
+{
+    // Decide which parts of which objects we need to read based on bitmaps
+    osd_primary_op_data_t *op_data = cur_op->op_data;
+    auto chain_reads = collect_chained_read_requests(cur_op);
+    int stripe_count = (pg.scheme == POOL_SCHEME_REPLICATED ? 1 : pg.pg_size);
+    op_data->chain_read_count = chain_reads.size();
+    op_data->chain_reads = (osd_chain_read_t*)calloc_or_die(
+        1, sizeof(osd_chain_read_t) * chain_reads.size()
+        + sizeof(osd_rmw_stripe_t) * stripe_count * op_data->chain_size
+    );
+    osd_rmw_stripe_t *chain_stripes = (osd_rmw_stripe_t*)(
+        ((void*)op_data->chain_reads) + sizeof(osd_chain_read_t) * op_data->chain_read_count
+    );
+    // Now process each subrequest as a separate read, including reconstruction if needed
+    // Prepare reads
+    int n_subops = 0;
+    uint64_t read_buffer_size = 0;
+    for (int cri = 0; cri < chain_reads.size(); cri++)
+    {
+        op_data->chain_reads[cri] = chain_reads[cri];
+        object_id cur_oid = { .inode = chain_reads[cri].inode, .stripe = op_data->oid.stripe };
+        // FIXME: maybe introduce split_read_stripes to shorten these lines and to remove read_start=req_start
+        osd_rmw_stripe_t *stripes = chain_stripes + chain_reads[cri].chain_pos*stripe_count;
+        split_stripes(pg.pg_data_size, bs_block_size, chain_reads[cri].offset, chain_reads[cri].len, stripes);
+        if (op_data->scheme == POOL_SCHEME_REPLICATED && !stripes[0].req_end)
+        {
+            continue;
+        }
+        for (int role = 0; role < op_data->pg_data_size; role++)
+        {
+            stripes[role].read_start = stripes[role].req_start;
+            stripes[role].read_end = stripes[role].req_end;
+        }
+        uint64_t *cur_set = pg.cur_set.data();
+        if (pg.state != PG_ACTIVE && op_data->scheme != POOL_SCHEME_REPLICATED)
+        {
+            pg_osd_set_state_t *object_state;
+            cur_set = get_object_osd_set(pg, cur_oid, pg.cur_set.data(), &object_state);
+            if (extend_missing_stripes(stripes, cur_set, pg.pg_data_size, pg.pg_size) < 0)
+            {
+                free(op_data->chain_reads);
+                op_data->chain_reads = NULL;
+                finish_op(cur_op, -EIO);
+                return -1;
+            }
+            op_data->degraded = 1;
+        }
+        if (op_data->scheme == POOL_SCHEME_REPLICATED)
+        {
+            n_subops++;
+            read_buffer_size += stripes[0].read_end - stripes[0].read_start;
+        }
+        else
+        {
+            for (int role = 0; role < pg.pg_size; role++)
+            {
+                if (stripes[role].read_end > 0 && cur_set[role] != 0)
+                    n_subops++;
+                if (stripes[role].read_end > 0)
+                    read_buffer_size += stripes[role].read_end - stripes[role].read_start;
+            }
+        }
+    }
+    cur_op->buf = memalign_or_die(MEM_ALIGNMENT, read_buffer_size);
+    void *cur_buf = cur_op->buf;
+    for (int cri = 0; cri < chain_reads.size(); cri++)
+    {
+        osd_rmw_stripe_t *stripes = chain_stripes + chain_reads[cri].chain_pos*stripe_count;
+        for (int role = 0; role < stripe_count; role++)
+        {
+            if (stripes[role].read_end > 0)
+            {
+                stripes[role].read_buf = cur_buf;
+                stripes[role].bmp_buf = op_data->snapshot_bitmaps + (chain_reads[cri].chain_pos*stripe_count + role)*clean_entry_bitmap_size;
+                cur_buf += stripes[role].read_end - stripes[role].read_start;
+            }
+        }
+    }
+    // Submit all reads
+    op_data->fact_ver = UINT64_MAX;
+    op_data->done = op_data->errors = 0;
+    op_data->n_subops = n_subops;
+    if (!n_subops)
+    {
+        return 0;
+    }
+    op_data->subops = new osd_op_t[n_subops];
+    int cur_subops = 0;
+    for (int cri = 0; cri < chain_reads.size(); cri++)
+    {
+        osd_rmw_stripe_t *stripes = chain_stripes + chain_reads[cri].chain_pos*stripe_count;
+        if (op_data->scheme == POOL_SCHEME_REPLICATED && !stripes[0].req_end)
+        {
+            continue;
+        }
+        object_id cur_oid = { .inode = chain_reads[cri].inode, .stripe = op_data->oid.stripe };
+        auto vo_it = pg.ver_override.find(cur_oid);
+        uint64_t target_ver = vo_it != pg.ver_override.end() ? vo_it->second : UINT64_MAX;
+        uint64_t *cur_set = pg.cur_set.data();
+        if (pg.state != PG_ACTIVE && op_data->scheme != POOL_SCHEME_REPLICATED)
+        {
+            pg_osd_set_state_t *object_state;
+            cur_set = get_object_osd_set(pg, cur_oid, pg.cur_set.data(), &object_state);
+        }
+        int zero_read = -1;
+        if (op_data->scheme == POOL_SCHEME_REPLICATED)
+        {
+            for (int role = 0; role < op_data->pg_size; role++)
+                if (cur_set[role] == this->osd_num || zero_read == -1)
+                    zero_read = role;
+        }
+        cur_subops += submit_primary_subop_batch(SUBMIT_READ, chain_reads[cri].inode, target_ver, stripes, cur_set, cur_op, cur_subops, zero_read);
+    }
+    assert(cur_subops == n_subops);
+    return 0;
+}
+
+void osd_t::send_chained_read_results(pg_t & pg, osd_op_t *cur_op)
+{
+    osd_primary_op_data_t *op_data = cur_op->op_data;
+    int stripe_count = (pg.scheme == POOL_SCHEME_REPLICATED ? 1 : pg.pg_size);
+    osd_rmw_stripe_t *chain_stripes = (osd_rmw_stripe_t*)(
+        ((void*)op_data->chain_reads) + sizeof(osd_chain_read_t) * op_data->chain_read_count
+    );
+    // Reconstruct parts if needed
+    if (op_data->degraded)
+    {
+        int stripe_count = (pg.scheme == POOL_SCHEME_REPLICATED ? 1 : pg.pg_size);
+        for (int cri = 0; cri < op_data->chain_read_count; cri++)
+        {
+            // Reconstruct missing stripes
+            osd_rmw_stripe_t *stripes = chain_stripes + op_data->chain_reads[cri].chain_pos*stripe_count;
+            if (op_data->scheme == POOL_SCHEME_XOR)
+            {
+                reconstruct_stripes_xor(stripes, pg.pg_size, clean_entry_bitmap_size);
+            }
+            else if (op_data->scheme == POOL_SCHEME_JERASURE)
+            {
+                reconstruct_stripes_jerasure(stripes, pg.pg_size, pg.pg_data_size, clean_entry_bitmap_size);
+            }
+        }
+    }
+    // Send bitmap
+    cur_op->reply.rw.bitmap_len = op_data->pg_data_size * clean_entry_bitmap_size;
+    cur_op->iov.push_back(op_data->stripes[0].bmp_buf, cur_op->reply.rw.bitmap_len);
+    // And finally compose the result
+    uint64_t sent = 0;
+    int prev_pos = 0, pos = 0;
+    bool prev_set = false;
+    int prev = (cur_op->req.rw.offset - op_data->oid.stripe) / bs_bitmap_granularity;
+    int end = prev + cur_op->req.rw.len/bs_bitmap_granularity;
+    int cur = prev;
+    while (cur <= end)
+    {
+        bool has_bit = false;
+        if (cur < end)
+        {
+            for (pos = 0; pos < op_data->chain_size; pos++)
+            {
+                has_bit = (((uint8_t*)op_data->snapshot_bitmaps)[pos*stripe_count*clean_entry_bitmap_size + cur/8] >> (cur%8)) & 1;
+                if (has_bit)
+                    break;
+            }
+        }
+        if (has_bit != prev_set || pos != prev_pos || cur == end)
+        {
+            if (cur > prev)
+            {
+                // Send buffer in parts to avoid copying
+                if (!prev_set)
+                {
+                    while ((cur-prev) > zero_buffer_size/bs_bitmap_granularity)
+                    {
+                        cur_op->iov.push_back(zero_buffer, zero_buffer_size);
+                        sent += zero_buffer_size;
+                        prev += zero_buffer_size/bs_bitmap_granularity;
+                    }
+                    cur_op->iov.push_back(zero_buffer, (cur-prev)*bs_bitmap_granularity);
+                    sent += (cur-prev)*bs_bitmap_granularity;
+                }
+                else
+                {
+                    osd_rmw_stripe_t *stripes = chain_stripes + prev_pos*stripe_count;
+                    while (cur > prev)
+                    {
+                        int role = prev*bs_bitmap_granularity/bs_block_size;
+                        int role_start = prev*bs_bitmap_granularity - role*bs_block_size;
+                        int role_end = cur*bs_bitmap_granularity - role*bs_block_size;
+                        if (role_end > bs_block_size)
+                            role_end = bs_block_size;
+                        assert(stripes[role].read_buf);
+                        cur_op->iov.push_back(
+                            stripes[role].read_buf + (role_start - stripes[role].read_start),
+                            role_end - role_start
+                        );
+                        sent += role_end - role_start;
+                        prev += (role_end - role_start)/bs_bitmap_granularity;
+                    }
+                }
+            }
+            prev = cur;
+            prev_pos = pos;
+            prev_set = has_bit;
+        }
+        cur++;
+    }
+    assert(sent == cur_op->req.rw.len);
+    free(op_data->chain_reads);
+    op_data->chain_reads = NULL;
+}
--- a/src/osd_primary_subops.cpp
+++ b/src/osd_primary_subops.cpp
@ -36,6 +36,29 @@ void osd_t::autosync()
 void osd_t::finish_op(osd_op_t *cur_op, int retval)
 {
    inflight_ops--;
+    if (cur_op->req.hdr.opcode == OSD_OP_READ ||
+        cur_op->req.hdr.opcode == OSD_OP_WRITE ||
+        cur_op->req.hdr.opcode == OSD_OP_DELETE)
+    {
+        // Track inode statistics
+        if (!cur_op->tv_end.tv_sec)
+        {
+            clock_gettime(CLOCK_REALTIME, &cur_op->tv_end);
+        }
+        uint64_t usec = (
+            (cur_op->tv_end.tv_sec - cur_op->tv_begin.tv_sec)*1000000 +
+            (cur_op->tv_end.tv_nsec - cur_op->tv_begin.tv_nsec)/1000
+        );
+        int inode_st_op = cur_op->req.hdr.opcode == OSD_OP_DELETE
+            ? INODE_STATS_DELETE
+            : (cur_op->req.hdr.opcode == OSD_OP_READ ? INODE_STATS_READ : INODE_STATS_WRITE);
+        inode_stats[cur_op->req.rw.inode].op_count[inode_st_op]++;
+        inode_stats[cur_op->req.rw.inode].op_sum[inode_st_op] += usec;
+        if (cur_op->req.hdr.opcode == OSD_OP_DELETE)
+            inode_stats[cur_op->req.rw.inode].op_bytes[inode_st_op] += cur_op->op_data->pg_data_size * bs_block_size;
+        else
+            inode_stats[cur_op->req.rw.inode].op_bytes[inode_st_op] += cur_op->req.rw.len;
+    }
    if (cur_op->op_data)
    {
        if (cur_op->op_data->pg_num > 0)
@ -43,17 +66,16 @@ void osd_t::finish_op(osd_op_t *cur_op, int retval)
            auto & pg = pgs.at({ .pool_id = INODE_POOL(cur_op->op_data->oid.inode), .pg_num = cur_op->op_data->pg_num });
            pg.inflight--;
            assert(pg.inflight >= 0);
-            if ((pg.state & PG_STOPPING) && pg.inflight == 0 && !pg.flush_batch &&
-                // We must either forget all PG's unstable writes or wait for it to become clean
-                dirty_pgs.find({ .pool_id = pg.pool_id, .pg_num = pg.pg_num }) == dirty_pgs.end())
+            if ((pg.state & PG_STOPPING) && pg.inflight == 0 && !pg.flush_batch)
            {
                finish_stop_pg(pg);
            }
+            else if ((pg.state & PG_REPEERING) && pg.inflight == 0 && !pg.flush_batch)
+            {
+                start_pg_peering(pg);
+            }
        }
        assert(!cur_op->op_data->subops);
-        assert(!cur_op->op_data->unstable_write_osds);
-        assert(!cur_op->op_data->unstable_writes);
-        assert(!cur_op->op_data->dirty_pgs);
        free(cur_op->op_data);
        cur_op->op_data = NULL;
    }
@ -64,7 +86,7 @@ void osd_t::finish_op(osd_op_t *cur_op, int retval)
    }
    else
    {
-        // FIXME add separate magic number
+        // FIXME add separate magic number for primary ops
        auto cl_it = c_cli.clients.find(cur_op->peer_fd);
        if (cl_it != c_cli.clients.end())
        {
@ -81,7 +103,7 @@ void osd_t::finish_op(osd_op_t *cur_op, int retval)
    }
 }

-void osd_t::submit_primary_subops(int submit_type, uint64_t op_version, int pg_size, const uint64_t* osd_set, osd_op_t *cur_op)
+void osd_t::submit_primary_subops(int submit_type, uint64_t op_version, const uint64_t* osd_set, osd_op_t *cur_op)
 {
    bool wr = submit_type == SUBMIT_WRITE;
    osd_primary_op_data_t *op_data = cur_op->op_data;
@ -89,32 +111,34 @@ void osd_t::submit_primary_subops(int submit_type, uint64_t op_version, int pg_s
    bool rep = op_data->scheme == POOL_SCHEME_REPLICATED;
    // Allocate subops
    int n_subops = 0, zero_read = -1;
-    for (int role = 0; role < pg_size; role++)
+    for (int role = 0; role < op_data->pg_size; role++)
    {
        if (osd_set[role] == this->osd_num || osd_set[role] != 0 && zero_read == -1)
-        {
            zero_read = role;
-        }
        if (osd_set[role] != 0 && (wr || !rep && stripes[role].read_end != 0))
-        {
            n_subops++;
    }
-    }
    if (!n_subops && (submit_type == SUBMIT_RMW_READ || rep))
-    {
        n_subops = 1;
-    }
    else
-    {
        zero_read = -1;
-    }
    osd_op_t *subops = new osd_op_t[n_subops];
    op_data->fact_ver = 0;
    op_data->done = op_data->errors = 0;
    op_data->n_subops = n_subops;
    op_data->subops = subops;
-    int i = 0;
-    for (int role = 0; role < pg_size; role++)
+    int sent = submit_primary_subop_batch(submit_type, op_data->oid.inode, op_version, op_data->stripes, osd_set, cur_op, 0, zero_read);
+    assert(sent == n_subops);
+}
+
+int osd_t::submit_primary_subop_batch(int submit_type, inode_t inode, uint64_t op_version,
+    osd_rmw_stripe_t *stripes, const uint64_t* osd_set, osd_op_t *cur_op, int subop_idx, int zero_read)
+{
+    bool wr = submit_type == SUBMIT_WRITE;
+    osd_primary_op_data_t *op_data = cur_op->op_data;
+    bool rep = op_data->scheme == POOL_SCHEME_REPLICATED;
+    int i = subop_idx;
+    for (int role = 0; role < op_data->pg_size; role++)
    {
        // We always submit zero-length writes to all replicas, even if the stripe is not modified
        if (!(wr || !rep && stripes[role].read_end != 0 || zero_read == role))
@ -125,89 +149,90 @@ void osd_t::submit_primary_subops(int submit_type, uint64_t op_version, int pg_s
        if (role_osd_num != 0)
        {
            int stripe_num = rep ? 0 : role;
+            osd_op_t *subop = op_data->subops + i;
            if (role_osd_num == this->osd_num)
            {
-                clock_gettime(CLOCK_REALTIME, &subops[i].tv_begin);
-                subops[i].op_type = (uint64_t)cur_op;
-                subops[i].bs_op = new blockstore_op_t({
+                clock_gettime(CLOCK_REALTIME, &subop->tv_begin);
+                subop->op_type = (uint64_t)cur_op;
+                subop->bitmap = stripes[stripe_num].bmp_buf;
+                subop->bitmap_len = clean_entry_bitmap_size;
+                subop->bs_op = new blockstore_op_t({
                    .opcode = (uint64_t)(wr ? (rep ? BS_OP_WRITE_STABLE : BS_OP_WRITE) : BS_OP_READ),
-                    .callback = [subop = &subops[i], this](blockstore_op_t *bs_subop)
+                    .callback = [subop, this](blockstore_op_t *bs_subop)
                    {
                        handle_primary_bs_subop(subop);
                    },
                    .oid = {
-                        .inode = op_data->oid.inode,
+                        .inode = inode,
                        .stripe = op_data->oid.stripe | stripe_num,
                    },
                    .version = op_version,
                    .offset = wr ? stripes[stripe_num].write_start : stripes[stripe_num].read_start,
                    .len = wr ? stripes[stripe_num].write_end - stripes[stripe_num].write_start : stripes[stripe_num].read_end - stripes[stripe_num].read_start,
                    .buf = wr ? stripes[stripe_num].write_buf : stripes[stripe_num].read_buf,
+                    .bitmap = stripes[stripe_num].bmp_buf,
                });
 #ifdef OSD_DEBUG
                printf(
                    "Submit %s to local: %lx:%lx v%lu %u-%u\n", wr ? "write" : "read",
-                    op_data->oid.inode, op_data->oid.stripe | stripe_num, op_version,
-                    subops[i].bs_op->offset, subops[i].bs_op->len
+                    inode, op_data->oid.stripe | stripe_num, op_version,
+                    subop->bs_op->offset, subop->bs_op->len
                );
 #endif
-                bs->enqueue_op(subops[i].bs_op);
+                bs->enqueue_op(subop->bs_op);
            }
            else
            {
-                subops[i].op_type = OSD_OP_OUT;
-                subops[i].peer_fd = c_cli.osd_peer_fds.at(role_osd_num);
-                subops[i].req.sec_rw = {
+                subop->op_type = OSD_OP_OUT;
+                subop->peer_fd = c_cli.osd_peer_fds.at(role_osd_num);
+                subop->bitmap = stripes[stripe_num].bmp_buf;
+                subop->bitmap_len = clean_entry_bitmap_size;
+                subop->req.sec_rw = {
                    .header = {
                        .magic = SECONDARY_OSD_OP_MAGIC,
                        .id = c_cli.next_subop_id++,
                        .opcode = (uint64_t)(wr ? (rep ? OSD_OP_SEC_WRITE_STABLE : OSD_OP_SEC_WRITE) : OSD_OP_SEC_READ),
                    },
                    .oid = {
-                        .inode = op_data->oid.inode,
+                        .inode = inode,
                        .stripe = op_data->oid.stripe | stripe_num,
                    },
                    .version = op_version,
                    .offset = wr ? stripes[stripe_num].write_start : stripes[stripe_num].read_start,
                    .len = wr ? stripes[stripe_num].write_end - stripes[stripe_num].write_start : stripes[stripe_num].read_end - stripes[stripe_num].read_start,
+                    .attr_len = wr ? clean_entry_bitmap_size : 0,
                };
 #ifdef OSD_DEBUG
                printf(
                    "Submit %s to osd %lu: %lx:%lx v%lu %u-%u\n", wr ? "write" : "read", role_osd_num,
-                    op_data->oid.inode, op_data->oid.stripe | stripe_num, op_version,
-                    subops[i].req.sec_rw.offset, subops[i].req.sec_rw.len
+                    inode, op_data->oid.stripe | stripe_num, op_version,
+                    subop->req.sec_rw.offset, subop->req.sec_rw.len
                );
 #endif
                if (wr)
                {
                    if (stripes[stripe_num].write_end > stripes[stripe_num].write_start)
                    {
-                        subops[i].iov.push_back(stripes[stripe_num].write_buf, stripes[stripe_num].write_end - stripes[stripe_num].write_start);
+                        subop->iov.push_back(stripes[stripe_num].write_buf, stripes[stripe_num].write_end - stripes[stripe_num].write_start);
                    }
                }
                else
                {
                    if (stripes[stripe_num].read_end > stripes[stripe_num].read_start)
                    {
-                        subops[i].iov.push_back(stripes[stripe_num].read_buf, stripes[stripe_num].read_end - stripes[stripe_num].read_start);
+                        subop->iov.push_back(stripes[stripe_num].read_buf, stripes[stripe_num].read_end - stripes[stripe_num].read_start);
                    }
                }
-                subops[i].callback = [cur_op, this](osd_op_t *subop)
+                subop->callback = [cur_op, this](osd_op_t *subop)
                {
-                    int fail_fd = subop->req.hdr.opcode == OSD_OP_SEC_WRITE &&
-                        subop->reply.hdr.retval != subop->req.sec_rw.len ? subop->peer_fd : -1;
                    handle_primary_subop(subop, cur_op);
-                    if (fail_fd >= 0)
-                    {
-                        // write operation failed, drop the connection
-                        c_cli.stop_client(fail_fd);
-                    }
                };
-                c_cli.outbox_push(&subops[i]);
+                c_cli.outbox_push(subop);
            }
            i++;
        }
    }
+    return i-subop_idx;
 }

 static uint64_t bs_op_to_osd_op[] = {
@ -247,6 +272,7 @@ void osd_t::handle_primary_bs_subop(osd_op_t *subop)
    }
    delete bs_op;
    subop->bs_op = NULL;
+    subop->peer_fd = -1;
    handle_primary_subop(subop, cur_op);
 }

@ -277,8 +303,13 @@ void osd_t::handle_primary_subop(osd_op_t *subop, osd_op_t *cur_op)
 {
    uint64_t opcode = subop->req.hdr.opcode;
    int retval = subop->reply.hdr.retval;
-    int expected = opcode == OSD_OP_SEC_READ || opcode == OSD_OP_SEC_WRITE
-        || opcode == OSD_OP_SEC_WRITE_STABLE ? subop->req.sec_rw.len : 0;
+    int expected;
+    if (opcode == OSD_OP_SEC_READ || opcode == OSD_OP_SEC_WRITE || opcode == OSD_OP_SEC_WRITE_STABLE)
+        expected = subop->req.sec_rw.len;
+    else if (opcode == OSD_OP_SEC_READ_BMP)
+        expected = subop->req.sec_read_bmp.len / sizeof(obj_ver_id) * (8 + clean_entry_bitmap_size);
+    else
+        expected = 0;
    osd_primary_op_data_t *op_data = cur_op->op_data;
    if (retval != expected)
    {
@ -288,6 +319,11 @@ void osd_t::handle_primary_subop(osd_op_t *subop, osd_op_t *cur_op)
            op_data->epipe++;
        }
        op_data->errors++;
+        if (subop->peer_fd >= 0)
+        {
+            // Drop connection on any error
+            c_cli.stop_client(subop->peer_fd);
+        }
    }
    else
    {
@ -300,6 +336,8 @@ void osd_t::handle_primary_subop(osd_op_t *subop, osd_op_t *cur_op)
                ? c_cli.clients[subop->peer_fd]->osd_num : osd_num;
            printf("subop %lu from osd %lu: version = %lu\n", opcode, peer_osd, version);
 #endif
+            if (op_data->fact_ver != UINT64_MAX)
+            {
                if (op_data->fact_ver != 0 && op_data->fact_ver != version)
                {
                    throw std::runtime_error(
@ -310,6 +348,7 @@ void osd_t::handle_primary_subop(osd_op_t *subop, osd_op_t *cur_op)
                op_data->fact_ver = version;
            }
        }
+    }
    if ((op_data->errors + op_data->done) >= op_data->n_subops)
    {
        delete[] op_data->subops;
@ -427,7 +466,7 @@ void osd_t::submit_primary_del_batch(osd_op_t *cur_op, obj_ver_osd_t *chunks_to_
        {
            subops[i].op_type = OSD_OP_OUT;
            subops[i].peer_fd = c_cli.osd_peer_fds.at(chunk.osd_num);
-            subops[i].req.sec_del = {
+            subops[i].req = (osd_any_op_t){ .sec_del = {
                .header = {
                    .magic = SECONDARY_OSD_OP_MAGIC,
                    .id = c_cli.next_subop_id++,
@ -435,23 +474,17 @@ void osd_t::submit_primary_del_batch(osd_op_t *cur_op, obj_ver_osd_t *chunks_to_
                },
                .oid = chunk.oid,
                .version = chunk.version,
-            };
+            } };
            subops[i].callback = [cur_op, this](osd_op_t *subop)
            {
-                int fail_fd = subop->reply.hdr.retval != 0 ? subop->peer_fd : -1;
                handle_primary_subop(subop, cur_op);
-                if (fail_fd >= 0)
-                {
-                    // delete operation failed, drop the connection
-                    c_cli.stop_client(fail_fd);
-                }
            };
            c_cli.outbox_push(&subops[i]);
        }
    }
 }

-void osd_t::submit_primary_sync_subops(osd_op_t *cur_op)
+int osd_t::submit_primary_sync_subops(osd_op_t *cur_op)
 {
    osd_primary_op_data_t *op_data = cur_op->op_data;
    int n_osds = op_data->dirty_osd_count;
@ -459,6 +492,7 @@ void osd_t::submit_primary_sync_subops(osd_op_t *cur_op)
    op_data->done = op_data->errors = 0;
    op_data->n_subops = n_osds;
    op_data->subops = subops;
+    std::map<uint64_t, int>::iterator peer_it;
    for (int i = 0; i < n_osds; i++)
    {
        osd_num_t sync_osd = op_data->dirty_osds[i];
@ -475,30 +509,35 @@ void osd_t::submit_primary_sync_subops(osd_op_t *cur_op)
            });
            bs->enqueue_op(subops[i].bs_op);
        }
-        else
+        else if ((peer_it = c_cli.osd_peer_fds.find(sync_osd)) != c_cli.osd_peer_fds.end())
        {
            subops[i].op_type = OSD_OP_OUT;
-            subops[i].peer_fd = c_cli.osd_peer_fds.at(sync_osd);
-            subops[i].req.sec_sync = {
+            subops[i].peer_fd = peer_it->second;
+            subops[i].req = (osd_any_op_t){ .sec_sync = {
                .header = {
                    .magic = SECONDARY_OSD_OP_MAGIC,
                    .id = c_cli.next_subop_id++,
                    .opcode = OSD_OP_SEC_SYNC,
                },
-            };
+            } };
            subops[i].callback = [cur_op, this](osd_op_t *subop)
            {
-                int fail_fd = subop->reply.hdr.retval != 0 ? subop->peer_fd : -1;
                handle_primary_subop(subop, cur_op);
-                if (fail_fd >= 0)
-                {
-                    // sync operation failed, drop the connection
-                    c_cli.stop_client(fail_fd);
-                }
            };
            c_cli.outbox_push(&subops[i]);
        }
+        else
+        {
+            op_data->done++;
        }
+    }
+    if (op_data->done >= op_data->n_subops)
+    {
+        delete[] op_data->subops;
+        op_data->subops = NULL;
+        return 0;
+    }
+    return 1;
 }

 void osd_t::submit_primary_stab_subops(osd_op_t *cur_op)
@ -531,24 +570,18 @@ void osd_t::submit_primary_stab_subops(osd_op_t *cur_op)
        {
            subops[i].op_type = OSD_OP_OUT;
            subops[i].peer_fd = c_cli.osd_peer_fds.at(stab_osd.osd_num);
-            subops[i].req.sec_stab = {
+            subops[i].req = (osd_any_op_t){ .sec_stab = {
                .header = {
                    .magic = SECONDARY_OSD_OP_MAGIC,
                    .id = c_cli.next_subop_id++,
                    .opcode = OSD_OP_SEC_STABILIZE,
                },
                .len = (uint64_t)(stab_osd.len * sizeof(obj_ver_id)),
-            };
+            } };
            subops[i].iov.push_back(op_data->unstable_writes + stab_osd.start, stab_osd.len * sizeof(obj_ver_id));
            subops[i].callback = [cur_op, this](osd_op_t *subop)
            {
-                int fail_fd = subop->reply.hdr.retval != 0 ? subop->peer_fd : -1;
                handle_primary_subop(subop, cur_op);
-                if (fail_fd >= 0)
-                {
-                    // sync operation failed, drop the connection
-                    c_cli.stop_client(fail_fd);
-                }
            };
            c_cli.outbox_push(&subops[i]);
        }
@ -566,7 +599,7 @@ void osd_t::pg_cancel_write_queue(pg_t & pg, osd_op_t *first_op, object_id oid,
        return;
    }
    std::vector<osd_op_t*> cancel_ops;
-    while (it != pg.write_queue.end())
+    while (it != pg.write_queue.end() && it->first == oid)
    {
        cancel_ops.push_back(it->second);
        it++;
--- a/src/osd_primary_sync.cpp
+++ b/src/osd_primary_sync.cpp
@ -0,0 +1,265 @@
+// Copyright (c) Vitaliy Filippov, 2019+
+// License: VNPL-1.1 (see README.md for details)
+
+#include "osd_primary.h"
+
+// Save and clear unstable_writes -> SYNC all -> STABLE all
+void osd_t::continue_primary_sync(osd_op_t *cur_op)
+{
+    if (!cur_op->op_data)
+    {
+        cur_op->op_data = (osd_primary_op_data_t*)calloc_or_die(1, sizeof(osd_primary_op_data_t));
+    }
+    osd_primary_op_data_t *op_data = cur_op->op_data;
+    if (op_data->st == 1)      goto resume_1;
+    else if (op_data->st == 2) goto resume_2;
+    else if (op_data->st == 3) goto resume_3;
+    else if (op_data->st == 4) goto resume_4;
+    else if (op_data->st == 5) goto resume_5;
+    else if (op_data->st == 6) goto resume_6;
+    else if (op_data->st == 7) goto resume_7;
+    else if (op_data->st == 8) goto resume_8;
+    assert(op_data->st == 0);
+    if (syncs_in_progress.size() > 0)
+    {
+        // Wait for previous syncs, if any
+        // FIXME: We may try to execute the current one in parallel, like in Blockstore, but I'm not sure if it matters at all
+        syncs_in_progress.push_back(cur_op);
+        op_data->st = 1;
+resume_1:
+        return;
+    }
+    else
+    {
+        syncs_in_progress.push_back(cur_op);
+    }
+resume_2:
+    if (dirty_osds.size() == 0)
+    {
+        // Nothing to sync
+        goto finish;
+    }
+    // Save and clear unstable_writes
+    // In theory it is possible to do in on a per-client basis, but this seems to be an unnecessary complication
+    // It would be cool not to copy these here at all, but someone has to deduplicate them by object IDs anyway
+    if (unstable_writes.size() > 0)
+    {
+        op_data->unstable_write_osds = new std::vector<unstable_osd_num_t>();
+        op_data->unstable_writes = new obj_ver_id[this->unstable_writes.size()];
+        osd_num_t last_osd = 0;
+        int last_start = 0, last_end = 0;
+        for (auto it = this->unstable_writes.begin(); it != this->unstable_writes.end(); it++)
+        {
+            if (last_osd != it->first.osd_num)
+            {
+                if (last_osd != 0)
+                {
+                    op_data->unstable_write_osds->push_back((unstable_osd_num_t){
+                        .osd_num = last_osd,
+                        .start = last_start,
+                        .len = last_end - last_start,
+                    });
+                }
+                last_osd = it->first.osd_num;
+                last_start = last_end;
+            }
+            op_data->unstable_writes[last_end] = (obj_ver_id){
+                .oid = it->first.oid,
+                .version = it->second,
+            };
+            last_end++;
+        }
+        if (last_osd != 0)
+        {
+            op_data->unstable_write_osds->push_back((unstable_osd_num_t){
+                .osd_num = last_osd,
+                .start = last_start,
+                .len = last_end - last_start,
+            });
+        }
+        this->unstable_writes.clear();
+    }
+    {
+        void *dirty_buf = malloc_or_die(
+            sizeof(pool_pg_num_t)*dirty_pgs.size() +
+            sizeof(osd_num_t)*dirty_osds.size() +
+            sizeof(obj_ver_osd_t)*this->copies_to_delete_after_sync_count
+        );
+        op_data->dirty_pgs = (pool_pg_num_t*)dirty_buf;
+        op_data->dirty_osds = (osd_num_t*)(dirty_buf + sizeof(pool_pg_num_t)*dirty_pgs.size());
+        op_data->dirty_pg_count = dirty_pgs.size();
+        op_data->dirty_osd_count = dirty_osds.size();
+        if (this->copies_to_delete_after_sync_count)
+        {
+            op_data->copies_to_delete_count = 0;
+            op_data->copies_to_delete = (obj_ver_osd_t*)(op_data->dirty_osds + op_data->dirty_osd_count);
+            for (auto dirty_pg_num: dirty_pgs)
+            {
+                auto & pg = pgs.at(dirty_pg_num);
+                assert(pg.copies_to_delete_after_sync.size() <= this->copies_to_delete_after_sync_count);
+                memcpy(
+                    op_data->copies_to_delete + op_data->copies_to_delete_count,
+                    pg.copies_to_delete_after_sync.data(),
+                    sizeof(obj_ver_osd_t)*pg.copies_to_delete_after_sync.size()
+                );
+                op_data->copies_to_delete_count += pg.copies_to_delete_after_sync.size();
+                this->copies_to_delete_after_sync_count -= pg.copies_to_delete_after_sync.size();
+                pg.copies_to_delete_after_sync.clear();
+            }
+            assert(this->copies_to_delete_after_sync_count == 0);
+        }
+        int dpg = 0;
+        for (auto dirty_pg_num: dirty_pgs)
+        {
+            pgs.at(dirty_pg_num).inflight++;
+            op_data->dirty_pgs[dpg++] = dirty_pg_num;
+        }
+        dirty_pgs.clear();
+        dpg = 0;
+        for (auto osd_num: dirty_osds)
+        {
+            op_data->dirty_osds[dpg++] = osd_num;
+        }
+        dirty_osds.clear();
+    }
+    if (immediate_commit != IMMEDIATE_ALL)
+    {
+        // SYNC
+        if (!submit_primary_sync_subops(cur_op))
+        {
+            goto resume_4;
+        }
+resume_3:
+        op_data->st = 3;
+        return;
+resume_4:
+        if (op_data->errors > 0)
+        {
+            goto resume_6;
+        }
+    }
+    if (op_data->unstable_writes)
+    {
+        // Stabilize version sets, if any
+        submit_primary_stab_subops(cur_op);
+resume_5:
+        op_data->st = 5;
+        return;
+    }
+resume_6:
+    if (op_data->errors > 0)
+    {
+        // Return PGs and OSDs back into their dirty sets
+        for (int i = 0; i < op_data->dirty_pg_count; i++)
+        {
+            dirty_pgs.insert(op_data->dirty_pgs[i]);
+        }
+        for (int i = 0; i < op_data->dirty_osd_count; i++)
+        {
+            dirty_osds.insert(op_data->dirty_osds[i]);
+        }
+        if (op_data->unstable_writes)
+        {
+            // Return objects back into the unstable write set
+            for (auto unstable_osd: *(op_data->unstable_write_osds))
+            {
+                for (int i = 0; i < unstable_osd.len; i++)
+                {
+                    // Except those from peered PGs
+                    auto & w = op_data->unstable_writes[i];
+                    pool_pg_num_t wpg = {
+                        .pool_id = INODE_POOL(w.oid.inode),
+                        .pg_num = map_to_pg(w.oid, st_cli.pool_config.at(INODE_POOL(w.oid.inode)).pg_stripe_size),
+                    };
+                    if (pgs.at(wpg).state & PG_ACTIVE)
+                    {
+                        uint64_t & dest = this->unstable_writes[(osd_object_id_t){
+                            .osd_num = unstable_osd.osd_num,
+                            .oid = w.oid,
+                        }];
+                        dest = dest < w.version ? w.version : dest;
+                        dirty_pgs.insert(wpg);
+                    }
+                }
+            }
+        }
+        if (op_data->copies_to_delete)
+        {
+            // Return 'copies to delete' back into respective PGs
+            for (int i = 0; i < op_data->copies_to_delete_count; i++)
+            {
+                auto & w = op_data->copies_to_delete[i];
+                auto & pg = pgs.at((pool_pg_num_t){
+                    .pool_id = INODE_POOL(w.oid.inode),
+                    .pg_num = map_to_pg(w.oid, st_cli.pool_config.at(INODE_POOL(w.oid.inode)).pg_stripe_size),
+                });
+                if (pg.state & PG_ACTIVE)
+                {
+                    pg.copies_to_delete_after_sync.push_back(w);
+                    copies_to_delete_after_sync_count++;
+                }
+            }
+        }
+    }
+    else if (op_data->copies_to_delete)
+    {
+        // Actually delete copies which we wanted to delete
+        submit_primary_del_batch(cur_op, op_data->copies_to_delete, op_data->copies_to_delete_count);
+resume_7:
+        op_data->st = 7;
+        return;
+resume_8:
+        if (op_data->errors > 0)
+        {
+            goto resume_6;
+        }
+    }
+    for (int i = 0; i < op_data->dirty_pg_count; i++)
+    {
+        auto & pg = pgs.at(op_data->dirty_pgs[i]);
+        pg.inflight--;
+        if ((pg.state & PG_STOPPING) && pg.inflight == 0 && !pg.flush_batch)
+        {
+            finish_stop_pg(pg);
+        }
+        else if ((pg.state & PG_REPEERING) && pg.inflight == 0 && !pg.flush_batch)
+        {
+            start_pg_peering(pg);
+        }
+    }
+    // FIXME: Free those in the destructor?
+    free(op_data->dirty_pgs);
+    op_data->dirty_pgs = NULL;
+    op_data->dirty_osds = NULL;
+    if (op_data->unstable_writes)
+    {
+        delete op_data->unstable_write_osds;
+        delete[] op_data->unstable_writes;
+        op_data->unstable_writes = NULL;
+        op_data->unstable_write_osds = NULL;
+    }
+    if (op_data->errors > 0)
+    {
+        finish_op(cur_op, op_data->epipe > 0 ? -EPIPE : -EIO);
+    }
+    else
+    {
+finish:
+        if (cur_op->peer_fd)
+        {
+            auto it = c_cli.clients.find(cur_op->peer_fd);
+            if (it != c_cli.clients.end())
+                it->second->dirty_pgs.clear();
+        }
+        finish_op(cur_op, 0);
+    }
+    assert(syncs_in_progress.front() == cur_op);
+    syncs_in_progress.pop_front();
+    if (syncs_in_progress.size() > 0)
+    {
+        cur_op = syncs_in_progress.front();
+        op_data = cur_op->op_data;
+        op_data->st++;
+        goto resume_2;
+    }
+}
--- a/src/osd_primary_write.cpp
+++ b/src/osd_primary_write.cpp
@ -0,0 +1,381 @@
+// Copyright (c) Vitaliy Filippov, 2019+
+// License: VNPL-1.1 (see README.md for details)
+
+#include "osd_primary.h"
+#include "allocator.h"
+
+bool osd_t::check_write_queue(osd_op_t *cur_op, pg_t & pg)
+{
+    osd_primary_op_data_t *op_data = cur_op->op_data;
+    // Check if actions are pending for this object
+    auto act_it = pg.flush_actions.lower_bound((obj_piece_id_t){
+        .oid = op_data->oid,
+        .osd_num = 0,
+    });
+    if (act_it != pg.flush_actions.end() &&
+        act_it->first.oid.inode == op_data->oid.inode &&
+        (act_it->first.oid.stripe & ~STRIPE_MASK) == op_data->oid.stripe)
+    {
+        pg.write_queue.emplace(op_data->oid, cur_op);
+        return false;
+    }
+    // Check if there are other write requests to the same object
+    auto vo_it = pg.write_queue.find(op_data->oid);
+    if (vo_it != pg.write_queue.end())
+    {
+        op_data->st = 1;
+        pg.write_queue.emplace(op_data->oid, cur_op);
+        return false;
+    }
+    pg.write_queue.emplace(op_data->oid, cur_op);
+    return true;
+}
+
+void osd_t::continue_primary_write(osd_op_t *cur_op)
+{
+    if (!cur_op->op_data && !prepare_primary_rw(cur_op))
+    {
+        return;
+    }
+    osd_primary_op_data_t *op_data = cur_op->op_data;
+    auto & pg = pgs.at({ .pool_id = INODE_POOL(op_data->oid.inode), .pg_num = op_data->pg_num });
+    if (op_data->st == 1)      goto resume_1;
+    else if (op_data->st == 2) goto resume_2;
+    else if (op_data->st == 3) goto resume_3;
+    else if (op_data->st == 4) goto resume_4;
+    else if (op_data->st == 5) goto resume_5;
+    else if (op_data->st == 6) goto resume_6;
+    else if (op_data->st == 7) goto resume_7;
+    else if (op_data->st == 8) goto resume_8;
+    else if (op_data->st == 9) goto resume_9;
+    else if (op_data->st == 10) goto resume_10;
+    assert(op_data->st == 0);
+    if (!check_write_queue(cur_op, pg))
+    {
+        return;
+    }
+resume_1:
+    // Determine blocks to read and write
+    // Missing chunks are allowed to be overwritten even in incomplete objects
+    // FIXME: Allow to do small writes to the old (degraded/misplaced) OSD set for lower performance impact
+    op_data->prev_set = get_object_osd_set(pg, op_data->oid, pg.cur_set.data(), &op_data->object_state);
+    if (op_data->scheme == POOL_SCHEME_REPLICATED)
+    {
+        // Simplified algorithm
+        op_data->stripes[0].write_start = op_data->stripes[0].req_start;
+        op_data->stripes[0].write_end = op_data->stripes[0].req_end;
+        op_data->stripes[0].write_buf = cur_op->buf;
+        if (pg.cur_set.data() != op_data->prev_set && (op_data->stripes[0].write_start != 0 ||
+            op_data->stripes[0].write_end != bs_block_size))
+        {
+            // Object is degraded/misplaced and will be moved to <write_osd_set>
+            op_data->stripes[0].read_start = 0;
+            op_data->stripes[0].read_end = bs_block_size;
+            cur_op->rmw_buf = op_data->stripes[0].read_buf = memalign_or_die(MEM_ALIGNMENT, bs_block_size);
+        }
+    }
+    else
+    {
+        cur_op->rmw_buf = calc_rmw(cur_op->buf, op_data->stripes, op_data->prev_set,
+            pg.pg_size, op_data->pg_data_size, pg.pg_cursize, pg.cur_set.data(), bs_block_size, clean_entry_bitmap_size);
+        if (!cur_op->rmw_buf)
+        {
+            // Refuse partial overwrite of an incomplete object
+            cur_op->reply.hdr.retval = -EINVAL;
+            goto continue_others;
+        }
+    }
+    // Read required blocks
+    submit_primary_subops(SUBMIT_RMW_READ, UINT64_MAX, op_data->prev_set, cur_op);
+resume_2:
+    op_data->st = 2;
+    return;
+resume_3:
+    if (op_data->errors > 0)
+    {
+        pg_cancel_write_queue(pg, cur_op, op_data->oid, op_data->epipe > 0 ? -EPIPE : -EIO);
+        return;
+    }
+    if (op_data->scheme == POOL_SCHEME_REPLICATED)
+    {
+        // Set bitmap bits
+        bitmap_set(op_data->stripes[0].bmp_buf, op_data->stripes[0].write_start,
+            op_data->stripes[0].write_end-op_data->stripes[0].write_start, bs_bitmap_granularity);
+        // Possibly copy new data from the request into the recovery buffer
+        if (pg.cur_set.data() != op_data->prev_set && (op_data->stripes[0].write_start != 0 ||
+            op_data->stripes[0].write_end != bs_block_size))
+        {
+            memcpy(
+                op_data->stripes[0].read_buf + op_data->stripes[0].req_start,
+                op_data->stripes[0].write_buf,
+                op_data->stripes[0].req_end - op_data->stripes[0].req_start
+            );
+            op_data->stripes[0].write_buf = op_data->stripes[0].read_buf;
+            op_data->stripes[0].write_start = 0;
+            op_data->stripes[0].write_end = bs_block_size;
+        }
+    }
+    else
+    {
+        // For EC/XOR pools, save version override to make it impossible
+        // for parallel reads to read different versions of data and parity
+        pg.ver_override[op_data->oid] = op_data->fact_ver;
+        // Recover missing stripes, calculate parity
+        if (pg.scheme == POOL_SCHEME_XOR)
+        {
+            calc_rmw_parity_xor(op_data->stripes, pg.pg_size, op_data->prev_set, pg.cur_set.data(), bs_block_size, clean_entry_bitmap_size);
+        }
+        else if (pg.scheme == POOL_SCHEME_JERASURE)
+        {
+            calc_rmw_parity_jerasure(op_data->stripes, pg.pg_size, op_data->pg_data_size, op_data->prev_set, pg.cur_set.data(), bs_block_size, clean_entry_bitmap_size);
+        }
+    }
+    // Send writes
+    if ((op_data->fact_ver >> (64-PG_EPOCH_BITS)) < pg.epoch)
+    {
+        op_data->target_ver = ((uint64_t)pg.epoch << (64-PG_EPOCH_BITS)) | 1;
+    }
+    else
+    {
+        if ((op_data->fact_ver & (1ul<<(64-PG_EPOCH_BITS) - 1)) == (1ul<<(64-PG_EPOCH_BITS) - 1))
+        {
+            assert(pg.epoch != ((1ul << PG_EPOCH_BITS)-1));
+            pg.epoch++;
+        }
+        op_data->target_ver = op_data->fact_ver + 1;
+    }
+    if (pg.epoch > pg.reported_epoch)
+    {
+        // Report newer epoch before writing
+        // FIXME: We may report only one PG state here...
+        this->pg_state_dirty.insert({ .pool_id = pg.pool_id, .pg_num = pg.pg_num });
+        pg.history_changed = true;
+        report_pg_states();
+resume_10:
+        if (pg.epoch > pg.reported_epoch)
+        {
+            op_data->st = 10;
+            return;
+        }
+    }
+    submit_primary_subops(SUBMIT_WRITE, op_data->target_ver, pg.cur_set.data(), cur_op);
+resume_4:
+    op_data->st = 4;
+    return;
+resume_5:
+    if (op_data->scheme != POOL_SCHEME_REPLICATED)
+    {
+        // Remove version override just after the write, but before stabilizing
+        pg.ver_override.erase(op_data->oid);
+    }
+    if (op_data->errors > 0)
+    {
+        pg_cancel_write_queue(pg, cur_op, op_data->oid, op_data->epipe > 0 ? -EPIPE : -EIO);
+        return;
+    }
+    if (op_data->object_state)
+    {
+        // We must forget the unclean state of the object before deleting it
+        // so the next reads don't accidentally read a deleted version
+        // And it should be done at the same time as the removal of the version override
+        remove_object_from_state(op_data->oid, op_data->object_state, pg);
+        pg.clean_count++;
+    }
+resume_6:
+resume_7:
+    if (!remember_unstable_write(cur_op, pg, pg.cur_loc_set, 6))
+    {
+        return;
+    }
+    if (op_data->fact_ver == 1)
+    {
+        // Object is created
+        pg.clean_count++;
+        pg.total_count++;
+    }
+    if (op_data->object_state)
+    {
+        {
+            int recovery_type = op_data->object_state->state & (OBJ_DEGRADED|OBJ_INCOMPLETE) ? 0 : 1;
+            recovery_stat_count[0][recovery_type]++;
+            if (!recovery_stat_count[0][recovery_type])
+            {
+                recovery_stat_count[0][recovery_type]++;
+                recovery_stat_bytes[0][recovery_type] = 0;
+            }
+            for (int role = 0; role < (op_data->scheme == POOL_SCHEME_REPLICATED ? 1 : pg.pg_size); role++)
+            {
+                recovery_stat_bytes[0][recovery_type] += op_data->stripes[role].write_end - op_data->stripes[role].write_start;
+            }
+        }
+        // Any kind of a non-clean object can have extra chunks, because we don't record objects
+        // as degraded & misplaced or incomplete & misplaced at the same time. So try to remove extra chunks
+        if (immediate_commit != IMMEDIATE_ALL)
+        {
+            // We can't remove extra chunks yet if fsyncs are explicit, because
+            // new copies may not be committed to stable storage yet
+            // We can only remove extra chunks after a successful SYNC for this PG
+            for (auto & chunk: op_data->object_state->osd_set)
+            {
+                // Check is the same as in submit_primary_del_subops()
+                if (op_data->scheme == POOL_SCHEME_REPLICATED
+                    ? !contains_osd(pg.cur_set.data(), pg.pg_size, chunk.osd_num)
+                    : (chunk.osd_num != pg.cur_set[chunk.role]))
+                {
+                    pg.copies_to_delete_after_sync.push_back((obj_ver_osd_t){
+                        .osd_num = chunk.osd_num,
+                        .oid = {
+                            .inode = op_data->oid.inode,
+                            .stripe = op_data->oid.stripe | (op_data->scheme == POOL_SCHEME_REPLICATED ? 0 : chunk.role),
+                        },
+                        .version = op_data->fact_ver,
+                    });
+                    copies_to_delete_after_sync_count++;
+                }
+            }
+            free_object_state(pg, &op_data->object_state);
+        }
+        else
+        {
+            submit_primary_del_subops(cur_op, pg.cur_set.data(), pg.pg_size, op_data->object_state->osd_set);
+            free_object_state(pg, &op_data->object_state);
+            if (op_data->n_subops > 0)
+            {
+resume_8:
+                op_data->st = 8;
+                return;
+resume_9:
+                if (op_data->errors > 0)
+                {
+                    pg_cancel_write_queue(pg, cur_op, op_data->oid, op_data->epipe > 0 ? -EPIPE : -EIO);
+                    return;
+                }
+            }
+        }
+    }
+    cur_op->reply.hdr.retval = cur_op->req.rw.len;
+continue_others:
+    osd_op_t *next_op = NULL;
+    auto next_it = pg.write_queue.find(op_data->oid);
+    // Remove the operation from queue before calling finish_op so it doesn't see the completed operation in queue
+    if (next_it != pg.write_queue.end() && next_it->second == cur_op)
+    {
+        pg.write_queue.erase(next_it++);
+        if (next_it != pg.write_queue.end() && next_it->first == op_data->oid)
+            next_op = next_it->second;
+    }
+    // finish_op would invalidate next_it if it cleared pg.write_queue, but it doesn't do that :)
+    finish_op(cur_op, cur_op->req.rw.len);
+    if (next_op)
+    {
+        // Continue next write to the same object
+        continue_primary_write(next_op);
+    }
+}
+
+bool osd_t::remember_unstable_write(osd_op_t *cur_op, pg_t & pg, pg_osd_set_t & loc_set, int base_state)
+{
+    osd_primary_op_data_t *op_data = cur_op->op_data;
+    if (op_data->st == base_state)
+    {
+        goto resume_6;
+    }
+    else if (op_data->st == base_state+1)
+    {
+        goto resume_7;
+    }
+    if (immediate_commit == IMMEDIATE_ALL)
+    {
+immediate:
+        if (op_data->scheme != POOL_SCHEME_REPLICATED)
+        {
+            // Send STABILIZE ops immediately
+            op_data->unstable_write_osds = new std::vector<unstable_osd_num_t>();
+            op_data->unstable_writes = new obj_ver_id[loc_set.size()];
+            {
+                int last_start = 0;
+                for (auto & chunk: loc_set)
+                {
+                    op_data->unstable_writes[last_start] = (obj_ver_id){
+                        .oid = {
+                            .inode = op_data->oid.inode,
+                            .stripe = op_data->oid.stripe | chunk.role,
+                        },
+                        .version = op_data->fact_ver,
+                    };
+                    op_data->unstable_write_osds->push_back((unstable_osd_num_t){
+                        .osd_num = chunk.osd_num,
+                        .start = last_start,
+                        .len = 1,
+                    });
+                    last_start++;
+                }
+            }
+            submit_primary_stab_subops(cur_op);
+resume_6:
+            op_data->st = 6;
+            return false;
+resume_7:
+            // FIXME: Free those in the destructor?
+            delete op_data->unstable_write_osds;
+            delete[] op_data->unstable_writes;
+            op_data->unstable_writes = NULL;
+            op_data->unstable_write_osds = NULL;
+            if (op_data->errors > 0)
+            {
+                pg_cancel_write_queue(pg, cur_op, op_data->oid, op_data->epipe > 0 ? -EPIPE : -EIO);
+                return false;
+            }
+        }
+    }
+    else if (immediate_commit == IMMEDIATE_SMALL)
+    {
+        int stripe_count = (op_data->scheme == POOL_SCHEME_REPLICATED ? 1 : op_data->pg_size);
+        for (int role = 0; role < stripe_count; role++)
+        {
+            if (op_data->stripes[role].write_start == 0 &&
+                op_data->stripes[role].write_end == bs_block_size)
+            {
+                // Big write. Treat write as unsynced
+                goto lazy;
+            }
+        }
+        goto immediate;
+    }
+    else
+    {
+lazy:
+        if (op_data->scheme != POOL_SCHEME_REPLICATED)
+        {
+            // Remember version as unstable for EC/XOR
+            for (auto & chunk: loc_set)
+            {
+                this->dirty_osds.insert(chunk.osd_num);
+                this->unstable_writes[(osd_object_id_t){
+                    .osd_num = chunk.osd_num,
+                    .oid = {
+                        .inode = op_data->oid.inode,
+                        .stripe = op_data->oid.stripe | chunk.role,
+                    },
+                }] = op_data->fact_ver;
+            }
+        }
+        else
+        {
+            // Only remember to sync OSDs for replicated pools
+            for (auto & chunk: loc_set)
+            {
+                this->dirty_osds.insert(chunk.osd_num);
+            }
+        }
+        // Remember PG as dirty to drop the connection when PG goes offline
+        // (this is required because of the "lazy sync")
+        auto cl_it = c_cli.clients.find(cur_op->peer_fd);
+        if (cl_it != c_cli.clients.end())
+        {
+            cl_it->second->dirty_pgs.insert({ .pool_id = pg.pool_id, .pg_num = pg.pg_num });
+        }
+        dirty_pgs.insert({ .pool_id = pg.pool_id, .pg_num = pg.pg_num });
+    }
+    return true;
+}
--- a/src/osd_rmw.cpp
+++ b/src/osd_rmw.cpp
@ -7,6 +7,7 @@
 #include <jerasure/reed_sol.h>
 #include <jerasure.h>
 #include <map>
+#include "allocator.h"
 #include "xor.h"
 #include "osd_rmw.h"
 #include "malloc_or_die.h"
@ -81,7 +82,7 @@ void split_stripes(uint64_t pg_minsize, uint32_t bs_block_size, uint32_t start,
    }
 }

-void reconstruct_stripes_xor(osd_rmw_stripe_t *stripes, int pg_size)
+void reconstruct_stripes_xor(osd_rmw_stripe_t *stripes, int pg_size, uint32_t bitmap_size)
 {
    for (int role = 0; role < pg_size; role++)
    {
@ -106,6 +107,7 @@ void reconstruct_stripes_xor(osd_rmw_stripe_t *stripes, int pg_size)
                            stripes[other].read_buf + (stripes[role].read_start - stripes[other].read_start),
                            stripes[role].read_buf, stripes[role].read_end - stripes[role].read_start
                        );
+                        memxor(stripes[prev].bmp_buf, stripes[other].bmp_buf, stripes[role].bmp_buf, bitmap_size);
                        prev = -1;
                    }
                    else
@ -116,6 +118,7 @@ void reconstruct_stripes_xor(osd_rmw_stripe_t *stripes, int pg_size)
                            stripes[other].read_buf + (stripes[role].read_start - stripes[other].read_start),
                            stripes[role].read_buf, stripes[role].read_end - stripes[role].read_start
                        );
+                        memxor(stripes[role].bmp_buf, stripes[other].bmp_buf, stripes[role].bmp_buf, bitmap_size);
                    }
                }
            }
@ -212,7 +215,7 @@ int* get_jerasure_decoding_matrix(osd_rmw_stripe_t *stripes, int pg_size, int pg
    auto dec_it = matrix->decodings.find((reed_sol_erased_t){ .data = erased, .size = pg_size });
    if (dec_it == matrix->decodings.end())
    {
-        int *dm_ids = (int*)malloc(sizeof(int)*(pg_minsize + pg_minsize*pg_minsize + pg_size));
+        int *dm_ids = (int*)malloc_or_die(sizeof(int)*(pg_minsize + pg_minsize*pg_minsize + pg_size));
        int *decoding_matrix = dm_ids + pg_minsize;
        if (!dm_ids)
            throw std::bad_alloc();
@ -230,7 +233,7 @@ int* get_jerasure_decoding_matrix(osd_rmw_stripe_t *stripes, int pg_size, int pg
    return dec_it->second;
 }

-void reconstruct_stripes_jerasure(osd_rmw_stripe_t *stripes, int pg_size, int pg_minsize)
+void reconstruct_stripes_jerasure(osd_rmw_stripe_t *stripes, int pg_size, int pg_minsize, uint32_t bitmap_size)
 {
    int *dm_ids = get_jerasure_decoding_matrix(stripes, pg_size, pg_minsize);
    if (!dm_ids)
@ -242,6 +245,8 @@ void reconstruct_stripes_jerasure(osd_rmw_stripe_t *stripes, int pg_size, int pg
    for (int role = 0; role < pg_minsize; role++)
    {
        if (stripes[role].read_end != 0 && stripes[role].missing)
+        {
+            if (stripes[role].read_end > stripes[role].read_start)
            {
                for (int other = 0; other < pg_size; other++)
                {
@ -258,6 +263,19 @@ void reconstruct_stripes_jerasure(osd_rmw_stripe_t *stripes, int pg_size, int pg
                    data_ptrs, data_ptrs+pg_minsize, stripes[role].read_end - stripes[role].read_start
                );
            }
+            for (int other = 0; other < pg_size; other++)
+            {
+                if (stripes[other].read_end != 0 && !stripes[other].missing)
+                {
+                    data_ptrs[other] = (char*)(stripes[other].bmp_buf);
+                }
+            }
+            data_ptrs[role] = (char*)stripes[role].bmp_buf;
+            jerasure_matrix_dotprod(
+                pg_minsize, OSD_JERASURE_W, decoding_matrix+(role*pg_minsize), dm_ids, role,
+                data_ptrs, data_ptrs+pg_minsize, bitmap_size
+            );
+        }
    }
 }

@ -320,7 +338,8 @@ void* alloc_read_buffer(osd_rmw_stripe_t *stripes, int read_pg_size, uint64_t ad
 }

 void* calc_rmw(void *request_buf, osd_rmw_stripe_t *stripes, uint64_t *read_osd_set,
-    uint64_t pg_size, uint64_t pg_minsize, uint64_t pg_cursize, uint64_t *write_osd_set, uint64_t chunk_size)
+    uint64_t pg_size, uint64_t pg_minsize, uint64_t pg_cursize, uint64_t *write_osd_set,
+    uint64_t chunk_size, uint32_t bitmap_size)
 {
    // Generic parity modification (read-modify-write) algorithm
    // Read -> Reconstruct missing chunks -> Calc parity chunks -> Write
@ -521,11 +540,12 @@ static void xor_multiple_buffers(buf_len_t *xor1, int n1, buf_len_t *xor2, int n
 }

 static void calc_rmw_parity_copy_mod(osd_rmw_stripe_t *stripes, int pg_size, int pg_minsize,
-    uint64_t *read_osd_set, uint64_t *write_osd_set, uint32_t chunk_size, uint32_t &start, uint32_t &end)
+    uint64_t *read_osd_set, uint64_t *write_osd_set, uint32_t chunk_size, uint32_t bitmap_granularity,
+    uint32_t &start, uint32_t &end)
 {
    if (write_osd_set[pg_minsize] != 0 || write_osd_set != read_osd_set)
    {
-        // Required for the next two if()s
+        // start & end are required for calc_rmw_parity
        for (int role = 0; role < pg_minsize; role++)
        {
            if (stripes[role].req_end != 0)
@ -543,6 +563,20 @@ static void calc_rmw_parity_copy_mod(osd_rmw_stripe_t *stripes, int pg_size, int
            }
        }
    }
+    // Set bitmap bits accordingly
+    if (bitmap_granularity > 0)
+    {
+        for (int role = 0; role < pg_minsize; role++)
+        {
+            if (stripes[role].req_end != 0)
+            {
+                bitmap_set(
+                    stripes[role].bmp_buf, stripes[role].req_start,
+                    stripes[role].req_end-stripes[role].req_start, bitmap_granularity
+                );
+            }
+        }
+    }
    if (write_osd_set != read_osd_set)
    {
        for (int role = 0; role < pg_minsize; role++)
@ -603,12 +637,14 @@ static void calc_rmw_parity_copy_parity(osd_rmw_stripe_t *stripes, int pg_size,
 #endif
 }

-void calc_rmw_parity_xor(osd_rmw_stripe_t *stripes, int pg_size, uint64_t *read_osd_set, uint64_t *write_osd_set, uint32_t chunk_size)
+void calc_rmw_parity_xor(osd_rmw_stripe_t *stripes, int pg_size, uint64_t *read_osd_set, uint64_t *write_osd_set,
+    uint32_t chunk_size, uint32_t bitmap_size)
 {
+    uint32_t bitmap_granularity = bitmap_size > 0 ? chunk_size / bitmap_size / 8 : 0;
    int pg_minsize = pg_size-1;
-    reconstruct_stripes_xor(stripes, pg_size);
+    reconstruct_stripes_xor(stripes, pg_size, bitmap_size);
    uint32_t start = 0, end = 0;
-    calc_rmw_parity_copy_mod(stripes, pg_size, pg_minsize, read_osd_set, write_osd_set, chunk_size, start, end);
+    calc_rmw_parity_copy_mod(stripes, pg_size, pg_minsize, read_osd_set, write_osd_set, chunk_size, bitmap_granularity, start, end);
    if (write_osd_set[pg_minsize] != 0 && end != 0)
    {
        // Calculate new parity (XOR k+1)
@ -626,9 +662,11 @@ void calc_rmw_parity_xor(osd_rmw_stripe_t *stripes, int pg_size, uint64_t *read_
                if (prev == -1)
                {
                    xor1[n1++] = { .buf = stripes[parity].write_buf, .len = end-start };
+                    memxor(stripes[parity].bmp_buf, stripes[other].bmp_buf, stripes[parity].bmp_buf, bitmap_size);
                }
                else
                {
+                    memxor(stripes[prev].bmp_buf, stripes[other].bmp_buf, stripes[parity].bmp_buf, bitmap_size);
                    get_old_new_buffers(stripes[prev], start, end, xor1, n1);
                    prev = -1;
                }
@ -641,12 +679,13 @@ void calc_rmw_parity_xor(osd_rmw_stripe_t *stripes, int pg_size, uint64_t *read_
 }

 void calc_rmw_parity_jerasure(osd_rmw_stripe_t *stripes, int pg_size, int pg_minsize,
-    uint64_t *read_osd_set, uint64_t *write_osd_set, uint32_t chunk_size)
+    uint64_t *read_osd_set, uint64_t *write_osd_set, uint32_t chunk_size, uint32_t bitmap_size)
 {
+    uint32_t bitmap_granularity = bitmap_size > 0 ? chunk_size / bitmap_size / 8 : 0;
    reed_sol_matrix_t *matrix = get_jerasure_matrix(pg_size, pg_minsize);
-    reconstruct_stripes_jerasure(stripes, pg_size, pg_minsize);
+    reconstruct_stripes_jerasure(stripes, pg_size, pg_minsize, bitmap_size);
    uint32_t start = 0, end = 0;
-    calc_rmw_parity_copy_mod(stripes, pg_size, pg_minsize, read_osd_set, write_osd_set, chunk_size, start, end);
+    calc_rmw_parity_copy_mod(stripes, pg_size, pg_minsize, read_osd_set, write_osd_set, chunk_size, bitmap_granularity, start, end);
    if (end != 0)
    {
        int i;
@ -701,6 +740,14 @@ void calc_rmw_parity_jerasure(osd_rmw_stripe_t *stripes, int pg_size, int pg_min
                );
                pos = next_end;
            }
+            for (int i = 0; i < pg_size; i++)
+            {
+                data_ptrs[i] = stripes[i].bmp_buf;
+            }
+            jerasure_matrix_encode(
+                pg_minsize, pg_size-pg_minsize, OSD_JERASURE_W, matrix->data,
+                (char**)data_ptrs, (char**)data_ptrs+pg_minsize, bitmap_size
+            );
        }
    }
    calc_rmw_parity_copy_parity(stripes, pg_size, pg_minsize, read_osd_set, write_osd_set, chunk_size, start, end);
--- a/src/osd_rmw.h
+++ b/src/osd_rmw.h
@ -20,6 +20,7 @@ struct buf_len_t
 struct osd_rmw_stripe_t
 {
    void *read_buf, *write_buf;
+    void *bmp_buf;
    uint32_t req_start, req_end;
    uint32_t read_start, read_end;
    uint32_t write_start, write_end;
@ -30,20 +31,22 @@ struct osd_rmw_stripe_t

 void split_stripes(uint64_t pg_minsize, uint32_t bs_block_size, uint32_t start, uint32_t len, osd_rmw_stripe_t *stripes);

-void reconstruct_stripes_xor(osd_rmw_stripe_t *stripes, int pg_size);
+void reconstruct_stripes_xor(osd_rmw_stripe_t *stripes, int pg_size, uint32_t bitmap_size);

 int extend_missing_stripes(osd_rmw_stripe_t *stripes, osd_num_t *osd_set, int pg_minsize, int pg_size);

 void* alloc_read_buffer(osd_rmw_stripe_t *stripes, int read_pg_size, uint64_t add_size);

 void* calc_rmw(void *request_buf, osd_rmw_stripe_t *stripes, uint64_t *read_osd_set,
-    uint64_t pg_size, uint64_t pg_minsize, uint64_t pg_cursize, uint64_t *write_osd_set, uint64_t chunk_size);
+    uint64_t pg_size, uint64_t pg_minsize, uint64_t pg_cursize, uint64_t *write_osd_set,
+    uint64_t chunk_size, uint32_t bitmap_size);

-void calc_rmw_parity_xor(osd_rmw_stripe_t *stripes, int pg_size, uint64_t *read_osd_set, uint64_t *write_osd_set, uint32_t chunk_size);
+void calc_rmw_parity_xor(osd_rmw_stripe_t *stripes, int pg_size, uint64_t *read_osd_set, uint64_t *write_osd_set,
+    uint32_t chunk_size, uint32_t bitmap_size);

 void use_jerasure(int pg_size, int pg_minsize, bool use);

-void reconstruct_stripes_jerasure(osd_rmw_stripe_t *stripes, int pg_size, int pg_minsize);
+void reconstruct_stripes_jerasure(osd_rmw_stripe_t *stripes, int pg_size, int pg_minsize, uint32_t bitmap_size);

 void calc_rmw_parity_jerasure(osd_rmw_stripe_t *stripes, int pg_size, int pg_minsize,
-    uint64_t *read_osd_set, uint64_t *write_osd_set, uint32_t chunk_size);
+    uint64_t *read_osd_set, uint64_t *write_osd_set, uint32_t chunk_size, uint32_t bitmap_size);
--- a/src/osd_rmw_test.cpp
+++ b/src/osd_rmw_test.cpp
@ -126,12 +126,16 @@ void test1()

 void test4()
 {
+    const uint32_t bmp = 4;
+    unsigned bitmaps[3] = { 0 };
    osd_num_t osd_set[3] = { 1, 0, 3 };
    osd_rmw_stripe_t stripes[3] = { 0 };
    // Test 4.1
    split_stripes(2, 128*1024, 128*1024-4096, 8192, stripes);
+    for (int i = 0; i < 3; i++)
+        stripes[i].bmp_buf = bitmaps+i;
    void* write_buf = malloc(8192);
-    void* rmw_buf = calc_rmw(write_buf, stripes, osd_set, 3, 2, 2, osd_set, 128*1024);
+    void* rmw_buf = calc_rmw(write_buf, stripes, osd_set, 3, 2, 2, osd_set, 128*1024, bmp);
    assert(stripes[0].read_start == 0 && stripes[0].read_end == 128*1024);
    assert(stripes[1].read_start == 4096 && stripes[1].read_end == 128*1024);
    assert(stripes[2].read_start == 4096 && stripes[2].read_end == 128*1024);
@ -149,7 +153,13 @@ void test4()
    set_pattern(stripes[0].read_buf, 128*1024, PATTERN1); // old data
    set_pattern(stripes[1].read_buf, 128*1024-4096, UINT64_MAX); // didn't read it, it's missing
    set_pattern(stripes[2].read_buf, 128*1024-4096, 0); // old parity = 0
-    calc_rmw_parity_xor(stripes, 3, osd_set, osd_set, 128*1024);
+    memset(stripes[0].bmp_buf, 0, bmp);
+    memset(stripes[1].bmp_buf, 0, bmp);
+    memset(stripes[2].bmp_buf, 0, bmp);
+    calc_rmw_parity_xor(stripes, 3, osd_set, osd_set, 128*1024, bmp);
+    assert(*(uint32_t*)stripes[0].bmp_buf == 0x80000000);
+    assert(*(uint32_t*)stripes[1].bmp_buf == 0x00000001);
+    assert(*(uint32_t*)stripes[2].bmp_buf == 0x80000001); // XOR
    check_pattern(stripes[2].write_buf, 4096, PATTERN0^PATTERN1); // new parity
    check_pattern(stripes[2].write_buf+4096, 128*1024-4096*2, 0); // new parity
    check_pattern(stripes[2].write_buf+128*1024-4096, 4096, PATTERN0^PATTERN1); // new parity
@ -181,7 +191,7 @@ void test5()
    assert(stripes[2].req_end == 0);
    // Test 5.2
    void *write_buf = malloc(64*1024*3);
-    void *rmw_buf = calc_rmw(write_buf, stripes, osd_set, 3, 2, 2, osd_set, 128*1024);
+    void *rmw_buf = calc_rmw(write_buf, stripes, osd_set, 3, 2, 2, osd_set, 128*1024, 0);
    assert(stripes[0].read_start == 64*1024 && stripes[0].read_end == 128*1024);
    assert(stripes[1].read_start == 64*1024 && stripes[1].read_end == 128*1024);
    assert(stripes[2].read_start == 64*1024 && stripes[2].read_end == 128*1024);
@ -218,7 +228,7 @@ void test6()
    // Test 6.1
    split_stripes(2, 128*1024, 0, 64*1024*3, stripes);
    void *write_buf = malloc(64*1024*3);
-    void *rmw_buf = calc_rmw(write_buf, stripes, osd_set, 3, 2, 3, osd_set, 128*1024);
+    void *rmw_buf = calc_rmw(write_buf, stripes, osd_set, 3, 2, 3, osd_set, 128*1024, 0);
    assert(stripes[0].read_end == 0);
    assert(stripes[1].read_start == 64*1024 && stripes[1].read_end == 128*1024);
    assert(stripes[2].read_end == 0);
@ -261,7 +271,7 @@ void test7()
    // Test 7.1
    split_stripes(2, 128*1024, 128*1024-4096, 8192, stripes);
    void *write_buf = malloc(8192);
-    void *rmw_buf = calc_rmw(write_buf, stripes, osd_set, 3, 2, 2, write_osd_set, 128*1024);
+    void *rmw_buf = calc_rmw(write_buf, stripes, osd_set, 3, 2, 2, write_osd_set, 128*1024, 0);
    assert(stripes[0].read_start == 0 && stripes[0].read_end == 128*1024);
    assert(stripes[1].read_start == 0 && stripes[1].read_end == 128*1024);
    assert(stripes[2].read_start == 0 && stripes[2].read_end == 128*1024);
@ -279,7 +289,7 @@ void test7()
    set_pattern(stripes[0].read_buf, 128*1024, PATTERN1); // old data
    set_pattern(stripes[1].read_buf, 128*1024, UINT64_MAX); // didn't read it, it's missing
    set_pattern(stripes[2].read_buf, 128*1024, 0); // old parity = 0
-    calc_rmw_parity_xor(stripes, 3, osd_set, write_osd_set, 128*1024);
+    calc_rmw_parity_xor(stripes, 3, osd_set, write_osd_set, 128*1024, 0);
    assert(stripes[0].write_start == 128*1024-4096 && stripes[0].write_end == 128*1024);
    assert(stripes[1].write_start == 0 && stripes[1].write_end == 128*1024);
    assert(stripes[2].write_start == 0 && stripes[2].write_end == 128*1024);
@ -314,7 +324,7 @@ void test8()
    // Test 8.1
    split_stripes(2, 128*1024, 0, 128*1024+4096, stripes);
    void *write_buf = malloc(128*1024+4096);
-    void *rmw_buf = calc_rmw(write_buf, stripes, osd_set, 3, 2, 2, write_osd_set, 128*1024);
+    void *rmw_buf = calc_rmw(write_buf, stripes, osd_set, 3, 2, 2, write_osd_set, 128*1024, 0);
    assert(stripes[0].read_start == 0 && stripes[0].read_end == 0);
    assert(stripes[1].read_start == 4096 && stripes[1].read_end == 128*1024);
    assert(stripes[2].read_start == 0 && stripes[2].read_end == 0);
@ -330,7 +340,7 @@ void test8()
    // Test 8.2
    set_pattern(write_buf, 128*1024+4096, PATTERN0);
    set_pattern(stripes[1].read_buf, 128*1024-4096, PATTERN1);
-    calc_rmw_parity_xor(stripes, 3, osd_set, write_osd_set, 128*1024);
+    calc_rmw_parity_xor(stripes, 3, osd_set, write_osd_set, 128*1024, 0);
    assert(stripes[0].write_start == 0 && stripes[0].write_end == 128*1024); // recheck again
    assert(stripes[1].write_start == 0 && stripes[1].write_end == 4096);     // recheck again
    assert(stripes[2].write_start == 0 && stripes[2].write_end == 128*1024); // recheck again
@ -373,7 +383,7 @@ void test9()
    assert(stripes[2].req_start == 0 && stripes[2].req_end == 0);
    // Test 9.1
    void *write_buf = NULL;
-    void *rmw_buf = calc_rmw(write_buf, stripes, osd_set, 3, 2, 3, write_osd_set, 128*1024);
+    void *rmw_buf = calc_rmw(write_buf, stripes, osd_set, 3, 2, 3, write_osd_set, 128*1024, 0);
    assert(stripes[0].read_start == 0 && stripes[0].read_end == 128*1024);
    assert(stripes[1].read_start == 0 && stripes[1].read_end == 128*1024);
    assert(stripes[2].read_start == 0 && stripes[2].read_end == 128*1024);
@ -389,7 +399,7 @@ void test9()
    // Test 9.2
    set_pattern(stripes[1].read_buf, 128*1024, 0);
    set_pattern(stripes[2].read_buf, 128*1024, PATTERN1);
-    calc_rmw_parity_xor(stripes, 3, osd_set, write_osd_set, 128*1024);
+    calc_rmw_parity_xor(stripes, 3, osd_set, write_osd_set, 128*1024, 0);
    assert(stripes[0].write_start == 0 && stripes[0].write_end == 128*1024);
    assert(stripes[1].write_start == 0 && stripes[1].write_end == 0);
    assert(stripes[2].write_start == 0 && stripes[2].write_end == 0);
@ -428,7 +438,7 @@ void test10()
    assert(stripes[2].req_start == 0 && stripes[2].req_end == 0);
    // Test 10.1
    void *write_buf = malloc(256*1024);
-    void *rmw_buf = calc_rmw(write_buf, stripes, osd_set, 3, 2, 3, write_osd_set, 128*1024);
+    void *rmw_buf = calc_rmw(write_buf, stripes, osd_set, 3, 2, 3, write_osd_set, 128*1024, 0);
    assert(rmw_buf);
    assert(stripes[0].read_start == 0 && stripes[0].read_end == 0);
    assert(stripes[1].read_start == 0 && stripes[1].read_end == 0);
@ -445,7 +455,7 @@ void test10()
    // Test 10.2
    set_pattern(stripes[0].write_buf, 128*1024, PATTERN1);
    set_pattern(stripes[1].write_buf, 128*1024, PATTERN2);
-    calc_rmw_parity_xor(stripes, 3, osd_set, write_osd_set, 128*1024);
+    calc_rmw_parity_xor(stripes, 3, osd_set, write_osd_set, 128*1024, 0);
    assert(stripes[0].write_start == 0 && stripes[0].write_end == 128*1024);
    assert(stripes[1].write_start == 0 && stripes[1].write_end == 128*1024);
    assert(stripes[2].write_start == 0 && stripes[2].write_end == 128*1024);
@ -484,7 +494,7 @@ void test11()
    assert(stripes[2].req_start == 0 && stripes[2].req_end == 0);
    // Test 11.1
    void *write_buf = malloc(256*1024);
-    void *rmw_buf = calc_rmw(write_buf, stripes, osd_set, 3, 2, 3, write_osd_set, 128*1024);
+    void *rmw_buf = calc_rmw(write_buf, stripes, osd_set, 3, 2, 3, write_osd_set, 128*1024, 0);
    assert(rmw_buf);
    assert(stripes[0].read_start == 0 && stripes[0].read_end == 128*1024);
    assert(stripes[1].read_start == 0 && stripes[1].read_end == 0);
@ -501,7 +511,7 @@ void test11()
    // Test 11.2
    set_pattern(stripes[0].read_buf, 128*1024, PATTERN1);
    set_pattern(stripes[1].write_buf, 128*1024, PATTERN2);
-    calc_rmw_parity_xor(stripes, 3, osd_set, write_osd_set, 128*1024);
+    calc_rmw_parity_xor(stripes, 3, osd_set, write_osd_set, 128*1024, 0);
    assert(stripes[0].write_start == 0 && stripes[0].write_end == 0);
    assert(stripes[1].write_start == 0 && stripes[1].write_end == 128*1024);
    assert(stripes[2].write_start == 0 && stripes[2].write_end == 128*1024);
@ -539,7 +549,7 @@ void test12()
    assert(stripes[1].req_start == 0 && stripes[1].req_end == 0);
    assert(stripes[2].req_start == 0 && stripes[2].req_end == 0);
    // Test 12.1
-    void *rmw_buf = calc_rmw(NULL, stripes, osd_set, 3, 2, 3, write_osd_set, 128*1024);
+    void *rmw_buf = calc_rmw(NULL, stripes, osd_set, 3, 2, 3, write_osd_set, 128*1024, 0);
    assert(rmw_buf);
    assert(stripes[0].read_start == 0 && stripes[0].read_end == 128*1024);
    assert(stripes[1].read_start == 0 && stripes[1].read_end == 128*1024);
@ -556,7 +566,7 @@ void test12()
    // Test 12.2
    set_pattern(stripes[0].read_buf, 128*1024, PATTERN1);
    set_pattern(stripes[1].read_buf, 128*1024, PATTERN2);
-    calc_rmw_parity_xor(stripes, 3, osd_set, write_osd_set, 128*1024);
+    calc_rmw_parity_xor(stripes, 3, osd_set, write_osd_set, 128*1024, 0);
    assert(stripes[0].write_start == 0 && stripes[0].write_end == 0);
    assert(stripes[1].write_start == 0 && stripes[1].write_end == 0);
    assert(stripes[2].write_start == 0 && stripes[2].write_end == 128*1024);
@ -596,7 +606,7 @@ void test13()
    assert(stripes[2].req_start == 0 && stripes[2].req_end == 0);
    assert(stripes[3].req_start == 0 && stripes[3].req_end == 0);
    // Test 13.1
-    void *rmw_buf = calc_rmw(write_buf, stripes, osd_set, 4, 2, 4, write_osd_set, 128*1024);
+    void *rmw_buf = calc_rmw(write_buf, stripes, osd_set, 4, 2, 4, write_osd_set, 128*1024, 0);
    assert(rmw_buf);
    assert(stripes[0].read_start == 0 && stripes[0].read_end == 128*1024-4096);
    assert(stripes[1].read_start == 4096 && stripes[1].read_end == 128*1024);
@ -618,7 +628,7 @@ void test13()
    set_pattern(write_buf, 8192, PATTERN3);
    set_pattern(stripes[0].read_buf, 128*1024-4096, PATTERN1);
    set_pattern(stripes[1].read_buf, 128*1024-4096, PATTERN2);
-    calc_rmw_parity_jerasure(stripes, 4, 2, osd_set, write_osd_set, 128*1024);
+    calc_rmw_parity_jerasure(stripes, 4, 2, osd_set, write_osd_set, 128*1024, 0);
    assert(stripes[0].write_start == 128*1024-4096 && stripes[0].write_end == 128*1024);
    assert(stripes[1].write_start == 0 && stripes[1].write_end == 4096);
    assert(stripes[2].write_start == 0 && stripes[2].write_end == 128*1024);
@ -653,7 +663,7 @@ void test13()
    assert(stripes[3].read_buf == read_buf+3*128*1024);
    memcpy(read_buf+2*128*1024, rmw_buf, 128*1024);
    memcpy(read_buf+3*128*1024, rmw_buf+128*1024, 128*1024);
-    reconstruct_stripes_jerasure(stripes, 4, 2);
+    reconstruct_stripes_jerasure(stripes, 4, 2, 0);
    check_pattern(stripes[0].read_buf, 128*1024-4096, PATTERN1);
    check_pattern(stripes[0].read_buf+128*1024-4096, 4096, PATTERN3);
    check_pattern(stripes[1].read_buf, 4096, PATTERN3);
@ -684,7 +694,7 @@ void test13()
    assert(stripes[3].read_buf == read_buf+2*128*1024);
    memcpy(read_buf+128*1024, rmw_buf, 128*1024);
    memcpy(read_buf+2*128*1024, rmw_buf+128*1024, 128*1024);
-    reconstruct_stripes_jerasure(stripes, 4, 2);
+    reconstruct_stripes_jerasure(stripes, 4, 2, 0);
    check_pattern(stripes[0].read_buf, 128*1024-4096, PATTERN1);
    check_pattern(stripes[0].read_buf+128*1024-4096, 4096, PATTERN3);
    free(read_buf);
@ -711,10 +721,12 @@ void test13()

 void test14()
 {
+    const int bmp = 4;
    use_jerasure(3, 2, true);
    osd_num_t osd_set[3] = { 1, 2, 0 };
    osd_num_t write_osd_set[3] = { 1, 2, 3 };
    osd_rmw_stripe_t stripes[3] = { 0 };
+    unsigned bitmaps[3] = { 0 };
    // Test 13.0
    void *write_buf = malloc_or_die(8192);
    split_stripes(2, 128*1024, 128*1024-4096, 8192, stripes);
@ -722,7 +734,9 @@ void test14()
    assert(stripes[1].req_start == 0 && stripes[1].req_end == 4096);
    assert(stripes[2].req_start == 0 && stripes[2].req_end == 0);
    // Test 13.1
-    void *rmw_buf = calc_rmw(write_buf, stripes, osd_set, 3, 2, 3, write_osd_set, 128*1024);
+    void *rmw_buf = calc_rmw(write_buf, stripes, osd_set, 3, 2, 3, write_osd_set, 128*1024, bmp);
+    for (int i = 0; i < 3; i++)
+        stripes[i].bmp_buf = bitmaps+i;
    assert(rmw_buf);
    assert(stripes[0].read_start == 0 && stripes[0].read_end == 128*1024-4096);
    assert(stripes[1].read_start == 4096 && stripes[1].read_end == 128*1024);
@ -740,7 +754,13 @@ void test14()
    set_pattern(write_buf, 8192, PATTERN3);
    set_pattern(stripes[0].read_buf, 128*1024-4096, PATTERN1);
    set_pattern(stripes[1].read_buf, 128*1024-4096, PATTERN2);
-    calc_rmw_parity_jerasure(stripes, 3, 2, osd_set, write_osd_set, 128*1024);
+    memset(stripes[0].bmp_buf, 0, bmp);
+    memset(stripes[1].bmp_buf, 0, bmp);
+    memset(stripes[2].bmp_buf, 0, bmp);
+    calc_rmw_parity_jerasure(stripes, 3, 2, osd_set, write_osd_set, 128*1024, bmp);
+    assert(*(uint32_t*)stripes[0].bmp_buf == 0x80000000);
+    assert(*(uint32_t*)stripes[1].bmp_buf == 0x00000001);
+    assert(*(uint32_t*)stripes[2].bmp_buf == 0x80000001); // jerasure 2+1 is still just XOR
    assert(stripes[0].write_start == 128*1024-4096 && stripes[0].write_end == 128*1024);
    assert(stripes[1].write_start == 0 && stripes[1].write_end == 4096);
    assert(stripes[2].write_start == 0 && stripes[2].write_end == 128*1024);
@ -764,6 +784,8 @@ void test14()
    assert(stripes[1].read_start == 0 && stripes[1].read_end == 128*1024);
    assert(stripes[2].read_start == 0 && stripes[2].read_end == 128*1024);
    void *read_buf = alloc_read_buffer(stripes, 3, 0);
+    for (int i = 0; i < 3; i++)
+        stripes[i].bmp_buf = bitmaps+i;
    assert(read_buf);
    assert(stripes[0].read_buf == read_buf);
    assert(stripes[1].read_buf == read_buf+128*1024);
@ -771,7 +793,7 @@ void test14()
    set_pattern(stripes[1].read_buf, 4096, PATTERN3);
    set_pattern(stripes[1].read_buf+4096, 128*1024-4096, PATTERN2);
    memcpy(stripes[2].read_buf, rmw_buf, 128*1024);
-    reconstruct_stripes_jerasure(stripes, 3, 2);
+    reconstruct_stripes_jerasure(stripes, 3, 2, bmp);
    check_pattern(stripes[0].read_buf, 128*1024-4096, PATTERN1);
    check_pattern(stripes[0].read_buf+128*1024-4096, 4096, PATTERN3);
    free(read_buf);
--- a/src/osd_secondary.cpp
+++ b/src/osd_secondary.cpp
@ -17,9 +17,13 @@ void osd_t::secondary_op_callback(osd_op_t *op)
    {
        op->reply.sec_del.version = op->bs_op->version;
    }
-    if (op->req.hdr.opcode == OSD_OP_SEC_READ &&
-        op->bs_op->retval > 0)
+    if (op->req.hdr.opcode == OSD_OP_SEC_READ)
    {
+        if (op->bs_op->retval >= 0)
+            op->reply.sec_rw.attr_len = clean_entry_bitmap_size;
+        else
+            op->reply.sec_rw.attr_len = 0;
+        if (op->bs_op->retval > 0)
            op->iov.push_back(op->buf, op->bs_op->retval);
    }
    else if (op->req.hdr.opcode == OSD_OP_SEC_LIST)
@ -40,6 +44,25 @@ void osd_t::secondary_op_callback(osd_op_t *op)

 void osd_t::exec_secondary(osd_op_t *cur_op)
 {
+    if (cur_op->req.hdr.opcode == OSD_OP_SEC_READ_BMP)
+    {
+        int n = cur_op->req.sec_read_bmp.len / sizeof(obj_ver_id);
+        if (n > 0)
+        {
+            obj_ver_id *ov = (obj_ver_id*)cur_op->buf;
+            void *reply_buf = malloc_or_die(n * (8 + clean_entry_bitmap_size));
+            void *cur_buf = reply_buf;
+            for (int i = 0; i < n; i++)
+            {
+                bs->read_bitmap(ov[i].oid, ov[i].version, cur_buf + sizeof(uint64_t), (uint64_t*)cur_buf);
+                cur_buf += (8 + clean_entry_bitmap_size);
+            }
+            free(cur_op->buf);
+            cur_op->buf = reply_buf;
+        }
+        finish_op(cur_op, n * (8 + clean_entry_bitmap_size));
+        return;
+    }
    cur_op->bs_op = new blockstore_op_t();
    cur_op->bs_op->callback = [this, cur_op](blockstore_op_t* bs_op) { secondary_op_callback(cur_op); };
    cur_op->bs_op->opcode = (cur_op->req.hdr.opcode == OSD_OP_SEC_READ ? BS_OP_READ
@ -55,11 +78,22 @@ void osd_t::exec_secondary(osd_op_t *cur_op)
        cur_op->req.hdr.opcode == OSD_OP_SEC_WRITE ||
        cur_op->req.hdr.opcode == OSD_OP_SEC_WRITE_STABLE)
    {
+        if (cur_op->req.hdr.opcode == OSD_OP_SEC_READ)
+        {
+            // Allocate memory for the read operation
+            if (clean_entry_bitmap_size > sizeof(unsigned))
+                cur_op->bitmap = cur_op->rmw_buf = malloc_or_die(clean_entry_bitmap_size);
+            else
+                cur_op->bitmap = &cur_op->bmp_data;
+            if (cur_op->req.sec_rw.len > 0)
+                cur_op->buf = memalign_or_die(MEM_ALIGNMENT, cur_op->req.sec_rw.len);
+        }
        cur_op->bs_op->oid = cur_op->req.sec_rw.oid;
        cur_op->bs_op->version = cur_op->req.sec_rw.version;
        cur_op->bs_op->offset = cur_op->req.sec_rw.offset;
        cur_op->bs_op->len = cur_op->req.sec_rw.len;
        cur_op->bs_op->buf = cur_op->buf;
+        cur_op->bs_op->bitmap = cur_op->bitmap;
 #ifdef OSD_STUB
        cur_op->bs_op->retval = cur_op->bs_op->len;
 #endif
@ -111,7 +145,9 @@ void osd_t::exec_secondary(osd_op_t *cur_op)
 void osd_t::exec_show_config(osd_op_t *cur_op)
 {
    // FIXME: Send the real config, not its source
-    std::string cfg_str = json11::Json(config).dump();
+    auto cfg_copy = config;
+    cfg_copy["protocol_version"] = std::to_string(OSD_PROTOCOL_VERSION);
+    std::string cfg_str = json11::Json(cfg_copy).dump();
    cur_op->buf = malloc_or_die(cfg_str.size()+1);
    memcpy(cur_op->buf, cfg_str.c_str(), cfg_str.size()+1);
    cur_op->iov.push_back(cur_op->buf, cfg_str.size()+1);
--- a/src/pg_states.cpp
+++ b/src/pg_states.cpp
@ -3,13 +3,14 @@

 #include "pg_states.h"

-const int pg_state_bit_count = 14;
+const int pg_state_bit_count = 15;

-const int pg_state_bits[14] = {
+const int pg_state_bits[15] = {
    PG_STARTING,
    PG_PEERING,
    PG_INCOMPLETE,
    PG_ACTIVE,
+    PG_REPEERING,
    PG_STOPPING,
    PG_OFFLINE,
    PG_DEGRADED,
@ -21,11 +22,12 @@ const int pg_state_bits[14] = {
    PG_LEFT_ON_DEAD,
 };

-const char *pg_state_names[14] = {
+const char *pg_state_names[15] = {
    "starting",
    "peering",
    "incomplete",
    "active",
+    "repeering",
    "stopping",
    "offline",
    "degraded",
--- a/src/pg_states.h
+++ b/src/pg_states.h
@ -10,16 +10,17 @@
 #define PG_PEERING (1<<1)
 #define PG_INCOMPLETE (1<<2)
 #define PG_ACTIVE (1<<3)
-#define PG_STOPPING (1<<4)
-#define PG_OFFLINE (1<<5)
+#define PG_REPEERING (1<<4)
+#define PG_STOPPING (1<<5)
+#define PG_OFFLINE (1<<6)
 // Plus any of these:
-#define PG_DEGRADED (1<<6)
-#define PG_HAS_INCOMPLETE (1<<7)
-#define PG_HAS_DEGRADED (1<<8)
-#define PG_HAS_MISPLACED (1<<9)
-#define PG_HAS_UNCLEAN (1<<10)
-#define PG_HAS_INVALID (1<<11)
-#define PG_LEFT_ON_DEAD (1<<12)
+#define PG_DEGRADED (1<<7)
+#define PG_HAS_INCOMPLETE (1<<8)
+#define PG_HAS_DEGRADED (1<<9)
+#define PG_HAS_MISPLACED (1<<10)
+#define PG_HAS_UNCLEAN (1<<11)
+#define PG_HAS_INVALID (1<<12)
+#define PG_LEFT_ON_DEAD (1<<13)

 // Lower bits that represent object role (EC 0/1/2... or always 0 with replication)
 // 12 bits is a safe default that doesn't depend on pg_stripe_size or pg_block_size
--- a/src/qemu_driver.c
+++ b/src/qemu_driver.c
@ -39,12 +39,14 @@ void DSO_STAMP_FUN(void)
 typedef struct VitastorClient
 {
    void *proxy;
+    void *watch;
    char *etcd_host;
    char *etcd_prefix;
+    char *image;
    uint64_t inode;
    uint64_t pool;
    uint64_t size;
-    int readonly;
+    long readonly;
    QemuMutex mutex;
 } VitastorClient;

@ -53,10 +55,14 @@ typedef struct VitastorRPC
    BlockDriverState *bs;
    Coroutine *co;
    QEMUIOVector *iov;
-    int ret;
+    long ret;
    int complete;
 } VitastorRPC;

+static void vitastor_co_init_task(BlockDriverState *bs, VitastorRPC *task);
+static void vitastor_co_generic_bh_cb(long retval, void *opaque);
+static void vitastor_close(BlockDriverState *bs);
+
 static char *qemu_rbd_next_tok(char *src, char delim, char **p)
 {
    char *end;
@ -132,22 +138,25 @@ static void vitastor_parse_filename(const char *filename, QDict *options, Error
            qdict_put_str(options, name, value);
        }
    }
+    if (!qdict_get_try_str(options, "image"))
+    {
        if (!qdict_get_try_int(options, "inode", 0))
        {
-        error_setg(errp, "inode is missing");
+            error_setg(errp, "one of image (name) and inode (number) must be specified");
            goto out;
        }
        if (!(qdict_get_try_int(options, "inode", 0) >> (64-POOL_ID_BITS)) &&
            !qdict_get_try_int(options, "pool", 0))
        {
-        error_setg(errp, "pool number is missing");
+            error_setg(errp, "pool number must be specified or included in the inode number");
            goto out;
        }
        if (!qdict_get_try_int(options, "size", 0))
        {
-        error_setg(errp, "size is missing");
+            error_setg(errp, "size must be specified when inode number is used instead of image name");
            goto out;
        }
+    }
    if (!qdict_get_str(options, "etcd_host"))
    {
        error_setg(errp, "etcd_host is missing");
@ -159,27 +168,85 @@ out:
    return;
 }

+static void coroutine_fn vitastor_co_get_metadata(VitastorRPC *task)
+{
+    BlockDriverState *bs = task->bs;
+    VitastorClient *client = bs->opaque;
+    task->co = qemu_coroutine_self();
+
+    qemu_mutex_lock(&client->mutex);
+    vitastor_proxy_watch_metadata(client->proxy, client->image, vitastor_co_generic_bh_cb, task);
+    qemu_mutex_unlock(&client->mutex);
+
+    while (!task->complete)
+    {
+        qemu_coroutine_yield();
+    }
+}
+
 static int vitastor_file_open(BlockDriverState *bs, QDict *options, int flags, Error **errp)
 {
    VitastorClient *client = bs->opaque;
    int64_t ret = 0;
+    qemu_mutex_init(&client->mutex);
    client->etcd_host = g_strdup(qdict_get_try_str(options, "etcd_host"));
    client->etcd_prefix = g_strdup(qdict_get_try_str(options, "etcd_prefix"));
+    client->proxy = vitastor_proxy_create(bdrv_get_aio_context(bs), client->etcd_host, client->etcd_prefix);
+    client->image = g_strdup(qdict_get_try_str(options, "image"));
+    client->readonly = (flags & BDRV_O_RDWR) ? 1 : 0;
+    if (client->image)
+    {
+        // Get image metadata (size and readonly flag)
+        VitastorRPC task;
+        task.complete = 0;
+        task.bs = bs;
+        if (qemu_in_coroutine())
+        {
+            vitastor_co_get_metadata(&task);
+        }
+        else
+        {
+            qemu_coroutine_enter(qemu_coroutine_create((void(*)(void*))vitastor_co_get_metadata, &task));
+        }
+        BDRV_POLL_WHILE(bs, !task.complete);
+        client->watch = (void*)task.ret;
+        client->readonly = client->readonly || vitastor_proxy_get_readonly(client->watch);
+        client->size = vitastor_proxy_get_size(client->watch);
+        if (!vitastor_proxy_get_inode_num(client->watch))
+        {
+            error_setg(errp, "image does not exist");
+            vitastor_close(bs);
+        }
+        if (!client->size)
+        {
+            client->size = qdict_get_int(options, "size");
+        }
+    }
+    else
+    {
+        client->watch = NULL;
        client->inode = qdict_get_int(options, "inode");
        client->pool = qdict_get_int(options, "pool");
        if (client->pool)
+        {
            client->inode = (client->inode & ((1l << (64-POOL_ID_BITS)) - 1)) | (client->pool << (64-POOL_ID_BITS));
+        }
        client->size = qdict_get_int(options, "size");
-    client->readonly = (flags & BDRV_O_RDWR) ? 1 : 0;
-    client->proxy = vitastor_proxy_create(bdrv_get_aio_context(bs), client->etcd_host, client->etcd_prefix);
-    //client->aio_context = bdrv_get_aio_context(bs);
+    }
+    if (!client->size)
+    {
+        error_setg(errp, "image size not specified");
+        vitastor_close(bs);
+        return -1;
+    }
    bs->total_sectors = client->size / BDRV_SECTOR_SIZE;
+    //client->aio_context = bdrv_get_aio_context(bs);
    qdict_del(options, "etcd_host");
    qdict_del(options, "etcd_prefix");
+    qdict_del(options, "image");
    qdict_del(options, "inode");
    qdict_del(options, "pool");
    qdict_del(options, "size");
-    qemu_mutex_init(&client->mutex);
    return ret;
 }

@ -191,6 +258,8 @@ static void vitastor_close(BlockDriverState *bs)
    g_free(client->etcd_host);
    if (client->etcd_prefix)
        g_free(client->etcd_prefix);
+    if (client->image)
+        g_free(client->image);
 }

 #if QEMU_VERSION_MAJOR >= 3
@ -296,7 +365,7 @@ static void vitastor_co_init_task(BlockDriverState *bs, VitastorRPC *task)
    };
 }

-static void vitastor_co_generic_bh_cb(int retval, void *opaque)
+static void vitastor_co_generic_bh_cb(long retval, void *opaque)
 {
    VitastorRPC *task = opaque;
    task->ret = retval;
@ -319,8 +388,9 @@ static int coroutine_fn vitastor_co_preadv(BlockDriverState *bs, uint64_t offset
    vitastor_co_init_task(bs, &task);
    task.iov = iov;

+    uint64_t inode = client->watch ? vitastor_proxy_get_inode_num(client->watch) : client->inode;
    qemu_mutex_lock(&client->mutex);
-    vitastor_proxy_rw(0, client->proxy, client->inode, offset, bytes, iov->iov, iov->niov, vitastor_co_generic_bh_cb, &task);
+    vitastor_proxy_rw(0, client->proxy, inode, offset, bytes, iov->iov, iov->niov, vitastor_co_generic_bh_cb, &task);
    qemu_mutex_unlock(&client->mutex);

    while (!task.complete)
@ -338,8 +408,9 @@ static int coroutine_fn vitastor_co_pwritev(BlockDriverState *bs, uint64_t offse
    vitastor_co_init_task(bs, &task);
    task.iov = iov;

+    uint64_t inode = client->watch ? vitastor_proxy_get_inode_num(client->watch) : client->inode;
    qemu_mutex_lock(&client->mutex);
-    vitastor_proxy_rw(1, client->proxy, client->inode, offset, bytes, iov->iov, iov->niov, vitastor_co_generic_bh_cb, &task);
+    vitastor_proxy_rw(1, client->proxy, inode, offset, bytes, iov->iov, iov->niov, vitastor_co_generic_bh_cb, &task);
    qemu_mutex_unlock(&client->mutex);

    while (!task.complete)
--- a/src/qemu_proxy.cpp
+++ b/src/qemu_proxy.cpp
@ -47,7 +47,6 @@ public:

    ~QemuProxy()
    {
-        cli->stop();
        delete cli;
        delete tfd;
    }
@ -127,4 +126,38 @@ void vitastor_proxy_sync(void *client, VitastorIOHandler cb, void *opaque)
    p->cli->execute(op);
 }

+void vitastor_proxy_watch_metadata(void *client, char *image, VitastorIOHandler cb, void *opaque)
+{
+    QemuProxy *p = (QemuProxy*)client;
+    p->cli->on_ready([=]()
+    {
+        auto watch = p->cli->st_cli.watch_inode(std::string(image));
+        cb((long)watch, opaque);
+    });
+}
+
+void vitastor_proxy_close_watch(void *client, void *watch)
+{
+    QemuProxy *p = (QemuProxy*)client;
+    p->cli->st_cli.close_watch((inode_watch_t*)watch);
+}
+
+uint64_t vitastor_proxy_get_size(void *watch_ptr)
+{
+    inode_watch_t *watch = (inode_watch_t*)watch_ptr;
+    return watch->cfg.size;
+}
+
+uint64_t vitastor_proxy_get_inode_num(void *watch_ptr)
+{
+    inode_watch_t *watch = (inode_watch_t*)watch_ptr;
+    return watch->cfg.num;
+}
+
+int vitastor_proxy_get_readonly(void *watch_ptr)
+{
+    inode_watch_t *watch = (inode_watch_t*)watch_ptr;
+    return watch->cfg.readonly;
+}
+
 }
--- a/src/qemu_proxy.h
+++ b/src/qemu_proxy.h
@ -15,12 +15,17 @@ extern "C" {
 #endif

 // Our exports
-typedef void VitastorIOHandler(int retval, void *opaque);
+typedef void VitastorIOHandler(long retval, void *opaque);
 void* vitastor_proxy_create(AioContext *ctx, const char *etcd_host, const char *etcd_prefix);
 void vitastor_proxy_destroy(void *client);
 void vitastor_proxy_rw(int write, void *client, uint64_t inode, uint64_t offset, uint64_t len,
    struct iovec *iov, int iovcnt, VitastorIOHandler cb, void *opaque);
 void vitastor_proxy_sync(void *client, VitastorIOHandler cb, void *opaque);
+void vitastor_proxy_watch_metadata(void *client, char *image, VitastorIOHandler cb, void *opaque);
+void vitastor_proxy_close_watch(void *client, void *watch);
+uint64_t vitastor_proxy_get_size(void *watch);
+uint64_t vitastor_proxy_get_inode_num(void *watch);
+int vitastor_proxy_get_readonly(void *watch);

 #ifdef __cplusplus
 }
--- a/src/test_allocator.cpp
+++ b/src/test_allocator.cpp
@ -20,7 +20,15 @@ void alloc_all(int size)
        {
            printf("incorrect block allocated: expected %d, got %lu\n", i, x);
        }
+        if (a->get(x))
+        {
+            printf("not free before set at %d\n", i);
+        }
        a->set(x, true);
+        if (!a->get(x))
+        {
+            printf("free after set at %d\n", i);
+        }
    }
    uint64_t x = a->find_free();
    if (x != UINT64_MAX)
--- a/src/test_blockstore.cpp
+++ b/src/test_blockstore.cpp
@ -2,8 +2,8 @@
 // License: VNPL-1.1 (see README.md for details)

 #include <malloc.h>
-#include "timerfd_interval.h"
 #include "blockstore.h"
+#include "epoll_manager.h"

 int main(int narg, char *args[])
 {
@ -12,11 +12,8 @@ int main(int narg, char *args[])
    config["journal_device"] = "./test_journal.bin";
    config["data_device"] = "./test_data.bin";
    ring_loop_t *ringloop = new ring_loop_t(512);
-    blockstore_t *bs = new blockstore_t(config, ringloop);
-    timerfd_interval tick_tfd(ringloop, 1, []()
-    {
-        printf("tick 1s\n");
-    });
+    epoll_manager_t *epmgr = new epoll_manager_t(ringloop);
+    blockstore_t *bs = new blockstore_t(config, ringloop, epmgr->tfd);

    blockstore_op_t op;
    int main_state = 0;
@ -125,6 +122,7 @@ int main(int narg, char *args[])
        ringloop->wait();
    }
    delete bs;
+    delete epmgr;
    delete ringloop;
    return 0;
 }
--- a/src/test_cluster_client.cpp
+++ b/src/test_cluster_client.cpp
@ -0,0 +1,407 @@
+// Copyright (c) Vitaliy Filippov, 2019+
+// License: VNPL-1.1 (see README.md for details)
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <assert.h>
+#include "cluster_client.h"
+
+void configure_single_pg_pool(cluster_client_t *cli)
+{
+    cli->st_cli.on_load_pgs_hook(true);
+    cli->st_cli.parse_state((etcd_kv_t){
+        .key = "/config/pools",
+        .value = json11::Json::object {
+            { "1", json11::Json::object {
+                { "name", "hddpool" },
+                { "scheme", "replicated" },
+                { "pg_size", 2 },
+                { "pg_minsize", 1 },
+                { "pg_count", 1 },
+                { "failure_domain", "osd" },
+            } }
+        },
+    });
+    cli->st_cli.parse_state((etcd_kv_t){
+        .key = "/config/pgs",
+        .value = json11::Json::object {
+            { "items", json11::Json::object {
+                { "1", json11::Json::object {
+                    { "1", json11::Json::object {
+                        { "osd_set", json11::Json::array { 1, 2 } },
+                        { "primary", 1 },
+                    } }
+                } }
+            } }
+        },
+    });
+    cli->st_cli.parse_state((etcd_kv_t){
+        .key = "/pg/state/1/1",
+        .value = json11::Json::object {
+            { "peers", json11::Json::array { 1, 2 } },
+            { "primary", 1 },
+            { "state", json11::Json::array { "active" } },
+        },
+    });
+    std::map<std::string, etcd_kv_t> changes;
+    cli->st_cli.on_change_hook(changes);
+}
+
+int *test_write(cluster_client_t *cli, uint64_t offset, uint64_t len, uint8_t c, std::function<void()> cb = NULL)
+{
+    printf("Post write %lx+%lx\n", offset, len);
+    int *r = new int;
+    *r = -1;
+    cluster_op_t *op = new cluster_op_t();
+    op->opcode = OSD_OP_WRITE;
+    op->inode = 0x1000000000001;
+    op->offset = offset;
+    op->len = len;
+    op->iov.push_back(malloc_or_die(len), len);
+    memset(op->iov.buf[0].iov_base, c, len);
+    op->callback = [r, cb](cluster_op_t *op)
+    {
+        if (*r == -1)
+            printf("Error: Not allowed to complete yet\n");
+        assert(*r != -1);
+        *r = op->retval == op->len ? 1 : 0;
+        free(op->iov.buf[0].iov_base);
+        printf("Done write %lx+%lx r=%d\n", op->offset, op->len, op->retval);
+        delete op;
+        if (cb != NULL)
+            cb();
+    };
+    cli->execute(op);
+    return r;
+}
+
+int *test_sync(cluster_client_t *cli)
+{
+    printf("Post sync\n");
+    int *r = new int;
+    *r = -1;
+    cluster_op_t *op = new cluster_op_t();
+    op->opcode = OSD_OP_SYNC;
+    op->callback = [r](cluster_op_t *op)
+    {
+        if (*r == -1)
+            printf("Error: Not allowed to complete yet\n");
+        assert(*r != -1);
+        *r = op->retval == 0 ? 1 : 0;
+        printf("Done sync r=%d\n", op->retval);
+        delete op;
+    };
+    cli->execute(op);
+    return r;
+}
+
+void can_complete(int *r)
+{
+    // Allow the operation to proceed so the test verifies
+    // that it doesn't complete earlier than expected
+    *r = -2;
+}
+
+void check_completed(int *r)
+{
+    assert(*r == 1);
+    delete r;
+}
+
+void pretend_connected(cluster_client_t *cli, osd_num_t osd_num)
+{
+    printf("OSD %lu connected\n", osd_num);
+    int peer_fd = cli->msgr.clients.size() ? std::prev(cli->msgr.clients.end())->first+1 : 10;
+    cli->msgr.osd_peer_fds[osd_num] = peer_fd;
+    cli->msgr.clients[peer_fd] = new osd_client_t();
+    cli->msgr.clients[peer_fd]->osd_num = osd_num;
+    cli->msgr.clients[peer_fd]->peer_state = PEER_CONNECTED;
+    cli->msgr.wanted_peers.erase(osd_num);
+    cli->msgr.repeer_pgs(osd_num);
+}
+
+void pretend_disconnected(cluster_client_t *cli, osd_num_t osd_num)
+{
+    printf("OSD %lu disconnected\n", osd_num);
+    cli->msgr.stop_client(cli->msgr.osd_peer_fds.at(osd_num));
+}
+
+void check_disconnected(cluster_client_t *cli, osd_num_t osd_num)
+{
+    if (cli->msgr.osd_peer_fds.find(osd_num) != cli->msgr.osd_peer_fds.end())
+    {
+        printf("OSD %lu not disconnected as it ought to be\n", osd_num);
+        assert(0);
+    }
+}
+
+void check_op_count(cluster_client_t *cli, osd_num_t osd_num, int ops)
+{
+    int peer_fd = cli->msgr.osd_peer_fds.at(osd_num);
+    int real_ops = cli->msgr.clients[peer_fd]->sent_ops.size();
+    if (real_ops != ops)
+    {
+        printf("error: %d ops expected, but %d queued\n", ops, real_ops);
+        assert(0);
+    }
+}
+
+osd_op_t *find_op(cluster_client_t *cli, osd_num_t osd_num, uint64_t opcode, uint64_t offset, uint64_t len)
+{
+    int peer_fd = cli->msgr.osd_peer_fds.at(osd_num);
+    auto op_it = cli->msgr.clients[peer_fd]->sent_ops.begin();
+    while (op_it != cli->msgr.clients[peer_fd]->sent_ops.end())
+    {
+        auto op = op_it->second;
+        if (op->req.hdr.opcode == opcode && (opcode == OSD_OP_SYNC ||
+            op->req.rw.inode == 0x1000000000001 && op->req.rw.offset == offset && op->req.rw.len == len))
+        {
+            return op;
+        }
+        op_it++;
+    }
+    return NULL;
+}
+
+void pretend_op_completed(cluster_client_t *cli, osd_op_t *op, int64_t retval)
+{
+    assert(op);
+    printf("Pretend completed %s %lx+%x\n", op->req.hdr.opcode == OSD_OP_SYNC
+        ? "sync" : (op->req.hdr.opcode == OSD_OP_WRITE ? "write" : "read"), op->req.rw.offset, op->req.rw.len);
+    uint64_t op_id = op->req.hdr.id;
+    int peer_fd = op->peer_fd;
+    cli->msgr.clients[peer_fd]->sent_ops.erase(op_id);
+    op->reply.hdr.magic = SECONDARY_OSD_REPLY_MAGIC;
+    op->reply.hdr.id = op->req.hdr.id;
+    op->reply.hdr.opcode = op->req.hdr.opcode;
+    op->reply.hdr.retval = retval < 0 ? retval : (op->req.hdr.opcode == OSD_OP_SYNC ? 0 : op->req.rw.len);
+    // Copy lambda to be unaffected by `delete op`
+    std::function<void(osd_op_t*)>(op->callback)(op);
+}
+
+void test1()
+{
+    json11::Json config;
+    timerfd_manager_t *tfd = new timerfd_manager_t([](int fd, bool wr, std::function<void(int, int)> callback){});
+    cluster_client_t *cli = new cluster_client_t(NULL, tfd, config);
+
+    int *r1 = test_write(cli, 0, 4096, 0x55);
+    configure_single_pg_pool(cli);
+    pretend_connected(cli, 1);
+    cli->continue_ops(true);
+    can_complete(r1);
+    check_op_count(cli, 1, 1);
+    pretend_op_completed(cli, find_op(cli, 1, OSD_OP_WRITE, 0, 4096), 0);
+    check_completed(r1);
+    pretend_disconnected(cli, 1);
+    int *r2 = test_sync(cli);
+    pretend_connected(cli, 1);
+    check_op_count(cli, 1, 0);
+    cli->continue_ops(true);
+    check_op_count(cli, 1, 1);
+    pretend_op_completed(cli, find_op(cli, 1, OSD_OP_WRITE, 0, 4096), 0);
+    check_op_count(cli, 1, 1);
+    can_complete(r2);
+    pretend_op_completed(cli, find_op(cli, 1, OSD_OP_SYNC, 0, 0), 0);
+    check_completed(r2);
+    // Check that the client doesn't repeat operations once more
+    pretend_disconnected(cli, 1);
+    pretend_connected(cli, 1);
+    check_op_count(cli, 1, 0);
+
+    // Case:
+    // Write(1) -> Complete Write(1) -> Overwrite(2) -> Complete Write(2)
+    // -> Overwrite(3) -> Drop OSD connection -> Reestablish OSD connection
+    // -> Complete All Posted Writes -> Sync -> Complete Sync
+    // The resulting state of the block must be (3) over (2) over (1).
+    // I.e. the part overwritten by (3) must remain as in (3) and so on.
+
+    // More interesting case:
+    // Same, but both Write(2) and Write(3) must consist of two parts:
+    // one from an OSD 2 that drops connection and other from OSD 1 that doesn't.
+    // The idea is that if the whole Write(2) is repeated when OSD 2 drops connection
+    // then it may also overwrite a part in OSD 1 which shouldn't be overwritten.
+
+    // Another interesting case:
+    // A new operation added during replay (would also break with the previous implementation)
+
+    r1 = test_write(cli, 0, 0x10000, 0x56);
+    can_complete(r1);
+    check_op_count(cli, 1, 1);
+    pretend_op_completed(cli, find_op(cli, 1, OSD_OP_WRITE, 0, 0x10000), 0);
+    check_completed(r1);
+
+    r1 = test_write(cli, 0xE000, 0x4000, 0x57);
+    can_complete(r1);
+    check_op_count(cli, 1, 1);
+    pretend_op_completed(cli, find_op(cli, 1, OSD_OP_WRITE, 0xE000, 0x4000), 0);
+    check_completed(r1);
+
+    r1 = test_write(cli, 0x10000, 0x4000, 0x58);
+
+    pretend_disconnected(cli, 1);
+    pretend_connected(cli, 1);
+    cli->continue_ops(true);
+
+    // Check replay
+    {
+        uint64_t replay_start = UINT64_MAX;
+        uint64_t replay_end = 0;
+        std::vector<osd_op_t*> replay_ops;
+        auto osd_cl = cli->msgr.clients.at(cli->msgr.osd_peer_fds.at(1));
+        for (auto & op_p: osd_cl->sent_ops)
+        {
+            auto op = op_p.second;
+            assert(op->req.hdr.opcode == OSD_OP_WRITE);
+            uint64_t offset = op->req.rw.offset;
+            if (op->req.rw.offset < replay_start)
+                replay_start = op->req.rw.offset;
+            if (op->req.rw.offset+op->req.rw.len > replay_end)
+                replay_end = op->req.rw.offset+op->req.rw.len;
+            for (int buf_idx = 0; buf_idx < op->iov.count; buf_idx++)
+            {
+                for (int i = 0; i < op->iov.buf[buf_idx].iov_len; i++, offset++)
+                {
+                    uint8_t c = offset < 0xE000 ? 0x56 : (offset < 0x10000 ? 0x57 : 0x58);
+                    if (((uint8_t*)op->iov.buf[buf_idx].iov_base)[i] != c)
+                    {
+                        printf("Write replay: mismatch at %lu\n", offset-op->req.rw.offset);
+                        goto fail;
+                    }
+                }
+            }
+        fail:
+            assert(offset == op->req.rw.offset+op->req.rw.len);
+            replay_ops.push_back(op);
+        }
+        if (replay_start != 0 || replay_end != 0x14000)
+        {
+            printf("Write replay: range mismatch: %lx-%lx\n", replay_start, replay_end);
+            assert(0);
+        }
+        for (auto op: replay_ops)
+        {
+            pretend_op_completed(cli, op, 0);
+        }
+    }
+    // Check that the following write finally proceeds
+    check_op_count(cli, 1, 1);
+    can_complete(r1);
+    pretend_op_completed(cli, find_op(cli, 1, OSD_OP_WRITE, 0x10000, 0x4000), 0);
+    check_completed(r1);
+    check_op_count(cli, 1, 0);
+
+    // Check sync
+    r2 = test_sync(cli);
+    can_complete(r2);
+    pretend_op_completed(cli, find_op(cli, 1, OSD_OP_SYNC, 0, 0), 0);
+    check_completed(r2);
+
+    // Check disconnect during write
+    r1 = test_write(cli, 0, 4096, 0x59);
+    check_op_count(cli, 1, 1);
+    pretend_op_completed(cli, find_op(cli, 1, OSD_OP_WRITE, 0, 0x1000), -EPIPE);
+    check_disconnected(cli, 1);
+    pretend_connected(cli, 1);
+    check_op_count(cli, 1, 0);
+    cli->continue_ops(true);
+    check_op_count(cli, 1, 1);
+    pretend_op_completed(cli, find_op(cli, 1, OSD_OP_WRITE, 0, 0x1000), 0);
+    check_op_count(cli, 1, 1);
+    can_complete(r1);
+    pretend_op_completed(cli, find_op(cli, 1, OSD_OP_WRITE, 0, 0x1000), 0);
+    check_completed(r1);
+
+    // Check disconnect inside operation callback (reenterability)
+    // Probably doesn't happen too often, but possible in theory
+    r1 = test_write(cli, 0, 0x1000, 0x60, [cli]()
+    {
+        pretend_disconnected(cli, 1);
+    });
+    r2 = test_write(cli, 0x1000, 0x1000, 0x61);
+    check_op_count(cli, 1, 2);
+    can_complete(r1);
+    pretend_op_completed(cli, find_op(cli, 1, OSD_OP_WRITE, 0, 0x1000), 0);
+    check_completed(r1);
+    check_disconnected(cli, 1);
+    pretend_connected(cli, 1);
+    cli->continue_ops(true);
+    check_op_count(cli, 1, 2);
+    pretend_op_completed(cli, find_op(cli, 1, OSD_OP_WRITE, 0, 0x1000), 0);
+    pretend_op_completed(cli, find_op(cli, 1, OSD_OP_WRITE, 0x1000, 0x1000), 0);
+    check_op_count(cli, 1, 1);
+    can_complete(r2);
+    pretend_op_completed(cli, find_op(cli, 1, OSD_OP_WRITE, 0x1000, 0x1000), 0);
+    check_completed(r2);
+
+    // Free client
+    delete cli;
+    delete tfd;
+    printf("[ok] write replay test\n");
+}
+
+void test2()
+{
+    std::map<object_id, cluster_buffer_t> unsynced_writes;
+    cluster_op_t *op = new cluster_op_t();
+    op->opcode = OSD_OP_WRITE;
+    op->inode = 1;
+    op->offset = 0;
+    op->len = 4096;
+    op->iov.push_back(malloc_or_die(4096*1024), 4096);
+    // 0-4k = 0x55
+    memset(op->iov.buf[0].iov_base, 0x55, op->iov.buf[0].iov_len);
+    cluster_client_t::copy_write(op, unsynced_writes);
+    // 8k-12k = 0x66
+    op->offset = 8192;
+    memset(op->iov.buf[0].iov_base, 0x66, op->iov.buf[0].iov_len);
+    cluster_client_t::copy_write(op, unsynced_writes);
+    // 4k-1M+4k = 0x77
+    op->len = op->iov.buf[0].iov_len = 1048576;
+    op->offset = 4096;
+    memset(op->iov.buf[0].iov_base, 0x77, op->iov.buf[0].iov_len);
+    cluster_client_t::copy_write(op, unsynced_writes);
+    // check it
+    assert(unsynced_writes.size() == 4);
+    auto uit = unsynced_writes.begin();
+    int i;
+    assert(uit->first.inode == 1);
+    assert(uit->first.stripe == 0);
+    assert(uit->second.len == 4096);
+    for (i = 0; i < uit->second.len && ((uint8_t*)uit->second.buf)[i] == 0x55; i++) {}
+    assert(i == uit->second.len);
+    uit++;
+    assert(uit->first.inode == 1);
+    assert(uit->first.stripe == 4096);
+    assert(uit->second.len == 4096);
+    for (i = 0; i < uit->second.len && ((uint8_t*)uit->second.buf)[i] == 0x77; i++) {}
+    assert(i == uit->second.len);
+    uit++;
+    assert(uit->first.inode == 1);
+    assert(uit->first.stripe == 8192);
+    assert(uit->second.len == 4096);
+    for (i = 0; i < uit->second.len && ((uint8_t*)uit->second.buf)[i] == 0x77; i++) {}
+    assert(i == uit->second.len);
+    uit++;
+    assert(uit->first.inode == 1);
+    assert(uit->first.stripe == 12*1024);
+    assert(uit->second.len == 1016*1024);
+    for (i = 0; i < uit->second.len && ((uint8_t*)uit->second.buf)[i] == 0x77; i++) {}
+    assert(i == uit->second.len);
+    uit++;
+    // free memory
+    free(op->iov.buf[0].iov_base);
+    delete op;
+    for (auto p: unsynced_writes)
+    {
+        free(p.second.buf);
+    }
+    printf("[ok] copy_write test\n");
+}
+
+int main(int narg, char *args[])
+{
+    test1();
+    test2();
+    return 0;
+}
--- a/src/timerfd_interval.cpp
+++ b/src/timerfd_interval.cpp
@ -1,64 +0,0 @@
-// Copyright (c) Vitaliy Filippov, 2019+
-// License: VNPL-1.1 or GNU GPL-2.0+ (see README.md for details)
-
-#include <sys/timerfd.h>
-#include <sys/poll.h>
-#include <unistd.h>
-#include "timerfd_interval.h"
-
-timerfd_interval::timerfd_interval(ring_loop_t *ringloop, int seconds, std::function<void(void)> cb)
-{
-    wait_state = 0;
-    timerfd = timerfd_create(CLOCK_MONOTONIC, TFD_NONBLOCK);
-    if (timerfd < 0)
-    {
-        throw std::runtime_error(std::string("timerfd_create: ") + strerror(errno));
-    }
-    struct itimerspec exp = {
-        .it_interval = { seconds, 0 },
-        .it_value = { seconds, 0 },
-    };
-    if (timerfd_settime(timerfd, 0, &exp, NULL))
-    {
-        throw std::runtime_error(std::string("timerfd_settime: ") + strerror(errno));
-    }
-    consumer.loop = [this]() { loop(); };
-    ringloop->register_consumer(&consumer);
-    this->ringloop = ringloop;
-    this->callback = cb;
-}
-
-timerfd_interval::~timerfd_interval()
-{
-    ringloop->unregister_consumer(&consumer);
-    close(timerfd);
-}
-
-void timerfd_interval::loop()
-{
-    if (wait_state == 1)
-    {
-        return;
-    }
-    struct io_uring_sqe *sqe = ringloop->get_sqe();
-    if (!sqe)
-    {
-        wait_state = 0;
-        return;
-    }
-    struct ring_data_t *data = ((ring_data_t*)sqe->user_data);
-    my_uring_prep_poll_add(sqe, timerfd, POLLIN);
-    data->callback = [&](ring_data_t *data)
-    {
-        if (data->res < 0)
-        {
-            throw std::runtime_error(std::string("waiting for timer failed: ") + strerror(-data->res));
-        }
-        uint64_t n;
-        read(timerfd, &n, 8);
-        wait_state = 0;
-        callback();
-    };
-    wait_state = 1;
-    ringloop->submit();
-}
--- a/src/timerfd_interval.h
+++ b/src/timerfd_interval.h
@ -1,19 +0,0 @@
-// Copyright (c) Vitaliy Filippov, 2019+
-// License: VNPL-1.1 or GNU GPL-2.0+ (see README.md for details)
-
-#pragma once
-
-#include "ringloop.h"
-
-class timerfd_interval
-{
-    int wait_state;
-    int timerfd;
-    ring_loop_t *ringloop;
-    ring_consumer_t consumer;
-    std::function<void(void)> callback;
-public:
-    timerfd_interval(ring_loop_t *ringloop, int seconds, std::function<void(void)> cb);
-    ~timerfd_interval();
-    void loop();
-};
--- a/src/timerfd_manager.cpp
+++ b/src/timerfd_manager.cpp
@ -34,8 +34,8 @@ timerfd_manager_t::~timerfd_manager_t()

 void timerfd_manager_t::inc_timer(timerfd_timer_t & t)
 {
-    t.next.tv_sec += t.millis/1000;
-    t.next.tv_nsec += (t.millis%1000)*1000000;
+    t.next.tv_sec += t.micros/1000000;
+    t.next.tv_nsec += (t.micros%1000000)*1000;
    if (t.next.tv_nsec > 1000000000)
    {
        t.next.tv_sec++;
@ -44,13 +44,18 @@ void timerfd_manager_t::inc_timer(timerfd_timer_t & t)
 }

 int timerfd_manager_t::set_timer(uint64_t millis, bool repeat, std::function<void(int)> callback)
+{
+    return set_timer_us(millis*1000, repeat, callback);
+}
+
+int timerfd_manager_t::set_timer_us(uint64_t micros, bool repeat, std::function<void(int)> callback)
 {
    int timer_id = id++;
    timespec start;
    clock_gettime(CLOCK_MONOTONIC, &start);
    timers.push_back({
        .id = timer_id,
-        .millis = millis,
+        .micros = micros,
        .start = start,
        .next = start,
        .repeat = repeat,
@ -121,7 +126,7 @@ again:
            exp.it_value.tv_sec--;
            exp.it_value.tv_nsec += 1000000000;
        }
-        if (exp.it_value.tv_sec < 0 || !exp.it_value.tv_sec && !exp.it_value.tv_nsec)
+        if (exp.it_value.tv_sec < 0 || exp.it_value.tv_sec == 0 && exp.it_value.tv_nsec <= 0)
        {
            // It already happened
            trigger_nearest();
@ -159,6 +164,6 @@ void timerfd_manager_t::trigger_nearest()
    {
        timers.erase(timers.begin()+nearest, timers.begin()+nearest+1);
    }
-    cb(nearest_id);
    nearest = -1;
+    cb(nearest_id);
 }
--- a/src/timerfd_manager.h
+++ b/src/timerfd_manager.h
@ -10,7 +10,7 @@
 struct timerfd_timer_t
 {
    int id;
-    uint64_t millis;
+    uint64_t micros;
    timespec start, next;
    bool repeat;
    std::function<void(int)> callback;
@ -34,5 +34,6 @@ public:
    timerfd_manager_t(std::function<void(int, bool, std::function<void(int, int)>)> set_fd_handler);
    ~timerfd_manager_t();
    int set_timer(uint64_t millis, bool repeat, std::function<void(int)> callback);
+    int set_timer_us(uint64_t micros, bool repeat, std::function<void(int)> callback);
    void clear_timer(int timer_id);
 };
--- a/tests/common.sh
+++ b/tests/common.sh
@ -23,8 +23,10 @@ trap 'kill -9 $(jobs -p)' EXIT
 ETCD=${ETCD:-etcd}
 ETCD_PORT=${ETCD_PORT:-12379}

-rm -rf ./testdata
-mkdir -p ./testdata
+if [ "$KEEP_DATA" = "" ]; then
+    rm -rf ./testdata
+    mkdir -p ./testdata
+fi

 $ETCD -name etcd_test --data-dir ./testdata/etcd \
    --advertise-client-urls http://127.0.0.1:$ETCD_PORT --listen-client-urls http://127.0.0.1:$ETCD_PORT \
--- a/tests/test_change_pg_count.sh
+++ b/tests/test_change_pg_count.sh
@ -2,6 +2,14 @@

 . `dirname $0`/common.sh

+if [ "$EC" != "" ]; then
+    POOLCFG='"scheme":"xor","pg_size":3,"pg_minsize":2,"parity_chunks":1'
+    NOBJ=512
+else
+    POOLCFG='"scheme":"replicated","pg_size":2,"pg_minsize":2'
+    NOBJ=1024
+fi
+
 dd if=/dev/zero of=./testdata/test_osd1.bin bs=1024 count=1 seek=$((1024*1024-1))
 dd if=/dev/zero of=./testdata/test_osd2.bin bs=1024 count=1 seek=$((1024*1024-1))
 dd if=/dev/zero of=./testdata/test_osd3.bin bs=1024 count=1 seek=$((1024*1024-1))
@ -28,7 +36,7 @@ cd ..
 node mon/mon-main.js --etcd_url http://$ETCD_URL --etcd_prefix "/vitastor" --verbose 1 &>./testdata/mon.log &
 MON_PID=$!

-$ETCDCTL put /vitastor/config/pools '{"1":{"name":"testpool","scheme":"replicated","pg_size":2,"pg_minsize":2,"pg_count":16,"failure_domain":"osd"}}'
+$ETCDCTL put /vitastor/config/pools '{"1":{"name":"testpool",'$POOLCFG',"pg_count":16,"failure_domain":"osd"}}'

 sleep 2

@ -52,7 +60,7 @@ try_change()
        echo --- Change PG count to $n --- >>testdata/osd$i.log
    done

-    $ETCDCTL put /vitastor/config/pools '{"1":{"name":"testpool","scheme":"replicated","pg_size":2,"pg_minsize":2,"pg_count":'$n',"failure_domain":"osd"}}'
+    $ETCDCTL put /vitastor/config/pools '{"1":{"name":"testpool",'$POOLCFG',"pg_count":'$n',"failure_domain":"osd"}}'

    for i in {1..10}; do
        ($ETCDCTL get /vitastor/config/pgs --print-value-only | jq -s -e '(.[0].items["1"] | map((.osd_set | select(. > 0)) | length == 2) | length) == '$n) && \
@ -82,8 +90,8 @@ try_change()

    # Check that no objects are lost !
    nobj=`$ETCDCTL get --prefix '/vitastor/pg/stats' --print-value-only | jq -s '[ .[].object_count ] | reduce .[] as $num (0; .+$num)'`
-    if [ "$nobj" -ne 1024 ]; then
-        format_error "Data lost after changing PG count to $n: 1024 objects expected, but got $nobj"
+    if [ "$nobj" -ne $NOBJ ]; then
+        format_error "Data lost after changing PG count to $n: $NOBJ objects expected, but got $nobj"
    fi
 }

--- a/tests/test_snapshot.sh
+++ b/tests/test_snapshot.sh
@ -0,0 +1,75 @@
+#!/bin/bash -ex
+
+. `dirname $0`/common.sh
+
+dd if=/dev/zero of=./testdata/test_osd1.bin bs=1024 count=1 seek=$((1024*1024-1))
+dd if=/dev/zero of=./testdata/test_osd2.bin bs=1024 count=1 seek=$((1024*1024-1))
+dd if=/dev/zero of=./testdata/test_osd3.bin bs=1024 count=1 seek=$((1024*1024-1))
+
+build/src/vitastor-osd --osd_num 1 --bind_address 127.0.0.1 --etcd_address $ETCD_URL $(node mon/simple-offsets.js --format options --device ./testdata/test_osd1.bin 2>/dev/null) &>./testdata/osd1.log &
+OSD1_PID=$!
+build/src/vitastor-osd --osd_num 2 --bind_address 127.0.0.1 --etcd_address $ETCD_URL $(node mon/simple-offsets.js --format options --device ./testdata/test_osd2.bin 2>/dev/null) &>./testdata/osd2.log &
+OSD2_PID=$!
+build/src/vitastor-osd --osd_num 3 --bind_address 127.0.0.1 --etcd_address $ETCD_URL $(node mon/simple-offsets.js --format options --device ./testdata/test_osd3.bin 2>/dev/null) &>./testdata/osd3.log &
+OSD3_PID=$!
+
+cd mon
+npm install
+cd ..
+node mon/mon-main.js --etcd_url http://$ETCD_URL --etcd_prefix "/vitastor" &>./testdata/mon.log &
+MON_PID=$!
+
+$ETCDCTL put /vitastor/config/pools '{"1":{"name":"testpool","scheme":"xor","pg_size":3,"pg_minsize":2,"parity_chunks":1,"pg_count":1,"failure_domain":"osd"}}'
+
+sleep 2
+
+if ! ($ETCDCTL get /vitastor/config/pgs --print-value-only | jq -s -e '(. | length) != 0 and (.[0].items["1"]["1"].osd_set | sort) == ["1","2","3"]'); then
+    format_error "FAILED: 1 PG NOT CONFIGURED"
+fi
+
+if ! ($ETCDCTL get /vitastor/pg/state/1/1 --print-value-only | jq -s -e '(. | length) != 0 and .[0].state == ["active"]'); then
+    format_error "FAILED: 1 PG NOT UP"
+fi
+
+if ! cmp build/src/block-vitastor.so /usr/lib/x86_64-linux-gnu/qemu/block-vitastor.so; then
+    sudo rm -f /usr/lib/x86_64-linux-gnu/qemu/block-vitastor.so
+    sudo ln -s "$(realpath .)/build/src/block-vitastor.so" /usr/lib/x86_64-linux-gnu/qemu/block-vitastor.so
+fi
+
+# Test basic write and snapshot
+
+$ETCDCTL put /vitastor/config/inode/1/2 '{"name":"testimg","size":'$((32*1024*1024))'}'
+
+LD_PRELOAD=libasan.so.5 \
+    fio -thread -name=test -ioengine=build/src/libfio_vitastor.so -bs=4M -direct=1 -iodepth=1 -fsync=1 -rw=write \
+        -etcd=$ETCD_URL -pool=1 -inode=2 -size=32M -cluster_log_level=10
+
+$ETCDCTL put /vitastor/config/inode/1/2 '{"name":"testimg@0","size":'$((32*1024*1024))'}'
+$ETCDCTL put /vitastor/config/inode/1/3 '{"parent_id":2,"name":"testimg","size":'$((32*1024*1024))'}'
+
+LD_PRELOAD=libasan.so.5 \
+    fio -thread -name=test -ioengine=build/src/libfio_vitastor.so -bs=4k -direct=1 -iodepth=1 -fsync=32 -buffer_pattern=0xdeadface \
+        -rw=randwrite -etcd=$ETCD_URL -image=testimg -number_ios=1024
+
+LD_PRELOAD=libasan.so.5 \
+    fio -thread -name=test -ioengine=build/src/libfio_vitastor.so -bs=4M -direct=1 -iodepth=1 -rw=read -etcd=$ETCD_URL -pool=1 -inode=3 -size=32M
+
+qemu-img convert -S 4096 -p \
+    -f raw "vitastor:etcd_host=127.0.0.1\:$ETCD_PORT/v3:pool=1:inode=3:size=$((32*1024*1024))" \
+    -O raw ./testdata/merged.bin
+
+qemu-img convert -S 4096 -p \
+    -f raw "vitastor:etcd_host=127.0.0.1\:$ETCD_PORT/v3:image=testimg@0" \
+    -O raw ./testdata/layer0.bin
+
+$ETCDCTL put /vitastor/config/inode/1/3 '{"name":"testimg","size":'$((32*1024*1024))'}'
+
+qemu-img convert -S 4096 -p \
+    -f raw "vitastor:etcd_host=127.0.0.1\:$ETCD_PORT/v3:image=testimg" \
+    -O raw ./testdata/layer1.bin
+
+node mon/merge.js ./testdata/layer0.bin ./testdata/layer1.bin ./testdata/check.bin
+
+cmp ./testdata/merged.bin ./testdata/check.bin
+
+format_green OK
--- a/tests/test_vm_cont.sh
+++ b/tests/test_vm_cont.sh
@ -0,0 +1,36 @@
+#!/bin/bash -ex
+
+export KEEP_DATA=1
+. `dirname $0`/common.sh
+
+etcdctl --endpoints=http://127.0.0.1:12379/v3 del --prefix /vitastor/mon/master
+etcdctl --endpoints=http://127.0.0.1:12379/v3 del --prefix /vitastor/pg/state
+etcdctl --endpoints=http://127.0.0.1:12379/v3 del --prefix /vitastor/osd/state
+
+build/src/vitastor-osd --osd_num 1 --bind_address 127.0.0.1 --etcd_address $ETCD_URL $(node mon/simple-offsets.js --format options --device ./testdata/test_osd1.bin 2>/dev/null) &>./testdata/osd1.log &
+OSD1_PID=$!
+build/src/vitastor-osd --osd_num 2 --bind_address 127.0.0.1 --etcd_address $ETCD_URL $(node mon/simple-offsets.js --format options --device ./testdata/test_osd2.bin 2>/dev/null) &>./testdata/osd2.log &
+OSD2_PID=$!
+build/src/vitastor-osd --osd_num 3 --bind_address 127.0.0.1 --etcd_address $ETCD_URL $(node mon/simple-offsets.js --format options --device ./testdata/test_osd3.bin 2>/dev/null) &>./testdata/osd3.log &
+OSD3_PID=$!
+
+node mon/mon-main.js --etcd_url http://$ETCD_URL --etcd_prefix "/vitastor" &>./testdata/mon.log &
+MON_PID=$!
+
+sleep 3
+
+if ! ($ETCDCTL get /vitastor/pg/state/1/1 --print-value-only | jq -s -e '(. | length) != 0 and .[0].state == ["active"]'); then
+    format_error "FAILED: 1 PG NOT UP"
+fi
+
+if ! cmp build/src/block-vitastor.so /usr/lib/x86_64-linux-gnu/qemu/block-vitastor.so; then
+    sudo rm -f /usr/lib/x86_64-linux-gnu/qemu/block-vitastor.so
+    sudo ln -s "$(realpath .)/build/src/block-vitastor.so" /usr/lib/x86_64-linux-gnu/qemu/block-vitastor.so
+fi
+
+qemu-system-x86_64 -enable-kvm -m 1024 \
+    -drive 'file=vitastor:etcd_host=127.0.0.1\:'$ETCD_PORT'/v3:image=debian9',format=raw,if=none,id=drive-virtio-disk0,cache=none \
+    -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x5,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1,write-cache=off,physical_block_size=4096,logical_block_size=512 \
+    -vnc 0.0.0.0:0
+
+format_green OK
--- a/tests/test_vm_start.sh
+++ b/tests/test_vm_start.sh
@ -0,0 +1,53 @@
+#!/bin/bash -ex
+
+. `dirname $0`/common.sh
+
+dd if=/dev/zero of=./testdata/test_osd1.bin bs=2048 count=1 seek=$((1024*1024-1))
+dd if=/dev/zero of=./testdata/test_osd2.bin bs=2048 count=1 seek=$((1024*1024-1))
+dd if=/dev/zero of=./testdata/test_osd3.bin bs=2048 count=1 seek=$((1024*1024-1))
+
+build/src/vitastor-osd --osd_num 1 --bind_address 127.0.0.1 --etcd_address $ETCD_URL $(node mon/simple-offsets.js --format options --device ./testdata/test_osd1.bin 2>/dev/null) &>./testdata/osd1.log &
+OSD1_PID=$!
+build/src/vitastor-osd --osd_num 2 --bind_address 127.0.0.1 --etcd_address $ETCD_URL $(node mon/simple-offsets.js --format options --device ./testdata/test_osd2.bin 2>/dev/null) &>./testdata/osd2.log &
+OSD2_PID=$!
+build/src/vitastor-osd --osd_num 3 --bind_address 127.0.0.1 --etcd_address $ETCD_URL $(node mon/simple-offsets.js --format options --device ./testdata/test_osd3.bin 2>/dev/null) &>./testdata/osd3.log &
+OSD3_PID=$!
+
+cd mon
+npm install
+cd ..
+node mon/mon-main.js --etcd_url http://$ETCD_URL --etcd_prefix "/vitastor" &>./testdata/mon.log &
+MON_PID=$!
+
+$ETCDCTL put /vitastor/config/pools '{"1":{"name":"testpool","scheme":"xor","pg_size":3,"pg_minsize":2,"parity_chunks":1,"pg_count":1,"failure_domain":"osd"}}'
+
+sleep 2
+
+if ! ($ETCDCTL get /vitastor/config/pgs --print-value-only | jq -s -e '(. | length) != 0 and (.[0].items["1"]["1"].osd_set | sort) == ["1","2","3"]'); then
+    format_error "FAILED: 1 PG NOT CONFIGURED"
+fi
+
+if ! ($ETCDCTL get /vitastor/pg/state/1/1 --print-value-only | jq -s -e '(. | length) != 0 and .[0].state == ["active"]'); then
+    format_error "FAILED: 1 PG NOT UP"
+fi
+
+if ! cmp build/src/block-vitastor.so /usr/lib/x86_64-linux-gnu/qemu/block-vitastor.so; then
+    sudo rm -f /usr/lib/x86_64-linux-gnu/qemu/block-vitastor.so
+    sudo ln -s "$(realpath .)/build/src/block-vitastor.so" /usr/lib/x86_64-linux-gnu/qemu/block-vitastor.so
+fi
+
+$ETCDCTL put /vitastor/config/inode/1/1 '{"name":"debian9","size":'$((2048*1024*1024))'}'
+
+qemu-img convert -S 4096 -p \
+    -f raw ~/debian9-kvm.raw \
+    -O raw "vitastor:etcd_host=127.0.0.1\:$ETCD_PORT/v3:image=debian9"
+
+$ETCDCTL put /vitastor/config/inode/1/1 '{"name":"debian9@0","size":'$((2048*1024*1024))'}'
+$ETCDCTL put /vitastor/config/inode/1/2 '{"parent_id":1,"name":"debian9","size":'$((2048*1024*1024))'}'
+
+qemu-system-x86_64 -enable-kvm -m 1024 \
+    -drive 'file=vitastor:etcd_host=127.0.0.1\:$ETCD_PORT/v3:image=debian9',format=raw,if=none,id=drive-virtio-disk0,cache=none \
+    -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x5,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1,write-cache=off,physical_block_size=4096,logical_block_size=512 \
+    -vnc 0.0.0.0:0
+
+format_green OK
--- a/tests/test_write.sh
+++ b/tests/test_write.sh
@ -34,7 +34,30 @@ fi
 #LD_PRELOAD=libasan.so.5 \
 #    fio -thread -name=test -ioengine=build/src/libfio_vitastor_sec.so -bs=4k -fsync=128 `$ETCDCTL get /vitastor/osd/state/1 --print-value-only | jq -r '"-host="+.addresses[0]+" -port="+(.port|tostring)'` -rw=write -size=32M

+if ! cmp build/src/block-vitastor.so /usr/lib/x86_64-linux-gnu/qemu/block-vitastor.so; then
+    sudo rm -f /usr/lib/x86_64-linux-gnu/qemu/block-vitastor.so
+    sudo ln -s "$(realpath .)/build/src/block-vitastor.so" /usr/lib/x86_64-linux-gnu/qemu/block-vitastor.so
+fi
+
+# A lot of parallel syncs was crashing the primary OSD at some point
+
 LD_PRELOAD=libasan.so.5 \
-    fio -thread -name=test -ioengine=build/src/libfio_vitastor.so -bs=4M -direct=1 -iodepth=1 -fsync=1 -rw=write -etcd=$ETCD_URL -pool=1 -inode=1 -size=1G -cluster_log_level=10
+    fio -thread -name=test -ioengine=build/src/libfio_vitastor.so -bs=4k -direct=1 -numjobs=64 -iodepth=1 -fsync=1 \
+        -rw=randwrite -etcd=$ETCD_URL -pool=1 -inode=1 -size=128M -number_ios=100
+
+LD_PRELOAD=libasan.so.5 \
+    fio -thread -name=test -ioengine=build/src/libfio_vitastor.so -bs=4M -direct=1 -iodepth=1 -fsync=1 -rw=write -etcd=$ETCD_URL -pool=1 -inode=1 -size=128M -cluster_log_level=10
+
+LD_PRELOAD=libasan.so.5 \
+    fio -thread -name=test -ioengine=build/src/libfio_vitastor.so -bs=4k -direct=1 -iodepth=1 -fsync=32 -buffer_pattern=0xdeadface \
+        -rw=randwrite -etcd=$ETCD_URL -pool=1 -inode=1 -size=128M -number_ios=1024
+
+qemu-img convert -S 4096 -p \
+    -f raw "vitastor:etcd_host=127.0.0.1\:$ETCD_PORT/v3:pool=1:inode=1:size=$((128*1024*1024))" \
+    -O raw ./testdata/read.bin
+
+qemu-img convert -S 4096 -p \
+    -f raw ./testdata/read.bin \
+    -O raw "vitastor:etcd_host=127.0.0.1\:$ETCD_PORT/v3:pool=1:inode=1:size=$((128*1024*1024))"

 format_green OK
Author	SHA1	Message	Date
Vitaliy Filippov	715bc8d53d	Release 0.6.2 - Fix a possible crash during SYNC when journal fsyncs are enabled - Fix a memory leak in the chained read implementation	2021-04-15 23:40:06 +03:00
Vitaliy Filippov	0af077701c	Fix a possible crash during SYNC when journal fsyncs are enabled	2021-04-15 02:01:50 +03:00
Vitaliy Filippov	cac976ce25	Fix a memory leak in the chained read implementation	2021-04-15 01:42:18 +03:00
Vitaliy Filippov	acf0646542	Build common sources once	2021-04-15 01:13:34 +03:00
Vitaliy Filippov	ede1c1d667	Release 0.6.1 A bugfix for the new "chained read from snapshot" feature	2021-04-14 22:32:23 +03:00
Vitaliy Filippov	38bd51c97f	Remove aio_context assertion, it seems it is unneeded	2021-04-14 22:32:15 +03:00
Vitaliy Filippov	8c9f32cd45	Add run_vm test bash scripts	2021-04-13 16:21:21 +03:00
Vitaliy Filippov	966fb763ca	Oooops, fix chained reads	2021-04-13 16:19:21 +03:00
Vitaliy Filippov	0b41ffc08d	Release 0.6.0 Warning: upgrading from 0.5.x is currently not supported! Please create an issue if you really need upgrade capability. New features: - Snapshots and Copy-on-Write clones - Inode (image) names - Inode I/O and space statistics - Write throttling for smoothing random write workloads in SSD+HDD configurations	2021-04-11 00:49:18 +03:00
Vitaliy Filippov	64eeb79051	Prevent 0.6.x OSDs from talking to 0.5.x The new protocol is almost compatible - it has bitmaps, but also it has a "bitmap_length" field. It's not hard to make 0.5-0.6 OSDs and clients compatible, but for now I just assume nobody needs it. If I'm wrong and anybody requests to upgrade their production 0.5.x system to 0.6.x I'll fix it.	2021-04-10 22:26:17 +03:00
Vitaliy Filippov	2a02f3c4c7	Add metadata superblock and check it on start Refuse to start if the superblock is missing or bad version; zero out the metadata area when initializing superblock.	2021-04-10 22:26:17 +03:00
Vitaliy Filippov	f684d9101a	Refuse to start with old journal version	2021-04-10 17:44:12 +03:00
Vitaliy Filippov	c72fddd714	Notes about master/0.5.x	2021-04-10 17:44:12 +03:00
Vitaliy Filippov	a1f2f19489	Do not increment inode statistics if the object already exists	2021-04-10 17:44:12 +03:00
Vitaliy Filippov	82c1a7ec67	Fix statistics reporting, split inode number into pool & inode	2021-04-10 17:44:12 +03:00
Vitaliy Filippov	2ab423d4ef	Implement journaled write throttling for the SSD+HDD case	2021-04-10 17:44:12 +03:00
Vitaliy Filippov	4694811eab	Add microsecond accuracy to set_timer	2021-04-10 17:44:12 +03:00
Vitaliy Filippov	6b988de17d	Remove timerfd_interval	2021-04-10 17:44:12 +03:00
Vitaliy Filippov	37efdc2a83	Fix bitmap_set for replicated pools	2021-04-10 17:44:12 +03:00
Vitaliy Filippov	591cad09c9	Fix bitmaps for objects larger than 128K	2021-04-10 17:44:12 +03:00
Vitaliy Filippov	b907ad50aa	Oops, forgot to add external bitmaps to blockstore in some places	2021-04-10 17:44:12 +03:00
Vitaliy Filippov	7308d6a6c0	Note about etcd 3.4.15	2021-04-10 17:44:12 +03:00
Vitaliy Filippov	5f5b6ef150	Enable chained reads in the client	2021-04-10 17:44:12 +03:00
Vitaliy Filippov	38a3df4a0e	Implement chained (optimized) read in the primary OSD code	2021-04-10 17:44:12 +03:00
Vitaliy Filippov	6950b8e3a0	Watch inode metadata revisions	2021-04-10 17:44:12 +03:00
Vitaliy Filippov	0cea3576fb	Add "read bitmaps" operation to secondary OSD protocol	2021-04-10 17:44:12 +03:00
Vitaliy Filippov	f01eea07d3	Add simplified interface to read blockstore bitmaps synchronously	2021-04-10 17:44:12 +03:00
Vitaliy Filippov	2c2f08aca2	Shorten some structure names	2021-04-10 17:44:12 +03:00
Vitaliy Filippov	d6524670e1	Introduce data distribution locality	2021-04-10 17:44:12 +03:00
Vitaliy Filippov	879ecfa74d	Fix wording	2021-04-10 17:44:12 +03:00
Vitaliy Filippov	aea2d19d35	Change Telegram chat link	2021-04-10 17:44:12 +03:00
Vitaliy Filippov	04f86dc00b	Fix Russian README for CMake build	2021-04-10 17:44:12 +03:00
Vitaliy Filippov	7aeb2cbac7	Capture all by value in qemu_proxy	2021-04-10 17:44:12 +03:00
Vitaliy Filippov	519f081006	Add LICENSE	2021-04-10 17:44:12 +03:00
Vitaliy Filippov	e50f703e1d	Add Russian version of the README	2021-04-10 17:44:12 +03:00
Vitaliy Filippov	2612d3198a	Introduce image names and metadata storage in etcd Each inode has: image name, parent inode number & pool, size and readonly flag Snapshots are created by switching image name to a different inode number while using the older inode as parent.	2021-04-10 17:44:12 +03:00
Vitaliy Filippov	ab39ce2bbb	Use clean_entry_bitmap_size instead of entry_attr_size back because of changed bitmap handling	2021-04-10 17:44:12 +03:00
Vitaliy Filippov	d0c2e31312	Add a test for snapshots, fix bugs. Now the test passes	2021-04-10 17:44:12 +03:00
Vitaliy Filippov	9038d42327	Fix several snapshot I/O bugs	2021-04-10 17:44:12 +03:00
Vitaliy Filippov	691f066055	Actual snapshot support (untested)	2021-04-10 17:44:12 +03:00
Vitaliy Filippov	ffe1cd4c79	Report inode I/O statistics, aggregate it in the monitor	2021-04-10 17:44:12 +03:00
Vitaliy Filippov	4ae1b84c67	Report inode space usage statistics to etcd, aggregate it in the monitor	2021-04-10 17:44:12 +03:00
Vitaliy Filippov	c35963967f	Add inode space usage statistics tracking to blockstore	2021-04-10 17:44:12 +03:00
Vitaliy Filippov	0aa2dd2890	Send bitmaps with primary-reads, actually read bitmaps for READ ops	2021-04-10 17:44:12 +03:00
Vitaliy Filippov	6bf88883ac	Allocate bitmaps along with stripes to avoid memory fragmentation	2021-04-10 17:44:12 +03:00
Vitaliy Filippov	004f265393	Remove cryptic bitmap inlining from bs_op_t and osd_op_t, use bitmap in primary OSD code	2021-04-10 17:44:12 +03:00
Vitaliy Filippov	860ac24762	Add "external" bitmap support to the secondary OSD protocol	2021-04-10 17:44:12 +03:00
Vitaliy Filippov	6107a4d07b	Add "external" bitmap support to blockstore	2021-04-10 17:44:12 +03:00
Vitaliy Filippov	95c29b9dc3	Add "external" bitmap support to osd_rmw	2021-04-10 17:44:12 +03:00
Vitaliy Filippov	d99407dcec	Check QEMU block-vitastor.so during the test	2021-04-10 17:44:12 +03:00
Vitaliy Filippov	6909807068	Allow to start the OSD just to flush the journal completely	2021-04-10 17:44:12 +03:00
Vitaliy Filippov	ec90fe6ec1	Release 0.5.13 Another followup to 0.5.11	2021-04-09 12:10:16 +03:00
Vitaliy Filippov	18c72f4835	Correct reenterability fix (now verified with a test) It's rather funny but 0.5.12 has to be re-published again	2021-04-09 12:10:16 +03:00
Vitaliy Filippov	59fbcef734	Release 0.5.12 Fix qemu driver broken in 0.5.11 :)	2021-04-08 15:47:18 +03:00
Vitaliy Filippov	40b7c21fb1	Followup to `307c1731c1` - fix mark_stable	2021-04-08 15:47:18 +03:00
Vitaliy Filippov	efb3678606	Fix qemu-img broken in 0.5.11 Caused by the lack of reenterability of the main cluster_client function	2021-04-08 14:59:20 +03:00
Vitaliy Filippov	462650134e	Release 0.5.11 Another bunch of fixes, including important ones. Now OSDs are stable in SSD+HDD configurations and everything is mostly ready for the merge of master branch. Features: - Add min_flusher_count configuration (good for HDDs) - Shuffle PGs for better data device utilisation - Make OSDs benefit from the immediate_commit=small setting if it's applicable Bug fixes: - Rework client code to fix write ordering during operation replay - Rework error handling code so OSDs don't crash in reaction to a crash of their peer OSDs - Fix several block layer problems related to the journal, some of which were leading to double allocations of the same block during journal replay - Fix monitors crashing during the removal of OSD keys from etcd - Fix data fsyncs being incorrectly disabled when only disable_journal_fsync was set - Always zero out unused part of request/reply headers - Fix some theoretically possible read/write ordering issues - Don't try to "recover" misplaced objects if it would make them degraded - Fix heartbeats sometimes preventing OSD to establish connections	2021-04-08 01:18:46 +03:00
Vitaliy Filippov	8d87e32175	Fix msgr_op.h includes	2021-04-08 01:18:46 +03:00
Vitaliy Filippov	b0b2e7df3c	Fix use-after-free in keepalive_timer and rework stop_client() The bug reproduced if fio was temporarily stopped with SIGSTOP during write test and then resumed after 10 seconds. In this case "pings" were failed for all clients and fio process crashed with 'use-after-free' in keepalive_timer. It happened because it called stop_client while having a live iterator to the map.	2021-04-07 11:06:31 +03:00
Vitaliy Filippov	97efb9e299	Do not crash on PG re-peering events when operations are in progress	2021-04-07 11:06:31 +03:00
Vitaliy Filippov	f6d705383a	Fix client connection recovery bugs, add dirty_ops limit	2021-04-07 11:06:31 +03:00
Vitaliy Filippov	68567c0e1f	Fix messenger possibly trying to connect to the same OSD twice	2021-04-07 01:30:38 +03:00
Vitaliy Filippov	04b00003e9	Log ping failures	2021-04-07 01:30:38 +03:00
Vitaliy Filippov	307c1731c1	Forget all dirty_entries before stable big_write or delete during initialisation This fixes a 'double_alloc' assertion in the following case: - big_write object #1 v1 to block #100 - big_write object #1 v2 to block #101 - big_write object #2 v1 to block #100	2021-04-07 01:30:38 +03:00
Vitaliy Filippov	75a6a556b5	Shuffle PGs for better data device utilisation	2021-04-07 01:30:38 +03:00
Vitaliy Filippov	a48e2bbf18	Fix write replay ordering when immediate_commit != all Previous implementation didn't respect write ordering and could lead to corrupted data when restarting writes after an OSD outage Also rework cluster_client queueing logic and add tests for it to verify the correct behaviour	2021-04-03 14:51:52 +03:00
Vitaliy Filippov	688821665a	Remove stoull_full() from etcd_state_client.cpp	2021-04-03 14:36:04 +03:00
Vitaliy Filippov	3e162d95a0	Remove http_client.h include from etcd_state_client.h	2021-04-03 14:36:04 +03:00
Vitaliy Filippov	829381b335	Extract some definitions to msgr_op.{cpp,h}	2021-04-03 14:36:04 +03:00
Vitaliy Filippov	54f2353f24	Use bitmap granularity for alignment checks	2021-04-03 14:36:04 +03:00
Vitaliy Filippov	e47f6fba60	Remove cluster_client_t::stop()	2021-04-03 14:35:42 +03:00
Vitaliy Filippov	883bf84a16	Fix build	2021-04-03 01:47:15 +03:00
Vitaliy Filippov	52097c4856	Stop flushing when less than min_flusher_count operations are available (unless a trim is forced)	2021-04-03 00:53:28 +03:00
Vitaliy Filippov	e1355cbc74	Report failed operation name in cluster_client	2021-04-03 00:53:28 +03:00
Vitaliy Filippov	8f8b90be7a	Add min_flusher_count configuration	2021-04-03 00:53:28 +03:00
Vitaliy Filippov	ad9f619370	Skip double allocs when reading journal	2021-04-03 00:53:28 +03:00
Vitaliy Filippov	f4769ba7c7	Collapse create+delete journal entry pairs if they're already flushed Old journal replay mechanism could lead to a double allocation of the same block and a "Fatal error: tried to overwrite non-zero metadata entry"	2021-04-03 00:53:28 +03:00
Vitaliy Filippov	843b7052d2	Add an assertion when clearing deleted metadata entries, add debug details when freeing blocks	2021-04-03 00:53:28 +03:00
Vitaliy Filippov	df99e232ee	Deduplicate osd_sets in pg history + raise request size limit for etcd	2021-04-03 00:53:28 +03:00
Vitaliy Filippov	3a40fa4127	Fix monitor errors in case of OSD removal	2021-03-27 01:15:18 +03:00
Vitaliy Filippov	4095bcc558	Do not ignore object deletion journal entries when they are preceded by a big write	2021-03-25 11:00:10 +03:00
Vitaliy Filippov	564d64e271	Add some details for debug prints	2021-03-25 11:00:10 +03:00
Vitaliy Filippov	cf54741c95	Followup to `05db1308aa` Don't do anything with the object state after errors because it's freed by PG re-peer in this case	2021-03-25 11:00:10 +03:00
Vitaliy Filippov	18a5fafa2a	Fix rollback	2021-03-25 02:41:58 +03:00
Vitaliy Filippov	06f4978085	Fix fsync check in blockstore_flush (data fsyncs were disabled instead of journal fsyncs)	2021-03-25 02:41:58 +03:00
Vitaliy Filippov	7ebf1588c5	Check for immediate_commit==small in the OSD code	2021-03-25 02:41:58 +03:00
Vitaliy Filippov	b0ad1e1e6d	Remember writes as "unsynced" only after completing them Previously BS_OP_SYNC could take unfinished writes and add them into the journal before they were actually completed. This was leading to crashes with the message "BUG: Unexpected dirty_entry 2000000000001:9f2a0000 v3 unstable state during flush: 338"	2021-03-25 02:41:58 +03:00
Vitaliy Filippov	0949f08407	Extract osd_primary write and sync code into separate files	2021-03-24 14:20:56 +03:00
Vitaliy Filippov	04a1f18fa5	Assign .req as a whole to always zero out the remaining part Also clear .reply before processing the operation	2021-03-24 14:20:56 +03:00
Vitaliy Filippov	cf9a641d66	Skip disconnected OSDs during sync	2021-03-24 14:20:56 +03:00
Vitaliy Filippov	05db1308aa	Fix two potential read/write ordering problems (even though not yet seen in tests) - Write operations could be 'stabilized' and previous versions could be purged from OSDs before the removal of version_override and following reads could potentially hit different version in EC pools - Object was marked clean after completing the delete during recovery, so reads could in theory hit a deleted version and return nothing	2021-03-24 14:20:56 +03:00
Vitaliy Filippov	98b54ca948	Don't try to "recover" misplaced objects if it would make them degraded	2021-03-21 01:37:23 +03:00
Vitaliy Filippov	23225c5e62	Do not run ping on clients that are not yet connected	2021-03-21 01:37:23 +03:00
				`@ -0,0 +1 @@`
				`g++ -D__MOCK__ -fsanitize=address -g -Wno-pointer-arith pg_states.cpp osd_ops.cpp test_cluster_client.cpp cluster_client.cpp msgr_op.cpp msgr_stop.cpp mock/messenger.cpp etcd_state_client.cpp timerfd_manager.cpp ../json11/json11.cpp -I mock -I . -I ..; ./a.out`