Add Hugo-based (https://gohugo.io ) documentation

Remove getrandom() usage
Fix some warnings
2022-05-11 11:28:32 +03:00 · 2022-05-11 11:25:20 +03:00 · 2022-05-10 12:42:58 +03:00 · 2022-05-10 12:26:47 +03:00 · 2022-05-10 10:43:17 +03:00 · 2022-05-09 22:37:23 +03:00
200 changed files with 18657 additions and 2589 deletions
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -2,6 +2,6 @@ cmake_minimum_required(VERSION 2.8)

 project(vitastor)

-set(VERSION "0.6.10")
+set(VERSION "0.6.17")

 add_subdirectory(src)
--- a/README-ru.md
+++ b/README-ru.md
@@ -52,10 +52,10 @@ Vitastor на данный момент находится в статусе п
 - Слияние снапшотов (vitastor-cli {snap-rm,flatten,merge})
 - Консольный интерфейс для управления образами (vitastor-cli {ls,create,modify})
 - Плагин для Proxmox
+- Упрощённая NFS-прокси для эмуляции файлового доступа к образам (подходит для VMWare)

 ## Планы развития

- Поддержка удаления снапшотов (слияния слоёв)
 - Более корректные скрипты разметки дисков и автоматического запуска OSD
 - Другие инструменты администрирования
 - Плагины для OpenNebula и других облачных систем
@@ -407,6 +407,7 @@ Vitastor с однопоточной NBD прокси на том же стен
 - На хостах мониторов:
  - Пропишите нужные вам значения в файле `/usr/lib/vitastor/mon/make-units.sh`
  - Создайте юниты systemd для etcd и мониторов: `/usr/lib/vitastor/mon/make-units.sh`
+- Запустите etcd и мониторы: `systemctl start etcd vitastor-mon`
 - Пропишите etcd_address и osd_network в `/etc/vitastor/vitastor.conf`. Например:
  ```
  {
@@ -414,7 +415,14 @@ Vitastor с однопоточной NBD прокси на том же стен
    "osd_network": "10.200.1.0/24"
  }
  ```
- Создайте юниты systemd для OSD: `/usr/lib/vitastor/make-osd.sh /dev/disk/by-partuuid/XXX [/dev/disk/by-partuuid/YYY ...]`
+- Инициализуйте OSD:
+  - SSD: `/usr/lib/vitastor/make-osd.sh /dev/disk/by-partuuid/XXX [/dev/disk/by-partuuid/YYY ...]`
+  - Гибридные, HDD+SSD: `/usr/lib/vitastor/mon/make-osd-hybrid.js /dev/sda /dev/sdb ...` - передайте
+    все ваши SSD и HDD скрипту в командной строке подряд, скрипт автоматически выделит разделы под
+    журналы на SSD и данные на HDD. Скрипт пропускает HDD, на которых уже есть разделы
+    или вообще какие-то данные, поэтому если диски непустые, сначала очистите их с помощью
+    `wipefs -a`. SSD с таблицей разделов не пропускаются, но так как скрипт создаёт новые разделы
+    для журналов, на SSD должно быть доступно свободное нераспределённое место.
 - Вы можете менять параметры OSD в юнитах systemd или в `vitastor.conf`. Смысл некоторых параметров:
  - `disable_data_fsync 1` - отключает fsync, используется с SSD с конденсаторами.
  - `immediate_commit all` - используется с SSD с конденсаторами.
@@ -430,7 +438,6 @@ Vitastor с однопоточной NBD прокси на том же стен
    диски, используемые на одном из тестовых стендов - Intel D3-S4510 - очень сильно не любят такую
    перезапись, и для них была добавлена эта опция. Когда данный режим включён, также нужно поднимать
    значение `journal_sector_buffer_count`, так как иначе Vitastor не хватит буферов для записи в журнал.
- Запустите все etcd: `systemctl start etcd`
 - Создайте глобальную конфигурацию в etcd: `etcdctl --endpoints=... put /vitastor/config/global '{"immediate_commit":"all"}'`
  (если все ваши диски - серверные с конденсаторами).
 - Создайте пулы: `etcdctl --endpoints=... put /vitastor/config/pools '{"1":{"name":"testpool","scheme":"replicated","pg_size":2,"pg_minsize":1,"pg_count":256,"failure_domain":"host"}}'`.
@@ -523,9 +530,48 @@ vitastor-nbd map --etcd_address 10.115.0.10:2379/v3 --image testimg
 Для обращения по номеру инода, аналогично другим командам, можно использовать опции
 `--pool <POOL> --inode <INODE> --size <SIZE>` вместо `--image testimg`.

+### NFS
+
+В Vitastor реализована упрощённая NFS 3.0 прокси для эмуляции файлового доступа к образам.
+Это не полноценная файловая система, т.к. метаданные всех файлов (образов) сохраняются
+в etcd и всё время хранятся в оперативной памяти - то есть, положить туда много файлов
+не получится.
+
+Однако в качестве способа доступа к образам виртуальных машин NFS прокси прекрасно подходит
+и позволяет подключить Vitastor, например, к VMWare.
+
+При этом, если вы используете режим immediate_commit=all (для SSD с конденсаторами или HDD
+с отключённым кэшем), то NFS-сервер не имеет состояния и вы можете свободно поднять
+его в нескольких экземплярах и использовать поверх них сетевой балансировщик нагрузки или
+схему с отказоустойчивостью.
+
+Использование vitastor-nfs:
+
+```
+vitastor-nfs [--etcd_address ADDR] [ДРУГИЕ ОПЦИИ]
+
+--subdir <DIR>    экспортировать "поддиректорию" - образы с префиксом имени <DIR>/ (по умолчанию пусто - экспортировать все образы)
+--portmap 0       отключить сервис portmap/rpcbind на порту 111 (по умолчанию включён и требует root привилегий)
+--bind <IP>       принимать соединения по адресу <IP> (по умолчанию 0.0.0.0 - на всех)
+--nfspath <PATH>  установить путь NFS-экспорта в <PATH> (по умолчанию /)
+--port <PORT>     использовать порт <PORT> для NFS-сервисов (по умолчанию 2049)
+--pool <POOL>     исползовать пул <POOL> для новых образов (обязательно, если пул в кластере не один)
+--foreground 1    не уходить в фон после запуска
+```
+
+Пример монтирования Vitastor через NFS:
+
+```
+vitastor-nfs --etcd_address 192.168.5.10:2379 --portmap 0 --port 2050 --pool testpool
+```
+
+```
+mount localhost:/ /mnt/ -o port=2050,mountport=2050,nfsvers=3,soft,nolock,tcp
+```
+
 ### Kubernetes

-У Vitastor есть CSI-плагин для Kubernetes, поддерживающий RWO-тома.
+У Vitastor есть CSI-плагин для Kubernetes, поддерживающий RWO, а также блочные RWX, тома.

 Для установки возьмите манифесты из директории [csi/deploy/](csi/deploy/), поместите
 вашу конфигурацию подключения к Vitastor в [csi/deploy/001-csi-config-map.yaml](001-csi-config-map.yaml),
--- a/README.md
+++ b/README.md
@@ -46,13 +46,13 @@ breaking changes in the future. However, the following is implemented:
 - Snapshot merge tool (vitastor-cli {snap-rm,flatten,merge})
 - Image management CLI (vitastor-cli {ls,create,modify})
 - Proxmox storage plugin
+- Simplified NFS proxy for file-based image access emulation (suitable for VMWare)

 ## Roadmap

- Snapshot deletion (layer merge) support
 - Better OSD creation and auto-start tools
 - Other administrative tools
- Plugins for OpenNebula, Proxmox and other cloud systems
+- Plugins for OpenNebula and other cloud systems
 - iSCSI proxy
 - Faster failover
 - Scrubbing without checksums (verification of replicas)
@@ -360,6 +360,7 @@ and calculate disk offsets almost by hand. This will be fixed in near future.
 - On the monitor hosts:
  - Edit variables at the top of `/usr/lib/vitastor/mon/make-units.sh` to desired values.
  - Create systemd units for the monitor and etcd: `/usr/lib/vitastor/mon/make-units.sh`
+- Start etcd and monitors: `systemctl start etcd vitastor-mon`
 - Put etcd_address and osd_network into `/etc/vitastor/vitastor.conf`. Example:
  ```
  {
@@ -367,7 +368,13 @@ and calculate disk offsets almost by hand. This will be fixed in near future.
    "osd_network": "10.200.1.0/24"
  }
  ```
- Create systemd units for your OSDs: `/usr/lib/vitastor/mon/make-osd.sh /dev/disk/by-partuuid/XXX [/dev/disk/by-partuuid/YYY ...]`
+- Initialize OSDs:
+  - Simplest, SSD-only: `/usr/lib/vitastor/mon/make-osd.sh /dev/disk/by-partuuid/XXX [/dev/disk/by-partuuid/YYY ...]`
+  - Hybrid, HDD+SSD: `/usr/lib/vitastor/mon/make-osd-hybrid.js /dev/sda /dev/sdb ...` - pass all your
+    devices (HDD and SSD) to this script - it will partition disks and initialize journals on its own.
+    This script skips HDDs which are already partitioned so if you want to use non-empty disks for
+    Vitastor you should first wipe them with `wipefs -a`. SSDs with GPT partition table are not skipped,
+    but some free unpartitioned space must be available because the script creates new partitions for journals.
 - You can change OSD configuration in units or in `vitastor.conf`. Notable configuration variables:
  - `disable_data_fsync 1` - only safe with server-grade drives with capacitors.
  - `immediate_commit all` - use this if all your drives are server-grade.
@@ -472,9 +479,49 @@ It will output the device name, like /dev/nbd0 which you can then format and mou

 Again, you can use `--pool <POOL> --inode <INODE> --size <SIZE>` insteaf of `--image <IMAGE>` if you want.

+### NFS
+
+Vitastor has a simplified NFS 3.0 proxy for file-based image access emulation. It's not
+suitable as a full-featured file system, at least because all file/image metadata is stored
+in etcd and kept in memory all the time - thus you can't put a lot of files in it.
+
+However, NFS proxy is totally fine as a method to provide VM image access and allows to
+plug Vitastor into, for example, VMWare. It's important to note that for VMWare it's a much
+better access method than iSCSI, because with iSCSI we'd have to put all VM images into one
+Vitastor image exported as a LUN to VMWare and formatted with VMFS. VMWare doesn't use VMFS
+over NFS.
+
+NFS proxy is stateless if you use immediate_commit=all mode (for SSD with capacitors or
+HDDs with disabled cache), so you can run multiple NFS proxies and use a network load
+balancer or any failover method you want to in that case.
+
+vitastor-nfs usage:
+
+```
+vitastor-nfs [--etcd_address ADDR] [OTHER OPTIONS]
+
+--subdir <DIR>    export images prefixed <DIR>/ (default empty - export all images)
+--portmap 0       do not listen on port 111 (portmap/rpcbind, requires root)
+--bind <IP>       bind service to <IP> address (default 0.0.0.0)
+--nfspath <PATH>  set NFS export path to <PATH> (default is /)
+--port <PORT>     use port <PORT> for NFS services (default is 2049)
+--pool <POOL>     use <POOL> as default pool for new files (images)
+--foreground 1    stay in foreground, do not daemonize
+```
+
+Example start and mount commands:
+
+```
+vitastor-nfs --etcd_address 192.168.5.10:2379 --portmap 0 --port 2050 --pool testpool
+```
+
+```
+mount localhost:/ /mnt/ -o port=2050,mountport=2050,nfsvers=3,soft,nolock,tcp
+```
+
 ### Kubernetes

-Vitastor has a CSI plugin for Kubernetes which supports RWO volumes.
+Vitastor has a CSI plugin for Kubernetes which supports RWO (and block RWX) volumes.

 To deploy it, take manifests from [csi/deploy/](csi/deploy/) directory, put your
 Vitastor configuration in [csi/deploy/001-csi-config-map.yaml](001-csi-config-map.yaml),
--- a/2
+++ b/2
--- a/csi/Makefile
+++ b/csi/Makefile
@@ -1,4 +1,4 @@
-VERSION ?= v0.6.10
+VERSION ?= v0.6.17

 all: build push

--- a/csi/deploy/004-csi-nodeplugin.yaml
+++ b/csi/deploy/004-csi-nodeplugin.yaml
@@ -49,7 +49,7 @@ spec:
            capabilities:
              add: ["SYS_ADMIN"]
            allowPrivilegeEscalation: true
-          image: vitalif/vitastor-csi:v0.6.10
+          image: vitalif/vitastor-csi:v0.6.17
          args:
            - "--node=$(NODE_ID)"
            - "--endpoint=$(CSI_ENDPOINT)"
--- a/csi/deploy/007-csi-provisioner.yaml
+++ b/csi/deploy/007-csi-provisioner.yaml
@@ -116,7 +116,7 @@ spec:
            privileged: true
            capabilities:
              add: ["SYS_ADMIN"]
-          image: vitalif/vitastor-csi:v0.6.10
+          image: vitalif/vitastor-csi:v0.6.17
          args:
            - "--node=$(NODE_ID)"
            - "--endpoint=$(CSI_ENDPOINT)"
--- a/csi/deploy/example-pvc-block.yaml
+++ b/csi/deploy/example-pvc-block.yaml
@@ -0,0 +1,13 @@
+---
+apiVersion: v1
+kind: PersistentVolumeClaim
+metadata:
+  name: test-vitastor-pvc-block
+spec:
+  storageClassName: vitastor
+  volumeMode: Block
+  accessModes:
+    - ReadWriteMany
+  resources:
+    requests:
+      storage: 10Gi
--- a/csi/deploy/example-test-pod-block.yaml
+++ b/csi/deploy/example-test-pod-block.yaml
@@ -0,0 +1,17 @@
+apiVersion: v1
+kind: Pod
+metadata:
+  name: vitastor-test-block-pvc
+  namespace: default
+spec:
+  containers:
+  - name: vitastor-test-block-pvc
+    image: nginx
+    volumeDevices:
+      - name: data
+        devicePath: /dev/xvda
+  volumes:
+  - name: data
+    persistentVolumeClaim:
+      claimName: test-vitastor-pvc-block
+      readOnly: false
--- a/csi/deploy/example-test-pod.yaml
+++ b/csi/deploy/example-test-pod.yaml
@@ -0,0 +1,17 @@
+apiVersion: v1
+kind: Pod
+metadata:
+  name: vitastor-test-nginx
+  namespace: default
+spec:
+  containers:
+   - name: vitastor-test-nginx
+     image: nginx
+     volumeMounts:
+       - mountPath: /usr/share/nginx/html/s3
+         name: data
+  volumes:
+   - name: data
+     persistentVolumeClaim:
+       claimName: test-vitastor-pvc
+       readOnly: false
--- a/csi/src/config.go
+++ b/csi/src/config.go
@@ -5,7 +5,7 @@ package vitastor

 const (
    vitastorCSIDriverName    = "csi.vitastor.io"
-    vitastorCSIDriverVersion = "0.6.10"
+    vitastorCSIDriverVersion = "0.6.17"
 )

 // Config struct fills the parameters of request or user input
--- a/csi/src/nodeserver.go
+++ b/csi/src/nodeserver.go
@@ -67,29 +67,44 @@ func (ns *NodeServer) NodePublishVolume(ctx context.Context, req *csi.NodePublis
    klog.Infof("received node publish volume request %+v", protosanitizer.StripSecrets(req))

    targetPath := req.GetTargetPath()
+    isBlock := req.GetVolumeCapability().GetBlock() != nil

    // Check that it's not already mounted
-    free, error := mount.IsNotMountPoint(ns.mounter, targetPath)
+    _, error := mount.IsNotMountPoint(ns.mounter, targetPath)
    if (error != nil)
    {
        if (os.IsNotExist(error))
        {
-            error := os.MkdirAll(targetPath, 0777)
-            if (error != nil)
+            if (isBlock)
            {
-                return nil, status.Error(codes.Internal, error.Error())
+                pathFile, err := os.OpenFile(targetPath, os.O_CREATE|os.O_RDWR, 0o600)
+                if (err != nil)
+                {
+                    klog.Errorf("failed to create block device mount target %s with error: %v", targetPath, err)
+                    return nil, status.Error(codes.Internal, err.Error())
+                }
+                err = pathFile.Close()
+                if (err != nil)
+                {
+                    klog.Errorf("failed to close %s with error: %v", targetPath, err)
+                    return nil, status.Error(codes.Internal, err.Error())
+                }
+            }
+            else
+            {
+                err := os.MkdirAll(targetPath, 0777)
+                if (err != nil)
+                {
+                    klog.Errorf("failed to create fs mount target %s with error: %v", targetPath, err)
+                    return nil, status.Error(codes.Internal, err.Error())
+                }
            }
-            free = true
        }
        else
        {
            return nil, status.Error(codes.Internal, error.Error())
        }
    }
-    if (!free)
-    {
-        return &csi.NodePublishVolumeResponse{}, nil
-    }

    ctxVars := make(map[string]string)
    err := json.Unmarshal([]byte(req.VolumeId), &ctxVars)
@@ -149,7 +164,6 @@ func (ns *NodeServer) NodePublishVolume(ctx context.Context, req *csi.NodePublis

    // Format the device (ext4 or xfs)
    fsType := req.GetVolumeCapability().GetMount().GetFsType()
-    isBlock := req.GetVolumeCapability().GetBlock() != nil
    opt := req.GetVolumeCapability().GetMount().GetMountFlags()
    opt = append(opt, "_netdev")
    if ((req.VolumeCapability.AccessMode.Mode == csi.VolumeCapability_AccessMode_MULTI_NODE_READER_ONLY ||
--- a/debian/changelog
+++ b/debian/changelog
@@ -1,4 +1,4 @@
-vitastor (0.6.10-1) unstable; urgency=medium
+vitastor (0.6.17-1) unstable; urgency=medium

  * RDMA support
  * Bugfixes
--- a/debian/vitastor-client.install
+++ b/debian/vitastor-client.install
@@ -2,5 +2,6 @@ usr/bin/vita
 usr/bin/vitastor-cli
 usr/bin/vitastor-rm
 usr/bin/vitastor-nbd
+usr/bin/vitastor-nfs
 usr/lib/*/libvitastor*.so*
 mon/make-osd.sh /usr/lib/vitastor
--- a/debian/vitastor.Dockerfile
+++ b/debian/vitastor.Dockerfile
@@ -33,8 +33,8 @@ RUN set -e -x; \
    mkdir -p /root/packages/vitastor-$REL; \
    rm -rf /root/packages/vitastor-$REL/*; \
    cd /root/packages/vitastor-$REL; \
-    cp -r /root/vitastor vitastor-0.6.10; \
-    cd vitastor-0.6.10; \
+    cp -r /root/vitastor vitastor-0.6.17; \
+    cd vitastor-0.6.17; \
    ln -s /root/fio-build/fio-*/ ./fio; \
    FIO=$(head -n1 fio/debian/changelog | perl -pe 's/^.*\((.*?)\).*$/$1/'); \
    ls /usr/include/linux/raw.h || cp ./debian/raw.h /usr/include/linux/raw.h; \
@@ -47,8 +47,8 @@ RUN set -e -x; \
    rm -rf a b; \
    echo "dep:fio=$FIO" > debian/fio_version; \
    cd /root/packages/vitastor-$REL; \
-    tar --sort=name --mtime='2020-01-01' --owner=0 --group=0 --exclude=debian -cJf vitastor_0.6.10.orig.tar.xz vitastor-0.6.10; \
-    cd vitastor-0.6.10; \
+    tar --sort=name --mtime='2020-01-01' --owner=0 --group=0 --exclude=debian -cJf vitastor_0.6.17.orig.tar.xz vitastor-0.6.17; \
+    cd vitastor-0.6.17; \
    V=$(head -n1 debian/changelog | perl -pe 's/^.*\((.*?)\).*$/$1/'); \
    DEBFULLNAME="Vitaliy Filippov <vitalif@yourcmc.ru>" dch -D $REL -v "$V""$REL" "Rebuild for $REL"; \
    DEB_BUILD_OPTIONS=nocheck dpkg-buildpackage --jobs=auto -sa; \
--- a/docs/gen-docs.js
+++ b/docs/gen-docs.js
@@ -0,0 +1,55 @@
+#!/usr/bin/nodejs
+
+const fs = require('fs');
+const yaml = require('yaml');
+
+const L = {
+    en: {},
+    ru: {
+        Type: 'Тип',
+        Default: 'Значение по умолчанию',
+        Minimum: 'Минимальное значение',
+    },
+};
+const types = {
+    en: {
+        string: 'string',
+        bool: 'boolean',
+        int: 'integer',
+        sec: 'seconds',
+        ms: 'milliseconds',
+        us: 'microseconds',
+    },
+    ru: {
+        string: 'строка',
+        bool: 'булево (да/нет)',
+        int: 'целое число',
+        sec: 'секунды',
+        ms: 'миллисекунды',
+        us: 'микросекунды',
+    },
+};
+const params_files = fs.readdirSync(__dirname+'/params')
+    .filter(f => f.substr(-4) == '.yml')
+    .map(f => f.substr(0, f.length-4));
+
+for (const file of params_files)
+{
+    const cfg = yaml.parse(fs.readFileSync(__dirname+'/params/'+file+'.yml', { encoding: 'utf-8' }));
+    for (const lang in types)
+    {
+        let out = '\n\n{{< toc >}}';
+        for (const c of cfg)
+        {
+            out += `\n\n## ${c.name}\n\n`;
+            out += `- ${L[lang]['Type'] || 'Type'}: ${c["type_"+lang] || types[lang][c.type] || c.type}\n`;
+            if (c.default !== undefined)
+                out += `- ${L[lang]['Default'] || 'Default'}: ${c.default}\n`;
+            if (c.min !== undefined)
+                out += `- ${L[lang]['Minimum'] || 'Minimum'}: ${c.min}\n`;
+            out += `\n`+(c["info_"+lang] || c["info"]).replace(/\s+$/, '');
+        }
+        const head = fs.readFileSync(__dirname+'/params/head/'+file+'.'+lang+'.md', { encoding: 'utf-8' });
+        fs.writeFileSync(__dirname+'/hugo/content/config/'+file+'.'+lang+'.md', head.replace(/\s+$/, '')+out+"\n");
+    }
+}
--- a/docs/hugo/archetypes/default.md
+++ b/docs/hugo/archetypes/default.md
@@ -0,0 +1,6 @@
+---
+title: "{{ replace .Name "-" " " | title }}"
+date: {{ .Date }}
+draft: true
+---
+
--- a/docs/hugo/config.yaml
+++ b/docs/hugo/config.yaml
@@ -0,0 +1,35 @@
+baseURL: http://localhost
+title: Vitastor
+theme: hugo-geekdoc
+#languageCode: en-us
+
+pluralizeListTitles: false
+
+# Geekdoc required configuration
+pygmentsUseClasses: true
+pygmentsCodeFences: true
+disablePathToLower: true
+
+# Required if you want to render robots.txt template
+enableRobotsTXT: true
+
+defaultContentLanguage: en
+languages:
+  en:
+    weight: 1
+    languageName: English
+  ru:
+    weight: 1
+    languageName: Русский
+
+markup:
+  goldmark:
+    renderer:
+      # Needed for mermaid shortcode
+      unsafe: true
+  tableOfContents:
+    startLevel: 1
+    endLevel: 9
+
+taxonomies:
+  tag: tags
--- a/docs/hugo/content/_index.md
+++ b/docs/hugo/content/_index.md
@@ -0,0 +1,6 @@
+## The Idea
+
+Vitastor is a small, simple and fast clustered block storage (storage for VM drives),
+architecturally similar to Ceph which means strong consistency, primary-replication,
+symmetric clustering and automatic data distribution over any number of drives
+of any size with configurable redundancy (replication or erasure codes/XOR).
--- a/docs/hugo/content/config/_index.en.md
+++ b/docs/hugo/content/config/_index.en.md
@@ -0,0 +1,61 @@
+---
+title: Parameter Reference
+weight: 1
+---
+
+Vitastor configuration consists of:
+- Configuration parameters (key-value), described here
+- [Pool configuration]({{< ref "config/pool" >}})
+- OSD placement tree configuration
+- Inode configuration i.e. image metadata like name, size and parent reference
+
+Configuration parameters can be set in 3 places:
+- Configuration file (`/etc/vitastor/vitastor.conf` or other path)
+- etcd key `/vitastor/config/global`. Most variables can be set there, but etcd
+  connection parameters should obviously be set in the configuration file.
+- Command line of Vitastor components: OSD, mon, fio and QEMU options,
+  OpenStack/Proxmox/etc configuration. The latter doesn't allow to set all
+  variables directly, but it allows to override the configuration file and
+  set everything you need inside it.
+
+In the future, additional configuration methods may be added:
+- OSD superblock which will, by design, contain parameters related to the disk
+  layout and to one specific OSD.
+- OSD-specific keys in etcd like `/vitastor/config/osd/<number>`.
+
+## Common Parameters
+
+These are the most common parameters which apply to all components of Vitastor.
+
+[See the list]({{< ref "common" >}})
+
+## Cluster-Wide Disk Layout Parameters
+
+These parameters apply to clients and OSDs and can't be changed after OSD
+initialization.
+
+[See the list]({{< ref "layout-cluster" >}})
+
+## OSD Disk Layout Parameters
+
+These parameters apply to OSDs and can't be changed after OSD initialization.
+
+[See the list]({{< ref "layout-osd" >}})
+
+## Network Protocol Parameters
+
+These parameters apply to clients and OSDs and can be changed with a restart.
+
+[See the list]({{< ref "network" >}})
+
+## Runtime OSD Parameters
+
+These parameters apply to OSDs and can be changed with an OSD restart.
+
+[See the list]({{< ref "osd" >}})
+
+## Monitor Parameters
+
+These parameters only apply to Monitors.
+
+[See the list]({{< ref "monitor" >}})
--- a/docs/hugo/content/config/_index.ru.md
+++ b/docs/hugo/content/config/_index.ru.md
@@ -0,0 +1,63 @@
+---
+title: Перечень настроек
+weight: 1
+---
+
+Конфигурация Vitastor состоит из:
+- Параметров (ключ-значение), описанных на данной странице
+- Настроек пулов
+- Настроек дерева OSD
+- Настроек инодов, т.е. метаданных образов, таких, как имя, размер и ссылки на
+  родительский образ
+
+Параметры конфигурации могут задаваться в 3 местах:
+- Файле конфигурации (`/etc/vitastor/vitastor.conf` или по другому пути)
+- Ключе в etcd `/vitastor/config/global`. Большая часть параметров может
+  задаваться там, кроме, естественно, самих параметров соединения с etcd,
+  которые должны задаваться в файле конфигурации
+- В командной строке компонентов Vitastor: OSD, монитора, опциях fio и QEMU,
+  настроек OpenStack, Proxmox и т.п. Последние, как правило, не включают полный
+  набор параметров напрямую, но разрешают определить путь к файлу конфигурации
+  и задать любые параметры в нём.
+
+В будущем также могут быть добавлены другие способы конфигурации:
+- Суперблок OSD, в котором будут храниться параметры OSD, связанные с дисковым
+  форматом и с этим конкретным OSD.
+- OSD-специфичные ключи в etcd типа `/vitastor/config/osd/<номер>`.
+
+## Общие параметры
+
+Это наиболее общие параметры, используемые всеми компонентами Vitastor.
+
+[Посмотреть список]({{< ref "common" >}})
+
+## Дисковые параметры уровня кластера
+
+Эти параметры используются клиентами и OSD и не могут быть изменены после
+инициализации OSD.
+
+[Посмотреть список]({{< ref "layout-cluster" >}})
+
+## Дисковые параметры OSD
+
+Эти параметры используются OSD и не могут быть изменены после инициализации OSD.
+
+[Посмотреть список]({{< ref "layout-osd" >}})
+
+## Параметры сетевого протокола
+
+Эти параметры используются клиентами и OSD и могут быть изменены с перезапуском.
+
+[Посмотреть список]({{< ref "network" >}})
+
+## Изменяемые параметры OSD
+
+Эти параметры используются OSD и могут быть изменены с перезапуском.
+
+[Посмотреть список]({{< ref "osd" >}})
+
+## Параметры мониторов
+
+Данные параметры используются только мониторами Vitastor.
+
+[Посмотреть список]({{< ref "monitor" >}})
--- a/docs/hugo/content/config/pool.en.md
+++ b/docs/hugo/content/config/pool.en.md
@@ -0,0 +1,178 @@
+---
+title: Pool configuration
+weight: 100
+---
+
+Pool configuration is set in etcd key `/vitastor/config/pools` in the following
+JSON format:
+
+```
+{
+  "<Numeric ID>": {
+    "name": "<name>",
+    ...other parameters...
+  }
+}
+```
+
+{{< toc >}}
+
+# Parameters
+
+## name
+
+- Type: string
+- Required
+
+Pool name.
+
+## scheme
+
+- Type: string
+- Required
+- One of: "replicated", "xor" or "jerasure"
+
+Redundancy scheme used for data in this pool.
+
+## pg_size
+
+- Type: integer
+- Required
+
+Total number of disks for PGs of this pool - i.e., number of replicas for
+replicated pools and number of data plus parity disks for EC/XOR pools.
+
+## parity_chunks
+
+- Type: integer
+
+Number of parity chunks for EC/XOR pools. For such pools, data will be lost
+if you lose more than parity_chunks disks at once, so this parameter can be
+equally described as FTT (number of failures to tolerate).
+
+Required for EC/XOR pools, ignored for replicated pools.
+
+## pg_minsize
+
+- Type: integer
+- Required
+
+Number of available live disks for PGs of this pool to remain active.
+That is, if it becomes impossible to place PG data on at least (pg_minsize)
+OSDs, PG is deactivated for both read and write. So you know that a fresh
+write always goes to at least (pg_minsize) OSDs (disks).
+
+FIXME: pg_minsize behaviour may be changed in the future to only make PGs
+read-only instead of deactivating them.
+
+## pg_count
+
+- Type: integer
+- Required
+
+Number of PGs for this pool. The value should be big enough for the monitor /
+LP solver to be able to optimize data placement.
+
+"Enough" is usually around 64-128 PGs per OSD, i.e. you set pg_count for pool
+to (total OSD count * 100 / pg_size). You can round it to the closest power of 2,
+because it makes it easier to reduce or increase PG count later by dividing or
+multiplying it by 2.
+
+In Vitastor, PGs are ephemeral, so you can change pool PG count anytime just
+by overwriting pool configuration in etcd. Amount of the data affected by
+rebalance will be smaller if the new PG count is a multiple of the old PG count
+or vice versa.
+
+## failure_domain
+
+- Type: string
+- Default: host
+
+Failure domain specification. Must be "host" or "osd" or refer to one of the
+placement tree levels, defined in [placement_levels]({{< ref "config/monitor#placement_levels" >}}).
+
+Two replicas, or two parts in case of EC/XOR, of the same block of data are
+never put on OSDs in the same failure domain (for example, on the same host).
+So failure domain specifies the unit which failure you are protecting yourself
+from.
+
+## max_osd_combinations
+
+- Type: integer
+- Default: 10000
+
+Vitastor data placement algorithm is based on the LP solver and OSD combinations
+which are fed to it are generated ramdonly. This parameter specifies the maximum
+number of combinations to generate when optimising PG placement.
+
+This parameter usually doesn't require to be changed.
+
+## pg_stripe_size
+
+- Type: integer
+- Default: 0
+
+Specifies the stripe size for this pool according to which images are split into
+different PGs. Stripe size can't be smaller than [block_size]({{< ref "config/layout-cluster#block_size" >}})
+multiplied by (pg_size - parity_chunks) for EC/XOR pools, or 1 for replicated pools,
+and the same value is used by default.
+
+This means first `pg_stripe_size = (block_size * (pg_size-parity_chunks))` bytes
+of an image go to one PG, next `pg_stripe_size` bytes go to another PG and so on.
+
+Usually doesn't require to be changed separately from the block size.
+
+## root_node
+
+- Type: string
+
+Specifies the root node of the OSD tree to restrict this pool OSDs to.
+Referenced root node must exist in /vitastor/config/node_placement.
+
+## osd_tags
+
+- Type: string or array of strings
+
+Specifies OSD tags to restrict this pool to. If multiple tags are specified,
+only OSDs having all of these tags will be used for this pool.
+
+## primary_affinity_tags
+
+- Type: string or array of strings
+
+Specifies OSD tags to prefer putting primary OSDs in this pool to.
+Note that for EC/XOR pools Vitastor always prefers to put primary OSD on one
+of the OSDs containing a data chunk for a PG.
+
+# Examples
+
+## Replicated pool
+
+```
+{
+  "1": {
+    "name":"testpool",
+    "scheme":"replicated",
+    "pg_size":2,
+    "pg_minsize":1,
+    "pg_count":256,
+    "failure_domain":"host"
+  }
+}
+```
+
+## Erasure-coded pool
+
+```
+{
+  "2": {
+    "name":"ecpool",
+    "scheme":"jerasure",
+    "pg_size":3,
+    "parity_chunks":1,
+    "pg_minsize":2,
+    "pg_count":256,
+    "failure_domain":"host"
+  }
+}
+```
--- a/docs/hugo/content/installation/packages.md
+++ b/docs/hugo/content/installation/packages.md
@@ -0,0 +1,41 @@
+---
+title: Packages
+weight: 2
+---
+
+## Debian
+
+- Trust Vitastor package signing key:
+  `wget -q -O - https://vitastor.io/debian/pubkey | sudo apt-key add -`
+- Add Vitastor package repository to your /etc/apt/sources.list:
+  - Debian 11 (Bullseye/Sid): `deb https://vitastor.io/debian bullseye main`
+  - Debian 10 (Buster): `deb https://vitastor.io/debian buster main`
+- For Debian 10 (Buster) also enable backports repository:
+  `deb http://deb.debian.org/debian buster-backports main`
+- Install packages: `apt update; apt install vitastor lp-solve etcd linux-image-amd64 qemu`
+
+## CentOS
+
+- Add Vitastor package repository:
+  - CentOS 7: `yum install https://vitastor.io/rpms/centos/7/vitastor-release-1.0-1.el7.noarch.rpm`
+  - CentOS 8: `dnf install https://vitastor.io/rpms/centos/8/vitastor-release-1.0-1.el8.noarch.rpm`
+- Enable EPEL: `yum/dnf install epel-release`
+- Enable additional CentOS repositories:
+  - CentOS 7: `yum install centos-release-scl`
+  - CentOS 8: `dnf install centos-release-advanced-virtualization`
+- Enable elrepo-kernel:
+  - CentOS 7: `yum install https://www.elrepo.org/elrepo-release-7.el7.elrepo.noarch.rpm`
+  - CentOS 8: `dnf install https://www.elrepo.org/elrepo-release-8.el8.elrepo.noarch.rpm`
+- Install packages: `yum/dnf install vitastor lpsolve etcd kernel-ml qemu-kvm`
+
+## Installation requirements
+
+- Linux kernel 5.4 or newer, for io_uring support. 5.8 or later is highly
+  recommended because io_uring is a relatively new technology and there is
+  at least one bug which reproduces with io_uring and HP SmartArray
+  controllers in 5.4
+- liburing 0.4 or newer
+- lp_solve
+- etcd 3.4.15 or newer. Earlier versions won't work because of various bugs,
+  for example [#12402](https://github.com/etcd-io/etcd/pull/12402).
+- node.js 10 or newer
--- a/docs/hugo/content/installation/quickstart.md
+++ b/docs/hugo/content/installation/quickstart.md
@@ -0,0 +1,72 @@
+---
+title: Quick Start
+weight: 1
+---
+
+Prepare:
+
+- Get some SATA or NVMe SSDs with capacitors (server-grade drives). You can use desktop SSDs
+  with lazy fsync, but prepare for inferior single-thread latency. Read more about capacitors
+  [here]({{< ref "config/layout-cluster#immediate_commit" >}}).
+- Get a fast network (at least 10 Gbit/s). Something like Mellanox ConnectX-4 with RoCEv2 is ideal.
+- Disable CPU powersaving: `cpupower idle-set -D 0 && cpupower frequency-set -g performance`.
+- [Install Vitastor packages]({{< ref "installation/packages" >}}).
+
+## Configure monitors
+
+On the monitor hosts:
+- Edit variables at the top of `/usr/lib/vitastor/mon/make-units.sh` to desired values.
+- Create systemd units for the monitor and etcd: `/usr/lib/vitastor/mon/make-units.sh`
+- Start etcd and monitors: `systemctl start etcd vitastor-mon`
+
+## Configure OSDs
+
+- Put etcd_address and osd_network into `/etc/vitastor/vitastor.conf`. Example:
+  ```
+  {
+    "etcd_address": ["10.200.1.10:2379","10.200.1.11:2379","10.200.1.12:2379"],
+    "osd_network": "10.200.1.0/24"
+  }
+  ```
+- Initialize OSDs:
+  - Simplest, SSD-only: `/usr/lib/vitastor/mon/make-osd.sh /dev/disk/by-partuuid/XXX [/dev/disk/by-partuuid/YYY ...]`
+  - Hybrid, HDD+SSD: `/usr/lib/vitastor/mon/make-osd-hybrid.js /dev/sda /dev/sdb ...` &mdash; pass all your
+    devices (HDD and SSD) to this script &mdash; it will partition disks and initialize journals on its own.
+    This script skips HDDs which are already partitioned so if you want to use non-empty disks for
+    Vitastor you should first wipe them with `wipefs -a`. SSDs with GPT partition table are not skipped,
+    but some free unpartitioned space must be available because the script creates new partitions for journals.
+- You can change OSD configuration in units or in `vitastor.conf`.
+  Check [Configuration Reference]({{< ref "config" >}}) for parameter descriptions.
+- `systemctl start vitastor.target` everywhere.
+- If all your drives have capacitors, create global configuration in etcd: \
+  `etcdctl --endpoints=... put /vitastor/config/global '{"immediate_commit":"all"}'`
+
+## Create a pool
+
+Create pool configuration in etcd:
+
+```
+etcdctl --endpoints=... put /vitastor/config/pools '{"1":{"name":"testpool",
+  "scheme":"replicated","pg_size":2,"pg_minsize":1,"pg_count":256,"failure_domain":"host"}}'
+```
+
+For jerasure pools the configuration should look like the following:
+
+```
+etcdctl --endpoints=... put /vitastor/config/pools '{"2":{"name":"ecpool",
+  "scheme":"jerasure","pg_size":4,"parity_chunks":2,"pg_minsize":2,"pg_count":256,"failure_domain":"host"}`
+```
+
+After you do this, one of the monitors will configure PGs and OSDs will start them.
+
+You can check PG states with `etcdctl --endpoints=... get --prefix /vitastor/pg/state`. All PGs should become 'active'.
+
+## Create an image
+
+Use vitastor-cli ([read CLI documentation here]({{< ref "usage/cli" >}})):
+
+```
+vitastor-cli create -s 10G testimg
+```
+
+After that, you can run benchmarks or start QEMU manually with this image.
--- a/docs/hugo/content/installation/source.md
+++ b/docs/hugo/content/installation/source.md
@@ -0,0 +1,54 @@
+---
+title: Building from Source
+weight: 3
+---
+
+## Requirements
+
+- gcc and g++ 8 or newer, clang 10 or newer, or other compiler with C++11 plus
+  designated initializers support from C++20
+- CMake
+- liburing, jerasure headers
+
+## Basic instructions
+
+Download source, for example using git: `git clone --recurse-submodules https://yourcmc.ru/git/vitalif/vitastor/`
+
+Get `fio` source and symlink it into `<vitastor>/fio`. If you don't want to build fio engine,
+you can disable it by passing `-DWITH_FIO=no` to cmake.
+
+Build and install Vitastor:
+
+```
+cd vitastor
+mkdir build
+cd build
+cmake .. && make -j8 install
+```
+
+## QEMU Driver
+
+It's recommended to build the QEMU driver (qemu_driver.c) in-tree, as a part of
+QEMU build process. To do that:
+- Install vitastor client library headers (from source or from vitastor-client-dev package)
+- Take a corresponding patch from `patches/qemu-*-vitastor.patch` and apply it to QEMU source
+- Copy `src/qemu_driver.c` to QEMU source directory as `block/block-vitastor.c`
+- Build QEMU as usual
+
+But it is also possible to build it out-of-tree. To do that:
+- Get QEMU source, begin to build it, stop the build and copy headers:
+   - `<qemu>/include` &rarr; `<vitastor>/qemu/include`
+   - Debian:
+      * Use qemu packages from the main repository
+      * `<qemu>/b/qemu/config-host.h` &rarr; `<vitastor>/qemu/b/qemu/config-host.h`
+      * `<qemu>/b/qemu/qapi` &rarr; `<vitastor>/qemu/b/qemu/qapi`
+   - CentOS 8:
+      * Use qemu packages from the Advanced-Virtualization repository. To enable it, run
+        `yum install centos-release-advanced-virtualization.noarch` and then `yum install qemu`
+      * `<qemu>/config-host.h` &rarr; `<vitastor>/qemu/b/qemu/config-host.h`
+      * For QEMU 3.0+: `<qemu>/qapi` &rarr; `<vitastor>/qemu/b/qemu/qapi`
+      * For QEMU 2.0+: `<qemu>/qapi-types.h` &rarr; `<vitastor>/qemu/b/qemu/qapi-types.h`
+   - `config-host.h` and `qapi` are required because they contain generated headers
+- Configure Vitastor with `WITH_QEMU=yes` and, if you're on RHEL, also with `QEMU_PLUGINDIR=qemu-kvm`:
+  `cmake .. -DWITH_QEMU=yes`.
+- After that, Vitastor will build `block-vitastor.so` during its build process.
--- a/docs/hugo/content/introduction/_index.md
+++ b/docs/hugo/content/introduction/_index.md
@@ -0,0 +1,4 @@
+---
+title: Introduction
+weight: -1
+---
--- a/docs/hugo/content/introduction/architecture.md
+++ b/docs/hugo/content/introduction/architecture.md
@@ -0,0 +1,73 @@
+---
+title: Architecture
+weight: 3
+---
+
+For people familiar with Ceph, Vitastor is quite similar:
+
+- Vitastor also has Pools, PGs, OSDs, Monitors, Failure Domains, Placement Tree:
+  - OSD (Object Storage Daemon) is a process that stores data and serves read/write requests.
+  - PG (Placement Group) is a container for data that (normally) shares the same replicas.
+  - Pool is a container for data that has the same redundancy scheme and placement rules.
+  - Monitor is a separate daemon that watches cluster state and controls data distribution.
+  - Failure Domain is a group of OSDs that you allow to fail. It's "host" by default.
+  - Placement Tree groups OSDs in a hierarchy to later split them into Failure Domains.
+- Vitastor also distributes every image data across the whole cluster.
+- Vitastor is also transactional (every write to the cluster is atomic).
+- OSDs also have journal and metadata and they can also be put on separate drives.
+- Just like in Ceph, client library attempts to recover from any cluster failure so
+  you can basically reboot the whole cluster and only pause, but not crash, your clients
+  (please report a bug if the client crashes in that case).
+
+However, there are also differences:
+
+- Vitastor's main focus is on SSDs. Hybrid SSD+HDD setups are also possible.
+- Vitastor OSD is (and will always be) single-threaded. If you want to dedicate more than 1 core
+  per drive you should run multiple OSDs each on a different partition of the drive.
+  Vitastor isn't CPU-hungry though (as opposed to Ceph), so 1 core is sufficient in a lot of cases.
+- Metadata and journal are always kept in memory. Metadata size depends linearly on drive capacity
+  and data store block size which is 128 KB by default. With 128 KB blocks metadata should occupy
+  around 512 MB per 1 TB (which is still less than Ceph wants). Journal doesn't have to be big,
+  the example test below was conducted with only 16 MB journal. A big journal is probably even
+  harmful as dirty write metadata also take some memory.
+- Vitastor storage layer doesn't have internal copy-on-write or redirect-write. I know that maybe
+  it's possible to create a good copy-on-write storage, but it's much harder and makes performance
+  less deterministic, so CoW isn't used in Vitastor.
+- The basic layer of Vitastor is block storage with fixed-size blocks, not object storage with
+  rich semantics like in Ceph (RADOS).
+- There's a "lazy fsync" mode which allows to batch writes before flushing them to the disk.
+  This allows to use Vitastor with desktop SSDs, but still lowers performance due to additional
+  network roundtrips, so use server SSDs with capacitor-based power loss protection
+  ("Advanced Power Loss Protection") for best performance.
+- PGs are ephemeral. This means that they aren't stored on data disks and only exist in memory
+  while OSDs are running.
+- Recovery process is per-object (per-block), not per-PG. Also there are no PGLOGs.
+- Monitors don't store data. Cluster configuration and state is stored in etcd in simple human-readable
+  JSON structures. Monitors only watch cluster state and handle data movement.
+  Thus Vitastor's Monitor isn't a critical component of the system and is more similar to Ceph's Manager.
+  Vitastor's Monitor is implemented in node.js.
+- PG distribution isn't based on consistent hashes. All PG mappings are stored in etcd.
+  Rebalancing PGs between OSDs is done by mathematical optimization - data distribution problem
+  is reduced to a linear programming problem and solved by lp_solve. This allows for almost
+  perfect (96-99% uniformity compared to Ceph's 80-90%) data distribution in most cases, ability
+  to map PGs by hand without breaking rebalancing logic, reduced OSD peer-to-peer communication
+  (on average, OSDs have fewer peers) and less data movement. It also probably has a drawback -
+  this method may fail in very large clusters, but up to several hundreds of OSDs it's perfectly fine.
+  It's also easy to add consistent hashes in the future if something proves their necessity.
+- There's no separate CRUSH layer. You select pool redundancy scheme, placement root, failure domain
+  and so on directly in pool configuration.
+- Images are global i.e. you can't create multiple images with the same name in different pools.
+
+## Implementation Principles
+
+- I like architecturally simple solutions. Vitastor is and will always be designed
+  exactly like that.
+- I also like reinventing the wheel to some extent, like writing my own HTTP client
+  for etcd interaction instead of using prebuilt libraries, because in this case
+  I'm confident about what my code does and what it doesn't do.
+- I don't care about C++ "best practices" like RAII or proper inheritance or usage of
+  smart pointers or whatever and I don't intend to change my mind, so if you're here
+  looking for ideal reference C++ code, this probably isn't the right place.
+- I like node.js better than any other dynamically-typed language interpreter
+  because it's faster than any other interpreter in the world, has neutral C-like
+  syntax and built-in event loop. That's why Monitor is implemented in node.js.
--- a/docs/hugo/content/introduction/author.md
+++ b/docs/hugo/content/introduction/author.md
@@ -0,0 +1,34 @@
+---
+title: Author and License
+weight: 3
+---
+
+Copyright (c) Vitaliy Filippov (vitalif [at] yourcmc.ru), 2019+
+
+Join Vitastor Telegram Chat: https://t.me/vitastor
+
+All server-side code (OSD, Monitor and so on) is licensed under the terms of
+Vitastor Network Public License 1.1 (VNPL 1.1), a copyleft license based on
+GNU GPLv3.0 with the additional "Network Interaction" clause which requires
+opensourcing all programs directly or indirectly interacting with Vitastor
+through a computer network and expressly designed to be used in conjunction
+with it ("Proxy Programs"). Proxy Programs may be made public not only under
+the terms of the same license, but also under the terms of any GPL-Compatible
+Free Software License, as listed by the Free Software Foundation.
+This is a stricter copyleft license than the Affero GPL.
+
+Please note that VNPL doesn't require you to open the code of proprietary
+software running inside a VM if it's not specially designed to be used with
+Vitastor.
+
+Basically, you can't use the software in a proprietary environment to provide
+its functionality to users without opensourcing all intermediary components
+standing between the user and Vitastor or purchasing a commercial license
+from the author 😀.
+
+Client libraries (cluster_client and so on) are dual-licensed under the same
+VNPL 1.1 and also GNU GPL 2.0 or later to allow for compatibility with GPLed
+software like QEMU and fio.
+
+You can find the full text of VNPL-1.1 in the file [VNPL-1.1.txt](VNPL-1.1.txt).
+GPL 2.0 is also included in this repository as [GPL-2.0.txt](GPL-2.0.txt).
--- a/docs/hugo/content/introduction/features.md
+++ b/docs/hugo/content/introduction/features.md
@@ -0,0 +1,60 @@
+---
+title: Features
+weight: 1
+---
+
+Vitastor is currently a pre-release and it still misses some important features.
+However, the following is implemented:
+
+- Basic part: highly-available block storage with symmetric clustering and no SPOF
+- Performance ;-D
+- Multiple redundancy schemes: Replication, XOR n+1, Reed-Solomon erasure codes
+  based on jerasure library with any number of data and parity drives in a group
+- Configuration via simple JSON data structures in etcd (parameters, pools and images)
+- Automatic data distribution over OSDs, with support for:
+  - Mathematical optimization for better uniformity and less data movement
+  - Multiple pools
+  - Placement tree, OSD selection by tags (device classes) and placement root
+  - Configurable failure domains
+- Recovery of degraded blocks
+- Rebalancing (data movement between OSDs)
+- Lazy fsync support
+- Per-OSD and per-image I/O and space usage statistics in etcd
+- Snapshots and copy-on-write image clones
+- Write throttling to smooth random write workloads in SSD+HDD configurations
+- RDMA/RoCEv2 support via libibverbs
+
+CLI (vitastor-cli):
+- Pool listing and space stats (df)
+- Image listing, space and I/O stats (ls)
+- Image and snapshot creation (create, modify)
+- Image removal and snapshot merge (rm, flatten, merge, rm-data)
+
+Plugins and packaging:
+- Debian and CentOS packages
+- Generic user-space client library
+- Native QEMU driver
+- Loadable fio engine for benchmarks
+- NBD proxy for kernel mounts
+- CSI plugin for Kubernetes
+- OpenStack support: Cinder driver, Nova and libvirt patches
+- Proxmox storage plugin and packages
+
+## Roadmap
+
+The following features are planned for the future:
+
+- Better OSD creation and auto-start tools
+- Other administrative tools
+- Web GUI
+- OpenNebula plugin
+- iSCSI proxy
+- Simplified NFS proxy
+- Multi-threaded client
+- Faster failover
+- Scrubbing without checksums (verification of replicas)
+- Checksums
+- Tiered storage (SSD caching)
+- NVDIMM support
+- Compression (possibly)
+- Read caching using system page cache (possibly)
--- a/docs/hugo/content/performance/comparison1.md
+++ b/docs/hugo/content/performance/comparison1.md
@@ -0,0 +1,93 @@
+---
+title: Example Comparison with Ceph
+weight: 4
+---
+
+Hardware configuration: 4 nodes, each with:
+- 6x SATA SSD Intel D3-S4510 3.84 TB
+- 2x Xeon Gold 6242 (16 cores @ 2.8 GHz)
+- 384 GB RAM
+- 1x 25 GbE network interface (Mellanox ConnectX-4 LX), connected to a Juniper QFX5200 switch
+
+CPU powersaving was disabled. Both Vitastor and Ceph were configured with 2 OSDs per 1 SSD.
+
+All of the results below apply to 4 KB blocks and random access (unless indicated otherwise).
+
+T8Q64 tests were conducted over 8 400GB RBD images from all hosts (every host was running 2 instances of fio).
+This is because Ceph has performance penalties related to running multiple clients over a single RBD image.
+
+cephx_sign_messages was set to false during tests, RocksDB and Bluestore settings were left at defaults.
+
+T8Q64 read test was conducted over 1 larger inode (3.2T) from all hosts (every host was running 2 instances of fio).
+Vitastor has no performance penalties related to running multiple clients over a single inode.
+If conducted from one node with all primary OSDs moved to other nodes the result was slightly lower (689000 iops),
+this is because all operations resulted in network roundtrips between the client and the primary OSD.
+When fio was colocated with OSDs (like in Ceph benchmarks above), 1/4 of the read workload actually
+used the loopback network.
+
+Vitastor was configured with: `--disable_data_fsync true --immediate_commit all --flusher_count 8
+  --disk_alignment 4096 --journal_block_size 4096 --meta_block_size 4096
+  --journal_no_same_sector_overwrites true --journal_sector_buffer_count 1024
+  --journal_size 16777216`.
+
+## Raw drive performance
+
+- T1Q1 write ~27000 iops (~0.037ms latency)
+- T1Q1 read ~9800 iops (~0.101ms latency)
+- T1Q32 write ~60000 iops
+- T1Q32 read ~81700 iops
+
+## 2 replicas
+
+### Ceph 15.2.4 (Bluestore)
+
+- T1Q1 write ~1000 iops (~1ms latency)
+- T1Q1 read ~1750 iops (~0.57ms latency)
+- T8Q64 write ~100000 iops, total CPU usage by OSDs about 40 virtual cores on each node
+- T8Q64 read ~480000 iops, total CPU usage by OSDs about 40 virtual cores on each node
+
+In fact, not that bad for Ceph. These servers are an example of well-balanced Ceph nodes.
+However, CPU usage and I/O latency were through the roof, as usual.
+
+### Vitastor 0.4.0 (native)
+
+- T1Q1 write: 7087 iops (0.14ms latency)
+- T1Q1 read: 6838 iops (0.145ms latency)
+- T2Q64 write: 162000 iops, total CPU usage by OSDs about 3 virtual cores on each node
+- T8Q64 read: 895000 iops, total CPU usage by OSDs about 4 virtual cores on each node
+- Linear write (4M T1Q32): 2800 MB/s
+- Linear read (4M T1Q32): 1500 MB/s
+
+### Vitastor 0.4.0 (NBD)
+
+NBD is currently required to mount Vitastor via kernel, but it imposes additional overhead
+due to additional copying between the kernel and userspace. This mostly hurts linear
+bandwidth, not iops.
+
+Vitastor with single-threaded NBD on the same hardware:
+- T1Q1 write: 6000 iops (0.166ms latency)
+- T1Q1 read: 5518 iops (0.18ms latency)
+- T1Q128 write: 94400 iops
+- T1Q128 read: 103000 iops
+- Linear write (4M T1Q128): 1266 MB/s (compared to 2800 MB/s via fio)
+- Linear read (4M T1Q128): 975 MB/s (compared to 1500 MB/s via fio)
+
+## EC/XOR 2+1
+
+### Ceph 15.2.4
+
+- T1Q1 write: 730 iops (~1.37ms latency)
+- T1Q1 read: 1500 iops with cold cache (~0.66ms latency), 2300 iops after 2 minute metadata cache warmup (~0.435ms latency)
+- T4Q128 write (4 RBD images): 45300 iops, total CPU usage by OSDs about 30 virtual cores on each node
+- T8Q64 read (4 RBD images): 278600 iops, total CPU usage by OSDs about 40 virtual cores on each node
+- Linear write (4M T1Q32): 1950 MB/s before preallocation, 2500 MB/s after preallocation
+- Linear read (4M T1Q32): 2400 MB/s
+
+### Vitastor 0.4.0
+
+- T1Q1 write: 2808 iops (~0.355ms latency)
+- T1Q1 read: 6190 iops (~0.16ms latency)
+- T2Q64 write: 85500 iops, total CPU usage by OSDs about 3.4 virtual cores on each node
+- T8Q64 read: 812000 iops, total CPU usage by OSDs about 4.7 virtual cores on each node
+- Linear write (4M T1Q32): 3200 MB/s
+- Linear read (4M T1Q32): 1800 MB/s
--- a/docs/hugo/content/performance/theoretical.md
+++ b/docs/hugo/content/performance/theoretical.md
@@ -0,0 +1,46 @@
+---
+title: Vitastor's Theoretical Maximum Performance
+weight: 3
+---
+
+Replicated setups:
+- Single-threaded (T1Q1) read latency: 1 network roundtrip + 1 disk read.
+- Single-threaded write+fsync latency:
+  - With immediate commit: 2 network roundtrips + 1 disk write.
+  - With lazy commit: 4 network roundtrips + 1 disk write + 1 disk flush.
+- Saturated parallel read iops: min(network bandwidth, sum(disk read iops)).
+- Saturated parallel write iops: min(network bandwidth, sum(disk write iops / number of replicas / write amplification)).
+
+EC/XOR setups:
+- Single-threaded (T1Q1) read latency: 1.5 network roundtrips + 1 disk read.
+- Single-threaded write+fsync latency:
+  - With immediate commit: 3.5 network roundtrips + 1 disk read + 2 disk writes.
+  - With lazy commit: 5.5 network roundtrips + 1 disk read + 2 disk writes + 2 disk fsyncs.
+  - 0.5 in actually (k-1)/k which means that an additional roundtrip doesn't happen when
+    the read sub-operation can be served locally.
+- Saturated parallel read iops: min(network bandwidth, sum(disk read iops)).
+- Saturated parallel write iops: min(network bandwidth, sum(disk write iops * number of data drives / (number of data + parity drives) / write amplification)).
+  In fact, you should put disk write iops under the condition of ~10% reads / ~90% writes in this formula.
+
+Write amplification for 4 KB blocks is usually 3-5 in Vitastor:
+1. Journal block write
+2. Journal data write
+3. Metadata block write
+4. Another journal block write for EC/XOR setups
+5. Data block write
+
+If you manage to get an SSD which handles 512 byte blocks well (Optane?) you may
+lower 1, 3 and 4 to 512 bytes (1/8 of data size) and get WA as low as 2.375.
+
+Lazy fsync also reduces WA for parallel workloads because journal blocks are only
+written when they fill up or fsync is requested.
+
+## In Practice
+
+In practice, using tests from [Understanding Performance]({{< ref "performance/understanding" >}})
+and good server-grade SSD/NVMe drives, you should head for:
+- At least 5000 T1Q1 replicated read and write iops (maximum 0.2ms latency)
+- At least ~80k parallel read iops or ~30k write iops per 1 core (1 OSD)
+- Disk-speed or wire-speed linear reads and writes, whichever is the bottleneck in your case
+
+If your results are lower, that may mean you have bad drives, bad network or some kind of misconfiguration.
--- a/docs/hugo/content/performance/tuning.md
+++ b/docs/hugo/content/performance/tuning.md
@@ -0,0 +1,6 @@
+---
+title: Tuning
+weight: 2
+---
+
+- Disable CPU powersaving
--- a/docs/hugo/content/performance/understanding.md
+++ b/docs/hugo/content/performance/understanding.md
@@ -0,0 +1,52 @@
+---
+title: Understanding Storage Performance
+weight: 1
+---
+
+The most important thing for fast storage is latency, not parallel iops.
+
+The best possible latency is achieved with one thread and queue depth of 1 which basically means
+"client load as low as possible". In this case IOPS = 1/latency, and this number doesn't
+scale with number of servers, drives, server processes or threads and so on.
+Single-threaded IOPS and latency numbers only depend on *how fast a single daemon is*.
+
+Why is it important? It's important because some of the applications *can't* use
+queue depth greater than 1 because their task isn't parallelizable. A notable example
+is any ACID DBMS because all of them write their WALs sequentially with fsync()s.
+
+fsync, by the way, is another important thing often missing in benchmarks. The point is
+that drives have cache buffers and don't guarantee that your data is actually persisted
+until you call fsync() which is translated to a FLUSH CACHE command by the OS.
+
+Desktop SSDs are very fast without fsync - NVMes, for example, can process ~80000 write
+operations per second with queue depth of 1 without fsync - but they're really slow with
+fsync because they have to actually write data to flash chips when you call fsync. Typical
+number is around 1000-2000 iops with fsync.
+
+Server SSDs often have supercapacitors that act as a built-in UPS and allow the drive
+to flush its DRAM cache to the persistent flash storage when a power loss occurs.
+This makes them perform equally well with and without fsync. This feature is called
+"Advanced Power Loss Protection" by Intel; other vendors either call it similarly
+or directly as "Full Capacitor-Based Power Loss Protection".
+
+All software-defined storages that I currently know are slow in terms of latency.
+Notable examples are Ceph and internal SDSes used by cloud providers like Amazon, Google,
+Yandex and so on. They're all slow and can only reach ~0.3ms read and ~0.6ms 4 KB write latency
+with best-in-slot hardware.
+
+And that's in the SSD era when you can buy an SSD that has ~0.04ms latency for 100 $.
+
+I use the following 6 commands with small variations to benchmark any storage:
+
+- Linear write:
+  `fio -ioengine=libaio -direct=1 -invalidate=1 -name=test -bs=4M -iodepth=32 -rw=write -runtime=60 -filename=/dev/sdX`
+- Linear read:
+  `fio -ioengine=libaio -direct=1 -invalidate=1 -name=test -bs=4M -iodepth=32 -rw=read -runtime=60 -filename=/dev/sdX`
+- Random write latency (T1Q1, this hurts storages the most):
+  `fio -ioengine=libaio -direct=1 -invalidate=1 -name=test -bs=4k -iodepth=1 -fsync=1 -rw=randwrite -runtime=60 -filename=/dev/sdX`
+- Random read latency (T1Q1):
+  `fio -ioengine=libaio -direct=1 -invalidate=1 -name=test -bs=4k -iodepth=1 -rw=randread -runtime=60 -filename=/dev/sdX`
+- Parallel write iops (use numjobs if a single CPU core is insufficient to saturate the load):
+  `fio -ioengine=libaio -direct=1 -invalidate=1 -name=test -bs=4k -iodepth=128 [-numjobs=4 -group_reporting] -rw=randwrite -runtime=60 -filename=/dev/sdX`
+- Parallel read iops (use numjobs if a single CPU core is insufficient to saturate the load):
+  `fio -ioengine=libaio -direct=1 -invalidate=1 -name=test -bs=4k -iodepth=128 [-numjobs=4 -group_reporting] -rw=randread -runtime=60 -filename=/dev/sdX`
--- a/docs/hugo/content/usage/cli.md
+++ b/docs/hugo/content/usage/cli.md
@@ -0,0 +1,183 @@
+---
+title: Vitastor CLI
+weight: 1
+---
+
+vitastor-cli is a command-line tool for administrative tasks like image management.
+
+It supports the following commands:
+
+{{< toc >}}
+
+Global options:
+
+```
+--etcd_address ADDR  Etcd connection address
+--iodepth N          Send N operations in parallel to each OSD when possible (default 32)
+--parallel_osds M    Work with M osds in parallel when possible (default 4)
+--progress 1|0       Report progress (default 1)
+--cas 1|0            Use online CAS writes when possible (default auto)
+--no-color           Disable colored output
+--json               JSON output
+```
+
+## status
+
+`vitastor-cli status`
+
+Show cluster status.
+
+Example output:
+
+```
+  cluster:
+    etcd: 1 / 1 up, 1.8 M database size
+    mon:  1 up, master stump
+    osd:  8 / 12 up
+
+  data:
+    raw:   498.5 G used, 301.2 G / 799.7 G available, 399.8 G down
+    state: 156.6 G clean, 97.6 G misplaced
+    pools: 2 / 3 active
+    pgs:   30 active
+           34 active+has_misplaced
+           32 offline
+
+  io:
+    client:    0 B/s rd, 0 op/s rd, 0 B/s wr, 0 op/s wr
+    rebalance: 989.8 M/s, 7.9 K op/s
+```
+
+## df
+
+`vitastor-cli df`
+
+Show pool space statistics.
+
+Example output:
+
+```
+NAME      SCHEME  PGS  TOTAL    USED    AVAILABLE  USED%   EFFICIENCY
+testpool  2/1     32   100 G    34.2 G  60.7 G     39.23%  100%
+size1     1/1     32   199.9 G  10 G    121.5 G    39.23%  100%
+kaveri    2/1     32   0 B      10 G    0 B        100%    0%
+```
+
+In the example above, "kaveri" pool has "zero" efficiency because all its OSD are down.
+
+## ls
+
+`vitastor-cli ls [-l] [-p POOL] [--sort FIELD] [-r] [-n N] [<glob> ...]`
+
+List images (only matching `<glob>` pattern(s) if passed).
+
+Options:
+
+```
+-p|--pool POOL  Filter images by pool ID or name
+-l|--long       Also report allocated size and I/O statistics
+--del           Also include delete operation statistics
+--sort FIELD    Sort by specified field (name, size, used_size, <read|write|delete>_<iops|bps|lat|queue>)
+-r|--reverse    Sort in descending order
+-n|--count N    Only list first N items
+```
+
+Example output:
+
+```
+NAME                 POOL      SIZE  USED    READ   IOPS  QUEUE  LAT   WRITE  IOPS  QUEUE  LAT   FLAGS  PARENT
+debian9              testpool  20 G  12.3 G  0 B/s  0     0      0 us  0 B/s  0     0      0 us     RO
+pve/vm-100-disk-0    testpool  20 G  0 B     0 B/s  0     0      0 us  0 B/s  0     0      0 us      -  debian9
+pve/base-101-disk-0  testpool  20 G  0 B     0 B/s  0     0      0 us  0 B/s  0     0      0 us     RO  debian9
+pve/vm-102-disk-0    testpool  32 G  36.4 M  0 B/s  0     0      0 us  0 B/s  0     0      0 us      -  pve/base-101-disk-0
+debian9-test         testpool  20 G  36.6 M  0 B/s  0     0      0 us  0 B/s  0     0      0 us      -  debian9
+bench                testpool  10 G  10 G    0 B/s  0     0      0 us  0 B/s  0     0      0 us      -
+bench-kaveri         kaveri    10 G  10 G    0 B/s  0     0      0 us  0 B/s  0     0      0 us      -
+```
+
+## create
+
+`vitastor-cli create -s|--size <size> [-p|--pool <id|name>] [--parent <parent_name>[@<snapshot>]] <name>`
+
+Create an image. You may use K/M/G/T suffixes for `<size>`. If `--parent` is specified,
+a copy-on-write image clone is created. Parent must be a snapshot (readonly image).
+Pool must be specified if there is more than one pool.
+
+```
+vitastor-cli create --snapshot <snapshot> [-p|--pool <id|name>] <image>
+vitastor-cli snap-create [-p|--pool <id|name>] <image>@<snapshot>
+```
+
+Create a snapshot of image `<name>` (either form can be used). May be used live if only a single writer is active.
+
+## modify
+
+`vitastor-cli modify <name> [--rename <new-name>] [--resize <size>] [--readonly | --readwrite] [-f|--force]`
+
+Rename, resize image or change its readonly status. Images with children can't be made read-write.
+If the new size is smaller than the old size, extra data will be purged.
+You should resize file system in the image, if present, before shrinking it.
+
+```
+-f|--force  Proceed with shrinking or setting readwrite flag even if the image has children.
+```
+
+## rm
+
+`vitastor-cli rm <from> [<to>] [--writers-stopped]`
+
+Remove `<from>` or all layers between `<from>` and `<to>` (`<to>` must be a child of `<from>`),
+rebasing all their children accordingly. --writers-stopped allows merging to be a bit
+more effective in case of a single 'slim' read-write child and 'fat' removed parent:
+the child is merged into parent and parent is renamed to child in that case.
+In other cases parent layers are always merged into children.
+
+## flatten
+
+`vitastor-cli flatten <layer>`
+
+Flatten a layer, i.e. merge data and detach it from parents.
+
+## rm-data
+
+`vitastor-cli rm-data --pool <pool> --inode <inode> [--wait-list] [--min-offset <offset>]`
+
+Remove inode data without changing metadata.
+
+```
+--wait-list   Retrieve full objects listings before starting to remove objects.
+              Requires more memory, but allows to show correct removal progress.
+--min-offset  Purge only data starting with specified offset.
+```
+
+## merge-data
+
+`vitastor-cli merge-data <from> <to> [--target <target>]`
+
+Merge layer data without changing metadata. Merge `<from>`..`<to>` to `<target>`.
+`<to>` must be a child of `<from>` and `<target>` may be one of the layers between
+`<from>` and `<to>`, including `<from>` and `<to>`.
+
+## alloc-osd
+
+`vitastor-cli alloc-osd`
+
+Allocate a new OSD number and reserve it by creating empty `/osd/stats/<n>` key.
+
+## simple-offsets
+
+`vitastor-cli simple-offsets <device>`
+
+Calculate offsets for simple&stupid (no superblock) OSD deployment.
+
+Options:
+
+```
+--object_size 128k       Set blockstore block size
+--bitmap_granularity 4k  Set bitmap granularity
+--journal_size 16M       Set journal size
+--device_block_size 4k   Set device block size
+--journal_offset 0       Set journal offset
+--device_size 0          Set device size
+--format text            Result format: json, options, env, or text
+```
--- a/docs/hugo/content/usage/nbd.md
+++ b/docs/hugo/content/usage/nbd.md
@@ -0,0 +1,20 @@
+---
+title: NBD
+weight: 6
+---
+
+To create a local block device for a Vitastor image, use NBD. For example:
+
+```
+vitastor-nbd map --etcd_address 10.115.0.10:2379/v3 --image testimg
+```
+
+It will output the device name, like /dev/nbd0 which you can then format and mount as a normal block device.
+
+You can also use `--pool <POOL> --inode <INODE> --size <SIZE>` instead of `--image <IMAGE>` if you want.
+
+To unmap the device run:
+
+```
+vitastor-nbd unmap /dev/nbd0
+```
--- a/docs/hugo/content/usage/qemu.md
+++ b/docs/hugo/content/usage/qemu.md
@@ -0,0 +1,39 @@
+---
+title: QEMU and qemu-img
+weight: 2
+---
+
+You need patched QEMU version to use Vitastor driver.
+
+To start a VM using plain QEMU command-line with Vitastor disk, use the following commands:
+
+Old syntax (-drive):
+
+```
+qemu-system-x86_64 -enable-kvm -m 1024 \
+    -drive 'file=vitastor:etcd_host=192.168.7.2\:2379/v3:image=debian9',
+        format=raw,if=none,id=drive-virtio-disk0,cache=none \
+    -device 'virtio-blk-pci,scsi=off,bus=pci.0,addr=0x5,drive=drive-virtio-disk0,
+        id=virtio-disk0,bootindex=1,write-cache=off' \
+    -vnc 0.0.0.0:0
+```
+
+New syntax (-blockdev):
+
+```
+qemu-system-x86_64 -enable-kvm -m 1024 \
+    -blockdev '{"node-name":"drive-virtio-disk0","driver":"vitastor","image":"debian9",
+        "cache":{"direct":true,"no-flush":false},"auto-read-only":true,"discard":"unmap"}' \
+    -device 'virtio-blk-pci,scsi=off,bus=pci.0,addr=0x5,drive=drive-virtio-disk0,
+        id=virtio-disk0,bootindex=1,write-cache=off' \
+    -vnc 0.0.0.0:0
+```
+
+For qemu-img, you should use `vitastor:etcd_host=<HOST>:image=<IMAGE>` as filename. For example:
+
+```
+qemu-img convert -f qcow2 debian10.qcow2 -p -O raw 'vitastor:etcd_host=192.168.7.2\:2379/v3:image=debian10'
+```
+
+You can also specify `:pool=<POOL>:inode=<INODE>:size=<SIZE>` instead of `:image=<IMAGE>`
+if you don't want to use inode metadata.
--- a/docs/hugo/i18n/ru.yaml
+++ b/docs/hugo/i18n/ru.yaml
@@ -0,0 +1,37 @@
+---
+nav_navigation: Навигация
+nav_tags: Теги
+nav_more: Подробнее
+nav_top: К началу
+
+form_placeholder_search: Поиск
+
+error_page_title: Открыта несуществующая страница
+error_message_title: Потерялись?
+error_message_code: Ошибка 404
+error_message_text: >
+  Похоже, страница, которую вы открыли, не существует. Попробуйте найти
+  нужную информацию с <a class="gdoc-error__link" href="{{ . }}">главной страницы</a>.
+
+button_toggle_dark: Переключить тёмный/светлый/авто режим
+button_nav_open: Показать навигацию
+button_nav_close: Скрыть навигацию
+button_menu_open: Открыть меню
+button_menu_close: Закрыть меню
+button_homepage: На главную
+
+title_anchor_prefix: "Ссылка на:"
+
+posts_read_more: Читать подробнее
+posts_read_time:
+  one: "Одна минута на чтение"
+  other: "{{ . }} минут(ы) на чтение"
+posts_update_prefix: Обновлено
+
+footer_build_with: >
+  Сделано на <a href="https://gohugo.io/" class="gdoc-footer__link">Hugo</a> с
+  <svg class="icon gdoc_heart"><use xlink:href="#gdoc_heart"></use></svg>
+footer_legal_notice: Правовая информация
+footer_privacy_policy: Приватность
+
+language_switch_no_tranlation_prefix: "Страница не переведена:"
--- a/docs/hugo/layouts/partials/site-footer.html
+++ b/docs/hugo/layouts/partials/site-footer.html
@@ -0,0 +1,34 @@
+<footer class="gdoc-footer">
+    <div class="container flex">
+        <div class="flex flex-wrap" style="flex: 1">
+            <span class="gdoc-footer__item gdoc-footer__item--row">
+                &copy; Vitaliy Filippov, 2021+
+            </span>
+        </div>
+        <div class="flex flex-wrap">
+            {{ with .Site.Params.GeekdocLegalNotice }}
+            <span class="gdoc-footer__item gdoc-footer__item--row">
+                <a href="{{ . | relURL }}" class="gdoc-footer__link">{{ i18n "footer_legal_notice" }}</a>
+            </span>
+            {{ end }}
+            {{ with .Site.Params.GeekdocPrivacyPolicy }}
+            <span class="gdoc-footer__item gdoc-footer__item--row">
+                <a href="{{ . | relURL }}" class="gdoc-footer__link">{{ i18n "footer_privacy_policy" }}</a>
+            </span>
+            {{ end }}
+        </div>
+        {{ if (default true .Site.Params.GeekdocBackToTop) }}
+        <div class="flex flex-25 justify-end">
+            <span class="gdoc-footer__item gdoc-footer__item--row" style="margin-right: 50px">
+                {{ i18n "footer_build_with" | safeHTML }}
+            </span>
+            <span class="gdoc-footer__item">
+                <a class="gdoc-footer__link fake-link" href="#" aria-label="{{ i18n "nav_top" }}">
+                    <svg class="icon gdoc_keyboard_arrow_up"><use xlink:href="#gdoc_keyboard_arrow_up"></use></svg>
+                    <span class="hidden-mobile">{{ i18n "nav_top" }}</span>
+                </a>
+            </span>
+        </div>
+        {{ end }}
+    </div>
+</footer>
--- a/docs/hugo/static/brand.svg
+++ b/docs/hugo/static/brand.svg
@@ -0,0 +1,215 @@
+<?xml version="1.0" encoding="UTF-8" standalone="no"?>
+<svg
+   xmlns:osb="http://www.openswatchbook.org/uri/2009/osb"
+   xmlns:dc="http://purl.org/dc/elements/1.1/"
+   xmlns:cc="http://creativecommons.org/ns#"
+   xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
+   xmlns:svg="http://www.w3.org/2000/svg"
+   xmlns="http://www.w3.org/2000/svg"
+   xmlns:xlink="http://www.w3.org/1999/xlink"
+   xmlns:sodipodi="http://sodipodi.sourceforge.net/DTD/sodipodi-0.dtd"
+   xmlns:inkscape="http://www.inkscape.org/namespaces/inkscape"
+   sodipodi:docname="logo_only2.svg"
+   inkscape:version="1.0.2 (e86c870879, 2021-01-15)"
+   id="svg1340"
+   version="1.1"
+   viewBox="0 0 100 86.80192"
+   height="86.801918mm"
+   width="100mm"
+   inkscape:export-filename="/var/home/vitali/SVN/vitastor/presentation/logos/logo_only.png"
+   inkscape:export-xdpi="92.889999"
+   inkscape:export-ydpi="92.889999">
+  <defs
+     id="defs1334">
+    <linearGradient
+       osb:paint="gradient"
+       id="linearGradient866">
+      <stop
+         id="stop862"
+         offset="0"
+         style="stop-color:#c0c0c0;stop-opacity:1" />
+      <stop
+         id="stop864"
+         offset="1"
+         style="stop-color:#000000;stop-opacity:0" />
+    </linearGradient>
+    <linearGradient
+       id="linearGradient846"
+       osb:paint="gradient">
+      <stop
+         style="stop-color:#ffd42a;stop-opacity:1"
+         offset="0"
+         id="stop842" />
+      <stop
+         style="stop-color:#ffa200;stop-opacity:1"
+         offset="1"
+         id="stop844" />
+    </linearGradient>
+    <radialGradient
+       r="50"
+       fy="159.11139"
+       fx="202.36813"
+       cy="159.11139"
+       cx="202.36813"
+       gradientTransform="matrix(1.2462942,-1.2279529,0.77712408,0.78873143,-190.96813,230.1331)"
+       gradientUnits="userSpaceOnUse"
+       id="radialGradient1530"
+       xlink:href="#linearGradient1352"
+       inkscape:collect="always" />
+    <linearGradient
+       inkscape:collect="always"
+       id="linearGradient1352">
+      <stop
+         style="stop-color:#00c9e6;stop-opacity:1"
+         offset="0"
+         id="stop1348" />
+      <stop
+         style="stop-color:#5240d3;stop-opacity:1"
+         offset="1"
+         id="stop1350" />
+    </linearGradient>
+    <linearGradient
+       y2="62.555599"
+       x2="51.484566"
+       y1="62.555599"
+       x1="38.105473"
+       gradientTransform="rotate(-16.930773,271.11609,-412.42594)"
+       gradientUnits="userSpaceOnUse"
+       id="linearGradient1508"
+       xlink:href="#linearGradient1323"
+       inkscape:collect="always" />
+    <linearGradient
+       inkscape:collect="always"
+       id="linearGradient1323">
+      <stop
+         style="stop-color:#000000;stop-opacity:0.47178105"
+         offset="0"
+         id="stop1319" />
+      <stop
+         style="stop-color:#eeaaff;stop-opacity:0;"
+         offset="1"
+         id="stop1321" />
+    </linearGradient>
+    <radialGradient
+       r="21.541935"
+       fy="24.614815"
+       fx="45.312912"
+       cy="24.614815"
+       cx="45.312912"
+       gradientTransform="matrix(1.0933447,0.13113705,-0.12664108,1.0558599,-1.082187,93.974708)"
+       gradientUnits="userSpaceOnUse"
+       id="radialGradient1504"
+       xlink:href="#linearGradient846"
+       inkscape:collect="always" />
+    <filter
+       style="color-interpolation-filters:sRGB"
+       inkscape:label="Drop Shadow"
+       id="filter1497"
+       width="2"
+       height="2"
+       x="-0.5"
+       y="-0.5">
+      <feFlood
+         flood-opacity="0.498039"
+         flood-color="rgb(0,0,0)"
+         result="flood"
+         id="feFlood1487" />
+      <feComposite
+         in="flood"
+         in2="SourceGraphic"
+         operator="in"
+         result="composite1"
+         id="feComposite1489" />
+      <feGaussianBlur
+         in="composite1"
+         stdDeviation="6"
+         result="blur"
+         id="feGaussianBlur1491" />
+      <feOffset
+         dx="0"
+         dy="6"
+         result="offset"
+         id="feOffset1493" />
+      <feComposite
+         in="offset"
+         in2="offset"
+         operator="atop"
+         result="composite2"
+         id="feComposite1495" />
+    </filter>
+    <radialGradient
+       r="21.541935"
+       fy="24.614815"
+       fx="45.312912"
+       cy="24.614815"
+       cx="45.312912"
+       gradientTransform="matrix(1.0933447,0.13113705,-0.12664108,1.0558599,-1.082187,93.974708)"
+       gradientUnits="userSpaceOnUse"
+       id="radialGradient1506"
+       xlink:href="#linearGradient846"
+       inkscape:collect="always" />
+  </defs>
+  <sodipodi:namedview
+     inkscape:window-maximized="1"
+     inkscape:window-y="0"
+     inkscape:window-x="0"
+     inkscape:window-height="992"
+     inkscape:window-width="1920"
+     fit-margin-bottom="0"
+     fit-margin-right="0"
+     fit-margin-left="0"
+     fit-margin-top="-30"
+     showgrid="false"
+     inkscape:document-rotation="0"
+     inkscape:current-layer="layer1"
+     inkscape:document-units="mm"
+     inkscape:cy="47.914558"
+     inkscape:cx="-103.69646"
+     inkscape:zoom="0.7"
+     inkscape:pageshadow="2"
+     inkscape:pageopacity="1"
+     borderopacity="1.0"
+     bordercolor="#666666"
+     pagecolor="#000000"
+     id="base" />
+  <metadata
+     id="metadata1337">
+    <rdf:RDF>
+      <cc:Work
+         rdf:about="">
+        <dc:format>image/svg+xml</dc:format>
+        <dc:type
+           rdf:resource="http://purl.org/dc/dcmitype/StillImage" />
+        <dc:title></dc:title>
+      </cc:Work>
+    </rdf:RDF>
+  </metadata>
+  <g
+     transform="translate(-133.26969,-52.101187)"
+     id="layer1"
+     inkscape:groupmode="layer"
+     inkscape:label="Слой 1">
+    <path
+       style="fill:url(#radialGradient1530);fill-opacity:1;fill-rule:evenodd;stroke:#000000;stroke-width:0;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-dasharray:none;stroke-opacity:1"
+       d="m 133.26969,59.089473 50,75.000087 50,-75.000087 z"
+       id="path1528"
+       sodipodi:nodetypes="cccc" />
+    <path
+       d="m 194.29572,89.403603 -8.41706,2.562119 -2.50682,7.49308 7.17785,23.579008 9.60097,-14.40173 z"
+       style="fill:url(#linearGradient1508);fill-opacity:1;stroke-width:0;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-dasharray:none;stroke-opacity:0.501961"
+       id="path1459" />
+    <g
+       transform="translate(135.70225,-49.385894)"
+       id="g1465">
+      <path
+         id="path1461"
+         style="fill:url(#radialGradient1504);fill-opacity:1;fill-rule:evenodd;stroke:#000000;stroke-width:0;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-dasharray:none;stroke-opacity:1;filter:url(#filter1497)"
+         d="m 28.817436,101.36529 c 3.112699,10.74423 6.225077,21.48892 9.333984,32.23438 2.519532,0 5.039063,0 7.558594,0 -0.985406,8.09729 -2.085815,16.18202 -2.951172,24.29297 -0.06053,0.88723 1.098131,1.61652 1.76,0.9155 1.007514,-1.05482 1.676008,-2.3829 2.528566,-3.56053 7.51538,-11.37722 14.987447,-22.78299 22.482919,-34.17333 -3.239584,0 -6.479167,0 -9.71875,0 2.887267,-6.79562 5.775365,-13.59088 8.662109,-20.38672 -13.284505,0 -26.56901,0 -39.853516,0 0.06576,0.22591 0.131511,0.45182 0.197266,0.67773 z" />
+      <path
+         sodipodi:nodetypes="cccccccc"
+         id="path1463"
+         d="m 30.735882,102.2764 h 35.342242 l -8.662729,20.3854 h 9.173783 l -22.106472,33.62346 3.027029,-24.27377 H 39.34604 Z"
+         style="fill:url(#radialGradient1506);fill-opacity:1;fill-rule:evenodd;stroke:#000000;stroke-width:0;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-dasharray:none;stroke-opacity:1" />
+    </g>
+  </g>
+</svg>
--- a/docs/hugo/static/custom.css
+++ b/docs/hugo/static/custom.css
@@ -0,0 +1,138 @@
+/* Global customization */
+
+:root {
+  --code-max-height: 60rem;
+}
+
+/* Light mode theming */
+:root,
+:root[color-mode="light"] {
+  --header-background: #404050;
+  --header-font-color: #ffffff;
+
+  --body-background: #ffffff;
+  --body-font-color: #343a40;
+
+  --button-background: #62cb97;
+  --button-border-color: #4ec58a;
+
+  --link-color: #c54e8a;
+  --link-color-visited: #c54e8a;
+
+  --code-background: #f5f6f8;
+  --code-accent-color: #e3e7eb;
+  --code-accent-color-lite: #eff1f3;
+
+  --accent-color: #e9ecef;
+  --accent-color-lite: #f8f9fa;
+
+  --control-icons: #b2bac1;
+
+  --footer-background: #606070;
+  --footer-font-color: #ffffff;
+  --footer-link-color: #ffcc5c;
+  --footer-link-color-visited: #ffcc5c;
+}
+@media (prefers-color-scheme: light) {
+  :root {
+    --header-background: #404050;
+    --header-font-color: #ffffff;
+
+    --body-background: #ffffff;
+    --body-font-color: #343a40;
+
+    --button-background: #62cb97;
+    --button-border-color: #4ec58a;
+
+    --link-color: #c54e8a;
+    --link-color-visited: #c54e8a;
+
+    --code-background: #f5f6f8;
+    --code-accent-color: #e3e7eb;
+    --code-accent-color-lite: #eff1f3;
+
+    --accent-color: #e9ecef;
+    --accent-color-lite: #f8f9fa;
+
+    --control-icons: #b2bac1;
+
+    --footer-background: #606070;
+    --footer-font-color: #ffffff;
+    --footer-link-color: #ffcc5c;
+    --footer-link-color-visited: #ffcc5c;
+  }
+}
+
+/* Dark mode theming */
+:root[color-mode="dark"] {
+  --header-background: #202830;
+  --header-font-color: #ffffff;
+
+  --body-background: #343a44;
+  --body-font-color: #ced3d8;
+
+  --button-background: #62cb97;
+  --button-border-color: #4ec58a;
+
+  --link-color: #7ac29e;
+  --link-color-visited: #7ac29e;
+
+  --code-background: #2f353a;
+  --code-accent-color: #262b2f;
+  --code-accent-color-lite: #2b3035;
+
+  --accent-color: #2b3035;
+  --accent-color-lite: #2f353a;
+
+  --control-icons: #b2bac1;
+
+  --footer-background: #2f333e;
+  --footer-font-color: #cccccc;
+  --footer-link-color: #7ac29e;
+  --footer-link-color-visited: #7ac29e;
+}
+@media (prefers-color-scheme: dark) {
+  :root {
+    --header-background: #404070;
+    --header-font-color: #ffffff;
+
+    --body-background: #343a40;
+    --body-font-color: #ced3d8;
+
+    --button-background: #62cb97;
+    --button-border-color: #4ec58a;
+
+    --link-color: #7ac29e;
+    --link-color-visited: #7ac29e;
+
+    --code-background: #2f353a;
+    --code-accent-color: #262b2f;
+    --code-accent-color-lite: #2b3035;
+
+    --accent-color: #2b3035;
+    --accent-color-lite: #2f353a;
+
+    --control-icons: #b2bac1;
+
+    --footer-background: #2f333e;
+    --footer-font-color: #cccccc;
+    --footer-link-color: #7ac29e;
+    --footer-link-color-visited: #7ac29e;
+  }
+}
+
+.gdoc-brand__img {
+  width: 48px;
+  height: auto;
+  margin-top: -4px;
+  margin-bottom: -4px;
+}
+
+.gdoc-menu-header > span {
+  display: flex;
+  flex-direction: row-reverse;
+}
+
+span.gdoc-language {
+  margin-right: 20px;
+}
--- a/docs/hugo/static/favicon/favicon-16x16.png
+++ b/docs/hugo/static/favicon/favicon-16x16.png
--- a/docs/hugo/static/favicon/favicon-32x32.png
+++ b/docs/hugo/static/favicon/favicon-32x32.png
--- a/docs/hugo/static/favicon/favicon.svg
+++ b/docs/hugo/static/favicon/favicon.svg
@@ -0,0 +1,196 @@
+<?xml version="1.0" encoding="UTF-8" standalone="no"?>
+<svg
+   xmlns:osb="http://www.openswatchbook.org/uri/2009/osb"
+   xmlns:dc="http://purl.org/dc/elements/1.1/"
+   xmlns:cc="http://creativecommons.org/ns#"
+   xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
+   xmlns:svg="http://www.w3.org/2000/svg"
+   xmlns="http://www.w3.org/2000/svg"
+   xmlns:xlink="http://www.w3.org/1999/xlink"
+   xmlns:sodipodi="http://sodipodi.sourceforge.net/DTD/sodipodi-0.dtd"
+   xmlns:inkscape="http://www.inkscape.org/namespaces/inkscape"
+   sodipodi:docname="favicon.svg"
+   inkscape:version="1.0.2 (e86c870879, 2021-01-15)"
+   id="svg1340"
+   version="1.1"
+   viewBox="0 0 100 100"
+   height="100mm"
+   width="100mm"
+   inkscape:export-filename="/var/home/vitali/SVN/vitastor/docs/static/favicon/favicon-64x64.png"
+   inkscape:export-xdpi="16.26"
+   inkscape:export-ydpi="16.26">
+  <defs
+     id="defs1334">
+    <linearGradient
+       osb:paint="gradient"
+       id="linearGradient866">
+      <stop
+         id="stop862"
+         offset="0"
+         style="stop-color:#c0c0c0;stop-opacity:1" />
+      <stop
+         id="stop864"
+         offset="1"
+         style="stop-color:#000000;stop-opacity:0" />
+    </linearGradient>
+    <linearGradient
+       id="linearGradient846"
+       osb:paint="gradient">
+      <stop
+         style="stop-color:#ffd42a;stop-opacity:1"
+         offset="0"
+         id="stop842" />
+      <stop
+         style="stop-color:#ffa200;stop-opacity:1"
+         offset="1"
+         id="stop844" />
+    </linearGradient>
+    <radialGradient
+       r="50"
+       fy="159.11139"
+       fx="202.36813"
+       cy="159.11139"
+       cx="202.36813"
+       gradientTransform="matrix(1.2462942,-1.2279529,0.77712408,0.78873143,-190.96813,230.1331)"
+       gradientUnits="userSpaceOnUse"
+       id="radialGradient1530"
+       xlink:href="#linearGradient1352"
+       inkscape:collect="always" />
+    <linearGradient
+       inkscape:collect="always"
+       id="linearGradient1352">
+      <stop
+         style="stop-color:#00c9e6;stop-opacity:1"
+         offset="0"
+         id="stop1348" />
+      <stop
+         style="stop-color:#5240d3;stop-opacity:1"
+         offset="1"
+         id="stop1350" />
+    </linearGradient>
+    <linearGradient
+       y2="62.555599"
+       x2="51.484566"
+       y1="62.555599"
+       x1="38.105473"
+       gradientTransform="rotate(-16.930773,271.11609,-412.42594)"
+       gradientUnits="userSpaceOnUse"
+       id="linearGradient1508"
+       xlink:href="#linearGradient1323"
+       inkscape:collect="always" />
+    <linearGradient
+       inkscape:collect="always"
+       id="linearGradient1323">
+      <stop
+         style="stop-color:#000000;stop-opacity:0.47178105"
+         offset="0"
+         id="stop1319" />
+      <stop
+         style="stop-color:#eeaaff;stop-opacity:0;"
+         offset="1"
+         id="stop1321" />
+    </linearGradient>
+    <filter
+       style="color-interpolation-filters:sRGB"
+       inkscape:label="Drop Shadow"
+       id="filter1497"
+       width="2"
+       height="2"
+       x="-0.5"
+       y="-0.5">
+      <feFlood
+         flood-opacity="0.498039"
+         flood-color="rgb(0,0,0)"
+         result="flood"
+         id="feFlood1487" />
+      <feComposite
+         in="flood"
+         in2="SourceGraphic"
+         operator="in"
+         result="composite1"
+         id="feComposite1489" />
+      <feGaussianBlur
+         in="composite1"
+         stdDeviation="6"
+         result="blur"
+         id="feGaussianBlur1491" />
+      <feOffset
+         dx="0"
+         dy="6"
+         result="offset"
+         id="feOffset1493" />
+      <feComposite
+         in="offset"
+         in2="offset"
+         operator="atop"
+         result="composite2"
+         id="feComposite1495" />
+    </filter>
+    <radialGradient
+       r="21.541935"
+       fy="24.614815"
+       fx="45.312912"
+       cy="24.614815"
+       cx="45.312912"
+       gradientTransform="matrix(1.6678615,0.20004527,-0.19318681,1.6106796,108.48083,22.966962)"
+       gradientUnits="userSpaceOnUse"
+       id="radialGradient1506"
+       xlink:href="#linearGradient846"
+       inkscape:collect="always" />
+  </defs>
+  <sodipodi:namedview
+     inkscape:window-maximized="1"
+     inkscape:window-y="0"
+     inkscape:window-x="0"
+     inkscape:window-height="992"
+     inkscape:window-width="1920"
+     fit-margin-bottom="0"
+     fit-margin-right="0"
+     fit-margin-left="0"
+     fit-margin-top="0"
+     showgrid="false"
+     inkscape:document-rotation="0"
+     inkscape:current-layer="layer1"
+     inkscape:document-units="mm"
+     inkscape:cy="83.752268"
+     inkscape:cx="-103.69645"
+     inkscape:zoom="0.7"
+     inkscape:pageshadow="2"
+     inkscape:pageopacity="0"
+     borderopacity="1.0"
+     bordercolor="#666666"
+     pagecolor="#000000"
+     id="base" />
+  <metadata
+     id="metadata1337">
+    <rdf:RDF>
+      <cc:Work
+         rdf:about="">
+        <dc:format>image/svg+xml</dc:format>
+        <dc:type
+           rdf:resource="http://purl.org/dc/dcmitype/StillImage" />
+        <dc:title></dc:title>
+      </cc:Work>
+    </rdf:RDF>
+  </metadata>
+  <g
+     transform="translate(-133.26969,-35.630924)"
+     id="layer1"
+     inkscape:groupmode="layer"
+     inkscape:label="Слой 1">
+    <path
+       style="fill:url(#radialGradient1530);fill-opacity:1;fill-rule:evenodd;stroke:#000000;stroke-width:0;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-dasharray:none;stroke-opacity:1"
+       d="m 133.26969,59.089473 50,75.000087 50,-75.000087 z"
+       id="path1528"
+       sodipodi:nodetypes="cccc" />
+    <path
+       d="m 194.29572,89.403603 -8.41706,2.562119 -2.50682,7.49308 7.17785,23.579008 9.60097,-14.40173 z"
+       style="fill:url(#linearGradient1508);fill-opacity:1;stroke-width:0;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-dasharray:none;stroke-opacity:0.501961"
+       id="path1459" />
+    <path
+       sodipodi:nodetypes="cccccccc"
+       id="path1463"
+       d="m 157.01826,35.630924 h 53.91343 l -13.21471,31.09726 h 13.99432 l -33.7227,51.291496 4.61762,-37.02885 h -12.45344 z"
+       style="fill:url(#radialGradient1506);fill-opacity:1;fill-rule:evenodd;stroke:#000000;stroke-width:0;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-dasharray:none;stroke-opacity:1" />
+  </g>
+</svg>
--- a/docs/params/common.yml
+++ b/docs/params/common.yml
@@ -0,0 +1,35 @@
+- name: config_path
+  type: string
+  default: "/etc/vitastor/vitastor.conf"
+  info: |
+    Path to the JSON configuration file. Configuration file is optional,
+    a non-existing configuration file does not prevent Vitastor from
+    running if required parameters are specified.
+  info_ru: |
+    Путь к файлу конфигурации в формате JSON. Файл конфигурации необязателен,
+    без него Vitastor тоже будет работать, если переданы необходимые параметры.
+- name: etcd_address
+  type: string or array of strings
+  type_ru: строка или массив строк
+  info: |
+    etcd connection endpoint(s). Multiple endpoints may be delimited by "," or
+    specified in a JSON array `["10.0.115.10:2379/v3","10.0.115.11:2379/v3"]`.
+    Note that https is not supported for etcd connections yet.
+  info_ru: |
+    Адрес(а) подключения к etcd. Несколько адресов могут разделяться запятой
+    или указываться в виде JSON-массива `["10.0.115.10:2379/v3","10.0.115.11:2379/v3"]`.
+- name: etcd_prefix
+  type: string
+  default: "/vitastor"
+  info: |
+    Prefix for all keys in etcd used by Vitastor. You can change prefix and, for
+    example, use a single etcd cluster for multiple Vitastor clusters.
+  info_ru: |
+    Префикс для ключей etcd, которые использует Vitastor. Вы можете задать другой
+    префикс, например, чтобы запустить несколько кластеров Vitastor с одним
+    кластером etcd.
+- name: log_level
+  type: int
+  default: 0
+  info: Log level. Raise if you want more verbose output.
+  info_ru: Уровень логгирования. Повысьте, если хотите более подробный вывод.
--- a/docs/params/head/common.en.md
+++ b/docs/params/head/common.en.md
@@ -0,0 +1,6 @@
+---
+title: Common Parameters
+weight: 1
+---
+
+These are the most common parameters which apply to all components of Vitastor.
--- a/docs/params/head/common.ru.md
+++ b/docs/params/head/common.ru.md
@@ -0,0 +1,6 @@
+---
+title: Общие параметры
+weight: 1
+---
+
+Это наиболее общие параметры, используемые всеми компонентами Vitastor.
--- a/docs/params/head/layout-cluster.en.md
+++ b/docs/params/head/layout-cluster.en.md
@@ -0,0 +1,7 @@
+---
+title: Cluster-Wide Disk Layout Parameters
+weight: 2
+---
+
+These parameters apply to clients and OSDs, are fixed at the moment of OSD drive
+initialization and can't be changed after it without losing data.
--- a/docs/params/head/layout-cluster.ru.md
+++ b/docs/params/head/layout-cluster.ru.md
@@ -0,0 +1,7 @@
+---
+title: Дисковые параметры уровня кластера
+weight: 2
+---
+
+Данные параметры используются клиентами и OSD, задаются в момент инициализации
+диска OSD и не могут быть изменены после этого без потери данных.
--- a/docs/params/head/layout-osd.en.md
+++ b/docs/params/head/layout-osd.en.md
@@ -0,0 +1,7 @@
+---
+title: OSD Disk Layout Parameters
+weight: 3
+---
+
+These parameters apply to OSDs, are fixed at the moment of OSD drive
+initialization and can't be changed after it without losing data.
--- a/docs/params/head/layout-osd.ru.md
+++ b/docs/params/head/layout-osd.ru.md
@@ -0,0 +1,8 @@
+---
+title: Дисковые параметры OSD
+weight: 3
+---
+
+Данные параметры используются только OSD и, также как и общекластерные
+дисковые параметры, задаются в момент инициализации дисков OSD и не могут быть
+изменены после этого без потери данных.
--- a/docs/params/head/monitor.en.md
+++ b/docs/params/head/monitor.en.md
@@ -0,0 +1,6 @@
+---
+title: Monitor Parameters
+weight: 6
+---
+
+These parameters only apply to Monitors.
--- a/docs/params/head/monitor.ru.md
+++ b/docs/params/head/monitor.ru.md
@@ -0,0 +1,6 @@
+---
+title: Параметры мониторов
+weight: 6
+---
+
+Данные параметры используются только мониторами Vitastor.
--- a/docs/params/head/network.en.md
+++ b/docs/params/head/network.en.md
@@ -0,0 +1,7 @@
+---
+title: Network Protocol Parameters
+weight: 4
+---
+
+These parameters apply to clients and OSDs and affect network connection logic
+between clients, OSDs and etcd.
--- a/docs/params/head/network.ru.md
+++ b/docs/params/head/network.ru.md
@@ -0,0 +1,7 @@
+---
+title: Параметры сетевого протокола
+weight: 4
+---
+
+Данные параметры используются клиентами и OSD и влияют на логику сетевого
+взаимодействия между клиентами, OSD, а также etcd.
--- a/docs/params/head/osd.en.md
+++ b/docs/params/head/osd.en.md
@@ -0,0 +1,7 @@
+---
+title: Runtime OSD Parameters
+weight: 5
+---
+
+These parameters only apply to OSDs, are not fixed at the moment of OSD drive
+initialization and can be changed with an OSD restart.
--- a/docs/params/head/osd.ru.md
+++ b/docs/params/head/osd.ru.md
@@ -0,0 +1,8 @@
+---
+title: Изменяемые параметры OSD
+weight: 5
+---
+
+Данные параметры используются только OSD, но, в отличие от дисковых параметров,
+не фиксируются в момент инициализации дисков OSD и могут быть изменены в любой
+момент с перезапуском OSD.
--- a/docs/params/layout-cluster.yml
+++ b/docs/params/layout-cluster.yml
@@ -0,0 +1,200 @@
+- name: block_size
+  type: int
+  default: 131072
+  info: |
+    Size of objects (data blocks) into which all physical and virtual drives are
+    subdivided in Vitastor. One of current main settings in Vitastor, affects
+    memory usage, write amplification and I/O load distribution effectiveness.
+
+    Recommended default block size is 128 KB for SSD and 4 MB for HDD. In fact,
+    it's possible to use 4 MB for SSD too - it will lower memory usage, but
+    may increase average WA and reduce linear performance.
+
+    OSDs with different block sizes (for example, SSD and SSD+HDD OSDs) can
+    currently coexist in one etcd instance only within separate Vitastor
+    clusters with different etcd_prefix'es.
+
+    Also block size can't be changed after OSD initialization without losing
+    data.
+
+    You must always specify block_size in etcd in /vitastor/config/global if
+    you change it so all clients can know about it.
+
+    OSD memory usage is roughly (SIZE / BLOCK * 68 bytes) which is roughly
+    544 MB per 1 TB of used disk space with the default 128 KB block size.
+  info_ru: |
+    Размер объектов (блоков данных), на которые делятся физические и виртуальные
+    диски в Vitastor. Одна из ключевых на данный момент настроек, влияет на
+    потребление памяти, объём избыточной записи (write amplification) и
+    эффективность распределения нагрузки по OSD.
+
+    Рекомендуемые по умолчанию размеры блока - 128 килобайт для SSD и 4
+    мегабайта для HDD. В принципе, для SSD можно тоже использовать 4 мегабайта,
+    это понизит использование памяти, но ухудшит распределение нагрузки и в
+    среднем увеличит WA.
+
+    OSD с разными размерами блока (например, SSD и SSD+HDD OSD) на данный
+    момент могут сосуществовать в рамках одного etcd только в виде двух независимых
+    кластеров Vitastor с разными etcd_prefix.
+
+    Также размер блока нельзя менять после инициализации OSD без потери данных.
+
+    Если вы меняете размер блока, обязательно прописывайте его в etcd в
+    /vitastor/config/global, дабы все клиенты его знали.
+
+    Потребление памяти OSD составляет примерно (РАЗМЕР / БЛОК * 68 байт),
+    т.е. примерно 544 МБ памяти на 1 ТБ занятого места на диске при
+    стандартном 128 КБ блоке.
+- name: bitmap_granularity
+  type: int
+  default: 4096
+  info: |
+    Required virtual disk write alignment ("sector size"). Must be a multiple
+    of disk_alignment. It's called bitmap granularity because Vitastor tracks
+    an allocation bitmap for each object containing 2 bits per each
+    (bitmap_granularity) bytes.
+
+    This parameter can't be changed after OSD initialization without losing
+    data. Also it's fixed for the whole Vitastor cluster i.e. two different
+    values can't be used in a single Vitastor cluster.
+
+    Clients MUST be aware of this parameter value, so put it into etcd key
+    /vitastor/config/global if you change it for any reason.
+  info_ru: |
+    Требуемое выравнивание записи на виртуальные диски (размер их "сектора").
+    Должен быть кратен disk_alignment. Называется гранулярностью битовой карты
+    потому, что Vitastor хранит битовую карту для каждого объекта, содержащую
+    по 2 бита на каждые (bitmap_granularity) байт.
+
+    Данный параметр нельзя менять после инициализации OSD без потери данных.
+    Также он фиксирован для всего кластера Vitastor, т.е. разные значения
+    не могут сосуществовать в одном кластере.
+
+    Клиенты ДОЛЖНЫ знать правильное значение этого параметра, так что если вы
+    его меняете, обязательно прописывайте изменённое значение в etcd в ключ
+    /vitastor/config/global.
+- name: immediate_commit
+  type: string
+  default: false
+  info: |
+    Another parameter which is really important for performance.
+
+    Desktop SSDs are very fast (100000+ iops) for simple random writes
+    without cache flush. However, they are really slow (only around 1000 iops)
+    if you try to fsync() each write, that is, when you want to guarantee that
+    each change gets immediately persisted to the physical media.
+
+    Server-grade SSDs with "Advanced/Enhanced Power Loss Protection" or with
+    "Supercapacitor-based Power Loss Protection", on the other hand, are equally
+    fast with and without fsync because their cache is protected from sudden
+    power loss by a built-in supercapacitor-based "UPS".
+
+    Some software-defined storage systems always fsync each write and thus are
+    really slow when used with desktop SSDs. Vitastor, however, can also
+    efficiently utilize desktop SSDs by postponing fsync until the client calls
+    it explicitly.
+
+    This is what this parameter regulates. When it's set to "all" the whole
+    Vitastor cluster commits each change to disks immediately and clients just
+    ignore fsyncs because they know for sure that they're unneeded. This reduces
+    the amount of network roundtrips performed by clients and improves
+    performance. So it's always better to use server grade SSDs with
+    supercapacitors even with Vitastor, especially given that they cost only
+    a bit more than desktop models.
+
+    There is also a common SATA SSD (and HDD too!) firmware bug (or feature)
+    that makes server SSDs which have supercapacitors slow with fsync. To check
+    if your SSDs are affected, compare benchmark results from `fio -name=test
+    -ioengine=libaio -direct=1 -bs=4k -rw=randwrite -iodepth=1` with and without
+    `-fsync=1`. Results should be the same. If fsync=1 result is worse you can
+    try to work around this bug by "disabling" drive write-back cache by running
+    `hdparm -W 0 /dev/sdXX` or `echo write through > /sys/block/sdXX/device/scsi_disk/*/cache_type`
+    (IMPORTANT: don't mistake it with `/sys/block/sdXX/queue/write_cache` - it's
+    unsafe to change by hand). The same may apply to newer HDDs with internal
+    SSD cache or "media-cache" - for example, a lot of Seagate EXOS drives have
+    it (they have internal SSD cache even though it's not stated in datasheets).
+
+    This parameter must be set both in etcd in /vitastor/config/global and in
+    OSD command line or configuration. Setting it to "all" or "small" requires
+    enabling disable_journal_fsync and disable_meta_fsync, setting it to "all"
+    also requires enabling disable_data_fsync.
+
+    TLDR: For optimal performance, set immediate_commit to "all" if you only use
+    SSDs with supercapacitor-based power loss protection (nonvolatile
+    write-through cache) for both data and journals in the whole Vitastor
+    cluster. Set it to "small" if you only use such SSDs for journals. Leave
+    empty if your drives have write-back cache.
+  info_ru: |
+    Ещё один важный для производительности параметр.
+
+    Модели SSD для настольных компьютеров очень быстрые (100000+ операций в
+    секунду) при простой случайной записи без сбросов кэша. Однако они очень
+    медленные (всего порядка 1000 iops), если вы пытаетесь сбрасывать кэш после
+    каждой записи, то есть, если вы пытаетесь гарантировать, что каждое
+    изменение физически записывается в энергонезависимую память.
+
+    С другой стороны, серверные SSD с конденсаторами - функцией, называемой
+    "Advanced/Enhanced Power Loss Protection" или просто "Supercapacitor-based
+    Power Loss Protection" - одинаково быстрые и со сбросом кэша, и без
+    него, потому что их кэш защищён от потери питания встроенным "источником
+    бесперебойного питания" на основе суперконденсаторов и на самом деле они
+    его никогда не сбрасывают.
+
+    Некоторые программные СХД всегда сбрасывают кэши дисков при каждой записи
+    и поэтому работают очень медленно с настольными SSD. Vitastor, однако, может
+    откладывать fsync до явного его вызова со стороны клиента и таким образом
+    эффективно утилизировать настольные SSD.
+
+    Данный параметр влияет как раз на это. Когда он установлен в значение "all",
+    весь кластер Vitastor мгновенно фиксирует каждое изменение на физические
+    носители и клиенты могут просто игнорировать запросы fsync, т.к. они точно
+    знают, что fsync-и не нужны. Это уменьшает число необходимых обращений к OSD
+    по сети и улучшает производительность. Поэтому даже с Vitastor лучше всегда
+    использовать только серверные модели SSD с суперконденсаторами, особенно
+    учитывая то, что стоят они ненамного дороже настольных.
+
+    Также в прошивках SATA SSD (и даже HDD!) очень часто встречается либо баг,
+    либо просто особенность логики, из-за которой серверные SSD, имеющие
+    конденсаторы и защиту от потери питания, всё равно медленно работают с
+    fsync. Чтобы понять, подвержены ли этой проблеме ваши SSD, сравните
+    результаты тестов `fio -name=test -ioengine=libaio -direct=1 -bs=4k
+    -rw=randwrite -iodepth=1` без и с опцией `-fsync=1`. Результаты должны
+    быть одинаковые. Если результат с `fsync=1` хуже, вы можете попробовать
+    обойти проблему, "отключив" кэш записи диска командой `hdparm -W 0 /dev/sdXX`
+    либо `echo write through > /sys/block/sdXX/device/scsi_disk/*/cache_type`
+    (ВАЖНО: не перепутайте с `/sys/block/sdXX/queue/write_cache` - этот параметр
+    менять руками небезопасно). Такая же проблема может встречаться и в новых
+    HDD-дисках с внутренним SSD или "медиа" кэшем - например, она встречается во
+    многих дисках Seagate EXOS (у них есть внутренний SSD-кэш, хотя это и не
+    указано в спецификациях).
+
+    Данный параметр нужно указывать и в etcd в /vitastor/config/global, и в
+    командной строке или конфигурации OSD. Значения "all" и "small" требуют
+    включения disable_journal_fsync и disable_meta_fsync, значение "all" также
+    требует включения disable_data_fsync.
+
+    Итого, вкратце: для оптимальной производительности установите
+    immediate_commit в значение "all", если вы используете в кластере только SSD
+    с суперконденсаторами и для данных, и для журналов. Если вы используете
+    такие SSD для всех журналов, но не для данных - можете установить параметр
+    в "small". Если и какие-то из дисков журналов имеют волатильный кэш записи -
+    оставьте параметр пустым.
+- name: client_dirty_limit
+  type: int
+  default: 33554432
+  info: |
+    Without immediate_commit=all this parameter sets the limit of "dirty"
+    (not committed by fsync) data allowed by the client before forcing an
+    additional fsync and committing the data. Also note that the client always
+    holds a copy of uncommitted data in memory so this setting also affects
+    RAM usage of clients.
+
+    This parameter doesn't affect OSDs themselves.
+  info_ru: |
+    При работе без immediate_commit=all - это лимит объёма "грязных" (не
+    зафиксированных fsync-ом) данных, при достижении которого клиент будет
+    принудительно вызывать fsync и фиксировать данные. Также стоит иметь в виду,
+    что в этом случае до момента fsync клиент хранит копию незафиксированных
+    данных в памяти, то есть, настройка влияет на потребление памяти клиентами.
+
+    Параметр не влияет на сами OSD.
--- a/docs/params/layout-osd.yml
+++ b/docs/params/layout-osd.yml
@@ -0,0 +1,205 @@
+- name: data_device
+  type: string
+  info: |
+    Path to the block device to use for data. It's highly recommendded to use
+    stable paths for all device names: `/dev/disk/by-partuuid/xxx...` instead
+    of just `/dev/sda` or `/dev/nvme0n1` to not mess up after server restart.
+    Files can also be used instead of block devices, but this is implemented
+    only for testing purposes and not for production.
+  info_ru: |
+    Путь к диску (блочному устройству) для хранения данных. Крайне рекомендуется
+    использовать стабильные пути: `/dev/disk/by-partuuid/xxx...` вместо простых
+    `/dev/sda` или `/dev/nvme0n1`, чтобы пути не могли спутаться после
+    перезагрузки сервера. Также вместо блочных устройств можно указывать файлы,
+    но это реализовано только для тестирования, а не для боевой среды.
+- name: meta_device
+  type: string
+  info: |
+    Path to the block device to use for the metadata. Metadata must be on a fast
+    SSD or performance will suffer. If this option is skipped, `data_device` is
+    used for the metadata.
+  info_ru: |
+    Путь к диску метаданных. Метаданные должны располагаться на быстром
+    SSD-диске, иначе производительность пострадает. Если эта опция не указана,
+    для метаданных используется `data_device`.
+- name: journal_device
+  type: string
+  info: |
+    Path to the block device to use for the journal. Journal must be on a fast
+    SSD or performance will suffer. If this option is skipped, `meta_device` is
+    used for the journal, and if it's also empty, journal is put on
+    `data_device`. It's almost always fine to put metadata and journal on the
+    same device, in this case you only need to set `meta_device`.
+  info_ru: |
+    Путь к диску журнала. Журнал должен располагаться на быстром SSD-диске,
+    иначе производительность пострадает. Если эта опция не указана,
+    для журнала используется `meta_device`, если же пуста и она, журнал
+    располагается на `data_device`. Нормально располагать журнал и метаданные
+    на одном устройстве, в этом случае достаточно указать только `meta_device`.
+- name: journal_offset
+  type: int
+  default: 0
+  info: Offset on the device in bytes where the journal is stored.
+  info_ru: Смещение на устройстве в байтах, по которому располагается журнал.
+- name: journal_size
+  type: int
+  info: |
+    Journal size in bytes. Doesn't have to be large, 16-32 MB is usually fine.
+    By default, the whole journal device will be used for the journal. You must
+    set it to some value manually (or use make-osd.sh) if you colocate the
+    journal with data or metadata.
+  info_ru: |
+    Размер журнала в байтах. Большим быть не обязан, 16-32 МБ обычно достаточно.
+    По умолчанию для журнала используется всё устройство журнала. Если же вы
+    размещаете журнал на устройстве данных или метаданных, то вы должны
+    установить эту опцию в какое-то значение сами (или использовать скрипт
+    make-osd.sh).
+- name: meta_offset
+  type: int
+  default: 0
+  info: |
+    Offset on the device in bytes where the metadata area is stored.
+    Again, set it to something if you colocate metadata with journal or data.
+  info_ru: |
+    Смещение на устройстве в байтах, по которому располагаются метаданные.
+    Эту опцию нужно задать, если метаданные у вас хранятся на том же
+    устройстве, что данные или журнал.
+- name: data_offset
+  type: int
+  default: 0
+  info: |
+    Offset on the device in bytes where the data area is stored.
+    Again, set it to something if you colocate data with journal or metadata.
+  info_ru: |
+    Смещение на устройстве в байтах, по которому располагаются данные.
+    Эту опцию нужно задать, если данные у вас хранятся на том же
+    устройстве, что метаданные или журнал.
+- name: data_size
+  type: int
+  info: |
+    Data area size in bytes. By default, the whole data device up to the end
+    will be used for the data area, but you can restrict it if you want to use
+    a smaller part. Note that there is no option to set metadata area size -
+    it's derived from the data area size.
+  info_ru: |
+    Размер области данных в байтах. По умолчанию под данные будет использована
+    вся доступная область устройства данных до конца устройства, но вы можете
+    использовать эту опцию, чтобы ограничить её меньшим размером. Заметьте, что
+    опции размера области метаданных нет - она вычисляется из размера области
+    данных автоматически.
+- name: meta_block_size
+  type: int
+  default: 4096
+  info: |
+    Physical block size of the metadata device. 4096 for most current
+    HDDs and SSDs.
+  info_ru: |
+    Размер физического блока устройства метаданных. 4096 для большинства
+    современных SSD и HDD.
+- name: journal_block_size
+  type: int
+  default: 4096
+  info: |
+    Physical block size of the journal device. Must be a multiple of
+    `disk_alignment`. 4096 for most current HDDs and SSDs.
+  info_ru: |
+    Размер физического блока устройства журнала. Должен быть кратен
+    `disk_alignment`. 4096 для большинства современных SSD и HDD.
+- name: disable_data_fsync
+  type: bool
+  default: false
+  info: |
+    Do not issue fsyncs to the data device, i.e. do not flush its cache.
+    Safe ONLY if your data device has write-through cache. If you disable
+    the cache yourself using `hdparm` or `scsi_disk/cache_type` then make sure
+    that the cache disable command is run every time before starting Vitastor
+    OSD, for example, in the systemd unit. See also `immediate_commit` option
+    for the instructions to disable cache and how to benefit from it.
+  info_ru: |
+    Не отправлять fsync-и устройству данных, т.е. не сбрасывать его кэш.
+    Безопасно, ТОЛЬКО если ваше устройство данных имеет кэш со сквозной
+    записью (write-through). Если вы отключаете кэш через `hdparm` или
+    `scsi_disk/cache_type`, то удостоверьтесь, что команда отключения кэша
+    выполняется перед каждым запуском Vitastor OSD, например, в systemd unit-е.
+    Смотрите также опцию `immediate_commit` для инструкций по отключению кэша
+    и о том, как из этого извлечь выгоду.
+- name: disable_meta_fsync
+  type: bool
+  default: false
+  info: |
+    Same as disable_data_fsync, but for the metadata device. If the metadata
+    device is not set or if the data device is used for the metadata the option
+    is ignored and disable_data_fsync value is used instead of it.
+  info_ru: |
+    То же, что disable_data_fsync, но для устройства метаданных. Если устройство
+    метаданных не задано или если оно равно устройству данных, значение опции
+    игнорируется и вместо него используется значение опции disable_data_fsync.
+- name: disable_journal_fsync
+  type: bool
+  default: false
+  info: |
+    Same as disable_data_fsync, but for the journal device. If the journal
+    device is not set or if the metadata device is used for the journal the
+    option is ignored and disable_meta_fsync value is used instead of it. If
+    the same device is used for data, metadata and journal the option is also
+    ignored and disable_data_fsync value is used instead of it.
+  info_ru: |
+    То же, что disable_data_fsync, но для устройства журнала. Если устройство
+    журнала не задано или если оно равно устройству метаданных, значение опции
+    игнорируется и вместо него используется значение опции disable_meta_fsync.
+    Если одно и то же устройство используется и под данные, и под журнал, и под
+    метаданные - значение опции также игнорируется и вместо него используется
+    значение опции disable_data_fsync.
+- name: disable_device_lock
+  type: bool
+  default: false
+  info: |
+    Do not lock data, metadata and journal block devices exclusively with
+    flock(). Though it's not recommended, but you can use it you want to run
+    multiple OSD with a single device and different offsets, without using
+    partitions.
+  info_ru: |
+    Не блокировать устройства данных, метаданных и журнала от открытия их
+    другими OSD с помощью flock(). Так делать не рекомендуется, но теоретически
+    вы можете это использовать, чтобы запускать несколько OSD на одном
+    устройстве с разными смещениями и без использования разделов.
+- name: disk_alignment
+  type: int
+  default: 4096
+  info: |
+    Required physical disk write alignment. Most current SSD and HDD drives
+    use 4 KB physical sectors even if they report 512 byte logical sector
+    size, so 4 KB is a good default setting.
+
+    Note, however, that physical sector size also affects WA, because with block
+    devices it's impossible to write anything smaller than a block. So, when
+    Vitastor has to write a single metadata entry that's only about 32 bytes in
+    size, it actually has to write the whole 4 KB sector.
+
+    Because of this it can actually be beneficial to use SSDs which work well
+    with 512 byte sectors and use 512 byte disk_alignment, journal_block_size
+    and meta_block_size. But the only SSD that may fit into this category is
+    Intel Optane (probably, not tested yet).
+
+    Clients don't need to be aware of disk_alignment, so it's not required to
+    put a modified value into etcd key /vitastor/config/global.
+  info_ru: |
+    Требуемое выравнивание записи на физические диски. Почти все современные
+    SSD и HDD диски используют 4 КБ физические секторы, даже если показывают
+    логический размер сектора 512 байт, поэтому 4 КБ - хорошее значение по
+    умолчанию.
+
+    Однако стоит понимать, что физический размер сектора тоже влияет на
+    избыточную запись (WA), потому что ничего меньше блока (сектора) на блочное
+    устройство записать невозможно. Таким образом, когда Vitastor-у нужно
+    записать на диск всего лишь одну 32-байтную запись метаданных, фактически
+    приходится перезаписывать 4 КБ сектор целиком.
+
+    Поэтому, на самом деле, может быть выгодно найти SSD, хорошо работающие с
+    меньшими, 512-байтными, блоками и использовать 512-байтные disk_alignment,
+    journal_block_size и meta_block_size. Однако единственные SSD, которые
+    теоретически могут попасть в эту категорию - это Intel Optane (но и это
+    пока не проверялось автором).
+
+    Клиентам не обязательно знать про disk_alignment, так что помещать значение
+    этого параметра в etcd в /vitastor/config/global не нужно.
--- a/docs/params/monitor.yml
+++ b/docs/params/monitor.yml
@@ -0,0 +1,65 @@
+- name: etcd_mon_ttl
+  type: sec
+  min: 10
+  default: 30
+  info: Monitor etcd lease refresh interval in seconds
+  info_ru: Интервал обновления etcd резервации (lease) монитором
+- name: etcd_mon_timeout
+  type: ms
+  default: 1000
+  info: etcd request timeout used by monitor
+  info_ru: Таймаут выполнения запросов к etcd от монитора
+- name: etcd_mon_retries
+  type: int
+  default: 5
+  info: Maximum number of attempts for one monitor etcd request
+  info_ru: Максимальное число попыток выполнения запросов к etcd монитором
+- name: mon_change_timeout
+  type: ms
+  min: 100
+  default: 1000
+  info: Optimistic retry interval for monitor etcd modification requests
+  info_ru: Время повтора при коллизиях при запросах модификации в etcd, производимых монитором
+- name: mon_stats_timeout
+  type: ms
+  min: 100
+  default: 1000
+  info: |
+    Interval for monitor to wait before updating aggregated statistics in
+    etcd after receiving OSD statistics updates
+  info_ru: |
+    Интервал, который монитор ожидает при изменении статистики по отдельным
+    OSD перед обновлением агрегированной статистики в etcd
+- name: osd_out_time
+  type: sec
+  default: 600
+  info: |
+    Time after which a failed OSD is removed from the data distribution.
+    I.e. time which the monitor waits before attempting to restore data
+    redundancy using other OSDs.
+  info_ru: |
+    Время, через которое отключенный OSD исключается из распределения данных.
+    То есть, время, которое монитор ожидает перед попыткой переместить данные
+    на другие OSD и таким образом восстановить избыточность хранения.
+- name: placement_levels
+  type: json
+  default: '`{"host":100,"osd":101}`'
+  info: |
+    Levels for the placement tree. You can define arbitrary tree levels by
+    defining them in this parameter. The configuration parameter value should
+    contain a JSON object with level names as keys and integer priorities as
+    values.  Smaller priority means higher level in tree. For example,
+    "datacenter" should have smaller priority than "osd". "host" and "osd"
+    levels are always predefined and can't be removed. If one of them is not
+    present in the configuration, then it is defined with the default priority
+    (100 for "host", 101 for "osd").
+  info_ru: |
+    Определения уровней для дерева размещения OSD. Вы можете определять
+    произвольные уровни, помещая их в данный параметр конфигурации. Значение
+    параметра должно содержать JSON-объект, ключи которого будут являться
+    названиями уровней, а значения - целочисленными приоритетами. Меньшие
+    приоритеты соответствуют верхним уровням дерева. Например, уровень
+    "датацентр" должен иметь меньший приоритет, чем "OSD". Уровни с названиями
+    "host" и "osd" являются предопределёнными и не могут быть удалены. Если
+    один из них отсутствует в конфигурации, он доопределяется с приоритетом по
+    умолчанию (100 для уровня "host", 101 для "osd").
--- a/docs/params/network.yml
+++ b/docs/params/network.yml
@@ -0,0 +1,225 @@
+- name: tcp_header_buffer_size
+  type: int
+  default: 65536
+  info: |
+    Size of the buffer used to read data using an additional copy. Vitastor
+    packet headers are 128 bytes, payload is always at least 4 KB, so it is
+    usually beneficial to try to read multiple packets at once even though
+    it requires to copy the data an additional time. The rest of each packet
+    is received without an additional copy. You can try to play with this
+    parameter and see how it affects random iops and linear bandwidth if you
+    want.
+  info_ru: |
+    Размер буфера для чтения данных с дополнительным копированием. Пакеты
+    Vitastor содержат 128-байтные заголовки, за которыми следуют данные размером
+    от 4 КБ и для мелких операций ввода-вывода обычно выгодно за 1 вызов читать
+    сразу несколько пакетов, даже не смотря на то, что это требует лишний раз
+    скопировать данные. Часть каждого пакета за пределами значения данного
+    параметра читается без дополнительного копирования. Вы можете попробовать
+    поменять этот параметр и посмотреть, как он влияет на производительность
+    случайного и линейного доступа.
+- name: use_sync_send_recv
+  type: bool
+  default: false
+  info: |
+    If true, synchronous send/recv syscalls are used instead of io_uring for
+    socket communication. Useless for OSDs because they require io_uring anyway,
+    but may be required for clients with old kernel versions.
+  info_ru: |
+    Если установлено в истину, то вместо io_uring для передачи данных по сети
+    будут использоваться обычные синхронные системные вызовы send/recv. Для OSD
+    это бессмысленно, так как OSD в любом случае нуждается в io_uring, но, в
+    принципе, это может применяться для клиентов со старыми версиями ядра.
+- name: use_rdma
+  type: bool
+  default: true
+  info: |
+    Try to use RDMA for communication if it's available. Disable if you don't
+    want Vitastor to use RDMA. RDMA increases the performance, but TCP-only
+    clients can still talk to an RDMA-enabled cluster, so you don't need to
+    make sure that all clients support RDMA when enabling it.
+  info_ru: |
+    Пытаться использовать RDMA для связи при наличии доступных устройств.
+    Отключите, если вы не хотите, чтобы Vitastor использовал RDMA.
+    RDMA улучшает производительность, но 
+    Клиенты и клиентов and TCP-only clients in the cluster at the
+    same time - TCP-only clients are still able to use an RDMA-enabled cluster.
+- name: rdma_device
+  type: string
+  info: |
+    RDMA device name to use for Vitastor OSD communications (for example,
+    "rocep5s0f0"). Please note that Vitastor RDMA requires Implicit On-Demand
+    Paging (Implicit ODP) and Scatter/Gather (SG) support from the RDMA device
+    to work. For example, Mellanox ConnectX-3 and older adapters don't have
+    Implicit ODP, so they're unsupported by Vitastor. Run `ibv_devinfo -v` as
+    root to list available RDMA devices and their features.
+  info_ru: |
+    Название RDMA-устройства для связи с Vitastor OSD (например, "rocep5s0f0").
+    Имейте в виду, что поддержка RDMA в Vitastor требует функций устройства
+    Implicit On-Demand Paging (Implicit ODP) и Scatter/Gather (SG). Например,
+    адаптеры Mellanox ConnectX-3 и более старые не поддерживают Implicit ODP и
+    потому не поддерживаются в Vitastor. Запустите `ibv_devinfo -v` от имени
+    суперпользователя, чтобы посмотреть список доступных RDMA-устройств, их
+    параметры и возможности.
+- name: rdma_port_num
+  type: int
+  default: 1
+  info: |
+    RDMA device port number to use. Only for devices that have more than 1 port.
+    See `phys_port_cnt` in `ibv_devinfo -v` output to determine how many ports
+    your device has.
+  info_ru: |
+    Номер порта RDMA-устройства, который следует использовать. Имеет смысл
+    только для устройств, у которых более 1 порта. Чтобы узнать, сколько портов
+    у вашего адаптера, посмотрите `phys_port_cnt` в выводе команды
+    `ibv_devinfo -v`.
+- name: rdma_gid_index
+  type: int
+  default: 0
+  info: |
+    Global address identifier index of the RDMA device to use. Different GID
+    indexes may correspond to different protocols like RoCEv1, RoCEv2 and iWARP.
+    Search for "GID" in `ibv_devinfo -v` output to determine which GID index
+    you need.
+
+    **IMPORTANT:** If you want to use RoCEv2 (as recommended) then the correct
+    rdma_gid_index is usually 1 (IPv6) or 3 (IPv4).
+  info_ru: |
+    Номер глобального идентификатора адреса RDMA-устройства, который следует
+    использовать. Разным gid_index могут соответствовать разные протоколы связи:
+    RoCEv1, RoCEv2, iWARP. Чтобы понять, какой нужен вам - смотрите строчки со
+    словом "GID" в выводе команды `ibv_devinfo -v`.
+
+    **ВАЖНО:** Если вы хотите использовать RoCEv2 (как мы и рекомендуем), то
+    правильный rdma_gid_index, как правило, 1 (IPv6) или 3 (IPv4).
+- name: rdma_mtu
+  type: int
+  default: 4096
+  info: |
+    RDMA Path MTU to use. Must be 1024, 2048 or 4096. There is usually no
+    sense to change it from the default 4096.
+  info_ru: |
+    Максимальная единица передачи (Path MTU) для RDMA. Должно быть равно 1024,
+    2048 или 4096. Обычно нет смысла менять значение по умолчанию, равное 4096.
+- name: rdma_max_sge
+  type: int
+  default: 128
+  info: |
+    Maximum number of scatter/gather entries to use for RDMA. OSDs negotiate
+    the actual value when establishing connection anyway, so it's usually not
+    required to change this parameter.
+  info_ru: |
+    Максимальное число записей разделения/сборки (scatter/gather) для RDMA.
+    OSD в любом случае согласовывают реальное значение при установке соединения,
+    так что менять этот параметр обычно не нужно.
+- name: rdma_max_msg
+  type: int
+  default: 1048576
+  info: Maximum size of a single RDMA send or receive operation in bytes.
+  info_ru: Максимальный размер одной RDMA-операции отправки или приёма.
+- name: rdma_max_recv
+  type: int
+  default: 8
+  info: |
+    Maximum number of parallel RDMA receive operations. Note that this number
+    of receive buffers `rdma_max_msg` in size are allocated for each client,
+    so this setting actually affects memory usage. This is because RDMA receive
+    operations are (sadly) still not zero-copy in Vitastor. It may be fixed in
+    later versions.
+  info_ru: |
+    Максимальное число параллельных RDMA-операций получения данных. Следует
+    иметь в виду, что данное число буферов размером `rdma_max_msg` выделяется
+    для каждого подключённого клиентского соединения, так что данная настройка
+    влияет на потребление памяти. Это так потому, что RDMA-приём данных в
+    Vitastor, увы, всё равно не является zero-copy, т.е. всё равно 1 раз
+    копирует данные в памяти. Данная особенность, возможно, будет исправлена в
+    более новых версиях Vitastor.
+- name: peer_connect_interval
+  type: sec
+  min: 1
+  default: 5
+  info: Interval before attempting to reconnect to an unavailable OSD.
+  info_ru: Время ожидания перед повторной попыткой соединиться с недоступным OSD.
+- name: peer_connect_timeout
+  type: sec
+  min: 1
+  default: 5
+  info: Timeout for OSD connection attempts.
+  info_ru: Максимальное время ожидания попытки соединения с OSD.
+- name: osd_idle_timeout
+  type: sec
+  min: 1
+  default: 5
+  info: |
+    OSD connection inactivity time after which clients and other OSDs send
+    keepalive requests to check state of the connection.
+  info_ru: |
+    Время неактивности соединения с OSD, после которого клиенты или другие OSD
+    посылают запрос проверки состояния соединения.
+- name: osd_ping_timeout
+  type: sec
+  min: 1
+  default: 5
+  info: |
+    Maximum time to wait for OSD keepalive responses. If an OSD doesn't respond
+    within this time, the connection to it is dropped and a reconnection attempt
+    is scheduled.
+  info_ru: |
+    Максимальное время ожидания ответа на запрос проверки состояния соединения.
+    Если OSD не отвечает за это время, соединение отключается и производится
+    повторная попытка соединения.
+- name: up_wait_retry_interval
+  type: ms
+  min: 50
+  default: 500
+  info: |
+    OSDs respond to clients with a special error code when they receive I/O
+    requests for a PG that's not synchronized and started. This parameter sets
+    the time for the clients to wait before re-attempting such I/O requests.
+  info_ru: |
+    Когда OSD получают от клиентов запросы ввода-вывода, относящиеся к не
+    поднятым на данный момент на них PG, либо к PG в процессе синхронизации,
+    они отвечают клиентам специальным кодом ошибки, означающим, что клиент
+    должен некоторое время подождать перед повторением запроса. Именно это время
+    ожидания задаёт данный параметр.
+- name: max_etcd_attempts
+  type: int
+  default: 5
+  info: |
+    Maximum number of attempts for etcd requests which can't be retried
+    indefinitely.
+  info_ru: |
+    Максимальное число попыток выполнения запросов к etcd для тех запросов,
+    которые нельзя повторять бесконечно.
+- name: etcd_quick_timeout
+  type: ms
+  default: 1000
+  info: |
+    Timeout for etcd requests which should complete quickly, like lease refresh.
+  info_ru: |
+    Максимальное время выполнения запросов к etcd, которые должны завершаться
+    быстро, таких, как обновление резервации (lease).
+- name: etcd_slow_timeout
+  type: ms
+  default: 5000
+  info: Timeout for etcd requests which are allowed to wait for some time.
+  info_ru: |
+    Максимальное время выполнения запросов к etcd, для которых не обязательно
+    гарантировать быстрое выполнение.
+- name: etcd_keepalive_timeout
+  type: sec
+  default: max(30, etcd_report_interval*2)
+  info: |
+    Timeout for etcd connection HTTP Keep-Alive. Should be higher than
+    etcd_report_interval to guarantee that keepalive actually works.
+  info_ru: |
+    Таймаут для HTTP Keep-Alive в соединениях к etcd. Должен быть больше, чем
+    etcd_report_interval, чтобы keepalive гарантированно работал.
+- name: etcd_ws_keepalive_timeout
+  type: sec
+  default: 30
+  info: |
+    etcd websocket ping interval required to keep the connection alive and
+    detect disconnections quickly.
+  info_ru: |
+    Интервал проверки живости вебсокет-подключений к etcd.
--- a/docs/params/osd.yml
+++ b/docs/params/osd.yml
@@ -0,0 +1,345 @@
+- name: etcd_report_interval
+  type: sec
+  default: 5
+  info: |
+    Interval at which OSDs report their state to etcd. Affects OSD lease time
+    and thus the failover speed. Lease time is equal to this parameter value
+    plus max_etcd_attempts * etcd_quick_timeout because it should be guaranteed
+    that every OSD always refreshes its lease in time.
+  info_ru: |
+    Интервал, с которым OSD обновляет своё состояние в etcd. Значение параметра
+    влияет на время резервации (lease) OSD и поэтому на скорость переключения
+    при падении OSD. Время lease равняется значению этого параметра плюс
+    max_etcd_attempts * etcd_quick_timeout.
+- name: run_primary
+  type: bool
+  default: true
+  info: |
+    Start primary OSD logic on this OSD. As of now, can be turned off only for
+    debugging purposes. It's possible to implement additional feature for the
+    monitor which may allow to separate primary and secondary OSDs, but it's
+    unclear why anyone could need it, so it's not implemented.
+  info_ru: |
+    Запускать логику первичного OSD на данном OSD. На данный момент отключать
+    эту опцию может иметь смысл только в целях отладки. В теории, можно
+    реализовать дополнительный режим для монитора, который позволит отделять
+    первичные OSD от вторичных, но пока не понятно, зачем это может кому-то
+    понадобиться, поэтому это не реализовано.
+- name: osd_network
+  type: string or array of strings
+  type_ru: строка или массив строк
+  info: |
+    Network mask of the network (IPv4 or IPv6) to use for OSDs. Note that
+    although it's possible to specify multiple networks here, this does not
+    mean that OSDs will create multiple listening sockets - they'll only
+    pick the first matching address of an UP + RUNNING interface. Separate
+    networks for cluster and client connections are also not implemented, but
+    they are mostly useless anyway, so it's not a big deal.
+  info_ru: |
+    Маска подсети (IPv4 или IPv6) для использования для соединений с OSD.
+    Имейте в виду, что хотя сейчас и можно передать в этот параметр несколько
+    подсетей, это не означает, что OSD будут создавать несколько слушающих
+    сокетов - они лишь будут выбирать адрес первого поднятого (состояние UP +
+    RUNNING), подходящий под заданную маску. Также не реализовано разделение
+    кластерной и публичной сетей OSD. Правда, от него обычно всё равно довольно
+    мало толку, так что особенной проблемы в этом нет.
+- name: bind_address
+  type: string
+  default: "0.0.0.0"
+  info: |
+    Instead of the network mask, you can also set OSD listen address explicitly
+    using this parameter. May be useful if you want to start OSDs on interfaces
+    that are not UP + RUNNING.
+  info_ru: |
+    Этим параметром можно явным образом задать адрес, на котором будет ожидать
+    соединений OSD (вместо использования маски подсети). Может быть полезно,
+    например, чтобы запускать OSD на неподнятых интерфейсах (не UP + RUNNING).
+- name: bind_port
+  type: int
+  info: |
+    By default, OSDs pick random ports to use for incoming connections
+    automatically. With this option you can set a specific port for a specific
+    OSD by hand.
+  info_ru: |
+    По умолчанию OSD сами выбирают случайные порты для входящих подключений.
+    С помощью данной опции вы можете задать порт для отдельного OSD вручную.
+- name: autosync_interval
+  type: sec
+  default: 5
+  info: |
+    Time interval at which automatic fsyncs/flushes are issued by each OSD when
+    the immediate_commit mode if disabled. fsyncs are required because without
+    them OSDs quickly fill their journals, become unable to clear them and
+    stall. Also this option limits the amount of recent uncommitted changes
+    which OSDs may lose in case of a power outage in case when clients don't
+    issue fsyncs at all.
+  info_ru: |
+    Временной интервал отправки автоматических fsync-ов (операций очистки кэша)
+    каждым OSD для случая, когда режим immediate_commit отключён. fsync-и нужны
+    OSD, чтобы успевать очищать журнал - без них OSD быстро заполняют журналы и
+    перестают обрабатывать операции записи. Также эта опция ограничивает объём
+    недавних незафиксированных изменений, которые OSD могут терять при
+    отключении питания, если клиенты вообще не отправляют fsync.
+- name: autosync_writes
+  type: int
+  default: 128
+  info: |
+    Same as autosync_interval, but sets the maximum number of uncommitted write
+    operations before issuing an fsync operation internally.
+  info_ru: |
+    Аналогично autosync_interval, но задаёт не временной интервал, а
+    максимальное количество незафиксированных операций записи перед
+    принудительной отправкой fsync-а.
+- name: recovery_queue_depth
+  type: int
+  default: 4
+  info: |
+    Maximum recovery operations per one primary OSD at any given moment of time.
+    Currently it's the only parameter available to tune the speed or recovery
+    and rebalancing, but it's planned to implement more.
+  info_ru: |
+    Максимальное число операций восстановления на одном первичном OSD в любой
+    момент времени. На данный момент единственный параметр, который можно менять
+    для ускорения или замедления восстановления и перебалансировки данных, но
+    в планах реализация других параметров.
+- name: recovery_sync_batch
+  type: int
+  default: 16
+  info: Maximum number of recovery operations before issuing an additional fsync.
+  info_ru: Максимальное число операций восстановления перед дополнительным fsync.
+- name: readonly
+  type: bool
+  default: false
+  info: |
+    Read-only mode. If this is enabled, an OSD will never issue any writes to
+    the underlying device. This may be useful for recovery purposes.
+  info_ru: |
+    Режим "только чтение". Если включить этот режим, OSD не будет писать ничего
+    на диск. Может быть полезно в целях восстановления.
+- name: no_recovery
+  type: bool
+  default: false
+  info: |
+    Disable automatic background recovery of objects. Note that it doesn't
+    affect implicit recovery of objects happening during writes - a write is
+    always made to a full set of at least pg_minsize OSDs.
+  info_ru: |
+    Отключить автоматическое фоновое восстановление объектов. Обратите внимание,
+    что эта опция не отключает восстановление объектов, происходящее при
+    записи - запись всегда производится в полный набор из как минимум pg_minsize
+    OSD.
+- name: no_rebalance
+  type: bool
+  default: false
+  info: |
+    Disable background movement of data between different OSDs. Disabling it
+    means that PGs in the `has_misplaced` state will be left in it indefinitely.
+  info_ru: |
+    Отключить фоновое перемещение объектов между разными OSD. Отключение
+    означает, что PG, находящиеся в состоянии `has_misplaced`, будут оставлены
+    в нём на неопределённый срок.
+- name: print_stats_interval
+  type: sec
+  default: 3
+  info: |
+    Time interval at which OSDs print simple human-readable operation
+    statistics on stdout.
+  info_ru: |
+    Временной интервал, с которым OSD печатают простую человекочитаемую
+    статистику выполнения операций в стандартный вывод.
+- name: slow_log_interval
+  type: sec
+  default: 10
+  info: |
+    Time interval at which OSDs dump slow or stuck operations on stdout, if
+    they're any. Also it's the time after which an operation is considered
+    "slow".
+  info_ru: |
+    Временной интервал, с которым OSD выводят в стандартный вывод список
+    медленных или зависших операций, если таковые имеются. Также время, при
+    превышении которого операция считается "медленной".
+- name: max_write_iodepth
+  type: int
+  default: 128
+  info: |
+    Parallel client write operation limit per one OSD. Operations that exceed
+    this limit are pushed to a temporary queue instead of being executed
+    immediately.
+  info_ru: |
+    Максимальное число одновременных клиентских операций записи на один OSD.
+    Операции, превышающие этот лимит, не исполняются сразу, а сохраняются во
+    временной очереди.
+- name: min_flusher_count
+  type: int
+  default: 1
+  info: |
+    Flusher is a micro-thread that moves data from the journal to the data
+    area of the device. Their number is auto-tuned between minimum and maximum.
+    Minimum number is set by this parameter.
+  info_ru: |
+    Flusher - это микро-поток (корутина), которая копирует данные из журнала в
+    основную область устройства данных. Их число настраивается динамически между
+    минимальным и максимальным значением. Этот параметр задаёт минимальное число.
+- name: max_flusher_count
+  type: int
+  default: 256
+  info: |
+    Maximum number of journal flushers (see above min_flusher_count).
+  info_ru: |
+    Максимальное число микро-потоков очистки журнала (см. выше min_flusher_count).
+- name: inmemory_metadata
+  type: bool
+  default: true
+  info: |
+    This parameter makes Vitastor always keep metadata area of the block device
+    in memory. It's required for good performance because it allows to avoid
+    additional read-modify-write cycles during metadata modifications. Metadata
+    area size is currently roughly 224 MB per 1 TB of data. You can turn it off
+    to reduce memory usage by this value, but it will hurt performance. This
+    restriction is likely to be removed in the future along with the upgrade
+    of the metadata storage scheme.
+  info_ru: |
+    Данный параметр заставляет Vitastor всегда держать область метаданных диска
+    в памяти. Это нужно, чтобы избегать дополнительных операций чтения с диска
+    при записи. Размер области метаданных на данный момент составляет примерно
+    224 МБ на 1 ТБ данных. При включении потребление памяти снизится примерно
+    на эту величину, но при этом также снизится и производительность. В будущем,
+    после обновления схемы хранения метаданных, это ограничение, скорее всего,
+    будет ликвидировано.
+- name: inmemory_journal
+  type: bool
+  default: true
+  info: |
+    This parameter make Vitastor always keep journal area of the block
+    device in memory. Turning it off will, again, reduce memory usage, but
+    hurt performance because flusher coroutines will have to read data from
+    the disk back before copying it into the main area. The memory usage benefit
+    is typically very small because it's sufficient to have 16-32 MB journal
+    for SSD OSDs. However, in theory it's possible that you'll want to turn it
+    off for hybrid (HDD+SSD) OSDs with large journals on quick devices.
+  info_ru: |
+    Данный параметр заставляет Vitastor всегда держать в памяти журналы OSD.
+    Отключение параметра, опять же, снижает потребление памяти, но ухудшает
+    производительность, так как для копирования данных из журнала в основную
+    область устройства OSD будут вынуждены читать их обратно с диска. Выигрыш
+    по памяти при этом обычно крайне низкий, так как для SSD OSD обычно
+    достаточно 16- или 32-мегабайтного журнала. Однако в теории отключение
+    параметра может оказаться полезным для гибридных OSD (HDD+SSD) с большими
+    журналами, расположенными на быстром по сравнению с HDD устройстве.
+- name: journal_sector_buffer_count
+  type: int
+  default: 32
+  info: |
+    Maximum number of buffers that can be used for writing journal metadata
+    blocks. The only situation when you should increase it to a larger value
+    is when you enable journal_no_same_sector_overwrites. In this case set
+    it to, for example, 1024.
+  info_ru: |
+    Максимальное число буферов, разрешённых для использования под записываемые
+    в журнал блоки метаданных. Единственная ситуация, в которой этот параметр
+    нужно менять - это если вы включаете journal_no_same_sector_overwrites. В
+    этом случае установите данный параметр, например, в 1024.
+- name: journal_no_same_sector_overwrites
+  type: bool
+  default: false
+  info: |
+    Enable this option for SSDs like Intel D3-S4510 and D3-S4610 which REALLY
+    don't like when a program overwrites the same sector multiple times in a
+    row and slow down significantly (from 25000+ iops to ~3000 iops). When
+    this option is set, Vitastor will always move to the next sector of the
+    journal after writing it instead of possibly overwriting it the second time.
+
+    Most (99%) other SSDs don't need this option.
+  info_ru: |
+    Включайте данную опцию для SSD вроде Intel D3-S4510 и D3-S4610, которые
+    ОЧЕНЬ не любят, когда ПО перезаписывает один и тот же сектор несколько раз
+    подряд. Такие SSD при многократной перезаписи одного и того же сектора
+    сильно замедляются - условно, с 25000 и более iops до 3000 iops. Когда
+    данная опция установлена, Vitastor всегда переходит к следующему сектору
+    журнала после записи вместо потенциально повторной перезаписи того же
+    самого сектора.
+
+    Почти все другие SSD (99% моделей) не требуют данной опции.
+- name: throttle_small_writes
+  type: bool
+  default: false
+  info: |
+    Enable soft throttling of small journaled writes. Useful for hybrid OSDs
+    with fast journal/metadata devices and slow data devices. The idea is that
+    small writes complete very quickly because they're first written to the
+    journal device, but moving them to the main device is slow. So if an OSD
+    allows clients to issue a lot of small writes it will perform very good
+    for several seconds and then the journal will fill up and the performance
+    will drop to almost zero. Throttling is meant to prevent this problem by
+    artifically slowing quick writes down based on the amount of free space in
+    the journal. When throttling is used, the performance of small writes will
+    decrease smoothly instead of abrupt drop at the moment when the journal
+    fills up.
+  info_ru: |
+    Разрешить мягкое ограничение скорости журналируемой записи. Полезно для
+    гибридных OSD с быстрыми устройствами метаданных и медленными устройствами
+    данных. Идея заключается в том, что мелкие записи в этой ситуации могут
+    завершаться очень быстро, так как они изначально записываются на быстрое
+    журнальное устройство (SSD). Но перемещать их потом на основное медленное
+    устройство долго. Поэтому если OSD быстро примет от клиентов очень много
+    мелких операций записи, он быстро заполнит свой журнал, после чего
+    производительность записи резко упадёт практически до нуля. Ограничение
+    скорости записи призвано решить эту проблему с помощью искусственного
+    замедления операций записи на основании объёма свободного места в журнале.
+    Когда эта опция включена, производительность мелких операций записи будет
+    снижаться плавно, а не резко в момент окончательного заполнения журнала.
+- name: throttle_target_iops
+  type: int
+  default: 100
+  info: |
+    Target maximum number of throttled operations per second under the condition
+    of full journal. Set it to approximate random write iops of your data devices
+    (HDDs).
+  info_ru: |
+    Расчётное максимальное число ограничиваемых операций в секунду при условии
+    отсутствия свободного места в журнале. Устанавливайте приблизительно равным
+    максимальной производительности случайной записи ваших устройств данных
+    (HDD) в операциях в секунду.
+- name: throttle_target_mbs
+  type: int
+  default: 100
+  info: |
+    Target maximum bandwidth in MB/s of throttled operations per second under
+    the condition of full journal. Set it to approximate linear write
+    performance of your data devices (HDDs).
+  info_ru: |
+    Расчётный максимальный размер в МБ/с ограничиваемых операций в секунду при
+    условии отсутствия свободного места в журнале. Устанавливайте приблизительно
+    равным максимальной производительности линейной записи ваших устройств
+    данных (HDD).
+- name: throttle_target_parallelism
+  type: int
+  default: 1
+  info: |
+    Target maximum parallelism of throttled operations under the condition of
+    full journal. Set it to approximate internal parallelism of your data
+    devices (1 for HDDs, 4-8 for SSDs).
+  info_ru: |
+    Расчётный максимальный параллелизм ограничиваемых операций в секунду при
+    условии отсутствия свободного места в журнале. Устанавливайте приблизительно
+    равным внутреннему параллелизму ваших устройств данных (1 для HDD, 4-8
+    для SSD).
+- name: throttle_threshold_us
+  type: us
+  default: 50
+  info: |
+    Minimal computed delay to be applied to throttled operations. Usually
+    doesn't need to be changed.
+  info_ru: |
+    Минимальная применимая к ограничиваемым операциям задержка. Обычно не
+    требует изменений.
+- name: osd_memlock
+  type: bool
+  default: false
+  info: >
+    Lock all OSD memory to prevent it from being unloaded into swap with
+    mlockall(). Requires sufficient ulimit -l (max locked memory).
+  info_ru: >
+    Блокировать всю память OSD с помощью mlockall, чтобы запретить её выгрузку
+    в пространство подкачки. Требует достаточного значения ulimit -l (лимита
+    заблокированной памяти).
--- a/2
+++ b/2
--- a/mon/lp-optimizer.js
+++ b/mon/lp-optimizer.js
@@ -50,7 +50,7 @@ async function lp_solve(text)
    return { score, vars };
 }

-async function optimize_initial({ osd_tree, pg_count, pg_size = 3, pg_minsize = 2, max_combinations = 10000, parity_space = 1, round_robin = false })
+async function optimize_initial({ osd_tree, pg_count, pg_size = 3, pg_minsize = 2, max_combinations = 10000, parity_space = 1, ordered = false })
 {
    if (!pg_count || !osd_tree)
    {
@@ -92,7 +92,7 @@ async function optimize_initial({ osd_tree, pg_count, pg_size = 3, pg_minsize =
        console.log(lp);
        throw new Error('Problem is infeasible or unbounded - is it a bug?');
    }
-    const int_pgs = make_int_pgs(lp_result.vars, pg_count, round_robin);
+    const int_pgs = make_int_pgs(lp_result.vars, pg_count, ordered);
    const eff = pg_list_space_efficiency(int_pgs, all_weights, pg_minsize, parity_space);
    const res = {
        score: lp_result.score,
@@ -140,20 +140,20 @@ function make_int_pgs(weights, pg_count, round_robin)
    return int_pgs;
 }

-function calc_intersect_weights(pg_size, pg_count, prev_weights, all_pgs)
+function calc_intersect_weights(old_pg_size, pg_size, pg_count, prev_weights, all_pgs, ordered)
 {
    const move_weights = {};
-    if ((1 << pg_size) < pg_count)
+    if ((1 << old_pg_size) < pg_count)
    {
        const intersect = {};
        for (const pg_name in prev_weights)
        {
            const pg = pg_name.substr(3).split(/_/);
-            for (let omit = 1; omit < (1 << pg_size); omit++)
+            for (let omit = 1; omit < (1 << old_pg_size); omit++)
            {
                let pg_omit = [ ...pg ];
-                let intersect_count = pg_size;
-                for (let i = 0; i < pg_size; i++)
+                let intersect_count = old_pg_size;
+                for (let i = 0; i < old_pg_size; i++)
                {
                    if (omit & (1 << i))
                    {
@@ -161,6 +161,8 @@ function calc_intersect_weights(pg_size, pg_count, prev_weights, all_pgs)
                        intersect_count--;
                    }
                }
+                if (!ordered)
+                    pg_omit = pg_omit.filter(n => n).sort();
                pg_omit = pg_omit.join(':');
                intersect[pg_omit] = Math.max(intersect[pg_omit] || 0, intersect_count);
            }
@@ -174,10 +176,10 @@ function calc_intersect_weights(pg_size, pg_count, prev_weights, all_pgs)
                for (let i = 0; i < pg_size; i++)
                {
                    if (omit & (1 << i))
-                    {
                        pg_omit[i] = '';
-                    }
                }
+                if (!ordered)
+                    pg_omit = pg_omit.filter(n => n).sort();
                pg_omit = pg_omit.join(':');
                max_int = Math.max(max_int, intersect[pg_omit] || 0);
            }
@@ -186,15 +188,18 @@ function calc_intersect_weights(pg_size, pg_count, prev_weights, all_pgs)
    }
    else
    {
-        const prev_pg_hashed = Object.keys(prev_weights).map(pg_name => pg_name.substr(3).split(/_/).reduce((a, c) => { a[c] = 1; return a; }, {}));
+        const prev_pg_hashed = Object.keys(prev_weights).map(pg_name => pg_name
+            .substr(3).split(/_/).reduce((a, c, i) => { a[c] = i+1; return a; }, {}));
        for (const pg of all_pgs)
        {
            if (!prev_weights['pg_'+pg.join('_')])
            {
                let max_int = 0;
-                for (const prev_hash in prev_pg_hashed)
+                for (const prev_hash of prev_pg_hashed)
                {
-                    const intersect_count = pg.reduce((a, osd) => a + (prev_hash[osd] ? 1 : 0), 0);
+                    const intersect_count = ordered
+                        ? pg.reduce((a, osd, i) => a + (prev_hash[osd] == 1+i ? 1 : 0), 0)
+                        : pg.reduce((a, osd, i) => a + (prev_hash[osd] ? 1 : 0), 0);
                    if (max_int < intersect_count)
                    {
                        max_int = intersect_count;
@@ -243,7 +248,7 @@ function add_valid_previous(osd_tree, prev_weights, all_pgs)
 }

 // Try to minimize data movement
-async function optimize_change({ prev_pgs: prev_int_pgs, osd_tree, pg_size = 3, pg_minsize = 2, max_combinations = 10000, parity_space = 1 })
+async function optimize_change({ prev_pgs: prev_int_pgs, osd_tree, pg_size = 3, pg_minsize = 2, max_combinations = 10000, parity_space = 1, ordered = false })
 {
    if (!osd_tree)
    {
@@ -266,9 +271,13 @@ async function optimize_change({ prev_pgs: prev_int_pgs, osd_tree, pg_size = 3,
            prev_pg_per_osd[osd].push([ pg_name, (i >= pg_minsize ? parity_space : 1) ]);
        }
    }
+    const old_pg_size = prev_int_pgs[0].length;
    // Get all combinations
    let all_pgs = random_combinations(osd_tree, pg_size, max_combinations, parity_space > 1);
-    add_valid_previous(osd_tree, prev_weights, all_pgs);
+    if (old_pg_size == pg_size)
+    {
+        add_valid_previous(osd_tree, prev_weights, all_pgs);
+    }
    all_pgs = Object.values(all_pgs);
    const pg_per_osd = {};
    for (const pg of all_pgs)
@@ -282,7 +291,7 @@ async function optimize_change({ prev_pgs: prev_int_pgs, osd_tree, pg_size = 3,
        }
    }
    // Penalize PGs based on their similarity to old PGs
-    const move_weights = calc_intersect_weights(pg_size, pg_count, prev_weights, all_pgs);
+    const move_weights = calc_intersect_weights(old_pg_size, pg_size, pg_count, prev_weights, all_pgs, ordered);
    // Calculate total weight - old PG weights
    const all_pg_names = all_pgs.map(pg => 'pg_'+pg.join('_'));
    const all_pgs_hash = all_pg_names.reduce((a, c) => { a[c] = true; return a; }, {});
@@ -373,11 +382,35 @@ async function optimize_change({ prev_pgs: prev_int_pgs, osd_tree, pg_size = 3,
        {
            differs++;
        }
-        for (let j = 0; j < pg_size; j++)
+    }
+    if (ordered)
+    {
+        for (let i = 0; i < pg_count; i++)
        {
-            if (new_pgs[i][j] != prev_int_pgs[i][j])
+            for (let j = 0; j < pg_size; j++)
            {
-                osd_differs++;
+                if (new_pgs[i][j] != prev_int_pgs[i][j])
+                {
+                    osd_differs++;
+                }
+            }
+        }
+    }
+    else
+    {
+        for (let i = 0; i < pg_count; i++)
+        {
+            const old_map = prev_int_pgs[i].reduce((a, c) => { a[c] = (a[c]|0) + 1; return a; }, {});
+            for (let j = 0; j < pg_size; j++)
+            {
+                if ((0|old_map[new_pgs[i][j]]) > 0)
+                {
+                    old_map[new_pgs[i][j]]--;
+                }
+                else
+                {
+                    osd_differs++;
+                }
            }
        }
    }
--- a/mon/make-osd-hybrid.js
+++ b/mon/make-osd-hybrid.js
@@ -0,0 +1,414 @@
+#!/usr/bin/nodejs
+// systemd unit generator for hybrid (HDD+SSD) vitastor OSDs
+// Copyright (c) Vitaliy Filippov, 2019+
+// License: VNPL-1.1
+
+// USAGE: nodejs make-osd-hybrid.js [--disable_ssd_cache 0] [--disable_hdd_cache 0] /dev/sda /dev/sdb /dev/sdc /dev/sdd ...
+// I.e. - just pass all HDDs and SSDs mixed, the script will decide where
+// to put journals on its own
+
+const fs = require('fs');
+const fsp = fs.promises;
+const child_process = require('child_process');
+
+const options = {
+    debug: 1,
+    journal_size: 1024*1024*1024,
+    min_meta_size: 1024*1024*1024,
+    object_size: 1024*1024,
+    bitmap_granularity: 4096,
+    device_block_size: 4096,
+    disable_ssd_cache: 1,
+    disable_hdd_cache: 1,
+};
+
+run().catch(console.fatal);
+
+async function run()
+{
+    const device_list = parse_options();
+    await system_or_die("mkdir -p /var/log/vitastor; chown vitastor /var/log/vitastor");
+    // Collect devices
+    const all_devices = await collect_devices(device_list);
+    const ssds = all_devices.filter(d => d.ssd);
+    const hdds = all_devices.filter(d => !d.ssd);
+    // Collect existing OSD units
+    const osd_units = await collect_osd_units();
+    // Count assigned HDD journals and unallocated space for each SSD
+    await check_journal_count(ssds, osd_units);
+    // Create new OSDs
+    await create_new_hybrid_osds(hdds, ssds, osd_units);
+    process.exit(0);
+}
+
+function parse_options()
+{
+    const devices = [];
+    const opt = {};
+    for (let i = 2; i < process.argv.length; i++)
+    {
+        const arg = process.argv[i];
+        if (arg == '--help' || arg == '-h')
+        {
+            opt.help = true;
+            break;
+        }
+        else if (arg.substr(0, 2) == '--')
+            opt[arg.substr(2)] = process.argv[++i];
+        else
+            devices.push(arg);
+    }
+    if (opt.help || !devices.length)
+    {
+        console.log(
+            'Prepare hybrid (HDD+SSD) Vitastor OSDs\n'+
+            '(c) Vitaliy Filippov, 2019+, license: VNPL-1.1\n\n'+
+            'USAGE: nodejs make-osd-hybrid.js [OPTIONS] /dev/sda /dev/sdb /dev/sdc ...\n'+
+            'Just pass all your SSDs and HDDs in any order, the script will distribute OSDs for you.\n\n'+
+            'OPTIONS (with defaults):\n'+
+            Object.keys(options).map(k => `  --${k} ${options[k]}`).join('\n')
+        );
+        process.exit(0);
+    }
+    for (const k in opt)
+        options[k] = opt[k];
+    return devices;
+}
+
+// Collect devices
+async function collect_devices(devices_to_check)
+{
+    const devices = [];
+    for (const dev of devices_to_check)
+    {
+        if (dev.substr(0, 5) != '/dev/')
+        {
+            console.log(`${dev} does not start with /dev/, skipping`);
+            continue;
+        }
+        if (!await file_exists('/sys/block/'+dev.substr(5)))
+        {
+            console.log(`${dev} is a partition, skipping`);
+            continue;
+        }
+        // Check if the device is an SSD
+        const rot = '/sys/block/'+dev.substr(5)+'/queue/rotational';
+        if (!await file_exists(rot))
+        {
+            console.log(`${dev} does not have ${rot} to check whether it's an SSD, skipping`);
+            continue;
+        }
+        const ssd = !parseInt(await fsp.readFile(rot, { encoding: 'utf-8' }));
+        // Check if the device has partition table
+        let [ has_partition_table, parts ] = await system(`sfdisk --dump ${dev} --json`);
+        if (has_partition_table != 0)
+        {
+            // Check if the device has any data
+            const [ has_data, out ] = await system(`blkid ${dev}`);
+            if (has_data == 0)
+            {
+                console.log(`${dev} contains data, skipping:\n  ${out.trim().replace(/\n/g, '\n  ')}`);
+                continue;
+            }
+        }
+        parts = parts ? JSON.parse(parts).partitiontable : null;
+        if (parts && parts.label != 'gpt')
+        {
+            console.log(`${dev} contains "${parts.label}" partition table, only GPT is supported, skipping`);
+            continue;
+        }
+        devices.push({
+            path: dev,
+            ssd,
+            parts,
+        });
+    }
+    return devices;
+}
+
+// Collect existing OSD units
+async function collect_osd_units()
+{
+    const units = [];
+    for (const unit of (await system("ls /etc/systemd/system/vitastor-osd*.service"))[1].trim().split('\n'))
+    {
+        if (!unit)
+        {
+            continue;
+        }
+        let cmd = /^ExecStart\s*=\s*(([^\n]*\\\n)*[^\n]*)/.exec(await fsp.readFile(unit, { encoding: 'utf-8' }));
+        if (!cmd)
+        {
+            console.log('ExecStart= not found in '+unit+', skipping')
+            continue;
+        }
+        let kv = {}, key;
+        cmd = cmd[1].replace(/^bash\s+-c\s+'/, '')
+            .replace(/>>\s*\S+2>\s*&1\s*'$/, '')
+            .replace(/\s*\\\n\s*/g, ' ')
+            .replace(/([^\s']+)|'([^']+)'/g, (m, m1, m2) =>
+            {
+                m1 = m1||m2;
+                if (key == null)
+                {
+                    if (m1.substr(0, 2) != '--')
+                    {
+                        console.log('Strange command line in '+unit+', stopping');
+                        process.exit(1);
+                    }
+                    key = m1.substr(2);
+                }
+                else
+                {
+                    kv[key] = m1;
+                    key = null;
+                }
+            });
+        units.push(kv);
+    }
+    return units;
+}
+
+// Count assigned HDD journals and unallocated space for each SSD
+async function check_journal_count(ssds, osd_units)
+{
+    const units_by_journal = osd_units.reduce((a, c) =>
+    {
+        if (c.journal_device)
+            a[c.journal_device] = c;
+        return a;
+    }, {});
+    for (const dev of ssds)
+    {
+        dev.journals = 0;
+        if (dev.parts)
+        {
+            for (const part of dev.parts.partitions)
+            {
+                if (part.uuid && units_by_journal['/dev/disk/by-partuuid/'+part.uuid.toLowerCase()])
+                {
+                    dev.journals++;
+                }
+            }
+            dev.free = free_from_parttable(dev.parts);
+        }
+        else
+        {
+            dev.free = parseInt(await system_or_die("blockdev --getsize64 "+dev.path));
+        }
+    }
+}
+
+async function create_new_hybrid_osds(hdds, ssds, osd_units)
+{
+    const units_by_disk = osd_units.reduce((a, c) => { a[c.data_device] = c; return a; }, {});
+    for (const dev of hdds)
+    {
+        if (!dev.parts)
+        {
+            // HDD is not partitioned yet, create a single partition
+            // + is the "default value" for sfdisk
+            await system_or_die('sfdisk '+dev.path, 'label: gpt\n\n+ +\n');
+            dev.parts = JSON.parse(await system_or_die('sfdisk --dump '+dev.path+' --json')).partitiontable;
+        }
+        if (dev.parts.partitions.length != 1)
+        {
+            console.log(dev.path+' has more than 1 partition, skipping');
+        }
+        else if ((dev.parts.partitions[0].start + dev.parts.partitions[0].size) != (1 + dev.parts.lastlba))
+        {
+            console.log(dev.path+'1 is not a whole-disk partition, skipping');
+        }
+        else if (!dev.parts.partitions[0].uuid)
+        {
+            console.log(dev.parts.partitions[0].node+' does not have UUID. Please repartition '+dev.path+' with GPT');
+        }
+        else if (!units_by_disk['/dev/disk/by-partuuid/'+dev.parts.partitions[0].uuid.toLowerCase()])
+        {
+            await create_hybrid_osd(dev, ssds);
+        }
+    }
+}
+
+async function create_hybrid_osd(dev, ssds)
+{
+    // Create a new OSD
+    // Calculate metadata size
+    const data_device = '/dev/disk/by-partuuid/'+dev.parts.partitions[0].uuid.toLowerCase();
+    const data_size = dev.parts.partitions[0].size * dev.parts.sectorsize;
+    const meta_entry_size = 24 + 2*options.object_size/options.bitmap_granularity/8;
+    const entries_per_block = Math.floor(options.device_block_size / meta_entry_size);
+    const object_count = Math.floor(data_size / options.object_size);
+    let meta_size = Math.ceil(1 + object_count / entries_per_block) * options.device_block_size;
+    // Leave some extra space for future metadata formats and round metadata area size to multiples of 1 MB
+    meta_size = 2*meta_size;
+    meta_size = Math.ceil(meta_size/1024/1024) * 1024*1024;
+    if (meta_size < options.min_meta_size)
+        meta_size = options.min_meta_size;
+    let journal_size = Math.ceil(options.journal_size/1024/1024) * 1024*1024;
+    // Pick an SSD for journal, balancing the number of journals across SSDs
+    let selected_ssd;
+    for (const ssd of ssds)
+        if (ssd.free >= (meta_size+journal_size) && (!selected_ssd || selected_ssd.journals > ssd.journals))
+            selected_ssd = ssd;
+    if (!selected_ssd)
+    {
+        console.error('Could not find free space for SSD journal and metadata for '+dev.path);
+        process.exit(1);
+    }
+    // Allocate an OSD number
+    const osd_num = (await system_or_die("vitastor-cli alloc-osd")).trim();
+    if (!osd_num)
+    {
+        console.error('Failed to run vitastor-cli alloc-osd');
+        process.exit(1);
+    }
+    console.log('Creating OSD '+osd_num+' on '+dev.path+' (HDD) with journal and metadata on '+selected_ssd.path+' (SSD)');
+    // Add two partitions: journal and metadata
+    const new_parts = await add_partitions(selected_ssd, [ journal_size, meta_size ]);
+    selected_ssd.journals++;
+    const journal_device = '/dev/disk/by-partuuid/'+new_parts[0].uuid.toLowerCase();
+    const meta_device = '/dev/disk/by-partuuid/'+new_parts[1].uuid.toLowerCase();
+    // Wait until the device symlinks appear
+    while (!await file_exists(journal_device))
+    {
+        await new Promise(ok => setTimeout(ok, 100));
+    }
+    while (!await file_exists(meta_device))
+    {
+        await new Promise(ok => setTimeout(ok, 100));
+    }
+    // Zero out metadata and journal
+    await system_or_die("dd if=/dev/zero of="+journal_device+" bs=1M count="+(journal_size/1024/1024)+" oflag=direct");
+    await system_or_die("dd if=/dev/zero of="+meta_device+" bs=1M count="+(meta_size/1024/1024)+" oflag=direct");
+    // Create unit file for the OSD
+    const has_scsi_cache_type = options.disable_ssd_cache &&
+        (await system("ls /sys/block/"+selected_ssd.path.substr(5)+"/device/scsi_disk/*/cache_type"))[0] == 0;
+    const write_through = options.disable_ssd_cache && (
+        has_scsi_cache_type || selected_ssd.path.substr(5, 4) == 'nvme'
+        && (await system_or_die("/sys/block/"+selected_ssd.path.substr(5)+"/queue/write_cache")).trim() == "write through");
+    await fsp.writeFile('/etc/systemd/system/vitastor-osd'+osd_num+'.service',
+`[Unit]
+Description=Vitastor object storage daemon osd.${osd_num}
+After=network-online.target local-fs.target time-sync.target
+Wants=network-online.target local-fs.target time-sync.target
+PartOf=vitastor.target
+
+[Service]
+LimitNOFILE=1048576
+LimitNPROC=1048576
+LimitMEMLOCK=infinity
+ExecStart=bash -c '/usr/bin/vitastor-osd \\
+    --osd_num ${osd_num} ${write_through
+        ? "--disable_meta_fsync 1 --disable_journal_fsync 1 --immediate_commit "+(options.disable_hdd_cache ? "all" : "small")
+        : ""} \\
+    --throttle_small_writes 1 \\
+    --disk_alignment ${options.device_block_size} \\
+    --journal_block_size ${options.device_block_size} \\
+    --meta_block_size ${options.device_block_size} \\
+    --journal_no_same_sector_overwrites true \\
+    --journal_sector_buffer_count 1024 \\
+    --block_size ${options.object_size} \\
+    --data_device ${data_device} \\
+    --journal_device ${journal_device} \\
+    --meta_device ${meta_device} >>/var/log/vitastor/osd${osd_num}.log 2>&1'
+WorkingDirectory=/
+ExecStartPre=+chown vitastor:vitastor ${data_device}
+ExecStartPre=+chown vitastor:vitastor ${journal_device}
+ExecStartPre=+chown vitastor:vitastor ${meta_device}${
+    has_scsi_cache_type
+    ? "\nExecStartPre=+bash -c 'D=$$$(readlink "+journal_device+"); echo write through > $$$(dirname /sys/block/*/$$\${D##*/})/device/scsi_disk/*/cache_type'"
+    : ""}${
+    options.disable_hdd_cache
+    ? "\nExecStartPre=+bash -c 'D=$$$(readlink "+data_device+"); echo write through > $$$(dirname /sys/block/*/$$\${D##*/})/device/scsi_disk/*/cache_type'"
+    : ""}
+User=vitastor
+PrivateTmp=false
+TasksMax=infinity
+Restart=always
+StartLimitInterval=0
+RestartSec=10
+
+[Install]
+WantedBy=vitastor.target
+`);
+    await system_or_die("systemctl enable vitastor-osd"+osd_num);
+}
+
+async function add_partitions(dev, sizes)
+{
+    let script = 'label: gpt\n\n';
+    if (dev.parts)
+    {
+        // Old partitions
+        for (const part of dev.parts.partitions)
+        {
+            script += part.node+': '+Object.keys(part).map(k => k == 'node' ? '' : k+'='+part[k]).filter(k => k).join(', ')+'\n';
+        }
+    }
+    // New partitions
+    for (const size of sizes)
+    {
+        script += '+ '+Math.ceil(size/1024)+'KiB\n';
+    }
+    await system_or_die('sfdisk '+dev.path, script);
+    // Get new partition table and find the new partition
+    const newpt = JSON.parse(await system_or_die('sfdisk --dump '+dev.path+' --json')).partitiontable;
+    const old_nodes = dev.parts ? dev.parts.partitions.reduce((a, c) => { a[c.uuid] = true; return a; }, {}) : {};
+    const new_nodes = newpt.partitions.filter(part => !old_nodes[part.uuid]);
+    if (new_nodes.length != sizes.length)
+    {
+        console.error('Failed to partition '+dev.path+': new partitions not found in table');
+        process.exit(1);
+    }
+    dev.parts = newpt;
+    dev.free = free_from_parttable(newpt);
+    return new_nodes;
+}
+
+function free_from_parttable(pt)
+{
+    let free = pt.lastlba + 1 - pt.firstlba;
+    for (const part of pt.partitions)
+    {
+        free -= part.size;
+    }
+    free *= pt.sectorsize;
+    return free;
+}
+
+async function system_or_die(cmd, input = '')
+{
+    let [ exitcode, stdout, stderr ] = await system(cmd, input);
+    if (exitcode != 0)
+    {
+        console.error(cmd+' failed: '+stderr);
+        process.exit(1);
+    }
+    return stdout;
+}
+
+async function system(cmd, input = '')
+{
+    if (options.debug)
+    {
+        process.stderr.write('+ '+cmd+(input ? " <<EOF\n"+input.replace(/\s*$/, '\n')+"EOF" : '')+'\n');
+    }
+    const cp = child_process.spawn(cmd, { shell: true });
+    let stdout = '', stderr = '', finish_cb;
+    cp.stdout.on('data', buf => stdout += buf.toString());
+    cp.stderr.on('data', buf => stderr += buf.toString());
+    cp.on('exit', () => finish_cb && finish_cb());
+    cp.stdin.write(input);
+    cp.stdin.end();
+    if (cp.exitCode == null)
+    {
+        await new Promise(ok => finish_cb = ok);
+    }
+    return [ cp.exitCode, stdout, stderr ];
+}
+
+async function file_exists(filename)
+{
+    return new Promise((ok, no) => fs.access(filename, fs.constants.R_OK, err => ok(!err)));
+}
--- a/mon/make-osd.sh
+++ b/mon/make-osd.sh
@@ -25,6 +25,10 @@ OPT=$(vitastor-cli simple-offsets --format options $DEV | tr '\n' ' ')
 META=$(vitastor-cli simple-offsets --format json $DEV | jq .data_offset)
 dd if=/dev/zero of=$DEV bs=1048576 count=$(((META+1048575)/1048576)) oflag=direct

+mkdir -p /var/log/vitastor
+id vitastor &>/dev/null || useradd vitastor
+chown vitastor /var/log/vitastor
+
 cat >/etc/systemd/system/vitastor-osd$OSD_NUM.service <<EOF
 [Unit]
 Description=Vitastor object storage daemon osd.$OSD_NUM
@@ -36,14 +40,14 @@ PartOf=vitastor.target
 LimitNOFILE=1048576
 LimitNPROC=1048576
 LimitMEMLOCK=infinity
-ExecStart=/usr/bin/vitastor-osd \\
+ExecStart=bash -c '/usr/bin/vitastor-osd \\
    --osd_num $OSD_NUM \\
    --disable_data_fsync 1 \\
    --immediate_commit all \\
    --disk_alignment 4096 --journal_block_size 4096 --meta_block_size 4096 \\
    --journal_no_same_sector_overwrites true \\
    --journal_sector_buffer_count 1024 \\
-    $OPT
+    $OPT >>/var/log/vitastor/osd$OSD_NUM.log 2>&1'
 WorkingDirectory=/
 ExecStartPre=+chown vitastor:vitastor $DEV
 User=vitastor
--- a/mon/mon.js
+++ b/mon/mon.js
@@ -31,6 +31,7 @@ const etcd_allow = new RegExp('^'+[
    'osd/inodestats/[1-9]\\d*',
    'osd/space/[1-9]\\d*',
    'mon/master',
+    'mon/member/[a-f0-9]+',
    'pg/state/[1-9]\\d*/[1-9]\\d*',
    'pg/stats/[1-9]\\d*/[1-9]\\d*',
    'pg/history/[1-9]\\d*/[1-9]\\d*',
@@ -83,8 +84,13 @@ const etcd_tree = {
            osd_idle_timeout: 5, // seconds. min: 1
            osd_ping_timeout: 5, // seconds. min: 1
            up_wait_retry_interval: 500, // ms. min: 50
+            max_etcd_attempts: 5,
+            etcd_quick_timeout: 1000, // ms
+            etcd_slow_timeout: 5000, // ms
+            etcd_keepalive_timeout: 30, // seconds, default is max(30, etcd_report_interval*2)
+            etcd_ws_keepalive_interval: 30, // seconds
            // osd
-            etcd_report_interval: 5,
+            etcd_report_interval: 5, // seconds
            run_primary: true,
            osd_network: null, // "192.168.7.0/24" or an array of masks
            bind_address: "0.0.0.0",
@@ -99,6 +105,7 @@ const etcd_tree = {
            no_rebalance: false,
            print_stats_interval: 3,
            slow_log_interval: 10,
+            osd_memlock: false,
            // blockstore - fixed in superblock
            block_size,
            disk_alignment,
@@ -125,6 +132,11 @@ const etcd_tree = {
            inmemory_journal,
            journal_sector_buffer_count,
            journal_no_same_sector_overwrites,
+            throttle_small_writes: false,
+            throttle_target_iops: 100,
+            throttle_target_mbs: 100,
+            throttle_target_parallelism: 1,
+            throttle_threshold_us: 50,
        }, */
        global: {},
        /* node_placement: {
@@ -148,6 +160,8 @@ const etcd_tree = {
                root_node?: 'rack1',
                // restrict pool to OSDs having all of these tags
                osd_tags?: 'nvme' | [ 'nvme', ... ],
+                // prefer to put primary on OSD with these tags
+                primary_affinity_tags?: 'nvme' | [ 'nvme', ... ],
            },
            ...
        }, */
@@ -212,21 +226,28 @@ const etcd_tree = {
            }, */
        },
        inodestats: {
-            /* <inode_t>: {
-                read: { count: uint64_t, usec: uint64_t, bytes: uint64_t },
-                write: { count: uint64_t, usec: uint64_t, bytes: uint64_t },
-                delete: { count: uint64_t, usec: uint64_t, bytes: uint64_t },
+            /* <pool_id>: {
+                <inode_t>: {
+                    read: { count: uint64_t, usec: uint64_t, bytes: uint64_t },
+                    write: { count: uint64_t, usec: uint64_t, bytes: uint64_t },
+                    delete: { count: uint64_t, usec: uint64_t, bytes: uint64_t },
+                },
            }, */
        },
        space: {
            /* <osd_num_t>: {
-                <inode_t>: uint64_t, // bytes
+                <pool_id>: {
+                    <inode_t>: uint64_t, // bytes
+                },
            }, */
        },
    },
    mon: {
        master: {
-            /* ip: [ string ], */
+            /* ip: [ string ], id: uint64_t */
+        },
+        standby: {
+            /* <uint64_t>: { ip: [ string ] }, */
        },
    },
    pg: {
@@ -257,7 +278,7 @@ const etcd_tree = {
                <pg_id>: {
                    osd_sets: osd_num_t[][],
                    all_peers: osd_num_t[],
-                    epoch: uint32_t,
+                    epoch: uint64_t,
                },
            }, */
        },
@@ -341,6 +362,9 @@ class Mon
        this.etcd_start_timeout = (config.etcd_start_timeout || 5) * 1000;
        this.state = JSON.parse(JSON.stringify(this.constructor.etcd_tree));
        this.signals_set = false;
+        this.ws = null;
+        this.ws_alive = false;
+        this.ws_keepalive_timer = null;
        this.on_stop_cb = () => this.on_stop(0).catch(console.error);
    }

@@ -383,7 +407,7 @@ class Mon
        for (const pool_id in this.state.config.pools)
        {
            if (!this.state.pool.stats[pool_id] ||
-                !this.state.pool.stats[pool_id].pg_real_size)
+                !Number(this.state.pool.stats[pool_id].pg_real_size))
            {
                // Generate missing data in etcd
                this.state.config.pgs.hash = null;
@@ -461,8 +485,20 @@ class Mon

    restart_watcher(cur_addr)
    {
+        if (this.ws)
+        {
+            this.ws.close();
+            this.ws = null;
+        }
+        if (this.ws_keepalive_timer)
+        {
+            clearInterval(this.ws_keepalive_timer);
+            this.ws_keepalive_timer = null;
+        }
        if (this.selected_etcd_url == cur_addr)
+        {
            this.selected_etcd_url = null;
+        }
        this.start_watcher(this.config.etcd_mon_retries).catch(this.die);
    }

@@ -482,6 +518,7 @@ class Mon
                const timer_id = setTimeout(() =>
                {
                    this.ws.close();
+                    this.ws = null;
                    ok(false);
                }, this.config.etcd_mon_timeout);
                this.ws = new WebSocket(base+'/watch');
@@ -510,6 +547,20 @@ class Mon
            this.die('Failed to open etcd watch websocket');
        }
        const cur_addr = this.selected_etcd_url;
+        this.ws_alive = true;
+        this.ws_keepalive_timer = setInterval(() =>
+        {
+            if (this.ws_alive)
+            {
+                this.ws_alive = false;
+                this.ws.send(JSON.stringify({ progress_request: {} }));
+            }
+            else
+            {
+                console.log('etcd websocket timed out, restarting it');
+                this.restart_watcher(cur_addr);
+            }
+        }, (Number(this.config.etcd_keepalive_interval) || 30)*1000);
        this.ws.on('error', () => this.restart_watcher(cur_addr));
        this.ws.send(JSON.stringify({
            create_request: {
@@ -522,6 +573,7 @@ class Mon
        }));
        this.ws.on('message', (msg) =>
        {
+            this.ws_alive = true;
            let data;
            try
            {
@@ -558,7 +610,7 @@ class Mon
                    console.log('Revision '+data.result.header.revision+' events: ');
                }
                this.etcd_watch_revision = BigInt(data.result.header.revision)+BigInt(1);
-                for (const e of data.result.events)
+                for (const e of data.result.events||[])
                {
                    this.parse_kv(e.kv);
                    const key = e.kv.key.substr(this.etcd_prefix.length);
@@ -631,11 +683,25 @@ class Mon
        }, this.etcd_start_timeout, 0);
    }

+    get_mon_state()
+    {
+        return { ip: this.local_ips(), hostname: os.hostname() };
+    }
+
    async get_lease()
    {
        const max_ttl = this.config.etcd_mon_ttl + this.config.etcd_mon_timeout/1000*this.config.etcd_mon_retries;
-        const res = await this.etcd_call('/lease/grant', { TTL: max_ttl }, this.config.etcd_mon_timeout, -1);
+        // Get lease
+        let res = await this.etcd_call('/lease/grant', { TTL: max_ttl }, this.config.etcd_mon_timeout, -1);
        this.etcd_lease_id = res.ID;
+        // Register in /mon/member, just for the information
+        const state = this.get_mon_state();
+        res = await this.etcd_call('/kv/put', {
+            key: b64(this.etcd_prefix+'/mon/member/'+this.etcd_lease_id),
+            value: b64(JSON.stringify(state)),
+            lease: ''+this.etcd_lease_id
+        }, this.etcd_start_timeout, 0);
+        // Set refresh timer
        this.lease_timer = setInterval(async () =>
        {
            const res = await this.etcd_call('/lease/keepalive', { ID: this.etcd_lease_id }, this.config.etcd_mon_timeout, this.config.etcd_mon_retries);
@@ -661,7 +727,7 @@ class Mon

    async become_master()
    {
-        const state = { ip: this.local_ips() };
+        const state = { ...this.get_mon_state(), id: ''+this.etcd_lease_id };
        while (1)
        {
            const res = await this.etcd_call('/kv/txn', {
@@ -839,27 +905,39 @@ class Mon
        return this.seed + 2147483648;
    }

-    pick_primary(pool_id, osd_set, up_osds)
+    pick_primary(pool_id, osd_set, up_osds, aff_osds)
    {
        let alive_set;
        if (this.state.config.pools[pool_id].scheme === 'replicated')
-            alive_set = osd_set.filter(osd_num => osd_num && up_osds[osd_num]);
+        {
+            // Prefer "affinity" OSDs
+            alive_set = osd_set.filter(osd_num => osd_num && aff_osds[osd_num]);
+            if (!alive_set.length)
+                alive_set = osd_set.filter(osd_num => osd_num && up_osds[osd_num]);
+        }
        else
        {
            // Prefer data OSDs for EC because they can actually read something without an additional network hop
            const pg_data_size = (this.state.config.pools[pool_id].pg_size||0) -
                (this.state.config.pools[pool_id].parity_chunks||0);
-            alive_set = osd_set.slice(0, pg_data_size).filter(osd_num => osd_num && up_osds[osd_num]);
+            alive_set = osd_set.slice(0, pg_data_size).filter(osd_num => osd_num && aff_osds[osd_num]);
            if (!alive_set.length)
-                alive_set = osd_set.filter(osd_num => osd_num && up_osds[osd_num]);
+                alive_set = osd_set.filter(osd_num => osd_num && aff_osds[osd_num]);
+            if (!alive_set.length)
+            {
+                alive_set = osd_set.slice(0, pg_data_size).filter(osd_num => osd_num && up_osds[osd_num]);
+                if (!alive_set.length)
+                    alive_set = osd_set.filter(osd_num => osd_num && up_osds[osd_num]);
+            }
        }
        if (!alive_set.length)
            return 0;
        return alive_set[this.rng() % alive_set.length];
    }

-    save_new_pgs_txn(request, pool_id, up_osds, prev_pgs, new_pgs, pg_history)
+    save_new_pgs_txn(request, pool_id, up_osds, osd_tree, prev_pgs, new_pgs, pg_history)
    {
+        const aff_osds = this.get_affinity_osds(this.state.config.pools[pool_id], up_osds, osd_tree);
        const pg_items = {};
        this.reset_rng();
        new_pgs.map((osd_set, i) =>
@@ -867,7 +945,7 @@ class Mon
            osd_set = osd_set.map(osd_num => osd_num === LPOptimizer.NO_OSD ? 0 : osd_num);
            pg_items[i+1] = {
                osd_set,
-                primary: this.pick_primary(pool_id, osd_set, up_osds),
+                primary: this.pick_primary(pool_id, osd_set, up_osds, aff_osds),
            };
            if (prev_pgs[i] && prev_pgs[i].join(' ') != osd_set.join(' ') &&
                prev_pgs[i].filter(osd_num => osd_num).length > 0)
@@ -998,6 +1076,13 @@ class Mon
                console.log('Pool '+pool_id+' has invalid osd_tags (must be a string or array of strings)');
            return false;
        }
+        if (pool_cfg.primary_affinity_tags && typeof(pool_cfg.primary_affinity_tags) != 'string' &&
+            (!(pool_cfg.primary_affinity_tags instanceof Array) || pool_cfg.primary_affinity_tags.filter(t => typeof t != 'string').length > 0))
+        {
+            if (warn)
+                console.log('Pool '+pool_id+' has invalid primary_affinity_tags (must be a string or array of strings)');
+            return false;
+        }
        return true;
    }

@@ -1027,6 +1112,17 @@ class Mon
        }
    }

+    get_affinity_osds(pool_cfg, up_osds, osd_tree)
+    {
+        let aff_osds = up_osds;
+        if (pool_cfg.primary_affinity_tags)
+        {
+            aff_osds = { ...up_osds };
+            this.filter_osds_by_tags(osd_tree, { x: aff_osds }, pool_cfg.primary_affinity_tags);
+        }
+        return aff_osds;
+    }
+
    async recheck_pgs()
    {
        // Take configuration and state, check it against the stored configuration hash
@@ -1057,7 +1153,7 @@ class Mon
                    {
                        prev_pgs[pg-1] = this.state.config.pgs.items[pool_id][pg].osd_set;
                    }
-                    this.save_new_pgs_txn(etcd_request, pool_id, up_osds, prev_pgs, [], []);
+                    this.save_new_pgs_txn(etcd_request, pool_id, up_osds, osd_tree, prev_pgs, [], []);
                }
            }
            for (const pool_id in this.state.config.pools)
@@ -1097,7 +1193,7 @@ class Mon
                    pg_size: pool_cfg.pg_size,
                    pg_minsize: pool_cfg.pg_minsize,
                    max_combinations: pool_cfg.max_osd_combinations,
-                    round_robin: pool_cfg.scheme != 'replicated',
+                    ordered: pool_cfg.scheme != 'replicated',
                };
                let optimize_result;
                if (old_pg_count > 0)
@@ -1120,10 +1216,6 @@ class Mon
                        {
                            pg.push(0);
                        }
-                        while (pg.length > pool_cfg.pg_size)
-                        {
-                            pg.pop();
-                        }
                    }
                    if (!this.state.config.pgs.hash)
                    {
@@ -1159,8 +1251,8 @@ class Mon
                this.state.pool.stats[pool_id] = {
                    used_raw_tb: (this.state.pool.stats[pool_id]||{}).used_raw_tb || 0,
                    total_raw_tb: optimize_result.space,
-                    pg_real_size: pg_effsize,
-                    raw_to_usable: pg_effsize / (pool_cfg.scheme === 'replicated'
+                    pg_real_size: pg_effsize || pool_cfg.pg_size,
+                    raw_to_usable: (pg_effsize || pool_cfg.pg_size) / (pool_cfg.scheme === 'replicated'
                        ? 1 : (pool_cfg.pg_size - (pool_cfg.parity_chunks||0))),
                    space_efficiency: optimize_result.space/(optimize_result.total_space||1),
                };
@@ -1168,7 +1260,7 @@ class Mon
                    key: b64(this.etcd_prefix+'/pool/stats/'+pool_id),
                    value: b64(JSON.stringify(this.state.pool.stats[pool_id])),
                } });
-                this.save_new_pgs_txn(etcd_request, pool_id, up_osds, real_prev_pgs, optimize_result.int_pgs, pg_history);
+                this.save_new_pgs_txn(etcd_request, pool_id, up_osds, osd_tree, real_prev_pgs, optimize_result.int_pgs, pg_history);
            }
            this.state.config.pgs.hash = tree_hash;
            await this.save_pg_config(etcd_request);
@@ -1185,13 +1277,14 @@ class Mon
                    continue;
                }
                const replicated = pool_cfg.scheme === 'replicated';
+                const aff_osds = this.get_affinity_osds(pool_cfg, up_osds, osd_tree);
                this.reset_rng();
                for (let pg_num = 1; pg_num <= pool_cfg.pg_count; pg_num++)
                {
                    const pg_cfg = this.state.config.pgs.items[pool_id][pg_num];
                    if (pg_cfg)
                    {
-                        const new_primary = this.pick_primary(pool_id, pg_cfg.osd_set, up_osds);
+                        const new_primary = this.pick_primary(pool_id, pg_cfg.osd_set, up_osds, aff_osds);
                        if (pg_cfg.primary != new_primary)
                        {
                            console.log(
@@ -1307,21 +1400,30 @@ class Mon
        const tm = prev_stats ? BigInt(timestamp - prev_stats.timestamp) : 0;
        for (const op in op_stats)
        {
-            op_stats[op].bps = prev_stats ? (op_stats[op].bytes - prev_stats.op_stats[op].bytes) * 1000n / tm : 0;
-            op_stats[op].iops = prev_stats ? (op_stats[op].count - prev_stats.op_stats[op].count) * 1000n / tm : 0;
-            op_stats[op].lat = prev_stats ? (op_stats[op].usec - prev_stats.op_stats[op].usec)
-                / ((op_stats[op].count - prev_stats.op_stats[op].count) || 1n) : 0;
+            if (prev_stats && prev_stats.op_stats && prev_stats.op_stats[op])
+            {
+                op_stats[op].bps = (op_stats[op].bytes - prev_stats.op_stats[op].bytes) * 1000n / tm;
+                op_stats[op].iops = (op_stats[op].count - prev_stats.op_stats[op].count) * 1000n / tm;
+                op_stats[op].lat = (op_stats[op].usec - prev_stats.op_stats[op].usec)
+                    / ((op_stats[op].count - prev_stats.op_stats[op].count) || 1n);
+            }
        }
        for (const op in subop_stats)
        {
-            subop_stats[op].iops = prev_stats ? (subop_stats[op].count - prev_stats.subop_stats[op].count) * 1000n / tm : 0;
-            subop_stats[op].lat = prev_stats ? (subop_stats[op].usec - prev_stats.subop_stats[op].usec)
-                / ((subop_stats[op].count - prev_stats.subop_stats[op].count) || 1n) : 0;
+            if (prev_stats && prev_stats.subop_stats && prev_stats.subop_stats[op])
+            {
+                subop_stats[op].iops = (subop_stats[op].count - prev_stats.subop_stats[op].count) * 1000n / tm;
+                subop_stats[op].lat = (subop_stats[op].usec - prev_stats.subop_stats[op].usec)
+                    / ((subop_stats[op].count - prev_stats.subop_stats[op].count) || 1n);
+            }
        }
        for (const op in recovery_stats)
        {
-            recovery_stats[op].bps = prev_stats ? (recovery_stats[op].bytes - prev_stats.recovery_stats[op].bytes) * 1000n / tm : 0;
-            recovery_stats[op].iops = prev_stats ? (recovery_stats[op].count - prev_stats.recovery_stats[op].count) * 1000n / tm : 0;
+            if (prev_stats && prev_stats.recovery_stats && prev_stats.recovery_stats[op])
+            {
+                recovery_stats[op].bps = (recovery_stats[op].bytes - prev_stats.recovery_stats[op].bytes) * 1000n / tm;
+                recovery_stats[op].iops = (recovery_stats[op].count - prev_stats.recovery_stats[op].count) * 1000n / tm;
+            }
        }
        return { op_stats, subop_stats, recovery_stats };
    }
--- a/mon/simple-offsets.js
+++ b/mon/simple-offsets.js
@@ -49,7 +49,8 @@ async function run()
    }
    options.journal_offset = Math.ceil(options.journal_offset/options.device_block_size)*options.device_block_size;
    const meta_offset = options.journal_offset + Math.ceil(options.journal_size/options.device_block_size)*options.device_block_size;
-    const entries_per_block = Math.floor(options.device_block_size / (24 + 2*options.object_size/options.bitmap_granularity/8));
+    const meta_entry_size = 24 + 2*options.object_size/options.bitmap_granularity/8;
+    const entries_per_block = Math.floor(options.device_block_size / meta_entry_size);
    const object_count = Math.floor((device_size-meta_offset)/options.object_size);
    const meta_size = Math.ceil(1 + object_count / entries_per_block) * options.device_block_size;
    const data_offset = meta_offset + meta_size;
--- a/mon/test-optimize-simple.js
+++ b/mon/test-optimize-simple.js
@@ -5,21 +5,45 @@ const LPOptimizer = require('./lp-optimizer.js');

 async function run()
 {
-    const osd_tree = { a: { 1: 1 }, b: { 2: 1 }, c: { 3: 1 } };
+    const osd_tree = {
+        100: { 1: 1 },
+        200: { 2: 1 },
+        300: { 3: 1 },
+    };
+
    let res;

    console.log('16 PGs, size=3');
-    res = await LPOptimizer.optimize_initial({ osd_tree, pg_size: 3, pg_count: 16 });
+    res = await LPOptimizer.optimize_initial({ osd_tree, pg_size: 3, pg_count: 16, ordered: false });
    LPOptimizer.print_change_stats(res, false);
-
-    console.log('\nReduce PG size to 2');
-    res = await LPOptimizer.optimize_change({ prev_pgs: res.int_pgs.map(pg => pg.slice(0, 2)), osd_tree, pg_size: 2 });
+    assert(res.space == 3, 'Initial distribution');
+    console.log('\nChange size to 2');
+    res = await LPOptimizer.optimize_change({ prev_pgs: res.int_pgs, osd_tree, pg_size: 2, ordered: false });
    LPOptimizer.print_change_stats(res, false);
-
+    assert(res.space >= 3*14/16 && res.osd_differs == 0, 'Redistribution');
    console.log('\nRemove OSD 3');
-    delete osd_tree['c'];
-    res = await LPOptimizer.optimize_change({ prev_pgs: res.int_pgs, osd_tree, pg_size: 2 });
+    const no3_tree = { ...osd_tree };
+    delete no3_tree['300'];
+    res = await LPOptimizer.optimize_change({ prev_pgs: res.int_pgs, osd_tree: no3_tree, pg_size: 2, ordered: false });
    LPOptimizer.print_change_stats(res, false);
+    assert(res.space == 2, 'Redistribution after OSD removal');
+
+    console.log('\n16 PGs, size=3, ordered');
+    res = await LPOptimizer.optimize_initial({ osd_tree, pg_size: 3, pg_count: 16, ordered: true });
+    LPOptimizer.print_change_stats(res, false);
+    assert(res.space == 3, 'Initial distribution');
+    console.log('\nChange size to 2, ordered');
+    res = await LPOptimizer.optimize_change({ prev_pgs: res.int_pgs, osd_tree, pg_size: 2, ordered: true });
+    LPOptimizer.print_change_stats(res, false);
+    assert(res.space >= 3*14/16 && res.osd_differs < 8, 'Redistribution');
+}
+
+function assert(cond, txt)
+{
+    if (!cond)
+    {
+        throw new Error((txt||'test')+' failed');
+    }
 }

 run().catch(console.error);
--- a/mon/test-optimize-undersized.js
+++ b/mon/test-optimize-undersized.js
@@ -45,30 +45,45 @@ async function run()
    console.log('Empty tree:');
    let res = await LPOptimizer.optimize_initial({ osd_tree: cur_tree, pg_size: 3, pg_count: 256 });
    LPOptimizer.print_change_stats(res, false);
+    assert(res.space == 0);
    console.log('\nAdding 1st failure domain:');
    cur_tree['dom1'] = osd_tree['dom1'];
    res = await LPOptimizer.optimize_change({ prev_pgs: res.int_pgs, osd_tree: cur_tree, pg_size: 3 });
    LPOptimizer.print_change_stats(res, false);
+    assert(res.space == 12 && res.total_space == 12);
    console.log('\nAdding 2nd failure domain:');
    cur_tree['dom2'] = osd_tree['dom2'];
    res = await LPOptimizer.optimize_change({ prev_pgs: res.int_pgs, osd_tree: cur_tree, pg_size: 3 });
    LPOptimizer.print_change_stats(res, false);
+    assert(res.space == 24 && res.total_space == 24);
    console.log('\nAdding 3rd failure domain:');
    cur_tree['dom3'] = osd_tree['dom3'];
    res = await LPOptimizer.optimize_change({ prev_pgs: res.int_pgs, osd_tree: cur_tree, pg_size: 3 });
    LPOptimizer.print_change_stats(res, false);
+    assert(res.space == 36 && res.total_space == 36);
    console.log('\nRemoving 3rd failure domain:');
    delete cur_tree['dom3'];
    res = await LPOptimizer.optimize_change({ prev_pgs: res.int_pgs, osd_tree: cur_tree, pg_size: 3 });
    LPOptimizer.print_change_stats(res, false);
+    assert(res.space == 24 && res.total_space == 24);
    console.log('\nRemoving 2nd failure domain:');
    delete cur_tree['dom2'];
    res = await LPOptimizer.optimize_change({ prev_pgs: res.int_pgs, osd_tree: cur_tree, pg_size: 3 });
    LPOptimizer.print_change_stats(res, false);
+    assert(res.space == 12 && res.total_space == 12);
    console.log('\nRemoving 1st failure domain:');
    delete cur_tree['dom1'];
    res = await LPOptimizer.optimize_change({ prev_pgs: res.int_pgs, osd_tree: cur_tree, pg_size: 3 });
    LPOptimizer.print_change_stats(res, false);
+    assert(res.space == 0);
+}
+
+function assert(cond, txt)
+{
+    if (!cond)
+    {
+        throw new Error((txt||'test')+' failed');
+    }
 }

 run().catch(console.error);
--- a/patches/cinder-vitastor.py
+++ b/patches/cinder-vitastor.py
@@ -50,7 +50,7 @@ from cinder.volume import configuration
 from cinder.volume import driver
 from cinder.volume import volume_utils

-VERSION = '0.6.10'
+VERSION = '0.6.17'

 LOG = logging.getLogger(__name__)

@@ -355,7 +355,25 @@ class VitastorDriver(driver.CloneableImageVD,
    def revert_to_snapshot(self, context, volume, snapshot):
        """Revert a volume to a given snapshot."""

-        # FIXME Delete the image, then recreate it from the snapshot
+        vol_name = utils.convert_str(snapshot.volume_name)
+        snap_name = utils.convert_str(snapshot.name)
+
+        # Delete the image and recreate it from the snapshot
+        args = [ 'vitastor-cli', 'rm', vol_name, *(self._vitastor_args()) ]
+        try:
+            self._execute(*args)
+        except processutils.ProcessExecutionError as exc:
+            LOG.error("Failed to delete image "+vol_name+": "+exc)
+            raise exception.VolumeBackendAPIException(data = exc.stderr)
+        args = [
+            'vitastor-cli', 'create', '--parent', vol_name+'@'+snap_name,
+            vol_name, *(self._vitastor_args())
+        ]
+        try:
+            self._execute(*args)
+        except processutils.ProcessExecutionError as exc:
+            LOG.error("Failed to recreate image "+vol_name+" from "+vol_name+"@"+snap_name+": "+exc)
+            raise exception.VolumeBackendAPIException(data = exc.stderr)

    def delete_snapshot(self, snapshot):
        """Deletes a snapshot."""
@@ -363,24 +381,15 @@ class VitastorDriver(driver.CloneableImageVD,
        vol_name = utils.convert_str(snapshot.volume_name)
        snap_name = utils.convert_str(snapshot.name)

-        # Find the snapshot
-        resp = self._etcd_txn({ 'success': [
-            { 'request_range': { 'key': 'index/image/'+vol_name+'@'+snap_name } },
-        ] })
-        if len(resp['responses'][0]['kvs']) == 0:
-            raise exception.SnapshotNotFound(snapshot_id = snap_name)
-        inode_id = int(resp['responses'][0]['kvs'][0]['value']['id'])
-        pool_id = int(resp['responses'][0]['kvs'][0]['value']['pool_id'])
-        parents = {}
-        parents[(pool_id << 48) | (inode_id & 0xffffffffffff)] = True
-
-        # Check if there are child volumes
-        children = self._child_count(parents)
-        if children > 0:
-            raise exception.SnapshotIsBusy(snapshot_name = snap_name)
-
-        # FIXME: We can't delete snapshots because we can't merge layers yet
-        raise exception.VolumeBackendAPIException(data = 'Snapshot delete (layer merge) is not implemented yet')
+        args = [
+            'vitastor-cli', 'rm', vol_name+'@'+snap_name,
+            *(self._vitastor_args())
+        ]
+        try:
+            self._execute(*args)
+        except processutils.ProcessExecutionError as exc:
+            LOG.error("Failed to remove snapshot "+vol_name+'@'+snap_name+": "+exc)
+            raise exception.VolumeBackendAPIException(data = exc.stderr)

    def _child_count(self, parents):
        children = 0
--- a/rpm/build-tarball.sh
+++ b/rpm/build-tarball.sh
@@ -25,4 +25,4 @@ rm fio
 mv fio-copy fio
 FIO=`rpm -qi fio | perl -e 'while(<>) { /^Epoch[\s:]+(\S+)/ && print "$1:"; /^Version[\s:]+(\S+)/ && print $1; /^Release[\s:]+(\S+)/ && print "-$1"; }'`
 perl -i -pe 's/(Requires:\s*fio)([^\n]+)?/$1 = '$FIO'/' $VITASTOR/rpm/vitastor-el$EL.spec
-tar --transform 's#^#vitastor-0.6.10/#' --exclude 'rpm/*.rpm' -czf $VITASTOR/../vitastor-0.6.10$(rpm --eval '%dist').tar.gz *
+tar --transform 's#^#vitastor-0.6.17/#' --exclude 'rpm/*.rpm' -czf $VITASTOR/../vitastor-0.6.17$(rpm --eval '%dist').tar.gz *
--- a/rpm/vitastor-el7.Dockerfile
+++ b/rpm/vitastor-el7.Dockerfile
@@ -34,7 +34,7 @@ ADD . /root/vitastor
 RUN set -e; \
    cd /root/vitastor/rpm; \
    sh build-tarball.sh; \
-    cp /root/vitastor-0.6.10.el7.tar.gz ~/rpmbuild/SOURCES; \
+    cp /root/vitastor-0.6.17.el7.tar.gz ~/rpmbuild/SOURCES; \
    cp vitastor-el7.spec ~/rpmbuild/SPECS/vitastor.spec; \
    cd ~/rpmbuild/SPECS/; \
    rpmbuild -ba vitastor.spec; \
--- a/rpm/vitastor-el7.spec
+++ b/rpm/vitastor-el7.spec
@@ -1,11 +1,11 @@
 Name:           vitastor
-Version:        0.6.10
+Version:        0.6.17
 Release:        1%{?dist}
 Summary:        Vitastor, a fast software-defined clustered block storage

 License:        Vitastor Network Public License 1.1
 URL:            https://vitastor.io/
-Source0:        vitastor-0.6.10.el7.tar.gz
+Source0:        vitastor-0.6.17.el7.tar.gz

 BuildRequires:  liburing-devel >= 0.6
 BuildRequires:  gperftools-devel
@@ -119,6 +119,7 @@ cp -r mon %buildroot/usr/lib/vitastor

 %files -n vitastor-client
 %_bindir/vitastor-nbd
+%_bindir/vitastor-nfs
 %_bindir/vitastor-cli
 %_bindir/vitastor-rm
 %_bindir/vita
--- a/rpm/vitastor-el8.Dockerfile
+++ b/rpm/vitastor-el8.Dockerfile
@@ -33,7 +33,7 @@ ADD . /root/vitastor
 RUN set -e; \
    cd /root/vitastor/rpm; \
    sh build-tarball.sh; \
-    cp /root/vitastor-0.6.10.el8.tar.gz ~/rpmbuild/SOURCES; \
+    cp /root/vitastor-0.6.17.el8.tar.gz ~/rpmbuild/SOURCES; \
    cp vitastor-el8.spec ~/rpmbuild/SPECS/vitastor.spec; \
    cd ~/rpmbuild/SPECS/; \
    rpmbuild -ba vitastor.spec; \
--- a/rpm/vitastor-el8.spec
+++ b/rpm/vitastor-el8.spec
@@ -1,11 +1,11 @@
 Name:           vitastor
-Version:        0.6.10
+Version:        0.6.17
 Release:        1%{?dist}
 Summary:        Vitastor, a fast software-defined clustered block storage

 License:        Vitastor Network Public License 1.1
 URL:            https://vitastor.io/
-Source0:        vitastor-0.6.10.el8.tar.gz
+Source0:        vitastor-0.6.17.el8.tar.gz

 BuildRequires:  liburing-devel >= 0.6
 BuildRequires:  gperftools-devel
@@ -116,6 +116,7 @@ cp -r mon %buildroot/usr/lib/vitastor

 %files -n vitastor-client
 %_bindir/vitastor-nbd
+%_bindir/vitastor-nfs
 %_bindir/vitastor-cli
 %_bindir/vitastor-rm
 %_bindir/vita
--- a/src/CMakeLists.txt
+++ b/src/CMakeLists.txt
@@ -15,7 +15,7 @@ if("${CMAKE_INSTALL_PREFIX}" MATCHES "^/usr/local/?$")
 	set(CMAKE_INSTALL_RPATH "${CMAKE_INSTALL_PREFIX}/${CMAKE_INSTALL_LIBDIR}")
 endif()

-add_definitions(-DVERSION="0.6.10")
+add_definitions(-DVERSION="0.6.17")
 add_definitions(-Wall -Wno-sign-compare -Wno-comment -Wno-parentheses -Wno-pointer-arith -fdiagnostics-color=always -I ${CMAKE_SOURCE_DIR}/src)
 if (${WITH_ASAN})
 	add_definitions(-fsanitize=address -fno-omit-frame-pointer)
@@ -88,8 +88,8 @@ if (IBVERBS_LIBRARIES)
 	set(MSGR_RDMA "msgr_rdma.cpp")
 endif (IBVERBS_LIBRARIES)
 add_library(vitastor_common STATIC
-	epoll_manager.cpp etcd_state_client.cpp
-	messenger.cpp msgr_stop.cpp msgr_op.cpp msgr_send.cpp msgr_receive.cpp ringloop.cpp ../json11/json11.cpp
+	epoll_manager.cpp etcd_state_client.cpp messenger.cpp addr_util.cpp
+	msgr_stop.cpp msgr_op.cpp msgr_send.cpp msgr_receive.cpp ringloop.cpp ../json11/json11.cpp
 	http_client.cpp osd_ops.cpp pg_states.cpp timerfd_manager.cpp base64.cpp ${MSGR_RDMA}
 )
 target_compile_options(vitastor_common PUBLIC -fPIC)
@@ -112,6 +112,7 @@ if (${WITH_FIO})
 	add_library(fio_vitastor_sec SHARED
 		fio_sec_osd.cpp
 		rw_blocking.cpp
+		addr_util.cpp
 	)
 	target_link_libraries(fio_vitastor_sec
 		tcmalloc_minimal
@@ -123,6 +124,18 @@ add_library(vitastor_client SHARED
 	cluster_client.cpp
 	cluster_client_list.cpp
 	vitastor_c.cpp
+	cli_common.cpp
+	cli_alloc_osd.cpp
+	cli_simple_offsets.cpp
+	cli_status.cpp
+	cli_df.cpp
+	cli_ls.cpp
+	cli_create.cpp
+	cli_modify.cpp
+	cli_flatten.cpp
+	cli_merge.cpp
+	cli_rm_data.cpp
+	cli_rm.cpp
 )
 set_target_properties(vitastor_client PROPERTIES PUBLIC_HEADER "vitastor_c.h")
 target_link_libraries(vitastor_client
@@ -151,10 +164,24 @@ target_link_libraries(vitastor-nbd
 	vitastor_client
 )

+# vitastor-nfs
+add_executable(vitastor-nfs
+	nfs_proxy.cpp
+	nfs_conn.cpp
+	nfs_portmap.cpp
+	sha256.c
+	nfs/xdr_impl.cpp
+	nfs/rpc_xdr.cpp
+	nfs/portmap_xdr.cpp
+	nfs/nfs_xdr.cpp
+)
+target_link_libraries(vitastor-nfs
+	vitastor_client
+)
+
 # vitastor-cli
 add_executable(vitastor-cli
-	cli.cpp cli_alloc_osd.cpp cli_simple_offsets.cpp cli_df.cpp
-	cli_ls.cpp cli_create.cpp cli_modify.cpp cli_flatten.cpp cli_merge.cpp cli_rm.cpp cli_snap_rm.cpp
+	cli.cpp
 )
 target_link_libraries(vitastor-cli
 	vitastor_client
@@ -189,11 +216,11 @@ endif (${WITH_QEMU})
 ### Test stubs

 # stub_osd, stub_bench, osd_test
-add_executable(stub_osd stub_osd.cpp rw_blocking.cpp)
+add_executable(stub_osd stub_osd.cpp rw_blocking.cpp addr_util.cpp)
 target_link_libraries(stub_osd tcmalloc_minimal)
-add_executable(stub_bench stub_bench.cpp rw_blocking.cpp)
+add_executable(stub_bench stub_bench.cpp rw_blocking.cpp addr_util.cpp)
 target_link_libraries(stub_bench tcmalloc_minimal)
-add_executable(osd_test osd_test.cpp rw_blocking.cpp)
+add_executable(osd_test osd_test.cpp rw_blocking.cpp addr_util.cpp)
 target_link_libraries(osd_test tcmalloc_minimal)

 # osd_rmw_test
@@ -243,7 +270,7 @@ target_include_directories(test_cluster_client PUBLIC ${CMAKE_SOURCE_DIR}/src/mo

 ### Install

-install(TARGETS vitastor-osd vitastor-dump-journal vitastor-nbd vitastor-cli RUNTIME DESTINATION ${CMAKE_INSTALL_BINDIR})
+install(TARGETS vitastor-osd vitastor-dump-journal vitastor-nbd vitastor-nfs vitastor-cli RUNTIME DESTINATION ${CMAKE_INSTALL_BINDIR})
 install_symlink(vitastor-cli ${CMAKE_INSTALL_PREFIX}/${CMAKE_INSTALL_BINDIR}/vitastor-rm)
 install_symlink(vitastor-cli ${CMAKE_INSTALL_PREFIX}/${CMAKE_INSTALL_BINDIR}/vita)
 install(
--- a/src/addr_util.cpp
+++ b/src/addr_util.cpp
@@ -0,0 +1,238 @@
+#include <sys/socket.h>
+#include <unistd.h>
+#include <arpa/inet.h>
+#include <net/if.h>
+#include <sys/types.h>
+#include <ifaddrs.h>
+#include <string.h>
+#include <stdio.h>
+
+#include <stdexcept>
+
+#include "addr_util.h"
+
+bool string_to_addr(std::string str, bool parse_port, int default_port, struct sockaddr_storage *addr)
+{
+    if (parse_port)
+    {
+        int p = str.rfind(':');
+        if (p != std::string::npos && !(str.length() > 0 && str[p-1] == ']')) // "[ipv6]" which contains ':'
+        {
+            char null_byte = 0;
+            int n = sscanf(str.c_str()+p+1, "%d%c", &default_port, &null_byte);
+            if (n != 1 || default_port >= 0x10000)
+                return false;
+            str = str.substr(0, p);
+        }
+    }
+    if (inet_pton(AF_INET, str.c_str(), &((struct sockaddr_in*)addr)->sin_addr) == 1)
+    {
+        addr->ss_family = AF_INET;
+        ((struct sockaddr_in*)addr)->sin_port = htons(default_port);
+        return true;
+    }
+    if (str.length() >= 2 && str[0] == '[' && str[str.length()-1] == ']')
+        str = str.substr(1, str.length()-2);
+    if (inet_pton(AF_INET6, str.c_str(), &((struct sockaddr_in6*)addr)->sin6_addr) == 1)
+    {
+        addr->ss_family = AF_INET6;
+        ((struct sockaddr_in6*)addr)->sin6_port = htons(default_port);
+        return true;
+    }
+    return false;
+}
+
+std::string addr_to_string(const sockaddr_storage &addr)
+{
+    char peer_str[256];
+    bool ok = false;
+    int port;
+    if (addr.ss_family == AF_INET)
+    {
+        ok = !!inet_ntop(AF_INET, &((sockaddr_in*)&addr)->sin_addr, peer_str, 256);
+        port = ntohs(((sockaddr_in*)&addr)->sin_port);
+    }
+    else if (addr.ss_family == AF_INET6)
+    {
+        ok = !!inet_ntop(AF_INET6, &((sockaddr_in6*)&addr)->sin6_addr, peer_str, 256);
+        port = ntohs(((sockaddr_in6*)&addr)->sin6_port);
+    }
+    else
+        throw std::runtime_error("Unknown address family "+std::to_string(addr.ss_family));
+    if (!ok)
+        throw std::runtime_error(std::string("inet_ntop: ") + strerror(errno));
+    return std::string(peer_str)+":"+std::to_string(port);
+}
+
+static bool cidr_match(const in_addr &addr, const in_addr &net, uint8_t bits)
+{
+    if (bits == 0)
+    {
+        // C99 6.5.7 (3): u32 << 32 is undefined behaviour
+        return true;
+    }
+    return !((addr.s_addr ^ net.s_addr) & htonl(0xFFFFFFFFu << (32 - bits)));
+}
+
+static bool cidr6_match(const in6_addr &address, const in6_addr &network, uint8_t bits)
+{
+    const uint32_t *a = address.s6_addr32;
+    const uint32_t *n = network.s6_addr32;
+    int bits_whole, bits_incomplete;
+    bits_whole = bits >> 5;         // number of whole u32
+    bits_incomplete = bits & 0x1F;  // number of bits in incomplete u32
+    if (bits_whole && memcmp(a, n, bits_whole << 2))
+        return false;
+    if (bits_incomplete)
+    {
+        uint32_t mask = htonl((0xFFFFFFFFu) << (32 - bits_incomplete));
+        if ((a[bits_whole] ^ n[bits_whole]) & mask)
+            return false;
+    }
+    return true;
+}
+
+struct addr_mask_t
+{
+    sa_family_t family;
+    in_addr ipv4;
+    in6_addr ipv6;
+    uint8_t bits;
+};
+
+std::vector<std::string> getifaddr_list(std::vector<std::string> mask_cfg, bool include_v6)
+{
+    std::vector<addr_mask_t> masks;
+    for (auto mask: mask_cfg)
+    {
+        unsigned bits = 0;
+        int p = mask.find('/');
+        if (p != std::string::npos)
+        {
+            char null_byte = 0;
+            if (sscanf(mask.c_str()+p+1, "%u%c", &bits, &null_byte) != 1 || bits > 128)
+            {
+                throw std::runtime_error((include_v6 ? "Invalid IPv4 address mask: " : "Invalid IP address mask: ") + mask);
+            }
+            mask = mask.substr(0, p);
+        }
+        in_addr ipv4;
+        in6_addr ipv6;
+        if (inet_pton(AF_INET, mask.c_str(), &ipv4) == 1)
+        {
+            if (bits > 32)
+            {
+                throw std::runtime_error((include_v6 ? "Invalid IPv4 address mask: " : "Invalid IP address mask: ") + mask);
+            }
+            masks.push_back((addr_mask_t){ .family = AF_INET, .ipv4 = ipv4, .bits = (uint8_t)bits });
+        }
+        else if (include_v6 && inet_pton(AF_INET6, mask.c_str(), &ipv6) == 1)
+        {
+            masks.push_back((addr_mask_t){ .family = AF_INET6, .ipv6 = ipv6, .bits = (uint8_t)bits });
+        }
+        else
+        {
+            throw std::runtime_error((include_v6 ? "Invalid IPv4 address mask: " : "Invalid IP address mask: ") + mask);
+        }
+    }
+    std::vector<std::string> addresses;
+    ifaddrs *list, *ifa;
+    if (getifaddrs(&list) == -1)
+    {
+        throw std::runtime_error(std::string("getifaddrs: ") + strerror(errno));
+    }
+    for (ifa = list; ifa != NULL; ifa = ifa->ifa_next)
+    {
+        if (!ifa->ifa_addr)
+        {
+            continue;
+        }
+        int family = ifa->ifa_addr->sa_family;
+        if ((family == AF_INET || family == AF_INET6 && include_v6) &&
+            (ifa->ifa_flags & (IFF_UP | IFF_RUNNING | IFF_LOOPBACK)) == (IFF_UP | IFF_RUNNING))
+        {
+            void *addr_ptr;
+            if (family == AF_INET)
+            {
+                addr_ptr = &((sockaddr_in *)ifa->ifa_addr)->sin_addr;
+            }
+            else
+            {
+                addr_ptr = &((sockaddr_in6 *)ifa->ifa_addr)->sin6_addr;
+            }
+            if (masks.size() > 0)
+            {
+                int i;
+                for (i = 0; i < masks.size(); i++)
+                {
+                    if (masks[i].family == family && (family == AF_INET
+                        ? cidr_match(*(in_addr*)addr_ptr, masks[i].ipv4, masks[i].bits)
+                        : cidr6_match(*(in6_addr*)addr_ptr, masks[i].ipv6, masks[i].bits)))
+                    {
+                        break;
+                    }
+                }
+                if (i >= masks.size())
+                {
+                    continue;
+                }
+            }
+            char addr[INET6_ADDRSTRLEN];
+            if (!inet_ntop(family, addr_ptr, addr, INET6_ADDRSTRLEN))
+            {
+                throw std::runtime_error(std::string("inet_ntop: ") + strerror(errno));
+            }
+            addresses.push_back(std::string(addr));
+        }
+    }
+    freeifaddrs(list);
+    return addresses;
+}
+
+int create_and_bind_socket(std::string bind_address, int bind_port, int listen_backlog, int *listening_port)
+{
+    sockaddr_storage addr;
+    if (!string_to_addr(bind_address, 0, bind_port, &addr))
+    {
+        throw std::runtime_error("bind address "+bind_address+" is not valid");
+    }
+
+    int listen_fd = socket(addr.ss_family, SOCK_STREAM, 0);
+    if (listen_fd < 0)
+    {
+        throw std::runtime_error(std::string("socket: ") + strerror(errno));
+    }
+    int enable = 1;
+    setsockopt(listen_fd, SOL_SOCKET, SO_REUSEADDR, &enable, sizeof(enable));
+
+    if (bind(listen_fd, (sockaddr*)&addr, sizeof(addr)) < 0)
+    {
+        close(listen_fd);
+        throw std::runtime_error(std::string("bind: ") + strerror(errno));
+    }
+    if (listening_port)
+    {
+        if (bind_port == 0)
+        {
+            socklen_t len = sizeof(addr);
+            if (getsockname(listen_fd, (sockaddr *)&addr, &len) == -1)
+            {
+                close(listen_fd);
+                throw std::runtime_error(std::string("getsockname: ") + strerror(errno));
+            }
+            *listening_port = ntohs(((sockaddr_in*)&addr)->sin_port);
+        }
+        else
+        {
+            *listening_port = bind_port;
+        }
+    }
+
+    if (listen(listen_fd, listen_backlog ? listen_backlog : 128) < 0)
+    {
+        close(listen_fd);
+        throw std::runtime_error(std::string("listen: ") + strerror(errno));
+    }
+
+    return listen_fd;
+}
--- a/src/addr_util.h
+++ b/src/addr_util.h
@@ -0,0 +1,10 @@
+#pragma once
+
+#include <sys/socket.h>
+#include <string>
+#include <vector>
+
+bool string_to_addr(std::string str, bool parse_port, int default_port, struct sockaddr_storage *addr);
+std::string addr_to_string(const sockaddr_storage &addr);
+std::vector<std::string> getifaddr_list(std::vector<std::string> mask_cfg = std::vector<std::string>(), bool include_v6 = false);
+int create_and_bind_socket(std::string bind_address, int bind_port, int listen_backlog, int *listening_port);
--- a/src/allocator.cpp
+++ b/src/allocator.cpp
@@ -25,7 +25,7 @@ allocator::allocator(uint64_t blocks)
    size = free = blocks;
    last_one_mask = (blocks % 64) == 0
        ? UINT64_MAX
-        : ((1l << (blocks % 64)) - 1);
+        : (((uint64_t)1 << (blocks % 64)) - 1);
    for (uint64_t i = 0; i < total; i++)
    {
        mask[i] = 0;
@@ -79,7 +79,7 @@ void allocator::set(uint64_t addr, bool value)
            }
            if (value)
            {
-                mask[last] = mask[last] | (1l << bit);
+                mask[last] = mask[last] | ((uint64_t)1 << bit);
                if (mask[last] != (!is_last || cur_addr/64 < size/64
                    ? UINT64_MAX : last_one_mask))
                {
@@ -88,7 +88,7 @@ void allocator::set(uint64_t addr, bool value)
            }
            else
            {
-                mask[last] = mask[last] & ~(1l << bit);
+                mask[last] = mask[last] & ~((uint64_t)1 << bit);
            }
            is_last = false;
            if (p2 > 1)
--- a/src/blockstore.h
+++ b/src/blockstore.h
@@ -21,7 +21,7 @@
 // Memory alignment for direct I/O (usually 512 bytes)
 // All other alignments must be a multiple of this one
 #ifndef MEM_ALIGNMENT
-#define MEM_ALIGNMENT 512
+#define MEM_ALIGNMENT 4096
 #endif

 // Default block size is 128 KB, current allowed range is 4K - 128M
--- a/src/blockstore_flush.cpp
+++ b/src/blockstore_flush.cpp
@@ -185,7 +185,7 @@ void journal_flusher_t::release_trim()
 void journal_flusher_t::dump_diagnostics()
 {
    const char *unflushable_type = "";
-    obj_ver_id unflushable = { 0 };
+    obj_ver_id unflushable = {};
    // Try to find out if there is a flushable object for information
    for (object_id cur_oid: flush_queue)
    {
@@ -415,8 +415,11 @@ stop_flusher:
        flusher->active_flushers++;
 resume_1:
        // Find it in clean_db
-        clean_it = bs->clean_db.find(cur.oid);
-        old_clean_loc = (clean_it != bs->clean_db.end() ? clean_it->second.location : UINT64_MAX);
+        {
+            auto & clean_db = bs->clean_db_shard(cur.oid);
+            auto clean_it = clean_db.find(cur.oid);
+            old_clean_loc = (clean_it != clean_db.end() ? clean_it->second.location : UINT64_MAX);
+        }
        // Scan dirty versions of the object
        if (!scan_dirty(1))
        {
@@ -486,8 +489,8 @@ resume_1:
        if (bs->clean_entry_bitmap_size)
        {
            new_clean_bitmap = (bs->inmemory_meta
-                ? meta_new.buf + meta_new.pos*bs->clean_entry_size + sizeof(clean_disk_entry)
-                : bs->clean_bitmap + (clean_loc >> bs->block_order)*(2*bs->clean_entry_bitmap_size));
+                ? (uint8_t*)meta_new.buf + meta_new.pos*bs->clean_entry_size + sizeof(clean_disk_entry)
+                : (uint8_t*)bs->clean_bitmap + (clean_loc >> bs->block_order)*(2*bs->clean_entry_bitmap_size));
            if (clean_init_bitmap)
            {
                memset(new_clean_bitmap, 0, bs->clean_entry_bitmap_size);
@@ -533,7 +536,7 @@ resume_1:
                return false;
            }
            // zero out old metadata entry
-            memset(meta_old.buf + meta_old.pos*bs->clean_entry_size, 0, bs->clean_entry_size);
+            memset((uint8_t*)meta_old.buf + meta_old.pos*bs->clean_entry_size, 0, bs->clean_entry_size);
            await_sqe(15);
            data->iov = (struct iovec){ meta_old.buf, bs->meta_block_size };
            data->callback = simple_callback_w;
@@ -544,7 +547,7 @@ resume_1:
        }
        if (has_delete)
        {
-            clean_disk_entry *new_entry = (clean_disk_entry*)(meta_new.buf + meta_new.pos*bs->clean_entry_size);
+            clean_disk_entry *new_entry = (clean_disk_entry*)((uint8_t*)meta_new.buf + meta_new.pos*bs->clean_entry_size);
            if (new_entry->oid.inode != 0 && new_entry->oid != cur.oid)
            {
                printf("Fatal error (metadata corruption or bug): tried to delete metadata entry %lu (%lx:%lx v%lu) while deleting %lx:%lx\n",
@@ -553,11 +556,11 @@ resume_1:
                exit(1);
            }
            // zero out new metadata entry
-            memset(meta_new.buf + meta_new.pos*bs->clean_entry_size, 0, bs->clean_entry_size);
+            memset((uint8_t*)meta_new.buf + meta_new.pos*bs->clean_entry_size, 0, bs->clean_entry_size);
        }
        else
        {
-            clean_disk_entry *new_entry = (clean_disk_entry*)(meta_new.buf + meta_new.pos*bs->clean_entry_size);
+            clean_disk_entry *new_entry = (clean_disk_entry*)((uint8_t*)meta_new.buf + meta_new.pos*bs->clean_entry_size);
            if (new_entry->oid.inode != 0 && new_entry->oid != cur.oid)
            {
                printf("Fatal error (metadata corruption or bug): tried to overwrite non-zero metadata entry %lu (%lx:%lx v%lu) with %lx:%lx v%lu\n",
@@ -575,7 +578,7 @@ resume_1:
            if (bs->clean_entry_bitmap_size)
            {
                void *bmp_ptr = bs->clean_entry_bitmap_size > sizeof(void*) ? dirty_end->second.bitmap : &dirty_end->second.bitmap;
-                memcpy((void*)(new_entry+1) + bs->clean_entry_bitmap_size, bmp_ptr, bs->clean_entry_bitmap_size);
+                memcpy((uint8_t*)(new_entry+1) + bs->clean_entry_bitmap_size, bmp_ptr, bs->clean_entry_bitmap_size);
            }
        }
        await_sqe(6);
@@ -762,7 +765,7 @@ bool journal_flusher_co::scan_dirty(int wait_base)
                        if (bs->journal.inmemory)
                        {
                            // Take it from memory
-                            memcpy(it->buf, bs->journal.buffer + submit_offset, submit_len);
+                            memcpy(it->buf, (uint8_t*)bs->journal.buffer + submit_offset, submit_len);
                        }
                        else
                        {
@@ -826,7 +829,7 @@ bool journal_flusher_co::modify_meta_read(uint64_t meta_loc, flusher_meta_write_
    wr.pos = ((meta_loc >> bs->block_order) % (bs->meta_block_size / bs->clean_entry_size));
    if (bs->inmemory_meta)
    {
-        wr.buf = bs->metadata_buffer + wr.sector;
+        wr.buf = (uint8_t*)bs->metadata_buffer + wr.sector;
        return true;
    }
    wr.it = flusher->meta_sectors.find(wr.sector);
@@ -870,10 +873,11 @@ void journal_flusher_co::update_clean_db()
 #endif
        bs->data_alloc->set(old_clean_loc >> bs->block_order, false);
    }
+    auto & clean_db = bs->clean_db_shard(cur.oid);
    if (has_delete)
    {
-        auto clean_it = bs->clean_db.find(cur.oid);
-        bs->clean_db.erase(clean_it);
+        auto clean_it = clean_db.find(cur.oid);
+        clean_db.erase(clean_it);
 #ifdef BLOCKSTORE_DEBUG
        printf("Free block %lu from %lx:%lx v%lu (delete)\n",
            clean_loc >> bs->block_order,
@@ -884,7 +888,7 @@ void journal_flusher_co::update_clean_db()
    }
    else
    {
-        bs->clean_db[cur.oid] = {
+        clean_db[cur.oid] = {
            .version = cur.version,
            .location = clean_loc,
        };
--- a/src/blockstore_flush.h
+++ b/src/blockstore_flush.h
@@ -49,7 +49,6 @@ class journal_flusher_co
    std::function<void(ring_data_t*)> simple_callback_r, simple_callback_w;

    bool skip_copy, has_delete, has_writes;
-    blockstore_clean_db_t::iterator clean_it;
    std::vector<copy_buffer_t> v;
    std::vector<copy_buffer_t>::iterator it;
    int copy_count;
--- a/src/blockstore_impl.cpp
+++ b/src/blockstore_impl.cpp
@@ -118,7 +118,7 @@ void blockstore_impl_t::loop()
        // has_writes == 0 - no writes before the current queue item
        // has_writes == 1 - some writes in progress
        // has_writes == 2 - tried to submit some writes, but failed
-        int has_writes = 0, op_idx = 0, new_idx = 0;
+        int has_writes = 0, op_idx = 0, new_idx = 0, done_lists = 0;
        for (; op_idx < submit_queue.size(); op_idx++, new_idx++)
        {
            auto op = submit_queue[op_idx];
@@ -142,7 +142,6 @@ void blockstore_impl_t::loop()
                    continue;
                }
            }
-            unsigned ring_space = ringloop->space_left();
            unsigned prev_sqe_pos = ringloop->save();
            // 0 = can't submit
            // 1 = in progress
@@ -199,9 +198,14 @@ void blockstore_impl_t::loop()
            }
            else if (op->opcode == BS_OP_LIST)
            {
-                // LIST doesn't need to be blocked by previous modifications
-                process_list(op);
-                wr_st = 2;
+                // LIST doesn't have to be blocked by previous modifications
+                // But don't do a lot of LISTs at once, because they're blocking and potentially slow
+                if (single_tick_list_limit <= 0 || done_lists < single_tick_list_limit)
+                {
+                    process_list(op);
+                    done_lists++;
+                    wr_st = 2;
+                }
            }
            if (wr_st == 2)
            {
@@ -212,7 +216,6 @@ void blockstore_impl_t::loop()
                ringloop->restore(prev_sqe_pos);
                if (PRIV(op)->wait_for == WAIT_SQE)
                {
-                    PRIV(op)->wait_detail = 1 + ring_space;
                    // ring is full, stop submission
                    break;
                }
@@ -235,6 +238,12 @@ void blockstore_impl_t::loop()
        {
            throw std::runtime_error(std::string("io_uring_submit: ") + strerror(-ret));
        }
+        for (auto s: journal.submitting_sectors)
+        {
+            // Mark journal sector writes as submitted
+            journal.sector_info[s].submit_id = 0;
+        }
+        journal.submitting_sectors.clear();
        if ((initial_ring_space - ringloop->space_left()) > 0)
        {
            live = true;
@@ -276,7 +285,7 @@ void blockstore_impl_t::check_wait(blockstore_op_t *op)
 {
    if (PRIV(op)->wait_for == WAIT_SQE)
    {
-        if (ringloop->space_left() < PRIV(op)->wait_detail)
+        if (ringloop->sqes_left() < PRIV(op)->wait_detail)
        {
            // stop submission if there's still no free space
 #ifdef BLOCKSTORE_DEBUG
@@ -366,7 +375,7 @@ void blockstore_impl_t::enqueue_op(blockstore_op_t *op)
                    };
                }
                unstable_writes.clear();
-                op->callback = [this, old_callback](blockstore_op_t *op)
+                op->callback = [old_callback](blockstore_op_t *op)
                {
                    obj_ver_id *vers = (obj_ver_id*)op->buf;
                    delete[] vers;
@@ -419,22 +428,104 @@ static bool replace_stable(object_id oid, uint64_t version, int search_start, in
    return false;
 }

+blockstore_clean_db_t& blockstore_impl_t::clean_db_shard(object_id oid)
+{
+    uint64_t pg_num = 0;
+    uint64_t pool_id = (oid.inode >> (64-POOL_ID_BITS));
+    auto sh_it = clean_db_settings.find(pool_id);
+    if (sh_it != clean_db_settings.end())
+    {
+        // like map_to_pg()
+        pg_num = (oid.stripe / sh_it->second.pg_stripe_size) % sh_it->second.pg_count + 1;
+    }
+    return clean_db_shards[(pool_id << (64-POOL_ID_BITS)) | pg_num];
+}
+
+void blockstore_impl_t::reshard_clean_db(pool_id_t pool, uint32_t pg_count, uint32_t pg_stripe_size)
+{
+    uint64_t pool_id = (uint64_t)pool;
+    std::map<pool_pg_id_t, blockstore_clean_db_t> new_shards;
+    auto sh_it = clean_db_shards.lower_bound((pool_id << (64-POOL_ID_BITS)));
+    while (sh_it != clean_db_shards.end() &&
+        (sh_it->first >> (64-POOL_ID_BITS)) == pool_id)
+    {
+        for (auto & pair: sh_it->second)
+        {
+            // like map_to_pg()
+            uint64_t pg_num = (pair.first.stripe / pg_stripe_size) % pg_count + 1;
+            uint64_t shard_id = (pool_id << (64-POOL_ID_BITS)) | pg_num;
+            new_shards[shard_id][pair.first] = pair.second;
+        }
+        clean_db_shards.erase(sh_it++);
+    }
+    for (sh_it = new_shards.begin(); sh_it != new_shards.end(); sh_it++)
+    {
+        auto & to = clean_db_shards[sh_it->first];
+        to.swap(sh_it->second);
+    }
+    clean_db_settings[pool_id] = (pool_shard_settings_t){
+        .pg_count = pg_count,
+        .pg_stripe_size = pg_stripe_size,
+    };
+}
+
 void blockstore_impl_t::process_list(blockstore_op_t *op)
 {
-    uint32_t list_pg = op->offset;
+    uint32_t list_pg = op->offset+1;
    uint32_t pg_count = op->len;
    uint64_t pg_stripe_size = op->oid.stripe;
    uint64_t min_inode = op->oid.inode;
    uint64_t max_inode = op->version;
    // Check PG
-    if (pg_count != 0 && (pg_stripe_size < MIN_BLOCK_SIZE || list_pg >= pg_count))
+    if (pg_count != 0 && (pg_stripe_size < MIN_BLOCK_SIZE || list_pg > pg_count))
    {
        op->retval = -EINVAL;
        FINISH_OP(op);
        return;
    }
-    // Copy clean_db entries (sorted)
-    int stable_count = 0, stable_alloc = clean_db.size() / (pg_count ? pg_count : 1);
+    // Check if the DB needs resharding
+    // (we don't know about PGs from the beginning, we only create "shards" here)
+    uint64_t first_shard = 0, last_shard = UINT64_MAX;
+    if (min_inode != 0 &&
+        // Check if min_inode == max_inode == pool_id<<N, i.e. this is a pool listing
+        (min_inode >> (64-POOL_ID_BITS)) == (max_inode >> (64-POOL_ID_BITS)))
+    {
+        pool_id_t pool_id = (min_inode >> (64-POOL_ID_BITS));
+        if (pg_count > 1)
+        {
+            // Per-pg listing
+            auto sh_it = clean_db_settings.find(pool_id);
+            if (sh_it == clean_db_settings.end() ||
+                sh_it->second.pg_count != pg_count ||
+                sh_it->second.pg_stripe_size != pg_stripe_size)
+            {
+                reshard_clean_db(pool_id, pg_count, pg_stripe_size);
+            }
+            first_shard = last_shard = ((uint64_t)pool_id << (64-POOL_ID_BITS)) | list_pg;
+        }
+        else
+        {
+            // Per-pool listing
+            first_shard = ((uint64_t)pool_id << (64-POOL_ID_BITS));
+            last_shard = ((uint64_t)(pool_id+1) << (64-POOL_ID_BITS)) - 1;
+        }
+    }
+    // Copy clean_db entries
+    int stable_count = 0, stable_alloc = 0;
+    if (min_inode != max_inode)
+    {
+        for (auto shard_it = clean_db_shards.lower_bound(first_shard);
+            shard_it != clean_db_shards.end() && shard_it->first <= last_shard;
+            shard_it++)
+        {
+            auto & clean_db = shard_it->second;
+            stable_alloc += clean_db.size();
+        }
+    }
+    else
+    {
+        stable_alloc = 32768;
+    }
    obj_ver_id *stable = (obj_ver_id*)malloc(sizeof(obj_ver_id) * stable_alloc);
    if (!stable)
    {
@@ -442,7 +533,11 @@ void blockstore_impl_t::process_list(blockstore_op_t *op)
        FINISH_OP(op);
        return;
    }
+    for (auto shard_it = clean_db_shards.lower_bound(first_shard);
+        shard_it != clean_db_shards.end() && shard_it->first <= last_shard;
+        shard_it++)
    {
+        auto & clean_db = shard_it->second;
        auto clean_it = clean_db.begin(), clean_end = clean_db.end();
        if ((min_inode != 0 || max_inode != 0) && min_inode <= max_inode)
        {
@@ -457,26 +552,28 @@ void blockstore_impl_t::process_list(blockstore_op_t *op)
        }
        for (; clean_it != clean_end; clean_it++)
        {
-            if (!pg_count || ((clean_it->first.stripe / pg_stripe_size) % pg_count) == list_pg) // like map_to_pg()
+            if (stable_count >= stable_alloc)
            {
-                if (stable_count >= stable_alloc)
+                stable_alloc *= 2;
+                stable = (obj_ver_id*)realloc(stable, sizeof(obj_ver_id) * stable_alloc);
+                if (!stable)
                {
-                    stable_alloc += 32768;
-                    stable = (obj_ver_id*)realloc(stable, sizeof(obj_ver_id) * stable_alloc);
-                    if (!stable)
-                    {
-                        op->retval = -ENOMEM;
-                        FINISH_OP(op);
-                        return;
-                    }
+                    op->retval = -ENOMEM;
+                    FINISH_OP(op);
+                    return;
                }
-                stable[stable_count++] = {
-                    .oid = clean_it->first,
-                    .version = clean_it->second.version,
-                };
            }
+            stable[stable_count++] = {
+                .oid = clean_it->first,
+                .version = clean_it->second.version,
+            };
        }
    }
+    if (first_shard != last_shard)
+    {
+        // If that's not a per-PG listing, sort clean entries
+        std::sort(stable, stable+stable_count);
+    }
    int clean_stable_count = stable_count;
    // Copy dirty_db entries (sorted, too)
    int unstable_count = 0, unstable_alloc = 0;
@@ -502,7 +599,7 @@ void blockstore_impl_t::process_list(blockstore_op_t *op)
        }
        for (; dirty_it != dirty_end; dirty_it++)
        {
-            if (!pg_count || ((dirty_it->first.oid.stripe / pg_stripe_size) % pg_count) == list_pg) // like map_to_pg()
+            if (!pg_count || ((dirty_it->first.oid.stripe / pg_stripe_size) % pg_count + 1) == list_pg) // like map_to_pg()
            {
                if (IS_DELETE(dirty_it->second.state))
                {
--- a/src/blockstore_impl.h
+++ b/src/blockstore_impl.h
@@ -54,6 +54,15 @@
 #define IS_BIG_WRITE(st) (((st) & 0x0F) == BS_ST_BIG_WRITE)
 #define IS_DELETE(st) (((st) & 0x0F) == BS_ST_DELETE)

+#define BS_SUBMIT_CHECK_SQES(n) \
+    if (ringloop->sqes_left() < (n))\
+    {\
+        /* Pause until there are more requests available */\
+        PRIV(op)->wait_detail = (n);\
+        PRIV(op)->wait_for = WAIT_SQE;\
+        return 0;\
+    }
+
 #define BS_SUBMIT_GET_SQE(sqe, data) \
    BS_SUBMIT_GET_ONLY_SQE(sqe); \
    struct ring_data_t *data = ((ring_data_t*)sqe->user_data)
@@ -63,6 +72,7 @@
    if (!sqe)\
    {\
        /* Pause until there are more requests available */\
+        PRIV(op)->wait_detail = 1;\
        PRIV(op)->wait_for = WAIT_SQE;\
        return 0;\
    }
@@ -72,6 +82,7 @@
    if (!sqe)\
    {\
        /* Pause until there are more requests available */\
+        PRIV(op)->wait_detail = 1;\
        PRIV(op)->wait_for = WAIT_SQE;\
        return 0;\
    }
@@ -170,7 +181,7 @@ struct blockstore_op_private_t
    std::vector<fulfill_read_t> read_vec;

    // Sync, write
-    uint64_t min_flushed_journal_sector, max_flushed_journal_sector;
+    int min_flushed_journal_sector, max_flushed_journal_sector;

    // Write
    struct iovec iov_zerofill[3];
@@ -193,6 +204,17 @@ typedef std::map<obj_ver_id, dirty_entry> blockstore_dirty_db_t;

 #include "blockstore_flush.h"

+typedef uint32_t pool_id_t;
+typedef uint64_t pool_pg_id_t;
+
+#define POOL_ID_BITS 16
+
+struct pool_shard_settings_t
+{
+    uint32_t pg_count;
+    uint32_t pg_stripe_size;
+};
+
 class blockstore_impl_t
 {
    /******* OPTIONS *******/
@@ -230,11 +252,14 @@ class blockstore_impl_t
    int throttle_target_parallelism = 1;
    // Minimum difference in microseconds between target and real execution times to throttle the response
    int throttle_threshold_us = 50;
+    // Maximum number of LIST operations to be processed between
+    int single_tick_list_limit = 1;
    /******* END OF OPTIONS *******/

    struct ring_consumer_t ring_consumer;

-    blockstore_clean_db_t clean_db;
+    std::map<pool_id_t, pool_shard_settings_t> clean_db_settings;
+    std::map<pool_pg_id_t, blockstore_clean_db_t> clean_db_shards;
    uint8_t *clean_bitmap = NULL;
    blockstore_dirty_db_t dirty_db;
    std::vector<blockstore_op_t*> submit_queue;
@@ -272,7 +297,7 @@ class blockstore_impl_t

    friend class blockstore_init_meta;
    friend class blockstore_init_journal;
-    friend class blockstore_journal_check_t;
+    friend struct blockstore_journal_check_t;
    friend class journal_flusher_t;
    friend class journal_flusher_co;

@@ -283,6 +308,13 @@ class blockstore_impl_t
    void open_journal();
    uint8_t* get_clean_entry_bitmap(uint64_t block_loc, int offset);

+    blockstore_clean_db_t& clean_db_shard(object_id oid);
+    void reshard_clean_db(pool_id_t pool_id, uint32_t pg_count, uint32_t pg_stripe_size);
+
+    // Journaling
+    void prepare_journal_sector_write(int sector, blockstore_op_t *op);
+    void handle_journal_write(ring_data_t *data, uint64_t flush_id);
+
    // Asynchronous init
    int initialized;
    int metadata_buf_size;
@@ -310,21 +342,18 @@ class blockstore_impl_t

    // Sync
    int continue_sync(blockstore_op_t *op, bool queue_has_in_progress_sync);
-    void handle_sync_event(ring_data_t *data, blockstore_op_t *op);
    void ack_sync(blockstore_op_t *op);

    // Stabilize
    int dequeue_stable(blockstore_op_t *op);
    int continue_stable(blockstore_op_t *op);
    void mark_stable(const obj_ver_id & ov, bool forget_dirty = false);
-    void handle_stable_event(ring_data_t *data, blockstore_op_t *op);
    void stabilize_object(object_id oid, uint64_t max_ver);

    // Rollback
    int dequeue_rollback(blockstore_op_t *op);
    int continue_rollback(blockstore_op_t *op);
    void mark_rolled_back(const obj_ver_id & ov);
-    void handle_rollback_event(ring_data_t *data, blockstore_op_t *op);
    void erase_dirty(blockstore_dirty_db_t::iterator dirty_start, blockstore_dirty_db_t::iterator dirty_end, uint64_t clean_loc);

    // List
--- a/src/blockstore_init.cpp
+++ b/src/blockstore_init.cpp
@@ -131,6 +131,7 @@ resume_1:
    }
    // Skip superblock
    bs->meta_offset += bs->meta_block_size;
+    bs->meta_len -= bs->meta_block_size;
    prev_done = 0;
    done_len = 0;
    done_pos = 0;
@@ -148,7 +149,7 @@ resume_1:
        {
            GET_SQE();
            data->iov = {
-                metadata_buffer + (bs->inmemory_meta
+                (uint8_t*)metadata_buffer + (bs->inmemory_meta
                    ? metadata_read
                    : (prev == 1 ? bs->metadata_buf_size : 0)),
                bs->meta_len - metadata_read > bs->metadata_buf_size ? bs->metadata_buf_size : bs->meta_len - metadata_read,
@@ -169,13 +170,13 @@ resume_1:
        if (prev_done)
        {
            void *done_buf = bs->inmemory_meta
-                ? (metadata_buffer + done_pos)
-                : (metadata_buffer + (prev_done == 2 ? bs->metadata_buf_size : 0));
+                ? ((uint8_t*)metadata_buffer + done_pos)
+                : ((uint8_t*)metadata_buffer + (prev_done == 2 ? bs->metadata_buf_size : 0));
            unsigned count = bs->meta_block_size / bs->clean_entry_size;
            for (int sector = 0; sector < done_len; sector += bs->meta_block_size)
            {
                // handle <count> entries
-                handle_entries(done_buf + sector, count, bs->block_order);
+                handle_entries((uint8_t*)done_buf + sector, count, bs->block_order);
                done_cnt += count;
            }
            prev_done = 0;
@@ -215,17 +216,18 @@ void blockstore_init_meta::handle_entries(void* entries, unsigned count, int blo
 {
    for (unsigned i = 0; i < count; i++)
    {
-        clean_disk_entry *entry = (clean_disk_entry*)(entries + i*bs->clean_entry_size);
+        clean_disk_entry *entry = (clean_disk_entry*)((uint8_t*)entries + i*bs->clean_entry_size);
        if (!bs->inmemory_meta && bs->clean_entry_bitmap_size)
        {
            memcpy(bs->clean_bitmap + (done_cnt+i)*2*bs->clean_entry_bitmap_size, &entry->bitmap, 2*bs->clean_entry_bitmap_size);
        }
        if (entry->oid.inode > 0)
        {
-            auto clean_it = bs->clean_db.find(entry->oid);
-            if (clean_it == bs->clean_db.end() || clean_it->second.version < entry->version)
+            auto & clean_db = bs->clean_db_shard(entry->oid);
+            auto clean_it = clean_db.find(entry->oid);
+            if (clean_it == clean_db.end() || clean_it->second.version < entry->version)
            {
-                if (clean_it != bs->clean_db.end())
+                if (clean_it != clean_db.end())
                {
                    // free the previous block
 #ifdef BLOCKSTORE_DEBUG
@@ -245,7 +247,7 @@ void blockstore_init_meta::handle_entries(void* entries, unsigned count, int blo
                printf("Allocate block (clean entry) %lu: %lx:%lx v%lu\n", done_cnt+i, entry->oid.inode, entry->oid.stripe, entry->version);
 #endif
                bs->data_alloc->set(done_cnt+i, true);
-                bs->clean_db[entry->oid] = (struct clean_entry){
+                clean_db[entry->oid] = (struct clean_entry){
                    .version = entry->version,
                    .location = (done_cnt+i) << block_order,
                };
@@ -440,7 +442,7 @@ resume_1:
                if (!bs->journal.inmemory)
                    submitted_buf = memalign_or_die(MEM_ALIGNMENT, JOURNAL_BUFFER_SIZE);
                else
-                    submitted_buf = bs->journal.buffer + journal_pos;
+                    submitted_buf = (uint8_t*)bs->journal.buffer + journal_pos;
                data->iov = {
                    submitted_buf,
                    end - journal_pos < JOURNAL_BUFFER_SIZE ? end - journal_pos : JOURNAL_BUFFER_SIZE,
@@ -570,7 +572,7 @@ int blockstore_init_journal::handle_journal_part(void *buf, uint64_t done_pos, u
    resume:
        while (pos < bs->journal.block_size)
        {
-            journal_entry *je = (journal_entry*)(buf + proc_pos - done_pos + pos);
+            journal_entry *je = (journal_entry*)((uint8_t*)buf + proc_pos - done_pos + pos);
            if (je->magic != JOURNAL_MAGIC || je_crc32(je) != je->crc32 ||
                je->type < JE_MIN || je->type > JE_MAX || started && je->crc32_prev != crc32_last)
            {
@@ -619,7 +621,7 @@ int blockstore_init_journal::handle_journal_part(void *buf, uint64_t done_pos, u
                if (location >= done_pos && location+je->small_write.len <= done_pos+len)
                {
                    // data is within this buffer
-                    data_crc32 = crc32c(0, buf + location - done_pos, je->small_write.len);
+                    data_crc32 = crc32c(0, (uint8_t*)buf + location - done_pos, je->small_write.len);
                }
                else
                {
@@ -634,7 +636,7 @@ int blockstore_init_journal::handle_journal_part(void *buf, uint64_t done_pos, u
                                ? location+je->small_write.len : done[i].pos+done[i].len);
                            uint64_t part_begin = (location < done[i].pos ? done[i].pos : location);
                            covered += part_end - part_begin;
-                            data_crc32 = crc32c(data_crc32, done[i].buf + part_begin - done[i].pos, part_end - part_begin);
+                            data_crc32 = crc32c(data_crc32, (uint8_t*)done[i].buf + part_begin - done[i].pos, part_end - part_begin);
                        }
                    }
                    if (covered < je->small_write.len)
@@ -650,14 +652,15 @@ int blockstore_init_journal::handle_journal_part(void *buf, uint64_t done_pos, u
                    // interesting thing is that we must clear the corrupt entry if we're not readonly,
                    // because we don't write next entries in the same journal block
                    printf("Journal entry data is corrupt (data crc32 %x != %x)\n", data_crc32, je->small_write.crc32_data);
-                    memset(buf + proc_pos - done_pos + pos, 0, bs->journal.block_size - pos);
+                    memset((uint8_t*)buf + proc_pos - done_pos + pos, 0, bs->journal.block_size - pos);
                    bs->journal.next_free = prev_free;
-                    init_write_buf = buf + proc_pos - done_pos;
+                    init_write_buf = (uint8_t*)buf + proc_pos - done_pos;
                    init_write_sector = proc_pos;
                    return 0;
                }
-                auto clean_it = bs->clean_db.find(je->small_write.oid);
-                if (clean_it == bs->clean_db.end() ||
+                auto & clean_db = bs->clean_db_shard(je->small_write.oid);
+                auto clean_it = clean_db.find(je->small_write.oid);
+                if (clean_it == clean_db.end() ||
                    clean_it->second.version < je->small_write.version)
                {
                    obj_ver_id ov = {
@@ -665,7 +668,7 @@ int blockstore_init_journal::handle_journal_part(void *buf, uint64_t done_pos, u
                        .version = je->small_write.version,
                    };
                    void *bmp = NULL;
-                    void *bmp_from = (void*)je + sizeof(journal_entry_small_write);
+                    void *bmp_from = (uint8_t*)je + sizeof(journal_entry_small_write);
                    if (bs->clean_entry_bitmap_size <= sizeof(void*))
                    {
                        memcpy(&bmp, bmp_from, bs->clean_entry_bitmap_size);
@@ -735,8 +738,9 @@ int blockstore_init_journal::handle_journal_part(void *buf, uint64_t done_pos, u
                        erase_dirty_object(dirty_it);
                    }
                }
-                auto clean_it = bs->clean_db.find(je->big_write.oid);
-                if (clean_it == bs->clean_db.end() ||
+                auto & clean_db = bs->clean_db_shard(je->big_write.oid);
+                auto clean_it = clean_db.find(je->big_write.oid);
+                if (clean_it == clean_db.end() ||
                    clean_it->second.version < je->big_write.version)
                {
                    // oid, version, block
@@ -745,7 +749,7 @@ int blockstore_init_journal::handle_journal_part(void *buf, uint64_t done_pos, u
                        .version = je->big_write.version,
                    };
                    void *bmp = NULL;
-                    void *bmp_from = (void*)je + sizeof(journal_entry_big_write);
+                    void *bmp_from = (uint8_t*)je + sizeof(journal_entry_big_write);
                    if (bs->clean_entry_bitmap_size <= sizeof(void*))
                    {
                        memcpy(&bmp, bmp_from, bs->clean_entry_bitmap_size);
@@ -841,8 +845,9 @@ int blockstore_init_journal::handle_journal_part(void *buf, uint64_t done_pos, u
                    dirty_it--;
                    dirty_exists = dirty_it->first.oid == je->del.oid;
                }
-                auto clean_it = bs->clean_db.find(je->del.oid);
-                bool clean_exists = (clean_it != bs->clean_db.end() &&
+                auto & clean_db = bs->clean_db_shard(je->del.oid);
+                auto clean_it = clean_db.find(je->del.oid);
+                bool clean_exists = (clean_it != clean_db.end() &&
                    clean_it->second.version < je->del.version);
                if (!clean_exists && dirty_exists)
                {
@@ -901,8 +906,9 @@ void blockstore_init_journal::erase_dirty_object(blockstore_dirty_db_t::iterator
            break;
        }
    }
-    auto clean_it = bs->clean_db.find(oid);
-    uint64_t clean_loc = clean_it != bs->clean_db.end()
+    auto & clean_db = bs->clean_db_shard(oid);
+    auto clean_it = clean_db.find(oid);
+    uint64_t clean_loc = clean_it != clean_db.end()
        ? clean_it->second.location : UINT64_MAX;
    if (exists && clean_loc == UINT64_MAX)
    {
--- a/src/blockstore_init.h
+++ b/src/blockstore_init.h
@@ -6,7 +6,7 @@
 class blockstore_init_meta
 {
    blockstore_impl_t *bs;
-    int wait_state = 0, wait_count = 0;
+    int wait_state = 0;
    bool zero_on_init = false;
    void *metadata_buffer = NULL;
    uint64_t metadata_read = 0;
--- a/src/blockstore_journal.cpp
+++ b/src/blockstore_journal.cpp
@@ -96,7 +96,8 @@ int blockstore_journal_check_t::check_available(blockstore_op_t *op, int entries
        next_pos = next_pos + data_after;
        if (next_pos > bs->journal.len)
        {
-            next_pos = bs->journal.block_size + data_after;
+            if (right_dir)
+                next_pos = bs->journal.block_size + data_after;
            right_dir = false;
        }
    }
@@ -136,13 +137,13 @@ journal_entry* prefill_single_journal_entry(journal_t & journal, uint16_t type,
        journal.in_sector_pos = 0;
        journal.next_free = (journal.next_free+journal.block_size) < journal.len ? journal.next_free + journal.block_size : journal.block_size;
        memset(journal.inmemory
-            ? journal.buffer + journal.sector_info[journal.cur_sector].offset
-            : journal.sector_buf + journal.block_size*journal.cur_sector, 0, journal.block_size);
+            ? (uint8_t*)journal.buffer + journal.sector_info[journal.cur_sector].offset
+            : (uint8_t*)journal.sector_buf + journal.block_size*journal.cur_sector, 0, journal.block_size);
    }
    journal_entry *je = (struct journal_entry*)(
        (journal.inmemory
-            ? journal.buffer + journal.sector_info[journal.cur_sector].offset
-            : journal.sector_buf + journal.block_size*journal.cur_sector) + journal.in_sector_pos
+            ? (uint8_t*)journal.buffer + journal.sector_info[journal.cur_sector].offset
+            : (uint8_t*)journal.sector_buf + journal.block_size*journal.cur_sector) + journal.in_sector_pos
    );
    journal.in_sector_pos += size;
    je->magic = JOURNAL_MAGIC;
@@ -153,22 +154,73 @@ journal_entry* prefill_single_journal_entry(journal_t & journal, uint16_t type,
    return je;
 }

-void prepare_journal_sector_write(journal_t & journal, int cur_sector, io_uring_sqe *sqe, std::function<void(ring_data_t*)> cb)
+void blockstore_impl_t::prepare_journal_sector_write(int cur_sector, blockstore_op_t *op)
 {
+    // Don't submit the same sector twice in the same batch
+    if (!journal.sector_info[cur_sector].submit_id)
+    {
+        io_uring_sqe *sqe = get_sqe();
+        // Caller must ensure availability of an SQE
+        assert(sqe != NULL);
+        ring_data_t *data = ((ring_data_t*)sqe->user_data);
+        journal.sector_info[cur_sector].written = true;
+        journal.sector_info[cur_sector].submit_id = ++journal.submit_id;
+        journal.submitting_sectors.push_back(cur_sector);
+        journal.sector_info[cur_sector].flush_count++;
+        data->iov = (struct iovec){
+            (journal.inmemory
+                ? (uint8_t*)journal.buffer + journal.sector_info[cur_sector].offset
+                : (uint8_t*)journal.sector_buf + journal.block_size*cur_sector),
+            journal.block_size
+        };
+        data->callback = [this, flush_id = journal.submit_id](ring_data_t *data) { handle_journal_write(data, flush_id); };
+        my_uring_prep_writev(
+            sqe, journal.fd, &data->iov, 1, journal.offset + journal.sector_info[cur_sector].offset
+        );
+    }
    journal.sector_info[cur_sector].dirty = false;
-    journal.sector_info[cur_sector].written = true;
-    journal.sector_info[cur_sector].flush_count++;
-    ring_data_t *data = ((ring_data_t*)sqe->user_data);
-    data->iov = (struct iovec){
-        (journal.inmemory
-            ? journal.buffer + journal.sector_info[cur_sector].offset
-            : journal.sector_buf + journal.block_size*cur_sector),
-        journal.block_size
-    };
-    data->callback = cb;
-    my_uring_prep_writev(
-        sqe, journal.fd, &data->iov, 1, journal.offset + journal.sector_info[cur_sector].offset
-    );
+    // But always remember that this operation has to wait until this exact journal write is finished
+    journal.flushing_ops.insert((pending_journaling_t){
+        .flush_id = journal.sector_info[cur_sector].submit_id,
+        .sector = cur_sector,
+        .op = op,
+    });
+    auto priv = PRIV(op);
+    priv->pending_ops++;
+    if (!priv->min_flushed_journal_sector)
+        priv->min_flushed_journal_sector = 1+cur_sector;
+    priv->max_flushed_journal_sector = 1+cur_sector;
+}
+
+void blockstore_impl_t::handle_journal_write(ring_data_t *data, uint64_t flush_id)
+{
+    live = true;
+    if (data->res != data->iov.iov_len)
+    {
+        // FIXME: our state becomes corrupted after a write error. maybe do something better than just die
+        throw std::runtime_error(
+            "journal write failed ("+std::to_string(data->res)+" != "+std::to_string(data->iov.iov_len)+
+            "). in-memory state is corrupted. AAAAAAAaaaaaaaaa!!!111"
+        );
+    }
+    auto fl_it = journal.flushing_ops.upper_bound((pending_journaling_t){ .flush_id = flush_id });
+    if (fl_it != journal.flushing_ops.end() && fl_it->flush_id == flush_id)
+    {
+        journal.sector_info[fl_it->sector].flush_count--;
+    }
+    while (fl_it != journal.flushing_ops.end() && fl_it->flush_id == flush_id)
+    {
+        auto priv = PRIV(fl_it->op);
+        priv->pending_ops--;
+        assert(priv->pending_ops >= 0);
+        if (priv->pending_ops == 0)
+        {
+            release_journal_sectors(fl_it->op);
+            priv->op_state++;
+            ringloop->wakeup();
+        }
+        journal.flushing_ops.erase(fl_it++);
+    }
 }

 journal_t::~journal_t()
--- a/src/blockstore_journal.h
+++ b/src/blockstore_journal.h
@@ -4,6 +4,7 @@
 #pragma once

 #include "crc32c.h"
+#include <set>

 #define MIN_JOURNAL_SIZE 4*1024*1024
 #define JOURNAL_MAGIC 0x4A33
@@ -145,8 +146,21 @@ struct journal_sector_info_t
    uint64_t flush_count;
    bool written;
    bool dirty;
+    uint64_t submit_id;
 };

+struct pending_journaling_t
+{
+    uint64_t flush_id;
+    int sector;
+    blockstore_op_t *op;
+};
+
+inline bool operator < (const pending_journaling_t & a, const pending_journaling_t & b)
+{
+    return a.flush_id < b.flush_id || a.flush_id == b.flush_id && a.op < b.op;
+}
+
 struct journal_t
 {
    int fd;
@@ -172,6 +186,9 @@ struct journal_t
    bool no_same_sector_overwrites = false;
    int cur_sector = 0;
    int in_sector_pos = 0;
+    std::vector<int> submitting_sectors;
+    std::set<pending_journaling_t> flushing_ops;
+    uint64_t submit_id = 0;

    // Used sector map
    // May use ~ 80 MB per 1 GB of used journal space in the worst case
@@ -200,5 +217,3 @@ struct blockstore_journal_check_t
 };

 journal_entry* prefill_single_journal_entry(journal_t & journal, uint16_t type, uint32_t size);
-
-void prepare_journal_sector_write(journal_t & journal, int sector, io_uring_sqe *sqe, std::function<void(ring_data_t*)> cb);
--- a/src/blockstore_open.cpp
+++ b/src/blockstore_open.cpp
@@ -306,6 +306,10 @@ static void check_size(int fd, uint64_t *size, uint64_t *sectsize, std::string n
    if (S_ISREG(st.st_mode))
    {
        *size = st.st_size;
+        if (sectsize)
+        {
+            *sectsize = st.st_blksize;
+        }
    }
    else if (S_ISBLK(st.st_mode))
    {
--- a/src/blockstore_read.cpp
+++ b/src/blockstore_read.cpp
@@ -24,7 +24,7 @@ int blockstore_impl_t::fulfill_read_push(blockstore_op_t *op, void *buf, uint64_
    }
    if (journal.inmemory && IS_JOURNAL(item_state))
    {
-        memcpy(buf, journal.buffer + offset, len);
+        memcpy(buf, (uint8_t*)journal.buffer + offset, len);
        return 1;
    }
    BS_SUBMIT_GET_SQE(sqe, data);
@@ -75,7 +75,7 @@ int blockstore_impl_t::fulfill_read(blockstore_op_t *read_op, uint64_t &fulfille
                };
                it = PRIV(read_op)->read_vec.insert(it, el);
                if (!fulfill_read_push(read_op,
-                    read_op->buf + el.offset - read_op->offset,
+                    (uint8_t*)read_op->buf + el.offset - read_op->offset,
                    item_location + el.offset - item_start,
                    el.len, item_state, item_version))
                {
@@ -102,7 +102,7 @@ uint8_t* blockstore_impl_t::get_clean_entry_bitmap(uint64_t block_loc, int offse
    {
        uint64_t sector = (meta_loc / (meta_block_size / clean_entry_size)) * meta_block_size;
        uint64_t pos = (meta_loc % (meta_block_size / clean_entry_size));
-        clean_entry_bitmap = (uint8_t*)(metadata_buffer + sector + pos*clean_entry_size + sizeof(clean_disk_entry) + offset);
+        clean_entry_bitmap = ((uint8_t*)metadata_buffer + sector + pos*clean_entry_size + sizeof(clean_disk_entry) + offset);
    }
    else
        clean_entry_bitmap = (uint8_t*)(clean_bitmap + meta_loc*2*clean_entry_bitmap_size + offset);
@@ -111,6 +111,7 @@ uint8_t* blockstore_impl_t::get_clean_entry_bitmap(uint64_t block_loc, int offse

 int blockstore_impl_t::dequeue_read(blockstore_op_t *read_op)
 {
+    auto & clean_db = clean_db_shard(read_op->oid);
    auto clean_it = clean_db.find(read_op->oid);
    auto dirty_it = dirty_db.upper_bound((obj_ver_id){
        .oid = read_op->oid,
@@ -297,6 +298,7 @@ int blockstore_impl_t::read_bitmap(object_id oid, uint64_t target_version, void
            dirty_it--;
        }
    }
+    auto & clean_db = clean_db_shard(oid);
    auto clean_it = clean_db.find(oid);
    if (clean_it != clean_db.end())
    {
--- a/src/blockstore_rollback.cpp
+++ b/src/blockstore_rollback.cpp
@@ -74,24 +74,17 @@ skip_ov:
    {
        return 0;
    }
-    // There is sufficient space. Get SQEs
-    struct io_uring_sqe *sqe[space_check.sectors_to_write];
-    for (i = 0; i < space_check.sectors_to_write; i++)
-    {
-        BS_SUBMIT_GET_SQE_DECL(sqe[i]);
-    }
+    // There is sufficient space. Check SQEs
+    BS_SUBMIT_CHECK_SQES(space_check.sectors_to_write);
    // Prepare and submit journal entries
-    auto cb = [this, op](ring_data_t *data) { handle_rollback_event(data, op); };
-    int s = 0, cur_sector = -1;
+    int s = 0;
    for (i = 0, v = (obj_ver_id*)op->buf; i < op->len; i++, v++)
    {
        if (!journal.entry_fits(sizeof(journal_entry_rollback)) &&
            journal.sector_info[journal.cur_sector].dirty)
        {
-            if (cur_sector == -1)
-                PRIV(op)->min_flushed_journal_sector = 1 + journal.cur_sector;
-            prepare_journal_sector_write(journal, journal.cur_sector, sqe[s++], cb);
-            cur_sector = journal.cur_sector;
+            prepare_journal_sector_write(journal.cur_sector, op);
+            s++;
        }
        journal_entry_rollback *je = (journal_entry_rollback*)
            prefill_single_journal_entry(journal, JE_ROLLBACK, sizeof(journal_entry_rollback));
@@ -100,12 +93,9 @@ skip_ov:
        je->crc32 = je_crc32((journal_entry*)je);
        journal.crc32_last = je->crc32;
    }
-    prepare_journal_sector_write(journal, journal.cur_sector, sqe[s++], cb);
+    prepare_journal_sector_write(journal.cur_sector, op);
+    s++;
    assert(s == space_check.sectors_to_write);
-    if (cur_sector == -1)
-        PRIV(op)->min_flushed_journal_sector = 1 + journal.cur_sector;
-    PRIV(op)->max_flushed_journal_sector = 1 + journal.cur_sector;
-    PRIV(op)->pending_ops = s;
    PRIV(op)->op_state = 1;
    return 1;
 }
@@ -114,30 +104,23 @@ int blockstore_impl_t::continue_rollback(blockstore_op_t *op)
 {
    if (PRIV(op)->op_state == 2)
        goto resume_2;
-    else if (PRIV(op)->op_state == 3)
-        goto resume_3;
-    else if (PRIV(op)->op_state == 5)
-        goto resume_5;
+    else if (PRIV(op)->op_state == 4)
+        goto resume_4;
    else
        return 1;
 resume_2:
-    // Release used journal sectors
-    release_journal_sectors(op);
-resume_3:
    if (!disable_journal_fsync)
    {
-        io_uring_sqe *sqe;
-        BS_SUBMIT_GET_SQE_DECL(sqe);
-        ring_data_t *data = ((ring_data_t*)sqe->user_data);
+        BS_SUBMIT_GET_SQE(sqe, data);
        my_uring_prep_fsync(sqe, journal.fd, IORING_FSYNC_DATASYNC);
        data->iov = { 0 };
-        data->callback = [this, op](ring_data_t *data) { handle_rollback_event(data, op); };
+        data->callback = [this, op](ring_data_t *data) { handle_write_event(data, op); };
        PRIV(op)->min_flushed_journal_sector = PRIV(op)->max_flushed_journal_sector = 0;
        PRIV(op)->pending_ops = 1;
-        PRIV(op)->op_state = 4;
+        PRIV(op)->op_state = 3;
        return 1;
    }
-resume_5:
+resume_4:
    obj_ver_id* v;
    int i;
    for (i = 0, v = (obj_ver_id*)op->buf; i < op->len; i++, v++)
@@ -196,24 +179,6 @@ void blockstore_impl_t::mark_rolled_back(const obj_ver_id & ov)
    }
 }

-void blockstore_impl_t::handle_rollback_event(ring_data_t *data, blockstore_op_t *op)
-{
-    live = true;
-    if (data->res != data->iov.iov_len)
-    {
-        throw std::runtime_error(
-            "write operation failed ("+std::to_string(data->res)+" != "+std::to_string(data->iov.iov_len)+
-            "). in-memory state is corrupted. AAAAAAAaaaaaaaaa!!!111"
-        );
-    }
-    PRIV(op)->pending_ops--;
-    if (PRIV(op)->pending_ops == 0)
-    {
-        PRIV(op)->op_state++;
-        ringloop->wakeup();
-    }
-}
-
 void blockstore_impl_t::erase_dirty(blockstore_dirty_db_t::iterator dirty_start, blockstore_dirty_db_t::iterator dirty_end, uint64_t clean_loc)
 {
    if (dirty_end == dirty_start)
--- a/src/blockstore_stable.cpp
+++ b/src/blockstore_stable.cpp
@@ -54,6 +54,7 @@ int blockstore_impl_t::dequeue_stable(blockstore_op_t *op)
        auto dirty_it = dirty_db.find(*v);
        if (dirty_it == dirty_db.end())
        {
+            auto & clean_db = clean_db_shard(v->oid);
            auto clean_it = clean_db.find(v->oid);
            if (clean_it == clean_db.end() || clean_it->second.version < v->version)
            {
@@ -97,25 +98,18 @@ int blockstore_impl_t::dequeue_stable(blockstore_op_t *op)
    {
        return 0;
    }
-    // There is sufficient space. Get SQEs
-    struct io_uring_sqe *sqe[space_check.sectors_to_write];
-    for (i = 0; i < space_check.sectors_to_write; i++)
-    {
-        BS_SUBMIT_GET_SQE_DECL(sqe[i]);
-    }
+    // There is sufficient space. Check SQEs
+    BS_SUBMIT_CHECK_SQES(space_check.sectors_to_write);
    // Prepare and submit journal entries
-    auto cb = [this, op](ring_data_t *data) { handle_stable_event(data, op); };
-    int s = 0, cur_sector = -1;
+    int s = 0;
    for (i = 0, v = (obj_ver_id*)op->buf; i < op->len; i++, v++)
    {
        // FIXME: Only stabilize versions that aren't stable yet
        if (!journal.entry_fits(sizeof(journal_entry_stable)) &&
            journal.sector_info[journal.cur_sector].dirty)
        {
-            if (cur_sector == -1)
-                PRIV(op)->min_flushed_journal_sector = 1 + journal.cur_sector;
-            prepare_journal_sector_write(journal, journal.cur_sector, sqe[s++], cb);
-            cur_sector = journal.cur_sector;
+            prepare_journal_sector_write(journal.cur_sector, op);
+            s++;
        }
        journal_entry_stable *je = (journal_entry_stable*)
            prefill_single_journal_entry(journal, JE_STABLE, sizeof(journal_entry_stable));
@@ -124,12 +118,9 @@ int blockstore_impl_t::dequeue_stable(blockstore_op_t *op)
        je->crc32 = je_crc32((journal_entry*)je);
        journal.crc32_last = je->crc32;
    }
-    prepare_journal_sector_write(journal, journal.cur_sector, sqe[s++], cb);
+    prepare_journal_sector_write(journal.cur_sector, op);
+    s++;
    assert(s == space_check.sectors_to_write);
-    if (cur_sector == -1)
-        PRIV(op)->min_flushed_journal_sector = 1 + journal.cur_sector;
-    PRIV(op)->max_flushed_journal_sector = 1 + journal.cur_sector;
-    PRIV(op)->pending_ops = s;
    PRIV(op)->op_state = 1;
    return 1;
 }
@@ -138,30 +129,23 @@ int blockstore_impl_t::continue_stable(blockstore_op_t *op)
 {
    if (PRIV(op)->op_state == 2)
        goto resume_2;
-    else if (PRIV(op)->op_state == 3)
-        goto resume_3;
-    else if (PRIV(op)->op_state == 5)
-        goto resume_5;
+    else if (PRIV(op)->op_state == 4)
+        goto resume_4;
    else
        return 1;
 resume_2:
-    // Release used journal sectors
-    release_journal_sectors(op);
-resume_3:
    if (!disable_journal_fsync)
    {
-        io_uring_sqe *sqe;
-        BS_SUBMIT_GET_SQE_DECL(sqe);
-        ring_data_t *data = ((ring_data_t*)sqe->user_data);
+        BS_SUBMIT_GET_SQE(sqe, data);
        my_uring_prep_fsync(sqe, journal.fd, IORING_FSYNC_DATASYNC);
        data->iov = { 0 };
-        data->callback = [this, op](ring_data_t *data) { handle_stable_event(data, op); };
+        data->callback = [this, op](ring_data_t *data) { handle_write_event(data, op); };
        PRIV(op)->min_flushed_journal_sector = PRIV(op)->max_flushed_journal_sector = 0;
        PRIV(op)->pending_ops = 1;
-        PRIV(op)->op_state = 4;
+        PRIV(op)->op_state = 3;
        return 1;
    }
-resume_5:
+resume_4:
    // Mark dirty_db entries as stable, acknowledge op completion
    obj_ver_id* v;
    int i;
@@ -205,6 +189,7 @@ void blockstore_impl_t::mark_stable(const obj_ver_id & v, bool forget_dirty)
                    }
                    if (exists == -1)
                    {
+                        auto & clean_db = clean_db_shard(v.oid);
                        auto clean_it = clean_db.find(v.oid);
                        exists = clean_it != clean_db.end() ? 1 : 0;
                    }
@@ -232,6 +217,7 @@ void blockstore_impl_t::mark_stable(const obj_ver_id & v, bool forget_dirty)
                        break;
                    }
                }
+                auto & clean_db = clean_db_shard(v.oid);
                auto clean_it = clean_db.find(v.oid);
                uint64_t clean_loc = clean_it != clean_db.end()
                    ? clean_it->second.location : UINT64_MAX;
@@ -257,21 +243,3 @@ void blockstore_impl_t::mark_stable(const obj_ver_id & v, bool forget_dirty)
        unstable_writes.erase(unstab_it);
    }
 }
-
-void blockstore_impl_t::handle_stable_event(ring_data_t *data, blockstore_op_t *op)
-{
-    live = true;
-    if (data->res != data->iov.iov_len)
-    {
-        throw std::runtime_error(
-            "write operation failed ("+std::to_string(data->res)+" != "+std::to_string(data->iov.iov_len)+
-            "). in-memory state is corrupted. AAAAAAAaaaaaaaaa!!!111"
-        );
-    }
-    PRIV(op)->pending_ops--;
-    if (PRIV(op)->pending_ops == 0)
-    {
-        PRIV(op)->op_state++;
-        ringloop->wakeup();
-    }
-}
--- a/src/blockstore_sync.cpp
+++ b/src/blockstore_sync.cpp
@@ -44,10 +44,8 @@ int blockstore_impl_t::continue_sync(blockstore_op_t *op, bool queue_has_in_prog
        if (journal.sector_info[journal.cur_sector].dirty)
        {
            // Write out the last journal sector if it happens to be dirty
-            BS_SUBMIT_GET_ONLY_SQE(sqe);
-            prepare_journal_sector_write(journal, journal.cur_sector, sqe, [this, op](ring_data_t *data) { handle_sync_event(data, op); });
-            PRIV(op)->min_flushed_journal_sector = PRIV(op)->max_flushed_journal_sector = 1 + journal.cur_sector;
-            PRIV(op)->pending_ops = 1;
+            BS_SUBMIT_CHECK_SQES(1);
+            prepare_journal_sector_write(journal.cur_sector, op);
            PRIV(op)->op_state = SYNC_JOURNAL_WRITE_SENT;
            return 1;
        }
@@ -64,7 +62,7 @@ int blockstore_impl_t::continue_sync(blockstore_op_t *op, bool queue_has_in_prog
            BS_SUBMIT_GET_SQE(sqe, data);
            my_uring_prep_fsync(sqe, data_fd, IORING_FSYNC_DATASYNC);
            data->iov = { 0 };
-            data->callback = [this, op](ring_data_t *data) { handle_sync_event(data, op); };
+            data->callback = [this, op](ring_data_t *data) { handle_write_event(data, op); };
            PRIV(op)->min_flushed_journal_sector = PRIV(op)->max_flushed_journal_sector = 0;
            PRIV(op)->pending_ops = 1;
            PRIV(op)->op_state = SYNC_DATA_SYNC_SENT;
@@ -85,24 +83,18 @@ int blockstore_impl_t::continue_sync(blockstore_op_t *op, bool queue_has_in_prog
        {
            return 0;
        }
-        // Get SQEs. Don't bother about merging, submit each journal sector as a separate request
-        struct io_uring_sqe *sqe[space_check.sectors_to_write];
-        for (int i = 0; i < space_check.sectors_to_write; i++)
-        {
-            BS_SUBMIT_GET_SQE_DECL(sqe[i]);
-        }
+        // Check SQEs. Don't bother about merging, submit each journal sector as a separate request
+        BS_SUBMIT_CHECK_SQES(space_check.sectors_to_write);
        // Prepare and submit journal entries
        auto it = PRIV(op)->sync_big_writes.begin();
-        int s = 0, cur_sector = -1;
+        int s = 0;
        while (it != PRIV(op)->sync_big_writes.end())
        {
            if (!journal.entry_fits(sizeof(journal_entry_big_write) + clean_entry_bitmap_size) &&
                journal.sector_info[journal.cur_sector].dirty)
            {
-                if (cur_sector == -1)
-                    PRIV(op)->min_flushed_journal_sector = 1 + journal.cur_sector;
-                prepare_journal_sector_write(journal, journal.cur_sector, sqe[s++], [this, op](ring_data_t *data) { handle_sync_event(data, op); });
-                cur_sector = journal.cur_sector;
+                prepare_journal_sector_write(journal.cur_sector, op);
+                s++;
            }
            auto & dirty_entry = dirty_db.at(*it);
            journal_entry_big_write *je = (journal_entry_big_write*)prefill_single_journal_entry(
@@ -129,12 +121,9 @@ int blockstore_impl_t::continue_sync(blockstore_op_t *op, bool queue_has_in_prog
            journal.crc32_last = je->crc32;
            it++;
        }
-        prepare_journal_sector_write(journal, journal.cur_sector, sqe[s++], [this, op](ring_data_t *data) { handle_sync_event(data, op); });
+        prepare_journal_sector_write(journal.cur_sector, op);
+        s++;
        assert(s == space_check.sectors_to_write);
-        if (cur_sector == -1)
-            PRIV(op)->min_flushed_journal_sector = 1 + journal.cur_sector;
-        PRIV(op)->max_flushed_journal_sector = 1 + journal.cur_sector;
-        PRIV(op)->pending_ops = s;
        PRIV(op)->op_state = SYNC_JOURNAL_WRITE_SENT;
        return 1;
    }
@@ -145,7 +134,7 @@ int blockstore_impl_t::continue_sync(blockstore_op_t *op, bool queue_has_in_prog
            BS_SUBMIT_GET_SQE(sqe, data);
            my_uring_prep_fsync(sqe, journal.fd, IORING_FSYNC_DATASYNC);
            data->iov = { 0 };
-            data->callback = [this, op](ring_data_t *data) { handle_sync_event(data, op); };
+            data->callback = [this, op](ring_data_t *data) { handle_write_event(data, op); };
            PRIV(op)->min_flushed_journal_sector = PRIV(op)->max_flushed_journal_sector = 0;
            PRIV(op)->pending_ops = 1;
            PRIV(op)->op_state = SYNC_JOURNAL_SYNC_SENT;
@@ -164,42 +153,6 @@ int blockstore_impl_t::continue_sync(blockstore_op_t *op, bool queue_has_in_prog
    return 1;
 }

-void blockstore_impl_t::handle_sync_event(ring_data_t *data, blockstore_op_t *op)
-{
-    live = true;
-    if (data->res != data->iov.iov_len)
-    {
-        throw std::runtime_error(
-            "write operation failed ("+std::to_string(data->res)+" != "+std::to_string(data->iov.iov_len)+
-            "). in-memory state is corrupted. AAAAAAAaaaaaaaaa!!!111"
-        );
-    }
-    PRIV(op)->pending_ops--;
-    if (PRIV(op)->pending_ops == 0)
-    {
-        // Release used journal sectors
-        release_journal_sectors(op);
-        // Handle states
-        if (PRIV(op)->op_state == SYNC_DATA_SYNC_SENT)
-        {
-            PRIV(op)->op_state = SYNC_DATA_SYNC_DONE;
-        }
-        else if (PRIV(op)->op_state == SYNC_JOURNAL_WRITE_SENT)
-        {
-            PRIV(op)->op_state = SYNC_JOURNAL_WRITE_DONE;
-        }
-        else if (PRIV(op)->op_state == SYNC_JOURNAL_SYNC_SENT)
-        {
-            PRIV(op)->op_state = SYNC_DONE;
-        }
-        else
-        {
-            throw std::runtime_error("BUG: unexpected sync op state");
-        }
-        ringloop->wakeup();
-    }
-}
-
 void blockstore_impl_t::ack_sync(blockstore_op_t *op)
 {
    // Handle states
--- a/src/blockstore_write.cpp
+++ b/src/blockstore_write.cpp
@@ -41,6 +41,7 @@ bool blockstore_impl_t::enqueue_write(blockstore_op_t *op)
    }
    if (!found)
    {
+        auto & clean_db = clean_db_shard(op->oid);
        auto clean_it = clean_db.find(op->oid);
        if (clean_it != clean_db.end())
        {
@@ -102,7 +103,7 @@ bool blockstore_impl_t::enqueue_write(blockstore_op_t *op)
        // Issue an additional sync so that the previous big write can reach the journal
        blockstore_op_t *sync_op = new blockstore_op_t;
        sync_op->opcode = BS_OP_SYNC;
-        sync_op->callback = [this, op](blockstore_op_t *sync_op)
+        sync_op->callback = [](blockstore_op_t *sync_op)
        {
            delete sync_op;
        };
@@ -268,18 +269,8 @@ int blockstore_impl_t::dequeue_write(blockstore_op_t *op)
            cancel_all_writes(op, dirty_it, -ENOSPC);
            return 2;
        }
-#ifndef NDEBUG
-        // Double-check to not overwrite anything if easily possible
-        if (inmemory_meta)
-        {
-            uint64_t sector = (loc / (meta_block_size / clean_entry_size)) * meta_block_size;
-            uint64_t pos = (loc % (meta_block_size / clean_entry_size));
-            clean_disk_entry *meta_entry = (clean_disk_entry*)(metadata_buffer + sector + pos*clean_entry_size);
-            assert(!meta_entry->oid.inode);
-        }
-#endif
-        write_iodepth++;
        BS_SUBMIT_GET_SQE(sqe, data);
+        write_iodepth++;
        dirty_it->second.location = loc << block_order;
        dirty_it->second.state = (dirty_it->second.state & ~BS_ST_WORKFLOW_MASK) | BS_ST_SUBMITTED;
 #ifdef BLOCKSTORE_DEBUG
@@ -334,29 +325,21 @@ int blockstore_impl_t::dequeue_write(blockstore_op_t *op)
        {
            return 0;
        }
-        write_iodepth++;
-        // There is sufficient space. Get SQE(s)
-        struct io_uring_sqe *sqe1 = NULL;
-        if (immediate_commit != IMMEDIATE_NONE ||
-            !journal.entry_fits(sizeof(journal_entry_small_write) + clean_entry_bitmap_size))
-        {
+        // There is sufficient space. Check SQE(s)
+        BS_SUBMIT_CHECK_SQES(
            // Write current journal sector only if it's dirty and full, or in the immediate_commit mode
-            BS_SUBMIT_GET_SQE_DECL(sqe1);
-        }
-        struct io_uring_sqe *sqe2 = NULL;
-        if (op->len > 0)
-        {
-            BS_SUBMIT_GET_SQE_DECL(sqe2);
-        }
+            (immediate_commit != IMMEDIATE_NONE ||
+                !journal.entry_fits(sizeof(journal_entry_small_write) + clean_entry_bitmap_size) ? 1 : 0) +
+            (op->len > 0 ? 1 : 0)
+        );
+        write_iodepth++;
        // Got SQEs. Prepare previous journal sector write if required
        auto cb = [this, op](ring_data_t *data) { handle_write_event(data, op); };
        if (immediate_commit == IMMEDIATE_NONE)
        {
-            if (sqe1)
+            if (!journal.entry_fits(sizeof(journal_entry_small_write) + clean_entry_bitmap_size))
            {
-                prepare_journal_sector_write(journal, journal.cur_sector, sqe1, cb);
-                PRIV(op)->min_flushed_journal_sector = PRIV(op)->max_flushed_journal_sector = 1 + journal.cur_sector;
-                PRIV(op)->pending_ops++;
+                prepare_journal_sector_write(journal.cur_sector, op);
            }
            else
            {
@@ -390,9 +373,7 @@ int blockstore_impl_t::dequeue_write(blockstore_op_t *op)
        journal.crc32_last = je->crc32;
        if (immediate_commit != IMMEDIATE_NONE)
        {
-            prepare_journal_sector_write(journal, journal.cur_sector, sqe1, cb);
-            PRIV(op)->min_flushed_journal_sector = PRIV(op)->max_flushed_journal_sector = 1 + journal.cur_sector;
-            PRIV(op)->pending_ops++;
+            prepare_journal_sector_write(journal.cur_sector, op);
        }
        if (op->len > 0)
        {
@@ -400,9 +381,9 @@ int blockstore_impl_t::dequeue_write(blockstore_op_t *op)
            if (journal.inmemory)
            {
                // Copy data
-                memcpy(journal.buffer + journal.next_free, op->buf, op->len);
+                memcpy((uint8_t*)journal.buffer + journal.next_free, op->buf, op->len);
            }
-            ring_data_t *data2 = ((ring_data_t*)sqe2->user_data);
+            BS_SUBMIT_GET_SQE(sqe2, data2);
            data2->iov = (struct iovec){ op->buf, op->len };
            data2->callback = cb;
            my_uring_prep_writev(
@@ -451,13 +432,12 @@ int blockstore_impl_t::continue_write(blockstore_op_t *op)
 resume_2:
    // Only for the immediate_commit mode: prepare and submit big_write journal entry
    {
+        BS_SUBMIT_CHECK_SQES(1);
        auto dirty_it = dirty_db.find((obj_ver_id){
            .oid = op->oid,
            .version = op->version,
        });
        assert(dirty_it != dirty_db.end());
-        io_uring_sqe *sqe = NULL;
-        BS_SUBMIT_GET_SQE_DECL(sqe);
        journal_entry_big_write *je = (journal_entry_big_write*)prefill_single_journal_entry(
            journal, op->opcode == BS_OP_WRITE_STABLE ? JE_BIG_WRITE_INSTANT : JE_BIG_WRITE,
            sizeof(journal_entry_big_write) + clean_entry_bitmap_size
@@ -479,10 +459,7 @@ resume_2:
        memcpy((void*)(je+1), (clean_entry_bitmap_size > sizeof(void*) ? dirty_it->second.bitmap : &dirty_it->second.bitmap), clean_entry_bitmap_size);
        je->crc32 = je_crc32((journal_entry*)je);
        journal.crc32_last = je->crc32;
-        prepare_journal_sector_write(journal, journal.cur_sector, sqe,
-            [this, op](ring_data_t *data) { handle_write_event(data, op); });
-        PRIV(op)->min_flushed_journal_sector = PRIV(op)->max_flushed_journal_sector = 1 + journal.cur_sector;
-        PRIV(op)->pending_ops = 1;
+        prepare_journal_sector_write(journal.cur_sector, op);
        PRIV(op)->op_state = 3;
        return 1;
    }
@@ -567,12 +544,13 @@ resume_4:
            if (ref_us > exec_us + throttle_threshold_us)
            {
                // Pause reply
+                PRIV(op)->op_state = 5;
+                // Remember that the timer can in theory be called right here
                tfd->set_timer_us(ref_us-exec_us, false, [this, op](int timer_id)
                {
                    PRIV(op)->op_state++;
                    ringloop->wakeup();
                });
-                PRIV(op)->op_state = 5;
                return 1;
            }
        }
@@ -597,6 +575,7 @@ void blockstore_impl_t::handle_write_event(ring_data_t *data, blockstore_op_t *o
        );
    }
    PRIV(op)->pending_ops--;
+    assert(PRIV(op)->pending_ops >= 0);
    if (PRIV(op)->pending_ops == 0)
    {
        release_journal_sectors(op);
@@ -614,7 +593,6 @@ void blockstore_impl_t::release_journal_sectors(blockstore_op_t *op)
        uint64_t s = PRIV(op)->min_flushed_journal_sector;
        while (1)
        {
-            journal.sector_info[s-1].flush_count--;
            if (s != (1+journal.cur_sector) && journal.sector_info[s-1].flush_count == 0)
            {
                // We know for sure that we won't write into this sector anymore
@@ -653,24 +631,24 @@ int blockstore_impl_t::dequeue_del(blockstore_op_t *op)
    {
        return 0;
    }
-    write_iodepth++;
-    io_uring_sqe *sqe = NULL;
-    if (immediate_commit != IMMEDIATE_NONE ||
-        (journal_block_size - journal.in_sector_pos) < sizeof(journal_entry_del) &&
-        journal.sector_info[journal.cur_sector].dirty)
+    // Write current journal sector only if it's dirty and full, or in the immediate_commit mode
+    BS_SUBMIT_CHECK_SQES(
+        (immediate_commit != IMMEDIATE_NONE ||
+            (journal_block_size - journal.in_sector_pos) < sizeof(journal_entry_del) &&
+            journal.sector_info[journal.cur_sector].dirty) ? 1 : 0
+    );
+    if (write_iodepth >= max_write_iodepth)
    {
-        // Write current journal sector only if it's dirty and full, or in the immediate_commit mode
-        BS_SUBMIT_GET_SQE_DECL(sqe);
+        return 0;
    }
-    auto cb = [this, op](ring_data_t *data) { handle_write_event(data, op); };
+    write_iodepth++;
    // Prepare journal sector write
    if (immediate_commit == IMMEDIATE_NONE)
    {
-        if (sqe)
+        if ((journal_block_size - journal.in_sector_pos) < sizeof(journal_entry_del) &&
+            journal.sector_info[journal.cur_sector].dirty)
        {
-            prepare_journal_sector_write(journal, journal.cur_sector, sqe, cb);
-            PRIV(op)->min_flushed_journal_sector = PRIV(op)->max_flushed_journal_sector = 1 + journal.cur_sector;
-            PRIV(op)->pending_ops++;
+            prepare_journal_sector_write(journal.cur_sector, op);
        }
        else
        {
@@ -697,9 +675,7 @@ int blockstore_impl_t::dequeue_del(blockstore_op_t *op)
    dirty_it->second.state = BS_ST_DELETE | BS_ST_SUBMITTED;
    if (immediate_commit != IMMEDIATE_NONE)
    {
-        prepare_journal_sector_write(journal, journal.cur_sector, sqe, cb);
-        PRIV(op)->min_flushed_journal_sector = PRIV(op)->max_flushed_journal_sector = 1 + journal.cur_sector;
-        PRIV(op)->pending_ops++;
+        prepare_journal_sector_write(journal.cur_sector, op);
    }
    if (!PRIV(op)->pending_ops)
    {
--- a/src/cli.cpp
+++ b/src/cli.cpp
@@ -2,8 +2,7 @@
 // License: VNPL-1.1 (see README.md for details)

 /**
- * CLI tool
- * Currently can (a) remove inodes and (b) merge snapshot/clone layers
+ * CLI tool and also a library for administrative tasks
 */

 #include <vector>
@@ -17,7 +16,9 @@

 static const char *exe_name = NULL;

-json11::Json::object cli_tool_t::parse_args(int narg, const char *args[])
+static void help();
+
+static json11::Json::object parse_args(int narg, const char *args[])
 {
    json11::Json::object cfg;
    json11::Json::array cmd;
@@ -79,13 +80,16 @@ json11::Json::object cli_tool_t::parse_args(int narg, const char *args[])
    return cfg;
 }

-void cli_tool_t::help()
+static void help()
 {
    printf(
        "Vitastor command-line tool\n"
        "(c) Vitaliy Filippov, 2019+ (VNPL-1.1)\n"
        "\n"
        "USAGE:\n"
+        "%s status\n"
+        "  Show cluster status\n"
+        "\n"
        "%s df\n"
        "  Show pool space statistics\n"
        "\n"
@@ -155,196 +159,177 @@ void cli_tool_t::help()
        "  --no-color          Disable colored output\n"
        "  --json              JSON output\n"
        ,
-        exe_name, exe_name, exe_name, exe_name, exe_name, exe_name,
+        exe_name, exe_name, exe_name, exe_name, exe_name, exe_name, exe_name,
        exe_name, exe_name, exe_name, exe_name, exe_name, exe_name
    );
    exit(0);
 }

-void cli_tool_t::change_parent(inode_t cur, inode_t new_parent)
-{
-    auto cur_cfg_it = cli->st_cli.inode_config.find(cur);
-    if (cur_cfg_it == cli->st_cli.inode_config.end())
-    {
-        fprintf(stderr, "Inode 0x%lx disappeared\n", cur);
-        exit(1);
-    }
-    inode_config_t new_cfg = cur_cfg_it->second;
-    std::string cur_name = new_cfg.name;
-    std::string cur_cfg_key = base64_encode(cli->st_cli.etcd_prefix+
-        "/config/inode/"+std::to_string(INODE_POOL(cur))+
-        "/"+std::to_string(INODE_NO_POOL(cur)));
-    new_cfg.parent_id = new_parent;
-    json11::Json::object cur_cfg_json = cli->st_cli.serialize_inode_cfg(&new_cfg);
-    waiting++;
-    cli->st_cli.etcd_txn(json11::Json::object {
-        { "compare", json11::Json::array {
-            json11::Json::object {
-                { "target", "MOD" },
-                { "key", cur_cfg_key },
-                { "result", "LESS" },
-                { "mod_revision", new_cfg.mod_revision+1 },
-            },
-        } },
-        { "success", json11::Json::array {
-            json11::Json::object {
-                { "request_put", json11::Json::object {
-                    { "key", cur_cfg_key },
-                    { "value", base64_encode(json11::Json(cur_cfg_json).dump()) },
-                } }
-            },
-        } },
-    }, ETCD_SLOW_TIMEOUT, [this, new_parent, cur, cur_name](std::string err, json11::Json res)
-    {
-        if (err != "")
-        {
-            fprintf(stderr, "Error changing parent of %s: %s\n", cur_name.c_str(), err.c_str());
-            exit(1);
-        }
-        if (!res["succeeded"].bool_value())
-        {
-            fprintf(stderr, "Inode %s was modified during snapshot deletion\n", cur_name.c_str());
-            exit(1);
-        }
-        if (new_parent)
-        {
-            auto new_parent_it = cli->st_cli.inode_config.find(new_parent);
-            std::string new_parent_name = new_parent_it != cli->st_cli.inode_config.end()
-                ? new_parent_it->second.name : "<unknown>";
-            printf(
-                "Parent of layer %s (inode %lu in pool %u) changed to %s (inode %lu in pool %u)\n",
-                cur_name.c_str(), INODE_NO_POOL(cur), INODE_POOL(cur),
-                new_parent_name.c_str(), INODE_NO_POOL(new_parent), INODE_POOL(new_parent)
-            );
-        }
-        else
-        {
-            printf(
-                "Parent of layer %s (inode %lu in pool %u) detached\n",
-                cur_name.c_str(), INODE_NO_POOL(cur), INODE_POOL(cur)
-            );
-        }
-        waiting--;
-        ringloop->wakeup();
-    });
-}
-
-inode_config_t* cli_tool_t::get_inode_cfg(const std::string & name)
-{
-    for (auto & ic: cli->st_cli.inode_config)
-    {
-        if (ic.second.name == name)
-        {
-            return &ic.second;
-        }
-    }
-    fprintf(stderr, "Layer %s not found\n", name.c_str());
-    exit(1);
-}
-
-void cli_tool_t::run(json11::Json cfg)
+static int run(cli_tool_t *p, json11::Json::object cfg)
 {
+    cli_result_t result;
+    p->parse_config(cfg);
    json11::Json::array cmd = cfg["command"].array_items();
+    cfg.erase("command");
+    std::function<bool(cli_result_t &)> action_cb;
    if (!cmd.size())
    {
-        fprintf(stderr, "command is missing\n");
-        exit(1);
+        result = { .err = EINVAL, .text = "command is missing" };
+    }
+    else if (cmd[0] == "status")
+    {
+        // Show cluster status
+        action_cb = p->start_status(cfg);
    }
    else if (cmd[0] == "df")
    {
        // Show pool space stats
-        action_cb = start_df(cfg);
+        action_cb = p->start_df(cfg);
    }
    else if (cmd[0] == "ls")
    {
        // List images
-        action_cb = start_ls(cfg);
+        if (cmd.size() > 1)
+        {
+            cmd.erase(cmd.begin(), cmd.begin()+1);
+            cfg["names"] = cmd;
+        }
+        action_cb = p->start_ls(cfg);
    }
-    else if (cmd[0] == "create" || cmd[0] == "snap-create")
+    else if (cmd[0] == "snap-create")
+    {
+        // Create snapshot
+        std::string name = cmd.size() > 1 ? cmd[1].string_value() : "";
+        int pos = name.find('@');
+        if (pos == std::string::npos || pos == name.length()-1)
+        {
+            result = (cli_result_t){ .err = EINVAL, .text = "Please specify new snapshot name after @" };
+        }
+        else
+        {
+            cfg["image"] = name.substr(0, pos);
+            cfg["snapshot"] = name.substr(pos + 1);
+            action_cb = p->start_create(cfg);
+        }
+    }
+    else if (cmd[0] == "create")
    {
        // Create image/snapshot
-        action_cb = start_create(cfg);
+        if (cmd.size() > 1)
+        {
+            cfg["image"] = cmd[1];
+        }
+        action_cb = p->start_create(cfg);
    }
    else if (cmd[0] == "modify")
    {
        // Modify image
-        action_cb = start_modify(cfg);
+        if (cmd.size() > 1)
+        {
+            cfg["image"] = cmd[1];
+        }
+        action_cb = p->start_modify(cfg);
    }
    else if (cmd[0] == "rm-data")
    {
        // Delete inode data
-        action_cb = start_rm(cfg);
+        action_cb = p->start_rm_data(cfg);
    }
    else if (cmd[0] == "merge-data")
    {
        // Merge layer data without affecting metadata
-        action_cb = start_merge(cfg);
+        if (cmd.size() > 1)
+        {
+            cfg["from"] = cmd[1];
+            if (cmd.size() > 2)
+                cfg["to"] = cmd[2];
+        }
+        action_cb = p->start_merge(cfg);
    }
    else if (cmd[0] == "flatten")
    {
        // Merge layer data without affecting metadata
-        action_cb = start_flatten(cfg);
+        if (cmd.size() > 1)
+        {
+            cfg["image"] = cmd[1];
+        }
+        action_cb = p->start_flatten(cfg);
    }
    else if (cmd[0] == "rm")
    {
        // Remove multiple snapshots and rebase their children
-        action_cb = start_snap_rm(cfg);
+        if (cmd.size() > 1)
+        {
+            cfg["from"] = cmd[1];
+            if (cmd.size() > 2)
+                cfg["to"] = cmd[2];
+        }
+        action_cb = p->start_rm(cfg);
    }
    else if (cmd[0] == "alloc-osd")
    {
        // Allocate a new OSD number
-        action_cb = start_alloc_osd(cfg);
+        action_cb = p->start_alloc_osd(cfg);
    }
    else if (cmd[0] == "simple-offsets")
    {
        // Calculate offsets for simple & stupid OSD deployment without superblock
-        action_cb = simple_offsets(cfg);
+        if (cmd.size() > 1)
+        {
+            cfg["device"] = cmd[1];
+        }
+        action_cb = p->simple_offsets(cfg);
    }
    else
    {
-        fprintf(stderr, "unknown command: %s\n", cmd[0].string_value().c_str());
-        exit(1);
+        result = { .err = EINVAL, .text = "unknown command: "+cmd[0].string_value() };
    }
-    color = !cfg["no-color"].bool_value();
-    json_output = cfg["json"].bool_value();
-    iodepth = cfg["iodepth"].uint64_value();
-    if (!iodepth)
-        iodepth = 32;
-    parallel_osds = cfg["parallel_osds"].uint64_value();
-    if (!parallel_osds)
-        parallel_osds = 4;
-    log_level = cfg["log_level"].int64_value();
-    progress = cfg["progress"].uint64_value() ? true : false;
-    list_first = cfg["wait-list"].uint64_value() ? true : false;
-    // Create client
-    ringloop = new ring_loop_t(512);
-    epmgr = new epoll_manager_t(ringloop);
-    cli = new cluster_client_t(ringloop, epmgr->tfd, cfg);
-    cli->on_ready([this]()
+    if (action_cb != NULL)
    {
-        // Initialize job
-        consumer.loop = [this]()
+        // Create client
+        json11::Json cfg_j = cfg;
+        p->ringloop = new ring_loop_t(512);
+        p->epmgr = new epoll_manager_t(p->ringloop);
+        p->cli = new cluster_client_t(p->ringloop, p->epmgr->tfd, cfg_j);
+        // Smaller timeout by default for more interactiveness
+        p->cli->st_cli.etcd_slow_timeout = p->cli->st_cli.etcd_quick_timeout;
+        p->loop_and_wait(action_cb, [&](const cli_result_t & r)
        {
+            result = r;
+            action_cb = NULL;
+        });
+        // Loop until it completes
+        while (action_cb != NULL)
+        {
+            p->ringloop->loop();
            if (action_cb != NULL)
-            {
-                bool done = action_cb();
-                if (done)
-                {
-                    action_cb = NULL;
-                }
-            }
-            ringloop->submit();
-        };
-        ringloop->register_consumer(&consumer);
-        consumer.loop();
-    });
-    // Loop until it completes
-    while (action_cb != NULL)
-    {
-        ringloop->loop();
-        if (action_cb != NULL)
-            ringloop->wait();
+                p->ringloop->wait();
+        }
+        // Destroy the client
+        delete p->cli;
+        delete p->epmgr;
+        delete p->ringloop;
+        p->cli = NULL;
+        p->epmgr = NULL;
+        p->ringloop = NULL;
    }
+    // Print result
+    if (p->json_output && !result.data.is_null())
+    {
+        printf("%s\n", result.data.dump().c_str());
+    }
+    else if (p->json_output && result.err)
+    {
+        printf("%s\n", json11::Json(json11::Json::object {
+            { "error_code", result.err },
+            { "error_text", result.text },
+        }).dump().c_str());
+    }
+    else if (result.text != "")
+    {
+        fprintf(result.err ? stderr : stdout, result.text[result.text.size()-1] == '\n' ? "%s" : "%s\n", result.text.c_str());
+    }
+    return result.err;
 }

 int main(int narg, const char *args[])
@@ -353,6 +338,7 @@ int main(int narg, const char *args[])
    setvbuf(stderr, NULL, _IONBF, 0);
    exe_name = args[0];
    cli_tool_t *p = new cli_tool_t();
-    p->run(cli_tool_t::parse_args(narg, args));
-    return 0;
+    int r = run(p, parse_args(narg, args));
+    delete p;
+    return r;
 }
--- a/src/cli.h
+++ b/src/cli.h
@@ -19,11 +19,18 @@ class epoll_manager_t;
 class cluster_client_t;
 struct inode_config_t;

+struct cli_result_t
+{
+    int err;
+    std::string text;
+    json11::Json data;
+};
+
 class cli_tool_t
 {
 public:
-    uint64_t iodepth = 0, parallel_osds = 0;
-    bool progress = true;
+    uint64_t iodepth = 4, parallel_osds = 32;
+    bool progress = false;
    bool list_first = false;
    bool json_output = false;
    int log_level = 0;
@@ -34,39 +41,42 @@ public:
    cluster_client_t *cli = NULL;

    int waiting = 0;
-    ring_consumer_t consumer;
-    std::function<bool(void)> action_cb;
+    cli_result_t etcd_err;
+    json11::Json etcd_result;

-    void run(json11::Json cfg);
+    void parse_config(json11::Json cfg);

-    void change_parent(inode_t cur, inode_t new_parent);
+    void change_parent(inode_t cur, inode_t new_parent, cli_result_t *result);
    inode_config_t* get_inode_cfg(const std::string & name);

-    static json11::Json::object parse_args(int narg, const char *args[]);
-    static void help();
-
    friend struct rm_inode_t;
    friend struct snap_merger_t;
    friend struct snap_flattener_t;
    friend struct snap_remover_t;

-    std::function<bool(void)> start_df(json11::Json);
-    std::function<bool(void)> start_ls(json11::Json);
-    std::function<bool(void)> start_create(json11::Json);
-    std::function<bool(void)> start_modify(json11::Json);
-    std::function<bool(void)> start_rm(json11::Json);
-    std::function<bool(void)> start_merge(json11::Json);
-    std::function<bool(void)> start_flatten(json11::Json);
-    std::function<bool(void)> start_snap_rm(json11::Json);
-    std::function<bool(void)> start_alloc_osd(json11::Json cfg, uint64_t *out = NULL);
-    std::function<bool(void)> simple_offsets(json11::Json cfg);
+    std::function<bool(cli_result_t &)> start_status(json11::Json);
+    std::function<bool(cli_result_t &)> start_df(json11::Json);
+    std::function<bool(cli_result_t &)> start_ls(json11::Json);
+    std::function<bool(cli_result_t &)> start_create(json11::Json);
+    std::function<bool(cli_result_t &)> start_modify(json11::Json);
+    std::function<bool(cli_result_t &)> start_rm_data(json11::Json);
+    std::function<bool(cli_result_t &)> start_merge(json11::Json);
+    std::function<bool(cli_result_t &)> start_flatten(json11::Json);
+    std::function<bool(cli_result_t &)> start_rm(json11::Json);
+    std::function<bool(cli_result_t &)> start_alloc_osd(json11::Json cfg);
+    std::function<bool(cli_result_t &)> simple_offsets(json11::Json cfg);
+
+    // Should be called like loop_and_wait(start_status(), <completion callback>)
+    void loop_and_wait(std::function<bool(cli_result_t &)> loop_cb, std::function<void(const cli_result_t &)> complete_cb);
+
+    void etcd_txn(json11::Json txn);
 };

 uint64_t parse_size(std::string size_str);

 std::string print_table(json11::Json items, json11::Json header, bool use_esc);

-std::string format_size(uint64_t size);
+std::string format_size(uint64_t size, bool nobytes = false);

 std::string format_lat(uint64_t lat);

--- a/src/cli_alloc_osd.cpp
+++ b/src/cli_alloc_osd.cpp
@@ -13,10 +13,10 @@ struct alloc_osd_t
 {
    cli_tool_t *parent;

-    json11::Json result;
    uint64_t new_id = 1;

    int state = 0;
+    cli_result_t result;

    bool is_done()
    {
@@ -29,7 +29,7 @@ struct alloc_osd_t
            goto resume_1;
        do
        {
-            etcd_txn(json11::Json::object {
+            parent->etcd_txn(json11::Json::object {
                { "compare", json11::Json::array {
                    json11::Json::object {
                        { "target", "VERSION" },
@@ -63,10 +63,16 @@ struct alloc_osd_t
            state = 1;
            if (parent->waiting > 0)
                return;
-            if (!result["succeeded"].bool_value())
+            if (parent->etcd_err.err)
+            {
+                result = parent->etcd_err;
+                state = 100;
+                return;
+            }
+            if (!parent->etcd_result["succeeded"].bool_value())
            {
                std::vector<osd_num_t> used;
-                for (auto kv: result["responses"][0]["response_range"]["kvs"].array_items())
+                for (auto kv: parent->etcd_result["responses"][0]["response_range"]["kvs"].array_items())
                {
                    std::string key = base64_decode(kv["key"].string_value());
                    osd_num_t cur_osd;
@@ -98,41 +104,25 @@ struct alloc_osd_t
                    new_id = used[e-1]+1;
                }
            }
-        } while (!result["succeeded"].bool_value());
+        } while (!parent->etcd_result["succeeded"].bool_value());
        state = 100;
-    }
-
-    void etcd_txn(json11::Json txn)
-    {
-        parent->waiting++;
-        parent->cli->st_cli.etcd_txn(txn, ETCD_SLOW_TIMEOUT, [this](std::string err, json11::Json res)
-        {
-            parent->waiting--;
-            if (err != "")
-            {
-                fprintf(stderr, "Error reading from etcd: %s\n", err.c_str());
-                exit(1);
-            }
-            this->result = res;
-            parent->ringloop->wakeup();
-        });
+        result = (cli_result_t){
+            .text = std::to_string(new_id),
+            .data = json11::Json(new_id),
+        };
    }
 };

-std::function<bool(void)> cli_tool_t::start_alloc_osd(json11::Json cfg, uint64_t *out)
+std::function<bool(cli_result_t &)> cli_tool_t::start_alloc_osd(json11::Json cfg)
 {
-    json11::Json::array cmd = cfg["command"].array_items();
    auto alloc_osd = new alloc_osd_t();
    alloc_osd->parent = this;
-    return [alloc_osd, out]()
+    return [alloc_osd](cli_result_t & result)
    {
        alloc_osd->loop();
        if (alloc_osd->is_done())
        {
-            if (out)
-                *out = alloc_osd->new_id;
-            else if (alloc_osd->new_id)
-                printf("%lu\n", alloc_osd->new_id);
+            result = alloc_osd->result;
            delete alloc_osd;
            return true;
        }
--- a/src/cli_common.cpp
+++ b/src/cli_common.cpp
@@ -0,0 +1,149 @@
+// Copyright (c) Vitaliy Filippov, 2019+
+// License: VNPL-1.1 (see README.md for details)
+
+#include "base64.h"
+#include "cluster_client.h"
+#include "cli.h"
+
+void cli_tool_t::change_parent(inode_t cur, inode_t new_parent, cli_result_t *result)
+{
+    auto cur_cfg_it = cli->st_cli.inode_config.find(cur);
+    if (cur_cfg_it == cli->st_cli.inode_config.end())
+    {
+        char buf[128];
+        snprintf(buf, 128, "Inode 0x%lx disappeared", cur);
+        *result = (cli_result_t){ .err = EIO, .text = buf };
+        return;
+    }
+    inode_config_t new_cfg = cur_cfg_it->second;
+    std::string cur_name = new_cfg.name;
+    std::string cur_cfg_key = base64_encode(cli->st_cli.etcd_prefix+
+        "/config/inode/"+std::to_string(INODE_POOL(cur))+
+        "/"+std::to_string(INODE_NO_POOL(cur)));
+    new_cfg.parent_id = new_parent;
+    json11::Json::object cur_cfg_json = cli->st_cli.serialize_inode_cfg(&new_cfg);
+    waiting++;
+    cli->st_cli.etcd_txn_slow(json11::Json::object {
+        { "compare", json11::Json::array {
+            json11::Json::object {
+                { "target", "MOD" },
+                { "key", cur_cfg_key },
+                { "result", "LESS" },
+                { "mod_revision", new_cfg.mod_revision+1 },
+            },
+        } },
+        { "success", json11::Json::array {
+            json11::Json::object {
+                { "request_put", json11::Json::object {
+                    { "key", cur_cfg_key },
+                    { "value", base64_encode(json11::Json(cur_cfg_json).dump()) },
+                } }
+            },
+        } },
+    }, [this, result, new_parent, cur, cur_name](std::string err, json11::Json res)
+    {
+        if (err != "")
+        {
+            *result = (cli_result_t){ .err = EIO, .text = "Error changing parent of "+cur_name+": "+err };
+        }
+        else if (!res["succeeded"].bool_value())
+        {
+            *result = (cli_result_t){ .err = EAGAIN, .text = "Image "+cur_name+" was modified during change" };
+        }
+        else if (new_parent)
+        {
+            auto new_parent_it = cli->st_cli.inode_config.find(new_parent);
+            std::string new_parent_name = new_parent_it != cli->st_cli.inode_config.end()
+                ? new_parent_it->second.name : "<unknown>";
+            *result = (cli_result_t){
+                .text = "Parent of layer "+cur_name+" (inode "+std::to_string(INODE_NO_POOL(cur))+
+                    " in pool "+std::to_string(INODE_POOL(cur))+") changed to "+new_parent_name+
+                    " (inode "+std::to_string(INODE_NO_POOL(new_parent))+" in pool "+std::to_string(INODE_POOL(new_parent))+")",
+            };
+        }
+        else
+        {
+            *result = (cli_result_t){
+                .text = "Parent of layer "+cur_name+" (inode "+std::to_string(INODE_NO_POOL(cur))+
+                    " in pool "+std::to_string(INODE_POOL(cur))+") detached",
+            };
+        }
+        waiting--;
+        ringloop->wakeup();
+    });
+}
+
+void cli_tool_t::etcd_txn(json11::Json txn)
+{
+    waiting++;
+    cli->st_cli.etcd_txn_slow(txn, [this](std::string err, json11::Json res)
+    {
+        waiting--;
+        if (err != "")
+            etcd_err = (cli_result_t){ .err = EIO, .text = "Error communicating with etcd: "+err };
+        else
+            etcd_err = (cli_result_t){ .err = 0 };
+        etcd_result = res;
+        ringloop->wakeup();
+    });
+}
+
+inode_config_t* cli_tool_t::get_inode_cfg(const std::string & name)
+{
+    for (auto & ic: cli->st_cli.inode_config)
+    {
+        if (ic.second.name == name)
+        {
+            return &ic.second;
+        }
+    }
+    return NULL;
+}
+
+void cli_tool_t::parse_config(json11::Json cfg)
+{
+    color = !cfg["no-color"].bool_value();
+    json_output = cfg["json"].bool_value();
+    iodepth = cfg["iodepth"].uint64_value();
+    if (!iodepth)
+        iodepth = 32;
+    parallel_osds = cfg["parallel_osds"].uint64_value();
+    if (!parallel_osds)
+        parallel_osds = 4;
+    log_level = cfg["log_level"].int64_value();
+    progress = cfg["progress"].uint64_value() ? true : false;
+    list_first = cfg["wait-list"].uint64_value() ? true : false;
+}
+
+struct cli_result_looper_t
+{
+    ring_consumer_t consumer;
+    cli_result_t result;
+    std::function<bool(cli_result_t &)> loop_cb;
+    std::function<void(const cli_result_t &)> complete_cb;
+};
+
+void cli_tool_t::loop_and_wait(std::function<bool(cli_result_t &)> loop_cb, std::function<void(const cli_result_t &)> complete_cb)
+{
+    auto *looper = new cli_result_looper_t();
+    looper->loop_cb = loop_cb;
+    looper->complete_cb = complete_cb;
+    looper->consumer.loop = [this, looper]()
+    {
+        bool done = looper->loop_cb(looper->result);
+        if (done)
+        {
+            ringloop->unregister_consumer(&looper->consumer);
+            looper->loop_cb = NULL;
+            looper->complete_cb(looper->result);
+            delete looper;
+            return;
+        }
+        ringloop->submit();
+    };
+    cli->on_ready([this, looper]()
+    {
+        ringloop->register_consumer(&looper->consumer);
+        ringloop->wakeup();
+    });
+}
--- a/src/cli_create.cpp
+++ b/src/cli_create.cpp
@@ -25,15 +25,18 @@ struct image_creator_t
    pool_id_t new_pool_id = 0;
    std::string new_pool_name;
    std::string image_name, new_snap, new_parent;
+    json11::Json new_meta;
    uint64_t size;
+    bool force_size = false;

    pool_id_t old_pool_id = 0;
    inode_t new_parent_id = 0;
    inode_t new_id = 0, old_id = 0;
    uint64_t max_id_mod_rev = 0, cfg_mod_rev = 0, idx_mod_rev = 0;
-    json11::Json result;
+    inode_config_t new_cfg;

    int state = 0;
+    cli_result_t result;

    bool is_done()
    {
@@ -44,13 +47,27 @@ struct image_creator_t
    {
        if (state >= 1)
            goto resume_1;
+        if (image_name == "")
+        {
+            // FIXME: EINVAL -> specific codes for every error
+            result = (cli_result_t){ .err = EINVAL, .text = "Image name is missing" };
+            state = 100;
+            return;
+        }
+        if (image_name.find('@') != std::string::npos)
+        {
+            result = (cli_result_t){ .err = EINVAL, .text = "Image name can't contain @ character" };
+            state = 100;
+            return;
+        }
        if (new_pool_id)
        {
            auto & pools = parent->cli->st_cli.pool_config;
            if (pools.find(new_pool_id) == pools.end())
            {
-                fprintf(stderr, "Pool %u does not exist\n", new_pool_id);
-                exit(1);
+                result = (cli_result_t){ .err = ENOENT, .text = "Pool "+std::to_string(new_pool_id)+" does not exist" };
+                state = 100;
+                return;
            }
        }
        else if (new_pool_name != "")
@@ -65,8 +82,9 @@ struct image_creator_t
            }
            if (!new_pool_id)
            {
-                fprintf(stderr, "Pool %s does not exist\n", new_pool_name.c_str());
-                exit(1);
+                result = (cli_result_t){ .err = ENOENT, .text = "Pool "+new_pool_name+" does not exist" };
+                state = 100;
+                return;
            }
        }
        else if (parent->cli->st_cli.pool_config.size() == 1)
@@ -92,8 +110,9 @@ struct image_creator_t
        {
            if (ic.second.name == image_name)
            {
-                fprintf(stderr, "Image %s already exists\n", image_name.c_str());
-                exit(1);
+                result = (cli_result_t){ .err = EEXIST, .text = "Image "+image_name+" already exists" };
+                state = 100;
+                return;
            }
            if (ic.second.name == new_parent)
            {
@@ -110,45 +129,61 @@ struct image_creator_t
        }
        if (new_parent != "" && !new_parent_id)
        {
-            fprintf(stderr, "Parent image not found\n");
-            exit(1);
+            result = (cli_result_t){ .err = ENOENT, .text = "Parent image "+new_parent+" not found" };
+            state = 100;
+            return;
        }
        if (!new_pool_id)
        {
-            fprintf(stderr, "Pool name or ID is missing\n");
-            exit(1);
+            result = (cli_result_t){ .err = EINVAL, .text = "Pool name or ID is missing" };
+            state = 100;
+            return;
        }
-        if (!size)
+        if (!size && !force_size)
        {
-            fprintf(stderr, "Image size is missing\n");
-            exit(1);
+            result = (cli_result_t){ .err = EINVAL, .text = "Image size is missing" };
+            state = 100;
+            return;
        }
        do
        {
-            etcd_txn(json11::Json::object {
+            parent->etcd_txn(json11::Json::object {
                { "success", json11::Json::array { get_next_id() } }
            });
            state = 2;
 resume_2:
            if (parent->waiting > 0)
                return;
-            extract_next_id(result["responses"][0]);
+            if (parent->etcd_err.err)
+            {
+                result = parent->etcd_err;
+                state = 100;
+                return;
+            }
+            extract_next_id(parent->etcd_result["responses"][0]);
            attempt_create();
            state = 3;
 resume_3:
            if (parent->waiting > 0)
                return;
-            if (!result["succeeded"].bool_value() &&
-                result["responses"][0]["response_range"]["kvs"].array_items().size() > 0)
+            if (parent->etcd_err.err)
            {
-                fprintf(stderr, "Image %s already exists\n", image_name.c_str());
-                exit(1);
+                result = parent->etcd_err;
+                state = 100;
+                return;
            }
-        } while (!result["succeeded"].bool_value());
-        if (parent->progress)
-        {
-            printf("Image %s created\n", image_name.c_str());
-        }
+            if (!parent->etcd_result["succeeded"].bool_value() &&
+                parent->etcd_result["responses"][0]["response_range"]["kvs"].array_items().size() > 0)
+            {
+                result = (cli_result_t){ .err = EEXIST, .text = "Image "+image_name+" already exists" };
+                state = 100;
+                return;
+            }
+        } while (!parent->etcd_result["succeeded"].bool_value());
+        // Save into inode_config for library users to be able to take it from there immediately
+        new_cfg.mod_revision = parent->etcd_result["responses"][0]["response_put"]["header"]["revision"].uint64_value();
+        parent->cli->st_cli.insert_inode_config(new_cfg);
+        result = (cli_result_t){ .err = 0, .text = "Image "+image_name+" created" };
        state = 100;
    }

@@ -164,14 +199,16 @@ resume_3:
        {
            if (ic.second.name == image_name+"@"+new_snap)
            {
-                fprintf(stderr, "Snapshot %s@%s already exists\n", image_name.c_str(), new_snap.c_str());
-                exit(1);
+                result = (cli_result_t){ .err = EEXIST, .text = "Snapshot "+image_name+"@"+new_snap+" already exists" };
+                state = 100;
+                return;
            }
        }
        if (new_parent != "")
        {
-            fprintf(stderr, "--parent can't be used with snapshots\n");
-            exit(1);
+            result = (cli_result_t){ .err = EINVAL, .text = "Parent can't be specified for snapshots" };
+            state = 100;
+            return;
        }
        do
        {
@@ -183,8 +220,9 @@ resume_3:
                return;
            if (!old_id)
            {
-                fprintf(stderr, "Image %s does not exist\n", image_name.c_str());
-                exit(1);
+                result = (cli_result_t){ .err = ENOENT, .text = "Image "+image_name+" does not exist" };
+                state = 100;
+                return;
            }
            if (!new_pool_id)
            {
@@ -196,17 +234,24 @@ resume_3:
 resume_4:
            if (parent->waiting > 0)
                return;
-            if (!result["succeeded"].bool_value() &&
-                result["responses"][0]["response_range"]["kvs"].array_items().size() > 0)
+            if (parent->etcd_err.err)
            {
-                fprintf(stderr, "Snapshot %s@%s already exists\n", image_name.c_str(), new_snap.c_str());
-                exit(1);
+                result = parent->etcd_err;
+                state = 100;
+                return;
            }
-        } while (!result["succeeded"].bool_value());
-        if (parent->progress)
-        {
-            printf("Snapshot %s@%s created\n", image_name.c_str(), new_snap.c_str());
-        }
+            if (!parent->etcd_result["succeeded"].bool_value() &&
+                parent->etcd_result["responses"][0]["response_range"]["kvs"].array_items().size() > 0)
+            {
+                result = (cli_result_t){ .err = EEXIST, .text = "Snapshot "+image_name+"@"+new_snap+" already exists" };
+                state = 100;
+                return;
+            }
+        } while (!parent->etcd_result["succeeded"].bool_value());
+        // Save into inode_config for library users to be able to take it from there immediately
+        new_cfg.mod_revision = parent->etcd_result["responses"][0]["response_put"]["header"]["revision"].uint64_value();
+        parent->cli->st_cli.insert_inode_config(new_cfg);
+        result = (cli_result_t){ .err = 0, .text = "Snapshot "+image_name+"@"+new_snap+" created" };
        state = 100;
    }

@@ -246,7 +291,7 @@ resume_4:
            goto resume_2;
        else if (state == 3)
            goto resume_3;
-        etcd_txn(json11::Json::object { { "success", json11::Json::array {
+        parent->etcd_txn(json11::Json::object { { "success", json11::Json::array {
            get_next_id(),
            json11::Json::object {
                { "request_range", json11::Json::object {
@@ -260,11 +305,17 @@ resume_4:
 resume_2:
        if (parent->waiting > 0)
            return;
-        extract_next_id(result["responses"][0]);
+        if (parent->etcd_err.err)
+        {
+            result = parent->etcd_err;
+            state = 100;
+            return;
+        }
+        extract_next_id(parent->etcd_result["responses"][0]);
        old_id = 0;
        old_pool_id = 0;
        cfg_mod_rev = idx_mod_rev = 0;
-        if (result["responses"][1]["response_range"]["kvs"].array_items().size() == 0)
+        if (parent->etcd_result["responses"][1]["response_range"]["kvs"].array_items().size() == 0)
        {
            for (auto & ic: parent->cli->st_cli.inode_config)
            {
@@ -283,17 +334,18 @@ resume_2:
        {
            // FIXME: Parse kvs in etcd_state_client automatically
            {
-                auto kv = parent->cli->st_cli.parse_etcd_kv(result["responses"][1]["response_range"]["kvs"][0]);
+                auto kv = parent->cli->st_cli.parse_etcd_kv(parent->etcd_result["responses"][1]["response_range"]["kvs"][0]);
                old_id = INODE_NO_POOL(kv.value["id"].uint64_value());
                old_pool_id = (pool_id_t)kv.value["pool_id"].uint64_value();
                idx_mod_rev = kv.mod_revision;
                if (!old_id || !old_pool_id || old_pool_id >= POOL_ID_MAX)
                {
-                    fprintf(stderr, "Invalid pool or inode ID in etcd key %s\n", kv.key.c_str());
-                    exit(1);
+                    result = (cli_result_t){ .err = ENOENT, .text = "Invalid pool or inode ID in etcd key "+kv.key };
+                    state = 100;
+                    return;
                }
            }
-            etcd_txn(json11::Json::object {
+            parent->etcd_txn(json11::Json::object {
                { "success", json11::Json::array {
                    json11::Json::object {
                        { "request_range", json11::Json::object {
@@ -309,8 +361,14 @@ resume_2:
 resume_3:
            if (parent->waiting > 0)
                return;
+            if (parent->etcd_err.err)
            {
-                auto kv = parent->cli->st_cli.parse_etcd_kv(result["responses"][0]["response_range"]["kvs"][0]);
+                result = parent->etcd_err;
+                state = 100;
+                return;
+            }
+            {
+                auto kv = parent->cli->st_cli.parse_etcd_kv(parent->etcd_result["responses"][0]["response_range"]["kvs"][0]);
                size = kv.value["size"].uint64_value();
                new_parent_id = kv.value["parent_id"].uint64_value();
                uint64_t parent_pool_id = kv.value["parent_pool_id"].uint64_value();
@@ -325,12 +383,13 @@ resume_3:

    void attempt_create()
    {
-        inode_config_t new_cfg = {
+        new_cfg = {
            .num = INODE_WITH_POOL(new_pool_id, new_id),
            .name = image_name,
            .size = size,
            .parent_id = (new_snap != "" ? INODE_WITH_POOL(old_pool_id, old_id) : new_parent_id),
            .readonly = false,
+            .meta = new_meta,
        };
        json11::Json::array checks = json11::Json::array {
            json11::Json::object {
@@ -439,28 +498,12 @@ resume_3:
                } },
            });
        };
-        etcd_txn(json11::Json::object {
+        parent->etcd_txn(json11::Json::object {
            { "compare", checks },
            { "success", success },
            { "failure", failure },
        });
    }
-
-    void etcd_txn(json11::Json txn)
-    {
-        parent->waiting++;
-        parent->cli->st_cli.etcd_txn(txn, ETCD_SLOW_TIMEOUT, [this](std::string err, json11::Json res)
-        {
-            parent->waiting--;
-            if (err != "")
-            {
-                fprintf(stderr, "Error reading from etcd: %s\n", err.c_str());
-                exit(1);
-            }
-            this->result = res;
-            parent->ringloop->wakeup();
-        });
-    }
 };

 uint64_t parse_size(std::string size_str)
@@ -474,77 +517,76 @@ uint64_t parse_size(std::string size_str)
    if (type_char == 'k' || type_char == 'm' || type_char == 'g' || type_char == 't')
    {
        if (type_char == 'k')
-            mul = 1l<<10;
+            mul = (uint64_t)1<<10;
        else if (type_char == 'm')
-            mul = 1l<<20;
+            mul = (uint64_t)1<<20;
        else if (type_char == 'g')
-            mul = 1l<<30;
+            mul = (uint64_t)1<<30;
        else /*if (type_char == 't')*/
-            mul = 1l<<40;
+            mul = (uint64_t)1<<40;
        size_str = size_str.substr(0, size_str.length()-1);
    }
    uint64_t size = json11::Json(size_str).uint64_value() * mul;
    if (size == 0 && size_str != "0" && (size_str != "" || mul != 1))
    {
-        fprintf(stderr, "Invalid syntax for size: %s\n", size_str.c_str());
-        exit(1);
+        return UINT64_MAX;
    }
    return size;
 }

-std::function<bool(void)> cli_tool_t::start_create(json11::Json cfg)
+std::function<bool(cli_result_t &)> cli_tool_t::start_create(json11::Json cfg)
 {
-    json11::Json::array cmd = cfg["command"].array_items();
    auto image_creator = new image_creator_t();
    image_creator->parent = this;
-    image_creator->image_name = cmd.size() > 1 ? cmd[1].string_value() : "";
+    image_creator->image_name = cfg["image"].string_value();
    image_creator->new_pool_id = cfg["pool"].uint64_value();
    image_creator->new_pool_name = cfg["pool"].string_value();
+    image_creator->force_size = cfg["force_size"].bool_value();
+    if (cfg["image_meta"].is_object())
+    {
+        image_creator->new_meta = cfg["image-meta"];
+    }
    if (cfg["snapshot"].string_value() != "")
    {
        image_creator->new_snap = cfg["snapshot"].string_value();
    }
-    else if (cmd[0] == "snap-create")
-    {
-        int p = image_creator->image_name.find('@');
-        if (p == std::string::npos || p == image_creator->image_name.length()-1)
-        {
-            fprintf(stderr, "Please specify new snapshot name after @\n");
-            exit(1);
-        }
-        image_creator->new_snap = image_creator->image_name.substr(p + 1);
-        image_creator->image_name = image_creator->image_name.substr(0, p);
-    }
    image_creator->new_parent = cfg["parent"].string_value();
    if (cfg["size"].string_value() != "")
    {
        image_creator->size = parse_size(cfg["size"].string_value());
-        if (image_creator->size % 4096)
+        if (image_creator->size == UINT64_MAX)
        {
-            fprintf(stderr, "Size should be a multiple of 4096\n");
-            exit(1);
+            return [size = cfg["size"].string_value()](cli_result_t & result)
+            {
+                result = (cli_result_t){ .err = EINVAL, .text = "Invalid syntax for size: "+size };
+                return true;
+            };
+        }
+        if ((image_creator->size % 4096) && !cfg["force_size"].bool_value())
+        {
+            delete image_creator;
+            return [](cli_result_t & result)
+            {
+                result = (cli_result_t){ .err = EINVAL, .text = "Size should be a multiple of 4096" };
+                return true;
+            };
        }
        if (image_creator->new_snap != "")
        {
-            fprintf(stderr, "--size can't be specified for snapshots\n");
-            exit(1);
+            delete image_creator;
+            return [](cli_result_t & result)
+            {
+                result = (cli_result_t){ .err = EINVAL, .text = "Size can't be specified for snapshots" };
+                return true;
+            };
        }
    }
-    if (image_creator->image_name == "")
-    {
-        fprintf(stderr, "Image name is missing\n");
-        exit(1);
-    }
-    if (image_creator->image_name.find('@') != std::string::npos)
-    {
-        fprintf(stderr, "Image name can't contain @ character\n");
-        exit(1);
-    }
-    return [image_creator]()
+    return [image_creator](cli_result_t & result)
    {
        image_creator->loop();
        if (image_creator->is_done())
        {
+            result = image_creator->result;
            delete image_creator;
            return true;
        }
--- a/src/cli_df.cpp
+++ b/src/cli_df.cpp
@@ -12,6 +12,7 @@ struct pool_lister_t

    int state = 0;
    json11::Json space_info;
+    cli_result_t result;
    std::map<pool_id_t, json11::Json::object> pool_stats;

    bool is_done()
@@ -24,8 +25,7 @@ struct pool_lister_t
        if (state == 1)
            goto resume_1;
        // Space statistics - pool/stats/<pool>
-        parent->waiting++;
-        parent->cli->st_cli.etcd_txn(json11::Json::object {
+        parent->etcd_txn(json11::Json::object {
            { "success", json11::Json::array {
                json11::Json::object {
                    { "request_range", json11::Json::object {
@@ -48,21 +48,18 @@ struct pool_lister_t
                    } },
                },
            } },
-        }, ETCD_SLOW_TIMEOUT, [this](std::string err, json11::Json res)
-        {
-            parent->waiting--;
-            if (err != "")
-            {
-                fprintf(stderr, "Error reading from etcd: %s\n", err.c_str());
-                exit(1);
-            }
-            space_info = res;
-            parent->ringloop->wakeup();
        });
        state = 1;
 resume_1:
        if (parent->waiting > 0)
            return;
+        if (parent->etcd_err.err)
+        {
+            result = parent->etcd_err;
+            state = 100;
+            return;
+        }
+        space_info = parent->etcd_result;
        std::map<pool_id_t, uint64_t> osd_free;
        for (auto & kv_item: space_info["responses"][0]["response_range"]["kvs"].array_items())
        {
@@ -118,9 +115,14 @@ resume_1:
                    pool_avail = pg_free;
                }
            }
+            if (pool_avail == UINT64_MAX)
+            {
+                pool_avail = 0;
+            }
            if (pool_cfg.scheme != POOL_SCHEME_REPLICATED)
            {
-                pool_avail = pool_avail * (pool_cfg.pg_size - pool_cfg.parity_chunks) / pool_stats[pool_cfg.id]["pg_real_size"].uint64_value();
+                uint64_t pg_real_size = pool_stats[pool_cfg.id]["pg_real_size"].uint64_value();
+                pool_avail = pg_real_size > 0 ? pool_avail * (pool_cfg.pg_size - pool_cfg.parity_chunks) / pg_real_size : 0;
            }
            pool_stats[pool_cfg.id] = json11::Json::object {
                { "name", pool_cfg.name },
@@ -129,8 +131,8 @@ resume_1:
                { "scheme_name", pool_cfg.scheme == POOL_SCHEME_REPLICATED
                    ? std::to_string(pool_cfg.pg_size)+"/"+std::to_string(pool_cfg.pg_minsize)
                    : "EC "+std::to_string(pool_cfg.pg_size-pool_cfg.parity_chunks)+"+"+std::to_string(pool_cfg.parity_chunks) },
-                { "used_raw", (uint64_t)(pool_stats[pool_cfg.id]["used_raw_tb"].number_value() * (1l<<40)) },
-                { "total_raw", (uint64_t)(pool_stats[pool_cfg.id]["total_raw_tb"].number_value() * (1l<<40)) },
+                { "used_raw", (uint64_t)(pool_stats[pool_cfg.id]["used_raw_tb"].number_value() * ((uint64_t)1<<40)) },
+                { "total_raw", (uint64_t)(pool_stats[pool_cfg.id]["total_raw_tb"].number_value() * ((uint64_t)1<<40)) },
                { "max_available", pool_avail },
                { "raw_to_usable", pool_stats[pool_cfg.id]["raw_to_usable"].number_value() },
                { "space_efficiency", pool_stats[pool_cfg.id]["space_efficiency"].number_value() },
@@ -155,10 +157,12 @@ resume_1:
        get_stats();
        if (parent->waiting > 0)
            return;
+        if (state == 100)
+            return;
        if (parent->json_output)
        {
            // JSON output
-            printf("%s\n", json11::Json(to_list()).dump().c_str());
+            result.data = to_list();
            state = 100;
            return;
        }
@@ -199,28 +203,34 @@ resume_1:
        json11::Json::array list;
        for (auto & kv: pool_stats)
        {
-            kv.second["total_fmt"] = format_size(kv.second["total_raw"].uint64_value() / kv.second["raw_to_usable"].number_value());
-            kv.second["used_fmt"] = format_size(kv.second["used_raw"].uint64_value() / kv.second["raw_to_usable"].number_value());
+            double raw_to = kv.second["raw_to_usable"].number_value();
+            if (raw_to < 0.000001 && raw_to > -0.000001)
+                raw_to = 1;
+            kv.second["total_fmt"] = format_size(kv.second["total_raw"].uint64_value() / raw_to);
+            kv.second["used_fmt"] = format_size(kv.second["used_raw"].uint64_value() / raw_to);
            kv.second["max_avail_fmt"] = format_size(kv.second["max_available"].uint64_value());
-            kv.second["used_pct"] = format_q(100 - 100*kv.second["max_available"].uint64_value() *
-                kv.second["raw_to_usable"].number_value() / kv.second["total_raw"].uint64_value())+"%";
+            kv.second["used_pct"] = format_q(kv.second["total_raw"].uint64_value()
+                ? (100 - 100*kv.second["max_available"].uint64_value() *
+                    kv.second["raw_to_usable"].number_value() / kv.second["total_raw"].uint64_value())
+                : 100)+"%";
            kv.second["eff_fmt"] = format_q(kv.second["space_efficiency"].number_value()*100)+"%";
        }
-        printf("%s", print_table(to_list(), cols, parent->color).c_str());
+        result.data = to_list();
+        result.text = print_table(result.data, cols, parent->color);
        state = 100;
    }
 };

-std::function<bool(void)> cli_tool_t::start_df(json11::Json cfg)
+std::function<bool(cli_result_t &)> cli_tool_t::start_df(json11::Json cfg)
 {
-    json11::Json::array cmd = cfg["command"].array_items();
    auto lister = new pool_lister_t();
    lister->parent = this;
-    return [lister]()
+    return [lister](cli_result_t & result)
    {
        lister->loop();
        if (lister->is_done())
        {
+            result = lister->result;
            delete lister;
            return true;
        }
--- a/Show More
+++ b/Show More
Author	SHA1	Message	Date
Vitaliy Filippov	7dba1148e7	Add Hugo-based (https://gohugo.io ) documentation	2022-05-11 11:28:32 +03:00
Vitaliy Filippov	6b69db73ac	Remove getrandom() usage	2022-05-11 11:25:20 +03:00
Vitaliy Filippov	d48a824846	Fix some warnings	2022-05-10 12:42:58 +03:00
Vitaliy Filippov	40985282ff	Fix build under GCC 8	2022-05-10 12:26:47 +03:00
Vitaliy Filippov	acf403e886	Add install target for NFS proxy	2022-05-10 10:43:17 +03:00
Vitaliy Filippov	cf03b9c84d	Implement "primary affinity tags"	2022-05-09 22:37:23 +03:00
Vitaliy Filippov	7c2379d458	Simplified NFS proxy based on own NFS/XDR implementation	2022-05-07 01:01:20 +03:00
Vitaliy Filippov	a2189100dd	Make CLI functions usable in library form Return results and errors in a variable instead of just printing them, separate vitastor-cli main() from cli_tool_t, move positional argument parsing to CLI main from command implementations.	2022-05-06 02:18:32 +03:00
Vitaliy Filippov	bb84379db6	Release 0.6.17 - Fix incorrect reading of extra metadata block leading to extra unknown objects in stats - Fix CSI driver volumeMode: Block support - Add block PVC and pod examples - Fix build under 32 bit architectures - Fix slow connection ramp-up caused by up_wait_retry_interval	2022-05-06 02:18:01 +03:00
Vitaliy Filippov	714dda8151	Fix slow connection ramp-up caused by up_wait_retry_interval pausing operations on first connection attempt	2022-05-06 02:12:08 +03:00
Vitaliy Filippov	834554c523	LD_PRELOAD=libasan.so.5 fio in tests fails when vitastor is built with ASan	2022-05-05 02:11:34 +03:00
Vitaliy Filippov	e718116f54	Fix incorrect reading of extra metadata block	2022-04-21 02:52:21 +03:00
Vitaliy Filippov	98e3528a14	Add block PVC and pod examples	2022-04-17 15:43:37 +03:00
Vitaliy Filippov	8e88f77101	Fix CSI driver volumeMode: Block support	2022-04-17 15:39:11 +03:00
Vitaliy Filippov	caa2cc2e6c	Fix 32bit build error	2022-04-16 01:48:24 +03:00
Vitaliy Filippov	842ba8b831	Use (uint64_t)1 instead of 1l / 1ul	2022-04-16 01:48:14 +03:00
Vitaliy Filippov	1493823f9e	Note about starting monitors	2022-04-12 15:00:28 +03:00
Vitaliy Filippov	c857272f44	Comment: epoch is uint64_t	2022-04-10 12:21:37 +03:00
Vitaliy Filippov	340a4b4f27	Release 0.6.16 - Implement `vitastor-cli status` (print cluster status) command - Add a new `make-osd-hybrid.js` script to quickly prepare a lot of hybrid (HDD+SSD) OSDs - Implement snapshot deletion for Cinder driver (only works in a healthy cluster) - Fix a huge :) bug causing reads to return all zeroes during rebalance. Add a test to prevent it in the future - Disconnect NBD proxy correctly without leaving a zombie [vitastor-nbd] process in D state - Fix a rare write hang appearing with small write throttling enabled	2022-04-09 01:16:52 +03:00
Vitaliy Filippov	5118980315	Add a script to run all tests	2022-04-09 01:14:00 +03:00
Vitaliy Filippov	d71cc174e3	Implement CLI status command	2022-04-09 00:25:51 +03:00
Vitaliy Filippov	0eb929f1ba	Fix change_pg_count test (statistic reporting may take some time)	2022-04-08 11:58:53 +03:00
Vitaliy Filippov	83146fa3e2	Fix the same HUGE bug for regular reads during rebalance	2022-04-08 11:50:09 +03:00
Vitaliy Filippov	15dcaf7903	Add the same "rebalance" test with regular reads	2022-04-08 11:48:31 +03:00
Vitaliy Filippov	cd18ef7323	Disconnect NBD proxy correctly without leaving a zombie [vitastor-nbd] process in D state	2022-04-07 16:03:35 +03:00
Vitaliy Filippov	39531ef1a6	Fix incorrect chained reads during rebalance (the bug detected by test_rebalance_verify.sh)	2022-04-07 15:56:58 +03:00
Vitaliy Filippov	d334914948	Fix the test so it actually fails indicating a bug :-)	2022-04-07 15:56:26 +03:00
Vitaliy Filippov	c373425562	Fix nbd log	2022-04-07 15:55:38 +03:00
Vitaliy Filippov	3615e57879	Register standby monitors in etcd in /mon/member	2022-04-04 00:48:52 +03:00
Vitaliy Filippov	0edc6fe5a6	Add notes about the new script	2022-04-03 13:04:34 +03:00
Vitaliy Filippov	9c30df83e3	Fix a HUGE :) bug in NBD proxy The bug could result in corrupted data on large writes	2022-04-03 10:42:06 +03:00
Vitaliy Filippov	a420c77107	Add rebalance-verify test	2022-04-03 10:42:06 +03:00
Vitaliy Filippov	4100d829c7	Allow to override log file for daemonized NBD proxy	2022-04-03 02:41:04 +03:00
Vitaliy Filippov	79ebda933e	Fix a write hang with throttling due to timer reenterability / triggerability	2022-03-28 01:42:06 +03:00
Vitaliy Filippov	65d08e067e	Add a script for preparing hybrid (HDD+SSD) OSDs	2022-03-28 01:11:26 +03:00
Vitaliy Filippov	d289753df4	Implement snapshot deletion for Cinder driver	2022-03-08 11:41:29 +03:00
Vitaliy Filippov	85298ddae2	Release 0.6.15 - Make peering much faster in medium to large clusters - Fix a reenterability issue which could rarely lead to peering process hangs	2022-03-06 19:16:34 +03:00
Vitaliy Filippov	e23296a327	Rename cli_rm -> cli_rm_data, cli_snap_rm -> cli_rm	2022-02-24 14:34:14 +03:00
Vitaliy Filippov	839ec9e6e0	Shard clean_db by PGs to speedup listings	2022-02-20 00:21:24 +03:00
Vitaliy Filippov	7cbfdff41a	Replace some throws with force_stop	2022-02-20 00:21:19 +03:00
Vitaliy Filippov	951272f27f	Try to process PG one after another	2022-02-19 19:25:55 +03:00
Vitaliy Filippov	a3fb1d4c98	Fix reenterability around set_timer	2022-02-19 18:28:12 +03:00
Vitaliy Filippov	88402e6eb6	Move next_request to run_cb_and_clear	2022-02-19 16:59:03 +03:00
Vitaliy Filippov	390239c51b	Don't terminate HTTP requests with timeouts if response is already available in the socket	2022-02-19 13:37:12 +03:00
Vitaliy Filippov	b7b2adfa32	Fix http client not continuing requests in case of failure to connect	2022-02-19 13:36:26 +03:00
Vitaliy Filippov	36c276358b	Attempt to fix "head-of-line blocking" by LIST operations	2022-02-18 01:31:45 +03:00
Vitaliy Filippov	117d6f0612	Release 0.6.14 - Fix IPv6 address parsing - Fix "cannot read bytes of undefined" in the monitor on a fresh DB - Fix possible hangs of read requests on OSD restarts without immediate_commit=all mode - Fix OSDs skipping misplaced recovery in some cases - Fix OSDs possibly dying with "map::at" errors when other OSDs are stopped - Fix division by zero in ls if all pool OSDs are down	2022-02-17 14:43:44 +03:00
Vitaliy Filippov	7d79c58095	Use the larger sockaddr_storage structure	2022-02-12 11:22:56 +03:00
Vitaliy Filippov	46d2bc100f	Add some tolerance to stat calculation so it does not fail on a fresh DB	2022-02-11 16:37:16 +03:00
Vitaliy Filippov	732e2804e9	Fix operation dependency counter underflow for reads without immediate_commit=all mode	2022-02-11 10:54:11 +03:00
Vitaliy Filippov	abaec2008c	Fix OSDs missing misplaced recovery	2022-02-11 01:00:24 +03:00
Vitaliy Filippov	8129d238a4	Different fio versions have different types for xfer_buflen, but Vitastor anyway does not support 128-bit offsets	2022-02-10 01:21:04 +03:00
Vitaliy Filippov	61ebed144a	Fix OSDs possibly dying with "map::at" errors when other OSDs are stopped	2022-02-09 10:35:29 +03:00
Vitaliy Filippov	9d3ba113aa	Extract bind socket code into a utility function	2022-02-06 00:39:52 +03:00
Vitaliy Filippov	9788045dc9	Fix division by zero in ls if all pool OSDs are down	2022-02-05 17:03:37 +03:00
Vitaliy Filippov	d6b0d29af6	4k MEM_ALIGNMENT	2022-02-05 17:03:37 +03:00
Vitaliy Filippov	36f352f06f	Release 0.6.13 - Fix client hangs possible on OSD restarts (bug affected versions from 0.5.11) - Fix "Assertion `sqe != NULL' failed" io_uring-related crashes possible on some kernels (0.6.11 increased probability of this bug) - Fix timeout=0 in NBD proxy - Fix build under centos 7	2022-02-03 01:50:30 +03:00
Vitaliy Filippov	318cc463c2	Fix warnings	2022-02-03 01:50:30 +03:00
Vitaliy Filippov	145e5cfb86	MCL_ONFAULT is not available under centos 7	2022-02-03 01:42:19 +03:00
Vitaliy Filippov	73ae578981	Add osd_memlock option	2022-02-02 01:40:22 +03:00
Vitaliy Filippov	20ee4ed758	Update some parameter docs	2022-02-01 22:46:13 +03:00
Vitaliy Filippov	63de79d1b2	Change > to \| to preserve newlines	2022-02-01 22:45:12 +03:00
Vitaliy Filippov	f712967079	And one more sqe starvation fix	2022-02-01 02:50:16 +03:00
Vitaliy Filippov	df0cd85352	Fix another part of the "async sqe clear" bug (followup to `d9857a5340`)	2022-02-01 01:14:56 +03:00
Vitaliy Filippov	ebaf4d7a72	Fix compatibility with fio 3.28+	2022-01-31 23:39:14 +03:00
Vitaliy Filippov	d4bc10542c	Fix compatibility with liburing >= 2.1 where it only has __pad2[2]	2022-01-31 22:49:40 +03:00
Vitaliy Filippov	140309620a	Free recv_buf in nbd_proxy	2022-01-31 20:37:58 +03:00
Vitaliy Filippov	0a610ee943	Destroy the client after completing CLI command	2022-01-31 18:27:04 +03:00
Vitaliy Filippov	f3ce166064	Do not print nan% in df when a pool has no available OSDs	2022-01-31 18:23:57 +03:00
Vitaliy Filippov	717d303370	Handle get_sqe failures, don't die with "will fall out of sync" in epoll_manager Problem is that in recent kernels io_uring may return completions BEFORE clearing the submission queue. I.e. for example its capacity is 512, there were 512 requests, one of them completed, so when the request completion is processed the queue "should have" 1 free slot. But sometimes it doesn't because io_uring doesn't always clear the submission queue before sending CQE :-/	2022-01-31 02:52:20 +03:00
Vitaliy Filippov	d9857a5340	Check for SQEs, not for completions Should finally fix Assertion `sqe != NULL' failed introduced after journaling refactor in 0.6.11...	2022-01-31 02:19:10 +03:00
Vitaliy Filippov	eb5d9153e8	Fix build under centos 7	2022-01-30 20:29:44 +03:00
Vitaliy Filippov	ae6d1ed1d5	Remove completed items	2022-01-30 20:20:06 +03:00
Vitaliy Filippov	d123e58ea3	Fix yaml syntax - remove ` in default	2022-01-29 02:08:48 +03:00
Vitaliy Filippov	d9869d8116	Add parameter documentation	2022-01-28 02:45:54 +03:00
Vitaliy Filippov	4047ca606f	Add missing cancel_op(currently being read op) when stopping a client Fixes client hangs possible after stopping & restarting an osd. Hangs happened when a connection was closed in the middle of reading a READ operation reply from the network. In this case the operation being read was in read_op and the client didn't free it when closing the connection. Test case for msgr_read.cpp: - Partially read reply for a READ operation - stop_client() - Check that the READ operation returns EPIPE The bug was actually introduced in 0.5.11.	2022-01-28 01:53:52 +03:00
Vitaliy Filippov	218e294e9c	> 0, of course	2022-01-24 13:36:09 +03:00
Vitaliy Filippov	c1929cabe0	Release 0.6.12 etcd connection stability, clang & elbrus support - Fix build under CLang and Elbrus LCC compilers, making Vitastor compatible with Elbrus CPUs :) - Completely fix the bug where OSDs didn't connect to peers and incorrectly marked PGs as incomplete - Limit I/O depth for deletes the same way as for small writes. Makes OSD crashes with "Assertion failed: sqe != NULL" during image deletion go away - Fix a very old, but rare, journaling bug (credits to https://github.com/mirrorll) - Fix flushing of unclean journaled objects leading to OSDs sometimes hanging after failover in EC setups (bug was introduced in 0.6.7) - Fix several problems that could prevent smooth operation of a Vitastor cluster under the condition of partial etcd failure: - OSDs could randomly fail due to too strict error handling - New clients and OSDs could be unable to start because of the lack of retries - CLI could fail some commands because of the lack of retries - Monitor could stop receiving state updates because of the lack of websocket pings - Fix monitor being unable to rebalance PGs after a downscale of pool pg_size (3->2) - Exit with failure when trying to nbd map or benchmark a non-existing image - Use HTTP keep-alive for etcd connections - Allow to configure etcd request timeouts and retries - Allow to configure NBD timeout, max devices and partitions, and set default to up to 64 devices with up to 3 partitions each	2022-01-24 01:15:25 +03:00
Vitaliy Filippov	cc6b24e03a	Allow to configure NBD timeout, max devices and partitions Also set default NBD devices/partitions to 64/3, Linux default is 16/16 which is way too low	2022-01-24 01:15:19 +03:00
Vitaliy Filippov	0757ba630a	Do not happily NBD "map" non-existing images, do not try to benchmark them too	2022-01-23 23:03:42 +03:00
Vitaliy Filippov	2a0b881685	Respect max_write_iodepth for deletes	2022-01-23 22:05:23 +03:00
Vitaliy Filippov	9a15b843ff	Do not set pg_real_size to 0	2022-01-23 20:15:04 +03:00
Vitaliy Filippov	8dc1ffb13b	Try to connect with PG peers before deciding it's incomplete :) I already attempted to fix it in 0.6.11, but it happened so that the fix was only partial :)	2022-01-23 19:19:26 +03:00
Vitaliy Filippov	ba63af49b4	Add etcd retries everywhere (they were missing in some places)	2022-01-23 17:21:48 +03:00
Vitaliy Filippov	31b9c683ee	Fix flushing of unclean objects This was preventing OSD failover when there were some unclean objects. Bug was introduced in `aa436027c8`	2022-01-23 00:45:11 +03:00
Vitaliy Filippov	3abcac058f	Check for double response_callback call more	2022-01-23 00:26:20 +03:00
Vitaliy Filippov	e01c4db702	Add paranoic if()s to prevent accidental double free of etcd_watch_ws	2022-01-23 00:16:09 +03:00
Vitaliy Filippov	a5cf06acd0	Remove etcd timeout and keepalive interval hardcode	2022-01-23 00:00:00 +03:00
Vitaliy Filippov	9c3653b1e1	Handle EINTR	2022-01-22 23:59:37 +03:00
Vitaliy Filippov	23e578b6a2	Fix common.sh	2022-01-21 01:51:25 +03:00
Vitaliy Filippov	7920414bee	Fix build under older gcc (debian buster)	2022-01-20 10:34:52 +03:00
Vitaliy Filippov	098e369a3b	Fix rand initialization, add etcd connection/disconnection logging	2022-01-20 00:45:49 +03:00
Vitaliy Filippov	a43ef525a2	Remove two last end()s from http_client (should have been removed in the keepalive patch)	2022-01-20 00:44:18 +03:00
Vitaliy Filippov	8a6b07d8f7	Add a 2/5 etcd failure test	2022-01-20 00:43:22 +03:00
Vitaliy Filippov	2c930d55fb	Merge pull request #41 from promobit-bitblaze/1-small-fix #1 fix deps	2022-01-18 11:19:08 +03:00
Mikhail Koshel	d798e0821e	#1 fix deps	2022-01-18 13:30:53 +06:00
Vitaliy Filippov	e591a3e9f7	Include sys/stat.h in messenger.cpp No idea why, but it builds without it on x86 and does not build on e2k	2022-01-17 13:43:29 +03:00
Vitaliy Filippov	77cc18420a	Fix leaks detected by clang scan-build (only 1 of 4 may be important though)	2022-01-16 00:11:59 +03:00
Vitaliy Filippov	7bdd92ca4f	Fix build under clang and some warnings Build problems fixed: - void* pointer arithmetic which is a GNU extension (works as byte*) - "variable size object may not be initialized" which is OK under GCC - nullptr_t related error in json11 (it lacks 'operator <' in clang) Warnings fixed: - empty nested struct initializer { 0 } replaced by {} - removed several unused lambda captures	2022-01-16 00:02:54 +03:00
Vitaliy Filippov	8f64fc61e7	Ignore empty events in mon	2022-01-08 11:41:00 +03:00
Vitaliy Filippov	4a9f001d9e	Make mon also ping etcd websockets regularly	2022-01-05 17:28:51 +03:00
Vitaliy Filippov	8c908316d9	Add a test with an OSD being added	2022-01-05 17:06:24 +03:00
Vitaliy Filippov	515a2e6e33	Only die when detecting a real race condition, not just a CAS failure	2022-01-05 17:05:25 +03:00
Vitaliy Filippov	68b6763ebe	Add asserts for lp-optimizer tests, pass `ordered` from the monitor	2022-01-03 20:37:07 +03:00
Vitaliy Filippov	9c6168bf17	Remove fill_parsed_response	2022-01-03 20:08:26 +03:00
Vitaliy Filippov	08e467270a	Fix pg_size changing from 3 to 2	2022-01-03 17:56:54 +03:00
Vitaliy Filippov	5473d5b4a2	Rework HTTP client to use keepalive, move getifaddr_list to addr_util	2022-01-03 14:52:01 +03:00
Vitaliy Filippov	c3304bce27	Merge pull request #38 from mirrorll/master journal check_available error	2021-12-31 12:45:16 +03:00
Vitaliy Filippov	ec2852c598	Add minsize_1 test	2021-12-28 10:54:36 +03:00
Vitaliy Filippov	b9f5c2a823	Support zero-copy send in fio_sec_osd to allow testing it Prelimilary results: - CPU usage drops significantly. For example, in T1Q8 128K write test against stub_uring_osd with 10G network and Athlon X4 860k CPU it drops from 100% to 30% - Latency becomes slightly worse. In T1Q1 4K write test in the same environment latency increases from 56 to 63 us. - Small write throughput also becomes slightly worse. In T1Q128 4K write test against stub iops decreases from 138k to ~110k (unstable, fluctuates 100k..120k). Note that this is without io_uring, of course.	2021-12-27 02:12:44 +03:00
Vitaliy Filippov	e9d2f79aa7	Support reading bitmaps in fio_sec_osd	2021-12-27 02:12:44 +03:00
Vitaliy Filippov	0785bdf8b3	Release 0.6.11 - Slightly reduce journaling write amplification (requires no_same_sector_overwrites=false) - Fix listen_backlog (it was 0) because it could more than halve OSD socket send speed - Support IPv6 OSD addresses - Do not try to initialize client in simple-offsets - Fix OSDs sometimes marking PGs incomplete instead of trying to connect with peers - Allow to configure OSD placement in node_placement - Allow to run with 4k sector size block devices. Natural, but it was forbidden	2021-12-26 21:11:24 +03:00
Vitaliy Filippov	b57e44748b	Send 4 byte bitmap in stub_uring_osd	2021-12-25 11:38:13 +03:00
Vitaliy Filippov	1bbe62f29c	Fix uninitialized listen_backlog which was leading to REALLY SLOW send speeds!!!	2021-12-25 11:38:13 +03:00
lihai	3061c30132	journal check_available error	2021-12-21 09:39:58 +08:00
Vitaliy Filippov	20a4406acc	Support IPv6 OSD addresses	2021-12-19 10:42:17 +03:00
Vitaliy Filippov	f93491bc6c	Implement journal write batching and slightly refactor journal writes Slightly reduces WA. For example, in 4K T1Q128 replicated randwrite tests WA is reduced from ~3.6 to ~3.1, in T1Q64 from ~3.8 to ~3.4. Only effective without no_same_sector_overwrites.	2021-12-16 00:27:17 +03:00
Vitaliy Filippov	999bed8514	Fix opening regular files as blockstore	2021-12-15 02:08:58 +03:00
Vitaliy Filippov	3f33095fd7	Do not try to initialize client in simple-offsets	2021-12-15 02:07:27 +03:00