Release 1.5.0

After half a year of hard work, VitastorFS is finally here ! :-) New features: - VitastorFS, a full-featured clustered (read-write-many) file system. Documentation: [VitastorFS](docs/usage/nfs.en.md) - Embedded key-value database implementation based on Parallel Optimistic B-Tree algorithm and used for the metadata of VitastorFS - Pool management commands in vitastor-cli (create-pool, list-pools, rm-pool, modify-pool). Thanks MIND Software (https://mindsw.io) for their contribution! [Documentation](docs/usage/cli.en.md#create-pool) Bug fixes: - Fix a very rare "infinite loop" in the client library - Fix a rare OSD hang on during start when zeroing out bad metadata entries left from the previous run
Fix build errors for gcc 8
2024-03-16 15:35:10 +03:00 · 2024-03-16 15:35:10 +03:00 · 2024-03-16 15:35:10 +03:00 · 2024-03-16 15:35:10 +03:00 · 2024-03-16 15:35:10 +03:00 · 2024-03-16 15:35:10 +03:00
95 changed files with 11512 additions and 937 deletions
--- a/.gitea/workflows/buildenv.Dockerfile
+++ b/.gitea/workflows/buildenv.Dockerfile
@@ -22,7 +22,7 @@ RUN apt-get update
 RUN apt-get -y install etcd qemu-system-x86 qemu-block-extra qemu-utils fio libasan5 \
    liburing1 liburing-dev libgoogle-perftools-dev devscripts libjerasure-dev cmake libibverbs-dev libisal-dev
 RUN apt-get -y build-dep fio qemu=`dpkg -s qemu-system-x86|grep ^Version:|awk '{print $2}'`
-RUN apt-get -y install jq lp-solve sudo
+RUN apt-get -y install jq lp-solve sudo nfs-common
 RUN apt-get --download-only source fio qemu=`dpkg -s qemu-system-x86|grep ^Version:|awk '{print $2}'`

 RUN set -ex; \
--- a/.gitea/workflows/test.yml
+++ b/.gitea/workflows/test.yml
@@ -856,3 +856,21 @@ jobs:
          echo ""
        done

+  test_nfs:
+    runs-on: ubuntu-latest
+    needs: build
+    container: ${{env.TEST_IMAGE}}:${{github.sha}}
+    steps:
+    - name: Run test
+      id: test
+      timeout-minutes: 3
+      run: /root/vitastor/tests/test_nfs.sh
+    - name: Print logs
+      if: always() && steps.test.outcome == 'failure'
+      run: |
+        for i in /root/vitastor/testdata/*.log /root/vitastor/testdata/*.txt; do
+          echo "-------- $i --------"
+          cat $i
+          echo ""
+        done
+
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -2,6 +2,6 @@ cmake_minimum_required(VERSION 2.8.12)

 project(vitastor)

-set(VERSION "1.4.8")
+set(VERSION "1.5.0")

 add_subdirectory(src)
--- a/README-ru.md
+++ b/README-ru.md
@@ -6,8 +6,8 @@

 Вернём былую скорость кластерному блочному хранилищу!

-Vitastor - распределённая блочная SDS (программная СХД), прямой аналог Ceph RBD и
-внутренних СХД популярных облачных провайдеров. Однако, в отличие от них, Vitastor
+Vitastor - распределённая блочная и файловая SDS (программная СХД), прямой аналог Ceph RBD и CephFS,
+а также внутренних СХД популярных облачных провайдеров. Однако, в отличие от них, Vitastor
 быстрый и при этом простой. Только пока маленький :-).

 Vitastor архитектурно похож на Ceph, что означает атомарность и строгую консистентность,
@@ -63,7 +63,7 @@ Vitastor поддерживает QEMU-драйвер, протоколы NBD и
  - [fio](docs/usage/fio.ru.md) для тестов производительности
  - [NBD](docs/usage/nbd.ru.md) для монтирования ядром
  - [QEMU и qemu-img](docs/usage/qemu.ru.md)
-  - [NFS](docs/usage/nfs.ru.md)-прокси для VMWare и подобных
+  - [NFS](docs/usage/nfs.ru.md) кластерная файловая система и псевдо-ФС прокси
 - Производительность
  - [Понимание сути производительности](docs/performance/understanding.ru.md)
  - [Теоретический максимум](docs/performance/theoretical.ru.md)
--- a/README.md
+++ b/README.md
@@ -6,9 +6,9 @@

 Make Clustered Block Storage Fast Again.

-Vitastor is a distributed block SDS, direct replacement of Ceph RBD and internal SDS's
-of public clouds. However, in contrast to them, Vitastor is fast and simple at the same time.
-The only thing is it's slightly young :-).
+Vitastor is a distributed block and file SDS, direct replacement of Ceph RBD and CephFS,
+and also internal SDS's of public clouds. However, in contrast to them, Vitastor is fast
+and simple at the same time. The only thing is it's slightly young :-).

 Vitastor is architecturally similar to Ceph which means strong consistency,
 primary-replication, symmetric clustering and automatic data distribution over any
@@ -63,7 +63,7 @@ Read more details below in the documentation.
  - [fio](docs/usage/fio.en.md) for benchmarks
  - [NBD](docs/usage/nbd.en.md) for kernel mounts
  - [QEMU and qemu-img](docs/usage/qemu.en.md)
-  - [NFS](docs/usage/nfs.en.md) emulator for VMWare and similar
+  - [NFS](docs/usage/nfs.en.md) clustered file system and pseudo-FS proxy
 - Performance
  - [Understanding storage performance](docs/performance/understanding.en.md)
  - [Theoretical performance](docs/performance/theoretical.en.md)
--- a/csi/Makefile
+++ b/csi/Makefile
@@ -1,4 +1,4 @@
-VERSION ?= v1.4.8
+VERSION ?= v1.5.0

 all: build push

--- a/csi/deploy/004-csi-nodeplugin.yaml
+++ b/csi/deploy/004-csi-nodeplugin.yaml
@@ -49,7 +49,7 @@ spec:
            capabilities:
              add: ["SYS_ADMIN"]
            allowPrivilegeEscalation: true
-          image: vitalif/vitastor-csi:v1.4.8
+          image: vitalif/vitastor-csi:v1.5.0
          args:
            - "--node=$(NODE_ID)"
            - "--endpoint=$(CSI_ENDPOINT)"
--- a/csi/deploy/007-csi-provisioner.yaml
+++ b/csi/deploy/007-csi-provisioner.yaml
@@ -121,7 +121,7 @@ spec:
            privileged: true
            capabilities:
              add: ["SYS_ADMIN"]
-          image: vitalif/vitastor-csi:v1.4.8
+          image: vitalif/vitastor-csi:v1.5.0
          args:
            - "--node=$(NODE_ID)"
            - "--endpoint=$(CSI_ENDPOINT)"
--- a/csi/src/config.go
+++ b/csi/src/config.go
@@ -5,7 +5,7 @@ package vitastor

 const (
    vitastorCSIDriverName    = "csi.vitastor.io"
-    vitastorCSIDriverVersion = "1.4.8"
+    vitastorCSIDriverVersion = "1.5.0"
 )

 // Config struct fills the parameters of request or user input
--- a/debian/changelog
+++ b/debian/changelog
@@ -1,4 +1,4 @@
-vitastor (1.4.8-1) unstable; urgency=medium
+vitastor (1.5.0-1) unstable; urgency=medium

  * Bugfixes

--- a/debian/vitastor-client.install
+++ b/debian/vitastor-client.install
@@ -3,4 +3,6 @@ usr/bin/vitastor-cli
 usr/bin/vitastor-rm
 usr/bin/vitastor-nbd
 usr/bin/vitastor-nfs
+usr/bin/vitastor-kv
+usr/bin/vitastor-kv-stress
 usr/lib/*/libvitastor*.so*
--- a/debian/vitastor.Dockerfile
+++ b/debian/vitastor.Dockerfile
@@ -37,8 +37,8 @@ RUN set -e -x; \
    mkdir -p /root/packages/vitastor-$REL; \
    rm -rf /root/packages/vitastor-$REL/*; \
    cd /root/packages/vitastor-$REL; \
-    cp -r /root/vitastor vitastor-1.4.8; \
-    cd vitastor-1.4.8; \
+    cp -r /root/vitastor vitastor-1.5.0; \
+    cd vitastor-1.5.0; \
    ln -s /root/fio-build/fio-*/ ./fio; \
    FIO=$(head -n1 fio/debian/changelog | perl -pe 's/^.*\((.*?)\).*$/$1/'); \
    ls /usr/include/linux/raw.h || cp ./debian/raw.h /usr/include/linux/raw.h; \
@@ -51,8 +51,8 @@ RUN set -e -x; \
    rm -rf a b; \
    echo "dep:fio=$FIO" > debian/fio_version; \
    cd /root/packages/vitastor-$REL; \
-    tar --sort=name --mtime='2020-01-01' --owner=0 --group=0 --exclude=debian -cJf vitastor_1.4.8.orig.tar.xz vitastor-1.4.8; \
-    cd vitastor-1.4.8; \
+    tar --sort=name --mtime='2020-01-01' --owner=0 --group=0 --exclude=debian -cJf vitastor_1.5.0.orig.tar.xz vitastor-1.5.0; \
+    cd vitastor-1.5.0; \
    V=$(head -n1 debian/changelog | perl -pe 's/^.*\((.*?)\).*$/$1/'); \
    DEBFULLNAME="Vitaliy Filippov <vitalif@yourcmc.ru>" dch -D $REL -v "$V""$REL" "Rebuild for $REL"; \
    DEB_BUILD_OPTIONS=nocheck dpkg-buildpackage --jobs=auto -sa; \
--- a/docs/config/pool.en.md
+++ b/docs/config/pool.en.md
@@ -41,6 +41,7 @@ Parameters:
 - [osd_tags](#osd_tags)
 - [primary_affinity_tags](#primary_affinity_tags)
 - [scrub_interval](#scrub_interval)
+- [used_for_fs](#used_for_fs)

 Examples:

@@ -154,8 +155,25 @@ That is, if it becomes impossible to place PG data on at least (pg_minsize)
 OSDs, PG is deactivated for both read and write. So you know that a fresh
 write always goes to at least (pg_minsize) OSDs (disks).

-That is, pg_size minus pg_minsize sets the number of disk failures to tolerate
-without temporary downtime (for [osd_out_time](monitor.en.md#osd_out_time)).
+For example, the difference between pg_minsize 2 and 1 in a 3-way replicated
+pool (pg_size=3) is:
+- If 2 hosts go down with pg_minsize=2, the pool becomes inactive and remains
+  inactive for [osd_out_time](monitor.en.md#osd_out_time) (10 minutes). After
+  this timeout, the monitor selects replacement hosts/OSDs and the pool comes
+  up and starts to heal. Therefore, if you don't have replacement OSDs, i.e.
+  if you only have 3 hosts with OSDs and 2 of them are down, the pool remains
+  inactive until you add or return at least 1 host (or change failure_domain
+  to "osd").
+- If 2 hosts go down with pg_minsize=1, the pool only experiences a short
+  I/O pause until the monitor notices that OSDs are down (5-10 seconds with
+  the default [etcd_report_interval](osd.en.md#etcd_report_interval)). After
+  this pause, I/O resumes, but new data is temporarily written in only 1 copy.
+  Then, after osd_out_time, the monitor also selects replacement OSDs and the
+  pool starts to heal.
+
+So, pg_minsize regulates the number of failures that a pool can tolerate
+without temporary downtime for [osd_out_time](monitor.en.md#osd_out_time),
+but at a cost of slightly reduced storage reliability.

 FIXME: pg_minsize behaviour may be changed in the future to only make PGs
 read-only instead of deactivating them.
@@ -168,8 +186,8 @@ read-only instead of deactivating them.
 Number of PGs for this pool. The value should be big enough for the monitor /
 LP solver to be able to optimize data placement.

-"Enough" is usually around 64-128 PGs per OSD, i.e. you set pg_count for pool
-to (total OSD count * 100 / pg_size). You can round it to the closest power of 2,
+"Enough" is usually around 10-100 PGs per OSD, i.e. you set pg_count for pool
+to (total OSD count * 10 / pg_size). You can round it to the closest power of 2,
 because it makes it easier to reduce or increase PG count later by dividing or
 multiplying it by 2.

@@ -282,6 +300,25 @@ of the OSDs containing a data chunk for a PG.
 Automatic scrubbing interval for this pool. Overrides
 [global scrub_interval setting](osd.en.md#scrub_interval).

+## used_for_fs
+
+- Type: string
+
+If non-empty, the pool is marked as used for VitastorFS with metadata stored
+in block image (regular Vitastor volume) named as the value of this pool parameter.
+
+When a pool is marked as used for VitastorFS, regular block volume creation in it
+is disabled (vitastor-cli refuses to create images without --force) to protect
+the user from block volume and FS file ID collisions and data loss.
+
+[vitastor-nfs](../usage/nfs.ru.md), in its turn, refuses to use pools not marked
+for the corresponding FS when starting. This also implies that you can use one
+pool only for one VitastorFS.
+
+The second thing that is disabled for VitastorFS pools is reporting per-inode space
+usage statistics in etcd because a FS pool may store a very large number of files
+and statistics for them all would take a lot of space in etcd.
+
 # Examples

 ## Replicated pool
--- a/docs/config/pool.ru.md
+++ b/docs/config/pool.ru.md
@@ -40,6 +40,7 @@
 - [osd_tags](#osd_tags)
 - [primary_affinity_tags](#primary_affinity_tags)
 - [scrub_interval](#scrub_interval)
+- [used_for_fs](#used_for_fs)

 Примеры:

@@ -157,9 +158,25 @@
 OSD, PG деактивируется на чтение и запись. Иными словами, всегда известно,
 что новые блоки данных всегда записываются как минимум на pg_minsize дисков.

-По сути, разница pg_size и pg_minsize задаёт число отказов дисков, которые пул
-может пережить без временной (на [osd_out_time](monitor.ru.md#osd_out_time))
-остановки обслуживания.
+Для примера, разница между pg_minsize 2 и 1 в реплицированном пуле с 3 копиями
+данных (pg_size=3), проявляется следующим образом:
+- Если 2 сервера отключаются при pg_minsize=2, пул становится неактивным и
+  остаётся неактивным в течение [osd_out_time](monitor.en.md#osd_out_time)
+  (10 минут), после чего монитор назначает другие OSD/серверы на замену, пул
+  поднимается и начинает восстанавливать недостающие копии данных. Соответственно,
+  если OSD на замену нет - то есть, если у вас всего 3 сервера с OSD и 2 из них
+  недоступны - пул так и остаётся недоступным до тех пор, пока вы не вернёте
+  или не добавите хотя бы 1 сервер (или не переключите failure_domain на "osd").
+- Если 2 сервера отключаются при pg_minsize=1, ввод-вывод лишь приостанавливается
+  на короткое время, до тех пор, пока монитор не поймёт, что OSD отключены
+  (что занимает 5-10 секунд при стандартном [etcd_report_interval](osd.en.md#etcd_report_interval)).
+  После этого ввод-вывод восстанавливается, но новые данные временно пишутся
+  всего в 1 копии. Когда же проходит osd_out_time, монитор точно так же назначает
+  другие OSD на замену выбывшим и пул начинает восстанавливать копии данных.
+
+То есть, pg_minsize регулирует число отказов, которые пул может пережить без
+временной остановки обслуживания на [osd_out_time](monitor.ru.md#osd_out_time),
+но ценой немного пониженных гарантий надёжности.

 FIXME: Поведение pg_minsize может быть изменено в будущем с полной деактивации
 PG на перевод их в режим только для чтения.
@@ -172,8 +189,8 @@ PG на перевод их в режим только для чтения.
 Число PG для данного пула. Число должно быть достаточно большим, чтобы монитор
 мог равномерно распределить по ним данные.

-Обычно это означает примерно 64-128 PG на 1 OSD, т.е. pg_count можно устанавливать
-равным (общему числу OSD * 100 / pg_size). Значение можно округлить до ближайшей
+Обычно это означает примерно 10-100 PG на 1 OSD, т.е. pg_count можно устанавливать
+равным (общему числу OSD * 10 / pg_size). Значение можно округлить до ближайшей
 степени 2, чтобы потом было легче уменьшать или увеличивать число PG, умножая
 или деля его на 2.

@@ -290,6 +307,27 @@ OSD с "all".
 Интервал скраба, то есть, автоматической фоновой проверки данных для данного пула.
 Переопределяет [глобальную настройку scrub_interval](osd.ru.md#scrub_interval).

+## used_for_fs
+
+- Type: string
+
+Если непусто, пул помечается как используемый для файловой системы VitastorFS с
+метаданными, хранимыми в блочном образе Vitastor с именем, равным значению
+этого параметра.
+
+Когда пул помечается как используемый для VitastorFS, создание обычных блочных
+образов в нём отключается (vitastor-cli отказывается создавать образы без --force),
+чтобы защитить пользователя от коллизий ID файлов и блочных образов и, таким
+образом, от потери данных.
+
+[vitastor-nfs](../usage/nfs.ru.md), в свою очередь, при запуске отказывается
+использовать для ФС пулы, не выделенные для неё. Это также означает, что один
+пул может использоваться только для одной VitastorFS.
+
+Также для ФС-пулов отключается передача статистики в etcd по отдельным инодам,
+так как ФС-пул может содержать очень много файлов и статистика по ним всем
+заняла бы очень много места в etcd.
+
 # Примеры

 ## Реплицированный пул
--- a/docs/intro/features.en.md
+++ b/docs/intro/features.en.md
@@ -33,6 +33,7 @@
 - [Checksums](../config/layout-osd.en.md#data_csum_type)
 - [Client write-back cache](../config/client.en.md#client_enable_writeback)
 - [Intelligent recovery auto-tuning](../config/osd.en.md#recovery_tune_interval)
+- [Clustered file system](../usage/nfs.en.md#vitastorfs)

 ## Plugins and tools

@@ -46,13 +47,12 @@
 - [CSI plugin for Kubernetes](../installation/kubernetes.en.md)
 - [OpenStack support: Cinder driver, Nova and libvirt patches](../installation/openstack.en.md)
 - [Proxmox storage plugin and packages](../installation/proxmox.en.md)
- [Simplified NFS proxy for file-based image access emulation (suitable for VMWare)](../usage/nfs.en.md)
+- [Simplified NFS proxy for file-based image access emulation (suitable for VMWare)](../usage/nfs.en.md#pseudo-fs)

 ## Roadmap

 The following features are planned for the future:

- File system
 - Control plane optimisation
 - Other administrative tools
 - Web GUI
--- a/docs/intro/features.ru.md
+++ b/docs/intro/features.ru.md
@@ -35,6 +35,7 @@
 - [Контрольные суммы](../config/layout-osd.ru.md#data_csum_type)
 - [Буферизация записи на стороне клиента](../config/client.ru.md#client_enable_writeback)
 - [Интеллектуальная автоподстройка скорости восстановления](../config/osd.ru.md#recovery_tune_interval)
+- [Кластерная файловая система](../usage/nfs.ru.md#vitastorfs)

 ## Драйверы и инструменты

@@ -48,11 +49,10 @@
 - [CSI-плагин для Kubernetes](../installation/kubernetes.ru.md)
 - [Базовая поддержка OpenStack: драйвер Cinder, патчи для Nova и libvirt](../installation/openstack.ru.md)
 - [Плагин для Proxmox](../installation/proxmox.ru.md)
- [Упрощённая NFS-прокси для эмуляции файлового доступа к образам (подходит для VMWare)](../usage/nfs.ru.md)
+- [Упрощённая NFS-прокси для эмуляции файлового доступа к образам (подходит для VMWare)](../usage/nfs.ru.md#псевдо-фс)

 ## Планы развития

- Файловая система
 - Оптимизация слоя управления
 - Другие инструменты администрирования
 - Web-интерфейс
--- a/docs/intro/quickstart.en.md
+++ b/docs/intro/quickstart.en.md
@@ -14,6 +14,7 @@
 - [Check cluster status](#check-cluster-status)
 - [Create an image](#create-an-image)
 - [Install plugins](#install-plugins)
+- [Create VitastorFS](#create-vitastorfs)

 ## Preparation

@@ -75,18 +76,16 @@ On the monitor hosts:

 ## Create a pool

-Create pool configuration in etcd:
+Create a pool using vitastor-cli:

 ```
-etcdctl --endpoints=... put /vitastor/config/pools '{"1":{"name":"testpool",
-  "scheme":"replicated","pg_size":2,"pg_minsize":1,"pg_count":256,"failure_domain":"host"}}'
+vitastor-cli create-pool testpool --pg_size 2 --pg_count 256
 ```

 For EC pools the configuration should look like the following:

 ```
-etcdctl --endpoints=... put /vitastor/config/pools '{"2":{"name":"ecpool",
-  "scheme":"ec","pg_size":4,"parity_chunks":2,"pg_minsize":2,"pg_count":256,"failure_domain":"host"}}'
+vitastor-cli create-pool testpool --ec 2+2 --pg_count 256
 ```

 After you do this, one of the monitors will configure PGs and OSDs will start them.
@@ -116,3 +115,9 @@ After that, you can [run benchmarks](../usage/fio.en.md) or [start QEMU manually
 - [Proxmox](../installation/proxmox.en.md)
 - [OpenStack](../installation/openstack.en.md)
 - [Kubernetes CSI](../installation/kubernetes.en.md)
+
+## Create VitastorFS
+
+If you want to use clustered file system in addition to VM or container images:
+
+- [Follow the instructions here](../usage/nfs.en.md#vitastorfs)
--- a/docs/intro/quickstart.ru.md
+++ b/docs/intro/quickstart.ru.md
@@ -14,6 +14,7 @@
 - [Проверьте состояние кластера](#проверьте-состояние-кластера)
 - [Создайте образ](#создайте-образ)
 - [Установите плагины](#установите-плагины)
+- [Создайте VitastorFS](#создайте-vitastorfs)

 ## Подготовка

@@ -77,18 +78,16 @@

 ## Создайте пул

-Создайте конфигурацию пула с помощью etcdctl:
+Создайте пул с помощью vitastor-cli:

 ```
-etcdctl --endpoints=... put /vitastor/config/pools '{"1":{"name":"testpool",
-  "scheme":"replicated","pg_size":2,"pg_minsize":1,"pg_count":256,"failure_domain":"host"}}'
+vitastor-cli create-pool testpool --pg_size 2 --pg_count 256
 ```

 Для пулов с кодами коррекции ошибок конфигурация должна выглядеть примерно так:

 ```
-etcdctl --endpoints=... put /vitastor/config/pools '{"2":{"name":"ecpool",
-  "scheme":"ec","pg_size":4,"parity_chunks":2,"pg_minsize":2,"pg_count":256,"failure_domain":"host"}}'
+vitastor-cli create-pool testpool --ec 2+2 --pg_count 256
 ```

 После этого один из мониторов должен сконфигурировать PG, а OSD должны запустить их.
@@ -118,3 +117,10 @@ vitastor-cli create -s 10G testimg
 - [Proxmox](../installation/proxmox.ru.md)
 - [OpenStack](../installation/openstack.ru.md)
 - [Kubernetes CSI](../installation/kubernetes.ru.md)
+
+## Создайте VitastorFS
+
+Если вы хотите использовать не только блочные образы виртуальных машин или контейнеров,
+а также кластерную файловую систему, то:
+
+- [Следуйте инструкциям](../usage/nfs.en.md#vitastorfs)
--- a/docs/usage/cli.en.md
+++ b/docs/usage/cli.en.md
@@ -24,6 +24,10 @@ It supports the following commands:
 - [fix](#fix)
 - [alloc-osd](#alloc-osd)
 - [rm-osd](#rm-osd)
+- [create-pool](#create-pool)
+- [modify-pool](#modify-pool)
+- [ls-pools](#ls-pools)
+- [rm-pool](#rm-pool)

 Global options:

@@ -137,8 +141,8 @@ Rename, resize image or change its readonly status. Images with children can't b
 If the new size is smaller than the old size, extra data will be purged.
 You should resize file system in the image, if present, before shrinking it.

-| `-f|--force` | Proceed with shrinking or setting readwrite flag even if the image has children. |
-| `--down-ok`  | Proceed with shrinking even if some data will be left on unavailable OSDs.       |
+* `-f|--force` - Proceed with shrinking or setting readwrite flag even if the image has children.
+* `--down-ok` - Proceed with shrinking even if some data will be left on unavailable OSDs.

 ## rm

@@ -152,7 +156,7 @@ In other cases parent layers are always merged into children.

 Other options:

-| `--down-ok` | Continue deletion/merging even if some data will be left on unavailable OSDs. |
+* `--down-ok` - Continue deletion/merging even if some data will be left on unavailable OSDs.

 ## flatten

@@ -241,3 +245,91 @@ Refuses to remove OSDs with data without `--force` and `--allow-data-loss`.

 With `--dry-run` only checks if deletion is possible without data loss and
 redundancy degradation.
+
+## create-pool
+
+`vitastor-cli create-pool|pool-create <name> (-s <pg_size>|--ec <N>+<K>) -n <pg_count> [OPTIONS]`
+
+Create a pool. Required parameters:
+
+| <!-- -->                 | <!-- -->                                                                              |
+|--------------------------|---------------------------------------------------------------------------------------|
+| `-s R` or `--pg_size R`  | Number of replicas for replicated pools                                               |
+| `--ec N+K`               | Number of data (N) and parity (K) chunks for erasure-coded pools                      |
+| `-n N` or `--pg_count N` | PG count for the new pool (start with 10*<OSD count>/pg_size rounded to a power of 2) |
+
+Optional parameters:
+
+| <!-- -->                       | <!-- -->                                                                   |
+|--------------------------------|----------------------------------------------------------------------------|
+| `--pg_minsize <number>`        | R or N+K minus number of failures to tolerate without downtime ([details](../config/pool.en.md#pg_minsize)) |
+| `--failure_domain host`        | Failure domain: host, osd or a level from placement_levels. Default: host  |
+| `--root_node <node>`           | Put pool only on child OSDs of this placement tree node                    |
+| `--osd_tags <tag>[,<tag>]...`  | Put pool only on OSDs tagged with all specified tags                       |
+| `--block_size 128k`            | Put pool only on OSDs with this data block size                            |
+| `--bitmap_granularity 4k`      | Put pool only on OSDs with this logical sector size                        |
+| `--immediate_commit none`      | Put pool only on OSDs with this or larger immediate_commit (none < small < all) |
+| `--primary_affinity_tags tags` | Prefer to put primary copies on OSDs with all specified tags               |
+| `--scrub_interval <time>`      | Enable regular scrubbing for this pool. Format: number + unit s/m/h/d/M/y  |
+| `--used_for_fs <name>`         | Mark pool as used for VitastorFS with metadata in image <name>             |
+| `--pg_stripe_size <number>`    | Increase object grouping stripe                                            |
+| `--max_osd_combinations 10000` | Maximum number of random combinations for LP solver input                  |
+| `--wait`                       | Wait for the new pool to come online                                       |
+| `-f` or `--force`              | Do not check that cluster has enough OSDs to create the pool               |
+
+See also [Pool configuration](../config/pool.en.md) for detailed parameter descriptions.
+
+Examples:
+
+`vitastor-cli create-pool test_x4 -s 4 -n 32`
+
+`vitastor-cli create-pool test_ec42 --ec 4+2 -n 32`
+
+## modify-pool
+
+`vitastor-cli modify-pool|pool-modify <id|name> [--name <new_name>] [PARAMETERS...]`
+
+Modify an existing pool. Modifiable parameters:
+
+```
+[-s|--pg_size <number>] [--pg_minsize <number>] [-n|--pg_count <count>]
+[--failure_domain <level>] [--root_node <node>] [--osd_tags <tags>] [--no_inode_stats 0|1]
+[--max_osd_combinations <number>] [--primary_affinity_tags <tags>] [--scrub_interval <time>]
+```
+
+Non-modifiable parameters (changing them WILL lead to data loss):
+
+```
+[--block_size <size>] [--bitmap_granularity <size>]
+[--immediate_commit <all|small|none>] [--pg_stripe_size <size>]
+```
+
+These, however, can still be modified with -f|--force.
+
+See [create-pool](#create-pool) for parameter descriptions.
+
+Examples:
+
+`vitastor-cli modify-pool pool_A --name pool_B`
+
+`vitastor-cli modify-pool 2 --pg_size 4 -n 128`
+
+## rm-pool
+
+`vitastor-cli rm-pool|pool-rm [--force] <id|name>`
+
+Remove a pool. Refuses to remove pools with images without `--force`.
+
+## ls-pools
+
+`vitastor-cli ls-pools|pool-ls|ls-pool|pools [-l] [--detail] [--sort FIELD] [-r] [-n N] [--stats] [<glob> ...]`
+
+List pools (only matching <glob> patterns if passed).
+
+| <!-- -->             | <!-- -->                                              |
+|----------------------|-------------------------------------------------------|
+| `-l` or `--long`     | Also report I/O statistics                            |
+| `--detail`           | Use list format (not table), show all details         |
+| `--sort FIELD`       | Sort by specified field (see fields in --json output) |
+| `-r` or `--reverse`  | Sort in descending order                              |
+| `-n` or `--count N`  | Only list first N items                               |
--- a/docs/usage/cli.ru.md
+++ b/docs/usage/cli.ru.md
@@ -23,6 +23,10 @@ vitastor-cli - интерфейс командной строки для адм
 - [merge-data](#merge-data)
 - [alloc-osd](#alloc-osd)
 - [rm-osd](#rm-osd)
+- [create-pool](#create-pool)
+- [modify-pool](#modify-pool)
+- [ls-pools](#ls-pools)
+- [rm-pool](#rm-pool)

 Глобальные опции:

@@ -85,8 +89,8 @@ kaveri    2/1     32   0 B      10 G    0 B        100%    0%

 `vitastor-cli ls [-l] [-p POOL] [--sort FIELD] [-r] [-n N] [<glob> ...]`

-Показать список образов, если переданы шаблоны `<glob>`, то только с именами,
-соответствующими этим шаблонам (стандартные ФС-шаблоны с * и ?).
+Показать список образов, если передан(ы) шаблон(ы) `<glob>`, то только с именами,
+соответствующими одному из шаблонов (стандартные ФС-шаблоны с * и ?).

 Опции:

@@ -140,8 +144,8 @@ vitastor-cli snap-create [-p|--pool <id|name>] <image>@<snapshot>
 Если новый размер меньше старого, "лишние" данные будут удалены, поэтому перед уменьшением
 образа сначала уменьшите файловую систему в нём.

-| -f|--force | Разрешить уменьшение или перевод в чтение-запись образа, у которого есть клоны.        |
-| --down-ok  | Разрешить уменьшение, даже если часть данных останется неудалённой на недоступных OSD. |
+* `-f|--force` - Разрешить уменьшение или перевод в чтение-запись образа, у которого есть клоны.
+* `--down-ok` - Разрешить уменьшение, даже если часть данных останется неудалённой на недоступных OSD.

 ## rm

@@ -159,7 +163,7 @@ vitastor-cli snap-create [-p|--pool <id|name>] <image>@<snapshot>

 Другие опции:

-| `--down-ok` | Продолжать удаление/слияние, даже если часть данных останется неудалённой на недоступных OSD. |
+* `--down-ok` - Продолжать удаление/слияние, даже если часть данных останется неудалённой на недоступных OSD.

 ## flatten

@@ -258,3 +262,91 @@ vitastor-cli snap-create [-p|--pool <id|name>] <image>@<snapshot>

 С опцией `--dry-run` только проверяет, возможно ли удаление без потери данных и деградации
 избыточности.
+
+## create-pool
+
+`vitastor-cli create-pool|pool-create <name> (-s <pg_size>|--ec <N>+<K>) -n <pg_count> [OPTIONS]`
+
+Создать пул. Обязательные параметры:
+
+| <!-- -->                  | <!-- -->                                                                                    |
+|---------------------------|---------------------------------------------------------------------------------------------|
+| `-s R` или `--pg_size R`  | Число копий данных для реплицированных пулов                                                |
+| `--ec N+K`                | Число частей данных (N) и чётности (K) для пулов с кодами коррекции ошибок                  |
+| `-n N` или `--pg_count N` | Число PG для нового пула (начните с 10*<число OSD>/pg_size, округлённого до степени двойки) |
+
+Необязательные параметры:
+
+| <!-- -->                       | <!-- -->                                                                   |
+|--------------------------------|----------------------------------------------------------------------------|
+| `--pg_minsize <number>`        | (R или N+K) минус число разрешённых отказов без остановки пула ([подробнее](../config/pool.ru.md#pg_minsize)) |
+| `--failure_domain host`        | Домен отказа: host, osd или другой из placement_levels. По умолчанию: host |
+| `--root_node <node>`           | Использовать для пула только дочерние OSD этого узла дерева размещения     |
+| `--osd_tags <tag>[,<tag>]...`  | ...только OSD со всеми заданными тегами                                    |
+| `--block_size 128k`            | ...только OSD с данным размером блока                                      |
+| `--bitmap_granularity 4k`      | ...только OSD с данным размером логического сектора                        |
+| `--immediate_commit none`      | ...только OSD с этим или большим immediate_commit (none < small < all)     |
+| `--primary_affinity_tags tags` | Предпочитать OSD со всеми данными тегами для роли первичных                |
+| `--scrub_interval <time>`      | Включить скрабы с заданным интервалом времени (число + единица s/m/h/d/M/y) |
+| `--pg_stripe_size <number>`    | Увеличить блок группировки объектов по PG                                  |
+| `--max_osd_combinations 10000` | Максимальное число случайных комбинаций OSD для ЛП-солвера                 |
+| `--wait`                       | Подождать, пока новый пул будет активирован                                |
+| `-f` или `--force`             | Не проверять, что в кластере достаточно доменов отказа для создания пула   |
+
+Подробно о параметрах см. [Конфигурация пулов](../config/pool.ru.md).
+
+Примеры:
+
+`vitastor-cli create-pool test_x4 -s 4 -n 32`
+
+`vitastor-cli create-pool test_ec42 --ec 4+2 -n 32`
+
+## modify-pool
+
+`vitastor-cli modify-pool|pool-modify <id|name> [--name <new_name>] [PARAMETERS...]`
+
+Изменить настройки существующего пула. Изменяемые параметры:
+
+```
+[-s|--pg_size <number>] [--pg_minsize <number>] [-n|--pg_count <count>]
+[--failure_domain <level>] [--root_node <node>] [--osd_tags <tags>]
+[--max_osd_combinations <number>] [--primary_affinity_tags <tags>] [--scrub_interval <time>]
+```
+
+Неизменяемые параметры (их изменение ПРИВЕДЁТ к потере данных):
+
+```
+[--block_size <size>] [--bitmap_granularity <size>]
+[--immediate_commit <all|small|none>] [--pg_stripe_size <size>]
+```
+
+Эти параметры можно изменить, только если явно передать опцию -f или --force.
+
+Описания параметров смотрите в [create-pool](#create-pool).
+
+Примеры:
+
+`vitastor-cli modify-pool pool_A --name pool_B`
+
+`vitastor-cli modify-pool 2 --pg_size 4 -n 128`
+
+## rm-pool
+
+`vitastor-cli rm-pool|pool-rm [--force] <id|name>`
+
+Удалить пул. Отказывается удалять пул, в котором ещё есть образы, без `--force`.
+
+## ls-pools
+
+`vitastor-cli ls-pools|pool-ls|ls-pool|pools [-l] [--detail] [--sort FIELD] [-r] [-n N] [--stats] [<glob> ...]`
+
+Показать список пулов. Если передан(ы) шаблон(ы) `<glob>`, то только с именами,
+соответствующими одному из шаблонов (стандартные ФС-шаблоны с * и ?).
+
+| <!-- -->              | <!-- -->                                                   |
+|-----------------------|------------------------------------------------------------|
+| `-l` или `--long`     | Вывести также статистику ввода-вывода                      |
+| `--detail`            | Максимально подробный вывод в виде списка (а не таблицы)   |
+| `--sort FIELD`        | Сортировать по заданному полю (поля см. в выводе с --json) |
+| `-r` или `--reverse`  | Сортировать в обратном порядке                             |
+| `-n` или `--count N`  | Выводить только первые N записей                           |
--- a/docs/usage/nfs.en.md
+++ b/docs/usage/nfs.en.md
@@ -4,42 +4,150 @@

 [Читать на русском](nfs.ru.md)

-# NFS
+# VitastorFS and pseudo-FS

-Vitastor has a simplified NFS 3.0 proxy for file-based image access emulation. It's not
-suitable as a full-featured file system, at least because all file/image metadata is stored
-in etcd and kept in memory all the time - thus you can't put a lot of files in it.
+Vitastor has two file system implementations. Both can be used via `vitastor-nfs`.

-However, NFS proxy is totally fine as a method to provide VM image access and allows to
-plug Vitastor into, for example, VMWare. It's important to note that for VMWare it's a much
-better access method than iSCSI, because with iSCSI we'd have to put all VM images into one
-Vitastor image exported as a LUN to VMWare and formatted with VMFS. VMWare doesn't use VMFS
-over NFS.
+Commands:
+- [mount](#mount)
+- [start](#start)

-NFS proxy is stateless if you use immediate_commit=all mode (for SSD with capacitors or
-HDDs with disabled cache), so you can run multiple NFS proxies and use a network load
-balancer or any failover method you want to in that case.
+## Pseudo-FS

-vitastor-nfs usage:
+Simplified pseudo-FS proxy is used for file-based image access emulation. It's not
+suitable as a full-featured file system: it lacks a lot of FS features, it stores
+all file/image metadata in memory and in etcd. So it's fine for hundreds or thousands
+of large files/images, but not for millions.
+
+Pseudo-FS proxy is intended for environments where other block volume access methods
+can't be used or impose additional restrictions - for example, VMWare. NFS is better
+for VMWare than, for example, iSCSI, because with iSCSI, VMWare puts all VM images
+into one large shared block image in its own VMFS file system, and with NFS, VMWare
+doesn't use VMFS and puts each VM disk in a regular file which is equal to one
+Vitastor block image, just as originally intended.
+
+To use Vitastor pseudo-FS locally, run `vitastor-nfs mount --block /mnt/vita`.
+
+Also you can start the network server:

 ```
-vitastor-nfs [STANDARD OPTIONS] [OTHER OPTIONS]
-
--subdir <DIR>    export images prefixed <DIR>/ (default empty - export all images)
--portmap 0       do not listen on port 111 (portmap/rpcbind, requires root)
--bind <IP>       bind service to <IP> address (default 0.0.0.0)
--nfspath <PATH>  set NFS export path to <PATH> (default is /)
--port <PORT>     use port <PORT> for NFS services (default is 2049)
--pool <POOL>     use <POOL> as default pool for new files (images)
--foreground 1    stay in foreground, do not daemonize
+vitastor-nfs start --block --etcd_address 192.168.5.10:2379 --portmap 0 --port 2050 --pool testpool
 ```

-Example start and mount commands (etcd_address is optional):
+To mount the FS exported by this server, run:

 ```
-vitastor-nfs --etcd_address 192.168.5.10:2379 --portmap 0 --port 2050 --pool testpool
+mount server:/ /mnt/ -o port=2050,mountport=2050,nfsvers=3,soft,nolock,tcp
 ```

-```
-mount localhost:/ /mnt/ -o port=2050,mountport=2050,nfsvers=3,soft,nolock,tcp
-```
+## VitastorFS
+
+VitastorFS is a full-featured clustered (Read-Write-Many) file system. It supports most POSIX
+features like hierarchical organization, symbolic links, hard links, quick renames and so on.
+
+VitastorFS metadata is stored in a Parallel Optimistic B-Tree key-value database,
+implemented over a regular Vitastor block volume. Directory entries and inodes
+are stored in a simple human-readable JSON format in the B-Tree. `vitastor-kv` tool
+can be used to inspect the database.
+
+To use VitastorFS:
+
+1. Create a pool or choose an existing empty pool for FS data
+2. Create an image for FS metadata, preferably in a faster (SSD or replica-HDD) pool,
+   but you can create it in the data pool too if you want (image size doesn't matter):
+   `vitastor-cli create -s 10G -p fastpool testfs`
+3. Mark data pool as an FS pool: `vitastor-cli modify-pool --used-for-fs testfs data-pool`
+4. Either mount the FS: `vitastor-nfs mount --fs testfs --pool data-pool /mnt/vita`
+5. Or start the NFS server: `vitastor-nfs start --fs testfs --pool data-pool`
+
+### Supported POSIX features
+
+- Read-after-write semantics (read returns new data immediately after write)
+- Linear and random read and write
+- Writing outside current file size
+- Hierarchical structure, immediate rename of files and directories
+- File size change support (truncate)
+- Permissions (chmod/chown)
+- Flushing data to stable storage (if required) (fsync)
+- Symbolic links
+- Hard links
+- Special files (devices, sockets, named pipes)
+- File modification and attribute change time tracking (mtime and ctime)
+- Modification time (mtime) and last access time (atime) change support (utimes)
+- Correct handling of directory listing during file creation/deletion
+
+### Limitations
+
+POSIX features currently not implemented in VitastorFS:
+- File locking is not supported
+- Actually used space is not counted, so `du` always reports apparent file sizes
+  instead of actually allocated space
+- Access times (`atime`) are not tracked (like `-o noatime`)
+- Modification time (`mtime`) is updated lazily every second (like `-o lazytime`)
+
+Other notable missing features which should be addressed in the future:
+- Defragmentation of "shared" inodes. Files smaller than pool object size (block_size
+  multiplied by data part count if pool is EC) are internally stored in large block
+  volumes sequentially, one after another, and leave garbage after deleting or resizing.
+  Defragmentator will be implemented to collect this garbage.
+- Inode ID reuse. Currently inode IDs always grow, the limit is 2^48 inodes, so
+  in theory you may hit it if you create and delete a very large number of files
+- Compaction of the key-value B-Tree. Current implementation never merges or deletes
+  B-Tree blocks, so B-Tree may become bloated over time. Currently you can
+  use `vitastor-kv dumpjson` & `loadjson` commands to recreate the index in such
+  situations.
+- Filesystem check tool. VitastorFS doesn't have journal because it would impose a
+  severe performance hit, optimistic CAS-based transactions are used instead of it.
+  So, again, in theory an abnormal shutdown of the FS server may leave some garbage
+  in the DB. The FS is implemented is such way that this garbage doesn't affect its
+  function, but having a tool to clean it up still seems a right thing to do.
+
+## Horizontal scaling
+
+Linux NFS 3.0 client doesn't support built-in scaling or failover, i.e. you can't
+specify multiple server addresses when mounting the FS.
+
+However, you can use any regular TCP load balancing over multiple NFS servers.
+It's absolutely safe with `immediate_commit=all` and `client_enable_writeback=false`
+settings, because Vitastor NFS proxy doesn't keep uncommitted data in memory
+with these settings. But it may even work without `immediate_commit=all` because
+the Linux NFS client repeats all uncommitted writes if it loses the connection.
+
+## Commands
+
+### mount
+
+`vitastor-nfs (--fs <NAME> | --block) [-o <OPT>] mount <MOUNTPOINT>`
+
+Start local filesystem server and mount file system to <MOUNTPOINT>.
+
+Use regular `umount <MOUNTPOINT>` to unmount the FS.
+
+The server will be automatically stopped when the FS is unmounted.
+
+- `-o|--options <OPT>` - Pass additional NFS mount options (ex.: -o async).
+
+### start
+
+`vitastor-nfs (--fs <NAME> | --block) start`
+
+Start network NFS server. Options:
+
+| <!-- -->        | <!-- -->                                                   |
+|-----------------|------------------------------------------------------------|
+| `--bind <IP>`   | bind service to \<IP> address (default 0.0.0.0)            |
+| `--port <PORT>` | use port \<PORT> for NFS services (default is 2049)        |
+| `--portmap 0`   | do not listen on port 111 (portmap/rpcbind, requires root) |
+
+## Common options
+
+| <!-- -->           | <!-- -->                                                 |
+|--------------------|----------------------------------------------------------|
+| `--fs <NAME>`      | use VitastorFS with metadata in image \<NAME>            |
+| `--block`          | use pseudo-FS presenting images as files                 |
+| `--pool <POOL>`    | use \<POOL> as default pool for new files                |
+| `--subdir <DIR>`   | export \<DIR> instead of root directory (pseudo-FS only) |
+| `--nfspath <PATH>` | set NFS export path to \<PATH> (default is /)            |
+| `--pidfile <FILE>` | write process ID to the specified file                   |
+| `--logfile <FILE>` | log to the specified file                                |
+| `--foreground 1`   | stay in foreground, do not daemonize                     |
--- a/docs/usage/nfs.ru.md
+++ b/docs/usage/nfs.ru.md
@@ -4,41 +4,156 @@

 [Read in English](nfs.en.md)

-# NFS
+# VitastorFS и псевдо-ФС

-В Vitastor реализована упрощённая NFS 3.0 прокси для эмуляции файлового доступа к образам.
-Это не полноценная файловая система, т.к. метаданные всех файлов (образов) сохраняются
-в etcd и всё время хранятся в оперативной памяти - то есть, положить туда много файлов
-не получится.
+В Vitastor есть две реализации файловой системы. Обе используются через `vitastor-nfs`.

-Однако в качестве способа доступа к образам виртуальных машин NFS прокси прекрасно подходит
-и позволяет подключить Vitastor, например, к VMWare.
+Команды:
+- [mount](#mount)
+- [start](#start)

-При этом, если вы используете режим immediate_commit=all (для SSD с конденсаторами или HDD
-с отключённым кэшем), то NFS-сервер не имеет состояния и вы можете свободно поднять
-его в нескольких экземплярах и использовать поверх них сетевой балансировщик нагрузки или
-схему с отказоустойчивостью.
+## Псевдо-ФС

-Использование vitastor-nfs:
+Упрощённая реализация псевдо-ФС используется для эмуляции файлового доступа к блочным
+образам Vitastor. Это не полноценная файловая система - в ней отсутствуют многие функции
+POSIX ФС, а метаданные всех файлов (образов) сохраняются в etcd и всё время хранятся в
+оперативной памяти - то есть, псевдо-ФС подходит для сотен или тысяч файлов, но не миллионов.
+
+Псевдо-ФС предназначена для доступа к образам виртуальных машин в средах, где другие
+способы невозможны или неудобны - например, в VMWare. Для VMWare это лучшая опция, чем
+iSCSI, так как при использовании iSCSI VMWare размещает все виртуальные машины в одном
+большом блочном образе внутри собственной ФС VMFS, а с NFS VMFS не используется и каждый
+диск ВМ представляется в виде одного файла, то есть, соответствует одному блочному образу
+Vitastor, как это и задумано изначально.
+
+Чтобы подключить псевдо-ФС Vitastor, выполните команду `vitastor-nfs mount --block /mnt/vita`.
+
+Либо же запустите сетевой вариант сервера:

 ```
-vitastor-nfs [СТАНДАРТНЫЕ ОПЦИИ] [ДРУГИЕ ОПЦИИ]
-
--subdir <DIR>    экспортировать "поддиректорию" - образы с префиксом имени <DIR>/ (по умолчанию пусто - экспортировать все образы)
--portmap 0       отключить сервис portmap/rpcbind на порту 111 (по умолчанию включён и требует root привилегий)
--bind <IP>       принимать соединения по адресу <IP> (по умолчанию 0.0.0.0 - на всех)
--nfspath <PATH>  установить путь NFS-экспорта в <PATH> (по умолчанию /)
--port <PORT>     использовать порт <PORT> для NFS-сервисов (по умолчанию 2049)
--pool <POOL>     использовать пул <POOL> для новых образов (обязательно, если пул в кластере не один)
--foreground 1    не уходить в фон после запуска
+vitastor-nfs start --block --etcd_address 192.168.5.10:2379 --portmap 0 --port 2050 --pool testpool
 ```

-Пример монтирования Vitastor через NFS (etcd_address необязателен):
+Примонтировать ФС, запущенную с такими опциями, можно следующей командой:

 ```
-vitastor-nfs --etcd_address 192.168.5.10:2379 --portmap 0 --port 2050 --pool testpool
+mount server:/ /mnt/ -o port=2050,mountport=2050,nfsvers=3,soft,nolock,tcp
 ```

-```
-mount localhost:/ /mnt/ -o port=2050,mountport=2050,nfsvers=3,soft,nolock,tcp
-```
+## VitastorFS
+
+VitastorFS - полноценная кластерная (Read-Write-Many) файловая система. Она поддерживает
+большую часть функций POSIX - иерархическую организацию, символические ссылки, жёсткие
+ссылки, быстрые переименования и так далее.
+
+Метаданные VitastorFS хранятся в собственной реализации БД формата ключ-значения,
+основанной на Параллельном Оптимистичном Б-дереве поверх обычного блочного образа Vitastor.
+И записи каталогов, и иноды, как обычно в Vitastor, хранятся в простом человекочитаемом
+JSON-формате :-). Для инспекции содержимого БД можно использовать инструмент `vitastor-kv`.
+
+Чтобы использовать VitastorFS:
+
+1. Создайте пул для данных ФС или выберите существующий пустой пул
+2. Создайте блочный образ для метаданных ФС, желательно, в более быстром пуле (на SSD
+   или по крайней мере на HDD, но без EC), но можно и в том же пуле, что данные
+   (размер образа значения не имеет):
+   `vitastor-cli create -s 10G -p fastpool testfs`
+3. Пометьте пул данных как ФС-пул: `vitastor-cli modify-pool --used-for-fs testfs data-pool`
+4. Либо примонтируйте ФС: `vitastor-nfs mount --fs testfs --pool data-pool /mnt/vita`
+5. Либо запустите сетевой NFS-сервер: `vitastor-nfs start --fs testfs --pool data-pool`
+
+### Поддерживаемые функции POSIX
+
+- Чтение актуальной версии данных сразу после записи
+- Последовательное и произвольное чтение и запись
+- Запись за пределами текущего размера файла
+- Иерархическая организация, мгновенное переименование файлов и каталогов
+- Изменение размера файла (truncate)
+- Права на файлы (chmod/chown)
+- Фиксация данных на диски (когда необходимо) (fsync)
+- Символические ссылки
+- Жёсткие ссылки
+- Специальные файлы (устройства, сокеты, каналы)
+- Отслеживание времён модификации (mtime), изменения атрибутов (ctime)
+- Ручное изменение времён модификации (mtime), последнего доступа (atime)
+- Корректная обработка изменений списка файлов во время листинга
+
+### Ограничения
+
+Отсутствующие на данный момент в VitastorFS функции POSIX:
+- Блокировки файлов не поддерживаются
+- Фактически занятое файлами место не подсчитывается и не возвращается вызовами
+  stat(2), так что `du` всегда показывает сумму размеров файлов, а не фактически занятое место
+- Времена доступа (`atime`) не отслеживаются (как будто ФС смонтирована с `-o noatime`)
+- Времена модификации (`mtime`) отслеживаются асинхронно (как будто ФС смонтирована с `-o lazytime`)
+
+Другие недостающие функции, которые нужно добавить в будущем:
+- Дефрагментация "общих инодов". На уровне реализации ФС файлы, меньшие, чем размер
+  объекта пула (block_size умножить на число частей данных, если пул EC),
+  упаковываются друг за другом в большие "общие" иноды/тома. Если такие файлы удалять
+  или увеличивать, они перемещаются и оставляют за собой "мусор", вот тут-то и нужен
+  дефрагментатор.
+- Переиспользование номеров инодов. В текущей реализации номера инодов всё время
+  увеличиваются, так что в теории вы можете упереться в лимит, если насоздаёте
+  и наудаляете больше, чем 2^48 файлов.
+- Очистка места в Б-дереве метаданных. Текущая реализация никогда не сливает и не
+  удаляет блоки Б-дерева, так что в теории дерево может разростись и стать неоптимальным.
+  Если вы столкнётесь с такой ситуацией сейчас, вы можете решить её с помощью
+  команд `vitastor-kv dumpjson` и `loadjson` (т.е. пересоздав и загрузив обратно все метаданные ФС).
+- Инструмент проверки метаданных файловой системы. У VitastorFS нет журнала, так как
+  журнал бы сильно замедлил реализацию, вместо него используются оптимистичные
+  транзакции на основе CAS (сравнить-и-записать), и теоретически при нештатном
+  завершении сервера ФС в БД также могут оставаться неконсистентные "мусорные"
+  записи. ФС устроена так, что на работу они не влияют, но для порядка и их стоит
+  уметь подчищать.
+
+## Горизонтальное масштабирование
+
+Клиент Linux NFS 3.0 не поддерживает встроенное масштабирование или отказоустойчивость.
+То есть, вы не можете задать несколько адресов серверов при монтировании ФС.
+
+Однако вы можете использовать любые стандартные сетевые балансировщики нагрузки
+или схемы с отказоустойчивостью. Это точно безопасно при настройках `immediate_commit=all` и
+`client_enable_writeback=false`, так как с ними NFS-сервер Vitastor вообще не хранит
+в памяти ещё не зафиксированные на дисках данные; и вполне вероятно безопасно
+даже без `immediate_commit=all`, потому что NFS-клиент ядра Linux повторяет все
+незафиксированные запросы при потере соединения.
+
+## Команды
+
+### mount
+
+`vitastor-nfs (--fs <NAME> | --block) mount [-o <OPT>] <MOUNTPOINT>`
+
+Запустить локальный сервер и примонтировать ФС в директорию <MOUNTPOINT>.
+
+Чтобы отмонтировать ФС, используйте обычную команду `umount <MOUNTPOINT>`.
+
+Сервер автоматически останавливается при отмонтировании ФС.
+
+- `-o|--options <OPT>` - Передать дополнительные опции монтирования NFS (пример: -o async).
+
+### start
+
+`vitastor-nfs (--fs <NAME> | --block) start`
+
+Запустить сетевой NFS-сервер. Опции:
+
+| <!-- -->        | <!-- -->                                                              |
+|-----------------|-----------------------------------------------------------------------|
+| `--bind <IP>`   | принимать соединения по адресу \<IP> (по умолчанию 0.0.0.0 - на всех) |
+| `--port <PORT>` | использовать порт \<PORT> для NFS-сервисов (по умолчанию 2049)        |
+| `--portmap 0`   | отключить сервис portmap/rpcbind на порту 111 (по умолчанию включён и требует root привилегий) |
+
+## Общие опции
+
+| <!-- -->           | <!-- -->                                                |
+|--------------------|---------------------------------------------------------|
+| `--fs <NAME>`      | использовать VitastorFS с метаданными в образе \<NAME>  |
+| `--block`          | использовать псевдо-ФС для доступа к блочным образам    |
+| `--pool <POOL>`    | использовать пул \<POOL> для новых файлов (обязательно, если пул в кластере не один) |
+| `--subdir <DIR>`   | экспортировать подкаталог \<DIR>, а не корень (только для псевдо-ФС) |
+| `--nfspath <PATH>` | установить путь NFS-экспорта в \<PATH> (по умолчанию /) |
+| `--pidfile <FILE>` | записать ID процесса в заданный файл                    |
+| `--logfile <FILE>` | записывать логи в заданный файл                         |
+| `--foreground 1`   | не уходить в фон после запуска                          |
--- a/mon/mon.js
+++ b/mon/mon.js
@@ -37,7 +37,7 @@ const etcd_allow = new RegExp('^'+[
    'pg/history/[1-9]\\d*/[1-9]\\d*',
    'pool/stats/[1-9]\\d*',
    'history/last_clean_pgs',
-    'inode/stats/[1-9]\\d*/[1-9]\\d*',
+    'inode/stats/[1-9]\\d*/\\d+',
    'pool/stats/[1-9]\\d*',
    'stats',
    'index/image/.*',
@@ -1737,8 +1737,11 @@ class Mon
                for (const inode_num in this.state.osd.space[osd_num][pool_id])
                {
                    const u = BigInt(this.state.osd.space[osd_num][pool_id][inode_num]||0);
-                    inode_stats[pool_id][inode_num] = inode_stats[pool_id][inode_num] || inode_stub();
-                    inode_stats[pool_id][inode_num].raw_used += u;
+                    if (inode_num)
+                    {
+                        inode_stats[pool_id][inode_num] = inode_stats[pool_id][inode_num] || inode_stub();
+                        inode_stats[pool_id][inode_num].raw_used += u;
+                    }
                    this.state.pool.stats[pool_id].used_raw_tb += u;
                }
            }
--- a/mon/package.json
+++ b/mon/package.json
@@ -1,6 +1,6 @@
 {
  "name": "vitastor-mon",
-  "version": "1.4.8",
+  "version": "1.5.0",
  "description": "Vitastor SDS monitor service",
  "main": "mon-main.js",
  "scripts": {
--- a/patches/cinder-vitastor.py
+++ b/patches/cinder-vitastor.py
@@ -50,7 +50,7 @@ from cinder.volume import configuration
 from cinder.volume import driver
 from cinder.volume import volume_utils

-VERSION = '1.4.8'
+VERSION = '1.5.0'

 LOG = logging.getLogger(__name__)

--- a/rpm/build-tarball.sh
+++ b/rpm/build-tarball.sh
@@ -24,4 +24,4 @@ rm fio
 mv fio-copy fio
 FIO=`rpm -qi fio | perl -e 'while(<>) { /^Epoch[\s:]+(\S+)/ && print "$1:"; /^Version[\s:]+(\S+)/ && print $1; /^Release[\s:]+(\S+)/ && print "-$1"; }'`
 perl -i -pe 's/(Requires:\s*fio)([^\n]+)?/$1 = '$FIO'/' $VITASTOR/rpm/vitastor-el$EL.spec
-tar --transform 's#^#vitastor-1.4.8/#' --exclude 'rpm/*.rpm' -czf $VITASTOR/../vitastor-1.4.8$(rpm --eval '%dist').tar.gz *
+tar --transform 's#^#vitastor-1.5.0/#' --exclude 'rpm/*.rpm' -czf $VITASTOR/../vitastor-1.5.0$(rpm --eval '%dist').tar.gz *
--- a/rpm/vitastor-el7.Dockerfile
+++ b/rpm/vitastor-el7.Dockerfile
@@ -36,7 +36,7 @@ ADD . /root/vitastor
 RUN set -e; \
    cd /root/vitastor/rpm; \
    sh build-tarball.sh; \
-    cp /root/vitastor-1.4.8.el7.tar.gz ~/rpmbuild/SOURCES; \
+    cp /root/vitastor-1.5.0.el7.tar.gz ~/rpmbuild/SOURCES; \
    cp vitastor-el7.spec ~/rpmbuild/SPECS/vitastor.spec; \
    cd ~/rpmbuild/SPECS/; \
    rpmbuild -ba vitastor.spec; \
--- a/rpm/vitastor-el7.spec
+++ b/rpm/vitastor-el7.spec
@@ -1,11 +1,11 @@
 Name:           vitastor
-Version:        1.4.8
+Version:        1.5.0
 Release:        1%{?dist}
 Summary:        Vitastor, a fast software-defined clustered block storage

 License:        Vitastor Network Public License 1.1
 URL:            https://vitastor.io/
-Source0:        vitastor-1.4.8.el7.tar.gz
+Source0:        vitastor-1.5.0.el7.tar.gz

 BuildRequires:  liburing-devel >= 0.6
 BuildRequires:  gperftools-devel
@@ -149,9 +149,12 @@ mkdir -p /etc/vitastor
 %_bindir/vitastor-nfs
 %_bindir/vitastor-cli
 %_bindir/vitastor-rm
+%_bindir/vitastor-kv
+%_bindir/vitastor-kv-stress
 %_bindir/vita
 %_libdir/libvitastor_blk.so*
 %_libdir/libvitastor_client.so*
+%_libdir/libvitastor_kv.so*


 %files -n vitastor-client-devel
--- a/rpm/vitastor-el8.Dockerfile
+++ b/rpm/vitastor-el8.Dockerfile
@@ -35,7 +35,7 @@ ADD . /root/vitastor
 RUN set -e; \
    cd /root/vitastor/rpm; \
    sh build-tarball.sh; \
-    cp /root/vitastor-1.4.8.el8.tar.gz ~/rpmbuild/SOURCES; \
+    cp /root/vitastor-1.5.0.el8.tar.gz ~/rpmbuild/SOURCES; \
    cp vitastor-el8.spec ~/rpmbuild/SPECS/vitastor.spec; \
    cd ~/rpmbuild/SPECS/; \
    rpmbuild -ba vitastor.spec; \
--- a/rpm/vitastor-el8.spec
+++ b/rpm/vitastor-el8.spec
@@ -1,11 +1,11 @@
 Name:           vitastor
-Version:        1.4.8
+Version:        1.5.0
 Release:        1%{?dist}
 Summary:        Vitastor, a fast software-defined clustered block storage

 License:        Vitastor Network Public License 1.1
 URL:            https://vitastor.io/
-Source0:        vitastor-1.4.8.el8.tar.gz
+Source0:        vitastor-1.5.0.el8.tar.gz

 BuildRequires:  liburing-devel >= 0.6
 BuildRequires:  gperftools-devel
@@ -146,9 +146,12 @@ mkdir -p /etc/vitastor
 %_bindir/vitastor-nfs
 %_bindir/vitastor-cli
 %_bindir/vitastor-rm
+%_bindir/vitastor-kv
+%_bindir/vitastor-kv-stress
 %_bindir/vita
 %_libdir/libvitastor_blk.so*
 %_libdir/libvitastor_client.so*
+%_libdir/libvitastor_kv.so*


 %files -n vitastor-client-devel
--- a/rpm/vitastor-el9.Dockerfile
+++ b/rpm/vitastor-el9.Dockerfile
@@ -18,7 +18,7 @@ ADD . /root/vitastor
 RUN set -e; \
    cd /root/vitastor/rpm; \
    sh build-tarball.sh; \
-    cp /root/vitastor-1.4.8.el9.tar.gz ~/rpmbuild/SOURCES; \
+    cp /root/vitastor-1.5.0.el9.tar.gz ~/rpmbuild/SOURCES; \
    cp vitastor-el9.spec ~/rpmbuild/SPECS/vitastor.spec; \
    cd ~/rpmbuild/SPECS/; \
    rpmbuild -ba vitastor.spec; \
--- a/rpm/vitastor-el9.spec
+++ b/rpm/vitastor-el9.spec
@@ -1,11 +1,11 @@
 Name:           vitastor
-Version:        1.4.8
+Version:        1.5.0
 Release:        1%{?dist}
 Summary:        Vitastor, a fast software-defined clustered block storage

 License:        Vitastor Network Public License 1.1
 URL:            https://vitastor.io/
-Source0:        vitastor-1.4.8.el9.tar.gz
+Source0:        vitastor-1.5.0.el9.tar.gz

 BuildRequires:  liburing-devel >= 0.6
 BuildRequires:  gperftools-devel
@@ -139,9 +139,12 @@ mkdir -p /etc/vitastor
 %_bindir/vitastor-nfs
 %_bindir/vitastor-cli
 %_bindir/vitastor-rm
+%_bindir/vitastor-kv
+%_bindir/vitastor-kv-stress
 %_bindir/vita
 %_libdir/libvitastor_blk.so*
 %_libdir/libvitastor_client.so*
+%_libdir/libvitastor_kv.so*


 %files -n vitastor-client-devel
--- a/src/CMakeLists.txt
+++ b/src/CMakeLists.txt
@@ -16,7 +16,7 @@ if("${CMAKE_INSTALL_PREFIX}" MATCHES "^/usr/local/?$")
 	set(CMAKE_INSTALL_RPATH "${CMAKE_INSTALL_PREFIX}/${CMAKE_INSTALL_LIBDIR}")
 endif()

-add_definitions(-DVERSION="1.4.8")
+add_definitions(-DVERSION="1.5.0")
 add_definitions(-D_LARGEFILE64_SOURCE -D_FILE_OFFSET_BITS=64 -Wall -Wno-sign-compare -Wno-comment -Wno-parentheses -Wno-pointer-arith -fdiagnostics-color=always -fno-omit-frame-pointer -I ${CMAKE_SOURCE_DIR}/src)
 add_link_options(-fno-omit-frame-pointer)
 if (${WITH_ASAN})
@@ -145,7 +145,6 @@ add_library(vitastor_client SHARED
 	cli_status.cpp
 	cli_describe.cpp
 	cli_fix.cpp
-	cli_df.cpp
 	cli_ls.cpp
 	cli_create.cpp
 	cli_modify.cpp
@@ -154,6 +153,11 @@ add_library(vitastor_client SHARED
 	cli_rm_data.cpp
 	cli_rm.cpp
 	cli_rm_osd.cpp
+	cli_pool_cfg.cpp
+	cli_pool_create.cpp
+	cli_pool_ls.cpp
+	cli_pool_modify.cpp
+	cli_pool_rm.cpp
 )
 set_target_properties(vitastor_client PROPERTIES PUBLIC_HEADER "vitastor_c.h")
 target_link_libraries(vitastor_client
@@ -181,10 +185,48 @@ target_link_libraries(vitastor-nbd
 	vitastor_client
 )

+# libvitastor_kv.so
+add_library(vitastor_kv SHARED
+	kv_db.cpp
+	kv_db.h
+)
+target_link_libraries(vitastor_kv
+	vitastor_client
+)
+set_target_properties(vitastor_kv PROPERTIES VERSION ${VERSION} SOVERSION 0)
+
+# vitastor-kv
+add_executable(vitastor-kv
+	kv_cli.cpp
+)
+target_link_libraries(vitastor-kv
+	vitastor_kv
+)
+
+add_executable(vitastor-kv-stress
+	kv_stress.cpp
+)
+target_link_libraries(vitastor-kv-stress
+	vitastor_kv
+)
+
 # vitastor-nfs
 add_executable(vitastor-nfs
 	nfs_proxy.cpp
-	nfs_conn.cpp
+	nfs_block.cpp
+	nfs_kv.cpp
+	nfs_kv_create.cpp
+	nfs_kv_getattr.cpp
+	nfs_kv_link.cpp
+	nfs_kv_lookup.cpp
+	nfs_kv_read.cpp
+	nfs_kv_readdir.cpp
+	nfs_kv_remove.cpp
+	nfs_kv_rename.cpp
+	nfs_kv_setattr.cpp
+	nfs_kv_write.cpp
+	nfs_fsstat.cpp
+	nfs_mount.cpp
 	nfs_portmap.cpp
 	sha256.c
 	nfs/xdr_impl.cpp
@@ -194,6 +236,7 @@ add_executable(vitastor-nfs
 )
 target_link_libraries(vitastor-nfs
 	vitastor_client
+	vitastor_kv
 )

 # vitastor-cli
@@ -318,12 +361,12 @@ add_test(NAME test_cluster_client COMMAND test_cluster_client)

 ### Install

-install(TARGETS vitastor-osd vitastor-disk vitastor-nbd vitastor-nfs vitastor-cli RUNTIME DESTINATION ${CMAKE_INSTALL_BINDIR})
+install(TARGETS vitastor-osd vitastor-disk vitastor-nbd vitastor-nfs vitastor-cli vitastor-kv vitastor-kv-stress RUNTIME DESTINATION ${CMAKE_INSTALL_BINDIR})
 install_symlink(vitastor-disk ${CMAKE_INSTALL_PREFIX}/${CMAKE_INSTALL_BINDIR}/vitastor-dump-journal)
 install_symlink(vitastor-cli ${CMAKE_INSTALL_PREFIX}/${CMAKE_INSTALL_BINDIR}/vitastor-rm)
 install_symlink(vitastor-cli ${CMAKE_INSTALL_PREFIX}/${CMAKE_INSTALL_BINDIR}/vita)
 install(
-	TARGETS vitastor_blk vitastor_client
+	TARGETS vitastor_blk vitastor_client vitastor_kv
 	LIBRARY DESTINATION ${CMAKE_INSTALL_LIBDIR}
 	PUBLIC_HEADER DESTINATION ${CMAKE_INSTALL_INCLUDEDIR}
 )
--- a/src/blockstore.cpp
+++ b/src/blockstore.cpp
@@ -82,3 +82,8 @@ uint32_t blockstore_t::get_bitmap_granularity()
 {
    return impl->get_bitmap_granularity();
 }
+
+void blockstore_t::set_no_inode_stats(const std::vector<uint64_t> & pool_ids)
+{
+    impl->set_no_inode_stats(pool_ids);
+}
--- a/src/blockstore.h
+++ b/src/blockstore.h
@@ -216,6 +216,9 @@ public:
    // Get per-inode space usage statistics
    std::map<uint64_t, uint64_t> & get_inode_space_stats();

+    // Set per-pool no_inode_stats
+    void set_no_inode_stats(const std::vector<uint64_t> & pool_ids);
+
    // Print diagnostics to stdout
    void dump_diagnostics();

--- a/src/blockstore_impl.cpp
+++ b/src/blockstore_impl.cpp
@@ -733,3 +733,86 @@ void blockstore_impl_t::disk_error_abort(const char *op, int retval, int expecte
    fprintf(stderr, "Disk %s failed: result is %d, expected %d. Can't continue, sorry :-(\n", op, retval, expected);
    exit(1);
 }
+
+void blockstore_impl_t::set_no_inode_stats(const std::vector<uint64_t> & pool_ids)
+{
+    for (auto & np: no_inode_stats)
+    {
+        np.second = 2;
+    }
+    for (auto pool_id: pool_ids)
+    {
+        if (!no_inode_stats[pool_id])
+            recalc_inode_space_stats(pool_id, false);
+        no_inode_stats[pool_id] = 1;
+    }
+    for (auto np_it = no_inode_stats.begin(); np_it != no_inode_stats.end(); )
+    {
+        if (np_it->second == 2)
+        {
+            recalc_inode_space_stats(np_it->first, true);
+            no_inode_stats.erase(np_it++);
+        }
+        else
+            np_it++;
+    }
+}
+
+void blockstore_impl_t::recalc_inode_space_stats(uint64_t pool_id, bool per_inode)
+{
+    auto sp_begin = inode_space_stats.lower_bound((pool_id << (64-POOL_ID_BITS)));
+    auto sp_end = inode_space_stats.lower_bound(((pool_id+1) << (64-POOL_ID_BITS)));
+    inode_space_stats.erase(sp_begin, sp_end);
+    auto sh_it = clean_db_shards.lower_bound((pool_id << (64-POOL_ID_BITS)));
+    while (sh_it != clean_db_shards.end() &&
+        (sh_it->first >> (64-POOL_ID_BITS)) == pool_id)
+    {
+        for (auto & pair: sh_it->second)
+        {
+            uint64_t space_id = per_inode ? pair.first.inode : (pool_id << (64-POOL_ID_BITS));
+            inode_space_stats[space_id] += dsk.data_block_size;
+        }
+        sh_it++;
+    }
+    object_id last_oid = {};
+    bool last_exists = false;
+    auto dirty_it = dirty_db.lower_bound((obj_ver_id){ .oid = { .inode = (pool_id << (64-POOL_ID_BITS)) } });
+    while (dirty_it != dirty_db.end() && (dirty_it->first.oid.inode >> (64-POOL_ID_BITS)) == pool_id)
+    {
+        if (IS_STABLE(dirty_it->second.state) && (IS_BIG_WRITE(dirty_it->second.state) || IS_DELETE(dirty_it->second.state)))
+        {
+            bool exists = false;
+            if (last_oid == dirty_it->first.oid)
+            {
+                exists = last_exists;
+            }
+            else
+            {
+                auto & clean_db = clean_db_shard(dirty_it->first.oid);
+                auto clean_it = clean_db.find(dirty_it->first.oid);
+                exists = clean_it != clean_db.end();
+            }
+            uint64_t space_id = per_inode ? dirty_it->first.oid.inode : (pool_id << (64-POOL_ID_BITS));
+            if (IS_BIG_WRITE(dirty_it->second.state))
+            {
+                if (!exists)
+                    inode_space_stats[space_id] += dsk.data_block_size;
+                last_exists = true;
+            }
+            else
+            {
+                if (exists)
+                {
+                    auto & sp = inode_space_stats[space_id];
+                    if (sp > dsk.data_block_size)
+                        sp -= dsk.data_block_size;
+                    else
+                        inode_space_stats.erase(space_id);
+                }
+                last_exists = false;
+            }
+            last_oid = dirty_it->first.oid;
+        }
+        dirty_it++;
+    }
+}
--- a/src/blockstore_impl.h
+++ b/src/blockstore_impl.h
@@ -272,6 +272,7 @@ class blockstore_impl_t

    std::map<pool_id_t, pool_shard_settings_t> clean_db_settings;
    std::map<pool_pg_id_t, blockstore_clean_db_t> clean_db_shards;
+    std::map<uint64_t, int> no_inode_stats;
    uint8_t *clean_bitmaps = NULL;
    blockstore_dirty_db_t dirty_db;
    std::vector<blockstore_op_t*> submit_queue;
@@ -318,6 +319,7 @@ class blockstore_impl_t

    blockstore_clean_db_t& clean_db_shard(object_id oid);
    void reshard_clean_db(pool_id_t pool_id, uint32_t pg_count, uint32_t pg_stripe_size);
+    void recalc_inode_space_stats(uint64_t pool_id, bool per_inode);

    // Journaling
    void prepare_journal_sector_write(int sector, blockstore_op_t *op);
@@ -428,6 +430,9 @@ public:
    // Space usage statistics
    std::map<uint64_t, uint64_t> inode_space_stats;

+    // Set per-pool no_inode_stats
+    void set_no_inode_stats(const std::vector<uint64_t> & pool_ids);
+
    // Print diagnostics to stdout
    void dump_diagnostics();

--- a/src/blockstore_init.cpp
+++ b/src/blockstore_init.cpp
@@ -32,7 +32,7 @@ void blockstore_init_meta::handle_event(ring_data_t *data, int buf_num)
    if (data->res < 0)
    {
        throw std::runtime_error(
-            std::string("read metadata failed at offset ") + std::to_string(bufs[buf_num].offset) +
+            std::string("read metadata failed at offset ") + std::to_string(buf_num >= 0 ? bufs[buf_num].offset : last_read_offset) +
            std::string(": ") + strerror(-data->res)
        );
    }
@@ -63,6 +63,7 @@ int blockstore_init_meta::loop()
        throw std::runtime_error("Failed to allocate metadata read buffer");
    // Read superblock
    GET_SQE();
+    last_read_offset = 0;
    data->iov = { metadata_buffer, (size_t)bs->dsk.meta_block_size };
    data->callback = [this](ring_data_t *data) { handle_event(data, -1); };
    my_uring_prep_readv(sqe, bs->dsk.meta_fd, &data->iov, 1, bs->dsk.meta_offset);
@@ -100,6 +101,7 @@ resume_1:
        {
            printf("Initializing metadata area\n");
            GET_SQE();
+            last_read_offset = 0;
            data->iov = (struct iovec){ metadata_buffer, (size_t)bs->dsk.meta_block_size };
            data->callback = [this](ring_data_t *data) { handle_event(data, -1); };
            my_uring_prep_writev(sqe, bs->dsk.meta_fd, &data->iov, 1, bs->dsk.meta_offset);
@@ -236,6 +238,7 @@ resume_2:
                data->iov = { bufs[i].buf, (size_t)bufs[i].size };
                data->callback = [this, i](ring_data_t *data) { handle_event(data, i); };
                my_uring_prep_writev(sqe, bs->dsk.meta_fd, &data->iov, 1, bs->dsk.meta_offset + bufs[i].offset);
+                bs->ringloop->submit();
                bufs[i].state = INIT_META_WRITING;
                submitted++;
            }
@@ -259,9 +262,11 @@ resume_2:
            next_offset = entries_to_zero[i]/entries_per_block;
            for (j = i; j < entries_to_zero.size() && entries_to_zero[j]/entries_per_block == next_offset; j++) {}
            GET_SQE();
+            last_read_offset = (1+next_offset)*bs->dsk.meta_block_size;
            data->iov = { metadata_buffer, (size_t)bs->dsk.meta_block_size };
            data->callback = [this](ring_data_t *data) { handle_event(data, -1); };
            my_uring_prep_readv(sqe, bs->dsk.meta_fd, &data->iov, 1, bs->dsk.meta_offset + (1+next_offset)*bs->dsk.meta_block_size);
+            bs->ringloop->submit();
            submitted++;
 resume_5:
            if (submitted > 0)
@@ -278,6 +283,7 @@ resume_5:
            data->iov = { metadata_buffer, (size_t)bs->dsk.meta_block_size };
            data->callback = [this](ring_data_t *data) { handle_event(data, -1); };
            my_uring_prep_writev(sqe, bs->dsk.meta_fd, &data->iov, 1, bs->dsk.meta_offset + (1+next_offset)*bs->dsk.meta_block_size);
+            bs->ringloop->submit();
            submitted++;
 resume_6:
            if (submitted > 0)
@@ -299,6 +305,7 @@ resume_6:
    {
        GET_SQE();
        my_uring_prep_fsync(sqe, bs->dsk.meta_fd, IORING_FSYNC_DATASYNC);
+        last_read_offset = 0;
        data->iov = { 0 };
        data->callback = [this](ring_data_t *data) { handle_event(data, -1); };
        submitted++;
--- a/src/blockstore_init.h
+++ b/src/blockstore_init.h
@@ -23,6 +23,7 @@ class blockstore_init_meta
    struct ring_data_t *data;
    uint64_t md_offset = 0;
    uint64_t next_offset = 0;
+    uint64_t last_read_offset = 0;
    uint64_t entries_loaded = 0;
    unsigned entries_per_block = 0;
    int i = 0, j = 0;
--- a/src/blockstore_stable.cpp
+++ b/src/blockstore_stable.cpp
@@ -487,18 +487,24 @@ void blockstore_impl_t::mark_stable(obj_ver_id v, bool forget_dirty)
                    }
                    if (!exists)
                    {
-                        inode_space_stats[dirty_it->first.oid.inode] += dsk.data_block_size;
+                        uint64_t space_id = dirty_it->first.oid.inode;
+                        if (no_inode_stats[dirty_it->first.oid.inode >> (64-POOL_ID_BITS)])
+                            space_id = space_id & ~(((uint64_t)1 << (64-POOL_ID_BITS)) - 1);
+                        inode_space_stats[space_id] += dsk.data_block_size;
                        used_blocks++;
                    }
                    big_to_flush++;
                }
                else if (IS_DELETE(dirty_it->second.state))
                {
-                    auto & sp = inode_space_stats[dirty_it->first.oid.inode];
+                    uint64_t space_id = dirty_it->first.oid.inode;
+                    if (no_inode_stats[dirty_it->first.oid.inode >> (64-POOL_ID_BITS)])
+                        space_id = space_id & ~(((uint64_t)1 << (64-POOL_ID_BITS)) - 1);
+                    auto & sp = inode_space_stats[space_id];
                    if (sp > dsk.data_block_size)
                        sp -= dsk.data_block_size;
                    else
-                        inode_space_stats.erase(dirty_it->first.oid.inode);
+                        inode_space_stats.erase(space_id);
                    used_blocks--;
                    big_to_flush++;
                }
--- a/src/cli.cpp
+++ b/src/cli.cpp
@@ -116,6 +116,55 @@ static const char* help_text =
    "  With --dry-run only checks if deletion is possible without data loss and\n"
    "  redundancy degradation.\n"
    "\n"
+    "vitastor-cli create-pool|pool-create <name> (-s <pg_size>|--ec <N>+<K>) -n <pg_count> [OPTIONS]\n"
+    "  Create a pool. Required parameters:\n"
+    "    -s|--pg_size R   Number of replicas for replicated pools\n"
+    "    --ec N+K         Number of data (N) and parity (K) chunks for erasure-coded pools\n"
+    "    -n|--pg_count N  PG count for the new pool (start with 10*<OSD count>/pg_size rounded to a power of 2)\n"
+    "  Optional parameters:\n"
+    "    --pg_minsize <number>         R or N+K minus number of failures to tolerate without downtime\n"
+    "    --failure_domain host         Failure domain: host, osd or a level from placement_levels. Default: host\n"
+    "    --root_node <node>            Put pool only on child OSDs of this placement tree node\n"
+    "    --osd_tags <tag>[,<tag>]...   Put pool only on OSDs tagged with all specified tags\n"
+    "    --block_size 128k             Put pool only on OSDs with this data block size\n"
+    "    --bitmap_granularity 4k       Put pool only on OSDs with this logical sector size\n"
+    "    --immediate_commit none       Put pool only on OSDs with this or larger immediate_commit (none < small < all)\n"
+    "    --primary_affinity_tags tags  Prefer to put primary copies on OSDs with all specified tags\n"
+    "    --scrub_interval <time>       Enable regular scrubbing for this pool. Format: number + unit s/m/h/d/M/y\n"
+    "    --used_for_fs <name>          Mark pool as used for VitastorFS with metadata in image <name>\n"
+    "    --pg_stripe_size <number>     Increase object grouping stripe\n"
+    "    --max_osd_combinations 10000  Maximum number of random combinations for LP solver input\n"
+    "    --wait                        Wait for the new pool to come online\n"
+    "    -f|--force                    Do not check that cluster has enough OSDs to create the pool\n"
+    "  Examples:\n"
+    "    vitastor-cli create-pool test_x4 -s 4 -n 32\n"
+    "    vitastor-cli create-pool test_ec42 --ec 4+2 -n 32\n"
+    "\n"
+    "vitastor-cli modify-pool|pool-modify <id|name> [--name <new_name>] [PARAMETERS...]\n"
+    "  Modify an existing pool. Modifiable parameters:\n"
+    "    [-s|--pg_size <number>] [--pg_minsize <number>] [-n|--pg_count <count>]\n"
+    "    [--failure_domain <level>] [--root_node <node>] [--osd_tags <tags>] [--used_for_fs <name>]\n"
+    "    [--max_osd_combinations <number>] [--primary_affinity_tags <tags>] [--scrub_interval <time>]\n"
+    "  Non-modifiable parameters (changing them WILL lead to data loss):\n"
+    "    [--block_size <size>] [--bitmap_granularity <size>]\n"
+    "    [--immediate_commit <all|small|none>] [--pg_stripe_size <size>]\n"
+    "  These, however, can still be modified with -f|--force.\n"
+    "  See create-pool for parameter descriptions.\n"
+    "  Examples:\n"
+    "    vitastor-cli modify-pool pool_A --name pool_B\n"
+    "    vitastor-cli modify-pool 2 --pg_size 4 -n 128\n"
+    "\n"
+    "vitastor-cli rm-pool|pool-rm [--force] <id|name>\n"
+    "  Remove a pool. Refuses to remove pools with images without --force.\n"
+    "\n"
+    "vitastor-cli ls-pools|pool-ls|ls-pool|pools [-l] [--detail] [--sort FIELD] [-r] [-n N] [--stats] [<glob> ...]\n"
+    "  List pools (only matching <glob> patterns if passed).\n"
+    "  -l|--long       Also report I/O statistics\n"
+    "  --detail        Use list format (not table), show all details\n"
+    "  --sort FIELD    Sort by specified field (see fields in --json output)\n"
+    "  -r|--reverse    Sort in descending order\n"
+    "  -n|--count N    Only list first N items\n"
+    "\n"
    "Use vitastor-cli --help <command> for command details or vitastor-cli --help --all for all details.\n"
    "\n"
    "GLOBAL OPTIONS:\n"
@@ -136,6 +185,7 @@ static json11::Json::object parse_args(int narg, const char *args[])
    cfg["progress"] = "1";
    for (int i = 1; i < narg; i++)
    {
+        bool argHasValue = (!(i == narg-1) && (args[i+1][0] != '-'));
        if (args[i][0] == '-' && args[i][1] == 'h' && args[i][2] == 0)
        {
            cfg["help"] = "1";
@@ -146,15 +196,15 @@ static json11::Json::object parse_args(int narg, const char *args[])
        }
        else if (args[i][0] == '-' && args[i][1] == 'n' && args[i][2] == 0)
        {
-            cfg["count"] = args[++i];
+            cfg["count"] = argHasValue ? args[++i] : "";
        }
        else if (args[i][0] == '-' && args[i][1] == 'p' && args[i][2] == 0)
        {
-            cfg["pool"] = args[++i];
+            cfg["pool"] = argHasValue ? args[++i] : "";
        }
        else if (args[i][0] == '-' && args[i][1] == 's' && args[i][2] == 0)
        {
-            cfg["size"] = args[++i];
+            cfg["size"] = argHasValue ? args[++i] : "";
        }
        else if (args[i][0] == '-' && args[i][1] == 'r' && args[i][2] == 0)
        {
@@ -167,9 +217,9 @@ static json11::Json::object parse_args(int narg, const char *args[])
        else if (args[i][0] == '-' && args[i][1] == '-')
        {
            const char *opt = args[i]+2;
-            cfg[opt] = i == narg-1 || !strcmp(opt, "json") ||
+            if (!strcmp(opt, "json") || !strcmp(opt, "wait") ||
                !strcmp(opt, "wait-list") || !strcmp(opt, "wait_list") ||
-                !strcmp(opt, "long") || !strcmp(opt, "del") ||
+                !strcmp(opt, "long") || !strcmp(opt, "detail") || !strcmp(opt, "del") ||
                !strcmp(opt, "no-color") || !strcmp(opt, "no_color") ||
                !strcmp(opt, "readonly") || !strcmp(opt, "readwrite") ||
                !strcmp(opt, "force") || !strcmp(opt, "reverse") ||
@@ -177,8 +227,14 @@ static json11::Json::object parse_args(int narg, const char *args[])
                !strcmp(opt, "down-ok") || !strcmp(opt, "down_ok") ||
                !strcmp(opt, "dry-run") || !strcmp(opt, "dry_run") ||
                !strcmp(opt, "help") || !strcmp(opt, "all") ||
-                (!strcmp(opt, "writers-stopped") || !strcmp(opt, "writers_stopped")) && strcmp("1", args[i+1]) != 0
-                ? "1" : args[++i];
+                !strcmp(opt, "writers-stopped") || !strcmp(opt, "writers_stopped"))
+            {
+                cfg[opt] = "1";
+            }
+            else
+            {
+                cfg[opt] = argHasValue ? args[++i] : "";
+            }
        }
        else
        {
@@ -221,7 +277,7 @@ static int run(cli_tool_t *p, json11::Json::object cfg)
    else if (cmd[0] == "df")
    {
        // Show pool space stats
-        action_cb = p->start_df(cfg);
+        action_cb = p->start_pool_ls(cfg);
    }
    else if (cmd[0] == "ls")
    {
@@ -328,6 +384,44 @@ static int run(cli_tool_t *p, json11::Json::object cfg)
        // Allocate a new OSD number
        action_cb = p->start_alloc_osd(cfg);
    }
+    else if (cmd[0] == "create-pool" || cmd[0] == "pool-create")
+    {
+        // Create a new pool
+        if (cmd.size() > 1 && cfg["name"].is_null())
+        {
+            cfg["name"] = cmd[1];
+        }
+        action_cb = p->start_pool_create(cfg);
+    }
+    else if (cmd[0] == "modify-pool" || cmd[0] == "pool-modify")
+    {
+        // Modify existing pool
+        if (cmd.size() > 1)
+        {
+            cfg["old_name"] = cmd[1];
+        }
+        action_cb = p->start_pool_modify(cfg);
+    }
+    else if (cmd[0] == "rm-pool" || cmd[0] == "pool-rm")
+    {
+        // Remove existing pool
+        if (cmd.size() > 1)
+        {
+            cfg["pool"] = cmd[1];
+        }
+        action_cb = p->start_pool_rm(cfg);
+    }
+    else if (cmd[0] == "ls-pool" || cmd[0] == "pool-ls" || cmd[0] == "ls-pools" || cmd[0] == "pools")
+    {
+        // Show pool list
+        cfg["show_recovery"] = 1;
+        if (cmd.size() > 1)
+        {
+            cmd.erase(cmd.begin(), cmd.begin()+1);
+            cfg["names"] = cmd;
+        }
+        action_cb = p->start_pool_ls(cfg);
+    }
    else
    {
        result = { .err = EINVAL, .text = "unknown command: "+cmd[0].string_value() };
--- a/src/cli.h
+++ b/src/cli.h
@@ -46,6 +46,7 @@ public:
    json11::Json etcd_result;

    void parse_config(json11::Json::object & cfg);
+    json11::Json parse_tags(std::string tags);

    void change_parent(inode_t cur, inode_t new_parent, cli_result_t *result);
    inode_config_t* get_inode_cfg(const std::string & name);
@@ -58,7 +59,6 @@ public:
    std::function<bool(cli_result_t &)> start_status(json11::Json);
    std::function<bool(cli_result_t &)> start_describe(json11::Json);
    std::function<bool(cli_result_t &)> start_fix(json11::Json);
-    std::function<bool(cli_result_t &)> start_df(json11::Json);
    std::function<bool(cli_result_t &)> start_ls(json11::Json);
    std::function<bool(cli_result_t &)> start_create(json11::Json);
    std::function<bool(cli_result_t &)> start_modify(json11::Json);
@@ -68,6 +68,10 @@ public:
    std::function<bool(cli_result_t &)> start_rm(json11::Json);
    std::function<bool(cli_result_t &)> start_rm_osd(json11::Json cfg);
    std::function<bool(cli_result_t &)> start_alloc_osd(json11::Json cfg);
+    std::function<bool(cli_result_t &)> start_pool_create(json11::Json);
+    std::function<bool(cli_result_t &)> start_pool_modify(json11::Json);
+    std::function<bool(cli_result_t &)> start_pool_rm(json11::Json);
+    std::function<bool(cli_result_t &)> start_pool_ls(json11::Json);

    // Should be called like loop_and_wait(start_status(), <completion callback>)
    void loop_and_wait(std::function<bool(cli_result_t &)> loop_cb, std::function<void(const cli_result_t &)> complete_cb);
@@ -77,8 +81,13 @@ public:

 std::string print_table(json11::Json items, json11::Json header, bool use_esc);

+size_t print_detail_title_len(json11::Json item, std::vector<std::pair<std::string, std::string>> names, size_t prev_len);
+std::string print_detail(json11::Json item, std::vector<std::pair<std::string, std::string>> names, size_t title_len, bool use_esc);
+
 std::string format_lat(uint64_t lat);

 std::string format_q(double depth);

 bool stupid_glob(const std::string str, const std::string glob);
+
+std::string implode(const std::string & sep, json11::Json array);
--- a/src/cli_common.cpp
+++ b/src/cli_common.cpp
@@ -153,6 +153,7 @@ void cli_tool_t::loop_and_wait(std::function<bool(cli_result_t &)> loop_cb, std:
            ringloop->unregister_consumer(&looper->consumer);
            looper->loop_cb = NULL;
            looper->complete_cb(looper->result);
+            ringloop->submit();
            delete looper;
            return;
        }
--- a/src/cli_create.cpp
+++ b/src/cli_create.cpp
@@ -27,6 +27,7 @@ struct image_creator_t
    std::string image_name, new_snap, new_parent;
    json11::Json new_meta;
    uint64_t size;
+    bool force = false;
    bool force_size = false;

    pool_id_t old_pool_id = 0;
@@ -45,6 +46,7 @@ struct image_creator_t

    void loop()
    {
+        auto & pools = parent->cli->st_cli.pool_config;
        if (state >= 1)
            goto resume_1;
        if (image_name == "")
@@ -62,7 +64,6 @@ struct image_creator_t
        }
        if (new_pool_id)
        {
-            auto & pools = parent->cli->st_cli.pool_config;
            if (pools.find(new_pool_id) == pools.end())
            {
                result = (cli_result_t){ .err = ENOENT, .text = "Pool "+std::to_string(new_pool_id)+" does not exist" };
@@ -72,7 +73,7 @@ struct image_creator_t
        }
        else if (new_pool_name != "")
        {
-            for (auto & ic: parent->cli->st_cli.pool_config)
+            for (auto & ic: pools)
            {
                if (ic.second.name == new_pool_name)
                {
@@ -87,10 +88,20 @@ struct image_creator_t
                return;
            }
        }
-        else if (parent->cli->st_cli.pool_config.size() == 1)
+        else if (pools.size() == 1)
        {
-            auto it = parent->cli->st_cli.pool_config.begin();
-            new_pool_id = it->first;
+            new_pool_id = pools.begin()->first;
+        }
+        if (new_pool_id && !pools.at(new_pool_id).used_for_fs.empty() && !force)
+        {
+            result = (cli_result_t){
+                .err = EINVAL,
+                .text = "Pool "+pools.at(new_pool_id).name+
+                    " is used for VitastorFS "+pools.at(new_pool_id).used_for_fs+
+                    ". Use --force if you really know what you are doing",
+            };
+            state = 100;
+            return;
        }
        state = 1;
    resume_1:
@@ -183,7 +194,16 @@ resume_3:
        // Save into inode_config for library users to be able to take it from there immediately
        new_cfg.mod_revision = parent->etcd_result["responses"][0]["response_put"]["header"]["revision"].uint64_value();
        parent->cli->st_cli.insert_inode_config(new_cfg);
-        result = (cli_result_t){ .err = 0, .text = "Image "+image_name+" created" };
+        result = (cli_result_t){
+            .err = 0,
+            .text = "Image "+image_name+" created",
+            .data = json11::Json::object {
+                { "name", image_name },
+                { "pool", new_pool_name },
+                { "parent", new_parent },
+                { "size", size },
+            }
+        };
        state = 100;
    }

@@ -251,7 +271,16 @@ resume_4:
        // Save into inode_config for library users to be able to take it from there immediately
        new_cfg.mod_revision = parent->etcd_result["responses"][0]["response_put"]["header"]["revision"].uint64_value();
        parent->cli->st_cli.insert_inode_config(new_cfg);
-        result = (cli_result_t){ .err = 0, .text = "Snapshot "+image_name+"@"+new_snap+" created" };
+        result = (cli_result_t){
+            .err = 0,
+            .text = "Snapshot "+image_name+"@"+new_snap+" created",
+            .data = json11::Json::object {
+                { "name", image_name+"@"+new_snap },
+                { "pool", (uint64_t)new_pool_id },
+                { "parent", new_parent },
+                { "size", size },
+            }
+        };
        state = 100;
    }

@@ -514,6 +543,7 @@ std::function<bool(cli_result_t &)> cli_tool_t::start_create(json11::Json cfg)
    image_creator->image_name = cfg["image"].string_value();
    image_creator->new_pool_id = cfg["pool"].uint64_value();
    image_creator->new_pool_name = cfg["pool"].string_value();
+    image_creator->force = cfg["force"].bool_value();
    image_creator->force_size = cfg["force_size"].bool_value();
    if (cfg["image_meta"].is_object())
    {
--- a/src/cli_df.cpp
+++ b/src/cli_df.cpp
@@ -1,243 +0,0 @@
-// Copyright (c) Vitaliy Filippov, 2019+
-// License: VNPL-1.1 (see README.md for details)
-
-#include "cli.h"
-#include "cluster_client.h"
-#include "str_util.h"
-
-// List pools with space statistics
-struct pool_lister_t
-{
-    cli_tool_t *parent;
-
-    int state = 0;
-    json11::Json space_info;
-    cli_result_t result;
-    std::map<pool_id_t, json11::Json::object> pool_stats;
-
-    bool is_done()
-    {
-        return state == 100;
-    }
-
-    void get_stats()
-    {
-        if (state == 1)
-            goto resume_1;
-        // Space statistics - pool/stats/<pool>
-        parent->etcd_txn(json11::Json::object {
-            { "success", json11::Json::array {
-                json11::Json::object {
-                    { "request_range", json11::Json::object {
-                        { "key", base64_encode(
-                            parent->cli->st_cli.etcd_prefix+"/pool/stats/"
-                        ) },
-                        { "range_end", base64_encode(
-                            parent->cli->st_cli.etcd_prefix+"/pool/stats0"
-                        ) },
-                    } },
-                },
-                json11::Json::object {
-                    { "request_range", json11::Json::object {
-                        { "key", base64_encode(
-                            parent->cli->st_cli.etcd_prefix+"/osd/stats/"
-                        ) },
-                        { "range_end", base64_encode(
-                            parent->cli->st_cli.etcd_prefix+"/osd/stats0"
-                        ) },
-                    } },
-                },
-            } },
-        });
-        state = 1;
-resume_1:
-        if (parent->waiting > 0)
-            return;
-        if (parent->etcd_err.err)
-        {
-            result = parent->etcd_err;
-            state = 100;
-            return;
-        }
-        space_info = parent->etcd_result;
-        std::map<pool_id_t, uint64_t> osd_free;
-        for (auto & kv_item: space_info["responses"][0]["response_range"]["kvs"].array_items())
-        {
-            auto kv = parent->cli->st_cli.parse_etcd_kv(kv_item);
-            // pool ID
-            pool_id_t pool_id;
-            char null_byte = 0;
-            int scanned = sscanf(kv.key.substr(parent->cli->st_cli.etcd_prefix.length()).c_str(), "/pool/stats/%u%c", &pool_id, &null_byte);
-            if (scanned != 1 || !pool_id || pool_id >= POOL_ID_MAX)
-            {
-                fprintf(stderr, "Invalid key in etcd: %s\n", kv.key.c_str());
-                continue;
-            }
-            // pool/stats/<N>
-            pool_stats[pool_id] = kv.value.object_items();
-        }
-        for (auto & kv_item: space_info["responses"][1]["response_range"]["kvs"].array_items())
-        {
-            auto kv = parent->cli->st_cli.parse_etcd_kv(kv_item);
-            // osd ID
-            osd_num_t osd_num;
-            char null_byte = 0;
-            int scanned = sscanf(kv.key.substr(parent->cli->st_cli.etcd_prefix.length()).c_str(), "/osd/stats/%ju%c", &osd_num, &null_byte);
-            if (scanned != 1 || !osd_num || osd_num >= POOL_ID_MAX)
-            {
-                fprintf(stderr, "Invalid key in etcd: %s\n", kv.key.c_str());
-                continue;
-            }
-            // osd/stats/<N>::free
-            osd_free[osd_num] = kv.value["free"].uint64_value();
-        }
-        // Calculate max_avail for each pool
-        for (auto & pp: parent->cli->st_cli.pool_config)
-        {
-            auto & pool_cfg = pp.second;
-            uint64_t pool_avail = UINT64_MAX;
-            std::map<osd_num_t, uint64_t> pg_per_osd;
-            for (auto & pgp: pool_cfg.pg_config)
-            {
-                for (auto pg_osd: pgp.second.target_set)
-                {
-                    if (pg_osd != 0)
-                    {
-                        pg_per_osd[pg_osd]++;
-                    }
-                }
-            }
-            for (auto pg_per_pair: pg_per_osd)
-            {
-                uint64_t pg_free = osd_free[pg_per_pair.first] * pool_cfg.real_pg_count / pg_per_pair.second;
-                if (pool_avail > pg_free)
-                {
-                    pool_avail = pg_free;
-                }
-            }
-            if (pool_avail == UINT64_MAX)
-            {
-                pool_avail = 0;
-            }
-            if (pool_cfg.scheme != POOL_SCHEME_REPLICATED)
-            {
-                pool_avail *= (pool_cfg.pg_size - pool_cfg.parity_chunks);
-            }
-            pool_stats[pool_cfg.id] = json11::Json::object {
-                { "id", (uint64_t)pool_cfg.id },
-                { "name", pool_cfg.name },
-                { "pg_count", pool_cfg.pg_count },
-                { "real_pg_count", pool_cfg.real_pg_count },
-                { "scheme", pool_cfg.scheme == POOL_SCHEME_REPLICATED ? "replicated" : "ec" },
-                { "scheme_name", pool_cfg.scheme == POOL_SCHEME_REPLICATED
-                    ? std::to_string(pool_cfg.pg_size)+"/"+std::to_string(pool_cfg.pg_minsize)
-                    : "EC "+std::to_string(pool_cfg.pg_size-pool_cfg.parity_chunks)+"+"+std::to_string(pool_cfg.parity_chunks) },
-                { "used_raw", (uint64_t)(pool_stats[pool_cfg.id]["used_raw_tb"].number_value() * ((uint64_t)1<<40)) },
-                { "total_raw", (uint64_t)(pool_stats[pool_cfg.id]["total_raw_tb"].number_value() * ((uint64_t)1<<40)) },
-                { "max_available", pool_avail },
-                { "raw_to_usable", pool_stats[pool_cfg.id]["raw_to_usable"].number_value() },
-                { "space_efficiency", pool_stats[pool_cfg.id]["space_efficiency"].number_value() },
-                { "pg_real_size", pool_stats[pool_cfg.id]["pg_real_size"].uint64_value() },
-                { "failure_domain", pool_cfg.failure_domain },
-            };
-        }
-    }
-
-    json11::Json::array to_list()
-    {
-        json11::Json::array list;
-        for (auto & kv: pool_stats)
-        {
-            list.push_back(kv.second);
-        }
-        return list;
-    }
-
-    void loop()
-    {
-        get_stats();
-        if (parent->waiting > 0)
-            return;
-        if (state == 100)
-            return;
-        if (parent->json_output)
-        {
-            // JSON output
-            result.data = to_list();
-            state = 100;
-            return;
-        }
-        // Table output: name, scheme_name, pg_count, total, used, max_avail, used%, efficiency
-        json11::Json::array cols;
-        cols.push_back(json11::Json::object{
-            { "key", "name" },
-            { "title", "NAME" },
-        });
-        cols.push_back(json11::Json::object{
-            { "key", "scheme_name" },
-            { "title", "SCHEME" },
-        });
-        cols.push_back(json11::Json::object{
-            { "key", "pg_count_fmt" },
-            { "title", "PGS" },
-        });
-        cols.push_back(json11::Json::object{
-            { "key", "total_fmt" },
-            { "title", "TOTAL" },
-        });
-        cols.push_back(json11::Json::object{
-            { "key", "used_fmt" },
-            { "title", "USED" },
-        });
-        cols.push_back(json11::Json::object{
-            { "key", "max_avail_fmt" },
-            { "title", "AVAILABLE" },
-        });
-        cols.push_back(json11::Json::object{
-            { "key", "used_pct" },
-            { "title", "USED%" },
-        });
-        cols.push_back(json11::Json::object{
-            { "key", "eff_fmt" },
-            { "title", "EFFICIENCY" },
-        });
-        json11::Json::array list;
-        for (auto & kv: pool_stats)
-        {
-            double raw_to = kv.second["raw_to_usable"].number_value();
-            if (raw_to < 0.000001 && raw_to > -0.000001)
-                raw_to = 1;
-            kv.second["pg_count_fmt"] = kv.second["real_pg_count"] == kv.second["pg_count"]
-                ? kv.second["real_pg_count"].as_string()
-                : kv.second["real_pg_count"].as_string()+"->"+kv.second["pg_count"].as_string();
-            kv.second["total_fmt"] = format_size(kv.second["total_raw"].uint64_value() / raw_to);
-            kv.second["used_fmt"] = format_size(kv.second["used_raw"].uint64_value() / raw_to);
-            kv.second["max_avail_fmt"] = format_size(kv.second["max_available"].uint64_value());
-            kv.second["used_pct"] = format_q(kv.second["total_raw"].uint64_value()
-                ? (100 - 100*kv.second["max_available"].uint64_value() *
-                    kv.second["raw_to_usable"].number_value() / kv.second["total_raw"].uint64_value())
-                : 100)+"%";
-            kv.second["eff_fmt"] = format_q(kv.second["space_efficiency"].number_value()*100)+"%";
-        }
-        result.data = to_list();
-        result.text = print_table(result.data, cols, parent->color);
-        state = 100;
-    }
-};
-
-std::function<bool(cli_result_t &)> cli_tool_t::start_df(json11::Json cfg)
-{
-    auto lister = new pool_lister_t();
-    lister->parent = this;
-    return [lister](cli_result_t & result)
-    {
-        lister->loop();
-        if (lister->is_done())
-        {
-            result = lister->result;
-            delete lister;
-            return true;
-        }
-        return false;
-    };
-}
--- a/src/cli_merge.cpp
+++ b/src/cli_merge.cpp
@@ -350,7 +350,11 @@ struct snap_merger_t
                : "Overwriting blocks: %ju/%ju\n", to_process, to_process);
        }
        // Done
-        result = (cli_result_t){ .text = "Done, layers from "+from_name+" to "+to_name+" merged into "+target_name };
+        result = (cli_result_t){ .text = "Done, layers from "+from_name+" to "+to_name+" merged into "+target_name, .data = json11::Json::object {
+            { "from", from_name },
+            { "to", to_name },
+            { "into", target_name },
+        }};
        state = 100;
    resume_100:
        return;
--- a/src/cli_modify.cpp
+++ b/src/cli_modify.cpp
@@ -85,7 +85,10 @@ struct image_changer_t
            (!new_size && !force_size || cfg.size == new_size || cfg.size >= new_size && inc_size) &&
            (new_name == "" || new_name == image_name))
        {
-            result = (cli_result_t){ .text = "No change" };
+            result = (cli_result_t){ .err = 0, .text = "No change", .data = json11::Json::object {
+                { "error_code", 0 },
+                { "error_text", "No change" },
+            }};
            state = 100;
            return;
        }
@@ -222,7 +225,16 @@ resume_2:
            parent->cli->st_cli.inode_by_name.erase(image_name);
        }
        parent->cli->st_cli.insert_inode_config(cfg);
-        result = (cli_result_t){ .err = 0, .text = "Image "+image_name+" modified" };
+        result = (cli_result_t){
+            .err = 0,
+            .text = "Image "+image_name+" modified",
+            .data = json11::Json::object {
+                { "name", image_name },
+                { "inode", INODE_NO_POOL(inode_num) },
+                { "pool", (uint64_t)INODE_POOL(inode_num) },
+                { "size", new_size },
+            }
+        };
        state = 100;
    }
 };
--- a/src/cli_pool_cfg.cpp
+++ b/src/cli_pool_cfg.cpp
@@ -0,0 +1,270 @@
+// Copyright (c) Vitaliy Filippov, 2024
+// License: VNPL-1.1 (see README.md for details)
+
+#include "cli_pool_cfg.h"
+#include "etcd_state_client.h"
+#include "str_util.h"
+
+std::string validate_pool_config(json11::Json::object & new_cfg, json11::Json old_cfg,
+    uint64_t global_block_size, uint64_t global_bitmap_granularity, bool force)
+{
+    // short option names
+    if (new_cfg.find("count") != new_cfg.end())
+    {
+        new_cfg["pg_count"] = new_cfg["count"];
+        new_cfg.erase("count");
+    }
+    if (new_cfg.find("size") != new_cfg.end())
+    {
+        new_cfg["pg_size"] = new_cfg["size"];
+        new_cfg.erase("size");
+    }
+
+    // --ec shortcut
+    if (new_cfg.find("ec") != new_cfg.end())
+    {
+        if (new_cfg.find("scheme") != new_cfg.end() ||
+            new_cfg.find("pg_size") != new_cfg.end() ||
+            new_cfg.find("parity_chunks") != new_cfg.end())
+        {
+            return "--ec can't be used with --pg_size, --parity_chunks or --scheme";
+        }
+        // pg_size = N+K
+        // parity_chunks = K
+        uint64_t data_chunks = 0, parity_chunks = 0;
+        char null_byte = 0;
+        int ret = sscanf(new_cfg["ec"].string_value().c_str(), "%ju+%ju%c", &data_chunks, &parity_chunks, &null_byte);
+        if (ret != 2 || !data_chunks || !parity_chunks)
+        {
+            return "--ec should be <N>+<K> format (<N>, <K> - numbers)";
+        }
+        new_cfg.erase("ec");
+        new_cfg["scheme"] = "ec";
+        new_cfg["pg_size"] = data_chunks+parity_chunks;
+        new_cfg["parity_chunks"] = parity_chunks;
+    }
+
+    if (old_cfg.is_null() && new_cfg["scheme"].string_value() == "")
+    {
+        // Default scheme
+        new_cfg["scheme"] = "replicated";
+    }
+    if (new_cfg.find("pg_minsize") == new_cfg.end() && (old_cfg.is_null() || new_cfg.find("pg_size") != new_cfg.end()))
+    {
+        // Default pg_minsize
+        if (new_cfg["scheme"] == "replicated")
+        {
+            // pg_minsize = (N+K > 2) ? 2 : 1
+            new_cfg["pg_minsize"] = new_cfg["pg_size"].uint64_value() > 2 ? 2 : 1;
+        }
+        else // ec or xor
+        {
+            // pg_minsize = (K > 1) ? N + 1 : N
+            new_cfg["pg_minsize"] = new_cfg["pg_size"].uint64_value() - new_cfg["parity_chunks"].uint64_value() +
+                (new_cfg["parity_chunks"].uint64_value() > 1 ? 1 : 0);
+        }
+    }
+
+    // Check integer values and unknown keys
+    for (auto kv_it = new_cfg.begin(); kv_it != new_cfg.end(); )
+    {
+        auto & key = kv_it->first;
+        auto & value = kv_it->second;
+        if (key == "pg_size" || key == "parity_chunks" || key == "pg_minsize" ||
+            key == "pg_count" || key == "max_osd_combinations" || key == "block_size" ||
+            key == "bitmap_granularity" || key == "pg_stripe_size")
+        {
+            if (value.is_number() && value.uint64_value() != value.number_value() ||
+                value.is_string() && !value.uint64_value() && value.string_value() != "0")
+            {
+                return key+" must be a non-negative integer";
+            }
+            value = value.uint64_value();
+        }
+        else if (key == "name" || key == "scheme" || key == "immediate_commit" ||
+            key == "failure_domain" || key == "root_node" || key == "scrub_interval" || key == "used_for_fs")
+        {
+            // OK
+        }
+        else if (key == "osd_tags" || key == "primary_affinity_tags")
+        {
+            if (value.is_string())
+            {
+                value = explode(",", value.string_value(), true);
+            }
+        }
+        else
+        {
+            // Unknown parameter
+            new_cfg.erase(kv_it++);
+            continue;
+        }
+        kv_it++;
+    }
+
+    // Merge with the old config
+    if (!old_cfg.is_null())
+    {
+        for (auto & kv: old_cfg.object_items())
+        {
+            if (new_cfg.find(kv.first) == new_cfg.end())
+            {
+                new_cfg[kv.first] = kv.second;
+            }
+        }
+    }
+
+    // Check after merging
+    if (new_cfg["scheme"] != "ec")
+    {
+        new_cfg.erase("parity_chunks");
+    }
+    if (new_cfg.find("used_for_fs") != new_cfg.end() && new_cfg["used_for_fs"].string_value() == "")
+    {
+        new_cfg.erase("used_for_fs");
+    }
+
+    // Prevent autovivification of object keys. Now we don't modify the config, we just check it
+    json11::Json cfg = new_cfg;
+
+    // Validate changes
+    if (!old_cfg.is_null() && !force)
+    {
+        if (old_cfg["scheme"] != cfg["scheme"])
+        {
+            return "Changing scheme for an existing pool will lead to data loss. Use --force to proceed";
+        }
+        if (etcd_state_client_t::parse_scheme(old_cfg["scheme"].string_value()) == POOL_SCHEME_EC)
+        {
+            uint64_t old_data_chunks = old_cfg["pg_size"].uint64_value() - old_cfg["parity_chunks"].uint64_value();
+            uint64_t new_data_chunks = cfg["pg_size"].uint64_value() - cfg["parity_chunks"].uint64_value();
+            if (old_data_chunks != new_data_chunks)
+            {
+                return "Changing EC data chunk count for an existing pool will lead to data loss. Use --force to proceed";
+            }
+        }
+        if (old_cfg["block_size"] != cfg["block_size"] ||
+            old_cfg["bitmap_granularity"] != cfg["bitmap_granularity"] ||
+            old_cfg["immediate_commit"] != cfg["immediate_commit"])
+        {
+            return "Changing block_size, bitmap_granularity or immediate_commit"
+                " for an existing pool will lead to incomplete PGs. Use --force to proceed";
+        }
+        if (old_cfg["pg_stripe_size"] != cfg["pg_stripe_size"])
+        {
+            return "Changing pg_stripe_size for an existing pool will lead to data loss. Use --force to proceed";
+        }
+    }
+
+    // Validate values
+    if (cfg["name"].string_value() == "")
+    {
+        return "Non-empty pool name is required";
+    }
+
+    // scheme
+    auto scheme = etcd_state_client_t::parse_scheme(cfg["scheme"].string_value());
+    if (!scheme)
+    {
+        return "Scheme must be one of \"replicated\", \"ec\" or \"xor\"";
+    }
+
+    // pg_size
+    auto pg_size = cfg["pg_size"].uint64_value();
+    if (!pg_size)
+    {
+        return "Non-zero PG size is required";
+    }
+    if (scheme != POOL_SCHEME_REPLICATED && pg_size < 3)
+    {
+        return "PG size can't be smaller than 3 for EC/XOR pools";
+    }
+    if (pg_size > 256)
+    {
+        return "PG size can't be greater than 256";
+    }
+
+    // parity_chunks
+    uint64_t parity_chunks = 1;
+    if (scheme == POOL_SCHEME_EC)
+    {
+        parity_chunks = cfg["parity_chunks"].uint64_value();
+        if (!parity_chunks)
+        {
+            return "Non-zero parity_chunks is required";
+        }
+        if (parity_chunks > pg_size-2)
+        {
+            return "parity_chunks can't be greater than "+std::to_string(pg_size-2)+" (PG size - 2)";
+        }
+    }
+
+    // pg_minsize
+    auto pg_minsize = cfg["pg_minsize"].uint64_value();
+    if (!pg_minsize)
+    {
+        return "Non-zero pg_minsize is required";
+    }
+    else if (pg_minsize > pg_size)
+    {
+        return "pg_minsize can't be greater than "+std::to_string(pg_size)+" (PG size)";
+    }
+    else if (scheme != POOL_SCHEME_REPLICATED && pg_minsize < pg_size-parity_chunks)
+    {
+        return "pg_minsize can't be smaller than "+std::to_string(pg_size-parity_chunks)+
+            " (pg_size - parity_chunks) for XOR/EC pool";
+    }
+
+    // pg_count
+    if (!cfg["pg_count"].uint64_value())
+    {
+        return "Non-zero pg_count is required";
+    }
+
+    // max_osd_combinations
+    if (!cfg["max_osd_combinations"].is_null() && cfg["max_osd_combinations"].uint64_value() < 100)
+    {
+        return "max_osd_combinations must be at least 100, but it is "+cfg["max_osd_combinations"].as_string();
+    }
+
+    // block_size
+    auto block_size = cfg["block_size"].uint64_value();
+    if (!cfg["block_size"].is_null() && ((block_size & (block_size-1)) ||
+        block_size < MIN_DATA_BLOCK_SIZE || block_size > MAX_DATA_BLOCK_SIZE))
+    {
+        return "block_size must be a power of two between "+std::to_string(MIN_DATA_BLOCK_SIZE)+
+            " and "+std::to_string(MAX_DATA_BLOCK_SIZE)+", but it is "+std::to_string(block_size);
+    }
+    block_size = (block_size ? block_size : global_block_size);
+
+    // bitmap_granularity
+    auto bitmap_granularity = cfg["bitmap_granularity"].uint64_value();
+    if (!cfg["bitmap_granularity"].is_null() && (!bitmap_granularity || (bitmap_granularity % 512)))
+    {
+        return "bitmap_granularity must be a multiple of 512, but it is "+std::to_string(bitmap_granularity);
+    }
+    bitmap_granularity = (bitmap_granularity ? bitmap_granularity : global_bitmap_granularity);
+    if (block_size % bitmap_granularity)
+    {
+        return "bitmap_granularity must divide data block size ("+std::to_string(block_size)+"), but it is "+std::to_string(bitmap_granularity);
+    }
+
+    // immediate_commit
+    if (!cfg["immediate_commit"].is_null() && !etcd_state_client_t::parse_immediate_commit(cfg["immediate_commit"].string_value()))
+    {
+        return "immediate_commit must be one of \"all\", \"small\", or \"none\", but it is "+cfg["immediate_commit"].as_string();
+    }
+
+    // scrub_interval
+    if (!cfg["scrub_interval"].is_null())
+    {
+        bool ok;
+        parse_time(cfg["scrub_interval"].string_value(), &ok);
+        if (!ok)
+        {
+            return "scrub_interval must be a time interval (number + unit s/m/h/d/M/y), but it is "+cfg["scrub_interval"].as_string();
+        }
+    }
+
+    return "";
+}
--- a/src/cli_pool_cfg.h
+++ b/src/cli_pool_cfg.h
@@ -0,0 +1,10 @@
+// Copyright (c) Vitaliy Filippov, 2024
+// License: VNPL-1.1 (see README.md for details)
+
+#pragma once
+
+#include "json11/json11.hpp"
+#include <stdint.h>
+
+std::string validate_pool_config(json11::Json::object & new_cfg, json11::Json old_cfg,
+    uint64_t global_block_size, uint64_t global_bitmap_granularity, bool force);
--- a/src/cli_pool_create.cpp
+++ b/src/cli_pool_create.cpp
@@ -0,0 +1,622 @@
+// Copyright (c) MIND Software LLC, 2023 (info@mindsw.io)
+// I accept Vitastor CLA: see CLA-en.md for details
+// Copyright (c) Vitaliy Filippov, 2024
+// License: VNPL-1.1 (see README.md for details)
+
+#include <ctype.h>
+#include "cli.h"
+#include "cli_pool_cfg.h"
+#include "cluster_client.h"
+#include "epoll_manager.h"
+#include "pg_states.h"
+#include "str_util.h"
+
+struct pool_creator_t
+{
+    cli_tool_t *parent;
+    json11::Json::object cfg;
+
+    bool force = false;
+    bool wait = false;
+
+    int state = 0;
+    cli_result_t result;
+
+    struct {
+        uint32_t retries = 5;
+        uint32_t interval = 0;
+        bool passed = false;
+    } create_check;
+
+    uint64_t new_id = 1;
+    uint64_t new_pools_mod_rev;
+    json11::Json state_node_tree;
+    json11::Json new_pools;
+
+    bool is_done() { return state == 100; }
+
+    void loop()
+    {
+        if (state == 1)
+            goto resume_1;
+        else if (state == 2)
+            goto resume_2;
+        else if (state == 3)
+            goto resume_3;
+        else if (state == 4)
+            goto resume_4;
+        else if (state == 5)
+            goto resume_5;
+        else if (state == 6)
+            goto resume_6;
+        else if (state == 7)
+            goto resume_7;
+        else if (state == 8)
+            goto resume_8;
+
+        // Validate pool parameters
+        result.text = validate_pool_config(cfg, json11::Json(), parent->cli->st_cli.global_block_size,
+            parent->cli->st_cli.global_bitmap_granularity, force);
+        if (result.text != "")
+        {
+            result.err = EINVAL;
+            state = 100;
+            return;
+        }
+        state = 1;
+resume_1:
+        // If not forced, check that we have enough osds for pg_size
+        if (!force)
+        {
+            // Get node_placement configuration from etcd
+            parent->etcd_txn(json11::Json::object {
+                { "success", json11::Json::array {
+                    json11::Json::object {
+                        { "request_range", json11::Json::object {
+                            { "key", base64_encode(parent->cli->st_cli.etcd_prefix+"/config/node_placement") },
+                        } }
+                    },
+                } },
+            });
+
+            state = 2;
+resume_2:
+            if (parent->waiting > 0)
+                return;
+            if (parent->etcd_err.err)
+            {
+                result = parent->etcd_err;
+                state = 100;
+                return;
+            }
+
+            // Get state_node_tree based on node_placement and osd peer states
+            {
+                auto kv = parent->cli->st_cli.parse_etcd_kv(parent->etcd_result["responses"][0]["response_range"]["kvs"][0]);
+                state_node_tree = get_state_node_tree(kv.value.object_items());
+            }
+
+            // Skip tag checks, if pool has none
+            if (cfg["osd_tags"].array_items().size())
+            {
+                // Get osd configs (for tags) of osds in state_node_tree
+                {
+                    json11::Json::array osd_configs;
+                    for (auto osd_num: state_node_tree["osds"].array_items())
+                    {
+                        osd_configs.push_back(json11::Json::object {
+                            { "request_range", json11::Json::object {
+                                { "key", base64_encode(parent->cli->st_cli.etcd_prefix+"/config/osd/"+osd_num.as_string()) },
+                            } }
+                        });
+                    }
+                    parent->etcd_txn(json11::Json::object { { "success", osd_configs, }, });
+                }
+
+                state = 3;
+resume_3:
+                if (parent->waiting > 0)
+                    return;
+                if (parent->etcd_err.err)
+                {
+                    result = parent->etcd_err;
+                    state = 100;
+                    return;
+                }
+
+                // Filter out osds from state_node_tree based on pool/osd tags
+                {
+                    std::vector<json11::Json> osd_configs;
+                    for (auto & ocr: parent->etcd_result["responses"].array_items())
+                    {
+                        auto kv = parent->cli->st_cli.parse_etcd_kv(ocr["response_range"]["kvs"][0]);
+                        osd_configs.push_back(kv.value);
+                    }
+                    state_node_tree = filter_state_node_tree_by_tags(state_node_tree, osd_configs);
+                }
+            }
+
+            // Get stats (for block_size, bitmap_granularity, ...) of osds in state_node_tree
+            {
+                json11::Json::array osd_stats;
+
+                for (auto osd_num: state_node_tree["osds"].array_items())
+                {
+                    osd_stats.push_back(json11::Json::object {
+                        { "request_range", json11::Json::object {
+                            { "key", base64_encode(parent->cli->st_cli.etcd_prefix+"/osd/stats/"+osd_num.as_string()) },
+                        } }
+                    });
+                }
+
+                parent->etcd_txn(json11::Json::object { { "success", osd_stats, }, });
+            }
+
+            state = 4;
+resume_4:
+            if (parent->waiting > 0)
+                return;
+            if (parent->etcd_err.err)
+            {
+                result = parent->etcd_err;
+                state = 100;
+                return;
+            }
+
+            // Filter osds from state_node_tree based on pool parameters and osd stats
+            {
+                std::vector<json11::Json> osd_stats;
+                for (auto & ocr: parent->etcd_result["responses"].array_items())
+                {
+                    auto kv = parent->cli->st_cli.parse_etcd_kv(ocr["response_range"]["kvs"][0]);
+                    osd_stats.push_back(kv.value);
+                }
+                state_node_tree = filter_state_node_tree_by_stats(state_node_tree, osd_stats);
+            }
+
+            // Check that pg_size <= max_pg_size
+            {
+                auto failure_domain = cfg["failure_domain"].string_value() == ""
+                    ? "host" : cfg["failure_domain"].string_value();
+                uint64_t max_pg_size = get_max_pg_size(state_node_tree["nodes"].object_items(),
+                    failure_domain, cfg["root_node"].string_value());
+
+                if (cfg["pg_size"].uint64_value() > max_pg_size)
+                {
+                    result = (cli_result_t){
+                        .err = EINVAL,
+                        .text =
+                            "There are "+std::to_string(max_pg_size)+" \""+failure_domain+"\" failure domains with OSDs matching tags and"
+                            " block_size/bitmap_granularity/immediate_commit parameters, but you want to create a"
+                            " pool with "+cfg["pg_size"].as_string()+" OSDs from different failure domains in a PG."
+                            " Change parameters or add --force if you want to create a degraded pool and add OSDs later."
+                    };
+                    state = 100;
+                    return;
+                }
+            }
+        }
+        // Create pool
+        state = 5;
+resume_5:
+        // Get pools from etcd
+        parent->etcd_txn(json11::Json::object {
+            { "success", json11::Json::array {
+                json11::Json::object {
+                    { "request_range", json11::Json::object {
+                        { "key", base64_encode(parent->cli->st_cli.etcd_prefix+"/config/pools") },
+                    } }
+                },
+            } },
+        });
+        state = 6;
+resume_6:
+        if (parent->waiting > 0)
+            return;
+        if (parent->etcd_err.err)
+        {
+            result = parent->etcd_err;
+            state = 100;
+            return;
+        }
+        {
+            // Add new pool
+            auto kv = parent->cli->st_cli.parse_etcd_kv(parent->etcd_result["responses"][0]["response_range"]["kvs"][0]);
+            new_pools = create_pool(kv);
+            if (new_pools.is_string())
+            {
+                result = (cli_result_t){ .err = EEXIST, .text = new_pools.string_value() };
+                state = 100;
+                return;
+            }
+            new_pools_mod_rev = kv.mod_revision;
+        }
+        // Update pools in etcd
+        parent->etcd_txn(json11::Json::object {
+            { "compare", json11::Json::array {
+                json11::Json::object {
+                    { "target", "MOD" },
+                    { "key", base64_encode(parent->cli->st_cli.etcd_prefix+"/config/pools") },
+                    { "result", "LESS" },
+                    { "mod_revision", new_pools_mod_rev+1 },
+                }
+            } },
+            { "success", json11::Json::array {
+                json11::Json::object {
+                    { "request_put", json11::Json::object {
+                        { "key", base64_encode(parent->cli->st_cli.etcd_prefix+"/config/pools") },
+                        { "value", base64_encode(new_pools.dump()) },
+                    } },
+                },
+            } },
+        });
+        state = 7;
+resume_7:
+        if (parent->waiting > 0)
+            return;
+        if (parent->etcd_err.err)
+        {
+            result = parent->etcd_err;
+            state = 100;
+            return;
+        }
+
+        // Perform final create-check
+        create_check.interval = parent->cli->config["mon_change_timeout"].uint64_value();
+        if (!create_check.interval)
+            create_check.interval = 1000;
+
+        state = 8;
+resume_8:
+        if (parent->waiting > 0)
+            return;
+
+        // Unless forced, check that pool was created and is active
+        if (!wait)
+        {
+            create_check.passed = true;
+        }
+        else if (create_check.retries)
+        {
+            create_check.retries--;
+            parent->waiting++;
+            parent->epmgr->tfd->set_timer(create_check.interval, false, [this](int timer_id)
+            {
+                if (parent->cli->st_cli.pool_config.find(new_id) != parent->cli->st_cli.pool_config.end())
+                {
+                    auto & pool_cfg = parent->cli->st_cli.pool_config[new_id];
+                    create_check.passed = pool_cfg.real_pg_count > 0;
+                    for (auto pg_it = pool_cfg.pg_config.begin(); pg_it != pool_cfg.pg_config.end(); pg_it++)
+                    {
+                        if (!(pg_it->second.cur_state & PG_ACTIVE))
+                        {
+                            create_check.passed = false;
+                            break;
+                        }
+                    }
+                    if (create_check.passed)
+                        create_check.retries = 0;
+                }
+                parent->waiting--;
+                parent->ringloop->wakeup();
+            });
+            return;
+        }
+
+        if (!create_check.passed)
+        {
+            result = (cli_result_t) {
+                .err = EAGAIN,
+                .text = "Pool "+cfg["name"].string_value()+" was created, but failed to become active."
+                    " This may indicate that cluster state has changed while the pool was being created."
+                    " Please check the current state and adjust the pool configuration if necessary.",
+            };
+        }
+        else
+        {
+            result = (cli_result_t){
+                .err = 0,
+                .text = "Pool "+cfg["name"].string_value()+" created",
+                .data = new_pools[std::to_string(new_id)],
+            };
+        }
+        state = 100;
+    }
+
+    // Returns a JSON object of form {"nodes": {...}, "osds": [...]} that
+    // contains: all nodes (osds, hosts, ...) based on node_placement config
+    // and current peer state, and a list of active peer osds.
+    json11::Json get_state_node_tree(json11::Json::object node_placement)
+    {
+        // Erase non-peer osd nodes from node_placement
+        for (auto np_it = node_placement.begin(); np_it != node_placement.end();)
+        {
+            // Numeric nodes are osds
+            osd_num_t osd_num = stoull_full(np_it->first);
+
+            // If node is osd and it is not in peer states, erase it
+            if (osd_num > 0 &&
+                parent->cli->st_cli.peer_states.find(osd_num) == parent->cli->st_cli.peer_states.end())
+            {
+                node_placement.erase(np_it++);
+            }
+            else
+                np_it++;
+        }
+
+        // List of peer osds
+        std::vector<std::string> peer_osds;
+
+        // Record peer osds and add missing osds/hosts to np
+        for (auto & ps: parent->cli->st_cli.peer_states)
+        {
+            std::string osd_num = std::to_string(ps.first);
+
+            // Record peer osd
+            peer_osds.push_back(osd_num);
+
+            // Add osd, if necessary
+            if (node_placement.find(osd_num) == node_placement.end())
+            {
+                std::string osd_host = ps.second["host"].as_string();
+
+                // Add host, if necessary
+                if (node_placement.find(osd_host) == node_placement.end())
+                {
+                    node_placement[osd_host] = json11::Json::object {
+                        { "level", "host" }
+                    };
+                }
+
+                node_placement[osd_num] = json11::Json::object {
+                    { "parent", osd_host }
+                };
+            }
+        }
+
+        return json11::Json::object { { "osds", peer_osds }, { "nodes", node_placement } };
+    }
+
+    // Returns new state_node_tree based on given state_node_tree with osds
+    // filtered out by tags in given osd_configs and current pool config.
+    // Requires: state_node_tree["osds"] must match osd_configs 1-1
+    json11::Json filter_state_node_tree_by_tags(const json11::Json & state_node_tree, std::vector<json11::Json> & osd_configs)
+    {
+        auto & osds = state_node_tree["osds"].array_items();
+
+        // Accepted state_node_tree nodes
+        auto accepted_nodes = state_node_tree["nodes"].object_items();
+
+        // List of accepted osds
+        std::vector<std::string> accepted_osds;
+
+        for (size_t i = 0; i < osd_configs.size(); i++)
+        {
+            auto & oc = osd_configs[i].object_items();
+
+            // Get osd number
+            auto osd_num = osds[i].as_string();
+
+            // We need tags in config to check against pool tags
+            if (oc.find("tags") == oc.end())
+            {
+                // Exclude osd from state_node_tree nodes
+                accepted_nodes.erase(osd_num);
+                continue;
+            }
+            else
+            {
+                // If all pool tags are in osd tags, accept osd
+                if (all_in_tags(osd_configs[i]["tags"], cfg["osd_tags"]))
+                {
+                    accepted_osds.push_back(osd_num);
+                }
+                // Otherwise, exclude osd
+                else
+                {
+                    // Exclude osd from state_node_tree nodes
+                    accepted_nodes.erase(osd_num);
+                }
+            }
+        }
+
+        return json11::Json::object { { "osds", accepted_osds }, { "nodes", accepted_nodes } };
+    }
+
+    // Returns new state_node_tree based on given state_node_tree with osds
+    // filtered out by stats parameters (block_size, bitmap_granularity) in
+    // given osd_stats and current pool config.
+    // Requires: state_node_tree["osds"] must match osd_stats 1-1
+    json11::Json filter_state_node_tree_by_stats(const json11::Json & state_node_tree, std::vector<json11::Json> & osd_stats)
+    {
+        auto & osds = state_node_tree["osds"].array_items();
+
+        // Accepted state_node_tree nodes
+        auto accepted_nodes = state_node_tree["nodes"].object_items();
+
+        // List of accepted osds
+        std::vector<std::string> accepted_osds;
+
+        uint64_t p_block_size = cfg["block_size"].uint64_value()
+            ? cfg["block_size"].uint64_value()
+            : parent->cli->st_cli.global_block_size;
+        uint64_t p_bitmap_granularity = cfg["bitmap_granularity"].uint64_value()
+            ? cfg["bitmap_granularity"].uint64_value()
+            : parent->cli->st_cli.global_bitmap_granularity;
+        uint32_t p_immediate_commit = cfg["immediate_commit"].is_string()
+            ? etcd_state_client_t::parse_immediate_commit(cfg["immediate_commit"].string_value())
+            : parent->cli->st_cli.global_immediate_commit;
+
+        for (size_t i = 0; i < osd_stats.size(); i++)
+        {
+            auto & os = osd_stats[i];
+            // Get osd number
+            auto osd_num = osds[i].as_string();
+            if (!os["data_block_size"].is_null() && os["data_block_size"] != p_block_size ||
+                !os["bitmap_granularity"].is_null() && os["bitmap_granularity"] != p_bitmap_granularity ||
+                !os["immediate_commit"].is_null() &&
+                etcd_state_client_t::parse_immediate_commit(os["immediate_commit"].string_value()) < p_immediate_commit)
+            {
+                accepted_nodes.erase(osd_num);
+            }
+            else
+            {
+                accepted_osds.push_back(osd_num);
+            }
+        }
+
+        return json11::Json::object { { "osds", accepted_osds }, { "nodes", accepted_nodes } };
+    }
+
+    // Returns maximum pg_size possible for given node_tree and failure_domain, starting at parent_node
+    uint64_t get_max_pg_size(json11::Json::object node_tree, const std::string & level, const std::string & parent_node)
+    {
+        uint64_t max_pg_sz = 0;
+
+        std::vector<std::string> nodes;
+
+        // Check if parent node is an osd (numeric)
+        if (parent_node != "" && stoull_full(parent_node))
+        {
+            // Add it to node list if osd is in node tree
+            if (node_tree.find(parent_node) != node_tree.end())
+                nodes.push_back(parent_node);
+        }
+        // If parent node given, ...
+        else if (parent_node != "")
+        {
+            // ... look for children nodes of this parent
+            for (auto & sn: node_tree)
+            {
+                auto & props = sn.second.object_items();
+
+                auto parent_prop = props.find("parent");
+                if (parent_prop != props.end() && (parent_prop->second.as_string() == parent_node))
+                {
+                    nodes.push_back(sn.first);
+
+                    // If we're not looking for all osds, we only need a single
+                    // child osd node
+                    if (level != "osd" && stoull_full(sn.first))
+                        break;
+                }
+            }
+        }
+        // No parent node given, and we're not looking for all osds
+        else if (level != "osd")
+        {
+            // ... look for all level nodes
+            for (auto & sn: node_tree)
+            {
+                auto & props = sn.second.object_items();
+
+                auto level_prop = props.find("level");
+                if (level_prop != props.end() && (level_prop->second.as_string() == level))
+                {
+                    nodes.push_back(sn.first);
+                }
+            }
+        }
+        // Otherwise, ...
+        else
+        {
+            // ... we're looking for osd nodes only
+            for (auto & sn: node_tree)
+            {
+                if (stoull_full(sn.first))
+                {
+                    nodes.push_back(sn.first);
+                }
+            }
+        }
+
+        // Process gathered nodes
+        for (auto & node: nodes)
+        {
+            // Check for osd node, return constant max size
+            if (stoull_full(node))
+            {
+                max_pg_sz += 1;
+            }
+            // Otherwise, ...
+            else
+            {
+                // ... exclude parent node from tree, and ...
+                node_tree.erase(parent_node);
+
+                // ... descend onto the resulting tree
+                max_pg_sz += get_max_pg_size(node_tree, level, node);
+            }
+        }
+
+        return max_pg_sz;
+    }
+
+    json11::Json create_pool(const etcd_kv_t & kv)
+    {
+        for (auto & p: kv.value.object_items())
+        {
+            // ID
+            uint64_t pool_id = stoull_full(p.first);
+            new_id = std::max(pool_id+1, new_id);
+            // Name
+            if (p.second["name"].string_value() == cfg["name"].string_value())
+            {
+                return "Pool with name \""+cfg["name"].string_value()+"\" already exists (ID "+std::to_string(pool_id)+")";
+            }
+        }
+        auto res = kv.value.object_items();
+        res[std::to_string(new_id)] = cfg;
+        return res;
+    }
+
+    // Checks whether tags2 tags are all in tags1 tags
+    bool all_in_tags(json11::Json tags1, json11::Json tags2)
+    {
+        if (!tags2.is_array())
+        {
+            tags2 = json11::Json::array{ tags2.string_value() };
+        }
+        if (!tags1.is_array())
+        {
+            tags1 = json11::Json::array{ tags1.string_value() };
+        }
+        for (auto & tag2: tags2.array_items())
+        {
+            bool found = false;
+            for (auto & tag1: tags1.array_items())
+            {
+                if (tag1 == tag2)
+                {
+                    found = true;
+                    break;
+                }
+            }
+            if (!found)
+            {
+                return false;
+            }
+        }
+        return true;
+    }
+};
+
+std::function<bool(cli_result_t &)> cli_tool_t::start_pool_create(json11::Json cfg)
+{
+    auto pool_creator = new pool_creator_t();
+    pool_creator->parent = this;
+    pool_creator->cfg = cfg.object_items();
+    pool_creator->force = cfg["force"].bool_value();
+    pool_creator->wait = cfg["wait"].bool_value();
+    return [pool_creator](cli_result_t & result)
+    {
+        pool_creator->loop();
+        if (pool_creator->is_done())
+        {
+            result = pool_creator->result;
+            delete pool_creator;
+            return true;
+        }
+        return false;
+    };
+}
--- a/src/cli_pool_ls.cpp
+++ b/src/cli_pool_ls.cpp
@@ -0,0 +1,723 @@
+// Copyright (c) Vitaliy Filippov, 2019+
+// License: VNPL-1.1 (see README.md for details)
+
+#include <algorithm>
+#include "cli.h"
+#include "cluster_client.h"
+#include "str_util.h"
+#include "pg_states.h"
+
+// List pools with space statistics
+// - df - minimal list with % used space
+// - pool-ls - same but with PG state and recovery %
+// - pool-ls -l - same but also include I/O statistics
+// - pool-ls --detail - use list format, include PG states, I/O stats and all pool parameters
+struct pool_lister_t
+{
+    cli_tool_t *parent;
+    std::string sort_field;
+    std::set<std::string> only_names;
+    bool reverse = false;
+    int max_count = 0;
+    bool show_recovery = false;
+    bool show_stats = false;
+    bool detailed = false;
+
+    int state = 0;
+    cli_result_t result;
+    std::map<pool_id_t, json11::Json::object> pool_stats;
+    struct io_stats_t
+    {
+        uint64_t count = 0;
+        uint64_t read_iops = 0;
+        uint64_t read_bps = 0;
+        uint64_t read_lat = 0;
+        uint64_t write_iops = 0;
+        uint64_t write_bps = 0;
+        uint64_t write_lat = 0;
+        uint64_t delete_iops = 0;
+        uint64_t delete_bps = 0;
+        uint64_t delete_lat = 0;
+    };
+    struct object_counts_t
+    {
+        uint64_t object_count = 0;
+        uint64_t misplaced_count = 0;
+        uint64_t degraded_count = 0;
+        uint64_t incomplete_count = 0;
+    };
+
+    bool is_done()
+    {
+        return state == 100;
+    }
+
+    void get_pool_stats(int base_state)
+    {
+        if (state == base_state+1)
+            goto resume_1;
+        // Space statistics - pool/stats/<pool>
+        parent->etcd_txn(json11::Json::object {
+            { "success", json11::Json::array {
+                json11::Json::object {
+                    { "request_range", json11::Json::object {
+                        { "key", base64_encode(
+                            parent->cli->st_cli.etcd_prefix+"/pool/stats/"
+                        ) },
+                        { "range_end", base64_encode(
+                            parent->cli->st_cli.etcd_prefix+"/pool/stats0"
+                        ) },
+                    } },
+                },
+                json11::Json::object {
+                    { "request_range", json11::Json::object {
+                        { "key", base64_encode(
+                            parent->cli->st_cli.etcd_prefix+"/osd/stats/"
+                        ) },
+                        { "range_end", base64_encode(
+                            parent->cli->st_cli.etcd_prefix+"/osd/stats0"
+                        ) },
+                    } },
+                },
+                json11::Json::object {
+                    { "request_range", json11::Json::object {
+                        { "key", base64_encode(
+                            parent->cli->st_cli.etcd_prefix+"/config/pools"
+                        ) },
+                    } },
+                },
+            } },
+        });
+        state = base_state+1;
+resume_1:
+        if (parent->waiting > 0)
+            return;
+        if (parent->etcd_err.err)
+        {
+            result = parent->etcd_err;
+            state = 100;
+            return;
+        }
+        auto space_info = parent->etcd_result;
+        auto config_pools = space_info["responses"][2]["response_range"]["kvs"][0];
+        if (!config_pools.is_null())
+        {
+            config_pools = parent->cli->st_cli.parse_etcd_kv(config_pools).value;
+        }
+        for (auto & kv_item: space_info["responses"][0]["response_range"]["kvs"].array_items())
+        {
+            auto kv = parent->cli->st_cli.parse_etcd_kv(kv_item);
+            // pool ID
+            pool_id_t pool_id;
+            char null_byte = 0;
+            int scanned = sscanf(kv.key.substr(parent->cli->st_cli.etcd_prefix.length()).c_str(), "/pool/stats/%u%c", &pool_id, &null_byte);
+            if (scanned != 1 || !pool_id || pool_id >= POOL_ID_MAX)
+            {
+                fprintf(stderr, "Invalid key in etcd: %s\n", kv.key.c_str());
+                continue;
+            }
+            // pool/stats/<N>
+            pool_stats[pool_id] = kv.value.object_items();
+        }
+        std::map<pool_id_t, uint64_t> osd_free;
+        for (auto & kv_item: space_info["responses"][1]["response_range"]["kvs"].array_items())
+        {
+            auto kv = parent->cli->st_cli.parse_etcd_kv(kv_item);
+            // osd ID
+            osd_num_t osd_num;
+            char null_byte = 0;
+            int scanned = sscanf(kv.key.substr(parent->cli->st_cli.etcd_prefix.length()).c_str(), "/osd/stats/%ju%c", &osd_num, &null_byte);
+            if (scanned != 1 || !osd_num || osd_num >= POOL_ID_MAX)
+            {
+                fprintf(stderr, "Invalid key in etcd: %s\n", kv.key.c_str());
+                continue;
+            }
+            // osd/stats/<N>::free
+            osd_free[osd_num] = kv.value["free"].uint64_value();
+        }
+        // Calculate max_avail for each pool
+        for (auto & pp: parent->cli->st_cli.pool_config)
+        {
+            auto & pool_cfg = pp.second;
+            uint64_t pool_avail = UINT64_MAX;
+            std::map<osd_num_t, uint64_t> pg_per_osd;
+            bool active = pool_cfg.real_pg_count > 0;
+            uint64_t pg_states = 0;
+            for (auto & pgp: pool_cfg.pg_config)
+            {
+                if (!(pgp.second.cur_state & PG_ACTIVE))
+                {
+                    active = false;
+                }
+                pg_states |= pgp.second.cur_state;
+                for (auto pg_osd: pgp.second.target_set)
+                {
+                    if (pg_osd != 0)
+                    {
+                        pg_per_osd[pg_osd]++;
+                    }
+                }
+            }
+            for (auto pg_per_pair: pg_per_osd)
+            {
+                uint64_t pg_free = osd_free[pg_per_pair.first] * pool_cfg.real_pg_count / pg_per_pair.second;
+                if (pool_avail > pg_free)
+                {
+                    pool_avail = pg_free;
+                }
+            }
+            if (pool_avail == UINT64_MAX)
+            {
+                pool_avail = 0;
+            }
+            if (pool_cfg.scheme != POOL_SCHEME_REPLICATED)
+            {
+                pool_avail *= (pool_cfg.pg_size - pool_cfg.parity_chunks);
+            }
+            // incomplete > has_incomplete > degraded > has_degraded > has_misplaced
+            std::string status;
+            if (!active)
+                status = "inactive";
+            else if (pg_states & PG_INCOMPLETE)
+                status = "incomplete";
+            else if (pg_states & PG_HAS_INCOMPLETE)
+                status = "has_incomplete";
+            else if (pg_states & PG_DEGRADED)
+                status = "degraded";
+            else if (pg_states & PG_HAS_DEGRADED)
+                status = "has_degraded";
+            else if (pg_states & PG_HAS_MISPLACED)
+                status = "has_misplaced";
+            else
+                status = "active";
+            pool_stats[pool_cfg.id] = json11::Json::object {
+                { "id", (uint64_t)pool_cfg.id },
+                { "name", pool_cfg.name },
+                { "status", status },
+                { "pg_count", pool_cfg.pg_count },
+                { "real_pg_count", pool_cfg.real_pg_count },
+                { "scheme_name", pool_cfg.scheme == POOL_SCHEME_REPLICATED
+                    ? std::to_string(pool_cfg.pg_size)+"/"+std::to_string(pool_cfg.pg_minsize)
+                    : "EC "+std::to_string(pool_cfg.pg_size-pool_cfg.parity_chunks)+"+"+std::to_string(pool_cfg.parity_chunks) },
+                { "used_raw", (uint64_t)(pool_stats[pool_cfg.id]["used_raw_tb"].number_value() * ((uint64_t)1<<40)) },
+                { "total_raw", (uint64_t)(pool_stats[pool_cfg.id]["total_raw_tb"].number_value() * ((uint64_t)1<<40)) },
+                { "max_available", pool_avail },
+                { "raw_to_usable", pool_stats[pool_cfg.id]["raw_to_usable"].number_value() },
+                { "space_efficiency", pool_stats[pool_cfg.id]["space_efficiency"].number_value() },
+                { "pg_real_size", pool_stats[pool_cfg.id]["pg_real_size"].uint64_value() },
+                { "osd_count", pg_per_osd.size() },
+            };
+        }
+        // Include full pool config
+        for (auto & pp: config_pools.object_items())
+        {
+            if (!pp.second.is_object())
+            {
+                continue;
+            }
+            auto pool_id = stoull_full(pp.first);
+            auto & st = pool_stats[pool_id];
+            for (auto & kv: pp.second.object_items())
+            {
+                if (st.find(kv.first) == st.end())
+                    st[kv.first] = kv.second;
+            }
+        }
+    }
+
+    void get_pg_stats(int base_state)
+    {
+        if (state == base_state+1)
+            goto resume_1;
+        // Space statistics - pool/stats/<pool>
+        parent->etcd_txn(json11::Json::object {
+            { "success", json11::Json::array {
+                json11::Json::object {
+                    { "request_range", json11::Json::object {
+                        { "key", base64_encode(
+                            parent->cli->st_cli.etcd_prefix+"/pg/stats/"
+                        ) },
+                        { "range_end", base64_encode(
+                            parent->cli->st_cli.etcd_prefix+"/pg/stats0"
+                        ) },
+                    } },
+                },
+            } },
+        });
+        state = base_state+1;
+resume_1:
+        if (parent->waiting > 0)
+            return;
+        if (parent->etcd_err.err)
+        {
+            result = parent->etcd_err;
+            state = 100;
+            return;
+        }
+        auto pg_stats = parent->etcd_result["responses"][0]["response_range"]["kvs"];
+        // Calculate recovery percent
+        std::map<pool_id_t, object_counts_t> counts;
+        for (auto & kv_item: pg_stats.array_items())
+        {
+            auto kv = parent->cli->st_cli.parse_etcd_kv(kv_item);
+            // pool ID & pg number
+            pool_id_t pool_id;
+            pg_num_t pg_num = 0;
+            char null_byte = 0;
+            int scanned = sscanf(kv.key.substr(parent->cli->st_cli.etcd_prefix.length()).c_str(),
+                "/pg/stats/%u/%u%c", &pool_id, &pg_num, &null_byte);
+            if (scanned != 2 || !pool_id || pool_id >= POOL_ID_MAX)
+            {
+                fprintf(stderr, "Invalid key in etcd: %s\n", kv.key.c_str());
+                continue;
+            }
+            auto & cnt = counts[pool_id];
+            cnt.object_count += kv.value["object_count"].uint64_value();
+            cnt.misplaced_count += kv.value["misplaced_count"].uint64_value();
+            cnt.degraded_count += kv.value["degraded_count"].uint64_value();
+            cnt.incomplete_count += kv.value["incomplete_count"].uint64_value();
+        }
+        for (auto & pp: pool_stats)
+        {
+            auto & cnt = counts[pp.first];
+            auto & st = pp.second;
+            st["object_count"] = cnt.object_count;
+            st["misplaced_count"] = cnt.misplaced_count;
+            st["degraded_count"] = cnt.degraded_count;
+            st["incomplete_count"] = cnt.incomplete_count;
+        }
+    }
+
+    void get_inode_stats(int base_state)
+    {
+        if (state == base_state+1)
+            goto resume_1;
+        // Space statistics - pool/stats/<pool>
+        parent->etcd_txn(json11::Json::object {
+            { "success", json11::Json::array {
+                json11::Json::object {
+                    { "request_range", json11::Json::object {
+                        { "key", base64_encode(
+                            parent->cli->st_cli.etcd_prefix+"/inode/stats/"
+                        ) },
+                        { "range_end", base64_encode(
+                            parent->cli->st_cli.etcd_prefix+"/inode/stats0"
+                        ) },
+                    } },
+                },
+            } },
+        });
+        state = base_state+1;
+resume_1:
+        if (parent->waiting > 0)
+            return;
+        if (parent->etcd_err.err)
+        {
+            result = parent->etcd_err;
+            state = 100;
+            return;
+        }
+        auto inode_stats = parent->etcd_result["responses"][0]["response_range"]["kvs"];
+        // Performance statistics
+        std::map<pool_id_t, io_stats_t> pool_io;
+        for (auto & kv_item: inode_stats.array_items())
+        {
+            auto kv = parent->cli->st_cli.parse_etcd_kv(kv_item);
+            // pool ID & inode number
+            pool_id_t pool_id;
+            inode_t only_inode_num;
+            char null_byte = 0;
+            int scanned = sscanf(kv.key.substr(parent->cli->st_cli.etcd_prefix.length()).c_str(),
+                "/inode/stats/%u/%ju%c", &pool_id, &only_inode_num, &null_byte);
+            if (scanned != 2 || !pool_id || pool_id >= POOL_ID_MAX || INODE_POOL(only_inode_num) != 0)
+            {
+                fprintf(stderr, "Invalid key in etcd: %s\n", kv.key.c_str());
+                continue;
+            }
+            auto & io = pool_io[pool_id];
+            io.read_iops += kv.value["read"]["iops"].uint64_value();
+            io.read_bps += kv.value["read"]["bps"].uint64_value();
+            io.read_lat += kv.value["read"]["lat"].uint64_value();
+            io.write_iops += kv.value["write"]["iops"].uint64_value();
+            io.write_bps += kv.value["write"]["bps"].uint64_value();
+            io.write_lat += kv.value["write"]["lat"].uint64_value();
+            io.delete_iops += kv.value["delete"]["iops"].uint64_value();
+            io.delete_bps += kv.value["delete"]["bps"].uint64_value();
+            io.delete_lat += kv.value["delete"]["lat"].uint64_value();
+            io.count++;
+        }
+        for (auto & pp: pool_stats)
+        {
+            auto & io = pool_io[pp.first];
+            if (io.count > 0)
+            {
+                io.read_lat /= io.count;
+                io.write_lat /= io.count;
+                io.delete_lat /= io.count;
+            }
+            auto & st = pp.second;
+            st["read_iops"] = io.read_iops;
+            st["read_bps"] = io.read_bps;
+            st["read_lat"] = io.read_lat;
+            st["write_iops"] = io.write_iops;
+            st["write_bps"] = io.write_bps;
+            st["write_lat"] = io.write_lat;
+            st["delete_iops"] = io.delete_iops;
+            st["delete_bps"] = io.delete_bps;
+            st["delete_lat"] = io.delete_lat;
+        }
+    }
+
+    json11::Json::array to_list()
+    {
+        json11::Json::array list;
+        for (auto & kv: pool_stats)
+        {
+            if (!only_names.size())
+            {
+                list.push_back(kv.second);
+            }
+            else
+            {
+                for (auto glob: only_names)
+                {
+                    if (stupid_glob(kv.second["name"].string_value(), glob))
+                    {
+                        list.push_back(kv.second);
+                        break;
+                    }
+                }
+            }
+        }
+        if (sort_field == "name" || sort_field == "scheme" ||
+            sort_field == "scheme_name" || sort_field == "status")
+        {
+            std::sort(list.begin(), list.end(), [this](json11::Json a, json11::Json b)
+            {
+                auto av = a[sort_field].as_string();
+                auto bv = b[sort_field].as_string();
+                return reverse ? av > bv : av < bv;
+            });
+        }
+        else
+        {
+            std::sort(list.begin(), list.end(), [this](json11::Json a, json11::Json b)
+            {
+                auto av = a[sort_field].number_value();
+                auto bv = b[sort_field].number_value();
+                return reverse ? av > bv : av < bv;
+            });
+        }
+        if (max_count > 0 && list.size() > max_count)
+        {
+            list.resize(max_count);
+        }
+        return list;
+    }
+
+    void loop()
+    {
+        if (state == 1)
+            goto resume_1;
+        if (state == 2)
+            goto resume_2;
+        if (state == 3)
+            goto resume_3;
+        if (state == 100)
+            return;
+        show_stats = show_stats || detailed;
+        show_recovery = show_recovery || detailed;
+resume_1:
+        get_pool_stats(0);
+        if (parent->waiting > 0)
+            return;
+        if (show_stats)
+        {
+resume_2:
+            get_inode_stats(1);
+            if (parent->waiting > 0)
+                return;
+        }
+        if (show_recovery)
+        {
+resume_3:
+            get_pg_stats(2);
+            if (parent->waiting > 0)
+                return;
+        }
+        if (parent->json_output)
+        {
+            // JSON output
+            result.data = to_list();
+            state = 100;
+            return;
+        }
+        json11::Json::array list;
+        for (auto & kv: pool_stats)
+        {
+            auto & st = kv.second;
+            double raw_to = st["raw_to_usable"].number_value();
+            if (raw_to < 0.000001 && raw_to > -0.000001)
+                raw_to = 1;
+            st["pg_count_fmt"] = st["real_pg_count"] == st["pg_count"]
+                ? st["real_pg_count"].as_string()
+                : st["real_pg_count"].as_string()+"->"+st["pg_count"].as_string();
+            st["total_fmt"] = format_size(st["total_raw"].uint64_value() / raw_to);
+            st["used_fmt"] = format_size(st["used_raw"].uint64_value() / raw_to);
+            st["max_avail_fmt"] = format_size(st["max_available"].uint64_value());
+            st["used_pct"] = format_q(st["total_raw"].uint64_value()
+                ? (100 - 100*st["max_available"].uint64_value() *
+                    st["raw_to_usable"].number_value() / st["total_raw"].uint64_value())
+                : 100)+"%";
+            st["eff_fmt"] = format_q(st["space_efficiency"].number_value()*100)+"%";
+            if (show_stats)
+            {
+                st["read_bw"] = format_size(st["read_bps"].uint64_value())+"/s";
+                st["write_bw"] = format_size(st["write_bps"].uint64_value())+"/s";
+                st["delete_bw"] = format_size(st["delete_bps"].uint64_value())+"/s";
+                st["read_iops"] = format_q(st["read_iops"].number_value());
+                st["write_iops"] = format_q(st["write_iops"].number_value());
+                st["delete_iops"] = format_q(st["delete_iops"].number_value());
+                st["read_lat_f"] = format_lat(st["read_lat"].uint64_value());
+                st["write_lat_f"] = format_lat(st["write_lat"].uint64_value());
+                st["delete_lat_f"] = format_lat(st["delete_lat"].uint64_value());
+            }
+            if (show_recovery)
+            {
+                auto object_count = st["object_count"].uint64_value();
+                auto recovery_pct = 100.0 * (object_count - (st["misplaced_count"].uint64_value() +
+                    st["degraded_count"].uint64_value() + st["incomplete_count"].uint64_value())) /
+                    (object_count ? object_count : 1);
+                st["recovery_fmt"] = format_q(recovery_pct)+"%";
+            }
+        }
+        if (detailed)
+        {
+            for (auto & kv: pool_stats)
+            {
+                auto & st = kv.second;
+                auto total = st["object_count"].uint64_value();
+                auto obj_size = st["block_size"].uint64_value();
+                if (!obj_size)
+                    obj_size = parent->cli->st_cli.global_block_size;
+                if (st["scheme"] == "ec")
+                    obj_size *= st["pg_size"].uint64_value() - st["parity_chunks"].uint64_value();
+                else if (st["scheme"] == "xor")
+                    obj_size *= st["pg_size"].uint64_value() - 1;
+                auto n = st["misplaced_count"].uint64_value();
+                if (n > 0)
+                    st["misplaced_fmt"] = format_size(n * obj_size) + " / " + format_q(100.0 * n / total);
+                n = st["degraded_count"].uint64_value();
+                if (n > 0)
+                    st["degraded_fmt"] = format_size(n * obj_size) + " / " + format_q(100.0 * n / total);
+                n = st["incomplete_count"].uint64_value();
+                if (n > 0)
+                    st["incomplete_fmt"] = format_size(n * obj_size) + " / " + format_q(100.0 * n / total);
+                st["read_fmt"] = st["read_bw"].string_value()+", "+st["read_iops"].string_value()+" op/s, "+
+                    st["read_lat_f"].string_value()+" lat";
+                st["write_fmt"] = st["write_bw"].string_value()+", "+st["write_iops"].string_value()+" op/s, "+
+                    st["write_lat_f"].string_value()+" lat";
+                st["delete_fmt"] = st["delete_bw"].string_value()+", "+st["delete_iops"].string_value()+" op/s, "+
+                    st["delete_lat_f"].string_value()+" lat";
+                if (st["scheme"] == "replicated")
+                    st["scheme_name"] = "x"+st["pg_size"].as_string();
+                if (st["failure_domain"].string_value() == "")
+                    st["failure_domain"] = "host";
+                st["osd_tags_fmt"] = implode(", ", st["osd_tags"]);
+                st["primary_affinity_tags_fmt"] = implode(", ", st["primary_affinity_tags"]);
+                if (st["block_size"].uint64_value())
+                    st["block_size_fmt"] = format_size(st["block_size"].uint64_value());
+                if (st["bitmap_granularity"].uint64_value())
+                    st["bitmap_granularity_fmt"] = format_size(st["bitmap_granularity"].uint64_value());
+            }
+            // All pool parameters are only displayed in the "detailed" mode
+            // because there's too many of them to show them in table
+            auto cols = std::vector<std::pair<std::string, std::string>>{
+                { "name", "Name" },
+                { "id", "ID" },
+                { "scheme_name", "Scheme" },
+                { "used_for_fs", "Used for VitastorFS" },
+                { "status", "Status" },
+                { "pg_count_fmt", "PGs" },
+                { "pg_minsize", "PG minsize" },
+                { "failure_domain", "Failure domain" },
+                { "root_node", "Root node" },
+                { "osd_tags_fmt", "OSD tags" },
+                { "primary_affinity_tags_fmt", "Primary affinity" },
+                { "block_size_fmt", "Block size" },
+                { "bitmap_granularity_fmt", "Bitmap granularity" },
+                { "immediate_commit", "Immediate commit" },
+                { "scrub_interval", "Scrub interval" },
+                { "inode_stats_fmt", "Per-inode stats" },
+                { "pg_stripe_size", "PG stripe size" },
+                { "max_osd_combinations", "Max OSD combinations" },
+                { "total_fmt", "Total" },
+                { "used_fmt", "Used" },
+                { "max_avail_fmt", "Available" },
+                { "used_pct", "Used%" },
+                { "eff_fmt", "Efficiency" },
+                { "osd_count", "OSD count" },
+                { "misplaced_fmt", "Misplaced" },
+                { "degraded_fmt", "Degraded" },
+                { "incomplete_fmt", "Incomplete" },
+                { "read_fmt", "Read" },
+                { "write_fmt", "Write" },
+                { "delete_fmt", "Delete" },
+            };
+            auto list = to_list();
+            size_t title_len = 0;
+            for (auto & item: list)
+            {
+                title_len = print_detail_title_len(item, cols, title_len);
+            }
+            for (auto & item: list)
+            {
+                if (result.text != "")
+                    result.text += "\n";
+                result.text += print_detail(item, cols, title_len, parent->color);
+            }
+            state = 100;
+            return;
+        }
+        // Table output: name, scheme_name, pg_count, total, used, max_avail, used%, efficiency
+        json11::Json::array cols;
+        cols.push_back(json11::Json::object{
+            { "key", "name" },
+            { "title", "NAME" },
+        });
+        cols.push_back(json11::Json::object{
+            { "key", "scheme_name" },
+            { "title", "SCHEME" },
+        });
+        cols.push_back(json11::Json::object{
+            { "key", "status" },
+            { "title", "STATUS" },
+        });
+        cols.push_back(json11::Json::object{
+            { "key", "pg_count_fmt" },
+            { "title", "PGS" },
+        });
+        cols.push_back(json11::Json::object{
+            { "key", "total_fmt" },
+            { "title", "TOTAL" },
+        });
+        cols.push_back(json11::Json::object{
+            { "key", "used_fmt" },
+            { "title", "USED" },
+        });
+        cols.push_back(json11::Json::object{
+            { "key", "max_avail_fmt" },
+            { "title", "AVAILABLE" },
+        });
+        cols.push_back(json11::Json::object{
+            { "key", "used_pct" },
+            { "title", "USED%" },
+        });
+        cols.push_back(json11::Json::object{
+            { "key", "eff_fmt" },
+            { "title", "EFFICIENCY" },
+        });
+        if (show_recovery)
+        {
+            cols.push_back(json11::Json::object{ { "key", "recovery_fmt" }, { "title", "RECOVERY" } });
+        }
+        if (show_stats)
+        {
+            cols.push_back(json11::Json::object{ { "key", "read_bw" }, { "title", "READ" } });
+            cols.push_back(json11::Json::object{ { "key", "read_iops" }, { "title", "IOPS" } });
+            cols.push_back(json11::Json::object{ { "key", "read_lat_f" }, { "title", "LAT" } });
+            cols.push_back(json11::Json::object{ { "key", "write_bw" }, { "title", "WRITE" } });
+            cols.push_back(json11::Json::object{ { "key", "write_iops" }, { "title", "IOPS" } });
+            cols.push_back(json11::Json::object{ { "key", "write_lat_f" }, { "title", "LAT" } });
+            cols.push_back(json11::Json::object{ { "key", "delete_bw" }, { "title", "DELETE" } });
+            cols.push_back(json11::Json::object{ { "key", "delete_iops" }, { "title", "IOPS" } });
+            cols.push_back(json11::Json::object{ { "key", "delete_lat_f" }, { "title", "LAT" } });
+        }
+        result.data = to_list();
+        result.text = print_table(result.data, cols, parent->color);
+        state = 100;
+    }
+};
+
+size_t print_detail_title_len(json11::Json item, std::vector<std::pair<std::string, std::string>> names, size_t prev_len)
+{
+    size_t title_len = prev_len;
+    for (auto & kv: names)
+    {
+        if (!item[kv.first].is_null() && (!item[kv.first].is_string() || item[kv.first].string_value() != ""))
+        {
+            size_t len = utf8_length(kv.second);
+            title_len = title_len < len ? len : title_len;
+        }
+    }
+    return title_len;
+}
+
+std::string print_detail(json11::Json item, std::vector<std::pair<std::string, std::string>> names, size_t title_len, bool use_esc)
+{
+    std::string str;
+    for (auto & kv: names)
+    {
+        if (!item[kv.first].is_null() && (!item[kv.first].is_string() || item[kv.first].string_value() != ""))
+        {
+            str += kv.second;
+            str += ": ";
+            size_t len = utf8_length(kv.second);
+            for (int j = 0; j < title_len-len; j++)
+                str += ' ';
+            if (use_esc)
+                str += "\033[1m";
+            str += item[kv.first].as_string();
+            if (use_esc)
+                str += "\033[0m";
+            str += "\n";
+        }
+    }
+    return str;
+}
+
+std::function<bool(cli_result_t &)> cli_tool_t::start_pool_ls(json11::Json cfg)
+{
+    auto lister = new pool_lister_t();
+    lister->parent = this;
+    lister->show_recovery = cfg["show_recovery"].bool_value();
+    lister->show_stats = cfg["long"].bool_value();
+    lister->detailed = cfg["detail"].bool_value();
+    lister->sort_field = cfg["sort"].string_value();
+    if ((lister->sort_field == "osd_tags") ||
+        (lister->sort_field == "primary_affinity_tags" ))
+        lister->sort_field = lister->sort_field + "_fmt";
+    lister->reverse = cfg["reverse"].bool_value();
+    lister->max_count = cfg["count"].uint64_value();
+    for (auto & item: cfg["names"].array_items())
+    {
+        lister->only_names.insert(item.string_value());
+    }
+    return [lister](cli_result_t & result)
+    {
+        lister->loop();
+        if (lister->is_done())
+        {
+            result = lister->result;
+            delete lister;
+            return true;
+        }
+        return false;
+    };
+}
+
+std::string implode(const std::string & sep, json11::Json array)
+{
+    if (array.is_number() || array.is_bool() || array.is_string())
+    {
+        return array.as_string();
+    }
+    std::string res;
+    bool first = true;
+    for (auto & item: array.array_items())
+    {
+        res += (first ? item.as_string() : sep+item.as_string());
+        first = false;
+    }
+    return res;
+}
--- a/src/cli_pool_modify.cpp
+++ b/src/cli_pool_modify.cpp
@@ -0,0 +1,203 @@
+// Copyright (c) MIND Software LLC, 2023 (info@mindsw.io)
+// I accept Vitastor CLA: see CLA-en.md for details
+// Copyright (c) Vitaliy Filippov, 2024
+// License: VNPL-1.1 (see README.md for details)
+
+#include <ctype.h>
+#include "cli.h"
+#include "cli_pool_cfg.h"
+#include "cluster_client.h"
+#include "str_util.h"
+
+struct pool_changer_t
+{
+    cli_tool_t *parent;
+
+    // Required parameters (id/name)
+    pool_id_t pool_id = 0;
+    std::string pool_name;
+    json11::Json::object cfg;
+    json11::Json::object new_cfg;
+    bool force = false;
+
+    json11::Json old_cfg;
+
+    int state = 0;
+    cli_result_t result;
+
+    // Updated pools
+    json11::Json new_pools;
+
+    // Expected pools mod revision
+    uint64_t pools_mod_rev;
+
+    bool is_done() { return state == 100; }
+
+    void loop()
+    {
+        if (state == 1)
+            goto resume_1;
+        else if (state == 2)
+            goto resume_2;
+        pool_id = stoull_full(cfg["old_name"].string_value());
+        if (!pool_id)
+        {
+            pool_name = cfg["old_name"].string_value();
+            if (pool_name == "")
+            {
+                result = (cli_result_t){ .err = ENOENT, .text = "Pool ID or name is required to modify it" };
+                state = 100;
+                return;
+            }
+        }
+resume_0:
+        // Get pools from etcd
+        parent->etcd_txn(json11::Json::object {
+            { "success", json11::Json::array {
+                json11::Json::object {
+                    { "request_range", json11::Json::object {
+                        { "key", base64_encode(parent->cli->st_cli.etcd_prefix+"/config/pools") },
+                    } }
+                },
+            } },
+        });
+        state = 1;
+resume_1:
+        if (parent->waiting > 0)
+            return;
+        if (parent->etcd_err.err)
+        {
+            result = parent->etcd_err;
+            state = 100;
+            return;
+        }
+        {
+            // Parse received pools from etcd
+            auto kv = parent->cli->st_cli.parse_etcd_kv(parent->etcd_result["responses"][0]["response_range"]["kvs"][0]);
+
+            // Get pool by name or ID
+            old_cfg = json11::Json();
+            if (pool_name != "")
+            {
+                for (auto & pce: kv.value.object_items())
+                {
+                    if (pce.second["name"] == pool_name)
+                    {
+                        pool_id = stoull_full(pce.first);
+                        old_cfg = pce.second;
+                        break;
+                    }
+                }
+            }
+            else
+            {
+                pool_name = std::to_string(pool_id);
+                old_cfg = kv.value[pool_name];
+            }
+            if (!old_cfg.is_object())
+            {
+                result = (cli_result_t){ .err = ENOENT, .text = "Pool "+pool_name+" does not exist" };
+                state = 100;
+                return;
+            }
+
+            // Update pool
+            new_cfg = cfg;
+            result.text = validate_pool_config(new_cfg, old_cfg, parent->cli->st_cli.global_block_size,
+                parent->cli->st_cli.global_bitmap_granularity, force);
+            if (result.text != "")
+            {
+                result.err = EINVAL;
+                state = 100;
+                return;
+            }
+
+            if (new_cfg.find("used_for_fs") != new_cfg.end() && !force)
+            {
+                // Check that pool doesn't have images
+                auto img_it = parent->cli->st_cli.inode_config.lower_bound(INODE_WITH_POOL(pool_id, 0));
+                if (img_it != parent->cli->st_cli.inode_config.end() && INODE_POOL(img_it->first) == pool_id &&
+                    img_it->second.name == new_cfg["used_for_fs"].string_value())
+                {
+                    // Only allow metadata image to exist in the FS pool
+                    img_it++;
+                }
+                if (img_it != parent->cli->st_cli.inode_config.end() && INODE_POOL(img_it->first) == pool_id)
+                {
+                    result = (cli_result_t){ .err = ENOENT, .text = "Pool "+pool_name+" has block images, delete them before using it for VitastorFS" };
+                    state = 100;
+                    return;
+                }
+            }
+
+            // Update pool
+            auto pls = kv.value.object_items();
+            pls[std::to_string(pool_id)] = new_cfg;
+            new_pools = pls;
+
+            // Expected pools mod revision
+            pools_mod_rev = kv.mod_revision;
+        }
+        // Update pools in etcd
+        parent->etcd_txn(json11::Json::object {
+            { "compare", json11::Json::array {
+                json11::Json::object {
+                    { "target", "MOD" },
+                    { "key", base64_encode(parent->cli->st_cli.etcd_prefix+"/config/pools") },
+                    { "result", "LESS" },
+                    { "mod_revision", pools_mod_rev+1 },
+                }
+            } },
+            { "success", json11::Json::array {
+                json11::Json::object {
+                    { "request_put", json11::Json::object {
+                        { "key", base64_encode(parent->cli->st_cli.etcd_prefix+"/config/pools") },
+                        { "value", base64_encode(new_pools.dump()) },
+                    } },
+                },
+            } },
+        });
+        state = 2;
+resume_2:
+        if (parent->waiting > 0)
+            return;
+        if (parent->etcd_err.err)
+        {
+            result = parent->etcd_err;
+            state = 100;
+            return;
+        }
+        if (!parent->etcd_result["succeeded"].bool_value())
+        {
+            // CAS failure - retry
+            fprintf(stderr, "Warning: pool configuration was modified in the meantime by someone else\n");
+            goto resume_0;
+        }
+        // Successfully updated pool
+        result = (cli_result_t){
+            .err = 0,
+            .text = "Pool "+pool_name+" updated",
+            .data = new_pools,
+        };
+        state = 100;
+    }
+};
+
+std::function<bool(cli_result_t &)> cli_tool_t::start_pool_modify(json11::Json cfg)
+{
+    auto pool_changer = new pool_changer_t();
+    pool_changer->parent = this;
+    pool_changer->cfg = cfg.object_items();
+    pool_changer->force = cfg["force"].bool_value();
+    return [pool_changer](cli_result_t & result)
+    {
+        pool_changer->loop();
+        if (pool_changer->is_done())
+        {
+            result = pool_changer->result;
+            delete pool_changer;
+            return true;
+        }
+        return false;
+    };
+}
--- a/src/cli_pool_rm.cpp
+++ b/src/cli_pool_rm.cpp
@@ -0,0 +1,226 @@
+// Copyright (c) MIND Software LLC, 2023 (info@mindsw.io)
+// I accept Vitastor CLA: see CLA-en.md for details
+// Copyright (c) Vitaliy Filippov, 2024
+// License: VNPL-1.1 (see README.md for details)
+
+#include <ctype.h>
+#include "cli.h"
+#include "cluster_client.h"
+#include "str_util.h"
+
+struct pool_remover_t
+{
+    cli_tool_t *parent;
+
+    // Required parameters (id/name)
+
+    pool_id_t pool_id = 0;
+    std::string pool_name;
+
+    // Force removal
+    bool force;
+
+    int state = 0;
+    cli_result_t result;
+
+    // Is pool valid?
+    bool pool_valid = false;
+
+    // Updated pools
+    json11::Json new_pools;
+
+    // Expected pools mod revision
+    uint64_t pools_mod_rev;
+
+    bool is_done() { return state == 100; }
+
+    void loop()
+    {
+        if (state == 1)
+            goto resume_1;
+        else if (state == 2)
+            goto resume_2;
+        else if (state == 3)
+            goto resume_3;
+
+        // Pool name (or id) required
+        if (!pool_id && pool_name == "")
+        {
+            result = (cli_result_t){ .err = EINVAL, .text = "Pool name or id must be given" };
+            state = 100;
+            return;
+        }
+
+        // Validate pool name/id
+
+        // Get pool id by name (if name given)
+        if (pool_name != "")
+        {
+            for (auto & ic: parent->cli->st_cli.pool_config)
+            {
+                if (ic.second.name == pool_name)
+                {
+                    pool_id = ic.first;
+                    pool_valid = 1;
+                    break;
+                }
+            }
+        }
+        // Otherwise, check if given pool id is valid
+        else
+        {
+            // Set pool name from id (for easier logging)
+            pool_name = "id " + std::to_string(pool_id);
+
+            // Look-up pool id in pool_config
+            if (parent->cli->st_cli.pool_config.find(pool_id) != parent->cli->st_cli.pool_config.end())
+            {
+                pool_valid = 1;
+            }
+        }
+
+        // Need a valid pool to proceed
+        if (!pool_valid)
+        {
+            result = (cli_result_t){ .err = ENOENT, .text = "Pool "+pool_name+" does not exist" };
+            state = 100;
+            return;
+        }
+
+        // Unless forced, check if pool has associated Images/Snapshots
+        if (!force)
+        {
+            std::string images;
+
+            for (auto & ic: parent->cli->st_cli.inode_config)
+            {
+                if (pool_id && INODE_POOL(ic.second.num) != pool_id)
+                {
+                    continue;
+                }
+                images += ((images != "") ? ", " : "") + ic.second.name;
+            }
+
+            if (images != "")
+            {
+                result = (cli_result_t){
+                    .err = ENOTEMPTY,
+                    .text =
+                        "Pool "+pool_name+" cannot be removed as it still has the following "
+                        "images/snapshots associated with it: "+images
+                };
+                state = 100;
+                return;
+            }
+        }
+
+        // Proceed to deleting the pool
+        state = 1;
+        do
+        {
+resume_1:
+            // Get pools from etcd
+            parent->etcd_txn(json11::Json::object {
+                { "success", json11::Json::array {
+                    json11::Json::object {
+                        { "request_range", json11::Json::object {
+                            { "key", base64_encode(parent->cli->st_cli.etcd_prefix+"/config/pools") },
+                        } }
+                    },
+                } },
+            });
+            state = 2;
+resume_2:
+            if (parent->waiting > 0)
+                return;
+            if (parent->etcd_err.err)
+            {
+                result = parent->etcd_err;
+                state = 100;
+                return;
+            }
+            {
+                // Parse received pools from etcd
+                auto kv = parent->cli->st_cli.parse_etcd_kv(parent->etcd_result["responses"][0]["response_range"]["kvs"][0]);
+
+                // Remove pool
+                auto p = kv.value.object_items();
+                if (p.erase(std::to_string(pool_id)) != 1)
+                {
+                    result = (cli_result_t){
+                        .err = ENOENT,
+                        .text = "Failed to erase pool "+pool_name+" from: "+kv.value.string_value()
+                    };
+                    state = 100;
+                    return;
+                }
+
+                // Record updated pools
+                new_pools = p;
+
+                // Expected pools mod revision
+                pools_mod_rev = kv.mod_revision;
+            }
+            // Update pools in etcd
+            parent->etcd_txn(json11::Json::object {
+                { "compare", json11::Json::array {
+                    json11::Json::object {
+                        { "target", "MOD" },
+                        { "key", base64_encode(parent->cli->st_cli.etcd_prefix+"/config/pools") },
+                        { "result", "LESS" },
+                        { "mod_revision", pools_mod_rev+1 },
+                    }
+                } },
+                { "success", json11::Json::array {
+                    json11::Json::object {
+                        { "request_put", json11::Json::object {
+                            { "key", base64_encode(parent->cli->st_cli.etcd_prefix+"/config/pools") },
+                            { "value", base64_encode(new_pools.dump()) },
+                        } },
+                    },
+                } },
+            });
+            state = 3;
+resume_3:
+            if (parent->waiting > 0)
+                return;
+            if (parent->etcd_err.err)
+            {
+                result = parent->etcd_err;
+                state = 100;
+                return;
+            }
+        } while (!parent->etcd_result["succeeded"].bool_value());
+
+        // Successfully deleted pool
+        result = (cli_result_t){
+            .err = 0,
+            .text = "Pool "+pool_name+" deleted",
+            .data = new_pools
+        };
+        state = 100;
+    }
+};
+
+std::function<bool(cli_result_t &)> cli_tool_t::start_pool_rm(json11::Json cfg)
+{
+    auto pool_remover = new pool_remover_t();
+    pool_remover->parent = this;
+
+    pool_remover->pool_id = cfg["pool"].uint64_value();
+    pool_remover->pool_name = pool_remover->pool_id ? "" : cfg["pool"].as_string();
+
+    pool_remover->force = !cfg["force"].is_null();
+
+    return [pool_remover](cli_result_t & result)
+    {
+        pool_remover->loop();
+        if (pool_remover->is_done())
+        {
+            result = pool_remover->result;
+            delete pool_remover;
+            return true;
+        }
+        return false;
+    };
+}
--- a/src/cli_rm.cpp
+++ b/src/cli_rm.cpp
@@ -247,6 +247,7 @@ resume_8:
        }
        state = 100;
        result = (cli_result_t){
+            .err = 0,
            .text = "",
            .data = my_result(result.data),
        };
--- a/src/cluster_client.cpp
+++ b/src/cluster_client.cpp
@@ -6,7 +6,7 @@
 #include "cluster_client_impl.h"
 #include "http_client.h" // json_is_true

-cluster_client_t::cluster_client_t(ring_loop_t *ringloop, timerfd_manager_t *tfd, json11::Json & config)
+cluster_client_t::cluster_client_t(ring_loop_t *ringloop, timerfd_manager_t *tfd, json11::Json config)
 {
    wb = new writeback_cache_t();

@@ -238,7 +238,8 @@ void cluster_client_t::erase_op(cluster_op_t *op)
        // which may continue following SYNCs, but these SYNCs
        // should know about the changed buffer state
        // This is ugly but this is the way we do it
-        std::function<void(cluster_op_t*)>(op->callback)(op);
+        auto cb = std::move(op->callback);
+        cb(op);
    }
    if (!(flags & OP_IMMEDIATE_COMMIT) || enable_writeback)
    {
@@ -248,7 +249,8 @@ void cluster_client_t::erase_op(cluster_op_t *op)
    {
        // Call callback at the end to avoid inconsistencies in prev_wait
        // if the callback adds more operations itself
-        std::function<void(cluster_op_t*)>(op->callback)(op);
+        auto cb = std::move(op->callback);
+        cb(op);
    }
    if (flags & OP_FLUSH_BUFFER)
    {
@@ -548,7 +550,8 @@ void cluster_client_t::execute(cluster_op_t *op)
        op->opcode != OSD_OP_READ_BITMAP && op->opcode != OSD_OP_READ_CHAIN_BITMAP && op->opcode != OSD_OP_WRITE)
    {
        op->retval = -EINVAL;
-        std::function<void(cluster_op_t*)>(op->callback)(op);
+        auto cb = std::move(op->callback);
+        cb(op);
        return;
    }
    if (!pgs_loaded)
@@ -570,7 +573,7 @@ void cluster_client_t::execute_internal(cluster_op_t *op)
        return;
    }
    if (op->opcode == OSD_OP_WRITE && enable_writeback && !(op->flags & OP_FLUSH_BUFFER) &&
-        !op->version /* FIXME no CAS writeback */)
+        !op->version /* no CAS writeback */)
    {
        if (wb->writebacks_active >= client_max_writeback_iodepth)
        {
@@ -586,12 +589,13 @@ void cluster_client_t::execute_internal(cluster_op_t *op)
            wb->start_writebacks(this, 1);
        }
        op->retval = op->len;
-        std::function<void(cluster_op_t*)>(op->callback)(op);
+        auto cb = std::move(op->callback);
+        cb(op);
        return;
    }
    if (op->opcode == OSD_OP_WRITE && !(op->flags & OP_IMMEDIATE_COMMIT))
    {
-        if (!(op->flags & OP_FLUSH_BUFFER))
+        if (!(op->flags & OP_FLUSH_BUFFER) && !op->version /* no CAS write-repeat */)
        {
            wb->copy_write(op, CACHE_WRITTEN);
        }
@@ -655,7 +659,8 @@ bool cluster_client_t::check_rw(cluster_op_t *op)
    if (!pool_id)
    {
        op->retval = -EINVAL;
-        std::function<void(cluster_op_t*)>(op->callback)(op);
+        auto cb = std::move(op->callback);
+        cb(op);
        return false;
    }
    auto pool_it = st_cli.pool_config.find(pool_id);
@@ -663,15 +668,17 @@ bool cluster_client_t::check_rw(cluster_op_t *op)
    {
        // Pools are loaded, but this one is unknown
        op->retval = -EINVAL;
-        std::function<void(cluster_op_t*)>(op->callback)(op);
+        auto cb = std::move(op->callback);
+        cb(op);
        return false;
    }
    // Check alignment
-    if (!op->len && (op->opcode == OSD_OP_READ || op->opcode == OSD_OP_READ_BITMAP || op->opcode == OSD_OP_READ_CHAIN_BITMAP || op->opcode == OSD_OP_WRITE) ||
+    if (!op->len && (op->opcode == OSD_OP_READ_BITMAP || op->opcode == OSD_OP_READ_CHAIN_BITMAP || op->opcode == OSD_OP_WRITE) ||
        op->offset % pool_it->second.bitmap_granularity || op->len % pool_it->second.bitmap_granularity)
    {
        op->retval = -EINVAL;
-        std::function<void(cluster_op_t*)>(op->callback)(op);
+        auto cb = std::move(op->callback);
+        cb(op);
        return false;
    }
    if (pool_it->second.immediate_commit == IMMEDIATE_ALL)
@@ -684,7 +691,8 @@ bool cluster_client_t::check_rw(cluster_op_t *op)
        if (ino_it != st_cli.inode_config.end() && ino_it->second.readonly)
        {
            op->retval = -EROFS;
-            std::function<void(cluster_op_t*)>(op->callback)(op);
+            auto cb = std::move(op->callback);
+            cb(op);
            return false;
        }
    }
@@ -1166,7 +1174,6 @@ static inline void mem_or(void *res, const void *r2, unsigned int len)
 void cluster_client_t::handle_op_part(cluster_op_part_t *part)
 {
    cluster_op_t *op = part->parent;
-    op->inflight_count--;
    int expected = part->op.req.hdr.opcode == OSD_OP_SYNC ? 0 : part->op.req.rw.len;
    if (part->op.reply.hdr.retval != expected)
    {
@@ -1189,7 +1196,7 @@ void cluster_client_t::handle_op_part(cluster_op_part_t *part)
                );
            }
        }
-        else
+        else if (log_level > 0)
        {
            fprintf(
                stderr, "%s operation failed on OSD %ju: retval=%jd (expected %d)\n",
@@ -1205,6 +1212,11 @@ void cluster_client_t::handle_op_part(cluster_op_part_t *part)
            op->retry_after = op->retval == -EIO ? client_eio_retry_interval : client_retry_interval;
        }
        reset_retry_timer(op->retry_after);
+        if (stop_fd >= 0)
+        {
+            msgr.stop_client(stop_fd);
+        }
+        op->inflight_count--;
        if (op->inflight_count == 0)
        {
            if (op->opcode == OSD_OP_SYNC)
@@ -1212,14 +1224,11 @@ void cluster_client_t::handle_op_part(cluster_op_part_t *part)
            else
                continue_rw(op);
        }
-        if (stop_fd >= 0)
-        {
-            msgr.stop_client(stop_fd);
-        }
    }
    else
    {
        // OK
+        op->inflight_count--;
        if ((op->opcode == OSD_OP_WRITE || op->opcode == OSD_OP_DELETE) && !(op->flags & OP_IMMEDIATE_COMMIT))
            dirty_osds.insert(part->osd_num);
        part->flags |= PART_DONE;
--- a/src/cluster_client.h
+++ b/src/cluster_client.h
@@ -123,7 +123,7 @@ public:
    json11::Json::object cli_config, file_config, etcd_global_config;
    json11::Json::object config;

-    cluster_client_t(ring_loop_t *ringloop, timerfd_manager_t *tfd, json11::Json & config);
+    cluster_client_t(ring_loop_t *ringloop, timerfd_manager_t *tfd, json11::Json config);
    ~cluster_client_t();
    void execute(cluster_op_t *op);
    void execute_raw(osd_num_t osd_num, osd_op_t *op);
--- a/src/disk_simple_offsets.cpp
+++ b/src/disk_simple_offsets.cpp
@@ -16,11 +16,6 @@
 void disk_tool_simple_offsets(json11::Json cfg, bool json_output)
 {
    std::string device = cfg["device"].string_value();
-    if (device == "")
-    {
-        fprintf(stderr, "Device path is missing\n");
-        exit(1);
-    }
    uint64_t data_block_size = parse_size(cfg["object_size"].string_value());
    uint64_t bitmap_granularity = parse_size(cfg["bitmap_granularity"].string_value());
    uint64_t journal_size = parse_size(cfg["journal_size"].string_value());
@@ -57,6 +52,11 @@ void disk_tool_simple_offsets(json11::Json cfg, bool json_output)
    uint64_t orig_device_size = device_size;
    if (!device_size)
    {
+        if (device == "")
+        {
+            fprintf(stderr, "Device path is missing\n");
+            exit(1);
+        }
        struct stat st;
        if (stat(device.c_str(), &st) < 0)
        {
--- a/src/disk_tool.h
+++ b/src/disk_tool.h
@@ -132,9 +132,6 @@ void disk_tool_simple_offsets(json11::Json cfg, bool json_output);

 uint64_t sscanf_json(const char *fmt, const json11::Json & str);
 void fromhexstr(const std::string & from, int bytes, uint8_t *to);
-std::string realpath_str(std::string path, bool nofail = true);
-std::string read_all_fd(int fd);
-std::string read_file(std::string file, bool allow_enoent = false);
 int disable_cache(std::string dev);
 std::string get_parent_device(std::string dev);
 bool json_is_true(const json11::Json & val);
--- a/src/disk_tool_utils.cpp
+++ b/src/disk_tool_utils.cpp
@@ -42,36 +42,6 @@ void fromhexstr(const std::string & from, int bytes, uint8_t *to)
    }
 }

-std::string realpath_str(std::string path, bool nofail)
-{
-    char *p = realpath((char*)path.c_str(), NULL);
-    if (!p)
-    {
-        fprintf(stderr, "Failed to resolve %s: %s\n", path.c_str(), strerror(errno));
-        return nofail ? path : "";
-    }
-    std::string rp(p);
-    free(p);
-    return rp;
-}
-
-std::string read_file(std::string file, bool allow_enoent)
-{
-    std::string res;
-    int fd = open(file.c_str(), O_RDONLY);
-    if (fd < 0 || (res = read_all_fd(fd)) == "")
-    {
-        int err = errno;
-        if (fd >= 0)
-            close(fd);
-        if (!allow_enoent || err != ENOENT)
-            fprintf(stderr, "Can't read %s: %s\n", file.c_str(), strerror(err));
-        return "";
-    }
-    close(fd);
-    return res;
-}
-
 // returns 1 = check error, 0 = write through, -1 = write back
 // (similar to 1 = warning, -1 = error, 0 = success in disable_cache)
 static int check_queue_cache(std::string dev, std::string parent_dev)
--- a/src/epoll_manager.cpp
+++ b/src/epoll_manager.cpp
@@ -101,7 +101,7 @@ void epoll_manager_t::handle_uring_event()
    my_uring_prep_poll_add(sqe, epoll_fd, POLLIN);
    data->callback = [this](ring_data_t *data)
    {
-        if (data->res < 0)
+        if (data->res < 0 && data->res != -ECANCELED)
        {
            throw std::runtime_error(std::string("epoll failed: ") + strerror(-data->res));
        }
--- a/src/etcd_state_client.cpp
+++ b/src/etcd_state_client.cpp
@@ -573,8 +573,7 @@ void etcd_state_client_t::load_global_config()
        {
            global_bitmap_granularity = DEFAULT_BITMAP_GRANULARITY;
        }
-        global_immediate_commit = global_config["immediate_commit"].string_value() == "all"
-            ? IMMEDIATE_ALL : (global_config["immediate_commit"].string_value() == "small" ? IMMEDIATE_SMALL : IMMEDIATE_NONE);
+        global_immediate_commit = parse_immediate_commit(global_config["immediate_commit"].string_value());
        on_load_config_hook(global_config);
    });
 }
@@ -782,13 +781,8 @@ void etcd_state_client_t::parse_state(const etcd_kv_t & kv)
            // Failure Domain
            pc.failure_domain = pool_item.second["failure_domain"].string_value();
            // Coding Scheme
-            if (pool_item.second["scheme"] == "replicated")
-                pc.scheme = POOL_SCHEME_REPLICATED;
-            else if (pool_item.second["scheme"] == "xor")
-                pc.scheme = POOL_SCHEME_XOR;
-            else if (pool_item.second["scheme"] == "ec" || pool_item.second["scheme"] == "jerasure")
-                pc.scheme = POOL_SCHEME_EC;
-            else
+            pc.scheme = parse_scheme(pool_item.second["scheme"].string_value());
+            if (!pc.scheme)
            {
                fprintf(stderr, "Pool %u has invalid coding scheme (one of \"xor\", \"replicated\", \"ec\" or \"jerasure\" required), skipping pool\n", pool_id);
                continue;
@@ -869,11 +863,11 @@ void etcd_state_client_t::parse_state(const etcd_kv_t & kv)
            pc.scrub_interval = parse_time(pool_item.second["scrub_interval"].string_value());
            if (!pc.scrub_interval)
                pc.scrub_interval = 0;
+            // Mark pool as VitastorFS pool (disable per-inode stats and block volume creation)
+            pc.used_for_fs = pool_item.second["used_for_fs"].as_string();
            // Immediate Commit Mode
            pc.immediate_commit = pool_item.second["immediate_commit"].is_string()
-                ? (pool_item.second["immediate_commit"].string_value() == "all"
-                    ? IMMEDIATE_ALL : (pool_item.second["immediate_commit"].string_value() == "small"
-                        ? IMMEDIATE_SMALL : IMMEDIATE_NONE))
+                ? parse_immediate_commit(pool_item.second["immediate_commit"].string_value())
                : global_immediate_commit;
            // PG Stripe Size
            pc.pg_stripe_size = pool_item.second["pg_stripe_size"].uint64_value();
@@ -1167,6 +1161,23 @@ void etcd_state_client_t::parse_state(const etcd_kv_t & kv)
    }
 }

+uint32_t etcd_state_client_t::parse_immediate_commit(const std::string & immediate_commit_str)
+{
+    return immediate_commit_str == "all" ? IMMEDIATE_ALL :
+        (immediate_commit_str == "small" ? IMMEDIATE_SMALL : IMMEDIATE_NONE);
+}
+
+uint32_t etcd_state_client_t::parse_scheme(const std::string & scheme)
+{
+    if (scheme == "replicated")
+        return POOL_SCHEME_REPLICATED;
+    else if (scheme == "xor")
+        return POOL_SCHEME_XOR;
+    else if (scheme == "ec" || scheme == "jerasure")
+        return POOL_SCHEME_EC;
+    return 0;
+}
+
 void etcd_state_client_t::insert_inode_config(const inode_config_t & cfg)
 {
    this->inode_config[cfg.num] = cfg;
--- a/src/etcd_state_client.h
+++ b/src/etcd_state_client.h
@@ -60,6 +60,7 @@ struct pool_config_t
    uint64_t pg_stripe_size;
    std::map<pg_num_t, pg_config_t> pg_config;
    uint64_t scrub_interval;
+    std::string used_for_fs;
 };

 struct inode_config_t
@@ -151,4 +152,7 @@ public:
    void close_watch(inode_watch_t* watch);
    int address_count();
    ~etcd_state_client_t();
+
+    static uint32_t parse_immediate_commit(const std::string & immediate_commit_str);
+    static uint32_t parse_scheme(const std::string & scheme_str);
 };
--- a/src/kv_cli.cpp
+++ b/src/kv_cli.cpp
@@ -0,0 +1,673 @@
+// Copyright (c) Vitaliy Filippov, 2019+
+// License: VNPL-1.1 (see README.md for details)
+//
+// Vitastor shared key/value database test CLI
+
+#define _XOPEN_SOURCE
+#include <limits.h>
+
+#include <netinet/tcp.h>
+#include <sys/epoll.h>
+#include <unistd.h>
+#include <fcntl.h>
+//#include <signal.h>
+
+#include "epoll_manager.h"
+#include "str_util.h"
+#include "kv_db.h"
+
+const char *exe_name = NULL;
+
+class kv_cli_t
+{
+public:
+    json11::Json::object cfg;
+    std::vector<std::string> cli_cmd;
+
+    kv_dbw_t *db = NULL;
+    ring_loop_t *ringloop = NULL;
+    epoll_manager_t *epmgr = NULL;
+    cluster_client_t *cli = NULL;
+    int load_parallelism = 16;
+    bool opened = false;
+    bool interactive = false, is_file = false;
+    int in_progress = 0;
+    char *cur_cmd = NULL;
+    int cur_cmd_size = 0, cur_cmd_alloc = 0;
+    bool finished = false, eof = false;
+
+    std::function<void(int)> load_cb;
+    bool loading_json = false, in_loadjson = false;
+    int load_state = 0;
+    std::string load_key;
+
+    ~kv_cli_t();
+
+    void parse_args(int narg, const char *args[]);
+    void run();
+    void read_cmd();
+    void next_cmd();
+    std::vector<std::string> parse_cmd(const std::string & cmdstr);
+    void handle_cmd(const std::vector<std::string> & cmd, std::function<void(int)> cb);
+    void loadjson();
+};
+
+kv_cli_t::~kv_cli_t()
+{
+    if (cur_cmd)
+    {
+        free(cur_cmd);
+        cur_cmd = NULL;
+    }
+    cur_cmd_alloc = 0;
+    if (db)
+        delete db;
+    if (cli)
+    {
+        cli->flush();
+        delete cli;
+    }
+    if (epmgr)
+        delete epmgr;
+    if (ringloop)
+        delete ringloop;
+}
+
+void kv_cli_t::parse_args(int narg, const char *args[])
+{
+    bool db = false;
+    for (int i = 1; i < narg; i++)
+    {
+        if (!strcmp(args[i], "-h") || !strcmp(args[i], "--help"))
+        {
+            printf(
+                "Vitastor Key/Value CLI\n"
+                "(c) Vitaliy Filippov, 2023+ (VNPL-1.1)\n"
+                "\n"
+                "USAGE: %s [OPTIONS] [<IMAGE> [<COMMAND>]]\n"
+                "\n"
+                "COMMANDS:\n"
+                "  get <key>\n"
+                "  set <key> <value>\n"
+                "  del <key>\n"
+                "  list [<start> [end]]\n"
+                "  dump [<start> [end]]\n"
+                "  dumpjson [<start> [end]]\n"
+                "  loadjson\n"
+                "\n"
+                "<IMAGE> should be the name of Vitastor image with the DB.\n"
+                "Without <COMMAND>, you get an interactive DB shell.\n"
+                "\n"
+                "OPTIONS:\n"
+                "  --kv_block_size 4k\n"
+                "    Key-value B-Tree block size\n"
+                "  --kv_memory_limit 128M\n"
+                "    Maximum memory to use for vitastor-kv index cache\n"
+                "  --kv_allocate_blocks 4\n"
+                "    Number of PG blocks used for new tree block allocation in parallel\n"
+                "  --kv_evict_max_misses 10\n"
+                "    Eviction algorithm parameter: retry eviction from another random spot\n"
+                "    if this number of keys is used currently or was used recently\n"
+                "  --kv_evict_attempts_per_level 3\n"
+                "    Retry eviction at most this number of times per tree level, starting\n"
+                "    with bottom-most levels\n"
+                "  --kv_evict_unused_age 1000\n"
+                "    Evict only keys unused during this number of last operations\n"
+                "  --kv_log_level 1\n"
+                "    Log level. 0 = errors, 1 = warnings, 10 = trace operations\n"
+                ,
+                exe_name
+            );
+            exit(0);
+        }
+        else if (args[i][0] == '-' && args[i][1] == '-')
+        {
+            const char *opt = args[i]+2;
+            cfg[opt] = !strcmp(opt, "json") || i == narg-1 ? "1" : args[++i];
+        }
+        else if (!db)
+        {
+            cfg["db"] = args[i];
+            db = true;
+        }
+        else
+        {
+            cli_cmd.push_back(args[i]);
+        }
+    }
+}
+
+void kv_cli_t::run()
+{
+    // Create client
+    ringloop = new ring_loop_t(512);
+    epmgr = new epoll_manager_t(ringloop);
+    cli = new cluster_client_t(ringloop, epmgr->tfd, cfg);
+    db = new kv_dbw_t(cli);
+    // Load image metadata
+    while (!cli->is_ready())
+    {
+        ringloop->loop();
+        if (cli->is_ready())
+            break;
+        ringloop->wait();
+    }
+    // Open if DB is set in options
+    if (cfg.find("db") != cfg.end())
+    {
+        bool done = false;
+        handle_cmd({ "open", cfg.at("db").string_value() }, [&done](int res) { if (res != 0) exit(1); done = true; });
+        while (!done)
+        {
+            ringloop->loop();
+            if (done)
+                break;
+            ringloop->wait();
+        }
+    }
+    // Run single command from CLI
+    if (cli_cmd.size())
+    {
+        bool done = false;
+        handle_cmd(cli_cmd, [&done](int res) { if (res != 0) exit(1); done = true; });
+        while (!done)
+        {
+            ringloop->loop();
+            if (done)
+                break;
+            ringloop->wait();
+        }
+    }
+    else
+    {
+        // Run interactive shell
+        fcntl(0, F_SETFL, fcntl(0, F_GETFL, 0) | O_NONBLOCK);
+        try
+        {
+            epmgr->tfd->set_fd_handler(0, false, [this](int fd, int events)
+            {
+                if (events & EPOLLIN)
+                {
+                    read_cmd();
+                }
+                if (events & EPOLLRDHUP)
+                {
+                    epmgr->tfd->set_fd_handler(0, false, NULL);
+                    finished = true;
+                }
+            });
+            interactive = isatty(0);
+            if (interactive)
+                printf("> ");
+        }
+        catch (std::exception & e)
+        {
+            // Can't add to epoll, STDIN is probably a file
+            is_file = true;
+            read_cmd();
+        }
+        while (!finished)
+        {
+            ringloop->loop();
+            if (!finished)
+                ringloop->wait();
+        }
+    }
+    // Destroy the client
+    delete db;
+    db = NULL;
+    cli->flush();
+    delete cli;
+    delete epmgr;
+    delete ringloop;
+    cli = NULL;
+    epmgr = NULL;
+    ringloop = NULL;
+}
+
+void kv_cli_t::read_cmd()
+{
+    if (!cur_cmd_alloc)
+    {
+        cur_cmd_alloc = 65536;
+        cur_cmd = (char*)malloc_or_die(cur_cmd_alloc);
+    }
+    while (cur_cmd_size < cur_cmd_alloc)
+    {
+        int r = read(0, cur_cmd+cur_cmd_size, cur_cmd_alloc-cur_cmd_size);
+        if (r < 0 && errno != EAGAIN)
+            fprintf(stderr, "Error reading from stdin: %s\n", strerror(errno));
+        if (r > 0)
+            cur_cmd_size += r;
+        if (r == 0)
+            eof = true;
+        if (r <= 0)
+            break;
+    }
+    next_cmd();
+}
+
+void kv_cli_t::next_cmd()
+{
+    if (loading_json)
+    {
+        loadjson();
+        return;
+    }
+    if (in_progress > 0)
+    {
+        return;
+    }
+    int pos = 0;
+    for (; pos < cur_cmd_size; pos++)
+    {
+        if (cur_cmd[pos] == '\n' || cur_cmd[pos] == '\r')
+        {
+            auto cmd = trim(std::string(cur_cmd, pos));
+            pos++;
+            memmove(cur_cmd, cur_cmd+pos, cur_cmd_size-pos);
+            cur_cmd_size -= pos;
+            in_progress++;
+            handle_cmd(parse_cmd(cmd), [this](int res)
+            {
+                in_progress--;
+                if (interactive)
+                    printf("> ");
+                next_cmd();
+                if (!in_progress)
+                    read_cmd();
+            });
+            break;
+        }
+    }
+    if (eof && !in_progress)
+    {
+        finished = true;
+    }
+}
+
+struct kv_cli_list_t
+{
+    kv_dbw_t *db = NULL;
+    void *handle = NULL;
+    int format = 0;
+    int n = 0;
+    std::function<void(int)> cb;
+};
+
+std::vector<std::string> kv_cli_t::parse_cmd(const std::string & str)
+{
+    std::vector<std::string> res;
+    size_t pos = 0;
+    auto cmd = scan_escaped(str, pos);
+    if (cmd.empty())
+        return res;
+    res.push_back(cmd);
+    int max_args = (cmd == "set" || cmd == "config" ||
+        cmd == "list" || cmd == "dump" || cmd == "dumpjson" ? 3 :
+        (cmd == "open" || cmd == "get" || cmd == "del" ? 2 : 1));
+    while (pos < str.size() && res.size() < max_args)
+    {
+        if (res.size() == max_args-1)
+        {
+            // Allow unquoted last argument
+            pos = str.find_first_not_of(" \t\r\n", pos);
+            if (pos == std::string::npos)
+                break;
+            if (str[pos] != '"' && str[pos] != '\'')
+            {
+                res.push_back(trim(str.substr(pos)));
+                break;
+            }
+        }
+        auto arg = scan_escaped(str, pos);
+        if (arg.size())
+            res.push_back(arg);
+    }
+    return res;
+}
+
+void kv_cli_t::loadjson()
+{
+    // simple streaming json parser
+    if (in_progress >= load_parallelism || in_loadjson)
+    {
+        return;
+    }
+    in_loadjson = true;
+    if (load_state == 5)
+    {
+st_5:
+        if (!in_progress)
+        {
+            loading_json = false;
+            auto cb = std::move(load_cb);
+            cb(0);
+        }
+        in_loadjson = false;
+        return;
+    }
+    do
+    {
+        read_cmd();
+        size_t pos = 0;
+        while (true)
+        {
+            while (pos < cur_cmd_size && is_white(cur_cmd[pos]))
+            {
+                pos++;
+            }
+            if (pos >= cur_cmd_size)
+            {
+                break;
+            }
+            if (load_state == 0 || load_state == 2)
+            {
+                char expected = "{ :"[load_state];
+                if (cur_cmd[pos] != expected)
+                {
+                    fprintf(stderr, "Unexpected %c, expected %c\n", cur_cmd[pos], expected);
+                    exit(1);
+                }
+                pos++;
+                load_state++;
+            }
+            else if (load_state == 1 || load_state == 3)
+            {
+                if (cur_cmd[pos] != '"')
+                {
+                    fprintf(stderr, "Unexpected %c, expected \"\n", cur_cmd[pos]);
+                    exit(1);
+                }
+                size_t prev = pos;
+                auto str = scan_escaped(cur_cmd, cur_cmd_size, pos, false);
+                if (pos == prev)
+                {
+                    break;
+                }
+                load_state++;
+                if (load_state == 2)
+                {
+                    load_key = str;
+                }
+                else
+                {
+                    in_progress++;
+                    handle_cmd({ "set", load_key, str }, [this](int res)
+                    {
+                        in_progress--;
+                        next_cmd();
+                    });
+                    if (in_progress >= load_parallelism)
+                    {
+                        break;
+                    }
+                }
+            }
+            else if (load_state == 4)
+            {
+                if (cur_cmd[pos] == ',')
+                {
+                    pos++;
+                    load_state = 1;
+                }
+                else if (cur_cmd[pos] == '}')
+                {
+                    pos++;
+                    load_state = 5;
+                    goto st_5;
+                }
+                else
+                {
+                    fprintf(stderr, "Unexpected %c, expected , or }\n", cur_cmd[pos]);
+                    exit(1);
+                }
+            }
+        }
+        if (pos < cur_cmd_size)
+        {
+            memmove(cur_cmd, cur_cmd+pos, cur_cmd_size-pos);
+        }
+        cur_cmd_size -= pos;
+    } while (loading_json && is_file);
+    in_loadjson = false;
+}
+
+void kv_cli_t::handle_cmd(const std::vector<std::string> & cmd, std::function<void(int)> cb)
+{
+    if (!cmd.size())
+    {
+        cb(-EINVAL);
+        return;
+    }
+    auto & opname = cmd[0];
+    if (!opened && opname != "open" && opname != "config" && opname != "quit" && opname != "q")
+    {
+        fprintf(stderr, "Error: database not opened\n");
+        cb(-EINVAL);
+        return;
+    }
+    if (opname == "open")
+    {
+        auto name = cmd.size() > 1 ? cmd[1] : "";
+        uint64_t pool_id = 0;
+        inode_t inode_id = 0;
+        int scanned = sscanf(name.c_str(), "%lu %lu", &pool_id, &inode_id);
+        if (scanned < 2 || !pool_id || !inode_id)
+        {
+            inode_id = 0;
+            name = trim(name);
+            for (auto & ic: cli->st_cli.inode_config)
+            {
+                if (ic.second.name == name)
+                {
+                    inode_id = ic.first;
+                    break;
+                }
+            }
+            if (!inode_id)
+            {
+                fprintf(stderr, "Usage: open <image> OR open <pool_id> <inode_id>\n");
+                cb(-EINVAL);
+                return;
+            }
+        }
+        else
+            inode_id = INODE_WITH_POOL(pool_id, inode_id);
+        db->open(inode_id, cfg, [=](int res)
+        {
+            if (res < 0)
+            {
+                fprintf(stderr, "Error opening index: %s (code %d)\n", strerror(-res), res);
+            }
+            else
+            {
+                opened = true;
+                fprintf(interactive ? stdout : stderr, "Index opened. Current size: %lu bytes\n", db->get_size());
+            }
+            cb(res);
+        });
+    }
+    else if (opname == "config")
+    {
+        if (cmd.size() < 3)
+        {
+            fprintf(stderr, "Usage: config <property> <value>\n");
+            cb(-EINVAL);
+            return;
+        }
+        auto & key = cmd[1];
+        auto & value = cmd[2];
+        if (key != "kv_memory_limit" &&
+            key != "kv_allocate_blocks" &&
+            key != "kv_evict_max_misses" &&
+            key != "kv_evict_attempts_per_level" &&
+            key != "kv_evict_unused_age" &&
+            key != "kv_log_level" &&
+            key != "kv_block_size")
+        {
+            fprintf(
+                stderr, "Allowed properties: kv_block_size, kv_memory_limit, kv_allocate_blocks,"
+                " kv_evict_max_misses, kv_evict_attempts_per_level, kv_evict_unused_age, kv_log_level\n"
+            );
+            cb(-EINVAL);
+        }
+        else if (key == "kv_block_size")
+        {
+            if (opened)
+            {
+                fprintf(stderr, "kv_block_size can't be set after opening DB\n");
+                cb(-EINVAL);
+            }
+            else
+            {
+                cfg[key] = value;
+                cb(0);
+            }
+        }
+        else
+        {
+            cfg[key] = value;
+            db->set_config(cfg);
+            cb(0);
+        }
+    }
+    else if (opname == "get" || opname == "set" || opname == "del")
+    {
+        if (opname == "get" || opname == "del")
+        {
+            if (cmd.size() < 2)
+            {
+                fprintf(stderr, "Usage: %s <key>\n", opname.c_str());
+                cb(-EINVAL);
+                return;
+            }
+            auto & key = cmd[1];
+            if (opname == "get")
+            {
+                db->get(key, [this, cb](int res, const std::string & value)
+                {
+                    if (res < 0)
+                        fprintf(stderr, "Error: %s (code %d)\n", strerror(-res), res);
+                    else
+                    {
+                        if (write(1, value.c_str(), value.size()) < 0 || write(1, "\n", 1) < 0)
+                            exit(1);
+                    }
+                    cb(res);
+                });
+            }
+            else
+            {
+                db->del(key, [this, cb](int res)
+                {
+                    if (res < 0)
+                        fprintf(stderr, "Error: %s (code %d)\n", strerror(-res), res);
+                    else
+                        fprintf(interactive ? stdout : stderr, "OK\n");
+                    cb(res);
+                });
+            }
+        }
+        else
+        {
+            if (cmd.size() < 3)
+            {
+                fprintf(stderr, "Usage: set <key> <value>\n");
+                cb(-EINVAL);
+                return;
+            }
+            auto & key = cmd[1];
+            auto & value = cmd[2];
+            db->set(key, value, [this, cb, l = loading_json](int res)
+            {
+                if (res < 0)
+                    fprintf(stderr, "Error: %s (code %d)\n", strerror(-res), res);
+                else if (!l)
+                    fprintf(interactive ? stdout : stderr, "OK\n");
+                cb(res);
+            });
+        }
+    }
+    else if (opname == "list" || opname == "dump" || opname == "dumpjson")
+    {
+        kv_cli_list_t *lst = new kv_cli_list_t;
+        std::string start = cmd.size() >= 2 ? cmd[1] : "";
+        std::string end = cmd.size() >= 3 ? cmd[2] : "";
+        lst->handle = db->list_start(start);
+        lst->db = db;
+        lst->format = opname == "dump" ? 1 : (opname == "dumpjson" ? 2 : 0);
+        lst->cb = std::move(cb);
+        db->list_next(lst->handle, [lst](int res, const std::string & key, const std::string & value)
+        {
+            if (res < 0)
+            {
+                if (res != -ENOENT)
+                {
+                    fprintf(stderr, "Error: %s (code %d)\n", strerror(-res), res);
+                }
+                if (lst->format == 2)
+                    printf("\n}\n");
+                lst->db->list_close(lst->handle);
+                lst->cb(res == -ENOENT ? 0 : res);
+                delete lst;
+            }
+            else
+            {
+                if (lst->format == 2)
+                    printf(lst->n ? ",\n  %s: %s" : "{\n  %s: %s", addslashes(key).c_str(), addslashes(value).c_str());
+                else if (lst->format == 1)
+                    printf("set %s %s\n", auto_addslashes(key).c_str(), value.c_str());
+                else
+                    printf("%s = %s\n", key.c_str(), value.c_str());
+                lst->n++;
+                lst->db->list_next(lst->handle, NULL);
+            }
+        });
+    }
+    else if (opname == "loadjson")
+    {
+        loading_json = true;
+        load_state = 0;
+        load_cb = cb;
+        loadjson();
+    }
+    else if (opname == "close")
+    {
+        db->close([=]()
+        {
+            fprintf(interactive ? stdout : stderr, "Index closed\n");
+            opened = false;
+            cb(0);
+        });
+    }
+    else if (opname == "quit" || opname == "q")
+    {
+        ::close(0);
+        finished = true;
+    }
+    else
+    {
+        fprintf(
+            stderr, "Unknown operation: %s. Supported operations:\n"
+            "open <image>\nopen <pool_id> <inode_id>\n"
+            "config <property> <value>\n"
+            "get <key>\nset <key> <value>\ndel <key>\n"
+            "list [<start> [end]]\ndump [<start> [end]]\ndumpjson [<start> [end]]\nloadjson\n"
+            "close\nquit\n", opname.c_str()
+        );
+        cb(-EINVAL);
+    }
+}
+
+int main(int narg, const char *args[])
+{
+    setvbuf(stdout, NULL, _IONBF, 0);
+    setvbuf(stderr, NULL, _IONBF, 0);
+    exe_name = args[0];
+    kv_cli_t *p = new kv_cli_t();
+    p->parse_args(narg, args);
+    p->run();
+    delete p;
+    return 0;
+}
--- a/src/kv_db.cpp
+++ b/src/kv_db.cpp
--- a/src/kv_db.h
+++ b/src/kv_db.h
@@ -0,0 +1,36 @@
+// Copyright (c) Vitaliy Filippov, 2019+
+// License: VNPL-1.1 (see README.md for details)
+//
+// Vitastor shared key/value database
+// Parallel optimistic B-Tree O:-)
+
+#pragma once
+
+#include "cluster_client.h"
+
+struct kv_db_t;
+
+struct kv_dbw_t
+{
+    kv_dbw_t(cluster_client_t *cli);
+    ~kv_dbw_t();
+
+    void open(inode_t inode_id, json11::Json cfg, std::function<void(int)> cb);
+    void set_config(json11::Json cfg);
+    void close(std::function<void()> cb);
+
+    uint64_t get_size();
+
+    void get(const std::string & key, std::function<void(int res, const std::string & value)> cb,
+        bool allow_old_cached = false);
+    void set(const std::string & key, const std::string & value, std::function<void(int res)> cb,
+        std::function<bool(int res, const std::string & value)> cas_compare = NULL);
+    void del(const std::string & key, std::function<void(int res)> cb,
+        std::function<bool(int res, const std::string & value)> cas_compare = NULL);
+
+    void* list_start(const std::string & start);
+    void list_next(void *handle, std::function<void(int res, const std::string & key, const std::string & value)> cb);
+    void list_close(void *handle);
+
+    kv_db_t *db;
+};
--- a/src/kv_stress.cpp
+++ b/src/kv_stress.cpp
@@ -0,0 +1,701 @@
+// Copyright (c) Vitaliy Filippov, 2019+
+// License: VNPL-1.1 (see README.md for details)
+//
+// Vitastor shared key/value database stress tester / benchmark
+
+#define _XOPEN_SOURCE
+#include <limits.h>
+
+#include <netinet/tcp.h>
+#include <sys/epoll.h>
+#include <unistd.h>
+#include <fcntl.h>
+//#include <signal.h>
+
+#include "epoll_manager.h"
+#include "str_util.h"
+#include "kv_db.h"
+
+const char *exe_name = NULL;
+
+struct kv_test_listing_t
+{
+    uint64_t count = 0, done = 0;
+    void *handle = NULL;
+    std::string next_after;
+    std::set<std::string> inflights;
+    timespec tv_begin;
+    bool error = false;
+};
+
+struct kv_test_lat_t
+{
+    const char *name = NULL;
+    uint64_t usec = 0, count = 0;
+};
+
+struct kv_test_stat_t
+{
+    kv_test_lat_t get, add, update, del, list;
+    uint64_t list_keys = 0;
+};
+
+class kv_test_t
+{
+public:
+    // Config
+    json11::Json::object kv_cfg;
+    std::string key_prefix, key_suffix;
+    uint64_t inode_id = 0;
+    uint64_t op_count = 1000000;
+    uint64_t runtime_sec = 0;
+    uint64_t parallelism = 4;
+    uint64_t reopen_prob = 1;
+    uint64_t get_prob = 30000;
+    uint64_t add_prob = 20000;
+    uint64_t update_prob = 20000;
+    uint64_t del_prob = 5000;
+    uint64_t list_prob = 300;
+    uint64_t min_key_len = 10;
+    uint64_t max_key_len = 70;
+    uint64_t min_value_len = 50;
+    uint64_t max_value_len = 300;
+    uint64_t min_list_count = 10;
+    uint64_t max_list_count = 1000;
+    uint64_t print_stats_interval = 1;
+    bool json_output = false;
+    uint64_t log_level = 1;
+    bool trace = false;
+    bool stop_on_error = false;
+    // FIXME: Multiple clients
+    kv_test_stat_t stat, prev_stat;
+    timespec prev_stat_time, start_stat_time;
+
+    // State
+    kv_dbw_t *db = NULL;
+    ring_loop_t *ringloop = NULL;
+    epoll_manager_t *epmgr = NULL;
+    cluster_client_t *cli = NULL;
+    ring_consumer_t consumer;
+    bool finished = false;
+    uint64_t total_prob = 0;
+    uint64_t ops_sent = 0, ops_done = 0;
+    int stat_timer_id = -1;
+    int in_progress = 0;
+    bool reopening = false;
+    std::set<kv_test_listing_t*> listings;
+    std::set<std::string> changing_keys;
+    std::map<std::string, std::string> values;
+
+    ~kv_test_t();
+
+    static json11::Json::object parse_args(int narg, const char *args[]);
+    void parse_config(json11::Json cfg);
+    void run(json11::Json cfg);
+    void loop();
+    void print_stats(kv_test_stat_t & prev_stat, timespec & prev_stat_time);
+    void print_total_stats();
+    void start_change(const std::string & key);
+    void stop_change(const std::string & key);
+    void add_stat(kv_test_lat_t & stat, timespec tv_begin);
+};
+
+kv_test_t::~kv_test_t()
+{
+    if (db)
+        delete db;
+    if (cli)
+    {
+        cli->flush();
+        delete cli;
+    }
+    if (epmgr)
+        delete epmgr;
+    if (ringloop)
+        delete ringloop;
+}
+
+json11::Json::object kv_test_t::parse_args(int narg, const char *args[])
+{
+    json11::Json::object cfg;
+    for (int i = 1; i < narg; i++)
+    {
+        if (!strcmp(args[i], "-h") || !strcmp(args[i], "--help"))
+        {
+            printf(
+                "Vitastor Key/Value DB stress tester / benchmark\n"
+                "(c) Vitaliy Filippov, 2023+ (VNPL-1.1)\n"
+                "\n"
+                "USAGE: %s --pool_id POOL_ID --inode_id INODE_ID [OPTIONS]\n"
+                "  --op_count 1000000\n"
+                "    Total operations to run during test. 0 means unlimited\n"
+                "  --key_prefix \"\"\n"
+                "    Prefix for all keys read or written (to avoid collisions)\n"
+                "  --key_suffix \"\"\n"
+                "    Suffix for all keys read or written (to avoid collisions, but scan all DB)\n"
+                "  --runtime 0\n"
+                "    Run for this number of seconds. 0 means unlimited\n"
+                "  --parallelism 4\n"
+                "    Run this number of operations in parallel\n"
+                "  --get_prob 30000\n"
+                "    Fraction of key retrieve operations\n"
+                "  --add_prob 20000\n"
+                "    Fraction of key addition operations\n"
+                "  --update_prob 20000\n"
+                "    Fraction of key update operations\n"
+                "  --del_prob 30000\n"
+                "    Fraction of key delete operations\n"
+                "  --list_prob 300\n"
+                "    Fraction of listing operations\n"
+                "  --reopen_prob 1\n"
+                "    Fraction of database reopens\n"
+                "  --min_key_len 10\n"
+                "    Minimum key size in bytes\n"
+                "  --max_key_len 70\n"
+                "    Maximum key size in bytes\n"
+                "  --min_value_len 50\n"
+                "    Minimum value size in bytes\n"
+                "  --max_value_len 300\n"
+                "    Maximum value size in bytes\n"
+                "  --min_list_count 10\n"
+                "    Minimum number of keys read in listing (0 = all keys)\n"
+                "  --max_list_count 1000\n"
+                "    Maximum number of keys read in listing\n"
+                "  --print_stats 1\n"
+                "    Print operation statistics every this number of seconds\n"
+                "  --json\n"
+                "    JSON output\n"
+                "  --stop_on_error 0\n"
+                "    Stop on first execution error, mismatch, lost key or extra key during listing\n"
+                "  --kv_block_size 4k\n"
+                "    Key-value B-Tree block size\n"
+                "  --kv_memory_limit 128M\n"
+                "    Maximum memory to use for vitastor-kv index cache\n"
+                "  --kv_allocate_blocks 4\n"
+                "    Number of PG blocks used for new tree block allocation in parallel\n"
+                "  --kv_evict_max_misses 10\n"
+                "    Eviction algorithm parameter: retry eviction from another random spot\n"
+                "    if this number of keys is used currently or was used recently\n"
+                "  --kv_evict_attempts_per_level 3\n"
+                "    Retry eviction at most this number of times per tree level, starting\n"
+                "    with bottom-most levels\n"
+                "  --kv_evict_unused_age 1000\n"
+                "    Evict only keys unused during this number of last operations\n"
+                "  --kv_log_level 1\n"
+                "    Log level. 0 = errors, 1 = warnings, 10 = trace operations\n",
+                exe_name
+            );
+            exit(0);
+        }
+        else if (args[i][0] == '-' && args[i][1] == '-')
+        {
+            const char *opt = args[i]+2;
+            cfg[opt] = !strcmp(opt, "json") || i == narg-1 ? "1" : args[++i];
+        }
+    }
+    return cfg;
+}
+
+void kv_test_t::parse_config(json11::Json cfg)
+{
+    inode_id = INODE_WITH_POOL(cfg["pool_id"].uint64_value(), cfg["inode_id"].uint64_value());
+    if (cfg["op_count"].uint64_value() > 0)
+        op_count = cfg["op_count"].uint64_value();
+    key_prefix = cfg["key_prefix"].string_value();
+    key_suffix = cfg["key_suffix"].string_value();
+    if (cfg["runtime"].uint64_value() > 0)
+        runtime_sec = cfg["runtime"].uint64_value();
+    if (cfg["parallelism"].uint64_value() > 0)
+        parallelism = cfg["parallelism"].uint64_value();
+    if (!cfg["reopen_prob"].is_null())
+        reopen_prob = cfg["reopen_prob"].uint64_value();
+    if (!cfg["get_prob"].is_null())
+        get_prob = cfg["get_prob"].uint64_value();
+    if (!cfg["add_prob"].is_null())
+        add_prob = cfg["add_prob"].uint64_value();
+    if (!cfg["update_prob"].is_null())
+        update_prob = cfg["update_prob"].uint64_value();
+    if (!cfg["del_prob"].is_null())
+        del_prob = cfg["del_prob"].uint64_value();
+    if (!cfg["list_prob"].is_null())
+        list_prob = cfg["list_prob"].uint64_value();
+    if (!cfg["min_key_len"].is_null())
+        min_key_len = cfg["min_key_len"].uint64_value();
+    if (cfg["max_key_len"].uint64_value() > 0)
+        max_key_len = cfg["max_key_len"].uint64_value();
+    if (!cfg["min_value_len"].is_null())
+        min_value_len = cfg["min_value_len"].uint64_value();
+    if (cfg["max_value_len"].uint64_value() > 0)
+        max_value_len = cfg["max_value_len"].uint64_value();
+    if (!cfg["min_list_count"].is_null())
+        min_list_count = cfg["min_list_count"].uint64_value();
+    if (!cfg["max_list_count"].is_null())
+        max_list_count = cfg["max_list_count"].uint64_value();
+    if (!cfg["print_stats"].is_null())
+        print_stats_interval = cfg["print_stats"].uint64_value();
+    if (!cfg["json"].is_null())
+        json_output = true;
+    if (!cfg["stop_on_error"].is_null())
+        stop_on_error = cfg["stop_on_error"].bool_value();
+    if (!cfg["kv_block_size"].is_null())
+        kv_cfg["kv_block_size"] = cfg["kv_block_size"];
+    if (!cfg["kv_memory_limit"].is_null())
+        kv_cfg["kv_memory_limit"] = cfg["kv_memory_limit"];
+    if (!cfg["kv_allocate_blocks"].is_null())
+        kv_cfg["kv_allocate_blocks"] = cfg["kv_allocate_blocks"];
+    if (!cfg["kv_evict_max_misses"].is_null())
+        kv_cfg["kv_evict_max_misses"] = cfg["kv_evict_max_misses"];
+    if (!cfg["kv_evict_attempts_per_level"].is_null())
+        kv_cfg["kv_evict_attempts_per_level"] = cfg["kv_evict_attempts_per_level"];
+    if (!cfg["kv_evict_unused_age"].is_null())
+        kv_cfg["kv_evict_unused_age"] = cfg["kv_evict_unused_age"];
+    if (!cfg["kv_log_level"].is_null())
+    {
+        log_level = cfg["kv_log_level"].uint64_value();
+        trace = log_level >= 10;
+        kv_cfg["kv_log_level"] = cfg["kv_log_level"];
+    }
+    total_prob = reopen_prob+get_prob+add_prob+update_prob+del_prob+list_prob;
+    stat.get.name = "get";
+    stat.add.name = "add";
+    stat.update.name = "update";
+    stat.del.name = "del";
+    stat.list.name = "list";
+}
+
+void kv_test_t::run(json11::Json cfg)
+{
+    srand48(time(NULL));
+    parse_config(cfg);
+    // Create client
+    ringloop = new ring_loop_t(512);
+    epmgr = new epoll_manager_t(ringloop);
+    cli = new cluster_client_t(ringloop, epmgr->tfd, cfg);
+    db = new kv_dbw_t(cli);
+    // Load image metadata
+    while (!cli->is_ready())
+    {
+        ringloop->loop();
+        if (cli->is_ready())
+            break;
+        ringloop->wait();
+    }
+    // Run
+    reopening = true;
+    db->open(inode_id, kv_cfg, [this](int res)
+    {
+        reopening = false;
+        if (res < 0)
+        {
+            fprintf(stderr, "ERROR: Open index: %d (%s)\n", res, strerror(-res));
+            exit(1);
+        }
+        if (trace)
+            printf("Index opened\n");
+        ringloop->wakeup();
+    });
+    consumer.loop = [this]() { loop(); };
+    ringloop->register_consumer(&consumer);
+    if (print_stats_interval)
+        stat_timer_id = epmgr->tfd->set_timer(print_stats_interval*1000, true, [this](int) { print_stats(prev_stat, prev_stat_time); });
+    clock_gettime(CLOCK_REALTIME, &start_stat_time);
+    prev_stat_time = start_stat_time;
+    while (!finished)
+    {
+        ringloop->loop();
+        if (!finished)
+            ringloop->wait();
+    }
+    if (stat_timer_id >= 0)
+        epmgr->tfd->clear_timer(stat_timer_id);
+    ringloop->unregister_consumer(&consumer);
+    // Print total stats
+    print_total_stats();
+    // Destroy the client
+    delete db;
+    db = NULL;
+    cli->flush();
+    delete cli;
+    delete epmgr;
+    delete ringloop;
+    cli = NULL;
+    epmgr = NULL;
+    ringloop = NULL;
+}
+
+static const char *base64_chars = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789@+/";
+
+std::string random_str(int len)
+{
+    std::string str;
+    str.resize(len);
+    for (int i = 0; i < len; i++)
+    {
+        str[i] = base64_chars[lrand48() % 64];
+    }
+    return str;
+}
+
+void kv_test_t::loop()
+{
+    if (reopening)
+    {
+        return;
+    }
+    if (ops_done >= op_count)
+    {
+        finished = true;
+    }
+    while (!finished && ops_sent < op_count && in_progress < parallelism)
+    {
+        uint64_t dice = (lrand48() % total_prob);
+        if (dice < reopen_prob)
+        {
+            reopening = true;
+            db->close([this]()
+            {
+                if (trace)
+                    printf("Index closed\n");
+                db->open(inode_id, kv_cfg, [this](int res)
+                {
+                    reopening = false;
+                    if (res < 0)
+                    {
+                        fprintf(stderr, "ERROR: Reopen index: %d (%s)\n", res, strerror(-res));
+                        finished = true;
+                        return;
+                    }
+                    if (trace)
+                        printf("Index reopened\n");
+                    ringloop->wakeup();
+                });
+            });
+            return;
+        }
+        else if (dice < reopen_prob+get_prob)
+        {
+            // get existing
+            auto key = random_str(max_key_len);
+            auto k_it = values.lower_bound(key);
+            if (k_it == values.end())
+                continue;
+            key = k_it->first;
+            if (changing_keys.find(key) != changing_keys.end())
+                continue;
+            in_progress++;
+            ops_sent++;
+            if (trace)
+                printf("get %s\n", key.c_str());
+            timespec tv_begin;
+            clock_gettime(CLOCK_REALTIME, &tv_begin);
+            db->get(key, [this, key, tv_begin](int res, const std::string & value)
+            {
+                add_stat(stat.get, tv_begin);
+                ops_done++;
+                in_progress--;
+                auto it = values.find(key);
+                if (res != (it == values.end() ? -ENOENT : 0))
+                {
+                    fprintf(stderr, "ERROR: get %s: %d (%s)\n", key.c_str(), res, strerror(-res));
+                    if (stop_on_error)
+                        exit(1);
+                }
+                else if (it != values.end() && value != it->second)
+                {
+                    fprintf(stderr, "ERROR: get %s: mismatch: %s vs %s\n", key.c_str(), value.c_str(), it->second.c_str());
+                    if (stop_on_error)
+                        exit(1);
+                }
+                ringloop->wakeup();
+            });
+        }
+        else if (dice < reopen_prob+get_prob+add_prob+update_prob)
+        {
+            bool is_add = false;
+            std::string key;
+            if (dice < reopen_prob+get_prob+add_prob)
+            {
+                // add
+                is_add = true;
+                uint64_t key_len = min_key_len + (max_key_len > min_key_len ? lrand48() % (max_key_len-min_key_len) : 0);
+                key = key_prefix + random_str(key_len) + key_suffix;
+            }
+            else
+            {
+                // update
+                key = random_str(max_key_len);
+                auto k_it = values.lower_bound(key);
+                if (k_it == values.end())
+                    continue;
+                key = k_it->first;
+            }
+            if (changing_keys.find(key) != changing_keys.end())
+                continue;
+            uint64_t value_len = min_value_len + (max_value_len > min_value_len ? lrand48() % (max_value_len-min_value_len) : 0);
+            auto value = random_str(value_len);
+            start_change(key);
+            ops_sent++;
+            in_progress++;
+            if (trace)
+                printf("set %s = %s\n", key.c_str(), value.c_str());
+            timespec tv_begin;
+            clock_gettime(CLOCK_REALTIME, &tv_begin);
+            db->set(key, value, [this, key, value, tv_begin, is_add](int res)
+            {
+                add_stat(is_add ? stat.add : stat.update, tv_begin);
+                stop_change(key);
+                ops_done++;
+                in_progress--;
+                if (res != 0)
+                {
+                    fprintf(stderr, "ERROR: set %s = %s: %d (%s)\n", key.c_str(), value.c_str(), res, strerror(-res));
+                    if (stop_on_error)
+                        exit(1);
+                }
+                else
+                {
+                    values[key] = value;
+                }
+                ringloop->wakeup();
+            }, NULL);
+        }
+        else if (dice < reopen_prob+get_prob+add_prob+update_prob+del_prob)
+        {
+            // delete
+            auto key = random_str(max_key_len);
+            auto k_it = values.lower_bound(key);
+            if (k_it == values.end())
+                continue;
+            key = k_it->first;
+            if (changing_keys.find(key) != changing_keys.end())
+                continue;
+            start_change(key);
+            ops_sent++;
+            in_progress++;
+            if (trace)
+                printf("del %s\n", key.c_str());
+            timespec tv_begin;
+            clock_gettime(CLOCK_REALTIME, &tv_begin);
+            db->del(key, [this, key, tv_begin](int res)
+            {
+                add_stat(stat.del, tv_begin);
+                stop_change(key);
+                ops_done++;
+                in_progress--;
+                if (res != 0)
+                {
+                    fprintf(stderr, "ERROR: del %s: %d (%s)\n", key.c_str(), res, strerror(-res));
+                    if (stop_on_error)
+                        exit(1);
+                }
+                else
+                {
+                    values.erase(key);
+                }
+                ringloop->wakeup();
+            }, NULL);
+        }
+        else if (dice < reopen_prob+get_prob+add_prob+update_prob+del_prob+list_prob)
+        {
+            // list
+            ops_sent++;
+            in_progress++;
+            auto key = random_str(max_key_len);
+            auto lst = new kv_test_listing_t;
+            auto k_it = values.lower_bound(key);
+            lst->count = min_list_count + (max_list_count > min_list_count ? lrand48() % (max_list_count-min_list_count) : 0);
+            lst->handle = db->list_start(k_it == values.begin() ? key_prefix : key);
+            lst->next_after = k_it == values.begin() ? key_prefix : key;
+            lst->inflights = changing_keys;
+            listings.insert(lst);
+            if (trace)
+                printf("list from %s\n", key.c_str());
+            clock_gettime(CLOCK_REALTIME, &lst->tv_begin);
+            db->list_next(lst->handle, [this, lst](int res, const std::string & key, const std::string & value)
+            {
+                if (log_level >= 11)
+                    printf("list: %s = %s\n", key.c_str(), value.c_str());
+                if (res >= 0 && key_prefix.size() && (key.size() < key_prefix.size() ||
+                    key.substr(0, key_prefix.size()) != key_prefix))
+                {
+                    // stop at this key
+                    res = -ENOENT;
+                }
+                if (res < 0 || (lst->count > 0 && lst->done >= lst->count))
+                {
+                    add_stat(stat.list, lst->tv_begin);
+                    if (res == 0)
+                    {
+                        // ok (done >= count)
+                    }
+                    else if (res != -ENOENT)
+                    {
+                        fprintf(stderr, "ERROR: list: %d (%s)\n", res, strerror(-res));
+                        lst->error = true;
+                    }
+                    else
+                    {
+                        auto k_it = lst->next_after == "" ? values.begin() : values.upper_bound(lst->next_after);
+                        while (k_it != values.end())
+                        {
+                            while (k_it != values.end() && lst->inflights.find(k_it->first) != lst->inflights.end())
+                                k_it++;
+                            if (k_it != values.end())
+                            {
+                                fprintf(stderr, "ERROR: list: missing key %s\n", (k_it++)->first.c_str());
+                                lst->error = true;
+                            }
+                        }
+                    }
+                    if (lst->error && stop_on_error)
+                        exit(1);
+                    ops_done++;
+                    in_progress--;
+                    db->list_close(lst->handle);
+                    delete lst;
+                    listings.erase(lst);
+                    ringloop->wakeup();
+                }
+                else
+                {
+                    stat.list_keys++;
+                    // Do not check modified keys in listing
+                    // Listing may return their old or new state
+                    if ((!key_suffix.size() || key.size() >= key_suffix.size() &&
+                        key.substr(key.size()-key_suffix.size()) == key_suffix) &&
+                        lst->inflights.find(key) == lst->inflights.end())
+                    {
+                        lst->done++;
+                        auto k_it = lst->next_after == "" ? values.begin() : values.upper_bound(lst->next_after);
+                        while (true)
+                        {
+                            while (k_it != values.end() && lst->inflights.find(k_it->first) != lst->inflights.end())
+                            {
+                                k_it++;
+                            }
+                            if (k_it == values.end() || k_it->first > key)
+                            {
+                                fprintf(stderr, "ERROR: list: extra key %s\n", key.c_str());
+                                lst->error = true;
+                                break;
+                            }
+                            else if (k_it->first < key)
+                            {
+                                fprintf(stderr, "ERROR: list: missing key %s\n", k_it->first.c_str());
+                                lst->error = true;
+                                lst->next_after = k_it->first;
+                                k_it++;
+                            }
+                            else
+                            {
+                                if (k_it->second != value)
+                                {
+                                    fprintf(stderr, "ERROR: list: mismatch: %s = %s but should be %s\n",
+                                        key.c_str(), value.c_str(), k_it->second.c_str());
+                                    lst->error = true;
+                                }
+                                lst->next_after = k_it->first;
+                                break;
+                            }
+                        }
+                    }
+                    db->list_next(lst->handle, NULL);
+                }
+            });
+        }
+    }
+}
+
+void kv_test_t::add_stat(kv_test_lat_t & stat, timespec tv_begin)
+{
+    timespec tv_end;
+    clock_gettime(CLOCK_REALTIME, &tv_end);
+    int64_t usec = (tv_end.tv_sec - tv_begin.tv_sec)*1000000 +
+        (tv_end.tv_nsec - tv_begin.tv_nsec)/1000;
+    if (usec > 0)
+        stat.usec += usec;
+    stat.count++;
+}
+
+void kv_test_t::print_stats(kv_test_stat_t & prev_stat, timespec & prev_stat_time)
+{
+    timespec cur_stat_time;
+    clock_gettime(CLOCK_REALTIME, &cur_stat_time);
+    int64_t usec = (cur_stat_time.tv_sec - prev_stat_time.tv_sec)*1000000 +
+        (cur_stat_time.tv_nsec - prev_stat_time.tv_nsec)/1000;
+    if (usec > 0)
+    {
+        kv_test_lat_t *lats[] = { &stat.get, &stat.add, &stat.update, &stat.del, &stat.list };
+        kv_test_lat_t *prev[] = { &prev_stat.get, &prev_stat.add, &prev_stat.update, &prev_stat.del, &prev_stat.list };
+        if (!json_output)
+        {
+            char buf[128] = { 0 };
+            for (int i = 0; i < sizeof(lats)/sizeof(lats[0]); i++)
+            {
+                snprintf(buf, sizeof(buf)-1, "%.1f %s/s (%lu us)", (lats[i]->count-prev[i]->count)*1000000.0/usec,
+                    lats[i]->name, (lats[i]->usec-prev[i]->usec)/(lats[i]->count-prev[i]->count > 0 ? lats[i]->count-prev[i]->count : 1));
+                int k;
+                for (k = strlen(buf); k < strlen(lats[i]->name)+21; k++)
+                    buf[k] = ' ';
+                buf[k] = 0;
+                printf("%s", buf);
+            }
+            printf("\n");
+        }
+        else
+        {
+            int64_t runtime = (cur_stat_time.tv_sec - start_stat_time.tv_sec)*1000000 +
+                (cur_stat_time.tv_nsec - start_stat_time.tv_nsec)/1000;
+            printf("{\"runtime\":%.1f", (double)runtime/1000000.0);
+            for (int i = 0; i < sizeof(lats)/sizeof(lats[0]); i++)
+            {
+                if (lats[i]->count > prev[i]->count)
+                {
+                    printf(
+                        ",\"%s\":{\"avg\":{\"iops\":%.1f,\"usec\":%lu},\"total\":{\"count\":%lu,\"usec\":%lu}}",
+                        lats[i]->name, (lats[i]->count-prev[i]->count)*1000000.0/usec,
+                        (lats[i]->usec-prev[i]->usec)/(lats[i]->count-prev[i]->count),
+                        lats[i]->count, lats[i]->usec
+                    );
+                }
+            }
+            printf("}\n");
+        }
+    }
+    prev_stat = stat;
+    prev_stat_time = cur_stat_time;
+}
+
+void kv_test_t::print_total_stats()
+{
+    if (!json_output)
+        printf("Total:\n");
+    kv_test_stat_t start_stats;
+    timespec start_stat_time = this->start_stat_time;
+    print_stats(start_stats, start_stat_time);
+}
+
+void kv_test_t::start_change(const std::string & key)
+{
+    changing_keys.insert(key);
+    for (auto lst: listings)
+    {
+        lst->inflights.insert(key);
+    }
+}
+
+void kv_test_t::stop_change(const std::string & key)
+{
+    changing_keys.erase(key);
+}
+
+int main(int narg, const char *args[])
+{
+    setvbuf(stdout, NULL, _IONBF, 0);
+    setvbuf(stderr, NULL, _IONBF, 0);
+    exe_name = args[0];
+    kv_test_t *p = new kv_test_t();
+    p->run(kv_test_t::parse_args(narg, args));
+    delete p;
+    return 0;
+}
--- a/src/nbd_proxy.cpp
+++ b/src/nbd_proxy.cpp
@@ -146,7 +146,7 @@ public:
            "    Note that nbd_timeout, nbd_max_devices and nbd_max_part options may also be specified\n"
            "    in /etc/vitastor/vitastor.conf or in other configuration file specified with --config_file.\n"
            "  --logfile /path/to/log/file.txt\n"
-            "    Wite log messages to the specified file instead of dropping them (in background mode)\n"
+            "    Write log messages to the specified file instead of dropping them (in background mode)\n"
            "    or printing them to the standard output (in foreground mode).\n"
            "  --dev_num N\n"
            "    Use the specified device /dev/nbdN instead of automatic selection.\n"
@@ -298,7 +298,7 @@ public:
                }
            }
        }
-        if (cfg["logfile"].is_string())
+        if (cfg["logfile"].string_value() != "")
        {
            logfile = cfg["logfile"].string_value();
        }
--- a/src/nfs_block.cpp
+++ b/src/nfs_block.cpp
@@ -1,23 +1,18 @@
 // Copyright (c) Vitaliy Filippov, 2019+
 // License: VNPL-1.1 (see README.md for details)
 //
-// NFS connection handler for NFS proxy
+// NFS proxy over Vitastor block images

 #include <sys/time.h>

 #include "str_util.h"

 #include "nfs_proxy.h"
-
+#include "nfs_common.h"
+#include "nfs_block.h"
 #include "nfs/nfs.h"
-
 #include "cli.h"

-#define TRUE 1
-#define FALSE 0
-
-#define MAX_REQUEST_SIZE 128*1024*1024
-
 static unsigned len_pad4(unsigned len)
 {
    return len + (len&3 ? 4-(len&3) : 0);
@@ -28,10 +23,10 @@ static std::string get_inode_name(nfs_client_t *self, diropargs3 & what)
    // Get name
    std::string dirhash = what.dir;
    std::string dir;
-    if (dirhash != "roothandle")
+    if (dirhash != NFS_ROOT_HANDLE)
    {
-        auto dir_it = self->parent->dir_by_hash.find(dirhash);
-        if (dir_it != self->parent->dir_by_hash.end())
+        auto dir_it = self->parent->blockfs->dir_by_hash.find(dirhash);
+        if (dir_it != self->parent->blockfs->dir_by_hash.end())
            dir = dir_it->second;
        else
            return "";
@@ -39,27 +34,12 @@ static std::string get_inode_name(nfs_client_t *self, diropargs3 & what)
    std::string name = what.name;
    return (dir.size()
        ? dir+"/"+name
-        : self->parent->name_prefix+name);
-}
-
-static nfsstat3 vitastor_nfs_map_err(int err)
-{
-    return (err == EINVAL ? NFS3ERR_INVAL
-        : (err == ENOENT ? NFS3ERR_NOENT
-        : (err == ENOSPC ? NFS3ERR_NOSPC
-        : (err == EEXIST ? NFS3ERR_EXIST
-        : (err == EIO ? NFS3ERR_IO : (err ? NFS3ERR_IO : NFS3_OK))))));
-}
-
-static int nfs3_null_proc(void *opaque, rpc_op_t *rop)
-{
-    rpc_queue_reply(rop);
-    return 0;
+        : self->parent->blockfs->name_prefix+name);
 }

 static fattr3 get_dir_attributes(nfs_client_t *self, std::string dir)
 {
-    auto & dinf = self->parent->dir_info.at(dir);
+    auto & dinf = self->parent->blockfs->dir_info.at(dir);
    return (fattr3){
        .type = NF3DIR,
        .mode = 0755,
@@ -108,7 +88,7 @@ static fattr3 get_file_attributes(nfs_client_t *self, inode_t inode_num)
    };
 }

-static int nfs3_getattr_proc(void *opaque, rpc_op_t *rop)
+static int block_nfs3_getattr_proc(void *opaque, rpc_op_t *rop)
 {
    nfs_client_t *self = (nfs_client_t*)opaque;
    GETATTR3args *args = (GETATTR3args*)rop->request;
@@ -116,12 +96,12 @@ static int nfs3_getattr_proc(void *opaque, rpc_op_t *rop)
    bool is_dir = false;
    std::string dirhash = args->object;
    std::string dir;
-    if (args->object == "roothandle")
+    if (args->object == NFS_ROOT_HANDLE)
        is_dir = true;
    else
    {
-        auto dir_it = self->parent->dir_by_hash.find(dirhash);
-        if (dir_it != self->parent->dir_by_hash.end())
+        auto dir_it = self->parent->blockfs->dir_by_hash.find(dirhash);
+        if (dir_it != self->parent->blockfs->dir_by_hash.end())
        {
            is_dir = true;
            dir = dir_it->second;
@@ -140,8 +120,8 @@ static int nfs3_getattr_proc(void *opaque, rpc_op_t *rop)
    else
    {
        uint64_t inode_num = 0;
-        auto inode_num_it = self->parent->inode_by_hash.find(dirhash);
-        if (inode_num_it != self->parent->inode_by_hash.end())
+        auto inode_num_it = self->parent->blockfs->inode_by_hash.find(dirhash);
+        if (inode_num_it != self->parent->blockfs->inode_by_hash.end())
            inode_num = inode_num_it->second;
        auto inode_it = self->parent->cli->st_cli.inode_config.find(inode_num);
        if (inode_num && inode_it != self->parent->cli->st_cli.inode_config.end())
@@ -179,16 +159,16 @@ static int nfs3_getattr_proc(void *opaque, rpc_op_t *rop)
    return 0;
 }

-static int nfs3_setattr_proc(void *opaque, rpc_op_t *rop)
+static int block_nfs3_setattr_proc(void *opaque, rpc_op_t *rop)
 {
    nfs_client_t *self = (nfs_client_t*)opaque;
    SETATTR3args *args = (SETATTR3args*)rop->request;
    SETATTR3res *reply = (SETATTR3res*)rop->reply;
    std::string handle = args->object;
-    auto ino_it = self->parent->inode_by_hash.find(handle);
-    if (ino_it == self->parent->inode_by_hash.end())
+    auto ino_it = self->parent->blockfs->inode_by_hash.find(handle);
+    if (ino_it == self->parent->blockfs->inode_by_hash.end())
    {
-        if (handle == "roothandle" || self->parent->dir_by_hash.find(handle) != self->parent->dir_by_hash.end())
+        if (handle == NFS_ROOT_HANDLE || self->parent->blockfs->dir_by_hash.find(handle) != self->parent->blockfs->dir_by_hash.end())
        {
            if (args->new_attributes.size.set_it)
            {
@@ -228,7 +208,7 @@ static int nfs3_setattr_proc(void *opaque, rpc_op_t *rop)
    return 0;
 }

-static int nfs3_lookup_proc(void *opaque, rpc_op_t *rop)
+static int block_nfs3_lookup_proc(void *opaque, rpc_op_t *rop)
 {
    nfs_client_t *self = (nfs_client_t*)opaque;
    LOOKUP3args *args = (LOOKUP3args*)rop->request;
@@ -255,8 +235,8 @@ static int nfs3_lookup_proc(void *opaque, rpc_op_t *rop)
                return 0;
            }
        }
-        auto dir_it = self->parent->dir_info.find(full_name);
-        if (dir_it != self->parent->dir_info.end())
+        auto dir_it = self->parent->blockfs->dir_info.find(full_name);
+        if (dir_it != self->parent->blockfs->dir_info.end())
        {
            *reply = (LOOKUP3res){
                .status = NFS3_OK,
@@ -277,7 +257,7 @@ static int nfs3_lookup_proc(void *opaque, rpc_op_t *rop)
    return 0;
 }

-static int nfs3_access_proc(void *opaque, rpc_op_t *rop)
+static int block_nfs3_access_proc(void *opaque, rpc_op_t *rop)
 {
    //nfs_client_t *self = (nfs_client_t*)opaque;
    ACCESS3args *args = (ACCESS3args*)rop->request;
@@ -292,7 +272,7 @@ static int nfs3_access_proc(void *opaque, rpc_op_t *rop)
    return 0;
 }

-static int nfs3_readlink_proc(void *opaque, rpc_op_t *rop)
+static int block_nfs3_readlink_proc(void *opaque, rpc_op_t *rop)
 {
    //nfs_client_t *self = (nfs_client_t*)opaque;
    //READLINK3args *args = (READLINK3args*)rop->request;
@@ -303,14 +283,14 @@ static int nfs3_readlink_proc(void *opaque, rpc_op_t *rop)
    return 0;
 }

-static int nfs3_read_proc(void *opaque, rpc_op_t *rop)
+static int block_nfs3_read_proc(void *opaque, rpc_op_t *rop)
 {
    nfs_client_t *self = (nfs_client_t*)opaque;
    READ3args *args = (READ3args*)rop->request;
    READ3res *reply = (READ3res*)rop->reply;
    std::string handle = args->file;
-    auto ino_it = self->parent->inode_by_hash.find(handle);
-    if (ino_it == self->parent->inode_by_hash.end())
+    auto ino_it = self->parent->blockfs->inode_by_hash.find(handle);
+    if (ino_it == self->parent->blockfs->inode_by_hash.end())
    {
        *reply = (READ3res){ .status = NFS3ERR_NOENT };
        rpc_queue_reply(rop);
@@ -367,14 +347,14 @@ static int nfs3_read_proc(void *opaque, rpc_op_t *rop)

 static void nfs_resize_write(nfs_client_t *self, rpc_op_t *rop, uint64_t inode, uint64_t new_size, uint64_t offset, uint64_t count, void *buf);

-static int nfs3_write_proc(void *opaque, rpc_op_t *rop)
+static int block_nfs3_write_proc(void *opaque, rpc_op_t *rop)
 {
    nfs_client_t *self = (nfs_client_t*)opaque;
    WRITE3args *args = (WRITE3args*)rop->request;
    WRITE3res *reply = (WRITE3res*)rop->reply;
    std::string handle = args->file;
-    auto ino_it = self->parent->inode_by_hash.find(handle);
-    if (ino_it == self->parent->inode_by_hash.end())
+    auto ino_it = self->parent->blockfs->inode_by_hash.find(handle);
+    if (ino_it == self->parent->blockfs->inode_by_hash.end())
    {
        *reply = (WRITE3res){ .status = NFS3ERR_NOENT };
        rpc_queue_reply(rop);
@@ -480,8 +460,8 @@ static void complete_extend_write(nfs_client_t *self, rpc_op_t *rop, inode_t ino

 static void complete_extend_inode(nfs_client_t *self, uint64_t inode, uint64_t new_size, int err)
 {
-    auto ext_it = self->extend_writes.lower_bound((extend_size_t){ .inode = inode, .new_size = 0 });
-    while (ext_it != self->extend_writes.end() &&
+    auto ext_it = self->parent->blockfs->extend_writes.lower_bound((extend_size_t){ .inode = inode, .new_size = 0 });
+    while (ext_it != self->parent->blockfs->extend_writes.end() &&
        ext_it->first.inode == inode &&
        ext_it->first.new_size <= new_size)
    {
@@ -490,7 +470,7 @@ static void complete_extend_inode(nfs_client_t *self, uint64_t inode, uint64_t n
        {
            complete_extend_write(self, ext_it->second.rop, inode, ext_it->second.write_res < 0
                ? ext_it->second.write_res : ext_it->second.resize_res);
-            self->extend_writes.erase(ext_it++);
+            self->parent->blockfs->extend_writes.erase(ext_it++);
        }
        else
            ext_it++;
@@ -500,7 +480,7 @@ static void complete_extend_inode(nfs_client_t *self, uint64_t inode, uint64_t n
 static void extend_inode(nfs_client_t *self, uint64_t inode, uint64_t new_size)
 {
    // Send an extend request
-    auto & ext = self->extends[inode];
+    auto & ext = self->parent->blockfs->extends[inode];
    ext.cur_extend = new_size;
    auto inode_it = self->parent->cli->st_cli.inode_config.find(inode);
    if (inode_it != self->parent->cli->st_cli.inode_config.end() &&
@@ -514,10 +494,10 @@ static void extend_inode(nfs_client_t *self, uint64_t inode, uint64_t new_size)
            { "force_size", true },
        }), [=](const cli_result_t & r)
        {
-            auto & ext = self->extends[inode];
+            auto & ext = self->parent->blockfs->extends[inode];
            if (r.err)
            {
-                fprintf(stderr, "Error extending inode %ju to %ju bytes: %s\n", inode, new_size, r.text.c_str());
+                fprintf(stderr, "Error extending inode %lu to %lu bytes: %s\n", inode, new_size, r.text.c_str());
            }
            if (r.err == EAGAIN || ext.next_extend > ext.cur_extend)
            {
@@ -548,7 +528,7 @@ static void nfs_do_write(nfs_client_t *self, std::multimap<extend_size_t, extend
    {
        auto inode = op->inode;
        int write_res = op->retval < 0 ? op->retval : (op->retval != op->len ? -ERANGE : 0);
-        if (ewr_it == self->extend_writes.end())
+        if (ewr_it == self->parent->blockfs->extend_writes.end())
        {
            complete_extend_write(self, rop, inode, write_res);
        }
@@ -558,7 +538,7 @@ static void nfs_do_write(nfs_client_t *self, std::multimap<extend_size_t, extend
            if (ewr_it->second.resize_res <= 0)
            {
                complete_extend_write(self, rop, inode, write_res < 0 ? write_res : ewr_it->second.resize_res);
-                self->extend_writes.erase(ewr_it);
+                self->parent->blockfs->extend_writes.erase(ewr_it);
            }
        }
    };
@@ -572,7 +552,7 @@ static void nfs_resize_write(nfs_client_t *self, rpc_op_t *rop, uint64_t inode,
    if (inode_it != self->parent->cli->st_cli.inode_config.end() &&
        inode_it->second.size < new_size)
    {
-        auto ewr_it = self->extend_writes.emplace((extend_size_t){
+        auto ewr_it = self->parent->blockfs->extend_writes.emplace((extend_size_t){
            .inode = inode,
            .new_size = new_size,
        }, (extend_write_t){
@@ -580,7 +560,7 @@ static void nfs_resize_write(nfs_client_t *self, rpc_op_t *rop, uint64_t inode,
            .resize_res = 1,
            .write_res = 1,
        });
-        auto & ext = self->extends[inode];
+        auto & ext = self->parent->blockfs->extends[inode];
        if (ext.cur_extend > 0)
        {
            // Already resizing, just wait
@@ -595,11 +575,11 @@ static void nfs_resize_write(nfs_client_t *self, rpc_op_t *rop, uint64_t inode,
    }
    else
    {
-        nfs_do_write(self, self->extend_writes.end(), rop, inode, offset, count, buf);
+        nfs_do_write(self, self->parent->blockfs->extend_writes.end(), rop, inode, offset, count, buf);
    }
 }

-static int nfs3_create_proc(void *opaque, rpc_op_t *rop)
+static int block_nfs3_create_proc(void *opaque, rpc_op_t *rop)
 {
    nfs_client_t *self = (nfs_client_t*)opaque;
    CREATE3args *args = (CREATE3args*)rop->request;
@@ -650,7 +630,7 @@ static int nfs3_create_proc(void *opaque, rpc_op_t *rop)
    return 1;
 }

-static int nfs3_mkdir_proc(void *opaque, rpc_op_t *rop)
+static int block_nfs3_mkdir_proc(void *opaque, rpc_op_t *rop)
 {
    nfs_client_t *self = (nfs_client_t*)opaque;
    MKDIR3args *args = (MKDIR3args*)rop->request;
@@ -669,19 +649,19 @@ static int nfs3_mkdir_proc(void *opaque, rpc_op_t *rop)
        rpc_queue_reply(rop);
        return 0;
    }
-    auto dir_id_it = self->parent->dir_info.find(full_name);
-    if (dir_id_it != self->parent->dir_info.end())
+    auto dir_id_it = self->parent->blockfs->dir_info.find(full_name);
+    if (dir_id_it != self->parent->blockfs->dir_info.end())
    {
        *reply = (MKDIR3res){ .status = NFS3ERR_EXIST };
        rpc_queue_reply(rop);
        return 0;
    }
    // FIXME: Persist empty directories in some etcd keys, like /vitastor/dir/...
-    self->parent->dir_info[full_name] = (nfs_dir_t){
-        .id = self->parent->next_dir_id++,
+    self->parent->blockfs->dir_info[full_name] = (nfs_dir_t){
+        .id = self->parent->blockfs->next_dir_id++,
        .mod_rev = 0,
    };
-    self->parent->dir_by_hash["S"+base64_encode(sha256(full_name))] = full_name;
+    self->parent->blockfs->dir_by_hash["S"+base64_encode(sha256(full_name))] = full_name;
    *reply = (MKDIR3res){
        .status = NFS3_OK,
        .resok = (MKDIR3resok){
@@ -700,7 +680,7 @@ static int nfs3_mkdir_proc(void *opaque, rpc_op_t *rop)
    return 0;
 }

-static int nfs3_symlink_proc(void *opaque, rpc_op_t *rop)
+static int block_nfs3_symlink_proc(void *opaque, rpc_op_t *rop)
 {
 //    nfs_client_t *self = (nfs_client_t*)opaque;
 //    SYMLINK3args *args = (SYMLINK3args*)rop->request;
@@ -711,7 +691,7 @@ static int nfs3_symlink_proc(void *opaque, rpc_op_t *rop)
    return 0;
 }

-static int nfs3_mknod_proc(void *opaque, rpc_op_t *rop)
+static int block_nfs3_mknod_proc(void *opaque, rpc_op_t *rop)
 {
 //    nfs_client_t *self = (nfs_client_t*)opaque;
 //    MKNOD3args *args = (MKNOD3args*)rop->request;
@@ -722,7 +702,7 @@ static int nfs3_mknod_proc(void *opaque, rpc_op_t *rop)
    return 0;
 }

-static int nfs3_remove_proc(void *opaque, rpc_op_t *rop)
+static int block_nfs3_remove_proc(void *opaque, rpc_op_t *rop)
 {
    nfs_client_t *self = (nfs_client_t*)opaque;
    REMOVE3res *reply = (REMOVE3res*)rop->reply;
@@ -752,7 +732,7 @@ static int nfs3_remove_proc(void *opaque, rpc_op_t *rop)
    return 1;
 }

-static int nfs3_rmdir_proc(void *opaque, rpc_op_t *rop)
+static int block_nfs3_rmdir_proc(void *opaque, rpc_op_t *rop)
 {
    nfs_client_t *self = (nfs_client_t*)opaque;
    RMDIR3args *args = (RMDIR3args*)rop->request;
@@ -764,8 +744,8 @@ static int nfs3_rmdir_proc(void *opaque, rpc_op_t *rop)
        rpc_queue_reply(rop);
        return 0;
    }
-    auto dir_it = self->parent->dir_info.find(full_name);
-    if (dir_it == self->parent->dir_info.end())
+    auto dir_it = self->parent->blockfs->dir_info.find(full_name);
+    if (dir_it == self->parent->blockfs->dir_info.end())
    {
        *reply = (RMDIR3res){ .status = NFS3ERR_NOENT };
        rpc_queue_reply(rop);
@@ -781,8 +761,8 @@ static int nfs3_rmdir_proc(void *opaque, rpc_op_t *rop)
            return 0;
        }
    }
-    self->parent->dir_by_hash.erase("S"+base64_encode(sha256(full_name)));
-    self->parent->dir_info.erase(dir_it);
+    self->parent->blockfs->dir_by_hash.erase("S"+base64_encode(sha256(full_name)));
+    self->parent->blockfs->dir_info.erase(dir_it);
    *reply = (RMDIR3res){ .status = NFS3_OK };
    rpc_queue_reply(rop);
    return 0;
@@ -811,12 +791,12 @@ static int continue_dir_rename(nfs_dir_rename_state *rename_st)
    if (!rename_st->items.size())
    {
        // old dir
-        auto old_info = self->parent->dir_info.at(rename_st->old_name);
-        self->parent->dir_info.erase(rename_st->old_name);
-        self->parent->dir_by_hash.erase("S"+base64_encode(sha256(rename_st->old_name)));
+        auto old_info = self->parent->blockfs->dir_info.at(rename_st->old_name);
+        self->parent->blockfs->dir_info.erase(rename_st->old_name);
+        self->parent->blockfs->dir_by_hash.erase("S"+base64_encode(sha256(rename_st->old_name)));
        // new dir
-        self->parent->dir_info[rename_st->new_name] = old_info;
-        self->parent->dir_by_hash["S"+base64_encode(sha256(rename_st->new_name))] = rename_st->new_name;
+        self->parent->blockfs->dir_info[rename_st->new_name] = old_info;
+        self->parent->blockfs->dir_by_hash["S"+base64_encode(sha256(rename_st->new_name))] = rename_st->new_name;
        RENAME3res *reply = (RENAME3res*)rename_st->rop->reply;
        *reply = (RENAME3res){
            .status = NFS3_OK,
@@ -853,7 +833,7 @@ static int continue_dir_rename(nfs_dir_rename_state *rename_st)

 static void nfs_do_rename(nfs_client_t *self, rpc_op_t *rop, std::string old_name, std::string new_name);

-static int nfs3_rename_proc(void *opaque, rpc_op_t *rop)
+static int block_nfs3_rename_proc(void *opaque, rpc_op_t *rop)
 {
    nfs_client_t *self = (nfs_client_t*)opaque;
    RENAME3args *args = (RENAME3args*)rop->request;
@@ -866,8 +846,8 @@ static int nfs3_rename_proc(void *opaque, rpc_op_t *rop)
        rpc_queue_reply(rop);
        return 0;
    }
-    bool old_is_dir = self->parent->dir_info.find(old_name) != self->parent->dir_info.end();
-    bool new_is_dir = self->parent->dir_info.find(new_name) != self->parent->dir_info.end();
+    bool old_is_dir = self->parent->blockfs->dir_info.find(old_name) != self->parent->blockfs->dir_info.end();
+    bool new_is_dir = self->parent->blockfs->dir_info.find(new_name) != self->parent->blockfs->dir_info.end();
    bool old_is_file = false, new_is_file = false;
    for (auto & ic: self->parent->cli->st_cli.inode_config)
    {
@@ -948,7 +928,7 @@ static void nfs_do_rename(nfs_client_t *self, rpc_op_t *rop, std::string old_nam
    });
 }

-static int nfs3_link_proc(void *opaque, rpc_op_t *rop)
+static int block_nfs3_link_proc(void *opaque, rpc_op_t *rop)
 {
    //nfs_client_t *self = (nfs_client_t*)opaque;
    //LINK3args *args = (LINK3args*)rop->request;
@@ -962,7 +942,7 @@ static int nfs3_link_proc(void *opaque, rpc_op_t *rop)
 static void fill_dir_entry(nfs_client_t *self, rpc_op_t *rop,
    std::map<std::string, nfs_dir_t>::iterator dir_id_it, struct entryplus3 *entry, bool is_plus)
 {
-    if (dir_id_it == self->parent->dir_info.end())
+    if (dir_id_it == self->parent->blockfs->dir_info.end())
    {
        return;
    }
@@ -980,7 +960,7 @@ static void fill_dir_entry(nfs_client_t *self, rpc_op_t *rop,
    }
 }

-static void nfs3_readdir_common(void *opaque, rpc_op_t *rop, bool is_plus)
+static void block_nfs3_readdir_common(void *opaque, rpc_op_t *rop, bool is_plus)
 {
    nfs_client_t *self = (nfs_client_t*)opaque;
    READDIRPLUS3args plus_args;
@@ -999,13 +979,13 @@ static void nfs3_readdir_common(void *opaque, rpc_op_t *rop, bool is_plus)
    }
    std::string dirhash = args->dir;
    std::string dir;
-    if (dirhash != "roothandle")
+    if (dirhash != NFS_ROOT_HANDLE)
    {
-        auto dir_it = self->parent->dir_by_hash.find(dirhash);
-        if (dir_it != self->parent->dir_by_hash.end())
+        auto dir_it = self->parent->blockfs->dir_by_hash.find(dirhash);
+        if (dir_it != self->parent->blockfs->dir_by_hash.end())
            dir = dir_it->second;
    }
-    std::string prefix = dir.size() ? dir+"/" : self->parent->name_prefix;
+    std::string prefix = dir.size() ? dir+"/" : self->parent->blockfs->name_prefix;
    std::map<std::string, struct entryplus3> entries;
    for (auto & ic: self->parent->cli->st_cli.inode_config)
    {
@@ -1043,12 +1023,12 @@ static void nfs3_readdir_common(void *opaque, rpc_op_t *rop, bool is_plus)
        }
        else
        {
-            // skip directories, they will be added from dir_info
+            // skip directories, they will be added from blockfs->dir_info
        }
    }
-    // Add directories from dir_info
-    for (auto dir_id_it = self->parent->dir_info.lower_bound(prefix);
-        dir_id_it != self->parent->dir_info.end(); dir_id_it++)
+    // Add directories from blockfs->dir_info
+    for (auto dir_id_it = self->parent->blockfs->dir_info.lower_bound(prefix);
+        dir_id_it != self->parent->blockfs->dir_info.end(); dir_id_it++)
    {
        if (prefix != "" && dir_id_it->first.substr(0, prefix.size()) != prefix)
            break;
@@ -1061,12 +1041,12 @@ static void nfs3_readdir_common(void *opaque, rpc_op_t *rop, bool is_plus)
    }
    // Add . and ..
    {
-        auto dir_id_it = self->parent->dir_info.find(dir);
+        auto dir_id_it = self->parent->blockfs->dir_info.find(dir);
        fill_dir_entry(self, rop, dir_id_it, &entries["."], is_plus);
        auto sl = dir.rfind("/");
        if (sl != std::string::npos)
        {
-            auto dir_id_it = self->parent->dir_info.find(dir.substr(0, sl));
+            auto dir_id_it = self->parent->blockfs->dir_info.find(dir.substr(0, sl));
            fill_dir_entry(self, rop, dir_id_it, &entries[".."], is_plus);
        }
    }
@@ -1147,7 +1127,7 @@ static void nfs3_readdir_common(void *opaque, rpc_op_t *rop, bool is_plus)
    {
        READDIRPLUS3res *reply = (READDIRPLUS3res*)rop->reply;
        *reply = { .status = NFS3_OK };
-        *(uint64_t*)(reply->resok.cookieverf) = self->parent->dir_info.at(dir).mod_rev;
+        *(uint64_t*)(reply->resok.cookieverf) = self->parent->blockfs->dir_info.at(dir).mod_rev;
        reply->resok.reply.entries = entries.size() ? &entries.begin()->second : NULL;
        reply->resok.reply.eof = eof;
    }
@@ -1155,250 +1135,135 @@ static void nfs3_readdir_common(void *opaque, rpc_op_t *rop, bool is_plus)
    {
        READDIR3res *reply = (READDIR3res*)rop->reply;
        *reply = { .status = NFS3_OK };
-        *(uint64_t*)(reply->resok.cookieverf) = self->parent->dir_info.at(dir).mod_rev;
+        *(uint64_t*)(reply->resok.cookieverf) = self->parent->blockfs->dir_info.at(dir).mod_rev;
        reply->resok.reply.entries = entries.size() ? (entry3*)&entries.begin()->second : NULL;
        reply->resok.reply.eof = eof;
    }
    rpc_queue_reply(rop);
 }

-static int nfs3_readdir_proc(void *opaque, rpc_op_t *rop)
+static int block_nfs3_readdir_proc(void *opaque, rpc_op_t *rop)
 {
-    nfs3_readdir_common(opaque, rop, false);
+    block_nfs3_readdir_common(opaque, rop, false);
    return 0;
 }

-static int nfs3_readdirplus_proc(void *opaque, rpc_op_t *rop)
+static int block_nfs3_readdirplus_proc(void *opaque, rpc_op_t *rop)
 {
-    nfs3_readdir_common(opaque, rop, true);
+    block_nfs3_readdir_common(opaque, rop, true);
    return 0;
 }

-// Get file system statistics
-static int nfs3_fsstat_proc(void *opaque, rpc_op_t *rop)
+void block_fs_state_t::init(nfs_proxy_t *proxy, json11::Json cfg)
 {
-    nfs_client_t *self = (nfs_client_t*)opaque;
-    //FSSTAT3args *args = (FSSTAT3args*)rop->request;
-    FSSTAT3res *reply = (FSSTAT3res*)rop->reply;
-    uint64_t tbytes = 0, fbytes = 0;
-    auto pst_it = self->parent->pool_stats.find(self->parent->default_pool_id);
-    if (pst_it != self->parent->pool_stats.end())
+    name_prefix = cfg["subdir"].string_value();
    {
-        auto ttb = pst_it->second["total_raw_tb"].number_value();
-        auto ftb = (pst_it->second["total_raw_tb"].number_value() - pst_it->second["used_raw_tb"].number_value());
-        tbytes = ttb / pst_it->second["raw_to_usable"].number_value() * ((uint64_t)2<<40);
-        fbytes = ftb / pst_it->second["raw_to_usable"].number_value() * ((uint64_t)2<<40);
+        int e = name_prefix.size();
+        while (e > 0 && name_prefix[e-1] == '/')
+            e--;
+        int s = 0;
+        while (s < e && name_prefix[s] == '/')
+            s++;
+        name_prefix = name_prefix.substr(s, e-s);
+        if (name_prefix.size())
+            name_prefix += "/";
    }
-    *reply = (FSSTAT3res){
-        .status = NFS3_OK,
-        .resok = (FSSTAT3resok){
-            .obj_attributes = {
-                .attributes_follow = 1,
-                .attributes = get_dir_attributes(self, ""),
-            },
-            .tbytes = tbytes, // total bytes
-            .fbytes = fbytes, // free bytes
-            .abytes = fbytes, // available bytes
-            .tfiles = (size3)(1 << 31), // maximum total files
-            .ffiles = (size3)(1 << 31), // free files
-            .afiles = (size3)(1 << 31), // available files
-            .invarsec = 0,
-        },
+    // We need inode name hashes for NFS handles to remain stateless and <= 64 bytes long
+    dir_info[""] = (nfs_dir_t){
+        .id = 1,
+        .mod_rev = 0,
    };
-    rpc_queue_reply(rop);
-    return 0;
-}
-
-static int nfs3_fsinfo_proc(void *opaque, rpc_op_t *rop)
-{
-    nfs_client_t *self = (nfs_client_t*)opaque;
-    FSINFO3args *args = (FSINFO3args*)rop->request;
-    FSINFO3res *reply = (FSINFO3res*)rop->reply;
-    if (args->fsroot != "roothandle")
+    clock_gettime(CLOCK_REALTIME, &dir_info[""].mtime);
+    assert(proxy->cli->st_cli.on_inode_change_hook == NULL);
+    proxy->cli->st_cli.on_inode_change_hook = [this, proxy](inode_t changed_inode, bool removed)
    {
-        // Example error
-        *reply = (FSINFO3res){ .status = NFS3ERR_INVAL };
-    }
-    else
-    {
-        // Fill info
-        *reply = (FSINFO3res){
-            .status = NFS3_OK,
-            .resok = (FSINFO3resok){
-                .obj_attributes = {
-                    .attributes_follow = 1,
-                    .attributes = get_dir_attributes(self, ""),
-                },
-                .rtmax = 128*1024*1024,
-                .rtpref = 128*1024*1024,
-                .rtmult = 4096,
-                .wtmax = 128*1024*1024,
-                .wtpref = 128*1024*1024,
-                .wtmult = 4096,
-                .dtpref = 128,
-                .maxfilesize = 0x7fffffffffffffff,
-                .time_delta = {
-                    .seconds = 1,
-                    .nseconds = 0,
-                },
-                .properties = FSF3_SYMLINK | FSF3_HOMOGENEOUS,
-            },
-        };
-    }
-    rpc_queue_reply(rop);
-    return 0;
-}
-
-static int nfs3_pathconf_proc(void *opaque, rpc_op_t *rop)
-{
-    //nfs_client_t *self = (nfs_client_t*)opaque;
-    PATHCONF3args *args = (PATHCONF3args*)rop->request;
-    PATHCONF3res *reply = (PATHCONF3res*)rop->reply;
-    if (args->object != "roothandle")
-    {
-        // Example error
-        *reply = (PATHCONF3res){ .status = NFS3ERR_INVAL };
-    }
-    else
-    {
-        // Fill info
-        bool_t x = FALSE;
-        *reply = (PATHCONF3res){
-            .status = NFS3_OK,
-            .resok = (PATHCONF3resok){
-                .obj_attributes = {
-                    // Without at least one reference to a non-constant value (local variable or something else),
-                    // with gcc 8 we get "internal compiler error: side-effects element in no-side-effects CONSTRUCTOR" here
-                    // FIXME: get rid of this after raising compiler requirement
-                    .attributes_follow = x,
-                },
-                .linkmax = 0,
-                .name_max = 255,
-                .no_trunc = TRUE,
-                .chown_restricted = FALSE,
-                .case_insensitive = FALSE,
-                .case_preserving = TRUE,
-            },
-        };
-    }
-    rpc_queue_reply(rop);
-    return 0;
-}
-
-static int nfs3_commit_proc(void *opaque, rpc_op_t *rop)
-{
-    nfs_client_t *self = (nfs_client_t*)opaque;
-    //COMMIT3args *args = (COMMIT3args*)rop->request;
-    cluster_op_t *op = new cluster_op_t;
-    // fsync. we don't know how to fsync a single inode, so just fsync everything
-    op->opcode = OSD_OP_SYNC;
-    op->callback = [self, rop](cluster_op_t *op)
-    {
-        COMMIT3res *reply = (COMMIT3res*)rop->reply;
-        *reply = (COMMIT3res){ .status = vitastor_nfs_map_err(op->retval) };
-        *(uint64_t*)reply->resok.verf = self->parent->server_id;
-        rpc_queue_reply(rop);
+        auto inode_cfg_it = proxy->cli->st_cli.inode_config.find(changed_inode);
+        if (inode_cfg_it == proxy->cli->st_cli.inode_config.end())
+        {
+            return;
+        }
+        auto & inode_cfg = inode_cfg_it->second;
+        std::string full_name = inode_cfg.name;
+        if (proxy->blockfs->name_prefix != "" && full_name.substr(0, proxy->blockfs->name_prefix.size()) != proxy->blockfs->name_prefix)
+        {
+            return;
+        }
+        // Calculate directory modification time and revision (used as "cookie verifier")
+        timespec now;
+        clock_gettime(CLOCK_REALTIME, &now);
+        dir_info[""].mod_rev = dir_info[""].mod_rev < inode_cfg.mod_revision ? inode_cfg.mod_revision : dir_info[""].mod_rev;
+        dir_info[""].mtime = now;
+        int pos = full_name.find('/', proxy->blockfs->name_prefix.size());
+        while (pos >= 0)
+        {
+            std::string dir = full_name.substr(0, pos);
+            auto & dinf = dir_info[dir];
+            if (!dinf.id)
+                dinf.id = next_dir_id++;
+            dinf.mod_rev = dinf.mod_rev < inode_cfg.mod_revision ? inode_cfg.mod_revision : dinf.mod_rev;
+            dinf.mtime = now;
+            dir_by_hash["S"+base64_encode(sha256(dir))] = dir;
+            pos = full_name.find('/', pos+1);
+        }
+        // Alter inode_by_hash
+        if (removed)
+        {
+            auto ino_it = hash_by_inode.find(changed_inode);
+            if (ino_it != hash_by_inode.end())
+            {
+                inode_by_hash.erase(ino_it->second);
+                hash_by_inode.erase(ino_it);
+            }
+        }
+        else
+        {
+            std::string hash = "S"+base64_encode(sha256(full_name));
+            auto hbi_it = hash_by_inode.find(changed_inode);
+            if (hbi_it != hash_by_inode.end() && hbi_it->second != hash)
+            {
+                // inode had a different name, remove old hash=>inode pointer
+                inode_by_hash.erase(hbi_it->second);
+            }
+            inode_by_hash[hash] = changed_inode;
+            hash_by_inode[changed_inode] = hash;
+        }
    };
-    self->parent->cli->execute(op);
-    return 1;
 }

-static int mount3_mnt_proc(void *opaque, rpc_op_t *rop)
-{
-    //nfs_client_t *self = (nfs_client_t*)opaque;
-    //nfs_dirpath *args = (nfs_dirpath*)rop->request;
-    nfs_mountres3 *reply = (nfs_mountres3*)rop->reply;
-    u_int flavor = RPC_AUTH_NONE;
-    reply->fhs_status = MNT3_OK;
-    reply->mountinfo.fhandle = xdr_copy_string(rop->xdrs, "roothandle");
-    reply->mountinfo.auth_flavors.auth_flavors_len = 1;
-    reply->mountinfo.auth_flavors.auth_flavors_val = (u_int*)xdr_copy_string(rop->xdrs, (char*)&flavor, sizeof(u_int)).data;
-    rpc_queue_reply(rop);
-    return 0;
-}
-
-static int mount3_dump_proc(void *opaque, rpc_op_t *rop)
-{
-    nfs_client_t *self = (nfs_client_t*)opaque;
-    nfs_mountlist *reply = (nfs_mountlist*)rop->reply;
-    *reply = (struct nfs_mountbody*)malloc_or_die(sizeof(struct nfs_mountbody));
-    xdr_add_malloc(rop->xdrs, *reply);
-    (*reply)->ml_hostname = xdr_copy_string(rop->xdrs, "127.0.0.1");
-    (*reply)->ml_directory = xdr_copy_string(rop->xdrs, self->parent->export_root);
-    (*reply)->ml_next = NULL;
-    rpc_queue_reply(rop);
-    return 0;
-}
-
-static int mount3_umnt_proc(void *opaque, rpc_op_t *rop)
-{
-    //nfs_client_t *self = (nfs_client_t*)opaque;
-    //nfs_dirpath *arg = (nfs_dirpath*)rop->request;
-    // do nothing
-    rpc_queue_reply(rop);
-    return 0;
-}
-
-static int mount3_umntall_proc(void *opaque, rpc_op_t *rop)
-{
-    // do nothing
-    rpc_queue_reply(rop);
-    return 0;
-}
-
-static int mount3_export_proc(void *opaque, rpc_op_t *rop)
-{
-    nfs_client_t *self = (nfs_client_t*)opaque;
-    nfs_exports *reply = (nfs_exports*)rop->reply;
-    *reply = (struct nfs_exportnode*)calloc_or_die(1, sizeof(struct nfs_exportnode) + sizeof(struct nfs_groupnode));
-    xdr_add_malloc(rop->xdrs, *reply);
-    (*reply)->ex_dir = xdr_copy_string(rop->xdrs, self->parent->export_root);
-    (*reply)->ex_groups = (struct nfs_groupnode*)(reply+1);
-    (*reply)->ex_groups->gr_name = xdr_copy_string(rop->xdrs, "127.0.0.1");
-    (*reply)->ex_groups->gr_next = NULL;
-    (*reply)->ex_next = NULL;
-    rpc_queue_reply(rop);
-    return 0;
-}
-
-nfs_client_t::nfs_client_t()
+void nfs_block_procs(nfs_client_t *self)
 {
    struct rpc_service_proc_t pt[] = {
-        {NFS_PROGRAM, NFS_V3, NFS3_NULL,        nfs3_null_proc,        NULL,                            0,                        NULL,                           0,                       this},
-        {NFS_PROGRAM, NFS_V3, NFS3_GETATTR,     nfs3_getattr_proc,     (xdrproc_t)xdr_GETATTR3args,     sizeof(GETATTR3args),     (xdrproc_t)xdr_GETATTR3res,     sizeof(GETATTR3res),     this},
-        {NFS_PROGRAM, NFS_V3, NFS3_SETATTR,     nfs3_setattr_proc,     (xdrproc_t)xdr_SETATTR3args,     sizeof(SETATTR3args),     (xdrproc_t)xdr_SETATTR3res,     sizeof(SETATTR3res),     this},
-        {NFS_PROGRAM, NFS_V3, NFS3_LOOKUP,      nfs3_lookup_proc,      (xdrproc_t)xdr_LOOKUP3args,      sizeof(LOOKUP3args),      (xdrproc_t)xdr_LOOKUP3res,      sizeof(LOOKUP3res),      this},
-        {NFS_PROGRAM, NFS_V3, NFS3_ACCESS,      nfs3_access_proc,      (xdrproc_t)xdr_ACCESS3args,      sizeof(ACCESS3args),      (xdrproc_t)xdr_ACCESS3res,      sizeof(ACCESS3res),      this},
-        {NFS_PROGRAM, NFS_V3, NFS3_READLINK,    nfs3_readlink_proc,    (xdrproc_t)xdr_READLINK3args,    sizeof(READLINK3args),    (xdrproc_t)xdr_READLINK3res,    sizeof(READLINK3res),    this},
-        {NFS_PROGRAM, NFS_V3, NFS3_READ,        nfs3_read_proc,        (xdrproc_t)xdr_READ3args,        sizeof(READ3args),        (xdrproc_t)xdr_READ3res,        sizeof(READ3res),        this},
-        {NFS_PROGRAM, NFS_V3, NFS3_WRITE,       nfs3_write_proc,       (xdrproc_t)xdr_WRITE3args,       sizeof(WRITE3args),       (xdrproc_t)xdr_WRITE3res,       sizeof(WRITE3res),       this},
-        {NFS_PROGRAM, NFS_V3, NFS3_CREATE,      nfs3_create_proc,      (xdrproc_t)xdr_CREATE3args,      sizeof(CREATE3args),      (xdrproc_t)xdr_CREATE3res,      sizeof(CREATE3res),      this},
-        {NFS_PROGRAM, NFS_V3, NFS3_MKDIR,       nfs3_mkdir_proc,       (xdrproc_t)xdr_MKDIR3args,       sizeof(MKDIR3args),       (xdrproc_t)xdr_MKDIR3res,       sizeof(MKDIR3res),       this},
-        {NFS_PROGRAM, NFS_V3, NFS3_SYMLINK,     nfs3_symlink_proc,     (xdrproc_t)xdr_SYMLINK3args,     sizeof(SYMLINK3args),     (xdrproc_t)xdr_SYMLINK3res,     sizeof(SYMLINK3res),     this},
-        {NFS_PROGRAM, NFS_V3, NFS3_MKNOD,       nfs3_mknod_proc,       (xdrproc_t)xdr_MKNOD3args,       sizeof(MKNOD3args),       (xdrproc_t)xdr_MKNOD3res,       sizeof(MKNOD3res),       this},
-        {NFS_PROGRAM, NFS_V3, NFS3_REMOVE,      nfs3_remove_proc,      (xdrproc_t)xdr_REMOVE3args,      sizeof(REMOVE3args),      (xdrproc_t)xdr_REMOVE3res,      sizeof(REMOVE3res),      this},
-        {NFS_PROGRAM, NFS_V3, NFS3_RMDIR,       nfs3_rmdir_proc,       (xdrproc_t)xdr_RMDIR3args,       sizeof(RMDIR3args),       (xdrproc_t)xdr_RMDIR3res,       sizeof(RMDIR3res),       this},
-        {NFS_PROGRAM, NFS_V3, NFS3_RENAME,      nfs3_rename_proc,      (xdrproc_t)xdr_RENAME3args,      sizeof(RENAME3args),      (xdrproc_t)xdr_RENAME3res,      sizeof(RENAME3res),      this},
-        {NFS_PROGRAM, NFS_V3, NFS3_LINK,        nfs3_link_proc,        (xdrproc_t)xdr_LINK3args,        sizeof(LINK3args),        (xdrproc_t)xdr_LINK3res,        sizeof(LINK3res),        this},
-        {NFS_PROGRAM, NFS_V3, NFS3_READDIR,     nfs3_readdir_proc,     (xdrproc_t)xdr_READDIR3args,     sizeof(READDIR3args),     (xdrproc_t)xdr_READDIR3res,     sizeof(READDIR3res),     this},
-        {NFS_PROGRAM, NFS_V3, NFS3_READDIRPLUS, nfs3_readdirplus_proc, (xdrproc_t)xdr_READDIRPLUS3args, sizeof(READDIRPLUS3args), (xdrproc_t)xdr_READDIRPLUS3res, sizeof(READDIRPLUS3res), this},
-        {NFS_PROGRAM, NFS_V3, NFS3_FSSTAT,      nfs3_fsstat_proc,      (xdrproc_t)xdr_FSSTAT3args,      sizeof(FSSTAT3args),      (xdrproc_t)xdr_FSSTAT3res,      sizeof(FSSTAT3res),      this},
-        {NFS_PROGRAM, NFS_V3, NFS3_FSINFO,      nfs3_fsinfo_proc,      (xdrproc_t)xdr_FSINFO3args,      sizeof(FSINFO3args),      (xdrproc_t)xdr_FSINFO3res,      sizeof(FSINFO3res),      this},
-        {NFS_PROGRAM, NFS_V3, NFS3_PATHCONF,    nfs3_pathconf_proc,    (xdrproc_t)xdr_PATHCONF3args,    sizeof(PATHCONF3args),    (xdrproc_t)xdr_PATHCONF3res,    sizeof(PATHCONF3res),    this},
-        {NFS_PROGRAM, NFS_V3, NFS3_COMMIT,      nfs3_commit_proc,      (xdrproc_t)xdr_COMMIT3args,      sizeof(COMMIT3args),      (xdrproc_t)xdr_COMMIT3res,      sizeof(COMMIT3res),      this},
-        {MOUNT_PROGRAM, MOUNT_V3, MOUNT3_NULL,    nfs3_null_proc,      NULL,                            0,                        NULL,                         0,                         this},
-        {MOUNT_PROGRAM, MOUNT_V3, MOUNT3_MNT,     mount3_mnt_proc,     (xdrproc_t)xdr_nfs_dirpath,      sizeof(nfs_dirpath),      (xdrproc_t)xdr_nfs_mountres3, sizeof(nfs_mountres3),     this},
-        {MOUNT_PROGRAM, MOUNT_V3, MOUNT3_DUMP,    mount3_dump_proc,    NULL,                            0,                        (xdrproc_t)xdr_nfs_mountlist, sizeof(nfs_mountlist),     this},
-        {MOUNT_PROGRAM, MOUNT_V3, MOUNT3_UMNT,    mount3_umnt_proc,    (xdrproc_t)xdr_nfs_dirpath,      sizeof(nfs_dirpath),      NULL,                         0,                         this},
-        {MOUNT_PROGRAM, MOUNT_V3, MOUNT3_UMNTALL, mount3_umntall_proc, NULL,                            0,                        NULL,                         0,                         this},
-        {MOUNT_PROGRAM, MOUNT_V3, MOUNT3_EXPORT,  mount3_export_proc,  NULL,                            0,                        (xdrproc_t)xdr_nfs_exports,   sizeof(nfs_exports),       this},
+        {NFS_PROGRAM, NFS_V3, NFS3_NULL,          nfs3_null_proc,              NULL,                            0,                        NULL,                           0,                       self},
+        {NFS_PROGRAM, NFS_V3, NFS3_GETATTR,       block_nfs3_getattr_proc,     (xdrproc_t)xdr_GETATTR3args,     sizeof(GETATTR3args),     (xdrproc_t)xdr_GETATTR3res,     sizeof(GETATTR3res),     self},
+        {NFS_PROGRAM, NFS_V3, NFS3_SETATTR,       block_nfs3_setattr_proc,     (xdrproc_t)xdr_SETATTR3args,     sizeof(SETATTR3args),     (xdrproc_t)xdr_SETATTR3res,     sizeof(SETATTR3res),     self},
+        {NFS_PROGRAM, NFS_V3, NFS3_LOOKUP,        block_nfs3_lookup_proc,      (xdrproc_t)xdr_LOOKUP3args,      sizeof(LOOKUP3args),      (xdrproc_t)xdr_LOOKUP3res,      sizeof(LOOKUP3res),      self},
+        {NFS_PROGRAM, NFS_V3, NFS3_ACCESS,        block_nfs3_access_proc,      (xdrproc_t)xdr_ACCESS3args,      sizeof(ACCESS3args),      (xdrproc_t)xdr_ACCESS3res,      sizeof(ACCESS3res),      self},
+        {NFS_PROGRAM, NFS_V3, NFS3_READLINK,      block_nfs3_readlink_proc,    (xdrproc_t)xdr_READLINK3args,    sizeof(READLINK3args),    (xdrproc_t)xdr_READLINK3res,    sizeof(READLINK3res),    self},
+        {NFS_PROGRAM, NFS_V3, NFS3_READ,          block_nfs3_read_proc,        (xdrproc_t)xdr_READ3args,        sizeof(READ3args),        (xdrproc_t)xdr_READ3res,        sizeof(READ3res),        self},
+        {NFS_PROGRAM, NFS_V3, NFS3_WRITE,         block_nfs3_write_proc,       (xdrproc_t)xdr_WRITE3args,       sizeof(WRITE3args),       (xdrproc_t)xdr_WRITE3res,       sizeof(WRITE3res),       self},
+        {NFS_PROGRAM, NFS_V3, NFS3_CREATE,        block_nfs3_create_proc,      (xdrproc_t)xdr_CREATE3args,      sizeof(CREATE3args),      (xdrproc_t)xdr_CREATE3res,      sizeof(CREATE3res),      self},
+        {NFS_PROGRAM, NFS_V3, NFS3_MKDIR,         block_nfs3_mkdir_proc,       (xdrproc_t)xdr_MKDIR3args,       sizeof(MKDIR3args),       (xdrproc_t)xdr_MKDIR3res,       sizeof(MKDIR3res),       self},
+        {NFS_PROGRAM, NFS_V3, NFS3_SYMLINK,       block_nfs3_symlink_proc,     (xdrproc_t)xdr_SYMLINK3args,     sizeof(SYMLINK3args),     (xdrproc_t)xdr_SYMLINK3res,     sizeof(SYMLINK3res),     self},
+        {NFS_PROGRAM, NFS_V3, NFS3_MKNOD,         block_nfs3_mknod_proc,       (xdrproc_t)xdr_MKNOD3args,       sizeof(MKNOD3args),       (xdrproc_t)xdr_MKNOD3res,       sizeof(MKNOD3res),       self},
+        {NFS_PROGRAM, NFS_V3, NFS3_REMOVE,        block_nfs3_remove_proc,      (xdrproc_t)xdr_REMOVE3args,      sizeof(REMOVE3args),      (xdrproc_t)xdr_REMOVE3res,      sizeof(REMOVE3res),      self},
+        {NFS_PROGRAM, NFS_V3, NFS3_RMDIR,         block_nfs3_rmdir_proc,       (xdrproc_t)xdr_RMDIR3args,       sizeof(RMDIR3args),       (xdrproc_t)xdr_RMDIR3res,       sizeof(RMDIR3res),       self},
+        {NFS_PROGRAM, NFS_V3, NFS3_RENAME,        block_nfs3_rename_proc,      (xdrproc_t)xdr_RENAME3args,      sizeof(RENAME3args),      (xdrproc_t)xdr_RENAME3res,      sizeof(RENAME3res),      self},
+        {NFS_PROGRAM, NFS_V3, NFS3_LINK,          block_nfs3_link_proc,        (xdrproc_t)xdr_LINK3args,        sizeof(LINK3args),        (xdrproc_t)xdr_LINK3res,        sizeof(LINK3res),        self},
+        {NFS_PROGRAM, NFS_V3, NFS3_READDIR,       block_nfs3_readdir_proc,     (xdrproc_t)xdr_READDIR3args,     sizeof(READDIR3args),     (xdrproc_t)xdr_READDIR3res,     sizeof(READDIR3res),     self},
+        {NFS_PROGRAM, NFS_V3, NFS3_READDIRPLUS,   block_nfs3_readdirplus_proc, (xdrproc_t)xdr_READDIRPLUS3args, sizeof(READDIRPLUS3args), (xdrproc_t)xdr_READDIRPLUS3res, sizeof(READDIRPLUS3res), self},
+        {NFS_PROGRAM, NFS_V3, NFS3_FSSTAT,        nfs3_fsstat_proc,            (xdrproc_t)xdr_FSSTAT3args,      sizeof(FSSTAT3args),      (xdrproc_t)xdr_FSSTAT3res,      sizeof(FSSTAT3res),      self},
+        {NFS_PROGRAM, NFS_V3, NFS3_FSINFO,        nfs3_fsinfo_proc,            (xdrproc_t)xdr_FSINFO3args,      sizeof(FSINFO3args),      (xdrproc_t)xdr_FSINFO3res,      sizeof(FSINFO3res),      self},
+        {NFS_PROGRAM, NFS_V3, NFS3_PATHCONF,      nfs3_pathconf_proc,          (xdrproc_t)xdr_PATHCONF3args,    sizeof(PATHCONF3args),    (xdrproc_t)xdr_PATHCONF3res,    sizeof(PATHCONF3res),    self},
+        {NFS_PROGRAM, NFS_V3, NFS3_COMMIT,        nfs3_commit_proc,            (xdrproc_t)xdr_COMMIT3args,      sizeof(COMMIT3args),      (xdrproc_t)xdr_COMMIT3res,      sizeof(COMMIT3res),      self},
+        {MOUNT_PROGRAM, MOUNT_V3, MOUNT3_NULL,    nfs3_null_proc,              NULL,                            0,                        NULL,                         0,                         self},
+        {MOUNT_PROGRAM, MOUNT_V3, MOUNT3_MNT,     mount3_mnt_proc,             (xdrproc_t)xdr_nfs_dirpath,      sizeof(nfs_dirpath),      (xdrproc_t)xdr_nfs_mountres3, sizeof(nfs_mountres3),     self},
+        {MOUNT_PROGRAM, MOUNT_V3, MOUNT3_DUMP,    mount3_dump_proc,            NULL,                            0,                        (xdrproc_t)xdr_nfs_mountlist, sizeof(nfs_mountlist),     self},
+        {MOUNT_PROGRAM, MOUNT_V3, MOUNT3_UMNT,    mount3_umnt_proc,            (xdrproc_t)xdr_nfs_dirpath,      sizeof(nfs_dirpath),      NULL,                         0,                         self},
+        {MOUNT_PROGRAM, MOUNT_V3, MOUNT3_UMNTALL, mount3_umntall_proc,         NULL,                            0,                        NULL,                         0,                         self},
+        {MOUNT_PROGRAM, MOUNT_V3, MOUNT3_EXPORT,  mount3_export_proc,          NULL,                            0,                        (xdrproc_t)xdr_nfs_exports,   sizeof(nfs_exports),       self},
    };
    for (int i = 0; i < sizeof(pt)/sizeof(pt[0]); i++)
    {
-        proc_table.insert(pt[i]);
+        self->proc_table.insert(pt[i]);
    }
 }
-
-nfs_client_t::~nfs_client_t()
-{
-}
--- a/src/nfs_block.h
+++ b/src/nfs_block.h
@@ -0,0 +1,59 @@
+// Copyright (c) Vitaliy Filippov, 2019+
+// License: VNPL-1.1 (see README.md for details)
+//
+// NFS proxy over Vitastor block images - header
+
+#pragma once
+
+struct nfs_dir_t
+{
+    uint64_t id;
+    uint64_t mod_rev;
+    timespec mtime;
+};
+
+struct extend_size_t
+{
+    inode_t inode;
+    uint64_t new_size;
+};
+
+inline bool operator < (const extend_size_t &a, const extend_size_t &b)
+{
+    return a.inode < b.inode || a.inode == b.inode && a.new_size < b.new_size;
+}
+
+struct extend_write_t
+{
+    rpc_op_t *rop;
+    int resize_res, write_res; // 1 = started, 0 = completed OK, -errno = completed with error
+};
+
+struct extend_inode_t
+{
+    uint64_t cur_extend = 0, next_extend = 0;
+};
+
+struct block_fs_state_t
+{
+    std::string name_prefix;
+
+    // filehandle = "S"+base64(sha256(full name with prefix)) or "roothandle" for mount root)
+    uint64_t next_dir_id = 2;
+    // filehandle => dir with name_prefix
+    std::map<std::string, std::string> dir_by_hash;
+    // dir with name_prefix => dir info
+    std::map<std::string, nfs_dir_t> dir_info;
+    // filehandle => inode ID
+    std::map<std::string, inode_t> inode_by_hash;
+    // inode ID => filehandle
+    std::map<inode_t, std::string> hash_by_inode;
+
+    // inode extend requests in progress
+    std::map<inode_t, extend_inode_t> extends;
+    std::multimap<extend_size_t, extend_write_t> extend_writes;
+
+    void init(nfs_proxy_t *proxy, json11::Json cfg);
+};
+
+nfsstat3 vitastor_nfs_map_err(int err);
--- a/src/nfs_common.h
+++ b/src/nfs_common.h
@@ -0,0 +1,22 @@
+// Copyright (c) Vitaliy Filippov, 2019+
+// License: VNPL-1.1 (see README.md for details)
+//
+// NFS proxy - common functions
+
+#pragma once
+
+#include "nfs/nfs.h"
+
+void nfs_block_procs(nfs_client_t *self);
+void nfs_kv_procs(nfs_client_t *self);
+int nfs3_fsstat_proc(void *opaque, rpc_op_t *rop);
+int nfs3_fsinfo_proc(void *opaque, rpc_op_t *rop);
+int nfs3_pathconf_proc(void *opaque, rpc_op_t *rop);
+int nfs3_access_proc(void *opaque, rpc_op_t *rop);
+int nfs3_null_proc(void *opaque, rpc_op_t *rop);
+int nfs3_commit_proc(void *opaque, rpc_op_t *rop);
+int mount3_mnt_proc(void *opaque, rpc_op_t *rop);
+int mount3_dump_proc(void *opaque, rpc_op_t *rop);
+int mount3_umnt_proc(void *opaque, rpc_op_t *rop);
+int mount3_umntall_proc(void *opaque, rpc_op_t *rop);
+int mount3_export_proc(void *opaque, rpc_op_t *rop);
--- a/src/nfs_fsstat.cpp
+++ b/src/nfs_fsstat.cpp
@@ -0,0 +1,129 @@
+// Copyright (c) Vitaliy Filippov, 2019+
+// License: VNPL-1.1 (see README.md for details)
+//
+// NFS proxy - common FSSTAT, FSINFO, PATHCONF
+
+#include <sys/time.h>
+
+#include "nfs_proxy.h"
+#include "nfs_kv.h"
+
+// Get file system statistics
+int nfs3_fsstat_proc(void *opaque, rpc_op_t *rop)
+{
+    nfs_client_t *self = (nfs_client_t*)opaque;
+    //FSSTAT3args *args = (FSSTAT3args*)rop->request;
+    if (self->parent->trace)
+        fprintf(stderr, "[%d] FSSTAT\n", self->nfs_fd);
+    FSSTAT3res *reply = (FSSTAT3res*)rop->reply;
+    uint64_t tbytes = 0, fbytes = 0;
+    auto pst_it = self->parent->pool_stats.find(self->parent->default_pool_id);
+    if (pst_it != self->parent->pool_stats.end())
+    {
+        auto ttb = pst_it->second["total_raw_tb"].number_value();
+        auto ftb = (pst_it->second["total_raw_tb"].number_value() - pst_it->second["used_raw_tb"].number_value());
+        tbytes = ttb / pst_it->second["raw_to_usable"].number_value() * ((uint64_t)2<<40);
+        fbytes = ftb / pst_it->second["raw_to_usable"].number_value() * ((uint64_t)2<<40);
+    }
+    *reply = (FSSTAT3res){
+        .status = NFS3_OK,
+        .resok = (FSSTAT3resok){
+            .obj_attributes = {
+                .attributes_follow = 0,
+                //.attributes = get_root_attributes(self),
+            },
+            .tbytes = tbytes, // total bytes
+            .fbytes = fbytes, // free bytes
+            .abytes = fbytes, // available bytes
+            .tfiles = (size3)1 << (63-POOL_ID_BITS), // maximum total files
+            .ffiles = (size3)1 << (63-POOL_ID_BITS), // free files
+            .afiles = (size3)1 << (63-POOL_ID_BITS), // available files
+            .invarsec = 0,
+        },
+    };
+    rpc_queue_reply(rop);
+    return 0;
+}
+
+int nfs3_fsinfo_proc(void *opaque, rpc_op_t *rop)
+{
+    nfs_client_t *self = (nfs_client_t*)opaque;
+    FSINFO3args *args = (FSINFO3args*)rop->request;
+    FSINFO3res *reply = (FSINFO3res*)rop->reply;
+    if (self->parent->trace)
+        fprintf(stderr, "[%d] FSINFO %s\n", self->nfs_fd, std::string(args->fsroot).c_str());
+    if (args->fsroot != NFS_ROOT_HANDLE)
+    {
+        *reply = (FSINFO3res){ .status = NFS3ERR_INVAL };
+    }
+    else
+    {
+        // Fill info
+        bool_t x = FALSE;
+        *reply = (FSINFO3res){
+            .status = NFS3_OK,
+            .resok = (FSINFO3resok){
+                .obj_attributes = {
+                    // Without at least one reference to a non-constant value (local variable or something else),
+                    // with gcc 8 we get "internal compiler error: side-effects element in no-side-effects CONSTRUCTOR" here
+                    // FIXME: get rid of this after raising compiler requirement
+                    .attributes_follow = x,
+                    //.attributes = get_root_attributes(self),
+                },
+                .rtmax = 128*1024*1024,
+                .rtpref = 128*1024*1024,
+                .rtmult = 4096,
+                .wtmax = 128*1024*1024,
+                .wtpref = 128*1024*1024,
+                .wtmult = 4096,
+                .dtpref = 128,
+                .maxfilesize = 0x7fffffffffffffff,
+                .time_delta = {
+                    .seconds = 1,
+                    .nseconds = 0,
+                },
+                .properties = FSF3_SYMLINK | FSF3_HOMOGENEOUS,
+            },
+        };
+    }
+    rpc_queue_reply(rop);
+    return 0;
+}
+
+int nfs3_pathconf_proc(void *opaque, rpc_op_t *rop)
+{
+    nfs_client_t *self = (nfs_client_t*)opaque;
+    PATHCONF3args *args = (PATHCONF3args*)rop->request;
+    PATHCONF3res *reply = (PATHCONF3res*)rop->reply;
+    if (self->parent->trace)
+        fprintf(stderr, "[%d] PATHCONF %s\n", self->nfs_fd, std::string(args->object).c_str());
+    if (args->object != NFS_ROOT_HANDLE)
+    {
+        *reply = (PATHCONF3res){ .status = NFS3ERR_INVAL };
+    }
+    else
+    {
+        // Fill info
+        bool_t x = FALSE;
+        *reply = (PATHCONF3res){
+            .status = NFS3_OK,
+            .resok = (PATHCONF3resok){
+                .obj_attributes = {
+                    // Without at least one reference to a non-constant value (local variable or something else),
+                    // with gcc 8 we get "internal compiler error: side-effects element in no-side-effects CONSTRUCTOR" here
+                    // FIXME: get rid of this after raising compiler requirement
+                    .attributes_follow = x,
+                    //.attributes = get_root_attributes(self),
+                },
+                .linkmax = 0,
+                .name_max = 255,
+                .no_trunc = TRUE,
+                .chown_restricted = FALSE,
+                .case_insensitive = FALSE,
+                .case_preserving = TRUE,
+            },
+        };
+    }
+    rpc_queue_reply(rop);
+    return 0;
+}
--- a/src/nfs_kv.cpp
+++ b/src/nfs_kv.cpp
@@ -0,0 +1,332 @@
+// Copyright (c) Vitaliy Filippov, 2019+
+// License: VNPL-1.1 (see README.md for details)
+//
+// NFS proxy over VitastorKV database - common functions
+
+#include <sys/time.h>
+
+#include "str_util.h"
+#include "nfs_proxy.h"
+#include "nfs_common.h"
+#include "nfs_kv.h"
+
+nfstime3 nfstime_from_str(const std::string & s)
+{
+    nfstime3 t;
+    auto p = s.find(".");
+    if (p != std::string::npos)
+    {
+        t.seconds = stoull_full(s.substr(0, p), 10);
+        t.nseconds = stoull_full(s.substr(p+1), 10);
+        p = s.size()-p-1;
+        for (; p < 9; p++)
+            t.nseconds *= 10;
+        for (; p > 9; p--)
+            t.nseconds /= 10;
+    }
+    else
+        t.seconds = stoull_full(s, 10);
+    return t;
+}
+
+static std::string timespec_to_str(timespec t)
+{
+    char buf[64];
+    snprintf(buf, sizeof(buf), "%ju.%09ju", t.tv_sec, t.tv_nsec);
+    int l = strlen(buf);
+    while (l > 0 && buf[l-1] == '0')
+        l--;
+    if (l > 0 && buf[l-1] == '.')
+        l--;
+    buf[l] = 0;
+    return buf;
+}
+
+std::string nfstime_to_str(nfstime3 t)
+{
+    return timespec_to_str((timespec){ .tv_sec = t.seconds, .tv_nsec = t.nseconds });
+}
+
+std::string nfstime_now_str()
+{
+    timespec t;
+    clock_gettime(CLOCK_REALTIME, &t);
+    return timespec_to_str(t);
+}
+
+int kv_map_type(const std::string & type)
+{
+    return (type == "" || type == "file" ? NF3REG :
+        (type == "dir" ? NF3DIR :
+        (type == "blk" ? NF3BLK :
+        (type == "chr" ? NF3CHR :
+        (type == "link" ? NF3LNK :
+        (type == "sock" ? NF3SOCK :
+        (type == "fifo" ? NF3FIFO : -1)))))));
+}
+
+fattr3 get_kv_attributes(nfs_client_t *self, uint64_t ino, json11::Json attrs)
+{
+    auto type = kv_map_type(attrs["type"].string_value());
+    auto mode = attrs["mode"].uint64_value();
+    auto nlink = attrs["nlink"].uint64_value();
+    nfstime3 mtime = nfstime_from_str(attrs["mtime"].string_value());
+    nfstime3 atime = attrs["atime"].is_null() ? mtime : nfstime_from_str(attrs["atime"].string_value());
+    nfstime3 ctime = attrs["ctime"].is_null() ? mtime : nfstime_from_str(attrs["ctime"].string_value());
+    // In theory we could store the binary structure itself, but JSON is simpler :-)
+    return (fattr3){
+        .type = (type == 0 ? NF3REG : (ftype3)type),
+        .mode = (attrs["mode"].is_null() ? (type == NF3DIR ? 0755 : 0644) : (uint32_t)mode),
+        .nlink = (nlink == 0 ? 1 : (uint32_t)nlink),
+        .uid = (uint32_t)attrs["uid"].uint64_value(),
+        .gid = (uint32_t)attrs["gid"].uint64_value(),
+        .size = (type == NF3DIR ? 4096 : attrs["size"].uint64_value()),
+        // FIXME Counting actual used file size would require reworking statistics
+        .used = (type == NF3DIR ? 4096 : attrs["size"].uint64_value()),
+        .rdev = (type == NF3BLK || type == NF3CHR
+            ? (specdata3){ (uint32_t)attrs["major"].uint64_value(), (uint32_t)attrs["minor"].uint64_value() }
+            : (specdata3){}),
+        .fsid = self->parent->fsid,
+        .fileid = ino,
+        .atime = atime,
+        .mtime = mtime,
+        .ctime = ctime,
+    };
+}
+
+std::string kv_direntry_key(uint64_t dir_ino, const std::string & filename)
+{
+    // encode as: d <length> <hex dir_ino> / <filename>
+    char key[24] = { 0 };
+    snprintf(key, sizeof(key), "d-%jx/", dir_ino);
+    int n = strnlen(key, sizeof(key)-1) - 3;
+    if (n < 10)
+        key[1] = '0'+n;
+    else
+        key[1] = 'A'+(n-10);
+    return (char*)key + filename;
+}
+
+std::string kv_direntry_filename(const std::string & key)
+{
+    // decode as: d <length> <hex dir_ino> / <filename>
+    auto pos = key.find("/");
+    if (pos != std::string::npos)
+        return key.substr(pos+1);
+    return key;
+}
+
+std::string kv_inode_key(uint64_t ino)
+{
+    char key[32] = { 0 };
+    snprintf(key, sizeof(key), "i%x", INODE_POOL(ino));
+    int n = strnlen(key, sizeof(key)-1);
+    snprintf(key+n+1, sizeof(key)-n-1, "%jx", INODE_NO_POOL(ino));
+    int m = strnlen(key+n+1, sizeof(key)-n-2);
+    key[n] = 'G'+m;
+    return std::string(key);
+}
+
+std::string kv_fh(uint64_t ino)
+{
+    char key[32] = { 0 };
+    snprintf(key, sizeof(key), "S%jx", ino);
+    return key;
+}
+
+uint64_t kv_fh_inode(const std::string & fh)
+{
+    if (fh == NFS_ROOT_HANDLE)
+    {
+        return 1;
+    }
+    else if (fh[0] == 'S')
+    {
+        uint64_t ino = 0;
+        int r = sscanf(fh.c_str()+1, "%jx", &ino);
+        if (r == 1)
+            return ino;
+    }
+    return 0;
+}
+
+bool kv_fh_valid(const std::string & fh)
+{
+    return fh == NFS_ROOT_HANDLE || fh[0] == 'S';
+}
+
+void nfs_kv_procs(nfs_client_t *self)
+{
+    struct rpc_service_proc_t pt[] = {
+        {NFS_PROGRAM, NFS_V3, NFS3_NULL,        nfs3_null_proc,           NULL,                            0,                        NULL,                           0,                       self},
+        {NFS_PROGRAM, NFS_V3, NFS3_GETATTR,     kv_nfs3_getattr_proc,     (xdrproc_t)xdr_GETATTR3args,     sizeof(GETATTR3args),     (xdrproc_t)xdr_GETATTR3res,     sizeof(GETATTR3res),     self},
+        {NFS_PROGRAM, NFS_V3, NFS3_SETATTR,     kv_nfs3_setattr_proc,     (xdrproc_t)xdr_SETATTR3args,     sizeof(SETATTR3args),     (xdrproc_t)xdr_SETATTR3res,     sizeof(SETATTR3res),     self},
+        {NFS_PROGRAM, NFS_V3, NFS3_LOOKUP,      kv_nfs3_lookup_proc,      (xdrproc_t)xdr_LOOKUP3args,      sizeof(LOOKUP3args),      (xdrproc_t)xdr_LOOKUP3res,      sizeof(LOOKUP3res),      self},
+        {NFS_PROGRAM, NFS_V3, NFS3_ACCESS,      nfs3_access_proc,         (xdrproc_t)xdr_ACCESS3args,      sizeof(ACCESS3args),      (xdrproc_t)xdr_ACCESS3res,      sizeof(ACCESS3res),      self},
+        {NFS_PROGRAM, NFS_V3, NFS3_READLINK,    kv_nfs3_readlink_proc,    (xdrproc_t)xdr_READLINK3args,    sizeof(READLINK3args),    (xdrproc_t)xdr_READLINK3res,    sizeof(READLINK3res),    self},
+        {NFS_PROGRAM, NFS_V3, NFS3_READ,        kv_nfs3_read_proc,        (xdrproc_t)xdr_READ3args,        sizeof(READ3args),        (xdrproc_t)xdr_READ3res,        sizeof(READ3res),        self},
+        {NFS_PROGRAM, NFS_V3, NFS3_WRITE,       kv_nfs3_write_proc,       (xdrproc_t)xdr_WRITE3args,       sizeof(WRITE3args),       (xdrproc_t)xdr_WRITE3res,       sizeof(WRITE3res),       self},
+        {NFS_PROGRAM, NFS_V3, NFS3_CREATE,      kv_nfs3_create_proc,      (xdrproc_t)xdr_CREATE3args,      sizeof(CREATE3args),      (xdrproc_t)xdr_CREATE3res,      sizeof(CREATE3res),      self},
+        {NFS_PROGRAM, NFS_V3, NFS3_MKDIR,       kv_nfs3_mkdir_proc,       (xdrproc_t)xdr_MKDIR3args,       sizeof(MKDIR3args),       (xdrproc_t)xdr_MKDIR3res,       sizeof(MKDIR3res),       self},
+        {NFS_PROGRAM, NFS_V3, NFS3_SYMLINK,     kv_nfs3_symlink_proc,     (xdrproc_t)xdr_SYMLINK3args,     sizeof(SYMLINK3args),     (xdrproc_t)xdr_SYMLINK3res,     sizeof(SYMLINK3res),     self},
+        {NFS_PROGRAM, NFS_V3, NFS3_MKNOD,       kv_nfs3_mknod_proc,       (xdrproc_t)xdr_MKNOD3args,       sizeof(MKNOD3args),       (xdrproc_t)xdr_MKNOD3res,       sizeof(MKNOD3res),       self},
+        {NFS_PROGRAM, NFS_V3, NFS3_REMOVE,      kv_nfs3_remove_proc,      (xdrproc_t)xdr_REMOVE3args,      sizeof(REMOVE3args),      (xdrproc_t)xdr_REMOVE3res,      sizeof(REMOVE3res),      self},
+        {NFS_PROGRAM, NFS_V3, NFS3_RMDIR,       kv_nfs3_rmdir_proc,       (xdrproc_t)xdr_RMDIR3args,       sizeof(RMDIR3args),       (xdrproc_t)xdr_RMDIR3res,       sizeof(RMDIR3res),       self},
+        {NFS_PROGRAM, NFS_V3, NFS3_RENAME,      kv_nfs3_rename_proc,      (xdrproc_t)xdr_RENAME3args,      sizeof(RENAME3args),      (xdrproc_t)xdr_RENAME3res,      sizeof(RENAME3res),      self},
+        {NFS_PROGRAM, NFS_V3, NFS3_LINK,        kv_nfs3_link_proc,        (xdrproc_t)xdr_LINK3args,        sizeof(LINK3args),        (xdrproc_t)xdr_LINK3res,        sizeof(LINK3res),        self},
+        {NFS_PROGRAM, NFS_V3, NFS3_READDIR,     kv_nfs3_readdir_proc,     (xdrproc_t)xdr_READDIR3args,     sizeof(READDIR3args),     (xdrproc_t)xdr_READDIR3res,     sizeof(READDIR3res),     self},
+        {NFS_PROGRAM, NFS_V3, NFS3_READDIRPLUS, kv_nfs3_readdirplus_proc, (xdrproc_t)xdr_READDIRPLUS3args, sizeof(READDIRPLUS3args), (xdrproc_t)xdr_READDIRPLUS3res, sizeof(READDIRPLUS3res), self},
+        {NFS_PROGRAM, NFS_V3, NFS3_FSSTAT,      nfs3_fsstat_proc,         (xdrproc_t)xdr_FSSTAT3args,      sizeof(FSSTAT3args),      (xdrproc_t)xdr_FSSTAT3res,      sizeof(FSSTAT3res),      self},
+        {NFS_PROGRAM, NFS_V3, NFS3_FSINFO,      nfs3_fsinfo_proc,         (xdrproc_t)xdr_FSINFO3args,      sizeof(FSINFO3args),      (xdrproc_t)xdr_FSINFO3res,      sizeof(FSINFO3res),      self},
+        {NFS_PROGRAM, NFS_V3, NFS3_PATHCONF,    nfs3_pathconf_proc,       (xdrproc_t)xdr_PATHCONF3args,    sizeof(PATHCONF3args),    (xdrproc_t)xdr_PATHCONF3res,    sizeof(PATHCONF3res),    self},
+        {NFS_PROGRAM, NFS_V3, NFS3_COMMIT,      nfs3_commit_proc,         (xdrproc_t)xdr_COMMIT3args,      sizeof(COMMIT3args),      (xdrproc_t)xdr_COMMIT3res,      sizeof(COMMIT3res),      self},
+        {MOUNT_PROGRAM, MOUNT_V3, MOUNT3_NULL,    nfs3_null_proc,         NULL,                            0,                        NULL,                         0,                         self},
+        {MOUNT_PROGRAM, MOUNT_V3, MOUNT3_MNT,     mount3_mnt_proc,        (xdrproc_t)xdr_nfs_dirpath,      sizeof(nfs_dirpath),      (xdrproc_t)xdr_nfs_mountres3, sizeof(nfs_mountres3),     self},
+        {MOUNT_PROGRAM, MOUNT_V3, MOUNT3_DUMP,    mount3_dump_proc,       NULL,                            0,                        (xdrproc_t)xdr_nfs_mountlist, sizeof(nfs_mountlist),     self},
+        {MOUNT_PROGRAM, MOUNT_V3, MOUNT3_UMNT,    mount3_umnt_proc,       (xdrproc_t)xdr_nfs_dirpath,      sizeof(nfs_dirpath),      NULL,                         0,                         self},
+        {MOUNT_PROGRAM, MOUNT_V3, MOUNT3_UMNTALL, mount3_umntall_proc,    NULL,                            0,                        NULL,                         0,                         self},
+        {MOUNT_PROGRAM, MOUNT_V3, MOUNT3_EXPORT,  mount3_export_proc,     NULL,                            0,                        (xdrproc_t)xdr_nfs_exports,   sizeof(nfs_exports),       self},
+    };
+    for (int i = 0; i < sizeof(pt)/sizeof(pt[0]); i++)
+    {
+        self->proc_table.insert(pt[i]);
+    }
+}
+
+void kv_fs_state_t::init(nfs_proxy_t *proxy, json11::Json cfg)
+{
+    this->proxy = proxy;
+    auto & pool_cfg = proxy->cli->st_cli.pool_config.at(proxy->default_pool_id);
+    fs_kv_inode = cfg["fs"].uint64_value();
+    if (fs_kv_inode)
+    {
+        if (!INODE_POOL(fs_kv_inode))
+        {
+            fprintf(stderr, "FS metadata inode number must include pool\n");
+            exit(1);
+        }
+    }
+    else
+    {
+        for (auto & ic: proxy->cli->st_cli.inode_config)
+        {
+            if (ic.second.name == cfg["fs"].string_value())
+            {
+                fs_kv_inode = ic.first;
+                break;
+            }
+        }
+        if (!fs_kv_inode)
+        {
+            fprintf(stderr, "FS metadata image \"%s\" does not exist\n", cfg["fs"].string_value().c_str());
+            exit(1);
+        }
+    }
+    if (proxy->cli->st_cli.inode_config.find(fs_kv_inode) != proxy->cli->st_cli.inode_config.end())
+    {
+        auto & name = proxy->cli->st_cli.inode_config.at(fs_kv_inode).name;
+        if (pool_cfg.used_for_fs != name)
+        {
+            fprintf(stderr, "Please mark pool as used for this file system with `vitastor-cli modify-pool --used-for-fs %s %s`\n",
+                name.c_str(), cfg["fs"].string_value().c_str());
+            exit(1);
+        }
+    }
+    auto img_it = proxy->cli->st_cli.inode_config.lower_bound(INODE_WITH_POOL(proxy->default_pool_id+1, 0));
+    if (img_it != proxy->cli->st_cli.inode_config.begin())
+    {
+        img_it--;
+        if (img_it != proxy->cli->st_cli.inode_config.begin() && INODE_POOL(img_it->first) == proxy->default_pool_id)
+        {
+            idgen[proxy->default_pool_id].min_id = INODE_NO_POOL(img_it->first) + 1;
+        }
+    }
+    readdir_getattr_parallel = cfg["readdir_getattr_parallel"].uint64_value();
+    if (!readdir_getattr_parallel)
+        readdir_getattr_parallel = 8;
+    id_alloc_batch_size = cfg["id_alloc_batch_size"].uint64_value();
+    if (!id_alloc_batch_size)
+        id_alloc_batch_size = 200;
+    touch_interval = cfg["touch_interval"].uint64_value();
+    if (touch_interval < 100) // ms
+        touch_interval = 100;
+    pool_block_size = pool_cfg.pg_stripe_size;
+    pool_alignment = pool_cfg.bitmap_granularity;
+    // Open DB and wait
+    int open_res = 0;
+    bool open_done = false;
+    proxy->db = new kv_dbw_t(proxy->cli);
+    proxy->db->open(fs_kv_inode, cfg, [&](int res)
+    {
+        open_done = true;
+        open_res = res;
+    });
+    while (!open_done)
+    {
+        proxy->ringloop->loop();
+        if (open_done)
+            break;
+        proxy->ringloop->wait();
+    }
+    if (open_res < 0)
+    {
+        fprintf(stderr, "Failed to open key/value filesystem metadata index: %s (code %d)\n",
+            strerror(-open_res), open_res);
+        exit(1);
+    }
+    fs_inode_count = ((uint64_t)1 << (64-POOL_ID_BITS)) - 1;
+    shared_inode_threshold = pool_block_size;
+    if (!cfg["shared_inode_threshold"].is_null())
+    {
+        shared_inode_threshold = cfg["shared_inode_threshold"].uint64_value();
+    }
+    zero_block.resize(pool_block_size < 1048576 ? 1048576 : pool_block_size);
+    scrap_block.resize(pool_block_size < 1048576 ? 1048576 : pool_block_size);
+    touch_timer_id = proxy->epmgr->tfd->set_timer(touch_interval, true, [this](int){ touch_inodes(); });
+}
+
+kv_fs_state_t::~kv_fs_state_t()
+{
+    if (proxy && touch_timer_id >= 0)
+    {
+        proxy->epmgr->tfd->clear_timer(touch_timer_id);
+        touch_timer_id = -1;
+    }
+}
+
+static void touch_inode(nfs_proxy_t *proxy, inode_t ino, bool allow_cache)
+{
+    kv_read_inode(proxy, ino, [proxy, ino](int res, const std::string & value, json11::Json attrs)
+    {
+        if (!res)
+        {
+            auto ientry = attrs.object_items();
+            ientry["mtime"] = ientry["ctime"] = nfstime_now_str();
+            ientry.erase("verf");
+            // FIXME: Use "update" query
+            bool *found = new bool;
+            *found = true;
+            proxy->db->set(kv_inode_key(ino), json11::Json(ientry).dump(), [proxy, ino, found](int res)
+            {
+                if (!*found)
+                    res = -ENOENT;
+                delete found;
+                if (res == -EAGAIN)
+                    touch_inode(proxy, ino, false);
+            }, [value, found](int res, const std::string & old_value)
+            {
+                *found = res == 0;
+                return res == 0 && old_value == value;
+            });
+        }
+    }, allow_cache);
+}
+
+void kv_fs_state_t::touch_inodes()
+{
+    std::set<inode_t> q = std::move(touch_queue);
+    for (auto ino: q)
+    {
+        touch_inode(proxy, ino, true);
+    }
+}
--- a/src/nfs_kv.h
+++ b/src/nfs_kv.h
@@ -0,0 +1,134 @@
+// Copyright (c) Vitaliy Filippov, 2019+
+// License: VNPL-1.1 (see README.md for details)
+//
+// NFS proxy over VitastorKV database - header
+
+#pragma once
+
+#include "nfs/nfs.h"
+
+#define KV_ROOT_INODE 1
+#define SHARED_FILE_MAGIC_V1 0x711A5158A6EDF17E
+
+struct nfs_kv_write_state;
+
+struct list_cookie_t
+{
+    uint64_t dir_ino, cookieverf, cookie;
+};
+
+inline bool operator < (const list_cookie_t & a, const list_cookie_t & b)
+{
+    return a.dir_ino < b.dir_ino || a.dir_ino == b.dir_ino &&
+        (a.cookieverf < b.cookieverf || a.cookieverf == b.cookieverf && a.cookie < b.cookie);
+};
+
+struct list_cookie_val_t
+{
+    std::string key;
+};
+
+struct shared_alloc_queue_t
+{
+    nfs_kv_write_state *st;
+    int state;
+};
+
+struct kv_inode_extend_t
+{
+    int refcnt = 0;
+    uint64_t cur_extend = 0, next_extend = 0, done_extend = 0;
+    std::vector<std::function<void()>> waiters;
+};
+
+struct kv_idgen_t
+{
+    uint64_t next_id = 1, allocated_id = 0;
+    uint64_t min_id = 1;
+    std::vector<uint64_t> unallocated_ids;
+};
+
+struct kv_fs_state_t
+{
+    nfs_proxy_t *proxy = NULL;
+    int touch_timer_id = -1;
+
+    uint64_t fs_kv_inode = 0;
+    uint64_t fs_inode_count = 0;
+    int readdir_getattr_parallel = 8, id_alloc_batch_size = 200;
+    uint64_t pool_block_size = 0;
+    uint64_t pool_alignment = 0;
+    uint64_t shared_inode_threshold = 0;
+    uint64_t touch_interval = 1000;
+
+    std::map<list_cookie_t, list_cookie_val_t> list_cookies;
+    std::map<pool_id_t, kv_idgen_t> idgen;
+    std::vector<shared_alloc_queue_t> allocating_shared;
+    uint64_t cur_shared_inode = 0, cur_shared_offset = 0;
+    std::map<inode_t, kv_inode_extend_t> extends;
+    std::set<inode_t> touch_queue;
+
+    std::vector<uint8_t> zero_block;
+    std::vector<uint8_t> scrap_block;
+
+    void init(nfs_proxy_t *proxy, json11::Json cfg);
+    void touch_inodes();
+    ~kv_fs_state_t();
+};
+
+struct shared_file_header_t
+{
+    uint64_t magic = 0;
+    uint64_t inode = 0;
+    uint64_t alloc = 0;
+};
+
+struct nfs_rmw_t
+{
+    nfs_proxy_t *parent = NULL;
+    uint64_t ino = 0;
+    uint64_t offset = 0;
+    uint8_t *buf = NULL;
+    uint64_t size = 0;
+    uint8_t *part_buf = NULL;
+    uint64_t version = 0;
+    nfs_rmw_t *other = NULL;
+    std::function<void(nfs_rmw_t *)> cb;
+    int res = 0;
+};
+
+nfsstat3 vitastor_nfs_map_err(int err);
+nfstime3 nfstime_from_str(const std::string & s);
+std::string nfstime_to_str(nfstime3 t);
+std::string nfstime_now_str();
+int kv_map_type(const std::string & type);
+fattr3 get_kv_attributes(nfs_client_t *self, uint64_t ino, json11::Json attrs);
+std::string kv_direntry_key(uint64_t dir_ino, const std::string & filename);
+std::string kv_direntry_filename(const std::string & key);
+std::string kv_inode_key(uint64_t ino);
+std::string kv_fh(uint64_t ino);
+uint64_t kv_fh_inode(const std::string & fh);
+bool kv_fh_valid(const std::string & fh);
+void allocate_new_id(nfs_client_t *self, pool_id_t pool_id, std::function<void(int res, uint64_t new_id)> cb);
+void kv_read_inode(nfs_proxy_t *proxy, uint64_t ino,
+    std::function<void(int res, const std::string & value, json11::Json ientry)> cb,
+    bool allow_cache = false);
+uint64_t align_shared_size(nfs_client_t *self, uint64_t size);
+void nfs_do_rmw(nfs_rmw_t *rmw);
+
+int kv_nfs3_getattr_proc(void *opaque, rpc_op_t *rop);
+int kv_nfs3_setattr_proc(void *opaque, rpc_op_t *rop);
+int kv_nfs3_lookup_proc(void *opaque, rpc_op_t *rop);
+int kv_nfs3_readlink_proc(void *opaque, rpc_op_t *rop);
+int kv_nfs3_read_proc(void *opaque, rpc_op_t *rop);
+int kv_nfs3_write_proc(void *opaque, rpc_op_t *rop);
+int kv_nfs3_create_proc(void *opaque, rpc_op_t *rop);
+int kv_nfs3_mkdir_proc(void *opaque, rpc_op_t *rop);
+int kv_nfs3_symlink_proc(void *opaque, rpc_op_t *rop);
+int kv_nfs3_mknod_proc(void *opaque, rpc_op_t *rop);
+int kv_nfs3_remove_proc(void *opaque, rpc_op_t *rop);
+int kv_nfs3_rmdir_proc(void *opaque, rpc_op_t *rop);
+int kv_nfs3_rename_proc(void *opaque, rpc_op_t *rop);
+int kv_nfs3_link_proc(void *opaque, rpc_op_t *rop);
+int kv_nfs3_readdir_proc(void *opaque, rpc_op_t *rop);
+int kv_nfs3_readdirplus_proc(void *opaque, rpc_op_t *rop);
--- a/src/nfs_kv_create.cpp
+++ b/src/nfs_kv_create.cpp
@@ -0,0 +1,365 @@
+// Copyright (c) Vitaliy Filippov, 2019+
+// License: VNPL-1.1 (see README.md for details)
+//
+// NFS proxy over VitastorKV database - CREATE, MKDIR, SYMLINK, MKNOD
+
+#include <sys/time.h>
+
+#include "str_util.h"
+#include "nfs_proxy.h"
+#include "nfs_kv.h"
+
+void allocate_new_id(nfs_client_t *self, pool_id_t pool_id, std::function<void(int res, uint64_t new_id)> cb)
+{
+    auto & idgen = self->parent->kvfs->idgen[pool_id];
+    if (idgen.unallocated_ids.size())
+    {
+        auto new_id = idgen.unallocated_ids.back();
+        idgen.unallocated_ids.pop_back();
+        cb(0, INODE_WITH_POOL(pool_id, new_id));
+        return;
+    }
+    else if (idgen.next_id <= idgen.allocated_id)
+    {
+        idgen.next_id++;
+        cb(0, INODE_WITH_POOL(pool_id, idgen.next_id-1));
+        return;
+    }
+    // FIXME: Maybe allow FS and block volumes to cohabitate in the same pool, but with different ID ranges
+    else if (idgen.next_id >= ((uint64_t)1 << (64-POOL_ID_BITS)))
+    {
+        cb(-ENOSPC, 0);
+        return;
+    }
+    self->parent->db->get((pool_id ? "id"+std::to_string(pool_id) : "id"), [=](int res, const std::string & prev_str)
+    {
+        auto & idgen = self->parent->kvfs->idgen[pool_id];
+        if (res < 0 && res != -ENOENT)
+        {
+            cb(res, 0);
+            return;
+        }
+        uint64_t prev_val = stoull_full(prev_str);
+        if (prev_val >= ((uint64_t)1 << (64-POOL_ID_BITS)))
+        {
+            cb(-ENOSPC, 0);
+            return;
+        }
+        if (prev_val < idgen.min_id)
+        {
+            prev_val = idgen.min_id;
+        }
+        uint64_t new_val = prev_val + self->parent->kvfs->id_alloc_batch_size;
+        if (new_val >= self->parent->kvfs->fs_inode_count)
+        {
+            new_val = self->parent->kvfs->fs_inode_count;
+        }
+        self->parent->db->set((pool_id ? "id"+std::to_string(pool_id) : "id"), std::to_string(new_val), [=](int res)
+        {
+            if (res == -EAGAIN)
+            {
+                // CAS failure - retry
+                allocate_new_id(self, pool_id, cb);
+            }
+            else if (res < 0)
+            {
+                cb(res, 0);
+            }
+            else
+            {
+                auto & idgen = self->parent->kvfs->idgen[pool_id];
+                idgen.next_id = prev_val+2;
+                idgen.allocated_id = new_val;
+                cb(0, INODE_WITH_POOL(pool_id, prev_val+1));
+            }
+        }, [prev_val](int res, const std::string & value)
+        {
+            // FIXME: Allow to modify value from CAS callback? ("update" query)
+            return res < 0 || stoull_full(value) == prev_val;
+        });
+    });
+}
+
+struct kv_create_state
+{
+    nfs_client_t *self = NULL;
+    rpc_op_t *rop = NULL;
+    bool exclusive = false;
+    uint64_t verf = 0;
+    uint64_t dir_ino = 0;
+    std::string filename;
+    // state
+    int res = 0;
+    pool_id_t pool_id = 0;
+    uint64_t new_id = 0;
+    json11::Json::object attrobj;
+    json11::Json attrs;
+    std::string direntry_text;
+    uint64_t dup_ino = 0;
+    std::function<void(int res)> cb;
+};
+
+static void kv_continue_create(kv_create_state *st, int state)
+{
+    if (state == 0)      {}
+    else if (state == 1) goto resume_1;
+    else if (state == 2) goto resume_2;
+    else if (state == 3) goto resume_3;
+    else if (state == 4) goto resume_4;
+    else if (state == 5) goto resume_5;
+    if (st->self->parent->trace)
+        fprintf(stderr, "[%d] CREATE %ju/%s ATTRS %s\n", st->self->nfs_fd, st->dir_ino, st->filename.c_str(), json11::Json(st->attrobj).dump().c_str());
+    if (st->filename == "" || st->filename.find("/") != std::string::npos)
+    {
+        auto cb = std::move(st->cb);
+        cb(-EINVAL);
+        return;
+    }
+    st->attrobj["ctime"] = nfstime_now_str();
+    if (st->attrobj.find("mtime") == st->attrobj.end())
+        st->attrobj["mtime"] = st->attrobj["ctime"];
+    st->attrs = std::move(st->attrobj);
+resume_1:
+    // Generate inode ID
+    // Directories and special files don't need pool
+    st->pool_id = kv_map_type(st->attrs["type"].string_value()) == NF3REG
+        ? st->self->parent->default_pool_id
+        : 0;
+    allocate_new_id(st->self, st->pool_id, [st](int res, uint64_t new_id)
+    {
+        st->res = res;
+        st->new_id = new_id;
+        kv_continue_create(st, 2);
+    });
+    return;
+resume_2:
+    if (st->res < 0)
+    {
+        auto cb = std::move(st->cb);
+        cb(st->res);
+        return;
+    }
+    st->self->parent->db->set(kv_inode_key(st->new_id), st->attrs.dump().c_str(), [st](int res)
+    {
+        st->res = res;
+        kv_continue_create(st, 3);
+    }, [st](int res, const std::string & value)
+    {
+        return res == -ENOENT;
+    });
+    return;
+resume_3:
+    if (st->res == -EAGAIN)
+    {
+        // Inode ID generator failure - retry
+        goto resume_1;
+    }
+    if (st->res < 0)
+    {
+        auto cb = std::move(st->cb);
+        cb(st->res);
+        return;
+    }
+    {
+        auto direntry = json11::Json::object{ { "ino", st->new_id } };
+        if (st->attrs["type"].string_value() == "dir")
+        {
+            direntry["type"] = "dir";
+        }
+        st->direntry_text = json11::Json(direntry).dump().c_str();
+    }
+    // Set direntry
+    st->dup_ino = 0;
+    st->self->parent->db->set(kv_direntry_key(st->dir_ino, st->filename), st->direntry_text, [st](int res)
+    {
+        st->res = res;
+        kv_continue_create(st, 4);
+    }, [st](int res, const std::string & value)
+    {
+        // CAS compare - check that the key doesn't exist
+        if (res == 0)
+        {
+            std::string err;
+            auto direntry = json11::Json::parse(value, err);
+            if (err != "")
+            {
+                fprintf(stderr, "Invalid JSON in direntry %s = %s: %s, overwriting\n",
+                    kv_direntry_key(st->dir_ino, st->filename).c_str(), value.c_str(), err.c_str());
+                return true;
+            }
+            if (st->exclusive && direntry["verf"].uint64_value() == st->verf)
+            {
+                st->dup_ino = direntry["ino"].uint64_value();
+                return false;
+            }
+            return false;
+        }
+        return true;
+    });
+    return;
+resume_4:
+    if (st->res == -EAGAIN)
+    {
+        // Direntry already exists
+        st->self->parent->db->del(kv_inode_key(st->new_id), [st](int res)
+        {
+            st->res = res;
+            kv_continue_create(st, 5);
+        });
+resume_5:
+        if (st->res < 0)
+        {
+            fprintf(stderr, "failed to delete duplicate inode %ju left from create %s (code %d)\n", st->new_id, strerror(-st->res), st->res);
+        }
+        else
+        {
+            auto & idgen = st->self->parent->kvfs->idgen[INODE_POOL(st->new_id)];
+            idgen.unallocated_ids.push_back(INODE_NO_POOL(st->new_id));
+        }
+        if (st->dup_ino)
+        {
+            // Successfully created by the previous "exclusive" request
+            st->new_id = st->dup_ino;
+        }
+        st->res = st->dup_ino ? 0 : -EEXIST;
+    }
+    if (!st->res)
+    {
+        st->self->parent->kvfs->touch_queue.insert(st->dir_ino);
+    }
+    auto cb = std::move(st->cb);
+    cb(st->res);
+}
+
+static void kv_create_setattr(json11::Json::object & attrobj, sattr3 & sattr)
+{
+    if (sattr.mode.set_it)
+        attrobj["mode"] = (uint64_t)sattr.mode.mode;
+    if (sattr.uid.set_it)
+        attrobj["uid"] = (uint64_t)sattr.uid.uid;
+    if (sattr.gid.set_it)
+        attrobj["gid"] = (uint64_t)sattr.gid.gid;
+    if (sattr.atime.set_it)
+        attrobj["atime"] = nfstime_to_str(sattr.atime.atime);
+    if (sattr.mtime.set_it)
+        attrobj["mtime"] = nfstime_to_str(sattr.mtime.mtime);
+}
+
+template<class T, class Tok> static void kv_create_reply(kv_create_state *st, int res)
+{
+    T *reply = (T*)st->rop->reply;
+    if (res < 0)
+    {
+        *reply = (T){ .status = vitastor_nfs_map_err(-res) };
+    }
+    else
+    {
+        *reply = (T){
+            .status = NFS3_OK,
+            .resok = (Tok){
+                .obj = {
+                    .handle_follows = 1,
+                    .handle = xdr_copy_string(st->rop->xdrs, kv_fh(st->new_id)),
+                },
+                .obj_attributes = {
+                    .attributes_follow = 1,
+                    .attributes = get_kv_attributes(st->self, st->new_id, st->attrs),
+                },
+            },
+        };
+    }
+    rpc_queue_reply(st->rop);
+    delete st;
+}
+
+int kv_nfs3_create_proc(void *opaque, rpc_op_t *rop)
+{
+    kv_create_state *st = new kv_create_state;
+    st->self = (nfs_client_t*)opaque;
+    st->rop = rop;
+    auto args = (CREATE3args*)rop->request;
+    st->exclusive = args->how.mode == NFS_EXCLUSIVE;
+    st->verf = st->exclusive ? *(uint64_t*)&args->how.verf : 0;
+    st->dir_ino = kv_fh_inode(args->where.dir);
+    st->filename = args->where.name;
+    if (args->how.mode == NFS_EXCLUSIVE)
+    {
+        st->attrobj["verf"] = *(uint64_t*)&args->how.verf;
+    }
+    else if (args->how.mode == NFS_UNCHECKED)
+    {
+        kv_create_setattr(st->attrobj, args->how.obj_attributes);
+        if (args->how.obj_attributes.size.set_it)
+        {
+            st->attrobj["size"] = (uint64_t)args->how.obj_attributes.size.size;
+            st->attrobj["empty"] = true;
+        }
+    }
+    st->cb = [st](int res) { kv_create_reply<CREATE3res, CREATE3resok>(st, res); };
+    kv_continue_create(st, 0);
+    return 1;
+}
+
+int kv_nfs3_mkdir_proc(void *opaque, rpc_op_t *rop)
+{
+    kv_create_state *st = new kv_create_state;
+    st->self = (nfs_client_t*)opaque;
+    st->rop = rop;
+    auto args = (MKDIR3args*)rop->request;
+    st->dir_ino = kv_fh_inode(args->where.dir);
+    st->filename = args->where.name;
+    st->attrobj["type"] = "dir";
+    st->attrobj["parent_ino"] = st->dir_ino;
+    kv_create_setattr(st->attrobj, args->attributes);
+    st->cb = [st](int res) { kv_create_reply<MKDIR3res, MKDIR3resok>(st, res); };
+    kv_continue_create(st, 0);
+    return 1;
+}
+
+int kv_nfs3_symlink_proc(void *opaque, rpc_op_t *rop)
+{
+    kv_create_state *st = new kv_create_state;
+    st->self = (nfs_client_t*)opaque;
+    st->rop = rop;
+    auto args = (SYMLINK3args*)rop->request;
+    st->dir_ino = kv_fh_inode(args->where.dir);
+    st->filename = args->where.name;
+    st->attrobj["type"] = "link";
+    st->attrobj["symlink"] = (std::string)args->symlink.symlink_data;
+    kv_create_setattr(st->attrobj, args->symlink.symlink_attributes);
+    st->cb = [st](int res) { kv_create_reply<SYMLINK3res, SYMLINK3resok>(st, res); };
+    kv_continue_create(st, 0);
+    return 1;
+}
+
+int kv_nfs3_mknod_proc(void *opaque, rpc_op_t *rop)
+{
+    kv_create_state *st = new kv_create_state;
+    st->self = (nfs_client_t*)opaque;
+    st->rop = rop;
+    auto args = (MKNOD3args*)rop->request;
+    st->dir_ino = kv_fh_inode(args->where.dir);
+    st->filename = args->where.name;
+    if (args->what.type == NF3CHR || args->what.type == NF3BLK)
+    {
+        st->attrobj["type"] = (args->what.type == NF3CHR ? "chr" : "blk");
+        st->attrobj["major"] = (uint64_t)args->what.chr_device.spec.specdata1;
+        st->attrobj["minor"] = (uint64_t)args->what.chr_device.spec.specdata2;
+        kv_create_setattr(st->attrobj, args->what.chr_device.dev_attributes);
+    }
+    else if (args->what.type == NF3SOCK || args->what.type == NF3FIFO)
+    {
+        st->attrobj["type"] = (args->what.type == NF3SOCK ? "sock" : "fifo");
+        kv_create_setattr(st->attrobj, args->what.sock_attributes);
+    }
+    else
+    {
+        *(MKNOD3res*)rop->reply = (MKNOD3res){ .status = NFS3ERR_INVAL };
+        rpc_queue_reply(rop);
+        delete st;
+        return 0;
+    }
+    st->cb = [st](int res) { kv_create_reply<MKNOD3res, MKNOD3resok>(st, res); };
+    kv_continue_create(st, 0);
+    return 1;
+}
--- a/src/nfs_kv_getattr.cpp
+++ b/src/nfs_kv_getattr.cpp
@@ -0,0 +1,78 @@
+// Copyright (c) Vitaliy Filippov, 2019+
+// License: VNPL-1.1 (see README.md for details)
+//
+// NFS proxy over VitastorKV database - GETATTR
+
+#include <sys/time.h>
+
+#include "nfs_proxy.h"
+#include "nfs_kv.h"
+
+// Attributes are always stored in the inode
+void kv_read_inode(nfs_proxy_t *proxy, uint64_t ino,
+    std::function<void(int res, const std::string & value, json11::Json ientry)> cb,
+    bool allow_cache)
+{
+    auto key = kv_inode_key(ino);
+    proxy->db->get(key, [=](int res, const std::string & value)
+    {
+        if (ino == KV_ROOT_INODE && res == -ENOENT)
+        {
+            // Allow root inode to not exist
+            cb(0, "", json11::Json(json11::Json::object{ { "type", "dir" } }));
+            return;
+        }
+        if (res < 0)
+        {
+            if (res != -ENOENT)
+                fprintf(stderr, "Error reading inode %s: %s (code %d)\n", kv_inode_key(ino).c_str(), strerror(-res), res);
+            cb(res, "", json11::Json());
+            return;
+        }
+        std::string err;
+        auto attrs = json11::Json::parse(value, err);
+        if (err != "")
+        {
+            fprintf(stderr, "Invalid JSON in inode %s = %s: %s\n", kv_inode_key(ino).c_str(), value.c_str(), err.c_str());
+            res = -EIO;
+        }
+        cb(res, value, attrs);
+    }, allow_cache);
+}
+
+int kv_nfs3_getattr_proc(void *opaque, rpc_op_t *rop)
+{
+    nfs_client_t *self = (nfs_client_t*)opaque;
+    GETATTR3args *args = (GETATTR3args*)rop->request;
+    GETATTR3res *reply = (GETATTR3res*)rop->reply;
+    std::string fh = args->object;
+    auto ino = kv_fh_inode(fh);
+    if (self->parent->trace)
+        fprintf(stderr, "[%d] GETATTR %ju\n", self->nfs_fd, ino);
+    if (!kv_fh_valid(fh))
+    {
+        *reply = (GETATTR3res){ .status = NFS3ERR_INVAL };
+        rpc_queue_reply(rop);
+        return 0;
+    }
+    kv_read_inode(self->parent, ino, [=](int res, const std::string & value, json11::Json attrs)
+    {
+        if (self->parent->trace)
+            fprintf(stderr, "[%d] GETATTR %ju -> %s\n", self->nfs_fd, ino, value.c_str());
+        if (res < 0)
+        {
+            *reply = (GETATTR3res){ .status = vitastor_nfs_map_err(-res) };
+        }
+        else
+        {
+            *reply = (GETATTR3res){
+                .status = NFS3_OK,
+                .resok = (GETATTR3resok){
+                    .obj_attributes = get_kv_attributes(self, ino, attrs),
+                },
+            };
+        }
+        rpc_queue_reply(rop);
+    });
+    return 1;
+}
--- a/src/nfs_kv_link.cpp
+++ b/src/nfs_kv_link.cpp
@@ -0,0 +1,193 @@
+// Copyright (c) Vitaliy Filippov, 2019+
+// License: VNPL-1.1 (see README.md for details)
+//
+// NFS proxy over VitastorKV database - LINK
+
+#include <sys/time.h>
+
+#include "nfs_proxy.h"
+#include "nfs_kv.h"
+
+struct nfs_kv_link_state
+{
+    nfs_client_t *self = NULL;
+    rpc_op_t *rop = NULL;
+    uint64_t ino = 0;
+    uint64_t dir_ino = 0;
+    std::string filename;
+    std::string ientry_text;
+    json11::Json ientry;
+    bool retrying = false;
+    int wait = 0;
+    int res = 0, res2 = 0;
+    std::function<void(int)> cb;
+};
+
+static void nfs_kv_continue_link(nfs_kv_link_state *st, int state)
+{
+    // 1) Read the source inode
+    // 2) If it's a directory - fail with -EISDIR
+    // 3) Create the new direntry with the same inode reference
+    // 4) Update the inode entry with refcount++
+    // 5) Retry update if CAS failed but the inode exists
+    // 6) Otherwise fail and remove the new direntry
+    // Yeah we may leave a bad direntry if we crash
+    // But the other option is to possibly leave an inode with too big refcount
+    if (state == 0)      {}
+    else if (state == 1) goto resume_1;
+    else if (state == 2) goto resume_2;
+    else if (state == 3) goto resume_3;
+    else if (state == 4) goto resume_4;
+    else
+    {
+        fprintf(stderr, "BUG: invalid state in nfs_kv_continue_link()");
+        abort();
+    }
+resume_0:
+    // Check that the source inode exists and is not a directory
+    st->wait = st->retrying ? 1 : 2;
+    st->res2 = 0;
+    kv_read_inode(st->self->parent, st->ino, [st](int res, const std::string & value, json11::Json attrs)
+    {
+        st->res = res == 0 ? (attrs["type"].string_value() == "dir" ? -EISDIR : 0) : res;
+        st->ientry_text = value;
+        st->ientry = attrs;
+        if (!--st->wait)
+            nfs_kv_continue_link(st, 1);
+    });
+    if (!st->retrying)
+    {
+        // Check that the new directory exists
+        kv_read_inode(st->self->parent, st->dir_ino, [st](int res, const std::string & value, json11::Json attrs)
+        {
+            st->res2 = res == 0 ? (attrs["type"].string_value() == "dir" ? 0 : -ENOTDIR) : res;
+            if (!--st->wait)
+                nfs_kv_continue_link(st, 1);
+        });
+    }
+    return;
+resume_1:
+    if (st->res < 0 || st->res2 < 0)
+    {
+        auto cb = std::move(st->cb);
+        cb(st->res < 0 ? st->res : st->res2);
+        return;
+    }
+    // Write the new direntry
+    if (!st->retrying)
+    {
+        st->self->parent->db->set(kv_direntry_key(st->dir_ino, st->filename),
+            json11::Json(json11::Json::object{ { "ino", st->ino } }).dump(), [st](int res)
+        {
+            st->res = res;
+            nfs_kv_continue_link(st, 2);
+        }, [st](int res, const std::string & old_value)
+        {
+            return res == -ENOENT;
+        });
+        return;
+resume_2:
+        if (st->res < 0)
+        {
+            auto cb = std::move(st->cb);
+            cb(st->res);
+            return;
+        }
+    }
+    // Increase inode refcount
+    {
+        auto new_ientry = st->ientry.object_items();
+        auto nlink = new_ientry["nlink"].uint64_value();
+        new_ientry["nlink"] = nlink ? nlink+1 : 2;
+        new_ientry["ctime"] = nfstime_now_str();
+        st->ientry = new_ientry;
+    }
+    st->self->parent->db->set(kv_inode_key(st->ino), st->ientry.dump(), [st](int res)
+    {
+        st->res = res;
+        nfs_kv_continue_link(st, 3);
+    }, [st](int res, const std::string & old_value)
+    {
+        st->res2 = res;
+        return res == 0 && old_value == st->ientry_text;
+    });
+    return;
+resume_3:
+    if (st->res2 == -ENOENT)
+    {
+        st->res = -ENOENT;
+    }
+    if (st->res == -EAGAIN)
+    {
+        // Re-read inode and retry
+        st->retrying = true;
+        goto resume_0;
+    }
+    if (st->res < 0)
+    {
+        // Maybe inode was deleted in the meantime, delete our direntry
+        st->self->parent->db->del(kv_direntry_key(st->dir_ino, st->filename), [st](int res)
+        {
+            st->res2 = res;
+            nfs_kv_continue_link(st, 4);
+        });
+        return;
+resume_4:
+        if (st->res2 < 0)
+        {
+            fprintf(stderr, "Warning: failed to delete new linked direntry %ju/%s: %s (code %d)\n",
+                st->dir_ino, st->filename.c_str(), strerror(-st->res2), st->res2);
+        }
+    }
+    if (!st->res)
+    {
+        st->self->parent->kvfs->touch_queue.insert(st->dir_ino);
+    }
+    auto cb = std::move(st->cb);
+    cb(st->res);
+}
+
+int kv_nfs3_link_proc(void *opaque, rpc_op_t *rop)
+{
+    auto st = new nfs_kv_link_state;
+    st->self = (nfs_client_t*)opaque;
+    st->rop = rop;
+    LINK3args *args = (LINK3args*)rop->request;
+    st->ino = kv_fh_inode(args->file);
+    st->dir_ino = kv_fh_inode(args->link.dir);
+    st->filename = args->link.name;
+    if (st->self->parent->trace)
+        fprintf(stderr, "[%d] LINK %ju -> %ju/%s\n", st->self->nfs_fd, st->ino, st->dir_ino, st->filename.c_str());
+    if (!st->ino || !st->dir_ino || st->filename == "")
+    {
+        LINK3res *reply = (LINK3res*)rop->reply;
+        *reply = (LINK3res){ .status = NFS3ERR_INVAL };
+        rpc_queue_reply(rop);
+        delete st;
+        return 0;
+    }
+    st->cb = [st](int res)
+    {
+        LINK3res *reply = (LINK3res*)st->rop->reply;
+        if (res < 0)
+        {
+            *reply = (LINK3res){ .status = vitastor_nfs_map_err(res) };
+        }
+        else
+        {
+            *reply = (LINK3res){
+                .status = NFS3_OK,
+                .resok = (LINK3resok){
+                    .file_attributes = (post_op_attr){
+                        .attributes_follow = 1,
+                        .attributes = get_kv_attributes(st->self, st->ino, st->ientry),
+                    },
+                },
+            };
+        }
+        rpc_queue_reply(st->rop);
+        delete st;
+    };
+    nfs_kv_continue_link(st, 0);
+    return 1;
+}
--- a/src/nfs_kv_lookup.cpp
+++ b/src/nfs_kv_lookup.cpp
@@ -0,0 +1,104 @@
+// Copyright (c) Vitaliy Filippov, 2019+
+// License: VNPL-1.1 (see README.md for details)
+//
+// NFS proxy over VitastorKV database - LOOKUP, READLINK
+
+#include <sys/time.h>
+
+#include "nfs_proxy.h"
+#include "nfs_kv.h"
+
+int kv_nfs3_lookup_proc(void *opaque, rpc_op_t *rop)
+{
+    nfs_client_t *self = (nfs_client_t*)opaque;
+    LOOKUP3args *args = (LOOKUP3args*)rop->request;
+    LOOKUP3res *reply = (LOOKUP3res*)rop->reply;
+    inode_t dir_ino = kv_fh_inode(args->what.dir);
+    std::string filename = args->what.name;
+    if (self->parent->trace)
+        fprintf(stderr, "[%d] LOOKUP %ju/%s\n", self->nfs_fd, dir_ino, filename.c_str());
+    if (!dir_ino || filename == "")
+    {
+        *reply = (LOOKUP3res){ .status = NFS3ERR_INVAL };
+        rpc_queue_reply(rop);
+        return 0;
+    }
+    self->parent->db->get(kv_direntry_key(dir_ino, filename), [=](int res, const std::string & value)
+    {
+        if (res < 0)
+        {
+            *reply = (LOOKUP3res){ .status = vitastor_nfs_map_err(-res) };
+            rpc_queue_reply(rop);
+            return;
+        }
+        std::string err;
+        auto direntry = json11::Json::parse(value, err);
+        if (err != "")
+        {
+            fprintf(stderr, "Invalid JSON in direntry %s = %s: %s\n", kv_direntry_key(dir_ino, filename).c_str(), value.c_str(), err.c_str());
+            *reply = (LOOKUP3res){ .status = NFS3ERR_IO };
+            rpc_queue_reply(rop);
+            return;
+        }
+        uint64_t ino = direntry["ino"].uint64_value();
+        kv_read_inode(self->parent, ino, [=](int res, const std::string & value, json11::Json ientry)
+        {
+            if (res < 0)
+            {
+                *reply = (LOOKUP3res){ .status = vitastor_nfs_map_err(res == -ENOENT ? -EIO : res) };
+                rpc_queue_reply(rop);
+                return;
+            }
+            *reply = (LOOKUP3res){
+                .status = NFS3_OK,
+                .resok = (LOOKUP3resok){
+                    .object = xdr_copy_string(rop->xdrs, kv_fh(ino)),
+                    .obj_attributes = {
+                        .attributes_follow = 1,
+                        .attributes = get_kv_attributes(self, ino, ientry),
+                    },
+                },
+            };
+            rpc_queue_reply(rop);
+        });
+    });
+    return 1;
+}
+
+int kv_nfs3_readlink_proc(void *opaque, rpc_op_t *rop)
+{
+    nfs_client_t *self = (nfs_client_t*)opaque;
+    READLINK3args *args = (READLINK3args*)rop->request;
+    if (self->parent->trace)
+        fprintf(stderr, "[%d] READLINK %ju\n", self->nfs_fd, kv_fh_inode(args->symlink));
+    READLINK3res *reply = (READLINK3res*)rop->reply;
+    if (!kv_fh_valid(args->symlink) || args->symlink == NFS_ROOT_HANDLE)
+    {
+        // Invalid filehandle or trying to read symlink from root entry
+        *reply = (READLINK3res){ .status = NFS3ERR_INVAL };
+        rpc_queue_reply(rop);
+        return 0;
+    }
+    kv_read_inode(self->parent, kv_fh_inode(args->symlink), [=](int res, const std::string & value, json11::Json attrs)
+    {
+        if (res < 0)
+        {
+            *reply = (READLINK3res){ .status = vitastor_nfs_map_err(-res) };
+        }
+        else if (attrs["type"] != "link")
+        {
+            *reply = (READLINK3res){ .status = NFS3ERR_INVAL };
+        }
+        else
+        {
+            *reply = (READLINK3res){
+                .status = NFS3_OK,
+                .resok = (READLINK3resok){
+                    .data = xdr_copy_string(rop->xdrs, attrs["symlink"].string_value()),
+                },
+            };
+        }
+        rpc_queue_reply(rop);
+    });
+    return 1;
+}
--- a/src/nfs_kv_read.cpp
+++ b/src/nfs_kv_read.cpp
@@ -0,0 +1,198 @@
+// Copyright (c) Vitaliy Filippov, 2019+
+// License: VNPL-1.1 (see README.md for details)
+//
+// NFS proxy over VitastorKV database - READ
+
+#include <sys/time.h>
+
+#include "nfs_proxy.h"
+#include "nfs_kv.h"
+
+struct nfs_kv_read_state
+{
+    nfs_client_t *self = NULL;
+    rpc_op_t *rop = NULL;
+    bool allow_cache = true;
+    inode_t ino = 0;
+    uint64_t offset = 0, size = 0;
+    std::function<void(int)> cb;
+    // state
+    int res = 0;
+    int eof = 0;
+    json11::Json ientry;
+    uint64_t aligned_size = 0, aligned_offset = 0;
+    uint8_t *aligned_buf = NULL;
+    cluster_op_t *op = NULL;
+    uint8_t *buf = NULL;
+};
+
+#define align_down(size) ((size) & ~(st->self->parent->kvfs->pool_alignment-1))
+#define align_up(size) (((size) + st->self->parent->kvfs->pool_alignment-1) & ~(st->self->parent->kvfs->pool_alignment-1))
+
+static void nfs_kv_continue_read(nfs_kv_read_state *st, int state)
+{
+    if (state == 0)      {}
+    else if (state == 1) goto resume_1;
+    else if (state == 2) goto resume_2;
+    else if (state == 3) goto resume_3;
+    else
+    {
+        fprintf(stderr, "BUG: invalid state in nfs_kv_continue_read()");
+        abort();
+    }
+resume_0:
+    if (st->offset + sizeof(shared_file_header_t) < st->self->parent->kvfs->shared_inode_threshold)
+    {
+        kv_read_inode(st->self->parent, st->ino, [st](int res, const std::string & value, json11::Json attrs)
+        {
+            st->res = res;
+            st->ientry = attrs;
+            nfs_kv_continue_read(st, 1);
+        }, st->allow_cache);
+        return;
+resume_1:
+        if (st->res < 0 || kv_map_type(st->ientry["type"].string_value()) != NF3REG)
+        {
+            auto cb = std::move(st->cb);
+            cb(st->res < 0 ? st->res : -EINVAL);
+            return;
+        }
+        if (st->ientry["shared_ino"].uint64_value() != 0)
+        {
+            if (st->offset >= st->ientry["size"].uint64_value())
+            {
+                st->size = 0;
+                st->eof = 1;
+                auto cb = std::move(st->cb);
+                cb(0);
+                return;
+            }
+            st->op = new cluster_op_t;
+            {
+                st->op->opcode = OSD_OP_READ;
+                st->op->inode = st->ientry["shared_ino"].uint64_value();
+                // Always read including header to react if the file was possibly moved away
+                auto read_offset = st->ientry["shared_offset"].uint64_value();
+                st->op->offset = align_down(read_offset);
+                if (st->op->offset < read_offset)
+                {
+                    st->op->iov.push_back(st->self->parent->kvfs->scrap_block.data(),
+                        read_offset-st->op->offset);
+                }
+                auto read_size = st->offset+st->size;
+                if (read_size > st->ientry["size"].uint64_value())
+                {
+                    st->eof = 1;
+                    st->size = st->ientry["size"].uint64_value()-st->offset;
+                    read_size = st->ientry["size"].uint64_value();
+                }
+                read_size += sizeof(shared_file_header_t);
+                assert(!st->aligned_buf);
+                st->aligned_buf = (uint8_t*)malloc_or_die(read_size);
+                st->buf = st->aligned_buf + sizeof(shared_file_header_t) + st->offset;
+                st->op->iov.push_back(st->aligned_buf, read_size);
+                st->op->len = align_up(read_offset+read_size) - st->op->offset;
+                if (read_offset+read_size < st->op->offset+st->op->len)
+                {
+                    st->op->iov.push_back(st->self->parent->kvfs->scrap_block.data(),
+                        st->op->offset+st->op->len - (read_offset+read_size));
+                }
+            }
+            st->op->callback = [st, state](cluster_op_t *op)
+            {
+                st->res = op->retval == op->len ? 0 : op->retval;
+                delete op;
+                nfs_kv_continue_read(st, 2);
+            };
+            st->self->parent->cli->execute(st->op);
+            return;
+resume_2:
+            if (st->res < 0)
+            {
+                free(st->aligned_buf);
+                st->aligned_buf = NULL;
+                auto cb = std::move(st->cb);
+                cb(st->res);
+                return;
+            }
+            auto hdr = ((shared_file_header_t*)st->aligned_buf);
+            if (hdr->magic != SHARED_FILE_MAGIC_V1 || hdr->inode != st->ino)
+            {
+                // Got unrelated data - retry from the beginning
+                free(st->aligned_buf);
+                st->aligned_buf = NULL;
+                st->allow_cache = false;
+                goto resume_0;
+            }
+            auto cb = std::move(st->cb);
+            cb(0);
+            return;
+        }
+    }
+    st->aligned_offset = align_down(st->offset);
+    st->aligned_size = align_up(st->offset+st->size) - st->aligned_offset;
+    assert(!st->aligned_buf);
+    st->aligned_buf = (uint8_t*)malloc_or_die(st->aligned_size);
+    st->buf = st->aligned_buf + st->offset - st->aligned_offset;
+    st->op = new cluster_op_t;
+    st->op->opcode = OSD_OP_READ;
+    st->op->inode = st->ino;
+    st->op->offset = st->aligned_offset;
+    st->op->len = st->aligned_size;
+    st->op->iov.push_back(st->aligned_buf, st->aligned_size);
+    st->op->callback = [st](cluster_op_t *op)
+    {
+        st->res = op->retval;
+        delete op;
+        nfs_kv_continue_read(st, 3);
+    };
+    st->self->parent->cli->execute(st->op);
+    return;
+resume_3:
+    if (st->res < 0)
+    {
+        free(st->aligned_buf);
+        st->aligned_buf = NULL;
+    }
+    auto cb = std::move(st->cb);
+    cb(st->res < 0 ? st->res : 0);
+    return;
+}
+
+int kv_nfs3_read_proc(void *opaque, rpc_op_t *rop)
+{
+    READ3args *args = (READ3args*)rop->request;
+    READ3res *reply = (READ3res*)rop->reply;
+    auto ino = kv_fh_inode(args->file);
+    if (args->count > MAX_REQUEST_SIZE || !ino)
+    {
+        *reply = (READ3res){ .status = NFS3ERR_INVAL };
+        rpc_queue_reply(rop);
+        return 0;
+    }
+    auto st = new nfs_kv_read_state;
+    st->self = (nfs_client_t*)opaque;
+    st->rop = rop;
+    st->ino = ino;
+    st->offset = args->offset;
+    st->size = args->count;
+    st->cb = [st](int res)
+    {
+        READ3res *reply = (READ3res*)st->rop->reply;
+        *reply = (READ3res){ .status = vitastor_nfs_map_err(res) };
+        if (res == 0)
+        {
+            xdr_add_malloc(st->rop->xdrs, st->aligned_buf);
+            reply->resok.data.data = (char*)st->buf;
+            reply->resok.data.size = st->size;
+            reply->resok.count = st->size;
+            reply->resok.eof = st->eof;
+        }
+        rpc_queue_reply(st->rop);
+        delete st;
+    };
+    if (st->self->parent->trace)
+        fprintf(stderr, "[%d] READ %ju %ju+%ju\n", st->self->nfs_fd, st->ino, st->offset, st->size);
+    nfs_kv_continue_read(st, 0);
+    return 1;
+}
--- a/src/nfs_kv_readdir.cpp
+++ b/src/nfs_kv_readdir.cpp
@@ -0,0 +1,375 @@
+// Copyright (c) Vitaliy Filippov, 2019+
+// License: VNPL-1.1 (see README.md for details)
+//
+// NFS proxy over VitastorKV database - READDIR, READDIRPLUS
+
+#include <sys/time.h>
+
+#include "nfs_proxy.h"
+#include "nfs_kv.h"
+
+static unsigned len_pad4(unsigned len)
+{
+    return len + (len&3 ? 4-(len&3) : 0);
+}
+
+struct nfs_kv_readdir_state
+{
+    nfs_client_t *self = NULL;
+    rpc_op_t *rop = NULL;
+    // Request:
+    bool is_plus = false;
+    uint64_t cookie = 0;
+    uint64_t cookieverf = 0;
+    uint64_t dir_ino = 0;
+    uint64_t maxcount = 0;
+    std::function<void(int)> cb;
+    // State:
+    int res = 0;
+    std::string prefix, start;
+    void *list_handle;
+    uint64_t parent_ino = 0;
+    std::string ientry_text, parent_ientry_text;
+    json11::Json ientry, parent_ientry;
+    std::string cur_key, cur_value;
+    int reply_size = 0;
+    int to_skip = 0;
+    uint64_t offset = 0;
+    int getattr_running = 0, getattr_cur = 0;
+    // Result:
+    bool eof = false;
+    //uint64_t cookieverf = 0; // same field
+    std::vector<entryplus3> entries;
+};
+
+static void nfs_kv_continue_readdir(nfs_kv_readdir_state *st, int state);
+
+static void kv_getattr_next(nfs_kv_readdir_state *st)
+{
+    while (st->is_plus && st->getattr_cur < st->entries.size() && st->getattr_running < st->self->parent->kvfs->readdir_getattr_parallel)
+    {
+        auto idx = st->getattr_cur++;
+        st->getattr_running++;
+        kv_read_inode(st->self->parent, st->entries[idx].fileid, [st, idx](int res, const std::string & value, json11::Json ientry)
+        {
+            if (res == 0)
+            {
+                st->entries[idx].name_attributes = (post_op_attr){
+                    // FIXME: maybe do not read parent attributes and leave them to a GETATTR?
+                    .attributes_follow = 1,
+                    .attributes = get_kv_attributes(st->self, st->entries[idx].fileid, ientry),
+                };
+            }
+            st->getattr_running--;
+            kv_getattr_next(st);
+            if (st->getattr_running == 0 && !st->list_handle)
+            {
+                nfs_kv_continue_readdir(st, 4);
+            }
+        });
+    }
+}
+
+static void nfs_kv_continue_readdir(nfs_kv_readdir_state *st, int state)
+{
+    if (state == 0)      {}
+    else if (state == 1) goto resume_1;
+    else if (state == 2) goto resume_2;
+    else if (state == 3) goto resume_3;
+    else if (state == 4) goto resume_4;
+    else
+    {
+        fprintf(stderr, "BUG: invalid state in nfs_kv_continue_readdir()");
+        abort();
+    }
+    // Limit results based on maximum reply size
+    // Sadly we have to calculate reply size by hand
+    // reply without entries is 4+4+(dir_attributes ? sizeof(fattr3) : 0)+8+4 bytes
+    st->reply_size = 20;
+    if (st->reply_size > st->maxcount)
+    {
+        // Error, too small max reply size
+        auto cb = std::move(st->cb);
+        cb(-NFS3ERR_TOOSMALL);
+        return;
+    }
+    // Add . and ..
+    if (st->cookie <= 1)
+    {
+        kv_read_inode(st->self->parent, st->dir_ino, [st](int res, const std::string & value, json11::Json ientry)
+        {
+            st->res = res;
+            st->ientry_text = value;
+            st->ientry = ientry;
+            nfs_kv_continue_readdir(st, 1);
+        });
+        return;
+resume_1:
+        if (st->res < 0)
+        {
+            auto cb = std::move(st->cb);
+            cb(st->res);
+            return;
+        }
+        if (st->cookie == 0)
+        {
+            auto fh = kv_fh(st->dir_ino);
+            auto entry_size = 20 + 4/*len_pad4(".")*/ + (st->is_plus ? 8 + 88 + len_pad4(fh.size()) : 0);
+            if (st->reply_size + entry_size > st->maxcount)
+            {
+                auto cb = std::move(st->cb);
+                cb(-NFS3ERR_TOOSMALL);
+                return;
+            }
+            entryplus3 dot = {};
+            dot.name = xdr_copy_string(st->rop->xdrs, ".");
+            dot.fileid = st->dir_ino;
+            dot.name_attributes = (post_op_attr){
+                .attributes_follow = 1,
+                .attributes = get_kv_attributes(st->self, st->dir_ino, st->ientry),
+            };
+            dot.name_handle = (post_op_fh3){
+                .handle_follows = 1,
+                .handle = xdr_copy_string(st->rop->xdrs, fh),
+            };
+            st->entries.push_back(dot);
+            st->reply_size += entry_size;
+        }
+        st->parent_ino = st->ientry["parent_ino"].uint64_value();
+        if (st->parent_ino)
+        {
+            kv_read_inode(st->self->parent, st->ientry["parent_ino"].uint64_value(), [st](int res, const std::string & value, json11::Json ientry)
+            {
+                st->res = res;
+                st->parent_ientry_text = value;
+                st->parent_ientry = ientry;
+                nfs_kv_continue_readdir(st, 2);
+            });
+            return;
+resume_2:
+            if (st->res < 0)
+            {
+                auto cb = std::move(st->cb);
+                cb(st->res);
+                return;
+            }
+        }
+        auto fh = kv_fh(st->parent_ino);
+        auto entry_size = 20 + 4/*len_pad4("..")*/ + (st->is_plus ? 8 + 88 + len_pad4(fh.size()) : 0);
+        if (st->reply_size + entry_size > st->maxcount)
+        {
+            st->eof = false;
+            auto cb = std::move(st->cb);
+            cb(0);
+            return;
+        }
+        entryplus3 dotdot = {};
+        dotdot.name = xdr_copy_string(st->rop->xdrs, "..");
+        dotdot.fileid = st->dir_ino;
+        dotdot.name_attributes = (post_op_attr){
+            // FIXME: maybe do not read parent attributes and leave them to a GETATTR?
+            .attributes_follow = 1,
+            .attributes = get_kv_attributes(st->self,
+                st->parent_ino ? st->parent_ino : st->dir_ino,
+                st->parent_ino ? st->parent_ientry : st->ientry),
+        };
+        dotdot.name_handle = (post_op_fh3){
+            .handle_follows = 1,
+            .handle = xdr_copy_string(st->rop->xdrs, fh),
+        };
+        st->entries.push_back(dotdot);
+        st->reply_size += entry_size;
+    }
+    st->prefix = kv_direntry_key(st->dir_ino, "");
+    st->eof = true;
+    st->start = st->prefix;
+    if (st->cookie > 1)
+    {
+        auto lc_it = st->self->parent->kvfs->list_cookies.find((list_cookie_t){ st->dir_ino, st->cookieverf, st->cookie });
+        if (lc_it != st->self->parent->kvfs->list_cookies.end())
+        {
+            st->start = st->prefix+lc_it->second.key;
+            st->to_skip = 1;
+            st->offset = st->cookie;
+        }
+        else
+        {
+            st->to_skip = st->cookie-2;
+            st->offset = 2;
+            st->cookieverf = ((uint64_t)lrand48() | ((uint64_t)lrand48() << 31) | ((uint64_t)lrand48() << 62));
+        }
+    }
+    else
+    {
+        st->to_skip = 0;
+        st->offset = 2;
+        st->cookieverf = ((uint64_t)lrand48() | ((uint64_t)lrand48() << 31) | ((uint64_t)lrand48() << 62));
+    }
+    {
+        auto lc_it = st->self->parent->kvfs->list_cookies.lower_bound((list_cookie_t){ st->dir_ino, st->cookieverf, 0 });
+        if (lc_it != st->self->parent->kvfs->list_cookies.end() &&
+            lc_it->first.dir_ino == st->dir_ino &&
+            lc_it->first.cookieverf == st->cookieverf &&
+            lc_it->first.cookie < st->cookie)
+        {
+            auto lc_start = lc_it;
+            while (lc_it != st->self->parent->kvfs->list_cookies.end() && lc_it->first.cookieverf == st->cookieverf)
+            {
+                lc_it++;
+            }
+            st->self->parent->kvfs->list_cookies.erase(lc_start, lc_it);
+        }
+    }
+    st->getattr_cur = st->entries.size();
+    st->list_handle = st->self->parent->db->list_start(st->start);
+    st->self->parent->db->list_next(st->list_handle, [=](int res, const std::string & key, const std::string & value)
+    {
+        st->res = res;
+        st->cur_key = key;
+        st->cur_value = value;
+        nfs_kv_continue_readdir(st, 3);
+    });
+    return;
+    while (st->list_handle)
+    {
+        st->self->parent->db->list_next(st->list_handle, NULL);
+        return;
+resume_3:
+        if (st->res == -ENOENT || st->cur_key.size() < st->prefix.size() || st->cur_key.substr(0, st->prefix.size()) != st->prefix)
+        {
+            st->self->parent->db->list_close(st->list_handle);
+            st->list_handle = NULL;
+            break;
+        }
+        if (st->to_skip > 0)
+        {
+            st->to_skip--;
+            continue;
+        }
+        std::string err;
+        auto direntry = json11::Json::parse(st->cur_value, err);
+        if (err != "")
+        {
+            fprintf(stderr, "readdir: direntry %s contains invalid JSON: %s, skipping\n",
+                st->cur_key.c_str(), st->cur_value.c_str());
+            continue;
+        }
+        auto ino = direntry["ino"].uint64_value();
+        auto name = kv_direntry_filename(st->cur_key);
+        if (st->self->parent->trace)
+        {
+            fprintf(stderr, "[%d] READDIR %ju %lu %s\n",
+                st->self->nfs_fd, st->dir_ino, st->offset, name.c_str());
+        }
+        auto fh = kv_fh(ino);
+        // 1 entry3 is (8+4+(filename_len+3)/4*4+8) bytes
+        // 1 entryplus3 is (8+4+(filename_len+3)/4*4+8
+        //   + 4+(name_attributes ? (sizeof(fattr3) = 84) : 0)
+        //   + 4+(name_handle ? 4+(handle_len+3)/4*4 : 0)) bytes
+        auto entry_size = 20 + len_pad4(name.size()) + (st->is_plus ? 8 + 88 + len_pad4(fh.size()) : 0);
+        if (st->reply_size + entry_size > st->maxcount)
+        {
+            st->eof = false;
+            st->self->parent->db->list_close(st->list_handle);
+            st->list_handle = NULL;
+            break;
+        }
+        st->reply_size += entry_size;
+        auto idx = st->entries.size();
+        st->entries.push_back((entryplus3){});
+        auto entry = &st->entries[idx];
+        entry->name = xdr_copy_string(st->rop->xdrs, name);
+        entry->fileid = ino;
+        entry->cookie = st->offset++;
+        st->self->parent->kvfs->list_cookies[(list_cookie_t){ st->dir_ino, st->cookieverf, entry->cookie }] = { .key = name };
+        if (st->is_plus)
+        {
+            entry->name_handle = (post_op_fh3){
+                .handle_follows = 1,
+                .handle = xdr_copy_string(st->rop->xdrs, fh),
+            };
+            kv_getattr_next(st);
+        }
+    }
+resume_4:
+    while (st->getattr_running > 0)
+    {
+        return;
+    }
+    void *prev = NULL;
+    for (int i = 0; i < st->entries.size(); i++)
+    {
+        entryplus3 *entry = &st->entries[i];
+        if (prev)
+        {
+            if (st->is_plus)
+                ((entryplus3*)prev)->nextentry = entry;
+            else
+                ((entry3*)prev)->nextentry = (entry3*)entry;
+        }
+        prev = entry;
+    }
+    // Send reply
+    auto cb = std::move(st->cb);
+    cb(0);
+}
+
+static void nfs3_readdir_common(void *opaque, rpc_op_t *rop, bool is_plus)
+{
+    auto st = new nfs_kv_readdir_state;
+    st->self = (nfs_client_t*)opaque;
+    st->rop = rop;
+    st->is_plus = is_plus;
+    if (st->is_plus)
+    {
+        READDIRPLUS3args *args = (READDIRPLUS3args*)rop->request;
+        st->dir_ino = kv_fh_inode(args->dir);
+        st->cookie = args->cookie;
+        st->cookieverf = *((uint64_t*)args->cookieverf);
+        st->maxcount = args->maxcount;
+    }
+    else
+    {
+        READDIR3args *args = ((READDIR3args*)rop->request);
+        st->dir_ino = kv_fh_inode(args->dir);
+        st->cookie = args->cookie;
+        st->cookieverf = *((uint64_t*)args->cookieverf);
+        st->maxcount = args->count;
+    }
+    if (st->self->parent->trace)
+        fprintf(stderr, "[%d] READDIR %ju VERF %jx OFFSET %ju LIMIT %ju\n", st->self->nfs_fd, st->dir_ino, st->cookieverf, st->cookie, st->maxcount);
+    st->cb = [st](int res)
+    {
+        if (st->is_plus)
+        {
+            READDIRPLUS3res *reply = (READDIRPLUS3res*)st->rop->reply;
+            *reply = (READDIRPLUS3res){ .status = vitastor_nfs_map_err(res) };
+            *(uint64_t*)(reply->resok.cookieverf) = st->cookieverf;
+            reply->resok.reply.entries = st->entries.size() ? &st->entries[0] : NULL;
+            reply->resok.reply.eof = st->eof;
+        }
+        else
+        {
+            READDIR3res *reply = (READDIR3res*)st->rop->reply;
+            *reply = (READDIR3res){ .status = vitastor_nfs_map_err(res) };
+            *(uint64_t*)(reply->resok.cookieverf) = st->cookieverf;
+            reply->resok.reply.entries = st->entries.size() ? (entry3*)&st->entries[0] : NULL;
+            reply->resok.reply.eof = st->eof;
+        }
+        rpc_queue_reply(st->rop);
+        delete st;
+    };
+    nfs_kv_continue_readdir(st, 0);
+}
+
+int kv_nfs3_readdir_proc(void *opaque, rpc_op_t *rop)
+{
+    nfs3_readdir_common(opaque, rop, false);
+    return 0;
+}
+
+int kv_nfs3_readdirplus_proc(void *opaque, rpc_op_t *rop)
+{
+    nfs3_readdir_common(opaque, rop, true);
+    return 0;
+}
--- a/src/nfs_kv_remove.cpp
+++ b/src/nfs_kv_remove.cpp
@@ -0,0 +1,321 @@
+// Copyright (c) Vitaliy Filippov, 2019+
+// License: VNPL-1.1 (see README.md for details)
+//
+// NFS proxy over VitastorKV database - REMOVE, RMDIR
+
+#include <sys/time.h>
+
+#include "nfs_proxy.h"
+#include "nfs_kv.h"
+#include "cli.h"
+
+struct kv_del_state
+{
+    nfs_client_t *self = NULL;
+    rpc_op_t *rop = NULL;
+    uint64_t dir_ino = 0;
+    std::string filename;
+    uint64_t ino = 0;
+    void *list_handle = NULL;
+    std::string prefix, list_key, direntry_text, ientry_text;
+    json11::Json direntry, ientry;
+    int type = 0;
+    bool is_rmdir = false;
+    bool rm_data = false;
+    bool allow_cache = true;
+    int res = 0, res2 = 0;
+    std::function<void(int)> cb;
+};
+
+static void nfs_kv_continue_delete(kv_del_state *st, int state)
+{
+    // Overall algorithm:
+    // 1) Get inode attributes and check that it's not a directory (REMOVE)
+    // 2) Get inode attributes and check that it is a directory (RMDIR)
+    // 3) Delete direntry with CAS
+    // 4) Check that the directory didn't contain files (RMDIR) and restore it if it did
+    // 5) Reduce inode refcount by 1 or delete inode
+    // 6) If regular file and inode is deleted: delete data
+    if (state == 0)      {}
+    else if (state == 1) goto resume_1;
+    else if (state == 2) goto resume_2;
+    else if (state == 3) goto resume_3;
+    else if (state == 4) goto resume_4;
+    else if (state == 5) goto resume_5;
+    else if (state == 6) goto resume_6;
+    else if (state == 7) goto resume_7;
+    else
+    {
+        fprintf(stderr, "BUG: invalid state in nfs_kv_continue_delete()");
+        abort();
+    }
+resume_0:
+    st->self->parent->db->get(kv_direntry_key(st->dir_ino, st->filename), [st](int res, const std::string & value)
+    {
+        st->res = res;
+        st->direntry_text = value;
+        nfs_kv_continue_delete(st, 1);
+    }, st->allow_cache);
+    return;
+resume_1:
+    if (st->res < 0)
+    {
+        auto cb = std::move(st->cb);
+        cb(st->res);
+        return;
+    }
+    {
+        std::string err;
+        st->direntry = json11::Json::parse(st->direntry_text, err);
+        if (err != "")
+        {
+            fprintf(stderr, "Invalid JSON in direntry %s = %s: %s, deleting\n",
+                kv_direntry_key(st->dir_ino, st->filename).c_str(), st->direntry_text.c_str(), err.c_str());
+            // Just delete direntry and skip inode
+        }
+        else
+        {
+            st->ino = st->direntry["ino"].uint64_value();
+        }
+    }
+    // Get inode
+    st->self->parent->db->get(kv_inode_key(st->ino), [st](int res, const std::string & value)
+    {
+        st->res = res;
+        st->ientry_text = value;
+        nfs_kv_continue_delete(st, 2);
+    }, st->allow_cache);
+    return;
+resume_2:
+    if (st->res < 0)
+    {
+        fprintf(stderr, "error reading inode %s: %s (code %d)\n",
+            kv_inode_key(st->ino).c_str(), strerror(-st->res), st->res);
+        auto cb = std::move(st->cb);
+        cb(st->res);
+        return;
+    }
+    {
+        std::string err;
+        st->ientry = json11::Json::parse(st->ientry_text, err);
+        if (err != "")
+        {
+            fprintf(stderr, "Invalid JSON in inode %s = %s: %s, treating as a regular file\n",
+                kv_inode_key(st->ino).c_str(), st->ientry_text.c_str(), err.c_str());
+        }
+    }
+    // (1-2) Check type
+    st->type = kv_map_type(st->ientry["type"].string_value());
+    if (st->type == -1 || st->is_rmdir != (st->type == NF3DIR))
+    {
+        auto cb = std::move(st->cb);
+        cb(st->is_rmdir ? -ENOTDIR : -EISDIR);
+        return;
+    }
+    // (3) Delete direntry with CAS
+    st->self->parent->db->del(kv_direntry_key(st->dir_ino, st->filename), [st](int res)
+    {
+        st->res = res;
+        nfs_kv_continue_delete(st, 3);
+    }, [st](int res, const std::string & value)
+    {
+        return value == st->direntry_text;
+    });
+    return;
+resume_3:
+    if (st->res == -EAGAIN)
+    {
+        // CAS failure, restart from the beginning
+        st->allow_cache = false;
+        goto resume_0;
+    }
+    else if (st->res < 0 && st->res != -ENOENT)
+    {
+        fprintf(stderr, "failed to remove direntry %s: %s (code %d)\n",
+            kv_direntry_key(st->dir_ino, st->filename).c_str(), strerror(-st->res), st->res);
+        auto cb = std::move(st->cb);
+        cb(st->res);
+        return;
+    }
+    if (!st->ino)
+    {
+        // direntry contained invalid JSON and was deleted, finish
+        auto cb = std::move(st->cb);
+        cb(0);
+        return;
+    }
+    if (st->is_rmdir)
+    {
+        // (4) Check if directory actually is not empty
+        st->list_handle = st->self->parent->db->list_start(kv_direntry_key(st->ino, ""));
+        st->self->parent->db->list_next(st->list_handle, [st](int res, const std::string & key, const std::string & value)
+        {
+            st->res = res;
+            st->list_key = key;
+            st->self->parent->db->list_close(st->list_handle);
+            nfs_kv_continue_delete(st, 4);
+        });
+        return;
+resume_4:
+        st->prefix = kv_direntry_key(st->ino, "");
+        if (st->res == -ENOENT || st->list_key.size() < st->prefix.size() || st->list_key.substr(0, st->prefix.size()) != st->prefix)
+        {
+            // OK, directory is empty
+        }
+        else
+        {
+            // Not OK, restore direntry
+            st->self->parent->db->del(kv_direntry_key(st->dir_ino, st->filename), [st](int res)
+            {
+                st->res2 = res;
+                nfs_kv_continue_delete(st, 5);
+            }, [st](int res, const std::string & value)
+            {
+                return res == -ENOENT;
+            });
+            return;
+resume_5:
+            if (st->res2 < 0)
+            {
+                fprintf(stderr, "failed to restore direntry %s (%s): %s (code %d)",
+                    kv_direntry_key(st->dir_ino, st->filename).c_str(), st->direntry_text.c_str(), strerror(-st->res2), st->res2);
+                fprintf(stderr, " - inode %ju may be left as garbage\n", st->ino);
+            }
+            if (st->res < 0)
+            {
+                fprintf(stderr, "failed to list entries from %s: %s (code %d)\n",
+                    kv_direntry_key(st->ino, "").c_str(), strerror(-st->res), st->res);
+            }
+            auto cb = std::move(st->cb);
+            cb(st->res < 0 ? st->res : -ENOTEMPTY);
+            return;
+        }
+    }
+    // (5) Reduce inode refcount by 1 or delete inode
+    if (st->ientry["nlink"].uint64_value() > 1)
+    {
+        auto copy = st->ientry.object_items();
+        copy["nlink"] = st->ientry["nlink"].uint64_value()-1;
+        copy["ctime"] = nfstime_now_str();
+        st->self->parent->db->set(kv_inode_key(st->ino), json11::Json(copy).dump(), [st](int res)
+        {
+            st->res = res;
+            nfs_kv_continue_delete(st, 6);
+        }, [st](int res, const std::string & old_value)
+        {
+            return old_value == st->ientry_text;
+        });
+    }
+    else
+    {
+        st->self->parent->kvfs->touch_queue.erase(st->ino);
+        st->self->parent->db->del(kv_inode_key(st->ino), [st](int res)
+        {
+            st->res = res;
+            nfs_kv_continue_delete(st, 6);
+        }, [st](int res, const std::string & old_value)
+        {
+            return old_value == st->ientry_text;
+        });
+    }
+    return;
+resume_6:
+    if (st->res < 0)
+    {
+        // Assume EAGAIN is OK, maybe someone created a hard link in the meantime
+        auto cb = std::move(st->cb);
+        cb(st->res == -EAGAIN ? 0 : st->res);
+        return;
+    }
+    // (6) If regular file and inode is deleted: delete data
+    if ((!st->type || st->type == NF3REG) && st->ientry["nlink"].uint64_value() <= 1 &&
+        !st->ientry["shared_ino"].uint64_value())
+    {
+        // Remove data
+        st->self->parent->cmd->loop_and_wait(st->self->parent->cmd->start_rm_data(json11::Json::object {
+            { "inode", INODE_NO_POOL(st->ino) },
+            { "pool", (uint64_t)INODE_POOL(st->ino) },
+        }), [st](const cli_result_t & r)
+        {
+            if (r.err)
+            {
+                fprintf(stderr, "Failed to remove inode %jx data: %s (code %d)\n",
+                    st->ino, r.text.c_str(), r.err);
+            }
+            st->res = r.err;
+            nfs_kv_continue_delete(st, 7);
+        });
+        return;
+resume_7:
+        auto cb = std::move(st->cb);
+        cb(st->res);
+        return;
+    }
+    if (!st->res)
+    {
+        st->self->parent->kvfs->touch_queue.insert(st->dir_ino);
+    }
+    auto cb = std::move(st->cb);
+    cb(0);
+}
+
+int kv_nfs3_remove_proc(void *opaque, rpc_op_t *rop)
+{
+    kv_del_state *st = new kv_del_state;
+    st->self = (nfs_client_t*)opaque;
+    st->rop = rop;
+    REMOVE3res *reply = (REMOVE3res*)rop->reply;
+    REMOVE3args *args = (REMOVE3args*)rop->request;
+    st->dir_ino = kv_fh_inode(args->object.dir);
+    st->filename = args->object.name;
+    if (st->self->parent->trace)
+        fprintf(stderr, "[%d] REMOVE %ju/%s\n", st->self->nfs_fd, st->dir_ino, st->filename.c_str());
+    if (!st->dir_ino)
+    {
+        *reply = (REMOVE3res){ .status = NFS3ERR_INVAL };
+        rpc_queue_reply(rop);
+        delete st;
+        return 0;
+    }
+    st->cb = [st](int res)
+    {
+        *((REMOVE3res*)st->rop->reply) = (REMOVE3res){
+            .status = vitastor_nfs_map_err(res),
+        };
+        rpc_queue_reply(st->rop);
+        delete st;
+    };
+    nfs_kv_continue_delete(st, 0);
+    return 1;
+}
+
+int kv_nfs3_rmdir_proc(void *opaque, rpc_op_t *rop)
+{
+    kv_del_state *st = new kv_del_state;
+    st->self = (nfs_client_t*)opaque;
+    st->rop = rop;
+    RMDIR3args *args = (RMDIR3args*)rop->request;
+    RMDIR3res *reply = (RMDIR3res*)rop->reply;
+    st->dir_ino = kv_fh_inode(args->object.dir);
+    st->filename = args->object.name;
+    st->is_rmdir = true;
+    if (st->self->parent->trace)
+        fprintf(stderr, "[%d] RMDIR %ju/%s\n", st->self->nfs_fd, st->dir_ino, st->filename.c_str());
+    if (!st->dir_ino)
+    {
+        *reply = (RMDIR3res){ .status = NFS3ERR_INVAL };
+        rpc_queue_reply(rop);
+        delete st;
+        return 0;
+    }
+    st->cb = [st](int res)
+    {
+        *((RMDIR3res*)st->rop->reply) = (RMDIR3res){
+            .status = vitastor_nfs_map_err(res),
+        };
+        rpc_queue_reply(st->rop);
+        delete st;
+    };
+    nfs_kv_continue_delete(st, 0);
+    return 1;
+}
--- a/src/nfs_kv_rename.cpp
+++ b/src/nfs_kv_rename.cpp
@@ -0,0 +1,401 @@
+// Copyright (c) Vitaliy Filippov, 2019+
+// License: VNPL-1.1 (see README.md for details)
+//
+// NFS proxy over VitastorKV database - RENAME
+
+#include <sys/time.h>
+
+#include "nfs_proxy.h"
+#include "nfs_kv.h"
+#include "cli.h"
+
+struct nfs_kv_rename_state
+{
+    nfs_client_t *self = NULL;
+    rpc_op_t *rop = NULL;
+    // params:
+    uint64_t old_dir_ino = 0, new_dir_ino = 0;
+    std::string old_name, new_name;
+    // state:
+    bool allow_cache = true;
+    std::string old_direntry_text, old_ientry_text, new_direntry_text, new_ientry_text;
+    json11::Json old_direntry, old_ientry, new_direntry, new_ientry;
+    std::string new_dir_prefix;
+    void *list_handle = NULL;
+    bool new_exists = false;
+    bool rm_dest_data = false;
+    int res = 0, res2 = 0;
+    std::function<void(int)> cb;
+};
+
+static void nfs_kv_continue_rename(nfs_kv_rename_state *st, int state)
+{
+    // Algorithm (non-atomic of course):
+    // 1) Read source direntry
+    // 2) Read destination direntry
+    // 3) If destination exists:
+    // 3.1) Check file/folder compatibility (EISDIR/ENOTDIR)
+    // 3.2) Check if destination is empty if it's a folder
+    // 4) If not:
+    // 4.1) Check that the destination directory is actually a directory
+    // 5) Overwrite destination direntry, restart from beginning if CAS failure
+    // 6) Delete source direntry, restart from beginning if CAS failure
+    // 7) If the moved direntry was a regular file:
+    // 7.1) Read inode
+    // 7.2) Delete inode if its link count <= 1
+    // 7.3) Delete inode data if its link count <= 1 and it's a regular non-shared file
+    // 7.4) Reduce link count by 1 if it's > 1
+    // 8) If the moved direntry is a directory:
+    // 8.1) Change parent_ino reference in its inode
+    if (state == 0)      {}
+    else if (state == 1) goto resume_1;
+    else if (state == 2) goto resume_2;
+    else if (state == 3) goto resume_3;
+    else if (state == 4) goto resume_4;
+    else if (state == 5) goto resume_5;
+    else if (state == 6) goto resume_6;
+    else if (state == 7) goto resume_7;
+    else if (state == 8) goto resume_8;
+    else if (state == 9) goto resume_9;
+    else if (state == 10) goto resume_10;
+    else if (state == 11) goto resume_11;
+    else if (state == 12) goto resume_12;
+    else
+    {
+        fprintf(stderr, "BUG: invalid state in nfs_kv_continue_rename()");
+        abort();
+    }
+resume_0:
+    // Read the old direntry
+    st->self->parent->db->get(kv_direntry_key(st->old_dir_ino, st->old_name), [=](int res, const std::string & value)
+    {
+        st->res = res;
+        st->old_direntry_text = value;
+        nfs_kv_continue_rename(st, 1);
+    }, st->allow_cache);
+    return;
+resume_1:
+    if (st->res < 0)
+    {
+        auto cb = std::move(st->cb);
+        cb(st->res);
+        return;
+    }
+    {
+        std::string err;
+        st->old_direntry = json11::Json::parse(st->old_direntry_text, err);
+        if (err != "")
+        {
+            fprintf(stderr, "Invalid JSON in direntry %s = %s: %s\n",
+                kv_direntry_key(st->old_dir_ino, st->old_name).c_str(),
+                st->old_direntry_text.c_str(), err.c_str());
+            auto cb = std::move(st->cb);
+            cb(-EIO);
+            return;
+        }
+    }
+    // Read the new direntry
+    st->self->parent->db->get(kv_direntry_key(st->new_dir_ino, st->new_name), [=](int res, const std::string & value)
+    {
+        st->res = res;
+        st->new_direntry_text = value;
+        nfs_kv_continue_rename(st, 2);
+    }, st->allow_cache);
+    return;
+resume_2:
+    if (st->res < 0 && st->res != -ENOENT)
+    {
+        auto cb = std::move(st->cb);
+        cb(st->res);
+        return;
+    }
+    if (st->res == 0)
+    {
+        std::string err;
+        st->new_direntry = json11::Json::parse(st->new_direntry_text, err);
+        if (err != "")
+        {
+            fprintf(stderr, "Invalid JSON in direntry %s = %s: %s\n",
+                kv_direntry_key(st->new_dir_ino, st->new_name).c_str(),
+                st->new_direntry_text.c_str(), err.c_str());
+            auto cb = std::move(st->cb);
+            cb(-EIO);
+            return;
+        }
+    }
+    st->new_exists = st->res == 0;
+    if (st->new_exists)
+    {
+        // Check file/folder compatibility (EISDIR/ENOTDIR)
+        if ((st->old_direntry["type"] == "dir") != (st->new_direntry["type"] == "dir"))
+        {
+            auto cb = std::move(st->cb);
+            cb((st->new_direntry["type"] == "dir") ? -ENOTDIR : -EISDIR);
+            return;
+        }
+        if (st->new_direntry["type"] == "dir")
+        {
+            // Check that the destination directory is empty
+            st->new_dir_prefix = kv_direntry_key(st->new_direntry["ino"].uint64_value(), "");
+            st->list_handle = st->self->parent->db->list_start(st->new_dir_prefix);
+            st->self->parent->db->list_next(st->list_handle, [st](int res, const std::string & key, const std::string & value)
+            {
+                st->res = res;
+                nfs_kv_continue_rename(st, 3);
+            });
+            return;
+resume_3:
+            st->self->parent->db->list_close(st->list_handle);
+            if (st->res != -ENOENT)
+            {
+                auto cb = std::move(st->cb);
+                cb(-ENOTEMPTY);
+                return;
+            }
+        }
+    }
+    else
+    {
+        // Check that the new directory is actually a directory
+        kv_read_inode(st->self->parent, st->new_dir_ino, [st](int res, const std::string & value, json11::Json attrs)
+        {
+            st->res = res == 0 ? (attrs["type"].string_value() == "dir" ? 0 : -ENOTDIR) : res;
+            nfs_kv_continue_rename(st, 4);
+        });
+        return;
+resume_4:
+        if (st->res < 0)
+        {
+            auto cb = std::move(st->cb);
+            cb(st->res);
+            return;
+        }
+    }
+    // Write the new direntry
+    st->self->parent->db->set(kv_direntry_key(st->new_dir_ino, st->new_name), st->old_direntry_text, [st](int res)
+    {
+        st->res = res;
+        nfs_kv_continue_rename(st, 5);
+    }, [st](int res, const std::string & old_value)
+    {
+        return st->new_exists ? (old_value == st->new_direntry_text) : (res == -ENOENT);
+    });
+    return;
+resume_5:
+    if (st->res == -EAGAIN)
+    {
+        // CAS failure
+        st->allow_cache = false;
+        goto resume_0;
+    }
+    if (st->res < 0)
+    {
+        auto cb = std::move(st->cb);
+        cb(st->res);
+        return;
+    }
+    // Delete the old direntry
+    st->self->parent->db->del(kv_direntry_key(st->old_dir_ino, st->old_name), [st](int res)
+    {
+        st->res = res;
+        nfs_kv_continue_rename(st, 6);
+    }, [=](int res, const std::string & old_value)
+    {
+        return res == 0 && old_value == st->old_direntry_text;
+    });
+    return;
+resume_6:
+    if (st->res == -EAGAIN)
+    {
+        // CAS failure
+        st->allow_cache = false;
+        goto resume_0;
+    }
+    if (st->res < 0)
+    {
+        auto cb = std::move(st->cb);
+        cb(st->res);
+        return;
+    }
+    st->allow_cache = true;
+resume_7again:
+    if (st->new_exists && st->new_direntry["type"].string_value() != "dir")
+    {
+        // (Maybe) delete old destination file data
+        kv_read_inode(st->self->parent, st->new_direntry["ino"].uint64_value(), [st](int res, const std::string & value, json11::Json attrs)
+        {
+            st->res = res;
+            st->new_ientry_text = value;
+            st->new_ientry = attrs;
+            nfs_kv_continue_rename(st, 7);
+        }, st->allow_cache);
+        return;
+resume_7:
+        if (st->res == 0)
+        {
+            // (5) Reduce inode refcount by 1 or delete inode
+            if (st->new_ientry["nlink"].uint64_value() > 1)
+            {
+                auto copy = st->new_ientry.object_items();
+                copy["nlink"] = st->new_ientry["nlink"].uint64_value()-1;
+                copy["ctime"] = nfstime_now_str();
+                copy.erase("verf");
+                st->self->parent->db->set(kv_inode_key(st->new_direntry["ino"].uint64_value()), json11::Json(copy).dump(), [st](int res)
+                {
+                    st->res = res;
+                    nfs_kv_continue_rename(st, 8);
+                }, [st](int res, const std::string & old_value)
+                {
+                    return old_value == st->new_ientry_text;
+                });
+            }
+            else
+            {
+                st->rm_dest_data = kv_map_type(st->new_ientry["type"].string_value()) == NF3REG
+                    && !st->new_ientry["shared_ino"].uint64_value();
+                st->self->parent->db->del(kv_inode_key(st->new_direntry["ino"].uint64_value()), [st](int res)
+                {
+                    st->res = res;
+                    nfs_kv_continue_rename(st, 8);
+                }, [st](int res, const std::string & old_value)
+                {
+                    return old_value == st->new_ientry_text;
+                });
+            }
+            return;
+resume_8:
+            if (st->res == -EAGAIN)
+            {
+                // CAS failure - re-read inode
+                st->allow_cache = false;
+                goto resume_7again;
+            }
+            if (st->res < 0)
+            {
+                auto cb = std::move(st->cb);
+                cb(st->res);
+                return;
+            }
+            // Delete inode data if required
+            if (st->rm_dest_data)
+            {
+                st->self->parent->cmd->loop_and_wait(st->self->parent->cmd->start_rm_data(json11::Json::object {
+                    { "inode", INODE_NO_POOL(st->new_direntry["ino"].uint64_value()) },
+                    { "pool", (uint64_t)INODE_POOL(st->new_direntry["ino"].uint64_value()) },
+                }), [st](const cli_result_t & r)
+                {
+                    if (r.err)
+                    {
+                        fprintf(stderr, "Failed to remove inode %jx data: %s (code %d)\n",
+                            st->new_direntry["ino"].uint64_value(), r.text.c_str(), r.err);
+                    }
+                    st->res = r.err;
+                    nfs_kv_continue_rename(st, 9);
+                });
+                return;
+resume_9:
+                if (st->res < 0)
+                {
+                    auto cb = std::move(st->cb);
+                    cb(st->res);
+                    return;
+                }
+            }
+        }
+    }
+    if (st->old_direntry["type"].string_value() == "dir" && st->new_dir_ino != st->old_dir_ino)
+    {
+        // Change parent_ino in old ientry
+        st->allow_cache = true;
+resume_10:
+        kv_read_inode(st->self->parent, st->old_direntry["ino"].uint64_value(), [st](int res, const std::string & value, json11::Json ientry)
+        {
+            st->res = res;
+            st->old_ientry_text = value;
+            st->old_ientry = ientry;
+            nfs_kv_continue_rename(st, 11);
+        }, st->allow_cache);
+        return;
+resume_11:
+        if (st->res < 0)
+        {
+            auto cb = std::move(st->cb);
+            cb(st->res);
+            return;
+        }
+        {
+            auto ientry_new = st->old_ientry.object_items();
+            ientry_new["parent_ino"] = st->new_dir_ino;
+            ientry_new["ctime"] = nfstime_now_str();
+            ientry_new.erase("verf");
+            st->self->parent->db->set(kv_inode_key(st->old_direntry["ino"].uint64_value()), json11::Json(ientry_new).dump(), [st](int res)
+            {
+                st->res = res;
+                nfs_kv_continue_rename(st, 12);
+            }, [st](int res, const std::string & old_value)
+            {
+                return old_value == st->old_ientry_text;
+            });
+        }
+        return;
+resume_12:
+        if (st->res == -EAGAIN)
+        {
+            // CAS failure - try again
+            st->allow_cache = false;
+            goto resume_10;
+        }
+        if (st->res < 0)
+        {
+            auto cb = std::move(st->cb);
+            cb(st->res);
+            return;
+        }
+    }
+    if (!st->res)
+    {
+        st->self->parent->kvfs->touch_queue.insert(st->old_dir_ino);
+        st->self->parent->kvfs->touch_queue.insert(st->new_dir_ino);
+    }
+    auto cb = std::move(st->cb);
+    cb(st->res);
+}
+
+int kv_nfs3_rename_proc(void *opaque, rpc_op_t *rop)
+{
+    auto st = new nfs_kv_rename_state;
+    st->self = (nfs_client_t*)opaque;
+    st->rop = rop;
+    RENAME3args *args = (RENAME3args*)rop->request;
+    st->old_dir_ino = kv_fh_inode(args->from.dir);
+    st->new_dir_ino = kv_fh_inode(args->to.dir);
+    st->old_name = args->from.name;
+    st->new_name = args->to.name;
+    if (st->self->parent->trace)
+        fprintf(stderr, "[%d] RENAME %ju/%s -> %ju/%s\n", st->self->nfs_fd, st->old_dir_ino, st->old_name.c_str(), st->new_dir_ino, st->new_name.c_str());
+    if (!st->old_dir_ino || !st->new_dir_ino || st->old_name == "" || st->new_name == "")
+    {
+        RENAME3res *reply = (RENAME3res*)rop->reply;
+        *reply = (RENAME3res){ .status = NFS3ERR_INVAL };
+        rpc_queue_reply(rop);
+        delete st;
+        return 0;
+    }
+    if (st->old_dir_ino == st->new_dir_ino && st->old_name == st->new_name)
+    {
+        RENAME3res *reply = (RENAME3res*)rop->reply;
+        *reply = (RENAME3res){ .status = NFS3_OK };
+        rpc_queue_reply(st->rop);
+        delete st;
+        return 0;
+    }
+    st->cb = [st](int res)
+    {
+        RENAME3res *reply = (RENAME3res*)st->rop->reply;
+        *reply = (RENAME3res){ .status = vitastor_nfs_map_err(res) };
+        rpc_queue_reply(st->rop);
+        delete st;
+    };
+    nfs_kv_continue_rename(st, 0);
+    return 1;
+}
--- a/src/nfs_kv_setattr.cpp
+++ b/src/nfs_kv_setattr.cpp
@@ -0,0 +1,204 @@
+// Copyright (c) Vitaliy Filippov, 2019+
+// License: VNPL-1.1 (see README.md for details)
+//
+// NFS proxy over VitastorKV database - SETATTR
+
+#include <sys/time.h>
+
+#include "nfs_proxy.h"
+#include "nfs_kv.h"
+#include "cli.h"
+
+struct nfs_kv_setattr_state
+{
+    nfs_client_t *self = NULL;
+    rpc_op_t *rop = NULL;
+    uint64_t ino = 0;
+    uint64_t old_size = 0, new_size = 0;
+    std::string expected_ctime;
+    json11::Json::object set_attrs;
+    int res = 0, cas_res = 0;
+    std::string ientry_text;
+    json11::Json ientry;
+    json11::Json::object new_attrs;
+    std::function<void(int)> cb;
+};
+
+static void nfs_kv_continue_setattr(nfs_kv_setattr_state *st, int state)
+{
+    // FIXME: NFS client does a lot of setattr calls, so maybe process them asynchronously
+    if (state == 0)      {}
+    else if (state == 1) goto resume_1;
+    else if (state == 2) goto resume_2;
+    else if (state == 3) goto resume_3;
+    else
+    {
+        fprintf(stderr, "BUG: invalid state in nfs_kv_continue_setattr()");
+        abort();
+    }
+    st->self->parent->kvfs->touch_queue.erase(st->ino);
+resume_0:
+    kv_read_inode(st->self->parent, st->ino, [st](int res, const std::string & value, json11::Json attrs)
+    {
+        st->res = res;
+        st->ientry_text = value;
+        st->ientry = attrs;
+        nfs_kv_continue_setattr(st, 1);
+    });
+    return;
+resume_1:
+    if (st->res < 0)
+    {
+        auto cb = std::move(st->cb);
+        cb(st->res);
+        return;
+    }
+    if (st->ientry["type"].string_value() != "file" &&
+        st->ientry["type"].string_value() != "" &&
+        !st->set_attrs["size"].is_null())
+    {
+        auto cb = std::move(st->cb);
+        cb(-EINVAL);
+        return;
+    }
+    if (st->expected_ctime != "")
+    {
+        auto actual_ctime = (st->ientry["ctime"].is_null() ? st->ientry["mtime"] : st->ientry["ctime"]);
+        if (actual_ctime != st->expected_ctime)
+        {
+            auto cb = std::move(st->cb);
+            cb(NFS3ERR_NOT_SYNC);
+            return;
+        }
+    }
+    // Now we can update it
+    st->new_attrs = st->ientry.object_items();
+    st->old_size = st->ientry["size"].uint64_value();
+    for (auto & kv: st->set_attrs)
+    {
+        if (kv.first == "size")
+        {
+            st->new_size = kv.second.uint64_value();
+        }
+        st->new_attrs[kv.first] = kv.second;
+    }
+    st->new_attrs.erase("verf");
+    st->new_attrs["ctime"] = nfstime_now_str();
+    st->self->parent->db->set(kv_inode_key(st->ino), json11::Json(st->new_attrs).dump(), [st](int res)
+    {
+        st->res = res;
+        nfs_kv_continue_setattr(st, 2);
+    }, [st](int res, const std::string & cas_value)
+    {
+        st->cas_res = res;
+        return (res == 0 || res == -ENOENT && st->ino == KV_ROOT_INODE) && cas_value == st->ientry_text;
+    });
+    return;
+resume_2:
+    if (st->cas_res == -ENOENT)
+    {
+        st->res = -ENOENT;
+    }
+    if (st->res == -EAGAIN)
+    {
+        // Retry
+        goto resume_0;
+    }
+    if (st->res < 0)
+    {
+        fprintf(stderr, "Failed to update inode %ju: %s (code %d)\n", st->ino, strerror(-st->res), st->res);
+        auto cb = std::move(st->cb);
+        cb(st->res);
+        return;
+    }
+    if (!st->set_attrs["size"].is_null() &&
+        st->ientry["size"].uint64_value() > st->set_attrs["size"].uint64_value() &&
+        !st->ientry["shared_ino"].uint64_value())
+    {
+        // Delete extra data when downsizing
+        st->self->parent->cmd->loop_and_wait(st->self->parent->cmd->start_rm_data(json11::Json::object {
+            { "inode", INODE_NO_POOL(st->ino) },
+            { "pool", (uint64_t)INODE_POOL(st->ino) },
+            { "min_offset", st->set_attrs["size"].uint64_value() },
+        }), [st](const cli_result_t & r)
+        {
+            if (r.err)
+            {
+                fprintf(stderr, "Failed to truncate inode %ju: %s (code %d)\n",
+                    st->ino, r.text.c_str(), r.err);
+            }
+            st->res = r.err;
+            nfs_kv_continue_setattr(st, 3);
+        });
+        return;
+    }
+resume_3:
+    auto cb = std::move(st->cb);
+    cb(st->res);
+}
+
+int kv_nfs3_setattr_proc(void *opaque, rpc_op_t *rop)
+{
+    nfs_kv_setattr_state *st = new nfs_kv_setattr_state;
+    st->self = (nfs_client_t*)opaque;
+    st->rop = rop;
+    auto args = (SETATTR3args*)rop->request;
+    auto reply = (SETATTR3res*)rop->reply;
+    std::string fh = args->object;
+    if (!kv_fh_valid(fh))
+    {
+        *reply = (SETATTR3res){ .status = NFS3ERR_INVAL };
+        rpc_queue_reply(rop);
+        delete st;
+        return 0;
+    }
+    st->ino = kv_fh_inode(fh);
+    if (args->guard.check)
+        st->expected_ctime = nfstime_to_str(args->guard.obj_ctime);
+    if (args->new_attributes.size.set_it)
+        st->set_attrs["size"] = args->new_attributes.size.size;
+    if (args->new_attributes.mode.set_it)
+        st->set_attrs["mode"] = (uint64_t)args->new_attributes.mode.mode;
+    if (args->new_attributes.uid.set_it)
+        st->set_attrs["uid"] = (uint64_t)args->new_attributes.uid.uid;
+    if (args->new_attributes.gid.set_it)
+        st->set_attrs["gid"] = (uint64_t)args->new_attributes.gid.gid;
+    if (args->new_attributes.atime.set_it == SET_TO_SERVER_TIME)
+        st->set_attrs["atime"] = nfstime_now_str();
+    else if (args->new_attributes.atime.set_it == SET_TO_CLIENT_TIME)
+        st->set_attrs["atime"] = nfstime_to_str(args->new_attributes.atime.atime);
+    if (args->new_attributes.mtime.set_it == SET_TO_SERVER_TIME)
+        st->set_attrs["mtime"] = nfstime_now_str();
+    else if (args->new_attributes.mtime.set_it == SET_TO_CLIENT_TIME)
+        st->set_attrs["mtime"] = nfstime_to_str(args->new_attributes.mtime.mtime);
+    if (st->self->parent->trace)
+        fprintf(stderr, "[%d] SETATTR %ju ATTRS %s\n", st->self->nfs_fd, st->ino, json11::Json(st->set_attrs).dump().c_str());
+    st->cb = [st](int res)
+    {
+        auto reply = (SETATTR3res*)st->rop->reply;
+        if (res < 0)
+        {
+            *reply = (SETATTR3res){
+                .status = vitastor_nfs_map_err(res),
+            };
+        }
+        else
+        {
+            *reply = (SETATTR3res){
+                .status = NFS3_OK,
+                .resok = (SETATTR3resok){
+                    .obj_wcc = (wcc_data){
+                        .after = (post_op_attr){
+                            .attributes_follow = 1,
+                            .attributes = get_kv_attributes(st->self, st->ino, st->new_attrs),
+                        },
+                    },
+                },
+            };
+        }
+        rpc_queue_reply(st->rop);
+        delete st;
+    };
+    nfs_kv_continue_setattr(st, 0);
+    return 1;
+}
--- a/src/nfs_kv_write.cpp
+++ b/src/nfs_kv_write.cpp
--- a/src/nfs_mount.cpp
+++ b/src/nfs_mount.cpp
@@ -0,0 +1,126 @@
+// Copyright (c) Vitaliy Filippov, 2019+
+// License: VNPL-1.1 (see README.md for details)
+//
+// NFS proxy - common NULL, ACCESS, COMMIT, DUMP, EXPORT, MNT, UMNT, UMNTALL
+
+#include <sys/time.h>
+
+#include "nfs_proxy.h"
+#include "nfs/nfs.h"
+
+nfsstat3 vitastor_nfs_map_err(int err)
+{
+    if (err < 0)
+    {
+        err = -err;
+    }
+    return (err == EINVAL ? NFS3ERR_INVAL
+        : (err == ENOENT ? NFS3ERR_NOENT
+        : (err == ENOSPC ? NFS3ERR_NOSPC
+        : (err == EEXIST ? NFS3ERR_EXIST
+        : (err == EISDIR ? NFS3ERR_ISDIR
+        : (err == ENOTDIR ? NFS3ERR_NOTDIR
+        : (err == ENOTEMPTY ? NFS3ERR_NOTEMPTY
+        : (err == EIO ? NFS3ERR_IO : (err ? NFS3ERR_IO : NFS3_OK)))))))));
+}
+
+int nfs3_null_proc(void *opaque, rpc_op_t *rop)
+{
+    rpc_queue_reply(rop);
+    return 0;
+}
+
+int nfs3_access_proc(void *opaque, rpc_op_t *rop)
+{
+    //nfs_client_t *self = (nfs_client_t*)opaque;
+    ACCESS3args *args = (ACCESS3args*)rop->request;
+    ACCESS3res *reply = (ACCESS3res*)rop->reply;
+    *reply = (ACCESS3res){
+        .status = NFS3_OK,
+        .resok = (ACCESS3resok){
+            .access = args->access,
+        },
+    };
+    rpc_queue_reply(rop);
+    return 0;
+}
+
+int nfs3_commit_proc(void *opaque, rpc_op_t *rop)
+{
+    nfs_client_t *self = (nfs_client_t*)opaque;
+    //COMMIT3args *args = (COMMIT3args*)rop->request;
+    cluster_op_t *op = new cluster_op_t;
+    // fsync. we don't know how to fsync a single inode, so just fsync everything
+    op->opcode = OSD_OP_SYNC;
+    op->callback = [self, rop](cluster_op_t *op)
+    {
+        COMMIT3res *reply = (COMMIT3res*)rop->reply;
+        *reply = (COMMIT3res){ .status = vitastor_nfs_map_err(op->retval) };
+        *(uint64_t*)reply->resok.verf = self->parent->server_id;
+        rpc_queue_reply(rop);
+    };
+    self->parent->cli->execute(op);
+    return 1;
+}
+
+int mount3_mnt_proc(void *opaque, rpc_op_t *rop)
+{
+    nfs_client_t *self = (nfs_client_t*)opaque;
+    //nfs_dirpath *args = (nfs_dirpath*)rop->request;
+    if (self->parent->trace)
+        fprintf(stderr, "[%d] MNT\n", self->nfs_fd);
+    nfs_mountres3 *reply = (nfs_mountres3*)rop->reply;
+    u_int flavor = RPC_AUTH_NONE;
+    reply->fhs_status = MNT3_OK;
+    reply->mountinfo.fhandle = xdr_copy_string(rop->xdrs, NFS_ROOT_HANDLE);
+    reply->mountinfo.auth_flavors.auth_flavors_len = 1;
+    reply->mountinfo.auth_flavors.auth_flavors_val = (u_int*)xdr_copy_string(rop->xdrs, (char*)&flavor, sizeof(u_int)).data;
+    rpc_queue_reply(rop);
+    return 0;
+}
+
+int mount3_dump_proc(void *opaque, rpc_op_t *rop)
+{
+    nfs_client_t *self = (nfs_client_t*)opaque;
+    if (self->parent->trace)
+        fprintf(stderr, "[%d] DUMP\n", self->nfs_fd);
+    nfs_mountlist *reply = (nfs_mountlist*)rop->reply;
+    *reply = (struct nfs_mountbody*)malloc_or_die(sizeof(struct nfs_mountbody));
+    xdr_add_malloc(rop->xdrs, *reply);
+    (*reply)->ml_hostname = xdr_copy_string(rop->xdrs, "127.0.0.1");
+    (*reply)->ml_directory = xdr_copy_string(rop->xdrs, self->parent->export_root);
+    (*reply)->ml_next = NULL;
+    rpc_queue_reply(rop);
+    return 0;
+}
+
+int mount3_umnt_proc(void *opaque, rpc_op_t *rop)
+{
+    //nfs_client_t *self = (nfs_client_t*)opaque;
+    //nfs_dirpath *arg = (nfs_dirpath*)rop->request;
+    // do nothing
+    rpc_queue_reply(rop);
+    return 0;
+}
+
+int mount3_umntall_proc(void *opaque, rpc_op_t *rop)
+{
+    // do nothing
+    rpc_queue_reply(rop);
+    return 0;
+}
+
+int mount3_export_proc(void *opaque, rpc_op_t *rop)
+{
+    nfs_client_t *self = (nfs_client_t*)opaque;
+    nfs_exports *reply = (nfs_exports*)rop->reply;
+    *reply = (struct nfs_exportnode*)calloc_or_die(1, sizeof(struct nfs_exportnode) + sizeof(struct nfs_groupnode));
+    xdr_add_malloc(rop->xdrs, *reply);
+    (*reply)->ex_dir = xdr_copy_string(rop->xdrs, self->parent->export_root);
+    (*reply)->ex_groups = (struct nfs_groupnode*)(reply+1);
+    (*reply)->ex_groups->gr_name = xdr_copy_string(rop->xdrs, "127.0.0.1");
+    (*reply)->ex_groups->gr_next = NULL;
+    (*reply)->ex_next = NULL;
+    rpc_queue_reply(rop);
+    return 0;
+}
--- a/src/nfs_proxy.cpp
+++ b/src/nfs_proxy.cpp
@@ -10,9 +10,10 @@

 #include <netinet/tcp.h>
 #include <sys/epoll.h>
+#include <sys/wait.h>
 #include <unistd.h>
 #include <fcntl.h>
-//#include <signal.h>
+#include <signal.h>

 #include "nfs/nfs.h"
 #include "nfs/rpc.h"
@@ -21,6 +22,9 @@
 #include "addr_util.h"
 #include "str_util.h"
 #include "nfs_proxy.h"
+#include "nfs_kv.h"
+#include "nfs_block.h"
+#include "nfs_common.h"
 #include "http_client.h"
 #include "cli.h"

@@ -31,6 +35,12 @@ const char *exe_name = NULL;

 nfs_proxy_t::~nfs_proxy_t()
 {
+    if (kvfs)
+        delete kvfs;
+    if (blockfs)
+        delete blockfs;
+    if (db)
+        delete db;
    if (cmd)
        delete cmd;
    if (cli)
@@ -44,43 +54,90 @@ nfs_proxy_t::~nfs_proxy_t()
        delete ringloop;
 }

+static const char* help_text =
+    "Vitastor NFS 3.0 proxy " VERSION "\n"
+    "(c) Vitaliy Filippov, 2021+ (VNPL-1.1)\n"
+    "\n"
+    "vitastor-nfs (--fs <NAME> | --block) [-o <OPT>] mount <MOUNTPOINT>\n"
+    "  Start local filesystem server and mount file system to <MOUNTPOINT>.\n"
+    "  Use regular `umount <MOUNTPOINT>` to unmount the FS.\n"
+    "  The server will be automatically stopped when the FS is unmounted.\n"
+    "  -o|--options <OPT>  Pass additional NFS mount options (ex.: -o async).\n"
+    "\n"
+    "vitastor-nfs (--fs <NAME> | --block) start\n"
+    "  Start network NFS server. Options:\n"
+    "  --bind <IP>       bind service to <IP> address (default 0.0.0.0)\n"
+    "  --port <PORT>     use port <PORT> for NFS services (default is 2049)\n"
+    "  --portmap 0       do not listen on port 111 (portmap/rpcbind, requires root)\n"
+    "\n"
+    "OPTIONS:\n"
+    "  --fs <NAME>       use VitastorFS with metadata in image <NAME>\n"
+    "  --block           use pseudo-FS presenting images as files\n"
+    "  --pool <POOL>     use <POOL> as default pool for new files\n"
+    "  --subdir <DIR>    export <DIR> instead of root directory (pseudo-FS only)\n"
+    "  --nfspath <PATH>  set NFS export path to <PATH> (default is /)\n"
+    "  --pidfile <FILE>  write process ID to the specified file\n"
+    "  --logfile <FILE>  log to the specified file\n"
+    "  --foreground 1    stay in foreground, do not daemonize\n"
+    "\n"
+    "NFS proxy is stateless if you use immediate_commit=all in your cluster and if\n"
+    "you do not use client_enable_writeback=true, so you can freely use multiple\n"
+    "NFS proxies with L3 load balancing in this case.\n"
+    "\n"
+    "Example start and mount commands for a custom NFS port:\n"
+    "  vitastor-nfs start --block --etcd_address 192.168.5.10:2379 --portmap 0 --port 2050 --pool testpool\n"
+    "  mount localhost:/ /mnt/ -o port=2050,mountport=2050,nfsvers=3,soft,nolock,tcp\n"
+    "Or just:\n"
+    "  vitastor-nfs mount --block --pool testpool /mnt/\n"
+;
+
 json11::Json::object nfs_proxy_t::parse_args(int narg, const char *args[])
 {
    json11::Json::object cfg;
+    std::vector<std::string> cmd;
    for (int i = 1; i < narg; i++)
    {
        if (!strcmp(args[i], "-h") || !strcmp(args[i], "--help"))
        {
-            printf(
-                "Vitastor NFS 3.0 proxy\n"
-                "(c) Vitaliy Filippov, 2021-2022 (VNPL-1.1)\n"
-                "\n"
-                "USAGE:\n"
-                "  %s [STANDARD OPTIONS] [OTHER OPTIONS]\n"
-                "  --subdir <DIR>    export images prefixed <DIR>/ (default empty - export all images)\n"
-                "  --portmap 0       do not listen on port 111 (portmap/rpcbind, requires root)\n"
-                "  --bind <IP>       bind service to <IP> address (default 0.0.0.0)\n"
-                "  --nfspath <PATH>  set NFS export path to <PATH> (default is /)\n"
-                "  --port <PORT>     use port <PORT> for NFS services (default is 2049)\n"
-                "  --pool <POOL>     use <POOL> as default pool for new files (images)\n"
-                "  --foreground 1    stay in foreground, do not daemonize\n"
-                "\n"
-                "NFS proxy is stateless if you use immediate_commit=all in your cluster and if\n"
-                "you do not use client_enable_writeback=true, so you can freely use multiple\n"
-                "NFS proxies with L3 load balancing in this case.\n"
-                "\n"
-                "Example start and mount commands for a custom NFS port:\n"
-                "  %s --etcd_address 192.168.5.10:2379 --portmap 0 --port 2050 --pool testpool\n"
-                "  mount localhost:/ /mnt/ -o port=2050,mountport=2050,nfsvers=3,soft,nolock,tcp\n",
-                exe_name, exe_name
-            );
+            printf("%s", help_text);
            exit(0);
        }
+        else if (!strcmp(args[i], "-o") || !strcmp(args[i], "--options"))
+        {
+            if (i >= narg-1)
+            {
+                printf("%s", help_text);
+                exit(0);
+            }
+            const std::string & old = cfg["options"].string_value();
+            cfg["options"] = old != "" ? old+","+args[i+1] : args[i+1];
+        }
        else if (args[i][0] == '-' && args[i][1] == '-')
        {
            const char *opt = args[i]+2;
-            cfg[opt] = !strcmp(opt, "json") || i == narg-1 ? "1" : args[++i];
+            cfg[opt] = !strcmp(opt, "json") || !strcmp(opt, "block") || i == narg-1 ? "1" : args[++i];
        }
+        else
+        {
+            cmd.push_back(args[i]);
+        }
+    }
+    if (cfg.find("block") == cfg.end() && cfg.find("fs") == cfg.end())
+    {
+        fprintf(stderr, "Specify one of --block or --fs NAME. Use vitastor-nfs --help for details\n");
+        exit(1);
+    }
+    if (cmd.size() >= 2 && cmd[0] == "mount")
+    {
+        cfg["mount"] = cmd[1];
+    }
+    else if (cmd.size() >= 1 && cmd[0] == "start")
+    {
+    }
+    else
+    {
+        printf("%s", help_text);
+        exit(1);
    }
    return cfg;
 }
@@ -92,6 +149,10 @@ void nfs_proxy_t::run(json11::Json cfg)
    srand48(tv.tv_sec*1000000000 + tv.tv_nsec);
    server_id = (uint64_t)lrand48() | ((uint64_t)lrand48() << 31) | ((uint64_t)lrand48() << 62);
    // Parse options
+    if (cfg["logfile"].string_value() != "")
+        logfile = cfg["logfile"].string_value();
+    pidfile = cfg["pidfile"].string_value();
+    trace = cfg["log_level"].uint64_value() > 5 || cfg["trace"].uint64_value() > 0;
    bind_address = cfg["bind"].string_value();
    if (bind_address == "")
        bind_address = "0.0.0.0";
@@ -103,18 +164,6 @@ void nfs_proxy_t::run(json11::Json cfg)
    export_root = cfg["nfspath"].string_value();
    if (!export_root.size())
        export_root = "/";
-    name_prefix = cfg["subdir"].string_value();
-    {
-        int e = name_prefix.size();
-        while (e > 0 && name_prefix[e-1] == '/')
-            e--;
-        int s = 0;
-        while (s < e && name_prefix[s] == '/')
-            s++;
-        name_prefix = name_prefix.substr(s, e-s);
-        if (name_prefix.size())
-            name_prefix += "/";
-    }
    if (cfg["client_writeback_allowed"].is_null())
    {
        // NFS is always aware of fsync, so we allow write-back cache
@@ -123,6 +172,16 @@ void nfs_proxy_t::run(json11::Json cfg)
        obj["client_writeback_allowed"] = true;
        cfg = obj;
    }
+    mountpoint = cfg["mount"].string_value();
+    if (mountpoint != "")
+    {
+        bind_address = "127.0.0.1";
+        nfs_port = 0;
+        portmap_enabled = false;
+        exit_on_umount = true;
+    }
+    mountopts = cfg["options"].string_value();
+    fsname = cfg["fs"].string_value();
    // Create client
    ringloop = new ring_loop_t(RINGLOOP_DEFAULT_SIZE);
    epmgr = new epoll_manager_t(ringloop);
@@ -131,67 +190,7 @@ void nfs_proxy_t::run(json11::Json cfg)
    cmd->ringloop = ringloop;
    cmd->epmgr = epmgr;
    cmd->cli = cli;
-    // We need inode name hashes for NFS handles to remain stateless and <= 64 bytes long
-    dir_info[""] = (nfs_dir_t){
-        .id = 1,
-        .mod_rev = 0,
-    };
-    clock_gettime(CLOCK_REALTIME, &dir_info[""].mtime);
    watch_stats();
-    assert(cli->st_cli.on_inode_change_hook == NULL);
-    cli->st_cli.on_inode_change_hook = [this](inode_t changed_inode, bool removed)
-    {
-        auto inode_cfg_it = cli->st_cli.inode_config.find(changed_inode);
-        if (inode_cfg_it == cli->st_cli.inode_config.end())
-        {
-            return;
-        }
-        auto & inode_cfg = inode_cfg_it->second;
-        std::string full_name = inode_cfg.name;
-        if (name_prefix != "" && full_name.substr(0, name_prefix.size()) != name_prefix)
-        {
-            return;
-        }
-        // Calculate directory modification time and revision (used as "cookie verifier")
-        timespec now;
-        clock_gettime(CLOCK_REALTIME, &now);
-        dir_info[""].mod_rev = dir_info[""].mod_rev < inode_cfg.mod_revision ? inode_cfg.mod_revision : dir_info[""].mod_rev;
-        dir_info[""].mtime = now;
-        int pos = full_name.find('/', name_prefix.size());
-        while (pos >= 0)
-        {
-            std::string dir = full_name.substr(0, pos);
-            auto & dinf = dir_info[dir];
-            if (!dinf.id)
-                dinf.id = next_dir_id++;
-            dinf.mod_rev = dinf.mod_rev < inode_cfg.mod_revision ? inode_cfg.mod_revision : dinf.mod_rev;
-            dinf.mtime = now;
-            dir_by_hash["S"+base64_encode(sha256(dir))] = dir;
-            pos = full_name.find('/', pos+1);
-        }
-        // Alter inode_by_hash
-        if (removed)
-        {
-            auto ino_it = hash_by_inode.find(changed_inode);
-            if (ino_it != hash_by_inode.end())
-            {
-                inode_by_hash.erase(ino_it->second);
-                hash_by_inode.erase(ino_it);
-            }
-        }
-        else
-        {
-            std::string hash = "S"+base64_encode(sha256(full_name));
-            auto hbi_it = hash_by_inode.find(changed_inode);
-            if (hbi_it != hash_by_inode.end() && hbi_it->second != hash)
-            {
-                // inode had a different name, remove old hash=>inode pointer
-                inode_by_hash.erase(hbi_it->second);
-            }
-            inode_by_hash[hash] = changed_inode;
-            hash_by_inode[changed_inode] = hash;
-        }
-    };
    // Load image metadata
    while (!cli->is_ready())
    {
@@ -202,6 +201,17 @@ void nfs_proxy_t::run(json11::Json cfg)
    }
    // Check default pool
    check_default_pool();
+    // Check if we're using VitastorFS
+    if (fsname == "")
+    {
+        blockfs = new block_fs_state_t();
+        blockfs->init(this, cfg);
+    }
+    else
+    {
+        kvfs = new kv_fs_state_t();
+        kvfs->init(this, cfg);
+    }
    // Self-register portmap and NFS
    pmap.reg_ports.insert((portmap_id_t){
        .prog = PMAP_PROGRAM,
@@ -232,7 +242,7 @@ void nfs_proxy_t::run(json11::Json cfg)
        .addr = "0.0.0.0.0."+std::to_string(nfs_port),
    });
    // Create NFS socket and add it to epoll
-    int nfs_socket = create_and_bind_socket(bind_address, nfs_port, 128, NULL);
+    int nfs_socket = create_and_bind_socket(bind_address, nfs_port, 128, &listening_port);
    fcntl(nfs_socket, F_SETFL, fcntl(nfs_socket, F_GETFL, 0) | O_NONBLOCK);
    epmgr->tfd->set_fd_handler(nfs_socket, false, [this](int nfs_socket, int epoll_events)
    {
@@ -264,17 +274,40 @@ void nfs_proxy_t::run(json11::Json cfg)
            }
        });
    }
+    if (mountpoint != "")
+    {
+        mount_fs();
+    }
    if (cfg["foreground"].is_null())
    {
        daemonize();
    }
-    while (true)
+    if (pidfile != "")
+    {
+        write_pid();
+    }
+    while (!finished)
    {
        ringloop->loop();
        ringloop->wait();
    }
    // Destroy the client
    cli->flush();
+    if (kvfs)
+    {
+        delete kvfs;
+        kvfs = NULL;
+    }
+    if (blockfs)
+    {
+        delete blockfs;
+        blockfs = NULL;
+    }
+    if (db)
+    {
+        delete db;
+        db = NULL;
+    }
    delete cli;
    delete epmgr;
    delete ringloop;
@@ -351,7 +384,7 @@ void nfs_proxy_t::parse_stats(etcd_kv_t & kv)
        inode_t inode_num = 0;
        char null_byte = 0;
        int scanned = sscanf(key.c_str() + cli->st_cli.etcd_prefix.length()+13, "%u/%ju%c", &pool_id, &inode_num, &null_byte);
-        if (scanned != 2 || !pool_id || pool_id >= POOL_ID_MAX || !inode_num)
+        if (scanned != 2 || !pool_id || pool_id >= POOL_ID_MAX)
        {
            fprintf(stderr, "Bad etcd key %s, ignoring\n", key.c_str());
        }
@@ -382,8 +415,9 @@ void nfs_proxy_t::check_default_pool()
    {
        if (cli->st_cli.pool_config.size() == 1)
        {
-            default_pool = cli->st_cli.pool_config.begin()->second.name;
-            default_pool_id = cli->st_cli.pool_config.begin()->first;
+            auto pool_it = cli->st_cli.pool_config.begin();
+            default_pool_id = pool_it->first;
+            default_pool = pool_it->second.name;
        }
        else
        {
@@ -416,11 +450,17 @@ void nfs_proxy_t::do_accept(int listen_fd)
    int nfs_fd = 0;
    while ((nfs_fd = accept(listen_fd, (struct sockaddr *)&addr, &addr_size)) >= 0)
    {
-        fprintf(stderr, "New client %d: connection from %s\n", nfs_fd, addr_to_string(addr).c_str());
+        if (trace)
+            fprintf(stderr, "New client %d: connection from %s\n", nfs_fd, addr_to_string(addr).c_str());
+        active_connections++;
        fcntl(nfs_fd, F_SETFL, fcntl(nfs_fd, F_GETFL, 0) | O_NONBLOCK);
        int one = 1;
        setsockopt(nfs_fd, SOL_TCP, TCP_NODELAY, &one, sizeof(one));
        auto cli = new nfs_client_t();
+        if (kvfs)
+            nfs_kv_procs(cli);
+        else
+            nfs_block_procs(cli);
        cli->parent = this;
        cli->nfs_fd = nfs_fd;
        for (auto & fn: pmap.proc_table)
@@ -432,8 +472,12 @@ void nfs_proxy_t::do_accept(int listen_fd)
            // Handle incoming event
            if (epoll_events & EPOLLRDHUP)
            {
-                fprintf(stderr, "Client %d disconnected\n", nfs_fd);
+                auto parent = cli->parent;
+                if (parent->trace)
+                    fprintf(stderr, "Client %d disconnected\n", nfs_fd);
                cli->stop();
+                parent->active_connections--;
+                parent->check_exit();
                return;
            }
            cli->epoll_events |= epoll_events;
@@ -544,7 +588,7 @@ void nfs_client_t::handle_read(int result)
    read_msg.msg_iovlen = 0;
    if (deref())
        return;
-    if (result <= 0 && result != -EAGAIN && result != -EINTR)
+    if (result <= 0 && result != -EAGAIN && result != -EINTR && result != -ECANCELED)
    {
        printf("Failed read from client %d: %d (%s)\n", nfs_fd, result, strerror(-result));
        stop();
@@ -639,8 +683,8 @@ void nfs_client_t::handle_read(int result)
                return;
            }
        }
-        submit_read(0);
    }
+    submit_read(0);
 }

 void nfs_client_t::submit_send()
@@ -968,8 +1012,164 @@ void nfs_proxy_t::daemonize()
    close(1);
    close(2);
    open("/dev/null", O_RDONLY);
-    open("/dev/null", O_WRONLY);
-    open("/dev/null", O_WRONLY);
+    open(logfile.c_str(), O_WRONLY|O_APPEND|O_CREAT, 0666);
+    open(logfile.c_str(), O_WRONLY|O_APPEND|O_CREAT, 0666);
+}
+
+void nfs_proxy_t::write_pid()
+{
+    int fd = open(pidfile.c_str(), O_WRONLY|O_CREAT|O_TRUNC, 0666);
+    if (fd < 0)
+    {
+        fprintf(stderr, "Failed to create pid file %s: %s (code %d)\n", pidfile.c_str(), strerror(errno), errno);
+        return;
+    }
+    auto pid = std::to_string(getpid());
+    if (write(fd, pid.c_str(), pid.size()) < 0)
+    {
+        fprintf(stderr, "Failed to write pid to %s: %s (code %d)\n", pidfile.c_str(), strerror(errno), errno);
+    }
+    close(fd);
+}
+
+static pid_t wanted_pid = 0;
+static bool child_finished = false;
+static int child_status = -1;
+
+void single_child_handler(int signal)
+{
+    child_finished = true;
+    waitpid(wanted_pid, &child_status, WNOHANG);
+}
+
+void nfs_proxy_t::mount_fs()
+{
+    check_already_mounted();
+    signal(SIGCHLD, single_child_handler);
+    auto pid = fork();
+    if (pid < 0)
+    {
+        fprintf(stderr, "Failed to fork: %s (code %d)\n", strerror(errno), errno);
+        exit(1);
+    }
+    if (pid > 0)
+    {
+        // Parent - loop and wait until child finishes
+        wanted_pid = pid;
+        exit_on_umount = false;
+        while (!child_finished)
+        {
+            ringloop->loop();
+            ringloop->wait();
+        }
+        if (!WIFEXITED(child_status) || WEXITSTATUS(child_status) != 0)
+        {
+            // Mounting failed
+            exit(1);
+        }
+        if (fsname != "")
+            fprintf(stderr, "Successfully mounted VitastorFS %s at %s\n", fsname.c_str(), mountpoint.c_str());
+        else
+            fprintf(stderr, "Successfully mounted Vitastor pseudo-FS at %s\n", mountpoint.c_str());
+        finished = false;
+        exit_on_umount = true;
+    }
+    else
+    {
+        // Child
+        std::string src = ("localhost:"+export_root);
+        std::string opts = ("port="+std::to_string(listening_port)+",mountport="+std::to_string(listening_port)+",nfsvers=3,nolock,tcp");
+        bool hard = false, async = false;
+        for (auto & opt: explode(",", mountopts, true))
+        {
+            if (opt == "hard")
+                hard = true;
+            else if (opt == "async")
+                async = true;
+            else if (opt.substr(0, 4) != "port" && opt.substr(0, 9) != "mountport" &&
+                opt.substr(0, 7) != "nfsvers" && opt.substr(0, 5) != "proto" &&
+                opt != "udp" && opt != "tcp" && opt != "rdma")
+            {
+                opts += ","+opt;
+            }
+        }
+        if (!hard)
+            opts += ",soft";
+        if (!async)
+            opts += ",sync";
+        const char *args[] = { "mount", src.c_str(), mountpoint.c_str(), "-o", opts.c_str(), NULL };
+        execvp("mount", (char* const*)args);
+        fprintf(stderr, "Failed to run mount %s %s -o %s: %s (code %d)\n",
+            src.c_str(), mountpoint.c_str(), opts.c_str(), strerror(errno), errno);
+        exit(1);
+    }
+}
+
+void nfs_proxy_t::check_already_mounted()
+{
+    std::string realpoint = realpath_str(mountpoint, false);
+    if (realpoint == "")
+    {
+        return;
+    }
+    std::string mountstr = read_file("/proc/mounts");
+    if (mountstr == "")
+    {
+        return;
+    }
+    auto mounts = explode("\n", mountstr, true);
+    for (auto & str: mounts)
+    {
+        auto mnt = explode(" ", str, true);
+        if (mnt.size() >= 2 && mnt[1] == realpoint)
+        {
+            fprintf(stderr, "%s is already mounted\n", mountpoint.c_str());
+            exit(1);
+        }
+    }
+}
+
+void nfs_proxy_t::check_exit()
+{
+    if (active_connections || !exit_on_umount)
+    {
+        return;
+    }
+    fprintf(stderr, "All active NFS connections are closed, checking /proc/mounts\n");
+    std::string mountstr = read_file("/proc/mounts");
+    if (mountstr == "")
+    {
+        return;
+    }
+    auto port_opt = "port="+std::to_string(listening_port);
+    auto mountport_opt = "mountport="+std::to_string(listening_port);
+    auto mounts = explode("\n", mountstr, true);
+    for (auto & str: mounts)
+    {
+        auto opts = explode(" ", str, true);
+        if (opts[2].size() >= 3 && opts[2].substr(0, 3) == "nfs" && opts.size() >= 4)
+        {
+            opts = explode(",", opts[3], true);
+            bool port_found = false;
+            bool addr_found = false;
+            for (auto & opt: opts)
+            {
+                if (opt == port_opt || opt == mountport_opt)
+                    port_found = true;
+                if (opt == "addr=127.0.0.1" || opt == "mountaddr=127.0.0.1")
+                    addr_found = true;
+            }
+            if (port_found && addr_found)
+            {
+                // OK, do not unmount
+                fprintf(stderr, "NFS mount to 127.0.0.1:%d still active, leaving server active\n", listening_port);
+                return;
+            }
+        }
+    }
+    fprintf(stderr, "NFS mount to 127.0.0.1:%d not found, exiting\n", listening_port);
+    // Not found, unmount
+    finished = true;
 }

 int main(int narg, const char *args[])
--- a/src/nfs_proxy.h
+++ b/src/nfs_proxy.h
@@ -4,51 +4,54 @@
 #include "epoll_manager.h"
 #include "nfs_portmap.h"
 #include "nfs/xdr_impl.h"
+#include "kv_db.h"

+#define NFS_ROOT_HANDLE "R"
 #define RPC_INIT_BUF_SIZE 32768
+#define MAX_REQUEST_SIZE 128*1024*1024
+#define TRUE 1
+#define FALSE 0

 class cli_tool_t;

-struct nfs_dir_t
-{
-    uint64_t id;
-    uint64_t mod_rev;
-    timespec mtime;
-};
+struct kv_fs_state_t;
+struct block_fs_state_t;

 class nfs_proxy_t
 {
 public:
    std::string bind_address;
-    std::string name_prefix;
    uint64_t fsid = 1;
    uint64_t server_id = 0;
+    // FIXME: Maybe allow to create files in different pools?
    std::string default_pool;
    std::string export_root;
    bool portmap_enabled;
    unsigned nfs_port;
+    int trace = 0;
+    std::string logfile = "/dev/null";
+    std::string pidfile;
+    bool exit_on_umount = false;
+    std::string mountpoint;
+    std::string mountopts;
+    std::string fsname;

-    pool_id_t default_pool_id;
+    int active_connections = 0;
+    bool finished = false;
+    int listening_port = 0;
+    pool_id_t default_pool_id = 0;

    portmap_service_t pmap;
    ring_loop_t *ringloop = NULL;
    epoll_manager_t *epmgr = NULL;
    cluster_client_t *cli = NULL;
    cli_tool_t *cmd = NULL;
+    kv_dbw_t *db = NULL;
+    kv_fs_state_t *kvfs = NULL;
+    block_fs_state_t *blockfs = NULL;

    std::vector<XDR*> xdr_pool;

-    // filehandle = "S"+base64(sha256(full name with prefix)) or "roothandle" for mount root)
-
-    uint64_t next_dir_id = 2;
-    // filehandle => dir with name_prefix
-    std::map<std::string, std::string> dir_by_hash;
-    // dir with name_prefix => dir info
-    std::map<std::string, nfs_dir_t> dir_info;
-    // filehandle => inode ID
-    std::map<std::string, inode_t> inode_by_hash;
-    // inode ID => filehandle
-    std::map<inode_t, std::string> hash_by_inode;
    // inode ID => statistics
    std::map<inode_t, json11::Json> inode_stats;
    // pool ID => statistics
@@ -63,6 +66,10 @@ public:
    void check_default_pool();
    void do_accept(int listen_fd);
    void daemonize();
+    void write_pid();
+    void mount_fs();
+    void check_already_mounted();
+    void check_exit();
 };

 struct rpc_cur_buffer_t
@@ -86,28 +93,6 @@ struct rpc_free_buffer_t
    unsigned size;
 };

-struct extend_size_t
-{
-    inode_t inode;
-    uint64_t new_size;
-};
-
-inline bool operator < (const extend_size_t &a, const extend_size_t &b)
-{
-    return a.inode < b.inode || a.inode == b.inode && a.new_size < b.new_size;
-}
-
-struct extend_write_t
-{
-    rpc_op_t *rop;
-    int resize_res, write_res; // 1 = started, 0 = completed OK, -errno = completed with error
-};
-
-struct extend_inode_t
-{
-    uint64_t cur_extend = 0, next_extend = 0;
-};
-
 class nfs_client_t
 {
 public:
@@ -122,8 +107,6 @@ public:
    rpc_cur_buffer_t cur_buffer = { 0 };
    std::map<uint8_t*, rpc_used_buffer_t> used_buffers;
    std::vector<rpc_free_buffer_t> free_buffers;
-    std::map<inode_t, extend_inode_t> extends;
-    std::multimap<extend_size_t, extend_write_t> extend_writes;

    iovec read_iov;
    msghdr read_msg = { 0 };
@@ -133,9 +116,6 @@ public:
    std::vector<iovec> send_list, next_send_list;
    std::vector<rpc_op_t*> outbox, next_outbox;

-    nfs_client_t();
-    ~nfs_client_t();
-
    void select_read_buffer(unsigned wanted_size);
    void submit_read(unsigned wanted_size);
    void handle_read(int result);
--- a/src/osd.h
+++ b/src/osd.h
@@ -239,6 +239,7 @@ class osd_t
    void report_statistics();
    void report_pg_state(pg_t & pg);
    void report_pg_states();
+    void apply_no_inode_stats();
    void apply_pg_count();
    void apply_pg_config();

--- a/src/osd_cluster.cpp
+++ b/src/osd_cluster.cpp
@@ -388,9 +388,18 @@ void osd_t::on_change_etcd_state_hook(std::map<std::string, etcd_kv_t> & changes
        etcd_global_config = changes[st_cli.etcd_prefix+"/config/global"].value.object_items();
        parse_config(false);
    }
+    bool pools = changes.find(st_cli.etcd_prefix+"/config/pools") != changes.end();
+    if (pools)
+    {
+        apply_no_inode_stats();
+    }
    if (run_primary)
    {
-        apply_pg_count();
+        bool pgs = changes.find(st_cli.etcd_prefix+"/config/pgs") != changes.end();
+        if (pools || pgs)
+        {
+            apply_pg_count();
+        }
        apply_pg_config();
    }
 }
@@ -414,6 +423,8 @@ void osd_t::on_reload_config_hook(json11::Json::object & global_config)
 // Acquire lease
 void osd_t::acquire_lease()
 {
+    // Apply no_inode_stats before the first statistics report
+    apply_no_inode_stats();
    // Maximum lease TTL is (report interval) + retries * (timeout + repeat interval)
    st_cli.etcd_call("/lease/grant", json11::Json::object {
        { "TTL", etcd_report_interval+(st_cli.max_etcd_attempts*(2*st_cli.etcd_quick_timeout)+999)/1000 }
@@ -602,11 +613,32 @@ void osd_t::on_load_pgs_hook(bool success)
    else
    {
        peering_state &= ~OSD_LOADING_PGS;
-        apply_pg_count();
-        apply_pg_config();
+        apply_no_inode_stats();
+        if (run_primary)
+        {
+            apply_pg_count();
+            apply_pg_config();
+        }
    }
 }

+void osd_t::apply_no_inode_stats()
+{
+    if (!bs)
+    {
+        return;
+    }
+    std::vector<uint64_t> no_inode_stats;
+    for (auto & pool_item: st_cli.pool_config)
+    {
+        if (!pool_item.second.used_for_fs.empty())
+        {
+            no_inode_stats.push_back(pool_item.first);
+        }
+    }
+    bs->set_no_inode_stats(no_inode_stats);
+}
+
 void osd_t::apply_pg_count()
 {
    for (auto & pool_item: st_cli.pool_config)
--- a/src/osd_id.h
+++ b/src/osd_id.h
@@ -3,13 +3,15 @@

 #pragma once

+#include "object_id.h"
+
 #define POOL_SCHEME_REPLICATED 1
 #define POOL_SCHEME_XOR 2
 #define POOL_SCHEME_EC 3
 #define POOL_ID_MAX 0x10000
 #define POOL_ID_BITS 16
 #define INODE_POOL(inode) (pool_id_t)((inode) >> (64 - POOL_ID_BITS))
-#define INODE_NO_POOL(inode) (inode_t)(inode & (((uint64_t)1 << (64-POOL_ID_BITS)) - 1))
+#define INODE_NO_POOL(inode) (inode_t)((inode) & (((uint64_t)1 << (64-POOL_ID_BITS)) - 1))
 #define INODE_WITH_POOL(pool_id, inode) (((inode_t)(pool_id) << (64-POOL_ID_BITS)) | INODE_NO_POOL(inode))

 // Pool ID is 16 bits long
--- a/src/str_util.cpp
+++ b/src/str_util.cpp
@@ -1,9 +1,10 @@
 // Copyright (c) Vitaliy Filippov, 2019+
-// License: VNPL-1.1 (see README.md for details)
+// License: VNPL-1.1 or GNU GPL-2.0+ (see README.md for details)

 #include <assert.h>
 #include <string.h>
 #include <unistd.h>
+#include <fcntl.h>
 #include "str_util.h"

 std::string base64_encode(const std::string &in)
@@ -214,7 +215,10 @@ void print_help(const char *help_text, std::string exe_name, std::string cmd, bo
            else if (*next_line && isspace(*next_line))
                started = true;
            else if (cmd_start && matched)
+            {
                filtered_text += std::string(cmd_start, next_line-cmd_start);
+                matched = started = false;
+            }
        }
        while (filtered_text.size() > 1 &&
            filtered_text[filtered_text.size()-1] == '\n' &&
@@ -301,6 +305,23 @@ std::string read_all_fd(int fd)
    return res;
 }

+std::string read_file(std::string file, bool allow_enoent)
+{
+    std::string res;
+    int fd = open(file.c_str(), O_RDONLY);
+    if (fd < 0 || (res = read_all_fd(fd)) == "")
+    {
+        int err = errno;
+        if (fd >= 0)
+            close(fd);
+        if (!allow_enoent || err != ENOENT)
+            fprintf(stderr, "Failed to read %s: %s (code %d)\n", file.c_str(), strerror(err), err);
+        return "";
+    }
+    close(fd);
+    return res;
+}
+
 std::string str_repeat(const std::string & str, int times)
 {
    std::string r;
@@ -324,3 +345,121 @@ size_t utf8_length(const char *s)
        len += (*s & 0xC0) != 0x80;
    return len;
 }
+
+std::vector<std::string> explode(const std::string & sep, const std::string & value, bool trim)
+{
+    std::vector<std::string> res;
+    size_t prev = 0;
+    while (prev < value.size())
+    {
+        while (trim && prev < value.size() && isspace(value[prev]))
+            prev++;
+        size_t pos = value.find(sep, prev);
+        if (pos == std::string::npos)
+            pos = value.size();
+        size_t next = pos+sep.size();
+        while (trim && pos > prev && isspace(value[pos-1]))
+            pos--;
+        if (!trim || pos > prev)
+            res.push_back(value.substr(prev, pos-prev));
+        prev = next;
+    }
+    return res;
+}
+
+std::string scan_escaped(const std::string & cmd, size_t & pos, bool allow_unquoted)
+{
+    return scan_escaped(cmd.data(), cmd.size(), pos, allow_unquoted);
+}
+
+// extract possibly single- or double-quoted part of string with escape characters
+std::string scan_escaped(const char *cmd, size_t size, size_t & pos, bool allow_unquoted)
+{
+    auto orig = pos;
+    while (pos < size && is_white(cmd[pos]))
+        pos++;
+    if (pos >= size)
+    {
+        pos = orig;
+        return "";
+    }
+    if (cmd[pos] != '"' && cmd[pos] != '\'')
+    {
+        if (!allow_unquoted)
+        {
+            pos = orig;
+            return "";
+        }
+        auto pos2 = pos;
+        while (pos2 < size && !is_white(cmd[pos2]))
+            pos2++;
+        auto key = std::string(cmd+pos, pos2-pos);
+        pos = pos2;
+        return key;
+    }
+    char quot = cmd[pos];
+    pos++;
+    std::string key;
+    while (true)
+    {
+        auto pos2 = pos;
+        while (pos2 < size && cmd[pos2] != '\\' && cmd[pos2] != quot)
+            pos2++;
+        if (pos2 >= size || pos2 == size-1 && cmd[pos2] == '\\')
+        {
+            // Unfinished string literal
+            pos = orig;
+            return "";
+        }
+        if (pos2 > pos)
+            key += std::string(cmd+pos, pos2-pos);
+        pos = pos2;
+        if (cmd[pos] == quot)
+        {
+            pos++;
+            break;
+        }
+        else /* if (cmd[pos] == '\\') */
+        {
+            key += cmd[++pos];
+            pos++;
+        }
+    }
+    return key;
+}
+
+std::string auto_addslashes(const std::string & str, const char *toescape)
+{
+    auto pos = str.find_first_of(toescape);
+    if (pos == std::string::npos)
+        return str;
+    return addslashes(str, toescape);
+}
+
+std::string addslashes(const std::string & str, const char *toescape)
+{
+    std::string res = "\"";
+    auto pos = 0;
+    while (pos < str.size())
+    {
+        auto pos2 = str.find_first_of(toescape, pos);
+        if (pos2 == std::string::npos)
+            return res + str.substr(pos) + "\"";
+        res += str.substr(pos, pos2-pos)+"\\"+str[pos2];
+        pos = pos2+1;
+    }
+    return res+"\"";
+}
+
+std::string realpath_str(std::string path, bool nofail)
+{
+    char *p = realpath((char*)path.c_str(), NULL);
+    if (!p)
+    {
+        fprintf(stderr, "Failed to resolve %s: %s\n", path.c_str(), strerror(errno));
+        return nofail ? path : "";
+    }
+    std::string rp(p);
+    free(p);
+    return rp;
+}
--- a/src/str_util.h
+++ b/src/str_util.h
@@ -1,9 +1,12 @@
 // Copyright (c) Vitaliy Filippov, 2019+
-// License: VNPL-1.1 (see README.md for details)
+// License: VNPL-1.1 or GNU GPL-2.0+ (see README.md for details)

 #pragma once
 #include <stdint.h>
 #include <string>
+#include <vector>
+
+#define is_white(a) ((a) == ' ' || (a) == '\t' || (a) == '\r' || (a) == '\n')

 std::string base64_encode(const std::string &in);
 std::string base64_decode(const std::string &in);
@@ -17,6 +20,13 @@ std::string format_size(uint64_t size, bool nobytes = false);
 void print_help(const char *help_text, std::string exe_name, std::string cmd, bool all);
 uint64_t parse_time(std::string time_str, bool *ok = NULL);
 std::string read_all_fd(int fd);
+std::string read_file(std::string file, bool allow_enoent = false);
 std::string str_repeat(const std::string & str, int times);
 size_t utf8_length(const std::string & s);
 size_t utf8_length(const char *s);
+std::vector<std::string> explode(const std::string & sep, const std::string & value, bool trim);
+std::string scan_escaped(const char *cmd, size_t size, size_t & pos, bool allow_unquoted = true);
+std::string scan_escaped(const std::string & cmd, size_t & pos, bool allow_unquoted = true);
+std::string auto_addslashes(const std::string & str, const char *toescape = "\\\"");
+std::string addslashes(const std::string & str, const char *toescape = "\\\"");
+std::string realpath_str(std::string path, bool nofail = true);
--- a/src/vitastor.pc.in
+++ b/src/vitastor.pc.in
@@ -6,7 +6,7 @@ includedir=${prefix}/@CMAKE_INSTALL_INCLUDEDIR@

 Name: Vitastor
 Description: Vitastor client library
-Version: 1.4.8
+Version: 1.5.0
 Libs: -L${libdir} -lvitastor_client
 Cflags: -I${includedir}

--- a/tests/run_tests.sh
+++ b/tests/run_tests.sh
@@ -68,3 +68,5 @@ SCHEME=xor ./test_scrub.sh
 PG_SIZE=3 ./test_scrub.sh
 PG_SIZE=6 PG_MINSIZE=4 OSD_COUNT=6 SCHEME=ec ./test_scrub.sh
 SCHEME=ec ./test_scrub.sh
+
+./test_nfs.sh
--- a/tests/test_failure_domain.sh
+++ b/tests/test_failure_domain.sh
@@ -13,7 +13,14 @@ $ETCDCTL put /vitastor/osd/stats/5 '{"host":"host3","size":1073741824,"time":"'$
 $ETCDCTL put /vitastor/osd/stats/6 '{"host":"host3","size":1073741824,"time":"'$TIME'"}'
 $ETCDCTL put /vitastor/osd/stats/7 '{"host":"host4","size":1073741824,"time":"'$TIME'"}'
 $ETCDCTL put /vitastor/osd/stats/8 '{"host":"host4","size":1073741824,"time":"'$TIME'"}'
-$ETCDCTL put /vitastor/config/pools '{"1":{"name":"testpool","scheme":"replicated","pg_size":2,"pg_minsize":1,"pg_count":4,"failure_domain":"rack"}}'
+build/src/vitastor-cli --etcd_address $ETCD_URL create-pool testpool --ec 3+2 -n 32 --failure_domain rack --force
+$ETCDCTL get --print-value-only /vitastor/config/pools | jq -s -e '. == [{"1": {"failure_domain": "rack", "name": "testpool", "parity_chunks": 2, "pg_count": 32, "pg_minsize": 4, "pg_size": 5, "scheme": "ec"}}]'
+build/src/vitastor-cli --etcd_address $ETCD_URL modify-pool testpool --ec 3+3 --failure_domain host
+$ETCDCTL get --print-value-only /vitastor/config/pools | jq -s -e '. == [{"1": {"failure_domain": "host", "name": "testpool", "parity_chunks": 3, "pg_count": 32, "pg_minsize": 4, "pg_size": 6, "scheme": "ec"}}]'
+build/src/vitastor-cli --etcd_address $ETCD_URL rm-pool testpool
+$ETCDCTL get --print-value-only /vitastor/config/pools | jq -s -e '. == [{}]'
+build/src/vitastor-cli --etcd_address $ETCD_URL create-pool testpool -s 2 -n 4 --failure_domain rack --force
+$ETCDCTL get --print-value-only /vitastor/config/pools | jq -s -e '. == [{"1":{"name":"testpool","scheme":"replicated","pg_size":2,"pg_minsize":1,"pg_count":4,"failure_domain":"rack"}}]'

 node mon/mon-main.js --etcd_address $ETCD_URL --etcd_prefix "/vitastor" >>./testdata/mon.log 2>&1 &
 MON_PID=$!
--- a/tests/test_nfs.sh
+++ b/tests/test_nfs.sh
@@ -0,0 +1,168 @@
+#!/bin/bash -ex
+
+PG_COUNT=16
+. `dirname $0`/run_3osds.sh
+
+build/src/vitastor-cli --etcd_address $ETCD_URL create -s 10G fsmeta
+build/src/vitastor-cli --etcd_address $ETCD_URL modify-pool --used-for-fs fsmeta testpool
+build/src/vitastor-nfs start --fs fsmeta --etcd_address $ETCD_URL --portmap 0 --port 2050 --foreground 1 --trace 1 >>./testdata/nfs.log 2>&1 &
+NFS_PID=$!
+
+mkdir -p testdata/nfs
+sudo mount localhost:/ ./testdata/nfs -o port=2050,mountport=2050,nfsvers=3,soft,nolock,tcp
+MNT=$(pwd)/testdata/nfs
+trap "sudo umount -f $MNT"' || true; kill -9 $(jobs -p)' EXIT
+
+# write small file
+ls -l ./testdata/nfs
+dd if=/dev/urandom of=./testdata/f1 bs=100k count=1
+cp testdata/f1 ./testdata/nfs/
+sudo umount ./testdata/nfs/
+sudo mount localhost:/ ./testdata/nfs -o port=2050,mountport=2050,nfsvers=3,soft,nolock,tcp
+ls -l ./testdata/nfs | grep f1
+diff ./testdata/f1 ./testdata/nfs/f1
+format_green "100K file ok"
+
+# overwrite it inplace
+dd if=/dev/urandom of=./testdata/f1_90k bs=90k count=1
+cp testdata/f1_90k ./testdata/nfs/f1
+sudo umount ./testdata/nfs/
+format_green "inplace overwrite 90K ok"
+sudo mount localhost:/ ./testdata/nfs -o port=2050,mountport=2050,nfsvers=3,soft,nolock,tcp
+ls -l ./testdata/nfs | grep f1
+# create another copy
+dd if=./testdata/f1_90k of=./testdata/nfs/f1_nfs bs=1M
+diff ./testdata/f1_90k ./testdata/nfs/f1_nfs
+sudo umount ./testdata/nfs/
+format_green "another copy 90K ok"
+sudo mount localhost:/ ./testdata/nfs -o port=2050,mountport=2050,nfsvers=3,soft,nolock,tcp
+ls -l ./testdata/nfs | grep f1
+cp ./testdata/nfs/f1 ./testdata/f1_nfs
+diff ./testdata/f1_90k ./testdata/nfs/f1
+format_green "90K data ok"
+
+# test partial shared overwrite
+dd if=/dev/urandom of=./testdata/f1_90k bs=9317 count=1 seek=5 conv=notrunc
+dd if=./testdata/f1_90k of=./testdata/nfs/f1 bs=9317 count=1 skip=5 seek=5 conv=notrunc
+sudo umount ./testdata/nfs/
+sudo mount localhost:/ ./testdata/nfs -o port=2050,mountport=2050,nfsvers=3,soft,nolock,tcp
+diff ./testdata/f1_90k ./testdata/nfs/f1
+format_green "partial inplace shared overwrite ok"
+
+# move it to a larger shared space
+dd if=/dev/urandom of=./testdata/f1_110k bs=110k count=1
+cp testdata/f1_110k ./testdata/nfs/f1
+sudo umount ./testdata/nfs/
+sudo mount localhost:/ ./testdata/nfs -o port=2050,mountport=2050,nfsvers=3,soft,nolock,tcp
+ls -l ./testdata/nfs | grep f1
+diff ./testdata/f1_110k ./testdata/nfs/f1
+format_green "move shared 90K -> 110K ok"
+
+# extend it to large file + rm
+dd if=/dev/urandom of=./testdata/f1_2M bs=2M count=1
+cp ./testdata/f1_2M ./testdata/nfs/f1
+sudo umount ./testdata/nfs/
+sudo mount localhost:/ ./testdata/nfs -o port=2050,mountport=2050,nfsvers=3,soft,nolock,tcp
+ls -l ./testdata/nfs | grep f1
+cp ./testdata/nfs/f1 ./testdata/f1_nfs
+diff ./testdata/f1_2M ./testdata/nfs/f1
+rm ./testdata/nfs/f1
+format_green "extend to 2M + rm ok"
+
+# mkdir
+mkdir -p ./testdata/nfs/dir1/dir2
+echo abcdef > ./testdata/nfs/dir1/dir2/hnpfls
+# rename dir
+mv ./testdata/nfs/dir1 ./testdata/nfs/dir3
+sudo umount ./testdata/nfs/
+sudo mount localhost:/ ./testdata/nfs -o port=2050,mountport=2050,nfsvers=3,soft,nolock,tcp
+ls -l ./testdata/nfs | grep dir3
+ls -l ./testdata/nfs/dir3 | grep dir2
+ls -l ./testdata/nfs/dir3/dir2 | grep hnpfls
+echo abcdef > ./testdata/hnpfls
+diff ./testdata/hnpfls ./testdata/nfs/dir3/dir2/hnpfls
+format_green "rename dir with file ok"
+
+# touch
+touch -t 202401011404 ./testdata/nfs/dir3/dir2/hnpfls
+sudo chown 65534:65534 ./testdata/nfs/dir3/dir2/hnpfls
+sudo chmod 755 ./testdata/nfs/dir3/dir2/hnpfls
+sudo umount ./testdata/nfs/
+sudo mount localhost:/ ./testdata/nfs -o port=2050,mountport=2050,nfsvers=3,soft,nolock,tcp
+T=`stat -c '%a %u %g %y' ./testdata/nfs/dir3/dir2/hnpfls | perl -pe 's/(:\d+)(.*)/$1/'`
+[[ "$T" = "755 65534 65534 2024-01-01 14:04" ]]
+format_green "set attrs ok"
+
+# move dir
+mv ./testdata/nfs/dir3/dir2 ./testdata/nfs/
+sudo umount ./testdata/nfs/
+sudo mount localhost:/ ./testdata/nfs -o port=2050,mountport=2050,nfsvers=3,soft,nolock,tcp
+ls -l ./testdata/nfs | grep dir3
+ls -l ./testdata/nfs | grep dir2
+format_green "move dir ok"
+
+# symlink, readlink
+ln -s dir2 ./testdata/nfs/sym2
+[[ "`stat -c '%A' ./testdata/nfs/sym2`" = "lrwxrwxrwx" ]]
+sudo umount ./testdata/nfs/
+sudo mount localhost:/ ./testdata/nfs -o port=2050,mountport=2050,nfsvers=3,soft,nolock,tcp
+[[ "`stat -c '%A' ./testdata/nfs/sym2`" = "lrwxrwxrwx" ]]
+[[ "`readlink ./testdata/nfs/sym2`" = "dir2" ]]
+format_green "symlink, readlink ok"
+
+# mknod: chr, blk, sock, fifo + remove
+sudo mknod ./testdata/nfs/nod_chr c 1 5
+sudo mknod ./testdata/nfs/nod_blk b 2 6
+mkfifo ./testdata/nfs/nod_fifo
+perl -e 'use Socket; socket($sock, PF_UNIX, SOCK_STREAM, undef) || die $!; bind($sock, sockaddr_un("./testdata/nfs/nod_sock")) || die $!;'
+chmod 777 ./testdata/nfs/nod_*
+sudo umount ./testdata/nfs/
+sudo mount localhost:/ ./testdata/nfs -o port=2050,mountport=2050,nfsvers=3,soft,nolock,tcp
+[[ "`ls testdata|wc -l`" -ge 4 ]]
+[[ "`stat -c '%A' ./testdata/nfs/nod_blk`" = "brwxrwxrwx" ]]
+[[ "`stat -c '%A' ./testdata/nfs/nod_chr`" = "crwxrwxrwx" ]]
+[[ "`stat -c '%A' ./testdata/nfs/nod_fifo`" = "prwxrwxrwx" ]]
+[[ "`stat -c '%A' ./testdata/nfs/nod_sock`" = "srwxrwxrwx" ]]
+sudo rm ./testdata/nfs/nod_*
+format_green "mknod + rm ok"
+
+# hardlink
+echo ABCDEF > ./testdata/nfs/linked1
+i=`stat -c '%i' ./testdata/nfs/linked1`
+ln ./testdata/nfs/linked1 ./testdata/nfs/linked2
+[[ "`stat -c '%i' ./testdata/nfs/linked2`" -eq $i ]]
+echo BABABA > ./testdata/nfs/linked2
+diff ./testdata/nfs/linked2 ./testdata/nfs/linked1
+sudo umount ./testdata/nfs/
+sudo mount localhost:/ ./testdata/nfs -o port=2050,mountport=2050,nfsvers=3,soft,nolock,tcp
+diff ./testdata/nfs/linked2 ./testdata/nfs/linked1
+[[ "`cat ./testdata/nfs/linked2`" = "BABABA" ]]
+rm ./testdata/nfs/linked2
+sudo umount ./testdata/nfs/
+sudo mount localhost:/ ./testdata/nfs -o port=2050,mountport=2050,nfsvers=3,soft,nolock,tcp
+[[ "`cat ./testdata/nfs/linked1`" = "BABABA" ]]
+format_green "hardlink ok"
+
+# rm small
+ls -l ./testdata/nfs
+dd if=/dev/urandom of=./testdata/nfs/smallfile bs=100k count=1
+sudo umount ./testdata/nfs/
+sudo mount localhost:/ ./testdata/nfs -o port=2050,mountport=2050,nfsvers=3,soft,nolock,tcp
+rm ./testdata/nfs/smallfile
+if ls ./testdata/nfs | grep smallfile; then false; fi
+sudo umount ./testdata/nfs/
+sudo mount localhost:/ ./testdata/nfs -o port=2050,mountport=2050,nfsvers=3,soft,nolock,tcp
+if ls ./testdata/nfs | grep smallfile; then false; fi
+format_green "rm small ok"
+
+# rename over existing
+echo ZXCVBN > ./testdata/nfs/over1
+mv ./testdata/nfs/over1 ./testdata/nfs/linked2
+sudo umount ./testdata/nfs/
+sudo mount localhost:/ ./testdata/nfs -o port=2050,mountport=2050,nfsvers=3,soft,nolock,tcp
+if ls ./testdata/nfs | grep over1; then false; fi
+[[ "`cat ./testdata/nfs/linked2`" = "ZXCVBN" ]]
+[[ "`cat ./testdata/nfs/linked1`" = "BABABA" ]]
+format_green "rename over existing file ok"
+
+format_green OK
Author	SHA1	Message	Date
Vitaliy Filippov	9ad6822353	Release 1.5.0 After half a year of hard work, VitastorFS is finally here ! :-) New features: - VitastorFS, a full-featured clustered (read-write-many) file system. Documentation: [VitastorFS](docs/usage/nfs.en.md) - Embedded key-value database implementation based on Parallel Optimistic B-Tree algorithm and used for the metadata of VitastorFS - Pool management commands in vitastor-cli (create-pool, list-pools, rm-pool, modify-pool). Thanks MIND Software (https://mindsw.io) for their contribution! [Documentation](docs/usage/cli.en.md#create-pool) Bug fixes: - Fix a very rare "infinite loop" in the client library - Fix a rare OSD hang on during start when zeroing out bad metadata entries left from the previous run	2024-03-16 15:35:10 +03:00
Vitaliy Filippov	2043b4e374	Fix build errors for gcc 8	2024-03-16 15:35:10 +03:00
Vitaliy Filippov	de840e6fe3	Reduce kv-cli loadjson load parallelism to 16	2024-03-16 15:35:10 +03:00
Vitaliy Filippov	b5e04bf809	Fix build warning	2024-03-16 15:35:10 +03:00
Vitaliy Filippov	8807a1623b	Fix markdown tables	2024-03-16 15:35:10 +03:00
Vitaliy Filippov	f12855c31b	Add vitastor-kv to packages	2024-03-16 15:35:10 +03:00
Vitaliy Filippov	e75dcc9a71	Add documentation for VitastorFS	2024-03-16 15:16:43 +03:00
Vitaliy Filippov	88516ab4bd	Remove extra log	2024-03-16 13:24:36 +03:00
Vitaliy Filippov	6221126b4f	Allow to print simple-offsets just given the device size	2024-03-16 13:24:36 +03:00
Vitaliy Filippov	6783d4a13c	Implement fool protection for FS pools	2024-03-16 13:24:36 +03:00
Vitaliy Filippov	dcbe1afac3	Store pool ID in inode metadata	2024-03-16 13:24:36 +03:00
Vitaliy Filippov	0bde28c24a	Make nfs_do_rmw a library function	2024-03-16 13:24:36 +03:00
Vitaliy Filippov	bb8ca6184e	Support setattr guard	2024-03-16 13:24:36 +03:00
Vitaliy Filippov	87310ef7bb	Support ctime	2024-03-16 13:24:36 +03:00
Vitaliy Filippov	4f4b2dab80	Log NFS liveness checks	2024-03-16 13:24:36 +03:00
Vitaliy Filippov	f70da82317	Add loadjson command to vitastor-kv	2024-03-16 13:24:36 +03:00
Vitaliy Filippov	e42148f347	Allow to specify KV commands on command line	2024-03-16 13:24:36 +03:00
Vitaliy Filippov	c289584469	Add JSON dump format	2024-03-16 13:24:36 +03:00
Vitaliy Filippov	018e89f867	Erase verf key left from creation from ientries on every modification	2024-03-16 13:24:36 +03:00
Vitaliy Filippov	603dc68f11	Implement async mtime change	2024-03-16 13:24:36 +03:00
Vitaliy Filippov	7b12342933	Allow to specify additional NFS mount options	2024-03-16 13:24:36 +03:00
Vitaliy Filippov	44bf0f16ee	Fix malloc/free in nfs_kv_read/write	2024-03-16 13:24:36 +03:00
Vitaliy Filippov	8840c84572	Fix "bad key in etcd" in mon for FS pools	2024-03-16 13:24:36 +03:00
Vitaliy Filippov	5b747c12ec	Check if already mounted before mounting	2024-03-16 13:24:36 +03:00
Vitaliy Filippov	05f5f46162	Fix zero used space, update mtime when moving/changing inode	2024-03-16 13:24:36 +03:00
Vitaliy Filippov	b5604191c8	Ignore ECANCELED in nfs-proxy (happens in io_uring on fork)	2024-03-16 13:24:36 +03:00
Vitaliy Filippov	e871de27de	Support unaligned shared_offsets, align shared file data instead of header	2024-03-16 13:24:36 +03:00
Vitaliy Filippov	f600ce98e2	Implement auto-unmount local NFS server mode for vitastor-nfs	2024-03-16 13:24:36 +03:00
Vitaliy Filippov	57605a5c13	Return error on failed shrink	2024-03-16 13:24:36 +03:00
Vitaliy Filippov	29bd4561bb	Implement rename over an existing file/directory	2024-03-16 13:24:36 +03:00
Vitaliy Filippov	7142460ec8	Support --logfile in nfs-proxy	2024-03-16 13:24:36 +03:00
Vitaliy Filippov	d03f19ebe5	Fix shared file overlap, add FIXMEs	2024-03-16 13:24:36 +03:00
Vitaliy Filippov	88f9d18be3	Create inode, then direntry, not direntry, then inode; retry ID collisions	2024-03-16 13:24:36 +03:00
Vitaliy Filippov	6213fbd8c6	Fix NFS shared/aligned write FIXMEs	2024-03-16 13:24:36 +03:00
Vitaliy Filippov	3aee37eadd	Allow to disable per-inode stats for VitastorFS pools	2024-03-16 13:24:36 +03:00
Vitaliy Filippov	ecfc753e93	Add basic NFS tests, fix bugs	2024-03-16 13:24:36 +03:00
Vitaliy Filippov	a574f9ad71	Return block NFS implementation back as an option too	2024-03-16 13:24:36 +03:00
Vitaliy Filippov	7c235c9103	Move KV FS header into a separate file	2024-03-16 13:24:36 +03:00
Vitaliy Filippov	e5bb986164	Implement packing small files into shared inodes	2024-03-16 13:24:36 +03:00
Vitaliy Filippov	181795d748	Split new NFS proxy implementation into multiple files	2024-03-16 13:24:36 +03:00
Vitaliy Filippov	8cdc38805b	WIP VitastorFS with metadata storage in VitastorKV	2024-03-16 13:24:36 +03:00
Vitaliy Filippov	0cd455d17f	First just recheck version without actually re-reading block in vitastor-kv	2024-03-16 13:24:36 +03:00
Vitaliy Filippov	32ba653ba6	Fix vitastor-kv hang on reopen & unfinished closed listing	2024-03-16 13:24:36 +03:00
Vitaliy Filippov	231d4b15fc	Add loadable dump format to vitastor-kv (dump)	2024-03-16 13:24:36 +03:00
Vitaliy Filippov	9dc4d5fd7b	Fix freeing r/w buffers on errors in kv_db	2024-03-16 13:24:36 +03:00
Vitaliy Filippov	e58538fa47	Fix eviction when random_pos selects the end	2024-03-16 13:24:36 +03:00
Vitaliy Filippov	11ac9e7024	Implement min/max list_count to make listings during performance test reasonable	2024-03-16 13:24:36 +03:00
Vitaliy Filippov	511bc3df1c	Fix and improve parallel allocation - Do not try to allocate more DB blocks in an inode block until it's "confirmed" and "locked" by the first write - Do not recheck for new zero DB blocks on first write into an inode block - a CAS failure means someone else is already writing into it - Throw new allocation blocks away regardless of whether the known_version is 0 on a CAS failure	2024-03-16 13:24:36 +03:00
Vitaliy Filippov	a64f0d1f73	Implement key_prefix for K/V stress test	2024-03-16 13:24:36 +03:00
Vitaliy Filippov	ec5f7c6b87	More fixes - do not overwrite a block with older version if known version is newer (read may start before update and end after update) - invalidated block versions can't be remembered and trusted - right boundary for split blocks is right_half when diving down, not key_lt - restart update also when block is "invalidated", not just on version mismatch - copy callback in listings to avoid closure destruction bugs too	2024-03-16 13:24:36 +03:00
Vitaliy Filippov	3ebed9a749	Add logging and one more assert	2024-03-16 13:24:36 +03:00
Vitaliy Filippov	eab67a6e8f	Make get_block() wait for updating when unrelated block is found along the path	2024-03-16 13:24:36 +03:00
Vitaliy Filippov	20993d9b7a	Fix a race condition where changed blocks were parsed over existing cached blocks and getting a mix of data	2024-03-16 13:24:36 +03:00
Vitaliy Filippov	5cf9b343c0	Simplify code by removing an unneeded "optimisation"	2024-03-16 13:24:36 +03:00
Vitaliy Filippov	79ae0aadcd	Add kv_log_level, print warnings on level 1, trace ops on level 10	2024-03-16 13:24:36 +03:00
Vitaliy Filippov	605afc3583	Fix duplicate keys in listings on parallel updates -- do not rewind key "iterator position"	2024-03-16 13:24:36 +03:00
Vitaliy Filippov	c0681d8242	Implement key suffix to avoid collisions of multiple test workers	2024-03-16 13:24:36 +03:00
Vitaliy Filippov	763e77b4f4	Do not complain on empty first block	2024-03-16 13:24:36 +03:00
Vitaliy Filippov	19426aa4c5	Add JSON output for stress-tester	2024-03-16 13:24:36 +03:00
Vitaliy Filippov	08f586bcec	Print total stats	2024-03-16 13:24:36 +03:00
Vitaliy Filippov	f1cd87473a	Do not send more than op_count operations (fix segfault on finish)	2024-03-16 13:24:36 +03:00
Vitaliy Filippov	1bd8d2da56	Add some more resiliency to serialize()	2024-03-16 13:24:36 +03:00
Vitaliy Filippov	a7396d2baf	Invalidate blocks being updated too	2024-03-16 13:24:36 +03:00
Vitaliy Filippov	e98a38810d	Change new block allocation method: make each writer choose multiple empty PG blocks and place blocks in them	2024-03-16 13:24:36 +03:00
Vitaliy Filippov	28c4324c36	Remove blocks from cache on unsuccessful updates	2024-03-16 13:24:36 +03:00
Vitaliy Filippov	31ec3fa8f5	Allow to track multiple updates per block (it should never happen though)	2024-03-16 13:24:36 +03:00
Vitaliy Filippov	e4fa26f60a	Do not call stop_updating after failed write_new_block and after clear_block (both delete the item)	2024-03-16 13:24:36 +03:00
Vitaliy Filippov	59ae27f9e5	Track versions of parent blocks and recheck if changed during update	2024-03-16 13:24:36 +03:00
Vitaliy Filippov	2c6a301d9b	Fix resume_split condition (key_lt can also be "")	2024-03-16 13:24:36 +03:00
Vitaliy Filippov	01558349f8	Experiment: transform offsets for better sharding	2024-03-16 13:24:36 +03:00
Vitaliy Filippov	36f4717d0d	More post-stress-test fixes - Prevent _split types of new blocks - Stop updating new blocks only after the whole update, otherwise pointers may become invalid - Use recheck_none for updates initially - Use UINT64_MAX as initial block version when postponing ops, otherwise the check fails when the block is initially empty. This for example leads to writing both leaf items & block pointers (which is incorrect) into the root block when starting stress-test with --parallelism 32 - Fix -EINTR comparison	2024-03-16 13:24:36 +03:00
Vitaliy Filippov	babaf2a0ce	Print operation statistics	2024-03-16 13:24:36 +03:00
Vitaliy Filippov	5773f1a375	K/V fixes after stress-test :-) - track block versions correctly - per inode block (128kb) instead of tree block (4kb) - prevent multiple parallel CAS writes of the same inode block - add logging for EILSEQ which means invalid data in the tree - fix get_block updated flag which was true for blocks already in cache and was leading to infinite loops on "unrelated block" errors - apply changes to blocks in cache only after successful writes (using "virtual changes") - do not replace cached block with an older version from disk - recheck "unrelated blocks" (read/update collisions) until data stops changing - track tree path correctly - do not treat split block as parent of its right half - correctly move blocks when finding new empty place on disk - restart updates from the beginning when one of blocks is changed by a parallel update - fix delete using SET opcode and setting key to the empty value instead - prevent changing the same key more than 1 time in parallel - fix listing verification - resume continue_updates in update_find (required because it uses continue_update itself) - add allow_old_cached parameter to get()	2024-03-16 13:24:36 +03:00
Vitaliy Filippov	57222a9f79	Implement K/V DB stress tester	2024-03-16 13:24:36 +03:00
Vitaliy Filippov	61ef000c6e	Evict blocks based on memory limit & block usage	2024-03-16 13:24:36 +03:00
Vitaliy Filippov	7d5e1cc393	Track blocks per level	2024-03-16 13:24:36 +03:00
Vitaliy Filippov	5e7f27a02d	Track block level	2024-03-16 13:24:36 +03:00
Vitaliy Filippov	fd1d8a8520	Experimental B-Tree Vitastor embedded K/V database implementation!	2024-03-16 13:24:36 +03:00
Vitaliy Filippov	c364e14c40	Stop then retry, not retry then stop	2024-03-16 13:24:36 +03:00
Vitaliy Filippov	3ebbfa0428	Fix another rare OSD hang on zeroing out entries on start	2024-03-16 13:24:36 +03:00
Vitaliy Filippov	aa79d1db1c	Fix incorrect "changing scheme" message in modify-pool	2024-03-06 00:41:35 +03:00
Vitaliy Filippov	a1fecb7eff	Move callback away when calling it in cluster_client	2024-03-06 00:41:35 +03:00
Vitaliy Filippov	ff74b19423	Fix rare OSD hang on zeroing out bad entries on start	2024-03-06 00:41:35 +03:00
Vitaliy Filippov	4cf6dceed7	Merge branch 'rel-1.4'	2024-02-29 09:59:01 +03:00
Vitaliy Filippov	4eab26f968	Add documentation and a very basic test for pool management commands	2024-02-28 13:08:04 +03:00
Vitaliy Filippov	86243b7101	Rework & fix pool-create / pool-modify / pool-ls	2024-02-28 13:08:04 +03:00
idelson	dc92851322	vitastor-cli: add commands to control pools: pool-create, pool-ls, pool-modify, pool-rm PR #59 - https://github.com/vitalif/vitastor/pull/58/commits By MIND Software LLC By submitting this pull request, I accept Vitastor CLA	2024-02-28 13:08:04 +03:00