Fix eviction when random_pos selects the end

Implement min/max list_count to make listings during performance test reasonable
Fix and improve parallel allocation
2023-12-01 01:43:03 +03:00 · 2023-12-01 01:17:04 +03:00 · 2023-12-01 01:17:04 +03:00 · 2023-12-01 01:17:04 +03:00 · 2023-12-01 01:17:04 +03:00 · 2023-12-01 01:17:04 +03:00
37 changed files with 3614 additions and 112 deletions
--- a/docs/config/network.en.md
+++ b/docs/config/network.en.md
@@ -20,6 +20,7 @@ between clients, OSDs and etcd.
 - [rdma_max_msg](#rdma_max_msg)
 - [rdma_max_recv](#rdma_max_recv)
 - [rdma_max_send](#rdma_max_send)
+- [rdma_odp](#rdma_odp)
 - [peer_connect_interval](#peer_connect_interval)
 - [peer_connect_timeout](#peer_connect_timeout)
 - [osd_idle_timeout](#osd_idle_timeout)
@@ -68,11 +69,14 @@ but they are not connected to the cluster.
 - Type: string

 RDMA device name to use for Vitastor OSD communications (for example,
-"rocep5s0f0"). Please note that Vitastor RDMA requires Implicit On-Demand
-Paging (Implicit ODP) and Scatter/Gather (SG) support from the RDMA device
-to work. For example, Mellanox ConnectX-3 and older adapters don't have
-Implicit ODP, so they're unsupported by Vitastor. Run `ibv_devinfo -v` as
-root to list available RDMA devices and their features.
+"rocep5s0f0"). Now Vitastor supports all adapters, even ones without
+ODP support, like Mellanox ConnectX-3 and non-Mellanox cards.
+
+Versions up to Vitastor 1.2.0 required ODP which is only present in
+Mellanox ConnectX >= 4. See also [rdma_odp](#rdma_odp).
+
+Run `ibv_devinfo -v` as root to list available RDMA devices and their
+features.

 Remember that you also have to configure your network switches if you use
 RoCE/RoCEv2, otherwise you may experience unstable performance. Refer to
@@ -147,6 +151,28 @@ less than `rdma_max_recv` so the receiving side doesn't run out of buffers.
 Doesn't affect memory usage - additional memory isn't allocated for send
 operations.

+## rdma_odp
+
+- Type: boolean
+- Default: false
+
+Use RDMA with On-Demand Paging. ODP is currently only available on Mellanox
+ConnectX-4 and newer adapters. ODP allows to not register memory explicitly
+for RDMA adapter to be able to use it. This, in turn, allows to skip memory
+copying during sending. One would think this should improve performance, but
+**in reality** RDMA performance with ODP is **drastically** worse. Example
+3-node cluster with 8 NVMe in each node and 2*25 GBit/s ConnectX-6 RDMA network
+without ODP pushes 3950000 read iops, but only 239000 iops with ODP...
+
+This happens because Mellanox ODP implementation seems to be based on
+message retransmissions when the adapter doesn't know about the buffer yet -
+it likely uses standard "RNR retransmissions" (RNR = receiver not ready)
+which is generally slow in RDMA/RoCE networks. Here's a presentation about
+it from ISPASS-2021 conference: https://tkygtr6.github.io/pub/ISPASS21_slides.pdf
+
+ODP support is retained in the code just in case a good ODP implementation
+appears one day.
+
 ## peer_connect_interval

 - Type: seconds
--- a/docs/config/network.ru.md
+++ b/docs/config/network.ru.md
@@ -20,6 +20,7 @@
 - [rdma_max_msg](#rdma_max_msg)
 - [rdma_max_recv](#rdma_max_recv)
 - [rdma_max_send](#rdma_max_send)
+- [rdma_odp](#rdma_odp)
 - [peer_connect_interval](#peer_connect_interval)
 - [peer_connect_timeout](#peer_connect_timeout)
 - [osd_idle_timeout](#osd_idle_timeout)
@@ -71,12 +72,15 @@ RDMA может быть нужно только если у клиентов е
 - Тип: строка

 Название RDMA-устройства для связи с Vitastor OSD (например, "rocep5s0f0").
-Имейте в виду, что поддержка RDMA в Vitastor требует функций устройства
-Implicit On-Demand Paging (Implicit ODP) и Scatter/Gather (SG). Например,
-адаптеры Mellanox ConnectX-3 и более старые не поддерживают Implicit ODP и
-потому не поддерживаются в Vitastor. Запустите `ibv_devinfo -v` от имени
-суперпользователя, чтобы посмотреть список доступных RDMA-устройств, их
-параметры и возможности.
+Сейчас Vitastor поддерживает все модели адаптеров, включая те, у которых
+нет поддержки ODP, то есть вы можете использовать RDMA с ConnectX-3 и
+картами производства не Mellanox.
+
+Версии Vitastor до 1.2.0 включительно требовали ODP, который есть только
+на Mellanox ConnectX 4 и более новых. См. также [rdma_odp](#rdma_odp).
+
+Запустите `ibv_devinfo -v` от имени суперпользователя, чтобы посмотреть
+список доступных RDMA-устройств, их параметры и возможности.

 Обратите внимание, что если вы используете RoCE/RoCEv2, вам также необходимо
 правильно настроить для него коммутаторы, иначе вы можете столкнуться с
@@ -155,6 +159,29 @@ OSD в любом случае согласовывают реальное зн
 Не влияет на потребление памяти - дополнительная память на операции отправки
 не выделяется.

+## rdma_odp
+
+- Тип: булево (да/нет)
+- Значение по умолчанию: false
+
+Использовать RDMA с On-Demand Paging. ODP - функция, доступная пока что
+исключительно на адаптерах Mellanox ConnectX-4 и более новых. ODP позволяет
+не регистрировать память для её использования RDMA-картой. Благодаря этому
+можно не копировать данные при отправке их в сеть и, казалось бы, это должно
+улучшать производительность - но **по факту** получается так, что
+производительность только ухудшается, причём сильно. Пример - на 3-узловом
+кластере с 8 NVMe в каждом узле и сетью 2*25 Гбит/с на чтение с RDMA без ODP
+удаётся снять 3950000 iops, а с ODP - всего 239000 iops...
+
+Это происходит из-за того, что реализация ODP у Mellanox неоптимальная и
+основана на повторной передаче сообщений, когда карте не известен буфер -
+вероятно, на стандартных "RNR retransmission" (RNR = receiver not ready).
+А данные повторные передачи в RDMA/RoCE - всегда очень медленная штука.
+Презентация на эту тему с конференции ISPASS-2021: https://tkygtr6.github.io/pub/ISPASS21_slides.pdf
+
+Возможность использования ODP сохранена в коде на случай, если вдруг в один
+прекрасный день появится хорошая реализация ODP.
+
 ## peer_connect_interval

 - Тип: секунды
--- a/docs/config/src/network.yml
+++ b/docs/config/src/network.yml
@@ -48,11 +48,14 @@
  type: string
  info: |
    RDMA device name to use for Vitastor OSD communications (for example,
-    "rocep5s0f0"). Please note that Vitastor RDMA requires Implicit On-Demand
-    Paging (Implicit ODP) and Scatter/Gather (SG) support from the RDMA device
-    to work. For example, Mellanox ConnectX-3 and older adapters don't have
-    Implicit ODP, so they're unsupported by Vitastor. Run `ibv_devinfo -v` as
-    root to list available RDMA devices and their features.
+    "rocep5s0f0"). Now Vitastor supports all adapters, even ones without
+    ODP support, like Mellanox ConnectX-3 and non-Mellanox cards.
+
+    Versions up to Vitastor 1.2.0 required ODP which is only present in
+    Mellanox ConnectX >= 4. See also [rdma_odp](#rdma_odp).
+
+    Run `ibv_devinfo -v` as root to list available RDMA devices and their
+    features.

    Remember that you also have to configure your network switches if you use
    RoCE/RoCEv2, otherwise you may experience unstable performance. Refer to
@@ -61,12 +64,15 @@
    PFC (Priority Flow Control) and ECN (Explicit Congestion Notification).
  info_ru: |
    Название RDMA-устройства для связи с Vitastor OSD (например, "rocep5s0f0").
-    Имейте в виду, что поддержка RDMA в Vitastor требует функций устройства
-    Implicit On-Demand Paging (Implicit ODP) и Scatter/Gather (SG). Например,
-    адаптеры Mellanox ConnectX-3 и более старые не поддерживают Implicit ODP и
-    потому не поддерживаются в Vitastor. Запустите `ibv_devinfo -v` от имени
-    суперпользователя, чтобы посмотреть список доступных RDMA-устройств, их
-    параметры и возможности.
+    Сейчас Vitastor поддерживает все модели адаптеров, включая те, у которых
+    нет поддержки ODP, то есть вы можете использовать RDMA с ConnectX-3 и
+    картами производства не Mellanox.
+
+    Версии Vitastor до 1.2.0 включительно требовали ODP, который есть только
+    на Mellanox ConnectX 4 и более новых. См. также [rdma_odp](#rdma_odp).
+
+    Запустите `ibv_devinfo -v` от имени суперпользователя, чтобы посмотреть
+    список доступных RDMA-устройств, их параметры и возможности.

    Обратите внимание, что если вы используете RoCE/RoCEv2, вам также необходимо
    правильно настроить для него коммутаторы, иначе вы можете столкнуться с
@@ -160,6 +166,45 @@
    у принимающей стороны в процессе работы не заканчивались буферы на приём.
    Не влияет на потребление памяти - дополнительная память на операции отправки
    не выделяется.
+- name: rdma_odp
+  type: bool
+  default: false
+  online: false
+  info: |
+    Use RDMA with On-Demand Paging. ODP is currently only available on Mellanox
+    ConnectX-4 and newer adapters. ODP allows to not register memory explicitly
+    for RDMA adapter to be able to use it. This, in turn, allows to skip memory
+    copying during sending. One would think this should improve performance, but
+    **in reality** RDMA performance with ODP is **drastically** worse. Example
+    3-node cluster with 8 NVMe in each node and 2*25 GBit/s ConnectX-6 RDMA network
+    without ODP pushes 3950000 read iops, but only 239000 iops with ODP...
+
+    This happens because Mellanox ODP implementation seems to be based on
+    message retransmissions when the adapter doesn't know about the buffer yet -
+    it likely uses standard "RNR retransmissions" (RNR = receiver not ready)
+    which is generally slow in RDMA/RoCE networks. Here's a presentation about
+    it from ISPASS-2021 conference: https://tkygtr6.github.io/pub/ISPASS21_slides.pdf
+
+    ODP support is retained in the code just in case a good ODP implementation
+    appears one day.
+  info_ru: |
+    Использовать RDMA с On-Demand Paging. ODP - функция, доступная пока что
+    исключительно на адаптерах Mellanox ConnectX-4 и более новых. ODP позволяет
+    не регистрировать память для её использования RDMA-картой. Благодаря этому
+    можно не копировать данные при отправке их в сеть и, казалось бы, это должно
+    улучшать производительность - но **по факту** получается так, что
+    производительность только ухудшается, причём сильно. Пример - на 3-узловом
+    кластере с 8 NVMe в каждом узле и сетью 2*25 Гбит/с на чтение с RDMA без ODP
+    удаётся снять 3950000 iops, а с ODP - всего 239000 iops...
+
+    Это происходит из-за того, что реализация ODP у Mellanox неоптимальная и
+    основана на повторной передаче сообщений, когда карте не известен буфер -
+    вероятно, на стандартных "RNR retransmission" (RNR = receiver not ready).
+    А данные повторные передачи в RDMA/RoCE - всегда очень медленная штука.
+    Презентация на эту тему с конференции ISPASS-2021: https://tkygtr6.github.io/pub/ISPASS21_slides.pdf
+
+    Возможность использования ODP сохранена в коде на случай, если вдруг в один
+    прекрасный день появится хорошая реализация ODP.
 - name: peer_connect_interval
  type: sec
  min: 1
--- a/docs/usage/qemu.en.md
+++ b/docs/usage/qemu.en.md
@@ -127,19 +127,46 @@ Linux kernel, starting with version 5.15, supports a new interface for attaching
 to the host - VDUSE (vDPA Device in Userspace). QEMU, starting with 7.2, has support for
 exporting QEMU block devices over this protocol using qemu-storage-daemon.

-VDUSE has the same problem as other FUSE-like interfaces in Linux: if a userspace process hangs,
-for example, if it loses connectivity with Vitastor cluster - active processes doing I/O may
-hang in the D state (uninterruptible sleep) and you won't be able to kill them even with kill -9.
-In this case reboot will be the only way to remove VDUSE devices from system.
+VDUSE is currently the best interface to attach Vitastor disks as kernel devices because:
+- It avoids data copies and thus achieves much better performance than [NBD](nbd.en.md)
+- It doesn't have NBD timeout problem - the device doesn't die if an operation executes for too long
+- It doesn't have hung device problem - if the userspace process dies it can be restarted (!)
+  and block device will continue operation
+- It doesn't seem to have the device number limit

-On the other hand, VDUSE is faster than [NBD](nbd.en.md), so you may prefer to use it if
-performance is important for you. Approximate performance numbers:
-direct fio benchmark - 115000 iops, NBD - 60000 iops, VDUSE - 90000 iops.
+Example performance comparison:
+
+|                      | direct fio  | NBD         | VDUSE       |
+|----------------------|-------------|-------------|-------------|
+| linear write         | 3.85 GB/s   | 1.12 GB/s   | 3.85 GB/s   |
+| 4k random write Q128 | 240000 iops | 120000 iops | 178000 iops |
+| 4k random write Q1   | 9500 iops   | 7620 iops   | 7640 iops   |
+| linear read          | 4.3 GB/s    | 1.8 GB/s    | 2.85 GB/s   |
+| 4k random read Q128  | 287000 iops | 140000 iops | 189000 iops |
+| 4k random read Q1    | 9600 iops   | 7640 iops   | 7780 iops   |

 To try VDUSE you need at least Linux 5.15, built with VDUSE support
-(CONFIG_VIRTIO_VDPA=m and CONFIG_VDPA_USER=m). Debian Linux kernels have these options
-disabled by now, so if you want to try it on Debian, use a kernel from Ubuntu
-[kernel-ppa/mainline](https://kernel.ubuntu.com/~kernel-ppa/mainline/) or Proxmox.
+(CONFIG_VIRTIO_VDPA=m, CONFIG_VDPA_USER=m, CONFIG_VIRTIO_VDPA=m).
+
+Debian Linux kernels have these options disabled by now, so if you want to try it on Debian,
+use a kernel from Ubuntu [kernel-ppa/mainline](https://kernel.ubuntu.com/~kernel-ppa/mainline/), Proxmox,
+or build modules for Debian kernel manually:
+
+```
+mkdir build
+cd build
+apt-get install linux-headers-`uname -r`
+apt-get build-dep linux-image-`uname -r`-unsigned
+apt-get source linux-image-`uname -r`-unsigned
+cd linux*/drivers/vdpa
+make -C /lib/modules/`uname -r`/build M=$PWD CONFIG_VDPA=m CONFIG_VDPA_USER=m CONFIG_VIRTIO_VDPA=m -j8 modules modules_install
+cat Module.symvers >> /lib/modules/`uname -r`/build/Module.symvers
+cd ../virtio
+make -C /lib/modules/`uname -r`/build M=$PWD CONFIG_VDPA=m CONFIG_VDPA_USER=m CONFIG_VIRTIO_VDPA=m -j8 modules modules_install
+depmod -a
+```
+
+You also need `vdpa` tool from the `iproute2` package.

 Commands to attach Vitastor image as a VDUSE device:

@@ -152,7 +179,7 @@ qemu-storage-daemon --daemonize --blockdev '{"node-name":"test1","driver":"vitas
 vdpa dev add name test1 mgmtdev vduse
 ```

-After running these commands /dev/vda device will appear in the system and you'll be able to
+After running these commands, `/dev/vda` device will appear in the system and you'll be able to
 use it as a normal disk.

 To remove the device:
--- a/docs/usage/qemu.ru.md
+++ b/docs/usage/qemu.ru.md
@@ -129,19 +129,47 @@ qemu-system-x86_64 -enable-kvm -m 2048 -M accel=kvm,memory-backend=mem \
 к системе - VDUSE (vDPA Device in Userspace), а в QEMU, начиная с версии 7.2, есть поддержка
 экспорта блочных устройств QEMU по этому протоколу через qemu-storage-daemon.

-VDUSE страдает общей проблемой FUSE-подобных интерфейсов в Linux: если пользовательский процесс
-подвиснет, например, если будет потеряна связь с кластером Vitastor - читающие/пишущие в кластер
-процессы могут "залипнуть" в состоянии D (непрерываемый сон) и их будет невозможно убить даже
-через kill -9. В этом случае удалить из системы устройство можно только перезагрузившись.
+VDUSE - на данный момент лучший интерфейс для подключения дисков Vitastor в виде блочных
+устройств на уровне ядра, ибо:
+- VDUSE не копирует данные и поэтому достигает значительно лучшей производительности, чем [NBD](nbd.ru.md)
+- Также оно не имеет проблемы NBD-таймаута - устройство не умирает, если операция выполняется слишком долго
+- Также оно не имеет проблемы подвисающих устройств - если процесс-обработчик умирает, его можно
+  перезапустить (!) и блочное устройство продолжит работать
+- По-видимому, у него нет предела числа подключаемых в систему устройств

-С другой стороны, VDUSE быстрее по сравнению с [NBD](nbd.ru.md), поэтому его может
-быть предпочтительно использовать там, где производительность важнее. Порядок показателей:
-прямое тестирование через fio - 115000 iops, NBD - 60000 iops, VDUSE - 90000 iops.
+Пример сравнения производительности:

-Чтобы использовать VDUSE, вам нужно ядро Linux версии хотя бы 5.15, собранное с поддержкой
-VDUSE (CONFIG_VIRTIO_VDPA=m и CONFIG_VDPA_USER=m). В ядрах в Debian Linux поддержка пока
-отключена - если хотите попробовать эту функцию на Debian, поставьте ядро из Ubuntu
-[kernel-ppa/mainline](https://kernel.ubuntu.com/~kernel-ppa/mainline/) или из Proxmox.
+|                          | Прямой fio  | NBD         | VDUSE       |
+|--------------------------|-------------|-------------|-------------|
+| линейная запись          | 3.85 GB/s   | 1.12 GB/s   | 3.85 GB/s   |
+| 4k случайная запись Q128 | 240000 iops | 120000 iops | 178000 iops |
+| 4k случайная запись Q1   | 9500 iops   | 7620 iops   | 7640 iops   |
+| линейное чтение          | 4.3 GB/s    | 1.8 GB/s    | 2.85 GB/s   |
+| 4k случайное чтение Q128 | 287000 iops | 140000 iops | 189000 iops |
+| 4k случайное чтение Q1   | 9600 iops   | 7640 iops   | 7780 iops   |
+
+Чтобы попробовать VDUSE, вам нужно ядро Linux как минимум версии 5.15, собранное с поддержкой
+VDUSE (CONFIG_VIRTIO_VDPA=m, CONFIG_VDPA_USER=m, CONFIG_VIRTIO_VDPA=m).
+
+В ядрах в Debian Linux поддержка пока отключена по умолчанию, так что чтобы попробовать VDUSE
+на Debian, поставьте ядро из Ubuntu [kernel-ppa/mainline](https://kernel.ubuntu.com/~kernel-ppa/mainline/),
+из Proxmox или соберите модули для ядра Debian вручную:
+
+```
+mkdir build
+cd build
+apt-get install linux-headers-`uname -r`
+apt-get build-dep linux-image-`uname -r`-unsigned
+apt-get source linux-image-`uname -r`-unsigned
+cd linux*/drivers/vdpa
+make -C /lib/modules/`uname -r`/build M=$PWD CONFIG_VDPA=m CONFIG_VDPA_USER=m CONFIG_VIRTIO_VDPA=m -j8 modules modules_install
+cat Module.symvers >> /lib/modules/`uname -r`/build/Module.symvers
+cd ../virtio
+make -C /lib/modules/`uname -r`/build M=$PWD CONFIG_VDPA=m CONFIG_VDPA_USER=m CONFIG_VIRTIO_VDPA=m -j8 modules modules_install
+depmod -a
+```
+
+Также вам понадобится консольная утилита `vdpa` из пакета `iproute2`.

 Команды для подключения виртуального диска через VDUSE:

@@ -154,7 +182,7 @@ qemu-storage-daemon --daemonize --blockdev '{"node-name":"test1","driver":"vitas
 vdpa dev add name test1 mgmtdev vduse
 ```

-После этого в системе появится устройство /dev/vda, которое можно будет использовать как
+После этого в системе появится устройство `/dev/vda`, которое можно будет использовать как
 обычный диск.

 Для удаления устройства из системы:
--- a/mon/mon.js
+++ b/mon/mon.js
@@ -1498,7 +1498,7 @@ class Mon
    {
        const zero_stats = { op: { bps: 0n, iops: 0n, lat: 0n }, subop: { iops: 0n, lat: 0n }, recovery: { bps: 0n, iops: 0n } };
        const diff = { op_stats: {}, subop_stats: {}, recovery_stats: {}, inode_stats: {} };
-        if (!st || !st.time || !prev || prev.time >= st.time)
+        if (!st || !st.time || !prev || !prev.time || prev.time >= st.time)
        {
            return prev_diff || diff;
        }
--- a/src/CMakeLists.txt
+++ b/src/CMakeLists.txt
@@ -17,9 +17,10 @@ if("${CMAKE_INSTALL_PREFIX}" MATCHES "^/usr/local/?$")
 endif()

 add_definitions(-DVERSION="1.2.0")
-add_definitions(-Wall -Wno-sign-compare -Wno-comment -Wno-parentheses -Wno-pointer-arith -fdiagnostics-color=always -I ${CMAKE_SOURCE_DIR}/src)
+add_definitions(-Wall -Wno-sign-compare -Wno-comment -Wno-parentheses -Wno-pointer-arith -fdiagnostics-color=always -fno-omit-frame-pointer -I ${CMAKE_SOURCE_DIR}/src)
+add_link_options(-fno-omit-frame-pointer)
 if (${WITH_ASAN})
-	add_definitions(-fsanitize=address -fno-omit-frame-pointer)
+	add_definitions(-fsanitize=address)
 	add_link_options(-fsanitize=address -fno-omit-frame-pointer)
 endif (${WITH_ASAN})

@@ -180,6 +181,25 @@ target_link_libraries(vitastor-nbd
 	vitastor_client
 )

+# vitastor-kv
+add_executable(vitastor-kv
+	kv_cli.cpp
+	kv_db.cpp
+	kv_db.h
+)
+target_link_libraries(vitastor-kv
+	vitastor_client
+)
+
+add_executable(vitastor-kv-stress
+	kv_stress.cpp
+	kv_db.cpp
+	kv_db.h
+)
+target_link_libraries(vitastor-kv-stress
+	vitastor_client
+)
+
 # vitastor-nfs
 add_executable(vitastor-nfs
 	nfs_proxy.cpp
--- a/src/blockstore_impl.h
+++ b/src/blockstore_impl.h
@@ -274,7 +274,7 @@ class blockstore_impl_t
    blockstore_dirty_db_t dirty_db;
    std::vector<blockstore_op_t*> submit_queue;
    std::vector<obj_ver_id> unsynced_big_writes, unsynced_small_writes;
-    int unsynced_big_write_count = 0;
+    int unsynced_big_write_count = 0, unstable_unsynced = 0;
    int unsynced_queued_ops = 0;
    allocator *data_alloc = NULL;
    uint8_t *zero_object;
--- a/src/blockstore_journal.cpp
+++ b/src/blockstore_journal.cpp
@@ -145,6 +145,7 @@ journal_entry* prefill_single_journal_entry(journal_t & journal, uint16_t type,
        journal.sector_info[journal.cur_sector].offset = journal.next_free;
        journal.in_sector_pos = 0;
        journal.next_free = (journal.next_free+journal.block_size) < journal.len ? journal.next_free + journal.block_size : journal.block_size;
+        assert(journal.next_free != journal.used_start);
        memset(journal.inmemory
            ? (uint8_t*)journal.buffer + journal.sector_info[journal.cur_sector].offset
            : (uint8_t*)journal.sector_buf + journal.block_size*journal.cur_sector, 0, journal.block_size);
--- a/src/blockstore_journal.h
+++ b/src/blockstore_journal.h
@@ -13,12 +13,6 @@
 #define JOURNAL_BUFFER_SIZE 4*1024*1024
 #define JOURNAL_ENTRY_HEADER_SIZE 16

-// We reserve some extra space for future stabilize requests during writes
-// FIXME: This value should be dynamic i.e. Blockstore ideally shouldn't allow
-// writing more than can be stabilized afterwards
-#define JOURNAL_STABILIZE_RESERVATION 65536
-#define JOURNAL_INSTANT_RESERVATION 131072
-
 // Journal entries
 // Journal entries are linked to each other by their crc32 value
 // The journal is almost a blockchain, because object versions constantly increase
--- a/src/blockstore_sync.cpp
+++ b/src/blockstore_sync.cpp
@@ -86,14 +86,15 @@ int blockstore_impl_t::continue_sync(blockstore_op_t *op)
                auto & dirty_entry = dirty_db.at(sbw);
                uint64_t dyn_size = dsk.dirty_dyn_size(dirty_entry.offset, dirty_entry.len);
                if (!space_check.check_available(op, 1, sizeof(journal_entry_big_write) + dyn_size,
-                    left == 0 ? JOURNAL_STABILIZE_RESERVATION : 0))
+                    (unstable_writes.size()+unstable_unsynced)*journal.block_size))
                {
                    return 0;
                }
            }
        }
        else if (!space_check.check_available(op, PRIV(op)->sync_big_writes.size(),
-            sizeof(journal_entry_big_write) + dsk.clean_entry_bitmap_size, JOURNAL_STABILIZE_RESERVATION))
+            sizeof(journal_entry_big_write) + dsk.clean_entry_bitmap_size,
+            (unstable_writes.size()+unstable_unsynced)*journal.block_size))
        {
            return 0;
        }
@@ -184,6 +185,11 @@ void blockstore_impl_t::ack_sync(blockstore_op_t *op)
        {
            mark_stable(dirty_it->first);
        }
+        else
+        {
+            unstable_unsynced--;
+            assert(unstable_unsynced >= 0);
+        }
        dirty_it++;
        while (dirty_it != dirty_db.end() && dirty_it->first.oid == it->oid)
        {
@@ -214,6 +220,11 @@ void blockstore_impl_t::ack_sync(blockstore_op_t *op)
            {
                mark_stable(*it);
            }
+            else
+            {
+                unstable_unsynced--;
+                assert(unstable_unsynced >= 0);
+            }
        }
    }
    op->retval = 0;
--- a/src/blockstore_write.cpp
+++ b/src/blockstore_write.cpp
@@ -320,7 +320,7 @@ int blockstore_impl_t::dequeue_write(blockstore_op_t *op)
        blockstore_journal_check_t space_check(this);
        if (!space_check.check_available(op, unsynced_big_write_count + 1,
            sizeof(journal_entry_big_write) + dsk.clean_dyn_size,
-            (dirty_it->second.state & BS_ST_INSTANT) ? JOURNAL_INSTANT_RESERVATION : JOURNAL_STABILIZE_RESERVATION))
+            (unstable_writes.size()+unstable_unsynced)*journal.block_size))
        {
            return 0;
        }
@@ -386,6 +386,10 @@ int blockstore_impl_t::dequeue_write(blockstore_op_t *op)
            sqe, dsk.data_fd, PRIV(op)->iov_zerofill, vcnt, dsk.data_offset + (loc << dsk.block_order) + op->offset - stripe_offset
        );
        PRIV(op)->pending_ops = 1;
+        if (immediate_commit != IMMEDIATE_ALL && !(dirty_it->second.state & BS_ST_INSTANT))
+        {
+            unstable_unsynced++;
+        }
        if (immediate_commit != IMMEDIATE_ALL)
        {
            // Increase the counter, but don't save into unsynced_writes yet (can't sync until the write is finished)
@@ -408,7 +412,7 @@ int blockstore_impl_t::dequeue_write(blockstore_op_t *op)
                sizeof(journal_entry_big_write) + dsk.clean_dyn_size, 0)
            || !space_check.check_available(op, 1,
                sizeof(journal_entry_small_write) + dyn_size,
-                op->len + ((dirty_it->second.state & BS_ST_INSTANT) ? JOURNAL_INSTANT_RESERVATION : JOURNAL_STABILIZE_RESERVATION)))
+                (unstable_writes.size()+unstable_unsynced)*journal.block_size))
        {
            return 0;
        }
@@ -499,6 +503,11 @@ int blockstore_impl_t::dequeue_write(blockstore_op_t *op)
        if (journal.next_free >= journal.len)
        {
            journal.next_free = dsk.journal_block_size;
+            assert(journal.next_free != journal.used_start);
+        }
+        if (immediate_commit == IMMEDIATE_NONE && !(dirty_it->second.state & BS_ST_INSTANT))
+        {
+            unstable_unsynced++;
        }
        if (!PRIV(op)->pending_ops)
        {
@@ -538,7 +547,7 @@ resume_2:
        uint64_t dyn_size = dsk.dirty_dyn_size(op->offset, op->len);
        blockstore_journal_check_t space_check(this);
        if (!space_check.check_available(op, 1, sizeof(journal_entry_big_write) + dyn_size,
-            ((dirty_it->second.state & BS_ST_INSTANT) ? JOURNAL_INSTANT_RESERVATION : JOURNAL_STABILIZE_RESERVATION)))
+            (unstable_writes.size()+unstable_unsynced)*journal.block_size))
        {
            return 0;
        }
@@ -582,14 +591,20 @@ resume_4:
 #endif
        bool is_big = (dirty_it->second.state & BS_ST_TYPE_MASK) == BS_ST_BIG_WRITE;
        bool imm = is_big ? (immediate_commit == IMMEDIATE_ALL) : (immediate_commit != IMMEDIATE_NONE);
+        bool is_instant = ((dirty_it->second.state & BS_ST_TYPE_MASK) == BS_ST_DELETE || (dirty_it->second.state & BS_ST_INSTANT));
        if (imm)
        {
            auto & unstab = unstable_writes[op->oid];
            unstab = unstab < op->version ? op->version : unstab;
        }
+        else if (!is_instant)
+        {
+            unstable_unsynced--;
+            assert(unstable_unsynced >= 0);
+        }
        dirty_it->second.state = (dirty_it->second.state & ~BS_ST_WORKFLOW_MASK)
            | (imm ? BS_ST_SYNCED : BS_ST_WRITTEN);
-        if (imm && ((dirty_it->second.state & BS_ST_TYPE_MASK) == BS_ST_DELETE || (dirty_it->second.state & BS_ST_INSTANT)))
+        if (imm && is_instant)
        {
            // Deletions and 'instant' operations are treated as immediately stable
            mark_stable(dirty_it->first);
@@ -735,7 +750,7 @@ int blockstore_impl_t::dequeue_del(blockstore_op_t *op)
    });
    assert(dirty_it != dirty_db.end());
    blockstore_journal_check_t space_check(this);
-    if (!space_check.check_available(op, 1, sizeof(journal_entry_del), JOURNAL_INSTANT_RESERVATION))
+    if (!space_check.check_available(op, 1, sizeof(journal_entry_del), (unstable_writes.size()+unstable_unsynced)*journal.block_size))
    {
        return 0;
    }
--- a/src/cli.cpp
+++ b/src/cli.cpp
@@ -331,7 +331,7 @@ static int run(cli_tool_t *p, json11::Json::object cfg)
    {
        // Create client
        json11::Json cfg_j = cfg;
-        p->ringloop = new ring_loop_t(512);
+        p->ringloop = new ring_loop_t(RINGLOOP_DEFAULT_SIZE);
        p->epmgr = new epoll_manager_t(p->ringloop);
        p->cli = new cluster_client_t(p->ringloop, p->epmgr->tfd, cfg_j);
        // Smaller timeout by default for more interactiveness
--- a/src/cluster_client.cpp
+++ b/src/cluster_client.cpp
@@ -6,7 +6,7 @@
 #include "cluster_client_impl.h"
 #include "http_client.h" // json_is_true

-cluster_client_t::cluster_client_t(ring_loop_t *ringloop, timerfd_manager_t *tfd, json11::Json & config)
+cluster_client_t::cluster_client_t(ring_loop_t *ringloop, timerfd_manager_t *tfd, json11::Json config)
 {
    wb = new writeback_cache_t();

@@ -532,7 +532,7 @@ void cluster_client_t::execute_internal(cluster_op_t *op)
        return;
    }
    if (op->opcode == OSD_OP_WRITE && enable_writeback && !(op->flags & OP_FLUSH_BUFFER) &&
-        !op->version /* FIXME no CAS writeback */)
+        !op->version /* no CAS writeback */)
    {
        if (wb->writebacks_active >= client_max_writeback_iodepth)
        {
@@ -553,7 +553,7 @@ void cluster_client_t::execute_internal(cluster_op_t *op)
    }
    if (op->opcode == OSD_OP_WRITE && !(op->flags & OP_IMMEDIATE_COMMIT))
    {
-        if (!(op->flags & OP_FLUSH_BUFFER))
+        if (!(op->flags & OP_FLUSH_BUFFER) && !op->version /* no CAS write-repeat */)
        {
            wb->copy_write(op, CACHE_WRITTEN);
        }
@@ -1152,7 +1152,7 @@ void cluster_client_t::handle_op_part(cluster_op_part_t *part)
                osd_op_names[part->op.req.hdr.opcode], part->osd_num, part->op.reply.hdr.retval, expected
            );
        }
-        else
+        else if (log_level > 0)
        {
            fprintf(
                stderr, "%s operation failed on OSD %lu: retval=%ld (expected %d)\n",
--- a/src/cluster_client.h
+++ b/src/cluster_client.h
@@ -121,7 +121,7 @@ public:
    json11::Json::object cli_config, file_config, etcd_global_config;
    json11::Json::object config;

-    cluster_client_t(ring_loop_t *ringloop, timerfd_manager_t *tfd, json11::Json & config);
+    cluster_client_t(ring_loop_t *ringloop, timerfd_manager_t *tfd, json11::Json config);
    ~cluster_client_t();
    void execute(cluster_op_t *op);
    void execute_raw(osd_num_t osd_num, osd_op_t *op);
--- a/src/disk_tool.cpp
+++ b/src/disk_tool.cpp
@@ -229,7 +229,7 @@ int main(int argc, char *argv[])
        {
            self.options["allow_data_loss"] = "1";
        }
-        else if (argv[i][0] == '-' && argv[i][1] == '-')
+        else if (argv[i][0] == '-' && argv[i][1] == '-' && i < argc-1)
        {
            char *key = argv[i]+2;
            self.options[key] = argv[++i];
--- a/src/disk_tool_journal.cpp
+++ b/src/disk_tool_journal.cpp
@@ -320,7 +320,7 @@ void disk_tool_t::dump_journal_entry(int num, journal_entry *je, bool json)
        if (journal_calc_data_pos != sw.data_offset)
        {
            printf(json ? ",\"bad_loc\":true,\"calc_loc\":\"0x%lx\""
-                : " (mismatched, calculated = %lu)", journal_pos);
+                : " (mismatched, calculated = %08lx)", journal_pos);
        }
        uint32_t data_csum_size = (!je_start.csum_block_size
            ? 0
--- a/src/disk_tool_resize.cpp
+++ b/src/disk_tool_resize.cpp
@@ -245,7 +245,7 @@ int disk_tool_t::resize_copy_data()
    {
        iodepth = 32;
    }
-    ringloop = new ring_loop_t(iodepth < 512 ? 512 : iodepth);
+    ringloop = new ring_loop_t(iodepth < RINGLOOP_DEFAULT_SIZE ? RINGLOOP_DEFAULT_SIZE : iodepth);
    dsk.data_fd = open(dsk.data_device.c_str(), O_DIRECT|O_RDWR);
    if (dsk.data_fd < 0)
    {
--- a/src/fio_engine.cpp
+++ b/src/fio_engine.cpp
@@ -130,7 +130,7 @@ static int bs_init(struct thread_data *td)
                config[p.first] = p.second.dump();
        }
    }
-    bsd->ringloop = new ring_loop_t(512);
+    bsd->ringloop = new ring_loop_t(RINGLOOP_DEFAULT_SIZE);
    bsd->epmgr = new epoll_manager_t(bsd->ringloop);
    bsd->bs = new blockstore_t(config, bsd->ringloop, bsd->epmgr->tfd);
    while (1)
--- a/src/kv_cli.cpp
+++ b/src/kv_cli.cpp
@@ -0,0 +1,401 @@
+// Copyright (c) Vitaliy Filippov, 2019+
+// License: VNPL-1.1 (see README.md for details)
+//
+// Vitastor shared key/value database test CLI
+
+#define _XOPEN_SOURCE
+#include <limits.h>
+
+#include <netinet/tcp.h>
+#include <sys/epoll.h>
+#include <unistd.h>
+#include <fcntl.h>
+//#include <signal.h>
+
+#include "epoll_manager.h"
+#include "str_util.h"
+#include "kv_db.h"
+
+const char *exe_name = NULL;
+
+class kv_cli_t
+{
+public:
+    kv_dbw_t *db = NULL;
+    ring_loop_t *ringloop = NULL;
+    epoll_manager_t *epmgr = NULL;
+    cluster_client_t *cli = NULL;
+    bool interactive = false;
+    int in_progress = 0;
+    char *cur_cmd = NULL;
+    int cur_cmd_size = 0, cur_cmd_alloc = 0;
+    bool finished = false, eof = false;
+    json11::Json::object cfg;
+
+    ~kv_cli_t();
+
+    static json11::Json::object parse_args(int narg, const char *args[]);
+    void run(const json11::Json::object & cfg);
+    void read_cmd();
+    void next_cmd();
+    void handle_cmd(const std::string & cmd, std::function<void()> cb);
+};
+
+kv_cli_t::~kv_cli_t()
+{
+    if (cur_cmd)
+    {
+        free(cur_cmd);
+        cur_cmd = NULL;
+    }
+    cur_cmd_alloc = 0;
+    if (db)
+        delete db;
+    if (cli)
+    {
+        cli->flush();
+        delete cli;
+    }
+    if (epmgr)
+        delete epmgr;
+    if (ringloop)
+        delete ringloop;
+}
+
+json11::Json::object kv_cli_t::parse_args(int narg, const char *args[])
+{
+    json11::Json::object cfg;
+    for (int i = 1; i < narg; i++)
+    {
+        if (!strcmp(args[i], "-h") || !strcmp(args[i], "--help"))
+        {
+            printf(
+                "Vitastor Key/Value CLI\n"
+                "(c) Vitaliy Filippov, 2023+ (VNPL-1.1)\n"
+                "\n"
+                "USAGE: %s [--etcd_address ADDR] [OTHER OPTIONS]\n",
+                exe_name
+            );
+            exit(0);
+        }
+        else if (args[i][0] == '-' && args[i][1] == '-')
+        {
+            const char *opt = args[i]+2;
+            cfg[opt] = !strcmp(opt, "json") || i == narg-1 ? "1" : args[++i];
+        }
+    }
+    return cfg;
+}
+
+void kv_cli_t::run(const json11::Json::object & cfg)
+{
+    // Create client
+    ringloop = new ring_loop_t(512);
+    epmgr = new epoll_manager_t(ringloop);
+    cli = new cluster_client_t(ringloop, epmgr->tfd, cfg);
+    db = new kv_dbw_t(cli);
+    // Load image metadata
+    while (!cli->is_ready())
+    {
+        ringloop->loop();
+        if (cli->is_ready())
+            break;
+        ringloop->wait();
+    }
+    // Run
+    fcntl(0, F_SETFL, fcntl(0, F_GETFL, 0) | O_NONBLOCK);
+    try
+    {
+        epmgr->tfd->set_fd_handler(0, false, [this](int fd, int events)
+        {
+            if (events & EPOLLIN)
+            {
+                read_cmd();
+            }
+            if (events & EPOLLRDHUP)
+            {
+                epmgr->tfd->set_fd_handler(0, false, NULL);
+                finished = true;
+            }
+        });
+        interactive = true;
+        printf("> ");
+    }
+    catch (std::exception & e)
+    {
+        // Can't add to epoll, STDIN is probably a file
+        read_cmd();
+    }
+    while (!finished)
+    {
+        ringloop->loop();
+        if (!finished)
+            ringloop->wait();
+    }
+    // Destroy the client
+    delete db;
+    db = NULL;
+    cli->flush();
+    delete cli;
+    delete epmgr;
+    delete ringloop;
+    cli = NULL;
+    epmgr = NULL;
+    ringloop = NULL;
+}
+
+void kv_cli_t::read_cmd()
+{
+    if (!cur_cmd_alloc)
+    {
+        cur_cmd_alloc = 65536;
+        cur_cmd = (char*)malloc_or_die(cur_cmd_alloc);
+    }
+    while (cur_cmd_size < cur_cmd_alloc)
+    {
+        int r = read(0, cur_cmd+cur_cmd_size, cur_cmd_alloc-cur_cmd_size);
+        if (r < 0 && errno != EAGAIN)
+            fprintf(stderr, "Error reading from stdin: %s\n", strerror(errno));
+        if (r > 0)
+            cur_cmd_size += r;
+        if (r == 0)
+            eof = true;
+        if (r <= 0)
+            break;
+    }
+    next_cmd();
+}
+
+void kv_cli_t::next_cmd()
+{
+    if (in_progress > 0)
+    {
+        return;
+    }
+    int pos = 0;
+    for (; pos < cur_cmd_size; pos++)
+    {
+        if (cur_cmd[pos] == '\n' || cur_cmd[pos] == '\r')
+        {
+            auto cmd = trim(std::string(cur_cmd, pos));
+            pos++;
+            memmove(cur_cmd, cur_cmd+pos, cur_cmd_size-pos);
+            cur_cmd_size -= pos;
+            in_progress++;
+            handle_cmd(cmd, [this]()
+            {
+                in_progress--;
+                if (interactive)
+                    printf("> ");
+                next_cmd();
+                if (!in_progress)
+                    read_cmd();
+            });
+            break;
+        }
+    }
+    if (eof && !in_progress)
+    {
+        finished = true;
+    }
+}
+
+void kv_cli_t::handle_cmd(const std::string & cmd, std::function<void()> cb)
+{
+    if (cmd == "")
+    {
+        cb();
+        return;
+    }
+    auto pos = cmd.find_first_of(" \t");
+    if (pos != std::string::npos)
+    {
+        while (pos < cmd.size()-1 && (cmd[pos+1] == ' ' || cmd[pos+1] == '\t'))
+            pos++;
+    }
+    auto opname = strtolower(pos == std::string::npos ? cmd : cmd.substr(0, pos));
+    if (opname == "open")
+    {
+        uint64_t pool_id = 0;
+        inode_t inode_id = 0;
+        uint32_t kv_block_size = 0;
+        int scanned = sscanf(cmd.c_str() + pos+1, "%lu %lu %u", &pool_id, &inode_id, &kv_block_size);
+        if (scanned == 2)
+        {
+            kv_block_size = 4096;
+        }
+        if (scanned < 2 || !pool_id || !inode_id || !kv_block_size || (kv_block_size & (kv_block_size-1)) != 0)
+        {
+            fprintf(stderr, "Usage: open <pool_id> <inode_id> [block_size]. Block size must be a power of 2. Default is 4096.\n");
+            cb();
+            return;
+        }
+        cfg["kv_block_size"] = (uint64_t)kv_block_size;
+        db->open(INODE_WITH_POOL(pool_id, inode_id), cfg, [=](int res)
+        {
+            if (res < 0)
+                fprintf(stderr, "Error opening index: %s (code %d)\n", strerror(-res), res);
+            else
+                printf("Index opened. Current size: %lu bytes\n", db->get_size());
+            cb();
+        });
+    }
+    else if (opname == "config")
+    {
+        auto pos2 = cmd.find_first_of(" \t", pos+1);
+        if (pos2 == std::string::npos)
+        {
+            fprintf(stderr, "Usage: config <property> <value>\n");
+            cb();
+            return;
+        }
+        auto key = trim(cmd.substr(pos+1, pos2-pos-1));
+        auto value = parse_size(trim(cmd.substr(pos2+1)));
+        if (key != "kv_memory_limit" &&
+            key != "kv_allocate_blocks" &&
+            key != "kv_evict_max_misses" &&
+            key != "kv_evict_attempts_per_level" &&
+            key != "kv_evict_unused_age" &&
+            key != "kv_log_level")
+        {
+            fprintf(
+                stderr, "Allowed properties: kv_memory_limit, kv_allocate_blocks,"
+                " kv_evict_max_misses, kv_evict_attempts_per_level, kv_evict_unused_age, kv_log_level\n"
+            );
+        }
+        else
+        {
+            cfg[key] = value;
+            db->set_config(cfg);
+        }
+        cb();
+    }
+    else if (opname == "get" || opname == "set" || opname == "del")
+    {
+        if (opname == "get" || opname == "del")
+        {
+            if (pos == std::string::npos)
+            {
+                fprintf(stderr, "Usage: %s <key>\n", opname.c_str());
+                cb();
+                return;
+            }
+            auto key = trim(cmd.substr(pos+1));
+            if (opname == "get")
+            {
+                db->get(key, [this, cb](int res, const std::string & value)
+                {
+                    if (res < 0)
+                        fprintf(stderr, "Error: %s (code %d)\n", strerror(-res), res);
+                    else
+                    {
+                        write(1, value.c_str(), value.size());
+                        write(1, "\n", 1);
+                    }
+                    cb();
+                });
+            }
+            else
+            {
+                db->del(key, [this, cb](int res)
+                {
+                    if (res < 0)
+                        fprintf(stderr, "Error: %s (code %d)\n", strerror(-res), res);
+                    else
+                        printf("OK\n");
+                    cb();
+                });
+            }
+        }
+        else
+        {
+            auto pos2 = cmd.find_first_of(" \t", pos+1);
+            if (pos2 == std::string::npos)
+            {
+                fprintf(stderr, "Usage: set <key> <value>\n");
+                cb();
+                return;
+            }
+            auto key = trim(cmd.substr(pos+1, pos2-pos-1));
+            auto value = trim(cmd.substr(pos2+1));
+            db->set(key, value, [this, cb](int res)
+            {
+                if (res < 0)
+                    fprintf(stderr, "Error: %s (code %d)\n", strerror(-res), res);
+                else
+                    printf("OK\n");
+                cb();
+            });
+        }
+    }
+    else if (opname == "list")
+    {
+        std::string start, end;
+        if (pos != std::string::npos)
+        {
+            auto pos2 = cmd.find_first_of(" \t", pos+1);
+            if (pos2 != std::string::npos)
+            {
+                start = trim(cmd.substr(pos+1, pos2-pos-1));
+                end = trim(cmd.substr(pos2+1));
+            }
+            else
+            {
+                start = trim(cmd.substr(pos+1));
+            }
+        }
+        void *handle = db->list_start(start);
+        db->list_next(handle, [=](int res, const std::string & key, const std::string & value)
+        {
+            if (res < 0)
+            {
+                if (res != -ENOENT)
+                {
+                    fprintf(stderr, "Error: %s (code %d)\n", strerror(-res), res);
+                }
+                db->list_close(handle);
+                cb();
+            }
+            else
+            {
+                printf("%s = %s\n", key.c_str(), value.c_str());
+                db->list_next(handle, NULL);
+            }
+        });
+    }
+    else if (opname == "close")
+    {
+        db->close([=]()
+        {
+            printf("Index closed\n");
+            cb();
+        });
+    }
+    else if (opname == "quit" || opname == "q")
+    {
+        ::close(0);
+        finished = true;
+    }
+    else
+    {
+        fprintf(
+            stderr, "Unknown operation: %s. Supported operations:\n"
+            "open <pool_id> <inode_id> [block_size]\n"
+            "config <property> <value>\n"
+            "get <key>\nset <key> <value>\ndel <key>\nlist [<start> [end]]\n"
+            "close\nquit\n", opname.c_str()
+        );
+        cb();
+    }
+}
+
+int main(int narg, const char *args[])
+{
+    setvbuf(stdout, NULL, _IONBF, 0);
+    setvbuf(stderr, NULL, _IONBF, 0);
+    exe_name = args[0];
+    kv_cli_t *p = new kv_cli_t();
+    p->run(kv_cli_t::parse_args(narg, args));
+    delete p;
+    return 0;
+}
--- a/src/kv_db.cpp
+++ b/src/kv_db.cpp
--- a/src/kv_db.h
+++ b/src/kv_db.h
@@ -0,0 +1,36 @@
+// Copyright (c) Vitaliy Filippov, 2019+
+// License: VNPL-1.1 (see README.md for details)
+//
+// Vitastor shared key/value database
+// Parallel optimistic B-Tree O:-)
+
+#pragma once
+
+#include "cluster_client.h"
+
+struct kv_db_t;
+
+struct kv_dbw_t
+{
+    kv_dbw_t(cluster_client_t *cli);
+    ~kv_dbw_t();
+
+    void open(inode_t inode_id, json11::Json cfg, std::function<void(int)> cb);
+    void set_config(json11::Json cfg);
+    void close(std::function<void()> cb);
+
+    uint64_t get_size();
+
+    void get(const std::string & key, std::function<void(int res, const std::string & value)> cb,
+        bool allow_old_cached = false);
+    void set(const std::string & key, const std::string & value, std::function<void(int res)> cb,
+        std::function<bool(int res, const std::string & value)> cas_compare = NULL);
+    void del(const std::string & key, std::function<void(int res)> cb,
+        std::function<bool(int res, const std::string & value)> cas_compare = NULL);
+
+    void* list_start(const std::string & start);
+    void list_next(void *handle, std::function<void(int res, const std::string & key, const std::string & value)> cb);
+    void list_close(void *handle);
+
+    kv_db_t *db;
+};
--- a/src/kv_stress.cpp
+++ b/src/kv_stress.cpp
@@ -0,0 +1,697 @@
+// Copyright (c) Vitaliy Filippov, 2019+
+// License: VNPL-1.1 (see README.md for details)
+//
+// Vitastor shared key/value database stress tester / benchmark
+
+#define _XOPEN_SOURCE
+#include <limits.h>
+
+#include <netinet/tcp.h>
+#include <sys/epoll.h>
+#include <unistd.h>
+#include <fcntl.h>
+//#include <signal.h>
+
+#include "epoll_manager.h"
+#include "str_util.h"
+#include "kv_db.h"
+
+const char *exe_name = NULL;
+
+struct kv_test_listing_t
+{
+    uint64_t count = 0, done = 0;
+    void *handle = NULL;
+    std::string next_after;
+    std::set<std::string> inflights;
+    timespec tv_begin;
+    bool error = false;
+};
+
+struct kv_test_lat_t
+{
+    const char *name = NULL;
+    uint64_t usec = 0, count = 0;
+};
+
+struct kv_test_stat_t
+{
+    kv_test_lat_t get, add, update, del, list;
+    uint64_t list_keys = 0;
+};
+
+class kv_test_t
+{
+public:
+    // Config
+    json11::Json::object kv_cfg;
+    std::string key_prefix, key_suffix;
+    uint64_t inode_id = 0;
+    uint64_t op_count = 1000000;
+    uint64_t runtime_sec = 0;
+    uint64_t parallelism = 4;
+    uint64_t reopen_prob = 1;
+    uint64_t get_prob = 30000;
+    uint64_t add_prob = 20000;
+    uint64_t update_prob = 20000;
+    uint64_t del_prob = 5000;
+    uint64_t list_prob = 300;
+    uint64_t min_key_len = 10;
+    uint64_t max_key_len = 70;
+    uint64_t min_value_len = 50;
+    uint64_t max_value_len = 300;
+    uint64_t min_list_count = 10;
+    uint64_t max_list_count = 1000;
+    uint64_t print_stats_interval = 1;
+    bool json_output = false;
+    uint64_t log_level = 1;
+    bool trace = false;
+    bool stop_on_error = false;
+    // FIXME: Multiple clients
+    kv_test_stat_t stat, prev_stat;
+    timespec prev_stat_time, start_stat_time;
+
+    // State
+    kv_dbw_t *db = NULL;
+    ring_loop_t *ringloop = NULL;
+    epoll_manager_t *epmgr = NULL;
+    cluster_client_t *cli = NULL;
+    ring_consumer_t consumer;
+    bool finished = false;
+    uint64_t total_prob = 0;
+    uint64_t ops_sent = 0, ops_done = 0;
+    int stat_timer_id = -1;
+    int in_progress = 0;
+    bool reopening = false;
+    std::set<kv_test_listing_t*> listings;
+    std::set<std::string> changing_keys;
+    std::map<std::string, std::string> values;
+
+    ~kv_test_t();
+
+    static json11::Json::object parse_args(int narg, const char *args[]);
+    void parse_config(json11::Json cfg);
+    void run(json11::Json cfg);
+    void loop();
+    void print_stats(kv_test_stat_t & prev_stat, timespec & prev_stat_time);
+    void print_total_stats();
+    void start_change(const std::string & key);
+    void stop_change(const std::string & key);
+    void add_stat(kv_test_lat_t & stat, timespec tv_begin);
+};
+
+kv_test_t::~kv_test_t()
+{
+    if (db)
+        delete db;
+    if (cli)
+    {
+        cli->flush();
+        delete cli;
+    }
+    if (epmgr)
+        delete epmgr;
+    if (ringloop)
+        delete ringloop;
+}
+
+json11::Json::object kv_test_t::parse_args(int narg, const char *args[])
+{
+    json11::Json::object cfg;
+    for (int i = 1; i < narg; i++)
+    {
+        if (!strcmp(args[i], "-h") || !strcmp(args[i], "--help"))
+        {
+            printf(
+                "Vitastor Key/Value DB stress tester / benchmark\n"
+                "(c) Vitaliy Filippov, 2023+ (VNPL-1.1)\n"
+                "\n"
+                "USAGE: %s --pool_id POOL_ID --inode_id INODE_ID [OPTIONS]\n"
+                "  --op_count 1000000\n"
+                "    Total operations to run during test. 0 means unlimited\n"
+                "  --key_prefix \"\"\n"
+                "    Prefix for all keys read or written (to avoid collisions)\n"
+                "  --key_suffix \"\"\n"
+                "    Suffix for all keys read or written (to avoid collisions, but scan all DB)\n"
+                "  --runtime 0\n"
+                "    Run for this number of seconds. 0 means unlimited\n"
+                "  --parallelism 4\n"
+                "    Run this number of operations in parallel\n"
+                "  --get_prob 30000\n"
+                "    Fraction of key retrieve operations\n"
+                "  --add_prob 20000\n"
+                "    Fraction of key addition operations\n"
+                "  --update_prob 20000\n"
+                "    Fraction of key update operations\n"
+                "  --del_prob 30000\n"
+                "    Fraction of key delete operations\n"
+                "  --list_prob 300\n"
+                "    Fraction of listing operations\n"
+                "  --min_key_len 10\n"
+                "    Minimum key size in bytes\n"
+                "  --max_key_len 70\n"
+                "    Maximum key size in bytes\n"
+                "  --min_value_len 50\n"
+                "    Minimum value size in bytes\n"
+                "  --max_value_len 300\n"
+                "    Maximum value size in bytes\n"
+                "  --min_list_count 10\n"
+                "    Minimum number of keys read in listing (0 = all keys)\n"
+                "  --max_list_count 1000\n"
+                "    Maximum number of keys read in listing\n"
+                "  --print_stats 1\n"
+                "    Print operation statistics every this number of seconds\n"
+                "  --json\n"
+                "    JSON output\n"
+                "  --stop_on_error 0\n"
+                "    Stop on first execution error, mismatch, lost key or extra key during listing\n"
+                "  --kv_memory_limit 128M\n"
+                "    Maximum memory to use for vitastor-kv index cache\n"
+                "  --kv_allocate_blocks 4\n"
+                "    Number of PG blocks used for new tree block allocation in parallel\n"
+                "  --kv_evict_max_misses 10\n"
+                "    Eviction algorithm parameter: retry eviction from another random spot\n"
+                "    if this number of keys is used currently or was used recently\n"
+                "  --kv_evict_attempts_per_level 3\n"
+                "    Retry eviction at most this number of times per tree level, starting\n"
+                "    with bottom-most levels\n"
+                "  --kv_evict_unused_age 1000\n"
+                "    Evict only keys unused during this number of last operations\n"
+                "  --kv_log_level 1\n"
+                "    Log level. 0 = errors, 1 = warnings, 10 = trace operations\n",
+                exe_name
+            );
+            exit(0);
+        }
+        else if (args[i][0] == '-' && args[i][1] == '-')
+        {
+            const char *opt = args[i]+2;
+            cfg[opt] = !strcmp(opt, "json") || i == narg-1 ? "1" : args[++i];
+        }
+    }
+    return cfg;
+}
+
+void kv_test_t::parse_config(json11::Json cfg)
+{
+    inode_id = INODE_WITH_POOL(cfg["pool_id"].uint64_value(), cfg["inode_id"].uint64_value());
+    if (cfg["op_count"].uint64_value() > 0)
+        op_count = cfg["op_count"].uint64_value();
+    key_prefix = cfg["key_prefix"].string_value();
+    key_suffix = cfg["key_suffix"].string_value();
+    if (cfg["runtime"].uint64_value() > 0)
+        runtime_sec = cfg["runtime"].uint64_value();
+    if (cfg["parallelism"].uint64_value() > 0)
+        parallelism = cfg["parallelism"].uint64_value();
+    if (!cfg["reopen_prob"].is_null())
+        reopen_prob = cfg["reopen_prob"].uint64_value();
+    if (!cfg["get_prob"].is_null())
+        get_prob = cfg["get_prob"].uint64_value();
+    if (!cfg["add_prob"].is_null())
+        add_prob = cfg["add_prob"].uint64_value();
+    if (!cfg["update_prob"].is_null())
+        update_prob = cfg["update_prob"].uint64_value();
+    if (!cfg["del_prob"].is_null())
+        del_prob = cfg["del_prob"].uint64_value();
+    if (!cfg["list_prob"].is_null())
+        list_prob = cfg["list_prob"].uint64_value();
+    if (!cfg["min_key_len"].is_null())
+        min_key_len = cfg["min_key_len"].uint64_value();
+    if (cfg["max_key_len"].uint64_value() > 0)
+        max_key_len = cfg["max_key_len"].uint64_value();
+    if (!cfg["min_value_len"].is_null())
+        min_value_len = cfg["min_value_len"].uint64_value();
+    if (cfg["max_value_len"].uint64_value() > 0)
+        max_value_len = cfg["max_value_len"].uint64_value();
+    if (!cfg["min_list_count"].is_null())
+        min_list_count = cfg["min_list_count"].uint64_value();
+    if (!cfg["max_list_count"].is_null())
+        max_list_count = cfg["max_list_count"].uint64_value();
+    if (!cfg["print_stats"].is_null())
+        print_stats_interval = cfg["print_stats"].uint64_value();
+    if (!cfg["json"].is_null())
+        json_output = true;
+    if (!cfg["stop_on_error"].is_null())
+        stop_on_error = cfg["stop_on_error"].bool_value();
+    if (!cfg["kv_memory_limit"].is_null())
+        kv_cfg["kv_memory_limit"] = cfg["kv_memory_limit"];
+    if (!cfg["kv_allocate_blocks"].is_null())
+        kv_cfg["kv_allocate_blocks"] = cfg["kv_allocate_blocks"];
+    if (!cfg["kv_evict_max_misses"].is_null())
+        kv_cfg["kv_evict_max_misses"] = cfg["kv_evict_max_misses"];
+    if (!cfg["kv_evict_attempts_per_level"].is_null())
+        kv_cfg["kv_evict_attempts_per_level"] = cfg["kv_evict_attempts_per_level"];
+    if (!cfg["kv_evict_unused_age"].is_null())
+        kv_cfg["kv_evict_unused_age"] = cfg["kv_evict_unused_age"];
+    if (!cfg["kv_log_level"].is_null())
+    {
+        log_level = cfg["kv_log_level"].uint64_value();
+        trace = log_level >= 10;
+        kv_cfg["kv_log_level"] = cfg["kv_log_level"];
+    }
+    total_prob = reopen_prob+get_prob+add_prob+update_prob+del_prob+list_prob;
+    stat.get.name = "get";
+    stat.add.name = "add";
+    stat.update.name = "update";
+    stat.del.name = "del";
+    stat.list.name = "list";
+}
+
+void kv_test_t::run(json11::Json cfg)
+{
+    srand48(time(NULL));
+    parse_config(cfg);
+    // Create client
+    ringloop = new ring_loop_t(512);
+    epmgr = new epoll_manager_t(ringloop);
+    cli = new cluster_client_t(ringloop, epmgr->tfd, cfg);
+    db = new kv_dbw_t(cli);
+    // Load image metadata
+    while (!cli->is_ready())
+    {
+        ringloop->loop();
+        if (cli->is_ready())
+            break;
+        ringloop->wait();
+    }
+    // Run
+    reopening = true;
+    db->open(inode_id, kv_cfg, [this](int res)
+    {
+        reopening = false;
+        if (res < 0)
+        {
+            fprintf(stderr, "ERROR: Open index: %d (%s)\n", res, strerror(-res));
+            exit(1);
+        }
+        if (trace)
+            printf("Index opened\n");
+        ringloop->wakeup();
+    });
+    consumer.loop = [this]() { loop(); };
+    ringloop->register_consumer(&consumer);
+    if (print_stats_interval)
+        stat_timer_id = epmgr->tfd->set_timer(print_stats_interval*1000, true, [this](int) { print_stats(prev_stat, prev_stat_time); });
+    clock_gettime(CLOCK_REALTIME, &start_stat_time);
+    prev_stat_time = start_stat_time;
+    while (!finished)
+    {
+        ringloop->loop();
+        if (!finished)
+            ringloop->wait();
+    }
+    if (stat_timer_id >= 0)
+        epmgr->tfd->clear_timer(stat_timer_id);
+    ringloop->unregister_consumer(&consumer);
+    // Print total stats
+    print_total_stats();
+    // Destroy the client
+    delete db;
+    db = NULL;
+    cli->flush();
+    delete cli;
+    delete epmgr;
+    delete ringloop;
+    cli = NULL;
+    epmgr = NULL;
+    ringloop = NULL;
+}
+
+static const char *base64_chars = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789@+/";
+
+std::string random_str(int len)
+{
+    std::string str;
+    str.resize(len);
+    for (int i = 0; i < len; i++)
+    {
+        str[i] = base64_chars[lrand48() % 64];
+    }
+    return str;
+}
+
+void kv_test_t::loop()
+{
+    if (reopening)
+    {
+        return;
+    }
+    if (ops_done >= op_count)
+    {
+        finished = true;
+    }
+    while (!finished && ops_sent < op_count && in_progress < parallelism)
+    {
+        uint64_t dice = (lrand48() % total_prob);
+        if (dice < reopen_prob)
+        {
+            reopening = true;
+            db->close([this]()
+            {
+                if (trace)
+                    printf("Index closed\n");
+                db->open(inode_id, kv_cfg, [this](int res)
+                {
+                    reopening = false;
+                    if (res < 0)
+                    {
+                        fprintf(stderr, "ERROR: Reopen index: %d (%s)\n", res, strerror(-res));
+                        finished = true;
+                        return;
+                    }
+                    if (trace)
+                        printf("Index reopened\n");
+                    ringloop->wakeup();
+                });
+            });
+            return;
+        }
+        else if (dice < reopen_prob+get_prob)
+        {
+            // get existing
+            auto key = random_str(max_key_len);
+            auto k_it = values.lower_bound(key);
+            if (k_it == values.end())
+                continue;
+            key = k_it->first;
+            if (changing_keys.find(key) != changing_keys.end())
+                continue;
+            in_progress++;
+            ops_sent++;
+            if (trace)
+                printf("get %s\n", key.c_str());
+            timespec tv_begin;
+            clock_gettime(CLOCK_REALTIME, &tv_begin);
+            db->get(key, [this, key, tv_begin](int res, const std::string & value)
+            {
+                add_stat(stat.get, tv_begin);
+                ops_done++;
+                in_progress--;
+                auto it = values.find(key);
+                if (res != (it == values.end() ? -ENOENT : 0))
+                {
+                    fprintf(stderr, "ERROR: get %s: %d (%s)\n", key.c_str(), res, strerror(-res));
+                    if (stop_on_error)
+                        exit(1);
+                }
+                else if (it != values.end() && value != it->second)
+                {
+                    fprintf(stderr, "ERROR: get %s: mismatch: %s vs %s\n", key.c_str(), value.c_str(), it->second.c_str());
+                    if (stop_on_error)
+                        exit(1);
+                }
+                ringloop->wakeup();
+            });
+        }
+        else if (dice < reopen_prob+get_prob+add_prob+update_prob)
+        {
+            bool is_add = false;
+            std::string key;
+            if (dice < reopen_prob+get_prob+add_prob)
+            {
+                // add
+                is_add = true;
+                uint64_t key_len = min_key_len + (max_key_len > min_key_len ? lrand48() % (max_key_len-min_key_len) : 0);
+                key = key_prefix + random_str(key_len) + key_suffix;
+            }
+            else
+            {
+                // update
+                key = random_str(max_key_len);
+                auto k_it = values.lower_bound(key);
+                if (k_it == values.end())
+                    continue;
+                key = k_it->first;
+            }
+            if (changing_keys.find(key) != changing_keys.end())
+                continue;
+            uint64_t value_len = min_value_len + (max_value_len > min_value_len ? lrand48() % (max_value_len-min_value_len) : 0);
+            auto value = random_str(value_len);
+            start_change(key);
+            ops_sent++;
+            in_progress++;
+            if (trace)
+                printf("set %s = %s\n", key.c_str(), value.c_str());
+            timespec tv_begin;
+            clock_gettime(CLOCK_REALTIME, &tv_begin);
+            db->set(key, value, [this, key, value, tv_begin, is_add](int res)
+            {
+                add_stat(is_add ? stat.add : stat.update, tv_begin);
+                stop_change(key);
+                ops_done++;
+                in_progress--;
+                if (res != 0)
+                {
+                    fprintf(stderr, "ERROR: set %s = %s: %d (%s)\n", key.c_str(), value.c_str(), res, strerror(-res));
+                    if (stop_on_error)
+                        exit(1);
+                }
+                else
+                {
+                    values[key] = value;
+                }
+                ringloop->wakeup();
+            }, NULL);
+        }
+        else if (dice < reopen_prob+get_prob+add_prob+update_prob+del_prob)
+        {
+            // delete
+            auto key = random_str(max_key_len);
+            auto k_it = values.lower_bound(key);
+            if (k_it == values.end())
+                continue;
+            key = k_it->first;
+            if (changing_keys.find(key) != changing_keys.end())
+                continue;
+            start_change(key);
+            ops_sent++;
+            in_progress++;
+            if (trace)
+                printf("del %s\n", key.c_str());
+            timespec tv_begin;
+            clock_gettime(CLOCK_REALTIME, &tv_begin);
+            db->del(key, [this, key, tv_begin](int res)
+            {
+                add_stat(stat.del, tv_begin);
+                stop_change(key);
+                ops_done++;
+                in_progress--;
+                if (res != 0)
+                {
+                    fprintf(stderr, "ERROR: del %s: %d (%s)\n", key.c_str(), res, strerror(-res));
+                    if (stop_on_error)
+                        exit(1);
+                }
+                else
+                {
+                    values.erase(key);
+                }
+                ringloop->wakeup();
+            }, NULL);
+        }
+        else if (dice < reopen_prob+get_prob+add_prob+update_prob+del_prob+list_prob)
+        {
+            // list
+            ops_sent++;
+            in_progress++;
+            auto key = random_str(max_key_len);
+            auto lst = new kv_test_listing_t;
+            auto k_it = values.lower_bound(key);
+            lst->count = min_list_count + (max_list_count > min_list_count ? lrand48() % (max_list_count-min_list_count) : 0);
+            lst->handle = db->list_start(k_it == values.begin() ? key_prefix : key);
+            lst->next_after = k_it == values.begin() ? key_prefix : key;
+            lst->inflights = changing_keys;
+            listings.insert(lst);
+            if (trace)
+                printf("list from %s\n", key.c_str());
+            clock_gettime(CLOCK_REALTIME, &lst->tv_begin);
+            db->list_next(lst->handle, [this, lst](int res, const std::string & key, const std::string & value)
+            {
+                if (log_level >= 11)
+                    printf("list: %s = %s\n", key.c_str(), value.c_str());
+                if (res >= 0 && key_prefix.size() && (key.size() < key_prefix.size() ||
+                    key.substr(0, key_prefix.size()) != key_prefix))
+                {
+                    // stop at this key
+                    res = -ENOENT;
+                }
+                if (res < 0 || (lst->count > 0 && lst->done >= lst->count))
+                {
+                    add_stat(stat.list, lst->tv_begin);
+                    if (res == 0)
+                    {
+                        // ok (done >= count)
+                    }
+                    else if (res != -ENOENT)
+                    {
+                        fprintf(stderr, "ERROR: list: %d (%s)\n", res, strerror(-res));
+                        lst->error = true;
+                    }
+                    else
+                    {
+                        auto k_it = lst->next_after == "" ? values.begin() : values.upper_bound(lst->next_after);
+                        while (k_it != values.end())
+                        {
+                            while (k_it != values.end() && lst->inflights.find(k_it->first) != lst->inflights.end())
+                                k_it++;
+                            if (k_it != values.end())
+                            {
+                                fprintf(stderr, "ERROR: list: missing key %s\n", (k_it++)->first.c_str());
+                                lst->error = true;
+                            }
+                        }
+                    }
+                    if (lst->error && stop_on_error)
+                        exit(1);
+                    ops_done++;
+                    in_progress--;
+                    db->list_close(lst->handle);
+                    delete lst;
+                    listings.erase(lst);
+                    ringloop->wakeup();
+                }
+                else
+                {
+                    stat.list_keys++;
+                    // Do not check modified keys in listing
+                    // Listing may return their old or new state
+                    if ((!key_suffix.size() || key.size() >= key_suffix.size() &&
+                        key.substr(key.size()-key_suffix.size()) == key_suffix) &&
+                        lst->inflights.find(key) == lst->inflights.end())
+                    {
+                        lst->done++;
+                        auto k_it = lst->next_after == "" ? values.begin() : values.upper_bound(lst->next_after);
+                        while (true)
+                        {
+                            while (k_it != values.end() && lst->inflights.find(k_it->first) != lst->inflights.end())
+                            {
+                                k_it++;
+                            }
+                            if (k_it == values.end() || k_it->first > key)
+                            {
+                                fprintf(stderr, "ERROR: list: extra key %s\n", key.c_str());
+                                lst->error = true;
+                                break;
+                            }
+                            else if (k_it->first < key)
+                            {
+                                fprintf(stderr, "ERROR: list: missing key %s\n", k_it->first.c_str());
+                                lst->error = true;
+                                lst->next_after = k_it->first;
+                                k_it++;
+                            }
+                            else
+                            {
+                                if (k_it->second != value)
+                                {
+                                    fprintf(stderr, "ERROR: list: mismatch: %s = %s but should be %s\n",
+                                        key.c_str(), value.c_str(), k_it->second.c_str());
+                                    lst->error = true;
+                                }
+                                lst->next_after = k_it->first;
+                                break;
+                            }
+                        }
+                    }
+                    db->list_next(lst->handle, NULL);
+                }
+            });
+        }
+    }
+}
+
+void kv_test_t::add_stat(kv_test_lat_t & stat, timespec tv_begin)
+{
+    timespec tv_end;
+    clock_gettime(CLOCK_REALTIME, &tv_end);
+    int64_t usec = (tv_end.tv_sec - tv_begin.tv_sec)*1000000 +
+        (tv_end.tv_nsec - tv_begin.tv_nsec)/1000;
+    if (usec > 0)
+    {
+        stat.usec += usec;
+        stat.count++;
+    }
+}
+
+void kv_test_t::print_stats(kv_test_stat_t & prev_stat, timespec & prev_stat_time)
+{
+    timespec cur_stat_time;
+    clock_gettime(CLOCK_REALTIME, &cur_stat_time);
+    int64_t usec = (cur_stat_time.tv_sec - prev_stat_time.tv_sec)*1000000 +
+        (cur_stat_time.tv_nsec - prev_stat_time.tv_nsec)/1000;
+    if (usec > 0)
+    {
+        kv_test_lat_t *lats[] = { &stat.get, &stat.add, &stat.update, &stat.del, &stat.list };
+        kv_test_lat_t *prev[] = { &prev_stat.get, &prev_stat.add, &prev_stat.update, &prev_stat.del, &prev_stat.list };
+        if (!json_output)
+        {
+            char buf[128] = { 0 };
+            for (int i = 0; i < sizeof(lats)/sizeof(lats[0]); i++)
+            {
+                snprintf(buf, sizeof(buf)-1, "%.1f %s/s (%lu us)", (lats[i]->count-prev[i]->count)*1000000.0/usec,
+                    lats[i]->name, (lats[i]->usec-prev[i]->usec)/(lats[i]->count-prev[i]->count > 0 ? lats[i]->count-prev[i]->count : 1));
+                int k;
+                for (k = strlen(buf); k < strlen(lats[i]->name)+21; k++)
+                    buf[k] = ' ';
+                buf[k] = 0;
+                printf("%s", buf);
+            }
+            printf("\n");
+        }
+        else
+        {
+            int64_t runtime = (cur_stat_time.tv_sec - start_stat_time.tv_sec)*1000000 +
+                (cur_stat_time.tv_nsec - start_stat_time.tv_nsec)/1000;
+            printf("{\"runtime\":%.1f", (double)runtime/1000000.0);
+            for (int i = 0; i < sizeof(lats)/sizeof(lats[0]); i++)
+            {
+                if (lats[i]->count > prev[i]->count)
+                {
+                    printf(
+                        ",\"%s\":{\"avg\":{\"iops\":%.1f,\"usec\":%lu},\"total\":{\"count\":%lu,\"usec\":%lu}}",
+                        lats[i]->name, (lats[i]->count-prev[i]->count)*1000000.0/usec,
+                        (lats[i]->usec-prev[i]->usec)/(lats[i]->count-prev[i]->count),
+                        lats[i]->count, lats[i]->usec
+                    );
+                }
+            }
+            printf("}\n");
+        }
+    }
+    prev_stat = stat;
+    prev_stat_time = cur_stat_time;
+}
+
+void kv_test_t::print_total_stats()
+{
+    if (!json_output)
+        printf("Total:\n");
+    kv_test_stat_t start_stats;
+    timespec start_stat_time = this->start_stat_time;
+    print_stats(start_stats, start_stat_time);
+}
+
+void kv_test_t::start_change(const std::string & key)
+{
+    changing_keys.insert(key);
+    for (auto lst: listings)
+    {
+        lst->inflights.insert(key);
+    }
+}
+
+void kv_test_t::stop_change(const std::string & key)
+{
+    changing_keys.erase(key);
+}
+
+int main(int narg, const char *args[])
+{
+    setvbuf(stdout, NULL, _IONBF, 0);
+    setvbuf(stderr, NULL, _IONBF, 0);
+    exe_name = args[0];
+    kv_test_t *p = new kv_test_t();
+    p->run(kv_test_t::parse_args(narg, args));
+    delete p;
+    return 0;
+}
--- a/src/messenger.cpp
+++ b/src/messenger.cpp
@@ -22,7 +22,7 @@ void osd_messenger_t::init()
    {
        rdma_context = msgr_rdma_context_t::create(
            rdma_device != "" ? rdma_device.c_str() : NULL,
-            rdma_port_num, rdma_gid_index, rdma_mtu, log_level
+            rdma_port_num, rdma_gid_index, rdma_mtu, rdma_odp, log_level
        );
        if (!rdma_context)
        {
@@ -167,6 +167,7 @@ void osd_messenger_t::parse_config(const json11::Json & config)
    this->rdma_max_msg = config["rdma_max_msg"].uint64_value();
    if (!this->rdma_max_msg || this->rdma_max_msg > 128*1024*1024)
        this->rdma_max_msg = 129*1024;
+    this->rdma_odp = config["rdma_odp"].bool_value();
 #endif
    this->receive_buffer_size = (uint32_t)config["tcp_header_buffer_size"].uint64_value();
    if (!this->receive_buffer_size || this->receive_buffer_size > 1024*1024*1024)
--- a/src/messenger.h
+++ b/src/messenger.h
@@ -131,6 +131,7 @@ protected:
    msgr_rdma_context_t *rdma_context = NULL;
    uint64_t rdma_max_sge = 0, rdma_max_send = 0, rdma_max_recv = 0;
    uint64_t rdma_max_msg = 0;
+    bool rdma_odp = false;
 #endif

    std::vector<int> read_ready_clients;
@@ -197,7 +198,9 @@ protected:
    void handle_reply_ready(osd_op_t *op);

 #ifdef WITH_RDMA
-    bool try_send_rdma(osd_client_t *cl);
+    void try_send_rdma(osd_client_t *cl);
+    void try_send_rdma_odp(osd_client_t *cl);
+    void try_send_rdma_nodp(osd_client_t *cl);
    bool try_recv_rdma(osd_client_t *cl);
    void handle_rdma_events();
 #endif
--- a/src/msgr_rdma.cpp
+++ b/src/msgr_rdma.cpp
@@ -47,11 +47,29 @@ msgr_rdma_connection_t::~msgr_rdma_connection_t()
    if (qp)
        ibv_destroy_qp(qp);
    if (recv_buffers.size())
+    {
        for (auto b: recv_buffers)
-            free(b);
+        {
+            if (b.mr)
+                ibv_dereg_mr(b.mr);
+            free(b.buf);
+        }
+        recv_buffers.clear();
+    }
+    if (send_out.mr)
+    {
+        ibv_dereg_mr(send_out.mr);
+        send_out.mr = NULL;
+    }
+    if (send_out.buf)
+    {
+        free(send_out.buf);
+        send_out.buf = NULL;
+    }
+    send_out_size = 0;
 }

-msgr_rdma_context_t *msgr_rdma_context_t::create(const char *ib_devname, uint8_t ib_port, uint8_t gid_index, uint32_t mtu, int log_level)
+msgr_rdma_context_t *msgr_rdma_context_t::create(const char *ib_devname, uint8_t ib_port, uint8_t gid_index, uint32_t mtu, bool odp, int log_level)
 {
    int res;
    ibv_device **dev_list = NULL;
@@ -136,21 +154,27 @@ msgr_rdma_context_t *msgr_rdma_context_t::create(const char *ib_devname, uint8_t
            fprintf(stderr, "Couldn't query RDMA device for its features\n");
            goto cleanup;
        }
-        if (!(ctx->attrx.odp_caps.general_caps & IBV_ODP_SUPPORT) ||
+        ctx->odp = odp;
+        if (ctx->odp &&
+            (!(ctx->attrx.odp_caps.general_caps & IBV_ODP_SUPPORT) ||
            !(ctx->attrx.odp_caps.general_caps & IBV_ODP_SUPPORT_IMPLICIT) ||
            !(ctx->attrx.odp_caps.per_transport_caps.rc_odp_caps & IBV_ODP_SUPPORT_SEND) ||
-            !(ctx->attrx.odp_caps.per_transport_caps.rc_odp_caps & IBV_ODP_SUPPORT_RECV))
+            !(ctx->attrx.odp_caps.per_transport_caps.rc_odp_caps & IBV_ODP_SUPPORT_RECV)))
        {
-            fprintf(stderr, "The RDMA device isn't implicit ODP (On-Demand Paging) capable or does not support RC send and receive with ODP\n");
-            goto cleanup;
+            ctx->odp = false;
+            if (log_level > 0)
+                fprintf(stderr, "The RDMA device isn't implicit ODP (On-Demand Paging) capable, disabling it\n");
        }
    }

-    ctx->mr = ibv_reg_mr(ctx->pd, NULL, SIZE_MAX, IBV_ACCESS_LOCAL_WRITE | IBV_ACCESS_ON_DEMAND);
-    if (!ctx->mr)
+    if (ctx->odp)
    {
-        fprintf(stderr, "Couldn't register RDMA memory region\n");
-        goto cleanup;
+        ctx->mr = ibv_reg_mr(ctx->pd, NULL, SIZE_MAX, IBV_ACCESS_LOCAL_WRITE | IBV_ACCESS_ON_DEMAND);
+        if (!ctx->mr)
+        {
+            fprintf(stderr, "Couldn't register RDMA memory region\n");
+            goto cleanup;
+        }
    }

    ctx->channel = ibv_create_comp_channel(ctx->context);
@@ -365,12 +389,34 @@ static void try_send_rdma_wr(osd_client_t *cl, ibv_sge *sge, int op_sge)
    cl->rdma_conn->cur_send++;
 }

-bool osd_messenger_t::try_send_rdma(osd_client_t *cl)
+static int try_send_rdma_copy(osd_client_t *cl, uint8_t *dst, int dst_len)
+{
+    auto rc = cl->rdma_conn;
+    int total_dst_len = dst_len;
+    while (dst_len > 0 && rc->send_pos < cl->send_list.size())
+    {
+        iovec & iov = cl->send_list[rc->send_pos];
+        uint32_t len = (uint32_t)(iov.iov_len-rc->send_buf_pos < dst_len
+            ? iov.iov_len-rc->send_buf_pos : dst_len);
+        memcpy(dst, iov.iov_base+rc->send_buf_pos, len);
+        dst += len;
+        dst_len -= len;
+        rc->send_buf_pos += len;
+        if (rc->send_buf_pos >= iov.iov_len)
+        {
+            rc->send_pos++;
+            rc->send_buf_pos = 0;
+        }
+    }
+    return total_dst_len-dst_len;
+}
+
+void osd_messenger_t::try_send_rdma_odp(osd_client_t *cl)
 {
    auto rc = cl->rdma_conn;
    if (!cl->send_list.size() || rc->cur_send >= rc->max_send)
    {
-        return true;
+        return;
    }
    uint64_t op_size = 0, op_sge = 0;
    ibv_sge sge[rc->max_sge];
@@ -408,15 +454,70 @@ bool osd_messenger_t::try_send_rdma(osd_client_t *cl)
        rc->send_sizes.push_back(op_size);
        try_send_rdma_wr(cl, sge, op_sge);
    }
-    return true;
 }

-static void try_recv_rdma_wr(osd_client_t *cl, void *buf)
+void osd_messenger_t::try_send_rdma_nodp(osd_client_t *cl)
+{
+    auto rc = cl->rdma_conn;
+    if (!rc->send_out_size)
+    {
+        // Allocate send ring buffer, if not yet
+        rc->send_out_size = rc->max_msg*rdma_max_send;
+        rc->send_out.buf = malloc_or_die(rc->send_out_size);
+        if (!rdma_context->odp)
+        {
+            rc->send_out.mr = ibv_reg_mr(rdma_context->pd, rc->send_out.buf, rc->send_out_size, 0);
+            if (!rc->send_out.mr)
+            {
+                fprintf(stderr, "Failed to register RDMA memory region: %s\n", strerror(errno));
+                exit(1);
+            }
+        }
+    }
+    // Copy data into the buffer and send it
+    uint8_t *dst = NULL;
+    int dst_len = 0;
+    int copied = 1;
+    while (!rc->send_out_full && copied > 0 && rc->cur_send < rc->max_send)
+    {
+        dst = (uint8_t*)rc->send_out.buf + rc->send_out_pos;
+        dst_len = (rc->send_out_pos < rc->send_out_size ? rc->send_out_size-rc->send_out_pos : rc->send_done_pos-rc->send_out_pos);
+        if (dst_len > rc->max_msg)
+            dst_len = rc->max_msg;
+        copied = try_send_rdma_copy(cl, dst, dst_len);
+        if (copied > 0)
+        {
+            rc->send_out_pos += copied;
+            if (rc->send_out_pos == rc->send_out_size)
+                rc->send_out_pos = 0;
+            assert(rc->send_out_pos < rc->send_out_size);
+            if (rc->send_out_pos >= rc->send_done_pos)
+                rc->send_out_full = true;
+            ibv_sge sge = {
+                .addr = (uintptr_t)dst,
+                .length = (uint32_t)copied,
+                .lkey = rdma_context->odp ? rdma_context->mr->lkey : rc->send_out.mr->lkey,
+            };
+            try_send_rdma_wr(cl, &sge, 1);
+            rc->send_sizes.push_back(copied);
+        }
+    }
+}
+
+void osd_messenger_t::try_send_rdma(osd_client_t *cl)
+{
+    if (rdma_context->odp)
+        try_send_rdma_odp(cl);
+    else
+        try_send_rdma_nodp(cl);
+}
+
+static void try_recv_rdma_wr(osd_client_t *cl, msgr_rdma_buf_t b)
 {
    ibv_sge sge = {
-        .addr = (uintptr_t)buf,
+        .addr = (uintptr_t)b.buf,
        .length = (uint32_t)cl->rdma_conn->max_msg,
-        .lkey = cl->rdma_conn->ctx->mr->lkey,
+        .lkey = cl->rdma_conn->ctx->odp ? cl->rdma_conn->ctx->mr->lkey : b.mr->lkey,
    };
    ibv_recv_wr *bad_wr = NULL;
    ibv_recv_wr wr = {
@@ -438,9 +539,19 @@ bool osd_messenger_t::try_recv_rdma(osd_client_t *cl)
    auto rc = cl->rdma_conn;
    while (rc->cur_recv < rc->max_recv)
    {
-        void *buf = malloc_or_die(rc->max_msg);
-        rc->recv_buffers.push_back(buf);
-        try_recv_rdma_wr(cl, buf);
+        msgr_rdma_buf_t b;
+        b.buf = malloc_or_die(rc->max_msg);
+        if (!rdma_context->odp)
+        {
+            b.mr = ibv_reg_mr(rdma_context->pd, b.buf, rc->max_msg, IBV_ACCESS_LOCAL_WRITE);
+            if (!b.mr)
+            {
+                fprintf(stderr, "Failed to register RDMA memory region: %s\n", strerror(errno));
+                exit(1);
+            }
+        }
+        rc->recv_buffers.push_back(b);
+        try_recv_rdma_wr(cl, b);
    }
    return true;
 }
@@ -492,7 +603,7 @@ void osd_messenger_t::handle_rdma_events()
            if (!is_send)
            {
                rc->cur_recv--;
-                if (!handle_read_buffer(cl, rc->recv_buffers[rc->next_recv_buf], wc[i].byte_len))
+                if (!handle_read_buffer(cl, rc->recv_buffers[rc->next_recv_buf].buf, wc[i].byte_len))
                {
                    // handle_read_buffer may stop the client
                    continue;
@@ -505,6 +616,14 @@ void osd_messenger_t::handle_rdma_events()
                rc->cur_send--;
                uint64_t sent_size = rc->send_sizes.at(0);
                rc->send_sizes.erase(rc->send_sizes.begin(), rc->send_sizes.begin()+1);
+                if (!rdma_context->odp)
+                {
+                    rc->send_done_pos += sent_size;
+                    rc->send_out_full = false;
+                    if (rc->send_done_pos == rc->send_out_size)
+                        rc->send_done_pos = 0;
+                    assert(rc->send_done_pos < rc->send_out_size);
+                }
                int send_pos = 0, send_buf_pos = 0;
                while (sent_size > 0)
                {
--- a/src/msgr_rdma.h
+++ b/src/msgr_rdma.h
@@ -23,6 +23,7 @@ struct msgr_rdma_context_t
    ibv_device *dev = NULL;
    ibv_device_attr_ex attrx;
    ibv_pd *pd = NULL;
+    bool odp = false;
    ibv_mr *mr = NULL;
    ibv_comp_channel *channel = NULL;
    ibv_cq *cq = NULL;
@@ -35,10 +36,16 @@ struct msgr_rdma_context_t
    int max_cqe = 0;
    int used_max_cqe = 0;

-    static msgr_rdma_context_t *create(const char *ib_devname, uint8_t ib_port, uint8_t gid_index, uint32_t mtu, int log_level);
+    static msgr_rdma_context_t *create(const char *ib_devname, uint8_t ib_port, uint8_t gid_index, uint32_t mtu, bool odp, int log_level);
    ~msgr_rdma_context_t();
 };

+struct msgr_rdma_buf_t
+{
+    void *buf = NULL;
+    ibv_mr *mr = NULL;
+};
+
 struct msgr_rdma_connection_t
 {
    msgr_rdma_context_t *ctx = NULL;
@@ -50,8 +57,11 @@ struct msgr_rdma_connection_t

    int send_pos = 0, send_buf_pos = 0;
    int next_recv_buf = 0;
-    std::vector<void*> recv_buffers;
+    std::vector<msgr_rdma_buf_t> recv_buffers;
    std::vector<uint64_t> send_sizes;
+    msgr_rdma_buf_t send_out;
+    int send_out_pos = 0, send_done_pos = 0, send_out_size = 0;
+    bool send_out_full = false;

    ~msgr_rdma_connection_t();
    static msgr_rdma_connection_t *create(msgr_rdma_context_t *ctx, uint32_t max_send, uint32_t max_recv, uint32_t max_sge, uint32_t max_msg);
--- a/src/nbd_proxy.cpp
+++ b/src/nbd_proxy.cpp
@@ -225,7 +225,7 @@ public:
            cfg = obj;
        }
        // Create client
-        ringloop = new ring_loop_t(512);
+        ringloop = new ring_loop_t(RINGLOOP_DEFAULT_SIZE);
        epmgr = new epoll_manager_t(ringloop);
        cli = new cluster_client_t(ringloop, epmgr->tfd, cfg);
        if (!inode)
--- a/src/nfs_proxy.cpp
+++ b/src/nfs_proxy.cpp
@@ -124,7 +124,7 @@ void nfs_proxy_t::run(json11::Json cfg)
        cfg = obj;
    }
    // Create client
-    ringloop = new ring_loop_t(512);
+    ringloop = new ring_loop_t(RINGLOOP_DEFAULT_SIZE);
    epmgr = new epoll_manager_t(ringloop);
    cli = new cluster_client_t(ringloop, epmgr->tfd, cfg);
    cmd = new cli_tool_t();
--- a/src/osd.cpp
+++ b/src/osd.cpp
@@ -541,11 +541,15 @@ void osd_t::print_slow()
                }
                else if (op->req.hdr.opcode == OSD_OP_SEC_STABILIZE || op->req.hdr.opcode == OSD_OP_SEC_ROLLBACK)
                {
-                    for (uint64_t i = 0; i < op->req.sec_stab.len; i += sizeof(obj_ver_id))
+                    for (uint64_t i = 0; i < op->req.sec_stab.len && i < sizeof(obj_ver_id)*12; i += sizeof(obj_ver_id))
                    {
                        obj_ver_id *ov = (obj_ver_id*)((uint8_t*)op->buf + i);
                        bufprintf(i == 0 ? " %lx:%lx v%lu" : ", %lx:%lx v%lu", ov->oid.inode, ov->oid.stripe, ov->version);
                    }
+                    if (op->req.sec_stab.len > sizeof(obj_ver_id)*12)
+                    {
+                        bufprintf(", ... (%lu items)", op->req.sec_stab.len/sizeof(obj_ver_id));
+                    }
                }
                else if (op->req.hdr.opcode == OSD_OP_SEC_LIST)
                {
--- a/src/osd_main.cpp
+++ b/src/osd_main.cpp
@@ -58,7 +58,7 @@ int main(int narg, char *args[])
    }
    signal(SIGINT, handle_sigint);
    signal(SIGTERM, handle_sigint);
-    ring_loop_t *ringloop = new ring_loop_t(512);
+    ring_loop_t *ringloop = new ring_loop_t(RINGLOOP_DEFAULT_SIZE);
    osd = new osd_t(config, ringloop);
    while (1)
    {
--- a/src/ringloop.cpp
+++ b/src/ringloop.cpp
@@ -17,7 +17,7 @@ ring_loop_t::ring_loop_t(int qd)
    {
        throw std::runtime_error(std::string("io_uring_queue_init: ") + strerror(-ret));
    }
-    free_ring_data_ptr = *ring.cq.kring_entries;
+    free_ring_data_ptr = *ring.sq.kring_entries;
    ring_datas = (struct ring_data_t*)calloc(free_ring_data_ptr, sizeof(ring_data_t));
    free_ring_data = (int*)malloc(sizeof(int) * free_ring_data_ptr);
    if (!ring_datas || !free_ring_data)
--- a/src/ringloop.h
+++ b/src/ringloop.h
@@ -15,6 +15,8 @@
 #include <functional>
 #include <vector>

+#define RINGLOOP_DEFAULT_SIZE 1024
+
 static inline void my_uring_prep_rw(int op, struct io_uring_sqe *sqe, int fd, const void *addr, unsigned len, off_t offset)
 {
    // Prepare a read/write operation without clearing user_data
@@ -139,11 +141,9 @@ public:
        if (free_ring_data_ptr == 0)
            return NULL;
        struct io_uring_sqe* sqe = io_uring_get_sqe(&ring);
-        if (sqe)
-        {
-            *sqe = { 0 };
-            io_uring_sqe_set_data(sqe, ring_datas + free_ring_data[--free_ring_data_ptr]);
-        }
+        assert(sqe);
+        *sqe = { 0 };
+        io_uring_sqe_set_data(sqe, ring_datas + free_ring_data[--free_ring_data_ptr]);
        return sqe;
    }
    inline void set_immediate(const std::function<void()> cb)
--- a/src/stub_uring_osd.cpp
+++ b/src/stub_uring_osd.cpp
@@ -30,7 +30,7 @@ void stub_exec_op(osd_messenger_t *msgr, osd_op_t *op);
 int main(int narg, char *args[])
 {
    ring_consumer_t looper;
-    ring_loop_t *ringloop = new ring_loop_t(512);
+    ring_loop_t *ringloop = new ring_loop_t(RINGLOOP_DEFAULT_SIZE);
    epoll_manager_t *epmgr = new epoll_manager_t(ringloop);
    osd_messenger_t *msgr = new osd_messenger_t();
    msgr->osd_num = 1351;
--- a/src/test_blockstore.cpp
+++ b/src/test_blockstore.cpp
@@ -11,7 +11,7 @@ int main(int narg, char *args[])
    config["meta_device"] = "./test_meta.bin";
    config["journal_device"] = "./test_journal.bin";
    config["data_device"] = "./test_data.bin";
-    ring_loop_t *ringloop = new ring_loop_t(512);
+    ring_loop_t *ringloop = new ring_loop_t(RINGLOOP_DEFAULT_SIZE);
    epoll_manager_t *epmgr = new epoll_manager_t(ringloop);
    blockstore_t *bs = new blockstore_t(config, ringloop, epmgr->tfd);

--- a/src/test_cas.cpp
+++ b/src/test_cas.cpp
@@ -68,7 +68,7 @@ int main(int narg, char *args[])
        | cfg["inode_id"].uint64_value();
    uint64_t base_ver = 0;
    // Create client
-    auto ringloop = new ring_loop_t(512);
+    auto ringloop = new ring_loop_t(RINGLOOP_DEFAULT_SIZE);
    auto epmgr = new epoll_manager_t(ringloop);
    auto cli = new cluster_client_t(ringloop, epmgr->tfd, cfg);
    cli->on_ready([&]()
--- a/src/vitastor_c.cpp
+++ b/src/vitastor_c.cpp
@@ -114,7 +114,7 @@ vitastor_c *vitastor_c_create_qemu_uring(QEMUSetFDHandler *aio_set_fd_handler, v
    ring_loop_t *ringloop = NULL;
    try
    {
-        ringloop = new ring_loop_t(512);
+        ringloop = new ring_loop_t(RINGLOOP_DEFAULT_SIZE);
    }
    catch (std::exception & e)
    {
@@ -136,7 +136,7 @@ vitastor_c *vitastor_c_create_uring(const char *config_path, const char *etcd_ho
    ring_loop_t *ringloop = NULL;
    try
    {
-        ringloop = new ring_loop_t(512);
+        ringloop = new ring_loop_t(RINGLOOP_DEFAULT_SIZE);
    }
    catch (std::exception & e)
    {
@@ -167,7 +167,7 @@ vitastor_c *vitastor_c_create_uring_json(const char **options, int options_len)
    ring_loop_t *ringloop = NULL;
    try
    {
-        ringloop = new ring_loop_t(512);
+        ringloop = new ring_loop_t(RINGLOOP_DEFAULT_SIZE);
    }
    catch (std::exception & e)
    {
Author	SHA1	Message	Date
Vitaliy Filippov	f285cfc483	Fix eviction when random_pos selects the end	2023-12-01 01:43:03 +03:00
Vitaliy Filippov	12b50b421d	Implement min/max list_count to make listings during performance test reasonable	2023-12-01 01:17:04 +03:00
Vitaliy Filippov	9f6d09428d	Fix and improve parallel allocation - Do not try to allocate more DB blocks in an inode block until it's "confirmed" and "locked" by the first write - Do not recheck for new zero DB blocks on first write into an inode block - a CAS failure means someone else is already writing into it - Throw new allocation blocks away regardless of whether the known_version is 0 on a CAS failure	2023-12-01 01:17:04 +03:00
Vitaliy Filippov	580025cfc9	Implement key_prefix for K/V stress test	2023-12-01 01:17:04 +03:00
Vitaliy Filippov	13e2d3ce7c	More fixes - do not overwrite a block with older version if known version is newer (read may start before update and end after update) - invalidated block versions can't be remembered and trusted - right boundary for split blocks is right_half when diving down, not key_lt - restart update also when block is "invalidated", not just on version mismatch - copy callback in listings to avoid closure destruction bugs too	2023-12-01 01:17:04 +03:00
Vitaliy Filippov	c5b00f897a	Add logging and one more assert	2023-12-01 01:17:04 +03:00
Vitaliy Filippov	e847e26912	Make get_block() wait for updating when unrelated block is found along the path	2023-12-01 01:17:04 +03:00
Vitaliy Filippov	3393463466	Fix a race condition where changed blocks were parsed over existing cached blocks and getting a mix of data	2023-12-01 01:17:04 +03:00
Vitaliy Filippov	bd96a6194a	Simplify code by removing an unneeded "optimisation"	2023-12-01 01:17:04 +03:00
Vitaliy Filippov	601fe10c28	Add kv_log_level, print warnings on level 1, trace ops on level 10	2023-12-01 01:17:04 +03:00
Vitaliy Filippov	63dbc9ca85	Fix duplicate keys in listings on parallel updates -- do not rewind key "iterator position"	2023-12-01 01:17:04 +03:00
Vitaliy Filippov	aa0c363c39	Implement key suffix to avoid collisions of multiple test workers	2023-12-01 01:17:04 +03:00
Vitaliy Filippov	ce52c5589e	Do not complain on empty first block	2023-12-01 01:17:04 +03:00
Vitaliy Filippov	aee20ab1ee	Add JSON output for stress-tester	2023-12-01 01:17:04 +03:00
Vitaliy Filippov	bb81992fac	Print total stats	2023-12-01 01:17:04 +03:00
Vitaliy Filippov	a28f401aff	Do not send more than op_count operations (fix segfault on finish)	2023-12-01 01:17:04 +03:00
Vitaliy Filippov	4ac7e096fd	Add some more resiliency to serialize()	2023-12-01 01:17:04 +03:00
Vitaliy Filippov	b6171a4599	Invalidate blocks being updated too	2023-12-01 01:17:03 +03:00
Vitaliy Filippov	28045f230c	Change new block allocation method: make each writer choose multiple empty PG blocks and place blocks in them	2023-12-01 01:17:03 +03:00
Vitaliy Filippov	10e867880f	Remove blocks from cache on unsuccessful updates	2023-12-01 01:17:03 +03:00
Vitaliy Filippov	012462171a	Allow to track multiple updates per block (it should never happen though)	2023-12-01 01:17:03 +03:00
Vitaliy Filippov	904793cdab	Do not call stop_updating after failed write_new_block and after clear_block (both delete the item)	2023-12-01 01:17:03 +03:00
Vitaliy Filippov	45c01db2de	Track versions of parent blocks and recheck if changed during update	2023-12-01 01:17:03 +03:00
Vitaliy Filippov	8c9206cecd	Fix resume_split condition (key_lt can also be "")	2023-12-01 01:17:03 +03:00
Vitaliy Filippov	e8c46ededa	Experiment: transform offsets for better sharding	2023-12-01 01:17:03 +03:00
Vitaliy Filippov	e9b321a0e0	More post-stress-test fixes - Prevent _split types of new blocks - Stop updating new blocks only after the whole update, otherwise pointers may become invalid - Use recheck_none for updates initially - Use UINT64_MAX as initial block version when postponing ops, otherwise the check fails when the block is initially empty. This for example leads to writing both leaf items & block pointers (which is incorrect) into the root block when starting stress-test with --parallelism 32 - Fix -EINTR comparison	2023-12-01 01:17:03 +03:00
Vitaliy Filippov	09a77991ae	Print operation statistics	2023-12-01 01:17:03 +03:00
Vitaliy Filippov	29d8c9b6f3	K/V fixes after stress-test :-) - track block versions correctly - per inode block (128kb) instead of tree block (4kb) - prevent multiple parallel CAS writes of the same inode block - add logging for EILSEQ which means invalid data in the tree - fix get_block updated flag which was true for blocks already in cache and was leading to infinite loops on "unrelated block" errors - apply changes to blocks in cache only after successful writes (using "virtual changes") - do not replace cached block with an older version from disk - recheck "unrelated blocks" (read/update collisions) until data stops changing - track tree path correctly - do not treat split block as parent of its right half - correctly move blocks when finding new empty place on disk - restart updates from the beginning when one of blocks is changed by a parallel update - fix delete using SET opcode and setting key to the empty value instead - prevent changing the same key more than 1 time in parallel - fix listing verification - resume continue_updates in update_find (required because it uses continue_update itself) - add allow_old_cached parameter to get()	2023-12-01 01:17:03 +03:00
Vitaliy Filippov	20321aaaef	Implement K/V DB stress tester	2023-12-01 01:17:03 +03:00
Vitaliy Filippov	987b005356	Evict blocks based on memory limit & block usage	2023-12-01 01:17:03 +03:00
Vitaliy Filippov	41754b748b	Track blocks per level	2023-12-01 01:17:03 +03:00
Vitaliy Filippov	31913256f3	Track block level	2023-12-01 01:17:03 +03:00
Vitaliy Filippov	0ee36baed7	Experimental B-Tree Vitastor embedded K/V database implementation!	2023-12-01 01:17:03 +03:00
Vitaliy Filippov	19e2d9d6fa	Fix crash on unknown long argument to vitastor-disk	2023-12-01 00:55:51 +03:00
Vitaliy Filippov	bfc7e61909	Add more notes + performance comparison about VDUSE	2023-11-25 02:25:56 +03:00
Vitaliy Filippov	7da4868b37	Fix monitor statistics aggregation in case of empty /osd/stats keys	2023-11-24 01:05:21 +03:00
Vitaliy Filippov	b5c020ce0b	Use io_uring SQ size for ringloop capacity - otherwise get_sqe could return NULL when space_left() was > 0 under load Raise default io_uring size to 1024 for the same effective capacity as previously	2023-11-20 03:04:06 +03:00
Vitaliy Filippov	6b33ae973d	%d -> %lu	2023-11-20 03:02:26 +03:00
Vitaliy Filippov	cf36445359	Reserve journal space for stabilize requests dynamically to prevent stalls	2023-11-20 03:01:57 +03:00
Vitaliy Filippov	3fd873d263	Add -fno-omit-frame-pointer by default	2023-11-20 02:59:54 +03:00
Vitaliy Filippov	a00e8ae9ed	Fix mismatch journal pos format in vitastor-disk	2023-11-19 15:19:54 +03:00
Vitaliy Filippov	75674545dc	Limit the number of printed object versions in slow op dump (otherwise it may overflow the fixed buffer)	2023-11-13 01:10:28 +03:00
Vitaliy Filippov	225eb2fe3d	Support RDMA without ODP by stupidly copying memory. Disable ODP by default ODP is slower than regular RDMA even with memory copy overhead Example numbers: - 3950000 random read iops without ODP vs 240000 iops with ODP - 1447000 random write iops without ODP vs 101000 iops with ODP Reference: https://tkygtr6.github.io/pub/ISPASS21_slides.pdf	2023-11-12 15:03:47 +03:00