Release 1.1.0

New features: - Implement [client writeback cache](docs/config/client.en.md#client_enable_writeback) - Add the third I/O mode: [O_DIRECT|O_SYNC](docs/config/osd.en.md#data_io) (good for Optane) - Reduce load on etcd by splitting OSD lease and statistics reporting intervals: [etcd_stats_interval](docs/config/osd.en.md#etcd_stats_interval) (default 30 sec) - Make MON automatically filter OSDs by layout (block_size/immediate_commit/bitmap_granularity) to prevent "refusing to start PGs of this pool" errors on misconfiguration - Support running fio benchmarks on systems without io_uring - Make QEMU driver compatible with QEMU 8.1 - Document usage of [vhost-user-blk](docs/usage/qemu.en.md#vhost-user-blk) Bug fixes: - Fix resizing disks in QEMU driver (for example, in Proxmox) - Fix "unexpected result" in Proxmox driver by making CLI flush output on exit - Remove unneeded block_size mismatch warnings on pools without matching PGs - Fix possible segfault in vitastor-cli ls -l (usually with deleted pools) - Fix QEMU driver compatibility with systems without io_uring - Fix monitor eating 100% CPU when etcd is down (caused by infinite retries) - Fix potential incorrect write processing with snapshots (not caught in tests but could probably lead to client hangs) - Fix buffer insertion in cluster_client (not caught in tests but could probably lead to incorrect writes in rare cases) - Fix rare OSD crash during sync operation processing - Fix a reenterability issue in cluster_client not reproducible in QEMU/fio, but reproducible with the currently developed K/V database implementation - Fix deletion of the first modified object - OSDs could crash if you modified the same object a lot of times, then deleted it, and then modified it again - Fix the fio_sec_osd test tool
Bump qemu version to vitastor4
2023-10-28 00:33:06 +03:00 · 2023-10-28 00:33:06 +03:00 · 2023-10-27 14:09:26 +03:00 · 2023-10-27 01:26:26 +03:00 · 2023-10-27 01:26:26 +03:00 · 2023-10-27 01:26:26 +03:00
98 changed files with 2186 additions and 749 deletions
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -2,6 +2,6 @@ cmake_minimum_required(VERSION 2.8.12)

 project(vitastor)

-set(VERSION "1.0.0")
+set(VERSION "1.1.0")

 add_subdirectory(src)
--- a/csi/Makefile
+++ b/csi/Makefile
@@ -1,4 +1,4 @@
-VERSION ?= v1.0.0
+VERSION ?= v1.1.0

 all: build push

--- a/csi/deploy/004-csi-nodeplugin.yaml
+++ b/csi/deploy/004-csi-nodeplugin.yaml
@@ -49,7 +49,7 @@ spec:
            capabilities:
              add: ["SYS_ADMIN"]
            allowPrivilegeEscalation: true
-          image: vitalif/vitastor-csi:v1.0.0
+          image: vitalif/vitastor-csi:v1.1.0
          args:
            - "--node=$(NODE_ID)"
            - "--endpoint=$(CSI_ENDPOINT)"
--- a/csi/deploy/007-csi-provisioner.yaml
+++ b/csi/deploy/007-csi-provisioner.yaml
@@ -116,7 +116,7 @@ spec:
            privileged: true
            capabilities:
              add: ["SYS_ADMIN"]
-          image: vitalif/vitastor-csi:v1.0.0
+          image: vitalif/vitastor-csi:v1.1.0
          args:
            - "--node=$(NODE_ID)"
            - "--endpoint=$(CSI_ENDPOINT)"
--- a/csi/go.mod
+++ b/csi/go.mod
@@ -9,7 +9,7 @@ require (
 	golang.org/x/net v0.0.0-20201202161906-c7110b5ffcbb
 	golang.org/x/xerrors v0.0.0-20200804184101-5ec99f83aff1 // indirect
 	google.golang.org/grpc v1.33.1
-	k8s.io/klog v1.0.0
+	k8s.io/klog v1.1.0
 	k8s.io/utils v0.0.0-20210305010621-2afb4311ab10
 )

--- a/csi/go.sum
+++ b/csi/go.sum
@@ -6,9 +6,9 @@ cloud.google.com/go v0.45.1/go.mod h1:RpBamKRgapWJb87xiFSdk4g1CME7QZg3uwTez+TSTj
 cloud.google.com/go v0.46.3/go.mod h1:a6bKKbmY7er1mI7TEI4lsAkts/mkhTSZK8w33B4RAg0=
 cloud.google.com/go v0.51.0/go.mod h1:hWtGJ6gnXH+KgDv+V0zFGDvpi07n3z8ZNj3T1RW0Gcw=
 cloud.google.com/go/bigquery v1.0.1/go.mod h1:i/xbL2UlR5RvWAURpBYZTtm/cXjCha9lbfbpx4poX+o=
-cloud.google.com/go/datastore v1.0.0/go.mod h1:LXYbyblFSglQ5pkeyhO+Qmw7ukd3C+pD7TKLgZqpHYE=
+cloud.google.com/go/datastore v1.1.0/go.mod h1:LXYbyblFSglQ5pkeyhO+Qmw7ukd3C+pD7TKLgZqpHYE=
 cloud.google.com/go/pubsub v1.0.1/go.mod h1:R0Gpsv3s54REJCy4fxDixWD93lHJMoZTyQ2kNxGRt3I=
-cloud.google.com/go/storage v1.0.0/go.mod h1:IhtSnM/ZTZV8YYJWCY8RULGVqBDmpoyjwiyrjsg+URw=
+cloud.google.com/go/storage v1.1.0/go.mod h1:IhtSnM/ZTZV8YYJWCY8RULGVqBDmpoyjwiyrjsg+URw=
 dmitri.shuralyov.com/gpu/mtl v0.0.0-20190408044501-666a987793e9/go.mod h1:H6x//7gZCb22OMCxBHrMx7a5I7Hp++hsVxbQ4BYO7hU=
 github.com/Azure/go-ansiterm v0.0.0-20170929234023-d6e3b3328b78/go.mod h1:LmzpDX56iTiv29bbRTIsUNlaFfuhWRQBWjQdVyAevI8=
 github.com/Azure/go-autorest/autorest v0.9.0/go.mod h1:xyHB1BMZT0cuDHU7I0+g046+BFDTQ8rEZB0s4Yfa6bI=
@@ -25,14 +25,14 @@ github.com/Azure/go-autorest/tracing v0.5.0/go.mod h1:r/s2XiOKccPW3HrqB+W0TQzfbt
 github.com/BurntSushi/toml v0.3.1/go.mod h1:xHWCNGjB5oqiDr8zfno3MHue2Ht5sIBksp03qcyfWMU=
 github.com/BurntSushi/xgb v0.0.0-20160522181843-27f122750802/go.mod h1:IVnqGOEym/WlBOVXweHU+Q+/VP0lqqI8lqeDx9IjBqo=
 github.com/NYTimes/gziphandler v0.0.0-20170623195520-56545f4a5d46/go.mod h1:3wb06e3pkSAbeQ52E9H9iFoQsEEwGN64994WTCIhntQ=
-github.com/PuerkitoBio/purell v1.0.0/go.mod h1:c11w/QuzBsJSee3cPx9rAFu61PvFxuPbtSwDGJws/X0=
+github.com/PuerkitoBio/purell v1.1.0/go.mod h1:c11w/QuzBsJSee3cPx9rAFu61PvFxuPbtSwDGJws/X0=
 github.com/PuerkitoBio/urlesc v0.0.0-20160726150825-5bd2802263f2/go.mod h1:uGdkoq3SwY9Y+13GIhn11/XLaGBb4BfwItxLd5jeuXE=
 github.com/alecthomas/template v0.0.0-20160405071501-a0175ee3bccc/go.mod h1:LOuyumcjzFXgccqObfd/Ljyb9UuFJ6TxHnclSeseNhc=
 github.com/alecthomas/template v0.0.0-20190718012654-fb15b899a751/go.mod h1:LOuyumcjzFXgccqObfd/Ljyb9UuFJ6TxHnclSeseNhc=
 github.com/alecthomas/units v0.0.0-20151022065526-2efee857e7cf/go.mod h1:ybxpYRFXyAe+OPACYpWeL0wqObRcbAqCMya13uyzqw0=
 github.com/alecthomas/units v0.0.0-20190717042225-c3de453c63f4/go.mod h1:ybxpYRFXyAe+OPACYpWeL0wqObRcbAqCMya13uyzqw0=
 github.com/beorn7/perks v0.0.0-20180321164747-3a771d992973/go.mod h1:Dwedo/Wpr24TaqPxmxbtue+5NUziq4I4S80YR8gNf3Q=
-github.com/beorn7/perks v1.0.0/go.mod h1:KWe93zE9D1o94FZ5RNwFwVgaQK1VOXiVxmqh+CedLV8=
+github.com/beorn7/perks v1.1.0/go.mod h1:KWe93zE9D1o94FZ5RNwFwVgaQK1VOXiVxmqh+CedLV8=
 github.com/beorn7/perks v1.0.1/go.mod h1:G2ZrVWU2WbWT9wwq4/hrbKbnv/1ERSJQ0ibhJ6rlkpw=
 github.com/blang/semver v3.5.0+incompatible/go.mod h1:kRBLl5iJ+tD4TcOOxsy/0fnwebNt5EWlYSAyrTnjyyk=
 github.com/census-instrumentation/opencensus-proto v0.2.1/go.mod h1:f6KPmirojxKA12rnyqOA5BBL4O983OfeGPqjHWSTneU=
@@ -92,13 +92,13 @@ github.com/golang/protobuf v1.4.1/go.mod h1:U8fpvMrcmy5pZrNK1lt4xCsGvpyWQ/VVv6QD
 github.com/golang/protobuf v1.4.2 h1:+Z5KGCizgyZCbGh1KZqA0fcLLkwbsjIzS4aV2v7wJX0=
 github.com/golang/protobuf v1.4.2/go.mod h1:oDoupMAO8OvCJWAcko0GGGIgR6R6ocIYbsSw735rRwI=
 github.com/google/btree v0.0.0-20180813153112-4030bb1f1f0c/go.mod h1:lNA+9X1NB3Zf8V7Ke586lFgjr2dZNuvo3lPJSGZ5JPQ=
-github.com/google/btree v1.0.0/go.mod h1:lNA+9X1NB3Zf8V7Ke586lFgjr2dZNuvo3lPJSGZ5JPQ=
+github.com/google/btree v1.1.0/go.mod h1:lNA+9X1NB3Zf8V7Ke586lFgjr2dZNuvo3lPJSGZ5JPQ=
 github.com/google/go-cmp v0.2.0/go.mod h1:oXzfMopK8JAjlY9xF4vHSVASa0yLyX7SntLO5aqRK0M=
 github.com/google/go-cmp v0.3.0/go.mod h1:8QqcDgzrUqlUb/G2PQTWiueGozuR1884gddMywk6iLU=
 github.com/google/go-cmp v0.3.1/go.mod h1:8QqcDgzrUqlUb/G2PQTWiueGozuR1884gddMywk6iLU=
 github.com/google/go-cmp v0.4.0 h1:xsAVV57WRhGj6kEIi8ReJzQlHHqcBYCElAvkovg3B/4=
 github.com/google/go-cmp v0.4.0/go.mod h1:v8dTdLbMG2kIc/vJvl+f65V22dbkXbowE6jgT/gNBxE=
-github.com/google/gofuzz v1.0.0/go.mod h1:dBl0BpW6vV/+mYPU4Po3pmUjxk6FQPldtuIdl/M65Eg=
+github.com/google/gofuzz v1.1.0/go.mod h1:dBl0BpW6vV/+mYPU4Po3pmUjxk6FQPldtuIdl/M65Eg=
 github.com/google/gofuzz v1.1.0/go.mod h1:dBl0BpW6vV/+mYPU4Po3pmUjxk6FQPldtuIdl/M65Eg=
 github.com/google/martian v2.1.0+incompatible/go.mod h1:9I4somxYTbIHy5NJKHRl3wXiIaQGbYVAs8BPL6v8lEs=
 github.com/google/pprof v0.0.0-20181206194817-3ea8567a2e57/go.mod h1:zfwlbNMJ+OItoe0UupaVj+oy1omPYYDuagoSzA8v9mc=
@@ -112,7 +112,7 @@ github.com/googleapis/gnostic v0.4.1/go.mod h1:LRhVm6pbyptWbWbuZ38d1eyptfvIytN3i
 github.com/gregjones/httpcache v0.0.0-20180305231024-9cad4c3443a7/go.mod h1:FecbI9+v66THATjSRHfNgh1IVFe/9kFxbXtjV0ctIMA=
 github.com/hashicorp/golang-lru v0.5.0/go.mod h1:/m3WP610KZHVQ1SGc6re/UDhFvYD7pJ4Ao+sR/qLZy8=
 github.com/hashicorp/golang-lru v0.5.1/go.mod h1:/m3WP610KZHVQ1SGc6re/UDhFvYD7pJ4Ao+sR/qLZy8=
-github.com/hpcloud/tail v1.0.0/go.mod h1:ab1qPbhIpdTxEkNHXyeSf5vhxWSCs/tWer42PpOxQnU=
+github.com/hpcloud/tail v1.1.0/go.mod h1:ab1qPbhIpdTxEkNHXyeSf5vhxWSCs/tWer42PpOxQnU=
 github.com/ianlancetaylor/demangle v0.0.0-20181102032728-5e5cf60278f6/go.mod h1:aSSvb/t6k1mPoxDqO4vJh6VOCGPwU4O0C2/Eqndh1Sc=
 github.com/imdario/mergo v0.3.5/go.mod h1:2EnlNZ0deacrJVfApfmtdGgDfMuh/nq6Ok1EcJh5FfA=
 github.com/json-iterator/go v1.1.6/go.mod h1:+SdeFBvtyEkXs7REEP0seUULqWtbJapLOCVDaaPEHmU=
@@ -121,7 +121,7 @@ github.com/jstemmer/go-junit-report v0.0.0-20190106144839-af01ea7f8024/go.mod h1
 github.com/jstemmer/go-junit-report v0.9.1/go.mod h1:Brl9GWCQeLvo8nXZwPNNblvFj/XSXhF0NWZEnDohbsk=
 github.com/julienschmidt/httprouter v1.2.0/go.mod h1:SYymIcj16QtmaHHD7aYtjjsJG7VTCxuUUipMqKk8s4w=
 github.com/kisielk/errcheck v1.2.0/go.mod h1:/BMXB+zMLi60iA8Vv6Ksmxu/1UDYcXs4uQLJ+jE2L00=
-github.com/kisielk/gotool v1.0.0/go.mod h1:XhKaO+MFFWcvkIS/tQcRk01m1F5IRFswLeQ+oQHNcck=
+github.com/kisielk/gotool v1.1.0/go.mod h1:XhKaO+MFFWcvkIS/tQcRk01m1F5IRFswLeQ+oQHNcck=
 github.com/konsorten/go-windows-terminal-sequences v1.0.1/go.mod h1:T0+1ngSBFLxvqU3pZ+m/2kptfBszLMUkC4ZK/EgS/cQ=
 github.com/konsorten/go-windows-terminal-sequences v1.0.3/go.mod h1:T0+1ngSBFLxvqU3pZ+m/2kptfBszLMUkC4ZK/EgS/cQ=
 github.com/kr/logfmt v0.0.0-20140226030751-b84e30acd515/go.mod h1:+0opPa2QZZtGFBFZlji/RkVcI2GknAs/DXo4wKdlNEc=
@@ -153,10 +153,10 @@ github.com/peterbourgon/diskv v2.0.1+incompatible/go.mod h1:uqqh8zWWbv1HBMNONnaR
 github.com/pkg/errors v0.8.0/go.mod h1:bwawxfHBFNV+L2hUp1rHADufV3IMtnDRdf1r5NINEl0=
 github.com/pkg/errors v0.8.1/go.mod h1:bwawxfHBFNV+L2hUp1rHADufV3IMtnDRdf1r5NINEl0=
 github.com/pkg/errors v0.9.1/go.mod h1:bwawxfHBFNV+L2hUp1rHADufV3IMtnDRdf1r5NINEl0=
-github.com/pmezard/go-difflib v1.0.0 h1:4DBwDE0NGyQoBHbLQYPwSUPoCMWR5BEzIk/f1lZbAQM=
-github.com/pmezard/go-difflib v1.0.0/go.mod h1:iKH77koFhYxTK1pcRnkKkqfTogsbg7gZNVY4sRDYZ/4=
+github.com/pmezard/go-difflib v1.1.0 h1:4DBwDE0NGyQoBHbLQYPwSUPoCMWR5BEzIk/f1lZbAQM=
+github.com/pmezard/go-difflib v1.1.0/go.mod h1:iKH77koFhYxTK1pcRnkKkqfTogsbg7gZNVY4sRDYZ/4=
 github.com/prometheus/client_golang v0.9.1/go.mod h1:7SWBe2y4D6OKWSNQJUaRYU/AaXPKyh/dDVn+NZz0KFw=
-github.com/prometheus/client_golang v1.0.0/go.mod h1:db9x61etRT2tGnBNRi70OPL5FsnadC4Ky3P0J6CfImo=
+github.com/prometheus/client_golang v1.1.0/go.mod h1:db9x61etRT2tGnBNRi70OPL5FsnadC4Ky3P0J6CfImo=
 github.com/prometheus/client_golang v1.7.1/go.mod h1:PY5Wy2awLA44sXw4AOSfFBetzPP4j5+D6mVACh+pe2M=
 github.com/prometheus/client_model v0.0.0-20180712105110-5c3871d89910/go.mod h1:MbSGuTsp3dbXC40dX6PRTWyKYBIrTGTE9sqQNg2J8bo=
 github.com/prometheus/client_model v0.0.0-20190129233127-fd36f4220a90/go.mod h1:xMI15A0UPsDsEKsMN9yxemIoYk6Tm2C1GtYGdfGttqA=
@@ -326,13 +326,13 @@ google.golang.org/protobuf v1.24.0 h1:UhZDfRO8JRQru4/+LlLE0BRKGF8L+PICnvYZmx/fEG
 google.golang.org/protobuf v1.24.0/go.mod h1:r/3tXBNzIEhYS9I1OUVjXDlt8tc493IdKGjtUeSXeh4=
 gopkg.in/alecthomas/kingpin.v2 v2.2.6/go.mod h1:FMv+mEhP44yOT+4EoQTLFTRgOQ1FBLkstjWtayDeSgw=
 gopkg.in/check.v1 v0.0.0-20161208181325-20d25e280405/go.mod h1:Co6ibVJAznAaIkqp8huTwlJQCZ016jof/cbN4VW5Yz0=
-gopkg.in/check.v1 v1.0.0-20180628173108-788fd7840127/go.mod h1:Co6ibVJAznAaIkqp8huTwlJQCZ016jof/cbN4VW5Yz0=
-gopkg.in/check.v1 v1.0.0-20190902080502-41f04d3bba15 h1:YR8cESwS4TdDjEe65xsg0ogRM/Nc3DYOhEAlW+xobZo=
-gopkg.in/check.v1 v1.0.0-20190902080502-41f04d3bba15/go.mod h1:Co6ibVJAznAaIkqp8huTwlJQCZ016jof/cbN4VW5Yz0=
+gopkg.in/check.v1 v1.1.0-20180628173108-788fd7840127/go.mod h1:Co6ibVJAznAaIkqp8huTwlJQCZ016jof/cbN4VW5Yz0=
+gopkg.in/check.v1 v1.1.0-20190902080502-41f04d3bba15 h1:YR8cESwS4TdDjEe65xsg0ogRM/Nc3DYOhEAlW+xobZo=
+gopkg.in/check.v1 v1.1.0-20190902080502-41f04d3bba15/go.mod h1:Co6ibVJAznAaIkqp8huTwlJQCZ016jof/cbN4VW5Yz0=
 gopkg.in/errgo.v2 v2.1.0/go.mod h1:hNsd1EY+bozCKY1Ytp96fpM3vjJbqLJn88ws8XvfDNI=
 gopkg.in/fsnotify.v1 v1.4.7/go.mod h1:Tz8NjZHkW78fSQdbUxIjBTcgA1z1m8ZHf0WmKUhAMys=
 gopkg.in/inf.v0 v0.9.1/go.mod h1:cWUDdTG/fYaXco+Dcufb5Vnc6Gp2YChqWtbxRZE0mXw=
-gopkg.in/tomb.v1 v1.0.0-20141024135613-dd632973f1e7/go.mod h1:dt/ZhP58zS4L8KSrWDmTeBkI65Dw0HsyUHuEVlX15mw=
+gopkg.in/tomb.v1 v1.1.0-20141024135613-dd632973f1e7/go.mod h1:dt/ZhP58zS4L8KSrWDmTeBkI65Dw0HsyUHuEVlX15mw=
 gopkg.in/yaml.v2 v2.2.1/go.mod h1:hI93XBmqTisBFMUTm0b8Fm+jr3Dg1NNxqwp+5A1VGuI=
 gopkg.in/yaml.v2 v2.2.2/go.mod h1:hI93XBmqTisBFMUTm0b8Fm+jr3Dg1NNxqwp+5A1VGuI=
 gopkg.in/yaml.v2 v2.2.4/go.mod h1:hI93XBmqTisBFMUTm0b8Fm+jr3Dg1NNxqwp+5A1VGuI=
@@ -351,8 +351,8 @@ k8s.io/apimachinery v0.19.0/go.mod h1:DnPGDnARWFvYa3pMHgSxtbZb7gpzzAZ1pTfaUNDVlm
 k8s.io/client-go v0.19.0/go.mod h1:H9E/VT95blcFQnlyShFgnFT9ZnJOAceiUHM3MlRC+mU=
 k8s.io/component-base v0.19.0/go.mod h1:dKsY8BxkA+9dZIAh2aWJLL/UdASFDNtGYTCItL4LM7Y=
 k8s.io/gengo v0.0.0-20200413195148-3a45101e95ac/go.mod h1:ezvh/TsK7cY6rbqRK0oQQ8IAqLxYwwyPxAX1Pzy0ii0=
-k8s.io/klog v1.0.0 h1:Pt+yjF5aB1xDSVbau4VsWe+dQNzA0qv1LlXdC2dF6Q8=
-k8s.io/klog v1.0.0/go.mod h1:4Bi6QPql/J/LkTDqv7R/cd3hPo4k2DG6Ptcz060Ez5I=
+k8s.io/klog v1.1.0 h1:Pt+yjF5aB1xDSVbau4VsWe+dQNzA0qv1LlXdC2dF6Q8=
+k8s.io/klog v1.1.0/go.mod h1:4Bi6QPql/J/LkTDqv7R/cd3hPo4k2DG6Ptcz060Ez5I=
 k8s.io/klog/v2 v2.0.0/go.mod h1:PBfzABfn139FHAV07az/IF9Wp1bkk3vpT2XSJ76fSDE=
 k8s.io/klog/v2 v2.2.0 h1:XRvcwJozkgZ1UQJmfMGpvRthQHOvihEhYtDfAaxMz/A=
 k8s.io/klog/v2 v2.2.0/go.mod h1:Od+F08eJP+W3HUb4pSrPpgp9DGU4GzlpG/TmITuYh/Y=
--- a/csi/src/config.go
+++ b/csi/src/config.go
@@ -5,7 +5,7 @@ package vitastor

 const (
    vitastorCSIDriverName    = "csi.vitastor.io"
-    vitastorCSIDriverVersion = "1.0.0"
+    vitastorCSIDriverVersion = "1.1.0"
 )

 // Config struct fills the parameters of request or user input
--- a/debian/changelog
+++ b/debian/changelog
@@ -1,10 +1,10 @@
-vitastor (1.0.0-1) unstable; urgency=medium
+vitastor (1.1.0-1) unstable; urgency=medium

  * Bugfixes

 -- Vitaliy Filippov <vitalif@yourcmc.ru>  Fri, 03 Jun 2022 02:09:44 +0300

-vitastor (1.0.0-1) unstable; urgency=medium
+vitastor (1.1.0-1) unstable; urgency=medium

  * Implement NFS proxy
  * Add documentation
--- a/debian/patched-qemu.Dockerfile
+++ b/debian/patched-qemu.Dockerfile
@@ -54,7 +54,8 @@ RUN set -e; \
    quilt add block/vitastor.c; \
    cp /root/vitastor/src/qemu_driver.c block/vitastor.c; \
    quilt refresh; \
-    V=$(head -n1 debian/changelog | perl -pe 's/^.*\((.*?)(~bpo[\d\+]*)?\).*$/$1/')+vitastor3; \
+    V=$(head -n1 debian/changelog | perl -pe 's/5\.2\+dfsg-9/5.2+dfsg-11/; s/^.*\((.*?)(~bpo[\d\+]*)?\).*$/$1/')+vitastor4; \
+    if [ "$REL" = bullseye ]; then V=${V}bullseye; fi; \
    DEBEMAIL="Vitaliy Filippov <vitalif@yourcmc.ru>" dch -D $REL -v $V 'Plug Vitastor block driver'; \
    DEB_BUILD_OPTIONS=nocheck dpkg-buildpackage --jobs=auto -sa; \
    rm -rf /root/packages/qemu-$REL/qemu-*/
--- a/debian/vitastor.Dockerfile
+++ b/debian/vitastor.Dockerfile
@@ -35,8 +35,8 @@ RUN set -e -x; \
    mkdir -p /root/packages/vitastor-$REL; \
    rm -rf /root/packages/vitastor-$REL/*; \
    cd /root/packages/vitastor-$REL; \
-    cp -r /root/vitastor vitastor-1.0.0; \
-    cd vitastor-1.0.0; \
+    cp -r /root/vitastor vitastor-1.1.0; \
+    cd vitastor-1.1.0; \
    ln -s /root/fio-build/fio-*/ ./fio; \
    FIO=$(head -n1 fio/debian/changelog | perl -pe 's/^.*\((.*?)\).*$/$1/'); \
    ls /usr/include/linux/raw.h || cp ./debian/raw.h /usr/include/linux/raw.h; \
@@ -49,8 +49,8 @@ RUN set -e -x; \
    rm -rf a b; \
    echo "dep:fio=$FIO" > debian/fio_version; \
    cd /root/packages/vitastor-$REL; \
-    tar --sort=name --mtime='2020-01-01' --owner=0 --group=0 --exclude=debian -cJf vitastor_1.0.0.orig.tar.xz vitastor-1.0.0; \
-    cd vitastor-1.0.0; \
+    tar --sort=name --mtime='2020-01-01' --owner=0 --group=0 --exclude=debian -cJf vitastor_1.1.0.orig.tar.xz vitastor-1.1.0; \
+    cd vitastor-1.1.0; \
    V=$(head -n1 debian/changelog | perl -pe 's/^.*\((.*?)\).*$/$1/'); \
    DEBFULLNAME="Vitaliy Filippov <vitalif@yourcmc.ru>" dch -D $REL -v "$V""$REL" "Rebuild for $REL"; \
    DEB_BUILD_OPTIONS=nocheck dpkg-buildpackage --jobs=auto -sa; \
--- a/docs/config.en.md
+++ b/docs/config.en.md
@@ -33,6 +33,7 @@ In the future, additional configuration methods may be added:

 - [Common](config/common.en.md)
 - [Network](config/network.en.md)
+- [Client](config/client.en.md)
 - [Global Disk Layout](config/layout-cluster.en.md)
 - [OSD Disk Layout](config/layout-osd.en.md)
 - [OSD Runtime Parameters](config/osd.en.md)
--- a/docs/config.ru.md
+++ b/docs/config.ru.md
@@ -36,6 +36,7 @@

 - [Общие](config/common.ru.md)
 - [Сеть](config/network.ru.md)
+- [Клиентский код](config/client.ru.md)
 - [Глобальные дисковые параметры](config/layout-cluster.ru.md)
 - [Дисковые параметры OSD](config/layout-osd.ru.md)
 - [Прочие параметры OSD](config/osd.ru.md)
--- a/docs/config/client.en.md
+++ b/docs/config/client.en.md
@@ -0,0 +1,103 @@
+[Documentation](../../README.md#documentation) → [Configuration](../config.en.md) → Client Parameters
+
+-----
+
+[Читать на русском](client.ru.md)
+
+# Client Parameters
+
+These parameters apply only to clients and affect their interaction with
+the cluster.
+
+- [client_max_dirty_bytes](#client_max_dirty_bytes)
+- [client_max_dirty_ops](#client_max_dirty_ops)
+- [client_enable_writeback](#client_enable_writeback)
+- [client_max_buffered_bytes](#client_max_buffered_bytes)
+- [client_max_buffered_ops](#client_max_buffered_ops)
+- [client_max_writeback_iodepth](#client_max_writeback_iodepth)
+
+## client_max_dirty_bytes
+
+- Type: integer
+- Default: 33554432
+- Can be changed online: yes
+
+Without [immediate_commit](layout-cluster.en.md#immediate_commit)=all this parameter sets the limit of "dirty"
+(not committed by fsync) data allowed by the client before forcing an
+additional fsync and committing the data. Also note that the client always
+holds a copy of uncommitted data in memory so this setting also affects
+RAM usage of clients.
+
+## client_max_dirty_ops
+
+- Type: integer
+- Default: 1024
+- Can be changed online: yes
+
+Same as client_max_dirty_bytes, but instead of total size, limits the number
+of uncommitted write operations.
+
+## client_enable_writeback
+
+- Type: boolean
+- Default: false
+- Can be changed online: yes
+
+This parameter enables client-side write buffering. This means that write
+requests are accumulated in memory for a short time before being sent to
+a Vitastor cluster which allows to send them in parallel and increase
+performance of some applications. Writes are buffered until client forces
+a flush with fsync() or until the amount of buffered writes exceeds the
+limit.
+
+Write buffering significantly increases performance of some applications,
+for example, CrystalDiskMark under Windows (LOL :-D), but also any other
+applications if they do writes in one of two non-optimal ways: either if
+they do a lot of small (4 kb or so) sequential writes, or if they do a lot
+of small random writes, but without any parallelism or asynchrony, and also
+without calling fsync().
+
+With write buffering enabled, you can expect around 22000 T1Q1 random write
+iops in QEMU more or less regardless of the quality of your SSDs, and this
+number is in fact bound by QEMU itself rather than Vitastor (check it
+yourself by adding a "driver=null-co" disk in QEMU). Without write
+buffering, the current record is 9900 iops, but the number is usually
+even lower with non-ideal hardware, for example, it may be 5000 iops.
+
+Even when this parameter is enabled, write buffering isn't enabled until
+the client explicitly allows it, because enabling it without the client
+being aware of the fact that his writes may be buffered may lead to data
+loss. Because of this, older versions of clients don't support write
+buffering at all, newer versions of the QEMU driver allow write buffering
+only if it's enabled in disk settings with `-blockdev cache.direct=false`,
+and newer versions of FIO only allow write buffering if you don't specify
+`-direct=1`. NBD and NFS drivers allow write buffering by default.
+
+You can overcome this restriction too with the `client_writeback_allowed`
+parameter, but you shouldn't do that unless you **really** know what you
+are doing.
+
+## client_max_buffered_bytes
+
+- Type: integer
+- Default: 33554432
+- Can be changed online: yes
+
+Maximum total size of buffered writes which triggers write-back when reached.
+
+## client_max_buffered_ops
+
+- Type: integer
+- Default: 1024
+- Can be changed online: yes
+
+Maximum number of buffered writes which triggers write-back when reached.
+Multiple consecutive modified data regions are counted as 1 write here.
+
+## client_max_writeback_iodepth
+
+- Type: integer
+- Default: 256
+- Can be changed online: yes
+
+Maximum number of parallel writes when flushing buffered data to the server.
--- a/docs/config/client.ru.md
+++ b/docs/config/client.ru.md
@@ -0,0 +1,103 @@
+[Документация](../../README-ru.md#документация) → [Конфигурация](../config.ru.md) → Параметры клиентского кода
+
+-----
+
+[Read in English](client.en.md)
+
+# Параметры клиентского кода
+
+Данные параметры применяются только к клиентам Vitastor (QEMU, fio, NBD) и
+затрагивают логику их работы с кластером.
+
+- [client_max_dirty_bytes](#client_max_dirty_bytes)
+- [client_max_dirty_ops](#client_max_dirty_ops)
+- [client_enable_writeback](#client_enable_writeback)
+- [client_max_buffered_bytes](#client_max_buffered_bytes)
+- [client_max_buffered_ops](#client_max_buffered_ops)
+- [client_max_writeback_iodepth](#client_max_writeback_iodepth)
+
+## client_max_dirty_bytes
+
+- Тип: целое число
+- Значение по умолчанию: 33554432
+- Можно менять на лету: да
+
+При работе без [immediate_commit](layout-cluster.ru.md#immediate_commit)=all - это лимит объёма "грязных" (не
+зафиксированных fsync-ом) данных, при достижении которого клиент будет
+принудительно вызывать fsync и фиксировать данные. Также стоит иметь в виду,
+что в этом случае до момента fsync клиент хранит копию незафиксированных
+данных в памяти, то есть, настройка влияет на потребление памяти клиентами.
+
+## client_max_dirty_ops
+
+- Тип: целое число
+- Значение по умолчанию: 1024
+- Можно менять на лету: да
+
+Аналогично client_max_dirty_bytes, но ограничивает количество
+незафиксированных операций записи вместо их общего объёма.
+
+## client_enable_writeback
+
+- Тип: булево (да/нет)
+- Значение по умолчанию: false
+- Можно менять на лету: да
+
+Данный параметр разрешает включать буферизацию записи в памяти. Буферизация
+означает, что операции записи отправляются на кластер Vitastor не сразу, а
+могут небольшое время накапливаться в памяти и сбрасываться сразу пакетами,
+до тех пор, пока либо не будет превышен лимит неотправленных записей, либо
+пока клиент не вызовет fsync.
+
+Буферизация значительно повышает производительность некоторых приложений,
+например, CrystalDiskMark в Windows (ха-ха :-D), но также и любых других,
+которые пишут на диск неоптимально: либо последовательно, но мелкими блоками
+(например, по 4 кб), либо случайно, но без параллелизма и без fsync - то
+есть, например, отправляя 128 операций записи в разные места диска, но не
+все сразу с помощью асинхронного I/O, а по одной.
+
+В QEMU с буферизацией записи можно ожидать показателя примерно 22000
+операций случайной записи в секунду в 1 поток и с глубиной очереди 1 (T1Q1)
+без fsync, почти вне зависимости от того, насколько хороши ваши диски - эта
+цифра упирается в сам QEMU. Без буферизации рекорд пока что - 9900 операций
+в секунду, но на железе похуже может быть и поменьше, например, 5000 операций
+в секунду.
+
+При этом, даже если данный параметр включён, буферизация не включается, если
+явно не разрешена клиентом, т.к. если клиент не знает, что запросы записи
+буферизуются, это может приводить к потере данных. Поэтому в старых версиях
+клиентских драйверов буферизация записи не включается вообще, в новых
+версиях QEMU-драйвера включается только если разрешена опцией диска
+`-blockdev cache.direct=false`, а в fio - только если нет опция `-direct=1`.
+В NBD и NFS драйверах буферизация записи разрешена по умолчанию.
+
+Можно обойти и это ограничение с помощью параметра `client_writeback_allowed`,
+но делать так не надо, если только вы не уверены в том, что делаете, на все
+100%. :-)
+
+## client_max_buffered_bytes
+
+- Тип: целое число
+- Значение по умолчанию: 33554432
+- Можно менять на лету: да
+
+Максимальный общий размер буферизованных записей, при достижении которого
+начинается процесс сброса данных на сервер.
+
+## client_max_buffered_ops
+
+- Тип: целое число
+- Значение по умолчанию: 1024
+- Можно менять на лету: да
+
+Максимальное количество буферизованных записей, при достижении которого
+начинается процесс сброса данных на сервер. При этом несколько
+последовательных изменённых областей здесь считаются 1 записью.
+
+## client_max_writeback_iodepth
+
+- Тип: целое число
+- Значение по умолчанию: 256
+- Можно менять на лету: да
+
+Максимальное число параллельных операций записи при сбросе буферов на сервер.
--- a/docs/config/layout-cluster.en.md
+++ b/docs/config/layout-cluster.en.md
@@ -96,8 +96,9 @@ SSD cache or "media-cache" - for example, a lot of Seagate EXOS drives have
 it (they have internal SSD cache even though it's not stated in datasheets).

 Setting this parameter to "all" or "small" in OSD parameters requires enabling
-disable_journal_fsync and disable_meta_fsync, setting it to "all" also requires
-enabling disable_data_fsync.
+[disable_journal_fsync](layout-osd.en.yml#disable_journal_fsync) and
+[disable_meta_fsync](layout-osd.en.yml#disable_meta_fsync), setting it to
+"all" also requires enabling [disable_data_fsync](layout-osd.en.yml#disable_data_fsync).

 TLDR: For optimal performance, set immediate_commit to "all" if you only use
 SSDs with supercapacitor-based power loss protection (nonvolatile
--- a/docs/config/layout-cluster.ru.md
+++ b/docs/config/layout-cluster.ru.md
@@ -103,8 +103,9 @@ HDD-дисках с внутренним SSD или "медиа" кэшем - н
 указано в спецификациях).

 Указание "all" или "small" в настройках / командной строке OSD требует
-включения disable_journal_fsync и disable_meta_fsync, значение "all" также
-требует включения disable_data_fsync.
+включения [disable_journal_fsync](layout-osd.ru.yml#disable_journal_fsync) и
+[disable_meta_fsync](layout-osd.ru.yml#disable_meta_fsync), значение "all"
+также требует включения [disable_data_fsync](layout-osd.ru.yml#disable_data_fsync).

 Итого, вкратце: для оптимальной производительности установите
 immediate_commit в значение "all", если вы используете в кластере только SSD
--- a/docs/config/layout-osd.en.md
+++ b/docs/config/layout-osd.en.md
@@ -213,6 +213,6 @@ Thus, recommended setups are:
 3. Hybrid HDD+SSD: csum_block_size=4k + inmemory_metadata=false
 4. HDD-only, faster random read: csum_block_size=32k
 5. HDD-only, faster random write: csum_block_size=4k +
-   inmemory_metadata=false + cached_io_meta=true
+   inmemory_metadata=false + meta_io=cached

-See also [cached_io_meta](osd.en.md#cached_io_meta).
+See also [meta_io](osd.en.md#meta_io).
--- a/docs/config/layout-osd.ru.md
+++ b/docs/config/layout-osd.ru.md
@@ -226,6 +226,6 @@ csum_block_size данных.
 3. Гибридные HDD+SSD: csum_block_size=4k + inmemory_metadata=false
 4. Только HDD, быстрее случайное чтение: csum_block_size=32k
 5. Только HDD, быстрее случайная запись: csum_block_size=4k +
-   inmemory_metadata=false + cached_io_meta=true
+   inmemory_metadata=false + meta_io=cached

-Смотрите также [cached_io_meta](osd.ru.md#cached_io_meta).
+Смотрите также [meta_io](osd.ru.md#meta_io).
--- a/docs/config/network.en.md
+++ b/docs/config/network.en.md
@@ -30,7 +30,6 @@ between clients, OSDs and etcd.
 - [etcd_slow_timeout](#etcd_slow_timeout)
 - [etcd_keepalive_timeout](#etcd_keepalive_timeout)
 - [etcd_ws_keepalive_timeout](#etcd_ws_keepalive_timeout)
- [client_dirty_limit](#client_dirty_limit)

 ## tcp_header_buffer_size

@@ -240,17 +239,3 @@ etcd_report_interval to guarantee that keepalive actually works.

 etcd websocket ping interval required to keep the connection alive and
 detect disconnections quickly.
-
-## client_dirty_limit
-
- Type: integer
- Default: 33554432
- Can be changed online: yes
-
-Without immediate_commit=all this parameter sets the limit of "dirty"
-(not committed by fsync) data allowed by the client before forcing an
-additional fsync and committing the data. Also note that the client always
-holds a copy of uncommitted data in memory so this setting also affects
-RAM usage of clients.
-
-This parameter doesn't affect OSDs themselves.
--- a/docs/config/network.ru.md
+++ b/docs/config/network.ru.md
@@ -30,7 +30,6 @@
 - [etcd_slow_timeout](#etcd_slow_timeout)
 - [etcd_keepalive_timeout](#etcd_keepalive_timeout)
 - [etcd_ws_keepalive_timeout](#etcd_ws_keepalive_timeout)
- [client_dirty_limit](#client_dirty_limit)

 ## tcp_header_buffer_size

@@ -251,17 +250,3 @@ etcd_report_interval, чтобы keepalive гарантированно рабо
 - Можно менять на лету: да

 Интервал проверки живости вебсокет-подключений к etcd.
-
-## client_dirty_limit
-
- Тип: целое число
- Значение по умолчанию: 33554432
- Можно менять на лету: да
-
-При работе без immediate_commit=all - это лимит объёма "грязных" (не
-зафиксированных fsync-ом) данных, при достижении которого клиент будет
-принудительно вызывать fsync и фиксировать данные. Также стоит иметь в виду,
-что в этом случае до момента fsync клиент хранит копию незафиксированных
-данных в памяти, то есть, настройка влияет на потребление памяти клиентами.
-
-Параметр не влияет на сами OSD.
--- a/docs/config/osd.en.md
+++ b/docs/config/osd.en.md
@@ -11,6 +11,7 @@ initialization and can be changed - either with an OSD restart or, for some of
 them, even without restarting by updating configuration in etcd.

 - [etcd_report_interval](#etcd_report_interval)
+- [etcd_stats_interval](#etcd_stats_interval)
 - [run_primary](#run_primary)
 - [osd_network](#osd_network)
 - [bind_address](#bind_address)
@@ -31,9 +32,9 @@ them, even without restarting by updating configuration in etcd.
 - [max_flusher_count](#max_flusher_count)
 - [inmemory_metadata](#inmemory_metadata)
 - [inmemory_journal](#inmemory_journal)
- [cached_io_data](#cached_io_data)
- [cached_io_meta](#cached_io_meta)
- [cached_io_journal](#cached_io_journal)
+- [data_io](#data_io)
+- [meta_io](#meta_io)
+- [journal_io](#journal_io)
 - [journal_sector_buffer_count](#journal_sector_buffer_count)
 - [journal_no_same_sector_overwrites](#journal_no_same_sector_overwrites)
 - [throttle_small_writes](#throttle_small_writes)
@@ -56,11 +57,21 @@ them, even without restarting by updating configuration in etcd.
 - Type: seconds
 - Default: 5

-Interval at which OSDs report their state to etcd. Affects OSD lease time
+Interval at which OSDs report their liveness to etcd. Affects OSD lease time
 and thus the failover speed. Lease time is equal to this parameter value
 plus max_etcd_attempts * etcd_quick_timeout because it should be guaranteed
 that every OSD always refreshes its lease in time.

+## etcd_stats_interval
+
+- Type: seconds
+- Default: 30
+
+Interval at which OSDs report their statistics to etcd. Highly affects the
+imposed load on etcd, because statistics include a key for every OSD and
+for every PG. At the same time, low statistic intervals make `vitastor-cli`
+statistics more responsive.
+
 ## run_primary

 - Type: boolean
@@ -258,47 +269,59 @@ is typically very small because it's sufficient to have 16-32 MB journal
 for SSD OSDs. However, in theory it's possible that you'll want to turn it
 off for hybrid (HDD+SSD) OSDs with large journals on quick devices.

-## cached_io_data
+## data_io

- Type: boolean
- Default: false
+- Type: string
+- Default: direct

-Read and write *data* through Linux page cache, i.e. use a file descriptor
-opened with O_SYNC, but without O_DIRECT for I/O. May improve read
-performance for hot data and slower disks - HDDs and maybe SATA SSDs.
-Not recommended for desktop SSDs without capacitors because O_SYNC flushes
-disk cache on every write.
+I/O mode for *data*. One of "direct", "cached" or "directsync". Corresponds
+to O_DIRECT, O_SYNC and O_DIRECT|O_SYNC, respectively.

-## cached_io_meta
+Choose "cached" to use Linux page cache. This may improve read performance
+for hot data and slower disks - HDDs and maybe SATA SSDs - but will slightly
+decrease write performance for fast disks because page cache is an overhead
+itself.

- Type: boolean
- Default: false
+Choose "directsync" to use [immediate_commit](layout-cluster.ru.md#immediate_commit)
+(which requires disable_data_fsync) with drives having write-back cache
+which can't be turned off, for example, Intel Optane. Also note that *some*
+desktop SSDs (for example, HP EX950) may ignore O_SYNC thus making
+disable_data_fsync unsafe even with "directsync".

-Read and write *metadata* through Linux page cache. May improve read
-performance only if your drives are relatively slow (HDD, SATA SSD), and
-only if checksums are enabled and [inmemory_metadata](#inmemory_metadata)
-is disabled, because in this case metadata blocks are read from disk
-on every read request to verify checksums and caching them may reduce this
-extra read load.
+## meta_io

-Absolutely pointless to enable with enabled inmemory_metadata because all
-metadata is kept in memory anyway, and likely pointless without checksums,
-because in that case, metadata blocks are read from disk only during journal
+- Type: string
+- Default: direct
+
+I/O mode for *metadata*. One of "direct", "cached" or "directsync".
+
+"cached" may improve read performance, but only under the following conditions:
+1. your drives are relatively slow (HDD, SATA SSD), and
+2. checksums are enabled, and
+3. [inmemory_metadata](#inmemory_metadata) is disabled.
+Under all these conditions, metadata blocks are read from disk on every
+read request to verify checksums and caching them may reduce this extra
+read load. Without (3) metadata is never read from the disk after starting,
+and without (2) metadata blocks are read from disk only during journal
 flushing.

-If the same device is used for data and metadata, enabling [cached_io_data](#cached_io_data)
-also enables this parameter, given that it isn't turned off explicitly.
+"directsync" is the same as above.

-## cached_io_journal
+If the same device is used for data and metadata, meta_io by default is set
+to the same value as [data_io](#data_io).

- Type: boolean
- Default: false
+## journal_io

-Read and write *journal* through Linux page cache. May improve read
-performance if [inmemory_journal](#inmemory_journal) is turned off.
+- Type: string
+- Default: direct

-If the same device is used for metadata and journal, enabling [cached_io_meta](#cached_io_meta)
-also enables this parameter, given that it isn't turned off explicitly.
+I/O mode for *journal*. One of "direct", "cached" or "directsync".
+
+Here, "cached" may only improve read performance for recent writes and
+only if [inmemory_journal](#inmemory_journal) is turned off.
+
+If the same device is used for metadata and journal, journal_io by default
+is set to the same value as [meta_io](#meta_io).

 ## journal_sector_buffer_count

--- a/docs/config/osd.ru.md
+++ b/docs/config/osd.ru.md
@@ -12,6 +12,7 @@
 изменения конфигурации в etcd.

 - [etcd_report_interval](#etcd_report_interval)
+- [etcd_stats_interval](#etcd_stats_interval)
 - [run_primary](#run_primary)
 - [osd_network](#osd_network)
 - [bind_address](#bind_address)
@@ -32,9 +33,9 @@
 - [max_flusher_count](#max_flusher_count)
 - [inmemory_metadata](#inmemory_metadata)
 - [inmemory_journal](#inmemory_journal)
- [cached_io_data](#cached_io_data)
- [cached_io_meta](#cached_io_meta)
- [cached_io_journal](#cached_io_journal)
+- [data_io](#data_io)
+- [meta_io](#meta_io)
+- [journal_io](#journal_io)
 - [journal_sector_buffer_count](#journal_sector_buffer_count)
 - [journal_no_same_sector_overwrites](#journal_no_same_sector_overwrites)
 - [throttle_small_writes](#throttle_small_writes)
@@ -57,11 +58,21 @@
 - Тип: секунды
 - Значение по умолчанию: 5

-Интервал, с которым OSD обновляет своё состояние в etcd. Значение параметра
-влияет на время резервации (lease) OSD и поэтому на скорость переключения
+Интервал, с которым OSD сообщает о том, что жив, в etcd. Значение параметра
+влияет на время резервации (lease) OSD и поэтому - на скорость переключения
 при падении OSD. Время lease равняется значению этого параметра плюс
 max_etcd_attempts * etcd_quick_timeout.

+## etcd_stats_interval
+
+- Тип: секунды
+- Значение по умолчанию: 30
+
+Интервал, с которым OSD обновляет свою статистику в etcd. Сильно влияет на
+создаваемую нагрузку на etcd, потому что статистика содержит по ключу на
+каждый OSD и на каждую PG. В то же время низкий интервал делает
+статистику, печатаемую `vitastor-cli`, отзывчивей.
+
 ## run_primary

 - Тип: булево (да/нет)
@@ -266,51 +277,62 @@ Flusher - это микро-поток (корутина), которая коп
 параметра может оказаться полезным для гибридных OSD (HDD+SSD) с большими
 журналами, расположенными на быстром по сравнению с HDD устройстве.

-## cached_io_data
+## data_io

- Тип: булево (да/нет)
- Значение по умолчанию: false
+- Тип: строка
+- Значение по умолчанию: direct

-Читать и записывать *данные* через системный кэш Linux (page cache), то есть,
-использовать для данных файловый дескриптор, открытый без флага O_DIRECT, но
-с флагом O_SYNC. Может улучшить скорость чтения для относительно медленных
-дисков - HDD и, возможно, SATA SSD. Не рекомендуется для потребительских
-SSD без конденсаторов, так как O_SYNC сбрасывает кэш диска при каждой записи.
+Режим ввода-вывода для *данных*. Одно из значений "direct", "cached" или
+"directsync", означающих O_DIRECT, O_SYNC и O_DIRECT|O_SYNC, соответственно.

-## cached_io_meta
+Выберите "cached", чтобы использовать системный кэш Linux (page cache) при
+чтении и записи. Это может улучшить скорость чтения горячих данных с
+относительно медленных дисков - HDD и, возможно, SATA SSD - но немного
+снижает производительность записи для быстрых дисков, так как кэш сам по
+себе тоже добавляет накладные расходы.

- Тип: булево (да/нет)
- Значение по умолчанию: false
+Выберите "directsync", если хотите задействовать
+[immediate_commit](layout-cluster.ru.md#immediate_commit) (требующий
+включенияd disable_data_fsync) на дисках с неотключаемым кэшем. Пример таких
+дисков - Intel Optane. При этом также стоит иметь в виду, что *некоторые*
+настольные SSD (например, HP EX950) игнорируют флаг O_SYNC, делая отключение
+fsync небезопасным даже с режимом "directsync".

-Читать и записывать *метаданные* через системный кэш Linux. Может улучшить
-скорость чтения, если у вас медленные диски, и только если контрольные суммы
-включены, а параметр [inmemory_metadata](#inmemory_metadata) отключён, так
-как в этом случае блоки метаданных читаются с диска при каждом запросе чтения
+## meta_io
+
+- Тип: строка
+- Значение по умолчанию: direct
+
+Режим ввода-вывода для *метаданных*. Одно из значений "direct", "cached" или
+"directsync".
+
+"cached" может улучшить скорость чтения, если:
+1. у вас медленные диски (HDD, SATA SSD)
+2. контрольные суммы включены
+3. параметр [inmemory_metadata](#inmemory_metadata) отключён.
+При этих условиях блоки метаданных читаются с диска при каждом запросе чтения
 для проверки контрольных сумм и их кэширование может снизить дополнительную
-нагрузку на диск.
+нагрузку на диск. Без (3) метаданные никогда не читаются с диска после
+запуска OSD, а без (2) блоки метаданных читаются только при сбросе журнала.

-Абсолютно бессмысленно включать данный параметр, если параметр
-inmemory_metadata включён (по умолчанию это так), и также вероятно
-бессмысленно включать его, если не включены контрольные суммы, так как в
-этом случае блоки метаданных читаются с диска только во время сброса
-журнала.
+Если одно и то же устройство используется для данных и метаданных, режим
+ввода-вывода метаданных по умолчанию устанавливается равным [data_io](#data_io).

-Если одно и то же устройство используется для данных и метаданных, включение
-[cached_io_data](#cached_io_data) также включает данный параметр, при
-условии, что он не отключён явным образом.
+## journal_io

-## cached_io_journal
+- Тип: строка
+- Значение по умолчанию: direct

- Тип: булево (да/нет)
- Значение по умолчанию: false
+Режим ввода-вывода для *журнала*. Одно из значений "direct", "cached" или
+"directsync".

-Читать и записывать *журнал* через системный кэш Linux. Может улучшить
-скорость чтения, если параметр [inmemory_journal](#inmemory_journal)
+Здесь "cached" может улучшить скорость чтения только недавно записанных
+данных и только если параметр [inmemory_journal](#inmemory_journal)
 отключён.

 Если одно и то же устройство используется для метаданных и журнала,
-включение [cached_io_meta](#cached_io_meta) также включает данный
-параметр, при условии, что он не отключён явным образом.
+режим ввода-вывода журнала по умолчанию устанавливается равным
+[meta_io](#meta_io).

 ## journal_sector_buffer_count

--- a/docs/config/pool.en.md
+++ b/docs/config/pool.en.md
@@ -205,9 +205,8 @@ This parameter usually doesn't require to be changed.
 - Default: 131072

 Block size for this pool. The value from /vitastor/config/global is used when
-unspecified. If your cluster has OSDs with different block sizes then pool must
-be restricted by [osd_tags](#osd_tags) to only include OSDs with matching block
-size.
+unspecified. Only OSDs with matching block_size are used for each pool. If you
+want to further restrict OSDs for the pool, use [osd_tags](#osd_tags).

 Read more about this parameter in [Cluster-Wide Disk Layout Parameters](layout-cluster.en.md#block_size).

@@ -216,10 +215,9 @@ Read more about this parameter in [Cluster-Wide Disk Layout Parameters](layout-c
 - Type: integer
 - Default: 4096

-"Sector" size of virtual disks in this pool. The value from
-/vitastor/config/global is used when unspecified. Similar to block_size, the
-pool must be restricted by [osd_tags](#osd_tags) to only include OSDs with
-matching bitmap_granularity.
+"Sector" size of virtual disks in this pool. The value from /vitastor/config/global
+is used when unspecified. Similarly to block_size, only OSDs with matching
+bitmap_granularity are used for each pool.

 Read more about this parameter in [Cluster-Wide Disk Layout Parameters](layout-cluster.en.md#bitmap_granularity).

@@ -229,10 +227,11 @@ Read more about this parameter in [Cluster-Wide Disk Layout Parameters](layout-c
 - Default: none

 Immediate commit setting for this pool. The value from /vitastor/config/global
-is used when unspecified. Similar to block_size, the pool must be restricted by
-[osd_tags](#osd_tags) to only include OSDs with compatible immediate_commit.
-Compatible means that a pool with non-immediate commit will work with OSDs with
-immediate commit enabled, but not vice versa.
+is used when unspecified. Similarly to block_size, only OSDs with compatible
+bitmap_granularity are used for each pool. "Compatible" means that a pool with
+non-immediate commit will use OSDs with immediate commit enabled, but not vice
+versa. I.e., pools with "none" use all OSDs, pools with "small" only use OSDs
+with "all" or "small", and pools with "all" only use OSDs with "all".

 Read more about this parameter in [Cluster-Wide Disk Layout Parameters](layout-cluster.en.md#immediate_commit).

--- a/docs/config/pool.ru.md
+++ b/docs/config/pool.ru.md
@@ -208,8 +208,9 @@ PG в Vitastor эферемерны, то есть вы можете менят

 Размер блока для данного пула. Если не задан, используется значение из
 /vitastor/config/global. Если в вашем кластере есть OSD с разными размерами
-блока, пул должен быть ограничен только OSD, блок которых равен блоку пула,
-с помощью [osd_tags](#osd_tags).
+блока, пул будет использовать только OSD с размером блока, равным размеру блока
+пула. Если вы хотите сильнее ограничить набор используемых для пула OSD -
+используйте [osd_tags](#osd_tags).

 О самом параметре читайте в разделе [Дисковые параметры уровня кластера](layout-cluster.ru.md#block_size).

@@ -219,9 +220,8 @@ PG в Vitastor эферемерны, то есть вы можете менят
 - По умолчанию: 4096

 Размер "сектора" виртуальных дисков в данном пуле. Если не задан, используется
-значение из /vitastor/config/global. Аналогично block_size, пул должен быть
-ограничен OSD со значением bitmap_granularity, равным значению пула, с помощью
-[osd_tags](#osd_tags).
+значение из /vitastor/config/global. Аналогично block_size, каждый пул будет
+использовать только OSD с совпадающей с пулом настройкой bitmap_granularity.

 О самом параметре читайте в разделе [Дисковые параметры уровня кластера](layout-cluster.ru.md#bitmap_granularity).

@@ -231,11 +231,13 @@ PG в Vitastor эферемерны, то есть вы можете менят
 - По умолчанию: none

 Настройка мгновенного коммита для данного пула. Если не задана, используется
-значение из /vitastor/config/global. Аналогично block_size, пул должен быть
-ограничен OSD со значением bitmap_granularity, совместимым со значением пула, с
-помощью [osd_tags](#osd_tags). Совместимость означает, что пул с отключенным
-мгновенным коммитом может работать на OSD с включённым мгновенным коммитом, но
-не наоборот.
+значение из /vitastor/config/global. Аналогично block_size, каждый пул будет
+использовать только OSD с *совместимыми* настройками immediate_commit.
+"Совместимыми" означает, что пул с отключенным мгновенным коммитом будет
+использовать OSD с включённым мгновенным коммитом, но не наоборот. То есть,
+пул со значением "none" будет использовать все OSD, пул со "small" будет
+использовать OSD с "all" или "small", а пул с "all" будет использовать только
+OSD с "all".

 О самом параметре читайте в разделе [Дисковые параметры уровня кластера](layout-cluster.ru.md#immediate_commit).

--- a/docs/config/src/client.en.md
+++ b/docs/config/src/client.en.md
@@ -0,0 +1,4 @@
+# Client Parameters
+
+These parameters apply only to clients and affect their interaction with
+the cluster.
--- a/docs/config/src/client.ru.md
+++ b/docs/config/src/client.ru.md
@@ -0,0 +1,4 @@
+# Параметры клиентского кода
+
+Данные параметры применяются только к клиентам Vitastor (QEMU, fio, NBD) и
+затрагивают логику их работы с кластером.
--- a/docs/config/src/client.yml
+++ b/docs/config/src/client.yml
@@ -0,0 +1,124 @@
+- name: client_max_dirty_bytes
+  type: int
+  default: 33554432
+  online: true
+  info: |
+    Without [immediate_commit](layout-cluster.en.md#immediate_commit)=all this parameter sets the limit of "dirty"
+    (not committed by fsync) data allowed by the client before forcing an
+    additional fsync and committing the data. Also note that the client always
+    holds a copy of uncommitted data in memory so this setting also affects
+    RAM usage of clients.
+  info_ru: |
+    При работе без [immediate_commit](layout-cluster.ru.md#immediate_commit)=all - это лимит объёма "грязных" (не
+    зафиксированных fsync-ом) данных, при достижении которого клиент будет
+    принудительно вызывать fsync и фиксировать данные. Также стоит иметь в виду,
+    что в этом случае до момента fsync клиент хранит копию незафиксированных
+    данных в памяти, то есть, настройка влияет на потребление памяти клиентами.
+- name: client_max_dirty_ops
+  type: int
+  default: 1024
+  online: true
+  info: |
+    Same as client_max_dirty_bytes, but instead of total size, limits the number
+    of uncommitted write operations.
+  info_ru: |
+    Аналогично client_max_dirty_bytes, но ограничивает количество
+    незафиксированных операций записи вместо их общего объёма.
+- name: client_enable_writeback
+  type: bool
+  default: false
+  online: true
+  info: |
+    This parameter enables client-side write buffering. This means that write
+    requests are accumulated in memory for a short time before being sent to
+    a Vitastor cluster which allows to send them in parallel and increase
+    performance of some applications. Writes are buffered until client forces
+    a flush with fsync() or until the amount of buffered writes exceeds the
+    limit.
+
+    Write buffering significantly increases performance of some applications,
+    for example, CrystalDiskMark under Windows (LOL :-D), but also any other
+    applications if they do writes in one of two non-optimal ways: either if
+    they do a lot of small (4 kb or so) sequential writes, or if they do a lot
+    of small random writes, but without any parallelism or asynchrony, and also
+    without calling fsync().
+
+    With write buffering enabled, you can expect around 22000 T1Q1 random write
+    iops in QEMU more or less regardless of the quality of your SSDs, and this
+    number is in fact bound by QEMU itself rather than Vitastor (check it
+    yourself by adding a "driver=null-co" disk in QEMU). Without write
+    buffering, the current record is 9900 iops, but the number is usually
+    even lower with non-ideal hardware, for example, it may be 5000 iops.
+
+    Even when this parameter is enabled, write buffering isn't enabled until
+    the client explicitly allows it, because enabling it without the client
+    being aware of the fact that his writes may be buffered may lead to data
+    loss. Because of this, older versions of clients don't support write
+    buffering at all, newer versions of the QEMU driver allow write buffering
+    only if it's enabled in disk settings with `-blockdev cache.direct=false`,
+    and newer versions of FIO only allow write buffering if you don't specify
+    `-direct=1`. NBD and NFS drivers allow write buffering by default.
+
+    You can overcome this restriction too with the `client_writeback_allowed`
+    parameter, but you shouldn't do that unless you **really** know what you
+    are doing.
+  info_ru: |
+    Данный параметр разрешает включать буферизацию записи в памяти. Буферизация
+    означает, что операции записи отправляются на кластер Vitastor не сразу, а
+    могут небольшое время накапливаться в памяти и сбрасываться сразу пакетами,
+    до тех пор, пока либо не будет превышен лимит неотправленных записей, либо
+    пока клиент не вызовет fsync.
+
+    Буферизация значительно повышает производительность некоторых приложений,
+    например, CrystalDiskMark в Windows (ха-ха :-D), но также и любых других,
+    которые пишут на диск неоптимально: либо последовательно, но мелкими блоками
+    (например, по 4 кб), либо случайно, но без параллелизма и без fsync - то
+    есть, например, отправляя 128 операций записи в разные места диска, но не
+    все сразу с помощью асинхронного I/O, а по одной.
+
+    В QEMU с буферизацией записи можно ожидать показателя примерно 22000
+    операций случайной записи в секунду в 1 поток и с глубиной очереди 1 (T1Q1)
+    без fsync, почти вне зависимости от того, насколько хороши ваши диски - эта
+    цифра упирается в сам QEMU. Без буферизации рекорд пока что - 9900 операций
+    в секунду, но на железе похуже может быть и поменьше, например, 5000 операций
+    в секунду.
+
+    При этом, даже если данный параметр включён, буферизация не включается, если
+    явно не разрешена клиентом, т.к. если клиент не знает, что запросы записи
+    буферизуются, это может приводить к потере данных. Поэтому в старых версиях
+    клиентских драйверов буферизация записи не включается вообще, в новых
+    версиях QEMU-драйвера включается только если разрешена опцией диска
+    `-blockdev cache.direct=false`, а в fio - только если нет опция `-direct=1`.
+    В NBD и NFS драйверах буферизация записи разрешена по умолчанию.
+
+    Можно обойти и это ограничение с помощью параметра `client_writeback_allowed`,
+    но делать так не надо, если только вы не уверены в том, что делаете, на все
+    100%. :-)
+- name: client_max_buffered_bytes
+  type: int
+  default: 33554432
+  online: true
+  info: |
+    Maximum total size of buffered writes which triggers write-back when reached.
+  info_ru: |
+    Максимальный общий размер буферизованных записей, при достижении которого
+    начинается процесс сброса данных на сервер.
+- name: client_max_buffered_ops
+  type: int
+  default: 1024
+  online: true
+  info: |
+    Maximum number of buffered writes which triggers write-back when reached.
+    Multiple consecutive modified data regions are counted as 1 write here.
+  info_ru: |
+    Максимальное количество буферизованных записей, при достижении которого
+    начинается процесс сброса данных на сервер. При этом несколько
+    последовательных изменённых областей здесь считаются 1 записью.
+- name: client_max_writeback_iodepth
+  type: int
+  default: 256
+  online: true
+  info: |
+    Maximum number of parallel writes when flushing buffered data to the server.
+  info_ru: |
+    Максимальное число параллельных операций записи при сбросе буферов на сервер.
--- a/docs/config/src/included.en.md
+++ b/docs/config/src/included.en.md
@@ -28,6 +28,8 @@

 {{../../config/network.en.md|indent=2}}

+{{../../config/client.en.md|indent=2}}
+
 {{../../config/layout-cluster.en.md|indent=2}}

 {{../../config/layout-osd.en.md|indent=2}}
--- a/docs/config/src/included.ru.md
+++ b/docs/config/src/included.ru.md
@@ -28,6 +28,8 @@

 {{../../config/network.ru.md|indent=2}}

+{{../../config/client.ru.md|indent=2}}
+
 {{../../config/layout-cluster.ru.md|indent=2}}

 {{../../config/layout-osd.ru.md|indent=2}}
--- a/docs/config/src/layout-cluster.yml
+++ b/docs/config/src/layout-cluster.yml
@@ -87,8 +87,9 @@
    it (they have internal SSD cache even though it's not stated in datasheets).

    Setting this parameter to "all" or "small" in OSD parameters requires enabling
-    disable_journal_fsync and disable_meta_fsync, setting it to "all" also requires
-    enabling disable_data_fsync.
+    [disable_journal_fsync](layout-osd.en.yml#disable_journal_fsync) and
+    [disable_meta_fsync](layout-osd.en.yml#disable_meta_fsync), setting it to
+    "all" also requires enabling [disable_data_fsync](layout-osd.en.yml#disable_data_fsync).

    TLDR: For optimal performance, set immediate_commit to "all" if you only use
    SSDs with supercapacitor-based power loss protection (nonvolatile
@@ -140,8 +141,9 @@
    указано в спецификациях).

    Указание "all" или "small" в настройках / командной строке OSD требует
-    включения disable_journal_fsync и disable_meta_fsync, значение "all" также
-    требует включения disable_data_fsync.
+    включения [disable_journal_fsync](layout-osd.ru.yml#disable_journal_fsync) и
+    [disable_meta_fsync](layout-osd.ru.yml#disable_meta_fsync), значение "all"
+    также требует включения [disable_data_fsync](layout-osd.ru.yml#disable_data_fsync).

    Итого, вкратце: для оптимальной производительности установите
    immediate_commit в значение "all", если вы используете в кластере только SSD
--- a/docs/config/src/layout-osd.yml
+++ b/docs/config/src/layout-osd.yml
@@ -244,9 +244,9 @@
    3. Hybrid HDD+SSD: csum_block_size=4k + inmemory_metadata=false
    4. HDD-only, faster random read: csum_block_size=32k
    5. HDD-only, faster random write: csum_block_size=4k +
-       inmemory_metadata=false + cached_io_meta=true
+       inmemory_metadata=false + meta_io=cached

-    See also [cached_io_meta](osd.en.md#cached_io_meta).
+    See also [meta_io](osd.en.md#meta_io).
  info_ru: |
    Размер блока расчёта контрольных сумм.

@@ -271,6 +271,6 @@
    3. Гибридные HDD+SSD: csum_block_size=4k + inmemory_metadata=false
    4. Только HDD, быстрее случайное чтение: csum_block_size=32k
    5. Только HDD, быстрее случайная запись: csum_block_size=4k +
-       inmemory_metadata=false + cached_io_meta=true
+       inmemory_metadata=false + meta_io=cached

-    Смотрите также [cached_io_meta](osd.ru.md#cached_io_meta).
+    Смотрите также [meta_io](osd.ru.md#meta_io).
--- a/docs/config/src/network.yml
+++ b/docs/config/src/network.yml
@@ -259,23 +259,3 @@
    detect disconnections quickly.
  info_ru: |
    Интервал проверки живости вебсокет-подключений к etcd.
- name: client_dirty_limit
-  type: int
-  default: 33554432
-  online: true
-  info: |
-    Without immediate_commit=all this parameter sets the limit of "dirty"
-    (not committed by fsync) data allowed by the client before forcing an
-    additional fsync and committing the data. Also note that the client always
-    holds a copy of uncommitted data in memory so this setting also affects
-    RAM usage of clients.
-
-    This parameter doesn't affect OSDs themselves.
-  info_ru: |
-    При работе без immediate_commit=all - это лимит объёма "грязных" (не
-    зафиксированных fsync-ом) данных, при достижении которого клиент будет
-    принудительно вызывать fsync и фиксировать данные. Также стоит иметь в виду,
-    что в этом случае до момента fsync клиент хранит копию незафиксированных
-    данных в памяти, то есть, настройка влияет на потребление памяти клиентами.
-
-    Параметр не влияет на сами OSD.
--- a/docs/config/src/osd.yml
+++ b/docs/config/src/osd.yml
@@ -2,15 +2,28 @@
  type: sec
  default: 5
  info: |
-    Interval at which OSDs report their state to etcd. Affects OSD lease time
+    Interval at which OSDs report their liveness to etcd. Affects OSD lease time
    and thus the failover speed. Lease time is equal to this parameter value
    plus max_etcd_attempts * etcd_quick_timeout because it should be guaranteed
    that every OSD always refreshes its lease in time.
  info_ru: |
-    Интервал, с которым OSD обновляет своё состояние в etcd. Значение параметра
-    влияет на время резервации (lease) OSD и поэтому на скорость переключения
+    Интервал, с которым OSD сообщает о том, что жив, в etcd. Значение параметра
+    влияет на время резервации (lease) OSD и поэтому - на скорость переключения
    при падении OSD. Время lease равняется значению этого параметра плюс
    max_etcd_attempts * etcd_quick_timeout.
+- name: etcd_stats_interval
+  type: sec
+  default: 30
+  info: |
+    Interval at which OSDs report their statistics to etcd. Highly affects the
+    imposed load on etcd, because statistics include a key for every OSD and
+    for every PG. At the same time, low statistic intervals make `vitastor-cli`
+    statistics more responsive.
+  info_ru: |
+    Интервал, с которым OSD обновляет свою статистику в etcd. Сильно влияет на
+    создаваемую нагрузку на etcd, потому что статистика содержит по ключу на
+    каждый OSD и на каждую PG. В то же время низкий интервал делает
+    статистику, печатаемую `vitastor-cli`, отзывчивей.
 - name: run_primary
  type: bool
  default: true
@@ -260,73 +273,96 @@
    достаточно 16- или 32-мегабайтного журнала. Однако в теории отключение
    параметра может оказаться полезным для гибридных OSD (HDD+SSD) с большими
    журналами, расположенными на быстром по сравнению с HDD устройстве.
- name: cached_io_data
-  type: bool
-  default: false
+- name: data_io
+  type: string
+  default: direct
  info: |
-    Read and write *data* through Linux page cache, i.e. use a file descriptor
-    opened with O_SYNC, but without O_DIRECT for I/O. May improve read
-    performance for hot data and slower disks - HDDs and maybe SATA SSDs.
-    Not recommended for desktop SSDs without capacitors because O_SYNC flushes
-    disk cache on every write.
-  info_ru: |
-    Читать и записывать *данные* через системный кэш Linux (page cache), то есть,
-    использовать для данных файловый дескриптор, открытый без флага O_DIRECT, но
-    с флагом O_SYNC. Может улучшить скорость чтения для относительно медленных
-    дисков - HDD и, возможно, SATA SSD. Не рекомендуется для потребительских
-    SSD без конденсаторов, так как O_SYNC сбрасывает кэш диска при каждой записи.
- name: cached_io_meta
-  type: bool
-  default: false
-  info: |
-    Read and write *metadata* through Linux page cache. May improve read
-    performance only if your drives are relatively slow (HDD, SATA SSD), and
-    only if checksums are enabled and [inmemory_metadata](#inmemory_metadata)
-    is disabled, because in this case metadata blocks are read from disk
-    on every read request to verify checksums and caching them may reduce this
-    extra read load.
+    I/O mode for *data*. One of "direct", "cached" or "directsync". Corresponds
+    to O_DIRECT, O_SYNC and O_DIRECT|O_SYNC, respectively.

-    Absolutely pointless to enable with enabled inmemory_metadata because all
-    metadata is kept in memory anyway, and likely pointless without checksums,
-    because in that case, metadata blocks are read from disk only during journal
+    Choose "cached" to use Linux page cache. This may improve read performance
+    for hot data and slower disks - HDDs and maybe SATA SSDs - but will slightly
+    decrease write performance for fast disks because page cache is an overhead
+    itself.
+
+    Choose "directsync" to use [immediate_commit](layout-cluster.ru.md#immediate_commit)
+    (which requires disable_data_fsync) with drives having write-back cache
+    which can't be turned off, for example, Intel Optane. Also note that *some*
+    desktop SSDs (for example, HP EX950) may ignore O_SYNC thus making
+    disable_data_fsync unsafe even with "directsync".
+  info_ru: |
+    Режим ввода-вывода для *данных*. Одно из значений "direct", "cached" или
+    "directsync", означающих O_DIRECT, O_SYNC и O_DIRECT|O_SYNC, соответственно.
+
+    Выберите "cached", чтобы использовать системный кэш Linux (page cache) при
+    чтении и записи. Это может улучшить скорость чтения горячих данных с
+    относительно медленных дисков - HDD и, возможно, SATA SSD - но немного
+    снижает производительность записи для быстрых дисков, так как кэш сам по
+    себе тоже добавляет накладные расходы.
+
+    Выберите "directsync", если хотите задействовать
+    [immediate_commit](layout-cluster.ru.md#immediate_commit) (требующий
+    включенияd disable_data_fsync) на дисках с неотключаемым кэшем. Пример таких
+    дисков - Intel Optane. При этом также стоит иметь в виду, что *некоторые*
+    настольные SSD (например, HP EX950) игнорируют флаг O_SYNC, делая отключение
+    fsync небезопасным даже с режимом "directsync".
+- name: meta_io
+  type: string
+  default: direct
+  info: |
+    I/O mode for *metadata*. One of "direct", "cached" or "directsync".
+
+    "cached" may improve read performance, but only under the following conditions:
+    1. your drives are relatively slow (HDD, SATA SSD), and
+    2. checksums are enabled, and
+    3. [inmemory_metadata](#inmemory_metadata) is disabled.
+    Under all these conditions, metadata blocks are read from disk on every
+    read request to verify checksums and caching them may reduce this extra
+    read load. Without (3) metadata is never read from the disk after starting,
+    and without (2) metadata blocks are read from disk only during journal
    flushing.

-    If the same device is used for data and metadata, enabling [cached_io_data](#cached_io_data)
-    also enables this parameter, given that it isn't turned off explicitly.
+    "directsync" is the same as above.
+
+    If the same device is used for data and metadata, meta_io by default is set
+    to the same value as [data_io](#data_io).
  info_ru: |
-    Читать и записывать *метаданные* через системный кэш Linux. Может улучшить
-    скорость чтения, если у вас медленные диски, и только если контрольные суммы
-    включены, а параметр [inmemory_metadata](#inmemory_metadata) отключён, так
-    как в этом случае блоки метаданных читаются с диска при каждом запросе чтения
+    Режим ввода-вывода для *метаданных*. Одно из значений "direct", "cached" или
+    "directsync".
+
+    "cached" может улучшить скорость чтения, если:
+    1. у вас медленные диски (HDD, SATA SSD)
+    2. контрольные суммы включены
+    3. параметр [inmemory_metadata](#inmemory_metadata) отключён.
+    При этих условиях блоки метаданных читаются с диска при каждом запросе чтения
    для проверки контрольных сумм и их кэширование может снизить дополнительную
-    нагрузку на диск.
+    нагрузку на диск. Без (3) метаданные никогда не читаются с диска после
+    запуска OSD, а без (2) блоки метаданных читаются только при сбросе журнала.

-    Абсолютно бессмысленно включать данный параметр, если параметр
-    inmemory_metadata включён (по умолчанию это так), и также вероятно
-    бессмысленно включать его, если не включены контрольные суммы, так как в
-    этом случае блоки метаданных читаются с диска только во время сброса
-    журнала.
-
-    Если одно и то же устройство используется для данных и метаданных, включение
-    [cached_io_data](#cached_io_data) также включает данный параметр, при
-    условии, что он не отключён явным образом.
- name: cached_io_journal
-  type: bool
-  default: false
+    Если одно и то же устройство используется для данных и метаданных, режим
+    ввода-вывода метаданных по умолчанию устанавливается равным [data_io](#data_io).
+- name: journal_io
+  type: string
+  default: direct
  info: |
-    Read and write *journal* through Linux page cache. May improve read
-    performance if [inmemory_journal](#inmemory_journal) is turned off.
+    I/O mode for *journal*. One of "direct", "cached" or "directsync".

-    If the same device is used for metadata and journal, enabling [cached_io_meta](#cached_io_meta)
-    also enables this parameter, given that it isn't turned off explicitly.
+    Here, "cached" may only improve read performance for recent writes and
+    only if [inmemory_journal](#inmemory_journal) is turned off.
+
+    If the same device is used for metadata and journal, journal_io by default
+    is set to the same value as [meta_io](#meta_io).
  info_ru: |
-    Читать и записывать *журнал* через системный кэш Linux. Может улучшить
-    скорость чтения, если параметр [inmemory_journal](#inmemory_journal)
+    Режим ввода-вывода для *журнала*. Одно из значений "direct", "cached" или
+    "directsync".
+
+    Здесь "cached" может улучшить скорость чтения только недавно записанных
+    данных и только если параметр [inmemory_journal](#inmemory_journal)
    отключён.

    Если одно и то же устройство используется для метаданных и журнала,
-    включение [cached_io_meta](#cached_io_meta) также включает данный
-    параметр, при условии, что он не отключён явным образом.
+    режим ввода-вывода журнала по умолчанию устанавливается равным
+    [meta_io](#meta_io).
 - name: journal_sector_buffer_count
  type: int
  default: 32
--- a/docs/intro/features.en.md
+++ b/docs/intro/features.en.md
@@ -31,6 +31,7 @@
 - [RDMA/RoCEv2 support via libibverbs](../config/network.en.md#rdma_device)
 - [Scrubbing](../config/osd.en.md#auto_scrub) (verification of copies)
 - [Checksums](../config/layout-osd.en.md#data_csum_type)
+- [Client write-back cache](../config/client.en.md#client_enable_writeback)

 ## Plugins and tools

--- a/docs/intro/features.ru.md
+++ b/docs/intro/features.ru.md
@@ -33,6 +33,7 @@
 - [Поддержка RDMA/RoCEv2 через libibverbs](../config/network.ru.md#rdma_device)
 - [Фоновая проверка целостности](../config/osd.ru.md#auto_scrub) (сверка копий)
 - [Контрольные суммы](../config/layout-osd.ru.md#data_csum_type)
+- [Буферизация записи на стороне клиента](../config/client.ru.md#client_enable_writeback)

 ## Драйверы и инструменты

--- a/docs/usage/qemu.en.md
+++ b/docs/usage/qemu.en.md
@@ -34,6 +34,20 @@ qemu-system-x86_64 -enable-kvm -m 1024 \
    -vnc 0.0.0.0:0
 ```

+With a separate I/O thread:
+
+```
+qemu-system-x86_64 -enable-kvm -m 1024 \
+    -object iothread,id=vitastor1 \
+    -blockdev '{"node-name":"drive-virtio-disk0","driver":"vitastor","image":"debian9",
+        "cache":{"direct":true,"no-flush":false},"auto-read-only":true,"discard":"unmap"}' \
+    -device 'virtio-blk-pci,iothread=vitastor1,scsi=off,bus=pci.0,addr=0x5,drive=drive-virtio-disk0,
+        id=virtio-disk0,bootindex=1,write-cache=off' \
+    -vnc 0.0.0.0:0
+```
+
+You can also specify inode ID, pool and size manually instead of `:image=<IMAGE>` option: `:pool=<POOL>:inode=<INODE>:size=<SIZE>`.
+
 ## qemu-img

 For qemu-img, you should use `vitastor:etcd_host=<HOST>:image=<IMAGE>` as filename.
@@ -84,6 +98,29 @@ This can be used for backups. Just note that exporting an image that is currentl
 is of course unsafe and doesn't produce a consistent result, so only export snapshots if you do this
 on a live VM.

+## vhost-user-blk
+
+QEMU, starting with 6.0, includes support for attaching disks via a separate
+userspace worker process, called `vhost-user-blk`. It usually has slightly (20-30 us)
+lower latency.
+
+Example commands to use it with Vitastor:
+
+```
+qemu-storage-daemon \
+    --daemonize \
+    --blockdev '{"node-name":"drive-virtio-disk1","driver":"vitastor","image":"testosd1","cache":{"direct":true,"no-flush":false},"auto-read-only":true,"discard":"unmap"}' \
+    --export type=vhost-user-blk,id=vitastor1,node-name=drive-virtio-disk1,addr.type=unix,addr.path=/run/vitastor1-user-blk.sock,writable=on,num-queues=1
+
+qemu-system-x86_64 -enable-kvm -m 2048 -M accel=kvm,memory-backend=mem \
+    -object memory-backend-memfd,id=mem,size=2G,share=on \
+    -chardev socket,id=vitastor1,reconnect=1,path=/run/vitastor1-user-blk.sock \
+    -device vhost-user-blk-pci,chardev=vitastor1,num-queues=1,config-wce=off \
+    -vnc 0.0.0.0:0
+```
+
+memfd memory-backend is crucial, vhost-user-blk does not work without it.
+
 ## VDUSE

 Linux kernel, starting with version 5.15, supports a new interface for attaching virtual disks
--- a/docs/usage/qemu.ru.md
+++ b/docs/usage/qemu.ru.md
@@ -36,6 +36,18 @@ qemu-system-x86_64 -enable-kvm -m 1024 \
    -vnc 0.0.0.0:0
 ```

+С отдельным потоком ввода-вывода:
+
+```
+qemu-system-x86_64 -enable-kvm -m 1024 \
+    -object iothread,id=vitastor1 \
+    -blockdev '{"node-name":"drive-virtio-disk0","driver":"vitastor","image":"debian9",
+        "cache":{"direct":true,"no-flush":false},"auto-read-only":true,"discard":"unmap"}' \
+    -device 'virtio-blk-pci,iothread=vitastor1,scsi=off,bus=pci.0,addr=0x5,drive=drive-virtio-disk0,
+        id=virtio-disk0,bootindex=1,write-cache=off' \
+    -vnc 0.0.0.0:0
+```
+
 Вместо `:image=<IMAGE>` также можно указывать номер инода, пул и размер: `:pool=<POOL>:inode=<INODE>:size=<SIZE>`.

 ## qemu-img
@@ -88,6 +100,29 @@ qemu-img rebase -u -b '' testimg.qcow2
 в то же время идёт запись, небезопасно - результат чтения не будет целостным. Так что если вы работаете
 с активными виртуальными машинами, экспортируйте только их снимки, но не сам образ.

+## vhost-user-blk
+
+QEMU, начиная с 6.0, позволяет подключать диски через отдельный рабочий процесс.
+Этот метод подключения называется `vhost-user-blk` и обычно имеет чуть меньшую
+задержку (ниже на 20-30 микросекунд, чем при обычном методе).
+
+Пример команд для использования vhost-user-blk с Vitastor:
+
+```
+qemu-storage-daemon \
+    --daemonize \
+    --blockdev '{"node-name":"drive-virtio-disk1","driver":"vitastor","image":"testosd1","cache":{"direct":true,"no-flush":false},"auto-read-only":true,"discard":"unmap"}' \
+    --export type=vhost-user-blk,id=vitastor1,node-name=drive-virtio-disk1,addr.type=unix,addr.path=/run/vitastor1-user-blk.sock,writable=on,num-queues=1
+
+qemu-system-x86_64 -enable-kvm -m 2048 -M accel=kvm,memory-backend=mem \
+    -object memory-backend-memfd,id=mem,size=2G,share=on \
+    -chardev socket,id=vitastor1,reconnect=1,path=/run/vitastor1-user-blk.sock \
+    -device vhost-user-blk-pci,chardev=vitastor1,num-queues=1,config-wce=off \
+    -vnc 0.0.0.0:0
+```
+
+Здесь критична опция memory-backend-memfd, vhost-user-blk без неё не работает.
+
 ## VDUSE

 В Linux, начиная с версии ядра 5.15, доступен новый интерфейс для подключения виртуальных дисков
--- a/mon/mon.js
+++ b/mon/mon.js
@@ -78,9 +78,15 @@ const etcd_tree = {
            disk_alignment: 4096,
            bitmap_granularity: 4096,
            immediate_commit: false, // 'all' or 'small'
+            // client - configurable online
+            client_max_dirty_bytes: 33554432,
+            client_max_dirty_ops: 1024,
+            client_enable_writeback: false,
+            client_max_buffered_bytes: 33554432,
+            client_max_buffered_ops: 1024,
+            client_max_writeback_iodepth: 256,
            // client and osd - configurable online
            log_level: 0,
-            client_dirty_limit: 33554432,
            peer_connect_interval: 5, // seconds. min: 1
            peer_connect_timeout: 5, // seconds. min: 1
            osd_idle_timeout: 5, // seconds. min: 1
@@ -93,6 +99,7 @@ const etcd_tree = {
            etcd_ws_keepalive_interval: 30, // seconds
            // osd
            etcd_report_interval: 5, // seconds
+            etcd_stats_interval: 30, // seconds
            run_primary: true,
            osd_network: null, // "192.168.7.0/24" or an array of masks
            bind_address: "0.0.0.0",
@@ -539,10 +546,18 @@ class Mon
        {
            retries = 1;
        }
+        const tried = {};
        while (retries < 0 || retry < retries)
        {
            const cur_addr = this.pick_next_etcd();
            const base = 'ws'+cur_addr.substr(4);
+            let now = Date.now();
+            if (tried[base] && now-tried[base] < timeout)
+            {
+                await new Promise(ok => setTimeout(ok, timeout-(now-tried[base])));
+                now = Date.now();
+            }
+            tried[base] = now;
            const ok = await new Promise((ok, no) =>
            {
                const timer_id = setTimeout(() =>
@@ -1148,6 +1163,33 @@ class Mon
        }
    }

+    filter_osds_by_block_layout(flat_tree, block_size, bitmap_granularity, immediate_commit)
+    {
+        for (const host in flat_tree)
+        {
+            let found = 0;
+            for (const osd in flat_tree[host])
+            {
+                const osd_stat = this.state.osd.stats[osd];
+                if (osd_stat && (osd_stat.bs_block_size && osd_stat.bs_block_size != block_size ||
+                    osd_stat.bitmap_granularity && osd_stat.bitmap_granularity != bitmap_granularity ||
+                    osd_stat.immediate_commit == 'small' && immediate_commit == 'all' ||
+                    osd_stat.immediate_commit == 'none' && immediate_commit != 'none'))
+                {
+                    delete flat_tree[host][osd];
+                }
+                else
+                {
+                    found++;
+                }
+            }
+            if (!found)
+            {
+                delete flat_tree[host];
+            }
+        }
+    }
+
    get_affinity_osds(pool_cfg, up_osds, osd_tree)
    {
        let aff_osds = up_osds;
@@ -1208,6 +1250,12 @@ class Mon
                pool_tree = pool_tree ? pool_tree.children : [];
                pool_tree = LPOptimizer.flatten_tree(pool_tree, levels, pool_cfg.failure_domain, 'osd');
                this.filter_osds_by_tags(osd_tree, pool_tree, pool_cfg.osd_tags);
+                this.filter_osds_by_block_layout(
+                    pool_tree,
+                    pool_cfg.block_size || this.config.block_size || 131072,
+                    pool_cfg.bitmap_granularity || this.config.bitmap_granularity || 4096,
+                    pool_cfg.immediate_commit || this.config.immediate_commit || 'none'
+                );
                // These are for the purpose of building history.osd_sets
                const real_prev_pgs = [];
                let pg_history = [];
@@ -1788,10 +1836,18 @@ class Mon
        {
            retries = 1;
        }
+        const tried = {};
        while (retries < 0 || retry < retries)
        {
            retry++;
            const base = this.pick_next_etcd();
+            let now = Date.now();
+            if (tried[base] && now-tried[base] < timeout)
+            {
+                await new Promise(ok => setTimeout(ok, timeout-(now-tried[base])));
+                now = Date.now();
+            }
+            tried[base] = now;
            const res = await POST(base+path, body, timeout);
            if (res.error)
            {
--- a/mon/package.json
+++ b/mon/package.json
@@ -1,6 +1,6 @@
 {
  "name": "vitastor-mon",
-  "version": "1.0.0",
+  "version": "1.1.0",
  "description": "Vitastor SDS monitor service",
  "main": "mon-main.js",
  "scripts": {
--- a/patches/cinder-vitastor.py
+++ b/patches/cinder-vitastor.py
@@ -50,7 +50,7 @@ from cinder.volume import configuration
 from cinder.volume import driver
 from cinder.volume import volume_utils

-VERSION = '1.0.0'
+VERSION = '1.1.0'

 LOG = logging.getLogger(__name__)

--- a/rpm/build-tarball.sh
+++ b/rpm/build-tarball.sh
@@ -24,4 +24,4 @@ rm fio
 mv fio-copy fio
 FIO=`rpm -qi fio | perl -e 'while(<>) { /^Epoch[\s:]+(\S+)/ && print "$1:"; /^Version[\s:]+(\S+)/ && print $1; /^Release[\s:]+(\S+)/ && print "-$1"; }'`
 perl -i -pe 's/(Requires:\s*fio)([^\n]+)?/$1 = '$FIO'/' $VITASTOR/rpm/vitastor-el$EL.spec
-tar --transform 's#^#vitastor-1.0.0/#' --exclude 'rpm/*.rpm' -czf $VITASTOR/../vitastor-1.0.0$(rpm --eval '%dist').tar.gz *
+tar --transform 's#^#vitastor-1.1.0/#' --exclude 'rpm/*.rpm' -czf $VITASTOR/../vitastor-1.1.0$(rpm --eval '%dist').tar.gz *
--- a/rpm/vitastor-el7.Dockerfile
+++ b/rpm/vitastor-el7.Dockerfile
@@ -35,7 +35,7 @@ ADD . /root/vitastor
 RUN set -e; \
    cd /root/vitastor/rpm; \
    sh build-tarball.sh; \
-    cp /root/vitastor-1.0.0.el7.tar.gz ~/rpmbuild/SOURCES; \
+    cp /root/vitastor-1.1.0.el7.tar.gz ~/rpmbuild/SOURCES; \
    cp vitastor-el7.spec ~/rpmbuild/SPECS/vitastor.spec; \
    cd ~/rpmbuild/SPECS/; \
    rpmbuild -ba vitastor.spec; \
--- a/rpm/vitastor-el7.spec
+++ b/rpm/vitastor-el7.spec
@@ -1,11 +1,11 @@
 Name:           vitastor
-Version:        1.0.0
+Version:        1.1.0
 Release:        1%{?dist}
 Summary:        Vitastor, a fast software-defined clustered block storage

 License:        Vitastor Network Public License 1.1
 URL:            https://vitastor.io/
-Source0:        vitastor-1.0.0.el7.tar.gz
+Source0:        vitastor-1.1.0.el7.tar.gz

 BuildRequires:  liburing-devel >= 0.6
 BuildRequires:  gperftools-devel
--- a/rpm/vitastor-el8.Dockerfile
+++ b/rpm/vitastor-el8.Dockerfile
@@ -35,7 +35,7 @@ ADD . /root/vitastor
 RUN set -e; \
    cd /root/vitastor/rpm; \
    sh build-tarball.sh; \
-    cp /root/vitastor-1.0.0.el8.tar.gz ~/rpmbuild/SOURCES; \
+    cp /root/vitastor-1.1.0.el8.tar.gz ~/rpmbuild/SOURCES; \
    cp vitastor-el8.spec ~/rpmbuild/SPECS/vitastor.spec; \
    cd ~/rpmbuild/SPECS/; \
    rpmbuild -ba vitastor.spec; \
--- a/rpm/vitastor-el8.spec
+++ b/rpm/vitastor-el8.spec
@@ -1,11 +1,11 @@
 Name:           vitastor
-Version:        1.0.0
+Version:        1.1.0
 Release:        1%{?dist}
 Summary:        Vitastor, a fast software-defined clustered block storage

 License:        Vitastor Network Public License 1.1
 URL:            https://vitastor.io/
-Source0:        vitastor-1.0.0.el8.tar.gz
+Source0:        vitastor-1.1.0.el8.tar.gz

 BuildRequires:  liburing-devel >= 0.6
 BuildRequires:  gperftools-devel
--- a/rpm/vitastor-el9.Dockerfile
+++ b/rpm/vitastor-el9.Dockerfile
@@ -18,7 +18,7 @@ ADD . /root/vitastor
 RUN set -e; \
    cd /root/vitastor/rpm; \
    sh build-tarball.sh; \
-    cp /root/vitastor-1.0.0.el9.tar.gz ~/rpmbuild/SOURCES; \
+    cp /root/vitastor-1.1.0.el9.tar.gz ~/rpmbuild/SOURCES; \
    cp vitastor-el9.spec ~/rpmbuild/SPECS/vitastor.spec; \
    cd ~/rpmbuild/SPECS/; \
    rpmbuild -ba vitastor.spec; \
--- a/rpm/vitastor-el9.spec
+++ b/rpm/vitastor-el9.spec
@@ -1,11 +1,11 @@
 Name:           vitastor
-Version:        1.0.0
+Version:        1.1.0
 Release:        1%{?dist}
 Summary:        Vitastor, a fast software-defined clustered block storage

 License:        Vitastor Network Public License 1.1
 URL:            https://vitastor.io/
-Source0:        vitastor-1.0.0.el9.tar.gz
+Source0:        vitastor-1.1.0.el9.tar.gz

 BuildRequires:  liburing-devel >= 0.6
 BuildRequires:  gperftools-devel
--- a/src/CMakeLists.txt
+++ b/src/CMakeLists.txt
@@ -16,7 +16,7 @@ if("${CMAKE_INSTALL_PREFIX}" MATCHES "^/usr/local/?$")
 	set(CMAKE_INSTALL_RPATH "${CMAKE_INSTALL_PREFIX}/${CMAKE_INSTALL_LIBDIR}")
 endif()

-add_definitions(-DVERSION="1.0.0")
+add_definitions(-DVERSION="1.1.0")
 add_definitions(-Wall -Wno-sign-compare -Wno-comment -Wno-parentheses -Wno-pointer-arith -fdiagnostics-color=always -I ${CMAKE_SOURCE_DIR}/src)
 if (${WITH_ASAN})
 	add_definitions(-fsanitize=address -fno-omit-frame-pointer)
@@ -137,6 +137,7 @@ endif (${WITH_FIO})
 add_library(vitastor_client SHARED
 	cluster_client.cpp
 	cluster_client_list.cpp
+	cluster_client_wb.cpp
 	vitastor_c.cpp
 	cli_common.cpp
 	cli_alloc_osd.cpp
@@ -300,7 +301,7 @@ target_link_libraries(test_crc32
 add_executable(test_cluster_client
 	EXCLUDE_FROM_ALL
 	test_cluster_client.cpp
-	pg_states.cpp osd_ops.cpp cluster_client.cpp cluster_client_list.cpp msgr_op.cpp mock/messenger.cpp msgr_stop.cpp
+	pg_states.cpp osd_ops.cpp cluster_client.cpp cluster_client_list.cpp cluster_client_wb.cpp msgr_op.cpp mock/messenger.cpp msgr_stop.cpp
 	etcd_state_client.cpp timerfd_manager.cpp str_util.cpp ../json11/json11.cpp
 )
 target_compile_definitions(test_cluster_client PUBLIC -D__MOCK__)
--- a/src/addr_util.cpp
+++ b/src/addr_util.cpp
@@ -19,8 +19,8 @@ bool string_to_addr(std::string str, bool parse_port, int default_port, struct s
        if (p != std::string::npos && !(str.length() > 0 && str[p-1] == ']')) // "[ipv6]" which contains ':'
        {
            char null_byte = 0;
-            int n = sscanf(str.c_str()+p+1, "%d%c", &default_port, &null_byte);
-            if (n != 1 || default_port >= 0x10000)
+            int scanned = sscanf(str.c_str()+p+1, "%d%c", &default_port, &null_byte);
+            if (scanned != 1 || default_port >= 0x10000)
                return false;
            str = str.substr(0, p);
        }
--- a/src/blockstore_disk.cpp
+++ b/src/blockstore_disk.cpp
@@ -45,13 +45,31 @@ void blockstore_disk_t::parse_config(std::map<std::string, std::string> & config
    meta_block_size = parse_size(config["meta_block_size"]);
    bitmap_granularity = parse_size(config["bitmap_granularity"]);
    meta_format = stoull_full(config["meta_format"]);
-    cached_io_data = config["cached_io_data"] == "true" || config["cached_io_data"] == "yes" || config["cached_io_data"] == "1";
-    cached_io_meta = cached_io_data && (meta_device == data_device || meta_device == "") &&
-        config.find("cached_io_meta") == config.end() ||
-        config["cached_io_meta"] == "true" || config["cached_io_meta"] == "yes" || config["cached_io_meta"] == "1";
-    cached_io_journal = cached_io_meta && (journal_device == meta_device || journal_device == "") &&
-        config.find("cached_io_journal") == config.end() ||
-        config["cached_io_journal"] == "true" || config["cached_io_journal"] == "yes" || config["cached_io_journal"] == "1";
+    if (config.find("data_io") == config.end() &&
+        config.find("meta_io") == config.end() &&
+        config.find("journal_io") == config.end())
+    {
+        bool cached_io_data = config["cached_io_data"] == "true" || config["cached_io_data"] == "yes" || config["cached_io_data"] == "1";
+        bool cached_io_meta = cached_io_data && (meta_device == data_device || meta_device == "") &&
+            config.find("cached_io_meta") == config.end() ||
+            config["cached_io_meta"] == "true" || config["cached_io_meta"] == "yes" || config["cached_io_meta"] == "1";
+        bool cached_io_journal = cached_io_meta && (journal_device == meta_device || journal_device == "") &&
+            config.find("cached_io_journal") == config.end() ||
+            config["cached_io_journal"] == "true" || config["cached_io_journal"] == "yes" || config["cached_io_journal"] == "1";
+        data_io = cached_io_data ? "cached" : "direct";
+        meta_io = cached_io_meta ? "cached" : "direct";
+        journal_io = cached_io_journal ? "cached" : "direct";
+    }
+    else
+    {
+        data_io = config.find("data_io") != config.end() ? config["data_io"] : "direct";
+        meta_io = config.find("meta_io") != config.end()
+            ? config["meta_io"]
+            : (meta_device == data_device || meta_device == "" ? data_io : "direct");
+        journal_io = config.find("journal_io") != config.end()
+            ? config["journal_io"]
+            : (journal_device == meta_device || journal_device == "" ? meta_io : "direct");
+    }
    if (config["data_csum_type"] == "crc32c")
    {
        data_csum_type = BLOCKSTORE_CSUM_CRC32C;
@@ -272,9 +290,19 @@ static void check_size(int fd, uint64_t *size, uint64_t *sectsize, std::string n
    }
 }

+static int bs_openmode(const std::string & mode)
+{
+    if (mode == "directsync")
+        return O_DIRECT|O_SYNC;
+    else if (mode == "cached")
+        return O_SYNC;
+    else
+        return O_DIRECT;
+}
+
 void blockstore_disk_t::open_data()
 {
-    data_fd = open(data_device.c_str(), (cached_io_data ? O_SYNC : O_DIRECT) | O_RDWR);
+    data_fd = open(data_device.c_str(), bs_openmode(data_io) | O_RDWR);
    if (data_fd == -1)
    {
        throw std::runtime_error("Failed to open data device "+data_device+": "+std::string(strerror(errno)));
@@ -299,9 +327,9 @@ void blockstore_disk_t::open_data()

 void blockstore_disk_t::open_meta()
 {
-    if (meta_device != data_device || cached_io_meta != cached_io_data)
+    if (meta_device != data_device || meta_io != data_io)
    {
-        meta_fd = open(meta_device.c_str(), (cached_io_meta ? O_SYNC : O_DIRECT) | O_RDWR);
+        meta_fd = open(meta_device.c_str(), bs_openmode(meta_io) | O_RDWR);
        if (meta_fd == -1)
        {
            throw std::runtime_error("Failed to open metadata device "+meta_device+": "+std::string(strerror(errno)));
@@ -337,9 +365,9 @@ void blockstore_disk_t::open_meta()

 void blockstore_disk_t::open_journal()
 {
-    if (journal_device != meta_device || cached_io_journal != cached_io_meta)
+    if (journal_device != meta_device || journal_io != meta_io)
    {
-        journal_fd = open(journal_device.c_str(), (cached_io_journal ? O_SYNC : O_DIRECT) | O_RDWR);
+        journal_fd = open(journal_device.c_str(), bs_openmode(journal_io) | O_RDWR);
        if (journal_fd == -1)
        {
            throw std::runtime_error("Failed to open journal device "+journal_device+": "+std::string(strerror(errno)));
--- a/src/blockstore_disk.h
+++ b/src/blockstore_disk.h
@@ -31,8 +31,9 @@ struct blockstore_disk_t
    uint32_t csum_block_size = 4096;
    // By default, Blockstore locks all opened devices exclusively. This option can be used to disable locking
    bool disable_flock = false;
-    // Use Linux page cache for reads and writes, i.e. open FDs with O_SYNC instead of O_DIRECT
-    bool cached_io_data = false, cached_io_meta = false, cached_io_journal = false;
+    // I/O modes for data, metadata and journal: direct or "" = O_DIRECT, cached = O_SYNC, directsync = O_DIRECT|O_SYNC
+    // O_SYNC without O_DIRECT = use Linux page cache for reads and writes
+    std::string data_io, meta_io, journal_io;

    int meta_fd = -1, data_fd = -1, journal_fd = -1;
    uint64_t meta_offset, meta_device_sect, meta_device_size, meta_len, meta_format = 0;
--- a/src/blockstore_impl.cpp
+++ b/src/blockstore_impl.cpp
@@ -384,6 +384,10 @@ void blockstore_impl_t::enqueue_op(blockstore_op_t *op)
        ringloop->set_immediate([op]() { std::function<void (blockstore_op_t*)>(op->callback)(op); });
        return;
    }
+    if (op->opcode == BS_OP_SYNC)
+    {
+        unsynced_queued_ops = 0;
+    }
    init_op(op);
    submit_queue.push_back(op);
    ringloop->wakeup();
@@ -393,6 +397,7 @@ void blockstore_impl_t::init_op(blockstore_op_t *op)
 {
    // Call constructor without allocating memory. We'll call destructor before returning op back
    new ((void*)op->private_data) blockstore_op_private_t;
+    PRIV(op)->min_flushed_journal_sector = PRIV(op)->max_flushed_journal_sector = 0;
    PRIV(op)->wait_for = 0;
    PRIV(op)->op_state = 0;
    PRIV(op)->pending_ops = 0;
--- a/src/blockstore_impl.h
+++ b/src/blockstore_impl.h
@@ -210,7 +210,7 @@ struct blockstore_op_private_t
    std::vector<copy_buffer_t> read_vec;

    // Sync, write
-    int min_flushed_journal_sector, max_flushed_journal_sector;
+    uint64_t min_flushed_journal_sector, max_flushed_journal_sector;

    // Write
    struct iovec iov_zerofill[3];
@@ -220,7 +220,6 @@ struct blockstore_op_private_t

    // Sync
    std::vector<obj_ver_id> sync_big_writes, sync_small_writes;
-    int sync_small_checked, sync_big_checked;
 };

 typedef uint32_t pool_id_t;
@@ -263,6 +262,8 @@ class blockstore_impl_t
    int throttle_target_parallelism = 1;
    // Minimum difference in microseconds between target and real execution times to throttle the response
    int throttle_threshold_us = 50;
+    // Maximum writes between automatically added fsync operations
+    uint64_t autosync_writes = 128;
    /******* END OF OPTIONS *******/

    struct ring_consumer_t ring_consumer;
@@ -274,6 +275,7 @@ class blockstore_impl_t
    std::vector<blockstore_op_t*> submit_queue;
    std::vector<obj_ver_id> unsynced_big_writes, unsynced_small_writes;
    int unsynced_big_write_count = 0;
+    int unsynced_queued_ops = 0;
    allocator *data_alloc = NULL;
    uint8_t *zero_object;

--- a/src/blockstore_journal.cpp
+++ b/src/blockstore_journal.cpp
@@ -198,6 +198,7 @@ void blockstore_impl_t::prepare_journal_sector_write(int cur_sector, blockstore_
    priv->pending_ops++;
    if (!priv->min_flushed_journal_sector)
        priv->min_flushed_journal_sector = 1+cur_sector;
+    assert(priv->min_flushed_journal_sector <= journal.sector_count);
    priv->max_flushed_journal_sector = 1+cur_sector;
 }

--- a/src/blockstore_open.cpp
+++ b/src/blockstore_open.cpp
@@ -19,6 +19,10 @@ void blockstore_impl_t::parse_config(blockstore_config_t & config, bool init)
    throttle_target_mbs = strtoull(config["throttle_target_mbs"].c_str(), NULL, 10);
    throttle_target_parallelism = strtoull(config["throttle_target_parallelism"].c_str(), NULL, 10);
    throttle_threshold_us = strtoull(config["throttle_threshold_us"].c_str(), NULL, 10);
+    if (config.find("autosync_writes") != config.end())
+    {
+        autosync_writes = strtoull(config["autosync_writes"].c_str(), NULL, 10);
+    }
    if (!max_flusher_count)
    {
        max_flusher_count = 256;
--- a/src/blockstore_sync.cpp
+++ b/src/blockstore_sync.cpp
@@ -27,8 +27,6 @@ int blockstore_impl_t::continue_sync(blockstore_op_t *op)
        unsynced_big_write_count -= unsynced_big_writes.size();
        PRIV(op)->sync_big_writes.swap(unsynced_big_writes);
        PRIV(op)->sync_small_writes.swap(unsynced_small_writes);
-        PRIV(op)->sync_small_checked = 0;
-        PRIV(op)->sync_big_checked = 0;
        unsynced_big_writes.clear();
        unsynced_small_writes.clear();
        if (PRIV(op)->sync_big_writes.size() > 0)
--- a/src/blockstore_write.cpp
+++ b/src/blockstore_write.cpp
@@ -127,8 +127,9 @@ bool blockstore_impl_t::enqueue_write(blockstore_op_t *op)
            return false;
        }
    }
-    if (wait_big && !is_del && !deleted && op->len < dsk.data_block_size &&
-        immediate_commit != IMMEDIATE_ALL)
+    bool imm = (op->len < dsk.data_block_size ? (immediate_commit != IMMEDIATE_NONE) : (immediate_commit == IMMEDIATE_ALL));
+    if (wait_big && !is_del && !deleted && op->len < dsk.data_block_size && !imm ||
+        !imm && unsynced_queued_ops >= autosync_writes)
    {
        // Issue an additional sync so that the previous big write can reach the journal
        blockstore_op_t *sync_op = new blockstore_op_t;
@@ -139,6 +140,8 @@ bool blockstore_impl_t::enqueue_write(blockstore_op_t *op)
        };
        enqueue_op(sync_op);
    }
+    else if (!imm)
+        unsynced_queued_ops++;
 #ifdef BLOCKSTORE_DEBUG
    if (is_del)
        printf("Delete %lx:%lx v%lu\n", op->oid.inode, op->oid.stripe, op->version);
@@ -286,13 +289,18 @@ int blockstore_impl_t::dequeue_write(blockstore_op_t *op)
        printf("Restoring %lx:%lx version: v%lu -> v%lu\n", op->oid.inode, op->oid.stripe, op->version, PRIV(op)->real_version);
 #endif
        auto prev_it = dirty_it;
-        prev_it--;
-        if (prev_it->first.oid == op->oid && prev_it->first.version >= PRIV(op)->real_version)
+        if (prev_it != dirty_db.begin())
        {
-            // Original version is still invalid
-            // All subsequent writes to the same object must be canceled too
-            cancel_all_writes(op, dirty_it, -EEXIST);
-            return 2;
+            prev_it--;
+            if (prev_it->first.oid == op->oid && prev_it->first.version >= PRIV(op)->real_version)
+            {
+                // Original version is still invalid
+                // All subsequent writes to the same object must be canceled too
+                printf("Tried to write %lx:%lx v%lu after delete (old version v%lu), but already have v%lu\n",
+                    op->oid.inode, op->oid.stripe, PRIV(op)->real_version, op->version, prev_it->first.version);
+                cancel_all_writes(op, dirty_it, -EEXIST);
+                return 2;
+            }
        }
        op->version = PRIV(op)->real_version;
        PRIV(op)->real_version = 0;
@@ -378,7 +386,6 @@ int blockstore_impl_t::dequeue_write(blockstore_op_t *op)
            sqe, dsk.data_fd, PRIV(op)->iov_zerofill, vcnt, dsk.data_offset + (loc << dsk.block_order) + op->offset - stripe_offset
        );
        PRIV(op)->pending_ops = 1;
-        PRIV(op)->min_flushed_journal_sector = PRIV(op)->max_flushed_journal_sector = 0;
        if (immediate_commit != IMMEDIATE_ALL)
        {
            // Increase the counter, but don't save into unsynced_writes yet (can't sync until the write is finished)
@@ -415,16 +422,10 @@ int blockstore_impl_t::dequeue_write(blockstore_op_t *op)
        write_iodepth++;
        // Got SQEs. Prepare previous journal sector write if required
        auto cb = [this, op](ring_data_t *data) { handle_write_event(data, op); };
-        if (immediate_commit == IMMEDIATE_NONE)
+        if (immediate_commit == IMMEDIATE_NONE &&
+            !journal.entry_fits(sizeof(journal_entry_small_write) + dyn_size))
        {
-            if (!journal.entry_fits(sizeof(journal_entry_small_write) + dyn_size))
-            {
-                prepare_journal_sector_write(journal.cur_sector, op);
-            }
-            else
-            {
-                PRIV(op)->min_flushed_journal_sector = PRIV(op)->max_flushed_journal_sector = 0;
-            }
+            prepare_journal_sector_write(journal.cur_sector, op);
        }
        // Then pre-fill journal entry
        journal_entry_small_write *je = (journal_entry_small_write*)prefill_single_journal_entry(
@@ -750,17 +751,11 @@ int blockstore_impl_t::dequeue_del(blockstore_op_t *op)
    }
    write_iodepth++;
    // Prepare journal sector write
-    if (immediate_commit == IMMEDIATE_NONE)
+    if (immediate_commit == IMMEDIATE_NONE &&
+        (dsk.journal_block_size - journal.in_sector_pos) < sizeof(journal_entry_del) &&
+        journal.sector_info[journal.cur_sector].dirty)
    {
-        if ((dsk.journal_block_size - journal.in_sector_pos) < sizeof(journal_entry_del) &&
-            journal.sector_info[journal.cur_sector].dirty)
-        {
-            prepare_journal_sector_write(journal.cur_sector, op);
-        }
-        else
-        {
-            PRIV(op)->min_flushed_journal_sector = PRIV(op)->max_flushed_journal_sector = 0;
-        }
+        prepare_journal_sector_write(journal.cur_sector, op);
    }
    // Pre-fill journal entry
    journal_entry_del *je = (journal_entry_del*)prefill_single_journal_entry(
--- a/src/cli.cpp
+++ b/src/cli.cpp
@@ -349,6 +349,7 @@ static int run(cli_tool_t *p, json11::Json::object cfg)
                p->ringloop->wait();
        }
        // Destroy the client
+        p->cli->flush();
        delete p->cli;
        delete p->epmgr;
        delete p->ringloop;
@@ -357,6 +358,8 @@ static int run(cli_tool_t *p, json11::Json::object cfg)
        p->ringloop = NULL;
    }
    // Print result
+    fflush(stderr);
+    fflush(stdout);
    if (p->json_output && !result.data.is_null())
    {
        printf("%s\n", result.data.dump().c_str());
--- a/src/cli_alloc_osd.cpp
+++ b/src/cli_alloc_osd.cpp
@@ -77,8 +77,8 @@ struct alloc_osd_t
                    std::string key = base64_decode(kv["key"].string_value());
                    osd_num_t cur_osd;
                    char null_byte = 0;
-                    sscanf(key.c_str() + parent->cli->st_cli.etcd_prefix.length(), "/osd/stats/%lu%c", &cur_osd, &null_byte);
-                    if (!cur_osd || null_byte != 0)
+                    int scanned = sscanf(key.c_str() + parent->cli->st_cli.etcd_prefix.length(), "/osd/stats/%lu%c", &cur_osd, &null_byte);
+                    if (scanned != 1 || !cur_osd)
                    {
                        fprintf(stderr, "Invalid key in etcd: %s\n", key.c_str());
                        continue;
--- a/src/cli_df.cpp
+++ b/src/cli_df.cpp
@@ -67,8 +67,8 @@ resume_1:
            // pool ID
            pool_id_t pool_id;
            char null_byte = 0;
-            sscanf(kv.key.substr(parent->cli->st_cli.etcd_prefix.length()).c_str(), "/pool/stats/%u%c", &pool_id, &null_byte);
-            if (!pool_id || pool_id >= POOL_ID_MAX || null_byte != 0)
+            int scanned = sscanf(kv.key.substr(parent->cli->st_cli.etcd_prefix.length()).c_str(), "/pool/stats/%u%c", &pool_id, &null_byte);
+            if (scanned != 1 || !pool_id || pool_id >= POOL_ID_MAX)
            {
                fprintf(stderr, "Invalid key in etcd: %s\n", kv.key.c_str());
                continue;
@@ -82,8 +82,8 @@ resume_1:
            // osd ID
            osd_num_t osd_num;
            char null_byte = 0;
-            sscanf(kv.key.substr(parent->cli->st_cli.etcd_prefix.length()).c_str(), "/osd/stats/%lu%c", &osd_num, &null_byte);
-            if (!osd_num || osd_num >= POOL_ID_MAX || null_byte != 0)
+            int scanned = sscanf(kv.key.substr(parent->cli->st_cli.etcd_prefix.length()).c_str(), "/osd/stats/%lu%c", &osd_num, &null_byte);
+            if (scanned != 1 || !osd_num || osd_num >= POOL_ID_MAX)
            {
                fprintf(stderr, "Invalid key in etcd: %s\n", kv.key.c_str());
                continue;
--- a/src/cli_ls.cpp
+++ b/src/cli_ls.cpp
@@ -133,8 +133,8 @@ resume_1:
            // pool ID
            pool_id_t pool_id;
            char null_byte = 0;
-            sscanf(kv.key.substr(parent->cli->st_cli.etcd_prefix.length()).c_str(), "/pool/stats/%u%c", &pool_id, &null_byte);
-            if (!pool_id || pool_id >= POOL_ID_MAX || null_byte != 0)
+            int scanned = sscanf(kv.key.substr(parent->cli->st_cli.etcd_prefix.length()).c_str(), "/pool/stats/%u%c", &pool_id, &null_byte);
+            if (scanned != 1 || !pool_id || pool_id >= POOL_ID_MAX)
            {
                fprintf(stderr, "Invalid key in etcd: %s\n", kv.key.c_str());
                continue;
@@ -149,9 +149,9 @@ resume_1:
            pool_id_t pool_id;
            inode_t only_inode_num;
            char null_byte = 0;
-            sscanf(kv.key.substr(parent->cli->st_cli.etcd_prefix.length()).c_str(),
+            int scanned = sscanf(kv.key.substr(parent->cli->st_cli.etcd_prefix.length()).c_str(),
                "/inode/stats/%u/%lu%c", &pool_id, &only_inode_num, &null_byte);
-            if (!pool_id || pool_id >= POOL_ID_MAX || INODE_POOL(only_inode_num) != 0 || null_byte != 0)
+            if (scanned != 2 || !pool_id || pool_id >= POOL_ID_MAX || INODE_POOL(only_inode_num) != 0)
            {
                fprintf(stderr, "Invalid key in etcd: %s\n", kv.key.c_str());
                continue;
@@ -174,7 +174,7 @@ resume_1:
                    { "size", 0 },
                    { "readonly", false },
                    { "pool_id", (uint64_t)INODE_POOL(inode_num) },
-                    { "pool_name", pool_it == parent->cli->st_cli.pool_config.end()
+                    { "pool_name", pool_it != parent->cli->st_cli.pool_config.end()
                        ? (pool_it->second.name == "" ? "<Unnamed>" : pool_it->second.name) : "?" },
                    { "inode_num", INODE_NO_POOL(inode_num) },
                    { "inode_id", inode_num },
--- a/src/cli_merge.cpp
+++ b/src/cli_merge.cpp
@@ -53,6 +53,7 @@ struct snap_merger_t
    std::map<inode_t, std::vector<uint64_t>> layer_lists;
    std::map<inode_t, uint64_t> layer_block_size;
    std::map<inode_t, uint64_t> layer_list_pos;
+    std::vector<snap_rw_op_t*> continue_rwo, continue_rwo2;
    int in_flight = 0;
    uint64_t last_fsync_offset = 0;
    uint64_t last_written_offset = 0;
@@ -304,6 +305,12 @@ struct snap_merger_t
        oit = merge_offsets.begin();
    resume_5:
        // Now read, overwrite and optionally delete offsets one by one
+        continue_rwo2.swap(continue_rwo);
+        for (auto rwo: continue_rwo2)
+        {
+            next_write(rwo);
+        }
+        continue_rwo2.clear();
        while (in_flight < parent->iodepth*parent->parallel_osds &&
            oit != merge_offsets.end() && !rwo_error.size())
        {
@@ -464,7 +471,8 @@ struct snap_merger_t
                rwo->error_offset = op->offset;
                rwo->error_read = true;
            }
-            next_write(rwo);
+            continue_rwo.push_back(rwo);
+            parent->ringloop->wakeup();
        };
        parent->cli->execute(op);
    }
@@ -544,11 +552,9 @@ struct snap_merger_t
            }
            // Increment CAS version
            rwo->op.version = subop->version;
-            if (use_cas)
-                next_write(rwo);
-            else
-                autofree_op(rwo);
            delete subop;
+            continue_rwo.push_back(rwo);
+            parent->ringloop->wakeup();
        };
        parent->cli->execute(subop);
    }
--- a/src/cli_rm.cpp
+++ b/src/cli_rm.cpp
@@ -384,8 +384,8 @@ resume_100:
                pool_id_t pool_id = 0;
                inode_t inode = 0;
                char null_byte = 0;
-                sscanf(kv.key.c_str() + parent->cli->st_cli.etcd_prefix.length()+13, "%u/%lu%c", &pool_id, &inode, &null_byte);
-                if (!inode || null_byte != 0)
+                int scanned = sscanf(kv.key.c_str() + parent->cli->st_cli.etcd_prefix.length()+13, "%u/%lu%c", &pool_id, &inode, &null_byte);
+                if (scanned != 2 || !inode)
                {
                    result = (cli_result_t){ .err = EIO, .text = "Bad key returned from etcd: "+kv.key };
                    state = 100;
--- a/src/cli_status.cpp
+++ b/src/cli_status.cpp
@@ -132,8 +132,8 @@ resume_2:
            auto kv = parent->cli->st_cli.parse_etcd_kv(osd_stats[i]);
            osd_num_t stat_osd_num = 0;
            char null_byte = 0;
-            sscanf(kv.key.c_str() + parent->cli->st_cli.etcd_prefix.size(), "/osd/stats/%lu%c", &stat_osd_num, &null_byte);
-            if (!stat_osd_num || null_byte != 0)
+            int scanned = sscanf(kv.key.c_str() + parent->cli->st_cli.etcd_prefix.size(), "/osd/stats/%lu%c", &stat_osd_num, &null_byte);
+            if (scanned != 1 || !stat_osd_num)
            {
                fprintf(stderr, "Invalid key in etcd: %s\n", kv.key.c_str());
                continue;
--- a/src/cluster_client.cpp
+++ b/src/cluster_client.cpp
@@ -3,21 +3,13 @@

 #include <stdexcept>
 #include <assert.h>
-#include "cluster_client.h"
-
-#define SCRAP_BUFFER_SIZE 4*1024*1024
-#define PART_SENT 1
-#define PART_DONE 2
-#define PART_ERROR 4
-#define PART_RETRY 8
-#define CACHE_DIRTY 1
-#define CACHE_FLUSHING 2
-#define CACHE_REPEATING 3
-#define OP_FLUSH_BUFFER 0x02
-#define OP_IMMEDIATE_COMMIT 0x04
+#include "cluster_client_impl.h"
+#include "http_client.h" // json_is_true

 cluster_client_t::cluster_client_t(ring_loop_t *ringloop, timerfd_manager_t *tfd, json11::Json & config)
 {
+    wb = new writeback_cache_t();
+
    cli_config = config.object_items();
    file_config = osd_messenger_t::read_config(config);
    config = osd_messenger_t::merge_configs(cli_config, file_config, etcd_global_config, {});
@@ -37,20 +29,14 @@ cluster_client_t::cluster_client_t(ring_loop_t *ringloop, timerfd_manager_t *tfd
            continue_lists();
            continue_raw_ops(peer_osd);
        }
-        else if (dirty_buffers.size())
+        else
        {
            // peer_osd just dropped connection
            // determine WHICH dirty_buffers are now obsolete and repeat them
-            for (auto & wr: dirty_buffers)
+            if (wb->repeat_ops_for(this, peer_osd) > 0)
            {
-                if (affects_osd(wr.first.inode, wr.first.stripe, wr.second.len, peer_osd) &&
-                    wr.second.state != CACHE_REPEATING)
-                {
-                    // FIXME: Flush in larger parts
-                    flush_buffer(wr.first, &wr.second);
-                }
+                continue_ops();
            }
-            continue_ops();
        }
    };
    msgr.exec_op = [this](osd_op_t *op)
@@ -78,16 +64,14 @@ cluster_client_t::cluster_client_t(ring_loop_t *ringloop, timerfd_manager_t *tfd

 cluster_client_t::~cluster_client_t()
 {
-    for (auto bp: dirty_buffers)
-    {
-        free(bp.second.buf);
-    }
-    dirty_buffers.clear();
+    msgr.repeer_pgs = [this](osd_num_t){};
    if (ringloop)
    {
        ringloop->unregister_consumer(&consumer);
    }
    free(scrap_buffer);
+    delete wb;
+    wb = NULL;
 }

 cluster_op_t::~cluster_op_t()
@@ -136,6 +120,19 @@ void cluster_client_t::init_msgr()
    }
 }

+void cluster_client_t::unshift_op(cluster_op_t *op)
+{
+    op->next = op_queue_head;
+    if (op_queue_head)
+    {
+        op_queue_head->prev = op;
+        op_queue_head = op;
+    }
+    else
+        op_queue_tail = op_queue_head = op;
+    inc_wait(op->opcode, op->flags, op->next, 1);
+}
+
 void cluster_client_t::calc_wait(cluster_op_t *op)
 {
    op->prev_wait = 0;
@@ -156,7 +153,7 @@ void cluster_client_t::calc_wait(cluster_op_t *op)
    {
        for (auto prev = op->prev; prev; prev = prev->prev)
        {
-            if (prev->opcode == OSD_OP_SYNC || prev->opcode == OSD_OP_WRITE && !(prev->flags & OP_IMMEDIATE_COMMIT))
+            if (prev->opcode == OSD_OP_SYNC || prev->opcode == OSD_OP_WRITE && (!(prev->flags & OP_IMMEDIATE_COMMIT) || enable_writeback))
            {
                op->prev_wait++;
            }
@@ -166,68 +163,58 @@ void cluster_client_t::calc_wait(cluster_op_t *op)
    }
    else /* if (op->opcode == OSD_OP_READ || op->opcode == OSD_OP_READ_BITMAP || op->opcode == OSD_OP_READ_CHAIN_BITMAP) */
    {
-        for (auto prev = op_queue_head; prev && prev != op; prev = prev->next)
-        {
-            if (prev->opcode == OSD_OP_WRITE && (prev->flags & OP_FLUSH_BUFFER))
-            {
-                op->prev_wait++;
-            }
-            else if (prev->opcode == OSD_OP_WRITE || prev->opcode == OSD_OP_READ ||
-                prev->opcode == OSD_OP_READ_BITMAP || prev->opcode == OSD_OP_READ_CHAIN_BITMAP)
-            {
-                // Flushes are always in the beginning (we're scanning from the beginning of the queue)
-                break;
-            }
-        }
-        if (!op->prev_wait)
-            continue_rw(op);
+        continue_rw(op);
    }
 }

 void cluster_client_t::inc_wait(uint64_t opcode, uint64_t flags, cluster_op_t *next, int inc)
 {
-    if (opcode == OSD_OP_WRITE)
+    if (opcode != OSD_OP_WRITE && opcode != OSD_OP_SYNC)
    {
-        while (next)
-        {
-            auto n2 = next->next;
-            if (next->opcode == OSD_OP_SYNC && !(flags & OP_IMMEDIATE_COMMIT) ||
-                next->opcode == OSD_OP_WRITE && (flags & OP_FLUSH_BUFFER) && !(next->flags & OP_FLUSH_BUFFER) ||
-                (next->opcode == OSD_OP_READ || next->opcode == OSD_OP_READ_BITMAP ||
-                    next->opcode == OSD_OP_READ_CHAIN_BITMAP) && (flags & OP_FLUSH_BUFFER))
-            {
-                next->prev_wait += inc;
-                assert(next->prev_wait >= 0);
-                if (!next->prev_wait)
-                {
-                    if (next->opcode == OSD_OP_SYNC)
-                        continue_sync(next);
-                    else
-                        continue_rw(next);
-                }
-            }
-            next = n2;
-        }
+        return;
    }
-    else if (opcode == OSD_OP_SYNC)
+    cluster_op_t *bh_ops_local[32], **bh_ops = bh_ops_local;
+    int bh_op_count = 0, bh_op_max = 32;
+    while (next)
    {
-        while (next)
+        auto n2 = next->next;
+        if (opcode == OSD_OP_WRITE
+            ? (next->opcode == OSD_OP_SYNC && (!(flags & OP_IMMEDIATE_COMMIT) || enable_writeback) ||
+                next->opcode == OSD_OP_WRITE && (flags & OP_FLUSH_BUFFER) && !(next->flags & OP_FLUSH_BUFFER))
+            : (next->opcode == OSD_OP_SYNC || next->opcode == OSD_OP_WRITE))
        {
-            auto n2 = next->next;
-            if (next->opcode == OSD_OP_SYNC || next->opcode == OSD_OP_WRITE)
+            next->prev_wait += inc;
+            assert(next->prev_wait >= 0);
+            if (!next->prev_wait)
            {
-                next->prev_wait += inc;
-                assert(next->prev_wait >= 0);
-                if (!next->prev_wait)
+                // Kind of std::vector with local "small vector optimisation"
+                if (bh_op_count >= bh_op_max)
                {
-                    if (next->opcode == OSD_OP_SYNC)
-                        continue_sync(next);
-                    else
-                        continue_rw(next);
+                    bh_op_max *= 2;
+                    cluster_op_t **n = (cluster_op_t**)malloc_or_die(sizeof(cluster_op_t*) * bh_op_max);
+                    memcpy(n, bh_ops, sizeof(cluster_op_t*) * bh_op_count);
+                    if (bh_ops != bh_ops_local)
+                    {
+                        free(bh_ops);
+                    }
+                    bh_ops = n;
                }
+                bh_ops[bh_op_count++] = next;
            }
-            next = n2;
        }
+        next = n2;
+    }
+    for (int i = 0; i < bh_op_count; i++)
+    {
+        cluster_op_t *next = bh_ops[i];
+        if (next->opcode == OSD_OP_SYNC)
+            continue_sync(next);
+        else
+            continue_rw(next);
+    }
+    if (bh_ops != bh_ops_local)
+    {
+        free(bh_ops);
    }
 }

@@ -245,13 +232,37 @@ void cluster_client_t::erase_op(cluster_op_t *op)
        op_queue_tail = op->prev;
    op->next = op->prev = NULL;
    if (flags & OP_FLUSH_BUFFER)
+    {
+        // Completed flushes change writeback buffer states,
+        // so the callback should be run before inc_wait()
+        // which may continue following SYNCs, but these SYNCs
+        // should know about the changed buffer state
+        // This is ugly but this is the way we do it
        std::function<void(cluster_op_t*)>(op->callback)(op);
-    if (!(flags & OP_IMMEDIATE_COMMIT))
+    }
+    if (!(flags & OP_IMMEDIATE_COMMIT) || enable_writeback)
+    {
        inc_wait(opcode, flags, next, -1);
-    // Call callback at the end to avoid inconsistencies in prev_wait
-    // if the callback adds more operations itself
+    }
    if (!(flags & OP_FLUSH_BUFFER))
+    {
+        // Call callback at the end to avoid inconsistencies in prev_wait
+        // if the callback adds more operations itself
        std::function<void(cluster_op_t*)>(op->callback)(op);
+    }
+    if (flags & OP_FLUSH_BUFFER)
+    {
+        int i = 0;
+        while (i < wb->writeback_overflow.size() && wb->writebacks_active < client_max_writeback_iodepth)
+        {
+            execute_internal(wb->writeback_overflow[i]);
+            i++;
+        }
+        if (i > 0)
+        {
+            wb->writeback_overflow.erase(wb->writeback_overflow.begin(), wb->writeback_overflow.begin()+i);
+        }
+    }
 }

 void cluster_client_t::continue_ops(bool up_retry)
@@ -295,6 +306,7 @@ void cluster_client_t::on_load_config_hook(json11::Json::object & etcd_global_co
 {
    this->etcd_global_config = etcd_global_config;
    config = osd_messenger_t::merge_configs(cli_config, file_config, etcd_global_config, {});
+    // client_max_dirty_bytes/client_dirty_limit
    if (config.find("client_max_dirty_bytes") != config.end())
    {
        client_max_dirty_bytes = config["client_max_dirty_bytes"].uint64_value();
@@ -310,11 +322,34 @@ void cluster_client_t::on_load_config_hook(json11::Json::object & etcd_global_co
    {
        client_max_dirty_bytes = DEFAULT_CLIENT_MAX_DIRTY_BYTES;
    }
+    // client_max_dirty_ops
    client_max_dirty_ops = config["client_max_dirty_ops"].uint64_value();
    if (!client_max_dirty_ops)
    {
        client_max_dirty_ops = DEFAULT_CLIENT_MAX_DIRTY_OPS;
    }
+    // client_enable_writeback
+    enable_writeback = json_is_true(config["client_enable_writeback"]) &&
+        json_is_true(config["client_writeback_allowed"]);
+    // client_max_buffered_bytes
+    client_max_buffered_bytes = config["client_max_buffered_bytes"].uint64_value();
+    if (!client_max_buffered_bytes)
+    {
+        client_max_buffered_bytes = DEFAULT_CLIENT_MAX_BUFFERED_BYTES;
+    }
+    // client_max_buffered_ops
+    client_max_buffered_ops = config["client_max_buffered_ops"].uint64_value();
+    if (!client_max_buffered_ops)
+    {
+        client_max_buffered_ops = DEFAULT_CLIENT_MAX_BUFFERED_OPS;
+    }
+    // client_max_writeback_iodepth
+    client_max_writeback_iodepth = config["client_max_writeback_iodepth"].uint64_value();
+    if (!client_max_writeback_iodepth)
+    {
+        client_max_writeback_iodepth = DEFAULT_CLIENT_MAX_WRITEBACK_IODEPTH;
+    }
+    // up_wait_retry_interval
    up_wait_retry_interval = config["up_wait_retry_interval"].uint64_value();
    if (!up_wait_retry_interval)
    {
@@ -374,6 +409,8 @@ void cluster_client_t::on_change_hook(std::map<std::string, etcd_kv_t> & changes

 bool cluster_client_t::get_immediate_commit(uint64_t inode)
 {
+    if (enable_writeback)
+        return false;
    pool_id_t pool_id = INODE_POOL(inode);
    if (!pool_id)
        return true;
@@ -408,6 +445,41 @@ void cluster_client_t::on_ready(std::function<void(void)> fn)
    }
 }

+bool cluster_client_t::flush()
+{
+    if (!ringloop)
+    {
+        if (wb->writeback_queue.size())
+        {
+            wb->start_writebacks(this, 0);
+            cluster_op_t *sync = new cluster_op_t;
+            sync->opcode = OSD_OP_SYNC;
+            sync->callback = [this](cluster_op_t *sync)
+            {
+                delete sync;
+            };
+            execute(sync);
+        }
+        return op_queue_head == NULL;
+    }
+    bool sync_done = false;
+    cluster_op_t *sync = new cluster_op_t;
+    sync->opcode = OSD_OP_SYNC;
+    sync->callback = [this, &sync_done](cluster_op_t *sync)
+    {
+        delete sync;
+        sync_done = true;
+    };
+    execute(sync);
+    while (!sync_done)
+    {
+        ringloop->loop();
+        if (!sync_done)
+            ringloop->wait();
+    }
+    return true;
+}
+
 /**
 * How writes are synced when immediate_commit is false
 *
@@ -428,6 +500,9 @@ void cluster_client_t::on_ready(std::function<void(void)> fn)
 * 3) if yes, send all SYNCs. otherwise, leave current SYNC as is.
 * 4) if any of them fail due to disconnected peers, repeat SYNC after repeating all writes
 * 5) if any of them fail due to other errors, fail the SYNC operation
+ *
+ * If writeback caching is turned on and writeback limit is not exhausted:
+ * data is just copied and the write is confirmed to the client.
 */
 void cluster_client_t::execute(cluster_op_t *op)
 {
@@ -443,67 +518,73 @@ void cluster_client_t::execute(cluster_op_t *op)
        offline_ops.push_back(op);
        return;
    }
+    op->flags = op->flags & OSD_OP_IGNORE_READONLY; // the only allowed flag
+    execute_internal(op);
+}
+
+void cluster_client_t::execute_internal(cluster_op_t *op)
+{
    op->cur_inode = op->inode;
    op->retval = 0;
-    op->flags = op->flags & OSD_OP_IGNORE_READONLY; // single allowed flag
-    if (op->opcode != OSD_OP_SYNC)
+    // check alignment, readonly flag and so on
+    if (!check_rw(op))
    {
-        pool_id_t pool_id = INODE_POOL(op->cur_inode);
-        if (!pool_id)
+        return;
+    }
+    if (op->opcode == OSD_OP_WRITE && enable_writeback && !(op->flags & OP_FLUSH_BUFFER) &&
+        !op->version /* FIXME no CAS writeback */)
+    {
+        if (wb->writebacks_active >= client_max_writeback_iodepth)
        {
-            op->retval = -EINVAL;
-            std::function<void(cluster_op_t*)>(op->callback)(op);
+            // Writeback queue is full, postpone the operation
+            wb->writeback_overflow.push_back(op);
            return;
        }
-        auto pool_it = st_cli.pool_config.find(pool_id);
-        if (pool_it == st_cli.pool_config.end() || pool_it->second.real_pg_count == 0)
+        // Just copy and acknowledge the operation
+        wb->copy_write(op, CACHE_DIRTY);
+        while (wb->writeback_bytes + op->len > client_max_buffered_bytes || wb->writeback_queue_size > client_max_buffered_ops)
        {
-            // Pools are loaded, but this one is unknown
-            op->retval = -EINVAL;
-            std::function<void(cluster_op_t*)>(op->callback)(op);
-            return;
-        }
-        // Check alignment
-        if (!op->len && (op->opcode == OSD_OP_READ || op->opcode == OSD_OP_READ_BITMAP || op->opcode == OSD_OP_READ_CHAIN_BITMAP || op->opcode == OSD_OP_WRITE) ||
-            op->offset % pool_it->second.bitmap_granularity || op->len % pool_it->second.bitmap_granularity)
-        {
-            op->retval = -EINVAL;
-            std::function<void(cluster_op_t*)>(op->callback)(op);
-            return;
-        }
-        if (pool_it->second.immediate_commit == IMMEDIATE_ALL)
-        {
-            op->flags |= OP_IMMEDIATE_COMMIT;
+            // Initiate some writeback (asynchronously)
+            wb->start_writebacks(this, 1);
        }
+        op->retval = op->len;
+        std::function<void(cluster_op_t*)>(op->callback)(op);
+        return;
    }
    if (op->opcode == OSD_OP_WRITE && !(op->flags & OP_IMMEDIATE_COMMIT))
    {
+        if (!(op->flags & OP_FLUSH_BUFFER))
+        {
+            wb->copy_write(op, CACHE_WRITTEN);
+        }
        if (dirty_bytes >= client_max_dirty_bytes || dirty_ops >= client_max_dirty_ops)
        {
            // Push an extra SYNC operation to flush previous writes
            cluster_op_t *sync_op = new cluster_op_t;
            sync_op->opcode = OSD_OP_SYNC;
+            sync_op->flags = OP_FLUSH_BUFFER;
            sync_op->callback = [](cluster_op_t* sync_op)
            {
                delete sync_op;
            };
-            sync_op->prev = op_queue_tail;
-            if (op_queue_tail)
-            {
-                op_queue_tail->next = sync_op;
-                op_queue_tail = sync_op;
-            }
-            else
-                op_queue_tail = op_queue_head = sync_op;
-            dirty_bytes = 0;
-            dirty_ops = 0;
-            calc_wait(sync_op);
+            execute_internal(sync_op);
        }
        dirty_bytes += op->len;
        dirty_ops++;
    }
    else if (op->opcode == OSD_OP_SYNC)
    {
+        // Flush the whole write-back queue first
+        if (!(op->flags & OP_FLUSH_BUFFER) && wb->writeback_overflow.size() > 0)
+        {
+            // Writeback queue is full, postpone the operation
+            wb->writeback_overflow.push_back(op);
+            return;
+        }
+        if (wb->writeback_queue.size())
+        {
+            wb->start_writebacks(this, 0);
+        }
        dirty_bytes = 0;
        dirty_ops = 0;
    }
@@ -515,7 +596,7 @@ void cluster_client_t::execute(cluster_op_t *op)
    }
    else
        op_queue_tail = op_queue_head = op;
-    if (!(op->flags & OP_IMMEDIATE_COMMIT))
+    if (!(op->flags & OP_IMMEDIATE_COMMIT) || enable_writeback)
        calc_wait(op);
    else
    {
@@ -526,6 +607,52 @@ void cluster_client_t::execute(cluster_op_t *op)
    }
 }

+bool cluster_client_t::check_rw(cluster_op_t *op)
+{
+    if (op->opcode == OSD_OP_SYNC)
+    {
+        return true;
+    }
+    pool_id_t pool_id = INODE_POOL(op->cur_inode);
+    if (!pool_id)
+    {
+        op->retval = -EINVAL;
+        std::function<void(cluster_op_t*)>(op->callback)(op);
+        return false;
+    }
+    auto pool_it = st_cli.pool_config.find(pool_id);
+    if (pool_it == st_cli.pool_config.end() || pool_it->second.real_pg_count == 0)
+    {
+        // Pools are loaded, but this one is unknown
+        op->retval = -EINVAL;
+        std::function<void(cluster_op_t*)>(op->callback)(op);
+        return false;
+    }
+    // Check alignment
+    if (!op->len && (op->opcode == OSD_OP_READ || op->opcode == OSD_OP_READ_BITMAP || op->opcode == OSD_OP_READ_CHAIN_BITMAP || op->opcode == OSD_OP_WRITE) ||
+        op->offset % pool_it->second.bitmap_granularity || op->len % pool_it->second.bitmap_granularity)
+    {
+        op->retval = -EINVAL;
+        std::function<void(cluster_op_t*)>(op->callback)(op);
+        return false;
+    }
+    if (pool_it->second.immediate_commit == IMMEDIATE_ALL)
+    {
+        op->flags |= OP_IMMEDIATE_COMMIT;
+    }
+    if ((op->opcode == OSD_OP_WRITE || op->opcode == OSD_OP_DELETE) && !(op->flags & OSD_OP_IGNORE_READONLY))
+    {
+        auto ino_it = st_cli.inode_config.find(op->inode);
+        if (ino_it != st_cli.inode_config.end() && ino_it->second.readonly)
+        {
+            op->retval = -EROFS;
+            std::function<void(cluster_op_t*)>(op->callback)(op);
+            return false;
+        }
+    }
+    return true;
+}
+
 void cluster_client_t::execute_raw(osd_num_t osd_num, osd_op_t *op)
 {
    auto fd_it = msgr.osd_peer_fds.find(osd_num);
@@ -543,114 +670,6 @@ void cluster_client_t::execute_raw(osd_num_t osd_num, osd_op_t *op)
    }
 }

-void cluster_client_t::copy_write(cluster_op_t *op, std::map<object_id, cluster_buffer_t> & dirty_buffers)
-{
-    // Save operation for replay when one of PGs goes out of sync
-    // (primary OSD drops our connection in this case)
-    auto dirty_it = dirty_buffers.lower_bound((object_id){
-        .inode = op->inode,
-        .stripe = op->offset,
-    });
-    while (dirty_it != dirty_buffers.begin())
-    {
-        dirty_it--;
-        if (dirty_it->first.inode != op->inode ||
-            (dirty_it->first.stripe + dirty_it->second.len) <= op->offset)
-        {
-            dirty_it++;
-            break;
-        }
-    }
-    uint64_t pos = op->offset, len = op->len, iov_idx = 0, iov_pos = 0;
-    while (len > 0)
-    {
-        uint64_t new_len = 0;
-        if (dirty_it == dirty_buffers.end())
-        {
-            new_len = len;
-        }
-        else if (dirty_it->first.inode != op->inode || dirty_it->first.stripe > pos)
-        {
-            new_len = dirty_it->first.stripe - pos;
-            if (new_len > len)
-            {
-                new_len = len;
-            }
-        }
-        if (new_len > 0)
-        {
-            dirty_it = dirty_buffers.emplace_hint(dirty_it, (object_id){
-                .inode = op->inode,
-                .stripe = pos,
-            }, (cluster_buffer_t){
-                .buf = malloc_or_die(new_len),
-                .len = new_len,
-            });
-        }
-        // FIXME: Split big buffers into smaller ones on overwrites. But this will require refcounting
-        dirty_it->second.state = CACHE_DIRTY;
-        uint64_t cur_len = (dirty_it->first.stripe + dirty_it->second.len - pos);
-        if (cur_len > len)
-        {
-            cur_len = len;
-        }
-        while (cur_len > 0 && iov_idx < op->iov.count)
-        {
-            unsigned iov_len = (op->iov.buf[iov_idx].iov_len - iov_pos);
-            if (iov_len <= cur_len)
-            {
-                memcpy((uint8_t*)dirty_it->second.buf + pos - dirty_it->first.stripe,
-                    (uint8_t*)op->iov.buf[iov_idx].iov_base + iov_pos, iov_len);
-                pos += iov_len;
-                len -= iov_len;
-                cur_len -= iov_len;
-                iov_pos = 0;
-                iov_idx++;
-            }
-            else
-            {
-                memcpy((uint8_t*)dirty_it->second.buf + pos - dirty_it->first.stripe,
-                    (uint8_t*)op->iov.buf[iov_idx].iov_base + iov_pos, cur_len);
-                pos += cur_len;
-                len -= cur_len;
-                iov_pos += cur_len;
-                cur_len = 0;
-            }
-        }
-        dirty_it++;
-    }
-}
-
-void cluster_client_t::flush_buffer(const object_id & oid, cluster_buffer_t *wr)
-{
-    wr->state = CACHE_REPEATING;
-    cluster_op_t *op = new cluster_op_t;
-    op->flags = OSD_OP_IGNORE_READONLY|OP_FLUSH_BUFFER;
-    op->opcode = OSD_OP_WRITE;
-    op->cur_inode = op->inode = oid.inode;
-    op->offset = oid.stripe;
-    op->len = wr->len;
-    op->iov.push_back(wr->buf, wr->len);
-    op->callback = [wr](cluster_op_t* op)
-    {
-        if (wr->state == CACHE_REPEATING)
-        {
-            wr->state = CACHE_DIRTY;
-        }
-        delete op;
-    };
-    op->next = op_queue_head;
-    if (op_queue_head)
-    {
-        op_queue_head->prev = op;
-        op_queue_head = op;
-    }
-    else
-        op_queue_tail = op_queue_head = op;
-    inc_wait(op->opcode, op->flags, op->next, 1);
-    continue_rw(op);
-}
-
 int cluster_client_t::continue_rw(cluster_op_t *op)
 {
    if (op->state == 0)
@@ -659,27 +678,7 @@ int cluster_client_t::continue_rw(cluster_op_t *op)
        goto resume_1;
    else if (op->state == 2)
        goto resume_2;
-    else if (op->state == 3)
-        goto resume_3;
 resume_0:
-    if (op->opcode == OSD_OP_WRITE || op->opcode == OSD_OP_DELETE)
-    {
-        if (!(op->flags & OSD_OP_IGNORE_READONLY))
-        {
-            auto ino_it = st_cli.inode_config.find(op->inode);
-            if (ino_it != st_cli.inode_config.end() && ino_it->second.readonly)
-            {
-                op->retval = -EINVAL;
-                erase_op(op);
-                return 1;
-            }
-        }
-        if (op->opcode == OSD_OP_WRITE && !(op->flags & OP_IMMEDIATE_COMMIT) && !(op->flags & OP_FLUSH_BUFFER))
-        {
-            copy_write(op, dirty_buffers);
-        }
-    }
-resume_1:
    // Slice the operation into parts
    slice_rw(op);
    op->needs_reslice = false;
@@ -690,9 +689,9 @@ resume_1:
        erase_op(op);
        return 1;
    }
-resume_2:
+resume_1:
    // Send unsent parts, if they're not subject to change
-    op->state = 3;
+    op->state = 2;
    if (op->needs_reslice)
    {
        for (int i = 0; i < op->parts.size(); i++)
@@ -702,7 +701,7 @@ resume_2:
                op->retval = -EPIPE;
            }
        }
-        goto resume_3;
+        goto resume_2;
    }
    for (int i = 0; i < op->parts.size(); i++)
    {
@@ -723,18 +722,18 @@ resume_2:
                        });
                    }
                }
-                op->state = 2;
+                op->state = 1;
            }
        }
    }
-    if (op->state == 2)
+    if (op->state == 1)
    {
        return 0;
    }
-resume_3:
+resume_2:
    if (op->inflight_count > 0)
    {
-        op->state = 3;
+        op->state = 2;
        return 0;
    }
    if (op->done_count >= op->parts.size())
@@ -762,7 +761,7 @@ resume_3:
                op->cur_inode = ino_it->second.parent_id;
                op->parts.clear();
                op->done_count = 0;
-                goto resume_1;
+                goto resume_0;
            }
        }
        op->retval = op->len;
@@ -774,7 +773,8 @@ resume_3:
        erase_op(op);
        return 1;
    }
-    else if (op->retval != 0 && op->retval != -EPIPE && op->retval != -EIO && op->retval != -ENOSPC)
+    else if (op->retval != 0 && !(op->flags & OP_FLUSH_BUFFER) &&
+        op->retval != -EPIPE && op->retval != -EIO && op->retval != -ENOSPC)
    {
        // Fatal error (neither -EPIPE, -EIO nor -ENOSPC)
        // FIXME: Add a parameter to allow to not wait for EIOs (incomplete or corrupted objects) to heal
@@ -789,7 +789,7 @@ resume_3:
        {
            op->parts.clear();
            op->done_count = 0;
-            goto resume_1;
+            goto resume_0;
        }
        else
        {
@@ -800,7 +800,7 @@ resume_3:
                    op->parts[i].flags = PART_RETRY;
                }
            }
-            goto resume_2;
+            goto resume_1;
        }
    }
    return 0;
@@ -874,6 +874,11 @@ void cluster_client_t::slice_rw(cluster_op_t *op)
    int iov_idx = 0;
    size_t iov_pos = 0;
    int i = 0;
+    // We also have to return reads from CACHE_REPEATING buffers - they are not
+    // guaranteed to be present on target OSDs at the moment of repeating
+    // And we're also free to return data from other cached buffers just
+    // because it's faster
+    bool dirty_copied = wb->read_from_cache(op, pool_cfg.bitmap_granularity);
    for (uint64_t stripe = first_stripe; stripe <= last_stripe; stripe += pg_block_size)
    {
        pg_num_t pg_num = (stripe/pool_cfg.pg_stripe_size) % pool_cfg.real_pg_count + 1; // like map_to_pg()
@@ -881,7 +886,8 @@ void cluster_client_t::slice_rw(cluster_op_t *op)
        uint64_t end = (op->offset + op->len) > (stripe + pg_block_size)
            ? (stripe + pg_block_size) : (op->offset + op->len);
        op->parts[i].iov.reset();
-        if (op->cur_inode != op->inode)
+        op->parts[i].flags = 0;
+        if (op->cur_inode != op->inode || op->opcode == OSD_OP_READ && dirty_copied)
        {
            // Read remaining parts from upper layers
            uint64_t prev = begin, cur = begin;
@@ -918,7 +924,10 @@ void cluster_client_t::slice_rw(cluster_op_t *op)
            else
                add_iov(cur-prev, skip_prev, op, iov_idx, iov_pos, op->parts[i].iov, scrap_buffer, scrap_buffer_size);
            if (end == begin)
+            {
                op->done_count++;
+                op->parts[i].flags = PART_DONE;
+            }
        }
        else if (op->opcode != OSD_OP_READ_BITMAP && op->opcode != OSD_OP_READ_CHAIN_BITMAP && op->opcode != OSD_OP_DELETE)
        {
@@ -930,7 +939,6 @@ void cluster_client_t::slice_rw(cluster_op_t *op)
            op->opcode == OSD_OP_DELETE ? 0 : (uint32_t)(end - begin);
        op->parts[i].pg_num = pg_num;
        op->parts[i].osd_num = 0;
-        op->parts[i].flags = 0;
        i++;
    }
 }
@@ -1042,13 +1050,7 @@ int cluster_client_t::continue_sync(cluster_op_t *op)
            do_it++;
    }
    // Post sync to affected OSDs
-    for (auto & prev_op: dirty_buffers)
-    {
-        if (prev_op.second.state == CACHE_DIRTY)
-        {
-            prev_op.second.state = CACHE_FLUSHING;
-        }
-    }
+    wb->fsync_start();
    op->parts.resize(dirty_osds.size());
    op->retval = 0;
    {
@@ -1073,13 +1075,7 @@ resume_1:
    }
    if (op->retval != 0)
    {
-        for (auto uw_it = dirty_buffers.begin(); uw_it != dirty_buffers.end(); uw_it++)
-        {
-            if (uw_it->second.state == CACHE_FLUSHING)
-            {
-                uw_it->second.state = CACHE_DIRTY;
-            }
-        }
+        wb->fsync_error();
        if (op->retval == -EPIPE || op->retval == -EIO || op->retval == -ENOSPC)
        {
            // Retry later
@@ -1093,16 +1089,7 @@ resume_1:
    }
    else
    {
-        for (auto uw_it = dirty_buffers.begin(); uw_it != dirty_buffers.end(); )
-        {
-            if (uw_it->second.state == CACHE_FLUSHING)
-            {
-                free(uw_it->second.buf);
-                dirty_buffers.erase(uw_it++);
-            }
-            else
-                uw_it++;
-        }
+        wb->fsync_ok();
    }
    erase_op(op);
    return 1;
--- a/src/cluster_client.h
+++ b/src/cluster_client.h
@@ -8,6 +8,9 @@

 #define DEFAULT_CLIENT_MAX_DIRTY_BYTES 32*1024*1024
 #define DEFAULT_CLIENT_MAX_DIRTY_OPS 1024
+#define DEFAULT_CLIENT_MAX_BUFFERED_BYTES 32*1024*1024
+#define DEFAULT_CLIENT_MAX_BUFFERED_OPS 1024
+#define DEFAULT_CLIENT_MAX_WRITEBACK_IODEPTH 256
 #define INODE_LIST_DONE 1
 #define INODE_LIST_HAS_UNSTABLE 2
 #define OSD_OP_READ_BITMAP OSD_OP_SEC_READ_BMP
@@ -64,17 +67,12 @@ protected:
    cluster_op_t *prev = NULL, *next = NULL;
    int prev_wait = 0;
    friend class cluster_client_t;
-};
-
-struct cluster_buffer_t
-{
-    void *buf;
-    uint64_t len;
-    int state;
+    friend class writeback_cache_t;
 };

 struct inode_list_t;
 struct inode_list_osd_t;
+class writeback_cache_t;

 // FIXME: Split into public and private interfaces
 class cluster_client_t
@@ -83,16 +81,23 @@ class cluster_client_t
    ring_loop_t *ringloop;

    std::map<pool_id_t, uint64_t> pg_counts;
-    // FIXME: Implement inmemory_commit mode. Note that it requires to return overlapping reads from memory.
+    // client_max_dirty_* is actually "max unsynced", for the case when immediate_commit is off
    uint64_t client_max_dirty_bytes = 0;
    uint64_t client_max_dirty_ops = 0;
+    // writeback improves (1) small consecutive writes and (2) Q1 writes without fsync
+    bool enable_writeback = false;
+    // client_max_buffered_* is the real "dirty limit" - maximum amount of writes buffered in memory
+    uint64_t client_max_buffered_bytes = 0;
+    uint64_t client_max_buffered_ops = 0;
+    uint64_t client_max_writeback_iodepth = 0;
+
    int log_level;
    int up_wait_retry_interval = 500; // ms

    int retry_timeout_id = 0;
    std::vector<cluster_op_t*> offline_ops;
    cluster_op_t *op_queue_head = NULL, *op_queue_tail = NULL;
-    std::map<object_id, cluster_buffer_t> dirty_buffers;
+    writeback_cache_t *wb = NULL;
    std::set<osd_num_t> dirty_osds;
    uint64_t dirty_bytes = 0, dirty_ops = 0;

@@ -122,10 +127,10 @@ public:
    void execute_raw(osd_num_t osd_num, osd_op_t *op);
    bool is_ready();
    void on_ready(std::function<void(void)> fn);
+    bool flush();

    bool get_immediate_commit(uint64_t inode);

-    static void copy_write(cluster_op_t *op, std::map<object_id, cluster_buffer_t> & dirty_buffers);
    void continue_ops(bool up_retry = false);
    inode_list_t *list_inode_start(inode_t inode,
        std::function<void(inode_list_t* lst, std::set<object_id>&& objects, pg_num_t pg_num, osd_num_t primary_osd, int status)> callback);
@@ -138,12 +143,14 @@ public:

 protected:
    bool affects_osd(uint64_t inode, uint64_t offset, uint64_t len, osd_num_t osd);
-    void flush_buffer(const object_id & oid, cluster_buffer_t *wr);
    void on_load_config_hook(json11::Json::object & config);
    void on_load_pgs_hook(bool success);
    void on_change_hook(std::map<std::string, etcd_kv_t> & changes);
    void on_change_osd_state_hook(uint64_t peer_osd);
+    void execute_internal(cluster_op_t *op);
+    void unshift_op(cluster_op_t *op);
    int continue_rw(cluster_op_t *op);
+    bool check_rw(cluster_op_t *op);
    void slice_rw(cluster_op_t *op);
    bool try_send(cluster_op_t *op, int i);
    int continue_sync(cluster_op_t *op);
@@ -157,4 +164,6 @@ protected:
    void continue_listing(inode_list_t *lst);
    void send_list(inode_list_osd_t *cur_list);
    void continue_raw_ops(osd_num_t peer_osd);
+
+    friend class writeback_cache_t;
 };
--- a/src/cluster_client_impl.h
+++ b/src/cluster_client_impl.h
@@ -0,0 +1,57 @@
+// Copyright (c) Vitaliy Filippov, 2019+
+// License: VNPL-1.1 or GNU GPL-2.0+ (see README.md for details)
+
+#pragma once
+
+#include "cluster_client.h"
+
+#define SCRAP_BUFFER_SIZE 4*1024*1024
+#define PART_SENT 1
+#define PART_DONE 2
+#define PART_ERROR 4
+#define PART_RETRY 8
+#define CACHE_DIRTY 1
+#define CACHE_WRITTEN 2
+#define CACHE_FLUSHING 3
+#define CACHE_REPEATING 4
+#define OP_FLUSH_BUFFER 0x02
+#define OP_IMMEDIATE_COMMIT 0x04
+
+struct cluster_buffer_t
+{
+    uint8_t *buf;
+    uint64_t len;
+    int state;
+    uint64_t flush_id;
+    uint64_t *refcnt;
+};
+
+typedef std::map<object_id, cluster_buffer_t>::iterator dirty_buf_it_t;
+
+class writeback_cache_t
+{
+public:
+    uint64_t writeback_bytes = 0;
+    int writeback_queue_size = 0;
+    int writebacks_active = 0;
+    uint64_t last_flush_id = 0;
+
+    std::map<object_id, cluster_buffer_t> dirty_buffers;
+    std::vector<cluster_op_t*> writeback_overflow;
+    std::vector<object_id> writeback_queue;
+    std::multimap<uint64_t, uint64_t*> flushed_buffers; // flush_id => refcnt
+
+    ~writeback_cache_t();
+    dirty_buf_it_t find_dirty(uint64_t inode, uint64_t offset);
+    bool is_left_merged(dirty_buf_it_t dirty_it);
+    bool is_right_merged(dirty_buf_it_t dirty_it);
+    bool is_merged(const dirty_buf_it_t & dirty_it);
+    void copy_write(cluster_op_t *op, int state);
+    int repeat_ops_for(cluster_client_t *cli, osd_num_t peer_osd);
+    void start_writebacks(cluster_client_t *cli, int count);
+    bool read_from_cache(cluster_op_t *op, uint32_t bitmap_granularity);
+    void flush_buffers(cluster_client_t *cli, dirty_buf_it_t from_it, dirty_buf_it_t to_it);
+    void fsync_start();
+    void fsync_error();
+    void fsync_ok();
+};
--- a/src/cluster_client_wb.cpp
+++ b/src/cluster_client_wb.cpp
@@ -0,0 +1,498 @@
+// Copyright (c) Vitaliy Filippov, 2019+
+// License: VNPL-1.1 or GNU GPL-2.0+ (see README.md for details)
+
+#include <cassert>
+
+#include "cluster_client_impl.h"
+
+writeback_cache_t::~writeback_cache_t()
+{
+    for (auto & bp: dirty_buffers)
+    {
+        if (!--(*bp.second.refcnt))
+        {
+            free(bp.second.refcnt); // refcnt is allocated with the buffer
+        }
+    }
+    dirty_buffers.clear();
+}
+
+dirty_buf_it_t writeback_cache_t::find_dirty(uint64_t inode, uint64_t offset)
+{
+    auto dirty_it = dirty_buffers.lower_bound((object_id){
+        .inode = inode,
+        .stripe = offset,
+    });
+    while (dirty_it != dirty_buffers.begin())
+    {
+        dirty_it--;
+        if (dirty_it->first.inode != inode ||
+            (dirty_it->first.stripe + dirty_it->second.len) <= offset)
+        {
+            dirty_it++;
+            break;
+        }
+    }
+    return dirty_it;
+}
+
+bool writeback_cache_t::is_left_merged(dirty_buf_it_t dirty_it)
+{
+    if (dirty_it != dirty_buffers.begin())
+    {
+        auto prev_it = dirty_it;
+        prev_it--;
+        if (prev_it->first.inode == dirty_it->first.inode &&
+            prev_it->first.stripe+prev_it->second.len == dirty_it->first.stripe &&
+            prev_it->second.state == CACHE_DIRTY)
+        {
+            return true;
+        }
+    }
+    return false;
+}
+
+bool writeback_cache_t::is_right_merged(dirty_buf_it_t dirty_it)
+{
+    auto next_it = dirty_it;
+    next_it++;
+    if (next_it != dirty_buffers.end() &&
+        next_it->first.inode == dirty_it->first.inode &&
+        next_it->first.stripe == dirty_it->first.stripe+dirty_it->second.len &&
+        next_it->second.state == CACHE_DIRTY)
+    {
+        return true;
+    }
+    return false;
+}
+
+bool writeback_cache_t::is_merged(const dirty_buf_it_t & dirty_it)
+{
+    return is_left_merged(dirty_it) || is_right_merged(dirty_it);
+}
+
+void writeback_cache_t::copy_write(cluster_op_t *op, int state)
+{
+    // Save operation for replay when one of PGs goes out of sync
+    // (primary OSD drops our connection in this case)
+    // ...or just save it for writeback if write buffering is enabled
+    if (op->len == 0)
+    {
+        return;
+    }
+    auto dirty_it = find_dirty(op->inode, op->offset);
+    auto new_end = op->offset + op->len;
+    while (dirty_it != dirty_buffers.end() &&
+        dirty_it->first.inode == op->inode &&
+        dirty_it->first.stripe < op->offset+op->len)
+    {
+        assert(dirty_it->first.stripe + dirty_it->second.len > op->offset);
+        // Remove overlapping part(s) of buffers
+        auto old_end = dirty_it->first.stripe + dirty_it->second.len;
+        if (dirty_it->first.stripe < op->offset)
+        {
+            if (old_end > new_end)
+            {
+                // Split into end and start
+                dirty_it->second.len = op->offset - dirty_it->first.stripe;
+                dirty_it = dirty_buffers.emplace_hint(dirty_it, (object_id){
+                    .inode = op->inode,
+                    .stripe = new_end,
+                }, (cluster_buffer_t){
+                    .buf = dirty_it->second.buf + new_end - dirty_it->first.stripe,
+                    .len = old_end - new_end,
+                    .state = dirty_it->second.state,
+                    .flush_id = dirty_it->second.flush_id,
+                    .refcnt = dirty_it->second.refcnt,
+                });
+                (*dirty_it->second.refcnt)++;
+                if (dirty_it->second.state == CACHE_DIRTY)
+                {
+                    writeback_bytes -= op->len;
+                    writeback_queue_size++;
+                }
+                break;
+            }
+            else
+            {
+                // Only leave the beginning
+                if (dirty_it->second.state == CACHE_DIRTY)
+                {
+                    writeback_bytes -= old_end - op->offset;
+                    if (is_left_merged(dirty_it) && !is_right_merged(dirty_it))
+                    {
+                        writeback_queue_size++;
+                    }
+                }
+                dirty_it->second.len = op->offset - dirty_it->first.stripe;
+                dirty_it++;
+            }
+        }
+        else if (old_end > new_end)
+        {
+            // Only leave the end
+            if (dirty_it->second.state == CACHE_DIRTY)
+            {
+                writeback_bytes -= new_end - dirty_it->first.stripe;
+                if (!is_left_merged(dirty_it) && is_right_merged(dirty_it))
+                {
+                    writeback_queue_size++;
+                }
+            }
+            auto new_dirty_it = dirty_buffers.emplace_hint(dirty_it, (object_id){
+                .inode = op->inode,
+                .stripe = new_end,
+            }, (cluster_buffer_t){
+                .buf = dirty_it->second.buf + new_end - dirty_it->first.stripe,
+                .len = old_end - new_end,
+                .state = dirty_it->second.state,
+                .flush_id = dirty_it->second.flush_id,
+                .refcnt = dirty_it->second.refcnt,
+            });
+            dirty_buffers.erase(dirty_it);
+            dirty_it = new_dirty_it;
+            break;
+        }
+        else
+        {
+            // Remove the whole buffer
+            if (dirty_it->second.state == CACHE_DIRTY && !is_merged(dirty_it))
+            {
+                writeback_bytes -= dirty_it->second.len;
+                assert(writeback_queue_size > 0);
+                writeback_queue_size--;
+            }
+            if (!--(*dirty_it->second.refcnt))
+            {
+                free(dirty_it->second.refcnt);
+            }
+            dirty_buffers.erase(dirty_it++);
+        }
+    }
+    // Overlapping buffers are removed, just insert the new one
+    uint64_t *refcnt = (uint64_t*)malloc_or_die(sizeof(uint64_t) + op->len);
+    uint8_t *buf = (uint8_t*)refcnt + sizeof(uint64_t);
+    *refcnt = 1;
+    dirty_it = dirty_buffers.emplace_hint(dirty_it, (object_id){
+        .inode = op->inode,
+        .stripe = op->offset,
+    }, (cluster_buffer_t){
+        .buf = buf,
+        .len = op->len,
+        .state = state,
+        .refcnt = refcnt,
+    });
+    if (state == CACHE_DIRTY)
+    {
+        writeback_bytes += op->len;
+        // Track consecutive write-back operations
+        if (!is_merged(dirty_it))
+        {
+            // <writeback_queue> is OK to contain more than actual number of consecutive
+            // requests as long as it doesn't miss anything. But <writeback_queue_size>
+            // is always calculated correctly.
+            writeback_queue_size++;
+            writeback_queue.push_back((object_id){
+                .inode = op->inode,
+                .stripe = op->offset,
+            });
+        }
+    }
+    uint64_t pos = 0, len = op->len, iov_idx = 0;
+    while (len > 0 && iov_idx < op->iov.count)
+    {
+        auto & iov = op->iov.buf[iov_idx];
+        memcpy(buf + pos, iov.iov_base, iov.iov_len);
+        pos += iov.iov_len;
+        iov_idx++;
+    }
+}
+
+int writeback_cache_t::repeat_ops_for(cluster_client_t *cli, osd_num_t peer_osd)
+{
+    int repeated = 0;
+    if (dirty_buffers.size())
+    {
+        // peer_osd just dropped connection
+        // determine WHICH dirty_buffers are now obsolete and repeat them
+        for (auto wr_it = dirty_buffers.begin(), flush_it = wr_it, last_it = wr_it; ; )
+        {
+            bool end = wr_it == dirty_buffers.end();
+            bool flush_this = !end && wr_it->second.state != CACHE_REPEATING &&
+                cli->affects_osd(wr_it->first.inode, wr_it->first.stripe, wr_it->second.len, peer_osd);
+            if (flush_it != wr_it && (end || !flush_this ||
+                wr_it->first.inode != flush_it->first.inode ||
+                wr_it->first.stripe != last_it->first.stripe+last_it->second.len))
+            {
+                repeated++;
+                flush_buffers(cli, flush_it, wr_it);
+                flush_it = wr_it;
+            }
+            if (end)
+                break;
+            last_it = wr_it;
+            wr_it++;
+            if (!flush_this)
+                flush_it = wr_it;
+        }
+    }
+    return repeated;
+}
+
+void writeback_cache_t::flush_buffers(cluster_client_t *cli, dirty_buf_it_t from_it, dirty_buf_it_t to_it)
+{
+    auto prev_it = to_it;
+    prev_it--;
+    bool is_writeback = from_it->second.state == CACHE_DIRTY;
+    cluster_op_t *op = new cluster_op_t;
+    op->flags = OSD_OP_IGNORE_READONLY|OP_FLUSH_BUFFER;
+    op->opcode = OSD_OP_WRITE;
+    op->cur_inode = op->inode = from_it->first.inode;
+    op->offset = from_it->first.stripe;
+    op->len = prev_it->first.stripe + prev_it->second.len - from_it->first.stripe;
+    uint32_t calc_len = 0;
+    uint64_t flush_id = ++last_flush_id;
+    for (auto it = from_it; it != to_it; it++)
+    {
+        it->second.state = CACHE_REPEATING;
+        it->second.flush_id = flush_id;
+        (*it->second.refcnt)++;
+        flushed_buffers.emplace(flush_id, it->second.refcnt);
+        op->iov.push_back(it->second.buf, it->second.len);
+        calc_len += it->second.len;
+    }
+    assert(calc_len == op->len);
+    writebacks_active++;
+    op->callback = [this, cli, flush_id](cluster_op_t* op)
+    {
+        // Buffer flushes should be always retried, regardless of the error,
+        // so they should never result in an error here
+        assert(op->retval == op->len);
+        for (auto fl_it = flushed_buffers.find(flush_id);
+            fl_it != flushed_buffers.end() && fl_it->first == flush_id; )
+        {
+            if (!--(*fl_it->second)) // refcnt
+            {
+                free(fl_it->second);
+            }
+            flushed_buffers.erase(fl_it++);
+        }
+        for (auto dirty_it = find_dirty(op->inode, op->offset);
+            dirty_it != dirty_buffers.end() && dirty_it->first.inode == op->inode &&
+            dirty_it->first.stripe < op->offset+op->len; dirty_it++)
+        {
+            if (dirty_it->second.flush_id == flush_id && dirty_it->second.state == CACHE_REPEATING)
+            {
+                dirty_it->second.flush_id = 0;
+                dirty_it->second.state = CACHE_WRITTEN;
+            }
+        }
+        delete op;
+        writebacks_active--;
+        // We can't call execute_internal because it affects an invalid copy of the list here
+        // (erase_op remembers `next` after writeback callback)
+    };
+    if (is_writeback)
+    {
+        cli->execute_internal(op);
+    }
+    else
+    {
+        // Insert repeated flushes into the beginning
+        cli->unshift_op(op);
+        cli->continue_rw(op);
+    }
+}
+
+void writeback_cache_t::start_writebacks(cluster_client_t *cli, int count)
+{
+    if (!writeback_queue.size())
+    {
+        return;
+    }
+    std::vector<object_id> queue_copy;
+    queue_copy.swap(writeback_queue);
+    int started = 0, i = 0;
+    for (i = 0; i < queue_copy.size() && (!count || started < count); i++)
+    {
+        object_id & req = queue_copy[i];
+        auto dirty_it = find_dirty(req.inode, req.stripe);
+        if (dirty_it == dirty_buffers.end() ||
+            dirty_it->first.inode != req.inode ||
+            dirty_it->second.state != CACHE_DIRTY)
+        {
+            continue;
+        }
+        auto from_it = dirty_it;
+        uint64_t off = dirty_it->first.stripe;
+        while (from_it != dirty_buffers.begin())
+        {
+            from_it--;
+            if (from_it->second.state != CACHE_DIRTY ||
+                from_it->first.inode != req.inode ||
+                from_it->first.stripe+from_it->second.len != off)
+            {
+                from_it++;
+                break;
+            }
+            off = from_it->first.stripe;
+        }
+        off = dirty_it->first.stripe + dirty_it->second.len;
+        auto to_it = dirty_it;
+        to_it++;
+        while (to_it != dirty_buffers.end())
+        {
+            if (to_it->second.state != CACHE_DIRTY ||
+                to_it->first.inode != req.inode ||
+                to_it->first.stripe != off)
+            {
+                break;
+            }
+            off = to_it->first.stripe + to_it->second.len;
+            to_it++;
+        }
+        started++;
+        assert(writeback_queue_size > 0);
+        writeback_queue_size--;
+        writeback_bytes -= off - from_it->first.stripe;
+        flush_buffers(cli, from_it, to_it);
+    }
+    queue_copy.erase(queue_copy.begin(), queue_copy.begin()+i);
+    if (writeback_queue.size())
+    {
+        queue_copy.insert(queue_copy.end(), writeback_queue.begin(), writeback_queue.end());
+    }
+    queue_copy.swap(writeback_queue);
+}
+
+static void copy_to_op(cluster_op_t *op, uint64_t offset, uint8_t *buf, uint64_t len, uint32_t bitmap_granularity)
+{
+    if (op->opcode == OSD_OP_READ)
+    {
+        // Not OSD_OP_READ_BITMAP or OSD_OP_READ_CHAIN_BITMAP
+        int iov_idx = 0;
+        uint64_t cur_offset = op->offset;
+        while (iov_idx < op->iov.count && cur_offset+op->iov.buf[iov_idx].iov_len <= offset)
+        {
+            cur_offset += op->iov.buf[iov_idx].iov_len;
+            iov_idx++;
+        }
+        while (iov_idx < op->iov.count && cur_offset < offset+len)
+        {
+            auto & v = op->iov.buf[iov_idx];
+            auto begin = (cur_offset < offset ? offset : cur_offset);
+            auto end = (cur_offset+v.iov_len > offset+len ? offset+len : cur_offset+v.iov_len);
+            memcpy(
+                v.iov_base + begin - cur_offset,
+                buf + (cur_offset <= offset ? 0 : cur_offset-offset),
+                end - begin
+            );
+            cur_offset += v.iov_len;
+            iov_idx++;
+        }
+    }
+    // Set bitmap bits
+    int start_bit = (offset-op->offset)/bitmap_granularity;
+    int end_bit = (offset-op->offset+len)/bitmap_granularity;
+    for (int bit = start_bit; bit < end_bit;)
+    {
+        if (!(bit%8) && bit <= end_bit-8)
+        {
+            ((uint8_t*)op->bitmap_buf)[bit/8] = 0xFF;
+            bit += 8;
+        }
+        else
+        {
+            ((uint8_t*)op->bitmap_buf)[bit/8] |= (1 << (bit%8));
+            bit++;
+        }
+    }
+}
+
+bool writeback_cache_t::read_from_cache(cluster_op_t *op, uint32_t bitmap_granularity)
+{
+    bool dirty_copied = false;
+    if (dirty_buffers.size() && (op->opcode == OSD_OP_READ ||
+        op->opcode == OSD_OP_READ_BITMAP || op->opcode == OSD_OP_READ_CHAIN_BITMAP))
+    {
+        // We also have to return reads from CACHE_REPEATING buffers - they are not
+        // guaranteed to be present on target OSDs at the moment of repeating
+        // And we're also free to return data from other cached buffers just
+        // because it's faster
+        auto dirty_it = find_dirty(op->cur_inode, op->offset);
+        while (dirty_it != dirty_buffers.end() && dirty_it->first.inode == op->cur_inode &&
+            dirty_it->first.stripe < op->offset+op->len)
+        {
+            uint64_t begin = dirty_it->first.stripe, end = dirty_it->first.stripe + dirty_it->second.len;
+            if (begin < op->offset)
+                begin = op->offset;
+            if (end > op->offset+op->len)
+                end = op->offset+op->len;
+            bool skip_prev = true;
+            uint64_t cur = begin, prev = begin;
+            while (cur < end)
+            {
+                unsigned bmp_loc = (cur - op->offset)/bitmap_granularity;
+                bool skip = (((*((uint8_t*)op->bitmap_buf + bmp_loc/8)) >> (bmp_loc%8)) & 0x1);
+                if (skip_prev != skip)
+                {
+                    if (cur > prev && !skip)
+                    {
+                        // Copy data
+                        dirty_copied = true;
+                        copy_to_op(op, prev, dirty_it->second.buf + prev - dirty_it->first.stripe, cur-prev, bitmap_granularity);
+                    }
+                    skip_prev = skip;
+                    prev = cur;
+                }
+                cur += bitmap_granularity;
+            }
+            assert(cur > prev);
+            if (!skip_prev)
+            {
+                // Copy data
+                dirty_copied = true;
+                copy_to_op(op, prev, dirty_it->second.buf + prev - dirty_it->first.stripe, cur-prev, bitmap_granularity);
+            }
+            dirty_it++;
+        }
+    }
+    return dirty_copied;
+}
+
+void writeback_cache_t::fsync_start()
+{
+    for (auto & prev_op: dirty_buffers)
+    {
+        if (prev_op.second.state == CACHE_WRITTEN)
+        {
+            prev_op.second.state = CACHE_FLUSHING;
+        }
+    }
+}
+
+void writeback_cache_t::fsync_error()
+{
+    for (auto & prev_op: dirty_buffers)
+    {
+        if (prev_op.second.state == CACHE_FLUSHING)
+        {
+            prev_op.second.state = CACHE_WRITTEN;
+        }
+    }
+}
+
+void writeback_cache_t::fsync_ok()
+{
+    for (auto uw_it = dirty_buffers.begin(); uw_it != dirty_buffers.end(); )
+    {
+        if (uw_it->second.state == CACHE_FLUSHING)
+        {
+            if (!--(*uw_it->second.refcnt))
+                free(uw_it->second.refcnt);
+            dirty_buffers.erase(uw_it++);
+        }
+        else
+            uw_it++;
+    }
+}
--- a/src/disk_tool.cpp
+++ b/src/disk_tool.cpp
@@ -74,7 +74,7 @@ static const char *help_text =
    "  If it doesn't succeed it issues a warning in the system log.\n"
    "  \n"
    "  You can also pass other OSD options here as arguments and they'll be persisted\n"
-    "  in the superblock: cached_io_data, cached_io_meta, cached_io_journal,\n"
+    "  in the superblock: data_io, meta_io, journal_io,\n"
    "  inmemory_metadata, inmemory_journal, max_write_iodepth,\n"
    "  min_flusher_count, max_flusher_count, journal_sector_buffer_count,\n"
    "  journal_no_same_sector_overwrites, throttle_small_writes, throttle_target_iops,\n"
--- a/src/disk_tool_prepare.cpp
+++ b/src/disk_tool_prepare.cpp
@@ -8,9 +8,9 @@
 int disk_tool_t::prepare_one(std::map<std::string, std::string> options, int is_hdd)
 {
    static const char *allow_additional_params[] = {
-        "cached_io_data",
-        "cached_io_meta",
-        "cached_io_journal",
+        "data_io",
+        "meta_io",
+        "journal_io",
        "max_write_iodepth",
        "max_write_iodepth",
        "min_flusher_count",
@@ -119,7 +119,7 @@ int disk_tool_t::prepare_one(std::map<std::string, std::string> options, int is_
    try
    {
        dsk.parse_config(options);
-        dsk.cached_io_data = dsk.cached_io_meta = dsk.cached_io_journal = false;
+        dsk.data_io = dsk.meta_io = dsk.journal_io = "direct";
        dsk.open_data();
        dsk.open_meta();
        dsk.open_journal();
@@ -483,7 +483,7 @@ int disk_tool_t::get_meta_partition(std::vector<vitastor_dev_info_t> & ssds, std
    {
        blockstore_disk_t dsk;
        dsk.parse_config(options);
-        dsk.cached_io_data = dsk.cached_io_meta = dsk.cached_io_journal = false;
+        dsk.data_io = dsk.meta_io = dsk.journal_io = "direct";
        dsk.open_data();
        dsk.open_meta();
        dsk.open_journal();
--- a/src/disk_tool_resize.cpp
+++ b/src/disk_tool_resize.cpp
@@ -91,7 +91,7 @@ int disk_tool_t::resize_parse_params()
    try
    {
        dsk.parse_config(options);
-        dsk.cached_io_data = dsk.cached_io_meta = dsk.cached_io_journal = false;
+        dsk.data_io = dsk.meta_io = dsk.journal_io = "direct";
        dsk.open_data();
        dsk.open_meta();
        dsk.open_journal();
--- a/src/epoll_manager.cpp
+++ b/src/epoll_manager.cpp
@@ -23,19 +23,24 @@ epoll_manager_t::epoll_manager_t(ring_loop_t *ringloop)

    tfd = new timerfd_manager_t([this](int fd, bool wr, std::function<void(int, int)> handler) { set_fd_handler(fd, wr, handler); });

-    consumer.loop = [this]()
+    if (ringloop)
    {
-        if (pending)
-            handle_epoll_events();
-    };
-    ringloop->register_consumer(&consumer);
-
-    handle_epoll_events();
+        consumer.loop = [this]()
+        {
+            if (pending)
+                handle_uring_event();
+        };
+        ringloop->register_consumer(&consumer);
+        handle_uring_event();
+    }
 }

 epoll_manager_t::~epoll_manager_t()
 {
-    ringloop->unregister_consumer(&consumer);
+    if (ringloop)
+    {
+        ringloop->unregister_consumer(&consumer);
+    }
    if (tfd)
    {
        delete tfd;
@@ -44,6 +49,11 @@ epoll_manager_t::~epoll_manager_t()
    close(epoll_fd);
 }

+int epoll_manager_t::get_fd()
+{
+    return epoll_fd;
+}
+
 void epoll_manager_t::set_fd_handler(int fd, bool wr, std::function<void(int, int)> handler)
 {
    if (handler != NULL)
@@ -75,7 +85,7 @@ void epoll_manager_t::set_fd_handler(int fd, bool wr, std::function<void(int, in
    }
 }

-void epoll_manager_t::handle_epoll_events()
+void epoll_manager_t::handle_uring_event()
 {
    io_uring_sqe *sqe = ringloop->get_sqe();
    if (!sqe)
@@ -95,14 +105,20 @@ void epoll_manager_t::handle_epoll_events()
        {
            throw std::runtime_error(std::string("epoll failed: ") + strerror(-data->res));
        }
-        handle_epoll_events();
+        handle_uring_event();
    };
    ringloop->submit();
+    handle_events(0);
+}
+
+void epoll_manager_t::handle_events(int timeout)
+{
    int nfds;
    epoll_event events[MAX_EPOLL_EVENTS];
    do
    {
-        nfds = epoll_wait(epoll_fd, events, MAX_EPOLL_EVENTS, 0);
+        nfds = epoll_wait(epoll_fd, events, MAX_EPOLL_EVENTS, timeout);
+        timeout = 0;
        for (int i = 0; i < nfds; i++)
        {
            auto cb_it = epoll_handlers.find(events[i].data.fd);
--- a/src/epoll_manager.h
+++ b/src/epoll_manager.h
@@ -15,11 +15,14 @@ class epoll_manager_t
    ring_consumer_t consumer;
    ring_loop_t *ringloop;
    std::map<int, std::function<void(int, int)>> epoll_handlers;
+
+    void handle_uring_event();
 public:
    epoll_manager_t(ring_loop_t *ringloop);
    ~epoll_manager_t();
+    int get_fd();
    void set_fd_handler(int fd, bool wr, std::function<void(int, int)> handler);
-    void handle_epoll_events();
+    void handle_events(int timeout);

    timerfd_manager_t *tfd;
 };
--- a/src/etcd_state_client.cpp
+++ b/src/etcd_state_client.cpp
@@ -684,8 +684,8 @@ void etcd_state_client_t::parse_state(const etcd_kv_t & kv)
            // ID
            pool_id_t pool_id;
            char null_byte = 0;
-            sscanf(pool_item.first.c_str(), "%u%c", &pool_id, &null_byte);
-            if (!pool_id || pool_id >= POOL_ID_MAX || null_byte != 0)
+            int scanned = sscanf(pool_item.first.c_str(), "%u%c", &pool_id, &null_byte);
+            if (scanned != 1 || !pool_id || pool_id >= POOL_ID_MAX)
            {
                fprintf(stderr, "Pool ID %s is invalid (must be a number less than 0x%x), skipping pool\n", pool_item.first.c_str(), POOL_ID_MAX);
                continue;
@@ -829,8 +829,8 @@ void etcd_state_client_t::parse_state(const etcd_kv_t & kv)
        {
            pool_id_t pool_id;
            char null_byte = 0;
-            sscanf(pool_item.first.c_str(), "%u%c", &pool_id, &null_byte);
-            if (!pool_id || pool_id >= POOL_ID_MAX || null_byte != 0)
+            int scanned = sscanf(pool_item.first.c_str(), "%u%c", &pool_id, &null_byte);
+            if (scanned != 1 || !pool_id || pool_id >= POOL_ID_MAX)
            {
                fprintf(stderr, "Pool ID %s is invalid in PG configuration (must be a number less than 0x%x), skipping pool\n", pool_item.first.c_str(), POOL_ID_MAX);
                continue;
@@ -838,8 +838,8 @@ void etcd_state_client_t::parse_state(const etcd_kv_t & kv)
            for (auto & pg_item: pool_item.second.object_items())
            {
                pg_num_t pg_num = 0;
-                sscanf(pg_item.first.c_str(), "%u%c", &pg_num, &null_byte);
-                if (!pg_num || null_byte != 0)
+                int scanned = sscanf(pg_item.first.c_str(), "%u%c", &pg_num, &null_byte);
+                if (scanned != 1 || !pg_num)
                {
                    fprintf(stderr, "Bad key in pool %u PG configuration: %s (must be a number), skipped\n", pool_id, pg_item.first.c_str());
                    continue;
@@ -889,8 +889,8 @@ void etcd_state_client_t::parse_state(const etcd_kv_t & kv)
        pool_id_t pool_id = 0;
        pg_num_t pg_num = 0;
        char null_byte = 0;
-        sscanf(key.c_str() + etcd_prefix.length()+12, "%u/%u%c", &pool_id, &pg_num, &null_byte);
-        if (!pool_id || pool_id >= POOL_ID_MAX || !pg_num || null_byte != 0)
+        int scanned = sscanf(key.c_str() + etcd_prefix.length()+12, "%u/%u%c", &pool_id, &pg_num, &null_byte);
+        if (scanned != 2 || !pool_id || pool_id >= POOL_ID_MAX || !pg_num)
        {
            fprintf(stderr, "Bad etcd key %s, ignoring\n", key.c_str());
        }
@@ -944,8 +944,8 @@ void etcd_state_client_t::parse_state(const etcd_kv_t & kv)
        pool_id_t pool_id = 0;
        pg_num_t pg_num = 0;
        char null_byte = 0;
-        sscanf(key.c_str() + etcd_prefix.length()+10, "%u/%u%c", &pool_id, &pg_num, &null_byte);
-        if (!pool_id || pool_id >= POOL_ID_MAX || !pg_num || null_byte != 0)
+        int scanned = sscanf(key.c_str() + etcd_prefix.length()+10, "%u/%u%c", &pool_id, &pg_num, &null_byte);
+        if (scanned != 2 || !pool_id || pool_id >= POOL_ID_MAX || !pg_num)
        {
            fprintf(stderr, "Bad etcd key %s, ignoring\n", key.c_str());
        }
@@ -1015,8 +1015,8 @@ void etcd_state_client_t::parse_state(const etcd_kv_t & kv)
        uint64_t pool_id = 0;
        uint64_t inode_num = 0;
        char null_byte = 0;
-        sscanf(key.c_str() + etcd_prefix.length()+14, "%lu/%lu%c", &pool_id, &inode_num, &null_byte);
-        if (!pool_id || pool_id >= POOL_ID_MAX || !inode_num || (inode_num >> (64-POOL_ID_BITS)) || null_byte != 0)
+        int scanned = sscanf(key.c_str() + etcd_prefix.length()+14, "%lu/%lu%c", &pool_id, &inode_num, &null_byte);
+        if (scanned != 2 || !pool_id || pool_id >= POOL_ID_MAX || !inode_num || (inode_num >> (64-POOL_ID_BITS)))
        {
            fprintf(stderr, "Bad etcd key %s, ignoring\n", key.c_str());
        }
--- a/src/fio_cluster.cpp
+++ b/src/fio_cluster.cpp
@@ -24,6 +24,7 @@
 #include <netinet/tcp.h>

 #include <vector>
+#include <string>

 #include "vitastor_c.h"
 #include "fio_headers.h"
@@ -31,6 +32,7 @@
 struct sec_data
 {
    vitastor_c *cli = NULL;
+    bool epoll_based = false;
    void *watch = NULL;
    bool last_sync = false;
    /* The list of completed io_u structs. */
@@ -57,6 +59,7 @@ struct sec_options
    int rdma_port_num = 0;
    int rdma_gid_index = 0;
    int rdma_mtu = 0;
+    int no_io_uring = 0;
 };

 static struct fio_option options[] = {
@@ -192,6 +195,16 @@ static struct fio_option options[] = {
        .category = FIO_OPT_C_ENGINE,
        .group  = FIO_OPT_G_FILENAME,
    },
+    {
+        .name   = "no_io_uring",
+        .lname  = "Disable io_uring",
+        .type   = FIO_OPT_BOOL,
+        .off1   = offsetof(struct sec_options, no_io_uring),
+        .help   = "Use epoll and plain sendmsg/recvmsg instead of io_uring (slower)",
+        .def    = "0",
+        .category = FIO_OPT_C_ENGINE,
+        .group  = FIO_OPT_G_FILENAME,
+    },
    {
        .name = NULL,
    },
@@ -203,6 +216,15 @@ static void watch_callback(void *opaque, long watch)
    bsd->watch = (void*)watch;
 }

+static void opt_push(std::vector<char *> & options, const char *opt, const char *value)
+{
+    if (value)
+    {
+        options.push_back(strdup(opt));
+        options.push_back(strdup(value));
+    }
+}
+
 static int sec_setup(struct thread_data *td)
 {
    sec_options *o = (sec_options*)td->eo;
@@ -254,18 +276,59 @@ static int sec_setup(struct thread_data *td)
    {
        o->inode = 0;
    }
-    bsd->cli = vitastor_c_create_uring(o->config_path, o->etcd_host, o->etcd_prefix,
-        o->use_rdma, o->rdma_device, o->rdma_port_num, o->rdma_gid_index, o->rdma_mtu, o->cluster_log);
+    std::vector<char *> options;
+    opt_push(options, "config_path", o->config_path);
+    opt_push(options, "etcd_address", o->etcd_host);
+    opt_push(options, "etcd_prefix", o->etcd_prefix);
+    if (o->use_rdma != -1)
+        opt_push(options, "use_rdma", std::to_string(o->use_rdma).c_str());
+    opt_push(options, "rdma_device", o->rdma_device);
+    if (o->rdma_port_num)
+        opt_push(options, "rdma_port_num", std::to_string(o->rdma_port_num).c_str());
+    if (o->rdma_gid_index)
+        opt_push(options, "rdma_gid_index", std::to_string(o->rdma_gid_index).c_str());
+    if (o->rdma_mtu)
+        opt_push(options, "rdma_mtu", std::to_string(o->rdma_mtu).c_str());
+    if (o->cluster_log)
+        opt_push(options, "log_level", std::to_string(o->cluster_log).c_str());
+    // allow writeback caching if -direct is not set
+    opt_push(options, "client_writeback_allowed", td->o.odirect ? "0" : "1");
+    bsd->cli = o->no_io_uring ? NULL : vitastor_c_create_uring_json((const char**)options.data(), options.size());
+    bsd->epoll_based = false;
+    if (!bsd->cli)
+    {
+        if (o->no_io_uring)
+            fprintf(stderr, "vitastor: io_uring disabled - I/O will be slower\n");
+        else
+            fprintf(stderr, "vitastor: failed to create io_uring: %s - I/O will be slower\n", strerror(errno));
+        bsd->cli = vitastor_c_create_epoll_json((const char**)options.data(), options.size());
+        bsd->epoll_based = true;
+    }
+    for (auto opt: options)
+        free(opt);
+    options.clear();
    if (o->image)
    {
        bsd->watch = NULL;
        vitastor_c_watch_inode(bsd->cli, o->image, watch_callback, bsd);
-        while (true)
+        if (!bsd->epoll_based)
        {
-            vitastor_c_uring_handle_events(bsd->cli);
-            if (bsd->watch)
-                break;
-            vitastor_c_uring_wait_events(bsd->cli);
+            while (true)
+            {
+                vitastor_c_uring_handle_events(bsd->cli);
+                if (bsd->watch)
+                    break;
+                vitastor_c_uring_wait_events(bsd->cli);
+            }
+        }
+        else
+        {
+            while (true)
+            {
+                if (bsd->watch)
+                    break;
+                vitastor_c_epoll_handle_events(bsd->cli, 1000);
+            }
        }
        td->files[0]->real_file_size = vitastor_c_inode_get_size(bsd->watch);
        if (!vitastor_c_inode_get_num(bsd->watch) ||
@@ -408,12 +471,24 @@ static enum fio_q_status sec_queue(struct thread_data *td, struct io_u *io)
 static int sec_getevents(struct thread_data *td, unsigned int min, unsigned int max, const struct timespec *t)
 {
    sec_data *bsd = (sec_data*)td->io_ops_data;
-    while (true)
+    if (!bsd->epoll_based)
    {
-        vitastor_c_uring_handle_events(bsd->cli);
-        if (bsd->completed.size() >= min)
-            break;
-        vitastor_c_uring_wait_events(bsd->cli);
+        while (true)
+        {
+            vitastor_c_uring_handle_events(bsd->cli);
+            if (bsd->completed.size() >= min)
+                break;
+            vitastor_c_uring_wait_events(bsd->cli);
+        }
+    }
+    else
+    {
+        while (true)
+        {
+            if (bsd->completed.size() >= min)
+                break;
+            vitastor_c_epoll_handle_events(bsd->cli, 1000);
+        }
    }
    return bsd->completed.size();
 }
--- a/src/fio_sec_osd.cpp
+++ b/src/fio_sec_osd.cpp
@@ -242,6 +242,7 @@ static enum fio_q_status sec_queue(struct thread_data *td, struct io_u *io)
            op.sec_rw.version = UINT64_MAX; // last unstable
            op.sec_rw.offset = io->offset % bsd->block_size;
            op.sec_rw.len = io->xfer_buflen;
+            op.sec_rw.attr_len = 0;
        }
        else
        {
@@ -263,6 +264,7 @@ static enum fio_q_status sec_queue(struct thread_data *td, struct io_u *io)
            op.sec_rw.version = 0; // assign automatically
            op.sec_rw.offset = io->offset % bsd->block_size;
            op.sec_rw.len = io->xfer_buflen;
+            op.sec_rw.attr_len = 0;
        }
        else
        {
--- a/src/messenger.cpp
+++ b/src/messenger.cpp
@@ -11,6 +11,9 @@

 #include "addr_util.h"
 #include "messenger.h"
+#ifdef WITH_RDMA
+#include "msgr_rdma.h"
+#endif

 void osd_messenger_t::init()
 {
--- a/src/messenger.h
+++ b/src/messenger.h
@@ -18,10 +18,6 @@
 #include "timerfd_manager.h"
 #include <ringloop.h>

-#ifdef WITH_RDMA
-#include "msgr_rdma.h"
-#endif
-
 #define CL_READ_HDR 1
 #define CL_READ_DATA 2
 #define CL_READ_REPLY_DATA 3
@@ -44,6 +40,11 @@ struct msgr_sendp_t
    int flags;
 };

+#ifdef WITH_RDMA
+struct msgr_rdma_connection_t;
+struct msgr_rdma_context_t;
+#endif
+
 struct osd_client_t
 {
    int refs = 0;
--- a/src/mock/messenger.cpp
+++ b/src/mock/messenger.cpp
@@ -55,3 +55,10 @@ json11::Json::object osd_messenger_t::merge_configs(const json11::Json::object &
 {
    return cli_config;
 }
+
+bool json_is_true(const json11::Json & val)
+{
+    if (val.is_string())
+        return val == "true" || val == "yes" || val == "1";
+    return val.bool_value();
+}
--- a/src/mock/ringloop.h
+++ b/src/mock/ringloop.h
@@ -22,4 +22,10 @@ public:
    void submit()
    {
    }
+    void wait()
+    {
+    }
+    void loop()
+    {
+    }
 };
--- a/src/msgr_rdma.cpp
+++ b/src/msgr_rdma.cpp
@@ -19,12 +19,12 @@ std::string msgr_rdma_address_t::to_string()
 bool msgr_rdma_address_t::from_string(const char *str, msgr_rdma_address_t *dest)
 {
    uint64_t* gid = (uint64_t*)&dest->gid;
-    int n = sscanf(
+    int scanned = sscanf(
        str, "%hx:%x:%x:%16lx%16lx", &dest->lid, &dest->qpn, &dest->psn, gid, gid+1
    );
    gid[0] = be64toh(gid[0]);
    gid[1] = be64toh(gid[1]);
-    return n == 5;
+    return scanned == 5;
 }

 msgr_rdma_context_t::~msgr_rdma_context_t()
--- a/src/msgr_stop.cpp
+++ b/src/msgr_stop.cpp
@@ -5,6 +5,9 @@
 #include <assert.h>

 #include "messenger.h"
+#ifdef WITH_RDMA
+#include "msgr_rdma.h"
+#endif

 void osd_messenger_t::cancel_osd_ops(osd_client_t *cl)
 {
--- a/src/nbd_proxy.cpp
+++ b/src/nbd_proxy.cpp
@@ -216,6 +216,14 @@ public:
        {
            nbd_timeout = cfg["nbd_timeout"].uint64_value();
        }
+        if (cfg["client_writeback_allowed"].is_null())
+        {
+            // NBD is always aware of fsync, so we allow write-back cache
+            // by default if it's enabled
+            auto obj = cfg.object_items();
+            obj["client_writeback_allowed"] = true;
+            cfg = obj;
+        }
        // Create client
        ringloop = new ring_loop_t(512);
        epmgr = new epoll_manager_t(ringloop);
@@ -341,6 +349,7 @@ public:
            ringloop->loop();
            ringloop->wait();
        }
+        cli->flush();
        delete cli;
        delete epmgr;
        delete ringloop;
--- a/src/nfs_proxy.cpp
+++ b/src/nfs_proxy.cpp
@@ -34,7 +34,10 @@ nfs_proxy_t::~nfs_proxy_t()
    if (cmd)
        delete cmd;
    if (cli)
+    {
+        cli->flush();
        delete cli;
+    }
    if (epmgr)
        delete epmgr;
    if (ringloop)
@@ -62,8 +65,9 @@ json11::Json::object nfs_proxy_t::parse_args(int narg, const char *args[])
                "  --pool <POOL>     use <POOL> as default pool for new files (images)\n"
                "  --foreground 1    stay in foreground, do not daemonize\n"
                "\n"
-                "NFS proxy is stateless if you use immediate_commit=all in your cluster, so\n"
-                "you can freely use multiple NFS proxies with L3 load balancing in this case.\n"
+                "NFS proxy is stateless if you use immediate_commit=all in your cluster and if\n"
+                "you do not use client_enable_writeback=true, so you can freely use multiple\n"
+                "NFS proxies with L3 load balancing in this case.\n"
                "\n"
                "Example start and mount commands for a custom NFS port:\n"
                "  %s --etcd_address 192.168.5.10:2379 --portmap 0 --port 2050 --pool testpool\n"
@@ -111,6 +115,14 @@ void nfs_proxy_t::run(json11::Json cfg)
        if (name_prefix.size())
            name_prefix += "/";
    }
+    if (cfg["client_writeback_allowed"].is_null())
+    {
+        // NFS is always aware of fsync, so we allow write-back cache
+        // by default if it's enabled
+        auto obj = cfg.object_items();
+        obj["client_writeback_allowed"] = true;
+        cfg = obj;
+    }
    // Create client
    ringloop = new ring_loop_t(512);
    epmgr = new epoll_manager_t(ringloop);
@@ -261,16 +273,8 @@ void nfs_proxy_t::run(json11::Json cfg)
        ringloop->loop();
        ringloop->wait();
    }
-    /*// Sync at the end
-    cluster_op_t *close_sync = new cluster_op_t;
-    close_sync->opcode = OSD_OP_SYNC;
-    close_sync->callback = [&stop](cluster_op_t *op)
-    {
-        stop = true;
-        delete op;
-    };
-    cli->execute(close_sync);*/
    // Destroy the client
+    cli->flush();
    delete cli;
    delete epmgr;
    delete ringloop;
@@ -346,8 +350,8 @@ void nfs_proxy_t::parse_stats(etcd_kv_t & kv)
        pool_id_t pool_id = 0;
        inode_t inode_num = 0;
        char null_byte = 0;
-        sscanf(key.c_str() + cli->st_cli.etcd_prefix.length()+13, "%u/%lu%c", &pool_id, &inode_num, &null_byte);
-        if (!pool_id || pool_id >= POOL_ID_MAX || !inode_num || null_byte != 0)
+        int scanned = sscanf(key.c_str() + cli->st_cli.etcd_prefix.length()+13, "%u/%lu%c", &pool_id, &inode_num, &null_byte);
+        if (scanned != 2 || !pool_id || pool_id >= POOL_ID_MAX || !inode_num)
        {
            fprintf(stderr, "Bad etcd key %s, ignoring\n", key.c_str());
        }
@@ -360,8 +364,8 @@ void nfs_proxy_t::parse_stats(etcd_kv_t & kv)
    {
        pool_id_t pool_id = 0;
        char null_byte = 0;
-        sscanf(key.c_str() + cli->st_cli.etcd_prefix.length()+12, "%u%c", &pool_id, &null_byte);
-        if (!pool_id || pool_id >= POOL_ID_MAX)
+        int scanned = sscanf(key.c_str() + cli->st_cli.etcd_prefix.length()+12, "%u%c", &pool_id, &null_byte);
+        if (scanned != 1 || !pool_id || pool_id >= POOL_ID_MAX)
        {
            fprintf(stderr, "Bad etcd key %s, ignoring\n", key.c_str());
        }
--- a/src/osd.cpp
+++ b/src/osd.cpp
@@ -160,6 +160,9 @@ void osd_t::parse_config(bool init)
        etcd_report_interval = config["etcd_report_interval"].uint64_value();
        if (etcd_report_interval <= 0)
            etcd_report_interval = 5;
+        etcd_stats_interval = config["etcd_stats_interval"].uint64_value();
+        if (etcd_stats_interval <= 0)
+            etcd_stats_interval = 30;
        readonly = json_is_true(config["readonly"]);
        run_primary = !json_is_false(config["run_primary"]);
        allow_test_ops = json_is_true(config["allow_test_ops"]);
--- a/src/osd.h
+++ b/src/osd.h
@@ -93,6 +93,7 @@ class osd_t

    json11::Json::object cli_config, file_config, etcd_global_config, etcd_osd_config, config;
    int etcd_report_interval = 5;
+    int etcd_stats_interval = 30;

    bool readonly = false;
    osd_num_t osd_num = 1; // OSD numbers start with 1
--- a/src/osd_cluster.cpp
+++ b/src/osd_cluster.cpp
@@ -186,10 +186,12 @@ json11::Json osd_t::get_statistics()
    if (bs)
    {
        st["blockstore_ready"] = bs->is_started();
-        st["data_block_size"] = (uint64_t)bs->get_block_size();
        st["size"] = bs->get_block_count() * bs->get_block_size();
        st["free"] = bs->get_free_block_count() * bs->get_block_size();
    }
+    st["data_block_size"] = (uint64_t)bs_block_size;
+    st["bitmap_granularity"] = (uint64_t)bs_bitmap_granularity;
+    st["immediate_commit"] = immediate_commit == IMMEDIATE_ALL ? "all" : (immediate_commit == IMMEDIATE_SMALL ? "small" : "none");
    st["host"] = self_state["host"];
    json11::Json::object op_stats, subop_stats;
    for (int i = OSD_OP_MIN; i <= OSD_OP_MAX; i++)
@@ -427,14 +429,18 @@ void osd_t::acquire_lease()
        create_osd_state();
    });
    printf(
-        "[OSD %lu] reporting to etcd at %s every %d seconds\n", this->osd_num,
+        "[OSD %lu] reporting to etcd at %s every %d seconds (statistics every %d seconds)\n", this->osd_num,
        (config["etcd_address"].is_string() ? config["etcd_address"].string_value() : config["etcd_address"].dump()).c_str(),
-        etcd_report_interval
+        etcd_report_interval, etcd_stats_interval
    );
    tfd->set_timer(etcd_report_interval*1000, true, [this](int timer_id)
    {
        renew_lease(false);
    });
+    tfd->set_timer(etcd_stats_interval*1000, true, [this](int timer_id)
+    {
+        report_statistics();
+    });
 }

 // Report "up" state once, then keep it alive using the lease
@@ -539,7 +545,6 @@ void osd_t::renew_lease(bool reload)
        else
        {
            etcd_failed_attempts = 0;
-            report_statistics();
            // Reload PGs
            if (reload && run_primary)
            {
@@ -649,7 +654,7 @@ void osd_t::apply_pg_config()
            auto pg_it = this->pgs.find({ .pool_id = pool_id, .pg_num = pg_num });
            bool currently_taken = pg_it != this->pgs.end() && pg_it->second.state != PG_OFFLINE;
            // Check pool block size and bitmap granularity
-            if (this->bs_block_size != pool_item.second.data_block_size ||
+            if (take && this->bs_block_size != pool_item.second.data_block_size ||
                this->bs_bitmap_granularity != pool_item.second.bitmap_granularity)
            {
                if (!warned_block_size)
@@ -967,8 +972,8 @@ void osd_t::report_pg_states()
                        pool_id_t pool_id = 0;
                        pg_num_t pg_num = 0;
                        char null_byte = 0;
-                        sscanf(kv.key.c_str() + st_cli.etcd_prefix.length()+10, "%u/%u%c", &pool_id, &pg_num, &null_byte);
-                        if (null_byte == 0)
+                        int scanned = sscanf(kv.key.c_str() + st_cli.etcd_prefix.length()+10, "%u/%u%c", &pool_id, &pg_num, &null_byte);
+                        if (scanned == 2)
                        {
                            auto pg_it = pgs.find({ .pool_id = pool_id, .pg_num = pg_num });
                            if (pg_it != pgs.end() && pg_it->second.state != PG_OFFLINE && pg_it->second.state != PG_STARTING &&
--- a/src/osd_secondary.cpp
+++ b/src/osd_secondary.cpp
@@ -2,6 +2,9 @@
 // License: VNPL-1.1 (see README.md for details)

 #include "osd.h"
+#ifdef WITH_RDMA
+#include "msgr_rdma.h"
+#endif

 #include "json11/json11.hpp"

--- a/src/qemu_driver.c
+++ b/src/qemu_driver.c
@@ -197,7 +197,11 @@ static void vitastor_parse_filename(const char *filename, QDict *options, Error
            !strcmp(name, "rdma-mtu"))
        {
            unsigned long long num_val;
+#if QEMU_VERSION_MAJOR < 8 || QEMU_VERSION_MAJOR == 8 && QEMU_VERSION_MINOR < 1
            if (parse_uint_full(value, &num_val, 0))
+#else
+            if (parse_uint_full(value, 0, &num_val))
+#endif
            {
                error_setg(errp, "Illegal %s: %s", name, value);
                goto out;
@@ -320,7 +324,7 @@ static void vitastor_aio_fd_write(void *fddv)
 static void universal_aio_set_fd_handler(AioContext *ctx, int fd, IOHandler *fd_read, IOHandler *fd_write, void *opaque)
 {
    aio_set_fd_handler(ctx, fd,
-#if QEMU_VERSION_MAJOR == 2 && QEMU_VERSION_MINOR >= 5 || QEMU_VERSION_MAJOR >= 3
+#if QEMU_VERSION_MAJOR == 2 && QEMU_VERSION_MINOR >= 5 || QEMU_VERSION_MAJOR >= 3 && (QEMU_VERSION_MAJOR < 8 || QEMU_VERSION_MAJOR == 8 && QEMU_VERSION_MINOR < 1)
        0 /*is_external*/,
 #endif
        fd_read,
@@ -384,6 +388,45 @@ static void vitastor_aio_set_fd_handler(void *vcli, int fd, int unused1, IOHandl
    );
 }

+#if defined VITASTOR_C_API_VERSION && VITASTOR_C_API_VERSION >= 2
+typedef struct str_array
+{
+    const char **items;
+    int len, alloc;
+} str_array;
+
+static void strarray_push(str_array *a, const char *str)
+{
+    if (a->len >= a->alloc)
+    {
+        a->alloc = !a->alloc ? 4 : 2*a->alloc;
+        a->items = (const char**)realloc(a->items, a->alloc*sizeof(char*));
+        if (!a->items)
+        {
+            fprintf(stderr, "bad alloc\n");
+            abort();
+        }
+    }
+    a->items[a->len++] = str;
+}
+
+static void strarray_push_kv(str_array *a, const char *key, const char *value)
+{
+    if (key && value)
+    {
+        strarray_push(a, key);
+        strarray_push(a, value);
+    }
+}
+
+static void strarray_free(str_array *a)
+{
+    free(a->items);
+    a->items = NULL;
+    a->len = a->alloc = 0;
+}
+#endif
+
 static int vitastor_file_open(BlockDriverState *bs, QDict *options, int flags, Error **errp)
 {
    VitastorRPC task;
@@ -403,22 +446,19 @@ static int vitastor_file_open(BlockDriverState *bs, QDict *options, int flags, E
    client->rdma_mtu = qdict_get_try_int(options, "rdma-mtu", 0);
    client->ctx = bdrv_get_aio_context(bs);
 #if defined VITASTOR_C_API_VERSION && VITASTOR_C_API_VERSION >= 2
-    client->proxy = vitastor_c_create_qemu_uring(
-        vitastor_aio_set_fd_handler, client, client->config_path, client->etcd_host, client->etcd_prefix,
-        client->use_rdma, client->rdma_device, client->rdma_port_num, client->rdma_gid_index, client->rdma_mtu, 0
-    );
-    if (!client->proxy)
-    {
-        fprintf(stderr, "vitastor: failed to create io_uring: %s - I/O will be slower\n", strerror(errno));
-        client->uring_eventfd = -1;
-#endif
-        client->proxy = vitastor_c_create_qemu(
-            vitastor_aio_set_fd_handler, client, client->config_path, client->etcd_host, client->etcd_prefix,
-            client->use_rdma, client->rdma_device, client->rdma_port_num, client->rdma_gid_index, client->rdma_mtu, 0
-        );
-#if defined VITASTOR_C_API_VERSION && VITASTOR_C_API_VERSION >= 2
-    }
-    else
+    str_array opt = {};
+    strarray_push_kv(&opt, "config_path", qdict_get_try_str(options, "config-path"));
+    strarray_push_kv(&opt, "etcd_address", qdict_get_try_str(options, "etcd-host"));
+    strarray_push_kv(&opt, "etcd_prefix", qdict_get_try_str(options, "etcd-prefix"));
+    strarray_push_kv(&opt, "use_rdma", qdict_get_try_str(options, "use-rdma"));
+    strarray_push_kv(&opt, "rdma_device", qdict_get_try_str(options, "rdma-device"));
+    strarray_push_kv(&opt, "rdma_port_num", qdict_get_try_str(options, "rdma-port-num"));
+    strarray_push_kv(&opt, "rdma_gid_index", qdict_get_try_str(options, "rdma-gid-index"));
+    strarray_push_kv(&opt, "rdma_mtu", qdict_get_try_str(options, "rdma-mtu"));
+    strarray_push_kv(&opt, "client_writeback_allowed", (flags & BDRV_O_NOCACHE) ? "0" : "1");
+    client->proxy = vitastor_c_create_uring_json(opt.items, opt.len);
+    strarray_free(&opt);
+    if (client->proxy)
    {
        client->uring_eventfd = vitastor_c_uring_register_eventfd(client->proxy);
        if (client->uring_eventfd < 0)
@@ -430,6 +470,19 @@ static int vitastor_file_open(BlockDriverState *bs, QDict *options, int flags, E
        }
        universal_aio_set_fd_handler(client->ctx, client->uring_eventfd, vitastor_uring_handler, NULL, client);
    }
+    else
+    {
+        // Writeback cache is unusable without io_uring because the client can't correctly flush on exit
+        fprintf(stderr, "vitastor: failed to create io_uring: %s - I/O will be slower%s\n",
+            strerror(errno), (flags & BDRV_O_NOCACHE ? "" : " and writeback cache will be disabled"));
+#endif
+        client->uring_eventfd = -1;
+        client->proxy = vitastor_c_create_qemu(
+            vitastor_aio_set_fd_handler, client, client->config_path, client->etcd_host, client->etcd_prefix,
+            client->use_rdma, client->rdma_device, client->rdma_port_num, client->rdma_gid_index, client->rdma_mtu, 0
+        );
+#if defined VITASTOR_C_API_VERSION && VITASTOR_C_API_VERSION >= 2
+    }
 #endif
    image = client->image = g_strdup(qdict_get_try_str(options, "image"));
    client->readonly = (flags & BDRV_O_RDWR) ? 1 : 0;
@@ -489,6 +542,10 @@ static int vitastor_file_open(BlockDriverState *bs, QDict *options, int flags, E
        return -1;
    }
    bs->total_sectors = client->size / BDRV_SECTOR_SIZE;
+#if QEMU_VERSION_MAJOR > 5 || QEMU_VERSION_MAJOR == 5 && QEMU_VERSION_MINOR >= 1
+    /* When extending regular files, we get zeros from the OS */
+    bs->supported_truncate_flags = BDRV_REQ_ZERO_WRITE;
+#endif
    //client->aio_context = bdrv_get_aio_context(bs);
    qdict_del(options, "use-rdma");
    qdict_del(options, "rdma-mtu");
@@ -585,7 +642,11 @@ static int coroutine_fn vitastor_co_truncate(BlockDriverState *bs, int64_t offse
    }

    // TODO: Resize inode to <offset> bytes
-    client->size = offset / BDRV_SECTOR_SIZE;
+#if QEMU_VERSION_MAJOR >= 4
+    client->size = exact || client->size < offset ? offset : client->size;
+#else
+    client->size = offset;
+#endif

    return 0;
 }
--- a/src/test_cluster_client.cpp
+++ b/src/test_cluster_client.cpp
@@ -4,7 +4,7 @@
 #include <stdio.h>
 #include <stdlib.h>
 #include <assert.h>
-#include "cluster_client.h"
+#include "cluster_client_impl.h"

 void configure_single_pg_pool(cluster_client_t *cli)
 {
@@ -47,11 +47,11 @@ void configure_single_pg_pool(cluster_client_t *cli)
    cli->st_cli.on_change_hook(changes);
 }

-int *test_write(cluster_client_t *cli, uint64_t offset, uint64_t len, uint8_t c, std::function<void()> cb = NULL)
+int *test_write(cluster_client_t *cli, uint64_t offset, uint64_t len, uint8_t c, std::function<void()> cb = NULL, bool instant = false)
 {
    printf("Post write %lx+%lx\n", offset, len);
    int *r = new int;
-    *r = -1;
+    *r = instant ? -2 : -1;
    cluster_op_t *op = new cluster_op_t();
    op->opcode = OSD_OP_WRITE;
    op->inode = 0x1000000000001;
@@ -72,6 +72,13 @@ int *test_write(cluster_client_t *cli, uint64_t offset, uint64_t len, uint8_t c,
            cb();
    };
    cli->execute(op);
+    if (instant)
+    {
+        long res = *r;
+        assert(*r >= 0);
+        delete r;
+        return (int*)res;
+    }
    return r;
 }

@@ -160,6 +167,13 @@ osd_op_t *find_op(cluster_client_t *cli, osd_num_t osd_num, uint64_t opcode, uin
        }
        op_it++;
    }
+    op_it = cli->msgr.clients[peer_fd]->sent_ops.begin();
+    while (op_it != cli->msgr.clients[peer_fd]->sent_ops.end())
+    {
+        printf("Found opcode %lu offset %lx size %x\n", op_it->second->req.hdr.opcode, op_it->second->req.rw.offset, op_it->second->req.rw.len);
+        op_it++;
+    }
+    printf("Not found opcode %lu offset %lx size %lx\n", opcode, offset, len);
    return NULL;
 }

@@ -192,11 +206,16 @@ void test1()
    check_op_count(cli, 1, 1);
    pretend_op_completed(cli, find_op(cli, 1, OSD_OP_WRITE, 0, 4096), 0);
    check_completed(r1);
+    r1 = test_write(cli, 4096, 4096, 0x56);
+    can_complete(r1);
+    check_op_count(cli, 1, 1);
+    pretend_op_completed(cli, find_op(cli, 1, OSD_OP_WRITE, 4096, 4096), 0);
+    check_completed(r1);
    pretend_disconnected(cli, 1);
    int *r2 = test_sync(cli);
    pretend_connected(cli, 1);
    check_op_count(cli, 1, 1);
-    pretend_op_completed(cli, find_op(cli, 1, OSD_OP_WRITE, 0, 4096), 0);
+    pretend_op_completed(cli, find_op(cli, 1, OSD_OP_WRITE, 0, 8192), 0);
    check_op_count(cli, 1, 1);
    can_complete(r2);
    pretend_op_completed(cli, find_op(cli, 1, OSD_OP_SYNC, 0, 0), 0);
@@ -321,9 +340,8 @@ void test1()
    check_disconnected(cli, 1);
    pretend_connected(cli, 1);
    cli->continue_ops(true);
-    check_op_count(cli, 1, 2);
-    pretend_op_completed(cli, find_op(cli, 1, OSD_OP_WRITE, 0, 0x1000), 0);
-    pretend_op_completed(cli, find_op(cli, 1, OSD_OP_WRITE, 0x1000, 0x1000), 0);
+    check_op_count(cli, 1, 1);
+    pretend_op_completed(cli, find_op(cli, 1, OSD_OP_WRITE, 0, 0x2000), 0);
    check_op_count(cli, 1, 1);
    can_complete(r2);
    pretend_op_completed(cli, find_op(cli, 1, OSD_OP_WRITE, 0x1000, 0x1000), 0);
@@ -337,7 +355,7 @@ void test1()

 void test2()
 {
-    std::map<object_id, cluster_buffer_t> unsynced_writes;
+    writeback_cache_t *wb = new writeback_cache_t();
    cluster_op_t *op = new cluster_op_t();
    op->opcode = OSD_OP_WRITE;
    op->inode = 1;
@@ -346,19 +364,19 @@ void test2()
    op->iov.push_back(malloc_or_die(4096*1024), 4096);
    // 0-4k = 0x55
    memset(op->iov.buf[0].iov_base, 0x55, op->iov.buf[0].iov_len);
-    cluster_client_t::copy_write(op, unsynced_writes);
+    wb->copy_write(op, CACHE_WRITTEN);
    // 8k-12k = 0x66
    op->offset = 8192;
    memset(op->iov.buf[0].iov_base, 0x66, op->iov.buf[0].iov_len);
-    cluster_client_t::copy_write(op, unsynced_writes);
+    wb->copy_write(op, CACHE_WRITTEN);
    // 4k-1M+4k = 0x77
    op->len = op->iov.buf[0].iov_len = 1048576;
    op->offset = 4096;
    memset(op->iov.buf[0].iov_base, 0x77, op->iov.buf[0].iov_len);
-    cluster_client_t::copy_write(op, unsynced_writes);
+    wb->copy_write(op, CACHE_WRITTEN);
    // check it
-    assert(unsynced_writes.size() == 4);
-    auto uit = unsynced_writes.begin();
+    assert(wb->dirty_buffers.size() == 2);
+    auto uit = wb->dirty_buffers.begin();
    int i;
    assert(uit->first.inode == 1);
    assert(uit->first.stripe == 0);
@@ -368,35 +386,106 @@ void test2()
    uit++;
    assert(uit->first.inode == 1);
    assert(uit->first.stripe == 4096);
-    assert(uit->second.len == 4096);
-    for (i = 0; i < uit->second.len && ((uint8_t*)uit->second.buf)[i] == 0x77; i++) {}
-    assert(i == uit->second.len);
-    uit++;
-    assert(uit->first.inode == 1);
-    assert(uit->first.stripe == 8192);
-    assert(uit->second.len == 4096);
-    for (i = 0; i < uit->second.len && ((uint8_t*)uit->second.buf)[i] == 0x77; i++) {}
-    assert(i == uit->second.len);
-    uit++;
-    assert(uit->first.inode == 1);
-    assert(uit->first.stripe == 12*1024);
-    assert(uit->second.len == 1016*1024);
+    assert(uit->second.len == 1048576);
    for (i = 0; i < uit->second.len && ((uint8_t*)uit->second.buf)[i] == 0x77; i++) {}
    assert(i == uit->second.len);
    uit++;
    // free memory
    free(op->iov.buf[0].iov_base);
    delete op;
-    for (auto p: unsynced_writes)
-    {
-        free(p.second.buf);
-    }
+    delete wb;
    printf("[ok] copy_write test\n");
 }

+void test_writeback()
+{
+    json11::Json config = json11::Json::object {
+        { "client_enable_writeback", true },
+        { "client_writeback_allowed", true },
+        { "client_max_buffered_bytes", 1024*1024 },
+        { "client_max_buffered_ops", 2 },
+        { "client_max_writeback_iodepth", 2 },
+        { "client_max_dirty_bytes", 1024*1024 },
+        { "client_max_dirty_ops", 2 },
+    };
+    timerfd_manager_t *tfd = new timerfd_manager_t([](int fd, bool wr, std::function<void(int, int)> callback){});
+    cluster_client_t *cli = new cluster_client_t(NULL, tfd, config);
+
+    configure_single_pg_pool(cli);
+    pretend_connected(cli, 1);
+
+    // Check that 3 consecutive writes are merged by writeback
+    assert((long)test_write(cli, 0, 4096, 0x55, NULL, true) == 1);
+    check_op_count(cli, 1, 0);
+    assert((long)test_write(cli, 4096, 4096, 0x55, NULL, true) == 1);
+    check_op_count(cli, 1, 0);
+    assert((long)test_write(cli, 8192, 4096, 0x55, NULL, true) == 1);
+    check_op_count(cli, 1, 0);
+
+    assert((long)test_write(cli, 1024*1024, 4096, 0x66, NULL, true) == 1);
+    check_op_count(cli, 1, 0);
+
+    // 3rd and 4th writes should trigger 1 writeback each
+    assert((long)test_write(cli, 2*1024*1024, 4096, 0x66, NULL, true) == 1);
+    check_op_count(cli, 1, 1);
+    assert((long)test_write(cli, 3*1024*1024, 4096, 0x66, NULL, true) == 1);
+    check_op_count(cli, 1, 2);
+
+    // 5th write should be postponed until at least 1 writeback is completed
+    int *r1 = test_write(cli, 4*1024*1024, 4096, 0x67, NULL);
+    check_op_count(cli, 1, 2);
+    can_complete(r1);
+    pretend_op_completed(cli, find_op(cli, 1, OSD_OP_WRITE, 0, 3*4096), 0);
+    check_completed(r1);
+    // autosync because max_dirty_ops=2, flush waits for sync
+    check_op_count(cli, 1, 1);
+    pretend_op_completed(cli, find_op(cli, 1, OSD_OP_WRITE, 1024*1024, 4096), 0);
+    check_op_count(cli, 1, 1);
+    pretend_op_completed(cli, find_op(cli, 1, OSD_OP_SYNC, 0, 0), 0);
+    check_op_count(cli, 1, 1);
+    pretend_op_completed(cli, find_op(cli, 1, OSD_OP_WRITE, 2*1024*1024, 4096), 0);
+    check_op_count(cli, 1, 0);
+
+    int *r2 = test_sync(cli);
+    check_op_count(cli, 1, 1);
+    pretend_op_completed(cli, find_op(cli, 1, OSD_OP_WRITE, 3*1024*1024, 4096), 0);
+    check_op_count(cli, 1, 1);
+    // autosync because max_dirty_ops=2, flush waits for sync
+    pretend_op_completed(cli, find_op(cli, 1, OSD_OP_SYNC, 0, 0), 0);
+    check_op_count(cli, 1, 1);
+    pretend_op_completed(cli, find_op(cli, 1, OSD_OP_WRITE, 4*1024*1024, 4096), 0);
+    check_op_count(cli, 1, 1);
+    can_complete(r2);
+    pretend_op_completed(cli, find_op(cli, 1, OSD_OP_SYNC, 0, 0), 0);
+    check_completed(r2);
+
+    // Check cutting of the beginning and end
+    assert((long)test_write(cli, 0, 32768, 0x55, NULL, true) == 1);
+    check_op_count(cli, 1, 0);
+    assert((long)test_write(cli, 32768, 32768, 0x56, NULL, true) == 1);
+    check_op_count(cli, 1, 0);
+    assert((long)test_write(cli, 16384, 32768, 0x57, NULL, true) == 1);
+    check_op_count(cli, 1, 0);
+    assert((long)test_write(cli, 16384+4096, 32768-4096, 0x58, NULL, true) == 1);
+    check_op_count(cli, 1, 0);
+    r2 = test_sync(cli);
+    check_op_count(cli, 1, 1);
+    pretend_op_completed(cli, find_op(cli, 1, OSD_OP_WRITE, 0, 65536), 0);
+    check_op_count(cli, 1, 1);
+    can_complete(r2);
+    pretend_op_completed(cli, find_op(cli, 1, OSD_OP_SYNC, 0, 0), 0);
+    check_completed(r2);
+
+    // Free client
+    delete cli;
+    delete tfd;
+    printf("[ok] writeback test\n");
+}
+
 int main(int narg, char *args[])
 {
    test1();
    test2();
+    test_writeback();
    return 0;
 }
--- a/src/timerfd_manager.cpp
+++ b/src/timerfd_manager.cpp
@@ -129,6 +129,7 @@ again:
        if (exp.it_value.tv_sec < 0 || exp.it_value.tv_sec == 0 && exp.it_value.tv_nsec <= 0)
        {
            // It already happened
+            // FIXME: Postpone to setImmediate/BH to avoid reenterability problems
            trigger_nearest();
            goto again;
        }
--- a/src/vitastor.pc.in
+++ b/src/vitastor.pc.in
@@ -6,7 +6,7 @@ includedir=${prefix}/@CMAKE_INSTALL_INCLUDEDIR@

 Name: Vitastor
 Description: Vitastor client library
-Version: 1.0.0
+Version: 1.1.0
 Libs: -L${libdir} -lvitastor_client
 Cflags: -I${includedir}

--- a/src/vitastor_c.cpp
+++ b/src/vitastor_c.cpp
@@ -164,6 +164,15 @@ int vitastor_c_uring_register_eventfd(vitastor_c *client)

 vitastor_c *vitastor_c_create_uring_json(const char **options, int options_len)
 {
+    ring_loop_t *ringloop = NULL;
+    try
+    {
+        ringloop = new ring_loop_t(512);
+    }
+    catch (std::exception & e)
+    {
+        return NULL;
+    }
    json11::Json::object cfg;
    for (int i = 0; i < options_len-1; i += 2)
    {
@@ -171,18 +180,32 @@ vitastor_c *vitastor_c_create_uring_json(const char **options, int options_len)
    }
    json11::Json cfg_json(cfg);
    vitastor_c *self = new vitastor_c;
-    self->ringloop = new ring_loop_t(512);
+    self->ringloop = ringloop;
    self->epmgr = new epoll_manager_t(self->ringloop);
    self->cli = new cluster_client_t(self->ringloop, self->epmgr->tfd, cfg_json);
    return self;
 }

+vitastor_c *vitastor_c_create_epoll_json(const char **options, int options_len)
+{
+    json11::Json::object cfg;
+    for (int i = 0; i < options_len-1; i += 2)
+    {
+        cfg[options[i]] = std::string(options[i+1]);
+    }
+    json11::Json cfg_json(cfg);
+    vitastor_c *self = new vitastor_c;
+    self->epmgr = new epoll_manager_t(NULL);
+    self->cli = new cluster_client_t(NULL, self->epmgr->tfd, cfg_json);
+    return self;
+}
+
 void vitastor_c_destroy(vitastor_c *client)
 {
    delete client->cli;
    if (client->epmgr)
        delete client->epmgr;
-    else
+    else if (client->tfd)
        delete client->tfd;
    if (client->ringloop)
        delete client->ringloop;
@@ -220,6 +243,16 @@ int vitastor_c_uring_has_work(vitastor_c *client)
    return client->ringloop->has_work();
 }

+int vitastor_c_epoll_get_fd(vitastor_c *client)
+{
+    return !client->ringloop && client->epmgr ? client->epmgr->get_fd() : -1;
+}
+
+void vitastor_c_epoll_handle_events(vitastor_c *client, int timeout)
+{
+    return client->epmgr->handle_events(timeout);
+}
+
 void vitastor_c_read(vitastor_c *client, uint64_t inode, uint64_t offset, uint64_t len,
    struct iovec *iov, int iovcnt, VitastorReadHandler cb, void *opaque)
 {
--- a/src/vitastor_c.h
+++ b/src/vitastor_c.h
@@ -7,7 +7,7 @@
 #define VITASTOR_QEMU_PROXY_H

 // C API wrapper version
-#define VITASTOR_C_API_VERSION 2
+#define VITASTOR_C_API_VERSION 3

 #ifndef POOL_ID_BITS
 #define POOL_ID_BITS 16
@@ -40,6 +40,7 @@ vitastor_c *vitastor_c_create_qemu_uring(QEMUSetFDHandler *aio_set_fd_handler, v
 vitastor_c *vitastor_c_create_uring(const char *config_path, const char *etcd_host, const char *etcd_prefix,
    int use_rdma, const char *rdma_device, int rdma_port_num, int rdma_gid_index, int rdma_mtu, int log_level);
 vitastor_c *vitastor_c_create_uring_json(const char **options, int options_len);
+vitastor_c *vitastor_c_create_epoll_json(const char **options, int options_len);
 void vitastor_c_destroy(vitastor_c *client);
 int vitastor_c_is_ready(vitastor_c *client);
 int vitastor_c_uring_register_eventfd(vitastor_c *client);
@@ -47,6 +48,8 @@ void vitastor_c_uring_wait_ready(vitastor_c *client);
 void vitastor_c_uring_handle_events(vitastor_c *client);
 void vitastor_c_uring_wait_events(vitastor_c *client);
 int vitastor_c_uring_has_work(vitastor_c *client);
+int vitastor_c_epoll_get_fd(vitastor_c *client);
+void vitastor_c_epoll_handle_events(vitastor_c *client, int timeout);
 void vitastor_c_read(vitastor_c *client, uint64_t inode, uint64_t offset, uint64_t len,
    struct iovec *iov, int iovcnt, VitastorReadHandler cb, void *opaque);
 void vitastor_c_write(vitastor_c *client, uint64_t inode, uint64_t offset, uint64_t len, uint64_t check_version,
--- a/tests/common.sh
+++ b/tests/common.sh
@@ -27,6 +27,7 @@ ETCD_COUNT=${ETCD_COUNT:-1}

 if [ "$KEEP_DATA" = "" ]; then
    rm -rf ./testdata
+    rm -rf /run/user/$(id -u)/testdata_etcd*
    mkdir -p ./testdata
 fi

@@ -41,7 +42,9 @@ ETCDCTL="${ETCD}ctl --endpoints=$ETCD_URL --dial-timeout=5s --command-timeout=10
 start_etcd()
 {
    local i=$1
-    ionice -c2 -n0 $ETCD -name etcd$i --data-dir ./testdata/etcd$i \
+    local t=/run/user/$(id -u)
+    findmnt $t >/dev/null || (sudo mkdir -p $t && sudo mount -t tmpfs tmpfs $t)
+    ionice -c2 -n0 $ETCD -name etcd$i --data-dir /run/user/$(id -u)/testdata_etcd$i \
        --advertise-client-urls http://$ETCD_IP:$((ETCD_PORT+2*i-2)) --listen-client-urls http://$ETCD_IP:$((ETCD_PORT+2*i-2)) \
        --initial-advertise-peer-urls http://$ETCD_IP:$((ETCD_PORT+2*i-1)) --listen-peer-urls http://$ETCD_IP:$((ETCD_PORT+2*i-1)) \
        --initial-cluster-token vitastor-tests-etcd --initial-cluster-state new \
--- a/tests/run_3osds.sh
+++ b/tests/run_3osds.sh
@@ -18,11 +18,11 @@ else
 fi

 if [ "$IMMEDIATE_COMMIT" != "" ]; then
-    NO_SAME="--journal_no_same_sector_overwrites true --journal_sector_buffer_count 1024 --disable_data_fsync 1 --immediate_commit all --log_level 10"
-    $ETCDCTL put /vitastor/config/global '{"recovery_queue_depth":1,"osd_out_time":1,"immediate_commit":"all"}'
+    NO_SAME="--journal_no_same_sector_overwrites true --journal_sector_buffer_count 1024 --disable_data_fsync 1 --immediate_commit all --log_level 10 --etcd_stats_interval 5"
+    $ETCDCTL put /vitastor/config/global '{"recovery_queue_depth":1,"osd_out_time":1,"immediate_commit":"all","client_enable_writeback":true}'
 else
-    NO_SAME="--journal_sector_buffer_count 1024 --log_level 10"
-    $ETCDCTL put /vitastor/config/global '{"recovery_queue_depth":1,"osd_out_time":1}'
+    NO_SAME="--journal_sector_buffer_count 1024 --log_level 10 --etcd_stats_interval 5"
+    $ETCDCTL put /vitastor/config/global '{"recovery_queue_depth":1,"osd_out_time":1,"client_enable_writeback":true}'
 fi

 start_osd_on()
--- a/tests/test_heal.sh
+++ b/tests/test_heal.sh
@@ -7,7 +7,7 @@ if [[ "$SCHEME" = "ec" ]]; then
    PG_DATA_SIZE=${PG_DATA_SIZE:-2}
    PG_MINSIZE=${PG_MINSIZE:-3}
 fi
-OSD_COUNT=7
+OSD_COUNT=${OSD_COUNT:-7}
 PG_COUNT=32
 . `dirname $0`/run_3osds.sh
 check_qemu
@@ -29,7 +29,7 @@ kill_osds()
    kill -9 $OSD1_PID
    $ETCDCTL del /vitastor/osd/state/1

-    for i in 2 3 4 5 6 7; do
+    for i in $(seq 2 $OSD_COUNT); do
        sleep 15
        echo Killing OSD $i and starting OSD $((i-1))
        p=OSD${i}_PID
@@ -40,8 +40,8 @@ kill_osds()
    done

    sleep 5
-    echo Starting OSD 7
-    start_osd 7
+    echo Starting OSD $OSD_COUNT
+    start_osd $OSD_COUNT

    sleep 5
 }
--- a/tests/test_move_reappear.sh
+++ b/tests/test_move_reappear.sh
@@ -7,7 +7,7 @@ OSD_COUNT=5
 OSD_ARGS="$OSD_ARGS"
 for i in $(seq 1 $OSD_COUNT); do
    dd if=/dev/zero of=./testdata/test_osd$i.bin bs=1024 count=1 seek=$((OSD_SIZE*1024-1))
-    build/src/vitastor-osd --osd_num $i --bind_address 127.0.0.1 $OSD_ARGS --etcd_address $ETCD_URL $(build/src/vitastor-disk simple-offsets --format options ./testdata/test_osd$i.bin 2>/dev/null) >>./testdata/osd$i.log 2>&1 &
+    build/src/vitastor-osd --osd_num $i --bind_address 127.0.0.1 --etcd_stats_interval 5 $OSD_ARGS --etcd_address $ETCD_URL $(build/src/vitastor-disk simple-offsets --format options ./testdata/test_osd$i.bin 2>/dev/null) >>./testdata/osd$i.log 2>&1 &
    eval OSD${i}_PID=$!
 done
Author	SHA1	Message	Date
Vitaliy Filippov	8222e3c77d	Release 1.1.0 New features: - Implement [client writeback cache](docs/config/client.en.md#client_enable_writeback) - Add the third I/O mode: [O_DIRECT\|O_SYNC](docs/config/osd.en.md#data_io) (good for Optane) - Reduce load on etcd by splitting OSD lease and statistics reporting intervals: [etcd_stats_interval](docs/config/osd.en.md#etcd_stats_interval) (default 30 sec) - Make MON automatically filter OSDs by layout (block_size/immediate_commit/bitmap_granularity) to prevent "refusing to start PGs of this pool" errors on misconfiguration - Support running fio benchmarks on systems without io_uring - Make QEMU driver compatible with QEMU 8.1 - Document usage of [vhost-user-blk](docs/usage/qemu.en.md#vhost-user-blk) Bug fixes: - Fix resizing disks in QEMU driver (for example, in Proxmox) - Fix "unexpected result" in Proxmox driver by making CLI flush output on exit - Remove unneeded block_size mismatch warnings on pools without matching PGs - Fix possible segfault in vitastor-cli ls -l (usually with deleted pools) - Fix QEMU driver compatibility with systems without io_uring - Fix monitor eating 100% CPU when etcd is down (caused by infinite retries) - Fix potential incorrect write processing with snapshots (not caught in tests but could probably lead to client hangs) - Fix buffer insertion in cluster_client (not caught in tests but could probably lead to incorrect writes in rare cases) - Fix rare OSD crash during sync operation processing - Fix a reenterability issue in cluster_client not reproducible in QEMU/fio, but reproducible with the currently developed K/V database implementation - Fix deletion of the first modified object - OSDs could crash if you modified the same object a lot of times, then deleted it, and then modified it again - Fix the fio_sec_osd test tool	2023-10-28 00:33:06 +03:00
Vitaliy Filippov	29cbe70e74	Bump qemu version to vitastor4	2023-10-28 00:33:06 +03:00
Vitaliy Filippov	a883e79507	Make docs to add etcd_stats_interval	2023-10-27 14:09:26 +03:00
Vitaliy Filippov	be7e76f849	Split etcd_stats_interval out of etcd_report_interval	2023-10-27 01:26:26 +03:00
Vitaliy Filippov	6fd2cf5df6	Add documentation for the write-back cache	2023-10-27 01:26:26 +03:00
Vitaliy Filippov	294a754c9e	Allow write-back by default in NBD & NFS	2023-10-27 01:26:26 +03:00
Vitaliy Filippov	8bfea6e7de	Support vitastor_c_create_epoll() in fio driver	2023-10-26 22:57:36 +03:00
Vitaliy Filippov	bac9e34836	Allow to create vitastor_c with plain epoll without uring :-)	2023-10-26 22:57:36 +03:00
Vitaliy Filippov	8aa4d492c1	Allow to use epoll_manager without ringloop	2023-10-26 22:57:36 +03:00
Vitaliy Filippov	9336ee5476	Correctly free manual "small vector" in cluster_client %-)	2023-10-26 22:57:36 +03:00
Vitaliy Filippov	ad30b11519	Add the missing ringloop creation check to vitastor_c_create_uring_json()	2023-10-26 18:07:23 +03:00
Vitaliy Filippov	a061246997	Do not attempt to initialize QEMU driver via vitastor_c_create_qemu_uring() It doesn't add any compatibility because vitastor_c_uring_register_eventfd() is added in the same VITASTOR_C_API_VERSION 2.	2023-10-26 17:46:19 +03:00
Vitaliy Filippov	5066e35a49	Fix write-over-delete failing for the very first entry in dirty_db	2023-10-21 17:00:14 +03:00
Vitaliy Filippov	93dc31f3fc	Fix possible segfault in vitastor-cli ls -l	2023-10-18 11:11:41 +03:00
Vitaliy Filippov	f245b56176	Fix another possible reenterability issue in cluster_client Non-reproducible in QEMU/FIO, only caught during K/V DB debugging	2023-10-08 11:02:53 +03:00
Vitaliy Filippov	befca06f18	Support any OSD count in test_heal	2023-10-08 11:02:53 +03:00
Vitaliy Filippov	fbf0263625	Add qemu-storage-daemon to documentation	2023-09-16 18:40:52 +03:00
Vitaliy Filippov	3bcf276d4d	Run tests with writeback	2023-09-16 17:52:17 +03:00
Vitaliy Filippov	38db53f5ee	Implement client writeback cache - Disabled by default, enable with client_enable_writeback=true - Even then only enabled in FIO when -direct is disabled and in QEMU when block device cache is enabled in settings - Can also be enabled in other clients like vitastor-cli using parameter client_writeback_allowed=true, but not recommended	2023-09-16 17:52:17 +03:00
Vitaliy Filippov	cd543a90bc	Prevent stack overflows in cli_merge with CAS and writeback cache	2023-09-16 17:52:17 +03:00
Vitaliy Filippov	f600cc07b0	Autosync in blockstore every autosync_writes, too	2023-09-16 17:52:17 +03:00
Vitaliy Filippov	6a8e530e6b	Add FIXME to timerfd_manager	2023-09-16 17:52:17 +03:00
Vitaliy Filippov	5cadb170b9	Fix possible OSD crash during sync due to missing min_flushed_journal_sector reset	2023-09-16 17:52:17 +03:00
Vitaliy Filippov	e72d4ed1d4	Remove unused bs_sync fields	2023-09-16 17:52:17 +03:00
Vitaliy Filippov	ff479a102d	Make MON filter OSDs by block layout to prevent "refusing to start PGs of this pool" errors on misconfiguration	2023-09-16 17:52:17 +03:00
Vitaliy Filippov	27d0d5b06a	Reads do not have to wait for buffer flushes anymore	2023-09-16 17:52:17 +03:00
Vitaliy Filippov	33950c1ec8	Fix fio_sec_osd attr_len	2023-09-16 17:49:10 +03:00
Vitaliy Filippov	eea7ef1f19	Remove debug osd_trace from test_write	2023-09-12 01:35:36 +03:00
Vitaliy Filippov	cc0fdc6253	Remove erroneous block_size mismatch warnings on pools without matching PGs	2023-09-08 23:19:04 +03:00
Vitaliy Filippov	79ecd59b10	Flush STDOUT and STDERR before exiting from cli to fix Proxmox "Unexpected result"	2023-09-07 17:30:26 +03:00
Vitaliy Filippov	51081c9b45	Put etcd into tmpfs for tests	2023-09-07 02:35:09 +03:00
Vitaliy Filippov	b7d398be5b	Fix sscanf validation usage (field count instead of null_byte == 0)	2023-09-07 02:34:35 +03:00
Vitaliy Filippov	85e9f67d9d	Add supported_truncate_flags	2023-09-06 17:37:52 +03:00
Vitaliy Filippov	79c6d6f323	Make QEMU driver compatible with QEMU 8.1	2023-08-24 02:23:55 +03:00
Vitaliy Filippov	ae760dbc1d	Fix co_truncate size division by BDRV_SECTOR_SIZE	2023-08-24 01:55:35 +03:00
Vitaliy Filippov	65487da4b1	Do not include msgr_rdma.h into messenger.h	2023-08-24 01:55:35 +03:00
Vitaliy Filippov	7862282938	Extract validation to check_rw(), remove duplicate code with OP_SYNC	2023-08-13 23:49:52 +03:00
Vitaliy Filippov	30ce2bd951	Fix buffer insert in cluster_client	2023-08-12 11:08:50 +03:00
Vitaliy Filippov	b1a0afd10a	Aggregate buffer flushes	2023-08-11 11:26:13 +03:00
Vitaliy Filippov	85b6134910	Return dirty buffers on read in client Required at least to return buffers when they need to be replayed, but until they are actually replayed	2023-08-09 00:57:08 +03:00
Vitaliy Filippov	b1b07a393d	Fix incorrect marking op parts as done with snapshots (could probably lead to client hangs)	2023-08-09 00:57:08 +03:00
Vitaliy Filippov	7333022adf	Add a third I/O mode: O_DIRECT\|O_SYNC, change parameters to data_io/meta_io/journal_io	2023-08-09 00:57:08 +03:00
Vitaliy Filippov	ab8627c9fa	Fix monitor retrying failed etcd connection in an infinite loop without pauses	2023-08-09 00:57:08 +03:00