Merge pull request 'master' (#3 ) from vitalif/vitastor:master into master

Reviewed-on: #3
Release 1.4.4
2024-02-13 14:44:09 +03:00 · 2024-02-11 16:23:08 +03:00 · 2024-02-11 16:13:52 +03:00 · 2024-02-11 16:13:52 +03:00 · 2024-02-11 16:13:52 +03:00 · 2024-02-11 13:42:51 +03:00
103 changed files with 2385 additions and 464 deletions
--- a/.gitea/workflows/test.yml
+++ b/.gitea/workflows/test.yml
@ -395,7 +395,7 @@ jobs:
    steps:
    - name: Run test
      id: test
-      timeout-minutes: 3
+      timeout-minutes: 6
      run: SCHEME=ec /root/vitastor/tests/test_snapshot_chain.sh
    - name: Print logs
      if: always() && steps.test.outcome == 'failure'
@ -532,6 +532,24 @@ jobs:
          echo ""
        done
  test_switch_primary:
    runs-on: ubuntu-latest
    needs: build
    container: ${{env.TEST_IMAGE}}:${{github.sha}}
    steps:
    - name: Run test
      id: test
      timeout-minutes: 3
      run: /root/vitastor/tests/test_switch_primary.sh
    - name: Print logs
      if: always() && steps.test.outcome == 'failure'
      run: |
        for i in /root/vitastor/testdata/*.log /root/vitastor/testdata/*.txt; do
          echo "-------- $i --------"
          cat $i
          echo ""
        done
  test_write:
    runs-on: ubuntu-latest
    needs: build
--- a/.gitea/workflows/tests-to-yaml.pl
+++ b/.gitea/workflows/tests-to-yaml.pl
@ -39,6 +39,10 @@ for my $line (<>)
                $test_name .= '_'.lc($1).'_'.$2;
            }
        }
        if ($test_name eq 'test_snapshot_chain_ec')
        {
            $timeout = 6;
        }
        $line =~ s!\./test_!/root/vitastor/tests/test_!;
        # Gitea CI doesn't support artifacts yet, lol
        #- name: Upload results
--- a/CLA-en.md
+++ b/CLA-en.md
@ -0,0 +1,115 @@
 ## Contributor License Agreement
 > This Agreement is made in the Russian and English languages. **The English
 text of Agreement is for informational purposes only** and is not binding
 for the Parties.
 >
 > In the event of a conflict between the provisions of the Russian and
 English versions of this Agreement, the **Russian version shall prevail**.
 >
 > Russian version is published at https://git.yourcmc.ru/vitalif/vitastor/src/branch/master/CLA-ru.md
 This document represents the offer of Filippov Vitaliy Vladimirovich
 ("Author"), author and copyright holder of Vitastor software ("Program"),
 acknowledged by a certificate of Federal Service for Intellectual
 Property of Russian Federation (Rospatent) # 2021617829 dated 20 May 2021,
 to "Contributors" to conclude this license agreement as follows
 ("Agreement" or "Offer").
 In accordance with Art. 435, Art. 438 of the Civil Code of the Russian
 Federation, this Agreement is an offer and in case of acceptance of the
 offer, an agreement is considered concluded on the conditions specified
 in the offer.
 1. Applicable Terms. \
   1.1. "Official Repository" shall mean the computer storage, operated by
        the Author, containing all prior and future versions of the Source
        Code of the Program, at Internet addresses https://git.yourcmc.ru/vitalif/vitastor/
        or https://github.com/vitalif/vitastor/. \
   1.2. "Contributions" shall mean results of intellectual activity
        (including, but not limited to, source code, libraries, components,
        texts, documentation) which can be software or elements of the software
        and which are provided by Contributors to the Author for inclusion
        in the Program. \
   1.3. "Contributor" shall mean a person who provides Contributions to
        the Author and agrees with all provisions of this Agreement.
        A Сontributor can be: 1) an individual; or 2) a legal entity or an
        individual entrepreneur in case when an individual provides Contributions
        on behalf of third parties, including on behalf of his employer.
 2. Subject of the Agreement. \
   2.1. Subject of the Agreement shall be the Contributions sent to the Author by Contributors. \
   2.2. The Contributor grants to the Author the right to use Contributions at his own
        discretion and without any necessity to get a prior approval from Contributor or
        any other third party in any way, under a simple (non-exclusive), royalty-free,
        irrevocable license throughout the world by all means not contrary to law, in whole
        or as a part of the Program, or other open-source or closed-source computer programs,
        products or services (hereinafter -- the "License"), including, but not limited to: \
        2.2.1. to execute Contributions and use them for any tasks; \
        2.2.2. to publish and distribute Contributions in modified or unmodified form and/or to rent them; \
        2.2.3. to modify Contributions, add comments, illustrations or any explanations to Contributions while using them; \
        2.2.4. to create other results of intellectual activity based on Contributions, including derivative works and composite works; \
        2.2.5. to translate Contributions into other languages, including other programming languages; \
        2.2.6. to carry out rental and public display of Contributions; \
        2.2.7. to use Contributions under the trade name and/or any trademark or any other label, or without it, as the Author thinks fit; \
   2.3. The Contributor grants to the Author the right to sublicense any of the aforementioned
        rights to third parties on any terms at the Author's discretion. \
   2.4. The License is provided for the entire duration of Contributor's
        exclusive intellectual property rights to the Contributions. \
   2.5. The Contributor grants to the Author the right to decide how and where to mention,
        or to not mention at all, the fact of his authorship, name, nickname and/or company
        details when including Contributions into the Program or in any other computer
        programs, products or services.
 3. Acceptance of the Offer \
   3.1. The Contributor may provide Contributions to the Author in the form of
        a "Pull Request" in an Official Repository of the Program or by any
        other electronic means of communication, including, but not limited to,
        E-mail or messenger applications. \
   3.2. The acceptance of the Offer shall be the fact of provision of Contributions
        to the Author by the Contributor by any means with the following remark:
        “I accept Vitastor CLA agreement: https://git.yourcmc.ru/vitalif/vitastor/src/branch/master/CLA-en.md”
        or “Я принимаю соглашение Vitastor CLA: https://git.yourcmc.ru/vitalif/vitastor/src/branch/master/CLA-ru.md”. \
   3.3. Date of acceptance of the Offer shall be the date of such provision.
 4. Rights and obligations of the parties. \
   4.1. The Contributor reserves the right to use Contributions by any lawful means
        not contrary to this Agreement. \
   4.2. The Author has the right to refuse to include Contributions into the Program
        at any moment with no explanation to the Contributor.
 5. Representations and Warranties. \
   5.1. The person providing Contributions for the purpose of their inclusion
        in the Program represents and warrants that he is the Contributor
        or legally acts on the Contributor's behalf. Name or company details
        of the Contributor shall be provided with the Contribution at the moment
        of their provision to the Author. \
   5.2. The Contributor represents and warrants that he legally owns exclusive
        intellectual property rights to the Contributions. \
   5.3. The Contributor represents and warrants that any further use of
        Contributions by the Author as provided by Contributor under the terms
        of the Agreement does not infringe on intellectual and other rights and
        legitimate interests of third parties. \
   5.4. The Contributor represents and warrants that he has all rights and legal
        capacity needed to accept this Offer; \
   5.5. The Contributor represents and warrants that Contributions don't
        contain malware or any information considered illegal under the law
        of Russian Federation.
 6. Termination of the Agreement \
   6.1. The Agreement may be terminated at will of both Author and Contributor,
        formalised in the written form or if the Agreement is terminated on
        reasons prescribed by the law of Russian Federation.
 7. Final Clauses \
   7.1. The Contributor may optionally sign the Agreement in the written form. \
   7.2. The Agreement is deemed to become effective from the Date of signing of
        the Agreement and until the expiration of Contributor's exclusive
        intellectual property rights to the Contributions. \
   7.3. The Author may unilaterally alter the Agreement without informing Contributors.
        The new version of the document shall come into effect 3 (three) days after
        being published in the Official Repository of the Program at Internet address
        [https://git.yourcmc.ru/vitalif/vitastor/src/branch/master/CLA-en.md](https://git.yourcmc.ru/vitalif/vitastor/src/branch/master/CLA-en.md).
        Contributors should keep informed about the actual version of the Agreement themselves. \
   7.4. If the Author and the Contributor fail to agree on disputable issues,
        disputes shall be referred to the Moscow Arbitration court.
--- a/CLA-ru.md
+++ b/CLA-ru.md
@ -0,0 +1,108 @@
 ## Лицензионное соглашение с участником
 > Данная Оферта написана в Русской и Английской версиях. **Версия на английском
 языке предоставляется в информационных целях** и не связывает стороны договора.
 >
 > В случае несоответствий между положениями Русской и Английской версий Договора,
 **Русская версия имеет приоритет**.
 >
 > Английская версия опубликована по адресу https://git.yourcmc.ru/vitalif/vitastor/src/branch/master/CLA-en.md
 Настоящий договор-оферта (далее по тексту – Оферта, Договор) адресована физическим
 и юридическим лицам (далее – Участникам) и является официальным публичным предложением
 Филиппова Виталия Владимировича (далее – Автора) программного обеспечения Vitastor,
 свидетельство Федеральной службы по интеллектуальной собственности (Роспатент) № 2021617829
 от 20 мая 2021 г. (далее – Программа) о нижеследующем:
 1. Термины и определения \
   1.1. Репозиторий – электронное хранилище, содержащее исходный код Программы. \
   1.2. Доработка – результат интеллектуальной деятельности Участника, включающий
        в себя изменения или дополнения к исходному коду Программы, которые Участник
        желает включить в состав Программы для дальнейшего использования и распространения
        Автором и для этого направляет их Автору. \
   1.3. Участник – физическое или юридическое лицо, вносящее Доработки в код Программы. \
   1.4. ГК РФ – Гражданский кодекс Российской Федерации.
 2. Предмет оферты \
   2.1. Предметом настоящей оферты являются Доработки, отправляемые Участником Автору. \
   2.2. Участник предоставляет Автору право использовать Доработки по собственному усмотрению
        и без необходимости предварительного согласования с Участником или иным третьим лицом
        на условиях простой (неисключительной) безвозмездной безотзывной лицензии, полностью
        или фрагментарно, в составе Программы или других программ, продуктов или сервисов
        как с открытым, так и с закрытым исходным кодом, любыми способами, не противоречащими
        закону, включая, но не ограничиваясь следующими: \
        2.2.1. Запускать и использовать Доработки для выполнения любых задач; \
        2.2.2. Распространять, импортировать и доводить Доработки до всеобщего сведения; \
        2.2.3. Вносить в Доработки изменения, сокращения и дополнения, снабжать Доработки
               при их использовании комментариями, иллюстрациями или пояснениями; \
        2.2.4. Создавать на основе Доработок иные результаты интеллектуальной деятельности,
               в том числе производные и составные произведения; \
        2.2.5. Переводить Доработки на другие языки, в том числе на другие языки программирования; \
        2.2.6. Осуществлять прокат и публичный показ Доработок; \
        2.2.7. Использовать Доработки под любым фирменным наименованием, товарным знаком
               (знаком обслуживания) или иным обозначением, или без такового. \
   2.3. Участник предоставляет Автору право сублицензировать полученные права на Доработки
        третьим лицам на любых условиях на усмотрение Автора. \
   2.4. Участник предоставляет Автору права на Доработки на территории всего мира. \
   2.5. Участник предоставляет Автору права на весь срок действия исключительного права
        Участника на Доработки. \
   2.6. Участник предоставляет Автору права на Доработки на безвозмездной основе. \
   2.7. Участник разрешает Автору самостоятельно определять порядок, способ и
        место указания его имени, реквизитов и/или псевдонима при включении
        Доработок в состав Программы или других программ, продуктов или сервисов.
 3. Акцепт Оферты \
   3.1. Участник может передавать Доработки в адрес Автора через зеркала официального
        Репозитория Программы по адресам https://git.yourcmc.ru/vitalif/vitastor/ или
        https://github.com/vitalif/vitastor/ в виде “запроса на слияние” (pull request),
        либо в письменном виде или с помощью любых других электронных средств коммуникации,
        например, электронной почты или мессенджеров. \
   3.2. Факт передачи Участником Доработок в адрес Автора любым способом с одной из пометок
        “I accept Vitastor CLA agreement: https://git.yourcmc.ru/vitalif/vitastor/src/branch/master/CLA-en.md”
        или “Я принимаю соглашение Vitastor CLA: https://git.yourcmc.ru/vitalif/vitastor/src/branch/master/CLA-ru.md”
        является полным и безоговорочным акцептом (принятием) Участником условий настоящей
        Оферты, т.е. Участник считается ознакомившимся с настоящим публичным договором и
        в соответствии с ГК РФ признается лицом, вступившим с Автором в договорные отношения
        на основании настоящей Оферты. \
   3.3. Датой акцептирования настоящей Оферты считается дата такой передачи.
 4. Права и обязанности Сторон \
   4.1. Участник сохраняет за собой право использовать Доработки любым законным
        способом, не противоречащим настоящему Договору. \
   4.2. Автор вправе отказать Участнику во включении Доработок в состав
        Программы без объяснения причин в любой момент по своему усмотрению.
 5. Гарантии и заверения \
   5.1. Лицо, направляющее Доработки для целей их включения в состав Программы,
        гарантирует, что является Участником или представителем Участника. Имя или реквизиты
        Участника должны быть указаны при их передаче в адрес Автора Программы. \
   5.2. Участник гарантирует, что является законным обладателем исключительных прав
        на Доработки. \
   5.3. Участник гарантирует, что на момент акцептирования настоящей Оферты ему
        ничего не известно (и не могло быть известно) о правах третьих лиц на
        передаваемые Автору Доработки или их часть, которые могут быть нарушены
        в связи с передачей Доработок по настоящему Договору. \
   5.4. Участник гарантирует, что является дееспособным лицом и обладает всеми
        необходимыми правами для заключения Договора. \
   5.5. Участник гарантирует, что Доработки не содержат вредоносного ПО, а также
        любой другой информации, запрещённой к распространению по законам Российской
        Федерации.
 6. Прекращение действия оферты \
   6.1. Действие настоящего договора может быть прекращено по соглашению сторон,
        оформленному в письменном виде, а также вследствие его расторжения по основаниям,
        предусмотренным законом.
 7. Заключительные положения \
   7.1. Участник вправе по желанию подписать настоящий Договор в письменном виде. \
   7.2. Настоящий договор действует с момента его заключения и до истечения срока
        действия исключительных прав Участника на Доработки. \
   7.3. Автор имеет право в одностороннем порядке вносить изменения и дополнения в договор
        без специального уведомления об этом Участников. Новая редакция документа вступает
        в силу через 3 (Три) календарных дня со дня опубликования в официальном Репозитории
        Программы по адресу в сети Интернет
        [https://git.yourcmc.ru/vitalif/vitastor/src/branch/master/CLA-ru.md](https://git.yourcmc.ru/vitalif/vitastor/src/branch/master/CLA-ru.md).
        Участники самостоятельно отслеживают действующие условия Оферты. \
   7.4. Все споры, возникающие между сторонами в процессе их взаимодействия по настоящему
        договору, решаются путём переговоров. В случае невозможности урегулирования споров
        переговорным порядком стороны разрешают их в Арбитражном суде г.Москвы.
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@ -2,6 +2,6 @@ cmake_minimum_required(VERSION 2.8.12)
 project(vitastor)
-set(VERSION "1.3.1")
+set(VERSION "1.4.4")
 add_subdirectory(src)
--- a/2
+++ b/2
@ -1 +1 @@
-Subproject commit 45e6d1f13196a0824e2089a586c53b9de0283f17
+Subproject commit 8de8b467acbca50cfd8835c20e0e379110f3b32b
--- a/csi/Makefile
+++ b/csi/Makefile
@ -1,4 +1,4 @@
-VERSION ?= v1.3.1
+VERSION ?= v1.4.4
 all: build push
--- a/csi/deploy/004-csi-nodeplugin.yaml
+++ b/csi/deploy/004-csi-nodeplugin.yaml
@ -49,7 +49,7 @@ spec:
            capabilities:
              add: ["SYS_ADMIN"]
            allowPrivilegeEscalation: true
-          image: vitalif/vitastor-csi:v1.3.1
+          image: vitalif/vitastor-csi:v1.4.4
          args:
            - "--node=$(NODE_ID)"
            - "--endpoint=$(CSI_ENDPOINT)"
--- a/csi/deploy/007-csi-provisioner.yaml
+++ b/csi/deploy/007-csi-provisioner.yaml
@ -121,7 +121,7 @@ spec:
            privileged: true
            capabilities:
              add: ["SYS_ADMIN"]
-          image: vitalif/vitastor-csi:v1.3.1
+          image: vitalif/vitastor-csi:v1.4.4
          args:
            - "--node=$(NODE_ID)"
            - "--endpoint=$(CSI_ENDPOINT)"
--- a/csi/src/config.go
+++ b/csi/src/config.go
@ -5,7 +5,7 @@ package vitastor
 const (
    vitastorCSIDriverName    = "csi.vitastor.io"
-    vitastorCSIDriverVersion = "1.3.1"
+    vitastorCSIDriverVersion = "1.4.4"
 )
 // Config struct fills the parameters of request or user input
--- a/csi/src/nodeserver.go
+++ b/csi/src/nodeserver.go
@ -14,6 +14,7 @@ import (
    "strconv"
    "strings"
    "syscall"
    "time"
    "google.golang.org/grpc/codes"
    "google.golang.org/grpc/status"
@ -32,6 +33,7 @@ type NodeServer struct
    useVduse bool
    stateDir string
    mounter mount.Interface
    restartInterval time.Duration
 }
 type DeviceState struct
@ -65,6 +67,16 @@ func NewNodeServer(driver *Driver) *NodeServer
    if (ns.useVduse)
    {
        ns.restoreVduseDaemons()
        dur, err := time.ParseDuration(os.Getenv("RESTART_INTERVAL"))
        if (err != nil)
        {
            dur = 10 * time.Second
        }
        ns.restartInterval = dur
        if (ns.restartInterval != time.Duration(0))
        {
            go ns.restarter()
        }
    }
    return ns
 }
@ -176,7 +188,6 @@ func (ns *NodeServer) unmapNbd(devicePath string)
 func findByPidFile(pidFile string) (*os.Process, error)
 {
    klog.Infof("killing process with PID from file %s", pidFile)
    pidBuf, err := os.ReadFile(pidFile)
    if (err != nil)
    {
@ -197,6 +208,7 @@ func findByPidFile(pidFile string) (*os.Process, error)
 func killByPidFile(pidFile string) error
 {
    klog.Infof("killing process with PID from file %s", pidFile)
    proc, err := findByPidFile(pidFile)
    if (err != nil)
    {
@ -364,6 +376,21 @@ func (ns *NodeServer) unmapVduseById(vdpaId string)
    }
 }
 func (ns *NodeServer) restarter()
 {
    // Restart dead VDUSE daemons at regular intervals
    // Otherwise volume I/O may hang in case of a qemu-storage-daemon crash
    // Moreover, it may lead to a kernel panic of the kernel is configured to
    // panic on hung tasks
    ticker := time.NewTicker(ns.restartInterval)
    defer ticker.Stop()
    for
    {
        <-ticker.C
        ns.restoreVduseDaemons()
    }
 }
 func (ns *NodeServer) restoreVduseDaemons()
 {
    pattern := ns.stateDir+"vitastor-vduse-*.json"
--- a/debian/changelog
+++ b/debian/changelog
@ -1,4 +1,4 @@
-vitastor (1.3.1-1) unstable; urgency=medium
+vitastor (1.4.4-1) unstable; urgency=medium
  * Bugfixes
--- a/debian/vitastor.Dockerfile
+++ b/debian/vitastor.Dockerfile
@ -35,8 +35,8 @@ RUN set -e -x; \
    mkdir -p /root/packages/vitastor-$REL; \
    rm -rf /root/packages/vitastor-$REL/*; \
    cd /root/packages/vitastor-$REL; \
-    cp -r /root/vitastor vitastor-1.3.1; \
+    cp -r /root/vitastor vitastor-1.4.4; \
-    cd vitastor-1.3.1; \
+    cd vitastor-1.4.4; \
    ln -s /root/fio-build/fio-*/ ./fio; \
    FIO=$(head -n1 fio/debian/changelog | perl -pe 's/^.*\((.*?)\).*$/$1/'); \
    ls /usr/include/linux/raw.h || cp ./debian/raw.h /usr/include/linux/raw.h; \
@ -49,8 +49,8 @@ RUN set -e -x; \
    rm -rf a b; \
    echo "dep:fio=$FIO" > debian/fio_version; \
    cd /root/packages/vitastor-$REL; \
-    tar --sort=name --mtime='2020-01-01' --owner=0 --group=0 --exclude=debian -cJf vitastor_1.3.1.orig.tar.xz vitastor-1.3.1; \
+    tar --sort=name --mtime='2020-01-01' --owner=0 --group=0 --exclude=debian -cJf vitastor_1.4.4.orig.tar.xz vitastor-1.4.4; \
-    cd vitastor-1.3.1; \
+    cd vitastor-1.4.4; \
    V=$(head -n1 debian/changelog | perl -pe 's/^.*\((.*?)\).*$/$1/'); \
    DEBFULLNAME="Vitaliy Filippov <vitalif@yourcmc.ru>" dch -D $REL -v "$V""$REL" "Rebuild for $REL"; \
    DEB_BUILD_OPTIONS=nocheck dpkg-buildpackage --jobs=auto -sa; \
--- a/docs/config/client.en.md
+++ b/docs/config/client.en.md
@ -6,8 +6,8 @@
 # Client Parameters
-These parameters apply only to clients and affect their interaction with
+These parameters apply only to Vitastor clients (QEMU, fio, NBD and so on) and
-the cluster.
+affect their interaction with the cluster.
 - [client_max_dirty_bytes](#client_max_dirty_bytes)
 - [client_max_dirty_ops](#client_max_dirty_ops)
--- a/docs/config/client.ru.md
+++ b/docs/config/client.ru.md
@ -6,7 +6,7 @@
 # Параметры клиентского кода
-Данные параметры применяются только к клиентам Vitastor (QEMU, fio, NBD) и
+Данные параметры применяются только к клиентам Vitastor (QEMU, fio, NBD и т.п.) и
 затрагивают логику их работы с кластером.
 - [client_max_dirty_bytes](#client_max_dirty_bytes)
--- a/docs/config/monitor.en.md
+++ b/docs/config/monitor.en.md
@ -19,8 +19,8 @@ These parameters only apply to Monitors.
 ## etcd_mon_ttl
 - Type: seconds
- Default: 30
+- Default: 1
- Minimum: 10
+- Minimum: 5
 Monitor etcd lease refresh interval in seconds
--- a/docs/config/monitor.ru.md
+++ b/docs/config/monitor.ru.md
@ -19,8 +19,8 @@
 ## etcd_mon_ttl
 - Тип: секунды
- Значение по умолчанию: 30
+- Значение по умолчанию: 1
- Минимальное значение: 10
+- Минимальное значение: 5
 Интервал обновления etcd резервации (lease) монитором
--- a/docs/config/network.en.md
+++ b/docs/config/network.en.md
@ -215,8 +215,8 @@ is scheduled.
 ## up_wait_retry_interval
 - Type: milliseconds
- Default: 500
+- Default: 50
- Minimum: 50
+- Minimum: 10
 - Can be changed online: yes
 OSDs respond to clients with a special error code when they receive I/O
--- a/docs/config/network.ru.md
+++ b/docs/config/network.ru.md
@ -224,8 +224,8 @@ OSD в любом случае согласовывают реальное зн
 ## up_wait_retry_interval
 - Тип: миллисекунды
- Значение по умолчанию: 500
+- Значение по умолчанию: 50
- Минимальное значение: 50
+- Минимальное значение: 10
 - Можно менять на лету: да
 Когда OSD получают от клиентов запросы ввода-вывода, относящиеся к не
--- a/docs/config/osd.en.md
+++ b/docs/config/osd.en.md
@ -19,6 +19,7 @@ them, even without restarting by updating configuration in etcd.
 - [autosync_interval](#autosync_interval)
 - [autosync_writes](#autosync_writes)
 - [recovery_queue_depth](#recovery_queue_depth)
 - [recovery_sleep_us](#recovery_sleep_us)
 - [recovery_pg_switch](#recovery_pg_switch)
 - [recovery_sync_batch](#recovery_sync_batch)
 - [readonly](#readonly)
@ -51,6 +52,14 @@ them, even without restarting by updating configuration in etcd.
 - [scrub_list_limit](#scrub_list_limit)
 - [scrub_find_best](#scrub_find_best)
 - [scrub_ec_max_bruteforce](#scrub_ec_max_bruteforce)
 - [recovery_tune_interval](#recovery_tune_interval)
 - [recovery_tune_util_low](#recovery_tune_util_low)
 - [recovery_tune_util_high](#recovery_tune_util_high)
 - [recovery_tune_client_util_low](#recovery_tune_client_util_low)
 - [recovery_tune_client_util_high](#recovery_tune_client_util_high)
 - [recovery_tune_agg_interval](#recovery_tune_agg_interval)
 - [recovery_tune_sleep_min_us](#recovery_tune_sleep_min_us)
 - [recovery_tune_sleep_cutoff_us](#recovery_tune_sleep_cutoff_us)
 ## etcd_report_interval
@ -135,12 +144,24 @@ operations before issuing an fsync operation internally.
 ## recovery_queue_depth
 - Type: integer
- Default: 4
+- Default: 1
 - Can be changed online: yes
-Maximum recovery operations per one primary OSD at any given moment of time.
+Maximum recovery and rebalance operations initiated by each OSD in parallel.
-Currently it's the only parameter available to tune the speed or recovery
+Note that each OSD talks to a lot of other OSDs so actual number of parallel
-and rebalancing, but it's planned to implement more.
+recovery operations per each OSD is greater than just recovery_queue_depth.
 Increasing this parameter can speedup recovery if [auto-tuning](#recovery_tune_interval)
 allows it or if it is disabled.
 ## recovery_sleep_us
 - Type: microseconds
 - Default: 0
 - Can be changed online: yes
 Delay for all recovery- and rebalance- related operations. If non-zero,
 such operations are artificially slowed down to reduce the impact on
 client I/O.
 ## recovery_pg_switch
@ -508,3 +529,90 @@ the variant with most available equal copies is correct. For example, if
 you have 3 replicas and 1 of them differs, this one is considered to be
 corrupted. But if there is no "best" version with more copies than all
 others have then the object is also marked as inconsistent.
 ## recovery_tune_interval
 - Type: seconds
 - Default: 1
 - Can be changed online: yes
 Interval at which OSD re-considers client and recovery load and automatically
 adjusts [recovery_sleep_us](#recovery_sleep_us). Recovery auto-tuning is
 disabled if recovery_tune_interval is set to 0.
 Auto-tuning targets utilization. Utilization is a measure of load and is
 equal to the product of iops and average latency (so it may be greater
 than 1). You set "low" and "high" client utilization thresholds and two
 corresponding target recovery utilization levels. OSD calculates desired
 recovery utilization from client utilization using linear interpolation
 and auto-tunes recovery operation delay to make actual recovery utilization
 match desired.
 This allows to reduce recovery/rebalance impact on client operations. It is
 of course impossible to remove it completely, but it should become adequate.
 In some tests rebalance could earlier drop client write speed from 1.5 GB/s
 to 50-100 MB/s, with default auto-tuning settings it now only reduces
 to ~1 GB/s.
 ## recovery_tune_util_low
 - Type: number
 - Default: 0.1
 - Can be changed online: yes
 Desired recovery/rebalance utilization when client load is high, i.e. when
 it is at or above recovery_tune_client_util_high.
 ## recovery_tune_util_high
 - Type: number
 - Default: 1
 - Can be changed online: yes
 Desired recovery/rebalance utilization when client load is low, i.e. when
 it is at or below recovery_tune_client_util_low.
 ## recovery_tune_client_util_low
 - Type: number
 - Default: 0
 - Can be changed online: yes
 Client utilization considered "low".
 ## recovery_tune_client_util_high
 - Type: number
 - Default: 0.5
 - Can be changed online: yes
 Client utilization considered "high".
 ## recovery_tune_agg_interval
 - Type: integer
 - Default: 10
 - Can be changed online: yes
 The number of last auto-tuning iterations to use for calculating the
 delay as average. Lower values result in quicker response to client
 load change, higher values result in more stable delay. Default value of 10
 is usually fine.
 ## recovery_tune_sleep_min_us
 - Type: microseconds
 - Default: 10
 - Can be changed online: yes
 Minimum possible value for auto-tuned recovery_sleep_us. Lower values
 are changed to 0.
 ## recovery_tune_sleep_cutoff_us
 - Type: microseconds
 - Default: 10000000
 - Can be changed online: yes
 Maximum possible value for auto-tuned recovery_sleep_us. Higher values
 are treated as outliers and ignored in aggregation.
--- a/docs/config/osd.ru.md
+++ b/docs/config/osd.ru.md
@ -20,6 +20,7 @@
 - [autosync_interval](#autosync_interval)
 - [autosync_writes](#autosync_writes)
 - [recovery_queue_depth](#recovery_queue_depth)
 - [recovery_sleep_us](#recovery_sleep_us)
 - [recovery_pg_switch](#recovery_pg_switch)
 - [recovery_sync_batch](#recovery_sync_batch)
 - [readonly](#readonly)
@ -52,6 +53,14 @@
 - [scrub_list_limit](#scrub_list_limit)
 - [scrub_find_best](#scrub_find_best)
 - [scrub_ec_max_bruteforce](#scrub_ec_max_bruteforce)
 - [recovery_tune_interval](#recovery_tune_interval)
 - [recovery_tune_util_low](#recovery_tune_util_low)
 - [recovery_tune_util_high](#recovery_tune_util_high)
 - [recovery_tune_client_util_low](#recovery_tune_client_util_low)
 - [recovery_tune_client_util_high](#recovery_tune_client_util_high)
 - [recovery_tune_agg_interval](#recovery_tune_agg_interval)
 - [recovery_tune_sleep_min_us](#recovery_tune_sleep_min_us)
 - [recovery_tune_sleep_cutoff_us](#recovery_tune_sleep_cutoff_us)
 ## etcd_report_interval
@ -138,13 +147,25 @@ OSD, чтобы успевать очищать журнал - без них OSD
 ## recovery_queue_depth
 - Тип: целое число
- Значение по умолчанию: 4
+- Значение по умолчанию: 1
 - Можно менять на лету: да
-Максимальное число операций восстановления на одном первичном OSD в любой
+Максимальное число параллельных операций восстановления, инициируемых одним
-момент времени. На данный момент единственный параметр, который можно менять
+OSD в любой момент времени. Имейте в виду, что каждый OSD обычно работает с
-для ускорения или замедления восстановления и перебалансировки данных, но
+многими другими OSD, так что на практике параллелизм восстановления больше,
-в планах реализация других параметров.
+чем просто recovery_queue_depth. Увеличение значения этого параметра может
 ускорить восстановление если [автотюнинг скорости](#recovery_tune_interval)
 разрешает это или если он отключён.
 ## recovery_sleep_us
 - Тип: микросекунды
 - Значение по умолчанию: 0
 - Можно менять на лету: да
 Delay for all recovery- and rebalance- related operations. If non-zero,
 such operations are artificially slowed down to reduce the impact on
 client I/O.
 ## recovery_pg_switch
@ -535,3 +556,93 @@ EC (кодов коррекции ошибок) с более, чем 1 диск
 считается некорректной. Однако, если "лучшую" версию с числом доступных
 копий большим, чем у всех других версий, найти невозможно, то объект тоже
 маркируется неконсистентным.
 ## recovery_tune_interval
 - Тип: секунды
 - Значение по умолчанию: 1
 - Можно менять на лету: да
 Интервал, с которым OSD пересматривает клиентскую нагрузку и нагрузку
 восстановления и автоматически подстраивает [recovery_sleep_us](#recovery_sleep_us).
 Автотюнинг (автоподстройка) отключается, если recovery_tune_interval
 устанавливается в значение 0.
 Автотюнинг регулирует утилизацию. Утилизация является мерой нагрузки
 и равна произведению числа операций в секунду и средней задержки
 (то есть, она может быть выше 1). Вы задаёте два уровня клиентской
 утилизации - "низкий" и "высокий" (low и high) и два соответствующих
 целевых уровня утилизации операциями восстановления. OSD рассчитывает
 желаемый уровень утилизации восстановления линейной интерполяцией от
 клиентской утилизации и подстраивает задержку операций восстановления
 так, чтобы фактическая утилизация восстановления совпадала с желаемой.
 Это позволяет снизить влияние восстановления и ребаланса на клиентские
 операции. Конечно, невозможно исключить такое влияние полностью, но оно
 должно становиться адекватнее. В некоторых тестах перебалансировка могла
 снижать клиентскую скорость записи с 1.5 ГБ/с до 50-100 МБ/с, а теперь, с
 настройками автотюнинга по умолчанию, она снижается только до ~1 ГБ/с.
 ## recovery_tune_util_low
 - Тип: число
 - Значение по умолчанию: 0.1
 - Можно менять на лету: да
 Желаемая утилизация восстановления в моменты, когда клиентская нагрузка
 высокая, то есть, находится на уровне или выше recovery_tune_client_util_high.
 ## recovery_tune_util_high
 - Тип: число
 - Значение по умолчанию: 1
 - Можно менять на лету: да
 Желаемая утилизация восстановления в моменты, когда клиентская нагрузка
 низкая, то есть, находится на уровне или ниже recovery_tune_client_util_low.
 ## recovery_tune_client_util_low
 - Тип: число
 - Значение по умолчанию: 0
 - Можно менять на лету: да
 Клиентская утилизация, которая считается "низкой".
 ## recovery_tune_client_util_high
 - Тип: число
 - Значение по умолчанию: 0.5
 - Можно менять на лету: да
 Клиентская утилизация, которая считается "высокой".
 ## recovery_tune_agg_interval
 - Тип: целое число
 - Значение по умолчанию: 10
 - Можно менять на лету: да
 Число последних итераций автоподстройки для расчёта задержки как среднего
 значения. Меньшие значения параметра ускоряют отклик на изменение нагрузки,
 большие значения делают задержку стабильнее. Значение по умолчанию 10
 обычно нормальное и не требует изменений.
 ## recovery_tune_sleep_min_us
 - Тип: микросекунды
 - Значение по умолчанию: 10
 - Можно менять на лету: да
 Минимальное возможное значение авто-подстроенного recovery_sleep_us.
 Меньшие значения заменяются на 0.
 ## recovery_tune_sleep_cutoff_us
 - Тип: микросекунды
 - Значение по умолчанию: 10000000
 - Можно менять на лету: да
 Максимальное возможное значение авто-подстроенного recovery_sleep_us.
 Большие значения считаются случайными выбросами и игнорируются в
 усреднении.
--- a/docs/config/src/make.js
+++ b/docs/config/src/make.js
@ -38,6 +38,7 @@ const types = {
        bool: 'boolean',
        int: 'integer',
        sec: 'seconds',
        float: 'number',
        ms: 'milliseconds',
        us: 'microseconds',
    },
@ -46,6 +47,7 @@ const types = {
        bool: 'булево (да/нет)',
        int: 'целое число',
        sec: 'секунды',
        float: 'число',
        ms: 'миллисекунды',
        us: 'микросекунды',
    },
--- a/docs/config/src/monitor.yml
+++ b/docs/config/src/monitor.yml
@ -1,7 +1,7 @@
 - name: etcd_mon_ttl
  type: sec
-  min: 10
+  min: 5
-  default: 30
+  default: 1
  info: Monitor etcd lease refresh interval in seconds
  info_ru: Интервал обновления etcd резервации (lease) монитором
 - name: etcd_mon_timeout
--- a/docs/config/src/network.yml
+++ b/docs/config/src/network.yml
@ -245,8 +245,8 @@
    повторная попытка соединения.
 - name: up_wait_retry_interval
  type: ms
-  min: 50
+  min: 10
-  default: 500
+  default: 50
  online: true
  info: |
    OSDs respond to clients with a special error code when they receive I/O
--- a/docs/config/src/osd.yml
+++ b/docs/config/src/osd.yml
@ -107,17 +107,29 @@
    принудительной отправкой fsync-а.
 - name: recovery_queue_depth
  type: int
-  default: 4
+  default: 1
  online: true
  info: |
-    Maximum recovery operations per one primary OSD at any given moment of time.
+    Maximum recovery and rebalance operations initiated by each OSD in parallel.
-    Currently it's the only parameter available to tune the speed or recovery
+    Note that each OSD talks to a lot of other OSDs so actual number of parallel
-    and rebalancing, but it's planned to implement more.
+    recovery operations per each OSD is greater than just recovery_queue_depth.
    Increasing this parameter can speedup recovery if [auto-tuning](#recovery_tune_interval)
    allows it or if it is disabled.
  info_ru: |
-    Максимальное число операций восстановления на одном первичном OSD в любой
+    Максимальное число параллельных операций восстановления, инициируемых одним
-    момент времени. На данный момент единственный параметр, который можно менять
+    OSD в любой момент времени. Имейте в виду, что каждый OSD обычно работает с
-    для ускорения или замедления восстановления и перебалансировки данных, но
+    многими другими OSD, так что на практике параллелизм восстановления больше,
-    в планах реализация других параметров.
+    чем просто recovery_queue_depth. Увеличение значения этого параметра может
    ускорить восстановление если [автотюнинг скорости](#recovery_tune_interval)
    разрешает это или если он отключён.
 - name: recovery_sleep_us
  type: us
  default: 0
  online: true
  info: |
    Delay for all recovery- and rebalance- related operations. If non-zero,
    such operations are artificially slowed down to reduce the impact on
    client I/O.
 - name: recovery_pg_switch
  type: int
  default: 128
@ -626,3 +638,112 @@
    считается некорректной. Однако, если "лучшую" версию с числом доступных
    копий большим, чем у всех других версий, найти невозможно, то объект тоже
    маркируется неконсистентным.
 - name: recovery_tune_interval
  type: sec
  default: 1
  online: true
  info: |
    Interval at which OSD re-considers client and recovery load and automatically
    adjusts [recovery_sleep_us](#recovery_sleep_us). Recovery auto-tuning is
    disabled if recovery_tune_interval is set to 0.
    Auto-tuning targets utilization. Utilization is a measure of load and is
    equal to the product of iops and average latency (so it may be greater
    than 1). You set "low" and "high" client utilization thresholds and two
    corresponding target recovery utilization levels. OSD calculates desired
    recovery utilization from client utilization using linear interpolation
    and auto-tunes recovery operation delay to make actual recovery utilization
    match desired.
    This allows to reduce recovery/rebalance impact on client operations. It is
    of course impossible to remove it completely, but it should become adequate.
    In some tests rebalance could earlier drop client write speed from 1.5 GB/s
    to 50-100 MB/s, with default auto-tuning settings it now only reduces
    to ~1 GB/s.
  info_ru: |
    Интервал, с которым OSD пересматривает клиентскую нагрузку и нагрузку
    восстановления и автоматически подстраивает [recovery_sleep_us](#recovery_sleep_us).
    Автотюнинг (автоподстройка) отключается, если recovery_tune_interval
    устанавливается в значение 0.
    Автотюнинг регулирует утилизацию. Утилизация является мерой нагрузки
    и равна произведению числа операций в секунду и средней задержки
    (то есть, она может быть выше 1). Вы задаёте два уровня клиентской
    утилизации - "низкий" и "высокий" (low и high) и два соответствующих
    целевых уровня утилизации операциями восстановления. OSD рассчитывает
    желаемый уровень утилизации восстановления линейной интерполяцией от
    клиентской утилизации и подстраивает задержку операций восстановления
    так, чтобы фактическая утилизация восстановления совпадала с желаемой.
    Это позволяет снизить влияние восстановления и ребаланса на клиентские
    операции. Конечно, невозможно исключить такое влияние полностью, но оно
    должно становиться адекватнее. В некоторых тестах перебалансировка могла
    снижать клиентскую скорость записи с 1.5 ГБ/с до 50-100 МБ/с, а теперь, с
    настройками автотюнинга по умолчанию, она снижается только до ~1 ГБ/с.
 - name: recovery_tune_util_low
  type: float
  default: 0.1
  online: true
  info: |
    Desired recovery/rebalance utilization when client load is high, i.e. when
    it is at or above recovery_tune_client_util_high.
  info_ru: |
    Желаемая утилизация восстановления в моменты, когда клиентская нагрузка
    высокая, то есть, находится на уровне или выше recovery_tune_client_util_high.
 - name: recovery_tune_util_high
  type: float
  default: 1
  online: true
  info: |
    Desired recovery/rebalance utilization when client load is low, i.e. when
    it is at or below recovery_tune_client_util_low.
  info_ru: |
    Желаемая утилизация восстановления в моменты, когда клиентская нагрузка
    низкая, то есть, находится на уровне или ниже recovery_tune_client_util_low.
 - name: recovery_tune_client_util_low
  type: float
  default: 0
  online: true
  info: Client utilization considered "low".
  info_ru: Клиентская утилизация, которая считается "низкой".
 - name: recovery_tune_client_util_high
  type: float
  default: 0.5
  online: true
  info: Client utilization considered "high".
  info_ru: Клиентская утилизация, которая считается "высокой".
 - name: recovery_tune_agg_interval
  type: int
  default: 10
  online: true
  info: |
    The number of last auto-tuning iterations to use for calculating the
    delay as average. Lower values result in quicker response to client
    load change, higher values result in more stable delay. Default value of 10
    is usually fine.
  info_ru: |
    Число последних итераций автоподстройки для расчёта задержки как среднего
    значения. Меньшие значения параметра ускоряют отклик на изменение нагрузки,
    большие значения делают задержку стабильнее. Значение по умолчанию 10
    обычно нормальное и не требует изменений.
 - name: recovery_tune_sleep_min_us
  type: us
  default: 10
  online: true
  info: |
    Minimum possible value for auto-tuned recovery_sleep_us. Lower values
    are changed to 0.
  info_ru: |
    Минимальное возможное значение авто-подстроенного recovery_sleep_us.
    Меньшие значения заменяются на 0.
 - name: recovery_tune_sleep_cutoff_us
  type: us
  default: 10000000
  online: true
  info: |
    Maximum possible value for auto-tuned recovery_sleep_us. Higher values
    are treated as outliers and ignored in aggregation.
  info_ru: |
    Максимальное возможное значение авто-подстроенного recovery_sleep_us.
    Большие значения считаются случайными выбросами и игнорируются в
    усреднении.
--- a/docs/installation/kubernetes.en.md
+++ b/docs/installation/kubernetes.en.md
@ -37,6 +37,7 @@ Vitastor CSI supports:
 - Volume snapshots. Example: [snapshot class](../../csi/deploy/example-snapshot-class.yaml), [snapshot](../../csi/deploy/example-snapshot.yaml), [clone](../../csi/deploy/example-snapshot-clone.yaml)
 - [VDUSE](../usage/qemu.en.md#vduse) (preferred) and [NBD](../usage/nbd.en.md) device mapping methods
 - Upgrades with VDUSE - new handler processes are restarted when CSI pods are restarted themselves
 - VDUSE daemon auto-restart - handler processes are automatically restarted if they crash due to a bug in Vitastor client code
 - Multiple clusters by using multiple configuration files in ConfigMap.
 Remember that to use snapshots with CSI you also have to install [Snapshot Controller and CRDs](https://kubernetes-csi.github.io/docs/snapshot-controller.html#deployment).
--- a/docs/installation/kubernetes.ru.md
+++ b/docs/installation/kubernetes.ru.md
@ -37,6 +37,7 @@ CSI-плагин Vitastor поддерживает:
 - Снимки томов. Пример: [класс снимков](../../csi/deploy/example-snapshot-class.yaml), [снимок](../../csi/deploy/example-snapshot.yaml), [клон снимка](../../csi/deploy/example-snapshot-clone.yaml)
 - Способы подключения устройств [VDUSE](../usage/qemu.ru.md#vduse) (предпочитаемый) и [NBD](../usage/nbd.ru.md)
 - Обновление при использовании VDUSE - новые процессы-обработчики устройств успешно перезапускаются вместе с самими подами CSI
 - Автоперезауск демонов VDUSE - процесс-обработчик автоматически перезапустится, если он внезапно упадёт из-за бага в коде клиента Vitastor
 - Несколько кластеров через задание нескольких файлов конфигурации в ConfigMap.
 Не забывайте, что для использования снимков нужно сначала установить [контроллер снимков и CRD](https://kubernetes-csi.github.io/docs/snapshot-controller.html#deployment).
--- a/docs/installation/proxmox.en.md
+++ b/docs/installation/proxmox.en.md
@ -25,7 +25,7 @@ vitastor: vitastor
    vitastor_pool testpool
    # path to the configuration file
    vitastor_config_path /etc/vitastor/vitastor.conf
-    # etcd address(es), required only if missing in the configuration file
+    # etcd address(es), OPTIONAL, required only if missing in the configuration file
    vitastor_etcd_address 192.168.7.2:2379/v3
    # prefix for keys in etcd
    vitastor_etcd_prefix /vitastor
--- a/docs/installation/proxmox.ru.md
+++ b/docs/installation/proxmox.ru.md
@ -24,7 +24,7 @@ vitastor: vitastor
    vitastor_pool testpool
    # Путь к файлу конфигурации
    vitastor_config_path /etc/vitastor/vitastor.conf
-    # Адрес(а) etcd, нужны, только если не указаны в vitastor.conf
+    # Адрес(а) etcd, ОПЦИОНАЛЬНЫ, нужны, только если не указаны в vitastor.conf
    vitastor_etcd_address 192.168.7.2:2379/v3
    # Префикс ключей метаданных в etcd
    vitastor_etcd_prefix /vitastor
--- a/docs/intro/features.en.md
+++ b/docs/intro/features.en.md
@ -32,6 +32,7 @@
 - [Scrubbing](../config/osd.en.md#auto_scrub) (verification of copies)
 - [Checksums](../config/layout-osd.en.md#data_csum_type)
 - [Client write-back cache](../config/client.en.md#client_enable_writeback)
 - [Intelligent recovery auto-tuning](../config/osd.en.md#recovery_tune_interval)
 ## Plugins and tools
--- a/docs/intro/features.ru.md
+++ b/docs/intro/features.ru.md
@ -34,6 +34,7 @@
 - [Фоновая проверка целостности](../config/osd.ru.md#auto_scrub) (сверка копий)
 - [Контрольные суммы](../config/layout-osd.ru.md#data_csum_type)
 - [Буферизация записи на стороне клиента](../config/client.ru.md#client_enable_writeback)
 - [Интеллектуальная автоподстройка скорости восстановления](../config/osd.ru.md#recovery_tune_interval)
 ## Драйверы и инструменты
--- a/docs/performance/theoretical.en.md
+++ b/docs/performance/theoretical.en.md
@ -11,19 +11,26 @@ Replicated setups:
 - Single-threaded write+fsync latency:
  - With immediate commit: 2 network roundtrips + 1 disk write.
  - With lazy commit: 4 network roundtrips + 1 disk write + 1 disk flush.
- Saturated parallel read iops: min(network bandwidth, sum(disk read iops)).
+- Linear read: `min(total network bandwidth, sum(disk read MB/s))`.
- Saturated parallel write iops: min(network bandwidth, sum(disk write iops / number of replicas / write amplification)).
+- Linear write: `min(total network bandwidth, sum(disk write MB/s / number of replicas))`.
 - Saturated parallel read iops: `min(total network bandwidth, sum(disk read iops))`.
 - Saturated parallel write iops: `min(total network bandwidth / number of replicas, sum(disk write iops / number of replicas / (write amplification = 4)))`.
-EC/XOR setups:
+EC/XOR setups (EC N+K):
 - Single-threaded (T1Q1) read latency: 1.5 network roundtrips + 1 disk read.
 - Single-threaded write+fsync latency:
  - With immediate commit: 3.5 network roundtrips + 1 disk read + 2 disk writes.
  - With lazy commit: 5.5 network roundtrips + 1 disk read + 2 disk writes + 2 disk fsyncs.
-  - 0.5 in actually (k-1)/k which means that an additional roundtrip doesn't happen when
+  - 0.5 in actually `(N-1)/N` which means that an additional roundtrip doesn't happen when
    the read sub-operation can be served locally.
- Saturated parallel read iops: min(network bandwidth, sum(disk read iops)).
+- Linear read: `min(total network bandwidth, sum(disk read MB/s))`.
- Saturated parallel write iops: min(network bandwidth, sum(disk write iops * number of data drives / (number of data + parity drives) / write amplification)).
+- Linear write: `min(total network bandwidth, sum(disk write MB/s * N/(N+K)))`.
-  In fact, you should put disk write iops under the condition of ~10% reads / ~90% writes in this formula.
+- Saturated parallel read iops: `min(total network bandwidth, sum(disk read iops))`.
 - Saturated parallel write iops: roughly `total iops / (N+K) / WA`. More exactly,
  `min(total network bandwidth * N/(N+K), sum(disk randrw iops / (N*4 + K*5 + 1)))` with
  random read/write mix corresponding to `(N-1)/(N*4 + K*5 + 1)*100 % reads`.
  - For example, with EC 2+1 it is: `(7% randrw iops) / 14`.
  - With EC 6+3 it is: `(12.5% randrw iops) / 40`.
 Write amplification for 4 KB blocks is usually 3-5 in Vitastor:
 1. Journal block write
--- a/docs/performance/theoretical.ru.md
+++ b/docs/performance/theoretical.ru.md
@ -11,20 +11,27 @@
 - Запись+fsync в 1 поток:
  - С мгновенным сбросом: 2 RTT + 1 запись.
  - С отложенным ("ленивым") сбросом: 4 RTT + 1 запись + 1 fsync.
- Параллельное чтение: сумма IOPS всех дисков либо производительность сети, если в сеть упрётся раньше.
+- Линейное чтение: сумма МБ/с чтения всех дисков, либо общая производительность сети (сумма пропускной способности сети всех нод), если в сеть упрётся раньше.
- Параллельная запись: сумма IOPS всех дисков / число реплик / WA либо производительность сети, если в сеть упрётся раньше.
+- Линейная запись: сумма МБ/с записи всех дисков / число реплик, либо производительность сети / число реплик, если в сеть упрётся раньше.
 - Параллельное случайное мелкое чтение: сумма IOPS чтения всех дисков, либо производительность сети, если в сеть упрётся раньше.
 - Параллельная случайная мелкая запись: сумма IOPS записи всех дисков / число реплик / WA, либо производительность сети / число реплик, если в сеть упрётся раньше.
-При использовании кодов коррекции ошибок (EC):
+При использовании кодов коррекции ошибок (EC N+K):
 - Задержка чтения в 1 поток (T1Q1): 1.5 RTT + 1 чтение.
 - Запись+fsync в 1 поток:
  - С мгновенным сбросом: 3.5 RTT + 1 чтение + 2 записи.
  - С отложенным ("ленивым") сбросом: 5.5 RTT + 1 чтение + 2 записи + 2 fsync.
- Под 0.5 на самом деле подразумевается (k-1)/k, где k - число дисков данных,
+- Под 0.5 на самом деле подразумевается (N-1)/N, где N - число дисков данных,
  что означает, что дополнительное обращение по сети не нужно, когда операция
  чтения обслуживается локально.
- Параллельное чтение: сумма IOPS всех дисков либо производительность сети, если в сеть упрётся раньше.
+- Линейное чтение: сумма МБ/с чтения всех дисков, либо общая производительность сети, если в сеть упрётся раньше.
- Параллельная запись: сумма IOPS всех дисков / общее число дисков данных и чётности / WA либо производительность сети, если в сеть упрётся раньше.
+- Линейная запись: сумма МБ/с записи всех дисков * N/(N+K), либо производительность сети * N / (N+K), если в сеть упрётся раньше.
-  Примечание: IOPS дисков в данном случае надо брать в смешанном режиме чтения/записи в пропорции, аналогичной формулам выше.
+- Параллельное случайное мелкое чтение: сумма IOPS чтения всех дисков либо производительность сети, если в сеть упрётся раньше.
 - Параллельная случайная мелкая запись: грубо `(сумма IOPS / (N+K) / WA)`. Если точнее, то:
  сумма смешанного IOPS всех дисков при `(N-1)/(N*4 + K*5 + 1)*100 %` чтения, делённая на `(N*4 + K*5 + 1)`.
  Либо, производительность сети * N/(N+K), если в сеть упрётся раньше.
  - Например, при EC 2+1 это: `(сумма IOPS при 7% чтения) / 14`.
  - При EC 6+3 это: `(сумма IOPS при 12.5% чтения) / 40`.
 WA (мультипликатор записи) для 4 КБ блоков в Vitastor обычно составляет 3-5:
 1. Запись метаданных в журнал
--- a/docs/usage/fio.en.md
+++ b/docs/usage/fio.en.md
@ -14,10 +14,13 @@ Vitastor has a fio driver which can be installed from the package vitastor-fio.
 Use the following command as an example to run tests with fio against a Vitastor cluster:
 ```
-fio -thread -ioengine=libfio_vitastor.so -name=test -bs=4M -direct=1 -iodepth=16 -rw=write -etcd=10.115.0.10:2379/v3 -image=testimg
+fio -thread -ioengine=libfio_vitastor.so -name=test -bs=4M -direct=1 -iodepth=16 -rw=write -image=testimg
 ```
 If you don't want to access your image by name, you can specify pool number, inode number and size
 (`-pool=1 -inode=1 -size=400G`) instead of the image name (`-image=testimg`).
-See exact fio commands to use for benchmarking [here](../performance/understanding.en.md#команды-fio).
+You can also specify etcd address(es) explicitly by adding `-etcd=10.115.0.10:2379/v3`, or you
 can override configuration file path by adding `-conf=/etc/vitastor/vitastor.conf`.
 See exact fio commands to use for benchmarking [here](../performance/understanding.en.md#fio-commands).
--- a/docs/usage/fio.ru.md
+++ b/docs/usage/fio.ru.md
@ -14,10 +14,13 @@
 Используйте следующую команду как пример для запуска тестов кластера Vitastor через fio:
 ```
-fio -thread -ioengine=libfio_vitastor.so -name=test -bs=4M -direct=1 -iodepth=16 -rw=write -etcd=10.115.0.10:2379/v3 -image=testimg
+fio -thread -ioengine=libfio_vitastor.so -name=test -bs=4M -direct=1 -iodepth=16 -rw=write -image=testimg
 ```
 Вместо обращения к образу по имени (`-image=testimg`) можно указать номер пула, номер инода и размер:
 `-pool=1 -inode=1 -size=400G`.
 Вы также можете задать адрес(а) подключения к etcd явно, добавив `-etcd=10.115.0.10:2379/v3`,
 или переопределить путь к файлу конфигурации, добавив `-conf=/etc/vitastor/vitastor.conf`.
 Конкретные команды fio для тестирования производительности можно посмотреть [здесь](../performance/understanding.ru.md#команды-fio).
--- a/docs/usage/nfs.en.md
+++ b/docs/usage/nfs.en.md
@ -34,7 +34,7 @@ vitastor-nfs [STANDARD OPTIONS] [OTHER OPTIONS]
 --foreground 1    stay in foreground, do not daemonize
 ```
-Example start and mount commands:
+Example start and mount commands (etcd_address is optional):
 ```
 vitastor-nfs --etcd_address 192.168.5.10:2379 --portmap 0 --port 2050 --pool testpool
--- a/docs/usage/nfs.ru.md
+++ b/docs/usage/nfs.ru.md
@ -33,7 +33,7 @@ vitastor-nfs [СТАНДАРТНЫЕ ОПЦИИ] [ДРУГИЕ ОПЦИИ]
 --foreground 1    не уходить в фон после запуска
 ```
-Пример монтирования Vitastor через NFS:
+Пример монтирования Vitastor через NFS (etcd_address необязателен):
 ```
 vitastor-nfs --etcd_address 192.168.5.10:2379 --portmap 0 --port 2050 --pool testpool
--- a/docs/usage/qemu.en.md
+++ b/docs/usage/qemu.en.md
@ -16,13 +16,16 @@ Old syntax (-drive):
 ```
 qemu-system-x86_64 -enable-kvm -m 1024 \
-    -drive 'file=vitastor:etcd_host=192.168.7.2\:2379/v3:image=debian9',
+    -drive 'file=vitastor:image=debian9',
        format=raw,if=none,id=drive-virtio-disk0,cache=none \
    -device 'virtio-blk-pci,scsi=off,bus=pci.0,addr=0x5,drive=drive-virtio-disk0,
        id=virtio-disk0,bootindex=1,write-cache=off' \
    -vnc 0.0.0.0:0
 ```
 Etcd address may be specified explicitly by adding `:etcd_host=192.168.7.2\:2379/v3` to `file=`.
 Configuration file path may be overriden by adding `:config_path=/etc/vitastor/vitastor.conf`.
 New syntax (-blockdev):
 ```
@ -50,12 +53,12 @@ You can also specify inode ID, pool and size manually instead of `:image=<IMAGE>
 ## qemu-img
-For qemu-img, you should use `vitastor:etcd_host=<HOST>:image=<IMAGE>` as filename.
+For qemu-img, you should use `vitastor:image=<IMAGE>[:etcd_host=<HOST>]` as filename.
 For example, to upload a VM image into Vitastor, run:
 ```
-qemu-img convert -f qcow2 debian10.qcow2 -p -O raw 'vitastor:etcd_host=192.168.7.2\:2379/v3:image=debian10'
+qemu-img convert -f qcow2 debian10.qcow2 -p -O raw 'vitastor:image=debian10'
 ```
 You can also specify `:pool=<POOL>:inode=<INODE>:size=<SIZE>` instead of `:image=<IMAGE>`
@ -72,10 +75,10 @@ the snapshot separately using the following commands (key points are using `skip
 `-B backing_file` option):
 ```
-qemu-img convert -f raw 'vitastor:etcd_host=192.168.7.2\:2379/v3:image=testimg@0' \
+qemu-img convert -f raw 'vitastor:image=testimg@0' \
    -O qcow2 testimg_0.qcow2
-qemu-img convert -f raw 'vitastor:etcd_host=192.168.7.2\:2379/v3:image=testimg:skip-parents=1' \
+qemu-img convert -f raw 'vitastor:image=testimg:skip-parents=1' \
    -O qcow2 -o 'cluster_size=4k' -B testimg_0.qcow2 testimg.qcow2
 ```
--- a/docs/usage/qemu.ru.md
+++ b/docs/usage/qemu.ru.md
@ -18,13 +18,16 @@
 ```
 qemu-system-x86_64 -enable-kvm -m 1024 \
-    -drive 'file=vitastor:etcd_host=192.168.7.2\:2379/v3:image=debian9',
+    -drive 'file=vitastor:image=debian9',
        format=raw,if=none,id=drive-virtio-disk0,cache=none \
    -device 'virtio-blk-pci,scsi=off,bus=pci.0,addr=0x5,drive=drive-virtio-disk0,
        id=virtio-disk0,bootindex=1,write-cache=off' \
    -vnc 0.0.0.0:0
 ```
 Адрес подключения etcd можно задать явно, если добавить `:etcd_host=192.168.7.2\:2379/v3` к `file=`.
 Путь к файлу конфигурации можно переопределить, добавив `:config_path=/etc/vitastor/vitastor.conf`.
 Новый синтаксис (-blockdev):
 ```
@ -52,12 +55,12 @@ qemu-system-x86_64 -enable-kvm -m 1024 \
 ## qemu-img
-Для qemu-img используйте строку `vitastor:etcd_host=<HOST>:image=<IMAGE>` в качестве имени файла диска.
+Для qemu-img используйте строку `vitastor:image=<IMAGE>[:etcd_host=<HOST>]` в качестве имени файла диска.
 Например, чтобы загрузить образ диска в Vitastor:
 ```
-qemu-img convert -f qcow2 debian10.qcow2 -p -O raw 'vitastor:etcd_host=10.115.0.10\:2379/v3:image=testimg'
+qemu-img convert -f qcow2 debian10.qcow2 -p -O raw 'vitastor:image=testimg'
 ```
 Если вы не хотите обращаться к образу по имени, вместо `:image=<IMAGE>` можно указать номер пула, номер инода и размер:
@ -73,10 +76,10 @@ qemu-img convert -f qcow2 debian10.qcow2 -p -O raw 'vitastor:etcd_host=10.115.0.
 с помощью следующих команд (ключевые моменты - использование `skip-parents=1` и опции `-B backing_file.qcow2`):
 ```
-qemu-img convert -f raw 'vitastor:etcd_host=192.168.7.2\:2379/v3:image=testimg@0' \
+qemu-img convert -f raw 'vitastor:image=testimg@0' \
    -O qcow2 testimg_0.qcow2
-qemu-img convert -f raw 'vitastor:etcd_host=192.168.7.2\:2379/v3:image=testimg:skip-parents=1' \
+qemu-img convert -f raw 'vitastor:image=testimg:skip-parents=1' \
    -O qcow2 -o 'cluster_size=4k' -B testimg_0.qcow2 testimg.qcow2
 ```
--- a/mon/PGUtil.js
+++ b/mon/PGUtil.js
@ -3,6 +3,7 @@
 module.exports = {
    scale_pg_count,
    scale_pg_history,
 };
 function add_pg_history(new_pg_history, new_pg, prev_pgs, prev_pg_history, old_pg)
@ -43,16 +44,18 @@ function finish_pg_history(merged_history)
    merged_history.all_peers = Object.values(merged_history.all_peers);
 }
-function scale_pg_count(prev_pgs, real_prev_pgs, prev_pg_history, new_pg_history, new_pg_count)
+function scale_pg_history(prev_pg_history, prev_pgs, new_pgs)
 {
-    const old_pg_count = real_prev_pgs.length;
+    const new_pg_history = [];
    const old_pg_count = prev_pgs.length;
    const new_pg_count = new_pgs.length;
    // Add all possibly intersecting PGs to the history of new PGs
    if (!(new_pg_count % old_pg_count))
    {
        // New PG count is a multiple of old PG count
        for (let i = 0; i < new_pg_count; i++)
        {
-            add_pg_history(new_pg_history, i, real_prev_pgs, prev_pg_history, i % old_pg_count);
+            add_pg_history(new_pg_history, i, prev_pgs, prev_pg_history, i % old_pg_count);
            finish_pg_history(new_pg_history[i]);
        }
    }
@ -64,7 +67,7 @@ function scale_pg_count(prev_pgs, real_prev_pgs, prev_pg_history, new_pg_history
        {
            for (let j = 0; j < mul; j++)
            {
-                add_pg_history(new_pg_history, i, real_prev_pgs, prev_pg_history, i+j*new_pg_count);
+                add_pg_history(new_pg_history, i, prev_pgs, prev_pg_history, i+j*new_pg_count);
            }
            finish_pg_history(new_pg_history[i]);
        }
@ -76,7 +79,7 @@ function scale_pg_count(prev_pgs, real_prev_pgs, prev_pg_history, new_pg_history
        let merged_history = {};
        for (let i = 0; i < old_pg_count; i++)
        {
-            add_pg_history(merged_history, 1, real_prev_pgs, prev_pg_history, i);
+            add_pg_history(merged_history, 1, prev_pgs, prev_pg_history, i);
        }
        finish_pg_history(merged_history[1]);
        for (let i = 0; i < new_pg_count; i++)
@ -89,6 +92,12 @@ function scale_pg_count(prev_pgs, real_prev_pgs, prev_pg_history, new_pg_history
    {
        new_pg_history[i] = null;
    }
    return new_pg_history;
 }
 function scale_pg_count(prev_pgs, new_pg_count)
 {
    const old_pg_count = prev_pgs.length;
    // Just for the lp_solve optimizer - pick a "previous" PG for each "new" one
    if (prev_pgs.length < new_pg_count)
    {
--- a/mon/mon.js
+++ b/mon/mon.js
@ -55,10 +55,11 @@ const etcd_tree = {
            // etcd connection - configurable online
            etcd_address: "10.0.115.10:2379/v3",
            // mon
-            etcd_mon_ttl: 30, // min: 10
+            etcd_mon_ttl: 5, // min: 1
            etcd_mon_timeout: 1000, // ms. min: 0
            etcd_mon_retries: 5, // min: 0
            mon_change_timeout: 1000, // ms. min: 100
            mon_retry_change_timeout: 50, // ms. min: 10
            mon_stats_timeout: 1000, // ms. min: 100
            osd_out_time: 600, // seconds. min: 0
            placement_levels: { datacenter: 1, rack: 2, host: 3, osd: 4, ... },
@ -91,7 +92,7 @@ const etcd_tree = {
            peer_connect_timeout: 5, // seconds. min: 1
            osd_idle_timeout: 5, // seconds. min: 1
            osd_ping_timeout: 5, // seconds. min: 1
-            up_wait_retry_interval: 500, // ms. min: 50
+            up_wait_retry_interval: 50, // ms. min: 10
            max_etcd_attempts: 5,
            etcd_quick_timeout: 1000, // ms
            etcd_slow_timeout: 5000, // ms
@ -112,12 +113,12 @@ const etcd_tree = {
            client_queue_depth: 128, // unused
            recovery_queue_depth: 1,
            recovery_sleep_us: 0,
-            recovery_tune_min_util: 0.1,
+            recovery_tune_util_low: 0.1,
-            recovery_tune_min_client_util: 0,
+            recovery_tune_client_util_low: 0,
-            recovery_tune_max_util: 1.0,
+            recovery_tune_util_high: 1.0,
-            recovery_tune_max_client_util: 0.5,
+            recovery_tune_client_util_high: 0.5,
            recovery_tune_interval: 1,
-            recovery_tune_ewma_rate: 0.5,
+            recovery_tune_agg_interval: 10, // 10 times recovery_tune_interval
            recovery_tune_sleep_min_us: 10, // 10 microseconds
            recovery_pg_switch: 128,
            recovery_sync_batch: 16,
@ -389,7 +390,8 @@ class Mon
 {
    constructor(config)
    {
-        this.die = (e) => this._die(e);
+        this.failconnect = (e) => this._die(e, 2);
        this.die = (e) => this._die(e, 1);
        if (fs.existsSync(config.config_path||'/etc/vitastor/vitastor.conf'))
        {
            config = {
@ -400,7 +402,7 @@ class Mon
        this.parse_etcd_addresses(config.etcd_address||config.etcd_url);
        this.verbose = config.verbose || 0;
        this.initConfig = config;
-        this.config = {};
+        this.config = { ...config };
        this.etcd_prefix = config.etcd_prefix || '/vitastor';
        this.etcd_prefix = this.etcd_prefix.replace(/\/\/+/g, '/').replace(/^\/?(.*[^\/])\/?$/, '/$1');
        this.etcd_start_timeout = (config.etcd_start_timeout || 5) * 1000;
@ -478,10 +480,10 @@ class Mon
    check_config()
    {
-        this.config.etcd_mon_ttl = Number(this.config.etcd_mon_ttl) || 30;
+        this.config.etcd_mon_ttl = Number(this.config.etcd_mon_ttl) || 5;
-        if (this.config.etcd_mon_ttl < 10)
+        if (this.config.etcd_mon_ttl < 1)
        {
-            this.config.etcd_mon_ttl = 10;
+            this.config.etcd_mon_ttl = 1;
        }
        this.config.etcd_mon_timeout = Number(this.config.etcd_mon_timeout) || 0;
        if (this.config.etcd_mon_timeout <= 0)
@ -498,6 +500,11 @@ class Mon
        {
            this.config.mon_change_timeout = 100;
        }
        this.config.mon_retry_change_timeout = Number(this.config.mon_retry_change_timeout) || 50;
        if (this.config.mon_retry_change_timeout < 50)
        {
            this.config.mon_retry_change_timeout = 50;
        }
        this.config.mon_stats_timeout = Number(this.config.mon_stats_timeout) || 1000;
        if (this.config.mon_stats_timeout < 100)
        {
@ -598,7 +605,7 @@ class Mon
        }
        if (!this.ws)
        {
-            this.die('Failed to open etcd watch websocket');
+            this.failconnect('Failed to open etcd watch websocket');
        }
        const cur_addr = this.selected_etcd_url;
        this.ws_alive = true;
@ -614,7 +621,7 @@ class Mon
                console.log('etcd websocket timed out, restarting it');
                this.restart_watcher(cur_addr);
            }
-        }, (Number(this.config.etcd_keepalive_interval) || 30)*1000);
+        }, (Number(this.config.etcd_ws_keepalive_interval) || 30)*1000);
        this.ws.on('error', () => this.restart_watcher(cur_addr));
        this.ws.send(JSON.stringify({
            create_request: {
@ -668,7 +675,12 @@ class Mon
                {
                    this.parse_kv(e.kv);
                    const key = e.kv.key.substr(this.etcd_prefix.length);
-                    if (key.substr(0, 11) == '/osd/stats/' || key.substr(0, 10) == '/pg/stats/' || key.substr(0, 16) == '/osd/inodestats/')
+                    if (key.substr(0, 11) == '/osd/state/')
                    {
                        stats_changed = true;
                        changed = true;
                    }
                    else if (key.substr(0, 11) == '/osd/stats/' || key.substr(0, 10) == '/pg/stats/' || key.substr(0, 16) == '/osd/inodestats/')
                    {
                        stats_changed = true;
                    }
@ -785,9 +797,9 @@ class Mon
            const res = await this.etcd_call('/lease/keepalive', { ID: this.etcd_lease_id }, this.config.etcd_mon_timeout, this.config.etcd_mon_retries);
            if (!res.result.TTL)
            {
-                this.die('Lease expired');
+                this.failconnect('Lease expired');
            }
-        }, this.config.etcd_mon_timeout);
+        }, this.config.etcd_mon_ttl*1000);
        if (!this.signals_set)
        {
            process.on('SIGINT', this.on_stop_cb);
@ -1230,6 +1242,89 @@ class Mon
        return aff_osds;
    }
    async generate_pool_pgs(pool_id, osd_tree, levels)
    {
        const pool_cfg = this.state.config.pools[pool_id];
        if (!this.validate_pool_cfg(pool_id, pool_cfg, false))
        {
            return null;
        }
        let pool_tree = osd_tree[pool_cfg.root_node || ''];
        pool_tree = pool_tree ? pool_tree.children : [];
        pool_tree = LPOptimizer.flatten_tree(pool_tree, levels, pool_cfg.failure_domain, 'osd');
        this.filter_osds_by_tags(osd_tree, pool_tree, pool_cfg.osd_tags);
        this.filter_osds_by_block_layout(
            pool_tree,
            pool_cfg.block_size || this.config.block_size || 131072,
            pool_cfg.bitmap_granularity || this.config.bitmap_granularity || 4096,
            pool_cfg.immediate_commit || this.config.immediate_commit || 'none'
        );
        // First try last_clean_pgs to minimize data movement
        let prev_pgs = [];
        for (const pg in ((this.state.history.last_clean_pgs.items||{})[pool_id]||{}))
        {
            prev_pgs[pg-1] = [ ...this.state.history.last_clean_pgs.items[pool_id][pg].osd_set ];
        }
        if (!prev_pgs.length)
        {
            // Fall back to config/pgs if it's empty
            for (const pg in ((this.state.config.pgs.items||{})[pool_id]||{}))
            {
                prev_pgs[pg-1] = [ ...this.state.config.pgs.items[pool_id][pg].osd_set ];
            }
        }
        const old_pg_count = prev_pgs.length;
        const optimize_cfg = {
            osd_tree: pool_tree,
            pg_count: pool_cfg.pg_count,
            pg_size: pool_cfg.pg_size,
            pg_minsize: pool_cfg.pg_minsize,
            max_combinations: pool_cfg.max_osd_combinations,
            ordered: pool_cfg.scheme != 'replicated',
        };
        let optimize_result;
        // Re-shuffle PGs if config/pgs.hash is empty
        if (old_pg_count > 0 && this.state.config.pgs.hash)
        {
            if (prev_pgs.length != pool_cfg.pg_count)
            {
                // Scale PG count
                // Do it even if old_pg_count is already equal to pool_cfg.pg_count,
                // because last_clean_pgs may still contain the old number of PGs
                PGUtil.scale_pg_count(prev_pgs, pool_cfg.pg_count);
            }
            for (const pg of prev_pgs)
            {
                while (pg.length < pool_cfg.pg_size)
                {
                    pg.push(0);
                }
            }
            optimize_result = await LPOptimizer.optimize_change({
                prev_pgs,
                ...optimize_cfg,
            });
        }
        else
        {
            optimize_result = await LPOptimizer.optimize_initial(optimize_cfg);
        }
        console.log(`Pool ${pool_id} (${pool_cfg.name || 'unnamed'}):`);
        LPOptimizer.print_change_stats(optimize_result);
        const pg_effsize = Math.min(pool_cfg.pg_size, Object.keys(pool_tree).length);
        return {
            pool_id,
            pgs: optimize_result.int_pgs,
            stats: {
                total_raw_tb: optimize_result.space,
                pg_real_size: pg_effsize || pool_cfg.pg_size,
                raw_to_usable: (pg_effsize || pool_cfg.pg_size) / (pool_cfg.scheme === 'replicated'
                    ? 1 : (pool_cfg.pg_size - (pool_cfg.parity_chunks||0))),
                space_efficiency: optimize_result.space/(optimize_result.total_space||1),
            },
        };
    }
    async recheck_pgs()
    {
        if (this.recheck_pgs_active)
@ -1244,158 +1339,47 @@ class Mon
        const { up_osds, levels, osd_tree } = this.get_osd_tree();
        const tree_cfg = {
            osd_tree,
            levels,
            pools: this.state.config.pools,
        };
        const tree_hash = sha1hex(stableStringify(tree_cfg));
        if (this.state.config.pgs.hash != tree_hash)
        {
            // Something has changed
-            const new_config_pgs = JSON.parse(JSON.stringify(this.state.config.pgs));
+            console.log('Pool configuration or OSD tree changed, re-optimizing');
-            const etcd_request = { compare: [], success: [] };
+            // First re-optimize PGs, but don't look at history yet
-            for (const pool_id in (this.state.config.pgs||{}).items||{})
+            const optimize_results = await Promise.all(Object.keys(this.state.config.pools)
-            {
+                .map(pool_id => this.generate_pool_pgs(pool_id, osd_tree, levels)));
-                if (!this.state.config.pools[pool_id])
+            // Then apply the modification in the form of an optimistic transaction,
-                {
+            // each time considering new pg/history modifications (OSDs modify it during rebalance)
-                    // Pool deleted. Delete all PGs, but first stop them.
+            while (!await this.apply_pool_pgs(optimize_results, up_osds, osd_tree, tree_hash))
                    if (!await this.stop_all_pgs(pool_id))
                    {
                        this.recheck_pgs_active = false;
                        this.schedule_recheck();
                        return;
                    }
                    const prev_pgs = [];
                    for (const pg in this.state.config.pgs.items[pool_id]||{})
                    {
                        prev_pgs[pg-1] = this.state.config.pgs.items[pool_id][pg].osd_set;
                    }
                    // Also delete pool statistics
                    etcd_request.success.push({ requestDeleteRange: {
                        key: b64(this.etcd_prefix+'/pool/stats/'+pool_id),
                    } });
                    this.save_new_pgs_txn(new_config_pgs, etcd_request, pool_id, up_osds, osd_tree, prev_pgs, [], []);
                }
            }
            for (const pool_id in this.state.config.pools)
            {
                const pool_cfg = this.state.config.pools[pool_id];
                if (!this.validate_pool_cfg(pool_id, pool_cfg, false))
                {
                    continue;
                }
                let pool_tree = osd_tree[pool_cfg.root_node || ''];
                pool_tree = pool_tree ? pool_tree.children : [];
                pool_tree = LPOptimizer.flatten_tree(pool_tree, levels, pool_cfg.failure_domain, 'osd');
                this.filter_osds_by_tags(osd_tree, pool_tree, pool_cfg.osd_tags);
                this.filter_osds_by_block_layout(
                    pool_tree,
                    pool_cfg.block_size || this.config.block_size || 131072,
                    pool_cfg.bitmap_granularity || this.config.bitmap_granularity || 4096,
                    pool_cfg.immediate_commit || this.config.immediate_commit || 'none'
                );
                // These are for the purpose of building history.osd_sets
                const real_prev_pgs = [];
                let pg_history = [];
                for (const pg in ((this.state.config.pgs.items||{})[pool_id]||{}))
                {
                    real_prev_pgs[pg-1] = this.state.config.pgs.items[pool_id][pg].osd_set;
                    if (this.state.pg.history[pool_id] &&
                        this.state.pg.history[pool_id][pg])
                    {
                        pg_history[pg-1] = this.state.pg.history[pool_id][pg];
                    }
                }
                // And these are for the purpose of minimizing data movement
                let prev_pgs = [];
                for (const pg in ((this.state.history.last_clean_pgs.items||{})[pool_id]||{}))
                {
                    prev_pgs[pg-1] = this.state.history.last_clean_pgs.items[pool_id][pg].osd_set;
                }
                prev_pgs = JSON.parse(JSON.stringify(prev_pgs.length ? prev_pgs : real_prev_pgs));
                const old_pg_count = real_prev_pgs.length;
                const optimize_cfg = {
                    osd_tree: pool_tree,
                    pg_count: pool_cfg.pg_count,
                    pg_size: pool_cfg.pg_size,
                    pg_minsize: pool_cfg.pg_minsize,
                    max_combinations: pool_cfg.max_osd_combinations,
                    ordered: pool_cfg.scheme != 'replicated',
                };
                let optimize_result;
                if (old_pg_count > 0)
                {
                    if (old_pg_count != pool_cfg.pg_count)
                    {
                        // PG count changed. Need to bring all PGs down.
                        if (!await this.stop_all_pgs(pool_id))
                        {
                            this.recheck_pgs_active = false;
                            this.schedule_recheck();
                            return;
                        }
                    }
                    if (prev_pgs.length != pool_cfg.pg_count)
                    {
                        // Scale PG count
                        // Do it even if old_pg_count is already equal to pool_cfg.pg_count,
                        // because last_clean_pgs may still contain the old number of PGs
                        const new_pg_history = [];
                        PGUtil.scale_pg_count(prev_pgs, real_prev_pgs, pg_history, new_pg_history, pool_cfg.pg_count);
                        pg_history = new_pg_history;
                    }
                    for (const pg of prev_pgs)
                    {
                        while (pg.length < pool_cfg.pg_size)
                        {
                            pg.push(0);
                        }
                    }
                    if (!this.state.config.pgs.hash)
                    {
                        // Re-shuffle PGs
                        optimize_result = await LPOptimizer.optimize_initial(optimize_cfg);
                    }
                    else
                    {
                        optimize_result = await LPOptimizer.optimize_change({
                            prev_pgs,
                            ...optimize_cfg,
                        });
                    }
                }
                else
                {
                    optimize_result = await LPOptimizer.optimize_initial(optimize_cfg);
                }
                if (old_pg_count != optimize_result.int_pgs.length)
            {
                console.log(
-                        `PG count for pool ${pool_id} (${pool_cfg.name || 'unnamed'})`+
+                    'Someone changed PG configuration while we also tried to change it.'+
-                        ` changed from: ${old_pg_count} to ${optimize_result.int_pgs.length}`
+                    ' Retrying in '+this.config.mon_retry_change_timeout+' ms'
                );
-                    // Drop stats
+                // Failed to apply - parallel change detected. Wait a bit and retry
-                    etcd_request.success.push({ requestDeleteRange: {
+                const old_rev = this.etcd_watch_revision;
-                        key: b64(this.etcd_prefix+'/pg/stats/'+pool_id+'/'),
+                while (this.etcd_watch_revision === old_rev)
-                        range_end: b64(this.etcd_prefix+'/pg/stats/'+pool_id+'0'),
+                {
-                    } });
+                    await new Promise(ok => setTimeout(ok, this.config.mon_retry_change_timeout));
                }
-                LPOptimizer.print_change_stats(optimize_result);
+                const new_ot = this.get_osd_tree();
-                const pg_effsize = Math.min(pool_cfg.pg_size, Object.keys(pool_tree).length);
+                const new_tcfg = {
-                this.state.pool.stats[pool_id] = {
+                    osd_tree: new_ot.osd_tree,
-                    used_raw_tb: (this.state.pool.stats[pool_id]||{}).used_raw_tb || 0,
+                    levels: new_ot.levels,
-                    total_raw_tb: optimize_result.space,
+                    pools: this.state.config.pools,
                    pg_real_size: pg_effsize || pool_cfg.pg_size,
                    raw_to_usable: (pg_effsize || pool_cfg.pg_size) / (pool_cfg.scheme === 'replicated'
                        ? 1 : (pool_cfg.pg_size - (pool_cfg.parity_chunks||0))),
                    space_efficiency: optimize_result.space/(optimize_result.total_space||1),
                };
-                etcd_request.success.push({ requestPut: {
+                if (sha1hex(stableStringify(new_tcfg)) !== tree_hash)
-                    key: b64(this.etcd_prefix+'/pool/stats/'+pool_id),
+                {
-                    value: b64(JSON.stringify(this.state.pool.stats[pool_id])),
+                    // Configuration actually changed, restart from the beginning
-                } });
+                    this.recheck_pgs_active = false;
-                this.save_new_pgs_txn(new_config_pgs, etcd_request, pool_id, up_osds, osd_tree, real_prev_pgs, optimize_result.int_pgs, pg_history);
+                    setImmediate(() => this.recheck_pgs().catch(this.die));
                    return;
                }
-            new_config_pgs.hash = tree_hash;
+                // Configuration didn't change, PG history probably changed, so just retry
-            await this.save_pg_config(new_config_pgs, etcd_request);
+            }
            console.log('PG configuration successfully changed');
        }
        else
        {
@ -1436,12 +1420,97 @@ class Mon
            }
            if (changed)
            {
-                await this.save_pg_config(new_config_pgs);
+                const ok = await this.save_pg_config(new_config_pgs);
                if (ok)
                    console.log('PG configuration successfully changed');
                else
                {
                    console.log('Someone changed PG configuration while we also tried to change it. Retrying in '+this.config.mon_change_timeout+' ms');
                    this.schedule_recheck();
                }
            }
        }
        this.recheck_pgs_active = false;
    }
    async apply_pool_pgs(results, up_osds, osd_tree, tree_hash)
    {
        for (const pool_id in (this.state.config.pgs||{}).items||{})
        {
            // We should stop all PGs when deleting a pool or changing its PG count
            if (!this.state.config.pools[pool_id] ||
                this.state.config.pgs.items[pool_id] && this.state.config.pools[pool_id].pg_count !=
                Object.keys(this.state.config.pgs.items[pool_id]).reduce((a, c) => (a < (0|c) ? (0|c) : a), 0))
            {
                if (!await this.stop_all_pgs(pool_id))
                {
                    return false;
                }
            }
        }
        const new_config_pgs = JSON.parse(JSON.stringify(this.state.config.pgs));
        const etcd_request = { compare: [], success: [] };
        for (const pool_id in (new_config_pgs||{}).items||{})
        {
            if (!this.state.config.pools[pool_id])
            {
                const prev_pgs = [];
                for (const pg in new_config_pgs.items[pool_id]||{})
                {
                    prev_pgs[pg-1] = new_config_pgs.items[pool_id][pg].osd_set;
                }
                // Also delete pool statistics
                etcd_request.success.push({ requestDeleteRange: {
                    key: b64(this.etcd_prefix+'/pool/stats/'+pool_id),
                } });
                this.save_new_pgs_txn(new_config_pgs, etcd_request, pool_id, up_osds, osd_tree, prev_pgs, [], []);
            }
        }
        for (const pool_res of results)
        {
            const pool_id = pool_res.pool_id;
            const pool_cfg = this.state.config.pools[pool_id];
            let pg_history = [];
            for (const pg in ((this.state.config.pgs.items||{})[pool_id]||{}))
            {
                if (this.state.pg.history[pool_id] &&
                    this.state.pg.history[pool_id][pg])
                {
                    pg_history[pg-1] = this.state.pg.history[pool_id][pg];
                }
            }
            const real_prev_pgs = [];
            for (const pg in ((this.state.config.pgs.items||{})[pool_id]||{}))
            {
                real_prev_pgs[pg-1] = [ ...this.state.config.pgs.items[pool_id][pg].osd_set ];
            }
            if (real_prev_pgs.length > 0 && real_prev_pgs.length != pool_res.pgs.length)
            {
                console.log(
                    `Changing PG count for pool ${pool_id} (${pool_cfg.name || 'unnamed'})`+
                    ` from: ${real_prev_pgs.length} to ${pool_res.pgs.length}`
                );
                pg_history = PGUtil.scale_pg_history(pg_history, real_prev_pgs, pool_res.pgs);
                // Drop stats
                etcd_request.success.push({ requestDeleteRange: {
                    key: b64(this.etcd_prefix+'/pg/stats/'+pool_id+'/'),
                    range_end: b64(this.etcd_prefix+'/pg/stats/'+pool_id+'0'),
                } });
            }
            const stats = {
                used_raw_tb: (this.state.pool.stats[pool_id]||{}).used_raw_tb || 0,
                ...pool_res.stats,
            };
            etcd_request.success.push({ requestPut: {
                key: b64(this.etcd_prefix+'/pool/stats/'+pool_id),
                value: b64(JSON.stringify(stats)),
            } });
            this.save_new_pgs_txn(new_config_pgs, etcd_request, pool_id, up_osds, osd_tree, real_prev_pgs, pool_res.pgs, pg_history);
        }
        new_config_pgs.hash = tree_hash;
        return await this.save_pg_config(new_config_pgs, etcd_request);
    }
    async save_pg_config(new_config_pgs, etcd_request = { compare: [], success: [] })
    {
        etcd_request.compare.push(
@ -1451,14 +1520,8 @@ class Mon
        etcd_request.success.push(
            { requestPut: { key: b64(this.etcd_prefix+'/config/pgs'), value: b64(JSON.stringify(new_config_pgs)) } },
        );
-        const res = await this.etcd_call('/kv/txn', etcd_request, this.config.etcd_mon_timeout, 0);
+        const txn_res = await this.etcd_call('/kv/txn', etcd_request, this.config.etcd_mon_timeout, 0);
-        if (!res.succeeded)
+        return txn_res.succeeded;
        {
            console.log('Someone changed PG configuration while we also tried to change it. Retrying in '+this.config.mon_change_timeout+' ms');
            this.schedule_recheck();
            return;
        }
        console.log('PG configuration successfully changed');
    }
    // Schedule next recheck at least at <unixtime>
@ -1577,9 +1640,13 @@ class Mon
        }
        const sum_diff = { op_stats: {}, subop_stats: {}, recovery_stats: {} };
        // Sum derived values instead of deriving summed
-        for (const osd in this.state.osd.stats)
+        for (const osd in this.state.osd.state)
        {
            const derived = this.prev_stats.osd_diff[osd];
            if (!this.state.osd.state[osd] || !derived)
            {
                continue;
            }
            for (const type in sum_diff)
            {
                for (const op in derived[type]||{})
@ -1680,9 +1747,13 @@ class Mon
            const used = this.state.pool.stats[pool_id].used_raw_tb;
            this.state.pool.stats[pool_id].used_raw_tb = Number(used)/1024/1024/1024/1024;
        }
-        for (const osd_num in this.state.osd.inodestats)
+        for (const osd_num in this.state.osd.state)
        {
            const ist = this.state.osd.inodestats[osd_num];
            if (!ist || !this.state.osd.state[osd_num])
            {
                continue;
            }
            for (const pool_id in ist)
            {
                inode_stats[pool_id] = inode_stats[pool_id] || {};
@ -1698,9 +1769,14 @@ class Mon
                }
            }
        }
-        for (const osd in this.prev_stats.osd_diff)
+        for (const osd in this.state.osd.state)
        {
-            for (const pool_id in this.prev_stats.osd_diff[osd].inode_stats)
+            const osd_diff = this.prev_stats.osd_diff[osd];
            if (!osd_diff || !this.state.osd.state[osd])
            {
                continue;
            }
            for (const pool_id in osd_diff.inode_stats)
            {
                for (const inode_num in this.prev_stats.osd_diff[osd].inode_stats[pool_id])
                {
@ -1940,14 +2016,14 @@ class Mon
                return res.json;
            }
        }
-        this.die();
+        this.failconnect();
    }
-    _die(err)
+    _die(err, code)
    {
        // In fact we can just try to rejoin
        console.error(new Error(err || 'Cluster connection failed'));
-        process.exit(1);
+        process.exit(code || 2);
    }
    local_ips(all)
--- a/mon/package.json
+++ b/mon/package.json
@ -1,6 +1,6 @@
 {
  "name": "vitastor-mon",
-  "version": "1.3.1",
+  "version": "1.4.4",
  "description": "Vitastor SDS monitor service",
  "main": "mon-main.js",
  "scripts": {
--- a/mon/vitastor-osd@.service
+++ b/mon/vitastor-osd@.service
@ -8,7 +8,9 @@ PartOf=vitastor.target
 LimitNOFILE=1048576
 LimitNPROC=1048576
 LimitMEMLOCK=infinity
-ExecStart=bash -c 'exec vitastor-disk exec-osd /dev/vitastor/osd%i-data >>/var/log/vitastor/osd%i.log 2>&1'
+# Use the following for direct logs to files
 #ExecStart=bash -c 'exec vitastor-disk exec-osd /dev/vitastor/osd%i-data >>/var/log/vitastor/osd%i.log 2>&1'
 ExecStart=vitastor-disk exec-osd /dev/vitastor/osd%i-data
 ExecStartPre=+vitastor-disk pre-exec /dev/vitastor/osd%i-data
 WorkingDirectory=/
 User=vitastor
--- a/patches/VitastorPlugin.pm
+++ b/patches/VitastorPlugin.pm
@ -110,7 +110,6 @@ sub properties
        vitastor_etcd_address => {
            description => 'IP address(es) of etcd.',
            type => 'string',
            format => 'pve-storage-portal-dns-list',
        },
        vitastor_etcd_prefix => {
            description => 'Prefix for Vitastor etcd metadata',
--- a/patches/cinder-vitastor.py
+++ b/patches/cinder-vitastor.py
@ -50,7 +50,7 @@ from cinder.volume import configuration
 from cinder.volume import driver
 from cinder.volume import volume_utils
-VERSION = '1.3.1'
+VERSION = '1.4.4'
 LOG = logging.getLogger(__name__)
--- a/patches/libvirt-9.10-vitastor.diff
+++ b/patches/libvirt-9.10-vitastor.diff
@ -0,0 +1,643 @@
 commit c1cd026e211e94b120028e7c98a6e4ce5afe9846
 Author: Vitaliy Filippov <vitalif@yourcmc.ru>
 Date:   Wed Jan 24 22:04:50 2024 +0300
    Add Vitastor support
 diff --git a/include/libvirt/libvirt-storage.h b/include/libvirt/libvirt-storage.h
 index aaad4a3da1..5f5daa8341 100644
 --- a/include/libvirt/libvirt-storage.h
 +++ b/include/libvirt/libvirt-storage.h
@@ -326,6 +326,7 @@ typedef enum {
     VIR_CONNECT_LIST_STORAGE_POOLS_ZFS           = 1 << 17, /* (Since: 1.2.8) */
     VIR_CONNECT_LIST_STORAGE_POOLS_VSTORAGE      = 1 << 18, /* (Since: 3.1.0) */
     VIR_CONNECT_LIST_STORAGE_POOLS_ISCSI_DIRECT  = 1 << 19, /* (Since: 5.6.0) */
 +    VIR_CONNECT_LIST_STORAGE_POOLS_VITASTOR      = 1 << 20, /* (Since: 5.0.0) */
 } virConnectListAllStoragePoolsFlags;
 int                     virConnectListAllStoragePools(virConnectPtr conn,
 diff --git a/src/conf/domain_conf.c b/src/conf/domain_conf.c
 index 22ad43e1d7..56c81d6852 100644
 --- a/src/conf/domain_conf.c
 +++ b/src/conf/domain_conf.c
@@ -7185,7 +7185,8 @@ virDomainDiskSourceNetworkParse(xmlNodePtr node,
     src->configFile = virXPathString("string(./config/@file)", ctxt);
     if (src->protocol == VIR_STORAGE_NET_PROTOCOL_HTTP ||
 -        src->protocol == VIR_STORAGE_NET_PROTOCOL_HTTPS)
 +        src->protocol == VIR_STORAGE_NET_PROTOCOL_HTTPS ||
 +        src->protocol == VIR_STORAGE_NET_PROTOCOL_VITASTOR)
         src->query = virXMLPropString(node, "query");
     if (virDomainStorageNetworkParseHosts(node, ctxt, &src->hosts, &src->nhosts) < 0)
@@ -30618,6 +30619,7 @@ virDomainStorageSourceTranslateSourcePool(virStorageSource *src,
     case VIR_STORAGE_POOL_MPATH:
     case VIR_STORAGE_POOL_RBD:
 +    case VIR_STORAGE_POOL_VITASTOR:
     case VIR_STORAGE_POOL_SHEEPDOG:
     case VIR_STORAGE_POOL_GLUSTER:
     case VIR_STORAGE_POOL_LAST:
 diff --git a/src/conf/domain_validate.c b/src/conf/domain_validate.c
 index c72108886e..c739ed6c43 100644
 --- a/src/conf/domain_validate.c
 +++ b/src/conf/domain_validate.c
@@ -495,6 +495,7 @@ virDomainDiskDefValidateSourceChainOne(const virStorageSource *src)
         case VIR_STORAGE_NET_PROTOCOL_RBD:
             break;
 +        case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
         case VIR_STORAGE_NET_PROTOCOL_NBD:
         case VIR_STORAGE_NET_PROTOCOL_SHEEPDOG:
         case VIR_STORAGE_NET_PROTOCOL_GLUSTER:
@@ -541,7 +542,7 @@ virDomainDiskDefValidateSourceChainOne(const virStorageSource *src)
         }
     }
 -    /* internal snapshots and config files are currently supported only with rbd: */
 +    /* internal snapshots are currently supported only with rbd: */
     if (virStorageSourceGetActualType(src) != VIR_STORAGE_TYPE_NETWORK &&
         src->protocol != VIR_STORAGE_NET_PROTOCOL_RBD) {
         if (src->snapshot) {
@@ -549,10 +550,15 @@ virDomainDiskDefValidateSourceChainOne(const virStorageSource *src)
                            _("<snapshot> element is currently supported only with 'rbd' disks"));
             return -1;
         }
 +    }
 +    /* config files are currently supported only with rbd and vitastor: */
 +    if (virStorageSourceGetActualType(src) != VIR_STORAGE_TYPE_NETWORK &&
 +        src->protocol != VIR_STORAGE_NET_PROTOCOL_RBD &&
 +        src->protocol != VIR_STORAGE_NET_PROTOCOL_VITASTOR) {
         if (src->configFile) {
             virReportError(VIR_ERR_XML_ERROR, "%s",
 -                           _("<config> element is currently supported only with 'rbd' disks"));
 +                           _("<config> element is currently supported only with 'rbd' and 'vitastor' disks"));
             return -1;
         }
     }
 diff --git a/src/conf/schemas/domaincommon.rng b/src/conf/schemas/domaincommon.rng
 index b98a2ae602..7d7a872e01 100644
 --- a/src/conf/schemas/domaincommon.rng
 +++ b/src/conf/schemas/domaincommon.rng
@@ -1997,6 +1997,35 @@
     </element>
   </define>
 +  <define name="diskSourceNetworkProtocolVitastor">
 +    <element name="source">
 +      <interleave>
 +        <attribute name="protocol">
 +          <value>vitastor</value>
 +        </attribute>
 +        <ref name="diskSourceCommon"/>
 +        <optional>
 +          <attribute name="name"/>
 +        </optional>
 +        <optional>
 +          <attribute name="query"/>
 +        </optional>
 +        <zeroOrMore>
 +          <ref name="diskSourceNetworkHost"/>
 +        </zeroOrMore>
 +        <optional>
 +          <element name="config">
 +            <attribute name="file">
 +              <ref name="absFilePath"/>
 +            </attribute>
 +            <empty/>
 +          </element>
 +        </optional>
 +        <empty/>
 +      </interleave>
 +    </element>
 +  </define>
 +
   <define name="diskSourceNetworkProtocolISCSI">
     <element name="source">
       <attribute name="protocol">
@@ -2347,6 +2376,7 @@
       <ref name="diskSourceNetworkProtocolSimple"/>
       <ref name="diskSourceNetworkProtocolVxHS"/>
       <ref name="diskSourceNetworkProtocolNFS"/>
 +      <ref name="diskSourceNetworkProtocolVitastor"/>
     </choice>
   </define>
 diff --git a/src/conf/storage_conf.c b/src/conf/storage_conf.c
 index 68842004b7..1d69a788b6 100644
 --- a/src/conf/storage_conf.c
 +++ b/src/conf/storage_conf.c
@@ -56,7 +56,7 @@ VIR_ENUM_IMPL(virStoragePool,
               "logical", "disk", "iscsi",
               "iscsi-direct", "scsi", "mpath",
               "rbd", "sheepdog", "gluster",
 -              "zfs", "vstorage",
 +              "zfs", "vstorage", "vitastor",
 );
 VIR_ENUM_IMPL(virStoragePoolFormatFileSystem,
@@ -242,6 +242,18 @@ static virStoragePoolTypeInfo poolTypeInfo[] = {
           .formatToString = virStorageFileFormatTypeToString,
       }
     },
 +    {.poolType = VIR_STORAGE_POOL_VITASTOR,
 +     .poolOptions = {
 +         .flags = (VIR_STORAGE_POOL_SOURCE_HOST |
 +                   VIR_STORAGE_POOL_SOURCE_NETWORK |
 +                   VIR_STORAGE_POOL_SOURCE_NAME),
 +      },
 +      .volOptions = {
 +          .defaultFormat = VIR_STORAGE_FILE_RAW,
 +          .formatFromString = virStorageVolumeFormatFromString,
 +          .formatToString = virStorageFileFormatTypeToString,
 +      }
 +    },
     {.poolType = VIR_STORAGE_POOL_SHEEPDOG,
      .poolOptions = {
          .flags = (VIR_STORAGE_POOL_SOURCE_HOST |
@@ -538,6 +550,11 @@ virStoragePoolDefParseSource(xmlXPathContextPtr ctxt,
                        _("element 'name' is mandatory for RBD pool"));
         return -1;
     }
 +    if (pool_type == VIR_STORAGE_POOL_VITASTOR && source->name == NULL) {
 +        virReportError(VIR_ERR_XML_ERROR, "%s",
 +                       _("element 'name' is mandatory for Vitastor pool"));
 +        return -1;
 +    }
     if (options->formatFromString) {
         g_autofree char *format = NULL;
@@ -1127,6 +1144,7 @@ virStoragePoolDefFormatBuf(virBuffer *buf,
     /* RBD, Sheepdog, Gluster and Iscsi-direct devices are not local block devs nor
      * files, so they don't have a target */
     if (def->type != VIR_STORAGE_POOL_RBD &&
 +        def->type != VIR_STORAGE_POOL_VITASTOR &&
         def->type != VIR_STORAGE_POOL_SHEEPDOG &&
         def->type != VIR_STORAGE_POOL_GLUSTER &&
         def->type != VIR_STORAGE_POOL_ISCSI_DIRECT) {
 diff --git a/src/conf/storage_conf.h b/src/conf/storage_conf.h
 index fc67957cfe..720c07ef74 100644
 --- a/src/conf/storage_conf.h
 +++ b/src/conf/storage_conf.h
@@ -103,6 +103,7 @@ typedef enum {
     VIR_STORAGE_POOL_GLUSTER,  /* Gluster device */
     VIR_STORAGE_POOL_ZFS,      /* ZFS */
     VIR_STORAGE_POOL_VSTORAGE, /* Virtuozzo Storage */
 +    VIR_STORAGE_POOL_VITASTOR, /* Vitastor */
     VIR_STORAGE_POOL_LAST,
 } virStoragePoolType;
@@ -454,6 +455,7 @@ VIR_ENUM_DECL(virStoragePartedFs);
                  VIR_CONNECT_LIST_STORAGE_POOLS_SCSI     | \
                  VIR_CONNECT_LIST_STORAGE_POOLS_MPATH    | \
                  VIR_CONNECT_LIST_STORAGE_POOLS_RBD      | \
 +                 VIR_CONNECT_LIST_STORAGE_POOLS_VITASTOR | \
                  VIR_CONNECT_LIST_STORAGE_POOLS_SHEEPDOG | \
                  VIR_CONNECT_LIST_STORAGE_POOLS_GLUSTER  | \
                  VIR_CONNECT_LIST_STORAGE_POOLS_ZFS      | \
 diff --git a/src/conf/storage_source_conf.c b/src/conf/storage_source_conf.c
 index f974a521b1..cd394d0a9f 100644
 --- a/src/conf/storage_source_conf.c
 +++ b/src/conf/storage_source_conf.c
@@ -88,6 +88,7 @@ VIR_ENUM_IMPL(virStorageNetProtocol,
               "ssh",
               "vxhs",
               "nfs",
 +              "vitastor",
 );
@@ -1301,6 +1302,7 @@ virStorageSourceNetworkDefaultPort(virStorageNetProtocol protocol)
         case VIR_STORAGE_NET_PROTOCOL_GLUSTER:
             return 24007;
 +        case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
         case VIR_STORAGE_NET_PROTOCOL_RBD:
             /* we don't provide a default for RBD */
             return 0;
 diff --git a/src/conf/storage_source_conf.h b/src/conf/storage_source_conf.h
 index 5e7d127453..283709eeb3 100644
 --- a/src/conf/storage_source_conf.h
 +++ b/src/conf/storage_source_conf.h
@@ -129,6 +129,7 @@ typedef enum {
     VIR_STORAGE_NET_PROTOCOL_SSH,
     VIR_STORAGE_NET_PROTOCOL_VXHS,
     VIR_STORAGE_NET_PROTOCOL_NFS,
 +    VIR_STORAGE_NET_PROTOCOL_VITASTOR,
     VIR_STORAGE_NET_PROTOCOL_LAST
 } virStorageNetProtocol;
 diff --git a/src/conf/virstorageobj.c b/src/conf/virstorageobj.c
 index 59fa5da372..4739167f5f 100644
 --- a/src/conf/virstorageobj.c
 +++ b/src/conf/virstorageobj.c
@@ -1438,6 +1438,7 @@ virStoragePoolObjSourceFindDuplicateCb(const void *payload,
             return 1;
         break;
 +    case VIR_STORAGE_POOL_VITASTOR:
     case VIR_STORAGE_POOL_ISCSI_DIRECT:
     case VIR_STORAGE_POOL_RBD:
     case VIR_STORAGE_POOL_LAST:
@@ -1921,6 +1922,8 @@ virStoragePoolObjMatch(virStoragePoolObj *obj,
                (obj->def->type == VIR_STORAGE_POOL_MPATH))   ||
               (MATCH(VIR_CONNECT_LIST_STORAGE_POOLS_RBD) &&
                (obj->def->type == VIR_STORAGE_POOL_RBD))     ||
 +              (MATCH(VIR_CONNECT_LIST_STORAGE_POOLS_VITASTOR) &&
 +               (obj->def->type == VIR_STORAGE_POOL_VITASTOR)) ||
               (MATCH(VIR_CONNECT_LIST_STORAGE_POOLS_SHEEPDOG) &&
                (obj->def->type == VIR_STORAGE_POOL_SHEEPDOG)) ||
               (MATCH(VIR_CONNECT_LIST_STORAGE_POOLS_GLUSTER) &&
 diff --git a/src/libvirt-storage.c b/src/libvirt-storage.c
 index db7660aac4..561df34709 100644
 --- a/src/libvirt-storage.c
 +++ b/src/libvirt-storage.c
@@ -94,6 +94,7 @@ virStoragePoolGetConnect(virStoragePoolPtr pool)
  * VIR_CONNECT_LIST_STORAGE_POOLS_SCSI
  * VIR_CONNECT_LIST_STORAGE_POOLS_MPATH
  * VIR_CONNECT_LIST_STORAGE_POOLS_RBD
 + * VIR_CONNECT_LIST_STORAGE_POOLS_VITASTOR
  * VIR_CONNECT_LIST_STORAGE_POOLS_SHEEPDOG
  * VIR_CONNECT_LIST_STORAGE_POOLS_GLUSTER
  * VIR_CONNECT_LIST_STORAGE_POOLS_ZFS
 diff --git a/src/libxl/libxl_conf.c b/src/libxl/libxl_conf.c
 index 62e1be6672..71a1d42896 100644
 --- a/src/libxl/libxl_conf.c
 +++ b/src/libxl/libxl_conf.c
@@ -979,6 +979,7 @@ libxlMakeNetworkDiskSrcStr(virStorageSource *src,
     case VIR_STORAGE_NET_PROTOCOL_SSH:
     case VIR_STORAGE_NET_PROTOCOL_VXHS:
     case VIR_STORAGE_NET_PROTOCOL_NFS:
 +    case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
     case VIR_STORAGE_NET_PROTOCOL_LAST:
     case VIR_STORAGE_NET_PROTOCOL_NONE:
         virReportError(VIR_ERR_NO_SUPPORT,
 diff --git a/src/libxl/xen_xl.c b/src/libxl/xen_xl.c
 index f175359307..8efcf4c329 100644
 --- a/src/libxl/xen_xl.c
 +++ b/src/libxl/xen_xl.c
@@ -1456,6 +1456,7 @@ xenFormatXLDiskSrcNet(virStorageSource *src)
     case VIR_STORAGE_NET_PROTOCOL_SSH:
     case VIR_STORAGE_NET_PROTOCOL_VXHS:
     case VIR_STORAGE_NET_PROTOCOL_NFS:
 +    case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
     case VIR_STORAGE_NET_PROTOCOL_LAST:
     case VIR_STORAGE_NET_PROTOCOL_NONE:
         virReportError(VIR_ERR_NO_SUPPORT,
 diff --git a/src/qemu/qemu_block.c b/src/qemu/qemu_block.c
 index 7e9daf0bdc..825b4a3006 100644
 --- a/src/qemu/qemu_block.c
 +++ b/src/qemu/qemu_block.c
@@ -758,6 +758,38 @@ qemuBlockStorageSourceGetRBDProps(virStorageSource *src,
 }
 +static virJSONValue *
 +qemuBlockStorageSourceGetVitastorProps(virStorageSource *src)
 +{
 +    virJSONValue *ret = NULL;
 +    virStorageNetHostDef *host;
 +    size_t i;
 +    g_auto(virBuffer) buf = VIR_BUFFER_INITIALIZER;
 +    g_autofree char *etcd = NULL;
 +
 +    for (i = 0; i < src->nhosts; i++) {
 +        host = src->hosts + i;
 +        if ((virStorageNetHostTransport)host->transport != VIR_STORAGE_NET_HOST_TRANS_TCP) {
 +            return NULL;
 +        }
 +        virBufferAsprintf(&buf, i > 0 ? ",%s:%u" : "%s:%u", host->name, host->port);
 +    }
 +    if (src->nhosts > 0) {
 +        etcd = virBufferContentAndReset(&buf);
 +    }
 +
 +    if (virJSONValueObjectAdd(&ret,
 +                              "S:etcd-host", etcd,
 +                              "S:etcd-prefix", src->query,
 +                              "S:config-path", src->configFile,
 +                              "s:image", src->path,
 +                              NULL) < 0)
 +        return NULL;
 +
 +    return ret;
 +}
 +
 +
 static virJSONValue *
 qemuBlockStorageSourceGetSheepdogProps(virStorageSource *src)
 {
@@ -1140,6 +1172,12 @@ qemuBlockStorageSourceGetBackendProps(virStorageSource *src,
                 return NULL;
             break;
 +        case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
 +            driver = "vitastor";
 +            if (!(fileprops = qemuBlockStorageSourceGetVitastorProps(src)))
 +                return NULL;
 +            break;
 +
         case VIR_STORAGE_NET_PROTOCOL_SHEEPDOG:
             driver = "sheepdog";
             if (!(fileprops = qemuBlockStorageSourceGetSheepdogProps(src)))
@@ -2032,6 +2070,7 @@ qemuBlockGetBackingStoreString(virStorageSource *src,
             case VIR_STORAGE_NET_PROTOCOL_SHEEPDOG:
             case VIR_STORAGE_NET_PROTOCOL_RBD:
 +            case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
             case VIR_STORAGE_NET_PROTOCOL_VXHS:
             case VIR_STORAGE_NET_PROTOCOL_NFS:
             case VIR_STORAGE_NET_PROTOCOL_SSH:
@@ -2415,6 +2454,12 @@ qemuBlockStorageSourceCreateGetStorageProps(virStorageSource *src,
                 return -1;
             break;
 +        case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
 +            driver = "vitastor";
 +            if (!(location = qemuBlockStorageSourceGetVitastorProps(src)))
 +                return -1;
 +            break;
 +
         case VIR_STORAGE_NET_PROTOCOL_SHEEPDOG:
             driver = "sheepdog";
             if (!(location = qemuBlockStorageSourceGetSheepdogProps(src)))
 diff --git a/src/qemu/qemu_domain.c b/src/qemu/qemu_domain.c
 index 953808fcfe..62860283d8 100644
 --- a/src/qemu/qemu_domain.c
 +++ b/src/qemu/qemu_domain.c
@@ -5215,7 +5215,8 @@ qemuDomainValidateStorageSource(virStorageSource *src,
     if (src->query &&
         (actualType != VIR_STORAGE_TYPE_NETWORK ||
          (src->protocol != VIR_STORAGE_NET_PROTOCOL_HTTPS &&
 -          src->protocol != VIR_STORAGE_NET_PROTOCOL_HTTP))) {
 +          src->protocol != VIR_STORAGE_NET_PROTOCOL_HTTP &&
 +          src->protocol != VIR_STORAGE_NET_PROTOCOL_VITASTOR))) {
         virReportError(VIR_ERR_CONFIG_UNSUPPORTED, "%s",
                        _("query is supported only with HTTP(S) protocols"));
         return -1;
@@ -10340,6 +10341,7 @@ qemuDomainPrepareStorageSourceTLS(virStorageSource *src,
         break;
     case VIR_STORAGE_NET_PROTOCOL_RBD:
 +    case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
     case VIR_STORAGE_NET_PROTOCOL_SHEEPDOG:
     case VIR_STORAGE_NET_PROTOCOL_GLUSTER:
     case VIR_STORAGE_NET_PROTOCOL_ISCSI:
 diff --git a/src/qemu/qemu_snapshot.c b/src/qemu/qemu_snapshot.c
 index 73ff533827..e9c799ca8f 100644
 --- a/src/qemu/qemu_snapshot.c
 +++ b/src/qemu/qemu_snapshot.c
@@ -423,6 +423,7 @@ qemuSnapshotPrepareDiskExternalInactive(virDomainSnapshotDiskDef *snapdisk,
         case VIR_STORAGE_NET_PROTOCOL_NONE:
         case VIR_STORAGE_NET_PROTOCOL_NBD:
         case VIR_STORAGE_NET_PROTOCOL_RBD:
 +        case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
         case VIR_STORAGE_NET_PROTOCOL_SHEEPDOG:
         case VIR_STORAGE_NET_PROTOCOL_GLUSTER:
         case VIR_STORAGE_NET_PROTOCOL_ISCSI:
@@ -648,6 +649,7 @@ qemuSnapshotPrepareDiskInternal(virDomainDiskDef *disk,
         case VIR_STORAGE_NET_PROTOCOL_NONE:
         case VIR_STORAGE_NET_PROTOCOL_NBD:
         case VIR_STORAGE_NET_PROTOCOL_RBD:
 +        case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
         case VIR_STORAGE_NET_PROTOCOL_SHEEPDOG:
         case VIR_STORAGE_NET_PROTOCOL_GLUSTER:
         case VIR_STORAGE_NET_PROTOCOL_ISCSI:
 diff --git a/src/storage/storage_driver.c b/src/storage/storage_driver.c
 index 314fe930e0..fb615a8b4e 100644
 --- a/src/storage/storage_driver.c
 +++ b/src/storage/storage_driver.c
@@ -1626,6 +1626,7 @@ storageVolLookupByPathCallback(virStoragePoolObj *obj,
         case VIR_STORAGE_POOL_GLUSTER:
         case VIR_STORAGE_POOL_RBD:
 +        case VIR_STORAGE_POOL_VITASTOR:
         case VIR_STORAGE_POOL_SHEEPDOG:
         case VIR_STORAGE_POOL_ZFS:
         case VIR_STORAGE_POOL_LAST:
 diff --git a/src/storage_file/storage_source_backingstore.c b/src/storage_file/storage_source_backingstore.c
 index 80681924ea..8a3ade9ec0 100644
 --- a/src/storage_file/storage_source_backingstore.c
 +++ b/src/storage_file/storage_source_backingstore.c
@@ -287,6 +287,75 @@ virStorageSourceParseRBDColonString(const char *rbdstr,
 }
 +static int
 +virStorageSourceParseVitastorColonString(const char *colonstr,
 +                                         virStorageSource *src)
 +{
 +    char *p, *e, *next;
 +    g_autofree char *options = NULL;
 +
 +    /* optionally skip the "vitastor:" prefix if provided */
 +    if (STRPREFIX(colonstr, "vitastor:"))
 +        colonstr += strlen("vitastor:");
 +
 +    options = g_strdup(colonstr);
 +
 +    p = options;
 +    while (*p) {
 +        /* find : delimiter or end of string */
 +        for (e = p; *e && *e != ':'; ++e) {
 +            if (*e == '\\') {
 +                e++;
 +                if (*e == '\0')
 +                    break;
 +            }
 +        }
 +        if (*e == '\0') {
 +            next = e;    /* last kv pair */
 +        } else {
 +            next = e + 1;
 +            *e = '\0';
 +        }
 +
 +        if (STRPREFIX(p, "image=")) {
 +            src->path = g_strdup(p + strlen("image="));
 +        } else if (STRPREFIX(p, "etcd-prefix=")) {
 +            src->query = g_strdup(p + strlen("etcd-prefix="));
 +        } else if (STRPREFIX(p, "config-path=")) {
 +            src->configFile = g_strdup(p + strlen("config-path="));
 +        } else if (STRPREFIX(p, "etcd-host=")) {
 +            char *h, *sep;
 +
 +            h = p + strlen("etcd-host=");
 +            while (h < e) {
 +                for (sep = h; sep < e; ++sep) {
 +                    if (*sep == '\\' && (sep[1] == ',' ||
 +                                         sep[1] == ';' ||
 +                                         sep[1] == ' ')) {
 +                        *sep = '\0';
 +                        sep += 2;
 +                        break;
 +                    }
 +                }
 +
 +                if (virStorageSourceRBDAddHost(src, h) < 0)
 +                    return -1;
 +
 +                h = sep;
 +            }
 +        }
 +
 +        p = next;
 +    }
 +
 +    if (!src->path) {
 +        return -1;
 +    }
 +
 +    return 0;
 +}
 +
 +
 static int
 virStorageSourceParseNBDColonString(const char *nbdstr,
                                     virStorageSource *src)
@@ -399,6 +468,11 @@ virStorageSourceParseBackingColon(virStorageSource *src,
             return -1;
         break;
 +    case VIR_STORAGE_NET_PROTOCOL_VITASTOR:
 +        if (virStorageSourceParseVitastorColonString(path, src) < 0)
 +            return -1;
 +        break;
 +
     case VIR_STORAGE_NET_PROTOCOL_SHEEPDOG:
     case VIR_STORAGE_NET_PROTOCOL_LAST:
     case VIR_STORAGE_NET_PROTOCOL_NONE:
@@ -975,6 +1049,54 @@ virStorageSourceParseBackingJSONRBD(virStorageSource *src,
     return 0;
 }
 +static int
 +virStorageSourceParseBackingJSONVitastor(virStorageSource *src,
 +                                         virJSONValue *json,
 +                                         const char *jsonstr G_GNUC_UNUSED,
 +                                         int opaque G_GNUC_UNUSED)
 +{
 +    const char *filename;
 +    const char *image = virJSONValueObjectGetString(json, "image");
 +    const char *conf = virJSONValueObjectGetString(json, "config-path");
 +    const char *etcd_prefix = virJSONValueObjectGetString(json, "etcd-prefix");
 +    virJSONValue *servers = virJSONValueObjectGetArray(json, "server");
 +    size_t nservers;
 +    size_t i;
 +
 +    src->type = VIR_STORAGE_TYPE_NETWORK;
 +    src->protocol = VIR_STORAGE_NET_PROTOCOL_VITASTOR;
 +
 +    /* legacy syntax passed via 'filename' option */
 +    if ((filename = virJSONValueObjectGetString(json, "filename")))
 +        return virStorageSourceParseVitastorColonString(filename, src);
 +
 +    if (!image) {
 +        virReportError(VIR_ERR_INVALID_ARG, "%s",
 +                       _("missing image name in Vitastor backing volume "
 +                         "JSON specification"));
 +        return -1;
 +    }
 +
 +    src->path = g_strdup(image);
 +    src->configFile = g_strdup(conf);
 +    src->query = g_strdup(etcd_prefix);
 +
 +    if (servers) {
 +        nservers = virJSONValueArraySize(servers);
 +
 +        src->hosts = g_new0(virStorageNetHostDef, nservers);
 +        src->nhosts = nservers;
 +
 +        for (i = 0; i < nservers; i++) {
 +            if (virStorageSourceParseBackingJSONInetSocketAddress(src->hosts + i,
 +                                                                  virJSONValueArrayGet(servers, i)) < 0)
 +                return -1;
 +        }
 +    }
 +
 +    return 0;
 +}
 +
 static int
 virStorageSourceParseBackingJSONRaw(virStorageSource *src,
                                     virJSONValue *json,
@@ -1152,6 +1274,7 @@ static const struct virStorageSourceJSONDriverParser jsonParsers[] = {
     {"sheepdog", false, virStorageSourceParseBackingJSONSheepdog, 0},
     {"ssh", false, virStorageSourceParseBackingJSONSSH, 0},
     {"rbd", false, virStorageSourceParseBackingJSONRBD, 0},
 +    {"vitastor", false, virStorageSourceParseBackingJSONVitastor, 0},
     {"raw", true, virStorageSourceParseBackingJSONRaw, 0},
     {"nfs", false, virStorageSourceParseBackingJSONNFS, 0},
     {"vxhs", false, virStorageSourceParseBackingJSONVxHS, 0},
 diff --git a/src/test/test_driver.c b/src/test/test_driver.c
 index e87d7cfd44..ccc05d7aae 100644
 --- a/src/test/test_driver.c
 +++ b/src/test/test_driver.c
@@ -7335,6 +7335,7 @@ testStorageVolumeTypeForPool(int pooltype)
     case VIR_STORAGE_POOL_ISCSI_DIRECT:
     case VIR_STORAGE_POOL_GLUSTER:
     case VIR_STORAGE_POOL_RBD:
 +    case VIR_STORAGE_POOL_VITASTOR:
         return VIR_STORAGE_VOL_NETWORK;
     case VIR_STORAGE_POOL_LOGICAL:
     case VIR_STORAGE_POOL_DISK:
 diff --git a/tests/storagepoolcapsschemadata/poolcaps-fs.xml b/tests/storagepoolcapsschemadata/poolcaps-fs.xml
 index eee75af746..8bd0a57bdd 100644
 --- a/tests/storagepoolcapsschemadata/poolcaps-fs.xml
 +++ b/tests/storagepoolcapsschemadata/poolcaps-fs.xml
@@ -204,4 +204,11 @@
       </enum>
     </volOptions>
   </pool>
 +  <pool type='vitastor' supported='no'>
 +    <volOptions>
 +      <defaultFormat type='raw'/>
 +      <enum name='targetFormatType'>
 +      </enum>
 +    </volOptions>
 +  </pool>
 </storagepoolCapabilities>
 diff --git a/tests/storagepoolcapsschemadata/poolcaps-full.xml b/tests/storagepoolcapsschemadata/poolcaps-full.xml
 index 805950a937..852df0de16 100644
 --- a/tests/storagepoolcapsschemadata/poolcaps-full.xml
 +++ b/tests/storagepoolcapsschemadata/poolcaps-full.xml
@@ -204,4 +204,11 @@
       </enum>
     </volOptions>
   </pool>
 +  <pool type='vitastor' supported='yes'>
 +    <volOptions>
 +      <defaultFormat type='raw'/>
 +      <enum name='targetFormatType'>
 +      </enum>
 +    </volOptions>
 +  </pool>
 </storagepoolCapabilities>
 diff --git a/tests/storagepoolxml2argvtest.c b/tests/storagepoolxml2argvtest.c
 index e8e40d695e..db55fe5f3a 100644
 --- a/tests/storagepoolxml2argvtest.c
 +++ b/tests/storagepoolxml2argvtest.c
@@ -65,6 +65,7 @@ testCompareXMLToArgvFiles(bool shouldFail,
     case VIR_STORAGE_POOL_GLUSTER:
     case VIR_STORAGE_POOL_ZFS:
     case VIR_STORAGE_POOL_VSTORAGE:
 +    case VIR_STORAGE_POOL_VITASTOR:
     case VIR_STORAGE_POOL_LAST:
     default:
         VIR_TEST_DEBUG("pool type '%s' has no xml2argv test", defTypeStr);
 diff --git a/tools/virsh-pool.c b/tools/virsh-pool.c
 index 36f00cf643..5f5bd3464e 100644
 --- a/tools/virsh-pool.c
 +++ b/tools/virsh-pool.c
@@ -1223,6 +1223,9 @@ cmdPoolList(vshControl *ctl, const vshCmd *cmd G_GNUC_UNUSED)
             case VIR_STORAGE_POOL_VSTORAGE:
                 flags |= VIR_CONNECT_LIST_STORAGE_POOLS_VSTORAGE;
                 break;
 +            case VIR_STORAGE_POOL_VITASTOR:
 +                flags |= VIR_CONNECT_LIST_STORAGE_POOLS_VITASTOR;
 +                break;
             case VIR_STORAGE_POOL_LAST:
                 break;
             }
--- a/pull_request_template.yml
+++ b/pull_request_template.yml
@ -0,0 +1,28 @@
 name: Pull Request
 about: Submit a pull request
 body:
  - type: textarea
    id: description
    attributes:
      label: Description
      description: Describe your pull request
      placeholder: ""
      value: ""
    validations:
      required: true
  - type: input
    id: author
    attributes:
      label: Contributor Name
      description: Contributor Name or Company Details if the Contributor is a company
      placeholder: ""
    validations:
      required: false
  - type: checkboxes
    id: terms
    attributes:
      label: CLA
      description: By submitting this pull request, I accept [Vitastor CLA](https://git.yourcmc.ru/vitalif/vitastor/src/branch/master/CLA-en.md)
      options:
        - label: "I accept Vitastor CLA agreement: https://git.yourcmc.ru/vitalif/vitastor/src/branch/master/CLA-en.md"
          required: true
--- a/rpm/build-tarball.sh
+++ b/rpm/build-tarball.sh
@ -24,4 +24,4 @@ rm fio
 mv fio-copy fio
 FIO=`rpm -qi fio | perl -e 'while(<>) { /^Epoch[\s:]+(\S+)/ && print "$1:"; /^Version[\s:]+(\S+)/ && print $1; /^Release[\s:]+(\S+)/ && print "-$1"; }'`
 perl -i -pe 's/(Requires:\s*fio)([^\n]+)?/$1 = '$FIO'/' $VITASTOR/rpm/vitastor-el$EL.spec
-tar --transform 's#^#vitastor-1.3.1/#' --exclude 'rpm/*.rpm' -czf $VITASTOR/../vitastor-1.3.1$(rpm --eval '%dist').tar.gz *
+tar --transform 's#^#vitastor-1.4.4/#' --exclude 'rpm/*.rpm' -czf $VITASTOR/../vitastor-1.4.4$(rpm --eval '%dist').tar.gz *
--- a/rpm/vitastor-el7.Dockerfile
+++ b/rpm/vitastor-el7.Dockerfile
@ -36,7 +36,7 @@ ADD . /root/vitastor
 RUN set -e; \
    cd /root/vitastor/rpm; \
    sh build-tarball.sh; \
-    cp /root/vitastor-1.3.1.el7.tar.gz ~/rpmbuild/SOURCES; \
+    cp /root/vitastor-1.4.4.el7.tar.gz ~/rpmbuild/SOURCES; \
    cp vitastor-el7.spec ~/rpmbuild/SPECS/vitastor.spec; \
    cd ~/rpmbuild/SPECS/; \
    rpmbuild -ba vitastor.spec; \
--- a/rpm/vitastor-el7.spec
+++ b/rpm/vitastor-el7.spec
@ -1,11 +1,11 @@
 Name:           vitastor
-Version:        1.3.1
+Version:        1.4.4
 Release:        1%{?dist}
 Summary:        Vitastor, a fast software-defined clustered block storage
 License:        Vitastor Network Public License 1.1
 URL:            https://vitastor.io/
-Source0:        vitastor-1.3.1.el7.tar.gz
+Source0:        vitastor-1.4.4.el7.tar.gz
 BuildRequires:  liburing-devel >= 0.6
 BuildRequires:  gperftools-devel
--- a/rpm/vitastor-el8.Dockerfile
+++ b/rpm/vitastor-el8.Dockerfile
@ -35,7 +35,7 @@ ADD . /root/vitastor
 RUN set -e; \
    cd /root/vitastor/rpm; \
    sh build-tarball.sh; \
-    cp /root/vitastor-1.3.1.el8.tar.gz ~/rpmbuild/SOURCES; \
+    cp /root/vitastor-1.4.4.el8.tar.gz ~/rpmbuild/SOURCES; \
    cp vitastor-el8.spec ~/rpmbuild/SPECS/vitastor.spec; \
    cd ~/rpmbuild/SPECS/; \
    rpmbuild -ba vitastor.spec; \
--- a/rpm/vitastor-el8.spec
+++ b/rpm/vitastor-el8.spec
@ -1,11 +1,11 @@
 Name:           vitastor
-Version:        1.3.1
+Version:        1.4.4
 Release:        1%{?dist}
 Summary:        Vitastor, a fast software-defined clustered block storage
 License:        Vitastor Network Public License 1.1
 URL:            https://vitastor.io/
-Source0:        vitastor-1.3.1.el8.tar.gz
+Source0:        vitastor-1.4.4.el8.tar.gz
 BuildRequires:  liburing-devel >= 0.6
 BuildRequires:  gperftools-devel
--- a/rpm/vitastor-el9.Dockerfile
+++ b/rpm/vitastor-el9.Dockerfile
@ -18,7 +18,7 @@ ADD . /root/vitastor
 RUN set -e; \
    cd /root/vitastor/rpm; \
    sh build-tarball.sh; \
-    cp /root/vitastor-1.3.1.el9.tar.gz ~/rpmbuild/SOURCES; \
+    cp /root/vitastor-1.4.4.el9.tar.gz ~/rpmbuild/SOURCES; \
    cp vitastor-el9.spec ~/rpmbuild/SPECS/vitastor.spec; \
    cd ~/rpmbuild/SPECS/; \
    rpmbuild -ba vitastor.spec; \
--- a/rpm/vitastor-el9.spec
+++ b/rpm/vitastor-el9.spec
@ -1,11 +1,11 @@
 Name:           vitastor
-Version:        1.3.1
+Version:        1.4.4
 Release:        1%{?dist}
 Summary:        Vitastor, a fast software-defined clustered block storage
 License:        Vitastor Network Public License 1.1
 URL:            https://vitastor.io/
-Source0:        vitastor-1.3.1.el9.tar.gz
+Source0:        vitastor-1.4.4.el9.tar.gz
 BuildRequires:  liburing-devel >= 0.6
 BuildRequires:  gperftools-devel
--- a/src/CMakeLists.txt
+++ b/src/CMakeLists.txt
@ -16,7 +16,7 @@ if("${CMAKE_INSTALL_PREFIX}" MATCHES "^/usr/local/?$")
 	set(CMAKE_INSTALL_RPATH "${CMAKE_INSTALL_PREFIX}/${CMAKE_INSTALL_LIBDIR}")
 endif()
-add_definitions(-DVERSION="1.3.1")
+add_definitions(-DVERSION="1.4.4")
 add_definitions(-Wall -Wno-sign-compare -Wno-comment -Wno-parentheses -Wno-pointer-arith -fdiagnostics-color=always -fno-omit-frame-pointer -I ${CMAKE_SOURCE_DIR}/src)
 add_link_options(-fno-omit-frame-pointer)
 if (${WITH_ASAN})
--- a/src/addr_util.cpp
+++ b/src/addr_util.cpp
@ -8,6 +8,7 @@
 #include <stdio.h>
 #include <stdexcept>
 #include <set>
 #include "addr_util.h"
@ -135,7 +136,7 @@ std::vector<std::string> getifaddr_list(std::vector<std::string> mask_cfg, bool
            throw std::runtime_error((include_v6 ? "Invalid IPv4 address mask: " : "Invalid IP address mask: ") + mask);
        }
    }
-    std::vector<std::string> addresses;
+    std::set<std::string> addresses;
    ifaddrs *list, *ifa;
    if (getifaddrs(&list) == -1)
    {
@ -149,7 +150,8 @@ std::vector<std::string> getifaddr_list(std::vector<std::string> mask_cfg, bool
        }
        int family = ifa->ifa_addr->sa_family;
        if ((family == AF_INET || family == AF_INET6 && include_v6) &&
-            (ifa->ifa_flags & (IFF_UP | IFF_RUNNING | IFF_LOOPBACK)) == (IFF_UP | IFF_RUNNING))
+            // Do not skip loopback addresses if the address filter is specified
            (ifa->ifa_flags & (IFF_UP | IFF_RUNNING | (masks.size() ? 0 : IFF_LOOPBACK))) == (IFF_UP | IFF_RUNNING))
        {
            void *addr_ptr;
            if (family == AF_INET)
@ -182,11 +184,11 @@ std::vector<std::string> getifaddr_list(std::vector<std::string> mask_cfg, bool
            {
                throw std::runtime_error(std::string("inet_ntop: ") + strerror(errno));
            }
-            addresses.push_back(std::string(addr));
+            addresses.insert(std::string(addr));
        }
    }
    freeifaddrs(list);
-    return addresses;
+    return std::vector<std::string>(addresses.begin(), addresses.end());
 }
 int create_and_bind_socket(std::string bind_address, int bind_port, int listen_backlog, int *listening_port)
--- a/src/blockstore_flush.cpp
+++ b/src/blockstore_flush.cpp
@ -184,8 +184,7 @@ void journal_flusher_t::mark_trim_possible()
    if (trim_wanted > 0)
    {
        dequeuing = true;
-        if (!journal_trim_counter)
+        journal_trim_counter = 0;
            journal_trim_counter = journal_trim_interval;
        bs->ringloop->wakeup();
    }
 }
@ -366,7 +365,7 @@ resume_0:
        !flusher->flush_queue.size() || !flusher->dequeuing)
    {
 stop_flusher:
-        if (flusher->trim_wanted > 0 && flusher->journal_trim_counter > 0)
+        if (flusher->trim_wanted > 0 && !flusher->journal_trim_counter)
        {
            // Attempt forced trim
            flusher->active_flushers++;
@ -1346,7 +1345,6 @@ bool journal_flusher_co::trim_journal(int wait_base)
    else if (wait_state == wait_base+2) goto resume_2;
    else if (wait_state == wait_base+3) goto resume_3;
    else if (wait_state == wait_base+4) goto resume_4;
    flusher->journal_trim_counter = 0;
    new_trim_pos = bs->journal.get_trim_pos();
    if (new_trim_pos != bs->journal.used_start)
    {
@ -1419,6 +1417,7 @@ bool journal_flusher_co::trim_journal(int wait_base)
                exit(0);
            }
        }
        flusher->journal_trim_counter = 0;
        flusher->trimming = false;
    }
    return true;
--- a/src/blockstore_impl.cpp
+++ b/src/blockstore_impl.cpp
@ -163,20 +163,10 @@ void blockstore_impl_t::loop()
            }
            else if (op->opcode == BS_OP_SYNC)
            {
-                // wait for all small writes to be submitted
+                // sync only completed writes?
                // wait for all big writes to complete, submit data device fsync
                // wait for the data device fsync to complete, then submit journal writes for big writes
                // then submit an fsync operation
                if (has_writes)
                {
                    // Can't submit SYNC before previous writes
                    continue;
                }
                wr_st = continue_sync(op);
                if (wr_st != 2)
                {
                    has_writes = wr_st > 0 ? 1 : 2;
                }
            }
            else if (op->opcode == BS_OP_STABLE)
            {
@ -205,6 +195,10 @@ void blockstore_impl_t::loop()
                    // ring is full, stop submission
                    break;
                }
                else if (PRIV(op)->wait_for == WAIT_JOURNAL)
                {
                    PRIV(op)->wait_detail2 = (unstable_writes.size()+unstable_unsynced);
                }
            }
        }
        if (op_idx != new_idx)
@ -283,7 +277,8 @@ void blockstore_impl_t::check_wait(blockstore_op_t *op)
    }
    else if (PRIV(op)->wait_for == WAIT_JOURNAL)
    {
-        if (journal.used_start == PRIV(op)->wait_detail)
+        if (journal.used_start == PRIV(op)->wait_detail &&
            (unstable_writes.size()+unstable_unsynced) == PRIV(op)->wait_detail2)
        {
            // do not submit
 #ifdef BLOCKSTORE_DEBUG
@ -558,13 +553,14 @@ void blockstore_impl_t::process_list(blockstore_op_t *op)
            if (stable_count >= stable_alloc)
            {
                stable_alloc *= 2;
-                stable = (obj_ver_id*)realloc(stable, sizeof(obj_ver_id) * stable_alloc);
+                obj_ver_id* nst = (obj_ver_id*)realloc(stable, sizeof(obj_ver_id) * stable_alloc);
-                if (!stable)
+                if (!nst)
                {
                    op->retval = -ENOMEM;
                    FINISH_OP(op);
                    return;
                }
                stable = nst;
            }
            stable[stable_count++] = {
                .oid = clean_it->first,
@ -642,8 +638,8 @@ void blockstore_impl_t::process_list(blockstore_op_t *op)
                            if (stable_count >= stable_alloc)
                            {
                                stable_alloc += 32768;
-                                stable = (obj_ver_id*)realloc(stable, sizeof(obj_ver_id) * stable_alloc);
+                                obj_ver_id *nst = (obj_ver_id*)realloc(stable, sizeof(obj_ver_id) * stable_alloc);
-                                if (!stable)
+                                if (!nst)
                                {
                                    if (unstable)
                                        free(unstable);
@ -651,6 +647,7 @@ void blockstore_impl_t::process_list(blockstore_op_t *op)
                                    FINISH_OP(op);
                                    return;
                                }
                                stable = nst;
                            }
                            stable[stable_count++] = dirty_it->first;
                        }
@ -666,8 +663,8 @@ void blockstore_impl_t::process_list(blockstore_op_t *op)
                    if (unstable_count >= unstable_alloc)
                    {
                        unstable_alloc += 32768;
-                        unstable = (obj_ver_id*)realloc(unstable, sizeof(obj_ver_id) * unstable_alloc);
+                        obj_ver_id *nst = (obj_ver_id*)realloc(unstable, sizeof(obj_ver_id) * unstable_alloc);
-                        if (!unstable)
+                        if (!nst)
                        {
                            if (stable)
                                free(stable);
@ -675,6 +672,7 @@ void blockstore_impl_t::process_list(blockstore_op_t *op)
                            FINISH_OP(op);
                            return;
                        }
                        unstable = nst;
                    }
                    unstable[unstable_count++] = dirty_it->first;
                }
@ -694,8 +692,8 @@ void blockstore_impl_t::process_list(blockstore_op_t *op)
    if (stable_count+unstable_count > stable_alloc)
    {
        stable_alloc = stable_count+unstable_count;
-        stable = (obj_ver_id*)realloc(stable, sizeof(obj_ver_id) * stable_alloc);
+        obj_ver_id *nst = (obj_ver_id*)realloc(stable, sizeof(obj_ver_id) * stable_alloc);
-        if (!stable)
+        if (!nst)
        {
            if (unstable)
                free(unstable);
@ -703,6 +701,7 @@ void blockstore_impl_t::process_list(blockstore_op_t *op)
            FINISH_OP(op);
            return;
        }
        stable = nst;
    }
    // Copy unstable entries
    for (int i = 0; i < unstable_count; i++)
--- a/src/blockstore_impl.h
+++ b/src/blockstore_impl.h
@ -55,6 +55,7 @@
 #define IS_JOURNAL(st) (((st) & 0x0F) == BS_ST_SMALL_WRITE)
 #define IS_BIG_WRITE(st) (((st) & 0x0F) == BS_ST_BIG_WRITE)
 #define IS_DELETE(st) (((st) & 0x0F) == BS_ST_DELETE)
 #define IS_INSTANT(st) (((st) & BS_ST_TYPE_MASK) == BS_ST_DELETE || ((st) & BS_ST_INSTANT))
 #define BS_SUBMIT_CHECK_SQES(n) \
    if (ringloop->sqes_left() < (n))\
@ -201,7 +202,7 @@ struct blockstore_op_private_t
 {
    // Wait status
    int wait_for;
-    uint64_t wait_detail;
+    uint64_t wait_detail, wait_detail2;
    int pending_ops;
    int op_state;
@ -277,6 +278,7 @@ class blockstore_impl_t
    int unsynced_big_write_count = 0, unstable_unsynced = 0;
    int unsynced_queued_ops = 0;
    allocator *data_alloc = NULL;
    uint64_t used_blocks = 0;
    uint8_t *zero_object;
    void *metadata_buffer = NULL;
@ -376,7 +378,7 @@ class blockstore_impl_t
    // Stabilize
    int dequeue_stable(blockstore_op_t *op);
    int continue_stable(blockstore_op_t *op);
-    void mark_stable(const obj_ver_id & ov, bool forget_dirty = false);
+    void mark_stable(obj_ver_id ov, bool forget_dirty = false);
    void stabilize_object(object_id oid, uint64_t max_ver);
    blockstore_op_t* selective_sync(blockstore_op_t *op);
    int split_stab_op(blockstore_op_t *op, std::function<int(obj_ver_id v)> decider);
@ -430,7 +432,7 @@ public:
    inline uint32_t get_block_size() { return dsk.data_block_size; }
    inline uint64_t get_block_count() { return dsk.block_count; }
-    inline uint64_t get_free_block_count() { return data_alloc->get_free_count(); }
+    inline uint64_t get_free_block_count() { return dsk.block_count - used_blocks; }
    inline uint32_t get_bitmap_granularity() { return dsk.disk_alignment; }
    inline uint64_t get_journal_size() { return dsk.journal_len; }
 };
--- a/src/blockstore_init.cpp
+++ b/src/blockstore_init.cpp
@ -376,6 +376,7 @@ bool blockstore_init_meta::handle_meta_block(uint8_t *buf, uint64_t entries_per_
                else
                {
                    bs->inode_space_stats[entry->oid.inode] += bs->dsk.data_block_size;
                    bs->used_blocks++;
                }
                entries_loaded++;
 #ifdef BLOCKSTORE_DEBUG
@ -1181,6 +1182,7 @@ void blockstore_init_journal::erase_dirty_object(blockstore_dirty_db_t::iterator
            sp -= bs->dsk.data_block_size;
        else
            bs->inode_space_stats.erase(oid.inode);
        bs->used_blocks--;
    }
    bs->erase_dirty(dirty_it, dirty_end, clean_loc);
    // Remove it from the flusher's queue, too
--- a/src/blockstore_open.cpp
+++ b/src/blockstore_open.cpp
@ -19,7 +19,7 @@ void blockstore_impl_t::parse_config(blockstore_config_t & config, bool init)
    throttle_target_mbs = strtoull(config["throttle_target_mbs"].c_str(), NULL, 10);
    throttle_target_parallelism = strtoull(config["throttle_target_parallelism"].c_str(), NULL, 10);
    throttle_threshold_us = strtoull(config["throttle_threshold_us"].c_str(), NULL, 10);
-    if (config.find("autosync_writes") != config.end())
+    if (config["autosync_writes"] != "")
    {
        autosync_writes = strtoull(config["autosync_writes"].c_str(), NULL, 10);
    }
--- a/src/blockstore_stable.cpp
+++ b/src/blockstore_stable.cpp
@ -412,11 +412,40 @@ resume_4:
    return 2;
 }
-void blockstore_impl_t::mark_stable(const obj_ver_id & v, bool forget_dirty)
+void blockstore_impl_t::mark_stable(obj_ver_id v, bool forget_dirty)
 {
    auto dirty_it = dirty_db.find(v);
    if (dirty_it != dirty_db.end())
    {
        if (IS_INSTANT(dirty_it->second.state))
        {
            // 'Instant' (non-EC) operations may complete and try to become stable out of order. Prevent it.
            auto back_it = dirty_it;
            while (back_it != dirty_db.begin())
            {
                back_it--;
                if (back_it->first.oid != v.oid)
                {
                    break;
                }
                if (!IS_STABLE(back_it->second.state))
                {
                    // There are preceding unstable versions, can't flush <v>
                    return;
                }
            }
            while (true)
            {
                dirty_it++;
                if (dirty_it == dirty_db.end() || dirty_it->first.oid != v.oid ||
                    !IS_SYNCED(dirty_it->second.state))
                {
                    dirty_it--;
                    break;
                }
                v.version = dirty_it->first.version;
            }
        }
        while (1)
        {
            bool was_stable = IS_STABLE(dirty_it->second.state);
@ -445,6 +474,7 @@ void blockstore_impl_t::mark_stable(const obj_ver_id & v, bool forget_dirty)
                    if (!exists)
                    {
                        inode_space_stats[dirty_it->first.oid.inode] += dsk.data_block_size;
                        used_blocks++;
                    }
                    big_to_flush++;
                }
@ -455,6 +485,7 @@ void blockstore_impl_t::mark_stable(const obj_ver_id & v, bool forget_dirty)
                        sp -= dsk.data_block_size;
                    else
                        inode_space_stats.erase(dirty_it->first.oid.inode);
                    used_blocks--;
                    big_to_flush++;
                }
            }
--- a/src/blockstore_sync.cpp
+++ b/src/blockstore_sync.cpp
@ -76,6 +76,7 @@ int blockstore_impl_t::continue_sync(blockstore_op_t *op)
        // 2nd step: Data device is synced, prepare & write journal entries
        // Check space in the journal and journal memory buffers
        blockstore_journal_check_t space_check(this);
        auto reservation = (unstable_writes.size()+unstable_unsynced+PRIV(op)->sync_big_writes.size())*journal.block_size;
        if (dsk.csum_block_size)
        {
            // More complex check because all journal entries have different lengths
@ -85,16 +86,14 @@ int blockstore_impl_t::continue_sync(blockstore_op_t *op)
                left--;
                auto & dirty_entry = dirty_db.at(sbw);
                uint64_t dyn_size = dsk.dirty_dyn_size(dirty_entry.offset, dirty_entry.len);
-                if (!space_check.check_available(op, 1, sizeof(journal_entry_big_write) + dyn_size,
+                if (!space_check.check_available(op, 1, sizeof(journal_entry_big_write) + dyn_size, left ? 0 : reservation))
                    (unstable_writes.size()+unstable_unsynced)*journal.block_size))
                {
                    return 0;
                }
            }
        }
        else if (!space_check.check_available(op, PRIV(op)->sync_big_writes.size(),
-            sizeof(journal_entry_big_write) + dsk.clean_entry_bitmap_size,
+            sizeof(journal_entry_big_write) + dsk.clean_entry_bitmap_size, reservation))
            (unstable_writes.size()+unstable_unsynced)*journal.block_size))
        {
            return 0;
        }
--- a/src/blockstore_write.cpp
+++ b/src/blockstore_write.cpp
@ -129,7 +129,7 @@ bool blockstore_impl_t::enqueue_write(blockstore_op_t *op)
    }
    bool imm = (op->len < dsk.data_block_size ? (immediate_commit != IMMEDIATE_NONE) : (immediate_commit == IMMEDIATE_ALL));
    if (wait_big && !is_del && !deleted && op->len < dsk.data_block_size && !imm ||
-        !imm && unsynced_queued_ops >= autosync_writes)
+        !imm && autosync_writes && unsynced_queued_ops >= autosync_writes)
    {
        // Issue an additional sync so that the previous big write can reach the journal
        blockstore_op_t *sync_op = new blockstore_op_t;
@ -320,7 +320,7 @@ int blockstore_impl_t::dequeue_write(blockstore_op_t *op)
        blockstore_journal_check_t space_check(this);
        if (!space_check.check_available(op, unsynced_big_write_count + 1,
            sizeof(journal_entry_big_write) + dsk.clean_dyn_size,
-            (unstable_writes.size()+unstable_unsynced)*journal.block_size))
+            (unstable_writes.size()+unstable_unsynced+((dirty_it->second.state & BS_ST_INSTANT) ? 0 : 1))*journal.block_size))
        {
            return 0;
        }
@ -412,7 +412,7 @@ int blockstore_impl_t::dequeue_write(blockstore_op_t *op)
                sizeof(journal_entry_big_write) + dsk.clean_dyn_size, 0)
            || !space_check.check_available(op, 1,
                sizeof(journal_entry_small_write) + dyn_size,
-                op->len + (unstable_writes.size()+unstable_unsynced)*journal.block_size))
+                op->len + (unstable_writes.size()+unstable_unsynced+((dirty_it->second.state & BS_ST_INSTANT) ? 0 : 1))*journal.block_size))
        {
            return 0;
        }
@ -549,7 +549,7 @@ resume_2:
        uint64_t dyn_size = dsk.dirty_dyn_size(op->offset, op->len);
        blockstore_journal_check_t space_check(this);
        if (!space_check.check_available(op, 1, sizeof(journal_entry_big_write) + dyn_size,
-            (unstable_writes.size()+unstable_unsynced)*journal.block_size))
+            (unstable_writes.size()+unstable_unsynced+((dirty_it->second.state & BS_ST_INSTANT) ? 0 : 1))*journal.block_size))
        {
            return 0;
        }
@ -593,7 +593,7 @@ resume_4:
 #endif
        bool is_big = (dirty_it->second.state & BS_ST_TYPE_MASK) == BS_ST_BIG_WRITE;
        bool imm = is_big ? (immediate_commit == IMMEDIATE_ALL) : (immediate_commit != IMMEDIATE_NONE);
-        bool is_instant = ((dirty_it->second.state & BS_ST_TYPE_MASK) == BS_ST_DELETE || (dirty_it->second.state & BS_ST_INSTANT));
+        bool is_instant = IS_INSTANT(dirty_it->second.state);
        if (imm)
        {
            auto & unstab = unstable_writes[op->oid];
--- a/src/cli_rm_data.cpp
+++ b/src/cli_rm_data.cpp
@ -17,6 +17,7 @@ struct rm_pg_t
    uint64_t obj_count = 0, obj_done = 0;
    int state = 0;
    int in_flight = 0;
    bool synced = false;
 };
 struct rm_inode_t
@ -48,6 +49,7 @@ struct rm_inode_t
                .objects = objects,
                .obj_count = objects.size(),
                .obj_done = 0,
                .synced = parent->cli->get_immediate_commit(inode),
            });
            if (min_offset == 0)
            {
@ -151,6 +153,37 @@ struct rm_inode_t
            }
            cur_list->obj_pos++;
        }
        if (cur_list->in_flight == 0 && cur_list->obj_pos == cur_list->objects.end() &&
            !cur_list->synced)
        {
            osd_op_t *op = new osd_op_t();
            op->op_type = OSD_OP_OUT;
            op->peer_fd = parent->cli->msgr.osd_peer_fds.at(cur_list->rm_osd_num);
            op->req = (osd_any_op_t){
                .sync = {
                    .header = {
                        .magic = SECONDARY_OSD_OP_MAGIC,
                        .id = parent->cli->next_op_id(),
                        .opcode = OSD_OP_SYNC,
                    },
                },
            };
            op->callback = [this, cur_list](osd_op_t *op)
            {
                cur_list->in_flight--;
                cur_list->synced = true;
                if (op->reply.hdr.retval < 0)
                {
                    fprintf(stderr, "Failed to sync OSD %lu (retval=%ld)\n",
                        cur_list->rm_osd_num, op->reply.hdr.retval);
                    error_count++;
                }
                delete op;
                continue_delete();
            };
            cur_list->in_flight++;
            parent->cli->msgr.outbox_push(op);
        }
    }
    void continue_delete()
@ -161,7 +194,8 @@ struct rm_inode_t
        }
        for (int i = 0; i < lists.size(); i++)
        {
-            if (!lists[i]->in_flight && lists[i]->obj_pos == lists[i]->objects.end())
+            if (!lists[i]->in_flight && lists[i]->obj_pos == lists[i]->objects.end() &&
                lists[i]->synced)
            {
                delete lists[i];
                lists.erase(lists.begin()+i, lists.begin()+i+1);
@ -187,7 +221,7 @@ struct rm_inode_t
            {
                fprintf(stderr, "\n");
            }
-            if (parent->progress && (total_done < total_count || inactive_osds.size() > 0))
+            if (parent->progress && (total_done < total_count || inactive_osds.size() > 0 || error_count > 0))
            {
                fprintf(
                    stderr, "Warning: Pool:%u,ID:%lu inode data may not have been fully removed.\n"
--- a/src/cli_status.cpp
+++ b/src/cli_status.cpp
@ -106,7 +106,7 @@ resume_2:
            if (etcd_states[i]["error"].is_null())
            {
                etcd_alive++;
-                etcd_db_size = etcd_states[i]["dbSizeInUse"].uint64_value();
+                etcd_db_size = etcd_states[i]["dbSize"].uint64_value();
            }
        }
        int mon_count = 0;
--- a/src/cluster_client.cpp
+++ b/src/cluster_client.cpp
@ -352,13 +352,15 @@ void cluster_client_t::on_load_config_hook(json11::Json::object & etcd_global_co
    // up_wait_retry_interval
    up_wait_retry_interval = config["up_wait_retry_interval"].uint64_value();
    if (!up_wait_retry_interval)
    {
        up_wait_retry_interval = 500;
    }
    else if (up_wait_retry_interval < 50)
    {
        up_wait_retry_interval = 50;
    }
    else if (up_wait_retry_interval < 10)
    {
        up_wait_retry_interval = 10;
    }
    // log_level
    log_level = config["log_level"].uint64_value();
    msgr.parse_config(config);
    st_cli.parse_config(config);
    st_cli.load_pgs();
@ -703,6 +705,8 @@ resume_1:
        }
        goto resume_2;
    }
    // Protect from try_send completing the operation immediately
    op->inflight_count++;
    for (int i = 0; i < op->parts.size(); i++)
    {
        if (!(op->parts[i].flags & PART_SENT))
@ -726,8 +730,10 @@ resume_1:
            }
        }
    }
    op->inflight_count--;
    if (op->state == 1)
    {
        // Some suboperations have to be resent
        return 0;
    }
 resume_2:
@ -1147,11 +1153,14 @@ void cluster_client_t::handle_op_part(cluster_op_part_t *part)
        if (op->retval != -EINTR && op->retval != -EIO && op->retval != -ENOSPC)
        {
            stop_fd = part->op.peer_fd;
            if (op->retval != -EPIPE || log_level > 0)
            {
                fprintf(
                    stderr, "%s operation failed on OSD %lu: retval=%ld (expected %d), dropping connection\n",
                    osd_op_names[part->op.req.hdr.opcode], part->osd_num, part->op.reply.hdr.retval, expected
                );
            }
        }
        else
        {
            fprintf(
--- a/src/cluster_client.h
+++ b/src/cluster_client.h
@ -91,7 +91,7 @@ class cluster_client_t
    uint64_t client_max_buffered_ops = 0;
    uint64_t client_max_writeback_iodepth = 0;
-    int log_level;
+    int log_level = 0;
    int up_wait_retry_interval = 500; // ms
    int retry_timeout_id = 0;
--- a/src/disk_tool_prepare.cpp
+++ b/src/disk_tool_prepare.cpp
@ -440,18 +440,27 @@ std::vector<std::string> disk_tool_t::get_new_data_parts(vitastor_dev_info_t & d
                {
                    // Use this partition
                    use_parts.push_back(part["uuid"].string_value());
                    osds_exist++;
                }
                else
                {
                    std::string part_path = "/dev/disk/by-partuuid/"+strtolower(part["uuid"].string_value());
                    bool is_meta = sb["params"]["meta_device"].string_value() == part_path;
                    bool is_journal = sb["params"]["journal_device"].string_value() == part_path;
                    bool is_data = sb["params"]["data_device"].string_value() == part_path;
                    fprintf(
-                        stderr, "%s is already initialized for OSD %lu, skipping\n",
+                        stderr, "%s is already initialized for OSD %lu%s, skipping\n",
-                        part["node"].string_value().c_str(), sb["params"]["osd_num"].uint64_value()
+                        part["node"].string_value().c_str(), sb["params"]["osd_num"].uint64_value(),
                        (is_data ? " data" : (is_meta ? " meta" : (is_journal ? " journal" : "")))
                    );
                    if (is_data || sb["params"]["data_device"].string_value().substr(0, 22) != "/dev/disk/by-partuuid/")
                    {
                        osds_size += part["size"].uint64_value()*dev.pt["sectorsize"].uint64_value();
                }
                        osds_exist++;
                    }
                }
            }
        }
        // Still create OSD(s) if a disk has no more than (max_other_percent) other data
        if (osds_exist >= osd_per_disk || (dev.free+osds_size) < dev.size*(100-max_other_percent)/100)
            fprintf(stderr, "%s is already partitioned, skipping\n", dev.path.c_str());
--- a/src/etcd_state_client.cpp
+++ b/src/etcd_state_client.cpp
@ -333,7 +333,7 @@ void etcd_state_client_t::start_etcd_watcher()
        etcd_watch_ws = NULL;
    }
    if (this->log_level > 1)
-        fprintf(stderr, "Trying to connect to etcd websocket at %s\n", etcd_address.c_str());
+        fprintf(stderr, "Trying to connect to etcd websocket at %s, watch from revision %lu\n", etcd_address.c_str(), etcd_watch_revision);
    etcd_watch_ws = open_websocket(tfd, etcd_address, etcd_api_path+"/watch", etcd_slow_timeout,
        [this, cur_addr = selected_etcd_address](const http_response_t *msg)
    {
@ -356,8 +356,8 @@ void etcd_state_client_t::start_etcd_watcher()
                        watch_id == ETCD_PG_HISTORY_WATCH_ID ||
                        watch_id == ETCD_OSD_STATE_WATCH_ID)
                        etcd_watches_initialised++;
-                    if (etcd_watches_initialised == 4 && this->log_level > 0)
+                    if (etcd_watches_initialised == ETCD_TOTAL_WATCHES && this->log_level > 0)
-                        fprintf(stderr, "Successfully subscribed to etcd at %s\n", cur_addr.c_str());
+                        fprintf(stderr, "Successfully subscribed to etcd at %s, revision %lu\n", cur_addr.c_str(), etcd_watch_revision);
                }
                if (data["result"]["canceled"].bool_value())
                {
@ -393,9 +393,13 @@ void etcd_state_client_t::start_etcd_watcher()
                        exit(1);
                    }
                }
-                if (etcd_watches_initialised == 4)
+                if (etcd_watches_initialised == ETCD_TOTAL_WATCHES && !data["result"]["header"]["revision"].is_null())
                {
-                    etcd_watch_revision = data["result"]["header"]["revision"].uint64_value()+1;
+                    // Protect against a revision beign split into multiple messages and some
                    // of them being lost. Even though I'm not sure if etcd actually splits them
                    // Also sometimes etcd sends something without a header, like:
                    // {"error": {"grpc_code": 14, "http_code": 503, "http_status": "Service Unavailable", "message": "error reading from server: EOF"}}
                    etcd_watch_revision = data["result"]["header"]["revision"].uint64_value();
                    addresses_to_try.clear();
                }
                // First gather all changes into a hash to remove multiple overwrites
@ -507,7 +511,7 @@ void etcd_state_client_t::start_ws_keepalive()
    {
        ws_keepalive_timer = tfd->set_timer(etcd_ws_keepalive_interval*1000, true, [this](int)
        {
-            if (!etcd_watch_ws)
+            if (!etcd_watch_ws || etcd_watches_initialised < ETCD_TOTAL_WATCHES)
            {
                // Do nothing
            }
@ -636,18 +640,28 @@ void etcd_state_client_t::load_pgs()
            on_load_pgs_hook(false);
            return;
        }
        reset_pg_exists();
        if (!etcd_watch_revision)
        {
            etcd_watch_revision = data["header"]["revision"].uint64_value()+1;
            if (this->log_level > 3)
            {
                fprintf(stderr, "Loaded revision %lu of PG configuration\n", etcd_watch_revision-1);
            }
        }
        for (auto & res: data["responses"].array_items())
        {
            for (auto & kv_json: res["response_range"]["kvs"].array_items())
            {
                auto kv = parse_etcd_kv(kv_json);
                if (this->log_level > 3)
                {
                    fprintf(stderr, "Loaded key: %s -> %s\n", kv.key.c_str(), kv.value.dump().c_str());
                }
                parse_state(kv);
            }
        }
        clean_nonexistent_pgs();
        on_load_pgs_hook(true);
        start_etcd_watcher();
    });
@ -668,6 +682,73 @@ void etcd_state_client_t::load_pgs()
 }
 #endif
 void etcd_state_client_t::reset_pg_exists()
 {
    for (auto & pool_item: pool_config)
    {
        for (auto & pg_item: pool_item.second.pg_config)
        {
            pg_item.second.state_exists = false;
            pg_item.second.history_exists = false;
        }
    }
    seen_peers.clear();
 }
 void etcd_state_client_t::clean_nonexistent_pgs()
 {
    for (auto & pool_item: pool_config)
    {
        for (auto pg_it = pool_item.second.pg_config.begin(); pg_it != pool_item.second.pg_config.end(); )
        {
            auto & pg_cfg = pg_it->second;
            if (!pg_cfg.config_exists && !pg_cfg.state_exists && !pg_cfg.history_exists)
            {
                if (this->log_level > 3)
                {
                    fprintf(stderr, "PG %u/%u disappeared after reload, forgetting it\n", pool_item.first, pg_it->first);
                }
                pool_item.second.pg_config.erase(pg_it++);
            }
            else
            {
                if (!pg_cfg.state_exists)
                {
                    if (this->log_level > 3)
                    {
                        fprintf(stderr, "PG %u/%u primary OSD disappeared after reload, forgetting it\n", pool_item.first, pg_it->first);
                    }
                    parse_state((etcd_kv_t){
                        .key = etcd_prefix+"/pg/state/"+std::to_string(pool_item.first)+"/"+std::to_string(pg_it->first),
                    });
                }
                if (!pg_cfg.history_exists)
                {
                    if (this->log_level > 3)
                    {
                        fprintf(stderr, "PG %u/%u history disappeared after reload, forgetting it\n", pool_item.first, pg_it->first);
                    }
                    parse_state((etcd_kv_t){
                        .key = etcd_prefix+"/pg/history/"+std::to_string(pool_item.first)+"/"+std::to_string(pg_it->first),
                    });
                }
                pg_it++;
            }
        }
    }
    for (auto & peer_item: peer_states)
    {
        if (seen_peers.find(peer_item.first) == seen_peers.end())
        {
            fprintf(stderr, "OSD %lu state disappeared after reload, forgetting it\n", peer_item.first);
            parse_state((etcd_kv_t){
                .key = etcd_prefix+"/osd/state/"+std::to_string(peer_item.first),
            });
        }
    }
    seen_peers.clear();
 }
 void etcd_state_client_t::parse_state(const etcd_kv_t & kv)
 {
    const std::string & key = kv.key;
@ -822,7 +903,7 @@ void etcd_state_client_t::parse_state(const etcd_kv_t & kv)
        {
            for (auto & pg_item: pool_item.second.pg_config)
            {
-                pg_item.second.exists = false;
+                pg_item.second.config_exists = false;
            }
        }
        for (auto & pool_item: value["items"].object_items())
@ -845,7 +926,7 @@ void etcd_state_client_t::parse_state(const etcd_kv_t & kv)
                    continue;
                }
                auto & parsed_cfg = this->pool_config[pool_id].pg_config[pg_num];
-                parsed_cfg.exists = true;
+                parsed_cfg.config_exists = true;
                parsed_cfg.pause = pg_item.second["pause"].bool_value();
                parsed_cfg.primary = pg_item.second["primary"].uint64_value();
                parsed_cfg.target_set.clear();
@ -866,7 +947,7 @@ void etcd_state_client_t::parse_state(const etcd_kv_t & kv)
            int n = 0;
            for (auto pg_it = pool_item.second.pg_config.begin(); pg_it != pool_item.second.pg_config.end(); pg_it++)
            {
-                if (pg_it->second.exists && pg_it->first != ++n)
+                if (pg_it->second.config_exists && pg_it->first != ++n)
                {
                    fprintf(
                        stderr, "Invalid pool %u PG configuration: PG numbers don't cover whole 1..%lu range\n",
@ -874,7 +955,7 @@ void etcd_state_client_t::parse_state(const etcd_kv_t & kv)
                    );
                    for (pg_it = pool_item.second.pg_config.begin(); pg_it != pool_item.second.pg_config.end(); pg_it++)
                    {
-                        pg_it->second.exists = false;
+                        pg_it->second.config_exists = false;
                    }
                    n = 0;
                    break;
@ -899,6 +980,7 @@ void etcd_state_client_t::parse_state(const etcd_kv_t & kv)
            auto & pg_cfg = this->pool_config[pool_id].pg_config[pg_num];
            pg_cfg.target_history.clear();
            pg_cfg.all_peers.clear();
            pg_cfg.history_exists = !value.is_null();
            // Refuse to start PG if any set of the <osd_sets> has no live OSDs
            for (auto & hist_item: value["osd_sets"].array_items())
            {
@ -951,11 +1033,15 @@ void etcd_state_client_t::parse_state(const etcd_kv_t & kv)
        }
        else if (value.is_null())
        {
-            this->pool_config[pool_id].pg_config[pg_num].cur_primary = 0;
+            auto & pg_cfg = this->pool_config[pool_id].pg_config[pg_num];
-            this->pool_config[pool_id].pg_config[pg_num].cur_state = 0;
+            pg_cfg.state_exists = false;
            pg_cfg.cur_primary = 0;
            pg_cfg.cur_state = 0;
        }
        else
        {
            auto & pg_cfg = this->pool_config[pool_id].pg_config[pg_num];
            pg_cfg.state_exists = true;
            osd_num_t cur_primary = value["primary"].uint64_value();
            int state = 0;
            for (auto & e: value["state"].array_items())
@ -983,8 +1069,8 @@ void etcd_state_client_t::parse_state(const etcd_kv_t & kv)
                fprintf(stderr, "Unexpected pool %u PG %u state in etcd: primary=%lu, state=%s\n", pool_id, pg_num, cur_primary, value["state"].dump().c_str());
                return;
            }
-            this->pool_config[pool_id].pg_config[pg_num].cur_primary = cur_primary;
+            pg_cfg.cur_primary = cur_primary;
-            this->pool_config[pool_id].pg_config[pg_num].cur_state = state;
+            pg_cfg.cur_state = state;
        }
    }
    else if (key.substr(0, etcd_prefix.length()+11) == etcd_prefix+"/osd/state/")
@ -998,6 +1084,7 @@ void etcd_state_client_t::parse_state(const etcd_kv_t & kv)
                value["port"].int64_value() > 0 && value["port"].int64_value() < 65536)
            {
                this->peer_states[peer_osd] = value;
                this->seen_peers.insert(peer_osd);
            }
            else
            {
--- a/src/etcd_state_client.h
+++ b/src/etcd_state_client.h
@ -3,6 +3,8 @@
 #pragma once
 #include <set>
 #include "json11/json11.hpp"
 #include "osd_id.h"
 #include "timerfd_manager.h"
@ -11,6 +13,7 @@
 #define ETCD_PG_STATE_WATCH_ID 2
 #define ETCD_PG_HISTORY_WATCH_ID 3
 #define ETCD_OSD_STATE_WATCH_ID 4
 #define ETCD_TOTAL_WATCHES 4
 #define DEFAULT_BLOCK_SIZE 128*1024
 #define MIN_DATA_BLOCK_SIZE 4*1024
@ -25,12 +28,12 @@ struct etcd_kv_t
 {
    std::string key;
    json11::Json value;
-    uint64_t mod_revision;
+    uint64_t mod_revision = 0;
 };
 struct pg_config_t
 {
-    bool exists;
+    bool config_exists, history_exists, state_exists;
    osd_num_t primary;
    std::vector<osd_num_t> target_set;
    std::vector<std::vector<osd_num_t>> target_history;
@ -61,21 +64,21 @@ struct pool_config_t
 struct inode_config_t
 {
-    uint64_t num;
+    uint64_t num = 0;
    std::string name;
-    uint64_t size;
+    uint64_t size = 0;
-    inode_t parent_id;
+    inode_t parent_id = 0;
-    bool readonly;
+    bool readonly = false;
    // Arbitrary metadata
    json11::Json meta;
    // Change revision of the metadata in etcd
-    uint64_t mod_revision;
+    uint64_t mod_revision = 0;
 };
 struct inode_watch_t
 {
    std::string name;
-    inode_config_t cfg;
+    inode_config_t cfg = {};
 };
 struct http_co_t;
@ -113,6 +116,7 @@ public:
    uint64_t etcd_watch_revision = 0;
    std::map<pool_id_t, pool_config_t> pool_config;
    std::map<osd_num_t, json11::Json> peer_states;
    std::set<osd_num_t> seen_peers;
    std::map<inode_t, inode_config_t> inode_config;
    std::map<std::string, inode_t> inode_by_name;
@ -138,6 +142,8 @@ public:
    void start_ws_keepalive();
    void load_global_config();
    void load_pgs();
    void reset_pg_exists();
    void clean_nonexistent_pgs();
    void parse_state(const etcd_kv_t & kv);
    void parse_config(const json11::Json & config);
    void insert_inode_config(const inode_config_t & cfg);
--- a/src/messenger.h
+++ b/src/messenger.h
@ -149,7 +149,7 @@ public:
    std::map<osd_num_t, osd_wanted_peer_t> wanted_peers;
    std::map<uint64_t, int> osd_peer_fds;
    // op statistics
-    osd_op_stats_t stats;
+    osd_op_stats_t stats, recovery_stats;
    void init();
    void parse_config(const json11::Json & config);
@ -175,6 +175,7 @@ public:
    bool connect_rdma(int peer_fd, std::string rdma_address, uint64_t client_max_msg);
 #endif
    void inc_op_stats(osd_op_stats_t & stats, uint64_t opcode, timespec & tv_begin, timespec & tv_end, uint64_t len);
    void measure_exec(osd_op_t *cur_op);
 protected:
--- a/src/msgr_op.cpp
+++ b/src/msgr_op.cpp
@ -24,3 +24,17 @@ osd_op_t::~osd_op_t()
        free(buf);
    }
 }
 bool osd_op_t::is_recovery_related()
 {
    return (req.hdr.opcode == OSD_OP_SEC_READ ||
        req.hdr.opcode == OSD_OP_SEC_WRITE ||
        req.hdr.opcode == OSD_OP_SEC_WRITE_STABLE) &&
        (req.sec_rw.flags & OSD_OP_RECOVERY_RELATED) ||
        req.hdr.opcode == OSD_OP_SEC_DELETE &&
        (req.sec_del.flags & OSD_OP_RECOVERY_RELATED) ||
        req.hdr.opcode == OSD_OP_SEC_STABILIZE &&
        (req.sec_stab.flags & OSD_OP_RECOVERY_RELATED) ||
        req.hdr.opcode == OSD_OP_SEC_SYNC &&
        (req.sec_sync.flags & OSD_OP_RECOVERY_RELATED);
 }
--- a/src/msgr_op.h
+++ b/src/msgr_op.h
@ -173,4 +173,6 @@ struct osd_op_t
    osd_op_buf_list_t iov;
    ~osd_op_t();
    bool is_recovery_related();
 };
--- a/src/msgr_send.cpp
+++ b/src/msgr_send.cpp
@ -131,6 +131,23 @@ void osd_messenger_t::outbox_push(osd_op_t *cur_op)
    }
 }
 void osd_messenger_t::inc_op_stats(osd_op_stats_t & stats, uint64_t opcode, timespec & tv_begin, timespec & tv_end, uint64_t len)
 {
    uint64_t usecs = (
        (tv_end.tv_sec - tv_begin.tv_sec)*1000000 +
        (tv_end.tv_nsec - tv_begin.tv_nsec)/1000
    );
    stats.op_stat_count[opcode]++;
    if (!stats.op_stat_count[opcode])
    {
        stats.op_stat_count[opcode] = 1;
        stats.op_stat_sum[opcode] = 0;
        stats.op_stat_bytes[opcode] = 0;
    }
    stats.op_stat_sum[opcode] += usecs;
    stats.op_stat_bytes[opcode] += len;
 }
 void osd_messenger_t::measure_exec(osd_op_t *cur_op)
 {
    // Measure execution latency
@ -142,29 +159,24 @@ void osd_messenger_t::measure_exec(osd_op_t *cur_op)
    {
        clock_gettime(CLOCK_REALTIME, &cur_op->tv_end);
    }
-    stats.op_stat_count[cur_op->req.hdr.opcode]++;
+    uint64_t len = 0;
    if (!stats.op_stat_count[cur_op->req.hdr.opcode])
    {
        stats.op_stat_count[cur_op->req.hdr.opcode]++;
        stats.op_stat_sum[cur_op->req.hdr.opcode] = 0;
        stats.op_stat_bytes[cur_op->req.hdr.opcode] = 0;
    }
    stats.op_stat_sum[cur_op->req.hdr.opcode] += (
        (cur_op->tv_end.tv_sec - cur_op->tv_begin.tv_sec)*1000000 +
        (cur_op->tv_end.tv_nsec - cur_op->tv_begin.tv_nsec)/1000
    );
    if (cur_op->req.hdr.opcode == OSD_OP_READ ||
        cur_op->req.hdr.opcode == OSD_OP_WRITE ||
        cur_op->req.hdr.opcode == OSD_OP_SCRUB)
    {
        // req.rw.len is internally set to the full object size for scrubs
-        stats.op_stat_bytes[cur_op->req.hdr.opcode] += cur_op->req.rw.len;
+        len = cur_op->req.rw.len;
    }
    else if (cur_op->req.hdr.opcode == OSD_OP_SEC_READ ||
        cur_op->req.hdr.opcode == OSD_OP_SEC_WRITE ||
        cur_op->req.hdr.opcode == OSD_OP_SEC_WRITE_STABLE)
    {
-        stats.op_stat_bytes[cur_op->req.hdr.opcode] += cur_op->req.sec_rw.len;
+        len = cur_op->req.sec_rw.len;
    }
    inc_op_stats(stats, cur_op->req.hdr.opcode, cur_op->tv_begin, cur_op->tv_end, len);
    if (cur_op->is_recovery_related())
    {
        inc_op_stats(recovery_stats, cur_op->req.hdr.opcode, cur_op->tv_begin, cur_op->tv_end, len);
    }
 }
--- a/src/osd.cpp
+++ b/src/osd.cpp
@ -22,7 +22,7 @@ static blockstore_config_t json_to_bs(const json11::Json::object & config)
    {
        if (kv.second.is_string())
            bs[kv.first] = kv.second.string_value();
-        else
+        else if (!kv.second.is_null())
            bs[kv.first] = kv.second.dump();
    }
    return bs;
@ -194,7 +194,8 @@ void osd_t::parse_config(bool init)
        if (autosync_interval > MAX_AUTOSYNC_INTERVAL)
            autosync_interval = DEFAULT_AUTOSYNC_INTERVAL;
    }
-    if (!config["autosync_writes"].is_null())
+    if (config["autosync_writes"].is_number() ||
        config["autosync_writes"].string_value() != "")
    {
        // Allow to set it to 0
        autosync_writes = config["autosync_writes"].uint64_value();
@ -209,21 +210,31 @@ void osd_t::parse_config(bool init)
    if (recovery_queue_depth < 1 || recovery_queue_depth > MAX_RECOVERY_QUEUE)
        recovery_queue_depth = DEFAULT_RECOVERY_QUEUE;
    recovery_sleep_us = config["recovery_sleep_us"].uint64_value();
-    recovery_tune_min_util = config["recovery_tune_min_util"].is_null()
+    recovery_tune_util_low = config["recovery_tune_util_low"].is_null()
-        ? 0.1 : config["recovery_tune_min_util"].number_value();
+        ? 0.1 : config["recovery_tune_util_low"].number_value();
-    recovery_tune_max_util = config["recovery_tune_max_util"].is_null()
+    if (recovery_tune_util_low < 0.01)
-        ? 1.0 : config["recovery_tune_max_util"].number_value();
+        recovery_tune_util_low = 0.01;
-    recovery_tune_min_client_util = config["recovery_tune_min_client_util"].is_null()
+    recovery_tune_util_high = config["recovery_tune_util_high"].is_null()
-        ? 0 : config["recovery_tune_min_client_util"].number_value();
+        ? 1.0 : config["recovery_tune_util_high"].number_value();
-    recovery_tune_max_client_util = config["recovery_tune_max_client_util"].is_null()
+    if (recovery_tune_util_high < 0.01)
-        ? 0.5 : config["recovery_tune_max_client_util"].number_value();
+        recovery_tune_util_high = 0.01;
    recovery_tune_client_util_low = config["recovery_tune_client_util_low"].is_null()
        ? 0 : config["recovery_tune_client_util_low"].number_value();
    if (recovery_tune_client_util_low < 0.01)
        recovery_tune_client_util_low = 0.01;
    recovery_tune_client_util_high = config["recovery_tune_client_util_high"].is_null()
        ? 0.5 : config["recovery_tune_client_util_high"].number_value();
    if (recovery_tune_client_util_high < 0.01)
        recovery_tune_client_util_high = 0.01;
    auto old_recovery_tune_interval = recovery_tune_interval;
    recovery_tune_interval = config["recovery_tune_interval"].is_null()
        ? 1 : config["recovery_tune_interval"].uint64_value();
-    recovery_tune_ewma_rate = config["recovery_tune_ewma_rate"].is_null()
+    recovery_tune_agg_interval = config["recovery_tune_agg_interval"].is_null()
-        ? 0.5 : config["recovery_tune_ewma_rate"].number_value();
+        ? 10 : config["recovery_tune_agg_interval"].uint64_value();
    recovery_tune_sleep_min_us = config["recovery_tune_sleep_min_us"].is_null()
        ? 10 : config["recovery_tune_sleep_min_us"].uint64_value();
    recovery_tune_sleep_cutoff_us = config["recovery_tune_sleep_cutoff_us"].is_null()
        ? 10000000 : config["recovery_tune_sleep_cutoff_us"].uint64_value();
    recovery_pg_switch = config["recovery_pg_switch"].uint64_value();
    if (recovery_pg_switch < 1)
        recovery_pg_switch = DEFAULT_RECOVERY_PG_SWITCH;
@ -494,11 +505,12 @@ void osd_t::print_stats()
        {
            uint64_t bw = (recovery_stat[i].bytes - recovery_print_prev[i].bytes) / print_stats_interval;
            printf(
-                "[OSD %lu] %s recovery: %.1f op/s, B/W: %.2f %s, avg lat %ld us\n", osd_num, recovery_stat_names[i],
+                "[OSD %lu] %s recovery: %.1f op/s, B/W: %.2f %s, avg latency %ld us, delay %ld us\n", osd_num, recovery_stat_names[i],
                (recovery_stat[i].count - recovery_print_prev[i].count) * 1.0 / print_stats_interval,
                (bw > 1024*1024*1024 ? bw/1024.0/1024/1024 : (bw > 1024*1024 ? bw/1024.0/1024 : bw/1024.0)),
                (bw > 1024*1024*1024 ? "GB/s" : (bw > 1024*1024 ? "MB/s" : "KB/s")),
-                (recovery_stat[i].usec - recovery_print_prev[i].usec) / (recovery_stat[i].count - recovery_print_prev[i].count)
+                (recovery_stat[i].usec - recovery_print_prev[i].usec) / (recovery_stat[i].count - recovery_print_prev[i].count),
                recovery_target_sleep_us
            );
        }
    }
@ -596,8 +608,8 @@ void osd_t::print_slow()
                    op->req.hdr.opcode == OSD_OP_SEC_STABILIZE || op->req.hdr.opcode == OSD_OP_SEC_ROLLBACK ||
                    op->req.hdr.opcode == OSD_OP_SEC_READ_BMP)
                {
-                    bufprintf(" state=%d", PRIV(op->bs_op)->op_state);
+                    bufprintf(" state=%d", op->bs_op ? PRIV(op->bs_op)->op_state : -1);
-                    int wait_for = PRIV(op->bs_op)->wait_for;
+                    int wait_for = op->bs_op ? PRIV(op->bs_op)->wait_for : 0;
                    if (wait_for)
                    {
                        bufprintf(" wait=%d (detail=%lu)", wait_for, PRIV(op->bs_op)->wait_detail);
--- a/src/osd.h
+++ b/src/osd.h
@ -118,13 +118,14 @@ class osd_t
    int autosync_writes = DEFAULT_AUTOSYNC_WRITES;
    uint64_t recovery_queue_depth = 1;
    uint64_t recovery_sleep_us = 0;
-    double recovery_tune_min_util = 0.1;
+    double recovery_tune_util_low = 0.1;
-    double recovery_tune_min_client_util = 0;
+    double recovery_tune_client_util_low = 0;
-    double recovery_tune_max_util = 1.0;
+    double recovery_tune_util_high = 1.0;
-    double recovery_tune_max_client_util = 0.5;
+    double recovery_tune_client_util_high = 0.5;
    int recovery_tune_interval = 1;
-    double recovery_tune_ewma_rate = 0.5;
+    int recovery_tune_agg_interval = 10;
    int recovery_tune_sleep_min_us = 10;
    int recovery_tune_sleep_cutoff_us = 10000000;
    int recovery_pg_switch = DEFAULT_RECOVERY_PG_SWITCH;
    int recovery_sync_batch = DEFAULT_RECOVERY_BATCH;
    int inode_vanish_time = 60;
@ -209,10 +210,11 @@ class osd_t
    int rtune_timer_id = -1;
    uint64_t rtune_avg_lat = 0;
    double rtune_client_util = 0, rtune_target_util = 1;
-    osd_op_stats_t rtune_prev_stats;
+    osd_op_stats_t rtune_prev_stats, rtune_prev_recovery_stats;
-    recovery_stat_t rtune_prev_recovery[2];
+    std::vector<uint64_t> recovery_target_sleep_items;
    uint64_t recovery_target_queue_depth = 1;
    uint64_t recovery_target_sleep_us = 0;
    uint64_t recovery_target_sleep_total = 0;
    int recovery_target_sleep_cur = 0, recovery_target_sleep_count = 0;
    // cluster connection
    void parse_config(bool init);
@ -281,6 +283,7 @@ class osd_t
    void exec_sync_stab_all(osd_op_t *cur_op);
    void exec_show_config(osd_op_t *cur_op);
    void exec_secondary(osd_op_t *cur_op);
    void exec_secondary_real(osd_op_t *cur_op);
    void secondary_op_callback(osd_op_t *cur_op);
    // primary ops
@ -303,7 +306,7 @@ class osd_t
    bool remember_unstable_write(osd_op_t *cur_op, pg_t & pg, pg_osd_set_t & loc_set, int base_state);
    void handle_primary_subop(osd_op_t *subop, osd_op_t *cur_op);
    void handle_primary_bs_subop(osd_op_t *subop);
-    void add_bs_subop_stats(osd_op_t *subop);
+    void add_bs_subop_stats(osd_op_t *subop, bool recovery_related = false);
    void pg_cancel_write_queue(pg_t & pg, osd_op_t *first_op, object_id oid, int retval);
    void submit_primary_subops(int submit_type, uint64_t op_version, const uint64_t* osd_set, osd_op_t *cur_op);
--- a/src/osd_cluster.cpp
+++ b/src/osd_cluster.cpp
@ -262,7 +262,8 @@ void osd_t::report_statistics()
    for (auto st_it = inode_stats.begin(); st_it != inode_stats.end(); )
    {
        auto & kv = *st_it;
-        if (!bs_inode_space[kv.first])
+        auto spc_it = bs_inode_space.find(kv.first);
        if (spc_it == bs_inode_space.end() || !spc_it->second) // prevent autovivification
        {
            // Is it an empty inode?
            if (!tv_now.tv_sec)
@ -651,7 +652,7 @@ void osd_t::apply_pg_config()
        {
            pg_num_t pg_num = kv.first;
            auto & pg_cfg = kv.second;
-            bool take = pg_cfg.exists && pg_cfg.primary == this->osd_num &&
+            bool take = pg_cfg.config_exists && pg_cfg.primary == this->osd_num &&
                !pg_cfg.pause && (!pg_cfg.cur_primary || pg_cfg.cur_primary == this->osd_num);
            auto pg_it = this->pgs.find({ .pool_id = pool_id, .pg_num = pg_num });
            bool currently_taken = pg_it != this->pgs.end() && pg_it->second.state != PG_OFFLINE;
--- a/src/osd_flush.cpp
+++ b/src/osd_flush.cpp
@ -325,17 +325,7 @@ void osd_t::submit_recovery_op(osd_recovery_op_t *op)
        {
            printf("Recovery operation done for %lx:%lx\n", op->oid.inode, op->oid.stripe);
        }
        if (recovery_target_sleep_us)
        {
            this->tfd->set_timer_us(recovery_target_sleep_us, false, [this, op](int timer_id)
            {
        finish_recovery_op(op);
            });
        }
        else
        {
            finish_recovery_op(op);
        }
    };
    exec_op(op->osd_op);
 }
@ -356,7 +346,6 @@ void osd_t::apply_recovery_tune_interval()
    }
    else
    {
        recovery_target_queue_depth = recovery_queue_depth;
        recovery_target_sleep_us = recovery_sleep_us;
    }
 }
@ -383,47 +372,82 @@ void osd_t::finish_recovery_op(osd_recovery_op_t *op)
 void osd_t::tune_recovery()
 {
-    static int total_client_ops[] = { OSD_OP_READ, OSD_OP_WRITE, OSD_OP_SYNC, OSD_OP_DELETE };
+    static int accounted_ops[] = {
-    uint64_t total_client_usec = 0;
+        OSD_OP_SEC_READ, OSD_OP_SEC_WRITE, OSD_OP_SEC_WRITE_STABLE,
-    for (int i = 0; i < sizeof(total_client_ops)/sizeof(total_client_ops[0]); i++)
+        OSD_OP_SEC_STABILIZE, OSD_OP_SEC_SYNC, OSD_OP_SEC_DELETE
    };
    uint64_t total_client_usec = 0, total_recovery_usec = 0, recovery_count = 0;
    for (int i = 0; i < sizeof(accounted_ops)/sizeof(accounted_ops[0]); i++)
    {
-        total_client_usec += (msgr.stats.op_stat_sum[total_client_ops[i]] - rtune_prev_stats.op_stat_sum[total_client_ops[i]]);
+        total_client_usec += (msgr.stats.op_stat_sum[accounted_ops[i]]
-        rtune_prev_stats.op_stat_sum[total_client_ops[i]] = msgr.stats.op_stat_sum[total_client_ops[i]];
+            - rtune_prev_stats.op_stat_sum[accounted_ops[i]]);
        total_recovery_usec += (msgr.recovery_stats.op_stat_sum[accounted_ops[i]]
            - rtune_prev_recovery_stats.op_stat_sum[accounted_ops[i]]);
        recovery_count += (msgr.recovery_stats.op_stat_count[accounted_ops[i]]
            - rtune_prev_recovery_stats.op_stat_count[accounted_ops[i]]);
        rtune_prev_stats.op_stat_sum[accounted_ops[i]] = msgr.stats.op_stat_sum[accounted_ops[i]];
        rtune_prev_recovery_stats.op_stat_sum[accounted_ops[i]] = msgr.recovery_stats.op_stat_sum[accounted_ops[i]];
        rtune_prev_recovery_stats.op_stat_count[accounted_ops[i]] = msgr.recovery_stats.op_stat_count[accounted_ops[i]];
    }
-    uint64_t total_recovery_usec = 0, recovery_count = 0;
+    total_client_usec -= total_recovery_usec;
    total_recovery_usec += recovery_stat[0].usec-rtune_prev_recovery[0].usec;
    total_recovery_usec += recovery_stat[1].usec-rtune_prev_recovery[1].usec;
    recovery_count += recovery_stat[0].count-rtune_prev_recovery[0].count;
    recovery_count += recovery_stat[1].count-rtune_prev_recovery[1].count;
    memcpy(rtune_prev_recovery, recovery_stat, sizeof(recovery_stat));
    if (recovery_count == 0)
    {
        return;
    }
-    rtune_avg_lat = total_recovery_usec/recovery_count*recovery_tune_ewma_rate +
+    // example:
-        rtune_avg_lat*(1-recovery_tune_ewma_rate);
+    // total 3 GB/s
-    // client_util = count/interval * usec/1000000.0/count = usec/1000000.0/interval :-)
+    // recovery queue 1
-    double client_util = total_client_usec/1000000.0/recovery_tune_interval;
+    // 120 OSDs
-    rtune_client_util = rtune_client_util*(1-recovery_tune_ewma_rate) + client_util*recovery_tune_ewma_rate;
+    // EC 5+3
-    rtune_target_util = (rtune_client_util < recovery_tune_min_client_util
+    // 128kb block_size => 640kb object
-        ? recovery_tune_max_util
+    // 3000*1024/640/120 = 40 MB/s per OSD = 64 recovered objects per OSD
-        : recovery_tune_min_util + (rtune_client_util >= recovery_tune_max_client_util
+    //   = 64*8*2 subops = 1024 recovery subop iops
-            ? 0 : (recovery_tune_max_util-recovery_tune_min_util)*
+    // 8 recovery subop queue
-                (recovery_tune_max_client_util-rtune_client_util)/(recovery_tune_max_client_util-recovery_tune_min_client_util)
+    // => subop avg latency = 0.0078125 sec
    // utilisation = 8
    // target util 1
    // intuitively target latency should be 8x of real
    // target_lat = rtune_avg_lat * utilisation / target_util
    //            = rtune_avg_lat * rtune_avg_lat * rtune_avg_iops / target_util
    //            = 0.0625
    // recovery utilisation will be 1
    rtune_client_util = total_client_usec/1000000.0/recovery_tune_interval;
    rtune_target_util = (rtune_client_util < recovery_tune_client_util_low
        ? recovery_tune_util_high
        : recovery_tune_util_low + (rtune_client_util >= recovery_tune_client_util_high
            ? 0 : (recovery_tune_util_high-recovery_tune_util_low)*
                (recovery_tune_client_util_high-rtune_client_util)/(recovery_tune_client_util_high-recovery_tune_client_util_low)
        )
    );
-    recovery_target_queue_depth = (int)rtune_target_util + (rtune_target_util < 1 || rtune_target_util-(int)rtune_target_util >= 0.1 ? 1 : 0);
+    rtune_avg_lat = total_recovery_usec/recovery_count;
-    // ideal_iops = 1s / real_latency
+    uint64_t target_lat = rtune_avg_lat * rtune_avg_lat/1000000.0 * recovery_count/recovery_tune_interval / rtune_target_util;
-    // ;; target_iops = target_util * ideal_iops
+    auto sleep_us = target_lat > rtune_avg_lat+recovery_tune_sleep_min_us ? target_lat-rtune_avg_lat : 0;
-    // => target_lat = target_queue * 1s / target_iops
+    if (sleep_us > recovery_tune_sleep_cutoff_us)
-    // => target_lat = target_queue / target_util * real_latency
+    {
-    uint64_t target_lat = recovery_target_queue_depth/rtune_target_util * rtune_avg_lat;
+        return;
-    recovery_target_sleep_us = target_lat > rtune_avg_lat+recovery_tune_sleep_min_us ? target_lat-rtune_avg_lat : 0;
+    }
-    if (log_level > 3)
+    if (recovery_target_sleep_items.size() != recovery_tune_agg_interval)
    {
        recovery_target_sleep_items.resize(recovery_tune_agg_interval);
        for (int i = 0; i < recovery_tune_agg_interval; i++)
            recovery_target_sleep_items[i] = 0;
        recovery_target_sleep_total = 0;
        recovery_target_sleep_cur = 0;
        recovery_target_sleep_count = 0;
    }
    recovery_target_sleep_total -= recovery_target_sleep_items[recovery_target_sleep_cur];
    recovery_target_sleep_items[recovery_target_sleep_cur] = sleep_us;
    recovery_target_sleep_cur = (recovery_target_sleep_cur+1) % recovery_tune_agg_interval;
    recovery_target_sleep_total += sleep_us;
    if (recovery_target_sleep_count < recovery_tune_agg_interval)
        recovery_target_sleep_count++;
    recovery_target_sleep_us = recovery_target_sleep_total / recovery_target_sleep_count;
    if (log_level > 1)
    {
        printf(
-            "recovery tune: client util %.2f (ewma %.2f), target util %.2f -> queue %ld, lat %lu us, real %lu us, pause %lu us\n",
+            "[OSD %lu] auto-tune: client util: %.2f, recovery util: %.2f, lat: %lu us -> target util %.2f, delay %lu us\n",
-            client_util, rtune_client_util, rtune_target_util, recovery_target_queue_depth, target_lat, rtune_avg_lat, recovery_target_sleep_us
+            osd_num, rtune_client_util, total_recovery_usec/1000000.0/recovery_tune_interval,
            rtune_avg_lat, rtune_target_util, recovery_target_sleep_us
        );
    }
 }
@ -431,7 +455,7 @@ void osd_t::tune_recovery()
 // Just trigger write requests for degraded objects. They'll be recovered during writing
 bool osd_t::continue_recovery()
 {
-    while (recovery_ops.size() < recovery_target_queue_depth)
+    while (recovery_ops.size() < recovery_queue_depth)
    {
        osd_recovery_op_t op;
        if (pick_next_recovery(op))
--- a/src/osd_ops.h
+++ b/src/osd_ops.h
@ -34,6 +34,7 @@
 #define OSD_OP_MAX                  18
 #define OSD_RW_MAX                  64*1024*1024
 #define OSD_PROTOCOL_VERSION        1
 #define OSD_OP_RECOVERY_RELATED     (uint32_t)1
 // Memory alignment for direct I/O (usually 512 bytes)
 #ifndef DIRECT_IO_ALIGNMENT
@ -88,7 +89,8 @@ struct __attribute__((__packed__)) osd_op_sec_rw_t
    uint32_t len;
    // bitmap/attribute length - bitmap comes after header, but before data
    uint32_t attr_len;
-    uint32_t pad0;
+    // the only possible flag is OSD_OP_RECOVERY_RELATED
    uint32_t flags;
 };
 struct __attribute__((__packed__)) osd_reply_sec_rw_t
@ -109,6 +111,9 @@ struct __attribute__((__packed__)) osd_op_sec_del_t
    object_id oid;
    // delete version (automatic or specific)
    uint64_t version;
    // the only possible flag is OSD_OP_RECOVERY_RELATED
    uint32_t flags;
    uint32_t pad0;
 };
 struct __attribute__((__packed__)) osd_reply_sec_del_t
@ -121,6 +126,9 @@ struct __attribute__((__packed__)) osd_reply_sec_del_t
 struct __attribute__((__packed__)) osd_op_sec_sync_t
 {
    osd_op_header_t header;
    // the only possible flag is OSD_OP_RECOVERY_RELATED
    uint32_t flags;
    uint32_t pad0;
 };
 struct __attribute__((__packed__)) osd_reply_sec_sync_t
@ -134,6 +142,9 @@ struct __attribute__((__packed__)) osd_op_sec_stab_t
    osd_op_header_t header;
    // obj_ver_id array length in bytes
    uint64_t len;
    // the only possible flag is OSD_OP_RECOVERY_RELATED
    uint32_t flags;
    uint32_t pad0;
 };
 typedef osd_op_sec_stab_t osd_op_sec_rollback_t;
--- a/src/osd_peering.cpp
+++ b/src/osd_peering.cpp
@ -222,6 +222,9 @@ void osd_t::start_pg_peering(pg_t & pg)
    }
    if (pg.pg_cursize < pg.pg_minsize)
    {
        // FIXME: Incomplete EC PGs may currently easily lead to write hangs ("slow ops" in OSD logs)
        // because such PGs don't flush unstable entries on secondary OSDs so they can't remove these
        // entries from their journals...
        pg.state = PG_INCOMPLETE;
        report_pg_state(pg);
        return;
--- a/src/osd_primary.cpp
+++ b/src/osd_primary.cpp
@ -706,6 +706,26 @@ resume_5:
        remove_object_from_state(op_data->oid, &op_data->object_state, pg);
        deref_object_state(pg, &op_data->object_state, true);
    }
    // Mark PG and OSDs as dirty
    for (auto & chunk: (op_data->object_state ? op_data->object_state->osd_set : pg.cur_loc_set))
    {
        this->dirty_osds.insert(chunk.osd_num);
    }
    for (auto cl_it = msgr.clients.find(cur_op->peer_fd); cl_it != msgr.clients.end(); )
    {
        cl_it->second->dirty_pgs.insert({ .pool_id = pg.pool_id, .pg_num = pg.pg_num });
        break;
    }
    dirty_pgs.insert({ .pool_id = pg.pool_id, .pg_num = pg.pg_num });
    if (immediate_commit == IMMEDIATE_NONE)
    {
        unstable_write_count++;
        if (unstable_write_count >= autosync_writes)
        {
            unstable_write_count = 0;
            autosync();
        }
    }
    pg.total_count--;
    cur_op->reply.hdr.retval = 0;
 continue_others:
--- a/src/osd_primary_subops.cpp
+++ b/src/osd_primary_subops.cpp
@ -221,6 +221,7 @@ int osd_t::submit_primary_subop_batch(int submit_type, inode_t inode, uint64_t o
                    .offset = wr ? si->write_start : si->read_start,
                    .len = subop_len,
                    .attr_len = wr ? clean_entry_bitmap_size : 0,
                    .flags = cur_op->peer_fd == SELF_FD && cur_op->req.hdr.opcode != OSD_OP_SCRUB ? OSD_OP_RECOVERY_RELATED : 0,
                };
 #ifdef OSD_DEBUG
                printf(
@ -300,7 +301,8 @@ void osd_t::handle_primary_bs_subop(osd_op_t *subop)
            " retval = "+std::to_string(bs_op->retval)+")"
        );
    }
-    add_bs_subop_stats(subop);
+    bool recovery_related = cur_op->peer_fd == SELF_FD && cur_op->req.hdr.opcode != OSD_OP_SCRUB;
    add_bs_subop_stats(subop, recovery_related);
    subop->req.hdr.opcode = bs_op_to_osd_op[bs_op->opcode];
    subop->reply.hdr.retval = bs_op->retval;
    if (bs_op->opcode == BS_OP_READ || bs_op->opcode == BS_OP_WRITE || bs_op->opcode == BS_OP_WRITE_STABLE)
@ -312,30 +314,33 @@ void osd_t::handle_primary_bs_subop(osd_op_t *subop)
    }
    delete bs_op;
    subop->bs_op = NULL;
-    subop->peer_fd = -1;
+    subop->peer_fd = SELF_FD;
    if (recovery_related && recovery_target_sleep_us)
    {
        tfd->set_timer_us(recovery_target_sleep_us, false, [=](int timer_id)
        {
            handle_primary_subop(subop, cur_op);
        });
    }
    else
    {
        handle_primary_subop(subop, cur_op);
    }
 }
-void osd_t::add_bs_subop_stats(osd_op_t *subop)
+void osd_t::add_bs_subop_stats(osd_op_t *subop, bool recovery_related)
 {
    // Include local blockstore ops in statistics
    uint64_t opcode = bs_op_to_osd_op[subop->bs_op->opcode];
    timespec tv_end;
    clock_gettime(CLOCK_REALTIME, &tv_end);
-    msgr.stats.op_stat_count[opcode]++;
+    uint64_t len = (opcode == OSD_OP_SEC_READ || opcode == OSD_OP_SEC_WRITE)
-    if (!msgr.stats.op_stat_count[opcode])
+        ? subop->bs_op->len : 0;
    msgr.inc_op_stats(msgr.stats, opcode, subop->tv_begin, tv_end, len);
    if (recovery_related)
    {
-        msgr.stats.op_stat_count[opcode] = 1;
+        // It is OSD_OP_RECOVERY_RELATED
-        msgr.stats.op_stat_sum[opcode] = 0;
+        msgr.inc_op_stats(msgr.recovery_stats, opcode, subop->tv_begin, tv_end, len);
        msgr.stats.op_stat_bytes[opcode] = 0;
    }
    msgr.stats.op_stat_sum[opcode] += (
        (tv_end.tv_sec - subop->tv_begin.tv_sec)*1000000 +
        (tv_end.tv_nsec - subop->tv_begin.tv_nsec)/1000
    );
    if (opcode == OSD_OP_SEC_READ || opcode == OSD_OP_SEC_WRITE)
    {
        msgr.stats.op_stat_bytes[opcode] += subop->bs_op->len;
    }
 }
@ -558,6 +563,7 @@ void osd_t::submit_primary_del_batch(osd_op_t *cur_op, obj_ver_osd_t *chunks_to_
                },
                .oid = chunk.oid,
                .version = chunk.version,
                .flags = cur_op->peer_fd == SELF_FD && cur_op->req.hdr.opcode != OSD_OP_SCRUB ? OSD_OP_RECOVERY_RELATED : 0,
            } };
            subops[i].callback = [cur_op, this](osd_op_t *subop)
            {
@ -615,6 +621,7 @@ int osd_t::submit_primary_sync_subops(osd_op_t *cur_op)
                    .id = msgr.next_subop_id++,
                    .opcode = OSD_OP_SEC_SYNC,
                },
                .flags = cur_op->peer_fd == SELF_FD && cur_op->req.hdr.opcode != OSD_OP_SCRUB ? OSD_OP_RECOVERY_RELATED : 0,
            } };
            subops[i].callback = [cur_op, this](osd_op_t *subop)
            {
@ -674,6 +681,7 @@ void osd_t::submit_primary_stab_subops(osd_op_t *cur_op)
                    .opcode = OSD_OP_SEC_STABILIZE,
                },
                .len = (uint64_t)(stab_osd.len * sizeof(obj_ver_id)),
                .flags = cur_op->peer_fd == SELF_FD && cur_op->req.hdr.opcode != OSD_OP_SCRUB ? OSD_OP_RECOVERY_RELATED : 0,
            } };
            subops[i].iov.push_back(op_data->unstable_writes + stab_osd.start, stab_osd.len * sizeof(obj_ver_id));
            subops[i].callback = [cur_op, this](osd_op_t *subop)
--- a/src/osd_primary_write.cpp
+++ b/src/osd_primary_write.cpp
@ -296,7 +296,6 @@ resume_7:
            if (!recovery_stat[recovery_type].count) // wrapped
            {
                memset(&recovery_print_prev[recovery_type], 0, sizeof(recovery_print_prev[recovery_type]));
                memset(&rtune_prev_recovery[recovery_type], 0, sizeof(rtune_prev_recovery[recovery_type]));
                memset(&recovery_stat[recovery_type], 0, sizeof(recovery_stat[recovery_type]));
                recovery_stat[recovery_type].count++;
            }
--- a/src/osd_rmw.cpp
+++ b/src/osd_rmw.cpp
@ -861,15 +861,15 @@ static void calc_rmw_parity_copy_mod(osd_rmw_stripe_t *stripes, int pg_size, int
 static void calc_rmw_parity_copy_parity(osd_rmw_stripe_t *stripes, int pg_size, int pg_minsize,
    uint64_t *read_osd_set, uint64_t *write_osd_set, uint32_t chunk_size, uint32_t start, uint32_t end)
 {
-    if (write_osd_set != read_osd_set)
+    if (write_osd_set != read_osd_set && end != 0)
    {
        for (int role = pg_minsize; role < pg_size; role++)
        {
-            if (write_osd_set[role] != read_osd_set[role] && (start != 0 || end != chunk_size))
+            if (write_osd_set[role] != read_osd_set[role] && write_osd_set[role] != 0 && (start != 0 || end != chunk_size))
            {
                // Copy new parity into the read buffer to write it back
                memcpy(
-                    (uint8_t*)stripes[role].read_buf + start,
+                    (uint8_t*)stripes[role].read_buf + start - stripes[role].read_start,
                    stripes[role].write_buf,
                    end - start
                );
--- a/src/osd_rmw_test.cpp
+++ b/src/osd_rmw_test.cpp
@ -30,6 +30,7 @@ void test16();
 void test_recover_22_d2();
 void test_ec43_error_bruteforce();
 void test_recover_53_d5();
 void test_recover_22();
 int main(int narg, char *args[])
 {
@ -70,6 +71,8 @@ int main(int narg, char *args[])
    test_ec43_error_bruteforce();
    // Test 19
    test_recover_53_d5();
    // Test 20
    test_recover_22();
    // End
    printf("all ok\n");
    return 0;
@ -1244,3 +1247,99 @@ void test_recover_53_d5()
    // Done
    use_ec(8, 5, false);
 }
 void test_recover_22()
 {
    const int bmp = 128*1024 / 4096 / 8;
    use_ec(4, 2, true);
    osd_num_t osd_set[4] = { 1, 2, 3, 4 };
    osd_num_t write_osd_set[4] = { 5, 0, 3, 0 };
    osd_rmw_stripe_t stripes[4] = {};
    unsigned bitmaps[4] = { 0 };
    // split
    void *write_buf = (uint8_t*)malloc_or_die(4096);
    set_pattern(write_buf, 4096, PATTERN0);
    split_stripes(2, 128*1024, 120*1024, 4096, stripes);
    assert(stripes[0].req_start == 120*1024 && stripes[0].req_end == 124*1024);
    assert(stripes[1].req_start == 0 && stripes[1].req_end == 0);
    assert(stripes[2].req_start == 0 && stripes[2].req_end == 0);
    assert(stripes[3].req_start == 0 && stripes[3].req_end == 0);
    // calc_rmw
    void *rmw_buf = calc_rmw(write_buf, stripes, osd_set, 4, 2, 2, write_osd_set, 128*1024, bmp);
    for (int i = 0; i < 4; i++)
        stripes[i].bmp_buf = bitmaps+i;
    assert(rmw_buf);
    assert(stripes[0].read_start == 0 && stripes[0].read_end == 128*1024);
    assert(stripes[1].read_start == 120*1024 && stripes[1].read_end == 124*1024);
    assert(stripes[2].read_start == 0 && stripes[2].read_end == 0);
    assert(stripes[3].read_start == 0 && stripes[3].read_end == 0);
    assert(stripes[0].write_start == 120*1024 && stripes[0].write_end == 124*1024);
    assert(stripes[1].write_start == 0 && stripes[1].write_end == 0);
    assert(stripes[2].write_start == 120*1024 && stripes[2].write_end == 124*1024);
    assert(stripes[3].write_start == 0 && stripes[3].write_end == 0);
    assert(stripes[0].read_buf == (uint8_t*)rmw_buf+4*1024);
    assert(stripes[1].read_buf == (uint8_t*)rmw_buf+132*1024);
    assert(stripes[2].read_buf == NULL);
    assert(stripes[3].read_buf == NULL);
    assert(stripes[0].write_buf == write_buf);
    assert(stripes[1].write_buf == NULL);
    assert(stripes[2].write_buf == (uint8_t*)rmw_buf);
    assert(stripes[3].write_buf == NULL);
    // encode
    set_pattern(stripes[0].read_buf, 128*1024, PATTERN1);
    set_pattern(stripes[1].read_buf, 4*1024, PATTERN2);
    memset(stripes[0].bmp_buf, 0xff, bmp);
    memset(stripes[1].bmp_buf, 0xff, bmp);
    calc_rmw_parity_ec(stripes, 4, 2, osd_set, write_osd_set, 128*1024, bmp);
    assert(*(uint32_t*)stripes[2].bmp_buf == 0);
    assert(stripes[0].write_start == 0 && stripes[0].write_end == 128*1024);
    assert(stripes[1].write_start == 0 && stripes[1].write_end == 0);
    assert(stripes[2].write_start == 120*1024 && stripes[2].write_end == 124*1024);
    assert(stripes[3].write_start == 0 && stripes[3].write_end == 0);
    assert(stripes[0].write_buf == stripes[0].read_buf);
    assert(stripes[1].write_buf == NULL);
    assert(stripes[2].write_buf == (uint8_t*)rmw_buf);
    assert(stripes[3].write_buf == NULL);
    check_pattern(stripes[2].write_buf, 4*1024, PATTERN0^PATTERN2);
    // decode and verify
    memset(stripes, 0, sizeof(stripes));
    split_stripes(2, 128*1024, 0, 256*1024, stripes);
    assert(stripes[0].req_start == 0 && stripes[0].req_end == 128*1024);
    assert(stripes[1].req_start == 0 && stripes[1].req_end == 128*1024);
    assert(stripes[2].req_start == 0 && stripes[2].req_end == 0);
    assert(stripes[3].req_start == 0 && stripes[3].req_end == 0);
    for (int role = 0; role < 4; role++)
    {
        stripes[role].read_start = stripes[role].req_start;
        stripes[role].read_end = stripes[role].req_end;
    }
    assert(extend_missing_stripes(stripes, write_osd_set, 2, 4) == 0);
    assert(stripes[0].read_start == 0 && stripes[0].read_end == 128*1024);
    assert(stripes[1].read_start == 0 && stripes[1].read_end == 128*1024);
    assert(stripes[2].read_start == 0 && stripes[2].read_end == 128*1024);
    assert(stripes[3].read_start == 0 && stripes[3].read_end == 0);
    void *read_buf = alloc_read_buffer(stripes, 4, 0);
    for (int i = 0; i < 4; i++)
        stripes[i].bmp_buf = bitmaps+i;
    assert(read_buf);
    assert(stripes[0].read_buf == read_buf);
    assert(stripes[1].read_buf == (uint8_t*)read_buf+128*1024);
    assert(stripes[2].read_buf == (uint8_t*)read_buf+2*128*1024);
    set_pattern(stripes[0].read_buf, 128*1024, PATTERN1);
    set_pattern(stripes[0].read_buf+120*1024, 4*1024, PATTERN0);
    set_pattern(stripes[2].read_buf, 128*1024, PATTERN1^PATTERN2);
    set_pattern(stripes[2].read_buf+120*1024, 4*1024, PATTERN0^PATTERN2);
    memset(stripes[0].bmp_buf, 0xff, bmp);
    memset(stripes[2].bmp_buf, 0, bmp);
    bitmaps[1] = 0;
    bitmaps[3] = 0;
    reconstruct_stripes_ec(stripes, 4, 2, bmp);
    assert(bitmaps[0] == 0xFFFFFFFF);
    assert(*(uint32_t*)stripes[1].bmp_buf == 0xFFFFFFFF);
    check_pattern(stripes[1].read_buf, 128*1024, PATTERN2);
    free(read_buf);
    // Done
    free(rmw_buf);
    free(write_buf);
    use_ec(4, 2, false);
 }
--- a/src/osd_secondary.cpp
+++ b/src/osd_secondary.cpp
@ -42,10 +42,44 @@ void osd_t::secondary_op_callback(osd_op_t *op)
    int retval = op->bs_op->retval;
    delete op->bs_op;
    op->bs_op = NULL;
    if (op->is_recovery_related() && recovery_target_sleep_us &&
        op->req.hdr.opcode == OSD_OP_SEC_STABILIZE)
    {
        // Apply pause AFTER commit. Do not apply pause to SYNC at all
        if (!op->tv_end.tv_sec)
        {
            clock_gettime(CLOCK_REALTIME, &op->tv_end);
        }
        tfd->set_timer_us(recovery_target_sleep_us, false, [this, op, retval](int timer_id)
        {
            finish_op(op, retval);
        });
    }
    else
    {
        finish_op(op, retval);
    }
 }
-void osd_t::exec_secondary(osd_op_t *cur_op)
+void osd_t::exec_secondary(osd_op_t *op)
 {
    if (op->is_recovery_related() && recovery_target_sleep_us &&
        op->req.hdr.opcode != OSD_OP_SEC_STABILIZE && op->req.hdr.opcode != OSD_OP_SEC_SYNC)
    {
        // Apply pause BEFORE write/delete
        tfd->set_timer_us(recovery_target_sleep_us, false, [this, op](int timer_id)
        {
            clock_gettime(CLOCK_REALTIME, &op->tv_begin);
            exec_secondary_real(op);
        });
    }
    else
    {
        exec_secondary_real(op);
    }
 }
 void osd_t::exec_secondary_real(osd_op_t *cur_op)
 {
    if (cur_op->req.hdr.opcode == OSD_OP_SEC_READ_BMP)
    {
--- a/src/timerfd_manager.cpp
+++ b/src/timerfd_manager.cpp
@ -90,6 +90,12 @@ void timerfd_manager_t::clear_timer(int timer_id)
 void timerfd_manager_t::set_nearest()
 {
    if (onstack > 0)
    {
        // Prevent re-entry
        return;
    }
    onstack++;
 again:
    if (!timers.size())
    {
@ -139,6 +145,7 @@ again:
        }
        wait_state = wait_state | 1;
    }
    onstack--;
 }
 void timerfd_manager_t::handle_readable()
--- a/src/timerfd_manager.h
+++ b/src/timerfd_manager.h
@ -22,6 +22,7 @@ class timerfd_manager_t
    int timerfd;
    int nearest = -1;
    int id = 1;
    int onstack = 0;
    std::vector<timerfd_timer_t> timers;
    void inc_timer(timerfd_timer_t & t);
--- a/src/vitastor.pc.in
+++ b/src/vitastor.pc.in
@ -6,7 +6,7 @@ includedir=${prefix}/@CMAKE_INSTALL_INCLUDEDIR@
 Name: Vitastor
 Description: Vitastor client library
-Version: 1.3.1
+Version: 1.4.4
 Libs: -L${libdir} -lvitastor_client
 Cflags: -I${includedir}
--- a/tests/run_3osds.sh
+++ b/tests/run_3osds.sh
@ -10,6 +10,7 @@ SCHEME=${SCHEME:-replicated}
 # OFFSET_ARGS
 # PG_SIZE
 # PG_MINSIZE
 # GLOBAL_CONFIG
 if [ "$SCHEME" = "ec" ]; then
    OSD_COUNT=${OSD_COUNT:-5}
@ -19,10 +20,10 @@ fi
 if [ "$IMMEDIATE_COMMIT" != "" ]; then
    NO_SAME="--journal_no_same_sector_overwrites true --journal_sector_buffer_count 1024 --disable_data_fsync 1 --immediate_commit all --log_level 10 --etcd_stats_interval 5"
-    $ETCDCTL put /vitastor/config/global '{"recovery_queue_depth":1,"osd_out_time":1,"immediate_commit":"all","client_enable_writeback":true}'
+    $ETCDCTL put /vitastor/config/global '{"recovery_queue_depth":1,"recovery_tune_util_low":1,"immediate_commit":"all","client_enable_writeback":true,"client_max_writeback_iodepth":32'$GLOBAL_CONFIG'}'
 else
    NO_SAME="--journal_sector_buffer_count 1024 --log_level 10 --etcd_stats_interval 5"
-    $ETCDCTL put /vitastor/config/global '{"recovery_queue_depth":1,"osd_out_time":1,"client_enable_writeback":true}'
+    $ETCDCTL put /vitastor/config/global '{"recovery_queue_depth":1,"recovery_tune_util_low":1,"client_enable_writeback":true,"client_max_writeback_iodepth":32'$GLOBAL_CONFIG'}'
 fi
 start_osd_on()
@ -53,7 +54,7 @@ for i in $(seq 1 $OSD_COUNT); do
    start_osd $i
 done
-(while true; do node mon/mon-main.js --etcd_url $ETCD_URL --etcd_prefix "/vitastor" --verbose 1 || true; done) &>./testdata/mon.log &
+(while true; do set +e; node mon/mon-main.js --etcd_address $ETCD_URL --etcd_prefix "/vitastor" --verbose 1; if [[ $? -ne 2 ]]; then break; fi; done) >>./testdata/mon.log 2>&1 &
 MON_PID=$!
 if [ "$SCHEME" = "ec" ]; then
--- a/tests/run_tests.sh
+++ b/tests/run_tests.sh
@ -45,6 +45,8 @@ IMMEDIATE_COMMIT=1 ./test_rebalance_verify.sh
 SCHEME=ec ./test_rebalance_verify.sh
 SCHEME=ec IMMEDIATE_COMMIT=1 ./test_rebalance_verify.sh
 ./test_switch_primary.sh
 ./test_write.sh
 SCHEME=xor ./test_write.sh
--- a/tests/test_add_osd.sh
+++ b/tests/test_add_osd.sh
@ -1,7 +1,7 @@
 #!/bin/bash -ex
 PG_COUNT=2048
-
+GLOBAL_CONFIG=',"osd_out_time":1'
 . `dirname $0`/run_3osds.sh
 LD_PRELOAD="build/src/libfio_vitastor.so" \
--- a/tests/test_change_pg_count.sh
+++ b/tests/test_change_pg_count.sh
@ -18,6 +18,7 @@ try_change()
    for i in {1..6}; do
        echo --- Change PG count to $n --- >>testdata/osd$i.log
    done
    echo --- Change PG count to $n --- >>testdata/mon.log
    $ETCDCTL put /vitastor/config/pools '{"1":{'$POOLCFG',"pg_size":'$PG_SIZE',"pg_minsize":'$PG_MINSIZE',"pg_count":'$n'}}'
--- a/tests/test_failure_domain.sh
+++ b/tests/test_failure_domain.sh
@ -15,7 +15,7 @@ $ETCDCTL put /vitastor/osd/stats/7 '{"host":"host4","size":1073741824,"time":"'$
 $ETCDCTL put /vitastor/osd/stats/8 '{"host":"host4","size":1073741824,"time":"'$TIME'"}'
 $ETCDCTL put /vitastor/config/pools '{"1":{"name":"testpool","scheme":"replicated","pg_size":2,"pg_minsize":1,"pg_count":4,"failure_domain":"rack"}}'
-node mon/mon-main.js --etcd_url $ETCD_URL --etcd_prefix "/vitastor" &>./testdata/mon.log &
+node mon/mon-main.js --etcd_address $ETCD_URL --etcd_prefix "/vitastor" >>./testdata/mon.log 2>&1 &
 MON_PID=$!
 sleep 2
--- a/tests/test_heal.sh
+++ b/tests/test_heal.sh
@ -9,6 +9,7 @@ if [[ "$SCHEME" = "ec" ]]; then
 fi
 OSD_COUNT=${OSD_COUNT:-7}
 PG_COUNT=32
 GLOBAL_CONFIG=',"osd_out_time":1'
 . `dirname $0`/run_3osds.sh
 check_qemu
--- a/tests/test_minsize_1.sh
+++ b/tests/test_minsize_1.sh
@ -2,6 +2,7 @@
 PG_MINSIZE=1
 SCHEME=replicated
 GLOBAL_CONFIG=',"osd_out_time":1'
 . `dirname $0`/run_3osds.sh
--- a/tests/test_move_reappear.sh
+++ b/tests/test_move_reappear.sh
@ -7,7 +7,7 @@ OSD_COUNT=5
 OSD_ARGS="$OSD_ARGS"
 for i in $(seq 1 $OSD_COUNT); do
    dd if=/dev/zero of=./testdata/test_osd$i.bin bs=1024 count=1 seek=$((OSD_SIZE*1024-1))
-    build/src/vitastor-osd --osd_num $i --bind_address 127.0.0.1 --etcd_stats_interval 5 $OSD_ARGS --etcd_address $ETCD_URL $(build/src/vitastor-disk simple-offsets --format options ./testdata/test_osd$i.bin 2>/dev/null) >>./testdata/osd$i.log 2>&1 &
+    build/src/vitastor-osd --log_level 10 --osd_num $i --bind_address 127.0.0.1 --etcd_stats_interval 5 $OSD_ARGS --etcd_address $ETCD_URL $(build/src/vitastor-disk simple-offsets --format options ./testdata/test_osd$i.bin 2>/dev/null) >>./testdata/osd$i.log 2>&1 &
    eval OSD${i}_PID=$!
 done
@ -53,6 +53,11 @@ for i in {1..30}; do
    fi
 done
 # Sync so all moved objects are removed from OSD 1 (they aren't removed without a sync)
 LD_PRELOAD="build/src/libfio_vitastor.so" \
 fio -thread -name=test -ioengine=build/src/libfio_vitastor.so -bs=4k -direct=1 -iodepth=1 -fsync=1 -number_ios=2 -rw=write \
    -etcd=$ETCD_URL -pool=1 -inode=2 -size=32M -cluster_log_level=10
 $ETCDCTL put /vitastor/config/pgs '{"items":{"1":{"1":{"osd_set":[4,5],"primary":0}}}}'
 $ETCDCTL put /vitastor/pg/history/1/1 '{"all_peers":[1,2,3]}'
--- a/tests/test_parity_change.sh
+++ b/tests/test_parity_change.sh
@ -0,0 +1,54 @@
 #!/bin/bash -ex
 # Test changing EC 4+1 into EC 4+3
 OSD_COUNT=7
 PG_COUNT=16
 SCHEME=ec
 PG_SIZE=5
 PG_DATA_SIZE=4
 PG_MINSIZE=5
 . `dirname $0`/run_3osds.sh
 try_change()
 {
    n=$1
    s=$2
    for i in {1..10}; do
        ($ETCDCTL get /vitastor/config/pgs --print-value-only |\
            jq -s -e '(.[0].items["1"] | map(  ([ .osd_set[] | select(. != 0) ] | length) == '$s'  ) | length == '$n')
                and ([ .[0].items["1"] | map(.osd_set)[][] ] | sort | unique == ["1","2","3","4","5","6","7"])') && \
            ($ETCDCTL get --prefix /vitastor/pg/state/ --print-value-only | jq -s -e '([ .[] | select(.state == ["active"]) ] | length) == '$n'') && \
            break
        sleep 1
    done
    if ! ($ETCDCTL get /vitastor/config/pgs --print-value-only |\
        jq -s -e '(.[0].items["1"] | map(  ([ .osd_set[] | select(. != 0) ] | length) == '$s'  ) | length == '$n')
            and ([ .[0].items["1"] | map(.osd_set)[][] ] | sort | unique == ["1","2","3","4","5","6","7"])'); then
        $ETCDCTL get /vitastor/config/pgs
        $ETCDCTL get --prefix /vitastor/pg/state/
        format_error "FAILED: PG SIZE NOT CHANGED OR SOME OSDS DO NOT HAVE PGS"
    fi
    if ! ($ETCDCTL get --prefix /vitastor/pg/state/ --print-value-only | jq -s -e '([ .[] | select(.state == ["active"]) ] | length) == '$n); then
        $ETCDCTL get /vitastor/config/pgs
        $ETCDCTL get --prefix /vitastor/pg/state/
        format_error "FAILED: PGS NOT UP AFTER PG SIZE CHANGE"
    fi
 }
 LD_PRELOAD="build/src/libfio_vitastor.so" \
    fio -thread -name=test -ioengine=build/src/libfio_vitastor.so -bs=1M -direct=1 -iodepth=4 \
        -rw=write -etcd=$ETCD_URL -pool=1 -inode=1 -size=128M -runtime=10
 PG_SIZE=7
 POOLCFG='"name":"testpool","failure_domain":"osd","scheme":"ec","parity_chunks":'$((PG_SIZE-PG_DATA_SIZE))
 $ETCDCTL put /vitastor/config/pools '{"1":{'$POOLCFG',"pg_size":'$PG_SIZE',"pg_minsize":'$PG_MINSIZE',"pg_count":'$PG_COUNT'}}'
 sleep 2
 try_change 16 7
 format_green OK
--- a/tests/test_scrub.sh
+++ b/tests/test_scrub.sh
@ -20,6 +20,9 @@ LD_PRELOAD="build/src/libfio_vitastor.so" \
    fio -thread -name=test -ioengine=build/src/libfio_vitastor.so -bs=1M -direct=1 -iodepth=4 \
        -mirror_file=./testdata/mirror.bin -end_fsync=1 -rw=write -etcd=$ETCD_URL -image=testimg
 # Save PG primary
 primary=$($ETCDCTL get --print-value-only /vitastor/config/pgs | jq -r '.items["1"]["1"].primary')
 # Intentionally corrupt OSD data and restart it
 zero_osd_pid=OSD${ZERO_OSD}_PID
 kill ${!zero_osd_pid}
@ -34,6 +37,9 @@ start_osd $ZERO_OSD
 # Wait until start
 wait_up 10
 # Wait until PG is back on the same primary
 wait_condition 10 "$ETCDCTL"$' get --print-value-only /vitastor/config/pgs | jq -s -e \'.[0].items["1"]["1"].primary == "'$primary'"'"'"
 # Trigger scrub
 $ETCDCTL put /vitastor/pg/history/1/1 `$ETCDCTL get --print-value-only /vitastor/pg/history/1/1 | jq -s -c '(.[0] // {}) + {"next_scrub":1}'`
--- a/Show More
+++ b/Show More
Author	SHA1	Message	Date
antilles	7fea69ff5f	Merge pull request 'master' (#3 ) from vitalif/vitastor:master into master Reviewed-on: #3	2024-02-13 14:44:09 +03:00
Vitaliy Filippov	c777a0041a	Release 1.4.4 A couple of fixes for EC pools - Fix a segfault possible on partial EC overwrite in 1234 -> 5030 rebalance scenario - Fix two problems leading to EC pools stalling on rebalance & parallel sudden stops of OSDs, for example during a sudden poweroff of a host: - Recovery auto-tuning (1.4.0 feature) could apply too large delays and stall the EC journal - fixed by limiting delays with a new recovery_tune_sleep_cutoff_us parameter (10 seconds by default) and applying recovery pauses before write operations, not after them, to not occupy space in the journal for long time - Dynamic journal space reservation (1.3.0 feature) wasn't accounting new writes when checking the limit so OSDs could still fill the journal fully and stall - fixed by including new writes into the limit - Print etcd dbSize instead of dbSizeInUse in status	2024-02-11 16:23:08 +03:00
Vitaliy Filippov	2947ea93e8	Raise test_snapshot_chain_ec timeout to 6 minutes	2024-02-11 16:13:52 +03:00
Vitaliy Filippov	978bdc128a	Apply recovery pause before writes, after commits, and do not apply it to syncs to not block EC pools from functioning	2024-02-11 16:13:52 +03:00
Vitaliy Filippov	bb2f395f1e	Add cutoff threshold for recovery auto-tuning	2024-02-11 16:13:52 +03:00
Vitaliy Filippov	b127da40f7	Add a FIXME about incomplete PGs	2024-02-11 13:42:51 +03:00
Vitaliy Filippov	ca34a6047a	Fix dynamic journal space reservation: include the new write itself, too	2024-02-11 13:42:51 +03:00
Vitaliy Filippov	38ba76e893	Fix flusher sometimes being unable to trim journal when the flush queue is empty	2024-02-11 13:42:51 +03:00
Vitaliy Filippov	1e3c4edea0	Print etcd dbSize instead of dbSizeInUse in status	2024-02-11 13:42:51 +03:00
Vitaliy Filippov	e7ac855b07	Fix that EC segfault (1234 -> 5030 partial overwrite)	2024-02-11 13:42:51 +03:00
Vitaliy Filippov	c53357ac45	Add a test for EC segfault with partial overwrite in 1234 -> 5030 rebalance scenario	2024-02-11 13:42:51 +03:00
Vitaliy Filippov	27e9f244ec	Release 1.4.3 Hotfix for hotfix O:-) - "Write stall fix" was incomplete and EC write stalls could continue even on 1.4.2. Now they're finally fixed O:-) - Make monitor ignore statistics of stopped OSDs. Previously if you stopped all OSDs the last total I/O numbers would remain the same indefinitely	2024-02-09 00:29:31 +03:00
Vitaliy Filippov	8e25a28a08	Ignore down OSDs in monitor statistics aggregation	2024-02-09 00:22:36 +03:00
Vitaliy Filippov	5d3317e4f2	Followup to 1.4.2 write stall fix - sadly, the previous version was not working correctly :)	2024-02-08 19:34:29 +03:00
Vitaliy Filippov	016115c0d4	Release 1.4.2 - Log to systemd by default - Fix excessive autosyncs after every operation with disabled immediate_commit (introduced in 1.1.0) - Fix a possible write stall with EC due to the lack of OSD wakeup after stabilizing previous writes - Change sync operation semantics as a final fix to possible write stalls with EC and disabled immediate_commit - Sync after deleting data in CLI rm / rm-data if immediate_commit is disabled - Fix OSDs ignoring syncs & autosyncs for delete operations - Fix OSD space reporting sometimes adding garbage zeros for deleted inodes (causing extra pool/stats etcd keys for deleted pools) - Speed up monitor failover - change default etcd_mon_ttl from 30 to 5 seconds - Speed up operation retries - change default up_wait_retry_interval to 50 ms - Add patch for libvirt 9.10	2024-02-04 02:23:49 +03:00
Vitaliy Filippov	e026de95d5	Log to systemd by default	2024-02-04 01:21:31 +03:00
Vitaliy Filippov	77c10fd1f8	In fact, do not autosync blockstore when autosync_writes=0	2024-02-03 20:37:36 +03:00
Vitaliy Filippov	581d02e581	Mark secondary OSDs with deletions as dirty to not forget to sync & autosync them	2024-02-03 20:31:08 +03:00
Vitaliy Filippov	f03a9db4d9	Fix OSD space reporting sometimes adding garbage zeros for deleted inodes (causing extra pool/stats etcd keys for deleted pools)	2024-02-03 20:31:08 +03:00
Vitaliy Filippov	cb9c30bc31	Sync after sending all deletes to each PG in cli rm-data	2024-02-03 20:31:08 +03:00
Vitaliy Filippov	a86a380d20	Fix invalid parsing of autosync_writes in blockstore leading to autosyncs after every operation with disabled immediate_commit :D	2024-02-03 20:31:08 +03:00
Vitaliy Filippov	d2b43cb118	Change default etcd_mon_ttl	2024-01-29 23:45:19 +03:00
Vitaliy Filippov	cc76e6876b	Fix flapping "scrub" test	2024-01-28 14:59:33 +03:00
Vitaliy Filippov	1cec62d25d	Sync only completed writes Should be a final remaining fix to EC + non-capacitor (non-immediate-commit) write hangs :). First it was breaking non-EC ("instantly stable") writes because they sometimes complete out of order which was leading to the following error: terminate called after throwing an instance of 'std::runtime_error' what(): BUG: Unexpected dirty_entry 1000000000001:29480000 v65540 unstable state during flush: 0x151 But it is easily fixed by scanning previous and next dirty_entries in mark_stable.	2024-01-27 15:17:22 +03:00
Vitaliy Filippov	1c322b33ed	Change default up_wait_retry_interval to 50 ms	2024-01-26 01:51:08 +03:00
Vitaliy Filippov	d27524f441	Add patch for libvirt 9.10	2024-01-25 01:09:12 +03:00
Vitaliy Filippov	ba55f91409	Release 1.4.1 - Fix a monitor crash on primary OSD switching introduced in 1.4.0 - Fix "partly outside array bounds" warnings for GCC 12 in cpp-btree - Fix a realloc memory leak in theory possible with too large listings (OSD_OP_LIST)	2024-01-18 02:31:42 +03:00
Vitaliy Filippov	80aac39513	Add detailed formula for theoretical EC N+K random write performance	2024-01-18 00:36:32 +03:00
Vitaliy Filippov	2aa5aa7ab6	Add a test for simple master switching without PG reconfiguration Also use osd_out_time:1 only in select tests and restart mon in tests only on connection errors	2024-01-17 00:19:01 +03:00
Vitaliy Filippov	3ca3b8a8d8	Fix recheck_pgs bug introduced in 1.4.0	2024-01-16 23:49:21 +03:00
Vitaliy Filippov	2cf649eba6	Fix "partly outside array bounds" warnings for GCC 12 in cpp-btree	2024-01-15 03:04:33 +03:00
Vitaliy Filippov	5935640a4a	Add CLA PR form	2024-01-14 16:48:24 +03:00
Vitaliy Filippov	d00d4dbac0	Initialize mod_revision field in etcd_state_client	2024-01-13 01:30:28 +03:00
Vitaliy Filippov	5d9d6f32a0	Fix common realloc memory leak mistakes found by cppcheck	2024-01-13 01:30:28 +03:00
antilles	1e39b80f31	Merge pull request 'master' (#2 ) from vitalif/vitastor:master into master Reviewed-on: #2	2024-01-12 15:04:03 +03:00
Vitaliy Filippov	5280d1d561	Release 1.4.0 New features: - Intelligent recovery/rebalance speed auto-tuning to reduce its impact on clients (see README -> Features) - Auto-restoration of dead VDUSE daemons in CSI plugin - Add vitastor-disk update-sb command - Update QEMU for Debian Bookworm to 8.1 and use it for CSI plugin Bug fixes: - Fix pools SOMETIMES staying inactive after stopping a node due to OSDs not reacting to PG state changes caused by incorrect full reload of state from etcd on reconnection - Make monitors retry pool configuration changes quickier which fixes them being unable to apply changes when an ongoing rebalance is quickly making a lot of PGs clean - Fix CSI plugin not accepting array of strings as etcd address in /etc/vitastor/vitastor.conf - Allow multiple interfaces with the same IP address, for "simple routed" full mesh network - Do not ignore loopback addresses for OSD network (to make ECMP setups with frr possible) - Fix a rare client crash during OSD reconnections - Only treat data partitions as existing OSDs in vitastor-disk prepare - Remove etcd parameter from default command examples - Fix reported free space sometimes changing non-immediately after deletion of data from OSDs - Fix a possible OSD crash on print_slow when bs_op is NULL - Use the same etcd_ws_keepalive_interval in mon as in OSD - Fix mon not using values from config when /config/global is not present - Remove pve-storage-portal-dns-list format for vitastor_etcd_address - Parse log_level in cluster_client - Fix vitastor-nbd image existence check not working because of non-zeroed inode_watch fields - Do not warn on EPIPE in client unless log_level is raised explicitly - Fix incorrect error in CSI when searching for the device in /sys - Remove 2 last prints to stdout in etcd_state_client - Fix a possible OSD crash when checking corrupted journal entries	2024-01-12 01:28:33 +03:00
Vitaliy Filippov	317b0feb0a	Add a note about VDUSE daemon auto-restart	2024-01-12 01:27:36 +03:00
Vitaliy Filippov	247f0552db	Fix debug log "killing..." in CSI	2024-01-10 01:19:34 +03:00
antilles	f94f76ca89	Merge pull request 'Pull fresh master from base' (#1 ) from vitalif/vitastor:master into master Reviewed-on: #1	2024-01-09 13:25:13 +03:00
Vitaliy Filippov	2f228fa96a	Only treat data partitions as existing OSDs in vitastor-disk prepare	2023-12-31 11:46:47 +03:00
Vitaliy Filippov	2f6b9c0306	Remove etcd parameter from default command examples	2023-12-31 02:50:41 +03:00
Vitaliy Filippov	48b5f871e0	Add Contributor License Aggrement in Russian and English	2023-12-31 01:23:52 +03:00
Vitaliy Filippov	c17f76a3e4	Add documentation for recovery auto-tuning	2023-12-31 01:23:17 +03:00
Vitaliy Filippov	a6ab54b1ba	Do not allow negative util_low/high	2023-12-31 01:23:17 +03:00
Vitaliy Filippov	99ee8596ea	Rename min/max_util to util_low/high	2023-12-31 01:23:17 +03:00
Vitaliy Filippov	c4928e6ecd	Protect from try_send completing the operation immediately Fixes a possible use-after-free in case of continue_ops() calling try_send(), then connect_peer() -> set_timer() -> trigger_nearest() -> handle_op_part() -> continue_ops() again	2023-12-31 01:23:17 +03:00
Vitaliy Filippov	ec7dcd1be5	Do not apply very large recovery pauses during tests	2023-12-31 01:23:17 +03:00
Vitaliy Filippov	e600bbc151	Fix flapping move_reappear test by adding an fsync before stopping PG	2023-12-31 01:23:17 +03:00
Vitaliy Filippov	8b8c1179a7	Use a separate used_blocks counter for free space stats to hide possibly delayed on-flush deallocation	2023-12-31 01:23:17 +03:00
Vitaliy Filippov	d5a6fa6dd7	Fix possible crash on print_slow when bs_op is NULL	2023-12-31 01:23:17 +03:00
Vitaliy Filippov	f757a35a8d	Retry PG changes without re-running lpsolve when pool configuration and OSD tree don't change OSDs often change their /pg/history keys during rebalance, so monitor receives additional transaction failures from etcd if it re-runs lpsolve which sometimes may even lead to monitor being unable to apply PG changes at all until rebalance completes	2023-12-31 01:23:17 +03:00
Vitaliy Filippov	1edf86ed26	Aggregate recovery delay using simple mean over last 10 observations (EWMA is shit)	2023-12-31 01:23:17 +03:00
Vitaliy Filippov	5ca7cde612	Experiment/WIP: Try to track "secondary" recovery ops separately	2023-12-31 01:23:17 +03:00
Vitaliy Filippov	751935ddd8	WIP Auto-tune recovery speed	2023-12-31 01:23:17 +03:00
Vitaliy Filippov	d84dee7098	Track recovery op latencies + refactor into a structure	2023-12-31 01:23:17 +03:00
Vitaliy Filippov	dcc76eee15	Add a parity chunk count change test script	2023-12-26 23:48:41 +03:00
Vitaliy Filippov	2f38adeb3d	Restart dead VDUSE daemons at regular intervals	2023-12-24 12:58:50 +03:00
Vitaliy Filippov	f72f14e6a7	Clear old PG states, history, and OSD states on etcd state reload Also add protection from etcd watcher messages being split into multiple websocket messages - I'm not sure if etcd actually does that, but it's better to have extra protection anyway. Also check that all etcd watchers are started in the keepalive routine, otherwise it sometimes tries to revive etcd watchers starting with revision=1 which obviously always fails because this revision is nearly always compacted. All these changes should fix an old rarely reproduced bug where SOMETIMES OSDs didn't react to PG config changes which was leading to offline pools on node reboot. It happened on the full reload of state from etcd.	2023-12-24 02:02:13 +03:00
Vitaliy Filippov	1299373988	Use the same etcd_ws_keepalive_interval in OSD and mon	2023-12-23 20:07:29 +03:00
Vitaliy Filippov	178bb0e701	Prevent re-entry into timerfd set_nearest	2023-12-22 02:32:40 +03:00
Vitaliy Filippov	4ece4dfdd0	Fix mon not using values from config when /config/global is not present	2023-12-22 02:25:09 +03:00
Vitaliy Filippov	95631773b6	Remove pve-storage-portal-dns-list format for vitastor_etcd_address	2023-12-20 02:22:06 +03:00
Vitaliy Filippov	7239cfb91a	Parse log_level in cluster_client	2023-12-20 02:21:23 +03:00
Vitaliy Filippov	7cea642f4a	Fix vitastor-nbd image existence check not working because of non-zeroed inode_watch fields	2023-12-19 01:11:37 +03:00
Vitaliy Filippov	dc615403d9	Do not warn on EPIPE in client unless log_level is raised explicitly	2023-12-17 13:42:26 +03:00
Vitaliy Filippov	1a704e06ab	Allow multiple interfaces with the same IP address, for "simple routed" full mesh network	2023-12-17 13:25:56 +03:00
Vitaliy Filippov	575475de71	Do not ignore loopback addresses for OSD network (to make ECMP setups with frr possible)	2023-12-17 11:55:13 +03:00
		`@ -1 +1 @@`
			`Subproject commit 45e6d1f13196a0824e2089a586c53b9de0283f17`				`Subproject commit 8de8b467acbca50cfd8835c20e0e379110f3b32b`
`@ -1,4 +1,4 @@`
	`VERSION ?= v1.3.1`	`VERSION ?= v1.4.4`

	`all: build push`	`all: build push`
`@ -1,4 +1,4 @@`
	`vitastor (1.3.1-1) unstable; urgency=medium`	`vitastor (1.4.4-1) unstable; urgency=medium`

	`* Bugfixes`	`* Bugfixes`