FIXME Sync only completed writes

Currently it's impossible because it leads to errors similar to terminate called after throwing an instance of 'std::runtime_error' what(): BUG: Unexpected dirty_entry 1000000000001:29480000 v65540 unstable state during flush: 0x151 Probably because in that case flushers should wait for previous writes too
Add Contributor License Aggrement in Russian and English
2023-12-31 01:24:54 +03:00 · 2023-12-31 01:23:52 +03:00 · 2023-12-31 01:23:17 +03:00 · 2023-12-31 01:23:17 +03:00 · 2023-12-31 01:23:17 +03:00 · 2023-12-31 01:23:17 +03:00
35 changed files with 1178 additions and 281 deletions
--- a/CLA-en.md
+++ b/CLA-en.md
@ -0,0 +1,115 @@
+## Contributor License Agreement
+
+> This Agreement is made in the Russian and English languages. **The English
+text of Agreement is for informational purposes only** and is not binding
+for the Parties.
+>
+> In the event of a conflict between the provisions of the Russian and
+English versions of this Agreement, the **Russian version shall prevail**.
+>
+> Russian version is published at https://git.yourcmc.ru/vitalif/vitastor/src/branch/master/CLA-ru.md
+
+This document represents the offer of Filippov Vitaliy Vladimirovich
+("Author"), author and copyright holder of Vitastor software ("Program"),
+acknowledged by a certificate of Federal Service for Intellectual
+Property of Russian Federation (Rospatent) # 2021617829 dated 20 May 2021,
+to "Contributors" to conclude this license agreement as follows
+("Agreement" or "Offer").
+
+In accordance with Art. 435, Art. 438 of the Civil Code of the Russian
+Federation, this Agreement is an offer and in case of acceptance of the
+offer, an agreement is considered concluded on the conditions specified
+in the offer.
+
+1. Applicable Terms. \
+   1.1. "Official Repository" shall mean the computer storage, operated by
+        the Author, containing all prior and future versions of the Source
+        Code of the Program, at Internet addresses https://git.yourcmc.ru/vitalif/vitastor/
+        or https://github.com/vitalif/vitastor/. \
+   1.2. "Contributions" shall mean results of intellectual activity
+        (including, but not limited to, source code, libraries, components,
+        texts, documentation) which can be software or elements of the software
+        and which are provided by Contributors to the Author for inclusion
+        in the Program. \
+   1.3. "Contributor" shall mean a person who provides Contributions to
+        the Author and agrees with all provisions of this Agreement.
+        A Сontributor can be: 1) an individual; or 2) a legal entity or an
+        individual entrepreneur in case when an individual provides Contributions
+        on behalf of third parties, including on behalf of his employer.
+
+2. Subject of the Agreement. \
+   2.1. Subject of the Agreement shall be the Contributions sent to the Author by Contributors.
+   2.2. The Contributor grants to the Author the right to use Contributions at his own
+        discretion and without any necessity to get a prior approval from Contributor or
+        any other third party in any way, under a simple (non-exclusive), royalty-free,
+        irrevocable license throughout the world by all means not contrary to law, in whole
+        or as a part of the Program, or other open-source or closed-source computer programs,
+        products or services (hereinafter -- the "License"), including, but not limited to: \
+        2.2.1. to execute Contributions and use them for any tasks; \
+        2.2.2. to publish and distribute Contributions in modified or unmodified form and/or to rent them; \
+        2.2.3. to modify Contributions, add comments, illustrations or any explanations to Contributions while using them; \
+        2.2.4. to create other results of intellectual activity based on Contributions, including derivative works and composite works; \
+        2.2.5. to translate Contributions into other languages, including other programming languages; \
+        2.2.6. to carry out rental and public display of Contributions; \
+        2.2.7. to use Contributions under the trade name and/or any trademark or any other label, or without it, as the Author thinks fit; \
+   2.3. The Contributor grants to the Author the right to sublicense any of the aforementioned
+        rights to third parties on any terms at the Author's discretion. \
+   2.4. The License is provided for the entire duration of Contributor's
+        exclusive intellectual property rights to the Contributions. \
+   2.5. The Contributor grants to the Author the right to decide how and where to mention,
+        or to not mention at all, the fact of his authorship, name, nickname and/or company
+        details when including Contributions into the Program or in any other computer
+        programs, products or services.
+
+3. Acceptance of the Offer \
+   3.1. The Contributor may provide Contributions to the Author in the form of
+        a "Pull Request" in an Official Repository of the Program or by any
+        other electronic means of communication, including, but not limited to,
+        E-mail or messenger applications. \
+   3.2. The acceptance of the Offer shall be the fact of provision of Contributions
+        to the Author by the Contributor by any means with the following remark:
+        “I accept Vitastor CLA agreement: https://git.yourcmc.ru/vitalif/vitastor/src/branch/master/CLA-en.md”
+        or “Я принимаю соглашение Vitastor CLA: https://git.yourcmc.ru/vitalif/vitastor/src/branch/master/CLA-ru.md”. \
+   3.3. Date of acceptance of the Offer shall be the date of such provision.
+
+4. Rights and obligations of the parties. \
+   4.1. The Contributor reserves the right to use Contributions by any lawful means
+        not contrary to this Agreement. \
+   4.2. The Author has the right to refuse to include Contributions into the Program
+        at any moment with no explanation to the Contributor.
+
+5. Representations and Warranties. \
+   5.1. The person providing Contributions for the purpose of their inclusion
+        in the Program represents and warrants that he is the Contributor
+        or legally acts on the Contributor's behalf. Name or company details
+        of the Contributor shall be provided with the Contribution at the moment
+        of their provision to the Author. \
+   5.2. The Contributor represents and warrants that he legally owns exclusive
+        intellectual property rights to the Contributions. \
+   5.3. The Contributor represents and warrants that any further use of \
+        Contributions by the Author as provided by Contributor under the terms
+        of the Agreement does not infringe on intellectual and other rights and
+        legitimate interests of third parties. \
+   5.4. The Contributor represents and warrants that he has all rights and legal
+        capacity needed to accept this Offer; \
+   5.5. The Contributor represents and warrants that Contributions don't
+        contain malware or any information considered illegal under the law
+        of Russian Federation.
+
+6. Termination of the Agreement \
+   6.1. The Agreement may be terminated at will of both Author and Contributor,
+        formalised in the written form or if the Agreement is terminated on
+        reasons prescribed by the law of Russian Federation.
+
+7. Final Clauses \
+   7.1. The Contributor may optionally sign the Agreement in the written form. \
+   7.2. The Agreement is deemed to become effective from the Date of signing of
+        the Agreement and until the expiration of Contributor's exclusive
+        intellectual property rights to the Contributions. \
+   7.3. The Author may unilaterally alter the Agreement without informing Contributors.
+        The new version of the document shall come into effect 3 (three) days after
+        being published in the Official Repository of the Program at Internet address
+        [https://git.yourcmc.ru/vitalif/vitastor/src/branch/master/CLA-en.md](https://git.yourcmc.ru/vitalif/vitastor/src/branch/master/CLA-en.md).
+        Contributors should keep informed about the actual version of the Agreement themselves. \
+   7.4. If the Author and the Contributor fail to agree on disputable issues,
+        disputes shall be referred to the Moscow Arbitration court.
--- a/CLA-ru.md
+++ b/CLA-ru.md
@ -0,0 +1,108 @@
+## Лицензионное соглашение с участником
+
+> Данная Оферта написана в Русской и Английской версиях. **Версия на английском
+языке предоставляется в информационных целях** и не связывает стороны договора.
+>
+> В случае несоответствий между положениями Русской и Английской версий Договора,
+**Русская версия имеет приоритет**.
+>
+> Английская версия опубликована по адресу https://git.yourcmc.ru/vitalif/vitastor/src/branch/master/CLA-en.md
+
+Настоящий договор-оферта (далее по тексту – Оферта, Договор) адресована физическим
+и юридическим лицам (далее – Участникам) и является официальным публичным предложением
+Филиппова Виталия Владимировича (далее – Автора) программного обеспечения Vitastor,
+свидетельство Федеральной службы по интеллектуальной собственности (Роспатент) № 2021617829
+от 20 мая 2021 г. (далее – Программа) о нижеследующем:
+
+1. Термины и определения \
+   1.1. Репозиторий – электронное хранилище, содержащее исходный код Программы. \
+   1.2. Доработка – результат интеллектуальной деятельности Участника, включающий
+        в себя изменения или дополнения к исходному коду Программы, которые Участник
+        желает включить в состав Программы для дальнейшего использования и распространения
+        Автором и для этого направляет их Автору. \
+   1.3. Участник – физическое или юридическое лицо, вносящее Доработки в код Программы. \
+   1.4. ГК РФ – Гражданский кодекс Российской Федерации.
+
+2. Предмет оферты \
+   2.1. Предметом настоящей оферты являются Доработки, отправляемые Участником Автору. \
+   2.2. Участник предоставляет Автору право использовать Доработки по собственному усмотрению
+        и без необходимости предварительного согласования с Участником или иным третьим лицом
+        на условиях простой (неисключительной) безвозмездной безотзывной лицензии, полностью
+        или фрагментарно, в составе Программы или других программ, продуктов или сервисов
+        как с открытым, так и с закрытым исходным кодом, любыми способами, не противоречащими
+        закону, включая, но не ограничиваясь следующими: \
+        2.2.1. Запускать и использовать Доработки для выполнения любых задач; \
+        2.2.2. Распространять, импортировать и доводить Доработки до всеобщего сведения; \
+        2.2.3. Вносить в Доработки изменения, сокращения и дополнения, снабжать Доработки
+               при их использовании комментариями, иллюстрациями или пояснениями; \
+        2.2.4. Создавать на основе Доработок иные результаты интеллектуальной деятельности,
+               в том числе производные и составные произведения; \
+        2.2.5. Переводить Доработки на другие языки, в том числе на другие языки программирования; \
+        2.2.6. Осуществлять прокат и публичный показ Доработок; \
+        2.2.7. Использовать Доработки под любым фирменным наименованием, товарным знаком
+               (знаком обслуживания) или иным обозначением, или без такового. \
+   2.3. Участник предоставляет Автору право сублицензировать полученные права на Доработки
+        третьим лицам на любых условиях на усмотрение Автора. \
+   2.4. Участник предоставляет Автору права на Доработки на территории всего мира. \
+   2.5. Участник предоставляет Автору права на весь срок действия исключительного права
+        Участника на Доработки. \
+   2.6. Участник предоставляет Автору права на Доработки на безвозмездной основе. \
+   2.7. Участник разрешает Автору самостоятельно определять порядок, способ и
+        место указания его имени, реквизитов и/или псевдонима при включении
+        Доработок в состав Программы или других программ, продуктов или сервисов.
+
+3. Акцепт Оферты \
+   3.1. Участник может передавать Доработки в адрес Автора через зеркала официального
+        Репозитория Программы по адресам https://git.yourcmc.ru/vitalif/vitastor/ или
+        https://github.com/vitalif/vitastor/ в виде “запроса на слияние” (pull request),
+        либо в письменном виде или с помощью любых других электронных средств коммуникации,
+        например, электронной почты или мессенджеров. \
+   3.2. Факт передачи Участником Доработок в адрес Автора любым способом с одной из пометок
+        “I accept Vitastor CLA agreement: https://git.yourcmc.ru/vitalif/vitastor/src/branch/master/CLA-en.md”
+        или “Я принимаю соглашение Vitastor CLA: https://git.yourcmc.ru/vitalif/vitastor/src/branch/master/CLA-ru.md”
+        является полным и безоговорочным акцептом (принятием) Участником условий настоящей
+        Оферты, т.е. Участник считается ознакомившимся с настоящим публичным договором и
+        в соответствии с ГК РФ признается лицом, вступившим с Автором в договорные отношения
+        на основании настоящей Оферты. \
+   3.3. Датой акцептирования настоящей Оферты считается дата такой передачи.
+
+4. Права и обязанности Сторон \
+   4.1. Участник сохраняет за собой право использовать Доработки любым законным
+        способом, не противоречащим настоящему Договору. \
+   4.2. Автор вправе отказать Участнику во включении Доработок в состав
+        Программы без объяснения причин в любой момент по своему усмотрению.
+
+5. Гарантии и заверения \
+   5.1. Лицо, направляющее Доработки для целей их включения в состав Программы,
+        гарантирует, что является Участником или представителем Участника. Имя или реквизиты
+        Участника должны быть указаны при их передаче в адрес Автора Программы. \
+   5.2. Участник гарантирует, что является законным обладателем исключительных прав
+        на Доработки. \
+   5.3. Участник гарантирует, что на момент акцептирования настоящей Оферты ему
+        ничего не известно (и не могло быть известно) о правах третьих лиц на
+        передаваемые Автору Доработки или их часть, которые могут быть нарушены
+        в связи с передачей Доработок по настоящему Договору. \
+   5.4. Участник гарантирует, что является дееспособным лицом и обладает всеми
+        необходимыми правами для заключения Договора. \
+   5.5. Участник гарантирует, что Доработки не содержат вредоносного ПО, а также
+        любой другой информации, запрещённой к распространению по законам Российской
+        Федерации.
+
+6. Прекращение действия оферты \
+   6.1. Действие настоящего договора может быть прекращено по соглашению сторон,
+        оформленному в письменном виде, а также вследствие его расторжения по основаниям,
+        предусмотренным законом.
+
+7. Заключительные положения \
+   7.1. Участник вправе по желанию подписать настоящий Договор в письменном виде. \
+   7.2. Настоящий договор действует с момента его заключения и до истечения срока
+        действия исключительных прав Участника на Доработки. \
+   7.3. Автор имеет право в одностороннем порядке вносить изменения и дополнения в договор
+        без специального уведомления об этом Участников. Новая редакция документа вступает
+        в силу через 3 (Три) календарных дня со дня опубликования в официальном Репозитории
+        Программы по адресу в сети Интернет
+        [https://git.yourcmc.ru/vitalif/vitastor/src/branch/master/CLA-ru.md](https://git.yourcmc.ru/vitalif/vitastor/src/branch/master/CLA-ru.md).
+        Участники самостоятельно отслеживают действующие условия Оферты. \
+   7.4. Все споры, возникающие между сторонами в процессе их взаимодействия по настоящему
+        договору, решаются путём переговоров. В случае невозможности урегулирования споров
+        переговорным порядком стороны разрешают их в Арбитражном суде г.Москвы.
--- a/docs/config/client.en.md
+++ b/docs/config/client.en.md
@ -6,8 +6,8 @@

 # Client Parameters

-These parameters apply only to clients and affect their interaction with
-the cluster.
+These parameters apply only to Vitastor clients (QEMU, fio, NBD and so on) and
+affect their interaction with the cluster.

 - [client_max_dirty_bytes](#client_max_dirty_bytes)
 - [client_max_dirty_ops](#client_max_dirty_ops)
--- a/docs/config/client.ru.md
+++ b/docs/config/client.ru.md
@ -6,7 +6,7 @@

 # Параметры клиентского кода

-Данные параметры применяются только к клиентам Vitastor (QEMU, fio, NBD) и
+Данные параметры применяются только к клиентам Vitastor (QEMU, fio, NBD и т.п.) и
 затрагивают логику их работы с кластером.

 - [client_max_dirty_bytes](#client_max_dirty_bytes)
--- a/docs/config/osd.en.md
+++ b/docs/config/osd.en.md
@ -19,6 +19,7 @@ them, even without restarting by updating configuration in etcd.
 - [autosync_interval](#autosync_interval)
 - [autosync_writes](#autosync_writes)
 - [recovery_queue_depth](#recovery_queue_depth)
+- [recovery_sleep_us](#recovery_sleep_us)
 - [recovery_pg_switch](#recovery_pg_switch)
 - [recovery_sync_batch](#recovery_sync_batch)
 - [readonly](#readonly)
@ -51,6 +52,13 @@ them, even without restarting by updating configuration in etcd.
 - [scrub_list_limit](#scrub_list_limit)
 - [scrub_find_best](#scrub_find_best)
 - [scrub_ec_max_bruteforce](#scrub_ec_max_bruteforce)
+- [recovery_tune_interval](#recovery_tune_interval)
+- [recovery_tune_util_low](#recovery_tune_util_low)
+- [recovery_tune_util_high](#recovery_tune_util_high)
+- [recovery_tune_client_util_low](#recovery_tune_client_util_low)
+- [recovery_tune_client_util_high](#recovery_tune_client_util_high)
+- [recovery_tune_agg_interval](#recovery_tune_agg_interval)
+- [recovery_tune_sleep_min_us](#recovery_tune_sleep_min_us)

 ## etcd_report_interval

@ -135,12 +143,24 @@ operations before issuing an fsync operation internally.
 ## recovery_queue_depth

 - Type: integer
- Default: 4
+- Default: 1
 - Can be changed online: yes

-Maximum recovery operations per one primary OSD at any given moment of time.
-Currently it's the only parameter available to tune the speed or recovery
-and rebalancing, but it's planned to implement more.
+Maximum recovery and rebalance operations initiated by each OSD in parallel.
+Note that each OSD talks to a lot of other OSDs so actual number of parallel
+recovery operations per each OSD is greater than just recovery_queue_depth.
+Increasing this parameter can speedup recovery if [auto-tuning](#recovery_tune_interval)
+allows it or if it is disabled.
+
+## recovery_sleep_us
+
+- Type: microseconds
+- Default: 0
+- Can be changed online: yes
+
+Delay for all recovery- and rebalance- related operations. If non-zero,
+such operations are artificially slowed down to reduce the impact on
+client I/O.

 ## recovery_pg_switch

@ -508,3 +528,81 @@ the variant with most available equal copies is correct. For example, if
 you have 3 replicas and 1 of them differs, this one is considered to be
 corrupted. But if there is no "best" version with more copies than all
 others have then the object is also marked as inconsistent.
+
+## recovery_tune_interval
+
+- Type: seconds
+- Default: 1
+- Can be changed online: yes
+
+Interval at which OSD re-considers client and recovery load and automatically
+adjusts [recovery_sleep_us](#recovery_sleep_us). Recovery auto-tuning is
+disabled if recovery_tune_interval is set to 0.
+
+Auto-tuning targets utilization. Utilization is a measure of load and is
+equal to the product of iops and average latency (so it may be greater
+than 1). You set "low" and "high" client utilization thresholds and two
+corresponding target recovery utilization levels. OSD calculates desired
+recovery utilization from client utilization using linear interpolation
+and auto-tunes recovery operation delay to make actual recovery utilization
+match desired.
+
+This allows to reduce recovery/rebalance impact on client operations. It is
+of course impossible to remove it completely, but it should become adequate.
+In some tests rebalance could earlier drop client write speed from 1.5 GB/s
+to 50-100 MB/s, with default auto-tuning settings it now only reduces
+to ~1 GB/s.
+
+## recovery_tune_util_low
+
+- Type: number
+- Default: 0.1
+- Can be changed online: yes
+
+Desired recovery/rebalance utilization when client load is high, i.e. when
+it is at or above recovery_tune_client_util_high.
+
+## recovery_tune_util_high
+
+- Type: number
+- Default: 1
+- Can be changed online: yes
+
+Desired recovery/rebalance utilization when client load is low, i.e. when
+it is at or below recovery_tune_client_util_low.
+
+## recovery_tune_client_util_low
+
+- Type: number
+- Default: 0
+- Can be changed online: yes
+
+Client utilization considered "low".
+
+## recovery_tune_client_util_high
+
+- Type: number
+- Default: 0.5
+- Can be changed online: yes
+
+Client utilization considered "high".
+
+## recovery_tune_agg_interval
+
+- Type: integer
+- Default: 10
+- Can be changed online: yes
+
+The number of last auto-tuning iterations to use for calculating the
+delay as average. Lower values result in quicker response to client
+load change, higher values result in more stable delay. Default value of 10
+is usually fine.
+
+## recovery_tune_sleep_min_us
+
+- Type: microseconds
+- Default: 10
+- Can be changed online: yes
+
+Minimum possible value for auto-tuned recovery_sleep_us. Values lower
+than this value are changed to 0.
--- a/docs/config/osd.ru.md
+++ b/docs/config/osd.ru.md
@ -20,6 +20,7 @@
 - [autosync_interval](#autosync_interval)
 - [autosync_writes](#autosync_writes)
 - [recovery_queue_depth](#recovery_queue_depth)
+- [recovery_sleep_us](#recovery_sleep_us)
 - [recovery_pg_switch](#recovery_pg_switch)
 - [recovery_sync_batch](#recovery_sync_batch)
 - [readonly](#readonly)
@ -52,6 +53,13 @@
 - [scrub_list_limit](#scrub_list_limit)
 - [scrub_find_best](#scrub_find_best)
 - [scrub_ec_max_bruteforce](#scrub_ec_max_bruteforce)
+- [recovery_tune_interval](#recovery_tune_interval)
+- [recovery_tune_util_low](#recovery_tune_util_low)
+- [recovery_tune_util_high](#recovery_tune_util_high)
+- [recovery_tune_client_util_low](#recovery_tune_client_util_low)
+- [recovery_tune_client_util_high](#recovery_tune_client_util_high)
+- [recovery_tune_agg_interval](#recovery_tune_agg_interval)
+- [recovery_tune_sleep_min_us](#recovery_tune_sleep_min_us)

 ## etcd_report_interval

@ -138,13 +146,25 @@ OSD, чтобы успевать очищать журнал - без них OSD
 ## recovery_queue_depth

 - Тип: целое число
- Значение по умолчанию: 4
+- Значение по умолчанию: 1
 - Можно менять на лету: да

-Максимальное число операций восстановления на одном первичном OSD в любой
-момент времени. На данный момент единственный параметр, который можно менять
-для ускорения или замедления восстановления и перебалансировки данных, но
-в планах реализация других параметров.
+Максимальное число параллельных операций восстановления, инициируемых одним
+OSD в любой момент времени. Имейте в виду, что каждый OSD обычно работает с
+многими другими OSD, так что на практике параллелизм восстановления больше,
+чем просто recovery_queue_depth. Увеличение значения этого параметра может
+ускорить восстановление если [автотюнинг скорости](#recovery_tune_interval)
+разрешает это или если он отключён.
+
+## recovery_sleep_us
+
+- Тип: микросекунды
+- Значение по умолчанию: 0
+- Можно менять на лету: да
+
+Delay for all recovery- and rebalance- related operations. If non-zero,
+such operations are artificially slowed down to reduce the impact on
+client I/O.

 ## recovery_pg_switch

@ -535,3 +555,83 @@ EC (кодов коррекции ошибок) с более, чем 1 диск
 считается некорректной. Однако, если "лучшую" версию с числом доступных
 копий большим, чем у всех других версий, найти невозможно, то объект тоже
 маркируется неконсистентным.
+
+## recovery_tune_interval
+
+- Тип: секунды
+- Значение по умолчанию: 1
+- Можно менять на лету: да
+
+Интервал, с которым OSD пересматривает клиентскую нагрузку и нагрузку
+восстановления и автоматически подстраивает [recovery_sleep_us](#recovery_sleep_us).
+Автотюнинг (автоподстройка) отключается, если recovery_tune_interval
+устанавливается в значение 0.
+
+Автотюнинг регулирует утилизацию. Утилизация является мерой нагрузки
+и равна произведению числа операций в секунду и средней задержки
+(то есть, она может быть выше 1). Вы задаёте два уровня клиентской
+утилизации - "низкий" и "высокий" (low и high) и два соответствующих
+целевых уровня утилизации операциями восстановления. OSD рассчитывает
+желаемый уровень утилизации восстановления линейной интерполяцией от
+клиентской утилизации и подстраивает задержку операций восстановления
+так, чтобы фактическая утилизация восстановления совпадала с желаемой.
+
+Это позволяет снизить влияние восстановления и ребаланса на клиентские
+операции. Конечно, невозможно исключить такое влияние полностью, но оно
+должно становиться адекватнее. В некоторых тестах перебалансировка могла
+снижать клиентскую скорость записи с 1.5 ГБ/с до 50-100 МБ/с, а теперь, с
+настройками автотюнинга по умолчанию, она снижается только до ~1 ГБ/с.
+
+## recovery_tune_util_low
+
+- Тип: число
+- Значение по умолчанию: 0.1
+- Можно менять на лету: да
+
+Желаемая утилизация восстановления в моменты, когда клиентская нагрузка
+высокая, то есть, находится на уровне или выше recovery_tune_client_util_high.
+
+## recovery_tune_util_high
+
+- Тип: число
+- Значение по умолчанию: 1
+- Можно менять на лету: да
+
+Желаемая утилизация восстановления в моменты, когда клиентская нагрузка
+низкая, то есть, находится на уровне или ниже recovery_tune_client_util_low.
+
+## recovery_tune_client_util_low
+
+- Тип: число
+- Значение по умолчанию: 0
+- Можно менять на лету: да
+
+Клиентская утилизация, которая считается "низкой".
+
+## recovery_tune_client_util_high
+
+- Тип: число
+- Значение по умолчанию: 0.5
+- Можно менять на лету: да
+
+Клиентская утилизация, которая считается "высокой".
+
+## recovery_tune_agg_interval
+
+- Тип: целое число
+- Значение по умолчанию: 10
+- Можно менять на лету: да
+
+Число последних итераций автоподстройки для расчёта задержки как среднего
+значения. Меньшие значения параметра ускоряют отклик на изменение нагрузки,
+большие значения делают задержку стабильнее. Значение по умолчанию 10
+обычно нормальное и не требует изменений.
+
+## recovery_tune_sleep_min_us
+
+- Тип: микросекунды
+- Значение по умолчанию: 10
+- Можно менять на лету: да
+
+Минимальное возможное значение авто-подстроенного recovery_sleep_us.
+Значения ниже данного заменяются на 0.
--- a/docs/config/src/make.js
+++ b/docs/config/src/make.js
@ -38,6 +38,7 @@ const types = {
        bool: 'boolean',
        int: 'integer',
        sec: 'seconds',
+        float: 'number',
        ms: 'milliseconds',
        us: 'microseconds',
    },
@ -46,6 +47,7 @@ const types = {
        bool: 'булево (да/нет)',
        int: 'целое число',
        sec: 'секунды',
+        float: 'число',
        ms: 'миллисекунды',
        us: 'микросекунды',
    },
--- a/docs/config/src/osd.yml
+++ b/docs/config/src/osd.yml
@ -107,17 +107,29 @@
    принудительной отправкой fsync-а.
 - name: recovery_queue_depth
  type: int
-  default: 4
+  default: 1
  online: true
  info: |
-    Maximum recovery operations per one primary OSD at any given moment of time.
-    Currently it's the only parameter available to tune the speed or recovery
-    and rebalancing, but it's planned to implement more.
+    Maximum recovery and rebalance operations initiated by each OSD in parallel.
+    Note that each OSD talks to a lot of other OSDs so actual number of parallel
+    recovery operations per each OSD is greater than just recovery_queue_depth.
+    Increasing this parameter can speedup recovery if [auto-tuning](#recovery_tune_interval)
+    allows it or if it is disabled.
  info_ru: |
-    Максимальное число операций восстановления на одном первичном OSD в любой
-    момент времени. На данный момент единственный параметр, который можно менять
-    для ускорения или замедления восстановления и перебалансировки данных, но
-    в планах реализация других параметров.
+    Максимальное число параллельных операций восстановления, инициируемых одним
+    OSD в любой момент времени. Имейте в виду, что каждый OSD обычно работает с
+    многими другими OSD, так что на практике параллелизм восстановления больше,
+    чем просто recovery_queue_depth. Увеличение значения этого параметра может
+    ускорить восстановление если [автотюнинг скорости](#recovery_tune_interval)
+    разрешает это или если он отключён.
+- name: recovery_sleep_us
+  type: us
+  default: 0
+  online: true
+  info: |
+    Delay for all recovery- and rebalance- related operations. If non-zero,
+    such operations are artificially slowed down to reduce the impact on
+    client I/O.
 - name: recovery_pg_switch
  type: int
  default: 128
@ -626,3 +638,101 @@
    считается некорректной. Однако, если "лучшую" версию с числом доступных
    копий большим, чем у всех других версий, найти невозможно, то объект тоже
    маркируется неконсистентным.
+- name: recovery_tune_interval
+  type: sec
+  default: 1
+  online: true
+  info: |
+    Interval at which OSD re-considers client and recovery load and automatically
+    adjusts [recovery_sleep_us](#recovery_sleep_us). Recovery auto-tuning is
+    disabled if recovery_tune_interval is set to 0.
+
+    Auto-tuning targets utilization. Utilization is a measure of load and is
+    equal to the product of iops and average latency (so it may be greater
+    than 1). You set "low" and "high" client utilization thresholds and two
+    corresponding target recovery utilization levels. OSD calculates desired
+    recovery utilization from client utilization using linear interpolation
+    and auto-tunes recovery operation delay to make actual recovery utilization
+    match desired.
+
+    This allows to reduce recovery/rebalance impact on client operations. It is
+    of course impossible to remove it completely, but it should become adequate.
+    In some tests rebalance could earlier drop client write speed from 1.5 GB/s
+    to 50-100 MB/s, with default auto-tuning settings it now only reduces
+    to ~1 GB/s.
+  info_ru: |
+    Интервал, с которым OSD пересматривает клиентскую нагрузку и нагрузку
+    восстановления и автоматически подстраивает [recovery_sleep_us](#recovery_sleep_us).
+    Автотюнинг (автоподстройка) отключается, если recovery_tune_interval
+    устанавливается в значение 0.
+
+    Автотюнинг регулирует утилизацию. Утилизация является мерой нагрузки
+    и равна произведению числа операций в секунду и средней задержки
+    (то есть, она может быть выше 1). Вы задаёте два уровня клиентской
+    утилизации - "низкий" и "высокий" (low и high) и два соответствующих
+    целевых уровня утилизации операциями восстановления. OSD рассчитывает
+    желаемый уровень утилизации восстановления линейной интерполяцией от
+    клиентской утилизации и подстраивает задержку операций восстановления
+    так, чтобы фактическая утилизация восстановления совпадала с желаемой.
+
+    Это позволяет снизить влияние восстановления и ребаланса на клиентские
+    операции. Конечно, невозможно исключить такое влияние полностью, но оно
+    должно становиться адекватнее. В некоторых тестах перебалансировка могла
+    снижать клиентскую скорость записи с 1.5 ГБ/с до 50-100 МБ/с, а теперь, с
+    настройками автотюнинга по умолчанию, она снижается только до ~1 ГБ/с.
+- name: recovery_tune_util_low
+  type: float
+  default: 0.1
+  online: true
+  info: |
+    Desired recovery/rebalance utilization when client load is high, i.e. when
+    it is at or above recovery_tune_client_util_high.
+  info_ru: |
+    Желаемая утилизация восстановления в моменты, когда клиентская нагрузка
+    высокая, то есть, находится на уровне или выше recovery_tune_client_util_high.
+- name: recovery_tune_util_high
+  type: float
+  default: 1
+  online: true
+  info: |
+    Desired recovery/rebalance utilization when client load is low, i.e. when
+    it is at or below recovery_tune_client_util_low.
+  info_ru: |
+    Желаемая утилизация восстановления в моменты, когда клиентская нагрузка
+    низкая, то есть, находится на уровне или ниже recovery_tune_client_util_low.
+- name: recovery_tune_client_util_low
+  type: float
+  default: 0
+  online: true
+  info: Client utilization considered "low".
+  info_ru: Клиентская утилизация, которая считается "низкой".
+- name: recovery_tune_client_util_high
+  type: float
+  default: 0.5
+  online: true
+  info: Client utilization considered "high".
+  info_ru: Клиентская утилизация, которая считается "высокой".
+- name: recovery_tune_agg_interval
+  type: int
+  default: 10
+  online: true
+  info: |
+    The number of last auto-tuning iterations to use for calculating the
+    delay as average. Lower values result in quicker response to client
+    load change, higher values result in more stable delay. Default value of 10
+    is usually fine.
+  info_ru: |
+    Число последних итераций автоподстройки для расчёта задержки как среднего
+    значения. Меньшие значения параметра ускоряют отклик на изменение нагрузки,
+    большие значения делают задержку стабильнее. Значение по умолчанию 10
+    обычно нормальное и не требует изменений.
+- name: recovery_tune_sleep_min_us
+  type: us
+  default: 10
+  online: true
+  info: |
+    Minimum possible value for auto-tuned recovery_sleep_us. Values lower
+    than this value are changed to 0.
+  info_ru: |
+    Минимальное возможное значение авто-подстроенного recovery_sleep_us.
+    Значения ниже данного заменяются на 0.
--- a/docs/intro/features.en.md
+++ b/docs/intro/features.en.md
@ -32,6 +32,7 @@
 - [Scrubbing](../config/osd.en.md#auto_scrub) (verification of copies)
 - [Checksums](../config/layout-osd.en.md#data_csum_type)
 - [Client write-back cache](../config/client.en.md#client_enable_writeback)
+- [Intelligent recovery auto-tuning](../config/osd.en.md#recovery_tune_interval)

 ## Plugins and tools

--- a/docs/intro/features.ru.md
+++ b/docs/intro/features.ru.md
@ -34,6 +34,7 @@
 - [Фоновая проверка целостности](../config/osd.ru.md#auto_scrub) (сверка копий)
 - [Контрольные суммы](../config/layout-osd.ru.md#data_csum_type)
 - [Буферизация записи на стороне клиента](../config/client.ru.md#client_enable_writeback)
+- [Интеллектуальная автоподстройка скорости восстановления](../config/osd.ru.md#recovery_tune_interval)

 ## Драйверы и инструменты

--- a/mon/PGUtil.js
+++ b/mon/PGUtil.js
@ -3,6 +3,7 @@

 module.exports = {
    scale_pg_count,
+    scale_pg_history,
 };

 function add_pg_history(new_pg_history, new_pg, prev_pgs, prev_pg_history, old_pg)
@ -43,16 +44,18 @@ function finish_pg_history(merged_history)
    merged_history.all_peers = Object.values(merged_history.all_peers);
 }

-function scale_pg_count(prev_pgs, real_prev_pgs, prev_pg_history, new_pg_history, new_pg_count)
+function scale_pg_history(prev_pg_history, prev_pgs, new_pgs)
 {
-    const old_pg_count = real_prev_pgs.length;
+    const new_pg_history = [];
+    const old_pg_count = prev_pgs.length;
+    const new_pg_count = new_pgs.length;
    // Add all possibly intersecting PGs to the history of new PGs
    if (!(new_pg_count % old_pg_count))
    {
        // New PG count is a multiple of old PG count
        for (let i = 0; i < new_pg_count; i++)
        {
-            add_pg_history(new_pg_history, i, real_prev_pgs, prev_pg_history, i % old_pg_count);
+            add_pg_history(new_pg_history, i, prev_pgs, prev_pg_history, i % old_pg_count);
            finish_pg_history(new_pg_history[i]);
        }
    }
@ -64,7 +67,7 @@ function scale_pg_count(prev_pgs, real_prev_pgs, prev_pg_history, new_pg_history
        {
            for (let j = 0; j < mul; j++)
            {
-                add_pg_history(new_pg_history, i, real_prev_pgs, prev_pg_history, i+j*new_pg_count);
+                add_pg_history(new_pg_history, i, prev_pgs, prev_pg_history, i+j*new_pg_count);
            }
            finish_pg_history(new_pg_history[i]);
        }
@ -76,7 +79,7 @@ function scale_pg_count(prev_pgs, real_prev_pgs, prev_pg_history, new_pg_history
        let merged_history = {};
        for (let i = 0; i < old_pg_count; i++)
        {
-            add_pg_history(merged_history, 1, real_prev_pgs, prev_pg_history, i);
+            add_pg_history(merged_history, 1, prev_pgs, prev_pg_history, i);
        }
        finish_pg_history(merged_history[1]);
        for (let i = 0; i < new_pg_count; i++)
@ -89,6 +92,12 @@ function scale_pg_count(prev_pgs, real_prev_pgs, prev_pg_history, new_pg_history
    {
        new_pg_history[i] = null;
    }
+    return new_pg_history;
+}
+
+function scale_pg_count(prev_pgs, new_pg_count)
+{
+    const old_pg_count = prev_pgs.length;
    // Just for the lp_solve optimizer - pick a "previous" PG for each "new" one
    if (prev_pgs.length < new_pg_count)
    {
--- a/mon/mon.js
+++ b/mon/mon.js
@ -59,6 +59,7 @@ const etcd_tree = {
            etcd_mon_timeout: 1000, // ms. min: 0
            etcd_mon_retries: 5, // min: 0
            mon_change_timeout: 1000, // ms. min: 100
+            mon_retry_change_timeout: 50, // ms. min: 10
            mon_stats_timeout: 1000, // ms. min: 100
            osd_out_time: 600, // seconds. min: 0
            placement_levels: { datacenter: 1, rack: 2, host: 3, osd: 4, ... },
@ -110,7 +111,15 @@ const etcd_tree = {
            autosync_interval: 5,
            autosync_writes: 128,
            client_queue_depth: 128, // unused
-            recovery_queue_depth: 4,
+            recovery_queue_depth: 1,
+            recovery_sleep_us: 0,
+            recovery_tune_util_low: 0.1,
+            recovery_tune_client_util_low: 0,
+            recovery_tune_util_high: 1.0,
+            recovery_tune_client_util_high: 0.5,
+            recovery_tune_interval: 1,
+            recovery_tune_agg_interval: 10, // 10 times recovery_tune_interval
+            recovery_tune_sleep_min_us: 10, // 10 microseconds
            recovery_pg_switch: 128,
            recovery_sync_batch: 16,
            no_recovery: false,
@ -490,6 +499,11 @@ class Mon
        {
            this.config.mon_change_timeout = 100;
        }
+        this.config.mon_retry_change_timeout = Number(this.config.mon_retry_change_timeout) || 50;
+        if (this.config.mon_retry_change_timeout < 50)
+        {
+            this.config.mon_retry_change_timeout = 50;
+        }
        this.config.mon_stats_timeout = Number(this.config.mon_stats_timeout) || 1000;
        if (this.config.mon_stats_timeout < 100)
        {
@ -1222,6 +1236,89 @@ class Mon
        return aff_osds;
    }

+    async generate_pool_pgs(pool_id, osd_tree, levels)
+    {
+        const pool_cfg = this.state.config.pools[pool_id];
+        if (!this.validate_pool_cfg(pool_id, pool_cfg, false))
+        {
+            return null;
+        }
+        let pool_tree = osd_tree[pool_cfg.root_node || ''];
+        pool_tree = pool_tree ? pool_tree.children : [];
+        pool_tree = LPOptimizer.flatten_tree(pool_tree, levels, pool_cfg.failure_domain, 'osd');
+        this.filter_osds_by_tags(osd_tree, pool_tree, pool_cfg.osd_tags);
+        this.filter_osds_by_block_layout(
+            pool_tree,
+            pool_cfg.block_size || this.config.block_size || 131072,
+            pool_cfg.bitmap_granularity || this.config.bitmap_granularity || 4096,
+            pool_cfg.immediate_commit || this.config.immediate_commit || 'none'
+        );
+        // First try last_clean_pgs to minimize data movement
+        let prev_pgs = [];
+        for (const pg in ((this.state.history.last_clean_pgs.items||{})[pool_id]||{}))
+        {
+            prev_pgs[pg-1] = [ ...this.state.history.last_clean_pgs.items[pool_id][pg].osd_set ];
+        }
+        if (!prev_pgs.length)
+        {
+            // Fall back to config/pgs if it's empty
+            for (const pg in ((this.state.config.pgs.items||{})[pool_id]||{}))
+            {
+                prev_pgs[pg-1] = [ ...this.state.config.pgs.items[pool_id][pg].osd_set ];
+            }
+        }
+        const old_pg_count = prev_pgs.length;
+        const optimize_cfg = {
+            osd_tree: pool_tree,
+            pg_count: pool_cfg.pg_count,
+            pg_size: pool_cfg.pg_size,
+            pg_minsize: pool_cfg.pg_minsize,
+            max_combinations: pool_cfg.max_osd_combinations,
+            ordered: pool_cfg.scheme != 'replicated',
+        };
+        let optimize_result;
+        // Re-shuffle PGs if config/pgs.hash is empty
+        if (old_pg_count > 0 && this.state.config.pgs.hash)
+        {
+            if (prev_pgs.length != pool_cfg.pg_count)
+            {
+                // Scale PG count
+                // Do it even if old_pg_count is already equal to pool_cfg.pg_count,
+                // because last_clean_pgs may still contain the old number of PGs
+                PGUtil.scale_pg_count(prev_pgs, pool_cfg.pg_count);
+            }
+            for (const pg of prev_pgs)
+            {
+                while (pg.length < pool_cfg.pg_size)
+                {
+                    pg.push(0);
+                }
+            }
+            optimize_result = await LPOptimizer.optimize_change({
+                prev_pgs,
+                ...optimize_cfg,
+            });
+        }
+        else
+        {
+            optimize_result = await LPOptimizer.optimize_initial(optimize_cfg);
+        }
+        console.log(`Pool ${pool_id} (${pool_cfg.name || 'unnamed'}):`);
+        LPOptimizer.print_change_stats(optimize_result);
+        const pg_effsize = Math.min(pool_cfg.pg_size, Object.keys(pool_tree).length);
+        return {
+            pool_id,
+            pgs: optimize_result.int_pgs,
+            stats: {
+                total_raw_tb: optimize_result.space,
+                pg_real_size: pg_effsize || pool_cfg.pg_size,
+                raw_to_usable: (pg_effsize || pool_cfg.pg_size) / (pool_cfg.scheme === 'replicated'
+                    ? 1 : (pool_cfg.pg_size - (pool_cfg.parity_chunks||0))),
+                space_efficiency: optimize_result.space/(optimize_result.total_space||1),
+            },
+        };
+    }
+
    async recheck_pgs()
    {
        if (this.recheck_pgs_active)
@ -1236,158 +1333,47 @@ class Mon
        const { up_osds, levels, osd_tree } = this.get_osd_tree();
        const tree_cfg = {
            osd_tree,
+            levels,
            pools: this.state.config.pools,
        };
        const tree_hash = sha1hex(stableStringify(tree_cfg));
        if (this.state.config.pgs.hash != tree_hash)
        {
            // Something has changed
-            const new_config_pgs = JSON.parse(JSON.stringify(this.state.config.pgs));
-            const etcd_request = { compare: [], success: [] };
-            for (const pool_id in (this.state.config.pgs||{}).items||{})
+            console.log('Pool configuration or OSD tree changed, re-optimizing');
+            // First re-optimize PGs, but don't look at history yet
+            const optimize_results = await Promise.all(Object.keys(this.state.config.pools)
+                .map(pool_id => this.generate_pool_pgs(pool_id, osd_tree, levels)));
+            // Then apply the modification in the form of an optimistic transaction,
+            // each time considering new pg/history modifications (OSDs modify it during rebalance)
+            while (!await this.apply_pool_pgs(optimize_results, up_osds, osd_tree, tree_hash))
            {
-                if (!this.state.config.pools[pool_id])
-                {
-                    // Pool deleted. Delete all PGs, but first stop them.
-                    if (!await this.stop_all_pgs(pool_id))
-                    {
-                        this.recheck_pgs_active = false;
-                        this.schedule_recheck();
-                        return;
-                    }
-                    const prev_pgs = [];
-                    for (const pg in this.state.config.pgs.items[pool_id]||{})
-                    {
-                        prev_pgs[pg-1] = this.state.config.pgs.items[pool_id][pg].osd_set;
-                    }
-                    // Also delete pool statistics
-                    etcd_request.success.push({ requestDeleteRange: {
-                        key: b64(this.etcd_prefix+'/pool/stats/'+pool_id),
-                    } });
-                    this.save_new_pgs_txn(new_config_pgs, etcd_request, pool_id, up_osds, osd_tree, prev_pgs, [], []);
-                }
-            }
-            for (const pool_id in this.state.config.pools)
-            {
-                const pool_cfg = this.state.config.pools[pool_id];
-                if (!this.validate_pool_cfg(pool_id, pool_cfg, false))
-                {
-                    continue;
-                }
-                let pool_tree = osd_tree[pool_cfg.root_node || ''];
-                pool_tree = pool_tree ? pool_tree.children : [];
-                pool_tree = LPOptimizer.flatten_tree(pool_tree, levels, pool_cfg.failure_domain, 'osd');
-                this.filter_osds_by_tags(osd_tree, pool_tree, pool_cfg.osd_tags);
-                this.filter_osds_by_block_layout(
-                    pool_tree,
-                    pool_cfg.block_size || this.config.block_size || 131072,
-                    pool_cfg.bitmap_granularity || this.config.bitmap_granularity || 4096,
-                    pool_cfg.immediate_commit || this.config.immediate_commit || 'none'
+                console.log(
+                    'Someone changed PG configuration while we also tried to change it.'+
+                    ' Retrying in '+this.config.mon_retry_change_timeout+' ms'
                );
-                // These are for the purpose of building history.osd_sets
-                const real_prev_pgs = [];
-                let pg_history = [];
-                for (const pg in ((this.state.config.pgs.items||{})[pool_id]||{}))
+                // Failed to apply - parallel change detected. Wait a bit and retry
+                const old_rev = this.etcd_watch_revision;
+                while (this.etcd_watch_revision === old_rev)
                {
-                    real_prev_pgs[pg-1] = this.state.config.pgs.items[pool_id][pg].osd_set;
-                    if (this.state.pg.history[pool_id] &&
-                        this.state.pg.history[pool_id][pg])
-                    {
-                        pg_history[pg-1] = this.state.pg.history[pool_id][pg];
-                    }
+                    await new Promise(ok => setTimeout(ok, this.config.mon_retry_change_timeout));
                }
-                // And these are for the purpose of minimizing data movement
-                let prev_pgs = [];
-                for (const pg in ((this.state.history.last_clean_pgs.items||{})[pool_id]||{}))
-                {
-                    prev_pgs[pg-1] = this.state.history.last_clean_pgs.items[pool_id][pg].osd_set;
-                }
-                prev_pgs = JSON.parse(JSON.stringify(prev_pgs.length ? prev_pgs : real_prev_pgs));
-                const old_pg_count = real_prev_pgs.length;
-                const optimize_cfg = {
-                    osd_tree: pool_tree,
-                    pg_count: pool_cfg.pg_count,
-                    pg_size: pool_cfg.pg_size,
-                    pg_minsize: pool_cfg.pg_minsize,
-                    max_combinations: pool_cfg.max_osd_combinations,
-                    ordered: pool_cfg.scheme != 'replicated',
+                const new_ot = this.get_osd_tree();
+                const new_tcfg = {
+                    osd_tree: new_ot.osd_tree,
+                    levels: new_ot.levels,
+                    pools: this.state.config.pools,
                };
-                let optimize_result;
-                if (old_pg_count > 0)
+                if (sha1hex(stableStringify(new_tcfg)) !== tree_hash)
                {
-                    if (old_pg_count != pool_cfg.pg_count)
-                    {
-                        // PG count changed. Need to bring all PGs down.
-                        if (!await this.stop_all_pgs(pool_id))
-                        {
-                            this.recheck_pgs_active = false;
-                            this.schedule_recheck();
-                            return;
-                        }
-                    }
-                    if (prev_pgs.length != pool_cfg.pg_count)
-                    {
-                        // Scale PG count
-                        // Do it even if old_pg_count is already equal to pool_cfg.pg_count,
-                        // because last_clean_pgs may still contain the old number of PGs
-                        const new_pg_history = [];
-                        PGUtil.scale_pg_count(prev_pgs, real_prev_pgs, pg_history, new_pg_history, pool_cfg.pg_count);
-                        pg_history = new_pg_history;
-                    }
-                    for (const pg of prev_pgs)
-                    {
-                        while (pg.length < pool_cfg.pg_size)
-                        {
-                            pg.push(0);
-                        }
-                    }
-                    if (!this.state.config.pgs.hash)
-                    {
-                        // Re-shuffle PGs
-                        optimize_result = await LPOptimizer.optimize_initial(optimize_cfg);
-                    }
-                    else
-                    {
-                        optimize_result = await LPOptimizer.optimize_change({
-                            prev_pgs,
-                            ...optimize_cfg,
-                        });
-                    }
+                    // Configuration actually changed, restart from the beginning
+                    this.recheck_pgs_active = false;
+                    setImmediate(() => this.recheck_pgs().catch(this.die));
+                    return;
                }
-                else
-                {
-                    optimize_result = await LPOptimizer.optimize_initial(optimize_cfg);
-                }
-                if (old_pg_count != optimize_result.int_pgs.length)
-                {
-                    console.log(
-                        `PG count for pool ${pool_id} (${pool_cfg.name || 'unnamed'})`+
-                        ` changed from: ${old_pg_count} to ${optimize_result.int_pgs.length}`
-                    );
-                    // Drop stats
-                    etcd_request.success.push({ requestDeleteRange: {
-                        key: b64(this.etcd_prefix+'/pg/stats/'+pool_id+'/'),
-                        range_end: b64(this.etcd_prefix+'/pg/stats/'+pool_id+'0'),
-                    } });
-                }
-                LPOptimizer.print_change_stats(optimize_result);
-                const pg_effsize = Math.min(pool_cfg.pg_size, Object.keys(pool_tree).length);
-                this.state.pool.stats[pool_id] = {
-                    used_raw_tb: (this.state.pool.stats[pool_id]||{}).used_raw_tb || 0,
-                    total_raw_tb: optimize_result.space,
-                    pg_real_size: pg_effsize || pool_cfg.pg_size,
-                    raw_to_usable: (pg_effsize || pool_cfg.pg_size) / (pool_cfg.scheme === 'replicated'
-                        ? 1 : (pool_cfg.pg_size - (pool_cfg.parity_chunks||0))),
-                    space_efficiency: optimize_result.space/(optimize_result.total_space||1),
-                };
-                etcd_request.success.push({ requestPut: {
-                    key: b64(this.etcd_prefix+'/pool/stats/'+pool_id),
-                    value: b64(JSON.stringify(this.state.pool.stats[pool_id])),
-                } });
-                this.save_new_pgs_txn(new_config_pgs, etcd_request, pool_id, up_osds, osd_tree, real_prev_pgs, optimize_result.int_pgs, pg_history);
+                // Configuration didn't change, PG history probably changed, so just retry
            }
-            new_config_pgs.hash = tree_hash;
-            await this.save_pg_config(new_config_pgs, etcd_request);
+            console.log('PG configuration successfully changed');
        }
        else
        {
@ -1434,8 +1420,81 @@ class Mon
        this.recheck_pgs_active = false;
    }

-    async save_pg_config(new_config_pgs, etcd_request = { compare: [], success: [] })
+    async apply_pool_pgs(results, up_osds, osd_tree, tree_hash)
    {
+        for (const pool_id in (this.state.config.pgs||{}).items||{})
+        {
+            // We should stop all PGs when deleting a pool or changing its PG count
+            if (!this.state.config.pools[pool_id] ||
+                this.state.config.pgs.items[pool_id] && this.state.config.pools[pool_id].pg_count !=
+                Object.keys(this.state.config.pgs.items[pool_id]).reduce((a, c) => (a < (0|c) ? (0|c) : a), 0))
+            {
+                if (!await this.stop_all_pgs(pool_id))
+                {
+                    return false;
+                }
+            }
+        }
+        const new_config_pgs = JSON.parse(JSON.stringify(this.state.config.pgs));
+        const etcd_request = { compare: [], success: [] };
+        for (const pool_id in (new_config_pgs||{}).items||{})
+        {
+            if (!this.state.config.pools[pool_id])
+            {
+                const prev_pgs = [];
+                for (const pg in new_config_pgs.items[pool_id]||{})
+                {
+                    prev_pgs[pg-1] = new_config_pgs.items[pool_id][pg].osd_set;
+                }
+                // Also delete pool statistics
+                etcd_request.success.push({ requestDeleteRange: {
+                    key: b64(this.etcd_prefix+'/pool/stats/'+pool_id),
+                } });
+                this.save_new_pgs_txn(new_config_pgs, etcd_request, pool_id, up_osds, osd_tree, prev_pgs, [], []);
+            }
+        }
+        for (const pool_res of results)
+        {
+            const pool_id = pool_res.pool_id;
+            const pool_cfg = this.state.config.pools[pool_id];
+            let pg_history = [];
+            for (const pg in ((this.state.config.pgs.items||{})[pool_id]||{}))
+            {
+                if (this.state.pg.history[pool_id] &&
+                    this.state.pg.history[pool_id][pg])
+                {
+                    pg_history[pg-1] = this.state.pg.history[pool_id][pg];
+                }
+            }
+            const real_prev_pgs = [];
+            for (const pg in ((this.state.config.pgs.items||{})[pool_id]||{}))
+            {
+                real_prev_pgs[pg-1] = [ ...this.state.config.pgs.items[pool_id][pg].osd_set ];
+            }
+            if (real_prev_pgs.length > 0 && real_prev_pgs.length != pool_res.pgs.length)
+            {
+                console.log(
+                    `Changing PG count for pool ${pool_id} (${pool_cfg.name || 'unnamed'})`+
+                    ` from: ${real_prev_pgs.length} to ${pool_res.pgs.length}`
+                );
+                pg_history = PGUtil.scale_pg_history(pg_history, real_prev_pgs, pool_res.pgs);
+                // Drop stats
+                etcd_request.success.push({ requestDeleteRange: {
+                    key: b64(this.etcd_prefix+'/pg/stats/'+pool_id+'/'),
+                    range_end: b64(this.etcd_prefix+'/pg/stats/'+pool_id+'0'),
+                } });
+            }
+            const stats = {
+                used_raw_tb: (this.state.pool.stats[pool_id]||{}).used_raw_tb || 0,
+                ...pool_res.stats,
+            };
+            etcd_request.success.push({ requestPut: {
+                key: b64(this.etcd_prefix+'/pool/stats/'+pool_id),
+                value: b64(JSON.stringify(stats)),
+            } });
+            this.save_new_pgs_txn(new_config_pgs, etcd_request, pool_id, up_osds, osd_tree, real_prev_pgs, pool_res.pgs, pg_history);
+        }
+        new_config_pgs.hash = tree_hash;
        etcd_request.compare.push(
            { key: b64(this.etcd_prefix+'/mon/master'), target: 'LEASE', lease: ''+this.etcd_lease_id },
            { key: b64(this.etcd_prefix+'/config/pgs'), target: 'MOD', mod_revision: ''+this.etcd_watch_revision, result: 'LESS' },
@ -1443,14 +1502,8 @@ class Mon
        etcd_request.success.push(
            { requestPut: { key: b64(this.etcd_prefix+'/config/pgs'), value: b64(JSON.stringify(new_config_pgs)) } },
        );
-        const res = await this.etcd_call('/kv/txn', etcd_request, this.config.etcd_mon_timeout, 0);
-        if (!res.succeeded)
-        {
-            console.log('Someone changed PG configuration while we also tried to change it. Retrying in '+this.config.mon_change_timeout+' ms');
-            this.schedule_recheck();
-            return;
-        }
-        console.log('PG configuration successfully changed');
+        const txn_res = await this.etcd_call('/kv/txn', etcd_request, this.config.etcd_mon_timeout, 0);
+        return txn_res.succeeded;
    }

    // Schedule next recheck at least at <unixtime>
--- a/src/blockstore_impl.cpp
+++ b/src/blockstore_impl.cpp
@ -163,20 +163,10 @@ void blockstore_impl_t::loop()
            }
            else if (op->opcode == BS_OP_SYNC)
            {
-                // wait for all small writes to be submitted
-                // wait for all big writes to complete, submit data device fsync
+                // sync only completed writes?
                // wait for the data device fsync to complete, then submit journal writes for big writes
                // then submit an fsync operation
-                if (has_writes)
-                {
-                    // Can't submit SYNC before previous writes
-                    continue;
-                }
                wr_st = continue_sync(op);
-                if (wr_st != 2)
-                {
-                    has_writes = wr_st > 0 ? 1 : 2;
-                }
            }
            else if (op->opcode == BS_OP_STABLE)
            {
--- a/src/blockstore_impl.h
+++ b/src/blockstore_impl.h
@ -277,6 +277,7 @@ class blockstore_impl_t
    int unsynced_big_write_count = 0, unstable_unsynced = 0;
    int unsynced_queued_ops = 0;
    allocator *data_alloc = NULL;
+    uint64_t used_blocks = 0;
    uint8_t *zero_object;

    void *metadata_buffer = NULL;
@ -430,7 +431,7 @@ public:

    inline uint32_t get_block_size() { return dsk.data_block_size; }
    inline uint64_t get_block_count() { return dsk.block_count; }
-    inline uint64_t get_free_block_count() { return data_alloc->get_free_count(); }
+    inline uint64_t get_free_block_count() { return dsk.block_count - used_blocks; }
    inline uint32_t get_bitmap_granularity() { return dsk.disk_alignment; }
    inline uint64_t get_journal_size() { return dsk.journal_len; }
 };
--- a/src/blockstore_init.cpp
+++ b/src/blockstore_init.cpp
@ -376,6 +376,7 @@ bool blockstore_init_meta::handle_meta_block(uint8_t *buf, uint64_t entries_per_
                else
                {
                    bs->inode_space_stats[entry->oid.inode] += bs->dsk.data_block_size;
+                    bs->used_blocks++;
                }
                entries_loaded++;
 #ifdef BLOCKSTORE_DEBUG
@ -1181,6 +1182,7 @@ void blockstore_init_journal::erase_dirty_object(blockstore_dirty_db_t::iterator
            sp -= bs->dsk.data_block_size;
        else
            bs->inode_space_stats.erase(oid.inode);
+        bs->used_blocks--;
    }
    bs->erase_dirty(dirty_it, dirty_end, clean_loc);
    // Remove it from the flusher's queue, too
--- a/src/blockstore_stable.cpp
+++ b/src/blockstore_stable.cpp
@ -445,6 +445,7 @@ void blockstore_impl_t::mark_stable(const obj_ver_id & v, bool forget_dirty)
                    if (!exists)
                    {
                        inode_space_stats[dirty_it->first.oid.inode] += dsk.data_block_size;
+                        used_blocks++;
                    }
                    big_to_flush++;
                }
@ -455,6 +456,7 @@ void blockstore_impl_t::mark_stable(const obj_ver_id & v, bool forget_dirty)
                        sp -= dsk.data_block_size;
                    else
                        inode_space_stats.erase(dirty_it->first.oid.inode);
+                    used_blocks--;
                    big_to_flush++;
                }
            }
--- a/src/cluster_client.cpp
+++ b/src/cluster_client.cpp
@ -705,6 +705,8 @@ resume_1:
        }
        goto resume_2;
    }
+    // Protect from try_send completing the operation immediately
+    op->inflight_count++;
    for (int i = 0; i < op->parts.size(); i++)
    {
        if (!(op->parts[i].flags & PART_SENT))
@ -728,8 +730,10 @@ resume_1:
            }
        }
    }
+    op->inflight_count--;
    if (op->state == 1)
    {
+        // Some suboperations have to be resent
        return 0;
    }
 resume_2:
--- a/src/messenger.h
+++ b/src/messenger.h
@ -149,7 +149,7 @@ public:
    std::map<osd_num_t, osd_wanted_peer_t> wanted_peers;
    std::map<uint64_t, int> osd_peer_fds;
    // op statistics
-    osd_op_stats_t stats;
+    osd_op_stats_t stats, recovery_stats;

    void init();
    void parse_config(const json11::Json & config);
@ -175,6 +175,7 @@ public:
    bool connect_rdma(int peer_fd, std::string rdma_address, uint64_t client_max_msg);
 #endif

+    void inc_op_stats(osd_op_stats_t & stats, uint64_t opcode, timespec & tv_begin, timespec & tv_end, uint64_t len);
    void measure_exec(osd_op_t *cur_op);

 protected:
--- a/src/msgr_op.cpp
+++ b/src/msgr_op.cpp
@ -24,3 +24,17 @@ osd_op_t::~osd_op_t()
        free(buf);
    }
 }
+
+bool osd_op_t::is_recovery_related()
+{
+    return (req.hdr.opcode == OSD_OP_SEC_READ ||
+        req.hdr.opcode == OSD_OP_SEC_WRITE ||
+        req.hdr.opcode == OSD_OP_SEC_WRITE_STABLE) &&
+        (req.sec_rw.flags & OSD_OP_RECOVERY_RELATED) ||
+        req.hdr.opcode == OSD_OP_SEC_DELETE &&
+        (req.sec_del.flags & OSD_OP_RECOVERY_RELATED) ||
+        req.hdr.opcode == OSD_OP_SEC_STABILIZE &&
+        (req.sec_stab.flags & OSD_OP_RECOVERY_RELATED) ||
+        req.hdr.opcode == OSD_OP_SEC_SYNC &&
+        (req.sec_sync.flags & OSD_OP_RECOVERY_RELATED);
+}
--- a/src/msgr_op.h
+++ b/src/msgr_op.h
@ -173,4 +173,6 @@ struct osd_op_t
    osd_op_buf_list_t iov;

    ~osd_op_t();
+
+    bool is_recovery_related();
 };
--- a/src/msgr_send.cpp
+++ b/src/msgr_send.cpp
@ -131,6 +131,23 @@ void osd_messenger_t::outbox_push(osd_op_t *cur_op)
    }
 }

+void osd_messenger_t::inc_op_stats(osd_op_stats_t & stats, uint64_t opcode, timespec & tv_begin, timespec & tv_end, uint64_t len)
+{
+    uint64_t usecs = (
+        (tv_end.tv_sec - tv_begin.tv_sec)*1000000 +
+        (tv_end.tv_nsec - tv_begin.tv_nsec)/1000
+    );
+    stats.op_stat_count[opcode]++;
+    if (!stats.op_stat_count[opcode])
+    {
+        stats.op_stat_count[opcode] = 1;
+        stats.op_stat_sum[opcode] = 0;
+        stats.op_stat_bytes[opcode] = 0;
+    }
+    stats.op_stat_sum[opcode] += usecs;
+    stats.op_stat_bytes[opcode] += len;
+}
+
 void osd_messenger_t::measure_exec(osd_op_t *cur_op)
 {
    // Measure execution latency
@ -142,29 +159,24 @@ void osd_messenger_t::measure_exec(osd_op_t *cur_op)
    {
        clock_gettime(CLOCK_REALTIME, &cur_op->tv_end);
    }
-    stats.op_stat_count[cur_op->req.hdr.opcode]++;
-    if (!stats.op_stat_count[cur_op->req.hdr.opcode])
-    {
-        stats.op_stat_count[cur_op->req.hdr.opcode]++;
-        stats.op_stat_sum[cur_op->req.hdr.opcode] = 0;
-        stats.op_stat_bytes[cur_op->req.hdr.opcode] = 0;
-    }
-    stats.op_stat_sum[cur_op->req.hdr.opcode] += (
-        (cur_op->tv_end.tv_sec - cur_op->tv_begin.tv_sec)*1000000 +
-        (cur_op->tv_end.tv_nsec - cur_op->tv_begin.tv_nsec)/1000
-    );
+    uint64_t len = 0;
    if (cur_op->req.hdr.opcode == OSD_OP_READ ||
        cur_op->req.hdr.opcode == OSD_OP_WRITE ||
        cur_op->req.hdr.opcode == OSD_OP_SCRUB)
    {
        // req.rw.len is internally set to the full object size for scrubs
-        stats.op_stat_bytes[cur_op->req.hdr.opcode] += cur_op->req.rw.len;
+        len = cur_op->req.rw.len;
    }
    else if (cur_op->req.hdr.opcode == OSD_OP_SEC_READ ||
        cur_op->req.hdr.opcode == OSD_OP_SEC_WRITE ||
        cur_op->req.hdr.opcode == OSD_OP_SEC_WRITE_STABLE)
    {
-        stats.op_stat_bytes[cur_op->req.hdr.opcode] += cur_op->req.sec_rw.len;
+        len = cur_op->req.sec_rw.len;
+    }
+    inc_op_stats(stats, cur_op->req.hdr.opcode, cur_op->tv_begin, cur_op->tv_end, len);
+    if (cur_op->is_recovery_related())
+    {
+        inc_op_stats(recovery_stats, cur_op->req.hdr.opcode, cur_op->tv_begin, cur_op->tv_end, len);
    }
 }

--- a/src/osd.cpp
+++ b/src/osd.cpp
@ -68,14 +68,21 @@ osd_t::osd_t(const json11::Json & config, ring_loop_t *ringloop)
        }
    }

-    print_stats_timer_id = this->tfd->set_timer(print_stats_interval*1000, true, [this](int timer_id)
+    if (print_stats_timer_id == -1)
    {
-        print_stats();
-    });
-    slow_log_timer_id = this->tfd->set_timer(slow_log_interval*1000, true, [this](int timer_id)
+        print_stats_timer_id = this->tfd->set_timer(print_stats_interval*1000, true, [this](int timer_id)
+        {
+            print_stats();
+        });
+    }
+    if (slow_log_timer_id == -1)
    {
-        print_slow();
-    });
+        slow_log_timer_id = this->tfd->set_timer(slow_log_interval*1000, true, [this](int timer_id)
+        {
+            print_slow();
+        });
+    }
+    apply_recovery_tune_interval();

    msgr.tfd = this->tfd;
    msgr.ringloop = this->ringloop;
@ -97,6 +104,11 @@ osd_t::~osd_t()
        tfd->clear_timer(slow_log_timer_id);
        slow_log_timer_id = -1;
    }
+    if (rtune_timer_id >= 0)
+    {
+        tfd->clear_timer(rtune_timer_id);
+        rtune_timer_id = -1;
+    }
    if (print_stats_timer_id >= 0)
    {
        tfd->clear_timer(print_stats_timer_id);
@ -196,6 +208,30 @@ void osd_t::parse_config(bool init)
    recovery_queue_depth = config["recovery_queue_depth"].uint64_value();
    if (recovery_queue_depth < 1 || recovery_queue_depth > MAX_RECOVERY_QUEUE)
        recovery_queue_depth = DEFAULT_RECOVERY_QUEUE;
+    recovery_sleep_us = config["recovery_sleep_us"].uint64_value();
+    recovery_tune_util_low = config["recovery_tune_util_low"].is_null()
+        ? 0.1 : config["recovery_tune_util_low"].number_value();
+    if (recovery_tune_util_low < 0.01)
+        recovery_tune_util_low = 0.01;
+    recovery_tune_util_high = config["recovery_tune_util_high"].is_null()
+        ? 1.0 : config["recovery_tune_util_high"].number_value();
+    if (recovery_tune_util_high < 0.01)
+        recovery_tune_util_high = 0.01;
+    recovery_tune_client_util_low = config["recovery_tune_client_util_low"].is_null()
+        ? 0 : config["recovery_tune_client_util_low"].number_value();
+    if (recovery_tune_client_util_low < 0.01)
+        recovery_tune_client_util_low = 0.01;
+    recovery_tune_client_util_high = config["recovery_tune_client_util_high"].is_null()
+        ? 0.5 : config["recovery_tune_client_util_high"].number_value();
+    if (recovery_tune_client_util_high < 0.01)
+        recovery_tune_client_util_high = 0.01;
+    auto old_recovery_tune_interval = recovery_tune_interval;
+    recovery_tune_interval = config["recovery_tune_interval"].is_null()
+        ? 1 : config["recovery_tune_interval"].uint64_value();
+    recovery_tune_agg_interval = config["recovery_tune_agg_interval"].is_null()
+        ? 10 : config["recovery_tune_agg_interval"].uint64_value();
+    recovery_tune_sleep_min_us = config["recovery_tune_sleep_min_us"].is_null()
+        ? 10 : config["recovery_tune_sleep_min_us"].uint64_value();
    recovery_pg_switch = config["recovery_pg_switch"].uint64_value();
    if (recovery_pg_switch < 1)
        recovery_pg_switch = DEFAULT_RECOVERY_PG_SWITCH;
@ -274,6 +310,10 @@ void osd_t::parse_config(bool init)
            print_slow();
        });
    }
+    if (old_recovery_tune_interval != recovery_tune_interval)
+    {
+        apply_recovery_tune_interval();
+    }
 }

 void osd_t::bind_socket()
@ -421,14 +461,6 @@ void osd_t::exec_op(osd_op_t *cur_op)
    }
 }

-void osd_t::reset_stats()
-{
-    msgr.stats = {};
-    prev_stats = {};
-    memset(recovery_stat_count, 0, sizeof(recovery_stat_count));
-    memset(recovery_stat_bytes, 0, sizeof(recovery_stat_bytes));
-}
-
 void osd_t::print_stats()
 {
    for (int i = OSD_OP_MIN; i <= OSD_OP_MAX; i++)
@ -466,19 +498,20 @@ void osd_t::print_stats()
    }
    for (int i = 0; i < 2; i++)
    {
-        if (recovery_stat_count[0][i] != recovery_stat_count[1][i])
+        if (recovery_stat[i].count > recovery_print_prev[i].count)
        {
-            uint64_t bw = (recovery_stat_bytes[0][i] - recovery_stat_bytes[1][i]) / print_stats_interval;
+            uint64_t bw = (recovery_stat[i].bytes - recovery_print_prev[i].bytes) / print_stats_interval;
            printf(
-                "[OSD %lu] %s recovery: %.1f op/s, B/W: %.2f %s\n", osd_num, recovery_stat_names[i],
-                (recovery_stat_count[0][i] - recovery_stat_count[1][i]) * 1.0 / print_stats_interval,
+                "[OSD %lu] %s recovery: %.1f op/s, B/W: %.2f %s, avg latency %ld us, delay %ld us\n", osd_num, recovery_stat_names[i],
+                (recovery_stat[i].count - recovery_print_prev[i].count) * 1.0 / print_stats_interval,
                (bw > 1024*1024*1024 ? bw/1024.0/1024/1024 : (bw > 1024*1024 ? bw/1024.0/1024 : bw/1024.0)),
-                (bw > 1024*1024*1024 ? "GB/s" : (bw > 1024*1024 ? "MB/s" : "KB/s"))
+                (bw > 1024*1024*1024 ? "GB/s" : (bw > 1024*1024 ? "MB/s" : "KB/s")),
+                (recovery_stat[i].usec - recovery_print_prev[i].usec) / (recovery_stat[i].count - recovery_print_prev[i].count),
+                recovery_target_sleep_us
            );
-            recovery_stat_count[1][i] = recovery_stat_count[0][i];
-            recovery_stat_bytes[1][i] = recovery_stat_bytes[0][i];
        }
    }
+    memcpy(recovery_print_prev, recovery_stat, sizeof(recovery_stat));
    if (corrupted_objects > 0)
    {
        printf("[OSD %lu] %lu object(s) corrupted\n", osd_num, corrupted_objects);
@ -572,8 +605,8 @@ void osd_t::print_slow()
                    op->req.hdr.opcode == OSD_OP_SEC_STABILIZE || op->req.hdr.opcode == OSD_OP_SEC_ROLLBACK ||
                    op->req.hdr.opcode == OSD_OP_SEC_READ_BMP)
                {
-                    bufprintf(" state=%d", PRIV(op->bs_op)->op_state);
-                    int wait_for = PRIV(op->bs_op)->wait_for;
+                    bufprintf(" state=%d", op->bs_op ? PRIV(op->bs_op)->op_state : -1);
+                    int wait_for = op->bs_op ? PRIV(op->bs_op)->wait_for : 0;
                    if (wait_for)
                    {
                        bufprintf(" wait=%d (detail=%lu)", wait_for, PRIV(op->bs_op)->wait_detail);
--- a/src/osd.h
+++ b/src/osd.h
@ -34,7 +34,7 @@
 #define DEFAULT_AUTOSYNC_INTERVAL 5
 #define DEFAULT_AUTOSYNC_WRITES 128
 #define MAX_RECOVERY_QUEUE 2048
-#define DEFAULT_RECOVERY_QUEUE 4
+#define DEFAULT_RECOVERY_QUEUE 1
 #define DEFAULT_RECOVERY_PG_SWITCH 128
 #define DEFAULT_RECOVERY_BATCH 16

@ -87,6 +87,11 @@ struct osd_chain_read_t

 struct osd_rmw_stripe_t;

+struct recovery_stat_t
+{
+    uint64_t count, usec, bytes;
+};
+
 class osd_t
 {
    // config
@ -111,7 +116,15 @@ class osd_t
    int immediate_commit = IMMEDIATE_NONE;
    int autosync_interval = DEFAULT_AUTOSYNC_INTERVAL; // "emergency" sync every 5 seconds
    int autosync_writes = DEFAULT_AUTOSYNC_WRITES;
-    int recovery_queue_depth = DEFAULT_RECOVERY_QUEUE;
+    uint64_t recovery_queue_depth = 1;
+    uint64_t recovery_sleep_us = 0;
+    double recovery_tune_util_low = 0.1;
+    double recovery_tune_client_util_low = 0;
+    double recovery_tune_util_high = 1.0;
+    double recovery_tune_client_util_high = 0.5;
+    int recovery_tune_interval = 1;
+    int recovery_tune_agg_interval = 10;
+    int recovery_tune_sleep_min_us = 10;
    int recovery_pg_switch = DEFAULT_RECOVERY_PG_SWITCH;
    int recovery_sync_batch = DEFAULT_RECOVERY_BATCH;
    int inode_vanish_time = 60;
@ -189,8 +202,18 @@ class osd_t
    std::map<uint64_t, inode_stats_t> inode_stats;
    std::map<uint64_t, timespec> vanishing_inodes;
    const char* recovery_stat_names[2] = { "degraded", "misplaced" };
-    uint64_t recovery_stat_count[2][2] = {};
-    uint64_t recovery_stat_bytes[2][2] = {};
+    recovery_stat_t recovery_stat[2];
+    recovery_stat_t recovery_print_prev[2];
+
+    // recovery auto-tuning
+    int rtune_timer_id = -1;
+    uint64_t rtune_avg_lat = 0;
+    double rtune_client_util = 0, rtune_target_util = 1;
+    osd_op_stats_t rtune_prev_stats, rtune_prev_recovery_stats;
+    std::vector<uint64_t> recovery_target_sleep_items;
+    uint64_t recovery_target_sleep_us = 0;
+    uint64_t recovery_target_sleep_total = 0;
+    int recovery_target_sleep_cur = 0, recovery_target_sleep_count = 0;

    // cluster connection
    void parse_config(bool init);
@ -208,8 +231,9 @@ class osd_t
    void create_osd_state();
    void renew_lease(bool reload);
    void print_stats();
+    void tune_recovery();
+    void apply_recovery_tune_interval();
    void print_slow();
-    void reset_stats();
    json11::Json get_statistics();
    void report_statistics();
    void report_pg_state(pg_t & pg);
@ -238,6 +262,7 @@ class osd_t
    bool submit_flush_op(pool_id_t pool_id, pg_num_t pg_num, pg_flush_batch_t *fb, bool rollback, osd_num_t peer_osd, int count, obj_ver_id *data);
    bool pick_next_recovery(osd_recovery_op_t &op);
    void submit_recovery_op(osd_recovery_op_t *op);
+    void finish_recovery_op(osd_recovery_op_t *op);
    bool continue_recovery();
    pg_osd_set_state_t* change_osd_set(pg_osd_set_state_t *st, pg_t *pg);

@ -279,7 +304,7 @@ class osd_t
    bool remember_unstable_write(osd_op_t *cur_op, pg_t & pg, pg_osd_set_t & loc_set, int base_state);
    void handle_primary_subop(osd_op_t *subop, osd_op_t *cur_op);
    void handle_primary_bs_subop(osd_op_t *subop);
-    void add_bs_subop_stats(osd_op_t *subop);
+    void add_bs_subop_stats(osd_op_t *subop, bool recovery_related = false);
    void pg_cancel_write_queue(pg_t & pg, osd_op_t *first_op, object_id oid, int retval);

    void submit_primary_subops(int submit_type, uint64_t op_version, const uint64_t* osd_set, osd_op_t *cur_op);
--- a/src/osd_cluster.cpp
+++ b/src/osd_cluster.cpp
@ -213,12 +213,14 @@ json11::Json osd_t::get_statistics()
    st["subop_stats"] = subop_stats;
    st["recovery_stats"] = json11::Json::object {
        { recovery_stat_names[0], json11::Json::object {
-            { "count", recovery_stat_count[0][0] },
-            { "bytes", recovery_stat_bytes[0][0] },
+            { "count", recovery_stat[0].count },
+            { "bytes", recovery_stat[0].bytes },
+            { "usec", recovery_stat[0].usec },
        } },
        { recovery_stat_names[1], json11::Json::object {
-            { "count", recovery_stat_count[0][1] },
-            { "bytes", recovery_stat_bytes[0][1] },
+            { "count", recovery_stat[1].count },
+            { "bytes", recovery_stat[1].bytes },
+            { "usec", recovery_stat[1].usec },
        } },
    };
    return st;
--- a/src/osd_flush.cpp
+++ b/src/osd_flush.cpp
@ -325,26 +325,129 @@ void osd_t::submit_recovery_op(osd_recovery_op_t *op)
        {
            printf("Recovery operation done for %lx:%lx\n", op->oid.inode, op->oid.stripe);
        }
-        // CAREFUL! op = &recovery_ops[op->oid]. Don't access op->* after recovery_ops.erase()
-        op->osd_op = NULL;
-        recovery_ops.erase(op->oid);
-        delete osd_op;
-        if (immediate_commit != IMMEDIATE_ALL)
-        {
-            recovery_done++;
-            if (recovery_done >= recovery_sync_batch)
-            {
-                // Force sync every <recovery_sync_batch> operations
-                // This is required not to pile up an excessive amount of delete operations
-                autosync();
-                recovery_done = 0;
-            }
-        }
-        continue_recovery();
+        finish_recovery_op(op);
    };
    exec_op(op->osd_op);
 }

+void osd_t::apply_recovery_tune_interval()
+{
+    if (rtune_timer_id >= 0)
+    {
+        tfd->clear_timer(rtune_timer_id);
+        rtune_timer_id = -1;
+    }
+    if (recovery_tune_interval != 0)
+    {
+        rtune_timer_id = this->tfd->set_timer(recovery_tune_interval*1000, true, [this](int timer_id)
+        {
+            tune_recovery();
+        });
+    }
+    else
+    {
+        recovery_target_sleep_us = recovery_sleep_us;
+    }
+}
+
+void osd_t::finish_recovery_op(osd_recovery_op_t *op)
+{
+    // CAREFUL! op = &recovery_ops[op->oid]. Don't access op->* after recovery_ops.erase()
+    delete op->osd_op;
+    op->osd_op = NULL;
+    recovery_ops.erase(op->oid);
+    if (immediate_commit != IMMEDIATE_ALL)
+    {
+        recovery_done++;
+        if (recovery_done >= recovery_sync_batch)
+        {
+            // Force sync every <recovery_sync_batch> operations
+            // This is required not to pile up an excessive amount of delete operations
+            autosync();
+            recovery_done = 0;
+        }
+    }
+    continue_recovery();
+}
+
+void osd_t::tune_recovery()
+{
+    static int accounted_ops[] = {
+        OSD_OP_SEC_READ, OSD_OP_SEC_WRITE, OSD_OP_SEC_WRITE_STABLE,
+        OSD_OP_SEC_STABILIZE, OSD_OP_SEC_SYNC, OSD_OP_SEC_DELETE
+    };
+    uint64_t total_client_usec = 0, total_recovery_usec = 0, recovery_count = 0;
+    for (int i = 0; i < sizeof(accounted_ops)/sizeof(accounted_ops[0]); i++)
+    {
+        total_client_usec += (msgr.stats.op_stat_sum[accounted_ops[i]]
+            - rtune_prev_stats.op_stat_sum[accounted_ops[i]]);
+        total_recovery_usec += (msgr.recovery_stats.op_stat_sum[accounted_ops[i]]
+            - rtune_prev_recovery_stats.op_stat_sum[accounted_ops[i]]);
+        recovery_count += (msgr.recovery_stats.op_stat_count[accounted_ops[i]]
+            - rtune_prev_recovery_stats.op_stat_count[accounted_ops[i]]);
+        rtune_prev_stats.op_stat_sum[accounted_ops[i]] = msgr.stats.op_stat_sum[accounted_ops[i]];
+        rtune_prev_recovery_stats.op_stat_sum[accounted_ops[i]] = msgr.recovery_stats.op_stat_sum[accounted_ops[i]];
+        rtune_prev_recovery_stats.op_stat_count[accounted_ops[i]] = msgr.recovery_stats.op_stat_count[accounted_ops[i]];
+    }
+    total_client_usec -= total_recovery_usec;
+    if (recovery_count == 0)
+    {
+        return;
+    }
+    // example:
+    // total 3 GB/s
+    // recovery queue 1
+    // 120 OSDs
+    // EC 5+3
+    // 128kb block_size => 640kb object
+    // 3000*1024/640/120 = 40 MB/s per OSD = 64 recovered objects per OSD
+    //   = 64*8*2 subops = 1024 recovery subop iops
+    // 8 recovery subop queue
+    // => subop avg latency = 0.0078125 sec
+    // utilisation = 8
+    // target util 1
+    // intuitively target latency should be 8x of real
+    // target_lat = rtune_avg_lat * utilisation / target_util
+    //            = rtune_avg_lat * rtune_avg_lat * rtune_avg_iops / target_util
+    //            = 0.0625
+    // recovery utilisation will be 1
+    rtune_client_util = total_client_usec/1000000.0/recovery_tune_interval;
+    rtune_target_util = (rtune_client_util < recovery_tune_client_util_low
+        ? recovery_tune_util_high
+        : recovery_tune_util_low + (rtune_client_util >= recovery_tune_client_util_high
+            ? 0 : (recovery_tune_util_high-recovery_tune_util_low)*
+                (recovery_tune_client_util_high-rtune_client_util)/(recovery_tune_client_util_high-recovery_tune_client_util_low)
+        )
+    );
+    rtune_avg_lat = total_recovery_usec/recovery_count;
+    uint64_t target_lat = rtune_avg_lat * rtune_avg_lat/1000000.0 * recovery_count/recovery_tune_interval / rtune_target_util;
+    auto sleep_us = target_lat > rtune_avg_lat+recovery_tune_sleep_min_us ? target_lat-rtune_avg_lat : 0;
+    if (recovery_target_sleep_items.size() != recovery_tune_agg_interval)
+    {
+        recovery_target_sleep_items.resize(recovery_tune_agg_interval);
+        for (int i = 0; i < recovery_tune_agg_interval; i++)
+            recovery_target_sleep_items[i] = 0;
+        recovery_target_sleep_total = 0;
+        recovery_target_sleep_cur = 0;
+        recovery_target_sleep_count = 0;
+    }
+    recovery_target_sleep_total -= recovery_target_sleep_items[recovery_target_sleep_cur];
+    recovery_target_sleep_items[recovery_target_sleep_cur] = sleep_us;
+    recovery_target_sleep_cur = (recovery_target_sleep_cur+1) % recovery_tune_agg_interval;
+    recovery_target_sleep_total += sleep_us;
+    if (recovery_target_sleep_count < recovery_tune_agg_interval)
+        recovery_target_sleep_count++;
+    recovery_target_sleep_us = recovery_target_sleep_total / recovery_target_sleep_count;
+    if (log_level > 4)
+    {
+        printf(
+            "[OSD %lu] auto-tune: client util: %.2f, recovery util: %.2f, lat: %lu us -> target util %.2f, delay %lu us\n",
+            osd_num, rtune_client_util, total_recovery_usec/1000000.0/recovery_tune_interval,
+            rtune_avg_lat, rtune_target_util, recovery_target_sleep_us
+        );
+    }
+}
+
 // Just trigger write requests for degraded objects. They'll be recovered during writing
 bool osd_t::continue_recovery()
 {
--- a/src/osd_ops.h
+++ b/src/osd_ops.h
@ -34,6 +34,7 @@
 #define OSD_OP_MAX                  18
 #define OSD_RW_MAX                  64*1024*1024
 #define OSD_PROTOCOL_VERSION        1
+#define OSD_OP_RECOVERY_RELATED     (uint32_t)1

 // Memory alignment for direct I/O (usually 512 bytes)
 #ifndef DIRECT_IO_ALIGNMENT
@ -88,7 +89,8 @@ struct __attribute__((__packed__)) osd_op_sec_rw_t
    uint32_t len;
    // bitmap/attribute length - bitmap comes after header, but before data
    uint32_t attr_len;
-    uint32_t pad0;
+    // the only possible flag is OSD_OP_RECOVERY_RELATED
+    uint32_t flags;
 };

 struct __attribute__((__packed__)) osd_reply_sec_rw_t
@ -109,6 +111,9 @@ struct __attribute__((__packed__)) osd_op_sec_del_t
    object_id oid;
    // delete version (automatic or specific)
    uint64_t version;
+    // the only possible flag is OSD_OP_RECOVERY_RELATED
+    uint32_t flags;
+    uint32_t pad0;
 };

 struct __attribute__((__packed__)) osd_reply_sec_del_t
@ -121,6 +126,9 @@ struct __attribute__((__packed__)) osd_reply_sec_del_t
 struct __attribute__((__packed__)) osd_op_sec_sync_t
 {
    osd_op_header_t header;
+    // the only possible flag is OSD_OP_RECOVERY_RELATED
+    uint32_t flags;
+    uint32_t pad0;
 };

 struct __attribute__((__packed__)) osd_reply_sec_sync_t
@ -134,6 +142,9 @@ struct __attribute__((__packed__)) osd_op_sec_stab_t
    osd_op_header_t header;
    // obj_ver_id array length in bytes
    uint64_t len;
+    // the only possible flag is OSD_OP_RECOVERY_RELATED
+    uint32_t flags;
+    uint32_t pad0;
 };
 typedef osd_op_sec_stab_t osd_op_sec_rollback_t;

--- a/src/osd_primary_subops.cpp
+++ b/src/osd_primary_subops.cpp
@ -3,13 +3,15 @@

 #include "osd_primary.h"

+#define SELF_FD -1
+
 void osd_t::autosync()
 {
    if (immediate_commit != IMMEDIATE_ALL && !autosync_op)
    {
        autosync_op = new osd_op_t();
        autosync_op->op_type = OSD_OP_IN;
-        autosync_op->peer_fd = -1;
+        autosync_op->peer_fd = SELF_FD;
        autosync_op->req = (osd_any_op_t){
            .sync = {
                .header = {
@ -85,9 +87,13 @@ void osd_t::finish_op(osd_op_t *cur_op, int retval)
    cur_op->reply.hdr.id = cur_op->req.hdr.id;
    cur_op->reply.hdr.opcode = cur_op->req.hdr.opcode;
    cur_op->reply.hdr.retval = retval;
-    if (cur_op->peer_fd == -1)
+    if (cur_op->peer_fd == SELF_FD)
    {
-        msgr.measure_exec(cur_op);
+        // Do not include internal primary writes (recovery/rebalance) into client op statistics
+        if (cur_op->req.hdr.opcode != OSD_OP_WRITE)
+        {
+            msgr.measure_exec(cur_op);
+        }
        // Copy lambda to be unaffected by `delete op`
        std::function<void(osd_op_t*)>(cur_op->callback)(cur_op);
    }
@ -215,6 +221,7 @@ int osd_t::submit_primary_subop_batch(int submit_type, inode_t inode, uint64_t o
                    .offset = wr ? si->write_start : si->read_start,
                    .len = subop_len,
                    .attr_len = wr ? clean_entry_bitmap_size : 0,
+                    .flags = cur_op->peer_fd == SELF_FD && cur_op->req.hdr.opcode != OSD_OP_SCRUB ? OSD_OP_RECOVERY_RELATED : 0,
                };
 #ifdef OSD_DEBUG
                printf(
@ -294,7 +301,8 @@ void osd_t::handle_primary_bs_subop(osd_op_t *subop)
            " retval = "+std::to_string(bs_op->retval)+")"
        );
    }
-    add_bs_subop_stats(subop);
+    bool recovery_related = cur_op->peer_fd == SELF_FD && cur_op->req.hdr.opcode != OSD_OP_SCRUB;
+    add_bs_subop_stats(subop, recovery_related);
    subop->req.hdr.opcode = bs_op_to_osd_op[bs_op->opcode];
    subop->reply.hdr.retval = bs_op->retval;
    if (bs_op->opcode == BS_OP_READ || bs_op->opcode == BS_OP_WRITE || bs_op->opcode == BS_OP_WRITE_STABLE)
@ -306,30 +314,33 @@ void osd_t::handle_primary_bs_subop(osd_op_t *subop)
    }
    delete bs_op;
    subop->bs_op = NULL;
-    subop->peer_fd = -1;
-    handle_primary_subop(subop, cur_op);
+    subop->peer_fd = SELF_FD;
+    if (recovery_related && recovery_target_sleep_us)
+    {
+        tfd->set_timer_us(recovery_target_sleep_us, false, [=](int timer_id)
+        {
+            handle_primary_subop(subop, cur_op);
+        });
+    }
+    else
+    {
+        handle_primary_subop(subop, cur_op);
+    }
 }

-void osd_t::add_bs_subop_stats(osd_op_t *subop)
+void osd_t::add_bs_subop_stats(osd_op_t *subop, bool recovery_related)
 {
    // Include local blockstore ops in statistics
    uint64_t opcode = bs_op_to_osd_op[subop->bs_op->opcode];
    timespec tv_end;
    clock_gettime(CLOCK_REALTIME, &tv_end);
-    msgr.stats.op_stat_count[opcode]++;
-    if (!msgr.stats.op_stat_count[opcode])
+    uint64_t len = (opcode == OSD_OP_SEC_READ || opcode == OSD_OP_SEC_WRITE)
+        ? subop->bs_op->len : 0;
+    msgr.inc_op_stats(msgr.stats, opcode, subop->tv_begin, tv_end, len);
+    if (recovery_related)
    {
-        msgr.stats.op_stat_count[opcode] = 1;
-        msgr.stats.op_stat_sum[opcode] = 0;
-        msgr.stats.op_stat_bytes[opcode] = 0;
-    }
-    msgr.stats.op_stat_sum[opcode] += (
-        (tv_end.tv_sec - subop->tv_begin.tv_sec)*1000000 +
-        (tv_end.tv_nsec - subop->tv_begin.tv_nsec)/1000
-    );
-    if (opcode == OSD_OP_SEC_READ || opcode == OSD_OP_SEC_WRITE)
-    {
-        msgr.stats.op_stat_bytes[opcode] += subop->bs_op->len;
+        // It is OSD_OP_RECOVERY_RELATED
+        msgr.inc_op_stats(msgr.recovery_stats, opcode, subop->tv_begin, tv_end, len);
    }
 }

@ -552,6 +563,7 @@ void osd_t::submit_primary_del_batch(osd_op_t *cur_op, obj_ver_osd_t *chunks_to_
                },
                .oid = chunk.oid,
                .version = chunk.version,
+                .flags = cur_op->peer_fd == SELF_FD && cur_op->req.hdr.opcode != OSD_OP_SCRUB ? OSD_OP_RECOVERY_RELATED : 0,
            } };
            subops[i].callback = [cur_op, this](osd_op_t *subop)
            {
@ -609,6 +621,7 @@ int osd_t::submit_primary_sync_subops(osd_op_t *cur_op)
                    .id = msgr.next_subop_id++,
                    .opcode = OSD_OP_SEC_SYNC,
                },
+                .flags = cur_op->peer_fd == SELF_FD && cur_op->req.hdr.opcode != OSD_OP_SCRUB ? OSD_OP_RECOVERY_RELATED : 0,
            } };
            subops[i].callback = [cur_op, this](osd_op_t *subop)
            {
@ -668,6 +681,7 @@ void osd_t::submit_primary_stab_subops(osd_op_t *cur_op)
                    .opcode = OSD_OP_SEC_STABILIZE,
                },
                .len = (uint64_t)(stab_osd.len * sizeof(obj_ver_id)),
+                .flags = cur_op->peer_fd == SELF_FD && cur_op->req.hdr.opcode != OSD_OP_SCRUB ? OSD_OP_RECOVERY_RELATED : 0,
            } };
            subops[i].iov.push_back(op_data->unstable_writes + stab_osd.start, stab_osd.len * sizeof(obj_ver_id));
            subops[i].callback = [cur_op, this](osd_op_t *subop)
--- a/src/osd_primary_write.cpp
+++ b/src/osd_primary_write.cpp
@ -292,16 +292,26 @@ resume_7:
    {
        {
            int recovery_type = op_data->object_state->state & (OBJ_DEGRADED|OBJ_INCOMPLETE) ? 0 : 1;
-            recovery_stat_count[0][recovery_type]++;
-            if (!recovery_stat_count[0][recovery_type])
+            recovery_stat[recovery_type].count++;
+            if (!recovery_stat[recovery_type].count) // wrapped
            {
-                recovery_stat_count[0][recovery_type]++;
-                recovery_stat_bytes[0][recovery_type] = 0;
+                memset(&recovery_print_prev[recovery_type], 0, sizeof(recovery_print_prev[recovery_type]));
+                memset(&recovery_stat[recovery_type], 0, sizeof(recovery_stat[recovery_type]));
+                recovery_stat[recovery_type].count++;
            }
            for (int role = 0; role < (op_data->scheme == POOL_SCHEME_REPLICATED ? 1 : pg.pg_size); role++)
            {
-                recovery_stat_bytes[0][recovery_type] += op_data->stripes[role].write_end - op_data->stripes[role].write_start;
+                recovery_stat[recovery_type].bytes += op_data->stripes[role].write_end - op_data->stripes[role].write_start;
            }
+            if (!cur_op->tv_end.tv_sec)
+            {
+                clock_gettime(CLOCK_REALTIME, &cur_op->tv_end);
+            }
+            uint64_t usec = (
+                (cur_op->tv_end.tv_sec - cur_op->tv_begin.tv_sec)*1000000 +
+                (cur_op->tv_end.tv_nsec - cur_op->tv_begin.tv_nsec)/1000
+            );
+            recovery_stat[recovery_type].usec += usec;
        }
        // Any kind of a non-clean object can have extra chunks, because we don't record objects
        // as degraded & misplaced or incomplete & misplaced at the same time. So try to remove extra chunks
--- a/src/osd_secondary.cpp
+++ b/src/osd_secondary.cpp
@ -42,7 +42,21 @@ void osd_t::secondary_op_callback(osd_op_t *op)
    int retval = op->bs_op->retval;
    delete op->bs_op;
    op->bs_op = NULL;
-    finish_op(op, retval);
+    if (op->is_recovery_related() && recovery_target_sleep_us)
+    {
+        if (!op->tv_end.tv_sec)
+        {
+            clock_gettime(CLOCK_REALTIME, &op->tv_end);
+        }
+        tfd->set_timer_us(recovery_target_sleep_us, false, [this, op, retval](int timer_id)
+        {
+            finish_op(op, retval);
+        });
+    }
+    else
+    {
+        finish_op(op, retval);
+    }
 }

 void osd_t::exec_secondary(osd_op_t *cur_op)
--- a/tests/run_3osds.sh
+++ b/tests/run_3osds.sh
@ -19,10 +19,10 @@ fi

 if [ "$IMMEDIATE_COMMIT" != "" ]; then
    NO_SAME="--journal_no_same_sector_overwrites true --journal_sector_buffer_count 1024 --disable_data_fsync 1 --immediate_commit all --log_level 10 --etcd_stats_interval 5"
-    $ETCDCTL put /vitastor/config/global '{"recovery_queue_depth":1,"osd_out_time":1,"immediate_commit":"all","client_enable_writeback":true}'
+    $ETCDCTL put /vitastor/config/global '{"recovery_queue_depth":1,"recovery_tune_util_low":1,"osd_out_time":1,"immediate_commit":"all","client_enable_writeback":true}'
 else
    NO_SAME="--journal_sector_buffer_count 1024 --log_level 10 --etcd_stats_interval 5"
-    $ETCDCTL put /vitastor/config/global '{"recovery_queue_depth":1,"osd_out_time":1,"client_enable_writeback":true}'
+    $ETCDCTL put /vitastor/config/global '{"recovery_queue_depth":1,"recovery_tune_util_low":1,"osd_out_time":1,"client_enable_writeback":true}'
 fi

 start_osd_on()
@ -53,7 +53,7 @@ for i in $(seq 1 $OSD_COUNT); do
    start_osd $i
 done

-(while true; do node mon/mon-main.js --etcd_url $ETCD_URL --etcd_prefix "/vitastor" --verbose 1 || true; done) &>./testdata/mon.log &
+(while true; do node mon/mon-main.js --etcd_address $ETCD_URL --etcd_prefix "/vitastor" --verbose 1 || true; done) >>./testdata/mon.log 2>&1 &
 MON_PID=$!

 if [ "$SCHEME" = "ec" ]; then
--- a/tests/test_change_pg_count.sh
+++ b/tests/test_change_pg_count.sh
@ -18,6 +18,7 @@ try_change()
    for i in {1..6}; do
        echo --- Change PG count to $n --- >>testdata/osd$i.log
    done
+    echo --- Change PG count to $n --- >>testdata/mon.log

    $ETCDCTL put /vitastor/config/pools '{"1":{'$POOLCFG',"pg_size":'$PG_SIZE',"pg_minsize":'$PG_MINSIZE',"pg_count":'$n'}}'

--- a/tests/test_failure_domain.sh
+++ b/tests/test_failure_domain.sh
@ -15,7 +15,7 @@ $ETCDCTL put /vitastor/osd/stats/7 '{"host":"host4","size":1073741824,"time":"'$
 $ETCDCTL put /vitastor/osd/stats/8 '{"host":"host4","size":1073741824,"time":"'$TIME'"}'
 $ETCDCTL put /vitastor/config/pools '{"1":{"name":"testpool","scheme":"replicated","pg_size":2,"pg_minsize":1,"pg_count":4,"failure_domain":"rack"}}'

-node mon/mon-main.js --etcd_url $ETCD_URL --etcd_prefix "/vitastor" &>./testdata/mon.log &
+node mon/mon-main.js --etcd_address $ETCD_URL --etcd_prefix "/vitastor" >>./testdata/mon.log 2>&1 &
 MON_PID=$!

 sleep 2
--- a/tests/test_move_reappear.sh
+++ b/tests/test_move_reappear.sh
@ -7,7 +7,7 @@ OSD_COUNT=5
 OSD_ARGS="$OSD_ARGS"
 for i in $(seq 1 $OSD_COUNT); do
    dd if=/dev/zero of=./testdata/test_osd$i.bin bs=1024 count=1 seek=$((OSD_SIZE*1024-1))
-    build/src/vitastor-osd --osd_num $i --bind_address 127.0.0.1 --etcd_stats_interval 5 $OSD_ARGS --etcd_address $ETCD_URL $(build/src/vitastor-disk simple-offsets --format options ./testdata/test_osd$i.bin 2>/dev/null) >>./testdata/osd$i.log 2>&1 &
+    build/src/vitastor-osd --log_level 10 --osd_num $i --bind_address 127.0.0.1 --etcd_stats_interval 5 $OSD_ARGS --etcd_address $ETCD_URL $(build/src/vitastor-disk simple-offsets --format options ./testdata/test_osd$i.bin 2>/dev/null) >>./testdata/osd$i.log 2>&1 &
    eval OSD${i}_PID=$!
 done

@ -53,6 +53,11 @@ for i in {1..30}; do
    fi
 done

+# Sync so all moved objects are removed from OSD 1 (they aren't removed without a sync)
+LD_PRELOAD="build/src/libfio_vitastor.so" \
+fio -thread -name=test -ioengine=build/src/libfio_vitastor.so -bs=4k -direct=1 -iodepth=1 -fsync=1 -number_ios=2 -rw=write \
+    -etcd=$ETCD_URL -pool=1 -inode=2 -size=32M -cluster_log_level=10
+
 $ETCDCTL put /vitastor/config/pgs '{"items":{"1":{"1":{"osd_set":[4,5],"primary":0}}}}'

 $ETCDCTL put /vitastor/pg/history/1/1 '{"all_peers":[1,2,3]}'
--- a/tests/test_parity_change.sh
+++ b/tests/test_parity_change.sh
@ -0,0 +1,54 @@
+#!/bin/bash -ex
+# Test changing EC 4+1 into EC 4+3
+
+OSD_COUNT=7
+PG_COUNT=16
+SCHEME=ec
+PG_SIZE=5
+PG_DATA_SIZE=4
+PG_MINSIZE=5
+
+. `dirname $0`/run_3osds.sh
+
+try_change()
+{
+    n=$1
+    s=$2
+
+    for i in {1..10}; do
+        ($ETCDCTL get /vitastor/config/pgs --print-value-only |\
+            jq -s -e '(.[0].items["1"] | map(  ([ .osd_set[] | select(. != 0) ] | length) == '$s'  ) | length == '$n')
+                and ([ .[0].items["1"] | map(.osd_set)[][] ] | sort | unique == ["1","2","3","4","5","6","7"])') && \
+            ($ETCDCTL get --prefix /vitastor/pg/state/ --print-value-only | jq -s -e '([ .[] | select(.state == ["active"]) ] | length) == '$n'') && \
+            break
+        sleep 1
+    done
+
+    if ! ($ETCDCTL get /vitastor/config/pgs --print-value-only |\
+        jq -s -e '(.[0].items["1"] | map(  ([ .osd_set[] | select(. != 0) ] | length) == '$s'  ) | length == '$n')
+            and ([ .[0].items["1"] | map(.osd_set)[][] ] | sort | unique == ["1","2","3","4","5","6","7"])'); then
+        $ETCDCTL get /vitastor/config/pgs
+        $ETCDCTL get --prefix /vitastor/pg/state/
+        format_error "FAILED: PG SIZE NOT CHANGED OR SOME OSDS DO NOT HAVE PGS"
+    fi
+
+    if ! ($ETCDCTL get --prefix /vitastor/pg/state/ --print-value-only | jq -s -e '([ .[] | select(.state == ["active"]) ] | length) == '$n); then
+        $ETCDCTL get /vitastor/config/pgs
+        $ETCDCTL get --prefix /vitastor/pg/state/
+        format_error "FAILED: PGS NOT UP AFTER PG SIZE CHANGE"
+    fi
+}
+
+LD_PRELOAD="build/src/libfio_vitastor.so" \
+    fio -thread -name=test -ioengine=build/src/libfio_vitastor.so -bs=1M -direct=1 -iodepth=4 \
+        -rw=write -etcd=$ETCD_URL -pool=1 -inode=1 -size=128M -runtime=10
+
+PG_SIZE=7
+POOLCFG='"name":"testpool","failure_domain":"osd","scheme":"ec","parity_chunks":'$((PG_SIZE-PG_DATA_SIZE))
+$ETCDCTL put /vitastor/config/pools '{"1":{'$POOLCFG',"pg_size":'$PG_SIZE',"pg_minsize":'$PG_MINSIZE',"pg_count":'$PG_COUNT'}}'
+
+sleep 2
+
+try_change 16 7
+
+format_green OK
--- a/tests/test_vm_cont.sh
+++ b/tests/test_vm_cont.sh
@ -15,7 +15,7 @@ for i in $(seq 1 $OSD_COUNT); do
    eval OSD${i}_PID=$!
 done

-(while true; do node mon/mon-main.js --etcd_url $ETCD_URL --etcd_prefix "/vitastor" --verbose 1 || true; done) &>./testdata/mon.log &
+(while true; do node mon/mon-main.js --etcd_address $ETCD_URL --etcd_prefix "/vitastor" --verbose 1 || true; done) >>./testdata/mon.log 2>&1 &
 MON_PID=$!

 sleep 3
Author	SHA1	Message	Date
Vitaliy Filippov	53184b3f81	FIXME Sync only completed writes Test / test_snapshot_ec (push) Has been cancelled Details Test / test_minsize_1 (push) Has been cancelled Details Test / test_move_reappear (push) Has been cancelled Details Test / test_rm (push) Has been cancelled Details Test / test_snapshot_chain (push) Has been cancelled Details Test / test_snapshot_chain_ec (push) Has been cancelled Details Test / test_snapshot_down (push) Has been cancelled Details Test / test_snapshot_down_ec (push) Has been cancelled Details Test / test_splitbrain (push) Has been cancelled Details Test / test_rebalance_verify (push) Has been cancelled Details Test / test_rebalance_verify_imm (push) Has been cancelled Details Test / test_rebalance_verify_ec (push) Has been cancelled Details Test / test_rebalance_verify_ec_imm (push) Has been cancelled Details Test / test_write (push) Has been cancelled Details Test / test_write_xor (push) Has been cancelled Details Test / test_write_no_same (push) Has been cancelled Details Test / test_heal_pg_size_2 (push) Has been cancelled Details Test / test_heal_ec (push) Has been cancelled Details Test / test_heal_csum_32k_dmj (push) Has been cancelled Details Test / test_heal_csum_32k_dj (push) Has been cancelled Details Test / test_heal_csum_32k (push) Has been cancelled Details Test / test_heal_csum_4k_dmj (push) Has been cancelled Details Test / test_heal_csum_4k_dj (push) Has been cancelled Details Test / test_heal_csum_4k (push) Has been cancelled Details Test / test_scrub (push) Has been cancelled Details Test / test_scrub_zero_osd_2 (push) Has been cancelled Details Test / test_scrub_xor (push) Has been cancelled Details Test / test_scrub_pg_size_3 (push) Has been cancelled Details Test / test_scrub_pg_size_6_pg_minsize_4_osd_count_6_ec (push) Has been cancelled Details Test / test_scrub_ec (push) Has been cancelled Details Currently it's impossible because it leads to errors similar to terminate called after throwing an instance of 'std::runtime_error' what(): BUG: Unexpected dirty_entry 1000000000001:29480000 v65540 unstable state during flush: 0x151 Probably because in that case flushers should wait for previous writes too	2023-12-31 01:24:54 +03:00
Vitaliy Filippov	48b5f871e0	Add Contributor License Aggrement in Russian and English	2023-12-31 01:23:52 +03:00
Vitaliy Filippov	c17f76a3e4	Add documentation for recovery auto-tuning Test / test_snapshot_ec (push) Successful in 26s Details Test / test_move_reappear (push) Successful in 19s Details Test / test_rm (push) Successful in 15s Details Test / test_snapshot_down (push) Successful in 24s Details Test / test_snapshot_down_ec (push) Successful in 26s Details Test / test_snapshot_chain (push) Successful in 1m50s Details Test / test_splitbrain (push) Successful in 52s Details Test / test_snapshot_chain_ec (push) Successful in 2m31s Details Test / test_rebalance_verify_imm (push) Successful in 2m28s Details Test / test_rebalance_verify (push) Successful in 3m25s Details Test / test_rebalance_verify_ec (push) Successful in 3m31s Details Test / test_write (push) Successful in 1m17s Details Test / test_write_no_same (push) Successful in 17s Details Test / test_rebalance_verify_ec_imm (push) Successful in 3m36s Details Test / test_heal_pg_size_2 (push) Successful in 4m12s Details Test / test_heal_ec (push) Successful in 5m20s Details Test / test_heal_csum_32k_dmj (push) Successful in 4m36s Details Test / test_heal_csum_32k_dj (push) Successful in 6m11s Details Test / test_heal_csum_32k (push) Successful in 6m13s Details Test / test_scrub (push) Successful in 56s Details Test / test_scrub_zero_osd_2 (push) Successful in 1m6s Details Test / test_heal_csum_4k_dj (push) Successful in 6m31s Details Test / test_heal_csum_4k_dmj (push) Successful in 6m58s Details Test / test_scrub_xor (push) Successful in 43s Details Test / test_scrub_pg_size_6_pg_minsize_4_osd_count_6_ec (push) Successful in 1m10s Details Test / test_scrub_ec (push) Successful in 49s Details Test / test_scrub_pg_size_3 (push) Successful in 1m40s Details Test / test_heal_csum_4k (push) Successful in 5m59s Details Test / test_write_xor (push) Successful in 34s Details Test / test_interrupted_rebalance (push) Successful in 1m19s Details	2023-12-31 01:23:17 +03:00
Vitaliy Filippov	a6ab54b1ba	Do not allow negative util_low/high	2023-12-31 01:23:17 +03:00
Vitaliy Filippov	99ee8596ea	Rename min/max_util to util_low/high	2023-12-31 01:23:17 +03:00
Vitaliy Filippov	c4928e6ecd	Protect from try_send completing the operation immediately Fixes a possible use-after-free in case of continue_ops() calling try_send(), then connect_peer() -> set_timer() -> trigger_nearest() -> handle_op_part() -> continue_ops() again	2023-12-31 01:23:17 +03:00
Vitaliy Filippov	ec7dcd1be5	Do not apply very large recovery pauses during tests	2023-12-31 01:23:17 +03:00
Vitaliy Filippov	e600bbc151	Fix flapping move_reappear test by adding an fsync before stopping PG	2023-12-31 01:23:17 +03:00
Vitaliy Filippov	8b8c1179a7	Use a separate used_blocks counter for free space stats to hide possibly delayed on-flush deallocation	2023-12-31 01:23:17 +03:00
Vitaliy Filippov	d5a6fa6dd7	Fix possible crash on print_slow when bs_op is NULL	2023-12-31 01:23:17 +03:00
Vitaliy Filippov	f757a35a8d	Retry PG changes without re-running lpsolve when pool configuration and OSD tree don't change OSDs often change their /pg/history keys during rebalance, so monitor receives additional transaction failures from etcd if it re-runs lpsolve which sometimes may even lead to monitor being unable to apply PG changes at all until rebalance completes	2023-12-31 01:23:17 +03:00
Vitaliy Filippov	1edf86ed26	Aggregate recovery delay using simple mean over last 10 observations (EWMA is shit)	2023-12-31 01:23:17 +03:00
Vitaliy Filippov	5ca7cde612	Experiment/WIP: Try to track "secondary" recovery ops separately	2023-12-31 01:23:17 +03:00
Vitaliy Filippov	751935ddd8	WIP Auto-tune recovery speed	2023-12-31 01:23:17 +03:00
Vitaliy Filippov	d84dee7098	Track recovery op latencies + refactor into a structure	2023-12-31 01:23:17 +03:00
Vitaliy Filippov	dcc76eee15	Add a parity chunk count change test script	2023-12-26 23:48:41 +03:00