Add documentation for recovery auto-tuning

Do not allow negative util_low/high
Rename min/max_util to util_low/high
2023-12-31 01:23:17 +03:00 · 2023-12-31 01:23:17 +03:00 · 2023-12-31 01:23:17 +03:00 · 2023-12-31 01:23:17 +03:00 · 2023-12-31 01:23:17 +03:00 · 2023-12-31 01:23:17 +03:00
31 changed files with 900 additions and 270 deletions
--- a/docs/config/client.en.md
+++ b/docs/config/client.en.md
@ -6,8 +6,8 @@

 # Client Parameters

-These parameters apply only to clients and affect their interaction with
-the cluster.
+These parameters apply only to Vitastor clients (QEMU, fio, NBD and so on) and
+affect their interaction with the cluster.

 - [client_max_dirty_bytes](#client_max_dirty_bytes)
 - [client_max_dirty_ops](#client_max_dirty_ops)
--- a/docs/config/client.ru.md
+++ b/docs/config/client.ru.md
@ -6,7 +6,7 @@

 # Параметры клиентского кода

-Данные параметры применяются только к клиентам Vitastor (QEMU, fio, NBD) и
+Данные параметры применяются только к клиентам Vitastor (QEMU, fio, NBD и т.п.) и
 затрагивают логику их работы с кластером.

 - [client_max_dirty_bytes](#client_max_dirty_bytes)
--- a/docs/config/osd.en.md
+++ b/docs/config/osd.en.md
@ -19,6 +19,7 @@ them, even without restarting by updating configuration in etcd.
 - [autosync_interval](#autosync_interval)
 - [autosync_writes](#autosync_writes)
 - [recovery_queue_depth](#recovery_queue_depth)
+- [recovery_sleep_us](#recovery_sleep_us)
 - [recovery_pg_switch](#recovery_pg_switch)
 - [recovery_sync_batch](#recovery_sync_batch)
 - [readonly](#readonly)
@ -51,6 +52,13 @@ them, even without restarting by updating configuration in etcd.
 - [scrub_list_limit](#scrub_list_limit)
 - [scrub_find_best](#scrub_find_best)
 - [scrub_ec_max_bruteforce](#scrub_ec_max_bruteforce)
+- [recovery_tune_interval](#recovery_tune_interval)
+- [recovery_tune_util_low](#recovery_tune_util_low)
+- [recovery_tune_util_high](#recovery_tune_util_high)
+- [recovery_tune_client_util_low](#recovery_tune_client_util_low)
+- [recovery_tune_client_util_high](#recovery_tune_client_util_high)
+- [recovery_tune_agg_interval](#recovery_tune_agg_interval)
+- [recovery_tune_sleep_min_us](#recovery_tune_sleep_min_us)

 ## etcd_report_interval

@ -135,12 +143,24 @@ operations before issuing an fsync operation internally.
 ## recovery_queue_depth

 - Type: integer
- Default: 4
+- Default: 1
 - Can be changed online: yes

-Maximum recovery operations per one primary OSD at any given moment of time.
-Currently it's the only parameter available to tune the speed or recovery
-and rebalancing, but it's planned to implement more.
+Maximum recovery and rebalance operations initiated by each OSD in parallel.
+Note that each OSD talks to a lot of other OSDs so actual number of parallel
+recovery operations per each OSD is greater than just recovery_queue_depth.
+Increasing this parameter can speedup recovery if [auto-tuning](#recovery_tune_interval)
+allows it or if it is disabled.
+
+## recovery_sleep_us
+
+- Type: microseconds
+- Default: 0
+- Can be changed online: yes
+
+Delay for all recovery- and rebalance- related operations. If non-zero,
+such operations are artificially slowed down to reduce the impact on
+client I/O.

 ## recovery_pg_switch

@ -508,3 +528,81 @@ the variant with most available equal copies is correct. For example, if
 you have 3 replicas and 1 of them differs, this one is considered to be
 corrupted. But if there is no "best" version with more copies than all
 others have then the object is also marked as inconsistent.
+
+## recovery_tune_interval
+
+- Type: seconds
+- Default: 1
+- Can be changed online: yes
+
+Interval at which OSD re-considers client and recovery load and automatically
+adjusts [recovery_sleep_us](#recovery_sleep_us). Recovery auto-tuning is
+disabled if recovery_tune_interval is set to 0.
+
+Auto-tuning targets utilization. Utilization is a measure of load and is
+equal to the product of iops and average latency (so it may be greater
+than 1). You set "low" and "high" client utilization thresholds and two
+corresponding target recovery utilization levels. OSD calculates desired
+recovery utilization from client utilization using linear interpolation
+and auto-tunes recovery operation delay to make actual recovery utilization
+match desired.
+
+This allows to reduce recovery/rebalance impact on client operations. It is
+of course impossible to remove it completely, but it should become adequate.
+In some tests rebalance could earlier drop client write speed from 1.5 GB/s
+to 50-100 MB/s, with default auto-tuning settings it now only reduces
+to ~1 GB/s.
+
+## recovery_tune_util_low
+
+- Type: number
+- Default: 0.1
+- Can be changed online: yes
+
+Desired recovery/rebalance utilization when client load is high, i.e. when
+it is at or above recovery_tune_client_util_high.
+
+## recovery_tune_util_high
+
+- Type: number
+- Default: 1
+- Can be changed online: yes
+
+Desired recovery/rebalance utilization when client load is low, i.e. when
+it is at or below recovery_tune_client_util_low.
+
+## recovery_tune_client_util_low
+
+- Type: number
+- Default: 0
+- Can be changed online: yes
+
+Client utilization considered "low".
+
+## recovery_tune_client_util_high
+
+- Type: number
+- Default: 0.5
+- Can be changed online: yes
+
+Client utilization considered "high".
+
+## recovery_tune_agg_interval
+
+- Type: integer
+- Default: 10
+- Can be changed online: yes
+
+The number of last auto-tuning iterations to use for calculating the
+delay as average. Lower values result in quicker response to client
+load change, higher values result in more stable delay. Default value of 10
+is usually fine.
+
+## recovery_tune_sleep_min_us
+
+- Type: microseconds
+- Default: 10
+- Can be changed online: yes
+
+Minimum possible value for auto-tuned recovery_sleep_us. Values lower
+than this value are changed to 0.
--- a/docs/config/osd.ru.md
+++ b/docs/config/osd.ru.md
@ -20,6 +20,7 @@
 - [autosync_interval](#autosync_interval)
 - [autosync_writes](#autosync_writes)
 - [recovery_queue_depth](#recovery_queue_depth)
+- [recovery_sleep_us](#recovery_sleep_us)
 - [recovery_pg_switch](#recovery_pg_switch)
 - [recovery_sync_batch](#recovery_sync_batch)
 - [readonly](#readonly)
@ -52,6 +53,13 @@
 - [scrub_list_limit](#scrub_list_limit)
 - [scrub_find_best](#scrub_find_best)
 - [scrub_ec_max_bruteforce](#scrub_ec_max_bruteforce)
+- [recovery_tune_interval](#recovery_tune_interval)
+- [recovery_tune_util_low](#recovery_tune_util_low)
+- [recovery_tune_util_high](#recovery_tune_util_high)
+- [recovery_tune_client_util_low](#recovery_tune_client_util_low)
+- [recovery_tune_client_util_high](#recovery_tune_client_util_high)
+- [recovery_tune_agg_interval](#recovery_tune_agg_interval)
+- [recovery_tune_sleep_min_us](#recovery_tune_sleep_min_us)

 ## etcd_report_interval

@ -138,13 +146,25 @@ OSD, чтобы успевать очищать журнал - без них OSD
 ## recovery_queue_depth

 - Тип: целое число
- Значение по умолчанию: 4
+- Значение по умолчанию: 1
 - Можно менять на лету: да

-Максимальное число операций восстановления на одном первичном OSD в любой
-момент времени. На данный момент единственный параметр, который можно менять
-для ускорения или замедления восстановления и перебалансировки данных, но
-в планах реализация других параметров.
+Максимальное число параллельных операций восстановления, инициируемых одним
+OSD в любой момент времени. Имейте в виду, что каждый OSD обычно работает с
+многими другими OSD, так что на практике параллелизм восстановления больше,
+чем просто recovery_queue_depth. Увеличение значения этого параметра может
+ускорить восстановление если [автотюнинг скорости](#recovery_tune_interval)
+разрешает это или если он отключён.
+
+## recovery_sleep_us
+
+- Тип: микросекунды
+- Значение по умолчанию: 0
+- Можно менять на лету: да
+
+Delay for all recovery- and rebalance- related operations. If non-zero,
+such operations are artificially slowed down to reduce the impact on
+client I/O.

 ## recovery_pg_switch

@ -535,3 +555,83 @@ EC (кодов коррекции ошибок) с более, чем 1 диск
 считается некорректной. Однако, если "лучшую" версию с числом доступных
 копий большим, чем у всех других версий, найти невозможно, то объект тоже
 маркируется неконсистентным.
+
+## recovery_tune_interval
+
+- Тип: секунды
+- Значение по умолчанию: 1
+- Можно менять на лету: да
+
+Интервал, с которым OSD пересматривает клиентскую нагрузку и нагрузку
+восстановления и автоматически подстраивает [recovery_sleep_us](#recovery_sleep_us).
+Автотюнинг (автоподстройка) отключается, если recovery_tune_interval
+устанавливается в значение 0.
+
+Автотюнинг регулирует утилизацию. Утилизация является мерой нагрузки
+и равна произведению числа операций в секунду и средней задержки
+(то есть, она может быть выше 1). Вы задаёте два уровня клиентской
+утилизации - "низкий" и "высокий" (low и high) и два соответствующих
+целевых уровня утилизации операциями восстановления. OSD рассчитывает
+желаемый уровень утилизации восстановления линейной интерполяцией от
+клиентской утилизации и подстраивает задержку операций восстановления
+так, чтобы фактическая утилизация восстановления совпадала с желаемой.
+
+Это позволяет снизить влияние восстановления и ребаланса на клиентские
+операции. Конечно, невозможно исключить такое влияние полностью, но оно
+должно становиться адекватнее. В некоторых тестах перебалансировка могла
+снижать клиентскую скорость записи с 1.5 ГБ/с до 50-100 МБ/с, а теперь, с
+настройками автотюнинга по умолчанию, она снижается только до ~1 ГБ/с.
+
+## recovery_tune_util_low
+
+- Тип: число
+- Значение по умолчанию: 0.1
+- Можно менять на лету: да
+
+Желаемая утилизация восстановления в моменты, когда клиентская нагрузка
+высокая, то есть, находится на уровне или выше recovery_tune_client_util_high.
+
+## recovery_tune_util_high
+
+- Тип: число
+- Значение по умолчанию: 1
+- Можно менять на лету: да
+
+Желаемая утилизация восстановления в моменты, когда клиентская нагрузка
+низкая, то есть, находится на уровне или ниже recovery_tune_client_util_low.
+
+## recovery_tune_client_util_low
+
+- Тип: число
+- Значение по умолчанию: 0
+- Можно менять на лету: да
+
+Клиентская утилизация, которая считается "низкой".
+
+## recovery_tune_client_util_high
+
+- Тип: число
+- Значение по умолчанию: 0.5
+- Можно менять на лету: да
+
+Клиентская утилизация, которая считается "высокой".
+
+## recovery_tune_agg_interval
+
+- Тип: целое число
+- Значение по умолчанию: 10
+- Можно менять на лету: да
+
+Число последних итераций автоподстройки для расчёта задержки как среднего
+значения. Меньшие значения параметра ускоряют отклик на изменение нагрузки,
+большие значения делают задержку стабильнее. Значение по умолчанию 10
+обычно нормальное и не требует изменений.
+
+## recovery_tune_sleep_min_us
+
+- Тип: микросекунды
+- Значение по умолчанию: 10
+- Можно менять на лету: да
+
+Минимальное возможное значение авто-подстроенного recovery_sleep_us.
+Значения ниже данного заменяются на 0.
--- a/docs/config/src/make.js
+++ b/docs/config/src/make.js
@ -38,6 +38,7 @@ const types = {
        bool: 'boolean',
        int: 'integer',
        sec: 'seconds',
+        float: 'number',
        ms: 'milliseconds',
        us: 'microseconds',
    },
@ -46,6 +47,7 @@ const types = {
        bool: 'булево (да/нет)',
        int: 'целое число',
        sec: 'секунды',
+        float: 'число',
        ms: 'миллисекунды',
        us: 'микросекунды',
    },
--- a/docs/config/src/osd.yml
+++ b/docs/config/src/osd.yml
@ -107,17 +107,29 @@
    принудительной отправкой fsync-а.
 - name: recovery_queue_depth
  type: int
-  default: 4
+  default: 1
  online: true
  info: |
-    Maximum recovery operations per one primary OSD at any given moment of time.
-    Currently it's the only parameter available to tune the speed or recovery
-    and rebalancing, but it's planned to implement more.
+    Maximum recovery and rebalance operations initiated by each OSD in parallel.
+    Note that each OSD talks to a lot of other OSDs so actual number of parallel
+    recovery operations per each OSD is greater than just recovery_queue_depth.
+    Increasing this parameter can speedup recovery if [auto-tuning](#recovery_tune_interval)
+    allows it or if it is disabled.
  info_ru: |
-    Максимальное число операций восстановления на одном первичном OSD в любой
-    момент времени. На данный момент единственный параметр, который можно менять
-    для ускорения или замедления восстановления и перебалансировки данных, но
-    в планах реализация других параметров.
+    Максимальное число параллельных операций восстановления, инициируемых одним
+    OSD в любой момент времени. Имейте в виду, что каждый OSD обычно работает с
+    многими другими OSD, так что на практике параллелизм восстановления больше,
+    чем просто recovery_queue_depth. Увеличение значения этого параметра может
+    ускорить восстановление если [автотюнинг скорости](#recovery_tune_interval)
+    разрешает это или если он отключён.
+- name: recovery_sleep_us
+  type: us
+  default: 0
+  online: true
+  info: |
+    Delay for all recovery- and rebalance- related operations. If non-zero,
+    such operations are artificially slowed down to reduce the impact on
+    client I/O.
 - name: recovery_pg_switch
  type: int
  default: 128
@ -626,3 +638,101 @@
    считается некорректной. Однако, если "лучшую" версию с числом доступных
    копий большим, чем у всех других версий, найти невозможно, то объект тоже
    маркируется неконсистентным.
+- name: recovery_tune_interval
+  type: sec
+  default: 1
+  online: true
+  info: |
+    Interval at which OSD re-considers client and recovery load and automatically
+    adjusts [recovery_sleep_us](#recovery_sleep_us). Recovery auto-tuning is
+    disabled if recovery_tune_interval is set to 0.
+
+    Auto-tuning targets utilization. Utilization is a measure of load and is
+    equal to the product of iops and average latency (so it may be greater
+    than 1). You set "low" and "high" client utilization thresholds and two
+    corresponding target recovery utilization levels. OSD calculates desired
+    recovery utilization from client utilization using linear interpolation
+    and auto-tunes recovery operation delay to make actual recovery utilization
+    match desired.
+
+    This allows to reduce recovery/rebalance impact on client operations. It is
+    of course impossible to remove it completely, but it should become adequate.
+    In some tests rebalance could earlier drop client write speed from 1.5 GB/s
+    to 50-100 MB/s, with default auto-tuning settings it now only reduces
+    to ~1 GB/s.
+  info_ru: |
+    Интервал, с которым OSD пересматривает клиентскую нагрузку и нагрузку
+    восстановления и автоматически подстраивает [recovery_sleep_us](#recovery_sleep_us).
+    Автотюнинг (автоподстройка) отключается, если recovery_tune_interval
+    устанавливается в значение 0.
+
+    Автотюнинг регулирует утилизацию. Утилизация является мерой нагрузки
+    и равна произведению числа операций в секунду и средней задержки
+    (то есть, она может быть выше 1). Вы задаёте два уровня клиентской
+    утилизации - "низкий" и "высокий" (low и high) и два соответствующих
+    целевых уровня утилизации операциями восстановления. OSD рассчитывает
+    желаемый уровень утилизации восстановления линейной интерполяцией от
+    клиентской утилизации и подстраивает задержку операций восстановления
+    так, чтобы фактическая утилизация восстановления совпадала с желаемой.
+
+    Это позволяет снизить влияние восстановления и ребаланса на клиентские
+    операции. Конечно, невозможно исключить такое влияние полностью, но оно
+    должно становиться адекватнее. В некоторых тестах перебалансировка могла
+    снижать клиентскую скорость записи с 1.5 ГБ/с до 50-100 МБ/с, а теперь, с
+    настройками автотюнинга по умолчанию, она снижается только до ~1 ГБ/с.
+- name: recovery_tune_util_low
+  type: float
+  default: 0.1
+  online: true
+  info: |
+    Desired recovery/rebalance utilization when client load is high, i.e. when
+    it is at or above recovery_tune_client_util_high.
+  info_ru: |
+    Желаемая утилизация восстановления в моменты, когда клиентская нагрузка
+    высокая, то есть, находится на уровне или выше recovery_tune_client_util_high.
+- name: recovery_tune_util_high
+  type: float
+  default: 1
+  online: true
+  info: |
+    Desired recovery/rebalance utilization when client load is low, i.e. when
+    it is at or below recovery_tune_client_util_low.
+  info_ru: |
+    Желаемая утилизация восстановления в моменты, когда клиентская нагрузка
+    низкая, то есть, находится на уровне или ниже recovery_tune_client_util_low.
+- name: recovery_tune_client_util_low
+  type: float
+  default: 0
+  online: true
+  info: Client utilization considered "low".
+  info_ru: Клиентская утилизация, которая считается "низкой".
+- name: recovery_tune_client_util_high
+  type: float
+  default: 0.5
+  online: true
+  info: Client utilization considered "high".
+  info_ru: Клиентская утилизация, которая считается "высокой".
+- name: recovery_tune_agg_interval
+  type: int
+  default: 10
+  online: true
+  info: |
+    The number of last auto-tuning iterations to use for calculating the
+    delay as average. Lower values result in quicker response to client
+    load change, higher values result in more stable delay. Default value of 10
+    is usually fine.
+  info_ru: |
+    Число последних итераций автоподстройки для расчёта задержки как среднего
+    значения. Меньшие значения параметра ускоряют отклик на изменение нагрузки,
+    большие значения делают задержку стабильнее. Значение по умолчанию 10
+    обычно нормальное и не требует изменений.
+- name: recovery_tune_sleep_min_us
+  type: us
+  default: 10
+  online: true
+  info: |
+    Minimum possible value for auto-tuned recovery_sleep_us. Values lower
+    than this value are changed to 0.
+  info_ru: |
+    Минимальное возможное значение авто-подстроенного recovery_sleep_us.
+    Значения ниже данного заменяются на 0.
--- a/docs/intro/features.en.md
+++ b/docs/intro/features.en.md
@ -32,6 +32,7 @@
 - [Scrubbing](../config/osd.en.md#auto_scrub) (verification of copies)
 - [Checksums](../config/layout-osd.en.md#data_csum_type)
 - [Client write-back cache](../config/client.en.md#client_enable_writeback)
+- [Intelligent recovery auto-tuning](../config/osd.en.md#recovery_tune_interval)

 ## Plugins and tools

--- a/docs/intro/features.ru.md
+++ b/docs/intro/features.ru.md
@ -34,6 +34,7 @@
 - [Фоновая проверка целостности](../config/osd.ru.md#auto_scrub) (сверка копий)
 - [Контрольные суммы](../config/layout-osd.ru.md#data_csum_type)
 - [Буферизация записи на стороне клиента](../config/client.ru.md#client_enable_writeback)
+- [Интеллектуальная автоподстройка скорости восстановления](../config/osd.ru.md#recovery_tune_interval)

 ## Драйверы и инструменты

--- a/mon/PGUtil.js
+++ b/mon/PGUtil.js
@ -3,6 +3,7 @@

 module.exports = {
    scale_pg_count,
+    scale_pg_history,
 };

 function add_pg_history(new_pg_history, new_pg, prev_pgs, prev_pg_history, old_pg)
@ -43,16 +44,18 @@ function finish_pg_history(merged_history)
    merged_history.all_peers = Object.values(merged_history.all_peers);
 }

-function scale_pg_count(prev_pgs, real_prev_pgs, prev_pg_history, new_pg_history, new_pg_count)
+function scale_pg_history(prev_pg_history, prev_pgs, new_pgs)
 {
-    const old_pg_count = real_prev_pgs.length;
+    const new_pg_history = [];
+    const old_pg_count = prev_pgs.length;
+    const new_pg_count = new_pgs.length;
    // Add all possibly intersecting PGs to the history of new PGs
    if (!(new_pg_count % old_pg_count))
    {
        // New PG count is a multiple of old PG count
        for (let i = 0; i < new_pg_count; i++)
        {
-            add_pg_history(new_pg_history, i, real_prev_pgs, prev_pg_history, i % old_pg_count);
+            add_pg_history(new_pg_history, i, prev_pgs, prev_pg_history, i % old_pg_count);
            finish_pg_history(new_pg_history[i]);
        }
    }
@ -64,7 +67,7 @@ function scale_pg_count(prev_pgs, real_prev_pgs, prev_pg_history, new_pg_history
        {
            for (let j = 0; j < mul; j++)
            {
-                add_pg_history(new_pg_history, i, real_prev_pgs, prev_pg_history, i+j*new_pg_count);
+                add_pg_history(new_pg_history, i, prev_pgs, prev_pg_history, i+j*new_pg_count);
            }
            finish_pg_history(new_pg_history[i]);
        }
@ -76,7 +79,7 @@ function scale_pg_count(prev_pgs, real_prev_pgs, prev_pg_history, new_pg_history
        let merged_history = {};
        for (let i = 0; i < old_pg_count; i++)
        {
-            add_pg_history(merged_history, 1, real_prev_pgs, prev_pg_history, i);
+            add_pg_history(merged_history, 1, prev_pgs, prev_pg_history, i);
        }
        finish_pg_history(merged_history[1]);
        for (let i = 0; i < new_pg_count; i++)
@ -89,6 +92,12 @@ function scale_pg_count(prev_pgs, real_prev_pgs, prev_pg_history, new_pg_history
    {
        new_pg_history[i] = null;
    }
+    return new_pg_history;
+}
+
+function scale_pg_count(prev_pgs, new_pg_count)
+{
+    const old_pg_count = prev_pgs.length;
    // Just for the lp_solve optimizer - pick a "previous" PG for each "new" one
    if (prev_pgs.length < new_pg_count)
    {
--- a/mon/mon.js
+++ b/mon/mon.js
@ -59,6 +59,7 @@ const etcd_tree = {
            etcd_mon_timeout: 1000, // ms. min: 0
            etcd_mon_retries: 5, // min: 0
            mon_change_timeout: 1000, // ms. min: 100
+            mon_retry_change_timeout: 50, // ms. min: 10
            mon_stats_timeout: 1000, // ms. min: 100
            osd_out_time: 600, // seconds. min: 0
            placement_levels: { datacenter: 1, rack: 2, host: 3, osd: 4, ... },
@ -110,7 +111,15 @@ const etcd_tree = {
            autosync_interval: 5,
            autosync_writes: 128,
            client_queue_depth: 128, // unused
-            recovery_queue_depth: 4,
+            recovery_queue_depth: 1,
+            recovery_sleep_us: 0,
+            recovery_tune_util_low: 0.1,
+            recovery_tune_client_util_low: 0,
+            recovery_tune_util_high: 1.0,
+            recovery_tune_client_util_high: 0.5,
+            recovery_tune_interval: 1,
+            recovery_tune_agg_interval: 10, // 10 times recovery_tune_interval
+            recovery_tune_sleep_min_us: 10, // 10 microseconds
            recovery_pg_switch: 128,
            recovery_sync_batch: 16,
            no_recovery: false,
@ -490,6 +499,11 @@ class Mon
        {
            this.config.mon_change_timeout = 100;
        }
+        this.config.mon_retry_change_timeout = Number(this.config.mon_retry_change_timeout) || 50;
+        if (this.config.mon_retry_change_timeout < 50)
+        {
+            this.config.mon_retry_change_timeout = 50;
+        }
        this.config.mon_stats_timeout = Number(this.config.mon_stats_timeout) || 1000;
        if (this.config.mon_stats_timeout < 100)
        {
@ -1222,6 +1236,89 @@ class Mon
        return aff_osds;
    }

+    async generate_pool_pgs(pool_id, osd_tree, levels)
+    {
+        const pool_cfg = this.state.config.pools[pool_id];
+        if (!this.validate_pool_cfg(pool_id, pool_cfg, false))
+        {
+            return null;
+        }
+        let pool_tree = osd_tree[pool_cfg.root_node || ''];
+        pool_tree = pool_tree ? pool_tree.children : [];
+        pool_tree = LPOptimizer.flatten_tree(pool_tree, levels, pool_cfg.failure_domain, 'osd');
+        this.filter_osds_by_tags(osd_tree, pool_tree, pool_cfg.osd_tags);
+        this.filter_osds_by_block_layout(
+            pool_tree,
+            pool_cfg.block_size || this.config.block_size || 131072,
+            pool_cfg.bitmap_granularity || this.config.bitmap_granularity || 4096,
+            pool_cfg.immediate_commit || this.config.immediate_commit || 'none'
+        );
+        // First try last_clean_pgs to minimize data movement
+        let prev_pgs = [];
+        for (const pg in ((this.state.history.last_clean_pgs.items||{})[pool_id]||{}))
+        {
+            prev_pgs[pg-1] = [ ...this.state.history.last_clean_pgs.items[pool_id][pg].osd_set ];
+        }
+        if (!prev_pgs.length)
+        {
+            // Fall back to config/pgs if it's empty
+            for (const pg in ((this.state.config.pgs.items||{})[pool_id]||{}))
+            {
+                prev_pgs[pg-1] = [ ...this.state.config.pgs.items[pool_id][pg].osd_set ];
+            }
+        }
+        const old_pg_count = prev_pgs.length;
+        const optimize_cfg = {
+            osd_tree: pool_tree,
+            pg_count: pool_cfg.pg_count,
+            pg_size: pool_cfg.pg_size,
+            pg_minsize: pool_cfg.pg_minsize,
+            max_combinations: pool_cfg.max_osd_combinations,
+            ordered: pool_cfg.scheme != 'replicated',
+        };
+        let optimize_result;
+        // Re-shuffle PGs if config/pgs.hash is empty
+        if (old_pg_count > 0 && this.state.config.pgs.hash)
+        {
+            if (prev_pgs.length != pool_cfg.pg_count)
+            {
+                // Scale PG count
+                // Do it even if old_pg_count is already equal to pool_cfg.pg_count,
+                // because last_clean_pgs may still contain the old number of PGs
+                PGUtil.scale_pg_count(prev_pgs, pool_cfg.pg_count);
+            }
+            for (const pg of prev_pgs)
+            {
+                while (pg.length < pool_cfg.pg_size)
+                {
+                    pg.push(0);
+                }
+            }
+            optimize_result = await LPOptimizer.optimize_change({
+                prev_pgs,
+                ...optimize_cfg,
+            });
+        }
+        else
+        {
+            optimize_result = await LPOptimizer.optimize_initial(optimize_cfg);
+        }
+        console.log(`Pool ${pool_id} (${pool_cfg.name || 'unnamed'}):`);
+        LPOptimizer.print_change_stats(optimize_result);
+        const pg_effsize = Math.min(pool_cfg.pg_size, Object.keys(pool_tree).length);
+        return {
+            pool_id,
+            pgs: optimize_result.int_pgs,
+            stats: {
+                total_raw_tb: optimize_result.space,
+                pg_real_size: pg_effsize || pool_cfg.pg_size,
+                raw_to_usable: (pg_effsize || pool_cfg.pg_size) / (pool_cfg.scheme === 'replicated'
+                    ? 1 : (pool_cfg.pg_size - (pool_cfg.parity_chunks||0))),
+                space_efficiency: optimize_result.space/(optimize_result.total_space||1),
+            },
+        };
+    }
+
    async recheck_pgs()
    {
        if (this.recheck_pgs_active)
@ -1236,158 +1333,47 @@ class Mon
        const { up_osds, levels, osd_tree } = this.get_osd_tree();
        const tree_cfg = {
            osd_tree,
+            levels,
            pools: this.state.config.pools,
        };
        const tree_hash = sha1hex(stableStringify(tree_cfg));
        if (this.state.config.pgs.hash != tree_hash)
        {
            // Something has changed
-            const new_config_pgs = JSON.parse(JSON.stringify(this.state.config.pgs));
-            const etcd_request = { compare: [], success: [] };
-            for (const pool_id in (this.state.config.pgs||{}).items||{})
-            {
-                if (!this.state.config.pools[pool_id])
-                {
-                    // Pool deleted. Delete all PGs, but first stop them.
-                    if (!await this.stop_all_pgs(pool_id))
-                    {
-                        this.recheck_pgs_active = false;
-                        this.schedule_recheck();
-                        return;
-                    }
-                    const prev_pgs = [];
-                    for (const pg in this.state.config.pgs.items[pool_id]||{})
-                    {
-                        prev_pgs[pg-1] = this.state.config.pgs.items[pool_id][pg].osd_set;
-                    }
-                    // Also delete pool statistics
-                    etcd_request.success.push({ requestDeleteRange: {
-                        key: b64(this.etcd_prefix+'/pool/stats/'+pool_id),
-                    } });
-                    this.save_new_pgs_txn(new_config_pgs, etcd_request, pool_id, up_osds, osd_tree, prev_pgs, [], []);
-                }
-            }
-            for (const pool_id in this.state.config.pools)
-            {
-                const pool_cfg = this.state.config.pools[pool_id];
-                if (!this.validate_pool_cfg(pool_id, pool_cfg, false))
-                {
-                    continue;
-                }
-                let pool_tree = osd_tree[pool_cfg.root_node || ''];
-                pool_tree = pool_tree ? pool_tree.children : [];
-                pool_tree = LPOptimizer.flatten_tree(pool_tree, levels, pool_cfg.failure_domain, 'osd');
-                this.filter_osds_by_tags(osd_tree, pool_tree, pool_cfg.osd_tags);
-                this.filter_osds_by_block_layout(
-                    pool_tree,
-                    pool_cfg.block_size || this.config.block_size || 131072,
-                    pool_cfg.bitmap_granularity || this.config.bitmap_granularity || 4096,
-                    pool_cfg.immediate_commit || this.config.immediate_commit || 'none'
-                );
-                // These are for the purpose of building history.osd_sets
-                const real_prev_pgs = [];
-                let pg_history = [];
-                for (const pg in ((this.state.config.pgs.items||{})[pool_id]||{}))
-                {
-                    real_prev_pgs[pg-1] = this.state.config.pgs.items[pool_id][pg].osd_set;
-                    if (this.state.pg.history[pool_id] &&
-                        this.state.pg.history[pool_id][pg])
-                    {
-                        pg_history[pg-1] = this.state.pg.history[pool_id][pg];
-                    }
-                }
-                // And these are for the purpose of minimizing data movement
-                let prev_pgs = [];
-                for (const pg in ((this.state.history.last_clean_pgs.items||{})[pool_id]||{}))
-                {
-                    prev_pgs[pg-1] = this.state.history.last_clean_pgs.items[pool_id][pg].osd_set;
-                }
-                prev_pgs = JSON.parse(JSON.stringify(prev_pgs.length ? prev_pgs : real_prev_pgs));
-                const old_pg_count = real_prev_pgs.length;
-                const optimize_cfg = {
-                    osd_tree: pool_tree,
-                    pg_count: pool_cfg.pg_count,
-                    pg_size: pool_cfg.pg_size,
-                    pg_minsize: pool_cfg.pg_minsize,
-                    max_combinations: pool_cfg.max_osd_combinations,
-                    ordered: pool_cfg.scheme != 'replicated',
-                };
-                let optimize_result;
-                if (old_pg_count > 0)
-                {
-                    if (old_pg_count != pool_cfg.pg_count)
-                    {
-                        // PG count changed. Need to bring all PGs down.
-                        if (!await this.stop_all_pgs(pool_id))
-                        {
-                            this.recheck_pgs_active = false;
-                            this.schedule_recheck();
-                            return;
-                        }
-                    }
-                    if (prev_pgs.length != pool_cfg.pg_count)
-                    {
-                        // Scale PG count
-                        // Do it even if old_pg_count is already equal to pool_cfg.pg_count,
-                        // because last_clean_pgs may still contain the old number of PGs
-                        const new_pg_history = [];
-                        PGUtil.scale_pg_count(prev_pgs, real_prev_pgs, pg_history, new_pg_history, pool_cfg.pg_count);
-                        pg_history = new_pg_history;
-                    }
-                    for (const pg of prev_pgs)
-                    {
-                        while (pg.length < pool_cfg.pg_size)
-                        {
-                            pg.push(0);
-                        }
-                    }
-                    if (!this.state.config.pgs.hash)
-                    {
-                        // Re-shuffle PGs
-                        optimize_result = await LPOptimizer.optimize_initial(optimize_cfg);
-                    }
-                    else
-                    {
-                        optimize_result = await LPOptimizer.optimize_change({
-                            prev_pgs,
-                            ...optimize_cfg,
-                        });
-                    }
-                }
-                else
-                {
-                    optimize_result = await LPOptimizer.optimize_initial(optimize_cfg);
-                }
-                if (old_pg_count != optimize_result.int_pgs.length)
+            console.log('Pool configuration or OSD tree changed, re-optimizing');
+            // First re-optimize PGs, but don't look at history yet
+            const optimize_results = await Promise.all(Object.keys(this.state.config.pools)
+                .map(pool_id => this.generate_pool_pgs(pool_id, osd_tree, levels)));
+            // Then apply the modification in the form of an optimistic transaction,
+            // each time considering new pg/history modifications (OSDs modify it during rebalance)
+            while (!await this.apply_pool_pgs(optimize_results, up_osds, osd_tree, tree_hash))
            {
                console.log(
-                        `PG count for pool ${pool_id} (${pool_cfg.name || 'unnamed'})`+
-                        ` changed from: ${old_pg_count} to ${optimize_result.int_pgs.length}`
+                    'Someone changed PG configuration while we also tried to change it.'+
+                    ' Retrying in '+this.config.mon_retry_change_timeout+' ms'
                );
-                    // Drop stats
-                    etcd_request.success.push({ requestDeleteRange: {
-                        key: b64(this.etcd_prefix+'/pg/stats/'+pool_id+'/'),
-                        range_end: b64(this.etcd_prefix+'/pg/stats/'+pool_id+'0'),
-                    } });
+                // Failed to apply - parallel change detected. Wait a bit and retry
+                const old_rev = this.etcd_watch_revision;
+                while (this.etcd_watch_revision === old_rev)
+                {
+                    await new Promise(ok => setTimeout(ok, this.config.mon_retry_change_timeout));
                }
-                LPOptimizer.print_change_stats(optimize_result);
-                const pg_effsize = Math.min(pool_cfg.pg_size, Object.keys(pool_tree).length);
-                this.state.pool.stats[pool_id] = {
-                    used_raw_tb: (this.state.pool.stats[pool_id]||{}).used_raw_tb || 0,
-                    total_raw_tb: optimize_result.space,
-                    pg_real_size: pg_effsize || pool_cfg.pg_size,
-                    raw_to_usable: (pg_effsize || pool_cfg.pg_size) / (pool_cfg.scheme === 'replicated'
-                        ? 1 : (pool_cfg.pg_size - (pool_cfg.parity_chunks||0))),
-                    space_efficiency: optimize_result.space/(optimize_result.total_space||1),
+                const new_ot = this.get_osd_tree();
+                const new_tcfg = {
+                    osd_tree: new_ot.osd_tree,
+                    levels: new_ot.levels,
+                    pools: this.state.config.pools,
                };
-                etcd_request.success.push({ requestPut: {
-                    key: b64(this.etcd_prefix+'/pool/stats/'+pool_id),
-                    value: b64(JSON.stringify(this.state.pool.stats[pool_id])),
-                } });
-                this.save_new_pgs_txn(new_config_pgs, etcd_request, pool_id, up_osds, osd_tree, real_prev_pgs, optimize_result.int_pgs, pg_history);
+                if (sha1hex(stableStringify(new_tcfg)) !== tree_hash)
+                {
+                    // Configuration actually changed, restart from the beginning
+                    this.recheck_pgs_active = false;
+                    setImmediate(() => this.recheck_pgs().catch(this.die));
+                    return;
                }
-            new_config_pgs.hash = tree_hash;
-            await this.save_pg_config(new_config_pgs, etcd_request);
+                // Configuration didn't change, PG history probably changed, so just retry
+            }
+            console.log('PG configuration successfully changed');
        }
        else
        {
@ -1434,8 +1420,81 @@ class Mon
        this.recheck_pgs_active = false;
    }

-    async save_pg_config(new_config_pgs, etcd_request = { compare: [], success: [] })
+    async apply_pool_pgs(results, up_osds, osd_tree, tree_hash)
    {
+        for (const pool_id in (this.state.config.pgs||{}).items||{})
+        {
+            // We should stop all PGs when deleting a pool or changing its PG count
+            if (!this.state.config.pools[pool_id] ||
+                this.state.config.pgs.items[pool_id] && this.state.config.pools[pool_id].pg_count !=
+                Object.keys(this.state.config.pgs.items[pool_id]).reduce((a, c) => (a < (0|c) ? (0|c) : a), 0))
+            {
+                if (!await this.stop_all_pgs(pool_id))
+                {
+                    return false;
+                }
+            }
+        }
+        const new_config_pgs = JSON.parse(JSON.stringify(this.state.config.pgs));
+        const etcd_request = { compare: [], success: [] };
+        for (const pool_id in (new_config_pgs||{}).items||{})
+        {
+            if (!this.state.config.pools[pool_id])
+            {
+                const prev_pgs = [];
+                for (const pg in new_config_pgs.items[pool_id]||{})
+                {
+                    prev_pgs[pg-1] = new_config_pgs.items[pool_id][pg].osd_set;
+                }
+                // Also delete pool statistics
+                etcd_request.success.push({ requestDeleteRange: {
+                    key: b64(this.etcd_prefix+'/pool/stats/'+pool_id),
+                } });
+                this.save_new_pgs_txn(new_config_pgs, etcd_request, pool_id, up_osds, osd_tree, prev_pgs, [], []);
+            }
+        }
+        for (const pool_res of results)
+        {
+            const pool_id = pool_res.pool_id;
+            const pool_cfg = this.state.config.pools[pool_id];
+            let pg_history = [];
+            for (const pg in ((this.state.config.pgs.items||{})[pool_id]||{}))
+            {
+                if (this.state.pg.history[pool_id] &&
+                    this.state.pg.history[pool_id][pg])
+                {
+                    pg_history[pg-1] = this.state.pg.history[pool_id][pg];
+                }
+            }
+            const real_prev_pgs = [];
+            for (const pg in ((this.state.config.pgs.items||{})[pool_id]||{}))
+            {
+                real_prev_pgs[pg-1] = [ ...this.state.config.pgs.items[pool_id][pg].osd_set ];
+            }
+            if (real_prev_pgs.length > 0 && real_prev_pgs.length != pool_res.pgs.length)
+            {
+                console.log(
+                    `Changing PG count for pool ${pool_id} (${pool_cfg.name || 'unnamed'})`+
+                    ` from: ${real_prev_pgs.length} to ${pool_res.pgs.length}`
+                );
+                pg_history = PGUtil.scale_pg_history(pg_history, real_prev_pgs, pool_res.pgs);
+                // Drop stats
+                etcd_request.success.push({ requestDeleteRange: {
+                    key: b64(this.etcd_prefix+'/pg/stats/'+pool_id+'/'),
+                    range_end: b64(this.etcd_prefix+'/pg/stats/'+pool_id+'0'),
+                } });
+            }
+            const stats = {
+                used_raw_tb: (this.state.pool.stats[pool_id]||{}).used_raw_tb || 0,
+                ...pool_res.stats,
+            };
+            etcd_request.success.push({ requestPut: {
+                key: b64(this.etcd_prefix+'/pool/stats/'+pool_id),
+                value: b64(JSON.stringify(stats)),
+            } });
+            this.save_new_pgs_txn(new_config_pgs, etcd_request, pool_id, up_osds, osd_tree, real_prev_pgs, pool_res.pgs, pg_history);
+        }
+        new_config_pgs.hash = tree_hash;
        etcd_request.compare.push(
            { key: b64(this.etcd_prefix+'/mon/master'), target: 'LEASE', lease: ''+this.etcd_lease_id },
            { key: b64(this.etcd_prefix+'/config/pgs'), target: 'MOD', mod_revision: ''+this.etcd_watch_revision, result: 'LESS' },
@ -1443,14 +1502,8 @@ class Mon
        etcd_request.success.push(
            { requestPut: { key: b64(this.etcd_prefix+'/config/pgs'), value: b64(JSON.stringify(new_config_pgs)) } },
        );
-        const res = await this.etcd_call('/kv/txn', etcd_request, this.config.etcd_mon_timeout, 0);
-        if (!res.succeeded)
-        {
-            console.log('Someone changed PG configuration while we also tried to change it. Retrying in '+this.config.mon_change_timeout+' ms');
-            this.schedule_recheck();
-            return;
-        }
-        console.log('PG configuration successfully changed');
+        const txn_res = await this.etcd_call('/kv/txn', etcd_request, this.config.etcd_mon_timeout, 0);
+        return txn_res.succeeded;
    }

    // Schedule next recheck at least at <unixtime>
--- a/src/blockstore_impl.h
+++ b/src/blockstore_impl.h
@ -277,6 +277,7 @@ class blockstore_impl_t
    int unsynced_big_write_count = 0, unstable_unsynced = 0;
    int unsynced_queued_ops = 0;
    allocator *data_alloc = NULL;
+    uint64_t used_blocks = 0;
    uint8_t *zero_object;

    void *metadata_buffer = NULL;
@ -430,7 +431,7 @@ public:

    inline uint32_t get_block_size() { return dsk.data_block_size; }
    inline uint64_t get_block_count() { return dsk.block_count; }
-    inline uint64_t get_free_block_count() { return data_alloc->get_free_count(); }
+    inline uint64_t get_free_block_count() { return dsk.block_count - used_blocks; }
    inline uint32_t get_bitmap_granularity() { return dsk.disk_alignment; }
    inline uint64_t get_journal_size() { return dsk.journal_len; }
 };
--- a/src/blockstore_init.cpp
+++ b/src/blockstore_init.cpp
@ -376,6 +376,7 @@ bool blockstore_init_meta::handle_meta_block(uint8_t *buf, uint64_t entries_per_
                else
                {
                    bs->inode_space_stats[entry->oid.inode] += bs->dsk.data_block_size;
+                    bs->used_blocks++;
                }
                entries_loaded++;
 #ifdef BLOCKSTORE_DEBUG
@ -1181,6 +1182,7 @@ void blockstore_init_journal::erase_dirty_object(blockstore_dirty_db_t::iterator
            sp -= bs->dsk.data_block_size;
        else
            bs->inode_space_stats.erase(oid.inode);
+        bs->used_blocks--;
    }
    bs->erase_dirty(dirty_it, dirty_end, clean_loc);
    // Remove it from the flusher's queue, too
--- a/src/blockstore_stable.cpp
+++ b/src/blockstore_stable.cpp
@ -445,6 +445,7 @@ void blockstore_impl_t::mark_stable(const obj_ver_id & v, bool forget_dirty)
                    if (!exists)
                    {
                        inode_space_stats[dirty_it->first.oid.inode] += dsk.data_block_size;
+                        used_blocks++;
                    }
                    big_to_flush++;
                }
@ -455,6 +456,7 @@ void blockstore_impl_t::mark_stable(const obj_ver_id & v, bool forget_dirty)
                        sp -= dsk.data_block_size;
                    else
                        inode_space_stats.erase(dirty_it->first.oid.inode);
+                    used_blocks--;
                    big_to_flush++;
                }
            }
--- a/src/cluster_client.cpp
+++ b/src/cluster_client.cpp
@ -705,6 +705,8 @@ resume_1:
        }
        goto resume_2;
    }
+    // Protect from try_send completing the operation immediately
+    op->inflight_count++;
    for (int i = 0; i < op->parts.size(); i++)
    {
        if (!(op->parts[i].flags & PART_SENT))
@ -728,8 +730,10 @@ resume_1:
            }
        }
    }
+    op->inflight_count--;
    if (op->state == 1)
    {
+        // Some suboperations have to be resent
        return 0;
    }
 resume_2:
--- a/src/messenger.h
+++ b/src/messenger.h
@ -149,7 +149,7 @@ public:
    std::map<osd_num_t, osd_wanted_peer_t> wanted_peers;
    std::map<uint64_t, int> osd_peer_fds;
    // op statistics
-    osd_op_stats_t stats;
+    osd_op_stats_t stats, recovery_stats;

    void init();
    void parse_config(const json11::Json & config);
@ -175,6 +175,7 @@ public:
    bool connect_rdma(int peer_fd, std::string rdma_address, uint64_t client_max_msg);
 #endif

+    void inc_op_stats(osd_op_stats_t & stats, uint64_t opcode, timespec & tv_begin, timespec & tv_end, uint64_t len);
    void measure_exec(osd_op_t *cur_op);

 protected:
--- a/src/msgr_op.cpp
+++ b/src/msgr_op.cpp
@ -24,3 +24,17 @@ osd_op_t::~osd_op_t()
        free(buf);
    }
 }
+
+bool osd_op_t::is_recovery_related()
+{
+    return (req.hdr.opcode == OSD_OP_SEC_READ ||
+        req.hdr.opcode == OSD_OP_SEC_WRITE ||
+        req.hdr.opcode == OSD_OP_SEC_WRITE_STABLE) &&
+        (req.sec_rw.flags & OSD_OP_RECOVERY_RELATED) ||
+        req.hdr.opcode == OSD_OP_SEC_DELETE &&
+        (req.sec_del.flags & OSD_OP_RECOVERY_RELATED) ||
+        req.hdr.opcode == OSD_OP_SEC_STABILIZE &&
+        (req.sec_stab.flags & OSD_OP_RECOVERY_RELATED) ||
+        req.hdr.opcode == OSD_OP_SEC_SYNC &&
+        (req.sec_sync.flags & OSD_OP_RECOVERY_RELATED);
+}
--- a/src/msgr_op.h
+++ b/src/msgr_op.h
@ -173,4 +173,6 @@ struct osd_op_t
    osd_op_buf_list_t iov;

    ~osd_op_t();
+
+    bool is_recovery_related();
 };
--- a/src/msgr_send.cpp
+++ b/src/msgr_send.cpp
@ -131,6 +131,23 @@ void osd_messenger_t::outbox_push(osd_op_t *cur_op)
    }
 }

+void osd_messenger_t::inc_op_stats(osd_op_stats_t & stats, uint64_t opcode, timespec & tv_begin, timespec & tv_end, uint64_t len)
+{
+    uint64_t usecs = (
+        (tv_end.tv_sec - tv_begin.tv_sec)*1000000 +
+        (tv_end.tv_nsec - tv_begin.tv_nsec)/1000
+    );
+    stats.op_stat_count[opcode]++;
+    if (!stats.op_stat_count[opcode])
+    {
+        stats.op_stat_count[opcode] = 1;
+        stats.op_stat_sum[opcode] = 0;
+        stats.op_stat_bytes[opcode] = 0;
+    }
+    stats.op_stat_sum[opcode] += usecs;
+    stats.op_stat_bytes[opcode] += len;
+}
+
 void osd_messenger_t::measure_exec(osd_op_t *cur_op)
 {
    // Measure execution latency
@ -142,29 +159,24 @@ void osd_messenger_t::measure_exec(osd_op_t *cur_op)
    {
        clock_gettime(CLOCK_REALTIME, &cur_op->tv_end);
    }
-    stats.op_stat_count[cur_op->req.hdr.opcode]++;
-    if (!stats.op_stat_count[cur_op->req.hdr.opcode])
-    {
-        stats.op_stat_count[cur_op->req.hdr.opcode]++;
-        stats.op_stat_sum[cur_op->req.hdr.opcode] = 0;
-        stats.op_stat_bytes[cur_op->req.hdr.opcode] = 0;
-    }
-    stats.op_stat_sum[cur_op->req.hdr.opcode] += (
-        (cur_op->tv_end.tv_sec - cur_op->tv_begin.tv_sec)*1000000 +
-        (cur_op->tv_end.tv_nsec - cur_op->tv_begin.tv_nsec)/1000
-    );
+    uint64_t len = 0;
    if (cur_op->req.hdr.opcode == OSD_OP_READ ||
        cur_op->req.hdr.opcode == OSD_OP_WRITE ||
        cur_op->req.hdr.opcode == OSD_OP_SCRUB)
    {
        // req.rw.len is internally set to the full object size for scrubs
-        stats.op_stat_bytes[cur_op->req.hdr.opcode] += cur_op->req.rw.len;
+        len = cur_op->req.rw.len;
    }
    else if (cur_op->req.hdr.opcode == OSD_OP_SEC_READ ||
        cur_op->req.hdr.opcode == OSD_OP_SEC_WRITE ||
        cur_op->req.hdr.opcode == OSD_OP_SEC_WRITE_STABLE)
    {
-        stats.op_stat_bytes[cur_op->req.hdr.opcode] += cur_op->req.sec_rw.len;
+        len = cur_op->req.sec_rw.len;
+    }
+    inc_op_stats(stats, cur_op->req.hdr.opcode, cur_op->tv_begin, cur_op->tv_end, len);
+    if (cur_op->is_recovery_related())
+    {
+        inc_op_stats(recovery_stats, cur_op->req.hdr.opcode, cur_op->tv_begin, cur_op->tv_end, len);
    }
 }

--- a/src/osd.cpp
+++ b/src/osd.cpp
@ -68,14 +68,21 @@ osd_t::osd_t(const json11::Json & config, ring_loop_t *ringloop)
        }
    }

+    if (print_stats_timer_id == -1)
+    {
        print_stats_timer_id = this->tfd->set_timer(print_stats_interval*1000, true, [this](int timer_id)
        {
            print_stats();
        });
+    }
+    if (slow_log_timer_id == -1)
+    {
        slow_log_timer_id = this->tfd->set_timer(slow_log_interval*1000, true, [this](int timer_id)
        {
            print_slow();
        });
+    }
+    apply_recovery_tune_interval();

    msgr.tfd = this->tfd;
    msgr.ringloop = this->ringloop;
@ -97,6 +104,11 @@ osd_t::~osd_t()
        tfd->clear_timer(slow_log_timer_id);
        slow_log_timer_id = -1;
    }
+    if (rtune_timer_id >= 0)
+    {
+        tfd->clear_timer(rtune_timer_id);
+        rtune_timer_id = -1;
+    }
    if (print_stats_timer_id >= 0)
    {
        tfd->clear_timer(print_stats_timer_id);
@ -196,6 +208,30 @@ void osd_t::parse_config(bool init)
    recovery_queue_depth = config["recovery_queue_depth"].uint64_value();
    if (recovery_queue_depth < 1 || recovery_queue_depth > MAX_RECOVERY_QUEUE)
        recovery_queue_depth = DEFAULT_RECOVERY_QUEUE;
+    recovery_sleep_us = config["recovery_sleep_us"].uint64_value();
+    recovery_tune_util_low = config["recovery_tune_util_low"].is_null()
+        ? 0.1 : config["recovery_tune_util_low"].number_value();
+    if (recovery_tune_util_low < 0.01)
+        recovery_tune_util_low = 0.01;
+    recovery_tune_util_high = config["recovery_tune_util_high"].is_null()
+        ? 1.0 : config["recovery_tune_util_high"].number_value();
+    if (recovery_tune_util_high < 0.01)
+        recovery_tune_util_high = 0.01;
+    recovery_tune_client_util_low = config["recovery_tune_client_util_low"].is_null()
+        ? 0 : config["recovery_tune_client_util_low"].number_value();
+    if (recovery_tune_client_util_low < 0.01)
+        recovery_tune_client_util_low = 0.01;
+    recovery_tune_client_util_high = config["recovery_tune_client_util_high"].is_null()
+        ? 0.5 : config["recovery_tune_client_util_high"].number_value();
+    if (recovery_tune_client_util_high < 0.01)
+        recovery_tune_client_util_high = 0.01;
+    auto old_recovery_tune_interval = recovery_tune_interval;
+    recovery_tune_interval = config["recovery_tune_interval"].is_null()
+        ? 1 : config["recovery_tune_interval"].uint64_value();
+    recovery_tune_agg_interval = config["recovery_tune_agg_interval"].is_null()
+        ? 10 : config["recovery_tune_agg_interval"].uint64_value();
+    recovery_tune_sleep_min_us = config["recovery_tune_sleep_min_us"].is_null()
+        ? 10 : config["recovery_tune_sleep_min_us"].uint64_value();
    recovery_pg_switch = config["recovery_pg_switch"].uint64_value();
    if (recovery_pg_switch < 1)
        recovery_pg_switch = DEFAULT_RECOVERY_PG_SWITCH;
@ -274,6 +310,10 @@ void osd_t::parse_config(bool init)
            print_slow();
        });
    }
+    if (old_recovery_tune_interval != recovery_tune_interval)
+    {
+        apply_recovery_tune_interval();
+    }
 }

 void osd_t::bind_socket()
@ -421,14 +461,6 @@ void osd_t::exec_op(osd_op_t *cur_op)
    }
 }

-void osd_t::reset_stats()
-{
-    msgr.stats = {};
-    prev_stats = {};
-    memset(recovery_stat_count, 0, sizeof(recovery_stat_count));
-    memset(recovery_stat_bytes, 0, sizeof(recovery_stat_bytes));
-}
-
 void osd_t::print_stats()
 {
    for (int i = OSD_OP_MIN; i <= OSD_OP_MAX; i++)
@ -466,19 +498,20 @@ void osd_t::print_stats()
    }
    for (int i = 0; i < 2; i++)
    {
-        if (recovery_stat_count[0][i] != recovery_stat_count[1][i])
+        if (recovery_stat[i].count > recovery_print_prev[i].count)
        {
-            uint64_t bw = (recovery_stat_bytes[0][i] - recovery_stat_bytes[1][i]) / print_stats_interval;
+            uint64_t bw = (recovery_stat[i].bytes - recovery_print_prev[i].bytes) / print_stats_interval;
            printf(
-                "[OSD %lu] %s recovery: %.1f op/s, B/W: %.2f %s\n", osd_num, recovery_stat_names[i],
-                (recovery_stat_count[0][i] - recovery_stat_count[1][i]) * 1.0 / print_stats_interval,
+                "[OSD %lu] %s recovery: %.1f op/s, B/W: %.2f %s, avg latency %ld us, delay %ld us\n", osd_num, recovery_stat_names[i],
+                (recovery_stat[i].count - recovery_print_prev[i].count) * 1.0 / print_stats_interval,
                (bw > 1024*1024*1024 ? bw/1024.0/1024/1024 : (bw > 1024*1024 ? bw/1024.0/1024 : bw/1024.0)),
-                (bw > 1024*1024*1024 ? "GB/s" : (bw > 1024*1024 ? "MB/s" : "KB/s"))
+                (bw > 1024*1024*1024 ? "GB/s" : (bw > 1024*1024 ? "MB/s" : "KB/s")),
+                (recovery_stat[i].usec - recovery_print_prev[i].usec) / (recovery_stat[i].count - recovery_print_prev[i].count),
+                recovery_target_sleep_us
            );
-            recovery_stat_count[1][i] = recovery_stat_count[0][i];
-            recovery_stat_bytes[1][i] = recovery_stat_bytes[0][i];
        }
    }
+    memcpy(recovery_print_prev, recovery_stat, sizeof(recovery_stat));
    if (corrupted_objects > 0)
    {
        printf("[OSD %lu] %lu object(s) corrupted\n", osd_num, corrupted_objects);
@ -572,8 +605,8 @@ void osd_t::print_slow()
                    op->req.hdr.opcode == OSD_OP_SEC_STABILIZE || op->req.hdr.opcode == OSD_OP_SEC_ROLLBACK ||
                    op->req.hdr.opcode == OSD_OP_SEC_READ_BMP)
                {
-                    bufprintf(" state=%d", PRIV(op->bs_op)->op_state);
-                    int wait_for = PRIV(op->bs_op)->wait_for;
+                    bufprintf(" state=%d", op->bs_op ? PRIV(op->bs_op)->op_state : -1);
+                    int wait_for = op->bs_op ? PRIV(op->bs_op)->wait_for : 0;
                    if (wait_for)
                    {
                        bufprintf(" wait=%d (detail=%lu)", wait_for, PRIV(op->bs_op)->wait_detail);
--- a/src/osd.h
+++ b/src/osd.h
@ -34,7 +34,7 @@
 #define DEFAULT_AUTOSYNC_INTERVAL 5
 #define DEFAULT_AUTOSYNC_WRITES 128
 #define MAX_RECOVERY_QUEUE 2048
-#define DEFAULT_RECOVERY_QUEUE 4
+#define DEFAULT_RECOVERY_QUEUE 1
 #define DEFAULT_RECOVERY_PG_SWITCH 128
 #define DEFAULT_RECOVERY_BATCH 16

@ -87,6 +87,11 @@ struct osd_chain_read_t

 struct osd_rmw_stripe_t;

+struct recovery_stat_t
+{
+    uint64_t count, usec, bytes;
+};
+
 class osd_t
 {
    // config
@ -111,7 +116,15 @@ class osd_t
    int immediate_commit = IMMEDIATE_NONE;
    int autosync_interval = DEFAULT_AUTOSYNC_INTERVAL; // "emergency" sync every 5 seconds
    int autosync_writes = DEFAULT_AUTOSYNC_WRITES;
-    int recovery_queue_depth = DEFAULT_RECOVERY_QUEUE;
+    uint64_t recovery_queue_depth = 1;
+    uint64_t recovery_sleep_us = 0;
+    double recovery_tune_util_low = 0.1;
+    double recovery_tune_client_util_low = 0;
+    double recovery_tune_util_high = 1.0;
+    double recovery_tune_client_util_high = 0.5;
+    int recovery_tune_interval = 1;
+    int recovery_tune_agg_interval = 10;
+    int recovery_tune_sleep_min_us = 10;
    int recovery_pg_switch = DEFAULT_RECOVERY_PG_SWITCH;
    int recovery_sync_batch = DEFAULT_RECOVERY_BATCH;
    int inode_vanish_time = 60;
@ -189,8 +202,18 @@ class osd_t
    std::map<uint64_t, inode_stats_t> inode_stats;
    std::map<uint64_t, timespec> vanishing_inodes;
    const char* recovery_stat_names[2] = { "degraded", "misplaced" };
-    uint64_t recovery_stat_count[2][2] = {};
-    uint64_t recovery_stat_bytes[2][2] = {};
+    recovery_stat_t recovery_stat[2];
+    recovery_stat_t recovery_print_prev[2];
+
+    // recovery auto-tuning
+    int rtune_timer_id = -1;
+    uint64_t rtune_avg_lat = 0;
+    double rtune_client_util = 0, rtune_target_util = 1;
+    osd_op_stats_t rtune_prev_stats, rtune_prev_recovery_stats;
+    std::vector<uint64_t> recovery_target_sleep_items;
+    uint64_t recovery_target_sleep_us = 0;
+    uint64_t recovery_target_sleep_total = 0;
+    int recovery_target_sleep_cur = 0, recovery_target_sleep_count = 0;

    // cluster connection
    void parse_config(bool init);
@ -208,8 +231,9 @@ class osd_t
    void create_osd_state();
    void renew_lease(bool reload);
    void print_stats();
+    void tune_recovery();
+    void apply_recovery_tune_interval();
    void print_slow();
-    void reset_stats();
    json11::Json get_statistics();
    void report_statistics();
    void report_pg_state(pg_t & pg);
@ -238,6 +262,7 @@ class osd_t
    bool submit_flush_op(pool_id_t pool_id, pg_num_t pg_num, pg_flush_batch_t *fb, bool rollback, osd_num_t peer_osd, int count, obj_ver_id *data);
    bool pick_next_recovery(osd_recovery_op_t &op);
    void submit_recovery_op(osd_recovery_op_t *op);
+    void finish_recovery_op(osd_recovery_op_t *op);
    bool continue_recovery();
    pg_osd_set_state_t* change_osd_set(pg_osd_set_state_t *st, pg_t *pg);

@ -279,7 +304,7 @@ class osd_t
    bool remember_unstable_write(osd_op_t *cur_op, pg_t & pg, pg_osd_set_t & loc_set, int base_state);
    void handle_primary_subop(osd_op_t *subop, osd_op_t *cur_op);
    void handle_primary_bs_subop(osd_op_t *subop);
-    void add_bs_subop_stats(osd_op_t *subop);
+    void add_bs_subop_stats(osd_op_t *subop, bool recovery_related = false);
    void pg_cancel_write_queue(pg_t & pg, osd_op_t *first_op, object_id oid, int retval);

    void submit_primary_subops(int submit_type, uint64_t op_version, const uint64_t* osd_set, osd_op_t *cur_op);
--- a/src/osd_cluster.cpp
+++ b/src/osd_cluster.cpp
@ -213,12 +213,14 @@ json11::Json osd_t::get_statistics()
    st["subop_stats"] = subop_stats;
    st["recovery_stats"] = json11::Json::object {
        { recovery_stat_names[0], json11::Json::object {
-            { "count", recovery_stat_count[0][0] },
-            { "bytes", recovery_stat_bytes[0][0] },
+            { "count", recovery_stat[0].count },
+            { "bytes", recovery_stat[0].bytes },
+            { "usec", recovery_stat[0].usec },
        } },
        { recovery_stat_names[1], json11::Json::object {
-            { "count", recovery_stat_count[0][1] },
-            { "bytes", recovery_stat_bytes[0][1] },
+            { "count", recovery_stat[1].count },
+            { "bytes", recovery_stat[1].bytes },
+            { "usec", recovery_stat[1].usec },
        } },
    };
    return st;
--- a/src/osd_flush.cpp
+++ b/src/osd_flush.cpp
@ -325,10 +325,37 @@ void osd_t::submit_recovery_op(osd_recovery_op_t *op)
        {
            printf("Recovery operation done for %lx:%lx\n", op->oid.inode, op->oid.stripe);
        }
+        finish_recovery_op(op);
+    };
+    exec_op(op->osd_op);
+}
+
+void osd_t::apply_recovery_tune_interval()
+{
+    if (rtune_timer_id >= 0)
+    {
+        tfd->clear_timer(rtune_timer_id);
+        rtune_timer_id = -1;
+    }
+    if (recovery_tune_interval != 0)
+    {
+        rtune_timer_id = this->tfd->set_timer(recovery_tune_interval*1000, true, [this](int timer_id)
+        {
+            tune_recovery();
+        });
+    }
+    else
+    {
+        recovery_target_sleep_us = recovery_sleep_us;
+    }
+}
+
+void osd_t::finish_recovery_op(osd_recovery_op_t *op)
+{
    // CAREFUL! op = &recovery_ops[op->oid]. Don't access op->* after recovery_ops.erase()
+    delete op->osd_op;
    op->osd_op = NULL;
    recovery_ops.erase(op->oid);
-        delete osd_op;
    if (immediate_commit != IMMEDIATE_ALL)
    {
        recovery_done++;
@ -341,8 +368,84 @@ void osd_t::submit_recovery_op(osd_recovery_op_t *op)
        }
    }
    continue_recovery();
+}
+
+void osd_t::tune_recovery()
+{
+    static int accounted_ops[] = {
+        OSD_OP_SEC_READ, OSD_OP_SEC_WRITE, OSD_OP_SEC_WRITE_STABLE,
+        OSD_OP_SEC_STABILIZE, OSD_OP_SEC_SYNC, OSD_OP_SEC_DELETE
    };
-    exec_op(op->osd_op);
+    uint64_t total_client_usec = 0, total_recovery_usec = 0, recovery_count = 0;
+    for (int i = 0; i < sizeof(accounted_ops)/sizeof(accounted_ops[0]); i++)
+    {
+        total_client_usec += (msgr.stats.op_stat_sum[accounted_ops[i]]
+            - rtune_prev_stats.op_stat_sum[accounted_ops[i]]);
+        total_recovery_usec += (msgr.recovery_stats.op_stat_sum[accounted_ops[i]]
+            - rtune_prev_recovery_stats.op_stat_sum[accounted_ops[i]]);
+        recovery_count += (msgr.recovery_stats.op_stat_count[accounted_ops[i]]
+            - rtune_prev_recovery_stats.op_stat_count[accounted_ops[i]]);
+        rtune_prev_stats.op_stat_sum[accounted_ops[i]] = msgr.stats.op_stat_sum[accounted_ops[i]];
+        rtune_prev_recovery_stats.op_stat_sum[accounted_ops[i]] = msgr.recovery_stats.op_stat_sum[accounted_ops[i]];
+        rtune_prev_recovery_stats.op_stat_count[accounted_ops[i]] = msgr.recovery_stats.op_stat_count[accounted_ops[i]];
+    }
+    total_client_usec -= total_recovery_usec;
+    if (recovery_count == 0)
+    {
+        return;
+    }
+    // example:
+    // total 3 GB/s
+    // recovery queue 1
+    // 120 OSDs
+    // EC 5+3
+    // 128kb block_size => 640kb object
+    // 3000*1024/640/120 = 40 MB/s per OSD = 64 recovered objects per OSD
+    //   = 64*8*2 subops = 1024 recovery subop iops
+    // 8 recovery subop queue
+    // => subop avg latency = 0.0078125 sec
+    // utilisation = 8
+    // target util 1
+    // intuitively target latency should be 8x of real
+    // target_lat = rtune_avg_lat * utilisation / target_util
+    //            = rtune_avg_lat * rtune_avg_lat * rtune_avg_iops / target_util
+    //            = 0.0625
+    // recovery utilisation will be 1
+    rtune_client_util = total_client_usec/1000000.0/recovery_tune_interval;
+    rtune_target_util = (rtune_client_util < recovery_tune_client_util_low
+        ? recovery_tune_util_high
+        : recovery_tune_util_low + (rtune_client_util >= recovery_tune_client_util_high
+            ? 0 : (recovery_tune_util_high-recovery_tune_util_low)*
+                (recovery_tune_client_util_high-rtune_client_util)/(recovery_tune_client_util_high-recovery_tune_client_util_low)
+        )
+    );
+    rtune_avg_lat = total_recovery_usec/recovery_count;
+    uint64_t target_lat = rtune_avg_lat * rtune_avg_lat/1000000.0 * recovery_count/recovery_tune_interval / rtune_target_util;
+    auto sleep_us = target_lat > rtune_avg_lat+recovery_tune_sleep_min_us ? target_lat-rtune_avg_lat : 0;
+    if (recovery_target_sleep_items.size() != recovery_tune_agg_interval)
+    {
+        recovery_target_sleep_items.resize(recovery_tune_agg_interval);
+        for (int i = 0; i < recovery_tune_agg_interval; i++)
+            recovery_target_sleep_items[i] = 0;
+        recovery_target_sleep_total = 0;
+        recovery_target_sleep_cur = 0;
+        recovery_target_sleep_count = 0;
+    }
+    recovery_target_sleep_total -= recovery_target_sleep_items[recovery_target_sleep_cur];
+    recovery_target_sleep_items[recovery_target_sleep_cur] = sleep_us;
+    recovery_target_sleep_cur = (recovery_target_sleep_cur+1) % recovery_tune_agg_interval;
+    recovery_target_sleep_total += sleep_us;
+    if (recovery_target_sleep_count < recovery_tune_agg_interval)
+        recovery_target_sleep_count++;
+    recovery_target_sleep_us = recovery_target_sleep_total / recovery_target_sleep_count;
+    if (log_level > 4)
+    {
+        printf(
+            "[OSD %lu] auto-tune: client util: %.2f, recovery util: %.2f, lat: %lu us -> target util %.2f, delay %lu us\n",
+            osd_num, rtune_client_util, total_recovery_usec/1000000.0/recovery_tune_interval,
+            rtune_avg_lat, rtune_target_util, recovery_target_sleep_us
+        );
+    }
 }

 // Just trigger write requests for degraded objects. They'll be recovered during writing
--- a/src/osd_ops.h
+++ b/src/osd_ops.h
@ -34,6 +34,7 @@
 #define OSD_OP_MAX                  18
 #define OSD_RW_MAX                  64*1024*1024
 #define OSD_PROTOCOL_VERSION        1
+#define OSD_OP_RECOVERY_RELATED     (uint32_t)1

 // Memory alignment for direct I/O (usually 512 bytes)
 #ifndef DIRECT_IO_ALIGNMENT
@ -88,7 +89,8 @@ struct __attribute__((__packed__)) osd_op_sec_rw_t
    uint32_t len;
    // bitmap/attribute length - bitmap comes after header, but before data
    uint32_t attr_len;
-    uint32_t pad0;
+    // the only possible flag is OSD_OP_RECOVERY_RELATED
+    uint32_t flags;
 };

 struct __attribute__((__packed__)) osd_reply_sec_rw_t
@ -109,6 +111,9 @@ struct __attribute__((__packed__)) osd_op_sec_del_t
    object_id oid;
    // delete version (automatic or specific)
    uint64_t version;
+    // the only possible flag is OSD_OP_RECOVERY_RELATED
+    uint32_t flags;
+    uint32_t pad0;
 };

 struct __attribute__((__packed__)) osd_reply_sec_del_t
@ -121,6 +126,9 @@ struct __attribute__((__packed__)) osd_reply_sec_del_t
 struct __attribute__((__packed__)) osd_op_sec_sync_t
 {
    osd_op_header_t header;
+    // the only possible flag is OSD_OP_RECOVERY_RELATED
+    uint32_t flags;
+    uint32_t pad0;
 };

 struct __attribute__((__packed__)) osd_reply_sec_sync_t
@ -134,6 +142,9 @@ struct __attribute__((__packed__)) osd_op_sec_stab_t
    osd_op_header_t header;
    // obj_ver_id array length in bytes
    uint64_t len;
+    // the only possible flag is OSD_OP_RECOVERY_RELATED
+    uint32_t flags;
+    uint32_t pad0;
 };
 typedef osd_op_sec_stab_t osd_op_sec_rollback_t;

--- a/src/osd_primary_subops.cpp
+++ b/src/osd_primary_subops.cpp
@ -3,13 +3,15 @@

 #include "osd_primary.h"

+#define SELF_FD -1
+
 void osd_t::autosync()
 {
    if (immediate_commit != IMMEDIATE_ALL && !autosync_op)
    {
        autosync_op = new osd_op_t();
        autosync_op->op_type = OSD_OP_IN;
-        autosync_op->peer_fd = -1;
+        autosync_op->peer_fd = SELF_FD;
        autosync_op->req = (osd_any_op_t){
            .sync = {
                .header = {
@ -85,9 +87,13 @@ void osd_t::finish_op(osd_op_t *cur_op, int retval)
    cur_op->reply.hdr.id = cur_op->req.hdr.id;
    cur_op->reply.hdr.opcode = cur_op->req.hdr.opcode;
    cur_op->reply.hdr.retval = retval;
-    if (cur_op->peer_fd == -1)
+    if (cur_op->peer_fd == SELF_FD)
+    {
+        // Do not include internal primary writes (recovery/rebalance) into client op statistics
+        if (cur_op->req.hdr.opcode != OSD_OP_WRITE)
        {
            msgr.measure_exec(cur_op);
+        }
        // Copy lambda to be unaffected by `delete op`
        std::function<void(osd_op_t*)>(cur_op->callback)(cur_op);
    }
@ -215,6 +221,7 @@ int osd_t::submit_primary_subop_batch(int submit_type, inode_t inode, uint64_t o
                    .offset = wr ? si->write_start : si->read_start,
                    .len = subop_len,
                    .attr_len = wr ? clean_entry_bitmap_size : 0,
+                    .flags = cur_op->peer_fd == SELF_FD && cur_op->req.hdr.opcode != OSD_OP_SCRUB ? OSD_OP_RECOVERY_RELATED : 0,
                };
 #ifdef OSD_DEBUG
                printf(
@ -294,7 +301,8 @@ void osd_t::handle_primary_bs_subop(osd_op_t *subop)
            " retval = "+std::to_string(bs_op->retval)+")"
        );
    }
-    add_bs_subop_stats(subop);
+    bool recovery_related = cur_op->peer_fd == SELF_FD && cur_op->req.hdr.opcode != OSD_OP_SCRUB;
+    add_bs_subop_stats(subop, recovery_related);
    subop->req.hdr.opcode = bs_op_to_osd_op[bs_op->opcode];
    subop->reply.hdr.retval = bs_op->retval;
    if (bs_op->opcode == BS_OP_READ || bs_op->opcode == BS_OP_WRITE || bs_op->opcode == BS_OP_WRITE_STABLE)
@ -306,30 +314,33 @@ void osd_t::handle_primary_bs_subop(osd_op_t *subop)
    }
    delete bs_op;
    subop->bs_op = NULL;
-    subop->peer_fd = -1;
+    subop->peer_fd = SELF_FD;
+    if (recovery_related && recovery_target_sleep_us)
+    {
+        tfd->set_timer_us(recovery_target_sleep_us, false, [=](int timer_id)
+        {
+            handle_primary_subop(subop, cur_op);
+        });
+    }
+    else
+    {
        handle_primary_subop(subop, cur_op);
    }
+}

-void osd_t::add_bs_subop_stats(osd_op_t *subop)
+void osd_t::add_bs_subop_stats(osd_op_t *subop, bool recovery_related)
 {
    // Include local blockstore ops in statistics
    uint64_t opcode = bs_op_to_osd_op[subop->bs_op->opcode];
    timespec tv_end;
    clock_gettime(CLOCK_REALTIME, &tv_end);
-    msgr.stats.op_stat_count[opcode]++;
-    if (!msgr.stats.op_stat_count[opcode])
+    uint64_t len = (opcode == OSD_OP_SEC_READ || opcode == OSD_OP_SEC_WRITE)
+        ? subop->bs_op->len : 0;
+    msgr.inc_op_stats(msgr.stats, opcode, subop->tv_begin, tv_end, len);
+    if (recovery_related)
    {
-        msgr.stats.op_stat_count[opcode] = 1;
-        msgr.stats.op_stat_sum[opcode] = 0;
-        msgr.stats.op_stat_bytes[opcode] = 0;
-    }
-    msgr.stats.op_stat_sum[opcode] += (
-        (tv_end.tv_sec - subop->tv_begin.tv_sec)*1000000 +
-        (tv_end.tv_nsec - subop->tv_begin.tv_nsec)/1000
-    );
-    if (opcode == OSD_OP_SEC_READ || opcode == OSD_OP_SEC_WRITE)
-    {
-        msgr.stats.op_stat_bytes[opcode] += subop->bs_op->len;
+        // It is OSD_OP_RECOVERY_RELATED
+        msgr.inc_op_stats(msgr.recovery_stats, opcode, subop->tv_begin, tv_end, len);
    }
 }

@ -552,6 +563,7 @@ void osd_t::submit_primary_del_batch(osd_op_t *cur_op, obj_ver_osd_t *chunks_to_
                },
                .oid = chunk.oid,
                .version = chunk.version,
+                .flags = cur_op->peer_fd == SELF_FD && cur_op->req.hdr.opcode != OSD_OP_SCRUB ? OSD_OP_RECOVERY_RELATED : 0,
            } };
            subops[i].callback = [cur_op, this](osd_op_t *subop)
            {
@ -609,6 +621,7 @@ int osd_t::submit_primary_sync_subops(osd_op_t *cur_op)
                    .id = msgr.next_subop_id++,
                    .opcode = OSD_OP_SEC_SYNC,
                },
+                .flags = cur_op->peer_fd == SELF_FD && cur_op->req.hdr.opcode != OSD_OP_SCRUB ? OSD_OP_RECOVERY_RELATED : 0,
            } };
            subops[i].callback = [cur_op, this](osd_op_t *subop)
            {
@ -668,6 +681,7 @@ void osd_t::submit_primary_stab_subops(osd_op_t *cur_op)
                    .opcode = OSD_OP_SEC_STABILIZE,
                },
                .len = (uint64_t)(stab_osd.len * sizeof(obj_ver_id)),
+                .flags = cur_op->peer_fd == SELF_FD && cur_op->req.hdr.opcode != OSD_OP_SCRUB ? OSD_OP_RECOVERY_RELATED : 0,
            } };
            subops[i].iov.push_back(op_data->unstable_writes + stab_osd.start, stab_osd.len * sizeof(obj_ver_id));
            subops[i].callback = [cur_op, this](osd_op_t *subop)
--- a/src/osd_primary_write.cpp
+++ b/src/osd_primary_write.cpp
@ -292,16 +292,26 @@ resume_7:
    {
        {
            int recovery_type = op_data->object_state->state & (OBJ_DEGRADED|OBJ_INCOMPLETE) ? 0 : 1;
-            recovery_stat_count[0][recovery_type]++;
-            if (!recovery_stat_count[0][recovery_type])
+            recovery_stat[recovery_type].count++;
+            if (!recovery_stat[recovery_type].count) // wrapped
            {
-                recovery_stat_count[0][recovery_type]++;
-                recovery_stat_bytes[0][recovery_type] = 0;
+                memset(&recovery_print_prev[recovery_type], 0, sizeof(recovery_print_prev[recovery_type]));
+                memset(&recovery_stat[recovery_type], 0, sizeof(recovery_stat[recovery_type]));
+                recovery_stat[recovery_type].count++;
            }
            for (int role = 0; role < (op_data->scheme == POOL_SCHEME_REPLICATED ? 1 : pg.pg_size); role++)
            {
-                recovery_stat_bytes[0][recovery_type] += op_data->stripes[role].write_end - op_data->stripes[role].write_start;
+                recovery_stat[recovery_type].bytes += op_data->stripes[role].write_end - op_data->stripes[role].write_start;
            }
+            if (!cur_op->tv_end.tv_sec)
+            {
+                clock_gettime(CLOCK_REALTIME, &cur_op->tv_end);
+            }
+            uint64_t usec = (
+                (cur_op->tv_end.tv_sec - cur_op->tv_begin.tv_sec)*1000000 +
+                (cur_op->tv_end.tv_nsec - cur_op->tv_begin.tv_nsec)/1000
+            );
+            recovery_stat[recovery_type].usec += usec;
        }
        // Any kind of a non-clean object can have extra chunks, because we don't record objects
        // as degraded & misplaced or incomplete & misplaced at the same time. So try to remove extra chunks
--- a/src/osd_secondary.cpp
+++ b/src/osd_secondary.cpp
@ -42,7 +42,21 @@ void osd_t::secondary_op_callback(osd_op_t *op)
    int retval = op->bs_op->retval;
    delete op->bs_op;
    op->bs_op = NULL;
+    if (op->is_recovery_related() && recovery_target_sleep_us)
+    {
+        if (!op->tv_end.tv_sec)
+        {
+            clock_gettime(CLOCK_REALTIME, &op->tv_end);
+        }
+        tfd->set_timer_us(recovery_target_sleep_us, false, [this, op, retval](int timer_id)
+        {
            finish_op(op, retval);
+        });
+    }
+    else
+    {
+        finish_op(op, retval);
+    }
 }

 void osd_t::exec_secondary(osd_op_t *cur_op)
--- a/tests/run_3osds.sh
+++ b/tests/run_3osds.sh
@ -19,10 +19,10 @@ fi

 if [ "$IMMEDIATE_COMMIT" != "" ]; then
    NO_SAME="--journal_no_same_sector_overwrites true --journal_sector_buffer_count 1024 --disable_data_fsync 1 --immediate_commit all --log_level 10 --etcd_stats_interval 5"
-    $ETCDCTL put /vitastor/config/global '{"recovery_queue_depth":1,"osd_out_time":1,"immediate_commit":"all","client_enable_writeback":true}'
+    $ETCDCTL put /vitastor/config/global '{"recovery_queue_depth":1,"recovery_tune_util_low":1,"osd_out_time":1,"immediate_commit":"all","client_enable_writeback":true}'
 else
    NO_SAME="--journal_sector_buffer_count 1024 --log_level 10 --etcd_stats_interval 5"
-    $ETCDCTL put /vitastor/config/global '{"recovery_queue_depth":1,"osd_out_time":1,"client_enable_writeback":true}'
+    $ETCDCTL put /vitastor/config/global '{"recovery_queue_depth":1,"recovery_tune_util_low":1,"osd_out_time":1,"client_enable_writeback":true}'
 fi

 start_osd_on()
@ -53,7 +53,7 @@ for i in $(seq 1 $OSD_COUNT); do
    start_osd $i
 done

-(while true; do node mon/mon-main.js --etcd_url $ETCD_URL --etcd_prefix "/vitastor" --verbose 1 || true; done) &>./testdata/mon.log &
+(while true; do node mon/mon-main.js --etcd_address $ETCD_URL --etcd_prefix "/vitastor" --verbose 1 || true; done) >>./testdata/mon.log 2>&1 &
 MON_PID=$!

 if [ "$SCHEME" = "ec" ]; then
--- a/tests/test_change_pg_count.sh
+++ b/tests/test_change_pg_count.sh
@ -18,6 +18,7 @@ try_change()
    for i in {1..6}; do
        echo --- Change PG count to $n --- >>testdata/osd$i.log
    done
+    echo --- Change PG count to $n --- >>testdata/mon.log

    $ETCDCTL put /vitastor/config/pools '{"1":{'$POOLCFG',"pg_size":'$PG_SIZE',"pg_minsize":'$PG_MINSIZE',"pg_count":'$n'}}'

--- a/tests/test_failure_domain.sh
+++ b/tests/test_failure_domain.sh
@ -15,7 +15,7 @@ $ETCDCTL put /vitastor/osd/stats/7 '{"host":"host4","size":1073741824,"time":"'$
 $ETCDCTL put /vitastor/osd/stats/8 '{"host":"host4","size":1073741824,"time":"'$TIME'"}'
 $ETCDCTL put /vitastor/config/pools '{"1":{"name":"testpool","scheme":"replicated","pg_size":2,"pg_minsize":1,"pg_count":4,"failure_domain":"rack"}}'

-node mon/mon-main.js --etcd_url $ETCD_URL --etcd_prefix "/vitastor" &>./testdata/mon.log &
+node mon/mon-main.js --etcd_address $ETCD_URL --etcd_prefix "/vitastor" >>./testdata/mon.log 2>&1 &
 MON_PID=$!

 sleep 2
--- a/tests/test_move_reappear.sh
+++ b/tests/test_move_reappear.sh
@ -7,7 +7,7 @@ OSD_COUNT=5
 OSD_ARGS="$OSD_ARGS"
 for i in $(seq 1 $OSD_COUNT); do
    dd if=/dev/zero of=./testdata/test_osd$i.bin bs=1024 count=1 seek=$((OSD_SIZE*1024-1))
-    build/src/vitastor-osd --osd_num $i --bind_address 127.0.0.1 --etcd_stats_interval 5 $OSD_ARGS --etcd_address $ETCD_URL $(build/src/vitastor-disk simple-offsets --format options ./testdata/test_osd$i.bin 2>/dev/null) >>./testdata/osd$i.log 2>&1 &
+    build/src/vitastor-osd --log_level 10 --osd_num $i --bind_address 127.0.0.1 --etcd_stats_interval 5 $OSD_ARGS --etcd_address $ETCD_URL $(build/src/vitastor-disk simple-offsets --format options ./testdata/test_osd$i.bin 2>/dev/null) >>./testdata/osd$i.log 2>&1 &
    eval OSD${i}_PID=$!
 done

@ -53,6 +53,11 @@ for i in {1..30}; do
    fi
 done

+# Sync so all moved objects are removed from OSD 1 (they aren't removed without a sync)
+LD_PRELOAD="build/src/libfio_vitastor.so" \
+fio -thread -name=test -ioengine=build/src/libfio_vitastor.so -bs=4k -direct=1 -iodepth=1 -fsync=1 -number_ios=2 -rw=write \
+    -etcd=$ETCD_URL -pool=1 -inode=2 -size=32M -cluster_log_level=10
+
 $ETCDCTL put /vitastor/config/pgs '{"items":{"1":{"1":{"osd_set":[4,5],"primary":0}}}}'

 $ETCDCTL put /vitastor/pg/history/1/1 '{"all_peers":[1,2,3]}'
--- a/tests/test_vm_cont.sh
+++ b/tests/test_vm_cont.sh
@ -15,7 +15,7 @@ for i in $(seq 1 $OSD_COUNT); do
    eval OSD${i}_PID=$!
 done

-(while true; do node mon/mon-main.js --etcd_url $ETCD_URL --etcd_prefix "/vitastor" --verbose 1 || true; done) &>./testdata/mon.log &
+(while true; do node mon/mon-main.js --etcd_address $ETCD_URL --etcd_prefix "/vitastor" --verbose 1 || true; done) >>./testdata/mon.log 2>&1 &
 MON_PID=$!

 sleep 3
Author	SHA1	Message	Date
Vitaliy Filippov	c17f76a3e4	Add documentation for recovery auto-tuning Test / test_snapshot_ec (push) Successful in 26s Details Test / test_move_reappear (push) Successful in 19s Details Test / test_rm (push) Successful in 15s Details Test / test_snapshot_down (push) Successful in 24s Details Test / test_snapshot_down_ec (push) Successful in 26s Details Test / test_snapshot_chain (push) Successful in 1m50s Details Test / test_splitbrain (push) Successful in 52s Details Test / test_snapshot_chain_ec (push) Successful in 2m31s Details Test / test_rebalance_verify_imm (push) Successful in 2m28s Details Test / test_rebalance_verify (push) Successful in 3m25s Details Test / test_rebalance_verify_ec (push) Successful in 3m31s Details Test / test_write (push) Successful in 1m17s Details Test / test_write_no_same (push) Successful in 17s Details Test / test_rebalance_verify_ec_imm (push) Successful in 3m36s Details Test / test_heal_pg_size_2 (push) Successful in 4m12s Details Test / test_heal_ec (push) Successful in 5m20s Details Test / test_heal_csum_32k_dmj (push) Successful in 4m36s Details Test / test_heal_csum_32k_dj (push) Successful in 6m11s Details Test / test_heal_csum_32k (push) Successful in 6m13s Details Test / test_scrub (push) Successful in 56s Details Test / test_scrub_zero_osd_2 (push) Successful in 1m6s Details Test / test_heal_csum_4k_dj (push) Successful in 6m31s Details Test / test_heal_csum_4k_dmj (push) Successful in 6m58s Details Test / test_scrub_xor (push) Successful in 43s Details Test / test_scrub_pg_size_6_pg_minsize_4_osd_count_6_ec (push) Successful in 1m10s Details Test / test_scrub_ec (push) Successful in 49s Details Test / test_scrub_pg_size_3 (push) Successful in 1m40s Details Test / test_heal_csum_4k (push) Successful in 5m59s Details Test / test_write_xor (push) Successful in 34s Details Test / test_interrupted_rebalance (push) Successful in 1m19s Details	2023-12-31 01:23:17 +03:00
Vitaliy Filippov	a6ab54b1ba	Do not allow negative util_low/high	2023-12-31 01:23:17 +03:00
Vitaliy Filippov	99ee8596ea	Rename min/max_util to util_low/high	2023-12-31 01:23:17 +03:00
Vitaliy Filippov	c4928e6ecd	Protect from try_send completing the operation immediately Fixes a possible use-after-free in case of continue_ops() calling try_send(), then connect_peer() -> set_timer() -> trigger_nearest() -> handle_op_part() -> continue_ops() again	2023-12-31 01:23:17 +03:00
Vitaliy Filippov	ec7dcd1be5	Do not apply very large recovery pauses during tests	2023-12-31 01:23:17 +03:00
Vitaliy Filippov	e600bbc151	Fix flapping move_reappear test by adding an fsync before stopping PG	2023-12-31 01:23:17 +03:00
Vitaliy Filippov	8b8c1179a7	Use a separate used_blocks counter for free space stats to hide possibly delayed on-flush deallocation	2023-12-31 01:23:17 +03:00
Vitaliy Filippov	d5a6fa6dd7	Fix possible crash on print_slow when bs_op is NULL	2023-12-31 01:23:17 +03:00
Vitaliy Filippov	f757a35a8d	Retry PG changes without re-running lpsolve when pool configuration and OSD tree don't change OSDs often change their /pg/history keys during rebalance, so monitor receives additional transaction failures from etcd if it re-runs lpsolve which sometimes may even lead to monitor being unable to apply PG changes at all until rebalance completes	2023-12-31 01:23:17 +03:00
Vitaliy Filippov	1edf86ed26	Aggregate recovery delay using simple mean over last 10 observations (EWMA is shit)	2023-12-31 01:23:17 +03:00
Vitaliy Filippov	5ca7cde612	Experiment/WIP: Try to track "secondary" recovery ops separately	2023-12-31 01:23:17 +03:00
Vitaliy Filippov	751935ddd8	WIP Auto-tune recovery speed	2023-12-31 01:23:17 +03:00
Vitaliy Filippov	d84dee7098	Track recovery op latencies + refactor into a structure	2023-12-31 01:23:17 +03:00